From c6bc383269ee44da5501ad377669e6727ce20511 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Motiejus=20Jak=C5=A1tys?= Date: Mon, 14 Feb 2022 13:05:33 +0200 Subject: [PATCH] explain some optimizations --- README.md | 137 +++++++++++++++++++++++++++++++++++------------------- 1 file changed, 88 insertions(+), 49 deletions(-) diff --git a/README.md b/README.md index ec9f4bd..e55b0fe 100644 --- a/README.md +++ b/README.md @@ -7,17 +7,17 @@ entries (i.e. system users, groups, and group memberships). It's main goal is performance, with focus on making [`id(1)`][id] run as fast as possible. To understand more about name service switch, start with -[`nsswitch.conf(5)`](nsswitch). +[`nsswitch.conf(5)`][nsswitch]. Design & constraints -------------------- To be fast, the user/group database (later: DB) has to be small ([highly -recommended background viewing](data-oriented-design)). It encodes user & group +recommended background viewing][data-oriented-design]). It encodes user & group information in a way that minimizes the DB size, and reduces jumping across the -DB ("chasing pointers and polluting CPU cache"). +DB ("chasing pointers and thrashing CPU cache"). -For example, [`getpwnam_r(3)`](getpwnam_r) accepts a username and returns +For example, [`getpwnam_r(3)`][getpwnam_r] accepts a username and returns the following user information: ``` @@ -33,14 +33,14 @@ struct passwd { ``` Turbonss, among others, implements this call, and takes the following steps to -resolve this: +resolve a username to a `struct passwd*`: - Hash the username using a perfect hash function. Perfect hash function - returns a number between [0,N], where N is the total number of users. -- Jump to a known location in the DB (by pointer arithmetic) which links the - user's index to the user's information. That is an index to a different - location within the DB. -- Jump to the location which stores the full user information. + returns a number `n ∈ [0,N-1]`, where N is the total number of users. +- Jump to the `n`'th location in the DB (by pointer arithmetic) which contains + the index `i` to the user's information. +- Jump to the location `i` (pointer arithmetic) which stores the full user + information. - Decode the user information (which is all in a continuous memory block) and return it to the caller. @@ -48,7 +48,12 @@ In total, that's one hash for the username (~150ns), two pointer jumps within the group file, and, now that the user record is found, `memcpy` for each field. -This tight packing places some constraints on the underlying data: +The turbonss DB file is be `mmap`-ed, making it simple to implement pointer +arithmetic and jumping across the file. This also reduces memory usage, +especially across multiple concurrent invocations of the `id` command. The +consumed heap space for each separate turbonss instance will be minimal. + +Tight packing places some constraints on the underlying data: - Maximum database size: 4GB. - Maximum length of username and groupname: 32 bytes. @@ -83,54 +88,53 @@ remarks on `id(1)` ------------------ A known implementation runs id(1) at ~250 rps sequentially on ~20k users and -~10k groups. Our target is 10k id/s. +~10k groups. Our target is 10k id/s for the same payload. -`id(1)` works as follows: +To better reason about the trade-offs, it is useful to understand how `id(1)` +is implemented, in rough terms: - lookup user by name. - get all additional gids (an array attached to a member). -- for each additional gid, get the group name. +- for each additional gid, get the group information (`struct group*`). Assuming a member is in ~100 groups on average, that's 1M group lookups per second. We need to convert gid to a group index, and group index to a group gid/name quickly. Caveat: `struct group` contains an array of pointers to names of group members -(`char **gr_mem`). However, `id` does not use that information, resulting in a -significant read amplification. Therefore, if `argv[0] == "id"`, `getgrid(3)` -will return group without the members. This speeds up `id` by about 10x on a -known NSS implementation. +(`char **gr_mem`). However, `id` does not use that information, resulting in +read amplification. Therefore, if `argv[0] == "id"`, our implementation of +`getgrid(3)` returns the `struct group*` without the members. This speeds up +`id` by about 10x on a known NSS implementation. -Because `getgrid(3)` does not use the group members' information, the group -members are stored in a different location, making the `Groups` section -smaller, thus more CPU-cache-friendly. +Relatedly, because `getgrid(3)` does not need the group members, the group +members are stored in a different DB sectoin, making the `Groups` section +smaller, thus more CPU-cache-friendly in the hot path. Indices ------- -The following operations need to be fast, in order of importance: +Now that we've sketched the implementation of `id(3)`, it's clearer to +understand which operations need to be fast; in order of importance: -1. lookup gid -> group (this is on hot path in id) with or without members (2 - separate calls). +1. lookup gid -> group info (this is on hot path in id) without members. 2. lookup uid -> user. 3. lookup groupname -> group. 4. lookup username -> user. -5. (optional) iterate users using a defined order (`getent passwd`). -6. (optional) iterate groups using a defined order (`getent group`). -First 4 can use perfect hashing like [cmph][cmph]: it hashes a list of bytes to -a sequential list of integers. Perfect hashing algorithms require some space, -and take some time to calculate ("hashing duration"). I've tested BDZ, which -hashes [][]u8 to a sequential list of integers (not preserving order) and CHM, which -does the same, but preserves order. BDZ accepts an argument 3 <= b <= 10. +These indices can use perfect hashing like [cmph][cmph]: a perfect hash hashes +a list of bytes to a sequential list of integers. Perfect hashing algorithms +require some space, and take some time to calculate ("hashing duration"). I've +tested BDZ, which hashes [][]u8 to a sequential list of integers (not +preserving order) and CHM, preserves order. BDZ accepts an optional argument `3 +<= b <= 10`. -BDZ: tried b=3, b=7 (default), and b=10. - -* BDZ algorithm requires (900KB, 338KB, 306KB, respectively) for 1M values. -* Latency to resolve 1M keys: (170ms, 180ms, 230ms). +* BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values. +* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively). * Packed vs non-packed latency differences are not meaningful. CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with -CHM than with BDZ, eliminating the benefit of preserved ordering. +CHM than with BDZ, eliminating the benefit of preserved ordering: we can just +have a separate index. Turbonss header --------------- @@ -168,6 +172,11 @@ and the header block fits to 64 bytes anyway, we are keeping them as u32 now. Primitive types --------------- +`User` and `Group` entries are sorted by name, ordered by their unicode +codepoints. They are byte-aligned (8bits). All `User` and `Group` entries are +referred by their byte offset in the `Users` and `Groups` section relative to +the beginning of the section. + ``` const Group = struct { gid: u32, @@ -187,9 +196,9 @@ const User = struct { // shell is a different story, documented elsewhere. shell_here: u1, shell_len_or_place: u6, - home_len: u6, - username_pos: u1, - username_len: u5, + homedir_len: u6, + username_is_a_suffix: u1, + username_offset_or_len: u5, gecos_len: u8, // a variable-sized array that will be stored immediately after this // struct. @@ -197,8 +206,28 @@ const User = struct { } ``` -`User` and `Group` entries are sorted by name, ordered by their unicode -codepoints. +`stringdata` contains a few string entries: +- homedir. +- username. +- gecos. +- shell (optional). + +First byte of the homedir is stored right after the `gecos_len` field, and it's +length is `homedir_len`. The same logic applies to all the `stringdata` fields: +there is a way to calculate their relative position from the length of the +fields before them. + +Additionally, two optimizations for special fields are made: +- shells are often shared across different users, see the "Shells" section. +- username is frequently a suffix of the homedir. For example, `/home/motiejus` + and `motiejus`. In which case storing both username and homedir strings is + wasteful. For that cases, username has two options: + 1. `username_is_a_suffix=true`: username is a suffix of the home dir. In that + case, the username starts at the `username_offset_or_len`'th byte of the + homedir, and ends at the same place as the homedir. + 2. `username_is_a_suffix=false`: username is stored separately. In that case, + it begins one byte after homedir, and it's length is + `username_offset_or_len`. Shells ------ @@ -221,14 +250,21 @@ to it's index in the external storage. Shells in the external storage are sorted by their weight, which is `length*frequency`. +Variable-length integers (varints) +---------------------------------- + +Varint is an efficiently encoded integer (packed for small values). Same as +[protocol buffer varints][varint], except the largest possible value is `u64`. +They compress integers well. + `groupmembers`, `additional_gids` --------------------------------- `groupmembers` and `additional_gids` store group and user memberships -respectively: for each group, a list of pointers ("offsets") to User records, -and for each user — a list of pointers to Group records. These fields are -always used in their entirety — making random-access not required, thus -suitable for tight packing. +respectively: for each group, a list of pointers (offsets) to User records, and +for each user — a list of pointers to Group records. These fields are always +used in their entirety — not necessitating random access, thus suitable for +tight packing. An entry of `groupmembers` and `additional_gids` looks like this piece of pseudo-code: @@ -242,10 +278,12 @@ const Groupmembers = PackedList; const AdditionalGids = PackedList; ``` -The single entry in `members` field points to an offset into a `User` or -`Group` entry (number of bytes relative to the first entry of the respective -type). The `members` field in `PackedList` is sorted by the name (`username` or -`groupname`) of the record it is pointing to. +An entry in `members` field points to the offset into a respective `User` or +`Group` entry (number of bytes relative to the first entry of the type). +`members` in `PackedList` is sorted by the name (`username` or `groupname`) of +the record it is pointing to. + +A packed list is a list of varints. Complete file structure ----------------------- @@ -270,3 +308,4 @@ Complete file structure [nsswitch]: https://linux.die.net/man/5/nsswitch.conf [data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/ [getpwnam_r]: https://linux.die.net/man/3/getpwnam_r +[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints