Design & constraints -------------------- To be fast, the user/group database (later: DB) has to be small ([background][data-oriented-design]). It encodes user & group information in a way that minimizes the DB size, and reduces jumping across the DB ("chasing pointers and thrashing CPU cache"). To understand how this is done efficiently, let's analyze the [`getpwnam_r(3)`][getpwnam_r] in high level. This API call accepts a username and returns the following user information: ``` struct passwd { char *pw_name; /* username */ char *pw_passwd; /* user password */ uid_t pw_uid; /* user ID */ gid_t pw_gid; /* group ID */ char *pw_gecos; /* user information */ char *pw_dir; /* home directory */ char *pw_shell; /* shell program */ }; ``` Turbonss, among others, implements this call, and takes the following steps to resolve a username to a `struct passwd*`: - Open the DB (using `mmap`) and interpret it's first 128 bytes as a `*struct Header`. The header stores offsets to the sections of the file. This needs to be done once, when the NSS library is loaded. - Hash the username using a perfect hash function. Perfect hash function returns a number `n ∈ [0,N-1]`, where N is the total number of users. - Jump to the `n`'th location in the `idx_name2user` section, which contains the index `i` to the user's information. - Jump to the location `i` of section `Users`, which stores the full user information. - Decode the user information (which is all in a continuous memory block) and return it to the caller. In total, that's one hash for the username (~150ns), two pointer jumps within the group file (to sections `idx_name2user` and `Users`), and, now that the user record is found, `memcpy` for each field. The turbonss DB file is be `mmap`-ed, making it simple to jump across the file using pointer arithmetic. This also reduces memory usage, as the mmap'ed regions are shared. Turbonss reads do not consume any heap space. Tight packing places some constraints on the underlying data: - Permitted length of username and groupname: 1-32 bytes. - Permitted length of shell and home: 1-256 bytes. - Permitted comment ("gecos") length: 0-255 bytes. - User name, groupname, gecos and shell must be utf8-encoded. - User and Groups sections are up to 2^35B (~34GB) large. Assuming an "average" user record takes 50 bytes, this section would fit ~660M users. The worst-case upper bound is left as an exercise to the reader. Sorting is stable. In v0: - Groups are sorted by gid, ascending. - Users are sorted by their name, ascending by the unicode codepoints (locale-independent). remarks on `id(1)` ------------------ A known implementation runs id(1) at ~250 rps sequentially on ~20k users and ~10k groups. Our rps target is much higher. To better reason about the trade-offs, it is useful to understand how `id(1)` is implemented, in rough terms: - lookup user by name ([`getpwent_r(3)`][getpwent]). - get all gids for the user ([`getgrouplist(3)`][getgrouplist]). Note: it is actually using `initgroups_dyn`, accepts a uid, and is very poorly documented. - for each additional gid, get the `struct group*` ([`getgrgid_r(3)`][getgrgid_r]). Assuming a member is in ~100 groups on average, to reach 10k id/s translates to 1M group lookups per second. We need to convert gid to a group index, and group index to a group gid/name quickly. Caveat: `struct group` contains an array of pointers to names of group members (`char **gr_mem`). However, `id` does not use that information, resulting in read amplification, sometimes by 10-100x. Therefore, if `argv[0] == "id"`, our implementation of [`getgrid_r(3)`][getgrid] returns the `struct group*` without the members. This speeds up `id` by about 10x on a known NSS implementation. Relatedly, because [`getgrid_r(3)`][getgrid] does not need the group members, the group members are stored in a different DB section, reducing the `Groups` section and making more of it fit the CPU caches. Turbonss header --------------- The turbonss header looks like this: ``` OFFSET TYPE NAME DESCRIPTION 0 [4]u8 magic f0 9f a4 b7 4 u8 version 0 5 u8 endian 0 for little, 1 for big 6 u8 nblocks_shell_blob max value: 63 7 u8 num_shells max value: 63 8 u32 num_groups number of group entries 12 u32 num_users number of passwd entries 16 u32 nblocks_bdz_gid bdz_gid section block count 20 u32 nblocks_bdz_groupname 24 u32 nblocks_bdz_uid 28 u32 nblocks_bdz_username 32 u64 nblocks_groups 40 u64 nblocks_users 48 u64 nblocks_groupmembers 56 u64 nblocks_additional_gids 64 u64 getgr_bufsize 72 u64 getpw_bufsize 80 [48]u8 padding ``` `magic` is 0xf09fa4b7, and `version` must be `0`. All integers are native-endian. `nblocks_*` is the count of blocks of a particular section; this helps calculate the offsets to all sections. Some numbers, like `nblocks_shell_blob`, `num_shells`, would fit to smaller number of bytes. However, interpreting `[2]u6` with `xxd(1)` is harder than interpreting `[2]u8`. Therefore we are using the space we have to make these integers byte-wide. `getgr_bufsize` and `getpw_bufsize` is a hint for the caller of `getgr*` and `getpw*`-family calls. This is the recommended size of the buffer, so the caller does not receive `ENOMEM`. Primitive types --------------- `User` and `Group` entries are sorted by the order they were received in the input file. All entries are aligned to 8 bytes. All `User` and `Group` entries are referred by their byte offset (shifted by 3 bits due to the 8-byte alignment) in the `Users` and `Groups` section relative to the beginning of the section. ``` const PackedGroup = packed struct { gid: u32, padding: u3, groupname_len: u5, } ``` PackedGroup is followed by the group name (of length `groupname_len`), followed by a varint-compressed offset to the groupmembers section, followed by padding upto 8 bytes. PackedUser is a bit more involved: ``` pub const PackedUser = packed struct { uid: u32, gid: u32, shell_len_or_idx: u8, shell_here: bool, name_is_a_suffix: bool, home_len: u6, name_len: u5, gecos_len: u11, } ``` ... followed by `userdata: []u8`: - home. - name (optional). - gecos. - shell (optional). - `additional_gids_offset`: varint. First byte of home is stored right after the `gecos_len` field, and its length is `home_len`. The same logic applies to all the `stringdata` fields: there is a way to calculate their relative position from the length of the fields before them. PackedUser employs two data-oriented compression techniques: - shells are often shared across different users, see the "Shells" section. - `name` is frequently a suffix of `home`. For example, `home=/home/vidmantas` and `name=vidmantas`. In this case storing both name and home is wasteful. Therefore name has two options: 1. `name_is_a_suffix=true`: name is a suffix of the home dir. Then `name` starts at the `home_len - name_len`'th byte of `home`, and ends at the same place as `home`. 2. `name_is_a_suffix=false`: name begins one byte after home, and it's length is `name_len`. The last field `additional_gids_offset: varint` points to the `additional_gids` section for this user. Shells ------ Normally there is a limited number of separate shells even in huge user databases. A few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among others. Therefore, "shells" have an optimization: they can be pointed by in the external list, or, if they are unique to the user, reside among the user's data. 255 most popular shells (i.e. referred to by at least two User entries) are stored externally in "Shells" area. The less popular ones are stored with userdata. Shells section consists of two sub-sections: the index and the blob. The index is an array of offsets: the i'th shell starts at `offsets[i]` byte, and ends at `offsets[i+1]` byte. If there is at least one shell in the shell section, the index contains a sentinel index as the last element, which signifies the position of the last byte of the shell blob. `shell_here=true` in the User struct means the shell is stored with userdata, and it's length is `shell_len_or_idx`. `shell_here=false` means it is stored in the `Shells` section, and it's index is `shell_len_or_idx` (and the actual string start and end offsets are resolved as described in the paragraph above). Variable-length integers (varints) ---------------------------------- Varint is an efficiently encoded integer (packed for small values). Same as [protocol buffer varints][varint], except the largest possible value is `u64`. They compress integers well. Varints are stored for group memberships. Group memberships ----------------- There are two group memberships at play: 1. Given a group (gid/name), resolve the members' names (e.g. `getgrgid`). 2. Given a username, resolve user's group gids (for `initgroups(3)`). When group's memberships are resolved in (1), the same call also requires other group information: gid and group name. Therefore it makes sense to store a pointer to the group members in the group information itself. However, the memberships are not *always* necessary (see remarks about `id(1)`), therefore the memberships will be stored separately, outside of the groups section. Similarly, when user's groups are resolved in (2), they are not always necessary (i.e. not part of `struct user*`), therefore the memberships themselves are stored out of bound. `groupmembers` and `additional_gids` store group and user memberships respectively. Membership IDs are packed — not necessitating random access, thus suitable for compression. - `groupmembers` consists of a number X followed by a list of offsets to User records, because `getgr*` returns pointers to membernames, thus a name has to be immediately resolvable. - `additional_gids` is a list of gids, because `initgroups_dyn` (and friends) returns an array of gids. Each entry of `groupmembers` and `additional_gids` starts with a varint N, which is the number of upcoming elements. Then N delta-compressed varints, which are: - **additional_gids** a list of gids. - **groupmembers** byte-offsets to the User records in the `users` section. Indices ------- Now that we've sketched the implementation of `id(3)`, it's clearer to understand which operations need to be fast; in order of importance: 1. lookup gid -> group info (this is on hot path in id) without members. 2. lookup username -> user's groups. 3. lookup uid -> user. 4. lookup groupname -> group. 5. lookup username -> user. These indices can use perfect hashing like [bdz from cmph][cmph]: a perfect hash hashes a list of bytes to a sequential list of integers. Perfect hashing algorithms require some space, and take some time to calculate ("hashing duration"). I've tested BDZ, which hashes `[][]u8` to a sequential list of integers (not preserving order) and CHM, preserves order. BDZ accepts an optional argument `3 <= b <= 10`. * BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values. * Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively). * Packed vs non-packed latency differences are not meaningful. CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with CHM than with BDZ, eliminating the benefit of preserved ordering: we can just have a separate index. None of the tested perfect hashing algorithms makes the distinction between existing (in the initial dictionary) and new keys. In other words, HASH(value) will be pointing to a number `n ∈ [0,N-1]`, regardless whether the value was in the initial dictionary. Therefore one must always confirm, after calculating the hash, that the key matches what's been hashed. `idx_*` sections are of type `[]u32` and are pointing from `hash(key)` to the respective `Groups` and `Users` entries (from the beginning of the respective section). Since User and Group records are 8-byte aligned, the actual offset to the record is acquired by right-shifting this value by 3 bits. Database file structure ----------------------- Each section is padded to 64 bytes. ``` SECTION SIZE DESCRIPTION header 128 see "Turbonss header" section bdz_gid ? bdz(gid) bdz_groupname ? bdz(groupname) bdz_uid ? bdz(uid) bdz_username ? bdz(username) idx_gid2group len(group)*4 bdz->offset Groups idx_groupname2group len(group)*4 bdz->offset Groups idx_uid2user len(user)*4 bdz->offset Users idx_name2user len(user)*4 bdz->offset Users shell_index len(shells)*2 shell index array shell_blob <= 65280 shell data blob (max 255*256 bytes) groups ? packed Group entries (8b padding) users ? packed User entries (8b padding) groupmembers ? per-group delta varint memberlist (no padding) additional_gids ? per-user delta varint gidlist (no padding) ``` [cmph]: http://cmph.sourceforge.net/ [id]: https://linux.die.net/man/1/id [data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/ [getpwnam_r]: https://linux.die.net/man/3/getpwnam_r [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints [getpwent]: https://www.man7.org/linux/man-pages/man3/getpwent_r.3.html [getgrouplist]: https://www.man7.org/linux/man-pages/man3/getgrouplist.3.html [getgrid]: https://www.man7.org/linux/man-pages/man3/getgrid_r.3.html