Turbo NSS --------- Turbonss is a plugin for GNU Name Service Switch (NSS) functionality of GNU C Library (glibc). Turbonss implements lookup for `user` and `passwd` database entries (i.e. system users, groups, and group memberships). It's main goal is performance, with focus on making [`id(1)`][id] run as fast as possible. To understand more about name service switch, start with [`nsswitch.conf(5)`][nsswitch]. Design & constraints -------------------- To be fast, the user/group database (later: DB) has to be small ([background][data-oriented-design]). It encodes user & group information in a way that minimizes the DB size, and reduces jumping across the DB ("chasing pointers and thrashing CPU cache"). To understand how this is done efficiently, let's analyze the [`getpwnam_r(3)`][getpwnam_r] in high level. This API call accepts a username and returns the following user information: ``` struct passwd { char *pw_name; /* username */ char *pw_passwd; /* user password */ uid_t pw_uid; /* user ID */ gid_t pw_gid; /* group ID */ char *pw_gecos; /* user information */ char *pw_dir; /* home directory */ char *pw_shell; /* shell program */ }; ``` Turbonss, among others, implements this call, and takes the following steps to resolve a username to a `struct passwd*`: - Open the DB (using `mmap`) and interpret it's first 40 bytes as a `struct Header`. The header stores offsets to the sections of the file. This needs to be done once, when the NSS library is loaded (or on the first call). - Hash the username using a perfect hash function. Perfect hash function returns a number `n ∈ [0,N-1]`, where N is the total number of users. - Jump to the `n`'th location in the `idx_username2user` section (by pointer arithmetic), which contains the index `i` to the user's information. - Jump to the location `i` of section `Users` (again, using pointer arithmetic) which stores the full user information. - Decode the user information (which is all in a continuous memory block) and return it to the caller. In total, that's one hash for the username (~150ns), two pointer jumps within the group file (to sections `idx_username2user` and `Users`), and, now that the user record is found, `memcpy` for each field. The turbonss DB file is be `mmap`-ed, making it simple to implement pointer arithmetic and jumping across the file. This also reduces memory usage, especially across multiple concurrent invocations of the `id` command. The consumed heap space for each separate turbonss instance will be minimal. Tight packing places some constraints on the underlying data: - Maximum database size: 4GB. - Maximum length of username and groupname: 32 bytes. - Maximum length of shell and homedir: 64 bytes. - Maximum comment ("gecos") length: 256 bytes. - Username and groupname must be utf8-encoded. Checking out and building ------------------------- ``` $ git clone --recursive https://git.sr.ht/~motiejus/turbonss ``` Alternatively, if you forgot `--recursive`: ``` $ git submodule update --init ``` And run tests: ``` $ zig build test ``` Other commands will be documented as they are implemented. This project uses [git subtrac][git-subtrac] for managing dependencies. remarks on `id(1)` ------------------ A known implementation runs id(1) at ~250 rps sequentially on ~20k users and ~10k groups. Our target is 10k id/s for the same payload. To better reason about the trade-offs, it is useful to understand how `id(1)` is implemented, in rough terms: - lookup user by name. - get all additional gids (an array attached to a member). - for each additional gid, get the group information (`struct group*`). Assuming a member is in ~100 groups on average, that's 1M group lookups per second. We need to convert gid to a group index, and group index to a group gid/name quickly. Caveat: `struct group` contains an array of pointers to names of group members (`char **gr_mem`). However, `id` does not use that information, resulting in read amplification. Therefore, if `argv[0] == "id"`, our implementation of `getgrid(3)` returns the `struct group*` without the members. This speeds up `id` by about 10x on a known NSS implementation. Relatedly, because `getgrid(3)` does not need the group members, the group members are stored in a different DB sectoin, making the `Groups` section smaller, thus more CPU-cache-friendly in the hot path. Indices ------- Now that we've sketched the implementation of `id(3)`, it's clearer to understand which operations need to be fast; in order of importance: 1. lookup gid -> group info (this is on hot path in id) without members. 2. lookup uid -> user. 3. lookup groupname -> group. 4. lookup username -> user. These indices can use perfect hashing like [cmph][cmph]: a perfect hash hashes a list of bytes to a sequential list of integers. Perfect hashing algorithms require some space, and take some time to calculate ("hashing duration"). I've tested BDZ, which hashes [][]u8 to a sequential list of integers (not preserving order) and CHM, preserves order. BDZ accepts an optional argument `3 <= b <= 10`. * BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values. * Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively). * Packed vs non-packed latency differences are not meaningful. CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with CHM than with BDZ, eliminating the benefit of preserved ordering: we can just have a separate index. Turbonss header --------------- The turbonss header looks like this: ``` OFFSET TYPE NAME DESCRIPTION 0 [4]u8 magic always 0xf09fa4b7 4 u8 version now `0` 5 u16 bom 0x1234 7 u8 padding 8 u32 num_users number of passwd entries 12 u32 num_groups number of group entries 16 u32 offset_cmph_gid2group 20 u32 offset_cmph_uid2user 24 u32 offset_cmph_groupname2group 28 u32 offset_cmph_username2user 32 u32 offset_groupmembers 36 u32 offset_additional_gids ``` `magic` is 0xf09fa4b7, and `version` must be `0`. All integers are native-endian. `bom` is a byte-order-mark. It must resolve to `0x1234` (4460). If that's not true, the file is consumed in a different endianness than it was created at. Turbonss files cannot be moved across different-endianness computers. If that happens, turbonss will refuse to read the file. Offsets are indices to further sections of the file, with zero being the first block (pointing to the `magic` field). As all blobs are 64-byte aligned, the offsets are always pointing to the beginning of an 64-byte "block". Therefore, all `offset_*` values could be `u26`. As `u32` is easier to visualize with xxd, and the header block fits to 64 bytes anyway, we are keeping them as u32 now. Primitive types --------------- `User` and `Group` entries are sorted by name, ordered by their unicode codepoints. They are byte-aligned (8bits). All `User` and `Group` entries are referred by their byte offset in the `Users` and `Groups` section relative to the beginning of the section. ``` const Group = struct { gid: u32, // index to a separate structure with a list of members. The memberlist is // always 2^5-byte aligned (32b), this is an index there. members_offset: u27, groupname_len: u5, // a groupname_len-sized string groupname []u8; } const User = struct { uid: u32, gid: u32, // pointer to a separate structure that contains a list of gids additional_gids_offset: u29, // shell is a different story, documented elsewhere. shell_here: u1, shell_len_or_place: u6, homedir_len: u6, username_is_a_suffix: u1, username_offset_or_len: u5, gecos_len: u8, // a variable-sized array that will be stored immediately after this // struct. stringdata []u8; } ``` `stringdata` contains a few string entries: - homedir. - username. - gecos. - shell (optional). First byte of the homedir is stored right after the `gecos_len` field, and it's length is `homedir_len`. The same logic applies to all the `stringdata` fields: there is a way to calculate their relative position from the length of the fields before them. Additionally, two optimizations for special fields are made: - shells are often shared across different users, see the "Shells" section. - username is frequently a suffix of the homedir. For example, `/home/motiejus` and `motiejus`. In which case storing both username and homedir strings is wasteful. For that cases, username has two options: 1. `username_is_a_suffix=true`: username is a suffix of the home dir. In that case, the username starts at the `username_offset_or_len`'th byte of the homedir, and ends at the same place as the homedir. 2. `username_is_a_suffix=false`: username is stored separately. In that case, it begins one byte after homedir, and it's length is `username_offset_or_len`. Shells ------ Normally there is a limited number of shells even in the huge user databases. A few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among others. Therefore, "shells" have an optimization: they can be pointed by in the external list, or reside among the user's data. 64 (1>>6) most popular shells (i.e. referred to by at least two User entries) are stored externally in "Shells" area. The less popular ones are stored with userdata. The `shell_here=true` bit signifies that the shell is stored with userdata. `false` means it is stored in the `Shells` section. If the shell is stored "here", it is the first element in `stringdata`, and it's length is `shell_len_or_place`. If it is stored externally, the latter variable points to it's index in the external storage. Shells in the external storage are sorted by their weight, which is `length*frequency`. Variable-length integers (varints) ---------------------------------- Varint is an efficiently encoded integer (packed for small values). Same as [protocol buffer varints][varint], except the largest possible value is `u64`. They compress integers well. `groupmembers`, `additional_gids` --------------------------------- `groupmembers` and `additional_gids` store group and user memberships respectively: for each group, a list of pointers (offsets) to User records, and for each user — a list of pointers to Group records. These fields are always used in their entirety — not necessitating random access, thus suitable for tight packing. An entry of `groupmembers` and `additional_gids` looks like this piece of pseudo-code: ``` const PackedList = struct { length: varint, members: []varint } const Groupmembers = PackedList; const AdditionalGids = PackedList; ``` An entry in `members` field points to the offset into a respective `User` or `Group` entry (number of bytes relative to the first entry of the type). `members` in `PackedList` is sorted by the name (`username` or `groupname`) of the record it is pointing to. A packed list is a list of varints. Complete file structure ----------------------- `idx_*` entries are of type `[]u29` and are pointing to the respective `Groups` and `Users` entries (from the beginning of the respective section). Since entries are 8-byte aligned, 3 bits are saved from every element. Each section is padded to 64 bytes. ``` SECTION SIZE DESCRIPTION Header 40 see "Turbonss header" section idx_gid2group len(group)*4*29/32 list of gid2group indices idx_groupname2group len(group)*4*29/32 list of groupname2group indices idx_uid2user len(user)*4*29/32 list of uid2user indices idx_username2user len(user)*4*29/32 list of username2user indices Groups ? list of Group entries Users ? list of User entries Shells ? See "Shells" section cmph_gid2group ? offset by offset_cmph_gid2group cmph_uid2user ? offset by offset_cmph_uid2user cmph_groupname2group ? offset by offset_cmph_groupname2group cmph_username2user ? offset by offset_cmph_username2user groupmembers ? offset by offset_groupmembers additional_gids ? offset by offset_additional_gids ``` [git-subtrac]: https://github.com/apenwarr/git-subtrac/ [cmph]: http://cmph.sourceforge.net/ [id]: https://linux.die.net/man/1/id [nsswitch]: https://linux.die.net/man/5/nsswitch.conf [data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/ [getpwnam_r]: https://linux.die.net/man/3/getpwnam_r [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints