1
Fork 0

explain some optimizations

main
Motiejus Jakštys 2022-02-14 13:05:33 +02:00 committed by Motiejus Jakštys
parent f0d9d16cad
commit c6bc383269
1 changed files with 88 additions and 49 deletions

137
README.md
View File

@ -7,17 +7,17 @@ entries (i.e. system users, groups, and group memberships). It's main goal is
performance, with focus on making [`id(1)`][id] run as fast as possible. performance, with focus on making [`id(1)`][id] run as fast as possible.
To understand more about name service switch, start with To understand more about name service switch, start with
[`nsswitch.conf(5)`](nsswitch). [`nsswitch.conf(5)`][nsswitch].
Design & constraints Design & constraints
-------------------- --------------------
To be fast, the user/group database (later: DB) has to be small ([highly To be fast, the user/group database (later: DB) has to be small ([highly
recommended background viewing](data-oriented-design)). It encodes user & group recommended background viewing][data-oriented-design]). It encodes user & group
information in a way that minimizes the DB size, and reduces jumping across the information in a way that minimizes the DB size, and reduces jumping across the
DB ("chasing pointers and polluting CPU cache"). DB ("chasing pointers and thrashing CPU cache").
For example, [`getpwnam_r(3)`](getpwnam_r) accepts a username and returns For example, [`getpwnam_r(3)`][getpwnam_r] accepts a username and returns
the following user information: the following user information:
``` ```
@ -33,14 +33,14 @@ struct passwd {
``` ```
Turbonss, among others, implements this call, and takes the following steps to Turbonss, among others, implements this call, and takes the following steps to
resolve this: resolve a username to a `struct passwd*`:
- Hash the username using a perfect hash function. Perfect hash function - Hash the username using a perfect hash function. Perfect hash function
returns a number between [0,N], where N is the total number of users. returns a number `n ∈ [0,N-1]`, where N is the total number of users.
- Jump to a known location in the DB (by pointer arithmetic) which links the - Jump to the `n`'th location in the DB (by pointer arithmetic) which contains
user's index to the user's information. That is an index to a different the index `i` to the user's information.
location within the DB. - Jump to the location `i` (pointer arithmetic) which stores the full user
- Jump to the location which stores the full user information. information.
- Decode the user information (which is all in a continuous memory block) and - Decode the user information (which is all in a continuous memory block) and
return it to the caller. return it to the caller.
@ -48,7 +48,12 @@ In total, that's one hash for the username (~150ns), two pointer jumps within
the group file, and, now that the user record is found, `memcpy` for each the group file, and, now that the user record is found, `memcpy` for each
field. field.
This tight packing places some constraints on the underlying data: The turbonss DB file is be `mmap`-ed, making it simple to implement pointer
arithmetic and jumping across the file. This also reduces memory usage,
especially across multiple concurrent invocations of the `id` command. The
consumed heap space for each separate turbonss instance will be minimal.
Tight packing places some constraints on the underlying data:
- Maximum database size: 4GB. - Maximum database size: 4GB.
- Maximum length of username and groupname: 32 bytes. - Maximum length of username and groupname: 32 bytes.
@ -83,54 +88,53 @@ remarks on `id(1)`
------------------ ------------------
A known implementation runs id(1) at ~250 rps sequentially on ~20k users and A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
~10k groups. Our target is 10k id/s. ~10k groups. Our target is 10k id/s for the same payload.
`id(1)` works as follows: To better reason about the trade-offs, it is useful to understand how `id(1)`
is implemented, in rough terms:
- lookup user by name. - lookup user by name.
- get all additional gids (an array attached to a member). - get all additional gids (an array attached to a member).
- for each additional gid, get the group name. - for each additional gid, get the group information (`struct group*`).
Assuming a member is in ~100 groups on average, that's 1M group lookups per Assuming a member is in ~100 groups on average, that's 1M group lookups per
second. We need to convert gid to a group index, and group index to a group second. We need to convert gid to a group index, and group index to a group
gid/name quickly. gid/name quickly.
Caveat: `struct group` contains an array of pointers to names of group members Caveat: `struct group` contains an array of pointers to names of group members
(`char **gr_mem`). However, `id` does not use that information, resulting in a (`char **gr_mem`). However, `id` does not use that information, resulting in
significant read amplification. Therefore, if `argv[0] == "id"`, `getgrid(3)` read amplification. Therefore, if `argv[0] == "id"`, our implementation of
will return group without the members. This speeds up `id` by about 10x on a `getgrid(3)` returns the `struct group*` without the members. This speeds up
known NSS implementation. `id` by about 10x on a known NSS implementation.
Because `getgrid(3)` does not use the group members' information, the group Relatedly, because `getgrid(3)` does not need the group members, the group
members are stored in a different location, making the `Groups` section members are stored in a different DB sectoin, making the `Groups` section
smaller, thus more CPU-cache-friendly. smaller, thus more CPU-cache-friendly in the hot path.
Indices Indices
------- -------
The following operations need to be fast, in order of importance: Now that we've sketched the implementation of `id(3)`, it's clearer to
understand which operations need to be fast; in order of importance:
1. lookup gid -> group (this is on hot path in id) with or without members (2 1. lookup gid -> group info (this is on hot path in id) without members.
separate calls).
2. lookup uid -> user. 2. lookup uid -> user.
3. lookup groupname -> group. 3. lookup groupname -> group.
4. lookup username -> user. 4. lookup username -> user.
5. (optional) iterate users using a defined order (`getent passwd`).
6. (optional) iterate groups using a defined order (`getent group`).
First 4 can use perfect hashing like [cmph][cmph]: it hashes a list of bytes to These indices can use perfect hashing like [cmph][cmph]: a perfect hash hashes
a sequential list of integers. Perfect hashing algorithms require some space, a list of bytes to a sequential list of integers. Perfect hashing algorithms
and take some time to calculate ("hashing duration"). I've tested BDZ, which require some space, and take some time to calculate ("hashing duration"). I've
hashes [][]u8 to a sequential list of integers (not preserving order) and CHM, which tested BDZ, which hashes [][]u8 to a sequential list of integers (not
does the same, but preserves order. BDZ accepts an argument 3 <= b <= 10. preserving order) and CHM, preserves order. BDZ accepts an optional argument `3
<= b <= 10`.
BDZ: tried b=3, b=7 (default), and b=10. * BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
* BDZ algorithm requires (900KB, 338KB, 306KB, respectively) for 1M values.
* Latency to resolve 1M keys: (170ms, 180ms, 230ms).
* Packed vs non-packed latency differences are not meaningful. * Packed vs non-packed latency differences are not meaningful.
CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
CHM than with BDZ, eliminating the benefit of preserved ordering. CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
have a separate index.
Turbonss header Turbonss header
--------------- ---------------
@ -168,6 +172,11 @@ and the header block fits to 64 bytes anyway, we are keeping them as u32 now.
Primitive types Primitive types
--------------- ---------------
`User` and `Group` entries are sorted by name, ordered by their unicode
codepoints. They are byte-aligned (8bits). All `User` and `Group` entries are
referred by their byte offset in the `Users` and `Groups` section relative to
the beginning of the section.
``` ```
const Group = struct { const Group = struct {
gid: u32, gid: u32,
@ -187,9 +196,9 @@ const User = struct {
// shell is a different story, documented elsewhere. // shell is a different story, documented elsewhere.
shell_here: u1, shell_here: u1,
shell_len_or_place: u6, shell_len_or_place: u6,
home_len: u6, homedir_len: u6,
username_pos: u1, username_is_a_suffix: u1,
username_len: u5, username_offset_or_len: u5,
gecos_len: u8, gecos_len: u8,
// a variable-sized array that will be stored immediately after this // a variable-sized array that will be stored immediately after this
// struct. // struct.
@ -197,8 +206,28 @@ const User = struct {
} }
``` ```
`User` and `Group` entries are sorted by name, ordered by their unicode `stringdata` contains a few string entries:
codepoints. - homedir.
- username.
- gecos.
- shell (optional).
First byte of the homedir is stored right after the `gecos_len` field, and it's
length is `homedir_len`. The same logic applies to all the `stringdata` fields:
there is a way to calculate their relative position from the length of the
fields before them.
Additionally, two optimizations for special fields are made:
- shells are often shared across different users, see the "Shells" section.
- username is frequently a suffix of the homedir. For example, `/home/motiejus`
and `motiejus`. In which case storing both username and homedir strings is
wasteful. For that cases, username has two options:
1. `username_is_a_suffix=true`: username is a suffix of the home dir. In that
case, the username starts at the `username_offset_or_len`'th byte of the
homedir, and ends at the same place as the homedir.
2. `username_is_a_suffix=false`: username is stored separately. In that case,
it begins one byte after homedir, and it's length is
`username_offset_or_len`.
Shells Shells
------ ------
@ -221,14 +250,21 @@ to it's index in the external storage.
Shells in the external storage are sorted by their weight, which is Shells in the external storage are sorted by their weight, which is
`length*frequency`. `length*frequency`.
Variable-length integers (varints)
----------------------------------
Varint is an efficiently encoded integer (packed for small values). Same as
[protocol buffer varints][varint], except the largest possible value is `u64`.
They compress integers well.
`groupmembers`, `additional_gids` `groupmembers`, `additional_gids`
--------------------------------- ---------------------------------
`groupmembers` and `additional_gids` store group and user memberships `groupmembers` and `additional_gids` store group and user memberships
respectively: for each group, a list of pointers ("offsets") to User records, respectively: for each group, a list of pointers (offsets) to User records, and
and for each user — a list of pointers to Group records. These fields are for each user — a list of pointers to Group records. These fields are always
always used in their entirety — making random-access not required, thus used in their entirety — not necessitating random access, thus suitable for
suitable for tight packing. tight packing.
An entry of `groupmembers` and `additional_gids` looks like this piece of An entry of `groupmembers` and `additional_gids` looks like this piece of
pseudo-code: pseudo-code:
@ -242,10 +278,12 @@ const Groupmembers = PackedList;
const AdditionalGids = PackedList; const AdditionalGids = PackedList;
``` ```
The single entry in `members` field points to an offset into a `User` or An entry in `members` field points to the offset into a respective `User` or
`Group` entry (number of bytes relative to the first entry of the respective `Group` entry (number of bytes relative to the first entry of the type).
type). The `members` field in `PackedList` is sorted by the name (`username` or `members` in `PackedList` is sorted by the name (`username` or `groupname`) of
`groupname`) of the record it is pointing to. the record it is pointing to.
A packed list is a list of varints.
Complete file structure Complete file structure
----------------------- -----------------------
@ -270,3 +308,4 @@ Complete file structure
[nsswitch]: https://linux.die.net/man/5/nsswitch.conf [nsswitch]: https://linux.die.net/man/5/nsswitch.conf
[data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/ [data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/
[getpwnam_r]: https://linux.die.net/man/3/getpwnam_r [getpwnam_r]: https://linux.die.net/man/3/getpwnam_r
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints