update README

This commit is contained in:
Motiejus Jakštys 2022-02-23 10:45:05 +02:00 committed by Motiejus Jakštys
parent 609ab3d2b6
commit 6c386e720e

216
README.md
View File

@ -10,7 +10,7 @@ Turbonss is optimized for reading. If the data changes in any way, the whole
file will need to be regenerated (and tooling only supports only full file will need to be regenerated (and tooling only supports only full
generation). It was created, and best suited, for environments that have a generation). It was created, and best suited, for environments that have a
central user & group database which then needs to be distributed to many central user & group database which then needs to be distributed to many
servers/services. servers/services, and the data does not change very often (e.g. hourly).
To understand more about name service switch, start with To understand more about name service switch, start with
[`nsswitch.conf(5)`][nsswitch]. [`nsswitch.conf(5)`][nsswitch].
@ -42,15 +42,15 @@ struct passwd {
Turbonss, among others, implements this call, and takes the following steps to Turbonss, among others, implements this call, and takes the following steps to
resolve a username to a `struct passwd*`: resolve a username to a `struct passwd*`:
- Open the DB (using `mmap`) and interpret it's first 40 bytes as a `struct - Open the DB (using `mmap`) and interpret it's first 64 bytes as a `*struct
Header`. The header stores offsets to the sections of the file. This needs to Header`. The header stores offsets to the sections of the file. This needs to
be done once, when the NSS library is loaded (or on the first call). be done once, when the NSS library is loaded.
- Hash the username using a perfect hash function. Perfect hash function - Hash the username using a perfect hash function. Perfect hash function
returns a number `n ∈ [0,N-1]`, where N is the total number of users. returns a number `n ∈ [0,N-1]`, where N is the total number of users.
- Jump to the `n`'th location in the `idx_name2user` section (by pointer - Jump to the `n`'th location in the `idx_name2user` section, which contains
arithmetic), which contains the index `i` to the user's information. the index `i` to the user's information.
- Jump to the location `i` of section `Users` (again, using pointer arithmetic) - Jump to the location `i` of section `Users`, which stores the full user
which stores the full user information. information.
- Decode the user information (which is all in a continuous memory block) and - Decode the user information (which is all in a continuous memory block) and
return it to the caller. return it to the caller.
@ -58,10 +58,9 @@ In total, that's one hash for the username (~150ns), two pointer jumps within
the group file (to sections `idx_name2user` and `Users`), and, now that the the group file (to sections `idx_name2user` and `Users`), and, now that the
user record is found, `memcpy` for each field. user record is found, `memcpy` for each field.
The turbonss DB file is be `mmap`-ed, making it simple to implement pointer The turbonss DB file is be `mmap`-ed, making it simple to jump across the file
arithmetic and jumping across the file. This also reduces memory usage, using pointer arithmetic. This also reduces memory usage, as the mmap'ed
especially across multiple concurrent invocations of the `id` command. The regions are shared. Turbonss reads do not consume any heap space.
consumed heap space for each separate turbonss instance will be minimal.
Tight packing places some constraints on the underlying data: Tight packing places some constraints on the underlying data:
@ -105,9 +104,12 @@ A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
To better reason about the trade-offs, it is useful to understand how `id(1)` To better reason about the trade-offs, it is useful to understand how `id(1)`
is implemented, in rough terms: is implemented, in rough terms:
- lookup user by name. - lookup user by name ([`getpwent_r(3)`][getpwent_r]).
- get all additional gids (an array attached to a member). - get all gids for the user ([`getgrouplist(3)`][getgrouplist]). Note: it is
- for each additional gid, get the group information (`struct group*`). actually using `initgroups_dyn`, accepts a uid, and is very poorly
documented.
- for each additional gid, get the `struct group*`
([`getgrgid_r(3)`][getgrgid_r]).
Assuming a member is in ~100 groups on average, that's 1M group lookups per Assuming a member is in ~100 groups on average, that's 1M group lookups per
second. We need to convert gid to a group index, and group index to a group second. We need to convert gid to a group index, and group index to a group
@ -115,40 +117,13 @@ gid/name quickly.
Caveat: `struct group` contains an array of pointers to names of group members Caveat: `struct group` contains an array of pointers to names of group members
(`char **gr_mem`). However, `id` does not use that information, resulting in (`char **gr_mem`). However, `id` does not use that information, resulting in
read amplification. Therefore, if `argv[0] == "id"`, our implementation of read amplification, sometimes by 10-100x. Therefore, if `argv[0] == "id"`, our
`getgrid(3)` returns the `struct group*` without the members. This speeds up implementation of [`getgrid_r(3)`][getgrid] returns the `struct group*` without
`id` by about 10x on a known NSS implementation. the members. This speeds up `id` by about 10x on a known NSS implementation.
Relatedly, because `getgrid(3)` does not need the group members, the group Relatedly, because [`getgrid_r(3)`][getgrid] does not need the group members,
members are stored in a different DB sectoin, making the `Groups` section the group members are stored in a different DB section, reducing the `Groups`
smaller, thus more CPU-cache-friendly in the hot path. section and making more of it fit the CPU caches.
Indices
-------
Now that we've sketched the implementation of `id(3)`, it's clearer to
understand which operations need to be fast; in order of importance:
1. lookup gid -> group info (this is on hot path in id) without members.
2. lookup username -> user's groups.
3. lookup uid -> user.
4. lookup groupname -> group.
5. lookup username -> user.
These indices can use perfect hashing like [bdz from cmph][cmph]: a perfect
hash hashes a list of bytes to a sequential list of integers. Perfect hashing
algorithms require some space, and take some time to calculate ("hashing
duration"). I've tested BDZ, which hashes [][]u8 to a sequential list of
integers (not preserving order) and CHM, preserves order. BDZ accepts an
optional argument `3 <= b <= 10`.
* BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
* Packed vs non-packed latency differences are not meaningful.
CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
have a separate index.
Turbonss header Turbonss header
--------------- ---------------
@ -160,7 +135,7 @@ OFFSET TYPE NAME DESCRIPTION
0 [4]u8 magic always 0xf09fa4b7 0 [4]u8 magic always 0xf09fa4b7
4 u8 version now `0` 4 u8 version now `0`
5 u16 bom 0x1234 5 u16 bom 0x1234
u8 num_shells max value: 63. Padding is strange on little endian. u8 num_shells max value: 63.
8 u32 num_users number of passwd entries 8 u32 num_users number of passwd entries
12 u32 num_groups number of group entries 12 u32 num_groups number of group entries
16 u32 offset_bdz_uid2user 16 u32 offset_bdz_uid2user
@ -180,10 +155,11 @@ created at. Turbonss files cannot be moved across different-endianness
computers. If that happens, turbonss will refuse to read the file. computers. If that happens, turbonss will refuse to read the file.
Offsets are indices to further sections of the file, with zero being the first Offsets are indices to further sections of the file, with zero being the first
block (pointing to the `magic` field). As all blobs are 64-byte aligned, the block (pointing to the `magic` field). As all sections are aligned to 64 bytes,
offsets are always pointing to the beginning of an 64-byte "block". Therefore, the offsets are always pointing to the beginning of an 64-byte "block".
all `offset_*` values could be `u26`. As `u32` is easier to visualize with xxd, Therefore, all `offset_*` values could be `u26`. As `u32` is easier to
and the header block fits to 64 bytes anyway, we are keeping them as u32 now. visualize with xxd, and the header block fits to 64 bytes anyway, we are
keeping them as u32 now.
Sections whose lengths can be calculated do not have a corresponding `offset_*` Sections whose lengths can be calculated do not have a corresponding `offset_*`
header field. For example, `bdz_gid2group` comes immediately after the header, header field. For example, `bdz_gid2group` comes immediately after the header,
@ -193,13 +169,13 @@ and `idx_groupname2group` comes after `idx_gid2group`, whose offset is
`num_shells` would fit to u6; however, we would need 2 bits of padding (all `num_shells` would fit to u6; however, we would need 2 bits of padding (all
other fields are byte-aligned). If we instead do `u2` followed by `u6`, the other fields are byte-aligned). If we instead do `u2` followed by `u6`, the
byte would look very unusual on a little-endian architecture. Therefore we will byte would look very unusual on a little-endian architecture. Therefore we will
just refuse loading the file if the number of shells exceeds 63. just reject the DB if the number of shells exceeds 63.
Primitive types Primitive types
--------------- ---------------
`User` and `Group` entries are sorted by name, ordered by their unicode `User` and `Group` entries are sorted by the order they were received in the input
codepoints. They are byte-aligned (8bits). All `User` and `Group` entries are file. All entries are aligned to 8 bytes. All `User` and `Group` entries are
referred by their byte offset in the `Users` and `Groups` section relative to referred by their byte offset in the `Users` and `Groups` section relative to
the beginning of the section. the beginning of the section.
@ -214,61 +190,59 @@ const Group = struct {
groupname []u8; groupname []u8;
} }
const User = struct { pub const PackedUser = packed struct {
uid: u32, uid: u32,
gid: u32, gid: u32,
// pointer to a separate structure that contains a list of gids
additional_gids_offset: u29,
// shell is a different story, documented elsewhere.
shell_here: bool, shell_here: bool,
shell_len_or_idx: u6, shell_len_or_idx: u6,
home_len: u6, home_len: u6,
name_is_a_suffix: bool, name_is_a_suffix: bool,
name_len: u5, name_len: u5,
gecos_len: u8, gecos_len: u10,
// a variable-sized array that will be stored immediately after this padding: u3,
// struct. // pseudocode: variable-sized array that will be stored immediately after
// this struct.
stringdata []u8; stringdata []u8;
} }
``` ```
`stringdata` contains a few string entries: `stringdata` contains a few string entries:
- home. - home.
- name. - name (optional).
- gecos. - gecos.
- shell (optional). - shell (optional).
First byte of the home is stored right after the `gecos_len` field, and it's First byte of home is stored right after the `gecos_len` field, and it's
length is `home_len`. The same logic applies to all the `stringdata` fields: length is `home_len`. The same logic applies to all the `stringdata` fields:
there is a way to calculate their relative position from the length of the there is a way to calculate their relative position from the length of the
fields before them. fields before them.
Additionally, two optimizations for special fields are made: Additionally, there are two "easy" optimizations:
- shells are often shared across different users, see the "Shells" section. - shells are often shared across different users, see the "Shells" section.
- name is frequently a suffix of the home. For example, `/home/motiejus` - `name` is frequently a suffix of `home`. For example, `/home/motiejus` and
and `motiejus`. In which case storing both name and home strings is `motiejus`. In this case storing both name and home is wasteful. Therefore
wasteful. For that cases, name has two options: name has two options:
1. `name_is_a_suffix=true`: name is a suffix of the home dir. In that 1. `name_is_a_suffix=true`: name is a suffix of the home dir. Then `name`
case, the name starts at the `home_len - name_len`'th starts at the `home_len - name_len`'th byte of `home`, and ends at the same
byte of the home, and ends at the same place as the home. place as `home`.
2. `name_is_a_suffix=false`: name is stored separately. In that case, 2. `name_is_a_suffix=false`: name begins one byte after home, and it's length
it begins one byte after home, and it's length is is `name_len`.
`name_len`.
Shells Shells
------ ------
Normally there is a limited number of shells even in the huge user databases. A Normally there is a limited number of separate shells even in huge user
few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among others. databases. A few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among
Therefore, "shells" have an optimization: they can be pointed by in the others. Therefore, "shells" have an optimization: they can be pointed by in the
external list, or reside among the user's data. external list, or, if they are unique to the user, reside among the user's
data.
63 most popular shells (i.e. referred to by at least two User entries) are 63 most popular shells (i.e. referred to by at least two User entries) are
stored externally in "Shells" area. The less popular ones are stored with stored externally in "Shells" area. The less popular ones are stored with
userdata. userdata.
There are two "Shells" areas: the index and the blob. The index is a list of Shells section consists of two sub-sections: the index and the blob. The index
structs which point to a location in the "blob" area: is a list of structs which point to a location in the "blob" area:
``` ```
const ShellIndex = struct { const ShellIndex = struct {
@ -277,29 +251,24 @@ const ShellIndex = struct {
}; };
``` ```
In the user's struct the `shell_here=true` bit signifies that the shell is In the user's struct `shell_here=true` signifies that the shell is stored with
stored with userdata. `false` means it is stored in the `Shells` section. If userdata, and it's length is `shell_len_or_idx`. `shell_here=false` means it is
the shell is stored "here", it is the first element in `stringdata`, and it's stored in the `Shells` section, and it's index is `shell_len_or_idx`.
length is `shell_len_or_idx`. If it is stored externally, the latter variable
points to it's index in the ShellIndex area.
Shells in the external storage are sorted by their weight, which is
`length*frequency`.
Variable-length integers (varints) Variable-length integers (varints)
---------------------------------- ----------------------------------
Varint is an efficiently encoded integer (packed for small values). Same as Varint is an efficiently encoded integer (packed for small values). Same as
[protocol buffer varints][varint], except the largest possible value is `u64`. [protocol buffer varints][varint], except the largest possible value is `u64`.
They compress integers well. They compress integers well. Varints are stored for group memberships.
Group memberships Group memberships
----------------- -----------------
There are two group memberships at play: There are two group memberships at play:
1. given a username, resolve user's group gids (for `initgroups(3)`). 1. Given a username, resolve user's group gids (for `initgroups(3)`).
2. given a group (gid/name), resolve the members' names (e.g. `getgrgid`). 2. Given a group (gid/name), resolve the members' names (e.g. `getgrgid`).
When user's groups are resolved in (1), the additional userdata is not When user's groups are resolved in (1), the additional userdata is not
requested (there is no way to return it). Therefore, it is reasonable to store requested (there is no way to return it). Therefore, it is reasonable to store
@ -309,23 +278,26 @@ username.
When group's memberships are resolved in (2), the same call also requires other When group's memberships are resolved in (2), the same call also requires other
group information: gid and group name. Therefore it makes sense to store a group information: gid and group name. Therefore it makes sense to store a
pointer to the group members in the group information itself. However, the pointer to the group members in the group information itself. However, the
memberships are not *always* necessary (see remarks about `id(1)` in this memberships are not *always* necessary (see remarks about `id(1)`), therefore
document), therefore the memberships will be stored separately, outside of the the memberships will be stored separately, outside of the groups section.
groups section.
`groupmembers` and `additional_gids` store group and user memberships `Groupmembers` and `Username2gids` store group and user memberships
respectively: for each group, a list of pointers (offsets) to User records, and respectively. Membership IDs are used in their entirety — not necessitating
for each user — a list of pointers to Group records. These fields are always random access, thus suitable for tight packing and varint encoding.
used in their entirety — not necessitating random access, thus suitable for
tight packing.
An entry of `groupmembers` and `additional_gids` looks like this piece of
- For each group — a list of pointers (offsets) to User records, because
`getgr*_r` returns an array of pointers to membernames.
- For each user — a list of gids, because `initgroups_dyn` (and friends)
returns an array of gids.
An entry of `Groupmembers` and `Username2gids` looks like this piece of
pseudo-code: pseudo-code:
``` ```
const PackedList = struct { const PackedList = struct {
length: varint, Length: varint,
members: [length]varint, Members: [Length]varint,
} }
const Groupmembers = PackedList; const Groupmembers = PackedList;
const Username2gids = PackedList; const Username2gids = PackedList;
@ -333,17 +305,40 @@ const Username2gids = PackedList;
A packed list is a list of varints. A packed list is a list of varints.
Section `Username2gidsIndex` stores an index from `hash(username)` to `offset` Indices
in Username2gids. -------
Complete file structure Now that we've sketched the implementation of `id(3)`, it's clearer to
----------------------- understand which operations need to be fast; in order of importance:
1. lookup gid -> group info (this is on hot path in id) without members.
2. lookup username -> user's groups.
3. lookup uid -> user.
4. lookup groupname -> group.
5. lookup username -> user.
`idx_*` sections are of type `[]PackedIntArray(u29)` and are pointing to the `idx_*` sections are of type `[]PackedIntArray(u29)` and are pointing to the
respective `Groups` and `Users` entries (from the beginning of the respective respective `Groups` and `Users` entries (from the beginning of the respective
section). Since User and Group records are 8-byte aligned, 3 bits can be saved section). Since User and Group records are 8-byte aligned, 3 bits are saved for
from every element. However, since the header easily fits to 64 bytes, we are every element.
storing plain `u32` for easier inspection.
These indices can use perfect hashing like [bdz from cmph][cmph]: a perfect
hash hashes a list of bytes to a sequential list of integers. Perfect hashing
algorithms require some space, and take some time to calculate ("hashing
duration"). I've tested BDZ, which hashes [][]u8 to a sequential list of
integers (not preserving order) and CHM, preserves order. BDZ accepts an
optional argument `3 <= b <= 10`.
* BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
* Packed vs non-packed latency differences are not meaningful.
CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
have a separate index.
Complete file structure
-----------------------
Each section is padded to 64 bytes. Each section is padded to 64 bytes.
@ -374,3 +369,6 @@ STATUS SECTION SIZE DESCRIPTION
[data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/ [data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/
[getpwnam_r]: https://linux.die.net/man/3/getpwnam_r [getpwnam_r]: https://linux.die.net/man/3/getpwnam_r
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
[getpwent_r]: https://www.man7.org/linux/man-pages/man3/getpwent_r.3.html
[getgrouplist]: https://www.man7.org/linux/man-pages/man3/getgrouplist.3.html
[getgrid_r]: https://www.man7.org/linux/man-pages/man3/getgrid_r.3.html