update README
This commit is contained in:
parent
609ab3d2b6
commit
6c386e720e
216
README.md
216
README.md
|
@ -10,7 +10,7 @@ Turbonss is optimized for reading. If the data changes in any way, the whole
|
|||
file will need to be regenerated (and tooling only supports only full
|
||||
generation). It was created, and best suited, for environments that have a
|
||||
central user & group database which then needs to be distributed to many
|
||||
servers/services.
|
||||
servers/services, and the data does not change very often (e.g. hourly).
|
||||
|
||||
To understand more about name service switch, start with
|
||||
[`nsswitch.conf(5)`][nsswitch].
|
||||
|
@ -42,15 +42,15 @@ struct passwd {
|
|||
Turbonss, among others, implements this call, and takes the following steps to
|
||||
resolve a username to a `struct passwd*`:
|
||||
|
||||
- Open the DB (using `mmap`) and interpret it's first 40 bytes as a `struct
|
||||
- Open the DB (using `mmap`) and interpret it's first 64 bytes as a `*struct
|
||||
Header`. The header stores offsets to the sections of the file. This needs to
|
||||
be done once, when the NSS library is loaded (or on the first call).
|
||||
be done once, when the NSS library is loaded.
|
||||
- Hash the username using a perfect hash function. Perfect hash function
|
||||
returns a number `n ∈ [0,N-1]`, where N is the total number of users.
|
||||
- Jump to the `n`'th location in the `idx_name2user` section (by pointer
|
||||
arithmetic), which contains the index `i` to the user's information.
|
||||
- Jump to the location `i` of section `Users` (again, using pointer arithmetic)
|
||||
which stores the full user information.
|
||||
- Jump to the `n`'th location in the `idx_name2user` section, which contains
|
||||
the index `i` to the user's information.
|
||||
- Jump to the location `i` of section `Users`, which stores the full user
|
||||
information.
|
||||
- Decode the user information (which is all in a continuous memory block) and
|
||||
return it to the caller.
|
||||
|
||||
|
@ -58,10 +58,9 @@ In total, that's one hash for the username (~150ns), two pointer jumps within
|
|||
the group file (to sections `idx_name2user` and `Users`), and, now that the
|
||||
user record is found, `memcpy` for each field.
|
||||
|
||||
The turbonss DB file is be `mmap`-ed, making it simple to implement pointer
|
||||
arithmetic and jumping across the file. This also reduces memory usage,
|
||||
especially across multiple concurrent invocations of the `id` command. The
|
||||
consumed heap space for each separate turbonss instance will be minimal.
|
||||
The turbonss DB file is be `mmap`-ed, making it simple to jump across the file
|
||||
using pointer arithmetic. This also reduces memory usage, as the mmap'ed
|
||||
regions are shared. Turbonss reads do not consume any heap space.
|
||||
|
||||
Tight packing places some constraints on the underlying data:
|
||||
|
||||
|
@ -105,9 +104,12 @@ A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
|
|||
|
||||
To better reason about the trade-offs, it is useful to understand how `id(1)`
|
||||
is implemented, in rough terms:
|
||||
- lookup user by name.
|
||||
- get all additional gids (an array attached to a member).
|
||||
- for each additional gid, get the group information (`struct group*`).
|
||||
- lookup user by name ([`getpwent_r(3)`][getpwent_r]).
|
||||
- get all gids for the user ([`getgrouplist(3)`][getgrouplist]). Note: it is
|
||||
actually using `initgroups_dyn`, accepts a uid, and is very poorly
|
||||
documented.
|
||||
- for each additional gid, get the `struct group*`
|
||||
([`getgrgid_r(3)`][getgrgid_r]).
|
||||
|
||||
Assuming a member is in ~100 groups on average, that's 1M group lookups per
|
||||
second. We need to convert gid to a group index, and group index to a group
|
||||
|
@ -115,40 +117,13 @@ gid/name quickly.
|
|||
|
||||
Caveat: `struct group` contains an array of pointers to names of group members
|
||||
(`char **gr_mem`). However, `id` does not use that information, resulting in
|
||||
read amplification. Therefore, if `argv[0] == "id"`, our implementation of
|
||||
`getgrid(3)` returns the `struct group*` without the members. This speeds up
|
||||
`id` by about 10x on a known NSS implementation.
|
||||
read amplification, sometimes by 10-100x. Therefore, if `argv[0] == "id"`, our
|
||||
implementation of [`getgrid_r(3)`][getgrid] returns the `struct group*` without
|
||||
the members. This speeds up `id` by about 10x on a known NSS implementation.
|
||||
|
||||
Relatedly, because `getgrid(3)` does not need the group members, the group
|
||||
members are stored in a different DB sectoin, making the `Groups` section
|
||||
smaller, thus more CPU-cache-friendly in the hot path.
|
||||
|
||||
Indices
|
||||
-------
|
||||
|
||||
Now that we've sketched the implementation of `id(3)`, it's clearer to
|
||||
understand which operations need to be fast; in order of importance:
|
||||
|
||||
1. lookup gid -> group info (this is on hot path in id) without members.
|
||||
2. lookup username -> user's groups.
|
||||
3. lookup uid -> user.
|
||||
4. lookup groupname -> group.
|
||||
5. lookup username -> user.
|
||||
|
||||
These indices can use perfect hashing like [bdz from cmph][cmph]: a perfect
|
||||
hash hashes a list of bytes to a sequential list of integers. Perfect hashing
|
||||
algorithms require some space, and take some time to calculate ("hashing
|
||||
duration"). I've tested BDZ, which hashes [][]u8 to a sequential list of
|
||||
integers (not preserving order) and CHM, preserves order. BDZ accepts an
|
||||
optional argument `3 <= b <= 10`.
|
||||
|
||||
* BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
|
||||
* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
|
||||
* Packed vs non-packed latency differences are not meaningful.
|
||||
|
||||
CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
|
||||
CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
|
||||
have a separate index.
|
||||
Relatedly, because [`getgrid_r(3)`][getgrid] does not need the group members,
|
||||
the group members are stored in a different DB section, reducing the `Groups`
|
||||
section and making more of it fit the CPU caches.
|
||||
|
||||
Turbonss header
|
||||
---------------
|
||||
|
@ -160,7 +135,7 @@ OFFSET TYPE NAME DESCRIPTION
|
|||
0 [4]u8 magic always 0xf09fa4b7
|
||||
4 u8 version now `0`
|
||||
5 u16 bom 0x1234
|
||||
u8 num_shells max value: 63. Padding is strange on little endian.
|
||||
u8 num_shells max value: 63.
|
||||
8 u32 num_users number of passwd entries
|
||||
12 u32 num_groups number of group entries
|
||||
16 u32 offset_bdz_uid2user
|
||||
|
@ -180,10 +155,11 @@ created at. Turbonss files cannot be moved across different-endianness
|
|||
computers. If that happens, turbonss will refuse to read the file.
|
||||
|
||||
Offsets are indices to further sections of the file, with zero being the first
|
||||
block (pointing to the `magic` field). As all blobs are 64-byte aligned, the
|
||||
offsets are always pointing to the beginning of an 64-byte "block". Therefore,
|
||||
all `offset_*` values could be `u26`. As `u32` is easier to visualize with xxd,
|
||||
and the header block fits to 64 bytes anyway, we are keeping them as u32 now.
|
||||
block (pointing to the `magic` field). As all sections are aligned to 64 bytes,
|
||||
the offsets are always pointing to the beginning of an 64-byte "block".
|
||||
Therefore, all `offset_*` values could be `u26`. As `u32` is easier to
|
||||
visualize with xxd, and the header block fits to 64 bytes anyway, we are
|
||||
keeping them as u32 now.
|
||||
|
||||
Sections whose lengths can be calculated do not have a corresponding `offset_*`
|
||||
header field. For example, `bdz_gid2group` comes immediately after the header,
|
||||
|
@ -193,13 +169,13 @@ and `idx_groupname2group` comes after `idx_gid2group`, whose offset is
|
|||
`num_shells` would fit to u6; however, we would need 2 bits of padding (all
|
||||
other fields are byte-aligned). If we instead do `u2` followed by `u6`, the
|
||||
byte would look very unusual on a little-endian architecture. Therefore we will
|
||||
just refuse loading the file if the number of shells exceeds 63.
|
||||
just reject the DB if the number of shells exceeds 63.
|
||||
|
||||
Primitive types
|
||||
---------------
|
||||
|
||||
`User` and `Group` entries are sorted by name, ordered by their unicode
|
||||
codepoints. They are byte-aligned (8bits). All `User` and `Group` entries are
|
||||
`User` and `Group` entries are sorted by the order they were received in the input
|
||||
file. All entries are aligned to 8 bytes. All `User` and `Group` entries are
|
||||
referred by their byte offset in the `Users` and `Groups` section relative to
|
||||
the beginning of the section.
|
||||
|
||||
|
@ -214,61 +190,59 @@ const Group = struct {
|
|||
groupname []u8;
|
||||
}
|
||||
|
||||
const User = struct {
|
||||
pub const PackedUser = packed struct {
|
||||
uid: u32,
|
||||
gid: u32,
|
||||
// pointer to a separate structure that contains a list of gids
|
||||
additional_gids_offset: u29,
|
||||
// shell is a different story, documented elsewhere.
|
||||
shell_here: bool,
|
||||
shell_len_or_idx: u6,
|
||||
home_len: u6,
|
||||
name_is_a_suffix: bool,
|
||||
name_len: u5,
|
||||
gecos_len: u8,
|
||||
// a variable-sized array that will be stored immediately after this
|
||||
// struct.
|
||||
gecos_len: u10,
|
||||
padding: u3,
|
||||
// pseudocode: variable-sized array that will be stored immediately after
|
||||
// this struct.
|
||||
stringdata []u8;
|
||||
}
|
||||
```
|
||||
|
||||
`stringdata` contains a few string entries:
|
||||
- home.
|
||||
- name.
|
||||
- name (optional).
|
||||
- gecos.
|
||||
- shell (optional).
|
||||
|
||||
First byte of the home is stored right after the `gecos_len` field, and it's
|
||||
First byte of home is stored right after the `gecos_len` field, and it's
|
||||
length is `home_len`. The same logic applies to all the `stringdata` fields:
|
||||
there is a way to calculate their relative position from the length of the
|
||||
fields before them.
|
||||
|
||||
Additionally, two optimizations for special fields are made:
|
||||
Additionally, there are two "easy" optimizations:
|
||||
- shells are often shared across different users, see the "Shells" section.
|
||||
- name is frequently a suffix of the home. For example, `/home/motiejus`
|
||||
and `motiejus`. In which case storing both name and home strings is
|
||||
wasteful. For that cases, name has two options:
|
||||
1. `name_is_a_suffix=true`: name is a suffix of the home dir. In that
|
||||
case, the name starts at the `home_len - name_len`'th
|
||||
byte of the home, and ends at the same place as the home.
|
||||
2. `name_is_a_suffix=false`: name is stored separately. In that case,
|
||||
it begins one byte after home, and it's length is
|
||||
`name_len`.
|
||||
- `name` is frequently a suffix of `home`. For example, `/home/motiejus` and
|
||||
`motiejus`. In this case storing both name and home is wasteful. Therefore
|
||||
name has two options:
|
||||
1. `name_is_a_suffix=true`: name is a suffix of the home dir. Then `name`
|
||||
starts at the `home_len - name_len`'th byte of `home`, and ends at the same
|
||||
place as `home`.
|
||||
2. `name_is_a_suffix=false`: name begins one byte after home, and it's length
|
||||
is `name_len`.
|
||||
|
||||
Shells
|
||||
------
|
||||
|
||||
Normally there is a limited number of shells even in the huge user databases. A
|
||||
few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among others.
|
||||
Therefore, "shells" have an optimization: they can be pointed by in the
|
||||
external list, or reside among the user's data.
|
||||
Normally there is a limited number of separate shells even in huge user
|
||||
databases. A few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among
|
||||
others. Therefore, "shells" have an optimization: they can be pointed by in the
|
||||
external list, or, if they are unique to the user, reside among the user's
|
||||
data.
|
||||
|
||||
63 most popular shells (i.e. referred to by at least two User entries) are
|
||||
stored externally in "Shells" area. The less popular ones are stored with
|
||||
userdata.
|
||||
|
||||
There are two "Shells" areas: the index and the blob. The index is a list of
|
||||
structs which point to a location in the "blob" area:
|
||||
Shells section consists of two sub-sections: the index and the blob. The index
|
||||
is a list of structs which point to a location in the "blob" area:
|
||||
|
||||
```
|
||||
const ShellIndex = struct {
|
||||
|
@ -277,29 +251,24 @@ const ShellIndex = struct {
|
|||
};
|
||||
```
|
||||
|
||||
In the user's struct the `shell_here=true` bit signifies that the shell is
|
||||
stored with userdata. `false` means it is stored in the `Shells` section. If
|
||||
the shell is stored "here", it is the first element in `stringdata`, and it's
|
||||
length is `shell_len_or_idx`. If it is stored externally, the latter variable
|
||||
points to it's index in the ShellIndex area.
|
||||
|
||||
Shells in the external storage are sorted by their weight, which is
|
||||
`length*frequency`.
|
||||
In the user's struct `shell_here=true` signifies that the shell is stored with
|
||||
userdata, and it's length is `shell_len_or_idx`. `shell_here=false` means it is
|
||||
stored in the `Shells` section, and it's index is `shell_len_or_idx`.
|
||||
|
||||
Variable-length integers (varints)
|
||||
----------------------------------
|
||||
|
||||
Varint is an efficiently encoded integer (packed for small values). Same as
|
||||
[protocol buffer varints][varint], except the largest possible value is `u64`.
|
||||
They compress integers well.
|
||||
They compress integers well. Varints are stored for group memberships.
|
||||
|
||||
Group memberships
|
||||
-----------------
|
||||
|
||||
There are two group memberships at play:
|
||||
|
||||
1. given a username, resolve user's group gids (for `initgroups(3)`).
|
||||
2. given a group (gid/name), resolve the members' names (e.g. `getgrgid`).
|
||||
1. Given a username, resolve user's group gids (for `initgroups(3)`).
|
||||
2. Given a group (gid/name), resolve the members' names (e.g. `getgrgid`).
|
||||
|
||||
When user's groups are resolved in (1), the additional userdata is not
|
||||
requested (there is no way to return it). Therefore, it is reasonable to store
|
||||
|
@ -309,23 +278,26 @@ username.
|
|||
When group's memberships are resolved in (2), the same call also requires other
|
||||
group information: gid and group name. Therefore it makes sense to store a
|
||||
pointer to the group members in the group information itself. However, the
|
||||
memberships are not *always* necessary (see remarks about `id(1)` in this
|
||||
document), therefore the memberships will be stored separately, outside of the
|
||||
groups section.
|
||||
memberships are not *always* necessary (see remarks about `id(1)`), therefore
|
||||
the memberships will be stored separately, outside of the groups section.
|
||||
|
||||
`groupmembers` and `additional_gids` store group and user memberships
|
||||
respectively: for each group, a list of pointers (offsets) to User records, and
|
||||
for each user — a list of pointers to Group records. These fields are always
|
||||
used in their entirety — not necessitating random access, thus suitable for
|
||||
tight packing.
|
||||
`Groupmembers` and `Username2gids` store group and user memberships
|
||||
respectively. Membership IDs are used in their entirety — not necessitating
|
||||
random access, thus suitable for tight packing and varint encoding.
|
||||
|
||||
An entry of `groupmembers` and `additional_gids` looks like this piece of
|
||||
|
||||
- For each group — a list of pointers (offsets) to User records, because
|
||||
`getgr*_r` returns an array of pointers to membernames.
|
||||
- For each user — a list of gids, because `initgroups_dyn` (and friends)
|
||||
returns an array of gids.
|
||||
|
||||
An entry of `Groupmembers` and `Username2gids` looks like this piece of
|
||||
pseudo-code:
|
||||
|
||||
```
|
||||
const PackedList = struct {
|
||||
length: varint,
|
||||
members: [length]varint,
|
||||
Length: varint,
|
||||
Members: [Length]varint,
|
||||
}
|
||||
const Groupmembers = PackedList;
|
||||
const Username2gids = PackedList;
|
||||
|
@ -333,17 +305,40 @@ const Username2gids = PackedList;
|
|||
|
||||
A packed list is a list of varints.
|
||||
|
||||
Section `Username2gidsIndex` stores an index from `hash(username)` to `offset`
|
||||
in Username2gids.
|
||||
Indices
|
||||
-------
|
||||
|
||||
Complete file structure
|
||||
-----------------------
|
||||
Now that we've sketched the implementation of `id(3)`, it's clearer to
|
||||
understand which operations need to be fast; in order of importance:
|
||||
|
||||
1. lookup gid -> group info (this is on hot path in id) without members.
|
||||
2. lookup username -> user's groups.
|
||||
3. lookup uid -> user.
|
||||
4. lookup groupname -> group.
|
||||
5. lookup username -> user.
|
||||
|
||||
`idx_*` sections are of type `[]PackedIntArray(u29)` and are pointing to the
|
||||
respective `Groups` and `Users` entries (from the beginning of the respective
|
||||
section). Since User and Group records are 8-byte aligned, 3 bits can be saved
|
||||
from every element. However, since the header easily fits to 64 bytes, we are
|
||||
storing plain `u32` for easier inspection.
|
||||
section). Since User and Group records are 8-byte aligned, 3 bits are saved for
|
||||
every element.
|
||||
|
||||
These indices can use perfect hashing like [bdz from cmph][cmph]: a perfect
|
||||
hash hashes a list of bytes to a sequential list of integers. Perfect hashing
|
||||
algorithms require some space, and take some time to calculate ("hashing
|
||||
duration"). I've tested BDZ, which hashes [][]u8 to a sequential list of
|
||||
integers (not preserving order) and CHM, preserves order. BDZ accepts an
|
||||
optional argument `3 <= b <= 10`.
|
||||
|
||||
* BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
|
||||
* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
|
||||
* Packed vs non-packed latency differences are not meaningful.
|
||||
|
||||
CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
|
||||
CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
|
||||
have a separate index.
|
||||
|
||||
Complete file structure
|
||||
-----------------------
|
||||
|
||||
Each section is padded to 64 bytes.
|
||||
|
||||
|
@ -374,3 +369,6 @@ STATUS SECTION SIZE DESCRIPTION
|
|||
[data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/
|
||||
[getpwnam_r]: https://linux.die.net/man/3/getpwnam_r
|
||||
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
|
||||
[getpwent_r]: https://www.man7.org/linux/man-pages/man3/getpwent_r.3.html
|
||||
[getgrouplist]: https://www.man7.org/linux/man-pages/man3/getgrouplist.3.html
|
||||
[getgrid_r]: https://www.man7.org/linux/man-pages/man3/getgrid_r.3.html
|
||||
|
|
Loading…
Reference in New Issue