explain some optimizations
This commit is contained in:
parent
f0d9d16cad
commit
c6bc383269
137
README.md
137
README.md
@ -7,17 +7,17 @@ entries (i.e. system users, groups, and group memberships). It's main goal is
|
|||||||
performance, with focus on making [`id(1)`][id] run as fast as possible.
|
performance, with focus on making [`id(1)`][id] run as fast as possible.
|
||||||
|
|
||||||
To understand more about name service switch, start with
|
To understand more about name service switch, start with
|
||||||
[`nsswitch.conf(5)`](nsswitch).
|
[`nsswitch.conf(5)`][nsswitch].
|
||||||
|
|
||||||
Design & constraints
|
Design & constraints
|
||||||
--------------------
|
--------------------
|
||||||
|
|
||||||
To be fast, the user/group database (later: DB) has to be small ([highly
|
To be fast, the user/group database (later: DB) has to be small ([highly
|
||||||
recommended background viewing](data-oriented-design)). It encodes user & group
|
recommended background viewing][data-oriented-design]). It encodes user & group
|
||||||
information in a way that minimizes the DB size, and reduces jumping across the
|
information in a way that minimizes the DB size, and reduces jumping across the
|
||||||
DB ("chasing pointers and polluting CPU cache").
|
DB ("chasing pointers and thrashing CPU cache").
|
||||||
|
|
||||||
For example, [`getpwnam_r(3)`](getpwnam_r) accepts a username and returns
|
For example, [`getpwnam_r(3)`][getpwnam_r] accepts a username and returns
|
||||||
the following user information:
|
the following user information:
|
||||||
|
|
||||||
```
|
```
|
||||||
@ -33,14 +33,14 @@ struct passwd {
|
|||||||
```
|
```
|
||||||
|
|
||||||
Turbonss, among others, implements this call, and takes the following steps to
|
Turbonss, among others, implements this call, and takes the following steps to
|
||||||
resolve this:
|
resolve a username to a `struct passwd*`:
|
||||||
|
|
||||||
- Hash the username using a perfect hash function. Perfect hash function
|
- Hash the username using a perfect hash function. Perfect hash function
|
||||||
returns a number between [0,N], where N is the total number of users.
|
returns a number `n ∈ [0,N-1]`, where N is the total number of users.
|
||||||
- Jump to a known location in the DB (by pointer arithmetic) which links the
|
- Jump to the `n`'th location in the DB (by pointer arithmetic) which contains
|
||||||
user's index to the user's information. That is an index to a different
|
the index `i` to the user's information.
|
||||||
location within the DB.
|
- Jump to the location `i` (pointer arithmetic) which stores the full user
|
||||||
- Jump to the location which stores the full user information.
|
information.
|
||||||
- Decode the user information (which is all in a continuous memory block) and
|
- Decode the user information (which is all in a continuous memory block) and
|
||||||
return it to the caller.
|
return it to the caller.
|
||||||
|
|
||||||
@ -48,7 +48,12 @@ In total, that's one hash for the username (~150ns), two pointer jumps within
|
|||||||
the group file, and, now that the user record is found, `memcpy` for each
|
the group file, and, now that the user record is found, `memcpy` for each
|
||||||
field.
|
field.
|
||||||
|
|
||||||
This tight packing places some constraints on the underlying data:
|
The turbonss DB file is be `mmap`-ed, making it simple to implement pointer
|
||||||
|
arithmetic and jumping across the file. This also reduces memory usage,
|
||||||
|
especially across multiple concurrent invocations of the `id` command. The
|
||||||
|
consumed heap space for each separate turbonss instance will be minimal.
|
||||||
|
|
||||||
|
Tight packing places some constraints on the underlying data:
|
||||||
|
|
||||||
- Maximum database size: 4GB.
|
- Maximum database size: 4GB.
|
||||||
- Maximum length of username and groupname: 32 bytes.
|
- Maximum length of username and groupname: 32 bytes.
|
||||||
@ -83,54 +88,53 @@ remarks on `id(1)`
|
|||||||
------------------
|
------------------
|
||||||
|
|
||||||
A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
|
A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
|
||||||
~10k groups. Our target is 10k id/s.
|
~10k groups. Our target is 10k id/s for the same payload.
|
||||||
|
|
||||||
`id(1)` works as follows:
|
To better reason about the trade-offs, it is useful to understand how `id(1)`
|
||||||
|
is implemented, in rough terms:
|
||||||
- lookup user by name.
|
- lookup user by name.
|
||||||
- get all additional gids (an array attached to a member).
|
- get all additional gids (an array attached to a member).
|
||||||
- for each additional gid, get the group name.
|
- for each additional gid, get the group information (`struct group*`).
|
||||||
|
|
||||||
Assuming a member is in ~100 groups on average, that's 1M group lookups per
|
Assuming a member is in ~100 groups on average, that's 1M group lookups per
|
||||||
second. We need to convert gid to a group index, and group index to a group
|
second. We need to convert gid to a group index, and group index to a group
|
||||||
gid/name quickly.
|
gid/name quickly.
|
||||||
|
|
||||||
Caveat: `struct group` contains an array of pointers to names of group members
|
Caveat: `struct group` contains an array of pointers to names of group members
|
||||||
(`char **gr_mem`). However, `id` does not use that information, resulting in a
|
(`char **gr_mem`). However, `id` does not use that information, resulting in
|
||||||
significant read amplification. Therefore, if `argv[0] == "id"`, `getgrid(3)`
|
read amplification. Therefore, if `argv[0] == "id"`, our implementation of
|
||||||
will return group without the members. This speeds up `id` by about 10x on a
|
`getgrid(3)` returns the `struct group*` without the members. This speeds up
|
||||||
known NSS implementation.
|
`id` by about 10x on a known NSS implementation.
|
||||||
|
|
||||||
Because `getgrid(3)` does not use the group members' information, the group
|
Relatedly, because `getgrid(3)` does not need the group members, the group
|
||||||
members are stored in a different location, making the `Groups` section
|
members are stored in a different DB sectoin, making the `Groups` section
|
||||||
smaller, thus more CPU-cache-friendly.
|
smaller, thus more CPU-cache-friendly in the hot path.
|
||||||
|
|
||||||
Indices
|
Indices
|
||||||
-------
|
-------
|
||||||
|
|
||||||
The following operations need to be fast, in order of importance:
|
Now that we've sketched the implementation of `id(3)`, it's clearer to
|
||||||
|
understand which operations need to be fast; in order of importance:
|
||||||
|
|
||||||
1. lookup gid -> group (this is on hot path in id) with or without members (2
|
1. lookup gid -> group info (this is on hot path in id) without members.
|
||||||
separate calls).
|
|
||||||
2. lookup uid -> user.
|
2. lookup uid -> user.
|
||||||
3. lookup groupname -> group.
|
3. lookup groupname -> group.
|
||||||
4. lookup username -> user.
|
4. lookup username -> user.
|
||||||
5. (optional) iterate users using a defined order (`getent passwd`).
|
|
||||||
6. (optional) iterate groups using a defined order (`getent group`).
|
|
||||||
|
|
||||||
First 4 can use perfect hashing like [cmph][cmph]: it hashes a list of bytes to
|
These indices can use perfect hashing like [cmph][cmph]: a perfect hash hashes
|
||||||
a sequential list of integers. Perfect hashing algorithms require some space,
|
a list of bytes to a sequential list of integers. Perfect hashing algorithms
|
||||||
and take some time to calculate ("hashing duration"). I've tested BDZ, which
|
require some space, and take some time to calculate ("hashing duration"). I've
|
||||||
hashes [][]u8 to a sequential list of integers (not preserving order) and CHM, which
|
tested BDZ, which hashes [][]u8 to a sequential list of integers (not
|
||||||
does the same, but preserves order. BDZ accepts an argument 3 <= b <= 10.
|
preserving order) and CHM, preserves order. BDZ accepts an optional argument `3
|
||||||
|
<= b <= 10`.
|
||||||
|
|
||||||
BDZ: tried b=3, b=7 (default), and b=10.
|
* BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
|
||||||
|
* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
|
||||||
* BDZ algorithm requires (900KB, 338KB, 306KB, respectively) for 1M values.
|
|
||||||
* Latency to resolve 1M keys: (170ms, 180ms, 230ms).
|
|
||||||
* Packed vs non-packed latency differences are not meaningful.
|
* Packed vs non-packed latency differences are not meaningful.
|
||||||
|
|
||||||
CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
|
CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
|
||||||
CHM than with BDZ, eliminating the benefit of preserved ordering.
|
CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
|
||||||
|
have a separate index.
|
||||||
|
|
||||||
Turbonss header
|
Turbonss header
|
||||||
---------------
|
---------------
|
||||||
@ -168,6 +172,11 @@ and the header block fits to 64 bytes anyway, we are keeping them as u32 now.
|
|||||||
Primitive types
|
Primitive types
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
|
`User` and `Group` entries are sorted by name, ordered by their unicode
|
||||||
|
codepoints. They are byte-aligned (8bits). All `User` and `Group` entries are
|
||||||
|
referred by their byte offset in the `Users` and `Groups` section relative to
|
||||||
|
the beginning of the section.
|
||||||
|
|
||||||
```
|
```
|
||||||
const Group = struct {
|
const Group = struct {
|
||||||
gid: u32,
|
gid: u32,
|
||||||
@ -187,9 +196,9 @@ const User = struct {
|
|||||||
// shell is a different story, documented elsewhere.
|
// shell is a different story, documented elsewhere.
|
||||||
shell_here: u1,
|
shell_here: u1,
|
||||||
shell_len_or_place: u6,
|
shell_len_or_place: u6,
|
||||||
home_len: u6,
|
homedir_len: u6,
|
||||||
username_pos: u1,
|
username_is_a_suffix: u1,
|
||||||
username_len: u5,
|
username_offset_or_len: u5,
|
||||||
gecos_len: u8,
|
gecos_len: u8,
|
||||||
// a variable-sized array that will be stored immediately after this
|
// a variable-sized array that will be stored immediately after this
|
||||||
// struct.
|
// struct.
|
||||||
@ -197,8 +206,28 @@ const User = struct {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
`User` and `Group` entries are sorted by name, ordered by their unicode
|
`stringdata` contains a few string entries:
|
||||||
codepoints.
|
- homedir.
|
||||||
|
- username.
|
||||||
|
- gecos.
|
||||||
|
- shell (optional).
|
||||||
|
|
||||||
|
First byte of the homedir is stored right after the `gecos_len` field, and it's
|
||||||
|
length is `homedir_len`. The same logic applies to all the `stringdata` fields:
|
||||||
|
there is a way to calculate their relative position from the length of the
|
||||||
|
fields before them.
|
||||||
|
|
||||||
|
Additionally, two optimizations for special fields are made:
|
||||||
|
- shells are often shared across different users, see the "Shells" section.
|
||||||
|
- username is frequently a suffix of the homedir. For example, `/home/motiejus`
|
||||||
|
and `motiejus`. In which case storing both username and homedir strings is
|
||||||
|
wasteful. For that cases, username has two options:
|
||||||
|
1. `username_is_a_suffix=true`: username is a suffix of the home dir. In that
|
||||||
|
case, the username starts at the `username_offset_or_len`'th byte of the
|
||||||
|
homedir, and ends at the same place as the homedir.
|
||||||
|
2. `username_is_a_suffix=false`: username is stored separately. In that case,
|
||||||
|
it begins one byte after homedir, and it's length is
|
||||||
|
`username_offset_or_len`.
|
||||||
|
|
||||||
Shells
|
Shells
|
||||||
------
|
------
|
||||||
@ -221,14 +250,21 @@ to it's index in the external storage.
|
|||||||
Shells in the external storage are sorted by their weight, which is
|
Shells in the external storage are sorted by their weight, which is
|
||||||
`length*frequency`.
|
`length*frequency`.
|
||||||
|
|
||||||
|
Variable-length integers (varints)
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
Varint is an efficiently encoded integer (packed for small values). Same as
|
||||||
|
[protocol buffer varints][varint], except the largest possible value is `u64`.
|
||||||
|
They compress integers well.
|
||||||
|
|
||||||
`groupmembers`, `additional_gids`
|
`groupmembers`, `additional_gids`
|
||||||
---------------------------------
|
---------------------------------
|
||||||
|
|
||||||
`groupmembers` and `additional_gids` store group and user memberships
|
`groupmembers` and `additional_gids` store group and user memberships
|
||||||
respectively: for each group, a list of pointers ("offsets") to User records,
|
respectively: for each group, a list of pointers (offsets) to User records, and
|
||||||
and for each user — a list of pointers to Group records. These fields are
|
for each user — a list of pointers to Group records. These fields are always
|
||||||
always used in their entirety — making random-access not required, thus
|
used in their entirety — not necessitating random access, thus suitable for
|
||||||
suitable for tight packing.
|
tight packing.
|
||||||
|
|
||||||
An entry of `groupmembers` and `additional_gids` looks like this piece of
|
An entry of `groupmembers` and `additional_gids` looks like this piece of
|
||||||
pseudo-code:
|
pseudo-code:
|
||||||
@ -242,10 +278,12 @@ const Groupmembers = PackedList;
|
|||||||
const AdditionalGids = PackedList;
|
const AdditionalGids = PackedList;
|
||||||
```
|
```
|
||||||
|
|
||||||
The single entry in `members` field points to an offset into a `User` or
|
An entry in `members` field points to the offset into a respective `User` or
|
||||||
`Group` entry (number of bytes relative to the first entry of the respective
|
`Group` entry (number of bytes relative to the first entry of the type).
|
||||||
type). The `members` field in `PackedList` is sorted by the name (`username` or
|
`members` in `PackedList` is sorted by the name (`username` or `groupname`) of
|
||||||
`groupname`) of the record it is pointing to.
|
the record it is pointing to.
|
||||||
|
|
||||||
|
A packed list is a list of varints.
|
||||||
|
|
||||||
Complete file structure
|
Complete file structure
|
||||||
-----------------------
|
-----------------------
|
||||||
@ -270,3 +308,4 @@ Complete file structure
|
|||||||
[nsswitch]: https://linux.die.net/man/5/nsswitch.conf
|
[nsswitch]: https://linux.die.net/man/5/nsswitch.conf
|
||||||
[data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/
|
[data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/
|
||||||
[getpwnam_r]: https://linux.die.net/man/3/getpwnam_r
|
[getpwnam_r]: https://linux.die.net/man/3/getpwnam_r
|
||||||
|
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
|
||||||
|
Loading…
Reference in New Issue
Block a user