1
Fork 0
turbonss/README.md

379 lines
15 KiB
Markdown
Raw Normal View History

2022-02-08 09:52:47 +02:00
Turbo NSS
---------
2022-02-14 10:55:49 +02:00
Turbonss is a plugin for GNU Name Service Switch (NSS) functionality of GNU C
Library (glibc). Turbonss implements lookup for `user` and `passwd` database
entries (i.e. system users, groups, and group memberships). It's main goal is
performance, with focus on making [`id(1)`][id] run as fast as possible.
2022-02-14 13:55:54 +02:00
Turbonss is optimized for reading. If the data changes in any way, the whole
file will need to be regenerated (and tooling only supports only full
generation). It was created, and best suited, for environments that have a
central user & group database which then needs to be distributed to many
2022-02-23 10:45:05 +02:00
servers/services, and the data does not change very often (e.g. hourly).
2022-02-14 13:55:54 +02:00
2022-02-14 10:55:49 +02:00
To understand more about name service switch, start with
2022-02-14 13:05:33 +02:00
[`nsswitch.conf(5)`][nsswitch].
2022-02-14 10:55:49 +02:00
Design & constraints
--------------------
2022-02-14 13:37:10 +02:00
To be fast, the user/group database (later: DB) has to be small
([background][data-oriented-design]). It encodes user & group information in a
way that minimizes the DB size, and reduces jumping across the DB ("chasing
pointers and thrashing CPU cache").
2022-02-14 10:55:49 +02:00
2022-02-14 13:37:10 +02:00
To understand how this is done efficiently, let's analyze the
[`getpwnam_r(3)`][getpwnam_r] in high level. This API call accepts a username
and returns the following user information:
2022-02-14 10:55:49 +02:00
```
struct passwd {
char *pw_name; /* username */
char *pw_passwd; /* user password */
uid_t pw_uid; /* user ID */
gid_t pw_gid; /* group ID */
char *pw_gecos; /* user information */
char *pw_dir; /* home directory */
char *pw_shell; /* shell program */
};
```
Turbonss, among others, implements this call, and takes the following steps to
2022-02-14 13:05:33 +02:00
resolve a username to a `struct passwd*`:
2022-02-14 10:55:49 +02:00
2022-02-23 10:45:05 +02:00
- Open the DB (using `mmap`) and interpret it's first 64 bytes as a `*struct
2022-02-14 13:37:10 +02:00
Header`. The header stores offsets to the sections of the file. This needs to
2022-02-23 10:45:05 +02:00
be done once, when the NSS library is loaded.
2022-02-14 10:55:49 +02:00
- Hash the username using a perfect hash function. Perfect hash function
2022-02-14 13:05:33 +02:00
returns a number `n ∈ [0,N-1]`, where N is the total number of users.
2022-02-23 10:45:05 +02:00
- Jump to the `n`'th location in the `idx_name2user` section, which contains
the index `i` to the user's information.
- Jump to the location `i` of section `Users`, which stores the full user
information.
2022-02-14 10:55:49 +02:00
- Decode the user information (which is all in a continuous memory block) and
return it to the caller.
In total, that's one hash for the username (~150ns), two pointer jumps within
2022-02-19 16:09:46 +02:00
the group file (to sections `idx_name2user` and `Users`), and, now that the
2022-02-14 13:37:10 +02:00
user record is found, `memcpy` for each field.
2022-02-14 10:55:49 +02:00
2022-02-23 10:45:05 +02:00
The turbonss DB file is be `mmap`-ed, making it simple to jump across the file
using pointer arithmetic. This also reduces memory usage, as the mmap'ed
regions are shared. Turbonss reads do not consume any heap space.
2022-02-14 13:05:33 +02:00
Tight packing places some constraints on the underlying data:
2022-02-14 10:55:49 +02:00
2022-02-15 10:49:03 +02:00
- Permitted length of username and groupname: 1-32 bytes.
2022-02-19 11:35:29 +02:00
- Permitted length of shell and home: 1-64 bytes.
2022-02-24 05:32:27 +02:00
- Permitted comment ("gecos") length: 0-255 bytes.
- User name, groupname, gecos and shell must be utf8-encoded.
2022-02-08 09:52:47 +02:00
Sorting is stable. In v0:
- Groups are sorted by gid, ascending.
- Users are sorted by their name, ascending by the unicode codepoints
(locale-independent).
Checking out and building
-------------------------
```
$ git clone --recursive https://git.sr.ht/~motiejus/turbonss
```
Alternatively, if you forgot `--recursive`:
```
$ git submodule update --init
```
And run tests:
```
$ zig build test
```
2022-02-13 18:01:44 +02:00
Other commands will be documented as they are implemented.
2022-02-18 17:24:22 +02:00
This project uses [git subtrac][git-subtrac] for managing dependencies. They
work just like regular submodules, except all the refs of the submodules are in
this repository. Repeat after me: all the submodules are in this repository.
So if you have a copy of this repo, dependencies will not disappear.
2022-02-13 18:01:44 +02:00
remarks on `id(1)`
------------------
2022-02-08 09:52:47 +02:00
2022-02-13 18:01:44 +02:00
A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
2022-02-24 05:32:27 +02:00
~10k groups. Our rps target is much higher.
2022-02-08 09:52:47 +02:00
2022-02-14 13:05:33 +02:00
To better reason about the trade-offs, it is useful to understand how `id(1)`
is implemented, in rough terms:
2022-02-23 10:45:05 +02:00
- lookup user by name ([`getpwent_r(3)`][getpwent_r]).
- get all gids for the user ([`getgrouplist(3)`][getgrouplist]). Note: it is
actually using `initgroups_dyn`, accepts a uid, and is very poorly
documented.
- for each additional gid, get the `struct group*`
([`getgrgid_r(3)`][getgrgid_r]).
2022-02-08 09:52:47 +02:00
2022-02-24 05:32:27 +02:00
Assuming a member is in ~100 groups on average, to reach 10k id/s translates to
1M group lookups per second. We need to convert gid to a group index, and group
index to a group gid/name quickly.
2022-02-08 09:52:47 +02:00
2022-02-13 18:01:44 +02:00
Caveat: `struct group` contains an array of pointers to names of group members
2022-02-14 13:05:33 +02:00
(`char **gr_mem`). However, `id` does not use that information, resulting in
2022-02-23 10:45:05 +02:00
read amplification, sometimes by 10-100x. Therefore, if `argv[0] == "id"`, our
implementation of [`getgrid_r(3)`][getgrid] returns the `struct group*` without
the members. This speeds up `id` by about 10x on a known NSS implementation.
2022-02-13 18:01:44 +02:00
2022-02-23 10:45:05 +02:00
Relatedly, because [`getgrid_r(3)`][getgrid] does not need the group members,
the group members are stored in a different DB section, reducing the `Groups`
section and making more of it fit the CPU caches.
2022-02-11 13:31:54 +02:00
2022-02-13 18:01:44 +02:00
Turbonss header
---------------
2022-02-11 15:37:23 +02:00
2022-02-13 10:42:40 +02:00
The turbonss header looks like this:
2022-02-11 15:37:23 +02:00
```
2022-03-17 07:10:39 +02:00
OFFSET TYPE NAME DESCRIPTION
0 [4]u8 magic f0 9f a4 b7
4 u8 version 0
5 u8 bigendian 0 for little-endian, 1 for big-endian
6 u8 nblocks_shell_blob max value: 63
7 u8 num_shells max value: 63
8 u32 num_groups number of group entries
12 u32 num_users number of passwd entries
16 u32 nblocks_bdz_gid bdz_gid section block count
20 u32 nblocks_bdz_groupname
24 u32 nblocks_bdz_uid
28 u32 nblocks_bdz_username
32 u64 nblocks_groups
40 u64 nblocks_users
48 u64 nblocks_groupmembers
56 u64 nblocks_usergids
2022-02-12 12:30:50 +02:00
```
2022-02-13 18:01:44 +02:00
`magic` is 0xf09fa4b7, and `version` must be `0`. All integers are
2022-03-17 07:10:39 +02:00
native-endian. `nblocks_*` is the count of blocks of a particular section; this
helps calculate the offsets to all sections.
Some numbers, like `nblocks_shell_blob`, `num_shells`, would fit to smaller
number of bytes. However, interpreting `[2]u6` with `xxd(1)` is harder than
interpreting `[2]u8`. Therefore we are using the space we have to make these
integers byte-wide.
2022-02-17 11:16:30 +02:00
2022-02-14 10:55:49 +02:00
Primitive types
---------------
2022-02-12 12:30:50 +02:00
2022-02-23 10:45:05 +02:00
`User` and `Group` entries are sorted by the order they were received in the input
file. All entries are aligned to 8 bytes. All `User` and `Group` entries are
2022-02-14 13:05:33 +02:00
referred by their byte offset in the `Users` and `Groups` section relative to
the beginning of the section.
2022-02-12 12:30:50 +02:00
```
2022-02-24 05:51:04 +02:00
const PackedGroup = packed struct {
2022-02-12 12:30:50 +02:00
gid: u32,
groupname_len: u8, // max is 32, but have too much space here.
2022-03-09 07:04:33 +02:00
// varint members_offset + (groupname_len-1)-length string
groupdata []u8;
2022-02-12 12:30:50 +02:00
}
2022-02-23 10:45:05 +02:00
pub const PackedUser = packed struct {
2022-02-12 10:13:10 +02:00
uid: u32,
gid: u32,
2022-03-08 20:44:32 +02:00
padding: u2 = 0,
2022-02-18 20:36:32 +02:00
shell_len_or_idx: u6,
2022-03-08 20:44:32 +02:00
shell_here: bool,
2022-02-19 11:35:29 +02:00
name_is_a_suffix: bool,
2022-03-08 20:44:32 +02:00
home_len: u6,
2022-02-19 11:35:29 +02:00
name_len: u5,
2022-03-08 20:44:32 +02:00
gecos_len: u11,
2022-02-23 10:45:05 +02:00
// pseudocode: variable-sized array that will be stored immediately after
// this struct.
2022-03-09 07:04:33 +02:00
userdata []u8;
2022-02-11 15:37:23 +02:00
}
2022-02-12 10:14:37 +02:00
```
2022-02-11 15:37:23 +02:00
2022-03-09 07:04:33 +02:00
`userdata` contains a few entries:
2022-02-19 11:35:29 +02:00
- home.
2022-02-23 10:45:05 +02:00
- name (optional).
2022-02-14 13:05:33 +02:00
- gecos.
- shell (optional).
2022-03-08 20:44:32 +02:00
- `additional_gids_offset`: varint.
2022-02-14 13:05:33 +02:00
2022-02-23 10:45:05 +02:00
First byte of home is stored right after the `gecos_len` field, and it's
2022-02-19 11:35:29 +02:00
length is `home_len`. The same logic applies to all the `stringdata` fields:
2022-02-14 13:05:33 +02:00
there is a way to calculate their relative position from the length of the
fields before them.
2022-02-23 10:45:05 +02:00
Additionally, there are two "easy" optimizations:
2022-02-14 13:05:33 +02:00
- shells are often shared across different users, see the "Shells" section.
2022-02-23 10:45:05 +02:00
- `name` is frequently a suffix of `home`. For example, `/home/motiejus` and
`motiejus`. In this case storing both name and home is wasteful. Therefore
name has two options:
1. `name_is_a_suffix=true`: name is a suffix of the home dir. Then `name`
starts at the `home_len - name_len`'th byte of `home`, and ends at the same
place as `home`.
2. `name_is_a_suffix=false`: name begins one byte after home, and it's length
is `name_len`.
2022-02-14 10:55:49 +02:00
2022-03-08 20:44:32 +02:00
The last field, `additional_gids_offset`, which is needed least frequently,
is stored at the end.
2022-02-14 10:55:49 +02:00
Shells
------
2022-02-23 10:45:05 +02:00
Normally there is a limited number of separate shells even in huge user
databases. A few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among
others. Therefore, "shells" have an optimization: they can be pointed by in the
external list, or, if they are unique to the user, reside among the user's
data.
2022-02-14 10:55:49 +02:00
2022-02-15 10:49:03 +02:00
63 most popular shells (i.e. referred to by at least two User entries) are
stored externally in "Shells" area. The less popular ones are stored with
2022-02-14 10:55:49 +02:00
userdata.
2022-02-23 10:45:05 +02:00
Shells section consists of two sub-sections: the index and the blob. The index
is a list of structs which point to a location in the "blob" area:
2022-02-15 10:49:03 +02:00
```
const ShellIndex = struct {
offset: u10,
len: u6,
};
```
2022-02-23 10:45:05 +02:00
In the user's struct `shell_here=true` signifies that the shell is stored with
userdata, and it's length is `shell_len_or_idx`. `shell_here=false` means it is
stored in the `Shells` section, and it's index is `shell_len_or_idx`.
2022-02-14 10:55:49 +02:00
2022-02-14 13:05:33 +02:00
Variable-length integers (varints)
----------------------------------
Varint is an efficiently encoded integer (packed for small values). Same as
[protocol buffer varints][varint], except the largest possible value is `u64`.
2022-02-23 10:45:05 +02:00
They compress integers well. Varints are stored for group memberships.
2022-02-14 13:05:33 +02:00
2022-02-22 15:04:59 +02:00
Group memberships
-----------------
There are two group memberships at play:
2022-02-24 05:32:27 +02:00
1. Given a group (gid/name), resolve the members' names (e.g. `getgrgid`).
2. Given a username, resolve user's group gids (for `initgroups(3)`).
2022-02-22 15:04:59 +02:00
2022-02-24 05:32:27 +02:00
When group's memberships are resolved in (1), the same call also requires other
2022-02-22 15:04:59 +02:00
group information: gid and group name. Therefore it makes sense to store a
pointer to the group members in the group information itself. However, the
2022-02-23 10:45:05 +02:00
memberships are not *always* necessary (see remarks about `id(1)`), therefore
the memberships will be stored separately, outside of the groups section.
2022-02-24 05:32:27 +02:00
Similarly, when user's groups are resolved in (2), they are not always necessary
(i.e. not part of `struct user*`), therefore the memberships themselves are
stored out of bound.
`Groupmembers` and `UserGids` store group and user memberships
2022-02-23 10:45:05 +02:00
respectively. Membership IDs are used in their entirety — not necessitating
random access, thus suitable for tight packing and varint encoding.
- For each group — a list of pointers (offsets) to User records, because
2022-02-24 05:32:27 +02:00
`getgr*_r` returns pointers to membernames.
2022-02-23 10:45:05 +02:00
- For each user — a list of gids, because `initgroups_dyn` (and friends)
returns an array of gids.
2022-02-14 10:55:49 +02:00
An entry of `Groupmembers` and `UserGids` looks like this piece of
2022-02-14 10:55:49 +02:00
pseudo-code:
```
const PackedList = struct {
2022-02-23 10:45:05 +02:00
Length: varint,
Members: [Length]varint,
2022-02-14 10:55:49 +02:00
}
const Groupmembers = PackedList;
const UserGids = PackedList;
2022-02-14 10:55:49 +02:00
```
2022-02-23 10:45:05 +02:00
Indices
-------
2022-02-23 10:45:05 +02:00
Now that we've sketched the implementation of `id(3)`, it's clearer to
understand which operations need to be fast; in order of importance:
1. lookup gid -> group info (this is on hot path in id) without members.
2. lookup username -> user's groups.
3. lookup uid -> user.
4. lookup groupname -> group.
5. lookup username -> user.
2022-02-13 18:01:44 +02:00
2022-02-23 10:45:05 +02:00
These indices can use perfect hashing like [bdz from cmph][cmph]: a perfect
hash hashes a list of bytes to a sequential list of integers. Perfect hashing
algorithms require some space, and take some time to calculate ("hashing
2022-02-24 05:32:27 +02:00
duration"). I've tested BDZ, which hashes `[][]u8` to a sequential list of
2022-02-23 10:45:05 +02:00
integers (not preserving order) and CHM, preserves order. BDZ accepts an
optional argument `3 <= b <= 10`.
* BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
* Packed vs non-packed latency differences are not meaningful.
CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
have a separate index.
2022-02-24 05:32:27 +02:00
None of the tested perfect hashing algorithms makes the distinction between
existing (in the initial dictionary) and new keys. In other words, HASH(value)
will be pointing to a number `n ∈ [0,N-1]`, regardless whether the value was in
the initial dictionary. Therefore one must always confirm, after calculating
the hash, that the key matches what's been hashed.
`idx_*` sections are of type `[]PackedIntArray(u29)` and are pointing to the
respective `Groups` and `Users` entries (from the beginning of the respective
section). Since User and Group records are 8-byte aligned, `u29` is used.
2022-03-17 07:10:39 +02:00
Database file structure
2022-02-23 10:45:05 +02:00
-----------------------
2022-02-14 13:37:10 +02:00
Each section is padded to 64 bytes.
2022-02-13 18:01:44 +02:00
```
2022-03-05 06:08:01 +02:00
SECTION SIZE DESCRIPTION
2022-03-17 07:10:39 +02:00
header 64 see "Turbonss header" section
2022-03-05 06:08:01 +02:00
bdz_gid ? bdz(gid)
bdz_groupname ? bdz(groupname)
bdz_uid ? bdz(uid)
bdz_username ? bdz(username)
2022-03-13 14:22:49 +02:00
idx_gid2group len(group)*4 bdz->offset Groups
idx_groupname2group len(group)*4 bdz->offset Groups
idx_uid2user len(user)*4 bdz->offset Users
idx_name2user len(user)*4 bdz->offset Users
2022-03-17 07:10:39 +02:00
shell_index len(shells)*2 shell index array
shell_blob <= 4032 shell data blob (max 63*64 bytes)
2022-03-05 06:08:01 +02:00
groups ? packed Group entries (8b padding)
users ? packed User entries (8b padding)
2022-03-17 07:10:39 +02:00
groupmembers ? per-group delta varint memberlist (no padding)
user_gids ? per-user delta varint gidlist (no padding)
2022-02-13 18:01:44 +02:00
```
Section creation order:
2022-03-03 18:05:46 +02:00
1.`bdz_*`. No depdendencies.
1.`shellIndex`, `shellBlob`. No dependencies.
2022-03-04 10:37:07 +02:00
1. ✅ userGids. No dependencies.
2022-03-06 18:03:22 +02:00
1. ✅ Users. Requires `userGids` and shell.
1. ✅ Groupmembers. Requires Users.
2022-03-15 06:26:48 +02:00
1. ✅ Groups. Requires Groupmembers.
1.`idx_*`. Requires offsets to Groups and Users.
2022-03-02 06:50:15 +02:00
1. Header.
2022-02-18 17:24:22 +02:00
[git-subtrac]: https://apenwarr.ca/log/20191109
2022-02-11 13:31:54 +02:00
[cmph]: http://cmph.sourceforge.net/
2022-02-14 10:55:49 +02:00
[id]: https://linux.die.net/man/1/id
[nsswitch]: https://linux.die.net/man/5/nsswitch.conf
[data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/
[getpwnam_r]: https://linux.die.net/man/3/getpwnam_r
2022-02-14 13:05:33 +02:00
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
2022-02-23 10:45:05 +02:00
[getpwent_r]: https://www.man7.org/linux/man-pages/man3/getpwent_r.3.html
[getgrouplist]: https://www.man7.org/linux/man-pages/man3/getgrouplist.3.html
[getgrid_r]: https://www.man7.org/linux/man-pages/man3/getgrid_r.3.html