2022-02-08 09:52:47 +02:00
|
|
|
Turbo NSS
|
|
|
|
---------
|
|
|
|
|
2022-02-14 10:55:49 +02:00
|
|
|
Turbonss is a plugin for GNU Name Service Switch (NSS) functionality of GNU C
|
|
|
|
Library (glibc). Turbonss implements lookup for `user` and `passwd` database
|
|
|
|
entries (i.e. system users, groups, and group memberships). It's main goal is
|
|
|
|
performance, with focus on making [`id(1)`][id] run as fast as possible.
|
|
|
|
|
|
|
|
To understand more about name service switch, start with
|
|
|
|
[`nsswitch.conf(5)`](nsswitch).
|
|
|
|
|
|
|
|
Design & constraints
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
To be fast, the user/group database (later: DB) has to be small ([highly
|
|
|
|
recommended background viewing](data-oriented-design)). It encodes user & group
|
|
|
|
information in a way that minimizes the DB size, and reduces jumping across the
|
|
|
|
DB ("chasing pointers and polluting CPU cache").
|
|
|
|
|
|
|
|
For example, [`getpwnam_r(3)`](getpwnam_r) accepts a username and returns
|
|
|
|
the following user information:
|
|
|
|
|
|
|
|
```
|
|
|
|
struct passwd {
|
|
|
|
char *pw_name; /* username */
|
|
|
|
char *pw_passwd; /* user password */
|
|
|
|
uid_t pw_uid; /* user ID */
|
|
|
|
gid_t pw_gid; /* group ID */
|
|
|
|
char *pw_gecos; /* user information */
|
|
|
|
char *pw_dir; /* home directory */
|
|
|
|
char *pw_shell; /* shell program */
|
|
|
|
};
|
|
|
|
```
|
|
|
|
|
|
|
|
Turbonss, among others, implements this call, and takes the following steps to
|
|
|
|
resolve this:
|
|
|
|
|
|
|
|
- Hash the username using a perfect hash function. Perfect hash function
|
|
|
|
returns a number between [0,N], where N is the total number of users.
|
|
|
|
- Jump to a known location in the DB (by pointer arithmetic) which links the
|
|
|
|
user's index to the user's information. That is an index to a different
|
|
|
|
location within the DB.
|
|
|
|
- Jump to the location which stores the full user information.
|
|
|
|
- Decode the user information (which is all in a continuous memory block) and
|
|
|
|
return it to the caller.
|
|
|
|
|
|
|
|
In total, that's one hash for the username (~150ns), two pointer jumps within
|
|
|
|
the group file, and, now that the user record is found, `memcpy` for each
|
|
|
|
field.
|
|
|
|
|
|
|
|
This tight packing places some constraints on the underlying data:
|
|
|
|
|
|
|
|
- Maximum database size: 4GB.
|
|
|
|
- Maximum length of username and groupname: 32 bytes.
|
|
|
|
- Maximum length of shell and homedir: 64 bytes.
|
|
|
|
- Maximum comment ("gecos") length: 256 bytes.
|
|
|
|
- Username and groupname must be utf8-encoded.
|
2022-02-08 09:52:47 +02:00
|
|
|
|
2022-02-09 13:14:42 +02:00
|
|
|
Checking out and building
|
|
|
|
-------------------------
|
|
|
|
|
|
|
|
```
|
|
|
|
$ git clone --recursive https://git.sr.ht/~motiejus/turbonss
|
|
|
|
```
|
|
|
|
|
|
|
|
Alternatively, if you forgot `--recursive`:
|
|
|
|
|
|
|
|
```
|
|
|
|
$ git submodule update --init
|
|
|
|
```
|
|
|
|
|
|
|
|
And run tests:
|
|
|
|
|
|
|
|
```
|
|
|
|
$ zig build test
|
|
|
|
```
|
|
|
|
|
2022-02-13 18:01:44 +02:00
|
|
|
Other commands will be documented as they are implemented.
|
2022-02-09 13:14:42 +02:00
|
|
|
|
|
|
|
This project uses [git subtrac][git-subtrac] for managing dependencies.
|
|
|
|
|
2022-02-13 18:01:44 +02:00
|
|
|
remarks on `id(1)`
|
|
|
|
------------------
|
2022-02-08 09:52:47 +02:00
|
|
|
|
2022-02-13 18:01:44 +02:00
|
|
|
A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
|
|
|
|
~10k groups. Our target is 10k id/s.
|
2022-02-08 09:52:47 +02:00
|
|
|
|
2022-02-13 18:01:44 +02:00
|
|
|
`id(1)` works as follows:
|
2022-02-08 09:52:47 +02:00
|
|
|
- lookup user by name.
|
|
|
|
- get all additional gids (an array attached to a member).
|
2022-02-12 10:13:10 +02:00
|
|
|
- for each additional gid, get the group name.
|
2022-02-08 09:52:47 +02:00
|
|
|
|
|
|
|
Assuming a member is in ~100 groups on average, that's 1M group lookups per
|
2022-02-13 18:01:44 +02:00
|
|
|
second. We need to convert gid to a group index, and group index to a group
|
|
|
|
gid/name quickly.
|
2022-02-08 09:52:47 +02:00
|
|
|
|
2022-02-13 18:01:44 +02:00
|
|
|
Caveat: `struct group` contains an array of pointers to names of group members
|
|
|
|
(`char **gr_mem`). However, `id` does not use that information, resulting in a
|
|
|
|
significant read amplification. Therefore, if `argv[0] == "id"`, `getgrid(3)`
|
|
|
|
will return group without the members. This speeds up `id` by about 10x on a
|
|
|
|
known NSS implementation.
|
|
|
|
|
2022-02-14 10:55:49 +02:00
|
|
|
Because `getgrid(3)` does not use the group members' information, the group
|
|
|
|
members are stored in a different location, making the `Groups` section
|
|
|
|
smaller, thus more CPU-cache-friendly.
|
|
|
|
|
2022-02-13 18:01:44 +02:00
|
|
|
Indices
|
|
|
|
-------
|
2022-02-08 09:52:47 +02:00
|
|
|
|
|
|
|
The following operations need to be fast, in order of importance:
|
|
|
|
|
2022-02-12 10:13:10 +02:00
|
|
|
1. lookup gid -> group (this is on hot path in id) with or without members (2
|
|
|
|
separate calls).
|
2022-02-08 09:52:47 +02:00
|
|
|
2. lookup uid -> user.
|
2022-02-11 15:37:23 +02:00
|
|
|
3. lookup groupname -> group.
|
|
|
|
4. lookup username -> user.
|
2022-02-12 12:30:50 +02:00
|
|
|
5. (optional) iterate users using a defined order (`getent passwd`).
|
|
|
|
6. (optional) iterate groups using a defined order (`getent group`).
|
2022-02-12 10:13:10 +02:00
|
|
|
|
2022-02-13 18:01:44 +02:00
|
|
|
First 4 can use perfect hashing like [cmph][cmph]: it hashes a list of bytes to
|
|
|
|
a sequential list of integers. Perfect hashing algorithms require some space,
|
|
|
|
and take some time to calculate ("hashing duration"). I've tested BDZ, which
|
|
|
|
hashes [][]u8 to a sequential list of integers (not preserving order) and CHM, which
|
|
|
|
does the same, but preserves order. BDZ accepts an argument 3 <= b <= 10.
|
2022-02-11 13:31:54 +02:00
|
|
|
|
|
|
|
BDZ: tried b=3, b=7 (default), and b=10.
|
|
|
|
|
2022-02-13 18:01:44 +02:00
|
|
|
* BDZ algorithm requires (900KB, 338KB, 306KB, respectively) for 1M values.
|
|
|
|
* Latency to resolve 1M keys: (170ms, 180ms, 230ms).
|
2022-02-11 13:31:54 +02:00
|
|
|
* Packed vs non-packed latency differences are not meaningful.
|
|
|
|
|
2022-02-12 12:30:50 +02:00
|
|
|
CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
|
2022-02-11 13:31:54 +02:00
|
|
|
CHM than with BDZ, eliminating the benefit of preserved ordering.
|
|
|
|
|
2022-02-13 18:01:44 +02:00
|
|
|
Turbonss header
|
|
|
|
---------------
|
2022-02-11 15:37:23 +02:00
|
|
|
|
2022-02-13 10:42:40 +02:00
|
|
|
The turbonss header looks like this:
|
|
|
|
|
2022-02-11 15:37:23 +02:00
|
|
|
```
|
2022-02-13 10:42:40 +02:00
|
|
|
OFFSET TYPE NAME DESCRIPTION
|
|
|
|
0 [4]u8 magic always 0xf09fa4b7
|
|
|
|
4 u8 version now `0`
|
2022-02-13 18:01:44 +02:00
|
|
|
5 u16 bom 0x1234
|
2022-02-14 10:55:49 +02:00
|
|
|
7 u8 padding
|
2022-02-13 18:01:44 +02:00
|
|
|
8 u32 num_users number of passwd entries
|
|
|
|
12 u32 num_groups number of group entries
|
|
|
|
16 u32 offset_cmph_gid2group
|
|
|
|
20 u32 offset_cmph_uid2user
|
|
|
|
24 u32 offset_cmph_groupname2group
|
|
|
|
28 u32 offset_cmph_username2user
|
|
|
|
32 u32 offset_groupmembers
|
|
|
|
36 u32 offset_additional_gids
|
2022-02-12 12:30:50 +02:00
|
|
|
```
|
|
|
|
|
2022-02-13 18:01:44 +02:00
|
|
|
`magic` is 0xf09fa4b7, and `version` must be `0`. All integers are
|
|
|
|
native-endian. `bom` is a byte-order-mark. It must resolve to `0x1234` (4460).
|
|
|
|
If that's not true, the file is consumed in a different endianness than it was
|
|
|
|
created at. Turbonss files cannot be moved across different-endianness
|
|
|
|
computers. If that happens, turbonss will refuse to read the file.
|
|
|
|
|
2022-02-13 10:42:40 +02:00
|
|
|
Offsets are indices to further sections of the file, with zero being the first
|
2022-02-13 18:01:44 +02:00
|
|
|
block (pointing to the `magic` field). As all blobs are 64-byte aligned, the
|
|
|
|
offsets are always pointing to the beginning of an 64-byte "block". Therefore,
|
|
|
|
all `offset_*` values could be `u26`. As `u32` is easier to visualize with xxd,
|
|
|
|
and the header block fits to 64 bytes anyway, we are keeping them as u32 now.
|
2022-02-12 12:30:50 +02:00
|
|
|
|
2022-02-14 10:55:49 +02:00
|
|
|
Primitive types
|
|
|
|
---------------
|
2022-02-12 12:30:50 +02:00
|
|
|
|
|
|
|
```
|
|
|
|
const Group = struct {
|
|
|
|
gid: u32,
|
2022-02-13 10:42:40 +02:00
|
|
|
// index to a separate structure with a list of members. The memberlist is
|
2022-02-14 10:55:49 +02:00
|
|
|
// always 2^5-byte aligned (32b), this is an index there.
|
2022-02-13 10:42:40 +02:00
|
|
|
members_offset: u27,
|
|
|
|
groupname_len: u5,
|
|
|
|
// a groupname_len-sized string
|
|
|
|
groupname []u8;
|
2022-02-12 12:30:50 +02:00
|
|
|
}
|
|
|
|
|
2022-02-12 10:13:10 +02:00
|
|
|
const User = struct {
|
|
|
|
uid: u32,
|
|
|
|
gid: u32,
|
2022-02-12 12:30:50 +02:00
|
|
|
// pointer to a separate structure that contains a list of gids
|
2022-02-12 10:13:10 +02:00
|
|
|
additional_gids_offset: u29,
|
2022-02-12 23:01:16 +02:00
|
|
|
// shell is a different story, documented elsewhere.
|
|
|
|
shell_here: u1,
|
|
|
|
shell_len_or_place: u6,
|
2022-02-12 10:13:10 +02:00
|
|
|
home_len: u6,
|
2022-02-13 10:42:40 +02:00
|
|
|
username_pos: u1,
|
|
|
|
username_len: u5,
|
2022-02-12 10:13:10 +02:00
|
|
|
gecos_len: u8,
|
|
|
|
// a variable-sized array that will be stored immediately after this
|
|
|
|
// struct.
|
|
|
|
stringdata []u8;
|
2022-02-11 15:37:23 +02:00
|
|
|
}
|
2022-02-12 10:14:37 +02:00
|
|
|
```
|
2022-02-11 15:37:23 +02:00
|
|
|
|
2022-02-14 10:55:49 +02:00
|
|
|
`User` and `Group` entries are sorted by name, ordered by their unicode
|
|
|
|
codepoints.
|
|
|
|
|
|
|
|
Shells
|
|
|
|
------
|
|
|
|
|
|
|
|
Normally there is a limited number of shells even in the huge user databases. A
|
|
|
|
few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among others.
|
|
|
|
Therefore, "shells" have an optimization: they can be pointed by in the
|
|
|
|
external list, or reside among the user's data.
|
|
|
|
|
|
|
|
64 (1>>6) most popular shells (i.e. referred to by at least two User entries)
|
|
|
|
are stored externally in "Shells" area. The less popular ones are stored with
|
|
|
|
userdata.
|
|
|
|
|
|
|
|
The `shell_here=true` bit signifies that the shell is stored with userdata.
|
|
|
|
`false` means it is stored in the `Shells` section. If the shell is stored
|
|
|
|
"here", it is the first element in `stringdata`, and it's length is
|
|
|
|
`shell_len_or_place`. If it is stored externally, the latter variable points
|
|
|
|
to it's index in the external storage.
|
|
|
|
|
|
|
|
Shells in the external storage are sorted by their weight, which is
|
|
|
|
`length*frequency`.
|
|
|
|
|
|
|
|
`groupmembers`, `additional_gids`
|
|
|
|
---------------------------------
|
|
|
|
|
|
|
|
`groupmembers` and `additional_gids` store group and user memberships
|
|
|
|
respectively: for each group, a list of pointers ("offsets") to User records,
|
|
|
|
and for each user — a list of pointers to Group records. These fields are
|
|
|
|
always used in their entirety — making random-access not required, thus
|
|
|
|
suitable for tight packing.
|
|
|
|
|
|
|
|
An entry of `groupmembers` and `additional_gids` looks like this piece of
|
|
|
|
pseudo-code:
|
|
|
|
|
|
|
|
```
|
|
|
|
const PackedList = struct {
|
|
|
|
length: varint,
|
|
|
|
members: []varint
|
|
|
|
}
|
|
|
|
const Groupmembers = PackedList;
|
|
|
|
const AdditionalGids = PackedList;
|
|
|
|
```
|
|
|
|
|
|
|
|
The single entry in `members` field points to an offset into a `User` or
|
|
|
|
`Group` entry (number of bytes relative to the first entry of the respective
|
|
|
|
type). The `members` field in `PackedList` is sorted by the name (`username` or
|
|
|
|
`groupname`) of the record it is pointing to.
|
|
|
|
|
2022-02-13 18:01:44 +02:00
|
|
|
Complete file structure
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
```
|
2022-02-14 10:55:49 +02:00
|
|
|
SECTION SIZE DESCRIPTION
|
|
|
|
Header 1<<6 documented above
|
|
|
|
[]Group ? list of Group entries
|
|
|
|
[]User ? list of User entries
|
|
|
|
Shells ? documented in "SHELLS"
|
|
|
|
cmph_gid2group ? offset by offset_cmph_gid2group
|
|
|
|
cmph_uid2user ? offset by offset_cmph_gid2group
|
|
|
|
cmph_groupname2group ? offset by offset_cmph_groupname2group
|
|
|
|
cmph_username2user ? offset by offset_cmph_username2user
|
|
|
|
groupmembers ? offset by offset_groupmembers
|
|
|
|
additional_gids ? offset by offset_additional_gids
|
2022-02-13 18:01:44 +02:00
|
|
|
```
|
|
|
|
|
2022-02-09 13:14:42 +02:00
|
|
|
[git-subtrac]: https://github.com/apenwarr/git-subtrac/
|
2022-02-11 13:31:54 +02:00
|
|
|
[cmph]: http://cmph.sourceforge.net/
|
2022-02-14 10:55:49 +02:00
|
|
|
[id]: https://linux.die.net/man/1/id
|
|
|
|
[nsswitch]: https://linux.die.net/man/5/nsswitch.conf
|
|
|
|
[data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/
|
|
|
|
[getpwnam_r]: https://linux.die.net/man/3/getpwnam_r
|