1
Fork 0

Compare commits

...

11 Commits

Author SHA1 Message Date
Motiejus Jakštys 4493b4408c update README 2022-11-30 12:04:12 +02:00
Motiejus Jakštys 312e510eff add lto and fpic
learning about linkers. Thanks, Drepper
2022-11-30 11:56:02 +02:00
Motiejus Jakštys 0df7d8b722 DSO: reduce visibility of bdz lib 2022-11-25 14:50:42 +02:00
Motiejus Jakštys 422c264df9 zig v0.10 compatibility 2022-11-20 13:33:05 +02:00
Motiejus Jakštys ff814a474b add compress-debug-sections to turbonss-makecorpus 2022-08-23 15:50:00 +03:00
Motiejus Jakštys 292c87a597 Merge branch 'compress-debug-sections' 2022-08-23 15:49:33 +03:00
Motiejus Jakštys 8212f3f51a clarify compiler requirements 2022-08-21 06:58:05 +03:00
Motiejus Jakštys 4d4c8a5be1 analyze command 2022-08-21 06:10:47 +03:00
Motiejus Jakštys fbd449b21f Move docs around; finish it 2022-08-21 06:08:21 +03:00
Motiejus Jakštys 8bfc4a30cd wip turbonss-unix2systemd 2022-08-20 19:08:04 +03:00
Motiejus Jakštys ef436294e9 wip compress-debug-sections 2022-07-20 14:30:48 +03:00
4 changed files with 520 additions and 410 deletions

534
README.md
View File

@ -1,442 +1,170 @@
Turbo NSS Turbo NSS
--------- ---------
Turbonss is a plugin for GNU Name Service Switch (NSS) functionality of GNU C Turbonss is a plugin for GNU Name Service Switch ([NSS][nsswitch])
Library (glibc). Turbonss implements lookup for `user` and `passwd` database functionality of GNU C Library (glibc). Turbonss implements lookup for `user`
entries (i.e. system users, groups, and group memberships). It's main goal is and `passwd` database entries (i.e. system users, groups, and group
performance, with focus on making [`id(1)`][id] run as fast as possible. memberships). It's main goal is to run [`id(1)`][id] as fast as possible.
Turbonss is optimized for reading. If the data changes in any way, the whole Turbonss is optimized for reading. If the data changes in any way, the whole
file will need to be regenerated (and tooling only supports only full file will need to be regenerated. Therefore, it was created, and best suited,
generation). It was created, and best suited, for environments that have a for environments that have a central user & group database which then needs to
central user & group database which then needs to be distributed to many be distributed to many servers/services, and the data does not change very
servers/services, and the data does not change very often (e.g. hourly). often (e.g. hourly).
To understand more about name service switch, start with This is the fastest known NSS passwd/group implementation for *reads*. On a
[`nsswitch.conf(5)`][nsswitch]. corpus with 10k users, 10k groups and 500 average members per group, `id` takes
17 seconds with the glibc default implementation, 10-17 milliseconds with a
pre-cached `nscd`, ~8 milliseconds with `turbonss`.
Design & constraints Project status
-------------------- --------------
To be fast, the user/group database (later: DB) has to be small The project is finished and was never used recommended for production. If you
([background][data-oriented-design]). It encodes user & group information in a are considering using turbonss, try nscd first. Turbonss is only 2-5 times
way that minimizes the DB size, and reduces jumping across the DB ("chasing faster than pre-warmed nscd, which usually does not matter enough to go through
pointers and thrashing CPU cache"). the hoops of using a nonstandard nss library in the first place.
To understand how this is done efficiently, let's analyze the Yours truly worked on this for about 7 months. This was also my first zig
[`getpwnam_r(3)`][getpwnam_r] in high level. This API call accepts a username project which I never went to (nor really needed to) come back and clean up.
and returns the following user information:
```
struct passwd {
char *pw_name; /* username */
char *pw_passwd; /* user password */
uid_t pw_uid; /* user ID */
gid_t pw_gid; /* group ID */
char *pw_gecos; /* user information */
char *pw_dir; /* home directory */
char *pw_shell; /* shell program */
};
```
Turbonss, among others, implements this call, and takes the following steps to
resolve a username to a `struct passwd*`:
- Open the DB (using `mmap`) and interpret it's first 64 bytes as a `*struct
Header`. The header stores offsets to the sections of the file. This needs to
be done once, when the NSS library is loaded.
- Hash the username using a perfect hash function. Perfect hash function
returns a number `n ∈ [0,N-1]`, where N is the total number of users.
- Jump to the `n`'th location in the `idx_name2user` section, which contains
the index `i` to the user's information.
- Jump to the location `i` of section `Users`, which stores the full user
information.
- Decode the user information (which is all in a continuous memory block) and
return it to the caller.
In total, that's one hash for the username (~150ns), two pointer jumps within
the group file (to sections `idx_name2user` and `Users`), and, now that the
user record is found, `memcpy` for each field.
The turbonss DB file is be `mmap`-ed, making it simple to jump across the file
using pointer arithmetic. This also reduces memory usage, as the mmap'ed
regions are shared. Turbonss reads do not consume any heap space.
Tight packing places some constraints on the underlying data:
- Permitted length of username and groupname: 1-32 bytes.
- Permitted length of shell and home: 1-256 bytes.
- Permitted comment ("gecos") length: 0-255 bytes.
- User name, groupname, gecos and shell must be utf8-encoded.
- User and Groups sections are up to 2^35B (~34GB) large. Assuming an "average"
user record takes 50 bytes, this section would fit ~660M users. The
worst-case upper bound is left as an exercise to the reader.
Sorting is stable. In v0:
- Groups are sorted by gid, ascending.
- Users are sorted by their name, ascending by the unicode codepoints
(locale-independent).
Checking out and building
-------------------------
```
$ git clone --recursive https://git.sr.ht/~motiejus/turbonss
```
Alternatively, if you forgot `--recursive`:
```
$ git submodule update --init
```
And run tests:
```
$ zig build test
```
Test the so
-----------
Build:
zig build -Dtarget=x86_64-linux-gnu.2.31 -Dcpu=x86_64_v3 -Drelease-fast=true -Dstrip=true
Generate `db.turbo`:
zig-out/bin/turbonss-unix2db --passwd /etc/passwd --group /etc/group
zig-out/bin/turbonss-analyze db.turbo
<...>
Run a test container:
$ docker run -ti --rm --privileged -v `pwd`:/etc/turbonss -w /etc/turbonss debian:bullseye
# cp zig-out/lib/libnss_turbo.so.2 /lib/x86_64-linux-gnu
# sed -i 's/\(\(passwd\|group\).*files\)$/\1 turbo/' /etc/nsswitch.conf
And knock yourself out:
getent passwd
getent group
id root
This is probably not very interesting; you may want to take a larger corpus of
/etc/passwd and /etc/group for more interesting results.
Dependencies Dependencies
------------ ------------
This project uses [git subtrac][git-subtrac] for managing dependencies. They 1. zig v0.10. turbonss is implemented in stage1, so will not work with zig
work just like regular submodules, except all the refs of the submodules are in v0.11+.
this repository. Repeat after me: all the submodules are in this repository. 2. [cmph][cmph]: bundled with this repository.
So if you have a copy of this repo, dependencies will not disappear.
remarks on `id(1)` Trying it out
------------------ -------------
A known implementation runs id(1) at ~250 rps sequentially on ~20k users and Clone, compile and test first:
~10k groups. Our rps target is much higher.
To better reason about the trade-offs, it is useful to understand how `id(1)` $ git clone --recursive https://git.sr.ht/~motiejus/turbonss
is implemented, in rough terms: $ zig build test
- lookup user by name ([`getpwent_r(3)`][getpwent]). $ zig build -Dtarget=x86_64-linux-gnu.2.16 -Dcpu=baseline -Drelease-safe=true
- get all gids for the user ([`getgrouplist(3)`][getgrouplist]). Note: it is
actually using `initgroups_dyn`, accepts a uid, and is very poorly
documented.
- for each additional gid, get the `struct group*`
([`getgrgid_r(3)`][getgrgid_r]).
Assuming a member is in ~100 groups on average, to reach 10k id/s translates to One may choose different options, depending on requirements. Here are some
1M group lookups per second. We need to convert gid to a group index, and group hints:
index to a group gid/name quickly.
Caveat: `struct group` contains an array of pointers to names of group members 1. `-Dcpu=<...>` for the CPU
(`char **gr_mem`). However, `id` does not use that information, resulting in [microarchitecture](https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels).
read amplification, sometimes by 10-100x. Therefore, if `argv[0] == "id"`, our 2. `-Drelease-fast=true` for max speed
implementation of [`getgrid_r(3)`][getgrid] returns the `struct group*` without 3. `-Drelease-small=true` for smallest binary sizes.
the members. This speeds up `id` by about 10x on a known NSS implementation. 4. `-Dstrip=true` to strip debug symbols.
Relatedly, because [`getgrid_r(3)`][getgrid] does not need the group members, For reference, size of the shared library and helper binaries when compiled
the group members are stored in a different DB section, reducing the `Groups` with `-Dstrip=true -Drelease-small=true`:
section and making more of it fit the CPU caches.
Turbonss header 17K Nov 30 11:53 turbonss-analyze
--------------- 16K Nov 30 11:53 turbonss-getent
17K Nov 30 11:53 turbonss-makecorpus
166K Nov 30 11:53 turbonss-unix2db
22K Nov 30 11:53 libnss_turbo.so.2.0.0
The turbonss header looks like this: Many thanks to Ulrich Drepper for [teaching how to link it properly][dso].
``` Test turobnss on a real system
OFFSET TYPE NAME DESCRIPTION ------------------------------
0 [4]u8 magic f0 9f a4 b7
4 u8 version 0
5 u8 endian 0 for little, 1 for big
6 u8 nblocks_shell_blob max value: 63
7 u8 num_shells max value: 63
8 u32 num_groups number of group entries
12 u32 num_users number of passwd entries
16 u32 nblocks_bdz_gid bdz_gid section block count
20 u32 nblocks_bdz_groupname
24 u32 nblocks_bdz_uid
28 u32 nblocks_bdz_username
32 u64 nblocks_groups
40 u64 nblocks_users
48 u64 nblocks_groupmembers
56 u64 nblocks_additional_gids
64 u64 getgr_bufsize
72 u64 getpw_bufsize
80 [48]u8 padding
```
`magic` is 0xf09fa4b7, and `version` must be `0`. All integers are `db.turbo` is the TurboNSS database file. To create one from `/etc/group` and
native-endian. `nblocks_*` is the count of blocks of a particular section; this `/etc/passwd`, use `turbonss-unix2db`:
helps calculate the offsets to all sections.
Some numbers, like `nblocks_shell_blob`, `num_shells`, would fit to smaller $ zig-out/bin/turbonss-unix2db --passwd /etc/passwd --group /etc/group
number of bytes. However, interpreting `[2]u6` with `xxd(1)` is harder than $ zig-out/bin/turbonss-analyze db.turbo
interpreting `[2]u8`. Therefore we are using the space we have to make these File: /etc/turbonss/db.turbo
integers byte-wide. Size: 2,624 bytes
Version: 0
Endian: little
Pointer size: 8 bytes
getgr buffer size: 17
getpw buffer size: 74
Users: 19
Groups: 39
Shells: 1
Most memberships: _apt (1)
Sections:
Name Begin End Size bytes
header 00000000 00000080 128
bdz_gid 00000080 000000c0 64
bdz_groupname 000000c0 00000100 64
bdz_uid 00000100 00000140 64
bdz_username 00000140 00000180 64
idx_gid2group 00000180 00000240 192
idx_groupname2group 00000240 00000300 192
idx_uid2user 00000300 00000380 128
idx_name2user 00000380 00000400 128
shell_index 00000400 00000440 64
shell_blob 00000440 00000480 64
groups 00000480 00000700 640
users 00000700 000009c0 704
groupmembers 000009c0 00000a00 64
additional_gids 00000a00 00000a40 64
`getgr_bufsize` and `getpw_bufsize` is a hint for the caller of `getgr*` and Run and configure a test container that uses `turbonss` instead of the default
`getpw*`-family calls. This is the recommended size of the buffer, so the `files`:
caller does not receive `ENOMEM`.
Primitive types $ docker run -ti --rm -v `pwd`:/etc/turbonss -w /etc/turbonss debian:bullseye
--------------- # cp zig-out/lib/libnss_turbo.so.2 /lib/x86_64-linux-gnu/
# sed -i '/passwd\|group/ s/files/turbo/' /etc/nsswitch.conf
`User` and `Group` entries are sorted by the order they were received in the input And run the commands:
file. All entries are aligned to 8 bytes. All `User` and `Group` entries are
referred by their byte offset in the `Users` and `Groups` section relative to
the beginning of the section.
``` $ getent passwd
const PackedGroup = packed struct { $ getent group
gid: u32, $ id root
padding: u3,
groupname_len: u5,
}
```
PackedGroup is followed by the group name (of length `groupname_len`), followed More users and groups
by a varint-compressed offset to the groupmembers section, followed by 8b padding. ---------------------
PackedUser is a bit more involved: `turbonss-makecorpus` can synthesize more `users` and `groups`:
``` # ./zig-out/bin/turbonss-makecorpus
pub const PackedUser = packed struct { wrote users=10000 groups=10000 avg-members=1000 to .
uid: u32, # cat group >> /etc/group
gid: u32, # cat passwd >> /etc/passwd
shell_len_or_idx: u8, # time id u_1000000
shell_here: bool, <...>
name_is_a_suffix: bool, real 0m17.380s
home_len: u6, user 0m13.117s
name_len: u5, sys 0m4.263s
gecos_len: u11,
}
```
... followed by `userdata: []u8`: 17 seconds for an `id` command! Well, there are indeed many users and groups.
- home. Let's see how turbonss fares with it:
- name (optional).
- gecos.
- shell (optional).
- `additional_gids_offset`: varint.
First byte of home is stored right after the `gecos_len` field, and its length # zig-out/bin/turbonss-unix2db --group /etc/group --passwd /etc/passwd
is `home_len`. The same logic applies to all the `stringdata` fields: there is total 10968512 bytes. groups=10019 users=10039
a way to calculate their relative position from the length of the fields before # ls -hs /etc/group /etc/passwd db.turbo
them. 48M /etc/group 668K /etc/passwd 11M db.turbo
# sed -i '/passwd\|group/ s/files/turbo/' /etc/nsswitch.conf
# time id u_1000000
real 0m0.008s
user 0m0.000s
sys 0m0.008s
PackedUser employs two data-oriented compression techniques: That's ~1500x improvement for the `id` command (and notice about 4X compression
- shells are often shared across different users, see the "Shells" section. ratio compared to plain files). If the number of users and groups is increased
- `name` is frequently a suffix of `home`. For example, `/home/vidmantas` and by 10x (to 100k each), the difference becomes even crazier:
`vidmantas`. In this case storing both name and home is wasteful. Therefore
name has two options:
1. `name_is_a_suffix=true`: name is a suffix of the home dir. Then `name`
starts at the `home_len - name_len`'th byte of `home`, and ends at the same
place as `home`.
2. `name_is_a_suffix=false`: name begins one byte after home, and it's length
is `name_len`.
The last field `additional_gids_offset: varint` points to the `additional_gids` # time id u_1000000
section for this user. <...>
real 3m42.281s
user 2m30.482s
sys 0m55.840s
# sed -i '/passwd\|group/ s/files/turbo/' /etc/nsswitch.conf
# time id u_1000000
<...>
real 0m0.008s
user 0m0.000s
sys 0m0.008s
Shells Documentation
------ -------------
Normally there is a limited number of separate shells even in huge user - Architecture is detailed in `docs/architecture.md`
databases. A few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among - Development notes are in `docs/development.md`
others. Therefore, "shells" have an optimization: they can be pointed by in the
external list, or, if they are unique to the user, reside among the user's
data.
255 most popular shells (i.e. referred to by at least two User entries) are
stored externally in "Shells" area. The less popular ones are stored with
userdata.
Shells section consists of two sub-sections: the index and the blob. The index
is an array of offsets: the i'th shell starts at `offsets[i]` byte, and ends at
`offsets[i+1]` byte. If there is at least one shell in the shell section, the
index contains a sentinel index as the last element, which signifies the position
of the last byte of the shell blob.
`shell_here=true` in the User struct means the shell is stored with userdata,
and it's length is `shell_len_or_idx`. `shell_here=false` means it is stored in
the `Shells` section, and it's index is `shell_len_or_idx` (and the actual
string start and end offsets are resolved as described in the paragraph above).
Variable-length integers (varints)
----------------------------------
Varint is an efficiently encoded integer (packed for small values). Same as
[protocol buffer varints][varint], except the largest possible value is `u64`.
They compress integers well. Varints are stored for group memberships.
Group memberships
-----------------
There are two group memberships at play:
1. Given a group (gid/name), resolve the members' names (e.g. `getgrgid`).
2. Given a username, resolve user's group gids (for `initgroups(3)`).
When group's memberships are resolved in (1), the same call also requires other
group information: gid and group name. Therefore it makes sense to store a
pointer to the group members in the group information itself. However, the
memberships are not *always* necessary (see remarks about `id(1)`), therefore
the memberships will be stored separately, outside of the groups section.
Similarly, when user's groups are resolved in (2), they are not always necessary
(i.e. not part of `struct user*`), therefore the memberships themselves are
stored out of bound.
`groupmembers` and `additional_gids` store group and user memberships
respectively. Membership IDs are packed — not necessitating random access, thus
suitable for compression.
- `groupmembers` consists of a number X followed by a list of offsets to User
records, because `getgr*` returns pointers to membernames, thus a name has to
be immediately resolvable.
- `additional_gids` is a list of gids, because `initgroups_dyn` (and friends)
returns an array of gids.
Each entry of `groupmembers` and `additional_gids` starts with a varint N,
which is the number of upcoming elements. Then N delta-compressed varints,
which are:
- **additional_gids** a list of gids.
- **groupmembers** byte-offsets to the User records in the `users` section.
Indices
-------
Now that we've sketched the implementation of `id(3)`, it's clearer to
understand which operations need to be fast; in order of importance:
1. lookup gid -> group info (this is on hot path in id) without members.
2. lookup username -> user's groups.
3. lookup uid -> user.
4. lookup groupname -> group.
5. lookup username -> user.
These indices can use perfect hashing like [bdz from cmph][cmph]: a perfect
hash hashes a list of bytes to a sequential list of integers. Perfect hashing
algorithms require some space, and take some time to calculate ("hashing
duration"). I've tested BDZ, which hashes `[][]u8` to a sequential list of
integers (not preserving order) and CHM, preserves order. BDZ accepts an
optional argument `3 <= b <= 10`.
* BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
* Packed vs non-packed latency differences are not meaningful.
CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
have a separate index.
None of the tested perfect hashing algorithms makes the distinction between
existing (in the initial dictionary) and new keys. In other words, HASH(value)
will be pointing to a number `n ∈ [0,N-1]`, regardless whether the value was in
the initial dictionary. Therefore one must always confirm, after calculating
the hash, that the key matches what's been hashed.
`idx_*` sections are of type `[]u32` and are pointing from `hash(key)` to the
respective `Groups` and `Users` entries (from the beginning of the respective
section). Since User and Group records are 8-byte aligned, the actual offset to
the record is acquired by right-shifting this value by 3 bits.
Database file structure
-----------------------
Each section is padded to 64 bytes.
```
SECTION SIZE DESCRIPTION
header 128 see "Turbonss header" section
bdz_gid ? bdz(gid)
bdz_groupname ? bdz(groupname)
bdz_uid ? bdz(uid)
bdz_username ? bdz(username)
idx_gid2group len(group)*4 bdz->offset Groups
idx_groupname2group len(group)*4 bdz->offset Groups
idx_uid2user len(user)*4 bdz->offset Users
idx_name2user len(user)*4 bdz->offset Users
shell_index len(shells)*2 shell index array
shell_blob <= 65280 shell data blob (max 255*256 bytes)
groups ? packed Group entries (8b padding)
users ? packed User entries (8b padding)
groupmembers ? per-group delta varint memberlist (no padding)
additional_gids ? per-user delta varint gidlist (no padding)
```
Section creation order:
1. ✅ `bdz_*`.
1. ✅ `shell_index`, `shell_blob`.
1. ✅ `additional_gids`.
1. ✅ `users` requires `additional_gids` and shell.
1. ✅ `groupmembers` requires `users`.
1. ✅ `groups` requires `groupmembers`.
1. ✅ `idx_*`. requires offsets to `groups` and `users`.
1. ✅ Header.
For v2
------
These are desired for the next DB format:
- Compress strings with fsst.
- Trim first 4 bytes from the cmph headers.
Profiling
---------
Prepare `profile.data`:
```
zig build -Drelease-small=true && \
perf record --call-graph=dwarf \
zig-out/bin/turbonss-unix2db --passwd passwd2 --group group2
```
Perf interactive:
```
perf report -i perf.data
```
Flame graph:
```
perf script | inferno-collapse-perf | inferno-flamegraph > profile.svg
```
[git-subtrac]: https://apenwarr.ca/log/20191109
[cmph]: http://cmph.sourceforge.net/
[id]: https://linux.die.net/man/1/id
[nsswitch]: https://linux.die.net/man/5/nsswitch.conf [nsswitch]: https://linux.die.net/man/5/nsswitch.conf
[data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/ [id]: https://linux.die.net/man/1/id
[getpwnam_r]: https://linux.die.net/man/3/getpwnam_r [cmph]: http://cmph.sourceforge.net/
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints [dso]: https://akkadia.org/drepper/dsohowto.pdf
[getpwent]: https://www.man7.org/linux/man-pages/man3/getpwent_r.3.html
[getgrouplist]: https://www.man7.org/linux/man-pages/man3/getgrouplist.3.html
[getgrid]: https://www.man7.org/linux/man-pages/man3/getgrid_r.3.html

View File

@ -5,6 +5,7 @@ const zbs = std.build;
pub fn build(b: *zbs.Builder) void { pub fn build(b: *zbs.Builder) void {
const target = b.standardTargetOptions(.{}); const target = b.standardTargetOptions(.{});
const mode = b.standardReleaseOptions(); const mode = b.standardReleaseOptions();
b.use_stage1 = true;
const strip = b.option(bool, "strip", "Omit debug information") orelse false; const strip = b.option(bool, "strip", "Omit debug information") orelse false;
@ -42,9 +43,11 @@ pub fn build(b: *zbs.Builder) void {
//"-DDEBUG", //"-DDEBUG",
}); });
cmph.strip = strip; cmph.strip = strip;
cmph.want_lto = true;
cmph.compress_debug_sections = .zlib;
cmph.omit_frame_pointer = true; cmph.omit_frame_pointer = true;
cmph.addIncludeDir("deps/cmph/src"); cmph.addIncludePath("deps/cmph/src");
cmph.addIncludeDir("include/deps/cmph"); cmph.addIncludePath("include/deps/cmph");
const bdz = b.addStaticLibrary("bdz", null); const bdz = b.addStaticLibrary("bdz", null);
bdz.setTarget(target); bdz.setTarget(target);
@ -57,15 +60,20 @@ pub fn build(b: *zbs.Builder) void {
}, &.{ }, &.{
"-W", "-W",
"-Wno-unused-function", "-Wno-unused-function",
"-fvisibility=hidden",
"-fpic",
//"-DDEBUG", //"-DDEBUG",
}); });
bdz.omit_frame_pointer = true; bdz.omit_frame_pointer = true;
bdz.addIncludeDir("deps/cmph/src"); bdz.addIncludePath("deps/cmph/src");
bdz.addIncludeDir("include/deps/cmph"); bdz.addIncludePath("include/deps/cmph");
bdz.want_lto = true;
{ {
const exe = b.addExecutable("turbonss-unix2db", "src/turbonss-unix2db.zig"); const exe = b.addExecutable("turbonss-unix2db", "src/turbonss-unix2db.zig");
exe.compress_debug_sections = .zlib;
exe.strip = strip; exe.strip = strip;
exe.want_lto = true;
exe.setTarget(target); exe.setTarget(target);
exe.setBuildMode(mode); exe.setBuildMode(mode);
addCmphDeps(exe, cmph); addCmphDeps(exe, cmph);
@ -74,7 +82,9 @@ pub fn build(b: *zbs.Builder) void {
{ {
const exe = b.addExecutable("turbonss-analyze", "src/turbonss-analyze.zig"); const exe = b.addExecutable("turbonss-analyze", "src/turbonss-analyze.zig");
exe.compress_debug_sections = .zlib;
exe.strip = strip; exe.strip = strip;
exe.want_lto = true;
exe.setTarget(target); exe.setTarget(target);
exe.setBuildMode(mode); exe.setBuildMode(mode);
exe.install(); exe.install();
@ -82,7 +92,9 @@ pub fn build(b: *zbs.Builder) void {
{ {
const exe = b.addExecutable("turbonss-makecorpus", "src/turbonss-makecorpus.zig"); const exe = b.addExecutable("turbonss-makecorpus", "src/turbonss-makecorpus.zig");
exe.compress_debug_sections = .zlib;
exe.strip = strip; exe.strip = strip;
exe.want_lto = true;
exe.setTarget(target); exe.setTarget(target);
exe.setBuildMode(mode); exe.setBuildMode(mode);
exe.install(); exe.install();
@ -90,10 +102,12 @@ pub fn build(b: *zbs.Builder) void {
{ {
const exe = b.addExecutable("turbonss-getent", "src/turbonss-getent.zig"); const exe = b.addExecutable("turbonss-getent", "src/turbonss-getent.zig");
exe.compress_debug_sections = .zlib;
exe.strip = strip; exe.strip = strip;
exe.want_lto = true;
exe.linkLibC(); exe.linkLibC();
exe.linkLibrary(bdz); exe.linkLibrary(bdz);
exe.addIncludeDir("deps/cmph/src"); exe.addIncludePath("deps/cmph/src");
exe.setTarget(target); exe.setTarget(target);
exe.setBuildMode(mode); exe.setBuildMode(mode);
exe.install(); exe.install();
@ -107,10 +121,12 @@ pub fn build(b: *zbs.Builder) void {
.patch = 0, .patch = 0,
}, },
}); });
so.compress_debug_sections = .zlib;
so.strip = strip; so.strip = strip;
so.want_lto = true;
so.linkLibC(); so.linkLibC();
so.linkLibrary(bdz); so.linkLibrary(bdz);
so.addIncludeDir("deps/cmph/src"); so.addIncludePath("deps/cmph/src");
so.setTarget(target); so.setTarget(target);
so.setBuildMode(mode); so.setBuildMode(mode);
so.install(); so.install();
@ -127,5 +143,5 @@ pub fn build(b: *zbs.Builder) void {
fn addCmphDeps(exe: *zbs.LibExeObjStep, cmph: *zbs.LibExeObjStep) void { fn addCmphDeps(exe: *zbs.LibExeObjStep, cmph: *zbs.LibExeObjStep) void {
exe.linkLibC(); exe.linkLibC();
exe.linkLibrary(cmph); exe.linkLibrary(cmph);
exe.addIncludeDir("deps/cmph/src"); exe.addIncludePath("deps/cmph/src");
} }

327
docs/architecture.md Normal file
View File

@ -0,0 +1,327 @@
Design & constraints
--------------------
To be fast, the user/group database (later: DB) has to be small
([background][data-oriented-design]). It encodes user & group information in a
way that minimizes the DB size, and reduces jumping across the DB ("chasing
pointers and thrashing CPU cache").
To understand how this is done efficiently, let's analyze the
[`getpwnam_r(3)`][getpwnam_r] in high level. This API call accepts a username
and returns the following user information:
```
struct passwd {
char *pw_name; /* username */
char *pw_passwd; /* user password */
uid_t pw_uid; /* user ID */
gid_t pw_gid; /* group ID */
char *pw_gecos; /* user information */
char *pw_dir; /* home directory */
char *pw_shell; /* shell program */
};
```
Turbonss, among others, implements this call, and takes the following steps to
resolve a username to a `struct passwd*`:
- Open the DB (using `mmap`) and interpret it's first 64 bytes as a `*struct
Header`. The header stores offsets to the sections of the file. This needs to
be done once, when the NSS library is loaded.
- Hash the username using a perfect hash function. Perfect hash function
returns a number `n ∈ [0,N-1]`, where N is the total number of users.
- Jump to the `n`'th location in the `idx_name2user` section, which contains
the index `i` to the user's information.
- Jump to the location `i` of section `Users`, which stores the full user
information.
- Decode the user information (which is all in a continuous memory block) and
return it to the caller.
In total, that's one hash for the username (~150ns), two pointer jumps within
the group file (to sections `idx_name2user` and `Users`), and, now that the
user record is found, `memcpy` for each field.
The turbonss DB file is be `mmap`-ed, making it simple to jump across the file
using pointer arithmetic. This also reduces memory usage, as the mmap'ed
regions are shared. Turbonss reads do not consume any heap space.
Tight packing places some constraints on the underlying data:
- Permitted length of username and groupname: 1-32 bytes.
- Permitted length of shell and home: 1-256 bytes.
- Permitted comment ("gecos") length: 0-255 bytes.
- User name, groupname, gecos and shell must be utf8-encoded.
- User and Groups sections are up to 2^35B (~34GB) large. Assuming an "average"
user record takes 50 bytes, this section would fit ~660M users. The
worst-case upper bound is left as an exercise to the reader.
Sorting is stable. In v0:
- Groups are sorted by gid, ascending.
- Users are sorted by their name, ascending by the unicode codepoints
(locale-independent).
remarks on `id(1)`
------------------
A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
~10k groups. Our rps target is much higher.
To better reason about the trade-offs, it is useful to understand how `id(1)`
is implemented, in rough terms:
- lookup user by name ([`getpwent_r(3)`][getpwent]).
- get all gids for the user ([`getgrouplist(3)`][getgrouplist]). Note: it is
actually using `initgroups_dyn`, accepts a uid, and is very poorly
documented.
- for each additional gid, get the `struct group*`
([`getgrgid_r(3)`][getgrgid_r]).
Assuming a member is in ~100 groups on average, to reach 10k id/s translates to
1M group lookups per second. We need to convert gid to a group index, and group
index to a group gid/name quickly.
Caveat: `struct group` contains an array of pointers to names of group members
(`char **gr_mem`). However, `id` does not use that information, resulting in
read amplification, sometimes by 10-100x. Therefore, if `argv[0] == "id"`, our
implementation of [`getgrid_r(3)`][getgrid] returns the `struct group*` without
the members. This speeds up `id` by about 10x on a known NSS implementation.
Relatedly, because [`getgrid_r(3)`][getgrid] does not need the group members,
the group members are stored in a different DB section, reducing the `Groups`
section and making more of it fit the CPU caches.
Turbonss header
---------------
The turbonss header looks like this:
```
OFFSET TYPE NAME DESCRIPTION
0 [4]u8 magic f0 9f a4 b7
4 u8 version 0
5 u8 endian 0 for little, 1 for big
6 u8 nblocks_shell_blob max value: 63
7 u8 num_shells max value: 63
8 u32 num_groups number of group entries
12 u32 num_users number of passwd entries
16 u32 nblocks_bdz_gid bdz_gid section block count
20 u32 nblocks_bdz_groupname
24 u32 nblocks_bdz_uid
28 u32 nblocks_bdz_username
32 u64 nblocks_groups
40 u64 nblocks_users
48 u64 nblocks_groupmembers
56 u64 nblocks_additional_gids
64 u64 getgr_bufsize
72 u64 getpw_bufsize
80 [48]u8 padding
```
`magic` is 0xf09fa4b7, and `version` must be `0`. All integers are
native-endian. `nblocks_*` is the count of blocks of a particular section; this
helps calculate the offsets to all sections.
Some numbers, like `nblocks_shell_blob`, `num_shells`, would fit to smaller
number of bytes. However, interpreting `[2]u6` with `xxd(1)` is harder than
interpreting `[2]u8`. Therefore we are using the space we have to make these
integers byte-wide.
`getgr_bufsize` and `getpw_bufsize` is a hint for the caller of `getgr*` and
`getpw*`-family calls. This is the recommended size of the buffer, so the
caller does not receive `ENOMEM`.
Primitive types
---------------
`User` and `Group` entries are sorted by the order they were received in the input
file. All entries are aligned to 8 bytes. All `User` and `Group` entries are
referred by their byte offset in the `Users` and `Groups` section relative to
the beginning of the section.
```
const PackedGroup = packed struct {
gid: u32,
padding: u3,
groupname_len: u5,
}
```
PackedGroup is followed by the group name (of length `groupname_len`), followed
by a varint-compressed offset to the groupmembers section, followed by 8b padding.
PackedUser is a bit more involved:
```
pub const PackedUser = packed struct {
uid: u32,
gid: u32,
shell_len_or_idx: u8,
shell_here: bool,
name_is_a_suffix: bool,
home_len: u6,
name_len: u5,
gecos_len: u11,
}
```
... followed by `userdata: []u8`:
- home.
- name (optional).
- gecos.
- shell (optional).
- `additional_gids_offset`: varint.
First byte of home is stored right after the `gecos_len` field, and its length
is `home_len`. The same logic applies to all the `stringdata` fields: there is
a way to calculate their relative position from the length of the fields before
them.
PackedUser employs two data-oriented compression techniques:
- shells are often shared across different users, see the "Shells" section.
- `name` is frequently a suffix of `home`. For example, `/home/vidmantas` and
`vidmantas`. In this case storing both name and home is wasteful. Therefore
name has two options:
1. `name_is_a_suffix=true`: name is a suffix of the home dir. Then `name`
starts at the `home_len - name_len`'th byte of `home`, and ends at the same
place as `home`.
2. `name_is_a_suffix=false`: name begins one byte after home, and it's length
is `name_len`.
The last field `additional_gids_offset: varint` points to the `additional_gids`
section for this user.
Shells
------
Normally there is a limited number of separate shells even in huge user
databases. A few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among
others. Therefore, "shells" have an optimization: they can be pointed by in the
external list, or, if they are unique to the user, reside among the user's
data.
255 most popular shells (i.e. referred to by at least two User entries) are
stored externally in "Shells" area. The less popular ones are stored with
userdata.
Shells section consists of two sub-sections: the index and the blob. The index
is an array of offsets: the i'th shell starts at `offsets[i]` byte, and ends at
`offsets[i+1]` byte. If there is at least one shell in the shell section, the
index contains a sentinel index as the last element, which signifies the position
of the last byte of the shell blob.
`shell_here=true` in the User struct means the shell is stored with userdata,
and it's length is `shell_len_or_idx`. `shell_here=false` means it is stored in
the `Shells` section, and it's index is `shell_len_or_idx` (and the actual
string start and end offsets are resolved as described in the paragraph above).
Variable-length integers (varints)
----------------------------------
Varint is an efficiently encoded integer (packed for small values). Same as
[protocol buffer varints][varint], except the largest possible value is `u64`.
They compress integers well. Varints are stored for group memberships.
Group memberships
-----------------
There are two group memberships at play:
1. Given a group (gid/name), resolve the members' names (e.g. `getgrgid`).
2. Given a username, resolve user's group gids (for `initgroups(3)`).
When group's memberships are resolved in (1), the same call also requires other
group information: gid and group name. Therefore it makes sense to store a
pointer to the group members in the group information itself. However, the
memberships are not *always* necessary (see remarks about `id(1)`), therefore
the memberships will be stored separately, outside of the groups section.
Similarly, when user's groups are resolved in (2), they are not always necessary
(i.e. not part of `struct user*`), therefore the memberships themselves are
stored out of bound.
`groupmembers` and `additional_gids` store group and user memberships
respectively. Membership IDs are packed — not necessitating random access, thus
suitable for compression.
- `groupmembers` consists of a number X followed by a list of offsets to User
records, because `getgr*` returns pointers to membernames, thus a name has to
be immediately resolvable.
- `additional_gids` is a list of gids, because `initgroups_dyn` (and friends)
returns an array of gids.
Each entry of `groupmembers` and `additional_gids` starts with a varint N,
which is the number of upcoming elements. Then N delta-compressed varints,
which are:
- **additional_gids** a list of gids.
- **groupmembers** byte-offsets to the User records in the `users` section.
Indices
-------
Now that we've sketched the implementation of `id(3)`, it's clearer to
understand which operations need to be fast; in order of importance:
1. lookup gid -> group info (this is on hot path in id) without members.
2. lookup username -> user's groups.
3. lookup uid -> user.
4. lookup groupname -> group.
5. lookup username -> user.
These indices can use perfect hashing like [bdz from cmph][cmph]: a perfect
hash hashes a list of bytes to a sequential list of integers. Perfect hashing
algorithms require some space, and take some time to calculate ("hashing
duration"). I've tested BDZ, which hashes `[][]u8` to a sequential list of
integers (not preserving order) and CHM, preserves order. BDZ accepts an
optional argument `3 <= b <= 10`.
* BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
* Packed vs non-packed latency differences are not meaningful.
CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
have a separate index.
None of the tested perfect hashing algorithms makes the distinction between
existing (in the initial dictionary) and new keys. In other words, HASH(value)
will be pointing to a number `n ∈ [0,N-1]`, regardless whether the value was in
the initial dictionary. Therefore one must always confirm, after calculating
the hash, that the key matches what's been hashed.
`idx_*` sections are of type `[]u32` and are pointing from `hash(key)` to the
respective `Groups` and `Users` entries (from the beginning of the respective
section). Since User and Group records are 8-byte aligned, the actual offset to
the record is acquired by right-shifting this value by 3 bits.
Database file structure
-----------------------
Each section is padded to 64 bytes.
```
SECTION SIZE DESCRIPTION
header 128 see "Turbonss header" section
bdz_gid ? bdz(gid)
bdz_groupname ? bdz(groupname)
bdz_uid ? bdz(uid)
bdz_username ? bdz(username)
idx_gid2group len(group)*4 bdz->offset Groups
idx_groupname2group len(group)*4 bdz->offset Groups
idx_uid2user len(user)*4 bdz->offset Users
idx_name2user len(user)*4 bdz->offset Users
shell_index len(shells)*2 shell index array
shell_blob <= 65280 shell data blob (max 255*256 bytes)
groups ? packed Group entries (8b padding)
users ? packed User entries (8b padding)
groupmembers ? per-group delta varint memberlist (no padding)
additional_gids ? per-user delta varint gidlist (no padding)
```
[cmph]: http://cmph.sourceforge.net/
[id]: https://linux.die.net/man/1/id
[data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/
[getpwnam_r]: https://linux.die.net/man/3/getpwnam_r
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
[getpwent]: https://www.man7.org/linux/man-pages/man3/getpwent_r.3.html
[getgrouplist]: https://www.man7.org/linux/man-pages/man3/getgrouplist.3.html
[getgrid]: https://www.man7.org/linux/man-pages/man3/getgrid_r.3.html

39
docs/development.md Normal file
View File

@ -0,0 +1,39 @@
Profiling
---------
Prepare `profile.data`:
```
zig build -Drelease-small=true && \
perf record --call-graph=dwarf \
zig-out/bin/turbonss-unix2db --passwd passwd --group group
```
Perf interactive:
```
perf report -i perf.data
```
Flame graph:
```
perf script | inferno-collapse-perf | inferno-flamegraph > profile.svg
```
For v2
------
These are desired for the next DB format:
- Compress strings with fsst.
- Trim first 4 bytes from the cmph headers.
Dependencies
------------
This project uses [git subtrac][git-subtrac] for managing dependencies. They
work just like regular submodules, except all the refs of the submodules are in
this repository. Repeat after me: all the submodules are in this repository.
So if you have a copy of this repo, dependencies will not disappear.
[git-subtrac]: https://apenwarr.ca/log/20191109