update README

add lto and fpic
learning about linkers. Thanks, Drepper
2022-11-30 12:04:12 +02:00 · 2022-11-30 11:56:02 +02:00 · 2022-11-25 14:50:42 +02:00 · 2022-11-20 13:33:05 +02:00 · 2022-08-23 15:50:00 +03:00 · 2022-08-23 15:49:33 +03:00
4 changed files with 520 additions and 410 deletions
--- a/README.md
+++ b/README.md
@ -1,442 +1,170 @@
 Turbo NSS
 ---------

-Turbonss is a plugin for GNU Name Service Switch (NSS) functionality of GNU C
-Library (glibc). Turbonss implements lookup for `user` and `passwd` database
-entries (i.e. system users, groups, and group memberships). It's main goal is
-performance, with focus on making [`id(1)`][id] run as fast as possible.
+Turbonss is a plugin for GNU Name Service Switch ([NSS][nsswitch])
+functionality of GNU C Library (glibc). Turbonss implements lookup for `user`
+and `passwd` database entries (i.e. system users, groups, and group
+memberships). It's main goal is to run [`id(1)`][id] as fast as possible.

 Turbonss is optimized for reading. If the data changes in any way, the whole
-file will need to be regenerated (and tooling only supports only full
-generation). It was created, and best suited, for environments that have a
-central user & group database which then needs to be distributed to many
-servers/services, and the data does not change very often (e.g. hourly).
+file will need to be regenerated. Therefore, it was created, and best suited,
+for environments that have a central user & group database which then needs to
+be distributed to many servers/services, and the data does not change very
+often (e.g. hourly).

-To understand more about name service switch, start with
-[`nsswitch.conf(5)`][nsswitch].
+This is the fastest known NSS passwd/group implementation for *reads*. On a
+corpus with 10k users, 10k groups and 500 average members per group, `id` takes
+17 seconds with the glibc default implementation, 10-17 milliseconds with a
+pre-cached `nscd`, ~8 milliseconds with `turbonss`.

-Design & constraints
--------------------
+Project status
+--------------

-To be fast, the user/group database (later: DB) has to be small
-([background][data-oriented-design]). It encodes user & group information in a
-way that minimizes the DB size, and reduces jumping across the DB ("chasing
-pointers and thrashing CPU cache").
+The project is finished and was never used recommended for production. If you
+are considering using turbonss, try nscd first. Turbonss is only 2-5 times
+faster than pre-warmed nscd, which usually does not matter enough to go through
+the hoops of using a nonstandard nss library in the first place.

-To understand how this is done efficiently, let's analyze the
-[`getpwnam_r(3)`][getpwnam_r] in high level. This API call accepts a username
-and returns the following user information:
-
-```
-struct passwd {
-   char   *pw_name;       /* username */
-   char   *pw_passwd;     /* user password */
-   uid_t   pw_uid;        /* user ID */
-   gid_t   pw_gid;        /* group ID */
-   char   *pw_gecos;      /* user information */
-   char   *pw_dir;        /* home directory */
-   char   *pw_shell;      /* shell program */
-};
-```
-
-Turbonss, among others, implements this call, and takes the following steps to
-resolve a username to a `struct passwd*`:
-
- Open the DB (using `mmap`) and interpret it's first 64 bytes as a `*struct
-  Header`. The header stores offsets to the sections of the file. This needs to
-  be done once, when the NSS library is loaded.
- Hash the username using a perfect hash function. Perfect hash function
-  returns a number `n ∈ [0,N-1]`, where N is the total number of users.
- Jump to the `n`'th location in the `idx_name2user` section, which contains
-  the index `i` to the user's information.
- Jump to the location `i` of section `Users`, which stores the full user
-  information.
- Decode the user information (which is all in a continuous memory block) and
-  return it to the caller.
-
-In total, that's one hash for the username (~150ns), two pointer jumps within
-the group file (to sections `idx_name2user` and `Users`), and, now that the
-user record is found, `memcpy` for each field.
-
-The turbonss DB file is be `mmap`-ed, making it simple to jump across the file
-using pointer arithmetic. This also reduces memory usage, as the mmap'ed
-regions are shared. Turbonss reads do not consume any heap space.
-
-Tight packing places some constraints on the underlying data:
-
- Permitted length of username and groupname: 1-32 bytes.
- Permitted length of shell and home: 1-256 bytes.
- Permitted comment ("gecos") length: 0-255 bytes.
- User name, groupname, gecos and shell must be utf8-encoded.
- User and Groups sections are up to 2^35B (~34GB) large. Assuming an "average"
-  user record takes 50 bytes, this section would fit ~660M users. The
-  worst-case upper bound is left as an exercise to the reader.
-
-Sorting is stable. In v0:
- Groups are sorted by gid, ascending.
- Users are sorted by their name, ascending by the unicode codepoints
-  (locale-independent).
-
-Checking out and building
-------------------------
-
-```
-$ git clone --recursive https://git.sr.ht/~motiejus/turbonss
-```
-
-Alternatively, if you forgot `--recursive`:
-
-```
-$ git submodule update --init
-```
-
-And run tests:
-
-```
-$ zig build test
-```
-
-Test the so
-----------
-
-Build:
-
-    zig build -Dtarget=x86_64-linux-gnu.2.31 -Dcpu=x86_64_v3 -Drelease-fast=true -Dstrip=true
-
-Generate `db.turbo`:
-
-    zig-out/bin/turbonss-unix2db --passwd /etc/passwd --group /etc/group
-    zig-out/bin/turbonss-analyze db.turbo
-    <...>
-
-Run a test container:
-
-    $ docker run -ti --rm --privileged -v `pwd`:/etc/turbonss -w /etc/turbonss debian:bullseye
-    # cp zig-out/lib/libnss_turbo.so.2 /lib/x86_64-linux-gnu
-    # sed -i 's/\(\(passwd\|group\).*files\)$/\1 turbo/' /etc/nsswitch.conf
-
-And knock yourself out:
-
-    getent passwd
-    getent group
-    id root
-
-This is probably not very interesting; you may want to take a larger corpus of
-/etc/passwd and /etc/group for more interesting results.
+Yours truly worked on this for about 7 months. This was also my first zig
+project which I never went to (nor really needed to) come back and clean up.

 Dependencies
 ------------

-This project uses [git subtrac][git-subtrac] for managing dependencies. They
-work just like regular submodules, except all the refs of the submodules are in
-this repository. Repeat after me: all the submodules are in this repository.
-So if you have a copy of this repo, dependencies will not disappear.
+1. zig v0.10. turbonss is implemented in stage1, so will not work with zig
+   v0.11+.
+2. [cmph][cmph]: bundled with this repository.

-remarks on `id(1)`
------------------
+Trying it out
+-------------

-A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
-~10k groups. Our rps target is much higher.
+Clone, compile and test first:

-To better reason about the trade-offs, it is useful to understand how `id(1)`
-is implemented, in rough terms:
- lookup user by name ([`getpwent_r(3)`][getpwent]).
- get all gids for the user ([`getgrouplist(3)`][getgrouplist]). Note: it is
-  actually using `initgroups_dyn`, accepts a uid, and is very poorly
-  documented.
- for each additional gid, get the `struct group*`
-  ([`getgrgid_r(3)`][getgrgid_r]).
+    $ git clone --recursive https://git.sr.ht/~motiejus/turbonss
+    $ zig build test
+    $ zig build -Dtarget=x86_64-linux-gnu.2.16 -Dcpu=baseline -Drelease-safe=true

-Assuming a member is in ~100 groups on average, to reach 10k id/s translates to
-1M group lookups per second. We need to convert gid to a group index, and group
-index to a group gid/name quickly.
+One may choose different options, depending on requirements. Here are some
+hints:

-Caveat: `struct group` contains an array of pointers to names of group members
-(`char **gr_mem`). However, `id` does not use that information, resulting in
-read amplification, sometimes by 10-100x. Therefore, if `argv[0] == "id"`, our
-implementation of [`getgrid_r(3)`][getgrid] returns the `struct group*` without
-the members. This speeds up `id` by about 10x on a known NSS implementation.
+1. `-Dcpu=<...>` for the CPU
+   [microarchitecture](https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels).
+2. `-Drelease-fast=true` for max speed
+3. `-Drelease-small=true` for smallest binary sizes.
+4. `-Dstrip=true` to strip debug symbols.

-Relatedly, because [`getgrid_r(3)`][getgrid] does not need the group members,
-the group members are stored in a different DB section, reducing the `Groups`
-section and making more of it fit the CPU caches.
+For reference, size of the shared library and helper binaries when compiled
+with `-Dstrip=true -Drelease-small=true`:

-Turbonss header
---------------
+     17K Nov 30 11:53 turbonss-analyze
+     16K Nov 30 11:53 turbonss-getent
+     17K Nov 30 11:53 turbonss-makecorpus
+    166K Nov 30 11:53 turbonss-unix2db
+     22K Nov 30 11:53 libnss_turbo.so.2.0.0

-The turbonss header looks like this:
+Many thanks to Ulrich Drepper for [teaching how to link it properly][dso].

-```
-OFFSET     TYPE     NAME                      DESCRIPTION
-   0      [4]u8     magic                     f0 9f a4 b7
-   4         u8     version                   0
-   5         u8     endian                    0 for little, 1 for big
-   6         u8     nblocks_shell_blob        max value: 63
-   7         u8     num_shells                max value: 63
-   8        u32     num_groups                number of group entries
-  12        u32     num_users                 number of passwd entries
-  16        u32     nblocks_bdz_gid           bdz_gid section block count
-  20        u32     nblocks_bdz_groupname
-  24        u32     nblocks_bdz_uid
-  28        u32     nblocks_bdz_username
-  32        u64     nblocks_groups
-  40        u64     nblocks_users
-  48        u64     nblocks_groupmembers
-  56        u64     nblocks_additional_gids
-  64        u64     getgr_bufsize
-  72        u64     getpw_bufsize
-  80     [48]u8     padding
-```
+Test turobnss on a real system
+------------------------------

-`magic` is 0xf09fa4b7, and `version` must be `0`. All integers are
-native-endian. `nblocks_*` is the count of blocks of a particular section; this
-helps calculate the offsets to all sections.
+`db.turbo` is the TurboNSS database file. To create one from `/etc/group` and
+`/etc/passwd`, use `turbonss-unix2db`:

-Some numbers, like `nblocks_shell_blob`, `num_shells`, would fit to smaller
-number of bytes. However, interpreting `[2]u6` with `xxd(1)` is harder than
-interpreting `[2]u8`. Therefore we are using the space we have to make these
-integers byte-wide.
+    $ zig-out/bin/turbonss-unix2db --passwd /etc/passwd --group /etc/group
+    $ zig-out/bin/turbonss-analyze db.turbo
+    File:               /etc/turbonss/db.turbo
+    Size:               2,624 bytes
+    Version:            0
+    Endian:             little
+    Pointer size:       8 bytes
+    getgr buffer size:  17
+    getpw buffer size:  74
+    Users:              19
+    Groups:             39
+    Shells:             1
+    Most memberships:   _apt (1)
+    Sections:
+        Name                 Begin    End          Size bytes
+        header               00000000 00000080            128
+        bdz_gid              00000080 000000c0             64
+        bdz_groupname        000000c0 00000100             64
+        bdz_uid              00000100 00000140             64
+        bdz_username         00000140 00000180             64
+        idx_gid2group        00000180 00000240            192
+        idx_groupname2group  00000240 00000300            192
+        idx_uid2user         00000300 00000380            128
+        idx_name2user        00000380 00000400            128
+        shell_index          00000400 00000440             64
+        shell_blob           00000440 00000480             64
+        groups               00000480 00000700            640
+        users                00000700 000009c0            704
+        groupmembers         000009c0 00000a00             64
+        additional_gids      00000a00 00000a40             64

-`getgr_bufsize` and `getpw_bufsize` is a hint for the caller of `getgr*` and
-`getpw*`-family calls. This is the recommended size of the buffer, so the
-caller does not receive `ENOMEM`.
+Run and configure a test container that uses `turbonss` instead of the default
+`files`:

-Primitive types
---------------
+    $ docker run -ti --rm -v `pwd`:/etc/turbonss -w /etc/turbonss debian:bullseye
+    # cp zig-out/lib/libnss_turbo.so.2 /lib/x86_64-linux-gnu/
+    # sed -i '/passwd\|group/ s/files/turbo/' /etc/nsswitch.conf

-`User` and `Group` entries are sorted by the order they were received in the input
-file. All entries are aligned to 8 bytes. All `User` and `Group` entries are
-referred by their byte offset in the `Users` and `Groups` section relative to
-the beginning of the section.
+And run the commands:

-```
-const PackedGroup = packed struct {
-    gid: u32,
-    padding: u3,
-    groupname_len: u5,
-}
-```
+    $ getent passwd
+    $ getent group
+    $ id root

-PackedGroup is followed by the group name (of length `groupname_len`), followed
-by a varint-compressed offset to the groupmembers section, followed by 8b padding.
+More users and groups
+---------------------

-PackedUser is a bit more involved:
+`turbonss-makecorpus` can synthesize more `users` and `groups`:

-```
-pub const PackedUser = packed struct {
-    uid: u32,
-    gid: u32,
-    shell_len_or_idx: u8,
-    shell_here: bool,
-    name_is_a_suffix: bool,
-    home_len: u6,
-    name_len: u5,
-    gecos_len: u11,
-}
-```
+    # ./zig-out/bin/turbonss-makecorpus 
+    wrote users=10000 groups=10000 avg-members=1000 to .
+    # cat group >> /etc/group
+    # cat passwd >> /etc/passwd
+    # time id u_1000000
+    <...>
+    real    0m17.380s
+    user    0m13.117s
+    sys     0m4.263s

-... followed by `userdata: []u8`:
- home.
- name (optional).
- gecos.
- shell (optional).
- `additional_gids_offset`: varint.
+17 seconds for an `id` command! Well, there are indeed many users and groups.
+Let's see how turbonss fares with it:

-First byte of home is stored right after the `gecos_len` field, and its length
-is `home_len`. The same logic applies to all the `stringdata` fields: there is
-a way to calculate their relative position from the length of the fields before
-them.
+    # zig-out/bin/turbonss-unix2db --group /etc/group --passwd /etc/passwd
+    total 10968512 bytes. groups=10019 users=10039
+    # ls -hs /etc/group /etc/passwd db.turbo
+    48M /etc/group  668K /etc/passwd   11M db.turbo
+    # sed -i '/passwd\|group/ s/files/turbo/' /etc/nsswitch.conf
+    # time id u_1000000
+    real    0m0.008s
+    user    0m0.000s
+    sys     0m0.008s

-PackedUser employs two data-oriented compression techniques:
- shells are often shared across different users, see the "Shells" section.
- `name` is frequently a suffix of `home`. For example, `/home/vidmantas` and
-  `vidmantas`. In this case storing both name and home is wasteful. Therefore
-  name has two options:
-  1. `name_is_a_suffix=true`: name is a suffix of the home dir. Then `name`
-  starts at the `home_len - name_len`'th byte of `home`, and ends at the same
-  place as `home`.
-  2. `name_is_a_suffix=false`: name begins one byte after home, and it's length
-  is `name_len`.
+That's ~1500x improvement for the `id` command (and notice about 4X compression
+ratio compared to plain files). If the number of users and groups is increased
+by 10x (to 100k each), the difference becomes even crazier:

-The last field `additional_gids_offset: varint` points to the `additional_gids`
-section for this user.
+    # time id u_1000000
+    <...>
+    real    3m42.281s
+    user    2m30.482s
+    sys     0m55.840s
+    # sed -i '/passwd\|group/ s/files/turbo/' /etc/nsswitch.conf
+    # time id u_1000000
+    <...>
+    real    0m0.008s
+    user    0m0.000s
+    sys     0m0.008s

-Shells
------
+Documentation
+-------------

-Normally there is a limited number of separate shells even in huge user
-databases. A few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among
-others. Therefore, "shells" have an optimization: they can be pointed by in the
-external list, or, if they are unique to the user, reside among the user's
-data.
+- Architecture is detailed in `docs/architecture.md`
+- Development notes are in `docs/development.md`

-255 most popular shells (i.e. referred to by at least two User entries) are
-stored externally in "Shells" area. The less popular ones are stored with
-userdata.
-
-Shells section consists of two sub-sections: the index and the blob. The index
-is an array of offsets: the i'th shell starts at `offsets[i]` byte, and ends at
-`offsets[i+1]` byte. If there is at least one shell in the shell section, the
-index contains a sentinel index as the last element, which signifies the position
-of the last byte of the shell blob.
-
-`shell_here=true` in the User struct means the shell is stored with userdata,
-and it's length is `shell_len_or_idx`. `shell_here=false` means it is stored in
-the `Shells` section, and it's index is `shell_len_or_idx` (and the actual
-string start and end offsets are resolved as described in the paragraph above).
-
-Variable-length integers (varints)
----------------------------------
-
-Varint is an efficiently encoded integer (packed for small values). Same as
-[protocol buffer varints][varint], except the largest possible value is `u64`.
-They compress integers well. Varints are stored for group memberships.
-
-Group memberships
-----------------
-
-There are two group memberships at play:
-
-1. Given a group (gid/name), resolve the members' names (e.g. `getgrgid`).
-2. Given a username, resolve user's group gids (for `initgroups(3)`).
-
-When group's memberships are resolved in (1), the same call also requires other
-group information: gid and group name. Therefore it makes sense to store a
-pointer to the group members in the group information itself. However, the
-memberships are not *always* necessary (see remarks about `id(1)`), therefore
-the memberships will be stored separately, outside of the groups section.
-
-Similarly, when user's groups are resolved in (2), they are not always necessary
-(i.e. not part of `struct user*`), therefore the memberships themselves are
-stored out of bound.
-
-`groupmembers` and `additional_gids` store group and user memberships
-respectively. Membership IDs are packed — not necessitating random access, thus
-suitable for compression.
-
- `groupmembers` consists of a number X followed by a list of offsets to User
-  records, because `getgr*` returns pointers to membernames, thus a name has to
-  be immediately resolvable.
- `additional_gids` is a list of gids, because `initgroups_dyn` (and friends)
-  returns an array of gids.
-
-Each entry of `groupmembers` and `additional_gids` starts with a varint N,
-which is the number of upcoming elements. Then N delta-compressed varints,
-which are:
-
- **additional_gids** a list of gids.
- **groupmembers** byte-offsets to the User records in the `users` section.
-
-Indices
-------
-
-Now that we've sketched the implementation of `id(3)`, it's clearer to
-understand which operations need to be fast; in order of importance:
-
-1. lookup gid -> group info (this is on hot path in id) without members.
-2. lookup username -> user's groups.
-3. lookup uid -> user.
-4. lookup groupname -> group.
-5. lookup username -> user.
-
-These indices can use perfect hashing like [bdz from cmph][cmph]: a perfect
-hash hashes a list of bytes to a sequential list of integers. Perfect hashing
-algorithms require some space, and take some time to calculate ("hashing
-duration"). I've tested BDZ, which hashes `[][]u8` to a sequential list of
-integers (not preserving order) and CHM, preserves order. BDZ accepts an
-optional argument `3 <= b <= 10`.
-
-* BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
-* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
-* Packed vs non-packed latency differences are not meaningful.
-
-CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
-CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
-have a separate index.
-
-None of the tested perfect hashing algorithms makes the distinction between
-existing (in the initial dictionary) and new keys. In other words, HASH(value)
-will be pointing to a number `n ∈ [0,N-1]`, regardless whether the value was in
-the initial dictionary. Therefore one must always confirm, after calculating
-the hash, that the key matches what's been hashed.
-
-`idx_*` sections are of type `[]u32` and are pointing from `hash(key)` to the
-respective `Groups` and `Users` entries (from the beginning of the respective
-section). Since User and Group records are 8-byte aligned, the actual offset to
-the record is acquired by right-shifting this value by 3 bits.
-
-Database file structure
-----------------------
-
-Each section is padded to 64 bytes.
-
-```
-SECTION               SIZE             DESCRIPTION
-header                128              see "Turbonss header" section
-bdz_gid               ?                bdz(gid)
-bdz_groupname         ?                bdz(groupname)
-bdz_uid               ?                bdz(uid)
-bdz_username          ?                bdz(username)
-idx_gid2group         len(group)*4     bdz->offset Groups
-idx_groupname2group   len(group)*4     bdz->offset Groups
-idx_uid2user          len(user)*4      bdz->offset Users
-idx_name2user         len(user)*4      bdz->offset Users
-shell_index           len(shells)*2    shell index array
-shell_blob            <= 65280         shell data blob (max 255*256 bytes)
-groups                ?                packed Group entries (8b padding)
-users                 ?                packed User entries (8b padding)
-groupmembers          ?                per-group delta varint memberlist (no padding)
-additional_gids       ?                per-user delta varint gidlist (no padding)
-```
-
-Section creation order:
-
-1. ✅ `bdz_*`.
-1. ✅ `shell_index`, `shell_blob`.
-1. ✅ `additional_gids`.
-1. ✅ `users` requires `additional_gids` and shell.
-1. ✅ `groupmembers` requires `users`.
-1. ✅ `groups` requires `groupmembers`.
-1. ✅ `idx_*`. requires offsets to `groups` and `users`.
-1. ✅ Header.
-
-For v2
------
-
-These are desired for the next DB format:
- Compress strings with fsst.
- Trim first 4 bytes from the cmph headers.
-
-Profiling
---------
-
-Prepare `profile.data`:
-
-```
-zig build -Drelease-small=true && \
-    perf record --call-graph=dwarf \
-        zig-out/bin/turbonss-unix2db --passwd passwd2 --group group2
-```
-
-Perf interactive:
-
-```
-perf report -i perf.data
-```
-
-Flame graph:
-
-```
-perf script | inferno-collapse-perf | inferno-flamegraph > profile.svg
-```
-
-[git-subtrac]: https://apenwarr.ca/log/20191109
-[cmph]: http://cmph.sourceforge.net/
-[id]: https://linux.die.net/man/1/id
 [nsswitch]: https://linux.die.net/man/5/nsswitch.conf
-[data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/
-[getpwnam_r]: https://linux.die.net/man/3/getpwnam_r
-[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
-[getpwent]: https://www.man7.org/linux/man-pages/man3/getpwent_r.3.html
-[getgrouplist]: https://www.man7.org/linux/man-pages/man3/getgrouplist.3.html
-[getgrid]: https://www.man7.org/linux/man-pages/man3/getgrid_r.3.html
+[id]: https://linux.die.net/man/1/id
+[cmph]: http://cmph.sourceforge.net/
+[dso]: https://akkadia.org/drepper/dsohowto.pdf
--- a/build.zig
+++ b/build.zig
@ -5,6 +5,7 @@ const zbs = std.build;
 pub fn build(b: *zbs.Builder) void {
    const target = b.standardTargetOptions(.{});
    const mode = b.standardReleaseOptions();
+    b.use_stage1 = true;

    const strip = b.option(bool, "strip", "Omit debug information") orelse false;

@ -42,9 +43,11 @@ pub fn build(b: *zbs.Builder) void {
        //"-DDEBUG",
    });
    cmph.strip = strip;
+    cmph.want_lto = true;
+    cmph.compress_debug_sections = .zlib;
    cmph.omit_frame_pointer = true;
-    cmph.addIncludeDir("deps/cmph/src");
-    cmph.addIncludeDir("include/deps/cmph");
+    cmph.addIncludePath("deps/cmph/src");
+    cmph.addIncludePath("include/deps/cmph");

    const bdz = b.addStaticLibrary("bdz", null);
    bdz.setTarget(target);
@ -57,15 +60,20 @@ pub fn build(b: *zbs.Builder) void {
    }, &.{
        "-W",
        "-Wno-unused-function",
+        "-fvisibility=hidden",
+        "-fpic",
        //"-DDEBUG",
    });
    bdz.omit_frame_pointer = true;
-    bdz.addIncludeDir("deps/cmph/src");
-    bdz.addIncludeDir("include/deps/cmph");
+    bdz.addIncludePath("deps/cmph/src");
+    bdz.addIncludePath("include/deps/cmph");
+    bdz.want_lto = true;

    {
        const exe = b.addExecutable("turbonss-unix2db", "src/turbonss-unix2db.zig");
+        exe.compress_debug_sections = .zlib;
        exe.strip = strip;
+        exe.want_lto = true;
        exe.setTarget(target);
        exe.setBuildMode(mode);
        addCmphDeps(exe, cmph);
@ -74,7 +82,9 @@ pub fn build(b: *zbs.Builder) void {

    {
        const exe = b.addExecutable("turbonss-analyze", "src/turbonss-analyze.zig");
+        exe.compress_debug_sections = .zlib;
        exe.strip = strip;
+        exe.want_lto = true;
        exe.setTarget(target);
        exe.setBuildMode(mode);
        exe.install();
@ -82,7 +92,9 @@ pub fn build(b: *zbs.Builder) void {

    {
        const exe = b.addExecutable("turbonss-makecorpus", "src/turbonss-makecorpus.zig");
+        exe.compress_debug_sections = .zlib;
        exe.strip = strip;
+        exe.want_lto = true;
        exe.setTarget(target);
        exe.setBuildMode(mode);
        exe.install();
@ -90,10 +102,12 @@ pub fn build(b: *zbs.Builder) void {

    {
        const exe = b.addExecutable("turbonss-getent", "src/turbonss-getent.zig");
+        exe.compress_debug_sections = .zlib;
        exe.strip = strip;
+        exe.want_lto = true;
        exe.linkLibC();
        exe.linkLibrary(bdz);
-        exe.addIncludeDir("deps/cmph/src");
+        exe.addIncludePath("deps/cmph/src");
        exe.setTarget(target);
        exe.setBuildMode(mode);
        exe.install();
@ -107,10 +121,12 @@ pub fn build(b: *zbs.Builder) void {
                .patch = 0,
            },
        });
+        so.compress_debug_sections = .zlib;
        so.strip = strip;
+        so.want_lto = true;
        so.linkLibC();
        so.linkLibrary(bdz);
-        so.addIncludeDir("deps/cmph/src");
+        so.addIncludePath("deps/cmph/src");
        so.setTarget(target);
        so.setBuildMode(mode);
        so.install();
@ -127,5 +143,5 @@ pub fn build(b: *zbs.Builder) void {
 fn addCmphDeps(exe: *zbs.LibExeObjStep, cmph: *zbs.LibExeObjStep) void {
    exe.linkLibC();
    exe.linkLibrary(cmph);
-    exe.addIncludeDir("deps/cmph/src");
+    exe.addIncludePath("deps/cmph/src");
 }
--- a/docs/architecture.md
+++ b/docs/architecture.md
@ -0,0 +1,327 @@
+Design & constraints
+--------------------
+
+To be fast, the user/group database (later: DB) has to be small
+([background][data-oriented-design]). It encodes user & group information in a
+way that minimizes the DB size, and reduces jumping across the DB ("chasing
+pointers and thrashing CPU cache").
+
+To understand how this is done efficiently, let's analyze the
+[`getpwnam_r(3)`][getpwnam_r] in high level. This API call accepts a username
+and returns the following user information:
+
+```
+struct passwd {
+   char   *pw_name;       /* username */
+   char   *pw_passwd;     /* user password */
+   uid_t   pw_uid;        /* user ID */
+   gid_t   pw_gid;        /* group ID */
+   char   *pw_gecos;      /* user information */
+   char   *pw_dir;        /* home directory */
+   char   *pw_shell;      /* shell program */
+};
+```
+
+Turbonss, among others, implements this call, and takes the following steps to
+resolve a username to a `struct passwd*`:
+
+- Open the DB (using `mmap`) and interpret it's first 64 bytes as a `*struct
+  Header`. The header stores offsets to the sections of the file. This needs to
+  be done once, when the NSS library is loaded.
+- Hash the username using a perfect hash function. Perfect hash function
+  returns a number `n ∈ [0,N-1]`, where N is the total number of users.
+- Jump to the `n`'th location in the `idx_name2user` section, which contains
+  the index `i` to the user's information.
+- Jump to the location `i` of section `Users`, which stores the full user
+  information.
+- Decode the user information (which is all in a continuous memory block) and
+  return it to the caller.
+
+In total, that's one hash for the username (~150ns), two pointer jumps within
+the group file (to sections `idx_name2user` and `Users`), and, now that the
+user record is found, `memcpy` for each field.
+
+The turbonss DB file is be `mmap`-ed, making it simple to jump across the file
+using pointer arithmetic. This also reduces memory usage, as the mmap'ed
+regions are shared. Turbonss reads do not consume any heap space.
+
+Tight packing places some constraints on the underlying data:
+
+- Permitted length of username and groupname: 1-32 bytes.
+- Permitted length of shell and home: 1-256 bytes.
+- Permitted comment ("gecos") length: 0-255 bytes.
+- User name, groupname, gecos and shell must be utf8-encoded.
+- User and Groups sections are up to 2^35B (~34GB) large. Assuming an "average"
+  user record takes 50 bytes, this section would fit ~660M users. The
+  worst-case upper bound is left as an exercise to the reader.
+
+Sorting is stable. In v0:
+- Groups are sorted by gid, ascending.
+- Users are sorted by their name, ascending by the unicode codepoints
+  (locale-independent).
+
+remarks on `id(1)`
+------------------
+
+A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
+~10k groups. Our rps target is much higher.
+
+To better reason about the trade-offs, it is useful to understand how `id(1)`
+is implemented, in rough terms:
+- lookup user by name ([`getpwent_r(3)`][getpwent]).
+- get all gids for the user ([`getgrouplist(3)`][getgrouplist]). Note: it is
+  actually using `initgroups_dyn`, accepts a uid, and is very poorly
+  documented.
+- for each additional gid, get the `struct group*`
+  ([`getgrgid_r(3)`][getgrgid_r]).
+
+Assuming a member is in ~100 groups on average, to reach 10k id/s translates to
+1M group lookups per second. We need to convert gid to a group index, and group
+index to a group gid/name quickly.
+
+Caveat: `struct group` contains an array of pointers to names of group members
+(`char **gr_mem`). However, `id` does not use that information, resulting in
+read amplification, sometimes by 10-100x. Therefore, if `argv[0] == "id"`, our
+implementation of [`getgrid_r(3)`][getgrid] returns the `struct group*` without
+the members. This speeds up `id` by about 10x on a known NSS implementation.
+
+Relatedly, because [`getgrid_r(3)`][getgrid] does not need the group members,
+the group members are stored in a different DB section, reducing the `Groups`
+section and making more of it fit the CPU caches.
+
+Turbonss header
+---------------
+
+The turbonss header looks like this:
+
+```
+OFFSET     TYPE     NAME                      DESCRIPTION
+   0      [4]u8     magic                     f0 9f a4 b7
+   4         u8     version                   0
+   5         u8     endian                    0 for little, 1 for big
+   6         u8     nblocks_shell_blob        max value: 63
+   7         u8     num_shells                max value: 63
+   8        u32     num_groups                number of group entries
+  12        u32     num_users                 number of passwd entries
+  16        u32     nblocks_bdz_gid           bdz_gid section block count
+  20        u32     nblocks_bdz_groupname
+  24        u32     nblocks_bdz_uid
+  28        u32     nblocks_bdz_username
+  32        u64     nblocks_groups
+  40        u64     nblocks_users
+  48        u64     nblocks_groupmembers
+  56        u64     nblocks_additional_gids
+  64        u64     getgr_bufsize
+  72        u64     getpw_bufsize
+  80     [48]u8     padding
+```
+
+`magic` is 0xf09fa4b7, and `version` must be `0`. All integers are
+native-endian. `nblocks_*` is the count of blocks of a particular section; this
+helps calculate the offsets to all sections.
+
+Some numbers, like `nblocks_shell_blob`, `num_shells`, would fit to smaller
+number of bytes. However, interpreting `[2]u6` with `xxd(1)` is harder than
+interpreting `[2]u8`. Therefore we are using the space we have to make these
+integers byte-wide.
+
+`getgr_bufsize` and `getpw_bufsize` is a hint for the caller of `getgr*` and
+`getpw*`-family calls. This is the recommended size of the buffer, so the
+caller does not receive `ENOMEM`.
+
+Primitive types
+---------------
+
+`User` and `Group` entries are sorted by the order they were received in the input
+file. All entries are aligned to 8 bytes. All `User` and `Group` entries are
+referred by their byte offset in the `Users` and `Groups` section relative to
+the beginning of the section.
+
+```
+const PackedGroup = packed struct {
+    gid: u32,
+    padding: u3,
+    groupname_len: u5,
+}
+```
+
+PackedGroup is followed by the group name (of length `groupname_len`), followed
+by a varint-compressed offset to the groupmembers section, followed by 8b padding.
+
+PackedUser is a bit more involved:
+
+```
+pub const PackedUser = packed struct {
+    uid: u32,
+    gid: u32,
+    shell_len_or_idx: u8,
+    shell_here: bool,
+    name_is_a_suffix: bool,
+    home_len: u6,
+    name_len: u5,
+    gecos_len: u11,
+}
+```
+
+... followed by `userdata: []u8`:
+- home.
+- name (optional).
+- gecos.
+- shell (optional).
+- `additional_gids_offset`: varint.
+
+First byte of home is stored right after the `gecos_len` field, and its length
+is `home_len`. The same logic applies to all the `stringdata` fields: there is
+a way to calculate their relative position from the length of the fields before
+them.
+
+PackedUser employs two data-oriented compression techniques:
+- shells are often shared across different users, see the "Shells" section.
+- `name` is frequently a suffix of `home`. For example, `/home/vidmantas` and
+  `vidmantas`. In this case storing both name and home is wasteful. Therefore
+  name has two options:
+  1. `name_is_a_suffix=true`: name is a suffix of the home dir. Then `name`
+  starts at the `home_len - name_len`'th byte of `home`, and ends at the same
+  place as `home`.
+  2. `name_is_a_suffix=false`: name begins one byte after home, and it's length
+  is `name_len`.
+
+The last field `additional_gids_offset: varint` points to the `additional_gids`
+section for this user.
+
+Shells
+------
+
+Normally there is a limited number of separate shells even in huge user
+databases. A few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among
+others. Therefore, "shells" have an optimization: they can be pointed by in the
+external list, or, if they are unique to the user, reside among the user's
+data.
+
+255 most popular shells (i.e. referred to by at least two User entries) are
+stored externally in "Shells" area. The less popular ones are stored with
+userdata.
+
+Shells section consists of two sub-sections: the index and the blob. The index
+is an array of offsets: the i'th shell starts at `offsets[i]` byte, and ends at
+`offsets[i+1]` byte. If there is at least one shell in the shell section, the
+index contains a sentinel index as the last element, which signifies the position
+of the last byte of the shell blob.
+
+`shell_here=true` in the User struct means the shell is stored with userdata,
+and it's length is `shell_len_or_idx`. `shell_here=false` means it is stored in
+the `Shells` section, and it's index is `shell_len_or_idx` (and the actual
+string start and end offsets are resolved as described in the paragraph above).
+
+Variable-length integers (varints)
+----------------------------------
+
+Varint is an efficiently encoded integer (packed for small values). Same as
+[protocol buffer varints][varint], except the largest possible value is `u64`.
+They compress integers well. Varints are stored for group memberships.
+
+Group memberships
+-----------------
+
+There are two group memberships at play:
+
+1. Given a group (gid/name), resolve the members' names (e.g. `getgrgid`).
+2. Given a username, resolve user's group gids (for `initgroups(3)`).
+
+When group's memberships are resolved in (1), the same call also requires other
+group information: gid and group name. Therefore it makes sense to store a
+pointer to the group members in the group information itself. However, the
+memberships are not *always* necessary (see remarks about `id(1)`), therefore
+the memberships will be stored separately, outside of the groups section.
+
+Similarly, when user's groups are resolved in (2), they are not always necessary
+(i.e. not part of `struct user*`), therefore the memberships themselves are
+stored out of bound.
+
+`groupmembers` and `additional_gids` store group and user memberships
+respectively. Membership IDs are packed — not necessitating random access, thus
+suitable for compression.
+
+- `groupmembers` consists of a number X followed by a list of offsets to User
+  records, because `getgr*` returns pointers to membernames, thus a name has to
+  be immediately resolvable.
+- `additional_gids` is a list of gids, because `initgroups_dyn` (and friends)
+  returns an array of gids.
+
+Each entry of `groupmembers` and `additional_gids` starts with a varint N,
+which is the number of upcoming elements. Then N delta-compressed varints,
+which are:
+
+- **additional_gids** a list of gids.
+- **groupmembers** byte-offsets to the User records in the `users` section.
+
+Indices
+-------
+
+Now that we've sketched the implementation of `id(3)`, it's clearer to
+understand which operations need to be fast; in order of importance:
+
+1. lookup gid -> group info (this is on hot path in id) without members.
+2. lookup username -> user's groups.
+3. lookup uid -> user.
+4. lookup groupname -> group.
+5. lookup username -> user.
+
+These indices can use perfect hashing like [bdz from cmph][cmph]: a perfect
+hash hashes a list of bytes to a sequential list of integers. Perfect hashing
+algorithms require some space, and take some time to calculate ("hashing
+duration"). I've tested BDZ, which hashes `[][]u8` to a sequential list of
+integers (not preserving order) and CHM, preserves order. BDZ accepts an
+optional argument `3 <= b <= 10`.
+
+* BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
+* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
+* Packed vs non-packed latency differences are not meaningful.
+
+CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
+CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
+have a separate index.
+
+None of the tested perfect hashing algorithms makes the distinction between
+existing (in the initial dictionary) and new keys. In other words, HASH(value)
+will be pointing to a number `n ∈ [0,N-1]`, regardless whether the value was in
+the initial dictionary. Therefore one must always confirm, after calculating
+the hash, that the key matches what's been hashed.
+
+`idx_*` sections are of type `[]u32` and are pointing from `hash(key)` to the
+respective `Groups` and `Users` entries (from the beginning of the respective
+section). Since User and Group records are 8-byte aligned, the actual offset to
+the record is acquired by right-shifting this value by 3 bits.
+
+Database file structure
+-----------------------
+
+Each section is padded to 64 bytes.
+
+```
+SECTION               SIZE             DESCRIPTION
+header                128              see "Turbonss header" section
+bdz_gid               ?                bdz(gid)
+bdz_groupname         ?                bdz(groupname)
+bdz_uid               ?                bdz(uid)
+bdz_username          ?                bdz(username)
+idx_gid2group         len(group)*4     bdz->offset Groups
+idx_groupname2group   len(group)*4     bdz->offset Groups
+idx_uid2user          len(user)*4      bdz->offset Users
+idx_name2user         len(user)*4      bdz->offset Users
+shell_index           len(shells)*2    shell index array
+shell_blob            <= 65280         shell data blob (max 255*256 bytes)
+groups                ?                packed Group entries (8b padding)
+users                 ?                packed User entries (8b padding)
+groupmembers          ?                per-group delta varint memberlist (no padding)
+additional_gids       ?                per-user delta varint gidlist (no padding)
+```
+
+[cmph]: http://cmph.sourceforge.net/
+[id]: https://linux.die.net/man/1/id
+[data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/
+[getpwnam_r]: https://linux.die.net/man/3/getpwnam_r
+[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
+[getpwent]: https://www.man7.org/linux/man-pages/man3/getpwent_r.3.html
+[getgrouplist]: https://www.man7.org/linux/man-pages/man3/getgrouplist.3.html
+[getgrid]: https://www.man7.org/linux/man-pages/man3/getgrid_r.3.html
--- a/docs/development.md
+++ b/docs/development.md
@ -0,0 +1,39 @@
+Profiling
+---------
+
+Prepare `profile.data`:
+
+```
+zig build -Drelease-small=true && \
+    perf record --call-graph=dwarf \
+        zig-out/bin/turbonss-unix2db --passwd passwd --group group
+```
+
+Perf interactive:
+
+```
+perf report -i perf.data
+```
+
+Flame graph:
+
+```
+perf script | inferno-collapse-perf | inferno-flamegraph > profile.svg
+```
+
+For v2
+------
+
+These are desired for the next DB format:
+- Compress strings with fsst.
+- Trim first 4 bytes from the cmph headers.
+
+Dependencies
+------------
+
+This project uses [git subtrac][git-subtrac] for managing dependencies. They
+work just like regular submodules, except all the refs of the submodules are in
+this repository. Repeat after me: all the submodules are in this repository.
+So if you have a copy of this repo, dependencies will not disappear.
+
+[git-subtrac]: https://apenwarr.ca/log/20191109
Author	SHA1	Message	Date
Motiejus Jakštys	4493b4408c	update README	2022-11-30 12:04:12 +02:00
Motiejus Jakštys	312e510eff	add lto and fpic learning about linkers. Thanks, Drepper	2022-11-30 11:56:02 +02:00
Motiejus Jakštys	0df7d8b722	DSO: reduce visibility of bdz lib	2022-11-25 14:50:42 +02:00
Motiejus Jakštys	422c264df9	zig v0.10 compatibility	2022-11-20 13:33:05 +02:00
Motiejus Jakštys	ff814a474b	add compress-debug-sections to turbonss-makecorpus	2022-08-23 15:50:00 +03:00
Motiejus Jakštys	292c87a597	Merge branch 'compress-debug-sections'	2022-08-23 15:49:33 +03:00
Motiejus Jakštys	8212f3f51a	clarify compiler requirements	2022-08-21 06:58:05 +03:00
Motiejus Jakštys	4d4c8a5be1	analyze command	2022-08-21 06:10:47 +03:00
Motiejus Jakštys	fbd449b21f	Move docs around; finish it	2022-08-21 06:08:21 +03:00
Motiejus Jakštys	8bfc4a30cd	wip turbonss-unix2systemd	2022-08-20 19:08:04 +03:00
Motiejus Jakštys	ef436294e9	wip compress-debug-sections	2022-07-20 14:30:48 +03:00