turbonss/README.md

Turbo NSS
---------

Turbonss is a plugin for GNU Name Service Switch (NSS) functionality of GNU C
Library (glibc). Turbonss implements lookup for `user` and `passwd` database
entries (i.e. system users, groups, and group memberships). It's main goal is
performance, with focus on making [`id(1)`][id] run as fast as possible.

Turbonss is optimized for reading. If the data changes in any way, the whole
file will need to be regenerated (and tooling only supports only full
generation). It was created, and best suited, for environments that have a
central user & group database which then needs to be distributed to many
servers/services.

To understand more about name service switch, start with
[`nsswitch.conf(5)`][nsswitch].

Design & constraints
--------------------

To be fast, the user/group database (later: DB) has to be small
([background][data-oriented-design]). It encodes user & group information in a
way that minimizes the DB size, and reduces jumping across the DB ("chasing
pointers and thrashing CPU cache").

To understand how this is done efficiently, let's analyze the
[`getpwnam_r(3)`][getpwnam_r] in high level. This API call accepts a username
and returns the following user information:

```
struct passwd {
   char   *pw_name;       /* username */
   char   *pw_passwd;     /* user password */
   uid_t   pw_uid;        /* user ID */
   gid_t   pw_gid;        /* group ID */
   char   *pw_gecos;      /* user information */
   char   *pw_dir;        /* home directory */
   char   *pw_shell;      /* shell program */
};
```

Turbonss, among others, implements this call, and takes the following steps to
resolve a username to a `struct passwd*`:

- Open the DB (using `mmap`) and interpret it's first 40 bytes as a `struct
  Header`. The header stores offsets to the sections of the file. This needs to
  be done once, when the NSS library is loaded (or on the first call).
- Hash the username using a perfect hash function. Perfect hash function
  returns a number `n ∈ [0,N-1]`, where N is the total number of users.
- Jump to the `n`'th location in the `idx_name2user` section (by pointer
  arithmetic), which contains the index `i` to the user's information.
- Jump to the location `i` of section `Users` (again, using pointer arithmetic)
  which stores the full user information.
- Decode the user information (which is all in a continuous memory block) and
  return it to the caller.

In total, that's one hash for the username (~150ns), two pointer jumps within
the group file (to sections `idx_name2user` and `Users`), and, now that the
user record is found, `memcpy` for each field.

The turbonss DB file is be `mmap`-ed, making it simple to implement pointer
arithmetic and jumping across the file. This also reduces memory usage,
especially across multiple concurrent invocations of the `id` command. The
consumed heap space for each separate turbonss instance will be minimal.

Tight packing places some constraints on the underlying data:

- Maximum database size: 4GB.
- Permitted length of username and groupname: 1-32 bytes.
- Permitted length of shell and home: 1-64 bytes.
- Permitted comment ("gecos") length: 0-1023 bytes.
- User name, groupname and gecos must be utf8-encoded.

Checking out and building
-------------------------

```
$ git clone --recursive https://git.sr.ht/~motiejus/turbonss
```

Alternatively, if you forgot `--recursive`:

```
$ git submodule update --init
```

And run tests:

```
$ zig build test
```

Other commands will be documented as they are implemented.

This project uses [git subtrac][git-subtrac] for managing dependencies. They
work just like regular submodules, except all the refs of the submodules are in
this repository. Repeat after me: all the submodules are in this repository.
So if you have a copy of this repo, dependencies will not disappear.

remarks on `id(1)`
------------------

A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
~10k groups. Our target is 10k id/s for the same payload.

To better reason about the trade-offs, it is useful to understand how `id(1)`
is implemented, in rough terms:
- lookup user by name.
- get all additional gids (an array attached to a member).
- for each additional gid, get the group information (`struct group*`).

Assuming a member is in ~100 groups on average, that's 1M group lookups per
second. We need to convert gid to a group index, and group index to a group
gid/name quickly.

Caveat: `struct group` contains an array of pointers to names of group members
(`char **gr_mem`). However, `id` does not use that information, resulting in
read amplification. Therefore, if `argv[0] == "id"`, our implementation of
`getgrid(3)` returns the `struct group*` without the members. This speeds up
`id` by about 10x on a known NSS implementation.

Relatedly, because `getgrid(3)` does not need the group members, the group
members are stored in a different DB sectoin, making the `Groups` section
smaller, thus more CPU-cache-friendly in the hot path.

Indices
-------

Now that we've sketched the implementation of `id(3)`, it's clearer to
understand which operations need to be fast; in order of importance:

1. lookup gid -> group info (this is on hot path in id) without members.
2. lookup username -> user's groups.
3. lookup uid -> user.
4. lookup groupname -> group.
5. lookup username -> user.

These indices can use perfect hashing like [bdz from cmph][cmph]: a perfect
hash hashes a list of bytes to a sequential list of integers. Perfect hashing
algorithms require some space, and take some time to calculate ("hashing
duration"). I've tested BDZ, which hashes [][]u8 to a sequential list of
integers (not preserving order) and CHM, preserves order. BDZ accepts an
optional argument `3 <= b <= 10`.

* BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
* Packed vs non-packed latency differences are not meaningful.

CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
have a separate index.

Turbonss header
---------------

The turbonss header looks like this:

```
OFFSET     TYPE     NAME                          DESCRIPTION
   0      [4]u8     magic                         always 0xf09fa4b7
   4         u8     version                       now `0`
   5        u16     bom                           0x1234
             u8     num_shells                    max value: 63. Padding is strange on little endian.
   8        u32     num_users                     number of passwd entries
  12        u32     num_groups                    number of group entries
  16        u32     offset_bdz_uid2user
  20        u32     offset_bdz_groupname2group
  24        u32     offset_bdz_name2user
  28        u32     offset_idx                    offset to the first idx_ section
  32        u32     offset_groups
  36        u32     offset_users
  40        u32     offset_groupmembers
  44        u32     offset_additional_gids
```

`magic` is 0xf09fa4b7, and `version` must be `0`. All integers are
native-endian. `bom` is a byte-order-mark. It must resolve to `0x1234` (4460).
If that's not true, the file is consumed in a different endianness than it was
created at. Turbonss files cannot be moved across different-endianness
computers. If that happens, turbonss will refuse to read the file.

Offsets are indices to further sections of the file, with zero being the first
block (pointing to the `magic` field). As all blobs are 64-byte aligned, the
offsets are always pointing to the beginning of an 64-byte "block". Therefore,
all `offset_*` values could be `u26`. As `u32` is easier to visualize with xxd,
and the header block fits to 64 bytes anyway, we are keeping them as u32 now.

Sections whose lengths can be calculated do not have a corresponding `offset_*`
header field. For example, `bdz_gid2group` comes immediately after the header,
and `idx_groupname2group` comes after `idx_gid2group`, whose offset is
`offset_idx`, and size can be calculated.

`num_shells` would fit to u6; however, we would need 2 bits of padding (all
other fields are byte-aligned). If we instead do `u2` followed by `u6`, the
byte would look very unusual on a little-endian architecture. Therefore we will
just refuse loading the file if the number of shells exceeds 63.

Primitive types
---------------

`User` and `Group` entries are sorted by name, ordered by their unicode
codepoints. They are byte-aligned (8bits). All `User` and `Group` entries are
referred by their byte offset in the `Users` and `Groups` section relative to
the beginning of the section.

```
const Group = struct {
    gid: u32,
    // index to a separate structure with a list of members. The memberlist is
    // always 2^5-byte aligned (32b), this is an index there.
    members_offset: u27,
    groupname_len: u5,
    // a groupname_len-sized string
    groupname []u8;
}

const User = struct {
    uid: u32,
    gid: u32,
    // pointer to a separate structure that contains a list of gids
    additional_gids_offset: u29,
    // shell is a different story, documented elsewhere.
    shell_here: bool,
    shell_len_or_idx: u6,
    home_len: u6,
    name_is_a_suffix: bool,
    name_len: u5,
    gecos_len: u8,
    // a variable-sized array that will be stored immediately after this
    // struct.
    stringdata []u8;
}
```

`stringdata` contains a few string entries:
- home.
- name.
- gecos.
- shell (optional).

First byte of the home is stored right after the `gecos_len` field, and it's
length is `home_len`. The same logic applies to all the `stringdata` fields:
there is a way to calculate their relative position from the length of the
fields before them.

Additionally, two optimizations for special fields are made:
- shells are often shared across different users, see the "Shells" section.
- name is frequently a suffix of the home. For example, `/home/motiejus`
  and `motiejus`. In which case storing both name and home strings is
  wasteful. For that cases, name has two options:
  1. `name_is_a_suffix=true`: name is a suffix of the home dir. In that
     case, the name starts at the `home_len - name_len`'th
     byte of the home, and ends at the same place as the home.
  2. `name_is_a_suffix=false`: name is stored separately. In that case,
  it begins one byte after home, and it's length is
  `name_len`.

Shells
------

Normally there is a limited number of shells even in the huge user databases. A
few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among others.
Therefore, "shells" have an optimization: they can be pointed by in the
external list, or reside among the user's data.

63 most popular shells (i.e. referred to by at least two User entries) are
stored externally in "Shells" area. The less popular ones are stored with
userdata.

There are two "Shells" areas: the index and the blob. The index is a list of
structs which point to a location in the "blob" area:

```
const ShellIndex = struct {
    offset: u10,
    len: u6,
};
```

In the user's struct the `shell_here=true` bit signifies that the shell is
stored with userdata. `false` means it is stored in the `Shells` section. If
the shell is stored "here", it is the first element in `stringdata`, and it's
length is `shell_len_or_idx`. If it is stored externally, the latter variable
points to it's index in the ShellIndex area.

Shells in the external storage are sorted by their weight, which is
`length*frequency`.

Variable-length integers (varints)
----------------------------------

Varint is an efficiently encoded integer (packed for small values). Same as
[protocol buffer varints][varint], except the largest possible value is `u64`.
They compress integers well.

Group memberships
-----------------

There are two group memberships at play:

1. given a username, resolve user's group gids (for `initgroups(3)`).
2. given a group (gid/name), resolve the members' names (e.g. `getgrgid`).

When user's groups are resolved in (1), the additional userdata is not
requested (there is no way to return it). Therefore, it is reasonable to store
the user's memberships completely out-of-bound, keyed by the hash of the
username.

When group's memberships are resolved in (2), the same call also requires other
group information: gid and group name. Therefore it makes sense to store a
pointer to the group members in the group information itself. However, the
memberships are not *always* necessary (see remarks about `id(1)` in this
document), therefore the memberships will be stored separately, outside of the
groups section.

`groupmembers` and `additional_gids` store group and user memberships
respectively: for each group, a list of pointers (offsets) to User records, and
for each user — a list of pointers to Group records. These fields are always
used in their entirety — not necessitating random access, thus suitable for
tight packing.

An entry of `groupmembers` and `additional_gids` looks like this piece of
pseudo-code:

```
const PackedList = struct {
    length: varint,
    members: [length]varint,
}
const Groupmembers = PackedList;
const AdditionalGids = PackedList;
```

A packed list is a list of varints.

Section `AdditionalGidsIndex` stores an index from `hash(username)` to `offset`
in AdditionalGids.

Complete file structure
-----------------------

`idx_*` sections are of type `[]PackedIntArray(u29)` and are pointing to the
respective `Groups` and `Users` entries (from the beginning of the respective
section). Since User and Group records are 8-byte aligned, 3 bits can be saved
from every element. However, since the header easily fits to 64 bytes, we are
storing plain `u32` for easier inspection.

Each section is padded to 64 bytes.

```
SECTION               SIZE            DESCRIPTION
Header                48              see "Turbonss header" section
bdz_gid2group         ?               gid->group bdz
bdz_uid2user          ?               uid->user bdz
bdz_groupname2group   ?               groupname->group bdz
bdz_name2user         ?               username->user bdz
idx_gid2group         len(group)*32   bdz->offset gid2group
idx_groupname2group   len(group)*32   bdz->offset groupname2group
idx_uid2user          len(user)*32    bdz->offset uid2user
idx_name2user         len(user)*32    bdz->offset name2user
idx_username2gids     len(user)*32    Per-user gidlist index
ShellIndex            len(shells)*2   Shell index array
ShellBlob             <= 4032         Shell data blob (max 63*64 bytes)
Groups                ?               packed Group entries (8b padding)
Users                 ?               packed User entries (8b padding)
Groupmembers          ?               per-group memberlist (32b padding)
AdditionalGids        ?               Per-user gidlist entries
```

[git-subtrac]: https://apenwarr.ca/log/20191109
[cmph]: http://cmph.sourceforge.net/
[id]: https://linux.die.net/man/1/id
[nsswitch]: https://linux.die.net/man/5/nsswitch.conf
[data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/
[getpwnam_r]: https://linux.die.net/man/3/getpwnam_r
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints