Turbo NSS
---------

Turbonss is a plugin for GNU Name Service Switch (NSS) functionality of GNU C
Library (glibc). Turbonss implements lookup for `user` and `passwd` database
entries (i.e. system users, groups, and group memberships). It's main goal is
performance, with focus on making [`id(1)`][id] run as fast as possible.

To understand more about name service switch, start with
[`nsswitch.conf(5)`](nsswitch).

Design & constraints
--------------------

To be fast, the user/group database (later: DB) has to be small ([highly
recommended background viewing](data-oriented-design)). It encodes user & group
information in a way that minimizes the DB size, and reduces jumping across the
DB ("chasing pointers and polluting CPU cache").

For example, [`getpwnam_r(3)`](getpwnam_r) accepts a username and returns
the following user information:

```
struct passwd {
   char   *pw_name;       /* username */
   char   *pw_passwd;     /* user password */
   uid_t   pw_uid;        /* user ID */
   gid_t   pw_gid;        /* group ID */
   char   *pw_gecos;      /* user information */
   char   *pw_dir;        /* home directory */
   char   *pw_shell;      /* shell program */
};
```

Turbonss, among others, implements this call, and takes the following steps to
resolve this:

- Hash the username using a perfect hash function. Perfect hash function
  returns a number between [0,N], where N is the total number of users.
- Jump to a known location in the DB (by pointer arithmetic) which links the
  user's index to the user's information. That is an index to a different
  location within the DB.
- Jump to the location which stores the full user information.
- Decode the user information (which is all in a continuous memory block) and
  return it to the caller.

In total, that's one hash for the username (~150ns), two pointer jumps within
the group file, and, now that the user record is found, `memcpy` for each
field.

This tight packing places some constraints on the underlying data:

- Maximum database size: 4GB.
- Maximum length of username and groupname: 32 bytes.
- Maximum length of shell and homedir: 64 bytes.
- Maximum comment ("gecos") length: 256 bytes.
- Username and groupname must be utf8-encoded.

Checking out and building
-------------------------

```
$ git clone --recursive https://git.sr.ht/~motiejus/turbonss
```

Alternatively, if you forgot `--recursive`:

```
$ git submodule update --init
```

And run tests:

```
$ zig build test
```

Other commands will be documented as they are implemented.

This project uses [git subtrac][git-subtrac] for managing dependencies.

remarks on `id(1)`
------------------

A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
~10k groups. Our target is 10k id/s.

`id(1)` works as follows:
- lookup user by name.
- get all additional gids (an array attached to a member).
- for each additional gid, get the group name.

Assuming a member is in ~100 groups on average, that's 1M group lookups per
second. We need to convert gid to a group index, and group index to a group
gid/name quickly.

Caveat: `struct group` contains an array of pointers to names of group members
(`char **gr_mem`). However, `id` does not use that information, resulting in a
significant read amplification. Therefore, if `argv[0] == "id"`, `getgrid(3)`
will return group without the members. This speeds up `id` by about 10x on a
known NSS implementation.

Because `getgrid(3)` does not use the group members' information, the group
members are stored in a different location, making the `Groups` section
smaller, thus more CPU-cache-friendly.

Indices
-------

The following operations need to be fast, in order of importance:

1. lookup gid -> group (this is on hot path in id) with or without members (2
   separate calls).
2. lookup uid -> user.
3. lookup groupname -> group.
4. lookup username -> user.
5. (optional) iterate users using a defined order (`getent passwd`).
6. (optional) iterate groups using a defined order (`getent group`).

First 4 can use perfect hashing like [cmph][cmph]: it hashes a list of bytes to
a sequential list of integers. Perfect hashing algorithms require some space,
and take some time to calculate ("hashing duration"). I've tested BDZ, which
hashes [][]u8 to a sequential list of integers (not preserving order) and CHM, which
does the same, but preserves order. BDZ accepts an argument 3 <= b <= 10.

BDZ: tried b=3, b=7 (default), and b=10.

* BDZ algorithm requires (900KB, 338KB, 306KB, respectively) for 1M values.
* Latency to resolve 1M keys: (170ms, 180ms, 230ms).
* Packed vs non-packed latency differences are not meaningful.

CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
CHM than with BDZ, eliminating the benefit of preserved ordering.

Turbonss header
---------------

The turbonss header looks like this:

```
OFFSET     TYPE     NAME                          DESCRIPTION
   0      [4]u8     magic                         always 0xf09fa4b7
   4         u8     version                       now `0`
   5        u16     bom                           0x1234
   7         u8     padding
   8        u32     num_users                     number of passwd entries
  12        u32     num_groups                    number of group entries
  16        u32     offset_cmph_gid2group
  20        u32     offset_cmph_uid2user
  24        u32     offset_cmph_groupname2group
  28        u32     offset_cmph_username2user
  32        u32     offset_groupmembers
  36        u32     offset_additional_gids
```

`magic` is 0xf09fa4b7, and `version` must be `0`. All integers are
native-endian. `bom` is a byte-order-mark. It must resolve to `0x1234` (4460).
If that's not true, the file is consumed in a different endianness than it was
created at. Turbonss files cannot be moved across different-endianness
computers. If that happens, turbonss will refuse to read the file.

Offsets are indices to further sections of the file, with zero being the first
block (pointing to the `magic` field). As all blobs are 64-byte aligned, the
offsets are always pointing to the beginning of an 64-byte "block". Therefore,
all `offset_*` values could be `u26`. As `u32` is easier to visualize with xxd,
and the header block fits to 64 bytes anyway, we are keeping them as u32 now.

Primitive types
---------------

```
const Group = struct {
    gid: u32,
    // index to a separate structure with a list of members. The memberlist is
    // always 2^5-byte aligned (32b), this is an index there.
    members_offset: u27,
    groupname_len: u5,
    // a groupname_len-sized string
    groupname []u8;
}

const User = struct {
    uid: u32,
    gid: u32,
    // pointer to a separate structure that contains a list of gids
    additional_gids_offset: u29,
    // shell is a different story, documented elsewhere.
    shell_here: u1,
    shell_len_or_place: u6,
    home_len: u6,
    username_pos: u1,
    username_len: u5,
    gecos_len: u8,
    // a variable-sized array that will be stored immediately after this
    // struct.
    stringdata []u8;
}
```

`User` and `Group` entries are sorted by name, ordered by their unicode
codepoints.

Shells
------

Normally there is a limited number of shells even in the huge user databases. A
few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among others.
Therefore, "shells" have an optimization: they can be pointed by in the
external list, or reside among the user's data.

64 (1>>6) most popular shells (i.e. referred to by at least two User entries)
are stored externally in "Shells" area. The less popular ones are stored with
userdata.

The `shell_here=true` bit signifies that the shell is stored with userdata.
`false` means it is stored in the `Shells` section. If the shell is stored
"here", it is the first element in `stringdata`, and it's length is
`shell_len_or_place`. If it is stored externally, the latter variable points
to it's index in the external storage.

Shells in the external storage are sorted by their weight, which is
`length*frequency`.

`groupmembers`, `additional_gids`
---------------------------------

`groupmembers` and `additional_gids` store group and user memberships
respectively: for each group, a list of pointers ("offsets") to User records,
and for each user — a list of pointers to Group records. These fields are
always used in their entirety — making random-access not required, thus
suitable for tight packing.

An entry of `groupmembers` and `additional_gids` looks like this piece of
pseudo-code:

```
const PackedList = struct {
    length: varint,
    members: []varint
}
const Groupmembers = PackedList;
const AdditionalGids = PackedList;
```

The single entry in `members` field points to an offset into a `User` or
`Group` entry (number of bytes relative to the first entry of the respective
type). The `members` field in `PackedList` is sorted by the name (`username` or
`groupname`) of the record it is pointing to.

Complete file structure
-----------------------

```
  SECTION              SIZE                         DESCRIPTION
  Header               1<<6                         documented above
  []Group                 ?                         list of Group entries
  []User                  ?                         list of User entries
  Shells                  ?                         documented in "SHELLS"
  cmph_gid2group          ?                         offset by offset_cmph_gid2group
  cmph_uid2user           ?                         offset by offset_cmph_gid2group
  cmph_groupname2group    ?                         offset by offset_cmph_groupname2group
  cmph_username2user      ?                         offset by offset_cmph_username2user
  groupmembers            ?                         offset by offset_groupmembers
  additional_gids         ?                         offset by offset_additional_gids
```

[git-subtrac]: https://github.com/apenwarr/git-subtrac/
[cmph]: http://cmph.sourceforge.net/
[id]: https://linux.die.net/man/1/id
[nsswitch]: https://linux.die.net/man/5/nsswitch.conf
[data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/
[getpwnam_r]: https://linux.die.net/man/3/getpwnam_r