From c6bc383269ee44da5501ad377669e6727ce20511 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Motiejus=20Jak=C5=A1tys?= <motiejus@jakstys.lt>
Date: Mon, 14 Feb 2022 13:05:33 +0200
Subject: [PATCH] explain some optimizations

---
 README.md | 137 +++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 88 insertions(+), 49 deletions(-)

diff --git a/README.md b/README.md
index ec9f4bd..e55b0fe 100644
--- a/README.md
+++ b/README.md
@@ -7,17 +7,17 @@ entries (i.e. system users, groups, and group memberships). It's main goal is
 performance, with focus on making [`id(1)`][id] run as fast as possible.
 
 To understand more about name service switch, start with
-[`nsswitch.conf(5)`](nsswitch).
+[`nsswitch.conf(5)`][nsswitch].
 
 Design & constraints
 --------------------
 
 To be fast, the user/group database (later: DB) has to be small ([highly
-recommended background viewing](data-oriented-design)). It encodes user & group
+recommended background viewing][data-oriented-design]). It encodes user & group
 information in a way that minimizes the DB size, and reduces jumping across the
-DB ("chasing pointers and polluting CPU cache").
+DB ("chasing pointers and thrashing CPU cache").
 
-For example, [`getpwnam_r(3)`](getpwnam_r) accepts a username and returns
+For example, [`getpwnam_r(3)`][getpwnam_r] accepts a username and returns
 the following user information:
 
 ```
@@ -33,14 +33,14 @@ struct passwd {
 ```
 
 Turbonss, among others, implements this call, and takes the following steps to
-resolve this:
+resolve a username to a `struct passwd*`:
 
 - Hash the username using a perfect hash function. Perfect hash function
-  returns a number between [0,N], where N is the total number of users.
-- Jump to a known location in the DB (by pointer arithmetic) which links the
-  user's index to the user's information. That is an index to a different
-  location within the DB.
-- Jump to the location which stores the full user information.
+  returns a number `n ∈ [0,N-1]`, where N is the total number of users.
+- Jump to the `n`'th location in the DB (by pointer arithmetic) which contains
+  the index `i` to the user's information.
+- Jump to the location `i` (pointer arithmetic) which stores the full user
+  information.
 - Decode the user information (which is all in a continuous memory block) and
   return it to the caller.
 
@@ -48,7 +48,12 @@ In total, that's one hash for the username (~150ns), two pointer jumps within
 the group file, and, now that the user record is found, `memcpy` for each
 field.
 
-This tight packing places some constraints on the underlying data:
+The turbonss DB file is be `mmap`-ed, making it simple to implement pointer
+arithmetic and jumping across the file. This also reduces memory usage,
+especially across multiple concurrent invocations of the `id` command. The
+consumed heap space for each separate turbonss instance will be minimal.
+
+Tight packing places some constraints on the underlying data:
 
 - Maximum database size: 4GB.
 - Maximum length of username and groupname: 32 bytes.
@@ -83,54 +88,53 @@ remarks on `id(1)`
 ------------------
 
 A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
-~10k groups. Our target is 10k id/s.
+~10k groups. Our target is 10k id/s for the same payload.
 
-`id(1)` works as follows:
+To better reason about the trade-offs, it is useful to understand how `id(1)`
+is implemented, in rough terms:
 - lookup user by name.
 - get all additional gids (an array attached to a member).
-- for each additional gid, get the group name.
+- for each additional gid, get the group information (`struct group*`).
 
 Assuming a member is in ~100 groups on average, that's 1M group lookups per
 second. We need to convert gid to a group index, and group index to a group
 gid/name quickly.
 
 Caveat: `struct group` contains an array of pointers to names of group members
-(`char **gr_mem`). However, `id` does not use that information, resulting in a
-significant read amplification. Therefore, if `argv[0] == "id"`, `getgrid(3)`
-will return group without the members. This speeds up `id` by about 10x on a
-known NSS implementation.
+(`char **gr_mem`). However, `id` does not use that information, resulting in
+read amplification. Therefore, if `argv[0] == "id"`, our implementation of
+`getgrid(3)` returns the `struct group*` without the members. This speeds up
+`id` by about 10x on a known NSS implementation.
 
-Because `getgrid(3)` does not use the group members' information, the group
-members are stored in a different location, making the `Groups` section
-smaller, thus more CPU-cache-friendly.
+Relatedly, because `getgrid(3)` does not need the group members, the group
+members are stored in a different DB sectoin, making the `Groups` section
+smaller, thus more CPU-cache-friendly in the hot path.
 
 Indices
 -------
 
-The following operations need to be fast, in order of importance:
+Now that we've sketched the implementation of `id(3)`, it's clearer to
+understand which operations need to be fast; in order of importance:
 
-1. lookup gid -> group (this is on hot path in id) with or without members (2
-   separate calls).
+1. lookup gid -> group info (this is on hot path in id) without members.
 2. lookup uid -> user.
 3. lookup groupname -> group.
 4. lookup username -> user.
-5. (optional) iterate users using a defined order (`getent passwd`).
-6. (optional) iterate groups using a defined order (`getent group`).
 
-First 4 can use perfect hashing like [cmph][cmph]: it hashes a list of bytes to
-a sequential list of integers. Perfect hashing algorithms require some space,
-and take some time to calculate ("hashing duration"). I've tested BDZ, which
-hashes [][]u8 to a sequential list of integers (not preserving order) and CHM, which
-does the same, but preserves order. BDZ accepts an argument 3 <= b <= 10.
+These indices can use perfect hashing like [cmph][cmph]: a perfect hash hashes
+a list of bytes to a sequential list of integers. Perfect hashing algorithms
+require some space, and take some time to calculate ("hashing duration"). I've
+tested BDZ, which hashes [][]u8 to a sequential list of integers (not
+preserving order) and CHM, preserves order. BDZ accepts an optional argument `3
+<= b <= 10`.
 
-BDZ: tried b=3, b=7 (default), and b=10.
-
-* BDZ algorithm requires (900KB, 338KB, 306KB, respectively) for 1M values.
-* Latency to resolve 1M keys: (170ms, 180ms, 230ms).
+* BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
+* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
 * Packed vs non-packed latency differences are not meaningful.
 
 CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
-CHM than with BDZ, eliminating the benefit of preserved ordering.
+CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
+have a separate index.
 
 Turbonss header
 ---------------
@@ -168,6 +172,11 @@ and the header block fits to 64 bytes anyway, we are keeping them as u32 now.
 Primitive types
 ---------------
 
+`User` and `Group` entries are sorted by name, ordered by their unicode
+codepoints. They are byte-aligned (8bits). All `User` and `Group` entries are
+referred by their byte offset in the `Users` and `Groups` section relative to
+the beginning of the section.
+
 ```
 const Group = struct {
     gid: u32,
@@ -187,9 +196,9 @@ const User = struct {
     // shell is a different story, documented elsewhere.
     shell_here: u1,
     shell_len_or_place: u6,
-    home_len: u6,
-    username_pos: u1,
-    username_len: u5,
+    homedir_len: u6,
+    username_is_a_suffix: u1,
+    username_offset_or_len: u5,
     gecos_len: u8,
     // a variable-sized array that will be stored immediately after this
     // struct.
@@ -197,8 +206,28 @@ const User = struct {
 }
 ```
 
-`User` and `Group` entries are sorted by name, ordered by their unicode
-codepoints.
+`stringdata` contains a few string entries:
+- homedir.
+- username.
+- gecos.
+- shell (optional).
+
+First byte of the homedir is stored right after the `gecos_len` field, and it's
+length is `homedir_len`. The same logic applies to all the `stringdata` fields:
+there is a way to calculate their relative position from the length of the
+fields before them.
+
+Additionally, two optimizations for special fields are made:
+- shells are often shared across different users, see the "Shells" section.
+- username is frequently a suffix of the homedir. For example, `/home/motiejus`
+  and `motiejus`. In which case storing both username and homedir strings is
+  wasteful. For that cases, username has two options:
+  1. `username_is_a_suffix=true`: username is a suffix of the home dir. In that
+  case, the username starts at the `username_offset_or_len`'th byte of the
+  homedir, and ends at the same place as the homedir.
+  2. `username_is_a_suffix=false`: username is stored separately. In that case,
+  it begins one byte after homedir, and it's length is
+  `username_offset_or_len`.
 
 Shells
 ------
@@ -221,14 +250,21 @@ to it's index in the external storage.
 Shells in the external storage are sorted by their weight, which is
 `length*frequency`.
 
+Variable-length integers (varints)
+----------------------------------
+
+Varint is an efficiently encoded integer (packed for small values). Same as
+[protocol buffer varints][varint], except the largest possible value is `u64`.
+They compress integers well.
+
 `groupmembers`, `additional_gids`
 ---------------------------------
 
 `groupmembers` and `additional_gids` store group and user memberships
-respectively: for each group, a list of pointers ("offsets") to User records,
-and for each user — a list of pointers to Group records. These fields are
-always used in their entirety — making random-access not required, thus
-suitable for tight packing.
+respectively: for each group, a list of pointers (offsets) to User records, and
+for each user — a list of pointers to Group records. These fields are always
+used in their entirety — not necessitating random access, thus suitable for
+tight packing.
 
 An entry of `groupmembers` and `additional_gids` looks like this piece of
 pseudo-code:
@@ -242,10 +278,12 @@ const Groupmembers = PackedList;
 const AdditionalGids = PackedList;
 ```
 
-The single entry in `members` field points to an offset into a `User` or
-`Group` entry (number of bytes relative to the first entry of the respective
-type). The `members` field in `PackedList` is sorted by the name (`username` or
-`groupname`) of the record it is pointing to.
+An entry in `members` field points to the offset into a respective `User` or
+`Group` entry (number of bytes relative to the first entry of the type).
+`members` in `PackedList` is sorted by the name (`username` or `groupname`) of
+the record it is pointing to.
+
+A packed list is a list of varints.
 
 Complete file structure
 -----------------------
@@ -270,3 +308,4 @@ Complete file structure
 [nsswitch]: https://linux.die.net/man/5/nsswitch.conf
 [data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/
 [getpwnam_r]: https://linux.die.net/man/3/getpwnam_r
+[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints