turbonss

Turbonss is a plugin for GNU Name Service Switch (NSS) functionality of GNU C Library (glibc). Turbonss implements lookup for user and passwd database entries (i.e. system users, groups, and group memberships). It's main goal is performance, with focus on making id(1) run as fast as possible.

Turbonss is optimized for reading. If the data changes in any way, the whole file will need to be regenerated (and tooling only supports only full generation). It was created, and best suited, for environments that have a central user & group database which then needs to be distributed to many servers/services, and the data does not change very often (e.g. hourly).

To understand more about name service switch, start with nsswitch.conf(5).

Design & constraints

To be fast, the user/group database (later: DB) has to be small (background). It encodes user & group information in a way that minimizes the DB size, and reduces jumping across the DB ("chasing pointers and thrashing CPU cache").

To understand how this is done efficiently, let's analyze the getpwnam_r(3) in high level. This API call accepts a username and returns the following user information:

struct passwd {
   char   *pw_name;       /* username */
   char   *pw_passwd;     /* user password */
   uid_t   pw_uid;        /* user ID */
   gid_t   pw_gid;        /* group ID */
   char   *pw_gecos;      /* user information */
   char   *pw_dir;        /* home directory */
   char   *pw_shell;      /* shell program */
};

Turbonss, among others, implements this call, and takes the following steps to resolve a username to a struct passwd*:

Open the DB (using mmap) and interpret it's first 64 bytes as a *struct Header. The header stores offsets to the sections of the file. This needs to be done once, when the NSS library is loaded.
Hash the username using a perfect hash function. Perfect hash function returns a number n ∈ [0,N-1], where N is the total number of users.
Jump to the n'th location in the idx_name2user section, which contains the index i to the user's information.
Jump to the location i of section Users, which stores the full user information.
Decode the user information (which is all in a continuous memory block) and return it to the caller.

In total, that's one hash for the username (~150ns), two pointer jumps within the group file (to sections idx_name2user and Users), and, now that the user record is found, memcpy for each field.

The turbonss DB file is be mmap-ed, making it simple to jump across the file using pointer arithmetic. This also reduces memory usage, as the mmap'ed regions are shared. Turbonss reads do not consume any heap space.

Tight packing places some constraints on the underlying data:

Permitted length of username and groupname: 1-32 bytes.
Permitted length of shell and home: 1-64 bytes.
Permitted comment ("gecos") length: 0-255 bytes.
User name, groupname, gecos and shell must be utf8-encoded.

Sorting is stable. In v0:

Groups are sorted by gid, ascending.
Users are sorted by their name, ascending by the unicode codepoints (locale-independent).

Checking out and building

$ git clone --recursive https://git.sr.ht/~motiejus/turbonss

Alternatively, if you forgot --recursive:

$ git submodule update --init

And run tests:

$ zig build test

Other commands will be documented as they are implemented.

This project uses git subtrac for managing dependencies. They work just like regular submodules, except all the refs of the submodules are in this repository. Repeat after me: all the submodules are in this repository. So if you have a copy of this repo, dependencies will not disappear.

remarks on `id(1)`

A known implementation runs id(1) at ~250 rps sequentially on ~20k users and ~10k groups. Our rps target is much higher.

To better reason about the trade-offs, it is useful to understand how id(1) is implemented, in rough terms:

lookup user by name (getpwent_r(3)).
get all gids for the user (getgrouplist(3)). Note: it is actually using initgroups_dyn, accepts a uid, and is very poorly documented.
for each additional gid, get the struct group* ([getgrgid_r(3)][getgrgid_r]).

Assuming a member is in ~100 groups on average, to reach 10k id/s translates to 1M group lookups per second. We need to convert gid to a group index, and group index to a group gid/name quickly.

Caveat: struct group contains an array of pointers to names of group members (char **gr_mem). However, id does not use that information, resulting in read amplification, sometimes by 10-100x. Therefore, if argv[0] == "id", our implementation of [getgrid_r(3)][getgrid] returns the struct group* without the members. This speeds up id by about 10x on a known NSS implementation.

Relatedly, because [getgrid_r(3)][getgrid] does not need the group members, the group members are stored in a different DB section, reducing the Groups section and making more of it fit the CPU caches.

Turbonss header

The turbonss header looks like this:

OFFSET     TYPE     NAME                      DESCRIPTION
   0      [4]u8     magic                     f0 9f a4 b7
   4         u8     version                   0
   5         u8     bigendian                 0 for little-endian, 1 for big-endian
   6         u8     nblocks_shell_blob        max value: 63
   7         u8     num_shells                max value: 63
   8        u32     num_groups                number of group entries
  12        u32     num_users                 number of passwd entries
  16        u32     nblocks_bdz_gid           bdz_gid section block count
  20        u32     nblocks_bdz_groupname
  24        u32     nblocks_bdz_uid
  28        u32     nblocks_bdz_username
  32        u64     nblocks_groups
  40        u64     nblocks_users
  48        u64     nblocks_groupmembers
  56        u64     nblocks_usergids

magic is 0xf09fa4b7, and version must be 0. All integers are native-endian. nblocks_* is the count of blocks of a particular section; this helps calculate the offsets to all sections.

Some numbers, like nblocks_shell_blob, num_shells, would fit to smaller number of bytes. However, interpreting [2]u6 with xxd(1) is harder than interpreting [2]u8. Therefore we are using the space we have to make these integers byte-wide.

Primitive types

User and Group entries are sorted by the order they were received in the input file. All entries are aligned to 8 bytes. All User and Group entries are referred by their byte offset in the Users and Groups section relative to the beginning of the section.

const PackedGroup = packed struct {
    gid: u32,
    groupname_len: u8, // max is 32, but have too much space here.
    // varint members_offset + (groupname_len-1)-length string
    groupdata []u8;
}

pub const PackedUser = packed struct {
    uid: u32,
    gid: u32,
    padding: u2 = 0,
    shell_len_or_idx: u6,
    shell_here: bool,
    name_is_a_suffix: bool,
    home_len: u6,
    name_len: u5,
    gecos_len: u11,
    // pseudocode: variable-sized array that will be stored immediately after
    // this struct.
    userdata []u8;
}

userdata contains a few entries:

home.
name (optional).
gecos.
shell (optional).
additional_gids_offset: varint.

First byte of home is stored right after the gecos_len field, and it's length is home_len. The same logic applies to all the stringdata fields: there is a way to calculate their relative position from the length of the fields before them.

Additionally, there are two "easy" optimizations:

shells are often shared across different users, see the "Shells" section.
name is frequently a suffix of home. For example, /home/motiejus and motiejus. In this case storing both name and home is wasteful. Therefore name has two options:
1. name_is_a_suffix=true: name is a suffix of the home dir. Then name starts at the home_len - name_len'th byte of home, and ends at the same place as home.
2. name_is_a_suffix=false: name begins one byte after home, and it's length is name_len.

The last field, additional_gids_offset, which is needed least frequently, is stored at the end.

Shells

Normally there is a limited number of separate shells even in huge user databases. A few examples: /bin/bash, /usr/bin/nologin, /bin/zsh among others. Therefore, "shells" have an optimization: they can be pointed by in the external list, or, if they are unique to the user, reside among the user's data.

63 most popular shells (i.e. referred to by at least two User entries) are stored externally in "Shells" area. The less popular ones are stored with userdata.

Shells section consists of two sub-sections: the index and the blob. The index is a list of structs which point to a location in the "blob" area:

const ShellIndex = struct {
    offset: u10,
    len: u6,
};

In the user's struct shell_here=true signifies that the shell is stored with userdata, and it's length is shell_len_or_idx. shell_here=false means it is stored in the Shells section, and it's index is shell_len_or_idx.

Variable-length integers (varints)

Varint is an efficiently encoded integer (packed for small values). Same as protocol buffer varints, except the largest possible value is u64. They compress integers well. Varints are stored for group memberships.

Group memberships

There are two group memberships at play:

Given a group (gid/name), resolve the members' names (e.g. getgrgid).
Given a username, resolve user's group gids (for initgroups(3)).

When group's memberships are resolved in (1), the same call also requires other group information: gid and group name. Therefore it makes sense to store a pointer to the group members in the group information itself. However, the memberships are not always necessary (see remarks about id(1)), therefore the memberships will be stored separately, outside of the groups section.

Similarly, when user's groups are resolved in (2), they are not always necessary (i.e. not part of struct user*), therefore the memberships themselves are stored out of bound.

Groupmembers and UserGids store group and user memberships respectively. Membership IDs are used in their entirety — not necessitating random access, thus suitable for tight packing and varint encoding.

For each group — a list of pointers (offsets) to User records, because getgr*_r returns pointers to membernames.
For each user — a list of gids, because initgroups_dyn (and friends) returns an array of gids.

An entry of Groupmembers and UserGids looks like this piece of pseudo-code:

const PackedList = struct {
    Length: varint,
    Members: [Length]varint,
}
const Groupmembers = PackedList;
const UserGids  = PackedList;

Indices

Now that we've sketched the implementation of id(3), it's clearer to understand which operations need to be fast; in order of importance:

lookup gid -> group info (this is on hot path in id) without members.
lookup username -> user's groups.
lookup uid -> user.
lookup groupname -> group.
lookup username -> user.

These indices can use perfect hashing like bdz from cmph: a perfect hash hashes a list of bytes to a sequential list of integers. Perfect hashing algorithms require some space, and take some time to calculate ("hashing duration"). I've tested BDZ, which hashes [][]u8 to a sequential list of integers (not preserving order) and CHM, preserves order. BDZ accepts an optional argument 3 <= b <= 10.

BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
Packed vs non-packed latency differences are not meaningful.

CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with CHM than with BDZ, eliminating the benefit of preserved ordering: we can just have a separate index.

None of the tested perfect hashing algorithms makes the distinction between existing (in the initial dictionary) and new keys. In other words, HASH(value) will be pointing to a number n ∈ [0,N-1], regardless whether the value was in the initial dictionary. Therefore one must always confirm, after calculating the hash, that the key matches what's been hashed.

idx_* sections are of type []PackedIntArray(u29) and are pointing to the respective Groups and Users entries (from the beginning of the respective section). Since User and Group records are 8-byte aligned, u29 is used.

Database file structure

Each section is padded to 64 bytes.

SECTION               SIZE             DESCRIPTION
header                64               see "Turbonss header" section
bdz_gid               ?                bdz(gid)
bdz_groupname         ?                bdz(groupname)
bdz_uid               ?                bdz(uid)
bdz_username          ?                bdz(username)
idx_gid2group         len(group)*4     bdz->offset Groups
idx_groupname2group   len(group)*4     bdz->offset Groups
idx_uid2user          len(user)*4      bdz->offset Users
idx_name2user         len(user)*4      bdz->offset Users
shell_index           len(shells)*2    shell index array
shell_blob            <= 4032          shell data blob (max 63*64 bytes)
groups                ?                packed Group entries (8b padding)
users                 ?                packed User entries (8b padding)
groupmembers          ?                per-group delta varint memberlist (no padding)
user_gids             ?                per-user delta varint gidlist (no padding)

Section creation order:

✅ bdz_*. No depdendencies.
✅ shellIndex, shellBlob. No dependencies.
✅ userGids. No dependencies.
✅ Users. Requires userGids and shell.
✅ Groupmembers. Requires Users.
✅ Groups. Requires Groupmembers.
✅ idx_*. Requires offsets to Groups and Users.
Header.

README.md

Turbo NSS