turbonss/README.md at b43ba9c6729109957d0266bed8962051a069f26e

motiejus

turbonss

14 KiB

Raw Blame History

Turbo NSS

Turbonss is a plugin for GNU Name Service Switch (NSS) functionality of GNU C Library (glibc). Turbonss implements lookup for user and passwd database entries (i.e. system users, groups, and group memberships). It's main goal is performance, with focus on making id(1) run as fast as possible.

Turbonss is optimized for reading. If the data changes in any way, the whole file will need to be regenerated (and tooling only supports only full generation). It was created, and best suited, for environments that have a central user & group database which then needs to be distributed to many servers/services.

To understand more about name service switch, start with nsswitch.conf(5).

Design & constraints

To be fast, the user/group database (later: DB) has to be small (background). It encodes user & group information in a way that minimizes the DB size, and reduces jumping across the DB ("chasing pointers and thrashing CPU cache").

To understand how this is done efficiently, let's analyze the getpwnam_r(3) in high level. This API call accepts a username and returns the following user information:

struct passwd {
   char   *pw_name;       /* username */
   char   *pw_passwd;     /* user password */
   uid_t   pw_uid;        /* user ID */
   gid_t   pw_gid;        /* group ID */
   char   *pw_gecos;      /* user information */
   char   *pw_dir;        /* home directory */
   char   *pw_shell;      /* shell program */
};

Turbonss, among others, implements this call, and takes the following steps to resolve a username to a struct passwd*:

Open the DB (using mmap) and interpret it's first 40 bytes as a struct Header. The header stores offsets to the sections of the file. This needs to be done once, when the NSS library is loaded (or on the first call).
Hash the username using a perfect hash function. Perfect hash function returns a number n ∈ [0,N-1], where N is the total number of users.
Jump to the n'th location in the idx_username2user section (by pointer arithmetic), which contains the index i to the user's information.
Jump to the location i of section Users (again, using pointer arithmetic) which stores the full user information.
Decode the user information (which is all in a continuous memory block) and return it to the caller.

In total, that's one hash for the username (~150ns), two pointer jumps within the group file (to sections idx_username2user and Users), and, now that the user record is found, memcpy for each field.

The turbonss DB file is be mmap-ed, making it simple to implement pointer arithmetic and jumping across the file. This also reduces memory usage, especially across multiple concurrent invocations of the id command. The consumed heap space for each separate turbonss instance will be minimal.

Tight packing places some constraints on the underlying data:

Maximum database size: 4GB.
Permitted length of username and groupname: 1-32 bytes.
Permitted length of shell and homedir: 1-64 bytes.
Permitted comment ("gecos") length: 0-255 bytes.
Username, groupname and gecos must be utf8-encoded.

Checking out and building

$ git clone --recursive https://git.sr.ht/~motiejus/turbonss

Alternatively, if you forgot --recursive:

$ git submodule update --init

And run tests:

$ zig build test

Other commands will be documented as they are implemented.

This project uses git subtrac for managing dependencies.

remarks on `id(1)`

A known implementation runs id(1) at ~250 rps sequentially on ~20k users and ~10k groups. Our target is 10k id/s for the same payload.

To better reason about the trade-offs, it is useful to understand how id(1) is implemented, in rough terms:

lookup user by name.
get all additional gids (an array attached to a member).
for each additional gid, get the group information (struct group*).

Assuming a member is in ~100 groups on average, that's 1M group lookups per second. We need to convert gid to a group index, and group index to a group gid/name quickly.

Caveat: struct group contains an array of pointers to names of group members (char **gr_mem). However, id does not use that information, resulting in read amplification. Therefore, if argv[0] == "id", our implementation of getgrid(3) returns the struct group* without the members. This speeds up id by about 10x on a known NSS implementation.

Relatedly, because getgrid(3) does not need the group members, the group members are stored in a different DB sectoin, making the Groups section smaller, thus more CPU-cache-friendly in the hot path.

Indices

Now that we've sketched the implementation of id(3), it's clearer to understand which operations need to be fast; in order of importance:

lookup gid -> group info (this is on hot path in id) without members.
lookup uid -> user.
lookup groupname -> group.
lookup username -> user.

These indices can use perfect hashing like cmph: a perfect hash hashes a list of bytes to a sequential list of integers. Perfect hashing algorithms require some space, and take some time to calculate ("hashing duration"). I've tested BDZ, which hashes [][]u8 to a sequential list of integers (not preserving order) and CHM, preserves order. BDZ accepts an optional argument 3 <= b <= 10.

BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
Packed vs non-packed latency differences are not meaningful.

CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with CHM than with BDZ, eliminating the benefit of preserved ordering: we can just have a separate index.

Turbonss header

The turbonss header looks like this:

OFFSET     TYPE     NAME                          DESCRIPTION
   0      [4]u8     magic                         always 0xf09fa4b7
   4         u8     version                       now `0`
   5        u16     bom                           0x1234
   7         u6     num_shells                    max value: 63
   8        u32     num_users                     number of passwd entries
  12        u32     num_groups                    number of group entries
  16        u32     offset_cmph_uid2user
  20        u32     offset_cmph_groupname2group
  24        u32     offset_cmph_username2user
  28        u32     offset_idx                    offset to the first idx_ section
  32        u32     offset_groups
  36        u32     offset_users
  40        u32     offset_groupmembers
  44        u32     offset_additional_gids

magic is 0xf09fa4b7, and version must be 0. All integers are native-endian. bom is a byte-order-mark. It must resolve to 0x1234 (4460). If that's not true, the file is consumed in a different endianness than it was created at. Turbonss files cannot be moved across different-endianness computers. If that happens, turbonss will refuse to read the file.

Offsets are indices to further sections of the file, with zero being the first block (pointing to the magic field). As all blobs are 64-byte aligned, the offsets are always pointing to the beginning of an 64-byte "block". Therefore, all offset_* values could be u26. As u32 is easier to visualize with xxd, and the header block fits to 64 bytes anyway, we are keeping them as u32 now.

Sections whose lengths can be calculated do not have a corresponding offset_* header field. For example, cmph_gid2group comes immediately after the header, and idx_groupname2group comes after idx_gid2group, whose offset is offset_idx, and size can be calculated.

Primitive types

User and Group entries are sorted by name, ordered by their unicode codepoints. They are byte-aligned (8bits). All User and Group entries are referred by their byte offset in the Users and Groups section relative to the beginning of the section.

const Group = struct {
    gid: u32,
    // index to a separate structure with a list of members. The memberlist is
    // always 2^5-byte aligned (32b), this is an index there.
    members_offset: u27,
    groupname_len: u5,
    // a groupname_len-sized string
    groupname []u8;
}

const User = struct {
    uid: u32,
    gid: u32,
    // pointer to a separate structure that contains a list of gids
    additional_gids_offset: u29,
    // shell is a different story, documented elsewhere.
    shell_here: u1,
    shell_len_or_place: u6,
    homedir_len: u6,
    username_is_a_suffix: u1,
    username_offset_or_len: u5,
    gecos_len: u8,
    // a variable-sized array that will be stored immediately after this
    // struct.
    stringdata []u8;
}

stringdata contains a few string entries:

homedir.
username.
gecos.
shell (optional).

First byte of the homedir is stored right after the gecos_len field, and it's length is homedir_len. The same logic applies to all the stringdata fields: there is a way to calculate their relative position from the length of the fields before them.

Additionally, two optimizations for special fields are made:

shells are often shared across different users, see the "Shells" section.
username is frequently a suffix of the homedir. For example, /home/motiejus and motiejus. In which case storing both username and homedir strings is wasteful. For that cases, username has two options:
1. username_is_a_suffix=true: username is a suffix of the home dir. In that case, the username starts at the username_offset_or_len'th byte of the homedir, and ends at the same place as the homedir.
2. username_is_a_suffix=false: username is stored separately. In that case, it begins one byte after homedir, and it's length is username_offset_or_len.

Shells

Normally there is a limited number of shells even in the huge user databases. A few examples: /bin/bash, /usr/bin/nologin, /bin/zsh among others. Therefore, "shells" have an optimization: they can be pointed by in the external list, or reside among the user's data.

63 most popular shells (i.e. referred to by at least two User entries) are stored externally in "Shells" area. The less popular ones are stored with userdata.

There are two "Shells" areas: the index and the blob. The index is a list of structs which point to a location in the "blob" area:

const ShellIndex = struct {
    offset: u10,
    len: u6,
};

In the user's struct the shell_here=true bit signifies that the shell is stored with userdata. false means it is stored in the Shells section. If the shell is stored "here", it is the first element in stringdata, and it's length is shell_len_or_place. If it is stored externally, the latter variable points to it's index in the ShellIndex area.

Shells in the external storage are sorted by their weight, which is length*frequency.

Variable-length integers (varints)

Varint is an efficiently encoded integer (packed for small values). Same as protocol buffer varints, except the largest possible value is u64. They compress integers well.

`groupmembers`, `additional_gids`

groupmembers and additional_gids store group and user memberships respectively: for each group, a list of pointers (offsets) to User records, and for each user — a list of pointers to Group records. These fields are always used in their entirety — not necessitating random access, thus suitable for tight packing.

An entry of groupmembers and additional_gids looks like this piece of pseudo-code:

const PackedList = struct {
    length: varint,
    members: []varint
}
const Groupmembers = PackedList;
const AdditionalGids = PackedList;

An entry in members field points to the offset into a respective User or Group entry (number of bytes relative to the first entry of the type). members in PackedList is sorted by the name (username or groupname) of the record it is pointing to.

A packed list is a list of varints.

Complete file structure

idx_* sections are of type []PackedIntArray(u29) and are pointing to the respective Groups and Users entries (from the beginning of the respective section). Since User and Group records are 8-byte aligned, 3 bits are saved from every element.

Each section is padded to 64 bytes.

SECTION               SIZE                 DESCRIPTION
Header                48                   see "Turbonss header" section
cmph_gid2group        ?                    gid->group cmph
cmph_uid2user         ?                    uid->user cmph
cmph_groupname2group  ?                    groupname->group cmph
cmph_username2user    ?                    username->user cmph
idx_gid2group         len(group)*4*29/32   cmph->offset gid2group
idx_groupname2group   len(group)*4*29/32   cmph->offset groupname2group
idx_uid2user          len(user)*4*29/32    cmph->offset uid2user
idx_username2user     len(user)*4*29/32    cmph->offset username2user
ShellIndex            len(shells)*2        Shell index array
ShellBlob             <= 4032              Shell data blob (max 63*64 bytes)
Groups                ?                    packed Group entries (8b padding)
Users                 ?                    packed User entries (8b padding)
groupmembers          ?                    per-group memberlist (32b padding)
additional_gids       ?                    per-user grouplist (8b padding)

14 KiB Raw Blame History