turbonss

motiejus

turbonss

10 KiB

Raw Blame History

Turbo NSS

Turbonss is a plugin for GNU Name Service Switch (NSS) functionality of GNU C Library (glibc). Turbonss implements lookup for user and passwd database entries (i.e. system users, groups, and group memberships). It's main goal is performance, with focus on making id(1) run as fast as possible.

To understand more about name service switch, start with nsswitch.conf(5).

Design & constraints

To be fast, the user/group database (later: DB) has to be small (highly recommended background viewing). It encodes user & group information in a way that minimizes the DB size, and reduces jumping across the DB ("chasing pointers and polluting CPU cache").

For example, getpwnam_r(3) accepts a username and returns the following user information:

struct passwd {
   char   *pw_name;       /* username */
   char   *pw_passwd;     /* user password */
   uid_t   pw_uid;        /* user ID */
   gid_t   pw_gid;        /* group ID */
   char   *pw_gecos;      /* user information */
   char   *pw_dir;        /* home directory */
   char   *pw_shell;      /* shell program */
};

Turbonss, among others, implements this call, and takes the following steps to resolve this:

Hash the username using a perfect hash function. Perfect hash function returns a number between [0,N], where N is the total number of users.
Jump to a known location in the DB (by pointer arithmetic) which links the user's index to the user's information. That is an index to a different location within the DB.
Jump to the location which stores the full user information.
Decode the user information (which is all in a continuous memory block) and return it to the caller.

In total, that's one hash for the username (~150ns), two pointer jumps within the group file, and, now that the user record is found, memcpy for each field.

This tight packing places some constraints on the underlying data:

Maximum database size: 4GB.
Maximum length of username and groupname: 32 bytes.
Maximum length of shell and homedir: 64 bytes.
Maximum comment ("gecos") length: 256 bytes.
Username and groupname must be utf8-encoded.

Checking out and building

$ git clone --recursive https://git.sr.ht/~motiejus/turbonss

Alternatively, if you forgot --recursive:

$ git submodule update --init

And run tests:

$ zig build test

Other commands will be documented as they are implemented.

This project uses git subtrac for managing dependencies.

remarks on `id(1)`

A known implementation runs id(1) at ~250 rps sequentially on ~20k users and ~10k groups. Our target is 10k id/s.

id(1) works as follows:

lookup user by name.
get all additional gids (an array attached to a member).
for each additional gid, get the group name.

Assuming a member is in ~100 groups on average, that's 1M group lookups per second. We need to convert gid to a group index, and group index to a group gid/name quickly.

Caveat: struct group contains an array of pointers to names of group members (char **gr_mem). However, id does not use that information, resulting in a significant read amplification. Therefore, if argv[0] == "id", getgrid(3) will return group without the members. This speeds up id by about 10x on a known NSS implementation.

Because getgrid(3) does not use the group members' information, the group members are stored in a different location, making the Groups section smaller, thus more CPU-cache-friendly.

Indices

The following operations need to be fast, in order of importance:

lookup gid -> group (this is on hot path in id) with or without members (2 separate calls).
lookup uid -> user.
lookup groupname -> group.
lookup username -> user.
(optional) iterate users using a defined order (getent passwd).
(optional) iterate groups using a defined order (getent group).

First 4 can use perfect hashing like cmph: it hashes a list of bytes to a sequential list of integers. Perfect hashing algorithms require some space, and take some time to calculate ("hashing duration"). I've tested BDZ, which hashes [][]u8 to a sequential list of integers (not preserving order) and CHM, which does the same, but preserves order. BDZ accepts an argument 3 <= b <= 10.

BDZ: tried b=3, b=7 (default), and b=10.

BDZ algorithm requires (900KB, 338KB, 306KB, respectively) for 1M values.
Latency to resolve 1M keys: (170ms, 180ms, 230ms).
Packed vs non-packed latency differences are not meaningful.

CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with CHM than with BDZ, eliminating the benefit of preserved ordering.

Turbonss header

The turbonss header looks like this:

OFFSET     TYPE     NAME                          DESCRIPTION
   0      [4]u8     magic                         always 0xf09fa4b7
   4         u8     version                       now `0`
   5        u16     bom                           0x1234
   7         u8     padding
   8        u32     num_users                     number of passwd entries
  12        u32     num_groups                    number of group entries
  16        u32     offset_cmph_gid2group
  20        u32     offset_cmph_uid2user
  24        u32     offset_cmph_groupname2group
  28        u32     offset_cmph_username2user
  32        u32     offset_groupmembers
  36        u32     offset_additional_gids

magic is 0xf09fa4b7, and version must be 0. All integers are native-endian. bom is a byte-order-mark. It must resolve to 0x1234 (4460). If that's not true, the file is consumed in a different endianness than it was created at. Turbonss files cannot be moved across different-endianness computers. If that happens, turbonss will refuse to read the file.

Offsets are indices to further sections of the file, with zero being the first block (pointing to the magic field). As all blobs are 64-byte aligned, the offsets are always pointing to the beginning of an 64-byte "block". Therefore, all offset_* values could be u26. As u32 is easier to visualize with xxd, and the header block fits to 64 bytes anyway, we are keeping them as u32 now.

Primitive types

const Group = struct {
    gid: u32,
    // index to a separate structure with a list of members. The memberlist is
    // always 2^5-byte aligned (32b), this is an index there.
    members_offset: u27,
    groupname_len: u5,
    // a groupname_len-sized string
    groupname []u8;
}

const User = struct {
    uid: u32,
    gid: u32,
    // pointer to a separate structure that contains a list of gids
    additional_gids_offset: u29,
    // shell is a different story, documented elsewhere.
    shell_here: u1,
    shell_len_or_place: u6,
    home_len: u6,
    username_pos: u1,
    username_len: u5,
    gecos_len: u8,
    // a variable-sized array that will be stored immediately after this
    // struct.
    stringdata []u8;
}

User and Group entries are sorted by name, ordered by their unicode codepoints.

Shells

Normally there is a limited number of shells even in the huge user databases. A few examples: /bin/bash, /usr/bin/nologin, /bin/zsh among others. Therefore, "shells" have an optimization: they can be pointed by in the external list, or reside among the user's data.

64 (1>>6) most popular shells (i.e. referred to by at least two User entries) are stored externally in "Shells" area. The less popular ones are stored with userdata.

The shell_here=true bit signifies that the shell is stored with userdata. false means it is stored in the Shells section. If the shell is stored "here", it is the first element in stringdata, and it's length is shell_len_or_place. If it is stored externally, the latter variable points to it's index in the external storage.

Shells in the external storage are sorted by their weight, which is length*frequency.

`groupmembers`, `additional_gids`

groupmembers and additional_gids store group and user memberships respectively: for each group, a list of pointers ("offsets") to User records, and for each user — a list of pointers to Group records. These fields are always used in their entirety — making random-access not required, thus suitable for tight packing.

An entry of groupmembers and additional_gids looks like this piece of pseudo-code:

const PackedList = struct {
    length: varint,
    members: []varint
}
const Groupmembers = PackedList;
const AdditionalGids = PackedList;

The single entry in members field points to an offset into a User or Group entry (number of bytes relative to the first entry of the respective type). The members field in PackedList is sorted by the name (username or groupname) of the record it is pointing to.

Complete file structure

  SECTION              SIZE                         DESCRIPTION
  Header               1<<6                         documented above
  []Group                 ?                         list of Group entries
  []User                  ?                         list of User entries
  Shells                  ?                         documented in "SHELLS"
  cmph_gid2group          ?                         offset by offset_cmph_gid2group
  cmph_uid2user           ?                         offset by offset_cmph_gid2group
  cmph_groupname2group    ?                         offset by offset_cmph_groupname2group
  cmph_username2user      ?                         offset by offset_cmph_username2user
  groupmembers            ?                         offset by offset_groupmembers
  additional_gids         ?                         offset by offset_additional_gids

10 KiB Raw Blame History