deps | ||
include/deps/cmph | ||
src | ||
.gitignore | ||
.gitmodules | ||
build.zig | ||
README.md |
Turbo NSS
Turbonss is a plugin for GNU Name Service Switch (NSS) functionality of GNU C
Library (glibc). Turbonss implements lookup for user
and passwd
database
entries (i.e. system users, groups, and group memberships). It's main goal is
performance, with focus on making id(1)
run as fast as possible.
To understand more about name service switch, start with
nsswitch.conf(5)
.
Design & constraints
To be fast, the user/group database (later: DB) has to be small (highly recommended background viewing). It encodes user & group information in a way that minimizes the DB size, and reduces jumping across the DB ("chasing pointers and polluting CPU cache").
For example, getpwnam_r(3)
accepts a username and returns
the following user information:
struct passwd {
char *pw_name; /* username */
char *pw_passwd; /* user password */
uid_t pw_uid; /* user ID */
gid_t pw_gid; /* group ID */
char *pw_gecos; /* user information */
char *pw_dir; /* home directory */
char *pw_shell; /* shell program */
};
Turbonss, among others, implements this call, and takes the following steps to resolve this:
- Hash the username using a perfect hash function. Perfect hash function returns a number between [0,N], where N is the total number of users.
- Jump to a known location in the DB (by pointer arithmetic) which links the user's index to the user's information. That is an index to a different location within the DB.
- Jump to the location which stores the full user information.
- Decode the user information (which is all in a continuous memory block) and return it to the caller.
In total, that's one hash for the username (~150ns), two pointer jumps within
the group file, and, now that the user record is found, memcpy
for each
field.
This tight packing places some constraints on the underlying data:
- Maximum database size: 4GB.
- Maximum length of username and groupname: 32 bytes.
- Maximum length of shell and homedir: 64 bytes.
- Maximum comment ("gecos") length: 256 bytes.
- Username and groupname must be utf8-encoded.
Checking out and building
$ git clone --recursive https://git.sr.ht/~motiejus/turbonss
Alternatively, if you forgot --recursive
:
$ git submodule update --init
And run tests:
$ zig build test
Other commands will be documented as they are implemented.
This project uses git subtrac for managing dependencies.
remarks on id(1)
A known implementation runs id(1) at ~250 rps sequentially on ~20k users and ~10k groups. Our target is 10k id/s.
id(1)
works as follows:
- lookup user by name.
- get all additional gids (an array attached to a member).
- for each additional gid, get the group name.
Assuming a member is in ~100 groups on average, that's 1M group lookups per second. We need to convert gid to a group index, and group index to a group gid/name quickly.
Caveat: struct group
contains an array of pointers to names of group members
(char **gr_mem
). However, id
does not use that information, resulting in a
significant read amplification. Therefore, if argv[0] == "id"
, getgrid(3)
will return group without the members. This speeds up id
by about 10x on a
known NSS implementation.
Because getgrid(3)
does not use the group members' information, the group
members are stored in a different location, making the Groups
section
smaller, thus more CPU-cache-friendly.
Indices
The following operations need to be fast, in order of importance:
- lookup gid -> group (this is on hot path in id) with or without members (2 separate calls).
- lookup uid -> user.
- lookup groupname -> group.
- lookup username -> user.
- (optional) iterate users using a defined order (
getent passwd
). - (optional) iterate groups using a defined order (
getent group
).
First 4 can use perfect hashing like cmph: it hashes a list of bytes to a sequential list of integers. Perfect hashing algorithms require some space, and take some time to calculate ("hashing duration"). I've tested BDZ, which hashes [][]u8 to a sequential list of integers (not preserving order) and CHM, which does the same, but preserves order. BDZ accepts an argument 3 <= b <= 10.
BDZ: tried b=3, b=7 (default), and b=10.
- BDZ algorithm requires (900KB, 338KB, 306KB, respectively) for 1M values.
- Latency to resolve 1M keys: (170ms, 180ms, 230ms).
- Packed vs non-packed latency differences are not meaningful.
CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with CHM than with BDZ, eliminating the benefit of preserved ordering.
Turbonss header
The turbonss header looks like this:
OFFSET TYPE NAME DESCRIPTION
0 [4]u8 magic always 0xf09fa4b7
4 u8 version now `0`
5 u16 bom 0x1234
7 u8 padding
8 u32 num_users number of passwd entries
12 u32 num_groups number of group entries
16 u32 offset_cmph_gid2group
20 u32 offset_cmph_uid2user
24 u32 offset_cmph_groupname2group
28 u32 offset_cmph_username2user
32 u32 offset_groupmembers
36 u32 offset_additional_gids
magic
is 0xf09fa4b7, and version
must be 0
. All integers are
native-endian. bom
is a byte-order-mark. It must resolve to 0x1234
(4460).
If that's not true, the file is consumed in a different endianness than it was
created at. Turbonss files cannot be moved across different-endianness
computers. If that happens, turbonss will refuse to read the file.
Offsets are indices to further sections of the file, with zero being the first
block (pointing to the magic
field). As all blobs are 64-byte aligned, the
offsets are always pointing to the beginning of an 64-byte "block". Therefore,
all offset_*
values could be u26
. As u32
is easier to visualize with xxd,
and the header block fits to 64 bytes anyway, we are keeping them as u32 now.
Primitive types
const Group = struct {
gid: u32,
// index to a separate structure with a list of members. The memberlist is
// always 2^5-byte aligned (32b), this is an index there.
members_offset: u27,
groupname_len: u5,
// a groupname_len-sized string
groupname []u8;
}
const User = struct {
uid: u32,
gid: u32,
// pointer to a separate structure that contains a list of gids
additional_gids_offset: u29,
// shell is a different story, documented elsewhere.
shell_here: u1,
shell_len_or_place: u6,
home_len: u6,
username_pos: u1,
username_len: u5,
gecos_len: u8,
// a variable-sized array that will be stored immediately after this
// struct.
stringdata []u8;
}
User
and Group
entries are sorted by name, ordered by their unicode
codepoints.
Shells
Normally there is a limited number of shells even in the huge user databases. A
few examples: /bin/bash
, /usr/bin/nologin
, /bin/zsh
among others.
Therefore, "shells" have an optimization: they can be pointed by in the
external list, or reside among the user's data.
64 (1>>6) most popular shells (i.e. referred to by at least two User entries) are stored externally in "Shells" area. The less popular ones are stored with userdata.
The shell_here=true
bit signifies that the shell is stored with userdata.
false
means it is stored in the Shells
section. If the shell is stored
"here", it is the first element in stringdata
, and it's length is
shell_len_or_place
. If it is stored externally, the latter variable points
to it's index in the external storage.
Shells in the external storage are sorted by their weight, which is
length*frequency
.
groupmembers
, additional_gids
groupmembers
and additional_gids
store group and user memberships
respectively: for each group, a list of pointers ("offsets") to User records,
and for each user — a list of pointers to Group records. These fields are
always used in their entirety — making random-access not required, thus
suitable for tight packing.
An entry of groupmembers
and additional_gids
looks like this piece of
pseudo-code:
const PackedList = struct {
length: varint,
members: []varint
}
const Groupmembers = PackedList;
const AdditionalGids = PackedList;
The single entry in members
field points to an offset into a User
or
Group
entry (number of bytes relative to the first entry of the respective
type). The members
field in PackedList
is sorted by the name (username
or
groupname
) of the record it is pointing to.
Complete file structure
SECTION SIZE DESCRIPTION
Header 1<<6 documented above
[]Group ? list of Group entries
[]User ? list of User entries
Shells ? documented in "SHELLS"
cmph_gid2group ? offset by offset_cmph_gid2group
cmph_uid2user ? offset by offset_cmph_gid2group
cmph_groupname2group ? offset by offset_cmph_groupname2group
cmph_username2user ? offset by offset_cmph_username2user
groupmembers ? offset by offset_groupmembers
additional_gids ? offset by offset_additional_gids