bring back additional_gids_offset

2022-02-24 05:32:27 +02:00
parent c0afca00b0
commit 3bf1b3fc01
2 changed files with 39 additions and 30 deletions
--- a/README.md
+++ b/README.md
@@ -67,7 +67,7 @@ Tight packing places some constraints on the underlying data:
 - Maximum database size: 4GB.
 - Permitted length of username and groupname: 1-32 bytes.
 - Permitted length of shell and home: 1-64 bytes.
- Permitted comment ("gecos") length: 0-1023 bytes.
+- Permitted comment ("gecos") length: 0-255 bytes.
 - User name, groupname, gecos and shell must be utf8-encoded.
 Checking out and building
@@ -100,7 +100,7 @@ remarks on `id(1)`
 ------------------
 A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
-~10k groups. Our target is 10k id/s for the same payload.
+~10k groups. Our rps target is much higher.
 To better reason about the trade-offs, it is useful to understand how `id(1)`
 is implemented, in rough terms:
@@ -111,9 +111,9 @@ is implemented, in rough terms:
 - for each additional gid, get the `struct group*`
  ([`getgrgid_r(3)`][getgrgid_r]).
-Assuming a member is in ~100 groups on average, that's 1M group lookups per
+Assuming a member is in ~100 groups on average, to reach 10k id/s translates to
-second. We need to convert gid to a group index, and group index to a group
+1M group lookups per second. We need to convert gid to a group index, and group
-gid/name quickly.
+index to a group gid/name quickly.
 Caveat: `struct group` contains an array of pointers to names of group members
 (`char **gr_mem`). However, `id` does not use that information, resulting in
@@ -193,13 +193,13 @@ const PackedGroup = struct {
 pub const PackedUser = packed struct {
    uid: u32,
    gid: u32,
    additional_gids_offset: u29,
    shell_here: bool,
    shell_len_or_idx: u6,
    home_len: u6,
    name_is_a_suffix: bool,
    name_len: u5,
-    gecos_len: u10,
+    gecos_len: u8,
    padding: u3,
    // pseudocode: variable-sized array that will be stored immediately after
    // this struct.
    stringdata []u8;
@@ -267,27 +267,26 @@ Group memberships
 There are two group memberships at play:
-1. Given a username, resolve user's group gids (for `initgroups(3)`).
+1. Given a group (gid/name), resolve the members' names (e.g. `getgrgid`).
-2. Given a group (gid/name), resolve the members' names (e.g. `getgrgid`).
+2. Given a username, resolve user's group gids (for `initgroups(3)`).
 When user's groups are resolved in (1), the additional userdata is not
 requested (there is no way to return it). Therefore, it is reasonable to store
 the user's memberships completely out-of-bound, keyed by the hash of the
 username.
-When group's memberships are resolved in (2), the same call also requires other
+When group's memberships are resolved in (1), the same call also requires other
 group information: gid and group name. Therefore it makes sense to store a
 pointer to the group members in the group information itself. However, the
 memberships are not *always* necessary (see remarks about `id(1)`), therefore
 the memberships will be stored separately, outside of the groups section.
 Similarly, when user's groups are resolved in (2), they are not always necessary
 (i.e. not part of `struct user*`), therefore the memberships themselves are
 stored out of bound.
 `Groupmembers` and `Username2gids` store group and user memberships
 respectively. Membership IDs are used in their entirety — not necessitating
 random access, thus suitable for tight packing and varint encoding.
 - For each group — a list of pointers (offsets) to User records, because
-  `getgr*_r` returns an array of pointers to membernames.
+  `getgr*_r` returns pointers to membernames.
 - For each user — a list of gids, because `initgroups_dyn` (and friends)
  returns an array of gids.
@@ -303,8 +302,6 @@ const Groupmembers = PackedList;
 const Username2gids  = PackedList;
 ```
 A packed list is a list of varints.
 Indices
 -------
@@ -317,15 +314,10 @@ understand which operations need to be fast; in order of importance:
 4. lookup groupname -> group.
 5. lookup username -> user.
 `idx_*` sections are of type `[]PackedIntArray(u29)` and are pointing to the
 respective `Groups` and `Users` entries (from the beginning of the respective
 section). Since User and Group records are 8-byte aligned, 3 bits are saved for
 every element.
 These indices can use perfect hashing like [bdz from cmph][cmph]: a perfect
 hash hashes a list of bytes to a sequential list of integers. Perfect hashing
 algorithms require some space, and take some time to calculate ("hashing
-duration"). I've tested BDZ, which hashes [][]u8 to a sequential list of
+duration"). I've tested BDZ, which hashes `[][]u8` to a sequential list of
 integers (not preserving order) and CHM, preserves order. BDZ accepts an
 optional argument `3 <= b <= 10`.
@@ -337,6 +329,16 @@ CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
 CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
 have a separate index.
 None of the tested perfect hashing algorithms makes the distinction between
 existing (in the initial dictionary) and new keys. In other words, HASH(value)
 will be pointing to a number `n ∈ [0,N-1]`, regardless whether the value was in
 the initial dictionary. Therefore one must always confirm, after calculating
 the hash, that the key matches what's been hashed.
 `idx_*` sections are of type `[]PackedIntArray(u29)` and are pointing to the
 respective `Groups` and `Users` entries (from the beginning of the respective
 section). Since User and Group records are 8-byte aligned, `u29` is used.
 Complete file structure
 -----------------------
--- a/src/user.zig
+++ b/src/user.zig
@@ -6,16 +6,17 @@ const Allocator = std.mem.Allocator;
 const ArrayList = std.ArrayList;
 const cast = std.math.cast;
 const PackedUserSize = @divExact(@bitSizeOf(PackedUser), 8);
 pub const PackedUser = packed struct {
    uid: u32,
    gid: u32,
    additional_gids_offset: u29,
    shell_here: bool,
    shell_len_or_idx: u6,
    home_len: u6,
    name_is_a_suffix: bool,
    name_len: u5,
-    gecos_len: u10,
+    gecos_len: u8,
    padding: u3,
    // blobLength returns the length of the blob storing string values.
    pub fn blobLength(self: *const PackedUser) usize {
@@ -107,7 +108,7 @@ pub const UserWriter = struct {
        const home_len = try downCast(u6, user.home.len - 1);
        const name_len = try downCast(u5, user.name.len - 1);
        const shell_len = try downCast(u6, user.shell.len - 1);
-        const gecos_len = try downCast(u10, user.gecos.len);
+        const gecos_len = try downCast(u8, user.gecos.len);
        try validateUtf8(user.home);
        try validateUtf8(user.name);
@@ -117,13 +118,13 @@ pub const UserWriter = struct {
        var puser = PackedUser{
            .uid = user.uid,
            .gid = user.gid,
            .additional_gids_offset = 1 << 29 - 1,
            .shell_here = self.shellIndexFn(user.shell) == null,
            .shell_len_or_idx = self.shellIndexFn(user.shell) orelse shell_len,
            .home_len = home_len,
            .name_is_a_suffix = std.mem.endsWith(u8, user.home, user.name),
            .name_len = name_len,
            .gecos_len = gecos_len,
            .padding = 0,
        };
        try self.appendTo.appendSlice(std.mem.asBytes(&puser));
@@ -241,7 +242,13 @@ pub const UserReader = struct {
 const testing = std.testing;
 test "PackedUser internal and external alignment" {
-    try testing.expectEqual(@bitSizeOf(PackedUser), @sizeOf(PackedUser) * 8);
+    // External padding (PackedUserAlignmentBits) must be higher or equal to
    // the "internal" PackedUser alignment. By aligning PackedUser we are also
    // working around https://github.com/ziglang/zig/issues/10958 ; PackedUser
    // cannot be converted from/to [@bitSizeOf(PackedUser)/8]u8;
    // asBytes/bytesAsValue use @sizeOf, which is larger. Now we are putting no
    // more than 1, but it probably could be higher.
    try testing.expect(@bitSizeOf(PackedUser) - @sizeOf(PackedUser) * 8 <= 8);
 }
 fn testShellIndex(shell: []const u8) ?u6 {
@@ -284,7 +291,7 @@ test "construct PackedUser section" {
        .uid = 0,
        .gid = 4294967295,
        .name = "n" ** 32,
-        .gecos = "g" ** 1023,
+        .gecos = "g" ** 255,
        .home = "h" ** 64,
        .shell = "s" ** 64,
    } };