rewrite shells

- Shell is up to 256 bytes long.
- Store up to 255 shells in the Shells area.
- Remove padding from the User struct.
This commit is contained in:
Motiejus Jakštys 2022-03-17 16:50:41 +01:00 committed by Motiejus Jakštys
parent 85552c1302
commit 4e36d7850e
6 changed files with 81 additions and 134 deletions

View File

@ -65,9 +65,12 @@ regions are shared. Turbonss reads do not consume any heap space.
Tight packing places some constraints on the underlying data: Tight packing places some constraints on the underlying data:
- Permitted length of username and groupname: 1-32 bytes. - Permitted length of username and groupname: 1-32 bytes.
- Permitted length of shell and home: 1-64 bytes. - Permitted length of shell and home: 1-256 bytes.
- Permitted comment ("gecos") length: 0-255 bytes. - Permitted comment ("gecos") length: 0-255 bytes.
- User name, groupname, gecos and shell must be utf8-encoded. - User name, groupname, gecos and shell must be utf8-encoded.
- User and Groups sections are up to 2^35B (~34GB) large. Assuming an "average"
user record takes 50 bytes, this section would fit ~660M users. The
worst-case upper bound is left as an exercise to the reader.
Sorting is stable. In v0: Sorting is stable. In v0:
- Groups are sorted by gid, ascending. - Groups are sorted by gid, ascending.
@ -173,7 +176,8 @@ the beginning of the section.
``` ```
const PackedGroup = packed struct { const PackedGroup = packed struct {
gid: u32, gid: u32,
groupname_len: u8, // max is 32, but have too much space here. padding: u3,
groupname_len: u5,
} }
``` ```
@ -186,8 +190,7 @@ PackedUser is a bit more involved:
pub const PackedUser = packed struct { pub const PackedUser = packed struct {
uid: u32, uid: u32,
gid: u32, gid: u32,
padding: u2 = 0, shell_len_or_idx: u8,
shell_len_or_idx: u6,
shell_here: bool, shell_here: bool,
name_is_a_suffix: bool, name_is_a_suffix: bool,
home_len: u6, home_len: u6,
@ -219,8 +222,8 @@ PackedUser employs two "simple" compression techniques:
2. `name_is_a_suffix=false`: name begins one byte after home, and it's length 2. `name_is_a_suffix=false`: name begins one byte after home, and it's length
is `name_len`. is `name_len`.
The last field `additional_gids_offset: varint` points to the `additional_gids` section for The last field `additional_gids_offset: varint` points to the `additional_gids`
this user. section for this user.
Shells Shells
------ ------
@ -231,23 +234,20 @@ others. Therefore, "shells" have an optimization: they can be pointed by in the
external list, or, if they are unique to the user, reside among the user's external list, or, if they are unique to the user, reside among the user's
data. data.
63 most popular shells (i.e. referred to by at least two User entries) are 255 most popular shells (i.e. referred to by at least two User entries) are
stored externally in "Shells" area. The less popular ones are stored with stored externally in "Shells" area. The less popular ones are stored with
userdata. userdata.
Shells section consists of two sub-sections: the index and the blob. The index Shells section consists of two sub-sections: the index and the blob. The index
is a list of structs which point to a location in the "blob" area: is an array of offsets: the i'th shell starts at `offsets[i]` byte, and ends at
`offsets[i+1]` byte. If there is at least one shell in the shell section, the
index contains a sentinel index as the last element, which signifies the position
of the last byte of the shell blob.
``` `shell_here=true` in the User struct means the shell is stored with userdata,
const ShellIndex = struct { and it's length is `shell_len_or_idx`. `shell_here=false` means it is stored in
offset: u10, the `Shells` section, and it's index is `shell_len_or_idx` (and the actual
len: u6, string start and end offsets are resolved as described in the paragraph above).
};
```
In the user's struct `shell_here=true` signifies that the shell is stored with
userdata, and it's length is `shell_len_or_idx`. `shell_here=false` means it is
stored in the `Shells` section, and it's index is `shell_len_or_idx`.
Variable-length integers (varints) Variable-length integers (varints)
---------------------------------- ----------------------------------
@ -264,7 +264,6 @@ There are two group memberships at play:
1. Given a group (gid/name), resolve the members' names (e.g. `getgrgid`). 1. Given a group (gid/name), resolve the members' names (e.g. `getgrgid`).
2. Given a username, resolve user's group gids (for `initgroups(3)`). 2. Given a username, resolve user's group gids (for `initgroups(3)`).
When group's memberships are resolved in (1), the same call also requires other When group's memberships are resolved in (1), the same call also requires other
group information: gid and group name. Therefore it makes sense to store a group information: gid and group name. Therefore it makes sense to store a
pointer to the group members in the group information itself. However, the pointer to the group members in the group information itself. However, the
@ -323,9 +322,10 @@ will be pointing to a number `n ∈ [0,N-1]`, regardless whether the value was i
the initial dictionary. Therefore one must always confirm, after calculating the initial dictionary. Therefore one must always confirm, after calculating
the hash, that the key matches what's been hashed. the hash, that the key matches what's been hashed.
`idx_*` sections are of type `[]PackedIntArray(u29)` and are pointing to the `idx_*` sections are of type `[]u32` and are pointing to the respective
respective `Groups` and `Users` entries (from the beginning of the respective `Groups` and `Users` entries (from the beginning of the respective section).
section). Since User and Group records are 8-byte aligned, `u29` is used. Since User and Group records are 8-byte aligned, the actual offset to the
record is acquired by right-shifting this value by 3 bits.
Database file structure Database file structure
----------------------- -----------------------
@ -344,7 +344,7 @@ idx_groupname2group len(group)*4 bdz->offset Groups
idx_uid2user len(user)*4 bdz->offset Users idx_uid2user len(user)*4 bdz->offset Users
idx_name2user len(user)*4 bdz->offset Users idx_name2user len(user)*4 bdz->offset Users
shell_index len(shells)*2 shell index array shell_index len(shells)*2 shell index array
shell_blob <= 4032 shell data blob (max 63*64 bytes) shell_blob <= 65280 shell data blob (max 255*256 bytes)
groups ? packed Group entries (8b padding) groups ? packed Group entries (8b padding)
users ? packed User entries (8b padding) users ? packed User entries (8b padding)
groupmembers ? per-group delta varint memberlist (no padding) groupmembers ? per-group delta varint memberlist (no padding)

View File

@ -42,10 +42,11 @@ pub const PackedGroup = struct {
const Inner = packed struct { const Inner = packed struct {
gid: u32, gid: u32,
groupname_len: u8, padding: u3 = 0,
groupname_len: u5,
pub fn groupnameLen(self: *const Inner) usize { pub fn groupnameLen(self: *const Inner) usize {
return self.groupname_len + 1; return @as(usize, self.groupname_len) + 1;
} }
}; };
@ -120,13 +121,9 @@ pub const PackedGroup = struct {
group: GroupStored, group: GroupStored,
) packErr!void { ) packErr!void {
std.debug.assert(arr.items.len & 7 == 0); std.debug.assert(arr.items.len & 7 == 0);
const groupname_len = try validate.downCast(u5, group.name.len - 1);
try validate.utf8(group.name); try validate.utf8(group.name);
const inner = Inner{ const len = try validate.downCast(u5, group.name.len - 1);
.gid = group.gid, const inner = Inner{ .gid = group.gid, .groupname_len = len };
.groupname_len = groupname_len,
};
try arr.*.appendSlice(mem.asBytes(&inner)); try arr.*.appendSlice(mem.asBytes(&inner));
try arr.*.appendSlice(group.name); try arr.*.appendSlice(group.name);
try compress.appendUvarint(arr, group.members_offset); try compress.appendUvarint(arr, group.members_offset);

View File

@ -110,12 +110,6 @@ test "header pack, unpack and validation" {
try testing.expectError(error.InvalidBom, Header.init(header.asArray())); try testing.expectError(error.InvalidBom, Header.init(header.asArray()));
} }
{
var header = goodHeader;
header.num_shells = shell.max_shells + 1;
try testing.expectError(error.TooManyShells, Header.init(header.asArray()));
}
{ {
var header = goodHeader; var header = goodHeader;
header.offset_bdz_uid2user = 65; header.offset_bdz_uid2user = 65;

View File

@ -234,7 +234,7 @@ pub fn usersSection(
&blob, &blob,
user, user,
gids.idx2offset[i], gids.idx2offset[i],
shells.indices, shells.shell2idx,
); );
try pad.arrayList(&blob, userImport.PackedUser.alignment_bits); try pad.arrayList(&blob, userImport.PackedUser.alignment_bits);
} }
@ -439,6 +439,7 @@ pub const AllSections = struct {
var groups = try groupsSection(allocator, corpus, group_members.idx2offset); var groups = try groupsSection(allocator, corpus, group_members.idx2offset);
errdefer groups.deinit(allocator); errdefer groups.deinit(allocator);
// TODO: these indices must point to the *offsets*, not the indices in "users"
var idx_gid2group = try bdzIdx(u32, allocator, bdz_gid, gids); var idx_gid2group = try bdzIdx(u32, allocator, bdz_gid, gids);
errdefer allocator.free(idx_gid2group); errdefer allocator.free(idx_gid2group);

View File

@ -1,5 +1,4 @@
const std = @import("std"); const std = @import("std");
const pad = @import("padding.zig");
const Allocator = std.mem.Allocator; const Allocator = std.mem.Allocator;
const PriorityDequeue = std.PriorityDequeue; const PriorityDequeue = std.PriorityDequeue;
const StringArrayHashMap = std.StringArrayHashMap; const StringArrayHashMap = std.StringArrayHashMap;
@ -7,40 +6,24 @@ const StringHashMap = std.StringHashMap;
const BoundedArray = std.BoundedArray; const BoundedArray = std.BoundedArray;
const StringContext = std.hash_map.StringContext; const StringContext = std.hash_map.StringContext;
// maxShells is the maximum number of "popular" shells. pub const max_shells = 255;
pub const max_shells = 63; pub const max_shell_len = 256;
pub const max_shell_len = 64;
pub const shell_alignment_bits = 2; // bits
// ShellIndex is an index to the shell strings. As shell can be up to 64 bytes
// (1<<6), maximum number of shells is 63 (1<<6-1), the maximum location offset
// is 1<<12. To make location resolvable in 10 bits, all shells will be padded
// to 4 bytes.
// The actual shell length is len+1: we don't allow empty shells, and the real
// length of the shell is 1-64 bytes.
pub const ShellIndex = packed struct {
offset: u10,
len: u6,
};
// ShellReader interprets "Shell Index" and "Shell Blob" sections. // ShellReader interprets "Shell Index" and "Shell Blob" sections.
pub const ShellReader = struct { pub const ShellReader = struct {
section_index: []const ShellIndex, index: []const u16,
section_blob: []const u8, blob: []const u8,
pub fn init(index: []const u8, blob: []const u8) ShellReader { pub fn init(index: []align(2) const u8, blob: []const u8) ShellReader {
return ShellReader{ return ShellReader{
.section_index = std.mem.bytesAsSlice(ShellIndex, index), .index = std.mem.bytesAsSlice(u16, index),
.section_blob = blob, .blob = blob,
}; };
} }
// get returns a shell at the given index. // get returns a shell at the given index.
pub fn get(self: *const ShellReader, idx: u6) []const u8 { pub fn get(self: *const ShellReader, idx: u8) []const u8 {
const shell_index = self.section_index[idx]; return self.blob[self.index[idx]..self.index[idx + 1]];
const start = shell_index.offset << 2;
const end = start + shell_index.len + 1;
return self.section_blob[start..end];
} }
}; };
@ -55,45 +38,42 @@ pub const ShellWriter = struct {
}; };
pub const ShellSections = struct { pub const ShellSections = struct {
index: BoundedArray(ShellIndex, max_shells), // index points the i'th shell to it's offset in blob. The last
blob: BoundedArray(u8, max_shells * max_shell_len), // byte of the i'th shell is index[i+1].
indices: StringHashMap(u6), index: BoundedArray(u16, max_shells),
// blob contains `index.len+1` number of records. The last record is
// pointing to the end of the blob, so length of the last shell can be
// calculated from the index array.
blob: BoundedArray(u8, (max_shells + 1) * max_shell_len),
// shell2idx helps translate a shell (string) to it's index.
shell2idx: StringHashMap(u8),
// initializes and populates shell sections. All strings are copied, // initializes and populates shell sections. All strings are copied,
// nothing is owned. // nothing is owned.
pub const initErr = Allocator.Error || error{Overflow};
pub fn init( pub fn init(
allocator: Allocator, allocator: Allocator,
shells: BoundedArray([]const u8, max_shells), shells: BoundedArray([]const u8, max_shells),
) initErr!ShellSections { ) error{ Overflow, OutOfMemory }!ShellSections {
var self = ShellSections{ var self = ShellSections{
.index = try BoundedArray(ShellIndex, max_shells).init(shells.len), .index = try BoundedArray(u16, max_shells).init(shells.len),
.blob = try BoundedArray(u8, max_shells * max_shell_len).init(0), .blob = try BoundedArray(u8, (max_shells + 1) * max_shell_len).init(0),
.indices = StringHashMap(u6).init(allocator), .shell2idx = StringHashMap(u8).init(allocator),
}; };
errdefer self.indices.deinit(); if (shells.len == 0) return self;
var full_offset: u12 = 0;
var idx: u6 = 0;
while (idx < shells.len) : (idx += 1) {
const len = try std.math.cast(u6, shells.get(idx).len);
try self.blob.appendSlice(shells.get(idx));
const our_shell = self.blob.constSlice()[full_offset .. full_offset + len];
try self.indices.put(our_shell, idx);
std.debug.assert(full_offset & 3 == 0);
self.index.set(idx, ShellIndex{
.offset = try std.math.cast(u10, full_offset >> 2),
.len = len - 1,
});
full_offset += len; errdefer self.shell2idx.deinit();
const padding = pad.roundUpPadding(u12, shell_alignment_bits, full_offset); for (shells.constSlice()) |shell, idx| {
full_offset += padding; const idx8 = @intCast(u8, idx);
try self.blob.appendNTimes(0, padding); const offset = @intCast(u16, self.blob.len);
try self.blob.appendSlice(shell);
try self.shell2idx.put(self.blob.constSlice()[offset..], idx8);
self.index.set(idx8, offset);
} }
try self.index.append(@intCast(u8, self.blob.len));
return self; return self;
} }
pub fn section_index(self: *const ShellSections) []const u8 { pub fn section_index(self: *const ShellSections) []align(2) const u8 {
return std.mem.sliceAsBytes(self.index.constSlice()); return std.mem.sliceAsBytes(self.index.constSlice());
} }
@ -102,12 +82,12 @@ pub const ShellWriter = struct {
} }
pub fn deinit(self: *ShellSections) void { pub fn deinit(self: *ShellSections) void {
self.indices.deinit(); self.shell2idx.deinit();
self.* = undefined; self.* = undefined;
} }
pub fn getIndex(self: *const ShellSections, shell: []const u8) ?u6 { pub fn getIndex(self: *const ShellSections, shell: []const u8) ?u8 {
return self.indices.get(shell); return self.shell2idx.get(shell);
} }
}; };
@ -143,8 +123,10 @@ pub const ShellWriter = struct {
// toOwnedSections returns the analyzed ShellSections. Resets the shell // toOwnedSections returns the analyzed ShellSections. Resets the shell
// popularity contest. ShellSections memory is allocated by the ShellWriter // popularity contest. ShellSections memory is allocated by the ShellWriter
// allocator, and must be deInit'ed by the caller. // allocator, and must be deInit'ed by the caller.
const toOwnedSectionsErr = Allocator.Error || error{Overflow}; pub fn toOwnedSections(
pub fn toOwnedSections(self: *ShellWriter, limit: u10) toOwnedSectionsErr!ShellSections { self: *ShellWriter,
limit: u10,
) error{ Overflow, OutOfMemory }!ShellSections {
var deque = PriorityDequeue(KV, void, cmpShells).init(self.allocator, {}); var deque = PriorityDequeue(KV, void, cmpShells).init(self.allocator, {});
defer deque.deinit(); defer deque.deinit();
@ -164,9 +146,8 @@ pub const ShellWriter = struct {
topShells.set(i, deque.removeMax().shell); topShells.set(i, deque.removeMax().shell);
const result = ShellSections.init(self.allocator, topShells); const result = ShellSections.init(self.allocator, topShells);
const allocator = self.allocator;
self.deinit(); self.deinit();
self.* = init(allocator); self.* = init(self.allocator);
return result; return result;
} }
}; };
@ -192,16 +173,13 @@ test "basic shellpopcon" {
var sections = try popcon.toOwnedSections(max_shells); var sections = try popcon.toOwnedSections(max_shells);
defer sections.deinit(); defer sections.deinit();
try testing.expectEqual(sections.index.len, 3); // all but "nobody" qualify try testing.expectEqual(sections.index.len, 4); // all but "nobody" qualify
try testing.expectEqual(sections.getIndex(long).?, 0); try testing.expectEqual(sections.getIndex(long).?, 0);
try testing.expectEqual(sections.getIndex(zsh).?, 1); try testing.expectEqual(sections.getIndex(zsh).?, 1);
try testing.expectEqual(sections.getIndex(bash).?, 2); try testing.expectEqual(sections.getIndex(bash).?, 2);
try testing.expectEqual(sections.getIndex(nobody), null); try testing.expectEqual(sections.getIndex(nobody), null);
try testing.expectEqual( try testing.expectEqual(sections.section_blob().len, bash.len + zsh.len + long.len);
sections.section_blob().len,
pad.roundUp(u12, 2, bash.len) + pad.roundUp(u12, 2, zsh.len) + pad.roundUp(u12, 2, long.len),
);
const shellReader = ShellReader.init( const shellReader = ShellReader.init(
sections.section_index(), sections.section_index(),
@ -211,5 +189,5 @@ test "basic shellpopcon" {
try testing.expectEqualStrings(shellReader.get(1), zsh); try testing.expectEqualStrings(shellReader.get(1), zsh);
try testing.expectEqualStrings(shellReader.get(2), bash); try testing.expectEqualStrings(shellReader.get(2), bash);
try testing.expectEqual(shellReader.section_index.len, 3); try testing.expectEqual(shellReader.index.len, 4);
} }

View File

@ -13,10 +13,6 @@ const Allocator = mem.Allocator;
const ArrayList = std.ArrayList; const ArrayList = std.ArrayList;
const StringHashMap = std.StringHashMap; const StringHashMap = std.StringHashMap;
// Idx2ShellProto is a function prototype that, given a shell's index (in
// global shell section), will return a shell string. Matches ShellReader.get.
const Idx2ShellProto = fn (u6) []const u8;
// User is a convenient public struct for record construction and // User is a convenient public struct for record construction and
// serialization. // serialization.
pub const User = struct { pub const User = struct {
@ -65,21 +61,6 @@ pub const User = struct {
} }
}; };
pub fn Shell2Index(T: type) type {
return struct {
const Self = @This();
data: T,
pub fn init(data: T) Self {
return Self{ .data = data };
}
pub fn get(self: *const Self, str: []const u8) ?u6 {
return self.data.get(str);
}
};
}
pub const PackedUser = struct { pub const PackedUser = struct {
const Self = @This(); const Self = @This();
@ -88,8 +69,7 @@ pub const PackedUser = struct {
const Inner = packed struct { const Inner = packed struct {
uid: u32, uid: u32,
gid: u32, gid: u32,
padding: u2 = 0, shell_len_or_idx: u8,
shell_len_or_idx: u6,
shell_here: bool, shell_here: bool,
name_is_a_suffix: bool, name_is_a_suffix: bool,
home_len: u6, home_len: u6,
@ -204,14 +184,14 @@ pub const PackedUser = struct {
arr: *ArrayList(u8), arr: *ArrayList(u8),
user: User, user: User,
additional_gids_offset: u64, additional_gids_offset: u64,
idxFn: StringHashMap(u6), idxFn: StringHashMap(u8),
) error{ InvalidRecord, OutOfMemory }!void { ) error{ InvalidRecord, OutOfMemory }!void {
std.debug.assert(arr.items.len & 7 == 0); std.debug.assert(arr.items.len & 7 == 0);
// function arguments are consts. We need to mutate the underlying // function arguments are consts. We need to mutate the underlying
// slice, so passing it via pointer instead. // slice, so passing it via pointer instead.
const home_len = try validate.downCast(u6, user.home.len - 1); const home_len = try validate.downCast(u6, user.home.len - 1);
const name_len = try validate.downCast(u5, user.name.len - 1); const name_len = try validate.downCast(u5, user.name.len - 1);
const shell_len = try validate.downCast(u6, user.shell.len - 1); const shell_len = try validate.downCast(u8, user.shell.len - 1);
const gecos_len = try validate.downCast(u8, user.gecos.len); const gecos_len = try validate.downCast(u8, user.gecos.len);
try validate.utf8(user.home); try validate.utf8(user.home);
@ -289,19 +269,16 @@ test "PackedUser internal and external alignment" {
); );
} }
fn testShellIndex(allocator: Allocator) StringHashMap(u6) { fn testShellIndex(allocator: Allocator) StringHashMap(u8) {
var result = StringHashMap(u6).init(allocator); var result = StringHashMap(u8).init(allocator);
result.put("/bin/bash", 0) catch unreachable; result.put("/bin/bash", 0) catch unreachable;
result.put("/bin/zsh", 1) catch unreachable; result.put("/bin/zsh", 1) catch unreachable;
return result; return result;
} }
const test_shell_reader = shellImport.ShellReader{ const test_shell_reader = shellImport.ShellReader{
.section_blob = "/bin/bash.../bin/zsh", .blob = "/bin/bash/bin/zsh",
.section_index = &[_]shellImport.ShellIndex{ .index = &[_]u16{ 0, 9, 17 },
shellImport.ShellIndex{ .offset = 0, .len = 9 - 1 },
shellImport.ShellIndex{ .offset = 12 >> 2, .len = 8 - 1 },
},
}; };
test "construct PackedUser section" { test "construct PackedUser section" {
@ -328,7 +305,7 @@ test "construct PackedUser section" {
.name = "Name" ** 8, .name = "Name" ** 8,
.gecos = "Gecos" ** 51, .gecos = "Gecos" ** 51,
.home = "Home" ** 16, .home = "Home" ** 16,
.shell = "She.LllL" ** 8, .shell = "She.LllL" ** 32,
}, User{ }, User{
.uid = 1002, .uid = 1002,
.gid = 1002, .gid = 1002,