From b9e818b556d15672ad589e1dbdf5f6f1ea557c78 Mon Sep 17 00:00:00 2001
From: Shawn Pearce <spearce@spearce.org>
Date: Sat, 22 Jul 2017 10:12:32 -0700
Subject: [PATCH] reftable: file format documentation

Some repositories contain a lot of references (e.g. android at 866k,
rails at 31k). The reftable format provides:

- Near constant time lookup for any single reference, even when the
  repository is cold and not in process or kernel cache.
- Near constant time verification a SHA-1 is referred to by at least
  one reference (for allow-tip-sha1-in-want).
- Efficient lookup of an entire namespace, such as `refs/tags/`.
- Support atomic push `O(size_of_update)` operations.
- Combine reflog storage with ref storage.

Change-Id: I29d0ff1eee475845660ac9173413e1407adcfbf2
---
 Documentation/technical/reftable.md | 950 ++++++++++++++++++++++++++++
 1 file changed, 950 insertions(+)
 create mode 100644 Documentation/technical/reftable.md

diff --git a/Documentation/technical/reftable.md b/Documentation/technical/reftable.md
new file mode 100644
index 000000000..34f752950
--- /dev/null
+++ b/Documentation/technical/reftable.md
@@ -0,0 +1,950 @@
+# reftable
+
+[TOC]
+
+## Overview
+
+### Problem statement
+
+Some repositories contain a lot of references (e.g.  android at 866k,
+rails at 31k).  The existing packed-refs format takes up a lot of
+space (e.g.  62M), and does not scale with additional references.
+Lookup of a single reference requires linearly scanning the file.
+
+Atomic pushes modifying multiple references require copying the
+entire packed-refs file, which can be a considerable amount of data
+moved (e.g. 62M in, 62M out) for even small transactions (2 refs
+modified).
+
+Repositories with many loose references occupy a large number of disk
+blocks from the local file system, as each reference is its own file
+storing 41 bytes (and another file for the corresponding reflog).
+This negatively affects the number of inodes available when a large
+number of repositories are stored on the same filesystem.  Readers can
+be penalized due to the larger number of syscalls required to traverse
+and read the `$GIT_DIR/refs` directory.
+
+### Objectives
+
+- Near constant time lookup for any single reference, even when the
+  repository is cold and not in process or kernel cache.
+- Near constant time verification if a SHA-1 is referred to by at
+  least one reference (for allow-tip-sha1-in-want).
+- Efficient lookup of an entire namespace, such as `refs/tags/`.
+- Support atomic push with `O(size_of_update)` operations.
+- Combine reflog storage with ref storage for small transactions.
+- Separate reflog storage for base refs and historical logs.
+
+### Description
+
+A reftable file is a portable binary file format customized for
+reference storage. References are sorted, enabling linear scans,
+binary search lookup, and range scans.
+
+Storage in the file is organized into variable sized blocks.  Prefix
+compression is used within a single block to reduce disk space.  Block
+size and alignment is tunable by the writer.
+
+### Performance
+
+Space used, packed-refs vs. reftable:
+
+repository | packed-refs | reftable | % original | avg ref  | avg obj
+-----------|------------:|---------:|-----------:|---------:|--------:
+android    |      62.2 M |   36.1 M |     58.0%  | 33 bytes | 5 bytes
+rails      |       1.8 M |    1.1 M |     57.7%  | 29 bytes | 4 bytes
+git        |      78.7 K |   48.1 K |     61.0%  | 50 bytes | 4 bytes
+git (heads)|       332 b |    269 b |     81.0%  | 33 bytes | 0 bytes
+
+Scan (read 866k refs), by reference name lookup (single ref from 866k
+refs), and by SHA-1 lookup (refs with that SHA-1, from 866k refs):
+
+format      | cache | scan    | by name        | by SHA-1
+------------|------:|--------:|---------------:|---------------:
+packed-refs | cold  |  402 ms | 409,660.1 usec | 412,535.8 usec
+packed-refs | hot   |         |   6,844.6 usec |  20,110.1 usec
+reftable    | cold  |  112 ms |      33.9 usec |     323.2 usec
+reftable    | hot   |         |      20.2 usec |     320.8 usec
+
+Space used for 149,932 log entries for 43,061 refs,
+reflog vs. reftable:
+
+format        | size  | avg entry
+--------------|------:|-----------:
+$GIT_DIR/logs | 173 M | 1209 bytes
+reftable      |   5 M |   37 bytes
+
+## Details
+
+### Peeling
+
+References stored in a reftable are peeled, a record for an annotated
+(or signed) tag records both the tag object, and the object it refers
+to.
+
+### Reference name encoding
+
+Reference names are an uninterpreted sequence of bytes that must pass
+[git-check-ref-format][ref-fmt] as a valid reference name.
+
+[ref-fmt]: https://git-scm.com/docs/git-check-ref-format
+
+### Network byte order
+
+All multi-byte, fixed width fields are in network byte order.
+
+### Ordering
+
+Blocks are lexicographically ordered by their first reference.
+
+### Directory/file conflicts
+
+The reftable format accepts both `refs/heads/foo` and
+`refs/heads/foo/bar` as distinct references.
+
+This property is useful for retaining log records in reftable, but may
+confuse versions of Git using `$GIT_DIR/refs` directory tree to
+maintain references.  Users of reftable may choose to continue to
+reject `foo` and `foo/bar` type conflicts to prevent problems for
+peers.
+
+## File format
+
+### Structure
+
+A reftable file has the following high-level structure:
+
+    first_block {
+      header
+      first_ref_block
+    }
+    ref_block*
+    ref_index*
+    obj_block*
+    obj_index*
+    log_block*
+    log_index*
+    footer
+
+A log-only file omits the `ref_block`, `ref_index`, `obj_block` and
+`obj_index` sections, containing only the file header and log block:
+
+    first_block {
+      header
+    }
+    log_block*
+    log_index*
+    footer
+
+in a log-only file the first log block immediately follows the file
+header, without padding to block alignment.
+
+### Block size
+
+The file's block size is arbitrarily determined by the writer, and
+does not have to be a power of 2.  The block size must be larger than
+the longest reference name or log entry used in the repository, as
+references cannot span blocks.
+
+Powers of two that are friendly to the virtual memory system or
+filesystem (such as 4k or 8k) are recommended.  Larger sizes (64k) can
+yield better compression, with a possible increased cost incurred by
+readers during access.
+
+The largest block size is `16777215` bytes (15.99 MiB).
+
+### Block alignment
+
+Writers may choose to align blocks at multiples of the block size by
+including `padding` filled with NUL bytes at the end of a block to
+round out to the chosen alignment.  When alignment is used, writers
+must specify the alignment with the file header's `block_size` field.
+
+Block alignment is not required by the file format.  Unaligned files
+must set `block_size = 0` in the file header, and omit `padding`.
+Unaligned files with more than one ref block must include the
+[ref index](#Ref-index) to support fast lookup.  Readers must be
+able to read both aligned and non-aligned files.
+
+Very small files (e.g. 1 only ref block) may omit `padding` and the
+ref index to reduce total file size.
+
+### Header
+
+A 24-byte header appears at the beginning of the file:
+
+    'REFT'
+    uint8( version_number = 1 )
+    uint24( block_size )
+    uint64( min_update_index )
+    uint64( max_update_index )
+
+Aligned files must specify `block_size` to configure readers with the
+expected block alignment.  Unaligned files must set `block_size = 0`.
+
+The `min_update_index` and `max_update_index` describe bounds for the
+`update_index` field of all log records in this file.  When reftables
+are used in a stack for [transactions](#Update-transactions), these
+fields can order the files such that the prior file's
+`max_update_index + 1` is the next file's `min_update_index`.
+
+### First ref block
+
+The first ref block shares the same block as the file header, and is
+24 bytes smaller than all other blocks in the file.  The first block
+immediately begins after the file header, at position 24.
+
+If the first block is a log block (a log-only file), its block header
+begins immediately at position 24.
+
+### Ref block format
+
+A ref block is written as:
+
+    'r'
+    uint24( block_len )
+    ref_record+
+    uint24( restart_offset )+
+    uint16( restart_count )
+
+    padding?
+
+Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which
+encodes the number of bytes in the block up to, but not including the
+optional `padding`.  This is always less than or equal to the file's
+block size.  In the first ref block, `block_len` includes 24 bytes
+for the file header.
+
+The 2-byte `restart_count` stores the number of entries in the
+`restart_offset` list, which must not be empty.  Readers can use
+`restart_count` to binary search between restarts before starting a
+linear scan.
+
+Exactly `restart_count` 3-byte `restart_offset` values precedes the
+`restart_count`.  Offsets are relative to the start of the block and
+refer to the first byte of any `ref_record` whose name has not been
+prefix compressed.  Entries in the `restart_offset` list must be
+sorted, ascending.  Readers can start linear scans from any of these
+records.
+
+A variable number of `ref_record` fill the middle of the block,
+describing reference names and values.  The format is described below.
+
+As the first ref block shares the first file block with the file
+header, all `restart_offset` in the first block are relative to the
+start of the file (position 0), and include the file header.  This
+forces the first `restart_offset` to be `28`.
+
+#### ref record
+
+A `ref_record` describes a single reference, storing both the name and
+its value(s). Records are formatted as:
+
+    varint( prefix_length )
+    varint( (suffix_length << 3) | value_type )
+    suffix
+    value?
+
+The `prefix_length` field specifies how many leading bytes of the
+prior reference record's name should be copied to obtain this
+reference's name.  This must be 0 for the first reference in any
+block, and also must be 0 for any `ref_record` whose offset is listed
+in the `restart_offset` table at the end of the block.
+
+Recovering a reference name from any `ref_record` is a simple concat:
+
+    this_name = prior_name[0..prefix_length] + suffix
+
+The `suffix_length` value provides the number of bytes available in
+`suffix` to copy from `suffix` to complete the reference name.
+
+The `value` follows.  Its format is determined by `value_type`, one of
+the following:
+
+- `0x0`: deletion; no value data (see transactions, below)
+- `0x1`: one 20-byte object id; value of the ref
+- `0x2`: two 20-byte object ids; value of the ref, peeled target
+- `0x3`: symbolic reference: `varint( target_len ) target`
+
+Symbolic references use `0x3`, followed by the complete name of the
+reference target.  No compression is applied to the target name.
+
+Types `0x4..0x7` are reserved for future use.
+
+### Ref index
+
+The ref index stores the name of the last reference from every ref
+block in the file, enabling reduced disk seeks for lookups.  Any
+reference can be found by searching the index, identifying the
+containing block, and searching within that block.
+
+The index may be organized into a multi-level index, where the 1st
+level index block points to additional ref index blocks (2nd level),
+which may in turn point to either additional index blocks (e.g. 3rd
+level) or ref blocks (leaf level).  Disk reads required to access a
+ref go up with higher index levels.  Multi-level indexes may be
+required to ensure no single index block exceeds the file format's max
+block size of `16777215` bytes (15.99 MiB).  To acheive constant O(1)
+disk seeks for lookups the index must be a single level, which is
+permitted to exceed the file's configured block size, but not the
+format's max block size of 15.99 MiB.
+
+If present, the ref index block(s) appears after the last ref block.
+
+If there are at least 4 ref blocks, a ref index block should be
+written to improve lookup times.  Cold reads using the index require
+2 disk reads (read index, read block), and binary searching < 4 blocks
+also requires <= 2 reads.  Omitting the index block from smaller files
+saves space.
+
+If the file is unaligned and contains more than one ref block, the ref
+index must be written.
+
+Index block format:
+
+    'i'
+    uint24( block_len )
+    index_record+
+    uint24( restart_offset )+
+    uint16( restart_count )
+
+    padding?
+
+The index blocks begin with `block_type = 'i'` and a 3-byte
+`block_len` which encodes the number of bytes in the block,
+up to but not including the optional `padding`.
+
+The `restart_offset` and `restart_count` fields are identical in
+format, meaning and usage as in ref blocks.
+
+To reduce the number of reads required for random access in very large
+files the index block may be larger than other blocks.  However,
+readers must hold the entire index in memory to benefit from this, so
+it's a time-space tradeoff in both file size and reader memory.
+
+Increasing the file's block size decreases the index size.
+Alternatively a multi-level index may be used, keeping index blocks
+within the file's block size, but increasing the number of blocks
+that need to be accessed.
+
+#### index record
+
+An index record describes the last entry in another block.
+Index records are written as:
+
+    varint( prefix_length )
+    varint( (suffix_length << 3) | 0 )
+    suffix
+    varint( block_position )
+
+Index records use prefix compression exactly like `ref_record`.
+
+Index records store `block_position` after the suffix, specifying the
+absolute position in bytes (from the start of the file) of the block
+that ends with this reference. Readers can seek to `block_position` to
+begin reading the block header.
+
+Readers must examine the block header at `block_position` to determine
+if the next block is another level index block, or the leaf-level ref
+block.
+
+#### Reading the index
+
+Readers loading the ref index must first read the footer (below) to
+obtain `ref_index_position`. If not present, the position will be 0.
+The `ref_index_position` is for the 1st level root of the ref index.
+
+### Obj block format
+
+Object blocks are optional.  Writers may choose to omit object blocks,
+especially if readers will not use the SHA-1 to ref mapping.
+
+Object blocks use unique, abbreviated 2-20 byte SHA-1 keys, mapping
+to ref blocks containing references pointing to that object directly,
+or as the peeled value of an annotated tag.  Like ref blocks, object
+blocks use the file's standard block size. The abbrevation length is
+available in the footer as `obj_id_len`.
+
+To save space in small files, object blocks may be omitted if the ref
+index is not present, as brute force search will only need to read a
+few ref blocks.  When missing, readers should brute force a linear
+search of all references to lookup by SHA-1.
+
+An object block is written as:
+
+    'o'
+    uint24( block_len )
+    obj_record+
+    uint24( restart_offset )+
+    uint16( restart_count )
+
+    padding?
+
+Fields are identical to ref block.  Binary search using the restart
+table works the same as in reference blocks.
+
+Because object identifiers are abbreviated by writers to the shortest
+unique abbreviation within the reftable, obj key lengths are variable
+between 2 and 20 bytes.  Readers must compare only for common prefix
+match within an obj block or obj index.
+
+#### obj record
+
+An `obj_record` describes a single object abbreviation, and the blocks
+containing references using that unique abbreviation:
+
+    varint( prefix_length )
+    varint( (suffix_length << 3) | cnt_3 )
+    suffix
+    varint( cnt_large )?
+    varint( position_delta )*
+
+Like in reference blocks, abbreviations are prefix compressed within
+an obj block.  On large reftables with many unique objects, higher
+block sizes (64k), and higher restart interval (128), a
+`prefix_length` of 2 or 3 and `suffix_length` of 3 may be common in
+obj records (unique abbreviation of 5-6 raw bytes, 10-12 hex digits).
+
+Each record contains `position_count` number of positions for matching
+ref blocks.  For 1-7 positions the count is stored in `cnt_3`.  When
+`cnt_3 = 0` the actual count follows in a varint, `cnt_large`.
+
+The use of `cnt_3` bets most objects are pointed to by only a single
+reference, some may be pointed to by a couple of references, and very
+few (if any) are pointed to by more than 7 references.
+
+A special case exists when `cnt_3 = 0` and `cnt_large = 0`: there
+are no `position_delta`, but at least one reference starts with this
+abbreviation.  A reader that needs exact reference names must scan all
+references to find which specific references have the desired object.
+Writers should use this format when the `position_delta` list would have
+overflowed the file's block size due to a high number of references
+pointing to the same object.
+
+The first `position_delta` is the position from the start of the file.
+Additional `position_delta` entries are sorted ascending and relative
+to the prior entry, e.g.  a reader would perform:
+
+    pos = position_delta[0]
+    prior = pos
+    for (j = 1; j < position_count; j++) {
+      pos = prior + position_delta[j]
+      prior = pos
+    }
+
+With a position in hand, a reader must linearly scan the ref block,
+starting from the first `ref_record`, testing each reference's SHA-1s
+(for `value_type = 0x1` or `0x2`) for full equality.  Faster searching
+by SHA-1 within a single ref block is not supported by the reftable
+format.  Smaller block sizes reduce the number of candidates this step
+must consider.
+
+### Obj index
+
+The obj index stores the abbreviation from the last entry for every
+obj block in the file, enabling reduced disk seeks for all lookups.
+It is formatted exactly the same as the ref index, but refers to obj
+blocks.
+
+The obj index should be present if obj blocks are present, as
+obj blocks should only be written in larger files.
+
+Readers loading the obj index must first read the footer (below) to
+obtain `obj_index_position`.  If not present, the position will be 0.
+
+### Log block format
+
+Unlike ref and obj blocks, log blocks are always unaligned.
+
+Log blocks are variable in size, and do not match the `block_size`
+specified in the file header or footer.  Writers should choose an
+appropriate buffer size to prepare a log block for deflation, such as
+`2 * block_size`.
+
+A log block is written as:
+
+    'g'
+    uint24( block_len )
+    zlib_deflate {
+      log_record+
+      uint24( restart_offset )+
+      uint16( restart_count )
+    }
+
+Log blocks look similar to ref blocks, except `block_type = 'g'`.
+
+The 4-byte block header is followed by the deflated block contents
+using zlib deflate.  The `block_len` in the header is the inflated
+size (including 4-byte block header), and should be used by readers to
+preallocate the inflation output buffer.  A log block's `block_len`
+may exceed the file's block size.
+
+Offsets within the log block (e.g.  `restart_offset`) still include
+the 4-byte header.  Readers may prefer prefixing the inflation output
+buffer with the 4-byte header.
+
+Within the deflate container, a variable number of `log_record`
+describe reference changes.  The log record format is described
+below.  See ref block format (above) for a description of
+`restart_offset` and `restart_count`.
+
+Because log blocks have no alignment or padding between blocks,
+readers must keep track of the bytes consumed by the inflater to
+know where the next log block begins.
+
+#### log record
+
+Log record keys are structured as:
+
+    ref_name '\0' reverse_int64( update_index )
+
+where `update_index` is the unique transaction identifier.  The
+`update_index` field must be unique within the scope of a `ref_name`.
+See the update transactions section below for further details.
+
+The `reverse_int64` function inverses the value so lexographical
+ordering the network byte order encoding sorts the more recent records
+with higher `update_index` values first:
+
+    reverse_int64(int64 t) {
+      return 0xffffffffffffffff - t;
+    }
+
+Log records have a similar starting structure to ref and index
+records, utilizing the same prefix compression scheme applied to the
+log record key described above.
+
+```
+    varint( prefix_length )
+    varint( (suffix_length << 3) | log_type )
+    suffix
+    log_data {
+      old_id
+      new_id
+      varint( name_length    )  name
+      varint( email_length   )  email
+      varint( time_seconds )
+      sint16( tz_offset )
+      varint( message_length )  message
+    }?
+```
+
+Log record entries use `log_type` to indicate what follows:
+
+- `0x0`: deletion; no log data.
+- `0x1`: standard git reflog data using `log_data` above.
+
+The `log_type = 0x0` is mostly useful for `git stash drop`, removing
+an entry from the reflog of `refs/stash` in a transaction file
+(below), without needing to rewrite larger files.  Readers reading a
+stack of reflogs must treat this as a deletion.
+
+For `log_type = 0x1`, the `log_data` section follows
+[git update-ref][update-ref] logging, and includes:
+
+- two 20-byte SHA-1s (old id, new id)
+- varint string of committer's name
+- varint string of committer's email
+- varint time in seconds since epoch (Jan 1, 1970)
+- 2-byte timezone offset in minutes (signed)
+- varint string of message
+
+`tz_offset` is the absolute number of minutes from GMT the committer
+was at the time of the update.  For example `GMT-0800` is encoded in
+reftable as `sint16(-480)` and `GMT+0230` is `sint16(150)`.
+
+The committer email does not contain `<` or `>`, it's the value
+normally found between the `<>` in a git commit object header.
+
+The `message_length` may be 0, in which case there was no message
+supplied for the update.
+
+[update-ref]: https://git-scm.com/docs/git-update-ref#_logging_updates
+
+#### Reading the log
+
+Readers accessing the log must first read the footer (below) to
+determine the `log_position`.  The first block of the log begins at
+`log_position` bytes since the start of the file.  The `log_position`
+is not block aligned.
+
+#### Importing logs
+
+When importing from `$GIT_DIR/logs` writers should globally order all
+log records roughly by timestamp while preserving file order, and
+assign unique, increasing `update_index` values for each log line.
+Newer log records get higher `update_index` values.
+
+Although an import may write only a single reftable file, the reftable
+file must span many unique `update_index`, as each log line requires
+its own `update_index` to preserve semantics.
+
+### Log index
+
+The log index stores the log key (`refname \0 reverse_int64(update_index)`)
+for the last log record of every log block in the file, supporting
+bounded-time lookup.
+
+A log index block must be written if 2 or more log blocks are written
+to the file.  If present, the log index appears after the last log
+block.  There is no padding used to align the log index to block
+alignment.
+
+Log index format is identical to ref index, except the keys are 9
+bytes longer to include `'\0'` and the 8-byte
+`reverse_int64(update_index)`.  Records use `block_position` to
+refer to the start of a log block.
+
+#### Reading the index
+
+Readers loading the log index must first read the footer (below) to
+obtain `log_index_position`. If not present, the position will be 0.
+
+### Footer
+
+After the last block of the file, a file footer is written.  It begins
+like the file header, but is extended with additional data.
+
+A 68-byte footer appears at the end:
+
+```
+    'REFT'
+    uint8( version_number = 1 )
+    uint24( block_size )
+    uint64( min_update_index )
+    uint64( max_update_index )
+
+    uint64( ref_index_position )
+    uint64( (obj_position << 5) | obj_id_len )
+    uint64( obj_index_position )
+
+    uint64( log_position )
+    uint64( log_index_position )
+
+    uint32( CRC-32 of above )
+```
+
+If a section is missing (e.g. ref index) the corresponding position
+field (e.g. `ref_index_position`) will be 0.
+
+- `obj_position`: byte position for the first obj block.
+- `obj_id_len`: number of bytes used to abbreviate object identifiers
+  in obj blocks.
+- `log_position`: byte position for the first log block.
+- `ref_index_position`: byte position for the start of the ref index.
+- `obj_index_position`: byte position for the start of the obj index.
+- `log_index_position`: byte position for the start of the log index.
+
+#### Reading the footer
+
+Readers must seek to `file_length - 68` to access the footer.  A
+trusted external source (such as `stat(2)`) is necessary to obtain
+`file_length`.  When reading the footer, readers must verify:
+
+- 4-byte magic is correct
+- 1-byte version number is recognized
+- 4-byte CRC-32 matches the other 64 bytes (including magic, and version)
+
+Once verified, the other fields of the footer can be accessed.
+
+### Varint encoding
+
+Varint encoding is identical to the ofs-delta encoding method used
+within pack files.
+
+Decoder works such as:
+
+    val = buf[ptr] & 0x7f
+    while (buf[ptr] & 0x80) {
+      ptr++
+      val = ((val + 1) << 7) | (buf[ptr] & 0x7f)
+    }
+
+### Binary search
+
+Binary search within a block is supported by the `restart_offset`
+fields at the end of the block.  Readers can binary search through the
+restart table to locate between which two restart points the sought
+reference or key should appear.
+
+Each record identified by a `restart_offset` stores the complete key
+in the `suffix` field of the record, making the compare operation
+during binary search straightforward.
+
+Once a restart point lexicographically before the sought reference has
+been identified, readers can linearly scan through the following
+record entries to locate the sought record, terminating if the current
+record sorts after (and therefore the sought key is not present).
+
+#### Restart point selection
+
+Writers determine the restart points at file creation.  The process is
+arbitrary, but every 16 or 64 records is recommended.  Every 16 may
+be more suitable for smaller block sizes (4k or 8k), every 64 for
+larger block sizes (64k).
+
+More frequent restart points reduces prefix compression and increases
+space consumed by the restart table, both of which increase file size.
+
+Less frequent restart points makes prefix compression more effective,
+decreasing overall file size, with increased penalities for readers
+walking through more records after the binary search step.
+
+A maximum of `65535` restart points per block is supported.
+
+## Considerations
+
+### Lightweight refs dominate
+
+The reftable format assumes the vast majority of references are single
+SHA-1 valued with common prefixes, such as Gerrit Code Review's
+`refs/changes/` namespace, GitHub's `refs/pulls/` namespace, or many
+lightweight tags in the `refs/tags/` namespace.
+
+Annotated tags storing the peeled object cost an additional 20 bytes
+per reference.
+
+### Low overhead
+
+A reftable with very few references (e.g. git.git with 5 heads)
+is 269 bytes for reftable, vs. 332 bytes for packed-refs.  This
+supports reftable scaling down for transaction logs (below).
+
+### Block size
+
+For a Gerrit Code Review type repository with many change refs, larger
+block sizes (64 KiB) and less frequent restart points (every 64) yield
+better compression due to more references within the block compressing
+against the prior reference.
+
+Larger block sizes reduce the index size, as the reftable will
+require fewer blocks to store the same number of references.
+
+### Minimal disk seeks
+
+Assuming the index block has been loaded into memory, binary searching
+for any single reference requires exactly 1 disk seek to load the
+containing block.
+
+### Scans and lookups dominate
+
+Scanning all references and lookup by name (or namespace such as
+`refs/heads/`) are the most common activities performed on repositories.
+SHA-1s are stored directly with references to optimize this use case.
+
+### Logs are infrequently read
+
+Logs are infrequently accessed, but can be large.  Deflating log
+blocks saves disk space, with some increased penalty at read time.
+
+Logs are stored in an isolated section from refs, reducing the burden
+on reference readers that want to ignore logs.  Further, historical
+logs can be isolated into log-only files.
+
+### Logs are read backwards
+
+Logs are frequently accessed backwards (most recent N records for
+master to answer `master@{4}`), so log records are grouped by
+reference, and sorted descending by update index.
+
+## Repository format
+
+### Version 1
+
+A repository must set its `$GIT_DIR/config` to configure reftable:
+
+    [core]
+        repositoryformatversion = 1
+    [extensions]
+        refStorage = reftable
+
+### Layout
+
+The `$GIT_DIR/refs` path is a file when reftable is configured, not a
+directory.  This prevents loose references from being stored.
+
+A collection of reftable files are stored in the `$GIT_DIR/reftable/`
+directory:
+
+    00000001.log
+    00000001.ref
+    00000002.ref
+
+where reftable files are named by a unique name such as produced by
+the function `${update_index}.ref`.
+
+Log-only files use the `.log` extension, while ref-only and mixed ref
+and log files use `.ref`.  extension.
+
+The stack ordering file is `$GIT_DIR/refs` and lists the current
+files, one per line, in order, from oldest (base) to newest (most
+recent):
+
+    $ cat .git/refs
+    00000001.log
+    00000001.ref
+    00000002.ref
+
+Readers must read `$GIT_DIR/refs` to determine which files are
+relevant right now, and search through the stack in reverse order
+(last reftable is examined first).
+
+Reftable files not listed in `refs` may be new (and about to be added
+to the stack by the active writer), or ancient and ready to be pruned.
+
+### Readers
+
+Readers can obtain a consistent snapshot of the reference space by
+following:
+
+1.  Open and read the `refs` file.
+2.  Open each of the reftable files that it mentions.
+3.  If any of the files is missing, goto 1.
+4.  Read from the now-open files as long as necessary.
+
+### Update transactions
+
+Although reftables are immutable, mutations are supported by writing a
+new reftable and atomically appending it to the stack:
+
+1. Acquire `refs.lock`.
+2. Read `refs` to determine current reftables.
+3. Select `update_index` to be most recent file's `max_update_index + 1`.
+4. Prepare temp reftable `${update_index}_XXXXXX`, including log entries.
+5. Rename `${update_index}_XXXXXX` to `${update_index}.ref`.
+6. Copy `refs` to `refs.lock`, appending file from (5).
+7. Rename `refs.lock` to `refs`.
+
+During step 4 the new file's `min_update_index` and `max_update_index`
+are both set to the `update_index` selected by step 3.  All log
+records for the transaction use the same `update_index` in their keys.
+This enables later correlation of which references were updated by the
+same transaction.
+
+Because a single `refs.lock` file is used to manage locking, the
+repository is single-threaded for writers.  Writers may have to
+busy-spin (with backoff) around creating `refs.lock`, for up to an
+acceptable wait period, aborting if the repository is too busy to
+mutate.  Application servers wrapped around repositories (e.g.  Gerrit
+Code Review) can layer their own lock/wait queue to improve fairness
+to writers.
+
+### Reference deletions
+
+Deletion of any reference can be explicitly stored by setting the
+`type` to `0x0` and omitting the `value` field of the `ref_record`.
+This serves as a tombstone, overriding any assertions about the
+existence of the reference from earlier files in the stack.
+
+### Compaction
+
+A partial stack of reftables can be compacted by merging references
+using a straightforward merge join across reftables, selecting the
+most recent value for output, and omitting deleted references that do
+not appear in remaining, lower reftables.
+
+A compacted reftable should set its `min_update_index` to the smallest of
+the input files' `min_update_index`, and its `max_update_index`
+likewise to the largest input `max_update_index`.
+
+For sake of illustration, assume the stack currently consists of
+reftable files (from oldest to newest): A, B, C, and D. The compactor
+is going to compact B and C, leaving A and D alone.
+
+1.  Obtain lock `refs.lock` and read the `refs` file.
+2.  Obtain locks `B.lock` and `C.lock`.
+    Ownership of these locks prevents other processes from trying
+    to compact these files.
+3.  Release `refs.lock`.
+4.  Compact `B` and `C` into a temp file `${min_update_index}_XXXXXX`.
+5.  Reacquire lock `refs.lock`.
+6.  Verify that `B` and `C` are still in the stack, in that order. This
+    should always be the case, assuming that other processes are adhering
+    to the locking protocol.
+7.  Rename `${min_update_index}_XXXXXX` to `${min_update_index}_2.ref`.
+8.  Write the new stack to `refs.lock`, replacing `B` and `C` with the
+    file from (4).
+9.  Rename `refs.lock` to `refs`.
+10. Delete `B` and `C`, perhaps after a short sleep to avoid forcing
+    readers to backtrack.
+
+This strategy permits compactions to proceed independently of updates.
+
+## Alternatives considered
+
+### bzip packed-refs
+
+`bzip2` can significantly shrink a large packed-refs file (e.g. 62
+MiB compresses to 23 MiB, 37%).  However the bzip format does not support
+random access to a single reference. Readers must inflate and discard
+while performing a linear scan.
+
+Breaking packed-refs into chunks (individually compressing each chunk)
+would reduce the amount of data a reader must inflate, but still
+leaves the problem of indexing chunks to support readers efficiently
+locating the correct chunk.
+
+Given the compression achieved by reftable's encoding, it does not
+seem necessary to add the complexity of bzip/gzip/zlib.
+
+### Michael Haggerty's alternate format
+
+Michael Haggerty proposed [an alternate][mh-alt] format to reftable on
+the Git mailing list.  This format uses smaller chunks, without the
+restart table, and avoids block alignment with padding.  Reflog entries
+immediately follow each ref, and are thus interleaved between refs.
+
+Performance testing indicates reftable is faster for lookups (51%
+faster, 11.2 usec vs.  5.4 usec), although reftable produces a
+slightly larger file (+ ~3.2%, 28.3M vs 29.2M):
+
+format    |  size  | seek cold | seek hot  |
+---------:|-------:|----------:|----------:|
+mh-alt    | 28.3 M | 23.4 usec | 11.2 usec |
+reftable  | 29.2 M | 19.9 usec |  5.4 usec |
+
+[mh-alt]: https://public-inbox.org/git/CAMy9T_HCnyc1g8XWOOWhe7nN0aEFyyBskV2aOMb_fe+wGvEJ7A@mail.gmail.com/
+
+### JGit Ketch RefTree
+
+[JGit Ketch][ketch] proposed [RefTree][reftree], an encoding of
+references inside Git tree objects stored as part of the repository's
+object database.
+
+The RefTree format adds additional load on the object database storage
+layer (more loose objects, more objects in packs), and relies heavily
+on the packer's delta compression to save space.  Namespaces which are
+flat (e.g.  thousands of tags in refs/tags) initially create very
+large loose objects, and so RefTree does not address the problem of
+copying many references to modify a handful.
+
+Flat namespaces are not efficiently searchable in RefTree, as tree
+objects in canonical formatting cannot be binary searched. This fails
+the need to handle a large number of references in a single namespace,
+such as GitHub's `refs/pulls`, or a project with many tags.
+
+[ketch]: https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg03073.html
+[reftree]: https://public-inbox.org/git/CAJo=hJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg@mail.gmail.com/
+
+### LMDB
+
+David Turner proposed [using LMDB][dt-lmdb], as LMDB is lightweight
+(64k of runtime code) and GPL-compatible license.
+
+A downside of LMDB is its reliance on a single C implementation.  This
+makes embedding inside JGit (a popular reimplemenation of Git)
+difficult, and hoisting onto virtual storage (for JGit DFS) virtually
+impossible.
+
+A common format that can be supported by all major Git implementations
+(git-core, JGit, libgit2) is strongly preferred.
+
+[dt-lmdb]: https://public-inbox.org/git/1455772670-21142-26-git-send-email-dturner@twopensource.com/
+
+## Future
+
+### Longer hashes
+
+Version will bump (e.g.  2) to indicate `value` uses a different
+object id length other than 20.  The length could be stored in an
+expanded file header, or hardcoded as part of the version.