When running a GC.repack() against a repository with over one
thousands of refs/heads and tens of millions of ObjectIds,
the calculation of all bitmaps associated with all the refs
would result in an unreasonable big file that would take up to
several hours to compute.
Test scenario: repo with 2500 heads / 10M obj Intel Xeon E5-2680 2.5GHz
Before this change: 20 mins
After this change and 2300 heads excluded: 10 mins (90s for bitmap)
Having such a large bitmap file is also slow in the runtime
processing and have negligible or even negative benefits, because
the time lost in reading and decompressing the bitmap in memory
would not be compensated by the time saved by using it.
It is key to preserve the bitmaps for those refs that are mostly
used in clone/fetch and give the ability to exlude some refs
prefixes that are known to be less frequently accessed, even
though they may actually be actively written.
Example: Gerrit sandbox branches may even be actively
used and selected automatically because its commits are very
recent, however, they may bloat the bitmap, making it ineffective.
A mono-repo with tens of thousands of developers may have
a relatively small number of active branches where the
CI/CD jobs are continuously fetching/cloning the code. However,
because Gerrit allows the use of sandbox branches, the
total number of refs/heads may be even tens to hundred
thousands.
Change-Id: I466dcde69fa008e7f7785735c977f6e150e3b644
Signed-off-by: Luca Milanesio <luca.milanesio@gmail.com>
The search for reuse phase for *all* the objects scans *all*
the packfiles, looking for the best candidate to serve back to the
client.
This can lead to an expensive operation when the number of
packfiles and objects is high.
Add parameter "pack.searchForReuseTimeout" to limit the time spent
on this search.
Change-Id: I54f5cddb6796fdc93ad9585c2ab4b44854fa6c48
Previously, the list of tables was in .git/refs. This makes
repo detection fail in older clients, which is undesirable.
This is proposal was discussed and approved on the git@vger list at
https://lore.kernel.org/git/CAFQ2z_PvKiz==GyS6J1H1uG0FRPL86JvDj+LjX1We4-yCSVQ+g@mail.gmail.com/
For backward compatibility, JGit could detect a file under .git/refs and
use it as a reftable list.
Change-Id: Ic0b974fa250cfa905463b811957e2a4fdd7bbc6b
Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
MergedReftableTest#scanDuplicates tests whether we can write duplicate
keys in a merged reftable. Apparently, the first key appearing should
get precedence, and this works because the sort() algorithm on ordered
collections is stable.
This is potentially confusing behavior, because you can write data
into the table that cannot be retrieved (Merged table can only have
one entry per key), and the APIs such as exactRef() only return a
single value.
Make this consistent with behavior introduced in I04f55c481 "reftable:
enforce ordering for ref and log writes" by considering a duplicate key
in sortAndWriteRefs as a fatal runtime error.
Change-Id: I1eedd18f028180069f78c5c467169dcfe1521157
Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
By using ${min_update}-${max_update} as file name template, we
guarantee that each file has a unique name.
This allows data from open files to be cached across reloads of the
stack.
This is in anticipation of Change I1837f268e
("file: implement FileReftableDatabase"), which is the first
implementation of reftable on a filesystem.
Change-Id: I7ef0610eb60c494165382d0c372afcf41f074393
Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Add an update_index to every reference in a reftable, storing the
exact transaction that last modified the reference. This is necessary
to fix some merge race conditions.
Consider updates at T1, T3 are present in two reftables. Compacting
these will create a table with range [T1,T3]. If T2 arrives during
or after the compaction its impossible for readers to know how to
merge the [T1,T3] table with the T2 table.
With an explicit update_index per reference, MergedReftable is able to
individually sort each reference, merging individual entries at T3
from [T1,T3] ahead of identically named entries appearing in T2.
Change-Id: Ie4065d4176a5a0207dcab9696ae05d086e042140
Some repositories contain a lot of references (e.g. android at 866k,
rails at 31k). The reftable format provides:
- Near constant time lookup for any single reference, even when the
repository is cold and not in process or kernel cache.
- Near constant time verification a SHA-1 is referred to by at least
one reference (for allow-tip-sha1-in-want).
- Efficient lookup of an entire namespace, such as `refs/tags/`.
- Support atomic push `O(size_of_update)` operations.
- Combine reflog storage with ref storage.
Change-Id: I29d0ff1eee475845660ac9173413e1407adcfbf2