Commit Graph

2793 Commits

Author SHA1 Message Date
Robin Stocker 78fca8a099 Improve test coverage of AutoCRLF(In|Out)putStream
Bug: 405672
Change-Id: I3894e98617fcee16dc2ac9853c203c62eb30c3ab
Signed-off-by: Chris Aniszczyk <zx@twitter.com>
2013-04-18 14:58:51 -05:00
Shawn Pearce fa1bc6abb7 Merge changes Id2848c16,I7621c434
* changes:
  Rescale "Compressing objects" progress meter by size
  Split delta search buckets by byte weight
2013-04-17 14:53:13 -04:00
Shawn Pearce 5d8a9f6f3f Rescale "Compressing objects" progress meter by size
Instead of counting objects processed, count number of bytes added
into the window. This should rescale the progress meter so that 30%
complete means 30% of the total uncompressed content size has been
inflated and fed into the window.

In theory the progress meter should be more accurate about its
percentage complete/remaining fraction than with objects. When
counting objects small objects move the progress meter more rapidly
than large objects, but demand a smaller amount of work than large
objects being compressed.

Change-Id: Id2848c16a2148b5ca51e0ca1e29c5be97eefeb48
2013-04-17 14:43:01 -04:00
Shawn Pearce 21e4aa2b9e Split delta search buckets by byte weight
Instead of assuming all objects cost the same amount of time to
delta compress, aggregate the byte size of objects in the list
and partition threads with roughly equal total bytes.

Before splitting the list select the N largest paths and assign
each one to its own thread. This allows threads to get through the
worst cases in parallel before attempting smaller paths that are
more likely to be splittable.

By running the largest path buckets first on each thread the likely
slowest part of compression is done early, while progress is still
reporting a low percentage. This gives users a better impression of
how fast the phase will run. On very complex inputs the slow part
is more likely to happen first, making a user realize its time to
go grab lunch, or even run it overnight.

If the worst sections are earlier, memory overruns may show up
earlier, giving the user a chance to correct the configuration and
try again before wasting large amounts of time. It also makes it
less likely the delta compression phase reaches 92% in 30 minutes
and then crawls for 10 hours through the remaining 8%.

Change-Id: I7621c4349b99e40098825c4966b8411079992e5f
2013-04-17 11:31:00 -07:00
Shawn Pearce e74263e743 Merge "Support excluding objects during DFS compaction" 2013-04-17 14:19:21 -04:00
Shawn Pearce 3c27ee1a91 Support excluding objects during DFS compaction
By excluding objects the compactor can avoid storing objects that
are already well packed in the base GC packs, or any other pack
not being replaced by the current compaction operation.

For deltas the base object is still included even if the base exists
in another exclusion set.  This favors keeping deltas for recent
history, to support faster fetch operations for clients.

Change-Id: Ie822fe075fe5072fe3171450fda2f0ca507796a1
2013-04-16 17:54:23 -07:00
Matthias Sohn aa7be667bc Make recursive merge strategy the default merge strategy
Use recursive merge as the default strategy since it can successfully
merge more cases than the resolve strategy can. This is also the default
in native Git.

Change-Id: I38fd522edb2791f15d83e99038185edb09fed8e1
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
2013-04-15 21:46:12 +02:00
Colby Ranger eaa52b12f5 Update PackBitmapIndexRemapper to handle mappings not in the new pack.
Previously, the code assumed all commits in the old pack would also
be present in the new pack. This assumption caused an
ArrayIndexOutOfBoundsException during remapping of ids. Fix the
iterator to only return entries that may be remapped. Furthermore,
update getBitmap() to return null if commit does not exist in the
new pack.

Change-Id: I065babe8cd39a7654c916bd01c7012135733dddf
2013-04-15 09:35:07 -07:00
Robin Rosenberg 4c638be79f Fix boundary conditions in AutoCRLFOutputStream
This fixes some problems with inputs around the size of the internal
buffer in AutoCRLFOutputStream (8000).

Tests supplied by Robin Stocker.

Bug: 405672
Change-Id: I6147897290392b3bfd4040e8006da39c302a3d49
2013-04-14 19:53:48 +02:00
Robin Rosenberg a6ed390ea7 NLS warning cleanup
Change-Id: Ia76aa02dd330a1f88096c2b059b363aa38d653e9
2013-04-14 00:41:55 +02:00
Robin Rosenberg 5db307a695 Merge "Fix a possible NPE" 2013-04-13 06:16:11 -04:00
Shawn Pearce 5f03dc61b4 Merge changes I845caede,Ie25c6d3a,I5caec313,Ib11ff99f,I9ccf20c3,Ic7826f29,I1bdd8b58,Idb84c1d7,I078841f9
* changes:
  Always attempt delta compression when reuseDeltas is false
  Avoid TemporaryBuffer.Heap on very small deltas
  Correct distribution of allowed delta size along chain length
  Split remaining delta work on path boundaries
  Replace DeltaWindow array with circularly linked list
  Micro-optimize copy instructions in DeltaEncoder
  Micro-optimize DeltaWindow primary loop
  Micro-optimize DeltaWindow maxMemory test to be != 0
  Mark DeltaWindowEntry methods final
2013-04-12 16:16:09 -04:00
Shawn Pearce c9707e6353 Always attempt delta compression when reuseDeltas is false
If reuseObjects=true but reuseDeltas=false the caller wants attempt
a delta for every object in the input list. Test for reuseDeltas
to ensure every object passes through the searchInWindow() method.

If no delta is possible for an object and it will be stored whole
(non-delta format), PackWriter may still reuse its content from any
source pack. This avoids an inflate()-deflate() cycle to recompress
the object contents.

Change-Id: I845caeded419ef4551ef1c85787dd5ffd73235d9
2013-04-12 12:59:02 -07:00
Shawn Pearce a5c6aac76c Avoid TemporaryBuffer.Heap on very small deltas
TemporaryBuffer is great when the output size is not known, but must
be bound by a relatively large upper limit that fits in memory, e.g.
64 KiB or 20 MiB.  The buffer gracefully supports growing storage by
allocating 8 KiB blocks and storing them in an ArrayList.

In a Git repository many deltas are less than 8 KiB.  Typical tree
objects are well below this threshold, and their deltas must be
encoded even smaller.

For these much smaller cases avoid the 8 KiB minimum allocation used
by TemporaryBuffer.  Instead allocate a very small OutputStream
writing to an array that is sized at the limit.

Change-Id: Ie25c6d3a8cf4604e0f8cd9a3b5b701a592d6ffca
2013-04-12 12:07:11 -07:00
Shawn Pearce 8a7c2f97d0 Correct distribution of allowed delta size along chain length
Nicolas Pitre discovered a very simple rule for selecting between two
different delta base candidates:

  - if based whole object, must be <= 50% of target
  - if at end of a chain, must be <= 1/depth * 50% of target

The rule penalizes deltas near the end of the chain, requiring them to
be very small in order to be kept by the packer.  This favors deltas
that are based on a shorter chain, where the read-time unpack cost is
much lower.  Fewer bytes need to be consulted from the source pack
file, and less copying is required in memory to rebuild the object.

Junio Hamano explained Nico's rule to me today, and this commit fixes
DeltaWindow to implement it as described.

When no base has been chosen the computation is simply the statements
denoted above.  However once a base with depth of 9 has been chosen
(e.g.  when pack.depth is limited to 10), a non-delta source may
create a new delta that is up to 10x larger than the already selected
base.  This reflects the intent of Nico's size distribution rule no
matter what order objects are visited in the DeltaWindow.

With this patch and my other patches applied, repacking JGit with:

  [pack]
    reuseObjects = false
    reuseDeltas = false
    depth = 50
    window = 250
    threads = 4
    compression = 9

  CGit (all) 5,711,735 bytes; real 0m13.942s user 0m47.722s [1]
  JGit heads 5,718,295 bytes; real 0m11.880s user 0m38.177s [2]
       rest      9,809 bytes

The improved JGit result for the head pack is only 6.4 KiB larger than
CGit's resulting pack.  This patch allowed JGit to find an additional
39.7 KiB worth of space savings.  JGit now also often runs 2s faster
than CGit, despite also creating bitmaps and pruning objects after the
head pack creation.

[1] time git repack -a -d -F --window=250 --depth=50
[2] time java -Xmx128m -jar jgit debug-gc

Change-Id: I5caec31359bf7248cabdd2a3254c84d4ee3cd96b
2013-04-12 12:07:09 -07:00
Shawn Pearce 3b7924f403 Split remaining delta work on path boundaries
When an idle thread tries to steal work from a sibling's remaining
toSearch queue, always try to split along a path boundary. This
avoids missing delta opportunities in the current window of the
thread whose work is being taken.

The search order is reversed to walk further down the chain from
current position, avoiding the risk of splitting the list within
the path the thread is currently processing.

When selecting which thread to split from use an accurate estimate
of the size to be taken. This avoids selecting a thread that has
only one path remaining but may contain more pending entries than
another thread with several paths remaining.

As there is now a race condition where the straggling thread can
start the next path before the split can finish, the stealWork()
loop spins until it is able to acquire a split or there is only
one path remaining in the siblings.

Change-Id: Ib11ff99f90a4d9efab24bf4a85342cc63203dba5
2013-04-12 12:03:38 -07:00
Shawn Pearce 65f44bef23 Remove DFS locality ordering during packing
PackWriter generally chooses the order for objects when it builds the
object lists.  This ordering already depends on history information to
guide placing more recent objects first and historical objects last.

Allow PackWriter to make the basic ordering decisions, instead of
trying to override them.  The old approach of sorting the list caused
DfsReader to override any ordering change PackWriter might have tried
to make when repacking a repository.

This now better matches with WindowCursor's implementation, where
PackWriter solely determines the object ordering.

Change-Id: Ic17ab5631ec539f0758b962966c3a1823735b814
2013-04-12 07:10:30 -07:00
Shawn Pearce af33a911d0 Replace DeltaWindow array with circularly linked list
Typical window sizes are 10 and 250 (although others are accepted).
In either case the pointer overhead of 1 pointer in an array or
2 pointers for a double linked list is trivial.  A doubly linked
list as used here for window=250 is only another 1024 bytes on a
32 bit machine, or 2048 bytes on a 64 bit machine.

The critical search loops scan through the array in either the
previous direction or the next direction until the cycle is finished,
or some other scan abort condition is reached.  Loading the next
object's pointer from a field in the current object avoids the
branch required to test for wrapping around the edge of the array.
It also saves the array bounds check on each access.

When a delta is chosen the window is shuffled to hoist the currently
selected base as an earlier candidate for the next object. Moving
the window entry is easier in a double-linked list than sliding a
group of array entries.

Change-Id: I9ccf20c3362a78678aede0f0f2cda165e509adff
2013-04-11 10:44:51 -07:00
Shawn Pearce 0f32901ab7 Micro-optimize copy instructions in DeltaEncoder
The copy instruction formatter should not to compute the shifts and
masks twice.  Instead compute them once and assume there is a register
available to store the temporary "b" for compare with 0.

Change-Id: Ic7826f29dca67b16903d8f790bdf785eb478c10d
2013-04-11 01:17:35 -07:00
Shawn Pearce 1db50c9d91 Micro-optimize DeltaWindow primary loop
javac and the JIT are more likely to understand a boolean being
used as a branch conditional than comparing int against 0 and 1.
Rewrite NEXT_RES and NEXT_SRC constants to be booleans so the
code is clarified for the JIT.

Change-Id: I1bdd8b587a69572975a84609c779b9ebf877b85d
2013-04-11 01:17:28 -07:00
Shawn Pearce 6903fa4a34 Micro-optimize DeltaWindow maxMemory test to be != 0
Instead of using a compare-with-0 use a does not equal 0.
javac bytecode has a special instruction for this, as it
is very common in software. We can assume the JIT knows
how to efficiently translate the opcode to machine code,
and processors can do != 0 very quickly.

Change-Id: Idb84c1d744d2874517fd4bfa1db390e2dbf64eac
2013-04-11 01:17:22 -07:00
Robin Rosenberg 4955301fac Merge "Consider working tree changes when stashing newly added files" 2013-04-11 02:06:54 -04:00
Shawn Pearce 4db695c1c6 Mark DeltaWindowEntry methods final
This class and all of its methods are only package visible.
Clarify the methods as final for the benefit of the JIT to
inline trivial code.

Change-Id: I078841f9900dbf299fbe6abf2599f0208ae96856
2013-04-10 21:20:24 -07:00
Shawn Pearce b5cbfa0146 Merge changes Ideecc472,I2b12788a,I6cb9382d,I12cd3326,I200baa0b,I05626f2e,I65e45422
* changes:
  Increase PackOutputStream copy buffer to 64 KiB
  Tighten object header writing in PackOutuptStream
  Skip main thread test in ThreadSafeProgressMonitor
  Declare members of PackOutputStream final
  Always allocate the PackOutputStream copyBuffer
  Disable CRC32 computation when no PackIndex will be created
  Steal work from delta threads to rebalance CPU load
2013-04-10 20:56:13 -04:00
Robin Rosenberg 8272f65730 Merge "LogCommand.all(): filter out refs that do not refer to commit objects" 2013-04-10 17:30:18 -04:00
Robin Rosenberg ad2ffc576b Merge "LogCommand.all(), peel references before using them" 2013-04-10 17:29:58 -04:00
Shawn Pearce 6c0bb4351d Increase PackOutputStream copy buffer to 64 KiB
Colby just pointed out to me the buffer was 16 KiB. This may
be very small for common objects. Increase to 64 KiB.

Change-Id: Ideecc4720655a57673252f7adb8eebdf2fda230d
2013-04-10 13:05:58 -07:00
Shawn Pearce 46ef61a702 Tighten object header writing in PackOutuptStream
Most objects are written as OFS_DELTA with the base in the pack,
that is why this case comes first in writeHeader(). Rewrite the
condition to always examine this first and cache the PackWriter's
formatting flag for use of OFS_DELTA headers, in modern Git networks
this is true more often then it it is false.

Assume the cost of write() is high, especially due to entering the
MessageDigest to update the pack footer SHA-1 computation. Combine
the OFS_DELTA information as part of the header buffer so that the
entire burst is a single write call, rather than two relatively
small ones. Most OFS_DELTA headers are <= 6 bytes, so this rewrite
tranforms 2 writes of 3 bytes each into 1 write of ~6 bytes.

Try to simplify the objectHeader code to reduce branches and use
more local registers. This shouldn't really be necessary if the
compiler is well optimized, but it isn't very hard to clarify data
usage to either javac or the JIT, which may make it easier for the
JIT to produce better machine code for this method.

Change-Id: I2b12788ad6866076fabbf7fa11f8cce44e963f35
2013-04-10 12:59:11 -07:00
Shawn Pearce d01fe32795 Skip main thread test in ThreadSafeProgressMonitor
update(int) is only invoked from a worker thread, in JGit's case
this is DeltaTask. The Javadoc of TSPM suggests update should only
ever be used by a worker thread.

Skip the main thread check, saving some cycles on each run of the
progress monitor.

Change-Id: I6cb9382d71b4cb3f8e8981c7ac382da25304dfcb
2013-04-10 12:59:11 -07:00
Shawn Pearce 66192817cd Declare members of PackOutputStream final
These methods cannot be sanely overridden anywhere. Most methods
are package visible only, or are private. A few public methods do
exist but there is no useful way to override them since creation
of PackOutputStream is managed by PackWriter and cannot be delegated.

Change-Id: I12cd3326b78d497c1f9751014d04d1460b46e0b0
2013-04-10 12:59:09 -07:00
Shawn Pearce 2be6927d8e Always allocate the PackOutputStream copyBuffer
The getCopyBuffer() is almost always used during output. All known
implementations of ObjectReuseAsIs rely on the buffer to be present,
and the only sane way to get good performance from PackWriter is to
reuse objects during packing.

Avoid a branch and test when obtaining this buffer by making sure
it is always populated.

Change-Id: I200baa0bde5dcdd11bab7787291ad64535c9f7fb
2013-04-10 12:58:51 -07:00
Shawn Pearce eb17495ca4 Disable CRC32 computation when no PackIndex will be created
If a server is streaming 3GiB worth of pack data to a client there
is no reason to compute the CRC32 checksum on the objects. The
CRC32 code computed by PackWriter is used only in the new index
created by writeIndex(), which is never invoked for the native Git
network protocols.

Object reuse may still compute its own CRC32 to verify the data
being copied from an existing pack has not been corrupted. This
check is done by the ObjectReader that implements ObjectReuseAsIs
and has no relationship to the CRC32 being skipped during output.

Change-Id: I05626f2e0d6ce19119b57d8a27193922636d60a7
2013-04-10 12:58:50 -07:00
Shawn Pearce d0a5337625 Steal work from delta threads to rebalance CPU load
If the configuration wants to run 4 threads the delta search work
is initially split somewhat evenly across the 4 threads. During
execution some threads will finish early due to the work not being
split fairly, as the initial partitions were based on object count
and not cost to inflate or size of DeltaIndex.

When a thread finishes early it now tries to take 50% of the work
remaining on a sibling thread, and executes that before exiting.
This repeats as each thread completes until a thread has only 1
object remaining.

Repacking Blink, Chromium's new fork of WebKit (2.2M objects 3.9G):

  [pack]
    reuseDeltas = false
    reuseObjects = false
    depth = 50
    threads = 8
    window = 250
    windowMemory = 800m

  before: ~105% CPU after 80%
  after:  >780% CPU to 100%

Change-Id: I65e45422edd96778aba4b6e5a0fd489ea48e8ca3
2013-04-10 11:34:50 -07:00
Robin Rosenberg 1bede91db2 Consider working tree changes when stashing newly added files
Bug: 402396
Change-Id: I50ff707c0c9abcab3f98eea21aaa6e824f7af63a
2013-04-09 21:28:15 +02:00
Matthias Sohn 135a78cfcb Remove unused dependencies
Change-Id: I3cd161ac360a2e2635bffe309725a41c9527694e
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
2013-04-09 16:57:54 +02:00
Christian Halstrick 266ec24d49 Merge "clean up merge squash and no-commit messages in pgm" 2013-04-09 03:09:33 -04:00
Robin Rosenberg 0a824f5996 Add a constant for info/exclude
Change-Id: Ifd537ce4e726cb9460ea332f683428689bd3d7f4
2013-04-08 18:08:23 -04:00
Matthias Sohn 0182e8152e Merge changes I8445070d,I38f10d62,I2af0bf68
* changes:
  Fix plugin provider names to conform with release train requirement
  Add missing @since tags for new API methods
  DfsReaderOptions are options for a DFS stored repository
2013-04-08 17:25:01 -04:00
Matthias Sohn 011f7fd27d Fix plugin provider names to conform with release train requirement
According to release train requirements [1] the provider name for all
artifacts of Eclipse projects is "Eclipse <project name>".

[1] http://wiki.eclipse.org/Development_Resources/HOWTO/Release_Reviews#Checklist

Change-Id: I8445070d1d96896d378bfc49ed062a5e7e0f201f
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
2013-04-08 23:05:36 +02:00
Robin Rosenberg cc00feaa8d A deleted work tree file is not a conflict when merge wants to delete it
Bug: 405199
Change-Id: I4b2ef3dc432d2fad8a6fabd1c8aec407b5c8c5ac
Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>
2013-04-08 22:43:39 +02:00
Tomasz Zarna b42b50fdf5 clean up merge squash and no-commit messages in pgm
Change-Id: Iffa6e8752fbd94f3ef69f49df772be82e3da5d05
2013-04-08 09:51:00 -04:00
Robin Rosenberg 59baf9148e Detect and handle a checkout conflict during merge nicely
Report the conflicting files nicely and inform the user.

Change-Id: I75d464d4156d10c6cc6c7ce5a321e2c9fb0df375
2013-04-08 05:48:09 -04:00
Matthias Sohn 2f93551e18 Add missing @since tags for new API methods
Change-Id: I38f10d622c30f19d1154a4901477e844cb411707
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
2013-04-07 23:22:48 +02:00
Gustav Karlsson b3e9626743 Added characters to be escaped in file name patterns
Originally, characters could not be escaped in FileNameMatcher patterns.
This breaks file name matching when escaped brackets "\[" and "\]" are
used in the pattern. A fix has been implemented to allow for any
character to be escaped by prepending it with a '\'

Bug: 340715
Change-Id: Ie46fd211931fa09ef3a6a712bd1da3d7fb64c5e3
Signed-off-by: Gustav Karlsson <gustav.karlsson@tieto.com>
2013-04-06 18:23:33 +02:00
Matthias Sohn 41cba241d8 DfsReaderOptions are options for a DFS stored repository
Change-Id: I2af0bf686188f1402fb53bf6dbe0ecb228069ace
Signed-off-by: Matthias Sohn <matthias.sohn@sap.com>
2013-04-06 01:28:20 +02:00
Shawn Pearce 5d446f410d Support cutting existing delta chains longer than the max depth
Some packs built by JGit have incredibly long delta chains due to a
long standing bug in PackWriter. Google has packs created by JGit's
DfsGarbageCollector with chains of 6000 objects long, or more.

Inflating objects at the end of this 6000 long chain is impossible
to complete within a reasonable time bound. It could take a beefy
system hours to perform even using the heavily optimized native C
implementation of Git, let alone with JGit.

Enable pack.cutDeltaChains to be set in a configuration file to
permit the PackWriter to determine the length of each delta chain
and clip the chain at arbitrary points to fit within pack.depth.

Delta chain cycles are still possible, but no attempt is made to
detect them. A trivial chain of A->B->A will iterate for the full
pack.depth configured limit (e.g. 50) and then pick an object to
store as non-delta.

When cutting chains the object list is walked in reverse to try
and take advantage of existing chain computations. The assumption
here is most deltas are near the end of the list, and their bases
are near the front of the list. Going up from the tail attempts to
reuse chainLength computations by relying on the memoized value in
the delta base.

The chainLength field in ObjectToPack is overloaded into the depth
field normally used by DeltaWindow. This is acceptable because the
chain cut happens before delta search, and the chainLength is reset
to 0 if delta search will follow.

Change-Id: Ida4fde9558f3abbbb77ade398d2af3941de9c812
2013-04-05 10:07:14 -07:00
Shawn Pearce 01a0699acc Micro-optimize reuseDeltaFor in PackWriter
This switch is called mostly for OBJ_TREE and OBJ_BLOB types, which
typically make up 66% of the objects in a repository. Simplify the
test for these common types by testing for the one bit they have in
common and returning early.

Object type 5 is currently undefined. In the old code it would hit
the default and return true. In the new code it will match the early
case and also return true. In either implementation 5 should never show
up as it is not a valid type known to Git.

Object type 6 OFS_DELTA is not permitted to be supplied here.
Object type 7 REF_DELTA is not permitted to be supplied here.

Change-Id: I0ede8acee928bb3e73c744450863942064864e9c
2013-04-05 09:43:02 -07:00
Shawn Pearce 8e83c36e27 Static import OBJ_* constants into PackWriter
Shortens most of the code that touches the objectLists.

Change-Id: Ib14d366dd311e544e7ba50e9ce07a6f3ce0cf254
2013-04-05 09:43:02 -07:00
Shawn Pearce 6a5019f539 Renumber internal ObjectToPack flags
Now that WANT_WRITE is gone renumber the flags to move the unused
bit next to the type. Recluster AS_IS and DELTA_ATTEMPTED to be
next to each other since these bits are tested as a pair.

Change-Id: I42994b5ff1f67435e15c3f06d02e3b82141e8f08
2013-04-04 19:44:41 -07:00
Shawn Pearce 241eed844d Move wantWrite flag to be special offset 1
Free up the WANT_WRITE flag in ObjectToPack by switching the test
to use the special offset value of 1. The Git pack file format
calls for the first 4 bytes to be 'PACK', which means any object
must start at an offset >= 4. Current versions require another 8
bytes in the header, placing the first object at offset = 12.

So offset = 1 is an invalid location for an object, and can be
used as a marker signal to indicate the writing loop has tried
to write the object, but recursed into the base first. When an
object is visited with offset == 1 it means there is a cycle in
the delta base path, and the cycle must be broken.

Change-Id: I2d05b9017c5f9bd9464b91d43e8d4b4a085e55bc
2013-04-04 17:53:01 -07:00