motiejus/jgit - jgit - gitea: Gitea Service

motiejus

jgit

Go to file

Shawn O. Pearce 11f99fecfd Reduce content hash function collisions The hash code returned by RawTextComparator (or that is used by the SimilarityIndex) play an important role in the speed of any algorithm that is based upon them. The lower the number of collisions produced by the hash function, the shorter the hash chains within hash tables will be, and the less likely we are to fall into O(N^2) runtime behaviors for algorithms like PatienceDiff. Our prior hash function was absolutely horrid, so replace it with the proper definition of the DJB hash that was originally published by Professor Daniel J. Bernstein. To support this assertion, below is a table listing the maximum number of collisions that result when hashing the unique lines in each source code file of 3 randomly chosen projects: test_jgit: 931 files; 122 avg. unique lines/file Algorithm \| Collisions -------------+----------- prior_hash 418 djb 5 sha1 6 string_hash31 11 test_linux26: 30198 files; 258 avg. unique lines/file Algorithm \| Collisions -------------+----------- prior_hash 8675 djb 32 sha1 8 string_hash31 32 test_frameworks_base: 8381 files; 184 avg. unique lines/file Algorithm \| Collisions -------------+----------- prior_hash 4615 djb 10 sha1 6 string_hash31 13 We can clearly see that prior_hash performed very poorly, resulting in 8,675 collisions (elements in the same hash bucket) for at least one file in the Linux kernel repository. This leads to some very bad O(N) style insertion and lookup performance, even though the hash table was sized to be the next power-of-2 larger than the total number of unique lines in the file. The djb hash we are replacing prior_hash with performs closer to SHA-1 in terms of having very few collisions. This indicates it provides a reasonably distributed output for this type of input, despite being a much simpler algorithm (and therefore will be much faster to execute). The string_hash31 function is provided just to compare results with, it is the algorithm commonly used by java.lang.String hashCode(). However, life isn't quite this simple. djb produces a 32 bit hash code, but our hash tables are always smaller than 2^32 buckets. Mashing the 32 bit code into an array index used to be done by simply taking the lower bits of the hash code by a bitwise and operator. This unfortuntely still produces many collisions, e.g. 32 on the linux-2.6 repository files. From [1] we can apply a final "cleanup" step to the hash code to mix the bits together a little better, and give priority to the higher order bits as they include data from more bytes of input: test_jgit: 931 files; 122 avg. unique lines/file Algorithm \| Collisions -------------+----------- prior_hash 418 djb 5 djb + cleanup 6 test_linux26: 30198 files; 258 avg. unique lines/file Algorithm \| Collisions -------------+----------- prior_hash 8675 djb 32 djb + cleanup 7 test_frameworks_base: 8381 files; 184 avg. unique lines/file Algorithm \| Collisions -------------+----------- prior_hash 4615 djb 10 djb + cleanup 7 This is a massive improvement, as the number of collisions for common inputs drops to acceptable levels, and we haven't really made the hash functions any more complex than they were before. [1] http://lkml.org/lkml/2009/10/27/404 Change-Id: Ia753b695de9526a157ddba265824240bd05dead1 Signed-off-by: Shawn O. Pearce <spearce@spearce.org>		2010-09-23 17:44:53 -07:00
org.eclipse.jgit	Reduce content hash function collisions	2010-09-23 17:44:53 -07:00
org.eclipse.jgit.console	Qualify builds as 0.10.0	2010-09-16 17:26:53 -07:00
org.eclipse.jgit.http.server	Qualify builds as 0.10.0	2010-09-16 17:26:53 -07:00
org.eclipse.jgit.http.test	Qualify builds as 0.10.0	2010-09-16 17:26:53 -07:00
org.eclipse.jgit.iplog	Merge "Define DiffAlgorithm as an abstract function"	2010-09-17 15:08:27 -04:00
org.eclipse.jgit.junit	Qualify builds as 0.10.0	2010-09-16 17:26:53 -07:00
org.eclipse.jgit.packaging	Fix dependency to jgit source bundle	2010-09-17 22:01:40 +02:00
org.eclipse.jgit.pgm	Qualify builds as 0.10.0	2010-09-16 17:26:53 -07:00
org.eclipse.jgit.test	Fix PatienceDiffTest	2010-09-23 14:45:33 -07:00
org.eclipse.jgit.ui	Qualify builds as 0.10.0	2010-09-16 17:26:53 -07:00
tools	Clean up LICENSE file	2010-07-02 14:52:49 -07:00
.eclipse_iplog	Update .eclipse_iplog for 0.9	2010-09-08 23:17:54 +02:00
.gitattributes	Initial JGit contribution to eclipse.org	2009-09-29 16:47:03 -07:00
LICENSE	Clean up LICENSE file	2010-07-02 14:52:49 -07:00
README	Initial JGit contribution to eclipse.org	2009-09-29 16:47:03 -07:00
SUBMITTING_PATCHES	Correcting explanation of EDL	2009-10-28 14:12:07 +01:00
pom.xml	Qualify builds as 0.10.0	2010-09-16 17:26:53 -07:00

README

            == Java GIT ==

This package is licensed under the BSD.

  org.eclipse.jgit/

    A pure Java library capable of being run standalone, with no
    additional support libraries.  Some JUnit tests are provided
    to exercise the library.  The library provides functions to
    read and write a GIT formatted repository.

    All portions of jgit are covered by the BSD.  Absolutely no GPL,
    LGPL or EPL contributions are accepted within this package.

  org.eclipse.jgit.test/
    Unit tests for org.eclipse.jgit and the same licensing rules.

            == WARNINGS / CAVEATS              ==

- Symbolic links are not supported because java does not support it.
  Such links could be damaged.

- Only the timestamp of the index is used by jgit check if  the index
  is dirty.

- Don't try the library with a JDK other than 1.6 (Java 6) unless you
  are prepared to investigate problems yourself. JDK 1.5.0_11 and later
  Java 5 versions *may* work. Earlier versions do not. JDK 1.4 is *not*
  supported. Apple's Java 1.5.0_07 is reported to work acceptably. We
  have no information about other vendors. Please report your findings
  if you try.

- CRLF conversion is never performed. On Windows you should thereforc
  make sure your projects and workspaces are configured to save files
  with Unix (LF) line endings.

            == Package Features                ==

  org.eclipse.jgit/

    * Read loose and packed commits, trees, blobs, including
      deltafied objects.

    * Read objects from shared repositories

    * Write loose commits, trees, blobs.

    * Write blobs from local files or Java InputStreams.

    * Read blobs as Java InputStreams.

    * Copy trees to local directory, or local directory to a tree.

    * Lazily loads objects as necessary.

    * Read and write .git/config files.

    * Create a new repository.

    * Read and write refs, including walking through symrefs.

    * Read, update and write the Git index.

    * Checkout in dirty working directory if trivial.

    * Walk the history from a given set of commits looking for commits
      introducing changes in files under a specified path.

    * Object transport
      Fetch via ssh, git, http, Amazon S3 and bundles.
      Push via ssh, git and Amazon S3. JGit does not yet deltify
      the pushed packs so they may be a lot larger than C Git packs.

  org.eclipse.jgit.pgm/

    * Assorted set of command line utilities. Mostly for ad-hoc testing of jgit
      log, glog, fetch etc.

            == Missing Features                ==

There are a lot of missing features. You need the real Git for this.
For some operations it may just be the preferred solution also. There
are not just a command line, there is e.g. git-gui that makes committing
partial files simple.

- Merging. 

- Repacking.

- Generate a GIT format patch.

- Apply a GIT format patch.

- Documentation. :-)

- gitattributes support
  In particular CRLF conversion is not implemented. Files are treated
  as byte sequences.

- submodule support
  Submodules are not supported or even recognized.

            == Support                         ==

  Post question, comments or patches to the git@vger.kernel.org mailing list.


            == Contributing                    ==

  See SUBMITTING_PATCHES in this directory. However, feedback and bug reports
  are also contributions.


            == About GIT                       ==

More information about GIT, its repository format, and the canonical
C based implementation can be obtained from the GIT websites:

  http://git.or.cz/
  http://www.kernel.org/pub/software/scm/git/
  http://www.kernel.org/pub/software/scm/git/docs/