11f99fecfd
The hash code returned by RawTextComparator (or that is used by the SimilarityIndex) play an important role in the speed of any algorithm that is based upon them. The lower the number of collisions produced by the hash function, the shorter the hash chains within hash tables will be, and the less likely we are to fall into O(N^2) runtime behaviors for algorithms like PatienceDiff. Our prior hash function was absolutely horrid, so replace it with the proper definition of the DJB hash that was originally published by Professor Daniel J. Bernstein. To support this assertion, below is a table listing the maximum number of collisions that result when hashing the unique lines in each source code file of 3 randomly chosen projects: test_jgit: 931 files; 122 avg. unique lines/file Algorithm | Collisions -------------+----------- prior_hash 418 djb 5 sha1 6 string_hash31 11 test_linux26: 30198 files; 258 avg. unique lines/file Algorithm | Collisions -------------+----------- prior_hash 8675 djb 32 sha1 8 string_hash31 32 test_frameworks_base: 8381 files; 184 avg. unique lines/file Algorithm | Collisions -------------+----------- prior_hash 4615 djb 10 sha1 6 string_hash31 13 We can clearly see that prior_hash performed very poorly, resulting in 8,675 collisions (elements in the same hash bucket) for at least one file in the Linux kernel repository. This leads to some very bad O(N) style insertion and lookup performance, even though the hash table was sized to be the next power-of-2 larger than the total number of unique lines in the file. The djb hash we are replacing prior_hash with performs closer to SHA-1 in terms of having very few collisions. This indicates it provides a reasonably distributed output for this type of input, despite being a much simpler algorithm (and therefore will be much faster to execute). The string_hash31 function is provided just to compare results with, it is the algorithm commonly used by java.lang.String hashCode(). However, life isn't quite this simple. djb produces a 32 bit hash code, but our hash tables are always smaller than 2^32 buckets. Mashing the 32 bit code into an array index used to be done by simply taking the lower bits of the hash code by a bitwise and operator. This unfortuntely still produces many collisions, e.g. 32 on the linux-2.6 repository files. From [1] we can apply a final "cleanup" step to the hash code to mix the bits together a little better, and give priority to the higher order bits as they include data from more bytes of input: test_jgit: 931 files; 122 avg. unique lines/file Algorithm | Collisions -------------+----------- prior_hash 418 djb 5 djb + cleanup 6 test_linux26: 30198 files; 258 avg. unique lines/file Algorithm | Collisions -------------+----------- prior_hash 8675 djb 32 djb + cleanup 7 test_frameworks_base: 8381 files; 184 avg. unique lines/file Algorithm | Collisions -------------+----------- prior_hash 4615 djb 10 djb + cleanup 7 This is a massive improvement, as the number of collisions for common inputs drops to acceptable levels, and we haven't really made the hash functions any more complex than they were before. [1] http://lkml.org/lkml/2009/10/27/404 Change-Id: Ia753b695de9526a157ddba265824240bd05dead1 Signed-off-by: Shawn O. Pearce <spearce@spearce.org> |
||
---|---|---|
org.eclipse.jgit | ||
org.eclipse.jgit.console | ||
org.eclipse.jgit.http.server | ||
org.eclipse.jgit.http.test | ||
org.eclipse.jgit.iplog | ||
org.eclipse.jgit.junit | ||
org.eclipse.jgit.packaging | ||
org.eclipse.jgit.pgm | ||
org.eclipse.jgit.test | ||
org.eclipse.jgit.ui | ||
tools | ||
.eclipse_iplog | ||
.gitattributes | ||
LICENSE | ||
README | ||
SUBMITTING_PATCHES | ||
pom.xml |
README
== Java GIT == This package is licensed under the BSD. org.eclipse.jgit/ A pure Java library capable of being run standalone, with no additional support libraries. Some JUnit tests are provided to exercise the library. The library provides functions to read and write a GIT formatted repository. All portions of jgit are covered by the BSD. Absolutely no GPL, LGPL or EPL contributions are accepted within this package. org.eclipse.jgit.test/ Unit tests for org.eclipse.jgit and the same licensing rules. == WARNINGS / CAVEATS == - Symbolic links are not supported because java does not support it. Such links could be damaged. - Only the timestamp of the index is used by jgit check if the index is dirty. - Don't try the library with a JDK other than 1.6 (Java 6) unless you are prepared to investigate problems yourself. JDK 1.5.0_11 and later Java 5 versions *may* work. Earlier versions do not. JDK 1.4 is *not* supported. Apple's Java 1.5.0_07 is reported to work acceptably. We have no information about other vendors. Please report your findings if you try. - CRLF conversion is never performed. On Windows you should thereforc make sure your projects and workspaces are configured to save files with Unix (LF) line endings. == Package Features == org.eclipse.jgit/ * Read loose and packed commits, trees, blobs, including deltafied objects. * Read objects from shared repositories * Write loose commits, trees, blobs. * Write blobs from local files or Java InputStreams. * Read blobs as Java InputStreams. * Copy trees to local directory, or local directory to a tree. * Lazily loads objects as necessary. * Read and write .git/config files. * Create a new repository. * Read and write refs, including walking through symrefs. * Read, update and write the Git index. * Checkout in dirty working directory if trivial. * Walk the history from a given set of commits looking for commits introducing changes in files under a specified path. * Object transport Fetch via ssh, git, http, Amazon S3 and bundles. Push via ssh, git and Amazon S3. JGit does not yet deltify the pushed packs so they may be a lot larger than C Git packs. org.eclipse.jgit.pgm/ * Assorted set of command line utilities. Mostly for ad-hoc testing of jgit log, glog, fetch etc. == Missing Features == There are a lot of missing features. You need the real Git for this. For some operations it may just be the preferred solution also. There are not just a command line, there is e.g. git-gui that makes committing partial files simple. - Merging. - Repacking. - Generate a GIT format patch. - Apply a GIT format patch. - Documentation. :-) - gitattributes support In particular CRLF conversion is not implemented. Files are treated as byte sequences. - submodule support Submodules are not supported or even recognized. == Support == Post question, comments or patches to the git@vger.kernel.org mailing list. == Contributing == See SUBMITTING_PATCHES in this directory. However, feedback and bug reports are also contributions. == About GIT == More information about GIT, its repository format, and the canonical C based implementation can be obtained from the GIT websites: http://git.or.cz/ http://www.kernel.org/pub/software/scm/git/ http://www.kernel.org/pub/software/scm/git/docs/