Compare commits

...

809 Commits

Author SHA1 Message Date
238af6eb4a add README note about breakage 2023-08-22 17:59:51 +03:00
fbff819ece make tests run again 2023-08-22 15:38:55 +03:00
993a29d2f8 zig 0.11 2023-08-21 16:40:19 +03:00
37e24524c2 Add 'deps/cmph/' from commit 'a250982ade093f4eed0552bbdd22dd7b0432007f'
git-subtree-dir: deps/cmph
git-subtree-mainline: 5040f4007b
git-subtree-split: a250982ade
2023-08-21 13:50:16 +03:00
5040f4007b replacing cmph with subtree 2023-08-21 13:49:41 +03:00
Motiejus Jakštys
ff0f5bca77 bump to zig 0.11.0-dev.3735+a72d634b7 2023-06-20 13:01:42 +03:00
Motiejus Jakštys
f723d48fe2 remove hardcoded cmph/config.h 2023-06-20 12:51:10 +03:00
Motiejus Jakštys
a447e7fdf4 safety checks in group parsing 2023-06-06 21:13:12 +03:00
Motiejus Jakštys
e1cae43d08 more safety checks in user parsing 2023-06-06 21:13:12 +03:00
Motiejus Jakštys
277a48296a compress: handle overflows in varints
This handles corrupt data explicitly.
2023-06-06 21:13:12 +03:00
Motiejus Jakštys
d6150734f1 syntax cosmetics 2023-06-06 21:13:12 +03:00
Motiejus Jakštys
65914ddcd6 libnss: fail early 2023-06-06 19:27:32 +03:00
Motiejus Jakštys
325c01b341 fix one TODO 2023-06-06 19:12:46 +03:00
Motiejus Jakštys
4962c7286b add Group Members Iterator 2023-06-06 19:05:14 +03:00
Motiejus Jakštys
ef4062edb7 doc nitpicks 2023-06-06 18:55:36 +03:00
Motiejus Jakštys
617c256863 zig update: sort 2023-06-06 18:43:54 +03:00
Motiejus Jakštys
6d957967ed multi_array_list: use slice() when possible 2023-05-19 16:12:33 +03:00
Motiejus Jakštys
82ab622c66 close a TODO 2023-05-19 16:04:21 +03:00
Motiejus Jakštys
f5b50068b7 zig compatibility 2023-05-19 15:40:27 +03:00
Motiejus Jakštys
4785ed029f restructure README a bit 2023-04-13 12:12:40 +03:00
Motiejus Jakštys
65bd0c9f9d update README 2023-04-13 12:09:03 +03:00
Motiejus Jakštys
d7ca840de8 styling some for loops 2023-04-13 11:26:26 +03:00
Motiejus Jakštys
c9bb95f395 avg-members -> max-members
that's what it really means.
2023-04-13 11:19:45 +03:00
Motiejus Jakštys
58c4d1e78f README sections 2023-04-13 11:11:39 +03:00
Motiejus Jakštys
6027508e60 update README 2023-04-13 11:10:47 +03:00
Motiejus Jakštys
abeb25f3e2 bump to zig 0.11.0-dev.2560+602029bb2 2023-04-12 22:24:20 +03:00
Motiejus Jakštys
f55fb9f86a keep_sigpipe
we want the unix-y sigpipe handler.
2023-02-24 15:22:40 +02:00
Motiejus Jakštys
8d6e8aa4bd upgrade to multi-for syntax 2023-02-24 15:18:33 +02:00
Motiejus Jakštys
127d44e375 some stage2 cleanups 2023-02-09 17:08:21 +02:00
Motiejus Jakštys
5d4f17e6bf re-license to apache2 2023-02-09 10:22:34 +02:00
Motiejus Jakštys
345c38eb61 stage2: update README 2023-02-08 16:40:39 +02:00
Motiejus Jakštys
000080a781 replace padding functions with ones from std.mem 2023-02-08 16:33:15 +02:00
Motiejus Jakštys
0ecd6172fc stage2 2023-02-08 16:13:03 +02:00
Motiejus Jakštys
5d3dfdc8dc architecture.md: fix the header size 2023-01-03 13:44:59 +02:00
Motiejus Jakštys
4493b4408c update README 2022-11-30 12:04:12 +02:00
Motiejus Jakštys
312e510eff add lto and fpic
learning about linkers. Thanks, Drepper
2022-11-30 11:56:02 +02:00
Motiejus Jakštys
0df7d8b722 DSO: reduce visibility of bdz lib 2022-11-25 14:50:42 +02:00
Motiejus Jakštys
422c264df9 zig v0.10 compatibility 2022-11-20 13:33:05 +02:00
Motiejus Jakštys
ff814a474b add compress-debug-sections to turbonss-makecorpus 2022-08-23 15:50:00 +03:00
Motiejus Jakštys
292c87a597 Merge branch 'compress-debug-sections' 2022-08-23 15:49:33 +03:00
8212f3f51a clarify compiler requirements 2022-08-21 06:58:05 +03:00
4d4c8a5be1 analyze command 2022-08-21 06:10:47 +03:00
fbd449b21f Move docs around; finish it 2022-08-21 06:08:21 +03:00
8bfc4a30cd wip turbonss-unix2systemd 2022-08-20 19:08:04 +03:00
ef436294e9 wip compress-debug-sections 2022-07-20 14:30:48 +03:00
Davi de Castro Reis
a250982ade Add docs directory for github. 2018-12-28 23:53:52 -02:00
Davi de Castro Reis
815d089f34 Set theme jekyll-theme-cayman 2018-12-28 23:44:49 -02:00
Davi de Castro Reis
e5f83da75b Minor version bump. 2018-12-28 22:47:25 -02:00
Davi de Castro Reis
bbf77c63c9 Apply some of debian patches. 2018-12-28 01:43:43 -02:00
Davi de Castro Reis
d233b4943f Partially apply https://sourceforge.net/p/cmph/patches/3/ 2018-12-28 00:51:37 -02:00
Davi de Castro Reis
9209046797 Make benchmarks optional and dependent on hopscotch_map. 2018-12-27 23:44:19 -02:00
Davi de Castro Reis
6e9f152f92 Add flat_hash_map to benchmark. 2018-12-27 23:37:06 -02:00
Davi de Castro Reis
3e4c4fa3ff Decrease macro pollution. 2018-12-27 23:37:06 -02:00
Davi de Castro Reis
a69bdded7d Add hopscotch to the benchmark baseline. 2018-12-27 23:37:06 -02:00
Davi de Castro Reis
776ae2cbca Add a swap function. 2018-12-27 23:37:06 -02:00
Davi de Castro Reis
69f81ca7ba Rollback confusing ACLOCAL_AMFLAGS. 2018-12-27 23:37:02 -02:00
Davi Reis
68705bea29 First tentative of minor version bump. 2014-06-06 17:47:37 -03:00
Davi Reis
452486b310 Merge branch 'master' of ssh://git.code.sf.net/p/cmph/git 2014-06-06 17:45:34 -03:00
Davi Reis
dc45060090 Add c++11 m4 check. 2014-06-06 13:44:33 -07:00
Davi Reis
17ef289801 Merge branch 'master' of ssh://git.code.sf.net/p/cmph/git 2014-06-06 17:35:00 -03:00
Davi Reis
84b042137d Simple fix to gendocs. 2014-06-06 17:34:21 -03:00
Davi Reis
8230f935c4 Tentative fix to m4 stuff. 2014-06-06 13:31:14 -07:00
Davi Reis
217c784dda Revert "Fix memory leak problem for bmz8"
This code leaks memory when iterations reaches 0 and previous code was
correct. See https://sourceforge.net/p/cmph/git/merge-requests/4/.
2014-06-06 12:33:04 -03:00
Davi Reis
c09a1f64ea Merge commit 'efe08b8080d0696bf388b21' 2014-06-06 12:00:56 -03:00
Davi Reis
a57b0e966d Style fixes to https://sourceforge.net/p/cmph/git/merge-requests/3/. 2014-06-06 11:52:06 -03:00
Davi Reis
a57fe72c9a Merge commit '2e797796a545748ea815f39113088c701b45653' 2014-06-06 11:49:39 -03:00
Davi Reis
6808fe7cf1 Merge commit 'c85b8f8ecd23039aa9794098735f896c2c56346f' 2014-06-06 11:39:57 -03:00
Davi Reis
b055b8d3cf Merge remote-tracking branch 'sf/githubmaster'
Bring fixes from sourceforge.
2014-06-06 11:34:57 -03:00
Davi Reis
91ad6123ad Small syntatic fixes. 2014-06-06 11:26:18 -03:00
Davi Reis
c113674a89 Dropped inline from Murmur for gcc 4.8 compatibility. 2014-06-06 11:26:18 -03:00
Davi de Castro Reis
4238530db6 Fixed initialization order of test framework. 2014-06-06 11:26:18 -03:00
Davi Reis
e4d6e18de0 A couple test cases. 2014-06-06 11:26:18 -03:00
Huang-Ming Huang
efe08b8080 fixed problem where the FCH algorithm did not dispose the keys using the user supplied function specified in cmph_config_t. 2014-05-23 11:21:34 -05:00
Huang-Ming Huang
a45235f886 Fix memory leak problem for bmz8 2014-05-23 08:55:22 -05:00
Joseph HERLANT
2e797796a5 Correcting potential segfault due to division by 0
This happend when passing a weird file name to cmph in generate
or verbose mode.
To get the crash test script and full report, refer to:
http://bugs.debian.org/715745
2014-03-07 12:12:39 +01:00
Joseph HERLANT
c85b8f8ecd Correcting not escaped minus signs in manpage.
Extract of the explanations why we need a change for this:
By default, "-" chars are interpreted as hyphens (U+2010) by groff, not as minus
signs (U+002D). Since options to programs use minus signs (U+002D), this means
for example in UTF-8 locales that you cannot cut and paste options, nor search
for them easily. The Debian groff package currently forces "-" to be interpreted
as a minus sign due to the number of manual pages with this problem, but this is
a Debian-specific modification and hopefully eventually can be removed.

"-" must be escaped ("\-") to be interpreted as minus.
2014-03-06 18:56:00 +01:00
Fabiano C. Botelho
2cf5c15cf6 This fixes a bug the key_struct_vector_read function.
This has been reported by Rama Krishna Chitta.
2013-04-20 01:44:25 -07:00
Fabiano C. Botelho
9f999ef428 Remaining part of the fix for bug 3465649. This one fixes both BRZ and
CHD_PH for small key sets.
2013-04-20 01:28:00 -07:00
Fabiano C. Botelho
1b2a7cedff Fixing bug 3465649. I still need to make CMPH_BRZ work for small sets. 2013-04-17 01:19:15 -07:00
Fabiano C. Botelho
c1a9eb164e Fixes bug 3482222 filed in the bug tracker. 2013-04-16 00:50:24 -07:00
Fabiano C. Botelho
9e434d41d0 Applying Michael Samuel's patch to fix a bug in jenkins_hash.
This will break compatibility with mphf previously generated.
If someone has functions stored they will need to either change
this function back and keep using the buggy version or regenerate
them.
2013-04-16 00:16:49 -07:00
Fabiano C. Botelho
f9e5adacbd Applying Michael Samuel's patch to avoid buffer overflow while loading
the algorithm name from a dump file.
2013-04-15 23:42:21 -07:00
Davi de Castro Reis
d59aaa88c4 Fixed initialization order of test framework. 2012-08-21 18:07:46 +02:00
Davi Reis
4f3c9003d5 A couple test cases. 2012-06-15 18:09:54 -03:00
Davi de Castro Reis
b8c5b54c9a Added missing file in the distribution. 2012-06-09 03:55:16 -03:00
Davi de Castro Reis
3beff9ee9c spurious comment removed. 2012-06-09 03:12:09 -03:00
Davi de Castro Reis
772cdca462 Less brandhes. 2012-06-09 03:11:22 -03:00
Davi de Castro Reis
7968e08658 Less branches. 2012-06-09 03:11:07 -03:00
Davi de Castro Reis
4fabbd9d25 Less branches. 2012-06-09 02:47:37 -03:00
Davi de Castro Reis
b3e2ef709d More conservative hash functions. 2012-06-09 02:27:39 -03:00
Davi de Castro Reis
c06ad3e25e Fixed erase bug. 2012-06-09 00:58:52 -03:00
Davi de Castro Reis
99ac6744c7 Fixed equality op. 2012-06-08 12:15:01 -03:00
Davi de Castro Reis
6744476198 Minor fixes in command line tool. 2012-06-08 11:37:40 -03:00
Davi de Castro Reis
f5c937d1fc Add c++0x flag to pkgconfig. 2012-06-08 11:25:32 -03:00
Davi de Castro Reis
e57ec31b93 Fixed conditional inclusion of pkg-config for c++. 2012-06-08 11:17:18 -03:00
Davi de Castro Reis
07c9f0c9f8 Updated ebuild file. 2012-06-08 11:08:46 -03:00
Davi de Castro Reis
4d3be28d19 Fixes for make dist. 2012-06-08 11:02:07 -03:00
Davi Reis
fba715aebb Improved integration of check library. Should do the same for benchmarks. 2012-06-05 00:27:17 -03:00
Davi Reis
688c382420 Merge remote-tracking branch 'upstream/master' 2012-06-04 21:04:24 -03:00
Davi Reis
9581e5ad12 Remove dependency on cmph in cxxmph 2012-06-04 21:03:19 -03:00
Davi de Castro Reis
cc42ab3b74 Pre-release comments. 2012-06-03 04:17:14 -03:00
Davi de Castro Reis
0d7a176458 Merge branch 'master' of github.com:bonitao/cmph 2012-06-03 03:35:46 -03:00
Davi Reis
a17b3792d1 Merge branch 'master' of github.com:bonitao/cmph 2012-06-03 03:27:33 -03:00
Davi de Castro Reis
d5b579fbd6 Generalized mph_map for trade-offs. 2012-06-03 03:13:06 -03:00
Davi de Castro Reis
8ebb9da1ab Compiles on gcc-4.7 macos. 2012-06-03 00:03:46 -03:00
Davi de Castro Reis
20a67137d4 Several build fixes. 2012-06-02 23:17:45 -03:00
Davi de Castro Reis
e808f82311 Version bump 2012-06-02 22:29:46 -03:00
Davi Reis
3232485ef0 Merge branch 'master' of github.com:bonitao/cmph 2012-06-02 22:22:02 -03:00
Davi Reis
c95e06eba4 Fix typo 2012-06-02 22:21:08 -03:00
Davi de Castro Reis
c7b77ae329 Improved test support. 2012-06-02 21:47:18 -03:00
Davi de Castro Reis
ec467953bc Added missing file. 2012-06-01 20:39:10 -03:00
Davi de Castro Reis
2cffd02352 Added unit test library check. 2012-06-01 20:06:21 -03:00
Davi Reis
3bad70ec1c Merge branch 'master' of ssh://cmph.git.sourceforge.net/gitroot/cmph/cmph 2012-06-01 18:13:08 -03:00
Davi de Castro Reis
0c13ae5fa4 Compiles with clang. 2012-06-01 17:49:00 -03:00
Davi Reis
99cbe8193a Merged upstream 2012-05-30 21:45:54 -03:00
Davi Reis
1e9c9f5594 Improved readme. 2012-05-30 21:44:13 -03:00
bonitao
d4f70244f3 Initial commit 2012-05-30 17:36:49 -07:00
Davi Reis
59bb0dac72 merge 2012-05-28 01:56:11 -03:00
Davi Reis
2a227cc956 Added a test framework. 2012-05-28 01:45:12 -03:00
Davi Reis
cdc0f5cd98 Forgot. 2012-05-28 01:39:36 -03:00
Davi Reis
7c425203df Fixed configure.ac for ubuntu. 2012-05-09 14:04:34 -03:00
Davi Reis
9d59436461 Make mph_bits compile fast. 2012-05-08 14:22:47 -03:00
Davi de Castro Reis
f8d5fe91f1 Fixed warnings. 2012-04-30 00:55:28 -03:00
Davi Reis
aaa59b7edb Real results. Minimal is slightly slower than STL, perfect is faster, perfect and pof2 even better. 2012-04-22 02:58:04 +02:00
Davi Reis
c432a3b848 Slack search needs to come first. 2012-04-22 02:50:14 +02:00
Davi Reis
8b1d7da028 Investigating benchmark u64 failures. 2012-04-22 02:41:43 +02:00
Davi Reis
334f5592ea Improved benchmark, something broke in bm_map latest cases. 2012-04-22 01:39:05 +02:00
Davi Reis
6afc7cf105 Fastest true incarnation so far. Not much faster than unordered_map. 2012-04-21 21:48:32 +02:00
Davi Reis
c52152bcb4 Fixed inline problem for hollow iterator. 2012-04-15 01:23:44 -03:00
Davi Reis
ea1cb4709e Fixed hollow iterator, but it still breaks inlining. 2012-04-15 01:10:03 -03:00
Davi Reis
48155e5b66 All tests pass. 2012-04-15 00:03:00 -03:00
Davi de Castro Reis
bcf4962604 Fixed inline crazyness. 2012-04-14 17:59:15 -03:00
Davi de Castro Reis
e85d7cc8d9 Improved comments. 2012-04-12 16:36:23 -03:00
Davi Reis
c112b11abe Fixed find, now minimal also beats STL. 2012-03-22 00:58:02 -03:00
Davi Reis
57ce26c5b1 Fixed bug in ranking function. 2012-03-21 22:25:38 -03:00
Davi Reis
9375a15dd4 Added rank function implementation. 2012-03-21 20:19:16 -03:00
Davi Reis
86dccdb466 Merge branch 'master' of ssh://cmph.git.sourceforge.net/gitroot/cmph/cmph 2012-03-21 10:26:42 -03:00
Davi Reis
1bb2d6a4dc Optimized slack_type. 2012-03-21 10:20:30 -03:00
Fabiano C. Botelho
14fda50f8f Merge branch 'master' of ssh://cmph.git.sourceforge.net/gitroot/cmph/cmph 2012-03-20 21:50:46 -07:00
Fabiano C. Botelho
bf98b6eaf1 Adding Nivio as a co-author in the web pages 2012-03-20 21:37:25 -07:00
Davi Reis
b8610f52e1 Some debugging, found that minimal version of mph_map is broken. Need to investigate. 2012-03-20 12:06:30 -03:00
Davi Reis
d4d79c62bd Improved hash signature. 2012-03-20 11:47:55 -03:00
Davi Reis
e760465fca Some comments. 2012-03-19 22:48:11 -03:00
Davi Reis
b47f367db0 Nice and fast. 2012-03-19 03:18:57 -03:00
Davi Reis
b3842c69e8 New bit code works, need to cleanup logging. 2012-03-19 03:10:42 -03:00
Davi Reis
50ac0e2974 Removed cuckoo hash failed attempt. Slower because of extra memory access. 2012-03-16 03:11:39 -03:00
Davi Reis
11d54ea837 Added nice optimization to avoid mod 3. 2012-03-16 02:54:16 -03:00
Davi Reis
2bfe38d2da Merge branch 'master' of ssh://cmph.git.sourceforge.net/gitroot/cmph/cmph 2012-03-15 18:34:19 -03:00
Davi Reis
40c6626d87 Cleanup for upload. 2012-03-15 18:14:39 -03:00
Davi Reis
3c127c7690 First tentative on the perfect hash design. 2012-03-14 23:23:48 -03:00
Davi Reis
7fe9527459 Interesting point, but get_cuckoo_nest is adding a lot and fast path is not that fast for int64. 2012-03-14 21:22:40 -03:00
Davi Reis
e3ccde3ba0 Working, but it sucks. 2012-03-14 18:26:26 -03:00
Davi Reis
b96b71961d struggle 2012-03-14 16:44:16 -03:00
Davi Reis
0335cbe679 struggle 2012-03-14 16:43:38 -03:00
Davi Reis
b63f618204 bit methods need tests. 2012-03-14 12:40:50 -03:00
Davi Reis
9c4bb27dc4 Disabled cuckoo stuff to beat STL again. 2012-03-14 12:07:08 -03:00
Davi Reis
687cc1b194 Added cuckoo stuff, uint64 became slower again. 2012-03-14 11:58:37 -03:00
Davi Reis
a4d96e6cb2 Tests pass, but it segfaults at the benchmark. Need further investigation, but the core for the cuckoo stuff is already there. 2012-03-14 04:51:55 -03:00
Davi Reis
86797b6402 Finally beat STL. Trying improvement around cuckoo hashing idea. 2012-03-14 01:29:13 -03:00
Davi Reis
aa5fa26b49 Strange optimizations for 64 bit integers. 2012-03-13 20:25:06 -03:00
Davi Reis
498884327a Use hash64. 2012-03-13 19:34:41 -03:00
Davi Reis
fd0bc2ae43 Added Murmur3 support. 2012-03-13 19:34:24 -03:00
Davi Reis
bd9efab766 Added Murmur3 support. Not necessarily faster.
Conflicts:

	cxxmph/Makefile.am
2012-03-13 19:34:03 -03:00
Davi Reis
7b8b3e5834 Use hash64. 2012-03-13 19:31:35 -03:00
Davi Reis
ee75d9a620 Reenabled benchmarks. 2012-03-12 01:44:56 -03:00
Davi Reis
9dcf0450f0 Added Murmur3 support. Not necessarily faster. 2012-03-12 01:43:06 -03:00
Davi Reis
09c1af7771 Perfect hash working, but it is slower. 2012-03-12 00:17:08 -03:00
Davi Reis
238e384367 Compiles, still need to fix size tracking. 2012-03-11 23:21:18 -03:00
Davi Reis
c057fb882b Iterator game. 2012-03-07 03:10:29 -05:00
Davi Reis
20aeaf8ee1 Poor hash functions break tests because of small set sizes. 2012-03-07 01:53:19 -05:00
Davi Reis
dbd4856fae Removed unnecessary seed mod which was breaking on presence of poor hash functions. 2012-03-07 01:48:20 -05:00
Davi Reis
b8b0cde5c7 Added miss ratio to benchmark tools. 2012-03-07 01:00:17 -05:00
Davi Reis
7b6c163075 Adding support for miss benchmarks. Need to fix myfind methods. 2012-03-06 18:25:05 -08:00
Davi de Castro Reis
08ff389f61 Merge branch 'master' of ssh://cmph.git.sourceforge.net/gitroot/cmph/cmph 2011-12-26 19:37:40 -02:00
Davi de Castro Reis
24e645febe Aesthetics in C code and replaced some asserts with NULL returns. 2011-12-26 19:35:30 -02:00
Davi de Castro Reis
4e4d36d833 Fixed fread test. 2011-12-26 19:12:24 -02:00
Davi Reis
3ba778f671 Aesthetics, compile on mac with gcc44. 2011-12-09 23:57:37 -02:00
Davi de Castro Reis
91dc5d95d5 Fixed headers. 2011-12-05 16:03:10 -02:00
Davi de Castro Reis
beb77d0e2d Removed tr1 stuff. 2011-11-10 16:44:37 -02:00
Davi de Castro Reis
d4ee76b7bf Small fixes, more comments. 2011-11-05 15:15:11 -02:00
Davi de Castro Reis
d3b3b3dfba Merge branch 'cxxmph' of ssh://cmph.git.sourceforge.net/gitroot/cmph/cmph into cxxmph 2011-11-05 15:06:06 -02:00
Davi Reis
b603173d01 About to merge. 2011-11-05 10:32:47 -02:00
Davi Reis
2a67236e29 Improved c++0x test. 2011-11-05 10:27:24 -02:00
Davi Reis
96862d3113 Fixed license. 2011-11-05 09:51:45 -02:00
Davi de Castro Reis
245a84c75e Fixed include camelcase. 2011-08-03 18:48:28 -03:00
Davi de Castro Reis
85a0d7453a Playing with benchmarks. 2011-06-14 04:59:54 -03:00
Davi de Castro Reis
2c88ab61ec Exposed perfect hash internals. 2011-06-14 04:58:22 -03:00
Davi de Castro Reis
1e1cbfe606 Trying perfect hash. 2011-06-14 03:38:23 -03:00
Davi de Castro Reis
cc80fcfa2b Fixed benchmark 2011-06-14 03:32:02 -03:00
Davi de Castro Reis
1a5eee170c Fixed bug in uit64 benchmark. 2011-06-14 03:30:41 -03:00
Davi de Castro Reis
0846177267 All tests pass. 2011-06-14 02:24:40 -03:00
Davi Reis
c749ab444b Added bm_common and bm_index missing files. 2011-06-13 03:17:56 -03:00
Davi Reis
b10fe56a4e All compiles in the mac. 2011-06-13 02:16:19 -03:00
Davi Reis
bbfcdeb5a6 Compiles with clang in mac. 2011-05-23 17:18:24 -07:00
Davi Reis
bb40a4bb00 Renamed table to index and reorganized benchmarks. 2011-05-23 11:01:08 -07:00
Davi de Castro Reis
c630eb2a70 Implemented serialization machinery. 2011-05-16 11:26:18 -03:00
Davi de Castro Reis
5a4ba7516c Improved const-correctness. 2011-05-15 23:24:12 -03:00
Davi de Castro Reis
37a57c18e8 Moved to c arrays to allow mmap'ing. 2011-05-15 23:04:30 -03:00
Davi de Castro Reis
a61882d722 Enabled debug. 2011-05-15 22:02:34 -03:00
Davi de Castro Reis
cb50e06bc2 Fixed cxxflags. 2011-05-15 21:57:58 -03:00
Davi de Castro Reis
5a46ad95be Better header organization. 2011-05-15 20:47:42 -03:00
Davi de Castro Reis
0761f24182 Improved cxxmph test organization. 2011-05-15 19:48:01 -03:00
Davi de Castro Reis
7d5a62cb1a Merge branch 'cxxmph' of ssh://cmph.git.sourceforge.net/gitroot/cmph/cmph into cxxmph
Conflicts:
	Makefile.am
	configure.ac
2011-05-15 19:44:28 -03:00
Davi de Castro Reis
7d9253fd98 Fixed include dir. 2011-05-15 19:39:55 -03:00
Davi Reis
4482c8f39b Conditional compilation of the cxxmph directory. 2011-05-15 19:38:31 -03:00
Davi de Castro Reis
532ee999b9 Moved benchmark code into tests directory. 2011-05-15 17:19:08 -03:00
Davi de Castro Reis
af887685f5 Merge branch 'master' into cxxmph
Conflicts:
	INSTALL
	Makefile.am
	cxxmph/Makefile.am
	cxxmph/cmph_hash_map.h
	cxxmph/cmph_hash_map_test.cc
	cxxmph/mphtable.cc
	cxxmph/mphtable.h
2011-05-15 17:15:57 -03:00
Davi de Castro Reis
6adc2a4816 Added changes to README.t2t as well. 2011-05-15 12:58:13 -03:00
Davi de Castro Reis
0e6849792e Ready to roll. 2011-05-15 12:29:24 -03:00
Davi de Castro Reis
b0f3aaa351 Reorganized tests. 2011-05-14 17:44:58 -03:00
Fabiano C. Botelho
5315e3a597 Committer: Fabiano C. Botelho <fbotelho@fbotelho-desk.(none)>
On branch master
Changes to be committed:
Fixing bugs reported by Steve Friedman, Kaštil Jan
and a bug that was accidently introduced in file
chd_ph.c in 1.0 release.

	modified:   src/bdz.c
	modified:   src/bitbool.h
	modified:   src/chd_ph.c
	modified:   src/cmph_structs.c
	modified:   src/hash.c
2011-03-27 01:41:28 -07:00
Davi de Castro Reis
05eaf15d53 Added a benchmark to the C++ code. 2011-02-18 14:15:24 -08:00
Davi de Castro Reis
8355e2e1b8 Added a benchmark to the C code. 2011-02-18 14:15:10 -08:00
Davi de Castro Reis
4fc0c52c56 Benchmark works. 2011-02-15 20:46:05 -08:00
Davi de Castro Reis
d0eb54d030 Finishing benchmarks. 2011-02-15 14:49:08 -05:00
Davi de Castro Reis
b2da526497 Dumping cmph_uint32. 2011-02-13 23:32:50 -02:00
Davi de Castro Reis
5b78c02da0 Dumping cmph_uint32. 2011-02-13 20:40:26 -02:00
Davi de Castro Reis
2a35666bfa Add benchmarking code. 2011-01-24 10:29:22 -02:00
Davi de Castro Reis
084a940c2a Removed c++ stuff from the c master branch. 2011-01-23 22:07:29 -02:00
Davi de Castro Reis
2a68958c1f Fixed licenses, applied patch from Colin Walters fixing gcc 2.95 warnings and added -Wall -Werror compilation flags. 2011-01-21 03:11:55 -02:00
Davi de Castro Reis
799b4a3fc5 Forgot. 2011-01-20 23:07:46 -02:00
davi
62ac3f4bde All fine, time to optimize. 2010-11-09 03:51:33 -02:00
davi
bb2e9e28a8 All looks fine, commenting debug. 2010-11-09 03:38:46 -02:00
davi
b0255a8269 Valgrind pass. 2010-11-09 02:29:39 -02:00
davi
676d34073c Fixed first_edge initialization bug. 2010-11-08 22:02:18 -02:00
davi
cde9f72c9e Repro failure. 2010-11-08 18:19:44 -02:00
davi
7c9c6c518d Thinking slack. 2010-11-06 09:14:07 -02:00
Davi de Castro Reis
0c5f2301df Fixed compilation error and detected iterator problem. 2010-11-05 21:46:53 -02:00
davi
c09df518dc make install works. 2010-11-05 04:48:53 -02:00
davi
8663285897 Better design for hash templates. 2010-11-05 04:40:15 -02:00
davi
6c69aa0a8f Fixed small bugs. 2010-11-05 00:17:08 -02:00
Davi de Castro Reis
84f7da426c Added trigraph test. 2010-11-04 22:59:42 -02:00
Davi de Castro Reis
76a88922ac Added iterator first. 2010-11-04 22:57:41 -02:00
Davi de Castro Reis
7ead7bff2f Better. 2010-10-28 23:26:37 -07:00
Davi de Castro Reis
5fab722781 Now going to adapt hash_map. 2010-10-28 17:53:40 -07:00
Davi de Castro Reis
22d149d3a8 It works. 2010-10-27 19:45:43 -07:00
Davi de Castro Reis
385ce27a10 Added half nibble code. 2010-10-27 17:17:09 -07:00
Davi de Castro Reis
724e716d67 Added murmur hash and finished porting all c code. 2010-10-24 19:12:47 -07:00
Davi de Castro Reis
bf0c5892d8 Lots of work. 2010-10-05 11:51:17 -03:00
Davi de Castro Reis
f04df98f91 Extracting graph code. 2010-09-10 00:07:06 -07:00
Davi de Castro Reis
a3d29713b7 Fixed Makefile. 2010-09-09 18:42:30 -07:00
Davi de Castro Reis
121d13d08b Added fixed INSTALL file. 2010-09-09 17:50:07 -07:00
Davi de Castro Reis
52e7da2187 Merge branch 'master' of ssh://cmph.git.sourceforge.net/gitroot/cmph/cmph
Conflicts:
	INSTALL
	Makefile.am
	src/Makefile
	src/Makefile.in
2010-09-09 16:30:16 -07:00
Davi de Castro Reis
bfdcc3a3a1 Updated documentation. 2010-09-09 16:28:20 -07:00
Davi de Castro Reis
7f4a877e93 Removed cdb code. 2010-09-09 16:14:21 -07:00
Elias Gabriel Amaral da Silva
e32d398467 Adding an ebuild for 0.9 2010-09-09 15:57:23 -07:00
Davi Reis
3103d23ff4 Fixed m4 large files macro. 2010-09-09 15:51:03 -07:00
Davi de Castro Reis
62c3b0d375 Removed more noise. 2010-06-28 16:03:59 -03:00
Davi de Castro Reis
d3aee08baa Removed noise. 2010-06-28 16:02:07 -03:00
Davi de Castro Reis
355836a156 Added new files. 2010-06-28 16:01:18 -03:00
Davi Reis
1fea1cc9a0 Merge branch 'master' of ssh://davi@cmph.git.sourceforge.net/gitroot/cmph 2009-07-04 20:44:42 -07:00
Davi Reis
829ba6be75 Merge branch 'master' of ssh://davi@cmph.git.sourceforge.net/gitroot/cmph
Conflicts:
	Makefile.am
	cmph.pc.in
	src/Makefile
	src/Makefile.in
	src/cmph.c
	src/cmph_types.h
2009-07-04 20:42:01 -07:00
Davi Reis
7bcf8a7962 Merge branch 'master' of ssh://davi@cmph.git.sourceforge.net/gitroot/cmph
Conflicts:
	Makefile.am
	cmph.pc.in
	src/Makefile
	src/Makefile.in
	src/cmph.c
	src/cmph_types.h
2009-07-04 20:42:01 -07:00
Davi Reis
47a73e7e89 Minor 2009-07-04 20:17:36 -07:00
Davi Reis
250afcb75f Minor 2009-07-04 20:17:36 -07:00
Davi Reis
bb5c464ca1 Very early draft of cdb support. 2009-07-04 20:08:38 -07:00
Davi Reis
cbb817ddef Very early draft of cdb support. 2009-07-04 20:08:38 -07:00
Fabiano C. Botelho
7c88459cb2 cmph_time.h included as header in Makefile.am 2009-06-17 15:01:39 -03:00
Fabiano C. Botelho
39423c781b cmph_time.h included as header in Makefile.am 2009-06-17 15:01:39 -03:00
Fabiano C. Botelho
9737faca0c cmph_time.h included as header in Makefile.am 2009-06-17 15:01:39 -03:00
Fabiano C. Botelho
27a9158cb5 cmph_time.h included as header in Makefile.am 2009-06-17 15:01:39 -03:00
Fabiano C. Botelho
3ef4d951fd Changed Fabiano'email by its homepage 2009-06-15 13:37:47 -03:00
Fabiano C. Botelho
0b54d7b11c Changed Fabiano'email by its homepage 2009-06-15 13:37:47 -03:00
Fabiano C. Botelho
4f1749504f Changed Fabiano'email by its homepage 2009-06-15 13:37:47 -03:00
Fabiano C. Botelho
6b652073f9 Changed Fabiano'email by its homepage 2009-06-15 13:37:47 -03:00
Fabiano C. Botelho
327675f93e Fixed a broken link 2009-06-12 22:07:09 -03:00
Fabiano C. Botelho
7f0b4e07e9 Fixed a broken link 2009-06-12 22:07:09 -03:00
Fabiano C. Botelho
d1dce621ed Fixed a broken link 2009-06-12 22:07:09 -03:00
Fabiano C. Botelho
b415d7a575 Fixed a broken link 2009-06-12 22:07:09 -03:00
Fabiano C. Botelho
9ae0e10732 Documentation updated for release 0.9 2009-06-12 21:49:26 -03:00
Fabiano C. Botelho
e042615de6 Documentation updated for release 0.9 2009-06-12 21:49:26 -03:00
Fabiano C. Botelho
088389184f Documentation updated for release 0.9 2009-06-12 21:49:26 -03:00
Fabiano C. Botelho
7476b61a88 Documentation updated for release 0.9 2009-06-12 21:49:26 -03:00
Fabiano C. Botelho
37444720b5 vldb07 directory removed 2009-06-12 19:42:24 -03:00
Fabiano C. Botelho
c3cece4173 vldb07 directory removed 2009-06-12 19:42:24 -03:00
Fabiano C. Botelho
b8aa2106e9 vldb07 directory removed 2009-06-12 19:42:24 -03:00
Fabiano C. Botelho
ef89d3b305 vldb07 directory removed 2009-06-12 19:42:24 -03:00
Fabiano C. Botelho
eb1442b4ac checking the bug reported by Ayat Dawood. It was fixed using his suggestion (if((bdz->r%2)==0) bdz->r+=1; // The new line) 2009-06-12 03:07:54 -03:00
Fabiano C. Botelho
7835d6b3e6 checking the bug reported by Ayat Dawood. It was fixed using his suggestion (if((bdz->r%2)==0) bdz->r+=1; // The new line) 2009-06-12 03:07:54 -03:00
Fabiano C. Botelho
13ab2f0988 checking the bug reported by Ayat Dawood. It was fixed using his suggestion (if((bdz->r%2)==0) bdz->r+=1; // The new line) 2009-06-12 03:07:54 -03:00
Fabiano C. Botelho
d3cfd78aee checking the bug reported by Ayat Dawood. It was fixed using his suggestion (if((bdz->r%2)==0) bdz->r+=1; // The new line) 2009-06-12 03:07:54 -03:00
Fabiano C. Botelho
09f5ac08d5 Compiled with -Werror -Wall -O3 and -Wconversion flags 2009-06-12 02:46:18 -03:00
Fabiano C. Botelho
cf0b25f6ee Compiled with -Werror -Wall -O3 and -Wconversion flags 2009-06-12 02:46:18 -03:00
Fabiano C. Botelho
17b3299c7c Compiled with -Werror -Wall -O3 and -Wconversion flags 2009-06-12 02:46:18 -03:00
Fabiano C. Botelho
f68c665136 Compiled with -Werror -Wall -O3 and -Wconversion flags 2009-06-12 02:46:18 -03:00
Davi de Castro Reis
0174dfe059 Fixed Makefile.am such that make dist works on ubuntu. 2009-06-11 23:10:39 -03:00
Davi de Castro Reis
318a60aa7a Fixed Makefile.am such that make dist works on ubuntu. 2009-06-11 23:10:39 -03:00
Davi de Castro Reis
ddf8d74132 Fixed Makefile.am such that make dist works on ubuntu. 2009-06-11 23:10:39 -03:00
Davi de Castro Reis
4ce2a5a8c6 Fixed Makefile.am such that make dist works on ubuntu. 2009-06-11 23:10:39 -03:00
davi
7e8b70a0c2 Added missing file compressed_rank_tests.c 2009-06-09 23:59:41 -03:00
davi
cd18128973 Added missing file compressed_rank_tests.c 2009-06-09 23:59:41 -03:00
davi
76f126c15e Added missing file compressed_rank_tests.c 2009-06-09 23:59:41 -03:00
davi
9b6e70767e Added missing file compressed_rank_tests.c 2009-06-09 23:59:41 -03:00
davi
5ba74165d8 Fixed r value initialization bdz and bdz_ph. 2009-06-08 02:36:54 -03:00
davi
f8d086f454 Fixed r value initialization bdz and bdz_ph. 2009-06-08 02:36:54 -03:00
davi
0dc99f5930 Fixed r value initialization bdz and bdz_ph. 2009-06-08 02:36:54 -03:00
davi
4666739e25 Fixed r value initialization bdz and bdz_ph. 2009-06-08 02:36:54 -03:00
davi
7cbf88d680 Removed broken target. 2009-06-08 02:35:12 -03:00
davi
48bc719728 Removed broken target. 2009-06-08 02:35:12 -03:00
davi
cb477d93c8 Removed broken target. 2009-06-08 02:35:12 -03:00
davi
778e620928 Removed broken target. 2009-06-08 02:35:12 -03:00
fc_botelho
d1fe095a69 *** empty log message *** 2009-04-21 13:01:45 +00:00
fc_botelho
503d257967 *** empty log message *** 2009-04-21 13:01:45 +00:00
fc_botelho
d5fb4476ea *** empty log message *** 2009-04-21 13:01:45 +00:00
fc_botelho
0407fea016 *** empty log message *** 2009-04-21 13:01:45 +00:00
fc_botelho
1a11b02e71 *** empty log message *** 2009-04-07 23:16:40 +00:00
fc_botelho
8ae4bca345 *** empty log message *** 2009-04-07 23:16:40 +00:00
fc_botelho
985e27a4fa *** empty log message *** 2009-04-07 23:16:40 +00:00
fc_botelho
f682fe0304 *** empty log message *** 2009-04-07 23:16:40 +00:00
fc_botelho
b8d0614a2d *** empty log message *** 2009-03-18 22:08:46 +00:00
fc_botelho
2cfffbcc9d *** empty log message *** 2009-03-18 22:08:46 +00:00
fc_botelho
ea0a1fadd6 *** empty log message *** 2009-03-18 22:08:46 +00:00
fc_botelho
a80f0de19f *** empty log message *** 2009-03-18 22:08:46 +00:00
fc_botelho
c23b2ae173 compressed hash and displace method added 2009-03-18 19:40:23 +00:00
fc_botelho
cb676ee676 compressed hash and displace method added 2009-03-18 19:40:23 +00:00
fc_botelho
29535bbde4 compressed hash and displace method added 2009-03-18 19:40:23 +00:00
fc_botelho
79d250d152 compressed hash and displace method added 2009-03-18 19:40:23 +00:00
fc_botelho
25d9cc9f21 compressed sequence added 2009-03-16 13:07:50 +00:00
fc_botelho
4934a6b662 compressed sequence added 2009-03-16 13:07:50 +00:00
fc_botelho
e22a2db37d compressed sequence added 2009-03-16 13:07:50 +00:00
fc_botelho
83ce8f171a compressed sequence added 2009-03-16 13:07:50 +00:00
fc_botelho
523202c04a *** empty log message *** 2009-03-16 13:07:10 +00:00
fc_botelho
0935b2bbcb *** empty log message *** 2009-03-16 13:07:10 +00:00
fc_botelho
fb7de5303d *** empty log message *** 2009-03-16 13:07:10 +00:00
fc_botelho
417c7fb458 *** empty log message *** 2009-03-16 13:07:10 +00:00
fc_botelho
933252406a select data structure added 2009-03-14 23:24:05 +00:00
fc_botelho
7eb145226c select data structure added 2009-03-14 23:24:05 +00:00
fc_botelho
579487c5cf select data structure added 2009-03-14 23:24:05 +00:00
fc_botelho
7e265956c9 select data structure added 2009-03-14 23:24:05 +00:00
fc_botelho
5e82267a5f *** empty log message *** 2009-03-12 02:20:33 +00:00
fc_botelho
3de50d4344 *** empty log message *** 2009-03-12 02:20:33 +00:00
fc_botelho
042ff338c3 *** empty log message *** 2009-03-12 02:20:33 +00:00
fc_botelho
8203d56952 *** empty log message *** 2009-03-12 02:20:33 +00:00
fc_botelho
3d084cf7fa *** empty log message *** 2009-03-09 03:05:48 +00:00
fc_botelho
059c3a5f58 *** empty log message *** 2009-03-09 03:05:48 +00:00
fc_botelho
61de79d8f1 *** empty log message *** 2009-03-09 03:05:48 +00:00
fc_botelho
d047cb5822 *** empty log message *** 2009-03-09 03:05:48 +00:00
fc_botelho
1ce421d289 *** empty log message *** 2008-04-29 14:27:13 +00:00
fc_botelho
2ad176de49 *** empty log message *** 2008-04-29 14:27:13 +00:00
fc_botelho
3d77f341b2 *** empty log message *** 2008-04-29 14:27:13 +00:00
fc_botelho
1d8437c29b *** empty log message *** 2008-04-29 14:27:13 +00:00
davi
e3a161b899 Updated ebuild. 2008-04-28 01:27:31 +00:00
davi
bbf0b5b190 Updated ebuild. 2008-04-28 01:27:31 +00:00
davi
bf45031604 Updated ebuild. 2008-04-28 01:27:31 +00:00
davi
d313bb9e2e Updated ebuild. 2008-04-28 01:27:31 +00:00
davi
35b0a816dc Removed fingerprint methods and fixed pending bugs. 2008-04-28 01:18:23 +00:00
davi
2ad31e6bb9 Removed fingerprint methods and fixed pending bugs. 2008-04-28 01:18:23 +00:00
davi
b5cc0a8ea8 Removed fingerprint methods and fixed pending bugs. 2008-04-28 01:18:23 +00:00
davi
7ad2151315 Removed fingerprint methods and fixed pending bugs. 2008-04-28 01:18:23 +00:00
davi
9cb2e56a2f *** empty log message *** 2008-04-27 23:31:03 +00:00
davi
e0c009a62c *** empty log message *** 2008-04-27 23:31:03 +00:00
davi
522c66581e *** empty log message *** 2008-04-27 23:31:03 +00:00
davi
6c22c28319 *** empty log message *** 2008-04-27 23:31:03 +00:00
fc_botelho
e882574526 *** empty log message *** 2008-04-12 06:17:21 +00:00
fc_botelho
81a8cad41f *** empty log message *** 2008-04-12 06:17:21 +00:00
fc_botelho
b8d4392b85 *** empty log message *** 2008-04-12 06:17:21 +00:00
fc_botelho
8152e01b05 *** empty log message *** 2008-04-12 06:17:21 +00:00
fc_botelho
06e31d97f4 *** empty log message *** 2008-03-30 00:59:30 +00:00
fc_botelho
a1dbe680cf *** empty log message *** 2008-03-30 00:59:30 +00:00
fc_botelho
44e343a040 *** empty log message *** 2008-03-30 00:59:30 +00:00
fc_botelho
80c14026b4 *** empty log message *** 2008-03-30 00:59:30 +00:00
fc_botelho
b2bcb0a2a6 *** empty log message *** 2008-03-30 00:59:30 +00:00
fc_botelho
22b82de9b9 *** empty log message *** 2008-03-30 00:59:30 +00:00
fc_botelho
7704f19336 *** empty log message *** 2008-03-30 00:59:30 +00:00
fc_botelho
a5cdc7743f *** empty log message *** 2008-03-30 00:59:30 +00:00
fc_botelho
7de8e8fe72 *** empty log message *** 2008-03-30 00:34:06 +00:00
fc_botelho
812e0d9c32 *** empty log message *** 2008-03-30 00:34:06 +00:00
fc_botelho
4c325b58dc *** empty log message *** 2008-03-30 00:34:06 +00:00
fc_botelho
8fae31e538 *** empty log message *** 2008-03-30 00:34:06 +00:00
fc_botelho
b671dc57c8 *** empty log message *** 2008-03-30 00:23:41 +00:00
fc_botelho
5b262df7c4 *** empty log message *** 2008-03-30 00:23:41 +00:00
fc_botelho
b8c88b42c4 *** empty log message *** 2008-03-30 00:23:41 +00:00
fc_botelho
dd6352db4f *** empty log message *** 2008-03-30 00:23:41 +00:00
fc_botelho
ca2f228840 *** empty log message *** 2008-03-29 01:48:15 +00:00
fc_botelho
5b62735b39 *** empty log message *** 2008-03-29 01:48:15 +00:00
fc_botelho
b0c9cd5c4c *** empty log message *** 2008-03-29 01:48:15 +00:00
fc_botelho
1114622eea *** empty log message *** 2008-03-29 01:48:15 +00:00
fc_botelho
d63806a90a *** empty log message *** 2008-03-26 20:26:48 +00:00
fc_botelho
bc478aff26 *** empty log message *** 2008-03-26 20:26:48 +00:00
fc_botelho
c8a1c4fef9 *** empty log message *** 2008-03-26 20:26:48 +00:00
fc_botelho
872c528cd9 *** empty log message *** 2008-03-26 20:26:48 +00:00
fc_botelho
1f42de94f2 *** empty log message *** 2008-03-25 20:44:12 +00:00
fc_botelho
beafb103e7 *** empty log message *** 2008-03-25 20:44:12 +00:00
fc_botelho
2665414b4e *** empty log message *** 2008-03-25 20:44:12 +00:00
fc_botelho
efa30b7e21 *** empty log message *** 2008-03-25 20:44:12 +00:00
fc_botelho
0892a6ec30 *** empty log message *** 2008-03-25 20:36:30 +00:00
fc_botelho
ff5931a80b *** empty log message *** 2008-03-25 20:36:30 +00:00
fc_botelho
929fd42a1f *** empty log message *** 2008-03-25 20:36:30 +00:00
fc_botelho
def808fcd1 *** empty log message *** 2008-03-25 20:36:30 +00:00
fc_botelho
64e65b477d BDZ_PH added 2008-03-25 20:20:55 +00:00
fc_botelho
963fede7f4 BDZ_PH added 2008-03-25 20:20:55 +00:00
fc_botelho
cfd0019e93 BDZ_PH added 2008-03-25 20:20:55 +00:00
fc_botelho
fdf37b35e8 BDZ_PH added 2008-03-25 20:20:55 +00:00
fc_botelho
df6f3dfa68 *** empty log message *** 2008-03-23 03:54:54 +00:00
fc_botelho
825d8abca6 *** empty log message *** 2008-03-23 03:54:54 +00:00
fc_botelho
49feeafbeb *** empty log message *** 2008-03-23 03:54:54 +00:00
fc_botelho
6de67f1732 *** empty log message *** 2008-03-23 03:54:54 +00:00
fc_botelho
a967fa4c5f *** empty log message *** 2008-03-23 03:45:01 +00:00
fc_botelho
2173fdf8cd *** empty log message *** 2008-03-23 03:45:01 +00:00
fc_botelho
32afd2075f *** empty log message *** 2008-03-23 03:45:01 +00:00
fc_botelho
94957a85ba *** empty log message *** 2008-03-23 03:45:01 +00:00
fc_botelho
06ae667957 *** empty log message *** 2008-03-23 02:17:44 +00:00
fc_botelho
70633c7832 *** empty log message *** 2008-03-23 02:17:44 +00:00
fc_botelho
f3ab04d6ef *** empty log message *** 2008-03-23 02:17:44 +00:00
fc_botelho
a1296d2fdd *** empty log message *** 2008-03-23 02:17:44 +00:00
fc_botelho
e26444097c *** empty log message *** 2008-03-23 01:26:01 +00:00
fc_botelho
e134b0edd4 *** empty log message *** 2008-03-23 01:26:01 +00:00
fc_botelho
393ac11f54 *** empty log message *** 2008-03-23 01:26:01 +00:00
fc_botelho
906b5a5443 *** empty log message *** 2008-03-23 01:26:01 +00:00
fc_botelho
b2fc0b695d *** empty log message *** 2008-03-23 00:46:34 +00:00
fc_botelho
6e360d39b8 *** empty log message *** 2008-03-23 00:46:34 +00:00
fc_botelho
39e68583d3 *** empty log message *** 2008-03-23 00:46:34 +00:00
fc_botelho
328e2c10b2 *** empty log message *** 2008-03-23 00:46:34 +00:00
fc_botelho
9b6d00a34a *** empty log message *** 2008-03-23 00:46:34 +00:00
fc_botelho
06095b44b7 *** empty log message *** 2008-03-23 00:46:34 +00:00
fc_botelho
ffd364c68a *** empty log message *** 2008-03-23 00:46:34 +00:00
fc_botelho
cc4d4824b5 *** empty log message *** 2008-03-23 00:46:34 +00:00
davi
d098140c8f Updated web page. 2007-12-01 01:19:18 +00:00
davi
bf7b84aac6 Updated web page. 2007-12-01 01:19:18 +00:00
davi
29caf24f2f Updated web page. 2007-12-01 01:19:18 +00:00
davi
b7a4427250 Updated web page. 2007-12-01 01:19:18 +00:00
davi
687d6a4ac9 Updated web page. 2007-12-01 01:11:32 +00:00
davi
5122f6368f Updated web page. 2007-12-01 01:11:32 +00:00
davi
8887bd08e7 Updated web page. 2007-12-01 01:11:32 +00:00
davi
6c82d59954 Updated web page. 2007-12-01 01:11:32 +00:00
davi
916725bac1 Added man pages and pc file. 2007-11-29 04:07:53 +00:00
davi
bc7e367a41 Added man pages and pc file. 2007-11-29 04:07:53 +00:00
davi
fc405aeaaf Added man pages and pc file. 2007-11-29 04:07:53 +00:00
davi
3216327798 Added man pages and pc file. 2007-11-29 04:07:53 +00:00
davi
10ef034125 Added man pages and pc file. 2007-11-29 03:49:39 +00:00
davi
4f7f77e673 Added man pages and pc file. 2007-11-29 03:49:39 +00:00
davi
e745011f52 Added man pages and pc file. 2007-11-29 03:49:39 +00:00
davi
342fde1b12 Added man pages and pc file. 2007-11-29 03:49:39 +00:00
fc_botelho
b5b865d76c FCH algorithm documentation was added 2007-02-14 15:45:08 +00:00
fc_botelho
4922b34c54 FCH algorithm documentation was added 2007-02-14 15:45:08 +00:00
fc_botelho
092460d3e3 FCH algorithm documentation was added 2007-02-14 15:45:08 +00:00
fc_botelho
6e59abab61 FCH algorithm documentation was added 2007-02-14 15:45:08 +00:00
fc_botelho
248e5a2545 FCH algorithm documentation was added 2007-02-14 02:14:10 +00:00
fc_botelho
7683542778 FCH algorithm documentation was added 2007-02-14 02:14:10 +00:00
fc_botelho
60c686a2fc FCH algorithm documentation was added 2007-02-14 02:14:10 +00:00
fc_botelho
6a18486edd FCH algorithm documentation was added 2007-02-14 02:14:10 +00:00
fc_botelho
e2dd0748d6 documentation of release 0.6 was included 2007-02-13 18:03:28 +00:00
fc_botelho
fe2ddbe366 documentation of release 0.6 was included 2007-02-13 18:03:28 +00:00
fc_botelho
beeea04351 documentation of release 0.6 was included 2007-02-13 18:03:28 +00:00
fc_botelho
1a41a85d2d documentation of release 0.6 was included 2007-02-13 18:03:28 +00:00
fc_botelho
24a6ff3840 Removed some unused variables at BRZ and FCH algorithm 2007-02-13 15:35:38 +00:00
fc_botelho
80413e6a42 Removed some unused variables at BRZ and FCH algorithm 2007-02-13 15:35:38 +00:00
fc_botelho
b10dcef656 Removed some unused variables at BRZ and FCH algorithm 2007-02-13 15:35:38 +00:00
fc_botelho
9d6365d965 Removed some unused variables at BRZ and FCH algorithm 2007-02-13 15:35:38 +00:00
davi
856be01a01 Modifications to use pdflatex. 2006-09-20 04:05:40 +00:00
davi
1eccb27c5a Modifications to use pdflatex. 2006-09-20 04:05:40 +00:00
davi
c73ee33b98 Modifications to use pdflatex. 2006-09-20 04:05:40 +00:00
davi
f9cac72303 Modifications to use pdflatex. 2006-09-20 04:05:40 +00:00
davi
2119ffc3ab Compressed papers files. 2006-08-29 18:46:28 +00:00
davi
3f834e50dd Compressed papers files. 2006-08-29 18:46:28 +00:00
davi
924a60b5a0 Compressed papers files. 2006-08-29 18:46:28 +00:00
davi
995d68898b Compressed papers files. 2006-08-29 18:46:28 +00:00
fc_botelho
f959c3b7eb paper for vldb07 added 2006-08-11 17:32:31 +00:00
fc_botelho
80549b6ca6 paper for vldb07 added 2006-08-11 17:32:31 +00:00
fc_botelho
bd2e291de9 paper for vldb07 added 2006-08-11 17:32:31 +00:00
fc_botelho
fe4600148e paper for vldb07 added 2006-08-11 17:32:31 +00:00
fc_botelho
ad8a0a51f4 *** empty log message *** 2006-08-11 16:04:55 +00:00
fc_botelho
00c049787a *** empty log message *** 2006-08-11 16:04:55 +00:00
fc_botelho
aa4b59e7c1 *** empty log message *** 2006-08-11 16:04:55 +00:00
fc_botelho
b0546b1fcc *** empty log message *** 2006-08-11 16:04:55 +00:00
fc_botelho
e62ba1982d BRZ is working with FCH or BMZ8. BMZ8 is faster but the MPHFs for each bucket are larger 2006-08-07 16:37:49 +00:00
fc_botelho
c09dbb5acc BRZ is working with FCH or BMZ8. BMZ8 is faster but the MPHFs for each bucket are larger 2006-08-07 16:37:49 +00:00
fc_botelho
690dc3ee82 BRZ is working with FCH or BMZ8. BMZ8 is faster but the MPHFs for each bucket are larger 2006-08-07 16:37:49 +00:00
fc_botelho
9144ecc809 BRZ is working with FCH or BMZ8. BMZ8 is faster but the MPHFs for each bucket are larger 2006-08-07 16:37:49 +00:00
fc_botelho
98f29044d1 BRZ is working with FCH or BMZ8. BMZ8 is faster but the MPHFs for each bucket are larger 2006-08-07 14:44:24 +00:00
fc_botelho
bfb82810b1 BRZ is working with FCH or BMZ8. BMZ8 is faster but the MPHFs for each bucket are larger 2006-08-07 14:44:24 +00:00
fc_botelho
5334c9debc BRZ is working with FCH or BMZ8. BMZ8 is faster but the MPHFs for each bucket are larger 2006-08-07 14:44:24 +00:00
fc_botelho
8adabac1c9 BRZ is working with FCH or BMZ8. BMZ8 is faster but the MPHFs for each bucket are larger 2006-08-07 14:44:24 +00:00
fc_botelho
1f478e47c2 FCH algorithm is working... 2006-08-04 17:55:13 +00:00
fc_botelho
7f44ed0d0b FCH algorithm is working... 2006-08-04 17:55:13 +00:00
fc_botelho
99f0705fed FCH algorithm is working... 2006-08-04 17:55:13 +00:00
fc_botelho
c9edcadd8f FCH algorithm is working... 2006-08-04 17:55:13 +00:00
fc_botelho
2752852d0f FCH algorithm added but not tested... 2006-08-03 20:00:11 +00:00
fc_botelho
c5ce704d0e FCH algorithm added but not tested... 2006-08-03 20:00:11 +00:00
fc_botelho
3825c0a511 FCH algorithm added but not tested... 2006-08-03 20:00:11 +00:00
fc_botelho
8c7e3e8e9a FCH algorithm added but not tested... 2006-08-03 20:00:11 +00:00
fc_botelho
97e8247c4e FCH algorithm added but not tested... 2006-08-03 18:27:16 +00:00
fc_botelho
ee06e15f43 FCH algorithm added but not tested... 2006-08-03 18:27:16 +00:00
fc_botelho
9352564651 FCH algorithm added but not tested... 2006-08-03 18:27:16 +00:00
fc_botelho
79099c26b8 FCH algorithm added but not tested... 2006-08-03 18:27:16 +00:00
fc_botelho
0e8b3df922 strlen fuction removed from BRZ algorithm 2006-07-28 22:36:50 +00:00
fc_botelho
6b9ee06643 strlen fuction removed from BRZ algorithm 2006-07-28 22:36:50 +00:00
fc_botelho
e75eb4b616 strlen fuction removed from BRZ algorithm 2006-07-28 22:36:50 +00:00
fc_botelho
6bc95ace44 strlen fuction removed from BRZ algorithm 2006-07-28 22:36:50 +00:00
fc_botelho
789e5d39b1 vector adapter example updated to be used with brz algorithm 2006-07-27 12:28:41 +00:00
fc_botelho
79377900f9 vector adapter example updated to be used with brz algorithm 2006-07-27 12:28:41 +00:00
fc_botelho
9d762c4bc0 vector adapter example updated to be used with brz algorithm 2006-07-27 12:28:41 +00:00
fc_botelho
32829274c4 vector adapter example updated to be used with brz algorithm 2006-07-27 12:28:41 +00:00
fc_botelho
ce21b7044b vector adapter example updated to be used with brz algorithm 2006-07-27 12:27:34 +00:00
fc_botelho
dafb408c10 vector adapter example updated to be used with brz algorithm 2006-07-27 12:27:34 +00:00
fc_botelho
3ecdcb88f7 vector adapter example updated to be used with brz algorithm 2006-07-27 12:27:34 +00:00
fc_botelho
293ae3c811 vector adapter example updated to be used with brz algorithm 2006-07-27 12:27:34 +00:00
fc_botelho
f114562f10 More fixes for C++ compilation proposed by Steffan Webb. 2006-07-27 00:44:14 +00:00
fc_botelho
9c3a26110c More fixes for C++ compilation proposed by Steffan Webb. 2006-07-27 00:44:14 +00:00
fc_botelho
99e6763c2f More fixes for C++ compilation proposed by Steffan Webb. 2006-07-27 00:44:14 +00:00
fc_botelho
4f41cdc956 More fixes for C++ compilation proposed by Steffan Webb. 2006-07-27 00:44:14 +00:00
davi
f730920907 Fixes for C++ compilation proposed by Steffan Webb. 2006-07-25 15:30:46 +00:00
davi
f0fd02d42e Fixes for C++ compilation proposed by Steffan Webb. 2006-07-25 15:30:46 +00:00
davi
038f836d0d Fixes for C++ compilation proposed by Steffan Webb. 2006-07-25 15:30:46 +00:00
davi
b5070227c6 Fixes for C++ compilation proposed by Steffan Webb. 2006-07-25 15:30:46 +00:00
fc_botelho
b8ac6fb0c1 *** empty log message *** 2006-05-03 20:25:41 +00:00
fc_botelho
f2881c8f2e *** empty log message *** 2006-05-03 20:25:41 +00:00
fc_botelho
2753738103 *** empty log message *** 2006-05-03 20:25:41 +00:00
fc_botelho
c1aea5cc0f *** empty log message *** 2006-05-03 20:25:41 +00:00
fc_botelho
124e0dfe68 buffer_manage replaced for buffer_manager 2006-04-28 16:35:48 +00:00
fc_botelho
ffc2d22424 buffer_manage replaced for buffer_manager 2006-04-28 16:35:48 +00:00
fc_botelho
998648ca4c buffer_manage replaced for buffer_manager 2006-04-28 16:35:48 +00:00
fc_botelho
3578933d4a buffer_manage replaced for buffer_manager 2006-04-28 16:35:48 +00:00
fc_botelho
09ef0957fc compilation errors corrected and license updated to MOZILLA PUBLIC LICENSE 2006-04-27 17:30:19 +00:00
fc_botelho
6002eb7693 compilation errors corrected and license updated to MOZILLA PUBLIC LICENSE 2006-04-27 17:30:19 +00:00
fc_botelho
1643b7400b compilation errors corrected and license updated to MOZILLA PUBLIC LICENSE 2006-04-27 17:30:19 +00:00
fc_botelho
9551495b02 compilation errors corrected and license updated to MOZILLA PUBLIC LICENSE 2006-04-27 17:30:19 +00:00
fc_botelho
9eccc3d4e5 Makefile.am and configure.ac updated 2006-04-26 17:59:34 +00:00
fc_botelho
dcba8ec14f Makefile.am and configure.ac updated 2006-04-26 17:59:34 +00:00
fc_botelho
d5bfc91289 Makefile.am and configure.ac updated 2006-04-26 17:59:34 +00:00
fc_botelho
a73afcb8c2 Makefile.am and configure.ac updated 2006-04-26 17:59:34 +00:00
fc_botelho
8673ed16e1 external memory based algorithm documentation updated 2006-04-25 19:34:06 +00:00
fc_botelho
7e37e5009e external memory based algorithm documentation updated 2006-04-25 19:34:06 +00:00
fc_botelho
1acdcba4b7 external memory based algorithm documentation updated 2006-04-25 19:34:06 +00:00
fc_botelho
7522a90f27 external memory based algorithm documentation updated 2006-04-25 19:34:06 +00:00
fc_botelho
a15a9ee018 external memory based algorithm documentation added 2006-04-25 16:51:02 +00:00
fc_botelho
ef8eb85832 external memory based algorithm documentation added 2006-04-25 16:51:02 +00:00
fc_botelho
e44ecbdbbc external memory based algorithm documentation added 2006-04-25 16:51:02 +00:00
fc_botelho
bc1dac6891 external memory based algorithm documentation added 2006-04-25 16:51:02 +00:00
fc_botelho
7b9cee37fe Makefile.am fixed and, wingetopt.c and wingetopt.h were moved to src directory 2006-04-19 19:46:12 +00:00
fc_botelho
baec893907 Makefile.am fixed and, wingetopt.c and wingetopt.h were moved to src directory 2006-04-19 19:46:12 +00:00
fc_botelho
5ecd08726e Makefile.am fixed and, wingetopt.c and wingetopt.h were moved to src directory 2006-04-19 19:46:12 +00:00
fc_botelho
27cd2b7978 Makefile.am fixed and, wingetopt.c and wingetopt.h were moved to src directory 2006-04-19 19:46:12 +00:00
fc_botelho
e3893430f6 Makefile.am fixed 2006-04-19 19:44:03 +00:00
fc_botelho
1afd893d0d Makefile.am fixed 2006-04-19 19:44:03 +00:00
fc_botelho
8959513c6e Makefile.am fixed 2006-04-19 19:44:03 +00:00
fc_botelho
4a69377253 Makefile.am fixed 2006-04-19 19:44:03 +00:00
fc_botelho
12fbab96cc Makefile.am fixed 2006-04-19 19:38:41 +00:00
fc_botelho
27b33e82ce Makefile.am fixed 2006-04-19 19:38:41 +00:00
fc_botelho
618bd59815 Makefile.am fixed 2006-04-19 19:38:41 +00:00
fc_botelho
1c450d1365 Makefile.am fixed 2006-04-19 19:38:41 +00:00
fc_botelho
7767312143 buffer_entry.c and buffer_entry.h added 2006-04-19 18:44:29 +00:00
fc_botelho
2fb21586d5 buffer_entry.c and buffer_entry.h added 2006-04-19 18:44:29 +00:00
fc_botelho
c97ea937a8 buffer_entry.c and buffer_entry.h added 2006-04-19 18:44:29 +00:00
fc_botelho
86ad500249 buffer_entry.c and buffer_entry.h added 2006-04-19 18:44:29 +00:00
fc_botelho
99d2a2527d buffer_manage.c and buffer_manage.h added 2006-04-19 18:02:40 +00:00
fc_botelho
f7d0e5b7fd buffer_manage.c and buffer_manage.h added 2006-04-19 18:02:40 +00:00
fc_botelho
d2a51f1231 buffer_manage.c and buffer_manage.h added 2006-04-19 18:02:40 +00:00
fc_botelho
ec7d16e77f buffer_manage.c and buffer_manage.h added 2006-04-19 18:02:40 +00:00
fc_botelho
9aa43a0994 stable version of BRZ algorithm using buffers 2006-01-25 19:54:53 +00:00
fc_botelho
a0cf4648a4 stable version of BRZ algorithm using buffers 2006-01-25 19:54:53 +00:00
fc_botelho
c12535761c stable version of BRZ algorithm using buffers 2006-01-25 19:54:53 +00:00
fc_botelho
14744749a3 stable version of BRZ algorithm using buffers 2006-01-25 19:54:53 +00:00
fc_botelho
6284bb94a5 stable version of BRZ algorithm using buffers 2006-01-25 19:45:14 +00:00
fc_botelho
dcd8e025e2 stable version of BRZ algorithm using buffers 2006-01-25 19:45:14 +00:00
fc_botelho
59ddeb6379 stable version of BRZ algorithm using buffers 2006-01-25 19:45:14 +00:00
fc_botelho
ab38f8e13f stable version of BRZ algorithm using buffers 2006-01-25 19:45:14 +00:00
fc_botelho
bb8335f65b thread-safe vector adapter 2005-10-10 17:43:21 +00:00
fc_botelho
312947b34f thread-safe vector adapter 2005-10-10 17:43:21 +00:00
fc_botelho
26550aad88 thread-safe vector adapter 2005-10-10 17:43:21 +00:00
fc_botelho
2fd4b6a46d thread-safe vector adapter 2005-10-10 17:43:21 +00:00
fc_botelho
54b8ded9ed *** empty log message *** 2005-10-03 00:01:27 +00:00
fc_botelho
da288dc849 *** empty log message *** 2005-10-03 00:01:27 +00:00
fc_botelho
a2af7dfa69 *** empty log message *** 2005-10-03 00:01:27 +00:00
fc_botelho
dab54e510c *** empty log message *** 2005-10-03 00:01:27 +00:00
fc_botelho
b4863fbce9 trabalhei em trabalhos relacionados 2005-10-02 23:56:51 +00:00
fc_botelho
d30103fcb0 trabalhei em trabalhos relacionados 2005-10-02 23:56:51 +00:00
fc_botelho
4825611cd0 trabalhei em trabalhos relacionados 2005-10-02 23:56:51 +00:00
fc_botelho
a1e0b08deb trabalhei em trabalhos relacionados 2005-10-02 23:56:51 +00:00
fc_botelho
4b707bc9fc *** empty log message *** 2005-10-02 21:28:13 +00:00
fc_botelho
797b6c0123 *** empty log message *** 2005-10-02 21:28:13 +00:00
fc_botelho
83982b2317 *** empty log message *** 2005-10-02 21:28:13 +00:00
fc_botelho
93d49e044c *** empty log message *** 2005-10-02 21:28:13 +00:00
fc_botelho
1592c86c28 Alterei introducao e trabalhos relacionados 2005-09-30 20:14:51 +00:00
fc_botelho
7d742eaeab Alterei introducao e trabalhos relacionados 2005-09-30 20:14:51 +00:00
fc_botelho
d0c7c1cd1f Alterei introducao e trabalhos relacionados 2005-09-30 20:14:51 +00:00
fc_botelho
ad495643bf Alterei introducao e trabalhos relacionados 2005-09-30 20:14:51 +00:00
fc_botelho
386dcbf9e4 trabalhei na secao de trabalhos relacionados 2005-09-28 21:37:19 +00:00
fc_botelho
766607a897 trabalhei na secao de trabalhos relacionados 2005-09-28 21:37:19 +00:00
fc_botelho
884d7ff8f2 trabalhei na secao de trabalhos relacionados 2005-09-28 21:37:19 +00:00
fc_botelho
a36080a274 trabalhei na secao de trabalhos relacionados 2005-09-28 21:37:19 +00:00
fc_botelho
95096c2cbc added vldb jounal 2005-09-27 15:11:25 +00:00
fc_botelho
230a632915 added vldb jounal 2005-09-27 15:11:25 +00:00
fc_botelho
c0ae67705b added vldb jounal 2005-09-27 15:11:25 +00:00
fc_botelho
ab356bf19b added vldb jounal 2005-09-27 15:11:25 +00:00
davi
b9842cfcbd Removed useless files. 2005-09-26 18:45:56 +00:00
davi
5d271817dd Removed useless files. 2005-09-26 18:45:56 +00:00
davi
8a95b0c180 Removed useless files. 2005-09-26 18:45:56 +00:00
davi
7c10672e35 Removed useless files. 2005-09-26 18:45:56 +00:00
davi
ed92147794 Added gentoo ebuild. 2005-09-26 04:30:54 +00:00
davi
e350c110d2 Added gentoo ebuild. 2005-09-26 04:30:54 +00:00
davi
a4b8e4d31e Added gentoo ebuild. 2005-09-26 04:30:54 +00:00
davi
aa501651c1 Added gentoo ebuild. 2005-09-26 04:30:54 +00:00
davi
f2b6da9a13 Starting to implement hashtree algorithm. 2005-09-23 22:31:02 +00:00
davi
ba8d86fc2e Starting to implement hashtree algorithm. 2005-09-23 22:31:02 +00:00
davi
12445eb287 Starting to implement hashtree algorithm. 2005-09-23 22:31:02 +00:00
davi
3cd94080ee Starting to implement hashtree algorithm. 2005-09-23 22:31:02 +00:00
davi
a8c4aa7a45 Started to implement new algorithm hashtree. 2005-09-23 21:43:33 +00:00
davi
5d4d6024e5 Started to implement new algorithm hashtree. 2005-09-23 21:43:33 +00:00
davi
5c679aeb77 Started to implement new algorithm hashtree. 2005-09-23 21:43:33 +00:00
davi
5adcefa255 Started to implement new algorithm hashtree. 2005-09-23 21:43:33 +00:00
fc_botelho
f30b8f6172 *** empty log message *** 2005-09-23 21:05:48 +00:00
fc_botelho
feb24d8a69 *** empty log message *** 2005-09-23 21:05:48 +00:00
fc_botelho
70e14dcc67 *** empty log message *** 2005-09-23 21:05:48 +00:00
fc_botelho
f7c6e8a60a *** empty log message *** 2005-09-23 21:05:48 +00:00
davi
8fed224575 Fixed small main.cc bug when -m parameter was in use. 2005-09-23 20:57:42 +00:00
davi
17ed72b8d2 Fixed small main.cc bug when -m parameter was in use. 2005-09-23 20:57:42 +00:00
davi
dffa191e74 Fixed small main.cc bug when -m parameter was in use. 2005-09-23 20:57:42 +00:00
davi
06f9e8b768 Fixed small main.cc bug when -m parameter was in use. 2005-09-23 20:57:42 +00:00
fc_botelho
7352e08488 stable version of BRZ using extenal memory to flush vector g 2005-09-23 20:54:31 +00:00
fc_botelho
ded7440fd4 stable version of BRZ using extenal memory to flush vector g 2005-09-23 20:54:31 +00:00
fc_botelho
c4bc326c4b stable version of BRZ using extenal memory to flush vector g 2005-09-23 20:54:31 +00:00
fc_botelho
7c654b88a8 stable version of BRZ using extenal memory to flush vector g 2005-09-23 20:54:31 +00:00
davi
24dbad9bd3 Removed useless files. 2005-09-23 19:58:44 +00:00
davi
e1b7b74776 Removed useless files. 2005-09-23 19:58:44 +00:00
davi
009bd68bcd Removed useless files. 2005-09-23 19:58:44 +00:00
davi
bab5f9465b Removed useless files. 2005-09-23 19:58:44 +00:00
fc_botelho
3871a24405 vesion 0.4. It was included the bmz8 algorithm to generate mphfs for small set of keys (at most 256 keys), the vector adpter and some bugs have been corrected. 2005-09-16 04:10:16 +00:00
fc_botelho
ee35a5237d vesion 0.4. It was included the bmz8 algorithm to generate mphfs for small set of keys (at most 256 keys), the vector adpter and some bugs have been corrected. 2005-09-16 04:10:16 +00:00
fc_botelho
f462c10611 vesion 0.4. It was included the bmz8 algorithm to generate mphfs for small set of keys (at most 256 keys), the vector adpter and some bugs have been corrected. 2005-09-16 04:10:16 +00:00
fc_botelho
6722f0e80b vesion 0.4. It was included the bmz8 algorithm to generate mphfs for small set of keys (at most 256 keys), the vector adpter and some bugs have been corrected. 2005-09-16 04:10:16 +00:00
fc_botelho
92cf0ea484 vesion 0.4. It was included the bmz8 algorithm to generate mphfs for small set of keys (at most 256 keys), the vector adpter and some bugs have been corrected. 2005-09-16 02:53:07 +00:00
fc_botelho
2303517703 vesion 0.4. It was included the bmz8 algorithm to generate mphfs for small set of keys (at most 256 keys), the vector adpter and some bugs have been corrected. 2005-09-16 02:53:07 +00:00
fc_botelho
cfbe583520 vesion 0.4. It was included the bmz8 algorithm to generate mphfs for small set of keys (at most 256 keys), the vector adpter and some bugs have been corrected. 2005-09-16 02:53:07 +00:00
fc_botelho
c9cddd1318 vesion 0.4. It was included the bmz8 algorithm to generate mphfs for small set of keys (at most 256 keys), the vector adpter and some bugs have been corrected. 2005-09-16 02:53:07 +00:00
fc_botelho
cff94651c4 LGPL license included in the file COPYING 2005-09-12 21:58:41 +00:00
fc_botelho
b78fa91879 LGPL license included in the file COPYING 2005-09-12 21:58:41 +00:00
fc_botelho
7a64ad46b2 LGPL license included in the file COPYING 2005-09-12 21:58:41 +00:00
fc_botelho
109853d82c LGPL license included in the file COPYING 2005-09-12 21:58:41 +00:00
fc_botelho
9c5dd49f61 Stable version of BRZ algorithm with option -M (memory_availability) 2005-09-06 14:37:35 +00:00
fc_botelho
c9b937fcbb Stable version of BRZ algorithm with option -M (memory_availability) 2005-09-06 14:37:35 +00:00
fc_botelho
0c25b2d6f5 Stable version of BRZ algorithm with option -M (memory_availability) 2005-09-06 14:37:35 +00:00
fc_botelho
a5b68f040a Stable version of BRZ algorithm with option -M (memory_availability) 2005-09-06 14:37:35 +00:00
fc_botelho
209986c6a5 Stable version of BRZ algorithm 2005-09-06 14:11:37 +00:00
fc_botelho
d2aeaae27c Stable version of BRZ algorithm 2005-09-06 14:11:37 +00:00
fc_botelho
d686e0a53e Stable version of BRZ algorithm 2005-09-06 14:11:37 +00:00
fc_botelho
d9ca12f60c Stable version of BRZ algorithm 2005-09-06 14:11:37 +00:00
fc_botelho
97e2fd04a0 BMZ8 - A 8 bit version of BMZ has been added 2005-09-05 17:32:21 +00:00
fc_botelho
3181ab91f2 BMZ8 - A 8 bit version of BMZ has been added 2005-09-05 17:32:21 +00:00
fc_botelho
0bd5ad5d1a BMZ8 - A 8 bit version of BMZ has been added 2005-09-05 17:32:21 +00:00
fc_botelho
3324db0600 BMZ8 - A 8 bit version of BMZ has been added 2005-09-05 17:32:21 +00:00
fc_botelho
e9a300f189 Fixed: gcc 2.95 problem and initialzes memory 2005-09-02 17:51:17 +00:00
fc_botelho
5473b2f124 Fixed: gcc 2.95 problem and initialzes memory 2005-09-02 17:51:17 +00:00
fc_botelho
eb3afb8e8e Fixed: gcc 2.95 problem and initialzes memory 2005-09-02 17:51:17 +00:00
fc_botelho
b930363855 Fixed: gcc 2.95 problem and initialzes memory 2005-09-02 17:51:17 +00:00
fc_botelho
8f5f608426 Fixed: gcc 2.95 problem and initialzes memory 2005-09-02 17:31:23 +00:00
fc_botelho
1d9e0c2c81 Fixed: gcc 2.95 problem and initialzes memory 2005-09-02 17:31:23 +00:00
fc_botelho
41cd24604c Fixed: gcc 2.95 problem and initialzes memory 2005-09-02 17:31:23 +00:00
fc_botelho
72dfef8a95 Fixed: gcc 2.95 problem and initialzes memory 2005-09-02 17:31:23 +00:00
fc_botelho
c9ca8dbc28 Fixed: gcc 2.95 problem and initialzes memory 2005-09-02 17:24:39 +00:00
fc_botelho
1d6ef167f6 Fixed: gcc 2.95 problem and initialzes memory 2005-09-02 17:24:39 +00:00
fc_botelho
01912bbe07 Fixed: gcc 2.95 problem and initialzes memory 2005-09-02 17:24:39 +00:00
fc_botelho
7721b027e2 Fixed: gcc 2.95 problem and initialzes memory 2005-09-02 17:24:39 +00:00
fc_botelho
2a1a7f9896 *** empty log message *** 2005-08-08 21:34:22 +00:00
fc_botelho
7f62bf4837 *** empty log message *** 2005-08-08 21:34:22 +00:00
fc_botelho
51be86814b *** empty log message *** 2005-08-08 21:34:22 +00:00
fc_botelho
275bb81d06 *** empty log message *** 2005-08-08 21:34:22 +00:00
fc_botelho
987870bc59 temporary directory passed by command line 2005-08-08 01:00:27 +00:00
fc_botelho
5e22ae4934 temporary directory passed by command line 2005-08-08 01:00:27 +00:00
fc_botelho
5f3d477c1f temporary directory passed by command line 2005-08-08 01:00:27 +00:00
fc_botelho
9553f65537 temporary directory passed by command line 2005-08-08 01:00:27 +00:00
fc_botelho
114c4e1c63 stable version of BRZ 2005-08-07 23:22:32 +00:00
fc_botelho
d334b15512 stable version of BRZ 2005-08-07 23:22:32 +00:00
fc_botelho
3a486da2ec stable version of BRZ 2005-08-07 23:22:32 +00:00
fc_botelho
da4ca77b9c stable version of BRZ 2005-08-07 23:22:32 +00:00
fc_botelho
abb193e618 no message 2005-08-07 01:09:14 +00:00
fc_botelho
9ae51473a3 no message 2005-08-07 01:09:14 +00:00
fc_botelho
dbfb678316 no message 2005-08-07 01:09:14 +00:00
fc_botelho
ca6e4355d9 no message 2005-08-07 01:09:14 +00:00
fc_botelho
c983eec983 no message 2005-08-07 01:02:49 +00:00
fc_botelho
fb2a18974e no message 2005-08-07 01:02:49 +00:00
fc_botelho
cc66c831c0 no message 2005-08-07 01:02:49 +00:00
fc_botelho
360cf99794 no message 2005-08-07 01:02:49 +00:00
fc_botelho
57126668ec fastest version of BRZ 2005-08-07 00:45:39 +00:00
fc_botelho
7432524608 fastest version of BRZ 2005-08-07 00:45:39 +00:00
fc_botelho
1dad589c00 fastest version of BRZ 2005-08-07 00:45:39 +00:00
fc_botelho
93010977df fastest version of BRZ 2005-08-07 00:45:39 +00:00
fc_botelho
508ef470c8 *** empty log message *** 2005-08-06 20:20:04 +00:00
fc_botelho
33a669055e *** empty log message *** 2005-08-06 20:20:04 +00:00
fc_botelho
1d07882eb3 *** empty log message *** 2005-08-06 20:20:04 +00:00
fc_botelho
b67c58a108 *** empty log message *** 2005-08-06 20:20:04 +00:00
fc_botelho
0db620e9cd BRZ algorithm is almost stable 2005-07-29 19:43:21 +00:00
fc_botelho
b95fd8234a BRZ algorithm is almost stable 2005-07-29 19:43:21 +00:00
fc_botelho
fc61796452 BRZ algorithm is almost stable 2005-07-29 19:43:21 +00:00
fc_botelho
a1a8bb8681 BRZ algorithm is almost stable 2005-07-29 19:43:21 +00:00
fc_botelho
2fa5457c27 BRZ algorithm is almost stable 2005-07-29 18:29:30 +00:00
fc_botelho
2a56ec26d7 BRZ algorithm is almost stable 2005-07-29 18:29:30 +00:00
fc_botelho
f54cbcdfdc BRZ algorithm is almost stable 2005-07-29 18:29:30 +00:00
fc_botelho
e796250cec BRZ algorithm is almost stable 2005-07-29 18:29:30 +00:00
fc_botelho
f5dc722c54 it was fixed more mistakes in BRZ algorithm 2005-07-29 03:09:31 +00:00
fc_botelho
c98912e0ae it was fixed more mistakes in BRZ algorithm 2005-07-29 03:09:31 +00:00
fc_botelho
be73e6b2a4 it was fixed more mistakes in BRZ algorithm 2005-07-29 03:09:31 +00:00
fc_botelho
e88b0312ec it was fixed more mistakes in BRZ algorithm 2005-07-29 03:09:31 +00:00
fc_botelho
699e9b4617 fixed some mistakes in BRZ algorithm 2005-07-29 00:00:01 +00:00
fc_botelho
77d05871bf fixed some mistakes in BRZ algorithm 2005-07-29 00:00:01 +00:00
fc_botelho
5212c7d5f3 fixed some mistakes in BRZ algorithm 2005-07-29 00:00:01 +00:00
fc_botelho
38f6621978 fixed some mistakes in BRZ algorithm 2005-07-29 00:00:01 +00:00
fc_botelho
7f38f249a9 algorithm BRZ included 2005-07-27 22:13:25 +00:00
fc_botelho
3e4b8fdfcb algorithm BRZ included 2005-07-27 22:13:25 +00:00
fc_botelho
0bf7615f53 algorithm BRZ included 2005-07-27 22:13:25 +00:00
fc_botelho
dfd23b84ff algorithm BRZ included 2005-07-27 22:13:25 +00:00
fc_botelho
426dfe2a63 Algorithm BRZ included 2005-07-27 21:13:02 +00:00
fc_botelho
347c675362 Algorithm BRZ included 2005-07-27 21:13:02 +00:00
fc_botelho
2948f675d5 Algorithm BRZ included 2005-07-27 21:13:02 +00:00
fc_botelho
8c60d7b436 Algorithm BRZ included 2005-07-27 21:13:02 +00:00
fc_botelho
652565a1a4 it was included an examples directory 2005-07-25 22:18:45 +00:00
fc_botelho
3914ba8d73 it was included an examples directory 2005-07-25 22:18:45 +00:00
fc_botelho
9a12ab1456 it was included an examples directory 2005-07-25 22:18:45 +00:00
fc_botelho
877cf9fd38 it was included an examples directory 2005-07-25 22:18:45 +00:00
fc_botelho
ba0fd4d0ae it was included a examples directory 2005-07-25 21:26:17 +00:00
fc_botelho
5555a42284 it was included a examples directory 2005-07-25 21:26:17 +00:00
fc_botelho
6aeb1033ed it was included a examples directory 2005-07-25 21:26:17 +00:00
fc_botelho
e7a99efa0e it was included a examples directory 2005-07-25 21:26:17 +00:00
davi
1a986bfbca Fixed off by one bug in chm. 2005-02-28 22:53:40 +00:00
davi
c3578666ac Fixed off by one bug in chm. 2005-02-28 22:53:40 +00:00
fc_botelho
2a1914c0d8 The way of calling the function cmph_search was fixed in the file README.t2t 2005-02-17 18:20:14 +00:00
fc_botelho
670212dbf8 The way of calling the function cmph_search was fixed in the file README.t2t 2005-02-17 18:20:14 +00:00
fc_botelho
1eac6fe727 Heuristic BMZ memory consumption was updated 2005-01-31 19:13:56 +00:00
fc_botelho
60819c778e Heuristic BMZ memory consumption was updated 2005-01-31 19:13:56 +00:00
fc_botelho
e2b4c10e74 DJB2, SDBM, FNV and Jenkins hash link were added 2005-01-31 19:09:29 +00:00
fc_botelho
00c77e2038 DJB2, SDBM, FNV and Jenkins hash link were added 2005-01-31 19:09:29 +00:00
fc_botelho
9abc48f91c BMZ documentation was finished 2005-01-31 18:50:58 +00:00
fc_botelho
4951dedce9 BMZ documentation was finished 2005-01-31 18:50:58 +00:00
fc_botelho
9110014044 Initial version 2005-01-28 20:12:58 +00:00
fc_botelho
8401ce6d92 Initial version 2005-01-28 20:12:58 +00:00
fc_botelho
29d8ae6995 It was improved the documentation of BMZ and CHM algorithms 2005-01-28 20:07:22 +00:00
fc_botelho
f1b1f12dda It was improved the documentation of BMZ and CHM algorithms 2005-01-28 20:07:22 +00:00
fc_botelho
e632be1080 history of BMZ algorithm is available 2005-01-27 20:07:57 +00:00
fc_botelho
dfa28a005a history of BMZ algorithm is available 2005-01-27 20:07:57 +00:00
fc_botelho
2c220612eb It was added the authors' email 2005-01-27 16:23:11 +00:00
fc_botelho
b4930a47f1 It was added the authors' email 2005-01-27 16:23:11 +00:00
fc_botelho
2fba2d5bf4 It was added FOOTER.t2t file 2005-01-27 16:21:49 +00:00
fc_botelho
1e67ed9f88 It was added FOOTER.t2t file 2005-01-27 16:21:49 +00:00
fc_botelho
76bae3e585 It was removed pjw and glib functions from cmph_hash_names vector 2005-01-27 14:12:28 +00:00
fc_botelho
7d3d1248d3 It was removed pjw and glib functions from cmph_hash_names vector 2005-01-27 14:12:28 +00:00
davi
1825613fd1 Fix to alternate hash functions code. Removed htonl stuff from chm algorithm. Added faq. 2005-01-27 13:01:45 +00:00
davi
71a55f697e Fix to alternate hash functions code. Removed htonl stuff from chm algorithm. Added faq. 2005-01-27 13:01:45 +00:00
fc_botelho
9eadc88397 It was corrected some formatting mistakes 2005-01-27 11:14:13 +00:00
fc_botelho
928f088348 It was corrected some formatting mistakes 2005-01-27 11:14:13 +00:00
davi
70796d9383 Added gperf notes. 2005-01-27 00:04:11 +00:00
davi
1e5c4f0f4b Added gperf notes. 2005-01-27 00:04:11 +00:00
fc_botelho
4b1d7a7713 generated in version 0.3 2005-01-25 21:10:46 +00:00
fc_botelho
a646050111 generated in version 0.3 2005-01-25 21:10:46 +00:00
fc_botelho
d831257c4f The czech.h, czech_structs.h and czech.c files were removed 2005-01-25 21:09:44 +00:00
fc_botelho
ff1b59d7c7 The czech.h, czech_structs.h and czech.c files were removed 2005-01-25 21:09:44 +00:00
fc_botelho
2c79aa809a It was changed the prefix czech by chm 2005-01-25 21:06:58 +00:00
fc_botelho
d7ea6d6a3e It was changed the prefix czech by chm 2005-01-25 21:06:58 +00:00
fc_botelho
e984385fa4 script to generate the documentation and the README file 2005-01-25 20:50:41 +00:00
fc_botelho
18516c9a11 script to generate the documentation and the README file 2005-01-25 20:50:41 +00:00
fc_botelho
ed091d5dee README was updated 2005-01-25 20:47:23 +00:00
fc_botelho
92db0cf750 README was updated 2005-01-25 20:47:23 +00:00
fc_botelho
56a9e19d84 Version was updated 2005-01-25 20:44:50 +00:00
fc_botelho
a5153049d6 Version was updated 2005-01-25 20:44:50 +00:00
fc_botelho
9e724e5d80 Vector adapter commented 2005-01-25 20:42:03 +00:00
fc_botelho
7c7324a44f Vector adapter commented 2005-01-25 20:42:03 +00:00
fc_botelho
c8f5d2dc9d It was included the PreProc macro through the CONFIG.t2t file and the LOGO through the LOGO.html file 2005-01-25 20:40:08 +00:00
fc_botelho
a29ea2dbd1 It was included the PreProc macro through the CONFIG.t2t file and the LOGO through the LOGO.html file 2005-01-25 20:40:08 +00:00
fc_botelho
b6124cfb22 It was included the PreProc macro through the CONFIG.t2t file and the LOGO through the LOGO.html file 2005-01-25 20:33:08 +00:00
fc_botelho
c684ca967b It was included the PreProc macro through the CONFIG.t2t file and the LOGO through the LOGO.html file 2005-01-25 20:33:08 +00:00
fc_botelho
8221293106 The file adpater was implemented. 2005-01-24 20:25:58 +00:00
fc_botelho
cf5ff6f140 The file adpater was implemented. 2005-01-24 20:25:58 +00:00
fc_botelho
8efdd6af87 the memory consumption to create a mphf using bmz with a heuristic was fixed. 2005-01-24 19:20:26 +00:00
fc_botelho
8ea43ca39f the memory consumption to create a mphf using bmz with a heuristic was fixed. 2005-01-24 19:20:26 +00:00
fc_botelho
439b3ecbcf The algorithms and hash functions were put in alphabetical order 2005-01-24 19:11:08 +00:00
fc_botelho
be80deedb6 The algorithms and hash functions were put in alphabetical order 2005-01-24 19:11:08 +00:00
fc_botelho
061a3c3a8c It was fixed some English mistakes and It was included the files BMZ.t2t, CZECH.t2t and COMPARISON.t2t 2005-01-24 18:15:50 +00:00
fc_botelho
e5f0aef11c It was fixed some English mistakes and It was included the files BMZ.t2t, CZECH.t2t and COMPARISON.t2t 2005-01-24 18:15:50 +00:00
davi
60cba13617 Added Doxyfile. 2005-01-21 21:19:18 +00:00
davi
783e633b6c Added Doxyfile. 2005-01-21 21:19:18 +00:00
davi
ae434fdc8d Fixed wingetopt.c 2005-01-21 21:14:55 +00:00
davi
c8b19092f1 Fixed wingetopt.c 2005-01-21 21:14:55 +00:00
fc_botelho
8305f399aa included files bitbool.h and bitbool.c 2005-01-21 20:44:11 +00:00
fc_botelho
70ad75cf97 included files bitbool.h and bitbool.c 2005-01-21 20:44:11 +00:00
fc_botelho
d3788d43a8 Only public symbols were prefixed with cmph, and the API was changed to agree with the initial txt2html documentation 2005-01-21 20:42:33 +00:00
fc_botelho
3ed086d14a Only public symbols were prefixed with cmph, and the API was changed to agree with the initial txt2html documentation 2005-01-21 20:42:33 +00:00
fc_botelho
a050a62857 mask to represent a boolean value using only 1 bit 2005-01-21 20:30:23 +00:00
fc_botelho
4dda0a3b62 mask to represent a boolean value using only 1 bit 2005-01-21 20:30:23 +00:00
davi
76f7be31e4 Added initial txt2tags documentation. 2005-01-20 12:28:42 +00:00
davi
955d4ad8fd Added initial txt2tags documentation. 2005-01-20 12:28:42 +00:00
davi
31e9d838e8 Added macros for large file support. 2005-01-19 12:40:22 +00:00
davi
81774a4464 Added macros for large file support. 2005-01-19 12:40:22 +00:00
fc_botelho
264a1996c8 version with cmph prefix 2005-01-18 21:06:08 +00:00
fc_botelho
ea71f288b3 version with cmph prefix 2005-01-18 21:06:08 +00:00
davi
ac4a2f539f Added missing files. 2005-01-18 17:10:28 +00:00
davi
2c837e225e Added missing files. 2005-01-18 17:10:28 +00:00
fc_botelho
c902bb96d0 initial version 2005-01-18 16:25:02 +00:00
fc_botelho
1e238c4b03 initial version 2005-01-18 16:25:02 +00:00
fc_botelho
9d691116b6 initial version 2005-01-18 16:16:17 +00:00
fc_botelho
df3977eb03 initial version 2005-01-18 16:16:17 +00:00
fc_botelho
345e2a7bf0 using bit mask to represent boolean values 2005-01-18 15:58:02 +00:00
fc_botelho
7e25497523 using bit mask to represent boolean values 2005-01-18 15:58:02 +00:00
fc_botelho
d2527ad507 no message 2005-01-18 15:56:14 +00:00
fc_botelho
583b16d44b no message 2005-01-18 15:56:14 +00:00
davi
edcd5b670a Fixed a lot of warnings. Added visual studio project. Make needed changes to work with windows. 2005-01-18 12:18:51 +00:00
davi
d24b968de4 Fixed a lot of warnings. Added visual studio project. Make needed changes to work with windows. 2005-01-18 12:18:51 +00:00
fc_botelho
69c177a494 stable version 2005-01-17 17:58:43 +00:00
fc_botelho
eea53e77f0 stable version 2005-01-17 17:58:43 +00:00
davi
1cfaf691bd Better error handling in czech.c. 2005-01-13 23:56:54 +00:00
davi
2e002e78ec Better error handling in czech.c. 2005-01-13 23:56:54 +00:00
fc_botelho
e88eb681ac included option -k to specify the number of keys to use 2005-01-05 20:45:33 +00:00
fc_botelho
66501e5eca included option -k to specify the number of keys to use 2005-01-05 20:45:33 +00:00
fc_botelho
30d7a654f0 included option -k to specify the number of keys to use 2005-01-05 19:48:23 +00:00
fc_botelho
c0928fd9f0 included option -k to specify the number of keys to use 2005-01-05 19:48:23 +00:00
fc_botelho
b63e04ee09 using less memory 2005-01-03 21:38:59 +00:00
fc_botelho
900e0a62f2 using less memory 2005-01-03 21:38:59 +00:00
fc_botelho
77ce0aaddf using less space to store the used_edges and critical_nodes arrays 2005-01-03 20:47:21 +00:00
fc_botelho
bda9c46618 using less space to store the used_edges and critical_nodes arrays 2005-01-03 20:47:21 +00:00
davi
b3f008eb40 Initial revision 2004-12-23 13:16:30 +00:00
davi
03519cc9c8 Initial revision 2004-12-23 13:16:30 +00:00
742 changed files with 30746 additions and 1210 deletions

3
.gitmodules vendored
View File

@@ -1,3 +0,0 @@
[submodule "deps/cmph"]
path = deps/cmph
url = .

202
LICENSE Normal file
View File

@@ -0,0 +1,202 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2022 Motiejus Jakštys
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

596
README.md
View File

@@ -1,442 +1,200 @@
Turbo NSS
---------
Turbonss is a plugin for GNU Name Service Switch (NSS) functionality of GNU C
Library (glibc). Turbonss implements lookup for `user` and `passwd` database
entries (i.e. system users, groups, and group memberships). It's main goal is
performance, with focus on making [`id(1)`][id] run as fast as possible.
Turbonss is a plugin for GNU Name Service Switch ([NSS][nsswitch])
functionality of GNU C Library (glibc). Turbonss implements lookup for `user`
and `passwd` database entries (i.e. system users, groups, and group
memberships). It's main goal is to run [`id(1)`][id] as fast as possible.
Turbonss is optimized for reading. If the data changes in any way, the whole
file will need to be regenerated (and tooling only supports only full
generation). It was created, and best suited, for environments that have a
central user & group database which then needs to be distributed to many
servers/services, and the data does not change very often (e.g. hourly).
file will need to be regenerated. Therefore, it was created, and best suited,
for environments that have a central user & group database which then needs to
be distributed to many servers/services, and the data does not change very
often.
To understand more about name service switch, start with
[`nsswitch.conf(5)`][nsswitch].
This is the fastest known NSS passwd/group implementation for *reads*. On my
2018-era laptop a corpus with 10k users, 10k groups and 500 average members per
group, `id` takes 17 seconds with the glibc default implementation, 10-17
milliseconds with a pre-cached `nscd`, ~8 milliseconds with uncached
`turbonss`.
Design & constraints
--------------------
Due to the nature of being built with Zig, this will work on glibc versions as
old as 2.16 (may work with even older ones, I did not test beyond that).
To be fast, the user/group database (later: DB) has to be small
([background][data-oriented-design]). It encodes user & group information in a
way that minimizes the DB size, and reduces jumping across the DB ("chasing
pointers and thrashing CPU cache").
Status (2023-08-22): I introduced breakages while converting to Zig v0.11. The
latest known working version is f723d48fe24f5d536dbc78fafa543d62a4063ae1. I am
working on fixing this.
To understand how this is done efficiently, let's analyze the
[`getpwnam_r(3)`][getpwnam_r] in high level. This API call accepts a username
and returns the following user information:
Project goals
-------------
```
struct passwd {
char *pw_name; /* username */
char *pw_passwd; /* user password */
uid_t pw_uid; /* user ID */
gid_t pw_gid; /* group ID */
char *pw_gecos; /* user information */
char *pw_dir; /* home directory */
char *pw_shell; /* shell program */
};
```
- Make it as fast as possible. Especially optimize for the `id` command.
- Small database size (helps making it fast).
- No runtime, no GC, as little as possible overhead.
- Easy to compile for ancient glibc versions (comes out of the box with Zig).
Turbonss, among others, implements this call, and takes the following steps to
resolve a username to a `struct passwd*`:
- Open the DB (using `mmap`) and interpret it's first 64 bytes as a `*struct
Header`. The header stores offsets to the sections of the file. This needs to
be done once, when the NSS library is loaded.
- Hash the username using a perfect hash function. Perfect hash function
returns a number `n ∈ [0,N-1]`, where N is the total number of users.
- Jump to the `n`'th location in the `idx_name2user` section, which contains
the index `i` to the user's information.
- Jump to the location `i` of section `Users`, which stores the full user
information.
- Decode the user information (which is all in a continuous memory block) and
return it to the caller.
In total, that's one hash for the username (~150ns), two pointer jumps within
the group file (to sections `idx_name2user` and `Users`), and, now that the
user record is found, `memcpy` for each field.
The turbonss DB file is be `mmap`-ed, making it simple to jump across the file
using pointer arithmetic. This also reduces memory usage, as the mmap'ed
regions are shared. Turbonss reads do not consume any heap space.
Tight packing places some constraints on the underlying data:
- Permitted length of username and groupname: 1-32 bytes.
- Permitted length of shell and home: 1-256 bytes.
- Permitted comment ("gecos") length: 0-255 bytes.
- User name, groupname, gecos and shell must be utf8-encoded.
- User and Groups sections are up to 2^35B (~34GB) large. Assuming an "average"
user record takes 50 bytes, this section would fit ~660M users. The
worst-case upper bound is left as an exercise to the reader.
Sorting is stable. In v0:
- Groups are sorted by gid, ascending.
- Users are sorted by their name, ascending by the unicode codepoints
(locale-independent).
Checking out and building
-------------------------
```
$ git clone --recursive https://git.sr.ht/~motiejus/turbonss
```
Alternatively, if you forgot `--recursive`:
```
$ git submodule update --init
```
And run tests:
```
$ zig build test
```
Test the so
-----------
Build:
zig build -Dtarget=x86_64-linux-gnu.2.31 -Dcpu=x86_64_v3 -Drelease-fast=true -Dstrip=true
Generate `db.turbo`:
zig-out/bin/turbonss-unix2db --passwd /etc/passwd --group /etc/group
zig-out/bin/turbonss-analyze db.turbo
<...>
Run a test container:
$ docker run -ti --rm --privileged -v `pwd`:/etc/turbonss -w /etc/turbonss debian:bullseye
# cp zig-out/lib/libnss_turbo.so.2 /lib/x86_64-linux-gnu
# sed -i 's/\(\(passwd\|group\).*files\)$/\1 turbo/' /etc/nsswitch.conf
And knock yourself out:
getent passwd
getent group
id root
This is probably not very interesting; you may want to take a larger corpus of
/etc/passwd and /etc/group for more interesting results.
Dependencies
------------
This project uses [git subtrac][git-subtrac] for managing dependencies. They
work just like regular submodules, except all the refs of the submodules are in
this repository. Repeat after me: all the submodules are in this repository.
So if you have a copy of this repo, dependencies will not disappear.
1. zig v0.11.
2. [cmph][cmph]: bundled with this repository.
remarks on `id(1)`
Building
--------
Clone, compile and run tests first:
$ git clone https://git.jakstys.lt/motiejus/turbonss
$ zig build test
$ zig build -Dtarget=x86_64-linux-gnu.2.16 -Doptimize=ReleaseSafe
One may choose different options, depending on requirements. Here are some
hints:
1. `-Dcpu=<...>` for the CPU [microarchitecture][mcpu].
2. `-Dstrip=true` to strip debug symbols.
For reference, size of the shared library and helper binaries when compiled
with `zig build -Dstrip=true -Doptimize=ReleaseSmall`:
$ ls -h1s zig-out/{bin/*,lib/libnss_turbo.so.2.0.0}
24K zig-out/bin/turbonss-analyze
20K zig-out/bin/turbonss-getent
24K zig-out/bin/turbonss-makecorpus
136K zig-out/bin/turbonss-unix2db
20K zig-out/lib/libnss_turbo.so.2.0.0
Many thanks to Ulrich Drepper for [teaching how to link it properly][dso].
Demo
----
turbonss is best tested, of course, with many users and groups. The guide below
will show how to synthesize 10k users, 10k groups with an avereage membership
of 1k users per group, and test ubernss with such corpus.
1. Synthesize some users and groups to `passwd` and `group` in the current directory:
```
$ zig-out/bin/turbonss-makecorpus
wrote users=10000 groups=10000 max-members=1000 to .
$ ls -1hs passwd group
48M group
668K passwd
```
2. Convert the generated `passwd` and `group` to the turbonss database. Note
the `db.turbo` database is more than 4 times smaller than the textual one:
```
$ zig-out/bin/turbonss-unix2db --group group --passwd passwd
total 10968064 bytes. groups=10000 users=10000
$ ls -1hs db.turbo
11M db.turbo
```
3. Optional: inspect the freshly created database:
```
$ zig-out/bin/turbonss-analyze db.turbo
File: db.turbo
Size: 10,968,064 bytes
Version: 0
Endian: little
Pointer size: 8 bytes
getgr buffer size: 18000
getpw buffer size: 57
Users: 10000
Groups: 10000
Shells: 4
Most memberships: u_1000000 (501)
Sections:
Name Begin End Size bytes
header 00000000 00000080 128
bdz_gid 00000080 00000e40 3,520
bdz_groupname 00000e40 00001c00 3,520
bdz_uid 00001c00 000029c0 3,520
bdz_username 000029c0 00003780 3,520
idx_gid2group 00003780 0000d3c0 40,000
idx_groupname2group 0000d3c0 00017000 40,000
idx_uid2user 00017000 00020c40 40,000
idx_name2user 00020c40 0002a880 40,000
shell_index 0002a880 0002a8c0 64
shell_blob 0002a8c0 0002a900 64
groups 0002a900 00065280 240,000
users 00065280 000da580 480,000
groupmembers 000da580 005a69c0 5,030,976
additional_gids 005a69c0 00a75c00 5,042,752
$ zig-out/bin/turbonss-getent --db db.turbo passwd u_1000000
u_1000000:x:1000000:1000000:User 1000000:/home/u_1000000:/bin/bash
$ zig-out/bin/turbonss-getent --db db.turbo group g_1000003
g_1000003:x:1000003:u_1000002,u_1000003,u_1000004
```
4. Now since we will be messing with the system, run all following commands in
a container:
```
$ docker run -ti --rm -v `pwd`:/etc/turbonss -w /etc/turbonss debian:bullseye
# cp zig-out/lib/libnss_turbo.so.2 /lib/x86_64-linux-gnu/
```
5. Instruct `nsswitch.conf` to use both turbonss and the standard resolver:
```
# sed -i '/passwd\|group/ s/files/turbo files/' /etc/nsswitch.conf
# time id u_1000000
<...>
real 0m0.006s
user 0m0.000s
sys 0m0.008s
```
The `id` call resolved `u_1000000` from `db.turbo`.
6. Compare the performance to plain `files` (that is, without turbonss):
```
# sed -i '/passwd\|group/ s/turbo files/files/' /etc/nsswitch.conf
# cat passwd >> /etc/passwd
# cat group >> /etc/group
# time id u_1000000
<...>
real 0m17.164s
user 0m13.288s
sys 0m3.876s
```
Over 2500x difference.
More Documentation
------------------
A known implementation runs id(1) at ~250 rps sequentially on ~20k users and
~10k groups. Our rps target is much higher.
- Architecture is detailed in `docs/architecture.md`
- Development notes are in `docs/development.md`
To better reason about the trade-offs, it is useful to understand how `id(1)`
is implemented, in rough terms:
- lookup user by name ([`getpwent_r(3)`][getpwent]).
- get all gids for the user ([`getgrouplist(3)`][getgrouplist]). Note: it is
actually using `initgroups_dyn`, accepts a uid, and is very poorly
documented.
- for each additional gid, get the `struct group*`
([`getgrgid_r(3)`][getgrgid_r]).
Project status and known deficiencies
-------------------------------------
Assuming a member is in ~100 groups on average, to reach 10k id/s translates to
1M group lookups per second. We need to convert gid to a group index, and group
index to a group gid/name quickly.
Turbonss works, but, to the author's knowledge, was not deployed to production.
If you want to use turbonss instead of a battle-tested, albeit slower nscd,
keep the following in mind:
- turbonss has not been fuzz-tested, so it will crash a program on invalid
database file. Please compile with `ReleaseSafe`. It is plenty fast with this
mode, but an invalid database will lead to defined behavior (i.e. crash with
a stack trace) instead of overwriting memory wherever.
- if the database file was replaced while the program has been running,
turbonss will not re-read the file (it holds to the previous file
descriptor).
Caveat: `struct group` contains an array of pointers to names of group members
(`char **gr_mem`). However, `id` does not use that information, resulting in
read amplification, sometimes by 10-100x. Therefore, if `argv[0] == "id"`, our
implementation of [`getgrid_r(3)`][getgrid] returns the `struct group*` without
the members. This speeds up `id` by about 10x on a known NSS implementation.
The license is permissive, so feel free to fork and implement the above (I
would appreciate if you told me, but surely you don't have to). I am also
available for [consulting][consulting] if that's your preference instead.
Relatedly, because [`getgrid_r(3)`][getgrid] does not need the group members,
the group members are stored in a different DB section, reducing the `Groups`
section and making more of it fit the CPU caches.
Turbonss header
---------------
The turbonss header looks like this:
```
OFFSET TYPE NAME DESCRIPTION
0 [4]u8 magic f0 9f a4 b7
4 u8 version 0
5 u8 endian 0 for little, 1 for big
6 u8 nblocks_shell_blob max value: 63
7 u8 num_shells max value: 63
8 u32 num_groups number of group entries
12 u32 num_users number of passwd entries
16 u32 nblocks_bdz_gid bdz_gid section block count
20 u32 nblocks_bdz_groupname
24 u32 nblocks_bdz_uid
28 u32 nblocks_bdz_username
32 u64 nblocks_groups
40 u64 nblocks_users
48 u64 nblocks_groupmembers
56 u64 nblocks_additional_gids
64 u64 getgr_bufsize
72 u64 getpw_bufsize
80 [48]u8 padding
```
`magic` is 0xf09fa4b7, and `version` must be `0`. All integers are
native-endian. `nblocks_*` is the count of blocks of a particular section; this
helps calculate the offsets to all sections.
Some numbers, like `nblocks_shell_blob`, `num_shells`, would fit to smaller
number of bytes. However, interpreting `[2]u6` with `xxd(1)` is harder than
interpreting `[2]u8`. Therefore we are using the space we have to make these
integers byte-wide.
`getgr_bufsize` and `getpw_bufsize` is a hint for the caller of `getgr*` and
`getpw*`-family calls. This is the recommended size of the buffer, so the
caller does not receive `ENOMEM`.
Primitive types
---------------
`User` and `Group` entries are sorted by the order they were received in the input
file. All entries are aligned to 8 bytes. All `User` and `Group` entries are
referred by their byte offset in the `Users` and `Groups` section relative to
the beginning of the section.
```
const PackedGroup = packed struct {
gid: u32,
padding: u3,
groupname_len: u5,
}
```
PackedGroup is followed by the group name (of length `groupname_len`), followed
by a varint-compressed offset to the groupmembers section, followed by 8b padding.
PackedUser is a bit more involved:
```
pub const PackedUser = packed struct {
uid: u32,
gid: u32,
shell_len_or_idx: u8,
shell_here: bool,
name_is_a_suffix: bool,
home_len: u6,
name_len: u5,
gecos_len: u11,
}
```
... followed by `userdata: []u8`:
- home.
- name (optional).
- gecos.
- shell (optional).
- `additional_gids_offset`: varint.
First byte of home is stored right after the `gecos_len` field, and its length
is `home_len`. The same logic applies to all the `stringdata` fields: there is
a way to calculate their relative position from the length of the fields before
them.
PackedUser employs two data-oriented compression techniques:
- shells are often shared across different users, see the "Shells" section.
- `name` is frequently a suffix of `home`. For example, `/home/vidmantas` and
`vidmantas`. In this case storing both name and home is wasteful. Therefore
name has two options:
1. `name_is_a_suffix=true`: name is a suffix of the home dir. Then `name`
starts at the `home_len - name_len`'th byte of `home`, and ends at the same
place as `home`.
2. `name_is_a_suffix=false`: name begins one byte after home, and it's length
is `name_len`.
The last field `additional_gids_offset: varint` points to the `additional_gids`
section for this user.
Shells
------
Normally there is a limited number of separate shells even in huge user
databases. A few examples: `/bin/bash`, `/usr/bin/nologin`, `/bin/zsh` among
others. Therefore, "shells" have an optimization: they can be pointed by in the
external list, or, if they are unique to the user, reside among the user's
data.
255 most popular shells (i.e. referred to by at least two User entries) are
stored externally in "Shells" area. The less popular ones are stored with
userdata.
Shells section consists of two sub-sections: the index and the blob. The index
is an array of offsets: the i'th shell starts at `offsets[i]` byte, and ends at
`offsets[i+1]` byte. If there is at least one shell in the shell section, the
index contains a sentinel index as the last element, which signifies the position
of the last byte of the shell blob.
`shell_here=true` in the User struct means the shell is stored with userdata,
and it's length is `shell_len_or_idx`. `shell_here=false` means it is stored in
the `Shells` section, and it's index is `shell_len_or_idx` (and the actual
string start and end offsets are resolved as described in the paragraph above).
Variable-length integers (varints)
----------------------------------
Varint is an efficiently encoded integer (packed for small values). Same as
[protocol buffer varints][varint], except the largest possible value is `u64`.
They compress integers well. Varints are stored for group memberships.
Group memberships
-----------------
There are two group memberships at play:
1. Given a group (gid/name), resolve the members' names (e.g. `getgrgid`).
2. Given a username, resolve user's group gids (for `initgroups(3)`).
When group's memberships are resolved in (1), the same call also requires other
group information: gid and group name. Therefore it makes sense to store a
pointer to the group members in the group information itself. However, the
memberships are not *always* necessary (see remarks about `id(1)`), therefore
the memberships will be stored separately, outside of the groups section.
Similarly, when user's groups are resolved in (2), they are not always necessary
(i.e. not part of `struct user*`), therefore the memberships themselves are
stored out of bound.
`groupmembers` and `additional_gids` store group and user memberships
respectively. Membership IDs are packed — not necessitating random access, thus
suitable for compression.
- `groupmembers` consists of a number X followed by a list of offsets to User
records, because `getgr*` returns pointers to membernames, thus a name has to
be immediately resolvable.
- `additional_gids` is a list of gids, because `initgroups_dyn` (and friends)
returns an array of gids.
Each entry of `groupmembers` and `additional_gids` starts with a varint N,
which is the number of upcoming elements. Then N delta-compressed varints,
which are:
- **additional_gids** a list of gids.
- **groupmembers** byte-offsets to the User records in the `users` section.
Indices
-------
Now that we've sketched the implementation of `id(3)`, it's clearer to
understand which operations need to be fast; in order of importance:
1. lookup gid -> group info (this is on hot path in id) without members.
2. lookup username -> user's groups.
3. lookup uid -> user.
4. lookup groupname -> group.
5. lookup username -> user.
These indices can use perfect hashing like [bdz from cmph][cmph]: a perfect
hash hashes a list of bytes to a sequential list of integers. Perfect hashing
algorithms require some space, and take some time to calculate ("hashing
duration"). I've tested BDZ, which hashes `[][]u8` to a sequential list of
integers (not preserving order) and CHM, preserves order. BDZ accepts an
optional argument `3 <= b <= 10`.
* BDZ algorithm requires (b=3, 900KB, b=7, 338KB, b=10, 306KB) for 1M values.
* Latency to resolve 1M keys: (170ms, 180ms, 230ms, respectively).
* Packed vs non-packed latency differences are not meaningful.
CHM retains order, however, 1M keys weigh 8MB. 10k keys are ~20x larger with
CHM than with BDZ, eliminating the benefit of preserved ordering: we can just
have a separate index.
None of the tested perfect hashing algorithms makes the distinction between
existing (in the initial dictionary) and new keys. In other words, HASH(value)
will be pointing to a number `n ∈ [0,N-1]`, regardless whether the value was in
the initial dictionary. Therefore one must always confirm, after calculating
the hash, that the key matches what's been hashed.
`idx_*` sections are of type `[]u32` and are pointing from `hash(key)` to the
respective `Groups` and `Users` entries (from the beginning of the respective
section). Since User and Group records are 8-byte aligned, the actual offset to
the record is acquired by right-shifting this value by 3 bits.
Database file structure
-----------------------
Each section is padded to 64 bytes.
```
SECTION SIZE DESCRIPTION
header 128 see "Turbonss header" section
bdz_gid ? bdz(gid)
bdz_groupname ? bdz(groupname)
bdz_uid ? bdz(uid)
bdz_username ? bdz(username)
idx_gid2group len(group)*4 bdz->offset Groups
idx_groupname2group len(group)*4 bdz->offset Groups
idx_uid2user len(user)*4 bdz->offset Users
idx_name2user len(user)*4 bdz->offset Users
shell_index len(shells)*2 shell index array
shell_blob <= 65280 shell data blob (max 255*256 bytes)
groups ? packed Group entries (8b padding)
users ? packed User entries (8b padding)
groupmembers ? per-group delta varint memberlist (no padding)
additional_gids ? per-user delta varint gidlist (no padding)
```
Section creation order:
1. ✅ `bdz_*`.
1. ✅ `shell_index`, `shell_blob`.
1. ✅ `additional_gids`.
1. ✅ `users` requires `additional_gids` and shell.
1. ✅ `groupmembers` requires `users`.
1. ✅ `groups` requires `groupmembers`.
1. ✅ `idx_*`. requires offsets to `groups` and `users`.
1. ✅ Header.
For v2
------
These are desired for the next DB format:
- Compress strings with fsst.
- Trim first 4 bytes from the cmph headers.
Profiling
---------
Prepare `profile.data`:
```
zig build -Drelease-small=true && \
perf record --call-graph=dwarf \
zig-out/bin/turbonss-unix2db --passwd passwd2 --group group2
```
Perf interactive:
```
perf report -i perf.data
```
Flame graph:
```
perf script | inferno-collapse-perf | inferno-flamegraph > profile.svg
```
[git-subtrac]: https://apenwarr.ca/log/20191109
[cmph]: http://cmph.sourceforge.net/
[id]: https://linux.die.net/man/1/id
[nsswitch]: https://linux.die.net/man/5/nsswitch.conf
[data-oriented-design]: https://media.handmade-seattle.com/practical-data-oriented-design/
[getpwnam_r]: https://linux.die.net/man/3/getpwnam_r
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
[getpwent]: https://www.man7.org/linux/man-pages/man3/getpwent_r.3.html
[getgrouplist]: https://www.man7.org/linux/man-pages/man3/getgrouplist.3.html
[getgrid]: https://www.man7.org/linux/man-pages/man3/getgrid_r.3.html
[id]: https://linux.die.net/man/1/id
[cmph]: http://cmph.sourceforge.net/
[dso]: https://akkadia.org/drepper/dsohowto.pdf
[mcpu]: https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels
[consulting]: https://jakstys.lt/contact

135
build.zig
View File

@@ -4,13 +4,15 @@ const zbs = std.build;
pub fn build(b: *zbs.Builder) void {
const target = b.standardTargetOptions(.{});
const mode = b.standardReleaseOptions();
const optimize = b.standardOptimizeOption(.{});
const strip = b.option(bool, "strip", "Omit debug information") orelse false;
const cmph = b.addStaticLibrary("cmph", null);
cmph.setTarget(target);
cmph.setBuildMode(mode);
const cmph = b.addStaticLibrary(.{
.name = "cmph",
.target = target,
.optimize = optimize,
});
cmph.linkLibC();
cmph.addCSourceFiles(&.{
"deps/cmph/src/bdz.c",
@@ -42,13 +44,40 @@ pub fn build(b: *zbs.Builder) void {
//"-DDEBUG",
});
cmph.strip = strip;
cmph.want_lto = true;
cmph.compress_debug_sections = .zlib;
cmph.omit_frame_pointer = true;
cmph.addIncludeDir("deps/cmph/src");
cmph.addIncludeDir("include/deps/cmph");
cmph.addIncludePath(.{ .path = "deps/cmph/src" });
cmph.addConfigHeader(b.addConfigHeader(.{}, .{
.HAVE_DLFCN_H = true,
.HAVE_GETOPT_H = true,
.HAVE_INTTYPES_H = true,
.HAVE_MATH_H = true,
.HAVE_MEMORY_H = true,
.HAVE_STDINT_H = true,
.HAVE_STDLIB_H = true,
.HAVE_STRINGS_H = true,
.HAVE_STRING_H = true,
.HAVE_SYS_STAT_H = true,
.HAVE_SYS_TYPES_H = true,
.HAVE_UNISTD_H = true,
.LT_OBJDIR = ".libs/",
.PACKAGE = "cmph",
.PACKAGE_BUGREPORT = "",
.PACKAGE_NAME = "cmph",
.PACKAGE_STRING = "cmph 2.0.2",
.PACKAGE_TARNAME = "cmph",
.PACKAGE_URL = "",
.PACKAGE_VERSION = "2.0.2",
.STDC_HEADERS = 1,
.VERSION = "2.0.2",
}));
const bdz = b.addStaticLibrary("bdz", null);
bdz.setTarget(target);
bdz.setBuildMode(mode);
const bdz = b.addStaticLibrary(.{
.name = "bdz",
.target = target,
.optimize = optimize,
});
bdz.linkLibC();
bdz.addCSourceFiles(&.{
"deps/bdz_read.c",
@@ -57,75 +86,107 @@ pub fn build(b: *zbs.Builder) void {
}, &.{
"-W",
"-Wno-unused-function",
"-fvisibility=hidden",
"-fpic",
//"-DDEBUG",
});
bdz.omit_frame_pointer = true;
bdz.addIncludeDir("deps/cmph/src");
bdz.addIncludeDir("include/deps/cmph");
bdz.addIncludePath(.{ .path = "deps/cmph/src" });
bdz.addIncludePath(.{ .path = "include/deps/cmph" });
bdz.want_lto = true;
{
const exe = b.addExecutable("turbonss-unix2db", "src/turbonss-unix2db.zig");
const exe = b.addExecutable(.{
.name = "turbonss-unix2db",
.root_source_file = .{ .path = "src/turbonss-unix2db.zig" },
.target = target,
.optimize = optimize,
});
exe.compress_debug_sections = .zlib;
exe.strip = strip;
exe.setTarget(target);
exe.setBuildMode(mode);
exe.want_lto = true;
addCmphDeps(exe, cmph);
exe.install();
b.installArtifact(exe);
}
{
const exe = b.addExecutable("turbonss-analyze", "src/turbonss-analyze.zig");
const exe = b.addExecutable(.{
.name = "turbonss-analyze",
.root_source_file = .{ .path = "src/turbonss-analyze.zig" },
.target = target,
.optimize = optimize,
});
exe.compress_debug_sections = .zlib;
exe.strip = strip;
exe.setTarget(target);
exe.setBuildMode(mode);
exe.install();
exe.want_lto = true;
b.installArtifact(exe);
}
{
const exe = b.addExecutable("turbonss-makecorpus", "src/turbonss-makecorpus.zig");
const exe = b.addExecutable(.{
.name = "turbonss-makecorpus",
.root_source_file = .{ .path = "src/turbonss-makecorpus.zig" },
.target = target,
.optimize = optimize,
});
exe.compress_debug_sections = .zlib;
exe.strip = strip;
exe.setTarget(target);
exe.setBuildMode(mode);
exe.install();
exe.want_lto = true;
b.installArtifact(exe);
}
{
const exe = b.addExecutable("turbonss-getent", "src/turbonss-getent.zig");
const exe = b.addExecutable(.{
.name = "turbonss-getent",
.root_source_file = .{ .path = "src/turbonss-getent.zig" },
.target = target,
.optimize = optimize,
});
exe.compress_debug_sections = .zlib;
exe.strip = strip;
exe.want_lto = true;
exe.linkLibC();
exe.linkLibrary(bdz);
exe.addIncludeDir("deps/cmph/src");
exe.setTarget(target);
exe.setBuildMode(mode);
exe.install();
exe.addIncludePath(.{ .path = "deps/cmph/src" });
b.installArtifact(exe);
}
{
const so = b.addSharedLibrary("nss_turbo", "src/libnss.zig", .{
.versioned = builtin.Version{
const so = b.addSharedLibrary(.{
.name = "nss_turbo",
.root_source_file = .{ .path = "src/libnss.zig" },
.version = std.SemanticVersion{
.major = 2,
.minor = 0,
.patch = 0,
},
.target = target,
.optimize = optimize,
});
so.compress_debug_sections = .zlib;
so.strip = strip;
so.want_lto = true;
so.linkLibC();
so.linkLibrary(bdz);
so.addIncludeDir("deps/cmph/src");
so.setTarget(target);
so.setBuildMode(mode);
so.install();
so.addIncludePath(.{ .path = "deps/cmph/src" });
b.installArtifact(so);
}
{
const src_test = b.addTest("src/test_all.zig");
const src_test = b.addTest(.{
.root_source_file = .{ .path = "src/test_all.zig" },
.target = target,
.optimize = optimize,
});
addCmphDeps(src_test, cmph);
const run = b.addRunArtifact(src_test);
const test_step = b.step("test", "Run the tests");
test_step.dependOn(&src_test.step);
test_step.dependOn(&run.step);
}
}
fn addCmphDeps(exe: *zbs.LibExeObjStep, cmph: *zbs.LibExeObjStep) void {
exe.linkLibC();
exe.linkLibrary(cmph);
exe.addIncludeDir("deps/cmph/src");
exe.addIncludePath(.{ .path = "deps/cmph/src" });
}

1
deps/cmph vendored

Submodule deps/cmph deleted from a250982ade

6
deps/cmph/ALGORITHMS.t2t vendored Normal file
View File

@@ -0,0 +1,6 @@
----------------------------------------
| [Home index.html] | [CHD chd.html] | [BDZ bdz.html] | [BMZ bmz.html] | [CHM chm.html] | [BRZ brz.html] | [FCH fch.html]
----------------------------------------

4
deps/cmph/AUTHORS vendored Normal file
View File

@@ -0,0 +1,4 @@
Davi de Castro Reis davi@users.sourceforge.net
Djamel Belazzougui db8192@users.sourceforge.net
Fabiano Cupertino Botelho fc_botelho@users.sourceforge.net
Nivio Ziviani nivio@dcc.ufmg.br

174
deps/cmph/BDZ.t2t vendored Executable file
View File

@@ -0,0 +1,174 @@
BDZ Algorithm
%!includeconf: CONFIG.t2t
----------------------------------------
==Introduction==
The BDZ algorithm was designed by Fabiano C. Botelho, Djamal Belazzougui, Rasmus Pagh and Nivio Ziviani. It is a simple, efficient, near-optimal space and practical algorithm to generate a family [figs/bdz/img8.png] of PHFs and MPHFs. It is also referred to as BPZ algorithm because the work presented by Botelho, Pagh and Ziviani in [[2 #papers]]. In the Botelho's PhD. dissertation [[1 #papers]] it is also referred to as RAM algorithm because it is more suitable for key sets that can be handled in internal memory.
The BDZ algorithm uses //r//-uniform random hypergraphs given by function values of //r// uniform random hash functions on the input key set //S// for generating PHFs and MPHFs that require //O(n)// bits to be stored. A hypergraph is the generalization of a standard undirected graph where each edge connects [figs/bdz/img12.png] vertices. This idea is not new, see e.g. [[8 #papers]], but we have proceeded differently to achieve a space usage of //O(n)// bits rather than //O(n log n)// bits. Evaluation time for all schemes considered is constant. For //r=3// we obtain a space usage of approximately //2.6n// bits for an MPHF. More compact, and even simpler, representations can be achieved for larger //m//. For example, for //m=1.23n// we can get a space usage of //1.95n// bits.
Our best MPHF space upper bound is within a factor of //2// from the information theoretical lower bound of approximately //1.44// bits. We have shown that the BDZ algorithm is far more practical than previous methods with proven space complexity, both because of its simplicity, and because the constant factor of the space complexity is more than //6// times lower than its closest competitor, for plausible problem sizes. We verify the practicality experimentally, using slightly more space than in the mentioned theoretical bounds.
----------------------------------------
==The Algorithm==
The BDZ algorithm is a three-step algorithm that generates PHFs and MPHFs based on random //r//-partite hypergraphs. This is an approach that provides a much tighter analysis and is much more simple than the one presented in [[3 #papers]], where it was implicit how to construct similar PHFs.The fastest and most compact functions are generated when //r=3//. In this case a PHF can be stored in approximately //1.95// bits per key and an MPHF in approximately //2.62// bits per key.
Figure 1 gives an overview of the algorithm for //r=3//, taking as input a key set [figs/bdz/img22.png] containing three English words, i.e., //S={who,band,the}//. The edge-oriented data structure proposed in [[4 #papers]] is used to represent hypergraphs, where each edge is explicitly represented as an array of //r// vertices and, for each vertex //v//, there is a list of edges that are incident on //v//.
| [figs/bdz/img50.png]
| **Figure 1:** (a) The mapping step generates a random acyclic //3//-partite hypergraph
| with //m=6// vertices and //n=3// edges and a list [figs/bdz/img4.png] of edges obtained when we test
| whether the hypergraph is acyclic. (b) The assigning step builds an array //g// that
| maps values from //[0,5]// to //[0,3]// to uniquely assign an edge to a vertex. (c) The ranking
| step builds the data structure used to compute function //rank// in //O(1)// time.
The //Mapping Step// in Figure 1(a) carries out two important tasks:
+ It assumes that it is possible to find three uniform hash functions //h,,0,,//, //h,,1,,// and //h,,2,,//, with ranges //{0,1}//, //{2,3}// and //{4,5}//, respectively. These functions build an one-to-one mapping of the key set //S// to the edge set //E// of a random acyclic //3//-partite hypergraph //G=(V,E)//, where //|V|=m=6// and //|E|=n=3//. In [[1,2 #papers]] it is shown that it is possible to obtain such a hypergraph with probability tending to //1// as //n// tends to infinity whenever //m=cn// and //c > 1.22//. The value of that minimizes the hypergraph size (and thereby the amount of bits to represent the resulting functions) is in the range //(1.22,1.23)//. To illustrate the mapping, key "who" is mapped to edge //{h,,0,,("who"), h,,1,,("who"), h,,2,,("who")} = {1,3,5}//, key "band" is mapped to edge //{h,,0,,("band"), h,,1,,("band"), h,,2,,("band")} = {1,2,4}//, and key "the" is mapped to edge //{h,,0,,("the"), h,,1,,("the"), h,,2,,("the")} = {0,2,5}//.
+ It tests whether the resulting random //3//-partite hypergraph contains cycles by iteratively deleting edges connecting vertices of degree 1. The deleted edges are stored in the order of deletion in a list [figs/bdz/img4.png] to be used in the assigning step. The first deleted edge in Figure 1(a) was //{1,2,4}//, the second one was //{1,3,5}// and the third one was //{0,2,5}//. If it ends with an empty graph, then the test succeeds, otherwise it fails.
We now show how to use the Jenkins hash functions [[7 #papers]] to implement the three hash functions //h,,i,,//, which map values from //S// to //V,,i,,//, where [figs/bdz/img52.png]. These functions are used to build a random //3//-partite hypergraph, where [figs/bdz/img53.png] and [figs/bdz/img54.png]. Let [figs/bdz/img55.png] be a Jenkins hash function for [figs/bdz/img56.png], where
//w=32 or 64// for 32-bit and 64-bit architectures, respectively.
Let //H'// be an array of 3 //w//-bit values. The Jenkins hash function
allow us to compute in parallel the three entries in //H'//
and thereby the three hash functions //h,,i,,//, as follows:
| //H' = h'(x)//
| //h,,0,,(x) = H'[0] mod// [figs/bdz/img136.png]
| //h,,1,,(x) = H'[1] mod// [figs/bdz/img136.png] //+// [figs/bdz/img136.png]
| //h,,2,,(x) = H'[2] mod// [figs/bdz/img136.png] //+ 2//[figs/bdz/img136.png]
The //Assigning Step// in Figure 1(b) outputs a PHF that maps the key set //S// into the range //[0,m-1]// and is represented by an array //g// storing values from the range //[0,3]//. The array //g// allows to select one out of the //3// vertices of a given edge, which is associated with a key //k//. A vertex for a key //k// is given by either //h,,0,,(k)//, //h,,1,,(k)// or //h,,2,,(k)//. The function //h,,i,,(k)// to be used for //k// is chosen by calculating //i = (g[h,,0,,(k)] + g[h,,1,,(k)] + g[h,,2,,(k)]) mod 3//. For instance, the values 1 and 4 represent the keys "who" and "band" because //i = (g[1] + g[3] + g[5]) mod 3 = 0// and //h,,0,,("who") = 1//, and //i = (g[1] + g[2] + g[4]) mod 3 = 2// and //h,,2,,("band") = 4//, respectively. The assigning step firstly initializes //g[i]=3// to mark every vertex as unassigned and //Visited[i]= false//, [figs/bdz/img88.png]. Let //Visited// be a boolean vector of size //m// to indicate whether a vertex has been visited. Then, for each edge [figs/bdz/img90.png] from tail to head, it looks for the first vertex //u// belonging //e// not yet visited. This is a sufficient condition for success [[1,2,8 #papers]]. Let //j// be the index of //u// in //e// for //j// in the range //[0,2]//. Then, it assigns [figs/bdz/img95.png]. Whenever it passes through a vertex //u// from //e//, if //u// has not yet been visited, it sets //Visited[u] = true//.
If we stop the BDZ algorithm in the assigning step we obtain a PHF with range //[0,m-1]//. The PHF has the following form: //phf(x) = h,,i(x),,(x)//, where key //x// is in //S// and //i(x) = (g[h,,0,,(x)] + g[h,,1,,(x)] + g[h,,2,,(x)]) mod 3//. In this case we do not need information for ranking and can set //g[i] = 0// whenever //g[i]// is equal to //3//, where //i// is in the range //[0,m-1]//. Therefore, the range of the values stored in //g// is narrowed from //[0,3]// to //[0,2]//. By using arithmetic coding as block of values (see [[1,2 #papers]] for details), or any compression technique that allows to perform random access in constant time to an array of compressed values [[5,6,12 #papers]], we can store the resulting PHFs in //mlog 3 = cnlog 3// bits, where //c > 1.22//. For //c = 1.23//, the space requirement is //1.95n// bits.
The //Ranking Step// in Figure 1 (c) outputs a data structure that permits to narrow the range of a PHF generated in the assigning step from //[0,m-1]// to //[0,n-1]// and thereby an MPHF is produced. The data structure allows to compute in constant time a function //rank// from //[0,m-1]// to //[0,n-1]// that counts the number of assigned positions before a given position //v// in //g//. For instance, //rank(4) = 2// because the positions //0// and //1// are assigned since //g[0]// and //g[1]// are not equal to //3//.
For the implementation of the ranking step we have borrowed a simple and efficient implementation from [[10 #papers]]. It requires [figs/bdz/img111.png] additional bits of space, where [figs/bdz/img112.png], and is obtained by storing explicitly the //rank// of every //k//th index in a rankTable, where [figs/bdz/img114.png]. The larger is //k// the more compact is the resulting MPHF. Therefore, the users can tradeoff space for evaluation time by setting //k// appropriately in the implementation. We only allow values for //k// that are power of two (i.e., //k=2^^b,,k,,^^// for some constant //b,,k,,// in order to replace the expensive division and modulo operations by bit-shift and bitwise "and" operations, respectively. We have used //k=256// in the experiments for generating more succinct MPHFs. We remark that it is still possible to obtain a more compact data structure by using the results presented in [[9,11 #papers]], but at the cost of a much more complex implementation.
We need to use an additional lookup table //T,,r,,// to guarantee the constant evaluation time of //rank(u)//. Let us illustrate how //rank(u)// is computed using both the rankTable and the lookup table //T,,r,,//. We first look up the rank of the largest precomputed index //v// lower than or equal to //u// in the rankTable, and use //T,,r,,// to count the number of assigned vertices from position //v// to //u-1//. The lookup table //T_r// allows us to count in constant time the number of assigned vertices in [figs/bdz/img122.png] bits, where [figs/bdz/img112.png]. Thus the actual evaluation time is [figs/bdz/img123.png]. For simplicity and without loss of generality we let [figs/bdz/img124.png] be a multiple of the number of bits [figs/bdz/img125.png] used to encode each entry of //g//. As the values in //g// come from the range //[0,3]//,
then [figs/bdz/img126.png] bits and we have tried [figs/bdz/img124.png] equal to //8// and //16//. We would expect that [figs/bdz/img124.png] equal to 16 should provide a faster evaluation time because we would need to carry out fewer lookups in //T,,r,,//. But, for both values the lookup table //T,,r,,// fits entirely in the CPU cache and we did not realize any significant difference in the evaluation times. Therefore we settle for the value //8//. We remark that each value of //r// requires a different lookup table //T,,r,, that can be generated a priori.
The resulting MPHFs have the following form: //h(x) = rank(phf(x))//. Then, we cannot get rid of the raking information by replacing the values 3 by 0 in the entries of //g//. In this case each entry in the array //g// is encoded with //2// bits and we need [figs/bdz/img133.png] additional bits to compute function //rank// in constant time. Then, the total space to store the resulting functions is [figs/bdz/img134.png] bits. By using //c = 1.23// and [figs/bdz/img135.png] we have obtained MPHFs that require approximately //2.62// bits per key to be stored.
----------------------------------------
==Memory Consumption==
Now we detail the memory consumption to generate and to store minimal perfect hash functions
using the BDZ algorithm. The structures responsible for memory consumption are in the
following:
- 3-graph:
+ **first**: is a vector that stores //cn// integer numbers, each one representing
the first edge (index in the vector edges) in the list of
incident edges of each vertex. The integer numbers are 4 bytes long. Therefore,
the vector first is stored in //4cn// bytes.
+ **edges**: is a vector to represent the edges of the graph. As each edge
is compounded by three vertices, each entry stores three integer numbers
of 4 bytes that represent the vertices. As there are //n// edges, the
vector edges is stored in //12n// bytes.
+ **next**: given a vertex [figs/img139.png], we can discover the edges that
contain [figs/img139.png] following its list of incident edges,
which starts on first[[figs/img139.png]] and the next
edges are given by next[...first[[figs/img139.png]]...]. Therefore, the vectors first and next represent
the linked lists of edges of each vertex. As there are three vertices for each edge,
when an edge is iserted in the 3-graph, it must be inserted in the three linked lists
of the vertices in its composition. Therefore, there are //3n// entries of integer
numbers in the vector next, so it is stored in //4*3n = 12n// bytes.
+ **Vertices degree (vert_degree vector)**: is a vector of //cn// bytes
that represents the degree of each vertex. We can use just one byte for each
vertex because the 3-graph is sparse, once it has more vertices than edges.
Therefore, the vertices degree is represented in //cn// bytes.
- Acyclicity test:
+ **List of deleted edges obtained when we test whether the 3-graph is a forest (queue vector)**:
is a vector of //n// integer numbers containing indexes of vector edges. Therefore, it
requires //4n// bytes in internal memory.
+ **Marked edges in the acyclicity test (marked_edges vector)**:
is a bit vector of //n// bits to indicate the edges that have already been deleted during
the acyclicity test. Therefore, it requires //n/8// bytes in internal memory.
- MPHF description
+ **function //g//**: is represented by a vector of //2cn// bits. Therefore, it is
stored in //0.25cn// bytes
+ **ranktable**: is a lookup table used to store some precomputed ranking information.
It has //(cn)/(2^b)// entries of 4-byte integer numbers. Therefore it is stored in
//(4cn)/(2^b)// bytes. The larger is b, the more compact is the resulting MPHFs and
the slower are the functions. So b imposes a trade-of between space and time.
+ **Total**: 0.25cn + (4cn)/(2^b) bytes
Thus, the total memory consumption of BDZ algorithm for generating a minimal
perfect hash function (MPHF) is: //(28.125 + 5c)n + 0.25cn + (4cn)/(2^b) + O(1)// bytes.
As the value of constant //c// may be larger than or equal to 1.23 we have:
|| //c// | //b// | Memory consumption to generate a MPHF (in bytes) |
| 1.23 | //7// | //34.62n + O(1)// |
| 1.23 | //8// | //34.60n + O(1)// |
| **Table 1:** Memory consumption to generate a MPHF using the BDZ algorithm.
Now we present the memory consumption to store the resulting function.
So we have:
|| //c// | //b// | Memory consumption to store a MPHF (in bits) |
| 1.23 | //7// | //2.77n + O(1)// |
| 1.23 | //8// | //2.61n + O(1)// |
| **Table 2:** Memory consumption to store a MPHF generated by the BDZ algorithm.
----------------------------------------
==Experimental Results==
Experimental results to compare the BDZ algorithm with the other ones in the CMPH
library are presented in Botelho, Pagh and Ziviani [[1,2 #papers]].
----------------------------------------
==Papers==[papers]
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho]. [Near-Optimal Space Perfect Hashing Algorithms papers/thesis.pdf]. //PhD. Thesis//, //Department of Computer Science//, //Federal University of Minas Gerais//, September 2008. Supervised by [N. Ziviani http://www.dcc.ufmg.br/~nivio].
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], [R. Pagh http://www.itu.dk/~pagh/], [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [Simple and space-efficient minimal perfect hash functions papers/wads07.pdf]. //In Proceedings of the 10th International Workshop on Algorithms and Data Structures (WADs'07),// Springer-Verlag Lecture Notes in Computer Science, vol. 4619, Halifax, Canada, August 2007, 139-150.
+ B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. The bloomier filter: An efficient data structure for static support lookup tables. //In Proceedings of the 15th annual ACM-SIAM symposium on Discrete algorithms (SODA'04)//, pages 3039, Philadelphia, PA, USA, 2004. Society for Industrial and Applied Mathematics.
+ J. Ebert. A versatile data structure for edges oriented graph algorithms. //Communication of The ACM//, (30):513519, 1987.
+ K. Fredriksson and F. Nikitin. Simple compression code supporting random access and fast string matching. //In Proceedings of the 6th International Workshop on Efficient and Experimental Algorithms (WEA07)//, pages 203216, 2007.
+ R. Gonzalez and G. Navarro. Statistical encoding of succinct data structures. //In Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM06)//, pages 294305, 2006.
+ B. Jenkins. Algorithm alley: Hash functions. //Dr. Dobb's Journal of Software Tools//, 22(9), september 1997. Extended version available at [http://burtleburtle.net/bob/hash/doobs.html http://burtleburtle.net/bob/hash/doobs.html].
+ B.S. Majewski, N.C. Wormald, G. Havas, and Z.J. Czech. A family of perfect hashing methods. //The Computer Journal//, 39(6):547554, 1996.
+ D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. //In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX07)//, 2007.
+ [R. Pagh http://www.itu.dk/~pagh/]. Low redundancy in static dictionaries with constant query time. //SIAM Journal on Computing//, 31(2):353363, 2001.
+ R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. //In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms (SODA02)//, pages 233242, Philadelphia PA, USA, 2002. Society for Industrial and Applied Mathematics.
+ K. Sadakane and R. Grossi. Squeezing succinct data structures into entropy bounds. //In Proceedings of the 17th annual ACM-SIAM symposium on Discrete algorithms (SODA06)//, pages 12301239, 2006.
%!include: ALGORITHMS.t2t
%!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

405
deps/cmph/BMZ.t2t vendored Normal file
View File

@@ -0,0 +1,405 @@
BMZ Algorithm
%!includeconf: CONFIG.t2t
----------------------------------------
==History==
At the end of 2003, professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] was
finishing the second edition of his [book http://www.dcc.ufmg.br/algoritmos/].
During the [book http://www.dcc.ufmg.br/algoritmos/] writing,
professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] studied the problem of generating
[minimal perfect hash functions concepts.html]
(if you are not familiarized with this problem, see [[1 #papers]][[2 #papers]]).
Professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] coded a modified version of
the [CHM algorithm chm.html], which was proposed by
Czech, Havas and Majewski, and put it in his [book http://www.dcc.ufmg.br/algoritmos/].
The [CHM algorithm chm.html] is based on acyclic random graphs to generate
[order preserving minimal perfect hash functions concepts.html] in linear time.
Professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio]
argued himself, why must the random graph
be acyclic? In the modified version availalbe in his [book http://www.dcc.ufmg.br/algoritmos/] he got rid of this restriction.
The modification presented a problem, it was impossible to generate minimal perfect hash functions
for sets with more than 1000 keys.
At the same time, [Fabiano C. Botelho http://www.dcc.ufmg.br/~fbotelho],
a master degree student at [Departament of Computer Science http://www.dcc.ufmg.br] in
[Federal University of Minas Gerais http://www.ufmg.br],
started to be advised by [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] who presented the problem
to [Fabiano http://www.dcc.ufmg.br/~fbotelho].
During the master, [Fabiano http://www.dcc.ufmg.br/~fbotelho] and
[Nivio Ziviani http://www.dcc.ufmg.br/~nivio] faced lots of problems.
In april of 2004, [Fabiano http://www.dcc.ufmg.br/~fbotelho] was talking with a
friend of him (David Menoti) about the problems
and many ideas appeared.
The ideas were implemented and a very fast algorithm to generate
minimal perfect hash functions had been designed.
We refer the algorithm to as **BMZ**, because it was conceived by Fabiano C. **B**otelho,
David **M**enoti and Nivio **Z**iviani. The algorithm is described in [[1 #papers]].
To analyse BMZ algorithm we needed some results from the random graph theory, so
we invited professor [Yoshiharu Kohayakawa http://www.ime.usp.br/~yoshi] to help us.
The final description and analysis of BMZ algorithm is presented in [[2 #papers]].
----------------------------------------
==The Algorithm==
The BMZ algorithm shares several features with the [CHM algorithm chm.html].
In particular, BMZ algorithm is also
based on the generation of random graphs [figs/img27.png], where [figs/img28.png] is in
one-to-one correspondence with the key set [figs/img20.png] for which we wish to
generate a [minimal perfect hash function concepts.html].
The two main differences between BMZ algorithm and CHM algorithm
are as follows: (//i//) BMZ algorithm generates random
graphs [figs/img27.png] with [figs/img29.png] and [figs/img30.png], where [figs/img31.png],
and hence [figs/img32.png] necessarily contains cycles,
while CHM algorithm generates //acyclic// random
graphs [figs/img27.png] with [figs/img29.png] and [figs/img30.png],
with a greater number of vertices: [figs/img33.png];
(//ii//) CHM algorithm generates [order preserving minimal perfect hash functions concepts.html]
while BMZ algorithm does not preserve order. Thus, BMZ algorithm improves
the space requirement at the expense of generating functions that are not
order preserving.
Suppose [figs/img14.png] is a universe of //keys//.
Let [figs/img17.png] be a set of [figs/img8.png] keys from [figs/img14.png].
Let us show how the BMZ algorithm constructs a minimal perfect hash function [figs/img7.png].
We make use of two auxiliary random functions [figs/img41.png] and [figs/img55.png],
where [figs/img56.png] for some suitably chosen integer [figs/img57.png],
where [figs/img58.png].We build a random graph [figs/img59.png] on [figs/img60.png],
whose edge set is [figs/img61.png]. There is an edge in [figs/img32.png] for each
key in the set of keys [figs/img20.png].
In what follows, we shall be interested in the //2-core// of
the random graph [figs/img32.png], that is, the maximal subgraph
of [figs/img32.png] with minimal degree at
least 2 (see [[2 #papers]] for details).
Because of its importance in our context, we call the 2-core the
//critical// subgraph of [figs/img32.png] and denote it by [figs/img63.png].
The vertices and edges in [figs/img63.png] are said to be //critical//.
We let [figs/img64.png] and [figs/img65.png].
Moreover, we let [figs/img66.png] be the set of //non-critical//
vertices in [figs/img32.png].
We also let [figs/img67.png] be the set of all critical
vertices that have at least one non-critical vertex as a neighbour.
Let [figs/img68.png] be the set of //non-critical// edges in [figs/img32.png].
Finally, we let [figs/img69.png] be the //non-critical// subgraph
of [figs/img32.png].
The non-critical subgraph [figs/img70.png] corresponds to the //acyclic part//
of [figs/img32.png].
We have [figs/img71.png].
We then construct a suitable labelling [figs/img72.png] of the vertices
of [figs/img32.png]: we choose [figs/img73.png] for each [figs/img74.png] in such
a way that [figs/img75.png] ([figs/img18.png]) is a
minimal perfect hash function for [figs/img20.png].
This labelling [figs/img37.png] can be found in linear time
if the number of edges in [figs/img63.png] is at most [figs/img76.png] (see [[2 #papers]]
for details).
Figure 1 presents a pseudo code for the BMZ algorithm.
The procedure BMZ ([figs/img20.png], [figs/img37.png]) receives as input the set of
keys [figs/img20.png] and produces the labelling [figs/img37.png].
The method uses a mapping, ordering and searching approach.
We now describe each step.
| procedure BMZ ([figs/img20.png], [figs/img37.png])
| &nbsp;&nbsp;&nbsp;&nbsp;Mapping ([figs/img20.png], [figs/img32.png]);
| &nbsp;&nbsp;&nbsp;&nbsp;Ordering ([figs/img32.png], [figs/img63.png], [figs/img70.png]);
| &nbsp;&nbsp;&nbsp;&nbsp;Searching ([figs/img32.png], [figs/img63.png], [figs/img70.png], [figs/img37.png]);
| **Figure 1**: Main steps of BMZ algorithm for constructing a minimal perfect hash function
----------------------------------------
===Mapping Step===
The procedure Mapping ([figs/img20.png], [figs/img32.png]) receives as input the set
of keys [figs/img20.png] and generates the random graph [figs/img59.png], by generating
two auxiliary functions [figs/img41.png], [figs/img78.png].
The functions [figs/img41.png] and [figs/img42.png] are constructed as follows.
We impose some upper bound [figs/img79.png] on the lengths of the keys in [figs/img20.png].
To define [figs/img80.png] ([figs/img81.png], [figs/img62.png]), we generate
an [figs/img82.png] table of random integers [figs/img83.png].
For a key [figs/img18.png] of length [figs/img84.png] and [figs/img85.png], we let
| [figs/img86.png]
The random graph [figs/img59.png] has vertex set [figs/img56.png] and
edge set [figs/img61.png]. We need [figs/img32.png] to be
simple, i.e., [figs/img32.png] should have neither loops nor multiple edges.
A loop occurs when [figs/img87.png] for some [figs/img18.png].
We solve this in an ad hoc manner: we simply let [figs/img88.png] in this case.
If we still find a loop after this, we generate another pair [figs/img89.png].
When a multiple edge occurs we abort and generate a new pair [figs/img89.png].
Although the function above causes [collisions concepts.html] with probability //1/t//,
in [cmph library index.html] we use faster hash
functions ([DJB2 hash http://www.cs.yorku.ca/~oz/hash.html], [FNV hash http://www.isthe.com/chongo/tech/comp/fnv/],
[Jenkins hash http://burtleburtle.net/bob/hash/doobs.html] and [SDBM hash http://www.cs.yorku.ca/~oz/hash.html])
in which we do not need to impose any upper bound [figs/img79.png] on the lengths of the keys in [figs/img20.png].
As mentioned before, for us to find the labelling [figs/img72.png] of the
vertices of [figs/img59.png] in linear time,
we require that [figs/img108.png].
The crucial step now is to determine the value
of [figs/img1.png] (in [figs/img57.png]) to obtain a random
graph [figs/img71.png] with [figs/img109.png].
Botelho, Menoti an Ziviani determinded emprically in [[1 #papers]] that
the value of [figs/img1.png] is //1.15//. This value is remarkably
close to the theoretical value determined in [[2 #papers]],
which is around [figs/img112.png].
----------------------------------------
===Ordering Step===
The procedure Ordering ([figs/img32.png], [figs/img63.png], [figs/img70.png]) receives
as input the graph [figs/img32.png] and partitions [figs/img32.png] into the two
subgraphs [figs/img63.png] and [figs/img70.png], so that [figs/img71.png].
Figure 2 presents a sample graph with 9 vertices
and 8 edges, where the degree of a vertex is shown besides each vertex.
Initially, all vertices with degree 1 are added to a queue [figs/img136.png].
For the example shown in Figure 2(a), [figs/img137.png] after the initialization step.
| [figs/img138.png]
| **Figure 2:** Ordering step for a graph with 9 vertices and 8 edges.
Next, we remove one vertex [figs/img139.png] from the queue, decrement its degree and
the degree of the vertices with degree greater than 0 in the adjacent
list of [figs/img139.png], as depicted in Figure 2(b) for [figs/img140.png].
At this point, the adjacencies of [figs/img139.png] with degree 1 are
inserted into the queue, such as vertex 1.
This process is repeated until the queue becomes empty.
All vertices with degree 0 are non-critical vertices and the others are
critical vertices, as depicted in Figure 2(c).
Finally, to determine the vertices in [figs/img141.png] we collect all
vertices [figs/img142.png] with at least one vertex [figs/img143.png] that
is in Adj[figs/img144.png] and in [figs/img145.png], as the vertex 8 in Figure 2(c).
----------------------------------------
===Searching Step===
In the searching step, the key part is
the //perfect assignment problem//: find [figs/img153.png] such that
the function [figs/img154.png] defined by
| [figs/img155.png]
is a bijection from [figs/img156.png] to [figs/img157.png] (recall [figs/img158.png]).
We are interested in a labelling [figs/img72.png] of
the vertices of the graph [figs/img59.png] with
the property that if [figs/img11.png] and [figs/img22.png] are keys
in [figs/img20.png], then [figs/img159.png]; that is, if we associate
to each edge the sum of the labels on its endpoints, then these values
should be all distinct.
Moreover, we require that all the sums [figs/img160.png] ([figs/img18.png])
fall between [figs/img115.png] and [figs/img161.png], and thus we have a bijection
between [figs/img20.png] and [figs/img157.png].
The procedure Searching ([figs/img32.png], [figs/img63.png], [figs/img70.png], [figs/img37.png])
receives as input [figs/img32.png], [figs/img63.png], [figs/img70.png] and finds a
suitable [figs/img162.png] bit value for each vertex [figs/img74.png], stored in the
array [figs/img37.png].
This step is first performed for the vertices in the
critical subgraph [figs/img63.png] of [figs/img32.png] (the 2-core of [figs/img32.png])
and then it is performed for the vertices in [figs/img70.png] (the non-critical subgraph
of [figs/img32.png] that contains the "acyclic part" of [figs/img32.png]).
The reason the assignment of the [figs/img37.png] values is first
performed on the vertices in [figs/img63.png] is to resolve reassignments
as early as possible (such reassignments are consequences of the cycles
in [figs/img63.png] and are depicted hereinafter).
----------------------------------------
====Assignment of Values to Critical Vertices====
The labels [figs/img73.png] ([figs/img142.png])
are assigned in increasing order following a greedy
strategy where the critical vertices [figs/img139.png] are considered one at a time,
according to a breadth-first search on [figs/img63.png].
If a candidate value [figs/img11.png] for [figs/img73.png] is forbidden
because setting [figs/img163.png] would create two edges with the same sum,
we try [figs/img164.png] for [figs/img73.png]. This fact is referred to
as a //reassignment//.
Let [figs/img165.png] be the set of addresses assigned to edges in [figs/img166.png].
Initially [figs/img167.png].
Let [figs/img11.png] be a candidate value for [figs/img73.png].
Initially [figs/img168.png].
Considering the subgraph [figs/img63.png] in Figure 2(c),
a step by step example of the assignment of values to vertices in [figs/img63.png] is
presented in Figure 3.
Initially, a vertex [figs/img139.png] is chosen, the assignment [figs/img163.png] is made
and [figs/img11.png] is set to [figs/img164.png].
For example, suppose that vertex [figs/img169.png] in Figure 3(a) is
chosen, the assignment [figs/img170.png] is made and [figs/img11.png] is set to [figs/img96.png].
| [figs/img171.png]
| **Figure 3:** Example of the assignment of values to critical vertices.
In Figure 3(b), following the adjacent list of vertex [figs/img169.png],
the unassigned vertex [figs/img115.png] is reached.
At this point, we collect in the temporary variable [figs/img172.png] all adjacencies
of vertex [figs/img115.png] that have been assigned an [figs/img11.png] value,
and [figs/img173.png].
Next, for all [figs/img174.png], we check if [figs/img175.png].
Since [figs/img176.png], then [figs/img177.png] is set
to [figs/img96.png], [figs/img11.png] is incremented
by 1 (now [figs/img178.png]) and [figs/img179.png].
Next, vertex [figs/img180.png] is reached, [figs/img181.png] is set
to [figs/img62.png], [figs/img11.png] is set to [figs/img180.png] and [figs/img182.png].
Next, vertex [figs/img183.png] is reached and [figs/img184.png].
Since [figs/img185.png] and [figs/img186.png], then [figs/img187.png] is
set to [figs/img180.png], [figs/img11.png] is set to [figs/img183.png] and [figs/img188.png].
Finally, vertex [figs/img189.png] is reached and [figs/img190.png].
Since [figs/img191.png], [figs/img11.png] is incremented by 1 and set to 5, as depicted in
Figure 3(c).
Since [figs/img192.png], [figs/img11.png] is again incremented by 1 and set to 6,
as depicted in Figure 3(d).
These two reassignments are indicated by the arrows in Figure 3.
Since [figs/img193.png] and [figs/img194.png], then [figs/img195.png] is set
to [figs/img196.png] and [figs/img197.png]. This finishes the algorithm.
----------------------------------------
====Assignment of Values to Non-Critical Vertices====
As [figs/img70.png] is acyclic, we can impose the order in which addresses are
associated with edges in [figs/img70.png], making this step simple to solve
by a standard depth first search algorithm.
Therefore, in the assignment of values to vertices in [figs/img70.png] we
benefit from the unused addresses in the gaps left by the assignment of values
to vertices in [figs/img63.png].
For that, we start the depth-first search from the vertices in [figs/img141.png] because
the [figs/img37.png] values for these critical vertices were already assigned
and cannot be changed.
Considering the subgraph [figs/img70.png] in Figure 2(c),
a step by step example of the assignment of values to vertices in [figs/img70.png] is
presented in Figure 4.
Figure 4(a) presents the initial state of the algorithm.
The critical vertex 8 is the only one that has non-critical vertices as
adjacent.
In the example presented in Figure 3, the addresses [figs/img198.png] were not used.
So, taking the first unused address [figs/img115.png] and the vertex [figs/img96.png],
which is reached from the vertex [figs/img169.png], [figs/img199.png] is set
to [figs/img200.png], as shown in Figure 4(b).
The only vertex that is reached from vertex [figs/img96.png] is vertex [figs/img62.png], so
taking the unused address [figs/img183.png] we set [figs/img201.png] to [figs/img202.png],
as shown in Figure 4(c).
This process is repeated until the UnAssignedAddresses list becomes empty.
| [figs/img203.png]
| **Figure 4:** Example of the assignment of values to non-critical vertices.
----------------------------------------
==The Heuristic==[heuristic]
We now present an heuristic for BMZ algorithm that
reduces the value of [figs/img1.png] to any given value between //1.15// and //0.93//.
This reduces the space requirement to store the resulting function
to any given value between [figs/img12.png] words and [figs/img13.png] words.
The heuristic reuses, when possible, the set
of [figs/img11.png] values that caused reassignments, just before
trying [figs/img164.png].
Decreasing the value of [figs/img1.png] leads to an increase in the number of
iterations to generate [figs/img32.png].
For example, for [figs/img244.png] and [figs/img6.png], the analytical expected number
of iterations are [figs/img245.png] and [figs/img246.png], respectively (see [[2 #papers]]
for details),
while for [figs/img128.png] the same value is around //2.13//.
----------------------------------------
==Memory Consumption==
Now we detail the memory consumption to generate and to store minimal perfect hash functions
using the BMZ algorithm. The structures responsible for memory consumption are in the
following:
- Graph:
+ **first**: is a vector that stores //cn// integer numbers, each one representing
the first edge (index in the vector edges) in the list of
edges of each vertex.
The integer numbers are 4 bytes long. Therefore,
the vector first is stored in //4cn// bytes.
+ **edges**: is a vector to represent the edges of the graph. As each edge
is compounded by a pair of vertices, each entry stores two integer numbers
of 4 bytes that represent the vertices. As there are //n// edges, the
vector edges is stored in //8n// bytes.
+ **next**: given a vertex [figs/img139.png], we can discover the edges that
contain [figs/img139.png] following its list of edges,
which starts on first[[figs/img139.png]] and the next
edges are given by next[...first[[figs/img139.png]]...]. Therefore, the vectors first and next represent
the linked lists of edges of each vertex. As there are two vertices for each edge,
when an edge is iserted in the graph, it must be inserted in the two linked lists
of the vertices in its composition. Therefore, there are //2n// entries of integer
numbers in the vector next, so it is stored in //4*2n = 8n// bytes.
+ **critical vertices(critical_nodes vector)**: is a vector of //cn// bits,
where each bit indicates if a vertex is critical (1) or non-critical (0).
Therefore, the critical and non-critical vertices are represented in //cn/8// bytes.
+ **critical edges (used_edges vector)**: is a vector of //n// bits, where each
bit indicates if an edge is critical (1) or non-critical (0). Therefore, the
critical and non-critical edges are represented in //n/8// bytes.
- Other auxiliary structures
+ **queue**: is a queue of integer numbers used in the breadth-first search of the
assignment of values to critical vertices. There is an entry in the queue for
each two critical vertices. Let [figs/img110.png] be the expected number of critical
vertices. Therefore, the queue is stored in //4*0.5*[figs/img110.png]=2[figs/img110.png]//.
+ **visited**: is a vector of //cn// bits, where each bit indicates if the g value of
a given vertex was already defined. Therefore, the vector visited is stored
in //cn/8// bytes.
+ **function //g//**: is represented by a vector of //cn// integer numbers.
As each integer number is 4 bytes long, the function //g// is stored in
//4cn// bytes.
Thus, the total memory consumption of BMZ algorithm for generating a minimal
perfect hash function (MPHF) is: //(8.25c + 16.125)n +2[figs/img110.png] + O(1)// bytes.
As the value of constant //c// may be 1.15 and 0.93 we have:
|| //c// | [figs/img110.png] | Memory consumption to generate a MPHF |
| 0.93 | //0.497n// | //24.80n + O(1)// |
| 1.15 | //0.401n// | //26.42n + O(1)// |
| **Table 1:** Memory consumption to generate a MPHF using the BMZ algorithm.
The values of [figs/img110.png] were calculated using Eq.(1) presented in [[2 #papers]].
Now we present the memory consumption to store the resulting function.
We only need to store the //g// function. Thus, we need //4cn// bytes.
Again we have:
|| //c// | Memory consumption to store a MPHF |
| 0.93 | //3.72n// |
| 1.15 | //4.60n// |
| **Table 2:** Memory consumption to store a MPHF generated by the BMZ algorithm.
----------------------------------------
==Experimental Results==
[CHM x BMZ comparison.html]
----------------------------------------
==Papers==[papers]
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], D. Menoti, [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A New algorithm for constructing minimal perfect hash functions papers/bmz_tr004_04.ps], Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], Y. Kohayakawa, and [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A Practical Minimal Perfect Hashing Method papers/wea05.pdf]. //4th International Workshop on efficient and Experimental Algorithms (WEA05),// Springer-Verlag Lecture Notes in Computer Science, vol. 3505, Santorini Island, Greece, May 2005, 488-500.
%!include: ALGORITHMS.t2t
%!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

Before

Width:  |  Height:  |  Size: 21 KiB

After

Width:  |  Height:  |  Size: 21 KiB

440
deps/cmph/BRZ.t2t vendored Normal file
View File

@@ -0,0 +1,440 @@
External Memory Based Algorithm
%!includeconf: CONFIG.t2t
----------------------------------------
==Introduction==
Until now, because of the limitations of current algorithms,
the use of MPHFs is restricted to scenarios where the set of keys being hashed is
relatively small.
However, in many cases it is crucial to deal in an efficient way with very large
sets of keys.
Due to the exponential growth of the Web, the work with huge collections is becoming
a daily task.
For instance, the simple assignment of number identifiers to web pages of a collection
can be a challenging task.
While traditional databases simply cannot handle more traffic once the working
set of URLs does not fit in main memory anymore[[4 #papers]], the algorithm we propose here to
construct MPHFs can easily scale to billions of entries.
As there are many applications for MPHFs, it is
important to design and implement space and time efficient algorithms for
constructing such functions.
The attractiveness of using MPHFs depends on the following issues:
+ The amount of CPU time required by the algorithms for constructing MPHFs.
+ The space requirements of the algorithms for constructing MPHFs.
+ The amount of CPU time required by a MPHF for each retrieval.
+ The space requirements of the description of the resulting MPHFs to be used at retrieval time.
We present here a novel external memory based algorithm for constructing MPHFs that
are very efficient in the four requirements mentioned previously.
First, the algorithm is linear on the size of keys to construct a MPHF,
which is optimal.
For instance, for a collection of 1 billion URLs
collected from the web, each one 64 characters long on average, the time to construct a
MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory
is approximately 3 hours.
Second, the algorithm needs a small a priori defined vector of [figs/brz/img23.png] one
byte entries in main memory to construct a MPHF.
For the collection of 1 billion URLs and using [figs/brz/img4.png], the algorithm needs only
5.45 megabytes of internal memory.
Third, the evaluation of the MPHF for each retrieval requires three memory accesses and
the computation of three universal hash functions.
This is not optimal as any MPHF requires at least one memory access and the computation
of two universal hash functions.
Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal.
For the collection of 1 billion URLs, it needs 8.1 bits for each key,
while the theoretical lower bound is [figs/brz/img24.png] bits per key.
----------------------------------------
==The Algorithm==
The main idea supporting our algorithm is the classical divide and conquer technique.
The algorithm is a two-step external memory based algorithm
that generates a MPHF //h// for a set //S// of //n// keys.
Figure 1 illustrates the two steps of the
algorithm: the partitioning step and the searching step.
| [figs/brz/brz.png]
| **Figure 1:** Main steps of our algorithm.
The partitioning step takes a key set //S// and uses a universal hash
function [figs/brz/img42.png] proposed by Jenkins[[5 #papers]]
to transform each key [figs/brz/img43.png] into an integer [figs/brz/img44.png].
Reducing [figs/brz/img44.png] modulo [figs/brz/img23.png], we partition //S//
into [figs/brz/img23.png] buckets containing at most 256 keys in each bucket (with high
probability).
The searching step generates a MPHF[figs/brz/img46.png] for each bucket //i//, [figs/brz/img47.png].
The resulting MPHF //h(k)//, [figs/brz/img43.png], is given by
| [figs/brz/img49.png]
where [figs/brz/img50.png].
The //i//th entry //offset[i]// of the displacement vector
//offset//, [figs/brz/img47.png], contains the total number
of keys in the buckets from 0 to //i-1//, that is, it gives the interval of the
keys in the hash table addressed by the MPHF[figs/brz/img46.png]. In the following we explain
each step in detail.
----------------------------------------
=== Partitioning step ===
The set //S// of //n// keys is partitioned into [figs/brz/img23.png],
where //b// is a suitable parameter chosen to guarantee
that each bucket has at most 256 keys with high probability
(see [[2 #papers]] for details).
The partitioning step works as follows:
| [figs/brz/img54.png]
| **Figure 2:** Partitioning step.
Statement 1.1 of the **for** loop presented in Figure 2
reads sequentially all the keys of block [figs/brz/img55.png] from disk into an internal area
of size [figs/brz/img8.png].
Statement 1.2 performs an indirect bucket sort of the keys in block [figs/brz/img55.png] and
at the same time updates the entries in the vector //size//.
Let us briefly describe how [figs/brz/img55.png] is partitioned among
the [figs/brz/img23.png] buckets.
We use a local array of [figs/brz/img23.png] counters to store a
count of how many keys from [figs/brz/img55.png] belong to each bucket.
The pointers to the keys in each bucket //i//, [figs/brz/img47.png],
are stored in contiguous positions in an array.
For this we first reserve the required number of entries
in this array of pointers using the information from the array of counters.
Next, we place the pointers to the keys in each bucket into the respective
reserved areas in the array (i.e., we place the pointers to the keys in bucket 0,
followed by the pointers to the keys in bucket 1, and so on).
To find the bucket address of a given key
we use the universal hash function [figs/brz/img44.png][[5 #papers]].
Key //k// goes into bucket //i//, where
| [figs/brz/img57.png] (1)
Figure 3(a) shows a //logical// view of the [figs/brz/img23.png] buckets
generated in the partitioning step.
In reality, the keys belonging to each bucket are distributed among many files,
as depicted in Figure 3(b).
In the example of Figure 3(b), the keys in bucket 0
appear in files 1 and //N//, the keys in bucket 1 appear in files 1, 2
and //N//, and so on.
| [figs/brz/brz-partitioning.png]
| **Figure 3:** Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view.
This scattering of the keys in the buckets could generate a performance
problem because of the potential number of seeks
needed to read the keys in each bucket from the //N// files in disk
during the searching step.
But, as we show in [[2 #papers]], the number of seeks
can be kept small using buffering techniques.
Considering that only the vector //size//, which has [figs/brz/img23.png] one-byte
entries (remember that each bucket has at most 256 keys),
must be maintained in main memory during the searching step,
almost all main memory is available to be used as disk I/O buffer.
The last step is to compute the //offset// vector and dump it to the disk.
We use the vector //size// to compute the
//offset// displacement vector.
The //offset[i]// entry contains the number of keys
in the buckets //0, 1, ..., i-1//.
As //size[i]// stores the number of keys
in bucket //i//, where [figs/brz/img47.png], we have
| [figs/brz/img63.png]
----------------------------------------
=== Searching step ===
The searching step is responsible for generating a MPHF for each
bucket. Figure 4 presents the searching step algorithm.
| [figs/brz/img64.png]
| **Figure 4:** Searching step.
Statement 1 of Figure 4 inserts one key from each file
in a minimum heap //H// of size //N//.
The order relation in //H// is given by the bucket address //i// given by
Eq. (1).
Statement 2 has two important steps.
In statement 2.1, a bucket is read from disk,
as described below.
In statement 2.2, a MPHF is generated for each bucket //i//, as described
in the following.
The description of MPHF[figs/brz/img46.png] is a vector [figs/brz/img66.png] of 8-bit integers.
Finally, statement 2.3 writes the description [figs/brz/img66.png] of MPHF[figs/brz/img46.png] to disk.
----------------------------------------
==== Reading a bucket from disk ====
In this section we present the refinement of statement 2.1 of
Figure 4.
The algorithm to read bucket //i// from disk is presented
in Figure 5.
| [figs/brz/img67.png]
| **Figure 5:** Reading a bucket.
Bucket //i// is distributed among many files and the heap //H// is used to drive a
multiway merge operation.
In Figure 5, statement 1.1 extracts and removes triple
//(i, j, k)// from //H//, where //i// is a minimum value in //H//.
Statement 1.2 inserts key //k// in bucket //i//.
Notice that the //k// in the triple //(i, j, k)// is in fact a pointer to
the first byte of the key that is kept in contiguous positions of an array of characters
(this array containing the keys is initialized during the heap construction
in statement 1 of Figure 4).
Statement 1.3 performs a seek operation in File //j// on disk for the first
read operation and reads sequentially all keys //k// that have the same //i//
and inserts them all in bucket //i//.
Finally, statement 1.4 inserts in //H// the triple //(i, j, x)//,
where //x// is the first key read from File //j// (in statement 1.3)
that does not have the same bucket address as the previous keys.
The number of seek operations on disk performed in statement 1.3 is discussed
in [[2, Section 5.1 #papers]],
where we present a buffering technique that brings down
the time spent with seeks.
----------------------------------------
==== Generating a MPHF for each bucket ====
To the best of our knowledge the [BMZ algorithm bmz.html] we have designed in
our previous works [[1,3 #papers]] is the fastest published algorithm for
constructing MPHFs.
That is why we are using that algorithm as a building block for the
algorithm presented here. In reality, we are using
an optimized version of BMZ (BMZ8) for small set of keys (at most 256 keys).
[Click here to see details about BMZ algorithm bmz.html].
----------------------------------------
==Analysis of the Algorithm==
Analytical results and the complete analysis of the external memory based algorithm
can be found in [[2 #papers]].
----------------------------------------
==Experimental Results==
In this section we present the experimental results.
We start presenting the experimental setup.
We then present experimental results for
the internal memory based algorithm ([the BMZ algorithm bmz.html])
and for our external memory based algorithm.
Finally, we discuss how the amount of internal memory available
affects the runtime of the external memory based algorithm.
----------------------------------------
===The data and the experimental setup===
All experiments were carried out on
a computer running the Linux operating system, version 2.6,
with a 2.4 gigahertz processor and
1 gigabyte of main memory.
In the experiments related to the new
algorithm we limited the main memory in 500 megabytes.
Our data consists of a collection of 1 billion
URLs collected from the Web, each URL 64 characters long on average.
The collection is stored on disk in 60.5 gigabytes.
----------------------------------------
===Performance of the BMZ Algorithm===
[The BMZ algorithm bmz.html] is used for constructing a MPHF for each bucket.
It is a randomized algorithm because it needs to generate a simple random graph
in its first step.
Once the graph is obtained the other two steps are deterministic.
Thus, we can consider the runtime of the algorithm to have
the form [figs/brz/img159.png] for an input of //n// keys,
where [figs/brz/img160.png] is some machine dependent
constant that further depends on the length of the keys and //Z// is a random
variable with geometric distribution with mean [figs/brz/img162.png]. All results
in our experiments were obtained taking //c=1//; the value of //c//, with //c// in //[0.93,1.15]//,
in fact has little influence in the runtime, as shown in [[3 #papers]].
The values chosen for //n// were 1, 2, 4, 8, 16 and 32 million.
Although we have a dataset with 1 billion URLs, on a PC with
1 gigabyte of main memory, the algorithm is able
to handle an input with at most 32 million keys.
This is mainly because of the graph we need to keep in main memory.
The algorithm requires //25n + O(1)// bytes for constructing
a MPHF ([click here to get details about the data structures used by the BMZ algorithm bmz.html]).
In order to estimate the number of trials for each value of //n// we use
a statistical method for determining a suitable sample size (see, e.g., [[6, Chapter 13 #papers]]).
As we obtained different values for each //n//,
we used the maximal value obtained, namely, 300 trials in order to have
a confidence level of 95 %.
Table 1 presents the runtime average for each //n//,
the respective standard deviations, and
the respective confidence intervals given by
the average time [figs/brz/img167.png] the distance from average time
considering a confidence level of 95 %.
Observing the runtime averages one sees that
the algorithm runs in expected linear time,
as shown in [[3 #papers]].
%!include(html): ''TABLEBRZ1.t2t''
| **Table 1:** Internal memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.
Figure 6 presents the runtime for each trial. In addition,
the solid line corresponds to a linear regression model
obtained from the experimental measurements.
As we can see, the runtime for a given //n// has a considerable
fluctuation. However, the fluctuation also grows linearly with //n//.
| [figs/brz/bmz_temporegressao.png]
| **Figure 6:** Time versus number of keys in //S// for the internal memory based algorithm. The solid line corresponds to a linear regression model.
The observed fluctuation in the runtimes is as expected; recall that this
runtime has the form [figs/brz/img159.png] with //Z// a geometric random variable with
mean //1/p=e//. Thus, the runtime has mean [figs/brz/img181.png] and standard
deviation [figs/brz/img182.png].
Therefore, the standard deviation also grows
linearly with //n//, as experimentally verified
in Table 1 and in Figure 6.
----------------------------------------
===Performance of the External Memory Based Algorithm===
The runtime of the external memory based algorithm is also a random variable,
but now it follows a (highly concentrated) normal distribution, as we discuss at the end of this
section. Again, we are interested in verifying the linearity claim made in
[[2, Section 5.1 #papers]]. Therefore, we ran the algorithm for
several numbers //n// of keys in //S//.
The values chosen for //n// were 1, 2, 4, 8, 16, 32, 64, 128, 512 and 1000
million.
We limited the main memory in 500 megabytes for the experiments.
The size [figs/brz/img8.png] of the a priori reserved internal memory area
was set to 250 megabytes, the parameter //b// was set to //175// and
the building block algorithm parameter //c// was again set to //1//.
We show later on how [figs/brz/img8.png] affects the runtime of the algorithm. The other two parameters
have insignificant influence on the runtime.
We again use a statistical method for determining a suitable sample size
to estimate the number of trials to be run for each value of //n//. We got that
just one trial for each //n// would be enough with a confidence level of 95 %.
However, we made 10 trials. This number of trials seems rather small, but, as
shown below, the behavior of our algorithm is very stable and its runtime is
almost deterministic (i.e., the standard deviation is very small).
Table 2 presents the runtime average for each //n//,
the respective standard deviations, and
the respective confidence intervals given by
the average time [figs/brz/img167.png] the distance from average time
considering a confidence level of 95 %.
Observing the runtime averages we noticed that
the algorithm runs in expected linear time,
as shown in [[2, Section 5.1 #papers]]. Better still,
it is only approximately 60 % slower than the BMZ algorithm.
To get that value we used the linear regression model obtained for the runtime of
the internal memory based algorithm to estimate how much time it would require
for constructing a MPHF for a set of 1 billion keys.
We got 2.3 hours for the internal memory based algorithm and we measured
3.67 hours on average for the external memory based algorithm.
Increasing the size of the internal memory area
from 250 to 600 megabytes,
we have brought the time to 3.09 hours. In this case, the external memory based algorithm is
just 34 % slower in this setup.
%!include(html): ''TABLEBRZ2.t2t''
| **Table 2:**The external memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.
Figure 7 presents the runtime for each trial. In addition,
the solid line corresponds to a linear regression model
obtained from the experimental measurements.
As we were expecting the runtime for a given //n// has almost no
variation.
| [figs/brz/brz_temporegressao.png]
| **Figure 7:** Time versus number of keys in //S// for our algorithm. The solid line corresponds to a linear regression model.
An intriguing observation is that the runtime of the algorithm is almost
deterministic, in spite of the fact that it uses as building block an
algorithm with a considerable fluctuation in its runtime. A given bucket
//i//, [figs/brz/img47.png], is a small set of keys (at most 256 keys) and,
as argued in last Section, the runtime of the
building block algorithm is a random variable [figs/brz/img207.png] with high fluctuation.
However, the runtime //Y// of the searching step of the external memory based algorithm is given
by [figs/brz/img209.png]. Under the hypothesis that
the [figs/brz/img207.png] are independent and bounded, the {\it law of large numbers} (see,
e.g., [[6 #papers]]) implies that the random variable [figs/brz/img210.png] converges
to a constant as [figs/brz/img83.png]. This explains why the runtime of our
algorithm is almost deterministic.
----------------------------------------
=== Controlling disk accesses ===
In order to bring down the number of seek operations on disk
we benefit from the fact that our algorithm leaves almost all main
memory available to be used as disk I/O buffer.
In this section we evaluate how much the parameter [figs/brz/img8.png] affects the runtime of our algorithm.
For that we fixed //n// in 1 billion of URLs,
set the main memory of the machine used for the experiments
to 1 gigabyte and used [figs/brz/img8.png] equal to 100, 200, 300, 400, 500 and 600
megabytes.
Table 3 presents the number of files //N//,
the buffer size used for all files, the number of seeks in the worst case considering
the pessimistic assumption mentioned in [[2, Section 5.1 #papers]], and
the time to generate a MPHF for 1 billion of keys as a function of the amount of internal
memory available. Observing Table 3 we noticed that the time spent in the construction
decreases as the value of [figs/brz/img8.png] increases. However, for [figs/brz/img213.png], the variation
on the time is not as significant as for [figs/brz/img214.png].
This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux
has smart policies for avoiding seeks and diminishing the average seek time
(see [http://www.linuxjournal.com/article/6931 http://www.linuxjournal.com/article/6931]).
%!include(html): ''TABLEBRZ3.t2t''
| **Table 3:**Influence of the internal memory area size ([figs/brz/img8.png]) in the external memory based algorithm runtime.
----------------------------------------
==Papers==[papers]
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], D. Menoti, [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A New algorithm for constructing minimal perfect hash functions papers/bmz_tr004_04.ps], Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], Y. Kohayakawa, [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [An Approach for Minimal Perfect Hash Functions for Very Large Databases papers/tr06.pdf], Technical Report TR003/06, Department of Computer Science, Federal University of Minas Gerais, 2004.
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], Y. Kohayakawa, and [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A Practical Minimal Perfect Hashing Method papers/wea05.pdf]. //4th International Workshop on efficient and Experimental Algorithms (WEA05),// Springer-Verlag Lecture Notes in Computer Science, vol. 3505, Santorini Island, Greece, May 2005, 488-500.
+ [M. Seltzer. Beyond relational databases. ACM Queue, 3(3), April 2005. http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=299]
+ [Bob Jenkins. Algorithm alley: Hash functions. Dr. Dobb's Journal of Software Tools, 22(9), september 1997. http://burtleburtle.net/bob/hash/doobs.html]
+ R. Jain. The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. John Wiley, first edition, 1991.
%!include: ALGORITHMS.t2t
%!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

44
deps/cmph/CHD.t2t vendored Normal file
View File

@@ -0,0 +1,44 @@
Compress, Hash and Displace: CHD Algorithm
%!includeconf: CONFIG.t2t
----------------------------------------
==Introduction==
The important performance parameters of a PHF are representation size, evaluation time and construction time. The representation size plays an important role when the whole function fits in a faster memory and the actual data is stored in a slower memory. For instace, compact PHFs can be entirely fit in a CPU cache and this makes their computation really fast by avoiding cache misses. The CHD algorithm plays an important role in this context. It was designed by Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger in [[2 #papers]].
The CHD algorithm permits to obtain PHFs with representation size very close to optimal while retaining //O(n)// construction time and //O(1)// evaluation time. For example, in the case //m=2n// we obtain a PHF that uses space //0.67// bits per key, and for //m=1.23n// we obtain space //1.4// bits per key, which was not achievable with previously known methods. The CHD algorithm is inspired by several known algorithms; the main new feature is that it combines a modification of Pagh's ``hash-and-displace'' approach with data compression on a sequence of hash function indices. That combination makes it possible to significantly reduce space usage while retaining linear construction time and constant query time. The CHD algorithm can also be used for //k//-perfect hashing, where at most //k// keys may be mapped to the same value. For the analysis we assume that fully random hash functions are given for free; such assumptions can be justified and were made in previous papers.
The compact PHFs generated by the CHD algorithm can be used in many applications in which we want to assign a unique identifier to each key without storing any information on the key. One of the most obvious applications of those functions (or //k//-perfect hash functions) is when we have a small fast memory in which we can store the perfect hash function while the keys and associated satellite data are stored in slower but larger memory. The size of a block or a transfer unit may be chosen so that //k// data items can be retrieved in one read access. In this case we can ensure that data associated with a key can be retrieved in a single probe to slower memory. This has been used for example in hardware routers [[4 #papers]].
The CHD algorithm generates the most compact PHFs and MPHFs we know of in //O(n)// time. The time required to evaluate the generated functions is constant (in practice less than //1.4// microseconds). The storage space of the resulting PHFs and MPHFs are distant from the information theoretic lower bound by a factor of //1.43//. The closest competitor is the algorithm by Martin and Pagh [[3 #papers]] but their algorithm do not work in linear time. Furthermore, the CHD algorithm can be tuned to run faster than the BPZ algorithm [[1 #papers]] (the fastest algorithm available in the literature so far) and to obtain more compact functions. The most impressive characteristic is that it has the ability, in principle, to approximate the information theoretic lower bound while being practical. A detailed description of the CHD algorithm can be found in [[2 #papers]].
----------------------------------------
==Experimental Results==
Experimental results comparing the CHD algorithm with [the BDZ algorithm bdz.html]
and others available in the CMPH library are presented in [[2 #papers]].
----------------------------------------
==Papers==[papers]
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], [R. Pagh http://www.itu.dk/~pagh/], [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [Simple and space-efficient minimal perfect hash functions papers/wads07.pdf]. //In Proceedings of the 10th International Workshop on Algorithms and Data Structures (WADs'07),// Springer-Verlag Lecture Notes in Computer Science, vol. 4619, Halifax, Canada, August 2007, 139-150.
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], D. Belazzougui and M. Dietzfelbinger. [Compress, hash and displace papers/esa09.pdf]. //In Proceedings of the 17th European Symposium on Algorithms (ESA09)//. Springer LNCS, 2009.
+ M. Dietzfelbinger and [R. Pagh http://www.itu.dk/~pagh/]. Succinct data structures for retrieval and approximate membership. //In Proceedings of the 35th international colloquium on Automata, Languages and Programming (ICALP08)//, pages 385396, Berlin, Heidelberg, 2008. Springer-Verlag.
+ B. Prabhakar and F. Bonomi. Perfect hashing for network applications. //In Proceedings of the IEEE International Symposium on Information Theory//. IEEE Press, 2006.
%!include: ALGORITHMS.t2t
%!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

88
deps/cmph/CHM.t2t vendored Normal file
View File

@@ -0,0 +1,88 @@
CHM Algorithm
%!includeconf: CONFIG.t2t
----------------------------------------
==The Algorithm==
The algorithm is presented in [[1,2,3 #papers]].
----------------------------------------
==Memory Consumption==
Now we detail the memory consumption to generate and to store minimal perfect hash functions
using the CHM algorithm. The structures responsible for memory consumption are in the
following:
- Graph:
+ **first**: is a vector that stores //cn// integer numbers, each one representing
the first edge (index in the vector edges) in the list of
edges of each vertex.
The integer numbers are 4 bytes long. Therefore,
the vector first is stored in //4cn// bytes.
+ **edges**: is a vector to represent the edges of the graph. As each edge
is compounded by a pair of vertices, each entry stores two integer numbers
of 4 bytes that represent the vertices. As there are //n// edges, the
vector edges is stored in //8n// bytes.
+ **next**: given a vertex [figs/img139.png], we can discover the edges that
contain [figs/img139.png] following its list of edges, which starts on
first[[figs/img139.png]] and the next
edges are given by next[...first[[figs/img139.png]]...]. Therefore,
the vectors first and next represent
the linked lists of edges of each vertex. As there are two vertices for each edge,
when an edge is iserted in the graph, it must be inserted in the two linked lists
of the vertices in its composition. Therefore, there are //2n// entries of integer
numbers in the vector next, so it is stored in //4*2n = 8n// bytes.
- Other auxiliary structures
+ **visited**: is a vector of //cn// bits, where each bit indicates if the g value of
a given vertex was already defined. Therefore, the vector visited is stored
in //cn/8// bytes.
+ **function //g//**: is represented by a vector of //cn// integer numbers.
As each integer number is 4 bytes long, the function //g// is stored in
//4cn// bytes.
Thus, the total memory consumption of CHM algorithm for generating a minimal
perfect hash function (MPHF) is: //(8.125c + 16)n + O(1)// bytes.
As the value of constant //c// must be at least 2.09 we have:
|| //c// | Memory consumption to generate a MPHF |
| 2.09 | //33.00n + O(1)// |
| **Table 1:** Memory consumption to generate a MPHF using the CHM algorithm.
Now we present the memory consumption to store the resulting function.
We only need to store the //g// function. Thus, we need //4cn// bytes.
Again we have:
|| //c// | Memory consumption to store a MPHF |
| 2.09 | //8.36n// |
| **Table 2:** Memory consumption to store a MPHF generated by the CHM algorithm.
----------------------------------------
==Experimental Results==
[CHM x BMZ comparison.html]
----------------------------------------
==Papers==[papers]
+ Z.J. Czech, G. Havas, and B.S. Majewski. [An optimal algorithm for generating minimal perfect hash functions. papers/chm92.pdf], Information Processing Letters, 43(5):257-264, 1992.
+ Z.J. Czech, G. Havas, and B.S. Majewski. Fundamental study perfect hashing.
Theoretical Computer Science, 182:1-143, 1997.
+ B.S. Majewski, N.C. Wormald, G. Havas, and Z.J. Czech. A family of perfect hashing methods.
The Computer Journal, 39(6):547--554, 1996.
%!include: ALGORITHMS.t2t
%!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

111
deps/cmph/COMPARISON.t2t vendored Normal file
View File

@@ -0,0 +1,111 @@
Comparison Between BMZ And CHM Algorithms
%!includeconf: CONFIG.t2t
----------------------------------------
==Characteristics==
Table 1 presents the main characteristics of the two algorithms.
The number of edges in the graph [figs/img27.png] is [figs/img236.png],
the number of keys in the input set [figs/img20.png].
The number of vertices of [figs/img32.png] is equal
to [figs/img12.png] and [figs/img237.png] for BMZ algorithm and the CHM algorithm, respectively.
This measure is related to the amount of space to store the array [figs/img37.png].
This improves the space required to store a function in BMZ algorithm to [figs/img238.png] of the space required by the CHM algorithm.
The number of critical edges is [figs/img76.png] and 0, for BMZ algorithm and the CHM algorithm,
respectively.
BMZ algorithm generates random graphs that necessarily contains cycles and the
CHM algorithm
generates
acyclic random graphs.
Finally, the CHM algorithm generates [order preserving functions concepts.html]
while BMZ algorithm does not preserve order.
%!include(html): ''TABLE1.t2t''
| **Table 1:** Main characteristics of the algorithms.
----------------------------------------
==Memory Consumption==
- Memory consumption to generate the minimal perfect hash function (MPHF):
|| Algorithm | //c// | Memory consumption to generate a MPHF |
| BMZ | 0.93 | //24.80n + O(1)// |
| BMZ | 1.15 | //26.42n + O(1)// |
| CHM | 2.09 | //33.00n + O(1)// |
| **Table 2:** Memory consumption to generate a MPHF using the algorithms BMZ and CHM.
- Memory consumption to store the resulting minimal perfect hash function (MPHF):
|| Algorithm | //c// | Memory consumption to store a MPHF |
| BMZ | 0.93 | //3.72n// |
| BMZ | 1.15 | //4.60n// |
| CHM | 2.09 | //8.36n// |
| **Table 3:** Memory consumption to store a MPHF generated by the algorithms BMZ and CHM.
----------------------------------------
==Run times==
We now present some experimental results to compare the BMZ and CHM algorithms.
The data consists of a collection of 100 million universe resource locations
(URLs) collected from the Web.
The average length of a URL in the collection is 63 bytes.
All experiments were carried on
a computer running the Linux operating system, version 2.6.7,
with a 2.4 gigahertz processor and
4 gigabytes of main memory.
Table 4 presents time measurements.
All times are in seconds.
The table entries represent averages over 50 trials.
The column labelled as [figs/img243.png] represents
the number of iterations to generate the random graph [figs/img32.png] in the
mapping step of the algorithms.
The next columns represent the run times
for the mapping plus ordering steps together and the searching
step for each algorithm.
The last column represents the percent gain of our algorithm
over the CHM algorithm.
%!include(html): ''TABLE4.t2t''
| **Table 4:** Time measurements for BMZ and the CHM algorithm.
The mapping step of the BMZ algorithm is faster because
the expected number of iterations in the mapping step to generate [figs/img32.png] are
2.13 and 2.92 for BMZ algorithm and the CHM algorithm, respectively
(see [[2 bmz.html#papers]] for details).
The graph [figs/img32.png] generated by BMZ algorithm
has [figs/img12.png] vertices, against [figs/img237.png] for the CHM algorithm.
These two facts make BMZ algorithm faster in the mapping step.
The ordering step of BMZ algorithm is approximately equal to
the time to check if [figs/img32.png] is acyclic for the CHM algorithm.
The searching step of the CHM algorithm is faster, but the total
time of BMZ algorithm is, on average, approximately 59 % faster
than the CHM algorithm.
It is important to notice the times for the searching step:
for both algorithms they are not the dominant times,
and the experimental results clearly show
a linear behavior for the searching step.
We now present run times for BMZ algorithm using a [heuristic bmz.html#heuristic] that
reduces the space requirement
to any given value between [figs/img12.png] words and [figs/img13.png] words.
For example, for [figs/img244.png] and [figs/img6.png], the analytical expected number
of iterations are [figs/img245.png] and [figs/img246.png], respectively
(for [figs/img247.png], the number of iterations are 2.78 for [figs/img244.png] and 3.04
for [figs/img6.png]).
Table 5 presents the total times to construct a
function for [figs/img247.png], with an increase from [figs/img248.png] seconds
for [figs/img128.png] (see Table 4) to [figs/img249.png] seconds for [figs/img244.png] and
to [figs/img250.png] seconds for [figs/img6.png].
%!include(html): ''TABLE5.t2t''
| **Table 5:** Time measurements for BMZ tuned algorithm with [figs/img5.png] and [figs/img6.png].
%!include: ALGORITHMS.t2t
%!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

56
deps/cmph/CONCEPTS.t2t vendored Normal file
View File

@@ -0,0 +1,56 @@
Minimal Perfect Hash Functions - Introduction
%!includeconf: CONFIG.t2t
----------------------------------------
==Basic Concepts==
Suppose [figs/img14.png] is a universe of //keys//.
Let [figs/img15.png] be a //hash function// that maps the keys from [figs/img14.png] to a given interval of integers [figs/img16.png].
Let [figs/img17.png] be a set of [figs/img8.png] keys from [figs/img14.png].
Given a key [figs/img18.png], the hash function [figs/img7.png] computes an
integer in [figs/img19.png] for the storage or retrieval of [figs/img11.png] in
a //hash table//.
Hashing methods for //non-static sets// of keys can be used to construct
data structures storing [figs/img20.png] and supporting membership queries
"[figs/img18.png]?" in expected time [figs/img21.png].
However, they involve a certain amount of wasted space owing to unused
locations in the table and waisted time to resolve collisions when
two keys are hashed to the same table location.
For //static sets// of keys it is possible to compute a function
to find any key in a table in one probe; such hash functions are called
//perfect//.
More precisely, given a set of keys [figs/img20.png], we shall say that a
hash function [figs/img15.png] is a //perfect hash function//
for [figs/img20.png] if [figs/img7.png] is an injection on [figs/img20.png],
that is, there are no //collisions// among the keys in [figs/img20.png]:
if [figs/img11.png] and [figs/img22.png] are in [figs/img20.png] and [figs/img23.png],
then [figs/img24.png].
Figure 1(a) illustrates a perfect hash function.
Since no collisions occur, each key can be retrieved from the table
with a single probe.
If [figs/img25.png], that is, the table has the same size as [figs/img20.png],
then we say that [figs/img7.png] is a //minimal perfect hash function//
for [figs/img20.png].
Figure 1(b) illustrates a minimal perfect hash function.
Minimal perfect hash functions totally avoid the problem of wasted
space and time. A perfect hash function [figs/img7.png] is //order preserving//
if the keys in [figs/img20.png] are arranged in some given order
and [figs/img7.png] preserves this order in the hash table.
| [figs/img26.png]
| **Figure 1:** (a) Perfect hash function. (b) Minimal perfect hash function.
Minimal perfect hash functions are widely used for memory efficient
storage and fast retrieval of items from static sets, such as words in natural
languages, reserved words in programming languages or interactive systems,
universal resource locations (URLs) in Web search engines, or item sets in
data mining techniques.
%!include: ALGORITHMS.t2t
%!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

51
deps/cmph/CONFIG.t2t vendored Normal file
View File

@@ -0,0 +1,51 @@
%! style(html): DOC.css
%! PreProc(html): '^%html% ' ''
%! PreProc(txt): '^%txt% ' ''
%! PostProc(html): "&amp;" "&"
%! PostProc(txt): "&nbsp;" " "
%! PostProc(html): 'ALIGN="middle" SRC="figs/img7.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img7.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img57.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img57.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img32.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img32.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img20.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img20.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img60.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img60.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img62.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img62.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img79.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img79.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img139.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img139.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img140.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img140.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img143.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img143.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img115.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img115.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img11.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img11.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img169.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img169.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img96.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img96.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img178.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img178.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img180.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img180.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img183.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img183.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img189.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img189.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img196.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img196.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img172.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img172.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img8.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img8.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img1.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img1.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img14.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img14.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img128.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img128.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img112.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img112.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img12.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img12.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img13.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img13.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img244.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img244.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img245.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img245.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img246.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img246.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img15.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img15.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img25.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img25.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img168.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img168.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img6.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img6.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img5.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img5.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img28.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img28.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img237.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img237.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img248.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img237.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img248.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img237.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img249.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img249.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img250.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img250.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/bdz/img8.png"(.*?)>' 'ALIGN="bottom" SRC="figs/bdz/img8.png"\1>'
% The ^ need to be escaped by \
%!postproc(html): \^\^(.*?)\^\^ <sup>\1</sup>
%!postproc(html): ,,(.*?),, <sub>\1</sub>

5
deps/cmph/COPYING vendored Normal file
View File

@@ -0,0 +1,5 @@
The code of the cmph library is dual licensed under the LGPL version 2 and MPL
1.1 licenses. Please refer to the LGPL-2 and MPL-1.1 files in the repository
for the full description of each of the licenses.
For cxxmph, the files stringpiece.h and MurmurHash2 are covered by the BSD and MIT licenses, respectively.

453
deps/cmph/ChangeLog vendored Normal file
View File

@@ -0,0 +1,453 @@
2005-08-08 18:34 fc_botelho
* INSTALL, examples/Makefile, examples/Makefile.in,
examples/.deps/file_adapter_ex2.Po,
examples/.deps/vector_adapter_ex1.Po, src/brz.c: [no log message]
2005-08-07 22:00 fc_botelho
* src/: brz.c, brz.h, brz_structs.h, cmph.c, cmph.h, main.c:
temporary directory passed by command line
2005-08-07 20:22 fc_botelho
* src/brz.c: stable version of BRZ
2005-08-06 22:09 fc_botelho
* src/bmz.c: no message
2005-08-06 22:02 fc_botelho
* src/bmz.c: no message
2005-08-06 21:45 fc_botelho
* src/brz.c: fastest version of BRZ
2005-08-06 17:20 fc_botelho
* src/: bmz.c, brz.c, main.c: [no log message]
2005-07-29 16:43 fc_botelho
* src/brz.c: BRZ algorithm is almost stable
2005-07-29 15:29 fc_botelho
* src/: bmz.c, brz.c, brz_structs.h, cmph_types.h: BRZ algorithm is
almost stable
2005-07-29 00:09 fc_botelho
* src/: brz.c, djb2_hash.c, djb2_hash.h, fnv_hash.c, fnv_hash.h,
hash.c, hash.h, jenkins_hash.c, jenkins_hash.h, sdbm_hash.c,
sdbm_hash.h: it was fixed more mistakes in BRZ algorithm
2005-07-28 21:00 fc_botelho
* src/: bmz.c, brz.c, cmph.c: fixed some mistakes in BRZ algorithm
2005-07-27 19:13 fc_botelho
* src/brz.c: algorithm BRZ included
2005-07-27 18:16 fc_botelho
* src/: bmz_structs.h, brz.c, brz.h, brz_structs.h: Algorithm BRZ
included
2005-07-27 18:13 fc_botelho
* src/: Makefile.am, bmz.c, chm.c, cmph.c, cmph.h, cmph_types.h:
Algorithm BRZ included
2005-07-25 19:18 fc_botelho
* README, README.t2t, scpscript: it was included an examples
directory
2005-07-25 18:26 fc_botelho
* INSTALL, Makefile.am, configure.ac, examples/Makefile,
examples/Makefile.am, examples/Makefile.in,
examples/file_adapter_ex2.c, examples/keys.txt,
examples/vector_adapter_ex1.c, examples/.deps/file_adapter_ex2.Po,
examples/.deps/vector_adapter_ex1.Po, src/cmph.c, src/cmph.h: it
was included a examples directory
2005-03-03 02:07 davi
* src/: bmz.c, chm.c, chm.h, chm_structs.h, cmph.c, cmph.h,
graph.c, graph.h, jenkins_hash.c, jenkins_hash.h, main.c (xgraph):
New f*cking cool algorithm works. Roughly implemented in chm.c
2005-03-02 20:55 davi
* src/xgraph.c (xgraph): xchmr working nice, but a bit slow
2005-03-02 02:01 davi
* src/xchmr.h: file xchmr.h was initially added on branch xgraph.
2005-03-02 02:01 davi
* src/xchmr_structs.h: file xchmr_structs.h was initially added on
branch xgraph.
2005-03-02 02:01 davi
* src/xchmr.c: file xchmr.c was initially added on branch xgraph.
2005-03-02 02:01 davi
* src/: Makefile.am, cmph.c, cmph_types.h, xchmr.c, xchmr.h,
xchmr_structs.h, xgraph.c, xgraph.h (xgraph): xchmr working fine
except for false positives on cyclic detection.
2005-03-02 00:05 davi
* src/: Makefile.am, xgraph.c, xgraph.h (xgraph): Added external
graph functionality in branch xgraph.
2005-03-02 00:05 davi
* src/xgraph.c: file xgraph.c was initially added on branch xgraph.
2005-03-02 00:05 davi
* src/xgraph.h: file xgraph.h was initially added on branch xgraph.
2005-02-28 19:53 davi
* src/chm.c: Fixed off by one bug in chm.
2005-02-17 16:20 fc_botelho
* LOGO.html, README, README.t2t, gendocs: The way of calling the
function cmph_search was fixed in the file README.t2t
2005-01-31 17:13 fc_botelho
* README.t2t: Heuristic BMZ memory consumption was updated
2005-01-31 17:09 fc_botelho
* BMZ.t2t: DJB2, SDBM, FNV and Jenkins hash link were added
2005-01-31 16:50 fc_botelho
* BMZ.t2t, CHM.t2t, COMPARISON.t2t, CONCEPTS.t2t, CONFIG.t2t,
FAQ.t2t, GPERF.t2t, LOGO.t2t, README.t2t, TABLE1.t2t, TABLE4.t2t,
TABLE5.t2t, DOC.css: BMZ documentation was finished
2005-01-28 18:12 fc_botelho
* figs/img1.png, figs/img10.png, figs/img100.png, figs/img101.png,
figs/img102.png, figs/img103.png, figs/img104.png, figs/img105.png,
figs/img106.png, figs/img107.png, figs/img108.png, figs/img109.png,
papers/bmz_tr004_04.ps, papers/bmz_wea2005.ps, papers/chm92.pdf,
figs/img11.png, figs/img110.png, figs/img111.png, figs/img112.png,
figs/img113.png, figs/img114.png, figs/img115.png, figs/img116.png,
figs/img117.png, figs/img118.png, figs/img119.png, figs/img12.png,
figs/img120.png, figs/img121.png, figs/img122.png, figs/img123.png,
figs/img124.png, figs/img125.png, figs/img126.png, figs/img127.png,
figs/img128.png, figs/img129.png, figs/img13.png, figs/img130.png,
figs/img131.png, figs/img132.png, figs/img133.png, figs/img134.png,
figs/img135.png, figs/img136.png, figs/img137.png, figs/img138.png,
figs/img139.png, figs/img14.png, figs/img140.png, figs/img141.png,
figs/img142.png, figs/img143.png, figs/img144.png, figs/img145.png,
figs/img146.png, figs/img147.png, figs/img148.png, figs/img149.png,
figs/img15.png, figs/img150.png, figs/img151.png, figs/img152.png,
figs/img153.png, figs/img154.png, figs/img155.png, figs/img156.png,
figs/img157.png, figs/img158.png, figs/img159.png, figs/img16.png,
figs/img160.png, figs/img161.png, figs/img162.png, figs/img163.png,
figs/img164.png, figs/img165.png, figs/img166.png, figs/img167.png,
figs/img168.png, figs/img169.png, figs/img17.png, figs/img170.png,
figs/img171.png, figs/img172.png, figs/img173.png, figs/img174.png,
figs/img175.png, figs/img176.png, figs/img177.png, figs/img178.png,
figs/img179.png, figs/img18.png, figs/img180.png, figs/img181.png,
figs/img182.png, figs/img183.png, figs/img184.png, figs/img185.png,
figs/img186.png, figs/img187.png, figs/img188.png, figs/img189.png,
figs/img19.png, figs/img190.png, figs/img191.png, figs/img192.png,
figs/img193.png, figs/img194.png, figs/img195.png, figs/img196.png,
figs/img197.png, figs/img198.png, figs/img199.png, figs/img2.png,
figs/img20.png, figs/img200.png, figs/img201.png, figs/img202.png,
figs/img203.png, figs/img204.png, figs/img205.png, figs/img206.png,
figs/img207.png, figs/img208.png, figs/img209.png, figs/img21.png,
figs/img210.png, figs/img211.png, figs/img212.png, figs/img213.png,
figs/img214.png, figs/img215.png, figs/img216.png, figs/img217.png,
figs/img218.png, figs/img219.png, figs/img22.png, figs/img220.png,
figs/img221.png, figs/img222.png, figs/img223.png, figs/img224.png,
figs/img225.png, figs/img226.png, figs/img227.png, figs/img228.png,
figs/img229.png, figs/img23.png, figs/img230.png, figs/img231.png,
figs/img232.png, figs/img233.png, figs/img234.png, figs/img235.png,
figs/img236.png, figs/img237.png, figs/img238.png, figs/img239.png,
figs/img24.png, figs/img240.png, figs/img241.png, figs/img242.png,
figs/img243.png, figs/img244.png, figs/img245.png, figs/img246.png,
figs/img247.png, figs/img248.png, figs/img249.png, figs/img25.png,
figs/img250.png, figs/img251.png, figs/img252.png, figs/img253.png,
figs/img26.png, figs/img27.png, figs/img28.png, figs/img29.png,
figs/img3.png, figs/img30.png, figs/img31.png, figs/img32.png,
figs/img33.png, figs/img34.png, figs/img35.png, figs/img36.png,
figs/img37.png, figs/img38.png, figs/img39.png, figs/img4.png,
figs/img40.png, figs/img41.png, figs/img42.png, figs/img43.png,
figs/img44.png, figs/img45.png, figs/img46.png, figs/img47.png,
figs/img48.png, figs/img49.png, figs/img5.png, figs/img50.png,
figs/img51.png, figs/img52.png, figs/img53.png, figs/img54.png,
figs/img55.png, figs/img56.png, figs/img57.png, figs/img58.png,
figs/img59.png, figs/img6.png, figs/img60.png, figs/img61.png,
figs/img62.png, figs/img63.png, figs/img64.png, figs/img65.png,
figs/img66.png, figs/img67.png, figs/img68.png, figs/img69.png,
figs/img7.png, figs/img70.png, figs/img71.png, figs/img72.png,
figs/img73.png, figs/img74.png, figs/img75.png, figs/img76.png,
figs/img77.png, figs/img78.png, figs/img79.png, figs/img8.png,
figs/img80.png, figs/img81.png, figs/img82.png, figs/img83.png,
figs/img84.png, figs/img85.png, figs/img86.png, figs/img87.png,
figs/img88.png, figs/img89.png, figs/img9.png, figs/img90.png,
figs/img91.png, figs/img92.png, figs/img93.png, figs/img94.png,
figs/img95.png, figs/img96.png, figs/img97.png, figs/img98.png,
figs/img99.png: Initial version
2005-01-28 18:07 fc_botelho
* BMZ.t2t, CHM.t2t, COMPARISON.t2t, CONFIG.t2t, README.t2t: It was
improved the documentation of BMZ and CHM algorithms
2005-01-27 18:07 fc_botelho
* BMZ.t2t, CHM.t2t, FAQ.t2t: history of BMZ algorithm is available
2005-01-27 14:23 fc_botelho
* AUTHORS: It was added the authors' email
2005-01-27 14:21 fc_botelho
* BMZ.t2t, CHM.t2t, COMPARISON.t2t, FAQ.t2t, FOOTER.t2t, GPERF.t2t,
README.t2t: It was added FOOTER.t2t file
2005-01-27 12:16 fc_botelho
* src/cmph_types.h: It was removed pjw and glib functions from
cmph_hash_names vector
2005-01-27 12:12 fc_botelho
* src/hash.c: It was removed pjw and glib functions from
cmph_hash_names vector
2005-01-27 11:01 davi
* FAQ.t2t, README, README.t2t, gendocs, src/bmz.c, src/bmz.h,
src/chm.c, src/chm.h, src/cmph.c, src/cmph_structs.c, src/debug.h,
src/main.c: Fix to alternate hash functions code. Removed htonl
stuff from chm algorithm. Added faq.
2005-01-27 09:14 fc_botelho
* README.t2t: It was corrected some formatting mistakes
2005-01-26 22:04 davi
* BMZ.t2t, CHM.t2t, COMPARISON.t2t, GPERF.t2t, README, README.t2t,
gendocs: Added gperf notes.
2005-01-25 19:10 fc_botelho
* INSTALL: generated in version 0.3
2005-01-25 19:09 fc_botelho
* src/: czech.c, czech.h, czech_structs.h: The czech.h,
czech_structs.h and czech.c files were removed
2005-01-25 19:06 fc_botelho
* src/: chm.c, chm.h, chm_structs.h, cmph.c, cmph_types.h, main.c,
Makefile.am: It was changed the prefix czech by chm
2005-01-25 18:50 fc_botelho
* gendocs: script to generate the documentation and the README file
2005-01-25 18:47 fc_botelho
* README: README was updated
2005-01-25 18:44 fc_botelho
* configure.ac: Version was updated
2005-01-25 18:42 fc_botelho
* src/cmph.h: Vector adapter commented
2005-01-25 18:40 fc_botelho
* CHM.t2t, CONFIG.t2t, LOGO.html: It was included the PreProc macro
through the CONFIG.t2t file and the LOGO through the LOGO.html file
2005-01-25 18:33 fc_botelho
* README.t2t, BMZ.t2t, COMPARISON.t2t, CZECH.t2t: It was included
the PreProc macro through the CONFIG.t2t file and the LOGO through
the LOGO.html file
2005-01-24 18:25 fc_botelho
* src/: bmz.c, bmz.h, cmph_structs.c, cmph_structs.h, czech.c,
cmph.c, czech.h, main.c, cmph.h: The file adpater was implemented.
2005-01-24 17:20 fc_botelho
* README.t2t: the memory consumption to create a mphf using bmz
with a heuristic was fixed.
2005-01-24 17:11 fc_botelho
* src/: cmph_types.h, main.c: The algorithms and hash functions
were put in alphabetical order
2005-01-24 16:15 fc_botelho
* BMZ.t2t, COMPARISON.t2t, CZECH.t2t, README.t2t: It was fixed some
English mistakes and It was included the files BMZ.t2t, CZECH.t2t
and COMPARISON.t2t
2005-01-21 19:19 davi
* ChangeLog, Doxyfile: Added Doxyfile.
2005-01-21 19:14 davi
* README.t2t, wingetopt.c, src/cmph.h, tests/graph_tests.c: Fixed
wingetopt.c
2005-01-21 18:44 fc_botelho
* src/Makefile.am: included files bitbool.h and bitbool.c
2005-01-21 18:42 fc_botelho
* src/: bmz.c, bmz.h, bmz_structs.h, cmph.c, cmph.h,
cmph_structs.c, cmph_structs.h, czech.c, czech.h, czech_structs.h,
djb2_hash.c, djb2_hash.h, fnv_hash.c, fnv_hash.h, graph.c, graph.h,
hash.c, hash.h, hash_state.h, jenkins_hash.c, jenkins_hash.h,
main.c, sdbm_hash.c, sdbm_hash.h, vqueue.c, vqueue.h, vstack.c,
vstack.h: Only public symbols were prefixed with cmph, and the API
was changed to agree with the initial txt2html documentation
2005-01-21 18:30 fc_botelho
* src/: bitbool.c, bitbool.h: mask to represent a boolean value
using only 1 bit
2005-01-20 10:28 davi
* ChangeLog, README, README.t2t, wingetopt.h, src/main.c: Added
initial txt2tags documentation.
2005-01-19 10:40 davi
* acinclude.m4, configure.ac: Added macros for large file support.
2005-01-18 19:06 fc_botelho
* src/: bmz.c, bmz.h, bmz_structs.h, cmph.c, cmph.h,
cmph_structs.c, cmph_structs.h, cmph_types.h, czech.c, czech.h,
czech_structs.h, djb2_hash.c, djb2_hash.h, fnv_hash.c, fnv_hash.h,
graph.c, graph.h, hash.c, hash.h, hash_state.h, jenkins_hash.c,
jenkins_hash.h, main.c, sdbm_hash.c, sdbm_hash.h, vqueue.c,
vqueue.h, vstack.c, vstack.h: version with cmph prefix
2005-01-18 15:10 davi
* ChangeLog, cmph.vcproj, cmphapp.vcproj, wingetopt.c, wingetopt.h:
Added missing files.
2005-01-18 14:25 fc_botelho
* aclocal.m4: initial version
2005-01-18 14:16 fc_botelho
* aclocal.m4: initial version
2005-01-18 13:58 fc_botelho
* src/czech.c: using bit mask to represent boolean values
2005-01-18 13:56 fc_botelho
* src/czech.c: no message
2005-01-18 10:18 davi
* COPYING, INSTALL, src/Makefile.am, src/bmz.c, src/bmz.h,
src/cmph.c, src/cmph.h, src/cmph_structs.c, src/cmph_structs.h,
src/czech.c, src/czech.h, src/debug.h, src/djb2_hash.c,
src/graph.c, src/graph.h, src/hash.c, src/jenkins_hash.c,
src/main.c, src/sdbm_hash.c, src/vqueue.c: Fixed a lot of warnings.
Added visual studio project. Make needed changes to work with
windows.
2005-01-17 16:01 fc_botelho
* src/main.c: stable version
2005-01-17 15:58 fc_botelho
* src/: bmz.c, cmph.c, cmph.h, graph.c: stable version
2005-01-13 21:56 davi
* src/czech.c: Better error handling in czech.c.
2005-01-05 18:45 fc_botelho
* src/cmph_structs.c: included option -k to specify the number of
keys to use
2005-01-05 17:48 fc_botelho
* src/: cmph.c, main.c: included option -k to specify the number of
keys to use
2005-01-03 19:38 fc_botelho
* src/bmz.c: using less memory
2005-01-03 18:47 fc_botelho
* src/: bmz.c, graph.c: using less space to store the used_edges
and critical_nodes arrays
2004-12-23 11:16 davi
* INSTALL, COPYING, AUTHORS, ChangeLog, Makefile.am, NEWS, README,
cmph.spec, configure.ac, src/graph.c, tests/Makefile.am,
tests/graph_tests.c, src/bmz.c, src/cmph_types.h,
src/czech_structs.h, src/hash_state.h, src/jenkins_hash.c,
src/bmz_structs.h, src/cmph.c, src/cmph.h, src/cmph_structs.h,
src/czech.c, src/debug.h, src/djb2_hash.c, src/djb2_hash.h,
src/fnv_hash.c, src/fnv_hash.h, src/graph.h, src/hash.c,
src/hash.h, src/jenkins_hash.h, src/sdbm_hash.c, src/vstack.h,
src/Makefile.am, src/bmz.h, src/cmph_structs.c, src/czech.h,
src/main.c, src/sdbm_hash.h, src/vqueue.c, src/vqueue.h,
src/vstack.c: Initial release.
2004-12-23 11:16 davi
* INSTALL, COPYING, AUTHORS, ChangeLog, Makefile.am, NEWS, README,
cmph.spec, configure.ac, src/graph.c, tests/Makefile.am,
tests/graph_tests.c, src/bmz.c, src/cmph_types.h,
src/czech_structs.h, src/hash_state.h, src/jenkins_hash.c,
src/bmz_structs.h, src/cmph.c, src/cmph.h, src/cmph_structs.h,
src/czech.c, src/debug.h, src/djb2_hash.c, src/djb2_hash.h,
src/fnv_hash.c, src/fnv_hash.h, src/graph.h, src/hash.c,
src/hash.h, src/jenkins_hash.h, src/sdbm_hash.c, src/vstack.h,
src/Makefile.am, src/bmz.h, src/cmph_structs.c, src/czech.h,
src/main.c, src/sdbm_hash.h, src/vqueue.c, src/vqueue.h,
src/vstack.c: Initial revision

33
deps/cmph/DOC.css vendored Normal file
View File

@@ -0,0 +1,33 @@
/* implement both fixed-size and relative sizes */
SMALL.XTINY { }
SMALL.TINY { }
SMALL.SCRIPTSIZE { }
BODY { font-size: 13 }
TD { font-size: 13 }
SMALL.FOOTNOTESIZE { font-size: 13 }
SMALL.SMALL { }
BIG.LARGE { }
BIG.XLARGE { }
BIG.XXLARGE { }
BIG.HUGE { }
BIG.XHUGE { }
/* heading styles */
H1 { }
H2 { }
H3 { }
H4 { }
H5 { }
/* mathematics styles */
DIV.displaymath { } /* math displays */
TD.eqno { } /* equation-number cells */
/* document-specific styles come next */
DIV.navigation { }
DIV.center { }
SPAN.textit { font-style: italic }
SPAN.arabic { }
SPAN.eqn-number { }

1153
deps/cmph/Doxyfile vendored Normal file

File diff suppressed because it is too large Load Diff

152
deps/cmph/EXAMPLES.t2t vendored Normal file
View File

@@ -0,0 +1,152 @@
CMPH - Examples
%!includeconf: CONFIG.t2t
Using cmph is quite simple. Take a look in the following examples.
-------------------------------------------------------------------
```
#include <cmph.h>
#include <string.h>
// Create minimal perfect hash function from in-memory vector
int main(int argc, char **argv)
{
// Creating a filled vector
unsigned int i = 0;
const char *vector[] = {"aaaaaaaaaa", "bbbbbbbbbb", "cccccccccc", "dddddddddd", "eeeeeeeeee",
"ffffffffff", "gggggggggg", "hhhhhhhhhh", "iiiiiiiiii", "jjjjjjjjjj"};
unsigned int nkeys = 10;
FILE* mphf_fd = fopen("temp.mph", "w");
// Source of keys
cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys);
//Create minimal perfect hash function using the brz algorithm.
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BRZ);
cmph_config_set_mphf_fd(config, mphf_fd);
cmph_t *hash = cmph_new(config);
cmph_config_destroy(config);
cmph_dump(hash, mphf_fd);
cmph_destroy(hash);
fclose(mphf_fd);
//Find key
mphf_fd = fopen("temp.mph", "r");
hash = cmph_load(mphf_fd);
while (i < nkeys) {
const char *key = vector[i];
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "key:%s -- hash:%u\n", key, id);
i++;
}
//Destroy hash
cmph_destroy(hash);
cmph_io_vector_adapter_destroy(source);
fclose(mphf_fd);
return 0;
}
```
Download [vector_adapter_ex1.c examples/vector_adapter_ex1.c]. This example does not work in versions below 0.6.
-------------------------------
```
#include <cmph.h>
#include <string.h>
// Create minimal perfect hash function from in-memory vector
#pragma pack(1)
typedef struct {
cmph_uint32 id;
char key[11];
cmph_uint32 year;
} rec_t;
#pragma pack(0)
int main(int argc, char **argv)
{
// Creating a filled vector
unsigned int i = 0;
rec_t vector[10] = {{1, "aaaaaaaaaa", 1999}, {2, "bbbbbbbbbb", 2000}, {3, "cccccccccc", 2001},
{4, "dddddddddd", 2002}, {5, "eeeeeeeeee", 2003}, {6, "ffffffffff", 2004},
{7, "gggggggggg", 2005}, {8, "hhhhhhhhhh", 2006}, {9, "iiiiiiiiii", 2007},
{10,"jjjjjjjjjj", 2008}};
unsigned int nkeys = 10;
FILE* mphf_fd = fopen("temp_struct_vector.mph", "w");
// Source of keys
cmph_io_adapter_t *source = cmph_io_struct_vector_adapter(vector, (cmph_uint32)sizeof(rec_t), (cmph_uint32)sizeof(cmph_uint32), 11, nkeys);
//Create minimal perfect hash function using the BDZ algorithm.
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BDZ);
cmph_config_set_mphf_fd(config, mphf_fd);
cmph_t *hash = cmph_new(config);
cmph_config_destroy(config);
cmph_dump(hash, mphf_fd);
cmph_destroy(hash);
fclose(mphf_fd);
//Find key
mphf_fd = fopen("temp_struct_vector.mph", "r");
hash = cmph_load(mphf_fd);
while (i < nkeys) {
const char *key = vector[i].key;
unsigned int id = cmph_search(hash, key, 11);
fprintf(stderr, "key:%s -- hash:%u\n", key, id);
i++;
}
//Destroy hash
cmph_destroy(hash);
cmph_io_vector_adapter_destroy(source);
fclose(mphf_fd);
return 0;
}
```
Download [struct_vector_adapter_ex3.c examples/struct_vector_adapter_ex3.c]. This example does not work in versions below 0.8.
-------------------------------
```
#include <cmph.h>
#include <stdio.h>
#include <string.h>
// Create minimal perfect hash function from in-disk keys using BDZ algorithm
int main(int argc, char **argv)
{
//Open file with newline separated list of keys
FILE * keys_fd = fopen("keys.txt", "r");
cmph_t *hash = NULL;
if (keys_fd == NULL)
{
fprintf(stderr, "File \"keys.txt\" not found\n");
exit(1);
}
// Source of keys
cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd);
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BDZ);
hash = cmph_new(config);
cmph_config_destroy(config);
//Find key
const char *key = "jjjjjjjjjj";
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "Id:%u\n", id);
//Destroy hash
cmph_destroy(hash);
cmph_io_nlfile_adapter_destroy(source);
fclose(keys_fd);
return 0;
}
```
Download [file_adapter_ex2.c examples/file_adapter_ex2.c] and [keys.txt examples/keys.txt]. This example does not work in versions below 0.8.
%!include: ALGORITHMS.t2t
%!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

38
deps/cmph/FAQ.t2t vendored Normal file
View File

@@ -0,0 +1,38 @@
CMPH FAQ
%!includeconf: CONFIG.t2t
- How do I define the ids of the keys?
- You don't. The ids will be assigned by the algorithm creating the minimal
perfect hash function. If the algorithm creates an **ordered** minimal
perfect hash function, the ids will be the indices of the keys in the
input. Otherwise, you have no guarantee of the distribution of the ids.
- Why do I always get the error "Unable to create minimum perfect hashing function"?
- The algorithms do not guarantee that a minimal perfect hash function can
be created. In practice, it will always work if your input
is big enough (>100 keys).
The error is probably because you have duplicated
keys in the input. You must guarantee that the keys are unique in the
input. If you are using a UN*X based OS, try doing
``` #sort input.txt | uniq > input_uniq.txt
and run cmph with input_uniq.txt
- Why do I change the hash function using cmph_config_set_hashfuncs function and the default (jenkins)
one is executed?
- Probably you are you using the cmph_config_set_algo function after
the cmph_config_set_hashfuncs. Therefore, the default hash function
is reset when you call the cmph_config_set_algo function.
- What do I do when the following error is got?
- Error: **error while loading shared libraries: libcmph.so.0: cannot open shared object file: No such file ordirectory**
- Solution: type **export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/** at the shell or put that shell command
in your .profile file or in the /etc/profile file.
%!include: ALGORITHMS.t2t
%!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

47
deps/cmph/FCH.t2t vendored Normal file
View File

@@ -0,0 +1,47 @@
FCH Algorithm
%!includeconf: CONFIG.t2t
----------------------------------------
==The Algorithm==
The algorithm is presented in [[1 #papers]].
----------------------------------------
==Memory Consumption==
Now we detail the memory consumption to generate and to store minimal perfect hash functions
using the FCH algorithm. The structures responsible for memory consumption are in the
following:
- A vector containing all the //n// keys.
- Data structure to speed up the searching step:
+ **random_table**: is a vector used to remember currently empty slots in the hash table. It stores //n// 4 byte long integer numbers. This vector initially contains a random permutation of the //n// hash addresses. A pointer called filled_count is used to keep the invariant that any slots to the right side of filled_count (inclusive) are empty and any ones to the left are filled.
+ **hash_table**: Table used to check whether all the collisions were resolved. It has //n// entries of one byte.
+ **map_table**: For any unfilled slot //x// in hash_table, the map_table vector contains //n// 4 byte long pointers pointing at random_table such that random_table[map_table[x]] = x. Thus, given an empty slot x in the hash_table, we can locate its position in the random_table vector through map_table.
- Other auxiliary structures
+ **sorted_indexes**: is a vector of //cn/(log(n) + 1)// 4 byte long pointers to indirectly keep the buckets sorted by decreasing order of their sizes.
+ **function //g//**: is represented by a vector of //cn/(log(n) + 1)// 4 byte long integer numbers, one for each bucket. It is used to spread all the keys in a given bucket into the hash table without collisions.
Thus, the total memory consumption of FCH algorithm for generating a minimal
perfect hash function (MPHF) is: //O(n) + 9n + 8cn/(log(n) + 1)// bytes.
The value of parameter //c// must be greater than or equal to 2.6.
Now we present the memory consumption to store the resulting function.
We only need to store the //g// function and a constant number of bytes for the seed of the hash functions used in the resulting MPHF. Thus, we need //cn/(log(n) + 1) + O(1)// bytes.
----------------------------------------
==Papers==[papers]
+ E.A. Fox, Q.F. Chen, and L.S. Heath. [A faster algorithm for constructing minimal perfect hash functions. papers/fch92.pdf] In Proc. 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 266-273, 1992.
%!include: ALGORITHMS.t2t
%!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

13
deps/cmph/FOOTER.t2t vendored Normal file
View File

@@ -0,0 +1,13 @@
Enjoy!
[Davi de Castro Reis davi@users.sourceforge.net]
[Djamel Belazzougui db8192@users.sourceforge.net]
[Fabiano Cupertino Botelho fc_botelho@users.sourceforge.net]
[Nivio Ziviani nivio@dcc.ufmg.br]

9
deps/cmph/GOOGLEANALYTICS.t2t vendored Normal file
View File

@@ -0,0 +1,9 @@
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>

39
deps/cmph/GPERF.t2t vendored Normal file
View File

@@ -0,0 +1,39 @@
GPERF versus CMPH
%!includeconf: CONFIG.t2t
You might ask why cmph if [gperf http://www.gnu.org/software/gperf/gperf.html]
already works perfectly. Actually, gperf and cmph have different goals.
Basically, these are the requirements for each of them:
- GPERF
- Create very fast hash functions for **small** sets
- Create **perfect** hash functions
- CMPH
- Create very fast hash function for **very large** sets
- Create **minimal perfect** hash functions
As result, cmph can be used to create hash functions where gperf would run
forever without finding a perfect hash function, because of the running
time of the algorithm and the large memory usage.
On the other side, functions created by cmph are about 2x slower than those
created by gperf.
So, if you have large sets, or memory usage is a key restriction for you, stick
to cmph. If you have small sets, and do not care about memory usage, go with
gperf. The first problem is common in the information retrieval field (e.g.
assigning ids to millions of documents), while the former is usually found in
the compiler programming area (detect reserved keywords).
%!include: ALGORITHMS.t2t
%!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

View File

@@ -1,19 +1,14 @@
TurboNSS: a glibc NSS plugin
Copyright (C) 2022 Motiejus Jakštys
Most components of the "acl" package are licensed under
Version 2.1 of the GNU Lesser General Public License (see below).
below.
This library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
Some components (as annotated in the source) are licensed
under Version 2 of the GNU General Public License (see COPYING).
This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
----------------------------------------------------------------------
GNU LESSER GENERAL PUBLIC LICENSE
Version 2.1, February 1999
GNU LESSER GENERAL PUBLIC LICENSE
Version 2.1, February 1999
Copyright (C) 1991, 1999 Free Software Foundation, Inc.
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
@@ -24,7 +19,7 @@ Lesser General Public License for more details.
as the successor of the GNU Library Public License, version 2, hence
the version number 2.1.]
Preamble
Preamble
The licenses for most software are designed to take away your
freedom to share and change it. By contrast, the GNU General Public
@@ -126,7 +121,7 @@ modification follow. Pay close attention to the difference between a
former contains code derived from the library, whereas the latter must
be combined with the library in order to run.
GNU LESSER GENERAL PUBLIC LICENSE
GNU LESSER GENERAL PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. This License Agreement applies to any software library or other
@@ -160,7 +155,7 @@ such a program is covered only if its contents constitute a work based
on the Library (independent of the use of the Library in a tool for
writing it). Whether that is true depends on what the Library does
and what the program that uses the Library does.
1. You may copy and distribute verbatim copies of the Library's
complete source code as you receive it, in any medium, provided that
you conspicuously and appropriately publish on each copy an
@@ -446,7 +441,7 @@ decision will be guided by the two goals of preserving the free status
of all derivatives of our free software and of promoting the sharing
and reuse of software generally.
NO WARRANTY
NO WARRANTY
15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO
WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW.
@@ -469,4 +464,50 @@ FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
DAMAGES.
END OF TERMS AND CONDITIONS
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Libraries
If you develop a new library, and you want it to be of the greatest
possible use to the public, we recommend making it free software that
everyone can redistribute and change. You can do so by permitting
redistribution under these terms (or, alternatively, under the terms of the
ordinary General Public License).
To apply these terms, attach the following notices to the library. It is
safest to attach them to the start of each source file to most effectively
convey the exclusion of warranty; and each file should have at least the
"copyright" line and a pointer to where the full notice is found.
<one line to give the library's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with this library; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
Also add information on how to contact you by electronic and paper mail.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the library, if
necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the
library `Frob' (a library for tweaking knobs) written by James Random Hacker.
<signature of Ty Coon>, 1 April 1990
Ty Coon, President of Vice
That's all there is to it!

1
deps/cmph/LOGO.t2t vendored Normal file
View File

@@ -0,0 +1 @@
<a href="http://sourceforge.net"><img src="http://sourceforge.net/sflogo.php?group_id=96251&amp;type=1" width="88" height="31" border="0" alt="SourceForge.net Logo" /> </a>

469
deps/cmph/MPL-1.1 vendored Normal file
View File

@@ -0,0 +1,469 @@
MOZILLA PUBLIC LICENSE
Version 1.1
---------------
1. Definitions.
1.0.1. "Commercial Use" means distribution or otherwise making the
Covered Code available to a third party.
1.1. "Contributor" means each entity that creates or contributes to
the creation of Modifications.
1.2. "Contributor Version" means the combination of the Original
Code, prior Modifications used by a Contributor, and the Modifications
made by that particular Contributor.
1.3. "Covered Code" means the Original Code or Modifications or the
combination of the Original Code and Modifications, in each case
including portions thereof.
1.4. "Electronic Distribution Mechanism" means a mechanism generally
accepted in the software development community for the electronic
transfer of data.
1.5. "Executable" means Covered Code in any form other than Source
Code.
1.6. "Initial Developer" means the individual or entity identified
as the Initial Developer in the Source Code notice required by Exhibit
A.
1.7. "Larger Work" means a work which combines Covered Code or
portions thereof with code not governed by the terms of this License.
1.8. "License" means this document.
1.8.1. "Licensable" means having the right to grant, to the maximum
extent possible, whether at the time of the initial grant or
subsequently acquired, any and all of the rights conveyed herein.
1.9. "Modifications" means any addition to or deletion from the
substance or structure of either the Original Code or any previous
Modifications. When Covered Code is released as a series of files, a
Modification is:
A. Any addition to or deletion from the contents of a file
containing Original Code or previous Modifications.
B. Any new file that contains any part of the Original Code or
previous Modifications.
1.10. "Original Code" means Source Code of computer software code
which is described in the Source Code notice required by Exhibit A as
Original Code, and which, at the time of its release under this
License is not already Covered Code governed by this License.
1.10.1. "Patent Claims" means any patent claim(s), now owned or
hereafter acquired, including without limitation, method, process,
and apparatus claims, in any patent Licensable by grantor.
1.11. "Source Code" means the preferred form of the Covered Code for
making modifications to it, including all modules it contains, plus
any associated interface definition files, scripts used to control
compilation and installation of an Executable, or source code
differential comparisons against either the Original Code or another
well known, available Covered Code of the Contributor's choice. The
Source Code can be in a compressed or archival form, provided the
appropriate decompression or de-archiving software is widely available
for no charge.
1.12. "You" (or "Your") means an individual or a legal entity
exercising rights under, and complying with all of the terms of, this
License or a future version of this License issued under Section 6.1.
For legal entities, "You" includes any entity which controls, is
controlled by, or is under common control with You. For purposes of
this definition, "control" means (a) the power, direct or indirect,
to cause the direction or management of such entity, whether by
contract or otherwise, or (b) ownership of more than fifty percent
(50%) of the outstanding shares or beneficial ownership of such
entity.
2. Source Code License.
2.1. The Initial Developer Grant.
The Initial Developer hereby grants You a world-wide, royalty-free,
non-exclusive license, subject to third party intellectual property
claims:
(a) under intellectual property rights (other than patent or
trademark) Licensable by Initial Developer to use, reproduce,
modify, display, perform, sublicense and distribute the Original
Code (or portions thereof) with or without Modifications, and/or
as part of a Larger Work; and
(b) under Patents Claims infringed by the making, using or
selling of Original Code, to make, have made, use, practice,
sell, and offer for sale, and/or otherwise dispose of the
Original Code (or portions thereof).
(c) the licenses granted in this Section 2.1(a) and (b) are
effective on the date Initial Developer first distributes
Original Code under the terms of this License.
(d) Notwithstanding Section 2.1(b) above, no patent license is
granted: 1) for code that You delete from the Original Code; 2)
separate from the Original Code; or 3) for infringements caused
by: i) the modification of the Original Code or ii) the
combination of the Original Code with other software or devices.
2.2. Contributor Grant.
Subject to third party intellectual property claims, each Contributor
hereby grants You a world-wide, royalty-free, non-exclusive license
(a) under intellectual property rights (other than patent or
trademark) Licensable by Contributor, to use, reproduce, modify,
display, perform, sublicense and distribute the Modifications
created by such Contributor (or portions thereof) either on an
unmodified basis, with other Modifications, as Covered Code
and/or as part of a Larger Work; and
(b) under Patent Claims infringed by the making, using, or
selling of Modifications made by that Contributor either alone
and/or in combination with its Contributor Version (or portions
of such combination), to make, use, sell, offer for sale, have
made, and/or otherwise dispose of: 1) Modifications made by that
Contributor (or portions thereof); and 2) the combination of
Modifications made by that Contributor with its Contributor
Version (or portions of such combination).
(c) the licenses granted in Sections 2.2(a) and 2.2(b) are
effective on the date Contributor first makes Commercial Use of
the Covered Code.
(d) Notwithstanding Section 2.2(b) above, no patent license is
granted: 1) for any code that Contributor has deleted from the
Contributor Version; 2) separate from the Contributor Version;
3) for infringements caused by: i) third party modifications of
Contributor Version or ii) the combination of Modifications made
by that Contributor with other software (except as part of the
Contributor Version) or other devices; or 4) under Patent Claims
infringed by Covered Code in the absence of Modifications made by
that Contributor.
3. Distribution Obligations.
3.1. Application of License.
The Modifications which You create or to which You contribute are
governed by the terms of this License, including without limitation
Section 2.2. The Source Code version of Covered Code may be
distributed only under the terms of this License or a future version
of this License released under Section 6.1, and You must include a
copy of this License with every copy of the Source Code You
distribute. You may not offer or impose any terms on any Source Code
version that alters or restricts the applicable version of this
License or the recipients' rights hereunder. However, You may include
an additional document offering the additional rights described in
Section 3.5.
3.2. Availability of Source Code.
Any Modification which You create or to which You contribute must be
made available in Source Code form under the terms of this License
either on the same media as an Executable version or via an accepted
Electronic Distribution Mechanism to anyone to whom you made an
Executable version available; and if made available via Electronic
Distribution Mechanism, must remain available for at least twelve (12)
months after the date it initially became available, or at least six
(6) months after a subsequent version of that particular Modification
has been made available to such recipients. You are responsible for
ensuring that the Source Code version remains available even if the
Electronic Distribution Mechanism is maintained by a third party.
3.3. Description of Modifications.
You must cause all Covered Code to which You contribute to contain a
file documenting the changes You made to create that Covered Code and
the date of any change. You must include a prominent statement that
the Modification is derived, directly or indirectly, from Original
Code provided by the Initial Developer and including the name of the
Initial Developer in (a) the Source Code, and (b) in any notice in an
Executable version or related documentation in which You describe the
origin or ownership of the Covered Code.
3.4. Intellectual Property Matters
(a) Third Party Claims.
If Contributor has knowledge that a license under a third party's
intellectual property rights is required to exercise the rights
granted by such Contributor under Sections 2.1 or 2.2,
Contributor must include a text file with the Source Code
distribution titled "LEGAL" which describes the claim and the
party making the claim in sufficient detail that a recipient will
know whom to contact. If Contributor obtains such knowledge after
the Modification is made available as described in Section 3.2,
Contributor shall promptly modify the LEGAL file in all copies
Contributor makes available thereafter and shall take other steps
(such as notifying appropriate mailing lists or newsgroups)
reasonably calculated to inform those who received the Covered
Code that new knowledge has been obtained.
(b) Contributor APIs.
If Contributor's Modifications include an application programming
interface and Contributor has knowledge of patent licenses which
are reasonably necessary to implement that API, Contributor must
also include this information in the LEGAL file.
(c) Representations.
Contributor represents that, except as disclosed pursuant to
Section 3.4(a) above, Contributor believes that Contributor's
Modifications are Contributor's original creation(s) and/or
Contributor has sufficient rights to grant the rights conveyed by
this License.
3.5. Required Notices.
You must duplicate the notice in Exhibit A in each file of the Source
Code. If it is not possible to put such notice in a particular Source
Code file due to its structure, then You must include such notice in a
location (such as a relevant directory) where a user would be likely
to look for such a notice. If You created one or more Modification(s)
You may add your name as a Contributor to the notice described in
Exhibit A. You must also duplicate this License in any documentation
for the Source Code where You describe recipients' rights or ownership
rights relating to Covered Code. You may choose to offer, and to
charge a fee for, warranty, support, indemnity or liability
obligations to one or more recipients of Covered Code. However, You
may do so only on Your own behalf, and not on behalf of the Initial
Developer or any Contributor. You must make it absolutely clear than
any such warranty, support, indemnity or liability obligation is
offered by You alone, and You hereby agree to indemnify the Initial
Developer and every Contributor for any liability incurred by the
Initial Developer or such Contributor as a result of warranty,
support, indemnity or liability terms You offer.
3.6. Distribution of Executable Versions.
You may distribute Covered Code in Executable form only if the
requirements of Section 3.1-3.5 have been met for that Covered Code,
and if You include a notice stating that the Source Code version of
the Covered Code is available under the terms of this License,
including a description of how and where You have fulfilled the
obligations of Section 3.2. The notice must be conspicuously included
in any notice in an Executable version, related documentation or
collateral in which You describe recipients' rights relating to the
Covered Code. You may distribute the Executable version of Covered
Code or ownership rights under a license of Your choice, which may
contain terms different from this License, provided that You are in
compliance with the terms of this License and that the license for the
Executable version does not attempt to limit or alter the recipient's
rights in the Source Code version from the rights set forth in this
License. If You distribute the Executable version under a different
license You must make it absolutely clear that any terms which differ
from this License are offered by You alone, not by the Initial
Developer or any Contributor. You hereby agree to indemnify the
Initial Developer and every Contributor for any liability incurred by
the Initial Developer or such Contributor as a result of any such
terms You offer.
3.7. Larger Works.
You may create a Larger Work by combining Covered Code with other code
not governed by the terms of this License and distribute the Larger
Work as a single product. In such a case, You must make sure the
requirements of this License are fulfilled for the Covered Code.
4. Inability to Comply Due to Statute or Regulation.
If it is impossible for You to comply with any of the terms of this
License with respect to some or all of the Covered Code due to
statute, judicial order, or regulation then You must: (a) comply with
the terms of this License to the maximum extent possible; and (b)
describe the limitations and the code they affect. Such description
must be included in the LEGAL file described in Section 3.4 and must
be included with all distributions of the Source Code. Except to the
extent prohibited by statute or regulation, such description must be
sufficiently detailed for a recipient of ordinary skill to be able to
understand it.
5. Application of this License.
This License applies to code to which the Initial Developer has
attached the notice in Exhibit A and to related Covered Code.
6. Versions of the License.
6.1. New Versions.
Netscape Communications Corporation ("Netscape") may publish revised
and/or new versions of the License from time to time. Each version
will be given a distinguishing version number.
6.2. Effect of New Versions.
Once Covered Code has been published under a particular version of the
License, You may always continue to use it under the terms of that
version. You may also choose to use such Covered Code under the terms
of any subsequent version of the License published by Netscape. No one
other than Netscape has the right to modify the terms applicable to
Covered Code created under this License.
6.3. Derivative Works.
If You create or use a modified version of this License (which you may
only do in order to apply it to code which is not already Covered Code
governed by this License), You must (a) rename Your license so that
the phrases "Mozilla", "MOZILLAPL", "MOZPL", "Netscape",
"MPL", "NPL" or any confusingly similar phrase do not appear in your
license (except to note that your license differs from this License)
and (b) otherwise make it clear that Your version of the license
contains terms which differ from the Mozilla Public License and
Netscape Public License. (Filling in the name of the Initial
Developer, Original Code or Contributor in the notice described in
Exhibit A shall not of themselves be deemed to be modifications of
this License.)
7. DISCLAIMER OF WARRANTY.
COVERED CODE IS PROVIDED UNDER THIS LICENSE ON AN "AS IS" BASIS,
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
WITHOUT LIMITATION, WARRANTIES THAT THE COVERED CODE IS FREE OF
DEFECTS, MERCHANTABLE, FIT FOR A PARTICULAR PURPOSE OR NON-INFRINGING.
THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE COVERED CODE
IS WITH YOU. SHOULD ANY COVERED CODE PROVE DEFECTIVE IN ANY RESPECT,
YOU (NOT THE INITIAL DEVELOPER OR ANY OTHER CONTRIBUTOR) ASSUME THE
COST OF ANY NECESSARY SERVICING, REPAIR OR CORRECTION. THIS DISCLAIMER
OF WARRANTY CONSTITUTES AN ESSENTIAL PART OF THIS LICENSE. NO USE OF
ANY COVERED CODE IS AUTHORIZED HEREUNDER EXCEPT UNDER THIS DISCLAIMER.
8. TERMINATION.
8.1. This License and the rights granted hereunder will terminate
automatically if You fail to comply with terms herein and fail to cure
such breach within 30 days of becoming aware of the breach. All
sublicenses to the Covered Code which are properly granted shall
survive any termination of this License. Provisions which, by their
nature, must remain in effect beyond the termination of this License
shall survive.
8.2. If You initiate litigation by asserting a patent infringement
claim (excluding declatory judgment actions) against Initial Developer
or a Contributor (the Initial Developer or Contributor against whom
You file such action is referred to as "Participant") alleging that:
(a) such Participant's Contributor Version directly or indirectly
infringes any patent, then any and all rights granted by such
Participant to You under Sections 2.1 and/or 2.2 of this License
shall, upon 60 days notice from Participant terminate prospectively,
unless if within 60 days after receipt of notice You either: (i)
agree in writing to pay Participant a mutually agreeable reasonable
royalty for Your past and future use of Modifications made by such
Participant, or (ii) withdraw Your litigation claim with respect to
the Contributor Version against such Participant. If within 60 days
of notice, a reasonable royalty and payment arrangement are not
mutually agreed upon in writing by the parties or the litigation claim
is not withdrawn, the rights granted by Participant to You under
Sections 2.1 and/or 2.2 automatically terminate at the expiration of
the 60 day notice period specified above.
(b) any software, hardware, or device, other than such Participant's
Contributor Version, directly or indirectly infringes any patent, then
any rights granted to You by such Participant under Sections 2.1(b)
and 2.2(b) are revoked effective as of the date You first made, used,
sold, distributed, or had made, Modifications made by that
Participant.
8.3. If You assert a patent infringement claim against Participant
alleging that such Participant's Contributor Version directly or
indirectly infringes any patent where such claim is resolved (such as
by license or settlement) prior to the initiation of patent
infringement litigation, then the reasonable value of the licenses
granted by such Participant under Sections 2.1 or 2.2 shall be taken
into account in determining the amount or value of any payment or
license.
8.4. In the event of termination under Sections 8.1 or 8.2 above,
all end user license agreements (excluding distributors and resellers)
which have been validly granted by You or any distributor hereunder
prior to termination shall survive termination.
9. LIMITATION OF LIABILITY.
UNDER NO CIRCUMSTANCES AND UNDER NO LEGAL THEORY, WHETHER TORT
(INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE, SHALL YOU, THE INITIAL
DEVELOPER, ANY OTHER CONTRIBUTOR, OR ANY DISTRIBUTOR OF COVERED CODE,
OR ANY SUPPLIER OF ANY OF SUCH PARTIES, BE LIABLE TO ANY PERSON FOR
ANY INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY
CHARACTER INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF GOODWILL,
WORK STOPPAGE, COMPUTER FAILURE OR MALFUNCTION, OR ANY AND ALL OTHER
COMMERCIAL DAMAGES OR LOSSES, EVEN IF SUCH PARTY SHALL HAVE BEEN
INFORMED OF THE POSSIBILITY OF SUCH DAMAGES. THIS LIMITATION OF
LIABILITY SHALL NOT APPLY TO LIABILITY FOR DEATH OR PERSONAL INJURY
RESULTING FROM SUCH PARTY'S NEGLIGENCE TO THE EXTENT APPLICABLE LAW
PROHIBITS SUCH LIMITATION. SOME JURISDICTIONS DO NOT ALLOW THE
EXCLUSION OR LIMITATION OF INCIDENTAL OR CONSEQUENTIAL DAMAGES, SO
THIS EXCLUSION AND LIMITATION MAY NOT APPLY TO YOU.
10. U.S. GOVERNMENT END USERS.
The Covered Code is a "commercial item," as that term is defined in
48 C.F.R. 2.101 (Oct. 1995), consisting of "commercial computer
software" and "commercial computer software documentation," as such
terms are used in 48 C.F.R. 12.212 (Sept. 1995). Consistent with 48
C.F.R. 12.212 and 48 C.F.R. 227.7202-1 through 227.7202-4 (June 1995),
all U.S. Government End Users acquire Covered Code with only those
rights set forth herein.
11. MISCELLANEOUS.
This License represents the complete agreement concerning subject
matter hereof. If any provision of this License is held to be
unenforceable, such provision shall be reformed only to the extent
necessary to make it enforceable. This License shall be governed by
California law provisions (except to the extent applicable law, if
any, provides otherwise), excluding its conflict-of-law provisions.
With respect to disputes in which at least one party is a citizen of,
or an entity chartered or registered to do business in the United
States of America, any litigation relating to this License shall be
subject to the jurisdiction of the Federal Courts of the Northern
District of California, with venue lying in Santa Clara County,
California, with the losing party responsible for costs, including
without limitation, court costs and reasonable attorneys' fees and
expenses. The application of the United Nations Convention on
Contracts for the International Sale of Goods is expressly excluded.
Any law or regulation which provides that the language of a contract
shall be construed against the drafter shall not apply to this
License.
12. RESPONSIBILITY FOR CLAIMS.
As between Initial Developer and the Contributors, each party is
responsible for claims and damages arising, directly or indirectly,
out of its utilization of rights under this License and You agree to
work with Initial Developer and Contributors to distribute such
responsibility on an equitable basis. Nothing herein is intended or
shall be deemed to constitute any admission of liability.
13. MULTIPLE-LICENSED CODE.
Initial Developer may designate portions of the Covered Code as
"Multiple-Licensed". "Multiple-Licensed" means that the Initial
Developer permits you to utilize portions of the Covered Code under
Your choice of the NPL or the alternative licenses, if any, specified
by the Initial Developer in the file described in Exhibit A.
EXHIBIT A -Mozilla Public License.
``The contents of this file are subject to the Mozilla Public License
Version 1.1 (the "License"); you may not use this file except in
compliance with the License. You may obtain a copy of the License at
http://www.mozilla.org/MPL/
Software distributed under the License is distributed on an "AS IS"
basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the
License for the specific language governing rights and limitations
under the License.
The Original Code is ______________________________________.
The Initial Developer of the Original Code is ________________________.
Portions created by ______________________ are Copyright (C) ______
_______________________. All Rights Reserved.
Contributor(s): ______________________________________.
Alternatively, the contents of this file may be used under the terms
of the _____ license (the "[___] License"), in which case the
provisions of [______] License are applicable instead of those
above. If you wish to allow use of your version of this file only
under the terms of the [____] License and not to allow others to use
your version of this file under the MPL, indicate your decision by
deleting the provisions above and replace them with the notice and
other provisions required by the [___] License. If you do not delete
the provisions above, a recipient may use your version of this file
under either the MPL or the [___] License."
[NOTE: The text of this Exhibit A may differ slightly from the text of
the notices in the Source Code files of the Original Code. You should
use the text of this Exhibit A rather than the text found in the
Original Code Source Code for Your Modifications.]

9
deps/cmph/Makefile.am vendored Normal file
View File

@@ -0,0 +1,9 @@
SUBDIRS = src tests examples man $(CXXMPH)
EXTRA_DIST = cmph.spec configure.ac cmph.pc.in cxxmph.pc.in LGPL-2 MPL-1.1
pkgconfig_DATA = cmph.pc
if USE_CXXMPH
pkgconfig_DATA += cxxmph.pc
endif
ACLOCAL_AMFLAGS="-Im4"
pkgconfigdir = $(libdir)/pkgconfig

0
deps/cmph/NEWS vendored Normal file
View File

85
deps/cmph/NEWSLOG.t2t vendored Normal file
View File

@@ -0,0 +1,85 @@
News Log
%!includeconf: CONFIG.t2t
----------------------------------------
==News for version 1.1==
Fixed a bug in the chd_pc algorithm and reorganized tests.
==News for version 1.0==
This is a bugfix only version, after which a revamp of the cmph code and
algorithms will be done.
----------------------------------------
==News for version 0.9==
- [The CHD algorithm chd.html], which is an algorithm that can be tuned to generate MPHFs that require approximately 2.07 bits per key to be stored. The algorithm outperforms [the BDZ algorithm bdz.html] and therefore is the fastest one available in the literature for sets that can be treated in internal memory.
- [The CHD_PH algorithm chd.html], which is an algorithm to generate PHFs with load factor up to //99 %//. It is actually the CHD algorithm without the ranking step. If we set the load factor to //81 %//, which is the maximum that can be obtained with [the BDZ algorithm bdz.html], the resulting functions can be stored in //1.40// bits per key. The space requirement increases with the load factor.
- All reported bugs and suggestions have been corrected and included as well.
----------------------------------------
==News for version 0.8==
- [An algorithm to generate MPHFs that require around 2.6 bits per key to be stored bdz.html], which is referred to as BDZ algorithm. The algorithm is the fastest one available in the literature for sets that can be treated in internal memory.
- [An algorithm to generate PHFs with range m = cn, for c > 1.22 bdz.html], which is referred to as BDZ_PH algorithm. It is actually the BDZ algorithm without the ranking step. The resulting functions can be stored in 1.95 bits per key for //c = 1.23// and are considerably faster than the MPHFs generated by the BDZ algorithm.
- An adapter to support a vector of struct as the source of keys has been added.
- An API to support the ability of packing a perfect hash function into a preallocated contiguous memory space. The computation of a packed function is still faster and can be easily mmapped.
- The hash functions djb2, fnv and sdbm were removed because they do not use random seeds and therefore are not useful for MPHFs algorithms.
- All reported bugs and suggestions have been corrected and included as well.
----------------------------------------
==News for version 0.7==
- Added man pages and a pkgconfig file.
----------------------------------------
==News for version 0.6==
- [An algorithm to generate MPHFs that require less than 4 bits per key to be stored fch.html], which is referred to as FCH algorithm. The algorithm is only efficient for small sets.
- The FCH algorithm is integrated with [BRZ algorithm brz.html] so that you will be able to efficiently generate space-efficient MPHFs for sets in the order of billion keys.
- All reported bugs and suggestions have been corrected and included as well.
----------------------------------------
==News for version 0.5==
- A thread safe vector adapter has been added.
- [A new algorithm for sets in the order of billion of keys that requires approximately 8.1 bits per key to store the resulting MPHFs. brz.html]
- All reported bugs and suggestions have been corrected and included as well.
----------------------------------------
==News for version 0.4==
- Vector Adapter has been added.
- An optimized version of bmz (bmz8) for small set of keys (at most 256 keys) has been added.
- All reported bugs and suggestions have been corrected and included as well.
----------------------------------------
==News for version 0.3==
- New heuristic added to the bmz algorithm permits to generate a mphf with only
//24.80n + O(1)// bytes. The resulting function can be stored in //3.72n// bytes.
%html% [click here bmz.html#heuristic] for details.
%!include: ALGORITHMS.t2t
%!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

326
deps/cmph/README vendored Normal file
View File

@@ -0,0 +1,326 @@
CMPH - C Minimal Perfect Hashing Library
-------------------------------------------------------------------
Motivation
==========
A perfect hash function maps a static set of n keys into a set of m integer numbers without collisions, where m is greater than or equal to n. If m is equal to n, the function is called minimal.
Minimal perfect hash functions (concepts.html) are widely used for memory efficient storage and fast retrieval of items from static sets, such as words in natural languages, reserved words in programming languages or interactive systems, universal resource locations (URLs) in Web search engines, or item sets in data mining techniques. Therefore, there are applications for minimal perfect hash functions in information retrieval systems, database systems, language translation systems, electronic commerce systems, compilers, operating systems, among others.
The use of minimal perfect hash functions is, until now, restricted to scenarios where the set of keys being hashed is small, because of the limitations of current algorithms. But in many cases, to deal with huge set of keys is crucial. So, this project gives to the free software community an API that will work with sets in the order of billion of keys.
Probably, the most interesting application for minimal perfect hash functions is its use as an indexing structure for databases. The most popular data structure used as an indexing structure in databases is the B+ tree. In fact, the B+ tree is very used for dynamic applications with frequent insertions and deletions of records. However, for applications with sporadic modifications and a huge number of queries the B+ tree is not the best option, because practical deployments of this structure are extremely complex, and perform poorly with very large sets of keys such as those required for the new frontiers database applications (http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=299).
For example, in the information retrieval field, the work with huge collections is a daily task. The simple assignment of ids to web pages of a collection can be a challenging task. While traditional databases simply cannot handle more traffic once the working set of web page urls does not fit in main memory anymore, minimal perfect hash functions can easily scale to hundred of millions of entries, using stock hardware.
As there are lots of applications for minimal perfect hash functions, it is important to implement memory and time efficient algorithms for constructing such functions. The lack of similar libraries in the free software world has been the main motivation to create the C Minimal Perfect Hashing Library (gperf is a bit different (gperf.html), since it was conceived to create very fast perfect hash functions for small sets of keys and CMPH Library was conceived to create minimal perfect hash functions for very large sets of keys). C Minimal Perfect Hashing Library is a portable LGPLed library to generate and to work with very efficient minimal perfect hash functions.
-------------------------------------------------------------------
Description
===========
The CMPH Library encapsulates the newest and more efficient algorithms in an easy-to-use, production-quality, fast API. The library was designed to work with big entries that cannot fit in the main memory. It has been used successfully for constructing minimal perfect hash functions for sets with more than 100 million of keys, and we intend to expand this number to the order of billion of keys. Although there is a lack of similar libraries, we can point out some of the distinguishable features of the CMPH Library:
- Fast.
- Space-efficient with main memory usage carefully documented.
- The best modern algorithms are available (or at least scheduled for implementation :-)).
- Works with in-disk key sets through of using the adapter pattern.
- Serialization of hash functions.
- Portable C code (currently works on GNU/Linux and WIN32 and is reported to work in OpenBSD and Solaris).
- Object oriented implementation.
- Easily extensible.
- Well encapsulated API aiming binary compatibility through releases.
- Free Software.
----------------------------------------
Supported Algorithms
====================
- CHD Algorithm:
- It is the fastest algorithm to build PHFs and MPHFs in linear time.
- It generates the most compact PHFs and MPHFs we know of.
- It can generate PHFs with a load factor up to 99 %.
- It can be used to generate t-perfect hash functions. A t-perfect hash function allows at most t collisions in a given bin. It is a well-known fact that modern memories are organized as blocks which constitute transfer unit. Example of such blocks are cache lines for internal memory or sectors for hard disks. Thus, it can be very useful for devices that carry out I/O operations in blocks.
- It is a two level scheme. It uses a first level hash function to split the key set in buckets of average size determined by a parameter b in the range [1,32]. In the second level it uses displacement values to resolve the collisions that have given rise to the buckets.
- It can generate MPHFs that can be stored in approximately 2.07 bits per key.
- For a load factor equal to the maximum one that is achieved by the BDZ algorithm (81 %), the resulting PHFs are stored in approximately 1.40 bits per key.
- BDZ Algorithm:
- It is very simple and efficient. It outperforms all the ones below.
- It constructs both PHFs and MPHFs in linear time.
- The maximum load factor one can achieve for a PHF is 1/1.23.
- It is based on acyclic random 3-graphs. A 3-graph is a generalization of a graph where each edge connects 3 vertices instead of only 2.
- The resulting MPHFs are not order preserving.
- The resulting MPHFs can be stored in only (2 + x)cn bits, where c should be larger than or equal to 1.23 and x is a constant larger than 0 (actually, x = 1/b and b is a parameter that should be larger than 2). For c = 1.23 and b = 8, the resulting functions are stored in approximately 2.6 bits per key.
- For its maximum load factor (81 %), the resulting PHFs are stored in approximately 1.95 bits per key.
- BMZ Algorithm:
- Construct MPHFs in linear time.
- It is based on cyclic random graphs. This makes it faster than the CHM algorithm.
- The resulting MPHFs are not order preserving.
- The resulting MPHFs are more compact than the ones generated by the CHM algorithm and can be stored in 4cn bytes, where c is in the range [0.93,1.15].
- BRZ Algorithm:
- A very fast external memory based algorithm for constructing minimal perfect hash functions for sets in the order of billions of keys.
- It works in linear time.
- The resulting MPHFs are not order preserving.
- The resulting MPHFs can be stored using less than 8.0 bits per key.
- CHM Algorithm:
- Construct minimal MPHFs in linear time.
- It is based on acyclic random graphs
- The resulting MPHFs are order preserving.
- The resulting MPHFs are stored in 4cn bytes, where c is greater than 2.
- FCH Algorithm:
- Construct minimal perfect hash functions that require less than 4 bits per key to be stored.
- The resulting MPHFs are very compact and very efficient at evaluation time
- The algorithm is only efficient for small sets.
- It is used as internal algorithm in the BRZ algorithm to efficiently solve larger problems and even so to generate MPHFs that require approximately 4.1 bits per key to be stored. For that, you just need to set the parameters -a to brz and -c to a value larger than or equal to 2.6.
----------------------------------------
News for version 2.0
====================
Cleaned up most warnings for the c code.
Experimental C++ interface (--enable-cxxmph) implementing the BDZ algorithm in
a convenient interface, which serves as the basis
for drop-in replacements for std::unordered_map, sparsehash::sparse_hash_map
and sparsehash::dense_hash_map. Potentially faster lookup time at the expense
of insertion time. See cxxmpph/mph_map.h and cxxmph/mph_index.h for details.
News for version 1.1
====================
Fixed a bug in the chd_pc algorithm and reorganized tests.
News for version 1.0
====================
This is a bugfix only version, after which a revamp of the cmph code and
algorithms will be done.
News for version 0.9
====================
- The CHD algorithm (chd.html), which is an algorithm that can be tuned to generate MPHFs that require approximately 2.07 bits per key to be stored. The algorithm outperforms the BDZ algorithm (bdz.html) and therefore is the fastest one available in the literature for sets that can be treated in internal memory.
- The CHD_PH algorithm (chd.html), which is an algorithm to generate PHFs with load factor up to 99 %. It is actually the CHD algorithm without the ranking step. If we set the load factor to 81 %, which is the maximum that can be obtained with the BDZ algorithm (bdz.html), the resulting functions can be stored in 1.40 bits per key. The space requirement increases with the load factor.
- All reported bugs and suggestions have been corrected and included as well.
News for version 0.8
====================
- An algorithm to generate MPHFs that require around 2.6 bits per key to be stored (bdz.html), which is referred to as BDZ algorithm. The algorithm is the fastest one available in the literature for sets that can be treated in internal memory.
- An algorithm to generate PHFs with range m = cn, for c > 1.22 (bdz.html), which is referred to as BDZ_PH algorithm. It is actually the BDZ algorithm without the ranking step. The resulting functions can be stored in 1.95 bits per key for c = 1.23 and are considerably faster than the MPHFs generated by the BDZ algorithm.
- An adapter to support a vector of struct as the source of keys has been added.
- An API to support the ability of packing a perfect hash function into a preallocated contiguous memory space. The computation of a packed function is still faster and can be easily mmapped.
- The hash functions djb2, fnv and sdbm were removed because they do not use random seeds and therefore are not useful for MPHFs algorithms.
- All reported bugs and suggestions have been corrected and included as well.
News log (newslog.html)
----------------------------------------
Examples
========
Using cmph is quite simple. Take a look.
#include <cmph.h>
#include <string.h>
// Create minimal perfect hash function from in-memory vector
int main(int argc, char **argv)
{
// Creating a filled vector
unsigned int i = 0;
const char *vector[] = {"aaaaaaaaaa", "bbbbbbbbbb", "cccccccccc", "dddddddddd", "eeeeeeeeee",
"ffffffffff", "gggggggggg", "hhhhhhhhhh", "iiiiiiiiii", "jjjjjjjjjj"};
unsigned int nkeys = 10;
FILE* mphf_fd = fopen("temp.mph", "w");
// Source of keys
cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys);
//Create minimal perfect hash function using the brz algorithm.
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BRZ);
cmph_config_set_mphf_fd(config, mphf_fd);
cmph_t *hash = cmph_new(config);
cmph_config_destroy(config);
cmph_dump(hash, mphf_fd);
cmph_destroy(hash);
fclose(mphf_fd);
//Find key
mphf_fd = fopen("temp.mph", "r");
hash = cmph_load(mphf_fd);
while (i < nkeys) {
const char *key = vector[i];
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "key:%s -- hash:%u\n", key, id);
i++;
}
//Destroy hash
cmph_destroy(hash);
cmph_io_vector_adapter_destroy(source);
fclose(mphf_fd);
return 0;
}
Download vector_adapter_ex1.c (examples/vector_adapter_ex1.c). This example does not work in versions below 0.6. You need to update the sources from GIT to make it work.
-------------------------------
#include <cmph.h>
#include <stdio.h>
#include <string.h>
// Create minimal perfect hash function from in-disk keys using BDZ algorithm
int main(int argc, char **argv)
{
//Open file with newline separated list of keys
FILE * keys_fd = fopen("keys.txt", "r");
cmph_t *hash = NULL;
if (keys_fd == NULL)
{
fprintf(stderr, "File \"keys.txt\" not found\n");
exit(1);
}
// Source of keys
cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd);
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BDZ);
hash = cmph_new(config);
cmph_config_destroy(config);
//Find key
const char *key = "jjjjjjjjjj";
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "Id:%u\n", id);
//Destroy hash
cmph_destroy(hash);
cmph_io_nlfile_adapter_destroy(source);
fclose(keys_fd);
return 0;
}
Download file_adapter_ex2.c (examples/file_adapter_ex2.c) and keys.txt (examples/keys.txt). This example does not work in versions below 0.8. You need to update the sources from GIT to make it work.
Click here to see more examples (examples.html)
--------------------------------------
The cmph application
====================
cmph is the name of both the library and the utility
application that comes with this package. You can use the cmph
application for constructing minimal perfect hash functions from the command line.
The cmph utility
comes with a number of flags, but it is very simple to create and to query
minimal perfect hash functions:
$ # Using the chm algorithm (default one) for constructing a mphf for keys in file keys_file
$ ./cmph -g keys_file
$ # Query id of keys in the file keys_query
$ ./cmph -m keys_file.mph keys_query
The additional options let you set most of the parameters you have
available through the C API. Below you can see the full help message for the
utility.
usage: cmph [-v] [-h] [-V] [-k nkeys] [-f hash_function] [-g [-c algorithm_dependent_value][-s seed] ]
[-a algorithm] [-M memory_in_MB] [-b algorithm_dependent_value] [-t keys_per_bin] [-d tmp_dir]
[-m file.mph] keysfile
Minimum perfect hashing tool
-h print this help message
-c c value determines:
* the number of vertices in the graph for the algorithms BMZ and CHM
* the number of bits per key required in the FCH algorithm
* the load factor in the CHD_PH algorithm
-a algorithm - valid values are
* bmz
* bmz8
* chm
* brz
* fch
* bdz
* bdz_ph
* chd_ph
* chd
-f hash function (may be used multiple times) - valid values are
* jenkins
-V print version number and exit
-v increase verbosity (may be used multiple times)
-k number of keys
-g generation mode
-s random seed
-m minimum perfect hash function file
-M main memory availability (in MB) used in BRZ algorithm
-d temporary directory used in BRZ algorithm
-b the meaning of this parameter depends on the algorithm selected in the -a option:
* For BRZ it is used to make the maximal number of keys in a bucket lower than 256.
In this case its value should be an integer in the range [64,175]. Default is 128.
* For BDZ it is used to determine the size of some precomputed rank
information and its value should be an integer in the range [3,10]. Default
is 7. The larger is this value, the more compact are the resulting functions
and the slower are them at evaluation time.
* For CHD and CHD_PH it is used to set the average number of keys per bucket
and its value should be an integer in the range [1,32]. Default is 4. The
larger is this value, the slower is the construction of the functions.
This parameter has no effect for other algorithms.
-t set the number of keys per bin for a t-perfect hashing function. A t-perfect
hash function allows at most t collisions in a given bin. This parameter applies
only to the CHD and CHD_PH algorithms. Its value should be an integer in the
range [1,128]. Defaul is 1
keysfile line separated file with keys
Additional Documentation
========================
FAQ (faq.html)
Downloads
=========
Use the github releases page at: https://github.com/bonitao/cmph/releases
License Stuff
=============
Code is under the LGPL and the MPL 1.1.
----------------------------------------
Enjoy!
Davi de Castro Reis (davi@users.sourceforge.net)
Djamel Belazzougui (db8192@users.sourceforge.net)
Fabiano Cupertino Botelho (fc_botelho@users.sourceforge.net)
Nivio Ziviani (nivio@dcc.ufmg.br)
Last Updated: Fri Dec 28 23:50:31 2018

1
deps/cmph/README.md vendored Normal file
View File

@@ -0,0 +1 @@
See http://cmph.sf.net

315
deps/cmph/README.t2t vendored Normal file
View File

@@ -0,0 +1,315 @@
CMPH - C Minimal Perfect Hashing Library
%!includeconf: CONFIG.t2t
-------------------------------------------------------------------
==Motivation==
A perfect hash function maps a static set of n keys into a set of m integer numbers without collisions, where m is greater than or equal to n. If m is equal to n, the function is called minimal.
[Minimal perfect hash functions concepts.html] are widely used for memory efficient storage and fast retrieval of items from static sets, such as words in natural languages, reserved words in programming languages or interactive systems, universal resource locations (URLs) in Web search engines, or item sets in data mining techniques. Therefore, there are applications for minimal perfect hash functions in information retrieval systems, database systems, language translation systems, electronic commerce systems, compilers, operating systems, among others.
The use of minimal perfect hash functions is, until now, restricted to scenarios where the set of keys being hashed is small, because of the limitations of current algorithms. But in many cases, to deal with huge set of keys is crucial. So, this project gives to the free software community an API that will work with sets in the order of billion of keys.
Probably, the most interesting application for minimal perfect hash functions is its use as an indexing structure for databases. The most popular data structure used as an indexing structure in databases is the B+ tree. In fact, the B+ tree is very used for dynamic applications with frequent insertions and deletions of records. However, for applications with sporadic modifications and a huge number of queries the B+ tree is not the best option, because practical deployments of this structure are extremely complex, and perform poorly with very large sets of keys such as those required for the new frontiers [database applications http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=299].
For example, in the information retrieval field, the work with huge collections is a daily task. The simple assignment of ids to web pages of a collection can be a challenging task. While traditional databases simply cannot handle more traffic once the working set of web page urls does not fit in main memory anymore, minimal perfect hash functions can easily scale to hundred of millions of entries, using stock hardware.
As there are lots of applications for minimal perfect hash functions, it is important to implement memory and time efficient algorithms for constructing such functions. The lack of similar libraries in the free software world has been the main motivation to create the C Minimal Perfect Hashing Library ([gperf is a bit different gperf.html], since it was conceived to create very fast perfect hash functions for small sets of keys and CMPH Library was conceived to create minimal perfect hash functions for very large sets of keys). C Minimal Perfect Hashing Library is a portable LGPLed library to generate and to work with very efficient minimal perfect hash functions.
-------------------------------------------------------------------
==Description==
The CMPH Library encapsulates the newest and more efficient algorithms in an easy-to-use, production-quality, fast API. The library was designed to work with big entries that cannot fit in the main memory. It has been used successfully for constructing minimal perfect hash functions for sets with more than 100 million of keys, and we intend to expand this number to the order of billion of keys. Although there is a lack of similar libraries, we can point out some of the distinguishable features of the CMPH Library:
- Fast.
- Space-efficient with main memory usage carefully documented.
- The best modern algorithms are available (or at least scheduled for implementation :-)).
- Works with in-disk key sets through of using the adapter pattern.
- Serialization of hash functions.
- Portable C code (currently works on GNU/Linux and WIN32 and is reported to work in OpenBSD and Solaris).
- Object oriented implementation.
- Easily extensible.
- Well encapsulated API aiming binary compatibility through releases.
- Free Software.
----------------------------------------
==Supported Algorithms==
%html% - [CHD Algorithm chd.html]:
%txt% - CHD Algorithm:
- It is the fastest algorithm to build PHFs and MPHFs in linear time.
- It generates the most compact PHFs and MPHFs we know of.
- It can generate PHFs with a load factor up to //99 %//.
- It can be used to generate //t//-perfect hash functions. A //t//-perfect hash function allows at most //t// collisions in a given bin. It is a well-known fact that modern memories are organized as blocks which constitute transfer unit. Example of such blocks are cache lines for internal memory or sectors for hard disks. Thus, it can be very useful for devices that carry out I/O operations in blocks.
- It is a two level scheme. It uses a first level hash function to split the key set in buckets of average size determined by a parameter //b// in the range //[1,32]//. In the second level it uses displacement values to resolve the collisions that have given rise to the buckets.
- It can generate MPHFs that can be stored in approximately //2.07// bits per key.
- For a load factor equal to the maximum one that is achieved by the BDZ algorithm (//81 %//), the resulting PHFs are stored in approximately //1.40// bits per key.
%html% - [BDZ Algorithm bdz.html]:
%txt% - BDZ Algorithm:
- It is very simple and efficient. It outperforms all the ones below.
- It constructs both PHFs and MPHFs in linear time.
- The maximum load factor one can achieve for a PHF is //1/1.23//.
- It is based on acyclic random 3-graphs. A 3-graph is a generalization of a graph where each edge connects 3 vertices instead of only 2.
- The resulting MPHFs are not order preserving.
- The resulting MPHFs can be stored in only //(2 + x)cn// bits, where //c// should be larger than or equal to //1.23// and //x// is a constant larger than //0// (actually, x = 1/b and b is a parameter that should be larger than 2). For //c = 1.23// and //b = 8//, the resulting functions are stored in approximately 2.6 bits per key.
- For its maximum load factor (//81 %//), the resulting PHFs are stored in approximately //1.95// bits per key.
%html% - [BMZ Algorithm bmz.html]:
%txt% - BMZ Algorithm:
- Construct MPHFs in linear time.
- It is based on cyclic random graphs. This makes it faster than the CHM algorithm.
- The resulting MPHFs are not order preserving.
- The resulting MPHFs are more compact than the ones generated by the CHM algorithm and can be stored in //4cn// bytes, where //c// is in the range //[0.93,1.15]//.
%html% - [BRZ Algorithm brz.html]:
%txt% - BRZ Algorithm:
- A very fast external memory based algorithm for constructing minimal perfect hash functions for sets in the order of billions of keys.
- It works in linear time.
- The resulting MPHFs are not order preserving.
- The resulting MPHFs can be stored using less than //8.0// bits per key.
%html% - [CHM Algorithm chm.html]:
%txt% - CHM Algorithm:
- Construct minimal MPHFs in linear time.
- It is based on acyclic random graphs
- The resulting MPHFs are order preserving.
- The resulting MPHFs are stored in //4cn// bytes, where //c// is greater than 2.
%html% - [FCH Algorithm fch.html]:
%txt% - FCH Algorithm:
- Construct minimal perfect hash functions that require less than 4 bits per key to be stored.
- The resulting MPHFs are very compact and very efficient at evaluation time
- The algorithm is only efficient for small sets.
- It is used as internal algorithm in the BRZ algorithm to efficiently solve larger problems and even so to generate MPHFs that require approximately 4.1 bits per key to be stored. For that, you just need to set the parameters -a to brz and -c to a value larger than or equal to 2.6.
----------------------------------------
==News for version 2.0==
Cleaned up most warnings for the c code.
Experimental C++ interface (--enable-cxxmph) implementing the BDZ algorithm in
a convenient interface, which serves as the basis
for drop-in replacements for std::unordered_map, sparsehash::sparse_hash_map
and sparsehash::dense_hash_map. Potentially faster lookup time at the expense
of insertion time. See cxxmpph/mph_map.h and cxxmph/mph_index.h for details.
==News for version 1.1==
Fixed a bug in the chd_pc algorithm and reorganized tests.
==News for version 1.0==
This is a bugfix only version, after which a revamp of the cmph code and
algorithms will be done.
==News for version 0.9==
- [The CHD algorithm chd.html], which is an algorithm that can be tuned to generate MPHFs that require approximately 2.07 bits per key to be stored. The algorithm outperforms [the BDZ algorithm bdz.html] and therefore is the fastest one available in the literature for sets that can be treated in internal memory.
- [The CHD_PH algorithm chd.html], which is an algorithm to generate PHFs with load factor up to //99 %//. It is actually the CHD algorithm without the ranking step. If we set the load factor to //81 %//, which is the maximum that can be obtained with [the BDZ algorithm bdz.html], the resulting functions can be stored in //1.40// bits per key. The space requirement increases with the load factor.
- All reported bugs and suggestions have been corrected and included as well.
==News for version 0.8 ==
- [An algorithm to generate MPHFs that require around 2.6 bits per key to be stored bdz.html], which is referred to as BDZ algorithm. The algorithm is the fastest one available in the literature for sets that can be treated in internal memory.
- [An algorithm to generate PHFs with range m = cn, for c > 1.22 bdz.html], which is referred to as BDZ_PH algorithm. It is actually the BDZ algorithm without the ranking step. The resulting functions can be stored in 1.95 bits per key for //c = 1.23// and are considerably faster than the MPHFs generated by the BDZ algorithm.
- An adapter to support a vector of struct as the source of keys has been added.
- An API to support the ability of packing a perfect hash function into a preallocated contiguous memory space. The computation of a packed function is still faster and can be easily mmapped.
- The hash functions djb2, fnv and sdbm were removed because they do not use random seeds and therefore are not useful for MPHFs algorithms.
- All reported bugs and suggestions have been corrected and included as well.
[News log newslog.html]
----------------------------------------
==Examples==
Using cmph is quite simple. Take a look.
```
#include <cmph.h>
#include <string.h>
// Create minimal perfect hash function from in-memory vector
int main(int argc, char **argv)
{
// Creating a filled vector
unsigned int i = 0;
const char *vector[] = {"aaaaaaaaaa", "bbbbbbbbbb", "cccccccccc", "dddddddddd", "eeeeeeeeee",
"ffffffffff", "gggggggggg", "hhhhhhhhhh", "iiiiiiiiii", "jjjjjjjjjj"};
unsigned int nkeys = 10;
FILE* mphf_fd = fopen("temp.mph", "w");
// Source of keys
cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys);
//Create minimal perfect hash function using the brz algorithm.
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BRZ);
cmph_config_set_mphf_fd(config, mphf_fd);
cmph_t *hash = cmph_new(config);
cmph_config_destroy(config);
cmph_dump(hash, mphf_fd);
cmph_destroy(hash);
fclose(mphf_fd);
//Find key
mphf_fd = fopen("temp.mph", "r");
hash = cmph_load(mphf_fd);
while (i < nkeys) {
const char *key = vector[i];
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "key:%s -- hash:%u\n", key, id);
i++;
}
//Destroy hash
cmph_destroy(hash);
cmph_io_vector_adapter_destroy(source);
fclose(mphf_fd);
return 0;
}
```
Download [vector_adapter_ex1.c examples/vector_adapter_ex1.c]. This example does not work in versions below 0.6. You need to update the sources from GIT to make it work.
-------------------------------
```
#include <cmph.h>
#include <stdio.h>
#include <string.h>
// Create minimal perfect hash function from in-disk keys using BDZ algorithm
int main(int argc, char **argv)
{
//Open file with newline separated list of keys
FILE * keys_fd = fopen("keys.txt", "r");
cmph_t *hash = NULL;
if (keys_fd == NULL)
{
fprintf(stderr, "File \"keys.txt\" not found\n");
exit(1);
}
// Source of keys
cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd);
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BDZ);
hash = cmph_new(config);
cmph_config_destroy(config);
//Find key
const char *key = "jjjjjjjjjj";
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "Id:%u\n", id);
//Destroy hash
cmph_destroy(hash);
cmph_io_nlfile_adapter_destroy(source);
fclose(keys_fd);
return 0;
}
```
Download [file_adapter_ex2.c examples/file_adapter_ex2.c] and [keys.txt examples/keys.txt]. This example does not work in versions below 0.8. You need to update the sources from GIT to make it work.
[Click here to see more examples examples.html]
--------------------------------------
==The cmph application==
cmph is the name of both the library and the utility
application that comes with this package. You can use the cmph
application for constructing minimal perfect hash functions from the command line.
The cmph utility
comes with a number of flags, but it is very simple to create and to query
minimal perfect hash functions:
```
$ # Using the chm algorithm (default one) for constructing a mphf for keys in file keys_file
$ ./cmph -g keys_file
$ # Query id of keys in the file keys_query
$ ./cmph -m keys_file.mph keys_query
```
The additional options let you set most of the parameters you have
available through the C API. Below you can see the full help message for the
utility.
```
usage: cmph [-v] [-h] [-V] [-k nkeys] [-f hash_function] [-g [-c algorithm_dependent_value][-s seed] ]
[-a algorithm] [-M memory_in_MB] [-b algorithm_dependent_value] [-t keys_per_bin] [-d tmp_dir]
[-m file.mph] keysfile
Minimum perfect hashing tool
-h print this help message
-c c value determines:
* the number of vertices in the graph for the algorithms BMZ and CHM
* the number of bits per key required in the FCH algorithm
* the load factor in the CHD_PH algorithm
-a algorithm - valid values are
* bmz
* bmz8
* chm
* brz
* fch
* bdz
* bdz_ph
* chd_ph
* chd
-f hash function (may be used multiple times) - valid values are
* jenkins
-V print version number and exit
-v increase verbosity (may be used multiple times)
-k number of keys
-g generation mode
-s random seed
-m minimum perfect hash function file
-M main memory availability (in MB) used in BRZ algorithm
-d temporary directory used in BRZ algorithm
-b the meaning of this parameter depends on the algorithm selected in the -a option:
* For BRZ it is used to make the maximal number of keys in a bucket lower than 256.
In this case its value should be an integer in the range [64,175]. Default is 128.
* For BDZ it is used to determine the size of some precomputed rank
information and its value should be an integer in the range [3,10]. Default
is 7. The larger is this value, the more compact are the resulting functions
and the slower are them at evaluation time.
* For CHD and CHD_PH it is used to set the average number of keys per bucket
and its value should be an integer in the range [1,32]. Default is 4. The
larger is this value, the slower is the construction of the functions.
This parameter has no effect for other algorithms.
-t set the number of keys per bin for a t-perfect hashing function. A t-perfect
hash function allows at most t collisions in a given bin. This parameter applies
only to the CHD and CHD_PH algorithms. Its value should be an integer in the
range [1,128]. Defaul is 1
keysfile line separated file with keys
```
==Additional Documentation==
[FAQ faq.html]
==Downloads==
Use the github releases page at: https://github.com/bonitao/cmph/releases
==License Stuff==
Code is under the LGPL and the MPL 1.1.
----------------------------------------
%!include: FOOTER.t2t
%!include(html): ''LOGO.t2t''
Last Updated: %%date(%c)
%!include(html): ''GOOGLEANALYTICS.t2t''

76
deps/cmph/TABLE1.t2t vendored Normal file
View File

@@ -0,0 +1,76 @@
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Characteristics </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=2><SMALL CLASS="FOOTNOTESIZE"> <SPAN>Algorithms</SPAN></SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
</SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> BMZ </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> CHM </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="11" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img1.png"
ALT="$c$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1.15 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.09 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="50" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img239.png"
ALT="$\vert E(G)\vert$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img8.png"
ALT="$n$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img8.png"
ALT="$n$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="89" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img240.png"
ALT="$\vert V(G)\vert=\vert g\vert$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="20" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img241.png"
ALT="$cn$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="20" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img241.png"
ALT="$cn$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<!-- MATH
$|E(G_{\rm crit})|$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="70" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img111.png"
ALT="$\vert E(G_{\rm crit})\vert$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="71" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img242.png"
ALT="$0.5\vert E(G)\vert$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 0</SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="17" HEIGHT="14" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img32.png"
ALT="$G$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> cyclic </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> acyclic </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Order preserving </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> no </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> yes </SMALL></TD>
</TR>
</TABLE>

109
deps/cmph/TABLE4.t2t vendored Normal file
View File

@@ -0,0 +1,109 @@
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img8.png"
ALT="$n$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=4><SMALL CLASS="FOOTNOTESIZE"> <SPAN> BMZ </SPAN> </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=4><SMALL CLASS="FOOTNOTESIZE">
<SPAN>CHM algorithm</SPAN></SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> Gain</SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
</SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="22" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img243.png"
ALT="$N_i$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Map+Ord </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Search </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Total </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="22" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img243.png"
ALT="$N_i$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Map+Ord </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Search </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Total </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> (%)</SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1,562,500 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.28 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 8.54 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.37 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 10.91 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.70 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 14.56 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1.57 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 16.13 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 48 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 3,125,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.16 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 15.92 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 4.88 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 20.80 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.85 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 30.36 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 3.20 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 33.56 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 61 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 6,250,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.20 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 33.09 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 10.48 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 43.57 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.90 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 62.26 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 6.76 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 69.02 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 58 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 12,500,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.00 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 63.26 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 23.04 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 86.30 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.60 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 117.99 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 14.94 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 132.92 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 54 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 25,000,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.00 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 130.79 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 51.55 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 182.34 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.80 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 262.05 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 33.68 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 295.73 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 62 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 50,000,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.07 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 273.75 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 114.12 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 387.87 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.90 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 577.59 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 73.97 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 651.56 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 68 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 100,000,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.07 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 567.47 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 243.13 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 810.60 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.80 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1,131.06 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 157.23 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1,288.29 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 59 </SMALL></TD>
</TR>
</TABLE>

46
deps/cmph/TABLE5.t2t vendored Normal file
View File

@@ -0,0 +1,46 @@
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img8.png"
ALT="$n$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=4><SMALL CLASS="FOOTNOTESIZE"> <SPAN> BMZ <SPAN CLASS="MATH"><IMG
WIDTH="60" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img5.png"
ALT="$c=1.00$"></SPAN></SPAN> </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=4><SMALL CLASS="FOOTNOTESIZE">
<SPAN> BMZ <SPAN CLASS="MATH"><IMG
WIDTH="60" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img6.png"
ALT="$c=0.93$"></SPAN></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
</SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="22" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img243.png"
ALT="$N_i$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Map+Ord </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Search </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Total </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="22" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img243.png"
ALT="$N_i$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Map+Ord </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Search </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Total </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 12,500,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.78 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 76.68 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 25.06 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 101.74 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 3.04 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 76.39 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 25.80 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 102.19 </SMALL></TD>
</TR>
</TABLE>

72
deps/cmph/TABLEBRZ1.t2t vendored Normal file
View File

@@ -0,0 +1,72 @@
<TABLE CELLPADDING=3 BORDER="1" ALIGN="CENTER">
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img5.png"
ALT="$n$"></SPAN> (millions) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 2 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 4 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 8 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 16 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 32 </SMALL></TD>
<TD></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
Average time (s)</SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="64" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img168.png"
ALT="$6.1 \pm 0.3$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img169.png"
ALT="$12.2 \pm 0.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img170.png"
ALT="$25.4 \pm 1.1$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img171.png"
ALT="$51.4 \pm 2.0$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img172.png"
ALT="$117.3 \pm 4.4$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img173.png"
ALT="$262.2 \pm 8.7$"></SPAN></SMALL></TD>
<TD></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
SD (s) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img174.png"
ALT="$2.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img175.png"
ALT="$5.4$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img176.png"
ALT="$9.8$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img177.png"
ALT="$17.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img178.png"
ALT="$37.3$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img179.png"
ALT="$76.3$"></SPAN> </SMALL></TD>
<TD></TD>
</TR>
</TABLE>

133
deps/cmph/TABLEBRZ2.t2t vendored Normal file
View File

@@ -0,0 +1,133 @@
<TABLE CELLPADDING=3 BORDER="1" ALIGN="CENTER">
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img5.png"
ALT="$n$"></SPAN> (millions) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 2 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 4 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 8 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 16 </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
Average time (s) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="64" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img187.png"
ALT="$6.9 \pm 0.3$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img188.png"
ALT="$13.8 \pm 0.2$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img189.png"
ALT="$31.9 \pm 0.7$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img190.png"
ALT="$69.9 \pm 1.1$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img191.png"
ALT="$140.6 \pm 2.5$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
SD </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img192.png"
ALT="$0.4$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img193.png"
ALT="$0.2$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img194.png"
ALT="$0.9$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img195.png"
ALT="$1.5$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img196.png"
ALT="$3.5$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img5.png"
ALT="$n$"></SPAN> (millions) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 32 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 64 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 128 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 512 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1000 </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
Average time (s) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img197.png"
ALT="$284.3 \pm 1.1$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img198.png"
ALT="$587.9 \pm 3.9$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
$1223.6 \pm 4.9$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="88" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img199.png"
ALT="$1223.6 \pm 4.9$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
$5966.4 \pm 9.5$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="88" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img200.png"
ALT="$5966.4 \pm 9.5$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
$13229.5 \pm 12.7$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="104" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img201.png"
ALT="$13229.5 \pm 12.7$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
SD </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img202.png"
ALT="$1.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img203.png"
ALT="$5.5$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img204.png"
ALT="$6.8$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img205.png"
ALT="$13.2$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img206.png"
ALT="$18.6$"></SPAN> </SMALL></TD>
</TR>
<TR><TD></TD>
<TD></TD>
<TD></TD>
<TD></TD>
<TD></TD>
<TD></TD>
</TR>
</TABLE>

147
deps/cmph/TABLEBRZ3.t2t vendored Normal file
View File

@@ -0,0 +1,147 @@
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img8.png"
ALT="$\mu $"></SPAN> (MB) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img215.png"
ALT="$100$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img216.png"
ALT="$200$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img217.png"
ALT="$300$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img218.png"
ALT="$400$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img219.png"
ALT="$500$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img212.png"
ALT="$600$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="19" HEIGHT="14" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img58.png"
ALT="$N$"></SPAN> (files) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img220.png"
ALT="$619$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img221.png"
ALT="$310$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img222.png"
ALT="$207$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img223.png"
ALT="$155$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img224.png"
ALT="$124$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img225.png"
ALT="$104$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
&nbsp;(buffer size in KB) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img226.png"
ALT="$165$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img227.png"
ALT="$661$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img228.png"
ALT="$1,484$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img229.png"
ALT="$2,643$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img230.png"
ALT="$4,129$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img231.png"
ALT="$5,908$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img135.png"
ALT="$\beta$"></SPAN>/&nbsp;(# of seeks in the worst case) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="59" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img232.png"
ALT="$384,478$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img233.png"
ALT="$95,974$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img234.png"
ALT="$42,749$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img235.png"
ALT="$24,003$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img236.png"
ALT="$15,365$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img237.png"
ALT="$10,738$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
Time (hours) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img238.png"
ALT="$4.04$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img239.png"
ALT="$3.64$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img240.png"
ALT="$3.34$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img241.png"
ALT="$3.20$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img242.png"
ALT="$3.13$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img243.png"
ALT="$3.09$"></SPAN> </SMALL></TD>
</TR>
</TABLE>

1
deps/cmph/_config.yml vendored Normal file
View File

@@ -0,0 +1 @@
theme: jekyll-theme-cayman

12
deps/cmph/cmph.pc.in vendored Normal file
View File

@@ -0,0 +1,12 @@
url=http://cmph.sourceforge.net/
prefix=@prefix@
exec_prefix=@exec_prefix@
libdir=@libdir@
includedir=@includedir@
Name: cmph
Description: minimal perfect hashing library
Version: @VERSION@
Libs: -L${libdir} -lcmph
Cflags: -I${includedir}
URL: ${url}

39
deps/cmph/cmph.spec vendored Normal file
View File

@@ -0,0 +1,39 @@
%define name cmph
%define version 0.4
%define release 3
Name: %{name}
Version: %{version}
Release: %{release}
Summary: C Minimal perfect hash library
Source: %{name}-%{version}.tar.gz
License: Proprietary
URL: http://www.akwan.com.br
BuildArch: i386
Group: Sitesearch
BuildRoot: %{_tmppath}/%{name}-root
%description
C Minimal perfect hash library
%prep
rm -Rf $RPM_BUILD_ROOT
rm -rf $RPM_BUILD_ROOT
%setup
mkdir $RPM_BUILD_ROOT
mkdir $RPM_BUILD_ROOT/usr
CXXFLAGS="-O2" ./configure --prefix=/usr/
%build
make
%install
DESTDIR=$RPM_BUILD_ROOT make install
%files
%defattr(755,root,root)
/
%changelog
* Tue Jun 1 2004 Davi de Castro Reis <davi@akwan.com.br>
+ Initial build

210
deps/cmph/cmph.vcproj vendored Normal file
View File

@@ -0,0 +1,210 @@
<?xml version="1.0" encoding="Windows-1252"?>
<VisualStudioProject
ProjectType="Visual C++"
Version="7.10"
Name="cmph"
ProjectGUID="{F215E028-2FB8-41DE-B211-8E4616CF5B59}"
Keyword="Win32Proj">
<Platforms>
<Platform
Name="Win32"/>
</Platforms>
<Configurations>
<Configuration
Name="Debug|Win32"
OutputDirectory="Debug"
IntermediateDirectory="Debug"
ConfigurationType="4"
CharacterSet="2">
<Tool
Name="VCCLCompilerTool"
Optimization="0"
PreprocessorDefinitions="WIN32;_DEBUG;_LIB"
MinimalRebuild="TRUE"
BasicRuntimeChecks="3"
RuntimeLibrary="5"
UsePrecompiledHeader="0"
WarningLevel="3"
Detect64BitPortabilityProblems="TRUE"
DebugInformationFormat="4"/>
<Tool
Name="VCCustomBuildTool"/>
<Tool
Name="VCLibrarianTool"
OutputFile="$(OutDir)/cmph.lib"/>
<Tool
Name="VCMIDLTool"/>
<Tool
Name="VCPostBuildEventTool"/>
<Tool
Name="VCPreBuildEventTool"/>
<Tool
Name="VCPreLinkEventTool"/>
<Tool
Name="VCResourceCompilerTool"/>
<Tool
Name="VCWebServiceProxyGeneratorTool"/>
<Tool
Name="VCXMLDataGeneratorTool"/>
<Tool
Name="VCManagedWrapperGeneratorTool"/>
<Tool
Name="VCAuxiliaryManagedWrapperGeneratorTool"/>
</Configuration>
<Configuration
Name="Release|Win32"
OutputDirectory="Release"
IntermediateDirectory="Release"
ConfigurationType="4"
CharacterSet="2">
<Tool
Name="VCCLCompilerTool"
PreprocessorDefinitions="WIN32;NDEBUG;_LIB"
RuntimeLibrary="4"
UsePrecompiledHeader="3"
WarningLevel="3"
Detect64BitPortabilityProblems="TRUE"
DebugInformationFormat="3"/>
<Tool
Name="VCCustomBuildTool"/>
<Tool
Name="VCLibrarianTool"
OutputFile="$(OutDir)/cmph.lib"/>
<Tool
Name="VCMIDLTool"/>
<Tool
Name="VCPostBuildEventTool"/>
<Tool
Name="VCPreBuildEventTool"/>
<Tool
Name="VCPreLinkEventTool"/>
<Tool
Name="VCResourceCompilerTool"/>
<Tool
Name="VCWebServiceProxyGeneratorTool"/>
<Tool
Name="VCXMLDataGeneratorTool"/>
<Tool
Name="VCManagedWrapperGeneratorTool"/>
<Tool
Name="VCAuxiliaryManagedWrapperGeneratorTool"/>
</Configuration>
</Configurations>
<References>
</References>
<Files>
<Filter
Name="Source Files"
Filter="cpp;c;cxx;def;odl;idl;hpj;bat;asm;asmx"
UniqueIdentifier="{4FC737F1-C7A5-4376-A066-2A32D752A2FF}">
<File
RelativePath=".\src\bmz.c">
</File>
<File
RelativePath=".\src\cmph.c">
</File>
<File
RelativePath=".\src\cmph_structs.c">
</File>
<File
RelativePath=".\src\czech.c">
</File>
<File
RelativePath=".\src\djb2_hash.c">
</File>
<File
RelativePath=".\src\fnv_hash.c">
</File>
<File
RelativePath=".\src\graph.c">
</File>
<File
RelativePath=".\src\hash.c">
</File>
<File
RelativePath=".\src\jenkins_hash.c">
</File>
<File
RelativePath=".\src\sdbm_hash.c">
</File>
<File
RelativePath=".\src\vqueue.c">
</File>
<File
RelativePath=".\src\vstack.c">
</File>
</Filter>
<Filter
Name="Header Files"
Filter="h;hpp;hxx;hm;inl;inc;xsd"
UniqueIdentifier="{93995380-89BD-4b04-88EB-625FBE52EBFB}">
<File
RelativePath=".\src\bmz.h">
</File>
<File
RelativePath=".\src\bmz_structs.h">
</File>
<File
RelativePath=".\src\cmph.h">
</File>
<File
RelativePath=".\src\cmph_structs.h">
</File>
<File
RelativePath=".\src\cmph_types.h">
</File>
<File
RelativePath=".\src\czech.h">
</File>
<File
RelativePath=".\src\czech_structs.h">
</File>
<File
RelativePath=".\src\debug.h">
</File>
<File
RelativePath=".\src\djb2_hash.h">
</File>
<File
RelativePath=".\src\fnv_hash.h">
</File>
<File
RelativePath=".\src\graph.h">
</File>
<File
RelativePath=".\src\hash.h">
</File>
<File
RelativePath=".\src\hash_funcs.h">
</File>
<File
RelativePath=".\src\hash_state.h">
</File>
<File
RelativePath=".\src\jenkins_hash.h">
</File>
<File
RelativePath=".\src\list.h">
</File>
<File
RelativePath=".\src\sdbm_hash.h">
</File>
<File
RelativePath=".\src\vqueue.h">
</File>
<File
RelativePath=".\src\vstack.h">
</File>
</Filter>
<Filter
Name="Resource Files"
Filter="rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx"
UniqueIdentifier="{67DA6AB6-F800-4c08-8B7A-83BB121AAD01}">
</Filter>
<File
RelativePath=".\ReadMe.txt">
</File>
</Files>
<Globals>
</Globals>
</VisualStudioProject>

141
deps/cmph/cmphapp.vcproj vendored Normal file
View File

@@ -0,0 +1,141 @@
<?xml version="1.0" encoding="Windows-1252"?>
<VisualStudioProject
ProjectType="Visual C++"
Version="7.10"
Name="cmphapp"
ProjectGUID="{5CD55126-9AC1-4393-B26B-AB3DFB2A9AD6}"
Keyword="Win32Proj">
<Platforms>
<Platform
Name="Win32"/>
</Platforms>
<Configurations>
<Configuration
Name="Debug|Win32"
OutputDirectory="Debug"
IntermediateDirectory="Debug"
ConfigurationType="1"
CharacterSet="2">
<Tool
Name="VCCLCompilerTool"
Optimization="0"
PreprocessorDefinitions="WIN32;_DEBUG;_CONSOLE"
MinimalRebuild="TRUE"
BasicRuntimeChecks="3"
RuntimeLibrary="5"
UsePrecompiledHeader="0"
WarningLevel="3"
Detect64BitPortabilityProblems="TRUE"
DebugInformationFormat="4"/>
<Tool
Name="VCCustomBuildTool"/>
<Tool
Name="VCLinkerTool"
OutputFile="$(OutDir)/cmphapp.exe"
LinkIncremental="2"
GenerateDebugInformation="TRUE"
ProgramDatabaseFile="$(OutDir)/cmphapp.pdb"
SubSystem="1"
TargetMachine="1"/>
<Tool
Name="VCMIDLTool"/>
<Tool
Name="VCPostBuildEventTool"/>
<Tool
Name="VCPreBuildEventTool"/>
<Tool
Name="VCPreLinkEventTool"/>
<Tool
Name="VCResourceCompilerTool"/>
<Tool
Name="VCWebServiceProxyGeneratorTool"/>
<Tool
Name="VCXMLDataGeneratorTool"/>
<Tool
Name="VCWebDeploymentTool"/>
<Tool
Name="VCManagedWrapperGeneratorTool"/>
<Tool
Name="VCAuxiliaryManagedWrapperGeneratorTool"/>
</Configuration>
<Configuration
Name="Release|Win32"
OutputDirectory="Release"
IntermediateDirectory="Release"
ConfigurationType="1"
CharacterSet="2">
<Tool
Name="VCCLCompilerTool"
PreprocessorDefinitions="WIN32;NDEBUG;_CONSOLE"
RuntimeLibrary="4"
UsePrecompiledHeader="3"
WarningLevel="3"
Detect64BitPortabilityProblems="TRUE"
DebugInformationFormat="3"/>
<Tool
Name="VCCustomBuildTool"/>
<Tool
Name="VCLinkerTool"
OutputFile="$(OutDir)/cmphapp.exe"
LinkIncremental="1"
GenerateDebugInformation="TRUE"
SubSystem="1"
OptimizeReferences="2"
EnableCOMDATFolding="2"
TargetMachine="1"/>
<Tool
Name="VCMIDLTool"/>
<Tool
Name="VCPostBuildEventTool"/>
<Tool
Name="VCPreBuildEventTool"/>
<Tool
Name="VCPreLinkEventTool"/>
<Tool
Name="VCResourceCompilerTool"/>
<Tool
Name="VCWebServiceProxyGeneratorTool"/>
<Tool
Name="VCXMLDataGeneratorTool"/>
<Tool
Name="VCWebDeploymentTool"/>
<Tool
Name="VCManagedWrapperGeneratorTool"/>
<Tool
Name="VCAuxiliaryManagedWrapperGeneratorTool"/>
</Configuration>
</Configurations>
<References>
</References>
<Files>
<Filter
Name="Source Files"
Filter="cpp;c;cxx;def;odl;idl;hpj;bat;asm;asmx"
UniqueIdentifier="{4FC737F1-C7A5-4376-A066-2A32D752A2FF}">
<File
RelativePath=".\src\main.c">
</File>
<File
RelativePath=".\wingetopt.c">
</File>
</Filter>
<Filter
Name="Header Files"
Filter="h;hpp;hxx;hm;inl;inc;xsd"
UniqueIdentifier="{93995380-89BD-4b04-88EB-625FBE52EBFB}">
<File
RelativePath=".\wingetopt.h">
</File>
</Filter>
<Filter
Name="Resource Files"
Filter="rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx"
UniqueIdentifier="{67DA6AB6-F800-4c08-8B7A-83BB121AAD01}">
</Filter>
<File
RelativePath=".\ReadMe.txt">
</File>
</Files>
<Globals>
</Globals>
</VisualStudioProject>

83
deps/cmph/configure.ac vendored Normal file
View File

@@ -0,0 +1,83 @@
dnl Process this file with autoconf to produce a configure script.
AC_INIT([cmph], [2.0.2])
AC_CONFIG_SRCDIR([Makefile.am])
AM_INIT_AUTOMAKE
AC_CONFIG_HEADERS([config.h])
AC_CONFIG_MACRO_DIR([m4])
dnl Checks for programs.
AC_PROG_AWK
AC_PROG_CC
AC_PROG_INSTALL
AC_PROG_LN_S
LT_INIT
AC_SYS_EXTRA_LARGEFILE
if test "x$ac_cv_sys_largefile_CFLAGS" = "xno" ; then
ac_cv_sys_largefile_CFLAGS=""
fi
if test "x$ac_cv_sys_largefile_LDFLAGS" = "xno" ; then
ac_cv_sys_largefile_LDFLAGS=""
fi
if test "x$ac_cv_sys_largefile_LIBS" = "xno" ; then
ac_cv_sys_largefile_LIBS=""
fi
CFLAGS="$ac_cv_sys_largefile_CFLAGS $CFLAGS"
LDFLAGS="$ac_cv_sys_largefile_LDFLAGS $LDFLAGS"
LIBS="$LIBS $ac_cv_sys_largefile_LIBS"
dnl Checks for headers
AC_CHECK_HEADERS([getopt.h math.h])
dnl Checks for libraries.
LT_LIB_M
LDFLAGS="$LIBS $LIBM $LDFLAGS"
CFLAGS="-Wall $CFLAGS"
AC_PROG_CXX
CXXFLAGS="-Wall -Wno-unused-function -DNDEBUG -O3 -fomit-frame-pointer $CXXFLAGS"
AC_ENABLE_CXXMPH
if test x$cxxmph = xtrue; then
AC_COMPILE_STDCXX_0X
if test x$ac_cv_cxx_compile_cxx0x_native = "xno"; then
if test x$ac_cv_cxx_compile_cxx11_cxx = "xyes"; then
CXXFLAGS="$CXXFLAGS -std=c++11"
elif test x$ac_cv_cxx_compile_cxx0x_cxx = "xyes"; then
CXXFLAGS="$CXXFLAGS -std=c++0x"
elif test x$ac_cv_cxx_compile_cxx0x_gxx = "xyes"; then
CXXFLAGS="$CXXFLAGS -std=gnu++0x"
else
AC_MSG_ERROR("cxxmph demands a working c++0x compiler.")
fi
fi
AC_SUBST([CXXMPH], "cxxmph")
fi
AM_CONDITIONAL([USE_CXXMPH], [test "$cxxmph" = true])
AC_ENABLE_BENCHMARKS
if test x$benchmarks = xtrue; then
AC_LANG_PUSH([C++])
AC_CHECK_HEADERS([hopscotch_map.h])
AC_LANG_POP([C++])
fi
AM_CONDITIONAL([USE_BENCHMARKS], [test "$benchmarks" = true])
# Unit tests based on the check library. Disabled by default.
# We do not use pkg-config because it is inconvenient for all developers to
# have check library installed.
AC_ARG_ENABLE(check, AS_HELP_STRING(
[--enable-check],
[Build unit tests depending on check library (default: disabled)]))
AS_IF([test "x$enable_check" = "xyes"],
[ AC_CHECK_LIB([check], [tcase_create])
AS_IF([test "$ac_cv_lib_check_tcase_create" = yes], [CHECK_LIBS="-lcheck"],
[AC_MSG_ERROR("Failed to find check library (http://check.sf.net).")])
AC_CHECK_HEADER(check.h,[],
[AC_MSG_ERROR("Failed to find check library header (http://check.sf.net).")])
])
AM_CONDITIONAL([USE_LIBCHECK], [test "$ac_cv_lib_check_tcase_create" = yes])
AC_SUBST(CHECK_LIBS)
AC_SUBST(CHECK_CFLAGS)
AC_CHECK_SPOON
AC_CONFIG_FILES([Makefile src/Makefile cxxmph/Makefile tests/Makefile examples/Makefile man/Makefile cmph.pc cxxmph.pc])
AC_OUTPUT

12
deps/cmph/cxxmph.pc.in vendored Normal file
View File

@@ -0,0 +1,12 @@
url=http://cmph.sourceforge.net/
prefix=@prefix@
exec_prefix=@exec_prefix@
libdir=@libdir@
includedir=@includedir@
Name: cxxmph
Description: minimal perfect hashing c++11 library
Version: @VERSION@
Libs: -L${libdir} -lcxxmph
Cflags: -std=c++0x -I${includedir}
URL: ${url}

58
deps/cmph/cxxmph/.ycm_extra_conf.py vendored Normal file
View File

@@ -0,0 +1,58 @@
import os
import ycm_core
flags = [
'-Wall',
'-Wextra',
'-Werror',
'-DNDEBUG',
'-DUSE_CLANG_COMPLETER',
'-std=c++11',
'-x',
'c++',
'-isystem'
'/usr/lib/c++/v1',
'-I',
'.',
]
def DirectoryOfThisScript():
return os.path.dirname( os.path.abspath( __file__ ) )
def MakeRelativePathsInFlagsAbsolute( flags, working_directory ):
if not working_directory:
return list( flags )
new_flags = []
make_next_absolute = False
path_flags = [ '-isystem', '-I', '-iquote', '--sysroot=' ]
for flag in flags:
new_flag = flag
if make_next_absolute:
make_next_absolute = False
if not flag.startswith( '/' ):
new_flag = os.path.join( working_directory, flag )
for path_flag in path_flags:
if flag == path_flag:
make_next_absolute = True
break
if flag.startswith( path_flag ):
path = flag[ len( path_flag ): ]
new_flag = path_flag + os.path.join( working_directory, path )
break
if new_flag:
new_flags.append( new_flag )
return new_flags
def FlagsForFile( filename ):
relative_to = DirectoryOfThisScript()
final_flags = MakeRelativePathsInFlagsAbsolute( flags, relative_to )
return {
'flags': final_flags,
'do_cache': True
}

62
deps/cmph/cxxmph/Makefile.am vendored Normal file
View File

@@ -0,0 +1,62 @@
TESTS = $(check_PROGRAMS)
check_PROGRAMS = seeded_hash_test mph_bits_test hollow_iterator_test mph_index_test trigraph_test
if USE_LIBCHECK
check_PROGRAMS += test_test map_tester_test mph_map_test dense_hash_map_test string_util_test
check_LTLIBRARIES = libcxxmph_test.la
endif
if USE_BENCHMARKS
noinst_PROGRAMS = bm_map # bm_index - disabled because of cmph dependency
endif
bin_PROGRAMS = cxxmph
cxxmph_includedir = $(includedir)/cxxmph/
cxxmph_include_HEADERS = mph_bits.h mph_map.h mph_index.h MurmurHash3.h trigraph.h seeded_hash.h stringpiece.h hollow_iterator.h string_util.h
noinst_LTLIBRARIES = libcxxmph_bm.la
lib_LTLIBRARIES = libcxxmph.la
libcxxmph_la_SOURCES = MurmurHash3.cpp trigraph.cc mph_bits.cc mph_index.cc benchmark.h benchmark.cc string_util.cc
libcxxmph_la_LDFLAGS = -version-info 0:0:0
libcxxmph_test_la_SOURCES = test.h test.cc
libcxxmph_test_la_LIBADD = libcxxmph.la
libcxxmph_bm_la_SOURCES = benchmark.h benchmark.cc bm_common.h bm_common.cc
libcxxmph_bm_la_LIBADD = libcxxmph.la
test_test_SOURCES = test_test.cc
test_test_LDADD = libcxxmph_test.la $(CHECK_LIBS)
mph_map_test_LDADD = libcxxmph_test.la $(CHECK_LIBS)
mph_map_test_SOURCES = mph_map_test.cc
dense_hash_map_test_LDADD = libcxxmph_test.la $(CHECK_LIBS)
dense_hash_map_test_SOURCES = dense_hash_map_test.cc
mph_index_test_LDADD = libcxxmph.la
mph_index_test_SOURCES = mph_index_test.cc
trigraph_test_LDADD = libcxxmph.la
trigraph_test_SOURCES = trigraph_test.cc
# Bad dependency, do not compile by default.
# bm_index_LDADD = libcxxmph_bm.la -lcmph
# bm_index_SOURCES = bm_index.cc
bm_map_LDADD = libcxxmph_bm.la
bm_map_SOURCES = bm_map.cc
cxxmph_LDADD = libcxxmph.la
cxxmph_SOURCES = cxxmph.cc
hollow_iterator_test_SOURCES = hollow_iterator_test.cc
seeded_hash_test_SOURCES = seeded_hash_test.cc
seeded_hash_test_LDADD = libcxxmph.la
mph_bits_test_SOURCES = mph_bits_test.cc
mph_bits_test_LDADD = libcxxmph.la
string_util_test_SOURCES = string_util_test.cc
string_util_test_LDADD = libcxxmph.la libcxxmph_test.la $(CHECK_LIBS)
map_tester_test_SOURCES = map_tester.h map_tester.cc map_tester_test.cc
map_tester_test_LDADD = libcxxmph.la libcxxmph_test.la $(CHECK_LIBS)

335
deps/cmph/cxxmph/MurmurHash3.cpp vendored Normal file
View File

@@ -0,0 +1,335 @@
//-----------------------------------------------------------------------------
// MurmurHash3 was written by Austin Appleby, and is placed in the public
// domain. The author hereby disclaims copyright to this source code.
// Note - The x86 and x64 versions do _not_ produce the same results, as the
// algorithms are optimized for their respective platforms. You can still
// compile and run any of them on any platform, but your performance with the
// non-native version will be less than optimal.
#include "MurmurHash3.h"
//-----------------------------------------------------------------------------
// Platform-specific functions and macros
// Microsoft Visual Studio
#if defined(_MSC_VER)
#define FORCE_INLINE __forceinline
#include <stdlib.h>
#define ROTL32(x,y) _rotl(x,y)
#define ROTL64(x,y) _rotl64(x,y)
#define BIG_CONSTANT(x) (x)
// Other compilers
#else // defined(_MSC_VER)
#define FORCE_INLINE __attribute__((always_inline))
inline uint32_t rotl32 ( uint32_t x, int8_t r )
{
return (x << r) | (x >> (32 - r));
}
inline uint64_t rotl64 ( uint64_t x, int8_t r )
{
return (x << r) | (x >> (64 - r));
}
#define ROTL32(x,y) rotl32(x,y)
#define ROTL64(x,y) rotl64(x,y)
#define BIG_CONSTANT(x) (x##LLU)
#endif // !defined(_MSC_VER)
//-----------------------------------------------------------------------------
// Block read - if your platform needs to do endian-swapping or can only
// handle aligned reads, do the conversion here
/*FORCE_INLINE*/ uint32_t getblock ( const uint32_t * p, int i )
{
return p[i];
}
/*FORCE_INLINE*/ uint64_t getblock ( const uint64_t * p, int i )
{
return p[i];
}
//-----------------------------------------------------------------------------
// Finalization mix - force all bits of a hash block to avalanche
/*FORCE_INLINE*/ uint32_t fmix ( uint32_t h )
{
h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;
return h;
}
//----------
/*FORCE_INLINE*/ uint64_t fmix ( uint64_t k )
{
k ^= k >> 33;
k *= BIG_CONSTANT(0xff51afd7ed558ccd);
k ^= k >> 33;
k *= BIG_CONSTANT(0xc4ceb9fe1a85ec53);
k ^= k >> 33;
return k;
}
//-----------------------------------------------------------------------------
void MurmurHash3_x86_32 ( const void * key, int len,
uint32_t seed, void * out )
{
const uint8_t * data = (const uint8_t*)key;
const int nblocks = len / 4;
uint32_t h1 = seed;
uint32_t c1 = 0xcc9e2d51;
uint32_t c2 = 0x1b873593;
//----------
// body
const uint32_t * blocks = (const uint32_t *)(data + nblocks*4);
for(int i = -nblocks; i; i++)
{
uint32_t k1 = getblock(blocks,i);
k1 *= c1;
k1 = ROTL32(k1,15);
k1 *= c2;
h1 ^= k1;
h1 = ROTL32(h1,13);
h1 = h1*5+0xe6546b64;
}
//----------
// tail
const uint8_t * tail = (const uint8_t*)(data + nblocks*4);
uint32_t k1 = 0;
switch(len & 3)
{
case 3: k1 ^= tail[2] << 16;
case 2: k1 ^= tail[1] << 8;
case 1: k1 ^= tail[0];
k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
};
//----------
// finalization
h1 ^= len;
h1 = fmix(h1);
*(uint32_t*)out = h1;
}
//-----------------------------------------------------------------------------
void MurmurHash3_x86_128 ( const void * key, const int len,
uint32_t seed, void * out )
{
const uint8_t * data = (const uint8_t*)key;
const int nblocks = len / 16;
uint32_t h1 = seed;
uint32_t h2 = seed;
uint32_t h3 = seed;
uint32_t h4 = seed;
uint32_t c1 = 0x239b961b;
uint32_t c2 = 0xab0e9789;
uint32_t c3 = 0x38b34ae5;
uint32_t c4 = 0xa1e38b93;
//----------
// body
const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);
for(int i = -nblocks; i; i++)
{
uint32_t k1 = getblock(blocks,i*4+0);
uint32_t k2 = getblock(blocks,i*4+1);
uint32_t k3 = getblock(blocks,i*4+2);
uint32_t k4 = getblock(blocks,i*4+3);
k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
h1 = ROTL32(h1,19); h1 += h2; h1 = h1*5+0x561ccd1b;
k2 *= c2; k2 = ROTL32(k2,16); k2 *= c3; h2 ^= k2;
h2 = ROTL32(h2,17); h2 += h3; h2 = h2*5+0x0bcaa747;
k3 *= c3; k3 = ROTL32(k3,17); k3 *= c4; h3 ^= k3;
h3 = ROTL32(h3,15); h3 += h4; h3 = h3*5+0x96cd1c35;
k4 *= c4; k4 = ROTL32(k4,18); k4 *= c1; h4 ^= k4;
h4 = ROTL32(h4,13); h4 += h1; h4 = h4*5+0x32ac3b17;
}
//----------
// tail
const uint8_t * tail = (const uint8_t*)(data + nblocks*16);
uint32_t k1 = 0;
uint32_t k2 = 0;
uint32_t k3 = 0;
uint32_t k4 = 0;
switch(len & 15)
{
case 15: k4 ^= tail[14] << 16;
case 14: k4 ^= tail[13] << 8;
case 13: k4 ^= tail[12] << 0;
k4 *= c4; k4 = ROTL32(k4,18); k4 *= c1; h4 ^= k4;
case 12: k3 ^= tail[11] << 24;
case 11: k3 ^= tail[10] << 16;
case 10: k3 ^= tail[ 9] << 8;
case 9: k3 ^= tail[ 8] << 0;
k3 *= c3; k3 = ROTL32(k3,17); k3 *= c4; h3 ^= k3;
case 8: k2 ^= tail[ 7] << 24;
case 7: k2 ^= tail[ 6] << 16;
case 6: k2 ^= tail[ 5] << 8;
case 5: k2 ^= tail[ 4] << 0;
k2 *= c2; k2 = ROTL32(k2,16); k2 *= c3; h2 ^= k2;
case 4: k1 ^= tail[ 3] << 24;
case 3: k1 ^= tail[ 2] << 16;
case 2: k1 ^= tail[ 1] << 8;
case 1: k1 ^= tail[ 0] << 0;
k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
};
//----------
// finalization
h1 ^= len; h2 ^= len; h3 ^= len; h4 ^= len;
h1 += h2; h1 += h3; h1 += h4;
h2 += h1; h3 += h1; h4 += h1;
h1 = fmix(h1);
h2 = fmix(h2);
h3 = fmix(h3);
h4 = fmix(h4);
h1 += h2; h1 += h3; h1 += h4;
h2 += h1; h3 += h1; h4 += h1;
((uint32_t*)out)[0] = h1;
((uint32_t*)out)[1] = h2;
((uint32_t*)out)[2] = h3;
((uint32_t*)out)[3] = h4;
}
//-----------------------------------------------------------------------------
void MurmurHash3_x64_128 ( const void * key, const int len,
const uint32_t seed, void * out )
{
const uint8_t * data = (const uint8_t*)key;
const int nblocks = len / 16;
uint64_t h1 = seed;
uint64_t h2 = seed;
uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);
uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);
//----------
// body
const uint64_t * blocks = (const uint64_t *)(data);
for(int i = 0; i < nblocks; i++)
{
uint64_t k1 = getblock(blocks,i*2+0);
uint64_t k2 = getblock(blocks,i*2+1);
k1 *= c1; k1 = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
h1 = ROTL64(h1,27); h1 += h2; h1 = h1*5+0x52dce729;
k2 *= c2; k2 = ROTL64(k2,33); k2 *= c1; h2 ^= k2;
h2 = ROTL64(h2,31); h2 += h1; h2 = h2*5+0x38495ab5;
}
//----------
// tail
const uint8_t * tail = (const uint8_t*)(data + nblocks*16);
uint64_t k1 = 0;
uint64_t k2 = 0;
switch(len & 15)
{
case 15: k2 ^= uint64_t(tail[14]) << 48;
case 14: k2 ^= uint64_t(tail[13]) << 40;
case 13: k2 ^= uint64_t(tail[12]) << 32;
case 12: k2 ^= uint64_t(tail[11]) << 24;
case 11: k2 ^= uint64_t(tail[10]) << 16;
case 10: k2 ^= uint64_t(tail[ 9]) << 8;
case 9: k2 ^= uint64_t(tail[ 8]) << 0;
k2 *= c2; k2 = ROTL64(k2,33); k2 *= c1; h2 ^= k2;
case 8: k1 ^= uint64_t(tail[ 7]) << 56;
case 7: k1 ^= uint64_t(tail[ 6]) << 48;
case 6: k1 ^= uint64_t(tail[ 5]) << 40;
case 5: k1 ^= uint64_t(tail[ 4]) << 32;
case 4: k1 ^= uint64_t(tail[ 3]) << 24;
case 3: k1 ^= uint64_t(tail[ 2]) << 16;
case 2: k1 ^= uint64_t(tail[ 1]) << 8;
case 1: k1 ^= uint64_t(tail[ 0]) << 0;
k1 *= c1; k1 = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
};
//----------
// finalization
h1 ^= len; h2 ^= len;
h1 += h2;
h2 += h1;
h1 = fmix(h1);
h2 = fmix(h2);
h1 += h2;
h2 += h1;
((uint64_t*)out)[0] = h1;
((uint64_t*)out)[1] = h2;
}
//-----------------------------------------------------------------------------

37
deps/cmph/cxxmph/MurmurHash3.h vendored Normal file
View File

@@ -0,0 +1,37 @@
//-----------------------------------------------------------------------------
// MurmurHash3 was written by Austin Appleby, and is placed in the public
// domain. The author hereby disclaims copyright to this source code.
#ifndef _MURMURHASH3_H_
#define _MURMURHASH3_H_
//-----------------------------------------------------------------------------
// Platform-specific functions and macros
// Microsoft Visual Studio
#if defined(_MSC_VER)
typedef unsigned char uint8_t;
typedef unsigned long uint32_t;
typedef unsigned __int64 uint64_t;
// Other compilers
#else // defined(_MSC_VER)
#include <stdint.h>
#endif // !defined(_MSC_VER)
//-----------------------------------------------------------------------------
void MurmurHash3_x86_32 ( const void * key, int len, uint32_t seed, void * out );
void MurmurHash3_x86_128 ( const void * key, int len, uint32_t seed, void * out );
void MurmurHash3_x64_128 ( const void * key, int len, uint32_t seed, void * out );
//-----------------------------------------------------------------------------
#endif // _MURMURHASH3_H_

142
deps/cmph/cxxmph/benchmark.cc vendored Normal file
View File

@@ -0,0 +1,142 @@
#include "benchmark.h"
#include <cerrno>
#include <cstring>
#include <cstdio>
#include <memory>
#include <sys/time.h>
#include <sys/resource.h>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <vector>
using std::cerr;
using std::cout;
using std::endl;
using std::setfill;
using std::setw;
using std::string;
using std::ostringstream;
using std::vector;
namespace {
/* Subtract the `struct timeval' values X and Y,
storing the result in RESULT.
Return 1 if the difference is negative, otherwise 0. */
int timeval_subtract (
struct timeval *result, struct timeval *x, struct timeval* y) {
/* Perform the carry for the later subtraction by updating y. */
if (x->tv_usec < y->tv_usec) {
int nsec = (y->tv_usec - x->tv_usec) / 1000000 + 1;
y->tv_usec -= 1000000 * nsec;
y->tv_sec += nsec;
}
if (x->tv_usec - y->tv_usec > 1000000) {
int nsec = (x->tv_usec - y->tv_usec) / 1000000;
y->tv_usec += 1000000 * nsec;
y->tv_sec -= nsec;
}
/* Compute the time remaining to wait.
tv_usec is certainly positive. */
result->tv_sec = x->tv_sec - y->tv_sec;
result->tv_usec = x->tv_usec - y->tv_usec;
/* Return 1 if result is negative. */
return x->tv_sec < y->tv_sec;
}
// C++ iostream is terrible for formatting.
string timeval_to_string(timeval tv) {
ostringstream out;
out << setfill(' ') << setw(3) << tv.tv_sec << '.';
out << setfill('0') << setw(6) << tv.tv_usec;
return out.str();
}
struct rusage getrusage_or_die() {
struct rusage rs;
int ret = getrusage(RUSAGE_SELF, &rs);
if (ret != 0) {
cerr << "rusage failed: " << strerror(errno) << endl;
exit(-1);
}
return rs;
}
struct timeval gettimeofday_or_die() {
struct timeval tv;
int ret = gettimeofday(&tv, NULL);
if (ret != 0) {
cerr << "gettimeofday failed: " << strerror(errno) << endl;
exit(-1);
}
return tv;
}
#ifdef HAVE_CXA_DEMANGLE
string demangle(const string& name) {
char buf[1024];
unsigned int size = 1024;
int status;
char* res = abi::__cxa_demangle(
name.c_str(), buf, &size, &status);
return res;
}
#else
string demangle(const string& name) { return name; }
#endif
static vector<cxxmph::Benchmark*> g_benchmarks;
} // anonymous namespace
namespace cxxmph {
/* static */ void Benchmark::Register(Benchmark* bm) {
if (bm->name().empty()) {
string name = demangle(typeid(*bm).name());
bm->set_name(name);
}
g_benchmarks.push_back(bm);
}
/* static */ void Benchmark::RunAll() {
for (uint32_t i = 0; i < g_benchmarks.size(); ++i) {
std::auto_ptr<Benchmark> bm(g_benchmarks[i]);
if (!bm->SetUp()) {
cerr << "Set up phase for benchmark "
<< bm->name() << " failed." << endl;
continue;
}
bm->MeasureRun();
bm->TearDown();
}
}
void Benchmark::MeasureRun() {
struct timeval walltime_begin = gettimeofday_or_die();
struct rusage begin = getrusage_or_die();
Run();
struct rusage end = getrusage_or_die();
struct timeval walltime_end = gettimeofday_or_die();
struct timeval utime;
timeval_subtract(&utime, &end.ru_utime, &begin.ru_utime);
struct timeval stime;
timeval_subtract(&stime, &end.ru_stime, &begin.ru_stime);
struct timeval wtime;
timeval_subtract(&wtime, &walltime_end, &walltime_begin);
cout << "Benchmark: " << name_ << endl;
cout << "CPU User time : " << timeval_to_string(utime) << endl;
cout << "CPU System time: " << timeval_to_string(stime) << endl;
cout << "Wall clock time: " << timeval_to_string(wtime) << endl;
cout << endl;
}
} // namespace cxxmph

32
deps/cmph/cxxmph/benchmark.h vendored Normal file
View File

@@ -0,0 +1,32 @@
#ifndef __CXXMPH_BENCHMARK_H__
#define __CXXMPH_BENCHMARK_H__
#include <string>
#include <typeinfo>
namespace cxxmph {
class Benchmark {
public:
Benchmark() {}
virtual ~Benchmark() {}
const std::string& name() { return name_; }
void set_name(const std::string& name) { name_ = name; }
static void Register(Benchmark* bm);
static void RunAll();
protected:
virtual bool SetUp() { return true; };
virtual void Run() = 0;
virtual bool TearDown() { return true; };
private:
std::string name_;
void MeasureRun();
};
} // namespace cxxmph
#endif

75
deps/cmph/cxxmph/bm_common.cc vendored Normal file
View File

@@ -0,0 +1,75 @@
#include <cmath>
#include <fstream>
#include <limits>
#include <iostream>
#include <set>
#include "bm_common.h"
using std::cerr;
using std::endl;
using std::set;
using std::string;
using std::vector;
namespace cxxmph {
UrlsBenchmark::~UrlsBenchmark() {}
bool UrlsBenchmark::SetUp() {
vector<string> urls;
std::ifstream f(urls_file_.c_str());
if (!f.is_open()) {
cerr << "Failed to open urls file " << urls_file_ << endl;
return false;
}
string buffer;
while(std::getline(f, buffer)) urls.push_back(buffer);
set<string> unique(urls.begin(), urls.end());
if (unique.size() != urls.size()) {
cerr << "Input file has repeated keys." << endl;
return false;
}
urls.swap(urls_);
return true;
}
SearchUrlsBenchmark::~SearchUrlsBenchmark() {}
bool SearchUrlsBenchmark::SetUp() {
if (!UrlsBenchmark::SetUp()) return false;
int32_t miss_ratio_int32 = std::numeric_limits<int32_t>::max() * miss_ratio_;
forced_miss_urls_.resize(nsearches_);
random_.resize(nsearches_);
for (uint32_t i = 0; i < nsearches_; ++i) {
random_[i] = urls_[random() % urls_.size()];
if (random() < miss_ratio_int32) {
forced_miss_urls_[i] = random_[i].as_string() + ".force_miss";
random_[i] = forced_miss_urls_[i];
}
}
return true;
}
Uint64Benchmark::~Uint64Benchmark() {}
bool Uint64Benchmark::SetUp() {
set<uint64_t> unique;
for (uint32_t i = 0; i < count_; ++i) {
uint64_t v;
do { v = random(); } while (unique.find(v) != unique.end());
values_.push_back(v);
unique.insert(v);
}
return true;
}
SearchUint64Benchmark::~SearchUint64Benchmark() {}
bool SearchUint64Benchmark::SetUp() {
if (!Uint64Benchmark::SetUp()) return false;
random_.resize(nsearches_);
for (uint32_t i = 0; i < nsearches_; ++i) {
uint32_t pos = random() % values_.size();
random_[i] = values_[pos];
}
return true;
}
} // namespace cxxmph

73
deps/cmph/cxxmph/bm_common.h vendored Normal file
View File

@@ -0,0 +1,73 @@
#ifndef __CXXMPH_BM_COMMON_H__
#define __CXXMPH_BM_COMMON_H__
#include "stringpiece.h"
#include <string>
#include <vector>
#include <unordered_map> // std::hash
#include "MurmurHash3.h"
#include "benchmark.h"
namespace std {
template <> struct hash<cxxmph::StringPiece> {
uint32_t operator()(const cxxmph::StringPiece& k) const {
uint32_t out;
MurmurHash3_x86_32(k.data(), k.length(), 1, &out);
return out;
}
};
} // namespace std
namespace cxxmph {
class UrlsBenchmark : public Benchmark {
public:
UrlsBenchmark(const std::string& urls_file) : urls_file_(urls_file) { }
virtual ~UrlsBenchmark();
protected:
virtual bool SetUp();
const std::string urls_file_;
std::vector<std::string> urls_;
};
class SearchUrlsBenchmark : public UrlsBenchmark {
public:
SearchUrlsBenchmark(const std::string& urls_file, uint32_t nsearches, float miss_ratio)
: UrlsBenchmark(urls_file), nsearches_(nsearches), miss_ratio_(miss_ratio) {}
virtual ~SearchUrlsBenchmark();
protected:
virtual bool SetUp();
const uint32_t nsearches_;
float miss_ratio_;
std::vector<std::string> forced_miss_urls_;
std::vector<StringPiece> random_;
};
class Uint64Benchmark : public Benchmark {
public:
Uint64Benchmark(uint32_t count) : count_(count) { }
virtual ~Uint64Benchmark();
virtual void Run() {}
protected:
virtual bool SetUp();
const uint32_t count_;
std::vector<uint64_t> values_;
};
class SearchUint64Benchmark : public Uint64Benchmark {
public:
SearchUint64Benchmark(uint32_t count, uint32_t nsearches)
: Uint64Benchmark(count), nsearches_(nsearches) { }
virtual ~SearchUint64Benchmark();
virtual void Run() {};
protected:
virtual bool SetUp();
const uint32_t nsearches_;
std::vector<uint64_t> random_;
};
} // namespace cxxmph
#endif // __CXXMPH_BM_COMMON_H__

149
deps/cmph/cxxmph/bm_index.cc vendored Normal file
View File

@@ -0,0 +1,149 @@
#include <cmph.h>
#include <cstdio>
#include <set>
#include <string>
#include <unordered_map>
#include "bm_common.h"
#include "stringpiece.h"
#include "mph_index.h"
using namespace cxxmph;
using std::string;
using std::unordered_map;
class BM_MPHIndexCreate : public UrlsBenchmark {
public:
BM_MPHIndexCreate(const std::string& urls_file)
: UrlsBenchmark(urls_file) { }
protected:
virtual void Run() {
SimpleMPHIndex<StringPiece> index;
index.Reset(urls_.begin(), urls_.end(), urls_.size());
}
};
class BM_STLIndexCreate : public UrlsBenchmark {
public:
BM_STLIndexCreate(const std::string& urls_file)
: UrlsBenchmark(urls_file) { }
protected:
virtual void Run() {
unordered_map<StringPiece, uint32_t> index;
int idx = 0;
for (auto it = urls_.begin(); it != urls_.end(); ++it) {
index.insert(make_pair(*it, idx++));
}
}
};
class BM_MPHIndexSearch : public SearchUrlsBenchmark {
public:
BM_MPHIndexSearch(const std::string& urls_file, int nsearches)
: SearchUrlsBenchmark(urls_file, nsearches, 0) { }
virtual void Run() {
uint64_t sum = 0;
for (auto it = random_.begin(); it != random_.end(); ++it) {
auto idx = index_.index(*it);
// Collision check to be fair with STL
if (strcmp(urls_[idx].c_str(), it->data()) != 0) idx = -1;
sum += idx;
}
}
protected:
virtual bool SetUp () {
if (!SearchUrlsBenchmark::SetUp()) return false;
index_.Reset(urls_.begin(), urls_.end(), urls_.size());
return true;
}
SimpleMPHIndex<StringPiece> index_;
};
class BM_CmphIndexSearch : public SearchUrlsBenchmark {
public:
BM_CmphIndexSearch(const std::string& urls_file, int nsearches)
: SearchUrlsBenchmark(urls_file, nsearches, 0) { }
~BM_CmphIndexSearch() { if (index_) cmph_destroy(index_); }
virtual void Run() {
uint64_t sum = 0;
for (auto it = random_.begin(); it != random_.end(); ++it) {
auto idx = cmph_search(index_, it->data(), it->length());
// Collision check to be fair with STL
if (strcmp(urls_[idx].c_str(), it->data()) != 0) idx = -1;
sum += idx;
}
}
protected:
virtual bool SetUp() {
if (!SearchUrlsBenchmark::SetUp()) {
cerr << "Parent class setup failed." << endl;
return false;
}
FILE* f = fopen(urls_file_.c_str(), "r");
if (!f) {
cerr << "Faied to open " << urls_file_ << endl;
return false;
}
cmph_io_adapter_t* source = cmph_io_nlfile_adapter(f);
if (!source) {
cerr << "Faied to create io adapter for " << urls_file_ << endl;
return false;
}
cmph_config_t* config = cmph_config_new(source);
if (!config) {
cerr << "Failed to create config" << endl;
return false;
}
cmph_config_set_algo(config, CMPH_BDZ);
cmph_t* mphf = cmph_new(config);
if (!mphf) {
cerr << "Failed to create mphf." << endl;
return false;
}
cmph_config_destroy(config);
cmph_io_nlfile_adapter_destroy(source);
fclose(f);
index_ = mphf;
return true;
}
cmph_t* index_;
};
class BM_STLIndexSearch : public SearchUrlsBenchmark {
public:
BM_STLIndexSearch(const std::string& urls_file, int nsearches)
: SearchUrlsBenchmark(urls_file, nsearches, 0) { }
virtual void Run() {
uint64_t sum = 0;
for (auto it = random_.begin(); it != random_.end(); ++it) {
auto idx = index_.find(*it);
sum += idx->second;
}
}
protected:
virtual bool SetUp () {
if (!SearchUrlsBenchmark::SetUp()) return false;
unordered_map<StringPiece, uint32_t> index;
int idx = 0;
for (auto it = urls_.begin(); it != urls_.end(); ++it) {
index.insert(make_pair(*it, idx++));
}
index.swap(index_);
return true;
}
unordered_map<StringPiece, uint32_t> index_;
};
int main(int argc, char** argv) {
Benchmark::Register(new BM_MPHIndexCreate("URLS100k"));
Benchmark::Register(new BM_STLIndexCreate("URLS100k"));
Benchmark::Register(new BM_MPHIndexSearch("URLS100k", 10*1000*1000));
Benchmark::Register(new BM_STLIndexSearch("URLS100k", 10*1000*1000));
Benchmark::Register(new BM_CmphIndexSearch("URLS100k", 10*1000*1000));
Benchmark::RunAll();
return 0;
}

126
deps/cmph/cxxmph/bm_map.cc vendored Normal file
View File

@@ -0,0 +1,126 @@
#include <string>
#include <unordered_map>
#include "hopscotch_map.h"
#include "bm_common.h"
#include "mph_map.h"
using std::string;
// Another reference benchmark:
// http://blog.aggregateknowledge.com/tag/bigmemory/
namespace cxxmph {
template <class MapType, class T>
const T* myfind(const MapType& mymap, const T& k) {
auto it = mymap.find(k);
auto end = mymap.end();
if (it == end) return NULL;
return &it->second;
}
template <class MapType>
class BM_CreateUrls : public UrlsBenchmark {
public:
BM_CreateUrls(const string& urls_file) : UrlsBenchmark(urls_file) { }
virtual void Run() {
MapType mymap;
for (auto it = urls_.begin(); it != urls_.end(); ++it) {
mymap[*it] = *it;
}
}
};
template <class MapType>
class BM_SearchUrls : public SearchUrlsBenchmark {
public:
BM_SearchUrls(const std::string& urls_file, int nsearches, float miss_ratio)
: SearchUrlsBenchmark(urls_file, nsearches, miss_ratio) { }
virtual ~BM_SearchUrls() {}
virtual void Run() {
uint32_t total = 1;
for (auto it = random_.begin(); it != random_.end(); ++it) {
auto v = myfind(mymap_, *it);
if (v) total += v->length();
}
fprintf(stderr, "Total: %u\n", total);
}
protected:
virtual bool SetUp() {
if (!SearchUrlsBenchmark::SetUp()) return false;
for (auto it = urls_.begin(); it != urls_.end(); ++it) {
mymap_[*it] = *it;
}
mymap_.rehash(mymap_.bucket_count());
fprintf(stderr, "Occupation: %f\n", static_cast<float>(mymap_.size())/mymap_.bucket_count());
return true;
}
MapType mymap_;
};
template <class MapType>
class BM_SearchUint64 : public SearchUint64Benchmark {
public:
BM_SearchUint64() : SearchUint64Benchmark(100000, 10*1000*1000) { }
virtual bool SetUp() {
if (!SearchUint64Benchmark::SetUp()) return false;
for (uint32_t i = 0; i < values_.size(); ++i) {
mymap_[values_[i]] = values_[i];
}
mymap_.rehash(mymap_.bucket_count());
// Double check if everything is all right
cerr << "Doing double check" << endl;
for (uint32_t i = 0; i < values_.size(); ++i) {
if (mymap_[values_[i]] != values_[i]) {
cerr << "Looking for " << i << " th key value " << values_[i];
cerr << " yielded " << mymap_[values_[i]] << endl;
return false;
}
}
return true;
}
virtual void Run() {
for (auto it = random_.begin(); it != random_.end(); ++it) {
auto v = myfind(mymap_, *it);
if (*v != *it) {
cerr << "Looked for " << *it << " got " << *v << endl;
exit(-1);
}
}
}
MapType mymap_;
};
} // namespace cxxmph
using namespace cxxmph;
int main(int argc, char** argv) {
srandom(4);
Benchmark::Register(new BM_CreateUrls<dense_hash_map<StringPiece, StringPiece>>("URLS100k"));
Benchmark::Register(new BM_CreateUrls<std::unordered_map<StringPiece, StringPiece>>("URLS100k"));
Benchmark::Register(new BM_CreateUrls<mph_map<StringPiece, StringPiece>>("URLS100k"));
Benchmark::Register(new BM_CreateUrls<sparse_hash_map<StringPiece, StringPiece>>("URLS100k"));
Benchmark::Register(new BM_CreateUrls<tsl::hopscotch_map<StringPiece, StringPiece>>("URLS100k"));
Benchmark::Register(new BM_SearchUrls<dense_hash_map<StringPiece, StringPiece>>("URLS100k", 10*1000 * 1000, 0));
Benchmark::Register(new BM_SearchUrls<std::unordered_map<StringPiece, StringPiece, Murmur3StringPiece>>("URLS100k", 10*1000 * 1000, 0));
Benchmark::Register(new BM_SearchUrls<mph_map<StringPiece, StringPiece>>("URLS100k", 10*1000 * 1000, 0));
Benchmark::Register(new BM_SearchUrls<sparse_hash_map<StringPiece, StringPiece>>("URLS100k", 10*1000 * 1000, 0));
Benchmark::Register(new BM_SearchUrls<tsl::hopscotch_map<StringPiece, StringPiece>>("URLS100k", 10*1000 * 1000, 0));
Benchmark::Register(new BM_SearchUrls<dense_hash_map<StringPiece, StringPiece>>("URLS100k", 10*1000 * 1000, 0.9));
Benchmark::Register(new BM_SearchUrls<std::unordered_map<StringPiece, StringPiece, Murmur3StringPiece>>("URLS100k", 10*1000 * 1000, 0.9));
Benchmark::Register(new BM_SearchUrls<mph_map<StringPiece, StringPiece>>("URLS100k", 10*1000 * 1000, 0.9));
Benchmark::Register(new BM_SearchUrls<sparse_hash_map<StringPiece, StringPiece>>("URLS100k", 10*1000 * 1000, 0.9));
Benchmark::Register(new BM_SearchUrls<tsl::hopscotch_map<StringPiece, StringPiece>>("URLS100k", 10*1000 * 1000, 0.9));
Benchmark::Register(new BM_SearchUint64<dense_hash_map<uint64_t, uint64_t>>);
Benchmark::Register(new BM_SearchUint64<std::unordered_map<uint64_t, uint64_t>>);
Benchmark::Register(new BM_SearchUint64<mph_map<uint64_t, uint64_t>>);
Benchmark::Register(new BM_SearchUint64<sparse_hash_map<uint64_t, uint64_t>>);
Benchmark::Register(new BM_SearchUint64<tsl::hopscotch_map<uint64_t, uint64_t>>);
Benchmark::RunAll();
}

74
deps/cmph/cxxmph/cxxmph.cc vendored Normal file
View File

@@ -0,0 +1,74 @@
// Copyright 2010 Google Inc. All Rights Reserved.
// Author: davi@google.com (Davi Reis)
#include <getopt.h>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include "mph_map.h"
#include "config.h"
using std::cerr;
using std::cout;
using std::endl;
using std::getline;
using std::ifstream;
using std::string;
using std::vector;
using cxxmph::mph_map;
void usage(const char* prg) {
cerr << "usage: " << prg << " [-v] [-h] [-V] <keys.txt>" << endl;
}
void usage_long(const char* prg) {
usage(prg);
cerr << " -h\t print this help message" << endl;
cerr << " -V\t print version number and exit" << endl;
cerr << " -v\t increase verbosity (may be used multiple times)" << endl;
}
int main(int argc, char** argv) {
int verbosity = 0;
while (1) {
char ch = (char)getopt(argc, argv, "hvV");
if (ch == -1) break;
switch (ch) {
case 'h':
usage_long(argv[0]);
return 0;
case 'V':
std::cout << VERSION << std::endl;
return 0;
case 'v':
++verbosity;
break;
}
}
if (optind != argc - 1) {
usage(argv[0]);
return 1;
}
vector<string> keys;
ifstream f(argv[optind]);
if (!f.is_open()) {
std::cerr << "Failed to open " << argv[optind] << std::endl;
exit(-1);
}
string buffer;
while (!getline(f, buffer).eof()) keys.push_back(buffer);
for (uint32_t i = 0; i < keys.size(); ++i) string s = keys[i];
mph_map<string, string> table;
for (uint32_t i = 0; i < keys.size(); ++i) table[keys[i]] = keys[i];
mph_map<string, string>::const_iterator it = table.begin();
mph_map<string, string>::const_iterator end = table.end();
for (int i = 0; it != end; ++it, ++i) {
cout << i << ": " << it->first
<<" -> " << it->second << endl;
}
}

25
deps/cmph/cxxmph/dense_hash_map_test.cc vendored Normal file
View File

@@ -0,0 +1,25 @@
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <string>
#include "mph_map.h"
#include "map_tester.h"
#include "test.h"
using namespace cxxmph;
typedef MapTester<dense_hash_map> Tester;
CXXMPH_CXX_TEST_CASE(empty_find, Tester::empty_find);
CXXMPH_CXX_TEST_CASE(empty_erase, Tester::empty_erase);
CXXMPH_CXX_TEST_CASE(small_insert, Tester::small_insert);
CXXMPH_CXX_TEST_CASE(large_insert, Tester::large_insert);
CXXMPH_CXX_TEST_CASE(small_search, Tester::small_search);
CXXMPH_CXX_TEST_CASE(default_search, Tester::default_search);
CXXMPH_CXX_TEST_CASE(large_search, Tester::large_search);
CXXMPH_CXX_TEST_CASE(string_search, Tester::string_search);
CXXMPH_CXX_TEST_CASE(rehash_zero, Tester::rehash_zero);
CXXMPH_CXX_TEST_CASE(rehash_size, Tester::rehash_size);
CXXMPH_CXX_TEST_CASE(erase_value, Tester::erase_value);
CXXMPH_CXX_TEST_CASE(erase_iterator, Tester::erase_iterator);

81
deps/cmph/cxxmph/hollow_iterator.h vendored Normal file
View File

@@ -0,0 +1,81 @@
#ifndef __CXXMPH_HOLLOW_ITERATOR_H__
#define __CXXMPH_HOLLOW_ITERATOR_H__
#include <vector>
namespace cxxmph {
using std::vector;
template <typename container_type>
struct is_empty {
public:
is_empty() : c_(NULL), p_(NULL) {};
is_empty(const container_type* c, const vector<bool>* p) : c_(c), p_(p) {};
bool operator()(typename container_type::const_iterator it) const {
if (it == c_->end()) return false;
return !(*p_)[it - c_->begin()];
}
private:
const container_type* c_;
const vector<bool>* p_;
};
template <typename iterator, typename is_empty>
struct hollow_iterator_base
: public std::iterator<std::forward_iterator_tag,
typename iterator::value_type> {
public:
typedef hollow_iterator_base<iterator, is_empty> self_type;
typedef self_type& self_reference;
typedef typename iterator::reference reference;
typedef typename iterator::pointer pointer;
inline hollow_iterator_base() : it_(), empty_() { }
inline hollow_iterator_base(iterator it, is_empty empty, bool solid) : it_(it), empty_(empty) {
if (!solid) advance();
}
// Same as above, assumes solid==true.
inline hollow_iterator_base(iterator it, is_empty empty) : it_(it), empty_(empty) {}
inline hollow_iterator_base(const self_type& rhs) { it_ = rhs.it_; empty_ = rhs.empty_; }
template <typename const_iterator>
hollow_iterator_base(const hollow_iterator_base<const_iterator, is_empty>& rhs) { it_ = rhs.it_; empty_ = rhs.empty_; }
reference operator*() { return *it_; }
pointer operator->() { return &(*it_); }
self_reference operator++() { ++it_; advance(); return *this; }
// self_type operator++() { auto tmp(*this); ++tmp; return tmp; }
template <typename const_iterator>
bool operator==(const hollow_iterator_base<const_iterator, is_empty>& rhs) { return rhs.it_ == it_; }
template <typename const_iterator>
bool operator!=(const hollow_iterator_base<const_iterator, is_empty>& rhs) { return rhs.it_ != it_; }
// should be friend
iterator it_;
is_empty empty_;
private:
void advance() {
while (empty_(it_)) ++it_;
}
};
template <typename container_type, typename iterator>
inline auto make_solid(
container_type* v, const vector<bool>* p, iterator it) ->
hollow_iterator_base<iterator, is_empty<const container_type>> {
return hollow_iterator_base<iterator, is_empty<const container_type>>(
it, is_empty<const container_type>(v, p));
}
template <typename container_type, typename iterator>
inline auto make_hollow(
container_type* v, const vector<bool>* p, iterator it) ->
hollow_iterator_base<iterator, is_empty<const container_type>> {
return hollow_iterator_base<iterator, is_empty<const container_type>>(
it, is_empty<const container_type>(v, p), false);
}
} // namespace cxxmph
#endif // __CXXMPH_HOLLOW_ITERATOR_H__

View File

@@ -0,0 +1,49 @@
#include <cstdlib>
#include <cstdio>
#include <vector>
#include <iostream>
using std::cerr;
using std::endl;
using std::vector;
#include "hollow_iterator.h"
using cxxmph::hollow_iterator_base;
using cxxmph::make_hollow;
using cxxmph::is_empty;
int main(int, char**) {
vector<int> v;
vector<bool> p;
for (int i = 0; i < 100; ++i) {
v.push_back(i);
p.push_back(i % 2 == 0);
}
auto begin = make_hollow(&v, &p, v.begin());
auto end = make_hollow(&v, &p, v.end());
for (auto it = begin; it != end; ++it) {
if (((*it) % 2) != 0) exit(-1);
}
const vector<int>* cv(&v);
auto cbegin(make_hollow(cv, &p, cv->begin()));
auto cend(make_hollow(cv, &p, cv->begin()));
for (auto it = cbegin; it != cend; ++it) {
if (((*it) % 2) != 0) exit(-1);
}
const vector<bool>* cp(&p);
cbegin = make_hollow(cv, cp, v.begin());
cend = make_hollow(cv, cp, cv->end());
vector<int>::iterator vit1 = v.begin();
vector<int>::const_iterator vit2 = v.begin();
if (vit1 != vit2) exit(-1);
auto it1 = make_hollow(&v, &p, vit1);
auto it2 = make_hollow(&v, &p, vit2);
if (it1 != it2) exit(-1);
typedef is_empty<const vector<int>> iev;
hollow_iterator_base<vector<int>::iterator, iev> default_constructed;
default_constructed = make_hollow(&v, &p, v.begin());
return 0;
}

4
deps/cmph/cxxmph/map_tester.cc vendored Normal file
View File

@@ -0,0 +1,4 @@
#include "map_tester.h"
namespace cxxxmph {
}

138
deps/cmph/cxxmph/map_tester.h vendored Normal file
View File

@@ -0,0 +1,138 @@
#ifndef __CXXMPH_MAP_TEST_HELPER_H__
#define __CXXMPH_MAP_TEST_HELPER_H__
#include <cstdint>
#include <string>
#include <utility>
#include <vector>
#include <unordered_map>
#include "string_util.h"
#include <check.h>
namespace cxxmph {
using namespace cxxmph;
using namespace std;
template <template<typename...> class map_type>
struct MapTester {
static bool empty_find() {
map_type<int64_t, int64_t> m;
for (int i = 0; i < 1000; ++i) {
if (m.find(i) != m.end()) return false;
}
return true;
}
static bool empty_erase() {
map_type<int64_t, int64_t> m;
for (int i = 0; i < 1000; ++i) {
m.erase(i);
if (m.size()) return false;
}
return true;
}
static bool small_insert() {
map_type<int64_t, int64_t> m;
// Start counting from 1 to not touch default constructed value bugs
for (int i = 1; i < 12; ++i) m.insert(make_pair(i, i));
return m.size() == 11;
}
static bool large_insert() {
map_type<int64_t, int64_t> m;
// Start counting from 1 to not touch default constructed value bugs
int nkeys = 12 * 256 * 256;
for (int i = 1; i < nkeys; ++i) m.insert(make_pair(i, i));
return static_cast<int>(m.size()) == nkeys - 1;
}
static bool small_search() {
map_type<int64_t, int64_t> m;
// Start counting from 1 to not touch default constructed value bugs
for (int i = 1; i < 12; ++i) m.insert(make_pair(i, i));
for (int i = 1; i < 12; ++i) if (m.find(i) == m.end()) return false;
return true;
}
static bool default_search() {
map_type<int64_t, int64_t> m;
if (m.find(0) != m.end()) return false;
for (int i = 1; i < 256; ++i) m.insert(make_pair(i, i));
if (m.find(0) != m.end()) return false;
for (int i = 0; i < 256; ++i) m.insert(make_pair(i, i));
if (m.find(0) == m.end()) return false;
return true;
}
static bool large_search() {
int nkeys = 10 * 1000;
map_type<int64_t, int64_t> m;
for (int i = 0; i < nkeys; ++i) m.insert(make_pair(i, i));
for (int i = 0; i < nkeys; ++i) if (m.find(i) == m.end()) return false;
return true;
}
static bool string_search() {
int nkeys = 10 * 1000;
vector<string> keys;
for (int i = 0; i < nkeys; ++i) {
keys.push_back(format("%v", i));
}
map_type<string, int64_t> m;
for (int i = 0; i < nkeys; ++i) m.insert(make_pair(keys[i], i));
for (int i = 0; i < nkeys; ++i) {
auto it = m.find(keys[i]);
if (it == m.end()) return false;
if (it->second != i) return false;
}
return true;
}
static bool rehash_zero() {
map_type<int64_t, int64_t> m;
m.rehash(0);
return m.size() == 0;
}
static bool rehash_size() {
map_type<int64_t, int64_t> m;
int nkeys = 10 * 1000;
for (int i = 0; i < nkeys; ++i) { m.insert(make_pair(i, i)); }
m.rehash(nkeys);
for (int i = 0; i < nkeys; ++i) { if (m.find(i) == m.end()) return false; }
for (int i = nkeys; i < nkeys * 2; ++i) {
if (m.find(i) != m.end()) return false;
}
return true;
}
static bool erase_iterator() {
map_type<int64_t, int64_t> m;
int nkeys = 10 * 1000;
for (int i = 0; i < nkeys; ++i) { m.insert(make_pair(i, i)); }
for (int i = 0; i < nkeys; ++i) {
if (m.find(i) == m.end()) return false;
}
for (int i = nkeys - 1; i >= 0; --i) { if (m.find(i) == m.end()) return false; }
for (int i = nkeys - 1; i >= 0; --i) {
fail_unless(m.find(i) != m.end(), "after erase %d cannot be found", i);
fail_unless(m.find(i)->first == i, "after erase key %d cannot be found", i);
}
for (int i = nkeys - 1; i >= 0; --i) {
fail_unless(m.find(i) != m.end(), "after erase %d cannot be found", i);
fail_unless(m.find(i)->first == i, "after erase key %d cannot be found", i);
if (!(m.find(i)->first == i)) return false;
m.erase(m.find(i));
if (static_cast<int>(m.size()) != i) return false;
}
return true;
}
static bool erase_value() {
map_type<int64_t, int64_t> m;
int nkeys = 10 * 1000;
for (int i = 0; i < nkeys; ++i) { m.insert(make_pair(i, i)); }
for (int i = nkeys - 1; i >= 0; --i) {
fail_unless(m.find(i) != m.end());
m.erase(i);
if (static_cast<int>(m.size()) != i) return false;
}
return true;
}
};
} // namespace cxxxmph
#endif // __CXXMPH_MAP_TEST_HELPER_H__

17
deps/cmph/cxxmph/map_tester_test.cc vendored Normal file
View File

@@ -0,0 +1,17 @@
#include "map_tester.h"
#include "test.h"
using namespace cxxmph;
typedef MapTester<std::unordered_map> Tester;
CXXMPH_CXX_TEST_CASE(small_insert, Tester::small_insert);
CXXMPH_CXX_TEST_CASE(large_insert, Tester::large_insert);
CXXMPH_CXX_TEST_CASE(small_search, Tester::small_search);
CXXMPH_CXX_TEST_CASE(default_search, Tester::default_search);
CXXMPH_CXX_TEST_CASE(large_search, Tester::large_search);
CXXMPH_CXX_TEST_CASE(string_search, Tester::string_search);
CXXMPH_CXX_TEST_CASE(rehash_zero, Tester::rehash_zero);
CXXMPH_CXX_TEST_CASE(rehash_size, Tester::rehash_size);
CXXMPH_CXX_TEST_CASE(erase_value, Tester::erase_value);
CXXMPH_CXX_TEST_CASE(erase_iterator, Tester::erase_iterator);

11
deps/cmph/cxxmph/mph_bits.cc vendored Normal file
View File

@@ -0,0 +1,11 @@
#include "mph_bits.h"
namespace cxxmph {
const uint8_t dynamic_2bitset::vmask[] = { 0xfc, 0xf3, 0xcf, 0x3f};
dynamic_2bitset::dynamic_2bitset() : size_(0), fill_(false) {}
dynamic_2bitset::dynamic_2bitset(uint32_t size, bool fill)
: size_(size), fill_(fill), data_(ceil(size / 4.0), ones()*fill) {}
dynamic_2bitset::~dynamic_2bitset() {}
}

73
deps/cmph/cxxmph/mph_bits.h vendored Normal file
View File

@@ -0,0 +1,73 @@
#ifndef __CXXMPH_MPH_BITS_H__
#define __CXXMPH_MPH_BITS_H__
#include <stdint.h> // for uint32_t and friends
#include <array>
#include <cassert>
#include <climits>
#include <cmath>
#include <cstdio>
#include <cstring>
#include <limits>
#include <vector>
#include <utility>
namespace cxxmph {
class dynamic_2bitset {
public:
dynamic_2bitset();
~dynamic_2bitset();
dynamic_2bitset(uint32_t size, bool fill = false);
const uint8_t operator[](uint32_t i) const { return get(i); }
const uint8_t get(uint32_t i) const {
assert(i < size());
assert((i >> 2) < data_.size());
return (data_[(i >> 2)] >> (((i & 3) << 1)) & 3);
}
void set(uint32_t i, uint8_t v) {
assert((i >> 2) < data_.size());
data_[(i >> 2)] |= ones() ^ dynamic_2bitset::vmask[i & 3];
data_[(i >> 2)] &= ((v << ((i & 3) << 1)) | dynamic_2bitset::vmask[i & 3]);
assert(v <= 3);
assert(get(i) == v);
}
void resize(uint32_t size) {
size_ = size;
data_.resize(size >> 2, fill_*ones());
}
void swap(dynamic_2bitset& other) {
std::swap(other.size_, size_);
std::swap(other.fill_, fill_);
other.data_.swap(data_);
}
void clear() { data_.clear(); size_ = 0; }
uint32_t size() const { return size_; }
static const uint8_t vmask[];
const std::vector<uint8_t>& data() const { return data_; }
private:
uint32_t size_;
bool fill_;
std::vector<uint8_t> data_;
const uint8_t ones() { return std::numeric_limits<uint8_t>::max(); }
};
static uint32_t nextpoweroftwo(uint32_t k) {
if (k == 0) return 1;
k--;
for (uint32_t i=1; i<sizeof(uint32_t)*CHAR_BIT; i<<=1) k = k | k >> i;
return k+1;
}
// Interesting bit tricks that might end up here:
// http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
// Fast a % (k*2^t)
// http://www.azillionmonkeys.com/qed/adiv.html
// rank and select:
// http://vigna.dsi.unimi.it/ftp/papers/Broadword.pdf
} // namespace cxxmph
#endif

61
deps/cmph/cxxmph/mph_bits_test.cc vendored Normal file
View File

@@ -0,0 +1,61 @@
#include <cstdio>
#include <cstdlib>
#include "mph_bits.h"
using cxxmph::dynamic_2bitset;
using cxxmph::nextpoweroftwo;
int main(int argc, char** argv) {
dynamic_2bitset small(256, true);
for (uint32_t i = 0; i < small.size(); ++i) small.set(i, i % 4);
for (uint32_t i = 0; i < small.size(); ++i) {
if (small[i] != i % 4) {
fprintf(stderr, "wrong bits %d at %d expected %d\n", small[i], i, i % 4);
exit(-1);
}
}
uint32_t size = 256;
dynamic_2bitset bits(size, true /* fill with ones */);
for (uint32_t i = 0; i < size; ++i) {
if (bits[i] != 3) {
fprintf(stderr, "wrong bits %d at %d expected %d\n", bits[i], i, 3);
exit(-1);
}
}
for (uint32_t i = 0; i < size; ++i) bits.set(i, 0);
for (uint32_t i = 0; i < size; ++i) {
if (bits[i] != 0) {
fprintf(stderr, "wrong bits %d at %d expected %d\n", bits[i], i, 0);
exit(-1);
}
}
for (uint32_t i = 0; i < size; ++i) bits.set(i, i % 4);
for (uint32_t i = 0; i < size; ++i) {
if (bits[i] != i % 4) {
fprintf(stderr, "wrong bits %d at %d expected %d\n", bits[i], i, i % 4);
exit(-1);
}
}
dynamic_2bitset size_corner1(1);
if (size_corner1.size() != 1) exit(-1);
dynamic_2bitset size_corner2(2);
if (size_corner2.size() != 2) exit(-1);
(dynamic_2bitset(4, true)).swap(size_corner2);
if (size_corner2.size() != 4) exit(-1);
for (uint32_t i = 0; i < size_corner2.size(); ++i) {
if (size_corner2[i] != 3) exit(-1);
}
size_corner2.clear();
if (size_corner2.size() != 0) exit(-1);
dynamic_2bitset empty;
empty.clear();
dynamic_2bitset large(1000, true);
empty.swap(large);
if (nextpoweroftwo(3) != 4) exit(-1);
}

229
deps/cmph/cxxmph/mph_index.cc vendored Normal file
View File

@@ -0,0 +1,229 @@
#include <limits>
#include <iostream>
#include <vector>
using std::cerr;
using std::endl;
#include "mph_index.h"
using std::vector;
namespace {
static const uint8_t kUnassigned = 3;
// table used for looking up the number of assigned vertices to a 8-bit integer
static uint8_t kBdzLookupIndex[] =
{
4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4, 3, 3, 3, 3, 2,
4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4, 3, 3, 3, 3, 2,
4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4, 3, 3, 3, 3, 2,
3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 1,
4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4, 3, 3, 3, 3, 2,
4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4, 3, 3, 3, 3, 2,
4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4, 3, 3, 3, 3, 2,
3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 1,
4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4, 3, 3, 3, 3, 2,
4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4, 3, 3, 3, 3, 2,
4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4, 3, 3, 3, 3, 2,
3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 1,
3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 1,
3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 1,
3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 1,
2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 0
};
} // anonymous namespace
namespace cxxmph {
MPHIndex::~MPHIndex() {
clear();
}
void MPHIndex::clear() {
std::vector<uint32_t> empty_ranktable;
ranktable_.swap(empty_ranktable);
dynamic_2bitset empty_g;
g_.swap(empty_g);
}
bool MPHIndex::GenerateQueue(
TriGraph* graph, vector<uint32_t>* queue_output) {
uint32_t queue_head = 0, queue_tail = 0;
uint32_t nedges = m_;
uint32_t nvertices = n_;
// Relies on vector<bool> using 1 bit per element
vector<bool> marked_edge(nedges + 1, false);
vector<uint32_t> queue(nvertices, 0);
for (uint32_t i = 0; i < nedges; ++i) {
const TriGraph::Edge& e = graph->edges()[i];
if (graph->vertex_degree()[e[0]] == 1 ||
graph->vertex_degree()[e[1]] == 1 ||
graph->vertex_degree()[e[2]] == 1) {
if (!marked_edge[i]) {
queue[queue_head++] = i;
marked_edge[i] = true;
}
}
}
/*
for (unsigned int i = 0; i < marked_edge.size(); ++i) {
cerr << "vertex with degree " << static_cast<uint32_t>(graph->vertex_degree()[i]) << " marked " << marked_edge[i] << endl;
}
for (unsigned int i = 0; i < queue.size(); ++i) {
cerr << "vertex " << i << " queued at " << queue[i] << endl;
}
*/
// At this point queue head is the number of edges touching at least one
// vertex of degree 1.
// cerr << "Queue head " << queue_head << " Queue tail " << queue_tail << endl;
// graph->DebugGraph();
while (queue_tail != queue_head) {
uint32_t current_edge = queue[queue_tail++];
graph->RemoveEdge(current_edge);
const TriGraph::Edge& e = graph->edges()[current_edge];
for (int i = 0; i < 3; ++i) {
uint32_t v = e[i];
if (graph->vertex_degree()[v] == 1) {
uint32_t first_edge = graph->first_edge()[v];
if (!marked_edge[first_edge]) {
queue[queue_head++] = first_edge;
marked_edge[first_edge] = true;
}
}
}
}
/*
for (unsigned int i = 0; i < queue.size(); ++i) {
cerr << "vertex " << i << " queued at " << queue[i] << endl;
}
*/
int cycles = queue_head - nedges;
if (cycles == 0) queue.swap(*queue_output);
return cycles == 0;
}
void MPHIndex::Assigning(
const vector<TriGraph::Edge>& edges, const vector<uint32_t>& queue) {
uint32_t current_edge = 0;
vector<bool> marked_vertices(n_ + 1);
dynamic_2bitset(8, true).swap(g_);
// Initialize vector of half nibbles with all bits set.
dynamic_2bitset g(n_, true /* set bits to 1 */);
uint32_t nedges = m_; // for legibility
for (int i = nedges - 1; i + 1 >= 1; --i) {
current_edge = queue[i];
const TriGraph::Edge& e = edges[current_edge];
/*
cerr << "B: " << e[0] << " " << e[1] << " " << e[2] << " -> "
<< get_2bit_value(g_, e[0]) << " "
<< get_2bit_value(g_, e[1]) << " "
<< get_2bit_value(g_, e[2]) << " edge " << current_edge << endl;
*/
if (!marked_vertices[e[0]]) {
if (!marked_vertices[e[1]]) {
g.set(e[1], kUnassigned);
marked_vertices[e[1]] = true;
}
if (!marked_vertices[e[2]]) {
g.set(e[2], kUnassigned);
assert(marked_vertices.size() > e[2]);
marked_vertices[e[2]] = true;
}
g.set(e[0], (6 - (g[e[1]] + g[e[2]])) % 3);
marked_vertices[e[0]] = true;
} else if (!marked_vertices[e[1]]) {
if (!marked_vertices[e[2]]) {
g.set(e[2], kUnassigned);
marked_vertices[e[2]] = true;
}
g.set(e[1], (7 - (g[e[0]] + g[e[2]])) % 3);
marked_vertices[e[1]] = true;
} else {
g.set(e[2], (8 - (g[e[0]] + g[e[1]])) % 3);
marked_vertices[e[2]] = true;
}
/*
cerr << "A: " << e[0] << " " << e[1] << " " << e[2] << " -> "
<< static_cast<uint32_t>(g[e[0]]) << " "
<< static_cast<uint32_t>(g[e[1]]) << " "
<< static_cast<uint32_t>(g[e[2]]) << " " << endl;
*/
}
g_.swap(g);
}
void MPHIndex::Ranking() {
uint32_t nbytes_total = static_cast<uint32_t>(ceil(n_ / 4.0));
uint32_t size = k_ >> 2U;
uint32_t ranktable_size = static_cast<uint32_t>(
ceil(n_ / static_cast<double>(k_)));
vector<uint32_t> ranktable(ranktable_size);
uint32_t offset = 0;
uint32_t count = 0;
uint32_t i = 1;
while (1) {
if (i == ranktable.size()) break;
uint32_t nbytes = size < nbytes_total ? size : nbytes_total;
for (uint32_t j = 0; j < nbytes; ++j) {
count += kBdzLookupIndex[g_.data()[offset + j]];
}
ranktable[i] = count;
offset += nbytes;
nbytes_total -= size;
++i;
}
ranktable_.swap(ranktable);
}
uint32_t MPHIndex::Rank(uint32_t vertex) const {
if (ranktable_.empty()) return 0;
uint32_t index = vertex >> b_;
uint32_t base_rank = ranktable_[index];
uint32_t beg_idx_v = index << b_;
uint32_t beg_idx_b = beg_idx_v >> 2;
uint32_t end_idx_b = vertex >> 2;
while (beg_idx_b < end_idx_b) {
assert(g_.data().size() > beg_idx_b);
base_rank += kBdzLookupIndex[g_.data()[beg_idx_b++]];
}
beg_idx_v = beg_idx_b << 2;
/*
cerr << "beg_idx_v: " << beg_idx_v << endl;
cerr << "base rank: " << base_rank << endl;
cerr << "G: ";
for (unsigned int i = 0; i < n_; ++i) {
cerr << static_cast<uint32_t>(g_[i]) << " ";
}
cerr << endl;
*/
while (beg_idx_v < vertex) {
if (g_[beg_idx_v] != kUnassigned) ++base_rank;
++beg_idx_v;
}
// cerr << "Base rank: " << base_rank << endl;
return base_rank;
}
void MPHIndex::swap(std::vector<uint32_t>& params, dynamic_2bitset& g, std::vector<uint32_t>& ranktable) {
params.resize(12);
uint32_t rounded_c = c_ * 1000 * 1000;
std::swap(params[0], rounded_c);
c_ = static_cast<double>(rounded_c) / 1000 / 1000;
std::swap(params[1], m_);
std::swap(params[2], n_);
std::swap(params[3], k_);
uint32_t uint32_square = static_cast<uint32_t>(square_);
std::swap(params[4], uint32_square);
square_ = uint32_square;
std::swap(params[5], hash_seed_[0]);
std::swap(params[6], hash_seed_[1]);
std::swap(params[7], hash_seed_[2]);
g.swap(g_);
ranktable.swap(ranktable_);
}
} // namespace cxxmph

270
deps/cmph/cxxmph/mph_index.h vendored Normal file
View File

@@ -0,0 +1,270 @@
#ifndef __CXXMPH_MPH_INDEX_H__
#define __CXXMPH_MPH_INDEX_H__
// Minimal perfect hash abstraction implementing the BDZ algorithm
//
// This is a data structure that given a set of known keys S, will create a
// mapping from S to [0..|S|). The class is informed about S through the Reset
// method and the mapping is queried by calling index(key).
//
// This is a pretty uncommon data structure, and if you application has a real
// use case for it, chances are that it is a real win. If all you are doing is
// a straightforward implementation of an in-memory associative mapping data
// structure, then it will probably be slower. Take a look at mph_map.h
// instead.
//
// Thesis presenting this and similar algorithms:
// http://homepages.dcc.ufmg.br/~fbotelho/en/talks/thesis2008/thesis.pdf
//
// Notes:
//
// Most users can use the SimpleMPHIndex wrapper instead of the MPHIndex which
// have confusing template parameters.
// This class only implements a minimal perfect hash function, it does not
// implement an associative mapping data structure.
#include <stdint.h>
#include <cassert>
#include <climits>
#include <cmath>
#include <unordered_map> // for std::hash
#include <vector>
#include <iostream>
using std::cerr;
using std::endl;
#include "seeded_hash.h"
#include "mph_bits.h"
#include "trigraph.h"
namespace cxxmph {
class MPHIndex {
public:
MPHIndex(bool square = false, double c = 1.23, uint8_t b = 7) :
c_(c), b_(b), m_(0), n_(0), k_(0), square_(square), r_(1), g_(8, true) {
nest_displacement_[0] = 0;
nest_displacement_[1] = r_;
nest_displacement_[2] = (r_ << 1);
}
~MPHIndex();
template <class SeededHashFcn, class ForwardIterator>
bool Reset(ForwardIterator begin, ForwardIterator end, uint32_t size);
template <class SeededHashFcn, class Key> // must agree with Reset
// Get a unique identifier for k, in the range [0;size()). If x wasn't part
// of the input in the last Reset call, returns a random value.
uint32_t index(const Key& x) const;
uint32_t size() const { return m_; }
void clear();
// Advanced users functions. Please avoid unless you know what you are doing.
uint32_t perfect_hash_size() const { return n_; }
template <class SeededHashFcn, class Key> // must agree with Reset
uint32_t perfect_hash(const Key& x) const; // way faster than the minimal
template <class SeededHashFcn, class Key> // must agree with Reset
uint32_t perfect_square(const Key& x) const; // even faster but needs square=true
uint32_t minimal_perfect_hash_size() const { return size(); }
template <class SeededHashFcn, class Key> // must agree with Reset
uint32_t minimal_perfect_hash(const Key& x) const;
// Experimental api to use as a serialization building block.
// Since this signature exposes some implementation details, expect it to
// change.
void swap(std::vector<uint32_t>& params, dynamic_2bitset& g, std::vector<uint32_t>& ranktable);
private:
template <class SeededHashFcn, class ForwardIterator>
bool Mapping(ForwardIterator begin, ForwardIterator end,
std::vector<TriGraph::Edge>* edges,
std::vector<uint32_t>* queue);
bool GenerateQueue(TriGraph* graph, std::vector<uint32_t>* queue);
void Assigning(const std::vector<TriGraph::Edge>& edges,
const std::vector<uint32_t>& queue);
void Ranking();
uint32_t Rank(uint32_t vertex) const;
// Algorithm parameters
// Perfect hash function density. If this was a 2graph,
// then probability of having an acyclic graph would be
// sqrt(1-(2/c)^2). See section 3 for details.
// http://www.it-c.dk/people/pagh/papers/simpleperf.pdf
double c_;
uint8_t b_; // Number of bits of the kth index in the ranktable
// Values used during generation
uint32_t m_; // edges count
uint32_t n_; // vertex count
uint32_t k_; // kth index in ranktable, $k = log_2(n=3r)\varepsilon$
bool square_; // make bit vector size a power of 2
// Values used during search
// Partition vertex count, derived from c parameter.
uint32_t r_;
uint32_t nest_displacement_[3]; // derived from r_
// The array containing the minimal perfect hash function graph.
dynamic_2bitset g_;
uint8_t threebit_mod3[10]; // speed up mod3 calculation for 3bit ints
// The table used for the rank step of the minimal perfect hash function
std::vector<uint32_t> ranktable_;
// The selected hash seed triplet for finding the edges in the minimal
// perfect hash function graph.
uint32_t hash_seed_[3];
};
// Template method needs to go in the header file.
template <class SeededHashFcn, class ForwardIterator>
bool MPHIndex::Reset(
ForwardIterator begin, ForwardIterator end, uint32_t size) {
if (end == begin) {
clear();
return true;
}
m_ = size;
r_ = static_cast<uint32_t>(ceil((c_*m_)/3));
if ((r_ % 2) == 0) r_ += 1;
// This can be used to speed mods, but increases occupation too much.
// Needs to try http://gmplib.org/manual/Integer-Exponentiation.html instead
if (square_) r_ = nextpoweroftwo(r_);
nest_displacement_[0] = 0;
nest_displacement_[1] = r_;
nest_displacement_[2] = (r_ << 1);
for (uint32_t i = 0; i < sizeof(threebit_mod3); ++i) threebit_mod3[i] = i % 3;
n_ = 3*r_;
k_ = 1U << b_;
// cerr << "m " << m_ << " n " << n_ << " r " << r_ << endl;
int iterations = 1000;
std::vector<TriGraph::Edge> edges;
std::vector<uint32_t> queue;
while (1) {
// cerr << "Iterations missing: " << iterations << endl;
for (int i = 0; i < 3; ++i) hash_seed_[i] = random();
if (Mapping<SeededHashFcn>(begin, end, &edges, &queue)) break;
else --iterations;
if (iterations == 0) break;
}
if (iterations == 0) return false;
Assigning(edges, queue);
std::vector<TriGraph::Edge>().swap(edges);
Ranking();
return true;
}
template <class SeededHashFcn, class ForwardIterator>
bool MPHIndex::Mapping(
ForwardIterator begin, ForwardIterator end,
std::vector<TriGraph::Edge>* edges, std::vector<uint32_t>* queue) {
TriGraph graph(n_, m_);
for (ForwardIterator it = begin; it != end; ++it) {
h128 h = SeededHashFcn().hash128(*it, hash_seed_[0]);
// for (int i = 0; i < 3; ++i) h[i] = SeededHashFcn()(*it, hash_seed_[i]);
uint32_t v0 = h[0] % r_;
uint32_t v1 = h[1] % r_ + r_;
uint32_t v2 = h[2] % r_ + (r_ << 1);
// cerr << "Key: " << *it << " edge " << it - begin << " (" << v0 << "," << v1 << "," << v2 << ")" << endl;
graph.AddEdge(TriGraph::Edge(v0, v1, v2));
}
if (GenerateQueue(&graph, queue)) {
graph.ExtractEdgesAndClear(edges);
return true;
}
return false;
}
template <class SeededHashFcn, class Key>
uint32_t MPHIndex::perfect_square(const Key& key) const {
h128 h = SeededHashFcn().hash128(key, hash_seed_[0]);
h[0] = (h[0] & (r_-1)) + nest_displacement_[0];
h[1] = (h[1] & (r_-1)) + nest_displacement_[1];
h[2] = (h[2] & (r_-1)) + nest_displacement_[2];
assert((h[0]) < g_.size());
assert((h[1]) < g_.size());
assert((h[2]) < g_.size());
uint8_t nest = threebit_mod3[g_[h[0]] + g_[h[1]] + g_[h[2]]];
uint32_t vertex = h[nest];
return vertex;
}
template <class SeededHashFcn, class Key>
uint32_t MPHIndex::perfect_hash(const Key& key) const {
if (!g_.size()) return 0;
h128 h = SeededHashFcn().hash128(key, hash_seed_[0]);
h[0] = (h[0] % r_) + nest_displacement_[0];
h[1] = (h[1] % r_) + nest_displacement_[1];
h[2] = (h[2] % r_) + nest_displacement_[2];
assert((h[0]) < g_.size());
assert((h[1]) < g_.size());
assert((h[2]) < g_.size());
uint8_t nest = threebit_mod3[g_[h[0]] + g_[h[1]] + g_[h[2]]];
uint32_t vertex = h[nest];
return vertex;
}
template <class SeededHashFcn, class Key>
uint32_t MPHIndex::minimal_perfect_hash(const Key& key) const {
return Rank(perfect_hash<SeededHashFcn, Key>(key));
}
template <class SeededHashFcn, class Key>
uint32_t MPHIndex::index(const Key& key) const {
return minimal_perfect_hash<SeededHashFcn, Key>(key);
}
// Simple wrapper around MPHIndex to simplify calling code. Please refer to the
// MPHIndex class for documentation.
template <class Key, class HashFcn = typename seeded_hash<std::hash<Key>>::hash_function>
class SimpleMPHIndex : public MPHIndex {
public:
SimpleMPHIndex(bool advanced_usage = false) : MPHIndex(advanced_usage) {}
template <class ForwardIterator>
bool Reset(ForwardIterator begin, ForwardIterator end, uint32_t size) {
return MPHIndex::Reset<HashFcn>(begin, end, size);
}
uint32_t index(const Key& key) const { return MPHIndex::index<HashFcn>(key); }
};
// The parameters minimal and square trade memory usage for evaluation speed.
// Minimal decreases speed and memory usage, and square does the opposite.
// Using minimal=true and square=false is the same as SimpleMPHIndex.
template <bool minimal, bool square, class Key, class HashFcn>
struct FlexibleMPHIndex {};
template <class Key, class HashFcn>
struct FlexibleMPHIndex<true, false, Key, HashFcn>
: public SimpleMPHIndex<Key, HashFcn> {
FlexibleMPHIndex() : SimpleMPHIndex<Key, HashFcn>(false) {}
uint32_t index(const Key& key) const {
return MPHIndex::minimal_perfect_hash<HashFcn>(key); }
uint32_t size() const { return MPHIndex::minimal_perfect_hash_size(); }
};
template <class Key, class HashFcn>
struct FlexibleMPHIndex<false, true, Key, HashFcn>
: public SimpleMPHIndex<Key, HashFcn> {
FlexibleMPHIndex() : SimpleMPHIndex<Key, HashFcn>(true) {}
uint32_t index(const Key& key) const {
return MPHIndex::perfect_square<HashFcn>(key); }
uint32_t size() const { return MPHIndex::perfect_hash_size(); }
};
template <class Key, class HashFcn>
struct FlexibleMPHIndex<false, false, Key, HashFcn>
: public SimpleMPHIndex<Key, HashFcn> {
FlexibleMPHIndex() : SimpleMPHIndex<Key, HashFcn>(false) {}
uint32_t index(const Key& key) const {
return MPHIndex::perfect_hash<HashFcn>(key); }
uint32_t size() const { return MPHIndex::perfect_hash_size(); }
};
// From a trade-off perspective this case does not make much sense.
// template <class Key, class HashFcn>
// class FlexibleMPHIndex<true, true, Key, HashFcn>
} // namespace cxxmph
#endif // __CXXMPH_MPH_INDEX_H__

53
deps/cmph/cxxmph/mph_index_test.cc vendored Normal file
View File

@@ -0,0 +1,53 @@
#include <algorithm>
#include <cassert>
#include <string>
#include <vector>
#include "mph_index.h"
using std::string;
using std::vector;
using namespace cxxmph;
int main(int argc, char** argv) {
srand(1);
vector<string> keys;
keys.push_back("davi");
keys.push_back("paulo");
keys.push_back("joao");
keys.push_back("maria");
keys.push_back("bruno");
keys.push_back("paula");
keys.push_back("diego");
keys.push_back("diogo");
keys.push_back("algume");
SimpleMPHIndex<string> mph_index;
if (!mph_index.Reset(keys.begin(), keys.end(), keys.size())) { exit(-1); }
vector<int> ids;
for (vector<int>::size_type i = 0; i < keys.size(); ++i) {
ids.push_back(mph_index.index(keys[i]));
cerr << " " << *(ids.end() - 1);
}
cerr << endl;
sort(ids.begin(), ids.end());
for (vector<int>::size_type i = 0; i < ids.size(); ++i) assert(ids[i] == static_cast<vector<int>::value_type>(i));
// Test serialization
vector<uint32_t> params;
dynamic_2bitset g;
vector<uint32_t> ranktable;
mph_index.swap(params, g, ranktable);
assert(mph_index.size() == 0);
mph_index.swap(params, g, ranktable);
assert(mph_index.size() == ids.size());
for (vector<int>::size_type i = 0; i < ids.size(); ++i) assert(ids[i] == static_cast<vector<int>::value_type>(i));
FlexibleMPHIndex<false, true, int64_t, seeded_hash<std::hash<int64_t>>::hash_function> square_empty;
auto id = square_empty.index(1);
FlexibleMPHIndex<false, false, int64_t, seeded_hash<std::hash<int64_t>>::hash_function> unordered_empty;
id ^= unordered_empty.index(1);
FlexibleMPHIndex<true, false, int64_t, seeded_hash<std::hash<int64_t>>::hash_function> minimal_empty;
id ^= minimal_empty.index(1);
}

272
deps/cmph/cxxmph/mph_map.h vendored Normal file
View File

@@ -0,0 +1,272 @@
#ifndef __CXXMPH_MPH_MAP_H__
#define __CXXMPH_MPH_MAP_H__
// Implementation of the unordered associative mapping interface using a
// minimal perfect hash function.
//
// Since these are header-mostly libraries, make sure you compile your code
// with -DNDEBUG and -O3. The code requires a modern C++11 compiler.
//
// The container comes in 3 flavors, all in the cxxmph namespace and drop-in
// replacement for the popular classes with the same names.
// * dense_hash_map
// -> fast, uses more memory, 2.93 bits per bucket, ~50% occupation
// * unordered_map (aliases: hash_map, mph_map)
// -> middle ground, uses 2.93 bits per bucket, ~81% occupation
// * sparse_hash_map -> slower, uses 3.6 bits per bucket
// -> less fast, uses 3.6 bits per bucket, 100% occupation
//
// Those classes are not necessarily faster than their existing counterparts.
// Benchmark your code before using it. The larger the key, the larger the
// number of elements inserted, and the bigger the number of failed searches,
// the more likely those classes will outperform existing code.
//
// For large sets of urls (>100k), which are a somewhat expensive to compare, I
// found those class to be about 10%-50% faster than unordered_map.
#include <algorithm>
#include <iostream>
#include <limits>
#include <unordered_map>
#include <unordered_set>
#include <vector>
#include <utility> // for std::pair
#include "string_util.h"
#include "hollow_iterator.h"
#include "mph_bits.h"
#include "mph_index.h"
#include "seeded_hash.h"
namespace cxxmph {
using std::pair;
using std::make_pair;
using std::vector;
// Save on repetitive typing.
#define MPH_MAP_TMPL_SPEC \
template <bool minimal, bool square, \
class Key, class Data, class HashFcn, class EqualKey, class Alloc>
#define MPH_MAP_CLASS_SPEC mph_map_base<minimal, square, Key, Data, HashFcn, EqualKey, Alloc>
#define MPH_MAP_METHOD_DECL(r, m) MPH_MAP_TMPL_SPEC typename MPH_MAP_CLASS_SPEC::r MPH_MAP_CLASS_SPEC::m
#define MPH_MAP_INLINE_METHOD_DECL(r, m) MPH_MAP_TMPL_SPEC inline typename MPH_MAP_CLASS_SPEC::r MPH_MAP_CLASS_SPEC::m
template <bool minimal, bool square, class Key, class Data, class HashFcn = std::hash<Key>, class EqualKey = std::equal_to<Key>, class Alloc = std::allocator<Data> >
class mph_map_base {
public:
typedef Key key_type;
typedef Data data_type;
typedef pair<Key, Data> value_type;
typedef HashFcn hasher;
typedef EqualKey key_equal;
typedef typename vector<value_type>::pointer pointer;
typedef typename vector<value_type>::reference reference;
typedef typename vector<value_type>::const_reference const_reference;
typedef typename vector<value_type>::size_type size_type;
typedef typename vector<value_type>::difference_type difference_type;
typedef is_empty<const vector<value_type>> is_empty_type;
typedef hollow_iterator_base<typename vector<value_type>::iterator, is_empty_type> iterator;
typedef hollow_iterator_base<typename vector<value_type>::const_iterator, is_empty_type> const_iterator;
// For making macros simpler.
typedef void void_type;
typedef bool bool_type;
typedef pair<iterator, bool> insert_return_type;
mph_map_base();
~mph_map_base();
iterator begin();
iterator end();
const_iterator begin() const;
const_iterator end() const;
size_type size() const;
bool empty() const;
void clear();
void erase(iterator pos);
void erase(const key_type& k);
pair<iterator, bool> insert(const value_type& x);
inline iterator find(const key_type& k);
inline const_iterator find(const key_type& k) const;
typedef int32_t my_int32_t; // help macros
inline int32_t index(const key_type& k) const;
data_type& operator[](const key_type &k);
const data_type& operator[](const key_type &k) const;
size_type bucket_count() const { return index_.size() + slack_.bucket_count(); }
void rehash(size_type nbuckets /*ignored*/);
protected: // mimicking STL implementation
EqualKey equal_;
private:
template <typename iterator>
struct iterator_first : public iterator {
iterator_first(iterator it) : iterator(it) { }
const typename iterator::value_type::first_type& operator*() {
return this->iterator::operator*().first;
}
};
template <typename iterator>
iterator_first<iterator> make_iterator_first(iterator it) {
return iterator_first<iterator>(it);
}
void pack();
vector<value_type> values_;
vector<bool> present_;
FlexibleMPHIndex<minimal, square, Key, typename seeded_hash<HashFcn>::hash_function> index_;
// TODO(davi) optimize slack to use hash from index rather than calculate its own
typedef std::unordered_map<h128, uint32_t, h128::hash32> slack_type;
slack_type slack_;
size_type size_;
typename seeded_hash<HashFcn>::hash_function hasher128_;
};
MPH_MAP_TMPL_SPEC
bool operator==(const MPH_MAP_CLASS_SPEC& lhs, const MPH_MAP_CLASS_SPEC& rhs) {
return lhs.size() == rhs.size() && std::equal(lhs.begin(), lhs.end(), rhs.begin());
}
MPH_MAP_TMPL_SPEC MPH_MAP_CLASS_SPEC::mph_map_base() : size_(0) {
clear();
pack();
}
MPH_MAP_TMPL_SPEC MPH_MAP_CLASS_SPEC::~mph_map_base() { }
MPH_MAP_METHOD_DECL(insert_return_type, insert)(const value_type& x) {
auto it = find(x.first);
auto it_end = end();
if (it != it_end) return make_pair(it, false);
bool should_pack = false;
if (values_.capacity() == values_.size() && values_.size() > 256) {
should_pack = true;
}
values_.push_back(x);
present_.push_back(true);
++size_;
h128 h = hasher128_.hash128(x.first, 0);
if (slack_.find(h) != slack_.end()) should_pack = true; // unavoidable pack
else slack_.insert(std::make_pair(h, values_.size() - 1));
if (should_pack) pack();
it = find(x.first);
return make_pair(it, true);
}
MPH_MAP_METHOD_DECL(void_type, pack)() {
// CXXMPH_DEBUGLN("Packing %v values")(values_.size());
if (values_.empty()) return;
assert(std::unordered_set<key_type>(make_iterator_first(begin()), make_iterator_first(end())).size() == size());
bool success = index_.Reset(
make_iterator_first(begin()),
make_iterator_first(end()), size_);
if (!success) { exit(-1); }
vector<value_type> new_values(index_.size());
new_values.reserve(new_values.size() * 2);
vector<bool> new_present(index_.size(), false);
new_present.reserve(new_present.size() * 2);
for (iterator it = begin(), it_end = end(); it != it_end; ++it) {
size_type id = index_.index(it->first);
assert(id < index_.size());
assert(id < new_values.size());
new_values[id] = *it;
new_present[id] = true;
}
// fprintf(stderr, "Collision ratio: %f\n", collisions*1.0/size());
values_.swap(new_values);
present_.swap(new_present);
slack_type().swap(slack_);
}
MPH_MAP_METHOD_DECL(iterator, begin)() { return make_hollow(&values_, &present_, values_.begin()); }
MPH_MAP_METHOD_DECL(iterator, end)() { return make_solid(&values_, &present_, values_.end()); }
MPH_MAP_METHOD_DECL(const_iterator, begin)() const { return make_hollow(&values_, &present_, values_.begin()); }
MPH_MAP_METHOD_DECL(const_iterator, end)() const { return make_solid(&values_, &present_, values_.end()); }
MPH_MAP_METHOD_DECL(bool_type, empty)() const { return size_ == 0; }
MPH_MAP_METHOD_DECL(size_type, size)() const { return size_; }
MPH_MAP_METHOD_DECL(void_type, clear)() {
values_.clear();
present_.clear();
slack_.clear();
index_.clear();
size_ = 0;
}
MPH_MAP_METHOD_DECL(void_type, erase)(iterator pos) {
assert(pos.it_ - values_.begin() < present_.size());
assert(present_[pos.it_ - values_.begin()]);
present_[pos.it_ - values_.begin()] = false;
*pos = value_type();
--size_;
}
MPH_MAP_METHOD_DECL(void_type, erase)(const key_type& k) {
iterator it = find(k);
if (it == end()) return;
erase(it);
}
MPH_MAP_INLINE_METHOD_DECL(const_iterator, find)(const key_type& k) const {
auto idx = index(k);
typename vector<value_type>::const_iterator vit = values_.begin() + idx;
if (idx == -1 || !equal_(vit->first, k)) return end();
return make_solid(&values_, &present_, vit);;
}
MPH_MAP_INLINE_METHOD_DECL(iterator, find)(const key_type& k) {
auto idx = index(k);
typename vector<value_type>::iterator vit = values_.begin() + idx;
if (idx == -1 || !equal_(vit->first, k)) return end();
return make_solid(&values_, &present_, vit);;
}
MPH_MAP_INLINE_METHOD_DECL(my_int32_t, index)(const key_type& k) const {
if (__builtin_expect(!slack_.empty(), 0)) {
auto sit = slack_.find(hasher128_.hash128(k, 0));
if (sit != slack_.end()) return sit->second;
}
if (__builtin_expect(index_.size(), 1)) {
auto id = index_.index(k);
if (__builtin_expect(present_[id], true)) return id;
}
return -1;
}
MPH_MAP_METHOD_DECL(data_type&, operator[])(const key_type& k) {
return insert(make_pair(k, data_type())).first->second;
}
MPH_MAP_METHOD_DECL(void_type, rehash)(size_type /*nbuckets*/) {
pack();
vector<value_type>(values_.begin(), values_.end()).swap(values_);
vector<bool>(present_.begin(), present_.end()).swap(present_);
slack_type().swap(slack_);
}
#define MPH_MAP_PREAMBLE template <class Key, class Data,\
class HashFcn = std::hash<Key>, class EqualKey = std::equal_to<Key>,\
class Alloc = std::allocator<Data> >
MPH_MAP_PREAMBLE class mph_map : public mph_map_base<
false, false, Key, Data, HashFcn, EqualKey, Alloc> {};
MPH_MAP_PREAMBLE class unordered_map : public mph_map_base<
false, false, Key, Data, HashFcn, EqualKey, Alloc> {};
MPH_MAP_PREAMBLE class hash_map : public mph_map_base<
false, false, Key, Data, HashFcn, EqualKey, Alloc> {};
MPH_MAP_PREAMBLE class dense_hash_map : public mph_map_base<
false, true, Key, Data, HashFcn, EqualKey, Alloc> {};
MPH_MAP_PREAMBLE class sparse_hash_map : public mph_map_base<
true, false, Key, Data, HashFcn, EqualKey, Alloc> {};
#undef MPH_MAP_TMPL_SPEC
#undef MPH_MAP_CLASS_SPEC
#undef MPH_MAP_METHOD_DECL
#undef MPH_MAP_INLINE_METHOD_DECL
#undef MPH_MAP_PREAMBLE
} // namespace cxxmph
#endif // __CXXMPH_MPH_MAP_H__

25
deps/cmph/cxxmph/mph_map_test.cc vendored Normal file
View File

@@ -0,0 +1,25 @@
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <string>
#include "mph_map.h"
#include "map_tester.h"
#include "test.h"
using namespace cxxmph;
typedef MapTester<mph_map> Tester;
CXXMPH_CXX_TEST_CASE(empty_find, Tester::empty_find);
CXXMPH_CXX_TEST_CASE(empty_erase, Tester::empty_erase);
CXXMPH_CXX_TEST_CASE(small_insert, Tester::small_insert);
CXXMPH_CXX_TEST_CASE(large_insert, Tester::large_insert);
CXXMPH_CXX_TEST_CASE(small_search, Tester::small_search);
CXXMPH_CXX_TEST_CASE(default_search, Tester::default_search);
CXXMPH_CXX_TEST_CASE(large_search, Tester::large_search);
CXXMPH_CXX_TEST_CASE(string_search, Tester::string_search);
CXXMPH_CXX_TEST_CASE(rehash_zero, Tester::rehash_zero);
CXXMPH_CXX_TEST_CASE(rehash_size, Tester::rehash_size);
CXXMPH_CXX_TEST_CASE(erase_value, Tester::erase_value);
CXXMPH_CXX_TEST_CASE(erase_iterator, Tester::erase_iterator);

147
deps/cmph/cxxmph/seeded_hash.h vendored Normal file
View File

@@ -0,0 +1,147 @@
#ifndef __CXXMPH_SEEDED_HASH_H__
#define __CXXMPH_SEEDED_HASH_H__
#include <stdint.h> // for uint32_t and friends
#include <cstdlib>
#include <cstring>
#include <unordered_map> // for std::hash
#include "MurmurHash3.h"
#include "stringpiece.h"
namespace cxxmph {
struct h128 {
const uint32_t& operator[](uint8_t i) const { return uint32[i]; }
uint32_t& operator[](uint8_t i) { return uint32[i]; }
uint64_t get64(bool second) const { return (static_cast<uint64_t>(uint32[second << 1]) << 32) | uint32[1 + (second << 1)]; }
void set64(uint64_t v, bool second) { uint32[second << 1] = v >> 32; uint32[1+(second<<1)] = ((v << 32) >> 32); }
bool operator==(const h128 rhs) const { return memcmp(uint32, rhs.uint32, sizeof(uint32)) == 0; }
uint32_t uint32[4];
struct hash32 { uint32_t operator()(const cxxmph::h128& h) const { return h[3]; } };
};
template <class HashFcn>
struct seeded_hash_function {
template <class Key>
uint32_t operator()(const Key& k, uint32_t seed) const {
uint32_t h;
uint32_t h0 = HashFcn()(k);
MurmurHash3_x86_32(reinterpret_cast<const void*>(&h0), 4, seed, &h);
return h;
}
template <class Key>
h128 hash128(const Key& k, uint32_t seed) const {
h128 h;
uint32_t h0 = HashFcn()(k);
MurmurHash3_x64_128(reinterpret_cast<const void*>(&h0), 4, seed, &h);
return h;
}
};
struct Murmur3 {
template<class Key>
uint32_t operator()(const Key& k) const {
uint32_t out;
MurmurHash3_x86_32(reinterpret_cast<const void*>(&k), sizeof(Key), 1 /* seed */, &out);
return out;
}
template <class Key>
h128 hash128(const Key& k) const {
h128 h;
MurmurHash3_x64_128(reinterpret_cast<const void*>(&k), sizeof(Key), 1 /* seed */, &h);
return h;
}
};
struct Murmur3StringPiece {
template <class Key>
uint32_t operator()(const Key& k) const {
StringPiece s(k);
uint32_t out;
MurmurHash3_x86_32(s.data(), s.length(), 1 /* seed */, &out);
return out;
}
template <class Key>
h128 hash128(const Key& k) const {
h128 h;
StringPiece s(k);
MurmurHash3_x64_128(s.data(), s.length(), 1 /* seed */, &h);
return h;
}
};
template <>
struct seeded_hash_function<Murmur3> {
template <class Key>
uint32_t operator()(const Key& k, uint32_t seed) const {
uint32_t out;
MurmurHash3_x86_32(reinterpret_cast<const void*>(&k), sizeof(Key), seed, &out);
return out;
}
template <class Key>
h128 hash128(const Key& k, uint32_t seed) const {
h128 h;
MurmurHash3_x64_128(reinterpret_cast<const void*>(&k), sizeof(Key), seed, &h);
return h;
}
};
template <>
struct seeded_hash_function<Murmur3StringPiece> {
template <class Key>
uint32_t operator()(const Key& k, uint32_t seed) const {
StringPiece s(k);
uint32_t out;
MurmurHash3_x86_32(s.data(), s.length(), seed, &out);
return out;
}
template <class Key>
h128 hash128(const Key& k, uint32_t seed) const {
h128 h;
StringPiece s(k);
MurmurHash3_x64_128(s.data(), s.length(), seed, &h);
return h;
}
};
template <class HashFcn> struct seeded_hash
{ typedef seeded_hash_function<HashFcn> hash_function; };
// Use Murmur3 instead for all types defined in std::hash, plus
// std::string which is commonly extended.
template <> struct seeded_hash<std::hash<char*> >
{ typedef seeded_hash_function<Murmur3StringPiece> hash_function; };
template <> struct seeded_hash<std::hash<const char*> >
{ typedef seeded_hash_function<Murmur3StringPiece> hash_function; };
template <> struct seeded_hash<std::hash<std::string> >
{ typedef seeded_hash_function<Murmur3StringPiece> hash_function; };
template <> struct seeded_hash<std::hash<cxxmph::StringPiece> >
{ typedef seeded_hash_function<Murmur3StringPiece> hash_function; };
template <> struct seeded_hash<std::hash<char> >
{ typedef seeded_hash_function<Murmur3> hash_function; };
template <> struct seeded_hash<std::hash<unsigned char> >
{ typedef seeded_hash_function<Murmur3> hash_function; };
template <> struct seeded_hash<std::hash<short> >
{ typedef seeded_hash_function<Murmur3> hash_function; };
template <> struct seeded_hash<std::hash<unsigned short> >
{ typedef seeded_hash_function<Murmur3> hash_function; };
template <> struct seeded_hash<std::hash<int> >
{ typedef seeded_hash_function<Murmur3> hash_function; };
template <> struct seeded_hash<std::hash<unsigned int> >
{ typedef seeded_hash_function<Murmur3> hash_function; };
template <> struct seeded_hash<std::hash<long> >
{ typedef seeded_hash_function<Murmur3> hash_function; };
template <> struct seeded_hash<std::hash<unsigned long> >
{ typedef seeded_hash_function<Murmur3> hash_function; };
template <> struct seeded_hash<std::hash<long long> >
{ typedef seeded_hash_function<Murmur3> hash_function; };
template <> struct seeded_hash<std::hash<unsigned long long> >
{ typedef seeded_hash_function<Murmur3> hash_function; };
} // namespace cxxmph
#endif // __CXXMPH_SEEDED_HASH_H__

59
deps/cmph/cxxmph/seeded_hash_test.cc vendored Normal file
View File

@@ -0,0 +1,59 @@
#include "seeded_hash.h"
#include <unordered_map>
#include <string>
#include <iostream>
using std::cerr;
using std::endl;
using std::string;
using std::unordered_map;
using namespace cxxmph;
int main(int argc, char** argv) {
auto hasher = seeded_hash_function<Murmur3StringPiece>();
string key1("0");
string key2("1");
auto h1 = hasher.hash128(key1, 1);
auto h2 = hasher.hash128(key2, 1);
if (h1 == h2) {
fprintf(stderr, "unexpected murmur collision\n");
exit(-1);
}
unordered_map<uint64_t, int> g;
for (int i = 0; i < 1000; ++i) g[i] = i;
for (int i = 0; i < 1000; ++i) if (g[i] != i) exit(-1);
auto inthasher = seeded_hash_function<std::hash<uint64_t>>();
unordered_map<h128, uint64_t, h128::hash32> g2;
for (uint64_t i = 0; i < 1000; ++i) {
auto h = inthasher.hash128(i, 0);
if (g2.find(h) != g2.end()) {
std::cerr << "Incorrectly found " << i << std::endl;
exit(-1);
}
if (h128::hash32()(h) != h[3]) {
cerr << "Buggy hash method." << endl;
exit(-1);
}
auto h2 = inthasher.hash128(i, 0);
if (!(h == h2)) {
cerr << "h 64(0) " << h.get64(0) << " h 64(1) " << h.get64(1) << endl;
cerr << " h2 64(0) " << h2.get64(0) << " h2 64(1) " << h2.get64(1) << endl;
cerr << "Broken equality for h128" << endl;
exit(-1);
}
if (h128::hash32()(h) != h128::hash32()(h2)) {
cerr << "Inconsistent hash method." << endl;
exit(-1);
}
g2[h] = i;
if (g2.find(h) == g2.end()) {
std::cerr << "Incorrectly missed " << i << std::endl;
exit(-1);
}
}
for (uint64_t i = 0; i < 1000; ++i) if (g2[inthasher.hash128(i, 0)] != i) exit(-1);
}

23
deps/cmph/cxxmph/string_util.cc vendored Normal file
View File

@@ -0,0 +1,23 @@
#include "string_util.h"
#include <cassert>
#include <cstdint>
#include <iostream>
#include <string>
using namespace std;
namespace cxxmph {
bool stream_printf(
const std::string& format_string, uint32_t offset, std::ostream* out) {
if (offset == format_string.length()) return true;
assert(offset < format_string.length());
cerr << "length:" << format_string.length() << endl;
cerr << "offset:" << offset << endl;
auto txt = format_string.substr(offset, format_string.length() - offset);
*out << txt;
return true;
}
} // namespace cxxmph

133
deps/cmph/cxxmph/string_util.h vendored Normal file
View File

@@ -0,0 +1,133 @@
#ifndef __CXXMPH_STRING_UTIL_H__
#define __CXXMPH_STRING_UTIL_H__
// Helper functions for string formatting and terminal output. Should be used
// only for debugging and tests, since performance was not a concern.
// Implemented using variadic templates because it is cool.
//
// Adds the extra format %v to the printf formatting language. Uses the method
// cxxmph::tostr to implement custom printers and fallback to operator
// ostream::operator<< otherwise.
#include <cstdint>
#include <cstdio>
#include <cstring>
#include <iostream>
#include <string>
#include <sstream>
#include <utility>
#include <vector>
#define CXXMPH_DEBUGLN(fmt) variadic_print(__FILE__, __LINE__, &std::cerr, fmt)
#define CXXMPH_INFOLN(fmt) variadic_print(__FILE__, __LINE__, &std::cout, fmt)
namespace cxxmph {
using std::pair;
using std::string;
using std::ostream;
using std::vector;
template <class T> void tostr(ostream *out, const T& v) {
*out << v;
}
inline void tostr(std::ostream* out, uint8_t v) {
*out << static_cast<uint32_t>(v);
}
template <class V>
inline void tostr(ostream* out, const vector<V>& v) {
*out << "[";
for (uint32_t i = 0; i < v.size(); ++i) {
tostr(out, v[1]);
if (i != v.size() - 1)*out << " ";
}
*out << "]";
}
template <class F, class S>
inline void tostr(ostream* out, const pair<F, S>& v) {
*out << "(";
tostr(out, v.first);
*out << ",";
tostr(out, v.second);
*out << ")";
}
bool stream_printf(
const std::string& format_string, uint32_t offset, std::ostream* out);
template <bool ispod> struct pod_snprintf {};
template <> struct pod_snprintf<false> {
template <class T>
int operator()(char*, size_t, const char*, const T&) {
return -1;
}
};
template <> struct pod_snprintf<true> {
template <class T>
int operator()(char* str, size_t size, const char* format, const T& v) {
return snprintf(str, size, format, v);
}
};
template <typename T, typename... Args>
bool stream_printf(const std::string& format_string, uint32_t offset,
std::ostream* out, const T& value, Args&&... args) {
auto txt = format_string.c_str() + offset;
while (*txt) {
auto b = txt;
for (; *txt != '%'; ++txt);
if (*(txt + 1) == '%') ++txt;
else if (txt == b) break;
*out << string(b, txt - b);
if (*(txt - 1) == '%') ++txt;
}
auto fmt = txt + 1;
while (*fmt && *fmt != '%') ++fmt;
if (strncmp(txt, "%v", 2) == 0) {
txt += 2;
tostr(out, value);
if (txt != fmt) *out << string(txt, fmt);
} else {
char buf[256]; // Is this enough?
auto n = pod_snprintf<std::is_pod<T>::value>()(
buf, 256, std::string(txt, fmt).c_str(), value);
if (n < 0) return false;
*out << buf;
}
return stream_printf(format_string, fmt - format_string.c_str(), out,
std::forward<Args>(args)...);
}
template <typename... Args>
std::string format(const std::string& format_string, Args&&... args) {
std::ostringstream out;
if (!stream_printf(format_string, 0, &out, std::forward<Args>(args)...)) {
return std::string();
};
return out.str();
}
template <typename... Args>
void infoln(const std::string& format_string, Args&&... args) {
stream_printf(format_string + "\n", 0, &std::cout, std::forward<Args>(args)...);
}
struct variadic_print {
variadic_print(const std::string& file, uint32_t line, std::ostream* out,
const std::string& format_line)
: file_(file), line_(line), out_(out), format_line_(format_line) {}
template <typename... Args>
void operator()(Args&&... args) {
std::string fancy_format = "%v:%d: ";
fancy_format += format_line_ + "\n";
stream_printf(fancy_format, 0, out_, file_, line_, std::forward<Args>(args)...);
}
const std::string& file_;
const uint32_t& line_;
std::ostream* out_;
const std::string& format_line_;
};
} // namespace cxxmph
#endif // __CXXMPH_STRING_UTIL_H__

27
deps/cmph/cxxmph/string_util_test.cc vendored Normal file
View File

@@ -0,0 +1,27 @@
#include "string_util.h"
#include "test.h"
using namespace cxxmph;
bool test_format() {
string expected = " %% 4 foo 0x0A bar ";
string foo = "foo";
string fmt = format(" %%%% %v %v 0x%.2X bar ", 4, foo, 10);
fail_unless(fmt == expected, "expected\n-%s-\n got \n-%s-", expected.c_str(), fmt.c_str());
return true;
}
bool test_infoln() {
infoln(string("%s:%d: MY INFO LINE"), __FILE__, __LINE__);
return true;
}
bool test_macro() {
CXXMPH_DEBUGLN("here i am")();
return true;
}
CXXMPH_TEST_CASE(test_format)
CXXMPH_TEST_CASE(test_infoln)
CXXMPH_TEST_CASE(test_macro)

182
deps/cmph/cxxmph/stringpiece.h vendored Normal file
View File

@@ -0,0 +1,182 @@
// Copyright 2001-2010 The RE2 Authors. All Rights Reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// A string-like object that points to a sized piece of memory.
//
// Functions or methods may use const StringPiece& parameters to accept either
// a "const char*" or a "string" value that will be implicitly converted to
// a StringPiece. The implicit conversion means that it is often appropriate
// to include this .h file in other files rather than forward-declaring
// StringPiece as would be appropriate for most other Google classes.
//
// Systematic usage of StringPiece is encouraged as it will reduce unnecessary
// conversions from "const char*" to "string" and back again.
//
//
// Arghh! I wish C++ literals were "string".
#ifndef CXXMPH_STRINGPIECE_H__
#define CXXMPH_STRINGPIECE_H__
#include <cstddef>
#include <string.h>
#include <iosfwd>
#include <string>
namespace cxxmph {
class StringPiece {
private:
const char* ptr_;
int length_;
public:
// We provide non-explicit singleton constructors so users can pass
// in a "const char*" or a "string" wherever a "StringPiece" is
// expected.
StringPiece() : ptr_(NULL), length_(0) { }
StringPiece(const char* str)
: ptr_(str), length_((str == NULL) ? 0 : static_cast<int>(strlen(str))) { }
StringPiece(const std::string& str)
: ptr_(str.data()), length_(static_cast<int>(str.size())) { }
StringPiece(const char* offset, int len) : ptr_(offset), length_(len) { }
// data() may return a pointer to a buffer with embedded NULs, and the
// returned buffer may or may not be null terminated. Therefore it is
// typically a mistake to pass data() to a routine that expects a NUL
// terminated string.
const char* data() const { return ptr_; }
int size() const { return length_; }
int length() const { return length_; }
bool empty() const { return length_ == 0; }
void clear() { ptr_ = NULL; length_ = 0; }
void set(const char* data, int len) { ptr_ = data; length_ = len; }
void set(const char* str) {
ptr_ = str;
if (str != NULL)
length_ = static_cast<int>(strlen(str));
else
length_ = 0;
}
void set(const void* data, int len) {
ptr_ = reinterpret_cast<const char*>(data);
length_ = len;
}
char operator[](int i) const { return ptr_[i]; }
void remove_prefix(int n) {
ptr_ += n;
length_ -= n;
}
void remove_suffix(int n) {
length_ -= n;
}
int compare(const StringPiece& x) const {
int r = memcmp(ptr_, x.ptr_, std::min(length_, x.length_));
if (r == 0) {
if (length_ < x.length_) r = -1;
else if (length_ > x.length_) r = +1;
}
return r;
}
std::string as_string() const {
return std::string(data(), size());
}
// We also define ToString() here, since many other string-like
// interfaces name the routine that converts to a C++ string
// "ToString", and it's confusing to have the method that does that
// for a StringPiece be called "as_string()". We also leave the
// "as_string()" method defined here for existing code.
std::string ToString() const {
return std::string(data(), size());
}
void CopyToString(std::string* target) const;
void AppendToString(std::string* target) const;
// Does "this" start with "x"
bool starts_with(const StringPiece& x) const {
return ((length_ >= x.length_) &&
(memcmp(ptr_, x.ptr_, x.length_) == 0));
}
// Does "this" end with "x"
bool ends_with(const StringPiece& x) const {
return ((length_ >= x.length_) &&
(memcmp(ptr_ + (length_-x.length_), x.ptr_, x.length_) == 0));
}
// standard STL container boilerplate
typedef char value_type;
typedef const char* pointer;
typedef const char& reference;
typedef const char& const_reference;
typedef size_t size_type;
typedef ptrdiff_t difference_type;
static const size_type npos;
typedef const char* const_iterator;
typedef const char* iterator;
typedef std::reverse_iterator<const_iterator> const_reverse_iterator;
typedef std::reverse_iterator<iterator> reverse_iterator;
iterator begin() const { return ptr_; }
iterator end() const { return ptr_ + length_; }
const_reverse_iterator rbegin() const {
return const_reverse_iterator(ptr_ + length_);
}
const_reverse_iterator rend() const {
return const_reverse_iterator(ptr_);
}
// STLS says return size_type, but Google says return int
int max_size() const { return length_; }
int capacity() const { return length_; }
int copy(char* buf, size_type n, size_type pos = 0) const;
int find(const StringPiece& s, size_type pos = 0) const;
int find(char c, size_type pos = 0) const;
int rfind(const StringPiece& s, size_type pos = npos) const;
int rfind(char c, size_type pos = npos) const;
StringPiece substr(size_type pos, size_type n = npos) const;
};
inline bool operator==(const StringPiece& x, const StringPiece& y) {
return x.length() == y.length() && memcmp(x.data(), y.data(), x.length()) == 0;
}
inline bool operator!=(const StringPiece& x, const StringPiece& y) {
return !(x == y);
}
inline bool operator<(const StringPiece& x, const StringPiece& y) {
const int r = memcmp(x.data(), y.data(),
std::min(x.size(), y.size()));
return ((r < 0) || ((r == 0) && (x.size() < y.size())));
}
inline bool operator>(const StringPiece& x, const StringPiece& y) {
return y < x;
}
inline bool operator<=(const StringPiece& x, const StringPiece& y) {
return !(x > y);
}
inline bool operator>=(const StringPiece& x, StringPiece& y) {
return !(x < y);
}
} // namespace cxxmph
// allow StringPiece to be logged
inline std::ostream& operator<<(std::ostream& o, const cxxmph::StringPiece& piece) {
o << piece.as_string(); return o;
}
#endif // CXXMPH_STRINGPIECE_H__

22
deps/cmph/cxxmph/test.cc vendored Normal file
View File

@@ -0,0 +1,22 @@
#include <cstdlib> // For EXIT_SUCCESS, EXIT_FAILURE
#include "test.h"
Suite* global_suite() {
static Suite* gs = suite_create("cxxmph_test_suite");
return gs;
}
TCase* global_tc_core() {
static TCase* gtc = tcase_create("Core");
return gtc;
}
int main (void) {
suite_add_tcase(global_suite(), global_tc_core());
int number_failed;
SRunner *sr = srunner_create (global_suite());
srunner_run_all (sr, CK_NORMAL);
number_failed = srunner_ntests_failed (sr);
srunner_free (sr);
return (number_failed == 0) ? EXIT_SUCCESS : EXIT_FAILURE;
}

32
deps/cmph/cxxmph/test.h vendored Normal file
View File

@@ -0,0 +1,32 @@
#ifndef __CXXMPH_TEST_H__
#define __CXXMPH_TEST_H__
// Thin wrapper on top of check.h to get rid of boilerplate in tests. Assumes a
// single test suite and test case per file, with each fixture represented by a
// parameter-less boolean function.
//
// The check.h header macro-clashes with c++ libraries so this file needs to be
// included last.
#include <check.h>
#include <stdio.h>
Suite* global_suite();
TCase* global_tc_core();
// Creates a new test case calling boolean_function. Name must be a valid,
// unique c identifier when prefixed with tc_.
#define CXXMPH_CXX_TEST_CASE(name, boolean_function) \
START_TEST(tc_ ## name) \
{ fail_unless(boolean_function()); } END_TEST \
static TestCase global_cxxmph_tc_ ## name(tc_ ## name);
#define CXXMPH_TEST_CASE(name) CXXMPH_CXX_TEST_CASE(name, name)
struct TestCase {
TestCase(void (*f)(int)) {
tcase_add_test(global_tc_core(), f);
}
};
#endif // __CXXMPH_TEST_H__

4
deps/cmph/cxxmph/test_test.cc vendored Normal file
View File

@@ -0,0 +1,4 @@
#include "test.h"
bool tautology() { return true; }
CXXMPH_TEST_CASE(tautology)

82
deps/cmph/cxxmph/trigraph.cc vendored Normal file
View File

@@ -0,0 +1,82 @@
#include <cassert>
#include <limits>
#include <iostream>
#include "trigraph.h"
using std::cerr;
using std::endl;
using std::vector;
namespace {
static const uint32_t kInvalidEdge = std::numeric_limits<uint32_t>::max();
}
namespace cxxmph {
TriGraph::TriGraph(uint32_t nvertices, uint32_t nedges)
: nedges_(0),
edges_(nedges),
next_edge_(nedges),
first_edge_(nvertices, kInvalidEdge),
vertex_degree_(nvertices, 0) { }
TriGraph::~TriGraph() {}
void TriGraph::ExtractEdgesAndClear(vector<Edge>* edges) {
vector<Edge>().swap(next_edge_);
vector<uint32_t>().swap(first_edge_);
vector<uint8_t>().swap(vertex_degree_);
nedges_ = 0;
edges->swap(edges_);
}
void TriGraph::AddEdge(const Edge& edge) {
edges_[nedges_] = edge;
assert(first_edge_.size() > edge[0]);
assert(first_edge_.size() > edge[1]);
assert(first_edge_.size() > edge[0]);
assert(first_edge_.size() > edge[1]);
assert(first_edge_.size() > edge[2]);
assert(next_edge_.size() > nedges_);
next_edge_[nedges_] = Edge(
first_edge_[edge[0]], first_edge_[edge[1]], first_edge_[edge[2]]);
first_edge_[edge[0]] = first_edge_[edge[1]] = first_edge_[edge[2]] = nedges_;
++vertex_degree_[edge[0]];
++vertex_degree_[edge[1]];
++vertex_degree_[edge[2]];
++nedges_;
}
void TriGraph::RemoveEdge(uint32_t current_edge) {
// cerr << "Removing edge " << current_edge << " from " << nedges_ << " existing edges " << endl;
for (int i = 0; i < 3; ++i) {
uint32_t vertex = edges_[current_edge][i];
uint32_t edge1 = first_edge_[vertex];
uint32_t edge2 = kInvalidEdge;
uint32_t j = 0;
while (edge1 != current_edge && edge1 != kInvalidEdge) {
edge2 = edge1;
if (edges_[edge1][0] == vertex) j = 0;
else if (edges_[edge1][1] == vertex) j = 1;
else j = 2;
edge1 = next_edge_[edge1][j];
}
assert(edge1 != kInvalidEdge);
if (edge2 != kInvalidEdge) next_edge_[edge2][j] = next_edge_[edge1][i];
else first_edge_[vertex] = next_edge_[edge1][i];
--vertex_degree_[vertex];
}
}
void TriGraph::DebugGraph() const {
uint32_t i;
for(i = 0; i < edges_.size(); i++){
cerr << i << " " << edges_[i][0] << " " << edges_[i][1] << " " << edges_[i][2]
<< " nexts " << next_edge_[i][0] << " " << next_edge_[i][1] << " " << next_edge_[i][2] << endl;
}
for(i = 0; i < first_edge_.size();i++){
cerr << "first for vertice " <<i << " " << first_edge_[i] << endl;
}
}
} // namespace cxxmph

49
deps/cmph/cxxmph/trigraph.h vendored Normal file
View File

@@ -0,0 +1,49 @@
#ifndef __CXXMPH_TRIGRAPH_H__
#define __CXXMPH_TRIGRAPH_H__
// Build a trigraph using a memory efficient representation.
//
// Prior knowledge of the number of edges and vertices for the graph is
// required. For each vertex, we store how many edges touch it (degree) and the
// index of the first edge in the vector of triples representing the edges.
#include <stdint.h> // for uint32_t and friends
#include <vector>
namespace cxxmph {
class TriGraph {
public:
struct Edge {
Edge() { }
Edge(uint32_t v0, uint32_t v1, uint32_t v2) {
vertices[0] = v0;
vertices[1] = v1;
vertices[2] = v2;
}
uint32_t& operator[](uint8_t v) { return vertices[v]; }
const uint32_t& operator[](uint8_t v) const { return vertices[v]; }
uint32_t vertices[3];
};
TriGraph(uint32_t nedges, uint32_t nvertices);
~TriGraph();
void AddEdge(const Edge& edge);
void RemoveEdge(uint32_t edge_id);
void ExtractEdgesAndClear(std::vector<Edge>* edges);
void DebugGraph() const;
const std::vector<Edge>& edges() const { return edges_; }
const std::vector<uint8_t>& vertex_degree() const { return vertex_degree_; }
const std::vector<uint32_t>& first_edge() const { return first_edge_; }
private:
uint32_t nedges_; // total number of edges
std::vector<Edge> edges_;
std::vector<Edge> next_edge_; // for implementing removal
std::vector<uint32_t> first_edge_; // the first edge for this vertex
std::vector<uint8_t> vertex_degree_; // number of edges for this vertex
};
} // namespace cxxmph
#endif // __CXXMPH_TRIGRAPH_H__

22
deps/cmph/cxxmph/trigraph_test.cc vendored Normal file
View File

@@ -0,0 +1,22 @@
#include <cassert>
#include "trigraph.h"
using cxxmph::TriGraph;
int main(int argc, char** argv) {
TriGraph g(4, 2);
g.AddEdge(TriGraph::Edge(0, 1, 2));
g.AddEdge(TriGraph::Edge(1, 3, 2));
assert(g.vertex_degree()[0] == 1);
assert(g.vertex_degree()[1] == 2);
assert(g.vertex_degree()[2] == 2);
assert(g.vertex_degree()[3] == 1);
g.RemoveEdge(0);
assert(g.vertex_degree()[0] == 0);
assert(g.vertex_degree()[1] == 1);
assert(g.vertex_degree()[2] == 1);
assert(g.vertex_degree()[3] == 1);
std::vector<TriGraph::Edge> edges;
g.ExtractEdgesAndClear(&edges);
}

315
deps/cmph/docs/bdz.html vendored Normal file
View File

@@ -0,0 +1,315 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>BDZ Algorithm</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>BDZ Algorithm</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>Introduction</H2>
<P>
The BDZ algorithm was designed by Fabiano C. Botelho, Djamal Belazzougui, Rasmus Pagh and Nivio Ziviani. It is a simple, efficient, near-optimal space and practical algorithm to generate a family <IMG ALIGN="bottom" SRC="figs/bdz/img8.png" BORDER="0" ALT=""> of PHFs and MPHFs. It is also referred to as BPZ algorithm because the work presented by Botelho, Pagh and Ziviani in <A HREF="#papers">[2</A>]. In the Botelho's PhD. dissertation <A HREF="#papers">[1</A>] it is also referred to as RAM algorithm because it is more suitable for key sets that can be handled in internal memory.
</P>
<P>
The BDZ algorithm uses <I>r</I>-uniform random hypergraphs given by function values of <I>r</I> uniform random hash functions on the input key set <I>S</I> for generating PHFs and MPHFs that require <I>O(n)</I> bits to be stored. A hypergraph is the generalization of a standard undirected graph where each edge connects <IMG ALIGN="middle" SRC="figs/bdz/img12.png" BORDER="0" ALT=""> vertices. This idea is not new, see e.g. <A HREF="#papers">[8</A>], but we have proceeded differently to achieve a space usage of <I>O(n)</I> bits rather than <I>O(n log n)</I> bits. Evaluation time for all schemes considered is constant. For <I>r=3</I> we obtain a space usage of approximately <I>2.6n</I> bits for an MPHF. More compact, and even simpler, representations can be achieved for larger <I>m</I>. For example, for <I>m=1.23n</I> we can get a space usage of <I>1.95n</I> bits.
</P>
<P>
Our best MPHF space upper bound is within a factor of <I>2</I> from the information theoretical lower bound of approximately <I>1.44</I> bits. We have shown that the BDZ algorithm is far more practical than previous methods with proven space complexity, both because of its simplicity, and because the constant factor of the space complexity is more than <I>6</I> times lower than its closest competitor, for plausible problem sizes. We verify the practicality experimentally, using slightly more space than in the mentioned theoretical bounds.
</P>
<HR NOSHADE SIZE=1>
<H2>The Algorithm</H2>
<P>
The BDZ algorithm is a three-step algorithm that generates PHFs and MPHFs based on random <I>r</I>-partite hypergraphs. This is an approach that provides a much tighter analysis and is much more simple than the one presented in <A HREF="#papers">[3</A>], where it was implicit how to construct similar PHFs.The fastest and most compact functions are generated when <I>r=3</I>. In this case a PHF can be stored in approximately <I>1.95</I> bits per key and an MPHF in approximately <I>2.62</I> bits per key.
</P>
<P>
Figure 1 gives an overview of the algorithm for <I>r=3</I>, taking as input a key set <IMG ALIGN="middle" SRC="figs/bdz/img22.png" BORDER="0" ALT=""> containing three English words, i.e., <I>S={who,band,the}</I>. The edge-oriented data structure proposed in <A HREF="#papers">[4</A>] is used to represent hypergraphs, where each edge is explicitly represented as an array of <I>r</I> vertices and, for each vertex <I>v</I>, there is a list of edges that are incident on <I>v</I>.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/bdz/img50.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 1:</B> (a) The mapping step generates a random acyclic <I>3</I>-partite hypergraph</TD>
</TR>
<TR>
<TD>with <I>m=6</I> vertices and <I>n=3</I> edges and a list <IMG ALIGN="middle" SRC="figs/bdz/img4.png" BORDER="0" ALT=""> of edges obtained when we test</TD>
</TR>
<TR>
<TD>whether the hypergraph is acyclic. (b) The assigning step builds an array <I>g</I> that</TD>
</TR>
<TR>
<TD>maps values from <I>[0,5]</I> to <I>[0,3]</I> to uniquely assign an edge to a vertex. (c) The ranking</TD>
</TR>
<TR>
<TD>step builds the data structure used to compute function <I>rank</I> in <I>O(1)</I> time.</TD>
</TR>
</TABLE>
<P>
The <I>Mapping Step</I> in Figure 1(a) carries out two important tasks:
</P>
<OL>
<LI>It assumes that it is possible to find three uniform hash functions <I>h<sub>0</sub></I>, <I>h<sub>1</sub></I> and <I>h<sub>2</sub></I>, with ranges <I>{0,1}</I>, <I>{2,3}</I> and <I>{4,5}</I>, respectively. These functions build an one-to-one mapping of the key set <I>S</I> to the edge set <I>E</I> of a random acyclic <I>3</I>-partite hypergraph <I>G=(V,E)</I>, where <I>|V|=m=6</I> and <I>|E|=n=3</I>. In <A HREF="#papers">[1,2</A>] it is shown that it is possible to obtain such a hypergraph with probability tending to <I>1</I> as <I>n</I> tends to infinity whenever <I>m=cn</I> and <I>c &gt; 1.22</I>. The value of that minimizes the hypergraph size (and thereby the amount of bits to represent the resulting functions) is in the range <I>(1.22,1.23)</I>. To illustrate the mapping, key "who" is mapped to edge <I>{h<sub>0</sub>("who"), h<sub>1</sub>("who"), h<sub>2</sub>("who")} = {1,3,5}</I>, key "band" is mapped to edge <I>{h<sub>0</sub>("band"), h<sub>1</sub>("band"), h<sub>2</sub>("band")} = {1,2,4}</I>, and key "the" is mapped to edge <I>{h<sub>0</sub>("the"), h<sub>1</sub>("the"), h<sub>2</sub>("the")} = {0,2,5}</I>.
<P></P>
<LI>It tests whether the resulting random <I>3</I>-partite hypergraph contains cycles by iteratively deleting edges connecting vertices of degree 1. The deleted edges are stored in the order of deletion in a list <IMG ALIGN="middle" SRC="figs/bdz/img4.png" BORDER="0" ALT=""> to be used in the assigning step. The first deleted edge in Figure 1(a) was <I>{1,2,4}</I>, the second one was <I>{1,3,5}</I> and the third one was <I>{0,2,5}</I>. If it ends with an empty graph, then the test succeeds, otherwise it fails.
</OL>
<P>
We now show how to use the Jenkins hash functions <A HREF="#papers">[7</A>] to implement the three hash functions <I>h<sub>i</sub></I>, which map values from <I>S</I> to <I>V<sub>i</sub></I>, where <IMG ALIGN="middle" SRC="figs/bdz/img52.png" BORDER="0" ALT="">. These functions are used to build a random <I>3</I>-partite hypergraph, where <IMG ALIGN="middle" SRC="figs/bdz/img53.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/bdz/img54.png" BORDER="0" ALT="">. Let <IMG ALIGN="middle" SRC="figs/bdz/img55.png" BORDER="0" ALT=""> be a Jenkins hash function for <IMG ALIGN="middle" SRC="figs/bdz/img56.png" BORDER="0" ALT="">, where
<I>w=32 or 64</I> for 32-bit and 64-bit architectures, respectively.
Let <I>H'</I> be an array of 3 <I>w</I>-bit values. The Jenkins hash function
allow us to compute in parallel the three entries in <I>H'</I>
and thereby the three hash functions <I>h<sub>i</sub></I>, as follows:
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><I>H' = h'(x)</I></TD>
</TR>
<TR>
<TD><I>h<sub>0</sub>(x) = H'[0] mod</I> <IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><I>h<sub>1</sub>(x) = H'[1] mod</I> <IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""> <I>+</I> <IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><I>h<sub>2</sub>(x) = H'[2] mod</I> <IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""> <I>+ 2</I><IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""></TD>
</TR>
</TABLE>
<P>
The <I>Assigning Step</I> in Figure 1(b) outputs a PHF that maps the key set <I>S</I> into the range <I>[0,m-1]</I> and is represented by an array <I>g</I> storing values from the range <I>[0,3]</I>. The array <I>g</I> allows to select one out of the <I>3</I> vertices of a given edge, which is associated with a key <I>k</I>. A vertex for a key <I>k</I> is given by either <I>h<sub>0</sub>(k)</I>, <I>h<sub>1</sub>(k)</I> or <I>h<sub>2</sub>(k)</I>. The function <I>h<sub>i</sub>(k)</I> to be used for <I>k</I> is chosen by calculating <I>i = (g[h<sub>0</sub>(k)] + g[h<sub>1</sub>(k)] + g[h<sub>2</sub>(k)]) mod 3</I>. For instance, the values 1 and 4 represent the keys "who" and "band" because <I>i = (g[1] + g[3] + g[5]) mod 3 = 0</I> and <I>h<sub>0</sub>("who") = 1</I>, and <I>i = (g[1] + g[2] + g[4]) mod 3 = 2</I> and <I>h<sub>2</sub>("band") = 4</I>, respectively. The assigning step firstly initializes <I>g[i]=3</I> to mark every vertex as unassigned and <I>Visited[i]= false</I>, <IMG ALIGN="middle" SRC="figs/bdz/img88.png" BORDER="0" ALT="">. Let <I>Visited</I> be a boolean vector of size <I>m</I> to indicate whether a vertex has been visited. Then, for each edge <IMG ALIGN="middle" SRC="figs/bdz/img90.png" BORDER="0" ALT=""> from tail to head, it looks for the first vertex <I>u</I> belonging <I>e</I> not yet visited. This is a sufficient condition for success <A HREF="#papers">[1,2,8</A>]. Let <I>j</I> be the index of <I>u</I> in <I>e</I> for <I>j</I> in the range <I>[0,2]</I>. Then, it assigns <IMG ALIGN="middle" SRC="figs/bdz/img95.png" BORDER="0" ALT="">. Whenever it passes through a vertex <I>u</I> from <I>e</I>, if <I>u</I> has not yet been visited, it sets <I>Visited[u] = true</I>.
</P>
<P>
If we stop the BDZ algorithm in the assigning step we obtain a PHF with range <I>[0,m-1]</I>. The PHF has the following form: <I>phf(x) = h<sub>i(x)</sub>(x)</I>, where key <I>x</I> is in <I>S</I> and <I>i(x) = (g[h<sub>0</sub>(x)] + g[h<sub>1</sub>(x)] + g[h<sub>2</sub>(x)]) mod 3</I>. In this case we do not need information for ranking and can set <I>g[i] = 0</I> whenever <I>g[i]</I> is equal to <I>3</I>, where <I>i</I> is in the range <I>[0,m-1]</I>. Therefore, the range of the values stored in <I>g</I> is narrowed from <I>[0,3]</I> to <I>[0,2]</I>. By using arithmetic coding as block of values (see <A HREF="#papers">[1,2</A>] for details), or any compression technique that allows to perform random access in constant time to an array of compressed values <A HREF="#papers">[5,6,12</A>], we can store the resulting PHFs in <I>mlog 3 = cnlog 3</I> bits, where <I>c &gt; 1.22</I>. For <I>c = 1.23</I>, the space requirement is <I>1.95n</I> bits.
</P>
<P>
The <I>Ranking Step</I> in Figure 1 (c) outputs a data structure that permits to narrow the range of a PHF generated in the assigning step from <I>[0,m-1]</I> to <I>[0,n-1]</I> and thereby an MPHF is produced. The data structure allows to compute in constant time a function <I>rank</I> from <I>[0,m-1]</I> to <I>[0,n-1]</I> that counts the number of assigned positions before a given position <I>v</I> in <I>g</I>. For instance, <I>rank(4) = 2</I> because the positions <I>0</I> and <I>1</I> are assigned since <I>g[0]</I> and <I>g[1]</I> are not equal to <I>3</I>.
</P>
<P>
For the implementation of the ranking step we have borrowed a simple and efficient implementation from <A HREF="#papers">[10</A>]. It requires <IMG ALIGN="middle" SRC="figs/bdz/img111.png" BORDER="0" ALT=""> additional bits of space, where <IMG ALIGN="middle" SRC="figs/bdz/img112.png" BORDER="0" ALT="">, and is obtained by storing explicitly the <I>rank</I> of every <I>k</I>th index in a rankTable, where <IMG ALIGN="middle" SRC="figs/bdz/img114.png" BORDER="0" ALT="">. The larger is <I>k</I> the more compact is the resulting MPHF. Therefore, the users can tradeoff space for evaluation time by setting <I>k</I> appropriately in the implementation. We only allow values for <I>k</I> that are power of two (i.e., <I>k=2<sup>b<sub>k</sub></sup></I> for some constant <I>b<sub>k</sub></I> in order to replace the expensive division and modulo operations by bit-shift and bitwise "and" operations, respectively. We have used <I>k=256</I> in the experiments for generating more succinct MPHFs. We remark that it is still possible to obtain a more compact data structure by using the results presented in <A HREF="#papers">[9,11</A>], but at the cost of a much more complex implementation.
</P>
<P>
We need to use an additional lookup table <I>T<sub>r</sub></I> to guarantee the constant evaluation time of <I>rank(u)</I>. Let us illustrate how <I>rank(u)</I> is computed using both the rankTable and the lookup table <I>T<sub>r</sub></I>. We first look up the rank of the largest precomputed index <I>v</I> lower than or equal to <I>u</I> in the rankTable, and use <I>T<sub>r</sub></I> to count the number of assigned vertices from position <I>v</I> to <I>u-1</I>. The lookup table <I>T_r</I> allows us to count in constant time the number of assigned vertices in <IMG ALIGN="middle" SRC="figs/bdz/img122.png" BORDER="0" ALT=""> bits, where <IMG ALIGN="middle" SRC="figs/bdz/img112.png" BORDER="0" ALT="">. Thus the actual evaluation time is <IMG ALIGN="middle" SRC="figs/bdz/img123.png" BORDER="0" ALT="">. For simplicity and without loss of generality we let <IMG ALIGN="middle" SRC="figs/bdz/img124.png" BORDER="0" ALT=""> be a multiple of the number of bits <IMG ALIGN="middle" SRC="figs/bdz/img125.png" BORDER="0" ALT=""> used to encode each entry of <I>g</I>. As the values in <I>g</I> come from the range <I>[0,3]</I>,
then <IMG ALIGN="middle" SRC="figs/bdz/img126.png" BORDER="0" ALT=""> bits and we have tried <IMG ALIGN="middle" SRC="figs/bdz/img124.png" BORDER="0" ALT=""> equal to <I>8</I> and <I>16</I>. We would expect that <IMG ALIGN="middle" SRC="figs/bdz/img124.png" BORDER="0" ALT=""> equal to 16 should provide a faster evaluation time because we would need to carry out fewer lookups in <I>T<sub>r</sub></I>. But, for both values the lookup table <I>T<sub>r</sub></I> fits entirely in the CPU cache and we did not realize any significant difference in the evaluation times. Therefore we settle for the value <I>8</I>. We remark that each value of <I>r</I> requires a different lookup table //T<sub>r</sub> that can be generated a priori.
</P>
<P>
The resulting MPHFs have the following form: <I>h(x) = rank(phf(x))</I>. Then, we cannot get rid of the raking information by replacing the values 3 by 0 in the entries of <I>g</I>. In this case each entry in the array <I>g</I> is encoded with <I>2</I> bits and we need <IMG ALIGN="middle" SRC="figs/bdz/img133.png" BORDER="0" ALT=""> additional bits to compute function <I>rank</I> in constant time. Then, the total space to store the resulting functions is <IMG ALIGN="middle" SRC="figs/bdz/img134.png" BORDER="0" ALT=""> bits. By using <I>c = 1.23</I> and <IMG ALIGN="middle" SRC="figs/bdz/img135.png" BORDER="0" ALT=""> we have obtained MPHFs that require approximately <I>2.62</I> bits per key to be stored.
</P>
<HR NOSHADE SIZE=1>
<H2>Memory Consumption</H2>
<P>
Now we detail the memory consumption to generate and to store minimal perfect hash functions
using the BDZ algorithm. The structures responsible for memory consumption are in the
following:
</P>
<UL>
<LI>3-graph:
<OL>
<LI><B>first</B>: is a vector that stores <I>cn</I> integer numbers, each one representing
the first edge (index in the vector edges) in the list of
incident edges of each vertex. The integer numbers are 4 bytes long. Therefore,
the vector first is stored in <I>4cn</I> bytes.
<P></P>
<LI><B>edges</B>: is a vector to represent the edges of the graph. As each edge
is compounded by three vertices, each entry stores three integer numbers
of 4 bytes that represent the vertices. As there are <I>n</I> edges, the
vector edges is stored in <I>12n</I> bytes.
<P></P>
<LI><B>next</B>: given a vertex <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">, we can discover the edges that
contain <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> following its list of incident edges,
which starts on first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">] and the next
edges are given by next[...first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">]...]. Therefore, the vectors first and next represent
the linked lists of edges of each vertex. As there are three vertices for each edge,
when an edge is iserted in the 3-graph, it must be inserted in the three linked lists
of the vertices in its composition. Therefore, there are <I>3n</I> entries of integer
numbers in the vector next, so it is stored in <I>4*3n = 12n</I> bytes.
<P></P>
<LI><B>Vertices degree (vert_degree vector)</B>: is a vector of <I>cn</I> bytes
that represents the degree of each vertex. We can use just one byte for each
vertex because the 3-graph is sparse, once it has more vertices than edges.
Therefore, the vertices degree is represented in <I>cn</I> bytes.
<P></P>
</OL>
<LI>Acyclicity test:
<OL>
<LI><B>List of deleted edges obtained when we test whether the 3-graph is a forest (queue vector)</B>:
is a vector of <I>n</I> integer numbers containing indexes of vector edges. Therefore, it
requires <I>4n</I> bytes in internal memory.
<P></P>
<LI><B>Marked edges in the acyclicity test (marked_edges vector)</B>:
is a bit vector of <I>n</I> bits to indicate the edges that have already been deleted during
the acyclicity test. Therefore, it requires <I>n/8</I> bytes in internal memory.
<P></P>
</OL>
<LI>MPHF description
<OL>
<LI><B>function <I>g</I></B>: is represented by a vector of <I>2cn</I> bits. Therefore, it is
stored in <I>0.25cn</I> bytes
<LI><B>ranktable</B>: is a lookup table used to store some precomputed ranking information.
It has <I>(cn)/(2^b)</I> entries of 4-byte integer numbers. Therefore it is stored in
<I>(4cn)/(2^b)</I> bytes. The larger is b, the more compact is the resulting MPHFs and
the slower are the functions. So b imposes a trade-of between space and time.
<LI><B>Total</B>: 0.25cn + (4cn)/(2^b) bytes
</OL>
</UL>
<P>
Thus, the total memory consumption of BDZ algorithm for generating a minimal
perfect hash function (MPHF) is: <I>(28.125 + 5c)n + 0.25cn + (4cn)/(2^b) + O(1)</I> bytes.
As the value of constant <I>c</I> may be larger than or equal to 1.23 we have:
</P>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH><I>c</I></TH>
<TH><I>b</I></TH>
<TH>Memory consumption to generate a MPHF (in bytes)</TH>
</TR>
<TR>
<TD>1.23</TD>
<TD ALIGN="center"><I>7</I></TD>
<TD ALIGN="center"><I>34.62n + O(1)</I></TD>
</TR>
<TR>
<TD>1.23</TD>
<TD ALIGN="center"><I>8</I></TD>
<TD ALIGN="center"><I>34.60n + O(1)</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 1:</B> Memory consumption to generate a MPHF using the BDZ algorithm.</TD>
</TR>
</TABLE>
<P>
Now we present the memory consumption to store the resulting function.
So we have:
</P>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH><I>c</I></TH>
<TH><I>b</I></TH>
<TH>Memory consumption to store a MPHF (in bits)</TH>
</TR>
<TR>
<TD>1.23</TD>
<TD ALIGN="center"><I>7</I></TD>
<TD ALIGN="center"><I>2.77n + O(1)</I></TD>
</TR>
<TR>
<TD>1.23</TD>
<TD ALIGN="center"><I>8</I></TD>
<TD ALIGN="center"><I>2.61n + O(1)</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 2:</B> Memory consumption to store a MPHF generated by the BDZ algorithm.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H2>Experimental Results</H2>
<P>
Experimental results to compare the BDZ algorithm with the other ones in the CMPH
library are presented in Botelho, Pagh and Ziviani <A HREF="#papers">[1,2</A>].
</P>
<HR NOSHADE SIZE=1>
<A NAME="papers"></A>
<H2>Papers</H2>
<OL>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>. <A HREF="papers/thesis.pdf">Near-Optimal Space Perfect Hashing Algorithms</A>. <I>PhD. Thesis</I>, <I>Department of Computer Science</I>, <I>Federal University of Minas Gerais</I>, September 2008. Supervised by <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>.
<P></P>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, <A HREF="http://www.itu.dk/~pagh/">R. Pagh</A>, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/wads07.pdf">Simple and space-efficient minimal perfect hash functions</A>. <I>In Proceedings of the 10th International Workshop on Algorithms and Data Structures (WADs'07),</I> Springer-Verlag Lecture Notes in Computer Science, vol. 4619, Halifax, Canada, August 2007, 139-150.
<P></P>
<LI>B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. The bloomier filter: An efficient data structure for static support lookup tables. <I>In Proceedings of the 15th annual ACM-SIAM symposium on Discrete algorithms (SODA'04)</I>, pages 3039, Philadelphia, PA, USA, 2004. Society for Industrial and Applied Mathematics.
<P></P>
<LI>J. Ebert. A versatile data structure for edges oriented graph algorithms. <I>Communication of The ACM</I>, (30):513519, 1987.
<P></P>
<LI>K. Fredriksson and F. Nikitin. Simple compression code supporting random access and fast string matching. <I>In Proceedings of the 6th International Workshop on Efficient and Experimental Algorithms (WEA07)</I>, pages 203216, 2007.
<P></P>
<LI>R. Gonzalez and G. Navarro. Statistical encoding of succinct data structures. <I>In Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM06)</I>, pages 294305, 2006.
<P></P>
<LI>B. Jenkins. Algorithm alley: Hash functions. <I>Dr. Dobb's Journal of Software Tools</I>, 22(9), september 1997. Extended version available at <A HREF="http://burtleburtle.net/bob/hash/doobs.html">http://burtleburtle.net/bob/hash/doobs.html</A>.
<P></P>
<LI>B.S. Majewski, N.C. Wormald, G. Havas, and Z.J. Czech. A family of perfect hashing methods. <I>The Computer Journal</I>, 39(6):547554, 1996.
<P></P>
<LI>D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. <I>In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX07)</I>, 2007.
<P></P>
<LI><A HREF="http://www.itu.dk/~pagh/">R. Pagh</A>. Low redundancy in static dictionaries with constant query time. <I>SIAM Journal on Computing</I>, 31(2):353363, 2001.
<P></P>
<LI>R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. <I>In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms (SODA02)</I>, pages 233242, Philadelphia PA, USA, 2002. Society for Industrial and Applied Mathematics.
<P></P>
<LI>K. Sadakane and R. Grossi. Squeezing succinct data structures into entropy bounds. <I>In Proceedings of the 17th annual ACM-SIAM symposium on Discrete algorithms (SODA06)</I>, pages 12301239, 2006.
</OL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i BDZ.t2t -o docs/bdz.html -->
</BODY></HTML>

581
deps/cmph/docs/bmz.html vendored Normal file
View File

@@ -0,0 +1,581 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>BMZ Algorithm</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>BMZ Algorithm</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>History</H2>
<P>
At the end of 2003, professor <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> was
finishing the second edition of his <A HREF="http://www.dcc.ufmg.br/algoritmos/">book</A>.
During the <A HREF="http://www.dcc.ufmg.br/algoritmos/">book</A> writing,
professor <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> studied the problem of generating
<A HREF="concepts.html">minimal perfect hash functions</A>
(if you are not familiarized with this problem, see <A HREF="#papers">[1</A>]<A HREF="#papers">[2</A>]).
Professor <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> coded a modified version of
the <A HREF="chm.html">CHM algorithm</A>, which was proposed by
Czech, Havas and Majewski, and put it in his <A HREF="http://www.dcc.ufmg.br/algoritmos/">book</A>.
The <A HREF="chm.html">CHM algorithm</A> is based on acyclic random graphs to generate
<A HREF="concepts.html">order preserving minimal perfect hash functions</A> in linear time.
Professor <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A>
argued himself, why must the random graph
be acyclic? In the modified version availalbe in his <A HREF="http://www.dcc.ufmg.br/algoritmos/">book</A> he got rid of this restriction.
</P>
<P>
The modification presented a problem, it was impossible to generate minimal perfect hash functions
for sets with more than 1000 keys.
At the same time, <A HREF="http://www.dcc.ufmg.br/~fbotelho">Fabiano C. Botelho</A>,
a master degree student at <A HREF="http://www.dcc.ufmg.br">Departament of Computer Science</A> in
<A HREF="http://www.ufmg.br">Federal University of Minas Gerais</A>,
started to be advised by <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> who presented the problem
to <A HREF="http://www.dcc.ufmg.br/~fbotelho">Fabiano</A>.
</P>
<P>
During the master, <A HREF="http://www.dcc.ufmg.br/~fbotelho">Fabiano</A> and
<A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> faced lots of problems.
In april of 2004, <A HREF="http://www.dcc.ufmg.br/~fbotelho">Fabiano</A> was talking with a
friend of him (David Menoti) about the problems
and many ideas appeared.
The ideas were implemented and a very fast algorithm to generate
minimal perfect hash functions had been designed.
We refer the algorithm to as <B>BMZ</B>, because it was conceived by Fabiano C. <B>B</B>otelho,
David <B>M</B>enoti and Nivio <B>Z</B>iviani. The algorithm is described in <A HREF="#papers">[1</A>].
To analyse BMZ algorithm we needed some results from the random graph theory, so
we invited professor <A HREF="http://www.ime.usp.br/~yoshi">Yoshiharu Kohayakawa</A> to help us.
The final description and analysis of BMZ algorithm is presented in <A HREF="#papers">[2</A>].
</P>
<HR NOSHADE SIZE=1>
<H2>The Algorithm</H2>
<P>
The BMZ algorithm shares several features with the <A HREF="chm.html">CHM algorithm</A>.
In particular, BMZ algorithm is also
based on the generation of random graphs <IMG ALIGN="middle" SRC="figs/img27.png" BORDER="0" ALT="">, where <IMG ALIGN="bottom" SRC="figs/img28.png" BORDER="0" ALT=""> is in
one-to-one correspondence with the key set <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> for which we wish to
generate a <A HREF="concepts.html">minimal perfect hash function</A>.
The two main differences between BMZ algorithm and CHM algorithm
are as follows: (<I>i</I>) BMZ algorithm generates random
graphs <IMG ALIGN="middle" SRC="figs/img27.png" BORDER="0" ALT=""> with <IMG ALIGN="middle" SRC="figs/img29.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img30.png" BORDER="0" ALT="">, where <IMG ALIGN="middle" SRC="figs/img31.png" BORDER="0" ALT="">,
and hence <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> necessarily contains cycles,
while CHM algorithm generates <I>acyclic</I> random
graphs <IMG ALIGN="middle" SRC="figs/img27.png" BORDER="0" ALT=""> with <IMG ALIGN="middle" SRC="figs/img29.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img30.png" BORDER="0" ALT="">,
with a greater number of vertices: <IMG ALIGN="middle" SRC="figs/img33.png" BORDER="0" ALT="">;
(<I>ii</I>) CHM algorithm generates <A HREF="concepts.html">order preserving minimal perfect hash functions</A>
while BMZ algorithm does not preserve order. Thus, BMZ algorithm improves
the space requirement at the expense of generating functions that are not
order preserving.
</P>
<P>
Suppose <IMG ALIGN="bottom" SRC="figs/img14.png" BORDER="0" ALT=""> is a universe of <I>keys</I>.
Let <IMG ALIGN="middle" SRC="figs/img17.png" BORDER="0" ALT=""> be a set of <IMG ALIGN="bottom" SRC="figs/img8.png" BORDER="0" ALT=""> keys from <IMG ALIGN="bottom" SRC="figs/img14.png" BORDER="0" ALT="">.
Let us show how the BMZ algorithm constructs a minimal perfect hash function <IMG ALIGN="bottom" SRC="figs/img7.png" BORDER="0" ALT="">.
We make use of two auxiliary random functions <IMG ALIGN="middle" SRC="figs/img41.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img55.png" BORDER="0" ALT="">,
where <IMG ALIGN="middle" SRC="figs/img56.png" BORDER="0" ALT=""> for some suitably chosen integer <IMG ALIGN="bottom" SRC="figs/img57.png" BORDER="0" ALT="">,
where <IMG ALIGN="middle" SRC="figs/img58.png" BORDER="0" ALT="">.We build a random graph <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT=""> on <IMG ALIGN="bottom" SRC="figs/img60.png" BORDER="0" ALT="">,
whose edge set is <IMG ALIGN="middle" SRC="figs/img61.png" BORDER="0" ALT="">. There is an edge in <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> for each
key in the set of keys <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
</P>
<P>
In what follows, we shall be interested in the <I>2-core</I> of
the random graph <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, that is, the maximal subgraph
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> with minimal degree at
least 2 (see <A HREF="#papers">[2</A>] for details).
Because of its importance in our context, we call the 2-core the
<I>critical</I> subgraph of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> and denote it by <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">.
The vertices and edges in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> are said to be <I>critical</I>.
We let <IMG ALIGN="middle" SRC="figs/img64.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img65.png" BORDER="0" ALT="">.
Moreover, we let <IMG ALIGN="middle" SRC="figs/img66.png" BORDER="0" ALT=""> be the set of <I>non-critical</I>
vertices in <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
We also let <IMG ALIGN="middle" SRC="figs/img67.png" BORDER="0" ALT=""> be the set of all critical
vertices that have at least one non-critical vertex as a neighbour.
Let <IMG ALIGN="middle" SRC="figs/img68.png" BORDER="0" ALT=""> be the set of <I>non-critical</I> edges in <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
Finally, we let <IMG ALIGN="middle" SRC="figs/img69.png" BORDER="0" ALT=""> be the <I>non-critical</I> subgraph
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
The non-critical subgraph <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> corresponds to the <I>acyclic part</I>
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
We have <IMG ALIGN="middle" SRC="figs/img71.png" BORDER="0" ALT="">.
</P>
<P>
We then construct a suitable labelling <IMG ALIGN="middle" SRC="figs/img72.png" BORDER="0" ALT=""> of the vertices
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">: we choose <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT=""> for each <IMG ALIGN="middle" SRC="figs/img74.png" BORDER="0" ALT=""> in such
a way that <IMG ALIGN="middle" SRC="figs/img75.png" BORDER="0" ALT=""> (<IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT="">) is a
minimal perfect hash function for <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
This labelling <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT=""> can be found in linear time
if the number of edges in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> is at most <IMG ALIGN="middle" SRC="figs/img76.png" BORDER="0" ALT=""> (see <A HREF="#papers">[2</A>]
for details).
</P>
<P>
Figure 1 presents a pseudo code for the BMZ algorithm.
The procedure BMZ (<IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">) receives as input the set of
keys <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> and produces the labelling <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">.
The method uses a mapping, ordering and searching approach.
We now describe each step.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD>procedure BMZ (<IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">)</TD>
</TR>
<TR>
<TD>&nbsp;&nbsp;&nbsp;&nbsp;Mapping (<IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">);</TD>
</TR>
<TR>
<TD>&nbsp;&nbsp;&nbsp;&nbsp;Ordering (<IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">);</TD>
</TR>
<TR>
<TD>&nbsp;&nbsp;&nbsp;&nbsp;Searching (<IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">);</TD>
</TR>
<TR>
<TD><B>Figure 1</B>: Main steps of BMZ algorithm for constructing a minimal perfect hash function</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H3>Mapping Step</H3>
<P>
The procedure Mapping (<IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">) receives as input the set
of keys <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> and generates the random graph <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT="">, by generating
two auxiliary functions <IMG ALIGN="middle" SRC="figs/img41.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img78.png" BORDER="0" ALT="">.
</P>
<P>
The functions <IMG ALIGN="middle" SRC="figs/img41.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img42.png" BORDER="0" ALT=""> are constructed as follows.
We impose some upper bound <IMG ALIGN="bottom" SRC="figs/img79.png" BORDER="0" ALT=""> on the lengths of the keys in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
To define <IMG ALIGN="middle" SRC="figs/img80.png" BORDER="0" ALT=""> (<IMG ALIGN="middle" SRC="figs/img81.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img62.png" BORDER="0" ALT="">), we generate
an <IMG ALIGN="middle" SRC="figs/img82.png" BORDER="0" ALT=""> table of random integers <IMG ALIGN="middle" SRC="figs/img83.png" BORDER="0" ALT="">.
For a key <IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT=""> of length <IMG ALIGN="middle" SRC="figs/img84.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img85.png" BORDER="0" ALT="">, we let
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/img86.png" BORDER="0" ALT=""></TD>
</TR>
</TABLE>
<P>
The random graph <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT=""> has vertex set <IMG ALIGN="middle" SRC="figs/img56.png" BORDER="0" ALT=""> and
edge set <IMG ALIGN="middle" SRC="figs/img61.png" BORDER="0" ALT="">. We need <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> to be
simple, i.e., <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> should have neither loops nor multiple edges.
A loop occurs when <IMG ALIGN="middle" SRC="figs/img87.png" BORDER="0" ALT=""> for some <IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT="">.
We solve this in an ad hoc manner: we simply let <IMG ALIGN="middle" SRC="figs/img88.png" BORDER="0" ALT=""> in this case.
If we still find a loop after this, we generate another pair <IMG ALIGN="middle" SRC="figs/img89.png" BORDER="0" ALT="">.
When a multiple edge occurs we abort and generate a new pair <IMG ALIGN="middle" SRC="figs/img89.png" BORDER="0" ALT="">.
Although the function above causes <A HREF="concepts.html">collisions</A> with probability <I>1/t</I>,
in <A HREF="index.html">cmph library</A> we use faster hash
functions (<A HREF="http://www.cs.yorku.ca/~oz/hash.html">DJB2 hash</A>, <A HREF="http://www.isthe.com/chongo/tech/comp/fnv/">FNV hash</A>,
<A HREF="http://burtleburtle.net/bob/hash/doobs.html">Jenkins hash</A> and <A HREF="http://www.cs.yorku.ca/~oz/hash.html">SDBM hash</A>)
in which we do not need to impose any upper bound <IMG ALIGN="bottom" SRC="figs/img79.png" BORDER="0" ALT=""> on the lengths of the keys in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
</P>
<P>
As mentioned before, for us to find the labelling <IMG ALIGN="middle" SRC="figs/img72.png" BORDER="0" ALT=""> of the
vertices of <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT=""> in linear time,
we require that <IMG ALIGN="middle" SRC="figs/img108.png" BORDER="0" ALT="">.
The crucial step now is to determine the value
of <IMG ALIGN="bottom" SRC="figs/img1.png" BORDER="0" ALT=""> (in <IMG ALIGN="bottom" SRC="figs/img57.png" BORDER="0" ALT="">) to obtain a random
graph <IMG ALIGN="middle" SRC="figs/img71.png" BORDER="0" ALT=""> with <IMG ALIGN="middle" SRC="figs/img109.png" BORDER="0" ALT="">.
Botelho, Menoti an Ziviani determinded emprically in <A HREF="#papers">[1</A>] that
the value of <IMG ALIGN="bottom" SRC="figs/img1.png" BORDER="0" ALT=""> is <I>1.15</I>. This value is remarkably
close to the theoretical value determined in <A HREF="#papers">[2</A>],
which is around <IMG ALIGN="bottom" SRC="figs/img112.png" BORDER="0" ALT="">.
</P>
<HR NOSHADE SIZE=1>
<H3>Ordering Step</H3>
<P>
The procedure Ordering (<IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">) receives
as input the graph <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> and partitions <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> into the two
subgraphs <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">, so that <IMG ALIGN="middle" SRC="figs/img71.png" BORDER="0" ALT="">.
</P>
<P>
Figure 2 presents a sample graph with 9 vertices
and 8 edges, where the degree of a vertex is shown besides each vertex.
Initially, all vertices with degree 1 are added to a queue <IMG ALIGN="middle" SRC="figs/img136.png" BORDER="0" ALT="">.
For the example shown in Figure 2(a), <IMG ALIGN="middle" SRC="figs/img137.png" BORDER="0" ALT=""> after the initialization step.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/img138.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 2:</B> Ordering step for a graph with 9 vertices and 8 edges.</TD>
</TR>
</TABLE>
<P>
Next, we remove one vertex <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> from the queue, decrement its degree and
the degree of the vertices with degree greater than 0 in the adjacent
list of <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">, as depicted in Figure 2(b) for <IMG ALIGN="bottom" SRC="figs/img140.png" BORDER="0" ALT="">.
At this point, the adjacencies of <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> with degree 1 are
inserted into the queue, such as vertex 1.
This process is repeated until the queue becomes empty.
All vertices with degree 0 are non-critical vertices and the others are
critical vertices, as depicted in Figure 2(c).
Finally, to determine the vertices in <IMG ALIGN="middle" SRC="figs/img141.png" BORDER="0" ALT=""> we collect all
vertices <IMG ALIGN="middle" SRC="figs/img142.png" BORDER="0" ALT=""> with at least one vertex <IMG ALIGN="bottom" SRC="figs/img143.png" BORDER="0" ALT=""> that
is in Adj<IMG ALIGN="middle" SRC="figs/img144.png" BORDER="0" ALT=""> and in <IMG ALIGN="middle" SRC="figs/img145.png" BORDER="0" ALT="">, as the vertex 8 in Figure 2(c).
</P>
<HR NOSHADE SIZE=1>
<H3>Searching Step</H3>
<P>
In the searching step, the key part is
the <I>perfect assignment problem</I>: find <IMG ALIGN="middle" SRC="figs/img153.png" BORDER="0" ALT=""> such that
the function <IMG ALIGN="middle" SRC="figs/img154.png" BORDER="0" ALT=""> defined by
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/img155.png" BORDER="0" ALT=""></TD>
</TR>
</TABLE>
<P>
is a bijection from <IMG ALIGN="middle" SRC="figs/img156.png" BORDER="0" ALT=""> to <IMG ALIGN="middle" SRC="figs/img157.png" BORDER="0" ALT=""> (recall <IMG ALIGN="middle" SRC="figs/img158.png" BORDER="0" ALT="">).
We are interested in a labelling <IMG ALIGN="middle" SRC="figs/img72.png" BORDER="0" ALT=""> of
the vertices of the graph <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT=""> with
the property that if <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img22.png" BORDER="0" ALT=""> are keys
in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, then <IMG ALIGN="middle" SRC="figs/img159.png" BORDER="0" ALT="">; that is, if we associate
to each edge the sum of the labels on its endpoints, then these values
should be all distinct.
Moreover, we require that all the sums <IMG ALIGN="middle" SRC="figs/img160.png" BORDER="0" ALT=""> (<IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT="">)
fall between <IMG ALIGN="bottom" SRC="figs/img115.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img161.png" BORDER="0" ALT="">, and thus we have a bijection
between <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img157.png" BORDER="0" ALT="">.
</P>
<P>
The procedure Searching (<IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">)
receives as input <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> and finds a
suitable <IMG ALIGN="middle" SRC="figs/img162.png" BORDER="0" ALT=""> bit value for each vertex <IMG ALIGN="middle" SRC="figs/img74.png" BORDER="0" ALT="">, stored in the
array <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">.
This step is first performed for the vertices in the
critical subgraph <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> (the 2-core of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">)
and then it is performed for the vertices in <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> (the non-critical subgraph
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> that contains the "acyclic part" of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">).
The reason the assignment of the <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT=""> values is first
performed on the vertices in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> is to resolve reassignments
as early as possible (such reassignments are consequences of the cycles
in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> and are depicted hereinafter).
</P>
<HR NOSHADE SIZE=1>
<H4>Assignment of Values to Critical Vertices</H4>
<P>
The labels <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT=""> (<IMG ALIGN="middle" SRC="figs/img142.png" BORDER="0" ALT="">)
are assigned in increasing order following a greedy
strategy where the critical vertices <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> are considered one at a time,
according to a breadth-first search on <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">.
If a candidate value <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> for <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT=""> is forbidden
because setting <IMG ALIGN="middle" SRC="figs/img163.png" BORDER="0" ALT=""> would create two edges with the same sum,
we try <IMG ALIGN="middle" SRC="figs/img164.png" BORDER="0" ALT=""> for <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT="">. This fact is referred to
as a <I>reassignment</I>.
</P>
<P>
Let <IMG ALIGN="middle" SRC="figs/img165.png" BORDER="0" ALT=""> be the set of addresses assigned to edges in <IMG ALIGN="middle" SRC="figs/img166.png" BORDER="0" ALT="">.
Initially <IMG ALIGN="middle" SRC="figs/img167.png" BORDER="0" ALT="">.
Let <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> be a candidate value for <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT="">.
Initially <IMG ALIGN="bottom" SRC="figs/img168.png" BORDER="0" ALT="">.
Considering the subgraph <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> in Figure 2(c),
a step by step example of the assignment of values to vertices in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> is
presented in Figure 3.
Initially, a vertex <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> is chosen, the assignment <IMG ALIGN="middle" SRC="figs/img163.png" BORDER="0" ALT=""> is made
and <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is set to <IMG ALIGN="middle" SRC="figs/img164.png" BORDER="0" ALT="">.
For example, suppose that vertex <IMG ALIGN="bottom" SRC="figs/img169.png" BORDER="0" ALT=""> in Figure 3(a) is
chosen, the assignment <IMG ALIGN="middle" SRC="figs/img170.png" BORDER="0" ALT=""> is made and <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is set to <IMG ALIGN="bottom" SRC="figs/img96.png" BORDER="0" ALT="">.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/img171.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 3:</B> Example of the assignment of values to critical vertices.</TD>
</TR>
</TABLE>
<P>
In Figure 3(b), following the adjacent list of vertex <IMG ALIGN="bottom" SRC="figs/img169.png" BORDER="0" ALT="">,
the unassigned vertex <IMG ALIGN="bottom" SRC="figs/img115.png" BORDER="0" ALT=""> is reached.
At this point, we collect in the temporary variable <IMG ALIGN="bottom" SRC="figs/img172.png" BORDER="0" ALT=""> all adjacencies
of vertex <IMG ALIGN="bottom" SRC="figs/img115.png" BORDER="0" ALT=""> that have been assigned an <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> value,
and <IMG ALIGN="middle" SRC="figs/img173.png" BORDER="0" ALT="">.
Next, for all <IMG ALIGN="middle" SRC="figs/img174.png" BORDER="0" ALT="">, we check if <IMG ALIGN="middle" SRC="figs/img175.png" BORDER="0" ALT="">.
Since <IMG ALIGN="middle" SRC="figs/img176.png" BORDER="0" ALT="">, then <IMG ALIGN="middle" SRC="figs/img177.png" BORDER="0" ALT=""> is set
to <IMG ALIGN="bottom" SRC="figs/img96.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is incremented
by 1 (now <IMG ALIGN="bottom" SRC="figs/img178.png" BORDER="0" ALT="">) and <IMG ALIGN="middle" SRC="figs/img179.png" BORDER="0" ALT="">.
Next, vertex <IMG ALIGN="bottom" SRC="figs/img180.png" BORDER="0" ALT=""> is reached, <IMG ALIGN="middle" SRC="figs/img181.png" BORDER="0" ALT=""> is set
to <IMG ALIGN="bottom" SRC="figs/img62.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is set to <IMG ALIGN="bottom" SRC="figs/img180.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img182.png" BORDER="0" ALT="">.
Next, vertex <IMG ALIGN="bottom" SRC="figs/img183.png" BORDER="0" ALT=""> is reached and <IMG ALIGN="middle" SRC="figs/img184.png" BORDER="0" ALT="">.
Since <IMG ALIGN="middle" SRC="figs/img185.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img186.png" BORDER="0" ALT="">, then <IMG ALIGN="middle" SRC="figs/img187.png" BORDER="0" ALT=""> is
set to <IMG ALIGN="bottom" SRC="figs/img180.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is set to <IMG ALIGN="bottom" SRC="figs/img183.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img188.png" BORDER="0" ALT="">.
Finally, vertex <IMG ALIGN="bottom" SRC="figs/img189.png" BORDER="0" ALT=""> is reached and <IMG ALIGN="middle" SRC="figs/img190.png" BORDER="0" ALT="">.
Since <IMG ALIGN="middle" SRC="figs/img191.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is incremented by 1 and set to 5, as depicted in
Figure 3(c).
Since <IMG ALIGN="middle" SRC="figs/img192.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is again incremented by 1 and set to 6,
as depicted in Figure 3(d).
These two reassignments are indicated by the arrows in Figure 3.
Since <IMG ALIGN="middle" SRC="figs/img193.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img194.png" BORDER="0" ALT="">, then <IMG ALIGN="middle" SRC="figs/img195.png" BORDER="0" ALT=""> is set
to <IMG ALIGN="bottom" SRC="figs/img196.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img197.png" BORDER="0" ALT="">. This finishes the algorithm.
</P>
<HR NOSHADE SIZE=1>
<H4>Assignment of Values to Non-Critical Vertices</H4>
<P>
As <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> is acyclic, we can impose the order in which addresses are
associated with edges in <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">, making this step simple to solve
by a standard depth first search algorithm.
Therefore, in the assignment of values to vertices in <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> we
benefit from the unused addresses in the gaps left by the assignment of values
to vertices in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">.
For that, we start the depth-first search from the vertices in <IMG ALIGN="middle" SRC="figs/img141.png" BORDER="0" ALT=""> because
the <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT=""> values for these critical vertices were already assigned
and cannot be changed.
</P>
<P>
Considering the subgraph <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> in Figure 2(c),
a step by step example of the assignment of values to vertices in <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> is
presented in Figure 4.
Figure 4(a) presents the initial state of the algorithm.
The critical vertex 8 is the only one that has non-critical vertices as
adjacent.
In the example presented in Figure 3, the addresses <IMG ALIGN="middle" SRC="figs/img198.png" BORDER="0" ALT=""> were not used.
So, taking the first unused address <IMG ALIGN="bottom" SRC="figs/img115.png" BORDER="0" ALT=""> and the vertex <IMG ALIGN="bottom" SRC="figs/img96.png" BORDER="0" ALT="">,
which is reached from the vertex <IMG ALIGN="bottom" SRC="figs/img169.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img199.png" BORDER="0" ALT=""> is set
to <IMG ALIGN="middle" SRC="figs/img200.png" BORDER="0" ALT="">, as shown in Figure 4(b).
The only vertex that is reached from vertex <IMG ALIGN="bottom" SRC="figs/img96.png" BORDER="0" ALT=""> is vertex <IMG ALIGN="bottom" SRC="figs/img62.png" BORDER="0" ALT="">, so
taking the unused address <IMG ALIGN="bottom" SRC="figs/img183.png" BORDER="0" ALT=""> we set <IMG ALIGN="middle" SRC="figs/img201.png" BORDER="0" ALT=""> to <IMG ALIGN="middle" SRC="figs/img202.png" BORDER="0" ALT="">,
as shown in Figure 4(c).
This process is repeated until the UnAssignedAddresses list becomes empty.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/img203.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 4:</B> Example of the assignment of values to non-critical vertices.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<A NAME="heuristic"></A>
<H2>The Heuristic</H2>
<P>
We now present an heuristic for BMZ algorithm that
reduces the value of <IMG ALIGN="bottom" SRC="figs/img1.png" BORDER="0" ALT=""> to any given value between <I>1.15</I> and <I>0.93</I>.
This reduces the space requirement to store the resulting function
to any given value between <IMG ALIGN="bottom" SRC="figs/img12.png" BORDER="0" ALT=""> words and <IMG ALIGN="bottom" SRC="figs/img13.png" BORDER="0" ALT=""> words.
The heuristic reuses, when possible, the set
of <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> values that caused reassignments, just before
trying <IMG ALIGN="middle" SRC="figs/img164.png" BORDER="0" ALT="">.
Decreasing the value of <IMG ALIGN="bottom" SRC="figs/img1.png" BORDER="0" ALT=""> leads to an increase in the number of
iterations to generate <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
For example, for <IMG ALIGN="bottom" SRC="figs/img244.png" BORDER="0" ALT=""> and <IMG ALIGN="bottom" SRC="figs/img6.png" BORDER="0" ALT="">, the analytical expected number
of iterations are <IMG ALIGN="bottom" SRC="figs/img245.png" BORDER="0" ALT=""> and <IMG ALIGN="bottom" SRC="figs/img246.png" BORDER="0" ALT="">, respectively (see <A HREF="#papers">[2</A>]
for details),
while for <IMG ALIGN="bottom" SRC="figs/img128.png" BORDER="0" ALT=""> the same value is around <I>2.13</I>.
</P>
<HR NOSHADE SIZE=1>
<H2>Memory Consumption</H2>
<P>
Now we detail the memory consumption to generate and to store minimal perfect hash functions
using the BMZ algorithm. The structures responsible for memory consumption are in the
following:
</P>
<UL>
<LI>Graph:
<OL>
<LI><B>first</B>: is a vector that stores <I>cn</I> integer numbers, each one representing
the first edge (index in the vector edges) in the list of
edges of each vertex.
The integer numbers are 4 bytes long. Therefore,
the vector first is stored in <I>4cn</I> bytes.
<P></P>
<LI><B>edges</B>: is a vector to represent the edges of the graph. As each edge
is compounded by a pair of vertices, each entry stores two integer numbers
of 4 bytes that represent the vertices. As there are <I>n</I> edges, the
vector edges is stored in <I>8n</I> bytes.
<P></P>
<LI><B>next</B>: given a vertex <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">, we can discover the edges that
contain <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> following its list of edges,
which starts on first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">] and the next
edges are given by next[...first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">]...]. Therefore, the vectors first and next represent
the linked lists of edges of each vertex. As there are two vertices for each edge,
when an edge is iserted in the graph, it must be inserted in the two linked lists
of the vertices in its composition. Therefore, there are <I>2n</I> entries of integer
numbers in the vector next, so it is stored in <I>4*2n = 8n</I> bytes.
<P></P>
<LI><B>critical vertices(critical_nodes vector)</B>: is a vector of <I>cn</I> bits,
where each bit indicates if a vertex is critical (1) or non-critical (0).
Therefore, the critical and non-critical vertices are represented in <I>cn/8</I> bytes.
<P></P>
<LI><B>critical edges (used_edges vector)</B>: is a vector of <I>n</I> bits, where each
bit indicates if an edge is critical (1) or non-critical (0). Therefore, the
critical and non-critical edges are represented in <I>n/8</I> bytes.
<P></P>
</OL>
<LI>Other auxiliary structures
<OL>
<LI><B>queue</B>: is a queue of integer numbers used in the breadth-first search of the
assignment of values to critical vertices. There is an entry in the queue for
each two critical vertices. Let <IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""> be the expected number of critical
vertices. Therefore, the queue is stored in <I>4*0.5*<IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT="">=2<IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""></I>.
<P></P>
<LI><B>visited</B>: is a vector of <I>cn</I> bits, where each bit indicates if the g value of
a given vertex was already defined. Therefore, the vector visited is stored
in <I>cn/8</I> bytes.
<P></P>
<LI><B>function <I>g</I></B>: is represented by a vector of <I>cn</I> integer numbers.
As each integer number is 4 bytes long, the function <I>g</I> is stored in
<I>4cn</I> bytes.
</OL>
</UL>
<P>
Thus, the total memory consumption of BMZ algorithm for generating a minimal
perfect hash function (MPHF) is: <I>(8.25c + 16.125)n +2<IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""> + O(1)</I> bytes.
As the value of constant <I>c</I> may be 1.15 and 0.93 we have:
</P>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH><I>c</I></TH>
<TH><IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""></TH>
<TH>Memory consumption to generate a MPHF</TH>
</TR>
<TR>
<TD>0.93</TD>
<TD ALIGN="center"><I>0.497n</I></TD>
<TD ALIGN="center"><I>24.80n + O(1)</I></TD>
</TR>
<TR>
<TD>1.15</TD>
<TD ALIGN="center"><I>0.401n</I></TD>
<TD ALIGN="center"><I>26.42n + O(1)</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 1:</B> Memory consumption to generate a MPHF using the BMZ algorithm.</TD>
</TR>
</TABLE>
<P>
The values of <IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""> were calculated using Eq.(1) presented in <A HREF="#papers">[2</A>].
</P>
<P>
Now we present the memory consumption to store the resulting function.
We only need to store the <I>g</I> function. Thus, we need <I>4cn</I> bytes.
Again we have:
</P>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH><I>c</I></TH>
<TH>Memory consumption to store a MPHF</TH>
</TR>
<TR>
<TD>0.93</TD>
<TD ALIGN="center"><I>3.72n</I></TD>
</TR>
<TR>
<TD>1.15</TD>
<TD ALIGN="center"><I>4.60n</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 2:</B> Memory consumption to store a MPHF generated by the BMZ algorithm.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H2>Experimental Results</H2>
<P>
<A HREF="comparison.html">CHM x BMZ</A>
</P>
<HR NOSHADE SIZE=1>
<A NAME="papers"></A>
<H2>Papers</H2>
<OL>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, D. Menoti, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/bmz_tr004_04.ps">A New algorithm for constructing minimal perfect hash functions</A>, Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
<P></P>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, Y. Kohayakawa, and <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/wea05.pdf">A Practical Minimal Perfect Hashing Method</A>. <I>4th International Workshop on efficient and Experimental Algorithms (WEA05),</I> Springer-Verlag Lecture Notes in Computer Science, vol. 3505, Santorini Island, Greece, May 2005, 488-500.
</OL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i BMZ.t2t -o docs/bmz.html -->
</BODY></HTML>

966
deps/cmph/docs/brz.html vendored Normal file
View File

@@ -0,0 +1,966 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>External Memory Based Algorithm</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>External Memory Based Algorithm</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>Introduction</H2>
<P>
Until now, because of the limitations of current algorithms,
the use of MPHFs is restricted to scenarios where the set of keys being hashed is
relatively small.
However, in many cases it is crucial to deal in an efficient way with very large
sets of keys.
Due to the exponential growth of the Web, the work with huge collections is becoming
a daily task.
For instance, the simple assignment of number identifiers to web pages of a collection
can be a challenging task.
While traditional databases simply cannot handle more traffic once the working
set of URLs does not fit in main memory anymore<A HREF="#papers">[4</A>], the algorithm we propose here to
construct MPHFs can easily scale to billions of entries.
</P>
<P>
As there are many applications for MPHFs, it is
important to design and implement space and time efficient algorithms for
constructing such functions.
The attractiveness of using MPHFs depends on the following issues:
</P>
<OL>
<LI>The amount of CPU time required by the algorithms for constructing MPHFs.
<P></P>
<LI>The space requirements of the algorithms for constructing MPHFs.
<P></P>
<LI>The amount of CPU time required by a MPHF for each retrieval.
<P></P>
<LI>The space requirements of the description of the resulting MPHFs to be used at retrieval time.
</OL>
<P>
We present here a novel external memory based algorithm for constructing MPHFs that
are very efficient in the four requirements mentioned previously.
First, the algorithm is linear on the size of keys to construct a MPHF,
which is optimal.
For instance, for a collection of 1 billion URLs
collected from the web, each one 64 characters long on average, the time to construct a
MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory
is approximately 3 hours.
Second, the algorithm needs a small a priori defined vector of <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> one
byte entries in main memory to construct a MPHF.
For the collection of 1 billion URLs and using <IMG ALIGN="middle" SRC="figs/brz/img4.png" BORDER="0" ALT="">, the algorithm needs only
5.45 megabytes of internal memory.
Third, the evaluation of the MPHF for each retrieval requires three memory accesses and
the computation of three universal hash functions.
This is not optimal as any MPHF requires at least one memory access and the computation
of two universal hash functions.
Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal.
For the collection of 1 billion URLs, it needs 8.1 bits for each key,
while the theoretical lower bound is <IMG ALIGN="middle" SRC="figs/brz/img24.png" BORDER="0" ALT=""> bits per key.
</P>
<HR NOSHADE SIZE=1>
<H2>The Algorithm</H2>
<P>
The main idea supporting our algorithm is the classical divide and conquer technique.
The algorithm is a two-step external memory based algorithm
that generates a MPHF <I>h</I> for a set <I>S</I> of <I>n</I> keys.
Figure 1 illustrates the two steps of the
algorithm: the partitioning step and the searching step.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/brz.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 1:</B> Main steps of our algorithm.</TD>
</TR>
</TABLE>
<P>
The partitioning step takes a key set <I>S</I> and uses a universal hash
function <IMG ALIGN="middle" SRC="figs/brz/img42.png" BORDER="0" ALT=""> proposed by Jenkins<A HREF="#papers">[5</A>]
to transform each key <IMG ALIGN="middle" SRC="figs/brz/img43.png" BORDER="0" ALT=""> into an integer <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT="">.
Reducing <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT=""> modulo <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT="">, we partition <I>S</I>
into <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets containing at most 256 keys in each bucket (with high
probability).
</P>
<P>
The searching step generates a MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> for each bucket <I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">.
The resulting MPHF <I>h(k)</I>, <IMG ALIGN="middle" SRC="figs/brz/img43.png" BORDER="0" ALT="">, is given by
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img49.png" BORDER="0" ALT=""></TD>
</TR>
</TABLE>
<P>
where <IMG ALIGN="middle" SRC="figs/brz/img50.png" BORDER="0" ALT="">.
The <I>i</I>th entry <I>offset[i]</I> of the displacement vector
<I>offset</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, contains the total number
of keys in the buckets from 0 to <I>i-1</I>, that is, it gives the interval of the
keys in the hash table addressed by the MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT="">. In the following we explain
each step in detail.
</P>
<HR NOSHADE SIZE=1>
<H3>Partitioning step</H3>
<P>
The set <I>S</I> of <I>n</I> keys is partitioned into <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT="">,
where <I>b</I> is a suitable parameter chosen to guarantee
that each bucket has at most 256 keys with high probability
(see <A HREF="#papers">[2</A>] for details).
The partitioning step works as follows:
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img54.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 2:</B> Partitioning step.</TD>
</TR>
</TABLE>
<P>
Statement 1.1 of the <B>for</B> loop presented in Figure 2
reads sequentially all the keys of block <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> from disk into an internal area
of size <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT="">.
</P>
<P>
Statement 1.2 performs an indirect bucket sort of the keys in block <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> and
at the same time updates the entries in the vector <I>size</I>.
Let us briefly describe how <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> is partitioned among
the <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets.
We use a local array of <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> counters to store a
count of how many keys from <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> belong to each bucket.
The pointers to the keys in each bucket <I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">,
are stored in contiguous positions in an array.
For this we first reserve the required number of entries
in this array of pointers using the information from the array of counters.
Next, we place the pointers to the keys in each bucket into the respective
reserved areas in the array (i.e., we place the pointers to the keys in bucket 0,
followed by the pointers to the keys in bucket 1, and so on).
</P>
<P>
To find the bucket address of a given key
we use the universal hash function <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT=""><A HREF="#papers">[5</A>].
Key <I>k</I> goes into bucket <I>i</I>, where
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img57.png" BORDER="0" ALT=""> (1)</TD>
</TR>
</TABLE>
<P>
Figure 3(a) shows a <I>logical</I> view of the <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets
generated in the partitioning step.
In reality, the keys belonging to each bucket are distributed among many files,
as depicted in Figure 3(b).
In the example of Figure 3(b), the keys in bucket 0
appear in files 1 and <I>N</I>, the keys in bucket 1 appear in files 1, 2
and <I>N</I>, and so on.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/brz-partitioning.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 3:</B> Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view.</TD>
</TR>
</TABLE>
<P>
This scattering of the keys in the buckets could generate a performance
problem because of the potential number of seeks
needed to read the keys in each bucket from the <I>N</I> files in disk
during the searching step.
But, as we show in <A HREF="#papers">[2</A>], the number of seeks
can be kept small using buffering techniques.
Considering that only the vector <I>size</I>, which has <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> one-byte
entries (remember that each bucket has at most 256 keys),
must be maintained in main memory during the searching step,
almost all main memory is available to be used as disk I/O buffer.
</P>
<P>
The last step is to compute the <I>offset</I> vector and dump it to the disk.
We use the vector <I>size</I> to compute the
<I>offset</I> displacement vector.
The <I>offset[i]</I> entry contains the number of keys
in the buckets <I>0, 1, ..., i-1</I>.
As <I>size[i]</I> stores the number of keys
in bucket <I>i</I>, where <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, we have
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img63.png" BORDER="0" ALT=""></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H3>Searching step</H3>
<P>
The searching step is responsible for generating a MPHF for each
bucket. Figure 4 presents the searching step algorithm.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img64.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 4:</B> Searching step.</TD>
</TR>
</TABLE>
<P>
Statement 1 of Figure 4 inserts one key from each file
in a minimum heap <I>H</I> of size <I>N</I>.
The order relation in <I>H</I> is given by the bucket address <I>i</I> given by
Eq. (1).
</P>
<P>
Statement 2 has two important steps.
In statement 2.1, a bucket is read from disk,
as described below.
In statement 2.2, a MPHF is generated for each bucket <I>i</I>, as described
in the following.
The description of MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> is a vector <IMG ALIGN="middle" SRC="figs/brz/img66.png" BORDER="0" ALT=""> of 8-bit integers.
Finally, statement 2.3 writes the description <IMG ALIGN="middle" SRC="figs/brz/img66.png" BORDER="0" ALT=""> of MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> to disk.
</P>
<HR NOSHADE SIZE=1>
<H4>Reading a bucket from disk</H4>
<P>
In this section we present the refinement of statement 2.1 of
Figure 4.
The algorithm to read bucket <I>i</I> from disk is presented
in Figure 5.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img67.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 5:</B> Reading a bucket.</TD>
</TR>
</TABLE>
<P>
Bucket <I>i</I> is distributed among many files and the heap <I>H</I> is used to drive a
multiway merge operation.
In Figure 5, statement 1.1 extracts and removes triple
<I>(i, j, k)</I> from <I>H</I>, where <I>i</I> is a minimum value in <I>H</I>.
Statement 1.2 inserts key <I>k</I> in bucket <I>i</I>.
Notice that the <I>k</I> in the triple <I>(i, j, k)</I> is in fact a pointer to
the first byte of the key that is kept in contiguous positions of an array of characters
(this array containing the keys is initialized during the heap construction
in statement 1 of Figure 4).
Statement 1.3 performs a seek operation in File <I>j</I> on disk for the first
read operation and reads sequentially all keys <I>k</I> that have the same <I>i</I>
and inserts them all in bucket <I>i</I>.
Finally, statement 1.4 inserts in <I>H</I> the triple <I>(i, j, x)</I>,
where <I>x</I> is the first key read from File <I>j</I> (in statement 1.3)
that does not have the same bucket address as the previous keys.
</P>
<P>
The number of seek operations on disk performed in statement 1.3 is discussed
in <A HREF="#papers">[2, Section 5.1</A>],
where we present a buffering technique that brings down
the time spent with seeks.
</P>
<HR NOSHADE SIZE=1>
<H4>Generating a MPHF for each bucket</H4>
<P>
To the best of our knowledge the <A HREF="bmz.html">BMZ algorithm</A> we have designed in
our previous works <A HREF="#papers">[1,3</A>] is the fastest published algorithm for
constructing MPHFs.
That is why we are using that algorithm as a building block for the
algorithm presented here. In reality, we are using
an optimized version of BMZ (BMZ8) for small set of keys (at most 256 keys).
<A HREF="bmz.html">Click here to see details about BMZ algorithm</A>.
</P>
<HR NOSHADE SIZE=1>
<H2>Analysis of the Algorithm</H2>
<P>
Analytical results and the complete analysis of the external memory based algorithm
can be found in <A HREF="#papers">[2</A>].
</P>
<HR NOSHADE SIZE=1>
<H2>Experimental Results</H2>
<P>
In this section we present the experimental results.
We start presenting the experimental setup.
We then present experimental results for
the internal memory based algorithm (<A HREF="bmz.html">the BMZ algorithm</A>)
and for our external memory based algorithm.
Finally, we discuss how the amount of internal memory available
affects the runtime of the external memory based algorithm.
</P>
<HR NOSHADE SIZE=1>
<H3>The data and the experimental setup</H3>
<P>
All experiments were carried out on
a computer running the Linux operating system, version 2.6,
with a 2.4 gigahertz processor and
1 gigabyte of main memory.
In the experiments related to the new
algorithm we limited the main memory in 500 megabytes.
</P>
<P>
Our data consists of a collection of 1 billion
URLs collected from the Web, each URL 64 characters long on average.
The collection is stored on disk in 60.5 gigabytes.
</P>
<HR NOSHADE SIZE=1>
<H3>Performance of the BMZ Algorithm</H3>
<P>
<A HREF="bmz.html">The BMZ algorithm</A> is used for constructing a MPHF for each bucket.
It is a randomized algorithm because it needs to generate a simple random graph
in its first step.
Once the graph is obtained the other two steps are deterministic.
</P>
<P>
Thus, we can consider the runtime of the algorithm to have
the form <IMG ALIGN="middle" SRC="figs/brz/img159.png" BORDER="0" ALT=""> for an input of <I>n</I> keys,
where <IMG ALIGN="middle" SRC="figs/brz/img160.png" BORDER="0" ALT=""> is some machine dependent
constant that further depends on the length of the keys and <I>Z</I> is a random
variable with geometric distribution with mean <IMG ALIGN="middle" SRC="figs/brz/img162.png" BORDER="0" ALT="">. All results
in our experiments were obtained taking <I>c=1</I>; the value of <I>c</I>, with <I>c</I> in <I>[0.93,1.15]</I>,
in fact has little influence in the runtime, as shown in <A HREF="#papers">[3</A>].
</P>
<P>
The values chosen for <I>n</I> were 1, 2, 4, 8, 16 and 32 million.
Although we have a dataset with 1 billion URLs, on a PC with
1 gigabyte of main memory, the algorithm is able
to handle an input with at most 32 million keys.
This is mainly because of the graph we need to keep in main memory.
The algorithm requires <I>25n + O(1)</I> bytes for constructing
a MPHF (<A HREF="bmz.html">click here to get details about the data structures used by the BMZ algorithm</A>).
</P>
<P>
In order to estimate the number of trials for each value of <I>n</I> we use
a statistical method for determining a suitable sample size (see, e.g., <A HREF="#papers">[6, Chapter 13</A>]).
As we obtained different values for each <I>n</I>,
we used the maximal value obtained, namely, 300 trials in order to have
a confidence level of 95 %.
</P>
<P>
Table 1 presents the runtime average for each <I>n</I>,
the respective standard deviations, and
the respective confidence intervals given by
the average time <IMG ALIGN="middle" SRC="figs/brz/img167.png" BORDER="0" ALT=""> the distance from average time
considering a confidence level of 95 %.
Observing the runtime averages one sees that
the algorithm runs in expected linear time,
as shown in <A HREF="#papers">[3</A>].
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="CENTER">
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img5.png"
ALT="$n$"></SPAN> (millions) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 2 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 4 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 8 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 16 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 32 </SMALL></TD>
<TD></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
Average time (s)</SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="64" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img168.png"
ALT="$6.1 \pm 0.3$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img169.png"
ALT="$12.2 \pm 0.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img170.png"
ALT="$25.4 \pm 1.1$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img171.png"
ALT="$51.4 \pm 2.0$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img172.png"
ALT="$117.3 \pm 4.4$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img173.png"
ALT="$262.2 \pm 8.7$"></SPAN></SMALL></TD>
<TD></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
SD (s) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img174.png"
ALT="$2.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img175.png"
ALT="$5.4$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img176.png"
ALT="$9.8$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img177.png"
ALT="$17.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img178.png"
ALT="$37.3$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img179.png"
ALT="$76.3$"></SPAN> </SMALL></TD>
<TD></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 1:</B> Internal memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.</TD>
</TR>
</TABLE>
<P>
Figure 6 presents the runtime for each trial. In addition,
the solid line corresponds to a linear regression model
obtained from the experimental measurements.
As we can see, the runtime for a given <I>n</I> has a considerable
fluctuation. However, the fluctuation also grows linearly with <I>n</I>.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/bmz_temporegressao.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 6:</B> Time versus number of keys in <I>S</I> for the internal memory based algorithm. The solid line corresponds to a linear regression model.</TD>
</TR>
</TABLE>
<P>
The observed fluctuation in the runtimes is as expected; recall that this
runtime has the form <IMG ALIGN="middle" SRC="figs/brz/img159.png" BORDER="0" ALT=""> with <I>Z</I> a geometric random variable with
mean <I>1/p=e</I>. Thus, the runtime has mean <IMG ALIGN="middle" SRC="figs/brz/img181.png" BORDER="0" ALT=""> and standard
deviation <IMG ALIGN="middle" SRC="figs/brz/img182.png" BORDER="0" ALT="">.
Therefore, the standard deviation also grows
linearly with <I>n</I>, as experimentally verified
in Table 1 and in Figure 6.
</P>
<HR NOSHADE SIZE=1>
<H3>Performance of the External Memory Based Algorithm</H3>
<P>
The runtime of the external memory based algorithm is also a random variable,
but now it follows a (highly concentrated) normal distribution, as we discuss at the end of this
section. Again, we are interested in verifying the linearity claim made in
<A HREF="#papers">[2, Section 5.1</A>]. Therefore, we ran the algorithm for
several numbers <I>n</I> of keys in <I>S</I>.
</P>
<P>
The values chosen for <I>n</I> were 1, 2, 4, 8, 16, 32, 64, 128, 512 and 1000
million.
We limited the main memory in 500 megabytes for the experiments.
The size <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> of the a priori reserved internal memory area
was set to 250 megabytes, the parameter <I>b</I> was set to <I>175</I> and
the building block algorithm parameter <I>c</I> was again set to <I>1</I>.
We show later on how <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> affects the runtime of the algorithm. The other two parameters
have insignificant influence on the runtime.
</P>
<P>
We again use a statistical method for determining a suitable sample size
to estimate the number of trials to be run for each value of <I>n</I>. We got that
just one trial for each <I>n</I> would be enough with a confidence level of 95 %.
However, we made 10 trials. This number of trials seems rather small, but, as
shown below, the behavior of our algorithm is very stable and its runtime is
almost deterministic (i.e., the standard deviation is very small).
</P>
<P>
Table 2 presents the runtime average for each <I>n</I>,
the respective standard deviations, and
the respective confidence intervals given by
the average time <IMG ALIGN="middle" SRC="figs/brz/img167.png" BORDER="0" ALT=""> the distance from average time
considering a confidence level of 95 %.
Observing the runtime averages we noticed that
the algorithm runs in expected linear time,
as shown in <A HREF="#papers">[2, Section 5.1</A>]. Better still,
it is only approximately 60 % slower than the BMZ algorithm.
To get that value we used the linear regression model obtained for the runtime of
the internal memory based algorithm to estimate how much time it would require
for constructing a MPHF for a set of 1 billion keys.
We got 2.3 hours for the internal memory based algorithm and we measured
3.67 hours on average for the external memory based algorithm.
Increasing the size of the internal memory area
from 250 to 600 megabytes,
we have brought the time to 3.09 hours. In this case, the external memory based algorithm is
just 34 % slower in this setup.
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="CENTER">
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img5.png"
ALT="$n$"></SPAN> (millions) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 2 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 4 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 8 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 16 </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
Average time (s) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="64" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img187.png"
ALT="$6.9 \pm 0.3$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img188.png"
ALT="$13.8 \pm 0.2$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img189.png"
ALT="$31.9 \pm 0.7$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img190.png"
ALT="$69.9 \pm 1.1$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img191.png"
ALT="$140.6 \pm 2.5$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
SD </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img192.png"
ALT="$0.4$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img193.png"
ALT="$0.2$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img194.png"
ALT="$0.9$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img195.png"
ALT="$1.5$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img196.png"
ALT="$3.5$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img5.png"
ALT="$n$"></SPAN> (millions) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 32 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 64 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 128 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 512 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1000 </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
Average time (s) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img197.png"
ALT="$284.3 \pm 1.1$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img198.png"
ALT="$587.9 \pm 3.9$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
$1223.6 \pm 4.9$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="88" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img199.png"
ALT="$1223.6 \pm 4.9$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
$5966.4 \pm 9.5$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="88" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img200.png"
ALT="$5966.4 \pm 9.5$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
$13229.5 \pm 12.7$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="104" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img201.png"
ALT="$13229.5 \pm 12.7$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
SD </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img202.png"
ALT="$1.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img203.png"
ALT="$5.5$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img204.png"
ALT="$6.8$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img205.png"
ALT="$13.2$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img206.png"
ALT="$18.6$"></SPAN> </SMALL></TD>
</TR>
<TR><TD></TD>
<TD></TD>
<TD></TD>
<TD></TD>
<TD></TD>
<TD></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 2:</B>The external memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.</TD>
</TR>
</TABLE>
<P>
Figure 7 presents the runtime for each trial. In addition,
the solid line corresponds to a linear regression model
obtained from the experimental measurements.
As we were expecting the runtime for a given <I>n</I> has almost no
variation.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/brz_temporegressao.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 7:</B> Time versus number of keys in <I>S</I> for our algorithm. The solid line corresponds to a linear regression model.</TD>
</TR>
</TABLE>
<P>
An intriguing observation is that the runtime of the algorithm is almost
deterministic, in spite of the fact that it uses as building block an
algorithm with a considerable fluctuation in its runtime. A given bucket
<I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, is a small set of keys (at most 256 keys) and,
as argued in last Section, the runtime of the
building block algorithm is a random variable <IMG ALIGN="middle" SRC="figs/brz/img207.png" BORDER="0" ALT=""> with high fluctuation.
However, the runtime <I>Y</I> of the searching step of the external memory based algorithm is given
by <IMG ALIGN="middle" SRC="figs/brz/img209.png" BORDER="0" ALT="">. Under the hypothesis that
the <IMG ALIGN="middle" SRC="figs/brz/img207.png" BORDER="0" ALT=""> are independent and bounded, the {\it law of large numbers} (see,
e.g., <A HREF="#papers">[6</A>]) implies that the random variable <IMG ALIGN="middle" SRC="figs/brz/img210.png" BORDER="0" ALT=""> converges
to a constant as <IMG ALIGN="middle" SRC="figs/brz/img83.png" BORDER="0" ALT="">. This explains why the runtime of our
algorithm is almost deterministic.
</P>
<HR NOSHADE SIZE=1>
<H3>Controlling disk accesses</H3>
<P>
In order to bring down the number of seek operations on disk
we benefit from the fact that our algorithm leaves almost all main
memory available to be used as disk I/O buffer.
In this section we evaluate how much the parameter <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> affects the runtime of our algorithm.
For that we fixed <I>n</I> in 1 billion of URLs,
set the main memory of the machine used for the experiments
to 1 gigabyte and used <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> equal to 100, 200, 300, 400, 500 and 600
megabytes.
</P>
<P>
Table 3 presents the number of files <I>N</I>,
the buffer size used for all files, the number of seeks in the worst case considering
the pessimistic assumption mentioned in <A HREF="#papers">[2, Section 5.1</A>], and
the time to generate a MPHF for 1 billion of keys as a function of the amount of internal
memory available. Observing Table 3 we noticed that the time spent in the construction
decreases as the value of <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> increases. However, for <IMG ALIGN="middle" SRC="figs/brz/img213.png" BORDER="0" ALT="">, the variation
on the time is not as significant as for <IMG ALIGN="middle" SRC="figs/brz/img214.png" BORDER="0" ALT="">.
This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux
has smart policies for avoiding seeks and diminishing the average seek time
(see <A HREF="http://www.linuxjournal.com/article/6931">http://www.linuxjournal.com/article/6931</A>).
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img8.png"
ALT="$\mu $"></SPAN> (MB) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img215.png"
ALT="$100$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img216.png"
ALT="$200$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img217.png"
ALT="$300$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img218.png"
ALT="$400$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img219.png"
ALT="$500$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img212.png"
ALT="$600$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="19" HEIGHT="14" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img58.png"
ALT="$N$"></SPAN> (files) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img220.png"
ALT="$619$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img221.png"
ALT="$310$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img222.png"
ALT="$207$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img223.png"
ALT="$155$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img224.png"
ALT="$124$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img225.png"
ALT="$104$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
&nbsp;(buffer size in KB) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img226.png"
ALT="$165$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img227.png"
ALT="$661$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img228.png"
ALT="$1,484$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img229.png"
ALT="$2,643$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img230.png"
ALT="$4,129$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img231.png"
ALT="$5,908$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img135.png"
ALT="$\beta$"></SPAN>/&nbsp;(# of seeks in the worst case) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="59" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img232.png"
ALT="$384,478$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img233.png"
ALT="$95,974$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img234.png"
ALT="$42,749$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img235.png"
ALT="$24,003$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img236.png"
ALT="$15,365$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img237.png"
ALT="$10,738$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
Time (hours) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img238.png"
ALT="$4.04$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img239.png"
ALT="$3.64$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img240.png"
ALT="$3.34$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img241.png"
ALT="$3.20$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img242.png"
ALT="$3.13$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img243.png"
ALT="$3.09$"></SPAN> </SMALL></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 3:</B>Influence of the internal memory area size (<IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT="">) in the external memory based algorithm runtime.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<A NAME="papers"></A>
<H2>Papers</H2>
<OL>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, D. Menoti, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/bmz_tr004_04.ps">A New algorithm for constructing minimal perfect hash functions</A>, Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
<P></P>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, Y. Kohayakawa, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/tr06.pdf">An Approach for Minimal Perfect Hash Functions for Very Large Databases</A>, Technical Report TR003/06, Department of Computer Science, Federal University of Minas Gerais, 2004.
<P></P>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, Y. Kohayakawa, and <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/wea05.pdf">A Practical Minimal Perfect Hashing Method</A>. <I>4th International Workshop on efficient and Experimental Algorithms (WEA05),</I> Springer-Verlag Lecture Notes in Computer Science, vol. 3505, Santorini Island, Greece, May 2005, 488-500.
<P></P>
<LI><A HREF="http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=299">M. Seltzer. Beyond relational databases. ACM Queue, 3(3), April 2005.</A>
<P></P>
<LI><A HREF="http://burtleburtle.net/bob/hash/doobs.html">Bob Jenkins. Algorithm alley: Hash functions. Dr. Dobb's Journal of Software Tools, 22(9), september 1997.</A>
<P></P>
<LI>R. Jain. The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. John Wiley, first edition, 1991.
</OL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i BRZ.t2t -o docs/brz.html -->
</BODY></HTML>

97
deps/cmph/docs/chd.html vendored Normal file
View File

@@ -0,0 +1,97 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>Compress, Hash and Displace: CHD Algorithm</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>Compress, Hash and Displace: CHD Algorithm</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>Introduction</H2>
<P>
The important performance parameters of a PHF are representation size, evaluation time and construction time. The representation size plays an important role when the whole function fits in a faster memory and the actual data is stored in a slower memory. For instace, compact PHFs can be entirely fit in a CPU cache and this makes their computation really fast by avoiding cache misses. The CHD algorithm plays an important role in this context. It was designed by Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger in <A HREF="#papers">[2</A>].
</P>
<P>
The CHD algorithm permits to obtain PHFs with representation size very close to optimal while retaining <I>O(n)</I> construction time and <I>O(1)</I> evaluation time. For example, in the case <I>m=2n</I> we obtain a PHF that uses space <I>0.67</I> bits per key, and for <I>m=1.23n</I> we obtain space <I>1.4</I> bits per key, which was not achievable with previously known methods. The CHD algorithm is inspired by several known algorithms; the main new feature is that it combines a modification of Pagh's ``hash-and-displace'' approach with data compression on a sequence of hash function indices. That combination makes it possible to significantly reduce space usage while retaining linear construction time and constant query time. The CHD algorithm can also be used for <I>k</I>-perfect hashing, where at most <I>k</I> keys may be mapped to the same value. For the analysis we assume that fully random hash functions are given for free; such assumptions can be justified and were made in previous papers.
</P>
<P>
The compact PHFs generated by the CHD algorithm can be used in many applications in which we want to assign a unique identifier to each key without storing any information on the key. One of the most obvious applications of those functions (or <I>k</I>-perfect hash functions) is when we have a small fast memory in which we can store the perfect hash function while the keys and associated satellite data are stored in slower but larger memory. The size of a block or a transfer unit may be chosen so that <I>k</I> data items can be retrieved in one read access. In this case we can ensure that data associated with a key can be retrieved in a single probe to slower memory. This has been used for example in hardware routers <A HREF="#papers">[4</A>].
</P>
<P>
The CHD algorithm generates the most compact PHFs and MPHFs we know of in <I>O(n)</I> time. The time required to evaluate the generated functions is constant (in practice less than <I>1.4</I> microseconds). The storage space of the resulting PHFs and MPHFs are distant from the information theoretic lower bound by a factor of <I>1.43</I>. The closest competitor is the algorithm by Martin and Pagh <A HREF="#papers">[3</A>] but their algorithm do not work in linear time. Furthermore, the CHD algorithm can be tuned to run faster than the BPZ algorithm <A HREF="#papers">[1</A>] (the fastest algorithm available in the literature so far) and to obtain more compact functions. The most impressive characteristic is that it has the ability, in principle, to approximate the information theoretic lower bound while being practical. A detailed description of the CHD algorithm can be found in <A HREF="#papers">[2</A>].
</P>
<HR NOSHADE SIZE=1>
<H2>Experimental Results</H2>
<P>
Experimental results comparing the CHD algorithm with <A HREF="bdz.html">the BDZ algorithm</A>
and others available in the CMPH library are presented in <A HREF="#papers">[2</A>].
</P>
<HR NOSHADE SIZE=1>
<A NAME="papers"></A>
<H2>Papers</H2>
<OL>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, <A HREF="http://www.itu.dk/~pagh/">R. Pagh</A>, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/wads07.pdf">Simple and space-efficient minimal perfect hash functions</A>. <I>In Proceedings of the 10th International Workshop on Algorithms and Data Structures (WADs'07),</I> Springer-Verlag Lecture Notes in Computer Science, vol. 4619, Halifax, Canada, August 2007, 139-150.
<P></P>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, D. Belazzougui and M. Dietzfelbinger. <A HREF="papers/esa09.pdf">Compress, hash and displace</A>. <I>In Proceedings of the 17th European Symposium on Algorithms (ESA09)</I>. Springer LNCS, 2009.
<P></P>
<LI>M. Dietzfelbinger and <A HREF="http://www.itu.dk/~pagh/">R. Pagh</A>. Succinct data structures for retrieval and approximate membership. <I>In Proceedings of the 35th international colloquium on Automata, Languages and Programming (ICALP08)</I>, pages 385396, Berlin, Heidelberg, 2008. Springer-Verlag.
<P></P>
<LI>B. Prabhakar and F. Bonomi. Perfect hashing for network applications. <I>In Proceedings of the IEEE International Symposium on Information Theory</I>. IEEE Press, 2006.
</OL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i CHD.t2t -o docs/chd.html -->
</BODY></HTML>

180
deps/cmph/docs/chm.html vendored Normal file
View File

@@ -0,0 +1,180 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>CHM Algorithm</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>CHM Algorithm</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>The Algorithm</H2>
<P>
The algorithm is presented in <A HREF="#papers">[1,2,3</A>].
</P>
<HR NOSHADE SIZE=1>
<H2>Memory Consumption</H2>
<P>
Now we detail the memory consumption to generate and to store minimal perfect hash functions
using the CHM algorithm. The structures responsible for memory consumption are in the
following:
</P>
<UL>
<LI>Graph:
<OL>
<LI><B>first</B>: is a vector that stores <I>cn</I> integer numbers, each one representing
the first edge (index in the vector edges) in the list of
edges of each vertex.
The integer numbers are 4 bytes long. Therefore,
the vector first is stored in <I>4cn</I> bytes.
<P></P>
<LI><B>edges</B>: is a vector to represent the edges of the graph. As each edge
is compounded by a pair of vertices, each entry stores two integer numbers
of 4 bytes that represent the vertices. As there are <I>n</I> edges, the
vector edges is stored in <I>8n</I> bytes.
<P></P>
<LI><B>next</B>: given a vertex <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">, we can discover the edges that
contain <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> following its list of edges, which starts on
first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">] and the next
edges are given by next[...first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">]...]. Therefore,
the vectors first and next represent
the linked lists of edges of each vertex. As there are two vertices for each edge,
when an edge is iserted in the graph, it must be inserted in the two linked lists
of the vertices in its composition. Therefore, there are <I>2n</I> entries of integer
numbers in the vector next, so it is stored in <I>4*2n = 8n</I> bytes.
<P></P>
</OL>
<LI>Other auxiliary structures
<OL>
<LI><B>visited</B>: is a vector of <I>cn</I> bits, where each bit indicates if the g value of
a given vertex was already defined. Therefore, the vector visited is stored
in <I>cn/8</I> bytes.
<P></P>
<LI><B>function <I>g</I></B>: is represented by a vector of <I>cn</I> integer numbers.
As each integer number is 4 bytes long, the function <I>g</I> is stored in
<I>4cn</I> bytes.
</OL>
</UL>
<P>
Thus, the total memory consumption of CHM algorithm for generating a minimal
perfect hash function (MPHF) is: <I>(8.125c + 16)n + O(1)</I> bytes.
As the value of constant <I>c</I> must be at least 2.09 we have:
</P>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH><I>c</I></TH>
<TH>Memory consumption to generate a MPHF</TH>
</TR>
<TR>
<TD>2.09</TD>
<TD ALIGN="center"><I>33.00n + O(1)</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 1:</B> Memory consumption to generate a MPHF using the CHM algorithm.</TD>
</TR>
</TABLE>
<P>
Now we present the memory consumption to store the resulting function.
We only need to store the <I>g</I> function. Thus, we need <I>4cn</I> bytes.
Again we have:
</P>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH><I>c</I></TH>
<TH>Memory consumption to store a MPHF</TH>
</TR>
<TR>
<TD>2.09</TD>
<TD ALIGN="center"><I>8.36n</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 2:</B> Memory consumption to store a MPHF generated by the CHM algorithm.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H2>Experimental Results</H2>
<P>
<A HREF="comparison.html">CHM x BMZ</A>
</P>
<HR NOSHADE SIZE=1>
<A NAME="papers"></A>
<H2>Papers</H2>
<OL>
<LI>Z.J. Czech, G. Havas, and B.S. Majewski. <A HREF="papers/chm92.pdf">An optimal algorithm for generating minimal perfect hash functions.</A>, Information Processing Letters, 43(5):257-264, 1992.
<P></P>
<LI>Z.J. Czech, G. Havas, and B.S. Majewski. Fundamental study perfect hashing.
Theoretical Computer Science, 182:1-143, 1997.
<P></P>
<LI>B.S. Majewski, N.C. Wormald, G. Havas, and Z.J. Czech. A family of perfect hashing methods.
The Computer Journal, 39(6):547--554, 1996.
</OL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i CHM.t2t -o docs/chm.html -->
</BODY></HTML>

457
deps/cmph/docs/comparison.html vendored Normal file
View File

@@ -0,0 +1,457 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>Comparison Between BMZ And CHM Algorithms</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>Comparison Between BMZ And CHM Algorithms</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>Characteristics</H2>
<P>
Table 1 presents the main characteristics of the two algorithms.
The number of edges in the graph <IMG ALIGN="middle" SRC="figs/img27.png" BORDER="0" ALT=""> is <IMG ALIGN="middle" SRC="figs/img236.png" BORDER="0" ALT="">,
the number of keys in the input set <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
The number of vertices of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> is equal
to <IMG ALIGN="bottom" SRC="figs/img12.png" BORDER="0" ALT=""> and <IMG ALIGN="bottom" SRC="figs/img237.png" BORDER="0" ALT=""> for BMZ algorithm and the CHM algorithm, respectively.
This measure is related to the amount of space to store the array <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">.
This improves the space required to store a function in BMZ algorithm to <IMG ALIGN="middle" SRC="figs/img238.png" BORDER="0" ALT=""> of the space required by the CHM algorithm.
The number of critical edges is <IMG ALIGN="middle" SRC="figs/img76.png" BORDER="0" ALT=""> and 0, for BMZ algorithm and the CHM algorithm,
respectively.
BMZ algorithm generates random graphs that necessarily contains cycles and the
CHM algorithm
generates
acyclic random graphs.
Finally, the CHM algorithm generates <A HREF="concepts.html">order preserving functions</A>
while BMZ algorithm does not preserve order.
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Characteristics </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=2><SMALL CLASS="FOOTNOTESIZE"> <SPAN>Algorithms</SPAN></SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
</SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> BMZ </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> CHM </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="11" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img1.png"
ALT="$c$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1.15 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.09 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="50" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img239.png"
ALT="$\vert E(G)\vert$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img8.png"
ALT="$n$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img8.png"
ALT="$n$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="89" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img240.png"
ALT="$\vert V(G)\vert=\vert g\vert$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="20" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img241.png"
ALT="$cn$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="20" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img241.png"
ALT="$cn$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<!-- MATH
$|E(G_{\rm crit})|$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="70" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img111.png"
ALT="$\vert E(G_{\rm crit})\vert$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="71" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img242.png"
ALT="$0.5\vert E(G)\vert$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 0</SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="17" HEIGHT="14" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img32.png"
ALT="$G$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> cyclic </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> acyclic </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Order preserving </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> no </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> yes </SMALL></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 1:</B> Main characteristics of the algorithms.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H2>Memory Consumption</H2>
<UL>
<LI>Memory consumption to generate the minimal perfect hash function (MPHF):
</UL>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH>Algorithm</TH>
<TH><I>c</I></TH>
<TH>Memory consumption to generate a MPHF</TH>
</TR>
<TR>
<TD ALIGN="center">BMZ</TD>
<TD>0.93</TD>
<TD ALIGN="center"><I>24.80n + O(1)</I></TD>
</TR>
<TR>
<TD ALIGN="center">BMZ</TD>
<TD>1.15</TD>
<TD ALIGN="center"><I>26.42n + O(1)</I></TD>
</TR>
<TR>
<TD ALIGN="center">CHM</TD>
<TD>2.09</TD>
<TD ALIGN="center"><I>33.00n + O(1)</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 2:</B> Memory consumption to generate a MPHF using the algorithms BMZ and CHM.</TD>
</TR>
</TABLE>
<UL>
<LI>Memory consumption to store the resulting minimal perfect hash function (MPHF):
</UL>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH>Algorithm</TH>
<TH><I>c</I></TH>
<TH>Memory consumption to store a MPHF</TH>
</TR>
<TR>
<TD ALIGN="center">BMZ</TD>
<TD>0.93</TD>
<TD ALIGN="center"><I>3.72n</I></TD>
</TR>
<TR>
<TD ALIGN="center">BMZ</TD>
<TD>1.15</TD>
<TD ALIGN="center"><I>4.60n</I></TD>
</TR>
<TR>
<TD ALIGN="center">CHM</TD>
<TD>2.09</TD>
<TD ALIGN="center"><I>8.36n</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 3:</B> Memory consumption to store a MPHF generated by the algorithms BMZ and CHM.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H2>Run times</H2>
<P>
We now present some experimental results to compare the BMZ and CHM algorithms.
The data consists of a collection of 100 million universe resource locations
(URLs) collected from the Web.
The average length of a URL in the collection is 63 bytes.
All experiments were carried on
a computer running the Linux operating system, version 2.6.7,
with a 2.4 gigahertz processor and
4 gigabytes of main memory.
</P>
<P>
Table 4 presents time measurements.
All times are in seconds.
The table entries represent averages over 50 trials.
The column labelled as <IMG ALIGN="middle" SRC="figs/img243.png" BORDER="0" ALT=""> represents
the number of iterations to generate the random graph <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> in the
mapping step of the algorithms.
The next columns represent the run times
for the mapping plus ordering steps together and the searching
step for each algorithm.
The last column represents the percent gain of our algorithm
over the CHM algorithm.
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img8.png"
ALT="$n$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=4><SMALL CLASS="FOOTNOTESIZE"> <SPAN> BMZ </SPAN> </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=4><SMALL CLASS="FOOTNOTESIZE">
<SPAN>CHM algorithm</SPAN></SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> Gain</SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
</SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="22" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img243.png"
ALT="$N_i$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Map+Ord </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Search </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Total </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="22" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img243.png"
ALT="$N_i$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Map+Ord </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Search </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Total </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> (%)</SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1,562,500 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.28 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 8.54 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.37 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 10.91 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.70 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 14.56 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1.57 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 16.13 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 48 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 3,125,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.16 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 15.92 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 4.88 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 20.80 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.85 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 30.36 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 3.20 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 33.56 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 61 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 6,250,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.20 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 33.09 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 10.48 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 43.57 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.90 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 62.26 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 6.76 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 69.02 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 58 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 12,500,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.00 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 63.26 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 23.04 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 86.30 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.60 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 117.99 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 14.94 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 132.92 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 54 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 25,000,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.00 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 130.79 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 51.55 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 182.34 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.80 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 262.05 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 33.68 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 295.73 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 62 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 50,000,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.07 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 273.75 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 114.12 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 387.87 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.90 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 577.59 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 73.97 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 651.56 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 68 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 100,000,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.07 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 567.47 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 243.13 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 810.60 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.80 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1,131.06 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 157.23 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1,288.29 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 59 </SMALL></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 4:</B> Time measurements for BMZ and the CHM algorithm.</TD>
</TR>
</TABLE>
<P>
The mapping step of the BMZ algorithm is faster because
the expected number of iterations in the mapping step to generate <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> are
2.13 and 2.92 for BMZ algorithm and the CHM algorithm, respectively
(see <A HREF="bmz.html#papers">[2</A>] for details).
The graph <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> generated by BMZ algorithm
has <IMG ALIGN="bottom" SRC="figs/img12.png" BORDER="0" ALT=""> vertices, against <IMG ALIGN="bottom" SRC="figs/img237.png" BORDER="0" ALT=""> for the CHM algorithm.
These two facts make BMZ algorithm faster in the mapping step.
The ordering step of BMZ algorithm is approximately equal to
the time to check if <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> is acyclic for the CHM algorithm.
The searching step of the CHM algorithm is faster, but the total
time of BMZ algorithm is, on average, approximately 59 % faster
than the CHM algorithm.
It is important to notice the times for the searching step:
for both algorithms they are not the dominant times,
and the experimental results clearly show
a linear behavior for the searching step.
</P>
<P>
We now present run times for BMZ algorithm using a <A HREF="bmz.html#heuristic">heuristic</A> that
reduces the space requirement
to any given value between <IMG ALIGN="bottom" SRC="figs/img12.png" BORDER="0" ALT=""> words and <IMG ALIGN="bottom" SRC="figs/img13.png" BORDER="0" ALT=""> words.
For example, for <IMG ALIGN="bottom" SRC="figs/img244.png" BORDER="0" ALT=""> and <IMG ALIGN="bottom" SRC="figs/img6.png" BORDER="0" ALT="">, the analytical expected number
of iterations are <IMG ALIGN="bottom" SRC="figs/img245.png" BORDER="0" ALT=""> and <IMG ALIGN="bottom" SRC="figs/img246.png" BORDER="0" ALT="">, respectively
(for <IMG ALIGN="middle" SRC="figs/img247.png" BORDER="0" ALT="">, the number of iterations are 2.78 for <IMG ALIGN="bottom" SRC="figs/img244.png" BORDER="0" ALT=""> and 3.04
for <IMG ALIGN="bottom" SRC="figs/img6.png" BORDER="0" ALT="">).
Table 5 presents the total times to construct a
function for <IMG ALIGN="middle" SRC="figs/img247.png" BORDER="0" ALT="">, with an increase from <IMG ALIGN="bottom" SRC="figs/img237.png" BORDER="0" ALT=""> seconds
for <IMG ALIGN="bottom" SRC="figs/img128.png" BORDER="0" ALT=""> (see Table 4) to <IMG ALIGN="bottom" SRC="figs/img249.png" BORDER="0" ALT=""> seconds for <IMG ALIGN="bottom" SRC="figs/img244.png" BORDER="0" ALT=""> and
to <IMG ALIGN="bottom" SRC="figs/img250.png" BORDER="0" ALT=""> seconds for <IMG ALIGN="bottom" SRC="figs/img6.png" BORDER="0" ALT="">.
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img8.png"
ALT="$n$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=4><SMALL CLASS="FOOTNOTESIZE"> <SPAN> BMZ <SPAN CLASS="MATH"><IMG
WIDTH="60" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img5.png"
ALT="$c=1.00$"></SPAN></SPAN> </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=4><SMALL CLASS="FOOTNOTESIZE">
<SPAN> BMZ <SPAN CLASS="MATH"><IMG
WIDTH="60" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img6.png"
ALT="$c=0.93$"></SPAN></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
</SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="22" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img243.png"
ALT="$N_i$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Map+Ord </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Search </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Total </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="22" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img243.png"
ALT="$N_i$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Map+Ord </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Search </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Total </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 12,500,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.78 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 76.68 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 25.06 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 101.74 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 3.04 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 76.39 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 25.80 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 102.19 </SMALL></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 5:</B> Time measurements for BMZ tuned algorithm with <IMG ALIGN="bottom" SRC="figs/img5.png" BORDER="0" ALT=""> and <IMG ALIGN="bottom" SRC="figs/img6.png" BORDER="0" ALT="">.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i COMPARISON.t2t -o docs/comparison.html -->
</BODY></HTML>

114
deps/cmph/docs/concepts.html vendored Normal file
View File

@@ -0,0 +1,114 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>Minimal Perfect Hash Functions - Introduction</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>Minimal Perfect Hash Functions - Introduction</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>Basic Concepts</H2>
<P>
Suppose <IMG ALIGN="bottom" SRC="figs/img14.png" BORDER="0" ALT=""> is a universe of <I>keys</I>.
Let <IMG ALIGN="bottom" SRC="figs/img15.png" BORDER="0" ALT=""> be a <I>hash function</I> that maps the keys from <IMG ALIGN="bottom" SRC="figs/img14.png" BORDER="0" ALT=""> to a given interval of integers <IMG ALIGN="middle" SRC="figs/img16.png" BORDER="0" ALT="">.
Let <IMG ALIGN="middle" SRC="figs/img17.png" BORDER="0" ALT=""> be a set of <IMG ALIGN="bottom" SRC="figs/img8.png" BORDER="0" ALT=""> keys from <IMG ALIGN="bottom" SRC="figs/img14.png" BORDER="0" ALT="">.
Given a key <IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT="">, the hash function <IMG ALIGN="bottom" SRC="figs/img7.png" BORDER="0" ALT=""> computes an
integer in <IMG ALIGN="middle" SRC="figs/img19.png" BORDER="0" ALT=""> for the storage or retrieval of <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> in
a <I>hash table</I>.
Hashing methods for <I>non-static sets</I> of keys can be used to construct
data structures storing <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> and supporting membership queries
"<IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT="">?" in expected time <IMG ALIGN="middle" SRC="figs/img21.png" BORDER="0" ALT="">.
However, they involve a certain amount of wasted space owing to unused
locations in the table and waisted time to resolve collisions when
two keys are hashed to the same table location.
</P>
<P>
For <I>static sets</I> of keys it is possible to compute a function
to find any key in a table in one probe; such hash functions are called
<I>perfect</I>.
More precisely, given a set of keys <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, we shall say that a
hash function <IMG ALIGN="bottom" SRC="figs/img15.png" BORDER="0" ALT=""> is a <I>perfect hash function</I>
for <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> if <IMG ALIGN="bottom" SRC="figs/img7.png" BORDER="0" ALT=""> is an injection on <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">,
that is, there are no <I>collisions</I> among the keys in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">:
if <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img22.png" BORDER="0" ALT=""> are in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img23.png" BORDER="0" ALT="">,
then <IMG ALIGN="middle" SRC="figs/img24.png" BORDER="0" ALT="">.
Figure 1(a) illustrates a perfect hash function.
Since no collisions occur, each key can be retrieved from the table
with a single probe.
If <IMG ALIGN="bottom" SRC="figs/img25.png" BORDER="0" ALT="">, that is, the table has the same size as <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">,
then we say that <IMG ALIGN="bottom" SRC="figs/img7.png" BORDER="0" ALT=""> is a <I>minimal perfect hash function</I>
for <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
Figure 1(b) illustrates a minimal perfect hash function.
Minimal perfect hash functions totally avoid the problem of wasted
space and time. A perfect hash function <IMG ALIGN="bottom" SRC="figs/img7.png" BORDER="0" ALT=""> is <I>order preserving</I>
if the keys in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> are arranged in some given order
and <IMG ALIGN="bottom" SRC="figs/img7.png" BORDER="0" ALT=""> preserves this order in the hash table.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD ALIGN="right"><center><IMG ALIGN="middle" SRC="figs/img26.png" BORDER="0" ALT=""></center></TD>
</TR>
<TR>
<TD><B>Figure 1:</B> (a) Perfect hash function. (b) Minimal perfect hash function.</TD>
</TR>
</TABLE>
<P>
Minimal perfect hash functions are widely used for memory efficient
storage and fast retrieval of items from static sets, such as words in natural
languages, reserved words in programming languages or interactive systems,
universal resource locations (URLs) in Web search engines, or item sets in
data mining techniques.
</P>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i CONCEPTS.t2t -o docs/concepts.html -->
</BODY></HTML>

210
deps/cmph/docs/examples.html vendored Normal file
View File

@@ -0,0 +1,210 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>CMPH - Examples</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>CMPH - Examples</H1>
</CENTER>
<P>
Using cmph is quite simple. Take a look in the following examples.
</P>
<HR NOSHADE SIZE=1>
<PRE>
#include &lt;cmph.h&gt;
#include &lt;string.h&gt;
// Create minimal perfect hash function from in-memory vector
int main(int argc, char **argv)
{
// Creating a filled vector
unsigned int i = 0;
const char *vector[] = {"aaaaaaaaaa", "bbbbbbbbbb", "cccccccccc", "dddddddddd", "eeeeeeeeee",
"ffffffffff", "gggggggggg", "hhhhhhhhhh", "iiiiiiiiii", "jjjjjjjjjj"};
unsigned int nkeys = 10;
FILE* mphf_fd = fopen("temp.mph", "w");
// Source of keys
cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys);
//Create minimal perfect hash function using the brz algorithm.
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BRZ);
cmph_config_set_mphf_fd(config, mphf_fd);
cmph_t *hash = cmph_new(config);
cmph_config_destroy(config);
cmph_dump(hash, mphf_fd);
cmph_destroy(hash);
fclose(mphf_fd);
//Find key
mphf_fd = fopen("temp.mph", "r");
hash = cmph_load(mphf_fd);
while (i &lt; nkeys) {
const char *key = vector[i];
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "key:%s -- hash:%u\n", key, id);
i++;
}
//Destroy hash
cmph_destroy(hash);
cmph_io_vector_adapter_destroy(source);
fclose(mphf_fd);
return 0;
}
</PRE>
<P>
Download <A HREF="examples/vector_adapter_ex1.c">vector_adapter_ex1.c</A>. This example does not work in versions below 0.6.
</P>
<HR NOSHADE SIZE=1>
<PRE>
#include &lt;cmph.h&gt;
#include &lt;string.h&gt;
// Create minimal perfect hash function from in-memory vector
#pragma pack(1)
typedef struct {
cmph_uint32 id;
char key[11];
cmph_uint32 year;
} rec_t;
#pragma pack(0)
int main(int argc, char **argv)
{
// Creating a filled vector
unsigned int i = 0;
rec_t vector[10] = {{1, "aaaaaaaaaa", 1999}, {2, "bbbbbbbbbb", 2000}, {3, "cccccccccc", 2001},
{4, "dddddddddd", 2002}, {5, "eeeeeeeeee", 2003}, {6, "ffffffffff", 2004},
{7, "gggggggggg", 2005}, {8, "hhhhhhhhhh", 2006}, {9, "iiiiiiiiii", 2007},
{10,"jjjjjjjjjj", 2008}};
unsigned int nkeys = 10;
FILE* mphf_fd = fopen("temp_struct_vector.mph", "w");
// Source of keys
cmph_io_adapter_t *source = cmph_io_struct_vector_adapter(vector, (cmph_uint32)sizeof(rec_t), (cmph_uint32)sizeof(cmph_uint32), 11, nkeys);
//Create minimal perfect hash function using the BDZ algorithm.
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BDZ);
cmph_config_set_mphf_fd(config, mphf_fd);
cmph_t *hash = cmph_new(config);
cmph_config_destroy(config);
cmph_dump(hash, mphf_fd);
cmph_destroy(hash);
fclose(mphf_fd);
//Find key
mphf_fd = fopen("temp_struct_vector.mph", "r");
hash = cmph_load(mphf_fd);
while (i &lt; nkeys) {
const char *key = vector[i].key;
unsigned int id = cmph_search(hash, key, 11);
fprintf(stderr, "key:%s -- hash:%u\n", key, id);
i++;
}
//Destroy hash
cmph_destroy(hash);
cmph_io_vector_adapter_destroy(source);
fclose(mphf_fd);
return 0;
}
</PRE>
<P>
Download <A HREF="examples/struct_vector_adapter_ex3.c">struct_vector_adapter_ex3.c</A>. This example does not work in versions below 0.8.
</P>
<HR NOSHADE SIZE=1>
<PRE>
#include &lt;cmph.h&gt;
#include &lt;stdio.h&gt;
#include &lt;string.h&gt;
// Create minimal perfect hash function from in-disk keys using BDZ algorithm
int main(int argc, char **argv)
{
//Open file with newline separated list of keys
FILE * keys_fd = fopen("keys.txt", "r");
cmph_t *hash = NULL;
if (keys_fd == NULL)
{
fprintf(stderr, "File \"keys.txt\" not found\n");
exit(1);
}
// Source of keys
cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd);
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BDZ);
hash = cmph_new(config);
cmph_config_destroy(config);
//Find key
const char *key = "jjjjjjjjjj";
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "Id:%u\n", id);
//Destroy hash
cmph_destroy(hash);
cmph_io_nlfile_adapter_destroy(source);
fclose(keys_fd);
return 0;
}
</PRE>
<P>
Download <A HREF="examples/file_adapter_ex2.c">file_adapter_ex2.c</A> and <A HREF="examples/keys.txt">keys.txt</A>. This example does not work in versions below 0.8.
</P>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i EXAMPLES.t2t -o docs/examples.html -->
</BODY></HTML>

111
deps/cmph/docs/faq.html vendored Normal file
View File

@@ -0,0 +1,111 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>CMPH FAQ</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>CMPH FAQ</H1>
</CENTER>
<UL>
<LI>How do I define the ids of the keys?
</UL>
<BLOCKQUOTE>
- You don't. The ids will be assigned by the algorithm creating the minimal
perfect hash function. If the algorithm creates an <B>ordered</B> minimal
perfect hash function, the ids will be the indices of the keys in the
input. Otherwise, you have no guarantee of the distribution of the ids.
</BLOCKQUOTE>
<UL>
<LI>Why do I always get the error "Unable to create minimum perfect hashing function"?
</UL>
<BLOCKQUOTE>
- The algorithms do not guarantee that a minimal perfect hash function can
be created. In practice, it will always work if your input
is big enough (&gt;100 keys).
The error is probably because you have duplicated
keys in the input. You must guarantee that the keys are unique in the
input. If you are using a UN*X based OS, try doing
</BLOCKQUOTE>
<PRE>
#sort input.txt | uniq &gt; input_uniq.txt
</PRE>
<BLOCKQUOTE>
and run cmph with input_uniq.txt
</BLOCKQUOTE>
<UL>
<LI>Why do I change the hash function using cmph_config_set_hashfuncs function and the default (jenkins)
one is executed?
</UL>
<BLOCKQUOTE>
- Probably you are you using the cmph_config_set_algo function after
the cmph_config_set_hashfuncs. Therefore, the default hash function
is reset when you call the cmph_config_set_algo function.
</BLOCKQUOTE>
<UL>
<LI>What do I do when the following error is got?
</UL>
<BLOCKQUOTE>
- Error: <B>error while loading shared libraries: libcmph.so.0: cannot open shared object file: No such file ordirectory</B>
</BLOCKQUOTE>
<BLOCKQUOTE>
- Solution: type <B>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/</B> at the shell or put that shell command
in your .profile file or in the /etc/profile file.
</BLOCKQUOTE>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i FAQ.t2t -o docs/faq.html -->
</BODY></HTML>

110
deps/cmph/docs/fch.html vendored Normal file
View File

@@ -0,0 +1,110 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>FCH Algorithm</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>FCH Algorithm</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>The Algorithm</H2>
<P>
The algorithm is presented in <A HREF="#papers">[1</A>].
</P>
<HR NOSHADE SIZE=1>
<H2>Memory Consumption</H2>
<P>
Now we detail the memory consumption to generate and to store minimal perfect hash functions
using the FCH algorithm. The structures responsible for memory consumption are in the
following:
</P>
<UL>
<LI>A vector containing all the <I>n</I> keys.
<LI>Data structure to speed up the searching step:
<OL>
<LI><B>random_table</B>: is a vector used to remember currently empty slots in the hash table. It stores <I>n</I> 4 byte long integer numbers. This vector initially contains a random permutation of the <I>n</I> hash addresses. A pointer called filled_count is used to keep the invariant that any slots to the right side of filled_count (inclusive) are empty and any ones to the left are filled.
<LI><B>hash_table</B>: Table used to check whether all the collisions were resolved. It has <I>n</I> entries of one byte.
<LI><B>map_table</B>: For any unfilled slot <I>x</I> in hash_table, the map_table vector contains <I>n</I> 4 byte long pointers pointing at random_table such that random_table[map_table[x]] = x. Thus, given an empty slot x in the hash_table, we can locate its position in the random_table vector through map_table.
<P></P>
</OL>
<LI>Other auxiliary structures
<OL>
<LI><B>sorted_indexes</B>: is a vector of <I>cn/(log(n) + 1)</I> 4 byte long pointers to indirectly keep the buckets sorted by decreasing order of their sizes.
<P></P>
<LI><B>function <I>g</I></B>: is represented by a vector of <I>cn/(log(n) + 1)</I> 4 byte long integer numbers, one for each bucket. It is used to spread all the keys in a given bucket into the hash table without collisions.
</OL>
</UL>
<P>
Thus, the total memory consumption of FCH algorithm for generating a minimal
perfect hash function (MPHF) is: <I>O(n) + 9n + 8cn/(log(n) + 1)</I> bytes.
The value of parameter <I>c</I> must be greater than or equal to 2.6.
</P>
<P>
Now we present the memory consumption to store the resulting function.
We only need to store the <I>g</I> function and a constant number of bytes for the seed of the hash functions used in the resulting MPHF. Thus, we need <I>cn/(log(n) + 1) + O(1)</I> bytes.
</P>
<HR NOSHADE SIZE=1>
<A NAME="papers"></A>
<H2>Papers</H2>
<OL>
<LI>E.A. Fox, Q.F. Chen, and L.S. Heath. <A HREF="papers/fch92.pdf">A faster algorithm for constructing minimal perfect hash functions.</A> In Proc. 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 266-273, 1992.
</OL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i FCH.t2t -o docs/fch.html -->
</BODY></HTML>

99
deps/cmph/docs/gperf.html vendored Normal file
View File

@@ -0,0 +1,99 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>GPERF versus CMPH</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>GPERF versus CMPH</H1>
</CENTER>
<P>
You might ask why cmph if <A HREF="http://www.gnu.org/software/gperf/gperf.html">gperf</A>
already works perfectly. Actually, gperf and cmph have different goals.
Basically, these are the requirements for each of them:
</P>
<UL>
<LI>GPERF
<P></P>
</UL>
<BLOCKQUOTE>
- Create very fast hash functions for <B>small</B> sets
</BLOCKQUOTE>
<BLOCKQUOTE>
- Create <B>perfect</B> hash functions
</BLOCKQUOTE>
<UL>
<LI>CMPH
<P></P>
</UL>
<BLOCKQUOTE>
- Create very fast hash function for <B>very large</B> sets
</BLOCKQUOTE>
<BLOCKQUOTE>
- Create <B>minimal perfect</B> hash functions
</BLOCKQUOTE>
<P>
As result, cmph can be used to create hash functions where gperf would run
forever without finding a perfect hash function, because of the running
time of the algorithm and the large memory usage.
On the other side, functions created by cmph are about 2x slower than those
created by gperf.
</P>
<P>
So, if you have large sets, or memory usage is a key restriction for you, stick
to cmph. If you have small sets, and do not care about memory usage, go with
gperf. The first problem is common in the information retrieval field (e.g.
assigning ids to millions of documents), while the former is usually found in
the compiler programming area (detect reserved keywords).
</P>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i GPERF.t2t -o docs/gperf.html -->
</BODY></HTML>

392
deps/cmph/docs/index.html vendored Normal file
View File

@@ -0,0 +1,392 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>CMPH - C Minimal Perfect Hashing Library</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>CMPH - C Minimal Perfect Hashing Library</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>Motivation</H2>
<P>
A perfect hash function maps a static set of n keys into a set of m integer numbers without collisions, where m is greater than or equal to n. If m is equal to n, the function is called minimal.
</P>
<P>
<A HREF="concepts.html">Minimal perfect hash functions</A> are widely used for memory efficient storage and fast retrieval of items from static sets, such as words in natural languages, reserved words in programming languages or interactive systems, universal resource locations (URLs) in Web search engines, or item sets in data mining techniques. Therefore, there are applications for minimal perfect hash functions in information retrieval systems, database systems, language translation systems, electronic commerce systems, compilers, operating systems, among others.
</P>
<P>
The use of minimal perfect hash functions is, until now, restricted to scenarios where the set of keys being hashed is small, because of the limitations of current algorithms. But in many cases, to deal with huge set of keys is crucial. So, this project gives to the free software community an API that will work with sets in the order of billion of keys.
</P>
<P>
Probably, the most interesting application for minimal perfect hash functions is its use as an indexing structure for databases. The most popular data structure used as an indexing structure in databases is the B+ tree. In fact, the B+ tree is very used for dynamic applications with frequent insertions and deletions of records. However, for applications with sporadic modifications and a huge number of queries the B+ tree is not the best option, because practical deployments of this structure are extremely complex, and perform poorly with very large sets of keys such as those required for the new frontiers <A HREF="http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=299">database applications</A>.
</P>
<P>
For example, in the information retrieval field, the work with huge collections is a daily task. The simple assignment of ids to web pages of a collection can be a challenging task. While traditional databases simply cannot handle more traffic once the working set of web page urls does not fit in main memory anymore, minimal perfect hash functions can easily scale to hundred of millions of entries, using stock hardware.
</P>
<P>
As there are lots of applications for minimal perfect hash functions, it is important to implement memory and time efficient algorithms for constructing such functions. The lack of similar libraries in the free software world has been the main motivation to create the C Minimal Perfect Hashing Library (<A HREF="gperf.html">gperf is a bit different</A>, since it was conceived to create very fast perfect hash functions for small sets of keys and CMPH Library was conceived to create minimal perfect hash functions for very large sets of keys). C Minimal Perfect Hashing Library is a portable LGPLed library to generate and to work with very efficient minimal perfect hash functions.
</P>
<HR NOSHADE SIZE=1>
<H2>Description</H2>
<P>
The CMPH Library encapsulates the newest and more efficient algorithms in an easy-to-use, production-quality, fast API. The library was designed to work with big entries that cannot fit in the main memory. It has been used successfully for constructing minimal perfect hash functions for sets with more than 100 million of keys, and we intend to expand this number to the order of billion of keys. Although there is a lack of similar libraries, we can point out some of the distinguishable features of the CMPH Library:
</P>
<UL>
<LI>Fast.
<LI>Space-efficient with main memory usage carefully documented.
<LI>The best modern algorithms are available (or at least scheduled for implementation :-)).
<LI>Works with in-disk key sets through of using the adapter pattern.
<LI>Serialization of hash functions.
<LI>Portable C code (currently works on GNU/Linux and WIN32 and is reported to work in OpenBSD and Solaris).
<LI>Object oriented implementation.
<LI>Easily extensible.
<LI>Well encapsulated API aiming binary compatibility through releases.
<LI>Free Software.
</UL>
<HR NOSHADE SIZE=1>
<H2>Supported Algorithms</H2>
<UL>
<LI><A HREF="chd.html">CHD Algorithm</A>:
<UL>
<LI>It is the fastest algorithm to build PHFs and MPHFs in linear time.
<LI>It generates the most compact PHFs and MPHFs we know of.
<LI>It can generate PHFs with a load factor up to <I>99 %</I>.
<LI>It can be used to generate <I>t</I>-perfect hash functions. A <I>t</I>-perfect hash function allows at most <I>t</I> collisions in a given bin. It is a well-known fact that modern memories are organized as blocks which constitute transfer unit. Example of such blocks are cache lines for internal memory or sectors for hard disks. Thus, it can be very useful for devices that carry out I/O operations in blocks.
<LI>It is a two level scheme. It uses a first level hash function to split the key set in buckets of average size determined by a parameter <I>b</I> in the range <I>[1,32]</I>. In the second level it uses displacement values to resolve the collisions that have given rise to the buckets.
<LI>It can generate MPHFs that can be stored in approximately <I>2.07</I> bits per key.
<LI>For a load factor equal to the maximum one that is achieved by the BDZ algorithm (<I>81 %</I>), the resulting PHFs are stored in approximately <I>1.40</I> bits per key.
</UL>
<LI><A HREF="bdz.html">BDZ Algorithm</A>:
<UL>
<LI>It is very simple and efficient. It outperforms all the ones below.
<LI>It constructs both PHFs and MPHFs in linear time.
<LI>The maximum load factor one can achieve for a PHF is <I>1/1.23</I>.
<LI>It is based on acyclic random 3-graphs. A 3-graph is a generalization of a graph where each edge connects 3 vertices instead of only 2.
<LI>The resulting MPHFs are not order preserving.
<LI>The resulting MPHFs can be stored in only <I>(2 + x)cn</I> bits, where <I>c</I> should be larger than or equal to <I>1.23</I> and <I>x</I> is a constant larger than <I>0</I> (actually, x = 1/b and b is a parameter that should be larger than 2). For <I>c = 1.23</I> and <I>b = 8</I>, the resulting functions are stored in approximately 2.6 bits per key.
<LI>For its maximum load factor (<I>81 %</I>), the resulting PHFs are stored in approximately <I>1.95</I> bits per key.
</UL>
<LI><A HREF="bmz.html">BMZ Algorithm</A>:
<UL>
<LI>Construct MPHFs in linear time.
<LI>It is based on cyclic random graphs. This makes it faster than the CHM algorithm.
<LI>The resulting MPHFs are not order preserving.
<LI>The resulting MPHFs are more compact than the ones generated by the CHM algorithm and can be stored in <I>4cn</I> bytes, where <I>c</I> is in the range <I>[0.93,1.15]</I>.
</UL>
<LI><A HREF="brz.html">BRZ Algorithm</A>:
<UL>
<LI>A very fast external memory based algorithm for constructing minimal perfect hash functions for sets in the order of billions of keys.
<LI>It works in linear time.
<LI>The resulting MPHFs are not order preserving.
<LI>The resulting MPHFs can be stored using less than <I>8.0</I> bits per key.
</UL>
<LI><A HREF="chm.html">CHM Algorithm</A>:
<UL>
<LI>Construct minimal MPHFs in linear time.
<LI>It is based on acyclic random graphs
<LI>The resulting MPHFs are order preserving.
<LI>The resulting MPHFs are stored in <I>4cn</I> bytes, where <I>c</I> is greater than 2.
</UL>
<LI><A HREF="fch.html">FCH Algorithm</A>:
<UL>
<LI>Construct minimal perfect hash functions that require less than 4 bits per key to be stored.
<LI>The resulting MPHFs are very compact and very efficient at evaluation time
<LI>The algorithm is only efficient for small sets.
<LI>It is used as internal algorithm in the BRZ algorithm to efficiently solve larger problems and even so to generate MPHFs that require approximately 4.1 bits per key to be stored. For that, you just need to set the parameters -a to brz and -c to a value larger than or equal to 2.6.
</UL>
</UL>
<HR NOSHADE SIZE=1>
<H2>News for version 2.0</H2>
<P>
Cleaned up most warnings for the c code.
</P>
<P>
Experimental C++ interface (--enable-cxxmph) implementing the BDZ algorithm in
a convenient interface, which serves as the basis
for drop-in replacements for std::unordered_map, sparsehash::sparse_hash_map
and sparsehash::dense_hash_map. Potentially faster lookup time at the expense
of insertion time. See cxxmpph/mph_map.h and cxxmph/mph_index.h for details.
</P>
<H2>News for version 1.1</H2>
<P>
Fixed a bug in the chd_pc algorithm and reorganized tests.
</P>
<H2>News for version 1.0</H2>
<P>
This is a bugfix only version, after which a revamp of the cmph code and
algorithms will be done.
</P>
<H2>News for version 0.9</H2>
<UL>
<LI><A HREF="chd.html">The CHD algorithm</A>, which is an algorithm that can be tuned to generate MPHFs that require approximately 2.07 bits per key to be stored. The algorithm outperforms <A HREF="bdz.html">the BDZ algorithm</A> and therefore is the fastest one available in the literature for sets that can be treated in internal memory.
<LI><A HREF="chd.html">The CHD_PH algorithm</A>, which is an algorithm to generate PHFs with load factor up to <I>99 %</I>. It is actually the CHD algorithm without the ranking step. If we set the load factor to <I>81 %</I>, which is the maximum that can be obtained with <A HREF="bdz.html">the BDZ algorithm</A>, the resulting functions can be stored in <I>1.40</I> bits per key. The space requirement increases with the load factor.
<LI>All reported bugs and suggestions have been corrected and included as well.
</UL>
<H2>News for version 0.8</H2>
<UL>
<LI><A HREF="bdz.html">An algorithm to generate MPHFs that require around 2.6 bits per key to be stored</A>, which is referred to as BDZ algorithm. The algorithm is the fastest one available in the literature for sets that can be treated in internal memory.
<LI><A HREF="bdz.html">An algorithm to generate PHFs with range m = cn, for c &gt; 1.22</A>, which is referred to as BDZ_PH algorithm. It is actually the BDZ algorithm without the ranking step. The resulting functions can be stored in 1.95 bits per key for <I>c = 1.23</I> and are considerably faster than the MPHFs generated by the BDZ algorithm.
<LI>An adapter to support a vector of struct as the source of keys has been added.
<LI>An API to support the ability of packing a perfect hash function into a preallocated contiguous memory space. The computation of a packed function is still faster and can be easily mmapped.
<LI>The hash functions djb2, fnv and sdbm were removed because they do not use random seeds and therefore are not useful for MPHFs algorithms.
<LI>All reported bugs and suggestions have been corrected and included as well.
</UL>
<P>
<A HREF="newslog.html">News log</A>
</P>
<HR NOSHADE SIZE=1>
<H2>Examples</H2>
<P>
Using cmph is quite simple. Take a look.
</P>
<PRE>
#include &lt;cmph.h&gt;
#include &lt;string.h&gt;
// Create minimal perfect hash function from in-memory vector
int main(int argc, char **argv)
{
// Creating a filled vector
unsigned int i = 0;
const char *vector[] = {"aaaaaaaaaa", "bbbbbbbbbb", "cccccccccc", "dddddddddd", "eeeeeeeeee",
"ffffffffff", "gggggggggg", "hhhhhhhhhh", "iiiiiiiiii", "jjjjjjjjjj"};
unsigned int nkeys = 10;
FILE* mphf_fd = fopen("temp.mph", "w");
// Source of keys
cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys);
//Create minimal perfect hash function using the brz algorithm.
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BRZ);
cmph_config_set_mphf_fd(config, mphf_fd);
cmph_t *hash = cmph_new(config);
cmph_config_destroy(config);
cmph_dump(hash, mphf_fd);
cmph_destroy(hash);
fclose(mphf_fd);
//Find key
mphf_fd = fopen("temp.mph", "r");
hash = cmph_load(mphf_fd);
while (i &lt; nkeys) {
const char *key = vector[i];
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "key:%s -- hash:%u\n", key, id);
i++;
}
//Destroy hash
cmph_destroy(hash);
cmph_io_vector_adapter_destroy(source);
fclose(mphf_fd);
return 0;
}
</PRE>
<P>
Download <A HREF="examples/vector_adapter_ex1.c">vector_adapter_ex1.c</A>. This example does not work in versions below 0.6. You need to update the sources from GIT to make it work.
</P>
<HR NOSHADE SIZE=1>
<PRE>
#include &lt;cmph.h&gt;
#include &lt;stdio.h&gt;
#include &lt;string.h&gt;
// Create minimal perfect hash function from in-disk keys using BDZ algorithm
int main(int argc, char **argv)
{
//Open file with newline separated list of keys
FILE * keys_fd = fopen("keys.txt", "r");
cmph_t *hash = NULL;
if (keys_fd == NULL)
{
fprintf(stderr, "File \"keys.txt\" not found\n");
exit(1);
}
// Source of keys
cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd);
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BDZ);
hash = cmph_new(config);
cmph_config_destroy(config);
//Find key
const char *key = "jjjjjjjjjj";
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "Id:%u\n", id);
//Destroy hash
cmph_destroy(hash);
cmph_io_nlfile_adapter_destroy(source);
fclose(keys_fd);
return 0;
}
</PRE>
<P>
Download <A HREF="examples/file_adapter_ex2.c">file_adapter_ex2.c</A> and <A HREF="examples/keys.txt">keys.txt</A>. This example does not work in versions below 0.8. You need to update the sources from GIT to make it work.
</P>
<P>
<A HREF="examples.html">Click here to see more examples</A>
</P>
<HR NOSHADE SIZE=1>
<H2>The cmph application</H2>
<P>
cmph is the name of both the library and the utility
application that comes with this package. You can use the cmph
application for constructing minimal perfect hash functions from the command line.
The cmph utility
comes with a number of flags, but it is very simple to create and to query
minimal perfect hash functions:
</P>
<PRE>
$ # Using the chm algorithm (default one) for constructing a mphf for keys in file keys_file
$ ./cmph -g keys_file
$ # Query id of keys in the file keys_query
$ ./cmph -m keys_file.mph keys_query
</PRE>
<P>
The additional options let you set most of the parameters you have
available through the C API. Below you can see the full help message for the
utility.
</P>
<PRE>
usage: cmph [-v] [-h] [-V] [-k nkeys] [-f hash_function] [-g [-c algorithm_dependent_value][-s seed] ]
[-a algorithm] [-M memory_in_MB] [-b algorithm_dependent_value] [-t keys_per_bin] [-d tmp_dir]
[-m file.mph] keysfile
Minimum perfect hashing tool
-h print this help message
-c c value determines:
* the number of vertices in the graph for the algorithms BMZ and CHM
* the number of bits per key required in the FCH algorithm
* the load factor in the CHD_PH algorithm
-a algorithm - valid values are
* bmz
* bmz8
* chm
* brz
* fch
* bdz
* bdz_ph
* chd_ph
* chd
-f hash function (may be used multiple times) - valid values are
* jenkins
-V print version number and exit
-v increase verbosity (may be used multiple times)
-k number of keys
-g generation mode
-s random seed
-m minimum perfect hash function file
-M main memory availability (in MB) used in BRZ algorithm
-d temporary directory used in BRZ algorithm
-b the meaning of this parameter depends on the algorithm selected in the -a option:
* For BRZ it is used to make the maximal number of keys in a bucket lower than 256.
In this case its value should be an integer in the range [64,175]. Default is 128.
* For BDZ it is used to determine the size of some precomputed rank
information and its value should be an integer in the range [3,10]. Default
is 7. The larger is this value, the more compact are the resulting functions
and the slower are them at evaluation time.
* For CHD and CHD_PH it is used to set the average number of keys per bucket
and its value should be an integer in the range [1,32]. Default is 4. The
larger is this value, the slower is the construction of the functions.
This parameter has no effect for other algorithms.
-t set the number of keys per bin for a t-perfect hashing function. A t-perfect
hash function allows at most t collisions in a given bin. This parameter applies
only to the CHD and CHD_PH algorithms. Its value should be an integer in the
range [1,128]. Defaul is 1
keysfile line separated file with keys
</PRE>
<H2>Additional Documentation</H2>
<P>
<A HREF="faq.html">FAQ</A>
</P>
<H2>Downloads</H2>
<P>
Use the github releases page at: <A HREF="https://github.com/bonitao/cmph/releases">https://github.com/bonitao/cmph/releases</A>
</P>
<H2>License Stuff</H2>
<P>
Code is under the LGPL and the MPL 1.1.
</P>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<a href="http://sourceforge.net"><img src="http://sourceforge.net/sflogo.php?group_id=96251&type=1" width="88" height="31" border="0" alt="SourceForge.net Logo" /> </a>
<P>
Last Updated: Fri Dec 28 23:50:29 2018
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -\-mask-email -i README.t2t -o docs/index.html -->
</BODY></HTML>

142
deps/cmph/docs/newslog.html vendored Normal file
View File

@@ -0,0 +1,142 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>News Log</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>News Log</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>News for version 1.1</H2>
<P>
Fixed a bug in the chd_pc algorithm and reorganized tests.
</P>
<H2>News for version 1.0</H2>
<P>
This is a bugfix only version, after which a revamp of the cmph code and
algorithms will be done.
</P>
<HR NOSHADE SIZE=1>
<H2>News for version 0.9</H2>
<UL>
<LI><A HREF="chd.html">The CHD algorithm</A>, which is an algorithm that can be tuned to generate MPHFs that require approximately 2.07 bits per key to be stored. The algorithm outperforms <A HREF="bdz.html">the BDZ algorithm</A> and therefore is the fastest one available in the literature for sets that can be treated in internal memory.
<LI><A HREF="chd.html">The CHD_PH algorithm</A>, which is an algorithm to generate PHFs with load factor up to <I>99 %</I>. It is actually the CHD algorithm without the ranking step. If we set the load factor to <I>81 %</I>, which is the maximum that can be obtained with <A HREF="bdz.html">the BDZ algorithm</A>, the resulting functions can be stored in <I>1.40</I> bits per key. The space requirement increases with the load factor.
<LI>All reported bugs and suggestions have been corrected and included as well.
</UL>
<HR NOSHADE SIZE=1>
<H2>News for version 0.8</H2>
<UL>
<LI><A HREF="bdz.html">An algorithm to generate MPHFs that require around 2.6 bits per key to be stored</A>, which is referred to as BDZ algorithm. The algorithm is the fastest one available in the literature for sets that can be treated in internal memory.
<LI><A HREF="bdz.html">An algorithm to generate PHFs with range m = cn, for c &gt; 1.22</A>, which is referred to as BDZ_PH algorithm. It is actually the BDZ algorithm without the ranking step. The resulting functions can be stored in 1.95 bits per key for <I>c = 1.23</I> and are considerably faster than the MPHFs generated by the BDZ algorithm.
<LI>An adapter to support a vector of struct as the source of keys has been added.
<LI>An API to support the ability of packing a perfect hash function into a preallocated contiguous memory space. The computation of a packed function is still faster and can be easily mmapped.
<LI>The hash functions djb2, fnv and sdbm were removed because they do not use random seeds and therefore are not useful for MPHFs algorithms.
<LI>All reported bugs and suggestions have been corrected and included as well.
</UL>
<HR NOSHADE SIZE=1>
<H2>News for version 0.7</H2>
<UL>
<LI>Added man pages and a pkgconfig file.
</UL>
<HR NOSHADE SIZE=1>
<H2>News for version 0.6</H2>
<UL>
<LI><A HREF="fch.html">An algorithm to generate MPHFs that require less than 4 bits per key to be stored</A>, which is referred to as FCH algorithm. The algorithm is only efficient for small sets.
<LI>The FCH algorithm is integrated with <A HREF="brz.html">BRZ algorithm</A> so that you will be able to efficiently generate space-efficient MPHFs for sets in the order of billion keys.
<LI>All reported bugs and suggestions have been corrected and included as well.
</UL>
<HR NOSHADE SIZE=1>
<H2>News for version 0.5</H2>
<UL>
<LI>A thread safe vector adapter has been added.
<LI><A HREF="brz.html">A new algorithm for sets in the order of billion of keys that requires approximately 8.1 bits per key to store the resulting MPHFs.</A>
<LI>All reported bugs and suggestions have been corrected and included as well.
</UL>
<HR NOSHADE SIZE=1>
<H2>News for version 0.4</H2>
<UL>
<LI>Vector Adapter has been added.
<LI>An optimized version of bmz (bmz8) for small set of keys (at most 256 keys) has been added.
<LI>All reported bugs and suggestions have been corrected and included as well.
</UL>
<HR NOSHADE SIZE=1>
<H2>News for version 0.3</H2>
<UL>
<LI>New heuristic added to the bmz algorithm permits to generate a mphf with only
<I>24.80n + O(1)</I> bytes. The resulting function can be stored in <I>3.72n</I> bytes.
<A HREF="bmz.html#heuristic">click here</A> for details.
</UL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i NEWSLOG.t2t -o docs/newslog.html -->
</BODY></HTML>

15
deps/cmph/examples/Makefile.am vendored Executable file
View File

@@ -0,0 +1,15 @@
noinst_PROGRAMS = vector_adapter_ex1 file_adapter_ex2 struct_vector_adapter_ex3 small_set_ex4
AM_CPPFLAGS = -I../src/
vector_adapter_ex1_LDADD = ../src/libcmph.la
vector_adapter_ex1_SOURCES = vector_adapter_ex1.c
file_adapter_ex2_LDADD = ../src/libcmph.la
file_adapter_ex2_SOURCES = file_adapter_ex2.c
struct_vector_adapter_ex3_LDADD = ../src/libcmph.la
struct_vector_adapter_ex3_SOURCES = struct_vector_adapter_ex3.c
small_set_ex4_LDADD = ../src/libcmph.la
small_set_ex4_SOURCES = small_set_ex4.c

32
deps/cmph/examples/file_adapter_ex2.c vendored Normal file
View File

@@ -0,0 +1,32 @@
#include <cmph.h>
#include <stdio.h>
#include <string.h>
// Create minimal perfect hash function from in-disk keys using BDZ algorithm
int main(int argc, char **argv)
{
//Open file with newline separated list of keys
FILE * keys_fd = fopen("keys.txt", "r");
cmph_t *hash = NULL;
if (keys_fd == NULL)
{
fprintf(stderr, "File \"keys.txt\" not found\n");
exit(1);
}
// Source of keys
cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd);
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BDZ);
hash = cmph_new(config);
cmph_config_destroy(config);
//Find key
const char *key = "jjjjjjjjjj";
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "Id:%u\n", id);
//Destroy hash
cmph_destroy(hash);
cmph_io_nlfile_adapter_destroy(source);
fclose(keys_fd);
return 0;
}

10
deps/cmph/examples/keys.txt vendored Normal file
View File

@@ -0,0 +1,10 @@
aaaaaaaaaa
bbbbbbbbbb
cccccccccc
dddddddddd
eeeeeeeeee
ffffffffff
gggggggggg
hhhhhhhhhh
iiiiiiiiii
jjjjjjjjjj

Some files were not shown because too many files have changed in this diff Show More