1
Fork 0

Add docs directory for github.

This commit is contained in:
Davi de Castro Reis 2018-12-28 23:53:09 -02:00
parent 815d089f34
commit a250982ade
16 changed files with 3790 additions and 17 deletions

4
README
View File

@ -302,7 +302,7 @@ FAQ (faq.html)
Downloads Downloads
========= =========
Use the project page at sourceforge: http://sf.net/projects/cmph Use the github releases page at: https://github.com/bonitao/cmph/releases
License Stuff License Stuff
@ -322,5 +322,5 @@ Fabiano Cupertino Botelho (fc_botelho@users.sourceforge.net)
Nivio Ziviani (nivio@dcc.ufmg.br) Nivio Ziviani (nivio@dcc.ufmg.br)
Last Updated: Fri Jun 6 17:16:57 2014 Last Updated: Fri Dec 28 23:50:31 2018

View File

@ -300,8 +300,7 @@ Minimum perfect hashing tool
==Downloads== ==Downloads==
Use the project page at sourceforge: http://sf.net/projects/cmph Use the github releases page at: https://github.com/bonitao/cmph/releases
==License Stuff== ==License Stuff==

315
docs/bdz.html Normal file
View File

@ -0,0 +1,315 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>BDZ Algorithm</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>BDZ Algorithm</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>Introduction</H2>
<P>
The BDZ algorithm was designed by Fabiano C. Botelho, Djamal Belazzougui, Rasmus Pagh and Nivio Ziviani. It is a simple, efficient, near-optimal space and practical algorithm to generate a family <IMG ALIGN="bottom" SRC="figs/bdz/img8.png" BORDER="0" ALT=""> of PHFs and MPHFs. It is also referred to as BPZ algorithm because the work presented by Botelho, Pagh and Ziviani in <A HREF="#papers">[2</A>]. In the Botelho's PhD. dissertation <A HREF="#papers">[1</A>] it is also referred to as RAM algorithm because it is more suitable for key sets that can be handled in internal memory.
</P>
<P>
The BDZ algorithm uses <I>r</I>-uniform random hypergraphs given by function values of <I>r</I> uniform random hash functions on the input key set <I>S</I> for generating PHFs and MPHFs that require <I>O(n)</I> bits to be stored. A hypergraph is the generalization of a standard undirected graph where each edge connects <IMG ALIGN="middle" SRC="figs/bdz/img12.png" BORDER="0" ALT=""> vertices. This idea is not new, see e.g. <A HREF="#papers">[8</A>], but we have proceeded differently to achieve a space usage of <I>O(n)</I> bits rather than <I>O(n log n)</I> bits. Evaluation time for all schemes considered is constant. For <I>r=3</I> we obtain a space usage of approximately <I>2.6n</I> bits for an MPHF. More compact, and even simpler, representations can be achieved for larger <I>m</I>. For example, for <I>m=1.23n</I> we can get a space usage of <I>1.95n</I> bits.
</P>
<P>
Our best MPHF space upper bound is within a factor of <I>2</I> from the information theoretical lower bound of approximately <I>1.44</I> bits. We have shown that the BDZ algorithm is far more practical than previous methods with proven space complexity, both because of its simplicity, and because the constant factor of the space complexity is more than <I>6</I> times lower than its closest competitor, for plausible problem sizes. We verify the practicality experimentally, using slightly more space than in the mentioned theoretical bounds.
</P>
<HR NOSHADE SIZE=1>
<H2>The Algorithm</H2>
<P>
The BDZ algorithm is a three-step algorithm that generates PHFs and MPHFs based on random <I>r</I>-partite hypergraphs. This is an approach that provides a much tighter analysis and is much more simple than the one presented in <A HREF="#papers">[3</A>], where it was implicit how to construct similar PHFs.The fastest and most compact functions are generated when <I>r=3</I>. In this case a PHF can be stored in approximately <I>1.95</I> bits per key and an MPHF in approximately <I>2.62</I> bits per key.
</P>
<P>
Figure 1 gives an overview of the algorithm for <I>r=3</I>, taking as input a key set <IMG ALIGN="middle" SRC="figs/bdz/img22.png" BORDER="0" ALT=""> containing three English words, i.e., <I>S={who,band,the}</I>. The edge-oriented data structure proposed in <A HREF="#papers">[4</A>] is used to represent hypergraphs, where each edge is explicitly represented as an array of <I>r</I> vertices and, for each vertex <I>v</I>, there is a list of edges that are incident on <I>v</I>.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/bdz/img50.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 1:</B> (a) The mapping step generates a random acyclic <I>3</I>-partite hypergraph</TD>
</TR>
<TR>
<TD>with <I>m=6</I> vertices and <I>n=3</I> edges and a list <IMG ALIGN="middle" SRC="figs/bdz/img4.png" BORDER="0" ALT=""> of edges obtained when we test</TD>
</TR>
<TR>
<TD>whether the hypergraph is acyclic. (b) The assigning step builds an array <I>g</I> that</TD>
</TR>
<TR>
<TD>maps values from <I>[0,5]</I> to <I>[0,3]</I> to uniquely assign an edge to a vertex. (c) The ranking</TD>
</TR>
<TR>
<TD>step builds the data structure used to compute function <I>rank</I> in <I>O(1)</I> time.</TD>
</TR>
</TABLE>
<P>
The <I>Mapping Step</I> in Figure 1(a) carries out two important tasks:
</P>
<OL>
<LI>It assumes that it is possible to find three uniform hash functions <I>h<sub>0</sub></I>, <I>h<sub>1</sub></I> and <I>h<sub>2</sub></I>, with ranges <I>{0,1}</I>, <I>{2,3}</I> and <I>{4,5}</I>, respectively. These functions build an one-to-one mapping of the key set <I>S</I> to the edge set <I>E</I> of a random acyclic <I>3</I>-partite hypergraph <I>G=(V,E)</I>, where <I>|V|=m=6</I> and <I>|E|=n=3</I>. In <A HREF="#papers">[1,2</A>] it is shown that it is possible to obtain such a hypergraph with probability tending to <I>1</I> as <I>n</I> tends to infinity whenever <I>m=cn</I> and <I>c &gt; 1.22</I>. The value of that minimizes the hypergraph size (and thereby the amount of bits to represent the resulting functions) is in the range <I>(1.22,1.23)</I>. To illustrate the mapping, key "who" is mapped to edge <I>{h<sub>0</sub>("who"), h<sub>1</sub>("who"), h<sub>2</sub>("who")} = {1,3,5}</I>, key "band" is mapped to edge <I>{h<sub>0</sub>("band"), h<sub>1</sub>("band"), h<sub>2</sub>("band")} = {1,2,4}</I>, and key "the" is mapped to edge <I>{h<sub>0</sub>("the"), h<sub>1</sub>("the"), h<sub>2</sub>("the")} = {0,2,5}</I>.
<P></P>
<LI>It tests whether the resulting random <I>3</I>-partite hypergraph contains cycles by iteratively deleting edges connecting vertices of degree 1. The deleted edges are stored in the order of deletion in a list <IMG ALIGN="middle" SRC="figs/bdz/img4.png" BORDER="0" ALT=""> to be used in the assigning step. The first deleted edge in Figure 1(a) was <I>{1,2,4}</I>, the second one was <I>{1,3,5}</I> and the third one was <I>{0,2,5}</I>. If it ends with an empty graph, then the test succeeds, otherwise it fails.
</OL>
<P>
We now show how to use the Jenkins hash functions <A HREF="#papers">[7</A>] to implement the three hash functions <I>h<sub>i</sub></I>, which map values from <I>S</I> to <I>V<sub>i</sub></I>, where <IMG ALIGN="middle" SRC="figs/bdz/img52.png" BORDER="0" ALT="">. These functions are used to build a random <I>3</I>-partite hypergraph, where <IMG ALIGN="middle" SRC="figs/bdz/img53.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/bdz/img54.png" BORDER="0" ALT="">. Let <IMG ALIGN="middle" SRC="figs/bdz/img55.png" BORDER="0" ALT=""> be a Jenkins hash function for <IMG ALIGN="middle" SRC="figs/bdz/img56.png" BORDER="0" ALT="">, where
<I>w=32 or 64</I> for 32-bit and 64-bit architectures, respectively.
Let <I>H'</I> be an array of 3 <I>w</I>-bit values. The Jenkins hash function
allow us to compute in parallel the three entries in <I>H'</I>
and thereby the three hash functions <I>h<sub>i</sub></I>, as follows:
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><I>H' = h'(x)</I></TD>
</TR>
<TR>
<TD><I>h<sub>0</sub>(x) = H'[0] mod</I> <IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><I>h<sub>1</sub>(x) = H'[1] mod</I> <IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""> <I>+</I> <IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><I>h<sub>2</sub>(x) = H'[2] mod</I> <IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""> <I>+ 2</I><IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""></TD>
</TR>
</TABLE>
<P>
The <I>Assigning Step</I> in Figure 1(b) outputs a PHF that maps the key set <I>S</I> into the range <I>[0,m-1]</I> and is represented by an array <I>g</I> storing values from the range <I>[0,3]</I>. The array <I>g</I> allows to select one out of the <I>3</I> vertices of a given edge, which is associated with a key <I>k</I>. A vertex for a key <I>k</I> is given by either <I>h<sub>0</sub>(k)</I>, <I>h<sub>1</sub>(k)</I> or <I>h<sub>2</sub>(k)</I>. The function <I>h<sub>i</sub>(k)</I> to be used for <I>k</I> is chosen by calculating <I>i = (g[h<sub>0</sub>(k)] + g[h<sub>1</sub>(k)] + g[h<sub>2</sub>(k)]) mod 3</I>. For instance, the values 1 and 4 represent the keys "who" and "band" because <I>i = (g[1] + g[3] + g[5]) mod 3 = 0</I> and <I>h<sub>0</sub>("who") = 1</I>, and <I>i = (g[1] + g[2] + g[4]) mod 3 = 2</I> and <I>h<sub>2</sub>("band") = 4</I>, respectively. The assigning step firstly initializes <I>g[i]=3</I> to mark every vertex as unassigned and <I>Visited[i]= false</I>, <IMG ALIGN="middle" SRC="figs/bdz/img88.png" BORDER="0" ALT="">. Let <I>Visited</I> be a boolean vector of size <I>m</I> to indicate whether a vertex has been visited. Then, for each edge <IMG ALIGN="middle" SRC="figs/bdz/img90.png" BORDER="0" ALT=""> from tail to head, it looks for the first vertex <I>u</I> belonging <I>e</I> not yet visited. This is a sufficient condition for success <A HREF="#papers">[1,2,8</A>]. Let <I>j</I> be the index of <I>u</I> in <I>e</I> for <I>j</I> in the range <I>[0,2]</I>. Then, it assigns <IMG ALIGN="middle" SRC="figs/bdz/img95.png" BORDER="0" ALT="">. Whenever it passes through a vertex <I>u</I> from <I>e</I>, if <I>u</I> has not yet been visited, it sets <I>Visited[u] = true</I>.
</P>
<P>
If we stop the BDZ algorithm in the assigning step we obtain a PHF with range <I>[0,m-1]</I>. The PHF has the following form: <I>phf(x) = h<sub>i(x)</sub>(x)</I>, where key <I>x</I> is in <I>S</I> and <I>i(x) = (g[h<sub>0</sub>(x)] + g[h<sub>1</sub>(x)] + g[h<sub>2</sub>(x)]) mod 3</I>. In this case we do not need information for ranking and can set <I>g[i] = 0</I> whenever <I>g[i]</I> is equal to <I>3</I>, where <I>i</I> is in the range <I>[0,m-1]</I>. Therefore, the range of the values stored in <I>g</I> is narrowed from <I>[0,3]</I> to <I>[0,2]</I>. By using arithmetic coding as block of values (see <A HREF="#papers">[1,2</A>] for details), or any compression technique that allows to perform random access in constant time to an array of compressed values <A HREF="#papers">[5,6,12</A>], we can store the resulting PHFs in <I>mlog 3 = cnlog 3</I> bits, where <I>c &gt; 1.22</I>. For <I>c = 1.23</I>, the space requirement is <I>1.95n</I> bits.
</P>
<P>
The <I>Ranking Step</I> in Figure 1 (c) outputs a data structure that permits to narrow the range of a PHF generated in the assigning step from <I>[0,m-1]</I> to <I>[0,n-1]</I> and thereby an MPHF is produced. The data structure allows to compute in constant time a function <I>rank</I> from <I>[0,m-1]</I> to <I>[0,n-1]</I> that counts the number of assigned positions before a given position <I>v</I> in <I>g</I>. For instance, <I>rank(4) = 2</I> because the positions <I>0</I> and <I>1</I> are assigned since <I>g[0]</I> and <I>g[1]</I> are not equal to <I>3</I>.
</P>
<P>
For the implementation of the ranking step we have borrowed a simple and efficient implementation from <A HREF="#papers">[10</A>]. It requires <IMG ALIGN="middle" SRC="figs/bdz/img111.png" BORDER="0" ALT=""> additional bits of space, where <IMG ALIGN="middle" SRC="figs/bdz/img112.png" BORDER="0" ALT="">, and is obtained by storing explicitly the <I>rank</I> of every <I>k</I>th index in a rankTable, where <IMG ALIGN="middle" SRC="figs/bdz/img114.png" BORDER="0" ALT="">. The larger is <I>k</I> the more compact is the resulting MPHF. Therefore, the users can tradeoff space for evaluation time by setting <I>k</I> appropriately in the implementation. We only allow values for <I>k</I> that are power of two (i.e., <I>k=2<sup>b<sub>k</sub></sup></I> for some constant <I>b<sub>k</sub></I> in order to replace the expensive division and modulo operations by bit-shift and bitwise "and" operations, respectively. We have used <I>k=256</I> in the experiments for generating more succinct MPHFs. We remark that it is still possible to obtain a more compact data structure by using the results presented in <A HREF="#papers">[9,11</A>], but at the cost of a much more complex implementation.
</P>
<P>
We need to use an additional lookup table <I>T<sub>r</sub></I> to guarantee the constant evaluation time of <I>rank(u)</I>. Let us illustrate how <I>rank(u)</I> is computed using both the rankTable and the lookup table <I>T<sub>r</sub></I>. We first look up the rank of the largest precomputed index <I>v</I> lower than or equal to <I>u</I> in the rankTable, and use <I>T<sub>r</sub></I> to count the number of assigned vertices from position <I>v</I> to <I>u-1</I>. The lookup table <I>T_r</I> allows us to count in constant time the number of assigned vertices in <IMG ALIGN="middle" SRC="figs/bdz/img122.png" BORDER="0" ALT=""> bits, where <IMG ALIGN="middle" SRC="figs/bdz/img112.png" BORDER="0" ALT="">. Thus the actual evaluation time is <IMG ALIGN="middle" SRC="figs/bdz/img123.png" BORDER="0" ALT="">. For simplicity and without loss of generality we let <IMG ALIGN="middle" SRC="figs/bdz/img124.png" BORDER="0" ALT=""> be a multiple of the number of bits <IMG ALIGN="middle" SRC="figs/bdz/img125.png" BORDER="0" ALT=""> used to encode each entry of <I>g</I>. As the values in <I>g</I> come from the range <I>[0,3]</I>,
then <IMG ALIGN="middle" SRC="figs/bdz/img126.png" BORDER="0" ALT=""> bits and we have tried <IMG ALIGN="middle" SRC="figs/bdz/img124.png" BORDER="0" ALT=""> equal to <I>8</I> and <I>16</I>. We would expect that <IMG ALIGN="middle" SRC="figs/bdz/img124.png" BORDER="0" ALT=""> equal to 16 should provide a faster evaluation time because we would need to carry out fewer lookups in <I>T<sub>r</sub></I>. But, for both values the lookup table <I>T<sub>r</sub></I> fits entirely in the CPU cache and we did not realize any significant difference in the evaluation times. Therefore we settle for the value <I>8</I>. We remark that each value of <I>r</I> requires a different lookup table //T<sub>r</sub> that can be generated a priori.
</P>
<P>
The resulting MPHFs have the following form: <I>h(x) = rank(phf(x))</I>. Then, we cannot get rid of the raking information by replacing the values 3 by 0 in the entries of <I>g</I>. In this case each entry in the array <I>g</I> is encoded with <I>2</I> bits and we need <IMG ALIGN="middle" SRC="figs/bdz/img133.png" BORDER="0" ALT=""> additional bits to compute function <I>rank</I> in constant time. Then, the total space to store the resulting functions is <IMG ALIGN="middle" SRC="figs/bdz/img134.png" BORDER="0" ALT=""> bits. By using <I>c = 1.23</I> and <IMG ALIGN="middle" SRC="figs/bdz/img135.png" BORDER="0" ALT=""> we have obtained MPHFs that require approximately <I>2.62</I> bits per key to be stored.
</P>
<HR NOSHADE SIZE=1>
<H2>Memory Consumption</H2>
<P>
Now we detail the memory consumption to generate and to store minimal perfect hash functions
using the BDZ algorithm. The structures responsible for memory consumption are in the
following:
</P>
<UL>
<LI>3-graph:
<OL>
<LI><B>first</B>: is a vector that stores <I>cn</I> integer numbers, each one representing
the first edge (index in the vector edges) in the list of
incident edges of each vertex. The integer numbers are 4 bytes long. Therefore,
the vector first is stored in <I>4cn</I> bytes.
<P></P>
<LI><B>edges</B>: is a vector to represent the edges of the graph. As each edge
is compounded by three vertices, each entry stores three integer numbers
of 4 bytes that represent the vertices. As there are <I>n</I> edges, the
vector edges is stored in <I>12n</I> bytes.
<P></P>
<LI><B>next</B>: given a vertex <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">, we can discover the edges that
contain <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> following its list of incident edges,
which starts on first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">] and the next
edges are given by next[...first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">]...]. Therefore, the vectors first and next represent
the linked lists of edges of each vertex. As there are three vertices for each edge,
when an edge is iserted in the 3-graph, it must be inserted in the three linked lists
of the vertices in its composition. Therefore, there are <I>3n</I> entries of integer
numbers in the vector next, so it is stored in <I>4*3n = 12n</I> bytes.
<P></P>
<LI><B>Vertices degree (vert_degree vector)</B>: is a vector of <I>cn</I> bytes
that represents the degree of each vertex. We can use just one byte for each
vertex because the 3-graph is sparse, once it has more vertices than edges.
Therefore, the vertices degree is represented in <I>cn</I> bytes.
<P></P>
</OL>
<LI>Acyclicity test:
<OL>
<LI><B>List of deleted edges obtained when we test whether the 3-graph is a forest (queue vector)</B>:
is a vector of <I>n</I> integer numbers containing indexes of vector edges. Therefore, it
requires <I>4n</I> bytes in internal memory.
<P></P>
<LI><B>Marked edges in the acyclicity test (marked_edges vector)</B>:
is a bit vector of <I>n</I> bits to indicate the edges that have already been deleted during
the acyclicity test. Therefore, it requires <I>n/8</I> bytes in internal memory.
<P></P>
</OL>
<LI>MPHF description
<OL>
<LI><B>function <I>g</I></B>: is represented by a vector of <I>2cn</I> bits. Therefore, it is
stored in <I>0.25cn</I> bytes
<LI><B>ranktable</B>: is a lookup table used to store some precomputed ranking information.
It has <I>(cn)/(2^b)</I> entries of 4-byte integer numbers. Therefore it is stored in
<I>(4cn)/(2^b)</I> bytes. The larger is b, the more compact is the resulting MPHFs and
the slower are the functions. So b imposes a trade-of between space and time.
<LI><B>Total</B>: 0.25cn + (4cn)/(2^b) bytes
</OL>
</UL>
<P>
Thus, the total memory consumption of BDZ algorithm for generating a minimal
perfect hash function (MPHF) is: <I>(28.125 + 5c)n + 0.25cn + (4cn)/(2^b) + O(1)</I> bytes.
As the value of constant <I>c</I> may be larger than or equal to 1.23 we have:
</P>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH><I>c</I></TH>
<TH><I>b</I></TH>
<TH>Memory consumption to generate a MPHF (in bytes)</TH>
</TR>
<TR>
<TD>1.23</TD>
<TD ALIGN="center"><I>7</I></TD>
<TD ALIGN="center"><I>34.62n + O(1)</I></TD>
</TR>
<TR>
<TD>1.23</TD>
<TD ALIGN="center"><I>8</I></TD>
<TD ALIGN="center"><I>34.60n + O(1)</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 1:</B> Memory consumption to generate a MPHF using the BDZ algorithm.</TD>
</TR>
</TABLE>
<P>
Now we present the memory consumption to store the resulting function.
So we have:
</P>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH><I>c</I></TH>
<TH><I>b</I></TH>
<TH>Memory consumption to store a MPHF (in bits)</TH>
</TR>
<TR>
<TD>1.23</TD>
<TD ALIGN="center"><I>7</I></TD>
<TD ALIGN="center"><I>2.77n + O(1)</I></TD>
</TR>
<TR>
<TD>1.23</TD>
<TD ALIGN="center"><I>8</I></TD>
<TD ALIGN="center"><I>2.61n + O(1)</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 2:</B> Memory consumption to store a MPHF generated by the BDZ algorithm.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H2>Experimental Results</H2>
<P>
Experimental results to compare the BDZ algorithm with the other ones in the CMPH
library are presented in Botelho, Pagh and Ziviani <A HREF="#papers">[1,2</A>].
</P>
<HR NOSHADE SIZE=1>
<A NAME="papers"></A>
<H2>Papers</H2>
<OL>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>. <A HREF="papers/thesis.pdf">Near-Optimal Space Perfect Hashing Algorithms</A>. <I>PhD. Thesis</I>, <I>Department of Computer Science</I>, <I>Federal University of Minas Gerais</I>, September 2008. Supervised by <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>.
<P></P>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, <A HREF="http://www.itu.dk/~pagh/">R. Pagh</A>, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/wads07.pdf">Simple and space-efficient minimal perfect hash functions</A>. <I>In Proceedings of the 10th International Workshop on Algorithms and Data Structures (WADs'07),</I> Springer-Verlag Lecture Notes in Computer Science, vol. 4619, Halifax, Canada, August 2007, 139-150.
<P></P>
<LI>B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. The bloomier filter: An efficient data structure for static support lookup tables. <I>In Proceedings of the 15th annual ACM-SIAM symposium on Discrete algorithms (SODA'04)</I>, pages 3039, Philadelphia, PA, USA, 2004. Society for Industrial and Applied Mathematics.
<P></P>
<LI>J. Ebert. A versatile data structure for edges oriented graph algorithms. <I>Communication of The ACM</I>, (30):513519, 1987.
<P></P>
<LI>K. Fredriksson and F. Nikitin. Simple compression code supporting random access and fast string matching. <I>In Proceedings of the 6th International Workshop on Efficient and Experimental Algorithms (WEA07)</I>, pages 203216, 2007.
<P></P>
<LI>R. Gonzalez and G. Navarro. Statistical encoding of succinct data structures. <I>In Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM06)</I>, pages 294305, 2006.
<P></P>
<LI>B. Jenkins. Algorithm alley: Hash functions. <I>Dr. Dobb's Journal of Software Tools</I>, 22(9), september 1997. Extended version available at <A HREF="http://burtleburtle.net/bob/hash/doobs.html">http://burtleburtle.net/bob/hash/doobs.html</A>.
<P></P>
<LI>B.S. Majewski, N.C. Wormald, G. Havas, and Z.J. Czech. A family of perfect hashing methods. <I>The Computer Journal</I>, 39(6):547554, 1996.
<P></P>
<LI>D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. <I>In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX07)</I>, 2007.
<P></P>
<LI><A HREF="http://www.itu.dk/~pagh/">R. Pagh</A>. Low redundancy in static dictionaries with constant query time. <I>SIAM Journal on Computing</I>, 31(2):353363, 2001.
<P></P>
<LI>R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. <I>In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms (SODA02)</I>, pages 233242, Philadelphia PA, USA, 2002. Society for Industrial and Applied Mathematics.
<P></P>
<LI>K. Sadakane and R. Grossi. Squeezing succinct data structures into entropy bounds. <I>In Proceedings of the 17th annual ACM-SIAM symposium on Discrete algorithms (SODA06)</I>, pages 12301239, 2006.
</OL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i BDZ.t2t -o docs/bdz.html -->
</BODY></HTML>

581
docs/bmz.html Normal file
View File

@ -0,0 +1,581 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>BMZ Algorithm</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>BMZ Algorithm</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>History</H2>
<P>
At the end of 2003, professor <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> was
finishing the second edition of his <A HREF="http://www.dcc.ufmg.br/algoritmos/">book</A>.
During the <A HREF="http://www.dcc.ufmg.br/algoritmos/">book</A> writing,
professor <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> studied the problem of generating
<A HREF="concepts.html">minimal perfect hash functions</A>
(if you are not familiarized with this problem, see <A HREF="#papers">[1</A>]<A HREF="#papers">[2</A>]).
Professor <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> coded a modified version of
the <A HREF="chm.html">CHM algorithm</A>, which was proposed by
Czech, Havas and Majewski, and put it in his <A HREF="http://www.dcc.ufmg.br/algoritmos/">book</A>.
The <A HREF="chm.html">CHM algorithm</A> is based on acyclic random graphs to generate
<A HREF="concepts.html">order preserving minimal perfect hash functions</A> in linear time.
Professor <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A>
argued himself, why must the random graph
be acyclic? In the modified version availalbe in his <A HREF="http://www.dcc.ufmg.br/algoritmos/">book</A> he got rid of this restriction.
</P>
<P>
The modification presented a problem, it was impossible to generate minimal perfect hash functions
for sets with more than 1000 keys.
At the same time, <A HREF="http://www.dcc.ufmg.br/~fbotelho">Fabiano C. Botelho</A>,
a master degree student at <A HREF="http://www.dcc.ufmg.br">Departament of Computer Science</A> in
<A HREF="http://www.ufmg.br">Federal University of Minas Gerais</A>,
started to be advised by <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> who presented the problem
to <A HREF="http://www.dcc.ufmg.br/~fbotelho">Fabiano</A>.
</P>
<P>
During the master, <A HREF="http://www.dcc.ufmg.br/~fbotelho">Fabiano</A> and
<A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> faced lots of problems.
In april of 2004, <A HREF="http://www.dcc.ufmg.br/~fbotelho">Fabiano</A> was talking with a
friend of him (David Menoti) about the problems
and many ideas appeared.
The ideas were implemented and a very fast algorithm to generate
minimal perfect hash functions had been designed.
We refer the algorithm to as <B>BMZ</B>, because it was conceived by Fabiano C. <B>B</B>otelho,
David <B>M</B>enoti and Nivio <B>Z</B>iviani. The algorithm is described in <A HREF="#papers">[1</A>].
To analyse BMZ algorithm we needed some results from the random graph theory, so
we invited professor <A HREF="http://www.ime.usp.br/~yoshi">Yoshiharu Kohayakawa</A> to help us.
The final description and analysis of BMZ algorithm is presented in <A HREF="#papers">[2</A>].
</P>
<HR NOSHADE SIZE=1>
<H2>The Algorithm</H2>
<P>
The BMZ algorithm shares several features with the <A HREF="chm.html">CHM algorithm</A>.
In particular, BMZ algorithm is also
based on the generation of random graphs <IMG ALIGN="middle" SRC="figs/img27.png" BORDER="0" ALT="">, where <IMG ALIGN="bottom" SRC="figs/img28.png" BORDER="0" ALT=""> is in
one-to-one correspondence with the key set <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> for which we wish to
generate a <A HREF="concepts.html">minimal perfect hash function</A>.
The two main differences between BMZ algorithm and CHM algorithm
are as follows: (<I>i</I>) BMZ algorithm generates random
graphs <IMG ALIGN="middle" SRC="figs/img27.png" BORDER="0" ALT=""> with <IMG ALIGN="middle" SRC="figs/img29.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img30.png" BORDER="0" ALT="">, where <IMG ALIGN="middle" SRC="figs/img31.png" BORDER="0" ALT="">,
and hence <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> necessarily contains cycles,
while CHM algorithm generates <I>acyclic</I> random
graphs <IMG ALIGN="middle" SRC="figs/img27.png" BORDER="0" ALT=""> with <IMG ALIGN="middle" SRC="figs/img29.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img30.png" BORDER="0" ALT="">,
with a greater number of vertices: <IMG ALIGN="middle" SRC="figs/img33.png" BORDER="0" ALT="">;
(<I>ii</I>) CHM algorithm generates <A HREF="concepts.html">order preserving minimal perfect hash functions</A>
while BMZ algorithm does not preserve order. Thus, BMZ algorithm improves
the space requirement at the expense of generating functions that are not
order preserving.
</P>
<P>
Suppose <IMG ALIGN="bottom" SRC="figs/img14.png" BORDER="0" ALT=""> is a universe of <I>keys</I>.
Let <IMG ALIGN="middle" SRC="figs/img17.png" BORDER="0" ALT=""> be a set of <IMG ALIGN="bottom" SRC="figs/img8.png" BORDER="0" ALT=""> keys from <IMG ALIGN="bottom" SRC="figs/img14.png" BORDER="0" ALT="">.
Let us show how the BMZ algorithm constructs a minimal perfect hash function <IMG ALIGN="bottom" SRC="figs/img7.png" BORDER="0" ALT="">.
We make use of two auxiliary random functions <IMG ALIGN="middle" SRC="figs/img41.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img55.png" BORDER="0" ALT="">,
where <IMG ALIGN="middle" SRC="figs/img56.png" BORDER="0" ALT=""> for some suitably chosen integer <IMG ALIGN="bottom" SRC="figs/img57.png" BORDER="0" ALT="">,
where <IMG ALIGN="middle" SRC="figs/img58.png" BORDER="0" ALT="">.We build a random graph <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT=""> on <IMG ALIGN="bottom" SRC="figs/img60.png" BORDER="0" ALT="">,
whose edge set is <IMG ALIGN="middle" SRC="figs/img61.png" BORDER="0" ALT="">. There is an edge in <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> for each
key in the set of keys <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
</P>
<P>
In what follows, we shall be interested in the <I>2-core</I> of
the random graph <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, that is, the maximal subgraph
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> with minimal degree at
least 2 (see <A HREF="#papers">[2</A>] for details).
Because of its importance in our context, we call the 2-core the
<I>critical</I> subgraph of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> and denote it by <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">.
The vertices and edges in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> are said to be <I>critical</I>.
We let <IMG ALIGN="middle" SRC="figs/img64.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img65.png" BORDER="0" ALT="">.
Moreover, we let <IMG ALIGN="middle" SRC="figs/img66.png" BORDER="0" ALT=""> be the set of <I>non-critical</I>
vertices in <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
We also let <IMG ALIGN="middle" SRC="figs/img67.png" BORDER="0" ALT=""> be the set of all critical
vertices that have at least one non-critical vertex as a neighbour.
Let <IMG ALIGN="middle" SRC="figs/img68.png" BORDER="0" ALT=""> be the set of <I>non-critical</I> edges in <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
Finally, we let <IMG ALIGN="middle" SRC="figs/img69.png" BORDER="0" ALT=""> be the <I>non-critical</I> subgraph
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
The non-critical subgraph <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> corresponds to the <I>acyclic part</I>
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
We have <IMG ALIGN="middle" SRC="figs/img71.png" BORDER="0" ALT="">.
</P>
<P>
We then construct a suitable labelling <IMG ALIGN="middle" SRC="figs/img72.png" BORDER="0" ALT=""> of the vertices
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">: we choose <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT=""> for each <IMG ALIGN="middle" SRC="figs/img74.png" BORDER="0" ALT=""> in such
a way that <IMG ALIGN="middle" SRC="figs/img75.png" BORDER="0" ALT=""> (<IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT="">) is a
minimal perfect hash function for <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
This labelling <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT=""> can be found in linear time
if the number of edges in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> is at most <IMG ALIGN="middle" SRC="figs/img76.png" BORDER="0" ALT=""> (see <A HREF="#papers">[2</A>]
for details).
</P>
<P>
Figure 1 presents a pseudo code for the BMZ algorithm.
The procedure BMZ (<IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">) receives as input the set of
keys <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> and produces the labelling <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">.
The method uses a mapping, ordering and searching approach.
We now describe each step.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD>procedure BMZ (<IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">)</TD>
</TR>
<TR>
<TD>&nbsp;&nbsp;&nbsp;&nbsp;Mapping (<IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">);</TD>
</TR>
<TR>
<TD>&nbsp;&nbsp;&nbsp;&nbsp;Ordering (<IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">);</TD>
</TR>
<TR>
<TD>&nbsp;&nbsp;&nbsp;&nbsp;Searching (<IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">);</TD>
</TR>
<TR>
<TD><B>Figure 1</B>: Main steps of BMZ algorithm for constructing a minimal perfect hash function</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H3>Mapping Step</H3>
<P>
The procedure Mapping (<IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">) receives as input the set
of keys <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> and generates the random graph <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT="">, by generating
two auxiliary functions <IMG ALIGN="middle" SRC="figs/img41.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img78.png" BORDER="0" ALT="">.
</P>
<P>
The functions <IMG ALIGN="middle" SRC="figs/img41.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img42.png" BORDER="0" ALT=""> are constructed as follows.
We impose some upper bound <IMG ALIGN="bottom" SRC="figs/img79.png" BORDER="0" ALT=""> on the lengths of the keys in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
To define <IMG ALIGN="middle" SRC="figs/img80.png" BORDER="0" ALT=""> (<IMG ALIGN="middle" SRC="figs/img81.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img62.png" BORDER="0" ALT="">), we generate
an <IMG ALIGN="middle" SRC="figs/img82.png" BORDER="0" ALT=""> table of random integers <IMG ALIGN="middle" SRC="figs/img83.png" BORDER="0" ALT="">.
For a key <IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT=""> of length <IMG ALIGN="middle" SRC="figs/img84.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img85.png" BORDER="0" ALT="">, we let
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/img86.png" BORDER="0" ALT=""></TD>
</TR>
</TABLE>
<P>
The random graph <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT=""> has vertex set <IMG ALIGN="middle" SRC="figs/img56.png" BORDER="0" ALT=""> and
edge set <IMG ALIGN="middle" SRC="figs/img61.png" BORDER="0" ALT="">. We need <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> to be
simple, i.e., <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> should have neither loops nor multiple edges.
A loop occurs when <IMG ALIGN="middle" SRC="figs/img87.png" BORDER="0" ALT=""> for some <IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT="">.
We solve this in an ad hoc manner: we simply let <IMG ALIGN="middle" SRC="figs/img88.png" BORDER="0" ALT=""> in this case.
If we still find a loop after this, we generate another pair <IMG ALIGN="middle" SRC="figs/img89.png" BORDER="0" ALT="">.
When a multiple edge occurs we abort and generate a new pair <IMG ALIGN="middle" SRC="figs/img89.png" BORDER="0" ALT="">.
Although the function above causes <A HREF="concepts.html">collisions</A> with probability <I>1/t</I>,
in <A HREF="index.html">cmph library</A> we use faster hash
functions (<A HREF="http://www.cs.yorku.ca/~oz/hash.html">DJB2 hash</A>, <A HREF="http://www.isthe.com/chongo/tech/comp/fnv/">FNV hash</A>,
<A HREF="http://burtleburtle.net/bob/hash/doobs.html">Jenkins hash</A> and <A HREF="http://www.cs.yorku.ca/~oz/hash.html">SDBM hash</A>)
in which we do not need to impose any upper bound <IMG ALIGN="bottom" SRC="figs/img79.png" BORDER="0" ALT=""> on the lengths of the keys in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
</P>
<P>
As mentioned before, for us to find the labelling <IMG ALIGN="middle" SRC="figs/img72.png" BORDER="0" ALT=""> of the
vertices of <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT=""> in linear time,
we require that <IMG ALIGN="middle" SRC="figs/img108.png" BORDER="0" ALT="">.
The crucial step now is to determine the value
of <IMG ALIGN="bottom" SRC="figs/img1.png" BORDER="0" ALT=""> (in <IMG ALIGN="bottom" SRC="figs/img57.png" BORDER="0" ALT="">) to obtain a random
graph <IMG ALIGN="middle" SRC="figs/img71.png" BORDER="0" ALT=""> with <IMG ALIGN="middle" SRC="figs/img109.png" BORDER="0" ALT="">.
Botelho, Menoti an Ziviani determinded emprically in <A HREF="#papers">[1</A>] that
the value of <IMG ALIGN="bottom" SRC="figs/img1.png" BORDER="0" ALT=""> is <I>1.15</I>. This value is remarkably
close to the theoretical value determined in <A HREF="#papers">[2</A>],
which is around <IMG ALIGN="bottom" SRC="figs/img112.png" BORDER="0" ALT="">.
</P>
<HR NOSHADE SIZE=1>
<H3>Ordering Step</H3>
<P>
The procedure Ordering (<IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">) receives
as input the graph <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> and partitions <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> into the two
subgraphs <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">, so that <IMG ALIGN="middle" SRC="figs/img71.png" BORDER="0" ALT="">.
</P>
<P>
Figure 2 presents a sample graph with 9 vertices
and 8 edges, where the degree of a vertex is shown besides each vertex.
Initially, all vertices with degree 1 are added to a queue <IMG ALIGN="middle" SRC="figs/img136.png" BORDER="0" ALT="">.
For the example shown in Figure 2(a), <IMG ALIGN="middle" SRC="figs/img137.png" BORDER="0" ALT=""> after the initialization step.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/img138.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 2:</B> Ordering step for a graph with 9 vertices and 8 edges.</TD>
</TR>
</TABLE>
<P>
Next, we remove one vertex <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> from the queue, decrement its degree and
the degree of the vertices with degree greater than 0 in the adjacent
list of <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">, as depicted in Figure 2(b) for <IMG ALIGN="bottom" SRC="figs/img140.png" BORDER="0" ALT="">.
At this point, the adjacencies of <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> with degree 1 are
inserted into the queue, such as vertex 1.
This process is repeated until the queue becomes empty.
All vertices with degree 0 are non-critical vertices and the others are
critical vertices, as depicted in Figure 2(c).
Finally, to determine the vertices in <IMG ALIGN="middle" SRC="figs/img141.png" BORDER="0" ALT=""> we collect all
vertices <IMG ALIGN="middle" SRC="figs/img142.png" BORDER="0" ALT=""> with at least one vertex <IMG ALIGN="bottom" SRC="figs/img143.png" BORDER="0" ALT=""> that
is in Adj<IMG ALIGN="middle" SRC="figs/img144.png" BORDER="0" ALT=""> and in <IMG ALIGN="middle" SRC="figs/img145.png" BORDER="0" ALT="">, as the vertex 8 in Figure 2(c).
</P>
<HR NOSHADE SIZE=1>
<H3>Searching Step</H3>
<P>
In the searching step, the key part is
the <I>perfect assignment problem</I>: find <IMG ALIGN="middle" SRC="figs/img153.png" BORDER="0" ALT=""> such that
the function <IMG ALIGN="middle" SRC="figs/img154.png" BORDER="0" ALT=""> defined by
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/img155.png" BORDER="0" ALT=""></TD>
</TR>
</TABLE>
<P>
is a bijection from <IMG ALIGN="middle" SRC="figs/img156.png" BORDER="0" ALT=""> to <IMG ALIGN="middle" SRC="figs/img157.png" BORDER="0" ALT=""> (recall <IMG ALIGN="middle" SRC="figs/img158.png" BORDER="0" ALT="">).
We are interested in a labelling <IMG ALIGN="middle" SRC="figs/img72.png" BORDER="0" ALT=""> of
the vertices of the graph <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT=""> with
the property that if <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img22.png" BORDER="0" ALT=""> are keys
in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, then <IMG ALIGN="middle" SRC="figs/img159.png" BORDER="0" ALT="">; that is, if we associate
to each edge the sum of the labels on its endpoints, then these values
should be all distinct.
Moreover, we require that all the sums <IMG ALIGN="middle" SRC="figs/img160.png" BORDER="0" ALT=""> (<IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT="">)
fall between <IMG ALIGN="bottom" SRC="figs/img115.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img161.png" BORDER="0" ALT="">, and thus we have a bijection
between <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img157.png" BORDER="0" ALT="">.
</P>
<P>
The procedure Searching (<IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">)
receives as input <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> and finds a
suitable <IMG ALIGN="middle" SRC="figs/img162.png" BORDER="0" ALT=""> bit value for each vertex <IMG ALIGN="middle" SRC="figs/img74.png" BORDER="0" ALT="">, stored in the
array <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">.
This step is first performed for the vertices in the
critical subgraph <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> (the 2-core of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">)
and then it is performed for the vertices in <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> (the non-critical subgraph
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> that contains the "acyclic part" of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">).
The reason the assignment of the <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT=""> values is first
performed on the vertices in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> is to resolve reassignments
as early as possible (such reassignments are consequences of the cycles
in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> and are depicted hereinafter).
</P>
<HR NOSHADE SIZE=1>
<H4>Assignment of Values to Critical Vertices</H4>
<P>
The labels <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT=""> (<IMG ALIGN="middle" SRC="figs/img142.png" BORDER="0" ALT="">)
are assigned in increasing order following a greedy
strategy where the critical vertices <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> are considered one at a time,
according to a breadth-first search on <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">.
If a candidate value <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> for <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT=""> is forbidden
because setting <IMG ALIGN="middle" SRC="figs/img163.png" BORDER="0" ALT=""> would create two edges with the same sum,
we try <IMG ALIGN="middle" SRC="figs/img164.png" BORDER="0" ALT=""> for <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT="">. This fact is referred to
as a <I>reassignment</I>.
</P>
<P>
Let <IMG ALIGN="middle" SRC="figs/img165.png" BORDER="0" ALT=""> be the set of addresses assigned to edges in <IMG ALIGN="middle" SRC="figs/img166.png" BORDER="0" ALT="">.
Initially <IMG ALIGN="middle" SRC="figs/img167.png" BORDER="0" ALT="">.
Let <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> be a candidate value for <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT="">.
Initially <IMG ALIGN="bottom" SRC="figs/img168.png" BORDER="0" ALT="">.
Considering the subgraph <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> in Figure 2(c),
a step by step example of the assignment of values to vertices in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> is
presented in Figure 3.
Initially, a vertex <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> is chosen, the assignment <IMG ALIGN="middle" SRC="figs/img163.png" BORDER="0" ALT=""> is made
and <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is set to <IMG ALIGN="middle" SRC="figs/img164.png" BORDER="0" ALT="">.
For example, suppose that vertex <IMG ALIGN="bottom" SRC="figs/img169.png" BORDER="0" ALT=""> in Figure 3(a) is
chosen, the assignment <IMG ALIGN="middle" SRC="figs/img170.png" BORDER="0" ALT=""> is made and <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is set to <IMG ALIGN="bottom" SRC="figs/img96.png" BORDER="0" ALT="">.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/img171.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 3:</B> Example of the assignment of values to critical vertices.</TD>
</TR>
</TABLE>
<P>
In Figure 3(b), following the adjacent list of vertex <IMG ALIGN="bottom" SRC="figs/img169.png" BORDER="0" ALT="">,
the unassigned vertex <IMG ALIGN="bottom" SRC="figs/img115.png" BORDER="0" ALT=""> is reached.
At this point, we collect in the temporary variable <IMG ALIGN="bottom" SRC="figs/img172.png" BORDER="0" ALT=""> all adjacencies
of vertex <IMG ALIGN="bottom" SRC="figs/img115.png" BORDER="0" ALT=""> that have been assigned an <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> value,
and <IMG ALIGN="middle" SRC="figs/img173.png" BORDER="0" ALT="">.
Next, for all <IMG ALIGN="middle" SRC="figs/img174.png" BORDER="0" ALT="">, we check if <IMG ALIGN="middle" SRC="figs/img175.png" BORDER="0" ALT="">.
Since <IMG ALIGN="middle" SRC="figs/img176.png" BORDER="0" ALT="">, then <IMG ALIGN="middle" SRC="figs/img177.png" BORDER="0" ALT=""> is set
to <IMG ALIGN="bottom" SRC="figs/img96.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is incremented
by 1 (now <IMG ALIGN="bottom" SRC="figs/img178.png" BORDER="0" ALT="">) and <IMG ALIGN="middle" SRC="figs/img179.png" BORDER="0" ALT="">.
Next, vertex <IMG ALIGN="bottom" SRC="figs/img180.png" BORDER="0" ALT=""> is reached, <IMG ALIGN="middle" SRC="figs/img181.png" BORDER="0" ALT=""> is set
to <IMG ALIGN="bottom" SRC="figs/img62.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is set to <IMG ALIGN="bottom" SRC="figs/img180.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img182.png" BORDER="0" ALT="">.
Next, vertex <IMG ALIGN="bottom" SRC="figs/img183.png" BORDER="0" ALT=""> is reached and <IMG ALIGN="middle" SRC="figs/img184.png" BORDER="0" ALT="">.
Since <IMG ALIGN="middle" SRC="figs/img185.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img186.png" BORDER="0" ALT="">, then <IMG ALIGN="middle" SRC="figs/img187.png" BORDER="0" ALT=""> is
set to <IMG ALIGN="bottom" SRC="figs/img180.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is set to <IMG ALIGN="bottom" SRC="figs/img183.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img188.png" BORDER="0" ALT="">.
Finally, vertex <IMG ALIGN="bottom" SRC="figs/img189.png" BORDER="0" ALT=""> is reached and <IMG ALIGN="middle" SRC="figs/img190.png" BORDER="0" ALT="">.
Since <IMG ALIGN="middle" SRC="figs/img191.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is incremented by 1 and set to 5, as depicted in
Figure 3(c).
Since <IMG ALIGN="middle" SRC="figs/img192.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is again incremented by 1 and set to 6,
as depicted in Figure 3(d).
These two reassignments are indicated by the arrows in Figure 3.
Since <IMG ALIGN="middle" SRC="figs/img193.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img194.png" BORDER="0" ALT="">, then <IMG ALIGN="middle" SRC="figs/img195.png" BORDER="0" ALT=""> is set
to <IMG ALIGN="bottom" SRC="figs/img196.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img197.png" BORDER="0" ALT="">. This finishes the algorithm.
</P>
<HR NOSHADE SIZE=1>
<H4>Assignment of Values to Non-Critical Vertices</H4>
<P>
As <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> is acyclic, we can impose the order in which addresses are
associated with edges in <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">, making this step simple to solve
by a standard depth first search algorithm.
Therefore, in the assignment of values to vertices in <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> we
benefit from the unused addresses in the gaps left by the assignment of values
to vertices in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">.
For that, we start the depth-first search from the vertices in <IMG ALIGN="middle" SRC="figs/img141.png" BORDER="0" ALT=""> because
the <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT=""> values for these critical vertices were already assigned
and cannot be changed.
</P>
<P>
Considering the subgraph <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> in Figure 2(c),
a step by step example of the assignment of values to vertices in <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> is
presented in Figure 4.
Figure 4(a) presents the initial state of the algorithm.
The critical vertex 8 is the only one that has non-critical vertices as
adjacent.
In the example presented in Figure 3, the addresses <IMG ALIGN="middle" SRC="figs/img198.png" BORDER="0" ALT=""> were not used.
So, taking the first unused address <IMG ALIGN="bottom" SRC="figs/img115.png" BORDER="0" ALT=""> and the vertex <IMG ALIGN="bottom" SRC="figs/img96.png" BORDER="0" ALT="">,
which is reached from the vertex <IMG ALIGN="bottom" SRC="figs/img169.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img199.png" BORDER="0" ALT=""> is set
to <IMG ALIGN="middle" SRC="figs/img200.png" BORDER="0" ALT="">, as shown in Figure 4(b).
The only vertex that is reached from vertex <IMG ALIGN="bottom" SRC="figs/img96.png" BORDER="0" ALT=""> is vertex <IMG ALIGN="bottom" SRC="figs/img62.png" BORDER="0" ALT="">, so
taking the unused address <IMG ALIGN="bottom" SRC="figs/img183.png" BORDER="0" ALT=""> we set <IMG ALIGN="middle" SRC="figs/img201.png" BORDER="0" ALT=""> to <IMG ALIGN="middle" SRC="figs/img202.png" BORDER="0" ALT="">,
as shown in Figure 4(c).
This process is repeated until the UnAssignedAddresses list becomes empty.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/img203.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 4:</B> Example of the assignment of values to non-critical vertices.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<A NAME="heuristic"></A>
<H2>The Heuristic</H2>
<P>
We now present an heuristic for BMZ algorithm that
reduces the value of <IMG ALIGN="bottom" SRC="figs/img1.png" BORDER="0" ALT=""> to any given value between <I>1.15</I> and <I>0.93</I>.
This reduces the space requirement to store the resulting function
to any given value between <IMG ALIGN="bottom" SRC="figs/img12.png" BORDER="0" ALT=""> words and <IMG ALIGN="bottom" SRC="figs/img13.png" BORDER="0" ALT=""> words.
The heuristic reuses, when possible, the set
of <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> values that caused reassignments, just before
trying <IMG ALIGN="middle" SRC="figs/img164.png" BORDER="0" ALT="">.
Decreasing the value of <IMG ALIGN="bottom" SRC="figs/img1.png" BORDER="0" ALT=""> leads to an increase in the number of
iterations to generate <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
For example, for <IMG ALIGN="bottom" SRC="figs/img244.png" BORDER="0" ALT=""> and <IMG ALIGN="bottom" SRC="figs/img6.png" BORDER="0" ALT="">, the analytical expected number
of iterations are <IMG ALIGN="bottom" SRC="figs/img245.png" BORDER="0" ALT=""> and <IMG ALIGN="bottom" SRC="figs/img246.png" BORDER="0" ALT="">, respectively (see <A HREF="#papers">[2</A>]
for details),
while for <IMG ALIGN="bottom" SRC="figs/img128.png" BORDER="0" ALT=""> the same value is around <I>2.13</I>.
</P>
<HR NOSHADE SIZE=1>
<H2>Memory Consumption</H2>
<P>
Now we detail the memory consumption to generate and to store minimal perfect hash functions
using the BMZ algorithm. The structures responsible for memory consumption are in the
following:
</P>
<UL>
<LI>Graph:
<OL>
<LI><B>first</B>: is a vector that stores <I>cn</I> integer numbers, each one representing
the first edge (index in the vector edges) in the list of
edges of each vertex.
The integer numbers are 4 bytes long. Therefore,
the vector first is stored in <I>4cn</I> bytes.
<P></P>
<LI><B>edges</B>: is a vector to represent the edges of the graph. As each edge
is compounded by a pair of vertices, each entry stores two integer numbers
of 4 bytes that represent the vertices. As there are <I>n</I> edges, the
vector edges is stored in <I>8n</I> bytes.
<P></P>
<LI><B>next</B>: given a vertex <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">, we can discover the edges that
contain <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> following its list of edges,
which starts on first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">] and the next
edges are given by next[...first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">]...]. Therefore, the vectors first and next represent
the linked lists of edges of each vertex. As there are two vertices for each edge,
when an edge is iserted in the graph, it must be inserted in the two linked lists
of the vertices in its composition. Therefore, there are <I>2n</I> entries of integer
numbers in the vector next, so it is stored in <I>4*2n = 8n</I> bytes.
<P></P>
<LI><B>critical vertices(critical_nodes vector)</B>: is a vector of <I>cn</I> bits,
where each bit indicates if a vertex is critical (1) or non-critical (0).
Therefore, the critical and non-critical vertices are represented in <I>cn/8</I> bytes.
<P></P>
<LI><B>critical edges (used_edges vector)</B>: is a vector of <I>n</I> bits, where each
bit indicates if an edge is critical (1) or non-critical (0). Therefore, the
critical and non-critical edges are represented in <I>n/8</I> bytes.
<P></P>
</OL>
<LI>Other auxiliary structures
<OL>
<LI><B>queue</B>: is a queue of integer numbers used in the breadth-first search of the
assignment of values to critical vertices. There is an entry in the queue for
each two critical vertices. Let <IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""> be the expected number of critical
vertices. Therefore, the queue is stored in <I>4*0.5*<IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT="">=2<IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""></I>.
<P></P>
<LI><B>visited</B>: is a vector of <I>cn</I> bits, where each bit indicates if the g value of
a given vertex was already defined. Therefore, the vector visited is stored
in <I>cn/8</I> bytes.
<P></P>
<LI><B>function <I>g</I></B>: is represented by a vector of <I>cn</I> integer numbers.
As each integer number is 4 bytes long, the function <I>g</I> is stored in
<I>4cn</I> bytes.
</OL>
</UL>
<P>
Thus, the total memory consumption of BMZ algorithm for generating a minimal
perfect hash function (MPHF) is: <I>(8.25c + 16.125)n +2<IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""> + O(1)</I> bytes.
As the value of constant <I>c</I> may be 1.15 and 0.93 we have:
</P>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH><I>c</I></TH>
<TH><IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""></TH>
<TH>Memory consumption to generate a MPHF</TH>
</TR>
<TR>
<TD>0.93</TD>
<TD ALIGN="center"><I>0.497n</I></TD>
<TD ALIGN="center"><I>24.80n + O(1)</I></TD>
</TR>
<TR>
<TD>1.15</TD>
<TD ALIGN="center"><I>0.401n</I></TD>
<TD ALIGN="center"><I>26.42n + O(1)</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 1:</B> Memory consumption to generate a MPHF using the BMZ algorithm.</TD>
</TR>
</TABLE>
<P>
The values of <IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""> were calculated using Eq.(1) presented in <A HREF="#papers">[2</A>].
</P>
<P>
Now we present the memory consumption to store the resulting function.
We only need to store the <I>g</I> function. Thus, we need <I>4cn</I> bytes.
Again we have:
</P>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH><I>c</I></TH>
<TH>Memory consumption to store a MPHF</TH>
</TR>
<TR>
<TD>0.93</TD>
<TD ALIGN="center"><I>3.72n</I></TD>
</TR>
<TR>
<TD>1.15</TD>
<TD ALIGN="center"><I>4.60n</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 2:</B> Memory consumption to store a MPHF generated by the BMZ algorithm.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H2>Experimental Results</H2>
<P>
<A HREF="comparison.html">CHM x BMZ</A>
</P>
<HR NOSHADE SIZE=1>
<A NAME="papers"></A>
<H2>Papers</H2>
<OL>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, D. Menoti, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/bmz_tr004_04.ps">A New algorithm for constructing minimal perfect hash functions</A>, Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
<P></P>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, Y. Kohayakawa, and <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/wea05.pdf">A Practical Minimal Perfect Hashing Method</A>. <I>4th International Workshop on efficient and Experimental Algorithms (WEA05),</I> Springer-Verlag Lecture Notes in Computer Science, vol. 3505, Santorini Island, Greece, May 2005, 488-500.
</OL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i BMZ.t2t -o docs/bmz.html -->
</BODY></HTML>

966
docs/brz.html Normal file
View File

@ -0,0 +1,966 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>External Memory Based Algorithm</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>External Memory Based Algorithm</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>Introduction</H2>
<P>
Until now, because of the limitations of current algorithms,
the use of MPHFs is restricted to scenarios where the set of keys being hashed is
relatively small.
However, in many cases it is crucial to deal in an efficient way with very large
sets of keys.
Due to the exponential growth of the Web, the work with huge collections is becoming
a daily task.
For instance, the simple assignment of number identifiers to web pages of a collection
can be a challenging task.
While traditional databases simply cannot handle more traffic once the working
set of URLs does not fit in main memory anymore<A HREF="#papers">[4</A>], the algorithm we propose here to
construct MPHFs can easily scale to billions of entries.
</P>
<P>
As there are many applications for MPHFs, it is
important to design and implement space and time efficient algorithms for
constructing such functions.
The attractiveness of using MPHFs depends on the following issues:
</P>
<OL>
<LI>The amount of CPU time required by the algorithms for constructing MPHFs.
<P></P>
<LI>The space requirements of the algorithms for constructing MPHFs.
<P></P>
<LI>The amount of CPU time required by a MPHF for each retrieval.
<P></P>
<LI>The space requirements of the description of the resulting MPHFs to be used at retrieval time.
</OL>
<P>
We present here a novel external memory based algorithm for constructing MPHFs that
are very efficient in the four requirements mentioned previously.
First, the algorithm is linear on the size of keys to construct a MPHF,
which is optimal.
For instance, for a collection of 1 billion URLs
collected from the web, each one 64 characters long on average, the time to construct a
MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory
is approximately 3 hours.
Second, the algorithm needs a small a priori defined vector of <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> one
byte entries in main memory to construct a MPHF.
For the collection of 1 billion URLs and using <IMG ALIGN="middle" SRC="figs/brz/img4.png" BORDER="0" ALT="">, the algorithm needs only
5.45 megabytes of internal memory.
Third, the evaluation of the MPHF for each retrieval requires three memory accesses and
the computation of three universal hash functions.
This is not optimal as any MPHF requires at least one memory access and the computation
of two universal hash functions.
Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal.
For the collection of 1 billion URLs, it needs 8.1 bits for each key,
while the theoretical lower bound is <IMG ALIGN="middle" SRC="figs/brz/img24.png" BORDER="0" ALT=""> bits per key.
</P>
<HR NOSHADE SIZE=1>
<H2>The Algorithm</H2>
<P>
The main idea supporting our algorithm is the classical divide and conquer technique.
The algorithm is a two-step external memory based algorithm
that generates a MPHF <I>h</I> for a set <I>S</I> of <I>n</I> keys.
Figure 1 illustrates the two steps of the
algorithm: the partitioning step and the searching step.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/brz.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 1:</B> Main steps of our algorithm.</TD>
</TR>
</TABLE>
<P>
The partitioning step takes a key set <I>S</I> and uses a universal hash
function <IMG ALIGN="middle" SRC="figs/brz/img42.png" BORDER="0" ALT=""> proposed by Jenkins<A HREF="#papers">[5</A>]
to transform each key <IMG ALIGN="middle" SRC="figs/brz/img43.png" BORDER="0" ALT=""> into an integer <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT="">.
Reducing <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT=""> modulo <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT="">, we partition <I>S</I>
into <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets containing at most 256 keys in each bucket (with high
probability).
</P>
<P>
The searching step generates a MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> for each bucket <I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">.
The resulting MPHF <I>h(k)</I>, <IMG ALIGN="middle" SRC="figs/brz/img43.png" BORDER="0" ALT="">, is given by
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img49.png" BORDER="0" ALT=""></TD>
</TR>
</TABLE>
<P>
where <IMG ALIGN="middle" SRC="figs/brz/img50.png" BORDER="0" ALT="">.
The <I>i</I>th entry <I>offset[i]</I> of the displacement vector
<I>offset</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, contains the total number
of keys in the buckets from 0 to <I>i-1</I>, that is, it gives the interval of the
keys in the hash table addressed by the MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT="">. In the following we explain
each step in detail.
</P>
<HR NOSHADE SIZE=1>
<H3>Partitioning step</H3>
<P>
The set <I>S</I> of <I>n</I> keys is partitioned into <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT="">,
where <I>b</I> is a suitable parameter chosen to guarantee
that each bucket has at most 256 keys with high probability
(see <A HREF="#papers">[2</A>] for details).
The partitioning step works as follows:
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img54.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 2:</B> Partitioning step.</TD>
</TR>
</TABLE>
<P>
Statement 1.1 of the <B>for</B> loop presented in Figure 2
reads sequentially all the keys of block <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> from disk into an internal area
of size <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT="">.
</P>
<P>
Statement 1.2 performs an indirect bucket sort of the keys in block <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> and
at the same time updates the entries in the vector <I>size</I>.
Let us briefly describe how <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> is partitioned among
the <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets.
We use a local array of <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> counters to store a
count of how many keys from <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> belong to each bucket.
The pointers to the keys in each bucket <I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">,
are stored in contiguous positions in an array.
For this we first reserve the required number of entries
in this array of pointers using the information from the array of counters.
Next, we place the pointers to the keys in each bucket into the respective
reserved areas in the array (i.e., we place the pointers to the keys in bucket 0,
followed by the pointers to the keys in bucket 1, and so on).
</P>
<P>
To find the bucket address of a given key
we use the universal hash function <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT=""><A HREF="#papers">[5</A>].
Key <I>k</I> goes into bucket <I>i</I>, where
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img57.png" BORDER="0" ALT=""> (1)</TD>
</TR>
</TABLE>
<P>
Figure 3(a) shows a <I>logical</I> view of the <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets
generated in the partitioning step.
In reality, the keys belonging to each bucket are distributed among many files,
as depicted in Figure 3(b).
In the example of Figure 3(b), the keys in bucket 0
appear in files 1 and <I>N</I>, the keys in bucket 1 appear in files 1, 2
and <I>N</I>, and so on.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/brz-partitioning.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 3:</B> Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view.</TD>
</TR>
</TABLE>
<P>
This scattering of the keys in the buckets could generate a performance
problem because of the potential number of seeks
needed to read the keys in each bucket from the <I>N</I> files in disk
during the searching step.
But, as we show in <A HREF="#papers">[2</A>], the number of seeks
can be kept small using buffering techniques.
Considering that only the vector <I>size</I>, which has <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> one-byte
entries (remember that each bucket has at most 256 keys),
must be maintained in main memory during the searching step,
almost all main memory is available to be used as disk I/O buffer.
</P>
<P>
The last step is to compute the <I>offset</I> vector and dump it to the disk.
We use the vector <I>size</I> to compute the
<I>offset</I> displacement vector.
The <I>offset[i]</I> entry contains the number of keys
in the buckets <I>0, 1, ..., i-1</I>.
As <I>size[i]</I> stores the number of keys
in bucket <I>i</I>, where <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, we have
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img63.png" BORDER="0" ALT=""></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H3>Searching step</H3>
<P>
The searching step is responsible for generating a MPHF for each
bucket. Figure 4 presents the searching step algorithm.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img64.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 4:</B> Searching step.</TD>
</TR>
</TABLE>
<P>
Statement 1 of Figure 4 inserts one key from each file
in a minimum heap <I>H</I> of size <I>N</I>.
The order relation in <I>H</I> is given by the bucket address <I>i</I> given by
Eq. (1).
</P>
<P>
Statement 2 has two important steps.
In statement 2.1, a bucket is read from disk,
as described below.
In statement 2.2, a MPHF is generated for each bucket <I>i</I>, as described
in the following.
The description of MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> is a vector <IMG ALIGN="middle" SRC="figs/brz/img66.png" BORDER="0" ALT=""> of 8-bit integers.
Finally, statement 2.3 writes the description <IMG ALIGN="middle" SRC="figs/brz/img66.png" BORDER="0" ALT=""> of MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> to disk.
</P>
<HR NOSHADE SIZE=1>
<H4>Reading a bucket from disk</H4>
<P>
In this section we present the refinement of statement 2.1 of
Figure 4.
The algorithm to read bucket <I>i</I> from disk is presented
in Figure 5.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img67.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 5:</B> Reading a bucket.</TD>
</TR>
</TABLE>
<P>
Bucket <I>i</I> is distributed among many files and the heap <I>H</I> is used to drive a
multiway merge operation.
In Figure 5, statement 1.1 extracts and removes triple
<I>(i, j, k)</I> from <I>H</I>, where <I>i</I> is a minimum value in <I>H</I>.
Statement 1.2 inserts key <I>k</I> in bucket <I>i</I>.
Notice that the <I>k</I> in the triple <I>(i, j, k)</I> is in fact a pointer to
the first byte of the key that is kept in contiguous positions of an array of characters
(this array containing the keys is initialized during the heap construction
in statement 1 of Figure 4).
Statement 1.3 performs a seek operation in File <I>j</I> on disk for the first
read operation and reads sequentially all keys <I>k</I> that have the same <I>i</I>
and inserts them all in bucket <I>i</I>.
Finally, statement 1.4 inserts in <I>H</I> the triple <I>(i, j, x)</I>,
where <I>x</I> is the first key read from File <I>j</I> (in statement 1.3)
that does not have the same bucket address as the previous keys.
</P>
<P>
The number of seek operations on disk performed in statement 1.3 is discussed
in <A HREF="#papers">[2, Section 5.1</A>],
where we present a buffering technique that brings down
the time spent with seeks.
</P>
<HR NOSHADE SIZE=1>
<H4>Generating a MPHF for each bucket</H4>
<P>
To the best of our knowledge the <A HREF="bmz.html">BMZ algorithm</A> we have designed in
our previous works <A HREF="#papers">[1,3</A>] is the fastest published algorithm for
constructing MPHFs.
That is why we are using that algorithm as a building block for the
algorithm presented here. In reality, we are using
an optimized version of BMZ (BMZ8) for small set of keys (at most 256 keys).
<A HREF="bmz.html">Click here to see details about BMZ algorithm</A>.
</P>
<HR NOSHADE SIZE=1>
<H2>Analysis of the Algorithm</H2>
<P>
Analytical results and the complete analysis of the external memory based algorithm
can be found in <A HREF="#papers">[2</A>].
</P>
<HR NOSHADE SIZE=1>
<H2>Experimental Results</H2>
<P>
In this section we present the experimental results.
We start presenting the experimental setup.
We then present experimental results for
the internal memory based algorithm (<A HREF="bmz.html">the BMZ algorithm</A>)
and for our external memory based algorithm.
Finally, we discuss how the amount of internal memory available
affects the runtime of the external memory based algorithm.
</P>
<HR NOSHADE SIZE=1>
<H3>The data and the experimental setup</H3>
<P>
All experiments were carried out on
a computer running the Linux operating system, version 2.6,
with a 2.4 gigahertz processor and
1 gigabyte of main memory.
In the experiments related to the new
algorithm we limited the main memory in 500 megabytes.
</P>
<P>
Our data consists of a collection of 1 billion
URLs collected from the Web, each URL 64 characters long on average.
The collection is stored on disk in 60.5 gigabytes.
</P>
<HR NOSHADE SIZE=1>
<H3>Performance of the BMZ Algorithm</H3>
<P>
<A HREF="bmz.html">The BMZ algorithm</A> is used for constructing a MPHF for each bucket.
It is a randomized algorithm because it needs to generate a simple random graph
in its first step.
Once the graph is obtained the other two steps are deterministic.
</P>
<P>
Thus, we can consider the runtime of the algorithm to have
the form <IMG ALIGN="middle" SRC="figs/brz/img159.png" BORDER="0" ALT=""> for an input of <I>n</I> keys,
where <IMG ALIGN="middle" SRC="figs/brz/img160.png" BORDER="0" ALT=""> is some machine dependent
constant that further depends on the length of the keys and <I>Z</I> is a random
variable with geometric distribution with mean <IMG ALIGN="middle" SRC="figs/brz/img162.png" BORDER="0" ALT="">. All results
in our experiments were obtained taking <I>c=1</I>; the value of <I>c</I>, with <I>c</I> in <I>[0.93,1.15]</I>,
in fact has little influence in the runtime, as shown in <A HREF="#papers">[3</A>].
</P>
<P>
The values chosen for <I>n</I> were 1, 2, 4, 8, 16 and 32 million.
Although we have a dataset with 1 billion URLs, on a PC with
1 gigabyte of main memory, the algorithm is able
to handle an input with at most 32 million keys.
This is mainly because of the graph we need to keep in main memory.
The algorithm requires <I>25n + O(1)</I> bytes for constructing
a MPHF (<A HREF="bmz.html">click here to get details about the data structures used by the BMZ algorithm</A>).
</P>
<P>
In order to estimate the number of trials for each value of <I>n</I> we use
a statistical method for determining a suitable sample size (see, e.g., <A HREF="#papers">[6, Chapter 13</A>]).
As we obtained different values for each <I>n</I>,
we used the maximal value obtained, namely, 300 trials in order to have
a confidence level of 95 %.
</P>
<P>
Table 1 presents the runtime average for each <I>n</I>,
the respective standard deviations, and
the respective confidence intervals given by
the average time <IMG ALIGN="middle" SRC="figs/brz/img167.png" BORDER="0" ALT=""> the distance from average time
considering a confidence level of 95 %.
Observing the runtime averages one sees that
the algorithm runs in expected linear time,
as shown in <A HREF="#papers">[3</A>].
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="CENTER">
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img5.png"
ALT="$n$"></SPAN> (millions) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 2 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 4 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 8 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 16 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 32 </SMALL></TD>
<TD></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
Average time (s)</SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="64" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img168.png"
ALT="$6.1 \pm 0.3$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img169.png"
ALT="$12.2 \pm 0.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img170.png"
ALT="$25.4 \pm 1.1$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img171.png"
ALT="$51.4 \pm 2.0$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img172.png"
ALT="$117.3 \pm 4.4$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img173.png"
ALT="$262.2 \pm 8.7$"></SPAN></SMALL></TD>
<TD></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
SD (s) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img174.png"
ALT="$2.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img175.png"
ALT="$5.4$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img176.png"
ALT="$9.8$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img177.png"
ALT="$17.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img178.png"
ALT="$37.3$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img179.png"
ALT="$76.3$"></SPAN> </SMALL></TD>
<TD></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 1:</B> Internal memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.</TD>
</TR>
</TABLE>
<P>
Figure 6 presents the runtime for each trial. In addition,
the solid line corresponds to a linear regression model
obtained from the experimental measurements.
As we can see, the runtime for a given <I>n</I> has a considerable
fluctuation. However, the fluctuation also grows linearly with <I>n</I>.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/bmz_temporegressao.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 6:</B> Time versus number of keys in <I>S</I> for the internal memory based algorithm. The solid line corresponds to a linear regression model.</TD>
</TR>
</TABLE>
<P>
The observed fluctuation in the runtimes is as expected; recall that this
runtime has the form <IMG ALIGN="middle" SRC="figs/brz/img159.png" BORDER="0" ALT=""> with <I>Z</I> a geometric random variable with
mean <I>1/p=e</I>. Thus, the runtime has mean <IMG ALIGN="middle" SRC="figs/brz/img181.png" BORDER="0" ALT=""> and standard
deviation <IMG ALIGN="middle" SRC="figs/brz/img182.png" BORDER="0" ALT="">.
Therefore, the standard deviation also grows
linearly with <I>n</I>, as experimentally verified
in Table 1 and in Figure 6.
</P>
<HR NOSHADE SIZE=1>
<H3>Performance of the External Memory Based Algorithm</H3>
<P>
The runtime of the external memory based algorithm is also a random variable,
but now it follows a (highly concentrated) normal distribution, as we discuss at the end of this
section. Again, we are interested in verifying the linearity claim made in
<A HREF="#papers">[2, Section 5.1</A>]. Therefore, we ran the algorithm for
several numbers <I>n</I> of keys in <I>S</I>.
</P>
<P>
The values chosen for <I>n</I> were 1, 2, 4, 8, 16, 32, 64, 128, 512 and 1000
million.
We limited the main memory in 500 megabytes for the experiments.
The size <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> of the a priori reserved internal memory area
was set to 250 megabytes, the parameter <I>b</I> was set to <I>175</I> and
the building block algorithm parameter <I>c</I> was again set to <I>1</I>.
We show later on how <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> affects the runtime of the algorithm. The other two parameters
have insignificant influence on the runtime.
</P>
<P>
We again use a statistical method for determining a suitable sample size
to estimate the number of trials to be run for each value of <I>n</I>. We got that
just one trial for each <I>n</I> would be enough with a confidence level of 95 %.
However, we made 10 trials. This number of trials seems rather small, but, as
shown below, the behavior of our algorithm is very stable and its runtime is
almost deterministic (i.e., the standard deviation is very small).
</P>
<P>
Table 2 presents the runtime average for each <I>n</I>,
the respective standard deviations, and
the respective confidence intervals given by
the average time <IMG ALIGN="middle" SRC="figs/brz/img167.png" BORDER="0" ALT=""> the distance from average time
considering a confidence level of 95 %.
Observing the runtime averages we noticed that
the algorithm runs in expected linear time,
as shown in <A HREF="#papers">[2, Section 5.1</A>]. Better still,
it is only approximately 60 % slower than the BMZ algorithm.
To get that value we used the linear regression model obtained for the runtime of
the internal memory based algorithm to estimate how much time it would require
for constructing a MPHF for a set of 1 billion keys.
We got 2.3 hours for the internal memory based algorithm and we measured
3.67 hours on average for the external memory based algorithm.
Increasing the size of the internal memory area
from 250 to 600 megabytes,
we have brought the time to 3.09 hours. In this case, the external memory based algorithm is
just 34 % slower in this setup.
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="CENTER">
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img5.png"
ALT="$n$"></SPAN> (millions) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 2 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 4 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 8 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 16 </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
Average time (s) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="64" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img187.png"
ALT="$6.9 \pm 0.3$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img188.png"
ALT="$13.8 \pm 0.2$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img189.png"
ALT="$31.9 \pm 0.7$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img190.png"
ALT="$69.9 \pm 1.1$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img191.png"
ALT="$140.6 \pm 2.5$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
SD </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img192.png"
ALT="$0.4$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img193.png"
ALT="$0.2$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img194.png"
ALT="$0.9$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img195.png"
ALT="$1.5$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img196.png"
ALT="$3.5$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img5.png"
ALT="$n$"></SPAN> (millions) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 32 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 64 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 128 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 512 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1000 </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
Average time (s) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img197.png"
ALT="$284.3 \pm 1.1$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img198.png"
ALT="$587.9 \pm 3.9$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
$1223.6 \pm 4.9$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="88" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img199.png"
ALT="$1223.6 \pm 4.9$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
$5966.4 \pm 9.5$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="88" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img200.png"
ALT="$5966.4 \pm 9.5$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
$13229.5 \pm 12.7$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="104" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img201.png"
ALT="$13229.5 \pm 12.7$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
SD </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img202.png"
ALT="$1.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img203.png"
ALT="$5.5$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img204.png"
ALT="$6.8$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img205.png"
ALT="$13.2$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img206.png"
ALT="$18.6$"></SPAN> </SMALL></TD>
</TR>
<TR><TD></TD>
<TD></TD>
<TD></TD>
<TD></TD>
<TD></TD>
<TD></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 2:</B>The external memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.</TD>
</TR>
</TABLE>
<P>
Figure 7 presents the runtime for each trial. In addition,
the solid line corresponds to a linear regression model
obtained from the experimental measurements.
As we were expecting the runtime for a given <I>n</I> has almost no
variation.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/brz_temporegressao.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 7:</B> Time versus number of keys in <I>S</I> for our algorithm. The solid line corresponds to a linear regression model.</TD>
</TR>
</TABLE>
<P>
An intriguing observation is that the runtime of the algorithm is almost
deterministic, in spite of the fact that it uses as building block an
algorithm with a considerable fluctuation in its runtime. A given bucket
<I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, is a small set of keys (at most 256 keys) and,
as argued in last Section, the runtime of the
building block algorithm is a random variable <IMG ALIGN="middle" SRC="figs/brz/img207.png" BORDER="0" ALT=""> with high fluctuation.
However, the runtime <I>Y</I> of the searching step of the external memory based algorithm is given
by <IMG ALIGN="middle" SRC="figs/brz/img209.png" BORDER="0" ALT="">. Under the hypothesis that
the <IMG ALIGN="middle" SRC="figs/brz/img207.png" BORDER="0" ALT=""> are independent and bounded, the {\it law of large numbers} (see,
e.g., <A HREF="#papers">[6</A>]) implies that the random variable <IMG ALIGN="middle" SRC="figs/brz/img210.png" BORDER="0" ALT=""> converges
to a constant as <IMG ALIGN="middle" SRC="figs/brz/img83.png" BORDER="0" ALT="">. This explains why the runtime of our
algorithm is almost deterministic.
</P>
<HR NOSHADE SIZE=1>
<H3>Controlling disk accesses</H3>
<P>
In order to bring down the number of seek operations on disk
we benefit from the fact that our algorithm leaves almost all main
memory available to be used as disk I/O buffer.
In this section we evaluate how much the parameter <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> affects the runtime of our algorithm.
For that we fixed <I>n</I> in 1 billion of URLs,
set the main memory of the machine used for the experiments
to 1 gigabyte and used <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> equal to 100, 200, 300, 400, 500 and 600
megabytes.
</P>
<P>
Table 3 presents the number of files <I>N</I>,
the buffer size used for all files, the number of seeks in the worst case considering
the pessimistic assumption mentioned in <A HREF="#papers">[2, Section 5.1</A>], and
the time to generate a MPHF for 1 billion of keys as a function of the amount of internal
memory available. Observing Table 3 we noticed that the time spent in the construction
decreases as the value of <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> increases. However, for <IMG ALIGN="middle" SRC="figs/brz/img213.png" BORDER="0" ALT="">, the variation
on the time is not as significant as for <IMG ALIGN="middle" SRC="figs/brz/img214.png" BORDER="0" ALT="">.
This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux
has smart policies for avoiding seeks and diminishing the average seek time
(see <A HREF="http://www.linuxjournal.com/article/6931">http://www.linuxjournal.com/article/6931</A>).
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img8.png"
ALT="$\mu $"></SPAN> (MB) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img215.png"
ALT="$100$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img216.png"
ALT="$200$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img217.png"
ALT="$300$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img218.png"
ALT="$400$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img219.png"
ALT="$500$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img212.png"
ALT="$600$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="19" HEIGHT="14" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img58.png"
ALT="$N$"></SPAN> (files) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img220.png"
ALT="$619$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img221.png"
ALT="$310$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img222.png"
ALT="$207$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img223.png"
ALT="$155$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img224.png"
ALT="$124$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img225.png"
ALT="$104$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
&nbsp;(buffer size in KB) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img226.png"
ALT="$165$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img227.png"
ALT="$661$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img228.png"
ALT="$1,484$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img229.png"
ALT="$2,643$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img230.png"
ALT="$4,129$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img231.png"
ALT="$5,908$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img135.png"
ALT="$\beta$"></SPAN>/&nbsp;(# of seeks in the worst case) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="59" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img232.png"
ALT="$384,478$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img233.png"
ALT="$95,974$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img234.png"
ALT="$42,749$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img235.png"
ALT="$24,003$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img236.png"
ALT="$15,365$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img237.png"
ALT="$10,738$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
Time (hours) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img238.png"
ALT="$4.04$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img239.png"
ALT="$3.64$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img240.png"
ALT="$3.34$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img241.png"
ALT="$3.20$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img242.png"
ALT="$3.13$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img243.png"
ALT="$3.09$"></SPAN> </SMALL></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 3:</B>Influence of the internal memory area size (<IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT="">) in the external memory based algorithm runtime.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<A NAME="papers"></A>
<H2>Papers</H2>
<OL>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, D. Menoti, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/bmz_tr004_04.ps">A New algorithm for constructing minimal perfect hash functions</A>, Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
<P></P>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, Y. Kohayakawa, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/tr06.pdf">An Approach for Minimal Perfect Hash Functions for Very Large Databases</A>, Technical Report TR003/06, Department of Computer Science, Federal University of Minas Gerais, 2004.
<P></P>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, Y. Kohayakawa, and <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/wea05.pdf">A Practical Minimal Perfect Hashing Method</A>. <I>4th International Workshop on efficient and Experimental Algorithms (WEA05),</I> Springer-Verlag Lecture Notes in Computer Science, vol. 3505, Santorini Island, Greece, May 2005, 488-500.
<P></P>
<LI><A HREF="http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=299">M. Seltzer. Beyond relational databases. ACM Queue, 3(3), April 2005.</A>
<P></P>
<LI><A HREF="http://burtleburtle.net/bob/hash/doobs.html">Bob Jenkins. Algorithm alley: Hash functions. Dr. Dobb's Journal of Software Tools, 22(9), september 1997.</A>
<P></P>
<LI>R. Jain. The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. John Wiley, first edition, 1991.
</OL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i BRZ.t2t -o docs/brz.html -->
</BODY></HTML>

97
docs/chd.html Normal file
View File

@ -0,0 +1,97 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>Compress, Hash and Displace: CHD Algorithm</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>Compress, Hash and Displace: CHD Algorithm</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>Introduction</H2>
<P>
The important performance parameters of a PHF are representation size, evaluation time and construction time. The representation size plays an important role when the whole function fits in a faster memory and the actual data is stored in a slower memory. For instace, compact PHFs can be entirely fit in a CPU cache and this makes their computation really fast by avoiding cache misses. The CHD algorithm plays an important role in this context. It was designed by Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger in <A HREF="#papers">[2</A>].
</P>
<P>
The CHD algorithm permits to obtain PHFs with representation size very close to optimal while retaining <I>O(n)</I> construction time and <I>O(1)</I> evaluation time. For example, in the case <I>m=2n</I> we obtain a PHF that uses space <I>0.67</I> bits per key, and for <I>m=1.23n</I> we obtain space <I>1.4</I> bits per key, which was not achievable with previously known methods. The CHD algorithm is inspired by several known algorithms; the main new feature is that it combines a modification of Pagh's ``hash-and-displace'' approach with data compression on a sequence of hash function indices. That combination makes it possible to significantly reduce space usage while retaining linear construction time and constant query time. The CHD algorithm can also be used for <I>k</I>-perfect hashing, where at most <I>k</I> keys may be mapped to the same value. For the analysis we assume that fully random hash functions are given for free; such assumptions can be justified and were made in previous papers.
</P>
<P>
The compact PHFs generated by the CHD algorithm can be used in many applications in which we want to assign a unique identifier to each key without storing any information on the key. One of the most obvious applications of those functions (or <I>k</I>-perfect hash functions) is when we have a small fast memory in which we can store the perfect hash function while the keys and associated satellite data are stored in slower but larger memory. The size of a block or a transfer unit may be chosen so that <I>k</I> data items can be retrieved in one read access. In this case we can ensure that data associated with a key can be retrieved in a single probe to slower memory. This has been used for example in hardware routers <A HREF="#papers">[4</A>].
</P>
<P>
The CHD algorithm generates the most compact PHFs and MPHFs we know of in <I>O(n)</I> time. The time required to evaluate the generated functions is constant (in practice less than <I>1.4</I> microseconds). The storage space of the resulting PHFs and MPHFs are distant from the information theoretic lower bound by a factor of <I>1.43</I>. The closest competitor is the algorithm by Martin and Pagh <A HREF="#papers">[3</A>] but their algorithm do not work in linear time. Furthermore, the CHD algorithm can be tuned to run faster than the BPZ algorithm <A HREF="#papers">[1</A>] (the fastest algorithm available in the literature so far) and to obtain more compact functions. The most impressive characteristic is that it has the ability, in principle, to approximate the information theoretic lower bound while being practical. A detailed description of the CHD algorithm can be found in <A HREF="#papers">[2</A>].
</P>
<HR NOSHADE SIZE=1>
<H2>Experimental Results</H2>
<P>
Experimental results comparing the CHD algorithm with <A HREF="bdz.html">the BDZ algorithm</A>
and others available in the CMPH library are presented in <A HREF="#papers">[2</A>].
</P>
<HR NOSHADE SIZE=1>
<A NAME="papers"></A>
<H2>Papers</H2>
<OL>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, <A HREF="http://www.itu.dk/~pagh/">R. Pagh</A>, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/wads07.pdf">Simple and space-efficient minimal perfect hash functions</A>. <I>In Proceedings of the 10th International Workshop on Algorithms and Data Structures (WADs'07),</I> Springer-Verlag Lecture Notes in Computer Science, vol. 4619, Halifax, Canada, August 2007, 139-150.
<P></P>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, D. Belazzougui and M. Dietzfelbinger. <A HREF="papers/esa09.pdf">Compress, hash and displace</A>. <I>In Proceedings of the 17th European Symposium on Algorithms (ESA09)</I>. Springer LNCS, 2009.
<P></P>
<LI>M. Dietzfelbinger and <A HREF="http://www.itu.dk/~pagh/">R. Pagh</A>. Succinct data structures for retrieval and approximate membership. <I>In Proceedings of the 35th international colloquium on Automata, Languages and Programming (ICALP08)</I>, pages 385396, Berlin, Heidelberg, 2008. Springer-Verlag.
<P></P>
<LI>B. Prabhakar and F. Bonomi. Perfect hashing for network applications. <I>In Proceedings of the IEEE International Symposium on Information Theory</I>. IEEE Press, 2006.
</OL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i CHD.t2t -o docs/chd.html -->
</BODY></HTML>

180
docs/chm.html Normal file
View File

@ -0,0 +1,180 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>CHM Algorithm</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>CHM Algorithm</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>The Algorithm</H2>
<P>
The algorithm is presented in <A HREF="#papers">[1,2,3</A>].
</P>
<HR NOSHADE SIZE=1>
<H2>Memory Consumption</H2>
<P>
Now we detail the memory consumption to generate and to store minimal perfect hash functions
using the CHM algorithm. The structures responsible for memory consumption are in the
following:
</P>
<UL>
<LI>Graph:
<OL>
<LI><B>first</B>: is a vector that stores <I>cn</I> integer numbers, each one representing
the first edge (index in the vector edges) in the list of
edges of each vertex.
The integer numbers are 4 bytes long. Therefore,
the vector first is stored in <I>4cn</I> bytes.
<P></P>
<LI><B>edges</B>: is a vector to represent the edges of the graph. As each edge
is compounded by a pair of vertices, each entry stores two integer numbers
of 4 bytes that represent the vertices. As there are <I>n</I> edges, the
vector edges is stored in <I>8n</I> bytes.
<P></P>
<LI><B>next</B>: given a vertex <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">, we can discover the edges that
contain <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> following its list of edges, which starts on
first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">] and the next
edges are given by next[...first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">]...]. Therefore,
the vectors first and next represent
the linked lists of edges of each vertex. As there are two vertices for each edge,
when an edge is iserted in the graph, it must be inserted in the two linked lists
of the vertices in its composition. Therefore, there are <I>2n</I> entries of integer
numbers in the vector next, so it is stored in <I>4*2n = 8n</I> bytes.
<P></P>
</OL>
<LI>Other auxiliary structures
<OL>
<LI><B>visited</B>: is a vector of <I>cn</I> bits, where each bit indicates if the g value of
a given vertex was already defined. Therefore, the vector visited is stored
in <I>cn/8</I> bytes.
<P></P>
<LI><B>function <I>g</I></B>: is represented by a vector of <I>cn</I> integer numbers.
As each integer number is 4 bytes long, the function <I>g</I> is stored in
<I>4cn</I> bytes.
</OL>
</UL>
<P>
Thus, the total memory consumption of CHM algorithm for generating a minimal
perfect hash function (MPHF) is: <I>(8.125c + 16)n + O(1)</I> bytes.
As the value of constant <I>c</I> must be at least 2.09 we have:
</P>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH><I>c</I></TH>
<TH>Memory consumption to generate a MPHF</TH>
</TR>
<TR>
<TD>2.09</TD>
<TD ALIGN="center"><I>33.00n + O(1)</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 1:</B> Memory consumption to generate a MPHF using the CHM algorithm.</TD>
</TR>
</TABLE>
<P>
Now we present the memory consumption to store the resulting function.
We only need to store the <I>g</I> function. Thus, we need <I>4cn</I> bytes.
Again we have:
</P>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH><I>c</I></TH>
<TH>Memory consumption to store a MPHF</TH>
</TR>
<TR>
<TD>2.09</TD>
<TD ALIGN="center"><I>8.36n</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 2:</B> Memory consumption to store a MPHF generated by the CHM algorithm.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H2>Experimental Results</H2>
<P>
<A HREF="comparison.html">CHM x BMZ</A>
</P>
<HR NOSHADE SIZE=1>
<A NAME="papers"></A>
<H2>Papers</H2>
<OL>
<LI>Z.J. Czech, G. Havas, and B.S. Majewski. <A HREF="papers/chm92.pdf">An optimal algorithm for generating minimal perfect hash functions.</A>, Information Processing Letters, 43(5):257-264, 1992.
<P></P>
<LI>Z.J. Czech, G. Havas, and B.S. Majewski. Fundamental study perfect hashing.
Theoretical Computer Science, 182:1-143, 1997.
<P></P>
<LI>B.S. Majewski, N.C. Wormald, G. Havas, and Z.J. Czech. A family of perfect hashing methods.
The Computer Journal, 39(6):547--554, 1996.
</OL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i CHM.t2t -o docs/chm.html -->
</BODY></HTML>

457
docs/comparison.html Normal file
View File

@ -0,0 +1,457 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>Comparison Between BMZ And CHM Algorithms</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>Comparison Between BMZ And CHM Algorithms</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>Characteristics</H2>
<P>
Table 1 presents the main characteristics of the two algorithms.
The number of edges in the graph <IMG ALIGN="middle" SRC="figs/img27.png" BORDER="0" ALT=""> is <IMG ALIGN="middle" SRC="figs/img236.png" BORDER="0" ALT="">,
the number of keys in the input set <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
The number of vertices of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> is equal
to <IMG ALIGN="bottom" SRC="figs/img12.png" BORDER="0" ALT=""> and <IMG ALIGN="bottom" SRC="figs/img237.png" BORDER="0" ALT=""> for BMZ algorithm and the CHM algorithm, respectively.
This measure is related to the amount of space to store the array <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">.
This improves the space required to store a function in BMZ algorithm to <IMG ALIGN="middle" SRC="figs/img238.png" BORDER="0" ALT=""> of the space required by the CHM algorithm.
The number of critical edges is <IMG ALIGN="middle" SRC="figs/img76.png" BORDER="0" ALT=""> and 0, for BMZ algorithm and the CHM algorithm,
respectively.
BMZ algorithm generates random graphs that necessarily contains cycles and the
CHM algorithm
generates
acyclic random graphs.
Finally, the CHM algorithm generates <A HREF="concepts.html">order preserving functions</A>
while BMZ algorithm does not preserve order.
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Characteristics </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=2><SMALL CLASS="FOOTNOTESIZE"> <SPAN>Algorithms</SPAN></SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
</SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> BMZ </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> CHM </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="11" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img1.png"
ALT="$c$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1.15 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.09 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="50" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img239.png"
ALT="$\vert E(G)\vert$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img8.png"
ALT="$n$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img8.png"
ALT="$n$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="89" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img240.png"
ALT="$\vert V(G)\vert=\vert g\vert$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="20" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img241.png"
ALT="$cn$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="20" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img241.png"
ALT="$cn$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<!-- MATH
$|E(G_{\rm crit})|$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="70" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img111.png"
ALT="$\vert E(G_{\rm crit})\vert$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="71" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img242.png"
ALT="$0.5\vert E(G)\vert$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 0</SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="17" HEIGHT="14" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img32.png"
ALT="$G$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> cyclic </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> acyclic </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Order preserving </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> no </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> yes </SMALL></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 1:</B> Main characteristics of the algorithms.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H2>Memory Consumption</H2>
<UL>
<LI>Memory consumption to generate the minimal perfect hash function (MPHF):
</UL>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH>Algorithm</TH>
<TH><I>c</I></TH>
<TH>Memory consumption to generate a MPHF</TH>
</TR>
<TR>
<TD ALIGN="center">BMZ</TD>
<TD>0.93</TD>
<TD ALIGN="center"><I>24.80n + O(1)</I></TD>
</TR>
<TR>
<TD ALIGN="center">BMZ</TD>
<TD>1.15</TD>
<TD ALIGN="center"><I>26.42n + O(1)</I></TD>
</TR>
<TR>
<TD ALIGN="center">CHM</TD>
<TD>2.09</TD>
<TD ALIGN="center"><I>33.00n + O(1)</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 2:</B> Memory consumption to generate a MPHF using the algorithms BMZ and CHM.</TD>
</TR>
</TABLE>
<UL>
<LI>Memory consumption to store the resulting minimal perfect hash function (MPHF):
</UL>
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
<TR>
<TH>Algorithm</TH>
<TH><I>c</I></TH>
<TH>Memory consumption to store a MPHF</TH>
</TR>
<TR>
<TD ALIGN="center">BMZ</TD>
<TD>0.93</TD>
<TD ALIGN="center"><I>3.72n</I></TD>
</TR>
<TR>
<TD ALIGN="center">BMZ</TD>
<TD>1.15</TD>
<TD ALIGN="center"><I>4.60n</I></TD>
</TR>
<TR>
<TD ALIGN="center">CHM</TD>
<TD>2.09</TD>
<TD ALIGN="center"><I>8.36n</I></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 3:</B> Memory consumption to store a MPHF generated by the algorithms BMZ and CHM.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H2>Run times</H2>
<P>
We now present some experimental results to compare the BMZ and CHM algorithms.
The data consists of a collection of 100 million universe resource locations
(URLs) collected from the Web.
The average length of a URL in the collection is 63 bytes.
All experiments were carried on
a computer running the Linux operating system, version 2.6.7,
with a 2.4 gigahertz processor and
4 gigabytes of main memory.
</P>
<P>
Table 4 presents time measurements.
All times are in seconds.
The table entries represent averages over 50 trials.
The column labelled as <IMG ALIGN="middle" SRC="figs/img243.png" BORDER="0" ALT=""> represents
the number of iterations to generate the random graph <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> in the
mapping step of the algorithms.
The next columns represent the run times
for the mapping plus ordering steps together and the searching
step for each algorithm.
The last column represents the percent gain of our algorithm
over the CHM algorithm.
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img8.png"
ALT="$n$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=4><SMALL CLASS="FOOTNOTESIZE"> <SPAN> BMZ </SPAN> </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=4><SMALL CLASS="FOOTNOTESIZE">
<SPAN>CHM algorithm</SPAN></SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> Gain</SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
</SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="22" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img243.png"
ALT="$N_i$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Map+Ord </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Search </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Total </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="22" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img243.png"
ALT="$N_i$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Map+Ord </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Search </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Total </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> (%)</SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1,562,500 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.28 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 8.54 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.37 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 10.91 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.70 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 14.56 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1.57 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 16.13 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 48 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 3,125,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.16 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 15.92 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 4.88 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 20.80 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.85 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 30.36 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 3.20 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 33.56 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 61 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 6,250,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.20 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 33.09 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 10.48 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 43.57 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.90 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 62.26 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 6.76 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 69.02 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 58 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 12,500,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.00 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 63.26 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 23.04 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 86.30 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.60 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 117.99 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 14.94 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 132.92 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 54 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 25,000,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.00 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 130.79 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 51.55 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 182.34 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.80 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 262.05 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 33.68 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 295.73 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 62 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 50,000,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.07 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 273.75 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 114.12 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 387.87 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.90 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 577.59 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 73.97 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 651.56 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 68 </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 100,000,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.07 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 567.47 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 243.13 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 810.60 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.80 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1,131.06 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 157.23 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 1,288.29 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 59 </SMALL></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 4:</B> Time measurements for BMZ and the CHM algorithm.</TD>
</TR>
</TABLE>
<P>
The mapping step of the BMZ algorithm is faster because
the expected number of iterations in the mapping step to generate <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> are
2.13 and 2.92 for BMZ algorithm and the CHM algorithm, respectively
(see <A HREF="bmz.html#papers">[2</A>] for details).
The graph <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> generated by BMZ algorithm
has <IMG ALIGN="bottom" SRC="figs/img12.png" BORDER="0" ALT=""> vertices, against <IMG ALIGN="bottom" SRC="figs/img237.png" BORDER="0" ALT=""> for the CHM algorithm.
These two facts make BMZ algorithm faster in the mapping step.
The ordering step of BMZ algorithm is approximately equal to
the time to check if <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> is acyclic for the CHM algorithm.
The searching step of the CHM algorithm is faster, but the total
time of BMZ algorithm is, on average, approximately 59 % faster
than the CHM algorithm.
It is important to notice the times for the searching step:
for both algorithms they are not the dominant times,
and the experimental results clearly show
a linear behavior for the searching step.
</P>
<P>
We now present run times for BMZ algorithm using a <A HREF="bmz.html#heuristic">heuristic</A> that
reduces the space requirement
to any given value between <IMG ALIGN="bottom" SRC="figs/img12.png" BORDER="0" ALT=""> words and <IMG ALIGN="bottom" SRC="figs/img13.png" BORDER="0" ALT=""> words.
For example, for <IMG ALIGN="bottom" SRC="figs/img244.png" BORDER="0" ALT=""> and <IMG ALIGN="bottom" SRC="figs/img6.png" BORDER="0" ALT="">, the analytical expected number
of iterations are <IMG ALIGN="bottom" SRC="figs/img245.png" BORDER="0" ALT=""> and <IMG ALIGN="bottom" SRC="figs/img246.png" BORDER="0" ALT="">, respectively
(for <IMG ALIGN="middle" SRC="figs/img247.png" BORDER="0" ALT="">, the number of iterations are 2.78 for <IMG ALIGN="bottom" SRC="figs/img244.png" BORDER="0" ALT=""> and 3.04
for <IMG ALIGN="bottom" SRC="figs/img6.png" BORDER="0" ALT="">).
Table 5 presents the total times to construct a
function for <IMG ALIGN="middle" SRC="figs/img247.png" BORDER="0" ALT="">, with an increase from <IMG ALIGN="bottom" SRC="figs/img237.png" BORDER="0" ALT=""> seconds
for <IMG ALIGN="bottom" SRC="figs/img128.png" BORDER="0" ALT=""> (see Table 4) to <IMG ALIGN="bottom" SRC="figs/img249.png" BORDER="0" ALT=""> seconds for <IMG ALIGN="bottom" SRC="figs/img244.png" BORDER="0" ALT=""> and
to <IMG ALIGN="bottom" SRC="figs/img250.png" BORDER="0" ALT=""> seconds for <IMG ALIGN="bottom" SRC="figs/img6.png" BORDER="0" ALT="">.
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img8.png"
ALT="$n$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=4><SMALL CLASS="FOOTNOTESIZE"> <SPAN> BMZ <SPAN CLASS="MATH"><IMG
WIDTH="60" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img5.png"
ALT="$c=1.00$"></SPAN></SPAN> </SMALL></TD>
<TD ALIGN="CENTER" COLSPAN=4><SMALL CLASS="FOOTNOTESIZE">
<SPAN> BMZ <SPAN CLASS="MATH"><IMG
WIDTH="60" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/img6.png"
ALT="$c=0.93$"></SPAN></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
</SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="22" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img243.png"
ALT="$N_i$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Map+Ord </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Search </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Total </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="22" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/img243.png"
ALT="$N_i$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Map+Ord </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">Search </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE">
Total </SMALL></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 12,500,000 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 2.78 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 76.68 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 25.06 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 101.74 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 3.04 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 76.39 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 25.80 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="FOOTNOTESIZE"> 102.19 </SMALL></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 5:</B> Time measurements for BMZ tuned algorithm with <IMG ALIGN="bottom" SRC="figs/img5.png" BORDER="0" ALT=""> and <IMG ALIGN="bottom" SRC="figs/img6.png" BORDER="0" ALT="">.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i COMPARISON.t2t -o docs/comparison.html -->
</BODY></HTML>

114
docs/concepts.html Normal file
View File

@ -0,0 +1,114 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>Minimal Perfect Hash Functions - Introduction</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>Minimal Perfect Hash Functions - Introduction</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>Basic Concepts</H2>
<P>
Suppose <IMG ALIGN="bottom" SRC="figs/img14.png" BORDER="0" ALT=""> is a universe of <I>keys</I>.
Let <IMG ALIGN="bottom" SRC="figs/img15.png" BORDER="0" ALT=""> be a <I>hash function</I> that maps the keys from <IMG ALIGN="bottom" SRC="figs/img14.png" BORDER="0" ALT=""> to a given interval of integers <IMG ALIGN="middle" SRC="figs/img16.png" BORDER="0" ALT="">.
Let <IMG ALIGN="middle" SRC="figs/img17.png" BORDER="0" ALT=""> be a set of <IMG ALIGN="bottom" SRC="figs/img8.png" BORDER="0" ALT=""> keys from <IMG ALIGN="bottom" SRC="figs/img14.png" BORDER="0" ALT="">.
Given a key <IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT="">, the hash function <IMG ALIGN="bottom" SRC="figs/img7.png" BORDER="0" ALT=""> computes an
integer in <IMG ALIGN="middle" SRC="figs/img19.png" BORDER="0" ALT=""> for the storage or retrieval of <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> in
a <I>hash table</I>.
Hashing methods for <I>non-static sets</I> of keys can be used to construct
data structures storing <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> and supporting membership queries
"<IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT="">?" in expected time <IMG ALIGN="middle" SRC="figs/img21.png" BORDER="0" ALT="">.
However, they involve a certain amount of wasted space owing to unused
locations in the table and waisted time to resolve collisions when
two keys are hashed to the same table location.
</P>
<P>
For <I>static sets</I> of keys it is possible to compute a function
to find any key in a table in one probe; such hash functions are called
<I>perfect</I>.
More precisely, given a set of keys <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, we shall say that a
hash function <IMG ALIGN="bottom" SRC="figs/img15.png" BORDER="0" ALT=""> is a <I>perfect hash function</I>
for <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> if <IMG ALIGN="bottom" SRC="figs/img7.png" BORDER="0" ALT=""> is an injection on <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">,
that is, there are no <I>collisions</I> among the keys in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">:
if <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img22.png" BORDER="0" ALT=""> are in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img23.png" BORDER="0" ALT="">,
then <IMG ALIGN="middle" SRC="figs/img24.png" BORDER="0" ALT="">.
Figure 1(a) illustrates a perfect hash function.
Since no collisions occur, each key can be retrieved from the table
with a single probe.
If <IMG ALIGN="bottom" SRC="figs/img25.png" BORDER="0" ALT="">, that is, the table has the same size as <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">,
then we say that <IMG ALIGN="bottom" SRC="figs/img7.png" BORDER="0" ALT=""> is a <I>minimal perfect hash function</I>
for <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
Figure 1(b) illustrates a minimal perfect hash function.
Minimal perfect hash functions totally avoid the problem of wasted
space and time. A perfect hash function <IMG ALIGN="bottom" SRC="figs/img7.png" BORDER="0" ALT=""> is <I>order preserving</I>
if the keys in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> are arranged in some given order
and <IMG ALIGN="bottom" SRC="figs/img7.png" BORDER="0" ALT=""> preserves this order in the hash table.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD ALIGN="right"><center><IMG ALIGN="middle" SRC="figs/img26.png" BORDER="0" ALT=""></center></TD>
</TR>
<TR>
<TD><B>Figure 1:</B> (a) Perfect hash function. (b) Minimal perfect hash function.</TD>
</TR>
</TABLE>
<P>
Minimal perfect hash functions are widely used for memory efficient
storage and fast retrieval of items from static sets, such as words in natural
languages, reserved words in programming languages or interactive systems,
universal resource locations (URLs) in Web search engines, or item sets in
data mining techniques.
</P>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i CONCEPTS.t2t -o docs/concepts.html -->
</BODY></HTML>

210
docs/examples.html Normal file
View File

@ -0,0 +1,210 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>CMPH - Examples</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>CMPH - Examples</H1>
</CENTER>
<P>
Using cmph is quite simple. Take a look in the following examples.
</P>
<HR NOSHADE SIZE=1>
<PRE>
#include &lt;cmph.h&gt;
#include &lt;string.h&gt;
// Create minimal perfect hash function from in-memory vector
int main(int argc, char **argv)
{
// Creating a filled vector
unsigned int i = 0;
const char *vector[] = {"aaaaaaaaaa", "bbbbbbbbbb", "cccccccccc", "dddddddddd", "eeeeeeeeee",
"ffffffffff", "gggggggggg", "hhhhhhhhhh", "iiiiiiiiii", "jjjjjjjjjj"};
unsigned int nkeys = 10;
FILE* mphf_fd = fopen("temp.mph", "w");
// Source of keys
cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys);
//Create minimal perfect hash function using the brz algorithm.
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BRZ);
cmph_config_set_mphf_fd(config, mphf_fd);
cmph_t *hash = cmph_new(config);
cmph_config_destroy(config);
cmph_dump(hash, mphf_fd);
cmph_destroy(hash);
fclose(mphf_fd);
//Find key
mphf_fd = fopen("temp.mph", "r");
hash = cmph_load(mphf_fd);
while (i &lt; nkeys) {
const char *key = vector[i];
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "key:%s -- hash:%u\n", key, id);
i++;
}
//Destroy hash
cmph_destroy(hash);
cmph_io_vector_adapter_destroy(source);
fclose(mphf_fd);
return 0;
}
</PRE>
<P>
Download <A HREF="examples/vector_adapter_ex1.c">vector_adapter_ex1.c</A>. This example does not work in versions below 0.6.
</P>
<HR NOSHADE SIZE=1>
<PRE>
#include &lt;cmph.h&gt;
#include &lt;string.h&gt;
// Create minimal perfect hash function from in-memory vector
#pragma pack(1)
typedef struct {
cmph_uint32 id;
char key[11];
cmph_uint32 year;
} rec_t;
#pragma pack(0)
int main(int argc, char **argv)
{
// Creating a filled vector
unsigned int i = 0;
rec_t vector[10] = {{1, "aaaaaaaaaa", 1999}, {2, "bbbbbbbbbb", 2000}, {3, "cccccccccc", 2001},
{4, "dddddddddd", 2002}, {5, "eeeeeeeeee", 2003}, {6, "ffffffffff", 2004},
{7, "gggggggggg", 2005}, {8, "hhhhhhhhhh", 2006}, {9, "iiiiiiiiii", 2007},
{10,"jjjjjjjjjj", 2008}};
unsigned int nkeys = 10;
FILE* mphf_fd = fopen("temp_struct_vector.mph", "w");
// Source of keys
cmph_io_adapter_t *source = cmph_io_struct_vector_adapter(vector, (cmph_uint32)sizeof(rec_t), (cmph_uint32)sizeof(cmph_uint32), 11, nkeys);
//Create minimal perfect hash function using the BDZ algorithm.
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BDZ);
cmph_config_set_mphf_fd(config, mphf_fd);
cmph_t *hash = cmph_new(config);
cmph_config_destroy(config);
cmph_dump(hash, mphf_fd);
cmph_destroy(hash);
fclose(mphf_fd);
//Find key
mphf_fd = fopen("temp_struct_vector.mph", "r");
hash = cmph_load(mphf_fd);
while (i &lt; nkeys) {
const char *key = vector[i].key;
unsigned int id = cmph_search(hash, key, 11);
fprintf(stderr, "key:%s -- hash:%u\n", key, id);
i++;
}
//Destroy hash
cmph_destroy(hash);
cmph_io_vector_adapter_destroy(source);
fclose(mphf_fd);
return 0;
}
</PRE>
<P>
Download <A HREF="examples/struct_vector_adapter_ex3.c">struct_vector_adapter_ex3.c</A>. This example does not work in versions below 0.8.
</P>
<HR NOSHADE SIZE=1>
<PRE>
#include &lt;cmph.h&gt;
#include &lt;stdio.h&gt;
#include &lt;string.h&gt;
// Create minimal perfect hash function from in-disk keys using BDZ algorithm
int main(int argc, char **argv)
{
//Open file with newline separated list of keys
FILE * keys_fd = fopen("keys.txt", "r");
cmph_t *hash = NULL;
if (keys_fd == NULL)
{
fprintf(stderr, "File \"keys.txt\" not found\n");
exit(1);
}
// Source of keys
cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd);
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BDZ);
hash = cmph_new(config);
cmph_config_destroy(config);
//Find key
const char *key = "jjjjjjjjjj";
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "Id:%u\n", id);
//Destroy hash
cmph_destroy(hash);
cmph_io_nlfile_adapter_destroy(source);
fclose(keys_fd);
return 0;
}
</PRE>
<P>
Download <A HREF="examples/file_adapter_ex2.c">file_adapter_ex2.c</A> and <A HREF="examples/keys.txt">keys.txt</A>. This example does not work in versions below 0.8.
</P>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i EXAMPLES.t2t -o docs/examples.html -->
</BODY></HTML>

111
docs/faq.html Normal file
View File

@ -0,0 +1,111 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>CMPH FAQ</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>CMPH FAQ</H1>
</CENTER>
<UL>
<LI>How do I define the ids of the keys?
</UL>
<BLOCKQUOTE>
- You don't. The ids will be assigned by the algorithm creating the minimal
perfect hash function. If the algorithm creates an <B>ordered</B> minimal
perfect hash function, the ids will be the indices of the keys in the
input. Otherwise, you have no guarantee of the distribution of the ids.
</BLOCKQUOTE>
<UL>
<LI>Why do I always get the error "Unable to create minimum perfect hashing function"?
</UL>
<BLOCKQUOTE>
- The algorithms do not guarantee that a minimal perfect hash function can
be created. In practice, it will always work if your input
is big enough (&gt;100 keys).
The error is probably because you have duplicated
keys in the input. You must guarantee that the keys are unique in the
input. If you are using a UN*X based OS, try doing
</BLOCKQUOTE>
<PRE>
#sort input.txt | uniq &gt; input_uniq.txt
</PRE>
<BLOCKQUOTE>
and run cmph with input_uniq.txt
</BLOCKQUOTE>
<UL>
<LI>Why do I change the hash function using cmph_config_set_hashfuncs function and the default (jenkins)
one is executed?
</UL>
<BLOCKQUOTE>
- Probably you are you using the cmph_config_set_algo function after
the cmph_config_set_hashfuncs. Therefore, the default hash function
is reset when you call the cmph_config_set_algo function.
</BLOCKQUOTE>
<UL>
<LI>What do I do when the following error is got?
</UL>
<BLOCKQUOTE>
- Error: <B>error while loading shared libraries: libcmph.so.0: cannot open shared object file: No such file ordirectory</B>
</BLOCKQUOTE>
<BLOCKQUOTE>
- Solution: type <B>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/</B> at the shell or put that shell command
in your .profile file or in the /etc/profile file.
</BLOCKQUOTE>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i FAQ.t2t -o docs/faq.html -->
</BODY></HTML>

110
docs/fch.html Normal file
View File

@ -0,0 +1,110 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>FCH Algorithm</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>FCH Algorithm</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>The Algorithm</H2>
<P>
The algorithm is presented in <A HREF="#papers">[1</A>].
</P>
<HR NOSHADE SIZE=1>
<H2>Memory Consumption</H2>
<P>
Now we detail the memory consumption to generate and to store minimal perfect hash functions
using the FCH algorithm. The structures responsible for memory consumption are in the
following:
</P>
<UL>
<LI>A vector containing all the <I>n</I> keys.
<LI>Data structure to speed up the searching step:
<OL>
<LI><B>random_table</B>: is a vector used to remember currently empty slots in the hash table. It stores <I>n</I> 4 byte long integer numbers. This vector initially contains a random permutation of the <I>n</I> hash addresses. A pointer called filled_count is used to keep the invariant that any slots to the right side of filled_count (inclusive) are empty and any ones to the left are filled.
<LI><B>hash_table</B>: Table used to check whether all the collisions were resolved. It has <I>n</I> entries of one byte.
<LI><B>map_table</B>: For any unfilled slot <I>x</I> in hash_table, the map_table vector contains <I>n</I> 4 byte long pointers pointing at random_table such that random_table[map_table[x]] = x. Thus, given an empty slot x in the hash_table, we can locate its position in the random_table vector through map_table.
<P></P>
</OL>
<LI>Other auxiliary structures
<OL>
<LI><B>sorted_indexes</B>: is a vector of <I>cn/(log(n) + 1)</I> 4 byte long pointers to indirectly keep the buckets sorted by decreasing order of their sizes.
<P></P>
<LI><B>function <I>g</I></B>: is represented by a vector of <I>cn/(log(n) + 1)</I> 4 byte long integer numbers, one for each bucket. It is used to spread all the keys in a given bucket into the hash table without collisions.
</OL>
</UL>
<P>
Thus, the total memory consumption of FCH algorithm for generating a minimal
perfect hash function (MPHF) is: <I>O(n) + 9n + 8cn/(log(n) + 1)</I> bytes.
The value of parameter <I>c</I> must be greater than or equal to 2.6.
</P>
<P>
Now we present the memory consumption to store the resulting function.
We only need to store the <I>g</I> function and a constant number of bytes for the seed of the hash functions used in the resulting MPHF. Thus, we need <I>cn/(log(n) + 1) + O(1)</I> bytes.
</P>
<HR NOSHADE SIZE=1>
<A NAME="papers"></A>
<H2>Papers</H2>
<OL>
<LI>E.A. Fox, Q.F. Chen, and L.S. Heath. <A HREF="papers/fch92.pdf">A faster algorithm for constructing minimal perfect hash functions.</A> In Proc. 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 266-273, 1992.
</OL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i FCH.t2t -o docs/fch.html -->
</BODY></HTML>

99
docs/gperf.html Normal file
View File

@ -0,0 +1,99 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>GPERF versus CMPH</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>GPERF versus CMPH</H1>
</CENTER>
<P>
You might ask why cmph if <A HREF="http://www.gnu.org/software/gperf/gperf.html">gperf</A>
already works perfectly. Actually, gperf and cmph have different goals.
Basically, these are the requirements for each of them:
</P>
<UL>
<LI>GPERF
<P></P>
</UL>
<BLOCKQUOTE>
- Create very fast hash functions for <B>small</B> sets
</BLOCKQUOTE>
<BLOCKQUOTE>
- Create <B>perfect</B> hash functions
</BLOCKQUOTE>
<UL>
<LI>CMPH
<P></P>
</UL>
<BLOCKQUOTE>
- Create very fast hash function for <B>very large</B> sets
</BLOCKQUOTE>
<BLOCKQUOTE>
- Create <B>minimal perfect</B> hash functions
</BLOCKQUOTE>
<P>
As result, cmph can be used to create hash functions where gperf would run
forever without finding a perfect hash function, because of the running
time of the algorithm and the large memory usage.
On the other side, functions created by cmph are about 2x slower than those
created by gperf.
</P>
<P>
So, if you have large sets, or memory usage is a key restriction for you, stick
to cmph. If you have small sets, and do not care about memory usage, go with
gperf. The first problem is common in the information retrieval field (e.g.
assigning ids to millions of documents), while the former is usually found in
the compiler programming area (detect reserved keywords).
</P>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i GPERF.t2t -o docs/gperf.html -->
</BODY></HTML>

392
docs/index.html Normal file
View File

@ -0,0 +1,392 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>CMPH - C Minimal Perfect Hashing Library</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>CMPH - C Minimal Perfect Hashing Library</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>Motivation</H2>
<P>
A perfect hash function maps a static set of n keys into a set of m integer numbers without collisions, where m is greater than or equal to n. If m is equal to n, the function is called minimal.
</P>
<P>
<A HREF="concepts.html">Minimal perfect hash functions</A> are widely used for memory efficient storage and fast retrieval of items from static sets, such as words in natural languages, reserved words in programming languages or interactive systems, universal resource locations (URLs) in Web search engines, or item sets in data mining techniques. Therefore, there are applications for minimal perfect hash functions in information retrieval systems, database systems, language translation systems, electronic commerce systems, compilers, operating systems, among others.
</P>
<P>
The use of minimal perfect hash functions is, until now, restricted to scenarios where the set of keys being hashed is small, because of the limitations of current algorithms. But in many cases, to deal with huge set of keys is crucial. So, this project gives to the free software community an API that will work with sets in the order of billion of keys.
</P>
<P>
Probably, the most interesting application for minimal perfect hash functions is its use as an indexing structure for databases. The most popular data structure used as an indexing structure in databases is the B+ tree. In fact, the B+ tree is very used for dynamic applications with frequent insertions and deletions of records. However, for applications with sporadic modifications and a huge number of queries the B+ tree is not the best option, because practical deployments of this structure are extremely complex, and perform poorly with very large sets of keys such as those required for the new frontiers <A HREF="http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=299">database applications</A>.
</P>
<P>
For example, in the information retrieval field, the work with huge collections is a daily task. The simple assignment of ids to web pages of a collection can be a challenging task. While traditional databases simply cannot handle more traffic once the working set of web page urls does not fit in main memory anymore, minimal perfect hash functions can easily scale to hundred of millions of entries, using stock hardware.
</P>
<P>
As there are lots of applications for minimal perfect hash functions, it is important to implement memory and time efficient algorithms for constructing such functions. The lack of similar libraries in the free software world has been the main motivation to create the C Minimal Perfect Hashing Library (<A HREF="gperf.html">gperf is a bit different</A>, since it was conceived to create very fast perfect hash functions for small sets of keys and CMPH Library was conceived to create minimal perfect hash functions for very large sets of keys). C Minimal Perfect Hashing Library is a portable LGPLed library to generate and to work with very efficient minimal perfect hash functions.
</P>
<HR NOSHADE SIZE=1>
<H2>Description</H2>
<P>
The CMPH Library encapsulates the newest and more efficient algorithms in an easy-to-use, production-quality, fast API. The library was designed to work with big entries that cannot fit in the main memory. It has been used successfully for constructing minimal perfect hash functions for sets with more than 100 million of keys, and we intend to expand this number to the order of billion of keys. Although there is a lack of similar libraries, we can point out some of the distinguishable features of the CMPH Library:
</P>
<UL>
<LI>Fast.
<LI>Space-efficient with main memory usage carefully documented.
<LI>The best modern algorithms are available (or at least scheduled for implementation :-)).
<LI>Works with in-disk key sets through of using the adapter pattern.
<LI>Serialization of hash functions.
<LI>Portable C code (currently works on GNU/Linux and WIN32 and is reported to work in OpenBSD and Solaris).
<LI>Object oriented implementation.
<LI>Easily extensible.
<LI>Well encapsulated API aiming binary compatibility through releases.
<LI>Free Software.
</UL>
<HR NOSHADE SIZE=1>
<H2>Supported Algorithms</H2>
<UL>
<LI><A HREF="chd.html">CHD Algorithm</A>:
<UL>
<LI>It is the fastest algorithm to build PHFs and MPHFs in linear time.
<LI>It generates the most compact PHFs and MPHFs we know of.
<LI>It can generate PHFs with a load factor up to <I>99 %</I>.
<LI>It can be used to generate <I>t</I>-perfect hash functions. A <I>t</I>-perfect hash function allows at most <I>t</I> collisions in a given bin. It is a well-known fact that modern memories are organized as blocks which constitute transfer unit. Example of such blocks are cache lines for internal memory or sectors for hard disks. Thus, it can be very useful for devices that carry out I/O operations in blocks.
<LI>It is a two level scheme. It uses a first level hash function to split the key set in buckets of average size determined by a parameter <I>b</I> in the range <I>[1,32]</I>. In the second level it uses displacement values to resolve the collisions that have given rise to the buckets.
<LI>It can generate MPHFs that can be stored in approximately <I>2.07</I> bits per key.
<LI>For a load factor equal to the maximum one that is achieved by the BDZ algorithm (<I>81 %</I>), the resulting PHFs are stored in approximately <I>1.40</I> bits per key.
</UL>
<LI><A HREF="bdz.html">BDZ Algorithm</A>:
<UL>
<LI>It is very simple and efficient. It outperforms all the ones below.
<LI>It constructs both PHFs and MPHFs in linear time.
<LI>The maximum load factor one can achieve for a PHF is <I>1/1.23</I>.
<LI>It is based on acyclic random 3-graphs. A 3-graph is a generalization of a graph where each edge connects 3 vertices instead of only 2.
<LI>The resulting MPHFs are not order preserving.
<LI>The resulting MPHFs can be stored in only <I>(2 + x)cn</I> bits, where <I>c</I> should be larger than or equal to <I>1.23</I> and <I>x</I> is a constant larger than <I>0</I> (actually, x = 1/b and b is a parameter that should be larger than 2). For <I>c = 1.23</I> and <I>b = 8</I>, the resulting functions are stored in approximately 2.6 bits per key.
<LI>For its maximum load factor (<I>81 %</I>), the resulting PHFs are stored in approximately <I>1.95</I> bits per key.
</UL>
<LI><A HREF="bmz.html">BMZ Algorithm</A>:
<UL>
<LI>Construct MPHFs in linear time.
<LI>It is based on cyclic random graphs. This makes it faster than the CHM algorithm.
<LI>The resulting MPHFs are not order preserving.
<LI>The resulting MPHFs are more compact than the ones generated by the CHM algorithm and can be stored in <I>4cn</I> bytes, where <I>c</I> is in the range <I>[0.93,1.15]</I>.
</UL>
<LI><A HREF="brz.html">BRZ Algorithm</A>:
<UL>
<LI>A very fast external memory based algorithm for constructing minimal perfect hash functions for sets in the order of billions of keys.
<LI>It works in linear time.
<LI>The resulting MPHFs are not order preserving.
<LI>The resulting MPHFs can be stored using less than <I>8.0</I> bits per key.
</UL>
<LI><A HREF="chm.html">CHM Algorithm</A>:
<UL>
<LI>Construct minimal MPHFs in linear time.
<LI>It is based on acyclic random graphs
<LI>The resulting MPHFs are order preserving.
<LI>The resulting MPHFs are stored in <I>4cn</I> bytes, where <I>c</I> is greater than 2.
</UL>
<LI><A HREF="fch.html">FCH Algorithm</A>:
<UL>
<LI>Construct minimal perfect hash functions that require less than 4 bits per key to be stored.
<LI>The resulting MPHFs are very compact and very efficient at evaluation time
<LI>The algorithm is only efficient for small sets.
<LI>It is used as internal algorithm in the BRZ algorithm to efficiently solve larger problems and even so to generate MPHFs that require approximately 4.1 bits per key to be stored. For that, you just need to set the parameters -a to brz and -c to a value larger than or equal to 2.6.
</UL>
</UL>
<HR NOSHADE SIZE=1>
<H2>News for version 2.0</H2>
<P>
Cleaned up most warnings for the c code.
</P>
<P>
Experimental C++ interface (--enable-cxxmph) implementing the BDZ algorithm in
a convenient interface, which serves as the basis
for drop-in replacements for std::unordered_map, sparsehash::sparse_hash_map
and sparsehash::dense_hash_map. Potentially faster lookup time at the expense
of insertion time. See cxxmpph/mph_map.h and cxxmph/mph_index.h for details.
</P>
<H2>News for version 1.1</H2>
<P>
Fixed a bug in the chd_pc algorithm and reorganized tests.
</P>
<H2>News for version 1.0</H2>
<P>
This is a bugfix only version, after which a revamp of the cmph code and
algorithms will be done.
</P>
<H2>News for version 0.9</H2>
<UL>
<LI><A HREF="chd.html">The CHD algorithm</A>, which is an algorithm that can be tuned to generate MPHFs that require approximately 2.07 bits per key to be stored. The algorithm outperforms <A HREF="bdz.html">the BDZ algorithm</A> and therefore is the fastest one available in the literature for sets that can be treated in internal memory.
<LI><A HREF="chd.html">The CHD_PH algorithm</A>, which is an algorithm to generate PHFs with load factor up to <I>99 %</I>. It is actually the CHD algorithm without the ranking step. If we set the load factor to <I>81 %</I>, which is the maximum that can be obtained with <A HREF="bdz.html">the BDZ algorithm</A>, the resulting functions can be stored in <I>1.40</I> bits per key. The space requirement increases with the load factor.
<LI>All reported bugs and suggestions have been corrected and included as well.
</UL>
<H2>News for version 0.8</H2>
<UL>
<LI><A HREF="bdz.html">An algorithm to generate MPHFs that require around 2.6 bits per key to be stored</A>, which is referred to as BDZ algorithm. The algorithm is the fastest one available in the literature for sets that can be treated in internal memory.
<LI><A HREF="bdz.html">An algorithm to generate PHFs with range m = cn, for c &gt; 1.22</A>, which is referred to as BDZ_PH algorithm. It is actually the BDZ algorithm without the ranking step. The resulting functions can be stored in 1.95 bits per key for <I>c = 1.23</I> and are considerably faster than the MPHFs generated by the BDZ algorithm.
<LI>An adapter to support a vector of struct as the source of keys has been added.
<LI>An API to support the ability of packing a perfect hash function into a preallocated contiguous memory space. The computation of a packed function is still faster and can be easily mmapped.
<LI>The hash functions djb2, fnv and sdbm were removed because they do not use random seeds and therefore are not useful for MPHFs algorithms.
<LI>All reported bugs and suggestions have been corrected and included as well.
</UL>
<P>
<A HREF="newslog.html">News log</A>
</P>
<HR NOSHADE SIZE=1>
<H2>Examples</H2>
<P>
Using cmph is quite simple. Take a look.
</P>
<PRE>
#include &lt;cmph.h&gt;
#include &lt;string.h&gt;
// Create minimal perfect hash function from in-memory vector
int main(int argc, char **argv)
{
// Creating a filled vector
unsigned int i = 0;
const char *vector[] = {"aaaaaaaaaa", "bbbbbbbbbb", "cccccccccc", "dddddddddd", "eeeeeeeeee",
"ffffffffff", "gggggggggg", "hhhhhhhhhh", "iiiiiiiiii", "jjjjjjjjjj"};
unsigned int nkeys = 10;
FILE* mphf_fd = fopen("temp.mph", "w");
// Source of keys
cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys);
//Create minimal perfect hash function using the brz algorithm.
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BRZ);
cmph_config_set_mphf_fd(config, mphf_fd);
cmph_t *hash = cmph_new(config);
cmph_config_destroy(config);
cmph_dump(hash, mphf_fd);
cmph_destroy(hash);
fclose(mphf_fd);
//Find key
mphf_fd = fopen("temp.mph", "r");
hash = cmph_load(mphf_fd);
while (i &lt; nkeys) {
const char *key = vector[i];
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "key:%s -- hash:%u\n", key, id);
i++;
}
//Destroy hash
cmph_destroy(hash);
cmph_io_vector_adapter_destroy(source);
fclose(mphf_fd);
return 0;
}
</PRE>
<P>
Download <A HREF="examples/vector_adapter_ex1.c">vector_adapter_ex1.c</A>. This example does not work in versions below 0.6. You need to update the sources from GIT to make it work.
</P>
<HR NOSHADE SIZE=1>
<PRE>
#include &lt;cmph.h&gt;
#include &lt;stdio.h&gt;
#include &lt;string.h&gt;
// Create minimal perfect hash function from in-disk keys using BDZ algorithm
int main(int argc, char **argv)
{
//Open file with newline separated list of keys
FILE * keys_fd = fopen("keys.txt", "r");
cmph_t *hash = NULL;
if (keys_fd == NULL)
{
fprintf(stderr, "File \"keys.txt\" not found\n");
exit(1);
}
// Source of keys
cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd);
cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BDZ);
hash = cmph_new(config);
cmph_config_destroy(config);
//Find key
const char *key = "jjjjjjjjjj";
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "Id:%u\n", id);
//Destroy hash
cmph_destroy(hash);
cmph_io_nlfile_adapter_destroy(source);
fclose(keys_fd);
return 0;
}
</PRE>
<P>
Download <A HREF="examples/file_adapter_ex2.c">file_adapter_ex2.c</A> and <A HREF="examples/keys.txt">keys.txt</A>. This example does not work in versions below 0.8. You need to update the sources from GIT to make it work.
</P>
<P>
<A HREF="examples.html">Click here to see more examples</A>
</P>
<HR NOSHADE SIZE=1>
<H2>The cmph application</H2>
<P>
cmph is the name of both the library and the utility
application that comes with this package. You can use the cmph
application for constructing minimal perfect hash functions from the command line.
The cmph utility
comes with a number of flags, but it is very simple to create and to query
minimal perfect hash functions:
</P>
<PRE>
$ # Using the chm algorithm (default one) for constructing a mphf for keys in file keys_file
$ ./cmph -g keys_file
$ # Query id of keys in the file keys_query
$ ./cmph -m keys_file.mph keys_query
</PRE>
<P>
The additional options let you set most of the parameters you have
available through the C API. Below you can see the full help message for the
utility.
</P>
<PRE>
usage: cmph [-v] [-h] [-V] [-k nkeys] [-f hash_function] [-g [-c algorithm_dependent_value][-s seed] ]
[-a algorithm] [-M memory_in_MB] [-b algorithm_dependent_value] [-t keys_per_bin] [-d tmp_dir]
[-m file.mph] keysfile
Minimum perfect hashing tool
-h print this help message
-c c value determines:
* the number of vertices in the graph for the algorithms BMZ and CHM
* the number of bits per key required in the FCH algorithm
* the load factor in the CHD_PH algorithm
-a algorithm - valid values are
* bmz
* bmz8
* chm
* brz
* fch
* bdz
* bdz_ph
* chd_ph
* chd
-f hash function (may be used multiple times) - valid values are
* jenkins
-V print version number and exit
-v increase verbosity (may be used multiple times)
-k number of keys
-g generation mode
-s random seed
-m minimum perfect hash function file
-M main memory availability (in MB) used in BRZ algorithm
-d temporary directory used in BRZ algorithm
-b the meaning of this parameter depends on the algorithm selected in the -a option:
* For BRZ it is used to make the maximal number of keys in a bucket lower than 256.
In this case its value should be an integer in the range [64,175]. Default is 128.
* For BDZ it is used to determine the size of some precomputed rank
information and its value should be an integer in the range [3,10]. Default
is 7. The larger is this value, the more compact are the resulting functions
and the slower are them at evaluation time.
* For CHD and CHD_PH it is used to set the average number of keys per bucket
and its value should be an integer in the range [1,32]. Default is 4. The
larger is this value, the slower is the construction of the functions.
This parameter has no effect for other algorithms.
-t set the number of keys per bin for a t-perfect hashing function. A t-perfect
hash function allows at most t collisions in a given bin. This parameter applies
only to the CHD and CHD_PH algorithms. Its value should be an integer in the
range [1,128]. Defaul is 1
keysfile line separated file with keys
</PRE>
<H2>Additional Documentation</H2>
<P>
<A HREF="faq.html">FAQ</A>
</P>
<H2>Downloads</H2>
<P>
Use the github releases page at: <A HREF="https://github.com/bonitao/cmph/releases">https://github.com/bonitao/cmph/releases</A>
</P>
<H2>License Stuff</H2>
<P>
Code is under the LGPL and the MPL 1.1.
</P>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<a href="http://sourceforge.net"><img src="http://sourceforge.net/sflogo.php?group_id=96251&type=1" width="88" height="31" border="0" alt="SourceForge.net Logo" /> </a>
<P>
Last Updated: Fri Dec 28 23:50:29 2018
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -\-mask-email -i README.t2t -o docs/index.html -->
</BODY></HTML>

142
docs/newslog.html Normal file
View File

@ -0,0 +1,142 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>News Log</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>News Log</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>News for version 1.1</H2>
<P>
Fixed a bug in the chd_pc algorithm and reorganized tests.
</P>
<H2>News for version 1.0</H2>
<P>
This is a bugfix only version, after which a revamp of the cmph code and
algorithms will be done.
</P>
<HR NOSHADE SIZE=1>
<H2>News for version 0.9</H2>
<UL>
<LI><A HREF="chd.html">The CHD algorithm</A>, which is an algorithm that can be tuned to generate MPHFs that require approximately 2.07 bits per key to be stored. The algorithm outperforms <A HREF="bdz.html">the BDZ algorithm</A> and therefore is the fastest one available in the literature for sets that can be treated in internal memory.
<LI><A HREF="chd.html">The CHD_PH algorithm</A>, which is an algorithm to generate PHFs with load factor up to <I>99 %</I>. It is actually the CHD algorithm without the ranking step. If we set the load factor to <I>81 %</I>, which is the maximum that can be obtained with <A HREF="bdz.html">the BDZ algorithm</A>, the resulting functions can be stored in <I>1.40</I> bits per key. The space requirement increases with the load factor.
<LI>All reported bugs and suggestions have been corrected and included as well.
</UL>
<HR NOSHADE SIZE=1>
<H2>News for version 0.8</H2>
<UL>
<LI><A HREF="bdz.html">An algorithm to generate MPHFs that require around 2.6 bits per key to be stored</A>, which is referred to as BDZ algorithm. The algorithm is the fastest one available in the literature for sets that can be treated in internal memory.
<LI><A HREF="bdz.html">An algorithm to generate PHFs with range m = cn, for c &gt; 1.22</A>, which is referred to as BDZ_PH algorithm. It is actually the BDZ algorithm without the ranking step. The resulting functions can be stored in 1.95 bits per key for <I>c = 1.23</I> and are considerably faster than the MPHFs generated by the BDZ algorithm.
<LI>An adapter to support a vector of struct as the source of keys has been added.
<LI>An API to support the ability of packing a perfect hash function into a preallocated contiguous memory space. The computation of a packed function is still faster and can be easily mmapped.
<LI>The hash functions djb2, fnv and sdbm were removed because they do not use random seeds and therefore are not useful for MPHFs algorithms.
<LI>All reported bugs and suggestions have been corrected and included as well.
</UL>
<HR NOSHADE SIZE=1>
<H2>News for version 0.7</H2>
<UL>
<LI>Added man pages and a pkgconfig file.
</UL>
<HR NOSHADE SIZE=1>
<H2>News for version 0.6</H2>
<UL>
<LI><A HREF="fch.html">An algorithm to generate MPHFs that require less than 4 bits per key to be stored</A>, which is referred to as FCH algorithm. The algorithm is only efficient for small sets.
<LI>The FCH algorithm is integrated with <A HREF="brz.html">BRZ algorithm</A> so that you will be able to efficiently generate space-efficient MPHFs for sets in the order of billion keys.
<LI>All reported bugs and suggestions have been corrected and included as well.
</UL>
<HR NOSHADE SIZE=1>
<H2>News for version 0.5</H2>
<UL>
<LI>A thread safe vector adapter has been added.
<LI><A HREF="brz.html">A new algorithm for sets in the order of billion of keys that requires approximately 8.1 bits per key to store the resulting MPHFs.</A>
<LI>All reported bugs and suggestions have been corrected and included as well.
</UL>
<HR NOSHADE SIZE=1>
<H2>News for version 0.4</H2>
<UL>
<LI>Vector Adapter has been added.
<LI>An optimized version of bmz (bmz8) for small set of keys (at most 256 keys) has been added.
<LI>All reported bugs and suggestions have been corrected and included as well.
</UL>
<HR NOSHADE SIZE=1>
<H2>News for version 0.3</H2>
<UL>
<LI>New heuristic added to the bmz algorithm permits to generate a mphf with only
<I>24.80n + O(1)</I> bytes. The resulting function can be stored in <I>3.72n</I> bytes.
<A HREF="bmz.html#heuristic">click here</A> for details.
</UL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i NEWSLOG.t2t -o docs/newslog.html -->
</BODY></HTML>

26
gendocs
View File

@ -1,18 +1,18 @@
#!/bin/sh #!/bin/sh
txt2tags -t html --mask-email -i README.t2t -o index.html txt2tags -t html --mask-email -i README.t2t -o docs/index.html
txt2tags -t html -i CHD.t2t -o chd.html txt2tags -t html -i CHD.t2t -o docs/chd.html
txt2tags -t html -i BDZ.t2t -o bdz.html txt2tags -t html -i BDZ.t2t -o docs/bdz.html
txt2tags -t html -i BMZ.t2t -o bmz.html txt2tags -t html -i BMZ.t2t -o docs/bmz.html
txt2tags -t html -i BRZ.t2t -o brz.html txt2tags -t html -i BRZ.t2t -o docs/brz.html
txt2tags -t html -i CHM.t2t -o chm.html txt2tags -t html -i CHM.t2t -o docs/chm.html
txt2tags -t html -i FCH.t2t -o fch.html txt2tags -t html -i FCH.t2t -o docs/fch.html
txt2tags -t html -i COMPARISON.t2t -o comparison.html txt2tags -t html -i COMPARISON.t2t -o docs/comparison.html
txt2tags -t html -i GPERF.t2t -o gperf.html txt2tags -t html -i GPERF.t2t -o docs/gperf.html
txt2tags -t html -i FAQ.t2t -o faq.html txt2tags -t html -i FAQ.t2t -o docs/faq.html
txt2tags -t html -i CONCEPTS.t2t -o concepts.html txt2tags -t html -i CONCEPTS.t2t -o docs/concepts.html
txt2tags -t html -i NEWSLOG.t2t -o newslog.html txt2tags -t html -i NEWSLOG.t2t -o docs/newslog.html
txt2tags -t html -i EXAMPLES.t2t -o examples.html txt2tags -t html -i EXAMPLES.t2t -o docs/examples.html
txt2tags -t txt --mask-email -i README.t2t -o README txt2tags -t txt --mask-email -i README.t2t -o README
txt2tags -t txt -i CHD.t2t -o CHD txt2tags -t txt -i CHD.t2t -o CHD