Add 'deps/cmph/' from commit 'a250982ade093f4eed0552bbdd22dd7b0432007f'
git-subtree-dir: deps/cmph git-subtree-mainline:5040f4007b
git-subtree-split:a250982ade
This commit is contained in:
commit
37e24524c2
|
@ -0,0 +1,6 @@
|
|||
|
||||
|
||||
|
||||
----------------------------------------
|
||||
| [Home index.html] | [CHD chd.html] | [BDZ bdz.html] | [BMZ bmz.html] | [CHM chm.html] | [BRZ brz.html] | [FCH fch.html]
|
||||
----------------------------------------
|
|
@ -0,0 +1,4 @@
|
|||
Davi de Castro Reis davi@users.sourceforge.net
|
||||
Djamel Belazzougui db8192@users.sourceforge.net
|
||||
Fabiano Cupertino Botelho fc_botelho@users.sourceforge.net
|
||||
Nivio Ziviani nivio@dcc.ufmg.br
|
|
@ -0,0 +1,174 @@
|
|||
BDZ Algorithm
|
||||
|
||||
|
||||
%!includeconf: CONFIG.t2t
|
||||
|
||||
----------------------------------------
|
||||
==Introduction==
|
||||
|
||||
The BDZ algorithm was designed by Fabiano C. Botelho, Djamal Belazzougui, Rasmus Pagh and Nivio Ziviani. It is a simple, efficient, near-optimal space and practical algorithm to generate a family [figs/bdz/img8.png] of PHFs and MPHFs. It is also referred to as BPZ algorithm because the work presented by Botelho, Pagh and Ziviani in [[2 #papers]]. In the Botelho's PhD. dissertation [[1 #papers]] it is also referred to as RAM algorithm because it is more suitable for key sets that can be handled in internal memory.
|
||||
|
||||
The BDZ algorithm uses //r//-uniform random hypergraphs given by function values of //r// uniform random hash functions on the input key set //S// for generating PHFs and MPHFs that require //O(n)// bits to be stored. A hypergraph is the generalization of a standard undirected graph where each edge connects [figs/bdz/img12.png] vertices. This idea is not new, see e.g. [[8 #papers]], but we have proceeded differently to achieve a space usage of //O(n)// bits rather than //O(n log n)// bits. Evaluation time for all schemes considered is constant. For //r=3// we obtain a space usage of approximately //2.6n// bits for an MPHF. More compact, and even simpler, representations can be achieved for larger //m//. For example, for //m=1.23n// we can get a space usage of //1.95n// bits.
|
||||
|
||||
Our best MPHF space upper bound is within a factor of //2// from the information theoretical lower bound of approximately //1.44// bits. We have shown that the BDZ algorithm is far more practical than previous methods with proven space complexity, both because of its simplicity, and because the constant factor of the space complexity is more than //6// times lower than its closest competitor, for plausible problem sizes. We verify the practicality experimentally, using slightly more space than in the mentioned theoretical bounds.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==The Algorithm==
|
||||
|
||||
The BDZ algorithm is a three-step algorithm that generates PHFs and MPHFs based on random //r//-partite hypergraphs. This is an approach that provides a much tighter analysis and is much more simple than the one presented in [[3 #papers]], where it was implicit how to construct similar PHFs.The fastest and most compact functions are generated when //r=3//. In this case a PHF can be stored in approximately //1.95// bits per key and an MPHF in approximately //2.62// bits per key.
|
||||
|
||||
Figure 1 gives an overview of the algorithm for //r=3//, taking as input a key set [figs/bdz/img22.png] containing three English words, i.e., //S={who,band,the}//. The edge-oriented data structure proposed in [[4 #papers]] is used to represent hypergraphs, where each edge is explicitly represented as an array of //r// vertices and, for each vertex //v//, there is a list of edges that are incident on //v//.
|
||||
|
||||
| [figs/bdz/img50.png]
|
||||
| **Figure 1:** (a) The mapping step generates a random acyclic //3//-partite hypergraph
|
||||
| with //m=6// vertices and //n=3// edges and a list [figs/bdz/img4.png] of edges obtained when we test
|
||||
| whether the hypergraph is acyclic. (b) The assigning step builds an array //g// that
|
||||
| maps values from //[0,5]// to //[0,3]// to uniquely assign an edge to a vertex. (c) The ranking
|
||||
| step builds the data structure used to compute function //rank// in //O(1)// time.
|
||||
|
||||
|
||||
|
||||
The //Mapping Step// in Figure 1(a) carries out two important tasks:
|
||||
|
||||
+ It assumes that it is possible to find three uniform hash functions //h,,0,,//, //h,,1,,// and //h,,2,,//, with ranges //{0,1}//, //{2,3}// and //{4,5}//, respectively. These functions build an one-to-one mapping of the key set //S// to the edge set //E// of a random acyclic //3//-partite hypergraph //G=(V,E)//, where //|V|=m=6// and //|E|=n=3//. In [[1,2 #papers]] it is shown that it is possible to obtain such a hypergraph with probability tending to //1// as //n// tends to infinity whenever //m=cn// and //c > 1.22//. The value of that minimizes the hypergraph size (and thereby the amount of bits to represent the resulting functions) is in the range //(1.22,1.23)//. To illustrate the mapping, key "who" is mapped to edge //{h,,0,,("who"), h,,1,,("who"), h,,2,,("who")} = {1,3,5}//, key "band" is mapped to edge //{h,,0,,("band"), h,,1,,("band"), h,,2,,("band")} = {1,2,4}//, and key "the" is mapped to edge //{h,,0,,("the"), h,,1,,("the"), h,,2,,("the")} = {0,2,5}//.
|
||||
|
||||
+ It tests whether the resulting random //3//-partite hypergraph contains cycles by iteratively deleting edges connecting vertices of degree 1. The deleted edges are stored in the order of deletion in a list [figs/bdz/img4.png] to be used in the assigning step. The first deleted edge in Figure 1(a) was //{1,2,4}//, the second one was //{1,3,5}// and the third one was //{0,2,5}//. If it ends with an empty graph, then the test succeeds, otherwise it fails.
|
||||
|
||||
|
||||
We now show how to use the Jenkins hash functions [[7 #papers]] to implement the three hash functions //h,,i,,//, which map values from //S// to //V,,i,,//, where [figs/bdz/img52.png]. These functions are used to build a random //3//-partite hypergraph, where [figs/bdz/img53.png] and [figs/bdz/img54.png]. Let [figs/bdz/img55.png] be a Jenkins hash function for [figs/bdz/img56.png], where
|
||||
//w=32 or 64// for 32-bit and 64-bit architectures, respectively.
|
||||
Let //H'// be an array of 3 //w//-bit values. The Jenkins hash function
|
||||
allow us to compute in parallel the three entries in //H'//
|
||||
and thereby the three hash functions //h,,i,,//, as follows:
|
||||
|
||||
| //H' = h'(x)//
|
||||
| //h,,0,,(x) = H'[0] mod// [figs/bdz/img136.png]
|
||||
| //h,,1,,(x) = H'[1] mod// [figs/bdz/img136.png] //+// [figs/bdz/img136.png]
|
||||
| //h,,2,,(x) = H'[2] mod// [figs/bdz/img136.png] //+ 2//[figs/bdz/img136.png]
|
||||
|
||||
|
||||
The //Assigning Step// in Figure 1(b) outputs a PHF that maps the key set //S// into the range //[0,m-1]// and is represented by an array //g// storing values from the range //[0,3]//. The array //g// allows to select one out of the //3// vertices of a given edge, which is associated with a key //k//. A vertex for a key //k// is given by either //h,,0,,(k)//, //h,,1,,(k)// or //h,,2,,(k)//. The function //h,,i,,(k)// to be used for //k// is chosen by calculating //i = (g[h,,0,,(k)] + g[h,,1,,(k)] + g[h,,2,,(k)]) mod 3//. For instance, the values 1 and 4 represent the keys "who" and "band" because //i = (g[1] + g[3] + g[5]) mod 3 = 0// and //h,,0,,("who") = 1//, and //i = (g[1] + g[2] + g[4]) mod 3 = 2// and //h,,2,,("band") = 4//, respectively. The assigning step firstly initializes //g[i]=3// to mark every vertex as unassigned and //Visited[i]= false//, [figs/bdz/img88.png]. Let //Visited// be a boolean vector of size //m// to indicate whether a vertex has been visited. Then, for each edge [figs/bdz/img90.png] from tail to head, it looks for the first vertex //u// belonging //e// not yet visited. This is a sufficient condition for success [[1,2,8 #papers]]. Let //j// be the index of //u// in //e// for //j// in the range //[0,2]//. Then, it assigns [figs/bdz/img95.png]. Whenever it passes through a vertex //u// from //e//, if //u// has not yet been visited, it sets //Visited[u] = true//.
|
||||
|
||||
|
||||
If we stop the BDZ algorithm in the assigning step we obtain a PHF with range //[0,m-1]//. The PHF has the following form: //phf(x) = h,,i(x),,(x)//, where key //x// is in //S// and //i(x) = (g[h,,0,,(x)] + g[h,,1,,(x)] + g[h,,2,,(x)]) mod 3//. In this case we do not need information for ranking and can set //g[i] = 0// whenever //g[i]// is equal to //3//, where //i// is in the range //[0,m-1]//. Therefore, the range of the values stored in //g// is narrowed from //[0,3]// to //[0,2]//. By using arithmetic coding as block of values (see [[1,2 #papers]] for details), or any compression technique that allows to perform random access in constant time to an array of compressed values [[5,6,12 #papers]], we can store the resulting PHFs in //mlog 3 = cnlog 3// bits, where //c > 1.22//. For //c = 1.23//, the space requirement is //1.95n// bits.
|
||||
|
||||
The //Ranking Step// in Figure 1 (c) outputs a data structure that permits to narrow the range of a PHF generated in the assigning step from //[0,m-1]// to //[0,n-1]// and thereby an MPHF is produced. The data structure allows to compute in constant time a function //rank// from //[0,m-1]// to //[0,n-1]// that counts the number of assigned positions before a given position //v// in //g//. For instance, //rank(4) = 2// because the positions //0// and //1// are assigned since //g[0]// and //g[1]// are not equal to //3//.
|
||||
|
||||
|
||||
For the implementation of the ranking step we have borrowed a simple and efficient implementation from [[10 #papers]]. It requires [figs/bdz/img111.png] additional bits of space, where [figs/bdz/img112.png], and is obtained by storing explicitly the //rank// of every //k//th index in a rankTable, where [figs/bdz/img114.png]. The larger is //k// the more compact is the resulting MPHF. Therefore, the users can tradeoff space for evaluation time by setting //k// appropriately in the implementation. We only allow values for //k// that are power of two (i.e., //k=2^^b,,k,,^^// for some constant //b,,k,,// in order to replace the expensive division and modulo operations by bit-shift and bitwise "and" operations, respectively. We have used //k=256// in the experiments for generating more succinct MPHFs. We remark that it is still possible to obtain a more compact data structure by using the results presented in [[9,11 #papers]], but at the cost of a much more complex implementation.
|
||||
|
||||
|
||||
We need to use an additional lookup table //T,,r,,// to guarantee the constant evaluation time of //rank(u)//. Let us illustrate how //rank(u)// is computed using both the rankTable and the lookup table //T,,r,,//. We first look up the rank of the largest precomputed index //v// lower than or equal to //u// in the rankTable, and use //T,,r,,// to count the number of assigned vertices from position //v// to //u-1//. The lookup table //T_r// allows us to count in constant time the number of assigned vertices in [figs/bdz/img122.png] bits, where [figs/bdz/img112.png]. Thus the actual evaluation time is [figs/bdz/img123.png]. For simplicity and without loss of generality we let [figs/bdz/img124.png] be a multiple of the number of bits [figs/bdz/img125.png] used to encode each entry of //g//. As the values in //g// come from the range //[0,3]//,
|
||||
then [figs/bdz/img126.png] bits and we have tried [figs/bdz/img124.png] equal to //8// and //16//. We would expect that [figs/bdz/img124.png] equal to 16 should provide a faster evaluation time because we would need to carry out fewer lookups in //T,,r,,//. But, for both values the lookup table //T,,r,,// fits entirely in the CPU cache and we did not realize any significant difference in the evaluation times. Therefore we settle for the value //8//. We remark that each value of //r// requires a different lookup table //T,,r,, that can be generated a priori.
|
||||
|
||||
The resulting MPHFs have the following form: //h(x) = rank(phf(x))//. Then, we cannot get rid of the raking information by replacing the values 3 by 0 in the entries of //g//. In this case each entry in the array //g// is encoded with //2// bits and we need [figs/bdz/img133.png] additional bits to compute function //rank// in constant time. Then, the total space to store the resulting functions is [figs/bdz/img134.png] bits. By using //c = 1.23// and [figs/bdz/img135.png] we have obtained MPHFs that require approximately //2.62// bits per key to be stored.
|
||||
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==Memory Consumption==
|
||||
|
||||
Now we detail the memory consumption to generate and to store minimal perfect hash functions
|
||||
using the BDZ algorithm. The structures responsible for memory consumption are in the
|
||||
following:
|
||||
- 3-graph:
|
||||
+ **first**: is a vector that stores //cn// integer numbers, each one representing
|
||||
the first edge (index in the vector edges) in the list of
|
||||
incident edges of each vertex. The integer numbers are 4 bytes long. Therefore,
|
||||
the vector first is stored in //4cn// bytes.
|
||||
|
||||
+ **edges**: is a vector to represent the edges of the graph. As each edge
|
||||
is compounded by three vertices, each entry stores three integer numbers
|
||||
of 4 bytes that represent the vertices. As there are //n// edges, the
|
||||
vector edges is stored in //12n// bytes.
|
||||
|
||||
+ **next**: given a vertex [figs/img139.png], we can discover the edges that
|
||||
contain [figs/img139.png] following its list of incident edges,
|
||||
which starts on first[[figs/img139.png]] and the next
|
||||
edges are given by next[...first[[figs/img139.png]]...]. Therefore, the vectors first and next represent
|
||||
the linked lists of edges of each vertex. As there are three vertices for each edge,
|
||||
when an edge is iserted in the 3-graph, it must be inserted in the three linked lists
|
||||
of the vertices in its composition. Therefore, there are //3n// entries of integer
|
||||
numbers in the vector next, so it is stored in //4*3n = 12n// bytes.
|
||||
|
||||
+ **Vertices degree (vert_degree vector)**: is a vector of //cn// bytes
|
||||
that represents the degree of each vertex. We can use just one byte for each
|
||||
vertex because the 3-graph is sparse, once it has more vertices than edges.
|
||||
Therefore, the vertices degree is represented in //cn// bytes.
|
||||
|
||||
- Acyclicity test:
|
||||
+ **List of deleted edges obtained when we test whether the 3-graph is a forest (queue vector)**:
|
||||
is a vector of //n// integer numbers containing indexes of vector edges. Therefore, it
|
||||
requires //4n// bytes in internal memory.
|
||||
|
||||
+ **Marked edges in the acyclicity test (marked_edges vector)**:
|
||||
is a bit vector of //n// bits to indicate the edges that have already been deleted during
|
||||
the acyclicity test. Therefore, it requires //n/8// bytes in internal memory.
|
||||
|
||||
- MPHF description
|
||||
+ **function //g//**: is represented by a vector of //2cn// bits. Therefore, it is
|
||||
stored in //0.25cn// bytes
|
||||
+ **ranktable**: is a lookup table used to store some precomputed ranking information.
|
||||
It has //(cn)/(2^b)// entries of 4-byte integer numbers. Therefore it is stored in
|
||||
//(4cn)/(2^b)// bytes. The larger is b, the more compact is the resulting MPHFs and
|
||||
the slower are the functions. So b imposes a trade-of between space and time.
|
||||
+ **Total**: 0.25cn + (4cn)/(2^b) bytes
|
||||
|
||||
|
||||
Thus, the total memory consumption of BDZ algorithm for generating a minimal
|
||||
perfect hash function (MPHF) is: //(28.125 + 5c)n + 0.25cn + (4cn)/(2^b) + O(1)// bytes.
|
||||
As the value of constant //c// may be larger than or equal to 1.23 we have:
|
||||
|| //c// | //b// | Memory consumption to generate a MPHF (in bytes) |
|
||||
| 1.23 | //7// | //34.62n + O(1)// |
|
||||
| 1.23 | //8// | //34.60n + O(1)// |
|
||||
|
||||
| **Table 1:** Memory consumption to generate a MPHF using the BDZ algorithm.
|
||||
|
||||
Now we present the memory consumption to store the resulting function.
|
||||
So we have:
|
||||
|| //c// | //b// | Memory consumption to store a MPHF (in bits) |
|
||||
| 1.23 | //7// | //2.77n + O(1)// |
|
||||
| 1.23 | //8// | //2.61n + O(1)// |
|
||||
|
||||
| **Table 2:** Memory consumption to store a MPHF generated by the BDZ algorithm.
|
||||
----------------------------------------
|
||||
|
||||
==Experimental Results==
|
||||
|
||||
Experimental results to compare the BDZ algorithm with the other ones in the CMPH
|
||||
library are presented in Botelho, Pagh and Ziviani [[1,2 #papers]].
|
||||
----------------------------------------
|
||||
|
||||
==Papers==[papers]
|
||||
|
||||
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho]. [Near-Optimal Space Perfect Hashing Algorithms papers/thesis.pdf]. //PhD. Thesis//, //Department of Computer Science//, //Federal University of Minas Gerais//, September 2008. Supervised by [N. Ziviani http://www.dcc.ufmg.br/~nivio].
|
||||
|
||||
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], [R. Pagh http://www.itu.dk/~pagh/], [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [Simple and space-efficient minimal perfect hash functions papers/wads07.pdf]. //In Proceedings of the 10th International Workshop on Algorithms and Data Structures (WADs'07),// Springer-Verlag Lecture Notes in Computer Science, vol. 4619, Halifax, Canada, August 2007, 139-150.
|
||||
|
||||
+ B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. The bloomier filter: An efficient data structure for static support lookup tables. //In Proceedings of the 15th annual ACM-SIAM symposium on Discrete algorithms (SODA'04)//, pages 30–39, Philadelphia, PA, USA, 2004. Society for Industrial and Applied Mathematics.
|
||||
|
||||
+ J. Ebert. A versatile data structure for edges oriented graph algorithms. //Communication of The ACM//, (30):513–519, 1987.
|
||||
|
||||
+ K. Fredriksson and F. Nikitin. Simple compression code supporting random access and fast string matching. //In Proceedings of the 6th International Workshop on Efficient and Experimental Algorithms (WEA’07)//, pages 203–216, 2007.
|
||||
|
||||
+ R. Gonzalez and G. Navarro. Statistical encoding of succinct data structures. //In Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM’06)//, pages 294–305, 2006.
|
||||
|
||||
+ B. Jenkins. Algorithm alley: Hash functions. //Dr. Dobb's Journal of Software Tools//, 22(9), september 1997. Extended version available at [http://burtleburtle.net/bob/hash/doobs.html http://burtleburtle.net/bob/hash/doobs.html].
|
||||
|
||||
+ B.S. Majewski, N.C. Wormald, G. Havas, and Z.J. Czech. A family of perfect hashing methods. //The Computer Journal//, 39(6):547–554, 1996.
|
||||
|
||||
+ D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. //In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX’07)//, 2007.
|
||||
|
||||
+ [R. Pagh http://www.itu.dk/~pagh/]. Low redundancy in static dictionaries with constant query time. //SIAM Journal on Computing//, 31(2):353–363, 2001.
|
||||
|
||||
+ R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. //In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms (SODA’02)//, pages 233–242, Philadelphia PA, USA, 2002. Society for Industrial and Applied Mathematics.
|
||||
|
||||
+ K. Sadakane and R. Grossi. Squeezing succinct data structures into entropy bounds. //In Proceedings of the 17th annual ACM-SIAM symposium on Discrete algorithms (SODA’06)//, pages 1230–1239, 2006.
|
||||
|
||||
|
||||
%!include: ALGORITHMS.t2t
|
||||
|
||||
%!include: FOOTER.t2t
|
||||
|
||||
%!include(html): ''GOOGLEANALYTICS.t2t''
|
|
@ -0,0 +1,405 @@
|
|||
BMZ Algorithm
|
||||
|
||||
|
||||
%!includeconf: CONFIG.t2t
|
||||
|
||||
----------------------------------------
|
||||
==History==
|
||||
|
||||
At the end of 2003, professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] was
|
||||
finishing the second edition of his [book http://www.dcc.ufmg.br/algoritmos/].
|
||||
During the [book http://www.dcc.ufmg.br/algoritmos/] writing,
|
||||
professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] studied the problem of generating
|
||||
[minimal perfect hash functions concepts.html]
|
||||
(if you are not familiarized with this problem, see [[1 #papers]][[2 #papers]]).
|
||||
Professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] coded a modified version of
|
||||
the [CHM algorithm chm.html], which was proposed by
|
||||
Czech, Havas and Majewski, and put it in his [book http://www.dcc.ufmg.br/algoritmos/].
|
||||
The [CHM algorithm chm.html] is based on acyclic random graphs to generate
|
||||
[order preserving minimal perfect hash functions concepts.html] in linear time.
|
||||
Professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio]
|
||||
argued himself, why must the random graph
|
||||
be acyclic? In the modified version availalbe in his [book http://www.dcc.ufmg.br/algoritmos/] he got rid of this restriction.
|
||||
|
||||
The modification presented a problem, it was impossible to generate minimal perfect hash functions
|
||||
for sets with more than 1000 keys.
|
||||
At the same time, [Fabiano C. Botelho http://www.dcc.ufmg.br/~fbotelho],
|
||||
a master degree student at [Departament of Computer Science http://www.dcc.ufmg.br] in
|
||||
[Federal University of Minas Gerais http://www.ufmg.br],
|
||||
started to be advised by [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] who presented the problem
|
||||
to [Fabiano http://www.dcc.ufmg.br/~fbotelho].
|
||||
|
||||
During the master, [Fabiano http://www.dcc.ufmg.br/~fbotelho] and
|
||||
[Nivio Ziviani http://www.dcc.ufmg.br/~nivio] faced lots of problems.
|
||||
In april of 2004, [Fabiano http://www.dcc.ufmg.br/~fbotelho] was talking with a
|
||||
friend of him (David Menoti) about the problems
|
||||
and many ideas appeared.
|
||||
The ideas were implemented and a very fast algorithm to generate
|
||||
minimal perfect hash functions had been designed.
|
||||
We refer the algorithm to as **BMZ**, because it was conceived by Fabiano C. **B**otelho,
|
||||
David **M**enoti and Nivio **Z**iviani. The algorithm is described in [[1 #papers]].
|
||||
To analyse BMZ algorithm we needed some results from the random graph theory, so
|
||||
we invited professor [Yoshiharu Kohayakawa http://www.ime.usp.br/~yoshi] to help us.
|
||||
The final description and analysis of BMZ algorithm is presented in [[2 #papers]].
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==The Algorithm==
|
||||
|
||||
The BMZ algorithm shares several features with the [CHM algorithm chm.html].
|
||||
In particular, BMZ algorithm is also
|
||||
based on the generation of random graphs [figs/img27.png], where [figs/img28.png] is in
|
||||
one-to-one correspondence with the key set [figs/img20.png] for which we wish to
|
||||
generate a [minimal perfect hash function concepts.html].
|
||||
The two main differences between BMZ algorithm and CHM algorithm
|
||||
are as follows: (//i//) BMZ algorithm generates random
|
||||
graphs [figs/img27.png] with [figs/img29.png] and [figs/img30.png], where [figs/img31.png],
|
||||
and hence [figs/img32.png] necessarily contains cycles,
|
||||
while CHM algorithm generates //acyclic// random
|
||||
graphs [figs/img27.png] with [figs/img29.png] and [figs/img30.png],
|
||||
with a greater number of vertices: [figs/img33.png];
|
||||
(//ii//) CHM algorithm generates [order preserving minimal perfect hash functions concepts.html]
|
||||
while BMZ algorithm does not preserve order. Thus, BMZ algorithm improves
|
||||
the space requirement at the expense of generating functions that are not
|
||||
order preserving.
|
||||
|
||||
Suppose [figs/img14.png] is a universe of //keys//.
|
||||
Let [figs/img17.png] be a set of [figs/img8.png] keys from [figs/img14.png].
|
||||
Let us show how the BMZ algorithm constructs a minimal perfect hash function [figs/img7.png].
|
||||
We make use of two auxiliary random functions [figs/img41.png] and [figs/img55.png],
|
||||
where [figs/img56.png] for some suitably chosen integer [figs/img57.png],
|
||||
where [figs/img58.png].We build a random graph [figs/img59.png] on [figs/img60.png],
|
||||
whose edge set is [figs/img61.png]. There is an edge in [figs/img32.png] for each
|
||||
key in the set of keys [figs/img20.png].
|
||||
|
||||
In what follows, we shall be interested in the //2-core// of
|
||||
the random graph [figs/img32.png], that is, the maximal subgraph
|
||||
of [figs/img32.png] with minimal degree at
|
||||
least 2 (see [[2 #papers]] for details).
|
||||
Because of its importance in our context, we call the 2-core the
|
||||
//critical// subgraph of [figs/img32.png] and denote it by [figs/img63.png].
|
||||
The vertices and edges in [figs/img63.png] are said to be //critical//.
|
||||
We let [figs/img64.png] and [figs/img65.png].
|
||||
Moreover, we let [figs/img66.png] be the set of //non-critical//
|
||||
vertices in [figs/img32.png].
|
||||
We also let [figs/img67.png] be the set of all critical
|
||||
vertices that have at least one non-critical vertex as a neighbour.
|
||||
Let [figs/img68.png] be the set of //non-critical// edges in [figs/img32.png].
|
||||
Finally, we let [figs/img69.png] be the //non-critical// subgraph
|
||||
of [figs/img32.png].
|
||||
The non-critical subgraph [figs/img70.png] corresponds to the //acyclic part//
|
||||
of [figs/img32.png].
|
||||
We have [figs/img71.png].
|
||||
|
||||
We then construct a suitable labelling [figs/img72.png] of the vertices
|
||||
of [figs/img32.png]: we choose [figs/img73.png] for each [figs/img74.png] in such
|
||||
a way that [figs/img75.png] ([figs/img18.png]) is a
|
||||
minimal perfect hash function for [figs/img20.png].
|
||||
This labelling [figs/img37.png] can be found in linear time
|
||||
if the number of edges in [figs/img63.png] is at most [figs/img76.png] (see [[2 #papers]]
|
||||
for details).
|
||||
|
||||
Figure 1 presents a pseudo code for the BMZ algorithm.
|
||||
The procedure BMZ ([figs/img20.png], [figs/img37.png]) receives as input the set of
|
||||
keys [figs/img20.png] and produces the labelling [figs/img37.png].
|
||||
The method uses a mapping, ordering and searching approach.
|
||||
We now describe each step.
|
||||
| procedure BMZ ([figs/img20.png], [figs/img37.png])
|
||||
| Mapping ([figs/img20.png], [figs/img32.png]);
|
||||
| Ordering ([figs/img32.png], [figs/img63.png], [figs/img70.png]);
|
||||
| Searching ([figs/img32.png], [figs/img63.png], [figs/img70.png], [figs/img37.png]);
|
||||
| **Figure 1**: Main steps of BMZ algorithm for constructing a minimal perfect hash function
|
||||
|
||||
----------------------------------------
|
||||
|
||||
===Mapping Step===
|
||||
|
||||
The procedure Mapping ([figs/img20.png], [figs/img32.png]) receives as input the set
|
||||
of keys [figs/img20.png] and generates the random graph [figs/img59.png], by generating
|
||||
two auxiliary functions [figs/img41.png], [figs/img78.png].
|
||||
|
||||
The functions [figs/img41.png] and [figs/img42.png] are constructed as follows.
|
||||
We impose some upper bound [figs/img79.png] on the lengths of the keys in [figs/img20.png].
|
||||
To define [figs/img80.png] ([figs/img81.png], [figs/img62.png]), we generate
|
||||
an [figs/img82.png] table of random integers [figs/img83.png].
|
||||
For a key [figs/img18.png] of length [figs/img84.png] and [figs/img85.png], we let
|
||||
|
||||
| [figs/img86.png]
|
||||
|
||||
The random graph [figs/img59.png] has vertex set [figs/img56.png] and
|
||||
edge set [figs/img61.png]. We need [figs/img32.png] to be
|
||||
simple, i.e., [figs/img32.png] should have neither loops nor multiple edges.
|
||||
A loop occurs when [figs/img87.png] for some [figs/img18.png].
|
||||
We solve this in an ad hoc manner: we simply let [figs/img88.png] in this case.
|
||||
If we still find a loop after this, we generate another pair [figs/img89.png].
|
||||
When a multiple edge occurs we abort and generate a new pair [figs/img89.png].
|
||||
Although the function above causes [collisions concepts.html] with probability //1/t//,
|
||||
in [cmph library index.html] we use faster hash
|
||||
functions ([DJB2 hash http://www.cs.yorku.ca/~oz/hash.html], [FNV hash http://www.isthe.com/chongo/tech/comp/fnv/],
|
||||
[Jenkins hash http://burtleburtle.net/bob/hash/doobs.html] and [SDBM hash http://www.cs.yorku.ca/~oz/hash.html])
|
||||
in which we do not need to impose any upper bound [figs/img79.png] on the lengths of the keys in [figs/img20.png].
|
||||
|
||||
As mentioned before, for us to find the labelling [figs/img72.png] of the
|
||||
vertices of [figs/img59.png] in linear time,
|
||||
we require that [figs/img108.png].
|
||||
The crucial step now is to determine the value
|
||||
of [figs/img1.png] (in [figs/img57.png]) to obtain a random
|
||||
graph [figs/img71.png] with [figs/img109.png].
|
||||
Botelho, Menoti an Ziviani determinded emprically in [[1 #papers]] that
|
||||
the value of [figs/img1.png] is //1.15//. This value is remarkably
|
||||
close to the theoretical value determined in [[2 #papers]],
|
||||
which is around [figs/img112.png].
|
||||
|
||||
----------------------------------------
|
||||
|
||||
===Ordering Step===
|
||||
|
||||
The procedure Ordering ([figs/img32.png], [figs/img63.png], [figs/img70.png]) receives
|
||||
as input the graph [figs/img32.png] and partitions [figs/img32.png] into the two
|
||||
subgraphs [figs/img63.png] and [figs/img70.png], so that [figs/img71.png].
|
||||
|
||||
Figure 2 presents a sample graph with 9 vertices
|
||||
and 8 edges, where the degree of a vertex is shown besides each vertex.
|
||||
Initially, all vertices with degree 1 are added to a queue [figs/img136.png].
|
||||
For the example shown in Figure 2(a), [figs/img137.png] after the initialization step.
|
||||
|
||||
| [figs/img138.png]
|
||||
| **Figure 2:** Ordering step for a graph with 9 vertices and 8 edges.
|
||||
|
||||
Next, we remove one vertex [figs/img139.png] from the queue, decrement its degree and
|
||||
the degree of the vertices with degree greater than 0 in the adjacent
|
||||
list of [figs/img139.png], as depicted in Figure 2(b) for [figs/img140.png].
|
||||
At this point, the adjacencies of [figs/img139.png] with degree 1 are
|
||||
inserted into the queue, such as vertex 1.
|
||||
This process is repeated until the queue becomes empty.
|
||||
All vertices with degree 0 are non-critical vertices and the others are
|
||||
critical vertices, as depicted in Figure 2(c).
|
||||
Finally, to determine the vertices in [figs/img141.png] we collect all
|
||||
vertices [figs/img142.png] with at least one vertex [figs/img143.png] that
|
||||
is in Adj[figs/img144.png] and in [figs/img145.png], as the vertex 8 in Figure 2(c).
|
||||
|
||||
----------------------------------------
|
||||
|
||||
===Searching Step===
|
||||
|
||||
In the searching step, the key part is
|
||||
the //perfect assignment problem//: find [figs/img153.png] such that
|
||||
the function [figs/img154.png] defined by
|
||||
|
||||
| [figs/img155.png]
|
||||
|
||||
is a bijection from [figs/img156.png] to [figs/img157.png] (recall [figs/img158.png]).
|
||||
We are interested in a labelling [figs/img72.png] of
|
||||
the vertices of the graph [figs/img59.png] with
|
||||
the property that if [figs/img11.png] and [figs/img22.png] are keys
|
||||
in [figs/img20.png], then [figs/img159.png]; that is, if we associate
|
||||
to each edge the sum of the labels on its endpoints, then these values
|
||||
should be all distinct.
|
||||
Moreover, we require that all the sums [figs/img160.png] ([figs/img18.png])
|
||||
fall between [figs/img115.png] and [figs/img161.png], and thus we have a bijection
|
||||
between [figs/img20.png] and [figs/img157.png].
|
||||
|
||||
The procedure Searching ([figs/img32.png], [figs/img63.png], [figs/img70.png], [figs/img37.png])
|
||||
receives as input [figs/img32.png], [figs/img63.png], [figs/img70.png] and finds a
|
||||
suitable [figs/img162.png] bit value for each vertex [figs/img74.png], stored in the
|
||||
array [figs/img37.png].
|
||||
This step is first performed for the vertices in the
|
||||
critical subgraph [figs/img63.png] of [figs/img32.png] (the 2-core of [figs/img32.png])
|
||||
and then it is performed for the vertices in [figs/img70.png] (the non-critical subgraph
|
||||
of [figs/img32.png] that contains the "acyclic part" of [figs/img32.png]).
|
||||
The reason the assignment of the [figs/img37.png] values is first
|
||||
performed on the vertices in [figs/img63.png] is to resolve reassignments
|
||||
as early as possible (such reassignments are consequences of the cycles
|
||||
in [figs/img63.png] and are depicted hereinafter).
|
||||
|
||||
----------------------------------------
|
||||
|
||||
====Assignment of Values to Critical Vertices====
|
||||
|
||||
The labels [figs/img73.png] ([figs/img142.png])
|
||||
are assigned in increasing order following a greedy
|
||||
strategy where the critical vertices [figs/img139.png] are considered one at a time,
|
||||
according to a breadth-first search on [figs/img63.png].
|
||||
If a candidate value [figs/img11.png] for [figs/img73.png] is forbidden
|
||||
because setting [figs/img163.png] would create two edges with the same sum,
|
||||
we try [figs/img164.png] for [figs/img73.png]. This fact is referred to
|
||||
as a //reassignment//.
|
||||
|
||||
Let [figs/img165.png] be the set of addresses assigned to edges in [figs/img166.png].
|
||||
Initially [figs/img167.png].
|
||||
Let [figs/img11.png] be a candidate value for [figs/img73.png].
|
||||
Initially [figs/img168.png].
|
||||
Considering the subgraph [figs/img63.png] in Figure 2(c),
|
||||
a step by step example of the assignment of values to vertices in [figs/img63.png] is
|
||||
presented in Figure 3.
|
||||
Initially, a vertex [figs/img139.png] is chosen, the assignment [figs/img163.png] is made
|
||||
and [figs/img11.png] is set to [figs/img164.png].
|
||||
For example, suppose that vertex [figs/img169.png] in Figure 3(a) is
|
||||
chosen, the assignment [figs/img170.png] is made and [figs/img11.png] is set to [figs/img96.png].
|
||||
|
||||
| [figs/img171.png]
|
||||
| **Figure 3:** Example of the assignment of values to critical vertices.
|
||||
|
||||
In Figure 3(b), following the adjacent list of vertex [figs/img169.png],
|
||||
the unassigned vertex [figs/img115.png] is reached.
|
||||
At this point, we collect in the temporary variable [figs/img172.png] all adjacencies
|
||||
of vertex [figs/img115.png] that have been assigned an [figs/img11.png] value,
|
||||
and [figs/img173.png].
|
||||
Next, for all [figs/img174.png], we check if [figs/img175.png].
|
||||
Since [figs/img176.png], then [figs/img177.png] is set
|
||||
to [figs/img96.png], [figs/img11.png] is incremented
|
||||
by 1 (now [figs/img178.png]) and [figs/img179.png].
|
||||
Next, vertex [figs/img180.png] is reached, [figs/img181.png] is set
|
||||
to [figs/img62.png], [figs/img11.png] is set to [figs/img180.png] and [figs/img182.png].
|
||||
Next, vertex [figs/img183.png] is reached and [figs/img184.png].
|
||||
Since [figs/img185.png] and [figs/img186.png], then [figs/img187.png] is
|
||||
set to [figs/img180.png], [figs/img11.png] is set to [figs/img183.png] and [figs/img188.png].
|
||||
Finally, vertex [figs/img189.png] is reached and [figs/img190.png].
|
||||
Since [figs/img191.png], [figs/img11.png] is incremented by 1 and set to 5, as depicted in
|
||||
Figure 3(c).
|
||||
Since [figs/img192.png], [figs/img11.png] is again incremented by 1 and set to 6,
|
||||
as depicted in Figure 3(d).
|
||||
These two reassignments are indicated by the arrows in Figure 3.
|
||||
Since [figs/img193.png] and [figs/img194.png], then [figs/img195.png] is set
|
||||
to [figs/img196.png] and [figs/img197.png]. This finishes the algorithm.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
====Assignment of Values to Non-Critical Vertices====
|
||||
|
||||
As [figs/img70.png] is acyclic, we can impose the order in which addresses are
|
||||
associated with edges in [figs/img70.png], making this step simple to solve
|
||||
by a standard depth first search algorithm.
|
||||
Therefore, in the assignment of values to vertices in [figs/img70.png] we
|
||||
benefit from the unused addresses in the gaps left by the assignment of values
|
||||
to vertices in [figs/img63.png].
|
||||
For that, we start the depth-first search from the vertices in [figs/img141.png] because
|
||||
the [figs/img37.png] values for these critical vertices were already assigned
|
||||
and cannot be changed.
|
||||
|
||||
Considering the subgraph [figs/img70.png] in Figure 2(c),
|
||||
a step by step example of the assignment of values to vertices in [figs/img70.png] is
|
||||
presented in Figure 4.
|
||||
Figure 4(a) presents the initial state of the algorithm.
|
||||
The critical vertex 8 is the only one that has non-critical vertices as
|
||||
adjacent.
|
||||
In the example presented in Figure 3, the addresses [figs/img198.png] were not used.
|
||||
So, taking the first unused address [figs/img115.png] and the vertex [figs/img96.png],
|
||||
which is reached from the vertex [figs/img169.png], [figs/img199.png] is set
|
||||
to [figs/img200.png], as shown in Figure 4(b).
|
||||
The only vertex that is reached from vertex [figs/img96.png] is vertex [figs/img62.png], so
|
||||
taking the unused address [figs/img183.png] we set [figs/img201.png] to [figs/img202.png],
|
||||
as shown in Figure 4(c).
|
||||
This process is repeated until the UnAssignedAddresses list becomes empty.
|
||||
|
||||
| [figs/img203.png]
|
||||
| **Figure 4:** Example of the assignment of values to non-critical vertices.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==The Heuristic==[heuristic]
|
||||
|
||||
We now present an heuristic for BMZ algorithm that
|
||||
reduces the value of [figs/img1.png] to any given value between //1.15// and //0.93//.
|
||||
This reduces the space requirement to store the resulting function
|
||||
to any given value between [figs/img12.png] words and [figs/img13.png] words.
|
||||
The heuristic reuses, when possible, the set
|
||||
of [figs/img11.png] values that caused reassignments, just before
|
||||
trying [figs/img164.png].
|
||||
Decreasing the value of [figs/img1.png] leads to an increase in the number of
|
||||
iterations to generate [figs/img32.png].
|
||||
For example, for [figs/img244.png] and [figs/img6.png], the analytical expected number
|
||||
of iterations are [figs/img245.png] and [figs/img246.png], respectively (see [[2 #papers]]
|
||||
for details),
|
||||
while for [figs/img128.png] the same value is around //2.13//.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==Memory Consumption==
|
||||
|
||||
Now we detail the memory consumption to generate and to store minimal perfect hash functions
|
||||
using the BMZ algorithm. The structures responsible for memory consumption are in the
|
||||
following:
|
||||
- Graph:
|
||||
+ **first**: is a vector that stores //cn// integer numbers, each one representing
|
||||
the first edge (index in the vector edges) in the list of
|
||||
edges of each vertex.
|
||||
The integer numbers are 4 bytes long. Therefore,
|
||||
the vector first is stored in //4cn// bytes.
|
||||
|
||||
+ **edges**: is a vector to represent the edges of the graph. As each edge
|
||||
is compounded by a pair of vertices, each entry stores two integer numbers
|
||||
of 4 bytes that represent the vertices. As there are //n// edges, the
|
||||
vector edges is stored in //8n// bytes.
|
||||
|
||||
+ **next**: given a vertex [figs/img139.png], we can discover the edges that
|
||||
contain [figs/img139.png] following its list of edges,
|
||||
which starts on first[[figs/img139.png]] and the next
|
||||
edges are given by next[...first[[figs/img139.png]]...]. Therefore, the vectors first and next represent
|
||||
the linked lists of edges of each vertex. As there are two vertices for each edge,
|
||||
when an edge is iserted in the graph, it must be inserted in the two linked lists
|
||||
of the vertices in its composition. Therefore, there are //2n// entries of integer
|
||||
numbers in the vector next, so it is stored in //4*2n = 8n// bytes.
|
||||
|
||||
+ **critical vertices(critical_nodes vector)**: is a vector of //cn// bits,
|
||||
where each bit indicates if a vertex is critical (1) or non-critical (0).
|
||||
Therefore, the critical and non-critical vertices are represented in //cn/8// bytes.
|
||||
|
||||
+ **critical edges (used_edges vector)**: is a vector of //n// bits, where each
|
||||
bit indicates if an edge is critical (1) or non-critical (0). Therefore, the
|
||||
critical and non-critical edges are represented in //n/8// bytes.
|
||||
|
||||
- Other auxiliary structures
|
||||
+ **queue**: is a queue of integer numbers used in the breadth-first search of the
|
||||
assignment of values to critical vertices. There is an entry in the queue for
|
||||
each two critical vertices. Let [figs/img110.png] be the expected number of critical
|
||||
vertices. Therefore, the queue is stored in //4*0.5*[figs/img110.png]=2[figs/img110.png]//.
|
||||
|
||||
+ **visited**: is a vector of //cn// bits, where each bit indicates if the g value of
|
||||
a given vertex was already defined. Therefore, the vector visited is stored
|
||||
in //cn/8// bytes.
|
||||
|
||||
+ **function //g//**: is represented by a vector of //cn// integer numbers.
|
||||
As each integer number is 4 bytes long, the function //g// is stored in
|
||||
//4cn// bytes.
|
||||
|
||||
|
||||
Thus, the total memory consumption of BMZ algorithm for generating a minimal
|
||||
perfect hash function (MPHF) is: //(8.25c + 16.125)n +2[figs/img110.png] + O(1)// bytes.
|
||||
As the value of constant //c// may be 1.15 and 0.93 we have:
|
||||
|| //c// | [figs/img110.png] | Memory consumption to generate a MPHF |
|
||||
| 0.93 | //0.497n// | //24.80n + O(1)// |
|
||||
| 1.15 | //0.401n// | //26.42n + O(1)// |
|
||||
|
||||
| **Table 1:** Memory consumption to generate a MPHF using the BMZ algorithm.
|
||||
|
||||
The values of [figs/img110.png] were calculated using Eq.(1) presented in [[2 #papers]].
|
||||
|
||||
Now we present the memory consumption to store the resulting function.
|
||||
We only need to store the //g// function. Thus, we need //4cn// bytes.
|
||||
Again we have:
|
||||
|| //c// | Memory consumption to store a MPHF |
|
||||
| 0.93 | //3.72n// |
|
||||
| 1.15 | //4.60n// |
|
||||
|
||||
| **Table 2:** Memory consumption to store a MPHF generated by the BMZ algorithm.
|
||||
----------------------------------------
|
||||
|
||||
==Experimental Results==
|
||||
|
||||
[CHM x BMZ comparison.html]
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==Papers==[papers]
|
||||
|
||||
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], D. Menoti, [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A New algorithm for constructing minimal perfect hash functions papers/bmz_tr004_04.ps], Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
|
||||
|
||||
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], Y. Kohayakawa, and [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A Practical Minimal Perfect Hashing Method papers/wea05.pdf]. //4th International Workshop on efficient and Experimental Algorithms (WEA05),// Springer-Verlag Lecture Notes in Computer Science, vol. 3505, Santorini Island, Greece, May 2005, 488-500.
|
||||
|
||||
|
||||
%!include: ALGORITHMS.t2t
|
||||
|
||||
%!include: FOOTER.t2t
|
||||
|
||||
%!include(html): ''GOOGLEANALYTICS.t2t''
|
After Width: | Height: | Size: 21 KiB |
|
@ -0,0 +1,440 @@
|
|||
External Memory Based Algorithm
|
||||
|
||||
|
||||
%!includeconf: CONFIG.t2t
|
||||
|
||||
----------------------------------------
|
||||
==Introduction==
|
||||
|
||||
Until now, because of the limitations of current algorithms,
|
||||
the use of MPHFs is restricted to scenarios where the set of keys being hashed is
|
||||
relatively small.
|
||||
However, in many cases it is crucial to deal in an efficient way with very large
|
||||
sets of keys.
|
||||
Due to the exponential growth of the Web, the work with huge collections is becoming
|
||||
a daily task.
|
||||
For instance, the simple assignment of number identifiers to web pages of a collection
|
||||
can be a challenging task.
|
||||
While traditional databases simply cannot handle more traffic once the working
|
||||
set of URLs does not fit in main memory anymore[[4 #papers]], the algorithm we propose here to
|
||||
construct MPHFs can easily scale to billions of entries.
|
||||
|
||||
As there are many applications for MPHFs, it is
|
||||
important to design and implement space and time efficient algorithms for
|
||||
constructing such functions.
|
||||
The attractiveness of using MPHFs depends on the following issues:
|
||||
|
||||
+ The amount of CPU time required by the algorithms for constructing MPHFs.
|
||||
|
||||
+ The space requirements of the algorithms for constructing MPHFs.
|
||||
|
||||
+ The amount of CPU time required by a MPHF for each retrieval.
|
||||
|
||||
+ The space requirements of the description of the resulting MPHFs to be used at retrieval time.
|
||||
|
||||
|
||||
We present here a novel external memory based algorithm for constructing MPHFs that
|
||||
are very efficient in the four requirements mentioned previously.
|
||||
First, the algorithm is linear on the size of keys to construct a MPHF,
|
||||
which is optimal.
|
||||
For instance, for a collection of 1 billion URLs
|
||||
collected from the web, each one 64 characters long on average, the time to construct a
|
||||
MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory
|
||||
is approximately 3 hours.
|
||||
Second, the algorithm needs a small a priori defined vector of [figs/brz/img23.png] one
|
||||
byte entries in main memory to construct a MPHF.
|
||||
For the collection of 1 billion URLs and using [figs/brz/img4.png], the algorithm needs only
|
||||
5.45 megabytes of internal memory.
|
||||
Third, the evaluation of the MPHF for each retrieval requires three memory accesses and
|
||||
the computation of three universal hash functions.
|
||||
This is not optimal as any MPHF requires at least one memory access and the computation
|
||||
of two universal hash functions.
|
||||
Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal.
|
||||
For the collection of 1 billion URLs, it needs 8.1 bits for each key,
|
||||
while the theoretical lower bound is [figs/brz/img24.png] bits per key.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
|
||||
==The Algorithm==
|
||||
|
||||
The main idea supporting our algorithm is the classical divide and conquer technique.
|
||||
The algorithm is a two-step external memory based algorithm
|
||||
that generates a MPHF //h// for a set //S// of //n// keys.
|
||||
Figure 1 illustrates the two steps of the
|
||||
algorithm: the partitioning step and the searching step.
|
||||
|
||||
| [figs/brz/brz.png]
|
||||
| **Figure 1:** Main steps of our algorithm.
|
||||
|
||||
The partitioning step takes a key set //S// and uses a universal hash
|
||||
function [figs/brz/img42.png] proposed by Jenkins[[5 #papers]]
|
||||
to transform each key [figs/brz/img43.png] into an integer [figs/brz/img44.png].
|
||||
Reducing [figs/brz/img44.png] modulo [figs/brz/img23.png], we partition //S//
|
||||
into [figs/brz/img23.png] buckets containing at most 256 keys in each bucket (with high
|
||||
probability).
|
||||
|
||||
The searching step generates a MPHF[figs/brz/img46.png] for each bucket //i//, [figs/brz/img47.png].
|
||||
The resulting MPHF //h(k)//, [figs/brz/img43.png], is given by
|
||||
|
||||
| [figs/brz/img49.png]
|
||||
|
||||
where [figs/brz/img50.png].
|
||||
The //i//th entry //offset[i]// of the displacement vector
|
||||
//offset//, [figs/brz/img47.png], contains the total number
|
||||
of keys in the buckets from 0 to //i-1//, that is, it gives the interval of the
|
||||
keys in the hash table addressed by the MPHF[figs/brz/img46.png]. In the following we explain
|
||||
each step in detail.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
=== Partitioning step ===
|
||||
|
||||
The set //S// of //n// keys is partitioned into [figs/brz/img23.png],
|
||||
where //b// is a suitable parameter chosen to guarantee
|
||||
that each bucket has at most 256 keys with high probability
|
||||
(see [[2 #papers]] for details).
|
||||
The partitioning step works as follows:
|
||||
|
||||
| [figs/brz/img54.png]
|
||||
| **Figure 2:** Partitioning step.
|
||||
|
||||
Statement 1.1 of the **for** loop presented in Figure 2
|
||||
reads sequentially all the keys of block [figs/brz/img55.png] from disk into an internal area
|
||||
of size [figs/brz/img8.png].
|
||||
|
||||
Statement 1.2 performs an indirect bucket sort of the keys in block [figs/brz/img55.png] and
|
||||
at the same time updates the entries in the vector //size//.
|
||||
Let us briefly describe how [figs/brz/img55.png] is partitioned among
|
||||
the [figs/brz/img23.png] buckets.
|
||||
We use a local array of [figs/brz/img23.png] counters to store a
|
||||
count of how many keys from [figs/brz/img55.png] belong to each bucket.
|
||||
The pointers to the keys in each bucket //i//, [figs/brz/img47.png],
|
||||
are stored in contiguous positions in an array.
|
||||
For this we first reserve the required number of entries
|
||||
in this array of pointers using the information from the array of counters.
|
||||
Next, we place the pointers to the keys in each bucket into the respective
|
||||
reserved areas in the array (i.e., we place the pointers to the keys in bucket 0,
|
||||
followed by the pointers to the keys in bucket 1, and so on).
|
||||
|
||||
To find the bucket address of a given key
|
||||
we use the universal hash function [figs/brz/img44.png][[5 #papers]].
|
||||
Key //k// goes into bucket //i//, where
|
||||
|
||||
| [figs/brz/img57.png] (1)
|
||||
|
||||
Figure 3(a) shows a //logical// view of the [figs/brz/img23.png] buckets
|
||||
generated in the partitioning step.
|
||||
In reality, the keys belonging to each bucket are distributed among many files,
|
||||
as depicted in Figure 3(b).
|
||||
In the example of Figure 3(b), the keys in bucket 0
|
||||
appear in files 1 and //N//, the keys in bucket 1 appear in files 1, 2
|
||||
and //N//, and so on.
|
||||
|
||||
| [figs/brz/brz-partitioning.png]
|
||||
| **Figure 3:** Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view.
|
||||
|
||||
This scattering of the keys in the buckets could generate a performance
|
||||
problem because of the potential number of seeks
|
||||
needed to read the keys in each bucket from the //N// files in disk
|
||||
during the searching step.
|
||||
But, as we show in [[2 #papers]], the number of seeks
|
||||
can be kept small using buffering techniques.
|
||||
Considering that only the vector //size//, which has [figs/brz/img23.png] one-byte
|
||||
entries (remember that each bucket has at most 256 keys),
|
||||
must be maintained in main memory during the searching step,
|
||||
almost all main memory is available to be used as disk I/O buffer.
|
||||
|
||||
The last step is to compute the //offset// vector and dump it to the disk.
|
||||
We use the vector //size// to compute the
|
||||
//offset// displacement vector.
|
||||
The //offset[i]// entry contains the number of keys
|
||||
in the buckets //0, 1, ..., i-1//.
|
||||
As //size[i]// stores the number of keys
|
||||
in bucket //i//, where [figs/brz/img47.png], we have
|
||||
|
||||
| [figs/brz/img63.png]
|
||||
|
||||
----------------------------------------
|
||||
|
||||
=== Searching step ===
|
||||
|
||||
The searching step is responsible for generating a MPHF for each
|
||||
bucket. Figure 4 presents the searching step algorithm.
|
||||
|
||||
| [figs/brz/img64.png]
|
||||
| **Figure 4:** Searching step.
|
||||
|
||||
Statement 1 of Figure 4 inserts one key from each file
|
||||
in a minimum heap //H// of size //N//.
|
||||
The order relation in //H// is given by the bucket address //i// given by
|
||||
Eq. (1).
|
||||
|
||||
Statement 2 has two important steps.
|
||||
In statement 2.1, a bucket is read from disk,
|
||||
as described below.
|
||||
In statement 2.2, a MPHF is generated for each bucket //i//, as described
|
||||
in the following.
|
||||
The description of MPHF[figs/brz/img46.png] is a vector [figs/brz/img66.png] of 8-bit integers.
|
||||
Finally, statement 2.3 writes the description [figs/brz/img66.png] of MPHF[figs/brz/img46.png] to disk.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==== Reading a bucket from disk ====
|
||||
|
||||
In this section we present the refinement of statement 2.1 of
|
||||
Figure 4.
|
||||
The algorithm to read bucket //i// from disk is presented
|
||||
in Figure 5.
|
||||
|
||||
| [figs/brz/img67.png]
|
||||
| **Figure 5:** Reading a bucket.
|
||||
|
||||
Bucket //i// is distributed among many files and the heap //H// is used to drive a
|
||||
multiway merge operation.
|
||||
In Figure 5, statement 1.1 extracts and removes triple
|
||||
//(i, j, k)// from //H//, where //i// is a minimum value in //H//.
|
||||
Statement 1.2 inserts key //k// in bucket //i//.
|
||||
Notice that the //k// in the triple //(i, j, k)// is in fact a pointer to
|
||||
the first byte of the key that is kept in contiguous positions of an array of characters
|
||||
(this array containing the keys is initialized during the heap construction
|
||||
in statement 1 of Figure 4).
|
||||
Statement 1.3 performs a seek operation in File //j// on disk for the first
|
||||
read operation and reads sequentially all keys //k// that have the same //i//
|
||||
and inserts them all in bucket //i//.
|
||||
Finally, statement 1.4 inserts in //H// the triple //(i, j, x)//,
|
||||
where //x// is the first key read from File //j// (in statement 1.3)
|
||||
that does not have the same bucket address as the previous keys.
|
||||
|
||||
The number of seek operations on disk performed in statement 1.3 is discussed
|
||||
in [[2, Section 5.1 #papers]],
|
||||
where we present a buffering technique that brings down
|
||||
the time spent with seeks.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==== Generating a MPHF for each bucket ====
|
||||
|
||||
To the best of our knowledge the [BMZ algorithm bmz.html] we have designed in
|
||||
our previous works [[1,3 #papers]] is the fastest published algorithm for
|
||||
constructing MPHFs.
|
||||
That is why we are using that algorithm as a building block for the
|
||||
algorithm presented here. In reality, we are using
|
||||
an optimized version of BMZ (BMZ8) for small set of keys (at most 256 keys).
|
||||
[Click here to see details about BMZ algorithm bmz.html].
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==Analysis of the Algorithm==
|
||||
|
||||
Analytical results and the complete analysis of the external memory based algorithm
|
||||
can be found in [[2 #papers]].
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==Experimental Results==
|
||||
|
||||
In this section we present the experimental results.
|
||||
We start presenting the experimental setup.
|
||||
We then present experimental results for
|
||||
the internal memory based algorithm ([the BMZ algorithm bmz.html])
|
||||
and for our external memory based algorithm.
|
||||
Finally, we discuss how the amount of internal memory available
|
||||
affects the runtime of the external memory based algorithm.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
===The data and the experimental setup===
|
||||
|
||||
All experiments were carried out on
|
||||
a computer running the Linux operating system, version 2.6,
|
||||
with a 2.4 gigahertz processor and
|
||||
1 gigabyte of main memory.
|
||||
In the experiments related to the new
|
||||
algorithm we limited the main memory in 500 megabytes.
|
||||
|
||||
Our data consists of a collection of 1 billion
|
||||
URLs collected from the Web, each URL 64 characters long on average.
|
||||
The collection is stored on disk in 60.5 gigabytes.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
===Performance of the BMZ Algorithm===
|
||||
|
||||
[The BMZ algorithm bmz.html] is used for constructing a MPHF for each bucket.
|
||||
It is a randomized algorithm because it needs to generate a simple random graph
|
||||
in its first step.
|
||||
Once the graph is obtained the other two steps are deterministic.
|
||||
|
||||
Thus, we can consider the runtime of the algorithm to have
|
||||
the form [figs/brz/img159.png] for an input of //n// keys,
|
||||
where [figs/brz/img160.png] is some machine dependent
|
||||
constant that further depends on the length of the keys and //Z// is a random
|
||||
variable with geometric distribution with mean [figs/brz/img162.png]. All results
|
||||
in our experiments were obtained taking //c=1//; the value of //c//, with //c// in //[0.93,1.15]//,
|
||||
in fact has little influence in the runtime, as shown in [[3 #papers]].
|
||||
|
||||
The values chosen for //n// were 1, 2, 4, 8, 16 and 32 million.
|
||||
Although we have a dataset with 1 billion URLs, on a PC with
|
||||
1 gigabyte of main memory, the algorithm is able
|
||||
to handle an input with at most 32 million keys.
|
||||
This is mainly because of the graph we need to keep in main memory.
|
||||
The algorithm requires //25n + O(1)// bytes for constructing
|
||||
a MPHF ([click here to get details about the data structures used by the BMZ algorithm bmz.html]).
|
||||
|
||||
In order to estimate the number of trials for each value of //n// we use
|
||||
a statistical method for determining a suitable sample size (see, e.g., [[6, Chapter 13 #papers]]).
|
||||
As we obtained different values for each //n//,
|
||||
we used the maximal value obtained, namely, 300 trials in order to have
|
||||
a confidence level of 95 %.
|
||||
|
||||
|
||||
Table 1 presents the runtime average for each //n//,
|
||||
the respective standard deviations, and
|
||||
the respective confidence intervals given by
|
||||
the average time [figs/brz/img167.png] the distance from average time
|
||||
considering a confidence level of 95 %.
|
||||
Observing the runtime averages one sees that
|
||||
the algorithm runs in expected linear time,
|
||||
as shown in [[3 #papers]].
|
||||
|
||||
%!include(html): ''TABLEBRZ1.t2t''
|
||||
| **Table 1:** Internal memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.
|
||||
|
||||
Figure 6 presents the runtime for each trial. In addition,
|
||||
the solid line corresponds to a linear regression model
|
||||
obtained from the experimental measurements.
|
||||
As we can see, the runtime for a given //n// has a considerable
|
||||
fluctuation. However, the fluctuation also grows linearly with //n//.
|
||||
|
||||
| [figs/brz/bmz_temporegressao.png]
|
||||
| **Figure 6:** Time versus number of keys in //S// for the internal memory based algorithm. The solid line corresponds to a linear regression model.
|
||||
|
||||
The observed fluctuation in the runtimes is as expected; recall that this
|
||||
runtime has the form [figs/brz/img159.png] with //Z// a geometric random variable with
|
||||
mean //1/p=e//. Thus, the runtime has mean [figs/brz/img181.png] and standard
|
||||
deviation [figs/brz/img182.png].
|
||||
Therefore, the standard deviation also grows
|
||||
linearly with //n//, as experimentally verified
|
||||
in Table 1 and in Figure 6.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
===Performance of the External Memory Based Algorithm===
|
||||
|
||||
The runtime of the external memory based algorithm is also a random variable,
|
||||
but now it follows a (highly concentrated) normal distribution, as we discuss at the end of this
|
||||
section. Again, we are interested in verifying the linearity claim made in
|
||||
[[2, Section 5.1 #papers]]. Therefore, we ran the algorithm for
|
||||
several numbers //n// of keys in //S//.
|
||||
|
||||
The values chosen for //n// were 1, 2, 4, 8, 16, 32, 64, 128, 512 and 1000
|
||||
million.
|
||||
We limited the main memory in 500 megabytes for the experiments.
|
||||
The size [figs/brz/img8.png] of the a priori reserved internal memory area
|
||||
was set to 250 megabytes, the parameter //b// was set to //175// and
|
||||
the building block algorithm parameter //c// was again set to //1//.
|
||||
We show later on how [figs/brz/img8.png] affects the runtime of the algorithm. The other two parameters
|
||||
have insignificant influence on the runtime.
|
||||
|
||||
We again use a statistical method for determining a suitable sample size
|
||||
to estimate the number of trials to be run for each value of //n//. We got that
|
||||
just one trial for each //n// would be enough with a confidence level of 95 %.
|
||||
However, we made 10 trials. This number of trials seems rather small, but, as
|
||||
shown below, the behavior of our algorithm is very stable and its runtime is
|
||||
almost deterministic (i.e., the standard deviation is very small).
|
||||
|
||||
Table 2 presents the runtime average for each //n//,
|
||||
the respective standard deviations, and
|
||||
the respective confidence intervals given by
|
||||
the average time [figs/brz/img167.png] the distance from average time
|
||||
considering a confidence level of 95 %.
|
||||
Observing the runtime averages we noticed that
|
||||
the algorithm runs in expected linear time,
|
||||
as shown in [[2, Section 5.1 #papers]]. Better still,
|
||||
it is only approximately 60 % slower than the BMZ algorithm.
|
||||
To get that value we used the linear regression model obtained for the runtime of
|
||||
the internal memory based algorithm to estimate how much time it would require
|
||||
for constructing a MPHF for a set of 1 billion keys.
|
||||
We got 2.3 hours for the internal memory based algorithm and we measured
|
||||
3.67 hours on average for the external memory based algorithm.
|
||||
Increasing the size of the internal memory area
|
||||
from 250 to 600 megabytes,
|
||||
we have brought the time to 3.09 hours. In this case, the external memory based algorithm is
|
||||
just 34 % slower in this setup.
|
||||
|
||||
%!include(html): ''TABLEBRZ2.t2t''
|
||||
| **Table 2:**The external memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.
|
||||
|
||||
Figure 7 presents the runtime for each trial. In addition,
|
||||
the solid line corresponds to a linear regression model
|
||||
obtained from the experimental measurements.
|
||||
As we were expecting the runtime for a given //n// has almost no
|
||||
variation.
|
||||
|
||||
| [figs/brz/brz_temporegressao.png]
|
||||
| **Figure 7:** Time versus number of keys in //S// for our algorithm. The solid line corresponds to a linear regression model.
|
||||
|
||||
An intriguing observation is that the runtime of the algorithm is almost
|
||||
deterministic, in spite of the fact that it uses as building block an
|
||||
algorithm with a considerable fluctuation in its runtime. A given bucket
|
||||
//i//, [figs/brz/img47.png], is a small set of keys (at most 256 keys) and,
|
||||
as argued in last Section, the runtime of the
|
||||
building block algorithm is a random variable [figs/brz/img207.png] with high fluctuation.
|
||||
However, the runtime //Y// of the searching step of the external memory based algorithm is given
|
||||
by [figs/brz/img209.png]. Under the hypothesis that
|
||||
the [figs/brz/img207.png] are independent and bounded, the {\it law of large numbers} (see,
|
||||
e.g., [[6 #papers]]) implies that the random variable [figs/brz/img210.png] converges
|
||||
to a constant as [figs/brz/img83.png]. This explains why the runtime of our
|
||||
algorithm is almost deterministic.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
=== Controlling disk accesses ===
|
||||
|
||||
In order to bring down the number of seek operations on disk
|
||||
we benefit from the fact that our algorithm leaves almost all main
|
||||
memory available to be used as disk I/O buffer.
|
||||
In this section we evaluate how much the parameter [figs/brz/img8.png] affects the runtime of our algorithm.
|
||||
For that we fixed //n// in 1 billion of URLs,
|
||||
set the main memory of the machine used for the experiments
|
||||
to 1 gigabyte and used [figs/brz/img8.png] equal to 100, 200, 300, 400, 500 and 600
|
||||
megabytes.
|
||||
|
||||
Table 3 presents the number of files //N//,
|
||||
the buffer size used for all files, the number of seeks in the worst case considering
|
||||
the pessimistic assumption mentioned in [[2, Section 5.1 #papers]], and
|
||||
the time to generate a MPHF for 1 billion of keys as a function of the amount of internal
|
||||
memory available. Observing Table 3 we noticed that the time spent in the construction
|
||||
decreases as the value of [figs/brz/img8.png] increases. However, for [figs/brz/img213.png], the variation
|
||||
on the time is not as significant as for [figs/brz/img214.png].
|
||||
This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux
|
||||
has smart policies for avoiding seeks and diminishing the average seek time
|
||||
(see [http://www.linuxjournal.com/article/6931 http://www.linuxjournal.com/article/6931]).
|
||||
|
||||
%!include(html): ''TABLEBRZ3.t2t''
|
||||
| **Table 3:**Influence of the internal memory area size ([figs/brz/img8.png]) in the external memory based algorithm runtime.
|
||||
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==Papers==[papers]
|
||||
|
||||
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], D. Menoti, [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A New algorithm for constructing minimal perfect hash functions papers/bmz_tr004_04.ps], Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
|
||||
|
||||
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], Y. Kohayakawa, [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [An Approach for Minimal Perfect Hash Functions for Very Large Databases papers/tr06.pdf], Technical Report TR003/06, Department of Computer Science, Federal University of Minas Gerais, 2004.
|
||||
|
||||
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], Y. Kohayakawa, and [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A Practical Minimal Perfect Hashing Method papers/wea05.pdf]. //4th International Workshop on efficient and Experimental Algorithms (WEA05),// Springer-Verlag Lecture Notes in Computer Science, vol. 3505, Santorini Island, Greece, May 2005, 488-500.
|
||||
|
||||
+ [M. Seltzer. Beyond relational databases. ACM Queue, 3(3), April 2005. http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=299]
|
||||
|
||||
+ [Bob Jenkins. Algorithm alley: Hash functions. Dr. Dobb's Journal of Software Tools, 22(9), september 1997. http://burtleburtle.net/bob/hash/doobs.html]
|
||||
|
||||
+ R. Jain. The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. John Wiley, first edition, 1991.
|
||||
|
||||
|
||||
%!include: ALGORITHMS.t2t
|
||||
|
||||
%!include: FOOTER.t2t
|
||||
|
||||
%!include(html): ''GOOGLEANALYTICS.t2t''
|
|
@ -0,0 +1,44 @@
|
|||
Compress, Hash and Displace: CHD Algorithm
|
||||
|
||||
|
||||
%!includeconf: CONFIG.t2t
|
||||
|
||||
----------------------------------------
|
||||
==Introduction==
|
||||
|
||||
The important performance parameters of a PHF are representation size, evaluation time and construction time. The representation size plays an important role when the whole function fits in a faster memory and the actual data is stored in a slower memory. For instace, compact PHFs can be entirely fit in a CPU cache and this makes their computation really fast by avoiding cache misses. The CHD algorithm plays an important role in this context. It was designed by Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger in [[2 #papers]].
|
||||
|
||||
|
||||
The CHD algorithm permits to obtain PHFs with representation size very close to optimal while retaining //O(n)// construction time and //O(1)// evaluation time. For example, in the case //m=2n// we obtain a PHF that uses space //0.67// bits per key, and for //m=1.23n// we obtain space //1.4// bits per key, which was not achievable with previously known methods. The CHD algorithm is inspired by several known algorithms; the main new feature is that it combines a modification of Pagh's ``hash-and-displace'' approach with data compression on a sequence of hash function indices. That combination makes it possible to significantly reduce space usage while retaining linear construction time and constant query time. The CHD algorithm can also be used for //k//-perfect hashing, where at most //k// keys may be mapped to the same value. For the analysis we assume that fully random hash functions are given for free; such assumptions can be justified and were made in previous papers.
|
||||
|
||||
The compact PHFs generated by the CHD algorithm can be used in many applications in which we want to assign a unique identifier to each key without storing any information on the key. One of the most obvious applications of those functions (or //k//-perfect hash functions) is when we have a small fast memory in which we can store the perfect hash function while the keys and associated satellite data are stored in slower but larger memory. The size of a block or a transfer unit may be chosen so that //k// data items can be retrieved in one read access. In this case we can ensure that data associated with a key can be retrieved in a single probe to slower memory. This has been used for example in hardware routers [[4 #papers]].
|
||||
|
||||
|
||||
The CHD algorithm generates the most compact PHFs and MPHFs we know of in //O(n)// time. The time required to evaluate the generated functions is constant (in practice less than //1.4// microseconds). The storage space of the resulting PHFs and MPHFs are distant from the information theoretic lower bound by a factor of //1.43//. The closest competitor is the algorithm by Martin and Pagh [[3 #papers]] but their algorithm do not work in linear time. Furthermore, the CHD algorithm can be tuned to run faster than the BPZ algorithm [[1 #papers]] (the fastest algorithm available in the literature so far) and to obtain more compact functions. The most impressive characteristic is that it has the ability, in principle, to approximate the information theoretic lower bound while being practical. A detailed description of the CHD algorithm can be found in [[2 #papers]].
|
||||
|
||||
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==Experimental Results==
|
||||
|
||||
Experimental results comparing the CHD algorithm with [the BDZ algorithm bdz.html]
|
||||
and others available in the CMPH library are presented in [[2 #papers]].
|
||||
----------------------------------------
|
||||
|
||||
==Papers==[papers]
|
||||
|
||||
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], [R. Pagh http://www.itu.dk/~pagh/], [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [Simple and space-efficient minimal perfect hash functions papers/wads07.pdf]. //In Proceedings of the 10th International Workshop on Algorithms and Data Structures (WADs'07),// Springer-Verlag Lecture Notes in Computer Science, vol. 4619, Halifax, Canada, August 2007, 139-150.
|
||||
|
||||
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], D. Belazzougui and M. Dietzfelbinger. [Compress, hash and displace papers/esa09.pdf]. //In Proceedings of the 17th European Symposium on Algorithms (ESA’09)//. Springer LNCS, 2009.
|
||||
|
||||
+ M. Dietzfelbinger and [R. Pagh http://www.itu.dk/~pagh/]. Succinct data structures for retrieval and approximate membership. //In Proceedings of the 35th international colloquium on Automata, Languages and Programming (ICALP’08)//, pages 385–396, Berlin, Heidelberg, 2008. Springer-Verlag.
|
||||
|
||||
+ B. Prabhakar and F. Bonomi. Perfect hashing for network applications. //In Proceedings of the IEEE International Symposium on Information Theory//. IEEE Press, 2006.
|
||||
|
||||
|
||||
%!include: ALGORITHMS.t2t
|
||||
|
||||
%!include: FOOTER.t2t
|
||||
|
||||
%!include(html): ''GOOGLEANALYTICS.t2t''
|
|
@ -0,0 +1,88 @@
|
|||
CHM Algorithm
|
||||
|
||||
|
||||
%!includeconf: CONFIG.t2t
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==The Algorithm==
|
||||
The algorithm is presented in [[1,2,3 #papers]].
|
||||
----------------------------------------
|
||||
|
||||
==Memory Consumption==
|
||||
|
||||
Now we detail the memory consumption to generate and to store minimal perfect hash functions
|
||||
using the CHM algorithm. The structures responsible for memory consumption are in the
|
||||
following:
|
||||
- Graph:
|
||||
+ **first**: is a vector that stores //cn// integer numbers, each one representing
|
||||
the first edge (index in the vector edges) in the list of
|
||||
edges of each vertex.
|
||||
The integer numbers are 4 bytes long. Therefore,
|
||||
the vector first is stored in //4cn// bytes.
|
||||
|
||||
+ **edges**: is a vector to represent the edges of the graph. As each edge
|
||||
is compounded by a pair of vertices, each entry stores two integer numbers
|
||||
of 4 bytes that represent the vertices. As there are //n// edges, the
|
||||
vector edges is stored in //8n// bytes.
|
||||
|
||||
+ **next**: given a vertex [figs/img139.png], we can discover the edges that
|
||||
contain [figs/img139.png] following its list of edges, which starts on
|
||||
first[[figs/img139.png]] and the next
|
||||
edges are given by next[...first[[figs/img139.png]]...]. Therefore,
|
||||
the vectors first and next represent
|
||||
the linked lists of edges of each vertex. As there are two vertices for each edge,
|
||||
when an edge is iserted in the graph, it must be inserted in the two linked lists
|
||||
of the vertices in its composition. Therefore, there are //2n// entries of integer
|
||||
numbers in the vector next, so it is stored in //4*2n = 8n// bytes.
|
||||
|
||||
- Other auxiliary structures
|
||||
+ **visited**: is a vector of //cn// bits, where each bit indicates if the g value of
|
||||
a given vertex was already defined. Therefore, the vector visited is stored
|
||||
in //cn/8// bytes.
|
||||
|
||||
+ **function //g//**: is represented by a vector of //cn// integer numbers.
|
||||
As each integer number is 4 bytes long, the function //g// is stored in
|
||||
//4cn// bytes.
|
||||
|
||||
|
||||
Thus, the total memory consumption of CHM algorithm for generating a minimal
|
||||
perfect hash function (MPHF) is: //(8.125c + 16)n + O(1)// bytes.
|
||||
As the value of constant //c// must be at least 2.09 we have:
|
||||
|| //c// | Memory consumption to generate a MPHF |
|
||||
| 2.09 | //33.00n + O(1)// |
|
||||
|
||||
| **Table 1:** Memory consumption to generate a MPHF using the CHM algorithm.
|
||||
|
||||
Now we present the memory consumption to store the resulting function.
|
||||
We only need to store the //g// function. Thus, we need //4cn// bytes.
|
||||
Again we have:
|
||||
|| //c// | Memory consumption to store a MPHF |
|
||||
| 2.09 | //8.36n// |
|
||||
|
||||
| **Table 2:** Memory consumption to store a MPHF generated by the CHM algorithm.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==Experimental Results==
|
||||
|
||||
[CHM x BMZ comparison.html]
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==Papers==[papers]
|
||||
|
||||
+ Z.J. Czech, G. Havas, and B.S. Majewski. [An optimal algorithm for generating minimal perfect hash functions. papers/chm92.pdf], Information Processing Letters, 43(5):257-264, 1992.
|
||||
|
||||
+ Z.J. Czech, G. Havas, and B.S. Majewski. Fundamental study perfect hashing.
|
||||
Theoretical Computer Science, 182:1-143, 1997.
|
||||
|
||||
+ B.S. Majewski, N.C. Wormald, G. Havas, and Z.J. Czech. A family of perfect hashing methods.
|
||||
The Computer Journal, 39(6):547--554, 1996.
|
||||
|
||||
|
||||
%!include: ALGORITHMS.t2t
|
||||
|
||||
%!include: FOOTER.t2t
|
||||
|
||||
%!include(html): ''GOOGLEANALYTICS.t2t''
|
|
@ -0,0 +1,111 @@
|
|||
Comparison Between BMZ And CHM Algorithms
|
||||
|
||||
|
||||
%!includeconf: CONFIG.t2t
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==Characteristics==
|
||||
Table 1 presents the main characteristics of the two algorithms.
|
||||
The number of edges in the graph [figs/img27.png] is [figs/img236.png],
|
||||
the number of keys in the input set [figs/img20.png].
|
||||
The number of vertices of [figs/img32.png] is equal
|
||||
to [figs/img12.png] and [figs/img237.png] for BMZ algorithm and the CHM algorithm, respectively.
|
||||
This measure is related to the amount of space to store the array [figs/img37.png].
|
||||
This improves the space required to store a function in BMZ algorithm to [figs/img238.png] of the space required by the CHM algorithm.
|
||||
The number of critical edges is [figs/img76.png] and 0, for BMZ algorithm and the CHM algorithm,
|
||||
respectively.
|
||||
BMZ algorithm generates random graphs that necessarily contains cycles and the
|
||||
CHM algorithm
|
||||
generates
|
||||
acyclic random graphs.
|
||||
Finally, the CHM algorithm generates [order preserving functions concepts.html]
|
||||
while BMZ algorithm does not preserve order.
|
||||
|
||||
%!include(html): ''TABLE1.t2t''
|
||||
| **Table 1:** Main characteristics of the algorithms.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==Memory Consumption==
|
||||
|
||||
- Memory consumption to generate the minimal perfect hash function (MPHF):
|
||||
|| Algorithm | //c// | Memory consumption to generate a MPHF |
|
||||
| BMZ | 0.93 | //24.80n + O(1)// |
|
||||
| BMZ | 1.15 | //26.42n + O(1)// |
|
||||
| CHM | 2.09 | //33.00n + O(1)// |
|
||||
|
||||
| **Table 2:** Memory consumption to generate a MPHF using the algorithms BMZ and CHM.
|
||||
|
||||
- Memory consumption to store the resulting minimal perfect hash function (MPHF):
|
||||
|| Algorithm | //c// | Memory consumption to store a MPHF |
|
||||
| BMZ | 0.93 | //3.72n// |
|
||||
| BMZ | 1.15 | //4.60n// |
|
||||
| CHM | 2.09 | //8.36n// |
|
||||
|
||||
| **Table 3:** Memory consumption to store a MPHF generated by the algorithms BMZ and CHM.
|
||||
|
||||
----------------------------------------
|
||||
|
||||
==Run times==
|
||||
We now present some experimental results to compare the BMZ and CHM algorithms.
|
||||
The data consists of a collection of 100 million universe resource locations
|
||||
(URLs) collected from the Web.
|
||||
The average length of a URL in the collection is 63 bytes.
|
||||
All experiments were carried on
|
||||
a computer running the Linux operating system, version 2.6.7,
|
||||
with a 2.4 gigahertz processor and
|
||||
4 gigabytes of main memory.
|
||||
|
||||
Table 4 presents time measurements.
|
||||
All times are in seconds.
|
||||
The table entries represent averages over 50 trials.
|
||||
The column labelled as [figs/img243.png] represents
|
||||
the number of iterations to generate the random graph [figs/img32.png] in the
|
||||
mapping step of the algorithms.
|
||||
The next columns represent the run times
|
||||
for the mapping plus ordering steps together and the searching
|
||||
step for each algorithm.
|
||||
The last column represents the percent gain of our algorithm
|
||||
over the CHM algorithm.
|
||||
|
||||
%!include(html): ''TABLE4.t2t''
|
||||
| **Table 4:** Time measurements for BMZ and the CHM algorithm.
|
||||
|
||||
The mapping step of the BMZ algorithm is faster because
|
||||
the expected number of iterations in the mapping step to generate [figs/img32.png] are
|
||||
2.13 and 2.92 for BMZ algorithm and the CHM algorithm, respectively
|
||||
(see [[2 bmz.html#papers]] for details).
|
||||
The graph [figs/img32.png] generated by BMZ algorithm
|
||||
has [figs/img12.png] vertices, against [figs/img237.png] for the CHM algorithm.
|
||||
These two facts make BMZ algorithm faster in the mapping step.
|
||||
The ordering step of BMZ algorithm is approximately equal to
|
||||
the time to check if [figs/img32.png] is acyclic for the CHM algorithm.
|
||||
The searching step of the CHM algorithm is faster, but the total
|
||||
time of BMZ algorithm is, on average, approximately 59 % faster
|
||||
than the CHM algorithm.
|
||||
It is important to notice the times for the searching step:
|
||||
for both algorithms they are not the dominant times,
|
||||
and the experimental results clearly show
|
||||
a linear behavior for the searching step.
|
||||
|
||||
We now present run times for BMZ algorithm using a [heuristic bmz.html#heuristic] that
|
||||
reduces the space requirement
|
||||
to any given value between [figs/img12.png] words and [figs/img13.png] words.
|
||||
For example, for [figs/img244.png] and [figs/img6.png], the analytical expected number
|
||||
of iterations are [figs/img245.png] and [figs/img246.png], respectively
|
||||
(for [figs/img247.png], the number of iterations are 2.78 for [figs/img244.png] and 3.04
|
||||
for [figs/img6.png]).
|
||||
Table 5 presents the total times to construct a
|
||||
function for [figs/img247.png], with an increase from [figs/img248.png] seconds
|
||||
for [figs/img128.png] (see Table 4) to [figs/img249.png] seconds for [figs/img244.png] and
|
||||
to [figs/img250.png] seconds for [figs/img6.png].
|
||||
|
||||
%!include(html): ''TABLE5.t2t''
|
||||
| **Table 5:** Time measurements for BMZ tuned algorithm with [figs/img5.png] and [figs/img6.png].
|
||||
|
||||
%!include: ALGORITHMS.t2t
|
||||
|
||||
%!include: FOOTER.t2t
|
||||
|
||||
%!include(html): ''GOOGLEANALYTICS.t2t''
|
|
@ -0,0 +1,56 @@
|
|||
Minimal Perfect Hash Functions - Introduction
|
||||
|
||||
|
||||
%!includeconf: CONFIG.t2t
|
||||
|
||||
----------------------------------------
|
||||
==Basic Concepts==
|
||||
|
||||
Suppose [figs/img14.png] is a universe of //keys//.
|
||||
Let [figs/img15.png] be a //hash function// that maps the keys from [figs/img14.png] to a given interval of integers [figs/img16.png].
|
||||
Let [figs/img17.png] be a set of [figs/img8.png] keys from [figs/img14.png].
|
||||
Given a key [figs/img18.png], the hash function [figs/img7.png] computes an
|
||||
integer in [figs/img19.png] for the storage or retrieval of [figs/img11.png] in
|
||||
a //hash table//.
|
||||
Hashing methods for //non-static sets// of keys can be used to construct
|
||||
data structures storing [figs/img20.png] and supporting membership queries
|
||||
"[figs/img18.png]?" in expected time [figs/img21.png].
|
||||
However, they involve a certain amount of wasted space owing to unused
|
||||
locations in the table and waisted time to resolve collisions when
|
||||
two keys are hashed to the same table location.
|
||||
|
||||
For //static sets// of keys it is possible to compute a function
|
||||
to find any key in a table in one probe; such hash functions are called
|
||||
//perfect//.
|
||||
More precisely, given a set of keys [figs/img20.png], we shall say that a
|
||||
hash function [figs/img15.png] is a //perfect hash function//
|
||||
for [figs/img20.png] if [figs/img7.png] is an injection on [figs/img20.png],
|
||||
that is, there are no //collisions// among the keys in [figs/img20.png]:
|
||||
if [figs/img11.png] and [figs/img22.png] are in [figs/img20.png] and [figs/img23.png],
|
||||
then [figs/img24.png].
|
||||
Figure 1(a) illustrates a perfect hash function.
|
||||
Since no collisions occur, each key can be retrieved from the table
|
||||
with a single probe.
|
||||
If [figs/img25.png], that is, the table has the same size as [figs/img20.png],
|
||||
then we say that [figs/img7.png] is a //minimal perfect hash function//
|
||||
for [figs/img20.png].
|
||||
Figure 1(b) illustrates a minimal perfect hash function.
|
||||
Minimal perfect hash functions totally avoid the problem of wasted
|
||||
space and time. A perfect hash function [figs/img7.png] is //order preserving//
|
||||
if the keys in [figs/img20.png] are arranged in some given order
|
||||
and [figs/img7.png] preserves this order in the hash table.
|
||||
|
||||
| [figs/img26.png]
|
||||
| **Figure 1:** (a) Perfect hash function. (b) Minimal perfect hash function.
|
||||
|
||||
Minimal perfect hash functions are widely used for memory efficient
|
||||
storage and fast retrieval of items from static sets, such as words in natural
|
||||
languages, reserved words in programming languages or interactive systems,
|
||||
universal resource locations (URLs) in Web search engines, or item sets in
|
||||
data mining techniques.
|
||||
|
||||
%!include: ALGORITHMS.t2t
|
||||
|
||||
%!include: FOOTER.t2t
|
||||
|
||||
%!include(html): ''GOOGLEANALYTICS.t2t''
|
|
@ -0,0 +1,51 @@
|
|||
%! style(html): DOC.css
|
||||
%! PreProc(html): '^%html% ' ''
|
||||
%! PreProc(txt): '^%txt% ' ''
|
||||
%! PostProc(html): "&" "&"
|
||||
%! PostProc(txt): " " " "
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img7.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img7.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img57.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img57.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img32.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img32.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img20.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img20.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img60.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img60.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img62.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img62.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img79.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img79.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img139.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img139.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img140.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img140.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img143.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img143.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img115.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img115.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img11.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img11.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img169.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img169.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img96.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img96.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img178.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img178.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img180.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img180.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img183.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img183.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img189.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img189.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img196.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img196.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img172.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img172.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img8.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img8.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img1.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img1.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img14.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img14.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img128.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img128.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img112.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img112.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img12.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img12.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img13.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img13.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img244.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img244.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img245.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img245.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img246.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img246.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img15.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img15.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img25.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img25.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img168.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img168.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img6.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img6.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img5.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img5.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img28.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img28.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img237.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img237.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img248.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img237.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img248.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img237.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img249.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img249.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/img250.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img250.png"\1>'
|
||||
%! PostProc(html): 'ALIGN="middle" SRC="figs/bdz/img8.png"(.*?)>' 'ALIGN="bottom" SRC="figs/bdz/img8.png"\1>'
|
||||
% The ^ need to be escaped by \
|
||||
%!postproc(html): \^\^(.*?)\^\^ <sup>\1</sup>
|
||||
%!postproc(html): ,,(.*?),, <sub>\1</sub>
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
The code of the cmph library is dual licensed under the LGPL version 2 and MPL
|
||||
1.1 licenses. Please refer to the LGPL-2 and MPL-1.1 files in the repository
|
||||
for the full description of each of the licenses.
|
||||
|
||||
For cxxmph, the files stringpiece.h and MurmurHash2 are covered by the BSD and MIT licenses, respectively.
|
|
@ -0,0 +1,453 @@
|
|||
2005-08-08 18:34 fc_botelho
|
||||
|
||||
* INSTALL, examples/Makefile, examples/Makefile.in,
|
||||
examples/.deps/file_adapter_ex2.Po,
|
||||
examples/.deps/vector_adapter_ex1.Po, src/brz.c: [no log message]
|
||||
|
||||
2005-08-07 22:00 fc_botelho
|
||||
|
||||
* src/: brz.c, brz.h, brz_structs.h, cmph.c, cmph.h, main.c:
|
||||
temporary directory passed by command line
|
||||
|
||||
2005-08-07 20:22 fc_botelho
|
||||
|
||||
* src/brz.c: stable version of BRZ
|
||||
|
||||
2005-08-06 22:09 fc_botelho
|
||||
|
||||
* src/bmz.c: no message
|
||||
|
||||
2005-08-06 22:02 fc_botelho
|
||||
|
||||
* src/bmz.c: no message
|
||||
|
||||
2005-08-06 21:45 fc_botelho
|
||||
|
||||
* src/brz.c: fastest version of BRZ
|
||||
|
||||
2005-08-06 17:20 fc_botelho
|
||||
|
||||
* src/: bmz.c, brz.c, main.c: [no log message]
|
||||
|
||||
2005-07-29 16:43 fc_botelho
|
||||
|
||||
* src/brz.c: BRZ algorithm is almost stable
|
||||
|
||||
2005-07-29 15:29 fc_botelho
|
||||
|
||||
* src/: bmz.c, brz.c, brz_structs.h, cmph_types.h: BRZ algorithm is
|
||||
almost stable
|
||||
|
||||
2005-07-29 00:09 fc_botelho
|
||||
|
||||
* src/: brz.c, djb2_hash.c, djb2_hash.h, fnv_hash.c, fnv_hash.h,
|
||||
hash.c, hash.h, jenkins_hash.c, jenkins_hash.h, sdbm_hash.c,
|
||||
sdbm_hash.h: it was fixed more mistakes in BRZ algorithm
|
||||
|
||||
2005-07-28 21:00 fc_botelho
|
||||
|
||||
* src/: bmz.c, brz.c, cmph.c: fixed some mistakes in BRZ algorithm
|
||||
|
||||
2005-07-27 19:13 fc_botelho
|
||||
|
||||
* src/brz.c: algorithm BRZ included
|
||||
|
||||
2005-07-27 18:16 fc_botelho
|
||||
|
||||
* src/: bmz_structs.h, brz.c, brz.h, brz_structs.h: Algorithm BRZ
|
||||
included
|
||||
|
||||
2005-07-27 18:13 fc_botelho
|
||||
|
||||
* src/: Makefile.am, bmz.c, chm.c, cmph.c, cmph.h, cmph_types.h:
|
||||
Algorithm BRZ included
|
||||
|
||||
2005-07-25 19:18 fc_botelho
|
||||
|
||||
* README, README.t2t, scpscript: it was included an examples
|
||||
directory
|
||||
|
||||
2005-07-25 18:26 fc_botelho
|
||||
|
||||
* INSTALL, Makefile.am, configure.ac, examples/Makefile,
|
||||
examples/Makefile.am, examples/Makefile.in,
|
||||
examples/file_adapter_ex2.c, examples/keys.txt,
|
||||
examples/vector_adapter_ex1.c, examples/.deps/file_adapter_ex2.Po,
|
||||
examples/.deps/vector_adapter_ex1.Po, src/cmph.c, src/cmph.h: it
|
||||
was included a examples directory
|
||||
|
||||
2005-03-03 02:07 davi
|
||||
|
||||
* src/: bmz.c, chm.c, chm.h, chm_structs.h, cmph.c, cmph.h,
|
||||
graph.c, graph.h, jenkins_hash.c, jenkins_hash.h, main.c (xgraph):
|
||||
New f*cking cool algorithm works. Roughly implemented in chm.c
|
||||
|
||||
2005-03-02 20:55 davi
|
||||
|
||||
* src/xgraph.c (xgraph): xchmr working nice, but a bit slow
|
||||
|
||||
2005-03-02 02:01 davi
|
||||
|
||||
* src/xchmr.h: file xchmr.h was initially added on branch xgraph.
|
||||
|
||||
2005-03-02 02:01 davi
|
||||
|
||||
* src/xchmr_structs.h: file xchmr_structs.h was initially added on
|
||||
branch xgraph.
|
||||
|
||||
2005-03-02 02:01 davi
|
||||
|
||||
* src/xchmr.c: file xchmr.c was initially added on branch xgraph.
|
||||
|
||||
2005-03-02 02:01 davi
|
||||
|
||||
* src/: Makefile.am, cmph.c, cmph_types.h, xchmr.c, xchmr.h,
|
||||
xchmr_structs.h, xgraph.c, xgraph.h (xgraph): xchmr working fine
|
||||
except for false positives on cyclic detection.
|
||||
|
||||
2005-03-02 00:05 davi
|
||||
|
||||
* src/: Makefile.am, xgraph.c, xgraph.h (xgraph): Added external
|
||||
graph functionality in branch xgraph.
|
||||
|
||||
2005-03-02 00:05 davi
|
||||
|
||||
* src/xgraph.c: file xgraph.c was initially added on branch xgraph.
|
||||
|
||||
2005-03-02 00:05 davi
|
||||
|
||||
* src/xgraph.h: file xgraph.h was initially added on branch xgraph.
|
||||
|
||||
2005-02-28 19:53 davi
|
||||
|
||||
* src/chm.c: Fixed off by one bug in chm.
|
||||
|
||||
2005-02-17 16:20 fc_botelho
|
||||
|
||||
* LOGO.html, README, README.t2t, gendocs: The way of calling the
|
||||
function cmph_search was fixed in the file README.t2t
|
||||
|
||||
2005-01-31 17:13 fc_botelho
|
||||
|
||||
* README.t2t: Heuristic BMZ memory consumption was updated
|
||||
|
||||
2005-01-31 17:09 fc_botelho
|
||||
|
||||
* BMZ.t2t: DJB2, SDBM, FNV and Jenkins hash link were added
|
||||
|
||||
2005-01-31 16:50 fc_botelho
|
||||
|
||||
* BMZ.t2t, CHM.t2t, COMPARISON.t2t, CONCEPTS.t2t, CONFIG.t2t,
|
||||
FAQ.t2t, GPERF.t2t, LOGO.t2t, README.t2t, TABLE1.t2t, TABLE4.t2t,
|
||||
TABLE5.t2t, DOC.css: BMZ documentation was finished
|
||||
|
||||
2005-01-28 18:12 fc_botelho
|
||||
|
||||
* figs/img1.png, figs/img10.png, figs/img100.png, figs/img101.png,
|
||||
figs/img102.png, figs/img103.png, figs/img104.png, figs/img105.png,
|
||||
figs/img106.png, figs/img107.png, figs/img108.png, figs/img109.png,
|
||||
papers/bmz_tr004_04.ps, papers/bmz_wea2005.ps, papers/chm92.pdf,
|
||||
figs/img11.png, figs/img110.png, figs/img111.png, figs/img112.png,
|
||||
figs/img113.png, figs/img114.png, figs/img115.png, figs/img116.png,
|
||||
figs/img117.png, figs/img118.png, figs/img119.png, figs/img12.png,
|
||||
figs/img120.png, figs/img121.png, figs/img122.png, figs/img123.png,
|
||||
figs/img124.png, figs/img125.png, figs/img126.png, figs/img127.png,
|
||||
figs/img128.png, figs/img129.png, figs/img13.png, figs/img130.png,
|
||||
figs/img131.png, figs/img132.png, figs/img133.png, figs/img134.png,
|
||||
figs/img135.png, figs/img136.png, figs/img137.png, figs/img138.png,
|
||||
figs/img139.png, figs/img14.png, figs/img140.png, figs/img141.png,
|
||||
figs/img142.png, figs/img143.png, figs/img144.png, figs/img145.png,
|
||||
figs/img146.png, figs/img147.png, figs/img148.png, figs/img149.png,
|
||||
figs/img15.png, figs/img150.png, figs/img151.png, figs/img152.png,
|
||||
figs/img153.png, figs/img154.png, figs/img155.png, figs/img156.png,
|
||||
figs/img157.png, figs/img158.png, figs/img159.png, figs/img16.png,
|
||||
figs/img160.png, figs/img161.png, figs/img162.png, figs/img163.png,
|
||||
figs/img164.png, figs/img165.png, figs/img166.png, figs/img167.png,
|
||||
figs/img168.png, figs/img169.png, figs/img17.png, figs/img170.png,
|
||||
figs/img171.png, figs/img172.png, figs/img173.png, figs/img174.png,
|
||||
figs/img175.png, figs/img176.png, figs/img177.png, figs/img178.png,
|
||||
figs/img179.png, figs/img18.png, figs/img180.png, figs/img181.png,
|
||||
figs/img182.png, figs/img183.png, figs/img184.png, figs/img185.png,
|
||||
figs/img186.png, figs/img187.png, figs/img188.png, figs/img189.png,
|
||||
figs/img19.png, figs/img190.png, figs/img191.png, figs/img192.png,
|
||||
figs/img193.png, figs/img194.png, figs/img195.png, figs/img196.png,
|
||||
figs/img197.png, figs/img198.png, figs/img199.png, figs/img2.png,
|
||||
figs/img20.png, figs/img200.png, figs/img201.png, figs/img202.png,
|
||||
figs/img203.png, figs/img204.png, figs/img205.png, figs/img206.png,
|
||||
figs/img207.png, figs/img208.png, figs/img209.png, figs/img21.png,
|
||||
figs/img210.png, figs/img211.png, figs/img212.png, figs/img213.png,
|
||||
figs/img214.png, figs/img215.png, figs/img216.png, figs/img217.png,
|
||||
figs/img218.png, figs/img219.png, figs/img22.png, figs/img220.png,
|
||||
figs/img221.png, figs/img222.png, figs/img223.png, figs/img224.png,
|
||||
figs/img225.png, figs/img226.png, figs/img227.png, figs/img228.png,
|
||||
figs/img229.png, figs/img23.png, figs/img230.png, figs/img231.png,
|
||||
figs/img232.png, figs/img233.png, figs/img234.png, figs/img235.png,
|
||||
figs/img236.png, figs/img237.png, figs/img238.png, figs/img239.png,
|
||||
figs/img24.png, figs/img240.png, figs/img241.png, figs/img242.png,
|
||||
figs/img243.png, figs/img244.png, figs/img245.png, figs/img246.png,
|
||||
figs/img247.png, figs/img248.png, figs/img249.png, figs/img25.png,
|
||||
figs/img250.png, figs/img251.png, figs/img252.png, figs/img253.png,
|
||||
figs/img26.png, figs/img27.png, figs/img28.png, figs/img29.png,
|
||||
figs/img3.png, figs/img30.png, figs/img31.png, figs/img32.png,
|
||||
figs/img33.png, figs/img34.png, figs/img35.png, figs/img36.png,
|
||||
figs/img37.png, figs/img38.png, figs/img39.png, figs/img4.png,
|
||||
figs/img40.png, figs/img41.png, figs/img42.png, figs/img43.png,
|
||||
figs/img44.png, figs/img45.png, figs/img46.png, figs/img47.png,
|
||||
figs/img48.png, figs/img49.png, figs/img5.png, figs/img50.png,
|
||||
figs/img51.png, figs/img52.png, figs/img53.png, figs/img54.png,
|
||||
figs/img55.png, figs/img56.png, figs/img57.png, figs/img58.png,
|
||||
figs/img59.png, figs/img6.png, figs/img60.png, figs/img61.png,
|
||||
figs/img62.png, figs/img63.png, figs/img64.png, figs/img65.png,
|
||||
figs/img66.png, figs/img67.png, figs/img68.png, figs/img69.png,
|
||||
figs/img7.png, figs/img70.png, figs/img71.png, figs/img72.png,
|
||||
figs/img73.png, figs/img74.png, figs/img75.png, figs/img76.png,
|
||||
figs/img77.png, figs/img78.png, figs/img79.png, figs/img8.png,
|
||||
figs/img80.png, figs/img81.png, figs/img82.png, figs/img83.png,
|
||||
figs/img84.png, figs/img85.png, figs/img86.png, figs/img87.png,
|
||||
figs/img88.png, figs/img89.png, figs/img9.png, figs/img90.png,
|
||||
figs/img91.png, figs/img92.png, figs/img93.png, figs/img94.png,
|
||||
figs/img95.png, figs/img96.png, figs/img97.png, figs/img98.png,
|
||||
figs/img99.png: Initial version
|
||||
|
||||
2005-01-28 18:07 fc_botelho
|
||||
|
||||
* BMZ.t2t, CHM.t2t, COMPARISON.t2t, CONFIG.t2t, README.t2t: It was
|
||||
improved the documentation of BMZ and CHM algorithms
|
||||
|
||||
2005-01-27 18:07 fc_botelho
|
||||
|
||||
* BMZ.t2t, CHM.t2t, FAQ.t2t: history of BMZ algorithm is available
|
||||
|
||||
2005-01-27 14:23 fc_botelho
|
||||
|
||||
* AUTHORS: It was added the authors' email
|
||||
|
||||
2005-01-27 14:21 fc_botelho
|
||||
|
||||
* BMZ.t2t, CHM.t2t, COMPARISON.t2t, FAQ.t2t, FOOTER.t2t, GPERF.t2t,
|
||||
README.t2t: It was added FOOTER.t2t file
|
||||
|
||||
2005-01-27 12:16 fc_botelho
|
||||
|
||||
* src/cmph_types.h: It was removed pjw and glib functions from
|
||||
cmph_hash_names vector
|
||||
|
||||
2005-01-27 12:12 fc_botelho
|
||||
|
||||
* src/hash.c: It was removed pjw and glib functions from
|
||||
cmph_hash_names vector
|
||||
|
||||
2005-01-27 11:01 davi
|
||||
|
||||
* FAQ.t2t, README, README.t2t, gendocs, src/bmz.c, src/bmz.h,
|
||||
src/chm.c, src/chm.h, src/cmph.c, src/cmph_structs.c, src/debug.h,
|
||||
src/main.c: Fix to alternate hash functions code. Removed htonl
|
||||
stuff from chm algorithm. Added faq.
|
||||
|
||||
2005-01-27 09:14 fc_botelho
|
||||
|
||||
* README.t2t: It was corrected some formatting mistakes
|
||||
|
||||
2005-01-26 22:04 davi
|
||||
|
||||
* BMZ.t2t, CHM.t2t, COMPARISON.t2t, GPERF.t2t, README, README.t2t,
|
||||
gendocs: Added gperf notes.
|
||||
|
||||
2005-01-25 19:10 fc_botelho
|
||||
|
||||
* INSTALL: generated in version 0.3
|
||||
|
||||
2005-01-25 19:09 fc_botelho
|
||||
|
||||
* src/: czech.c, czech.h, czech_structs.h: The czech.h,
|
||||
czech_structs.h and czech.c files were removed
|
||||
|
||||
2005-01-25 19:06 fc_botelho
|
||||
|
||||
* src/: chm.c, chm.h, chm_structs.h, cmph.c, cmph_types.h, main.c,
|
||||
Makefile.am: It was changed the prefix czech by chm
|
||||
|
||||
2005-01-25 18:50 fc_botelho
|
||||
|
||||
* gendocs: script to generate the documentation and the README file
|
||||
|
||||
2005-01-25 18:47 fc_botelho
|
||||
|
||||
* README: README was updated
|
||||
|
||||
2005-01-25 18:44 fc_botelho
|
||||
|
||||
* configure.ac: Version was updated
|
||||
|
||||
2005-01-25 18:42 fc_botelho
|
||||
|
||||
* src/cmph.h: Vector adapter commented
|
||||
|
||||
2005-01-25 18:40 fc_botelho
|
||||
|
||||
* CHM.t2t, CONFIG.t2t, LOGO.html: It was included the PreProc macro
|
||||
through the CONFIG.t2t file and the LOGO through the LOGO.html file
|
||||
|
||||
2005-01-25 18:33 fc_botelho
|
||||
|
||||
* README.t2t, BMZ.t2t, COMPARISON.t2t, CZECH.t2t: It was included
|
||||
the PreProc macro through the CONFIG.t2t file and the LOGO through
|
||||
the LOGO.html file
|
||||
|
||||
2005-01-24 18:25 fc_botelho
|
||||
|
||||
* src/: bmz.c, bmz.h, cmph_structs.c, cmph_structs.h, czech.c,
|
||||
cmph.c, czech.h, main.c, cmph.h: The file adpater was implemented.
|
||||
|
||||
2005-01-24 17:20 fc_botelho
|
||||
|
||||
* README.t2t: the memory consumption to create a mphf using bmz
|
||||
with a heuristic was fixed.
|
||||
|
||||
2005-01-24 17:11 fc_botelho
|
||||
|
||||
* src/: cmph_types.h, main.c: The algorithms and hash functions
|
||||
were put in alphabetical order
|
||||
|
||||
2005-01-24 16:15 fc_botelho
|
||||
|
||||
* BMZ.t2t, COMPARISON.t2t, CZECH.t2t, README.t2t: It was fixed some
|
||||
English mistakes and It was included the files BMZ.t2t, CZECH.t2t
|
||||
and COMPARISON.t2t
|
||||
|
||||
2005-01-21 19:19 davi
|
||||
|
||||
* ChangeLog, Doxyfile: Added Doxyfile.
|
||||
|
||||
2005-01-21 19:14 davi
|
||||
|
||||
* README.t2t, wingetopt.c, src/cmph.h, tests/graph_tests.c: Fixed
|
||||
wingetopt.c
|
||||
|
||||
2005-01-21 18:44 fc_botelho
|
||||
|
||||
* src/Makefile.am: included files bitbool.h and bitbool.c
|
||||
|
||||
2005-01-21 18:42 fc_botelho
|
||||
|
||||