Documentation updated for release 0.9

This commit is contained in:
Fabiano C. Botelho 2009-06-12 21:49:26 -03:00
parent b8aa2106e9
commit 088389184f
274 changed files with 2951 additions and 217 deletions

View File

@ -2,5 +2,5 @@
---------------------------------------- ----------------------------------------
| [Home index.html] | [BDZ bdz.html] | [BMZ bmz.html] | [CHM chm.html] | [BRZ brz.html] | [FCH fch.html] | [Home index.html] | [CHD chd.html] | [BDZ bdz.html] | [BMZ bmz.html] | [CHM chm.html] | [BRZ brz.html] | [FCH fch.html]
---------------------------------------- ----------------------------------------

85
BDZ.t2t
View File

@ -6,33 +6,64 @@ BDZ Algorithm
---------------------------------------- ----------------------------------------
==Introduction== ==Introduction==
Coming soon... The BDZ algorithm was designed by Fabiano C. Botelho, Djamal Belazzougui, Rasmus Pagh and Nivio Ziviani. It is a simple, efficient, near-optimal space and practical algorithm to generate a family [figs/bdz/img8.png] of PHFs and MPHFs. It is also referred to as BPZ algorithm because the work presented by Botelho, Pagh and Ziviani in [[2 #papers]]. In the Botelho's PhD. dissertation [[1 #papers]] it is also referred to as RAM algorithm because it is more suitable for key sets that can be handled in internal memory.
The BDZ algorithm uses //r//-uniform random hypergraphs given by function values of //r// uniform random hash functions on the input key set //S// for generating PHFs and MPHFs that require //O(n)// bits to be stored. A hypergraph is the generalization of a standard undirected graph where each edge connects [figs/bdz/img12.png] vertices. This idea is not new, see e.g. [[8 #papers]], but we have proceeded differently to achieve a space usage of //O(n)// bits rather than //O(n log n)// bits. Evaluation time for all schemes considered is constant. For //r=3// we obtain a space usage of approximately //2.6n// bits for an MPHF. More compact, and even simpler, representations can be achieved for larger //m//. For example, for //m=1.23n// we can get a space usage of //1.95n// bits.
Our best MPHF space upper bound is within a factor of //2// from the information theoretical lower bound of approximately //1.44// bits. We have shown that the BDZ algorithm is far more practical than previous methods with proven space complexity, both because of its simplicity, and because the constant factor of the space complexity is more than //6// times lower than its closest competitor, for plausible problem sizes. We verify the practicality experimentally, using slightly more space than in the mentioned theoretical bounds.
---------------------------------------- ----------------------------------------
==The Algorithm== ==The Algorithm==
Coming soon... The BDZ algorithm is a three-step algorithm that generates PHFs and MPHFs based on random //r//-partite hypergraphs. This is an approach that provides a much tighter analysis and is much more simple than the one presented in [[3 #papers]], where it was implicit how to construct similar PHFs.The fastest and most compact functions are generated when //r=3//. In this case a PHF can be stored in approximately //1.95// bits per key and an MPHF in approximately //2.62// bits per key.
Figure 1 gives an overview of the algorithm for //r=3//, taking as input a key set [figs/bdz/img22.png] containing three English words, i.e., //S={who,band,the}//. The edge-oriented data structure proposed in [[4 #papers]] is used to represent hypergraphs, where each edge is explicitly represented as an array of //r// vertices and, for each vertex //v//, there is a list of edges that are incident on //v//.
| [figs/bdz/img50.png]
| **Figure 1:** (a) The mapping step generates a random acyclic //3//-partite hypergraph
| with //m=6// vertices and //n=3// edges and a list [figs/bdz/img4.png] of edges obtained when we test
| whether the hypergraph is acyclic. (b) The assigning step builds an array //g// that
| maps values from //[0,5]// to //[0,3]// to uniquely assign an edge to a vertex. (c) The ranking
| step builds the data structure used to compute function //rank// in //O(1)// time.
----------------------------------------
===Mapping Step=== The //Mapping Step// in Figure 1(a) carries out two important tasks:
Coming soon... + It assumes that it is possible to find three uniform hash functions //h,,0,,//, //h,,1,,// and //h,,2,,//, with ranges //{0,1}//, //{2,3}// and //{4,5}//, respectively. These functions build an one-to-one mapping of the key set //S// to the edge set //E// of a random acyclic //3//-partite hypergraph //G=(V,E)//, where //|V|=m=6// and //|E|=n=3//. In [[1,2 #papers]] it is shown that it is possible to obtain such a hypergraph with probability tending to //1// as //n// tends to infinity whenever //m=cn// and //c > 1.22//. The value of that minimizes the hypergraph size (and thereby the amount of bits to represent the resulting functions) is in the range //(1.22,1.23)//. To illustrate the mapping, key "who" is mapped to edge //{h,,0,,("who"), h,,1,,("who"), h,,2,,("who")} = {1,3,5}//, key "band" is mapped to edge //{h,,0,,("band"), h,,1,,("band"), h,,2,,("band")} = {1,2,4}//, and key "the" is mapped to edge //{h,,0,,("the"), h,,1,,("the"), h,,2,,("the")} = {0,2,5}//.
---------------------------------------- + It tests whether the resulting random //3//-partite hypergraph contains cycles by iteratively deleting edges connecting vertices of degree 1. The deleted edges are stored in the order of deletion in a list [figs/bdz/img4.png] to be used in the assigning step. The first deleted edge in Figure 1(a) was //{1,2,4}//, the second one was //{1,3,5}// and the third one was //{0,2,5}//. If it ends with an empty graph, then the test succeeds, otherwise it fails.
===Assigning Step===
Coming soon...
---------------------------------------- We now show how to use the Jenkins hash functions [[7 #papers]] to implement the three hash functions //h,,i,,//, which map values from //S// to //V,,i,,//, where [figs/bdz/img52.png]. These functions are used to build a random //3//-partite hypergraph, where [figs/bdz/img53.png] and [figs/bdz/img54.png]. Let [figs/bdz/img55.png] be a Jenkins hash function for [figs/bdz/img56.png], where
//w=32 or 64// for 32-bit and 64-bit architectures, respectively.
Let //H'// be an array of 3 //w//-bit values. The Jenkins hash function
allow us to compute in parallel the three entries in //H'//
and thereby the three hash functions //h,,i,,//, as follows:
===Ranking Step=== | //H' = h'(x)//
| //h,,0,,(x) = H'[0] mod// [figs/bdz/img136.png]
| //h,,1,,(x) = H'[1] mod// [figs/bdz/img136.png] //+// [figs/bdz/img136.png]
| //h,,2,,(x) = H'[2] mod// [figs/bdz/img136.png] //+ 2//[figs/bdz/img136.png]
The //Assigning Step// in Figure 1(b) outputs a PHF that maps the key set //S// into the range //[0,m-1]// and is represented by an array //g// storing values from the range //[0,3]//. The array //g// allows to select one out of the //3// vertices of a given edge, which is associated with a key //k//. A vertex for a key //k// is given by either //h,,0,,(k)//, //h,,1,,(k)// or //h,,2,,(k)//. The function //h,,i,,(k)// to be used for //k// is chosen by calculating //i = (g[h,,0,,(k)] + g[h,,1,,(k)] + g[h,,2,,(k)]) mod 3//. For instance, the values 1 and 4 represent the keys "who" and "band" because //i = (g[1] + g[3] + g[5]) mod 3 = 0// and //h,,0,,("who") = 1//, and //i = (g[1] + g[2] + g[4]) mod 3 = 2// and //h,,2,,("band") = 4//, respectively. The assigning step firstly initializes //g[i]=3// to mark every vertex as unassigned and //Visited[i]= false//, [figs/bdz/img88.png]. Let //Visited// be a boolean vector of size //m// to indicate whether a vertex has been visited. Then, for each edge [figs/bdz/img90.png] from tail to head, it looks for the first vertex //u// belonging //e// not yet visited. This is a sufficient condition for success [[1,2,8 #papers]]. Let //j// be the index of //u// in //e// for //j// in the range //[0,2]//. Then, it assigns [figs/bdz/img95.png]. Whenever it passes through a vertex //u// from //e//, if //u// has not yet been visited, it sets //Visited[u] = true//.
If we stop the BDZ algorithm in the assigning step we obtain a PHF with range //[0,m-1]//. The PHF has the following form: //phf(x) = h,,i(x),,(x)//, where key //x// is in //S// and //i(x) = (g[h,,0,,(x)] + g[h,,1,,(x)] + g[h,,2,,(x)]) mod 3//. In this case we do not need information for ranking and can set //g[i] = 0// whenever //g[i]// is equal to //3//, where //i// is in the range //[0,m-1]//. Therefore, the range of the values stored in //g// is narrowed from //[0,3]// to //[0,2]//. By using arithmetic coding as block of values (see [[1,2 #papers]] for details), or any compression technique that allows to perform random access in constant time to an array of compressed values [[5,6,12 #papers]], we can store the resulting PHFs in //mlog 3 = cnlog 3// bits, where //c > 1.22//. For //c = 1.23//, the space requirement is //1.95n// bits.
The //Ranking Step// in Figure 1 (c) outputs a data structure that permits to narrow the range of a PHF generated in the assigning step from //[0,m-1]// to //[0,n-1]// and thereby an MPHF is produced. The data structure allows to compute in constant time a function //rank// from //[0,m-1]// to //[0,n-1]// that counts the number of assigned positions before a given position //v// in //g//. For instance, //rank(4) = 2// because the positions //0// and //1// are assigned since //g[0]// and //g[1]// are not equal to //3//.
For the implementation of the ranking step we have borrowed a simple and efficient implementation from [[10 #papers]]. It requires [figs/bdz/img111.png] additional bits of space, where [figs/bdz/img112.png], and is obtained by storing explicitly the //rank// of every //k//th index in a rankTable, where [figs/bdz/img114.png]. The larger is //k// the more compact is the resulting MPHF. Therefore, the users can tradeoff space for evaluation time by setting //k// appropriately in the implementation. We only allow values for //k// that are power of two (i.e., //k=2^^b,,k,,^^// for some constant //b,,k,,// in order to replace the expensive division and modulo operations by bit-shift and bitwise "and" operations, respectively. We have used //k=256// in the experiments for generating more succinct MPHFs. We remark that it is still possible to obtain a more compact data structure by using the results presented in [[9,11 #papers]], but at the cost of a much more complex implementation.
We need to use an additional lookup table //T,,r,,// to guarantee the constant evaluation time of //rank(u)//. Let us illustrate how //rank(u)// is computed using both the rankTable and the lookup table //T,,r,,//. We first look up the rank of the largest precomputed index //v// lower than or equal to //u// in the rankTable, and use //T,,r,,// to count the number of assigned vertices from position //v// to //u-1//. The lookup table //T_r// allows us to count in constant time the number of assigned vertices in [figs/bdz/img122.png] bits, where [figs/bdz/img112.png]. Thus the actual evaluation time is [figs/bdz/img123.png]. For simplicity and without loss of generality we let [figs/bdz/img124.png] be a multiple of the number of bits [figs/bdz/img125.png] used to encode each entry of //g//. As the values in //g// come from the range //[0,3]//,
then [figs/bdz/img126.png] bits and we have tried [figs/bdz/img124.png] equal to //8// and //16//. We would expect that [figs/bdz/img124.png] equal to 16 should provide a faster evaluation time because we would need to carry out fewer lookups in //T,,r,,//. But, for both values the lookup table //T,,r,,// fits entirely in the CPU cache and we did not realize any significant difference in the evaluation times. Therefore we settle for the value //8//. We remark that each value of //r// requires a different lookup table //T,,r,, that can be generated a priori.
The resulting MPHFs have the following form: //h(x) = rank(phf(x))//. Then, we cannot get rid of the raking information by replacing the values 3 by 0 in the entries of //g//. In this case each entry in the array //g// is encoded with //2// bits and we need [figs/bdz/img133.png] additional bits to compute function //rank// in constant time. Then, the total space to store the resulting functions is [figs/bdz/img134.png] bits. By using //c = 1.23// and [figs/bdz/img135.png] we have obtained MPHFs that require approximately //2.62// bits per key to be stored.
Coming soon...
---------------------------------------- ----------------------------------------
@ -106,16 +137,38 @@ So we have:
==Experimental Results== ==Experimental Results==
Experimental results to compare the BDZ algorithm with the other ones in the CMPH Experimental results to compare the BDZ algorithm with the other ones in the CMPH
library are presented in Botelho, Pagh and Ziviani [[1 #papers],[2 #papers]]. library are presented in Botelho, Pagh and Ziviani [[1,2 #papers].
---------------------------------------- ----------------------------------------
==Papers==[papers] ==Papers==[papers]
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], R. Pagh, [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [Simple and space-efficient minimal perfect hash functions papers/wads07.pdf]. //10th International Workshop on Algorithms and Data Structures (WADs'07),// Springer-Verlag Lecture Notes in Computer Science, vol. 4619, Halifax, Canada, August 2007, 139-150. + [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho]. [Near-Optimal Space Perfect Hashing Algorithms papers/thesis.pdf]. //PhD. Thesis//, //Department of Computer Science//, //Federal University of Minas Gerais//, September 2008. Supervised by [N. Ziviani http://www.dcc.ufmg.br/~nivio].
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho]. [Near Space-Optimal Perfect Hashing Algorithms papers/thesis.pdf]. //Thesis Proposal//, //Department of Computer Science//, //Federal University of Minas Gerais//, July 2007. + [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], [R. Pagh http://www.itu.dk/~pagh/], [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [Simple and space-efficient minimal perfect hash functions papers/wads07.pdf]. //In Proceedings of the 10th International Workshop on Algorithms and Data Structures (WADs'07),// Springer-Verlag Lecture Notes in Computer Science, vol. 4619, Halifax, Canada, August 2007, 139-150.
+ B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. The bloomier filter: An efficient data structure for static support lookup tables. //In Proceedings of the 15th annual ACM-SIAM symposium on Discrete algorithms (SODA'04)//, pages 3039, Philadelphia, PA, USA, 2004. Society for Industrial and Applied Mathematics.
+ J. Ebert. A versatile data structure for edges oriented graph algorithms. //Communication of The ACM//, (30):513519, 1987.
+ K. Fredriksson and F. Nikitin. Simple compression code supporting random access and fast string matching. //In Proceedings of the 6th International Workshop on Efficient and Experimental Algorithms (WEA07)//, pages 203216, 2007.
+ R. Gonzalez and G. Navarro. Statistical encoding of succinct data structures. //In Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM06)//, pages 294305, 2006.
+ B. Jenkins. Algorithm alley: Hash functions. //Dr. Dobb's Journal of Software Tools//, 22(9), september 1997. Extended version available at [http://burtleburtle.net/bob/hash/doobs.html http://burtleburtle.net/bob/hash/doobs.html].
+ B.S. Majewski, N.C. Wormald, G. Havas, and Z.J. Czech. A family of perfect hashing methods. //The Computer Journal//, 39(6):547554, 1996.
+ D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. //In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX07)//, 2007.
+ [R. Pagh http://www.itu.dk/~pagh/]. Low redundancy in static dictionaries with constant query time. //SIAM Journal on Computing//, 31(2):353363, 2001.
+ R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. //In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms (SODA02)//, pages 233242, Philadelphia PA, USA, 2002. Society for Industrial and Applied Mathematics.
+ K. Sadakane and R. Grossi. Squeezing succinct data structures into entropy bounds. //In Proceedings of the 17th annual ACM-SIAM symposium on Discrete algorithms (SODA06)//, pages 12301239, 2006.
%!include: ALGORITHMS.t2t %!include: ALGORITHMS.t2t
%!include: FOOTER.t2t %!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

View File

@ -401,3 +401,5 @@ Again we have:
%!include: ALGORITHMS.t2t %!include: ALGORITHMS.t2t
%!include: FOOTER.t2t %!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

Before

Width:  |  Height:  |  Size: 21 KiB

After

Width:  |  Height:  |  Size: 21 KiB

View File

@ -436,3 +436,5 @@ has smart policies for avoiding seeks and diminishing the average seek time
%!include: ALGORITHMS.t2t %!include: ALGORITHMS.t2t
%!include: FOOTER.t2t %!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

44
CHD.t2t Normal file
View File

@ -0,0 +1,44 @@
Compress, Hash and Displace: CHD Algorithm
%!includeconf: CONFIG.t2t
----------------------------------------
==Introduction==
The important performance parameters of a PHF are representation size, evaluation time and construction time. The representation size plays an important role when the whole function fits in a faster memory and the actual data is stored in a slower memory. For instace, compact PHFs can be entirely fit in a CPU cache and this makes their computation really fast by avoiding cache misses. The CHD algorithm plays an important role in this context. It was designed by Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger in [[2 #papers]].
The CHD algorithm permits to obtain PHFs with representation size very close to optimal while retaining //O(n)// construction time and //O(1)// evaluation time. For example, in the case //m=2n// we obtain a PHF that uses space //0.67// bits per key, and for //m=1.23n// we obtain space //1.4// bits per key, which was not achievable with previously known methods. The CHD algorithm is inspired by several known algorithms; the main new feature is that it combines a modification of Pagh's ``hash-and-displace'' approach with data compression on a sequence of hash function indices. That combination makes it possible to significantly reduce space usage while retaining linear construction time and constant query time. The CHD algorithm can also be used for //k//-perfect hashing, where at most //k// keys may be mapped to the same value. For the analysis we assume that fully random hash functions are given for free; such assumptions can be justified and were made in previous papers.
The compact PHFs generated by the CHD algorithm can be used in many applications in which we want to assign a unique identifier to each key without storing any information on the key. One of the most obvious applications of those functions (or //k//-perfect hash functions) is when we have a small fast memory in which we can store the perfect hash function while the keys and associated satellite data are stored in slower but larger memory. The size of a block or a transfer unit may be chosen so that //k// data items can be retrieved in one read access. In this case we can ensure that data associated with a key can be retrieved in a single probe to slower memory. This has been used for example in hardware routers [[4 #papers]].
The CHD algorithm generates the most compact PHFs and MPHFs we know of in //O(n)// time. The time required to evaluate the generated functions is constant (in practice less than //1.4// microseconds). The storage space of the resulting PHFs and MPHFs are distant from the information theoretic lower bound by a factor of //1.43//. The closest competitor is the algorithm by Martin and Pagh [[3 #papers]] but their algorithm do not work in linear time. Furthermore, the CHD algorithm can be tuned to run faster than the BPZ algorithm [[1 #papers]] (the fastest algorithm available in the literature so far) and to obtain more compact functions. The most impressive characteristic is that it has the ability, in principle, to approximate the information theoretic lower bound while being practical. A detailed description of the CHD algorithm can be found in [[2 #papers]].
----------------------------------------
==Experimental Results==
Experimental results comparing the CHD algorithm with [the BDZ algorithm bdz.html]
and others available in the CMPH library are presented in [[2 #papers]].
----------------------------------------
==Papers==[papers]
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], [R. Pagh http://www.itu.dk/~pagh/], [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [Simple and space-efficient minimal perfect hash functions papers/wads07.pdf]. //In Proceedings of the 10th International Workshop on Algorithms and Data Structures (WADs'07),// Springer-Verlag Lecture Notes in Computer Science, vol. 4619, Halifax, Canada, August 2007, 139-150.
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], D. Belazzougui and M. Dietzfelbinger. [Compress, hash and displace papers/esa09.pdf]. //In Proceedings of the 17th European Symposium on Algorithms (ESA09)//. Springer LNCS, 2009.
+ M. Dietzfelbinger and [R. Pagh http://www.itu.dk/~pagh/]. Succinct data structures for retrieval and approximate membership. //In Proceedings of the 35th international colloquium on Automata, Languages and Programming (ICALP08)//, pages 385396, Berlin, Heidelberg, 2008. Springer-Verlag.
+ B. Prabhakar and F. Bonomi. Perfect hashing for network applications. //In Proceedings of the IEEE International Symposium on Information Theory//. IEEE Press, 2006.
%!include: ALGORITHMS.t2t
%!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

View File

@ -84,3 +84,5 @@ Again we have:
%!include: ALGORITHMS.t2t %!include: ALGORITHMS.t2t
%!include: FOOTER.t2t %!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

View File

@ -107,3 +107,5 @@ to [figs/img250.png] seconds for [figs/img6.png].
%!include: ALGORITHMS.t2t %!include: ALGORITHMS.t2t
%!include: FOOTER.t2t %!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

View File

@ -52,3 +52,5 @@ data mining techniques.
%!include: ALGORITHMS.t2t %!include: ALGORITHMS.t2t
%!include: FOOTER.t2t %!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

View File

@ -44,3 +44,8 @@
%! PostProc(html): 'ALIGN="middle" SRC="figs/img248.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img237.png"\1>' %! PostProc(html): 'ALIGN="middle" SRC="figs/img248.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img237.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img249.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img249.png"\1>' %! PostProc(html): 'ALIGN="middle" SRC="figs/img249.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img249.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/img250.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img250.png"\1>' %! PostProc(html): 'ALIGN="middle" SRC="figs/img250.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img250.png"\1>'
%! PostProc(html): 'ALIGN="middle" SRC="figs/bdz/img8.png"(.*?)>' 'ALIGN="bottom" SRC="figs/bdz/img8.png"\1>'
% The ^ need to be escaped by \
%!postproc(html): \^\^(.*?)\^\^ <sup>\1</sup>
%!postproc(html): ,,(.*?),, <sub>\1</sub>

View File

@ -13,29 +13,44 @@ Using cmph is quite simple. Take a look in the following examples.
// Create minimal perfect hash function from in-memory vector // Create minimal perfect hash function from in-memory vector
int main(int argc, char **argv) int main(int argc, char **argv)
{ {
// Creating a filled vector
const char *vector[] = {"aaaaaaaaaa", "bbbbbbbbbb", "cccccccccc", "dddddddddd", "eeeeeeeeee",
"ffffffffff", "gggggggggg", "hhhhhhhhhh", "iiiiiiiiii", "jjjjjjjjjj"};
unsigned int nkeys = 10;
// Source of keys
cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys);
//Create minimal perfect hash function using the default (chm) algorithm. // Creating a filled vector
cmph_config_t *config = cmph_config_new(source); unsigned int i = 0;
cmph_t *hash = cmph_new(config); const char *vector[] = {"aaaaaaaaaa", "bbbbbbbbbb", "cccccccccc", "dddddddddd", "eeeeeeeeee",
cmph_config_destroy(config); "ffffffffff", "gggggggggg", "hhhhhhhhhh", "iiiiiiiiii", "jjjjjjjjjj"};
unsigned int nkeys = 10;
FILE* mphf_fd = fopen("temp.mph", "w");
// Source of keys
cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys);
//Find key //Create minimal perfect hash function using the brz algorithm.
const char *key = "jjjjjjjjjj"; cmph_config_t *config = cmph_config_new(source);
unsigned int id = cmph_search(hash, key, strlen(key)); cmph_config_set_algo(config, CMPH_BRZ);
fprintf(stderr, "Id:%u\n", id); cmph_config_set_mphf_fd(config, mphf_fd);
//Destroy hash cmph_t *hash = cmph_new(config);
cmph_destroy(hash); cmph_config_destroy(config);
cmph_io_vector_adapter_destroy(source); cmph_dump(hash, mphf_fd);
return 0; cmph_destroy(hash);
fclose(mphf_fd);
//Find key
mphf_fd = fopen("temp.mph", "r");
hash = cmph_load(mphf_fd);
while (i < nkeys) {
const char *key = vector[i];
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "key:%s -- hash:%u\n", key, id);
i++;
}
//Destroy hash
cmph_destroy(hash);
cmph_io_vector_adapter_destroy(source);
fclose(mphf_fd);
return 0;
} }
``` ```
Download [vector_adapter_ex1.c examples/vector_adapter_ex1.c]. This example does not work in versions below 0.3. Download [vector_adapter_ex1.c examples/vector_adapter_ex1.c]. This example does not work in versions below 0.6.
------------------------------- -------------------------------
``` ```
@ -45,9 +60,9 @@ Download [vector_adapter_ex1.c examples/vector_adapter_ex1.c]. This example does
#pragma pack(1) #pragma pack(1)
typedef struct { typedef struct {
cmph_uint32 id; cmph_uint32 id;
char key[11]; char key[11];
cmph_uint32 year; cmph_uint32 year;
} rec_t; } rec_t;
#pragma pack(0) #pragma pack(0)
@ -56,15 +71,15 @@ int main(int argc, char **argv)
// Creating a filled vector // Creating a filled vector
unsigned int i = 0; unsigned int i = 0;
rec_t vector[10] = {{1, "aaaaaaaaaa", 1999}, {2, "bbbbbbbbbb", 2000}, {3, "cccccccccc", 2001}, rec_t vector[10] = {{1, "aaaaaaaaaa", 1999}, {2, "bbbbbbbbbb", 2000}, {3, "cccccccccc", 2001},
{4, "dddddddddd", 2002}, {5, "eeeeeeeeee", 2003}, {6, "ffffffffff", 2004}, {4, "dddddddddd", 2002}, {5, "eeeeeeeeee", 2003}, {6, "ffffffffff", 2004},
{7, "gggggggggg", 2005}, {8, "hhhhhhhhhh", 2006}, {9, "iiiiiiiiii", 2007}, {7, "gggggggggg", 2005}, {8, "hhhhhhhhhh", 2006}, {9, "iiiiiiiiii", 2007},
{10,"jjjjjjjjjj", 2008}}; {10,"jjjjjjjjjj", 2008}};
unsigned int nkeys = 10; unsigned int nkeys = 10;
FILE* mphf_fd = fopen("temp_struct_vector.mph", "w"); FILE* mphf_fd = fopen("temp_struct_vector.mph", "w");
// Source of keys // Source of keys
cmph_io_adapter_t *source = cmph_io_struct_vector_adapter(vector, sizeof(rec_t), sizeof(cmph_uint32), 11, nkeys); cmph_io_adapter_t *source = cmph_io_struct_vector_adapter(vector, (cmph_uint32)sizeof(rec_t), (cmph_uint32)sizeof(cmph_uint32), 11, nkeys);
//Create minimal perfect hash function using the default (chm) algorithm. //Create minimal perfect hash function using the BDZ algorithm.
cmph_config_t *config = cmph_config_new(source); cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BDZ); cmph_config_set_algo(config, CMPH_BDZ);
cmph_config_set_mphf_fd(config, mphf_fd); cmph_config_set_mphf_fd(config, mphf_fd);
@ -78,10 +93,10 @@ int main(int argc, char **argv)
mphf_fd = fopen("temp_struct_vector.mph", "r"); mphf_fd = fopen("temp_struct_vector.mph", "r");
hash = cmph_load(mphf_fd); hash = cmph_load(mphf_fd);
while (i < nkeys) { while (i < nkeys) {
const char *key = vector[i].key; const char *key = vector[i].key;
unsigned int id = cmph_search(hash, key, 11); unsigned int id = cmph_search(hash, key, 11);
fprintf(stderr, "key:%s -- hash:%u\n", key, id); fprintf(stderr, "key:%s -- hash:%u\n", key, id);
i++; i++;
} }
//Destroy hash //Destroy hash
@ -91,35 +106,35 @@ int main(int argc, char **argv)
return 0; return 0;
} }
``` ```
Download [struct_vector_adapter_ex3.c examples/struct_vector_adapter_ex3.c]. This example does not work in versions below 0.7. Download [struct_vector_adapter_ex3.c examples/struct_vector_adapter_ex3.c]. This example does not work in versions below 0.8.
------------------------------- -------------------------------
``` ```
#include <cmph.h> #include <cmph.h>
#include <stdio.h> #include <stdio.h>
#include <string.h> #include <string.h>
// Create minimal perfect hash function from in-disk keys using BMZ algorithm // Create minimal perfect hash function from in-disk keys using BDZ algorithm
int main(int argc, char **argv) int main(int argc, char **argv)
{ {
//Open file with newline separated list of keys //Open file with newline separated list of keys
FILE * keys_fd = fopen("keys.txt", "r"); FILE * keys_fd = fopen("keys.txt", "r");
cmph_t *hash = NULL; cmph_t *hash = NULL;
if (keys_fd == NULL) if (keys_fd == NULL)
{ {
fprintf(stderr, "File \"keys.txt\" not found\n"); fprintf(stderr, "File \"keys.txt\" not found\n");
exit(1); exit(1);
} }
// Source of keys // Source of keys
cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd); cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd);
cmph_config_t *config = cmph_config_new(source); cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BMZ); cmph_config_set_algo(config, CMPH_BDZ);
hash = cmph_new(config); hash = cmph_new(config);
cmph_config_destroy(config); cmph_config_destroy(config);
//Find key //Find key
const char *key = "jjjjjjjjjj"; const char *key = "jjjjjjjjjj";
unsigned int id = cmph_search(hash, key, strlen(key)); unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "Id:%u\n", id); fprintf(stderr, "Id:%u\n", id);
//Destroy hash //Destroy hash
cmph_destroy(hash); cmph_destroy(hash);
@ -128,8 +143,10 @@ int main(int argc, char **argv)
return 0; return 0;
} }
``` ```
Download [file_adapter_ex2.c examples/file_adapter_ex2.c] and [keys.txt examples/keys.txt] Download [file_adapter_ex2.c examples/file_adapter_ex2.c] and [keys.txt examples/keys.txt]. This example does not work in versions below 0.8.
%!include: ALGORITHMS.t2t %!include: ALGORITHMS.t2t
%!include: FOOTER.t2t %!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

View File

@ -34,3 +34,5 @@ one is executed?
%!include: ALGORITHMS.t2t %!include: ALGORITHMS.t2t
%!include: FOOTER.t2t %!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

View File

@ -43,3 +43,5 @@ We only need to store the //g// function and a constant number of bytes for the
%!include: ALGORITHMS.t2t %!include: ALGORITHMS.t2t
%!include: FOOTER.t2t %!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

9
GOOGLEANALYTICS.t2t Normal file
View File

@ -0,0 +1,9 @@
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>

View File

@ -35,3 +35,5 @@ the compiler programming area (detect reserved keywords).
%!include: ALGORITHMS.t2t %!include: ALGORITHMS.t2t
%!include: FOOTER.t2t %!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

View File

@ -3,6 +3,15 @@ News Log
%!includeconf: CONFIG.t2t %!includeconf: CONFIG.t2t
----------------------------------------
==News for version 0.9==
- [The CHD algorithm chd.html], which is an algorithm that can be tuned to generate MPHFs that require approximately 2.07 bits per key to be stored. The algorithm outperforms [the BDZ algorithm bdz.html] and therefore is the fastest one available in the literature for sets that can be treated in internal memory.
- [The CHD_PH algorithm chd.html], which is an algorithm to generate PHFs with load factor up to //99 %//. It is actually the CHD algorithm without the ranking step. If we set the load factor to //81 %//, which is the maximum that can be obtained with [the BDZ algorithm bdz.html], the resulting functions can be stored in //1.40// bits per key. The space requirement increases with the load factor.
- All reported bugs and suggestions have been corrected and included as well.
---------------------------------------- ----------------------------------------
==News for version 0.8== ==News for version 0.8==
@ -61,3 +70,5 @@ News Log
%!include: ALGORITHMS.t2t %!include: ALGORITHMS.t2t
%!include: FOOTER.t2t %!include: FOOTER.t2t
%!include(html): ''GOOGLEANALYTICS.t2t''

View File

@ -42,43 +42,61 @@ The CMPH Library encapsulates the newest and more efficient algorithms in an eas
==Supported Algorithms== ==Supported Algorithms==
%html% - [BDZ Algorithm bdz.html]. %html% - [CHD Algorithm chd.html]:
%txt% - BDZ Algorithm. %txt% - CHD Algorithm:
The fastest algorithm to build PHFs and MPHFs. It is based on random 3-graphs. A 3-graph is a - It is the fastest algorithm to build PHFs and MPHFs in linear time.
generalization of a graph where each edge connects 3 vertices instead of only 2. The - It generates the most compact PHFs and MPHFs we know of.
resulting functions are not order preserving and can be stored in only //(2 + x)cn// - It can generate PHFs with a load factor up to //99 %//.
bits, where //c// should be larger than or equal to //1.23// and //x// is a constant - It can be used to generate //t//-perfect hash functions. A //t//-perfect hash function allows at most //t// collisions in a given bin. It is a well-known fact that modern memories are organized as blocks which constitute transfer unit. Example of such blocks are cache lines for internal memory or sectors for hard disks. Thus, it can be very useful for devices that carry out I/O operations in blocks.
larger than //0// (actually, x = 1/b and b is a parameter that should be larger than 2). - It is a two level scheme. It uses a first level hash function to split the key set in buckets of average size determined by a parameter //b// in the range //[1,32]//. In the second level it uses displacement values to resolve the collisions that have given rise to the buckets.
For //c = 1.23// and //b = 8//, the resulting functions are stored in approximately 2.6 bits per key. - It can generate MPHFs that can be stored in approximately //2.07// bits per key.
%html% - [BMZ Algorithm bmz.html]. - For a load factor equal to the maximum one that is achieved by the BDZ algorithm (//81 %//), the resulting PHFs are stored in approximately //1.40// bits per key.
%txt% - BMZ Algorithm. %html% - [BDZ Algorithm bdz.html]:
A very fast algorithm based on cyclic random graphs to construct minimal %txt% - BDZ Algorithm:
perfect hash functions in linear time. The resulting functions are not order preserving and - It is very simple and efficient. It outperforms all the ones below.
can be stored in only //4cn// bytes, where //c// is between 0.93 and 1.15. - It constructs both PHFs and MPHFs in linear time.
%html% - [BRZ Algorithm brz.html]. - The maximum load factor one can achieve for a PHF is //1/1.23//.
%txt% - BRZ Algorithm. - It is based on acyclic random 3-graphs. A 3-graph is a generalization of a graph where each edge connects 3 vertices instead of only 2.
A very fast external memory based algorithm for constructing minimal perfect hash functions - The resulting MPHFs are not order preserving.
for sets in the order of billion of keys in linear time. The resulting functions are not order preserving and - The resulting MPHFs can be stored in only //(2 + x)cn// bits, where //c// should be larger than or equal to //1.23// and //x// is a constant larger than //0// (actually, x = 1/b and b is a parameter that should be larger than 2). For //c = 1.23// and //b = 8//, the resulting functions are stored in approximately 2.6 bits per key.
can be stored using just 8.1 bits per key. - For its maximum load factor (//81 %//), the resulting PHFs are stored in approximately //1.95// bits per key.
%html% - [CHM Algorithm chm.html]. %html% - [BMZ Algorithm bmz.html]:
%txt% - CHM Algorithm. %txt% - BMZ Algorithm:
An algorithm based on acyclic random graphs to construct minimal - Construct MPHFs in linear time.
perfect hash functions in linear time. The resulting functions are order preserving and - It is based on cyclic random graphs. This makes it faster than the CHM algorithm.
are stored in //4cn// bytes, where //c// is greater than 2. - The resulting MPHFs are not order preserving.
%html% - [FCH Algorithm fch.html]. - The resulting MPHFs are more compact than the ones generated by the CHM algorithm and can be stored in //4cn// bytes, where //c// is in the range //[0.93,1.15]//.
%txt% - FCH Algorithm. %html% - [BRZ Algorithm brz.html]:
An algorithm to construct minimal perfect hash functions that require %txt% - BRZ Algorithm:
less than 4 bits per key to be stored. Although the resulting MPHFs are - A very fast external memory based algorithm for constructing minimal perfect hash functions for sets in the order of billions of keys.
very compact, the algorithm is only efficient for small sets. - It works in linear time.
However, it is used as internal algorithm in the BRZ algorithm for efficiently solving - The resulting MPHFs are not order preserving.
larger problems and even so to generate MPHFs that require approximately - The resulting MPHFs can be stored using less than //8.0// bits per key.
4.1 bits per key to be stored. For that, you just need to set the parameters -a to brz and %html% - [CHM Algorithm chm.html]:
-c to a value larger than or equal to 2.6. %txt% - CHM Algorithm:
- Construct minimal MPHFs in linear time.
- It is based on acyclic random graphs
- The resulting MPHFs are order preserving.
- The resulting MPHFs are stored in //4cn// bytes, where //c// is greater than 2.
%html% - [FCH Algorithm fch.html]:
%txt% - FCH Algorithm:
- Construct minimal perfect hash functions that require less than 4 bits per key to be stored.
- The resulting MPHFs are very compact and very efficient at evaluation time
- The algorithm is only efficient for small sets.
- It is used as internal algorithm in the BRZ algorithm to efficiently solve larger problems and even so to generate MPHFs that require approximately 4.1 bits per key to be stored. For that, you just need to set the parameters -a to brz and -c to a value larger than or equal to 2.6.
---------------------------------------- ----------------------------------------
==News for version 0.8 (Coming soon)== ==News for version 0.9==
- [The CHD algorithm chd.html], which is an algorithm that can be tuned to generate MPHFs that require approximately 2.07 bits per key to be stored. The algorithm outperforms [the BDZ algorithm bdz.html] and therefore is the fastest one available in the literature for sets that can be treated in internal memory.
- [The CHD_PH algorithm chd.html], which is an algorithm to generate PHFs with load factor up to //99 %//. It is actually the CHD algorithm without the ranking step. If we set the load factor to //81 %//, which is the maximum that can be obtained with [the BDZ algorithm bdz.html], the resulting functions can be stored in //1.40// bits per key. The space requirement increases with the load factor.
- All reported bugs and suggestions have been corrected and included as well.
==News for version 0.8 ==
- [An algorithm to generate MPHFs that require around 2.6 bits per key to be stored bdz.html], which is referred to as BDZ algorithm. The algorithm is the fastest one available in the literature for sets that can be treated in internal memory. - [An algorithm to generate MPHFs that require around 2.6 bits per key to be stored bdz.html], which is referred to as BDZ algorithm. The algorithm is the fastest one available in the literature for sets that can be treated in internal memory.
- [An algorithm to generate PHFs with range m = cn, for c > 1.22 bdz.html], which is referred to as BDZ_PH algorithm. It is actually the BDZ algorithm without the ranking step. The resulting functions can be stored in 1.95 bits per key for //c = 1.23// and are considerably faster than the MPHFs generated by the BDZ algorithm. - [An algorithm to generate PHFs with range m = cn, for c > 1.22 bdz.html], which is referred to as BDZ_PH algorithm. It is actually the BDZ algorithm without the ranking step. The resulting functions can be stored in 1.95 bits per key for //c = 1.23// and are considerably faster than the MPHFs generated by the BDZ algorithm.
@ -88,10 +106,6 @@ The CMPH Library encapsulates the newest and more efficient algorithms in an eas
- All reported bugs and suggestions have been corrected and included as well. - All reported bugs and suggestions have been corrected and included as well.
==News for version 0.7==
- Added man pages and a pkgconfig file.
[News log newslog.html] [News log newslog.html]
---------------------------------------- ----------------------------------------
@ -107,57 +121,72 @@ Using cmph is quite simple. Take a look.
// Create minimal perfect hash function from in-memory vector // Create minimal perfect hash function from in-memory vector
int main(int argc, char **argv) int main(int argc, char **argv)
{ {
// Creating a filled vector
const char *vector[] = {"aaaaaaaaaa", "bbbbbbbbbb", "cccccccccc", "dddddddddd", "eeeeeeeeee",
"ffffffffff", "gggggggggg", "hhhhhhhhhh", "iiiiiiiiii", "jjjjjjjjjj"};
unsigned int nkeys = 10;
// Source of keys
cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys);
//Create minimal perfect hash function using the default (chm) algorithm. // Creating a filled vector
cmph_config_t *config = cmph_config_new(source); unsigned int i = 0;
cmph_t *hash = cmph_new(config); const char *vector[] = {"aaaaaaaaaa", "bbbbbbbbbb", "cccccccccc", "dddddddddd", "eeeeeeeeee",
cmph_config_destroy(config); "ffffffffff", "gggggggggg", "hhhhhhhhhh", "iiiiiiiiii", "jjjjjjjjjj"};
unsigned int nkeys = 10;
FILE* mphf_fd = fopen("temp.mph", "w");
// Source of keys
cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys);
//Find key //Create minimal perfect hash function using the brz algorithm.
const char *key = "jjjjjjjjjj"; cmph_config_t *config = cmph_config_new(source);
unsigned int id = cmph_search(hash, key, strlen(key)); cmph_config_set_algo(config, CMPH_BRZ);
fprintf(stderr, "Id:%u\n", id); cmph_config_set_mphf_fd(config, mphf_fd);
//Destroy hash cmph_t *hash = cmph_new(config);
cmph_destroy(hash); cmph_config_destroy(config);
cmph_io_vector_adapter_destroy(source); cmph_dump(hash, mphf_fd);
return 0; cmph_destroy(hash);
fclose(mphf_fd);
//Find key
mphf_fd = fopen("temp.mph", "r");
hash = cmph_load(mphf_fd);
while (i < nkeys) {
const char *key = vector[i];
unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "key:%s -- hash:%u\n", key, id);
i++;
}
//Destroy hash
cmph_destroy(hash);
cmph_io_vector_adapter_destroy(source);
fclose(mphf_fd);
return 0;
} }
``` ```
Download [vector_adapter_ex1.c examples/vector_adapter_ex1.c]. This example does not work in version 0.3. You need to update the sources from CVS to make it works. Download [vector_adapter_ex1.c examples/vector_adapter_ex1.c]. This example does not work in versions below 0.6. You need to update the sources from GIT to make it work.
------------------------------- -------------------------------
``` ```
#include <cmph.h> #include <cmph.h>
#include <stdio.h> #include <stdio.h>
#include <string.h> #include <string.h>
// Create minimal perfect hash function from in-disk keys using BMZ algorithm // Create minimal perfect hash function from in-disk keys using BDZ algorithm
int main(int argc, char **argv) int main(int argc, char **argv)
{ {
//Open file with newline separated list of keys //Open file with newline separated list of keys
FILE * keys_fd = fopen("keys.txt", "r"); FILE * keys_fd = fopen("keys.txt", "r");
cmph_t *hash = NULL; cmph_t *hash = NULL;
if (keys_fd == NULL) if (keys_fd == NULL)
{ {
fprintf(stderr, "File \"keys.txt\" not found\n"); fprintf(stderr, "File \"keys.txt\" not found\n");
exit(1); exit(1);
} }
// Source of keys // Source of keys
cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd); cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd);
cmph_config_t *config = cmph_config_new(source); cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BMZ); cmph_config_set_algo(config, CMPH_BDZ);
hash = cmph_new(config); hash = cmph_new(config);
cmph_config_destroy(config); cmph_config_destroy(config);
//Find key //Find key
const char *key = "jjjjjjjjjj"; const char *key = "jjjjjjjjjj";
unsigned int id = cmph_search(hash, key, strlen(key)); unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key));
fprintf(stderr, "Id:%u\n", id); fprintf(stderr, "Id:%u\n", id);
//Destroy hash //Destroy hash
cmph_destroy(hash); cmph_destroy(hash);
@ -166,7 +195,7 @@ int main(int argc, char **argv)
return 0; return 0;
} }
``` ```
Download [file_adapter_ex2.c examples/file_adapter_ex2.c] and [keys.txt examples/keys.txt] Download [file_adapter_ex2.c examples/file_adapter_ex2.c] and [keys.txt examples/keys.txt]. This example does not work in versions below 0.8. You need to update the sources from GIT to make it work.
[Click here to see more examples examples.html] [Click here to see more examples examples.html]
-------------------------------------- --------------------------------------
@ -195,41 +224,55 @@ utility.
``` ```
usage: cmph [-v] [-h] [-V] [-k nkeys] [-f hash_function] [-g [-c value][-s seed] ] usage: cmph [-v] [-h] [-V] [-k nkeys] [-f hash_function] [-g [-c algorithm_dependent_value][-s seed] ]
[-a algorithm] [-M memory_in_MB] [-b BRZ_parameter] [-d tmp_dir] [-a algorithm] [-M memory_in_MB] [-b algorithm_dependent_value] [-t keys_per_bin] [-d tmp_dir]
[-m file.mph] keysfile [-m file.mph] keysfile
Minimum perfect hashing tool Minimum perfect hashing tool
-h print this help message -h print this help message
-c c value determines: -c c value determines:
the number of vertices in the graph for the algorithms BMZ and CHM * the number of vertices in the graph for the algorithms BMZ and CHM
the number of bits per key required in the FCH algorithm * the number of bits per key required in the FCH algorithm
-a algorithm - valid values are * the load factor in the CHD_PH algorithm
* bmz -a algorithm - valid values are
* bmz8 * bmz
* chm * bmz8
* brz * chm
* fch * brz
* bdz * fch
* bdz_ph * bdz
-f hash function (may be used multiple times) - valid values are * bdz_ph
* jenkins * chd_ph
-V print version number and exit * chd
-v increase verbosity (may be used multiple times) -f hash function (may be used multiple times) - valid values are
-k number of keys * jenkins
-g generation mode -V print version number and exit
-s random seed -v increase verbosity (may be used multiple times)
-m minimum perfect hash function file -k number of keys
-M main memory availability (in MB) -g generation mode
-d temporary directory used in brz algorithm -s random seed
-b the meaning of this parameter depends on the algorithm used. -m minimum perfect hash function file
If BRZ algorithm is selected in -a option, than it is used -M main memory availability (in MB) used in BRZ algorithm
to make the maximal number of keys in a bucket lower than 256. -d temporary directory used in BRZ algorithm
In this case its value should be an integer in the range [64,175]. -b the meaning of this parameter depends on the algorithm selected in the -a option:
If BDZ algorithm is selected in option -a, than it is used to * For BRZ it is used to make the maximal number of keys in a bucket lower than 256.
determine the size of some precomputed rank information and In this case its value should be an integer in the range [64,175]. Default is 128.
its value should be an integer in the range [3,10]
keysfile line separated file with keys * For BDZ it is used to determine the size of some precomputed rank
information and its value should be an integer in the range [3,10]. Default
is 7. The larger is this value, the more compact are the resulting functions
and the slower are them at evaluation time.
* For CHD and CHD_PH it is used to set the average number of keys per bucket
and its value should be an integer in the range [1,32]. Default is 4. The
larger is this value, the slower is the construction of the functions.
This parameter has no effect for other algorithms.
-t set the number of keys per bin for a t-perfect hashing function. A t-perfect
hash function allows at most t collisions in a given bin. This parameter applies
only to the CHD and CHD_PH algorithms. Its value should be an integer in the
range [1,128]. Defaul is 1
keysfile line separated file with keys
``` ```
==Additional Documentation== ==Additional Documentation==
@ -250,3 +293,5 @@ Code is under the LGPL and the MPL 1.1.
%!include(html): ''LOGO.t2t'' %!include(html): ''LOGO.t2t''
Last Updated: %%date(%c) Last Updated: %%date(%c)
%!include(html): ''GOOGLEANALYTICS.t2t''

View File

@ -1,7 +1,7 @@
#include <cmph.h> #include <cmph.h>
#include <stdio.h> #include <stdio.h>
#include <string.h> #include <string.h>
// Create minimal perfect hash function from in-disk keys using BMZ algorithm // Create minimal perfect hash function from in-disk keys using BDZ algorithm
int main(int argc, char **argv) int main(int argc, char **argv)
{ {
//Open file with newline separated list of keys //Open file with newline separated list of keys
@ -16,7 +16,7 @@ int main(int argc, char **argv)
cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd); cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd);
cmph_config_t *config = cmph_config_new(source); cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BMZ); cmph_config_set_algo(config, CMPH_BDZ);
hash = cmph_new(config); hash = cmph_new(config);
cmph_config_destroy(config); cmph_config_destroy(config);

View File

@ -12,40 +12,40 @@ typedef struct {
int main(int argc, char **argv) int main(int argc, char **argv)
{ {
// Creating a filled vector // Creating a filled vector
unsigned int i = 0; unsigned int i = 0;
rec_t vector[10] = {{1, "aaaaaaaaaa", 1999}, {2, "bbbbbbbbbb", 2000}, {3, "cccccccccc", 2001}, rec_t vector[10] = {{1, "aaaaaaaaaa", 1999}, {2, "bbbbbbbbbb", 2000}, {3, "cccccccccc", 2001},
{4, "dddddddddd", 2002}, {5, "eeeeeeeeee", 2003}, {6, "ffffffffff", 2004}, {4, "dddddddddd", 2002}, {5, "eeeeeeeeee", 2003}, {6, "ffffffffff", 2004},
{7, "gggggggggg", 2005}, {8, "hhhhhhhhhh", 2006}, {9, "iiiiiiiiii", 2007}, {7, "gggggggggg", 2005}, {8, "hhhhhhhhhh", 2006}, {9, "iiiiiiiiii", 2007},
{10,"jjjjjjjjjj", 2008}}; {10,"jjjjjjjjjj", 2008}};
unsigned int nkeys = 10; unsigned int nkeys = 10;
FILE* mphf_fd = fopen("temp_struct_vector.mph", "w"); FILE* mphf_fd = fopen("temp_struct_vector.mph", "w");
// Source of keys // Source of keys
cmph_io_adapter_t *source = cmph_io_struct_vector_adapter(vector, (cmph_uint32)sizeof(rec_t), (cmph_uint32)sizeof(cmph_uint32), 11, nkeys); cmph_io_adapter_t *source = cmph_io_struct_vector_adapter(vector, (cmph_uint32)sizeof(rec_t), (cmph_uint32)sizeof(cmph_uint32), 11, nkeys);
//Create minimal perfect hash function using the default (chm) algorithm. //Create minimal perfect hash function using the BDZ algorithm.
cmph_config_t *config = cmph_config_new(source); cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BDZ); cmph_config_set_algo(config, CMPH_BDZ);
cmph_config_set_mphf_fd(config, mphf_fd); cmph_config_set_mphf_fd(config, mphf_fd);
cmph_t *hash = cmph_new(config); cmph_t *hash = cmph_new(config);
cmph_config_destroy(config); cmph_config_destroy(config);
cmph_dump(hash, mphf_fd); cmph_dump(hash, mphf_fd);
cmph_destroy(hash); cmph_destroy(hash);
fclose(mphf_fd); fclose(mphf_fd);
//Find key //Find key
mphf_fd = fopen("temp_struct_vector.mph", "r"); mphf_fd = fopen("temp_struct_vector.mph", "r");
hash = cmph_load(mphf_fd); hash = cmph_load(mphf_fd);
while (i < nkeys) { while (i < nkeys) {
const char *key = vector[i].key; const char *key = vector[i].key;
unsigned int id = cmph_search(hash, key, 11); unsigned int id = cmph_search(hash, key, 11);
fprintf(stderr, "key:%s -- hash:%u\n", key, id); fprintf(stderr, "key:%s -- hash:%u\n", key, id);
i++; i++;
} }
//Destroy hash //Destroy hash
cmph_destroy(hash); cmph_destroy(hash);
cmph_io_vector_adapter_destroy(source); cmph_io_vector_adapter_destroy(source);
fclose(mphf_fd); fclose(mphf_fd);
return 0; return 0;
} }

View File

@ -13,7 +13,7 @@ int main(int argc, char **argv)
// Source of keys // Source of keys
cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys); cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys);
//Create minimal perfect hash function using the default (chm) algorithm. //Create minimal perfect hash function using the brz algorithm.
cmph_config_t *config = cmph_config_new(source); cmph_config_t *config = cmph_config_new(source);
cmph_config_set_algo(config, CMPH_BRZ); cmph_config_set_algo(config, CMPH_BRZ);
cmph_config_set_mphf_fd(config, mphf_fd); cmph_config_set_mphf_fd(config, mphf_fd);

BIN
figs/bdz/img1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 234 B

BIN
figs/bdz/img10.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 235 B

BIN
figs/bdz/img100.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 399 B

BIN
figs/bdz/img101.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 321 B

BIN
figs/bdz/img102.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 505 B

BIN
figs/bdz/img103.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 364 B

BIN
figs/bdz/img104.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 709 B

BIN
figs/bdz/img105.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 401 B

BIN
figs/bdz/img106.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 399 B

BIN
figs/bdz/img107.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 850 B

BIN
figs/bdz/img108.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 587 B

BIN
figs/bdz/img109.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 324 B

BIN
figs/bdz/img11.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 415 B

BIN
figs/bdz/img110.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 573 B

BIN
figs/bdz/img111.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 274 B

BIN
figs/bdz/img112.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 388 B

BIN
figs/bdz/img113.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 348 B

BIN
figs/bdz/img114.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 675 B

BIN
figs/bdz/img115.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 381 B

BIN
figs/bdz/img116.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 235 B

BIN
figs/bdz/img117.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 434 B

BIN
figs/bdz/img118.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 256 B

BIN
figs/bdz/img119.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 493 B

BIN
figs/bdz/img12.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 344 B

BIN
figs/bdz/img120.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 319 B

BIN
figs/bdz/img121.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 251 B

BIN
figs/bdz/img122.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 469 B

BIN
figs/bdz/img123.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 473 B

BIN
figs/bdz/img124.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 205 B

BIN
figs/bdz/img125.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 250 B

BIN
figs/bdz/img126.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 340 B

BIN
figs/bdz/img127.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 325 B

BIN
figs/bdz/img128.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 267 B

BIN
figs/bdz/img129.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 337 B

BIN
figs/bdz/img13.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 583 B

BIN
figs/bdz/img130.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 344 B

BIN
figs/bdz/img131.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 888 B

BIN
figs/bdz/img132.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 220 B

BIN
figs/bdz/img133.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 263 B

BIN
figs/bdz/img134.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 625 B

BIN
figs/bdz/img135.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 428 B

BIN
figs/bdz/img136.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 229 B

BIN
figs/bdz/img137.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 178 B

BIN
figs/bdz/img138.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 205 B

BIN
figs/bdz/img14.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 297 B

BIN
figs/bdz/img15.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 354 B

BIN
figs/bdz/img16.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 250 B

BIN
figs/bdz/img17.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 465 B

BIN
figs/bdz/img18.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 366 B

BIN
figs/bdz/img19.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 445 B

BIN
figs/bdz/img2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 352 B

BIN
figs/bdz/img20.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 326 B

BIN
figs/bdz/img21.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 332 B

BIN
figs/bdz/img22.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 377 B

BIN
figs/bdz/img23.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 809 B

BIN
figs/bdz/img24.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 205 B

BIN
figs/bdz/img25.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 264 B

BIN
figs/bdz/img26.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 256 B

BIN
figs/bdz/img27.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 241 B

BIN
figs/bdz/img28.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 360 B

BIN
figs/bdz/img29.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 445 B

BIN
figs/bdz/img3.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 316 B

BIN
figs/bdz/img30.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 411 B

BIN
figs/bdz/img31.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 248 B

BIN
figs/bdz/img32.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 556 B

BIN
figs/bdz/img33.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 469 B

BIN
figs/bdz/img34.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 442 B

BIN
figs/bdz/img35.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 161 B

BIN
figs/bdz/img36.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 216 B

BIN
figs/bdz/img37.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 366 B

BIN
figs/bdz/img38.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 442 B

BIN
figs/bdz/img39.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 200 B

BIN
figs/bdz/img4.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 223 B

BIN
figs/bdz/img40.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 438 B

BIN
figs/bdz/img41.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 376 B

BIN
figs/bdz/img42.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 382 B

BIN
figs/bdz/img43.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 392 B

BIN
figs/bdz/img44.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 588 B

BIN
figs/bdz/img45.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 585 B

BIN
figs/bdz/img46.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 627 B

BIN
figs/bdz/img47.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 474 B

Some files were not shown because too many files have changed in this diff Show More