+The BDZ algorithm was designed by Fabiano C. Botelho, Djamal Belazzougui, Rasmus Pagh and Nivio Ziviani. It is a simple, efficient, near-optimal space and practical algorithm to generate a family of PHFs and MPHFs. It is also referred to as BPZ algorithm because the work presented by Botelho, Pagh and Ziviani in [2]. In the Botelho's PhD. dissertation [1] it is also referred to as RAM algorithm because it is more suitable for key sets that can be handled in internal memory. +
++The BDZ algorithm uses r-uniform random hypergraphs given by function values of r uniform random hash functions on the input key set S for generating PHFs and MPHFs that require O(n) bits to be stored. A hypergraph is the generalization of a standard undirected graph where each edge connects vertices. This idea is not new, see e.g. [8], but we have proceeded differently to achieve a space usage of O(n) bits rather than O(n log n) bits. Evaluation time for all schemes considered is constant. For r=3 we obtain a space usage of approximately 2.6n bits for an MPHF. More compact, and even simpler, representations can be achieved for larger m. For example, for m=1.23n we can get a space usage of 1.95n bits. +
++Our best MPHF space upper bound is within a factor of 2 from the information theoretical lower bound of approximately 1.44 bits. We have shown that the BDZ algorithm is far more practical than previous methods with proven space complexity, both because of its simplicity, and because the constant factor of the space complexity is more than 6 times lower than its closest competitor, for plausible problem sizes. We verify the practicality experimentally, using slightly more space than in the mentioned theoretical bounds. +
+ ++The BDZ algorithm is a three-step algorithm that generates PHFs and MPHFs based on random r-partite hypergraphs. This is an approach that provides a much tighter analysis and is much more simple than the one presented in [3], where it was implicit how to construct similar PHFs.The fastest and most compact functions are generated when r=3. In this case a PHF can be stored in approximately 1.95 bits per key and an MPHF in approximately 2.62 bits per key. +
++Figure 1 gives an overview of the algorithm for r=3, taking as input a key set containing three English words, i.e., S={who,band,the}. The edge-oriented data structure proposed in [4] is used to represent hypergraphs, where each edge is explicitly represented as an array of r vertices and, for each vertex v, there is a list of edges that are incident on v. +
+ + + + + + + + + + + + + + + + + + + + +Figure 1: (a) The mapping step generates a random acyclic 3-partite hypergraph |
with m=6 vertices and n=3 edges and a list of edges obtained when we test |
whether the hypergraph is acyclic. (b) The assigning step builds an array g that |
maps values from [0,5] to [0,3] to uniquely assign an edge to a vertex. (c) The ranking |
step builds the data structure used to compute function rank in O(1) time. |
+The Mapping Step in Figure 1(a) carries out two important tasks: +
+ ++We now show how to use the Jenkins hash functions [7] to implement the three hash functions h_{i}, which map values from S to V_{i}, where . These functions are used to build a random 3-partite hypergraph, where and . Let be a Jenkins hash function for , where +w=32 or 64 for 32-bit and 64-bit architectures, respectively. +Let H' be an array of 3 w-bit values. The Jenkins hash function +allow us to compute in parallel the three entries in H' +and thereby the three hash functions h_{i}, as follows: +
+ + + + + + + + + + + + + + +H' = h'(x) |
h_{0}(x) = H'[0] mod |
h_{1}(x) = H'[1] mod + |
h_{2}(x) = H'[2] mod + 2 |
+The Assigning Step in Figure 1(b) outputs a PHF that maps the key set S into the range [0,m-1] and is represented by an array g storing values from the range [0,3]. The array g allows to select one out of the 3 vertices of a given edge, which is associated with a key k. A vertex for a key k is given by either h_{0}(k), h_{1}(k) or h_{2}(k). The function h_{i}(k) to be used for k is chosen by calculating i = (g[h_{0}(k)] + g[h_{1}(k)] + g[h_{2}(k)]) mod 3. For instance, the values 1 and 4 represent the keys "who" and "band" because i = (g[1] + g[3] + g[5]) mod 3 = 0 and h_{0}("who") = 1, and i = (g[1] + g[2] + g[4]) mod 3 = 2 and h_{2}("band") = 4, respectively. The assigning step firstly initializes g[i]=3 to mark every vertex as unassigned and Visited[i]= false, . Let Visited be a boolean vector of size m to indicate whether a vertex has been visited. Then, for each edge from tail to head, it looks for the first vertex u belonging e not yet visited. This is a sufficient condition for success [1,2,8]. Let j be the index of u in e for j in the range [0,2]. Then, it assigns . Whenever it passes through a vertex u from e, if u has not yet been visited, it sets Visited[u] = true. +
++If we stop the BDZ algorithm in the assigning step we obtain a PHF with range [0,m-1]. The PHF has the following form: phf(x) = h_{i(x)}(x), where key x is in S and i(x) = (g[h_{0}(x)] + g[h_{1}(x)] + g[h_{2}(x)]) mod 3. In this case we do not need information for ranking and can set g[i] = 0 whenever g[i] is equal to 3, where i is in the range [0,m-1]. Therefore, the range of the values stored in g is narrowed from [0,3] to [0,2]. By using arithmetic coding as block of values (see [1,2] for details), or any compression technique that allows to perform random access in constant time to an array of compressed values [5,6,12], we can store the resulting PHFs in mlog 3 = cnlog 3 bits, where c > 1.22. For c = 1.23, the space requirement is 1.95n bits. +
++The Ranking Step in Figure 1 (c) outputs a data structure that permits to narrow the range of a PHF generated in the assigning step from [0,m-1] to [0,n-1] and thereby an MPHF is produced. The data structure allows to compute in constant time a function rank from [0,m-1] to [0,n-1] that counts the number of assigned positions before a given position v in g. For instance, rank(4) = 2 because the positions 0 and 1 are assigned since g[0] and g[1] are not equal to 3. +
++For the implementation of the ranking step we have borrowed a simple and efficient implementation from [10]. It requires additional bits of space, where , and is obtained by storing explicitly the rank of every kth index in a rankTable, where . The larger is k the more compact is the resulting MPHF. Therefore, the users can tradeoff space for evaluation time by setting k appropriately in the implementation. We only allow values for k that are power of two (i.e., k=2^{bk} for some constant b_{k} in order to replace the expensive division and modulo operations by bit-shift and bitwise "and" operations, respectively. We have used k=256 in the experiments for generating more succinct MPHFs. We remark that it is still possible to obtain a more compact data structure by using the results presented in [9,11], but at the cost of a much more complex implementation. +
++We need to use an additional lookup table T_{r} to guarantee the constant evaluation time of rank(u). Let us illustrate how rank(u) is computed using both the rankTable and the lookup table T_{r}. We first look up the rank of the largest precomputed index v lower than or equal to u in the rankTable, and use T_{r} to count the number of assigned vertices from position v to u-1. The lookup table T_r allows us to count in constant time the number of assigned vertices in bits, where . Thus the actual evaluation time is . For simplicity and without loss of generality we let be a multiple of the number of bits used to encode each entry of g. As the values in g come from the range [0,3], +then bits and we have tried equal to 8 and 16. We would expect that equal to 16 should provide a faster evaluation time because we would need to carry out fewer lookups in T_{r}. But, for both values the lookup table T_{r} fits entirely in the CPU cache and we did not realize any significant difference in the evaluation times. Therefore we settle for the value 8. We remark that each value of r requires a different lookup table //T_{r} that can be generated a priori. +
++The resulting MPHFs have the following form: h(x) = rank(phf(x)). Then, we cannot get rid of the raking information by replacing the values 3 by 0 in the entries of g. In this case each entry in the array g is encoded with 2 bits and we need additional bits to compute function rank in constant time. Then, the total space to store the resulting functions is bits. By using c = 1.23 and we have obtained MPHFs that require approximately 2.62 bits per key to be stored. +
+ ++Now we detail the memory consumption to generate and to store minimal perfect hash functions +using the BDZ algorithm. The structures responsible for memory consumption are in the +following: +
+ ++Thus, the total memory consumption of BDZ algorithm for generating a minimal +perfect hash function (MPHF) is: (28.125 + 5c)n + 0.25cn + (4cn)/(2^b) + O(1) bytes. +As the value of constant c may be larger than or equal to 1.23 we have: +
+ + + + + + + + + + + + + + + + + +c | b | Memory consumption to generate a MPHF (in bytes) |
---|---|---|
1.23 | 7 | 34.62n + O(1) |
1.23 | 8 | 34.60n + O(1) |
Table 1: Memory consumption to generate a MPHF using the BDZ algorithm. |
+Now we present the memory consumption to store the resulting function. +So we have: +
+ + + + + + + + + + + + + + + + + +c | b | Memory consumption to store a MPHF (in bits) |
---|---|---|
1.23 | 7 | 2.77n + O(1) |
1.23 | 8 | 2.61n + O(1) |
Table 2: Memory consumption to store a MPHF generated by the BDZ algorithm. |
+Experimental results to compare the BDZ algorithm with the other ones in the CMPH +library are presented in Botelho, Pagh and Ziviani [1,2]. +
+ +Home | CHD | BDZ | BMZ | CHM | BRZ | FCH |
+Enjoy! +
+ + + + + + + + + + diff --git a/docs/bmz.html b/docs/bmz.html new file mode 100644 index 0000000..0341287 --- /dev/null +++ b/docs/bmz.html @@ -0,0 +1,581 @@ + + + + + ++At the end of 2003, professor Nivio Ziviani was +finishing the second edition of his book. +During the book writing, +professor Nivio Ziviani studied the problem of generating +minimal perfect hash functions +(if you are not familiarized with this problem, see [1][2]). +Professor Nivio Ziviani coded a modified version of +the CHM algorithm, which was proposed by +Czech, Havas and Majewski, and put it in his book. +The CHM algorithm is based on acyclic random graphs to generate +order preserving minimal perfect hash functions in linear time. +Professor Nivio Ziviani +argued himself, why must the random graph +be acyclic? In the modified version availalbe in his book he got rid of this restriction. +
++The modification presented a problem, it was impossible to generate minimal perfect hash functions +for sets with more than 1000 keys. +At the same time, Fabiano C. Botelho, +a master degree student at Departament of Computer Science in +Federal University of Minas Gerais, +started to be advised by Nivio Ziviani who presented the problem +to Fabiano. +
++During the master, Fabiano and +Nivio Ziviani faced lots of problems. +In april of 2004, Fabiano was talking with a +friend of him (David Menoti) about the problems +and many ideas appeared. +The ideas were implemented and a very fast algorithm to generate +minimal perfect hash functions had been designed. +We refer the algorithm to as BMZ, because it was conceived by Fabiano C. Botelho, +David Menoti and Nivio Ziviani. The algorithm is described in [1]. +To analyse BMZ algorithm we needed some results from the random graph theory, so +we invited professor Yoshiharu Kohayakawa to help us. +The final description and analysis of BMZ algorithm is presented in [2]. +
+ ++The BMZ algorithm shares several features with the CHM algorithm. +In particular, BMZ algorithm is also +based on the generation of random graphs , where is in +one-to-one correspondence with the key set for which we wish to +generate a minimal perfect hash function. +The two main differences between BMZ algorithm and CHM algorithm +are as follows: (i) BMZ algorithm generates random +graphs with and , where , +and hence necessarily contains cycles, +while CHM algorithm generates acyclic random +graphs with and , +with a greater number of vertices: ; +(ii) CHM algorithm generates order preserving minimal perfect hash functions +while BMZ algorithm does not preserve order. Thus, BMZ algorithm improves +the space requirement at the expense of generating functions that are not +order preserving. +
++Suppose is a universe of keys. +Let be a set of keys from . +Let us show how the BMZ algorithm constructs a minimal perfect hash function . +We make use of two auxiliary random functions and , +where for some suitably chosen integer , +where .We build a random graph on , +whose edge set is . There is an edge in for each +key in the set of keys . +
++In what follows, we shall be interested in the 2-core of +the random graph , that is, the maximal subgraph +of with minimal degree at +least 2 (see [2] for details). +Because of its importance in our context, we call the 2-core the +critical subgraph of and denote it by . +The vertices and edges in are said to be critical. +We let and . +Moreover, we let be the set of non-critical +vertices in . +We also let be the set of all critical +vertices that have at least one non-critical vertex as a neighbour. +Let be the set of non-critical edges in . +Finally, we let be the non-critical subgraph +of . +The non-critical subgraph corresponds to the acyclic part +of . +We have . +
++We then construct a suitable labelling of the vertices +of : we choose for each in such +a way that () is a +minimal perfect hash function for . +This labelling can be found in linear time +if the number of edges in is at most (see [2] +for details). +
++Figure 1 presents a pseudo code for the BMZ algorithm. +The procedure BMZ (, ) receives as input the set of +keys and produces the labelling . +The method uses a mapping, ordering and searching approach. +We now describe each step. +
+ + + + + + + + + + + + + + + + + +procedure BMZ (, ) |
Mapping (, ); |
Ordering (, , ); |
Searching (, , , ); |
Figure 1: Main steps of BMZ algorithm for constructing a minimal perfect hash function |
+The procedure Mapping (, ) receives as input the set +of keys and generates the random graph , by generating +two auxiliary functions , . +
++The functions and are constructed as follows. +We impose some upper bound on the lengths of the keys in . +To define (, ), we generate +an table of random integers . +For a key of length and , we let +
+ + + + + ++The random graph has vertex set and +edge set . We need to be +simple, i.e., should have neither loops nor multiple edges. +A loop occurs when for some . +We solve this in an ad hoc manner: we simply let in this case. +If we still find a loop after this, we generate another pair . +When a multiple edge occurs we abort and generate a new pair . +Although the function above causes collisions with probability 1/t, +in cmph library we use faster hash +functions (DJB2 hash, FNV hash, + Jenkins hash and SDBM hash) + in which we do not need to impose any upper bound on the lengths of the keys in . +
++As mentioned before, for us to find the labelling of the +vertices of in linear time, +we require that . +The crucial step now is to determine the value +of (in ) to obtain a random +graph with . +Botelho, Menoti an Ziviani determinded emprically in [1] that +the value of is 1.15. This value is remarkably +close to the theoretical value determined in [2], +which is around . +
+ ++The procedure Ordering (, , ) receives +as input the graph and partitions into the two +subgraphs and , so that . +
++Figure 2 presents a sample graph with 9 vertices +and 8 edges, where the degree of a vertex is shown besides each vertex. +Initially, all vertices with degree 1 are added to a queue . +For the example shown in Figure 2(a), after the initialization step. +
+ + + + + + + + +Figure 2: Ordering step for a graph with 9 vertices and 8 edges. |
+Next, we remove one vertex from the queue, decrement its degree and +the degree of the vertices with degree greater than 0 in the adjacent +list of , as depicted in Figure 2(b) for . +At this point, the adjacencies of with degree 1 are +inserted into the queue, such as vertex 1. +This process is repeated until the queue becomes empty. +All vertices with degree 0 are non-critical vertices and the others are +critical vertices, as depicted in Figure 2(c). +Finally, to determine the vertices in we collect all +vertices with at least one vertex that +is in Adj and in , as the vertex 8 in Figure 2(c). +
+ ++In the searching step, the key part is +the perfect assignment problem: find such that +the function defined by +
+ + + + + ++is a bijection from to (recall ). +We are interested in a labelling of +the vertices of the graph with +the property that if and are keys +in , then ; that is, if we associate +to each edge the sum of the labels on its endpoints, then these values +should be all distinct. +Moreover, we require that all the sums () +fall between and , and thus we have a bijection +between and . +
++The procedure Searching (, , , ) +receives as input , , and finds a +suitable bit value for each vertex , stored in the +array . +This step is first performed for the vertices in the +critical subgraph of (the 2-core of ) +and then it is performed for the vertices in (the non-critical subgraph +of that contains the "acyclic part" of ). +The reason the assignment of the values is first +performed on the vertices in is to resolve reassignments +as early as possible (such reassignments are consequences of the cycles +in and are depicted hereinafter). +
+ ++The labels () +are assigned in increasing order following a greedy +strategy where the critical vertices are considered one at a time, +according to a breadth-first search on . +If a candidate value for is forbidden +because setting would create two edges with the same sum, +we try for . This fact is referred to +as a reassignment. +
++Let be the set of addresses assigned to edges in . +Initially . +Let be a candidate value for . +Initially . +Considering the subgraph in Figure 2(c), +a step by step example of the assignment of values to vertices in is +presented in Figure 3. +Initially, a vertex is chosen, the assignment is made +and is set to . +For example, suppose that vertex in Figure 3(a) is +chosen, the assignment is made and is set to . +
+ + + + + + + + +Figure 3: Example of the assignment of values to critical vertices. |
+In Figure 3(b), following the adjacent list of vertex , +the unassigned vertex is reached. +At this point, we collect in the temporary variable all adjacencies +of vertex that have been assigned an value, +and . +Next, for all , we check if . +Since , then is set +to , is incremented +by 1 (now ) and . +Next, vertex is reached, is set +to , is set to and . +Next, vertex is reached and . +Since and , then is +set to , is set to and . +Finally, vertex is reached and . +Since , is incremented by 1 and set to 5, as depicted in +Figure 3(c). +Since , is again incremented by 1 and set to 6, +as depicted in Figure 3(d). +These two reassignments are indicated by the arrows in Figure 3. +Since and , then is set +to and . This finishes the algorithm. +
+ ++As is acyclic, we can impose the order in which addresses are +associated with edges in , making this step simple to solve +by a standard depth first search algorithm. +Therefore, in the assignment of values to vertices in we +benefit from the unused addresses in the gaps left by the assignment of values +to vertices in . +For that, we start the depth-first search from the vertices in because +the values for these critical vertices were already assigned +and cannot be changed. +
++Considering the subgraph in Figure 2(c), +a step by step example of the assignment of values to vertices in is +presented in Figure 4. +Figure 4(a) presents the initial state of the algorithm. +The critical vertex 8 is the only one that has non-critical vertices as +adjacent. +In the example presented in Figure 3, the addresses were not used. +So, taking the first unused address and the vertex , +which is reached from the vertex , is set +to , as shown in Figure 4(b). +The only vertex that is reached from vertex is vertex , so +taking the unused address we set to , +as shown in Figure 4(c). +This process is repeated until the UnAssignedAddresses list becomes empty. +
+ + + + + + + + +Figure 4: Example of the assignment of values to non-critical vertices. |
+We now present an heuristic for BMZ algorithm that +reduces the value of to any given value between 1.15 and 0.93. +This reduces the space requirement to store the resulting function +to any given value between words and words. +The heuristic reuses, when possible, the set +of values that caused reassignments, just before +trying . +Decreasing the value of leads to an increase in the number of +iterations to generate . +For example, for and , the analytical expected number +of iterations are and , respectively (see [2] +for details), +while for the same value is around 2.13. +
+ ++Now we detail the memory consumption to generate and to store minimal perfect hash functions +using the BMZ algorithm. The structures responsible for memory consumption are in the +following: +
+ ++Thus, the total memory consumption of BMZ algorithm for generating a minimal +perfect hash function (MPHF) is: (8.25c + 16.125)n +2 + O(1) bytes. +As the value of constant c may be 1.15 and 0.93 we have: +
+ + + + + + + + + + + + + + + + + +c | Memory consumption to generate a MPHF | |
---|---|---|
0.93 | 0.497n | 24.80n + O(1) |
1.15 | 0.401n | 26.42n + O(1) |
Table 1: Memory consumption to generate a MPHF using the BMZ algorithm. |
+The values of were calculated using Eq.(1) presented in [2]. +
++Now we present the memory consumption to store the resulting function. +We only need to store the g function. Thus, we need 4cn bytes. +Again we have: +
+ + + + + + + + + + + + + + +c | Memory consumption to store a MPHF |
---|---|
0.93 | 3.72n |
1.15 | 4.60n |
Table 2: Memory consumption to store a MPHF generated by the BMZ algorithm. |
+CHM x BMZ +
+ +Home | CHD | BDZ | BMZ | CHM | BRZ | FCH |
+Enjoy! +
+ + + + + + + + + + diff --git a/docs/brz.html b/docs/brz.html new file mode 100644 index 0000000..0d4225d --- /dev/null +++ b/docs/brz.html @@ -0,0 +1,966 @@ + + + + + ++Until now, because of the limitations of current algorithms, +the use of MPHFs is restricted to scenarios where the set of keys being hashed is +relatively small. +However, in many cases it is crucial to deal in an efficient way with very large +sets of keys. +Due to the exponential growth of the Web, the work with huge collections is becoming +a daily task. +For instance, the simple assignment of number identifiers to web pages of a collection +can be a challenging task. +While traditional databases simply cannot handle more traffic once the working +set of URLs does not fit in main memory anymore[4], the algorithm we propose here to +construct MPHFs can easily scale to billions of entries. +
++As there are many applications for MPHFs, it is +important to design and implement space and time efficient algorithms for +constructing such functions. +The attractiveness of using MPHFs depends on the following issues: +
+ ++We present here a novel external memory based algorithm for constructing MPHFs that +are very efficient in the four requirements mentioned previously. +First, the algorithm is linear on the size of keys to construct a MPHF, +which is optimal. +For instance, for a collection of 1 billion URLs +collected from the web, each one 64 characters long on average, the time to construct a +MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory +is approximately 3 hours. +Second, the algorithm needs a small a priori defined vector of one +byte entries in main memory to construct a MPHF. +For the collection of 1 billion URLs and using , the algorithm needs only +5.45 megabytes of internal memory. +Third, the evaluation of the MPHF for each retrieval requires three memory accesses and +the computation of three universal hash functions. +This is not optimal as any MPHF requires at least one memory access and the computation +of two universal hash functions. +Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal. +For the collection of 1 billion URLs, it needs 8.1 bits for each key, +while the theoretical lower bound is bits per key. +
+ ++The main idea supporting our algorithm is the classical divide and conquer technique. +The algorithm is a two-step external memory based algorithm +that generates a MPHF h for a set S of n keys. +Figure 1 illustrates the two steps of the +algorithm: the partitioning step and the searching step. +
+ + + + + + + + +Figure 1: Main steps of our algorithm. |
+The partitioning step takes a key set S and uses a universal hash +function proposed by Jenkins[5] +to transform each key into an integer . +Reducing modulo , we partition S +into buckets containing at most 256 keys in each bucket (with high +probability). +
++The searching step generates a MPHF for each bucket i, . +The resulting MPHF h(k), , is given by +
+ + + + + ++where . +The ith entry offset[i] of the displacement vector +offset, , contains the total number +of keys in the buckets from 0 to i-1, that is, it gives the interval of the +keys in the hash table addressed by the MPHF. In the following we explain +each step in detail. +
+ ++The set S of n keys is partitioned into , +where b is a suitable parameter chosen to guarantee +that each bucket has at most 256 keys with high probability +(see [2] for details). +The partitioning step works as follows: +
+ + + + + + + + +Figure 2: Partitioning step. |
+Statement 1.1 of the for loop presented in Figure 2 +reads sequentially all the keys of block from disk into an internal area +of size . +
++Statement 1.2 performs an indirect bucket sort of the keys in block and +at the same time updates the entries in the vector size. +Let us briefly describe how is partitioned among +the buckets. +We use a local array of counters to store a +count of how many keys from belong to each bucket. +The pointers to the keys in each bucket i, , +are stored in contiguous positions in an array. +For this we first reserve the required number of entries +in this array of pointers using the information from the array of counters. +Next, we place the pointers to the keys in each bucket into the respective +reserved areas in the array (i.e., we place the pointers to the keys in bucket 0, +followed by the pointers to the keys in bucket 1, and so on). +
++To find the bucket address of a given key +we use the universal hash function [5]. +Key k goes into bucket i, where +
+ + + + + +(1) |
+Figure 3(a) shows a logical view of the buckets +generated in the partitioning step. +In reality, the keys belonging to each bucket are distributed among many files, +as depicted in Figure 3(b). +In the example of Figure 3(b), the keys in bucket 0 +appear in files 1 and N, the keys in bucket 1 appear in files 1, 2 +and N, and so on. +
+ + + + + + + + +Figure 3: Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view. |
+This scattering of the keys in the buckets could generate a performance +problem because of the potential number of seeks +needed to read the keys in each bucket from the N files in disk +during the searching step. +But, as we show in [2], the number of seeks +can be kept small using buffering techniques. +Considering that only the vector size, which has one-byte +entries (remember that each bucket has at most 256 keys), +must be maintained in main memory during the searching step, +almost all main memory is available to be used as disk I/O buffer. +
++The last step is to compute the offset vector and dump it to the disk. +We use the vector size to compute the +offset displacement vector. +The offset[i] entry contains the number of keys +in the buckets 0, 1, ..., i-1. +As size[i] stores the number of keys +in bucket i, where , we have +
+ + + + + ++The searching step is responsible for generating a MPHF for each +bucket. Figure 4 presents the searching step algorithm. +
+ + + + + + + + +Figure 4: Searching step. |
+Statement 1 of Figure 4 inserts one key from each file +in a minimum heap H of size N. +The order relation in H is given by the bucket address i given by +Eq. (1). +
++Statement 2 has two important steps. +In statement 2.1, a bucket is read from disk, +as described below. +In statement 2.2, a MPHF is generated for each bucket i, as described +in the following. +The description of MPHF is a vector of 8-bit integers. +Finally, statement 2.3 writes the description of MPHF to disk. +
+ ++In this section we present the refinement of statement 2.1 of +Figure 4. +The algorithm to read bucket i from disk is presented +in Figure 5. +
+ + + + + + + + +Figure 5: Reading a bucket. |
+Bucket i is distributed among many files and the heap H is used to drive a +multiway merge operation. +In Figure 5, statement 1.1 extracts and removes triple +(i, j, k) from H, where i is a minimum value in H. +Statement 1.2 inserts key k in bucket i. +Notice that the k in the triple (i, j, k) is in fact a pointer to +the first byte of the key that is kept in contiguous positions of an array of characters +(this array containing the keys is initialized during the heap construction +in statement 1 of Figure 4). +Statement 1.3 performs a seek operation in File j on disk for the first +read operation and reads sequentially all keys k that have the same i +and inserts them all in bucket i. +Finally, statement 1.4 inserts in H the triple (i, j, x), +where x is the first key read from File j (in statement 1.3) +that does not have the same bucket address as the previous keys. +
++The number of seek operations on disk performed in statement 1.3 is discussed +in [2, Section 5.1], +where we present a buffering technique that brings down +the time spent with seeks. +
+ ++To the best of our knowledge the BMZ algorithm we have designed in +our previous works [1,3] is the fastest published algorithm for +constructing MPHFs. +That is why we are using that algorithm as a building block for the +algorithm presented here. In reality, we are using +an optimized version of BMZ (BMZ8) for small set of keys (at most 256 keys). +Click here to see details about BMZ algorithm. +
+ ++Analytical results and the complete analysis of the external memory based algorithm +can be found in [2]. +
+ ++In this section we present the experimental results. +We start presenting the experimental setup. +We then present experimental results for +the internal memory based algorithm (the BMZ algorithm) +and for our external memory based algorithm. +Finally, we discuss how the amount of internal memory available +affects the runtime of the external memory based algorithm. +
+ ++All experiments were carried out on +a computer running the Linux operating system, version 2.6, +with a 2.4 gigahertz processor and +1 gigabyte of main memory. +In the experiments related to the new +algorithm we limited the main memory in 500 megabytes. +
++Our data consists of a collection of 1 billion +URLs collected from the Web, each URL 64 characters long on average. +The collection is stored on disk in 60.5 gigabytes. +
+ ++The BMZ algorithm is used for constructing a MPHF for each bucket. +It is a randomized algorithm because it needs to generate a simple random graph +in its first step. +Once the graph is obtained the other two steps are deterministic. +
++Thus, we can consider the runtime of the algorithm to have +the form for an input of n keys, +where is some machine dependent +constant that further depends on the length of the keys and Z is a random +variable with geometric distribution with mean . All results +in our experiments were obtained taking c=1; the value of c, with c in [0.93,1.15], +in fact has little influence in the runtime, as shown in [3]. +
++The values chosen for n were 1, 2, 4, 8, 16 and 32 million. +Although we have a dataset with 1 billion URLs, on a PC with +1 gigabyte of main memory, the algorithm is able +to handle an input with at most 32 million keys. +This is mainly because of the graph we need to keep in main memory. +The algorithm requires 25n + O(1) bytes for constructing +a MPHF (click here to get details about the data structures used by the BMZ algorithm). +
++In order to estimate the number of trials for each value of n we use +a statistical method for determining a suitable sample size (see, e.g., [6, Chapter 13]). +As we obtained different values for each n, +we used the maximal value obtained, namely, 300 trials in order to have +a confidence level of 95 %. +
++Table 1 presents the runtime average for each n, +the respective standard deviations, and +the respective confidence intervals given by +the average time the distance from average time +considering a confidence level of 95 %. +Observing the runtime averages one sees that +the algorithm runs in expected linear time, +as shown in [3]. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ (millions) | 1 | 2 | 4 | 8 | 16 | 32 | |
+ + Average time (s) | |||||||
+ SD (s) |
Table 1: Internal memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %. |
+Figure 6 presents the runtime for each trial. In addition, +the solid line corresponds to a linear regression model +obtained from the experimental measurements. +As we can see, the runtime for a given n has a considerable +fluctuation. However, the fluctuation also grows linearly with n. +
+ + + + + + + + +Figure 6: Time versus number of keys in S for the internal memory based algorithm. The solid line corresponds to a linear regression model. |
+The observed fluctuation in the runtimes is as expected; recall that this +runtime has the form with Z a geometric random variable with +mean 1/p=e. Thus, the runtime has mean and standard +deviation . +Therefore, the standard deviation also grows +linearly with n, as experimentally verified +in Table 1 and in Figure 6. +
+ ++The runtime of the external memory based algorithm is also a random variable, +but now it follows a (highly concentrated) normal distribution, as we discuss at the end of this +section. Again, we are interested in verifying the linearity claim made in +[2, Section 5.1]. Therefore, we ran the algorithm for +several numbers n of keys in S. +
++The values chosen for n were 1, 2, 4, 8, 16, 32, 64, 128, 512 and 1000 +million. +We limited the main memory in 500 megabytes for the experiments. +The size of the a priori reserved internal memory area +was set to 250 megabytes, the parameter b was set to 175 and +the building block algorithm parameter c was again set to 1. +We show later on how affects the runtime of the algorithm. The other two parameters +have insignificant influence on the runtime. +
++We again use a statistical method for determining a suitable sample size +to estimate the number of trials to be run for each value of n. We got that +just one trial for each n would be enough with a confidence level of 95 %. +However, we made 10 trials. This number of trials seems rather small, but, as +shown below, the behavior of our algorithm is very stable and its runtime is +almost deterministic (i.e., the standard deviation is very small). +
++Table 2 presents the runtime average for each n, +the respective standard deviations, and +the respective confidence intervals given by +the average time the distance from average time +considering a confidence level of 95 %. +Observing the runtime averages we noticed that +the algorithm runs in expected linear time, +as shown in [2, Section 5.1]. Better still, +it is only approximately 60 % slower than the BMZ algorithm. +To get that value we used the linear regression model obtained for the runtime of +the internal memory based algorithm to estimate how much time it would require +for constructing a MPHF for a set of 1 billion keys. +We got 2.3 hours for the internal memory based algorithm and we measured +3.67 hours on average for the external memory based algorithm. +Increasing the size of the internal memory area +from 250 to 600 megabytes, +we have brought the time to 3.09 hours. In this case, the external memory based algorithm is +just 34 % slower in this setup. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ (millions) | 1 | 2 | 4 | 8 | 16 |
+ Average time (s) | |||||
+ SD | |||||
+ + (millions) | 32 | 64 | 128 | 512 | 1000 |
+ Average time (s) | + | + | + | ||
+ SD | |||||
Table 2:The external memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %. |
+Figure 7 presents the runtime for each trial. In addition, +the solid line corresponds to a linear regression model +obtained from the experimental measurements. +As we were expecting the runtime for a given n has almost no +variation. +
+ + + + + + + + +Figure 7: Time versus number of keys in S for our algorithm. The solid line corresponds to a linear regression model. |
+An intriguing observation is that the runtime of the algorithm is almost +deterministic, in spite of the fact that it uses as building block an +algorithm with a considerable fluctuation in its runtime. A given bucket +i, , is a small set of keys (at most 256 keys) and, +as argued in last Section, the runtime of the +building block algorithm is a random variable with high fluctuation. +However, the runtime Y of the searching step of the external memory based algorithm is given +by . Under the hypothesis that +the are independent and bounded, the {\it law of large numbers} (see, +e.g., [6]) implies that the random variable converges +to a constant as . This explains why the runtime of our +algorithm is almost deterministic. +
+ ++In order to bring down the number of seek operations on disk +we benefit from the fact that our algorithm leaves almost all main +memory available to be used as disk I/O buffer. +In this section we evaluate how much the parameter affects the runtime of our algorithm. +For that we fixed n in 1 billion of URLs, +set the main memory of the machine used for the experiments +to 1 gigabyte and used equal to 100, 200, 300, 400, 500 and 600 +megabytes. +
++Table 3 presents the number of files N, +the buffer size used for all files, the number of seeks in the worst case considering +the pessimistic assumption mentioned in [2, Section 5.1], and +the time to generate a MPHF for 1 billion of keys as a function of the amount of internal +memory available. Observing Table 3 we noticed that the time spent in the construction +decreases as the value of increases. However, for , the variation +on the time is not as significant as for . +This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux +has smart policies for avoiding seeks and diminishing the average seek time +(see http://www.linuxjournal.com/article/6931). +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++ (MB) | ||||||
+ + (files) | ||||||
+ (buffer size in KB) | ||||||
+ / (# of seeks in the worst case) | ||||||
+ Time (hours) |
Table 3:Influence of the internal memory area size () in the external memory based algorithm runtime. |
Home | CHD | BDZ | BMZ | CHM | BRZ | FCH |
+Enjoy! +
+ + + + + + + + + + diff --git a/docs/chd.html b/docs/chd.html new file mode 100644 index 0000000..ac1d3f7 --- /dev/null +++ b/docs/chd.html @@ -0,0 +1,97 @@ + + + + + ++The important performance parameters of a PHF are representation size, evaluation time and construction time. The representation size plays an important role when the whole function fits in a faster memory and the actual data is stored in a slower memory. For instace, compact PHFs can be entirely fit in a CPU cache and this makes their computation really fast by avoiding cache misses. The CHD algorithm plays an important role in this context. It was designed by Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger in [2]. +
++The CHD algorithm permits to obtain PHFs with representation size very close to optimal while retaining O(n) construction time and O(1) evaluation time. For example, in the case m=2n we obtain a PHF that uses space 0.67 bits per key, and for m=1.23n we obtain space 1.4 bits per key, which was not achievable with previously known methods. The CHD algorithm is inspired by several known algorithms; the main new feature is that it combines a modification of Pagh's ``hash-and-displace'' approach with data compression on a sequence of hash function indices. That combination makes it possible to significantly reduce space usage while retaining linear construction time and constant query time. The CHD algorithm can also be used for k-perfect hashing, where at most k keys may be mapped to the same value. For the analysis we assume that fully random hash functions are given for free; such assumptions can be justified and were made in previous papers. +
++The compact PHFs generated by the CHD algorithm can be used in many applications in which we want to assign a unique identifier to each key without storing any information on the key. One of the most obvious applications of those functions (or k-perfect hash functions) is when we have a small fast memory in which we can store the perfect hash function while the keys and associated satellite data are stored in slower but larger memory. The size of a block or a transfer unit may be chosen so that k data items can be retrieved in one read access. In this case we can ensure that data associated with a key can be retrieved in a single probe to slower memory. This has been used for example in hardware routers [4]. +
++The CHD algorithm generates the most compact PHFs and MPHFs we know of in O(n) time. The time required to evaluate the generated functions is constant (in practice less than 1.4 microseconds). The storage space of the resulting PHFs and MPHFs are distant from the information theoretic lower bound by a factor of 1.43. The closest competitor is the algorithm by Martin and Pagh [3] but their algorithm do not work in linear time. Furthermore, the CHD algorithm can be tuned to run faster than the BPZ algorithm [1] (the fastest algorithm available in the literature so far) and to obtain more compact functions. The most impressive characteristic is that it has the ability, in principle, to approximate the information theoretic lower bound while being practical. A detailed description of the CHD algorithm can be found in [2]. +
+ ++Experimental results comparing the CHD algorithm with the BDZ algorithm +and others available in the CMPH library are presented in [2]. +
+ +Home | CHD | BDZ | BMZ | CHM | BRZ | FCH |
+Enjoy! +
+ + + + + + + + + + diff --git a/docs/chm.html b/docs/chm.html new file mode 100644 index 0000000..2fcdfb2 --- /dev/null +++ b/docs/chm.html @@ -0,0 +1,180 @@ + + + + + ++The algorithm is presented in [1,2,3]. +
+ ++Now we detail the memory consumption to generate and to store minimal perfect hash functions +using the CHM algorithm. The structures responsible for memory consumption are in the +following: +
+ ++Thus, the total memory consumption of CHM algorithm for generating a minimal +perfect hash function (MPHF) is: (8.125c + 16)n + O(1) bytes. +As the value of constant c must be at least 2.09 we have: +
+ + + + + + + + + + +c | Memory consumption to generate a MPHF |
---|---|
2.09 | 33.00n + O(1) |
Table 1: Memory consumption to generate a MPHF using the CHM algorithm. |
+Now we present the memory consumption to store the resulting function. +We only need to store the g function. Thus, we need 4cn bytes. +Again we have: +
+ + + + + + + + + + +c | Memory consumption to store a MPHF |
---|---|
2.09 | 8.36n |
Table 2: Memory consumption to store a MPHF generated by the CHM algorithm. |
+CHM x BMZ +
+ +Home | CHD | BDZ | BMZ | CHM | BRZ | FCH |
+Enjoy! +
+ + + + + + + + + + diff --git a/docs/comparison.html b/docs/comparison.html new file mode 100644 index 0000000..77573ea --- /dev/null +++ b/docs/comparison.html @@ -0,0 +1,457 @@ + + + + + ++Table 1 presents the main characteristics of the two algorithms. +The number of edges in the graph is , +the number of keys in the input set . +The number of vertices of is equal +to and for BMZ algorithm and the CHM algorithm, respectively. +This measure is related to the amount of space to store the array . +This improves the space required to store a function in BMZ algorithm to of the space required by the CHM algorithm. +The number of critical edges is and 0, for BMZ algorithm and the CHM algorithm, +respectively. +BMZ algorithm generates random graphs that necessarily contains cycles and the +CHM algorithm +generates +acyclic random graphs. +Finally, the CHM algorithm generates order preserving functions +while BMZ algorithm does not preserve order. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ++Characteristics | Algorithms | |
+ + | BMZ | CHM |
+ + | 1.15 | 2.09 |
+ | ||
+ | ||
+ + | 0 | |
+ | cyclic | acyclic |
+Order preserving | no | yes |
Table 1: Main characteristics of the algorithms. |
Algorithm | c | Memory consumption to generate a MPHF |
---|---|---|
BMZ | 0.93 | 24.80n + O(1) |
BMZ | 1.15 | 26.42n + O(1) |
CHM | 2.09 | 33.00n + O(1) |
Table 2: Memory consumption to generate a MPHF using the algorithms BMZ and CHM. |
Algorithm | c | Memory consumption to store a MPHF |
---|---|---|
BMZ | 0.93 | 3.72n |
BMZ | 1.15 | 4.60n |
CHM | 2.09 | 8.36n |
Table 3: Memory consumption to store a MPHF generated by the algorithms BMZ and CHM. |
+We now present some experimental results to compare the BMZ and CHM algorithms. +The data consists of a collection of 100 million universe resource locations +(URLs) collected from the Web. +The average length of a URL in the collection is 63 bytes. +All experiments were carried on +a computer running the Linux operating system, version 2.6.7, +with a 2.4 gigahertz processor and +4 gigabytes of main memory. +
++Table 4 presents time measurements. +All times are in seconds. +The table entries represent averages over 50 trials. +The column labelled as represents +the number of iterations to generate the random graph in the +mapping step of the algorithms. +The next columns represent the run times +for the mapping plus ordering steps together and the searching +step for each algorithm. +The last column represents the percent gain of our algorithm +over the CHM algorithm. +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +BMZ | +CHM algorithm | Gain | |||||||
+ | Map+Ord | +Search | Total | + | Map+Ord | Search | +Total | (%) | |
1,562,500 | 2.28 | 8.54 | 2.37 | 10.91 | 2.70 | 14.56 | 1.57 | 16.13 | 48 |
3,125,000 | 2.16 | 15.92 | 4.88 | 20.80 | 2.85 | 30.36 | 3.20 | 33.56 | 61 |
6,250,000 | 2.20 | 33.09 | 10.48 | 43.57 | 2.90 | 62.26 | 6.76 | 69.02 | 58 |
12,500,000 | 2.00 | 63.26 | 23.04 | 86.30 | 2.60 | 117.99 | 14.94 | 132.92 | 54 |
25,000,000 | 2.00 | 130.79 | 51.55 | 182.34 | 2.80 | 262.05 | 33.68 | 295.73 | 62 |
50,000,000 | 2.07 | 273.75 | 114.12 | 387.87 | 2.90 | 577.59 | 73.97 | 651.56 | 68 |
100,000,000 | 2.07 | 567.47 | 243.13 | 810.60 | 2.80 | 1,131.06 | 157.23 | 1,288.29 | 59 |
Table 4: Time measurements for BMZ and the CHM algorithm. |
+The mapping step of the BMZ algorithm is faster because +the expected number of iterations in the mapping step to generate are +2.13 and 2.92 for BMZ algorithm and the CHM algorithm, respectively +(see [2] for details). +The graph generated by BMZ algorithm +has vertices, against for the CHM algorithm. +These two facts make BMZ algorithm faster in the mapping step. +The ordering step of BMZ algorithm is approximately equal to +the time to check if is acyclic for the CHM algorithm. +The searching step of the CHM algorithm is faster, but the total +time of BMZ algorithm is, on average, approximately 59 % faster +than the CHM algorithm. +It is important to notice the times for the searching step: +for both algorithms they are not the dominant times, +and the experimental results clearly show +a linear behavior for the searching step. +
++We now present run times for BMZ algorithm using a heuristic that +reduces the space requirement +to any given value between words and words. +For example, for and , the analytical expected number +of iterations are and , respectively +(for , the number of iterations are 2.78 for and 3.04 +for ). +Table 5 presents the total times to construct a +function for , with an increase from seconds +for (see Table 4) to seconds for and +to seconds for . +
+ + + + + + + + + + + + + + + + + + + + + + + + + +BMZ | + BMZ | |||||||
+ | Map+Ord | +Search | Total | + | Map+Ord | Search | +Total | |
12,500,000 | 2.78 | 76.68 | 25.06 | 101.74 | 3.04 | 76.39 | 25.80 | 102.19 |
Table 5: Time measurements for BMZ tuned algorithm with and . |
Home | CHD | BDZ | BMZ | CHM | BRZ | FCH |
+Enjoy! +
+ + + + + + + + + + diff --git a/docs/concepts.html b/docs/concepts.html new file mode 100644 index 0000000..346e1a2 --- /dev/null +++ b/docs/concepts.html @@ -0,0 +1,114 @@ + + + + + ++Suppose is a universe of keys. +Let be a hash function that maps the keys from to a given interval of integers . +Let be a set of keys from . +Given a key , the hash function computes an +integer in for the storage or retrieval of in +a hash table. +Hashing methods for non-static sets of keys can be used to construct +data structures storing and supporting membership queries +"?" in expected time . +However, they involve a certain amount of wasted space owing to unused +locations in the table and waisted time to resolve collisions when +two keys are hashed to the same table location. +
++For static sets of keys it is possible to compute a function +to find any key in a table in one probe; such hash functions are called +perfect. +More precisely, given a set of keys , we shall say that a +hash function is a perfect hash function +for if is an injection on , +that is, there are no collisions among the keys in : +if and are in and , +then . +Figure 1(a) illustrates a perfect hash function. +Since no collisions occur, each key can be retrieved from the table +with a single probe. +If , that is, the table has the same size as , +then we say that is a minimal perfect hash function +for . +Figure 1(b) illustrates a minimal perfect hash function. +Minimal perfect hash functions totally avoid the problem of wasted +space and time. A perfect hash function is order preserving +if the keys in are arranged in some given order +and preserves this order in the hash table. +
+ + + + + + + + +Figure 1: (a) Perfect hash function. (b) Minimal perfect hash function. |
+Minimal perfect hash functions are widely used for memory efficient +storage and fast retrieval of items from static sets, such as words in natural +languages, reserved words in programming languages or interactive systems, +universal resource locations (URLs) in Web search engines, or item sets in +data mining techniques. +
+ +Home | CHD | BDZ | BMZ | CHM | BRZ | FCH |
+Enjoy! +
+ + + + + + + + + + diff --git a/docs/examples.html b/docs/examples.html new file mode 100644 index 0000000..26a0ce7 --- /dev/null +++ b/docs/examples.html @@ -0,0 +1,210 @@ + + + + + ++Using cmph is quite simple. Take a look in the following examples. +
+ ++ #include <cmph.h> + #include <string.h> + // Create minimal perfect hash function from in-memory vector + int main(int argc, char **argv) + { + + // Creating a filled vector + unsigned int i = 0; + const char *vector[] = {"aaaaaaaaaa", "bbbbbbbbbb", "cccccccccc", "dddddddddd", "eeeeeeeeee", + "ffffffffff", "gggggggggg", "hhhhhhhhhh", "iiiiiiiiii", "jjjjjjjjjj"}; + unsigned int nkeys = 10; + FILE* mphf_fd = fopen("temp.mph", "w"); + // Source of keys + cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys); + + //Create minimal perfect hash function using the brz algorithm. + cmph_config_t *config = cmph_config_new(source); + cmph_config_set_algo(config, CMPH_BRZ); + cmph_config_set_mphf_fd(config, mphf_fd); + cmph_t *hash = cmph_new(config); + cmph_config_destroy(config); + cmph_dump(hash, mphf_fd); + cmph_destroy(hash); + fclose(mphf_fd); + + //Find key + mphf_fd = fopen("temp.mph", "r"); + hash = cmph_load(mphf_fd); + while (i < nkeys) { + const char *key = vector[i]; + unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key)); + fprintf(stderr, "key:%s -- hash:%u\n", key, id); + i++; + } + + //Destroy hash + cmph_destroy(hash); + cmph_io_vector_adapter_destroy(source); + fclose(mphf_fd); + return 0; + } ++ +
+Download vector_adapter_ex1.c. This example does not work in versions below 0.6. +
+ ++ #include <cmph.h> + #include <string.h> + // Create minimal perfect hash function from in-memory vector + + #pragma pack(1) + typedef struct { + cmph_uint32 id; + char key[11]; + cmph_uint32 year; + } rec_t; + #pragma pack(0) + + int main(int argc, char **argv) + { + // Creating a filled vector + unsigned int i = 0; + rec_t vector[10] = {{1, "aaaaaaaaaa", 1999}, {2, "bbbbbbbbbb", 2000}, {3, "cccccccccc", 2001}, + {4, "dddddddddd", 2002}, {5, "eeeeeeeeee", 2003}, {6, "ffffffffff", 2004}, + {7, "gggggggggg", 2005}, {8, "hhhhhhhhhh", 2006}, {9, "iiiiiiiiii", 2007}, + {10,"jjjjjjjjjj", 2008}}; + unsigned int nkeys = 10; + FILE* mphf_fd = fopen("temp_struct_vector.mph", "w"); + // Source of keys + cmph_io_adapter_t *source = cmph_io_struct_vector_adapter(vector, (cmph_uint32)sizeof(rec_t), (cmph_uint32)sizeof(cmph_uint32), 11, nkeys); + + //Create minimal perfect hash function using the BDZ algorithm. + cmph_config_t *config = cmph_config_new(source); + cmph_config_set_algo(config, CMPH_BDZ); + cmph_config_set_mphf_fd(config, mphf_fd); + cmph_t *hash = cmph_new(config); + cmph_config_destroy(config); + cmph_dump(hash, mphf_fd); + cmph_destroy(hash); + fclose(mphf_fd); + + //Find key + mphf_fd = fopen("temp_struct_vector.mph", "r"); + hash = cmph_load(mphf_fd); + while (i < nkeys) { + const char *key = vector[i].key; + unsigned int id = cmph_search(hash, key, 11); + fprintf(stderr, "key:%s -- hash:%u\n", key, id); + i++; + } + + //Destroy hash + cmph_destroy(hash); + cmph_io_vector_adapter_destroy(source); + fclose(mphf_fd); + return 0; + } ++ +
+Download struct_vector_adapter_ex3.c. This example does not work in versions below 0.8. +
+ ++ #include <cmph.h> + #include <stdio.h> + #include <string.h> + // Create minimal perfect hash function from in-disk keys using BDZ algorithm + int main(int argc, char **argv) + { + //Open file with newline separated list of keys + FILE * keys_fd = fopen("keys.txt", "r"); + cmph_t *hash = NULL; + if (keys_fd == NULL) + { + fprintf(stderr, "File \"keys.txt\" not found\n"); + exit(1); + } + // Source of keys + cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd); + + cmph_config_t *config = cmph_config_new(source); + cmph_config_set_algo(config, CMPH_BDZ); + hash = cmph_new(config); + cmph_config_destroy(config); + + //Find key + const char *key = "jjjjjjjjjj"; + unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key)); + fprintf(stderr, "Id:%u\n", id); + //Destroy hash + cmph_destroy(hash); + cmph_io_nlfile_adapter_destroy(source); + fclose(keys_fd); + return 0; + } ++ +
+Download file_adapter_ex2.c and keys.txt. This example does not work in versions below 0.8. +
+ +Home | CHD | BDZ | BMZ | CHM | BRZ | FCH |
+Enjoy! +
+ + + + + + + + + + diff --git a/docs/faq.html b/docs/faq.html new file mode 100644 index 0000000..07b61db --- /dev/null +++ b/docs/faq.html @@ -0,0 +1,111 @@ + + + + + ++ - You don't. The ids will be assigned by the algorithm creating the minimal + perfect hash function. If the algorithm creates an ordered minimal + perfect hash function, the ids will be the indices of the keys in the + input. Otherwise, you have no guarantee of the distribution of the ids. ++ +
+ - The algorithms do not guarantee that a minimal perfect hash function can + be created. In practice, it will always work if your input + is big enough (>100 keys). + The error is probably because you have duplicated + keys in the input. You must guarantee that the keys are unique in the + input. If you are using a UN*X based OS, try doing ++ +
+ #sort input.txt | uniq > input_uniq.txt ++ +
+ and run cmph with input_uniq.txt ++ +
+ - Probably you are you using the cmph_config_set_algo function after + the cmph_config_set_hashfuncs. Therefore, the default hash function + is reset when you call the cmph_config_set_algo function. ++ +
+ - Error: error while loading shared libraries: libcmph.so.0: cannot open shared object file: No such file ordirectory ++
+ - Solution: type export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/ at the shell or put that shell command + in your .profile file or in the /etc/profile file. ++ +
Home | CHD | BDZ | BMZ | CHM | BRZ | FCH |
+Enjoy! +
+ + + + + + + + + + diff --git a/docs/fch.html b/docs/fch.html new file mode 100644 index 0000000..2676ee2 --- /dev/null +++ b/docs/fch.html @@ -0,0 +1,110 @@ + + + + + ++The algorithm is presented in [1]. +
+ ++Now we detail the memory consumption to generate and to store minimal perfect hash functions +using the FCH algorithm. The structures responsible for memory consumption are in the +following: +
+ ++Thus, the total memory consumption of FCH algorithm for generating a minimal +perfect hash function (MPHF) is: O(n) + 9n + 8cn/(log(n) + 1) bytes. +The value of parameter c must be greater than or equal to 2.6. +
++Now we present the memory consumption to store the resulting function. +We only need to store the g function and a constant number of bytes for the seed of the hash functions used in the resulting MPHF. Thus, we need cn/(log(n) + 1) + O(1) bytes. +
+ +Home | CHD | BDZ | BMZ | CHM | BRZ | FCH |
+Enjoy! +
+ + + + + + + + + + diff --git a/docs/gperf.html b/docs/gperf.html new file mode 100644 index 0000000..2603bcc --- /dev/null +++ b/docs/gperf.html @@ -0,0 +1,99 @@ + + + + + ++You might ask why cmph if gperf +already works perfectly. Actually, gperf and cmph have different goals. +Basically, these are the requirements for each of them: +
+ ++ - Create very fast hash functions for small sets ++
+ - Create perfect hash functions ++ +
+ - Create very fast hash function for very large sets ++
+ - Create minimal perfect hash functions ++
+As result, cmph can be used to create hash functions where gperf would run +forever without finding a perfect hash function, because of the running +time of the algorithm and the large memory usage. +On the other side, functions created by cmph are about 2x slower than those +created by gperf. +
++So, if you have large sets, or memory usage is a key restriction for you, stick +to cmph. If you have small sets, and do not care about memory usage, go with +gperf. The first problem is common in the information retrieval field (e.g. +assigning ids to millions of documents), while the former is usually found in +the compiler programming area (detect reserved keywords). +
+ +Home | CHD | BDZ | BMZ | CHM | BRZ | FCH |
+Enjoy! +
+ + + + + + + + + + diff --git a/docs/index.html b/docs/index.html new file mode 100644 index 0000000..a1629a0 --- /dev/null +++ b/docs/index.html @@ -0,0 +1,392 @@ + + + + + ++A perfect hash function maps a static set of n keys into a set of m integer numbers without collisions, where m is greater than or equal to n. If m is equal to n, the function is called minimal. +
++Minimal perfect hash functions are widely used for memory efficient storage and fast retrieval of items from static sets, such as words in natural languages, reserved words in programming languages or interactive systems, universal resource locations (URLs) in Web search engines, or item sets in data mining techniques. Therefore, there are applications for minimal perfect hash functions in information retrieval systems, database systems, language translation systems, electronic commerce systems, compilers, operating systems, among others. +
++The use of minimal perfect hash functions is, until now, restricted to scenarios where the set of keys being hashed is small, because of the limitations of current algorithms. But in many cases, to deal with huge set of keys is crucial. So, this project gives to the free software community an API that will work with sets in the order of billion of keys. +
++Probably, the most interesting application for minimal perfect hash functions is its use as an indexing structure for databases. The most popular data structure used as an indexing structure in databases is the B+ tree. In fact, the B+ tree is very used for dynamic applications with frequent insertions and deletions of records. However, for applications with sporadic modifications and a huge number of queries the B+ tree is not the best option, because practical deployments of this structure are extremely complex, and perform poorly with very large sets of keys such as those required for the new frontiers database applications. +
++For example, in the information retrieval field, the work with huge collections is a daily task. The simple assignment of ids to web pages of a collection can be a challenging task. While traditional databases simply cannot handle more traffic once the working set of web page urls does not fit in main memory anymore, minimal perfect hash functions can easily scale to hundred of millions of entries, using stock hardware. +
++As there are lots of applications for minimal perfect hash functions, it is important to implement memory and time efficient algorithms for constructing such functions. The lack of similar libraries in the free software world has been the main motivation to create the C Minimal Perfect Hashing Library (gperf is a bit different, since it was conceived to create very fast perfect hash functions for small sets of keys and CMPH Library was conceived to create minimal perfect hash functions for very large sets of keys). C Minimal Perfect Hashing Library is a portable LGPLed library to generate and to work with very efficient minimal perfect hash functions. +
+ ++The CMPH Library encapsulates the newest and more efficient algorithms in an easy-to-use, production-quality, fast API. The library was designed to work with big entries that cannot fit in the main memory. It has been used successfully for constructing minimal perfect hash functions for sets with more than 100 million of keys, and we intend to expand this number to the order of billion of keys. Although there is a lack of similar libraries, we can point out some of the distinguishable features of the CMPH Library: +
+ ++Cleaned up most warnings for the c code. +
++Experimental C++ interface (--enable-cxxmph) implementing the BDZ algorithm in +a convenient interface, which serves as the basis +for drop-in replacements for std::unordered_map, sparsehash::sparse_hash_map +and sparsehash::dense_hash_map. Potentially faster lookup time at the expense +of insertion time. See cxxmpph/mph_map.h and cxxmph/mph_index.h for details. +
+ ++Fixed a bug in the chd_pc algorithm and reorganized tests. +
+ ++This is a bugfix only version, after which a revamp of the cmph code and +algorithms will be done. +
+ ++News log +
+ ++Using cmph is quite simple. Take a look. +
+ ++ #include <cmph.h> + #include <string.h> + // Create minimal perfect hash function from in-memory vector + int main(int argc, char **argv) + { + + // Creating a filled vector + unsigned int i = 0; + const char *vector[] = {"aaaaaaaaaa", "bbbbbbbbbb", "cccccccccc", "dddddddddd", "eeeeeeeeee", + "ffffffffff", "gggggggggg", "hhhhhhhhhh", "iiiiiiiiii", "jjjjjjjjjj"}; + unsigned int nkeys = 10; + FILE* mphf_fd = fopen("temp.mph", "w"); + // Source of keys + cmph_io_adapter_t *source = cmph_io_vector_adapter((char **)vector, nkeys); + + //Create minimal perfect hash function using the brz algorithm. + cmph_config_t *config = cmph_config_new(source); + cmph_config_set_algo(config, CMPH_BRZ); + cmph_config_set_mphf_fd(config, mphf_fd); + cmph_t *hash = cmph_new(config); + cmph_config_destroy(config); + cmph_dump(hash, mphf_fd); + cmph_destroy(hash); + fclose(mphf_fd); + + //Find key + mphf_fd = fopen("temp.mph", "r"); + hash = cmph_load(mphf_fd); + while (i < nkeys) { + const char *key = vector[i]; + unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key)); + fprintf(stderr, "key:%s -- hash:%u\n", key, id); + i++; + } + + //Destroy hash + cmph_destroy(hash); + cmph_io_vector_adapter_destroy(source); + fclose(mphf_fd); + return 0; + } ++ +
+Download vector_adapter_ex1.c. This example does not work in versions below 0.6. You need to update the sources from GIT to make it work. +
+ ++ #include <cmph.h> + #include <stdio.h> + #include <string.h> + // Create minimal perfect hash function from in-disk keys using BDZ algorithm + int main(int argc, char **argv) + { + //Open file with newline separated list of keys + FILE * keys_fd = fopen("keys.txt", "r"); + cmph_t *hash = NULL; + if (keys_fd == NULL) + { + fprintf(stderr, "File \"keys.txt\" not found\n"); + exit(1); + } + // Source of keys + cmph_io_adapter_t *source = cmph_io_nlfile_adapter(keys_fd); + + cmph_config_t *config = cmph_config_new(source); + cmph_config_set_algo(config, CMPH_BDZ); + hash = cmph_new(config); + cmph_config_destroy(config); + + //Find key + const char *key = "jjjjjjjjjj"; + unsigned int id = cmph_search(hash, key, (cmph_uint32)strlen(key)); + fprintf(stderr, "Id:%u\n", id); + //Destroy hash + cmph_destroy(hash); + cmph_io_nlfile_adapter_destroy(source); + fclose(keys_fd); + return 0; + } ++ +
+Download file_adapter_ex2.c and keys.txt. This example does not work in versions below 0.8. You need to update the sources from GIT to make it work. +
++Click here to see more examples +
+ ++cmph is the name of both the library and the utility +application that comes with this package. You can use the cmph +application for constructing minimal perfect hash functions from the command line. +The cmph utility +comes with a number of flags, but it is very simple to create and to query +minimal perfect hash functions: +
+ ++ $ # Using the chm algorithm (default one) for constructing a mphf for keys in file keys_file + $ ./cmph -g keys_file + $ # Query id of keys in the file keys_query + $ ./cmph -m keys_file.mph keys_query ++ +
+The additional options let you set most of the parameters you have +available through the C API. Below you can see the full help message for the +utility. +
+ ++ usage: cmph [-v] [-h] [-V] [-k nkeys] [-f hash_function] [-g [-c algorithm_dependent_value][-s seed] ] + [-a algorithm] [-M memory_in_MB] [-b algorithm_dependent_value] [-t keys_per_bin] [-d tmp_dir] + [-m file.mph] keysfile + Minimum perfect hashing tool + + -h print this help message + -c c value determines: + * the number of vertices in the graph for the algorithms BMZ and CHM + * the number of bits per key required in the FCH algorithm + * the load factor in the CHD_PH algorithm + -a algorithm - valid values are + * bmz + * bmz8 + * chm + * brz + * fch + * bdz + * bdz_ph + * chd_ph + * chd + -f hash function (may be used multiple times) - valid values are + * jenkins + -V print version number and exit + -v increase verbosity (may be used multiple times) + -k number of keys + -g generation mode + -s random seed + -m minimum perfect hash function file + -M main memory availability (in MB) used in BRZ algorithm + -d temporary directory used in BRZ algorithm + -b the meaning of this parameter depends on the algorithm selected in the -a option: + * For BRZ it is used to make the maximal number of keys in a bucket lower than 256. + In this case its value should be an integer in the range [64,175]. Default is 128. + + * For BDZ it is used to determine the size of some precomputed rank + information and its value should be an integer in the range [3,10]. Default + is 7. The larger is this value, the more compact are the resulting functions + and the slower are them at evaluation time. + + * For CHD and CHD_PH it is used to set the average number of keys per bucket + and its value should be an integer in the range [1,32]. Default is 4. The + larger is this value, the slower is the construction of the functions. + This parameter has no effect for other algorithms. + + -t set the number of keys per bin for a t-perfect hashing function. A t-perfect + hash function allows at most t collisions in a given bin. This parameter applies + only to the CHD and CHD_PH algorithms. Its value should be an integer in the + range [1,128]. Defaul is 1 + keysfile line separated file with keys ++ +
+FAQ +
+ ++Use the github releases page at: https://github.com/bonitao/cmph/releases +
+ ++Code is under the LGPL and the MPL 1.1. +
+ ++Enjoy! +
+ + + + + ++Last Updated: Fri Dec 28 23:50:29 2018 +
+ + + + + + diff --git a/docs/newslog.html b/docs/newslog.html new file mode 100644 index 0000000..6b7c4e4 --- /dev/null +++ b/docs/newslog.html @@ -0,0 +1,142 @@ + + + + + ++Fixed a bug in the chd_pc algorithm and reorganized tests. +
+ ++This is a bugfix only version, after which a revamp of the cmph code and +algorithms will be done. +
+ +Home | CHD | BDZ | BMZ | CHM | BRZ | FCH |
+Enjoy! +
+ + + + + + + + + + diff --git a/gendocs b/gendocs index 7982818..c8d4c86 100755 --- a/gendocs +++ b/gendocs @@ -1,18 +1,18 @@ #!/bin/sh -txt2tags -t html --mask-email -i README.t2t -o index.html -txt2tags -t html -i CHD.t2t -o chd.html -txt2tags -t html -i BDZ.t2t -o bdz.html -txt2tags -t html -i BMZ.t2t -o bmz.html -txt2tags -t html -i BRZ.t2t -o brz.html -txt2tags -t html -i CHM.t2t -o chm.html -txt2tags -t html -i FCH.t2t -o fch.html -txt2tags -t html -i COMPARISON.t2t -o comparison.html -txt2tags -t html -i GPERF.t2t -o gperf.html -txt2tags -t html -i FAQ.t2t -o faq.html -txt2tags -t html -i CONCEPTS.t2t -o concepts.html -txt2tags -t html -i NEWSLOG.t2t -o newslog.html -txt2tags -t html -i EXAMPLES.t2t -o examples.html +txt2tags -t html --mask-email -i README.t2t -o docs/index.html +txt2tags -t html -i CHD.t2t -o docs/chd.html +txt2tags -t html -i BDZ.t2t -o docs/bdz.html +txt2tags -t html -i BMZ.t2t -o docs/bmz.html +txt2tags -t html -i BRZ.t2t -o docs/brz.html +txt2tags -t html -i CHM.t2t -o docs/chm.html +txt2tags -t html -i FCH.t2t -o docs/fch.html +txt2tags -t html -i COMPARISON.t2t -o docs/comparison.html +txt2tags -t html -i GPERF.t2t -o docs/gperf.html +txt2tags -t html -i FAQ.t2t -o docs/faq.html +txt2tags -t html -i CONCEPTS.t2t -o docs/concepts.html +txt2tags -t html -i NEWSLOG.t2t -o docs/newslog.html +txt2tags -t html -i EXAMPLES.t2t -o docs/examples.html txt2tags -t txt --mask-email -i README.t2t -o README txt2tags -t txt -i CHD.t2t -o CHD