From 9abc48f91ca7070223fa38ae2f83bb19e5a8d81d Mon Sep 17 00:00:00 2001 From: fc_botelho Date: Mon, 31 Jan 2005 18:50:58 +0000 Subject: [PATCH] BMZ documentation was finished --- BMZ.t2t | 309 ++++++++++++++++++++++++++++++++++++++++++------- CHM.t2t | 24 +++- COMPARISON.t2t | 105 +++++++++++++++-- CONCEPTS.t2t | 56 +++++++++ CONFIG.t2t | 42 +++++++ DOC.css | 33 ++++++ FAQ.t2t | 3 +- GPERF.t2t | 3 +- LOGO.t2t | 1 + README.t2t | 7 +- TABLE1.t2t | 76 ++++++++++++ TABLE4.t2t | 109 +++++++++++++++++ TABLE5.t2t | 46 ++++++++ 13 files changed, 753 insertions(+), 61 deletions(-) create mode 100644 CONCEPTS.t2t create mode 100644 DOC.css create mode 100644 LOGO.t2t create mode 100644 TABLE1.t2t create mode 100644 TABLE4.t2t create mode 100644 TABLE5.t2t diff --git a/BMZ.t2t b/BMZ.t2t index 37e3101..08c8ce2 100644 --- a/BMZ.t2t +++ b/BMZ.t2t @@ -9,15 +9,17 @@ BMZ Algorithm At the end of 2003, professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] was finishing the second edition of his [book http://www.dcc.ufmg.br/algoritmos/]. During the [book http://www.dcc.ufmg.br/algoritmos/] writing, -professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] studied the problem of generating minimal perfect hash -functions (if you are not familiarized with this problem, see [1][2]). +professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] studied the problem of generating +[minimal perfect hash functions concepts.html] +(if you are not familiarized with this problem, see [[1 #papers]][[2 #papers]]). Professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] coded a modified version of the [CHM algorithm chm.html], which was proposed by -Czech, Havas and Majewski and put it in his [book http://www.dcc.ufmg.br/algoritmos/]. -The [CHM algorithm chm.html] is based on acyclic random graphs to generate order preserving -minimal perfect hash functions in linear time. Professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] +Czech, Havas and Majewski, and put it in his [book http://www.dcc.ufmg.br/algoritmos/]. +The [CHM algorithm chm.html] is based on acyclic random graphs to generate +[order preserving minimal perfect hash functions concepts.html] in linear time. +Professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] argued himself, why must the random graph -be acyclic? In the modified version availalbe in his [book http://www.dcc.ufmg.br/algoritmos/] he got rid of such restriction. +be acyclic? In the modified version availalbe in his [book http://www.dcc.ufmg.br/algoritmos/] he got rid of this restriction. The modification presented a problem, it was impossible to generate minimal perfect hash functions for sets with more than 1000 keys. @@ -32,19 +34,38 @@ During the master, [Fabiano http://www.dcc.ufmg.br/~fbotelho] and In april of 2004, [Fabiano http://www.dcc.ufmg.br/~fbotelho] was talking with a friend of him (David Menoti) about the problems and many ideas appeared. -The ideas were implemented and we noticed that a very fast algorithm to generate +The ideas were implemented and a very fast algorithm to generate minimal perfect hash functions had been designed. -We refer the algorithm to as **BMZ**, because it was conceived by Fabiano C. **B**otelho -David **M**enoti and Nivio **Z**iviani. The algorithm is described in [1]. +We refer the algorithm to as **BMZ**, because it was conceived by Fabiano C. **B**otelho, +David **M**enoti and Nivio **Z**iviani. The algorithm is described in [[1 #papers]]. To analyse BMZ algorithm we needed some results from the random graph theory, so we invite professor [Yoshiharu Kohayakawa http://www.ime.usp.br/~yoshi] to help us. -The final description and analysis of BMZ algorithm is presented in [2]. +The final description and analysis of BMZ algorithm is presented in [[2 #papers]]. ---------------------------------------- ==The Algorithm== -Let us show how the minimal perfect hash function [figs/img7.png] will be constructed. +The BMZ algorithm shares several features with the [CHM algorithm chm.html]. +In particular, BMZ algorithm is also +based on the generation of random graphs [figs/img27.png], where [figs/img28.png] is in +one-to-one correspondence with the key set [figs/img20.png] for which we wish to +generate a [minimal perfect hash function concepts.html]. +The two main differences between BMZ algorithm and CHM algorithm +are as follows: (//i//) BMZ algorithm generates random +graphs [figs/img27.png] with [figs/img29.png] and [figs/img30.png], where [figs/img31.png], +and hence [figs/img32.png] necessarily contains cycles, +while CHM algorithm generates //acyclic// random +graphs [figs/img27.png] with [figs/img29.png] and [figs/img30.png], +with a greater number of vertices: [figs/img33.png]; +(//ii//) CHM algorithm generates [order preserving minimal perfect hash functions concepts.html] +while BMZ algorithm does not preserve order. Thus, BMZ algorithm improves +the space requirement at the expense of generating functions that are not +order preserving. + +Suppose [figs/img14.png] is a universe of //keys//. +Let [figs/img17.png] be a set of [figs/img8.png] keys from [figs/img14.png]. +Let us show how the BMZ algorithm constructs a minimal perfect hash function [figs/img7.png]. We make use of two auxiliary random functions [figs/img41.png] and [figs/img55.png], where [figs/img56.png] for some suitably chosen integer [figs/img57.png], where [figs/img58.png].We build a random graph [figs/img59.png] on [figs/img60.png], @@ -54,7 +75,7 @@ key in the set of keys [figs/img20.png]. In what follows, we shall be interested in the //2-core// of the random graph [figs/img32.png], that is, the maximal subgraph of [figs/img32.png] with minimal degree at -least 2 (see, e.g., [2] for details). +least 2 (see [[2 #papers]] for details). Because of its importance in our context, we call the 2-core the //critical// subgraph of [figs/img32.png] and denote it by [figs/img63.png]. The vertices and edges in [figs/img63.png] are said to be //critical//. @@ -65,7 +86,7 @@ We also let [figs/img67.png] be the set of all critical vertices that have at least one non-critical vertex as a neighbour. Let [figs/img68.png] be the set of //non-critical// edges in [figs/img32.png]. Finally, we let [figs/img69.png] be the //non-critical// subgraph -of [figs/img32.png. +of [figs/img32.png]. The non-critical subgraph [figs/img70.png] corresponds to the //acyclic part// of [figs/img32.png]. We have [figs/img71.png]. @@ -74,33 +95,222 @@ We then construct a suitable labelling [figs/img72.png] of the vertices of [figs/img32.png]: we choose [figs/img73.png] for each [figs/img74.png] in such a way that [figs/img75.png] ([figs/img18.png]) is a minimal perfect hash function for [figs/img20.png]. -We will see later on that this labelling [figs/img37.png] can be found in linear time -if the number of edges in [figs/img63.png] is at most [figs/img76.png]. +This labelling [figs/img37.png] can be found in linear time +if the number of edges in [figs/img63.png] is at most [figs/img76.png] (see [[2 #papers]] +for details). -Figure 2 presents a pseudo code for the algorithm. -The procedure GenerateMPHF ([figs/img20.png], [figs/img37.png]) receives as input the set of +Figure 1 presents a pseudo code for the BMZ algorithm. +The procedure BMZ ([figs/img20.png], [figs/img37.png]) receives as input the set of keys [figs/img20.png] and produces the labelling [figs/img37.png]. The method uses a mapping, ordering and searching approach. We now describe each step. -| procedure GenerateMPHF ([figs/img20.png], [figs/img37.png]) -|     Mapping ([figs/img20.png], [figs/img32.png]); -|     Ordering ([figs/img32.png], [figs/img63.png], [figs/img70.png]); -|     Searching ([figs/img32.png], [figs/img63.png], [figs/img70.png], [figs/img37.png]); -**Figure 2**: Main steps of the algorithm for constructing a minimal perfect hash function - -===Mapping Step=== - -===Ordering Step=== - -===Searching Step=== - -====Assignment of Values to Critical Vertices==== - -====Assignment of Values to Non-Critical Vertices==== + | procedure BMZ ([figs/img20.png], [figs/img37.png]) + |     Mapping ([figs/img20.png], [figs/img32.png]); + |     Ordering ([figs/img32.png], [figs/img63.png], [figs/img70.png]); + |     Searching ([figs/img32.png], [figs/img63.png], [figs/img70.png], [figs/img37.png]); + | **Figure 1**: Main steps of BMZ algorithm for constructing a minimal perfect hash function ---------------------------------------- -==The Heuristic== +===Mapping Step=== + +The procedure Mapping ([figs/img20.png], [figs/img32.png]) receives as input the set +of keys [figs/img20.png] and generates the random graph [figs/img59.png], by generating +two auxiliary functions [figs/img41.png], [figs/img78.png]. + +The functions [figs/img41.png] and [figs/img42.png] are constructed as follows. +We impose some upper bound [figs/img79.png] on the lengths of the keys in [figs/img20.png]. +To define [figs/img80.png] ([figs/img81.png], [figs/img62.png]), we generate +an [figs/img82.png] table of random integers [figs/img83.png]. +For a key [figs/img18.png] of length [figs/img84.png] and [figs/img85.png], we let + + | [figs/img86.png] + +The random graph [figs/img59.png] has vertex set [figs/img56.png] and +edge set [figs/img61.png]. We need [figs/img32.png] to be +simple, i.e., [figs/img32.png] should have neither loops nor multiple edges. +A loop occurs when [figs/img87.png] for some [figs/img18.png]. +We solve this in an ad hoc manner: we simply let [figs/img88.png] in this case. +If we still find a loop after this, we generate another pair [figs/img89.png]. +When a multiple edge occurs we abort and generate a new pair [figs/img89.png]. +Although the function above causes [collisions concepts.html] with probability //1/t//, +in [cmph library index.html] we use faster hash +functions ([DJB2 hash http://], [FNV hash http://], [Jenkins hash http://] +and [SDBM hash http://]) in which we do not need to impose any upper bound [figs/img79.png] on the lengths of the keys in [figs/img20.png]. + +As mentioned before, for us to find the labelling [figs/img72.png] of the +vertices of [figs/img59.png] in linear time, +we require that [figs/img108.png]. +The crucial step now is to determine the value +of [figs/img1.png] (in [figs/img57.png]) to obtain a random +graph [figs/img71.png] with [figs/img109.png]. +Botelho, Menoti an Ziviani determinded emprically in [[1 #papers]] that +the value of [figs/img1.png] is //1.15//. This value is remarkably +close to the theoretical value determined in [[2 #papers]], +which is around [figs/img112.png]. + +---------------------------------------- + +===Ordering Step=== + +The procedure Ordering ([figs/img32.png], [figs/img63.png], [figs/img70.png]) receives +as input the graph [figs/img32.png] and partitions [figs/img32.png] into the two +subgraphs [figs/img63.png] and [figs/img70.png], so that [figs/img71.png]. + +Figure 2 presents a sample graph with 9 vertices +and 8 edges, where the degree of a vertex is shown besides each vertex. +Initially, all vertices with degree 1 are added to a queue [figs/img136.png]. +For the example shown in Figure 2(a), [figs/img137.png] after the initialization step. + + | [figs/img138.png] + | **Figure 2:** Ordering step for a graph with 9 vertices and 8 edges. + +Next, we remove one vertex [figs/img139.png] from the queue, decrement its degree and +the degree of the vertices with degree greater than 0 in the adjacent +list of [figs/img139.png], as depicted in Figure 2(b) for [figs/img140.png]. +At this point, the adjacencies of [figs/img139.png] with degree 1 are +inserted into the queue, such as vertex 1. +This process is repeated until the queue becomes empty. +All vertices with degree 0 are non-critical vertices and the others are +critical vertices, as depicted in Figure 2(c). +Finally, to determine the vertices in [figs/img141.png] we collect all +vertices [figs/img142.png] with at least one vertex [figs/img143.png] that +is in Adj[figs/img144.png] and in [figs/img145.png], as the vertex 8 in Figure 2(c). + +---------------------------------------- + +===Searching Step=== + +In the searching step, the key part is +the //perfect assignment problem//: find [figs/img153.png] such that +the function [figs/img154.png] defined by + + | [figs/img155.png] + +is a bijection from [figs/img156.png] to [figs/img157.png] (recall [figs/img158.png]). +We are interested in a labelling [figs/img72.png] of +the vertices of the graph [figs/img59.png] with +the property that if [figs/img11.png] and [figs/img22.png] are keys +in [figs/img20.png], then [figs/img159.png]; that is, if we associate +to each edge the sum of the labels on its endpoints, then these values +should be all distinct. +Moreover, we require that all the sums [figs/img160.png] ([figs/img18.png]) +fall between [figs/img115.png] and [figs/img161.png], and thus we have a bijection +between [figs/img20.png] and [figs/img157.png]. + +The procedure Searching ([figs/img32.png], [figs/img63.png], [figs/img70.png], [figs/img37.png]) +receives as input [figs/img32.png], [figs/img63.png], [figs/img70.png] and finds a +suitable [figs/img162.png] bit value for each vertex [figs/img74.png], stored in the +array [figs/img37.png]. +This step is first performed for the vertices in the +critical subgraph [figs/img63.png] of [figs/img32.png] (the 2-core of [figs/img32.png]) +and then it is performed for the vertices in [figs/img70.png] (the non-critical subgraph +of [figs/img32.png] that contains the "acyclic part" of [figs/img32.png]). +The reason the assignment of the [figs/img37.png] values is first +performed on the vertices in [figs/img63.png] is to resolve reassignments +as early as possible (such reassignments are consequences of the cycles +in [figs/img63.png] and are depicted hereinafter). + +---------------------------------------- + +====Assignment of Values to Critical Vertices==== + +The labels [figs/img73.png] ([figs/img142.png]) +are assigned in increasing order following a greedy +strategy where the critical vertices [figs/img139.png] are considered one at a time, +according to a breadth-first search on [figs/img63.png]. +If a candidate value [figs/img11.png] for [figs/img73.png] is forbidden +because setting [figs/img163.png] would create two edges with the same sum, +we try [figs/img164.png] for [figs/img73.png]. This fact is referred to +as a //reassignment//. + +Let [figs/img165.png] be the set of addresses assigned to edges in [figs/img166.png]. +Initially [figs/img167.png]. +Let [figs/img11.png] be a candidate value for [figs/img73.png]. +Initially [figs/img168.png]. +Considering the subgraph [figs/img63.png] in Figure 2(c), +a step by step example of the assignment of values to vertices in [figs/img63.png] is +presented in Figure 3. +Initially, a vertex [figs/img139.png] is chosen, the assignment [figs/img163.png] is made +and [figs/img11.png] is set to [figs/img164.png]. +For example, suppose that vertex [figs/img169.png] in Figure 3(a) is +chosen, the assignment [figs/img170.png] is made and [figs/img11.png] is set to [figs/img96.png]. + + | [figs/img171.png] + | **Figure 3:** Example of the assignment of values to critical vertices. + +In Figure 3(b), following the adjacent list of vertex [figs/img169.png], +the unassigned vertex [figs/img115.png] is reached. +At this point, we collect in the temporary variable [figs/img172.png] all adjacencies +of vertex [figs/img115.png] that have been assigned an [figs/img11.png] value, +and [figs/img173.png]. +Next, for all [figs/img174.png], we check if [figs/img175.png]. +Since [figs/img176.png], then [figs/img177.png] is set +to [figs/img96.png], [figs/img11.png] is incremented +by 1 (now [figs/img178.png]) and [figs/img179.png]. +Next, vertex [figs/img180.png] is reached, [figs/img181.png] is set +to [figs/img62.png], [figs/img11.png] is set to [figs/img180.png] and [figs/img182.png]. +Next, vertex [figs/img183.png] is reached and [figs/img184.png]. +Since [figs/img185.png] and [figs/img186.png], then [figs/img187.png] is +set to [figs/img180.png], [figs/img11.png] is set to [figs/img183.png] and [figs/img188.png]. +Finally, vertex [figs/img189.png] is reached and [figs/img190.png]. +Since [figs/img191.png], [figs/img11.png] is incremented by 1 and set to 5, as depicted in +Figure 3(c). +Since [figs/img192.png], [figs/img11.png] is again incremented by 1 and set to 6, +as depicted in Figure 3(d). +These two reassignments are indicated by the arrows in Figure 3. +Since [figs/img193.png] and [figs/img194.png], then [figs/img195.png] is set +to [figs/img196.png] and [figs/img197.png]. This finishes the algorithm. + +---------------------------------------- + +====Assignment of Values to Non-Critical Vertices==== + +As [figs/img70.png] is acyclic, we can impose the order in which addresses are +associated with edges in [figs/img70.png], making this step simple to solve +by a standard depth first search algorithm. +Therefore, in the assignment of values to vertices in [figs/img70.png] we +benefit from the unused addresses in the gaps left by the assignment of values +to vertices in [figs/img63.png]. +For that, we start the depth-first search from the vertices in [figs/img141.png] because +the [figs/img37.png] values for these critical vertices were already assigned +and cannot be changed. + +Considering the subgraph [figs/img70.png] in Figure 2(c), +a step by step example of the assignment of values to vertices in [figs/img70.png] is +presented in Figure 4. +Figure 4(a) presents the initial state of the algorithm. +The critical vertex 8 is the only one that has non-critical vertices as +adjacent. +In the example presented in Figure 3, the addresses [figs/img198.png] were not used. +So, taking the first unused address [figs/img115.png] and the vertex [figs/img96.png], +which is reached from the vertex [figs/img169.png], [figs/img199.png] is set +to [figs/img200.png], as shown in Figure 4(b). +The only vertex that is reached from vertex [figs/img96.png] is vertex [figs/img62.png], so +taking the unused address [figs/img183.png] we set [figs/img201.png] to [figs/img202.png], +as shown in Figure 4(c). +This process is repeated until the UnAssignedAddresses list becomes empty. + + | [figs/img203.png] + | **Figure 4:** Example of the assignment of values to non-critical vertices. + +---------------------------------------- + +==The Heuristic==[heuristic] + +We now present an heuristic for BMZ algorithm that +reduces the value of [figs/img1.png] to any given value between //1.15// and //0.93//. +This reduces the space requirement to store the resulting function +to any given value between [figs/img12.png] words and [figs/img13.png] words. +The heuristic reuses, when possible, the set +of [figs/img11.png] values that caused reassignments, just before +trying [figs/img164.png]. +Decreasing the value of [figs/img1.png] leads to an increase in the number of +iterations to generate [figs/img32.png]. +For example, for [figs/img244.png] and [figs/img6.png], the analytical expected number +of iterations are [figs/img245.png] and [figs/img246.png], respectively (see [[2 #papers]] +for details), +while for [figs/img128.png] the same value is around //2.13//. ---------------------------------------- @@ -121,9 +331,10 @@ following: of 4 bytes that represent the vertices. As there are //n// edges, the vector edges is stored in //8n// bytes. - + **next**: given a vertex //v//, we can discover the edges that contain //v// - following its list of edges, which starts on first[//v//] and the next - edges are given by next[...first[//v//]...]. Therefore, the vectors first and next represent + + **next**: given a vertex [figs/img139.png], we can discover the edges that + contain [figs/img139.png] following its list of edges, + which starts on first[[figs/img139.png]] and the next + edges are given by next[...first[[figs/img139.png]]...]. Therefore, the vectors first and next represent the linked lists of edges of each vertex. As there are two vertices for each edge, when an edge is iserted in the graph, it must be inserted in the two linked lists of the vertices in its composition. Therefore, there are //2n// entries of integer @@ -140,8 +351,8 @@ following: - Other auxiliary structures + **queue**: is a queue of integer numbers used in the breadth-first search of the assignment of values to critical vertices. There is an entry in the queue for - each two critical vertices. Let //|Vcrit|// be the expected number of critical - vertices. Therefore, the queue is stored in //4*0.5*|Vcrit|=2|Vcrit|//. + each two critical vertices. Let [figs/img110.png] be the expected number of critical + vertices. Therefore, the queue is stored in //4*0.5*[figs/img110.png]=2[figs/img110.png]//. + **visited**: is a vector of //cn// bits, where each bit indicates if the g value of a given vertex was already defined. Therefore, the vector visited is stored @@ -153,12 +364,15 @@ following: Thus, the total memory consumption of BMZ algorithm for generating a minimal -perfect hash function (MPHF) is: //(8.25c + 16.125)n +2|Vcrit| + O(1)// bytes. +perfect hash function (MPHF) is: //(8.25c + 16.125)n +2[figs/img110.png] + O(1)// bytes. As the value of constant //c// may be 1.15 and 0.93 we have: - || //c// | //|Vcrit|// | Memory consumption to generate a MPHF | + || //c// | [figs/img110.png] | Memory consumption to generate a MPHF | | 0.93 | //0.497n// | //24.80n + O(1)// | | 1.15 | //0.401n// | //26.42n + O(1)// | -The values of |Vcrit| were calculated using Eq.(1) presented in [2]. + + | **Table 1:** Memory consumption to generate a MPHF using the BMZ algorithm. + +The values of [figs/img110.png] were calculated using Eq.(1) presented in [[2 #papers]]. Now we present the memory consumption to store the resulting function. We only need to store the //g// function. Thus, we need //4cn// bytes. @@ -166,10 +380,17 @@ Again we have: || //c// | Memory consumption to store a MPHF | | 0.93 | //3.72n// | | 1.15 | //4.60n// | - + + | **Table 2:** Memory consumption to store a MPHF generated by the BMZ algorithm. ---------------------------------------- -==Papers== +==Experimental Results== + +[CHM x BMZ comparison.html] + +---------------------------------------- + +==Papers==[papers] + [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], D. Menoti, [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A New algorithm for constructing minimal perfect hash functions papers/bmz_tr004_04.ps], Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004. @@ -177,7 +398,7 @@ Again we have: ---------------------------------------- -[Home index.html] + | [Home index.html] | [CHM chm.html] | [BMZ bmz.html] ---------------------------------------- %!include: FOOTER.t2t diff --git a/CHM.t2t b/CHM.t2t index 8859eff..e3090dc 100644 --- a/CHM.t2t +++ b/CHM.t2t @@ -4,8 +4,11 @@ CHM Algorithm %!includeconf: CONFIG.t2t ---------------------------------------- + ==The Algorithm== +---------------------------------------- + ==Memory Consumption== Now we detail the memory consumption to generate and to store minimal perfect hash functions @@ -23,9 +26,11 @@ following: of 4 bytes that represent the vertices. As there are //n// edges, the vector edges is stored in //8n// bytes. - + **next**: given a vertex //v//, we can discover the edges that contain //v// - following its list of edges, which starts on first[//v//] and the next - edges are given by next[...first[//v//]...]. Therefore, the vectors first and next represent + + **next**: given a vertex [figs/img139.png], we can discover the edges that + contain [figs/img139.png] following its list of edges, which starts on + first[[figs/img139.png]] and the next + edges are given by next[...first[[figs/img139.png]]...]. Therefore, + the vectors first and next represent the linked lists of edges of each vertex. As there are two vertices for each edge, when an edge is iserted in the graph, it must be inserted in the two linked lists of the vertices in its composition. Therefore, there are //2n// entries of integer @@ -47,12 +52,23 @@ As the value of constant //c// must be at least 2.09 we have: || //c// | Memory consumption to generate a MPHF | | 2.09 | //33.00n + O(1)// | + | **Table 1:** Memory consumption to generate a MPHF using the CHM algorithm. + Now we present the memory consumption to store the resulting function. We only need to store the //g// function. Thus, we need //4cn// bytes. Again we have: || //c// | Memory consumption to store a MPHF | | 2.09 | //8.36n// | + | **Table 2:** Memory consumption to store a MPHF generated by the CHM algorithm. + +---------------------------------------- + +==Experimental Results== + +[CHM x BMZ comparison.html] + +---------------------------------------- ==Papers== @@ -66,7 +82,7 @@ Again we have: ---------------------------------------- -[Home index.html] + | [Home index.html] | [CHM chm.html] | [BMZ bmz.html] ---------------------------------------- %!include: FOOTER.t2t diff --git a/COMPARISON.t2t b/COMPARISON.t2t index a6ff823..1a6e328 100644 --- a/COMPARISON.t2t +++ b/COMPARISON.t2t @@ -5,17 +5,106 @@ Comparison Between BMZ And CHM Algorithms ---------------------------------------- -==Features== +==Characteristics== +Table 1 presents the main characteristics of the two algorithms. +The number of edges in the graph [figs/img27.png] is [figs/img236.png], +the number of keys in the input set [figs/img20.png]. +The number of vertices of [figs/img32.png] is equal +to [figs/img12.png] and [figs/img237.png] for BMZ algorithm and the CHM algorithm, respectively. +This measure is related to the amount of space to store the array [figs/img37.png]. +This improves the space required to store a function in BMZ algorithm to [figs/img238.png] of the space required by the CHM algorithm. +The number of critical edges is [figs/img76.png] and 0, for BMZ algorithm and the CHM algorithm, +respectively. +BMZ algorithm generates random graphs that necessarily contains cycles and the +CHM algorithm +generates +acyclic random graphs. +Finally, the CHM algorithm generates [order preserving functions concepts.html] +while BMZ algorithm does not preserve order. -==Constructing Minimal Perfect Hash Functions== - -==Memory Consumption== - - -==Run times== +%!include(html): ''TABLE1.t2t'' + | **Table 1:** Main characteristics of the algorithms. ---------------------------------------- -[Home index.html] + +==Memory Consumption== + +- Memory consumption to generate the minimal perfect hash function (MPHF): + || Algorithm | //c// | Memory consumption to generate a MPHF | + | BMZ | 0.93 | //24.80n + O(1)// | + | BMZ | 1.15 | //26.42n + O(1)// | + | CHM | 2.09 | //33.00n + O(1)// | + + | **Table 2:** Memory consumption to generate a MPHF using the algorithms BMZ and CHM. + +- Memory consumption to store the resulting minimal perfect hash function (MPHF): + || Algorithm | //c// | Memory consumption to store a MPHF | + | BMZ | 0.93 | //3.72n// | + | BMZ | 1.15 | //4.60n// | + | CHM | 2.09 | //8.36n// | + + | **Table 3:** Memory consumption to store a MPHF generated by the algorithms BMZ and CHM. + +---------------------------------------- + +==Run times== +We now present some experimental results to compare the BMZ and CHM algorithms. +The data consists of a collection of 100 million universe resource locations +(URLs) collected from the Web. +The average length of a URL in the collection is 63 bytes. +All experiments were carried on +a computer running the Linux operating system, version 2.6.7, +with a 2.4 gigahertz processor and +4 gigabytes of main memory. + +Table 4 presents time measurements. +All times are in seconds. +The table entries represent averages over 50 trials. +The column labelled as [figs/img243.png] represents +the number of iterations to generate the random graph [figs/img32.png] in the +mapping step of the algorithms. +The next columns represent the run times +for the mapping plus ordering steps together and the searching +step for each algorithm. +The last column represents the percent gain of our algorithm +over the CHM algorithm. + +%!include(html): ''TABLE4.t2t'' + | **Table 4:** Time measurements for BMZ and the CHM algorithm. + +The mapping step of the BMZ algorithm is faster because +the expected number of iterations in the mapping step to generate [figs/img32.png] are +2.13 and 2.92 for BMZ algorithm and the CHM algorithm, respectively +(see [[2 bmz.html#papers]] for details). +The graph [figs/img32.png] generated by BMZ algorithm +has [figs/img12.png] vertices, against [figs/img237.png] for the CHM algorithm. +These two facts make BMZ algorithm faster in the mapping step. +The ordering step of BMZ algorithm is approximately equal to +the time to check if [figs/img32.png] is acyclic for the CHM algorithm. +The searching step of the CHM algorithm is faster, but the total +time of BMZ algorithm is, on average, approximately 59 % faster +than the CHM algorithm. +It is important to notice the times for the searching step: +for both algorithms they are not the dominant times, +and the experimental results clearly show +a linear behavior for the searching step. + +We now present run times for BMZ algorithm using a [heuristic bmz.html#heuristic] that +reduces the space requirement +to any given value between [figs/img12.png] words and [figs/img13.png] words. +For example, for [figs/img244.png] and [figs/img6.png], the analytical expected number +of iterations are [figs/img245.png] and [figs/img246.png], respectively +(for [figs/img247.png], the number of iterations are 2.78 for [figs/img244.png] and 3.04 +for [figs/img6.png]). +Table 5 presents the total times to construct a +function for [figs/img247.png], with an increase from [figs/img248.png] seconds +for [figs/img128.png] (see Table 4) to [figs/img249.png] seconds for [figs/img244.png] and +to [figs/img250.png] seconds for [figs/img6.png]. + +%!include(html): ''TABLE5.t2t'' + | **Table 5:** Time measurements for BMZ tuned algorithm with [figs/img5.png] and [figs/img6.png]. +---------------------------------------- + | [Home index.html] | [CHM chm.html] | [BMZ bmz.html] ---------------------------------------- %!include: FOOTER.t2t diff --git a/CONCEPTS.t2t b/CONCEPTS.t2t new file mode 100644 index 0000000..49bedde --- /dev/null +++ b/CONCEPTS.t2t @@ -0,0 +1,56 @@ +Minimal Perfect Hash Functions - Introduction + + +%!includeconf: CONFIG.t2t + +---------------------------------------- +==Basic Concepts== + +Suppose [figs/img14.png] is a universe of //keys//. +Let [figs/img15.png] be a //hash function// that maps the keys from [figs/img14.png] to a given interval of integers [figs/img16.png]. +Let [figs/img17.png] be a set of [figs/img8.png] keys from [figs/img14.png]. +Given a key [figs/img18.png], the hash function [figs/img7.png] computes an +integer in [figs/img19.png] for the storage or retrieval of [figs/img11.png] in +a //hash table//. +Hashing methods for //non-static sets// of keys can be used to construct +data structures storing [figs/img20.png] and supporting membership queries +"[figs/img18.png]?" in expected time [figs/img21.png]. +However, they involve a certain amount of wasted space owing to unused +locations in the table and waisted time to resolve collisions when +two keys are hashed to the same table location. + +For //static sets// of keys it is possible to compute a function +to find any key in a table in one probe; such hash functions are called +//perfect//. +More precisely, given a set of keys [figs/img20.png], we shall say that a +hash function [figs/img15.png] is a //perfect hash function// +for [figs/img20.png] if [figs/img7.png] is an injection on [figs/img20.png], +that is, there are no //collisions// among the keys in [figs/img20.png]: +if [figs/img11.png] and [figs/img22.png] are in [figs/img20.png] and [figs/img23.png], +then [figs/img24.png]. +Figure 1(a) illustrates a perfect hash function. +Since no collisions occur, each key can be retrieved from the table +with a single probe. +If [figs/img25.png], that is, the table has the same size as [figs/img20.png], +then we say that [figs/img7.png] is a //minimal perfect hash function// +for [figs/img20.png]. +Figure 1(b) illustrates a minimal perfect hash function. +Minimal perfect hash functions totally avoid the problem of wasted +space and time. A perfect hash function [figs/img7.png] is //order preserving// +if the keys in [figs/img20.png] are arranged in some given order +and [figs/img7.png] preserves this order in the hash table. + + | [figs/img26.png] + | **Figure 1:** (a) Perfect hash function. (b) Minimal perfect hash function. + +Minimal perfect hash functions are widely used for memory efficient +storage and fast retrieval of items from static sets, such as words in natural +languages, reserved words in programming languages or interactive systems, +universal resource locations (URLs) in Web search engines, or item sets in +data mining techniques. + +---------------------------------------- + | [Home index.html] | [CHM chm.html] | [BMZ bmz.html] +---------------------------------------- + +%!include: FOOTER.t2t diff --git a/CONFIG.t2t b/CONFIG.t2t index 19dd4e9..cf391eb 100644 --- a/CONFIG.t2t +++ b/CONFIG.t2t @@ -1,4 +1,46 @@ +%! style(html): DOC.css %! PreProc(html): '^%html% ' '' %! PreProc(txt): '^%txt% ' '' %! PostProc(html): "&" "&" %! PostProc(txt): " " " " +%! PostProc(html): 'ALIGN="middle" SRC="figs/img7.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img7.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img57.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img57.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img32.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img32.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img20.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img20.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img60.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img60.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img62.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img62.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img79.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img79.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img139.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img139.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img140.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img140.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img143.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img143.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img115.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img115.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img11.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img11.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img169.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img169.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img96.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img96.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img178.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img178.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img180.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img180.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img183.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img183.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img189.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img189.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img196.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img196.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img172.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img172.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img8.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img8.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img1.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img1.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img14.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img14.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img128.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img128.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img112.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img112.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img12.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img12.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img13.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img13.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img244.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img244.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img245.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img245.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img246.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img246.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img15.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img15.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img25.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img25.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img168.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img168.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img6.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img6.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img5.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img5.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img28.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img28.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img237.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img237.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img248.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img237.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img248.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img237.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img249.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img249.png"\1>' +%! PostProc(html): 'ALIGN="middle" SRC="figs/img250.png"(.*?)>' 'ALIGN="bottom" SRC="figs/img250.png"\1>' diff --git a/DOC.css b/DOC.css new file mode 100644 index 0000000..db09b2d --- /dev/null +++ b/DOC.css @@ -0,0 +1,33 @@ +/* implement both fixed-size and relative sizes */ +SMALL.XTINY { } +SMALL.TINY { } +SMALL.SCRIPTSIZE { } +BODY { font-size: 13 } +TD { font-size: 13 } +SMALL.FOOTNOTESIZE { font-size: 13 } +SMALL.SMALL { } +BIG.LARGE { } +BIG.XLARGE { } +BIG.XXLARGE { } +BIG.HUGE { } +BIG.XHUGE { } + +/* heading styles */ +H1 { } +H2 { } +H3 { } +H4 { } +H5 { } + + +/* mathematics styles */ +DIV.displaymath { } /* math displays */ +TD.eqno { } /* equation-number cells */ + + +/* document-specific styles come next */ +DIV.navigation { } +DIV.center { } +SPAN.textit { font-style: italic } +SPAN.arabic { } +SPAN.eqn-number { } diff --git a/FAQ.t2t b/FAQ.t2t index 1ce1774..05ca410 100644 --- a/FAQ.t2t +++ b/FAQ.t2t @@ -1,6 +1,7 @@ CMPH FAQ +%!includeconf: CONFIG.t2t - How do I define the ids of the keys? - You don't. The ids will be assigned by the algorithm creating the minimal @@ -26,7 +27,7 @@ one is executed? ---------------------------------------- -[Home index.html] + | [Home index.html] | [CHM chm.html] | [BMZ bmz.html] ---------------------------------------- %!include: FOOTER.t2t diff --git a/GPERF.t2t b/GPERF.t2t index 61190b9..67d5d40 100644 --- a/GPERF.t2t +++ b/GPERF.t2t @@ -1,6 +1,7 @@ GPERF versus CMPH +%!includeconf: CONFIG.t2t You might ask why cmph if [gperf http://www.gnu.org/software/gperf/gperf.html] already works perfectly. Actually, gperf and cmph have different goals. @@ -32,7 +33,7 @@ assigning ids to millions of documents), while the former is usually found in the compiler programming area (detect reserved keywords). ---------------------------------------- -[Home index.html] + | [Home index.html] | [CHM chm.html] | [BMZ bmz.html] ---------------------------------------- %!include: FOOTER.t2t diff --git a/LOGO.t2t b/LOGO.t2t new file mode 100644 index 0000000..dc245a8 --- /dev/null +++ b/LOGO.t2t @@ -0,0 +1 @@ +SourceForge.net Logo diff --git a/README.t2t b/README.t2t index e491c32..7ddc1ec 100644 --- a/README.t2t +++ b/README.t2t @@ -8,7 +8,8 @@ CMPH - C Minimal Perfect Hashing Library ==Description== C Minimal Perfect Hashing Library is a portable LGPLed library to create and -to work with minimal perfect hash functions. The cmph library encapsulates the newest +to work with [minimal perfect hash functions concepts.html]. +The cmph library encapsulates the newest and more efficient algorithms (available in the literature) in an easy-to-use, production-quality and fast API. The library is designed to work with big entries that can not fit in the main memory. It has been used successfully for constructing minimal perfect @@ -54,7 +55,7 @@ of the distinguishable features of cmph: - New heuristic added to the bmz algorithm permits to generate a mphf with only //24.6n + O(1)// bytes. The resulting function can be stored in //3.72n// bytes. -%html% [click here bmz.html] for details. +%html% [click here bmz.html#heuristic] for details. ---------------------------------------- @@ -173,5 +174,5 @@ Code is under the LGPL. %!include: FOOTER.t2t -%!include(html): ''LOGO.html'' +%!include(html): ''LOGO.t2t'' Last Updated: %%date(%c) diff --git a/TABLE1.t2t b/TABLE1.t2t new file mode 100644 index 0000000..402a854 --- /dev/null +++ b/TABLE1.t2t @@ -0,0 +1,76 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+Characteristics Algorithms
+ + BMZ CHM
+ +$c$ 1.15 2.09
+$\vert E(G)\vert$ $n$ $n$
+$\vert V(G)\vert=\vert g\vert$ $cn$ $cn$
+ +$\vert E(G_{\rm crit})\vert$ $0.5\vert E(G)\vert$ 0
+$G$ cyclic acyclic
+Order preserving no yes
\ No newline at end of file diff --git a/TABLE4.t2t b/TABLE4.t2t new file mode 100644 index 0000000..350fa1e --- /dev/null +++ b/TABLE4.t2t @@ -0,0 +1,109 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
$n$ BMZ +CHM algorithm Gain
+ $N_i$ Map+Ord +Search Total +$N_i$ Map+Ord Search +Total (%)
1,562,500 2.28 8.54 2.37 10.91 2.70 14.56 1.57 16.13 48
3,125,000 2.16 15.92 4.88 20.80 2.85 30.36 3.20 33.56 61
6,250,000 2.20 33.09 10.48 43.57 2.90 62.26 6.76 69.02 58
12,500,000 2.00 63.26 23.04 86.30 2.60 117.99 14.94 132.92 54
25,000,000 2.00 130.79 51.55 182.34 2.80 262.05 33.68 295.73 62
50,000,000 2.07 273.75 114.12 387.87 2.90 577.59 73.97 651.56 68
100,000,000 2.07 567.47 243.13 810.60 2.80 1,131.06 157.23 1,288.29 59
diff --git a/TABLE5.t2t b/TABLE5.t2t new file mode 100644 index 0000000..8cf966a --- /dev/null +++ b/TABLE5.t2t @@ -0,0 +1,46 @@ + + + + + + + + + + + + + + + + + + + + + + + + + +
$n$ BMZ $c=1.00$ + BMZ $c=0.93$
+ $N_i$ Map+Ord +Search Total +$N_i$ Map+Ord Search +Total
12,500,000 2.78 76.68 25.06 101.74 3.04 76.39 25.80 102.19
\ No newline at end of file