From f1b1f12dda025eb50b8488d2b493e80b0d5fc3ff Mon Sep 17 00:00:00 2001 From: fc_botelho Date: Fri, 28 Jan 2005 20:07:22 +0000 Subject: [PATCH] It was improved the documentation of BMZ and CHM algorithms --- BMZ.t2t | 164 ++++++++++++++++++++++++++++++++++++++++++++----- CHM.t2t | 51 ++++++++++++++- COMPARISON.t2t | 8 +-- CONFIG.t2t | 2 + README.t2t | 22 +++---- 5 files changed, 212 insertions(+), 35 deletions(-) diff --git a/BMZ.t2t b/BMZ.t2t index 616d6bd..37e3101 100644 --- a/BMZ.t2t +++ b/BMZ.t2t @@ -4,46 +4,176 @@ BMZ Algorithm %!includeconf: CONFIG.t2t ---------------------------------------- -**History** +==History== At the end of 2003, professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] was -finishing the second edition of his book. -During the book writing, professor Nivio studied the problem of generating minimal perfect hash +finishing the second edition of his [book http://www.dcc.ufmg.br/algoritmos/]. +During the [book http://www.dcc.ufmg.br/algoritmos/] writing, +professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] studied the problem of generating minimal perfect hash functions (if you are not familiarized with this problem, see [1][2]). -Professor Nivio coded a modified version of the [CHM algorithm chm.html], which was proposed by -Czech, Havas and Majewski and put it in his book. +Professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] coded a modified version of +the [CHM algorithm chm.html], which was proposed by +Czech, Havas and Majewski and put it in his [book http://www.dcc.ufmg.br/algoritmos/]. The [CHM algorithm chm.html] is based on acyclic random graphs to generate order preserving -minimal perfect hash functions in linear time. Professor Nivio argued himself, why must the random graph -be acyclic? In the modified version availalbe in his book he got rid of such restriction. +minimal perfect hash functions in linear time. Professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] +argued himself, why must the random graph +be acyclic? In the modified version availalbe in his [book http://www.dcc.ufmg.br/algoritmos/] he got rid of such restriction. The modification presented a problem, it was impossible to generate minimal perfect hash functions for sets with more than 1000 keys. At the same time, [Fabiano C. Botelho http://www.dcc.ufmg.br/~fbotelho], a master degree student at [Departament of Computer Science http://www.dcc.ufmg.br] in [Federal University of Minas Gerais http://www.ufmg.br], -started to be advised by Nivio who presented the problem to Fabiano. +started to be advised by [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] who presented the problem +to [Fabiano http://www.dcc.ufmg.br/~fbotelho]. -During the master, Fabiano and Nivio faced lots of problems. -Talking with a friend of mine (David Menoti) about our problems, many ideas -appeared and after of implementing them, we got a very fast algorithm to generate -minimal perfect hash functions that does not preserve order. +During the master, [Fabiano http://www.dcc.ufmg.br/~fbotelho] and +[Nivio Ziviani http://www.dcc.ufmg.br/~nivio] faced lots of problems. +In april of 2004, [Fabiano http://www.dcc.ufmg.br/~fbotelho] was talking with a +friend of him (David Menoti) about the problems +and many ideas appeared. +The ideas were implemented and we noticed that a very fast algorithm to generate +minimal perfect hash functions had been designed. We refer the algorithm to as **BMZ**, because it was conceived by Fabiano C. **B**otelho David **M**enoti and Nivio **Z**iviani. The algorithm is described in [1]. To analyse BMZ algorithm we needed some results from the random graph theory, so we invite professor [Yoshiharu Kohayakawa http://www.ime.usp.br/~yoshi] to help us. The final description and analysis of BMZ algorithm is presented in [2]. - +---------------------------------------- -**The Algorithm** +==The Algorithm== -**The Heuristic** +Let us show how the minimal perfect hash function [figs/img7.png] will be constructed. +We make use of two auxiliary random functions [figs/img41.png] and [figs/img55.png], +where [figs/img56.png] for some suitably chosen integer [figs/img57.png], +where [figs/img58.png].We build a random graph [figs/img59.png] on [figs/img60.png], +whose edge set is [figs/img61.png]. There is an edge in [figs/img32.png] for each +key in the set of keys [figs/img20.png]. -**Papers** +In what follows, we shall be interested in the //2-core// of +the random graph [figs/img32.png], that is, the maximal subgraph +of [figs/img32.png] with minimal degree at +least 2 (see, e.g., [2] for details). +Because of its importance in our context, we call the 2-core the +//critical// subgraph of [figs/img32.png] and denote it by [figs/img63.png]. +The vertices and edges in [figs/img63.png] are said to be //critical//. +We let [figs/img64.png] and [figs/img65.png]. +Moreover, we let [figs/img66.png] be the set of //non-critical// +vertices in [figs/img32.png]. +We also let [figs/img67.png] be the set of all critical +vertices that have at least one non-critical vertex as a neighbour. +Let [figs/img68.png] be the set of //non-critical// edges in [figs/img32.png]. +Finally, we let [figs/img69.png] be the //non-critical// subgraph +of [figs/img32.png. +The non-critical subgraph [figs/img70.png] corresponds to the //acyclic part// +of [figs/img32.png]. +We have [figs/img71.png]. + +We then construct a suitable labelling [figs/img72.png] of the vertices +of [figs/img32.png]: we choose [figs/img73.png] for each [figs/img74.png] in such +a way that [figs/img75.png] ([figs/img18.png]) is a +minimal perfect hash function for [figs/img20.png]. +We will see later on that this labelling [figs/img37.png] can be found in linear time +if the number of edges in [figs/img63.png] is at most [figs/img76.png]. + +Figure 2 presents a pseudo code for the algorithm. +The procedure GenerateMPHF ([figs/img20.png], [figs/img37.png]) receives as input the set of +keys [figs/img20.png] and produces the labelling [figs/img37.png]. +The method uses a mapping, ordering and searching approach. +We now describe each step. +| procedure GenerateMPHF ([figs/img20.png], [figs/img37.png]) +|     Mapping ([figs/img20.png], [figs/img32.png]); +|     Ordering ([figs/img32.png], [figs/img63.png], [figs/img70.png]); +|     Searching ([figs/img32.png], [figs/img63.png], [figs/img70.png], [figs/img37.png]); +**Figure 2**: Main steps of the algorithm for constructing a minimal perfect hash function + +===Mapping Step=== + +===Ordering Step=== + +===Searching Step=== + +====Assignment of Values to Critical Vertices==== + +====Assignment of Values to Non-Critical Vertices==== + +---------------------------------------- + +==The Heuristic== + +---------------------------------------- + +==Memory Consumption== + +Now we detail the memory consumption to generate and to store minimal perfect hash functions +using the BMZ algorithm. The structures responsible for memory consumption are in the +following: +- Graph: + + **first**: is a vector that stores //cn// integer numbers, each one representing + the first edge (index in the vector edges) in the list of + edges of each vertex. + The integer numbers are 4 bytes long. Therefore, + the vector first is stored in //4cn// bytes. + + + **edges**: is a vector to represent the edges of the graph. As each edge + is compounded by a pair of vertices, each entry stores two integer numbers + of 4 bytes that represent the vertices. As there are //n// edges, the + vector edges is stored in //8n// bytes. + + + **next**: given a vertex //v//, we can discover the edges that contain //v// + following its list of edges, which starts on first[//v//] and the next + edges are given by next[...first[//v//]...]. Therefore, the vectors first and next represent + the linked lists of edges of each vertex. As there are two vertices for each edge, + when an edge is iserted in the graph, it must be inserted in the two linked lists + of the vertices in its composition. Therefore, there are //2n// entries of integer + numbers in the vector next, so it is stored in //4*2n = 8n// bytes. + + + **critical vertices(critical_nodes vector)**: is a vector of //cn// bits, + where each bit indicates if a vertex is critical (1) or non-critical (0). + Therefore, the critical and non-critical vertices are represented in //cn/8// bytes. + + + **critical edges (used_edges vector)**: is a vector of //n// bits, where each + bit indicates if an edge is critical (1) or non-critical (0). Therefore, the + critical and non-critical edges are represented in //n/8// bytes. + +- Other auxiliary structures + + **queue**: is a queue of integer numbers used in the breadth-first search of the + assignment of values to critical vertices. There is an entry in the queue for + each two critical vertices. Let //|Vcrit|// be the expected number of critical + vertices. Therefore, the queue is stored in //4*0.5*|Vcrit|=2|Vcrit|//. + + + **visited**: is a vector of //cn// bits, where each bit indicates if the g value of + a given vertex was already defined. Therefore, the vector visited is stored + in //cn/8// bytes. + + + **function //g//**: is represented by a vector of //cn// integer numbers. + As each integer number is 4 bytes long, the function //g// is stored in + //4cn// bytes. + + +Thus, the total memory consumption of BMZ algorithm for generating a minimal +perfect hash function (MPHF) is: //(8.25c + 16.125)n +2|Vcrit| + O(1)// bytes. +As the value of constant //c// may be 1.15 and 0.93 we have: + || //c// | //|Vcrit|// | Memory consumption to generate a MPHF | + | 0.93 | //0.497n// | //24.80n + O(1)// | + | 1.15 | //0.401n// | //26.42n + O(1)// | +The values of |Vcrit| were calculated using Eq.(1) presented in [2]. + +Now we present the memory consumption to store the resulting function. +We only need to store the //g// function. Thus, we need //4cn// bytes. +Again we have: + || //c// | Memory consumption to store a MPHF | + | 0.93 | //3.72n// | + | 1.15 | //4.60n// | + +---------------------------------------- + +==Papers== + [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], D. Menoti, [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A New algorithm for constructing minimal perfect hash functions papers/bmz_tr004_04.ps], Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004. -+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], Y. Kohayakawa, and [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A Practical Minimal Perfect Hashing Method papers/bmz_wea2005.ps], 4th International Workshop on Efficient and Experimental Algorithms (WEA), 2005.(submitted) ++ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], Y. Kohayakawa, and [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A Practical Minimal Perfect Hashing Method papers/bmz_wea2005.ps] (Submitted). ---------------------------------------- diff --git a/CHM.t2t b/CHM.t2t index 1ceccaa..8859eff 100644 --- a/CHM.t2t +++ b/CHM.t2t @@ -4,12 +4,57 @@ CHM Algorithm %!includeconf: CONFIG.t2t ---------------------------------------- +==The Algorithm== -**History** +==Memory Consumption== -**The Algorithm** +Now we detail the memory consumption to generate and to store minimal perfect hash functions +using the CHM algorithm. The structures responsible for memory consumption are in the +following: +- Graph: + + **first**: is a vector that stores //cn// integer numbers, each one representing + the first edge (index in the vector edges) in the list of + edges of each vertex. + The integer numbers are 4 bytes long. Therefore, + the vector first is stored in //4cn// bytes. + + + **edges**: is a vector to represent the edges of the graph. As each edge + is compounded by a pair of vertices, each entry stores two integer numbers + of 4 bytes that represent the vertices. As there are //n// edges, the + vector edges is stored in //8n// bytes. + + + **next**: given a vertex //v//, we can discover the edges that contain //v// + following its list of edges, which starts on first[//v//] and the next + edges are given by next[...first[//v//]...]. Therefore, the vectors first and next represent + the linked lists of edges of each vertex. As there are two vertices for each edge, + when an edge is iserted in the graph, it must be inserted in the two linked lists + of the vertices in its composition. Therefore, there are //2n// entries of integer + numbers in the vector next, so it is stored in //4*2n = 8n// bytes. + +- Other auxiliary structures + + **visited**: is a vector of //cn// bits, where each bit indicates if the g value of + a given vertex was already defined. Therefore, the vector visited is stored + in //cn/8// bytes. + + + **function //g//**: is represented by a vector of //cn// integer numbers. + As each integer number is 4 bytes long, the function //g// is stored in + //4cn// bytes. -**Papers** + +Thus, the total memory consumption of CHM algorithm for generating a minimal +perfect hash function (MPHF) is: //(8.125c + 16)n + O(1)// bytes. +As the value of constant //c// must be at least 2.09 we have: + || //c// | Memory consumption to generate a MPHF | + | 2.09 | //33.00n + O(1)// | + +Now we present the memory consumption to store the resulting function. +We only need to store the //g// function. Thus, we need //4cn// bytes. +Again we have: + || //c// | Memory consumption to store a MPHF | + | 2.09 | //8.36n// | + + +==Papers== + Z.J. Czech, G. Havas, and B.S. Majewski. [An optimal algorithm for generating minimal perfect hash functions. papers/chm92.pdf], Information Processing Letters, 43(5):257-264, 1992. diff --git a/COMPARISON.t2t b/COMPARISON.t2t index 4176c28..a6ff823 100644 --- a/COMPARISON.t2t +++ b/COMPARISON.t2t @@ -5,14 +5,14 @@ Comparison Between BMZ And CHM Algorithms ---------------------------------------- -**Features** +==Features== -**Constructing Minimal Perfect Hash Functions** +==Constructing Minimal Perfect Hash Functions== -**Memory Consumption** +==Memory Consumption== -**Run times** +==Run times== ---------------------------------------- [Home index.html] diff --git a/CONFIG.t2t b/CONFIG.t2t index 807454c..19dd4e9 100644 --- a/CONFIG.t2t +++ b/CONFIG.t2t @@ -1,2 +1,4 @@ %! PreProc(html): '^%html% ' '' %! PreProc(txt): '^%txt% ' '' +%! PostProc(html): "&" "&" +%! PostProc(txt): " " " " diff --git a/README.t2t b/README.t2t index 0b25b81..e491c32 100644 --- a/README.t2t +++ b/README.t2t @@ -5,7 +5,7 @@ CMPH - C Minimal Perfect Hashing Library ------------------------------------------------------------------- -**Description** +==Description== C Minimal Perfect Hashing Library is a portable LGPLed library to create and to work with minimal perfect hash functions. The cmph library encapsulates the newest @@ -31,35 +31,35 @@ of the distinguishable features of cmph: ---------------------------------------- -**Supported Algorithms** +==Supported Algorithms== %html% - [BMZ Algorithm bmz.html]. %txt% - BMZ Algorithm. A very fast algorithm based on cyclic random graphs to construct minimal perfect hash functions in linear time. The resulting functions are not order preserving and - can be stored in only 4cn bytes, where c is between 0.93 and 1.15. + can be stored in only //4cn// bytes, where //c// is between 0.93 and 1.15. %html% - [CHM Algorithm chm.html]. %txt% - CHM Algorithm. An algorithm based on acyclic random graphs to construct minimal perfect hash functions in linear time. The resulting functions are order preserving and - are stored in 4cn bytes, where c is greater than 2. + are stored in //4cn// bytes, where //c// is greater than 2. %html% [Click Here comparison.html] to see a comparison of the supported algorithms. ---------------------------------------- -**News for version 0.3** +==News for version 0.3== - New heuristic added to the bmz algorithm permits to generate a mphf with only - 24.61*n + O(1) bytes. The resulting function can be stored in 3.72*n bytes. + //24.6n + O(1)// bytes. The resulting function can be stored in //3.72n// bytes. %html% [click here bmz.html] for details. ---------------------------------------- -**Examples** +==Examples== Using cmph is quite simple. Take a look. @@ -113,7 +113,7 @@ Using cmph is quite simple. Take a look. ``` -------------------------------------- -**The cmph application** +==The cmph application== cmph is the name of both the library and the utility application that comes with this package. You can use the cmph @@ -157,16 +157,16 @@ utility. keysfile line separated file with keys ``` -**Additional Documentation** +==Additional Documentation== [FAQ faq.html] -**Downloads** +==Downloads== Use the project page at sourceforge: http://sf.net/projects/cmph -**License Stuff** +==License Stuff== Code is under the LGPL. ----------------------------------------