It was improved the documentation of BMZ and CHM algorithms

This commit is contained in:
fc_botelho 2005-01-28 20:07:22 +00:00
parent dfa28a005a
commit f1b1f12dda
5 changed files with 212 additions and 35 deletions

164
BMZ.t2t
View File

@ -4,46 +4,176 @@ BMZ Algorithm
%!includeconf: CONFIG.t2t
----------------------------------------
**History**
==History==
At the end of 2003, professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] was
finishing the second edition of his book.
During the book writing, professor Nivio studied the problem of generating minimal perfect hash
finishing the second edition of his [book http://www.dcc.ufmg.br/algoritmos/].
During the [book http://www.dcc.ufmg.br/algoritmos/] writing,
professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] studied the problem of generating minimal perfect hash
functions (if you are not familiarized with this problem, see [1][2]).
Professor Nivio coded a modified version of the [CHM algorithm chm.html], which was proposed by
Czech, Havas and Majewski and put it in his book.
Professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] coded a modified version of
the [CHM algorithm chm.html], which was proposed by
Czech, Havas and Majewski and put it in his [book http://www.dcc.ufmg.br/algoritmos/].
The [CHM algorithm chm.html] is based on acyclic random graphs to generate order preserving
minimal perfect hash functions in linear time. Professor Nivio argued himself, why must the random graph
be acyclic? In the modified version availalbe in his book he got rid of such restriction.
minimal perfect hash functions in linear time. Professor [Nivio Ziviani http://www.dcc.ufmg.br/~nivio]
argued himself, why must the random graph
be acyclic? In the modified version availalbe in his [book http://www.dcc.ufmg.br/algoritmos/] he got rid of such restriction.
The modification presented a problem, it was impossible to generate minimal perfect hash functions
for sets with more than 1000 keys.
At the same time, [Fabiano C. Botelho http://www.dcc.ufmg.br/~fbotelho],
a master degree student at [Departament of Computer Science http://www.dcc.ufmg.br] in
[Federal University of Minas Gerais http://www.ufmg.br],
started to be advised by Nivio who presented the problem to Fabiano.
started to be advised by [Nivio Ziviani http://www.dcc.ufmg.br/~nivio] who presented the problem
to [Fabiano http://www.dcc.ufmg.br/~fbotelho].
During the master, Fabiano and Nivio faced lots of problems.
Talking with a friend of mine (David Menoti) about our problems, many ideas
appeared and after of implementing them, we got a very fast algorithm to generate
minimal perfect hash functions that does not preserve order.
During the master, [Fabiano http://www.dcc.ufmg.br/~fbotelho] and
[Nivio Ziviani http://www.dcc.ufmg.br/~nivio] faced lots of problems.
In april of 2004, [Fabiano http://www.dcc.ufmg.br/~fbotelho] was talking with a
friend of him (David Menoti) about the problems
and many ideas appeared.
The ideas were implemented and we noticed that a very fast algorithm to generate
minimal perfect hash functions had been designed.
We refer the algorithm to as **BMZ**, because it was conceived by Fabiano C. **B**otelho
David **M**enoti and Nivio **Z**iviani. The algorithm is described in [1].
To analyse BMZ algorithm we needed some results from the random graph theory, so
we invite professor [Yoshiharu Kohayakawa http://www.ime.usp.br/~yoshi] to help us.
The final description and analysis of BMZ algorithm is presented in [2].
----------------------------------------
**The Algorithm**
==The Algorithm==
**The Heuristic**
Let us show how the minimal perfect hash function [figs/img7.png] will be constructed.
We make use of two auxiliary random functions [figs/img41.png] and [figs/img55.png],
where [figs/img56.png] for some suitably chosen integer [figs/img57.png],
where [figs/img58.png].We build a random graph [figs/img59.png] on [figs/img60.png],
whose edge set is [figs/img61.png]. There is an edge in [figs/img32.png] for each
key in the set of keys [figs/img20.png].
**Papers**
In what follows, we shall be interested in the //2-core// of
the random graph [figs/img32.png], that is, the maximal subgraph
of [figs/img32.png] with minimal degree at
least 2 (see, e.g., [2] for details).
Because of its importance in our context, we call the 2-core the
//critical// subgraph of [figs/img32.png] and denote it by [figs/img63.png].
The vertices and edges in [figs/img63.png] are said to be //critical//.
We let [figs/img64.png] and [figs/img65.png].
Moreover, we let [figs/img66.png] be the set of //non-critical//
vertices in [figs/img32.png].
We also let [figs/img67.png] be the set of all critical
vertices that have at least one non-critical vertex as a neighbour.
Let [figs/img68.png] be the set of //non-critical// edges in [figs/img32.png].
Finally, we let [figs/img69.png] be the //non-critical// subgraph
of [figs/img32.png.
The non-critical subgraph [figs/img70.png] corresponds to the //acyclic part//
of [figs/img32.png].
We have [figs/img71.png].
We then construct a suitable labelling [figs/img72.png] of the vertices
of [figs/img32.png]: we choose [figs/img73.png] for each [figs/img74.png] in such
a way that [figs/img75.png] ([figs/img18.png]) is a
minimal perfect hash function for [figs/img20.png].
We will see later on that this labelling [figs/img37.png] can be found in linear time
if the number of edges in [figs/img63.png] is at most [figs/img76.png].
Figure 2 presents a pseudo code for the algorithm.
The procedure GenerateMPHF ([figs/img20.png], [figs/img37.png]) receives as input the set of
keys [figs/img20.png] and produces the labelling [figs/img37.png].
The method uses a mapping, ordering and searching approach.
We now describe each step.
| procedure GenerateMPHF ([figs/img20.png], [figs/img37.png])
|     Mapping ([figs/img20.png], [figs/img32.png]);
|     Ordering ([figs/img32.png], [figs/img63.png], [figs/img70.png]);
|     Searching ([figs/img32.png], [figs/img63.png], [figs/img70.png], [figs/img37.png]);
**Figure 2**: Main steps of the algorithm for constructing a minimal perfect hash function
===Mapping Step===
===Ordering Step===
===Searching Step===
====Assignment of Values to Critical Vertices====
====Assignment of Values to Non-Critical Vertices====
----------------------------------------
==The Heuristic==
----------------------------------------
==Memory Consumption==
Now we detail the memory consumption to generate and to store minimal perfect hash functions
using the BMZ algorithm. The structures responsible for memory consumption are in the
following:
- Graph:
+ **first**: is a vector that stores //cn// integer numbers, each one representing
the first edge (index in the vector edges) in the list of
edges of each vertex.
The integer numbers are 4 bytes long. Therefore,
the vector first is stored in //4cn// bytes.
+ **edges**: is a vector to represent the edges of the graph. As each edge
is compounded by a pair of vertices, each entry stores two integer numbers
of 4 bytes that represent the vertices. As there are //n// edges, the
vector edges is stored in //8n// bytes.
+ **next**: given a vertex //v//, we can discover the edges that contain //v//
following its list of edges, which starts on first[//v//] and the next
edges are given by next[...first[//v//]...]. Therefore, the vectors first and next represent
the linked lists of edges of each vertex. As there are two vertices for each edge,
when an edge is iserted in the graph, it must be inserted in the two linked lists
of the vertices in its composition. Therefore, there are //2n// entries of integer
numbers in the vector next, so it is stored in //4*2n = 8n// bytes.
+ **critical vertices(critical_nodes vector)**: is a vector of //cn// bits,
where each bit indicates if a vertex is critical (1) or non-critical (0).
Therefore, the critical and non-critical vertices are represented in //cn/8// bytes.
+ **critical edges (used_edges vector)**: is a vector of //n// bits, where each
bit indicates if an edge is critical (1) or non-critical (0). Therefore, the
critical and non-critical edges are represented in //n/8// bytes.
- Other auxiliary structures
+ **queue**: is a queue of integer numbers used in the breadth-first search of the
assignment of values to critical vertices. There is an entry in the queue for
each two critical vertices. Let //|Vcrit|// be the expected number of critical
vertices. Therefore, the queue is stored in //4*0.5*|Vcrit|=2|Vcrit|//.
+ **visited**: is a vector of //cn// bits, where each bit indicates if the g value of
a given vertex was already defined. Therefore, the vector visited is stored
in //cn/8// bytes.
+ **function //g//**: is represented by a vector of //cn// integer numbers.
As each integer number is 4 bytes long, the function //g// is stored in
//4cn// bytes.
Thus, the total memory consumption of BMZ algorithm for generating a minimal
perfect hash function (MPHF) is: //(8.25c + 16.125)n +2|Vcrit| + O(1)// bytes.
As the value of constant //c// may be 1.15 and 0.93 we have:
|| //c// | //|Vcrit|// | Memory consumption to generate a MPHF |
| 0.93 | //0.497n// | //24.80n + O(1)// |
| 1.15 | //0.401n// | //26.42n + O(1)// |
The values of |Vcrit| were calculated using Eq.(1) presented in [2].
Now we present the memory consumption to store the resulting function.
We only need to store the //g// function. Thus, we need //4cn// bytes.
Again we have:
|| //c// | Memory consumption to store a MPHF |
| 0.93 | //3.72n// |
| 1.15 | //4.60n// |
----------------------------------------
==Papers==
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], D. Menoti, [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A New algorithm for constructing minimal perfect hash functions papers/bmz_tr004_04.ps], Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], Y. Kohayakawa, and [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A Practical Minimal Perfect Hashing Method papers/bmz_wea2005.ps], 4th International Workshop on Efficient and Experimental Algorithms (WEA), 2005.(submitted)
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], Y. Kohayakawa, and [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A Practical Minimal Perfect Hashing Method papers/bmz_wea2005.ps] (Submitted).
----------------------------------------

Before

Width:  |  Height:  |  Size: 2.6 KiB

After

Width:  |  Height:  |  Size: 9.3 KiB

51
CHM.t2t
View File

@ -4,12 +4,57 @@ CHM Algorithm
%!includeconf: CONFIG.t2t
----------------------------------------
==The Algorithm==
**History**
==Memory Consumption==
**The Algorithm**
Now we detail the memory consumption to generate and to store minimal perfect hash functions
using the CHM algorithm. The structures responsible for memory consumption are in the
following:
- Graph:
+ **first**: is a vector that stores //cn// integer numbers, each one representing
the first edge (index in the vector edges) in the list of
edges of each vertex.
The integer numbers are 4 bytes long. Therefore,
the vector first is stored in //4cn// bytes.
+ **edges**: is a vector to represent the edges of the graph. As each edge
is compounded by a pair of vertices, each entry stores two integer numbers
of 4 bytes that represent the vertices. As there are //n// edges, the
vector edges is stored in //8n// bytes.
+ **next**: given a vertex //v//, we can discover the edges that contain //v//
following its list of edges, which starts on first[//v//] and the next
edges are given by next[...first[//v//]...]. Therefore, the vectors first and next represent
the linked lists of edges of each vertex. As there are two vertices for each edge,
when an edge is iserted in the graph, it must be inserted in the two linked lists
of the vertices in its composition. Therefore, there are //2n// entries of integer
numbers in the vector next, so it is stored in //4*2n = 8n// bytes.
- Other auxiliary structures
+ **visited**: is a vector of //cn// bits, where each bit indicates if the g value of
a given vertex was already defined. Therefore, the vector visited is stored
in //cn/8// bytes.
+ **function //g//**: is represented by a vector of //cn// integer numbers.
As each integer number is 4 bytes long, the function //g// is stored in
//4cn// bytes.
**Papers**
Thus, the total memory consumption of CHM algorithm for generating a minimal
perfect hash function (MPHF) is: //(8.125c + 16)n + O(1)// bytes.
As the value of constant //c// must be at least 2.09 we have:
|| //c// | Memory consumption to generate a MPHF |
| 2.09 | //33.00n + O(1)// |
Now we present the memory consumption to store the resulting function.
We only need to store the //g// function. Thus, we need //4cn// bytes.
Again we have:
|| //c// | Memory consumption to store a MPHF |
| 2.09 | //8.36n// |
==Papers==
+ Z.J. Czech, G. Havas, and B.S. Majewski. [An optimal algorithm for generating minimal perfect hash functions. papers/chm92.pdf], Information Processing Letters, 43(5):257-264, 1992.

View File

@ -5,14 +5,14 @@ Comparison Between BMZ And CHM Algorithms
----------------------------------------
**Features**
==Features==
**Constructing Minimal Perfect Hash Functions**
==Constructing Minimal Perfect Hash Functions==
**Memory Consumption**
==Memory Consumption==
**Run times**
==Run times==
----------------------------------------
[Home index.html]

View File

@ -1,2 +1,4 @@
%! PreProc(html): '^%html% ' ''
%! PreProc(txt): '^%txt% ' ''
%! PostProc(html): "&" "&"
%! PostProc(txt): " " " "

View File

@ -5,7 +5,7 @@ CMPH - C Minimal Perfect Hashing Library
-------------------------------------------------------------------
**Description**
==Description==
C Minimal Perfect Hashing Library is a portable LGPLed library to create and
to work with minimal perfect hash functions. The cmph library encapsulates the newest
@ -31,35 +31,35 @@ of the distinguishable features of cmph:
----------------------------------------
**Supported Algorithms**
==Supported Algorithms==
%html% - [BMZ Algorithm bmz.html].
%txt% - BMZ Algorithm.
A very fast algorithm based on cyclic random graphs to construct minimal
perfect hash functions in linear time. The resulting functions are not order preserving and
can be stored in only 4cn bytes, where c is between 0.93 and 1.15.
can be stored in only //4cn// bytes, where //c// is between 0.93 and 1.15.
%html% - [CHM Algorithm chm.html].
%txt% - CHM Algorithm.
An algorithm based on acyclic random graphs to construct minimal
perfect hash functions in linear time. The resulting functions are order preserving and
are stored in 4cn bytes, where c is greater than 2.
are stored in //4cn// bytes, where //c// is greater than 2.
%html% [Click Here comparison.html] to see a comparison of the supported algorithms.
----------------------------------------
**News for version 0.3**
==News for version 0.3==
- New heuristic added to the bmz algorithm permits to generate a mphf with only
24.61*n + O(1) bytes. The resulting function can be stored in 3.72*n bytes.
//24.6n + O(1)// bytes. The resulting function can be stored in //3.72n// bytes.
%html% [click here bmz.html] for details.
----------------------------------------
**Examples**
==Examples==
Using cmph is quite simple. Take a look.
@ -113,7 +113,7 @@ Using cmph is quite simple. Take a look.
```
--------------------------------------
**The cmph application**
==The cmph application==
cmph is the name of both the library and the utility
application that comes with this package. You can use the cmph
@ -157,16 +157,16 @@ utility.
keysfile line separated file with keys
```
**Additional Documentation**
==Additional Documentation==
[FAQ faq.html]
**Downloads**
==Downloads==
Use the project page at sourceforge: http://sf.net/projects/cmph
**License Stuff**
==License Stuff==
Code is under the LGPL.
----------------------------------------