Add docs directory for github.
This commit is contained in:
parent
815d089f34
commit
a250982ade
4
README
4
README
|
@ -302,7 +302,7 @@ FAQ (faq.html)
|
|||
Downloads
|
||||
=========
|
||||
|
||||
Use the project page at sourceforge: http://sf.net/projects/cmph
|
||||
Use the github releases page at: https://github.com/bonitao/cmph/releases
|
||||
|
||||
|
||||
License Stuff
|
||||
|
@ -322,5 +322,5 @@ Fabiano Cupertino Botelho (fc_botelho@users.sourceforge.net)
|
|||
|
||||
Nivio Ziviani (nivio@dcc.ufmg.br)
|
||||
|
||||
Last Updated: Fri Jun 6 17:16:57 2014
|
||||
Last Updated: Fri Dec 28 23:50:31 2018
|
||||
|
||||
|
|
|
@ -300,8 +300,7 @@ Minimum perfect hashing tool
|
|||
|
||||
==Downloads==
|
||||
|
||||
Use the project page at sourceforge: http://sf.net/projects/cmph
|
||||
|
||||
Use the github releases page at: https://github.com/bonitao/cmph/releases
|
||||
|
||||
==License Stuff==
|
||||
|
||||
|
|
|
@ -0,0 +1,315 @@
|
|||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
||||
<HTML>
|
||||
<HEAD>
|
||||
<META NAME="generator" CONTENT="http://txt2tags.org">
|
||||
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
|
||||
<TITLE>BDZ Algorithm</TITLE>
|
||||
</HEAD><BODY BGCOLOR="white" TEXT="black">
|
||||
<CENTER>
|
||||
<H1>BDZ Algorithm</H1>
|
||||
</CENTER>
|
||||
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H2>Introduction</H2>
|
||||
|
||||
<P>
|
||||
The BDZ algorithm was designed by Fabiano C. Botelho, Djamal Belazzougui, Rasmus Pagh and Nivio Ziviani. It is a simple, efficient, near-optimal space and practical algorithm to generate a family <IMG ALIGN="bottom" SRC="figs/bdz/img8.png" BORDER="0" ALT=""> of PHFs and MPHFs. It is also referred to as BPZ algorithm because the work presented by Botelho, Pagh and Ziviani in <A HREF="#papers">[2</A>]. In the Botelho's PhD. dissertation <A HREF="#papers">[1</A>] it is also referred to as RAM algorithm because it is more suitable for key sets that can be handled in internal memory.
|
||||
</P>
|
||||
<P>
|
||||
The BDZ algorithm uses <I>r</I>-uniform random hypergraphs given by function values of <I>r</I> uniform random hash functions on the input key set <I>S</I> for generating PHFs and MPHFs that require <I>O(n)</I> bits to be stored. A hypergraph is the generalization of a standard undirected graph where each edge connects <IMG ALIGN="middle" SRC="figs/bdz/img12.png" BORDER="0" ALT=""> vertices. This idea is not new, see e.g. <A HREF="#papers">[8</A>], but we have proceeded differently to achieve a space usage of <I>O(n)</I> bits rather than <I>O(n log n)</I> bits. Evaluation time for all schemes considered is constant. For <I>r=3</I> we obtain a space usage of approximately <I>2.6n</I> bits for an MPHF. More compact, and even simpler, representations can be achieved for larger <I>m</I>. For example, for <I>m=1.23n</I> we can get a space usage of <I>1.95n</I> bits.
|
||||
</P>
|
||||
<P>
|
||||
Our best MPHF space upper bound is within a factor of <I>2</I> from the information theoretical lower bound of approximately <I>1.44</I> bits. We have shown that the BDZ algorithm is far more practical than previous methods with proven space complexity, both because of its simplicity, and because the constant factor of the space complexity is more than <I>6</I> times lower than its closest competitor, for plausible problem sizes. We verify the practicality experimentally, using slightly more space than in the mentioned theoretical bounds.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H2>The Algorithm</H2>
|
||||
|
||||
<P>
|
||||
The BDZ algorithm is a three-step algorithm that generates PHFs and MPHFs based on random <I>r</I>-partite hypergraphs. This is an approach that provides a much tighter analysis and is much more simple than the one presented in <A HREF="#papers">[3</A>], where it was implicit how to construct similar PHFs.The fastest and most compact functions are generated when <I>r=3</I>. In this case a PHF can be stored in approximately <I>1.95</I> bits per key and an MPHF in approximately <I>2.62</I> bits per key.
|
||||
</P>
|
||||
<P>
|
||||
Figure 1 gives an overview of the algorithm for <I>r=3</I>, taking as input a key set <IMG ALIGN="middle" SRC="figs/bdz/img22.png" BORDER="0" ALT=""> containing three English words, i.e., <I>S={who,band,the}</I>. The edge-oriented data structure proposed in <A HREF="#papers">[4</A>] is used to represent hypergraphs, where each edge is explicitly represented as an array of <I>r</I> vertices and, for each vertex <I>v</I>, there is a list of edges that are incident on <I>v</I>.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/bdz/img50.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 1:</B> (a) The mapping step generates a random acyclic <I>3</I>-partite hypergraph</TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD>with <I>m=6</I> vertices and <I>n=3</I> edges and a list <IMG ALIGN="middle" SRC="figs/bdz/img4.png" BORDER="0" ALT=""> of edges obtained when we test</TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD>whether the hypergraph is acyclic. (b) The assigning step builds an array <I>g</I> that</TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD>maps values from <I>[0,5]</I> to <I>[0,3]</I> to uniquely assign an edge to a vertex. (c) The ranking</TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD>step builds the data structure used to compute function <I>rank</I> in <I>O(1)</I> time.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
The <I>Mapping Step</I> in Figure 1(a) carries out two important tasks:
|
||||
</P>
|
||||
|
||||
<OL>
|
||||
<LI>It assumes that it is possible to find three uniform hash functions <I>h<sub>0</sub></I>, <I>h<sub>1</sub></I> and <I>h<sub>2</sub></I>, with ranges <I>{0,1}</I>, <I>{2,3}</I> and <I>{4,5}</I>, respectively. These functions build an one-to-one mapping of the key set <I>S</I> to the edge set <I>E</I> of a random acyclic <I>3</I>-partite hypergraph <I>G=(V,E)</I>, where <I>|V|=m=6</I> and <I>|E|=n=3</I>. In <A HREF="#papers">[1,2</A>] it is shown that it is possible to obtain such a hypergraph with probability tending to <I>1</I> as <I>n</I> tends to infinity whenever <I>m=cn</I> and <I>c > 1.22</I>. The value of that minimizes the hypergraph size (and thereby the amount of bits to represent the resulting functions) is in the range <I>(1.22,1.23)</I>. To illustrate the mapping, key "who" is mapped to edge <I>{h<sub>0</sub>("who"), h<sub>1</sub>("who"), h<sub>2</sub>("who")} = {1,3,5}</I>, key "band" is mapped to edge <I>{h<sub>0</sub>("band"), h<sub>1</sub>("band"), h<sub>2</sub>("band")} = {1,2,4}</I>, and key "the" is mapped to edge <I>{h<sub>0</sub>("the"), h<sub>1</sub>("the"), h<sub>2</sub>("the")} = {0,2,5}</I>.
|
||||
<P></P>
|
||||
<LI>It tests whether the resulting random <I>3</I>-partite hypergraph contains cycles by iteratively deleting edges connecting vertices of degree 1. The deleted edges are stored in the order of deletion in a list <IMG ALIGN="middle" SRC="figs/bdz/img4.png" BORDER="0" ALT=""> to be used in the assigning step. The first deleted edge in Figure 1(a) was <I>{1,2,4}</I>, the second one was <I>{1,3,5}</I> and the third one was <I>{0,2,5}</I>. If it ends with an empty graph, then the test succeeds, otherwise it fails.
|
||||
</OL>
|
||||
|
||||
<P>
|
||||
We now show how to use the Jenkins hash functions <A HREF="#papers">[7</A>] to implement the three hash functions <I>h<sub>i</sub></I>, which map values from <I>S</I> to <I>V<sub>i</sub></I>, where <IMG ALIGN="middle" SRC="figs/bdz/img52.png" BORDER="0" ALT="">. These functions are used to build a random <I>3</I>-partite hypergraph, where <IMG ALIGN="middle" SRC="figs/bdz/img53.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/bdz/img54.png" BORDER="0" ALT="">. Let <IMG ALIGN="middle" SRC="figs/bdz/img55.png" BORDER="0" ALT=""> be a Jenkins hash function for <IMG ALIGN="middle" SRC="figs/bdz/img56.png" BORDER="0" ALT="">, where
|
||||
<I>w=32 or 64</I> for 32-bit and 64-bit architectures, respectively.
|
||||
Let <I>H'</I> be an array of 3 <I>w</I>-bit values. The Jenkins hash function
|
||||
allow us to compute in parallel the three entries in <I>H'</I>
|
||||
and thereby the three hash functions <I>h<sub>i</sub></I>, as follows:
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><I>H' = h'(x)</I></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><I>h<sub>0</sub>(x) = H'[0] mod</I> <IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><I>h<sub>1</sub>(x) = H'[1] mod</I> <IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""> <I>+</I> <IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><I>h<sub>2</sub>(x) = H'[2] mod</I> <IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""> <I>+ 2</I><IMG ALIGN="middle" SRC="figs/bdz/img136.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
The <I>Assigning Step</I> in Figure 1(b) outputs a PHF that maps the key set <I>S</I> into the range <I>[0,m-1]</I> and is represented by an array <I>g</I> storing values from the range <I>[0,3]</I>. The array <I>g</I> allows to select one out of the <I>3</I> vertices of a given edge, which is associated with a key <I>k</I>. A vertex for a key <I>k</I> is given by either <I>h<sub>0</sub>(k)</I>, <I>h<sub>1</sub>(k)</I> or <I>h<sub>2</sub>(k)</I>. The function <I>h<sub>i</sub>(k)</I> to be used for <I>k</I> is chosen by calculating <I>i = (g[h<sub>0</sub>(k)] + g[h<sub>1</sub>(k)] + g[h<sub>2</sub>(k)]) mod 3</I>. For instance, the values 1 and 4 represent the keys "who" and "band" because <I>i = (g[1] + g[3] + g[5]) mod 3 = 0</I> and <I>h<sub>0</sub>("who") = 1</I>, and <I>i = (g[1] + g[2] + g[4]) mod 3 = 2</I> and <I>h<sub>2</sub>("band") = 4</I>, respectively. The assigning step firstly initializes <I>g[i]=3</I> to mark every vertex as unassigned and <I>Visited[i]= false</I>, <IMG ALIGN="middle" SRC="figs/bdz/img88.png" BORDER="0" ALT="">. Let <I>Visited</I> be a boolean vector of size <I>m</I> to indicate whether a vertex has been visited. Then, for each edge <IMG ALIGN="middle" SRC="figs/bdz/img90.png" BORDER="0" ALT=""> from tail to head, it looks for the first vertex <I>u</I> belonging <I>e</I> not yet visited. This is a sufficient condition for success <A HREF="#papers">[1,2,8</A>]. Let <I>j</I> be the index of <I>u</I> in <I>e</I> for <I>j</I> in the range <I>[0,2]</I>. Then, it assigns <IMG ALIGN="middle" SRC="figs/bdz/img95.png" BORDER="0" ALT="">. Whenever it passes through a vertex <I>u</I> from <I>e</I>, if <I>u</I> has not yet been visited, it sets <I>Visited[u] = true</I>.
|
||||
</P>
|
||||
<P>
|
||||
If we stop the BDZ algorithm in the assigning step we obtain a PHF with range <I>[0,m-1]</I>. The PHF has the following form: <I>phf(x) = h<sub>i(x)</sub>(x)</I>, where key <I>x</I> is in <I>S</I> and <I>i(x) = (g[h<sub>0</sub>(x)] + g[h<sub>1</sub>(x)] + g[h<sub>2</sub>(x)]) mod 3</I>. In this case we do not need information for ranking and can set <I>g[i] = 0</I> whenever <I>g[i]</I> is equal to <I>3</I>, where <I>i</I> is in the range <I>[0,m-1]</I>. Therefore, the range of the values stored in <I>g</I> is narrowed from <I>[0,3]</I> to <I>[0,2]</I>. By using arithmetic coding as block of values (see <A HREF="#papers">[1,2</A>] for details), or any compression technique that allows to perform random access in constant time to an array of compressed values <A HREF="#papers">[5,6,12</A>], we can store the resulting PHFs in <I>mlog 3 = cnlog 3</I> bits, where <I>c > 1.22</I>. For <I>c = 1.23</I>, the space requirement is <I>1.95n</I> bits.
|
||||
</P>
|
||||
<P>
|
||||
The <I>Ranking Step</I> in Figure 1 (c) outputs a data structure that permits to narrow the range of a PHF generated in the assigning step from <I>[0,m-1]</I> to <I>[0,n-1]</I> and thereby an MPHF is produced. The data structure allows to compute in constant time a function <I>rank</I> from <I>[0,m-1]</I> to <I>[0,n-1]</I> that counts the number of assigned positions before a given position <I>v</I> in <I>g</I>. For instance, <I>rank(4) = 2</I> because the positions <I>0</I> and <I>1</I> are assigned since <I>g[0]</I> and <I>g[1]</I> are not equal to <I>3</I>.
|
||||
</P>
|
||||
<P>
|
||||
For the implementation of the ranking step we have borrowed a simple and efficient implementation from <A HREF="#papers">[10</A>]. It requires <IMG ALIGN="middle" SRC="figs/bdz/img111.png" BORDER="0" ALT=""> additional bits of space, where <IMG ALIGN="middle" SRC="figs/bdz/img112.png" BORDER="0" ALT="">, and is obtained by storing explicitly the <I>rank</I> of every <I>k</I>th index in a rankTable, where <IMG ALIGN="middle" SRC="figs/bdz/img114.png" BORDER="0" ALT="">. The larger is <I>k</I> the more compact is the resulting MPHF. Therefore, the users can tradeoff space for evaluation time by setting <I>k</I> appropriately in the implementation. We only allow values for <I>k</I> that are power of two (i.e., <I>k=2<sup>b<sub>k</sub></sup></I> for some constant <I>b<sub>k</sub></I> in order to replace the expensive division and modulo operations by bit-shift and bitwise "and" operations, respectively. We have used <I>k=256</I> in the experiments for generating more succinct MPHFs. We remark that it is still possible to obtain a more compact data structure by using the results presented in <A HREF="#papers">[9,11</A>], but at the cost of a much more complex implementation.
|
||||
</P>
|
||||
<P>
|
||||
We need to use an additional lookup table <I>T<sub>r</sub></I> to guarantee the constant evaluation time of <I>rank(u)</I>. Let us illustrate how <I>rank(u)</I> is computed using both the rankTable and the lookup table <I>T<sub>r</sub></I>. We first look up the rank of the largest precomputed index <I>v</I> lower than or equal to <I>u</I> in the rankTable, and use <I>T<sub>r</sub></I> to count the number of assigned vertices from position <I>v</I> to <I>u-1</I>. The lookup table <I>T_r</I> allows us to count in constant time the number of assigned vertices in <IMG ALIGN="middle" SRC="figs/bdz/img122.png" BORDER="0" ALT=""> bits, where <IMG ALIGN="middle" SRC="figs/bdz/img112.png" BORDER="0" ALT="">. Thus the actual evaluation time is <IMG ALIGN="middle" SRC="figs/bdz/img123.png" BORDER="0" ALT="">. For simplicity and without loss of generality we let <IMG ALIGN="middle" SRC="figs/bdz/img124.png" BORDER="0" ALT=""> be a multiple of the number of bits <IMG ALIGN="middle" SRC="figs/bdz/img125.png" BORDER="0" ALT=""> used to encode each entry of <I>g</I>. As the values in <I>g</I> come from the range <I>[0,3]</I>,
|
||||
then <IMG ALIGN="middle" SRC="figs/bdz/img126.png" BORDER="0" ALT=""> bits and we have tried <IMG ALIGN="middle" SRC="figs/bdz/img124.png" BORDER="0" ALT=""> equal to <I>8</I> and <I>16</I>. We would expect that <IMG ALIGN="middle" SRC="figs/bdz/img124.png" BORDER="0" ALT=""> equal to 16 should provide a faster evaluation time because we would need to carry out fewer lookups in <I>T<sub>r</sub></I>. But, for both values the lookup table <I>T<sub>r</sub></I> fits entirely in the CPU cache and we did not realize any significant difference in the evaluation times. Therefore we settle for the value <I>8</I>. We remark that each value of <I>r</I> requires a different lookup table //T<sub>r</sub> that can be generated a priori.
|
||||
</P>
|
||||
<P>
|
||||
The resulting MPHFs have the following form: <I>h(x) = rank(phf(x))</I>. Then, we cannot get rid of the raking information by replacing the values 3 by 0 in the entries of <I>g</I>. In this case each entry in the array <I>g</I> is encoded with <I>2</I> bits and we need <IMG ALIGN="middle" SRC="figs/bdz/img133.png" BORDER="0" ALT=""> additional bits to compute function <I>rank</I> in constant time. Then, the total space to store the resulting functions is <IMG ALIGN="middle" SRC="figs/bdz/img134.png" BORDER="0" ALT=""> bits. By using <I>c = 1.23</I> and <IMG ALIGN="middle" SRC="figs/bdz/img135.png" BORDER="0" ALT=""> we have obtained MPHFs that require approximately <I>2.62</I> bits per key to be stored.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H2>Memory Consumption</H2>
|
||||
|
||||
<P>
|
||||
Now we detail the memory consumption to generate and to store minimal perfect hash functions
|
||||
using the BDZ algorithm. The structures responsible for memory consumption are in the
|
||||
following:
|
||||
</P>
|
||||
|
||||
<UL>
|
||||
<LI>3-graph:
|
||||
<OL>
|
||||
<LI><B>first</B>: is a vector that stores <I>cn</I> integer numbers, each one representing
|
||||
the first edge (index in the vector edges) in the list of
|
||||
incident edges of each vertex. The integer numbers are 4 bytes long. Therefore,
|
||||
the vector first is stored in <I>4cn</I> bytes.
|
||||
<P></P>
|
||||
<LI><B>edges</B>: is a vector to represent the edges of the graph. As each edge
|
||||
is compounded by three vertices, each entry stores three integer numbers
|
||||
of 4 bytes that represent the vertices. As there are <I>n</I> edges, the
|
||||
vector edges is stored in <I>12n</I> bytes.
|
||||
<P></P>
|
||||
<LI><B>next</B>: given a vertex <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">, we can discover the edges that
|
||||
contain <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> following its list of incident edges,
|
||||
which starts on first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">] and the next
|
||||
edges are given by next[...first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">]...]. Therefore, the vectors first and next represent
|
||||
the linked lists of edges of each vertex. As there are three vertices for each edge,
|
||||
when an edge is iserted in the 3-graph, it must be inserted in the three linked lists
|
||||
of the vertices in its composition. Therefore, there are <I>3n</I> entries of integer
|
||||
numbers in the vector next, so it is stored in <I>4*3n = 12n</I> bytes.
|
||||
<P></P>
|
||||
<LI><B>Vertices degree (vert_degree vector)</B>: is a vector of <I>cn</I> bytes
|
||||
that represents the degree of each vertex. We can use just one byte for each
|
||||
vertex because the 3-graph is sparse, once it has more vertices than edges.
|
||||
Therefore, the vertices degree is represented in <I>cn</I> bytes.
|
||||
<P></P>
|
||||
</OL>
|
||||
<LI>Acyclicity test:
|
||||
<OL>
|
||||
<LI><B>List of deleted edges obtained when we test whether the 3-graph is a forest (queue vector)</B>:
|
||||
is a vector of <I>n</I> integer numbers containing indexes of vector edges. Therefore, it
|
||||
requires <I>4n</I> bytes in internal memory.
|
||||
<P></P>
|
||||
<LI><B>Marked edges in the acyclicity test (marked_edges vector)</B>:
|
||||
is a bit vector of <I>n</I> bits to indicate the edges that have already been deleted during
|
||||
the acyclicity test. Therefore, it requires <I>n/8</I> bytes in internal memory.
|
||||
<P></P>
|
||||
</OL>
|
||||
<LI>MPHF description
|
||||
<OL>
|
||||
<LI><B>function <I>g</I></B>: is represented by a vector of <I>2cn</I> bits. Therefore, it is
|
||||
stored in <I>0.25cn</I> bytes
|
||||
<LI><B>ranktable</B>: is a lookup table used to store some precomputed ranking information.
|
||||
It has <I>(cn)/(2^b)</I> entries of 4-byte integer numbers. Therefore it is stored in
|
||||
<I>(4cn)/(2^b)</I> bytes. The larger is b, the more compact is the resulting MPHFs and
|
||||
the slower are the functions. So b imposes a trade-of between space and time.
|
||||
<LI><B>Total</B>: 0.25cn + (4cn)/(2^b) bytes
|
||||
</OL>
|
||||
</UL>
|
||||
|
||||
<P>
|
||||
Thus, the total memory consumption of BDZ algorithm for generating a minimal
|
||||
perfect hash function (MPHF) is: <I>(28.125 + 5c)n + 0.25cn + (4cn)/(2^b) + O(1)</I> bytes.
|
||||
As the value of constant <I>c</I> may be larger than or equal to 1.23 we have:
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
|
||||
<TR>
|
||||
<TH><I>c</I></TH>
|
||||
<TH><I>b</I></TH>
|
||||
<TH>Memory consumption to generate a MPHF (in bytes)</TH>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD>1.23</TD>
|
||||
<TD ALIGN="center"><I>7</I></TD>
|
||||
<TD ALIGN="center"><I>34.62n + O(1)</I></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD>1.23</TD>
|
||||
<TD ALIGN="center"><I>8</I></TD>
|
||||
<TD ALIGN="center"><I>34.60n + O(1)</I></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><B>Table 1:</B> Memory consumption to generate a MPHF using the BDZ algorithm.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
Now we present the memory consumption to store the resulting function.
|
||||
So we have:
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
|
||||
<TR>
|
||||
<TH><I>c</I></TH>
|
||||
<TH><I>b</I></TH>
|
||||
<TH>Memory consumption to store a MPHF (in bits)</TH>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD>1.23</TD>
|
||||
<TD ALIGN="center"><I>7</I></TD>
|
||||
<TD ALIGN="center"><I>2.77n + O(1)</I></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD>1.23</TD>
|
||||
<TD ALIGN="center"><I>8</I></TD>
|
||||
<TD ALIGN="center"><I>2.61n + O(1)</I></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><B>Table 2:</B> Memory consumption to store a MPHF generated by the BDZ algorithm.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H2>Experimental Results</H2>
|
||||
|
||||
<P>
|
||||
Experimental results to compare the BDZ algorithm with the other ones in the CMPH
|
||||
library are presented in Botelho, Pagh and Ziviani <A HREF="#papers">[1,2</A>].
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<A NAME="papers"></A>
|
||||
<H2>Papers</H2>
|
||||
|
||||
<OL>
|
||||
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>. <A HREF="papers/thesis.pdf">Near-Optimal Space Perfect Hashing Algorithms</A>. <I>PhD. Thesis</I>, <I>Department of Computer Science</I>, <I>Federal University of Minas Gerais</I>, September 2008. Supervised by <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>.
|
||||
<P></P>
|
||||
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, <A HREF="http://www.itu.dk/~pagh/">R. Pagh</A>, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/wads07.pdf">Simple and space-efficient minimal perfect hash functions</A>. <I>In Proceedings of the 10th International Workshop on Algorithms and Data Structures (WADs'07),</I> Springer-Verlag Lecture Notes in Computer Science, vol. 4619, Halifax, Canada, August 2007, 139-150.
|
||||
<P></P>
|
||||
<LI>B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. The bloomier filter: An efficient data structure for static support lookup tables. <I>In Proceedings of the 15th annual ACM-SIAM symposium on Discrete algorithms (SODA'04)</I>, pages 30–39, Philadelphia, PA, USA, 2004. Society for Industrial and Applied Mathematics.
|
||||
<P></P>
|
||||
<LI>J. Ebert. A versatile data structure for edges oriented graph algorithms. <I>Communication of The ACM</I>, (30):513–519, 1987.
|
||||
<P></P>
|
||||
<LI>K. Fredriksson and F. Nikitin. Simple compression code supporting random access and fast string matching. <I>In Proceedings of the 6th International Workshop on Efficient and Experimental Algorithms (WEA’07)</I>, pages 203–216, 2007.
|
||||
<P></P>
|
||||
<LI>R. Gonzalez and G. Navarro. Statistical encoding of succinct data structures. <I>In Proceedings of the 19th Annual Symposium on Combinatorial Pattern Matching (CPM’06)</I>, pages 294–305, 2006.
|
||||
<P></P>
|
||||
<LI>B. Jenkins. Algorithm alley: Hash functions. <I>Dr. Dobb's Journal of Software Tools</I>, 22(9), september 1997. Extended version available at <A HREF="http://burtleburtle.net/bob/hash/doobs.html">http://burtleburtle.net/bob/hash/doobs.html</A>.
|
||||
<P></P>
|
||||
<LI>B.S. Majewski, N.C. Wormald, G. Havas, and Z.J. Czech. A family of perfect hashing methods. <I>The Computer Journal</I>, 39(6):547–554, 1996.
|
||||
<P></P>
|
||||
<LI>D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. <I>In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX’07)</I>, 2007.
|
||||
<P></P>
|
||||
<LI><A HREF="http://www.itu.dk/~pagh/">R. Pagh</A>. Low redundancy in static dictionaries with constant query time. <I>SIAM Journal on Computing</I>, 31(2):353–363, 2001.
|
||||
<P></P>
|
||||
<LI>R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. <I>In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms (SODA’02)</I>, pages 233–242, Philadelphia PA, USA, 2002. Society for Industrial and Applied Mathematics.
|
||||
<P></P>
|
||||
<LI>K. Sadakane and R. Grossi. Squeezing succinct data structures into entropy bounds. <I>In Proceedings of the 17th annual ACM-SIAM symposium on Discrete algorithms (SODA’06)</I>, pages 1230–1239, 2006.
|
||||
</OL>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><A HREF="index.html">Home</A></TD>
|
||||
<TD><A HREF="chd.html">CHD</A></TD>
|
||||
<TD><A HREF="bdz.html">BDZ</A></TD>
|
||||
<TD><A HREF="bmz.html">BMZ</A></TD>
|
||||
<TD><A HREF="chm.html">CHM</A></TD>
|
||||
<TD><A HREF="brz.html">BRZ</A></TD>
|
||||
<TD><A HREF="fch.html">FCH</A></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<P>
|
||||
Enjoy!
|
||||
</P>
|
||||
<P>
|
||||
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
|
||||
</P>
|
||||
<P>
|
||||
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
|
||||
</P>
|
||||
<P>
|
||||
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
|
||||
</P>
|
||||
<P>
|
||||
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
|
||||
</P>
|
||||
<script type="text/javascript">
|
||||
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
|
||||
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
|
||||
</script>
|
||||
<script type="text/javascript">
|
||||
try {
|
||||
var pageTracker = _gat._getTracker("UA-7698683-2");
|
||||
pageTracker._trackPageview();
|
||||
} catch(err) {}</script>
|
||||
|
||||
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
|
||||
<!-- cmdline: txt2tags -t html -i BDZ.t2t -o docs/bdz.html -->
|
||||
</BODY></HTML>
|
|
@ -0,0 +1,581 @@
|
|||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
||||
<HTML>
|
||||
<HEAD>
|
||||
<META NAME="generator" CONTENT="http://txt2tags.org">
|
||||
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
|
||||
<TITLE>BMZ Algorithm</TITLE>
|
||||
</HEAD><BODY BGCOLOR="white" TEXT="black">
|
||||
<CENTER>
|
||||
<H1>BMZ Algorithm</H1>
|
||||
</CENTER>
|
||||
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H2>History</H2>
|
||||
|
||||
<P>
|
||||
At the end of 2003, professor <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> was
|
||||
finishing the second edition of his <A HREF="http://www.dcc.ufmg.br/algoritmos/">book</A>.
|
||||
During the <A HREF="http://www.dcc.ufmg.br/algoritmos/">book</A> writing,
|
||||
professor <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> studied the problem of generating
|
||||
<A HREF="concepts.html">minimal perfect hash functions</A>
|
||||
(if you are not familiarized with this problem, see <A HREF="#papers">[1</A>]<A HREF="#papers">[2</A>]).
|
||||
Professor <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> coded a modified version of
|
||||
the <A HREF="chm.html">CHM algorithm</A>, which was proposed by
|
||||
Czech, Havas and Majewski, and put it in his <A HREF="http://www.dcc.ufmg.br/algoritmos/">book</A>.
|
||||
The <A HREF="chm.html">CHM algorithm</A> is based on acyclic random graphs to generate
|
||||
<A HREF="concepts.html">order preserving minimal perfect hash functions</A> in linear time.
|
||||
Professor <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A>
|
||||
argued himself, why must the random graph
|
||||
be acyclic? In the modified version availalbe in his <A HREF="http://www.dcc.ufmg.br/algoritmos/">book</A> he got rid of this restriction.
|
||||
</P>
|
||||
<P>
|
||||
The modification presented a problem, it was impossible to generate minimal perfect hash functions
|
||||
for sets with more than 1000 keys.
|
||||
At the same time, <A HREF="http://www.dcc.ufmg.br/~fbotelho">Fabiano C. Botelho</A>,
|
||||
a master degree student at <A HREF="http://www.dcc.ufmg.br">Departament of Computer Science</A> in
|
||||
<A HREF="http://www.ufmg.br">Federal University of Minas Gerais</A>,
|
||||
started to be advised by <A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> who presented the problem
|
||||
to <A HREF="http://www.dcc.ufmg.br/~fbotelho">Fabiano</A>.
|
||||
</P>
|
||||
<P>
|
||||
During the master, <A HREF="http://www.dcc.ufmg.br/~fbotelho">Fabiano</A> and
|
||||
<A HREF="http://www.dcc.ufmg.br/~nivio">Nivio Ziviani</A> faced lots of problems.
|
||||
In april of 2004, <A HREF="http://www.dcc.ufmg.br/~fbotelho">Fabiano</A> was talking with a
|
||||
friend of him (David Menoti) about the problems
|
||||
and many ideas appeared.
|
||||
The ideas were implemented and a very fast algorithm to generate
|
||||
minimal perfect hash functions had been designed.
|
||||
We refer the algorithm to as <B>BMZ</B>, because it was conceived by Fabiano C. <B>B</B>otelho,
|
||||
David <B>M</B>enoti and Nivio <B>Z</B>iviani. The algorithm is described in <A HREF="#papers">[1</A>].
|
||||
To analyse BMZ algorithm we needed some results from the random graph theory, so
|
||||
we invited professor <A HREF="http://www.ime.usp.br/~yoshi">Yoshiharu Kohayakawa</A> to help us.
|
||||
The final description and analysis of BMZ algorithm is presented in <A HREF="#papers">[2</A>].
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H2>The Algorithm</H2>
|
||||
|
||||
<P>
|
||||
The BMZ algorithm shares several features with the <A HREF="chm.html">CHM algorithm</A>.
|
||||
In particular, BMZ algorithm is also
|
||||
based on the generation of random graphs <IMG ALIGN="middle" SRC="figs/img27.png" BORDER="0" ALT="">, where <IMG ALIGN="bottom" SRC="figs/img28.png" BORDER="0" ALT=""> is in
|
||||
one-to-one correspondence with the key set <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> for which we wish to
|
||||
generate a <A HREF="concepts.html">minimal perfect hash function</A>.
|
||||
The two main differences between BMZ algorithm and CHM algorithm
|
||||
are as follows: (<I>i</I>) BMZ algorithm generates random
|
||||
graphs <IMG ALIGN="middle" SRC="figs/img27.png" BORDER="0" ALT=""> with <IMG ALIGN="middle" SRC="figs/img29.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img30.png" BORDER="0" ALT="">, where <IMG ALIGN="middle" SRC="figs/img31.png" BORDER="0" ALT="">,
|
||||
and hence <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> necessarily contains cycles,
|
||||
while CHM algorithm generates <I>acyclic</I> random
|
||||
graphs <IMG ALIGN="middle" SRC="figs/img27.png" BORDER="0" ALT=""> with <IMG ALIGN="middle" SRC="figs/img29.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img30.png" BORDER="0" ALT="">,
|
||||
with a greater number of vertices: <IMG ALIGN="middle" SRC="figs/img33.png" BORDER="0" ALT="">;
|
||||
(<I>ii</I>) CHM algorithm generates <A HREF="concepts.html">order preserving minimal perfect hash functions</A>
|
||||
while BMZ algorithm does not preserve order. Thus, BMZ algorithm improves
|
||||
the space requirement at the expense of generating functions that are not
|
||||
order preserving.
|
||||
</P>
|
||||
<P>
|
||||
Suppose <IMG ALIGN="bottom" SRC="figs/img14.png" BORDER="0" ALT=""> is a universe of <I>keys</I>.
|
||||
Let <IMG ALIGN="middle" SRC="figs/img17.png" BORDER="0" ALT=""> be a set of <IMG ALIGN="bottom" SRC="figs/img8.png" BORDER="0" ALT=""> keys from <IMG ALIGN="bottom" SRC="figs/img14.png" BORDER="0" ALT="">.
|
||||
Let us show how the BMZ algorithm constructs a minimal perfect hash function <IMG ALIGN="bottom" SRC="figs/img7.png" BORDER="0" ALT="">.
|
||||
We make use of two auxiliary random functions <IMG ALIGN="middle" SRC="figs/img41.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img55.png" BORDER="0" ALT="">,
|
||||
where <IMG ALIGN="middle" SRC="figs/img56.png" BORDER="0" ALT=""> for some suitably chosen integer <IMG ALIGN="bottom" SRC="figs/img57.png" BORDER="0" ALT="">,
|
||||
where <IMG ALIGN="middle" SRC="figs/img58.png" BORDER="0" ALT="">.We build a random graph <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT=""> on <IMG ALIGN="bottom" SRC="figs/img60.png" BORDER="0" ALT="">,
|
||||
whose edge set is <IMG ALIGN="middle" SRC="figs/img61.png" BORDER="0" ALT="">. There is an edge in <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> for each
|
||||
key in the set of keys <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
|
||||
</P>
|
||||
<P>
|
||||
In what follows, we shall be interested in the <I>2-core</I> of
|
||||
the random graph <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, that is, the maximal subgraph
|
||||
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> with minimal degree at
|
||||
least 2 (see <A HREF="#papers">[2</A>] for details).
|
||||
Because of its importance in our context, we call the 2-core the
|
||||
<I>critical</I> subgraph of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> and denote it by <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">.
|
||||
The vertices and edges in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> are said to be <I>critical</I>.
|
||||
We let <IMG ALIGN="middle" SRC="figs/img64.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img65.png" BORDER="0" ALT="">.
|
||||
Moreover, we let <IMG ALIGN="middle" SRC="figs/img66.png" BORDER="0" ALT=""> be the set of <I>non-critical</I>
|
||||
vertices in <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
|
||||
We also let <IMG ALIGN="middle" SRC="figs/img67.png" BORDER="0" ALT=""> be the set of all critical
|
||||
vertices that have at least one non-critical vertex as a neighbour.
|
||||
Let <IMG ALIGN="middle" SRC="figs/img68.png" BORDER="0" ALT=""> be the set of <I>non-critical</I> edges in <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
|
||||
Finally, we let <IMG ALIGN="middle" SRC="figs/img69.png" BORDER="0" ALT=""> be the <I>non-critical</I> subgraph
|
||||
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
|
||||
The non-critical subgraph <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> corresponds to the <I>acyclic part</I>
|
||||
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
|
||||
We have <IMG ALIGN="middle" SRC="figs/img71.png" BORDER="0" ALT="">.
|
||||
</P>
|
||||
<P>
|
||||
We then construct a suitable labelling <IMG ALIGN="middle" SRC="figs/img72.png" BORDER="0" ALT=""> of the vertices
|
||||
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">: we choose <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT=""> for each <IMG ALIGN="middle" SRC="figs/img74.png" BORDER="0" ALT=""> in such
|
||||
a way that <IMG ALIGN="middle" SRC="figs/img75.png" BORDER="0" ALT=""> (<IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT="">) is a
|
||||
minimal perfect hash function for <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
|
||||
This labelling <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT=""> can be found in linear time
|
||||
if the number of edges in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> is at most <IMG ALIGN="middle" SRC="figs/img76.png" BORDER="0" ALT=""> (see <A HREF="#papers">[2</A>]
|
||||
for details).
|
||||
</P>
|
||||
<P>
|
||||
Figure 1 presents a pseudo code for the BMZ algorithm.
|
||||
The procedure BMZ (<IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">) receives as input the set of
|
||||
keys <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> and produces the labelling <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">.
|
||||
The method uses a mapping, ordering and searching approach.
|
||||
We now describe each step.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD>procedure BMZ (<IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">)</TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD> Mapping (<IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">);</TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD> Ordering (<IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">);</TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD> Searching (<IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">);</TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 1</B>: Main steps of BMZ algorithm for constructing a minimal perfect hash function</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H3>Mapping Step</H3>
|
||||
|
||||
<P>
|
||||
The procedure Mapping (<IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">) receives as input the set
|
||||
of keys <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> and generates the random graph <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT="">, by generating
|
||||
two auxiliary functions <IMG ALIGN="middle" SRC="figs/img41.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img78.png" BORDER="0" ALT="">.
|
||||
</P>
|
||||
<P>
|
||||
The functions <IMG ALIGN="middle" SRC="figs/img41.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img42.png" BORDER="0" ALT=""> are constructed as follows.
|
||||
We impose some upper bound <IMG ALIGN="bottom" SRC="figs/img79.png" BORDER="0" ALT=""> on the lengths of the keys in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
|
||||
To define <IMG ALIGN="middle" SRC="figs/img80.png" BORDER="0" ALT=""> (<IMG ALIGN="middle" SRC="figs/img81.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img62.png" BORDER="0" ALT="">), we generate
|
||||
an <IMG ALIGN="middle" SRC="figs/img82.png" BORDER="0" ALT=""> table of random integers <IMG ALIGN="middle" SRC="figs/img83.png" BORDER="0" ALT="">.
|
||||
For a key <IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT=""> of length <IMG ALIGN="middle" SRC="figs/img84.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img85.png" BORDER="0" ALT="">, we let
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/img86.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
The random graph <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT=""> has vertex set <IMG ALIGN="middle" SRC="figs/img56.png" BORDER="0" ALT=""> and
|
||||
edge set <IMG ALIGN="middle" SRC="figs/img61.png" BORDER="0" ALT="">. We need <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> to be
|
||||
simple, i.e., <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> should have neither loops nor multiple edges.
|
||||
A loop occurs when <IMG ALIGN="middle" SRC="figs/img87.png" BORDER="0" ALT=""> for some <IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT="">.
|
||||
We solve this in an ad hoc manner: we simply let <IMG ALIGN="middle" SRC="figs/img88.png" BORDER="0" ALT=""> in this case.
|
||||
If we still find a loop after this, we generate another pair <IMG ALIGN="middle" SRC="figs/img89.png" BORDER="0" ALT="">.
|
||||
When a multiple edge occurs we abort and generate a new pair <IMG ALIGN="middle" SRC="figs/img89.png" BORDER="0" ALT="">.
|
||||
Although the function above causes <A HREF="concepts.html">collisions</A> with probability <I>1/t</I>,
|
||||
in <A HREF="index.html">cmph library</A> we use faster hash
|
||||
functions (<A HREF="http://www.cs.yorku.ca/~oz/hash.html">DJB2 hash</A>, <A HREF="http://www.isthe.com/chongo/tech/comp/fnv/">FNV hash</A>,
|
||||
<A HREF="http://burtleburtle.net/bob/hash/doobs.html">Jenkins hash</A> and <A HREF="http://www.cs.yorku.ca/~oz/hash.html">SDBM hash</A>)
|
||||
in which we do not need to impose any upper bound <IMG ALIGN="bottom" SRC="figs/img79.png" BORDER="0" ALT=""> on the lengths of the keys in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">.
|
||||
</P>
|
||||
<P>
|
||||
As mentioned before, for us to find the labelling <IMG ALIGN="middle" SRC="figs/img72.png" BORDER="0" ALT=""> of the
|
||||
vertices of <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT=""> in linear time,
|
||||
we require that <IMG ALIGN="middle" SRC="figs/img108.png" BORDER="0" ALT="">.
|
||||
The crucial step now is to determine the value
|
||||
of <IMG ALIGN="bottom" SRC="figs/img1.png" BORDER="0" ALT=""> (in <IMG ALIGN="bottom" SRC="figs/img57.png" BORDER="0" ALT="">) to obtain a random
|
||||
graph <IMG ALIGN="middle" SRC="figs/img71.png" BORDER="0" ALT=""> with <IMG ALIGN="middle" SRC="figs/img109.png" BORDER="0" ALT="">.
|
||||
Botelho, Menoti an Ziviani determinded emprically in <A HREF="#papers">[1</A>] that
|
||||
the value of <IMG ALIGN="bottom" SRC="figs/img1.png" BORDER="0" ALT=""> is <I>1.15</I>. This value is remarkably
|
||||
close to the theoretical value determined in <A HREF="#papers">[2</A>],
|
||||
which is around <IMG ALIGN="bottom" SRC="figs/img112.png" BORDER="0" ALT="">.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H3>Ordering Step</H3>
|
||||
|
||||
<P>
|
||||
The procedure Ordering (<IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">) receives
|
||||
as input the graph <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> and partitions <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> into the two
|
||||
subgraphs <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">, so that <IMG ALIGN="middle" SRC="figs/img71.png" BORDER="0" ALT="">.
|
||||
</P>
|
||||
<P>
|
||||
Figure 2 presents a sample graph with 9 vertices
|
||||
and 8 edges, where the degree of a vertex is shown besides each vertex.
|
||||
Initially, all vertices with degree 1 are added to a queue <IMG ALIGN="middle" SRC="figs/img136.png" BORDER="0" ALT="">.
|
||||
For the example shown in Figure 2(a), <IMG ALIGN="middle" SRC="figs/img137.png" BORDER="0" ALT=""> after the initialization step.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/img138.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 2:</B> Ordering step for a graph with 9 vertices and 8 edges.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
Next, we remove one vertex <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> from the queue, decrement its degree and
|
||||
the degree of the vertices with degree greater than 0 in the adjacent
|
||||
list of <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">, as depicted in Figure 2(b) for <IMG ALIGN="bottom" SRC="figs/img140.png" BORDER="0" ALT="">.
|
||||
At this point, the adjacencies of <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> with degree 1 are
|
||||
inserted into the queue, such as vertex 1.
|
||||
This process is repeated until the queue becomes empty.
|
||||
All vertices with degree 0 are non-critical vertices and the others are
|
||||
critical vertices, as depicted in Figure 2(c).
|
||||
Finally, to determine the vertices in <IMG ALIGN="middle" SRC="figs/img141.png" BORDER="0" ALT=""> we collect all
|
||||
vertices <IMG ALIGN="middle" SRC="figs/img142.png" BORDER="0" ALT=""> with at least one vertex <IMG ALIGN="bottom" SRC="figs/img143.png" BORDER="0" ALT=""> that
|
||||
is in Adj<IMG ALIGN="middle" SRC="figs/img144.png" BORDER="0" ALT=""> and in <IMG ALIGN="middle" SRC="figs/img145.png" BORDER="0" ALT="">, as the vertex 8 in Figure 2(c).
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H3>Searching Step</H3>
|
||||
|
||||
<P>
|
||||
In the searching step, the key part is
|
||||
the <I>perfect assignment problem</I>: find <IMG ALIGN="middle" SRC="figs/img153.png" BORDER="0" ALT=""> such that
|
||||
the function <IMG ALIGN="middle" SRC="figs/img154.png" BORDER="0" ALT=""> defined by
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/img155.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
is a bijection from <IMG ALIGN="middle" SRC="figs/img156.png" BORDER="0" ALT=""> to <IMG ALIGN="middle" SRC="figs/img157.png" BORDER="0" ALT=""> (recall <IMG ALIGN="middle" SRC="figs/img158.png" BORDER="0" ALT="">).
|
||||
We are interested in a labelling <IMG ALIGN="middle" SRC="figs/img72.png" BORDER="0" ALT=""> of
|
||||
the vertices of the graph <IMG ALIGN="middle" SRC="figs/img59.png" BORDER="0" ALT=""> with
|
||||
the property that if <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img22.png" BORDER="0" ALT=""> are keys
|
||||
in <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT="">, then <IMG ALIGN="middle" SRC="figs/img159.png" BORDER="0" ALT="">; that is, if we associate
|
||||
to each edge the sum of the labels on its endpoints, then these values
|
||||
should be all distinct.
|
||||
Moreover, we require that all the sums <IMG ALIGN="middle" SRC="figs/img160.png" BORDER="0" ALT=""> (<IMG ALIGN="middle" SRC="figs/img18.png" BORDER="0" ALT="">)
|
||||
fall between <IMG ALIGN="bottom" SRC="figs/img115.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img161.png" BORDER="0" ALT="">, and thus we have a bijection
|
||||
between <IMG ALIGN="bottom" SRC="figs/img20.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img157.png" BORDER="0" ALT="">.
|
||||
</P>
|
||||
<P>
|
||||
The procedure Searching (<IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">)
|
||||
receives as input <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> and finds a
|
||||
suitable <IMG ALIGN="middle" SRC="figs/img162.png" BORDER="0" ALT=""> bit value for each vertex <IMG ALIGN="middle" SRC="figs/img74.png" BORDER="0" ALT="">, stored in the
|
||||
array <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT="">.
|
||||
This step is first performed for the vertices in the
|
||||
critical subgraph <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> (the 2-core of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">)
|
||||
and then it is performed for the vertices in <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> (the non-critical subgraph
|
||||
of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT=""> that contains the "acyclic part" of <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">).
|
||||
The reason the assignment of the <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT=""> values is first
|
||||
performed on the vertices in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> is to resolve reassignments
|
||||
as early as possible (such reassignments are consequences of the cycles
|
||||
in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> and are depicted hereinafter).
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H4>Assignment of Values to Critical Vertices</H4>
|
||||
|
||||
<P>
|
||||
The labels <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT=""> (<IMG ALIGN="middle" SRC="figs/img142.png" BORDER="0" ALT="">)
|
||||
are assigned in increasing order following a greedy
|
||||
strategy where the critical vertices <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> are considered one at a time,
|
||||
according to a breadth-first search on <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">.
|
||||
If a candidate value <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> for <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT=""> is forbidden
|
||||
because setting <IMG ALIGN="middle" SRC="figs/img163.png" BORDER="0" ALT=""> would create two edges with the same sum,
|
||||
we try <IMG ALIGN="middle" SRC="figs/img164.png" BORDER="0" ALT=""> for <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT="">. This fact is referred to
|
||||
as a <I>reassignment</I>.
|
||||
</P>
|
||||
<P>
|
||||
Let <IMG ALIGN="middle" SRC="figs/img165.png" BORDER="0" ALT=""> be the set of addresses assigned to edges in <IMG ALIGN="middle" SRC="figs/img166.png" BORDER="0" ALT="">.
|
||||
Initially <IMG ALIGN="middle" SRC="figs/img167.png" BORDER="0" ALT="">.
|
||||
Let <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> be a candidate value for <IMG ALIGN="middle" SRC="figs/img73.png" BORDER="0" ALT="">.
|
||||
Initially <IMG ALIGN="bottom" SRC="figs/img168.png" BORDER="0" ALT="">.
|
||||
Considering the subgraph <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> in Figure 2(c),
|
||||
a step by step example of the assignment of values to vertices in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT=""> is
|
||||
presented in Figure 3.
|
||||
Initially, a vertex <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> is chosen, the assignment <IMG ALIGN="middle" SRC="figs/img163.png" BORDER="0" ALT=""> is made
|
||||
and <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is set to <IMG ALIGN="middle" SRC="figs/img164.png" BORDER="0" ALT="">.
|
||||
For example, suppose that vertex <IMG ALIGN="bottom" SRC="figs/img169.png" BORDER="0" ALT=""> in Figure 3(a) is
|
||||
chosen, the assignment <IMG ALIGN="middle" SRC="figs/img170.png" BORDER="0" ALT=""> is made and <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is set to <IMG ALIGN="bottom" SRC="figs/img96.png" BORDER="0" ALT="">.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/img171.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 3:</B> Example of the assignment of values to critical vertices.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
In Figure 3(b), following the adjacent list of vertex <IMG ALIGN="bottom" SRC="figs/img169.png" BORDER="0" ALT="">,
|
||||
the unassigned vertex <IMG ALIGN="bottom" SRC="figs/img115.png" BORDER="0" ALT=""> is reached.
|
||||
At this point, we collect in the temporary variable <IMG ALIGN="bottom" SRC="figs/img172.png" BORDER="0" ALT=""> all adjacencies
|
||||
of vertex <IMG ALIGN="bottom" SRC="figs/img115.png" BORDER="0" ALT=""> that have been assigned an <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> value,
|
||||
and <IMG ALIGN="middle" SRC="figs/img173.png" BORDER="0" ALT="">.
|
||||
Next, for all <IMG ALIGN="middle" SRC="figs/img174.png" BORDER="0" ALT="">, we check if <IMG ALIGN="middle" SRC="figs/img175.png" BORDER="0" ALT="">.
|
||||
Since <IMG ALIGN="middle" SRC="figs/img176.png" BORDER="0" ALT="">, then <IMG ALIGN="middle" SRC="figs/img177.png" BORDER="0" ALT=""> is set
|
||||
to <IMG ALIGN="bottom" SRC="figs/img96.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is incremented
|
||||
by 1 (now <IMG ALIGN="bottom" SRC="figs/img178.png" BORDER="0" ALT="">) and <IMG ALIGN="middle" SRC="figs/img179.png" BORDER="0" ALT="">.
|
||||
Next, vertex <IMG ALIGN="bottom" SRC="figs/img180.png" BORDER="0" ALT=""> is reached, <IMG ALIGN="middle" SRC="figs/img181.png" BORDER="0" ALT=""> is set
|
||||
to <IMG ALIGN="bottom" SRC="figs/img62.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is set to <IMG ALIGN="bottom" SRC="figs/img180.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img182.png" BORDER="0" ALT="">.
|
||||
Next, vertex <IMG ALIGN="bottom" SRC="figs/img183.png" BORDER="0" ALT=""> is reached and <IMG ALIGN="middle" SRC="figs/img184.png" BORDER="0" ALT="">.
|
||||
Since <IMG ALIGN="middle" SRC="figs/img185.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img186.png" BORDER="0" ALT="">, then <IMG ALIGN="middle" SRC="figs/img187.png" BORDER="0" ALT=""> is
|
||||
set to <IMG ALIGN="bottom" SRC="figs/img180.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is set to <IMG ALIGN="bottom" SRC="figs/img183.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img188.png" BORDER="0" ALT="">.
|
||||
Finally, vertex <IMG ALIGN="bottom" SRC="figs/img189.png" BORDER="0" ALT=""> is reached and <IMG ALIGN="middle" SRC="figs/img190.png" BORDER="0" ALT="">.
|
||||
Since <IMG ALIGN="middle" SRC="figs/img191.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is incremented by 1 and set to 5, as depicted in
|
||||
Figure 3(c).
|
||||
Since <IMG ALIGN="middle" SRC="figs/img192.png" BORDER="0" ALT="">, <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> is again incremented by 1 and set to 6,
|
||||
as depicted in Figure 3(d).
|
||||
These two reassignments are indicated by the arrows in Figure 3.
|
||||
Since <IMG ALIGN="middle" SRC="figs/img193.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img194.png" BORDER="0" ALT="">, then <IMG ALIGN="middle" SRC="figs/img195.png" BORDER="0" ALT=""> is set
|
||||
to <IMG ALIGN="bottom" SRC="figs/img196.png" BORDER="0" ALT=""> and <IMG ALIGN="middle" SRC="figs/img197.png" BORDER="0" ALT="">. This finishes the algorithm.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H4>Assignment of Values to Non-Critical Vertices</H4>
|
||||
|
||||
<P>
|
||||
As <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> is acyclic, we can impose the order in which addresses are
|
||||
associated with edges in <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT="">, making this step simple to solve
|
||||
by a standard depth first search algorithm.
|
||||
Therefore, in the assignment of values to vertices in <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> we
|
||||
benefit from the unused addresses in the gaps left by the assignment of values
|
||||
to vertices in <IMG ALIGN="middle" SRC="figs/img63.png" BORDER="0" ALT="">.
|
||||
For that, we start the depth-first search from the vertices in <IMG ALIGN="middle" SRC="figs/img141.png" BORDER="0" ALT=""> because
|
||||
the <IMG ALIGN="middle" SRC="figs/img37.png" BORDER="0" ALT=""> values for these critical vertices were already assigned
|
||||
and cannot be changed.
|
||||
</P>
|
||||
<P>
|
||||
Considering the subgraph <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> in Figure 2(c),
|
||||
a step by step example of the assignment of values to vertices in <IMG ALIGN="middle" SRC="figs/img70.png" BORDER="0" ALT=""> is
|
||||
presented in Figure 4.
|
||||
Figure 4(a) presents the initial state of the algorithm.
|
||||
The critical vertex 8 is the only one that has non-critical vertices as
|
||||
adjacent.
|
||||
In the example presented in Figure 3, the addresses <IMG ALIGN="middle" SRC="figs/img198.png" BORDER="0" ALT=""> were not used.
|
||||
So, taking the first unused address <IMG ALIGN="bottom" SRC="figs/img115.png" BORDER="0" ALT=""> and the vertex <IMG ALIGN="bottom" SRC="figs/img96.png" BORDER="0" ALT="">,
|
||||
which is reached from the vertex <IMG ALIGN="bottom" SRC="figs/img169.png" BORDER="0" ALT="">, <IMG ALIGN="middle" SRC="figs/img199.png" BORDER="0" ALT=""> is set
|
||||
to <IMG ALIGN="middle" SRC="figs/img200.png" BORDER="0" ALT="">, as shown in Figure 4(b).
|
||||
The only vertex that is reached from vertex <IMG ALIGN="bottom" SRC="figs/img96.png" BORDER="0" ALT=""> is vertex <IMG ALIGN="bottom" SRC="figs/img62.png" BORDER="0" ALT="">, so
|
||||
taking the unused address <IMG ALIGN="bottom" SRC="figs/img183.png" BORDER="0" ALT=""> we set <IMG ALIGN="middle" SRC="figs/img201.png" BORDER="0" ALT=""> to <IMG ALIGN="middle" SRC="figs/img202.png" BORDER="0" ALT="">,
|
||||
as shown in Figure 4(c).
|
||||
This process is repeated until the UnAssignedAddresses list becomes empty.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/img203.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 4:</B> Example of the assignment of values to non-critical vertices.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<A NAME="heuristic"></A>
|
||||
<H2>The Heuristic</H2>
|
||||
|
||||
<P>
|
||||
We now present an heuristic for BMZ algorithm that
|
||||
reduces the value of <IMG ALIGN="bottom" SRC="figs/img1.png" BORDER="0" ALT=""> to any given value between <I>1.15</I> and <I>0.93</I>.
|
||||
This reduces the space requirement to store the resulting function
|
||||
to any given value between <IMG ALIGN="bottom" SRC="figs/img12.png" BORDER="0" ALT=""> words and <IMG ALIGN="bottom" SRC="figs/img13.png" BORDER="0" ALT=""> words.
|
||||
The heuristic reuses, when possible, the set
|
||||
of <IMG ALIGN="bottom" SRC="figs/img11.png" BORDER="0" ALT=""> values that caused reassignments, just before
|
||||
trying <IMG ALIGN="middle" SRC="figs/img164.png" BORDER="0" ALT="">.
|
||||
Decreasing the value of <IMG ALIGN="bottom" SRC="figs/img1.png" BORDER="0" ALT=""> leads to an increase in the number of
|
||||
iterations to generate <IMG ALIGN="bottom" SRC="figs/img32.png" BORDER="0" ALT="">.
|
||||
For example, for <IMG ALIGN="bottom" SRC="figs/img244.png" BORDER="0" ALT=""> and <IMG ALIGN="bottom" SRC="figs/img6.png" BORDER="0" ALT="">, the analytical expected number
|
||||
of iterations are <IMG ALIGN="bottom" SRC="figs/img245.png" BORDER="0" ALT=""> and <IMG ALIGN="bottom" SRC="figs/img246.png" BORDER="0" ALT="">, respectively (see <A HREF="#papers">[2</A>]
|
||||
for details),
|
||||
while for <IMG ALIGN="bottom" SRC="figs/img128.png" BORDER="0" ALT=""> the same value is around <I>2.13</I>.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H2>Memory Consumption</H2>
|
||||
|
||||
<P>
|
||||
Now we detail the memory consumption to generate and to store minimal perfect hash functions
|
||||
using the BMZ algorithm. The structures responsible for memory consumption are in the
|
||||
following:
|
||||
</P>
|
||||
|
||||
<UL>
|
||||
<LI>Graph:
|
||||
<OL>
|
||||
<LI><B>first</B>: is a vector that stores <I>cn</I> integer numbers, each one representing
|
||||
the first edge (index in the vector edges) in the list of
|
||||
edges of each vertex.
|
||||
The integer numbers are 4 bytes long. Therefore,
|
||||
the vector first is stored in <I>4cn</I> bytes.
|
||||
<P></P>
|
||||
<LI><B>edges</B>: is a vector to represent the edges of the graph. As each edge
|
||||
is compounded by a pair of vertices, each entry stores two integer numbers
|
||||
of 4 bytes that represent the vertices. As there are <I>n</I> edges, the
|
||||
vector edges is stored in <I>8n</I> bytes.
|
||||
<P></P>
|
||||
<LI><B>next</B>: given a vertex <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">, we can discover the edges that
|
||||
contain <IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT=""> following its list of edges,
|
||||
which starts on first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">] and the next
|
||||
edges are given by next[...first[<IMG ALIGN="bottom" SRC="figs/img139.png" BORDER="0" ALT="">]...]. Therefore, the vectors first and next represent
|
||||
the linked lists of edges of each vertex. As there are two vertices for each edge,
|
||||
when an edge is iserted in the graph, it must be inserted in the two linked lists
|
||||
of the vertices in its composition. Therefore, there are <I>2n</I> entries of integer
|
||||
numbers in the vector next, so it is stored in <I>4*2n = 8n</I> bytes.
|
||||
<P></P>
|
||||
<LI><B>critical vertices(critical_nodes vector)</B>: is a vector of <I>cn</I> bits,
|
||||
where each bit indicates if a vertex is critical (1) or non-critical (0).
|
||||
Therefore, the critical and non-critical vertices are represented in <I>cn/8</I> bytes.
|
||||
<P></P>
|
||||
<LI><B>critical edges (used_edges vector)</B>: is a vector of <I>n</I> bits, where each
|
||||
bit indicates if an edge is critical (1) or non-critical (0). Therefore, the
|
||||
critical and non-critical edges are represented in <I>n/8</I> bytes.
|
||||
<P></P>
|
||||
</OL>
|
||||
<LI>Other auxiliary structures
|
||||
<OL>
|
||||
<LI><B>queue</B>: is a queue of integer numbers used in the breadth-first search of the
|
||||
assignment of values to critical vertices. There is an entry in the queue for
|
||||
each two critical vertices. Let <IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""> be the expected number of critical
|
||||
vertices. Therefore, the queue is stored in <I>4*0.5*<IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT="">=2<IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""></I>.
|
||||
<P></P>
|
||||
<LI><B>visited</B>: is a vector of <I>cn</I> bits, where each bit indicates if the g value of
|
||||
a given vertex was already defined. Therefore, the vector visited is stored
|
||||
in <I>cn/8</I> bytes.
|
||||
<P></P>
|
||||
<LI><B>function <I>g</I></B>: is represented by a vector of <I>cn</I> integer numbers.
|
||||
As each integer number is 4 bytes long, the function <I>g</I> is stored in
|
||||
<I>4cn</I> bytes.
|
||||
</OL>
|
||||
</UL>
|
||||
|
||||
<P>
|
||||
Thus, the total memory consumption of BMZ algorithm for generating a minimal
|
||||
perfect hash function (MPHF) is: <I>(8.25c + 16.125)n +2<IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""> + O(1)</I> bytes.
|
||||
As the value of constant <I>c</I> may be 1.15 and 0.93 we have:
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
|
||||
<TR>
|
||||
<TH><I>c</I></TH>
|
||||
<TH><IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""></TH>
|
||||
<TH>Memory consumption to generate a MPHF</TH>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD>0.93</TD>
|
||||
<TD ALIGN="center"><I>0.497n</I></TD>
|
||||
<TD ALIGN="center"><I>24.80n + O(1)</I></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD>1.15</TD>
|
||||
<TD ALIGN="center"><I>0.401n</I></TD>
|
||||
<TD ALIGN="center"><I>26.42n + O(1)</I></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><B>Table 1:</B> Memory consumption to generate a MPHF using the BMZ algorithm.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
The values of <IMG ALIGN="middle" SRC="figs/img110.png" BORDER="0" ALT=""> were calculated using Eq.(1) presented in <A HREF="#papers">[2</A>].
|
||||
</P>
|
||||
<P>
|
||||
Now we present the memory consumption to store the resulting function.
|
||||
We only need to store the <I>g</I> function. Thus, we need <I>4cn</I> bytes.
|
||||
Again we have:
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" BORDER="1" CELLPADDING="4">
|
||||
<TR>
|
||||
<TH><I>c</I></TH>
|
||||
<TH>Memory consumption to store a MPHF</TH>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD>0.93</TD>
|
||||
<TD ALIGN="center"><I>3.72n</I></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD>1.15</TD>
|
||||
<TD ALIGN="center"><I>4.60n</I></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><B>Table 2:</B> Memory consumption to store a MPHF generated by the BMZ algorithm.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H2>Experimental Results</H2>
|
||||
|
||||
<P>
|
||||
<A HREF="comparison.html">CHM x BMZ</A>
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<A NAME="papers"></A>
|
||||
<H2>Papers</H2>
|
||||
|
||||
<OL>
|
||||
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, D. Menoti, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/bmz_tr004_04.ps">A New algorithm for constructing minimal perfect hash functions</A>, Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
|
||||
<P></P>
|
||||
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, Y. Kohayakawa, and <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/wea05.pdf">A Practical Minimal Perfect Hashing Method</A>. <I>4th International Workshop on efficient and Experimental Algorithms (WEA05),</I> Springer-Verlag Lecture Notes in Computer Science, vol. 3505, Santorini Island, Greece, May 2005, 488-500.
|
||||
</OL>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><A HREF="index.html">Home</A></TD>
|
||||
<TD><A HREF="chd.html">CHD</A></TD>
|
||||
<TD><A HREF="bdz.html">BDZ</A></TD>
|
||||
<TD><A HREF="bmz.html">BMZ</A></TD>
|
||||
<TD><A HREF="chm.html">CHM</A></TD>
|
||||
<TD><A HREF="brz.html">BRZ</A></TD>
|
||||
<TD><A HREF="fch.html">FCH</A></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<P>
|
||||
Enjoy!
|
||||
</P>
|
||||
<P>
|
||||
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
|
||||
</P>
|
||||
<P>
|
||||
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
|
||||
</P>
|
||||
<P>
|
||||
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
|
||||
</P>
|
||||
<P>
|
||||
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
|
||||
</P>
|
||||
<script type="text/javascript">
|
||||
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
|
||||
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
|
||||
</script>
|
||||
<script type="text/javascript">
|
||||
try {
|
||||
var pageTracker = _gat._getTracker("UA-7698683-2");
|
||||
pageTracker._trackPageview();
|
||||
} catch(err) {}</script>
|
||||
|
||||
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
|
||||
<!-- cmdline: txt2tags -t html -i BMZ.t2t -o docs/bmz.html -->
|
||||
</BODY></HTML>
|
|
@ -0,0 +1,966 @@
|
|||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
||||
<HTML>
|
||||
<HEAD>
|
||||
<META NAME="generator" CONTENT="http://txt2tags.org">
|
||||
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
|
||||
<TITLE>External Memory Based Algorithm</TITLE>
|
||||
</HEAD><BODY BGCOLOR="white" TEXT="black">
|
||||
<CENTER>
|
||||
<H1>External Memory Based Algorithm</H1>
|
||||
</CENTER>
|
||||
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H2>Introduction</H2>
|
||||
|
||||
<P>
|
||||
Until now, because of the limitations of current algorithms,
|
||||
the use of MPHFs is restricted to scenarios where the set of keys being hashed is
|
||||
relatively small.
|
||||
However, in many cases it is crucial to deal in an efficient way with very large
|
||||
sets of keys.
|
||||
Due to the exponential growth of the Web, the work with huge collections is becoming
|
||||
a daily task.
|
||||
For instance, the simple assignment of number identifiers to web pages of a collection
|
||||
can be a challenging task.
|
||||
While traditional databases simply cannot handle more traffic once the working
|
||||
set of URLs does not fit in main memory anymore<A HREF="#papers">[4</A>], the algorithm we propose here to
|
||||
construct MPHFs can easily scale to billions of entries.
|
||||
</P>
|
||||
<P>
|
||||
As there are many applications for MPHFs, it is
|
||||
important to design and implement space and time efficient algorithms for
|
||||
constructing such functions.
|
||||
The attractiveness of using MPHFs depends on the following issues:
|
||||
</P>
|
||||
|
||||
<OL>
|
||||
<LI>The amount of CPU time required by the algorithms for constructing MPHFs.
|
||||
<P></P>
|
||||
<LI>The space requirements of the algorithms for constructing MPHFs.
|
||||
<P></P>
|
||||
<LI>The amount of CPU time required by a MPHF for each retrieval.
|
||||
<P></P>
|
||||
<LI>The space requirements of the description of the resulting MPHFs to be used at retrieval time.
|
||||
</OL>
|
||||
|
||||
<P>
|
||||
We present here a novel external memory based algorithm for constructing MPHFs that
|
||||
are very efficient in the four requirements mentioned previously.
|
||||
First, the algorithm is linear on the size of keys to construct a MPHF,
|
||||
which is optimal.
|
||||
For instance, for a collection of 1 billion URLs
|
||||
collected from the web, each one 64 characters long on average, the time to construct a
|
||||
MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory
|
||||
is approximately 3 hours.
|
||||
Second, the algorithm needs a small a priori defined vector of <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> one
|
||||
byte entries in main memory to construct a MPHF.
|
||||
For the collection of 1 billion URLs and using <IMG ALIGN="middle" SRC="figs/brz/img4.png" BORDER="0" ALT="">, the algorithm needs only
|
||||
5.45 megabytes of internal memory.
|
||||
Third, the evaluation of the MPHF for each retrieval requires three memory accesses and
|
||||
the computation of three universal hash functions.
|
||||
This is not optimal as any MPHF requires at least one memory access and the computation
|
||||
of two universal hash functions.
|
||||
Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal.
|
||||
For the collection of 1 billion URLs, it needs 8.1 bits for each key,
|
||||
while the theoretical lower bound is <IMG ALIGN="middle" SRC="figs/brz/img24.png" BORDER="0" ALT=""> bits per key.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H2>The Algorithm</H2>
|
||||
|
||||
<P>
|
||||
The main idea supporting our algorithm is the classical divide and conquer technique.
|
||||
The algorithm is a two-step external memory based algorithm
|
||||
that generates a MPHF <I>h</I> for a set <I>S</I> of <I>n</I> keys.
|
||||
Figure 1 illustrates the two steps of the
|
||||
algorithm: the partitioning step and the searching step.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/brz.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 1:</B> Main steps of our algorithm.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
The partitioning step takes a key set <I>S</I> and uses a universal hash
|
||||
function <IMG ALIGN="middle" SRC="figs/brz/img42.png" BORDER="0" ALT=""> proposed by Jenkins<A HREF="#papers">[5</A>]
|
||||
to transform each key <IMG ALIGN="middle" SRC="figs/brz/img43.png" BORDER="0" ALT=""> into an integer <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT="">.
|
||||
Reducing <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT=""> modulo <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT="">, we partition <I>S</I>
|
||||
into <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets containing at most 256 keys in each bucket (with high
|
||||
probability).
|
||||
</P>
|
||||
<P>
|
||||
The searching step generates a MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> for each bucket <I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">.
|
||||
The resulting MPHF <I>h(k)</I>, <IMG ALIGN="middle" SRC="figs/brz/img43.png" BORDER="0" ALT="">, is given by
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/img49.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
where <IMG ALIGN="middle" SRC="figs/brz/img50.png" BORDER="0" ALT="">.
|
||||
The <I>i</I>th entry <I>offset[i]</I> of the displacement vector
|
||||
<I>offset</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, contains the total number
|
||||
of keys in the buckets from 0 to <I>i-1</I>, that is, it gives the interval of the
|
||||
keys in the hash table addressed by the MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT="">. In the following we explain
|
||||
each step in detail.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H3>Partitioning step</H3>
|
||||
|
||||
<P>
|
||||
The set <I>S</I> of <I>n</I> keys is partitioned into <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT="">,
|
||||
where <I>b</I> is a suitable parameter chosen to guarantee
|
||||
that each bucket has at most 256 keys with high probability
|
||||
(see <A HREF="#papers">[2</A>] for details).
|
||||
The partitioning step works as follows:
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/img54.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 2:</B> Partitioning step.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
Statement 1.1 of the <B>for</B> loop presented in Figure 2
|
||||
reads sequentially all the keys of block <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> from disk into an internal area
|
||||
of size <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT="">.
|
||||
</P>
|
||||
<P>
|
||||
Statement 1.2 performs an indirect bucket sort of the keys in block <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> and
|
||||
at the same time updates the entries in the vector <I>size</I>.
|
||||
Let us briefly describe how <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> is partitioned among
|
||||
the <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets.
|
||||
We use a local array of <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> counters to store a
|
||||
count of how many keys from <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> belong to each bucket.
|
||||
The pointers to the keys in each bucket <I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">,
|
||||
are stored in contiguous positions in an array.
|
||||
For this we first reserve the required number of entries
|
||||
in this array of pointers using the information from the array of counters.
|
||||
Next, we place the pointers to the keys in each bucket into the respective
|
||||
reserved areas in the array (i.e., we place the pointers to the keys in bucket 0,
|
||||
followed by the pointers to the keys in bucket 1, and so on).
|
||||
</P>
|
||||
<P>
|
||||
To find the bucket address of a given key
|
||||
we use the universal hash function <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT=""><A HREF="#papers">[5</A>].
|
||||
Key <I>k</I> goes into bucket <I>i</I>, where
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/img57.png" BORDER="0" ALT=""> (1)</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
Figure 3(a) shows a <I>logical</I> view of the <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets
|
||||
generated in the partitioning step.
|
||||
In reality, the keys belonging to each bucket are distributed among many files,
|
||||
as depicted in Figure 3(b).
|
||||
In the example of Figure 3(b), the keys in bucket 0
|
||||
appear in files 1 and <I>N</I>, the keys in bucket 1 appear in files 1, 2
|
||||
and <I>N</I>, and so on.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/brz-partitioning.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 3:</B> Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
This scattering of the keys in the buckets could generate a performance
|
||||
problem because of the potential number of seeks
|
||||
needed to read the keys in each bucket from the <I>N</I> files in disk
|
||||
during the searching step.
|
||||
But, as we show in <A HREF="#papers">[2</A>], the number of seeks
|
||||
can be kept small using buffering techniques.
|
||||
Considering that only the vector <I>size</I>, which has <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> one-byte
|
||||
entries (remember that each bucket has at most 256 keys),
|
||||
must be maintained in main memory during the searching step,
|
||||
almost all main memory is available to be used as disk I/O buffer.
|
||||
</P>
|
||||
<P>
|
||||
The last step is to compute the <I>offset</I> vector and dump it to the disk.
|
||||
We use the vector <I>size</I> to compute the
|
||||
<I>offset</I> displacement vector.
|
||||
The <I>offset[i]</I> entry contains the number of keys
|
||||
in the buckets <I>0, 1, ..., i-1</I>.
|
||||
As <I>size[i]</I> stores the number of keys
|
||||
in bucket <I>i</I>, where <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, we have
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/img63.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H3>Searching step</H3>
|
||||
|
||||
<P>
|
||||
The searching step is responsible for generating a MPHF for each
|
||||
bucket. Figure 4 presents the searching step algorithm.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/img64.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 4:</B> Searching step.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
Statement 1 of Figure 4 inserts one key from each file
|
||||
in a minimum heap <I>H</I> of size <I>N</I>.
|
||||
The order relation in <I>H</I> is given by the bucket address <I>i</I> given by
|
||||
Eq. (1).
|
||||
</P>
|
||||
<P>
|
||||
Statement 2 has two important steps.
|
||||
In statement 2.1, a bucket is read from disk,
|
||||
as described below.
|
||||
In statement 2.2, a MPHF is generated for each bucket <I>i</I>, as described
|
||||
in the following.
|
||||
The description of MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> is a vector <IMG ALIGN="middle" SRC="figs/brz/img66.png" BORDER="0" ALT=""> of 8-bit integers.
|
||||
Finally, statement 2.3 writes the description <IMG ALIGN="middle" SRC="figs/brz/img66.png" BORDER="0" ALT=""> of MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> to disk.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H4>Reading a bucket from disk</H4>
|
||||
|
||||
<P>
|
||||
In this section we present the refinement of statement 2.1 of
|
||||
Figure 4.
|
||||
The algorithm to read bucket <I>i</I> from disk is presented
|
||||
in Figure 5.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/img67.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 5:</B> Reading a bucket.</ |