turbonss/docs/brz.html
2018-12-28 23:53:52 -02:00

967 lines
44 KiB
HTML

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META NAME="generator" CONTENT="http://txt2tags.org">
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
<TITLE>External Memory Based Algorithm</TITLE>
</HEAD><BODY BGCOLOR="white" TEXT="black">
<CENTER>
<H1>External Memory Based Algorithm</H1>
</CENTER>
<HR NOSHADE SIZE=1>
<H2>Introduction</H2>
<P>
Until now, because of the limitations of current algorithms,
the use of MPHFs is restricted to scenarios where the set of keys being hashed is
relatively small.
However, in many cases it is crucial to deal in an efficient way with very large
sets of keys.
Due to the exponential growth of the Web, the work with huge collections is becoming
a daily task.
For instance, the simple assignment of number identifiers to web pages of a collection
can be a challenging task.
While traditional databases simply cannot handle more traffic once the working
set of URLs does not fit in main memory anymore<A HREF="#papers">[4</A>], the algorithm we propose here to
construct MPHFs can easily scale to billions of entries.
</P>
<P>
As there are many applications for MPHFs, it is
important to design and implement space and time efficient algorithms for
constructing such functions.
The attractiveness of using MPHFs depends on the following issues:
</P>
<OL>
<LI>The amount of CPU time required by the algorithms for constructing MPHFs.
<P></P>
<LI>The space requirements of the algorithms for constructing MPHFs.
<P></P>
<LI>The amount of CPU time required by a MPHF for each retrieval.
<P></P>
<LI>The space requirements of the description of the resulting MPHFs to be used at retrieval time.
</OL>
<P>
We present here a novel external memory based algorithm for constructing MPHFs that
are very efficient in the four requirements mentioned previously.
First, the algorithm is linear on the size of keys to construct a MPHF,
which is optimal.
For instance, for a collection of 1 billion URLs
collected from the web, each one 64 characters long on average, the time to construct a
MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory
is approximately 3 hours.
Second, the algorithm needs a small a priori defined vector of <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> one
byte entries in main memory to construct a MPHF.
For the collection of 1 billion URLs and using <IMG ALIGN="middle" SRC="figs/brz/img4.png" BORDER="0" ALT="">, the algorithm needs only
5.45 megabytes of internal memory.
Third, the evaluation of the MPHF for each retrieval requires three memory accesses and
the computation of three universal hash functions.
This is not optimal as any MPHF requires at least one memory access and the computation
of two universal hash functions.
Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal.
For the collection of 1 billion URLs, it needs 8.1 bits for each key,
while the theoretical lower bound is <IMG ALIGN="middle" SRC="figs/brz/img24.png" BORDER="0" ALT=""> bits per key.
</P>
<HR NOSHADE SIZE=1>
<H2>The Algorithm</H2>
<P>
The main idea supporting our algorithm is the classical divide and conquer technique.
The algorithm is a two-step external memory based algorithm
that generates a MPHF <I>h</I> for a set <I>S</I> of <I>n</I> keys.
Figure 1 illustrates the two steps of the
algorithm: the partitioning step and the searching step.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/brz.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 1:</B> Main steps of our algorithm.</TD>
</TR>
</TABLE>
<P>
The partitioning step takes a key set <I>S</I> and uses a universal hash
function <IMG ALIGN="middle" SRC="figs/brz/img42.png" BORDER="0" ALT=""> proposed by Jenkins<A HREF="#papers">[5</A>]
to transform each key <IMG ALIGN="middle" SRC="figs/brz/img43.png" BORDER="0" ALT=""> into an integer <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT="">.
Reducing <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT=""> modulo <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT="">, we partition <I>S</I>
into <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets containing at most 256 keys in each bucket (with high
probability).
</P>
<P>
The searching step generates a MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> for each bucket <I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">.
The resulting MPHF <I>h(k)</I>, <IMG ALIGN="middle" SRC="figs/brz/img43.png" BORDER="0" ALT="">, is given by
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img49.png" BORDER="0" ALT=""></TD>
</TR>
</TABLE>
<P>
where <IMG ALIGN="middle" SRC="figs/brz/img50.png" BORDER="0" ALT="">.
The <I>i</I>th entry <I>offset[i]</I> of the displacement vector
<I>offset</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, contains the total number
of keys in the buckets from 0 to <I>i-1</I>, that is, it gives the interval of the
keys in the hash table addressed by the MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT="">. In the following we explain
each step in detail.
</P>
<HR NOSHADE SIZE=1>
<H3>Partitioning step</H3>
<P>
The set <I>S</I> of <I>n</I> keys is partitioned into <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT="">,
where <I>b</I> is a suitable parameter chosen to guarantee
that each bucket has at most 256 keys with high probability
(see <A HREF="#papers">[2</A>] for details).
The partitioning step works as follows:
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img54.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 2:</B> Partitioning step.</TD>
</TR>
</TABLE>
<P>
Statement 1.1 of the <B>for</B> loop presented in Figure 2
reads sequentially all the keys of block <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> from disk into an internal area
of size <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT="">.
</P>
<P>
Statement 1.2 performs an indirect bucket sort of the keys in block <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> and
at the same time updates the entries in the vector <I>size</I>.
Let us briefly describe how <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> is partitioned among
the <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets.
We use a local array of <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> counters to store a
count of how many keys from <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> belong to each bucket.
The pointers to the keys in each bucket <I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">,
are stored in contiguous positions in an array.
For this we first reserve the required number of entries
in this array of pointers using the information from the array of counters.
Next, we place the pointers to the keys in each bucket into the respective
reserved areas in the array (i.e., we place the pointers to the keys in bucket 0,
followed by the pointers to the keys in bucket 1, and so on).
</P>
<P>
To find the bucket address of a given key
we use the universal hash function <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT=""><A HREF="#papers">[5</A>].
Key <I>k</I> goes into bucket <I>i</I>, where
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img57.png" BORDER="0" ALT=""> (1)</TD>
</TR>
</TABLE>
<P>
Figure 3(a) shows a <I>logical</I> view of the <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets
generated in the partitioning step.
In reality, the keys belonging to each bucket are distributed among many files,
as depicted in Figure 3(b).
In the example of Figure 3(b), the keys in bucket 0
appear in files 1 and <I>N</I>, the keys in bucket 1 appear in files 1, 2
and <I>N</I>, and so on.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/brz-partitioning.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 3:</B> Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view.</TD>
</TR>
</TABLE>
<P>
This scattering of the keys in the buckets could generate a performance
problem because of the potential number of seeks
needed to read the keys in each bucket from the <I>N</I> files in disk
during the searching step.
But, as we show in <A HREF="#papers">[2</A>], the number of seeks
can be kept small using buffering techniques.
Considering that only the vector <I>size</I>, which has <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> one-byte
entries (remember that each bucket has at most 256 keys),
must be maintained in main memory during the searching step,
almost all main memory is available to be used as disk I/O buffer.
</P>
<P>
The last step is to compute the <I>offset</I> vector and dump it to the disk.
We use the vector <I>size</I> to compute the
<I>offset</I> displacement vector.
The <I>offset[i]</I> entry contains the number of keys
in the buckets <I>0, 1, ..., i-1</I>.
As <I>size[i]</I> stores the number of keys
in bucket <I>i</I>, where <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, we have
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img63.png" BORDER="0" ALT=""></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<H3>Searching step</H3>
<P>
The searching step is responsible for generating a MPHF for each
bucket. Figure 4 presents the searching step algorithm.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img64.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 4:</B> Searching step.</TD>
</TR>
</TABLE>
<P>
Statement 1 of Figure 4 inserts one key from each file
in a minimum heap <I>H</I> of size <I>N</I>.
The order relation in <I>H</I> is given by the bucket address <I>i</I> given by
Eq. (1).
</P>
<P>
Statement 2 has two important steps.
In statement 2.1, a bucket is read from disk,
as described below.
In statement 2.2, a MPHF is generated for each bucket <I>i</I>, as described
in the following.
The description of MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> is a vector <IMG ALIGN="middle" SRC="figs/brz/img66.png" BORDER="0" ALT=""> of 8-bit integers.
Finally, statement 2.3 writes the description <IMG ALIGN="middle" SRC="figs/brz/img66.png" BORDER="0" ALT=""> of MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> to disk.
</P>
<HR NOSHADE SIZE=1>
<H4>Reading a bucket from disk</H4>
<P>
In this section we present the refinement of statement 2.1 of
Figure 4.
The algorithm to read bucket <I>i</I> from disk is presented
in Figure 5.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/img67.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 5:</B> Reading a bucket.</TD>
</TR>
</TABLE>
<P>
Bucket <I>i</I> is distributed among many files and the heap <I>H</I> is used to drive a
multiway merge operation.
In Figure 5, statement 1.1 extracts and removes triple
<I>(i, j, k)</I> from <I>H</I>, where <I>i</I> is a minimum value in <I>H</I>.
Statement 1.2 inserts key <I>k</I> in bucket <I>i</I>.
Notice that the <I>k</I> in the triple <I>(i, j, k)</I> is in fact a pointer to
the first byte of the key that is kept in contiguous positions of an array of characters
(this array containing the keys is initialized during the heap construction
in statement 1 of Figure 4).
Statement 1.3 performs a seek operation in File <I>j</I> on disk for the first
read operation and reads sequentially all keys <I>k</I> that have the same <I>i</I>
and inserts them all in bucket <I>i</I>.
Finally, statement 1.4 inserts in <I>H</I> the triple <I>(i, j, x)</I>,
where <I>x</I> is the first key read from File <I>j</I> (in statement 1.3)
that does not have the same bucket address as the previous keys.
</P>
<P>
The number of seek operations on disk performed in statement 1.3 is discussed
in <A HREF="#papers">[2, Section 5.1</A>],
where we present a buffering technique that brings down
the time spent with seeks.
</P>
<HR NOSHADE SIZE=1>
<H4>Generating a MPHF for each bucket</H4>
<P>
To the best of our knowledge the <A HREF="bmz.html">BMZ algorithm</A> we have designed in
our previous works <A HREF="#papers">[1,3</A>] is the fastest published algorithm for
constructing MPHFs.
That is why we are using that algorithm as a building block for the
algorithm presented here. In reality, we are using
an optimized version of BMZ (BMZ8) for small set of keys (at most 256 keys).
<A HREF="bmz.html">Click here to see details about BMZ algorithm</A>.
</P>
<HR NOSHADE SIZE=1>
<H2>Analysis of the Algorithm</H2>
<P>
Analytical results and the complete analysis of the external memory based algorithm
can be found in <A HREF="#papers">[2</A>].
</P>
<HR NOSHADE SIZE=1>
<H2>Experimental Results</H2>
<P>
In this section we present the experimental results.
We start presenting the experimental setup.
We then present experimental results for
the internal memory based algorithm (<A HREF="bmz.html">the BMZ algorithm</A>)
and for our external memory based algorithm.
Finally, we discuss how the amount of internal memory available
affects the runtime of the external memory based algorithm.
</P>
<HR NOSHADE SIZE=1>
<H3>The data and the experimental setup</H3>
<P>
All experiments were carried out on
a computer running the Linux operating system, version 2.6,
with a 2.4 gigahertz processor and
1 gigabyte of main memory.
In the experiments related to the new
algorithm we limited the main memory in 500 megabytes.
</P>
<P>
Our data consists of a collection of 1 billion
URLs collected from the Web, each URL 64 characters long on average.
The collection is stored on disk in 60.5 gigabytes.
</P>
<HR NOSHADE SIZE=1>
<H3>Performance of the BMZ Algorithm</H3>
<P>
<A HREF="bmz.html">The BMZ algorithm</A> is used for constructing a MPHF for each bucket.
It is a randomized algorithm because it needs to generate a simple random graph
in its first step.
Once the graph is obtained the other two steps are deterministic.
</P>
<P>
Thus, we can consider the runtime of the algorithm to have
the form <IMG ALIGN="middle" SRC="figs/brz/img159.png" BORDER="0" ALT=""> for an input of <I>n</I> keys,
where <IMG ALIGN="middle" SRC="figs/brz/img160.png" BORDER="0" ALT=""> is some machine dependent
constant that further depends on the length of the keys and <I>Z</I> is a random
variable with geometric distribution with mean <IMG ALIGN="middle" SRC="figs/brz/img162.png" BORDER="0" ALT="">. All results
in our experiments were obtained taking <I>c=1</I>; the value of <I>c</I>, with <I>c</I> in <I>[0.93,1.15]</I>,
in fact has little influence in the runtime, as shown in <A HREF="#papers">[3</A>].
</P>
<P>
The values chosen for <I>n</I> were 1, 2, 4, 8, 16 and 32 million.
Although we have a dataset with 1 billion URLs, on a PC with
1 gigabyte of main memory, the algorithm is able
to handle an input with at most 32 million keys.
This is mainly because of the graph we need to keep in main memory.
The algorithm requires <I>25n + O(1)</I> bytes for constructing
a MPHF (<A HREF="bmz.html">click here to get details about the data structures used by the BMZ algorithm</A>).
</P>
<P>
In order to estimate the number of trials for each value of <I>n</I> we use
a statistical method for determining a suitable sample size (see, e.g., <A HREF="#papers">[6, Chapter 13</A>]).
As we obtained different values for each <I>n</I>,
we used the maximal value obtained, namely, 300 trials in order to have
a confidence level of 95 %.
</P>
<P>
Table 1 presents the runtime average for each <I>n</I>,
the respective standard deviations, and
the respective confidence intervals given by
the average time <IMG ALIGN="middle" SRC="figs/brz/img167.png" BORDER="0" ALT=""> the distance from average time
considering a confidence level of 95 %.
Observing the runtime averages one sees that
the algorithm runs in expected linear time,
as shown in <A HREF="#papers">[3</A>].
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="CENTER">
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img5.png"
ALT="$n$"></SPAN> (millions) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 2 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 4 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 8 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 16 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 32 </SMALL></TD>
<TD></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
Average time (s)</SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="64" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img168.png"
ALT="$6.1 \pm 0.3$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img169.png"
ALT="$12.2 \pm 0.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img170.png"
ALT="$25.4 \pm 1.1$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img171.png"
ALT="$51.4 \pm 2.0$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img172.png"
ALT="$117.3 \pm 4.4$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img173.png"
ALT="$262.2 \pm 8.7$"></SPAN></SMALL></TD>
<TD></TD>
</TR>
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
SD (s) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img174.png"
ALT="$2.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img175.png"
ALT="$5.4$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img176.png"
ALT="$9.8$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img177.png"
ALT="$17.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img178.png"
ALT="$37.3$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img179.png"
ALT="$76.3$"></SPAN> </SMALL></TD>
<TD></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 1:</B> Internal memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.</TD>
</TR>
</TABLE>
<P>
Figure 6 presents the runtime for each trial. In addition,
the solid line corresponds to a linear regression model
obtained from the experimental measurements.
As we can see, the runtime for a given <I>n</I> has a considerable
fluctuation. However, the fluctuation also grows linearly with <I>n</I>.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/bmz_temporegressao.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 6:</B> Time versus number of keys in <I>S</I> for the internal memory based algorithm. The solid line corresponds to a linear regression model.</TD>
</TR>
</TABLE>
<P>
The observed fluctuation in the runtimes is as expected; recall that this
runtime has the form <IMG ALIGN="middle" SRC="figs/brz/img159.png" BORDER="0" ALT=""> with <I>Z</I> a geometric random variable with
mean <I>1/p=e</I>. Thus, the runtime has mean <IMG ALIGN="middle" SRC="figs/brz/img181.png" BORDER="0" ALT=""> and standard
deviation <IMG ALIGN="middle" SRC="figs/brz/img182.png" BORDER="0" ALT="">.
Therefore, the standard deviation also grows
linearly with <I>n</I>, as experimentally verified
in Table 1 and in Figure 6.
</P>
<HR NOSHADE SIZE=1>
<H3>Performance of the External Memory Based Algorithm</H3>
<P>
The runtime of the external memory based algorithm is also a random variable,
but now it follows a (highly concentrated) normal distribution, as we discuss at the end of this
section. Again, we are interested in verifying the linearity claim made in
<A HREF="#papers">[2, Section 5.1</A>]. Therefore, we ran the algorithm for
several numbers <I>n</I> of keys in <I>S</I>.
</P>
<P>
The values chosen for <I>n</I> were 1, 2, 4, 8, 16, 32, 64, 128, 512 and 1000
million.
We limited the main memory in 500 megabytes for the experiments.
The size <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> of the a priori reserved internal memory area
was set to 250 megabytes, the parameter <I>b</I> was set to <I>175</I> and
the building block algorithm parameter <I>c</I> was again set to <I>1</I>.
We show later on how <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> affects the runtime of the algorithm. The other two parameters
have insignificant influence on the runtime.
</P>
<P>
We again use a statistical method for determining a suitable sample size
to estimate the number of trials to be run for each value of <I>n</I>. We got that
just one trial for each <I>n</I> would be enough with a confidence level of 95 %.
However, we made 10 trials. This number of trials seems rather small, but, as
shown below, the behavior of our algorithm is very stable and its runtime is
almost deterministic (i.e., the standard deviation is very small).
</P>
<P>
Table 2 presents the runtime average for each <I>n</I>,
the respective standard deviations, and
the respective confidence intervals given by
the average time <IMG ALIGN="middle" SRC="figs/brz/img167.png" BORDER="0" ALT=""> the distance from average time
considering a confidence level of 95 %.
Observing the runtime averages we noticed that
the algorithm runs in expected linear time,
as shown in <A HREF="#papers">[2, Section 5.1</A>]. Better still,
it is only approximately 60 % slower than the BMZ algorithm.
To get that value we used the linear regression model obtained for the runtime of
the internal memory based algorithm to estimate how much time it would require
for constructing a MPHF for a set of 1 billion keys.
We got 2.3 hours for the internal memory based algorithm and we measured
3.67 hours on average for the external memory based algorithm.
Increasing the size of the internal memory area
from 250 to 600 megabytes,
we have brought the time to 3.09 hours. In this case, the external memory based algorithm is
just 34 % slower in this setup.
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="CENTER">
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img5.png"
ALT="$n$"></SPAN> (millions) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 2 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 4 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 8 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 16 </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
Average time (s) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="64" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img187.png"
ALT="$6.9 \pm 0.3$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img188.png"
ALT="$13.8 \pm 0.2$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img189.png"
ALT="$31.9 \pm 0.7$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img190.png"
ALT="$69.9 \pm 1.1$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img191.png"
ALT="$140.6 \pm 2.5$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
SD </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img192.png"
ALT="$0.4$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img193.png"
ALT="$0.2$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img194.png"
ALT="$0.9$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img195.png"
ALT="$1.5$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img196.png"
ALT="$3.5$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img5.png"
ALT="$n$"></SPAN> (millions) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 32 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 64 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 128 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 512 </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1000 </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
Average time (s) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img197.png"
ALT="$284.3 \pm 1.1$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img198.png"
ALT="$587.9 \pm 3.9$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
$1223.6 \pm 4.9$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="88" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img199.png"
ALT="$1223.6 \pm 4.9$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
$5966.4 \pm 9.5$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="88" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img200.png"
ALT="$5966.4 \pm 9.5$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
$13229.5 \pm 12.7$
-->
<SPAN CLASS="MATH"><IMG
WIDTH="104" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img201.png"
ALT="$13229.5 \pm 12.7$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
SD </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img202.png"
ALT="$1.6$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img203.png"
ALT="$5.5$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img204.png"
ALT="$6.8$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img205.png"
ALT="$13.2$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img206.png"
ALT="$18.6$"></SPAN> </SMALL></TD>
</TR>
<TR><TD></TD>
<TD></TD>
<TD></TD>
<TD></TD>
<TD></TD>
<TD></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 2:</B>The external memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.</TD>
</TR>
</TABLE>
<P>
Figure 7 presents the runtime for each trial. In addition,
the solid line corresponds to a linear regression model
obtained from the experimental measurements.
As we were expecting the runtime for a given <I>n</I> has almost no
variation.
</P>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><IMG ALIGN="middle" SRC="figs/brz/brz_temporegressao.png" BORDER="0" ALT=""></TD>
</TR>
<TR>
<TD><B>Figure 7:</B> Time versus number of keys in <I>S</I> for our algorithm. The solid line corresponds to a linear regression model.</TD>
</TR>
</TABLE>
<P>
An intriguing observation is that the runtime of the algorithm is almost
deterministic, in spite of the fact that it uses as building block an
algorithm with a considerable fluctuation in its runtime. A given bucket
<I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, is a small set of keys (at most 256 keys) and,
as argued in last Section, the runtime of the
building block algorithm is a random variable <IMG ALIGN="middle" SRC="figs/brz/img207.png" BORDER="0" ALT=""> with high fluctuation.
However, the runtime <I>Y</I> of the searching step of the external memory based algorithm is given
by <IMG ALIGN="middle" SRC="figs/brz/img209.png" BORDER="0" ALT="">. Under the hypothesis that
the <IMG ALIGN="middle" SRC="figs/brz/img207.png" BORDER="0" ALT=""> are independent and bounded, the {\it law of large numbers} (see,
e.g., <A HREF="#papers">[6</A>]) implies that the random variable <IMG ALIGN="middle" SRC="figs/brz/img210.png" BORDER="0" ALT=""> converges
to a constant as <IMG ALIGN="middle" SRC="figs/brz/img83.png" BORDER="0" ALT="">. This explains why the runtime of our
algorithm is almost deterministic.
</P>
<HR NOSHADE SIZE=1>
<H3>Controlling disk accesses</H3>
<P>
In order to bring down the number of seek operations on disk
we benefit from the fact that our algorithm leaves almost all main
memory available to be used as disk I/O buffer.
In this section we evaluate how much the parameter <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> affects the runtime of our algorithm.
For that we fixed <I>n</I> in 1 billion of URLs,
set the main memory of the machine used for the experiments
to 1 gigabyte and used <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> equal to 100, 200, 300, 400, 500 and 600
megabytes.
</P>
<P>
Table 3 presents the number of files <I>N</I>,
the buffer size used for all files, the number of seeks in the worst case considering
the pessimistic assumption mentioned in <A HREF="#papers">[2, Section 5.1</A>], and
the time to generate a MPHF for 1 billion of keys as a function of the amount of internal
memory available. Observing Table 3 we noticed that the time spent in the construction
decreases as the value of <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> increases. However, for <IMG ALIGN="middle" SRC="figs/brz/img213.png" BORDER="0" ALT="">, the variation
on the time is not as significant as for <IMG ALIGN="middle" SRC="figs/brz/img214.png" BORDER="0" ALT="">.
This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux
has smart policies for avoiding seeks and diminishing the average seek time
(see <A HREF="http://www.linuxjournal.com/article/6931">http://www.linuxjournal.com/article/6931</A>).
</P>
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img8.png"
ALT="$\mu $"></SPAN> (MB) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img215.png"
ALT="$100$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img216.png"
ALT="$200$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img217.png"
ALT="$300$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img218.png"
ALT="$400$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img219.png"
ALT="$500$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img212.png"
ALT="$600$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="19" HEIGHT="14" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img58.png"
ALT="$N$"></SPAN> (files) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img220.png"
ALT="$619$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img221.png"
ALT="$310$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img222.png"
ALT="$207$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img223.png"
ALT="$155$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img224.png"
ALT="$124$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img225.png"
ALT="$104$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
&nbsp;(buffer size in KB) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img226.png"
ALT="$165$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img227.png"
ALT="$661$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img228.png"
ALT="$1,484$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img229.png"
ALT="$2,643$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img230.png"
ALT="$4,129$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img231.png"
ALT="$5,908$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
<SPAN CLASS="MATH"><IMG
WIDTH="14" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img135.png"
ALT="$\beta$"></SPAN>/&nbsp;(# of seeks in the worst case) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="59" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img232.png"
ALT="$384,478$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img233.png"
ALT="$95,974$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img234.png"
ALT="$42,749$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img235.png"
ALT="$24,003$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img236.png"
ALT="$15,365$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
SRC="figs/brz/img237.png"
ALT="$10,738$"></SPAN> </SMALL></TD>
</TR>
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
Time (hours) </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img238.png"
ALT="$4.04$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img239.png"
ALT="$3.64$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img240.png"
ALT="$3.34$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img241.png"
ALT="$3.20$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img242.png"
ALT="$3.13$"></SPAN> </SMALL></TD>
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
SRC="figs/brz/img243.png"
ALT="$3.09$"></SPAN> </SMALL></TD>
</TR>
</TABLE>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><B>Table 3:</B>Influence of the internal memory area size (<IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT="">) in the external memory based algorithm runtime.</TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<A NAME="papers"></A>
<H2>Papers</H2>
<OL>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, D. Menoti, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/bmz_tr004_04.ps">A New algorithm for constructing minimal perfect hash functions</A>, Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
<P></P>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, Y. Kohayakawa, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/tr06.pdf">An Approach for Minimal Perfect Hash Functions for Very Large Databases</A>, Technical Report TR003/06, Department of Computer Science, Federal University of Minas Gerais, 2004.
<P></P>
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, Y. Kohayakawa, and <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/wea05.pdf">A Practical Minimal Perfect Hashing Method</A>. <I>4th International Workshop on efficient and Experimental Algorithms (WEA05),</I> Springer-Verlag Lecture Notes in Computer Science, vol. 3505, Santorini Island, Greece, May 2005, 488-500.
<P></P>
<LI><A HREF="http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=299">M. Seltzer. Beyond relational databases. ACM Queue, 3(3), April 2005.</A>
<P></P>
<LI><A HREF="http://burtleburtle.net/bob/hash/doobs.html">Bob Jenkins. Algorithm alley: Hash functions. Dr. Dobb's Journal of Software Tools, 22(9), september 1997.</A>
<P></P>
<LI>R. Jain. The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. John Wiley, first edition, 1991.
</OL>
<HR NOSHADE SIZE=1>
<TABLE ALIGN="center" CELLPADDING="4">
<TR>
<TD><A HREF="index.html">Home</A></TD>
<TD><A HREF="chd.html">CHD</A></TD>
<TD><A HREF="bdz.html">BDZ</A></TD>
<TD><A HREF="bmz.html">BMZ</A></TD>
<TD><A HREF="chm.html">CHM</A></TD>
<TD><A HREF="brz.html">BRZ</A></TD>
<TD><A HREF="fch.html">FCH</A></TD>
</TR>
</TABLE>
<HR NOSHADE SIZE=1>
<P>
Enjoy!
</P>
<P>
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
</P>
<P>
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
</P>
<P>
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
</P>
<P>
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
</P>
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-7698683-2");
pageTracker._trackPageview();
} catch(err) {}</script>
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
<!-- cmdline: txt2tags -t html -i BRZ.t2t -o docs/brz.html -->
</BODY></HTML>