Motiejus Jakštys
37e24524c2
git-subtree-dir: deps/cmph git-subtree-mainline:5040f4007b
git-subtree-split:a250982ade
967 lines
44 KiB
HTML
967 lines
44 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
|
<HTML>
|
|
<HEAD>
|
|
<META NAME="generator" CONTENT="http://txt2tags.org">
|
|
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
|
|
<TITLE>External Memory Based Algorithm</TITLE>
|
|
</HEAD><BODY BGCOLOR="white" TEXT="black">
|
|
<CENTER>
|
|
<H1>External Memory Based Algorithm</H1>
|
|
</CENTER>
|
|
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<H2>Introduction</H2>
|
|
|
|
<P>
|
|
Until now, because of the limitations of current algorithms,
|
|
the use of MPHFs is restricted to scenarios where the set of keys being hashed is
|
|
relatively small.
|
|
However, in many cases it is crucial to deal in an efficient way with very large
|
|
sets of keys.
|
|
Due to the exponential growth of the Web, the work with huge collections is becoming
|
|
a daily task.
|
|
For instance, the simple assignment of number identifiers to web pages of a collection
|
|
can be a challenging task.
|
|
While traditional databases simply cannot handle more traffic once the working
|
|
set of URLs does not fit in main memory anymore<A HREF="#papers">[4</A>], the algorithm we propose here to
|
|
construct MPHFs can easily scale to billions of entries.
|
|
</P>
|
|
<P>
|
|
As there are many applications for MPHFs, it is
|
|
important to design and implement space and time efficient algorithms for
|
|
constructing such functions.
|
|
The attractiveness of using MPHFs depends on the following issues:
|
|
</P>
|
|
|
|
<OL>
|
|
<LI>The amount of CPU time required by the algorithms for constructing MPHFs.
|
|
<P></P>
|
|
<LI>The space requirements of the algorithms for constructing MPHFs.
|
|
<P></P>
|
|
<LI>The amount of CPU time required by a MPHF for each retrieval.
|
|
<P></P>
|
|
<LI>The space requirements of the description of the resulting MPHFs to be used at retrieval time.
|
|
</OL>
|
|
|
|
<P>
|
|
We present here a novel external memory based algorithm for constructing MPHFs that
|
|
are very efficient in the four requirements mentioned previously.
|
|
First, the algorithm is linear on the size of keys to construct a MPHF,
|
|
which is optimal.
|
|
For instance, for a collection of 1 billion URLs
|
|
collected from the web, each one 64 characters long on average, the time to construct a
|
|
MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory
|
|
is approximately 3 hours.
|
|
Second, the algorithm needs a small a priori defined vector of <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> one
|
|
byte entries in main memory to construct a MPHF.
|
|
For the collection of 1 billion URLs and using <IMG ALIGN="middle" SRC="figs/brz/img4.png" BORDER="0" ALT="">, the algorithm needs only
|
|
5.45 megabytes of internal memory.
|
|
Third, the evaluation of the MPHF for each retrieval requires three memory accesses and
|
|
the computation of three universal hash functions.
|
|
This is not optimal as any MPHF requires at least one memory access and the computation
|
|
of two universal hash functions.
|
|
Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal.
|
|
For the collection of 1 billion URLs, it needs 8.1 bits for each key,
|
|
while the theoretical lower bound is <IMG ALIGN="middle" SRC="figs/brz/img24.png" BORDER="0" ALT=""> bits per key.
|
|
</P>
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<H2>The Algorithm</H2>
|
|
|
|
<P>
|
|
The main idea supporting our algorithm is the classical divide and conquer technique.
|
|
The algorithm is a two-step external memory based algorithm
|
|
that generates a MPHF <I>h</I> for a set <I>S</I> of <I>n</I> keys.
|
|
Figure 1 illustrates the two steps of the
|
|
algorithm: the partitioning step and the searching step.
|
|
</P>
|
|
|
|
<TABLE ALIGN="center" CELLPADDING="4">
|
|
<TR>
|
|
<TD><IMG ALIGN="middle" SRC="figs/brz/brz.png" BORDER="0" ALT=""></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><B>Figure 1:</B> Main steps of our algorithm.</TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<P>
|
|
The partitioning step takes a key set <I>S</I> and uses a universal hash
|
|
function <IMG ALIGN="middle" SRC="figs/brz/img42.png" BORDER="0" ALT=""> proposed by Jenkins<A HREF="#papers">[5</A>]
|
|
to transform each key <IMG ALIGN="middle" SRC="figs/brz/img43.png" BORDER="0" ALT=""> into an integer <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT="">.
|
|
Reducing <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT=""> modulo <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT="">, we partition <I>S</I>
|
|
into <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets containing at most 256 keys in each bucket (with high
|
|
probability).
|
|
</P>
|
|
<P>
|
|
The searching step generates a MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> for each bucket <I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">.
|
|
The resulting MPHF <I>h(k)</I>, <IMG ALIGN="middle" SRC="figs/brz/img43.png" BORDER="0" ALT="">, is given by
|
|
</P>
|
|
|
|
<TABLE ALIGN="center" CELLPADDING="4">
|
|
<TR>
|
|
<TD><IMG ALIGN="middle" SRC="figs/brz/img49.png" BORDER="0" ALT=""></TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<P>
|
|
where <IMG ALIGN="middle" SRC="figs/brz/img50.png" BORDER="0" ALT="">.
|
|
The <I>i</I>th entry <I>offset[i]</I> of the displacement vector
|
|
<I>offset</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, contains the total number
|
|
of keys in the buckets from 0 to <I>i-1</I>, that is, it gives the interval of the
|
|
keys in the hash table addressed by the MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT="">. In the following we explain
|
|
each step in detail.
|
|
</P>
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<H3>Partitioning step</H3>
|
|
|
|
<P>
|
|
The set <I>S</I> of <I>n</I> keys is partitioned into <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT="">,
|
|
where <I>b</I> is a suitable parameter chosen to guarantee
|
|
that each bucket has at most 256 keys with high probability
|
|
(see <A HREF="#papers">[2</A>] for details).
|
|
The partitioning step works as follows:
|
|
</P>
|
|
|
|
<TABLE ALIGN="center" CELLPADDING="4">
|
|
<TR>
|
|
<TD><IMG ALIGN="middle" SRC="figs/brz/img54.png" BORDER="0" ALT=""></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><B>Figure 2:</B> Partitioning step.</TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<P>
|
|
Statement 1.1 of the <B>for</B> loop presented in Figure 2
|
|
reads sequentially all the keys of block <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> from disk into an internal area
|
|
of size <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT="">.
|
|
</P>
|
|
<P>
|
|
Statement 1.2 performs an indirect bucket sort of the keys in block <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> and
|
|
at the same time updates the entries in the vector <I>size</I>.
|
|
Let us briefly describe how <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> is partitioned among
|
|
the <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets.
|
|
We use a local array of <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> counters to store a
|
|
count of how many keys from <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> belong to each bucket.
|
|
The pointers to the keys in each bucket <I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">,
|
|
are stored in contiguous positions in an array.
|
|
For this we first reserve the required number of entries
|
|
in this array of pointers using the information from the array of counters.
|
|
Next, we place the pointers to the keys in each bucket into the respective
|
|
reserved areas in the array (i.e., we place the pointers to the keys in bucket 0,
|
|
followed by the pointers to the keys in bucket 1, and so on).
|
|
</P>
|
|
<P>
|
|
To find the bucket address of a given key
|
|
we use the universal hash function <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT=""><A HREF="#papers">[5</A>].
|
|
Key <I>k</I> goes into bucket <I>i</I>, where
|
|
</P>
|
|
|
|
<TABLE ALIGN="center" CELLPADDING="4">
|
|
<TR>
|
|
<TD><IMG ALIGN="middle" SRC="figs/brz/img57.png" BORDER="0" ALT=""> (1)</TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<P>
|
|
Figure 3(a) shows a <I>logical</I> view of the <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets
|
|
generated in the partitioning step.
|
|
In reality, the keys belonging to each bucket are distributed among many files,
|
|
as depicted in Figure 3(b).
|
|
In the example of Figure 3(b), the keys in bucket 0
|
|
appear in files 1 and <I>N</I>, the keys in bucket 1 appear in files 1, 2
|
|
and <I>N</I>, and so on.
|
|
</P>
|
|
|
|
<TABLE ALIGN="center" CELLPADDING="4">
|
|
<TR>
|
|
<TD><IMG ALIGN="middle" SRC="figs/brz/brz-partitioning.png" BORDER="0" ALT=""></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><B>Figure 3:</B> Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view.</TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<P>
|
|
This scattering of the keys in the buckets could generate a performance
|
|
problem because of the potential number of seeks
|
|
needed to read the keys in each bucket from the <I>N</I> files in disk
|
|
during the searching step.
|
|
But, as we show in <A HREF="#papers">[2</A>], the number of seeks
|
|
can be kept small using buffering techniques.
|
|
Considering that only the vector <I>size</I>, which has <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> one-byte
|
|
entries (remember that each bucket has at most 256 keys),
|
|
must be maintained in main memory during the searching step,
|
|
almost all main memory is available to be used as disk I/O buffer.
|
|
</P>
|
|
<P>
|
|
The last step is to compute the <I>offset</I> vector and dump it to the disk.
|
|
We use the vector <I>size</I> to compute the
|
|
<I>offset</I> displacement vector.
|
|
The <I>offset[i]</I> entry contains the number of keys
|
|
in the buckets <I>0, 1, ..., i-1</I>.
|
|
As <I>size[i]</I> stores the number of keys
|
|
in bucket <I>i</I>, where <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, we have
|
|
</P>
|
|
|
|
<TABLE ALIGN="center" CELLPADDING="4">
|
|
<TR>
|
|
<TD><IMG ALIGN="middle" SRC="figs/brz/img63.png" BORDER="0" ALT=""></TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<H3>Searching step</H3>
|
|
|
|
<P>
|
|
The searching step is responsible for generating a MPHF for each
|
|
bucket. Figure 4 presents the searching step algorithm.
|
|
</P>
|
|
|
|
<TABLE ALIGN="center" CELLPADDING="4">
|
|
<TR>
|
|
<TD><IMG ALIGN="middle" SRC="figs/brz/img64.png" BORDER="0" ALT=""></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><B>Figure 4:</B> Searching step.</TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<P>
|
|
Statement 1 of Figure 4 inserts one key from each file
|
|
in a minimum heap <I>H</I> of size <I>N</I>.
|
|
The order relation in <I>H</I> is given by the bucket address <I>i</I> given by
|
|
Eq. (1).
|
|
</P>
|
|
<P>
|
|
Statement 2 has two important steps.
|
|
In statement 2.1, a bucket is read from disk,
|
|
as described below.
|
|
In statement 2.2, a MPHF is generated for each bucket <I>i</I>, as described
|
|
in the following.
|
|
The description of MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> is a vector <IMG ALIGN="middle" SRC="figs/brz/img66.png" BORDER="0" ALT=""> of 8-bit integers.
|
|
Finally, statement 2.3 writes the description <IMG ALIGN="middle" SRC="figs/brz/img66.png" BORDER="0" ALT=""> of MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> to disk.
|
|
</P>
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<H4>Reading a bucket from disk</H4>
|
|
|
|
<P>
|
|
In this section we present the refinement of statement 2.1 of
|
|
Figure 4.
|
|
The algorithm to read bucket <I>i</I> from disk is presented
|
|
in Figure 5.
|
|
</P>
|
|
|
|
<TABLE ALIGN="center" CELLPADDING="4">
|
|
<TR>
|
|
<TD><IMG ALIGN="middle" SRC="figs/brz/img67.png" BORDER="0" ALT=""></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><B>Figure 5:</B> Reading a bucket.</TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<P>
|
|
Bucket <I>i</I> is distributed among many files and the heap <I>H</I> is used to drive a
|
|
multiway merge operation.
|
|
In Figure 5, statement 1.1 extracts and removes triple
|
|
<I>(i, j, k)</I> from <I>H</I>, where <I>i</I> is a minimum value in <I>H</I>.
|
|
Statement 1.2 inserts key <I>k</I> in bucket <I>i</I>.
|
|
Notice that the <I>k</I> in the triple <I>(i, j, k)</I> is in fact a pointer to
|
|
the first byte of the key that is kept in contiguous positions of an array of characters
|
|
(this array containing the keys is initialized during the heap construction
|
|
in statement 1 of Figure 4).
|
|
Statement 1.3 performs a seek operation in File <I>j</I> on disk for the first
|
|
read operation and reads sequentially all keys <I>k</I> that have the same <I>i</I>
|
|
and inserts them all in bucket <I>i</I>.
|
|
Finally, statement 1.4 inserts in <I>H</I> the triple <I>(i, j, x)</I>,
|
|
where <I>x</I> is the first key read from File <I>j</I> (in statement 1.3)
|
|
that does not have the same bucket address as the previous keys.
|
|
</P>
|
|
<P>
|
|
The number of seek operations on disk performed in statement 1.3 is discussed
|
|
in <A HREF="#papers">[2, Section 5.1</A>],
|
|
where we present a buffering technique that brings down
|
|
the time spent with seeks.
|
|
</P>
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<H4>Generating a MPHF for each bucket</H4>
|
|
|
|
<P>
|
|
To the best of our knowledge the <A HREF="bmz.html">BMZ algorithm</A> we have designed in
|
|
our previous works <A HREF="#papers">[1,3</A>] is the fastest published algorithm for
|
|
constructing MPHFs.
|
|
That is why we are using that algorithm as a building block for the
|
|
algorithm presented here. In reality, we are using
|
|
an optimized version of BMZ (BMZ8) for small set of keys (at most 256 keys).
|
|
<A HREF="bmz.html">Click here to see details about BMZ algorithm</A>.
|
|
</P>
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<H2>Analysis of the Algorithm</H2>
|
|
|
|
<P>
|
|
Analytical results and the complete analysis of the external memory based algorithm
|
|
can be found in <A HREF="#papers">[2</A>].
|
|
</P>
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<H2>Experimental Results</H2>
|
|
|
|
<P>
|
|
In this section we present the experimental results.
|
|
We start presenting the experimental setup.
|
|
We then present experimental results for
|
|
the internal memory based algorithm (<A HREF="bmz.html">the BMZ algorithm</A>)
|
|
and for our external memory based algorithm.
|
|
Finally, we discuss how the amount of internal memory available
|
|
affects the runtime of the external memory based algorithm.
|
|
</P>
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<H3>The data and the experimental setup</H3>
|
|
|
|
<P>
|
|
All experiments were carried out on
|
|
a computer running the Linux operating system, version 2.6,
|
|
with a 2.4 gigahertz processor and
|
|
1 gigabyte of main memory.
|
|
In the experiments related to the new
|
|
algorithm we limited the main memory in 500 megabytes.
|
|
</P>
|
|
<P>
|
|
Our data consists of a collection of 1 billion
|
|
URLs collected from the Web, each URL 64 characters long on average.
|
|
The collection is stored on disk in 60.5 gigabytes.
|
|
</P>
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<H3>Performance of the BMZ Algorithm</H3>
|
|
|
|
<P>
|
|
<A HREF="bmz.html">The BMZ algorithm</A> is used for constructing a MPHF for each bucket.
|
|
It is a randomized algorithm because it needs to generate a simple random graph
|
|
in its first step.
|
|
Once the graph is obtained the other two steps are deterministic.
|
|
</P>
|
|
<P>
|
|
Thus, we can consider the runtime of the algorithm to have
|
|
the form <IMG ALIGN="middle" SRC="figs/brz/img159.png" BORDER="0" ALT=""> for an input of <I>n</I> keys,
|
|
where <IMG ALIGN="middle" SRC="figs/brz/img160.png" BORDER="0" ALT=""> is some machine dependent
|
|
constant that further depends on the length of the keys and <I>Z</I> is a random
|
|
variable with geometric distribution with mean <IMG ALIGN="middle" SRC="figs/brz/img162.png" BORDER="0" ALT="">. All results
|
|
in our experiments were obtained taking <I>c=1</I>; the value of <I>c</I>, with <I>c</I> in <I>[0.93,1.15]</I>,
|
|
in fact has little influence in the runtime, as shown in <A HREF="#papers">[3</A>].
|
|
</P>
|
|
<P>
|
|
The values chosen for <I>n</I> were 1, 2, 4, 8, 16 and 32 million.
|
|
Although we have a dataset with 1 billion URLs, on a PC with
|
|
1 gigabyte of main memory, the algorithm is able
|
|
to handle an input with at most 32 million keys.
|
|
This is mainly because of the graph we need to keep in main memory.
|
|
The algorithm requires <I>25n + O(1)</I> bytes for constructing
|
|
a MPHF (<A HREF="bmz.html">click here to get details about the data structures used by the BMZ algorithm</A>).
|
|
</P>
|
|
<P>
|
|
In order to estimate the number of trials for each value of <I>n</I> we use
|
|
a statistical method for determining a suitable sample size (see, e.g., <A HREF="#papers">[6, Chapter 13</A>]).
|
|
As we obtained different values for each <I>n</I>,
|
|
we used the maximal value obtained, namely, 300 trials in order to have
|
|
a confidence level of 95 %.
|
|
</P>
|
|
<P>
|
|
Table 1 presents the runtime average for each <I>n</I>,
|
|
the respective standard deviations, and
|
|
the respective confidence intervals given by
|
|
the average time <IMG ALIGN="middle" SRC="figs/brz/img167.png" BORDER="0" ALT=""> the distance from average time
|
|
considering a confidence level of 95 %.
|
|
Observing the runtime averages one sees that
|
|
the algorithm runs in expected linear time,
|
|
as shown in <A HREF="#papers">[3</A>].
|
|
</P>
|
|
<TABLE CELLPADDING=3 BORDER="1" ALIGN="CENTER">
|
|
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
|
|
<SPAN CLASS="MATH"><IMG
|
|
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img5.png"
|
|
ALT="$n$"></SPAN> (millions) </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1 </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 2 </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 4 </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 8 </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 16 </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 32 </SMALL></TD>
|
|
<TD></TD>
|
|
</TR>
|
|
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
|
|
|
|
Average time (s)</SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="64" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img168.png"
|
|
ALT="$6.1 \pm 0.3$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img169.png"
|
|
ALT="$12.2 \pm 0.6$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img170.png"
|
|
ALT="$25.4 \pm 1.1$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img171.png"
|
|
ALT="$51.4 \pm 2.0$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img172.png"
|
|
ALT="$117.3 \pm 4.4$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img173.png"
|
|
ALT="$262.2 \pm 8.7$"></SPAN></SMALL></TD>
|
|
<TD></TD>
|
|
</TR>
|
|
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
|
|
SD (s) </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img174.png"
|
|
ALT="$2.6$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img175.png"
|
|
ALT="$5.4$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img176.png"
|
|
ALT="$9.8$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img177.png"
|
|
ALT="$17.6$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img178.png"
|
|
ALT="$37.3$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img179.png"
|
|
ALT="$76.3$"></SPAN> </SMALL></TD>
|
|
<TD></TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<TABLE ALIGN="center" CELLPADDING="4">
|
|
<TR>
|
|
<TD><B>Table 1:</B> Internal memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.</TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<P>
|
|
Figure 6 presents the runtime for each trial. In addition,
|
|
the solid line corresponds to a linear regression model
|
|
obtained from the experimental measurements.
|
|
As we can see, the runtime for a given <I>n</I> has a considerable
|
|
fluctuation. However, the fluctuation also grows linearly with <I>n</I>.
|
|
</P>
|
|
|
|
<TABLE ALIGN="center" CELLPADDING="4">
|
|
<TR>
|
|
<TD><IMG ALIGN="middle" SRC="figs/brz/bmz_temporegressao.png" BORDER="0" ALT=""></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><B>Figure 6:</B> Time versus number of keys in <I>S</I> for the internal memory based algorithm. The solid line corresponds to a linear regression model.</TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<P>
|
|
The observed fluctuation in the runtimes is as expected; recall that this
|
|
runtime has the form <IMG ALIGN="middle" SRC="figs/brz/img159.png" BORDER="0" ALT=""> with <I>Z</I> a geometric random variable with
|
|
mean <I>1/p=e</I>. Thus, the runtime has mean <IMG ALIGN="middle" SRC="figs/brz/img181.png" BORDER="0" ALT=""> and standard
|
|
deviation <IMG ALIGN="middle" SRC="figs/brz/img182.png" BORDER="0" ALT="">.
|
|
Therefore, the standard deviation also grows
|
|
linearly with <I>n</I>, as experimentally verified
|
|
in Table 1 and in Figure 6.
|
|
</P>
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<H3>Performance of the External Memory Based Algorithm</H3>
|
|
|
|
<P>
|
|
The runtime of the external memory based algorithm is also a random variable,
|
|
but now it follows a (highly concentrated) normal distribution, as we discuss at the end of this
|
|
section. Again, we are interested in verifying the linearity claim made in
|
|
<A HREF="#papers">[2, Section 5.1</A>]. Therefore, we ran the algorithm for
|
|
several numbers <I>n</I> of keys in <I>S</I>.
|
|
</P>
|
|
<P>
|
|
The values chosen for <I>n</I> were 1, 2, 4, 8, 16, 32, 64, 128, 512 and 1000
|
|
million.
|
|
We limited the main memory in 500 megabytes for the experiments.
|
|
The size <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> of the a priori reserved internal memory area
|
|
was set to 250 megabytes, the parameter <I>b</I> was set to <I>175</I> and
|
|
the building block algorithm parameter <I>c</I> was again set to <I>1</I>.
|
|
We show later on how <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> affects the runtime of the algorithm. The other two parameters
|
|
have insignificant influence on the runtime.
|
|
</P>
|
|
<P>
|
|
We again use a statistical method for determining a suitable sample size
|
|
to estimate the number of trials to be run for each value of <I>n</I>. We got that
|
|
just one trial for each <I>n</I> would be enough with a confidence level of 95 %.
|
|
However, we made 10 trials. This number of trials seems rather small, but, as
|
|
shown below, the behavior of our algorithm is very stable and its runtime is
|
|
almost deterministic (i.e., the standard deviation is very small).
|
|
</P>
|
|
<P>
|
|
Table 2 presents the runtime average for each <I>n</I>,
|
|
the respective standard deviations, and
|
|
the respective confidence intervals given by
|
|
the average time <IMG ALIGN="middle" SRC="figs/brz/img167.png" BORDER="0" ALT=""> the distance from average time
|
|
considering a confidence level of 95 %.
|
|
Observing the runtime averages we noticed that
|
|
the algorithm runs in expected linear time,
|
|
as shown in <A HREF="#papers">[2, Section 5.1</A>]. Better still,
|
|
it is only approximately 60 % slower than the BMZ algorithm.
|
|
To get that value we used the linear regression model obtained for the runtime of
|
|
the internal memory based algorithm to estimate how much time it would require
|
|
for constructing a MPHF for a set of 1 billion keys.
|
|
We got 2.3 hours for the internal memory based algorithm and we measured
|
|
3.67 hours on average for the external memory based algorithm.
|
|
Increasing the size of the internal memory area
|
|
from 250 to 600 megabytes,
|
|
we have brought the time to 3.09 hours. In this case, the external memory based algorithm is
|
|
just 34 % slower in this setup.
|
|
</P>
|
|
<TABLE CELLPADDING=3 BORDER="1" ALIGN="CENTER">
|
|
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
|
<SPAN CLASS="MATH"><IMG
|
|
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img5.png"
|
|
ALT="$n$"></SPAN> (millions) </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1 </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 2 </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 4 </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 8 </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 16 </SMALL></TD>
|
|
</TR>
|
|
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
|
Average time (s) </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="64" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img187.png"
|
|
ALT="$6.9 \pm 0.3$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img188.png"
|
|
ALT="$13.8 \pm 0.2$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img189.png"
|
|
ALT="$31.9 \pm 0.7$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img190.png"
|
|
ALT="$69.9 \pm 1.1$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img191.png"
|
|
ALT="$140.6 \pm 2.5$"></SPAN> </SMALL></TD>
|
|
</TR>
|
|
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
|
SD </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img192.png"
|
|
ALT="$0.4$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img193.png"
|
|
ALT="$0.2$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img194.png"
|
|
ALT="$0.9$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img195.png"
|
|
ALT="$1.5$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img196.png"
|
|
ALT="$3.5$"></SPAN> </SMALL></TD>
|
|
</TR>
|
|
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
|
|
|
<SPAN CLASS="MATH"><IMG
|
|
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img5.png"
|
|
ALT="$n$"></SPAN> (millions) </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 32 </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 64 </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 128 </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 512 </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1000 </SMALL></TD>
|
|
</TR>
|
|
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
|
Average time (s) </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img197.png"
|
|
ALT="$284.3 \pm 1.1$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img198.png"
|
|
ALT="$587.9 \pm 3.9$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
|
|
$1223.6 \pm 4.9$
|
|
-->
|
|
<SPAN CLASS="MATH"><IMG
|
|
WIDTH="88" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img199.png"
|
|
ALT="$1223.6 \pm 4.9$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
|
|
$5966.4 \pm 9.5$
|
|
-->
|
|
<SPAN CLASS="MATH"><IMG
|
|
WIDTH="88" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img200.png"
|
|
ALT="$5966.4 \pm 9.5$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
|
|
$13229.5 \pm 12.7$
|
|
-->
|
|
<SPAN CLASS="MATH"><IMG
|
|
WIDTH="104" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img201.png"
|
|
ALT="$13229.5 \pm 12.7$"></SPAN> </SMALL></TD>
|
|
</TR>
|
|
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
|
SD </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img202.png"
|
|
ALT="$1.6$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img203.png"
|
|
ALT="$5.5$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img204.png"
|
|
ALT="$6.8$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img205.png"
|
|
ALT="$13.2$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img206.png"
|
|
ALT="$18.6$"></SPAN> </SMALL></TD>
|
|
</TR>
|
|
<TR><TD></TD>
|
|
<TD></TD>
|
|
<TD></TD>
|
|
<TD></TD>
|
|
<TD></TD>
|
|
<TD></TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<TABLE ALIGN="center" CELLPADDING="4">
|
|
<TR>
|
|
<TD><B>Table 2:</B>The external memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.</TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<P>
|
|
Figure 7 presents the runtime for each trial. In addition,
|
|
the solid line corresponds to a linear regression model
|
|
obtained from the experimental measurements.
|
|
As we were expecting the runtime for a given <I>n</I> has almost no
|
|
variation.
|
|
</P>
|
|
|
|
<TABLE ALIGN="center" CELLPADDING="4">
|
|
<TR>
|
|
<TD><IMG ALIGN="middle" SRC="figs/brz/brz_temporegressao.png" BORDER="0" ALT=""></TD>
|
|
</TR>
|
|
<TR>
|
|
<TD><B>Figure 7:</B> Time versus number of keys in <I>S</I> for our algorithm. The solid line corresponds to a linear regression model.</TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<P>
|
|
An intriguing observation is that the runtime of the algorithm is almost
|
|
deterministic, in spite of the fact that it uses as building block an
|
|
algorithm with a considerable fluctuation in its runtime. A given bucket
|
|
<I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, is a small set of keys (at most 256 keys) and,
|
|
as argued in last Section, the runtime of the
|
|
building block algorithm is a random variable <IMG ALIGN="middle" SRC="figs/brz/img207.png" BORDER="0" ALT=""> with high fluctuation.
|
|
However, the runtime <I>Y</I> of the searching step of the external memory based algorithm is given
|
|
by <IMG ALIGN="middle" SRC="figs/brz/img209.png" BORDER="0" ALT="">. Under the hypothesis that
|
|
the <IMG ALIGN="middle" SRC="figs/brz/img207.png" BORDER="0" ALT=""> are independent and bounded, the {\it law of large numbers} (see,
|
|
e.g., <A HREF="#papers">[6</A>]) implies that the random variable <IMG ALIGN="middle" SRC="figs/brz/img210.png" BORDER="0" ALT=""> converges
|
|
to a constant as <IMG ALIGN="middle" SRC="figs/brz/img83.png" BORDER="0" ALT="">. This explains why the runtime of our
|
|
algorithm is almost deterministic.
|
|
</P>
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<H3>Controlling disk accesses</H3>
|
|
|
|
<P>
|
|
In order to bring down the number of seek operations on disk
|
|
we benefit from the fact that our algorithm leaves almost all main
|
|
memory available to be used as disk I/O buffer.
|
|
In this section we evaluate how much the parameter <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> affects the runtime of our algorithm.
|
|
For that we fixed <I>n</I> in 1 billion of URLs,
|
|
set the main memory of the machine used for the experiments
|
|
to 1 gigabyte and used <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> equal to 100, 200, 300, 400, 500 and 600
|
|
megabytes.
|
|
</P>
|
|
<P>
|
|
Table 3 presents the number of files <I>N</I>,
|
|
the buffer size used for all files, the number of seeks in the worst case considering
|
|
the pessimistic assumption mentioned in <A HREF="#papers">[2, Section 5.1</A>], and
|
|
the time to generate a MPHF for 1 billion of keys as a function of the amount of internal
|
|
memory available. Observing Table 3 we noticed that the time spent in the construction
|
|
decreases as the value of <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> increases. However, for <IMG ALIGN="middle" SRC="figs/brz/img213.png" BORDER="0" ALT="">, the variation
|
|
on the time is not as significant as for <IMG ALIGN="middle" SRC="figs/brz/img214.png" BORDER="0" ALT="">.
|
|
This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux
|
|
has smart policies for avoiding seeks and diminishing the average seek time
|
|
(see <A HREF="http://www.linuxjournal.com/article/6931">http://www.linuxjournal.com/article/6931</A>).
|
|
</P>
|
|
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
|
|
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
|
<SPAN CLASS="MATH"><IMG
|
|
WIDTH="14" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img8.png"
|
|
ALT="$\mu $"></SPAN> (MB) </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img215.png"
|
|
ALT="$100$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img216.png"
|
|
ALT="$200$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img217.png"
|
|
ALT="$300$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img218.png"
|
|
ALT="$400$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img219.png"
|
|
ALT="$500$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img212.png"
|
|
ALT="$600$"></SPAN> </SMALL></TD>
|
|
</TR>
|
|
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
|
|
|
<SPAN CLASS="MATH"><IMG
|
|
WIDTH="19" HEIGHT="14" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img58.png"
|
|
ALT="$N$"></SPAN> (files) </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img220.png"
|
|
ALT="$619$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img221.png"
|
|
ALT="$310$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img222.png"
|
|
ALT="$207$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img223.png"
|
|
ALT="$155$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img224.png"
|
|
ALT="$124$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img225.png"
|
|
ALT="$104$"></SPAN> </SMALL></TD>
|
|
</TR>
|
|
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
|
(buffer size in KB) </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img226.png"
|
|
ALT="$165$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img227.png"
|
|
ALT="$661$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img228.png"
|
|
ALT="$1,484$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img229.png"
|
|
ALT="$2,643$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img230.png"
|
|
ALT="$4,129$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img231.png"
|
|
ALT="$5,908$"></SPAN> </SMALL></TD>
|
|
</TR>
|
|
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
|
<SPAN CLASS="MATH"><IMG
|
|
WIDTH="14" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img135.png"
|
|
ALT="$\beta$"></SPAN>/ (# of seeks in the worst case) </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="59" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img232.png"
|
|
ALT="$384,478$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img233.png"
|
|
ALT="$95,974$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img234.png"
|
|
ALT="$42,749$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img235.png"
|
|
ALT="$24,003$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img236.png"
|
|
ALT="$15,365$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
|
SRC="figs/brz/img237.png"
|
|
ALT="$10,738$"></SPAN> </SMALL></TD>
|
|
</TR>
|
|
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
|
Time (hours) </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img238.png"
|
|
ALT="$4.04$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img239.png"
|
|
ALT="$3.64$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img240.png"
|
|
ALT="$3.34$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img241.png"
|
|
ALT="$3.20$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img242.png"
|
|
ALT="$3.13$"></SPAN> </SMALL></TD>
|
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
|
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
|
SRC="figs/brz/img243.png"
|
|
ALT="$3.09$"></SPAN> </SMALL></TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<TABLE ALIGN="center" CELLPADDING="4">
|
|
<TR>
|
|
<TD><B>Table 3:</B>Influence of the internal memory area size (<IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT="">) in the external memory based algorithm runtime.</TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<A NAME="papers"></A>
|
|
<H2>Papers</H2>
|
|
|
|
<OL>
|
|
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, D. Menoti, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/bmz_tr004_04.ps">A New algorithm for constructing minimal perfect hash functions</A>, Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
|
|
<P></P>
|
|
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, Y. Kohayakawa, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/tr06.pdf">An Approach for Minimal Perfect Hash Functions for Very Large Databases</A>, Technical Report TR003/06, Department of Computer Science, Federal University of Minas Gerais, 2004.
|
|
<P></P>
|
|
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, Y. Kohayakawa, and <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/wea05.pdf">A Practical Minimal Perfect Hashing Method</A>. <I>4th International Workshop on efficient and Experimental Algorithms (WEA05),</I> Springer-Verlag Lecture Notes in Computer Science, vol. 3505, Santorini Island, Greece, May 2005, 488-500.
|
|
<P></P>
|
|
<LI><A HREF="http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=299">M. Seltzer. Beyond relational databases. ACM Queue, 3(3), April 2005.</A>
|
|
<P></P>
|
|
<LI><A HREF="http://burtleburtle.net/bob/hash/doobs.html">Bob Jenkins. Algorithm alley: Hash functions. Dr. Dobb's Journal of Software Tools, 22(9), september 1997.</A>
|
|
<P></P>
|
|
<LI>R. Jain. The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. John Wiley, first edition, 1991.
|
|
</OL>
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<TABLE ALIGN="center" CELLPADDING="4">
|
|
<TR>
|
|
<TD><A HREF="index.html">Home</A></TD>
|
|
<TD><A HREF="chd.html">CHD</A></TD>
|
|
<TD><A HREF="bdz.html">BDZ</A></TD>
|
|
<TD><A HREF="bmz.html">BMZ</A></TD>
|
|
<TD><A HREF="chm.html">CHM</A></TD>
|
|
<TD><A HREF="brz.html">BRZ</A></TD>
|
|
<TD><A HREF="fch.html">FCH</A></TD>
|
|
</TR>
|
|
</TABLE>
|
|
|
|
<HR NOSHADE SIZE=1>
|
|
|
|
<P>
|
|
Enjoy!
|
|
</P>
|
|
<P>
|
|
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
|
|
</P>
|
|
<P>
|
|
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
|
|
</P>
|
|
<P>
|
|
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
|
|
</P>
|
|
<P>
|
|
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
|
|
</P>
|
|
<script type="text/javascript">
|
|
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
|
|
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
|
|
</script>
|
|
<script type="text/javascript">
|
|
try {
|
|
var pageTracker = _gat._getTracker("UA-7698683-2");
|
|
pageTracker._trackPageview();
|
|
} catch(err) {}</script>
|
|
|
|
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
|
|
<!-- cmdline: txt2tags -t html -i BRZ.t2t -o docs/brz.html -->
|
|
</BODY></HTML>
|