Add 'deps/cmph/' from commit 'a250982ade093f4eed0552bbdd22dd7b0432007f'
git-subtree-dir: deps/cmph git-subtree-mainline:5040f4007bgit-subtree-split:a250982ade
This commit is contained in:
966
deps/cmph/docs/brz.html
vendored
Normal file
966
deps/cmph/docs/brz.html
vendored
Normal file
@@ -0,0 +1,966 @@
|
||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
||||
<HTML>
|
||||
<HEAD>
|
||||
<META NAME="generator" CONTENT="http://txt2tags.org">
|
||||
<LINK REL="stylesheet" TYPE="text/css" HREF="DOC.css">
|
||||
<TITLE>External Memory Based Algorithm</TITLE>
|
||||
</HEAD><BODY BGCOLOR="white" TEXT="black">
|
||||
<CENTER>
|
||||
<H1>External Memory Based Algorithm</H1>
|
||||
</CENTER>
|
||||
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H2>Introduction</H2>
|
||||
|
||||
<P>
|
||||
Until now, because of the limitations of current algorithms,
|
||||
the use of MPHFs is restricted to scenarios where the set of keys being hashed is
|
||||
relatively small.
|
||||
However, in many cases it is crucial to deal in an efficient way with very large
|
||||
sets of keys.
|
||||
Due to the exponential growth of the Web, the work with huge collections is becoming
|
||||
a daily task.
|
||||
For instance, the simple assignment of number identifiers to web pages of a collection
|
||||
can be a challenging task.
|
||||
While traditional databases simply cannot handle more traffic once the working
|
||||
set of URLs does not fit in main memory anymore<A HREF="#papers">[4</A>], the algorithm we propose here to
|
||||
construct MPHFs can easily scale to billions of entries.
|
||||
</P>
|
||||
<P>
|
||||
As there are many applications for MPHFs, it is
|
||||
important to design and implement space and time efficient algorithms for
|
||||
constructing such functions.
|
||||
The attractiveness of using MPHFs depends on the following issues:
|
||||
</P>
|
||||
|
||||
<OL>
|
||||
<LI>The amount of CPU time required by the algorithms for constructing MPHFs.
|
||||
<P></P>
|
||||
<LI>The space requirements of the algorithms for constructing MPHFs.
|
||||
<P></P>
|
||||
<LI>The amount of CPU time required by a MPHF for each retrieval.
|
||||
<P></P>
|
||||
<LI>The space requirements of the description of the resulting MPHFs to be used at retrieval time.
|
||||
</OL>
|
||||
|
||||
<P>
|
||||
We present here a novel external memory based algorithm for constructing MPHFs that
|
||||
are very efficient in the four requirements mentioned previously.
|
||||
First, the algorithm is linear on the size of keys to construct a MPHF,
|
||||
which is optimal.
|
||||
For instance, for a collection of 1 billion URLs
|
||||
collected from the web, each one 64 characters long on average, the time to construct a
|
||||
MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory
|
||||
is approximately 3 hours.
|
||||
Second, the algorithm needs a small a priori defined vector of <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> one
|
||||
byte entries in main memory to construct a MPHF.
|
||||
For the collection of 1 billion URLs and using <IMG ALIGN="middle" SRC="figs/brz/img4.png" BORDER="0" ALT="">, the algorithm needs only
|
||||
5.45 megabytes of internal memory.
|
||||
Third, the evaluation of the MPHF for each retrieval requires three memory accesses and
|
||||
the computation of three universal hash functions.
|
||||
This is not optimal as any MPHF requires at least one memory access and the computation
|
||||
of two universal hash functions.
|
||||
Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal.
|
||||
For the collection of 1 billion URLs, it needs 8.1 bits for each key,
|
||||
while the theoretical lower bound is <IMG ALIGN="middle" SRC="figs/brz/img24.png" BORDER="0" ALT=""> bits per key.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H2>The Algorithm</H2>
|
||||
|
||||
<P>
|
||||
The main idea supporting our algorithm is the classical divide and conquer technique.
|
||||
The algorithm is a two-step external memory based algorithm
|
||||
that generates a MPHF <I>h</I> for a set <I>S</I> of <I>n</I> keys.
|
||||
Figure 1 illustrates the two steps of the
|
||||
algorithm: the partitioning step and the searching step.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/brz.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 1:</B> Main steps of our algorithm.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
The partitioning step takes a key set <I>S</I> and uses a universal hash
|
||||
function <IMG ALIGN="middle" SRC="figs/brz/img42.png" BORDER="0" ALT=""> proposed by Jenkins<A HREF="#papers">[5</A>]
|
||||
to transform each key <IMG ALIGN="middle" SRC="figs/brz/img43.png" BORDER="0" ALT=""> into an integer <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT="">.
|
||||
Reducing <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT=""> modulo <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT="">, we partition <I>S</I>
|
||||
into <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets containing at most 256 keys in each bucket (with high
|
||||
probability).
|
||||
</P>
|
||||
<P>
|
||||
The searching step generates a MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> for each bucket <I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">.
|
||||
The resulting MPHF <I>h(k)</I>, <IMG ALIGN="middle" SRC="figs/brz/img43.png" BORDER="0" ALT="">, is given by
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/img49.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
where <IMG ALIGN="middle" SRC="figs/brz/img50.png" BORDER="0" ALT="">.
|
||||
The <I>i</I>th entry <I>offset[i]</I> of the displacement vector
|
||||
<I>offset</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, contains the total number
|
||||
of keys in the buckets from 0 to <I>i-1</I>, that is, it gives the interval of the
|
||||
keys in the hash table addressed by the MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT="">. In the following we explain
|
||||
each step in detail.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H3>Partitioning step</H3>
|
||||
|
||||
<P>
|
||||
The set <I>S</I> of <I>n</I> keys is partitioned into <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT="">,
|
||||
where <I>b</I> is a suitable parameter chosen to guarantee
|
||||
that each bucket has at most 256 keys with high probability
|
||||
(see <A HREF="#papers">[2</A>] for details).
|
||||
The partitioning step works as follows:
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/img54.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 2:</B> Partitioning step.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
Statement 1.1 of the <B>for</B> loop presented in Figure 2
|
||||
reads sequentially all the keys of block <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> from disk into an internal area
|
||||
of size <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT="">.
|
||||
</P>
|
||||
<P>
|
||||
Statement 1.2 performs an indirect bucket sort of the keys in block <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> and
|
||||
at the same time updates the entries in the vector <I>size</I>.
|
||||
Let us briefly describe how <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> is partitioned among
|
||||
the <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets.
|
||||
We use a local array of <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> counters to store a
|
||||
count of how many keys from <IMG ALIGN="middle" SRC="figs/brz/img55.png" BORDER="0" ALT=""> belong to each bucket.
|
||||
The pointers to the keys in each bucket <I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">,
|
||||
are stored in contiguous positions in an array.
|
||||
For this we first reserve the required number of entries
|
||||
in this array of pointers using the information from the array of counters.
|
||||
Next, we place the pointers to the keys in each bucket into the respective
|
||||
reserved areas in the array (i.e., we place the pointers to the keys in bucket 0,
|
||||
followed by the pointers to the keys in bucket 1, and so on).
|
||||
</P>
|
||||
<P>
|
||||
To find the bucket address of a given key
|
||||
we use the universal hash function <IMG ALIGN="middle" SRC="figs/brz/img44.png" BORDER="0" ALT=""><A HREF="#papers">[5</A>].
|
||||
Key <I>k</I> goes into bucket <I>i</I>, where
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/img57.png" BORDER="0" ALT=""> (1)</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
Figure 3(a) shows a <I>logical</I> view of the <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> buckets
|
||||
generated in the partitioning step.
|
||||
In reality, the keys belonging to each bucket are distributed among many files,
|
||||
as depicted in Figure 3(b).
|
||||
In the example of Figure 3(b), the keys in bucket 0
|
||||
appear in files 1 and <I>N</I>, the keys in bucket 1 appear in files 1, 2
|
||||
and <I>N</I>, and so on.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/brz-partitioning.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 3:</B> Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
This scattering of the keys in the buckets could generate a performance
|
||||
problem because of the potential number of seeks
|
||||
needed to read the keys in each bucket from the <I>N</I> files in disk
|
||||
during the searching step.
|
||||
But, as we show in <A HREF="#papers">[2</A>], the number of seeks
|
||||
can be kept small using buffering techniques.
|
||||
Considering that only the vector <I>size</I>, which has <IMG ALIGN="middle" SRC="figs/brz/img23.png" BORDER="0" ALT=""> one-byte
|
||||
entries (remember that each bucket has at most 256 keys),
|
||||
must be maintained in main memory during the searching step,
|
||||
almost all main memory is available to be used as disk I/O buffer.
|
||||
</P>
|
||||
<P>
|
||||
The last step is to compute the <I>offset</I> vector and dump it to the disk.
|
||||
We use the vector <I>size</I> to compute the
|
||||
<I>offset</I> displacement vector.
|
||||
The <I>offset[i]</I> entry contains the number of keys
|
||||
in the buckets <I>0, 1, ..., i-1</I>.
|
||||
As <I>size[i]</I> stores the number of keys
|
||||
in bucket <I>i</I>, where <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, we have
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/img63.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H3>Searching step</H3>
|
||||
|
||||
<P>
|
||||
The searching step is responsible for generating a MPHF for each
|
||||
bucket. Figure 4 presents the searching step algorithm.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/img64.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 4:</B> Searching step.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
Statement 1 of Figure 4 inserts one key from each file
|
||||
in a minimum heap <I>H</I> of size <I>N</I>.
|
||||
The order relation in <I>H</I> is given by the bucket address <I>i</I> given by
|
||||
Eq. (1).
|
||||
</P>
|
||||
<P>
|
||||
Statement 2 has two important steps.
|
||||
In statement 2.1, a bucket is read from disk,
|
||||
as described below.
|
||||
In statement 2.2, a MPHF is generated for each bucket <I>i</I>, as described
|
||||
in the following.
|
||||
The description of MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> is a vector <IMG ALIGN="middle" SRC="figs/brz/img66.png" BORDER="0" ALT=""> of 8-bit integers.
|
||||
Finally, statement 2.3 writes the description <IMG ALIGN="middle" SRC="figs/brz/img66.png" BORDER="0" ALT=""> of MPHF<IMG ALIGN="middle" SRC="figs/brz/img46.png" BORDER="0" ALT=""> to disk.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H4>Reading a bucket from disk</H4>
|
||||
|
||||
<P>
|
||||
In this section we present the refinement of statement 2.1 of
|
||||
Figure 4.
|
||||
The algorithm to read bucket <I>i</I> from disk is presented
|
||||
in Figure 5.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/img67.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 5:</B> Reading a bucket.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
Bucket <I>i</I> is distributed among many files and the heap <I>H</I> is used to drive a
|
||||
multiway merge operation.
|
||||
In Figure 5, statement 1.1 extracts and removes triple
|
||||
<I>(i, j, k)</I> from <I>H</I>, where <I>i</I> is a minimum value in <I>H</I>.
|
||||
Statement 1.2 inserts key <I>k</I> in bucket <I>i</I>.
|
||||
Notice that the <I>k</I> in the triple <I>(i, j, k)</I> is in fact a pointer to
|
||||
the first byte of the key that is kept in contiguous positions of an array of characters
|
||||
(this array containing the keys is initialized during the heap construction
|
||||
in statement 1 of Figure 4).
|
||||
Statement 1.3 performs a seek operation in File <I>j</I> on disk for the first
|
||||
read operation and reads sequentially all keys <I>k</I> that have the same <I>i</I>
|
||||
and inserts them all in bucket <I>i</I>.
|
||||
Finally, statement 1.4 inserts in <I>H</I> the triple <I>(i, j, x)</I>,
|
||||
where <I>x</I> is the first key read from File <I>j</I> (in statement 1.3)
|
||||
that does not have the same bucket address as the previous keys.
|
||||
</P>
|
||||
<P>
|
||||
The number of seek operations on disk performed in statement 1.3 is discussed
|
||||
in <A HREF="#papers">[2, Section 5.1</A>],
|
||||
where we present a buffering technique that brings down
|
||||
the time spent with seeks.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H4>Generating a MPHF for each bucket</H4>
|
||||
|
||||
<P>
|
||||
To the best of our knowledge the <A HREF="bmz.html">BMZ algorithm</A> we have designed in
|
||||
our previous works <A HREF="#papers">[1,3</A>] is the fastest published algorithm for
|
||||
constructing MPHFs.
|
||||
That is why we are using that algorithm as a building block for the
|
||||
algorithm presented here. In reality, we are using
|
||||
an optimized version of BMZ (BMZ8) for small set of keys (at most 256 keys).
|
||||
<A HREF="bmz.html">Click here to see details about BMZ algorithm</A>.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H2>Analysis of the Algorithm</H2>
|
||||
|
||||
<P>
|
||||
Analytical results and the complete analysis of the external memory based algorithm
|
||||
can be found in <A HREF="#papers">[2</A>].
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H2>Experimental Results</H2>
|
||||
|
||||
<P>
|
||||
In this section we present the experimental results.
|
||||
We start presenting the experimental setup.
|
||||
We then present experimental results for
|
||||
the internal memory based algorithm (<A HREF="bmz.html">the BMZ algorithm</A>)
|
||||
and for our external memory based algorithm.
|
||||
Finally, we discuss how the amount of internal memory available
|
||||
affects the runtime of the external memory based algorithm.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H3>The data and the experimental setup</H3>
|
||||
|
||||
<P>
|
||||
All experiments were carried out on
|
||||
a computer running the Linux operating system, version 2.6,
|
||||
with a 2.4 gigahertz processor and
|
||||
1 gigabyte of main memory.
|
||||
In the experiments related to the new
|
||||
algorithm we limited the main memory in 500 megabytes.
|
||||
</P>
|
||||
<P>
|
||||
Our data consists of a collection of 1 billion
|
||||
URLs collected from the Web, each URL 64 characters long on average.
|
||||
The collection is stored on disk in 60.5 gigabytes.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H3>Performance of the BMZ Algorithm</H3>
|
||||
|
||||
<P>
|
||||
<A HREF="bmz.html">The BMZ algorithm</A> is used for constructing a MPHF for each bucket.
|
||||
It is a randomized algorithm because it needs to generate a simple random graph
|
||||
in its first step.
|
||||
Once the graph is obtained the other two steps are deterministic.
|
||||
</P>
|
||||
<P>
|
||||
Thus, we can consider the runtime of the algorithm to have
|
||||
the form <IMG ALIGN="middle" SRC="figs/brz/img159.png" BORDER="0" ALT=""> for an input of <I>n</I> keys,
|
||||
where <IMG ALIGN="middle" SRC="figs/brz/img160.png" BORDER="0" ALT=""> is some machine dependent
|
||||
constant that further depends on the length of the keys and <I>Z</I> is a random
|
||||
variable with geometric distribution with mean <IMG ALIGN="middle" SRC="figs/brz/img162.png" BORDER="0" ALT="">. All results
|
||||
in our experiments were obtained taking <I>c=1</I>; the value of <I>c</I>, with <I>c</I> in <I>[0.93,1.15]</I>,
|
||||
in fact has little influence in the runtime, as shown in <A HREF="#papers">[3</A>].
|
||||
</P>
|
||||
<P>
|
||||
The values chosen for <I>n</I> were 1, 2, 4, 8, 16 and 32 million.
|
||||
Although we have a dataset with 1 billion URLs, on a PC with
|
||||
1 gigabyte of main memory, the algorithm is able
|
||||
to handle an input with at most 32 million keys.
|
||||
This is mainly because of the graph we need to keep in main memory.
|
||||
The algorithm requires <I>25n + O(1)</I> bytes for constructing
|
||||
a MPHF (<A HREF="bmz.html">click here to get details about the data structures used by the BMZ algorithm</A>).
|
||||
</P>
|
||||
<P>
|
||||
In order to estimate the number of trials for each value of <I>n</I> we use
|
||||
a statistical method for determining a suitable sample size (see, e.g., <A HREF="#papers">[6, Chapter 13</A>]).
|
||||
As we obtained different values for each <I>n</I>,
|
||||
we used the maximal value obtained, namely, 300 trials in order to have
|
||||
a confidence level of 95 %.
|
||||
</P>
|
||||
<P>
|
||||
Table 1 presents the runtime average for each <I>n</I>,
|
||||
the respective standard deviations, and
|
||||
the respective confidence intervals given by
|
||||
the average time <IMG ALIGN="middle" SRC="figs/brz/img167.png" BORDER="0" ALT=""> the distance from average time
|
||||
considering a confidence level of 95 %.
|
||||
Observing the runtime averages one sees that
|
||||
the algorithm runs in expected linear time,
|
||||
as shown in <A HREF="#papers">[3</A>].
|
||||
</P>
|
||||
<TABLE CELLPADDING=3 BORDER="1" ALIGN="CENTER">
|
||||
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
|
||||
<SPAN CLASS="MATH"><IMG
|
||||
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img5.png"
|
||||
ALT="$n$"></SPAN> (millions) </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1 </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 2 </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 4 </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 8 </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 16 </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 32 </SMALL></TD>
|
||||
<TD></TD>
|
||||
</TR>
|
||||
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
|
||||
|
||||
Average time (s)</SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="64" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img168.png"
|
||||
ALT="$6.1 \pm 0.3$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img169.png"
|
||||
ALT="$12.2 \pm 0.6$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img170.png"
|
||||
ALT="$25.4 \pm 1.1$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img171.png"
|
||||
ALT="$51.4 \pm 2.0$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img172.png"
|
||||
ALT="$117.3 \pm 4.4$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img173.png"
|
||||
ALT="$262.2 \pm 8.7$"></SPAN></SMALL></TD>
|
||||
<TD></TD>
|
||||
</TR>
|
||||
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
|
||||
SD (s) </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img174.png"
|
||||
ALT="$2.6$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img175.png"
|
||||
ALT="$5.4$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img176.png"
|
||||
ALT="$9.8$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img177.png"
|
||||
ALT="$17.6$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img178.png"
|
||||
ALT="$37.3$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img179.png"
|
||||
ALT="$76.3$"></SPAN> </SMALL></TD>
|
||||
<TD></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><B>Table 1:</B> Internal memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
Figure 6 presents the runtime for each trial. In addition,
|
||||
the solid line corresponds to a linear regression model
|
||||
obtained from the experimental measurements.
|
||||
As we can see, the runtime for a given <I>n</I> has a considerable
|
||||
fluctuation. However, the fluctuation also grows linearly with <I>n</I>.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/bmz_temporegressao.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 6:</B> Time versus number of keys in <I>S</I> for the internal memory based algorithm. The solid line corresponds to a linear regression model.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
The observed fluctuation in the runtimes is as expected; recall that this
|
||||
runtime has the form <IMG ALIGN="middle" SRC="figs/brz/img159.png" BORDER="0" ALT=""> with <I>Z</I> a geometric random variable with
|
||||
mean <I>1/p=e</I>. Thus, the runtime has mean <IMG ALIGN="middle" SRC="figs/brz/img181.png" BORDER="0" ALT=""> and standard
|
||||
deviation <IMG ALIGN="middle" SRC="figs/brz/img182.png" BORDER="0" ALT="">.
|
||||
Therefore, the standard deviation also grows
|
||||
linearly with <I>n</I>, as experimentally verified
|
||||
in Table 1 and in Figure 6.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H3>Performance of the External Memory Based Algorithm</H3>
|
||||
|
||||
<P>
|
||||
The runtime of the external memory based algorithm is also a random variable,
|
||||
but now it follows a (highly concentrated) normal distribution, as we discuss at the end of this
|
||||
section. Again, we are interested in verifying the linearity claim made in
|
||||
<A HREF="#papers">[2, Section 5.1</A>]. Therefore, we ran the algorithm for
|
||||
several numbers <I>n</I> of keys in <I>S</I>.
|
||||
</P>
|
||||
<P>
|
||||
The values chosen for <I>n</I> were 1, 2, 4, 8, 16, 32, 64, 128, 512 and 1000
|
||||
million.
|
||||
We limited the main memory in 500 megabytes for the experiments.
|
||||
The size <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> of the a priori reserved internal memory area
|
||||
was set to 250 megabytes, the parameter <I>b</I> was set to <I>175</I> and
|
||||
the building block algorithm parameter <I>c</I> was again set to <I>1</I>.
|
||||
We show later on how <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> affects the runtime of the algorithm. The other two parameters
|
||||
have insignificant influence on the runtime.
|
||||
</P>
|
||||
<P>
|
||||
We again use a statistical method for determining a suitable sample size
|
||||
to estimate the number of trials to be run for each value of <I>n</I>. We got that
|
||||
just one trial for each <I>n</I> would be enough with a confidence level of 95 %.
|
||||
However, we made 10 trials. This number of trials seems rather small, but, as
|
||||
shown below, the behavior of our algorithm is very stable and its runtime is
|
||||
almost deterministic (i.e., the standard deviation is very small).
|
||||
</P>
|
||||
<P>
|
||||
Table 2 presents the runtime average for each <I>n</I>,
|
||||
the respective standard deviations, and
|
||||
the respective confidence intervals given by
|
||||
the average time <IMG ALIGN="middle" SRC="figs/brz/img167.png" BORDER="0" ALT=""> the distance from average time
|
||||
considering a confidence level of 95 %.
|
||||
Observing the runtime averages we noticed that
|
||||
the algorithm runs in expected linear time,
|
||||
as shown in <A HREF="#papers">[2, Section 5.1</A>]. Better still,
|
||||
it is only approximately 60 % slower than the BMZ algorithm.
|
||||
To get that value we used the linear regression model obtained for the runtime of
|
||||
the internal memory based algorithm to estimate how much time it would require
|
||||
for constructing a MPHF for a set of 1 billion keys.
|
||||
We got 2.3 hours for the internal memory based algorithm and we measured
|
||||
3.67 hours on average for the external memory based algorithm.
|
||||
Increasing the size of the internal memory area
|
||||
from 250 to 600 megabytes,
|
||||
we have brought the time to 3.09 hours. In this case, the external memory based algorithm is
|
||||
just 34 % slower in this setup.
|
||||
</P>
|
||||
<TABLE CELLPADDING=3 BORDER="1" ALIGN="CENTER">
|
||||
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
||||
<SPAN CLASS="MATH"><IMG
|
||||
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img5.png"
|
||||
ALT="$n$"></SPAN> (millions) </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1 </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 2 </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 4 </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 8 </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 16 </SMALL></TD>
|
||||
</TR>
|
||||
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
||||
Average time (s) </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="64" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img187.png"
|
||||
ALT="$6.9 \pm 0.3$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img188.png"
|
||||
ALT="$13.8 \pm 0.2$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img189.png"
|
||||
ALT="$31.9 \pm 0.7$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img190.png"
|
||||
ALT="$69.9 \pm 1.1$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img191.png"
|
||||
ALT="$140.6 \pm 2.5$"></SPAN> </SMALL></TD>
|
||||
</TR>
|
||||
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
||||
SD </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img192.png"
|
||||
ALT="$0.4$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img193.png"
|
||||
ALT="$0.2$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img194.png"
|
||||
ALT="$0.9$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img195.png"
|
||||
ALT="$1.5$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img196.png"
|
||||
ALT="$3.5$"></SPAN> </SMALL></TD>
|
||||
</TR>
|
||||
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
||||
|
||||
<SPAN CLASS="MATH"><IMG
|
||||
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img5.png"
|
||||
ALT="$n$"></SPAN> (millions) </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 32 </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 64 </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 128 </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 512 </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1000 </SMALL></TD>
|
||||
</TR>
|
||||
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
||||
Average time (s) </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img197.png"
|
||||
ALT="$284.3 \pm 1.1$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img198.png"
|
||||
ALT="$587.9 \pm 3.9$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
|
||||
$1223.6 \pm 4.9$
|
||||
-->
|
||||
<SPAN CLASS="MATH"><IMG
|
||||
WIDTH="88" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img199.png"
|
||||
ALT="$1223.6 \pm 4.9$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
|
||||
$5966.4 \pm 9.5$
|
||||
-->
|
||||
<SPAN CLASS="MATH"><IMG
|
||||
WIDTH="88" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img200.png"
|
||||
ALT="$5966.4 \pm 9.5$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <!-- MATH
|
||||
$13229.5 \pm 12.7$
|
||||
-->
|
||||
<SPAN CLASS="MATH"><IMG
|
||||
WIDTH="104" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img201.png"
|
||||
ALT="$13229.5 \pm 12.7$"></SPAN> </SMALL></TD>
|
||||
</TR>
|
||||
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
||||
SD </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img202.png"
|
||||
ALT="$1.6$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img203.png"
|
||||
ALT="$5.5$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img204.png"
|
||||
ALT="$6.8$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img205.png"
|
||||
ALT="$13.2$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img206.png"
|
||||
ALT="$18.6$"></SPAN> </SMALL></TD>
|
||||
</TR>
|
||||
<TR><TD></TD>
|
||||
<TD></TD>
|
||||
<TD></TD>
|
||||
<TD></TD>
|
||||
<TD></TD>
|
||||
<TD></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><B>Table 2:</B>The external memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
Figure 7 presents the runtime for each trial. In addition,
|
||||
the solid line corresponds to a linear regression model
|
||||
obtained from the experimental measurements.
|
||||
As we were expecting the runtime for a given <I>n</I> has almost no
|
||||
variation.
|
||||
</P>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><IMG ALIGN="middle" SRC="figs/brz/brz_temporegressao.png" BORDER="0" ALT=""></TD>
|
||||
</TR>
|
||||
<TR>
|
||||
<TD><B>Figure 7:</B> Time versus number of keys in <I>S</I> for our algorithm. The solid line corresponds to a linear regression model.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<P>
|
||||
An intriguing observation is that the runtime of the algorithm is almost
|
||||
deterministic, in spite of the fact that it uses as building block an
|
||||
algorithm with a considerable fluctuation in its runtime. A given bucket
|
||||
<I>i</I>, <IMG ALIGN="middle" SRC="figs/brz/img47.png" BORDER="0" ALT="">, is a small set of keys (at most 256 keys) and,
|
||||
as argued in last Section, the runtime of the
|
||||
building block algorithm is a random variable <IMG ALIGN="middle" SRC="figs/brz/img207.png" BORDER="0" ALT=""> with high fluctuation.
|
||||
However, the runtime <I>Y</I> of the searching step of the external memory based algorithm is given
|
||||
by <IMG ALIGN="middle" SRC="figs/brz/img209.png" BORDER="0" ALT="">. Under the hypothesis that
|
||||
the <IMG ALIGN="middle" SRC="figs/brz/img207.png" BORDER="0" ALT=""> are independent and bounded, the {\it law of large numbers} (see,
|
||||
e.g., <A HREF="#papers">[6</A>]) implies that the random variable <IMG ALIGN="middle" SRC="figs/brz/img210.png" BORDER="0" ALT=""> converges
|
||||
to a constant as <IMG ALIGN="middle" SRC="figs/brz/img83.png" BORDER="0" ALT="">. This explains why the runtime of our
|
||||
algorithm is almost deterministic.
|
||||
</P>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<H3>Controlling disk accesses</H3>
|
||||
|
||||
<P>
|
||||
In order to bring down the number of seek operations on disk
|
||||
we benefit from the fact that our algorithm leaves almost all main
|
||||
memory available to be used as disk I/O buffer.
|
||||
In this section we evaluate how much the parameter <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> affects the runtime of our algorithm.
|
||||
For that we fixed <I>n</I> in 1 billion of URLs,
|
||||
set the main memory of the machine used for the experiments
|
||||
to 1 gigabyte and used <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> equal to 100, 200, 300, 400, 500 and 600
|
||||
megabytes.
|
||||
</P>
|
||||
<P>
|
||||
Table 3 presents the number of files <I>N</I>,
|
||||
the buffer size used for all files, the number of seeks in the worst case considering
|
||||
the pessimistic assumption mentioned in <A HREF="#papers">[2, Section 5.1</A>], and
|
||||
the time to generate a MPHF for 1 billion of keys as a function of the amount of internal
|
||||
memory available. Observing Table 3 we noticed that the time spent in the construction
|
||||
decreases as the value of <IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT=""> increases. However, for <IMG ALIGN="middle" SRC="figs/brz/img213.png" BORDER="0" ALT="">, the variation
|
||||
on the time is not as significant as for <IMG ALIGN="middle" SRC="figs/brz/img214.png" BORDER="0" ALT="">.
|
||||
This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux
|
||||
has smart policies for avoiding seeks and diminishing the average seek time
|
||||
(see <A HREF="http://www.linuxjournal.com/article/6931">http://www.linuxjournal.com/article/6931</A>).
|
||||
</P>
|
||||
<TABLE CELLPADDING=3 BORDER="1" ALIGN="center">
|
||||
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
||||
<SPAN CLASS="MATH"><IMG
|
||||
WIDTH="14" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img8.png"
|
||||
ALT="$\mu $"></SPAN> (MB) </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img215.png"
|
||||
ALT="$100$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img216.png"
|
||||
ALT="$200$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img217.png"
|
||||
ALT="$300$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img218.png"
|
||||
ALT="$400$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img219.png"
|
||||
ALT="$500$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img212.png"
|
||||
ALT="$600$"></SPAN> </SMALL></TD>
|
||||
</TR>
|
||||
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
||||
|
||||
<SPAN CLASS="MATH"><IMG
|
||||
WIDTH="19" HEIGHT="14" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img58.png"
|
||||
ALT="$N$"></SPAN> (files) </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img220.png"
|
||||
ALT="$619$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img221.png"
|
||||
ALT="$310$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img222.png"
|
||||
ALT="$207$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img223.png"
|
||||
ALT="$155$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img224.png"
|
||||
ALT="$124$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img225.png"
|
||||
ALT="$104$"></SPAN> </SMALL></TD>
|
||||
</TR>
|
||||
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
||||
(buffer size in KB) </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img226.png"
|
||||
ALT="$165$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="28" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img227.png"
|
||||
ALT="$661$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img228.png"
|
||||
ALT="$1,484$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img229.png"
|
||||
ALT="$2,643$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img230.png"
|
||||
ALT="$4,129$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="43" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img231.png"
|
||||
ALT="$5,908$"></SPAN> </SMALL></TD>
|
||||
</TR>
|
||||
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
||||
<SPAN CLASS="MATH"><IMG
|
||||
WIDTH="14" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img135.png"
|
||||
ALT="$\beta$"></SPAN>/ (# of seeks in the worst case) </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="59" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img232.png"
|
||||
ALT="$384,478$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img233.png"
|
||||
ALT="$95,974$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img234.png"
|
||||
ALT="$42,749$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img235.png"
|
||||
ALT="$24,003$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img236.png"
|
||||
ALT="$15,365$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="51" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||
SRC="figs/brz/img237.png"
|
||||
ALT="$10,738$"></SPAN> </SMALL></TD>
|
||||
</TR>
|
||||
<TR><TD ALIGN="LEFT"><SMALL CLASS="SCRIPTSIZE">
|
||||
Time (hours) </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img238.png"
|
||||
ALT="$4.04$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img239.png"
|
||||
ALT="$3.64$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img240.png"
|
||||
ALT="$3.34$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img241.png"
|
||||
ALT="$3.20$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img242.png"
|
||||
ALT="$3.13$"></SPAN> </SMALL></TD>
|
||||
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||
SRC="figs/brz/img243.png"
|
||||
ALT="$3.09$"></SPAN> </SMALL></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><B>Table 3:</B>Influence of the internal memory area size (<IMG ALIGN="middle" SRC="figs/brz/img8.png" BORDER="0" ALT="">) in the external memory based algorithm runtime.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<A NAME="papers"></A>
|
||||
<H2>Papers</H2>
|
||||
|
||||
<OL>
|
||||
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, D. Menoti, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/bmz_tr004_04.ps">A New algorithm for constructing minimal perfect hash functions</A>, Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
|
||||
<P></P>
|
||||
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, Y. Kohayakawa, <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/tr06.pdf">An Approach for Minimal Perfect Hash Functions for Very Large Databases</A>, Technical Report TR003/06, Department of Computer Science, Federal University of Minas Gerais, 2004.
|
||||
<P></P>
|
||||
<LI><A HREF="http://www.dcc.ufmg.br/~fbotelho">F. C. Botelho</A>, Y. Kohayakawa, and <A HREF="http://www.dcc.ufmg.br/~nivio">N. Ziviani</A>. <A HREF="papers/wea05.pdf">A Practical Minimal Perfect Hashing Method</A>. <I>4th International Workshop on efficient and Experimental Algorithms (WEA05),</I> Springer-Verlag Lecture Notes in Computer Science, vol. 3505, Santorini Island, Greece, May 2005, 488-500.
|
||||
<P></P>
|
||||
<LI><A HREF="http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=299">M. Seltzer. Beyond relational databases. ACM Queue, 3(3), April 2005.</A>
|
||||
<P></P>
|
||||
<LI><A HREF="http://burtleburtle.net/bob/hash/doobs.html">Bob Jenkins. Algorithm alley: Hash functions. Dr. Dobb's Journal of Software Tools, 22(9), september 1997.</A>
|
||||
<P></P>
|
||||
<LI>R. Jain. The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. John Wiley, first edition, 1991.
|
||||
</OL>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<TABLE ALIGN="center" CELLPADDING="4">
|
||||
<TR>
|
||||
<TD><A HREF="index.html">Home</A></TD>
|
||||
<TD><A HREF="chd.html">CHD</A></TD>
|
||||
<TD><A HREF="bdz.html">BDZ</A></TD>
|
||||
<TD><A HREF="bmz.html">BMZ</A></TD>
|
||||
<TD><A HREF="chm.html">CHM</A></TD>
|
||||
<TD><A HREF="brz.html">BRZ</A></TD>
|
||||
<TD><A HREF="fch.html">FCH</A></TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
|
||||
<HR NOSHADE SIZE=1>
|
||||
|
||||
<P>
|
||||
Enjoy!
|
||||
</P>
|
||||
<P>
|
||||
<A HREF="mailto:davi@users.sourceforge.net">Davi de Castro Reis</A>
|
||||
</P>
|
||||
<P>
|
||||
<A HREF="mailto:db8192@users.sourceforge.net">Djamel Belazzougui</A>
|
||||
</P>
|
||||
<P>
|
||||
<A HREF="mailto:fc_botelho@users.sourceforge.net">Fabiano Cupertino Botelho</A>
|
||||
</P>
|
||||
<P>
|
||||
<A HREF="mailto:nivio@dcc.ufmg.br">Nivio Ziviani</A>
|
||||
</P>
|
||||
<script type="text/javascript">
|
||||
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
|
||||
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
|
||||
</script>
|
||||
<script type="text/javascript">
|
||||
try {
|
||||
var pageTracker = _gat._getTracker("UA-7698683-2");
|
||||
pageTracker._trackPageview();
|
||||
} catch(err) {}</script>
|
||||
|
||||
<!-- html code generated by txt2tags 2.6 (http://txt2tags.org) -->
|
||||
<!-- cmdline: txt2tags -t html -i BRZ.t2t -o docs/brz.html -->
|
||||
</BODY></HTML>
|
||||
Reference in New Issue
Block a user