paper for vldb07 added

This commit is contained in:
fc_botelho 2006-08-11 17:32:31 +00:00
parent 00c049787a
commit 80549b6ca6
28 changed files with 4517 additions and 0 deletions

7
vldb07/acknowledgments.tex Executable file
View File

@ -0,0 +1,7 @@
\section{Acknowledgments}
This section is optional; it is a location for you
to acknowledge grants, funding, editing assistance and
what have you. In the present case, for example, the
authors would like to thank Gerald Murray of ACM for
his help in codifying this \textit{Author's Guide}
and the \textbf{.cls} and \textbf{.tex} files that it describes.

174
vldb07/analyticalresults.tex Executable file
View File

@ -0,0 +1,174 @@
%% Nivio: 23/jan/06 29/jan/06
% Time-stamp: <Monday 30 Jan 2006 03:56:47am EDT yoshi@ime.usp.br>
\enlargethispage{2\baselineskip}
\section{Analytical results}
\label{sec:analytcal-results}
\vspace{-1mm}
The purpose of this section is fourfold.
First, we show that our algorithm runs in expected time $O(n)$.
Second, we present the main memory requirements for constructing the MPHF.
Third, we discuss the cost of evaluating the resulting MPHF.
Fourth, we present the space required to store the resulting MPHF.
\vspace{-2mm}
\subsection{The linear time complexity}
\label{sec:linearcomplexity}
First, we show that the partitioning step presented in
Figure~\ref{fig:partitioningstep} runs in $O(n)$ time.
Each iteration of the {\bf for} loop in statement~1
runs in $O(|B_j|)$ time, $1 \leq j \leq N$, where $|B_j|$ is the
number of keys
that fit in block $B_j$ of size $\mu$. This is because statement 1.1 just reads
$|B_j|$ keys from disk, statement 1.2 runs a bucket sort like algorithm
that is well known to be linear in the number of keys it sorts (i.e., $|B_j|$ keys),
and statement 1.3 just dumps $|B_j|$ keys to the disk into File $j$.
Thus, the {\bf for} loop runs in $O(\sum_{j=1}^{N}|B_j|)$ time.
As $\sum_{j=1}^{N}|B_j|=n$, then the partitioning step runs in $O(n)$ time.
Second, we show that the searching step presented in
Figure~\ref{fig:searchingstep} also runs in $O(n)$ time.
The heap construction in statement 1 runs in $O(N)$ time, for $N \ll n$.
We have assumed that insertions and deletions in the heap cost $O(1)$ because
$N$ is typically much smaller than $n$ (see \cite[Section 6.4]{bkz06t} for details).
Statement 2 runs in $O(\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i])$ time
(remember that $\mathit{size}[i]$ stores the number of keys in bucket $i$).
As $\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i] = n$, if
statements 2.1, 2.2 and 2.3 run in $O(\mathit{size}[i])$ time, then statement 2
runs in $O(n)$ time.
%Statement 2.1 runs the algorithm to read a bucket from disk. That algorithm reads $\mathit{size}[i]$
%keys of bucket $i$ that might be spread into many files or, in the worst case,
%into $|BS_{max}|$ files, where $|BS_{max}|$ is the number of keys in the bucket of maximum size.
%It uses the heap $H$ to drive a multiway merge of the sprayed bucket $i$.
%As we are considering that each read/write on disk costs $O(1)$ and
%each heap operation also costs $O(1)$ (recall $N \ll n$), then statement 2.1
%costs $O(\mathit{size}[i])$ time.
%We need to take into account that this step could generate a lot of seeks on disk.
%However, the number of seeks can be amortized (see Section~\ref{sec:contr-disk-access})
%and that is why we have been able of getting a MPHF for a set of 1 billion keys in less
%than 4 hours using a machine with just 500 MB of main memory
%(see Section~\ref{sec:performance}).
Statement 2.1 reads $O(\mathit{size}[i])$ keys of bucket $i$
and is detailed in Figure~\ref{fig:readingbucket}.
As we are assuming that each read or write on disk costs $O(1)$ and
each heap operation also costs $O(1)$, statement~2.1
takes $O(\mathit{size}[i])$ time.
However, the keys of bucket $i$ are distributed in at most~$BS_{max}$ files on disk
in the worst case
(recall that $BS_{max}$ is the maximum number of keys found in any bucket).
Therefore, we need to take into account that
the critical step in reading a bucket is in statement~1.3 of Figure~\ref{fig:readingbucket},
where a seek operation in File $j$
may be performed by the first read operation.
In order to amortize the number of seeks performed we use a buffering technique~\cite{k73}.
We create a buffer $j$ of size \textbaht$\: = \mu/N$ for each file $j$,
where $1\leq j \leq N$
(recall that $\mu$ is the size in bytes of an a priori reserved internal memory area).
Every time a read operation is requested to file $j$ and the data is not found
in the $j$th~buffer, \textbaht~bytes are read from file $j$ to buffer $j$.
Hence, the number of seeks performed in the worst case is given by
$\beta/$\textbaht~(remember that $\beta$ is the size in bytes of $S$).
For that we have made the pessimistic assumption that one seek happens every time
buffer $j$ is filled in.
Thus, the number of seeks performed in the worst case is $64n/$\textbaht, since
each URL is 64 bytes long on average. Therefore, the number of seeks is linear on
$n$ and amortized by \textbaht.
It is important to emphasize two things.
First, the operating system uses techniques
to diminish the number of seeks and the average seek time.
This makes the amortization factor to be greater than \textbaht~in practice.
Second, almost all main memory is available to be used as
file buffers because just a small vector
of $\lceil n/b\rceil$ one-byte entries must be maintained in main memory,
as we show in Section~\ref{sec:memconstruction}.
Statement 2.2 runs our internal memory based algorithm in order to generate a MPHF for each bucket.
That algorithm is linear, as we showed in~\cite{bkz05}. As it is applied to buckets with {\it size}$[i]$ keys, statement~2.2 takes $O(\mathit{size}[i])$ time.
Statement 2.3 has time complexity $O(\mathit{size}[i])$ because it writes to disk
the description of each generated MPHF and each description is stored in
$c \times \mathit{size}[i] + O(1)$ bytes, where $c\in[0.93,1.15]$.
In conclusion, our algorithm takes $O(n)$ time because both the partitioning and
the searching steps run in $O(n)$ time.
An experimental validation of the above proof and a performance comparison with
our internal memory based algorithm~\cite{bkz05} were not included here due to
space restrictions but can be found in~\cite{bkz06t} and also in the appendix.
\vspace{-1mm}
\enlargethispage{2\baselineskip}
\subsection{Space used for constructing a MPHF}
\label{sec:memconstruction}
The vector {\it size} is kept in main memory
all the time.
The vector {\it size} has $\lceil n/b \rceil$ one-byte entries.
It stores the number of keys in each bucket and
those values are less than or equal to 256.
For example, for a set of 1 billion keys and $b=175$ the vector size needs
$5.45$ megabytes of main memory.
We need an internal memory area of size $\mu$ bytes to be used in
the partitioning step and in the searching step.
The size $\mu$ is fixed a priori and depends only on the amount
of internal memory available to run the algorithm
(i.e., it does not depend on the size $n$ of the problem).
% One could argue about the a priori reserved internal memory area and the main memory
% required to run the indirect bucket sort algorithm.
% Those internal memory requirements do not depend on the size of the problem
% (i.e., the number of keys being hashed) and can be fixed a priori.
The additional space required in the searching step
is constant, once the problem was broken down
into several small problems (at most 256 keys) and
the heap size is supposed to be much smaller than $n$ ($N \ll n$).
For example, for a set of 1 billion keys and an internal area of~$\mu = 250$ megabytes,
the number of files is $N = 248$.
\vspace{-1mm}
\subsection{Evaluation cost of the MPHF}
Now we consider the amount of CPU time
required by the resulting MPHF at retrieval time.
The MPHF requires for each key the computation of three
universal hash functions and three memory accesses
(see Eqs.~(\ref{eq:mphf}), (\ref{eq:bucketindex}) and (\ref{eq:mphfi})).
This is not optimal. Pagh~\cite{p99} showed that any MPHF requires
at least the computation of two universal hash functions and one memory
access.
\subsection{Description size of the MPHF}
The number of bits required to store the MPHF generated by the algorithm
is computed by Eq.~(\ref{eq:newmphfbits}).
We need to store each $g_i$ vector presented in Eq.~(\ref{eq:mphfi}), where
$0\leq i < \lceil n/b \rceil$. As each bucket has at most 256 keys, each
entry in a $g_i$ vector has 8~bits. In each $g_i$ vector there are
$c \times \mathit{size}[i]$ entries (recall $c\in[0.93, 1.15]$).
When we sum up the number of entries of $\lceil n/b \rceil$ $g_i$ vectors we have
$c\sum_{i=0}^{\lceil n/b \rceil -1} \mathit{size}[i]=cn$ entries. We also need to
store $3 \lceil n/b \rceil$ integer numbers of
$\log_2n$ bits referring respectively to the {\it offset} vector and the two random seeds of
$h_{1i}$ and $h_{2i}$. In addition, we need to store $\lceil n/b \rceil$ 8-bit entries of
the vector {\it size}. Therefore,
\begin{eqnarray}\label{eq:newmphfbits}
\mathrm{Required\: Space} = 8cn + \frac{n}{b}\left( 3\log_2n +8\right) \:
\mathrm{bits}.
\end{eqnarray}
Considering $c=0.93$ and $b=175$, the number of bits per key to store
the description of the resulting MPHF for a set of 1~billion keys is $8.1$.
If we set $b=128$, then the bits per key ratio increases to $8.3$.
Theoretically, the number of bits required to store the MPHF in
Eq.~(\ref{eq:newmphfbits})
is $O(n\log n)$ as~$n\to\infty$. However, for sets of size up to $2^{b/3}$ keys
the number of bits per key is lower than 9~bits (note that
$2^{b/3}>2^{58}>10^{17}$ for $b=175$).
%For $b=175$, the number of bits per key will be close to 9 for a set of $2^{58}$ keys.
Thus, in practice the resulting function is stored in $O(n)$ bits.

6
vldb07/appendix.tex Normal file
View File

@ -0,0 +1,6 @@
\appendix
\input{experimentalresults}
\input{thedataandsetup}
\input{costhashingbuckets}
\input{performancenewalgorithm}
\input{diskaccess}

42
vldb07/conclusions.tex Executable file
View File

@ -0,0 +1,42 @@
% Time-stamp: <Monday 30 Jan 2006 12:38:06am EST yoshi@flare>
\enlargethispage{2\baselineskip}
\section{Concluding remarks}
\label{sec:concuding-remarks}
This paper has presented a novel external memory based algorithm for
constructing MPHFs that works for sets in the order of billions of keys. The
algorithm outputs the resulting function in~$O(n)$ time and, furthermore, it
can be tuned to run only $34\%$ slower (see \cite{bkz06t} for details) than the fastest
algorithm available in the literature for constructing MPHFs~\cite{bkz05}.
In addition, the space
requirement of the resulting MPHF is of up to 9 bits per key for datasets of
up to $2^{58}\simeq10^{17.4}$ keys.
The algorithm is simple and needs just a
small vector of size approximately 5.45 megabytes in main memory to construct
a MPHF for a collection of 1 billion URLs, each one 64 bytes long on average.
Therefore, almost all main memory is available to be used as disk I/O buffer.
Making use of such a buffering scheme considering an internal memory area of size
$\mu=200$ megabytes, our algorithm can produce a MPHF for a
set of 1 billion URLs in approximately 3.6 hours using a commodity PC of 2.4 gigahertz and
500 megabytes of main memory.
If we increase both the main memory
available to 1 gigabyte and the internal memory area to $\mu=500$ megabytes,
a MPHF for the set of 1 billion URLs is produced in approximately 3 hours. For any
key, the evaluation of the resulting MPHF takes three memory accesses and the
computation of three universal hash functions.
In order to allow the reproduction of our results and the utilization of both the internal memory
based algorithm and the external memory based algorithm,
the algorithms are available at \texttt{http://cmph.sf.net}
under the GNU Lesser General Public License (LGPL).
They were implemented in the C language.
In future work, we will exploit the fact that the searching step intrinsically
presents a high degree of parallelism and requires $73\%$ of the
construction time. Therefore, a parallel implementation of our algorithm will
allow the construction and the evaluation of the resulting function in parallel.
Therefore, the description of the resulting MPHFs will be distributed in the paralell
computer allowing the scalability to sets of hundreds of billions of keys.
This is an important contribution, mainly for applications related to the Web, as
mentioned in Section~\ref{sec:intro}.

177
vldb07/costhashingbuckets.tex Executable file
View File

@ -0,0 +1,177 @@
% Nivio: 29/jan/06
% Time-stamp: <Monday 30 Jan 2006 12:37:22am EST yoshi@flare>
\vspace{-2mm}
\subsection{Performance of the internal memory based algorithm}
\label{sec:intern-memory-algor}
%\begin{table*}[htb]
%\begin{center}
%{\scriptsize
%\begin{tabular}{|c|c|c|c|c|c|c|c|}
%\hline
%$n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\
%\hline
%Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$ & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\
%SD (s) & $2.6$ & $5.4$ & $9.8$ & $17.6$ & $37.3$ & $76.3$ \\
%\hline
%\end{tabular}
%\vspace{-3mm}
%}
%\end{center}
%\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF,
%the standard deviation (SD), and the confidence intervals considering
%a confidence level of $95\%$.}
%\label{tab:medias}
%\end{table*}
Our three-step internal memory based algorithm presented in~\cite{bkz05}
is used for constructing a MPHF for each bucket.
It is a randomized algorithm because it needs to generate a simple random graph
in its first step.
Once the graph is obtained the other two steps are deterministic.
Thus, we can consider the runtime of the algorithm to have the form~$\alpha
nZ$ for an input of~$n$ keys, where~$\alpha$ is some machine dependent
constant that further depends on the length of the keys and~$Z$ is a random
variable with geometric distribution with mean~$1/p=e^{1/c^2}$ (see
Section~\ref{sec:mphfbucket}). All results in our experiments were obtained
taking $c=1$; the value of~$c$, with~$c\in[0.93,1.15]$, in fact has little
influence in the runtime, as shown in~\cite{bkz05}.
The values chosen for $n$ were $1, 2, 4, 8, 16$ and $32$ million.
Although we have a dataset with 1~billion URLs, on a PC with
1~gigabyte of main memory, the algorithm is able
to handle an input with at most 32 million keys.
This is mainly because of the graph we need to keep in main memory.
The algorithm requires $25n + O(1)$ bytes for constructing
a MPHF (details about the data structures used by the algorithm can
be found in~\texttt{http://cmph.sf.net}.
% for the details about the data structures
%used by the algorithm).
In order to estimate the number of trials for each value of $n$ we use
a statistical method for determining a suitable sample size (see, e.g.,
\cite[Chapter 13]{j91}).
As we obtained different values for each $n$,
we used the maximal value obtained, namely, 300~trials in order to have
a confidence level of $95\%$.
% \begin{figure*}[ht]
% \noindent
% \begin{minipage}[b]{0.5\linewidth}
% \centering
% \subfigure[The previous algorithm]
% {\scalebox{0.5}{\includegraphics{figs/bmz_temporegressao.eps}}}
% \end{minipage}
% \hfill
% \begin{minipage}[b]{0.5\linewidth}
% \centering
% \subfigure[The new algorithm]
% {\scalebox{0.5}{\includegraphics{figs/brz_temporegressao.eps}}}
% \end{minipage}
% \caption{Time versus number of keys in $S$. The solid line corresponds to
% a linear regression model.}
% %obtained from the experimental measurements.}
% \label{fig:temporegressao}
% \end{figure*}
Table~\ref{tab:medias} presents the runtime average for each $n$,
the respective standard deviations, and
the respective confidence intervals given by
the average time $\pm$ the distance from average time
considering a confidence level of $95\%$.
Observing the runtime averages one sees that
the algorithm runs in expected linear time,
as shown in~\cite{bkz05}.
\vspace{-2mm}
\begin{table*}[htb]
\begin{center}
{\scriptsize
\begin{tabular}{|c|c|c|c|c|c|c|c|}
\hline
$n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\
\hline
Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$ & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\
SD (s) & $2.6$ & $5.4$ & $9.8$ & $17.6$ & $37.3$ & $76.3$ \\
\hline
\end{tabular}
\vspace{-1mm}
}
\end{center}
\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF,
the standard deviation (SD), and the confidence intervals considering
a confidence level of $95\%$.}
\label{tab:medias}
\vspace{-4mm}
\end{table*}
% \enlargethispage{\baselineskip}
% \begin{table*}[htb]
% \begin{center}
% {\scriptsize
% (a)
% \begin{tabular}{|c|c|c|c|c|c|c|c|}
% \hline
% $n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\
% \hline
% Average time (s)& $6.119 \pm 0.300$ & $12.190 \pm 0.615$ & $25.359 \pm 1.109$ & $51.408 \pm 2.003$ & $117.343 \pm 4.364$ & $262.215 \pm 8.724$\\
% SD (s) & $2.644$ & $5.414$ & $9.757$ & $17.627$ & $37.333$ & $76.271$ \\
% \hline
% \end{tabular}
% \\[5mm] (b)
% \begin{tabular}{|l|c|c|c|c|c|}
% \hline
% $n$ (millions) & 1 & 2 & 4 & 8 & 16 \\
% \hline % Part. 16 \% 16 \% 16 \% 18 \% 20\%
% Average time (s) & $6.927 \pm 0.309$ & $13.828 \pm 0.175$ & $31.936 \pm 0.663$ & $69.902 \pm 1.084$ & $140.617 \pm 2.502$ \\
% SD & $0.431$ & $0.245$ & $0.926$ & $1.515$ & $3.498$ \\
% \hline
% \hline
% $n$ (millions) & 32 & 64 & 128 & 512 & 1000 \\
% \hline % Part. 20 \% 20\% 20\% 18\% 18\%
% Average time (s) & $284.330 \pm 1.135$ & $587.880 \pm 3.945$ & $1223.581 \pm 4.864$ & $5966.402 \pm 9.465$ & $13229.540 \pm 12.670$ \\
% SD & $1.587$ & $5.514$ & $6.800$ & $13.232$ & $18.577$ \\
% \hline
% \end{tabular}
% }
% \end{center}
% \caption{The runtime averages in seconds,
% the standard deviation (SD), and
% the confidence intervals given by the average time $\pm$
% the distance from average time considering
% a confidence level of $95\%$.}
% \label{tab:medias}
% \end{table*}
\enlargethispage{2\baselineskip}
Figure~\ref{fig:bmz_temporegressao}
presents the runtime for each trial. In addition,
the solid line corresponds to a linear regression model
obtained from the experimental measurements.
As we can see, the runtime for a given $n$ has a considerable
fluctuation. However, the fluctuation also grows linearly with $n$.
\begin{figure}[htb]
\vspace{-2mm}
\begin{center}
\scalebox{0.4}{\includegraphics{figs/bmz_temporegressao.eps}}
\caption{Time versus number of keys in $S$ for the internal memory based algorithm.
The solid line corresponds to a linear regression model.}
\label{fig:bmz_temporegressao}
\end{center}
\vspace{-6mm}
\end{figure}
The observed fluctuation in the runtimes is as expected; recall that this
runtime has the form~$\alpha nZ$ with~$Z$ a geometric random variable with
mean~$1/p=e$. Thus, the runtime has mean~$\alpha n/p=\alpha en$ and standard
deviation~$\alpha n\sqrt{(1-p)/p^2}=\alpha n\sqrt{e(e-1)}$.
Therefore, the standard deviation also grows
linearly with $n$, as experimentally verified
in Table~\ref{tab:medias} and in Figure~\ref{fig:bmz_temporegressao}.
%\noindent-------------------------------------------------------------------------\\
%Comentario para Yoshi: Nao consegui reproduzir bem o que discutimos
%no paragrafo acima, acho que vc conseguira justificar melhor :-). \\
%-------------------------------------------------------------------------\\

146
vldb07/determiningb.tex Executable file
View File

@ -0,0 +1,146 @@
% Nivio: 29/jan/06
% Time-stamp: <Monday 30 Jan 2006 04:01:40am EDT yoshi@ime.usp.br>
\enlargethispage{2\baselineskip}
\subsection{Determining~$b$}
\label{sec:determining-b}
\begin{table*}[t]
\begin{center}
{\small %\scriptsize
\begin{tabular}{|c|ccc|ccc|}
\hline
\raisebox{-0.7em}{$n$} & \multicolumn{3}{c|}{\raisebox{-1mm}{b=128}} &
\multicolumn{3}{c|}{\raisebox{-1mm}{b=175}}\\
\cline{2-4} \cline{5-7}
& \raisebox{-0.5mm}{Worst Case} & \raisebox{-0.5mm}{Average} &\raisebox{-0.5mm}{Eq.~(\ref{eq:maxbs})}
& \raisebox{-0.5mm}{Worst Case} & \raisebox{-0.5mm}{Average} &\raisebox{-0.5mm}{Eq.~(\ref{eq:maxbs})} \\
\hline
$1.0 \times 10^6$ & 177 & 172.0 & 176 & 232 & 226.6 & 230 \\
%$2.0 \times 10^6$ & 179 & 174.0 & 178 & 236 & 228.5 & 232 \\
$4.0 \times 10^6$ & 182 & 177.5 & 179 & 241 & 231.8 & 234 \\
%$8.0 \times 10^6$ & 186 & 181.6 & 181 & 238 & 234.2 & 236 \\
$1.6 \times 10^7$ & 184 & 181.6 & 183 & 241 & 236.1 & 238 \\
%$3.2 \times 10^7$ & 191 & 183.9 & 184 & 240 & 236.6 & 240 \\
$6.4 \times 10^7$ & 195 & 185.2 & 186 & 244 & 239.0 & 242 \\
%$1.28 \times 10^8$ & 193 & 187.7 & 187 & 244 & 239.7 & 244 \\
$5.12 \times 10^8$ & 196 & 191.7 & 190 & 251 & 246.3 & 247 \\
$1.0 \times 10^9$ & 197 & 191.6 & 192 & 253 & 248.9 & 249 \\
\hline
\end{tabular}
\vspace{-1mm}
}
\end{center}
\caption{Values for $\mathit{BS}_{\mathit{max}}$: worst case and average case obtained in the experiments and using Eq.~(\ref{eq:maxbs}),
considering $b=128$ and $b=175$ for different number $n$ of keys in $S$.}
\label{tab:comparison}
\vspace{-6mm}
\end{table*}
The partitioning step can be viewed as the well known ``balls into bins''
problem~\cite{ra98,dfm02} where~$n$ keys (the balls) are placed independently and
uniformly into $\lceil n/b\rceil$ buckets (the bins). The main question related to that problem we are interested
in is: what is the maximum number of keys in any bucket?
In fact, we want to get the maximum value for $b$ that makes the maximum number of keys in any bucket
no greater than 256 with high probability.
This is important, as we wish to use 8 bits per entry in the vector $g_i$ of
each $\mathrm{MPHF}_i$,
where $0 \leq i < \lceil n/b\rceil$.
Let $\mathit{BS}_{\mathit{max}}$ be the maximum number of keys in any bucket.
Clearly, $\BSmax$ is the maximum
of~$\lceil n/b\rceil$ random variables~$Z_i$, each with binomial
distribution~$\Bi(n,p)$ with parameters~$n$ and~$p=1/\lceil n/b\rceil$.
However, the~$Z_i$ are not independent. Note that~$\Bi(n,p)$ has mean and
variance~$\simeq b$. To give an upper estimate for the probability of the
event~$\BSmax\geq \gamma$, we can estimate the probability that we have~$Z_i\geq \gamma$
for a fixed~$i$, and then sum these estimates over all~$i$.
Let~$\gamma=b+\sigma\sqrt{b\ln(n/b)}$, where~$\sigma=\sqrt2$.
Approximating~$\Bi(n,p)$ by the normal distribution with mean and
variance~$b$, we obtain the
estimate~$(\sigma\sqrt{2\pi\ln(n/b)})^{-1}\times\exp(-(1/2)\sigma^2\ln(n/b))$ for
the probability that~$Z_i\geq \gamma$ occurs, which, summed over all~$i$, gives
that the probability that~$\BSmax\geq \gamma$ occurs is at
most~$1/(\sigma\sqrt{2\pi\ln(n/b)})$, which tends to~$0$ as~$n\to\infty$.
Thus, we have shown that, with high probability,
\begin{equation}
\label{eq:maxbs}
\BSmax\leq b+\sqrt{2b\ln{n\over b}}.
\end{equation}
% The traditional approach used to estimate $\mathit{BS}_{\mathit{max}}$ with high probability is
% to consider $\mathit{BS}_{\mathit{max}}$ as a random variable that follows a binomial distribution
% that can be approximated by a poisson distribution. This yields a good approximation
% when the number of balls is lower than or equal to the number of bins~\cite{g81}. In our case,
% the number of balls is greater than the number of buckets.
% % and that is why we have used more recent works to estimate $\mathit{BS}_{\mathit{max}}$.
% As $b > \ln (n/b)$, we can use the result by Raab and Steger~\cite{ra98} to estimate
% $\mathit{BS}_{\mathit{max}}$ with high probability.
% The following equation gives the estimation, where $\sigma=\sqrt{2}$:
% \begin{eqnarray} \label{eq:maxbs}
% \mathit{BS}_{\mathit{max}} = b + O \left( \sqrt{b\ln\frac{n}{b}} \right) = b + \sigma \times \left(\sqrt{b\ln\frac{n}{b}} \right)
% \end{eqnarray}
% In order to estimate the suitable constant $\sigma$ we did a linear
% regression suppressing the constant term.
% We use the equation $BS_{max} - b = \sigma \times \sqrt{b\ln (n/b)}$
% in the linear regression considering $y=BS_{max} - b$ and $x=\sqrt{b\ln (n/b)}$.
% In order to obtain data to be used in the linear regression we set
% b=128 and ran the new algorithm ten times
% for n equal to 1, 2, 4, 8, 16, 32, 64, 128, 512, 1000 million keys.
% Taking a confidence level equal to 95\% we got
% $\sigma = 2.11 \pm 0.03$.
% The coefficient of determination was $99.6\%$, which means that the linear
% regression explains $99.6\%$ of the data variation and only $0.4\%$
% is due to experimental errors.
% Therefore, Eq.~(\ref{eq:maxbs}) with $\sigma = 2.11 \pm 0.03$ and $b=128$
% makes a very good estimation of the maximum number of keys in any bucket.
% Repeating the same experiments for $b$ equals to $175$ and
% a confidence level of $95\%$ we got $\sigma = 2.07 \pm 0.03$.
% Again we verified that Eq.~(\ref{eq:maxbs}) with $\sigma = 2.07 \pm 0.03$ and $b=175$ is
% a very good estimation of the maximum number of keys in any bucket once the
% coefficient of determination obtained was $99.7 \%$ and $\sigma$ is in a very narrow range.
In our algorithm the maximum number of keys in any bucket must be at most 256.
Table~\ref{tab:comparison} presents the values for $\mathit{BS}_{\mathit{max}}$
obtained experimentally and using Eq.~(\ref{eq:maxbs}).
The table presents the worst case and the average case,
considering $b=128$, $b=175$ and Eq.~(\ref{eq:maxbs}),
for several numbers~$n$ of keys in $S$.
The estimation given by Eq.~(\ref{eq:maxbs}) is very close to the experimental
results.
Now we estimate the biggest problem our algorithm is able to solve for
a given $b$.
Using Eq.~(\ref{eq:maxbs}) considering $b=128$, $b=175$ and imposing
that~$\mathit{BS}_{\mathit{max}}\leq256$,
the sizes of the biggest key set our algorithm
can deal with are $10^{30}$ keys and $10^{10}$ keys, respectively.
%It is also important to have $b$ as big as possible, once its value is
%related to the space required to store the resultant MPHF, as shown later on.
%Table~\ref{tab:bp} shows the biggest problem the algorithm can solve.
% The values were obtained from Eq.~(\ref{eq:maxbs}),
% considering $b=128$ and~$b=175$ and imposing
% that~$\mathit{BS}_{\mathit{max}}\leq256$.
% We set $\sigma=2.14$ because it was the greatest value obtained for $\sigma$
% in the two linear regression we did.
% \vspace{-3mm}
% \begin{table}[htb]
% \begin{center}
% {\small %\scriptsize
% \begin{tabular}{|c|c|}
% \hline
% b & Problem size ($n$) \\
% \hline
% 128 & $10^{30}$ keys \\
% 175 & $10^{10}$ keys \\
% \hline
% \end{tabular}
% \vspace{-1mm}
% }
% \end{center}
% \caption{Using Eq.~(\ref{eq:maxbs}) to estimate the biggest problem our algorithm can solve.}
% %considering $\sigma=\sqrt{2}$.}
% \label{tab:bp}
% \vspace{-14mm}
% \end{table}

113
vldb07/diskaccess.tex Executable file
View File

@ -0,0 +1,113 @@
% Nivio: 29/jan/06
% Time-stamp: <Sunday 29 Jan 2006 11:58:28pm EST yoshi@flare>
\vspace{-2mm}
\subsection{Controlling disk accesses}
\label{sec:contr-disk-access}
In order to bring down the number of seek operations on disk
we benefit from the fact that our algorithm leaves almost all main
memory available to be used as disk I/O buffer.
In this section we evaluate how much the parameter $\mu$
affects the runtime of our algorithm.
For that we fixed $n$ in 1 billion of URLs,
set the main memory of the machine used for the experiments
to 1 gigabyte and used $\mu$ equal to $100, 200, 300, 400, 500$ and $600$
megabytes.
\enlargethispage{2\baselineskip}
Table~\ref{tab:diskaccess} presents the number of files $N$,
the buffer size used for all files, the number of seeks in the worst case considering
the pessimistic assumption mentioned in Section~\ref{sec:linearcomplexity}, and
the time to generate a MPHF for 1 billion of keys as a function of the amount of internal
memory available. Observing Table~\ref{tab:diskaccess} we noticed that the time spent in the construction
decreases as the value of $\mu$ increases. However, for $\mu > 400$, the variation
on the time is not as significant as for $\mu \leq 400$.
This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux
has smart policies
for avoiding seeks and diminishing the average seek time
(see \texttt{http://www.linuxjournal.com/article/6931}).
\begin{table*}[ht]
\vspace{-2mm}
\begin{center}
{\scriptsize
\begin{tabular}{|l|c|c|c|c|c|c|}
\hline
$\mu$ (MB) & $100$ & $200$ & $300$ & $400$ & $500$ & $600$ \\
\hline
$N$ (files) & $619$ & $310$ & $207$ & $155$ & $124$ & $104$ \\
%\hline
\textbaht~(buffer size in KB) & $165$ & $661$ & $1,484$ & $2,643$ & $4,129$ & $5,908$ \\
%\hline
$\beta$/\textbaht~(\# of seeks in the worst case) & $384,478$ & $95,974$ & $42,749$ & $24,003$ & $15,365$ & $10,738$ \\
% \hline
% \raisebox{-0.2em}{\# of seeks performed in} & \raisebox{-0.7em}{$383,056$} & \raisebox{-0.7em}{$95,919$} & \raisebox{-0.7em}{$42,700$} & \raisebox{-0.7em}{$23,980$} & \raisebox{-0.7em}{$15,347$} & \raisebox{-0.7em}{$xx,xxx$} \\
% \raisebox{0.2em}{statement 1.3 of Figure~\ref{fig:readingbucket}} & & & & & & \\
% \hline
Time (hours) & $4.04$ & $3.64$ & $3.34$ & $3.20$ & $3.13$ & $3.09$ \\
\hline
\end{tabular}
\vspace{-1mm}
}
\end{center}
\caption{Influence of the internal memory area size ($\mu$) in our algorithm runtime.}
\label{tab:diskaccess}
\vspace{-14mm}
\end{table*}
% \begin{table*}[ht]
% \begin{center}
% {\scriptsize
% \begin{tabular}{|l|c|c|c|c|c|c|c|c|c|c|c|}
% \hline
% $\mu$ (MB) & $100$ & $150$ & $200$ & $250$ & $300$ & $350$ & $400$ & $450$ & $500$ & $550$ & $600$ \\
% \hline
% $N$ (files) & $619$ & $413$ & $310$ & $248$ & $207$ & $177$ & $155$ & $138$ & $124$ & $113$ & $103$ \\
% \hline
% \textbaht~(buffer size in KB) & $165$ & $372$ & $661$ & $1,033$ & $1,484$ & $2,025$ & $2,643$ & $3,339$ & & & \\
% \hline
% \# of seeks (Worst case) & $384,478$ & $170,535$ & $95,974$ & $61,413$ & $42,749$ & $31,328$ & $24,003$ & $19,000$ & & & \\
% \hline
% \raisebox{-0.2em}{\# of seeks performed in} & \raisebox{-0.7em}{$383,056$} & \raisebox{-0.7em}{$170,385$} & \raisebox{-0.7em}{$95,919$} & \raisebox{-0.7em}{$61,388$} & \raisebox{-0.7em}{$42,700$} & \raisebox{-0.7em}{$31,296$} & \raisebox{-0.7em}{$23,980$} & \raisebox{-0.7em}{$18,978$} & \raisebox{-0.7em}{$xx,xxx$} & \raisebox{-0.7em}{$xx,xxx$} & \raisebox{-0.7em}{$xx,xxx$} \\
% \raisebox{0.2em}{statement 1.3 of Figure~\ref{fig:readingbucket}} & & & & & & & & & & & \\
% \hline
% Time (horas) & $4.04$ & $3.93$ & $3.64$ & $3.46$ & $3.34$ & $3.26$ & $3.20$ & $3.13$ & & & \\
% \hline
% \end{tabular}
% }
% \end{center}
% \caption{Influence of the internal memory area size ($\mu$) in our algorithm runtime.}
% \label{tab:diskaccess}
% \end{table*}
% \begin{table*}[htb]
% \begin{center}
% {\scriptsize
% \begin{tabular}{|l|c|c|c|c|c|}
% \hline
% $n$ (millions) & 1 & 2 & 4 & 8 & 16 \\
% \hline % Part. 16 \% 16 \% 16 \% 18 \% 20\%
% Average time (s) & $14.124 \pm 0.128$ & $28.301 \pm 0.140$ & $56.807 \pm 0.312$ & $117.286 \pm 0.997$ & $241.086 \pm 0.936$ \\
% SD & $0.179$ & $0.196$ & $0.437$ & $1.394$ & $1.308$ \\
% \hline
% \hline
% $n$ (millions) & 32 & 64 & 128 & 512 & 1000 \\
% \hline % Part. 20 \% 20\% 20\% 18\% 18\%
% Average time (s) & $492.430 \pm 1.565$ & $1006.307 \pm 1.425$ & $2081.208 \pm 0.740$ & $9253.188 \pm 4.406$ & $19021.480 \pm 13.850$ \\
% SD & $2.188$ & $1.992$ & $1.035$ & $ 6.160$ & $18.016$ \\
% \hline
% \end{tabular}
% }
% \end{center}
% \caption{The runtime averages in seconds,
% the standard deviation (SD), and
% the confidence intervals given by the average time $\pm$
% the distance from average time considering
% a confidence level of $95\%$.
% }
% \label{tab:mediasbrz}
% \end{table*}

15
vldb07/experimentalresults.tex Executable file
View File

@ -0,0 +1,15 @@
%Nivio: 29/jan/06
% Time-stamp: <Sunday 29 Jan 2006 11:57:21pm EST yoshi@flare>
\vspace{-2mm}
\enlargethispage{2\baselineskip}
\section{Appendix: Experimental results}
\label{sec:experimental-results}
\vspace{-1mm}
In this section we present the experimental results.
We start presenting the experimental setup.
We then present experimental results for
the internal memory based algorithm~\cite{bkz05}
and for our algorithm.
Finally, we discuss how the amount of internal memory available
affects the runtime of our algorithm.

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.6 KiB

View File

@ -0,0 +1,107 @@
#FIG 3.2
Landscape
Center
Metric
A4
100.00
Single
-2
1200 2
0 32 #bdbebd
0 33 #bdbebd
0 34 #bdbebd
0 35 #4a4d4a
0 36 #bdbebd
0 37 #4a4d4a
0 38 #bdbebd
0 39 #bdbebd
6 225 6615 2520 7560
2 1 0 1 -1 7 50 -1 -1 0.000 0 0 -1 0 0 2
900 7133 1608 7133
2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
260 6795 474 6795 474 6965 260 6965 260 6795
2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
474 6795 686 6795 686 6965 474 6965 474 6795
2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
474 6626 686 6626 686 6795 474 6795 474 6626
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
1538 6795 1750 6795 1750 6965 1538 6965 1538 6795
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
1538 6965 1750 6965 1750 7133 1538 7133 1538 6965
2 2 0 1 -1 7 50 -1 -1 0.000 0 0 7 0 0 5
474 6965 686 6965 686 7133 474 7133 474 6965
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
686 6965 900 6965 900 7133 686 7133 686 6965
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
1538 6626 1750 6626 1750 6795 1538 6795 1538 6626
2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
260 6965 474 6965 474 7133 260 7133 260 6965
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
686 6795 900 6795 900 6965 686 6965 686 6795
4 0 0 50 -1 0 14 0.0000 4 30 180 1148 7049 ...\001
4 0 -1 50 -1 0 7 0.0000 2 60 60 332 7260 0\001
4 0 -1 50 -1 0 7 0.0000 2 75 60 544 7260 1\001
4 0 -1 50 -1 0 7 0.0000 2 60 60 758 7260 2\001
4 0 -1 50 -1 0 7 0.0000 2 90 960 1538 7260 ${\\lceil n/b\\rceil - 1}$\001
4 0 -1 50 -1 0 7 0.0000 2 105 975 540 7515 Buckets Logical View\001
-6
6 2700 6390 4365 7830
6 3461 6445 3675 7425
6 3463 6786 3675 7245
6 3546 6893 3591 7094
4 0 -1 50 -1 0 12 0.0000 2 15 45 3546 6959 .\001
4 0 -1 50 -1 0 12 0.0000 2 15 45 3546 7027 .\001
4 0 -1 50 -1 0 12 0.0000 2 15 45 3546 7094 .\001
-6
2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
3463 6786 3675 6786 3675 7245 3463 7245 3463 6786
-6
2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
3461 6445 3675 6445 3675 6615 3461 6615 3461 6445
2 2 0 1 -1 7 50 -1 41 0.000 0 0 7 0 0 5
3463 6616 3675 6616 3675 6785 3463 6785 3463 6616
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
3463 7246 3675 7246 3675 7425 3463 7425 3463 7246
-6
6 3023 6786 3235 7245
6 3106 6893 3151 7094
4 0 -1 50 -1 0 12 0.0000 2 15 45 3106 6959 .\001
4 0 -1 50 -1 0 12 0.0000 2 15 45 3106 7027 .\001
4 0 -1 50 -1 0 12 0.0000 2 15 45 3106 7094 .\001
-6
2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
3023 6786 3235 6786 3235 7245 3023 7245 3023 6786
-6
6 4091 6425 4305 7425
6 4093 6946 4305 7255
6 4176 7018 4221 7153
4 0 -1 50 -1 0 12 0.0000 2 15 45 4176 7063 .\001
4 0 -1 50 -1 0 12 0.0000 2 15 45 4176 7108 .\001
4 0 -1 50 -1 0 12 0.0000 2 15 45 4176 7153 .\001
-6
2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
4093 6946 4305 6946 4305 7255 4093 7255 4093 6946
-6
2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
4091 6605 4305 6605 4305 6775 4091 6775 4091 6605
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
4093 7256 4305 7256 4305 7425 4093 7425 4093 7256
2 2 0 1 -1 7 50 -1 41 0.000 0 0 7 0 0 5
4093 6776 4305 6776 4305 6945 4093 6945 4093 6776
2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
4091 6425 4305 6425 4305 6595 4091 6595 4091 6425
-6
2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
3021 6445 3235 6445 3235 6615 3021 6615 3021 6445
2 2 0 1 -1 7 50 -1 -1 0.000 0 0 7 0 0 5
3023 6616 3235 6616 3235 6785 3023 6785 3023 6616
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
3023 7246 3235 7246 3235 7425 3023 7425 3023 7246
4 0 0 50 -1 0 14 0.0000 4 30 180 3780 6975 ...\001
4 0 -1 50 -1 0 7 0.0000 2 75 255 3015 7560 File 1\001
4 0 -1 50 -1 0 7 0.0000 2 75 255 3465 7560 File 2\001
4 0 -1 50 -1 0 7 0.0000 2 75 270 4095 7560 File N\001
4 0 -1 50 -1 0 7 0.0000 2 105 1020 3195 7785 Buckets Physical View\001
4 0 0 50 -1 0 10 0.0000 4 150 120 2700 7020 b)\001
-6
4 0 0 50 -1 0 10 0.0000 4 150 105 0 7020 a)\001

View File

@ -0,0 +1,126 @@
#FIG 3.2
Landscape
Center
Metric
A4
100.00
Single
-2
1200 2
0 32 #bebebe
0 33 #4e4e4e
6 2160 3825 2430 4365
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
2160 4005 2430 4005 2430 4095 2160 4095 2160 4005
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
2160 3825 2430 3825 2430 3915 2160 3915 2160 3825
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
2160 3915 2430 3915 2430 4005 2160 4005 2160 3915
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
2160 4275 2430 4275 2430 4365 2160 4365 2160 4275
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
2160 4185 2430 4185 2430 4275 2160 4275 2160 4185
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
2160 4095 2430 4095 2430 4185 2160 4185 2160 4095
-6
6 2430 3735 2700 4365
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
2430 3825 2700 3825 2700 3915 2430 3915 2430 3825
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
2430 4275 2700 4275 2700 4365 2430 4365 2430 4275
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
2430 4185 2700 4185 2700 4275 2430 4275 2430 4185
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
2430 4095 2700 4095 2700 4185 2430 4185 2430 4095
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
2430 4005 2700 4005 2700 4095 2430 4095 2430 4005
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
2430 3915 2700 3915 2700 4005 2430 4005 2430 3915
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
2430 3735 2700 3735 2700 3825 2430 3825 2430 3735
-6
6 2700 4005 2970 4365
2 2 0 1 0 32 50 -1 43 0.000 0 0 -1 0 0 5
2700 4275 2970 4275 2970 4365 2700 4365 2700 4275
2 2 0 1 0 32 50 -1 43 0.000 0 0 -1 0 0 5
2700 4185 2970 4185 2970 4275 2700 4275 2700 4185
2 2 0 1 0 32 50 -1 43 0.000 0 0 -1 0 0 5
2700 4095 2970 4095 2970 4185 2700 4185 2700 4095
2 2 0 1 -1 32 50 -1 43 0.000 0 0 -1 0 0 5
2700 4005 2970 4005 2970 4095 2700 4095 2700 4005
-6
6 2025 5625 3690 5760
4 0 0 50 -1 0 10 0.0000 4 105 360 2025 5760 File 1\001
4 0 0 50 -1 0 10 0.0000 4 105 360 2565 5760 File 2\001
4 0 0 50 -1 0 10 0.0000 4 105 405 3285 5760 File N\001
-6
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
3510 4410 3510 4590
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
3780 4410 3780 4590
2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
1890 4185 2160 4185 2160 4275 1890 4275 1890 4185
2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
1890 4275 2160 4275 2160 4365 1890 4365 1890 4275
2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
1890 4095 2160 4095 2160 4185 1890 4185 1890 4095
2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
2070 4860 2340 4860 2340 5040 2070 5040 2070 4860
2 2 0 1 0 7 50 -1 41 0.000 0 0 7 0 0 5
3330 5220 3600 5220 3600 5400 3330 5400 3330 5220
2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
3330 4860 3600 4860 3600 4950 3330 4950 3330 4860
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
2070 5040 2340 5040 2340 5130 2070 5130 2070 5040
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
3330 4950 3600 4950 3600 5220 3330 5220 3330 4950
2 2 0 1 0 7 50 -1 41 0.000 0 0 7 0 0 5
2070 5130 2340 5130 2340 5310 2070 5310 2070 5130
2 2 0 1 0 7 50 -1 10 0.000 0 0 7 0 0 5
2610 5400 2880 5400 2880 5580 2610 5580 2610 5400
2 2 0 1 0 7 50 -1 41 0.000 0 0 7 0 0 5
2610 4860 2880 4860 2880 5040 2610 5040 2610 4860
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
2610 5040 2880 5040 2880 5130 2610 5130 2610 5040
2 2 0 1 0 7 50 -1 50 0.000 0 0 -1 0 0 5
2970 4275 3240 4275 3240 4365 2970 4365 2970 4275
2 2 0 1 0 7 50 -1 50 0.000 0 0 -1 0 0 5
2970 4185 3240 4185 3240 4275 2970 4275 2970 4185
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
3510 4410 3600 4410
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
3690 4410 3780 4410
2 2 0 1 0 7 50 -1 10 0.000 0 0 -1 0 0 5
3510 4275 3780 4275 3780 4365 3510 4365 3510 4275
2 2 0 1 0 7 50 -1 10 0.000 0 0 -1 0 0 5
3510 4185 3780 4185 3780 4275 3510 4275 3510 4185
2 2 0 1 0 32 50 -1 20 0.000 0 0 7 0 0 5
2610 5130 2880 5130 2880 5400 2610 5400 2610 5130
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
2070 5310 2340 5310 2340 5490 2070 5490 2070 5310
2 2 0 1 0 7 50 -1 10 0.000 0 0 7 0 0 5
2070 5490 2340 5490 2340 5580 2070 5580 2070 5490
2 2 0 1 0 7 50 -1 50 0.000 0 0 7 0 0 5
3330 5400 3600 5400 3600 5490 3330 5490 3330 5400
2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
3240 4275 3510 4275 3510 4365 3240 4365 3240 4275
2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
3240 4185 3510 4185 3510 4275 3240 4275 3240 4185
2 2 0 1 -1 32 50 -1 20 0.000 0 0 -1 0 0 5
3240 4095 3510 4095 3510 4185 3240 4185 3240 4095
2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
3240 4005 3510 4005 3510 4095 3240 4095 3240 4005
2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
3240 3915 3510 3915 3510 4005 3240 4005 3240 3915
2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
3330 5490 3600 5490 3600 5580 3330 5580 3330 5490
4 0 0 50 -1 0 10 0.0000 4 105 75 1980 4545 0\001
4 0 0 50 -1 0 10 0.0000 4 105 420 3555 4545 n/b - 1\001
4 0 0 50 -1 0 18 0.0000 4 30 180 3015 5265 ...\001
4 0 0 50 -1 0 10 0.0000 4 105 75 2250 4545 1\001
4 0 0 50 -1 0 10 0.0000 4 105 75 2520 4545 2\001
4 0 0 50 -1 0 18 0.0000 4 30 180 2880 4500 ...\001
4 0 0 50 -1 0 10 0.0000 4 135 1410 4050 5310 Buckets Physical View\001
4 0 0 50 -1 0 10 0.0000 4 135 1350 4050 4140 Buckets Logical View\001
4 0 0 50 -1 0 10 0.0000 4 135 120 1665 3780 a)\001
4 0 0 50 -1 0 10 0.0000 4 135 135 1620 4950 b)\001

183
vldb07/figs/brz.fig Executable file
View File

@ -0,0 +1,183 @@
#FIG 3.2 Produced by xfig version 3.2.5-alpha5
Landscape
Center
Metric
A4
100.00
Single
-2
1200 2
0 32 #bdbebd
0 33 #bdbebd
0 34 #bdbebd
0 35 #4a4d4a
0 36 #bdbebd
0 37 #4a4d4a
0 38 #bdbebd
0 39 #bdbebd
0 40 #bdbebd
6 3427 4042 3852 4211
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3427 4041 3852 4041 3852 4211 3427 4211 3427 4041
4 0 0 50 -1 0 14 0.0000 4 30 180 3551 4140 ...\001
-6
6 3410 5689 3835 5859
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3410 5689 3835 5689 3835 5858 3410 5858 3410 5689
4 0 0 50 -1 0 14 0.0000 4 30 180 3534 5788 ...\001
-6
6 3825 5445 4455 5535
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
4140 5445 4095 5490
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
4140 5445 4185 5490
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
3825 5535 3825 5490 3870 5490 3915 5490 3959 5490 4006 5490
4095 5490 4095 5490
0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
4455 5535 4455 5490 4410 5490 4365 5490 4321 5490 4274 5490
4185 5490
0.000 1.000 1.000 1.000 1.000 1.000 0.000
-6
6 1873 5442 2323 5532
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
2098 5442 2066 5487
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
2098 5442 2130 5487
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
1873 5532 1873 5487 1905 5487 1937 5487 1969 5487 2002 5487
2066 5487 2066 5487
0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
2323 5532 2323 5487 2291 5487 2259 5487 2227 5487 2194 5487
2130 5487
0.000 1.000 1.000 1.000 1.000 1.000 0.000
-6
6 2338 5442 2968 5532
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
2653 5442 2608 5487
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
2653 5442 2698 5487
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
2338 5532 2338 5487 2383 5487 2428 5487 2473 5487 2518 5487
2608 5487 2608 5487
0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
2968 5532 2968 5487 2923 5487 2878 5487 2833 5487 2788 5487
2698 5487
0.000 1.000 1.000 1.000 1.000 1.000 0.000
-6
6 2475 4500 4770 5175
2 1 0 1 -1 7 50 -1 -1 0.000 0 0 -1 0 0 2
3137 5013 3845 5013
2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
2497 4675 2711 4675 2711 4845 2497 4845 2497 4675
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
2711 4675 2923 4675 2923 4845 2711 4845 2711 4675
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
2711 4506 2923 4506 2923 4675 2711 4675 2711 4506
2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
3775 4675 3987 4675 3987 4845 3775 4845 3775 4675
2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
3775 4845 3987 4845 3987 5013 3775 5013 3775 4845
2 2 0 1 -1 7 50 -1 -1 0.000 0 0 7 0 0 5
2711 4845 2923 4845 2923 5013 2711 5013 2711 4845
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
2923 4845 3137 4845 3137 5013 2923 5013 2923 4845
2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
3775 4506 3987 4506 3987 4675 3775 4675 3775 4506
2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
2497 4845 2711 4845 2711 5013 2497 5013 2497 4845
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
2923 4675 3137 4675 3137 4845 2923 4845 2923 4675
4 0 0 50 -1 0 14 0.0000 4 30 180 3385 4929 ...\001
4 0 -1 50 -1 0 7 0.0000 2 75 60 2569 5140 0\001
4 0 -1 50 -1 0 7 0.0000 2 75 60 2781 5140 1\001
4 0 -1 50 -1 0 7 0.0000 2 75 60 2995 5140 2\001
4 0 -1 50 -1 0 7 0.0000 2 75 405 4059 4845 Buckets\001
4 0 -1 50 -1 0 7 0.0000 2 105 1095 3775 5140 ${\\lceil n/b\\rceil - 1}$\001
-6
6 2983 5446 3433 5536
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
3208 5446 3176 5491
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
3208 5446 3240 5491
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
2983 5536 2983 5491 3015 5491 3047 5491 3079 5491 3112 5491
3176 5491 3176 5491
0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
3433 5536 3433 5491 3401 5491 3369 5491 3337 5491 3304 5491
3240 5491
0.000 1.000 1.000 1.000 1.000 1.000 0.000
-6
2 2 0 1 0 36 50 -1 -1 0.000 0 0 7 0 0 5
3852 4041 4066 4041 4066 4211 3852 4211 3852 4041
2 2 0 1 0 36 50 -1 -1 0.000 0 0 7 0 0 5
4066 4041 4279 4041 4279 4211 4066 4211 4066 4041
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
1937 4041 2149 4041 2149 4211 1937 4211 1937 4041
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
2149 4041 2362 4041 2362 4211 2149 4211 2149 4041
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
2362 4041 2576 4041 2576 4211 2362 4211 2362 4041
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
2576 4041 2788 4041 2788 4211 2576 4211 2576 4041
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
2788 4041 3002 4041 3002 4211 2788 4211 2788 4041
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3214 4041 3427 4041 3427 4211 3214 4211 3214 4041
2 2 0 1 0 36 50 -1 -1 0.000 0 0 7 0 0 5
4279 4041 4492 4041 4492 4211 4279 4211 4279 4041
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3002 4041 3214 4041 3214 4211 3002 4211 3002 4041
2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
2132 5689 2345 5689 2345 5858 2132 5858 2132 5689
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
3197 5689 3410 5689 3410 5858 3197 5858 3197 5689
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
2771 5689 2985 5689 2985 5858 2771 5858 2771 5689
2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
4262 5689 4475 5689 4475 5858 4262 5858 4262 5689
2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
4049 5689 4262 5689 4262 5858 4049 5858 4049 5689
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
2985 5689 3197 5689 3197 5858 2985 5858 2985 5689
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
2345 5689 2559 5689 2559 5858 2345 5858 2345 5689
2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
1914 5687 2127 5687 2127 5856 1914 5856 1914 5687
2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
3835 5689 4049 5689 4049 5858 3835 5858 3835 5689
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
2559 5689 2771 5689 2771 5858 2559 5858 2559 5689
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 5
1 1 1.00 60.00 120.00
3330 4275 3330 4365 3330 4410 3330 4455 3330 4500
2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
1 1 1.00 45.00 60.00
3880 5168 4140 5445
2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
1 1 1.00 45.00 60.00
3025 5170 3205 5440
2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
1 1 1.00 45.00 60.00
2805 5164 2653 5438
2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
1 1 1.00 45.00 60.00
2577 5170 2103 5434
4 0 -1 50 -1 0 7 0.0000 2 120 645 4562 4168 Key Set $S$\001
4 0 -1 50 -1 0 7 0.0000 2 75 60 2008 3999 0\001
4 0 -1 50 -1 0 7 0.0000 2 75 60 2220 3999 1\001
4 0 -1 50 -1 0 7 0.0000 2 75 165 4314 3999 n-1\001
4 0 -1 50 -1 0 7 0.0000 2 75 60 1991 5985 0\001
4 0 -1 50 -1 0 7 0.0000 2 75 60 2203 5985 1\001
4 0 -1 50 -1 0 7 0.0000 2 75 165 4297 5985 n-1\001
4 0 -1 50 -1 0 7 0.0000 2 75 555 4545 5816 Hash Table\001
4 0 -1 50 -1 0 3 0.0000 2 75 450 1980 5625 MPHF$_0$\001
4 0 -1 50 -1 0 3 0.0000 2 75 450 2520 5625 MPHF$_1$\001
4 0 -1 50 -1 0 3 0.0000 2 75 450 3015 5625 MPHF$_2$\001
4 0 -1 50 -1 0 3 0.0000 2 75 1065 3825 5625 MPHF$_{\\lceil n/b \\rceil - 1}$\001
4 0 -1 50 -1 0 7 0.0000 2 105 585 1440 4455 Partitioning\001
4 0 -1 50 -1 0 7 0.0000 2 105 495 1440 5265 Searching\001

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.5 KiB

153
vldb07/figs/brzfabiano.fig Executable file
View File

@ -0,0 +1,153 @@
#FIG 3.2 Produced by xfig version 3.2.5-alpha5
Landscape
Center
Metric
A4
100.00
Single
-2
1200 2
0 32 #bebebe
6 2025 3015 3555 3690
2 3 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
2025 3285 2295 3285 2295 3015 3285 3015 3285 3285 3555 3285
2790 3690 2025 3285
4 0 0 50 -1 0 10 0.0000 4 135 765 2385 3330 Partitioning\001
-6
6 1890 3735 3780 4365
6 2430 3735 2700 4365
6 2430 3915 2700 4365
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2430 4275 2700 4275 2700 4365 2430 4365 2430 4275
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2430 4185 2700 4185 2700 4275 2430 4275 2430 4185
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2430 4095 2700 4095 2700 4185 2430 4185 2430 4095
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2430 4005 2700 4005 2700 4095 2430 4095 2430 4005
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2430 3915 2700 3915 2700 4005 2430 4005 2430 3915
-6
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2430 3825 2700 3825 2700 3915 2430 3915 2430 3825
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2430 3735 2700 3735 2700 3825 2430 3825 2430 3735
-6
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
1890 4275 2160 4275 2160 4365 1890 4365 1890 4275
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
1890 4185 2160 4185 2160 4275 1890 4275 1890 4185
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2160 4275 2430 4275 2430 4365 2160 4365 2160 4275
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2160 4185 2430 4185 2430 4275 2160 4275 2160 4185
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2160 4095 2430 4095 2430 4185 2160 4185 2160 4095
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2160 4005 2430 4005 2430 4095 2160 4095 2160 4005
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2160 3915 2430 3915 2430 4005 2160 4005 2160 3915
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2700 4275 2970 4275 2970 4365 2700 4365 2700 4275
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2700 4185 2970 4185 2970 4275 2700 4275 2700 4185
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2700 4095 2970 4095 2970 4185 2700 4185 2700 4095
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2700 4005 2970 4005 2970 4095 2700 4095 2700 4005
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2160 3825 2430 3825 2430 3915 2160 3915 2160 3825
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3240 4275 3510 4275 3510 4365 3240 4365 3240 4275
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3510 4275 3780 4275 3780 4365 3510 4365 3510 4275
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2970 4275 3240 4275 3240 4365 2970 4365 2970 4275
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3240 4185 3510 4185 3510 4275 3240 4275 3240 4185
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
1890 4095 2160 4095 2160 4185 1890 4185 1890 4095
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3510 4185 3780 4185 3780 4275 3510 4275 3510 4185
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3240 4095 3510 4095 3510 4185 3240 4185 3240 4095
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3240 4005 3510 4005 3510 4095 3240 4095 3240 4005
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3240 3915 3510 3915 3510 4005 3240 4005 3240 3915
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
1890 4365 3780 4365
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2970 4185 3240 4185 3240 4275 2970 4275 2970 4185
-6
6 1260 5310 4230 5580
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
1260 5400 4230 5400
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
1530 5310 1800 5310 1800 5400 1530 5400 1530 5310
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2070 5310 2340 5310 2340 5400 2070 5400 2070 5310
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2340 5310 2610 5310 2610 5400 2340 5400 2340 5310
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2610 5310 2880 5310 2880 5400 2610 5400 2610 5310
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2880 5310 3150 5310 3150 5400 2880 5400 2880 5310
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3420 5310 3690 5310 3690 5400 3420 5400 3420 5310
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3690 5310 3960 5310 3960 5400 3690 5400 3690 5310
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3960 5310 4230 5310 4230 5400 3960 5400 3960 5310
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
1800 5310 2070 5310 2070 5400 1800 5400 1800 5310
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3150 5310 3420 5310 3420 5400 3150 5400 3150 5310
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
1260 5310 1530 5310 1530 5400 1260 5400 1260 5310
4 0 0 50 -1 0 10 0.0000 4 105 210 4005 5580 n-1\001
4 0 0 50 -1 0 10 0.0000 4 105 75 1350 5580 0\001
-6
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
1260 2925 4230 2925
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
1530 2835 1800 2835 1800 2925 1530 2925 1530 2835
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2070 2835 2340 2835 2340 2925 2070 2925 2070 2835
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2340 2835 2610 2835 2610 2925 2340 2925 2340 2835
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2610 2835 2880 2835 2880 2925 2610 2925 2610 2835
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
2880 2835 3150 2835 3150 2925 2880 2925 2880 2835
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3420 2835 3690 2835 3690 2925 3420 2925 3420 2835
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3690 2835 3960 2835 3960 2925 3690 2925 3690 2835
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3960 2835 4230 2835 4230 2925 3960 2925 3960 2835
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
1800 2835 2070 2835 2070 2925 1800 2925 1800 2835
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
3150 2835 3420 2835 3420 2925 3150 2925 3150 2835
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
1260 2835 1530 2835 1530 2925 1260 2925 1260 2835
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
3510 4410 3510 4590
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
3510 4410 3600 4410
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
3690 4410 3780 4410
2 3 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
2025 4815 2295 4815 2295 4545 3285 4545 3285 4815 3555 4815
2790 5220 2025 4815
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
3780 4410 3780 4590
4 0 0 50 -1 0 10 0.0000 4 135 585 2475 4860 Searching\001
4 0 0 50 -1 0 10 0.0000 4 105 75 1980 4545 0\001
4 0 0 50 -1 0 10 0.0000 4 105 690 4410 5400 Hash Table\001
4 0 0 50 -1 0 10 0.0000 4 105 480 4410 4230 Buckets\001
4 0 0 50 -1 0 10 0.0000 4 135 555 4410 2925 Key set S\001
4 0 0 50 -1 0 10 0.0000 4 105 75 1350 2745 0\001
4 0 0 50 -1 0 10 0.0000 4 105 210 4005 2745 n-1\001
4 0 0 50 -1 0 10 0.0000 4 105 420 3555 4545 n/b - 1\001

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.8 KiB

109
vldb07/introduction.tex Executable file
View File

@ -0,0 +1,109 @@
%% Nivio: 22/jan/06 23/jan/06 29/jan
% Time-stamp: <Monday 30 Jan 2006 03:52:42am EDT yoshi@ime.usp.br>
\section{Introduction}
\label{sec:intro}
\enlargethispage{2\baselineskip}
Suppose~$U$ is a universe of \textit{keys} of size $u$.
Let $h:U\to M$ be a {\em hash function} that maps the keys from~$U$
to a given interval of integers $M=[0,m-1]=\{0,1,\dots,m-1\}$.
Let~$S\subseteq U$ be a set of~$n$ keys from~$U$, where $ n \ll u$.
Given a key~$x\in S$, the hash function~$h$ computes an integer in
$[0,m-1]$ for the storage or retrieval of~$x$ in a {\em hash table}.
% Hashing methods for {\em non-static sets} of keys can be used to construct
% data structures storing $S$ and supporting membership queries
% ``$x \in S$?'' in expected time $O(1)$.
% However, they involve a certain amount of wasted space owing to unused
% locations in the table and waisted time to resolve collisions when
% two keys are hashed to the same table location.
A perfect hash function maps a {\em static set} $S$ of $n$ keys from $U$ into a set of $m$ integer
numbers without collisions, where $m$ is greater than or equal to $n$.
If $m$ is equal to $n$, the function is called minimal.
% Figure~\ref{fig:minimalperfecthash-ph-mph}(a) illustrates a perfect hash function and
% Figure~\ref{fig:minimalperfecthash-ph-mph}(b) illustrates a minimal perfect hash function (MPHF).
%
% \begin{figure}
% \centering
% \scalebox{0.7}{\epsfig{file=figs/minimalperfecthash-ph-mph.ps}}
% \caption{(a) Perfect hash function (b) Minimal perfect hash function (MPHF)}
% \label{fig:minimalperfecthash-ph-mph}
% %\vspace{-5mm}
% \end{figure}
Minimal perfect hash functions are widely used for memory efficient storage and fast
retrieval of items from static sets, such as words in natural languages,
reserved words in programming languages or interactive systems, universal resource
locations (URLs) in web search engines, or item sets in data mining techniques.
Search engines are nowadays indexing tens of billions of pages and algorithms
like PageRank~\cite{Brin1998}, which uses the web link structure to derive a
measure of popularity for Web pages, would benefit from a MPHF for storage and
retrieval of such huge sets of URLs.
For instance, the TodoBr\footnote{TodoBr ({\texttt www.todobr.com.br}) is a trademark of
Akwan Information Technologies, which was acquired by Google Inc. in July 2005.}
search engine used the algorithm proposed hereinafter to
improve and to scale its link analysis system.
The WebGraph research group~\cite{bv04} would
also benefit from a MPHF for sets in the order of billions of URLs to scale
and to improve the storange requirements of their algorithms on Graph compression.
Another interesting application for MPHFs is its use as an indexing structure
for databases.
The B+ tree is very popular as an indexing structure for dynamic applications
with frequent insertions and deletions of records.
However, for applications with sporadic modifications and a huge number of
queries the B+ tree is not the best option,
because it performs poorly with very large sets of keys
such as those required for the new frontiers of database applications~\cite{s05}.
Therefore, there are applications for MPHFs in
information retrieval systems, database systems, language translation systems,
electronic commerce systems, compilers, operating systems, among others.
Until now, because of the limitations of current algorithms,
the use of MPHFs is restricted to scenarios where the set of keys being hashed is
relatively small.
However, in many cases it is crucial to deal in an efficient way with very large
sets of keys.
Due to the exponential growth of the Web, the work with huge collections is becoming
a daily task.
For instance, the simple assignment of number identifiers to web pages of a collection
can be a challenging task.
While traditional databases simply cannot handle more traffic once the working
set of URLs does not fit in main memory anymore~\cite{s05}, the algorithm we propose here to
construct MPHFs can easily scale to billions of entries.
% using stock hardware.
As there are many applications for MPHFs, it is
important to design and implement space and time efficient algorithms for
constructing such functions.
The attractiveness of using MPHFs depends on the following issues:
\begin{enumerate}
\item The amount of CPU time required by the algorithms for constructing MPHFs.
\item The space requirements of the algorithms for constructing MPHFs.
\item The amount of CPU time required by a MPHF for each retrieval.
\item The space requirements of the description of the resulting MPHFs to be
used at retrieval time.
\end{enumerate}
\enlargethispage{2\baselineskip}
This paper presents a novel external memory based algorithm for constructing MPHFs that
is very efficient in these four requirements.
First, the algorithm is linear on the size of keys to construct a MPHF,
which is optimal.
For instance, for a collection of 1 billion URLs
collected from the web, each one 64 characters long on average, the time to construct a
MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory
is approximately 3 hours.
Second, the algorithm needs a small a priori defined vector of $\lceil n/b \rceil$
one byte entries in main memory to construct a MPHF.
For the collection of 1 billion URLs and using $b=175$, the algorithm needs only
5.45 megabytes of internal memory.
Third, the evaluation of the MPHF for each retrieval requires three memory accesses and
the computation of three universal hash functions.
This is not optimal as any MPHF requires at least one memory access and the computation
of two universal hash functions.
Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal.
For the collection of 1 billion URLs, it needs 8.1 bits for each key,
while the theoretical lower bound is $1/\ln2 \approx 1.4427$ bits per
key~\cite{m84}.

17
vldb07/makefile Executable file
View File

@ -0,0 +1,17 @@
all:
latex vldb.tex
bibtex vldb
latex vldb.tex
latex vldb.tex
dvips vldb.dvi -o vldb.ps
ps2pdf vldb.ps
chmod -R g+rwx *
perm:
chmod -R g+rwx *
run: clean all
gv vldb.ps &
clean:
rm *.aux *.bbl *.blg *.log *.ps *.pdf *.dvi

141
vldb07/partitioningthekeys.tex Executable file
View File

@ -0,0 +1,141 @@
%% Nivio: 21/jan/06
% Time-stamp: <Monday 30 Jan 2006 03:57:28am EDT yoshi@ime.usp.br>
\vspace{-2mm}
\subsection{Partitioning step}
\label{sec:partitioning-keys}
The set $S$ of $n$ keys is partitioned into $\lceil n/b \rceil$ buckets,
where $b$ is a suitable parameter chosen to guarantee
that each bucket has at most 256 keys with high probability
(see Section~\ref{sec:determining-b}).
The partitioning step works as follows:
\begin{figure}[h]
\hrule
\hrule
\vspace{2mm}
\begin{tabbing}
aa\=type booleanx \== (false, true); \kill
\> $\blacktriangleright$ Let $\beta$ be the size in bytes of the set $S$ \\
\> $\blacktriangleright$ Let $\mu$ be the size in bytes of an a priori reserved \\
\> ~~~ internal memory area \\
\> $\blacktriangleright$ Let $N = \lceil \beta/\mu \rceil$ be the number of key blocks that will \\
\> ~~~ be read from disk into an internal memory area \\
\> $\blacktriangleright$ Let $\mathit{size}$ be a vector that stores the size of each bucket \\
\> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \\
\> ~~ $1.1$ Read block $B_j$ of keys from disk \\
\> ~~ $1.2$ Cluster $B_j$ into $\lceil n/b \rceil$ buckets using a bucket sort \\
\> ~~~~~~~ algorithm and update the entries in the vector {\it size} \\
\> ~~ $1.3$ Dump $B_j$ to the disk into File $j$\\
\> $2.$ Compute the {\it offset} vector and dump it to the disk.
\end{tabbing}
\hrule
\hrule
\vspace{-1.0mm}
\caption{Partitioning step}
\vspace{-3mm}
\label{fig:partitioningstep}
\end{figure}
Statement 1.1 of the {\bf for} loop presented in Figure~\ref{fig:partitioningstep}
reads sequentially all the keys of block $B_j$ from disk into an internal area
of size $\mu$.
Statement 1.2 performs an indirect bucket sort of the keys in block $B_j$
and at the same time updates the entries in the vector {\em size}.
Let us briefly describe how~$B_j$ is partitioned among the~$\lceil n/b\rceil$
buckets.
We use a local array of $\lceil n/b \rceil$ counters to store a
count of how many keys from $B_j$ belong to each bucket.
%At the same time, the global vector {\it size} is computed based on the local
%counters.
The pointers to the keys in each bucket $i$, $0 \leq i < \lceil n/b \rceil$,
are stored in contiguous positions in an array.
For this we first reserve the required number of entries
in this array of pointers using the information from the array of counters.
Next, we place the pointers to the keys in each bucket into the respective
reserved areas in the array (i.e., we place the pointers to the keys in bucket 0,
followed by the pointers to the keys in bucket 1, and so on).
\enlargethispage{2\baselineskip}
To find the bucket address of a given key
we use the universal hash function $h_0(k)$~\cite{j97}.
Key~$k$ goes into bucket~$i$, where
%Then, for each integer $h_0(k)$ the respective bucket address is obtained
%as follows:
\begin{eqnarray} \label{eq:bucketindex}
i=h_0(k) \bmod \left \lceil \frac{n}{b} \right \rceil.
\end{eqnarray}
Figure~\ref{fig:brz-partitioning}(a) shows a \emph{logical} view of the
$\lceil n/b \rceil$ buckets generated in the partitioning step.
%In this case, the keys of each bucket are put together by the pointers to
%each key stored
%in contiguous positions in the array of pointers.
In reality, the keys belonging to each bucket are distributed among many files,
as depicted in Figure~\ref{fig:brz-partitioning}(b).
In the example of Figure~\ref{fig:brz-partitioning}(b), the keys in bucket 0
appear in files 1 and $N$, the keys in bucket 1 appear in files 1, 2
and $N$, and so on.
\vspace{-7mm}
\begin{figure}[ht]
\centering
\begin{picture}(0,0)%
\includegraphics{figs/brz-partitioning.ps}%
\end{picture}%
\setlength{\unitlength}{4144sp}%
%
\begingroup\makeatletter\ifx\SetFigFont\undefined%
\gdef\SetFigFont#1#2#3#4#5{%
\reset@font\fontsize{#1}{#2pt}%
\fontfamily{#3}\fontseries{#4}\fontshape{#5}%
\selectfont}%
\fi\endgroup%
\begin{picture}(4371,1403)(1,-6977)
\put(333,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
\put(545,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
\put(759,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}2}}}}
\put(1539,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}${\lceil n/b\rceil - 1}$}}}}
\put(541,-6676){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets Logical View}}}}
\put(3547,-6120){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
\put(3547,-6188){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
\put(3547,-6255){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
\put(3107,-6120){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
\put(3107,-6188){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
\put(3107,-6255){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
\put(4177,-6224){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
\put(4177,-6269){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
\put(4177,-6314){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
\put(3016,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File 1}}}}
\put(3466,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File 2}}}}
\put(4096,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File N}}}}
\put(3196,-6946){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets Physical View}}}}
\end{picture}%
\caption{Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view}
\label{fig:brz-partitioning}
\vspace{-2mm}
\end{figure}
This scattering of the keys in the buckets could generate a performance
problem because of the potential number of seeks
needed to read the keys in each bucket from the $N$ files in disk
during the searching step.
But, as we show later in Section~\ref{sec:analytcal-results}, the number of seeks
can be kept small using buffering techniques.
Considering that only the vector {\it size}, which has $\lceil n/b \rceil$
one-byte entries (remember that each bucket has at most 256 keys),
must be maintained in main memory during the searching step,
almost all main memory is available to be used as disk I/O buffer.
The last step is to compute the {\it offset} vector and dump it to the disk.
We use the vector $\mathit{size}$ to compute the
$\mathit{offset}$ displacement vector.
The $\mathit{offset}[i]$ entry contains the number of keys
in the buckets $0, 1, \dots, i-1$.
As {\it size}$[i]$ stores the number of keys
in bucket $i$, where $0 \leq i <\lceil n/b \rceil$, we have
\begin{displaymath}
\mathit{offset}[i] = \sum_{j=0}^{i-1} \mathit{size}[j] \cdot
\end{displaymath}

View File

@ -0,0 +1,113 @@
% Nivio: 29/jan/06
% Time-stamp: <Monday 30 Jan 2006 12:13:14pm EST yoshi@flare>
\subsection{Performance of the new algorithm}
\label{sec:performance}
%As we have done for the internal memory based algorithm,
The runtime of our algorithm is also a random variable, but now it follows a
(highly concentrated) normal distribution, as we discuss at the end of this
section. Again, we are interested in verifying the linearity claim made in
Section~\ref{sec:linearcomplexity}. Therefore, we ran the algorithm for
several numbers $n$ of keys in $S$.
The values chosen for $n$ were $1, 2, 4, 8, 16, 32, 64, 128, 512$ and $1000$
million.
%Just the small vector {\it size} must be kept in main memory,
%as we saw in Section~\ref{sec:memconstruction}.
We limited the main memory in 500 megabytes for the experiments.
The size $\mu$ of the a priori reserved internal memory area
was set to 250 megabytes, the parameter $b$ was set to $175$ and
the building block algorithm parameter $c$ was again set to $1$.
In Section~\ref{sec:contr-disk-access} we show how $\mu$
affects the runtime of the algorithm. The other two parameters
have insignificant influence on the runtime.
We again use a statistical method for determining a suitable sample size
%~\cite[Chapter 13]{j91}
to estimate the number of trials to be run for each value of $n$. We got that
just one trial for each $n$ would be enough with a confidence level of $95\%$.
However, we made 10 trials. This number of trials seems rather small, but, as
shown below, the behavior of our algorithm is very stable and its runtime is
almost deterministic (i.e., the standard deviation is very small).
Table~\ref{tab:mediasbrz} presents the runtime average for each $n$,
the respective standard deviations, and
the respective confidence intervals given by
the average time $\pm$ the distance from average time
considering a confidence level of $95\%$.
Observing the runtime averages we noticed that
the algorithm runs in expected linear time,
as shown in~Section~\ref{sec:linearcomplexity}. Better still,
it is only approximately $60\%$ slower than our internal memory based algorithm.
To get that value we used the linear regression model obtained for the runtime of
the internal memory based algorithm to estimate how much time it would require
for constructing a MPHF for a set of 1 billion keys.
We got 2.3 hours for the internal memory based algorithm and we measured
3.67 hours on average for our algorithm.
Increasing the size of the internal memory area
from 250 to 600 megabytes (see Section~\ref{sec:contr-disk-access}),
we have brought the time to 3.09 hours. In this case, our algorithm is
just $34\%$ slower in this setup.
\enlargethispage{2\baselineskip}
\begin{table*}[htb]
\vspace{-1mm}
\begin{center}
{\scriptsize
\begin{tabular}{|l|c|c|c|c|c|}
\hline
$n$ (millions) & 1 & 2 & 4 & 8 & 16 \\
\hline % Part. 16 \% 16 \% 16 \% 18 \% 20\%
Average time (s) & $6.9 \pm 0.3$ & $13.8 \pm 0.2$ & $31.9 \pm 0.7$ & $69.9 \pm 1.1$ & $140.6 \pm 2.5$ \\
SD & $0.4$ & $0.2$ & $0.9$ & $1.5$ & $3.5$ \\
\hline
\hline
$n$ (millions) & 32 & 64 & 128 & 512 & 1000 \\
\hline % Part. 20 \% 20\% 20\% 18\% 18\%
Average time (s) & $284.3 \pm 1.1$ & $587.9 \pm 3.9$ & $1223.6 \pm 4.9$ & $5966.4 \pm 9.5$ & $13229.5 \pm 12.7$ \\
SD & $1.6$ & $5.5$ & $6.8$ & $13.2$ & $18.6$ \\
\hline
\end{tabular}
\vspace{-1mm}
}
\end{center}
\caption{Our algorithm: average time in seconds for constructing a MPHF,
the standard deviation (SD), and the confidence intervals considering
a confidence level of $95\%$.
}
\label{tab:mediasbrz}
\vspace{-5mm}
\end{table*}
Figure~\ref{fig:brz_temporegressao}
presents the runtime for each trial. In addition,
the solid line corresponds to a linear regression model
obtained from the experimental measurements.
As we were expecting the runtime for a given $n$ has almost no
variation.
\begin{figure}[htb]
\begin{center}
\scalebox{0.4}{\includegraphics{figs/brz_temporegressao.eps}}
\caption{Time versus number of keys in $S$ for our algorithm. The solid line corresponds to
a linear regression model.}
\label{fig:brz_temporegressao}
\end{center}
\vspace{-9mm}
\end{figure}
An intriguing observation is that the runtime of the algorithm is almost
deterministic, in spite of the fact that it uses as building block an
algorithm with a considerable fluctuation in its runtime. A given bucket~$i$,
$0 \leq i < \lceil n/b \rceil$, is a small set of keys (at most 256 keys) and,
as argued in Section~\ref{sec:intern-memory-algor}, the runtime of the
building block algorithm is a random variable~$X_i$ with high fluctuation.
However, the runtime~$Y$ of the searching step of our algorithm is given
by~$Y=\sum_{0\leq i<\lceil n/b\rceil}X_i$. Under the hypothesis that
the~$X_i$ are independent and bounded, the {\it law of large numbers} (see,
e.g., \cite{j91}) implies that the random variable $Y/\lceil n/b\rceil$
converges to a constant as~$n\to\infty$. This explains why the runtime of our
algorithm is almost deterministic.

814
vldb07/references.bib Executable file
View File

@ -0,0 +1,814 @@
@InProceedings{Brin1998,
author = "Sergey Brin and Lawrence Page",
title = "The Anatomy of a Large-Scale Hypertextual Web Search Engine",
booktitle = "Proceedings of the 7th International {World Wide Web}
Conference",
pages = "107--117",
adress = "Brisbane, Australia",
month = "April",
year = 1998,
annote = "Artigo do Google."
}
@inproceedings{p99,
author = {R. Pagh},
title = {Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions},
booktitle = {Workshop on Algorithms and Data Structures},
pages = {49-54},
year = 1999,
url = {citeseer.nj.nec.com/pagh99hash.html},
key = {author}
}
@article{p00,
author = {R. Pagh},
title = {Faster deterministic dictionaries},
journal = {Symposium on Discrete Algorithms (ACM SODA)},
OPTvolume = {43},
OPTnumber = {5},
pages = {487--493},
year = {2000}
}
@article{g81,
author = {G. H. Gonnet},
title = {Expected Length of the Longest Probe Sequence in Hash Code Searching},
journal = {J. ACM},
volume = {28},
number = {2},
year = {1981},
issn = {0004-5411},
pages = {289--304},
doi = {http://doi.acm.org/10.1145/322248.322254},
publisher = {ACM Press},
address = {New York, NY, USA},
}
@misc{r04,
author = "S. Rao",
title = "Combinatorial Algorithms Data Structures",
year = 2004,
howpublished = {CS 270 Spring},
url = "citeseer.ist.psu.edu/700201.html"
}
@article{ra98,
author = {Martin Raab and Angelika Steger},
title = {``{B}alls into Bins'' --- {A} Simple and Tight Analysis},
journal = {Lecture Notes in Computer Science},
volume = 1518,
pages = {159--170},
year = 1998,
url = "citeseer.ist.psu.edu/raab98balls.html"
}
@misc{mrs00,
author = "M. Mitzenmacher and A. Richa and R. Sitaraman",
title = "The power of two random choices: A survey of the techniques and results",
howpublished={In Handbook of Randomized
Computing, P. Pardalos, S. Rajasekaran, and J. Rolim, Eds. Kluwer},
year = "2000",
url = "citeseer.ist.psu.edu/article/mitzenmacher00power.html"
}
@article{dfm02,
author = {E. Drinea and A. Frieze and M. Mitzenmacher},
title = {Balls and bins models with feedback},
journal = {Symposium on Discrete Algorithms (ACM SODA)},
pages = {308--315},
year = {2002}
}
@Article{j97,
author = {Bob Jenkins},
title = {Algorithm Alley: Hash Functions},
journal = {Dr. Dobb's Journal of Software Tools},
volume = {22},
number = {9},
month = {september},
year = {1997}
}
@article{gss01,
author = {N. Galli and B. Seybold and K. Simon},
title = {Tetris-Hashing or optimal table compression},
journal = {Discrete Applied Mathematics},
volume = {110},
number = {1},
pages = {41--58},
month = {june},
publisher = {Elsevier Science},
year = {2001}
}
@article{s05,
author = {M. Seltzer},
title = {Beyond Relational Databases},
journal = {ACM Queue},
volume = {3},
number = {3},
month = {April},
year = {2005}
}
@InProceedings{ss89,
author = {P. Schmidt and A. Siegel},
title = {On aspects of universality and performance for closed hashing},
booktitle = {Proc. 21th Ann. ACM Symp. on Theory of Computing -- STOC'89},
month = {May},
year = {1989},
pages = {355--366}
}
@article{asw00,
author = {M. Atici and D. R. Stinson and R. Wei.},
title = {A new practical algorithm for the construction of a perfect hash function},
journal = {Journal Combin. Math. Combin. Comput.},
volume = {35},
pages = {127--145},
year = {2000}
}
@article{swz00,
author = {D. R. Stinson and R. Wei and L. Zhu},
title = {New constructions for perfect hash families and related structures using combinatorial designs and codes},
journal = {Journal Combin. Designs.},
volume = {8},
pages = {189--200},
year = {2000}
}
@inproceedings{ht01,
author = {T. Hagerup and T. Tholey},
title = {Efficient minimal perfect hashing in nearly minimal space},
booktitle = {The 18th Symposium on Theoretical Aspects of Computer Science (STACS), volume 2010 of Lecture Notes in Computer Science},
year = 2001,
pages = {317--326},
key = {author}
}
@inproceedings{dh01,
author = {M. Dietzfelbinger and T. Hagerup},
title = {Simple minimal perfect hashing in less space},
booktitle = {The 9th European Symposium on Algorithms (ESA), volume 2161 of Lecture Notes in Computer Science},
year = 2001,
pages = {109--120},
key = {author}
}
@MastersThesis{mar00,
author = {M. S. Neubert},
title = {Algoritmos Distribu;os para a Constru;o de Arquivos invertidos},
school = {Departamento de Ci;cia da Computa;o, Universidade Federal de Minas Gerais},
year = 2000,
month = {Mar;},
key = {author}
}
@Book{clrs01,
author = {T. H. Cormen and C. E. Leiserson and R. L. Rivest and C. Stein},
title = {Introduction to Algorithms},
publisher = {MIT Press},
year = {2001},
edition = {second},
}
@Book{j91,
author = {R. Jain},
title = {The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. },
publisher = {John Wiley},
year = {1991},
edition = {first}
}
@Book{k73,
author = {D. E. Knuth},
title = {The Art of Computer Programming: Sorting and Searching},
publisher = {Addison-Wesley},
volume = {3},
year = {1973},
edition = {second},
}
@inproceedings{rp99,
author = {R. Pagh},
title = {Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions},
booktitle = {Workshop on Algorithms and Data Structures},
pages = {49-54},
year = 1999,
url = {citeseer.nj.nec.com/pagh99hash.html},
key = {author}
}
@inproceedings{hmwc93,
author = {G. Havas and B.S. Majewski and N.C. Wormald and Z.J. Czech},
title = {Graphs, Hypergraphs and Hashing},
booktitle = {19th International Workshop on Graph-Theoretic Concepts in Computer Science},
publisher = {Springer Lecture Notes in Computer Science vol. 790},
pages = {153-165},
year = 1993,
key = {author}
}
@inproceedings{bkz05,
author = {F.C. Botelho and Y. Kohayakawa and N. Ziviani},
title = {A Practical Minimal Perfect Hashing Method},
booktitle = {4th International Workshop on Efficient and Experimental Algorithms},
publisher = {Springer Lecture Notes in Computer Science vol. 3503},
pages = {488-500},
Moth = May,
year = 2005,
key = {author}
}
@Article{chm97,
author = {Z.J. Czech and G. Havas and B.S. Majewski},
title = {Fundamental Study Perfect Hashing},
journal = {Theoretical Computer Science},
volume = {182},
year = {1997},
pages = {1-143},
key = {author}
}
@article{chm92,
author = {Z.J. Czech and G. Havas and B.S. Majewski},
title = {An Optimal Algorithm for Generating Minimal Perfect Hash Functions},
journal = {Information Processing Letters},
volume = {43},
number = {5},
pages = {257-264},
year = {1992},
url = {citeseer.nj.nec.com/czech92optimal.html},
key = {author}
}
@Article{mwhc96,
author = {B.S. Majewski and N.C. Wormald and G. Havas and Z.J. Czech},
title = {A family of perfect hashing methods},
journal = {The Computer Journal},
year = {1996},
volume = {39},
number = {6},
pages = {547-554},
key = {author}
}
@InProceedings{bv04,
author = {P. Boldi and S. Vigna},
title = {The WebGraph Framework I: Compression Techniques},
booktitle = {13th International World Wide Web Conference},
pages = {595--602},
year = {2004}
}
@Book{z04,
author = {N. Ziviani},
title = {Projeto de Algoritmos com implementa;es em Pascal e C},
publisher = {Pioneira Thompson},
year = 2004,
edition = {segunda edi;o}
}
@Book{p85,
author = {E. M. Palmer},
title = {Graphical Evolution: An Introduction to the Theory of Random Graphs},
publisher = {John Wiley \& Sons},
year = {1985},
address = {New York}
}
@Book{imb99,
author = {I.H. Witten and A. Moffat and T.C. Bell},
title = {Managing Gigabytes: Compressing and Indexing Documents and Images},
publisher = {Morgan Kaufmann Publishers},
year = 1999,
edition = {second edition}
}
@Book{wfe68,
author = {W. Feller},
title = { An Introduction to Probability Theory and Its Applications},
publisher = {Wiley},
year = 1968,
volume = 1,
optedition = {second edition}
}
@Article{fhcd92,
author = {E.A. Fox and L. S. Heath and Q. Chen and A.M. Daoud},
title = {Practical Minimal Perfect Hash Functions For Large Databases},
journal = {Communications of the ACM},
year = {1992},
volume = {35},
number = {1},
pages = {105--121}
}
@inproceedings{fch92,
author = {E.A. Fox and Q.F. Chen and L.S. Heath},
title = {A Faster Algorithm for Constructing Minimal Perfect Hash Functions},
booktitle = {Proceedings of the 15th Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval},
year = {1992},
pages = {266-273},
}
@article{c80,
author = {R.J. Cichelli},
title = {Minimal perfect hash functions made simple},
journal = {Communications of the ACM},
volume = {23},
number = {1},
year = {1980},
issn = {0001-0782},
pages = {17--19},
doi = {http://doi.acm.org/10.1145/358808.358813},
publisher = {ACM Press},
}
@TechReport{fhc89,
author = {E.A. Fox and L.S. Heath and Q.F. Chen},
title = {An $O(n\log n)$ algorithm for finding minimal perfect hash functions},
institution = {Virginia Polytechnic Institute and State University},
year = {1989},
OPTkey = {},
OPTtype = {},
OPTnumber = {},
address = {Blacksburg, VA},
month = {April},
OPTnote = {},
OPTannote = {}
}
@TechReport{bkz06t,
author = {F.C. Botelho and Y. Kohayakawa and N. Ziviani},
title = {An Approach for Minimal Perfect Hash Functions in Very Large Databases},
institution = {Department of Computer Science, Federal University of Minas Gerais},
note = {Available at http://www.dcc.ufmg.br/\texttt{\~ }nivio/pub/technicalreports.html},
year = {2006},
OPTkey = {},
OPTtype = {},
number = {RT.DCC.003},
address = {Belo Horizonte, MG, Brazil},
month = {April},
OPTannote = {}
}
@inproceedings{fcdh90,
author = {E.A. Fox and Q.F. Chen and A.M. Daoud and L.S. Heath},
title = {Order preserving minimal perfect hash functions and information retrieval},
booktitle = {Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval},
year = {1990},
isbn = {0-89791-408-2},
pages = {279--311},
location = {Brussels, Belgium},
doi = {http://doi.acm.org/10.1145/96749.98233},
publisher = {ACM Press},
}
@Article{fkp89,
author = {P. Flajolet and D. E. Knuth and B. Pittel},
title = {The first cycles in an evolving graph},
journal = {Discrete Math},
year = {1989},
volume = {75},
pages = {167-215},
}
@Article{s77,
author = {R. Sprugnoli},
title = {Perfect Hashing Functions: A Single Probe Retrieving
Method For Static Sets},
journal = {Communications of the ACM},
year = {1977},
volume = {20},
number = {11},
pages = {841--850},
month = {November},
}
@Article{j81,
author = {G. Jaeschke},
title = {Reciprocal Hashing: A method For Generating Minimal Perfect
Hashing Functions},
journal = {Communications of the ACM},
year = {1981},
volume = {24},
number = {12},
month = {December},
pages = {829--833}
}
@Article{c84,
author = {C. C. Chang},
title = {The Study Of An Ordered Minimal Perfect Hashing Scheme},
journal = {Communications of the ACM},
year = {1984},
volume = {27},
number = {4},
month = {December},
pages = {384--387}
}
@Article{c86,
author = {C. C. Chang},
title = {Letter-Oriented Reciprocal Hashing Scheme},
journal = {Inform. Sci.},
year = {1986},
volume = {27},
pages = {243--255}
}
@Article{cl86,
author = {C. C. Chang and R. C. T. Lee},
title = {A Letter-Oriented Minimal Perfect Hashing Scheme},
journal = {Computer Journal},
year = {1986},
volume = {29},
number = {3},
month = {June},
pages = {277--281}
}
@Article{cc88,
author = {C. C. Chang and C. H. Chang},
title = {An Ordered Minimal Perfect Hashing Scheme with Single Parameter},
journal = {Inform. Process. Lett.},
year = {1988},
volume = {27},
number = {2},
month = {February},
pages = {79--83}
}
@Article{w90,
author = {V. G. Winters},
title = {Minimal Perfect Hashing in Polynomial Time},
journal = {BIT},
year = {1990},
volume = {30},
number = {2},
pages = {235--244}
}
@Article{fcdh91,
author = {E. A. Fox and Q. F. Chen and A. M. Daoud and L. S. Heath},
title = {Order Preserving Minimal Perfect Hash Functions and Information Retrieval},
journal = {ACM Trans. Inform. Systems},
year = {1991},
volume = {9},
number = {3},
month = {July},
pages = {281--308}
}
@Article{fks84,
author = {M. L. Fredman and J. Koml\'os and E. Szemer\'edi},
title = {Storing a sparse table with {O(1)} worst case access time},
journal = {J. ACM},
year = {1984},
volume = {31},
number = {3},
month = {July},
pages = {538--544}
}
@Article{dhjs83,
author = {M. W. Du and T. M. Hsieh and K. F. Jea and D. W. Shieh},
title = {The study of a new perfect hash scheme},
journal = {IEEE Trans. Software Eng.},
year = {1983},
volume = {9},
number = {3},
month = {May},
pages = {305--313}
}
@Article{bt94,
author = {M. D. Brain and A. L. Tharp},
title = {Using Tries to Eliminate Pattern Collisions in Perfect Hashing},
journal = {IEEE Trans. on Knowledge and Data Eng.},
year = {1994},
volume = {6},
number = {2},
month = {April},
pages = {239--247}
}
@Article{bt90,
author = {M. D. Brain and A. L. Tharp},
title = {Perfect hashing using sparse matrix packing},
journal = {Inform. Systems},
year = {1990},
volume = {15},
number = {3},
OPTmonth = {April},
pages = {281--290}
}
@Article{ckw93,
author = {C. C. Chang and H. C.Kowng and T. C. Wu},
title = {A refinement of a compression-oriented addressing scheme},
journal = {BIT},
year = {1993},
volume = {33},
number = {4},
OPTmonth = {April},
pages = {530--535}
}
@Article{cw91,
author = {C. C. Chang and T. C. Wu},
title = {A letter-oriented perfect hashing scheme based upon sparse table compression},
journal = {Software -- Practice Experience},
year = {1991},
volume = {21},
number = {1},
month = {january},
pages = {35--49}
}
@Article{ty79,
author = {R. E. Tarjan and A. C. C. Yao},
title = {Storing a sparse table},
journal = {Comm. ACM},
year = {1979},
volume = {22},
number = {11},
month = {November},
pages = {606--611}
}
@Article{yd85,
author = {W. P. Yang and M. W. Du},
title = {A backtracking method for constructing perfect hash functions from a set of mapping functions},
journal = {BIT},
year = {1985},
volume = {25},
number = {1},
pages = {148--164}
}
@Article{s85,
author = {T. J. Sager},
title = {A polynomial time generator for minimal perfect hash functions},
journal = {Commun. ACM},
year = {1985},
volume = {28},
number = {5},
month = {May},
pages = {523--532}
}
@Article{cm93,
author = {Z. J. Czech and B. S. Majewski},
title = {A linear time algorithm for finding minimal perfect hash functions},
journal = {The computer Journal},
year = {1993},
volume = {36},
number = {6},
pages = {579--587}
}
@Article{gbs94,
author = {R. Gupta and S. Bhaskar and S. Smolka},
title = {On randomization in sequential and distributed algorithms},
journal = {ACM Comput. Surveys},
year = {1994},
volume = {26},
number = {1},
month = {March},
pages = {7--86}
}
@InProceedings{sb84,
author = {C. Slot and P. V. E. Boas},
title = {On tape versus core; an application of space efficient perfect hash functions to the
invariance of space},
booktitle = {Proc. 16th Ann. ACM Symp. on Theory of Computing -- STOC'84},
address = {Washington},
month = {May},
year = {1984},
pages = {391--400},
}
@InProceedings{wi90,
author = {V. G. Winters},
title = {Minimal perfect hashing for large sets of data},
booktitle = {Internat. Conf. on Computing and Information -- ICCI'90},
address = {Canada},
month = {May},
year = {1990},
pages = {275--284},
}
@InProceedings{lr85,
author = {P. Larson and M. V. Ramakrishna},
title = {External perfect hashing},
booktitle = {Proc. ACM SIGMOD Conf.},
address = {Austin TX},
month = {June},
year = {1985},
pages = {190--199},
}
@Book{m84,
author = {K. Mehlhorn},
editor = {W. Brauer and G. Rozenberg and A. Salomaa},
title = {Data Structures and Algorithms 1: Sorting and Searching},
publisher = {Springer-Verlag},
year = {1984},
}
@PhdThesis{c92,
author = {Q. F. Chen},
title = {An Object-Oriented Database System for Efficient Information Retrieval Appliations},
school = {Virginia Tech Dept. of Computer Science},
year = {1992},
month = {March}
}
@article {er59,
AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.},
TITLE = {On random graphs {I}},
JOURNAL = {Pub. Math. Debrecen},
VOLUME = {6},
YEAR = {1959},
PAGES = {290--297},
MRCLASS = {05.00},
MRNUMBER = {MR0120167 (22 \#10924)},
MRREVIEWER = {A. Dvoretzky},
}
@article {erdos61,
AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.},
TITLE = {On the evolution of random graphs},
JOURNAL = {Bull. Inst. Internat. Statist.},
VOLUME = 38,
YEAR = 1961,
PAGES = {343--347},
MRCLASS = {05.40 (55.10)},
MRNUMBER = {MR0148055 (26 \#5564)},
}
@article {er60,
AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.},
TITLE = {On the evolution of random graphs},
JOURNAL = {Magyar Tud. Akad. Mat. Kutat\'o Int. K\"ozl.},
VOLUME = {5},
YEAR = {1960},
PAGES = {17--61},
MRCLASS = {05.40},
MRNUMBER = {MR0125031 (23 \#A2338)},
MRREVIEWER = {J. Riordan},
}
@Article{er60:_Old,
author = {P. Erd{\H{o}}s and A. R\'enyi},
title = {On the evolution of random graphs},
journal = {Publications of the Mathematical Institute of the Hungarian
Academy of Sciences},
year = {1960},
volume = {56},
pages = {17-61}
}
@Article{er61,
author = {P. Erd{\H{o}}s and A. R\'enyi},
title = {On the strength of connectedness of a random graph},
journal = {Acta Mathematica Scientia Hungary},
year = {1961},
volume = {12},
pages = {261-267}
}
@Article{bp04,
author = {B. Bollob\'as and O. Pikhurko},
title = {Integer Sets with Prescribed Pairwise Differences Being Distinct},
journal = {European Journal of Combinatorics},
OPTkey = {},
OPTvolume = {},
OPTnumber = {},
OPTpages = {},
OPTmonth = {},
note = {To Appear},
OPTannote = {}
}
@Article{pw04:_OLD,
author = {B. Pittel and N. C. Wormald},
title = {Counting connected graphs inside-out},
journal = {Journal of Combinatorial Theory},
OPTkey = {},
OPTvolume = {},
OPTnumber = {},
OPTpages = {},
OPTmonth = {},
note = {To Appear},
OPTannote = {}
}
@Article{mr95,
author = {M. Molloy and B. Reed},
title = {A critical point for random graphs with a given degree sequence},
journal = {Random Structures and Algorithms},
year = {1995},
volume = {6},
pages = {161-179}
}
@TechReport{bmz04,
author = {F. C. Botelho and D. Menoti and N. Ziviani},
title = {A New algorithm for constructing minimal perfect hash functions},
institution = {Federal Univ. of Minas Gerais},
year = {2004},
OPTkey = {},
OPTtype = {},
number = {TR004},
OPTaddress = {},
OPTmonth = {},
note = {(http://www.dcc.ufmg.br/\texttt{\~ }nivio/pub/technicalreports.html)},
OPTannote = {}
}
@Article{mr98,
author = {M. Molloy and B. Reed},
title = {The size of the giant component of a random graph with a given degree sequence},
journal = {Combinatorics, Probability and Computing},
year = {1998},
volume = {7},
pages = {295-305}
}
@misc{h98,
author = {D. Hawking},
title = {Overview of TREC-7 Very Large Collection Track (Draft for Notebook)},
url = {citeseer.ist.psu.edu/4991.html},
year = {1998}}
@book {jlr00,
AUTHOR = {Janson, S. and {\L}uczak, T. and Ruci{\'n}ski, A.},
TITLE = {Random graphs},
PUBLISHER = {Wiley-Inter.},
YEAR = 2000,
PAGES = {xii+333},
ISBN = {0-471-17541-2},
MRCLASS = {05C80 (60C05 82B41)},
MRNUMBER = {2001k:05180},
MRREVIEWER = {Mark R. Jerrum},
}
@incollection {jlr90,
AUTHOR = {Janson, Svante and {\L}uczak, Tomasz and Ruci{\'n}ski,
Andrzej},
TITLE = {An exponential bound for the probability of nonexistence of a
specified subgraph in a random graph},
BOOKTITLE = {Random graphs '87 (Pozna\'n, 1987)},
PAGES = {73--87},
PUBLISHER = {Wiley},
ADDRESS = {Chichester},
YEAR = 1990,
MRCLASS = {05C80 (60C05)},
MRNUMBER = {91m:05168},
MRREVIEWER = {J. Spencer},
}
@book {b01,
AUTHOR = {Bollob{\'a}s, B.},
TITLE = {Random graphs},
SERIES = {Cambridge Studies in Advanced Mathematics},
VOLUME = 73,
EDITION = {Second},
PUBLISHER = {Cambridge University Press},
ADDRESS = {Cambridge},
YEAR = 2001,
PAGES = {xviii+498},
ISBN = {0-521-80920-7; 0-521-79722-5},
MRCLASS = {05C80 (60C05)},
MRNUMBER = {MR1864966 (2002j:05132)},
}
@article {pw04,
AUTHOR = {Pittel, Boris and Wormald, Nicholas C.},
TITLE = {Counting connected graphs inside-out},
JOURNAL = {J. Combin. Theory Ser. B},
FJOURNAL = {Journal of Combinatorial Theory. Series B},
VOLUME = 93,
YEAR = 2005,
NUMBER = 2,
PAGES = {127--172},
ISSN = {0095-8956},
CODEN = {JCBTB8},
MRCLASS = {05C30 (05A16 05C40 05C80)},
MRNUMBER = {MR2117934 (2005m:05117)},
MRREVIEWER = {Edward A. Bender},
}

112
vldb07/relatedwork.tex Executable file
View File

@ -0,0 +1,112 @@
% Time-stamp: <Monday 30 Jan 2006 03:06:57am EDT yoshi@ime.usp.br>
\vspace{-3mm}
\section{Related work}
\label{sec:relatedprevious-work}
\vspace{-2mm}
% Optimal speed for hashing means that each key from the key set $S$
% will map to an unique location in the hash table, avoiding time wasted
% in resolving collisions. That is achieved with a MPHF and
% because of that many algorithms for constructing static
% and dynamic MPHFs, when static or dynamic sets are involved,
% were developed. Our focus has been on static MPHFs, since
% in many applications the key sets change slowly, if at all~\cite{s05}.
\enlargethispage{2\baselineskip}
Czech, Havas and Majewski~\cite{chm97} provide a
comprehensive survey of the most important theoretical and practical results
on perfect hashing.
In this section we review some of the most important results.
%We also present more recent algorithms that share some features with
%the one presented hereinafter.
Fredman, Koml\'os and Szemer\'edi~\cite{FKS84} showed that it is possible to
construct space efficient perfect hash functions that can be evaluated in
constant time with table sizes that are linear in the number of keys:
$m=O(n)$. In their model of computation, an element of the universe~$U$ fits
into one machine word, and arithmetic operations and memory accesses have unit
cost. Randomized algorithms in the FKS model can construct a perfect hash
function in expected time~$O(n)$:
this is the case of our algorithm and the works in~\cite{chm92,p99}.
Mehlhorn~\cite{m84} showed
that at least $\Omega((1/\ln 2)n + \ln\ln u)$ bits are
required to represent a MPHF (i.e, at least 1.4427 bits per
key must be stored).
To the best of our knowledge our algorithm
is the first one capable of generating MPHFs for sets in the order
of billion of keys, and the generated functions
require less than 9 bits per key to be stored.
This increases one order of magnitude in the size of the greatest
key set for which a MPHF was obtained in the literature~\cite{bkz05}.
%which is close to the lower bound presented in~\cite{m84}.
Some work on minimal perfect hashing has been done under the assumption that
the algorithm can pick and store truly random functions~\cite{bkz05,chm92,p99}.
Since the space requirements for truly random functions makes them unsuitable for
implementation, one has to settle for pseudo-random functions in practice.
Empirical studies show that limited randomness properties are often as good as
total randomness.
We could verify that phenomenon in our experiments by using the universal hash
function proposed by Jenkins~\cite{j97}, which is
time efficient at retrieval time and requires just an integer to be used as a
random seed (the function is completely determined by the seed).
% Os trabalhos~\cite{asw00,swz00} apresentam algoritmos para construir
% FHPs e FHPMs deterministicamente.
% As fun\c{c}\~oes geradas necessitam de $O(n \log(n) + \log(\log(u)))$ bits para serem descritas.
% A complexidade de caso m\'edio dos algoritmos para gerar as fun\c{c}\~oes \'e
% $O(n\log(n) \log( \log (u)))$ e a de pior caso \'e $O(n^3\log(n) \log(\log(u)))$.
% A complexidade de avalia\c{c}\~ao das fun\c{c}\~oes \'e $O(\log(n) + \log(\log(u)))$.
% Assim, os algoritmos n\~ao geram fun\c{c}\~oes que podem ser avaliadas com complexidade
% de tempo $O(1)$, est\~ao distantes a um fator de $\log n$ da complexidade \'otima para descrever
% FHPs e FHPMs (Mehlhorn mostra em~\cite{m84}
% que para armazenar uma FHP s\~ao necess\'arios no m\'{\i}nimo
% $\Omega(n^2/(2\ln 2) m + \log\log u)$ bits), e n\~ao geram as
% fun\c{c}\~oes com complexidade linear.
% Al\'em disso, o universo $U$ das chaves \'e restrito a n\'umeros inteiros, o que pode
% limitar a utiliza\c{c}\~ao na pr\'atica.
Pagh~\cite{p99} proposed a family of randomized algorithms for
constructing MPHFs
where the form of the resulting function is $h(x) = (f(x) + d[g(x)]) \bmod n$,
where $f$ and $g$ are universal hash functions and $d$ is a set of
displacement values to resolve collisions that are caused by the function $f$.
Pagh identified a set of conditions concerning $f$ and $g$ and showed
that if these conditions are satisfied, then a minimal perfect hash
function can be computed in expected time $O(n)$ and stored in
$(2+\epsilon)n\log_2n$ bits.
Dietzfelbinger and Hagerup~\cite{dh01} improved~\cite{p99},
reducing from $(2+\epsilon)n\log_2n$ to $(1+\epsilon)n\log_2n$ the number of bits
required to store the function, but in their approach~$f$ and~$g$ must
be chosen from a class
of hash functions that meet additional requirements.
%Differently from the works in~\cite{dh01, p99}, our algorithm generates a MPHF
%$h$ in expected linear time and $h$ can be stored in $O(n)$ bits (9 bits per key).
% Galli, Seybold e Simon~\cite{gss01} propuseram um algoritmo r\^andomico
% que gera FHPMs da mesma forma das geradas pelos algoritmos de Pagh~\cite{p99}
% e, Dietzfelbinger e Hagerup~\cite{dh01}. No entanto, eles definiram a forma das
% fun\c{c}\~oes $f(k) = h_c(k) \bmod n$ e $g(k) = \lfloor h_c(k)/n \rfloor$ para obter em tempo esperado $O(n)$ uma fun\c{c}\~ao que pode ser descrita em $O(n\log n)$ bits, onde
% $h_c(k) = (ck \bmod p) \bmod n^2$, $1 \leq c \leq p-1$ e $p$ um primo maior do que $u$.
%Our algorithm is the first one capable of generating MPHFs for sets in the order of
%billion of keys. It happens because we do not need to keep into main memory
%at generation time complex data structures as a graph, lists and so on. We just need to maintain
%a small vector that occupies around 8MB for a set of 1 billion keys.
Fox et al.~\cite{fch92,fhcd92} studied MPHFs
%that also share features with the ones generated by our algorithm.
that bring down the storage requirements we got to between 2 and 4 bits per key.
However, it is shown in~\cite[Section 6.7]{chm97} that their algorithms have exponential
running times and cannot scale for sets larger than 11 million keys in our
implementation of the algorithm.
Our previous work~\cite{bkz05} improves the one by Czech, Havas and Majewski~\cite{chm92}.
We obtained more compact functions in less time. Although
the algorithm in~\cite{bkz05} is the fastest algorithm
we know of, the resulting functions are stored in $O(n\log n)$ bits and
one needs to keep in main memory at generation time a random graph of $n$ edges
and $cn$ vertices,
where $c\in[0.93,1.15]$. Using the well known divide to conquer approach
we use that algorithm as a building block for the new one, where the
resulting functions are stored in $O(n)$ bits.

155
vldb07/searching.tex Executable file
View File

@ -0,0 +1,155 @@
%% Nivio: 22/jan/06
% Time-stamp: <Monday 30 Jan 2006 03:57:35am EDT yoshi@ime.usp.br>
\vspace{-7mm}
\subsection{Searching step}
\label{sec:searching}
\enlargethispage{2\baselineskip}
The searching step is responsible for generating a MPHF for each
bucket.
Figure~\ref{fig:searchingstep} presents the searching step algorithm.
\vspace{-2mm}
\begin{figure}[h]
%\centering
\hrule
\hrule
\vspace{2mm}
\begin{tabbing}
aa\=type booleanx \== (false, true); \kill
\> $\blacktriangleright$ Let $H$ be a minimum heap of size $N$, where the \\
\> ~~ order relation in $H$ is given by Eq.~(\ref{eq:bucketindex}), that is, the\\
\> ~~ remove operation removes the item with smallest $i$\\
\> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \{ Heap construction \}\\
\> ~~ $1.1$ Read key $k$ from File $j$ on disk\\
\> ~~ $1.2$ Insert $(i, j, k)$ in $H$ \\
\> $2.$ {\bf for} $i = 0$ {\bf to} $\lceil n/b \rceil - 1$ {\bf do} \\
\> ~~ $2.1$ Read bucket $i$ from disk driven by heap $H$ \\
\> ~~ $2.2$ Generate a MPHF for bucket $i$ \\
\> ~~ $2.3$ Write the description of MPHF$_i$ to the disk
\end{tabbing}
\vspace{-1mm}
\hrule
\hrule
\caption{Searching step}
\label{fig:searchingstep}
\vspace{-4mm}
\end{figure}
Statement 1 of Figure~\ref{fig:searchingstep} inserts one key from each file
in a minimum heap $H$ of size $N$.
The order relation in $H$ is given by the bucket address $i$ given by
Eq.~(\ref{eq:bucketindex}).
%\enlargethispage{-\baselineskip}
Statement 2 has two important steps.
In statement 2.1, a bucket is read from disk,
as described below.
%in Section~\ref{sec:readingbucket}.
In statement 2.2, a MPHF is generated for each bucket $i$, as described
in the following.
%in Section~\ref{sec:mphfbucket}.
The description of MPHF$_i$ is a vector $g_i$ of 8-bit integers.
Finally, statement 2.3 writes the description $g_i$ of MPHF$_i$ to disk.
\vspace{-3mm}
\label{sec:readingbucket}
\subsubsection{Reading a bucket from disk.}
In this section we present the refinement of statement 2.1 of
Figure~\ref{fig:searchingstep}.
The algorithm to read bucket $i$ from disk is presented
in Figure~\ref{fig:readingbucket}.
\begin{figure}[h]
\hrule
\hrule
\vspace{2mm}
\begin{tabbing}
aa\=type booleanx \== (false, true); \kill
\> $1.$ {\bf while} bucket $i$ is not full {\bf do} \\
\> ~~ $1.1$ Remove $(i, j, k)$ from $H$\\
\> ~~ $1.2$ Insert $k$ into bucket $i$ \\
\> ~~ $1.3$ Read sequentially all keys $k$ from File $j$ that have \\
\> ~~~~~~~ the same $i$ and insert them into bucket $i$ \\
\> ~~ $1.4$ Insert the triple $(i, j, x)$ in $H$, where $x$ is the first \\
\> ~~~~~~~ key read from File $j$ that does not have the \\
\> ~~~~~~~ same bucket index $i$
\end{tabbing}
\hrule
\hrule
\vspace{-1.0mm}
\caption{Reading a bucket}
\vspace{-4.0mm}
\label{fig:readingbucket}
\end{figure}
Bucket $i$ is distributed among many files and the heap $H$ is used to drive a
multiway merge operation.
In Figure~\ref{fig:readingbucket}, statement 1.1 extracts and removes triple
$(i, j, k)$ from $H$, where $i$ is a minimum value in $H$.
Statement 1.2 inserts key $k$ in bucket $i$.
Notice that the $k$ in the triple $(i, j, k)$ is in fact a pointer to
the first byte of the key that is kept in contiguous positions of an array of characters
(this array containing the keys is initialized during the heap construction
in statement 1 of Figure~\ref{fig:searchingstep}).
Statement 1.3 performs a seek operation in File $j$ on disk for the first
read operation and reads sequentially all keys $k$ that have the same $i$
%(obtained from Eq.~(\ref{eq:bucketindex}))
and inserts them all in bucket $i$.
Finally, statement 1.4 inserts in $H$ the triple $(i, j, x)$,
where $x$ is the first key read from File $j$ (in statement 1.3)
that does not have the same bucket address as the previous keys.
The number of seek operations on disk performed in statement 1.3 is discussed
in Section~\ref{sec:linearcomplexity},
where we present a buffering technique that brings down
the time spent with seeks.
\vspace{-2mm}
\enlargethispage{2\baselineskip}
\subsubsection{Generating a MPHF for each bucket.} \label{sec:mphfbucket}
To the best of our knowledge the algorithm we have designed in
our previous work~\cite{bkz05} is the fastest published algorithm for
constructing MPHFs.
That is why we are using that algorithm as a building block for the
algorithm presented here.
%\enlargethispage{-\baselineskip}
Our previous algorithm is a three-step internal memory based algorithm
that produces a MPHF based on random graphs.
For a set of $n$ keys, the algorithm outputs the resulting MPHF in expected time $O(n)$.
For a given bucket $i$, $0 \leq i < \lceil n/b \rceil$, the corresponding MPHF$_i$
has the following form:
\begin{eqnarray}
\mathrm{MPHF}_i(k) &=& g_i[a] + g_i[b] \label{eq:mphfi}
\end{eqnarray}
where $a = h_{i1}(k) \bmod t$, $b = h_{i2}(k) \bmod t$ and
$t = c\times \mathit{size}[i]$. The functions
$h_{i1}(k)$ and $h_{i2}(k)$ are the same universal function proposed by Jenkins~\cite{j97}
that was used in the partitioning step described in Section~\ref{sec:partitioning-keys}.
In order to generate the function above the algorithm involves the generation of simple random graphs
$G_i = (V_i, E_i)$ with~$|V_i|=t=c\times\mathit{size}[i]$ and $|E_i|=\mathit{size}[i]$, with $c \in [0.93, 1.15]$.
To generate a simple random graph with high
probability\footnote{We use the terms `with high probability'
to mean `with probability tending to~$1$ as~$n\to\infty$'.}, two vertices $a$ and $b$ are
computed for each key $k$ in bucket $i$.
Thus, each bucket $i$ has a corresponding graph~$G_i=(V_i,E_i)$, where $V_i=\{0,1,
\ldots,t-1\}$ and $E_i=\big\{\{a,b\}:k \in \mathrm{bucket}\: i\big\}$.
In order to get a simple graph,
the algorithm repeatedly selects $h_{i1}$ and $h_{i2}$ from a family of universal hash functions
until the corresponding graph is simple.
The probability of getting a simple graph is $p=e^{-1/c^2}$.
For $c=1$, this probability is $p \simeq 0.368$, and the expected number of
iterations to obtain a simple graph is~$1/p \simeq 2.72$.
The construction of MPHF$_i$ ends with a computation of a suitable labelling of the vertices
of~$G_i$. The labelling is stored into vector $g_i$.
We choose~$g_i[v]$ for each~$v\in V_i$ in such
a way that Eq.~(\ref{eq:mphfi}) is a MPHF for bucket $i$.
In order to get the values of each entry of $g_i$ we first
run a breadth-first search on the 2-\textit{core} of $G_i$, i.e., the maximal subgraph
of~$G_i$ with minimal degree at least~$2$ (see, e.g., \cite{b01,jlr00,pw04}) and
a depth-first search on the acyclic part of $G_i$ (see \cite{bkz05} for details).

77
vldb07/svglov2.clo Normal file
View File

@ -0,0 +1,77 @@
% SVJour2 DOCUMENT CLASS OPTION SVGLOV2 -- for standardised journals
%
% This is an enhancement for the LaTeX
% SVJour2 document class for Springer journals
%
%%
%%
%% \CharacterTable
%% {Upper-case \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z
%% Lower-case \a\b\c\d\e\f\g\h\i\j\k\l\m\n\o\p\q\r\s\t\u\v\w\x\y\z
%% Digits \0\1\2\3\4\5\6\7\8\9
%% Exclamation \! Double quote \" Hash (number) \#
%% Dollar \$ Percent \% Ampersand \&
%% Acute accent \' Left paren \( Right paren \)
%% Asterisk \* Plus \+ Comma \,
%% Minus \- Point \. Solidus \/
%% Colon \: Semicolon \; Less than \<
%% Equals \= Greater than \> Question mark \?
%% Commercial at \@ Left bracket \[ Backslash \\
%% Right bracket \] Circumflex \^ Underscore \_
%% Grave accent \` Left brace \{ Vertical bar \|
%% Right brace \} Tilde \~}
\ProvidesFile{svglov2.clo}
[2004/10/25 v2.1
style option for standardised journals]
\typeout{SVJour Class option: svglov2.clo for standardised journals}
\def\validfor{svjour2}
\ExecuteOptions{final,10pt,runningheads}
% No size changing allowed, hence a copy of size10.clo is included
\renewcommand\normalsize{%
\@setfontsize\normalsize{10.2pt}{4mm}%
\abovedisplayskip=3 mm plus6pt minus 4pt
\belowdisplayskip=3 mm plus6pt minus 4pt
\abovedisplayshortskip=0.0 mm plus6pt
\belowdisplayshortskip=2 mm plus4pt minus 4pt
\let\@listi\@listI}
\normalsize
\newcommand\small{%
\@setfontsize\small{8.7pt}{3.25mm}%
\abovedisplayskip 8.5\p@ \@plus3\p@ \@minus4\p@
\abovedisplayshortskip \z@ \@plus2\p@
\belowdisplayshortskip 4\p@ \@plus2\p@ \@minus2\p@
\def\@listi{\leftmargin\leftmargini
\parsep 0\p@ \@plus1\p@ \@minus\p@
\topsep 4\p@ \@plus2\p@ \@minus4\p@
\itemsep0\p@}%
\belowdisplayskip \abovedisplayskip
}
\let\footnotesize\small
\newcommand\scriptsize{\@setfontsize\scriptsize\@viipt\@viiipt}
\newcommand\tiny{\@setfontsize\tiny\@vpt\@vipt}
\newcommand\large{\@setfontsize\large\@xiipt{14pt}}
\newcommand\Large{\@setfontsize\Large\@xivpt{16dd}}
\newcommand\LARGE{\@setfontsize\LARGE\@xviipt{17dd}}
\newcommand\huge{\@setfontsize\huge\@xxpt{25}}
\newcommand\Huge{\@setfontsize\Huge\@xxvpt{30}}
%
%ALT% \def\runheadhook{\rlap{\smash{\lower5pt\hbox to\textwidth{\hrulefill}}}}
\def\runheadhook{\rlap{\smash{\lower11pt\hbox to\textwidth{\hrulefill}}}}
\AtEndOfClass{\advance\headsep by5pt}
\if@twocolumn
\setlength{\textwidth}{17.6cm}
\setlength{\textheight}{230mm}
\AtEndOfClass{\setlength\columnsep{4mm}}
\else
\setlength{\textwidth}{11.7cm}
\setlength{\textheight}{517.5dd} % 19.46cm
\fi
%
\AtBeginDocument{%
\@ifundefined{@journalname}
{\typeout{Unknown journal: specify \string\journalname\string{%
<name of your journal>\string} in preambel^^J}}{}}
%
\endinput
%%
%% End of file `svglov2.clo'.

1419
vldb07/svjour2.cls Normal file

File diff suppressed because it is too large Load Diff

18
vldb07/terminology.tex Executable file
View File

@ -0,0 +1,18 @@
% Time-stamp: <Sunday 29 Jan 2006 11:55:42pm EST yoshi@flare>
\vspace{-3mm}
\section{Notation and terminology}
\vspace{-2mm}
\label{sec:notation}
\enlargethispage{2\baselineskip}
The essential notation and terminology used throughout this paper are as follows.
\begin{itemize}
\item $U$: key universe. $|U| = u$.
\item $S$: actual static key set. $S \subset U$, $|S| = n \ll u$.
\item $h: U \to M$ is a hash function that maps keys from a universe $U$ into
a given range $M = \{0,1,\dots,m-1\}$ of integer numbers.
\item $h$ is a perfect hash function if it is one-to-one on~$S$, i.e., if
$h(k_1) \not = h(k_2)$ for all $k_1 \not = k_2$ from $S$.
\item $h$ is a minimal perfect hash function (MPHF) if it is one-to-one on~$S$
and $n=m$.
\end{itemize}

78
vldb07/thealgorithm.tex Executable file
View File

@ -0,0 +1,78 @@
%% Nivio: 13/jan/06, 21/jan/06 29/jan/06
% Time-stamp: <Sunday 29 Jan 2006 11:56:25pm EST yoshi@flare>
\vspace{-3mm}
\section{The algorithm}
\label{sec:new-algorithm}
\vspace{-2mm}
\enlargethispage{2\baselineskip}
The main idea supporting our algorithm is the classical divide and conquer technique.
The algorithm is a two-step external memory based algorithm
that generates a MPHF $h$ for a set $S$ of $n$ keys.
Figure~\ref{fig:new-algo-main-steps} illustrates the two steps of the
algorithm: the partitioning step and the searching step.
\vspace{-2mm}
\begin{figure}[ht]
\centering
\begin{picture}(0,0)%
\includegraphics{figs/brz.ps}%
\end{picture}%
\setlength{\unitlength}{4144sp}%
%
\begingroup\makeatletter\ifx\SetFigFont\undefined%
\gdef\SetFigFont#1#2#3#4#5{%
\reset@font\fontsize{#1}{#2pt}%
\fontfamily{#3}\fontseries{#4}\fontshape{#5}%
\selectfont}%
\fi\endgroup%
\begin{picture}(3704,2091)(1426,-5161)
\put(2570,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
\put(2782,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
\put(2996,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}2}}}}
\put(4060,-4006){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets}}}}
\put(3776,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}${\lceil n/b\rceil - 1}$}}}}
\put(4563,-3329){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Key Set $S$}}}}
\put(2009,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
\put(2221,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
\put(4315,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}n-1}}}}
\put(1992,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
\put(2204,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
\put(4298,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}n-1}}}}
\put(4546,-4977){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Hash Table}}}}
\put(1441,-3616){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Partitioning}}}}
\put(1441,-4426){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Searching}}}}
\put(1981,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_0$}}}}
\put(2521,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_1$}}}}
\put(3016,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_2$}}}}
\put(3826,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_{\lceil n/b \rceil - 1}$}}}}
\end{picture}%
\vspace{-1mm}
\caption{Main steps of our algorithm}
\label{fig:new-algo-main-steps}
\vspace{-3mm}
\end{figure}
The partitioning step takes a key set $S$ and uses a universal hash function
$h_0$ proposed by Jenkins~\cite{j97}
%for each key $k \in S$ of length $|k|$
to transform each key~$k\in S$ into an integer~$h_0(k)$.
Reducing~$h_0(k)$ modulo~$\lceil n/b\rceil$, we partition~$S$ into $\lceil n/b
\rceil$ buckets containing at most 256 keys in each bucket (with high
probability).
The searching step generates a MPHF$_i$ for each bucket $i$,
$0 \leq i < \lceil n/b \rceil$.
The resulting MPHF $h(k)$, $k \in S$, is given by
\begin{eqnarray}\label{eq:mphf}
h(k) = \mathrm{MPHF}_i (k) + \mathit{offset}[i],
\end{eqnarray}
where~$i=h_0(k)\bmod\lceil n/b\rceil$.
The $i$th entry~$\mathit{offset}[i]$ of the displacement vector
$\mathit{offset}$, $0 \leq i < \lceil n/b \rceil$, contains the total number
of keys in the buckets from 0 to $i-1$, that is, it gives the interval of the
keys in the hash table addressed by the MPHF$_i$. In the following we explain
each step in detail.

21
vldb07/thedataandsetup.tex Executable file
View File

@ -0,0 +1,21 @@
% Nivio: 29/jan/06
% Time-stamp: <Sunday 29 Jan 2006 11:57:40pm EST yoshi@flare>
\vspace{-2mm}
\subsection{The data and the experimental setup}
\label{sec:data-exper-set}
The algorithms were implemented in the C language and
are available at \texttt{http://\-cmph.sf.net}
under the GNU Lesser General Public License (LGPL).
% free software licence.
All experiments were carried out on
a computer running the Linux operating system, version 2.6,
with a 2.4 gigahertz processor and
1 gigabyte of main memory.
In the experiments related to the new
algorithm we limited the main memory in 500 megabytes.
Our data consists of a collection of 1 billion
URLs collected from the Web, each URL 64 characters long on average.
The collection is stored on disk in 60.5 gigabytes.

194
vldb07/vldb.tex Normal file
View File

@ -0,0 +1,194 @@
%%%%%%%%%%%%%%%%%%%%%%% file template.tex %%%%%%%%%%%%%%%%%%%%%%%%%
%
% This is a template file for the LaTeX package SVJour2 for the
% Springer journal "The VLDB Journal".
%
% Springer Heidelberg 2004/12/03
%
% Copy it to a new file with a new name and use it as the basis
% for your article. Delete % as needed.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% First comes an example EPS file -- just ignore it and
% proceed on the \documentclass line
% your LaTeX will extract the file if required
%\begin{filecontents*}{figs/minimalperfecthash-ph-mph.ps}
%!PS-Adobe-3.0 EPSF-3.0
%%BoundingBox: 19 19 221 221
%%CreationDate: Mon Sep 29 1997
%%Creator: programmed by hand (JK)
%%EndComments
%gsave
%newpath
% 20 20 moveto
% 20 220 lineto
% 220 220 lineto
% 220 20 lineto
%closepath
%2 setlinewidth
%gsave
% .4 setgray fill
%grestore
%stroke
%grestore
%\end{filecontents*}
%
\documentclass[twocolumn,fleqn,runningheads]{svjour2}
%
\smartqed % flush right qed marks, e.g. at end of proof
%
\usepackage{graphicx}
\usepackage{listings}
\usepackage{epsfig}
\usepackage{textcomp}
\usepackage[latin1]{inputenc}
\usepackage{amssymb}
%\DeclareGraphicsExtensions{.png}
%
% \usepackage{mathptmx} % use Times fonts if available on your TeX system
%
% insert here the call for the packages your document requires
%\usepackage{latexsym}
% etc.
%
% please place your own definitions here and don't use \def but
% \newcommand{}{}
%
\lstset{
language=Pascal,
basicstyle=\fontsize{9}{9}\selectfont,
captionpos=t,
aboveskip=1mm,
belowskip=1mm,
abovecaptionskip=1mm,
belowcaptionskip=1mm,
% numbers = left,
mathescape=true,
escapechar=@,
extendedchars=true,
showstringspaces=false,
columns=fixed,
basewidth=0.515em,
frame=single,
framesep=2mm,
xleftmargin=2mm,
xrightmargin=2mm,
framerule=0.5pt
}
\def\cG{{\mathcal G}}
\def\crit{{\rm crit}}
\def\ncrit{{\rm ncrit}}
\def\scrit{{\rm scrit}}
\def\bedges{{\rm bedges}}
\def\ZZ{{\mathbb Z}}
\journalname{The VLDB Journal}
%
\begin{document}
\title{Space and Time Efficient Minimal Perfect Hash \\[0.2cm]
Functions for Very Large Databases\thanks{
This work was supported in part by
GERINDO Project--grant MCT/CNPq/CT-INFO 552.087/02-5,
CAPES/PROF Scholarship (Fabiano C. Botelho),
FAPESP Proj.\ Tem.\ 03/09925-5 and CNPq Grant 30.0334/93-1
(Yoshiharu Kohayakawa),
and CNPq Grant 30.5237/02-0 (Nivio Ziviani).}
}
%\subtitle{Do you have a subtitle?\\ If so, write it here}
%\titlerunning{Short form of title} % if too long for running head
\author{Fabiano C. Botelho \and Davi C. Reis \and Yoshiharu Kohayakawa \and Nivio Ziviani}
%\authorrunning{Short form of author list} % if too long for running head
\institute{
F. C. Botelho \and
N. Ziviani \at
Dept. of Computer Science,
Federal Univ. of Minas Gerais,
Belo Horizonte, Brazil\\
\email{\{fbotelho,nivio\}@dcc.ufmg.br}
\and
D. C. Reis \at
Google, Brazil \\
\email{davi.reis@gmail.com}
\and
Y. Kohayakawa
Dept. of Computer Science,
Univ. of S\~ao Paulo,
S\~ao Paulo, Brazil\\
\email{yoshi@ime.usp.br}
}
\date{Received: date / Accepted: date}
% The correct dates will be entered by the editor
\maketitle
\begin{abstract}
We propose a novel external memory based algorithm for constructing minimal
perfect hash functions~$h$ for huge sets of keys.
For a set of~$n$ keys, our algorithm outputs~$h$ in time~$O(n)$.
The algorithm needs a small vector of one byte entries
in main memory to construct $h$.
The evaluation of~$h(x)$ requires three memory accesses for any key~$x$.
The description of~$h$ takes a constant number of bits
for each key, which is optimal, i.e., the theoretical lower bound is $1/\ln 2$
bits per key.
In our experiments, we used a collection of 1 billion URLs collected
from the web, each URL 64 characters long on average.
For this collection, our algorithm
(i) finds a minimal perfect hash function in approximately
3 hours using a commodity PC,
(ii) needs just 5.45 megabytes of internal memory to generate $h$
and (iii) takes 8.1 bits per key for the description of~$h$.
\keywords{Minimal Perfect Hashing \and Large Databases}
\end{abstract}
% main text
\def\cG{{\mathcal G}}
\def\crit{{\rm crit}}
\def\ncrit{{\rm ncrit}}
\def\scrit{{\rm scrit}}
\def\bedges{{\rm bedges}}
\def\ZZ{{\mathbb Z}}
\def\BSmax{\mathit{BS}_{\mathit{max}}}
\def\Bi{\mathop{\rm Bi}\nolimits}
\input{introduction}
%\input{terminology}
\input{relatedwork}
\input{thealgorithm}
\input{partitioningthekeys}
\input{searching}
%\input{computingoffset}
%\input{hashingbuckets}
\input{determiningb}
%\input{analyticalandexperimentalresults}
\input{analyticalresults}
%\input{results}
\input{conclusions}
%\input{acknowledgments}
%\begin{acknowledgements}
%If you'd like to thank anyone, place your comments here
%and remove the percent signs.
%\end{acknowledgements}
% BibTeX users please use
%\bibliographystyle{spmpsci}
%\bibliography{} % name your BibTeX data base
\bibliographystyle{plain}
\bibliography{references}
\input{appendix}
\end{document}