paper for vldb07 added
This commit is contained in:
parent
b0546b1fcc
commit
fe4600148e
7
vldb07/acknowledgments.tex
Executable file
7
vldb07/acknowledgments.tex
Executable file
@ -0,0 +1,7 @@
|
||||
\section{Acknowledgments}
|
||||
This section is optional; it is a location for you
|
||||
to acknowledge grants, funding, editing assistance and
|
||||
what have you. In the present case, for example, the
|
||||
authors would like to thank Gerald Murray of ACM for
|
||||
his help in codifying this \textit{Author's Guide}
|
||||
and the \textbf{.cls} and \textbf{.tex} files that it describes.
|
174
vldb07/analyticalresults.tex
Executable file
174
vldb07/analyticalresults.tex
Executable file
@ -0,0 +1,174 @@
|
||||
%% Nivio: 23/jan/06 29/jan/06
|
||||
% Time-stamp: <Monday 30 Jan 2006 03:56:47am EDT yoshi@ime.usp.br>
|
||||
\enlargethispage{2\baselineskip}
|
||||
\section{Analytical results}
|
||||
\label{sec:analytcal-results}
|
||||
|
||||
\vspace{-1mm}
|
||||
The purpose of this section is fourfold.
|
||||
First, we show that our algorithm runs in expected time $O(n)$.
|
||||
Second, we present the main memory requirements for constructing the MPHF.
|
||||
Third, we discuss the cost of evaluating the resulting MPHF.
|
||||
Fourth, we present the space required to store the resulting MPHF.
|
||||
|
||||
\vspace{-2mm}
|
||||
\subsection{The linear time complexity}
|
||||
\label{sec:linearcomplexity}
|
||||
|
||||
First, we show that the partitioning step presented in
|
||||
Figure~\ref{fig:partitioningstep} runs in $O(n)$ time.
|
||||
Each iteration of the {\bf for} loop in statement~1
|
||||
runs in $O(|B_j|)$ time, $1 \leq j \leq N$, where $|B_j|$ is the
|
||||
number of keys
|
||||
that fit in block $B_j$ of size $\mu$. This is because statement 1.1 just reads
|
||||
$|B_j|$ keys from disk, statement 1.2 runs a bucket sort like algorithm
|
||||
that is well known to be linear in the number of keys it sorts (i.e., $|B_j|$ keys),
|
||||
and statement 1.3 just dumps $|B_j|$ keys to the disk into File $j$.
|
||||
Thus, the {\bf for} loop runs in $O(\sum_{j=1}^{N}|B_j|)$ time.
|
||||
As $\sum_{j=1}^{N}|B_j|=n$, then the partitioning step runs in $O(n)$ time.
|
||||
|
||||
Second, we show that the searching step presented in
|
||||
Figure~\ref{fig:searchingstep} also runs in $O(n)$ time.
|
||||
The heap construction in statement 1 runs in $O(N)$ time, for $N \ll n$.
|
||||
We have assumed that insertions and deletions in the heap cost $O(1)$ because
|
||||
$N$ is typically much smaller than $n$ (see \cite[Section 6.4]{bkz06t} for details).
|
||||
Statement 2 runs in $O(\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i])$ time
|
||||
(remember that $\mathit{size}[i]$ stores the number of keys in bucket $i$).
|
||||
As $\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i] = n$, if
|
||||
statements 2.1, 2.2 and 2.3 run in $O(\mathit{size}[i])$ time, then statement 2
|
||||
runs in $O(n)$ time.
|
||||
|
||||
%Statement 2.1 runs the algorithm to read a bucket from disk. That algorithm reads $\mathit{size}[i]$
|
||||
%keys of bucket $i$ that might be spread into many files or, in the worst case,
|
||||
%into $|BS_{max}|$ files, where $|BS_{max}|$ is the number of keys in the bucket of maximum size.
|
||||
%It uses the heap $H$ to drive a multiway merge of the sprayed bucket $i$.
|
||||
%As we are considering that each read/write on disk costs $O(1)$ and
|
||||
%each heap operation also costs $O(1)$ (recall $N \ll n$), then statement 2.1
|
||||
%costs $O(\mathit{size}[i])$ time.
|
||||
%We need to take into account that this step could generate a lot of seeks on disk.
|
||||
%However, the number of seeks can be amortized (see Section~\ref{sec:contr-disk-access})
|
||||
%and that is why we have been able of getting a MPHF for a set of 1 billion keys in less
|
||||
%than 4 hours using a machine with just 500 MB of main memory
|
||||
%(see Section~\ref{sec:performance}).
|
||||
Statement 2.1 reads $O(\mathit{size}[i])$ keys of bucket $i$
|
||||
and is detailed in Figure~\ref{fig:readingbucket}.
|
||||
As we are assuming that each read or write on disk costs $O(1)$ and
|
||||
each heap operation also costs $O(1)$, statement~2.1
|
||||
takes $O(\mathit{size}[i])$ time.
|
||||
However, the keys of bucket $i$ are distributed in at most~$BS_{max}$ files on disk
|
||||
in the worst case
|
||||
(recall that $BS_{max}$ is the maximum number of keys found in any bucket).
|
||||
Therefore, we need to take into account that
|
||||
the critical step in reading a bucket is in statement~1.3 of Figure~\ref{fig:readingbucket},
|
||||
where a seek operation in File $j$
|
||||
may be performed by the first read operation.
|
||||
|
||||
In order to amortize the number of seeks performed we use a buffering technique~\cite{k73}.
|
||||
We create a buffer $j$ of size \textbaht$\: = \mu/N$ for each file $j$,
|
||||
where $1\leq j \leq N$
|
||||
(recall that $\mu$ is the size in bytes of an a priori reserved internal memory area).
|
||||
Every time a read operation is requested to file $j$ and the data is not found
|
||||
in the $j$th~buffer, \textbaht~bytes are read from file $j$ to buffer $j$.
|
||||
Hence, the number of seeks performed in the worst case is given by
|
||||
$\beta/$\textbaht~(remember that $\beta$ is the size in bytes of $S$).
|
||||
For that we have made the pessimistic assumption that one seek happens every time
|
||||
buffer $j$ is filled in.
|
||||
Thus, the number of seeks performed in the worst case is $64n/$\textbaht, since
|
||||
each URL is 64 bytes long on average. Therefore, the number of seeks is linear on
|
||||
$n$ and amortized by \textbaht.
|
||||
|
||||
It is important to emphasize two things.
|
||||
First, the operating system uses techniques
|
||||
to diminish the number of seeks and the average seek time.
|
||||
This makes the amortization factor to be greater than \textbaht~in practice.
|
||||
Second, almost all main memory is available to be used as
|
||||
file buffers because just a small vector
|
||||
of $\lceil n/b\rceil$ one-byte entries must be maintained in main memory,
|
||||
as we show in Section~\ref{sec:memconstruction}.
|
||||
|
||||
|
||||
Statement 2.2 runs our internal memory based algorithm in order to generate a MPHF for each bucket.
|
||||
That algorithm is linear, as we showed in~\cite{bkz05}. As it is applied to buckets with {\it size}$[i]$ keys, statement~2.2 takes $O(\mathit{size}[i])$ time.
|
||||
|
||||
Statement 2.3 has time complexity $O(\mathit{size}[i])$ because it writes to disk
|
||||
the description of each generated MPHF and each description is stored in
|
||||
$c \times \mathit{size}[i] + O(1)$ bytes, where $c\in[0.93,1.15]$.
|
||||
In conclusion, our algorithm takes $O(n)$ time because both the partitioning and
|
||||
the searching steps run in $O(n)$ time.
|
||||
|
||||
An experimental validation of the above proof and a performance comparison with
|
||||
our internal memory based algorithm~\cite{bkz05} were not included here due to
|
||||
space restrictions but can be found in~\cite{bkz06t} and also in the appendix.
|
||||
|
||||
\vspace{-1mm}
|
||||
\enlargethispage{2\baselineskip}
|
||||
\subsection{Space used for constructing a MPHF}
|
||||
\label{sec:memconstruction}
|
||||
|
||||
The vector {\it size} is kept in main memory
|
||||
all the time.
|
||||
The vector {\it size} has $\lceil n/b \rceil$ one-byte entries.
|
||||
It stores the number of keys in each bucket and
|
||||
those values are less than or equal to 256.
|
||||
For example, for a set of 1 billion keys and $b=175$ the vector size needs
|
||||
$5.45$ megabytes of main memory.
|
||||
|
||||
We need an internal memory area of size $\mu$ bytes to be used in
|
||||
the partitioning step and in the searching step.
|
||||
The size $\mu$ is fixed a priori and depends only on the amount
|
||||
of internal memory available to run the algorithm
|
||||
(i.e., it does not depend on the size $n$ of the problem).
|
||||
|
||||
% One could argue about the a priori reserved internal memory area and the main memory
|
||||
% required to run the indirect bucket sort algorithm.
|
||||
% Those internal memory requirements do not depend on the size of the problem
|
||||
% (i.e., the number of keys being hashed) and can be fixed a priori.
|
||||
|
||||
The additional space required in the searching step
|
||||
is constant, once the problem was broken down
|
||||
into several small problems (at most 256 keys) and
|
||||
the heap size is supposed to be much smaller than $n$ ($N \ll n$).
|
||||
For example, for a set of 1 billion keys and an internal area of~$\mu = 250$ megabytes,
|
||||
the number of files is $N = 248$.
|
||||
|
||||
\vspace{-1mm}
|
||||
\subsection{Evaluation cost of the MPHF}
|
||||
|
||||
Now we consider the amount of CPU time
|
||||
required by the resulting MPHF at retrieval time.
|
||||
The MPHF requires for each key the computation of three
|
||||
universal hash functions and three memory accesses
|
||||
(see Eqs.~(\ref{eq:mphf}), (\ref{eq:bucketindex}) and (\ref{eq:mphfi})).
|
||||
This is not optimal. Pagh~\cite{p99} showed that any MPHF requires
|
||||
at least the computation of two universal hash functions and one memory
|
||||
access.
|
||||
|
||||
\subsection{Description size of the MPHF}
|
||||
|
||||
The number of bits required to store the MPHF generated by the algorithm
|
||||
is computed by Eq.~(\ref{eq:newmphfbits}).
|
||||
We need to store each $g_i$ vector presented in Eq.~(\ref{eq:mphfi}), where
|
||||
$0\leq i < \lceil n/b \rceil$. As each bucket has at most 256 keys, each
|
||||
entry in a $g_i$ vector has 8~bits. In each $g_i$ vector there are
|
||||
$c \times \mathit{size}[i]$ entries (recall $c\in[0.93, 1.15]$).
|
||||
When we sum up the number of entries of $\lceil n/b \rceil$ $g_i$ vectors we have
|
||||
$c\sum_{i=0}^{\lceil n/b \rceil -1} \mathit{size}[i]=cn$ entries. We also need to
|
||||
store $3 \lceil n/b \rceil$ integer numbers of
|
||||
$\log_2n$ bits referring respectively to the {\it offset} vector and the two random seeds of
|
||||
$h_{1i}$ and $h_{2i}$. In addition, we need to store $\lceil n/b \rceil$ 8-bit entries of
|
||||
the vector {\it size}. Therefore,
|
||||
\begin{eqnarray}\label{eq:newmphfbits}
|
||||
\mathrm{Required\: Space} = 8cn + \frac{n}{b}\left( 3\log_2n +8\right) \:
|
||||
\mathrm{bits}.
|
||||
\end{eqnarray}
|
||||
|
||||
Considering $c=0.93$ and $b=175$, the number of bits per key to store
|
||||
the description of the resulting MPHF for a set of 1~billion keys is $8.1$.
|
||||
If we set $b=128$, then the bits per key ratio increases to $8.3$.
|
||||
Theoretically, the number of bits required to store the MPHF in
|
||||
Eq.~(\ref{eq:newmphfbits})
|
||||
is $O(n\log n)$ as~$n\to\infty$. However, for sets of size up to $2^{b/3}$ keys
|
||||
the number of bits per key is lower than 9~bits (note that
|
||||
$2^{b/3}>2^{58}>10^{17}$ for $b=175$).
|
||||
%For $b=175$, the number of bits per key will be close to 9 for a set of $2^{58}$ keys.
|
||||
Thus, in practice the resulting function is stored in $O(n)$ bits.
|
6
vldb07/appendix.tex
Normal file
6
vldb07/appendix.tex
Normal file
@ -0,0 +1,6 @@
|
||||
\appendix
|
||||
\input{experimentalresults}
|
||||
\input{thedataandsetup}
|
||||
\input{costhashingbuckets}
|
||||
\input{performancenewalgorithm}
|
||||
\input{diskaccess}
|
42
vldb07/conclusions.tex
Executable file
42
vldb07/conclusions.tex
Executable file
@ -0,0 +1,42 @@
|
||||
% Time-stamp: <Monday 30 Jan 2006 12:38:06am EST yoshi@flare>
|
||||
\enlargethispage{2\baselineskip}
|
||||
\section{Concluding remarks}
|
||||
\label{sec:concuding-remarks}
|
||||
|
||||
This paper has presented a novel external memory based algorithm for
|
||||
constructing MPHFs that works for sets in the order of billions of keys. The
|
||||
algorithm outputs the resulting function in~$O(n)$ time and, furthermore, it
|
||||
can be tuned to run only $34\%$ slower (see \cite{bkz06t} for details) than the fastest
|
||||
algorithm available in the literature for constructing MPHFs~\cite{bkz05}.
|
||||
In addition, the space
|
||||
requirement of the resulting MPHF is of up to 9 bits per key for datasets of
|
||||
up to $2^{58}\simeq10^{17.4}$ keys.
|
||||
|
||||
The algorithm is simple and needs just a
|
||||
small vector of size approximately 5.45 megabytes in main memory to construct
|
||||
a MPHF for a collection of 1 billion URLs, each one 64 bytes long on average.
|
||||
Therefore, almost all main memory is available to be used as disk I/O buffer.
|
||||
Making use of such a buffering scheme considering an internal memory area of size
|
||||
$\mu=200$ megabytes, our algorithm can produce a MPHF for a
|
||||
set of 1 billion URLs in approximately 3.6 hours using a commodity PC of 2.4 gigahertz and
|
||||
500 megabytes of main memory.
|
||||
If we increase both the main memory
|
||||
available to 1 gigabyte and the internal memory area to $\mu=500$ megabytes,
|
||||
a MPHF for the set of 1 billion URLs is produced in approximately 3 hours. For any
|
||||
key, the evaluation of the resulting MPHF takes three memory accesses and the
|
||||
computation of three universal hash functions.
|
||||
|
||||
In order to allow the reproduction of our results and the utilization of both the internal memory
|
||||
based algorithm and the external memory based algorithm,
|
||||
the algorithms are available at \texttt{http://cmph.sf.net}
|
||||
under the GNU Lesser General Public License (LGPL).
|
||||
They were implemented in the C language.
|
||||
|
||||
In future work, we will exploit the fact that the searching step intrinsically
|
||||
presents a high degree of parallelism and requires $73\%$ of the
|
||||
construction time. Therefore, a parallel implementation of our algorithm will
|
||||
allow the construction and the evaluation of the resulting function in parallel.
|
||||
Therefore, the description of the resulting MPHFs will be distributed in the paralell
|
||||
computer allowing the scalability to sets of hundreds of billions of keys.
|
||||
This is an important contribution, mainly for applications related to the Web, as
|
||||
mentioned in Section~\ref{sec:intro}.
|
177
vldb07/costhashingbuckets.tex
Executable file
177
vldb07/costhashingbuckets.tex
Executable file
@ -0,0 +1,177 @@
|
||||
% Nivio: 29/jan/06
|
||||
% Time-stamp: <Monday 30 Jan 2006 12:37:22am EST yoshi@flare>
|
||||
\vspace{-2mm}
|
||||
\subsection{Performance of the internal memory based algorithm}
|
||||
\label{sec:intern-memory-algor}
|
||||
|
||||
%\begin{table*}[htb]
|
||||
%\begin{center}
|
||||
%{\scriptsize
|
||||
%\begin{tabular}{|c|c|c|c|c|c|c|c|}
|
||||
%\hline
|
||||
%$n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\
|
||||
%\hline
|
||||
%Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$ & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\
|
||||
%SD (s) & $2.6$ & $5.4$ & $9.8$ & $17.6$ & $37.3$ & $76.3$ \\
|
||||
%\hline
|
||||
%\end{tabular}
|
||||
%\vspace{-3mm}
|
||||
%}
|
||||
%\end{center}
|
||||
%\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF,
|
||||
%the standard deviation (SD), and the confidence intervals considering
|
||||
%a confidence level of $95\%$.}
|
||||
%\label{tab:medias}
|
||||
%\end{table*}
|
||||
|
||||
Our three-step internal memory based algorithm presented in~\cite{bkz05}
|
||||
is used for constructing a MPHF for each bucket.
|
||||
It is a randomized algorithm because it needs to generate a simple random graph
|
||||
in its first step.
|
||||
Once the graph is obtained the other two steps are deterministic.
|
||||
|
||||
Thus, we can consider the runtime of the algorithm to have the form~$\alpha
|
||||
nZ$ for an input of~$n$ keys, where~$\alpha$ is some machine dependent
|
||||
constant that further depends on the length of the keys and~$Z$ is a random
|
||||
variable with geometric distribution with mean~$1/p=e^{1/c^2}$ (see
|
||||
Section~\ref{sec:mphfbucket}). All results in our experiments were obtained
|
||||
taking $c=1$; the value of~$c$, with~$c\in[0.93,1.15]$, in fact has little
|
||||
influence in the runtime, as shown in~\cite{bkz05}.
|
||||
|
||||
The values chosen for $n$ were $1, 2, 4, 8, 16$ and $32$ million.
|
||||
Although we have a dataset with 1~billion URLs, on a PC with
|
||||
1~gigabyte of main memory, the algorithm is able
|
||||
to handle an input with at most 32 million keys.
|
||||
This is mainly because of the graph we need to keep in main memory.
|
||||
The algorithm requires $25n + O(1)$ bytes for constructing
|
||||
a MPHF (details about the data structures used by the algorithm can
|
||||
be found in~\texttt{http://cmph.sf.net}.
|
||||
% for the details about the data structures
|
||||
%used by the algorithm).
|
||||
|
||||
In order to estimate the number of trials for each value of $n$ we use
|
||||
a statistical method for determining a suitable sample size (see, e.g.,
|
||||
\cite[Chapter 13]{j91}).
|
||||
As we obtained different values for each $n$,
|
||||
we used the maximal value obtained, namely, 300~trials in order to have
|
||||
a confidence level of $95\%$.
|
||||
|
||||
% \begin{figure*}[ht]
|
||||
% \noindent
|
||||
% \begin{minipage}[b]{0.5\linewidth}
|
||||
% \centering
|
||||
% \subfigure[The previous algorithm]
|
||||
% {\scalebox{0.5}{\includegraphics{figs/bmz_temporegressao.eps}}}
|
||||
% \end{minipage}
|
||||
% \hfill
|
||||
% \begin{minipage}[b]{0.5\linewidth}
|
||||
% \centering
|
||||
% \subfigure[The new algorithm]
|
||||
% {\scalebox{0.5}{\includegraphics{figs/brz_temporegressao.eps}}}
|
||||
% \end{minipage}
|
||||
% \caption{Time versus number of keys in $S$. The solid line corresponds to
|
||||
% a linear regression model.}
|
||||
% %obtained from the experimental measurements.}
|
||||
% \label{fig:temporegressao}
|
||||
% \end{figure*}
|
||||
|
||||
Table~\ref{tab:medias} presents the runtime average for each $n$,
|
||||
the respective standard deviations, and
|
||||
the respective confidence intervals given by
|
||||
the average time $\pm$ the distance from average time
|
||||
considering a confidence level of $95\%$.
|
||||
Observing the runtime averages one sees that
|
||||
the algorithm runs in expected linear time,
|
||||
as shown in~\cite{bkz05}.
|
||||
|
||||
\vspace{-2mm}
|
||||
\begin{table*}[htb]
|
||||
\begin{center}
|
||||
{\scriptsize
|
||||
\begin{tabular}{|c|c|c|c|c|c|c|c|}
|
||||
\hline
|
||||
$n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\
|
||||
\hline
|
||||
Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$ & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\
|
||||
SD (s) & $2.6$ & $5.4$ & $9.8$ & $17.6$ & $37.3$ & $76.3$ \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\vspace{-1mm}
|
||||
}
|
||||
\end{center}
|
||||
\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF,
|
||||
the standard deviation (SD), and the confidence intervals considering
|
||||
a confidence level of $95\%$.}
|
||||
\label{tab:medias}
|
||||
\vspace{-4mm}
|
||||
\end{table*}
|
||||
|
||||
% \enlargethispage{\baselineskip}
|
||||
% \begin{table*}[htb]
|
||||
% \begin{center}
|
||||
% {\scriptsize
|
||||
% (a)
|
||||
% \begin{tabular}{|c|c|c|c|c|c|c|c|}
|
||||
% \hline
|
||||
% $n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\
|
||||
% \hline
|
||||
% Average time (s)& $6.119 \pm 0.300$ & $12.190 \pm 0.615$ & $25.359 \pm 1.109$ & $51.408 \pm 2.003$ & $117.343 \pm 4.364$ & $262.215 \pm 8.724$\\
|
||||
% SD (s) & $2.644$ & $5.414$ & $9.757$ & $17.627$ & $37.333$ & $76.271$ \\
|
||||
% \hline
|
||||
% \end{tabular}
|
||||
% \\[5mm] (b)
|
||||
% \begin{tabular}{|l|c|c|c|c|c|}
|
||||
% \hline
|
||||
% $n$ (millions) & 1 & 2 & 4 & 8 & 16 \\
|
||||
% \hline % Part. 16 \% 16 \% 16 \% 18 \% 20\%
|
||||
% Average time (s) & $6.927 \pm 0.309$ & $13.828 \pm 0.175$ & $31.936 \pm 0.663$ & $69.902 \pm 1.084$ & $140.617 \pm 2.502$ \\
|
||||
% SD & $0.431$ & $0.245$ & $0.926$ & $1.515$ & $3.498$ \\
|
||||
% \hline
|
||||
% \hline
|
||||
% $n$ (millions) & 32 & 64 & 128 & 512 & 1000 \\
|
||||
% \hline % Part. 20 \% 20\% 20\% 18\% 18\%
|
||||
% Average time (s) & $284.330 \pm 1.135$ & $587.880 \pm 3.945$ & $1223.581 \pm 4.864$ & $5966.402 \pm 9.465$ & $13229.540 \pm 12.670$ \\
|
||||
% SD & $1.587$ & $5.514$ & $6.800$ & $13.232$ & $18.577$ \\
|
||||
% \hline
|
||||
% \end{tabular}
|
||||
% }
|
||||
% \end{center}
|
||||
% \caption{The runtime averages in seconds,
|
||||
% the standard deviation (SD), and
|
||||
% the confidence intervals given by the average time $\pm$
|
||||
% the distance from average time considering
|
||||
% a confidence level of $95\%$.}
|
||||
% \label{tab:medias}
|
||||
% \end{table*}
|
||||
|
||||
\enlargethispage{2\baselineskip}
|
||||
Figure~\ref{fig:bmz_temporegressao}
|
||||
presents the runtime for each trial. In addition,
|
||||
the solid line corresponds to a linear regression model
|
||||
obtained from the experimental measurements.
|
||||
As we can see, the runtime for a given $n$ has a considerable
|
||||
fluctuation. However, the fluctuation also grows linearly with $n$.
|
||||
|
||||
\begin{figure}[htb]
|
||||
\vspace{-2mm}
|
||||
\begin{center}
|
||||
\scalebox{0.4}{\includegraphics{figs/bmz_temporegressao.eps}}
|
||||
\caption{Time versus number of keys in $S$ for the internal memory based algorithm.
|
||||
The solid line corresponds to a linear regression model.}
|
||||
\label{fig:bmz_temporegressao}
|
||||
\end{center}
|
||||
\vspace{-6mm}
|
||||
\end{figure}
|
||||
|
||||
The observed fluctuation in the runtimes is as expected; recall that this
|
||||
runtime has the form~$\alpha nZ$ with~$Z$ a geometric random variable with
|
||||
mean~$1/p=e$. Thus, the runtime has mean~$\alpha n/p=\alpha en$ and standard
|
||||
deviation~$\alpha n\sqrt{(1-p)/p^2}=\alpha n\sqrt{e(e-1)}$.
|
||||
Therefore, the standard deviation also grows
|
||||
linearly with $n$, as experimentally verified
|
||||
in Table~\ref{tab:medias} and in Figure~\ref{fig:bmz_temporegressao}.
|
||||
|
||||
%\noindent-------------------------------------------------------------------------\\
|
||||
%Comentario para Yoshi: Nao consegui reproduzir bem o que discutimos
|
||||
%no paragrafo acima, acho que vc conseguira justificar melhor :-). \\
|
||||
%-------------------------------------------------------------------------\\
|
146
vldb07/determiningb.tex
Executable file
146
vldb07/determiningb.tex
Executable file
@ -0,0 +1,146 @@
|
||||
% Nivio: 29/jan/06
|
||||
% Time-stamp: <Monday 30 Jan 2006 04:01:40am EDT yoshi@ime.usp.br>
|
||||
\enlargethispage{2\baselineskip}
|
||||
\subsection{Determining~$b$}
|
||||
\label{sec:determining-b}
|
||||
\begin{table*}[t]
|
||||
\begin{center}
|
||||
{\small %\scriptsize
|
||||
\begin{tabular}{|c|ccc|ccc|}
|
||||
\hline
|
||||
\raisebox{-0.7em}{$n$} & \multicolumn{3}{c|}{\raisebox{-1mm}{b=128}} &
|
||||
\multicolumn{3}{c|}{\raisebox{-1mm}{b=175}}\\
|
||||
\cline{2-4} \cline{5-7}
|
||||
& \raisebox{-0.5mm}{Worst Case} & \raisebox{-0.5mm}{Average} &\raisebox{-0.5mm}{Eq.~(\ref{eq:maxbs})}
|
||||
& \raisebox{-0.5mm}{Worst Case} & \raisebox{-0.5mm}{Average} &\raisebox{-0.5mm}{Eq.~(\ref{eq:maxbs})} \\
|
||||
\hline
|
||||
$1.0 \times 10^6$ & 177 & 172.0 & 176 & 232 & 226.6 & 230 \\
|
||||
%$2.0 \times 10^6$ & 179 & 174.0 & 178 & 236 & 228.5 & 232 \\
|
||||
$4.0 \times 10^6$ & 182 & 177.5 & 179 & 241 & 231.8 & 234 \\
|
||||
%$8.0 \times 10^6$ & 186 & 181.6 & 181 & 238 & 234.2 & 236 \\
|
||||
$1.6 \times 10^7$ & 184 & 181.6 & 183 & 241 & 236.1 & 238 \\
|
||||
%$3.2 \times 10^7$ & 191 & 183.9 & 184 & 240 & 236.6 & 240 \\
|
||||
$6.4 \times 10^7$ & 195 & 185.2 & 186 & 244 & 239.0 & 242 \\
|
||||
%$1.28 \times 10^8$ & 193 & 187.7 & 187 & 244 & 239.7 & 244 \\
|
||||
$5.12 \times 10^8$ & 196 & 191.7 & 190 & 251 & 246.3 & 247 \\
|
||||
$1.0 \times 10^9$ & 197 & 191.6 & 192 & 253 & 248.9 & 249 \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\vspace{-1mm}
|
||||
}
|
||||
\end{center}
|
||||
\caption{Values for $\mathit{BS}_{\mathit{max}}$: worst case and average case obtained in the experiments and using Eq.~(\ref{eq:maxbs}),
|
||||
considering $b=128$ and $b=175$ for different number $n$ of keys in $S$.}
|
||||
\label{tab:comparison}
|
||||
\vspace{-6mm}
|
||||
\end{table*}
|
||||
|
||||
The partitioning step can be viewed as the well known ``balls into bins''
|
||||
problem~\cite{ra98,dfm02} where~$n$ keys (the balls) are placed independently and
|
||||
uniformly into $\lceil n/b\rceil$ buckets (the bins). The main question related to that problem we are interested
|
||||
in is: what is the maximum number of keys in any bucket?
|
||||
In fact, we want to get the maximum value for $b$ that makes the maximum number of keys in any bucket
|
||||
no greater than 256 with high probability.
|
||||
This is important, as we wish to use 8 bits per entry in the vector $g_i$ of
|
||||
each $\mathrm{MPHF}_i$,
|
||||
where $0 \leq i < \lceil n/b\rceil$.
|
||||
Let $\mathit{BS}_{\mathit{max}}$ be the maximum number of keys in any bucket.
|
||||
|
||||
Clearly, $\BSmax$ is the maximum
|
||||
of~$\lceil n/b\rceil$ random variables~$Z_i$, each with binomial
|
||||
distribution~$\Bi(n,p)$ with parameters~$n$ and~$p=1/\lceil n/b\rceil$.
|
||||
However, the~$Z_i$ are not independent. Note that~$\Bi(n,p)$ has mean and
|
||||
variance~$\simeq b$. To give an upper estimate for the probability of the
|
||||
event~$\BSmax\geq \gamma$, we can estimate the probability that we have~$Z_i\geq \gamma$
|
||||
for a fixed~$i$, and then sum these estimates over all~$i$.
|
||||
Let~$\gamma=b+\sigma\sqrt{b\ln(n/b)}$, where~$\sigma=\sqrt2$.
|
||||
Approximating~$\Bi(n,p)$ by the normal distribution with mean and
|
||||
variance~$b$, we obtain the
|
||||
estimate~$(\sigma\sqrt{2\pi\ln(n/b)})^{-1}\times\exp(-(1/2)\sigma^2\ln(n/b))$ for
|
||||
the probability that~$Z_i\geq \gamma$ occurs, which, summed over all~$i$, gives
|
||||
that the probability that~$\BSmax\geq \gamma$ occurs is at
|
||||
most~$1/(\sigma\sqrt{2\pi\ln(n/b)})$, which tends to~$0$ as~$n\to\infty$.
|
||||
Thus, we have shown that, with high probability,
|
||||
\begin{equation}
|
||||
\label{eq:maxbs}
|
||||
\BSmax\leq b+\sqrt{2b\ln{n\over b}}.
|
||||
\end{equation}
|
||||
|
||||
% The traditional approach used to estimate $\mathit{BS}_{\mathit{max}}$ with high probability is
|
||||
% to consider $\mathit{BS}_{\mathit{max}}$ as a random variable that follows a binomial distribution
|
||||
% that can be approximated by a poisson distribution. This yields a good approximation
|
||||
% when the number of balls is lower than or equal to the number of bins~\cite{g81}. In our case,
|
||||
% the number of balls is greater than the number of buckets.
|
||||
% % and that is why we have used more recent works to estimate $\mathit{BS}_{\mathit{max}}$.
|
||||
% As $b > \ln (n/b)$, we can use the result by Raab and Steger~\cite{ra98} to estimate
|
||||
% $\mathit{BS}_{\mathit{max}}$ with high probability.
|
||||
% The following equation gives the estimation, where $\sigma=\sqrt{2}$:
|
||||
% \begin{eqnarray} \label{eq:maxbs}
|
||||
% \mathit{BS}_{\mathit{max}} = b + O \left( \sqrt{b\ln\frac{n}{b}} \right) = b + \sigma \times \left(\sqrt{b\ln\frac{n}{b}} \right)
|
||||
% \end{eqnarray}
|
||||
|
||||
% In order to estimate the suitable constant $\sigma$ we did a linear
|
||||
% regression suppressing the constant term.
|
||||
% We use the equation $BS_{max} - b = \sigma \times \sqrt{b\ln (n/b)}$
|
||||
% in the linear regression considering $y=BS_{max} - b$ and $x=\sqrt{b\ln (n/b)}$.
|
||||
% In order to obtain data to be used in the linear regression we set
|
||||
% b=128 and ran the new algorithm ten times
|
||||
% for n equal to 1, 2, 4, 8, 16, 32, 64, 128, 512, 1000 million keys.
|
||||
% Taking a confidence level equal to 95\% we got
|
||||
% $\sigma = 2.11 \pm 0.03$.
|
||||
% The coefficient of determination was $99.6\%$, which means that the linear
|
||||
% regression explains $99.6\%$ of the data variation and only $0.4\%$
|
||||
% is due to experimental errors.
|
||||
% Therefore, Eq.~(\ref{eq:maxbs}) with $\sigma = 2.11 \pm 0.03$ and $b=128$
|
||||
% makes a very good estimation of the maximum number of keys in any bucket.
|
||||
|
||||
% Repeating the same experiments for $b$ equals to $175$ and
|
||||
% a confidence level of $95\%$ we got $\sigma = 2.07 \pm 0.03$.
|
||||
% Again we verified that Eq.~(\ref{eq:maxbs}) with $\sigma = 2.07 \pm 0.03$ and $b=175$ is
|
||||
% a very good estimation of the maximum number of keys in any bucket once the
|
||||
% coefficient of determination obtained was $99.7 \%$ and $\sigma$ is in a very narrow range.
|
||||
|
||||
In our algorithm the maximum number of keys in any bucket must be at most 256.
|
||||
Table~\ref{tab:comparison} presents the values for $\mathit{BS}_{\mathit{max}}$
|
||||
obtained experimentally and using Eq.~(\ref{eq:maxbs}).
|
||||
The table presents the worst case and the average case,
|
||||
considering $b=128$, $b=175$ and Eq.~(\ref{eq:maxbs}),
|
||||
for several numbers~$n$ of keys in $S$.
|
||||
The estimation given by Eq.~(\ref{eq:maxbs}) is very close to the experimental
|
||||
results.
|
||||
|
||||
Now we estimate the biggest problem our algorithm is able to solve for
|
||||
a given $b$.
|
||||
Using Eq.~(\ref{eq:maxbs}) considering $b=128$, $b=175$ and imposing
|
||||
that~$\mathit{BS}_{\mathit{max}}\leq256$,
|
||||
the sizes of the biggest key set our algorithm
|
||||
can deal with are $10^{30}$ keys and $10^{10}$ keys, respectively.
|
||||
%It is also important to have $b$ as big as possible, once its value is
|
||||
%related to the space required to store the resultant MPHF, as shown later on.
|
||||
%Table~\ref{tab:bp} shows the biggest problem the algorithm can solve.
|
||||
% The values were obtained from Eq.~(\ref{eq:maxbs}),
|
||||
% considering $b=128$ and~$b=175$ and imposing
|
||||
% that~$\mathit{BS}_{\mathit{max}}\leq256$.
|
||||
|
||||
% We set $\sigma=2.14$ because it was the greatest value obtained for $\sigma$
|
||||
% in the two linear regression we did.
|
||||
% \vspace{-3mm}
|
||||
% \begin{table}[htb]
|
||||
% \begin{center}
|
||||
% {\small %\scriptsize
|
||||
% \begin{tabular}{|c|c|}
|
||||
% \hline
|
||||
% b & Problem size ($n$) \\
|
||||
% \hline
|
||||
% 128 & $10^{30}$ keys \\
|
||||
% 175 & $10^{10}$ keys \\
|
||||
% \hline
|
||||
% \end{tabular}
|
||||
% \vspace{-1mm}
|
||||
% }
|
||||
% \end{center}
|
||||
% \caption{Using Eq.~(\ref{eq:maxbs}) to estimate the biggest problem our algorithm can solve.}
|
||||
% %considering $\sigma=\sqrt{2}$.}
|
||||
% \label{tab:bp}
|
||||
% \vspace{-14mm}
|
||||
% \end{table}
|
113
vldb07/diskaccess.tex
Executable file
113
vldb07/diskaccess.tex
Executable file
@ -0,0 +1,113 @@
|
||||
% Nivio: 29/jan/06
|
||||
% Time-stamp: <Sunday 29 Jan 2006 11:58:28pm EST yoshi@flare>
|
||||
\vspace{-2mm}
|
||||
\subsection{Controlling disk accesses}
|
||||
\label{sec:contr-disk-access}
|
||||
|
||||
In order to bring down the number of seek operations on disk
|
||||
we benefit from the fact that our algorithm leaves almost all main
|
||||
memory available to be used as disk I/O buffer.
|
||||
In this section we evaluate how much the parameter $\mu$
|
||||
affects the runtime of our algorithm.
|
||||
For that we fixed $n$ in 1 billion of URLs,
|
||||
set the main memory of the machine used for the experiments
|
||||
to 1 gigabyte and used $\mu$ equal to $100, 200, 300, 400, 500$ and $600$
|
||||
megabytes.
|
||||
|
||||
\enlargethispage{2\baselineskip}
|
||||
Table~\ref{tab:diskaccess} presents the number of files $N$,
|
||||
the buffer size used for all files, the number of seeks in the worst case considering
|
||||
the pessimistic assumption mentioned in Section~\ref{sec:linearcomplexity}, and
|
||||
the time to generate a MPHF for 1 billion of keys as a function of the amount of internal
|
||||
memory available. Observing Table~\ref{tab:diskaccess} we noticed that the time spent in the construction
|
||||
decreases as the value of $\mu$ increases. However, for $\mu > 400$, the variation
|
||||
on the time is not as significant as for $\mu \leq 400$.
|
||||
This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux
|
||||
has smart policies
|
||||
for avoiding seeks and diminishing the average seek time
|
||||
(see \texttt{http://www.linuxjournal.com/article/6931}).
|
||||
\begin{table*}[ht]
|
||||
\vspace{-2mm}
|
||||
\begin{center}
|
||||
{\scriptsize
|
||||
\begin{tabular}{|l|c|c|c|c|c|c|}
|
||||
\hline
|
||||
$\mu$ (MB) & $100$ & $200$ & $300$ & $400$ & $500$ & $600$ \\
|
||||
\hline
|
||||
$N$ (files) & $619$ & $310$ & $207$ & $155$ & $124$ & $104$ \\
|
||||
%\hline
|
||||
\textbaht~(buffer size in KB) & $165$ & $661$ & $1,484$ & $2,643$ & $4,129$ & $5,908$ \\
|
||||
%\hline
|
||||
$\beta$/\textbaht~(\# of seeks in the worst case) & $384,478$ & $95,974$ & $42,749$ & $24,003$ & $15,365$ & $10,738$ \\
|
||||
% \hline
|
||||
% \raisebox{-0.2em}{\# of seeks performed in} & \raisebox{-0.7em}{$383,056$} & \raisebox{-0.7em}{$95,919$} & \raisebox{-0.7em}{$42,700$} & \raisebox{-0.7em}{$23,980$} & \raisebox{-0.7em}{$15,347$} & \raisebox{-0.7em}{$xx,xxx$} \\
|
||||
% \raisebox{0.2em}{statement 1.3 of Figure~\ref{fig:readingbucket}} & & & & & & \\
|
||||
% \hline
|
||||
Time (hours) & $4.04$ & $3.64$ & $3.34$ & $3.20$ & $3.13$ & $3.09$ \\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\vspace{-1mm}
|
||||
}
|
||||
\end{center}
|
||||
\caption{Influence of the internal memory area size ($\mu$) in our algorithm runtime.}
|
||||
\label{tab:diskaccess}
|
||||
\vspace{-14mm}
|
||||
\end{table*}
|
||||
|
||||
|
||||
|
||||
% \begin{table*}[ht]
|
||||
% \begin{center}
|
||||
% {\scriptsize
|
||||
% \begin{tabular}{|l|c|c|c|c|c|c|c|c|c|c|c|}
|
||||
% \hline
|
||||
% $\mu$ (MB) & $100$ & $150$ & $200$ & $250$ & $300$ & $350$ & $400$ & $450$ & $500$ & $550$ & $600$ \\
|
||||
% \hline
|
||||
% $N$ (files) & $619$ & $413$ & $310$ & $248$ & $207$ & $177$ & $155$ & $138$ & $124$ & $113$ & $103$ \\
|
||||
% \hline
|
||||
% \textbaht~(buffer size in KB) & $165$ & $372$ & $661$ & $1,033$ & $1,484$ & $2,025$ & $2,643$ & $3,339$ & & & \\
|
||||
% \hline
|
||||
% \# of seeks (Worst case) & $384,478$ & $170,535$ & $95,974$ & $61,413$ & $42,749$ & $31,328$ & $24,003$ & $19,000$ & & & \\
|
||||
% \hline
|
||||
% \raisebox{-0.2em}{\# of seeks performed in} & \raisebox{-0.7em}{$383,056$} & \raisebox{-0.7em}{$170,385$} & \raisebox{-0.7em}{$95,919$} & \raisebox{-0.7em}{$61,388$} & \raisebox{-0.7em}{$42,700$} & \raisebox{-0.7em}{$31,296$} & \raisebox{-0.7em}{$23,980$} & \raisebox{-0.7em}{$18,978$} & \raisebox{-0.7em}{$xx,xxx$} & \raisebox{-0.7em}{$xx,xxx$} & \raisebox{-0.7em}{$xx,xxx$} \\
|
||||
% \raisebox{0.2em}{statement 1.3 of Figure~\ref{fig:readingbucket}} & & & & & & & & & & & \\
|
||||
% \hline
|
||||
% Time (horas) & $4.04$ & $3.93$ & $3.64$ & $3.46$ & $3.34$ & $3.26$ & $3.20$ & $3.13$ & & & \\
|
||||
% \hline
|
||||
% \end{tabular}
|
||||
% }
|
||||
% \end{center}
|
||||
% \caption{Influence of the internal memory area size ($\mu$) in our algorithm runtime.}
|
||||
% \label{tab:diskaccess}
|
||||
% \end{table*}
|
||||
|
||||
|
||||
|
||||
% \begin{table*}[htb]
|
||||
% \begin{center}
|
||||
% {\scriptsize
|
||||
% \begin{tabular}{|l|c|c|c|c|c|}
|
||||
% \hline
|
||||
% $n$ (millions) & 1 & 2 & 4 & 8 & 16 \\
|
||||
% \hline % Part. 16 \% 16 \% 16 \% 18 \% 20\%
|
||||
% Average time (s) & $14.124 \pm 0.128$ & $28.301 \pm 0.140$ & $56.807 \pm 0.312$ & $117.286 \pm 0.997$ & $241.086 \pm 0.936$ \\
|
||||
% SD & $0.179$ & $0.196$ & $0.437$ & $1.394$ & $1.308$ \\
|
||||
% \hline
|
||||
% \hline
|
||||
% $n$ (millions) & 32 & 64 & 128 & 512 & 1000 \\
|
||||
% \hline % Part. 20 \% 20\% 20\% 18\% 18\%
|
||||
% Average time (s) & $492.430 \pm 1.565$ & $1006.307 \pm 1.425$ & $2081.208 \pm 0.740$ & $9253.188 \pm 4.406$ & $19021.480 \pm 13.850$ \\
|
||||
% SD & $2.188$ & $1.992$ & $1.035$ & $ 6.160$ & $18.016$ \\
|
||||
% \hline
|
||||
|
||||
% \end{tabular}
|
||||
% }
|
||||
% \end{center}
|
||||
% \caption{The runtime averages in seconds,
|
||||
% the standard deviation (SD), and
|
||||
% the confidence intervals given by the average time $\pm$
|
||||
% the distance from average time considering
|
||||
% a confidence level of $95\%$.
|
||||
% }
|
||||
% \label{tab:mediasbrz}
|
||||
% \end{table*}
|
15
vldb07/experimentalresults.tex
Executable file
15
vldb07/experimentalresults.tex
Executable file
@ -0,0 +1,15 @@
|
||||
%Nivio: 29/jan/06
|
||||
% Time-stamp: <Sunday 29 Jan 2006 11:57:21pm EST yoshi@flare>
|
||||
\vspace{-2mm}
|
||||
\enlargethispage{2\baselineskip}
|
||||
\section{Appendix: Experimental results}
|
||||
\label{sec:experimental-results}
|
||||
\vspace{-1mm}
|
||||
|
||||
In this section we present the experimental results.
|
||||
We start presenting the experimental setup.
|
||||
We then present experimental results for
|
||||
the internal memory based algorithm~\cite{bkz05}
|
||||
and for our algorithm.
|
||||
Finally, we discuss how the amount of internal memory available
|
||||
affects the runtime of our algorithm.
|
BIN
vldb07/figs/bmz_temporegressao.png
Normal file
BIN
vldb07/figs/bmz_temporegressao.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 5.6 KiB |
107
vldb07/figs/brz-partitioning.fig
Normal file
107
vldb07/figs/brz-partitioning.fig
Normal file
@ -0,0 +1,107 @@
|
||||
#FIG 3.2
|
||||
Landscape
|
||||
Center
|
||||
Metric
|
||||
A4
|
||||
100.00
|
||||
Single
|
||||
-2
|
||||
1200 2
|
||||
0 32 #bdbebd
|
||||
0 33 #bdbebd
|
||||
0 34 #bdbebd
|
||||
0 35 #4a4d4a
|
||||
0 36 #bdbebd
|
||||
0 37 #4a4d4a
|
||||
0 38 #bdbebd
|
||||
0 39 #bdbebd
|
||||
6 225 6615 2520 7560
|
||||
2 1 0 1 -1 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
900 7133 1608 7133
|
||||
2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
|
||||
260 6795 474 6795 474 6965 260 6965 260 6795
|
||||
2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
474 6795 686 6795 686 6965 474 6965 474 6795
|
||||
2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
474 6626 686 6626 686 6795 474 6795 474 6626
|
||||
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
|
||||
1538 6795 1750 6795 1750 6965 1538 6965 1538 6795
|
||||
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
|
||||
1538 6965 1750 6965 1750 7133 1538 7133 1538 6965
|
||||
2 2 0 1 -1 7 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
474 6965 686 6965 686 7133 474 7133 474 6965
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
|
||||
686 6965 900 6965 900 7133 686 7133 686 6965
|
||||
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
|
||||
1538 6626 1750 6626 1750 6795 1538 6795 1538 6626
|
||||
2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
|
||||
260 6965 474 6965 474 7133 260 7133 260 6965
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
|
||||
686 6795 900 6795 900 6965 686 6965 686 6795
|
||||
4 0 0 50 -1 0 14 0.0000 4 30 180 1148 7049 ...\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 60 60 332 7260 0\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 60 544 7260 1\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 60 60 758 7260 2\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 90 960 1538 7260 ${\\lceil n/b\\rceil - 1}$\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 105 975 540 7515 Buckets Logical View\001
|
||||
-6
|
||||
6 2700 6390 4365 7830
|
||||
6 3461 6445 3675 7425
|
||||
6 3463 6786 3675 7245
|
||||
6 3546 6893 3591 7094
|
||||
4 0 -1 50 -1 0 12 0.0000 2 15 45 3546 6959 .\001
|
||||
4 0 -1 50 -1 0 12 0.0000 2 15 45 3546 7027 .\001
|
||||
4 0 -1 50 -1 0 12 0.0000 2 15 45 3546 7094 .\001
|
||||
-6
|
||||
2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
3463 6786 3675 6786 3675 7245 3463 7245 3463 6786
|
||||
-6
|
||||
2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
3461 6445 3675 6445 3675 6615 3461 6615 3461 6445
|
||||
2 2 0 1 -1 7 50 -1 41 0.000 0 0 7 0 0 5
|
||||
3463 6616 3675 6616 3675 6785 3463 6785 3463 6616
|
||||
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
|
||||
3463 7246 3675 7246 3675 7425 3463 7425 3463 7246
|
||||
-6
|
||||
6 3023 6786 3235 7245
|
||||
6 3106 6893 3151 7094
|
||||
4 0 -1 50 -1 0 12 0.0000 2 15 45 3106 6959 .\001
|
||||
4 0 -1 50 -1 0 12 0.0000 2 15 45 3106 7027 .\001
|
||||
4 0 -1 50 -1 0 12 0.0000 2 15 45 3106 7094 .\001
|
||||
-6
|
||||
2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
3023 6786 3235 6786 3235 7245 3023 7245 3023 6786
|
||||
-6
|
||||
6 4091 6425 4305 7425
|
||||
6 4093 6946 4305 7255
|
||||
6 4176 7018 4221 7153
|
||||
4 0 -1 50 -1 0 12 0.0000 2 15 45 4176 7063 .\001
|
||||
4 0 -1 50 -1 0 12 0.0000 2 15 45 4176 7108 .\001
|
||||
4 0 -1 50 -1 0 12 0.0000 2 15 45 4176 7153 .\001
|
||||
-6
|
||||
2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
4093 6946 4305 6946 4305 7255 4093 7255 4093 6946
|
||||
-6
|
||||
2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
4091 6605 4305 6605 4305 6775 4091 6775 4091 6605
|
||||
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
|
||||
4093 7256 4305 7256 4305 7425 4093 7425 4093 7256
|
||||
2 2 0 1 -1 7 50 -1 41 0.000 0 0 7 0 0 5
|
||||
4093 6776 4305 6776 4305 6945 4093 6945 4093 6776
|
||||
2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
|
||||
4091 6425 4305 6425 4305 6595 4091 6595 4091 6425
|
||||
-6
|
||||
2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
|
||||
3021 6445 3235 6445 3235 6615 3021 6615 3021 6445
|
||||
2 2 0 1 -1 7 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
3023 6616 3235 6616 3235 6785 3023 6785 3023 6616
|
||||
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
|
||||
3023 7246 3235 7246 3235 7425 3023 7425 3023 7246
|
||||
4 0 0 50 -1 0 14 0.0000 4 30 180 3780 6975 ...\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 255 3015 7560 File 1\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 255 3465 7560 File 2\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 270 4095 7560 File N\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 105 1020 3195 7785 Buckets Physical View\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 150 120 2700 7020 b)\001
|
||||
-6
|
||||
4 0 0 50 -1 0 10 0.0000 4 150 105 0 7020 a)\001
|
126
vldb07/figs/brz-partitioningfabiano.fig
Executable file
126
vldb07/figs/brz-partitioningfabiano.fig
Executable file
@ -0,0 +1,126 @@
|
||||
#FIG 3.2
|
||||
Landscape
|
||||
Center
|
||||
Metric
|
||||
A4
|
||||
100.00
|
||||
Single
|
||||
-2
|
||||
1200 2
|
||||
0 32 #bebebe
|
||||
0 33 #4e4e4e
|
||||
6 2160 3825 2430 4365
|
||||
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2160 4005 2430 4005 2430 4095 2160 4095 2160 4005
|
||||
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2160 3825 2430 3825 2430 3915 2160 3915 2160 3825
|
||||
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2160 3915 2430 3915 2430 4005 2160 4005 2160 3915
|
||||
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2160 4275 2430 4275 2430 4365 2160 4365 2160 4275
|
||||
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2160 4185 2430 4185 2430 4275 2160 4275 2160 4185
|
||||
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2160 4095 2430 4095 2430 4185 2160 4185 2160 4095
|
||||
-6
|
||||
6 2430 3735 2700 4365
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
|
||||
2430 3825 2700 3825 2700 3915 2430 3915 2430 3825
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
|
||||
2430 4275 2700 4275 2700 4365 2430 4365 2430 4275
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
|
||||
2430 4185 2700 4185 2700 4275 2430 4275 2430 4185
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
|
||||
2430 4095 2700 4095 2700 4185 2430 4185 2430 4095
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
|
||||
2430 4005 2700 4005 2700 4095 2430 4095 2430 4005
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
|
||||
2430 3915 2700 3915 2700 4005 2430 4005 2430 3915
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
|
||||
2430 3735 2700 3735 2700 3825 2430 3825 2430 3735
|
||||
-6
|
||||
6 2700 4005 2970 4365
|
||||
2 2 0 1 0 32 50 -1 43 0.000 0 0 -1 0 0 5
|
||||
2700 4275 2970 4275 2970 4365 2700 4365 2700 4275
|
||||
2 2 0 1 0 32 50 -1 43 0.000 0 0 -1 0 0 5
|
||||
2700 4185 2970 4185 2970 4275 2700 4275 2700 4185
|
||||
2 2 0 1 0 32 50 -1 43 0.000 0 0 -1 0 0 5
|
||||
2700 4095 2970 4095 2970 4185 2700 4185 2700 4095
|
||||
2 2 0 1 -1 32 50 -1 43 0.000 0 0 -1 0 0 5
|
||||
2700 4005 2970 4005 2970 4095 2700 4095 2700 4005
|
||||
-6
|
||||
6 2025 5625 3690 5760
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 360 2025 5760 File 1\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 360 2565 5760 File 2\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 405 3285 5760 File N\001
|
||||
-6
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
3510 4410 3510 4590
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
3780 4410 3780 4590
|
||||
2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
|
||||
1890 4185 2160 4185 2160 4275 1890 4275 1890 4185
|
||||
2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
|
||||
1890 4275 2160 4275 2160 4365 1890 4365 1890 4275
|
||||
2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
|
||||
1890 4095 2160 4095 2160 4185 1890 4185 1890 4095
|
||||
2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
|
||||
2070 4860 2340 4860 2340 5040 2070 5040 2070 4860
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 7 0 0 5
|
||||
3330 5220 3600 5220 3600 5400 3330 5400 3330 5220
|
||||
2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
|
||||
3330 4860 3600 4860 3600 4950 3330 4950 3330 4860
|
||||
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2070 5040 2340 5040 2340 5130 2070 5130 2070 5040
|
||||
2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
3330 4950 3600 4950 3600 5220 3330 5220 3330 4950
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 7 0 0 5
|
||||
2070 5130 2340 5130 2340 5310 2070 5310 2070 5130
|
||||
2 2 0 1 0 7 50 -1 10 0.000 0 0 7 0 0 5
|
||||
2610 5400 2880 5400 2880 5580 2610 5580 2610 5400
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 7 0 0 5
|
||||
2610 4860 2880 4860 2880 5040 2610 5040 2610 4860
|
||||
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
|
||||
2610 5040 2880 5040 2880 5130 2610 5130 2610 5040
|
||||
2 2 0 1 0 7 50 -1 50 0.000 0 0 -1 0 0 5
|
||||
2970 4275 3240 4275 3240 4365 2970 4365 2970 4275
|
||||
2 2 0 1 0 7 50 -1 50 0.000 0 0 -1 0 0 5
|
||||
2970 4185 3240 4185 3240 4275 2970 4275 2970 4185
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
3510 4410 3600 4410
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
3690 4410 3780 4410
|
||||
2 2 0 1 0 7 50 -1 10 0.000 0 0 -1 0 0 5
|
||||
3510 4275 3780 4275 3780 4365 3510 4365 3510 4275
|
||||
2 2 0 1 0 7 50 -1 10 0.000 0 0 -1 0 0 5
|
||||
3510 4185 3780 4185 3780 4275 3510 4275 3510 4185
|
||||
2 2 0 1 0 32 50 -1 20 0.000 0 0 7 0 0 5
|
||||
2610 5130 2880 5130 2880 5400 2610 5400 2610 5130
|
||||
2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
|
||||
2070 5310 2340 5310 2340 5490 2070 5490 2070 5310
|
||||
2 2 0 1 0 7 50 -1 10 0.000 0 0 7 0 0 5
|
||||
2070 5490 2340 5490 2340 5580 2070 5580 2070 5490
|
||||
2 2 0 1 0 7 50 -1 50 0.000 0 0 7 0 0 5
|
||||
3330 5400 3600 5400 3600 5490 3330 5490 3330 5400
|
||||
2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
|
||||
3240 4275 3510 4275 3510 4365 3240 4365 3240 4275
|
||||
2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
|
||||
3240 4185 3510 4185 3510 4275 3240 4275 3240 4185
|
||||
2 2 0 1 -1 32 50 -1 20 0.000 0 0 -1 0 0 5
|
||||
3240 4095 3510 4095 3510 4185 3240 4185 3240 4095
|
||||
2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
|
||||
3240 4005 3510 4005 3510 4095 3240 4095 3240 4005
|
||||
2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
|
||||
3240 3915 3510 3915 3510 4005 3240 4005 3240 3915
|
||||
2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
|
||||
3330 5490 3600 5490 3600 5580 3330 5580 3330 5490
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 75 1980 4545 0\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 420 3555 4545 n/b - 1\001
|
||||
4 0 0 50 -1 0 18 0.0000 4 30 180 3015 5265 ...\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 75 2250 4545 1\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 75 2520 4545 2\001
|
||||
4 0 0 50 -1 0 18 0.0000 4 30 180 2880 4500 ...\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 135 1410 4050 5310 Buckets Physical View\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 135 1350 4050 4140 Buckets Logical View\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 135 120 1665 3780 a)\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 135 135 1620 4950 b)\001
|
183
vldb07/figs/brz.fig
Executable file
183
vldb07/figs/brz.fig
Executable file
@ -0,0 +1,183 @@
|
||||
#FIG 3.2 Produced by xfig version 3.2.5-alpha5
|
||||
Landscape
|
||||
Center
|
||||
Metric
|
||||
A4
|
||||
100.00
|
||||
Single
|
||||
-2
|
||||
1200 2
|
||||
0 32 #bdbebd
|
||||
0 33 #bdbebd
|
||||
0 34 #bdbebd
|
||||
0 35 #4a4d4a
|
||||
0 36 #bdbebd
|
||||
0 37 #4a4d4a
|
||||
0 38 #bdbebd
|
||||
0 39 #bdbebd
|
||||
0 40 #bdbebd
|
||||
6 3427 4042 3852 4211
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3427 4041 3852 4041 3852 4211 3427 4211 3427 4041
|
||||
4 0 0 50 -1 0 14 0.0000 4 30 180 3551 4140 ...\001
|
||||
-6
|
||||
6 3410 5689 3835 5859
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3410 5689 3835 5689 3835 5858 3410 5858 3410 5689
|
||||
4 0 0 50 -1 0 14 0.0000 4 30 180 3534 5788 ...\001
|
||||
-6
|
||||
6 3825 5445 4455 5535
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
4140 5445 4095 5490
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
4140 5445 4185 5490
|
||||
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
|
||||
3825 5535 3825 5490 3870 5490 3915 5490 3959 5490 4006 5490
|
||||
4095 5490 4095 5490
|
||||
0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
|
||||
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
|
||||
4455 5535 4455 5490 4410 5490 4365 5490 4321 5490 4274 5490
|
||||
4185 5490
|
||||
0.000 1.000 1.000 1.000 1.000 1.000 0.000
|
||||
-6
|
||||
6 1873 5442 2323 5532
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
2098 5442 2066 5487
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
2098 5442 2130 5487
|
||||
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
|
||||
1873 5532 1873 5487 1905 5487 1937 5487 1969 5487 2002 5487
|
||||
2066 5487 2066 5487
|
||||
0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
|
||||
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
|
||||
2323 5532 2323 5487 2291 5487 2259 5487 2227 5487 2194 5487
|
||||
2130 5487
|
||||
0.000 1.000 1.000 1.000 1.000 1.000 0.000
|
||||
-6
|
||||
6 2338 5442 2968 5532
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
2653 5442 2608 5487
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
2653 5442 2698 5487
|
||||
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
|
||||
2338 5532 2338 5487 2383 5487 2428 5487 2473 5487 2518 5487
|
||||
2608 5487 2608 5487
|
||||
0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
|
||||
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
|
||||
2968 5532 2968 5487 2923 5487 2878 5487 2833 5487 2788 5487
|
||||
2698 5487
|
||||
0.000 1.000 1.000 1.000 1.000 1.000 0.000
|
||||
-6
|
||||
6 2475 4500 4770 5175
|
||||
2 1 0 1 -1 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
3137 5013 3845 5013
|
||||
2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
|
||||
2497 4675 2711 4675 2711 4845 2497 4845 2497 4675
|
||||
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2711 4675 2923 4675 2923 4845 2711 4845 2711 4675
|
||||
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2711 4506 2923 4506 2923 4675 2711 4675 2711 4506
|
||||
2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
|
||||
3775 4675 3987 4675 3987 4845 3775 4845 3775 4675
|
||||
2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
|
||||
3775 4845 3987 4845 3987 5013 3775 5013 3775 4845
|
||||
2 2 0 1 -1 7 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2711 4845 2923 4845 2923 5013 2711 5013 2711 4845
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
|
||||
2923 4845 3137 4845 3137 5013 2923 5013 2923 4845
|
||||
2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
|
||||
3775 4506 3987 4506 3987 4675 3775 4675 3775 4506
|
||||
2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
|
||||
2497 4845 2711 4845 2711 5013 2497 5013 2497 4845
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
|
||||
2923 4675 3137 4675 3137 4845 2923 4845 2923 4675
|
||||
4 0 0 50 -1 0 14 0.0000 4 30 180 3385 4929 ...\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 60 2569 5140 0\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 60 2781 5140 1\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 60 2995 5140 2\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 405 4059 4845 Buckets\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 105 1095 3775 5140 ${\\lceil n/b\\rceil - 1}$\001
|
||||
-6
|
||||
6 2983 5446 3433 5536
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
3208 5446 3176 5491
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
3208 5446 3240 5491
|
||||
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
|
||||
2983 5536 2983 5491 3015 5491 3047 5491 3079 5491 3112 5491
|
||||
3176 5491 3176 5491
|
||||
0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
|
||||
3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
|
||||
3433 5536 3433 5491 3401 5491 3369 5491 3337 5491 3304 5491
|
||||
3240 5491
|
||||
0.000 1.000 1.000 1.000 1.000 1.000 0.000
|
||||
-6
|
||||
2 2 0 1 0 36 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
3852 4041 4066 4041 4066 4211 3852 4211 3852 4041
|
||||
2 2 0 1 0 36 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
4066 4041 4279 4041 4279 4211 4066 4211 4066 4041
|
||||
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
1937 4041 2149 4041 2149 4211 1937 4211 1937 4041
|
||||
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2149 4041 2362 4041 2362 4211 2149 4211 2149 4041
|
||||
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2362 4041 2576 4041 2576 4211 2362 4211 2362 4041
|
||||
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2576 4041 2788 4041 2788 4211 2576 4211 2576 4041
|
||||
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2788 4041 3002 4041 3002 4211 2788 4211 2788 4041
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3214 4041 3427 4041 3427 4211 3214 4211 3214 4041
|
||||
2 2 0 1 0 36 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
4279 4041 4492 4041 4492 4211 4279 4211 4279 4041
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3002 4041 3214 4041 3214 4211 3002 4211 3002 4041
|
||||
2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
|
||||
2132 5689 2345 5689 2345 5858 2132 5858 2132 5689
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
|
||||
3197 5689 3410 5689 3410 5858 3197 5858 3197 5689
|
||||
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2771 5689 2985 5689 2985 5858 2771 5858 2771 5689
|
||||
2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
|
||||
4262 5689 4475 5689 4475 5858 4262 5858 4262 5689
|
||||
2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
|
||||
4049 5689 4262 5689 4262 5858 4049 5858 4049 5689
|
||||
2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
|
||||
2985 5689 3197 5689 3197 5858 2985 5858 2985 5689
|
||||
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2345 5689 2559 5689 2559 5858 2345 5858 2345 5689
|
||||
2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
|
||||
1914 5687 2127 5687 2127 5856 1914 5856 1914 5687
|
||||
2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
|
||||
3835 5689 4049 5689 4049 5858 3835 5858 3835 5689
|
||||
2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
|
||||
2559 5689 2771 5689 2771 5858 2559 5858 2559 5689
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 5
|
||||
1 1 1.00 60.00 120.00
|
||||
3330 4275 3330 4365 3330 4410 3330 4455 3330 4500
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
|
||||
1 1 1.00 45.00 60.00
|
||||
3880 5168 4140 5445
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
|
||||
1 1 1.00 45.00 60.00
|
||||
3025 5170 3205 5440
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
|
||||
1 1 1.00 45.00 60.00
|
||||
2805 5164 2653 5438
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
|
||||
1 1 1.00 45.00 60.00
|
||||
2577 5170 2103 5434
|
||||
4 0 -1 50 -1 0 7 0.0000 2 120 645 4562 4168 Key Set $S$\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 60 2008 3999 0\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 60 2220 3999 1\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 165 4314 3999 n-1\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 60 1991 5985 0\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 60 2203 5985 1\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 165 4297 5985 n-1\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 75 555 4545 5816 Hash Table\001
|
||||
4 0 -1 50 -1 0 3 0.0000 2 75 450 1980 5625 MPHF$_0$\001
|
||||
4 0 -1 50 -1 0 3 0.0000 2 75 450 2520 5625 MPHF$_1$\001
|
||||
4 0 -1 50 -1 0 3 0.0000 2 75 450 3015 5625 MPHF$_2$\001
|
||||
4 0 -1 50 -1 0 3 0.0000 2 75 1065 3825 5625 MPHF$_{\\lceil n/b \\rceil - 1}$\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 105 585 1440 4455 Partitioning\001
|
||||
4 0 -1 50 -1 0 7 0.0000 2 105 495 1440 5265 Searching\001
|
BIN
vldb07/figs/brz_temporegressao.png
Normal file
BIN
vldb07/figs/brz_temporegressao.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 5.5 KiB |
153
vldb07/figs/brzfabiano.fig
Executable file
153
vldb07/figs/brzfabiano.fig
Executable file
@ -0,0 +1,153 @@
|
||||
#FIG 3.2 Produced by xfig version 3.2.5-alpha5
|
||||
Landscape
|
||||
Center
|
||||
Metric
|
||||
A4
|
||||
100.00
|
||||
Single
|
||||
-2
|
||||
1200 2
|
||||
0 32 #bebebe
|
||||
6 2025 3015 3555 3690
|
||||
2 3 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
|
||||
2025 3285 2295 3285 2295 3015 3285 3015 3285 3285 3555 3285
|
||||
2790 3690 2025 3285
|
||||
4 0 0 50 -1 0 10 0.0000 4 135 765 2385 3330 Partitioning\001
|
||||
-6
|
||||
6 1890 3735 3780 4365
|
||||
6 2430 3735 2700 4365
|
||||
6 2430 3915 2700 4365
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2430 4275 2700 4275 2700 4365 2430 4365 2430 4275
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2430 4185 2700 4185 2700 4275 2430 4275 2430 4185
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2430 4095 2700 4095 2700 4185 2430 4185 2430 4095
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2430 4005 2700 4005 2700 4095 2430 4095 2430 4005
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2430 3915 2700 3915 2700 4005 2430 4005 2430 3915
|
||||
-6
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2430 3825 2700 3825 2700 3915 2430 3915 2430 3825
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2430 3735 2700 3735 2700 3825 2430 3825 2430 3735
|
||||
-6
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
1890 4275 2160 4275 2160 4365 1890 4365 1890 4275
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
1890 4185 2160 4185 2160 4275 1890 4275 1890 4185
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2160 4275 2430 4275 2430 4365 2160 4365 2160 4275
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2160 4185 2430 4185 2430 4275 2160 4275 2160 4185
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2160 4095 2430 4095 2430 4185 2160 4185 2160 4095
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2160 4005 2430 4005 2430 4095 2160 4095 2160 4005
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2160 3915 2430 3915 2430 4005 2160 4005 2160 3915
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2700 4275 2970 4275 2970 4365 2700 4365 2700 4275
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2700 4185 2970 4185 2970 4275 2700 4275 2700 4185
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2700 4095 2970 4095 2970 4185 2700 4185 2700 4095
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2700 4005 2970 4005 2970 4095 2700 4095 2700 4005
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2160 3825 2430 3825 2430 3915 2160 3915 2160 3825
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3240 4275 3510 4275 3510 4365 3240 4365 3240 4275
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3510 4275 3780 4275 3780 4365 3510 4365 3510 4275
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2970 4275 3240 4275 3240 4365 2970 4365 2970 4275
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3240 4185 3510 4185 3510 4275 3240 4275 3240 4185
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
1890 4095 2160 4095 2160 4185 1890 4185 1890 4095
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3510 4185 3780 4185 3780 4275 3510 4275 3510 4185
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3240 4095 3510 4095 3510 4185 3240 4185 3240 4095
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3240 4005 3510 4005 3510 4095 3240 4095 3240 4005
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3240 3915 3510 3915 3510 4005 3240 4005 3240 3915
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
1890 4365 3780 4365
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2970 4185 3240 4185 3240 4275 2970 4275 2970 4185
|
||||
-6
|
||||
6 1260 5310 4230 5580
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
1260 5400 4230 5400
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
1530 5310 1800 5310 1800 5400 1530 5400 1530 5310
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2070 5310 2340 5310 2340 5400 2070 5400 2070 5310
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2340 5310 2610 5310 2610 5400 2340 5400 2340 5310
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2610 5310 2880 5310 2880 5400 2610 5400 2610 5310
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2880 5310 3150 5310 3150 5400 2880 5400 2880 5310
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3420 5310 3690 5310 3690 5400 3420 5400 3420 5310
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3690 5310 3960 5310 3960 5400 3690 5400 3690 5310
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3960 5310 4230 5310 4230 5400 3960 5400 3960 5310
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
1800 5310 2070 5310 2070 5400 1800 5400 1800 5310
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3150 5310 3420 5310 3420 5400 3150 5400 3150 5310
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
1260 5310 1530 5310 1530 5400 1260 5400 1260 5310
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 210 4005 5580 n-1\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 75 1350 5580 0\001
|
||||
-6
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
1260 2925 4230 2925
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
1530 2835 1800 2835 1800 2925 1530 2925 1530 2835
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2070 2835 2340 2835 2340 2925 2070 2925 2070 2835
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2340 2835 2610 2835 2610 2925 2340 2925 2340 2835
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2610 2835 2880 2835 2880 2925 2610 2925 2610 2835
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
2880 2835 3150 2835 3150 2925 2880 2925 2880 2835
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3420 2835 3690 2835 3690 2925 3420 2925 3420 2835
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3690 2835 3960 2835 3960 2925 3690 2925 3690 2835
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3960 2835 4230 2835 4230 2925 3960 2925 3960 2835
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
1800 2835 2070 2835 2070 2925 1800 2925 1800 2835
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
3150 2835 3420 2835 3420 2925 3150 2925 3150 2835
|
||||
2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
|
||||
1260 2835 1530 2835 1530 2925 1260 2925 1260 2835
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
3510 4410 3510 4590
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
3510 4410 3600 4410
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
3690 4410 3780 4410
|
||||
2 3 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
|
||||
2025 4815 2295 4815 2295 4545 3285 4545 3285 4815 3555 4815
|
||||
2790 5220 2025 4815
|
||||
2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
|
||||
3780 4410 3780 4590
|
||||
4 0 0 50 -1 0 10 0.0000 4 135 585 2475 4860 Searching\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 75 1980 4545 0\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 690 4410 5400 Hash Table\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 480 4410 4230 Buckets\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 135 555 4410 2925 Key set S\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 75 1350 2745 0\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 210 4005 2745 n-1\001
|
||||
4 0 0 50 -1 0 10 0.0000 4 105 420 3555 4545 n/b - 1\001
|
BIN
vldb07/figs/minimalperfecthash-ph-mph.png
Normal file
BIN
vldb07/figs/minimalperfecthash-ph-mph.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 3.8 KiB |
109
vldb07/introduction.tex
Executable file
109
vldb07/introduction.tex
Executable file
@ -0,0 +1,109 @@
|
||||
%% Nivio: 22/jan/06 23/jan/06 29/jan
|
||||
% Time-stamp: <Monday 30 Jan 2006 03:52:42am EDT yoshi@ime.usp.br>
|
||||
\section{Introduction}
|
||||
\label{sec:intro}
|
||||
|
||||
\enlargethispage{2\baselineskip}
|
||||
Suppose~$U$ is a universe of \textit{keys} of size $u$.
|
||||
Let $h:U\to M$ be a {\em hash function} that maps the keys from~$U$
|
||||
to a given interval of integers $M=[0,m-1]=\{0,1,\dots,m-1\}$.
|
||||
Let~$S\subseteq U$ be a set of~$n$ keys from~$U$, where $ n \ll u$.
|
||||
Given a key~$x\in S$, the hash function~$h$ computes an integer in
|
||||
$[0,m-1]$ for the storage or retrieval of~$x$ in a {\em hash table}.
|
||||
% Hashing methods for {\em non-static sets} of keys can be used to construct
|
||||
% data structures storing $S$ and supporting membership queries
|
||||
% ``$x \in S$?'' in expected time $O(1)$.
|
||||
% However, they involve a certain amount of wasted space owing to unused
|
||||
% locations in the table and waisted time to resolve collisions when
|
||||
% two keys are hashed to the same table location.
|
||||
A perfect hash function maps a {\em static set} $S$ of $n$ keys from $U$ into a set of $m$ integer
|
||||
numbers without collisions, where $m$ is greater than or equal to $n$.
|
||||
If $m$ is equal to $n$, the function is called minimal.
|
||||
|
||||
% Figure~\ref{fig:minimalperfecthash-ph-mph}(a) illustrates a perfect hash function and
|
||||
% Figure~\ref{fig:minimalperfecthash-ph-mph}(b) illustrates a minimal perfect hash function (MPHF).
|
||||
%
|
||||
% \begin{figure}
|
||||
% \centering
|
||||
% \scalebox{0.7}{\epsfig{file=figs/minimalperfecthash-ph-mph.ps}}
|
||||
% \caption{(a) Perfect hash function (b) Minimal perfect hash function (MPHF)}
|
||||
% \label{fig:minimalperfecthash-ph-mph}
|
||||
% %\vspace{-5mm}
|
||||
% \end{figure}
|
||||
|
||||
Minimal perfect hash functions are widely used for memory efficient storage and fast
|
||||
retrieval of items from static sets, such as words in natural languages,
|
||||
reserved words in programming languages or interactive systems, universal resource
|
||||
locations (URLs) in web search engines, or item sets in data mining techniques.
|
||||
Search engines are nowadays indexing tens of billions of pages and algorithms
|
||||
like PageRank~\cite{Brin1998}, which uses the web link structure to derive a
|
||||
measure of popularity for Web pages, would benefit from a MPHF for storage and
|
||||
retrieval of such huge sets of URLs.
|
||||
For instance, the TodoBr\footnote{TodoBr ({\texttt www.todobr.com.br}) is a trademark of
|
||||
Akwan Information Technologies, which was acquired by Google Inc. in July 2005.}
|
||||
search engine used the algorithm proposed hereinafter to
|
||||
improve and to scale its link analysis system.
|
||||
The WebGraph research group~\cite{bv04} would
|
||||
also benefit from a MPHF for sets in the order of billions of URLs to scale
|
||||
and to improve the storange requirements of their algorithms on Graph compression.
|
||||
|
||||
Another interesting application for MPHFs is its use as an indexing structure
|
||||
for databases.
|
||||
The B+ tree is very popular as an indexing structure for dynamic applications
|
||||
with frequent insertions and deletions of records.
|
||||
However, for applications with sporadic modifications and a huge number of
|
||||
queries the B+ tree is not the best option,
|
||||
because it performs poorly with very large sets of keys
|
||||
such as those required for the new frontiers of database applications~\cite{s05}.
|
||||
Therefore, there are applications for MPHFs in
|
||||
information retrieval systems, database systems, language translation systems,
|
||||
electronic commerce systems, compilers, operating systems, among others.
|
||||
|
||||
Until now, because of the limitations of current algorithms,
|
||||
the use of MPHFs is restricted to scenarios where the set of keys being hashed is
|
||||
relatively small.
|
||||
However, in many cases it is crucial to deal in an efficient way with very large
|
||||
sets of keys.
|
||||
Due to the exponential growth of the Web, the work with huge collections is becoming
|
||||
a daily task.
|
||||
For instance, the simple assignment of number identifiers to web pages of a collection
|
||||
can be a challenging task.
|
||||
While traditional databases simply cannot handle more traffic once the working
|
||||
set of URLs does not fit in main memory anymore~\cite{s05}, the algorithm we propose here to
|
||||
construct MPHFs can easily scale to billions of entries.
|
||||
% using stock hardware.
|
||||
|
||||
As there are many applications for MPHFs, it is
|
||||
important to design and implement space and time efficient algorithms for
|
||||
constructing such functions.
|
||||
The attractiveness of using MPHFs depends on the following issues:
|
||||
\begin{enumerate}
|
||||
\item The amount of CPU time required by the algorithms for constructing MPHFs.
|
||||
\item The space requirements of the algorithms for constructing MPHFs.
|
||||
\item The amount of CPU time required by a MPHF for each retrieval.
|
||||
\item The space requirements of the description of the resulting MPHFs to be
|
||||
used at retrieval time.
|
||||
\end{enumerate}
|
||||
|
||||
\enlargethispage{2\baselineskip}
|
||||
This paper presents a novel external memory based algorithm for constructing MPHFs that
|
||||
is very efficient in these four requirements.
|
||||
First, the algorithm is linear on the size of keys to construct a MPHF,
|
||||
which is optimal.
|
||||
For instance, for a collection of 1 billion URLs
|
||||
collected from the web, each one 64 characters long on average, the time to construct a
|
||||
MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory
|
||||
is approximately 3 hours.
|
||||
Second, the algorithm needs a small a priori defined vector of $\lceil n/b \rceil$
|
||||
one byte entries in main memory to construct a MPHF.
|
||||
For the collection of 1 billion URLs and using $b=175$, the algorithm needs only
|
||||
5.45 megabytes of internal memory.
|
||||
Third, the evaluation of the MPHF for each retrieval requires three memory accesses and
|
||||
the computation of three universal hash functions.
|
||||
This is not optimal as any MPHF requires at least one memory access and the computation
|
||||
of two universal hash functions.
|
||||
Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal.
|
||||
For the collection of 1 billion URLs, it needs 8.1 bits for each key,
|
||||
while the theoretical lower bound is $1/\ln2 \approx 1.4427$ bits per
|
||||
key~\cite{m84}.
|
||||
|
17
vldb07/makefile
Executable file
17
vldb07/makefile
Executable file
@ -0,0 +1,17 @@
|
||||
all:
|
||||
latex vldb.tex
|
||||
bibtex vldb
|
||||
latex vldb.tex
|
||||
latex vldb.tex
|
||||
dvips vldb.dvi -o vldb.ps
|
||||
ps2pdf vldb.ps
|
||||
chmod -R g+rwx *
|
||||
|
||||
perm:
|
||||
chmod -R g+rwx *
|
||||
|
||||
run: clean all
|
||||
gv vldb.ps &
|
||||
clean:
|
||||
rm *.aux *.bbl *.blg *.log *.ps *.pdf *.dvi
|
||||
|
141
vldb07/partitioningthekeys.tex
Executable file
141
vldb07/partitioningthekeys.tex
Executable file
@ -0,0 +1,141 @@
|
||||
%% Nivio: 21/jan/06
|
||||
% Time-stamp: <Monday 30 Jan 2006 03:57:28am EDT yoshi@ime.usp.br>
|
||||
\vspace{-2mm}
|
||||
\subsection{Partitioning step}
|
||||
\label{sec:partitioning-keys}
|
||||
|
||||
The set $S$ of $n$ keys is partitioned into $\lceil n/b \rceil$ buckets,
|
||||
where $b$ is a suitable parameter chosen to guarantee
|
||||
that each bucket has at most 256 keys with high probability
|
||||
(see Section~\ref{sec:determining-b}).
|
||||
The partitioning step works as follows:
|
||||
|
||||
\begin{figure}[h]
|
||||
\hrule
|
||||
\hrule
|
||||
\vspace{2mm}
|
||||
\begin{tabbing}
|
||||
aa\=type booleanx \== (false, true); \kill
|
||||
\> $\blacktriangleright$ Let $\beta$ be the size in bytes of the set $S$ \\
|
||||
\> $\blacktriangleright$ Let $\mu$ be the size in bytes of an a priori reserved \\
|
||||
\> ~~~ internal memory area \\
|
||||
\> $\blacktriangleright$ Let $N = \lceil \beta/\mu \rceil$ be the number of key blocks that will \\
|
||||
\> ~~~ be read from disk into an internal memory area \\
|
||||
\> $\blacktriangleright$ Let $\mathit{size}$ be a vector that stores the size of each bucket \\
|
||||
\> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \\
|
||||
\> ~~ $1.1$ Read block $B_j$ of keys from disk \\
|
||||
\> ~~ $1.2$ Cluster $B_j$ into $\lceil n/b \rceil$ buckets using a bucket sort \\
|
||||
\> ~~~~~~~ algorithm and update the entries in the vector {\it size} \\
|
||||
\> ~~ $1.3$ Dump $B_j$ to the disk into File $j$\\
|
||||
\> $2.$ Compute the {\it offset} vector and dump it to the disk.
|
||||
\end{tabbing}
|
||||
\hrule
|
||||
\hrule
|
||||
\vspace{-1.0mm}
|
||||
\caption{Partitioning step}
|
||||
\vspace{-3mm}
|
||||
\label{fig:partitioningstep}
|
||||
\end{figure}
|
||||
|
||||
Statement 1.1 of the {\bf for} loop presented in Figure~\ref{fig:partitioningstep}
|
||||
reads sequentially all the keys of block $B_j$ from disk into an internal area
|
||||
of size $\mu$.
|
||||
|
||||
Statement 1.2 performs an indirect bucket sort of the keys in block $B_j$
|
||||
and at the same time updates the entries in the vector {\em size}.
|
||||
Let us briefly describe how~$B_j$ is partitioned among the~$\lceil n/b\rceil$
|
||||
buckets.
|
||||
We use a local array of $\lceil n/b \rceil$ counters to store a
|
||||
count of how many keys from $B_j$ belong to each bucket.
|
||||
%At the same time, the global vector {\it size} is computed based on the local
|
||||
%counters.
|
||||
The pointers to the keys in each bucket $i$, $0 \leq i < \lceil n/b \rceil$,
|
||||
are stored in contiguous positions in an array.
|
||||
For this we first reserve the required number of entries
|
||||
in this array of pointers using the information from the array of counters.
|
||||
Next, we place the pointers to the keys in each bucket into the respective
|
||||
reserved areas in the array (i.e., we place the pointers to the keys in bucket 0,
|
||||
followed by the pointers to the keys in bucket 1, and so on).
|
||||
|
||||
\enlargethispage{2\baselineskip}
|
||||
To find the bucket address of a given key
|
||||
we use the universal hash function $h_0(k)$~\cite{j97}.
|
||||
Key~$k$ goes into bucket~$i$, where
|
||||
%Then, for each integer $h_0(k)$ the respective bucket address is obtained
|
||||
%as follows:
|
||||
\begin{eqnarray} \label{eq:bucketindex}
|
||||
i=h_0(k) \bmod \left \lceil \frac{n}{b} \right \rceil.
|
||||
\end{eqnarray}
|
||||
|
||||
Figure~\ref{fig:brz-partitioning}(a) shows a \emph{logical} view of the
|
||||
$\lceil n/b \rceil$ buckets generated in the partitioning step.
|
||||
%In this case, the keys of each bucket are put together by the pointers to
|
||||
%each key stored
|
||||
%in contiguous positions in the array of pointers.
|
||||
In reality, the keys belonging to each bucket are distributed among many files,
|
||||
as depicted in Figure~\ref{fig:brz-partitioning}(b).
|
||||
In the example of Figure~\ref{fig:brz-partitioning}(b), the keys in bucket 0
|
||||
appear in files 1 and $N$, the keys in bucket 1 appear in files 1, 2
|
||||
and $N$, and so on.
|
||||
|
||||
\vspace{-7mm}
|
||||
\begin{figure}[ht]
|
||||
\centering
|
||||
\begin{picture}(0,0)%
|
||||
\includegraphics{figs/brz-partitioning.ps}%
|
||||
\end{picture}%
|
||||
\setlength{\unitlength}{4144sp}%
|
||||
%
|
||||
\begingroup\makeatletter\ifx\SetFigFont\undefined%
|
||||
\gdef\SetFigFont#1#2#3#4#5{%
|
||||
\reset@font\fontsize{#1}{#2pt}%
|
||||
\fontfamily{#3}\fontseries{#4}\fontshape{#5}%
|
||||
\selectfont}%
|
||||
\fi\endgroup%
|
||||
\begin{picture}(4371,1403)(1,-6977)
|
||||
\put(333,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
|
||||
\put(545,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
|
||||
\put(759,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}2}}}}
|
||||
\put(1539,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}${\lceil n/b\rceil - 1}$}}}}
|
||||
\put(541,-6676){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets Logical View}}}}
|
||||
\put(3547,-6120){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
|
||||
\put(3547,-6188){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
|
||||
\put(3547,-6255){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
|
||||
\put(3107,-6120){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
|
||||
\put(3107,-6188){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
|
||||
\put(3107,-6255){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
|
||||
\put(4177,-6224){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
|
||||
\put(4177,-6269){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
|
||||
\put(4177,-6314){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
|
||||
\put(3016,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File 1}}}}
|
||||
\put(3466,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File 2}}}}
|
||||
\put(4096,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File N}}}}
|
||||
\put(3196,-6946){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets Physical View}}}}
|
||||
\end{picture}%
|
||||
\caption{Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view}
|
||||
\label{fig:brz-partitioning}
|
||||
\vspace{-2mm}
|
||||
\end{figure}
|
||||
|
||||
This scattering of the keys in the buckets could generate a performance
|
||||
problem because of the potential number of seeks
|
||||
needed to read the keys in each bucket from the $N$ files in disk
|
||||
during the searching step.
|
||||
But, as we show later in Section~\ref{sec:analytcal-results}, the number of seeks
|
||||
can be kept small using buffering techniques.
|
||||
Considering that only the vector {\it size}, which has $\lceil n/b \rceil$
|
||||
one-byte entries (remember that each bucket has at most 256 keys),
|
||||
must be maintained in main memory during the searching step,
|
||||
almost all main memory is available to be used as disk I/O buffer.
|
||||
|
||||
The last step is to compute the {\it offset} vector and dump it to the disk.
|
||||
We use the vector $\mathit{size}$ to compute the
|
||||
$\mathit{offset}$ displacement vector.
|
||||
The $\mathit{offset}[i]$ entry contains the number of keys
|
||||
in the buckets $0, 1, \dots, i-1$.
|
||||
As {\it size}$[i]$ stores the number of keys
|
||||
in bucket $i$, where $0 \leq i <\lceil n/b \rceil$, we have
|
||||
\begin{displaymath}
|
||||
\mathit{offset}[i] = \sum_{j=0}^{i-1} \mathit{size}[j] \cdot
|
||||
\end{displaymath}
|
||||
|
113
vldb07/performancenewalgorithm.tex
Executable file
113
vldb07/performancenewalgorithm.tex
Executable file
@ -0,0 +1,113 @@
|
||||
% Nivio: 29/jan/06
|
||||
% Time-stamp: <Monday 30 Jan 2006 12:13:14pm EST yoshi@flare>
|
||||
\subsection{Performance of the new algorithm}
|
||||
\label{sec:performance}
|
||||
%As we have done for the internal memory based algorithm,
|
||||
|
||||
The runtime of our algorithm is also a random variable, but now it follows a
|
||||
(highly concentrated) normal distribution, as we discuss at the end of this
|
||||
section. Again, we are interested in verifying the linearity claim made in
|
||||
Section~\ref{sec:linearcomplexity}. Therefore, we ran the algorithm for
|
||||
several numbers $n$ of keys in $S$.
|
||||
|
||||
The values chosen for $n$ were $1, 2, 4, 8, 16, 32, 64, 128, 512$ and $1000$
|
||||
million.
|
||||
%Just the small vector {\it size} must be kept in main memory,
|
||||
%as we saw in Section~\ref{sec:memconstruction}.
|
||||
We limited the main memory in 500 megabytes for the experiments.
|
||||
The size $\mu$ of the a priori reserved internal memory area
|
||||
was set to 250 megabytes, the parameter $b$ was set to $175$ and
|
||||
the building block algorithm parameter $c$ was again set to $1$.
|
||||
In Section~\ref{sec:contr-disk-access} we show how $\mu$
|
||||
affects the runtime of the algorithm. The other two parameters
|
||||
have insignificant influence on the runtime.
|
||||
|
||||
We again use a statistical method for determining a suitable sample size
|
||||
%~\cite[Chapter 13]{j91}
|
||||
to estimate the number of trials to be run for each value of $n$. We got that
|
||||
just one trial for each $n$ would be enough with a confidence level of $95\%$.
|
||||
However, we made 10 trials. This number of trials seems rather small, but, as
|
||||
shown below, the behavior of our algorithm is very stable and its runtime is
|
||||
almost deterministic (i.e., the standard deviation is very small).
|
||||
|
||||
Table~\ref{tab:mediasbrz} presents the runtime average for each $n$,
|
||||
the respective standard deviations, and
|
||||
the respective confidence intervals given by
|
||||
the average time $\pm$ the distance from average time
|
||||
considering a confidence level of $95\%$.
|
||||
Observing the runtime averages we noticed that
|
||||
the algorithm runs in expected linear time,
|
||||
as shown in~Section~\ref{sec:linearcomplexity}. Better still,
|
||||
it is only approximately $60\%$ slower than our internal memory based algorithm.
|
||||
To get that value we used the linear regression model obtained for the runtime of
|
||||
the internal memory based algorithm to estimate how much time it would require
|
||||
for constructing a MPHF for a set of 1 billion keys.
|
||||
We got 2.3 hours for the internal memory based algorithm and we measured
|
||||
3.67 hours on average for our algorithm.
|
||||
Increasing the size of the internal memory area
|
||||
from 250 to 600 megabytes (see Section~\ref{sec:contr-disk-access}),
|
||||
we have brought the time to 3.09 hours. In this case, our algorithm is
|
||||
just $34\%$ slower in this setup.
|
||||
|
||||
\enlargethispage{2\baselineskip}
|
||||
\begin{table*}[htb]
|
||||
\vspace{-1mm}
|
||||
\begin{center}
|
||||
{\scriptsize
|
||||
\begin{tabular}{|l|c|c|c|c|c|}
|
||||
\hline
|
||||
$n$ (millions) & 1 & 2 & 4 & 8 & 16 \\
|
||||
\hline % Part. 16 \% 16 \% 16 \% 18 \% 20\%
|
||||
Average time (s) & $6.9 \pm 0.3$ & $13.8 \pm 0.2$ & $31.9 \pm 0.7$ & $69.9 \pm 1.1$ & $140.6 \pm 2.5$ \\
|
||||
SD & $0.4$ & $0.2$ & $0.9$ & $1.5$ & $3.5$ \\
|
||||
\hline
|
||||
\hline
|
||||
$n$ (millions) & 32 & 64 & 128 & 512 & 1000 \\
|
||||
\hline % Part. 20 \% 20\% 20\% 18\% 18\%
|
||||
Average time (s) & $284.3 \pm 1.1$ & $587.9 \pm 3.9$ & $1223.6 \pm 4.9$ & $5966.4 \pm 9.5$ & $13229.5 \pm 12.7$ \\
|
||||
SD & $1.6$ & $5.5$ & $6.8$ & $13.2$ & $18.6$ \\
|
||||
\hline
|
||||
|
||||
\end{tabular}
|
||||
\vspace{-1mm}
|
||||
}
|
||||
\end{center}
|
||||
\caption{Our algorithm: average time in seconds for constructing a MPHF,
|
||||
the standard deviation (SD), and the confidence intervals considering
|
||||
a confidence level of $95\%$.
|
||||
}
|
||||
\label{tab:mediasbrz}
|
||||
\vspace{-5mm}
|
||||
\end{table*}
|
||||
|
||||
Figure~\ref{fig:brz_temporegressao}
|
||||
presents the runtime for each trial. In addition,
|
||||
the solid line corresponds to a linear regression model
|
||||
obtained from the experimental measurements.
|
||||
As we were expecting the runtime for a given $n$ has almost no
|
||||
variation.
|
||||
|
||||
\begin{figure}[htb]
|
||||
\begin{center}
|
||||
\scalebox{0.4}{\includegraphics{figs/brz_temporegressao.eps}}
|
||||
\caption{Time versus number of keys in $S$ for our algorithm. The solid line corresponds to
|
||||
a linear regression model.}
|
||||
\label{fig:brz_temporegressao}
|
||||
\end{center}
|
||||
\vspace{-9mm}
|
||||
\end{figure}
|
||||
|
||||
An intriguing observation is that the runtime of the algorithm is almost
|
||||
deterministic, in spite of the fact that it uses as building block an
|
||||
algorithm with a considerable fluctuation in its runtime. A given bucket~$i$,
|
||||
$0 \leq i < \lceil n/b \rceil$, is a small set of keys (at most 256 keys) and,
|
||||
as argued in Section~\ref{sec:intern-memory-algor}, the runtime of the
|
||||
building block algorithm is a random variable~$X_i$ with high fluctuation.
|
||||
However, the runtime~$Y$ of the searching step of our algorithm is given
|
||||
by~$Y=\sum_{0\leq i<\lceil n/b\rceil}X_i$. Under the hypothesis that
|
||||
the~$X_i$ are independent and bounded, the {\it law of large numbers} (see,
|
||||
e.g., \cite{j91}) implies that the random variable $Y/\lceil n/b\rceil$
|
||||
converges to a constant as~$n\to\infty$. This explains why the runtime of our
|
||||
algorithm is almost deterministic.
|
||||
|
||||
|
814
vldb07/references.bib
Executable file
814
vldb07/references.bib
Executable file
@ -0,0 +1,814 @@
|
||||
|
||||
@InProceedings{Brin1998,
|
||||
author = "Sergey Brin and Lawrence Page",
|
||||
title = "The Anatomy of a Large-Scale Hypertextual Web Search Engine",
|
||||
booktitle = "Proceedings of the 7th International {World Wide Web}
|
||||
Conference",
|
||||
pages = "107--117",
|
||||
adress = "Brisbane, Australia",
|
||||
month = "April",
|
||||
year = 1998,
|
||||
annote = "Artigo do Google."
|
||||
}
|
||||
|
||||
@inproceedings{p99,
|
||||
author = {R. Pagh},
|
||||
title = {Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions},
|
||||
booktitle = {Workshop on Algorithms and Data Structures},
|
||||
pages = {49-54},
|
||||
year = 1999,
|
||||
url = {citeseer.nj.nec.com/pagh99hash.html},
|
||||
key = {author}
|
||||
}
|
||||
|
||||
@article{p00,
|
||||
author = {R. Pagh},
|
||||
title = {Faster deterministic dictionaries},
|
||||
journal = {Symposium on Discrete Algorithms (ACM SODA)},
|
||||
OPTvolume = {43},
|
||||
OPTnumber = {5},
|
||||
pages = {487--493},
|
||||
year = {2000}
|
||||
}
|
||||
@article{g81,
|
||||
author = {G. H. Gonnet},
|
||||
title = {Expected Length of the Longest Probe Sequence in Hash Code Searching},
|
||||
journal = {J. ACM},
|
||||
volume = {28},
|
||||
number = {2},
|
||||
year = {1981},
|
||||
issn = {0004-5411},
|
||||
pages = {289--304},
|
||||
doi = {http://doi.acm.org/10.1145/322248.322254},
|
||||
publisher = {ACM Press},
|
||||
address = {New York, NY, USA},
|
||||
}
|
||||
|
||||
@misc{r04,
|
||||
author = "S. Rao",
|
||||
title = "Combinatorial Algorithms Data Structures",
|
||||
year = 2004,
|
||||
howpublished = {CS 270 Spring},
|
||||
url = "citeseer.ist.psu.edu/700201.html"
|
||||
}
|
||||
@article{ra98,
|
||||
author = {Martin Raab and Angelika Steger},
|
||||
title = {``{B}alls into Bins'' --- {A} Simple and Tight Analysis},
|
||||
journal = {Lecture Notes in Computer Science},
|
||||
volume = 1518,
|
||||
pages = {159--170},
|
||||
year = 1998,
|
||||
url = "citeseer.ist.psu.edu/raab98balls.html"
|
||||
}
|
||||
|
||||
@misc{mrs00,
|
||||
author = "M. Mitzenmacher and A. Richa and R. Sitaraman",
|
||||
title = "The power of two random choices: A survey of the techniques and results",
|
||||
howpublished={In Handbook of Randomized
|
||||
Computing, P. Pardalos, S. Rajasekaran, and J. Rolim, Eds. Kluwer},
|
||||
year = "2000",
|
||||
url = "citeseer.ist.psu.edu/article/mitzenmacher00power.html"
|
||||
}
|
||||
|
||||
@article{dfm02,
|
||||
author = {E. Drinea and A. Frieze and M. Mitzenmacher},
|
||||
title = {Balls and bins models with feedback},
|
||||
journal = {Symposium on Discrete Algorithms (ACM SODA)},
|
||||
pages = {308--315},
|
||||
year = {2002}
|
||||
}
|
||||
@Article{j97,
|
||||
author = {Bob Jenkins},
|
||||
title = {Algorithm Alley: Hash Functions},
|
||||
journal = {Dr. Dobb's Journal of Software Tools},
|
||||
volume = {22},
|
||||
number = {9},
|
||||
month = {september},
|
||||
year = {1997}
|
||||
}
|
||||
|
||||
@article{gss01,
|
||||
author = {N. Galli and B. Seybold and K. Simon},
|
||||
title = {Tetris-Hashing or optimal table compression},
|
||||
journal = {Discrete Applied Mathematics},
|
||||
volume = {110},
|
||||
number = {1},
|
||||
pages = {41--58},
|
||||
month = {june},
|
||||
publisher = {Elsevier Science},
|
||||
year = {2001}
|
||||
}
|
||||
|
||||
@article{s05,
|
||||
author = {M. Seltzer},
|
||||
title = {Beyond Relational Databases},
|
||||
journal = {ACM Queue},
|
||||
volume = {3},
|
||||
number = {3},
|
||||
month = {April},
|
||||
year = {2005}
|
||||
}
|
||||
|
||||
@InProceedings{ss89,
|
||||
author = {P. Schmidt and A. Siegel},
|
||||
title = {On aspects of universality and performance for closed hashing},
|
||||
booktitle = {Proc. 21th Ann. ACM Symp. on Theory of Computing -- STOC'89},
|
||||
month = {May},
|
||||
year = {1989},
|
||||
pages = {355--366}
|
||||
}
|
||||
|
||||
@article{asw00,
|
||||
author = {M. Atici and D. R. Stinson and R. Wei.},
|
||||
title = {A new practical algorithm for the construction of a perfect hash function},
|
||||
journal = {Journal Combin. Math. Combin. Comput.},
|
||||
volume = {35},
|
||||
pages = {127--145},
|
||||
year = {2000}
|
||||
}
|
||||
|
||||
@article{swz00,
|
||||
author = {D. R. Stinson and R. Wei and L. Zhu},
|
||||
title = {New constructions for perfect hash families and related structures using combinatorial designs and codes},
|
||||
journal = {Journal Combin. Designs.},
|
||||
volume = {8},
|
||||
pages = {189--200},
|
||||
year = {2000}
|
||||
}
|
||||
|
||||
@inproceedings{ht01,
|
||||
author = {T. Hagerup and T. Tholey},
|
||||
title = {Efficient minimal perfect hashing in nearly minimal space},
|
||||
booktitle = {The 18th Symposium on Theoretical Aspects of Computer Science (STACS), volume 2010 of Lecture Notes in Computer Science},
|
||||
year = 2001,
|
||||
pages = {317--326},
|
||||
key = {author}
|
||||
}
|
||||
|
||||
@inproceedings{dh01,
|
||||
author = {M. Dietzfelbinger and T. Hagerup},
|
||||
title = {Simple minimal perfect hashing in less space},
|
||||
booktitle = {The 9th European Symposium on Algorithms (ESA), volume 2161 of Lecture Notes in Computer Science},
|
||||
year = 2001,
|
||||
pages = {109--120},
|
||||
key = {author}
|
||||
}
|
||||
|
||||
|
||||
@MastersThesis{mar00,
|
||||
author = {M. S. Neubert},
|
||||
title = {Algoritmos Distribu;os para a Constru;o de Arquivos invertidos},
|
||||
school = {Departamento de Ci;cia da Computa;o, Universidade Federal de Minas Gerais},
|
||||
year = 2000,
|
||||
month = {Mar;},
|
||||
key = {author}
|
||||
}
|
||||
|
||||
|
||||
@Book{clrs01,
|
||||
author = {T. H. Cormen and C. E. Leiserson and R. L. Rivest and C. Stein},
|
||||
title = {Introduction to Algorithms},
|
||||
publisher = {MIT Press},
|
||||
year = {2001},
|
||||
edition = {second},
|
||||
}
|
||||
|
||||
@Book{j91,
|
||||
author = {R. Jain},
|
||||
title = {The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. },
|
||||
publisher = {John Wiley},
|
||||
year = {1991},
|
||||
edition = {first}
|
||||
}
|
||||
|
||||
@Book{k73,
|
||||
author = {D. E. Knuth},
|
||||
title = {The Art of Computer Programming: Sorting and Searching},
|
||||
publisher = {Addison-Wesley},
|
||||
volume = {3},
|
||||
year = {1973},
|
||||
edition = {second},
|
||||
}
|
||||
|
||||
@inproceedings{rp99,
|
||||
author = {R. Pagh},
|
||||
title = {Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions},
|
||||
booktitle = {Workshop on Algorithms and Data Structures},
|
||||
pages = {49-54},
|
||||
year = 1999,
|
||||
url = {citeseer.nj.nec.com/pagh99hash.html},
|
||||
key = {author}
|
||||
}
|
||||
|
||||
@inproceedings{hmwc93,
|
||||
author = {G. Havas and B.S. Majewski and N.C. Wormald and Z.J. Czech},
|
||||
title = {Graphs, Hypergraphs and Hashing},
|
||||
booktitle = {19th International Workshop on Graph-Theoretic Concepts in Computer Science},
|
||||
publisher = {Springer Lecture Notes in Computer Science vol. 790},
|
||||
pages = {153-165},
|
||||
year = 1993,
|
||||
key = {author}
|
||||
}
|
||||
|
||||
@inproceedings{bkz05,
|
||||
author = {F.C. Botelho and Y. Kohayakawa and N. Ziviani},
|
||||
title = {A Practical Minimal Perfect Hashing Method},
|
||||
booktitle = {4th International Workshop on Efficient and Experimental Algorithms},
|
||||
publisher = {Springer Lecture Notes in Computer Science vol. 3503},
|
||||
pages = {488-500},
|
||||
Moth = May,
|
||||
year = 2005,
|
||||
key = {author}
|
||||
}
|
||||
|
||||
@Article{chm97,
|
||||
author = {Z.J. Czech and G. Havas and B.S. Majewski},
|
||||
title = {Fundamental Study Perfect Hashing},
|
||||
journal = {Theoretical Computer Science},
|
||||
volume = {182},
|
||||
year = {1997},
|
||||
pages = {1-143},
|
||||
key = {author}
|
||||
}
|
||||
|
||||
@article{chm92,
|
||||
author = {Z.J. Czech and G. Havas and B.S. Majewski},
|
||||
title = {An Optimal Algorithm for Generating Minimal Perfect Hash Functions},
|
||||
journal = {Information Processing Letters},
|
||||
volume = {43},
|
||||
number = {5},
|
||||
pages = {257-264},
|
||||
year = {1992},
|
||||
url = {citeseer.nj.nec.com/czech92optimal.html},
|
||||
key = {author}
|
||||
}
|
||||
|
||||
@Article{mwhc96,
|
||||
author = {B.S. Majewski and N.C. Wormald and G. Havas and Z.J. Czech},
|
||||
title = {A family of perfect hashing methods},
|
||||
journal = {The Computer Journal},
|
||||
year = {1996},
|
||||
volume = {39},
|
||||
number = {6},
|
||||
pages = {547-554},
|
||||
key = {author}
|
||||
}
|
||||
|
||||
@InProceedings{bv04,
|
||||
author = {P. Boldi and S. Vigna},
|
||||
title = {The WebGraph Framework I: Compression Techniques},
|
||||
booktitle = {13th International World Wide Web Conference},
|
||||
pages = {595--602},
|
||||
year = {2004}
|
||||
}
|
||||
|
||||
|
||||
@Book{z04,
|
||||
author = {N. Ziviani},
|
||||
title = {Projeto de Algoritmos com implementa;es em Pascal e C},
|
||||
publisher = {Pioneira Thompson},
|
||||
year = 2004,
|
||||
edition = {segunda edi;o}
|
||||
}
|
||||
|
||||
|
||||
@Book{p85,
|
||||
author = {E. M. Palmer},
|
||||
title = {Graphical Evolution: An Introduction to the Theory of Random Graphs},
|
||||
publisher = {John Wiley \& Sons},
|
||||
year = {1985},
|
||||
address = {New York}
|
||||
}
|
||||
|
||||
@Book{imb99,
|
||||
author = {I.H. Witten and A. Moffat and T.C. Bell},
|
||||
title = {Managing Gigabytes: Compressing and Indexing Documents and Images},
|
||||
publisher = {Morgan Kaufmann Publishers},
|
||||
year = 1999,
|
||||
edition = {second edition}
|
||||
}
|
||||
@Book{wfe68,
|
||||
author = {W. Feller},
|
||||
title = { An Introduction to Probability Theory and Its Applications},
|
||||
publisher = {Wiley},
|
||||
year = 1968,
|
||||
volume = 1,
|
||||
optedition = {second edition}
|
||||
}
|
||||
|
||||
|
||||
@Article{fhcd92,
|
||||
author = {E.A. Fox and L. S. Heath and Q. Chen and A.M. Daoud},
|
||||
title = {Practical Minimal Perfect Hash Functions For Large Databases},
|
||||
journal = {Communications of the ACM},
|
||||
year = {1992},
|
||||
volume = {35},
|
||||
number = {1},
|
||||
pages = {105--121}
|
||||
}
|
||||
|
||||
|
||||
@inproceedings{fch92,
|
||||
author = {E.A. Fox and Q.F. Chen and L.S. Heath},
|
||||
title = {A Faster Algorithm for Constructing Minimal Perfect Hash Functions},
|
||||
booktitle = {Proceedings of the 15th Annual International ACM SIGIR Conference
|
||||
on Research and Development in Information Retrieval},
|
||||
year = {1992},
|
||||
pages = {266-273},
|
||||
}
|
||||
|
||||
@article{c80,
|
||||
author = {R.J. Cichelli},
|
||||
title = {Minimal perfect hash functions made simple},
|
||||
journal = {Communications of the ACM},
|
||||
volume = {23},
|
||||
number = {1},
|
||||
year = {1980},
|
||||
issn = {0001-0782},
|
||||
pages = {17--19},
|
||||
doi = {http://doi.acm.org/10.1145/358808.358813},
|
||||
publisher = {ACM Press},
|
||||
}
|
||||
|
||||
|
||||
@TechReport{fhc89,
|
||||
author = {E.A. Fox and L.S. Heath and Q.F. Chen},
|
||||
title = {An $O(n\log n)$ algorithm for finding minimal perfect hash functions},
|
||||
institution = {Virginia Polytechnic Institute and State University},
|
||||
year = {1989},
|
||||
OPTkey = {},
|
||||
OPTtype = {},
|
||||
OPTnumber = {},
|
||||
address = {Blacksburg, VA},
|
||||
month = {April},
|
||||
OPTnote = {},
|
||||
OPTannote = {}
|
||||
}
|
||||
|
||||
@TechReport{bkz06t,
|
||||
author = {F.C. Botelho and Y. Kohayakawa and N. Ziviani},
|
||||
title = {An Approach for Minimal Perfect Hash Functions in Very Large Databases},
|
||||
institution = {Department of Computer Science, Federal University of Minas Gerais},
|
||||
note = {Available at http://www.dcc.ufmg.br/\texttt{\~ }nivio/pub/technicalreports.html},
|
||||
year = {2006},
|
||||
OPTkey = {},
|
||||
OPTtype = {},
|
||||
number = {RT.DCC.003},
|
||||
address = {Belo Horizonte, MG, Brazil},
|
||||
month = {April},
|
||||
OPTannote = {}
|
||||
}
|
||||
|
||||
@inproceedings{fcdh90,
|
||||
author = {E.A. Fox and Q.F. Chen and A.M. Daoud and L.S. Heath},
|
||||
title = {Order preserving minimal perfect hash functions and information retrieval},
|
||||
booktitle = {Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval},
|
||||
year = {1990},
|
||||
isbn = {0-89791-408-2},
|
||||
pages = {279--311},
|
||||
location = {Brussels, Belgium},
|
||||
doi = {http://doi.acm.org/10.1145/96749.98233},
|
||||
publisher = {ACM Press},
|
||||
}
|
||||
|
||||
@Article{fkp89,
|
||||
author = {P. Flajolet and D. E. Knuth and B. Pittel},
|
||||
title = {The first cycles in an evolving graph},
|
||||
journal = {Discrete Math},
|
||||
year = {1989},
|
||||
volume = {75},
|
||||
pages = {167-215},
|
||||
}
|
||||
|
||||
@Article{s77,
|
||||
author = {R. Sprugnoli},
|
||||
title = {Perfect Hashing Functions: A Single Probe Retrieving
|
||||
Method For Static Sets},
|
||||
journal = {Communications of the ACM},
|
||||
year = {1977},
|
||||
volume = {20},
|
||||
number = {11},
|
||||
pages = {841--850},
|
||||
month = {November},
|
||||
}
|
||||
|
||||
@Article{j81,
|
||||
author = {G. Jaeschke},
|
||||
title = {Reciprocal Hashing: A method For Generating Minimal Perfect
|
||||
Hashing Functions},
|
||||
journal = {Communications of the ACM},
|
||||
year = {1981},
|
||||
volume = {24},
|
||||
number = {12},
|
||||
month = {December},
|
||||
pages = {829--833}
|
||||
}
|
||||
|
||||
@Article{c84,
|
||||
author = {C. C. Chang},
|
||||
title = {The Study Of An Ordered Minimal Perfect Hashing Scheme},
|
||||
journal = {Communications of the ACM},
|
||||
year = {1984},
|
||||
volume = {27},
|
||||
number = {4},
|
||||
month = {December},
|
||||
pages = {384--387}
|
||||
}
|
||||
|
||||
@Article{c86,
|
||||
author = {C. C. Chang},
|
||||
title = {Letter-Oriented Reciprocal Hashing Scheme},
|
||||
journal = {Inform. Sci.},
|
||||
year = {1986},
|
||||
volume = {27},
|
||||
pages = {243--255}
|
||||
}
|
||||
|
||||
@Article{cl86,
|
||||
author = {C. C. Chang and R. C. T. Lee},
|
||||
title = {A Letter-Oriented Minimal Perfect Hashing Scheme},
|
||||
journal = {Computer Journal},
|
||||
year = {1986},
|
||||
volume = {29},
|
||||
number = {3},
|
||||
month = {June},
|
||||
pages = {277--281}
|
||||
}
|
||||
|
||||
|
||||
@Article{cc88,
|
||||
author = {C. C. Chang and C. H. Chang},
|
||||
title = {An Ordered Minimal Perfect Hashing Scheme with Single Parameter},
|
||||
journal = {Inform. Process. Lett.},
|
||||
year = {1988},
|
||||
volume = {27},
|
||||
number = {2},
|
||||
month = {February},
|
||||
pages = {79--83}
|
||||
}
|
||||
|
||||
@Article{w90,
|
||||
author = {V. G. Winters},
|
||||
title = {Minimal Perfect Hashing in Polynomial Time},
|
||||
journal = {BIT},
|
||||
year = {1990},
|
||||
volume = {30},
|
||||
number = {2},
|
||||
pages = {235--244}
|
||||
}
|
||||
|
||||
@Article{fcdh91,
|
||||
author = {E. A. Fox and Q. F. Chen and A. M. Daoud and L. S. Heath},
|
||||
title = {Order Preserving Minimal Perfect Hash Functions and Information Retrieval},
|
||||
journal = {ACM Trans. Inform. Systems},
|
||||
year = {1991},
|
||||
volume = {9},
|
||||
number = {3},
|
||||
month = {July},
|
||||
pages = {281--308}
|
||||
}
|
||||
|
||||
@Article{fks84,
|
||||
author = {M. L. Fredman and J. Koml\'os and E. Szemer\'edi},
|
||||
title = {Storing a sparse table with {O(1)} worst case access time},
|
||||
journal = {J. ACM},
|
||||
year = {1984},
|
||||
volume = {31},
|
||||
number = {3},
|
||||
month = {July},
|
||||
pages = {538--544}
|
||||
}
|
||||
|
||||
@Article{dhjs83,
|
||||
author = {M. W. Du and T. M. Hsieh and K. F. Jea and D. W. Shieh},
|
||||
title = {The study of a new perfect hash scheme},
|
||||
journal = {IEEE Trans. Software Eng.},
|
||||
year = {1983},
|
||||
volume = {9},
|
||||
number = {3},
|
||||
month = {May},
|
||||
pages = {305--313}
|
||||
}
|
||||
|
||||
@Article{bt94,
|
||||
author = {M. D. Brain and A. L. Tharp},
|
||||
title = {Using Tries to Eliminate Pattern Collisions in Perfect Hashing},
|
||||
journal = {IEEE Trans. on Knowledge and Data Eng.},
|
||||
year = {1994},
|
||||
volume = {6},
|
||||
number = {2},
|
||||
month = {April},
|
||||
pages = {239--247}
|
||||
}
|
||||
|
||||
@Article{bt90,
|
||||
author = {M. D. Brain and A. L. Tharp},
|
||||
title = {Perfect hashing using sparse matrix packing},
|
||||
journal = {Inform. Systems},
|
||||
year = {1990},
|
||||
volume = {15},
|
||||
number = {3},
|
||||
OPTmonth = {April},
|
||||
pages = {281--290}
|
||||
}
|
||||
|
||||
@Article{ckw93,
|
||||
author = {C. C. Chang and H. C.Kowng and T. C. Wu},
|
||||
title = {A refinement of a compression-oriented addressing scheme},
|
||||
journal = {BIT},
|
||||
year = {1993},
|
||||
volume = {33},
|
||||
number = {4},
|
||||
OPTmonth = {April},
|
||||
pages = {530--535}
|
||||
}
|
||||
|
||||
@Article{cw91,
|
||||
author = {C. C. Chang and T. C. Wu},
|
||||
title = {A letter-oriented perfect hashing scheme based upon sparse table compression},
|
||||
journal = {Software -- Practice Experience},
|
||||
year = {1991},
|
||||
volume = {21},
|
||||
number = {1},
|
||||
month = {january},
|
||||
pages = {35--49}
|
||||
}
|
||||
|
||||
@Article{ty79,
|
||||
author = {R. E. Tarjan and A. C. C. Yao},
|
||||
title = {Storing a sparse table},
|
||||
journal = {Comm. ACM},
|
||||
year = {1979},
|
||||
volume = {22},
|
||||
number = {11},
|
||||
month = {November},
|
||||
pages = {606--611}
|
||||
}
|
||||
|
||||
@Article{yd85,
|
||||
author = {W. P. Yang and M. W. Du},
|
||||
title = {A backtracking method for constructing perfect hash functions from a set of mapping functions},
|
||||
journal = {BIT},
|
||||
year = {1985},
|
||||
volume = {25},
|
||||
number = {1},
|
||||
pages = {148--164}
|
||||
}
|
||||
|
||||
@Article{s85,
|
||||
author = {T. J. Sager},
|
||||
title = {A polynomial time generator for minimal perfect hash functions},
|
||||
journal = {Commun. ACM},
|
||||
year = {1985},
|
||||
volume = {28},
|
||||
number = {5},
|
||||
month = {May},
|
||||
pages = {523--532}
|
||||
}
|
||||
|
||||
@Article{cm93,
|
||||
author = {Z. J. Czech and B. S. Majewski},
|
||||
title = {A linear time algorithm for finding minimal perfect hash functions},
|
||||
journal = {The computer Journal},
|
||||
year = {1993},
|
||||
volume = {36},
|
||||
number = {6},
|
||||
pages = {579--587}
|
||||
}
|
||||
|
||||
@Article{gbs94,
|
||||
author = {R. Gupta and S. Bhaskar and S. Smolka},
|
||||
title = {On randomization in sequential and distributed algorithms},
|
||||
journal = {ACM Comput. Surveys},
|
||||
year = {1994},
|
||||
volume = {26},
|
||||
number = {1},
|
||||
month = {March},
|
||||
pages = {7--86}
|
||||
}
|
||||
|
||||
@InProceedings{sb84,
|
||||
author = {C. Slot and P. V. E. Boas},
|
||||
title = {On tape versus core; an application of space efficient perfect hash functions to the
|
||||
invariance of space},
|
||||
booktitle = {Proc. 16th Ann. ACM Symp. on Theory of Computing -- STOC'84},
|
||||
address = {Washington},
|
||||
month = {May},
|
||||
year = {1984},
|
||||
pages = {391--400},
|
||||
}
|
||||
|
||||
@InProceedings{wi90,
|
||||
author = {V. G. Winters},
|
||||
title = {Minimal perfect hashing for large sets of data},
|
||||
booktitle = {Internat. Conf. on Computing and Information -- ICCI'90},
|
||||
address = {Canada},
|
||||
month = {May},
|
||||
year = {1990},
|
||||
pages = {275--284},
|
||||
}
|
||||
|
||||
@InProceedings{lr85,
|
||||
author = {P. Larson and M. V. Ramakrishna},
|
||||
title = {External perfect hashing},
|
||||
booktitle = {Proc. ACM SIGMOD Conf.},
|
||||
address = {Austin TX},
|
||||
month = {June},
|
||||
year = {1985},
|
||||
pages = {190--199},
|
||||
}
|
||||
|
||||
@Book{m84,
|
||||
author = {K. Mehlhorn},
|
||||
editor = {W. Brauer and G. Rozenberg and A. Salomaa},
|
||||
title = {Data Structures and Algorithms 1: Sorting and Searching},
|
||||
publisher = {Springer-Verlag},
|
||||
year = {1984},
|
||||
}
|
||||
|
||||
@PhdThesis{c92,
|
||||
author = {Q. F. Chen},
|
||||
title = {An Object-Oriented Database System for Efficient Information Retrieval Appliations},
|
||||
school = {Virginia Tech Dept. of Computer Science},
|
||||
year = {1992},
|
||||
month = {March}
|
||||
}
|
||||
|
||||
@article {er59,
|
||||
AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.},
|
||||
TITLE = {On random graphs {I}},
|
||||
JOURNAL = {Pub. Math. Debrecen},
|
||||
VOLUME = {6},
|
||||
YEAR = {1959},
|
||||
PAGES = {290--297},
|
||||
MRCLASS = {05.00},
|
||||
MRNUMBER = {MR0120167 (22 \#10924)},
|
||||
MRREVIEWER = {A. Dvoretzky},
|
||||
}
|
||||
|
||||
|
||||
@article {erdos61,
|
||||
AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.},
|
||||
TITLE = {On the evolution of random graphs},
|
||||
JOURNAL = {Bull. Inst. Internat. Statist.},
|
||||
VOLUME = 38,
|
||||
YEAR = 1961,
|
||||
PAGES = {343--347},
|
||||
MRCLASS = {05.40 (55.10)},
|
||||
MRNUMBER = {MR0148055 (26 \#5564)},
|
||||
}
|
||||
|
||||
@article {er60,
|
||||
AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.},
|
||||
TITLE = {On the evolution of random graphs},
|
||||
JOURNAL = {Magyar Tud. Akad. Mat. Kutat\'o Int. K\"ozl.},
|
||||
VOLUME = {5},
|
||||
YEAR = {1960},
|
||||
PAGES = {17--61},
|
||||
MRCLASS = {05.40},
|
||||
MRNUMBER = {MR0125031 (23 \#A2338)},
|
||||
MRREVIEWER = {J. Riordan},
|
||||
}
|
||||
|
||||
@Article{er60:_Old,
|
||||
author = {P. Erd{\H{o}}s and A. R\'enyi},
|
||||
title = {On the evolution of random graphs},
|
||||
journal = {Publications of the Mathematical Institute of the Hungarian
|
||||
Academy of Sciences},
|
||||
year = {1960},
|
||||
volume = {56},
|
||||
pages = {17-61}
|
||||
}
|
||||
|
||||
@Article{er61,
|
||||
author = {P. Erd{\H{o}}s and A. R\'enyi},
|
||||
title = {On the strength of connectedness of a random graph},
|
||||
journal = {Acta Mathematica Scientia Hungary},
|
||||
year = {1961},
|
||||
volume = {12},
|
||||
pages = {261-267}
|
||||
}
|
||||
|
||||
|
||||
@Article{bp04,
|
||||
author = {B. Bollob\'as and O. Pikhurko},
|
||||
title = {Integer Sets with Prescribed Pairwise Differences Being Distinct},
|
||||
journal = {European Journal of Combinatorics},
|
||||
OPTkey = {},
|
||||
OPTvolume = {},
|
||||
OPTnumber = {},
|
||||
OPTpages = {},
|
||||
OPTmonth = {},
|
||||
note = {To Appear},
|
||||
OPTannote = {}
|
||||
}
|
||||
|
||||
@Article{pw04:_OLD,
|
||||
author = {B. Pittel and N. C. Wormald},
|
||||
title = {Counting connected graphs inside-out},
|
||||
journal = {Journal of Combinatorial Theory},
|
||||
OPTkey = {},
|
||||
OPTvolume = {},
|
||||
OPTnumber = {},
|
||||
OPTpages = {},
|
||||
OPTmonth = {},
|
||||
note = {To Appear},
|
||||
OPTannote = {}
|
||||
}
|
||||
|
||||
|
||||
@Article{mr95,
|
||||
author = {M. Molloy and B. Reed},
|
||||
title = {A critical point for random graphs with a given degree sequence},
|
||||
journal = {Random Structures and Algorithms},
|
||||
year = {1995},
|
||||
volume = {6},
|
||||
pages = {161-179}
|
||||
}
|
||||
|
||||
@TechReport{bmz04,
|
||||
author = {F. C. Botelho and D. Menoti and N. Ziviani},
|
||||
title = {A New algorithm for constructing minimal perfect hash functions},
|
||||
institution = {Federal Univ. of Minas Gerais},
|
||||
year = {2004},
|
||||
OPTkey = {},
|
||||
OPTtype = {},
|
||||
number = {TR004},
|
||||
OPTaddress = {},
|
||||
OPTmonth = {},
|
||||
note = {(http://www.dcc.ufmg.br/\texttt{\~ }nivio/pub/technicalreports.html)},
|
||||
OPTannote = {}
|
||||
}
|
||||
|
||||
@Article{mr98,
|
||||
author = {M. Molloy and B. Reed},
|
||||
title = {The size of the giant component of a random graph with a given degree sequence},
|
||||
journal = {Combinatorics, Probability and Computing},
|
||||
year = {1998},
|
||||
volume = {7},
|
||||
pages = {295-305}
|
||||
}
|
||||
|
||||
@misc{h98,
|
||||
author = {D. Hawking},
|
||||
title = {Overview of TREC-7 Very Large Collection Track (Draft for Notebook)},
|
||||
url = {citeseer.ist.psu.edu/4991.html},
|
||||
year = {1998}}
|
||||
|
||||
@book {jlr00,
|
||||
AUTHOR = {Janson, S. and {\L}uczak, T. and Ruci{\'n}ski, A.},
|
||||
TITLE = {Random graphs},
|
||||
PUBLISHER = {Wiley-Inter.},
|
||||
YEAR = 2000,
|
||||
PAGES = {xii+333},
|
||||
ISBN = {0-471-17541-2},
|
||||
MRCLASS = {05C80 (60C05 82B41)},
|
||||
MRNUMBER = {2001k:05180},
|
||||
MRREVIEWER = {Mark R. Jerrum},
|
||||
}
|
||||
|
||||
@incollection {jlr90,
|
||||
AUTHOR = {Janson, Svante and {\L}uczak, Tomasz and Ruci{\'n}ski,
|
||||
Andrzej},
|
||||
TITLE = {An exponential bound for the probability of nonexistence of a
|
||||
specified subgraph in a random graph},
|
||||
BOOKTITLE = {Random graphs '87 (Pozna\'n, 1987)},
|
||||
PAGES = {73--87},
|
||||
PUBLISHER = {Wiley},
|
||||
ADDRESS = {Chichester},
|
||||
YEAR = 1990,
|
||||
MRCLASS = {05C80 (60C05)},
|
||||
MRNUMBER = {91m:05168},
|
||||
MRREVIEWER = {J. Spencer},
|
||||
}
|
||||
|
||||
@book {b01,
|
||||
AUTHOR = {Bollob{\'a}s, B.},
|
||||
TITLE = {Random graphs},
|
||||
SERIES = {Cambridge Studies in Advanced Mathematics},
|
||||
VOLUME = 73,
|
||||
EDITION = {Second},
|
||||
PUBLISHER = {Cambridge University Press},
|
||||
ADDRESS = {Cambridge},
|
||||
YEAR = 2001,
|
||||
PAGES = {xviii+498},
|
||||
ISBN = {0-521-80920-7; 0-521-79722-5},
|
||||
MRCLASS = {05C80 (60C05)},
|
||||
MRNUMBER = {MR1864966 (2002j:05132)},
|
||||
}
|
||||
|
||||
@article {pw04,
|
||||
AUTHOR = {Pittel, Boris and Wormald, Nicholas C.},
|
||||
TITLE = {Counting connected graphs inside-out},
|
||||
JOURNAL = {J. Combin. Theory Ser. B},
|
||||
FJOURNAL = {Journal of Combinatorial Theory. Series B},
|
||||
VOLUME = 93,
|
||||
YEAR = 2005,
|
||||
NUMBER = 2,
|
||||
PAGES = {127--172},
|
||||
ISSN = {0095-8956},
|
||||
CODEN = {JCBTB8},
|
||||
MRCLASS = {05C30 (05A16 05C40 05C80)},
|
||||
MRNUMBER = {MR2117934 (2005m:05117)},
|
||||
MRREVIEWER = {Edward A. Bender},
|
||||
}
|
112
vldb07/relatedwork.tex
Executable file
112
vldb07/relatedwork.tex
Executable file
@ -0,0 +1,112 @@
|
||||
% Time-stamp: <Monday 30 Jan 2006 03:06:57am EDT yoshi@ime.usp.br>
|
||||
\vspace{-3mm}
|
||||
\section{Related work}
|
||||
\label{sec:relatedprevious-work}
|
||||
\vspace{-2mm}
|
||||
|
||||
% Optimal speed for hashing means that each key from the key set $S$
|
||||
% will map to an unique location in the hash table, avoiding time wasted
|
||||
% in resolving collisions. That is achieved with a MPHF and
|
||||
% because of that many algorithms for constructing static
|
||||
% and dynamic MPHFs, when static or dynamic sets are involved,
|
||||
% were developed. Our focus has been on static MPHFs, since
|
||||
% in many applications the key sets change slowly, if at all~\cite{s05}.
|
||||
|
||||
\enlargethispage{2\baselineskip}
|
||||
Czech, Havas and Majewski~\cite{chm97} provide a
|
||||
comprehensive survey of the most important theoretical and practical results
|
||||
on perfect hashing.
|
||||
In this section we review some of the most important results.
|
||||
%We also present more recent algorithms that share some features with
|
||||
%the one presented hereinafter.
|
||||
|
||||
Fredman, Koml\'os and Szemer\'edi~\cite{FKS84} showed that it is possible to
|
||||
construct space efficient perfect hash functions that can be evaluated in
|
||||
constant time with table sizes that are linear in the number of keys:
|
||||
$m=O(n)$. In their model of computation, an element of the universe~$U$ fits
|
||||
into one machine word, and arithmetic operations and memory accesses have unit
|
||||
cost. Randomized algorithms in the FKS model can construct a perfect hash
|
||||
function in expected time~$O(n)$:
|
||||
this is the case of our algorithm and the works in~\cite{chm92,p99}.
|
||||
|
||||
Mehlhorn~\cite{m84} showed
|
||||
that at least $\Omega((1/\ln 2)n + \ln\ln u)$ bits are
|
||||
required to represent a MPHF (i.e, at least 1.4427 bits per
|
||||
key must be stored).
|
||||
To the best of our knowledge our algorithm
|
||||
is the first one capable of generating MPHFs for sets in the order
|
||||
of billion of keys, and the generated functions
|
||||
require less than 9 bits per key to be stored.
|
||||
This increases one order of magnitude in the size of the greatest
|
||||
key set for which a MPHF was obtained in the literature~\cite{bkz05}.
|
||||
%which is close to the lower bound presented in~\cite{m84}.
|
||||
|
||||
Some work on minimal perfect hashing has been done under the assumption that
|
||||
the algorithm can pick and store truly random functions~\cite{bkz05,chm92,p99}.
|
||||
Since the space requirements for truly random functions makes them unsuitable for
|
||||
implementation, one has to settle for pseudo-random functions in practice.
|
||||
Empirical studies show that limited randomness properties are often as good as
|
||||
total randomness.
|
||||
We could verify that phenomenon in our experiments by using the universal hash
|
||||
function proposed by Jenkins~\cite{j97}, which is
|
||||
time efficient at retrieval time and requires just an integer to be used as a
|
||||
random seed (the function is completely determined by the seed).
|
||||
% Os trabalhos~\cite{asw00,swz00} apresentam algoritmos para construir
|
||||
% FHPs e FHPMs deterministicamente.
|
||||
% As fun\c{c}\~oes geradas necessitam de $O(n \log(n) + \log(\log(u)))$ bits para serem descritas.
|
||||
% A complexidade de caso m\'edio dos algoritmos para gerar as fun\c{c}\~oes \'e
|
||||
% $O(n\log(n) \log( \log (u)))$ e a de pior caso \'e $O(n^3\log(n) \log(\log(u)))$.
|
||||
% A complexidade de avalia\c{c}\~ao das fun\c{c}\~oes \'e $O(\log(n) + \log(\log(u)))$.
|
||||
% Assim, os algoritmos n\~ao geram fun\c{c}\~oes que podem ser avaliadas com complexidade
|
||||
% de tempo $O(1)$, est\~ao distantes a um fator de $\log n$ da complexidade \'otima para descrever
|
||||
% FHPs e FHPMs (Mehlhorn mostra em~\cite{m84}
|
||||
% que para armazenar uma FHP s\~ao necess\'arios no m\'{\i}nimo
|
||||
% $\Omega(n^2/(2\ln 2) m + \log\log u)$ bits), e n\~ao geram as
|
||||
% fun\c{c}\~oes com complexidade linear.
|
||||
% Al\'em disso, o universo $U$ das chaves \'e restrito a n\'umeros inteiros, o que pode
|
||||
% limitar a utiliza\c{c}\~ao na pr\'atica.
|
||||
|
||||
Pagh~\cite{p99} proposed a family of randomized algorithms for
|
||||
constructing MPHFs
|
||||
where the form of the resulting function is $h(x) = (f(x) + d[g(x)]) \bmod n$,
|
||||
where $f$ and $g$ are universal hash functions and $d$ is a set of
|
||||
displacement values to resolve collisions that are caused by the function $f$.
|
||||
Pagh identified a set of conditions concerning $f$ and $g$ and showed
|
||||
that if these conditions are satisfied, then a minimal perfect hash
|
||||
function can be computed in expected time $O(n)$ and stored in
|
||||
$(2+\epsilon)n\log_2n$ bits.
|
||||
|
||||
Dietzfelbinger and Hagerup~\cite{dh01} improved~\cite{p99},
|
||||
reducing from $(2+\epsilon)n\log_2n$ to $(1+\epsilon)n\log_2n$ the number of bits
|
||||
required to store the function, but in their approach~$f$ and~$g$ must
|
||||
be chosen from a class
|
||||
of hash functions that meet additional requirements.
|
||||
%Differently from the works in~\cite{dh01, p99}, our algorithm generates a MPHF
|
||||
%$h$ in expected linear time and $h$ can be stored in $O(n)$ bits (9 bits per key).
|
||||
|
||||
% Galli, Seybold e Simon~\cite{gss01} propuseram um algoritmo r\^andomico
|
||||
% que gera FHPMs da mesma forma das geradas pelos algoritmos de Pagh~\cite{p99}
|
||||
% e, Dietzfelbinger e Hagerup~\cite{dh01}. No entanto, eles definiram a forma das
|
||||
% fun\c{c}\~oes $f(k) = h_c(k) \bmod n$ e $g(k) = \lfloor h_c(k)/n \rfloor$ para obter em tempo esperado $O(n)$ uma fun\c{c}\~ao que pode ser descrita em $O(n\log n)$ bits, onde
|
||||
% $h_c(k) = (ck \bmod p) \bmod n^2$, $1 \leq c \leq p-1$ e $p$ um primo maior do que $u$.
|
||||
%Our algorithm is the first one capable of generating MPHFs for sets in the order of
|
||||
%billion of keys. It happens because we do not need to keep into main memory
|
||||
%at generation time complex data structures as a graph, lists and so on. We just need to maintain
|
||||
%a small vector that occupies around 8MB for a set of 1 billion keys.
|
||||
|
||||
Fox et al.~\cite{fch92,fhcd92} studied MPHFs
|
||||
%that also share features with the ones generated by our algorithm.
|
||||
that bring down the storage requirements we got to between 2 and 4 bits per key.
|
||||
However, it is shown in~\cite[Section 6.7]{chm97} that their algorithms have exponential
|
||||
running times and cannot scale for sets larger than 11 million keys in our
|
||||
implementation of the algorithm.
|
||||
|
||||
Our previous work~\cite{bkz05} improves the one by Czech, Havas and Majewski~\cite{chm92}.
|
||||
We obtained more compact functions in less time. Although
|
||||
the algorithm in~\cite{bkz05} is the fastest algorithm
|
||||
we know of, the resulting functions are stored in $O(n\log n)$ bits and
|
||||
one needs to keep in main memory at generation time a random graph of $n$ edges
|
||||
and $cn$ vertices,
|
||||
where $c\in[0.93,1.15]$. Using the well known divide to conquer approach
|
||||
we use that algorithm as a building block for the new one, where the
|
||||
resulting functions are stored in $O(n)$ bits.
|
155
vldb07/searching.tex
Executable file
155
vldb07/searching.tex
Executable file
@ -0,0 +1,155 @@
|
||||
%% Nivio: 22/jan/06
|
||||
% Time-stamp: <Monday 30 Jan 2006 03:57:35am EDT yoshi@ime.usp.br>
|
||||
\vspace{-7mm}
|
||||
\subsection{Searching step}
|
||||
\label{sec:searching}
|
||||
|
||||
\enlargethispage{2\baselineskip}
|
||||
The searching step is responsible for generating a MPHF for each
|
||||
bucket.
|
||||
Figure~\ref{fig:searchingstep} presents the searching step algorithm.
|
||||
\vspace{-2mm}
|
||||
\begin{figure}[h]
|
||||
%\centering
|
||||
\hrule
|
||||
\hrule
|
||||
\vspace{2mm}
|
||||
\begin{tabbing}
|
||||
aa\=type booleanx \== (false, true); \kill
|
||||
\> $\blacktriangleright$ Let $H$ be a minimum heap of size $N$, where the \\
|
||||
\> ~~ order relation in $H$ is given by Eq.~(\ref{eq:bucketindex}), that is, the\\
|
||||
\> ~~ remove operation removes the item with smallest $i$\\
|
||||
\> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \{ Heap construction \}\\
|
||||
\> ~~ $1.1$ Read key $k$ from File $j$ on disk\\
|
||||
\> ~~ $1.2$ Insert $(i, j, k)$ in $H$ \\
|
||||
\> $2.$ {\bf for} $i = 0$ {\bf to} $\lceil n/b \rceil - 1$ {\bf do} \\
|
||||
\> ~~ $2.1$ Read bucket $i$ from disk driven by heap $H$ \\
|
||||
\> ~~ $2.2$ Generate a MPHF for bucket $i$ \\
|
||||
\> ~~ $2.3$ Write the description of MPHF$_i$ to the disk
|
||||
\end{tabbing}
|
||||
\vspace{-1mm}
|
||||
\hrule
|
||||
\hrule
|
||||
\caption{Searching step}
|
||||
\label{fig:searchingstep}
|
||||
\vspace{-4mm}
|
||||
\end{figure}
|
||||
|
||||
Statement 1 of Figure~\ref{fig:searchingstep} inserts one key from each file
|
||||
in a minimum heap $H$ of size $N$.
|
||||
The order relation in $H$ is given by the bucket address $i$ given by
|
||||
Eq.~(\ref{eq:bucketindex}).
|
||||
|
||||
%\enlargethispage{-\baselineskip}
|
||||
Statement 2 has two important steps.
|
||||
In statement 2.1, a bucket is read from disk,
|
||||
as described below.
|
||||
%in Section~\ref{sec:readingbucket}.
|
||||
In statement 2.2, a MPHF is generated for each bucket $i$, as described
|
||||
in the following.
|
||||
%in Section~\ref{sec:mphfbucket}.
|
||||
The description of MPHF$_i$ is a vector $g_i$ of 8-bit integers.
|
||||
Finally, statement 2.3 writes the description $g_i$ of MPHF$_i$ to disk.
|
||||
|
||||
\vspace{-3mm}
|
||||
\label{sec:readingbucket}
|
||||
\subsubsection{Reading a bucket from disk.}
|
||||
|
||||
In this section we present the refinement of statement 2.1 of
|
||||
Figure~\ref{fig:searchingstep}.
|
||||
The algorithm to read bucket $i$ from disk is presented
|
||||
in Figure~\ref{fig:readingbucket}.
|
||||
|
||||
\begin{figure}[h]
|
||||
\hrule
|
||||
\hrule
|
||||
\vspace{2mm}
|
||||
\begin{tabbing}
|
||||
aa\=type booleanx \== (false, true); \kill
|
||||
\> $1.$ {\bf while} bucket $i$ is not full {\bf do} \\
|
||||
\> ~~ $1.1$ Remove $(i, j, k)$ from $H$\\
|
||||
\> ~~ $1.2$ Insert $k$ into bucket $i$ \\
|
||||
\> ~~ $1.3$ Read sequentially all keys $k$ from File $j$ that have \\
|
||||
\> ~~~~~~~ the same $i$ and insert them into bucket $i$ \\
|
||||
\> ~~ $1.4$ Insert the triple $(i, j, x)$ in $H$, where $x$ is the first \\
|
||||
\> ~~~~~~~ key read from File $j$ that does not have the \\
|
||||
\> ~~~~~~~ same bucket index $i$
|
||||
\end{tabbing}
|
||||
\hrule
|
||||
\hrule
|
||||
\vspace{-1.0mm}
|
||||
\caption{Reading a bucket}
|
||||
\vspace{-4.0mm}
|
||||
\label{fig:readingbucket}
|
||||
\end{figure}
|
||||
|
||||
Bucket $i$ is distributed among many files and the heap $H$ is used to drive a
|
||||
multiway merge operation.
|
||||
In Figure~\ref{fig:readingbucket}, statement 1.1 extracts and removes triple
|
||||
$(i, j, k)$ from $H$, where $i$ is a minimum value in $H$.
|
||||
Statement 1.2 inserts key $k$ in bucket $i$.
|
||||
Notice that the $k$ in the triple $(i, j, k)$ is in fact a pointer to
|
||||
the first byte of the key that is kept in contiguous positions of an array of characters
|
||||
(this array containing the keys is initialized during the heap construction
|
||||
in statement 1 of Figure~\ref{fig:searchingstep}).
|
||||
Statement 1.3 performs a seek operation in File $j$ on disk for the first
|
||||
read operation and reads sequentially all keys $k$ that have the same $i$
|
||||
%(obtained from Eq.~(\ref{eq:bucketindex}))
|
||||
and inserts them all in bucket $i$.
|
||||
Finally, statement 1.4 inserts in $H$ the triple $(i, j, x)$,
|
||||
where $x$ is the first key read from File $j$ (in statement 1.3)
|
||||
that does not have the same bucket address as the previous keys.
|
||||
|
||||
The number of seek operations on disk performed in statement 1.3 is discussed
|
||||
in Section~\ref{sec:linearcomplexity},
|
||||
where we present a buffering technique that brings down
|
||||
the time spent with seeks.
|
||||
|
||||
\vspace{-2mm}
|
||||
\enlargethispage{2\baselineskip}
|
||||
\subsubsection{Generating a MPHF for each bucket.} \label{sec:mphfbucket}
|
||||
|
||||
To the best of our knowledge the algorithm we have designed in
|
||||
our previous work~\cite{bkz05} is the fastest published algorithm for
|
||||
constructing MPHFs.
|
||||
That is why we are using that algorithm as a building block for the
|
||||
algorithm presented here.
|
||||
|
||||
%\enlargethispage{-\baselineskip}
|
||||
Our previous algorithm is a three-step internal memory based algorithm
|
||||
that produces a MPHF based on random graphs.
|
||||
For a set of $n$ keys, the algorithm outputs the resulting MPHF in expected time $O(n)$.
|
||||
For a given bucket $i$, $0 \leq i < \lceil n/b \rceil$, the corresponding MPHF$_i$
|
||||
has the following form:
|
||||
\begin{eqnarray}
|
||||
\mathrm{MPHF}_i(k) &=& g_i[a] + g_i[b] \label{eq:mphfi}
|
||||
\end{eqnarray}
|
||||
where $a = h_{i1}(k) \bmod t$, $b = h_{i2}(k) \bmod t$ and
|
||||
$t = c\times \mathit{size}[i]$. The functions
|
||||
$h_{i1}(k)$ and $h_{i2}(k)$ are the same universal function proposed by Jenkins~\cite{j97}
|
||||
that was used in the partitioning step described in Section~\ref{sec:partitioning-keys}.
|
||||
|
||||
In order to generate the function above the algorithm involves the generation of simple random graphs
|
||||
$G_i = (V_i, E_i)$ with~$|V_i|=t=c\times\mathit{size}[i]$ and $|E_i|=\mathit{size}[i]$, with $c \in [0.93, 1.15]$.
|
||||
To generate a simple random graph with high
|
||||
probability\footnote{We use the terms `with high probability'
|
||||
to mean `with probability tending to~$1$ as~$n\to\infty$'.}, two vertices $a$ and $b$ are
|
||||
computed for each key $k$ in bucket $i$.
|
||||
Thus, each bucket $i$ has a corresponding graph~$G_i=(V_i,E_i)$, where $V_i=\{0,1,
|
||||
\ldots,t-1\}$ and $E_i=\big\{\{a,b\}:k \in \mathrm{bucket}\: i\big\}$.
|
||||
In order to get a simple graph,
|
||||
the algorithm repeatedly selects $h_{i1}$ and $h_{i2}$ from a family of universal hash functions
|
||||
until the corresponding graph is simple.
|
||||
The probability of getting a simple graph is $p=e^{-1/c^2}$.
|
||||
For $c=1$, this probability is $p \simeq 0.368$, and the expected number of
|
||||
iterations to obtain a simple graph is~$1/p \simeq 2.72$.
|
||||
|
||||
The construction of MPHF$_i$ ends with a computation of a suitable labelling of the vertices
|
||||
of~$G_i$. The labelling is stored into vector $g_i$.
|
||||
We choose~$g_i[v]$ for each~$v\in V_i$ in such
|
||||
a way that Eq.~(\ref{eq:mphfi}) is a MPHF for bucket $i$.
|
||||
In order to get the values of each entry of $g_i$ we first
|
||||
run a breadth-first search on the 2-\textit{core} of $G_i$, i.e., the maximal subgraph
|
||||
of~$G_i$ with minimal degree at least~$2$ (see, e.g., \cite{b01,jlr00,pw04}) and
|
||||
a depth-first search on the acyclic part of $G_i$ (see \cite{bkz05} for details).
|
||||
|
77
vldb07/svglov2.clo
Normal file
77
vldb07/svglov2.clo
Normal file
@ -0,0 +1,77 @@
|
||||
% SVJour2 DOCUMENT CLASS OPTION SVGLOV2 -- for standardised journals
|
||||
%
|
||||
% This is an enhancement for the LaTeX
|
||||
% SVJour2 document class for Springer journals
|
||||
%
|
||||
%%
|
||||
%%
|
||||
%% \CharacterTable
|
||||
%% {Upper-case \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z
|
||||
%% Lower-case \a\b\c\d\e\f\g\h\i\j\k\l\m\n\o\p\q\r\s\t\u\v\w\x\y\z
|
||||
%% Digits \0\1\2\3\4\5\6\7\8\9
|
||||
%% Exclamation \! Double quote \" Hash (number) \#
|
||||
%% Dollar \$ Percent \% Ampersand \&
|
||||
%% Acute accent \' Left paren \( Right paren \)
|
||||
%% Asterisk \* Plus \+ Comma \,
|
||||
%% Minus \- Point \. Solidus \/
|
||||
%% Colon \: Semicolon \; Less than \<
|
||||
%% Equals \= Greater than \> Question mark \?
|
||||
%% Commercial at \@ Left bracket \[ Backslash \\
|
||||
%% Right bracket \] Circumflex \^ Underscore \_
|
||||
%% Grave accent \` Left brace \{ Vertical bar \|
|
||||
%% Right brace \} Tilde \~}
|
||||
\ProvidesFile{svglov2.clo}
|
||||
[2004/10/25 v2.1
|
||||
style option for standardised journals]
|
||||
\typeout{SVJour Class option: svglov2.clo for standardised journals}
|
||||
\def\validfor{svjour2}
|
||||
\ExecuteOptions{final,10pt,runningheads}
|
||||
% No size changing allowed, hence a copy of size10.clo is included
|
||||
\renewcommand\normalsize{%
|
||||
\@setfontsize\normalsize{10.2pt}{4mm}%
|
||||
\abovedisplayskip=3 mm plus6pt minus 4pt
|
||||
\belowdisplayskip=3 mm plus6pt minus 4pt
|
||||
\abovedisplayshortskip=0.0 mm plus6pt
|
||||
\belowdisplayshortskip=2 mm plus4pt minus 4pt
|
||||
\let\@listi\@listI}
|
||||
\normalsize
|
||||
\newcommand\small{%
|
||||
\@setfontsize\small{8.7pt}{3.25mm}%
|
||||
\abovedisplayskip 8.5\p@ \@plus3\p@ \@minus4\p@
|
||||
\abovedisplayshortskip \z@ \@plus2\p@
|
||||
\belowdisplayshortskip 4\p@ \@plus2\p@ \@minus2\p@
|
||||
\def\@listi{\leftmargin\leftmargini
|
||||
\parsep 0\p@ \@plus1\p@ \@minus\p@
|
||||
\topsep 4\p@ \@plus2\p@ \@minus4\p@
|
||||
\itemsep0\p@}%
|
||||
\belowdisplayskip \abovedisplayskip
|
||||
}
|
||||
\let\footnotesize\small
|
||||
\newcommand\scriptsize{\@setfontsize\scriptsize\@viipt\@viiipt}
|
||||
\newcommand\tiny{\@setfontsize\tiny\@vpt\@vipt}
|
||||
\newcommand\large{\@setfontsize\large\@xiipt{14pt}}
|
||||
\newcommand\Large{\@setfontsize\Large\@xivpt{16dd}}
|
||||
\newcommand\LARGE{\@setfontsize\LARGE\@xviipt{17dd}}
|
||||
\newcommand\huge{\@setfontsize\huge\@xxpt{25}}
|
||||
\newcommand\Huge{\@setfontsize\Huge\@xxvpt{30}}
|
||||
%
|
||||
%ALT% \def\runheadhook{\rlap{\smash{\lower5pt\hbox to\textwidth{\hrulefill}}}}
|
||||
\def\runheadhook{\rlap{\smash{\lower11pt\hbox to\textwidth{\hrulefill}}}}
|
||||
\AtEndOfClass{\advance\headsep by5pt}
|
||||
\if@twocolumn
|
||||
\setlength{\textwidth}{17.6cm}
|
||||
\setlength{\textheight}{230mm}
|
||||
\AtEndOfClass{\setlength\columnsep{4mm}}
|
||||
\else
|
||||
\setlength{\textwidth}{11.7cm}
|
||||
\setlength{\textheight}{517.5dd} % 19.46cm
|
||||
\fi
|
||||
%
|
||||
\AtBeginDocument{%
|
||||
\@ifundefined{@journalname}
|
||||
{\typeout{Unknown journal: specify \string\journalname\string{%
|
||||
<name of your journal>\string} in preambel^^J}}{}}
|
||||
%
|
||||
\endinput
|
||||
%%
|
||||
%% End of file `svglov2.clo'.
|
1419
vldb07/svjour2.cls
Normal file
1419
vldb07/svjour2.cls
Normal file
File diff suppressed because it is too large
Load Diff
18
vldb07/terminology.tex
Executable file
18
vldb07/terminology.tex
Executable file
@ -0,0 +1,18 @@
|
||||
% Time-stamp: <Sunday 29 Jan 2006 11:55:42pm EST yoshi@flare>
|
||||
\vspace{-3mm}
|
||||
\section{Notation and terminology}
|
||||
\vspace{-2mm}
|
||||
\label{sec:notation}
|
||||
|
||||
\enlargethispage{2\baselineskip}
|
||||
The essential notation and terminology used throughout this paper are as follows.
|
||||
\begin{itemize}
|
||||
\item $U$: key universe. $|U| = u$.
|
||||
\item $S$: actual static key set. $S \subset U$, $|S| = n \ll u$.
|
||||
\item $h: U \to M$ is a hash function that maps keys from a universe $U$ into
|
||||
a given range $M = \{0,1,\dots,m-1\}$ of integer numbers.
|
||||
\item $h$ is a perfect hash function if it is one-to-one on~$S$, i.e., if
|
||||
$h(k_1) \not = h(k_2)$ for all $k_1 \not = k_2$ from $S$.
|
||||
\item $h$ is a minimal perfect hash function (MPHF) if it is one-to-one on~$S$
|
||||
and $n=m$.
|
||||
\end{itemize}
|
78
vldb07/thealgorithm.tex
Executable file
78
vldb07/thealgorithm.tex
Executable file
@ -0,0 +1,78 @@
|
||||
%% Nivio: 13/jan/06, 21/jan/06 29/jan/06
|
||||
% Time-stamp: <Sunday 29 Jan 2006 11:56:25pm EST yoshi@flare>
|
||||
\vspace{-3mm}
|
||||
\section{The algorithm}
|
||||
\label{sec:new-algorithm}
|
||||
\vspace{-2mm}
|
||||
|
||||
\enlargethispage{2\baselineskip}
|
||||
The main idea supporting our algorithm is the classical divide and conquer technique.
|
||||
The algorithm is a two-step external memory based algorithm
|
||||
that generates a MPHF $h$ for a set $S$ of $n$ keys.
|
||||
Figure~\ref{fig:new-algo-main-steps} illustrates the two steps of the
|
||||
algorithm: the partitioning step and the searching step.
|
||||
|
||||
\vspace{-2mm}
|
||||
\begin{figure}[ht]
|
||||
\centering
|
||||
\begin{picture}(0,0)%
|
||||
\includegraphics{figs/brz.ps}%
|
||||
\end{picture}%
|
||||
\setlength{\unitlength}{4144sp}%
|
||||
%
|
||||
\begingroup\makeatletter\ifx\SetFigFont\undefined%
|
||||
\gdef\SetFigFont#1#2#3#4#5{%
|
||||
\reset@font\fontsize{#1}{#2pt}%
|
||||
\fontfamily{#3}\fontseries{#4}\fontshape{#5}%
|
||||
\selectfont}%
|
||||
\fi\endgroup%
|
||||
\begin{picture}(3704,2091)(1426,-5161)
|
||||
\put(2570,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
|
||||
\put(2782,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
|
||||
\put(2996,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}2}}}}
|
||||
\put(4060,-4006){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets}}}}
|
||||
\put(3776,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}${\lceil n/b\rceil - 1}$}}}}
|
||||
\put(4563,-3329){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Key Set $S$}}}}
|
||||
\put(2009,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
|
||||
\put(2221,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
|
||||
\put(4315,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}n-1}}}}
|
||||
\put(1992,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
|
||||
\put(2204,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
|
||||
\put(4298,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}n-1}}}}
|
||||
\put(4546,-4977){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Hash Table}}}}
|
||||
\put(1441,-3616){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Partitioning}}}}
|
||||
\put(1441,-4426){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Searching}}}}
|
||||
\put(1981,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_0$}}}}
|
||||
\put(2521,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_1$}}}}
|
||||
\put(3016,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_2$}}}}
|
||||
\put(3826,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_{\lceil n/b \rceil - 1}$}}}}
|
||||
\end{picture}%
|
||||
\vspace{-1mm}
|
||||
\caption{Main steps of our algorithm}
|
||||
\label{fig:new-algo-main-steps}
|
||||
\vspace{-3mm}
|
||||
\end{figure}
|
||||
|
||||
The partitioning step takes a key set $S$ and uses a universal hash function
|
||||
$h_0$ proposed by Jenkins~\cite{j97}
|
||||
%for each key $k \in S$ of length $|k|$
|
||||
to transform each key~$k\in S$ into an integer~$h_0(k)$.
|
||||
Reducing~$h_0(k)$ modulo~$\lceil n/b\rceil$, we partition~$S$ into $\lceil n/b
|
||||
\rceil$ buckets containing at most 256 keys in each bucket (with high
|
||||
probability).
|
||||
|
||||
The searching step generates a MPHF$_i$ for each bucket $i$,
|
||||
$0 \leq i < \lceil n/b \rceil$.
|
||||
The resulting MPHF $h(k)$, $k \in S$, is given by
|
||||
\begin{eqnarray}\label{eq:mphf}
|
||||
h(k) = \mathrm{MPHF}_i (k) + \mathit{offset}[i],
|
||||
\end{eqnarray}
|
||||
where~$i=h_0(k)\bmod\lceil n/b\rceil$.
|
||||
The $i$th entry~$\mathit{offset}[i]$ of the displacement vector
|
||||
$\mathit{offset}$, $0 \leq i < \lceil n/b \rceil$, contains the total number
|
||||
of keys in the buckets from 0 to $i-1$, that is, it gives the interval of the
|
||||
keys in the hash table addressed by the MPHF$_i$. In the following we explain
|
||||
each step in detail.
|
||||
|
||||
|
||||
|
21
vldb07/thedataandsetup.tex
Executable file
21
vldb07/thedataandsetup.tex
Executable file
@ -0,0 +1,21 @@
|
||||
% Nivio: 29/jan/06
|
||||
% Time-stamp: <Sunday 29 Jan 2006 11:57:40pm EST yoshi@flare>
|
||||
\vspace{-2mm}
|
||||
\subsection{The data and the experimental setup}
|
||||
\label{sec:data-exper-set}
|
||||
|
||||
The algorithms were implemented in the C language and
|
||||
are available at \texttt{http://\-cmph.sf.net}
|
||||
under the GNU Lesser General Public License (LGPL).
|
||||
% free software licence.
|
||||
All experiments were carried out on
|
||||
a computer running the Linux operating system, version 2.6,
|
||||
with a 2.4 gigahertz processor and
|
||||
1 gigabyte of main memory.
|
||||
In the experiments related to the new
|
||||
algorithm we limited the main memory in 500 megabytes.
|
||||
|
||||
Our data consists of a collection of 1 billion
|
||||
URLs collected from the Web, each URL 64 characters long on average.
|
||||
The collection is stored on disk in 60.5 gigabytes.
|
||||
|
194
vldb07/vldb.tex
Normal file
194
vldb07/vldb.tex
Normal file
@ -0,0 +1,194 @@
|
||||
%%%%%%%%%%%%%%%%%%%%%%% file template.tex %%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%
|
||||
% This is a template file for the LaTeX package SVJour2 for the
|
||||
% Springer journal "The VLDB Journal".
|
||||
%
|
||||
% Springer Heidelberg 2004/12/03
|
||||
%
|
||||
% Copy it to a new file with a new name and use it as the basis
|
||||
% for your article. Delete % as needed.
|
||||
%
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%
|
||||
% First comes an example EPS file -- just ignore it and
|
||||
% proceed on the \documentclass line
|
||||
% your LaTeX will extract the file if required
|
||||
%\begin{filecontents*}{figs/minimalperfecthash-ph-mph.ps}
|
||||
%!PS-Adobe-3.0 EPSF-3.0
|
||||
%%BoundingBox: 19 19 221 221
|
||||
%%CreationDate: Mon Sep 29 1997
|
||||
%%Creator: programmed by hand (JK)
|
||||
%%EndComments
|
||||
%gsave
|
||||
%newpath
|
||||
% 20 20 moveto
|
||||
% 20 220 lineto
|
||||
% 220 220 lineto
|
||||
% 220 20 lineto
|
||||
%closepath
|
||||
%2 setlinewidth
|
||||
%gsave
|
||||
% .4 setgray fill
|
||||
%grestore
|
||||
%stroke
|
||||
%grestore
|
||||
%\end{filecontents*}
|
||||
%
|
||||
\documentclass[twocolumn,fleqn,runningheads]{svjour2}
|
||||
%
|
||||
\smartqed % flush right qed marks, e.g. at end of proof
|
||||
%
|
||||
\usepackage{graphicx}
|
||||
\usepackage{listings}
|
||||
\usepackage{epsfig}
|
||||
\usepackage{textcomp}
|
||||
\usepackage[latin1]{inputenc}
|
||||
\usepackage{amssymb}
|
||||
|
||||
%\DeclareGraphicsExtensions{.png}
|
||||
%
|
||||
% \usepackage{mathptmx} % use Times fonts if available on your TeX system
|
||||
%
|
||||
% insert here the call for the packages your document requires
|
||||
%\usepackage{latexsym}
|
||||
% etc.
|
||||
%
|
||||
% please place your own definitions here and don't use \def but
|
||||
% \newcommand{}{}
|
||||
%
|
||||
|
||||
\lstset{
|
||||
language=Pascal,
|
||||
basicstyle=\fontsize{9}{9}\selectfont,
|
||||
captionpos=t,
|
||||
aboveskip=1mm,
|
||||
belowskip=1mm,
|
||||
abovecaptionskip=1mm,
|
||||
belowcaptionskip=1mm,
|
||||
% numbers = left,
|
||||
mathescape=true,
|
||||
escapechar=@,
|
||||
extendedchars=true,
|
||||
showstringspaces=false,
|
||||
columns=fixed,
|
||||
basewidth=0.515em,
|
||||
frame=single,
|
||||
framesep=2mm,
|
||||
xleftmargin=2mm,
|
||||
xrightmargin=2mm,
|
||||
framerule=0.5pt
|
||||
}
|
||||
|
||||
\def\cG{{\mathcal G}}
|
||||
\def\crit{{\rm crit}}
|
||||
\def\ncrit{{\rm ncrit}}
|
||||
\def\scrit{{\rm scrit}}
|
||||
\def\bedges{{\rm bedges}}
|
||||
\def\ZZ{{\mathbb Z}}
|
||||
|
||||
\journalname{The VLDB Journal}
|
||||
%
|
||||
|
||||
\begin{document}
|
||||
|
||||
\title{Space and Time Efficient Minimal Perfect Hash \\[0.2cm]
|
||||
Functions for Very Large Databases\thanks{
|
||||
This work was supported in part by
|
||||
GERINDO Project--grant MCT/CNPq/CT-INFO 552.087/02-5,
|
||||
CAPES/PROF Scholarship (Fabiano C. Botelho),
|
||||
FAPESP Proj.\ Tem.\ 03/09925-5 and CNPq Grant 30.0334/93-1
|
||||
(Yoshiharu Kohayakawa),
|
||||
and CNPq Grant 30.5237/02-0 (Nivio Ziviani).}
|
||||
}
|
||||
%\subtitle{Do you have a subtitle?\\ If so, write it here}
|
||||
|
||||
%\titlerunning{Short form of title} % if too long for running head
|
||||
|
||||
\author{Fabiano C. Botelho \and Davi C. Reis \and Yoshiharu Kohayakawa \and Nivio Ziviani}
|
||||
%\authorrunning{Short form of author list} % if too long for running head
|
||||
\institute{
|
||||
F. C. Botelho \and
|
||||
N. Ziviani \at
|
||||
Dept. of Computer Science,
|
||||
Federal Univ. of Minas Gerais,
|
||||
Belo Horizonte, Brazil\\
|
||||
\email{\{fbotelho,nivio\}@dcc.ufmg.br}
|
||||
\and
|
||||
D. C. Reis \at
|
||||
Google, Brazil \\
|
||||
\email{davi.reis@gmail.com}
|
||||
\and
|
||||
Y. Kohayakawa
|
||||
Dept. of Computer Science,
|
||||
Univ. of S\~ao Paulo,
|
||||
S\~ao Paulo, Brazil\\
|
||||
\email{yoshi@ime.usp.br}
|
||||
}
|
||||
|
||||
\date{Received: date / Accepted: date}
|
||||
% The correct dates will be entered by the editor
|
||||
|
||||
|
||||
\maketitle
|
||||
|
||||
\begin{abstract}
|
||||
We propose a novel external memory based algorithm for constructing minimal
|
||||
perfect hash functions~$h$ for huge sets of keys.
|
||||
For a set of~$n$ keys, our algorithm outputs~$h$ in time~$O(n)$.
|
||||
The algorithm needs a small vector of one byte entries
|
||||
in main memory to construct $h$.
|
||||
The evaluation of~$h(x)$ requires three memory accesses for any key~$x$.
|
||||
The description of~$h$ takes a constant number of bits
|
||||
for each key, which is optimal, i.e., the theoretical lower bound is $1/\ln 2$
|
||||
bits per key.
|
||||
In our experiments, we used a collection of 1 billion URLs collected
|
||||
from the web, each URL 64 characters long on average.
|
||||
For this collection, our algorithm
|
||||
(i) finds a minimal perfect hash function in approximately
|
||||
3 hours using a commodity PC,
|
||||
(ii) needs just 5.45 megabytes of internal memory to generate $h$
|
||||
and (iii) takes 8.1 bits per key for the description of~$h$.
|
||||
\keywords{Minimal Perfect Hashing \and Large Databases}
|
||||
\end{abstract}
|
||||
|
||||
% main text
|
||||
|
||||
\def\cG{{\mathcal G}}
|
||||
\def\crit{{\rm crit}}
|
||||
\def\ncrit{{\rm ncrit}}
|
||||
\def\scrit{{\rm scrit}}
|
||||
\def\bedges{{\rm bedges}}
|
||||
\def\ZZ{{\mathbb Z}}
|
||||
\def\BSmax{\mathit{BS}_{\mathit{max}}}
|
||||
\def\Bi{\mathop{\rm Bi}\nolimits}
|
||||
|
||||
\input{introduction}
|
||||
%\input{terminology}
|
||||
\input{relatedwork}
|
||||
\input{thealgorithm}
|
||||
\input{partitioningthekeys}
|
||||
\input{searching}
|
||||
%\input{computingoffset}
|
||||
%\input{hashingbuckets}
|
||||
\input{determiningb}
|
||||
%\input{analyticalandexperimentalresults}
|
||||
\input{analyticalresults}
|
||||
%\input{results}
|
||||
\input{conclusions}
|
||||
|
||||
|
||||
|
||||
|
||||
%\input{acknowledgments}
|
||||
%\begin{acknowledgements}
|
||||
%If you'd like to thank anyone, place your comments here
|
||||
%and remove the percent signs.
|
||||
%\end{acknowledgements}
|
||||
|
||||
% BibTeX users please use
|
||||
%\bibliographystyle{spmpsci}
|
||||
%\bibliography{} % name your BibTeX data base
|
||||
\bibliographystyle{plain}
|
||||
\bibliography{references}
|
||||
\input{appendix}
|
||||
\end{document}
|
Loading…
Reference in New Issue
Block a user