156 lines
6.5 KiB
TeX
Executable File
156 lines
6.5 KiB
TeX
Executable File
%% Nivio: 22/jan/06
|
|
% Time-stamp: <Monday 30 Jan 2006 03:57:35am EDT yoshi@ime.usp.br>
|
|
\vspace{-7mm}
|
|
\subsection{Searching step}
|
|
\label{sec:searching}
|
|
|
|
\enlargethispage{2\baselineskip}
|
|
The searching step is responsible for generating a MPHF for each
|
|
bucket.
|
|
Figure~\ref{fig:searchingstep} presents the searching step algorithm.
|
|
\vspace{-2mm}
|
|
\begin{figure}[h]
|
|
%\centering
|
|
\hrule
|
|
\hrule
|
|
\vspace{2mm}
|
|
\begin{tabbing}
|
|
aa\=type booleanx \== (false, true); \kill
|
|
\> $\blacktriangleright$ Let $H$ be a minimum heap of size $N$, where the \\
|
|
\> ~~ order relation in $H$ is given by Eq.~(\ref{eq:bucketindex}), that is, the\\
|
|
\> ~~ remove operation removes the item with smallest $i$\\
|
|
\> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \{ Heap construction \}\\
|
|
\> ~~ $1.1$ Read key $k$ from File $j$ on disk\\
|
|
\> ~~ $1.2$ Insert $(i, j, k)$ in $H$ \\
|
|
\> $2.$ {\bf for} $i = 0$ {\bf to} $\lceil n/b \rceil - 1$ {\bf do} \\
|
|
\> ~~ $2.1$ Read bucket $i$ from disk driven by heap $H$ \\
|
|
\> ~~ $2.2$ Generate a MPHF for bucket $i$ \\
|
|
\> ~~ $2.3$ Write the description of MPHF$_i$ to the disk
|
|
\end{tabbing}
|
|
\vspace{-1mm}
|
|
\hrule
|
|
\hrule
|
|
\caption{Searching step}
|
|
\label{fig:searchingstep}
|
|
\vspace{-4mm}
|
|
\end{figure}
|
|
|
|
Statement 1 of Figure~\ref{fig:searchingstep} inserts one key from each file
|
|
in a minimum heap $H$ of size $N$.
|
|
The order relation in $H$ is given by the bucket address $i$ given by
|
|
Eq.~(\ref{eq:bucketindex}).
|
|
|
|
%\enlargethispage{-\baselineskip}
|
|
Statement 2 has two important steps.
|
|
In statement 2.1, a bucket is read from disk,
|
|
as described below.
|
|
%in Section~\ref{sec:readingbucket}.
|
|
In statement 2.2, a MPHF is generated for each bucket $i$, as described
|
|
in the following.
|
|
%in Section~\ref{sec:mphfbucket}.
|
|
The description of MPHF$_i$ is a vector $g_i$ of 8-bit integers.
|
|
Finally, statement 2.3 writes the description $g_i$ of MPHF$_i$ to disk.
|
|
|
|
\vspace{-3mm}
|
|
\label{sec:readingbucket}
|
|
\subsubsection{Reading a bucket from disk.}
|
|
|
|
In this section we present the refinement of statement 2.1 of
|
|
Figure~\ref{fig:searchingstep}.
|
|
The algorithm to read bucket $i$ from disk is presented
|
|
in Figure~\ref{fig:readingbucket}.
|
|
|
|
\begin{figure}[h]
|
|
\hrule
|
|
\hrule
|
|
\vspace{2mm}
|
|
\begin{tabbing}
|
|
aa\=type booleanx \== (false, true); \kill
|
|
\> $1.$ {\bf while} bucket $i$ is not full {\bf do} \\
|
|
\> ~~ $1.1$ Remove $(i, j, k)$ from $H$\\
|
|
\> ~~ $1.2$ Insert $k$ into bucket $i$ \\
|
|
\> ~~ $1.3$ Read sequentially all keys $k$ from File $j$ that have \\
|
|
\> ~~~~~~~ the same $i$ and insert them into bucket $i$ \\
|
|
\> ~~ $1.4$ Insert the triple $(i, j, x)$ in $H$, where $x$ is the first \\
|
|
\> ~~~~~~~ key read from File $j$ that does not have the \\
|
|
\> ~~~~~~~ same bucket index $i$
|
|
\end{tabbing}
|
|
\hrule
|
|
\hrule
|
|
\vspace{-1.0mm}
|
|
\caption{Reading a bucket}
|
|
\vspace{-4.0mm}
|
|
\label{fig:readingbucket}
|
|
\end{figure}
|
|
|
|
Bucket $i$ is distributed among many files and the heap $H$ is used to drive a
|
|
multiway merge operation.
|
|
In Figure~\ref{fig:readingbucket}, statement 1.1 extracts and removes triple
|
|
$(i, j, k)$ from $H$, where $i$ is a minimum value in $H$.
|
|
Statement 1.2 inserts key $k$ in bucket $i$.
|
|
Notice that the $k$ in the triple $(i, j, k)$ is in fact a pointer to
|
|
the first byte of the key that is kept in contiguous positions of an array of characters
|
|
(this array containing the keys is initialized during the heap construction
|
|
in statement 1 of Figure~\ref{fig:searchingstep}).
|
|
Statement 1.3 performs a seek operation in File $j$ on disk for the first
|
|
read operation and reads sequentially all keys $k$ that have the same $i$
|
|
%(obtained from Eq.~(\ref{eq:bucketindex}))
|
|
and inserts them all in bucket $i$.
|
|
Finally, statement 1.4 inserts in $H$ the triple $(i, j, x)$,
|
|
where $x$ is the first key read from File $j$ (in statement 1.3)
|
|
that does not have the same bucket address as the previous keys.
|
|
|
|
The number of seek operations on disk performed in statement 1.3 is discussed
|
|
in Section~\ref{sec:linearcomplexity},
|
|
where we present a buffering technique that brings down
|
|
the time spent with seeks.
|
|
|
|
\vspace{-2mm}
|
|
\enlargethispage{2\baselineskip}
|
|
\subsubsection{Generating a MPHF for each bucket.} \label{sec:mphfbucket}
|
|
|
|
To the best of our knowledge the algorithm we have designed in
|
|
our previous work~\cite{bkz05} is the fastest published algorithm for
|
|
constructing MPHFs.
|
|
That is why we are using that algorithm as a building block for the
|
|
algorithm presented here.
|
|
|
|
%\enlargethispage{-\baselineskip}
|
|
Our previous algorithm is a three-step internal memory based algorithm
|
|
that produces a MPHF based on random graphs.
|
|
For a set of $n$ keys, the algorithm outputs the resulting MPHF in expected time $O(n)$.
|
|
For a given bucket $i$, $0 \leq i < \lceil n/b \rceil$, the corresponding MPHF$_i$
|
|
has the following form:
|
|
\begin{eqnarray}
|
|
\mathrm{MPHF}_i(k) &=& g_i[a] + g_i[b] \label{eq:mphfi}
|
|
\end{eqnarray}
|
|
where $a = h_{i1}(k) \bmod t$, $b = h_{i2}(k) \bmod t$ and
|
|
$t = c\times \mathit{size}[i]$. The functions
|
|
$h_{i1}(k)$ and $h_{i2}(k)$ are the same universal function proposed by Jenkins~\cite{j97}
|
|
that was used in the partitioning step described in Section~\ref{sec:partitioning-keys}.
|
|
|
|
In order to generate the function above the algorithm involves the generation of simple random graphs
|
|
$G_i = (V_i, E_i)$ with~$|V_i|=t=c\times\mathit{size}[i]$ and $|E_i|=\mathit{size}[i]$, with $c \in [0.93, 1.15]$.
|
|
To generate a simple random graph with high
|
|
probability\footnote{We use the terms `with high probability'
|
|
to mean `with probability tending to~$1$ as~$n\to\infty$'.}, two vertices $a$ and $b$ are
|
|
computed for each key $k$ in bucket $i$.
|
|
Thus, each bucket $i$ has a corresponding graph~$G_i=(V_i,E_i)$, where $V_i=\{0,1,
|
|
\ldots,t-1\}$ and $E_i=\big\{\{a,b\}:k \in \mathrm{bucket}\: i\big\}$.
|
|
In order to get a simple graph,
|
|
the algorithm repeatedly selects $h_{i1}$ and $h_{i2}$ from a family of universal hash functions
|
|
until the corresponding graph is simple.
|
|
The probability of getting a simple graph is $p=e^{-1/c^2}$.
|
|
For $c=1$, this probability is $p \simeq 0.368$, and the expected number of
|
|
iterations to obtain a simple graph is~$1/p \simeq 2.72$.
|
|
|
|
The construction of MPHF$_i$ ends with a computation of a suitable labelling of the vertices
|
|
of~$G_i$. The labelling is stored into vector $g_i$.
|
|
We choose~$g_i[v]$ for each~$v\in V_i$ in such
|
|
a way that Eq.~(\ref{eq:mphfi}) is a MPHF for bucket $i$.
|
|
In order to get the values of each entry of $g_i$ we first
|
|
run a breadth-first search on the 2-\textit{core} of $G_i$, i.e., the maximal subgraph
|
|
of~$G_i$ with minimal degree at least~$2$ (see, e.g., \cite{b01,jlr00,pw04}) and
|
|
a depth-first search on the acyclic part of $G_i$ (see \cite{bkz05} for details).
|
|
|