paper for vldb07 added
This commit is contained in:
155
vldb07/searching.tex
Executable file
155
vldb07/searching.tex
Executable file
@@ -0,0 +1,155 @@
|
||||
%% Nivio: 22/jan/06
|
||||
% Time-stamp: <Monday 30 Jan 2006 03:57:35am EDT yoshi@ime.usp.br>
|
||||
\vspace{-7mm}
|
||||
\subsection{Searching step}
|
||||
\label{sec:searching}
|
||||
|
||||
\enlargethispage{2\baselineskip}
|
||||
The searching step is responsible for generating a MPHF for each
|
||||
bucket.
|
||||
Figure~\ref{fig:searchingstep} presents the searching step algorithm.
|
||||
\vspace{-2mm}
|
||||
\begin{figure}[h]
|
||||
%\centering
|
||||
\hrule
|
||||
\hrule
|
||||
\vspace{2mm}
|
||||
\begin{tabbing}
|
||||
aa\=type booleanx \== (false, true); \kill
|
||||
\> $\blacktriangleright$ Let $H$ be a minimum heap of size $N$, where the \\
|
||||
\> ~~ order relation in $H$ is given by Eq.~(\ref{eq:bucketindex}), that is, the\\
|
||||
\> ~~ remove operation removes the item with smallest $i$\\
|
||||
\> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \{ Heap construction \}\\
|
||||
\> ~~ $1.1$ Read key $k$ from File $j$ on disk\\
|
||||
\> ~~ $1.2$ Insert $(i, j, k)$ in $H$ \\
|
||||
\> $2.$ {\bf for} $i = 0$ {\bf to} $\lceil n/b \rceil - 1$ {\bf do} \\
|
||||
\> ~~ $2.1$ Read bucket $i$ from disk driven by heap $H$ \\
|
||||
\> ~~ $2.2$ Generate a MPHF for bucket $i$ \\
|
||||
\> ~~ $2.3$ Write the description of MPHF$_i$ to the disk
|
||||
\end{tabbing}
|
||||
\vspace{-1mm}
|
||||
\hrule
|
||||
\hrule
|
||||
\caption{Searching step}
|
||||
\label{fig:searchingstep}
|
||||
\vspace{-4mm}
|
||||
\end{figure}
|
||||
|
||||
Statement 1 of Figure~\ref{fig:searchingstep} inserts one key from each file
|
||||
in a minimum heap $H$ of size $N$.
|
||||
The order relation in $H$ is given by the bucket address $i$ given by
|
||||
Eq.~(\ref{eq:bucketindex}).
|
||||
|
||||
%\enlargethispage{-\baselineskip}
|
||||
Statement 2 has two important steps.
|
||||
In statement 2.1, a bucket is read from disk,
|
||||
as described below.
|
||||
%in Section~\ref{sec:readingbucket}.
|
||||
In statement 2.2, a MPHF is generated for each bucket $i$, as described
|
||||
in the following.
|
||||
%in Section~\ref{sec:mphfbucket}.
|
||||
The description of MPHF$_i$ is a vector $g_i$ of 8-bit integers.
|
||||
Finally, statement 2.3 writes the description $g_i$ of MPHF$_i$ to disk.
|
||||
|
||||
\vspace{-3mm}
|
||||
\label{sec:readingbucket}
|
||||
\subsubsection{Reading a bucket from disk.}
|
||||
|
||||
In this section we present the refinement of statement 2.1 of
|
||||
Figure~\ref{fig:searchingstep}.
|
||||
The algorithm to read bucket $i$ from disk is presented
|
||||
in Figure~\ref{fig:readingbucket}.
|
||||
|
||||
\begin{figure}[h]
|
||||
\hrule
|
||||
\hrule
|
||||
\vspace{2mm}
|
||||
\begin{tabbing}
|
||||
aa\=type booleanx \== (false, true); \kill
|
||||
\> $1.$ {\bf while} bucket $i$ is not full {\bf do} \\
|
||||
\> ~~ $1.1$ Remove $(i, j, k)$ from $H$\\
|
||||
\> ~~ $1.2$ Insert $k$ into bucket $i$ \\
|
||||
\> ~~ $1.3$ Read sequentially all keys $k$ from File $j$ that have \\
|
||||
\> ~~~~~~~ the same $i$ and insert them into bucket $i$ \\
|
||||
\> ~~ $1.4$ Insert the triple $(i, j, x)$ in $H$, where $x$ is the first \\
|
||||
\> ~~~~~~~ key read from File $j$ that does not have the \\
|
||||
\> ~~~~~~~ same bucket index $i$
|
||||
\end{tabbing}
|
||||
\hrule
|
||||
\hrule
|
||||
\vspace{-1.0mm}
|
||||
\caption{Reading a bucket}
|
||||
\vspace{-4.0mm}
|
||||
\label{fig:readingbucket}
|
||||
\end{figure}
|
||||
|
||||
Bucket $i$ is distributed among many files and the heap $H$ is used to drive a
|
||||
multiway merge operation.
|
||||
In Figure~\ref{fig:readingbucket}, statement 1.1 extracts and removes triple
|
||||
$(i, j, k)$ from $H$, where $i$ is a minimum value in $H$.
|
||||
Statement 1.2 inserts key $k$ in bucket $i$.
|
||||
Notice that the $k$ in the triple $(i, j, k)$ is in fact a pointer to
|
||||
the first byte of the key that is kept in contiguous positions of an array of characters
|
||||
(this array containing the keys is initialized during the heap construction
|
||||
in statement 1 of Figure~\ref{fig:searchingstep}).
|
||||
Statement 1.3 performs a seek operation in File $j$ on disk for the first
|
||||
read operation and reads sequentially all keys $k$ that have the same $i$
|
||||
%(obtained from Eq.~(\ref{eq:bucketindex}))
|
||||
and inserts them all in bucket $i$.
|
||||
Finally, statement 1.4 inserts in $H$ the triple $(i, j, x)$,
|
||||
where $x$ is the first key read from File $j$ (in statement 1.3)
|
||||
that does not have the same bucket address as the previous keys.
|
||||
|
||||
The number of seek operations on disk performed in statement 1.3 is discussed
|
||||
in Section~\ref{sec:linearcomplexity},
|
||||
where we present a buffering technique that brings down
|
||||
the time spent with seeks.
|
||||
|
||||
\vspace{-2mm}
|
||||
\enlargethispage{2\baselineskip}
|
||||
\subsubsection{Generating a MPHF for each bucket.} \label{sec:mphfbucket}
|
||||
|
||||
To the best of our knowledge the algorithm we have designed in
|
||||
our previous work~\cite{bkz05} is the fastest published algorithm for
|
||||
constructing MPHFs.
|
||||
That is why we are using that algorithm as a building block for the
|
||||
algorithm presented here.
|
||||
|
||||
%\enlargethispage{-\baselineskip}
|
||||
Our previous algorithm is a three-step internal memory based algorithm
|
||||
that produces a MPHF based on random graphs.
|
||||
For a set of $n$ keys, the algorithm outputs the resulting MPHF in expected time $O(n)$.
|
||||
For a given bucket $i$, $0 \leq i < \lceil n/b \rceil$, the corresponding MPHF$_i$
|
||||
has the following form:
|
||||
\begin{eqnarray}
|
||||
\mathrm{MPHF}_i(k) &=& g_i[a] + g_i[b] \label{eq:mphfi}
|
||||
\end{eqnarray}
|
||||
where $a = h_{i1}(k) \bmod t$, $b = h_{i2}(k) \bmod t$ and
|
||||
$t = c\times \mathit{size}[i]$. The functions
|
||||
$h_{i1}(k)$ and $h_{i2}(k)$ are the same universal function proposed by Jenkins~\cite{j97}
|
||||
that was used in the partitioning step described in Section~\ref{sec:partitioning-keys}.
|
||||
|
||||
In order to generate the function above the algorithm involves the generation of simple random graphs
|
||||
$G_i = (V_i, E_i)$ with~$|V_i|=t=c\times\mathit{size}[i]$ and $|E_i|=\mathit{size}[i]$, with $c \in [0.93, 1.15]$.
|
||||
To generate a simple random graph with high
|
||||
probability\footnote{We use the terms `with high probability'
|
||||
to mean `with probability tending to~$1$ as~$n\to\infty$'.}, two vertices $a$ and $b$ are
|
||||
computed for each key $k$ in bucket $i$.
|
||||
Thus, each bucket $i$ has a corresponding graph~$G_i=(V_i,E_i)$, where $V_i=\{0,1,
|
||||
\ldots,t-1\}$ and $E_i=\big\{\{a,b\}:k \in \mathrm{bucket}\: i\big\}$.
|
||||
In order to get a simple graph,
|
||||
the algorithm repeatedly selects $h_{i1}$ and $h_{i2}$ from a family of universal hash functions
|
||||
until the corresponding graph is simple.
|
||||
The probability of getting a simple graph is $p=e^{-1/c^2}$.
|
||||
For $c=1$, this probability is $p \simeq 0.368$, and the expected number of
|
||||
iterations to obtain a simple graph is~$1/p \simeq 2.72$.
|
||||
|
||||
The construction of MPHF$_i$ ends with a computation of a suitable labelling of the vertices
|
||||
of~$G_i$. The labelling is stored into vector $g_i$.
|
||||
We choose~$g_i[v]$ for each~$v\in V_i$ in such
|
||||
a way that Eq.~(\ref{eq:mphfi}) is a MPHF for bucket $i$.
|
||||
In order to get the values of each entry of $g_i$ we first
|
||||
run a breadth-first search on the 2-\textit{core} of $G_i$, i.e., the maximal subgraph
|
||||
of~$G_i$ with minimal degree at least~$2$ (see, e.g., \cite{b01,jlr00,pw04}) and
|
||||
a depth-first search on the acyclic part of $G_i$ (see \cite{bkz05} for details).
|
||||
|
||||
Reference in New Issue
Block a user