paper for vldb07 added

2006-08-11 17:32:31 +00:00
parent 00c049787a
commit 80549b6ca6
28 changed files with 4517 additions and 0 deletions
--- a/vldb07/searching.tex
+++ b/vldb07/searching.tex
@@ -0,0 +1,155 @@
+%% Nivio: 22/jan/06
+% Time-stamp: <Monday 30 Jan 2006 03:57:35am EDT yoshi@ime.usp.br>
+\vspace{-7mm}
+\subsection{Searching step}
+\label{sec:searching}
+
+\enlargethispage{2\baselineskip}
+The searching step is responsible for generating a MPHF for each 
+bucket.
+Figure~\ref{fig:searchingstep} presents the searching step algorithm.
+\vspace{-2mm}
+\begin{figure}[h]
+%\centering
+\hrule 
+\hrule 
+\vspace{2mm}
+\begin{tabbing}
+aa\=type booleanx \==  (false, true); \kill
+\> $\blacktriangleright$ Let $H$ be a minimum heap of size $N$, where the \\
+\> ~~ order relation in $H$ is given by Eq.~(\ref{eq:bucketindex}), that is, the\\
+\> ~~ remove operation removes the item with smallest $i$\\ 
+\> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \{ Heap construction \}\\
+\> ~~ $1.1$ Read key $k$ from File $j$ on disk\\
+\> ~~ $1.2$ Insert $(i, j, k)$ in $H$ \\
+\> $2.$ {\bf for} $i = 0$ {\bf to} $\lceil n/b \rceil - 1$ {\bf do} \\
+\> ~~ $2.1$ Read bucket $i$ from disk driven by heap $H$ \\
+\> ~~ $2.2$ Generate a MPHF for bucket $i$ \\
+\> ~~ $2.3$ Write the description of MPHF$_i$ to the disk 
+\end{tabbing}
+\vspace{-1mm}
+\hrule 
+\hrule 
+\caption{Searching step}
+\label{fig:searchingstep}
+\vspace{-4mm}
+\end{figure}
+
+Statement 1 of Figure~\ref{fig:searchingstep} inserts one key from each file
+in a minimum heap $H$ of size $N$.
+The order relation in $H$ is given by the bucket address $i$ given by
+Eq.~(\ref{eq:bucketindex}).
+
+%\enlargethispage{-\baselineskip}
+Statement 2 has two important steps.
+In statement 2.1, a bucket is read from disk,
+as described below.
+%in Section~\ref{sec:readingbucket}. 
+In statement 2.2, a MPHF is generated for each bucket $i$, as described 
+in the following.
+%in Section~\ref{sec:mphfbucket}.
+The description of MPHF$_i$ is a vector $g_i$ of 8-bit integers.
+Finally, statement 2.3 writes the description $g_i$ of MPHF$_i$ to disk.
+
+\vspace{-3mm}
+\label{sec:readingbucket}
+\subsubsection{Reading a bucket from disk.} 
+
+In this section we present the refinement of statement 2.1 of
+Figure~\ref{fig:searchingstep}.
+The algorithm to read bucket $i$ from disk is presented 
+in Figure~\ref{fig:readingbucket}.
+
+\begin{figure}[h]
+\hrule 
+\hrule 
+\vspace{2mm}
+\begin{tabbing}
+aa\=type booleanx \==  (false, true); \kill
+\> $1.$ {\bf while} bucket $i$ is not full {\bf do} \\
+\> ~~ $1.1$ Remove $(i, j, k)$ from $H$\\
+\> ~~ $1.2$ Insert $k$ into bucket $i$ \\
+\> ~~ $1.3$ Read sequentially all keys $k$ from File $j$ that have \\
+\> ~~~~~~~ the same $i$ and insert them into bucket $i$ \\
+\> ~~ $1.4$ Insert the triple $(i, j, x)$ in $H$, where $x$ is the first \\
+\> ~~~~~~~ key read from File $j$ that does not have the \\ 
+\> ~~~~~~~ same bucket index $i$
+\end{tabbing}
+\hrule 
+\hrule 
+\vspace{-1.0mm}
+\caption{Reading a bucket}
+\vspace{-4.0mm}
+\label{fig:readingbucket}
+\end{figure}
+
+Bucket $i$ is distributed among many files and the heap $H$ is used to drive a
+multiway merge operation.
+In Figure~\ref{fig:readingbucket}, statement 1.1 extracts and removes triple 
+$(i, j, k)$ from $H$, where $i$ is a minimum value in $H$.
+Statement 1.2 inserts key $k$ in bucket $i$.
+Notice that the $k$ in the triple $(i, j, k)$ is in fact a pointer to
+the first byte of the key that is kept in contiguous positions of an array of characters
+(this array containing the keys is initialized during the heap construction
+in statement 1 of Figure~\ref{fig:searchingstep}).
+Statement 1.3 performs a seek operation in File $j$ on disk for the first 
+read operation and reads sequentially all keys $k$ that have the same $i$ 
+%(obtained from Eq.~(\ref{eq:bucketindex})) 
+and inserts them all in bucket $i$.
+Finally, statement 1.4 inserts in $H$ the triple $(i, j, x)$,  
+where $x$ is the first key read from File $j$ (in statement 1.3) 
+that does not have the same bucket address as the previous keys.
+
+The number of seek operations on disk performed in statement 1.3 is discussed
+in Section~\ref{sec:linearcomplexity}, 
+where we present a buffering technique that brings down 
+the time spent with seeks.
+
+\vspace{-2mm}
+\enlargethispage{2\baselineskip}
+\subsubsection{Generating a MPHF for each bucket.} \label{sec:mphfbucket}
+
+To the best of our knowledge the algorithm we have designed in
+our previous work~\cite{bkz05} is the fastest published algorithm for
+constructing MPHFs.
+That is why we are using that algorithm as a building block for the 
+algorithm presented here.
+
+%\enlargethispage{-\baselineskip}
+Our previous algorithm is a three-step internal memory based algorithm
+that produces a MPHF based on random graphs.
+For a set of $n$ keys, the algorithm outputs the resulting MPHF in expected time $O(n)$.
+For a given bucket $i$, $0 \leq i < \lceil n/b \rceil$, the corresponding MPHF$_i$ 
+has the following form:
+\begin{eqnarray}
+        \mathrm{MPHF}_i(k) &=& g_i[a] + g_i[b] \label{eq:mphfi}
+\end{eqnarray} 
+where $a = h_{i1}(k) \bmod t$, $b = h_{i2}(k) \bmod t$ and
+$t = c\times \mathit{size}[i]$. The functions
+$h_{i1}(k)$ and $h_{i2}(k)$ are the same universal function proposed by Jenkins~\cite{j97}
+that was used in the partitioning step described in Section~\ref{sec:partitioning-keys}.
+
+In order to generate the function above the algorithm involves the generation of simple random graphs
+$G_i = (V_i, E_i)$ with~$|V_i|=t=c\times\mathit{size}[i]$ and $|E_i|=\mathit{size}[i]$, with  $c \in [0.93, 1.15]$.
+To generate a simple random graph with high 
+probability\footnote{We use the terms `with high probability'
+to mean `with probability tending to~$1$ as~$n\to\infty$'.}, two vertices $a$ and $b$ are
+computed for each key $k$ in bucket $i$.
+Thus, each bucket $i$ has a corresponding graph~$G_i=(V_i,E_i)$, where $V_i=\{0,1,
+\ldots,t-1\}$ and $E_i=\big\{\{a,b\}:k \in \mathrm{bucket}\: i\big\}$.
+In order to get a simple graph, 
+the algorithm repeatedly selects $h_{i1}$ and $h_{i2}$ from a family of universal hash functions
+until the corresponding graph is simple.
+The probability of getting a simple graph is $p=e^{-1/c^2}$.
+For $c=1$, this probability is $p \simeq 0.368$, and the expected number of 
+iterations to obtain a simple graph is~$1/p \simeq 2.72$.
+
+The construction of MPHF$_i$ ends with a computation of a suitable labelling of the vertices
+of~$G_i$. The labelling is stored into vector $g_i$.
+We choose~$g_i[v]$ for each~$v\in V_i$ in such
+a way that Eq.~(\ref{eq:mphfi}) is a MPHF for bucket $i$.
+In order to get the values of each entry of $g_i$ we first 
+run a breadth-first search on the 2-\textit{core} of $G_i$, i.e., the maximal subgraph 
+of~$G_i$ with minimal degree at least~$2$ (see, e.g., \cite{b01,jlr00,pw04}) and
+a depth-first search on the acyclic part of $G_i$ (see \cite{bkz05} for details).
+