%% Nivio: 22/jan/06 % Time-stamp: \vspace{-7mm} \subsection{Searching step} \label{sec:searching} \enlargethispage{2\baselineskip} The searching step is responsible for generating a MPHF for each bucket. Figure~\ref{fig:searchingstep} presents the searching step algorithm. \vspace{-2mm} \begin{figure}[h] %\centering \hrule \hrule \vspace{2mm} \begin{tabbing} aa\=type booleanx \== (false, true); \kill \> $\blacktriangleright$ Let $H$ be a minimum heap of size $N$, where the \\ \> ~~ order relation in $H$ is given by Eq.~(\ref{eq:bucketindex}), that is, the\\ \> ~~ remove operation removes the item with smallest $i$\\ \> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \{ Heap construction \}\\ \> ~~ $1.1$ Read key $k$ from File $j$ on disk\\ \> ~~ $1.2$ Insert $(i, j, k)$ in $H$ \\ \> $2.$ {\bf for} $i = 0$ {\bf to} $\lceil n/b \rceil - 1$ {\bf do} \\ \> ~~ $2.1$ Read bucket $i$ from disk driven by heap $H$ \\ \> ~~ $2.2$ Generate a MPHF for bucket $i$ \\ \> ~~ $2.3$ Write the description of MPHF$_i$ to the disk \end{tabbing} \vspace{-1mm} \hrule \hrule \caption{Searching step} \label{fig:searchingstep} \vspace{-4mm} \end{figure} Statement 1 of Figure~\ref{fig:searchingstep} inserts one key from each file in a minimum heap $H$ of size $N$. The order relation in $H$ is given by the bucket address $i$ given by Eq.~(\ref{eq:bucketindex}). %\enlargethispage{-\baselineskip} Statement 2 has two important steps. In statement 2.1, a bucket is read from disk, as described below. %in Section~\ref{sec:readingbucket}. In statement 2.2, a MPHF is generated for each bucket $i$, as described in the following. %in Section~\ref{sec:mphfbucket}. The description of MPHF$_i$ is a vector $g_i$ of 8-bit integers. Finally, statement 2.3 writes the description $g_i$ of MPHF$_i$ to disk. \vspace{-3mm} \label{sec:readingbucket} \subsubsection{Reading a bucket from disk.} In this section we present the refinement of statement 2.1 of Figure~\ref{fig:searchingstep}. The algorithm to read bucket $i$ from disk is presented in Figure~\ref{fig:readingbucket}. \begin{figure}[h] \hrule \hrule \vspace{2mm} \begin{tabbing} aa\=type booleanx \== (false, true); \kill \> $1.$ {\bf while} bucket $i$ is not full {\bf do} \\ \> ~~ $1.1$ Remove $(i, j, k)$ from $H$\\ \> ~~ $1.2$ Insert $k$ into bucket $i$ \\ \> ~~ $1.3$ Read sequentially all keys $k$ from File $j$ that have \\ \> ~~~~~~~ the same $i$ and insert them into bucket $i$ \\ \> ~~ $1.4$ Insert the triple $(i, j, x)$ in $H$, where $x$ is the first \\ \> ~~~~~~~ key read from File $j$ that does not have the \\ \> ~~~~~~~ same bucket index $i$ \end{tabbing} \hrule \hrule \vspace{-1.0mm} \caption{Reading a bucket} \vspace{-4.0mm} \label{fig:readingbucket} \end{figure} Bucket $i$ is distributed among many files and the heap $H$ is used to drive a multiway merge operation. In Figure~\ref{fig:readingbucket}, statement 1.1 extracts and removes triple $(i, j, k)$ from $H$, where $i$ is a minimum value in $H$. Statement 1.2 inserts key $k$ in bucket $i$. Notice that the $k$ in the triple $(i, j, k)$ is in fact a pointer to the first byte of the key that is kept in contiguous positions of an array of characters (this array containing the keys is initialized during the heap construction in statement 1 of Figure~\ref{fig:searchingstep}). Statement 1.3 performs a seek operation in File $j$ on disk for the first read operation and reads sequentially all keys $k$ that have the same $i$ %(obtained from Eq.~(\ref{eq:bucketindex})) and inserts them all in bucket $i$. Finally, statement 1.4 inserts in $H$ the triple $(i, j, x)$, where $x$ is the first key read from File $j$ (in statement 1.3) that does not have the same bucket address as the previous keys. The number of seek operations on disk performed in statement 1.3 is discussed in Section~\ref{sec:linearcomplexity}, where we present a buffering technique that brings down the time spent with seeks. \vspace{-2mm} \enlargethispage{2\baselineskip} \subsubsection{Generating a MPHF for each bucket.} \label{sec:mphfbucket} To the best of our knowledge the algorithm we have designed in our previous work~\cite{bkz05} is the fastest published algorithm for constructing MPHFs. That is why we are using that algorithm as a building block for the algorithm presented here. %\enlargethispage{-\baselineskip} Our previous algorithm is a three-step internal memory based algorithm that produces a MPHF based on random graphs. For a set of $n$ keys, the algorithm outputs the resulting MPHF in expected time $O(n)$. For a given bucket $i$, $0 \leq i < \lceil n/b \rceil$, the corresponding MPHF$_i$ has the following form: \begin{eqnarray} \mathrm{MPHF}_i(k) &=& g_i[a] + g_i[b] \label{eq:mphfi} \end{eqnarray} where $a = h_{i1}(k) \bmod t$, $b = h_{i2}(k) \bmod t$ and $t = c\times \mathit{size}[i]$. The functions $h_{i1}(k)$ and $h_{i2}(k)$ are the same universal function proposed by Jenkins~\cite{j97} that was used in the partitioning step described in Section~\ref{sec:partitioning-keys}. In order to generate the function above the algorithm involves the generation of simple random graphs $G_i = (V_i, E_i)$ with~$|V_i|=t=c\times\mathit{size}[i]$ and $|E_i|=\mathit{size}[i]$, with $c \in [0.93, 1.15]$. To generate a simple random graph with high probability\footnote{We use the terms `with high probability' to mean `with probability tending to~$1$ as~$n\to\infty$'.}, two vertices $a$ and $b$ are computed for each key $k$ in bucket $i$. Thus, each bucket $i$ has a corresponding graph~$G_i=(V_i,E_i)$, where $V_i=\{0,1, \ldots,t-1\}$ and $E_i=\big\{\{a,b\}:k \in \mathrm{bucket}\: i\big\}$. In order to get a simple graph, the algorithm repeatedly selects $h_{i1}$ and $h_{i2}$ from a family of universal hash functions until the corresponding graph is simple. The probability of getting a simple graph is $p=e^{-1/c^2}$. For $c=1$, this probability is $p \simeq 0.368$, and the expected number of iterations to obtain a simple graph is~$1/p \simeq 2.72$. The construction of MPHF$_i$ ends with a computation of a suitable labelling of the vertices of~$G_i$. The labelling is stored into vector $g_i$. We choose~$g_i[v]$ for each~$v\in V_i$ in such a way that Eq.~(\ref{eq:mphfi}) is a MPHF for bucket $i$. In order to get the values of each entry of $g_i$ we first run a breadth-first search on the 2-\textit{core} of $G_i$, i.e., the maximal subgraph of~$G_i$ with minimal degree at least~$2$ (see, e.g., \cite{b01,jlr00,pw04}) and a depth-first search on the acyclic part of $G_i$ (see \cite{bkz05} for details).