%% Nivio: 23/jan/06 29/jan/06 % Time-stamp: \enlargethispage{2\baselineskip} \section{Analytical results} \label{sec:analytcal-results} \vspace{-1mm} The purpose of this section is fourfold. First, we show that our algorithm runs in expected time $O(n)$. Second, we present the main memory requirements for constructing the MPHF. Third, we discuss the cost of evaluating the resulting MPHF. Fourth, we present the space required to store the resulting MPHF. \vspace{-2mm} \subsection{The linear time complexity} \label{sec:linearcomplexity} First, we show that the partitioning step presented in Figure~\ref{fig:partitioningstep} runs in $O(n)$ time. Each iteration of the {\bf for} loop in statement~1 runs in $O(|B_j|)$ time, $1 \leq j \leq N$, where $|B_j|$ is the number of keys that fit in block $B_j$ of size $\mu$. This is because statement 1.1 just reads $|B_j|$ keys from disk, statement 1.2 runs a bucket sort like algorithm that is well known to be linear in the number of keys it sorts (i.e., $|B_j|$ keys), and statement 1.3 just dumps $|B_j|$ keys to the disk into File $j$. Thus, the {\bf for} loop runs in $O(\sum_{j=1}^{N}|B_j|)$ time. As $\sum_{j=1}^{N}|B_j|=n$, then the partitioning step runs in $O(n)$ time. Second, we show that the searching step presented in Figure~\ref{fig:searchingstep} also runs in $O(n)$ time. The heap construction in statement 1 runs in $O(N)$ time, for $N \ll n$. We have assumed that insertions and deletions in the heap cost $O(1)$ because $N$ is typically much smaller than $n$ (see \cite[Section 6.4]{bkz06t} for details). Statement 2 runs in $O(\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i])$ time (remember that $\mathit{size}[i]$ stores the number of keys in bucket $i$). As $\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i] = n$, if statements 2.1, 2.2 and 2.3 run in $O(\mathit{size}[i])$ time, then statement 2 runs in $O(n)$ time. %Statement 2.1 runs the algorithm to read a bucket from disk. That algorithm reads $\mathit{size}[i]$ %keys of bucket $i$ that might be spread into many files or, in the worst case, %into $|BS_{max}|$ files, where $|BS_{max}|$ is the number of keys in the bucket of maximum size. %It uses the heap $H$ to drive a multiway merge of the sprayed bucket $i$. %As we are considering that each read/write on disk costs $O(1)$ and %each heap operation also costs $O(1)$ (recall $N \ll n$), then statement 2.1 %costs $O(\mathit{size}[i])$ time. %We need to take into account that this step could generate a lot of seeks on disk. %However, the number of seeks can be amortized (see Section~\ref{sec:contr-disk-access}) %and that is why we have been able of getting a MPHF for a set of 1 billion keys in less %than 4 hours using a machine with just 500 MB of main memory %(see Section~\ref{sec:performance}). Statement 2.1 reads $O(\mathit{size}[i])$ keys of bucket $i$ and is detailed in Figure~\ref{fig:readingbucket}. As we are assuming that each read or write on disk costs $O(1)$ and each heap operation also costs $O(1)$, statement~2.1 takes $O(\mathit{size}[i])$ time. However, the keys of bucket $i$ are distributed in at most~$BS_{max}$ files on disk in the worst case (recall that $BS_{max}$ is the maximum number of keys found in any bucket). Therefore, we need to take into account that the critical step in reading a bucket is in statement~1.3 of Figure~\ref{fig:readingbucket}, where a seek operation in File $j$ may be performed by the first read operation. In order to amortize the number of seeks performed we use a buffering technique~\cite{k73}. We create a buffer $j$ of size \textbaht$\: = \mu/N$ for each file $j$, where $1\leq j \leq N$ (recall that $\mu$ is the size in bytes of an a priori reserved internal memory area). Every time a read operation is requested to file $j$ and the data is not found in the $j$th~buffer, \textbaht~bytes are read from file $j$ to buffer $j$. Hence, the number of seeks performed in the worst case is given by $\beta/$\textbaht~(remember that $\beta$ is the size in bytes of $S$). For that we have made the pessimistic assumption that one seek happens every time buffer $j$ is filled in. Thus, the number of seeks performed in the worst case is $64n/$\textbaht, since each URL is 64 bytes long on average. Therefore, the number of seeks is linear on $n$ and amortized by \textbaht. It is important to emphasize two things. First, the operating system uses techniques to diminish the number of seeks and the average seek time. This makes the amortization factor to be greater than \textbaht~in practice. Second, almost all main memory is available to be used as file buffers because just a small vector of $\lceil n/b\rceil$ one-byte entries must be maintained in main memory, as we show in Section~\ref{sec:memconstruction}. Statement 2.2 runs our internal memory based algorithm in order to generate a MPHF for each bucket. That algorithm is linear, as we showed in~\cite{bkz05}. As it is applied to buckets with {\it size}$[i]$ keys, statement~2.2 takes $O(\mathit{size}[i])$ time. Statement 2.3 has time complexity $O(\mathit{size}[i])$ because it writes to disk the description of each generated MPHF and each description is stored in $c \times \mathit{size}[i] + O(1)$ bytes, where $c\in[0.93,1.15]$. In conclusion, our algorithm takes $O(n)$ time because both the partitioning and the searching steps run in $O(n)$ time. An experimental validation of the above proof and a performance comparison with our internal memory based algorithm~\cite{bkz05} were not included here due to space restrictions but can be found in~\cite{bkz06t} and also in the appendix. \vspace{-1mm} \enlargethispage{2\baselineskip} \subsection{Space used for constructing a MPHF} \label{sec:memconstruction} The vector {\it size} is kept in main memory all the time. The vector {\it size} has $\lceil n/b \rceil$ one-byte entries. It stores the number of keys in each bucket and those values are less than or equal to 256. For example, for a set of 1 billion keys and $b=175$ the vector size needs $5.45$ megabytes of main memory. We need an internal memory area of size $\mu$ bytes to be used in the partitioning step and in the searching step. The size $\mu$ is fixed a priori and depends only on the amount of internal memory available to run the algorithm (i.e., it does not depend on the size $n$ of the problem). % One could argue about the a priori reserved internal memory area and the main memory % required to run the indirect bucket sort algorithm. % Those internal memory requirements do not depend on the size of the problem % (i.e., the number of keys being hashed) and can be fixed a priori. The additional space required in the searching step is constant, once the problem was broken down into several small problems (at most 256 keys) and the heap size is supposed to be much smaller than $n$ ($N \ll n$). For example, for a set of 1 billion keys and an internal area of~$\mu = 250$ megabytes, the number of files is $N = 248$. \vspace{-1mm} \subsection{Evaluation cost of the MPHF} Now we consider the amount of CPU time required by the resulting MPHF at retrieval time. The MPHF requires for each key the computation of three universal hash functions and three memory accesses (see Eqs.~(\ref{eq:mphf}), (\ref{eq:bucketindex}) and (\ref{eq:mphfi})). This is not optimal. Pagh~\cite{p99} showed that any MPHF requires at least the computation of two universal hash functions and one memory access. \subsection{Description size of the MPHF} The number of bits required to store the MPHF generated by the algorithm is computed by Eq.~(\ref{eq:newmphfbits}). We need to store each $g_i$ vector presented in Eq.~(\ref{eq:mphfi}), where $0\leq i < \lceil n/b \rceil$. As each bucket has at most 256 keys, each entry in a $g_i$ vector has 8~bits. In each $g_i$ vector there are $c \times \mathit{size}[i]$ entries (recall $c\in[0.93, 1.15]$). When we sum up the number of entries of $\lceil n/b \rceil$ $g_i$ vectors we have $c\sum_{i=0}^{\lceil n/b \rceil -1} \mathit{size}[i]=cn$ entries. We also need to store $3 \lceil n/b \rceil$ integer numbers of $\log_2n$ bits referring respectively to the {\it offset} vector and the two random seeds of $h_{1i}$ and $h_{2i}$. In addition, we need to store $\lceil n/b \rceil$ 8-bit entries of the vector {\it size}. Therefore, \begin{eqnarray}\label{eq:newmphfbits} \mathrm{Required\: Space} = 8cn + \frac{n}{b}\left( 3\log_2n +8\right) \: \mathrm{bits}. \end{eqnarray} Considering $c=0.93$ and $b=175$, the number of bits per key to store the description of the resulting MPHF for a set of 1~billion keys is $8.1$. If we set $b=128$, then the bits per key ratio increases to $8.3$. Theoretically, the number of bits required to store the MPHF in Eq.~(\ref{eq:newmphfbits}) is $O(n\log n)$ as~$n\to\infty$. However, for sets of size up to $2^{b/3}$ keys the number of bits per key is lower than 9~bits (note that $2^{b/3}>2^{58}>10^{17}$ for $b=175$). %For $b=175$, the number of bits per key will be close to 9 for a set of $2^{58}$ keys. Thus, in practice the resulting function is stored in $O(n)$ bits.