175 lines
9.0 KiB
TeX
Executable File
175 lines
9.0 KiB
TeX
Executable File
%% Nivio: 23/jan/06 29/jan/06
|
|
% Time-stamp: <Monday 30 Jan 2006 03:56:47am EDT yoshi@ime.usp.br>
|
|
\enlargethispage{2\baselineskip}
|
|
\section{Analytical results}
|
|
\label{sec:analytcal-results}
|
|
|
|
\vspace{-1mm}
|
|
The purpose of this section is fourfold.
|
|
First, we show that our algorithm runs in expected time $O(n)$.
|
|
Second, we present the main memory requirements for constructing the MPHF.
|
|
Third, we discuss the cost of evaluating the resulting MPHF.
|
|
Fourth, we present the space required to store the resulting MPHF.
|
|
|
|
\vspace{-2mm}
|
|
\subsection{The linear time complexity}
|
|
\label{sec:linearcomplexity}
|
|
|
|
First, we show that the partitioning step presented in
|
|
Figure~\ref{fig:partitioningstep} runs in $O(n)$ time.
|
|
Each iteration of the {\bf for} loop in statement~1
|
|
runs in $O(|B_j|)$ time, $1 \leq j \leq N$, where $|B_j|$ is the
|
|
number of keys
|
|
that fit in block $B_j$ of size $\mu$. This is because statement 1.1 just reads
|
|
$|B_j|$ keys from disk, statement 1.2 runs a bucket sort like algorithm
|
|
that is well known to be linear in the number of keys it sorts (i.e., $|B_j|$ keys),
|
|
and statement 1.3 just dumps $|B_j|$ keys to the disk into File $j$.
|
|
Thus, the {\bf for} loop runs in $O(\sum_{j=1}^{N}|B_j|)$ time.
|
|
As $\sum_{j=1}^{N}|B_j|=n$, then the partitioning step runs in $O(n)$ time.
|
|
|
|
Second, we show that the searching step presented in
|
|
Figure~\ref{fig:searchingstep} also runs in $O(n)$ time.
|
|
The heap construction in statement 1 runs in $O(N)$ time, for $N \ll n$.
|
|
We have assumed that insertions and deletions in the heap cost $O(1)$ because
|
|
$N$ is typically much smaller than $n$ (see \cite[Section 6.4]{bkz06t} for details).
|
|
Statement 2 runs in $O(\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i])$ time
|
|
(remember that $\mathit{size}[i]$ stores the number of keys in bucket $i$).
|
|
As $\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i] = n$, if
|
|
statements 2.1, 2.2 and 2.3 run in $O(\mathit{size}[i])$ time, then statement 2
|
|
runs in $O(n)$ time.
|
|
|
|
%Statement 2.1 runs the algorithm to read a bucket from disk. That algorithm reads $\mathit{size}[i]$
|
|
%keys of bucket $i$ that might be spread into many files or, in the worst case,
|
|
%into $|BS_{max}|$ files, where $|BS_{max}|$ is the number of keys in the bucket of maximum size.
|
|
%It uses the heap $H$ to drive a multiway merge of the sprayed bucket $i$.
|
|
%As we are considering that each read/write on disk costs $O(1)$ and
|
|
%each heap operation also costs $O(1)$ (recall $N \ll n$), then statement 2.1
|
|
%costs $O(\mathit{size}[i])$ time.
|
|
%We need to take into account that this step could generate a lot of seeks on disk.
|
|
%However, the number of seeks can be amortized (see Section~\ref{sec:contr-disk-access})
|
|
%and that is why we have been able of getting a MPHF for a set of 1 billion keys in less
|
|
%than 4 hours using a machine with just 500 MB of main memory
|
|
%(see Section~\ref{sec:performance}).
|
|
Statement 2.1 reads $O(\mathit{size}[i])$ keys of bucket $i$
|
|
and is detailed in Figure~\ref{fig:readingbucket}.
|
|
As we are assuming that each read or write on disk costs $O(1)$ and
|
|
each heap operation also costs $O(1)$, statement~2.1
|
|
takes $O(\mathit{size}[i])$ time.
|
|
However, the keys of bucket $i$ are distributed in at most~$BS_{max}$ files on disk
|
|
in the worst case
|
|
(recall that $BS_{max}$ is the maximum number of keys found in any bucket).
|
|
Therefore, we need to take into account that
|
|
the critical step in reading a bucket is in statement~1.3 of Figure~\ref{fig:readingbucket},
|
|
where a seek operation in File $j$
|
|
may be performed by the first read operation.
|
|
|
|
In order to amortize the number of seeks performed we use a buffering technique~\cite{k73}.
|
|
We create a buffer $j$ of size \textbaht$\: = \mu/N$ for each file $j$,
|
|
where $1\leq j \leq N$
|
|
(recall that $\mu$ is the size in bytes of an a priori reserved internal memory area).
|
|
Every time a read operation is requested to file $j$ and the data is not found
|
|
in the $j$th~buffer, \textbaht~bytes are read from file $j$ to buffer $j$.
|
|
Hence, the number of seeks performed in the worst case is given by
|
|
$\beta/$\textbaht~(remember that $\beta$ is the size in bytes of $S$).
|
|
For that we have made the pessimistic assumption that one seek happens every time
|
|
buffer $j$ is filled in.
|
|
Thus, the number of seeks performed in the worst case is $64n/$\textbaht, since
|
|
each URL is 64 bytes long on average. Therefore, the number of seeks is linear on
|
|
$n$ and amortized by \textbaht.
|
|
|
|
It is important to emphasize two things.
|
|
First, the operating system uses techniques
|
|
to diminish the number of seeks and the average seek time.
|
|
This makes the amortization factor to be greater than \textbaht~in practice.
|
|
Second, almost all main memory is available to be used as
|
|
file buffers because just a small vector
|
|
of $\lceil n/b\rceil$ one-byte entries must be maintained in main memory,
|
|
as we show in Section~\ref{sec:memconstruction}.
|
|
|
|
|
|
Statement 2.2 runs our internal memory based algorithm in order to generate a MPHF for each bucket.
|
|
That algorithm is linear, as we showed in~\cite{bkz05}. As it is applied to buckets with {\it size}$[i]$ keys, statement~2.2 takes $O(\mathit{size}[i])$ time.
|
|
|
|
Statement 2.3 has time complexity $O(\mathit{size}[i])$ because it writes to disk
|
|
the description of each generated MPHF and each description is stored in
|
|
$c \times \mathit{size}[i] + O(1)$ bytes, where $c\in[0.93,1.15]$.
|
|
In conclusion, our algorithm takes $O(n)$ time because both the partitioning and
|
|
the searching steps run in $O(n)$ time.
|
|
|
|
An experimental validation of the above proof and a performance comparison with
|
|
our internal memory based algorithm~\cite{bkz05} were not included here due to
|
|
space restrictions but can be found in~\cite{bkz06t} and also in the appendix.
|
|
|
|
\vspace{-1mm}
|
|
\enlargethispage{2\baselineskip}
|
|
\subsection{Space used for constructing a MPHF}
|
|
\label{sec:memconstruction}
|
|
|
|
The vector {\it size} is kept in main memory
|
|
all the time.
|
|
The vector {\it size} has $\lceil n/b \rceil$ one-byte entries.
|
|
It stores the number of keys in each bucket and
|
|
those values are less than or equal to 256.
|
|
For example, for a set of 1 billion keys and $b=175$ the vector size needs
|
|
$5.45$ megabytes of main memory.
|
|
|
|
We need an internal memory area of size $\mu$ bytes to be used in
|
|
the partitioning step and in the searching step.
|
|
The size $\mu$ is fixed a priori and depends only on the amount
|
|
of internal memory available to run the algorithm
|
|
(i.e., it does not depend on the size $n$ of the problem).
|
|
|
|
% One could argue about the a priori reserved internal memory area and the main memory
|
|
% required to run the indirect bucket sort algorithm.
|
|
% Those internal memory requirements do not depend on the size of the problem
|
|
% (i.e., the number of keys being hashed) and can be fixed a priori.
|
|
|
|
The additional space required in the searching step
|
|
is constant, once the problem was broken down
|
|
into several small problems (at most 256 keys) and
|
|
the heap size is supposed to be much smaller than $n$ ($N \ll n$).
|
|
For example, for a set of 1 billion keys and an internal area of~$\mu = 250$ megabytes,
|
|
the number of files is $N = 248$.
|
|
|
|
\vspace{-1mm}
|
|
\subsection{Evaluation cost of the MPHF}
|
|
|
|
Now we consider the amount of CPU time
|
|
required by the resulting MPHF at retrieval time.
|
|
The MPHF requires for each key the computation of three
|
|
universal hash functions and three memory accesses
|
|
(see Eqs.~(\ref{eq:mphf}), (\ref{eq:bucketindex}) and (\ref{eq:mphfi})).
|
|
This is not optimal. Pagh~\cite{p99} showed that any MPHF requires
|
|
at least the computation of two universal hash functions and one memory
|
|
access.
|
|
|
|
\subsection{Description size of the MPHF}
|
|
|
|
The number of bits required to store the MPHF generated by the algorithm
|
|
is computed by Eq.~(\ref{eq:newmphfbits}).
|
|
We need to store each $g_i$ vector presented in Eq.~(\ref{eq:mphfi}), where
|
|
$0\leq i < \lceil n/b \rceil$. As each bucket has at most 256 keys, each
|
|
entry in a $g_i$ vector has 8~bits. In each $g_i$ vector there are
|
|
$c \times \mathit{size}[i]$ entries (recall $c\in[0.93, 1.15]$).
|
|
When we sum up the number of entries of $\lceil n/b \rceil$ $g_i$ vectors we have
|
|
$c\sum_{i=0}^{\lceil n/b \rceil -1} \mathit{size}[i]=cn$ entries. We also need to
|
|
store $3 \lceil n/b \rceil$ integer numbers of
|
|
$\log_2n$ bits referring respectively to the {\it offset} vector and the two random seeds of
|
|
$h_{1i}$ and $h_{2i}$. In addition, we need to store $\lceil n/b \rceil$ 8-bit entries of
|
|
the vector {\it size}. Therefore,
|
|
\begin{eqnarray}\label{eq:newmphfbits}
|
|
\mathrm{Required\: Space} = 8cn + \frac{n}{b}\left( 3\log_2n +8\right) \:
|
|
\mathrm{bits}.
|
|
\end{eqnarray}
|
|
|
|
Considering $c=0.93$ and $b=175$, the number of bits per key to store
|
|
the description of the resulting MPHF for a set of 1~billion keys is $8.1$.
|
|
If we set $b=128$, then the bits per key ratio increases to $8.3$.
|
|
Theoretically, the number of bits required to store the MPHF in
|
|
Eq.~(\ref{eq:newmphfbits})
|
|
is $O(n\log n)$ as~$n\to\infty$. However, for sets of size up to $2^{b/3}$ keys
|
|
the number of bits per key is lower than 9~bits (note that
|
|
$2^{b/3}>2^{58}>10^{17}$ for $b=175$).
|
|
%For $b=175$, the number of bits per key will be close to 9 for a set of $2^{58}$ keys.
|
|
Thus, in practice the resulting function is stored in $O(n)$ bits.
|