turbonss/vldb07/analyticalresults.tex
2006-08-11 17:32:31 +00:00

175 lines
9.0 KiB
TeX
Executable File

%% Nivio: 23/jan/06 29/jan/06
% Time-stamp: <Monday 30 Jan 2006 03:56:47am EDT yoshi@ime.usp.br>
\enlargethispage{2\baselineskip}
\section{Analytical results}
\label{sec:analytcal-results}
\vspace{-1mm}
The purpose of this section is fourfold.
First, we show that our algorithm runs in expected time $O(n)$.
Second, we present the main memory requirements for constructing the MPHF.
Third, we discuss the cost of evaluating the resulting MPHF.
Fourth, we present the space required to store the resulting MPHF.
\vspace{-2mm}
\subsection{The linear time complexity}
\label{sec:linearcomplexity}
First, we show that the partitioning step presented in
Figure~\ref{fig:partitioningstep} runs in $O(n)$ time.
Each iteration of the {\bf for} loop in statement~1
runs in $O(|B_j|)$ time, $1 \leq j \leq N$, where $|B_j|$ is the
number of keys
that fit in block $B_j$ of size $\mu$. This is because statement 1.1 just reads
$|B_j|$ keys from disk, statement 1.2 runs a bucket sort like algorithm
that is well known to be linear in the number of keys it sorts (i.e., $|B_j|$ keys),
and statement 1.3 just dumps $|B_j|$ keys to the disk into File $j$.
Thus, the {\bf for} loop runs in $O(\sum_{j=1}^{N}|B_j|)$ time.
As $\sum_{j=1}^{N}|B_j|=n$, then the partitioning step runs in $O(n)$ time.
Second, we show that the searching step presented in
Figure~\ref{fig:searchingstep} also runs in $O(n)$ time.
The heap construction in statement 1 runs in $O(N)$ time, for $N \ll n$.
We have assumed that insertions and deletions in the heap cost $O(1)$ because
$N$ is typically much smaller than $n$ (see \cite[Section 6.4]{bkz06t} for details).
Statement 2 runs in $O(\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i])$ time
(remember that $\mathit{size}[i]$ stores the number of keys in bucket $i$).
As $\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i] = n$, if
statements 2.1, 2.2 and 2.3 run in $O(\mathit{size}[i])$ time, then statement 2
runs in $O(n)$ time.
%Statement 2.1 runs the algorithm to read a bucket from disk. That algorithm reads $\mathit{size}[i]$
%keys of bucket $i$ that might be spread into many files or, in the worst case,
%into $|BS_{max}|$ files, where $|BS_{max}|$ is the number of keys in the bucket of maximum size.
%It uses the heap $H$ to drive a multiway merge of the sprayed bucket $i$.
%As we are considering that each read/write on disk costs $O(1)$ and
%each heap operation also costs $O(1)$ (recall $N \ll n$), then statement 2.1
%costs $O(\mathit{size}[i])$ time.
%We need to take into account that this step could generate a lot of seeks on disk.
%However, the number of seeks can be amortized (see Section~\ref{sec:contr-disk-access})
%and that is why we have been able of getting a MPHF for a set of 1 billion keys in less
%than 4 hours using a machine with just 500 MB of main memory
%(see Section~\ref{sec:performance}).
Statement 2.1 reads $O(\mathit{size}[i])$ keys of bucket $i$
and is detailed in Figure~\ref{fig:readingbucket}.
As we are assuming that each read or write on disk costs $O(1)$ and
each heap operation also costs $O(1)$, statement~2.1
takes $O(\mathit{size}[i])$ time.
However, the keys of bucket $i$ are distributed in at most~$BS_{max}$ files on disk
in the worst case
(recall that $BS_{max}$ is the maximum number of keys found in any bucket).
Therefore, we need to take into account that
the critical step in reading a bucket is in statement~1.3 of Figure~\ref{fig:readingbucket},
where a seek operation in File $j$
may be performed by the first read operation.
In order to amortize the number of seeks performed we use a buffering technique~\cite{k73}.
We create a buffer $j$ of size \textbaht$\: = \mu/N$ for each file $j$,
where $1\leq j \leq N$
(recall that $\mu$ is the size in bytes of an a priori reserved internal memory area).
Every time a read operation is requested to file $j$ and the data is not found
in the $j$th~buffer, \textbaht~bytes are read from file $j$ to buffer $j$.
Hence, the number of seeks performed in the worst case is given by
$\beta/$\textbaht~(remember that $\beta$ is the size in bytes of $S$).
For that we have made the pessimistic assumption that one seek happens every time
buffer $j$ is filled in.
Thus, the number of seeks performed in the worst case is $64n/$\textbaht, since
each URL is 64 bytes long on average. Therefore, the number of seeks is linear on
$n$ and amortized by \textbaht.
It is important to emphasize two things.
First, the operating system uses techniques
to diminish the number of seeks and the average seek time.
This makes the amortization factor to be greater than \textbaht~in practice.
Second, almost all main memory is available to be used as
file buffers because just a small vector
of $\lceil n/b\rceil$ one-byte entries must be maintained in main memory,
as we show in Section~\ref{sec:memconstruction}.
Statement 2.2 runs our internal memory based algorithm in order to generate a MPHF for each bucket.
That algorithm is linear, as we showed in~\cite{bkz05}. As it is applied to buckets with {\it size}$[i]$ keys, statement~2.2 takes $O(\mathit{size}[i])$ time.
Statement 2.3 has time complexity $O(\mathit{size}[i])$ because it writes to disk
the description of each generated MPHF and each description is stored in
$c \times \mathit{size}[i] + O(1)$ bytes, where $c\in[0.93,1.15]$.
In conclusion, our algorithm takes $O(n)$ time because both the partitioning and
the searching steps run in $O(n)$ time.
An experimental validation of the above proof and a performance comparison with
our internal memory based algorithm~\cite{bkz05} were not included here due to
space restrictions but can be found in~\cite{bkz06t} and also in the appendix.
\vspace{-1mm}
\enlargethispage{2\baselineskip}
\subsection{Space used for constructing a MPHF}
\label{sec:memconstruction}
The vector {\it size} is kept in main memory
all the time.
The vector {\it size} has $\lceil n/b \rceil$ one-byte entries.
It stores the number of keys in each bucket and
those values are less than or equal to 256.
For example, for a set of 1 billion keys and $b=175$ the vector size needs
$5.45$ megabytes of main memory.
We need an internal memory area of size $\mu$ bytes to be used in
the partitioning step and in the searching step.
The size $\mu$ is fixed a priori and depends only on the amount
of internal memory available to run the algorithm
(i.e., it does not depend on the size $n$ of the problem).
% One could argue about the a priori reserved internal memory area and the main memory
% required to run the indirect bucket sort algorithm.
% Those internal memory requirements do not depend on the size of the problem
% (i.e., the number of keys being hashed) and can be fixed a priori.
The additional space required in the searching step
is constant, once the problem was broken down
into several small problems (at most 256 keys) and
the heap size is supposed to be much smaller than $n$ ($N \ll n$).
For example, for a set of 1 billion keys and an internal area of~$\mu = 250$ megabytes,
the number of files is $N = 248$.
\vspace{-1mm}
\subsection{Evaluation cost of the MPHF}
Now we consider the amount of CPU time
required by the resulting MPHF at retrieval time.
The MPHF requires for each key the computation of three
universal hash functions and three memory accesses
(see Eqs.~(\ref{eq:mphf}), (\ref{eq:bucketindex}) and (\ref{eq:mphfi})).
This is not optimal. Pagh~\cite{p99} showed that any MPHF requires
at least the computation of two universal hash functions and one memory
access.
\subsection{Description size of the MPHF}
The number of bits required to store the MPHF generated by the algorithm
is computed by Eq.~(\ref{eq:newmphfbits}).
We need to store each $g_i$ vector presented in Eq.~(\ref{eq:mphfi}), where
$0\leq i < \lceil n/b \rceil$. As each bucket has at most 256 keys, each
entry in a $g_i$ vector has 8~bits. In each $g_i$ vector there are
$c \times \mathit{size}[i]$ entries (recall $c\in[0.93, 1.15]$).
When we sum up the number of entries of $\lceil n/b \rceil$ $g_i$ vectors we have
$c\sum_{i=0}^{\lceil n/b \rceil -1} \mathit{size}[i]=cn$ entries. We also need to
store $3 \lceil n/b \rceil$ integer numbers of
$\log_2n$ bits referring respectively to the {\it offset} vector and the two random seeds of
$h_{1i}$ and $h_{2i}$. In addition, we need to store $\lceil n/b \rceil$ 8-bit entries of
the vector {\it size}. Therefore,
\begin{eqnarray}\label{eq:newmphfbits}
\mathrm{Required\: Space} = 8cn + \frac{n}{b}\left( 3\log_2n +8\right) \:
\mathrm{bits}.
\end{eqnarray}
Considering $c=0.93$ and $b=175$, the number of bits per key to store
the description of the resulting MPHF for a set of 1~billion keys is $8.1$.
If we set $b=128$, then the bits per key ratio increases to $8.3$.
Theoretically, the number of bits required to store the MPHF in
Eq.~(\ref{eq:newmphfbits})
is $O(n\log n)$ as~$n\to\infty$. However, for sets of size up to $2^{b/3}$ keys
the number of bits per key is lower than 9~bits (note that
$2^{b/3}>2^{58}>10^{17}$ for $b=175$).
%For $b=175$, the number of bits per key will be close to 9 for a set of $2^{58}$ keys.
Thus, in practice the resulting function is stored in $O(n)$ bits.