paper for vldb07 added
This commit is contained in:
174
vldb07/analyticalresults.tex
Executable file
174
vldb07/analyticalresults.tex
Executable file
@@ -0,0 +1,174 @@
|
||||
%% Nivio: 23/jan/06 29/jan/06
|
||||
% Time-stamp: <Monday 30 Jan 2006 03:56:47am EDT yoshi@ime.usp.br>
|
||||
\enlargethispage{2\baselineskip}
|
||||
\section{Analytical results}
|
||||
\label{sec:analytcal-results}
|
||||
|
||||
\vspace{-1mm}
|
||||
The purpose of this section is fourfold.
|
||||
First, we show that our algorithm runs in expected time $O(n)$.
|
||||
Second, we present the main memory requirements for constructing the MPHF.
|
||||
Third, we discuss the cost of evaluating the resulting MPHF.
|
||||
Fourth, we present the space required to store the resulting MPHF.
|
||||
|
||||
\vspace{-2mm}
|
||||
\subsection{The linear time complexity}
|
||||
\label{sec:linearcomplexity}
|
||||
|
||||
First, we show that the partitioning step presented in
|
||||
Figure~\ref{fig:partitioningstep} runs in $O(n)$ time.
|
||||
Each iteration of the {\bf for} loop in statement~1
|
||||
runs in $O(|B_j|)$ time, $1 \leq j \leq N$, where $|B_j|$ is the
|
||||
number of keys
|
||||
that fit in block $B_j$ of size $\mu$. This is because statement 1.1 just reads
|
||||
$|B_j|$ keys from disk, statement 1.2 runs a bucket sort like algorithm
|
||||
that is well known to be linear in the number of keys it sorts (i.e., $|B_j|$ keys),
|
||||
and statement 1.3 just dumps $|B_j|$ keys to the disk into File $j$.
|
||||
Thus, the {\bf for} loop runs in $O(\sum_{j=1}^{N}|B_j|)$ time.
|
||||
As $\sum_{j=1}^{N}|B_j|=n$, then the partitioning step runs in $O(n)$ time.
|
||||
|
||||
Second, we show that the searching step presented in
|
||||
Figure~\ref{fig:searchingstep} also runs in $O(n)$ time.
|
||||
The heap construction in statement 1 runs in $O(N)$ time, for $N \ll n$.
|
||||
We have assumed that insertions and deletions in the heap cost $O(1)$ because
|
||||
$N$ is typically much smaller than $n$ (see \cite[Section 6.4]{bkz06t} for details).
|
||||
Statement 2 runs in $O(\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i])$ time
|
||||
(remember that $\mathit{size}[i]$ stores the number of keys in bucket $i$).
|
||||
As $\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i] = n$, if
|
||||
statements 2.1, 2.2 and 2.3 run in $O(\mathit{size}[i])$ time, then statement 2
|
||||
runs in $O(n)$ time.
|
||||
|
||||
%Statement 2.1 runs the algorithm to read a bucket from disk. That algorithm reads $\mathit{size}[i]$
|
||||
%keys of bucket $i$ that might be spread into many files or, in the worst case,
|
||||
%into $|BS_{max}|$ files, where $|BS_{max}|$ is the number of keys in the bucket of maximum size.
|
||||
%It uses the heap $H$ to drive a multiway merge of the sprayed bucket $i$.
|
||||
%As we are considering that each read/write on disk costs $O(1)$ and
|
||||
%each heap operation also costs $O(1)$ (recall $N \ll n$), then statement 2.1
|
||||
%costs $O(\mathit{size}[i])$ time.
|
||||
%We need to take into account that this step could generate a lot of seeks on disk.
|
||||
%However, the number of seeks can be amortized (see Section~\ref{sec:contr-disk-access})
|
||||
%and that is why we have been able of getting a MPHF for a set of 1 billion keys in less
|
||||
%than 4 hours using a machine with just 500 MB of main memory
|
||||
%(see Section~\ref{sec:performance}).
|
||||
Statement 2.1 reads $O(\mathit{size}[i])$ keys of bucket $i$
|
||||
and is detailed in Figure~\ref{fig:readingbucket}.
|
||||
As we are assuming that each read or write on disk costs $O(1)$ and
|
||||
each heap operation also costs $O(1)$, statement~2.1
|
||||
takes $O(\mathit{size}[i])$ time.
|
||||
However, the keys of bucket $i$ are distributed in at most~$BS_{max}$ files on disk
|
||||
in the worst case
|
||||
(recall that $BS_{max}$ is the maximum number of keys found in any bucket).
|
||||
Therefore, we need to take into account that
|
||||
the critical step in reading a bucket is in statement~1.3 of Figure~\ref{fig:readingbucket},
|
||||
where a seek operation in File $j$
|
||||
may be performed by the first read operation.
|
||||
|
||||
In order to amortize the number of seeks performed we use a buffering technique~\cite{k73}.
|
||||
We create a buffer $j$ of size \textbaht$\: = \mu/N$ for each file $j$,
|
||||
where $1\leq j \leq N$
|
||||
(recall that $\mu$ is the size in bytes of an a priori reserved internal memory area).
|
||||
Every time a read operation is requested to file $j$ and the data is not found
|
||||
in the $j$th~buffer, \textbaht~bytes are read from file $j$ to buffer $j$.
|
||||
Hence, the number of seeks performed in the worst case is given by
|
||||
$\beta/$\textbaht~(remember that $\beta$ is the size in bytes of $S$).
|
||||
For that we have made the pessimistic assumption that one seek happens every time
|
||||
buffer $j$ is filled in.
|
||||
Thus, the number of seeks performed in the worst case is $64n/$\textbaht, since
|
||||
each URL is 64 bytes long on average. Therefore, the number of seeks is linear on
|
||||
$n$ and amortized by \textbaht.
|
||||
|
||||
It is important to emphasize two things.
|
||||
First, the operating system uses techniques
|
||||
to diminish the number of seeks and the average seek time.
|
||||
This makes the amortization factor to be greater than \textbaht~in practice.
|
||||
Second, almost all main memory is available to be used as
|
||||
file buffers because just a small vector
|
||||
of $\lceil n/b\rceil$ one-byte entries must be maintained in main memory,
|
||||
as we show in Section~\ref{sec:memconstruction}.
|
||||
|
||||
|
||||
Statement 2.2 runs our internal memory based algorithm in order to generate a MPHF for each bucket.
|
||||
That algorithm is linear, as we showed in~\cite{bkz05}. As it is applied to buckets with {\it size}$[i]$ keys, statement~2.2 takes $O(\mathit{size}[i])$ time.
|
||||
|
||||
Statement 2.3 has time complexity $O(\mathit{size}[i])$ because it writes to disk
|
||||
the description of each generated MPHF and each description is stored in
|
||||
$c \times \mathit{size}[i] + O(1)$ bytes, where $c\in[0.93,1.15]$.
|
||||
In conclusion, our algorithm takes $O(n)$ time because both the partitioning and
|
||||
the searching steps run in $O(n)$ time.
|
||||
|
||||
An experimental validation of the above proof and a performance comparison with
|
||||
our internal memory based algorithm~\cite{bkz05} were not included here due to
|
||||
space restrictions but can be found in~\cite{bkz06t} and also in the appendix.
|
||||
|
||||
\vspace{-1mm}
|
||||
\enlargethispage{2\baselineskip}
|
||||
\subsection{Space used for constructing a MPHF}
|
||||
\label{sec:memconstruction}
|
||||
|
||||
The vector {\it size} is kept in main memory
|
||||
all the time.
|
||||
The vector {\it size} has $\lceil n/b \rceil$ one-byte entries.
|
||||
It stores the number of keys in each bucket and
|
||||
those values are less than or equal to 256.
|
||||
For example, for a set of 1 billion keys and $b=175$ the vector size needs
|
||||
$5.45$ megabytes of main memory.
|
||||
|
||||
We need an internal memory area of size $\mu$ bytes to be used in
|
||||
the partitioning step and in the searching step.
|
||||
The size $\mu$ is fixed a priori and depends only on the amount
|
||||
of internal memory available to run the algorithm
|
||||
(i.e., it does not depend on the size $n$ of the problem).
|
||||
|
||||
% One could argue about the a priori reserved internal memory area and the main memory
|
||||
% required to run the indirect bucket sort algorithm.
|
||||
% Those internal memory requirements do not depend on the size of the problem
|
||||
% (i.e., the number of keys being hashed) and can be fixed a priori.
|
||||
|
||||
The additional space required in the searching step
|
||||
is constant, once the problem was broken down
|
||||
into several small problems (at most 256 keys) and
|
||||
the heap size is supposed to be much smaller than $n$ ($N \ll n$).
|
||||
For example, for a set of 1 billion keys and an internal area of~$\mu = 250$ megabytes,
|
||||
the number of files is $N = 248$.
|
||||
|
||||
\vspace{-1mm}
|
||||
\subsection{Evaluation cost of the MPHF}
|
||||
|
||||
Now we consider the amount of CPU time
|
||||
required by the resulting MPHF at retrieval time.
|
||||
The MPHF requires for each key the computation of three
|
||||
universal hash functions and three memory accesses
|
||||
(see Eqs.~(\ref{eq:mphf}), (\ref{eq:bucketindex}) and (\ref{eq:mphfi})).
|
||||
This is not optimal. Pagh~\cite{p99} showed that any MPHF requires
|
||||
at least the computation of two universal hash functions and one memory
|
||||
access.
|
||||
|
||||
\subsection{Description size of the MPHF}
|
||||
|
||||
The number of bits required to store the MPHF generated by the algorithm
|
||||
is computed by Eq.~(\ref{eq:newmphfbits}).
|
||||
We need to store each $g_i$ vector presented in Eq.~(\ref{eq:mphfi}), where
|
||||
$0\leq i < \lceil n/b \rceil$. As each bucket has at most 256 keys, each
|
||||
entry in a $g_i$ vector has 8~bits. In each $g_i$ vector there are
|
||||
$c \times \mathit{size}[i]$ entries (recall $c\in[0.93, 1.15]$).
|
||||
When we sum up the number of entries of $\lceil n/b \rceil$ $g_i$ vectors we have
|
||||
$c\sum_{i=0}^{\lceil n/b \rceil -1} \mathit{size}[i]=cn$ entries. We also need to
|
||||
store $3 \lceil n/b \rceil$ integer numbers of
|
||||
$\log_2n$ bits referring respectively to the {\it offset} vector and the two random seeds of
|
||||
$h_{1i}$ and $h_{2i}$. In addition, we need to store $\lceil n/b \rceil$ 8-bit entries of
|
||||
the vector {\it size}. Therefore,
|
||||
\begin{eqnarray}\label{eq:newmphfbits}
|
||||
\mathrm{Required\: Space} = 8cn + \frac{n}{b}\left( 3\log_2n +8\right) \:
|
||||
\mathrm{bits}.
|
||||
\end{eqnarray}
|
||||
|
||||
Considering $c=0.93$ and $b=175$, the number of bits per key to store
|
||||
the description of the resulting MPHF for a set of 1~billion keys is $8.1$.
|
||||
If we set $b=128$, then the bits per key ratio increases to $8.3$.
|
||||
Theoretically, the number of bits required to store the MPHF in
|
||||
Eq.~(\ref{eq:newmphfbits})
|
||||
is $O(n\log n)$ as~$n\to\infty$. However, for sets of size up to $2^{b/3}$ keys
|
||||
the number of bits per key is lower than 9~bits (note that
|
||||
$2^{b/3}>2^{58}>10^{17}$ for $b=175$).
|
||||
%For $b=175$, the number of bits per key will be close to 9 for a set of $2^{58}$ keys.
|
||||
Thus, in practice the resulting function is stored in $O(n)$ bits.
|
||||
Reference in New Issue
Block a user