turbonss/vldb07/analyticalresults.tex

%% Nivio: 23/jan/06 29/jan/06
% Time-stamp: <Monday 30 Jan 2006 03:56:47am EDT yoshi@ime.usp.br>
\enlargethispage{2\baselineskip}
\section{Analytical results}
\label{sec:analytcal-results}

\vspace{-1mm}
The purpose of this section is fourfold.
First, we show that our algorithm runs in expected time $O(n)$. 
Second, we present the main memory requirements for constructing the MPHF.
Third, we discuss the cost of evaluating the resulting MPHF.
Fourth, we present the space required to store the resulting MPHF.

\vspace{-2mm}
\subsection{The linear time complexity}
\label{sec:linearcomplexity}
 
First, we show that the partitioning step presented in
Figure~\ref{fig:partitioningstep} runs in $O(n)$ time.
Each iteration of the {\bf for} loop in statement~1
runs in $O(|B_j|)$ time, $1 \leq j \leq N$, where $|B_j|$ is the 
number of keys 
that fit in block $B_j$ of size $\mu$. This is because statement 1.1 just reads
$|B_j|$ keys from disk, statement 1.2 runs a bucket sort like algorithm 
that is well known to be linear in the number of keys it sorts (i.e., $|B_j|$ keys),
and statement 1.3 just dumps $|B_j|$ keys to the disk into File $j$.
Thus, the {\bf for} loop runs in $O(\sum_{j=1}^{N}|B_j|)$ time. 
As $\sum_{j=1}^{N}|B_j|=n$, then the partitioning step runs in $O(n)$ time.

Second, we show that the searching step presented in 
Figure~\ref{fig:searchingstep} also runs in $O(n)$ time.
The heap construction in statement 1 runs in $O(N)$ time, for $N \ll n$.
We have assumed that insertions and deletions in the heap cost $O(1)$ because 
$N$ is typically much smaller than $n$ (see \cite[Section 6.4]{bkz06t} for details).
Statement 2 runs in $O(\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i])$ time
(remember that $\mathit{size}[i]$ stores the number of keys in bucket $i$).
As $\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i] = n$, if 
statements 2.1, 2.2 and 2.3 run in $O(\mathit{size}[i])$ time, then statement 2
runs in $O(n)$ time.

%Statement 2.1 runs the algorithm to read a bucket from disk. That algorithm reads $\mathit{size}[i]$
%keys of bucket $i$ that might be spread into many files or, in the worst case,
%into $|BS_{max}|$ files, where $|BS_{max}|$ is the number of keys in the bucket of maximum size.
%It uses the heap $H$ to drive a multiway merge of the sprayed bucket $i$. 
%As we are considering that each read/write on disk costs $O(1)$ and
%each heap operation also costs $O(1)$ (recall $N \ll n$), then  statement 2.1
%costs $O(\mathit{size}[i])$ time.
%We need to take into account that this step could generate a lot of seeks on disk. 
%However, the number of seeks can be amortized (see Section~\ref{sec:contr-disk-access})
%and that is why we have been able of getting a MPHF for a set of 1 billion keys in less 
%than 4 hours using a machine with just 500 MB of main memory
%(see Section~\ref{sec:performance}).
Statement 2.1 reads $O(\mathit{size}[i])$ keys of bucket $i$
and is detailed in Figure~\ref{fig:readingbucket}.
As we are assuming that each read or write on disk costs $O(1)$ and
each heap operation also costs $O(1)$, statement~2.1
takes $O(\mathit{size}[i])$ time.
However, the keys of bucket $i$ are distributed in at most~$BS_{max}$ files on disk 
in the worst case
(recall that $BS_{max}$ is the maximum number of keys found in any bucket).
Therefore, we need to take into account that 
the critical step in reading a bucket is in statement~1.3 of Figure~\ref{fig:readingbucket},
where a seek operation in File $j$
may be performed by the first read operation.

In order to amortize the number of seeks performed we use a buffering technique~\cite{k73}.
We create a buffer $j$ of size \textbaht$\: = \mu/N$ for each file $j$, 
where $1\leq j \leq N$
(recall that $\mu$ is the size in bytes of an a priori reserved internal memory area).
Every time a read operation is requested to file $j$ and the data is not found 
in the $j$th~buffer, \textbaht~bytes are read from file $j$ to buffer $j$. 
Hence, the number of seeks performed in the worst case is given by
$\beta/$\textbaht~(remember that $\beta$ is the size in bytes of $S$).
For that we have made the pessimistic assumption that one seek happens every time 
buffer $j$ is filled in. 
Thus, the number of seeks performed in the worst case is $64n/$\textbaht, since
each URL is 64 bytes long on average. Therefore, the number of seeks is linear on 
$n$ and amortized by \textbaht.

It is important to emphasize two things.
First, the operating system uses techniques
to diminish the number of seeks and the average seek time. 
This makes the amortization factor to be greater than \textbaht~in practice.  
Second, almost all main memory is available to be used as
file buffers because just a small vector
of $\lceil n/b\rceil$ one-byte entries must be maintained in main memory, 
as we show in Section~\ref{sec:memconstruction}.


Statement 2.2 runs our internal memory based algorithm in order to generate a MPHF for each bucket.
That algorithm is linear, as we showed in~\cite{bkz05}. As it is applied to buckets with {\it size}$[i]$ keys, statement~2.2 takes $O(\mathit{size}[i])$ time.

Statement 2.3 has time complexity $O(\mathit{size}[i])$ because it writes to disk 
the description of each generated MPHF and each description is stored in
$c \times \mathit{size}[i] + O(1)$ bytes, where $c\in[0.93,1.15]$.
In conclusion, our algorithm takes $O(n)$ time because both the partitioning and
the searching steps run in $O(n)$ time. 

An experimental validation of the above proof and a performance comparison with 
our internal memory based algorithm~\cite{bkz05} were not included here due to 
space restrictions but can be found in~\cite{bkz06t} and also in the appendix.

\vspace{-1mm}
\enlargethispage{2\baselineskip}
\subsection{Space used for constructing a MPHF} 
\label{sec:memconstruction}

The vector {\it size} is kept in main memory 
all the time. 
The vector {\it size} has $\lceil n/b \rceil$ one-byte entries.
It stores the number of keys in each bucket and 
those values are less than or equal to 256. 
For example, for a set of 1 billion keys and $b=175$ the vector size needs 
$5.45$ megabytes of main memory.

We need an internal memory area of size $\mu$ bytes to be used in
the partitioning step and in the searching step. 
The size $\mu$ is fixed a priori and depends only on the amount 
of internal memory available to run the algorithm
(i.e., it does not depend on the size $n$ of the problem).

% One could argue about the a priori reserved internal memory area and the main memory
% required to run the indirect bucket sort algorithm.
% Those internal memory requirements do not depend on the size of the problem
% (i.e., the number of keys being hashed) and can be fixed a priori.

The additional space required in the searching step
is constant, once the problem was broken down
into several small problems (at most 256 keys) and 
the heap size is supposed to be much smaller than $n$ ($N \ll n$).
For example, for a set of 1 billion keys and an internal area of~$\mu = 250$ megabytes,
the number of files is $N = 248$. 

\vspace{-1mm}
\subsection{Evaluation cost of the MPHF} 

Now we consider the amount of CPU time 
required by the resulting MPHF at retrieval time.
The MPHF requires for each key the computation of three 
universal hash functions and three memory accesses 
(see Eqs.~(\ref{eq:mphf}), (\ref{eq:bucketindex}) and (\ref{eq:mphfi})).
This is not optimal. Pagh~\cite{p99} showed that any MPHF requires 
at least the computation of two universal hash functions and one memory
access.

\subsection{Description size of the MPHF} 

The number of bits required to store the MPHF generated by the algorithm 
is computed by Eq.~(\ref{eq:newmphfbits}). 
We need to store each $g_i$ vector presented in Eq.~(\ref{eq:mphfi}), where
$0\leq i < \lceil n/b \rceil$. As each bucket has at most 256 keys, each 
entry in a $g_i$ vector has 8~bits.  In each $g_i$ vector there are 
$c \times \mathit{size}[i]$ entries (recall $c\in[0.93, 1.15]$).
When we sum up the number of entries of $\lceil n/b \rceil$ $g_i$ vectors we have
$c\sum_{i=0}^{\lceil n/b \rceil -1} \mathit{size}[i]=cn$ entries.  We also need to
store $3 \lceil n/b \rceil$ integer numbers of 
$\log_2n$ bits referring respectively to the {\it offset} vector and the two random seeds of
$h_{1i}$ and $h_{2i}$. In addition, we need to store $\lceil n/b \rceil$ 8-bit entries of 
the vector {\it size}.  Therefore, 
\begin{eqnarray}\label{eq:newmphfbits}
\mathrm{Required\: Space} = 8cn + \frac{n}{b}\left( 3\log_2n +8\right) \:
\mathrm{bits}. 
\end{eqnarray}

Considering $c=0.93$ and $b=175$, the number of bits per key to store 
the description of the resulting MPHF for a set of 1~billion keys is $8.1$.
If we set $b=128$, then the bits per key ratio increases to $8.3$.
Theoretically, the number of bits required to store the MPHF in
Eq.~(\ref{eq:newmphfbits}) 
is $O(n\log n)$ as~$n\to\infty$. However, for sets of size up to $2^{b/3}$ keys 
the number of bits per key is lower than 9~bits (note that
$2^{b/3}>2^{58}>10^{17}$ for $b=175$).  
%For $b=175$, the number of bits per key will be close to 9 for a set of $2^{58}$ keys. 
Thus, in practice the resulting function is stored in $O(n)$ bits.
paper for vldb07 added 2006-08-11 20:32:31 +03:00			`%% Nivio: 23/jan/06 29/jan/06`
			`% Time-stamp: <Monday 30 Jan 2006 03:56:47am EDT yoshi@ime.usp.br>`
			`\enlargethispage{2\baselineskip}`
			`\section{Analytical results}`
			`\label{sec:analytcal-results}`

			`\vspace{-1mm}`
			`The purpose of this section is fourfold.`
			`First, we show that our algorithm runs in expected time $O(n)$.`
			`Second, we present the main memory requirements for constructing the MPHF.`
			`Third, we discuss the cost of evaluating the resulting MPHF.`
			`Fourth, we present the space required to store the resulting MPHF.`

			`\vspace{-2mm}`
			`\subsection{The linear time complexity}`
			`\label{sec:linearcomplexity}`

			`First, we show that the partitioning step presented in`
			`Figure~\ref{fig:partitioningstep} runs in $O(n)$ time.`
			`Each iteration of the {\bf for} loop in statement~1`
			`runs in $O(\|B_j\|)$ time, $1 \leq j \leq N$, where $\|B_j\|$ is the`
			`number of keys`
			`that fit in block $B_j$ of size $\mu$. This is because statement 1.1 just reads`
			`$\|B_j\|$ keys from disk, statement 1.2 runs a bucket sort like algorithm`
			`that is well known to be linear in the number of keys it sorts (i.e., $\|B_j\|$ keys),`
			`and statement 1.3 just dumps $\|B_j\|$ keys to the disk into File $j$.`
			`Thus, the {\bf for} loop runs in $O(\sum_{j=1}^{N}\|B_j\|)$ time.`
			`As $\sum_{j=1}^{N}\|B_j\|=n$, then the partitioning step runs in $O(n)$ time.`

			`Second, we show that the searching step presented in`
			`Figure~\ref{fig:searchingstep} also runs in $O(n)$ time.`
			`The heap construction in statement 1 runs in $O(N)$ time, for $N \ll n$.`
			`We have assumed that insertions and deletions in the heap cost $O(1)$ because`
			`$N$ is typically much smaller than $n$ (see \cite[Section 6.4]{bkz06t} for details).`
			`Statement 2 runs in $O(\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i])$ time`
			`(remember that $\mathit{size}[i]$ stores the number of keys in bucket $i$).`
			`As $\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i] = n$, if`
			`statements 2.1, 2.2 and 2.3 run in $O(\mathit{size}[i])$ time, then statement 2`
			`runs in $O(n)$ time.`

			`%Statement 2.1 runs the algorithm to read a bucket from disk. That algorithm reads $\mathit{size}[i]$`
			`%keys of bucket $i$ that might be spread into many files or, in the worst case,`
			`%into $\|BS_{max}\|$ files, where $\|BS_{max}\|$ is the number of keys in the bucket of maximum size.`
			`%It uses the heap $H$ to drive a multiway merge of the sprayed bucket $i$.`
			`%As we are considering that each read/write on disk costs $O(1)$ and`
			`%each heap operation also costs $O(1)$ (recall $N \ll n$), then statement 2.1`
			`%costs $O(\mathit{size}[i])$ time.`
			`%We need to take into account that this step could generate a lot of seeks on disk.`
			`%However, the number of seeks can be amortized (see Section~\ref{sec:contr-disk-access})`
			`%and that is why we have been able of getting a MPHF for a set of 1 billion keys in less`
			`%than 4 hours using a machine with just 500 MB of main memory`
			`%(see Section~\ref{sec:performance}).`
			`Statement 2.1 reads $O(\mathit{size}[i])$ keys of bucket $i$`
			`and is detailed in Figure~\ref{fig:readingbucket}.`
			`As we are assuming that each read or write on disk costs $O(1)$ and`
			`each heap operation also costs $O(1)$, statement~2.1`
			`takes $O(\mathit{size}[i])$ time.`
			`However, the keys of bucket $i$ are distributed in at most~$BS_{max}$ files on disk`
			`in the worst case`
			`(recall that $BS_{max}$ is the maximum number of keys found in any bucket).`
			`Therefore, we need to take into account that`
			`the critical step in reading a bucket is in statement~1.3 of Figure~\ref{fig:readingbucket},`
			`where a seek operation in File $j$`
			`may be performed by the first read operation.`

			`In order to amortize the number of seeks performed we use a buffering technique~\cite{k73}.`
			`We create a buffer $j$ of size \textbaht$\: = \mu/N$ for each file $j$,`
			`where $1\leq j \leq N$`
			`(recall that $\mu$ is the size in bytes of an a priori reserved internal memory area).`
			`Every time a read operation is requested to file $j$ and the data is not found`
			`in the $j$th~buffer, \textbaht~bytes are read from file $j$ to buffer $j$.`
			`Hence, the number of seeks performed in the worst case is given by`
			`$\beta/$\textbaht~(remember that $\beta$ is the size in bytes of $S$).`
			`For that we have made the pessimistic assumption that one seek happens every time`
			`buffer $j$ is filled in.`
			`Thus, the number of seeks performed in the worst case is $64n/$\textbaht, since`
			`each URL is 64 bytes long on average. Therefore, the number of seeks is linear on`
			`$n$ and amortized by \textbaht.`

			`It is important to emphasize two things.`
			`First, the operating system uses techniques`
			`to diminish the number of seeks and the average seek time.`
			`This makes the amortization factor to be greater than \textbaht~in practice.`
			`Second, almost all main memory is available to be used as`
			`file buffers because just a small vector`
			`of $\lceil n/b\rceil$ one-byte entries must be maintained in main memory,`
			`as we show in Section~\ref{sec:memconstruction}.`


			`Statement 2.2 runs our internal memory based algorithm in order to generate a MPHF for each bucket.`
			`That algorithm is linear, as we showed in~\cite{bkz05}. As it is applied to buckets with {\it size}$[i]$ keys, statement~2.2 takes $O(\mathit{size}[i])$ time.`

			`Statement 2.3 has time complexity $O(\mathit{size}[i])$ because it writes to disk`
			`the description of each generated MPHF and each description is stored in`
			`$c \times \mathit{size}[i] + O(1)$ bytes, where $c\in[0.93,1.15]$.`
			`In conclusion, our algorithm takes $O(n)$ time because both the partitioning and`
			`the searching steps run in $O(n)$ time.`

			`An experimental validation of the above proof and a performance comparison with`
			`our internal memory based algorithm~\cite{bkz05} were not included here due to`
			`space restrictions but can be found in~\cite{bkz06t} and also in the appendix.`

			`\vspace{-1mm}`
			`\enlargethispage{2\baselineskip}`
			`\subsection{Space used for constructing a MPHF}`
			`\label{sec:memconstruction}`

			`The vector {\it size} is kept in main memory`
			`all the time.`
			`The vector {\it size} has $\lceil n/b \rceil$ one-byte entries.`
			`It stores the number of keys in each bucket and`
			`those values are less than or equal to 256.`
			`For example, for a set of 1 billion keys and $b=175$ the vector size needs`
			`$5.45$ megabytes of main memory.`

			`We need an internal memory area of size $\mu$ bytes to be used in`
			`the partitioning step and in the searching step.`
			`The size $\mu$ is fixed a priori and depends only on the amount`
			`of internal memory available to run the algorithm`
			`(i.e., it does not depend on the size $n$ of the problem).`

			`% One could argue about the a priori reserved internal memory area and the main memory`
			`% required to run the indirect bucket sort algorithm.`
			`% Those internal memory requirements do not depend on the size of the problem`
			`% (i.e., the number of keys being hashed) and can be fixed a priori.`

			`The additional space required in the searching step`
			`is constant, once the problem was broken down`
			`into several small problems (at most 256 keys) and`
			`the heap size is supposed to be much smaller than $n$ ($N \ll n$).`
			`For example, for a set of 1 billion keys and an internal area of~$\mu = 250$ megabytes,`
			`the number of files is $N = 248$.`

			`\vspace{-1mm}`
			`\subsection{Evaluation cost of the MPHF}`

			`Now we consider the amount of CPU time`
			`required by the resulting MPHF at retrieval time.`
			`The MPHF requires for each key the computation of three`
			`universal hash functions and three memory accesses`
			`(see Eqs.~(\ref{eq:mphf}), (\ref{eq:bucketindex}) and (\ref{eq:mphfi})).`
			`This is not optimal. Pagh~\cite{p99} showed that any MPHF requires`
			`at least the computation of two universal hash functions and one memory`
			`access.`

			`\subsection{Description size of the MPHF}`

			`The number of bits required to store the MPHF generated by the algorithm`
			`is computed by Eq.~(\ref{eq:newmphfbits}).`
			`We need to store each $g_i$ vector presented in Eq.~(\ref{eq:mphfi}), where`
			`$0\leq i < \lceil n/b \rceil$. As each bucket has at most 256 keys, each`
			`entry in a $g_i$ vector has 8~bits. In each $g_i$ vector there are`
			`$c \times \mathit{size}[i]$ entries (recall $c\in[0.93, 1.15]$).`
			`When we sum up the number of entries of $\lceil n/b \rceil$ $g_i$ vectors we have`
			`$c\sum_{i=0}^{\lceil n/b \rceil -1} \mathit{size}[i]=cn$ entries. We also need to`
			`store $3 \lceil n/b \rceil$ integer numbers of`
			`$\log_2n$ bits referring respectively to the {\it offset} vector and the two random seeds of`
			`$h_{1i}$ and $h_{2i}$. In addition, we need to store $\lceil n/b \rceil$ 8-bit entries of`
			`the vector {\it size}. Therefore,`
			`\begin{eqnarray}\label{eq:newmphfbits}`
			`\mathrm{Required\: Space} = 8cn + \frac{n}{b}\left( 3\log_2n +8\right) \:`
			`\mathrm{bits}.`
			`\end{eqnarray}`

			`Considering $c=0.93$ and $b=175$, the number of bits per key to store`
			`the description of the resulting MPHF for a set of 1~billion keys is $8.1$.`
			`If we set $b=128$, then the bits per key ratio increases to $8.3$.`
			`Theoretically, the number of bits required to store the MPHF in`
			`Eq.~(\ref{eq:newmphfbits})`
			`is $O(n\log n)$ as~$n\to\infty$. However, for sets of size up to $2^{b/3}$ keys`
			`the number of bits per key is lower than 9~bits (note that`
			`$2^{b/3}>2^{58}>10^{17}$ for $b=175$).`
			`%For $b=175$, the number of bits per key will be close to 9 for a set of $2^{58}$ keys.`
			`Thus, in practice the resulting function is stored in $O(n)$ bits.`