paper for vldb07 added

2006-08-11 17:32:31 +00:00
parent 00c049787a
commit 80549b6ca6
28 changed files with 4517 additions and 0 deletions
--- a/vldb07/analyticalresults.tex
+++ b/vldb07/analyticalresults.tex
@@ -0,0 +1,174 @@
+%% Nivio: 23/jan/06 29/jan/06
+% Time-stamp: <Monday 30 Jan 2006 03:56:47am EDT yoshi@ime.usp.br>
+\enlargethispage{2\baselineskip}
+\section{Analytical results}
+\label{sec:analytcal-results}
+
+\vspace{-1mm}
+The purpose of this section is fourfold.
+First, we show that our algorithm runs in expected time $O(n)$. 
+Second, we present the main memory requirements for constructing the MPHF.
+Third, we discuss the cost of evaluating the resulting MPHF.
+Fourth, we present the space required to store the resulting MPHF.
+
+\vspace{-2mm}
+\subsection{The linear time complexity}
+\label{sec:linearcomplexity}
+ 
+First, we show that the partitioning step presented in
+Figure~\ref{fig:partitioningstep} runs in $O(n)$ time.
+Each iteration of the {\bf for} loop in statement~1
+runs in $O(|B_j|)$ time, $1 \leq j \leq N$, where $|B_j|$ is the 
+number of keys 
+that fit in block $B_j$ of size $\mu$. This is because statement 1.1 just reads
+$|B_j|$ keys from disk, statement 1.2 runs a bucket sort like algorithm 
+that is well known to be linear in the number of keys it sorts (i.e., $|B_j|$ keys),
+and statement 1.3 just dumps $|B_j|$ keys to the disk into File $j$.
+Thus, the {\bf for} loop runs in $O(\sum_{j=1}^{N}|B_j|)$ time. 
+As $\sum_{j=1}^{N}|B_j|=n$, then the partitioning step runs in $O(n)$ time.
+
+Second, we show that the searching step presented in 
+Figure~\ref{fig:searchingstep} also runs in $O(n)$ time.
+The heap construction in statement 1 runs in $O(N)$ time, for $N \ll n$.
+We have assumed that insertions and deletions in the heap cost $O(1)$ because 
+$N$ is typically much smaller than $n$ (see \cite[Section 6.4]{bkz06t} for details).
+Statement 2 runs in $O(\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i])$ time
+(remember that $\mathit{size}[i]$ stores the number of keys in bucket $i$).
+As $\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i] = n$, if 
+statements 2.1, 2.2 and 2.3 run in $O(\mathit{size}[i])$ time, then statement 2
+runs in $O(n)$ time.
+
+%Statement 2.1 runs the algorithm to read a bucket from disk. That algorithm reads $\mathit{size}[i]$
+%keys of bucket $i$ that might be spread into many files or, in the worst case,
+%into $|BS_{max}|$ files, where $|BS_{max}|$ is the number of keys in the bucket of maximum size.
+%It uses the heap $H$ to drive a multiway merge of the sprayed bucket $i$. 
+%As we are considering that each read/write on disk costs $O(1)$ and
+%each heap operation also costs $O(1)$ (recall $N \ll n$), then  statement 2.1
+%costs $O(\mathit{size}[i])$ time.
+%We need to take into account that this step could generate a lot of seeks on disk. 
+%However, the number of seeks can be amortized (see Section~\ref{sec:contr-disk-access})
+%and that is why we have been able of getting a MPHF for a set of 1 billion keys in less 
+%than 4 hours using a machine with just 500 MB of main memory
+%(see Section~\ref{sec:performance}).
+Statement 2.1 reads $O(\mathit{size}[i])$ keys of bucket $i$
+and is detailed in Figure~\ref{fig:readingbucket}.
+As we are assuming that each read or write on disk costs $O(1)$ and
+each heap operation also costs $O(1)$, statement~2.1
+takes $O(\mathit{size}[i])$ time.
+However, the keys of bucket $i$ are distributed in at most~$BS_{max}$ files on disk 
+in the worst case
+(recall that $BS_{max}$ is the maximum number of keys found in any bucket).
+Therefore, we need to take into account that 
+the critical step in reading a bucket is in statement~1.3 of Figure~\ref{fig:readingbucket},
+where a seek operation in File $j$
+may be performed by the first read operation.
+
+In order to amortize the number of seeks performed we use a buffering technique~\cite{k73}.
+We create a buffer $j$ of size \textbaht$\: = \mu/N$ for each file $j$, 
+where $1\leq j \leq N$
+(recall that $\mu$ is the size in bytes of an a priori reserved internal memory area).
+Every time a read operation is requested to file $j$ and the data is not found 
+in the $j$th~buffer, \textbaht~bytes are read from file $j$ to buffer $j$. 
+Hence, the number of seeks performed in the worst case is given by
+$\beta/$\textbaht~(remember that $\beta$ is the size in bytes of $S$).
+For that we have made the pessimistic assumption that one seek happens every time 
+buffer $j$ is filled in. 
+Thus, the number of seeks performed in the worst case is $64n/$\textbaht, since
+each URL is 64 bytes long on average. Therefore, the number of seeks is linear on 
+$n$ and amortized by \textbaht.
+
+It is important to emphasize two things.
+First, the operating system uses techniques
+to diminish the number of seeks and the average seek time. 
+This makes the amortization factor to be greater than \textbaht~in practice.  
+Second, almost all main memory is available to be used as
+file buffers because just a small vector
+of $\lceil n/b\rceil$ one-byte entries must be maintained in main memory, 
+as we show in Section~\ref{sec:memconstruction}.
+
+
+Statement 2.2 runs our internal memory based algorithm in order to generate a MPHF for each bucket.
+That algorithm is linear, as we showed in~\cite{bkz05}. As it is applied to buckets with {\it size}$[i]$ keys, statement~2.2 takes $O(\mathit{size}[i])$ time.
+
+Statement 2.3 has time complexity $O(\mathit{size}[i])$ because it writes to disk 
+the description of each generated MPHF and each description is stored in
+$c \times \mathit{size}[i] + O(1)$ bytes, where $c\in[0.93,1.15]$.
+In conclusion, our algorithm takes $O(n)$ time because both the partitioning and
+the searching steps run in $O(n)$ time. 
+
+An experimental validation of the above proof and a performance comparison with 
+our internal memory based algorithm~\cite{bkz05} were not included here due to 
+space restrictions but can be found in~\cite{bkz06t} and also in the appendix.
+
+\vspace{-1mm}
+\enlargethispage{2\baselineskip}
+\subsection{Space used for constructing a MPHF} 
+\label{sec:memconstruction}
+
+The vector {\it size} is kept in main memory 
+all the time. 
+The vector {\it size} has $\lceil n/b \rceil$ one-byte entries.
+It stores the number of keys in each bucket and 
+those values are less than or equal to 256. 
+For example, for a set of 1 billion keys and $b=175$ the vector size needs 
+$5.45$ megabytes of main memory.
+
+We need an internal memory area of size $\mu$ bytes to be used in
+the partitioning step and in the searching step. 
+The size $\mu$ is fixed a priori and depends only on the amount 
+of internal memory available to run the algorithm
+(i.e., it does not depend on the size $n$ of the problem).
+
+% One could argue about the a priori reserved internal memory area and the main memory
+% required to run the indirect bucket sort algorithm.
+% Those internal memory requirements do not depend on the size of the problem
+% (i.e., the number of keys being hashed) and can be fixed a priori.
+
+The additional space required in the searching step
+is constant, once the problem was broken down
+into several small problems (at most 256 keys) and 
+the heap size is supposed to be much smaller than $n$ ($N \ll n$).
+For example, for a set of 1 billion keys and an internal area of~$\mu = 250$ megabytes,
+the number of files is $N = 248$. 
+
+\vspace{-1mm}
+\subsection{Evaluation cost of the MPHF} 
+
+Now we consider the amount of CPU time 
+required by the resulting MPHF at retrieval time.
+The MPHF requires for each key the computation of three 
+universal hash functions and three memory accesses 
+(see Eqs.~(\ref{eq:mphf}), (\ref{eq:bucketindex}) and (\ref{eq:mphfi})).
+This is not optimal. Pagh~\cite{p99} showed that any MPHF requires 
+at least the computation of two universal hash functions and one memory
+access.
+
+\subsection{Description size of the MPHF} 
+
+The number of bits required to store the MPHF generated by the algorithm 
+is computed by Eq.~(\ref{eq:newmphfbits}). 
+We need to store each $g_i$ vector presented in Eq.~(\ref{eq:mphfi}), where
+$0\leq i < \lceil n/b \rceil$. As each bucket has at most 256 keys, each 
+entry in a $g_i$ vector has 8~bits.  In each $g_i$ vector there are 
+$c \times \mathit{size}[i]$ entries (recall $c\in[0.93, 1.15]$).
+When we sum up the number of entries of $\lceil n/b \rceil$ $g_i$ vectors we have
+$c\sum_{i=0}^{\lceil n/b \rceil -1} \mathit{size}[i]=cn$ entries.  We also need to
+store $3 \lceil n/b \rceil$ integer numbers of 
+$\log_2n$ bits referring respectively to the {\it offset} vector and the two random seeds of
+$h_{1i}$ and $h_{2i}$. In addition, we need to store $\lceil n/b \rceil$ 8-bit entries of 
+the vector {\it size}.  Therefore, 
+\begin{eqnarray}\label{eq:newmphfbits}
+\mathrm{Required\: Space} = 8cn + \frac{n}{b}\left( 3\log_2n +8\right) \:
+\mathrm{bits}. 
+\end{eqnarray}
+
+Considering $c=0.93$ and $b=175$, the number of bits per key to store 
+the description of the resulting MPHF for a set of 1~billion keys is $8.1$.
+If we set $b=128$, then the bits per key ratio increases to $8.3$.
+Theoretically, the number of bits required to store the MPHF in
+Eq.~(\ref{eq:newmphfbits}) 
+is $O(n\log n)$ as~$n\to\infty$. However, for sets of size up to $2^{b/3}$ keys 
+the number of bits per key is lower than 9~bits (note that
+$2^{b/3}>2^{58}>10^{17}$ for $b=175$).  
+%For $b=175$, the number of bits per key will be close to 9 for a set of $2^{58}$ keys. 
+Thus, in practice the resulting function is stored in $O(n)$ bits.