paper for vldb07 added

2006-08-11 17:32:31 +00:00 · 2006-08-11 17:32:31 +00:00 · bd2e291de9
commit bd2e291de9
parent aa4b59e7c1
28 changed files with 4517 additions and 0 deletions
--- a/vldb07/acknowledgments.tex
+++ b/vldb07/acknowledgments.tex
@ -0,0 +1,7 @@
+\section{Acknowledgments}
+This section is optional; it is a location for you
+to acknowledge grants, funding, editing assistance and
+what have you.  In the present case, for example, the
+authors would like to thank Gerald Murray of ACM for
+his help in codifying this \textit{Author's Guide}
+and the \textbf{.cls} and \textbf{.tex} files that it describes.
--- a/vldb07/analyticalresults.tex
+++ b/vldb07/analyticalresults.tex
@ -0,0 +1,174 @@
+%% Nivio: 23/jan/06 29/jan/06
+% Time-stamp: <Monday 30 Jan 2006 03:56:47am EDT yoshi@ime.usp.br>
+\enlargethispage{2\baselineskip}
+\section{Analytical results}
+\label{sec:analytcal-results}
+
+\vspace{-1mm}
+The purpose of this section is fourfold.
+First, we show that our algorithm runs in expected time $O(n)$. 
+Second, we present the main memory requirements for constructing the MPHF.
+Third, we discuss the cost of evaluating the resulting MPHF.
+Fourth, we present the space required to store the resulting MPHF.
+
+\vspace{-2mm}
+\subsection{The linear time complexity}
+\label{sec:linearcomplexity}
+ 
+First, we show that the partitioning step presented in
+Figure~\ref{fig:partitioningstep} runs in $O(n)$ time.
+Each iteration of the {\bf for} loop in statement~1
+runs in $O(|B_j|)$ time, $1 \leq j \leq N$, where $|B_j|$ is the 
+number of keys 
+that fit in block $B_j$ of size $\mu$. This is because statement 1.1 just reads
+$|B_j|$ keys from disk, statement 1.2 runs a bucket sort like algorithm 
+that is well known to be linear in the number of keys it sorts (i.e., $|B_j|$ keys),
+and statement 1.3 just dumps $|B_j|$ keys to the disk into File $j$.
+Thus, the {\bf for} loop runs in $O(\sum_{j=1}^{N}|B_j|)$ time. 
+As $\sum_{j=1}^{N}|B_j|=n$, then the partitioning step runs in $O(n)$ time.
+
+Second, we show that the searching step presented in 
+Figure~\ref{fig:searchingstep} also runs in $O(n)$ time.
+The heap construction in statement 1 runs in $O(N)$ time, for $N \ll n$.
+We have assumed that insertions and deletions in the heap cost $O(1)$ because 
+$N$ is typically much smaller than $n$ (see \cite[Section 6.4]{bkz06t} for details).
+Statement 2 runs in $O(\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i])$ time
+(remember that $\mathit{size}[i]$ stores the number of keys in bucket $i$).
+As $\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i] = n$, if 
+statements 2.1, 2.2 and 2.3 run in $O(\mathit{size}[i])$ time, then statement 2
+runs in $O(n)$ time.
+
+%Statement 2.1 runs the algorithm to read a bucket from disk. That algorithm reads $\mathit{size}[i]$
+%keys of bucket $i$ that might be spread into many files or, in the worst case,
+%into $|BS_{max}|$ files, where $|BS_{max}|$ is the number of keys in the bucket of maximum size.
+%It uses the heap $H$ to drive a multiway merge of the sprayed bucket $i$. 
+%As we are considering that each read/write on disk costs $O(1)$ and
+%each heap operation also costs $O(1)$ (recall $N \ll n$), then  statement 2.1
+%costs $O(\mathit{size}[i])$ time.
+%We need to take into account that this step could generate a lot of seeks on disk. 
+%However, the number of seeks can be amortized (see Section~\ref{sec:contr-disk-access})
+%and that is why we have been able of getting a MPHF for a set of 1 billion keys in less 
+%than 4 hours using a machine with just 500 MB of main memory
+%(see Section~\ref{sec:performance}).
+Statement 2.1 reads $O(\mathit{size}[i])$ keys of bucket $i$
+and is detailed in Figure~\ref{fig:readingbucket}.
+As we are assuming that each read or write on disk costs $O(1)$ and
+each heap operation also costs $O(1)$, statement~2.1
+takes $O(\mathit{size}[i])$ time.
+However, the keys of bucket $i$ are distributed in at most~$BS_{max}$ files on disk 
+in the worst case
+(recall that $BS_{max}$ is the maximum number of keys found in any bucket).
+Therefore, we need to take into account that 
+the critical step in reading a bucket is in statement~1.3 of Figure~\ref{fig:readingbucket},
+where a seek operation in File $j$
+may be performed by the first read operation.
+
+In order to amortize the number of seeks performed we use a buffering technique~\cite{k73}.
+We create a buffer $j$ of size \textbaht$\: = \mu/N$ for each file $j$, 
+where $1\leq j \leq N$
+(recall that $\mu$ is the size in bytes of an a priori reserved internal memory area).
+Every time a read operation is requested to file $j$ and the data is not found 
+in the $j$th~buffer, \textbaht~bytes are read from file $j$ to buffer $j$. 
+Hence, the number of seeks performed in the worst case is given by
+$\beta/$\textbaht~(remember that $\beta$ is the size in bytes of $S$).
+For that we have made the pessimistic assumption that one seek happens every time 
+buffer $j$ is filled in. 
+Thus, the number of seeks performed in the worst case is $64n/$\textbaht, since
+each URL is 64 bytes long on average. Therefore, the number of seeks is linear on 
+$n$ and amortized by \textbaht.
+
+It is important to emphasize two things.
+First, the operating system uses techniques
+to diminish the number of seeks and the average seek time. 
+This makes the amortization factor to be greater than \textbaht~in practice.  
+Second, almost all main memory is available to be used as
+file buffers because just a small vector
+of $\lceil n/b\rceil$ one-byte entries must be maintained in main memory, 
+as we show in Section~\ref{sec:memconstruction}.
+
+
+Statement 2.2 runs our internal memory based algorithm in order to generate a MPHF for each bucket.
+That algorithm is linear, as we showed in~\cite{bkz05}. As it is applied to buckets with {\it size}$[i]$ keys, statement~2.2 takes $O(\mathit{size}[i])$ time.
+
+Statement 2.3 has time complexity $O(\mathit{size}[i])$ because it writes to disk 
+the description of each generated MPHF and each description is stored in
+$c \times \mathit{size}[i] + O(1)$ bytes, where $c\in[0.93,1.15]$.
+In conclusion, our algorithm takes $O(n)$ time because both the partitioning and
+the searching steps run in $O(n)$ time. 
+
+An experimental validation of the above proof and a performance comparison with 
+our internal memory based algorithm~\cite{bkz05} were not included here due to 
+space restrictions but can be found in~\cite{bkz06t} and also in the appendix.
+
+\vspace{-1mm}
+\enlargethispage{2\baselineskip}
+\subsection{Space used for constructing a MPHF} 
+\label{sec:memconstruction}
+
+The vector {\it size} is kept in main memory 
+all the time. 
+The vector {\it size} has $\lceil n/b \rceil$ one-byte entries.
+It stores the number of keys in each bucket and 
+those values are less than or equal to 256. 
+For example, for a set of 1 billion keys and $b=175$ the vector size needs 
+$5.45$ megabytes of main memory.
+
+We need an internal memory area of size $\mu$ bytes to be used in
+the partitioning step and in the searching step. 
+The size $\mu$ is fixed a priori and depends only on the amount 
+of internal memory available to run the algorithm
+(i.e., it does not depend on the size $n$ of the problem).
+
+% One could argue about the a priori reserved internal memory area and the main memory
+% required to run the indirect bucket sort algorithm.
+% Those internal memory requirements do not depend on the size of the problem
+% (i.e., the number of keys being hashed) and can be fixed a priori.
+
+The additional space required in the searching step
+is constant, once the problem was broken down
+into several small problems (at most 256 keys) and 
+the heap size is supposed to be much smaller than $n$ ($N \ll n$).
+For example, for a set of 1 billion keys and an internal area of~$\mu = 250$ megabytes,
+the number of files is $N = 248$. 
+
+\vspace{-1mm}
+\subsection{Evaluation cost of the MPHF} 
+
+Now we consider the amount of CPU time 
+required by the resulting MPHF at retrieval time.
+The MPHF requires for each key the computation of three 
+universal hash functions and three memory accesses 
+(see Eqs.~(\ref{eq:mphf}), (\ref{eq:bucketindex}) and (\ref{eq:mphfi})).
+This is not optimal. Pagh~\cite{p99} showed that any MPHF requires 
+at least the computation of two universal hash functions and one memory
+access.
+
+\subsection{Description size of the MPHF} 
+
+The number of bits required to store the MPHF generated by the algorithm 
+is computed by Eq.~(\ref{eq:newmphfbits}). 
+We need to store each $g_i$ vector presented in Eq.~(\ref{eq:mphfi}), where
+$0\leq i < \lceil n/b \rceil$. As each bucket has at most 256 keys, each 
+entry in a $g_i$ vector has 8~bits.  In each $g_i$ vector there are 
+$c \times \mathit{size}[i]$ entries (recall $c\in[0.93, 1.15]$).
+When we sum up the number of entries of $\lceil n/b \rceil$ $g_i$ vectors we have
+$c\sum_{i=0}^{\lceil n/b \rceil -1} \mathit{size}[i]=cn$ entries.  We also need to
+store $3 \lceil n/b \rceil$ integer numbers of 
+$\log_2n$ bits referring respectively to the {\it offset} vector and the two random seeds of
+$h_{1i}$ and $h_{2i}$. In addition, we need to store $\lceil n/b \rceil$ 8-bit entries of 
+the vector {\it size}.  Therefore, 
+\begin{eqnarray}\label{eq:newmphfbits}
+\mathrm{Required\: Space} = 8cn + \frac{n}{b}\left( 3\log_2n +8\right) \:
+\mathrm{bits}. 
+\end{eqnarray}
+
+Considering $c=0.93$ and $b=175$, the number of bits per key to store 
+the description of the resulting MPHF for a set of 1~billion keys is $8.1$.
+If we set $b=128$, then the bits per key ratio increases to $8.3$.
+Theoretically, the number of bits required to store the MPHF in
+Eq.~(\ref{eq:newmphfbits}) 
+is $O(n\log n)$ as~$n\to\infty$. However, for sets of size up to $2^{b/3}$ keys 
+the number of bits per key is lower than 9~bits (note that
+$2^{b/3}>2^{58}>10^{17}$ for $b=175$).  
+%For $b=175$, the number of bits per key will be close to 9 for a set of $2^{58}$ keys. 
+Thus, in practice the resulting function is stored in $O(n)$ bits.
--- a/vldb07/appendix.tex
+++ b/vldb07/appendix.tex
@ -0,0 +1,6 @@
+\appendix
+\input{experimentalresults}
+\input{thedataandsetup}
+\input{costhashingbuckets}
+\input{performancenewalgorithm}
+\input{diskaccess}
--- a/vldb07/conclusions.tex
+++ b/vldb07/conclusions.tex
@ -0,0 +1,42 @@
+% Time-stamp: <Monday 30 Jan 2006 12:38:06am EST yoshi@flare>
+\enlargethispage{2\baselineskip}
+\section{Concluding remarks}
+\label{sec:concuding-remarks}
+
+This paper has presented a novel external memory based algorithm for
+constructing MPHFs that works for sets in the order of billions of keys.  The
+algorithm outputs the resulting function in~$O(n)$ time and, furthermore, it
+can be tuned to run only $34\%$ slower (see \cite{bkz06t} for details) than the fastest 
+algorithm available in the literature for constructing MPHFs~\cite{bkz05}.  
+In addition, the space
+requirement of the resulting MPHF is of up to 9 bits per key for datasets of
+up to $2^{58}\simeq10^{17.4}$ keys. 
+ 
+The algorithm is simple and needs just a
+small vector of size approximately 5.45 megabytes in main memory to construct
+a MPHF for a collection of 1 billion URLs, each one 64 bytes long on average.
+Therefore, almost all main memory is available to be used as disk I/O buffer.
+Making use of such a buffering scheme considering an internal memory area of size
+$\mu=200$ megabytes, our algorithm can produce a MPHF for a
+set of 1 billion URLs in approximately 3.6 hours using a commodity PC of 2.4 gigahertz and
+500 megabytes of main memory.  
+If we increase both the main memory
+available to 1 gigabyte and the internal memory area to $\mu=500$ megabytes, 
+a MPHF for the set of 1 billion URLs is produced in approximately 3 hours. For any
+key, the evaluation of the resulting MPHF takes three memory accesses and the
+computation of three universal hash functions. 
+
+In order to allow the reproduction of our results and the utilization of both the internal memory
+based algorithm and the external memory based algorithm,
+the algorithms are available at \texttt{http://cmph.sf.net}
+under the GNU Lesser General Public License (LGPL).
+They were implemented in the C language.
+
+In future work, we will exploit the fact that the searching step intrinsically
+presents a high degree of parallelism and requires $73\%$ of the
+construction time.  Therefore, a parallel implementation of our algorithm will
+allow the construction and the evaluation of the resulting function in parallel.
+Therefore, the description of the resulting MPHFs will be distributed in the paralell 
+computer allowing the scalability to sets of hundreds of billions of keys. 
+This is an important contribution, mainly for applications related to the Web, as 
+mentioned in Section~\ref{sec:intro}.
--- a/vldb07/costhashingbuckets.tex
+++ b/vldb07/costhashingbuckets.tex
@ -0,0 +1,177 @@
+% Nivio: 29/jan/06
+% Time-stamp: <Monday 30 Jan 2006 12:37:22am EST yoshi@flare>
+\vspace{-2mm}
+\subsection{Performance of the internal memory based algorithm}
+\label{sec:intern-memory-algor}
+
+%\begin{table*}[htb]
+%\begin{center}
+%{\scriptsize
+%\begin{tabular}{|c|c|c|c|c|c|c|c|}
+%\hline
+%$n$ (millions)  & 1                 & 2                    & 4                  & 8                  & 16      & 32 \\
+%\hline
+%Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$   & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\
+%SD (s)          & $2.6$           & $5.4$              & $9.8$            & $17.6$           & $37.3$            & $76.3$  \\
+%\hline
+%\end{tabular}
+%\vspace{-3mm}
+%}
+%\end{center}
+%\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF,
+%the standard deviation (SD), and the confidence intervals considering
+%a confidence level of $95\%$.}
+%\label{tab:medias}
+%\end{table*}
+
+Our three-step internal memory based algorithm presented in~\cite{bkz05}
+is used for constructing a MPHF for each bucket.
+It is a randomized algorithm because it needs to generate a simple random graph
+in its first step. 
+Once the graph is obtained the other two steps are deterministic. 
+
+Thus, we can consider the runtime of the algorithm to have the form~$\alpha
+nZ$ for an input of~$n$ keys, where~$\alpha$ is some machine dependent
+constant that further depends on the length of the keys and~$Z$ is a random
+variable with geometric distribution with mean~$1/p=e^{1/c^2}$ (see
+Section~\ref{sec:mphfbucket}).  All results in our experiments were obtained
+taking $c=1$; the value of~$c$, with~$c\in[0.93,1.15]$, in fact has little
+influence in the runtime, as shown in~\cite{bkz05}.
+
+The values chosen for $n$ were $1, 2, 4, 8, 16$ and $32$ million.
+Although we have a dataset with 1~billion URLs, on a PC with
+1~gigabyte of main memory, the algorithm is able
+to handle an input with at most 32 million keys.
+This is mainly because of the graph we need to keep in main memory.
+The algorithm requires $25n + O(1)$ bytes for constructing 
+a MPHF (details about the data structures used by the algorithm can
+be found in~\texttt{http://cmph.sf.net}.
+% for the details about the data structures 
+%used by the algorithm).
+
+In order to estimate the number of trials for each value of $n$ we use
+a statistical method for determining a suitable sample size (see, e.g.,
+\cite[Chapter 13]{j91}).  
+As we obtained different values for each $n$, 
+we used the maximal value obtained, namely, 300~trials in order to have 
+a confidence level of $95\%$.
+ 
+% \begin{figure*}[ht]
+%   \noindent
+%   \begin{minipage}[b]{0.5\linewidth}
+%     \centering
+%     \subfigure[The previous algorithm]
+%     {\scalebox{0.5}{\includegraphics{figs/bmz_temporegressao.eps}}}
+%   \end{minipage}
+%   \hfill
+%   \begin{minipage}[b]{0.5\linewidth}
+%     \centering
+%     \subfigure[The new algorithm]
+%     {\scalebox{0.5}{\includegraphics{figs/brz_temporegressao.eps}}}
+%   \end{minipage}
+%     \caption{Time versus number of keys in $S$. The solid line corresponds to 
+% a linear regression model.}
+% %obtained from the experimental measurements.}
+%     \label{fig:temporegressao}
+% \end{figure*}
+
+Table~\ref{tab:medias} presents the runtime average for each $n$,
+the respective standard deviations, and 
+the respective confidence intervals given by 
+the average time $\pm$ the distance from average time
+considering a confidence level of $95\%$.
+Observing the runtime averages one sees that 
+the algorithm runs in expected linear time, 
+as shown in~\cite{bkz05}. 
+
+\vspace{-2mm}
+\begin{table*}[htb]
+\begin{center}
+{\scriptsize
+\begin{tabular}{|c|c|c|c|c|c|c|c|}
+\hline
+$n$ (millions)  & 1                 & 2                    & 4                  & 8                  & 16      & 32 \\
+\hline
+Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$   & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\
+SD (s)          & $2.6$           & $5.4$              & $9.8$            & $17.6$           & $37.3$            & $76.3$  \\
+\hline
+\end{tabular}
+\vspace{-1mm}
+}
+\end{center}
+\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF,
+the standard deviation (SD), and the confidence intervals considering
+a confidence level of $95\%$.}
+\label{tab:medias}
+\vspace{-4mm}
+\end{table*}
+
+% \enlargethispage{\baselineskip}
+% \begin{table*}[htb]
+% \begin{center}
+% {\scriptsize
+% (a)
+% \begin{tabular}{|c|c|c|c|c|c|c|c|}
+% \hline
+% $n$ (millions)  & 1                 & 2                    & 4                  & 8                  & 16                 & 32 \\
+% \hline
+% Average time (s)& $6.119 \pm 0.300$ & $12.190 \pm 0.615$   & $25.359 \pm 1.109$ & $51.408 \pm 2.003$ & $117.343 \pm 4.364$ & $262.215 \pm 8.724$\\
+% SD (s)          & $2.644$           & $5.414$              & $9.757$            & $17.627$           & $37.333$            & $76.271$  \\
+% \hline
+% \end{tabular}
+% \\[5mm]  (b)
+% \begin{tabular}{|l|c|c|c|c|c|}
+% \hline
+% $n$ (millions)   & 1                  & 2                  & 4                  & 8                   & 16             \\
+% \hline % Part.      16 \%                 16 \%                 16 \%                18 \%                 20\%           
+% Average time (s) & $6.927 \pm 0.309$  & $13.828 \pm 0.175$ & $31.936 \pm 0.663$ & $69.902 \pm 1.084$  & $140.617 \pm 2.502$  \\
+% SD               & $0.431$            & $0.245$            & $0.926$            & $1.515$             & $3.498$         \\
+% \hline
+% \hline
+% $n$ (millions)   & 32                  & 64                   & 128                    & 512                  & 1000            \\
+% \hline % Part.      20 \%                 20\%                  20\%                      18\%                    18\%
+% Average time (s) & $284.330 \pm 1.135$ & $587.880 \pm 3.945$  & $1223.581 \pm 4.864$   & $5966.402 \pm 9.465$ & $13229.540 \pm 12.670$  \\
+% SD               & $1.587$             & $5.514$              & $6.800$                & $13.232$             & $18.577$            \\
+% \hline
+% \end{tabular}
+% }
+% \end{center}
+% \caption{The runtime averages in seconds, 
+% the standard deviation (SD), and
+% the confidence intervals given by the average time $\pm$ 
+% the distance from average time considering 
+% a confidence level of $95\%$.}
+% \label{tab:medias}
+% \end{table*}
+
+\enlargethispage{2\baselineskip}
+Figure~\ref{fig:bmz_temporegressao} 
+presents the runtime for each trial. In addition, 
+the solid line corresponds to a linear regression model 
+obtained from the experimental measurements.
+As we can see, the runtime for a given $n$ has a considerable 
+fluctuation. However, the fluctuation also grows linearly with $n$.
+
+\begin{figure}[htb]
+\vspace{-2mm}
+\begin{center}
+\scalebox{0.4}{\includegraphics{figs/bmz_temporegressao.eps}}
+\caption{Time versus number of keys in $S$ for the internal memory based algorithm.
+The solid line corresponds to a linear regression model.}
+\label{fig:bmz_temporegressao}
+\end{center}
+\vspace{-6mm}
+\end{figure}
+
+The observed fluctuation in the runtimes is as expected; recall that this
+runtime has the form~$\alpha nZ$ with~$Z$ a geometric random variable with
+mean~$1/p=e$.  Thus, the runtime has mean~$\alpha n/p=\alpha en$ and standard
+deviation~$\alpha n\sqrt{(1-p)/p^2}=\alpha n\sqrt{e(e-1)}$. 
+Therefore, the standard deviation also grows 
+linearly with $n$, as experimentally verified 
+in Table~\ref{tab:medias} and in Figure~\ref{fig:bmz_temporegressao}.
+
+%\noindent-------------------------------------------------------------------------\\
+%Comentario para Yoshi: Nao consegui reproduzir bem o que discutimos 
+%no paragrafo acima, acho que vc conseguira justificar melhor :-). \\
+%-------------------------------------------------------------------------\\
--- a/vldb07/determiningb.tex
+++ b/vldb07/determiningb.tex
@ -0,0 +1,146 @@
+% Nivio: 29/jan/06
+% Time-stamp: <Monday 30 Jan 2006 04:01:40am EDT yoshi@ime.usp.br>
+\enlargethispage{2\baselineskip}
+\subsection{Determining~$b$}
+\label{sec:determining-b}
+\begin{table*}[t]
+\begin{center}
+{\small %\scriptsize
+\begin{tabular}{|c|ccc|ccc|}
+\hline
+\raisebox{-0.7em}{$n$} & \multicolumn{3}{c|}{\raisebox{-1mm}{b=128}}  &
+\multicolumn{3}{c|}{\raisebox{-1mm}{b=175}}\\
+\cline{2-4} \cline{5-7}
+ & \raisebox{-0.5mm}{Worst Case} & \raisebox{-0.5mm}{Average} &\raisebox{-0.5mm}{Eq.~(\ref{eq:maxbs})} 
+ & \raisebox{-0.5mm}{Worst Case} & \raisebox{-0.5mm}{Average} &\raisebox{-0.5mm}{Eq.~(\ref{eq:maxbs})} \\
+\hline
+$1.0 \times 10^6$  & 177 & 172.0 & 176  & 232 & 226.6 & 230 \\
+%$2.0 \times 10^6$  & 179 & 174.0 & 178  & 236 & 228.5 & 232 \\
+$4.0 \times 10^6$  & 182 & 177.5 & 179  & 241 & 231.8 & 234 \\
+%$8.0 \times 10^6$  & 186 & 181.6 & 181  & 238 & 234.2 & 236 \\
+$1.6 \times 10^7$  & 184 & 181.6 & 183  & 241 & 236.1 & 238 \\
+%$3.2 \times 10^7$  & 191 & 183.9 & 184  & 240 & 236.6 & 240 \\
+$6.4 \times 10^7$  & 195 & 185.2 & 186  & 244 & 239.0 & 242 \\
+%$1.28 \times 10^8$ & 193 & 187.7 & 187  & 244 & 239.7 & 244 \\
+$5.12 \times 10^8$ & 196 & 191.7 & 190  & 251 & 246.3 & 247 \\
+$1.0 \times 10^9$  & 197 & 191.6 & 192  & 253 & 248.9 & 249 \\
+\hline
+\end{tabular}
+\vspace{-1mm}
+}
+\end{center}
+\caption{Values for $\mathit{BS}_{\mathit{max}}$: worst case and average case obtained in the experiments and using Eq.~(\ref{eq:maxbs}), 
+considering $b=128$ and $b=175$ for different number $n$ of keys in $S$.} 
+\label{tab:comparison}
+\vspace{-6mm}
+\end{table*}
+
+The partitioning step can be viewed as the well known ``balls into bins'' 
+problem~\cite{ra98,dfm02} where~$n$ keys (the balls) are placed independently and 
+uniformly into $\lceil n/b\rceil$ buckets (the bins). The main question related to that problem we are interested 
+in is: what is the maximum number of keys in any bucket? 
+In fact, we want to get the maximum value for $b$ that makes the maximum number of keys in any bucket
+no greater than 256 with high probability. 
+This is important, as we wish to use 8 bits per entry in the vector $g_i$ of
+each $\mathrm{MPHF}_i$, 
+where $0 \leq i < \lceil n/b\rceil$.      
+Let $\mathit{BS}_{\mathit{max}}$ be the maximum number of keys in any bucket.
+
+Clearly, $\BSmax$ is the maximum
+of~$\lceil n/b\rceil$ random variables~$Z_i$, each with binomial
+distribution~$\Bi(n,p)$ with parameters~$n$ and~$p=1/\lceil n/b\rceil$.
+However, the~$Z_i$ are not independent.  Note that~$\Bi(n,p)$ has mean and
+variance~$\simeq b$.  To give an upper estimate for the probability of the
+event~$\BSmax\geq \gamma$, we can estimate the probability that we have~$Z_i\geq \gamma$
+for a fixed~$i$, and then sum these estimates over all~$i$.
+Let~$\gamma=b+\sigma\sqrt{b\ln(n/b)}$, where~$\sigma=\sqrt2$.
+Approximating~$\Bi(n,p)$ by the normal distribution with mean and
+variance~$b$, we obtain the
+estimate~$(\sigma\sqrt{2\pi\ln(n/b)})^{-1}\times\exp(-(1/2)\sigma^2\ln(n/b))$ for
+the probability that~$Z_i\geq \gamma$ occurs, which, summed over all~$i$, gives
+that the probability that~$\BSmax\geq \gamma$ occurs is at
+most~$1/(\sigma\sqrt{2\pi\ln(n/b)})$, which tends to~$0$ as~$n\to\infty$.
+Thus, we have shown that, with high probability, 
+\begin{equation}
+  \label{eq:maxbs}
+  \BSmax\leq b+\sqrt{2b\ln{n\over b}}.
+\end{equation}
+
+% The traditional approach used to estimate $\mathit{BS}_{\mathit{max}}$ with high probability is 
+% to consider $\mathit{BS}_{\mathit{max}}$ as a random variable that follows a binomial distribution 
+% that can be approximated by a poisson distribution. This yields a good approximation
+% when the number of balls is lower than or equal to the number of bins~\cite{g81}. In our case,
+% the number of balls is greater than the number of buckets.
+% % and that is why we have used more recent works to estimate $\mathit{BS}_{\mathit{max}}$. 
+% As $b > \ln (n/b)$, we can use the result by Raab and Steger~\cite{ra98} to estimate
+% $\mathit{BS}_{\mathit{max}}$ with high probability. 
+% The following equation gives the estimation, where $\sigma=\sqrt{2}$:  
+% \begin{eqnarray} \label{eq:maxbs}
+% \mathit{BS}_{\mathit{max}} = b + O \left( \sqrt{b\ln\frac{n}{b}} \right) = b + \sigma \times \left(\sqrt{b\ln\frac{n}{b}} \right)
+% \end{eqnarray}
+
+% In order to estimate the suitable constant $\sigma$ we did a linear 
+% regression suppressing the constant term. 
+% We use the equation $BS_{max} - b = \sigma \times \sqrt{b\ln (n/b)}$ 
+% in the linear regression considering $y=BS_{max} - b$ and $x=\sqrt{b\ln (n/b)}$. 
+% In order to obtain data to be used in the linear regression we set 
+% b=128 and ran the new algorithm ten times 
+% for n equal to 1, 2, 4, 8, 16, 32, 64, 128, 512, 1000 million keys.  
+% Taking a confidence level equal to 95\% we got 
+% $\sigma = 2.11 \pm 0.03$. 
+% The coefficient of determination was $99.6\%$, which means that the linear 
+% regression explains $99.6\%$ of the data variation and only $0.4\%$ 
+% is due to experimental errors.
+% Therefore, Eq.~(\ref{eq:maxbs}) with $\sigma = 2.11 \pm 0.03$ and $b=128$ 
+% makes a very good estimation of the maximum number of keys in any bucket.
+
+% Repeating the same experiments for $b$ equals to $175$ and  
+% a confidence level of $95\%$ we got $\sigma = 2.07 \pm 0.03$.
+% Again we verified that Eq.~(\ref{eq:maxbs}) with $\sigma = 2.07 \pm 0.03$ and $b=175$ is 
+% a very good estimation of the maximum number of keys in any bucket once the
+% coefficient of determination obtained was $99.7 \%$ and $\sigma$ is in a very narrow range.
+
+In our algorithm the maximum number of keys in any bucket must be at most 256. 
+Table~\ref{tab:comparison} presents the values for $\mathit{BS}_{\mathit{max}}$
+obtained experimentally and using Eq.~(\ref{eq:maxbs}).
+The table presents the worst case and the average case, 
+considering $b=128$,  $b=175$ and Eq.~(\ref{eq:maxbs}),
+for several numbers~$n$ of keys in $S$. 
+The estimation given by Eq.~(\ref{eq:maxbs}) is very close to the experimental
+results.
+
+Now we estimate the biggest problem our algorithm is able to solve for 
+a given $b$. 
+Using Eq.~(\ref{eq:maxbs}) considering $b=128$, $b=175$ and imposing
+that~$\mathit{BS}_{\mathit{max}}\leq256$, 
+the sizes of the biggest key set our algorithm 
+can deal with are $10^{30}$ keys and $10^{10}$ keys, respectively. 
+%It is also important to have $b$ as big as possible, once its value is
+%related to the space required to store the resultant MPHF, as shown later on.
+%Table~\ref{tab:bp} shows the biggest problem the algorithm can solve.
+% The values were obtained from Eq.~(\ref{eq:maxbs}), 
+% considering $b=128$ and~$b=175$ and imposing
+% that~$\mathit{BS}_{\mathit{max}}\leq256$. 
+
+% We set $\sigma=2.14$ because it was the greatest value obtained for $\sigma$
+% in the two linear regression we did.
+% \vspace{-3mm} 
+% \begin{table}[htb]
+% \begin{center}
+% {\small %\scriptsize
+% \begin{tabular}{|c|c|}
+% \hline
+% b    & Problem size ($n$) \\
+% \hline
+% 128  & $10^{30}$ keys  \\
+% 175  & $10^{10}$ keys  \\
+% \hline
+% \end{tabular}
+% \vspace{-1mm}
+% }
+% \end{center}
+% \caption{Using Eq.~(\ref{eq:maxbs}) to estimate the biggest problem our algorithm can solve.}
+% %considering $\sigma=\sqrt{2}$.}
+% \label{tab:bp}
+% \vspace{-14mm}
+% \end{table}
--- a/vldb07/diskaccess.tex
+++ b/vldb07/diskaccess.tex
@ -0,0 +1,113 @@
+% Nivio: 29/jan/06
+% Time-stamp: <Sunday 29 Jan 2006 11:58:28pm EST yoshi@flare>
+\vspace{-2mm}
+\subsection{Controlling disk accesses}
+\label{sec:contr-disk-access}
+
+In order to bring down the number of seek operations on disk
+we benefit from the fact that our algorithm leaves almost all main
+memory available to be used as disk I/O buffer. 
+In this section we evaluate how much the parameter $\mu$ 
+affects the runtime of our algorithm.
+For that we fixed $n$ in 1 billion of URLs,
+set the main memory of the machine used for the experiments 
+to 1 gigabyte and used $\mu$ equal to $100, 200, 300, 400, 500$ and $600$
+megabytes. 
+
+\enlargethispage{2\baselineskip}
+Table~\ref{tab:diskaccess} presents the number of files $N$,
+the buffer size used for all files, the number of seeks in the worst case considering
+the pessimistic assumption mentioned in Section~\ref{sec:linearcomplexity}, and 
+the time to generate a MPHF for 1 billion of keys as a function of the amount of internal 
+memory available. Observing Table~\ref{tab:diskaccess} we noticed that the time spent in the construction
+decreases as the value of $\mu$ increases. However, for $\mu > 400$, the variation 
+on the time is not as significant as for $\mu \leq 400$. 
+This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux
+has smart policies  
+for avoiding seeks and diminishing the average seek time 
+(see \texttt{http://www.linuxjournal.com/article/6931}).
+\begin{table*}[ht]
+\vspace{-2mm}
+\begin{center}
+{\scriptsize
+\begin{tabular}{|l|c|c|c|c|c|c|}
+\hline
+$\mu$ (MB)                                                        & $100$                        & $200$                       & $300$                       & $400$                       & $500$                       & $600$ \\
+\hline
+$N$ (files)                                                       & $619$                        & $310$                       & $207$                       & $155$                       & $124$                       & $104$ \\
+%\hline
+\textbaht~(buffer size in KB)                                     & $165$                        & $661$                       & $1,484$                     & $2,643$                     & $4,129$                     & $5,908$ \\
+%\hline
+$\beta$/\textbaht~(\# of seeks in the worst case)                 & $384,478$                    & $95,974$                    & $42,749$                    & $24,003$                    & $15,365$                    & $10,738$ \\
+% \hline
+% \raisebox{-0.2em}{\# of seeks performed in}                       & \raisebox{-0.7em}{$383,056$} & \raisebox{-0.7em}{$95,919$} & \raisebox{-0.7em}{$42,700$} & \raisebox{-0.7em}{$23,980$} & \raisebox{-0.7em}{$15,347$} & \raisebox{-0.7em}{$xx,xxx$} \\
+% \raisebox{0.2em}{statement 1.3 of Figure~\ref{fig:readingbucket}} &                              &                             &                             &                             &                             &       \\
+% \hline
+Time (hours)                                                      & $4.04$                       & $3.64$                      & $3.34$                      & $3.20$                      & $3.13$                      & $3.09$      \\
+\hline
+\end{tabular}
+\vspace{-1mm}
+}
+\end{center}
+\caption{Influence of the internal memory area size ($\mu$) in our algorithm runtime.}
+\label{tab:diskaccess}
+\vspace{-14mm}
+\end{table*}
+
+
+
+% \begin{table*}[ht]
+% \begin{center}
+% {\scriptsize
+% \begin{tabular}{|l|c|c|c|c|c|c|c|c|c|c|c|}
+% \hline
+% $\mu$ (MB)                                                        & $100$                        & $150$                        & $200$                       & $250$                       & $300$                       & $350$                       & $400$                       & $450$                       & $500$                       & $550$                       & $600$ \\
+% \hline
+% $N$ (files)                                                       & $619$                        & $413$                        & $310$                       & $248$                       & $207$                       & $177$                       & $155$                       & $138$                       & $124$                       & $113$                       & $103$ \\
+% \hline
+% \textbaht~(buffer size in KB)                                     & $165$                        & $372$                        & $661$                       & $1,033$                     & $1,484$                     & $2,025$                     & $2,643$                     & $3,339$                     &                             &                             &       \\
+% \hline
+% \# of seeks (Worst case)                                          & $384,478$                     & $170,535$                    & $95,974$                    & $61,413$                    & $42,749$                    & $31,328$                    & $24,003$                    & $19,000$                    &                             &                             &       \\
+% \hline
+% \raisebox{-0.2em}{\# of seeks performed in}                       & \raisebox{-0.7em}{$383,056$} & \raisebox{-0.7em}{$170,385$} & \raisebox{-0.7em}{$95,919$} & \raisebox{-0.7em}{$61,388$} & \raisebox{-0.7em}{$42,700$} & \raisebox{-0.7em}{$31,296$} & \raisebox{-0.7em}{$23,980$} & \raisebox{-0.7em}{$18,978$} & \raisebox{-0.7em}{$xx,xxx$} & \raisebox{-0.7em}{$xx,xxx$} & \raisebox{-0.7em}{$xx,xxx$} \\
+% \raisebox{0.2em}{statement 1.3 of Figure~\ref{fig:readingbucket}} &                              &                              &                             &                             &                             &                             &                             &                             &                             &                             &       \\
+% \hline
+% Time (horas)                                                      & $4.04$                          & $3.93$                       & $3.64$                      & $3.46$                      & $3.34$                      & $3.26$                      & $3.20$                      & $3.13$                      &                             &                             &       \\
+% \hline
+% \end{tabular}
+% }
+% \end{center}
+% \caption{Influence of the internal memory area size ($\mu$) in our algorithm runtime.}
+% \label{tab:diskaccess}
+% \end{table*}
+
+
+
+% \begin{table*}[htb]
+% \begin{center}
+% {\scriptsize
+% \begin{tabular}{|l|c|c|c|c|c|}
+% \hline
+% $n$ (millions)   & 1                  & 2                  & 4                  & 8                   & 16             \\
+% \hline % Part.      16 \%                 16 \%                 16 \%                18 \%                 20\%           
+% Average time (s) & $14.124 \pm 0.128$ & $28.301 \pm 0.140$ & $56.807 \pm 0.312$ & $117.286 \pm 0.997$ & $241.086 \pm 0.936$  \\
+% SD               & $0.179$            & $0.196$            & $0.437$            & $1.394$             & $1.308$         \\
+% \hline
+% \hline
+% $n$ (millions)   & 32                  & 64                   & 128                    & 512                    & 1000            \\
+% \hline % Part.      20 \%                 20\%                  20\%                      18\%                    18\%
+% Average time (s) & $492.430 \pm 1.565$ & $1006.307 \pm 1.425$ & $2081.208 \pm 0.740$   & $9253.188 \pm 4.406$ & $19021.480 \pm 13.850$  \\
+% SD               & $2.188$             & $1.992$              & $1.035$                & $ 6.160$             & $18.016$            \\
+% \hline
+
+% \end{tabular}
+% }
+% \end{center}
+% \caption{The runtime averages in seconds, 
+% the standard deviation (SD), and
+% the confidence intervals given by the average time $\pm$ 
+% the distance from average time considering 
+% a confidence level of $95\%$.
+% }
+% \label{tab:mediasbrz}
+% \end{table*}
--- a/vldb07/experimentalresults.tex
+++ b/vldb07/experimentalresults.tex
@ -0,0 +1,15 @@
+%Nivio: 29/jan/06
+% Time-stamp: <Sunday 29 Jan 2006 11:57:21pm EST yoshi@flare>
+\vspace{-2mm}
+\enlargethispage{2\baselineskip}
+\section{Appendix: Experimental results}
+\label{sec:experimental-results}
+\vspace{-1mm}
+
+In this section we present the experimental results.
+We start presenting the experimental setup. 
+We then present experimental results for
+the internal memory based algorithm~\cite{bkz05} 
+and for our algorithm.
+Finally, we discuss how the amount of internal memory available 
+affects the runtime of our algorithm. 
--- a/vldb07/figs/bmz_temporegressao.png
+++ b/vldb07/figs/bmz_temporegressao.png
--- a/vldb07/figs/brz-partitioning.fig
+++ b/vldb07/figs/brz-partitioning.fig
@ -0,0 +1,107 @@
+#FIG 3.2
+Landscape
+Center
+Metric
+A4      
+100.00
+Single
+-2
+1200 2
+0 32 #bdbebd
+0 33 #bdbebd
+0 34 #bdbebd
+0 35 #4a4d4a
+0 36 #bdbebd
+0 37 #4a4d4a
+0 38 #bdbebd
+0 39 #bdbebd
+6 225 6615 2520 7560
+2 1 0 1 -1 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 900 7133 1608 7133
+2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
+	 260 6795 474 6795 474 6965 260 6965 260 6795
+2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
+	 474 6795 686 6795 686 6965 474 6965 474 6795
+2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
+	 474 6626 686 6626 686 6795 474 6795 474 6626
+2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
+	 1538 6795 1750 6795 1750 6965 1538 6965 1538 6795
+2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
+	 1538 6965 1750 6965 1750 7133 1538 7133 1538 6965
+2 2 0 1 -1 7 50 -1 -1 0.000 0 0 7 0 0 5
+	 474 6965 686 6965 686 7133 474 7133 474 6965
+2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
+	 686 6965 900 6965 900 7133 686 7133 686 6965
+2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
+	 1538 6626 1750 6626 1750 6795 1538 6795 1538 6626
+2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
+	 260 6965 474 6965 474 7133 260 7133 260 6965
+2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
+	 686 6795 900 6795 900 6965 686 6965 686 6795
+4 0 0 50 -1 0 14 0.0000 4 30 180 1148 7049 ...\001
+4 0 -1 50 -1 0 7 0.0000 2 60 60 332 7260 0\001
+4 0 -1 50 -1 0 7 0.0000 2 75 60 544 7260 1\001
+4 0 -1 50 -1 0 7 0.0000 2 60 60 758 7260 2\001
+4 0 -1 50 -1 0 7 0.0000 2 90 960 1538 7260 ${\\lceil n/b\\rceil - 1}$\001
+4 0 -1 50 -1 0 7 0.0000 2 105 975 540 7515 Buckets Logical View\001
+-6
+6 2700 6390 4365 7830
+6 3461 6445 3675 7425
+6 3463 6786 3675 7245
+6 3546 6893 3591 7094
+4 0 -1 50 -1 0 12 0.0000 2 15 45 3546 6959 .\001
+4 0 -1 50 -1 0 12 0.0000 2 15 45 3546 7027 .\001
+4 0 -1 50 -1 0 12 0.0000 2 15 45 3546 7094 .\001
+-6
+2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
+	 3463 6786 3675 6786 3675 7245 3463 7245 3463 6786
+-6
+2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
+	 3461 6445 3675 6445 3675 6615 3461 6615 3461 6445
+2 2 0 1 -1 7 50 -1 41 0.000 0 0 7 0 0 5
+	 3463 6616 3675 6616 3675 6785 3463 6785 3463 6616
+2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
+	 3463 7246 3675 7246 3675 7425 3463 7425 3463 7246
+-6
+6 3023 6786 3235 7245
+6 3106 6893 3151 7094
+4 0 -1 50 -1 0 12 0.0000 2 15 45 3106 6959 .\001
+4 0 -1 50 -1 0 12 0.0000 2 15 45 3106 7027 .\001
+4 0 -1 50 -1 0 12 0.0000 2 15 45 3106 7094 .\001
+-6
+2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
+	 3023 6786 3235 6786 3235 7245 3023 7245 3023 6786
+-6
+6 4091 6425 4305 7425
+6 4093 6946 4305 7255
+6 4176 7018 4221 7153
+4 0 -1 50 -1 0 12 0.0000 2 15 45 4176 7063 .\001
+4 0 -1 50 -1 0 12 0.0000 2 15 45 4176 7108 .\001
+4 0 -1 50 -1 0 12 0.0000 2 15 45 4176 7153 .\001
+-6
+2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
+	 4093 6946 4305 6946 4305 7255 4093 7255 4093 6946
+-6
+2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
+	 4091 6605 4305 6605 4305 6775 4091 6775 4091 6605
+2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
+	 4093 7256 4305 7256 4305 7425 4093 7425 4093 7256
+2 2 0 1 -1 7 50 -1 41 0.000 0 0 7 0 0 5
+	 4093 6776 4305 6776 4305 6945 4093 6945 4093 6776
+2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
+	 4091 6425 4305 6425 4305 6595 4091 6595 4091 6425
+-6
+2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
+	 3021 6445 3235 6445 3235 6615 3021 6615 3021 6445
+2 2 0 1 -1 7 50 -1 -1 0.000 0 0 7 0 0 5
+	 3023 6616 3235 6616 3235 6785 3023 6785 3023 6616
+2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
+	 3023 7246 3235 7246 3235 7425 3023 7425 3023 7246
+4 0 0 50 -1 0 14 0.0000 4 30 180 3780 6975 ...\001
+4 0 -1 50 -1 0 7 0.0000 2 75 255 3015 7560 File 1\001
+4 0 -1 50 -1 0 7 0.0000 2 75 255 3465 7560 File 2\001
+4 0 -1 50 -1 0 7 0.0000 2 75 270 4095 7560 File N\001
+4 0 -1 50 -1 0 7 0.0000 2 105 1020 3195 7785 Buckets Physical View\001
+4 0 0 50 -1 0 10 0.0000 4 150 120 2700 7020 b)\001
+-6
+4 0 0 50 -1 0 10 0.0000 4 150 105 0 7020 a)\001
--- a/vldb07/figs/brz-partitioningfabiano.fig
+++ b/vldb07/figs/brz-partitioningfabiano.fig
@ -0,0 +1,126 @@
+#FIG 3.2
+Landscape
+Center
+Metric
+A4      
+100.00
+Single
+-2
+1200 2
+0 32 #bebebe
+0 33 #4e4e4e
+6 2160 3825 2430 4365
+2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
+	 2160 4005 2430 4005 2430 4095 2160 4095 2160 4005
+2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
+	 2160 3825 2430 3825 2430 3915 2160 3915 2160 3825
+2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
+	 2160 3915 2430 3915 2430 4005 2160 4005 2160 3915
+2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
+	 2160 4275 2430 4275 2430 4365 2160 4365 2160 4275
+2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
+	 2160 4185 2430 4185 2430 4275 2160 4275 2160 4185
+2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
+	 2160 4095 2430 4095 2430 4185 2160 4185 2160 4095
+-6
+6 2430 3735 2700 4365
+2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
+	 2430 3825 2700 3825 2700 3915 2430 3915 2430 3825
+2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
+	 2430 4275 2700 4275 2700 4365 2430 4365 2430 4275
+2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
+	 2430 4185 2700 4185 2700 4275 2430 4275 2430 4185
+2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
+	 2430 4095 2700 4095 2700 4185 2430 4185 2430 4095
+2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
+	 2430 4005 2700 4005 2700 4095 2430 4095 2430 4005
+2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
+	 2430 3915 2700 3915 2700 4005 2430 4005 2430 3915
+2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
+	 2430 3735 2700 3735 2700 3825 2430 3825 2430 3735
+-6
+6 2700 4005 2970 4365
+2 2 0 1 0 32 50 -1 43 0.000 0 0 -1 0 0 5
+	 2700 4275 2970 4275 2970 4365 2700 4365 2700 4275
+2 2 0 1 0 32 50 -1 43 0.000 0 0 -1 0 0 5
+	 2700 4185 2970 4185 2970 4275 2700 4275 2700 4185
+2 2 0 1 0 32 50 -1 43 0.000 0 0 -1 0 0 5
+	 2700 4095 2970 4095 2970 4185 2700 4185 2700 4095
+2 2 0 1 -1 32 50 -1 43 0.000 0 0 -1 0 0 5
+	 2700 4005 2970 4005 2970 4095 2700 4095 2700 4005
+-6
+6 2025 5625 3690 5760
+4 0 0 50 -1 0 10 0.0000 4 105 360 2025 5760 File 1\001
+4 0 0 50 -1 0 10 0.0000 4 105 360 2565 5760 File 2\001
+4 0 0 50 -1 0 10 0.0000 4 105 405 3285 5760 File N\001
+-6
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 3510 4410 3510 4590
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 3780 4410 3780 4590
+2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
+	 1890 4185 2160 4185 2160 4275 1890 4275 1890 4185
+2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
+	 1890 4275 2160 4275 2160 4365 1890 4365 1890 4275
+2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
+	 1890 4095 2160 4095 2160 4185 1890 4185 1890 4095
+2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
+	 2070 4860 2340 4860 2340 5040 2070 5040 2070 4860
+2 2 0 1 0 7 50 -1 41 0.000 0 0 7 0 0 5
+	 3330 5220 3600 5220 3600 5400 3330 5400 3330 5220
+2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
+	 3330 4860 3600 4860 3600 4950 3330 4950 3330 4860
+2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
+	 2070 5040 2340 5040 2340 5130 2070 5130 2070 5040
+2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
+	 3330 4950 3600 4950 3600 5220 3330 5220 3330 4950
+2 2 0 1 0 7 50 -1 41 0.000 0 0 7 0 0 5
+	 2070 5130 2340 5130 2340 5310 2070 5310 2070 5130
+2 2 0 1 0 7 50 -1 10 0.000 0 0 7 0 0 5
+	 2610 5400 2880 5400 2880 5580 2610 5580 2610 5400
+2 2 0 1 0 7 50 -1 41 0.000 0 0 7 0 0 5
+	 2610 4860 2880 4860 2880 5040 2610 5040 2610 4860
+2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
+	 2610 5040 2880 5040 2880 5130 2610 5130 2610 5040
+2 2 0 1 0 7 50 -1 50 0.000 0 0 -1 0 0 5
+	 2970 4275 3240 4275 3240 4365 2970 4365 2970 4275
+2 2 0 1 0 7 50 -1 50 0.000 0 0 -1 0 0 5
+	 2970 4185 3240 4185 3240 4275 2970 4275 2970 4185
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 3510 4410 3600 4410
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 3690 4410 3780 4410
+2 2 0 1 0 7 50 -1 10 0.000 0 0 -1 0 0 5
+	 3510 4275 3780 4275 3780 4365 3510 4365 3510 4275
+2 2 0 1 0 7 50 -1 10 0.000 0 0 -1 0 0 5
+	 3510 4185 3780 4185 3780 4275 3510 4275 3510 4185
+2 2 0 1 0 32 50 -1 20 0.000 0 0 7 0 0 5
+	 2610 5130 2880 5130 2880 5400 2610 5400 2610 5130
+2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
+	 2070 5310 2340 5310 2340 5490 2070 5490 2070 5310
+2 2 0 1 0 7 50 -1 10 0.000 0 0 7 0 0 5
+	 2070 5490 2340 5490 2340 5580 2070 5580 2070 5490
+2 2 0 1 0 7 50 -1 50 0.000 0 0 7 0 0 5
+	 3330 5400 3600 5400 3600 5490 3330 5490 3330 5400
+2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
+	 3240 4275 3510 4275 3510 4365 3240 4365 3240 4275
+2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
+	 3240 4185 3510 4185 3510 4275 3240 4275 3240 4185
+2 2 0 1 -1 32 50 -1 20 0.000 0 0 -1 0 0 5
+	 3240 4095 3510 4095 3510 4185 3240 4185 3240 4095
+2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
+	 3240 4005 3510 4005 3510 4095 3240 4095 3240 4005
+2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
+	 3240 3915 3510 3915 3510 4005 3240 4005 3240 3915
+2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
+	 3330 5490 3600 5490 3600 5580 3330 5580 3330 5490
+4 0 0 50 -1 0 10 0.0000 4 105 75 1980 4545 0\001
+4 0 0 50 -1 0 10 0.0000 4 105 420 3555 4545 n/b  - 1\001
+4 0 0 50 -1 0 18 0.0000 4 30 180 3015 5265 ...\001
+4 0 0 50 -1 0 10 0.0000 4 105 75 2250 4545 1\001
+4 0 0 50 -1 0 10 0.0000 4 105 75 2520 4545 2\001
+4 0 0 50 -1 0 18 0.0000 4 30 180 2880 4500 ...\001
+4 0 0 50 -1 0 10 0.0000 4 135 1410 4050 5310 Buckets Physical View\001
+4 0 0 50 -1 0 10 0.0000 4 135 1350 4050 4140 Buckets Logical View\001
+4 0 0 50 -1 0 10 0.0000 4 135 120 1665 3780 a)\001
+4 0 0 50 -1 0 10 0.0000 4 135 135 1620 4950 b)\001
--- a/vldb07/figs/brz.fig
+++ b/vldb07/figs/brz.fig
@ -0,0 +1,183 @@
+#FIG 3.2  Produced by xfig version 3.2.5-alpha5
+Landscape
+Center
+Metric
+A4      
+100.00
+Single
+-2
+1200 2
+0 32 #bdbebd
+0 33 #bdbebd
+0 34 #bdbebd
+0 35 #4a4d4a
+0 36 #bdbebd
+0 37 #4a4d4a
+0 38 #bdbebd
+0 39 #bdbebd
+0 40 #bdbebd
+6 3427 4042 3852 4211
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3427 4041 3852 4041 3852 4211 3427 4211 3427 4041
+4 0 0 50 -1 0 14 0.0000 4 30 180 3551 4140 ...\001
+-6
+6 3410 5689 3835 5859
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3410 5689 3835 5689 3835 5858 3410 5858 3410 5689
+4 0 0 50 -1 0 14 0.0000 4 30 180 3534 5788 ...\001
+-6
+6 3825 5445 4455 5535
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 4140 5445 4095 5490
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 4140 5445 4185 5490
+3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
+	 3825 5535 3825 5490 3870 5490 3915 5490 3959 5490 4006 5490
+	 4095 5490 4095 5490
+	 0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
+3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
+	 4455 5535 4455 5490 4410 5490 4365 5490 4321 5490 4274 5490
+	 4185 5490
+	 0.000 1.000 1.000 1.000 1.000 1.000 0.000
+-6
+6 1873 5442 2323 5532
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 2098 5442 2066 5487
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 2098 5442 2130 5487
+3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
+	 1873 5532 1873 5487 1905 5487 1937 5487 1969 5487 2002 5487
+	 2066 5487 2066 5487
+	 0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
+3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
+	 2323 5532 2323 5487 2291 5487 2259 5487 2227 5487 2194 5487
+	 2130 5487
+	 0.000 1.000 1.000 1.000 1.000 1.000 0.000
+-6
+6 2338 5442 2968 5532
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 2653 5442 2608 5487
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 2653 5442 2698 5487
+3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
+	 2338 5532 2338 5487 2383 5487 2428 5487 2473 5487 2518 5487
+	 2608 5487 2608 5487
+	 0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
+3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
+	 2968 5532 2968 5487 2923 5487 2878 5487 2833 5487 2788 5487
+	 2698 5487
+	 0.000 1.000 1.000 1.000 1.000 1.000 0.000
+-6
+6 2475 4500 4770 5175
+2 1 0 1 -1 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 3137 5013 3845 5013
+2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
+	 2497 4675 2711 4675 2711 4845 2497 4845 2497 4675
+2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
+	 2711 4675 2923 4675 2923 4845 2711 4845 2711 4675
+2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
+	 2711 4506 2923 4506 2923 4675 2711 4675 2711 4506
+2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
+	 3775 4675 3987 4675 3987 4845 3775 4845 3775 4675
+2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
+	 3775 4845 3987 4845 3987 5013 3775 5013 3775 4845
+2 2 0 1 -1 7 50 -1 -1 0.000 0 0 7 0 0 5
+	 2711 4845 2923 4845 2923 5013 2711 5013 2711 4845
+2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
+	 2923 4845 3137 4845 3137 5013 2923 5013 2923 4845
+2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
+	 3775 4506 3987 4506 3987 4675 3775 4675 3775 4506
+2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
+	 2497 4845 2711 4845 2711 5013 2497 5013 2497 4845
+2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
+	 2923 4675 3137 4675 3137 4845 2923 4845 2923 4675
+4 0 0 50 -1 0 14 0.0000 4 30 180 3385 4929 ...\001
+4 0 -1 50 -1 0 7 0.0000 2 75 60 2569 5140 0\001
+4 0 -1 50 -1 0 7 0.0000 2 75 60 2781 5140 1\001
+4 0 -1 50 -1 0 7 0.0000 2 75 60 2995 5140 2\001
+4 0 -1 50 -1 0 7 0.0000 2 75 405 4059 4845 Buckets\001
+4 0 -1 50 -1 0 7 0.0000 2 105 1095 3775 5140 ${\\lceil n/b\\rceil - 1}$\001
+-6
+6 2983 5446 3433 5536
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 3208 5446 3176 5491
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 3208 5446 3240 5491
+3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
+	 2983 5536 2983 5491 3015 5491 3047 5491 3079 5491 3112 5491
+	 3176 5491 3176 5491
+	 0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
+3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
+	 3433 5536 3433 5491 3401 5491 3369 5491 3337 5491 3304 5491
+	 3240 5491
+	 0.000 1.000 1.000 1.000 1.000 1.000 0.000
+-6
+2 2 0 1 0 36 50 -1 -1 0.000 0 0 7 0 0 5
+	 3852 4041 4066 4041 4066 4211 3852 4211 3852 4041
+2 2 0 1 0 36 50 -1 -1 0.000 0 0 7 0 0 5
+	 4066 4041 4279 4041 4279 4211 4066 4211 4066 4041
+2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
+	 1937 4041 2149 4041 2149 4211 1937 4211 1937 4041
+2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
+	 2149 4041 2362 4041 2362 4211 2149 4211 2149 4041
+2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
+	 2362 4041 2576 4041 2576 4211 2362 4211 2362 4041
+2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
+	 2576 4041 2788 4041 2788 4211 2576 4211 2576 4041
+2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
+	 2788 4041 3002 4041 3002 4211 2788 4211 2788 4041
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3214 4041 3427 4041 3427 4211 3214 4211 3214 4041
+2 2 0 1 0 36 50 -1 -1 0.000 0 0 7 0 0 5
+	 4279 4041 4492 4041 4492 4211 4279 4211 4279 4041
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3002 4041 3214 4041 3214 4211 3002 4211 3002 4041
+2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
+	 2132 5689 2345 5689 2345 5858 2132 5858 2132 5689
+2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
+	 3197 5689 3410 5689 3410 5858 3197 5858 3197 5689
+2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
+	 2771 5689 2985 5689 2985 5858 2771 5858 2771 5689
+2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
+	 4262 5689 4475 5689 4475 5858 4262 5858 4262 5689
+2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
+	 4049 5689 4262 5689 4262 5858 4049 5858 4049 5689
+2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
+	 2985 5689 3197 5689 3197 5858 2985 5858 2985 5689
+2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
+	 2345 5689 2559 5689 2559 5858 2345 5858 2345 5689
+2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
+	 1914 5687 2127 5687 2127 5856 1914 5856 1914 5687
+2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
+	 3835 5689 4049 5689 4049 5858 3835 5858 3835 5689
+2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
+	 2559 5689 2771 5689 2771 5858 2559 5858 2559 5689
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 5
+	1 1 1.00 60.00 120.00
+	 3330 4275 3330 4365 3330 4410 3330 4455 3330 4500
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
+	1 1 1.00 45.00 60.00
+	 3880 5168 4140 5445
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
+	1 1 1.00 45.00 60.00
+	 3025 5170 3205 5440
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
+	1 1 1.00 45.00 60.00
+	 2805 5164 2653 5438
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
+	1 1 1.00 45.00 60.00
+	 2577 5170 2103 5434
+4 0 -1 50 -1 0 7 0.0000 2 120 645 4562 4168 Key Set $S$\001
+4 0 -1 50 -1 0 7 0.0000 2 75 60 2008 3999 0\001
+4 0 -1 50 -1 0 7 0.0000 2 75 60 2220 3999 1\001
+4 0 -1 50 -1 0 7 0.0000 2 75 165 4314 3999 n-1\001
+4 0 -1 50 -1 0 7 0.0000 2 75 60 1991 5985 0\001
+4 0 -1 50 -1 0 7 0.0000 2 75 60 2203 5985 1\001
+4 0 -1 50 -1 0 7 0.0000 2 75 165 4297 5985 n-1\001
+4 0 -1 50 -1 0 7 0.0000 2 75 555 4545 5816 Hash Table\001
+4 0 -1 50 -1 0 3 0.0000 2 75 450 1980 5625 MPHF$_0$\001
+4 0 -1 50 -1 0 3 0.0000 2 75 450 2520 5625 MPHF$_1$\001
+4 0 -1 50 -1 0 3 0.0000 2 75 450 3015 5625 MPHF$_2$\001
+4 0 -1 50 -1 0 3 0.0000 2 75 1065 3825 5625 MPHF$_{\\lceil n/b \\rceil - 1}$\001
+4 0 -1 50 -1 0 7 0.0000 2 105 585 1440 4455 Partitioning\001
+4 0 -1 50 -1 0 7 0.0000 2 105 495 1440 5265 Searching\001
--- a/vldb07/figs/brz_temporegressao.png
+++ b/vldb07/figs/brz_temporegressao.png
--- a/vldb07/figs/brzfabiano.fig
+++ b/vldb07/figs/brzfabiano.fig
@ -0,0 +1,153 @@
+#FIG 3.2  Produced by xfig version 3.2.5-alpha5
+Landscape
+Center
+Metric
+A4      
+100.00
+Single
+-2
+1200 2
+0 32 #bebebe
+6 2025 3015 3555 3690
+2 3 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
+	 2025 3285 2295 3285 2295 3015 3285 3015 3285 3285 3555 3285
+	 2790 3690 2025 3285
+4 0 0 50 -1 0 10 0.0000 4 135 765 2385 3330 Partitioning\001
+-6
+6 1890 3735 3780 4365
+6 2430 3735 2700 4365
+6 2430 3915 2700 4365
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2430 4275 2700 4275 2700 4365 2430 4365 2430 4275
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2430 4185 2700 4185 2700 4275 2430 4275 2430 4185
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2430 4095 2700 4095 2700 4185 2430 4185 2430 4095
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2430 4005 2700 4005 2700 4095 2430 4095 2430 4005
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2430 3915 2700 3915 2700 4005 2430 4005 2430 3915
+-6
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2430 3825 2700 3825 2700 3915 2430 3915 2430 3825
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2430 3735 2700 3735 2700 3825 2430 3825 2430 3735
+-6
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 1890 4275 2160 4275 2160 4365 1890 4365 1890 4275
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 1890 4185 2160 4185 2160 4275 1890 4275 1890 4185
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2160 4275 2430 4275 2430 4365 2160 4365 2160 4275
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2160 4185 2430 4185 2430 4275 2160 4275 2160 4185
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2160 4095 2430 4095 2430 4185 2160 4185 2160 4095
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2160 4005 2430 4005 2430 4095 2160 4095 2160 4005
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2160 3915 2430 3915 2430 4005 2160 4005 2160 3915
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2700 4275 2970 4275 2970 4365 2700 4365 2700 4275
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2700 4185 2970 4185 2970 4275 2700 4275 2700 4185
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2700 4095 2970 4095 2970 4185 2700 4185 2700 4095
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2700 4005 2970 4005 2970 4095 2700 4095 2700 4005
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2160 3825 2430 3825 2430 3915 2160 3915 2160 3825
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3240 4275 3510 4275 3510 4365 3240 4365 3240 4275
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3510 4275 3780 4275 3780 4365 3510 4365 3510 4275
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2970 4275 3240 4275 3240 4365 2970 4365 2970 4275
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3240 4185 3510 4185 3510 4275 3240 4275 3240 4185
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 1890 4095 2160 4095 2160 4185 1890 4185 1890 4095
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3510 4185 3780 4185 3780 4275 3510 4275 3510 4185
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3240 4095 3510 4095 3510 4185 3240 4185 3240 4095
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3240 4005 3510 4005 3510 4095 3240 4095 3240 4005
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3240 3915 3510 3915 3510 4005 3240 4005 3240 3915
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 1890 4365 3780 4365
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2970 4185 3240 4185 3240 4275 2970 4275 2970 4185
+-6
+6 1260 5310 4230 5580
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 1260 5400 4230 5400
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 1530 5310 1800 5310 1800 5400 1530 5400 1530 5310
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2070 5310 2340 5310 2340 5400 2070 5400 2070 5310
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2340 5310 2610 5310 2610 5400 2340 5400 2340 5310
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2610 5310 2880 5310 2880 5400 2610 5400 2610 5310
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2880 5310 3150 5310 3150 5400 2880 5400 2880 5310
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3420 5310 3690 5310 3690 5400 3420 5400 3420 5310
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3690 5310 3960 5310 3960 5400 3690 5400 3690 5310
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3960 5310 4230 5310 4230 5400 3960 5400 3960 5310
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 1800 5310 2070 5310 2070 5400 1800 5400 1800 5310
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3150 5310 3420 5310 3420 5400 3150 5400 3150 5310
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 1260 5310 1530 5310 1530 5400 1260 5400 1260 5310
+4 0 0 50 -1 0 10 0.0000 4 105 210 4005 5580 n-1\001
+4 0 0 50 -1 0 10 0.0000 4 105 75 1350 5580 0\001
+-6
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 1260 2925 4230 2925
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 1530 2835 1800 2835 1800 2925 1530 2925 1530 2835
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2070 2835 2340 2835 2340 2925 2070 2925 2070 2835
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2340 2835 2610 2835 2610 2925 2340 2925 2340 2835
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2610 2835 2880 2835 2880 2925 2610 2925 2610 2835
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 2880 2835 3150 2835 3150 2925 2880 2925 2880 2835
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3420 2835 3690 2835 3690 2925 3420 2925 3420 2835
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3690 2835 3960 2835 3960 2925 3690 2925 3690 2835
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3960 2835 4230 2835 4230 2925 3960 2925 3960 2835
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 1800 2835 2070 2835 2070 2925 1800 2925 1800 2835
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3150 2835 3420 2835 3420 2925 3150 2925 3150 2835
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 1260 2835 1530 2835 1530 2925 1260 2925 1260 2835
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 3510 4410 3510 4590
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 3510 4410 3600 4410
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 3690 4410 3780 4410
+2 3 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
+	 2025 4815 2295 4815 2295 4545 3285 4545 3285 4815 3555 4815
+	 2790 5220 2025 4815
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 3780 4410 3780 4590
+4 0 0 50 -1 0 10 0.0000 4 135 585 2475 4860 Searching\001
+4 0 0 50 -1 0 10 0.0000 4 105 75 1980 4545 0\001
+4 0 0 50 -1 0 10 0.0000 4 105 690 4410 5400 Hash Table\001
+4 0 0 50 -1 0 10 0.0000 4 105 480 4410 4230 Buckets\001
+4 0 0 50 -1 0 10 0.0000 4 135 555 4410 2925 Key set S\001
+4 0 0 50 -1 0 10 0.0000 4 105 75 1350 2745 0\001
+4 0 0 50 -1 0 10 0.0000 4 105 210 4005 2745 n-1\001
+4 0 0 50 -1 0 10 0.0000 4 105 420 3555 4545 n/b  - 1\001
--- a/vldb07/figs/minimalperfecthash-ph-mph.png
+++ b/vldb07/figs/minimalperfecthash-ph-mph.png
--- a/vldb07/introduction.tex
+++ b/vldb07/introduction.tex
@ -0,0 +1,109 @@
+%% Nivio: 22/jan/06 23/jan/06 29/jan
+% Time-stamp: <Monday 30 Jan 2006 03:52:42am EDT yoshi@ime.usp.br>
+\section{Introduction}
+\label{sec:intro}
+
+\enlargethispage{2\baselineskip}
+Suppose~$U$ is a universe of \textit{keys} of size $u$.
+Let $h:U\to M$ be a {\em hash function} that maps the keys from~$U$
+to a given interval of integers $M=[0,m-1]=\{0,1,\dots,m-1\}$.
+Let~$S\subseteq U$ be a set of~$n$ keys from~$U$, where $ n \ll u$.
+Given a key~$x\in S$, the hash function~$h$ computes an integer in
+$[0,m-1]$ for the storage or retrieval of~$x$ in a {\em hash table}.
+% Hashing methods for {\em non-static sets} of keys can be used to construct
+% data structures storing $S$ and supporting membership queries
+% ``$x \in S$?'' in expected time $O(1)$.
+% However, they involve a certain amount of wasted space owing to unused
+% locations in the table and waisted time to resolve collisions when
+% two keys are hashed to the same table location.
+A perfect hash function maps a {\em static set} $S$ of $n$ keys from $U$ into a set of $m$ integer 
+numbers without collisions, where $m$ is greater than or equal to $n$. 
+If $m$ is equal to $n$, the function is called minimal. 
+
+% Figure~\ref{fig:minimalperfecthash-ph-mph}(a) illustrates a perfect hash function and
+% Figure~\ref{fig:minimalperfecthash-ph-mph}(b) illustrates a minimal perfect hash function (MPHF).
+% 
+% \begin{figure}
+% \centering
+% \scalebox{0.7}{\epsfig{file=figs/minimalperfecthash-ph-mph.ps}}
+% \caption{(a) Perfect hash function (b) Minimal perfect hash function (MPHF)}
+% \label{fig:minimalperfecthash-ph-mph}
+% %\vspace{-5mm}
+% \end{figure}
+
+Minimal perfect hash functions are widely used for memory efficient storage and fast 
+retrieval of items from static sets, such as words in natural languages, 
+reserved words in programming languages or interactive systems, universal resource 
+locations (URLs) in web search engines, or item sets in data mining techniques. 
+Search engines are nowadays indexing tens of billions of pages and algorithms
+like PageRank~\cite{Brin1998}, which uses the web link structure to derive a
+measure of popularity for Web pages, would benefit from a MPHF for storage and 
+retrieval of such huge sets of URLs. 
+For instance, the TodoBr\footnote{TodoBr ({\texttt www.todobr.com.br}) is a trademark of 
+Akwan Information Technologies, which was acquired by Google Inc. in July 2005.}
+search engine used the algorithm proposed hereinafter to 
+improve and to scale its link analysis system. 
+The WebGraph research group~\cite{bv04} would 
+also benefit from a MPHF for sets in the order of billions of URLs to scale
+and to improve the storange requirements of their algorithms on Graph compression. 
+
+ Another interesting application for MPHFs is its use as an indexing structure 
+ for databases. 
+ The B+ tree is very popular as an indexing structure for dynamic applications 
+ with frequent insertions and deletions of records. 
+ However, for applications with sporadic modifications and a huge number of 
+ queries the B+ tree is not the best option, 
+ because it performs poorly with very large sets of keys 
+ such as those required for the new frontiers of database applications~\cite{s05}.
+ Therefore, there are applications for MPHFs in 
+ information retrieval systems, database systems, language translation systems, 
+ electronic commerce systems, compilers, operating systems, among others.
+
+Until now, because of the limitations of current algorithms,
+the use of MPHFs is restricted to scenarios where the set of keys being hashed is 
+relatively small.
+However, in many cases it is crucial to deal in an efficient way with very large
+sets of keys. 
+Due to the exponential growth of the Web, the work with huge collections is becoming
+a daily task. 
+For instance, the simple assignment of number identifiers to web pages of a collection 
+can be a challenging task. 
+While traditional databases simply cannot handle more traffic once the working 
+set of URLs does not fit in main memory anymore~\cite{s05}, the algorithm we propose here to
+construct MPHFs can easily scale to billions of entries.
+% using stock hardware.
+
+As there are many applications for MPHFs, it is 
+important to design and implement space and time efficient algorithms for 
+constructing such functions. 
+The attractiveness of using MPHFs depends on the following issues:
+\begin{enumerate}
+\item The amount of CPU time required by the algorithms for constructing MPHFs.
+\item The space requirements of the algorithms for constructing MPHFs.
+\item The amount of CPU time required by a MPHF for each retrieval.
+\item The space requirements of the description of the resulting MPHFs to be
+  used at retrieval time.
+\end{enumerate}
+
+\enlargethispage{2\baselineskip}
+This paper presents a novel external memory based algorithm for constructing MPHFs that 
+is very efficient in these four requirements.
+First, the algorithm is linear on the size of keys to construct a MPHF,
+which is optimal.
+For instance, for a collection of 1 billion URLs 
+collected from the web, each one 64 characters long on average, the time to construct a
+MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory
+is approximately 3 hours.
+Second, the algorithm needs a small a priori defined vector of $\lceil n/b \rceil$
+one byte entries in main memory to construct a MPHF.
+For the collection of 1 billion URLs and using $b=175$, the algorithm needs only
+5.45 megabytes of internal memory.
+Third, the evaluation of the MPHF for each retrieval requires three memory accesses and
+the computation of three universal hash functions.
+This is not optimal as any MPHF requires at least one memory access and the computation
+of two universal hash functions.
+Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal.
+For the collection of 1 billion URLs, it needs 8.1 bits for each key,
+while the theoretical lower bound is $1/\ln2 \approx 1.4427$ bits per 
+key~\cite{m84}.
+
--- a/vldb07/makefile
+++ b/vldb07/makefile
@ -0,0 +1,17 @@
+all: 
+	latex vldb.tex 
+	bibtex vldb
+	latex vldb.tex 
+	latex vldb.tex
+	dvips vldb.dvi -o vldb.ps
+	ps2pdf vldb.ps
+	chmod -R g+rwx *
+
+perm:
+	chmod -R g+rwx *
+
+run: clean all 
+	gv vldb.ps &
+clean:
+	rm *.aux *.bbl *.blg *.log *.ps *.pdf *.dvi
+
--- a/vldb07/partitioningthekeys.tex
+++ b/vldb07/partitioningthekeys.tex
@ -0,0 +1,141 @@
+%% Nivio: 21/jan/06
+% Time-stamp: <Monday 30 Jan 2006 03:57:28am EDT yoshi@ime.usp.br>
+\vspace{-2mm}
+\subsection{Partitioning step}
+\label{sec:partitioning-keys}
+
+The set $S$ of $n$ keys is partitioned into $\lceil n/b \rceil$ buckets, 
+where $b$ is a suitable parameter chosen to guarantee
+that each bucket has at most 256 keys with high probability
+(see Section~\ref{sec:determining-b}).
+The partitioning step works as follows:
+
+\begin{figure}[h]
+\hrule 
+\hrule 
+\vspace{2mm}
+\begin{tabbing}
+aa\=type booleanx \==  (false, true); \kill
+\> $\blacktriangleright$ Let $\beta$ be the size in bytes of the set $S$ \\ 
+\> $\blacktriangleright$ Let $\mu$ be the size in bytes of an a priori reserved \\
+\> ~~~ internal memory area \\ 
+\> $\blacktriangleright$ Let $N = \lceil \beta/\mu \rceil$ be the number of key blocks that will \\
+\> ~~~ be read from disk into an internal memory area \\
+\> $\blacktriangleright$ Let $\mathit{size}$ be a vector that stores the size of each bucket \\
+\> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \\
+\> ~~ $1.1$ Read block $B_j$ of keys from disk \\
+\> ~~ $1.2$ Cluster $B_j$ into $\lceil n/b \rceil$ buckets using a bucket sort \\
+\> ~~~~~~~ algorithm and update the entries in the vector {\it size} \\
+\> ~~ $1.3$ Dump $B_j$ to the disk into File $j$\\
+\> $2.$ Compute the {\it offset} vector and dump it to the disk.
+\end{tabbing}
+\hrule 
+\hrule 
+\vspace{-1.0mm}
+\caption{Partitioning step}
+\vspace{-3mm}
+\label{fig:partitioningstep}
+\end{figure}
+
+Statement 1.1 of the {\bf for} loop presented in Figure~\ref{fig:partitioningstep} 
+reads sequentially all the keys of block $B_j$ from disk into an internal area
+of size $\mu$.
+
+Statement 1.2 performs an indirect bucket sort of the keys in block $B_j$
+and at the same time updates the entries in the vector {\em size}.
+Let us briefly describe how~$B_j$ is partitioned among the~$\lceil n/b\rceil$
+buckets. 
+We use a local array of $\lceil n/b \rceil$ counters to store a 
+count of how many keys from $B_j$ belong to each bucket.
+%At the same time, the global vector {\it size} is computed based on the local 
+%counters. 
+The pointers to the keys in each bucket $i$, $0 \leq i < \lceil n/b \rceil$,
+are stored in contiguous positions in an array.
+For this we first reserve the required number of entries
+in this array of pointers using the information from the array of counters. 
+Next, we place the pointers to the keys in each bucket into the respective
+reserved areas in the array (i.e., we place the pointers to the keys in bucket 0, 
+followed by the pointers to the keys in bucket 1, and so on).
+
+\enlargethispage{2\baselineskip}
+To find the bucket address of a given key
+we use the universal hash function $h_0(k)$~\cite{j97}.
+Key~$k$ goes into bucket~$i$, where
+%Then, for each integer $h_0(k)$ the respective bucket address is obtained
+%as follows:
+\begin{eqnarray} \label{eq:bucketindex}
+i=h_0(k) \bmod \left \lceil \frac{n}{b} \right \rceil.
+\end{eqnarray}
+
+Figure~\ref{fig:brz-partitioning}(a) shows a \emph{logical} view of the
+$\lceil n/b \rceil$ buckets generated in the partitioning step.
+%In this case, the keys of each bucket are put together by the pointers to
+%each key stored 
+%in contiguous positions in the array of pointers.
+In reality, the keys belonging to each bucket are distributed among many files,
+as depicted in Figure~\ref{fig:brz-partitioning}(b).
+In the example of Figure~\ref{fig:brz-partitioning}(b), the keys in bucket 0 
+appear in files 1 and $N$, the keys in bucket 1 appear in files 1, 2
+and $N$, and so on. 
+
+\vspace{-7mm}
+\begin{figure}[ht]
+\centering
+\begin{picture}(0,0)%
+\includegraphics{figs/brz-partitioning.ps}%
+\end{picture}%
+\setlength{\unitlength}{4144sp}%
+%
+\begingroup\makeatletter\ifx\SetFigFont\undefined%
+\gdef\SetFigFont#1#2#3#4#5{%
+  \reset@font\fontsize{#1}{#2pt}%
+  \fontfamily{#3}\fontseries{#4}\fontshape{#5}%
+  \selectfont}%
+\fi\endgroup%
+\begin{picture}(4371,1403)(1,-6977)
+\put(333,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
+\put(545,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
+\put(759,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}2}}}}
+\put(1539,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}${\lceil n/b\rceil - 1}$}}}}
+\put(541,-6676){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets Logical View}}}}
+\put(3547,-6120){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
+\put(3547,-6188){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
+\put(3547,-6255){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
+\put(3107,-6120){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
+\put(3107,-6188){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
+\put(3107,-6255){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
+\put(4177,-6224){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
+\put(4177,-6269){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
+\put(4177,-6314){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
+\put(3016,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File 1}}}}
+\put(3466,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File 2}}}}
+\put(4096,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File N}}}}
+\put(3196,-6946){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets Physical View}}}}
+\end{picture}%
+\caption{Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view}
+\label{fig:brz-partitioning}
+\vspace{-2mm}
+\end{figure}
+
+This scattering of the keys in the buckets could generate a performance
+problem because of the potential number of seeks  
+needed to read the keys in each bucket from the $N$ files in disk 
+during the searching step. 
+But, as we show later in Section~\ref{sec:analytcal-results}, the number of seeks 
+can be kept small using buffering techniques.
+Considering that only the vector {\it size}, which has $\lceil n/b \rceil$
+one-byte entries (remember that each bucket has at most 256 keys),
+must be maintained in main memory during the searching step,
+almost all main memory is available to be used as disk I/O buffer.
+
+The last step is to compute the {\it offset} vector and dump it to the disk.
+We use the vector $\mathit{size}$ to compute the
+$\mathit{offset}$ displacement vector. 
+The $\mathit{offset}[i]$ entry contains the number of keys 
+in the buckets $0, 1, \dots, i-1$.
+As {\it size}$[i]$ stores the number of keys
+in bucket $i$, where $0 \leq i <\lceil n/b \rceil$, we have
+\begin{displaymath}
+\mathit{offset}[i] = \sum_{j=0}^{i-1} \mathit{size}[j] \cdot
+\end{displaymath}
+ 
--- a/vldb07/performancenewalgorithm.tex
+++ b/vldb07/performancenewalgorithm.tex
@ -0,0 +1,113 @@
+% Nivio: 29/jan/06
+% Time-stamp: <Monday 30 Jan 2006 12:13:14pm EST yoshi@flare>
+\subsection{Performance of the new algorithm}
+\label{sec:performance}
+%As we have done for the internal memory based algorithm, 
+
+The runtime of our algorithm is also a random variable, but now it follows a
+(highly concentrated) normal distribution, as we discuss at the end of this
+section.  Again, we are interested in verifying the linearity claim made in
+Section~\ref{sec:linearcomplexity}.  Therefore, we ran the algorithm for
+several numbers $n$ of keys in $S$.
+
+The values chosen for $n$ were $1, 2, 4, 8, 16, 32, 64, 128, 512$ and $1000$
+million. 
+%Just the small vector {\it size} must be kept in main memory,
+%as we saw in Section~\ref{sec:memconstruction}.
+We limited the main memory in 500 megabytes for the experiments.
+The size $\mu$ of the a priori reserved internal memory area 
+was set to 250 megabytes, the parameter $b$ was set to $175$ and
+the building block algorithm parameter $c$ was again set to $1$.
+In Section~\ref{sec:contr-disk-access} we show how $\mu$
+affects the runtime of the algorithm. The other two parameters
+have insignificant influence on the runtime.  
+
+We again use a statistical method for determining a suitable sample size
+%~\cite[Chapter 13]{j91}
+to estimate the number of trials to be run for each value of $n$.  We got that
+just one trial for each $n$ would be enough with a confidence level of $95\%$.
+However, we made 10 trials.  This number of trials seems rather small, but, as
+shown below, the behavior of our algorithm is very stable and its runtime is
+almost deterministic (i.e., the standard deviation is very small).
+ 
+Table~\ref{tab:mediasbrz} presents the runtime average for each $n$,
+the respective standard deviations, and 
+the respective confidence intervals given by 
+the average time $\pm$ the distance from average time
+considering a confidence level of $95\%$.
+Observing the runtime averages we noticed that 
+the algorithm runs in expected linear time, 
+as shown in~Section~\ref{sec:linearcomplexity}.  Better still,
+it is only approximately $60\%$  slower than our internal memory based algorithm.
+To get that value we used the linear regression model obtained for the runtime of
+the internal memory based algorithm to estimate how much time it would require
+for constructing a MPHF for a set of 1 billion keys. 
+We got 2.3 hours for the internal memory based algorithm  and we measured  
+3.67 hours on average for our algorithm. 
+Increasing the size of the internal memory area 
+from 250 to 600 megabytes (see Section~\ref{sec:contr-disk-access}),
+we have brought the time to 3.09 hours. In this case, our algorithm is 
+just $34\%$  slower in this setup.
+
+\enlargethispage{2\baselineskip}
+\begin{table*}[htb]
+\vspace{-1mm}
+\begin{center}
+{\scriptsize
+\begin{tabular}{|l|c|c|c|c|c|}
+\hline
+$n$ (millions)   & 1                  & 2                  & 4                  & 8                   & 16             \\
+\hline % Part.      16 \%                 16 \%                 16 \%                18 \%                 20\%           
+Average time (s) & $6.9 \pm 0.3$  & $13.8 \pm 0.2$ & $31.9 \pm 0.7$ & $69.9 \pm 1.1$  & $140.6 \pm 2.5$  \\
+SD               & $0.4$            & $0.2$            & $0.9$            & $1.5$             & $3.5$         \\
+\hline
+\hline
+$n$ (millions)   & 32                  & 64                   & 128                    & 512                  & 1000            \\
+\hline % Part.      20 \%                 20\%                  20\%                      18\%                    18\%
+Average time (s) & $284.3 \pm 1.1$ & $587.9 \pm 3.9$  & $1223.6 \pm 4.9$   & $5966.4 \pm 9.5$ & $13229.5 \pm 12.7$  \\
+SD               & $1.6$             & $5.5$              & $6.8$                & $13.2$             & $18.6$            \\
+\hline
+
+\end{tabular}
+\vspace{-1mm}
+}
+\end{center}
+\caption{Our algorithm: average time in seconds for constructing a MPHF,
+the standard deviation (SD), and the confidence intervals considering 
+a confidence level of $95\%$.
+}
+\label{tab:mediasbrz}
+\vspace{-5mm}
+\end{table*}
+
+Figure~\ref{fig:brz_temporegressao}
+presents the runtime for each trial. In addition, 
+the solid line corresponds to a linear regression model 
+obtained from the experimental measurements.
+As we were expecting the runtime for a given $n$ has almost no 
+variation.
+
+\begin{figure}[htb]
+\begin{center}
+\scalebox{0.4}{\includegraphics{figs/brz_temporegressao.eps}}
+\caption{Time versus number of keys in $S$ for our algorithm. The solid line corresponds to 
+a linear regression model.}
+\label{fig:brz_temporegressao}
+\end{center}
+\vspace{-9mm}
+\end{figure}
+
+An intriguing observation is that the runtime of the algorithm is almost
+deterministic, in spite of the fact that it uses as building block an
+algorithm with a considerable fluctuation in its runtime.  A given bucket~$i$,
+$0 \leq i < \lceil n/b \rceil$, is a small set of keys (at most 256 keys) and,
+as argued in Section~\ref{sec:intern-memory-algor}, the runtime of the
+building block algorithm is a random variable~$X_i$ with high fluctuation.
+However, the runtime~$Y$ of the searching step of our algorithm is given
+by~$Y=\sum_{0\leq i<\lceil n/b\rceil}X_i$.  Under the hypothesis that
+the~$X_i$ are independent and bounded, the {\it law of large numbers} (see,
+e.g., \cite{j91}) implies that the random variable $Y/\lceil n/b\rceil$
+converges to a constant as~$n\to\infty$.  This explains why the runtime of our
+algorithm is almost deterministic.
+
+
--- a/vldb07/references.bib
+++ b/vldb07/references.bib
@ -0,0 +1,814 @@
+
+@InProceedings{Brin1998,
+  author =       "Sergey Brin and Lawrence Page",
+  title =        "The Anatomy of a Large-Scale Hypertextual Web Search Engine",
+  booktitle =    "Proceedings of the 7th International {World Wide Web}
+                  Conference",
+  pages =        "107--117",
+  adress =       "Brisbane, Australia",
+  month =        "April",
+  year =         1998,
+  annote =       "Artigo do Google."
+}
+
+@inproceedings{p99,
+    author = {R. Pagh},
+    title = {Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions},
+    booktitle = {Workshop on Algorithms and Data Structures},
+    pages = {49-54},
+    year = 1999,
+    url = {citeseer.nj.nec.com/pagh99hash.html},
+    key = {author} 
+}
+
+@article{p00,
+    author = {R. Pagh},
+    title = {Faster deterministic dictionaries},
+    journal = {Symposium on Discrete Algorithms (ACM SODA)},
+    OPTvolume = {43},
+    OPTnumber = {5},
+    pages = {487--493},
+    year = {2000}
+}
+@article{g81,
+ author = {G. H. Gonnet},
+ title = {Expected Length of the Longest Probe Sequence in Hash Code Searching},
+ journal = {J. ACM},
+ volume = {28},
+ number = {2},
+ year = {1981},
+ issn = {0004-5411},
+ pages = {289--304},
+ doi = {http://doi.acm.org/10.1145/322248.322254},
+ publisher = {ACM Press},
+ address = {New York, NY, USA},
+ }
+
+@misc{r04,
+  author = "S. Rao",
+  title = "Combinatorial Algorithms Data Structures",
+  year = 2004,
+  howpublished = {CS 270 Spring},
+  url = "citeseer.ist.psu.edu/700201.html" 
+}
+@article{ra98,
+    author = {Martin Raab and Angelika Steger},
+    title = {``{B}alls into Bins'' --- {A} Simple and Tight Analysis},
+    journal = {Lecture Notes in Computer Science},
+    volume = 1518,
+    pages = {159--170},
+    year = 1998,
+    url = "citeseer.ist.psu.edu/raab98balls.html" 
+}
+
+@misc{mrs00,
+  author = "M. Mitzenmacher and A. Richa and R. Sitaraman",
+  title = "The power of two random choices: A survey of the techniques and results",
+  howpublished={In Handbook of Randomized
+    Computing, P. Pardalos, S. Rajasekaran, and J. Rolim, Eds. Kluwer},
+  year = "2000",
+  url = "citeseer.ist.psu.edu/article/mitzenmacher00power.html" 
+}
+
+@article{dfm02,
+    author = {E. Drinea and A. Frieze and M. Mitzenmacher},  	
+    title = {Balls and bins models with feedback},
+    journal = {Symposium on Discrete Algorithms (ACM SODA)},
+    pages = {308--315},
+    year = {2002}
+}
+@Article{j97,
+  author =       {Bob Jenkins},
+  title =        {Algorithm Alley: Hash Functions},
+  journal =      {Dr. Dobb's Journal of Software Tools},
+  volume =       {22},
+  number =       {9},
+  month =        {september},
+  year =         {1997}
+}
+
+@article{gss01,
+    author = {N. Galli and B. Seybold and K. Simon},
+    title = {Tetris-Hashing or optimal table compression},
+    journal = {Discrete Applied Mathematics},
+    volume = {110},
+    number = {1},
+    pages = {41--58},
+    month = {june},
+    publisher = {Elsevier Science},
+    year = {2001}
+}
+
+@article{s05,
+    author = {M. Seltzer},
+    title = {Beyond Relational Databases},
+    journal = {ACM Queue},
+    volume = {3},
+    number = {3},
+    month = {April},
+    year = {2005}
+}
+
+@InProceedings{ss89,
+  author = 	 {P. Schmidt and A. Siegel},
+  title = 	 {On aspects of universality and performance for closed hashing},
+  booktitle =    {Proc. 21th Ann. ACM Symp. on Theory of Computing -- STOC'89},
+  month = 	 {May},
+  year = 	 {1989},
+  pages = 	 {355--366}
+}
+
+@article{asw00,
+    author = {M. Atici and D. R. Stinson and R. Wei.},
+    title = {A new practical algorithm for the construction of a perfect hash function},
+    journal = {Journal Combin. Math. Combin. Comput.},
+    volume = {35},
+    pages = {127--145},
+    year = {2000}
+}
+
+@article{swz00,
+    author = {D. R. Stinson and R. Wei and L. Zhu},
+    title = {New constructions for perfect hash families and related structures using combinatorial designs and codes},
+    journal = {Journal Combin. Designs.},
+    volume = {8},
+    pages = {189--200},
+    year = {2000}
+}
+
+@inproceedings{ht01,
+    author = {T. Hagerup and T. Tholey},
+    title = {Efficient minimal perfect hashing in nearly minimal space},
+    booktitle = {The 18th Symposium on Theoretical Aspects of Computer Science (STACS), volume 2010 of Lecture Notes in Computer Science},
+    year = 2001,
+    pages = {317--326},
+    key = {author} 
+}
+
+@inproceedings{dh01,
+    author = {M. Dietzfelbinger and T. Hagerup},
+    title = {Simple minimal perfect hashing in less space},
+    booktitle = {The 9th European Symposium on Algorithms (ESA), volume 2161 of Lecture Notes in Computer Science},
+    year = 2001,
+    pages = {109--120},
+    key = {author} 
+}
+
+
+@MastersThesis{mar00,
+  author = 	 {M. S. Neubert},
+  title = 	 {Algoritmos Distribu;os para a Constru;o de Arquivos invertidos},
+  school = 	 {Departamento de Ci;cia da Computa;o, Universidade Federal de Minas Gerais},
+  year = 	 2000,
+  month =	 {Mar;},
+  key = {author}
+}
+
+
+@Book{clrs01,
+  author = 	 {T. H. Cormen and C. E. Leiserson and R. L. Rivest and C. Stein},
+  title = 	 {Introduction to Algorithms},
+  publisher = 	 {MIT Press},
+  year = 	 {2001},
+  edition = 	 {second},
+}
+
+@Book{j91,
+  author = 	  {R. Jain},
+  title = 	  {The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. },
+  publisher = {John Wiley},
+  year = 	  {1991},
+  edition =   {first}
+}
+
+@Book{k73,
+  author = 	 {D. E. Knuth},
+  title = 	 {The Art of Computer Programming: Sorting and Searching},
+  publisher = 	 {Addison-Wesley},
+  volume    =    {3},
+  year = 	 {1973},
+  edition = 	 {second},
+}
+
+@inproceedings{rp99,
+    author = {R. Pagh},
+    title = {Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions},
+    booktitle = {Workshop on Algorithms and Data Structures},
+    pages = {49-54},
+    year = 1999,
+    url = {citeseer.nj.nec.com/pagh99hash.html},
+    key = {author} 
+}
+
+@inproceedings{hmwc93,
+    author = {G. Havas and B.S. Majewski and N.C. Wormald and Z.J. Czech},
+    title = {Graphs, Hypergraphs and Hashing},
+    booktitle = {19th International Workshop on Graph-Theoretic Concepts in Computer Science},
+    publisher = {Springer Lecture Notes in Computer Science vol. 790},
+    pages = {153-165},
+    year = 1993,
+    key = {author} 
+}
+
+@inproceedings{bkz05,
+    author = {F.C. Botelho and Y. Kohayakawa and N. Ziviani},
+    title = {A Practical Minimal Perfect Hashing Method},
+    booktitle = {4th International Workshop on Efficient and Experimental Algorithms},
+    publisher = {Springer Lecture Notes in Computer Science vol. 3503},
+    pages = {488-500},
+    Moth = May,
+    year = 2005,
+    key = {author} 
+}
+
+@Article{chm97,
+  author = 	 {Z.J. Czech and G. Havas and B.S. Majewski},
+  title = 	 {Fundamental Study Perfect Hashing},
+  journal = 	 {Theoretical Computer Science},
+  volume = 	{182},
+  year = 	 {1997},
+  pages = 	 {1-143},
+  key = {author}
+}
+
+@article{chm92,
+    author = {Z.J. Czech and G. Havas and B.S. Majewski},
+    title = {An Optimal Algorithm for Generating Minimal Perfect Hash Functions},
+    journal = {Information Processing Letters},
+    volume = {43},
+    number = {5},
+    pages = {257-264},
+    year = {1992},
+    url = {citeseer.nj.nec.com/czech92optimal.html}, 
+    key = {author} 
+}
+
+@Article{mwhc96,
+  author = 	 {B.S. Majewski and N.C. Wormald and G. Havas and Z.J. Czech},
+  title = 	 {A family of perfect hashing methods},
+  journal = 	 {The Computer Journal},
+  year = 	 {1996},
+  volume = 	 {39},
+  number = 	 {6},
+  pages = 	 {547-554},
+  key = {author}
+}
+
+@InProceedings{bv04,
+author =         {P. Boldi and S. Vigna},
+title =          {The WebGraph Framework I: Compression Techniques},
+booktitle =      {13th International World Wide Web Conference},
+pages =          {595--602},
+year =           {2004}
+}
+
+
+@Book{z04,
+  author = 	 {N. Ziviani},
+  title = 	 {Projeto de Algoritmos com implementa;es em Pascal e C},
+  publisher = 	 {Pioneira Thompson},
+  year = 	 2004,
+  edition = 	 {segunda edi;o}
+}
+
+
+@Book{p85,
+  author = 	 {E. M. Palmer},
+  title = 	 {Graphical Evolution: An Introduction to the Theory of Random Graphs},
+  publisher = 	 {John Wiley \& Sons},
+  year = 	 {1985},
+  address = 	 {New York}
+}
+
+@Book{imb99,
+  author = 	 {I.H. Witten and A. Moffat and T.C. Bell},
+  title = 	 {Managing Gigabytes: Compressing and Indexing Documents and Images},
+  publisher = 	 {Morgan Kaufmann Publishers},
+  year = 	 1999,
+  edition = 	 {second edition}
+}
+@Book{wfe68,
+  author = 	 {W. Feller},
+  title = 	 { An Introduction to Probability Theory and Its Applications},
+  publisher = 	 {Wiley},
+  year = 	 1968,
+  volume = 1,
+  optedition = 	 {second edition}
+}
+
+
+@Article{fhcd92,
+  author = 	 {E.A. Fox and L. S. Heath and Q. Chen and A.M. Daoud},
+  title = 	 {Practical Minimal Perfect Hash Functions For Large Databases},
+  journal = 	 {Communications of the ACM},
+  year = 	 {1992},
+  volume = 	 {35},
+  number = 	 {1},
+  pages =        {105--121}
+}
+
+
+@inproceedings{fch92,
+  author    = {E.A. Fox and Q.F. Chen and L.S. Heath},
+  title     = {A Faster Algorithm for Constructing Minimal Perfect Hash Functions},
+  booktitle = {Proceedings of the 15th Annual International ACM SIGIR Conference
+               on Research and Development in Information Retrieval},
+  year      = {1992},
+  pages = {266-273},
+}
+
+@article{c80,
+ author = {R.J. Cichelli},
+ title = {Minimal perfect hash functions made simple},
+ journal = {Communications of the ACM},
+ volume = {23},
+ number = {1},
+ year = {1980},
+ issn = {0001-0782},
+ pages = {17--19},
+ doi = {http://doi.acm.org/10.1145/358808.358813},
+ publisher = {ACM Press},
+ }
+
+
+@TechReport{fhc89,
+  author = 	 {E.A. Fox and L.S. Heath and Q.F. Chen},
+  title = 	 {An $O(n\log n)$ algorithm for finding minimal perfect hash functions},
+  institution =  {Virginia Polytechnic Institute and State University},
+  year = 	 {1989},
+  OPTkey = 	 {},
+  OPTtype = 	 {},
+  OPTnumber = 	 {},
+  address = 	 {Blacksburg, VA},
+  month = 	 {April},
+  OPTnote = 	 {},
+  OPTannote = 	 {}
+}
+
+@TechReport{bkz06t,
+  author = 	 {F.C. Botelho and Y. Kohayakawa and N. Ziviani},
+  title = 	 {An Approach for Minimal Perfect Hash Functions in Very Large Databases},
+  institution =  {Department of Computer Science, Federal University of Minas Gerais},
+  note = 	 {Available at http://www.dcc.ufmg.br/\texttt{\~ }nivio/pub/technicalreports.html},
+  year = 	 {2006},
+  OPTkey = 	 {},
+  OPTtype = 	 {},
+  number = 	 {RT.DCC.003},
+  address = 	 {Belo Horizonte, MG, Brazil},
+  month = 	 {April},
+  OPTannote = 	 {}
+}
+
+@inproceedings{fcdh90,
+ author = {E.A. Fox and Q.F. Chen and A.M. Daoud and L.S. Heath},
+ title = {Order preserving minimal perfect hash functions and information retrieval},
+ booktitle = {Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval},
+ year = {1990},
+ isbn = {0-89791-408-2},
+ pages = {279--311},
+ location = {Brussels, Belgium},
+ doi = {http://doi.acm.org/10.1145/96749.98233},
+ publisher = {ACM Press},
+ }
+
+@Article{fkp89,
+  author = 	 {P. Flajolet and D. E. Knuth and B. Pittel},
+  title = 	 {The first cycles in an evolving graph},
+  journal = 	 {Discrete Math},
+  year = 	 {1989},
+  volume = 	 {75},
+  pages = 	 {167-215},
+}
+
+@Article{s77,
+  author = 	 {R. Sprugnoli},
+  title = 	 {Perfect Hashing Functions: A Single Probe Retrieving 
+                  Method For Static Sets},
+  journal = 	 {Communications of the ACM},
+  year = 	 {1977},
+  volume = 	 {20},
+  number = 	 {11},
+  pages =        {841--850},
+  month = 	 {November},
+}
+
+@Article{j81,
+  author = 	 {G. Jaeschke},
+  title = 	 {Reciprocal Hashing: A method For Generating Minimal Perfect
+                  Hashing Functions},
+  journal = 	 {Communications of the ACM},
+  year = 	 {1981},
+  volume = 	 {24},
+  number = 	 {12},
+  month = 	 {December},
+  pages =        {829--833}
+}
+
+@Article{c84,
+  author = 	 {C. C. Chang},
+  title = 	 {The Study Of An Ordered Minimal Perfect Hashing Scheme},
+  journal = 	 {Communications of the ACM},
+  year = 	 {1984},
+  volume = 	 {27},
+  number = 	 {4},
+  month = 	 {December},
+  pages =        {384--387}
+}
+
+@Article{c86,
+  author = 	 {C. C. Chang},
+  title = 	 {Letter-Oriented Reciprocal Hashing Scheme},
+  journal = 	 {Inform. Sci.},
+  year = 	 {1986},
+  volume = 	 {27},
+  pages =        {243--255}
+}
+
+@Article{cl86,
+  author = 	 {C. C. Chang and R. C. T. Lee},
+  title = 	 {A Letter-Oriented Minimal Perfect Hashing Scheme},
+  journal = 	 {Computer Journal},
+  year = 	 {1986},
+  volume = 	 {29},
+  number = 	 {3},
+  month = 	 {June},
+  pages =        {277--281}
+}
+
+
+@Article{cc88,
+  author = 	 {C. C. Chang and C. H. Chang},
+  title = 	 {An Ordered Minimal Perfect Hashing Scheme with Single Parameter},
+  journal = 	 {Inform. Process. Lett.},
+  year = 	 {1988},
+  volume = 	 {27},
+  number = 	 {2},
+  month = 	 {February},
+  pages =        {79--83}
+}
+
+@Article{w90,
+  author = 	 {V. G. Winters},
+  title = 	 {Minimal Perfect Hashing in Polynomial Time},
+  journal = 	 {BIT},
+  year = 	 {1990},
+  volume = 	 {30},
+  number = 	 {2},
+  pages =        {235--244}
+}
+
+@Article{fcdh91,
+  author = 	 {E. A. Fox and Q. F. Chen and A. M. Daoud and L. S. Heath},
+  title = 	 {Order Preserving Minimal Perfect Hash Functions and Information Retrieval},
+  journal = 	 {ACM Trans. Inform. Systems},
+  year = 	 {1991},
+  volume = 	 {9},
+  number = 	 {3},
+  month = 	 {July},
+  pages =        {281--308}
+}
+
+@Article{fks84,
+  author = 	 {M. L. Fredman and J. Koml\'os and E. Szemer\'edi},
+  title = 	 {Storing a sparse table with {O(1)} worst case access time},
+  journal = 	 {J. ACM},
+  year = 	 {1984},
+  volume = 	 {31},
+  number = 	 {3},
+  month = 	 {July},
+  pages =        {538--544}
+}
+
+@Article{dhjs83,
+  author = 	 {M. W. Du and T. M. Hsieh and K. F. Jea and D. W. Shieh},
+  title = 	 {The study of a new perfect hash scheme},
+  journal = 	 {IEEE Trans. Software Eng.},
+  year = 	 {1983},
+  volume = 	 {9},
+  number = 	 {3},
+  month = 	 {May},
+  pages =        {305--313}
+}
+
+@Article{bt94,
+  author = 	 {M. D. Brain and A. L. Tharp},
+  title = 	 {Using Tries to Eliminate Pattern Collisions in Perfect Hashing},
+  journal = 	 {IEEE Trans. on Knowledge and Data Eng.},
+  year = 	 {1994},
+  volume = 	 {6},
+  number = 	 {2},
+  month = 	 {April},
+  pages =        {239--247}
+}
+
+@Article{bt90,
+  author = 	 {M. D. Brain and A. L. Tharp},
+  title = 	 {Perfect hashing using sparse matrix packing},
+  journal = 	 {Inform. Systems},
+  year = 	 {1990},
+  volume = 	 {15},
+  number = 	 {3},
+  OPTmonth = 	 {April},
+  pages =        {281--290}
+}
+
+@Article{ckw93,
+  author = 	 {C. C. Chang and H. C.Kowng and T. C. Wu},
+  title = 	 {A refinement of a compression-oriented addressing scheme},
+  journal = 	 {BIT},
+  year = 	 {1993},
+  volume = 	 {33},
+  number = 	 {4},
+  OPTmonth = 	 {April},
+  pages =        {530--535}
+}
+
+@Article{cw91,
+  author = 	 {C. C. Chang and T. C. Wu},
+  title = 	 {A letter-oriented perfect hashing scheme based upon sparse table compression},
+  journal = 	 {Software -- Practice Experience},
+  year = 	 {1991},
+  volume = 	 {21},
+  number = 	 {1},
+  month = 	 {january},
+  pages =        {35--49}
+}
+
+@Article{ty79,
+  author = 	 {R. E. Tarjan and A. C. C. Yao},
+  title = 	 {Storing a sparse table},
+  journal = 	 {Comm. ACM},
+  year = 	 {1979},
+  volume = 	 {22},
+  number = 	 {11},
+  month = 	 {November},
+  pages =        {606--611}
+}
+
+@Article{yd85,
+  author = 	 {W. P. Yang and M. W. Du},
+  title = 	 {A backtracking method for constructing perfect hash functions from a set of mapping functions},
+  journal = 	 {BIT},
+  year = 	 {1985},
+  volume = 	 {25},
+  number = 	 {1},
+  pages =        {148--164}
+}
+
+@Article{s85,
+  author = 	 {T. J. Sager},
+  title = 	 {A polynomial time generator for minimal perfect hash functions},
+  journal = 	 {Commun. ACM},
+  year = 	 {1985},
+  volume = 	 {28},
+  number = 	 {5},
+  month =        {May},
+  pages =        {523--532}
+}
+
+@Article{cm93,
+  author = 	 {Z. J. Czech and B. S. Majewski},
+  title = 	 {A linear time algorithm for finding minimal perfect hash functions},
+  journal = 	 {The computer Journal},
+  year = 	 {1993},
+  volume = 	 {36},
+  number = 	 {6},
+  pages =        {579--587}
+}
+
+@Article{gbs94,
+  author = 	 {R. Gupta and S. Bhaskar and S. Smolka},
+  title = 	 {On randomization in sequential and distributed algorithms},
+  journal = 	 {ACM Comput. Surveys},
+  year = 	 {1994},
+  volume = 	 {26},
+  number = 	 {1},
+  month =        {March},
+  pages =        {7--86}
+}
+
+@InProceedings{sb84,
+  author = 	 {C. Slot and P. V. E. Boas},
+  title = 	 {On tape versus core; an application of space efficient perfect hash functions to the 
+                  invariance of space},
+  booktitle =    {Proc. 16th Ann. ACM Symp. on Theory of Computing -- STOC'84},
+  address = 	 {Washington},
+  month = 	 {May},
+  year = 	 {1984},
+  pages = 	 {391--400},
+}
+
+@InProceedings{wi90,
+  author = 	 {V. G. Winters},
+  title = 	 {Minimal perfect hashing for large sets of data},
+  booktitle =    {Internat. Conf. on Computing and Information -- ICCI'90},
+  address = 	 {Canada},
+  month = 	 {May},
+  year = 	 {1990},
+  pages = 	 {275--284},
+}
+
+@InProceedings{lr85,
+  author = 	 {P. Larson and M. V. Ramakrishna},
+  title = 	 {External perfect hashing},
+  booktitle =    {Proc. ACM SIGMOD Conf.},
+  address = 	 {Austin TX},
+  month = 	 {June},
+  year = 	 {1985},
+  pages = 	 {190--199},
+}
+
+@Book{m84,
+  author = 	 {K. Mehlhorn},
+  editor = 	 {W. Brauer and G. Rozenberg and A. Salomaa},
+  title = 	 {Data Structures and Algorithms 1: Sorting and Searching},
+  publisher = 	 {Springer-Verlag},
+  year = 	 {1984},
+}
+
+@PhdThesis{c92,
+  author = 	 {Q. F. Chen},
+  title = 	 {An Object-Oriented Database System for Efficient Information Retrieval Appliations},
+  school = 	 {Virginia Tech Dept. of Computer Science},
+  year = 	 {1992},
+  month = 	 {March}
+}
+
+@article {er59,
+    AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.},
+     TITLE = {On random graphs {I}},
+   JOURNAL = {Pub. Math. Debrecen},
+    VOLUME = {6},
+      YEAR = {1959},
+     PAGES = {290--297},
+   MRCLASS = {05.00},
+  MRNUMBER = {MR0120167 (22 \#10924)},
+MRREVIEWER = {A. Dvoretzky},
+}
+
+
+@article {erdos61,
+    AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.},
+     TITLE = {On the evolution of random graphs},
+   JOURNAL = {Bull. Inst. Internat. Statist.},
+    VOLUME = 38,
+      YEAR = 1961,
+     PAGES = {343--347},
+   MRCLASS = {05.40 (55.10)},
+  MRNUMBER = {MR0148055 (26 \#5564)},
+}
+
+@article {er60,
+    AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.},
+     TITLE = {On the evolution of random graphs},
+   JOURNAL = {Magyar Tud. Akad. Mat. Kutat\'o Int. K\"ozl.},
+    VOLUME = {5},
+      YEAR = {1960},
+     PAGES = {17--61},
+   MRCLASS = {05.40},
+  MRNUMBER = {MR0125031 (23 \#A2338)},
+MRREVIEWER = {J. Riordan},
+}
+
+@Article{er60:_Old,
+  author =   {P. Erd{\H{o}}s and A. R\'enyi},
+  title =    {On the evolution of random graphs},
+  journal =  {Publications of the Mathematical Institute of the Hungarian
+                  Academy of Sciences},  
+  year =     {1960},
+  volume =   {56},
+  pages =    {17-61}
+}
+
+@Article{er61,
+  author =   {P. Erd{\H{o}}s and A. R\'enyi},
+  title =    {On the strength of connectedness of a random graph},
+  journal =  {Acta Mathematica Scientia Hungary},
+  year =     {1961},
+  volume =   {12},
+  pages =    {261-267}
+}
+
+
+@Article{bp04,
+  author = 	 {B. Bollob\'as and O. Pikhurko},
+  title = 	 {Integer Sets with Prescribed Pairwise Differences Being Distinct},
+  journal = 	 {European Journal of Combinatorics},
+  OPTkey = 	 {},
+  OPTvolume = 	 {},
+  OPTnumber = 	 {},
+  OPTpages = 	 {},
+  OPTmonth = 	 {},
+  note = 	 {To Appear},
+  OPTannote = 	 {}
+}
+
+@Article{pw04:_OLD,
+  author = 	 {B. Pittel and N. C. Wormald},
+  title = 	 {Counting connected graphs inside-out},
+  journal = 	 {Journal of Combinatorial Theory},
+  OPTkey = 	 {},
+  OPTvolume = 	 {},
+  OPTnumber = 	 {},
+  OPTpages = 	 {},
+  OPTmonth = 	 {},
+  note = 	 {To Appear},
+  OPTannote = 	 {}
+}
+
+
+@Article{mr95,
+  author =   {M. Molloy and B. Reed},
+  title =    {A critical point for random graphs with a given degree sequence},
+  journal =  {Random Structures and Algorithms},
+  year =     {1995},
+  volume =   {6},
+  pages =    {161-179}
+}
+
+@TechReport{bmz04,
+  author = 	 {F. C. Botelho and D. Menoti and N. Ziviani},
+  title = 	 {A New algorithm for constructing minimal perfect hash functions},
+  institution =  {Federal Univ. of Minas Gerais},
+  year = 	 {2004},
+  OPTkey = 	 {},
+  OPTtype = 	 {},
+  number = 	 {TR004},
+  OPTaddress = 	 {},
+  OPTmonth = 	 {},
+  note = 	 {(http://www.dcc.ufmg.br/\texttt{\~ }nivio/pub/technicalreports.html)},
+  OPTannote = 	 {}
+}
+
+@Article{mr98,
+  author =   {M. Molloy and B. Reed},
+  title =    {The size of the giant component of a random graph with a given degree sequence},
+  journal =  {Combinatorics, Probability and Computing},
+  year =     {1998},
+  volume =   {7},
+  pages =    {295-305}
+}
+
+@misc{h98,
+  author = {D. Hawking},
+  title = {Overview of TREC-7 Very Large Collection Track (Draft for Notebook)},
+  url = {citeseer.ist.psu.edu/4991.html},
+  year = {1998}}
+
+@book {jlr00,
+    AUTHOR = {Janson, S. and {\L}uczak, T. and Ruci{\'n}ski, A.},
+     TITLE = {Random graphs},
+ PUBLISHER = {Wiley-Inter.},
+      YEAR = 2000,
+     PAGES = {xii+333},
+      ISBN = {0-471-17541-2},
+   MRCLASS = {05C80 (60C05 82B41)},
+  MRNUMBER = {2001k:05180},
+MRREVIEWER = {Mark R. Jerrum},
+}
+
+@incollection {jlr90,
+    AUTHOR = {Janson, Svante and {\L}uczak, Tomasz and Ruci{\'n}ski,
+              Andrzej},
+     TITLE = {An exponential bound for the probability of nonexistence of a
+              specified subgraph in a random graph},
+ BOOKTITLE = {Random graphs '87 (Pozna\'n, 1987)},
+     PAGES = {73--87},
+ PUBLISHER = {Wiley},
+   ADDRESS = {Chichester},
+      YEAR = 1990,
+   MRCLASS = {05C80 (60C05)},
+  MRNUMBER = {91m:05168},
+MRREVIEWER = {J. Spencer},
+}
+
+@book {b01,
+    AUTHOR = {Bollob{\'a}s, B.},
+     TITLE = {Random graphs},
+    SERIES = {Cambridge Studies in Advanced Mathematics},
+    VOLUME = 73,
+   EDITION = {Second},
+ PUBLISHER = {Cambridge University Press},
+   ADDRESS = {Cambridge},
+      YEAR = 2001,
+     PAGES = {xviii+498},
+      ISBN = {0-521-80920-7; 0-521-79722-5},
+   MRCLASS = {05C80 (60C05)},
+  MRNUMBER = {MR1864966 (2002j:05132)},
+}
+
+@article {pw04,
+    AUTHOR = {Pittel, Boris and Wormald, Nicholas C.},
+     TITLE = {Counting connected graphs inside-out},
+   JOURNAL = {J. Combin. Theory Ser. B},
+  FJOURNAL = {Journal of Combinatorial Theory. Series B},
+    VOLUME = 93,
+      YEAR = 2005,
+    NUMBER = 2,
+     PAGES = {127--172},
+      ISSN = {0095-8956},
+     CODEN = {JCBTB8},
+   MRCLASS = {05C30 (05A16 05C40 05C80)},
+  MRNUMBER = {MR2117934 (2005m:05117)},
+MRREVIEWER = {Edward A. Bender},
+}
--- a/vldb07/relatedwork.tex
+++ b/vldb07/relatedwork.tex
@ -0,0 +1,112 @@
+% Time-stamp: <Monday 30 Jan 2006 03:06:57am EDT yoshi@ime.usp.br>
+\vspace{-3mm}
+\section{Related work}
+\label{sec:relatedprevious-work}
+\vspace{-2mm}
+
+% Optimal speed for hashing means that each key from the key set $S$
+% will map to an unique location in the hash table, avoiding time wasted 
+% in resolving collisions. That is achieved with a MPHF and
+% because of that many algorithms for constructing static 
+% and dynamic MPHFs, when static or dynamic sets are involved, 
+% were developed. Our focus has been on static MPHFs, since 
+% in many applications the key sets change slowly, if at all~\cite{s05}.
+
+\enlargethispage{2\baselineskip}
+Czech, Havas and Majewski~\cite{chm97} provide a
+comprehensive survey of the most important theoretical and practical results
+on perfect hashing.
+In this section we review some of the most important results.
+%We also present more recent algorithms that share some features with 
+%the one presented hereinafter.
+
+Fredman, Koml\'os and Szemer\'edi~\cite{FKS84} showed that it is possible to
+construct space efficient perfect hash functions that can be evaluated in
+constant time with table sizes that are linear in the number of keys:
+$m=O(n)$.  In their model of computation, an element of the universe~$U$ fits
+into one machine word, and arithmetic operations and memory accesses have unit
+cost.  Randomized algorithms in the FKS model can construct a perfect hash
+function in expected time~$O(n)$:
+this is the case of our algorithm and the works in~\cite{chm92,p99}.
+
+Mehlhorn~\cite{m84} showed
+that at least $\Omega((1/\ln 2)n + \ln\ln u)$ bits are 
+required to represent a MPHF (i.e, at least 1.4427 bits per
+key must be stored).
+To the best of our knowledge our algorithm 
+is the first one capable of generating MPHFs for sets in the order
+of billion of keys, and the generated functions  
+require less than 9 bits per key to be stored.
+This increases one order of magnitude in the size of the greatest 
+key set for which a MPHF was obtained in the literature~\cite{bkz05}.
+%which is close to the lower bound presented in~\cite{m84}. 
+
+Some work on minimal perfect hashing has been done under the assumption that
+the algorithm can pick and store truly random functions~\cite{bkz05,chm92,p99}.
+Since the space requirements for truly random functions makes them unsuitable for
+implementation, one has to settle for pseudo-random functions in practice. 
+Empirical studies show that limited randomness properties are often as good as
+total randomness.
+We could verify that phenomenon in our experiments by using the universal hash
+function proposed by Jenkins~\cite{j97}, which is
+time efficient at retrieval time and requires just an integer to be used as a
+random seed (the function is completely determined by the seed). 
+% Os trabalhos~\cite{asw00,swz00} apresentam algoritmos para construir
+% FHPs e FHPMs deterministicamente. 
+% As fun\c{c}\~oes geradas necessitam de $O(n \log(n) + \log(\log(u)))$ bits para serem descritas.
+% A complexidade de caso m\'edio dos algoritmos para gerar as fun\c{c}\~oes \'e 
+% $O(n\log(n) \log( \log (u)))$ e a de pior caso \'e $O(n^3\log(n) \log(\log(u)))$. 
+% A complexidade de avalia\c{c}\~ao das fun\c{c}\~oes \'e $O(\log(n) + \log(\log(u)))$.
+% Assim, os algoritmos n\~ao geram fun\c{c}\~oes que podem ser avaliadas com complexidade 
+% de tempo $O(1)$, est\~ao distantes a um fator de $\log n$ da complexidade \'otima para descrever 
+% FHPs e FHPMs (Mehlhorn mostra em~\cite{m84} 
+% que para armazenar uma FHP s\~ao necess\'arios no m\'{\i}nimo 
+% $\Omega(n^2/(2\ln 2) m + \log\log u)$ bits), e n\~ao geram as 
+% fun\c{c}\~oes com complexidade linear.
+% Al\'em disso, o universo $U$ das chaves \'e restrito a n\'umeros inteiros, o que pode 
+% limitar a utiliza\c{c}\~ao na pr\'atica. 
+
+Pagh~\cite{p99} proposed a family of randomized algorithms for
+constructing MPHFs
+where the form of the resulting function is $h(x) = (f(x) + d[g(x)]) \bmod n$,
+where $f$ and $g$ are universal hash functions and $d$ is a set of
+displacement values to resolve collisions that are caused by the function $f$.
+Pagh identified a set of conditions concerning $f$ and $g$ and showed
+that if these conditions are satisfied, then a minimal perfect hash
+function can be computed in expected time $O(n)$ and stored in
+$(2+\epsilon)n\log_2n$ bits.
+
+Dietzfelbinger and Hagerup~\cite{dh01} improved~\cite{p99},
+reducing from $(2+\epsilon)n\log_2n$ to $(1+\epsilon)n\log_2n$ the number of bits
+required to store the function, but in their approach~$f$ and~$g$ must
+be chosen from a class
+of hash functions that meet additional requirements.
+%Differently from the works in~\cite{dh01, p99}, our algorithm generates a MPHF
+%$h$ in expected linear time and $h$ can be stored in $O(n)$ bits (9 bits per key).
+  
+% Galli, Seybold e Simon~\cite{gss01} propuseram um algoritmo r\^andomico
+% que gera FHPMs da mesma forma das geradas pelos algoritmos de Pagh~\cite{p99}
+% e, Dietzfelbinger e Hagerup~\cite{dh01}. No entanto, eles definiram a forma das 
+% fun\c{c}\~oes $f(k) = h_c(k) \bmod n$ e $g(k) = \lfloor h_c(k)/n \rfloor$ para obter em tempo esperado $O(n)$ uma fun\c{c}\~ao que pode ser descrita em $O(n\log n)$ bits, onde
+% $h_c(k) = (ck \bmod p) \bmod n^2$, $1 \leq c \leq  p-1$ e $p$ um primo maior do que $u$.
+%Our algorithm is the first one capable of generating MPHFs for sets in the order of
+%billion of keys. It happens because we do not need to keep into main memory
+%at generation time complex data structures as a graph, lists and so on. We just need to maintain
+%a small vector that occupies around 8MB for a set of 1 billion keys.  
+
+Fox et al.~\cite{fch92,fhcd92} studied MPHFs 
+%that also share features with the ones generated by our algorithm. 
+that bring down the storage requirements we got to between 2 and 4 bits per key.
+However, it is shown in~\cite[Section 6.7]{chm97} that their algorithms have exponential 
+running times and cannot scale for sets larger than 11 million keys in our 
+implementation of the algorithm.  
+
+Our previous work~\cite{bkz05} improves the one by Czech, Havas and Majewski~\cite{chm92}.
+We obtained more compact functions in less time. Although 
+the algorithm in~\cite{bkz05} is the fastest algorithm 
+we know of, the resulting functions are stored in $O(n\log n)$ bits and
+one needs to keep in main memory at generation time a random graph of $n$ edges
+and $cn$ vertices, 
+where $c\in[0.93,1.15]$.  Using the well known divide to conquer approach
+we use that algorithm as a building block for the new one, where the
+resulting functions are stored in $O(n)$ bits.
--- a/vldb07/searching.tex
+++ b/vldb07/searching.tex
@ -0,0 +1,155 @@
+%% Nivio: 22/jan/06
+% Time-stamp: <Monday 30 Jan 2006 03:57:35am EDT yoshi@ime.usp.br>
+\vspace{-7mm}
+\subsection{Searching step}
+\label{sec:searching}
+
+\enlargethispage{2\baselineskip}
+The searching step is responsible for generating a MPHF for each 
+bucket.
+Figure~\ref{fig:searchingstep} presents the searching step algorithm.
+\vspace{-2mm}
+\begin{figure}[h]
+%\centering
+\hrule 
+\hrule 
+\vspace{2mm}
+\begin{tabbing}
+aa\=type booleanx \==  (false, true); \kill
+\> $\blacktriangleright$ Let $H$ be a minimum heap of size $N$, where the \\
+\> ~~ order relation in $H$ is given by Eq.~(\ref{eq:bucketindex}), that is, the\\
+\> ~~ remove operation removes the item with smallest $i$\\ 
+\> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \{ Heap construction \}\\
+\> ~~ $1.1$ Read key $k$ from File $j$ on disk\\
+\> ~~ $1.2$ Insert $(i, j, k)$ in $H$ \\
+\> $2.$ {\bf for} $i = 0$ {\bf to} $\lceil n/b \rceil - 1$ {\bf do} \\
+\> ~~ $2.1$ Read bucket $i$ from disk driven by heap $H$ \\
+\> ~~ $2.2$ Generate a MPHF for bucket $i$ \\
+\> ~~ $2.3$ Write the description of MPHF$_i$ to the disk 
+\end{tabbing}
+\vspace{-1mm}
+\hrule 
+\hrule 
+\caption{Searching step}
+\label{fig:searchingstep}
+\vspace{-4mm}
+\end{figure}
+
+Statement 1 of Figure~\ref{fig:searchingstep} inserts one key from each file
+in a minimum heap $H$ of size $N$.
+The order relation in $H$ is given by the bucket address $i$ given by
+Eq.~(\ref{eq:bucketindex}).
+
+%\enlargethispage{-\baselineskip}
+Statement 2 has two important steps.
+In statement 2.1, a bucket is read from disk,
+as described below.
+%in Section~\ref{sec:readingbucket}. 
+In statement 2.2, a MPHF is generated for each bucket $i$, as described 
+in the following.
+%in Section~\ref{sec:mphfbucket}.
+The description of MPHF$_i$ is a vector $g_i$ of 8-bit integers.
+Finally, statement 2.3 writes the description $g_i$ of MPHF$_i$ to disk.
+
+\vspace{-3mm}
+\label{sec:readingbucket}
+\subsubsection{Reading a bucket from disk.} 
+
+In this section we present the refinement of statement 2.1 of
+Figure~\ref{fig:searchingstep}.
+The algorithm to read bucket $i$ from disk is presented 
+in Figure~\ref{fig:readingbucket}.
+
+\begin{figure}[h]
+\hrule 
+\hrule 
+\vspace{2mm}
+\begin{tabbing}
+aa\=type booleanx \==  (false, true); \kill
+\> $1.$ {\bf while} bucket $i$ is not full {\bf do} \\
+\> ~~ $1.1$ Remove $(i, j, k)$ from $H$\\
+\> ~~ $1.2$ Insert $k$ into bucket $i$ \\
+\> ~~ $1.3$ Read sequentially all keys $k$ from File $j$ that have \\
+\> ~~~~~~~ the same $i$ and insert them into bucket $i$ \\
+\> ~~ $1.4$ Insert the triple $(i, j, x)$ in $H$, where $x$ is the first \\
+\> ~~~~~~~ key read from File $j$ that does not have the \\ 
+\> ~~~~~~~ same bucket index $i$
+\end{tabbing}
+\hrule 
+\hrule 
+\vspace{-1.0mm}
+\caption{Reading a bucket}
+\vspace{-4.0mm}
+\label{fig:readingbucket}
+\end{figure}
+
+Bucket $i$ is distributed among many files and the heap $H$ is used to drive a
+multiway merge operation.
+In Figure~\ref{fig:readingbucket}, statement 1.1 extracts and removes triple 
+$(i, j, k)$ from $H$, where $i$ is a minimum value in $H$.
+Statement 1.2 inserts key $k$ in bucket $i$.
+Notice that the $k$ in the triple $(i, j, k)$ is in fact a pointer to
+the first byte of the key that is kept in contiguous positions of an array of characters
+(this array containing the keys is initialized during the heap construction
+in statement 1 of Figure~\ref{fig:searchingstep}).
+Statement 1.3 performs a seek operation in File $j$ on disk for the first 
+read operation and reads sequentially all keys $k$ that have the same $i$ 
+%(obtained from Eq.~(\ref{eq:bucketindex})) 
+and inserts them all in bucket $i$.
+Finally, statement 1.4 inserts in $H$ the triple $(i, j, x)$,  
+where $x$ is the first key read from File $j$ (in statement 1.3) 
+that does not have the same bucket address as the previous keys.
+
+The number of seek operations on disk performed in statement 1.3 is discussed
+in Section~\ref{sec:linearcomplexity}, 
+where we present a buffering technique that brings down 
+the time spent with seeks.
+
+\vspace{-2mm}
+\enlargethispage{2\baselineskip}
+\subsubsection{Generating a MPHF for each bucket.} \label{sec:mphfbucket}
+
+To the best of our knowledge the algorithm we have designed in
+our previous work~\cite{bkz05} is the fastest published algorithm for
+constructing MPHFs.
+That is why we are using that algorithm as a building block for the 
+algorithm presented here.
+
+%\enlargethispage{-\baselineskip}
+Our previous algorithm is a three-step internal memory based algorithm
+that produces a MPHF based on random graphs.
+For a set of $n$ keys, the algorithm outputs the resulting MPHF in expected time $O(n)$.
+For a given bucket $i$, $0 \leq i < \lceil n/b \rceil$, the corresponding MPHF$_i$ 
+has the following form:
+\begin{eqnarray}
+        \mathrm{MPHF}_i(k) &=& g_i[a] + g_i[b] \label{eq:mphfi}
+\end{eqnarray} 
+where $a = h_{i1}(k) \bmod t$, $b = h_{i2}(k) \bmod t$ and
+$t = c\times \mathit{size}[i]$. The functions
+$h_{i1}(k)$ and $h_{i2}(k)$ are the same universal function proposed by Jenkins~\cite{j97}
+that was used in the partitioning step described in Section~\ref{sec:partitioning-keys}.
+
+In order to generate the function above the algorithm involves the generation of simple random graphs
+$G_i = (V_i, E_i)$ with~$|V_i|=t=c\times\mathit{size}[i]$ and $|E_i|=\mathit{size}[i]$, with  $c \in [0.93, 1.15]$.
+To generate a simple random graph with high 
+probability\footnote{We use the terms `with high probability'
+to mean `with probability tending to~$1$ as~$n\to\infty$'.}, two vertices $a$ and $b$ are
+computed for each key $k$ in bucket $i$.
+Thus, each bucket $i$ has a corresponding graph~$G_i=(V_i,E_i)$, where $V_i=\{0,1,
+\ldots,t-1\}$ and $E_i=\big\{\{a,b\}:k \in \mathrm{bucket}\: i\big\}$.
+In order to get a simple graph, 
+the algorithm repeatedly selects $h_{i1}$ and $h_{i2}$ from a family of universal hash functions
+until the corresponding graph is simple.
+The probability of getting a simple graph is $p=e^{-1/c^2}$.
+For $c=1$, this probability is $p \simeq 0.368$, and the expected number of 
+iterations to obtain a simple graph is~$1/p \simeq 2.72$.
+
+The construction of MPHF$_i$ ends with a computation of a suitable labelling of the vertices
+of~$G_i$. The labelling is stored into vector $g_i$.
+We choose~$g_i[v]$ for each~$v\in V_i$ in such
+a way that Eq.~(\ref{eq:mphfi}) is a MPHF for bucket $i$.
+In order to get the values of each entry of $g_i$ we first 
+run a breadth-first search on the 2-\textit{core} of $G_i$, i.e., the maximal subgraph 
+of~$G_i$ with minimal degree at least~$2$ (see, e.g., \cite{b01,jlr00,pw04}) and
+a depth-first search on the acyclic part of $G_i$ (see \cite{bkz05} for details).
+
--- a/vldb07/svglov2.clo
+++ b/vldb07/svglov2.clo
@ -0,0 +1,77 @@
+% SVJour2 DOCUMENT CLASS OPTION SVGLOV2 -- for standardised journals
+%
+% This is an enhancement for the LaTeX
+% SVJour2 document class for Springer journals
+%
+%%
+%%
+%% \CharacterTable
+%%  {Upper-case    \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z
+%%   Lower-case    \a\b\c\d\e\f\g\h\i\j\k\l\m\n\o\p\q\r\s\t\u\v\w\x\y\z
+%%   Digits        \0\1\2\3\4\5\6\7\8\9
+%%   Exclamation   \!     Double quote  \"     Hash (number) \#
+%%   Dollar        \$     Percent       \%     Ampersand     \&
+%%   Acute accent  \'     Left paren    \(     Right paren   \)
+%%   Asterisk      \*     Plus          \+     Comma         \,
+%%   Minus         \-     Point         \.     Solidus       \/
+%%   Colon         \:     Semicolon     \;     Less than     \<
+%%   Equals        \=     Greater than  \>     Question mark \?
+%%   Commercial at \@     Left bracket  \[     Backslash     \\
+%%   Right bracket \]     Circumflex    \^     Underscore    \_
+%%   Grave accent  \`     Left brace    \{     Vertical bar  \|
+%%   Right brace   \}     Tilde         \~}
+\ProvidesFile{svglov2.clo}
+              [2004/10/25 v2.1
+      style option for standardised journals]
+\typeout{SVJour Class option: svglov2.clo for standardised journals}
+\def\validfor{svjour2}
+\ExecuteOptions{final,10pt,runningheads}
+% No size changing allowed, hence a copy of size10.clo is included
+\renewcommand\normalsize{%
+   \@setfontsize\normalsize{10.2pt}{4mm}%
+   \abovedisplayskip=3 mm plus6pt minus 4pt
+   \belowdisplayskip=3 mm plus6pt minus 4pt
+   \abovedisplayshortskip=0.0 mm plus6pt
+   \belowdisplayshortskip=2 mm plus4pt minus 4pt
+   \let\@listi\@listI}
+\normalsize
+\newcommand\small{%
+   \@setfontsize\small{8.7pt}{3.25mm}%
+   \abovedisplayskip 8.5\p@ \@plus3\p@ \@minus4\p@
+   \abovedisplayshortskip \z@ \@plus2\p@
+   \belowdisplayshortskip 4\p@ \@plus2\p@ \@minus2\p@
+   \def\@listi{\leftmargin\leftmargini
+               \parsep 0\p@ \@plus1\p@ \@minus\p@
+               \topsep 4\p@ \@plus2\p@ \@minus4\p@
+               \itemsep0\p@}%
+   \belowdisplayskip \abovedisplayskip
+}
+\let\footnotesize\small
+\newcommand\scriptsize{\@setfontsize\scriptsize\@viipt\@viiipt}
+\newcommand\tiny{\@setfontsize\tiny\@vpt\@vipt}
+\newcommand\large{\@setfontsize\large\@xiipt{14pt}}
+\newcommand\Large{\@setfontsize\Large\@xivpt{16dd}}
+\newcommand\LARGE{\@setfontsize\LARGE\@xviipt{17dd}}
+\newcommand\huge{\@setfontsize\huge\@xxpt{25}}
+\newcommand\Huge{\@setfontsize\Huge\@xxvpt{30}}
+%
+%ALT% \def\runheadhook{\rlap{\smash{\lower5pt\hbox to\textwidth{\hrulefill}}}}
+\def\runheadhook{\rlap{\smash{\lower11pt\hbox to\textwidth{\hrulefill}}}}
+\AtEndOfClass{\advance\headsep by5pt}
+\if@twocolumn
+\setlength{\textwidth}{17.6cm}
+\setlength{\textheight}{230mm}
+\AtEndOfClass{\setlength\columnsep{4mm}}
+\else
+\setlength{\textwidth}{11.7cm}
+\setlength{\textheight}{517.5dd} % 19.46cm
+\fi
+%
+\AtBeginDocument{%
+\@ifundefined{@journalname}
+ {\typeout{Unknown journal: specify \string\journalname\string{%
+<name of your journal>\string} in preambel^^J}}{}}
+%
+\endinput
+%%
+%% End of file `svglov2.clo'.
--- a/vldb07/svjour2.cls
+++ b/vldb07/svjour2.cls
--- a/vldb07/terminology.tex
+++ b/vldb07/terminology.tex
@ -0,0 +1,18 @@
+% Time-stamp: <Sunday 29 Jan 2006 11:55:42pm EST yoshi@flare>
+\vspace{-3mm}
+\section{Notation and terminology}
+\vspace{-2mm}
+\label{sec:notation}
+
+\enlargethispage{2\baselineskip}
+The essential notation and terminology used throughout this paper are as follows.
+\begin{itemize}
+\item $U$: key universe. $|U| = u$.
+\item $S$: actual static key set. $S \subset U$, $|S| = n \ll u$.
+\item $h: U \to M$ is a hash function that maps keys from a universe $U$ into
+a given range $M = \{0,1,\dots,m-1\}$ of integer numbers.
+\item $h$ is a perfect hash function if it is one-to-one on~$S$, i.e., if
+  $h(k_1) \not = h(k_2)$ for all $k_1 \not = k_2$ from $S$.
+\item $h$ is a minimal perfect hash function (MPHF) if it is one-to-one on~$S$ 
+  and $n=m$. 
+\end{itemize}
--- a/vldb07/thealgorithm.tex
+++ b/vldb07/thealgorithm.tex
@ -0,0 +1,78 @@
+%% Nivio: 13/jan/06, 21/jan/06 29/jan/06
+% Time-stamp: <Sunday 29 Jan 2006 11:56:25pm EST yoshi@flare>
+\vspace{-3mm}
+\section{The algorithm}
+\label{sec:new-algorithm}
+\vspace{-2mm}
+
+\enlargethispage{2\baselineskip}
+The main idea supporting our algorithm is the classical divide and conquer technique.
+The algorithm is a two-step external memory based algorithm 
+that generates a MPHF $h$ for a set $S$ of $n$ keys.
+Figure~\ref{fig:new-algo-main-steps} illustrates the two steps of the
+algorithm: the partitioning step and the searching step.
+
+\vspace{-2mm}
+\begin{figure}[ht]
+\centering
+\begin{picture}(0,0)%
+\includegraphics{figs/brz.ps}%
+\end{picture}%
+\setlength{\unitlength}{4144sp}%
+%
+\begingroup\makeatletter\ifx\SetFigFont\undefined%
+\gdef\SetFigFont#1#2#3#4#5{%
+  \reset@font\fontsize{#1}{#2pt}%
+  \fontfamily{#3}\fontseries{#4}\fontshape{#5}%
+  \selectfont}%
+\fi\endgroup%
+\begin{picture}(3704,2091)(1426,-5161)
+\put(2570,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
+\put(2782,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
+\put(2996,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}2}}}}
+\put(4060,-4006){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets}}}}
+\put(3776,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}${\lceil n/b\rceil - 1}$}}}}
+\put(4563,-3329){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Key Set $S$}}}}
+\put(2009,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
+\put(2221,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
+\put(4315,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}n-1}}}}
+\put(1992,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
+\put(2204,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
+\put(4298,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}n-1}}}}
+\put(4546,-4977){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Hash Table}}}}
+\put(1441,-3616){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Partitioning}}}}
+\put(1441,-4426){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Searching}}}}
+\put(1981,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_0$}}}}
+\put(2521,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_1$}}}}
+\put(3016,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_2$}}}}
+\put(3826,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_{\lceil n/b \rceil - 1}$}}}}
+\end{picture}%
+\vspace{-1mm}
+\caption{Main steps of our algorithm}
+\label{fig:new-algo-main-steps}
+\vspace{-3mm}
+\end{figure}
+
+The partitioning step takes a key set $S$ and uses a universal hash function 
+$h_0$ proposed by Jenkins~\cite{j97} 
+%for each key $k \in S$ of length $|k|$ 
+to transform each key~$k\in S$ into an integer~$h_0(k)$.
+Reducing~$h_0(k)$ modulo~$\lceil n/b\rceil$, we partition~$S$ into $\lceil n/b
+\rceil$ buckets containing at most 256 keys in each bucket (with high
+probability).  
+
+The searching step generates a MPHF$_i$ for each bucket $i$, 
+$0 \leq i < \lceil n/b \rceil$.
+The resulting MPHF $h(k)$, $k \in S$, is given by
+\begin{eqnarray}\label{eq:mphf}
+h(k) = \mathrm{MPHF}_i (k) + \mathit{offset}[i], 
+\end{eqnarray}
+where~$i=h_0(k)\bmod\lceil n/b\rceil$.
+The $i$th entry~$\mathit{offset}[i]$ of the displacement vector
+$\mathit{offset}$, $0 \leq i < \lceil n/b \rceil$, contains the total number
+of keys in the buckets from 0 to $i-1$, that is, it gives the interval of the
+keys in the hash table addressed by the MPHF$_i$.  In the following we explain
+each step in detail.
+
+
+
--- a/vldb07/thedataandsetup.tex
+++ b/vldb07/thedataandsetup.tex
@ -0,0 +1,21 @@
+% Nivio: 29/jan/06
+% Time-stamp: <Sunday 29 Jan 2006 11:57:40pm EST yoshi@flare>
+\vspace{-2mm}
+\subsection{The data and the experimental setup}
+\label{sec:data-exper-set}
+
+The algorithms were implemented in the C language and
+are available at \texttt{http://\-cmph.sf.net}
+under the GNU Lesser General Public License (LGPL).
+% free software licence.
+All experiments were carried out on
+a computer running the Linux operating system, version 2.6,
+with a 2.4 gigahertz processor and
+1 gigabyte of main memory. 
+In the experiments related to the new
+algorithm we limited the main memory in 500 megabytes.
+
+Our data consists of a collection of 1 billion
+URLs collected from the Web, each URL 64 characters long on average.
+The collection is stored on disk in 60.5 gigabytes.
+
--- a/vldb07/vldb.tex
+++ b/vldb07/vldb.tex
@ -0,0 +1,194 @@
+%%%%%%%%%%%%%%%%%%%%%%% file template.tex %%%%%%%%%%%%%%%%%%%%%%%%%
+%
+% This is a template file for the LaTeX package SVJour2 for the
+% Springer journal "The VLDB Journal".
+%
+%                                    Springer Heidelberg 2004/12/03
+%
+% Copy it to a new file with a new name and use it as the basis
+% for your article. Delete % as needed.
+%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%
+% First comes an example EPS file -- just ignore it and
+% proceed on the \documentclass line
+% your LaTeX will extract the file if required
+%\begin{filecontents*}{figs/minimalperfecthash-ph-mph.ps}
+%!PS-Adobe-3.0 EPSF-3.0
+%%BoundingBox: 19 19 221 221
+%%CreationDate: Mon Sep 29 1997
+%%Creator: programmed by hand (JK)
+%%EndComments
+%gsave
+%newpath
+%  20 20 moveto
+%  20 220 lineto
+%  220 220 lineto
+%  220 20 lineto
+%closepath
+%2 setlinewidth
+%gsave
+%  .4 setgray fill
+%grestore
+%stroke
+%grestore
+%\end{filecontents*}
+%
+\documentclass[twocolumn,fleqn,runningheads]{svjour2}
+%
+\smartqed  % flush right qed marks, e.g. at end of proof
+%
+\usepackage{graphicx}
+\usepackage{listings}
+\usepackage{epsfig}
+\usepackage{textcomp}
+\usepackage[latin1]{inputenc}
+\usepackage{amssymb}
+
+%\DeclareGraphicsExtensions{.png} 
+%
+% \usepackage{mathptmx}      % use Times fonts if available on your TeX system
+%
+% insert here the call for the packages your document requires
+%\usepackage{latexsym}
+% etc.
+%
+% please place your own definitions here and don't use \def but
+% \newcommand{}{}
+%
+
+\lstset{
+  language=Pascal,
+  basicstyle=\fontsize{9}{9}\selectfont,
+  captionpos=t,
+  aboveskip=1mm,
+  belowskip=1mm,
+  abovecaptionskip=1mm,
+  belowcaptionskip=1mm,
+%  numbers = left,
+  mathescape=true,
+  escapechar=@,
+  extendedchars=true,
+  showstringspaces=false,
+  columns=fixed,
+  basewidth=0.515em,
+  frame=single,
+  framesep=2mm,
+  xleftmargin=2mm,
+  xrightmargin=2mm,
+  framerule=0.5pt
+}
+
+\def\cG{{\mathcal G}}
+\def\crit{{\rm crit}}
+\def\ncrit{{\rm ncrit}}
+\def\scrit{{\rm scrit}}
+\def\bedges{{\rm bedges}}
+\def\ZZ{{\mathbb Z}}
+
+\journalname{The VLDB Journal}
+%
+
+\begin{document}
+
+\title{Space and Time Efficient Minimal Perfect Hash \\[0.2cm]
+Functions for Very Large Databases\thanks{
+This work was supported in part by
+GERINDO Project--grant MCT/CNPq/CT-INFO 552.087/02-5,
+CAPES/PROF Scholarship (Fabiano C. Botelho),
+FAPESP Proj.\ Tem.\ 03/09925-5 and CNPq Grant 30.0334/93-1
+(Yoshiharu Kohayakawa),
+and CNPq Grant 30.5237/02-0 (Nivio Ziviani).}
+}
+%\subtitle{Do you have a subtitle?\\ If so, write it here}
+
+%\titlerunning{Short form of title}        % if too long for running head
+
+\author{Fabiano C. Botelho \and Davi C. Reis \and Yoshiharu Kohayakawa \and Nivio Ziviani}
+%\authorrunning{Short form of author list} % if too long for running head
+\institute{
+F. C. Botelho \and 
+N. Ziviani \at
+Dept. of Computer Science,
+Federal Univ. of Minas Gerais,
+Belo Horizonte, Brazil\\
+\email{\{fbotelho,nivio\}@dcc.ufmg.br}
+\and
+D. C. Reis \at
+Google, Brazil \\
+\email{davi.reis@gmail.com}
+\and
+Y. Kohayakawa
+Dept. of Computer Science,
+Univ. of S\~ao Paulo,
+S\~ao Paulo, Brazil\\
+\email{yoshi@ime.usp.br}
+}
+
+\date{Received: date / Accepted: date}
+% The correct dates will be entered by the editor
+
+
+\maketitle
+
+\begin{abstract}
+We propose a novel external memory based algorithm for constructing minimal
+perfect hash functions~$h$ for huge sets of keys.
+For a set of~$n$ keys, our algorithm outputs~$h$ in time~$O(n)$.
+The algorithm needs a small vector of one byte entries
+in main memory to construct $h$.
+The evaluation of~$h(x)$ requires three memory accesses for any key~$x$.
+The description of~$h$ takes a constant number of bits
+for each key, which is optimal, i.e., the theoretical lower bound is $1/\ln 2$
+bits per key.
+In our experiments, we used a collection of 1 billion URLs collected
+from the web, each URL 64 characters long on average.
+For this collection, our algorithm
+(i) finds a minimal perfect hash function in approximately
+3 hours using a commodity PC,
+(ii) needs just 5.45 megabytes of internal memory to generate $h$
+and (iii) takes 8.1 bits per key for the description of~$h$.
+\keywords{Minimal Perfect Hashing \and Large Databases}
+\end{abstract}
+
+% main text
+
+\def\cG{{\mathcal G}}
+\def\crit{{\rm crit}}
+\def\ncrit{{\rm ncrit}}
+\def\scrit{{\rm scrit}}
+\def\bedges{{\rm bedges}}
+\def\ZZ{{\mathbb Z}}
+\def\BSmax{\mathit{BS}_{\mathit{max}}}
+\def\Bi{\mathop{\rm Bi}\nolimits}
+
+\input{introduction}
+%\input{terminology}
+\input{relatedwork}
+\input{thealgorithm}
+\input{partitioningthekeys}
+\input{searching}
+%\input{computingoffset}
+%\input{hashingbuckets}
+\input{determiningb}
+%\input{analyticalandexperimentalresults}
+\input{analyticalresults}
+%\input{results}
+\input{conclusions}
+
+
+
+
+%\input{acknowledgments}
+%\begin{acknowledgements}
+%If you'd like to thank anyone, place your comments here
+%and remove the percent signs.
+%\end{acknowledgements}
+
+% BibTeX users please use
+%\bibliographystyle{spmpsci}
+%\bibliography{}   % name your BibTeX data base
+\bibliographystyle{plain}
+\bibliography{references}
+\input{appendix}
+\end{document}