paper for vldb07 added

2006-08-11 17:32:31 +00:00
parent aa4b59e7c1
commit bd2e291de9
28 changed files with 4517 additions and 0 deletions
--- a/vldb07/acknowledgments.tex
+++ b/vldb07/acknowledgments.tex
@@ -0,0 +1,7 @@
 \section{Acknowledgments}
 This section is optional; it is a location for you
 to acknowledge grants, funding, editing assistance and
 what have you.  In the present case, for example, the
 authors would like to thank Gerald Murray of ACM for
 his help in codifying this \textit{Author's Guide}
 and the \textbf{.cls} and \textbf{.tex} files that it describes.
--- a/vldb07/analyticalresults.tex
+++ b/vldb07/analyticalresults.tex
@@ -0,0 +1,174 @@
 %% Nivio: 23/jan/06 29/jan/06
 % Time-stamp: <Monday 30 Jan 2006 03:56:47am EDT yoshi@ime.usp.br>
 \enlargethispage{2\baselineskip}
 \section{Analytical results}
 \label{sec:analytcal-results}
 \vspace{-1mm}
 The purpose of this section is fourfold.
 First, we show that our algorithm runs in expected time $O(n)$. 
 Second, we present the main memory requirements for constructing the MPHF.
 Third, we discuss the cost of evaluating the resulting MPHF.
 Fourth, we present the space required to store the resulting MPHF.
 \vspace{-2mm}
 \subsection{The linear time complexity}
 \label{sec:linearcomplexity}
 First, we show that the partitioning step presented in
 Figure~\ref{fig:partitioningstep} runs in $O(n)$ time.
 Each iteration of the {\bf for} loop in statement~1
 runs in $O(|B_j|)$ time, $1 \leq j \leq N$, where $|B_j|$ is the 
 number of keys 
 that fit in block $B_j$ of size $\mu$. This is because statement 1.1 just reads
 $|B_j|$ keys from disk, statement 1.2 runs a bucket sort like algorithm 
 that is well known to be linear in the number of keys it sorts (i.e., $|B_j|$ keys),
 and statement 1.3 just dumps $|B_j|$ keys to the disk into File $j$.
 Thus, the {\bf for} loop runs in $O(\sum_{j=1}^{N}|B_j|)$ time. 
 As $\sum_{j=1}^{N}|B_j|=n$, then the partitioning step runs in $O(n)$ time.
 Second, we show that the searching step presented in 
 Figure~\ref{fig:searchingstep} also runs in $O(n)$ time.
 The heap construction in statement 1 runs in $O(N)$ time, for $N \ll n$.
 We have assumed that insertions and deletions in the heap cost $O(1)$ because 
 $N$ is typically much smaller than $n$ (see \cite[Section 6.4]{bkz06t} for details).
 Statement 2 runs in $O(\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i])$ time
 (remember that $\mathit{size}[i]$ stores the number of keys in bucket $i$).
 As $\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i] = n$, if 
 statements 2.1, 2.2 and 2.3 run in $O(\mathit{size}[i])$ time, then statement 2
 runs in $O(n)$ time.
 %Statement 2.1 runs the algorithm to read a bucket from disk. That algorithm reads $\mathit{size}[i]$
 %keys of bucket $i$ that might be spread into many files or, in the worst case,
 %into $|BS_{max}|$ files, where $|BS_{max}|$ is the number of keys in the bucket of maximum size.
 %It uses the heap $H$ to drive a multiway merge of the sprayed bucket $i$. 
 %As we are considering that each read/write on disk costs $O(1)$ and
 %each heap operation also costs $O(1)$ (recall $N \ll n$), then  statement 2.1
 %costs $O(\mathit{size}[i])$ time.
 %We need to take into account that this step could generate a lot of seeks on disk. 
 %However, the number of seeks can be amortized (see Section~\ref{sec:contr-disk-access})
 %and that is why we have been able of getting a MPHF for a set of 1 billion keys in less 
 %than 4 hours using a machine with just 500 MB of main memory
 %(see Section~\ref{sec:performance}).
 Statement 2.1 reads $O(\mathit{size}[i])$ keys of bucket $i$
 and is detailed in Figure~\ref{fig:readingbucket}.
 As we are assuming that each read or write on disk costs $O(1)$ and
 each heap operation also costs $O(1)$, statement~2.1
 takes $O(\mathit{size}[i])$ time.
 However, the keys of bucket $i$ are distributed in at most~$BS_{max}$ files on disk 
 in the worst case
 (recall that $BS_{max}$ is the maximum number of keys found in any bucket).
 Therefore, we need to take into account that 
 the critical step in reading a bucket is in statement~1.3 of Figure~\ref{fig:readingbucket},
 where a seek operation in File $j$
 may be performed by the first read operation.
 In order to amortize the number of seeks performed we use a buffering technique~\cite{k73}.
 We create a buffer $j$ of size \textbaht$\: = \mu/N$ for each file $j$, 
 where $1\leq j \leq N$
 (recall that $\mu$ is the size in bytes of an a priori reserved internal memory area).
 Every time a read operation is requested to file $j$ and the data is not found 
 in the $j$th~buffer, \textbaht~bytes are read from file $j$ to buffer $j$. 
 Hence, the number of seeks performed in the worst case is given by
 $\beta/$\textbaht~(remember that $\beta$ is the size in bytes of $S$).
 For that we have made the pessimistic assumption that one seek happens every time 
 buffer $j$ is filled in. 
 Thus, the number of seeks performed in the worst case is $64n/$\textbaht, since
 each URL is 64 bytes long on average. Therefore, the number of seeks is linear on 
 $n$ and amortized by \textbaht.
 It is important to emphasize two things.
 First, the operating system uses techniques
 to diminish the number of seeks and the average seek time. 
 This makes the amortization factor to be greater than \textbaht~in practice.  
 Second, almost all main memory is available to be used as
 file buffers because just a small vector
 of $\lceil n/b\rceil$ one-byte entries must be maintained in main memory, 
 as we show in Section~\ref{sec:memconstruction}.
 Statement 2.2 runs our internal memory based algorithm in order to generate a MPHF for each bucket.
 That algorithm is linear, as we showed in~\cite{bkz05}. As it is applied to buckets with {\it size}$[i]$ keys, statement~2.2 takes $O(\mathit{size}[i])$ time.
 Statement 2.3 has time complexity $O(\mathit{size}[i])$ because it writes to disk 
 the description of each generated MPHF and each description is stored in
 $c \times \mathit{size}[i] + O(1)$ bytes, where $c\in[0.93,1.15]$.
 In conclusion, our algorithm takes $O(n)$ time because both the partitioning and
 the searching steps run in $O(n)$ time. 
 An experimental validation of the above proof and a performance comparison with 
 our internal memory based algorithm~\cite{bkz05} were not included here due to 
 space restrictions but can be found in~\cite{bkz06t} and also in the appendix.
 \vspace{-1mm}
 \enlargethispage{2\baselineskip}
 \subsection{Space used for constructing a MPHF} 
 \label{sec:memconstruction}
 The vector {\it size} is kept in main memory 
 all the time. 
 The vector {\it size} has $\lceil n/b \rceil$ one-byte entries.
 It stores the number of keys in each bucket and 
 those values are less than or equal to 256. 
 For example, for a set of 1 billion keys and $b=175$ the vector size needs 
 $5.45$ megabytes of main memory.
 We need an internal memory area of size $\mu$ bytes to be used in
 the partitioning step and in the searching step. 
 The size $\mu$ is fixed a priori and depends only on the amount 
 of internal memory available to run the algorithm
 (i.e., it does not depend on the size $n$ of the problem).
 % One could argue about the a priori reserved internal memory area and the main memory
 % required to run the indirect bucket sort algorithm.
 % Those internal memory requirements do not depend on the size of the problem
 % (i.e., the number of keys being hashed) and can be fixed a priori.
 The additional space required in the searching step
 is constant, once the problem was broken down
 into several small problems (at most 256 keys) and 
 the heap size is supposed to be much smaller than $n$ ($N \ll n$).
 For example, for a set of 1 billion keys and an internal area of~$\mu = 250$ megabytes,
 the number of files is $N = 248$. 
 \vspace{-1mm}
 \subsection{Evaluation cost of the MPHF} 
 Now we consider the amount of CPU time 
 required by the resulting MPHF at retrieval time.
 The MPHF requires for each key the computation of three 
 universal hash functions and three memory accesses 
 (see Eqs.~(\ref{eq:mphf}), (\ref{eq:bucketindex}) and (\ref{eq:mphfi})).
 This is not optimal. Pagh~\cite{p99} showed that any MPHF requires 
 at least the computation of two universal hash functions and one memory
 access.
 \subsection{Description size of the MPHF} 
 The number of bits required to store the MPHF generated by the algorithm 
 is computed by Eq.~(\ref{eq:newmphfbits}). 
 We need to store each $g_i$ vector presented in Eq.~(\ref{eq:mphfi}), where
 $0\leq i < \lceil n/b \rceil$. As each bucket has at most 256 keys, each 
 entry in a $g_i$ vector has 8~bits.  In each $g_i$ vector there are 
 $c \times \mathit{size}[i]$ entries (recall $c\in[0.93, 1.15]$).
 When we sum up the number of entries of $\lceil n/b \rceil$ $g_i$ vectors we have
 $c\sum_{i=0}^{\lceil n/b \rceil -1} \mathit{size}[i]=cn$ entries.  We also need to
 store $3 \lceil n/b \rceil$ integer numbers of 
 $\log_2n$ bits referring respectively to the {\it offset} vector and the two random seeds of
 $h_{1i}$ and $h_{2i}$. In addition, we need to store $\lceil n/b \rceil$ 8-bit entries of 
 the vector {\it size}.  Therefore, 
 \begin{eqnarray}\label{eq:newmphfbits}
 \mathrm{Required\: Space} = 8cn + \frac{n}{b}\left( 3\log_2n +8\right) \:
 \mathrm{bits}. 
 \end{eqnarray}
 Considering $c=0.93$ and $b=175$, the number of bits per key to store 
 the description of the resulting MPHF for a set of 1~billion keys is $8.1$.
 If we set $b=128$, then the bits per key ratio increases to $8.3$.
 Theoretically, the number of bits required to store the MPHF in
 Eq.~(\ref{eq:newmphfbits}) 
 is $O(n\log n)$ as~$n\to\infty$. However, for sets of size up to $2^{b/3}$ keys 
 the number of bits per key is lower than 9~bits (note that
 $2^{b/3}>2^{58}>10^{17}$ for $b=175$).  
 %For $b=175$, the number of bits per key will be close to 9 for a set of $2^{58}$ keys. 
 Thus, in practice the resulting function is stored in $O(n)$ bits.
--- a/vldb07/appendix.tex
+++ b/vldb07/appendix.tex
@@ -0,0 +1,6 @@
 \appendix
 \input{experimentalresults}
 \input{thedataandsetup}
 \input{costhashingbuckets}
 \input{performancenewalgorithm}
 \input{diskaccess}
--- a/vldb07/conclusions.tex
+++ b/vldb07/conclusions.tex
@@ -0,0 +1,42 @@
 % Time-stamp: <Monday 30 Jan 2006 12:38:06am EST yoshi@flare>
 \enlargethispage{2\baselineskip}
 \section{Concluding remarks}
 \label{sec:concuding-remarks}
 This paper has presented a novel external memory based algorithm for
 constructing MPHFs that works for sets in the order of billions of keys.  The
 algorithm outputs the resulting function in~$O(n)$ time and, furthermore, it
 can be tuned to run only $34\%$ slower (see \cite{bkz06t} for details) than the fastest 
 algorithm available in the literature for constructing MPHFs~\cite{bkz05}.  
 In addition, the space
 requirement of the resulting MPHF is of up to 9 bits per key for datasets of
 up to $2^{58}\simeq10^{17.4}$ keys. 
 The algorithm is simple and needs just a
 small vector of size approximately 5.45 megabytes in main memory to construct
 a MPHF for a collection of 1 billion URLs, each one 64 bytes long on average.
 Therefore, almost all main memory is available to be used as disk I/O buffer.
 Making use of such a buffering scheme considering an internal memory area of size
 $\mu=200$ megabytes, our algorithm can produce a MPHF for a
 set of 1 billion URLs in approximately 3.6 hours using a commodity PC of 2.4 gigahertz and
 500 megabytes of main memory.  
 If we increase both the main memory
 available to 1 gigabyte and the internal memory area to $\mu=500$ megabytes, 
 a MPHF for the set of 1 billion URLs is produced in approximately 3 hours. For any
 key, the evaluation of the resulting MPHF takes three memory accesses and the
 computation of three universal hash functions. 
 In order to allow the reproduction of our results and the utilization of both the internal memory
 based algorithm and the external memory based algorithm,
 the algorithms are available at \texttt{http://cmph.sf.net}
 under the GNU Lesser General Public License (LGPL).
 They were implemented in the C language.
 In future work, we will exploit the fact that the searching step intrinsically
 presents a high degree of parallelism and requires $73\%$ of the
 construction time.  Therefore, a parallel implementation of our algorithm will
 allow the construction and the evaluation of the resulting function in parallel.
 Therefore, the description of the resulting MPHFs will be distributed in the paralell 
 computer allowing the scalability to sets of hundreds of billions of keys. 
 This is an important contribution, mainly for applications related to the Web, as 
 mentioned in Section~\ref{sec:intro}.
--- a/vldb07/costhashingbuckets.tex
+++ b/vldb07/costhashingbuckets.tex
@@ -0,0 +1,177 @@
 % Nivio: 29/jan/06
 % Time-stamp: <Monday 30 Jan 2006 12:37:22am EST yoshi@flare>
 \vspace{-2mm}
 \subsection{Performance of the internal memory based algorithm}
 \label{sec:intern-memory-algor}
 %\begin{table*}[htb]
 %\begin{center}
 %{\scriptsize
 %\begin{tabular}{|c|c|c|c|c|c|c|c|}
 %\hline
 %$n$ (millions)  & 1                 & 2                    & 4                  & 8                  & 16      & 32 \\
 %\hline
 %Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$   & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\
 %SD (s)          & $2.6$           & $5.4$              & $9.8$            & $17.6$           & $37.3$            & $76.3$  \\
 %\hline
 %\end{tabular}
 %\vspace{-3mm}
 %}
 %\end{center}
 %\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF,
 %the standard deviation (SD), and the confidence intervals considering
 %a confidence level of $95\%$.}
 %\label{tab:medias}
 %\end{table*}
 Our three-step internal memory based algorithm presented in~\cite{bkz05}
 is used for constructing a MPHF for each bucket.
 It is a randomized algorithm because it needs to generate a simple random graph
 in its first step. 
 Once the graph is obtained the other two steps are deterministic. 
 Thus, we can consider the runtime of the algorithm to have the form~$\alpha
 nZ$ for an input of~$n$ keys, where~$\alpha$ is some machine dependent
 constant that further depends on the length of the keys and~$Z$ is a random
 variable with geometric distribution with mean~$1/p=e^{1/c^2}$ (see
 Section~\ref{sec:mphfbucket}).  All results in our experiments were obtained
 taking $c=1$; the value of~$c$, with~$c\in[0.93,1.15]$, in fact has little
 influence in the runtime, as shown in~\cite{bkz05}.
 The values chosen for $n$ were $1, 2, 4, 8, 16$ and $32$ million.
 Although we have a dataset with 1~billion URLs, on a PC with
 1~gigabyte of main memory, the algorithm is able
 to handle an input with at most 32 million keys.
 This is mainly because of the graph we need to keep in main memory.
 The algorithm requires $25n + O(1)$ bytes for constructing 
 a MPHF (details about the data structures used by the algorithm can
 be found in~\texttt{http://cmph.sf.net}.
 % for the details about the data structures 
 %used by the algorithm).
 In order to estimate the number of trials for each value of $n$ we use
 a statistical method for determining a suitable sample size (see, e.g.,
 \cite[Chapter 13]{j91}).  
 As we obtained different values for each $n$, 
 we used the maximal value obtained, namely, 300~trials in order to have 
 a confidence level of $95\%$.
 % \begin{figure*}[ht]
 %   \noindent
 %   \begin{minipage}[b]{0.5\linewidth}
 %     \centering
 %     \subfigure[The previous algorithm]
 %     {\scalebox{0.5}{\includegraphics{figs/bmz_temporegressao.eps}}}
 %   \end{minipage}
 %   \hfill
 %   \begin{minipage}[b]{0.5\linewidth}
 %     \centering
 %     \subfigure[The new algorithm]
 %     {\scalebox{0.5}{\includegraphics{figs/brz_temporegressao.eps}}}
 %   \end{minipage}
 %     \caption{Time versus number of keys in $S$. The solid line corresponds to 
 % a linear regression model.}
 % %obtained from the experimental measurements.}
 %     \label{fig:temporegressao}
 % \end{figure*}
 Table~\ref{tab:medias} presents the runtime average for each $n$,
 the respective standard deviations, and 
 the respective confidence intervals given by 
 the average time $\pm$ the distance from average time
 considering a confidence level of $95\%$.
 Observing the runtime averages one sees that 
 the algorithm runs in expected linear time, 
 as shown in~\cite{bkz05}. 
 \vspace{-2mm}
 \begin{table*}[htb]
 \begin{center}
 {\scriptsize
 \begin{tabular}{|c|c|c|c|c|c|c|c|}
 \hline
 $n$ (millions)  & 1                 & 2                    & 4                  & 8                  & 16      & 32 \\
 \hline
 Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$   & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\
 SD (s)          & $2.6$           & $5.4$              & $9.8$            & $17.6$           & $37.3$            & $76.3$  \\
 \hline
 \end{tabular}
 \vspace{-1mm}
 }
 \end{center}
 \caption{Internal memory based algorithm: average time in seconds for constructing a MPHF,
 the standard deviation (SD), and the confidence intervals considering
 a confidence level of $95\%$.}
 \label{tab:medias}
 \vspace{-4mm}
 \end{table*}
 % \enlargethispage{\baselineskip}
 % \begin{table*}[htb]
 % \begin{center}
 % {\scriptsize
 % (a)
 % \begin{tabular}{|c|c|c|c|c|c|c|c|}
 % \hline
 % $n$ (millions)  & 1                 & 2                    & 4                  & 8                  & 16                 & 32 \\
 % \hline
 % Average time (s)& $6.119 \pm 0.300$ & $12.190 \pm 0.615$   & $25.359 \pm 1.109$ & $51.408 \pm 2.003$ & $117.343 \pm 4.364$ & $262.215 \pm 8.724$\\
 % SD (s)          & $2.644$           & $5.414$              & $9.757$            & $17.627$           & $37.333$            & $76.271$  \\
 % \hline
 % \end{tabular}
 % \\[5mm]  (b)
 % \begin{tabular}{|l|c|c|c|c|c|}
 % \hline
 % $n$ (millions)   & 1                  & 2                  & 4                  & 8                   & 16             \\
 % \hline % Part.      16 \%                 16 \%                 16 \%                18 \%                 20\%           
 % Average time (s) & $6.927 \pm 0.309$  & $13.828 \pm 0.175$ & $31.936 \pm 0.663$ & $69.902 \pm 1.084$  & $140.617 \pm 2.502$  \\
 % SD               & $0.431$            & $0.245$            & $0.926$            & $1.515$             & $3.498$         \\
 % \hline
 % \hline
 % $n$ (millions)   & 32                  & 64                   & 128                    & 512                  & 1000            \\
 % \hline % Part.      20 \%                 20\%                  20\%                      18\%                    18\%
 % Average time (s) & $284.330 \pm 1.135$ & $587.880 \pm 3.945$  & $1223.581 \pm 4.864$   & $5966.402 \pm 9.465$ & $13229.540 \pm 12.670$  \\
 % SD               & $1.587$             & $5.514$              & $6.800$                & $13.232$             & $18.577$            \\
 % \hline
 % \end{tabular}
 % }
 % \end{center}
 % \caption{The runtime averages in seconds, 
 % the standard deviation (SD), and
 % the confidence intervals given by the average time $\pm$ 
 % the distance from average time considering 
 % a confidence level of $95\%$.}
 % \label{tab:medias}
 % \end{table*}
 \enlargethispage{2\baselineskip}
 Figure~\ref{fig:bmz_temporegressao} 
 presents the runtime for each trial. In addition, 
 the solid line corresponds to a linear regression model 
 obtained from the experimental measurements.
 As we can see, the runtime for a given $n$ has a considerable 
 fluctuation. However, the fluctuation also grows linearly with $n$.
 \begin{figure}[htb]
 \vspace{-2mm}
 \begin{center}
 \scalebox{0.4}{\includegraphics{figs/bmz_temporegressao.eps}}
 \caption{Time versus number of keys in $S$ for the internal memory based algorithm.
 The solid line corresponds to a linear regression model.}
 \label{fig:bmz_temporegressao}
 \end{center}
 \vspace{-6mm}
 \end{figure}
 The observed fluctuation in the runtimes is as expected; recall that this
 runtime has the form~$\alpha nZ$ with~$Z$ a geometric random variable with
 mean~$1/p=e$.  Thus, the runtime has mean~$\alpha n/p=\alpha en$ and standard
 deviation~$\alpha n\sqrt{(1-p)/p^2}=\alpha n\sqrt{e(e-1)}$. 
 Therefore, the standard deviation also grows 
 linearly with $n$, as experimentally verified 
 in Table~\ref{tab:medias} and in Figure~\ref{fig:bmz_temporegressao}.
 %\noindent-------------------------------------------------------------------------\\
 %Comentario para Yoshi: Nao consegui reproduzir bem o que discutimos 
 %no paragrafo acima, acho que vc conseguira justificar melhor :-). \\
 %-------------------------------------------------------------------------\\
--- a/vldb07/determiningb.tex
+++ b/vldb07/determiningb.tex
@@ -0,0 +1,146 @@
 % Nivio: 29/jan/06
 % Time-stamp: <Monday 30 Jan 2006 04:01:40am EDT yoshi@ime.usp.br>
 \enlargethispage{2\baselineskip}
 \subsection{Determining~$b$}
 \label{sec:determining-b}
 \begin{table*}[t]
 \begin{center}
 {\small %\scriptsize
 \begin{tabular}{|c|ccc|ccc|}
 \hline
 \raisebox{-0.7em}{$n$} & \multicolumn{3}{c|}{\raisebox{-1mm}{b=128}}  &
 \multicolumn{3}{c|}{\raisebox{-1mm}{b=175}}\\
 \cline{2-4} \cline{5-7}
 & \raisebox{-0.5mm}{Worst Case} & \raisebox{-0.5mm}{Average} &\raisebox{-0.5mm}{Eq.~(\ref{eq:maxbs})} 
 & \raisebox{-0.5mm}{Worst Case} & \raisebox{-0.5mm}{Average} &\raisebox{-0.5mm}{Eq.~(\ref{eq:maxbs})} \\
 \hline
 $1.0 \times 10^6$  & 177 & 172.0 & 176  & 232 & 226.6 & 230 \\
 %$2.0 \times 10^6$  & 179 & 174.0 & 178  & 236 & 228.5 & 232 \\
 $4.0 \times 10^6$  & 182 & 177.5 & 179  & 241 & 231.8 & 234 \\
 %$8.0 \times 10^6$  & 186 & 181.6 & 181  & 238 & 234.2 & 236 \\
 $1.6 \times 10^7$  & 184 & 181.6 & 183  & 241 & 236.1 & 238 \\
 %$3.2 \times 10^7$  & 191 & 183.9 & 184  & 240 & 236.6 & 240 \\
 $6.4 \times 10^7$  & 195 & 185.2 & 186  & 244 & 239.0 & 242 \\
 %$1.28 \times 10^8$ & 193 & 187.7 & 187  & 244 & 239.7 & 244 \\
 $5.12 \times 10^8$ & 196 & 191.7 & 190  & 251 & 246.3 & 247 \\
 $1.0 \times 10^9$  & 197 & 191.6 & 192  & 253 & 248.9 & 249 \\
 \hline
 \end{tabular}
 \vspace{-1mm}
 }
 \end{center}
 \caption{Values for $\mathit{BS}_{\mathit{max}}$: worst case and average case obtained in the experiments and using Eq.~(\ref{eq:maxbs}), 
 considering $b=128$ and $b=175$ for different number $n$ of keys in $S$.} 
 \label{tab:comparison}
 \vspace{-6mm}
 \end{table*}
 The partitioning step can be viewed as the well known ``balls into bins'' 
 problem~\cite{ra98,dfm02} where~$n$ keys (the balls) are placed independently and 
 uniformly into $\lceil n/b\rceil$ buckets (the bins). The main question related to that problem we are interested 
 in is: what is the maximum number of keys in any bucket? 
 In fact, we want to get the maximum value for $b$ that makes the maximum number of keys in any bucket
 no greater than 256 with high probability. 
 This is important, as we wish to use 8 bits per entry in the vector $g_i$ of
 each $\mathrm{MPHF}_i$, 
 where $0 \leq i < \lceil n/b\rceil$.      
 Let $\mathit{BS}_{\mathit{max}}$ be the maximum number of keys in any bucket.
 Clearly, $\BSmax$ is the maximum
 of~$\lceil n/b\rceil$ random variables~$Z_i$, each with binomial
 distribution~$\Bi(n,p)$ with parameters~$n$ and~$p=1/\lceil n/b\rceil$.
 However, the~$Z_i$ are not independent.  Note that~$\Bi(n,p)$ has mean and
 variance~$\simeq b$.  To give an upper estimate for the probability of the
 event~$\BSmax\geq \gamma$, we can estimate the probability that we have~$Z_i\geq \gamma$
 for a fixed~$i$, and then sum these estimates over all~$i$.
 Let~$\gamma=b+\sigma\sqrt{b\ln(n/b)}$, where~$\sigma=\sqrt2$.
 Approximating~$\Bi(n,p)$ by the normal distribution with mean and
 variance~$b$, we obtain the
 estimate~$(\sigma\sqrt{2\pi\ln(n/b)})^{-1}\times\exp(-(1/2)\sigma^2\ln(n/b))$ for
 the probability that~$Z_i\geq \gamma$ occurs, which, summed over all~$i$, gives
 that the probability that~$\BSmax\geq \gamma$ occurs is at
 most~$1/(\sigma\sqrt{2\pi\ln(n/b)})$, which tends to~$0$ as~$n\to\infty$.
 Thus, we have shown that, with high probability, 
 \begin{equation}
  \label{eq:maxbs}
  \BSmax\leq b+\sqrt{2b\ln{n\over b}}.
 \end{equation}
 % The traditional approach used to estimate $\mathit{BS}_{\mathit{max}}$ with high probability is 
 % to consider $\mathit{BS}_{\mathit{max}}$ as a random variable that follows a binomial distribution 
 % that can be approximated by a poisson distribution. This yields a good approximation
 % when the number of balls is lower than or equal to the number of bins~\cite{g81}. In our case,
 % the number of balls is greater than the number of buckets.
 % % and that is why we have used more recent works to estimate $\mathit{BS}_{\mathit{max}}$. 
 % As $b > \ln (n/b)$, we can use the result by Raab and Steger~\cite{ra98} to estimate
 % $\mathit{BS}_{\mathit{max}}$ with high probability. 
 % The following equation gives the estimation, where $\sigma=\sqrt{2}$:  
 % \begin{eqnarray} \label{eq:maxbs}
 % \mathit{BS}_{\mathit{max}} = b + O \left( \sqrt{b\ln\frac{n}{b}} \right) = b + \sigma \times \left(\sqrt{b\ln\frac{n}{b}} \right)
 % \end{eqnarray}
 % In order to estimate the suitable constant $\sigma$ we did a linear 
 % regression suppressing the constant term. 
 % We use the equation $BS_{max} - b = \sigma \times \sqrt{b\ln (n/b)}$ 
 % in the linear regression considering $y=BS_{max} - b$ and $x=\sqrt{b\ln (n/b)}$. 
 % In order to obtain data to be used in the linear regression we set 
 % b=128 and ran the new algorithm ten times 
 % for n equal to 1, 2, 4, 8, 16, 32, 64, 128, 512, 1000 million keys.  
 % Taking a confidence level equal to 95\% we got 
 % $\sigma = 2.11 \pm 0.03$. 
 % The coefficient of determination was $99.6\%$, which means that the linear 
 % regression explains $99.6\%$ of the data variation and only $0.4\%$ 
 % is due to experimental errors.
 % Therefore, Eq.~(\ref{eq:maxbs}) with $\sigma = 2.11 \pm 0.03$ and $b=128$ 
 % makes a very good estimation of the maximum number of keys in any bucket.
 % Repeating the same experiments for $b$ equals to $175$ and  
 % a confidence level of $95\%$ we got $\sigma = 2.07 \pm 0.03$.
 % Again we verified that Eq.~(\ref{eq:maxbs}) with $\sigma = 2.07 \pm 0.03$ and $b=175$ is 
 % a very good estimation of the maximum number of keys in any bucket once the
 % coefficient of determination obtained was $99.7 \%$ and $\sigma$ is in a very narrow range.
 In our algorithm the maximum number of keys in any bucket must be at most 256. 
 Table~\ref{tab:comparison} presents the values for $\mathit{BS}_{\mathit{max}}$
 obtained experimentally and using Eq.~(\ref{eq:maxbs}).
 The table presents the worst case and the average case, 
 considering $b=128$,  $b=175$ and Eq.~(\ref{eq:maxbs}),
 for several numbers~$n$ of keys in $S$. 
 The estimation given by Eq.~(\ref{eq:maxbs}) is very close to the experimental
 results.
 Now we estimate the biggest problem our algorithm is able to solve for 
 a given $b$. 
 Using Eq.~(\ref{eq:maxbs}) considering $b=128$, $b=175$ and imposing
 that~$\mathit{BS}_{\mathit{max}}\leq256$, 
 the sizes of the biggest key set our algorithm 
 can deal with are $10^{30}$ keys and $10^{10}$ keys, respectively. 
 %It is also important to have $b$ as big as possible, once its value is
 %related to the space required to store the resultant MPHF, as shown later on.
 %Table~\ref{tab:bp} shows the biggest problem the algorithm can solve.
 % The values were obtained from Eq.~(\ref{eq:maxbs}), 
 % considering $b=128$ and~$b=175$ and imposing
 % that~$\mathit{BS}_{\mathit{max}}\leq256$. 
 % We set $\sigma=2.14$ because it was the greatest value obtained for $\sigma$
 % in the two linear regression we did.
 % \vspace{-3mm} 
 % \begin{table}[htb]
 % \begin{center}
 % {\small %\scriptsize
 % \begin{tabular}{|c|c|}
 % \hline
 % b    & Problem size ($n$) \\
 % \hline
 % 128  & $10^{30}$ keys  \\
 % 175  & $10^{10}$ keys  \\
 % \hline
 % \end{tabular}
 % \vspace{-1mm}
 % }
 % \end{center}
 % \caption{Using Eq.~(\ref{eq:maxbs}) to estimate the biggest problem our algorithm can solve.}
 % %considering $\sigma=\sqrt{2}$.}
 % \label{tab:bp}
 % \vspace{-14mm}
 % \end{table}
--- a/vldb07/diskaccess.tex
+++ b/vldb07/diskaccess.tex
@@ -0,0 +1,113 @@
 % Nivio: 29/jan/06
 % Time-stamp: <Sunday 29 Jan 2006 11:58:28pm EST yoshi@flare>
 \vspace{-2mm}
 \subsection{Controlling disk accesses}
 \label{sec:contr-disk-access}
 In order to bring down the number of seek operations on disk
 we benefit from the fact that our algorithm leaves almost all main
 memory available to be used as disk I/O buffer. 
 In this section we evaluate how much the parameter $\mu$ 
 affects the runtime of our algorithm.
 For that we fixed $n$ in 1 billion of URLs,
 set the main memory of the machine used for the experiments 
 to 1 gigabyte and used $\mu$ equal to $100, 200, 300, 400, 500$ and $600$
 megabytes. 
 \enlargethispage{2\baselineskip}
 Table~\ref{tab:diskaccess} presents the number of files $N$,
 the buffer size used for all files, the number of seeks in the worst case considering
 the pessimistic assumption mentioned in Section~\ref{sec:linearcomplexity}, and 
 the time to generate a MPHF for 1 billion of keys as a function of the amount of internal 
 memory available. Observing Table~\ref{tab:diskaccess} we noticed that the time spent in the construction
 decreases as the value of $\mu$ increases. However, for $\mu > 400$, the variation 
 on the time is not as significant as for $\mu \leq 400$. 
 This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux
 has smart policies  
 for avoiding seeks and diminishing the average seek time 
 (see \texttt{http://www.linuxjournal.com/article/6931}).
 \begin{table*}[ht]
 \vspace{-2mm}
 \begin{center}
 {\scriptsize
 \begin{tabular}{|l|c|c|c|c|c|c|}
 \hline
 $\mu$ (MB)                                                        & $100$                        & $200$                       & $300$                       & $400$                       & $500$                       & $600$ \\
 \hline
 $N$ (files)                                                       & $619$                        & $310$                       & $207$                       & $155$                       & $124$                       & $104$ \\
 %\hline
 \textbaht~(buffer size in KB)                                     & $165$                        & $661$                       & $1,484$                     & $2,643$                     & $4,129$                     & $5,908$ \\
 %\hline
 $\beta$/\textbaht~(\# of seeks in the worst case)                 & $384,478$                    & $95,974$                    & $42,749$                    & $24,003$                    & $15,365$                    & $10,738$ \\
 % \hline
 % \raisebox{-0.2em}{\# of seeks performed in}                       & \raisebox{-0.7em}{$383,056$} & \raisebox{-0.7em}{$95,919$} & \raisebox{-0.7em}{$42,700$} & \raisebox{-0.7em}{$23,980$} & \raisebox{-0.7em}{$15,347$} & \raisebox{-0.7em}{$xx,xxx$} \\
 % \raisebox{0.2em}{statement 1.3 of Figure~\ref{fig:readingbucket}} &                              &                             &                             &                             &                             &       \\
 % \hline
 Time (hours)                                                      & $4.04$                       & $3.64$                      & $3.34$                      & $3.20$                      & $3.13$                      & $3.09$      \\
 \hline
 \end{tabular}
 \vspace{-1mm}
 }
 \end{center}
 \caption{Influence of the internal memory area size ($\mu$) in our algorithm runtime.}
 \label{tab:diskaccess}
 \vspace{-14mm}
 \end{table*}
 % \begin{table*}[ht]
 % \begin{center}
 % {\scriptsize
 % \begin{tabular}{|l|c|c|c|c|c|c|c|c|c|c|c|}
 % \hline
 % $\mu$ (MB)                                                        & $100$                        & $150$                        & $200$                       & $250$                       & $300$                       & $350$                       & $400$                       & $450$                       & $500$                       & $550$                       & $600$ \\
 % \hline
 % $N$ (files)                                                       & $619$                        & $413$                        & $310$                       & $248$                       & $207$                       & $177$                       & $155$                       & $138$                       & $124$                       & $113$                       & $103$ \\
 % \hline
 % \textbaht~(buffer size in KB)                                     & $165$                        & $372$                        & $661$                       & $1,033$                     & $1,484$                     & $2,025$                     & $2,643$                     & $3,339$                     &                             &                             &       \\
 % \hline
 % \# of seeks (Worst case)                                          & $384,478$                     & $170,535$                    & $95,974$                    & $61,413$                    & $42,749$                    & $31,328$                    & $24,003$                    & $19,000$                    &                             &                             &       \\
 % \hline
 % \raisebox{-0.2em}{\# of seeks performed in}                       & \raisebox{-0.7em}{$383,056$} & \raisebox{-0.7em}{$170,385$} & \raisebox{-0.7em}{$95,919$} & \raisebox{-0.7em}{$61,388$} & \raisebox{-0.7em}{$42,700$} & \raisebox{-0.7em}{$31,296$} & \raisebox{-0.7em}{$23,980$} & \raisebox{-0.7em}{$18,978$} & \raisebox{-0.7em}{$xx,xxx$} & \raisebox{-0.7em}{$xx,xxx$} & \raisebox{-0.7em}{$xx,xxx$} \\
 % \raisebox{0.2em}{statement 1.3 of Figure~\ref{fig:readingbucket}} &                              &                              &                             &                             &                             &                             &                             &                             &                             &                             &       \\
 % \hline
 % Time (horas)                                                      & $4.04$                          & $3.93$                       & $3.64$                      & $3.46$                      & $3.34$                      & $3.26$                      & $3.20$                      & $3.13$                      &                             &                             &       \\
 % \hline
 % \end{tabular}
 % }
 % \end{center}
 % \caption{Influence of the internal memory area size ($\mu$) in our algorithm runtime.}
 % \label{tab:diskaccess}
 % \end{table*}
 % \begin{table*}[htb]
 % \begin{center}
 % {\scriptsize
 % \begin{tabular}{|l|c|c|c|c|c|}
 % \hline
 % $n$ (millions)   & 1                  & 2                  & 4                  & 8                   & 16             \\
 % \hline % Part.      16 \%                 16 \%                 16 \%                18 \%                 20\%           
 % Average time (s) & $14.124 \pm 0.128$ & $28.301 \pm 0.140$ & $56.807 \pm 0.312$ & $117.286 \pm 0.997$ & $241.086 \pm 0.936$  \\
 % SD               & $0.179$            & $0.196$            & $0.437$            & $1.394$             & $1.308$         \\
 % \hline
 % \hline
 % $n$ (millions)   & 32                  & 64                   & 128                    & 512                    & 1000            \\
 % \hline % Part.      20 \%                 20\%                  20\%                      18\%                    18\%
 % Average time (s) & $492.430 \pm 1.565$ & $1006.307 \pm 1.425$ & $2081.208 \pm 0.740$   & $9253.188 \pm 4.406$ & $19021.480 \pm 13.850$  \\
 % SD               & $2.188$             & $1.992$              & $1.035$                & $ 6.160$             & $18.016$            \\
 % \hline
 % \end{tabular}
 % }
 % \end{center}
 % \caption{The runtime averages in seconds, 
 % the standard deviation (SD), and
 % the confidence intervals given by the average time $\pm$ 
 % the distance from average time considering 
 % a confidence level of $95\%$.
 % }
 % \label{tab:mediasbrz}
 % \end{table*}
--- a/vldb07/experimentalresults.tex
+++ b/vldb07/experimentalresults.tex
@@ -0,0 +1,15 @@
 %Nivio: 29/jan/06
 % Time-stamp: <Sunday 29 Jan 2006 11:57:21pm EST yoshi@flare>
 \vspace{-2mm}
 \enlargethispage{2\baselineskip}
 \section{Appendix: Experimental results}
 \label{sec:experimental-results}
 \vspace{-1mm}
 In this section we present the experimental results.
 We start presenting the experimental setup. 
 We then present experimental results for
 the internal memory based algorithm~\cite{bkz05} 
 and for our algorithm.
 Finally, we discuss how the amount of internal memory available 
 affects the runtime of our algorithm. 
--- a/vldb07/figs/bmz_temporegressao.png
+++ b/vldb07/figs/bmz_temporegressao.png
--- a/vldb07/figs/brz-partitioning.fig
+++ b/vldb07/figs/brz-partitioning.fig
@@ -0,0 +1,107 @@
 #FIG 3.2
 Landscape
 Center
 Metric
 A4      
 100.00
 Single
 -2
 1200 2
 0 32 #bdbebd
 0 33 #bdbebd
 0 34 #bdbebd
 0 35 #4a4d4a
 0 36 #bdbebd
 0 37 #4a4d4a
 0 38 #bdbebd
 0 39 #bdbebd
 6 225 6615 2520 7560
 2 1 0 1 -1 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 900 7133 1608 7133
 2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
 	 260 6795 474 6795 474 6965 260 6965 260 6795
 2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
 	 474 6795 686 6795 686 6965 474 6965 474 6795
 2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
 	 474 6626 686 6626 686 6795 474 6795 474 6626
 2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
 	 1538 6795 1750 6795 1750 6965 1538 6965 1538 6795
 2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
 	 1538 6965 1750 6965 1750 7133 1538 7133 1538 6965
 2 2 0 1 -1 7 50 -1 -1 0.000 0 0 7 0 0 5
 	 474 6965 686 6965 686 7133 474 7133 474 6965
 2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
 	 686 6965 900 6965 900 7133 686 7133 686 6965
 2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
 	 1538 6626 1750 6626 1750 6795 1538 6795 1538 6626
 2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
 	 260 6965 474 6965 474 7133 260 7133 260 6965
 2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
 	 686 6795 900 6795 900 6965 686 6965 686 6795
 4 0 0 50 -1 0 14 0.0000 4 30 180 1148 7049 ...\001
 4 0 -1 50 -1 0 7 0.0000 2 60 60 332 7260 0\001
 4 0 -1 50 -1 0 7 0.0000 2 75 60 544 7260 1\001
 4 0 -1 50 -1 0 7 0.0000 2 60 60 758 7260 2\001
 4 0 -1 50 -1 0 7 0.0000 2 90 960 1538 7260 ${\\lceil n/b\\rceil - 1}$\001
 4 0 -1 50 -1 0 7 0.0000 2 105 975 540 7515 Buckets Logical View\001
 -6
 6 2700 6390 4365 7830
 6 3461 6445 3675 7425
 6 3463 6786 3675 7245
 6 3546 6893 3591 7094
 4 0 -1 50 -1 0 12 0.0000 2 15 45 3546 6959 .\001
 4 0 -1 50 -1 0 12 0.0000 2 15 45 3546 7027 .\001
 4 0 -1 50 -1 0 12 0.0000 2 15 45 3546 7094 .\001
 -6
 2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
 	 3463 6786 3675 6786 3675 7245 3463 7245 3463 6786
 -6
 2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
 	 3461 6445 3675 6445 3675 6615 3461 6615 3461 6445
 2 2 0 1 -1 7 50 -1 41 0.000 0 0 7 0 0 5
 	 3463 6616 3675 6616 3675 6785 3463 6785 3463 6616
 2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
 	 3463 7246 3675 7246 3675 7425 3463 7425 3463 7246
 -6
 6 3023 6786 3235 7245
 6 3106 6893 3151 7094
 4 0 -1 50 -1 0 12 0.0000 2 15 45 3106 6959 .\001
 4 0 -1 50 -1 0 12 0.0000 2 15 45 3106 7027 .\001
 4 0 -1 50 -1 0 12 0.0000 2 15 45 3106 7094 .\001
 -6
 2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
 	 3023 6786 3235 6786 3235 7245 3023 7245 3023 6786
 -6
 6 4091 6425 4305 7425
 6 4093 6946 4305 7255
 6 4176 7018 4221 7153
 4 0 -1 50 -1 0 12 0.0000 2 15 45 4176 7063 .\001
 4 0 -1 50 -1 0 12 0.0000 2 15 45 4176 7108 .\001
 4 0 -1 50 -1 0 12 0.0000 2 15 45 4176 7153 .\001
 -6
 2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
 	 4093 6946 4305 6946 4305 7255 4093 7255 4093 6946
 -6
 2 2 0 1 0 35 50 -1 -1 0.000 0 0 7 0 0 5
 	 4091 6605 4305 6605 4305 6775 4091 6775 4091 6605
 2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
 	 4093 7256 4305 7256 4305 7425 4093 7425 4093 7256
 2 2 0 1 -1 7 50 -1 41 0.000 0 0 7 0 0 5
 	 4093 6776 4305 6776 4305 6945 4093 6945 4093 6776
 2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
 	 4091 6425 4305 6425 4305 6595 4091 6595 4091 6425
 -6
 2 2 0 1 0 35 50 -1 20 0.000 0 0 7 0 0 5
 	 3021 6445 3235 6445 3235 6615 3021 6615 3021 6445
 2 2 0 1 -1 7 50 -1 -1 0.000 0 0 7 0 0 5
 	 3023 6616 3235 6616 3235 6785 3023 6785 3023 6616
 2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
 	 3023 7246 3235 7246 3235 7425 3023 7425 3023 7246
 4 0 0 50 -1 0 14 0.0000 4 30 180 3780 6975 ...\001
 4 0 -1 50 -1 0 7 0.0000 2 75 255 3015 7560 File 1\001
 4 0 -1 50 -1 0 7 0.0000 2 75 255 3465 7560 File 2\001
 4 0 -1 50 -1 0 7 0.0000 2 75 270 4095 7560 File N\001
 4 0 -1 50 -1 0 7 0.0000 2 105 1020 3195 7785 Buckets Physical View\001
 4 0 0 50 -1 0 10 0.0000 4 150 120 2700 7020 b)\001
 -6
 4 0 0 50 -1 0 10 0.0000 4 150 105 0 7020 a)\001
--- a/vldb07/figs/brz-partitioningfabiano.fig
+++ b/vldb07/figs/brz-partitioningfabiano.fig
@@ -0,0 +1,126 @@
 #FIG 3.2
 Landscape
 Center
 Metric
 A4      
 100.00
 Single
 -2
 1200 2
 0 32 #bebebe
 0 33 #4e4e4e
 6 2160 3825 2430 4365
 2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
 	 2160 4005 2430 4005 2430 4095 2160 4095 2160 4005
 2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
 	 2160 3825 2430 3825 2430 3915 2160 3915 2160 3825
 2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
 	 2160 3915 2430 3915 2430 4005 2160 4005 2160 3915
 2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
 	 2160 4275 2430 4275 2430 4365 2160 4365 2160 4275
 2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
 	 2160 4185 2430 4185 2430 4275 2160 4275 2160 4185
 2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
 	 2160 4095 2430 4095 2430 4185 2160 4185 2160 4095
 -6
 6 2430 3735 2700 4365
 2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
 	 2430 3825 2700 3825 2700 3915 2430 3915 2430 3825
 2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
 	 2430 4275 2700 4275 2700 4365 2430 4365 2430 4275
 2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
 	 2430 4185 2700 4185 2700 4275 2430 4275 2430 4185
 2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
 	 2430 4095 2700 4095 2700 4185 2430 4185 2430 4095
 2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
 	 2430 4005 2700 4005 2700 4095 2430 4095 2430 4005
 2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
 	 2430 3915 2700 3915 2700 4005 2430 4005 2430 3915
 2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
 	 2430 3735 2700 3735 2700 3825 2430 3825 2430 3735
 -6
 6 2700 4005 2970 4365
 2 2 0 1 0 32 50 -1 43 0.000 0 0 -1 0 0 5
 	 2700 4275 2970 4275 2970 4365 2700 4365 2700 4275
 2 2 0 1 0 32 50 -1 43 0.000 0 0 -1 0 0 5
 	 2700 4185 2970 4185 2970 4275 2700 4275 2700 4185
 2 2 0 1 0 32 50 -1 43 0.000 0 0 -1 0 0 5
 	 2700 4095 2970 4095 2970 4185 2700 4185 2700 4095
 2 2 0 1 -1 32 50 -1 43 0.000 0 0 -1 0 0 5
 	 2700 4005 2970 4005 2970 4095 2700 4095 2700 4005
 -6
 6 2025 5625 3690 5760
 4 0 0 50 -1 0 10 0.0000 4 105 360 2025 5760 File 1\001
 4 0 0 50 -1 0 10 0.0000 4 105 360 2565 5760 File 2\001
 4 0 0 50 -1 0 10 0.0000 4 105 405 3285 5760 File N\001
 -6
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 3510 4410 3510 4590
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 3780 4410 3780 4590
 2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
 	 1890 4185 2160 4185 2160 4275 1890 4275 1890 4185
 2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
 	 1890 4275 2160 4275 2160 4365 1890 4365 1890 4275
 2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
 	 1890 4095 2160 4095 2160 4185 1890 4185 1890 4095
 2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
 	 2070 4860 2340 4860 2340 5040 2070 5040 2070 4860
 2 2 0 1 0 7 50 -1 41 0.000 0 0 7 0 0 5
 	 3330 5220 3600 5220 3600 5400 3330 5400 3330 5220
 2 2 0 1 0 33 50 -1 20 0.000 0 0 7 0 0 5
 	 3330 4860 3600 4860 3600 4950 3330 4950 3330 4860
 2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
 	 2070 5040 2340 5040 2340 5130 2070 5130 2070 5040
 2 2 0 1 0 33 50 -1 -1 0.000 0 0 7 0 0 5
 	 3330 4950 3600 4950 3600 5220 3330 5220 3330 4950
 2 2 0 1 0 7 50 -1 41 0.000 0 0 7 0 0 5
 	 2070 5130 2340 5130 2340 5310 2070 5310 2070 5130
 2 2 0 1 0 7 50 -1 10 0.000 0 0 7 0 0 5
 	 2610 5400 2880 5400 2880 5580 2610 5580 2610 5400
 2 2 0 1 0 7 50 -1 41 0.000 0 0 7 0 0 5
 	 2610 4860 2880 4860 2880 5040 2610 5040 2610 4860
 2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
 	 2610 5040 2880 5040 2880 5130 2610 5130 2610 5040
 2 2 0 1 0 7 50 -1 50 0.000 0 0 -1 0 0 5
 	 2970 4275 3240 4275 3240 4365 2970 4365 2970 4275
 2 2 0 1 0 7 50 -1 50 0.000 0 0 -1 0 0 5
 	 2970 4185 3240 4185 3240 4275 2970 4275 2970 4185
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 3510 4410 3600 4410
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 3690 4410 3780 4410
 2 2 0 1 0 7 50 -1 10 0.000 0 0 -1 0 0 5
 	 3510 4275 3780 4275 3780 4365 3510 4365 3510 4275
 2 2 0 1 0 7 50 -1 10 0.000 0 0 -1 0 0 5
 	 3510 4185 3780 4185 3780 4275 3510 4275 3510 4185
 2 2 0 1 0 32 50 -1 20 0.000 0 0 7 0 0 5
 	 2610 5130 2880 5130 2880 5400 2610 5400 2610 5130
 2 2 0 1 0 32 50 -1 43 0.000 0 0 7 0 0 5
 	 2070 5310 2340 5310 2340 5490 2070 5490 2070 5310
 2 2 0 1 0 7 50 -1 10 0.000 0 0 7 0 0 5
 	 2070 5490 2340 5490 2340 5580 2070 5580 2070 5490
 2 2 0 1 0 7 50 -1 50 0.000 0 0 7 0 0 5
 	 3330 5400 3600 5400 3600 5490 3330 5490 3330 5400
 2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
 	 3240 4275 3510 4275 3510 4365 3240 4365 3240 4275
 2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
 	 3240 4185 3510 4185 3510 4275 3240 4275 3240 4185
 2 2 0 1 -1 32 50 -1 20 0.000 0 0 -1 0 0 5
 	 3240 4095 3510 4095 3510 4185 3240 4185 3240 4095
 2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
 	 3240 4005 3510 4005 3510 4095 3240 4095 3240 4005
 2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
 	 3240 3915 3510 3915 3510 4005 3240 4005 3240 3915
 2 2 0 1 0 32 50 -1 20 0.000 0 0 -1 0 0 5
 	 3330 5490 3600 5490 3600 5580 3330 5580 3330 5490
 4 0 0 50 -1 0 10 0.0000 4 105 75 1980 4545 0\001
 4 0 0 50 -1 0 10 0.0000 4 105 420 3555 4545 n/b  - 1\001
 4 0 0 50 -1 0 18 0.0000 4 30 180 3015 5265 ...\001
 4 0 0 50 -1 0 10 0.0000 4 105 75 2250 4545 1\001
 4 0 0 50 -1 0 10 0.0000 4 105 75 2520 4545 2\001
 4 0 0 50 -1 0 18 0.0000 4 30 180 2880 4500 ...\001
 4 0 0 50 -1 0 10 0.0000 4 135 1410 4050 5310 Buckets Physical View\001
 4 0 0 50 -1 0 10 0.0000 4 135 1350 4050 4140 Buckets Logical View\001
 4 0 0 50 -1 0 10 0.0000 4 135 120 1665 3780 a)\001
 4 0 0 50 -1 0 10 0.0000 4 135 135 1620 4950 b)\001
--- a/vldb07/figs/brz.fig
+++ b/vldb07/figs/brz.fig
@@ -0,0 +1,183 @@
 #FIG 3.2  Produced by xfig version 3.2.5-alpha5
 Landscape
 Center
 Metric
 A4      
 100.00
 Single
 -2
 1200 2
 0 32 #bdbebd
 0 33 #bdbebd
 0 34 #bdbebd
 0 35 #4a4d4a
 0 36 #bdbebd
 0 37 #4a4d4a
 0 38 #bdbebd
 0 39 #bdbebd
 0 40 #bdbebd
 6 3427 4042 3852 4211
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3427 4041 3852 4041 3852 4211 3427 4211 3427 4041
 4 0 0 50 -1 0 14 0.0000 4 30 180 3551 4140 ...\001
 -6
 6 3410 5689 3835 5859
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3410 5689 3835 5689 3835 5858 3410 5858 3410 5689
 4 0 0 50 -1 0 14 0.0000 4 30 180 3534 5788 ...\001
 -6
 6 3825 5445 4455 5535
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 4140 5445 4095 5490
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 4140 5445 4185 5490
 3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
 	 3825 5535 3825 5490 3870 5490 3915 5490 3959 5490 4006 5490
 	 4095 5490 4095 5490
 	 0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
 3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
 	 4455 5535 4455 5490 4410 5490 4365 5490 4321 5490 4274 5490
 	 4185 5490
 	 0.000 1.000 1.000 1.000 1.000 1.000 0.000
 -6
 6 1873 5442 2323 5532
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 2098 5442 2066 5487
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 2098 5442 2130 5487
 3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
 	 1873 5532 1873 5487 1905 5487 1937 5487 1969 5487 2002 5487
 	 2066 5487 2066 5487
 	 0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
 3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
 	 2323 5532 2323 5487 2291 5487 2259 5487 2227 5487 2194 5487
 	 2130 5487
 	 0.000 1.000 1.000 1.000 1.000 1.000 0.000
 -6
 6 2338 5442 2968 5532
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 2653 5442 2608 5487
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 2653 5442 2698 5487
 3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
 	 2338 5532 2338 5487 2383 5487 2428 5487 2473 5487 2518 5487
 	 2608 5487 2608 5487
 	 0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
 3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
 	 2968 5532 2968 5487 2923 5487 2878 5487 2833 5487 2788 5487
 	 2698 5487
 	 0.000 1.000 1.000 1.000 1.000 1.000 0.000
 -6
 6 2475 4500 4770 5175
 2 1 0 1 -1 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 3137 5013 3845 5013
 2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
 	 2497 4675 2711 4675 2711 4845 2497 4845 2497 4675
 2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
 	 2711 4675 2923 4675 2923 4845 2711 4845 2711 4675
 2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
 	 2711 4506 2923 4506 2923 4675 2711 4675 2711 4506
 2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
 	 3775 4675 3987 4675 3987 4845 3775 4845 3775 4675
 2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
 	 3775 4845 3987 4845 3987 5013 3775 5013 3775 4845
 2 2 0 1 -1 7 50 -1 -1 0.000 0 0 7 0 0 5
 	 2711 4845 2923 4845 2923 5013 2711 5013 2711 4845
 2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
 	 2923 4845 3137 4845 3137 5013 2923 5013 2923 4845
 2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
 	 3775 4506 3987 4506 3987 4675 3775 4675 3775 4506
 2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
 	 2497 4845 2711 4845 2711 5013 2497 5013 2497 4845
 2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
 	 2923 4675 3137 4675 3137 4845 2923 4845 2923 4675
 4 0 0 50 -1 0 14 0.0000 4 30 180 3385 4929 ...\001
 4 0 -1 50 -1 0 7 0.0000 2 75 60 2569 5140 0\001
 4 0 -1 50 -1 0 7 0.0000 2 75 60 2781 5140 1\001
 4 0 -1 50 -1 0 7 0.0000 2 75 60 2995 5140 2\001
 4 0 -1 50 -1 0 7 0.0000 2 75 405 4059 4845 Buckets\001
 4 0 -1 50 -1 0 7 0.0000 2 105 1095 3775 5140 ${\\lceil n/b\\rceil - 1}$\001
 -6
 6 2983 5446 3433 5536
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 3208 5446 3176 5491
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 3208 5446 3240 5491
 3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 8
 	 2983 5536 2983 5491 3015 5491 3047 5491 3079 5491 3112 5491
 	 3176 5491 3176 5491
 	 0.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000
 3 0 0 1 0 7 50 -1 -1 0.000 0 0 0 7
 	 3433 5536 3433 5491 3401 5491 3369 5491 3337 5491 3304 5491
 	 3240 5491
 	 0.000 1.000 1.000 1.000 1.000 1.000 0.000
 -6
 2 2 0 1 0 36 50 -1 -1 0.000 0 0 7 0 0 5
 	 3852 4041 4066 4041 4066 4211 3852 4211 3852 4041
 2 2 0 1 0 36 50 -1 -1 0.000 0 0 7 0 0 5
 	 4066 4041 4279 4041 4279 4211 4066 4211 4066 4041
 2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
 	 1937 4041 2149 4041 2149 4211 1937 4211 1937 4041
 2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
 	 2149 4041 2362 4041 2362 4211 2149 4211 2149 4041
 2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
 	 2362 4041 2576 4041 2576 4211 2362 4211 2362 4041
 2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
 	 2576 4041 2788 4041 2788 4211 2576 4211 2576 4041
 2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
 	 2788 4041 3002 4041 3002 4211 2788 4211 2788 4041
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3214 4041 3427 4041 3427 4211 3214 4211 3214 4041
 2 2 0 1 0 36 50 -1 -1 0.000 0 0 7 0 0 5
 	 4279 4041 4492 4041 4492 4211 4279 4211 4279 4041
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3002 4041 3214 4041 3214 4211 3002 4211 3002 4041
 2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
 	 2132 5689 2345 5689 2345 5858 2132 5858 2132 5689
 2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
 	 3197 5689 3410 5689 3410 5858 3197 5858 3197 5689
 2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
 	 2771 5689 2985 5689 2985 5858 2771 5858 2771 5689
 2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
 	 4262 5689 4475 5689 4475 5858 4262 5858 4262 5689
 2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
 	 4049 5689 4262 5689 4262 5858 4049 5858 4049 5689
 2 2 0 1 0 7 50 -1 41 0.000 0 0 -1 0 0 5
 	 2985 5689 3197 5689 3197 5858 2985 5858 2985 5689
 2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
 	 2345 5689 2559 5689 2559 5858 2345 5858 2345 5689
 2 2 0 1 0 37 50 -1 20 0.000 0 0 7 0 0 5
 	 1914 5687 2127 5687 2127 5856 1914 5856 1914 5687
 2 2 0 1 0 36 50 -1 43 0.000 0 0 7 0 0 5
 	 3835 5689 4049 5689 4049 5858 3835 5858 3835 5689
 2 2 0 1 0 37 50 -1 -1 0.000 0 0 7 0 0 5
 	 2559 5689 2771 5689 2771 5858 2559 5858 2559 5689
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 5
 	1 1 1.00 60.00 120.00
 	 3330 4275 3330 4365 3330 4410 3330 4455 3330 4500
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
 	1 1 1.00 45.00 60.00
 	 3880 5168 4140 5445
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
 	1 1 1.00 45.00 60.00
 	 3025 5170 3205 5440
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
 	1 1 1.00 45.00 60.00
 	 2805 5164 2653 5438
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 7 1 0 2
 	1 1 1.00 45.00 60.00
 	 2577 5170 2103 5434
 4 0 -1 50 -1 0 7 0.0000 2 120 645 4562 4168 Key Set $S$\001
 4 0 -1 50 -1 0 7 0.0000 2 75 60 2008 3999 0\001
 4 0 -1 50 -1 0 7 0.0000 2 75 60 2220 3999 1\001
 4 0 -1 50 -1 0 7 0.0000 2 75 165 4314 3999 n-1\001
 4 0 -1 50 -1 0 7 0.0000 2 75 60 1991 5985 0\001
 4 0 -1 50 -1 0 7 0.0000 2 75 60 2203 5985 1\001
 4 0 -1 50 -1 0 7 0.0000 2 75 165 4297 5985 n-1\001
 4 0 -1 50 -1 0 7 0.0000 2 75 555 4545 5816 Hash Table\001
 4 0 -1 50 -1 0 3 0.0000 2 75 450 1980 5625 MPHF$_0$\001
 4 0 -1 50 -1 0 3 0.0000 2 75 450 2520 5625 MPHF$_1$\001
 4 0 -1 50 -1 0 3 0.0000 2 75 450 3015 5625 MPHF$_2$\001
 4 0 -1 50 -1 0 3 0.0000 2 75 1065 3825 5625 MPHF$_{\\lceil n/b \\rceil - 1}$\001
 4 0 -1 50 -1 0 7 0.0000 2 105 585 1440 4455 Partitioning\001
 4 0 -1 50 -1 0 7 0.0000 2 105 495 1440 5265 Searching\001
--- a/vldb07/figs/brz_temporegressao.png
+++ b/vldb07/figs/brz_temporegressao.png
--- a/vldb07/figs/brzfabiano.fig
+++ b/vldb07/figs/brzfabiano.fig
@@ -0,0 +1,153 @@
 #FIG 3.2  Produced by xfig version 3.2.5-alpha5
 Landscape
 Center
 Metric
 A4      
 100.00
 Single
 -2
 1200 2
 0 32 #bebebe
 6 2025 3015 3555 3690
 2 3 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
 	 2025 3285 2295 3285 2295 3015 3285 3015 3285 3285 3555 3285
 	 2790 3690 2025 3285
 4 0 0 50 -1 0 10 0.0000 4 135 765 2385 3330 Partitioning\001
 -6
 6 1890 3735 3780 4365
 6 2430 3735 2700 4365
 6 2430 3915 2700 4365
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2430 4275 2700 4275 2700 4365 2430 4365 2430 4275
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2430 4185 2700 4185 2700 4275 2430 4275 2430 4185
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2430 4095 2700 4095 2700 4185 2430 4185 2430 4095
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2430 4005 2700 4005 2700 4095 2430 4095 2430 4005
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2430 3915 2700 3915 2700 4005 2430 4005 2430 3915
 -6
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2430 3825 2700 3825 2700 3915 2430 3915 2430 3825
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2430 3735 2700 3735 2700 3825 2430 3825 2430 3735
 -6
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 1890 4275 2160 4275 2160 4365 1890 4365 1890 4275
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 1890 4185 2160 4185 2160 4275 1890 4275 1890 4185
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2160 4275 2430 4275 2430 4365 2160 4365 2160 4275
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2160 4185 2430 4185 2430 4275 2160 4275 2160 4185
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2160 4095 2430 4095 2430 4185 2160 4185 2160 4095
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2160 4005 2430 4005 2430 4095 2160 4095 2160 4005
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2160 3915 2430 3915 2430 4005 2160 4005 2160 3915
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2700 4275 2970 4275 2970 4365 2700 4365 2700 4275
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2700 4185 2970 4185 2970 4275 2700 4275 2700 4185
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2700 4095 2970 4095 2970 4185 2700 4185 2700 4095
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2700 4005 2970 4005 2970 4095 2700 4095 2700 4005
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2160 3825 2430 3825 2430 3915 2160 3915 2160 3825
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3240 4275 3510 4275 3510 4365 3240 4365 3240 4275
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3510 4275 3780 4275 3780 4365 3510 4365 3510 4275
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2970 4275 3240 4275 3240 4365 2970 4365 2970 4275
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3240 4185 3510 4185 3510 4275 3240 4275 3240 4185
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 1890 4095 2160 4095 2160 4185 1890 4185 1890 4095
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3510 4185 3780 4185 3780 4275 3510 4275 3510 4185
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3240 4095 3510 4095 3510 4185 3240 4185 3240 4095
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3240 4005 3510 4005 3510 4095 3240 4095 3240 4005
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3240 3915 3510 3915 3510 4005 3240 4005 3240 3915
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 1890 4365 3780 4365
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2970 4185 3240 4185 3240 4275 2970 4275 2970 4185
 -6
 6 1260 5310 4230 5580
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 1260 5400 4230 5400
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 1530 5310 1800 5310 1800 5400 1530 5400 1530 5310
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2070 5310 2340 5310 2340 5400 2070 5400 2070 5310
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2340 5310 2610 5310 2610 5400 2340 5400 2340 5310
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2610 5310 2880 5310 2880 5400 2610 5400 2610 5310
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2880 5310 3150 5310 3150 5400 2880 5400 2880 5310
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3420 5310 3690 5310 3690 5400 3420 5400 3420 5310
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3690 5310 3960 5310 3960 5400 3690 5400 3690 5310
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3960 5310 4230 5310 4230 5400 3960 5400 3960 5310
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 1800 5310 2070 5310 2070 5400 1800 5400 1800 5310
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3150 5310 3420 5310 3420 5400 3150 5400 3150 5310
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 1260 5310 1530 5310 1530 5400 1260 5400 1260 5310
 4 0 0 50 -1 0 10 0.0000 4 105 210 4005 5580 n-1\001
 4 0 0 50 -1 0 10 0.0000 4 105 75 1350 5580 0\001
 -6
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 1260 2925 4230 2925
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 1530 2835 1800 2835 1800 2925 1530 2925 1530 2835
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2070 2835 2340 2835 2340 2925 2070 2925 2070 2835
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2340 2835 2610 2835 2610 2925 2340 2925 2340 2835
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2610 2835 2880 2835 2880 2925 2610 2925 2610 2835
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 2880 2835 3150 2835 3150 2925 2880 2925 2880 2835
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3420 2835 3690 2835 3690 2925 3420 2925 3420 2835
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3690 2835 3960 2835 3960 2925 3690 2925 3690 2835
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3960 2835 4230 2835 4230 2925 3960 2925 3960 2835
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 1800 2835 2070 2835 2070 2925 1800 2925 1800 2835
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 3150 2835 3420 2835 3420 2925 3150 2925 3150 2835
 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
 	 1260 2835 1530 2835 1530 2925 1260 2925 1260 2835
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 3510 4410 3510 4590
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 3510 4410 3600 4410
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 3690 4410 3780 4410
 2 3 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 8
 	 2025 4815 2295 4815 2295 4545 3285 4545 3285 4815 3555 4815
 	 2790 5220 2025 4815
 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
 	 3780 4410 3780 4590
 4 0 0 50 -1 0 10 0.0000 4 135 585 2475 4860 Searching\001
 4 0 0 50 -1 0 10 0.0000 4 105 75 1980 4545 0\001
 4 0 0 50 -1 0 10 0.0000 4 105 690 4410 5400 Hash Table\001
 4 0 0 50 -1 0 10 0.0000 4 105 480 4410 4230 Buckets\001
 4 0 0 50 -1 0 10 0.0000 4 135 555 4410 2925 Key set S\001
 4 0 0 50 -1 0 10 0.0000 4 105 75 1350 2745 0\001
 4 0 0 50 -1 0 10 0.0000 4 105 210 4005 2745 n-1\001
 4 0 0 50 -1 0 10 0.0000 4 105 420 3555 4545 n/b  - 1\001
--- a/vldb07/figs/minimalperfecthash-ph-mph.png
+++ b/vldb07/figs/minimalperfecthash-ph-mph.png
--- a/vldb07/introduction.tex
+++ b/vldb07/introduction.tex
@@ -0,0 +1,109 @@
 %% Nivio: 22/jan/06 23/jan/06 29/jan
 % Time-stamp: <Monday 30 Jan 2006 03:52:42am EDT yoshi@ime.usp.br>
 \section{Introduction}
 \label{sec:intro}
 \enlargethispage{2\baselineskip}
 Suppose~$U$ is a universe of \textit{keys} of size $u$.
 Let $h:U\to M$ be a {\em hash function} that maps the keys from~$U$
 to a given interval of integers $M=[0,m-1]=\{0,1,\dots,m-1\}$.
 Let~$S\subseteq U$ be a set of~$n$ keys from~$U$, where $ n \ll u$.
 Given a key~$x\in S$, the hash function~$h$ computes an integer in
 $[0,m-1]$ for the storage or retrieval of~$x$ in a {\em hash table}.
 % Hashing methods for {\em non-static sets} of keys can be used to construct
 % data structures storing $S$ and supporting membership queries
 % ``$x \in S$?'' in expected time $O(1)$.
 % However, they involve a certain amount of wasted space owing to unused
 % locations in the table and waisted time to resolve collisions when
 % two keys are hashed to the same table location.
 A perfect hash function maps a {\em static set} $S$ of $n$ keys from $U$ into a set of $m$ integer 
 numbers without collisions, where $m$ is greater than or equal to $n$. 
 If $m$ is equal to $n$, the function is called minimal. 
 % Figure~\ref{fig:minimalperfecthash-ph-mph}(a) illustrates a perfect hash function and
 % Figure~\ref{fig:minimalperfecthash-ph-mph}(b) illustrates a minimal perfect hash function (MPHF).
 % 
 % \begin{figure}
 % \centering
 % \scalebox{0.7}{\epsfig{file=figs/minimalperfecthash-ph-mph.ps}}
 % \caption{(a) Perfect hash function (b) Minimal perfect hash function (MPHF)}
 % \label{fig:minimalperfecthash-ph-mph}
 % %\vspace{-5mm}
 % \end{figure}
 Minimal perfect hash functions are widely used for memory efficient storage and fast 
 retrieval of items from static sets, such as words in natural languages, 
 reserved words in programming languages or interactive systems, universal resource 
 locations (URLs) in web search engines, or item sets in data mining techniques. 
 Search engines are nowadays indexing tens of billions of pages and algorithms
 like PageRank~\cite{Brin1998}, which uses the web link structure to derive a
 measure of popularity for Web pages, would benefit from a MPHF for storage and 
 retrieval of such huge sets of URLs. 
 For instance, the TodoBr\footnote{TodoBr ({\texttt www.todobr.com.br}) is a trademark of 
 Akwan Information Technologies, which was acquired by Google Inc. in July 2005.}
 search engine used the algorithm proposed hereinafter to 
 improve and to scale its link analysis system. 
 The WebGraph research group~\cite{bv04} would 
 also benefit from a MPHF for sets in the order of billions of URLs to scale
 and to improve the storange requirements of their algorithms on Graph compression. 
 Another interesting application for MPHFs is its use as an indexing structure 
 for databases. 
 The B+ tree is very popular as an indexing structure for dynamic applications 
 with frequent insertions and deletions of records. 
 However, for applications with sporadic modifications and a huge number of 
 queries the B+ tree is not the best option, 
 because it performs poorly with very large sets of keys 
 such as those required for the new frontiers of database applications~\cite{s05}.
 Therefore, there are applications for MPHFs in 
 information retrieval systems, database systems, language translation systems, 
 electronic commerce systems, compilers, operating systems, among others.
 Until now, because of the limitations of current algorithms,
 the use of MPHFs is restricted to scenarios where the set of keys being hashed is 
 relatively small.
 However, in many cases it is crucial to deal in an efficient way with very large
 sets of keys. 
 Due to the exponential growth of the Web, the work with huge collections is becoming
 a daily task. 
 For instance, the simple assignment of number identifiers to web pages of a collection 
 can be a challenging task. 
 While traditional databases simply cannot handle more traffic once the working 
 set of URLs does not fit in main memory anymore~\cite{s05}, the algorithm we propose here to
 construct MPHFs can easily scale to billions of entries.
 % using stock hardware.
 As there are many applications for MPHFs, it is 
 important to design and implement space and time efficient algorithms for 
 constructing such functions. 
 The attractiveness of using MPHFs depends on the following issues:
 \begin{enumerate}
 \item The amount of CPU time required by the algorithms for constructing MPHFs.
 \item The space requirements of the algorithms for constructing MPHFs.
 \item The amount of CPU time required by a MPHF for each retrieval.
 \item The space requirements of the description of the resulting MPHFs to be
  used at retrieval time.
 \end{enumerate}
 \enlargethispage{2\baselineskip}
 This paper presents a novel external memory based algorithm for constructing MPHFs that 
 is very efficient in these four requirements.
 First, the algorithm is linear on the size of keys to construct a MPHF,
 which is optimal.
 For instance, for a collection of 1 billion URLs 
 collected from the web, each one 64 characters long on average, the time to construct a
 MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory
 is approximately 3 hours.
 Second, the algorithm needs a small a priori defined vector of $\lceil n/b \rceil$
 one byte entries in main memory to construct a MPHF.
 For the collection of 1 billion URLs and using $b=175$, the algorithm needs only
 5.45 megabytes of internal memory.
 Third, the evaluation of the MPHF for each retrieval requires three memory accesses and
 the computation of three universal hash functions.
 This is not optimal as any MPHF requires at least one memory access and the computation
 of two universal hash functions.
 Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal.
 For the collection of 1 billion URLs, it needs 8.1 bits for each key,
 while the theoretical lower bound is $1/\ln2 \approx 1.4427$ bits per 
 key~\cite{m84}.
--- a/vldb07/makefile
+++ b/vldb07/makefile
@@ -0,0 +1,17 @@
 all: 
 	latex vldb.tex 
 	bibtex vldb
 	latex vldb.tex 
 	latex vldb.tex
 	dvips vldb.dvi -o vldb.ps
 	ps2pdf vldb.ps
 	chmod -R g+rwx *
 perm:
 	chmod -R g+rwx *
 run: clean all 
 	gv vldb.ps &
 clean:
 	rm *.aux *.bbl *.blg *.log *.ps *.pdf *.dvi
--- a/vldb07/partitioningthekeys.tex
+++ b/vldb07/partitioningthekeys.tex
@@ -0,0 +1,141 @@
 %% Nivio: 21/jan/06
 % Time-stamp: <Monday 30 Jan 2006 03:57:28am EDT yoshi@ime.usp.br>
 \vspace{-2mm}
 \subsection{Partitioning step}
 \label{sec:partitioning-keys}
 The set $S$ of $n$ keys is partitioned into $\lceil n/b \rceil$ buckets, 
 where $b$ is a suitable parameter chosen to guarantee
 that each bucket has at most 256 keys with high probability
 (see Section~\ref{sec:determining-b}).
 The partitioning step works as follows:
 \begin{figure}[h]
 \hrule 
 \hrule 
 \vspace{2mm}
 \begin{tabbing}
 aa\=type booleanx \==  (false, true); \kill
 \> $\blacktriangleright$ Let $\beta$ be the size in bytes of the set $S$ \\ 
 \> $\blacktriangleright$ Let $\mu$ be the size in bytes of an a priori reserved \\
 \> ~~~ internal memory area \\ 
 \> $\blacktriangleright$ Let $N = \lceil \beta/\mu \rceil$ be the number of key blocks that will \\
 \> ~~~ be read from disk into an internal memory area \\
 \> $\blacktriangleright$ Let $\mathit{size}$ be a vector that stores the size of each bucket \\
 \> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \\
 \> ~~ $1.1$ Read block $B_j$ of keys from disk \\
 \> ~~ $1.2$ Cluster $B_j$ into $\lceil n/b \rceil$ buckets using a bucket sort \\
 \> ~~~~~~~ algorithm and update the entries in the vector {\it size} \\
 \> ~~ $1.3$ Dump $B_j$ to the disk into File $j$\\
 \> $2.$ Compute the {\it offset} vector and dump it to the disk.
 \end{tabbing}
 \hrule 
 \hrule 
 \vspace{-1.0mm}
 \caption{Partitioning step}
 \vspace{-3mm}
 \label{fig:partitioningstep}
 \end{figure}
 Statement 1.1 of the {\bf for} loop presented in Figure~\ref{fig:partitioningstep} 
 reads sequentially all the keys of block $B_j$ from disk into an internal area
 of size $\mu$.
 Statement 1.2 performs an indirect bucket sort of the keys in block $B_j$
 and at the same time updates the entries in the vector {\em size}.
 Let us briefly describe how~$B_j$ is partitioned among the~$\lceil n/b\rceil$
 buckets. 
 We use a local array of $\lceil n/b \rceil$ counters to store a 
 count of how many keys from $B_j$ belong to each bucket.
 %At the same time, the global vector {\it size} is computed based on the local 
 %counters. 
 The pointers to the keys in each bucket $i$, $0 \leq i < \lceil n/b \rceil$,
 are stored in contiguous positions in an array.
 For this we first reserve the required number of entries
 in this array of pointers using the information from the array of counters. 
 Next, we place the pointers to the keys in each bucket into the respective
 reserved areas in the array (i.e., we place the pointers to the keys in bucket 0, 
 followed by the pointers to the keys in bucket 1, and so on).
 \enlargethispage{2\baselineskip}
 To find the bucket address of a given key
 we use the universal hash function $h_0(k)$~\cite{j97}.
 Key~$k$ goes into bucket~$i$, where
 %Then, for each integer $h_0(k)$ the respective bucket address is obtained
 %as follows:
 \begin{eqnarray} \label{eq:bucketindex}
 i=h_0(k) \bmod \left \lceil \frac{n}{b} \right \rceil.
 \end{eqnarray}
 Figure~\ref{fig:brz-partitioning}(a) shows a \emph{logical} view of the
 $\lceil n/b \rceil$ buckets generated in the partitioning step.
 %In this case, the keys of each bucket are put together by the pointers to
 %each key stored 
 %in contiguous positions in the array of pointers.
 In reality, the keys belonging to each bucket are distributed among many files,
 as depicted in Figure~\ref{fig:brz-partitioning}(b).
 In the example of Figure~\ref{fig:brz-partitioning}(b), the keys in bucket 0 
 appear in files 1 and $N$, the keys in bucket 1 appear in files 1, 2
 and $N$, and so on. 
 \vspace{-7mm}
 \begin{figure}[ht]
 \centering
 \begin{picture}(0,0)%
 \includegraphics{figs/brz-partitioning.ps}%
 \end{picture}%
 \setlength{\unitlength}{4144sp}%
 %
 \begingroup\makeatletter\ifx\SetFigFont\undefined%
 \gdef\SetFigFont#1#2#3#4#5{%
  \reset@font\fontsize{#1}{#2pt}%
  \fontfamily{#3}\fontseries{#4}\fontshape{#5}%
  \selectfont}%
 \fi\endgroup%
 \begin{picture}(4371,1403)(1,-6977)
 \put(333,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
 \put(545,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
 \put(759,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}2}}}}
 \put(1539,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}${\lceil n/b\rceil - 1}$}}}}
 \put(541,-6676){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets Logical View}}}}
 \put(3547,-6120){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
 \put(3547,-6188){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
 \put(3547,-6255){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
 \put(3107,-6120){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
 \put(3107,-6188){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
 \put(3107,-6255){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
 \put(4177,-6224){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
 \put(4177,-6269){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
 \put(4177,-6314){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}}
 \put(3016,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File 1}}}}
 \put(3466,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File 2}}}}
 \put(4096,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File N}}}}
 \put(3196,-6946){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets Physical View}}}}
 \end{picture}%
 \caption{Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view}
 \label{fig:brz-partitioning}
 \vspace{-2mm}
 \end{figure}
 This scattering of the keys in the buckets could generate a performance
 problem because of the potential number of seeks  
 needed to read the keys in each bucket from the $N$ files in disk 
 during the searching step. 
 But, as we show later in Section~\ref{sec:analytcal-results}, the number of seeks 
 can be kept small using buffering techniques.
 Considering that only the vector {\it size}, which has $\lceil n/b \rceil$
 one-byte entries (remember that each bucket has at most 256 keys),
 must be maintained in main memory during the searching step,
 almost all main memory is available to be used as disk I/O buffer.
 The last step is to compute the {\it offset} vector and dump it to the disk.
 We use the vector $\mathit{size}$ to compute the
 $\mathit{offset}$ displacement vector. 
 The $\mathit{offset}[i]$ entry contains the number of keys 
 in the buckets $0, 1, \dots, i-1$.
 As {\it size}$[i]$ stores the number of keys
 in bucket $i$, where $0 \leq i <\lceil n/b \rceil$, we have
 \begin{displaymath}
 \mathit{offset}[i] = \sum_{j=0}^{i-1} \mathit{size}[j] \cdot
 \end{displaymath}
--- a/vldb07/performancenewalgorithm.tex
+++ b/vldb07/performancenewalgorithm.tex
@@ -0,0 +1,113 @@
 % Nivio: 29/jan/06
 % Time-stamp: <Monday 30 Jan 2006 12:13:14pm EST yoshi@flare>
 \subsection{Performance of the new algorithm}
 \label{sec:performance}
 %As we have done for the internal memory based algorithm, 
 The runtime of our algorithm is also a random variable, but now it follows a
 (highly concentrated) normal distribution, as we discuss at the end of this
 section.  Again, we are interested in verifying the linearity claim made in
 Section~\ref{sec:linearcomplexity}.  Therefore, we ran the algorithm for
 several numbers $n$ of keys in $S$.
 The values chosen for $n$ were $1, 2, 4, 8, 16, 32, 64, 128, 512$ and $1000$
 million. 
 %Just the small vector {\it size} must be kept in main memory,
 %as we saw in Section~\ref{sec:memconstruction}.
 We limited the main memory in 500 megabytes for the experiments.
 The size $\mu$ of the a priori reserved internal memory area 
 was set to 250 megabytes, the parameter $b$ was set to $175$ and
 the building block algorithm parameter $c$ was again set to $1$.
 In Section~\ref{sec:contr-disk-access} we show how $\mu$
 affects the runtime of the algorithm. The other two parameters
 have insignificant influence on the runtime.  
 We again use a statistical method for determining a suitable sample size
 %~\cite[Chapter 13]{j91}
 to estimate the number of trials to be run for each value of $n$.  We got that
 just one trial for each $n$ would be enough with a confidence level of $95\%$.
 However, we made 10 trials.  This number of trials seems rather small, but, as
 shown below, the behavior of our algorithm is very stable and its runtime is
 almost deterministic (i.e., the standard deviation is very small).
 Table~\ref{tab:mediasbrz} presents the runtime average for each $n$,
 the respective standard deviations, and 
 the respective confidence intervals given by 
 the average time $\pm$ the distance from average time
 considering a confidence level of $95\%$.
 Observing the runtime averages we noticed that 
 the algorithm runs in expected linear time, 
 as shown in~Section~\ref{sec:linearcomplexity}.  Better still,
 it is only approximately $60\%$  slower than our internal memory based algorithm.
 To get that value we used the linear regression model obtained for the runtime of
 the internal memory based algorithm to estimate how much time it would require
 for constructing a MPHF for a set of 1 billion keys. 
 We got 2.3 hours for the internal memory based algorithm  and we measured  
 3.67 hours on average for our algorithm. 
 Increasing the size of the internal memory area 
 from 250 to 600 megabytes (see Section~\ref{sec:contr-disk-access}),
 we have brought the time to 3.09 hours. In this case, our algorithm is 
 just $34\%$  slower in this setup.
 \enlargethispage{2\baselineskip}
 \begin{table*}[htb]
 \vspace{-1mm}
 \begin{center}
 {\scriptsize
 \begin{tabular}{|l|c|c|c|c|c|}
 \hline
 $n$ (millions)   & 1                  & 2                  & 4                  & 8                   & 16             \\
 \hline % Part.      16 \%                 16 \%                 16 \%                18 \%                 20\%           
 Average time (s) & $6.9 \pm 0.3$  & $13.8 \pm 0.2$ & $31.9 \pm 0.7$ & $69.9 \pm 1.1$  & $140.6 \pm 2.5$  \\
 SD               & $0.4$            & $0.2$            & $0.9$            & $1.5$             & $3.5$         \\
 \hline
 \hline
 $n$ (millions)   & 32                  & 64                   & 128                    & 512                  & 1000            \\
 \hline % Part.      20 \%                 20\%                  20\%                      18\%                    18\%
 Average time (s) & $284.3 \pm 1.1$ & $587.9 \pm 3.9$  & $1223.6 \pm 4.9$   & $5966.4 \pm 9.5$ & $13229.5 \pm 12.7$  \\
 SD               & $1.6$             & $5.5$              & $6.8$                & $13.2$             & $18.6$            \\
 \hline
 \end{tabular}
 \vspace{-1mm}
 }
 \end{center}
 \caption{Our algorithm: average time in seconds for constructing a MPHF,
 the standard deviation (SD), and the confidence intervals considering 
 a confidence level of $95\%$.
 }
 \label{tab:mediasbrz}
 \vspace{-5mm}
 \end{table*}
 Figure~\ref{fig:brz_temporegressao}
 presents the runtime for each trial. In addition, 
 the solid line corresponds to a linear regression model 
 obtained from the experimental measurements.
 As we were expecting the runtime for a given $n$ has almost no 
 variation.
 \begin{figure}[htb]
 \begin{center}
 \scalebox{0.4}{\includegraphics{figs/brz_temporegressao.eps}}
 \caption{Time versus number of keys in $S$ for our algorithm. The solid line corresponds to 
 a linear regression model.}
 \label{fig:brz_temporegressao}
 \end{center}
 \vspace{-9mm}
 \end{figure}
 An intriguing observation is that the runtime of the algorithm is almost
 deterministic, in spite of the fact that it uses as building block an
 algorithm with a considerable fluctuation in its runtime.  A given bucket~$i$,
 $0 \leq i < \lceil n/b \rceil$, is a small set of keys (at most 256 keys) and,
 as argued in Section~\ref{sec:intern-memory-algor}, the runtime of the
 building block algorithm is a random variable~$X_i$ with high fluctuation.
 However, the runtime~$Y$ of the searching step of our algorithm is given
 by~$Y=\sum_{0\leq i<\lceil n/b\rceil}X_i$.  Under the hypothesis that
 the~$X_i$ are independent and bounded, the {\it law of large numbers} (see,
 e.g., \cite{j91}) implies that the random variable $Y/\lceil n/b\rceil$
 converges to a constant as~$n\to\infty$.  This explains why the runtime of our
 algorithm is almost deterministic.
--- a/vldb07/references.bib
+++ b/vldb07/references.bib
@@ -0,0 +1,814 @@
@InProceedings{Brin1998,
  author =       "Sergey Brin and Lawrence Page",
  title =        "The Anatomy of a Large-Scale Hypertextual Web Search Engine",
  booktitle =    "Proceedings of the 7th International {World Wide Web}
                  Conference",
  pages =        "107--117",
  adress =       "Brisbane, Australia",
  month =        "April",
  year =         1998,
  annote =       "Artigo do Google."
 }
@inproceedings{p99,
    author = {R. Pagh},
    title = {Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions},
    booktitle = {Workshop on Algorithms and Data Structures},
    pages = {49-54},
    year = 1999,
    url = {citeseer.nj.nec.com/pagh99hash.html},
    key = {author} 
 }
@article{p00,
    author = {R. Pagh},
    title = {Faster deterministic dictionaries},
    journal = {Symposium on Discrete Algorithms (ACM SODA)},
    OPTvolume = {43},
    OPTnumber = {5},
    pages = {487--493},
    year = {2000}
 }
@article{g81,
 author = {G. H. Gonnet},
 title = {Expected Length of the Longest Probe Sequence in Hash Code Searching},
 journal = {J. ACM},
 volume = {28},
 number = {2},
 year = {1981},
 issn = {0004-5411},
 pages = {289--304},
 doi = {http://doi.acm.org/10.1145/322248.322254},
 publisher = {ACM Press},
 address = {New York, NY, USA},
 }
@misc{r04,
  author = "S. Rao",
  title = "Combinatorial Algorithms Data Structures",
  year = 2004,
  howpublished = {CS 270 Spring},
  url = "citeseer.ist.psu.edu/700201.html" 
 }
@article{ra98,
    author = {Martin Raab and Angelika Steger},
    title = {``{B}alls into Bins'' --- {A} Simple and Tight Analysis},
    journal = {Lecture Notes in Computer Science},
    volume = 1518,
    pages = {159--170},
    year = 1998,
    url = "citeseer.ist.psu.edu/raab98balls.html" 
 }
@misc{mrs00,
  author = "M. Mitzenmacher and A. Richa and R. Sitaraman",
  title = "The power of two random choices: A survey of the techniques and results",
  howpublished={In Handbook of Randomized
    Computing, P. Pardalos, S. Rajasekaran, and J. Rolim, Eds. Kluwer},
  year = "2000",
  url = "citeseer.ist.psu.edu/article/mitzenmacher00power.html" 
 }
@article{dfm02,
    author = {E. Drinea and A. Frieze and M. Mitzenmacher},  	
    title = {Balls and bins models with feedback},
    journal = {Symposium on Discrete Algorithms (ACM SODA)},
    pages = {308--315},
    year = {2002}
 }
@Article{j97,
  author =       {Bob Jenkins},
  title =        {Algorithm Alley: Hash Functions},
  journal =      {Dr. Dobb's Journal of Software Tools},
  volume =       {22},
  number =       {9},
  month =        {september},
  year =         {1997}
 }
@article{gss01,
    author = {N. Galli and B. Seybold and K. Simon},
    title = {Tetris-Hashing or optimal table compression},
    journal = {Discrete Applied Mathematics},
    volume = {110},
    number = {1},
    pages = {41--58},
    month = {june},
    publisher = {Elsevier Science},
    year = {2001}
 }
@article{s05,
    author = {M. Seltzer},
    title = {Beyond Relational Databases},
    journal = {ACM Queue},
    volume = {3},
    number = {3},
    month = {April},
    year = {2005}
 }
@InProceedings{ss89,
  author = 	 {P. Schmidt and A. Siegel},
  title = 	 {On aspects of universality and performance for closed hashing},
  booktitle =    {Proc. 21th Ann. ACM Symp. on Theory of Computing -- STOC'89},
  month = 	 {May},
  year = 	 {1989},
  pages = 	 {355--366}
 }
@article{asw00,
    author = {M. Atici and D. R. Stinson and R. Wei.},
    title = {A new practical algorithm for the construction of a perfect hash function},
    journal = {Journal Combin. Math. Combin. Comput.},
    volume = {35},
    pages = {127--145},
    year = {2000}
 }
@article{swz00,
    author = {D. R. Stinson and R. Wei and L. Zhu},
    title = {New constructions for perfect hash families and related structures using combinatorial designs and codes},
    journal = {Journal Combin. Designs.},
    volume = {8},
    pages = {189--200},
    year = {2000}
 }
@inproceedings{ht01,
    author = {T. Hagerup and T. Tholey},
    title = {Efficient minimal perfect hashing in nearly minimal space},
    booktitle = {The 18th Symposium on Theoretical Aspects of Computer Science (STACS), volume 2010 of Lecture Notes in Computer Science},
    year = 2001,
    pages = {317--326},
    key = {author} 
 }
@inproceedings{dh01,
    author = {M. Dietzfelbinger and T. Hagerup},
    title = {Simple minimal perfect hashing in less space},
    booktitle = {The 9th European Symposium on Algorithms (ESA), volume 2161 of Lecture Notes in Computer Science},
    year = 2001,
    pages = {109--120},
    key = {author} 
 }
@MastersThesis{mar00,
  author = 	 {M. S. Neubert},
  title = 	 {Algoritmos Distribu;os para a Constru;o de Arquivos invertidos},
  school = 	 {Departamento de Ci;cia da Computa;o, Universidade Federal de Minas Gerais},
  year = 	 2000,
  month =	 {Mar;},
  key = {author}
 }
@Book{clrs01,
  author = 	 {T. H. Cormen and C. E. Leiserson and R. L. Rivest and C. Stein},
  title = 	 {Introduction to Algorithms},
  publisher = 	 {MIT Press},
  year = 	 {2001},
  edition = 	 {second},
 }
@Book{j91,
  author = 	  {R. Jain},
  title = 	  {The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. },
  publisher = {John Wiley},
  year = 	  {1991},
  edition =   {first}
 }
@Book{k73,
  author = 	 {D. E. Knuth},
  title = 	 {The Art of Computer Programming: Sorting and Searching},
  publisher = 	 {Addison-Wesley},
  volume    =    {3},
  year = 	 {1973},
  edition = 	 {second},
 }
@inproceedings{rp99,
    author = {R. Pagh},
    title = {Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions},
    booktitle = {Workshop on Algorithms and Data Structures},
    pages = {49-54},
    year = 1999,
    url = {citeseer.nj.nec.com/pagh99hash.html},
    key = {author} 
 }
@inproceedings{hmwc93,
    author = {G. Havas and B.S. Majewski and N.C. Wormald and Z.J. Czech},
    title = {Graphs, Hypergraphs and Hashing},
    booktitle = {19th International Workshop on Graph-Theoretic Concepts in Computer Science},
    publisher = {Springer Lecture Notes in Computer Science vol. 790},
    pages = {153-165},
    year = 1993,
    key = {author} 
 }
@inproceedings{bkz05,
    author = {F.C. Botelho and Y. Kohayakawa and N. Ziviani},
    title = {A Practical Minimal Perfect Hashing Method},
    booktitle = {4th International Workshop on Efficient and Experimental Algorithms},
    publisher = {Springer Lecture Notes in Computer Science vol. 3503},
    pages = {488-500},
    Moth = May,
    year = 2005,
    key = {author} 
 }
@Article{chm97,
  author = 	 {Z.J. Czech and G. Havas and B.S. Majewski},
  title = 	 {Fundamental Study Perfect Hashing},
  journal = 	 {Theoretical Computer Science},
  volume = 	{182},
  year = 	 {1997},
  pages = 	 {1-143},
  key = {author}
 }
@article{chm92,
    author = {Z.J. Czech and G. Havas and B.S. Majewski},
    title = {An Optimal Algorithm for Generating Minimal Perfect Hash Functions},
    journal = {Information Processing Letters},
    volume = {43},
    number = {5},
    pages = {257-264},
    year = {1992},
    url = {citeseer.nj.nec.com/czech92optimal.html}, 
    key = {author} 
 }
@Article{mwhc96,
  author = 	 {B.S. Majewski and N.C. Wormald and G. Havas and Z.J. Czech},
  title = 	 {A family of perfect hashing methods},
  journal = 	 {The Computer Journal},
  year = 	 {1996},
  volume = 	 {39},
  number = 	 {6},
  pages = 	 {547-554},
  key = {author}
 }
@InProceedings{bv04,
 author =         {P. Boldi and S. Vigna},
 title =          {The WebGraph Framework I: Compression Techniques},
 booktitle =      {13th International World Wide Web Conference},
 pages =          {595--602},
 year =           {2004}
 }
@Book{z04,
  author = 	 {N. Ziviani},
  title = 	 {Projeto de Algoritmos com implementa;es em Pascal e C},
  publisher = 	 {Pioneira Thompson},
  year = 	 2004,
  edition = 	 {segunda edi;o}
 }
@Book{p85,
  author = 	 {E. M. Palmer},
  title = 	 {Graphical Evolution: An Introduction to the Theory of Random Graphs},
  publisher = 	 {John Wiley \& Sons},
  year = 	 {1985},
  address = 	 {New York}
 }
@Book{imb99,
  author = 	 {I.H. Witten and A. Moffat and T.C. Bell},
  title = 	 {Managing Gigabytes: Compressing and Indexing Documents and Images},
  publisher = 	 {Morgan Kaufmann Publishers},
  year = 	 1999,
  edition = 	 {second edition}
 }
@Book{wfe68,
  author = 	 {W. Feller},
  title = 	 { An Introduction to Probability Theory and Its Applications},
  publisher = 	 {Wiley},
  year = 	 1968,
  volume = 1,
  optedition = 	 {second edition}
 }
@Article{fhcd92,
  author = 	 {E.A. Fox and L. S. Heath and Q. Chen and A.M. Daoud},
  title = 	 {Practical Minimal Perfect Hash Functions For Large Databases},
  journal = 	 {Communications of the ACM},
  year = 	 {1992},
  volume = 	 {35},
  number = 	 {1},
  pages =        {105--121}
 }
@inproceedings{fch92,
  author    = {E.A. Fox and Q.F. Chen and L.S. Heath},
  title     = {A Faster Algorithm for Constructing Minimal Perfect Hash Functions},
  booktitle = {Proceedings of the 15th Annual International ACM SIGIR Conference
               on Research and Development in Information Retrieval},
  year      = {1992},
  pages = {266-273},
 }
@article{c80,
 author = {R.J. Cichelli},
 title = {Minimal perfect hash functions made simple},
 journal = {Communications of the ACM},
 volume = {23},
 number = {1},
 year = {1980},
 issn = {0001-0782},
 pages = {17--19},
 doi = {http://doi.acm.org/10.1145/358808.358813},
 publisher = {ACM Press},
 }
@TechReport{fhc89,
  author = 	 {E.A. Fox and L.S. Heath and Q.F. Chen},
  title = 	 {An $O(n\log n)$ algorithm for finding minimal perfect hash functions},
  institution =  {Virginia Polytechnic Institute and State University},
  year = 	 {1989},
  OPTkey = 	 {},
  OPTtype = 	 {},
  OPTnumber = 	 {},
  address = 	 {Blacksburg, VA},
  month = 	 {April},
  OPTnote = 	 {},
  OPTannote = 	 {}
 }
@TechReport{bkz06t,
  author = 	 {F.C. Botelho and Y. Kohayakawa and N. Ziviani},
  title = 	 {An Approach for Minimal Perfect Hash Functions in Very Large Databases},
  institution =  {Department of Computer Science, Federal University of Minas Gerais},
  note = 	 {Available at http://www.dcc.ufmg.br/\texttt{\~ }nivio/pub/technicalreports.html},
  year = 	 {2006},
  OPTkey = 	 {},
  OPTtype = 	 {},
  number = 	 {RT.DCC.003},
  address = 	 {Belo Horizonte, MG, Brazil},
  month = 	 {April},
  OPTannote = 	 {}
 }
@inproceedings{fcdh90,
 author = {E.A. Fox and Q.F. Chen and A.M. Daoud and L.S. Heath},
 title = {Order preserving minimal perfect hash functions and information retrieval},
 booktitle = {Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval},
 year = {1990},
 isbn = {0-89791-408-2},
 pages = {279--311},
 location = {Brussels, Belgium},
 doi = {http://doi.acm.org/10.1145/96749.98233},
 publisher = {ACM Press},
 }
@Article{fkp89,
  author = 	 {P. Flajolet and D. E. Knuth and B. Pittel},
  title = 	 {The first cycles in an evolving graph},
  journal = 	 {Discrete Math},
  year = 	 {1989},
  volume = 	 {75},
  pages = 	 {167-215},
 }
@Article{s77,
  author = 	 {R. Sprugnoli},
  title = 	 {Perfect Hashing Functions: A Single Probe Retrieving 
                  Method For Static Sets},
  journal = 	 {Communications of the ACM},
  year = 	 {1977},
  volume = 	 {20},
  number = 	 {11},
  pages =        {841--850},
  month = 	 {November},
 }
@Article{j81,
  author = 	 {G. Jaeschke},
  title = 	 {Reciprocal Hashing: A method For Generating Minimal Perfect
                  Hashing Functions},
  journal = 	 {Communications of the ACM},
  year = 	 {1981},
  volume = 	 {24},
  number = 	 {12},
  month = 	 {December},
  pages =        {829--833}
 }
@Article{c84,
  author = 	 {C. C. Chang},
  title = 	 {The Study Of An Ordered Minimal Perfect Hashing Scheme},
  journal = 	 {Communications of the ACM},
  year = 	 {1984},
  volume = 	 {27},
  number = 	 {4},
  month = 	 {December},
  pages =        {384--387}
 }
@Article{c86,
  author = 	 {C. C. Chang},
  title = 	 {Letter-Oriented Reciprocal Hashing Scheme},
  journal = 	 {Inform. Sci.},
  year = 	 {1986},
  volume = 	 {27},
  pages =        {243--255}
 }
@Article{cl86,
  author = 	 {C. C. Chang and R. C. T. Lee},
  title = 	 {A Letter-Oriented Minimal Perfect Hashing Scheme},
  journal = 	 {Computer Journal},
  year = 	 {1986},
  volume = 	 {29},
  number = 	 {3},
  month = 	 {June},
  pages =        {277--281}
 }
@Article{cc88,
  author = 	 {C. C. Chang and C. H. Chang},
  title = 	 {An Ordered Minimal Perfect Hashing Scheme with Single Parameter},
  journal = 	 {Inform. Process. Lett.},
  year = 	 {1988},
  volume = 	 {27},
  number = 	 {2},
  month = 	 {February},
  pages =        {79--83}
 }
@Article{w90,
  author = 	 {V. G. Winters},
  title = 	 {Minimal Perfect Hashing in Polynomial Time},
  journal = 	 {BIT},
  year = 	 {1990},
  volume = 	 {30},
  number = 	 {2},
  pages =        {235--244}
 }
@Article{fcdh91,
  author = 	 {E. A. Fox and Q. F. Chen and A. M. Daoud and L. S. Heath},
  title = 	 {Order Preserving Minimal Perfect Hash Functions and Information Retrieval},
  journal = 	 {ACM Trans. Inform. Systems},
  year = 	 {1991},
  volume = 	 {9},
  number = 	 {3},
  month = 	 {July},
  pages =        {281--308}
 }
@Article{fks84,
  author = 	 {M. L. Fredman and J. Koml\'os and E. Szemer\'edi},
  title = 	 {Storing a sparse table with {O(1)} worst case access time},
  journal = 	 {J. ACM},
  year = 	 {1984},
  volume = 	 {31},
  number = 	 {3},
  month = 	 {July},
  pages =        {538--544}
 }
@Article{dhjs83,
  author = 	 {M. W. Du and T. M. Hsieh and K. F. Jea and D. W. Shieh},
  title = 	 {The study of a new perfect hash scheme},
  journal = 	 {IEEE Trans. Software Eng.},
  year = 	 {1983},
  volume = 	 {9},
  number = 	 {3},
  month = 	 {May},
  pages =        {305--313}
 }
@Article{bt94,
  author = 	 {M. D. Brain and A. L. Tharp},
  title = 	 {Using Tries to Eliminate Pattern Collisions in Perfect Hashing},
  journal = 	 {IEEE Trans. on Knowledge and Data Eng.},
  year = 	 {1994},
  volume = 	 {6},
  number = 	 {2},
  month = 	 {April},
  pages =        {239--247}
 }
@Article{bt90,
  author = 	 {M. D. Brain and A. L. Tharp},
  title = 	 {Perfect hashing using sparse matrix packing},
  journal = 	 {Inform. Systems},
  year = 	 {1990},
  volume = 	 {15},
  number = 	 {3},
  OPTmonth = 	 {April},
  pages =        {281--290}
 }
@Article{ckw93,
  author = 	 {C. C. Chang and H. C.Kowng and T. C. Wu},
  title = 	 {A refinement of a compression-oriented addressing scheme},
  journal = 	 {BIT},
  year = 	 {1993},
  volume = 	 {33},
  number = 	 {4},
  OPTmonth = 	 {April},
  pages =        {530--535}
 }
@Article{cw91,
  author = 	 {C. C. Chang and T. C. Wu},
  title = 	 {A letter-oriented perfect hashing scheme based upon sparse table compression},
  journal = 	 {Software -- Practice Experience},
  year = 	 {1991},
  volume = 	 {21},
  number = 	 {1},
  month = 	 {january},
  pages =        {35--49}
 }
@Article{ty79,
  author = 	 {R. E. Tarjan and A. C. C. Yao},
  title = 	 {Storing a sparse table},
  journal = 	 {Comm. ACM},
  year = 	 {1979},
  volume = 	 {22},
  number = 	 {11},
  month = 	 {November},
  pages =        {606--611}
 }
@Article{yd85,
  author = 	 {W. P. Yang and M. W. Du},
  title = 	 {A backtracking method for constructing perfect hash functions from a set of mapping functions},
  journal = 	 {BIT},
  year = 	 {1985},
  volume = 	 {25},
  number = 	 {1},
  pages =        {148--164}
 }
@Article{s85,
  author = 	 {T. J. Sager},
  title = 	 {A polynomial time generator for minimal perfect hash functions},
  journal = 	 {Commun. ACM},
  year = 	 {1985},
  volume = 	 {28},
  number = 	 {5},
  month =        {May},
  pages =        {523--532}
 }
@Article{cm93,
  author = 	 {Z. J. Czech and B. S. Majewski},
  title = 	 {A linear time algorithm for finding minimal perfect hash functions},
  journal = 	 {The computer Journal},
  year = 	 {1993},
  volume = 	 {36},
  number = 	 {6},
  pages =        {579--587}
 }
@Article{gbs94,
  author = 	 {R. Gupta and S. Bhaskar and S. Smolka},
  title = 	 {On randomization in sequential and distributed algorithms},
  journal = 	 {ACM Comput. Surveys},
  year = 	 {1994},
  volume = 	 {26},
  number = 	 {1},
  month =        {March},
  pages =        {7--86}
 }
@InProceedings{sb84,
  author = 	 {C. Slot and P. V. E. Boas},
  title = 	 {On tape versus core; an application of space efficient perfect hash functions to the 
                  invariance of space},
  booktitle =    {Proc. 16th Ann. ACM Symp. on Theory of Computing -- STOC'84},
  address = 	 {Washington},
  month = 	 {May},
  year = 	 {1984},
  pages = 	 {391--400},
 }
@InProceedings{wi90,
  author = 	 {V. G. Winters},
  title = 	 {Minimal perfect hashing for large sets of data},
  booktitle =    {Internat. Conf. on Computing and Information -- ICCI'90},
  address = 	 {Canada},
  month = 	 {May},
  year = 	 {1990},
  pages = 	 {275--284},
 }
@InProceedings{lr85,
  author = 	 {P. Larson and M. V. Ramakrishna},
  title = 	 {External perfect hashing},
  booktitle =    {Proc. ACM SIGMOD Conf.},
  address = 	 {Austin TX},
  month = 	 {June},
  year = 	 {1985},
  pages = 	 {190--199},
 }
@Book{m84,
  author = 	 {K. Mehlhorn},
  editor = 	 {W. Brauer and G. Rozenberg and A. Salomaa},
  title = 	 {Data Structures and Algorithms 1: Sorting and Searching},
  publisher = 	 {Springer-Verlag},
  year = 	 {1984},
 }
@PhdThesis{c92,
  author = 	 {Q. F. Chen},
  title = 	 {An Object-Oriented Database System for Efficient Information Retrieval Appliations},
  school = 	 {Virginia Tech Dept. of Computer Science},
  year = 	 {1992},
  month = 	 {March}
 }
@article {er59,
    AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.},
     TITLE = {On random graphs {I}},
   JOURNAL = {Pub. Math. Debrecen},
    VOLUME = {6},
      YEAR = {1959},
     PAGES = {290--297},
   MRCLASS = {05.00},
  MRNUMBER = {MR0120167 (22 \#10924)},
 MRREVIEWER = {A. Dvoretzky},
 }
@article {erdos61,
    AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.},
     TITLE = {On the evolution of random graphs},
   JOURNAL = {Bull. Inst. Internat. Statist.},
    VOLUME = 38,
      YEAR = 1961,
     PAGES = {343--347},
   MRCLASS = {05.40 (55.10)},
  MRNUMBER = {MR0148055 (26 \#5564)},
 }
@article {er60,
    AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.},
     TITLE = {On the evolution of random graphs},
   JOURNAL = {Magyar Tud. Akad. Mat. Kutat\'o Int. K\"ozl.},
    VOLUME = {5},
      YEAR = {1960},
     PAGES = {17--61},
   MRCLASS = {05.40},
  MRNUMBER = {MR0125031 (23 \#A2338)},
 MRREVIEWER = {J. Riordan},
 }
@Article{er60:_Old,
  author =   {P. Erd{\H{o}}s and A. R\'enyi},
  title =    {On the evolution of random graphs},
  journal =  {Publications of the Mathematical Institute of the Hungarian
                  Academy of Sciences},  
  year =     {1960},
  volume =   {56},
  pages =    {17-61}
 }
@Article{er61,
  author =   {P. Erd{\H{o}}s and A. R\'enyi},
  title =    {On the strength of connectedness of a random graph},
  journal =  {Acta Mathematica Scientia Hungary},
  year =     {1961},
  volume =   {12},
  pages =    {261-267}
 }
@Article{bp04,
  author = 	 {B. Bollob\'as and O. Pikhurko},
  title = 	 {Integer Sets with Prescribed Pairwise Differences Being Distinct},
  journal = 	 {European Journal of Combinatorics},
  OPTkey = 	 {},
  OPTvolume = 	 {},
  OPTnumber = 	 {},
  OPTpages = 	 {},
  OPTmonth = 	 {},
  note = 	 {To Appear},
  OPTannote = 	 {}
 }
@Article{pw04:_OLD,
  author = 	 {B. Pittel and N. C. Wormald},
  title = 	 {Counting connected graphs inside-out},
  journal = 	 {Journal of Combinatorial Theory},
  OPTkey = 	 {},
  OPTvolume = 	 {},
  OPTnumber = 	 {},
  OPTpages = 	 {},
  OPTmonth = 	 {},
  note = 	 {To Appear},
  OPTannote = 	 {}
 }
@Article{mr95,
  author =   {M. Molloy and B. Reed},
  title =    {A critical point for random graphs with a given degree sequence},
  journal =  {Random Structures and Algorithms},
  year =     {1995},
  volume =   {6},
  pages =    {161-179}
 }
@TechReport{bmz04,
  author = 	 {F. C. Botelho and D. Menoti and N. Ziviani},
  title = 	 {A New algorithm for constructing minimal perfect hash functions},
  institution =  {Federal Univ. of Minas Gerais},
  year = 	 {2004},
  OPTkey = 	 {},
  OPTtype = 	 {},
  number = 	 {TR004},
  OPTaddress = 	 {},
  OPTmonth = 	 {},
  note = 	 {(http://www.dcc.ufmg.br/\texttt{\~ }nivio/pub/technicalreports.html)},
  OPTannote = 	 {}
 }
@Article{mr98,
  author =   {M. Molloy and B. Reed},
  title =    {The size of the giant component of a random graph with a given degree sequence},
  journal =  {Combinatorics, Probability and Computing},
  year =     {1998},
  volume =   {7},
  pages =    {295-305}
 }
@misc{h98,
  author = {D. Hawking},
  title = {Overview of TREC-7 Very Large Collection Track (Draft for Notebook)},
  url = {citeseer.ist.psu.edu/4991.html},
  year = {1998}}
@book {jlr00,
    AUTHOR = {Janson, S. and {\L}uczak, T. and Ruci{\'n}ski, A.},
     TITLE = {Random graphs},
 PUBLISHER = {Wiley-Inter.},
      YEAR = 2000,
     PAGES = {xii+333},
      ISBN = {0-471-17541-2},
   MRCLASS = {05C80 (60C05 82B41)},
  MRNUMBER = {2001k:05180},
 MRREVIEWER = {Mark R. Jerrum},
 }
@incollection {jlr90,
    AUTHOR = {Janson, Svante and {\L}uczak, Tomasz and Ruci{\'n}ski,
              Andrzej},
     TITLE = {An exponential bound for the probability of nonexistence of a
              specified subgraph in a random graph},
 BOOKTITLE = {Random graphs '87 (Pozna\'n, 1987)},
     PAGES = {73--87},
 PUBLISHER = {Wiley},
   ADDRESS = {Chichester},
      YEAR = 1990,
   MRCLASS = {05C80 (60C05)},
  MRNUMBER = {91m:05168},
 MRREVIEWER = {J. Spencer},
 }
@book {b01,
    AUTHOR = {Bollob{\'a}s, B.},
     TITLE = {Random graphs},
    SERIES = {Cambridge Studies in Advanced Mathematics},
    VOLUME = 73,
   EDITION = {Second},
 PUBLISHER = {Cambridge University Press},
   ADDRESS = {Cambridge},
      YEAR = 2001,
     PAGES = {xviii+498},
      ISBN = {0-521-80920-7; 0-521-79722-5},
   MRCLASS = {05C80 (60C05)},
  MRNUMBER = {MR1864966 (2002j:05132)},
 }
@article {pw04,
    AUTHOR = {Pittel, Boris and Wormald, Nicholas C.},
     TITLE = {Counting connected graphs inside-out},
   JOURNAL = {J. Combin. Theory Ser. B},
  FJOURNAL = {Journal of Combinatorial Theory. Series B},
    VOLUME = 93,
      YEAR = 2005,
    NUMBER = 2,
     PAGES = {127--172},
      ISSN = {0095-8956},
     CODEN = {JCBTB8},
   MRCLASS = {05C30 (05A16 05C40 05C80)},
  MRNUMBER = {MR2117934 (2005m:05117)},
 MRREVIEWER = {Edward A. Bender},
 }
--- a/vldb07/relatedwork.tex
+++ b/vldb07/relatedwork.tex
@@ -0,0 +1,112 @@
 % Time-stamp: <Monday 30 Jan 2006 03:06:57am EDT yoshi@ime.usp.br>
 \vspace{-3mm}
 \section{Related work}
 \label{sec:relatedprevious-work}
 \vspace{-2mm}
 % Optimal speed for hashing means that each key from the key set $S$
 % will map to an unique location in the hash table, avoiding time wasted 
 % in resolving collisions. That is achieved with a MPHF and
 % because of that many algorithms for constructing static 
 % and dynamic MPHFs, when static or dynamic sets are involved, 
 % were developed. Our focus has been on static MPHFs, since 
 % in many applications the key sets change slowly, if at all~\cite{s05}.
 \enlargethispage{2\baselineskip}
 Czech, Havas and Majewski~\cite{chm97} provide a
 comprehensive survey of the most important theoretical and practical results
 on perfect hashing.
 In this section we review some of the most important results.
 %We also present more recent algorithms that share some features with 
 %the one presented hereinafter.
 Fredman, Koml\'os and Szemer\'edi~\cite{FKS84} showed that it is possible to
 construct space efficient perfect hash functions that can be evaluated in
 constant time with table sizes that are linear in the number of keys:
 $m=O(n)$.  In their model of computation, an element of the universe~$U$ fits
 into one machine word, and arithmetic operations and memory accesses have unit
 cost.  Randomized algorithms in the FKS model can construct a perfect hash
 function in expected time~$O(n)$:
 this is the case of our algorithm and the works in~\cite{chm92,p99}.
 Mehlhorn~\cite{m84} showed
 that at least $\Omega((1/\ln 2)n + \ln\ln u)$ bits are 
 required to represent a MPHF (i.e, at least 1.4427 bits per
 key must be stored).
 To the best of our knowledge our algorithm 
 is the first one capable of generating MPHFs for sets in the order
 of billion of keys, and the generated functions  
 require less than 9 bits per key to be stored.
 This increases one order of magnitude in the size of the greatest 
 key set for which a MPHF was obtained in the literature~\cite{bkz05}.
 %which is close to the lower bound presented in~\cite{m84}. 
 Some work on minimal perfect hashing has been done under the assumption that
 the algorithm can pick and store truly random functions~\cite{bkz05,chm92,p99}.
 Since the space requirements for truly random functions makes them unsuitable for
 implementation, one has to settle for pseudo-random functions in practice. 
 Empirical studies show that limited randomness properties are often as good as
 total randomness.
 We could verify that phenomenon in our experiments by using the universal hash
 function proposed by Jenkins~\cite{j97}, which is
 time efficient at retrieval time and requires just an integer to be used as a
 random seed (the function is completely determined by the seed). 
 % Os trabalhos~\cite{asw00,swz00} apresentam algoritmos para construir
 % FHPs e FHPMs deterministicamente. 
 % As fun\c{c}\~oes geradas necessitam de $O(n \log(n) + \log(\log(u)))$ bits para serem descritas.
 % A complexidade de caso m\'edio dos algoritmos para gerar as fun\c{c}\~oes \'e 
 % $O(n\log(n) \log( \log (u)))$ e a de pior caso \'e $O(n^3\log(n) \log(\log(u)))$. 
 % A complexidade de avalia\c{c}\~ao das fun\c{c}\~oes \'e $O(\log(n) + \log(\log(u)))$.
 % Assim, os algoritmos n\~ao geram fun\c{c}\~oes que podem ser avaliadas com complexidade 
 % de tempo $O(1)$, est\~ao distantes a um fator de $\log n$ da complexidade \'otima para descrever 
 % FHPs e FHPMs (Mehlhorn mostra em~\cite{m84} 
 % que para armazenar uma FHP s\~ao necess\'arios no m\'{\i}nimo 
 % $\Omega(n^2/(2\ln 2) m + \log\log u)$ bits), e n\~ao geram as 
 % fun\c{c}\~oes com complexidade linear.
 % Al\'em disso, o universo $U$ das chaves \'e restrito a n\'umeros inteiros, o que pode 
 % limitar a utiliza\c{c}\~ao na pr\'atica. 
 Pagh~\cite{p99} proposed a family of randomized algorithms for
 constructing MPHFs
 where the form of the resulting function is $h(x) = (f(x) + d[g(x)]) \bmod n$,
 where $f$ and $g$ are universal hash functions and $d$ is a set of
 displacement values to resolve collisions that are caused by the function $f$.
 Pagh identified a set of conditions concerning $f$ and $g$ and showed
 that if these conditions are satisfied, then a minimal perfect hash
 function can be computed in expected time $O(n)$ and stored in
 $(2+\epsilon)n\log_2n$ bits.
 Dietzfelbinger and Hagerup~\cite{dh01} improved~\cite{p99},
 reducing from $(2+\epsilon)n\log_2n$ to $(1+\epsilon)n\log_2n$ the number of bits
 required to store the function, but in their approach~$f$ and~$g$ must
 be chosen from a class
 of hash functions that meet additional requirements.
 %Differently from the works in~\cite{dh01, p99}, our algorithm generates a MPHF
 %$h$ in expected linear time and $h$ can be stored in $O(n)$ bits (9 bits per key).
 % Galli, Seybold e Simon~\cite{gss01} propuseram um algoritmo r\^andomico
 % que gera FHPMs da mesma forma das geradas pelos algoritmos de Pagh~\cite{p99}
 % e, Dietzfelbinger e Hagerup~\cite{dh01}. No entanto, eles definiram a forma das 
 % fun\c{c}\~oes $f(k) = h_c(k) \bmod n$ e $g(k) = \lfloor h_c(k)/n \rfloor$ para obter em tempo esperado $O(n)$ uma fun\c{c}\~ao que pode ser descrita em $O(n\log n)$ bits, onde
 % $h_c(k) = (ck \bmod p) \bmod n^2$, $1 \leq c \leq  p-1$ e $p$ um primo maior do que $u$.
 %Our algorithm is the first one capable of generating MPHFs for sets in the order of
 %billion of keys. It happens because we do not need to keep into main memory
 %at generation time complex data structures as a graph, lists and so on. We just need to maintain
 %a small vector that occupies around 8MB for a set of 1 billion keys.  
 Fox et al.~\cite{fch92,fhcd92} studied MPHFs 
 %that also share features with the ones generated by our algorithm. 
 that bring down the storage requirements we got to between 2 and 4 bits per key.
 However, it is shown in~\cite[Section 6.7]{chm97} that their algorithms have exponential 
 running times and cannot scale for sets larger than 11 million keys in our 
 implementation of the algorithm.  
 Our previous work~\cite{bkz05} improves the one by Czech, Havas and Majewski~\cite{chm92}.
 We obtained more compact functions in less time. Although 
 the algorithm in~\cite{bkz05} is the fastest algorithm 
 we know of, the resulting functions are stored in $O(n\log n)$ bits and
 one needs to keep in main memory at generation time a random graph of $n$ edges
 and $cn$ vertices, 
 where $c\in[0.93,1.15]$.  Using the well known divide to conquer approach
 we use that algorithm as a building block for the new one, where the
 resulting functions are stored in $O(n)$ bits.
--- a/vldb07/searching.tex
+++ b/vldb07/searching.tex
@@ -0,0 +1,155 @@
 %% Nivio: 22/jan/06
 % Time-stamp: <Monday 30 Jan 2006 03:57:35am EDT yoshi@ime.usp.br>
 \vspace{-7mm}
 \subsection{Searching step}
 \label{sec:searching}
 \enlargethispage{2\baselineskip}
 The searching step is responsible for generating a MPHF for each 
 bucket.
 Figure~\ref{fig:searchingstep} presents the searching step algorithm.
 \vspace{-2mm}
 \begin{figure}[h]
 %\centering
 \hrule 
 \hrule 
 \vspace{2mm}
 \begin{tabbing}
 aa\=type booleanx \==  (false, true); \kill
 \> $\blacktriangleright$ Let $H$ be a minimum heap of size $N$, where the \\
 \> ~~ order relation in $H$ is given by Eq.~(\ref{eq:bucketindex}), that is, the\\
 \> ~~ remove operation removes the item with smallest $i$\\ 
 \> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \{ Heap construction \}\\
 \> ~~ $1.1$ Read key $k$ from File $j$ on disk\\
 \> ~~ $1.2$ Insert $(i, j, k)$ in $H$ \\
 \> $2.$ {\bf for} $i = 0$ {\bf to} $\lceil n/b \rceil - 1$ {\bf do} \\
 \> ~~ $2.1$ Read bucket $i$ from disk driven by heap $H$ \\
 \> ~~ $2.2$ Generate a MPHF for bucket $i$ \\
 \> ~~ $2.3$ Write the description of MPHF$_i$ to the disk 
 \end{tabbing}
 \vspace{-1mm}
 \hrule 
 \hrule 
 \caption{Searching step}
 \label{fig:searchingstep}
 \vspace{-4mm}
 \end{figure}
 Statement 1 of Figure~\ref{fig:searchingstep} inserts one key from each file
 in a minimum heap $H$ of size $N$.
 The order relation in $H$ is given by the bucket address $i$ given by
 Eq.~(\ref{eq:bucketindex}).
 %\enlargethispage{-\baselineskip}
 Statement 2 has two important steps.
 In statement 2.1, a bucket is read from disk,
 as described below.
 %in Section~\ref{sec:readingbucket}. 
 In statement 2.2, a MPHF is generated for each bucket $i$, as described 
 in the following.
 %in Section~\ref{sec:mphfbucket}.
 The description of MPHF$_i$ is a vector $g_i$ of 8-bit integers.
 Finally, statement 2.3 writes the description $g_i$ of MPHF$_i$ to disk.
 \vspace{-3mm}
 \label{sec:readingbucket}
 \subsubsection{Reading a bucket from disk.} 
 In this section we present the refinement of statement 2.1 of
 Figure~\ref{fig:searchingstep}.
 The algorithm to read bucket $i$ from disk is presented 
 in Figure~\ref{fig:readingbucket}.
 \begin{figure}[h]
 \hrule 
 \hrule 
 \vspace{2mm}
 \begin{tabbing}
 aa\=type booleanx \==  (false, true); \kill
 \> $1.$ {\bf while} bucket $i$ is not full {\bf do} \\
 \> ~~ $1.1$ Remove $(i, j, k)$ from $H$\\
 \> ~~ $1.2$ Insert $k$ into bucket $i$ \\
 \> ~~ $1.3$ Read sequentially all keys $k$ from File $j$ that have \\
 \> ~~~~~~~ the same $i$ and insert them into bucket $i$ \\
 \> ~~ $1.4$ Insert the triple $(i, j, x)$ in $H$, where $x$ is the first \\
 \> ~~~~~~~ key read from File $j$ that does not have the \\ 
 \> ~~~~~~~ same bucket index $i$
 \end{tabbing}
 \hrule 
 \hrule 
 \vspace{-1.0mm}
 \caption{Reading a bucket}
 \vspace{-4.0mm}
 \label{fig:readingbucket}
 \end{figure}
 Bucket $i$ is distributed among many files and the heap $H$ is used to drive a
 multiway merge operation.
 In Figure~\ref{fig:readingbucket}, statement 1.1 extracts and removes triple 
 $(i, j, k)$ from $H$, where $i$ is a minimum value in $H$.
 Statement 1.2 inserts key $k$ in bucket $i$.
 Notice that the $k$ in the triple $(i, j, k)$ is in fact a pointer to
 the first byte of the key that is kept in contiguous positions of an array of characters
 (this array containing the keys is initialized during the heap construction
 in statement 1 of Figure~\ref{fig:searchingstep}).
 Statement 1.3 performs a seek operation in File $j$ on disk for the first 
 read operation and reads sequentially all keys $k$ that have the same $i$ 
 %(obtained from Eq.~(\ref{eq:bucketindex})) 
 and inserts them all in bucket $i$.
 Finally, statement 1.4 inserts in $H$ the triple $(i, j, x)$,  
 where $x$ is the first key read from File $j$ (in statement 1.3) 
 that does not have the same bucket address as the previous keys.
 The number of seek operations on disk performed in statement 1.3 is discussed
 in Section~\ref{sec:linearcomplexity}, 
 where we present a buffering technique that brings down 
 the time spent with seeks.
 \vspace{-2mm}
 \enlargethispage{2\baselineskip}
 \subsubsection{Generating a MPHF for each bucket.} \label{sec:mphfbucket}
 To the best of our knowledge the algorithm we have designed in
 our previous work~\cite{bkz05} is the fastest published algorithm for
 constructing MPHFs.
 That is why we are using that algorithm as a building block for the 
 algorithm presented here.
 %\enlargethispage{-\baselineskip}
 Our previous algorithm is a three-step internal memory based algorithm
 that produces a MPHF based on random graphs.
 For a set of $n$ keys, the algorithm outputs the resulting MPHF in expected time $O(n)$.
 For a given bucket $i$, $0 \leq i < \lceil n/b \rceil$, the corresponding MPHF$_i$ 
 has the following form:
 \begin{eqnarray}
        \mathrm{MPHF}_i(k) &=& g_i[a] + g_i[b] \label{eq:mphfi}
 \end{eqnarray} 
 where $a = h_{i1}(k) \bmod t$, $b = h_{i2}(k) \bmod t$ and
 $t = c\times \mathit{size}[i]$. The functions
 $h_{i1}(k)$ and $h_{i2}(k)$ are the same universal function proposed by Jenkins~\cite{j97}
 that was used in the partitioning step described in Section~\ref{sec:partitioning-keys}.
 In order to generate the function above the algorithm involves the generation of simple random graphs
 $G_i = (V_i, E_i)$ with~$|V_i|=t=c\times\mathit{size}[i]$ and $|E_i|=\mathit{size}[i]$, with  $c \in [0.93, 1.15]$.
 To generate a simple random graph with high 
 probability\footnote{We use the terms `with high probability'
 to mean `with probability tending to~$1$ as~$n\to\infty$'.}, two vertices $a$ and $b$ are
 computed for each key $k$ in bucket $i$.
 Thus, each bucket $i$ has a corresponding graph~$G_i=(V_i,E_i)$, where $V_i=\{0,1,
 \ldots,t-1\}$ and $E_i=\big\{\{a,b\}:k \in \mathrm{bucket}\: i\big\}$.
 In order to get a simple graph, 
 the algorithm repeatedly selects $h_{i1}$ and $h_{i2}$ from a family of universal hash functions
 until the corresponding graph is simple.
 The probability of getting a simple graph is $p=e^{-1/c^2}$.
 For $c=1$, this probability is $p \simeq 0.368$, and the expected number of 
 iterations to obtain a simple graph is~$1/p \simeq 2.72$.
 The construction of MPHF$_i$ ends with a computation of a suitable labelling of the vertices
 of~$G_i$. The labelling is stored into vector $g_i$.
 We choose~$g_i[v]$ for each~$v\in V_i$ in such
 a way that Eq.~(\ref{eq:mphfi}) is a MPHF for bucket $i$.
 In order to get the values of each entry of $g_i$ we first 
 run a breadth-first search on the 2-\textit{core} of $G_i$, i.e., the maximal subgraph 
 of~$G_i$ with minimal degree at least~$2$ (see, e.g., \cite{b01,jlr00,pw04}) and
 a depth-first search on the acyclic part of $G_i$ (see \cite{bkz05} for details).
--- a/vldb07/svglov2.clo
+++ b/vldb07/svglov2.clo
@@ -0,0 +1,77 @@
 % SVJour2 DOCUMENT CLASS OPTION SVGLOV2 -- for standardised journals
 %
 % This is an enhancement for the LaTeX
 % SVJour2 document class for Springer journals
 %
 %%
 %%
 %% \CharacterTable
 %%  {Upper-case    \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z
 %%   Lower-case    \a\b\c\d\e\f\g\h\i\j\k\l\m\n\o\p\q\r\s\t\u\v\w\x\y\z
 %%   Digits        \0\1\2\3\4\5\6\7\8\9
 %%   Exclamation   \!     Double quote  \"     Hash (number) \#
 %%   Dollar        \$     Percent       \%     Ampersand     \&
 %%   Acute accent  \'     Left paren    \(     Right paren   \)
 %%   Asterisk      \*     Plus          \+     Comma         \,
 %%   Minus         \-     Point         \.     Solidus       \/
 %%   Colon         \:     Semicolon     \;     Less than     \<
 %%   Equals        \=     Greater than  \>     Question mark \?
 %%   Commercial at \@     Left bracket  \[     Backslash     \\
 %%   Right bracket \]     Circumflex    \^     Underscore    \_
 %%   Grave accent  \`     Left brace    \{     Vertical bar  \|
 %%   Right brace   \}     Tilde         \~}
 \ProvidesFile{svglov2.clo}
              [2004/10/25 v2.1
      style option for standardised journals]
 \typeout{SVJour Class option: svglov2.clo for standardised journals}
 \def\validfor{svjour2}
 \ExecuteOptions{final,10pt,runningheads}
 % No size changing allowed, hence a copy of size10.clo is included
 \renewcommand\normalsize{%
   \@setfontsize\normalsize{10.2pt}{4mm}%
   \abovedisplayskip=3 mm plus6pt minus 4pt
   \belowdisplayskip=3 mm plus6pt minus 4pt
   \abovedisplayshortskip=0.0 mm plus6pt
   \belowdisplayshortskip=2 mm plus4pt minus 4pt
   \let\@listi\@listI}
 \normalsize
 \newcommand\small{%
   \@setfontsize\small{8.7pt}{3.25mm}%
   \abovedisplayskip 8.5\p@ \@plus3\p@ \@minus4\p@
   \abovedisplayshortskip \z@ \@plus2\p@
   \belowdisplayshortskip 4\p@ \@plus2\p@ \@minus2\p@
   \def\@listi{\leftmargin\leftmargini
               \parsep 0\p@ \@plus1\p@ \@minus\p@
               \topsep 4\p@ \@plus2\p@ \@minus4\p@
               \itemsep0\p@}%
   \belowdisplayskip \abovedisplayskip
 }
 \let\footnotesize\small
 \newcommand\scriptsize{\@setfontsize\scriptsize\@viipt\@viiipt}
 \newcommand\tiny{\@setfontsize\tiny\@vpt\@vipt}
 \newcommand\large{\@setfontsize\large\@xiipt{14pt}}
 \newcommand\Large{\@setfontsize\Large\@xivpt{16dd}}
 \newcommand\LARGE{\@setfontsize\LARGE\@xviipt{17dd}}
 \newcommand\huge{\@setfontsize\huge\@xxpt{25}}
 \newcommand\Huge{\@setfontsize\Huge\@xxvpt{30}}
 %
 %ALT% \def\runheadhook{\rlap{\smash{\lower5pt\hbox to\textwidth{\hrulefill}}}}
 \def\runheadhook{\rlap{\smash{\lower11pt\hbox to\textwidth{\hrulefill}}}}
 \AtEndOfClass{\advance\headsep by5pt}
 \if@twocolumn
 \setlength{\textwidth}{17.6cm}
 \setlength{\textheight}{230mm}
 \AtEndOfClass{\setlength\columnsep{4mm}}
 \else
 \setlength{\textwidth}{11.7cm}
 \setlength{\textheight}{517.5dd} % 19.46cm
 \fi
 %
 \AtBeginDocument{%
 \@ifundefined{@journalname}
 {\typeout{Unknown journal: specify \string\journalname\string{%
 <name of your journal>\string} in preambel^^J}}{}}
 %
 \endinput
 %%
 %% End of file `svglov2.clo'.
--- a/vldb07/svjour2.cls
+++ b/vldb07/svjour2.cls
--- a/vldb07/terminology.tex
+++ b/vldb07/terminology.tex
@@ -0,0 +1,18 @@
 % Time-stamp: <Sunday 29 Jan 2006 11:55:42pm EST yoshi@flare>
 \vspace{-3mm}
 \section{Notation and terminology}
 \vspace{-2mm}
 \label{sec:notation}
 \enlargethispage{2\baselineskip}
 The essential notation and terminology used throughout this paper are as follows.
 \begin{itemize}
 \item $U$: key universe. $|U| = u$.
 \item $S$: actual static key set. $S \subset U$, $|S| = n \ll u$.
 \item $h: U \to M$ is a hash function that maps keys from a universe $U$ into
 a given range $M = \{0,1,\dots,m-1\}$ of integer numbers.
 \item $h$ is a perfect hash function if it is one-to-one on~$S$, i.e., if
  $h(k_1) \not = h(k_2)$ for all $k_1 \not = k_2$ from $S$.
 \item $h$ is a minimal perfect hash function (MPHF) if it is one-to-one on~$S$ 
  and $n=m$. 
 \end{itemize}
--- a/vldb07/thealgorithm.tex
+++ b/vldb07/thealgorithm.tex
@@ -0,0 +1,78 @@
 %% Nivio: 13/jan/06, 21/jan/06 29/jan/06
 % Time-stamp: <Sunday 29 Jan 2006 11:56:25pm EST yoshi@flare>
 \vspace{-3mm}
 \section{The algorithm}
 \label{sec:new-algorithm}
 \vspace{-2mm}
 \enlargethispage{2\baselineskip}
 The main idea supporting our algorithm is the classical divide and conquer technique.
 The algorithm is a two-step external memory based algorithm 
 that generates a MPHF $h$ for a set $S$ of $n$ keys.
 Figure~\ref{fig:new-algo-main-steps} illustrates the two steps of the
 algorithm: the partitioning step and the searching step.
 \vspace{-2mm}
 \begin{figure}[ht]
 \centering
 \begin{picture}(0,0)%
 \includegraphics{figs/brz.ps}%
 \end{picture}%
 \setlength{\unitlength}{4144sp}%
 %
 \begingroup\makeatletter\ifx\SetFigFont\undefined%
 \gdef\SetFigFont#1#2#3#4#5{%
  \reset@font\fontsize{#1}{#2pt}%
  \fontfamily{#3}\fontseries{#4}\fontshape{#5}%
  \selectfont}%
 \fi\endgroup%
 \begin{picture}(3704,2091)(1426,-5161)
 \put(2570,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
 \put(2782,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
 \put(2996,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}2}}}}
 \put(4060,-4006){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets}}}}
 \put(3776,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}${\lceil n/b\rceil - 1}$}}}}
 \put(4563,-3329){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Key Set $S$}}}}
 \put(2009,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
 \put(2221,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
 \put(4315,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}n-1}}}}
 \put(1992,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}}
 \put(2204,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}}
 \put(4298,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}n-1}}}}
 \put(4546,-4977){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Hash Table}}}}
 \put(1441,-3616){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Partitioning}}}}
 \put(1441,-4426){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Searching}}}}
 \put(1981,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_0$}}}}
 \put(2521,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_1$}}}}
 \put(3016,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_2$}}}}
 \put(3826,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_{\lceil n/b \rceil - 1}$}}}}
 \end{picture}%
 \vspace{-1mm}
 \caption{Main steps of our algorithm}
 \label{fig:new-algo-main-steps}
 \vspace{-3mm}
 \end{figure}
 The partitioning step takes a key set $S$ and uses a universal hash function 
 $h_0$ proposed by Jenkins~\cite{j97} 
 %for each key $k \in S$ of length $|k|$ 
 to transform each key~$k\in S$ into an integer~$h_0(k)$.
 Reducing~$h_0(k)$ modulo~$\lceil n/b\rceil$, we partition~$S$ into $\lceil n/b
 \rceil$ buckets containing at most 256 keys in each bucket (with high
 probability).  
 The searching step generates a MPHF$_i$ for each bucket $i$, 
 $0 \leq i < \lceil n/b \rceil$.
 The resulting MPHF $h(k)$, $k \in S$, is given by
 \begin{eqnarray}\label{eq:mphf}
 h(k) = \mathrm{MPHF}_i (k) + \mathit{offset}[i], 
 \end{eqnarray}
 where~$i=h_0(k)\bmod\lceil n/b\rceil$.
 The $i$th entry~$\mathit{offset}[i]$ of the displacement vector
 $\mathit{offset}$, $0 \leq i < \lceil n/b \rceil$, contains the total number
 of keys in the buckets from 0 to $i-1$, that is, it gives the interval of the
 keys in the hash table addressed by the MPHF$_i$.  In the following we explain
 each step in detail.
--- a/vldb07/thedataandsetup.tex
+++ b/vldb07/thedataandsetup.tex
@@ -0,0 +1,21 @@
 % Nivio: 29/jan/06
 % Time-stamp: <Sunday 29 Jan 2006 11:57:40pm EST yoshi@flare>
 \vspace{-2mm}
 \subsection{The data and the experimental setup}
 \label{sec:data-exper-set}
 The algorithms were implemented in the C language and
 are available at \texttt{http://\-cmph.sf.net}
 under the GNU Lesser General Public License (LGPL).
 % free software licence.
 All experiments were carried out on
 a computer running the Linux operating system, version 2.6,
 with a 2.4 gigahertz processor and
 1 gigabyte of main memory. 
 In the experiments related to the new
 algorithm we limited the main memory in 500 megabytes.
 Our data consists of a collection of 1 billion
 URLs collected from the Web, each URL 64 characters long on average.
 The collection is stored on disk in 60.5 gigabytes.
--- a/vldb07/vldb.tex
+++ b/vldb07/vldb.tex
@@ -0,0 +1,194 @@
 %%%%%%%%%%%%%%%%%%%%%%% file template.tex %%%%%%%%%%%%%%%%%%%%%%%%%
 %
 % This is a template file for the LaTeX package SVJour2 for the
 % Springer journal "The VLDB Journal".
 %
 %                                    Springer Heidelberg 2004/12/03
 %
 % Copy it to a new file with a new name and use it as the basis
 % for your article. Delete % as needed.
 %
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %
 % First comes an example EPS file -- just ignore it and
 % proceed on the \documentclass line
 % your LaTeX will extract the file if required
 %\begin{filecontents*}{figs/minimalperfecthash-ph-mph.ps}
 %!PS-Adobe-3.0 EPSF-3.0
 %%BoundingBox: 19 19 221 221
 %%CreationDate: Mon Sep 29 1997
 %%Creator: programmed by hand (JK)
 %%EndComments
 %gsave
 %newpath
 %  20 20 moveto
 %  20 220 lineto
 %  220 220 lineto
 %  220 20 lineto
 %closepath
 %2 setlinewidth
 %gsave
 %  .4 setgray fill
 %grestore
 %stroke
 %grestore
 %\end{filecontents*}
 %
 \documentclass[twocolumn,fleqn,runningheads]{svjour2}
 %
 \smartqed  % flush right qed marks, e.g. at end of proof
 %
 \usepackage{graphicx}
 \usepackage{listings}
 \usepackage{epsfig}
 \usepackage{textcomp}
 \usepackage[latin1]{inputenc}
 \usepackage{amssymb}
 %\DeclareGraphicsExtensions{.png} 
 %
 % \usepackage{mathptmx}      % use Times fonts if available on your TeX system
 %
 % insert here the call for the packages your document requires
 %\usepackage{latexsym}
 % etc.
 %
 % please place your own definitions here and don't use \def but
 % \newcommand{}{}
 %
 \lstset{
  language=Pascal,
  basicstyle=\fontsize{9}{9}\selectfont,
  captionpos=t,
  aboveskip=1mm,
  belowskip=1mm,
  abovecaptionskip=1mm,
  belowcaptionskip=1mm,
 %  numbers = left,
  mathescape=true,
  escapechar=@,
  extendedchars=true,
  showstringspaces=false,
  columns=fixed,
  basewidth=0.515em,
  frame=single,
  framesep=2mm,
  xleftmargin=2mm,
  xrightmargin=2mm,
  framerule=0.5pt
 }
 \def\cG{{\mathcal G}}
 \def\crit{{\rm crit}}
 \def\ncrit{{\rm ncrit}}
 \def\scrit{{\rm scrit}}
 \def\bedges{{\rm bedges}}
 \def\ZZ{{\mathbb Z}}
 \journalname{The VLDB Journal}
 %
 \begin{document}
 \title{Space and Time Efficient Minimal Perfect Hash \\[0.2cm]
 Functions for Very Large Databases\thanks{
 This work was supported in part by
 GERINDO Project--grant MCT/CNPq/CT-INFO 552.087/02-5,
 CAPES/PROF Scholarship (Fabiano C. Botelho),
 FAPESP Proj.\ Tem.\ 03/09925-5 and CNPq Grant 30.0334/93-1
 (Yoshiharu Kohayakawa),
 and CNPq Grant 30.5237/02-0 (Nivio Ziviani).}
 }
 %\subtitle{Do you have a subtitle?\\ If so, write it here}
 %\titlerunning{Short form of title}        % if too long for running head
 \author{Fabiano C. Botelho \and Davi C. Reis \and Yoshiharu Kohayakawa \and Nivio Ziviani}
 %\authorrunning{Short form of author list} % if too long for running head
 \institute{
 F. C. Botelho \and 
 N. Ziviani \at
 Dept. of Computer Science,
 Federal Univ. of Minas Gerais,
 Belo Horizonte, Brazil\\
 \email{\{fbotelho,nivio\}@dcc.ufmg.br}
 \and
 D. C. Reis \at
 Google, Brazil \\
 \email{davi.reis@gmail.com}
 \and
 Y. Kohayakawa
 Dept. of Computer Science,
 Univ. of S\~ao Paulo,
 S\~ao Paulo, Brazil\\
 \email{yoshi@ime.usp.br}
 }
 \date{Received: date / Accepted: date}
 % The correct dates will be entered by the editor
 \maketitle
 \begin{abstract}
 We propose a novel external memory based algorithm for constructing minimal
 perfect hash functions~$h$ for huge sets of keys.
 For a set of~$n$ keys, our algorithm outputs~$h$ in time~$O(n)$.
 The algorithm needs a small vector of one byte entries
 in main memory to construct $h$.
 The evaluation of~$h(x)$ requires three memory accesses for any key~$x$.
 The description of~$h$ takes a constant number of bits
 for each key, which is optimal, i.e., the theoretical lower bound is $1/\ln 2$
 bits per key.
 In our experiments, we used a collection of 1 billion URLs collected
 from the web, each URL 64 characters long on average.
 For this collection, our algorithm
 (i) finds a minimal perfect hash function in approximately
 3 hours using a commodity PC,
 (ii) needs just 5.45 megabytes of internal memory to generate $h$
 and (iii) takes 8.1 bits per key for the description of~$h$.
 \keywords{Minimal Perfect Hashing \and Large Databases}
 \end{abstract}
 % main text
 \def\cG{{\mathcal G}}
 \def\crit{{\rm crit}}
 \def\ncrit{{\rm ncrit}}
 \def\scrit{{\rm scrit}}
 \def\bedges{{\rm bedges}}
 \def\ZZ{{\mathbb Z}}
 \def\BSmax{\mathit{BS}_{\mathit{max}}}
 \def\Bi{\mathop{\rm Bi}\nolimits}
 \input{introduction}
 %\input{terminology}
 \input{relatedwork}
 \input{thealgorithm}
 \input{partitioningthekeys}
 \input{searching}
 %\input{computingoffset}
 %\input{hashingbuckets}
 \input{determiningb}
 %\input{analyticalandexperimentalresults}
 \input{analyticalresults}
 %\input{results}
 \input{conclusions}
 %\input{acknowledgments}
 %\begin{acknowledgements}
 %If you'd like to thank anyone, place your comments here
 %and remove the percent signs.
 %\end{acknowledgements}
 % BibTeX users please use
 %\bibliographystyle{spmpsci}
 %\bibliography{}   % name your BibTeX data base
 \bibliographystyle{plain}
 \bibliography{references}
 \input{appendix}
 \end{document}