From 80549b6ca664ba2338ebf96fbc886e145e5e40cd Mon Sep 17 00:00:00 2001 From: fc_botelho Date: Fri, 11 Aug 2006 17:32:31 +0000 Subject: [PATCH] paper for vldb07 added --- vldb07/acknowledgments.tex | 7 + vldb07/analyticalresults.tex | 174 +++ vldb07/appendix.tex | 6 + vldb07/conclusions.tex | 42 + vldb07/costhashingbuckets.tex | 177 +++ vldb07/determiningb.tex | 146 +++ vldb07/diskaccess.tex | 113 ++ vldb07/experimentalresults.tex | 15 + vldb07/figs/bmz_temporegressao.png | Bin 0 -> 5769 bytes vldb07/figs/brz-partitioning.fig | 107 ++ vldb07/figs/brz-partitioningfabiano.fig | 126 ++ vldb07/figs/brz.fig | 183 +++ vldb07/figs/brz_temporegressao.png | Bin 0 -> 5671 bytes vldb07/figs/brzfabiano.fig | 153 +++ vldb07/figs/minimalperfecthash-ph-mph.png | Bin 0 -> 3916 bytes vldb07/introduction.tex | 109 ++ vldb07/makefile | 17 + vldb07/partitioningthekeys.tex | 141 ++ vldb07/performancenewalgorithm.tex | 113 ++ vldb07/references.bib | 814 ++++++++++++ vldb07/relatedwork.tex | 112 ++ vldb07/searching.tex | 155 +++ vldb07/svglov2.clo | 77 ++ vldb07/svjour2.cls | 1419 +++++++++++++++++++++ vldb07/terminology.tex | 18 + vldb07/thealgorithm.tex | 78 ++ vldb07/thedataandsetup.tex | 21 + vldb07/vldb.tex | 194 +++ 28 files changed, 4517 insertions(+) create mode 100755 vldb07/acknowledgments.tex create mode 100755 vldb07/analyticalresults.tex create mode 100644 vldb07/appendix.tex create mode 100755 vldb07/conclusions.tex create mode 100755 vldb07/costhashingbuckets.tex create mode 100755 vldb07/determiningb.tex create mode 100755 vldb07/diskaccess.tex create mode 100755 vldb07/experimentalresults.tex create mode 100644 vldb07/figs/bmz_temporegressao.png create mode 100644 vldb07/figs/brz-partitioning.fig create mode 100755 vldb07/figs/brz-partitioningfabiano.fig create mode 100755 vldb07/figs/brz.fig create mode 100644 vldb07/figs/brz_temporegressao.png create mode 100755 vldb07/figs/brzfabiano.fig create mode 100644 vldb07/figs/minimalperfecthash-ph-mph.png create mode 100755 vldb07/introduction.tex create mode 100755 vldb07/makefile create mode 100755 vldb07/partitioningthekeys.tex create mode 100755 vldb07/performancenewalgorithm.tex create mode 100755 vldb07/references.bib create mode 100755 vldb07/relatedwork.tex create mode 100755 vldb07/searching.tex create mode 100644 vldb07/svglov2.clo create mode 100644 vldb07/svjour2.cls create mode 100755 vldb07/terminology.tex create mode 100755 vldb07/thealgorithm.tex create mode 100755 vldb07/thedataandsetup.tex create mode 100644 vldb07/vldb.tex diff --git a/vldb07/acknowledgments.tex b/vldb07/acknowledgments.tex new file mode 100755 index 0000000..d903ceb --- /dev/null +++ b/vldb07/acknowledgments.tex @@ -0,0 +1,7 @@ +\section{Acknowledgments} +This section is optional; it is a location for you +to acknowledge grants, funding, editing assistance and +what have you. In the present case, for example, the +authors would like to thank Gerald Murray of ACM for +his help in codifying this \textit{Author's Guide} +and the \textbf{.cls} and \textbf{.tex} files that it describes. diff --git a/vldb07/analyticalresults.tex b/vldb07/analyticalresults.tex new file mode 100755 index 0000000..06ea049 --- /dev/null +++ b/vldb07/analyticalresults.tex @@ -0,0 +1,174 @@ +%% Nivio: 23/jan/06 29/jan/06 +% Time-stamp: +\enlargethispage{2\baselineskip} +\section{Analytical results} +\label{sec:analytcal-results} + +\vspace{-1mm} +The purpose of this section is fourfold. +First, we show that our algorithm runs in expected time $O(n)$. +Second, we present the main memory requirements for constructing the MPHF. +Third, we discuss the cost of evaluating the resulting MPHF. +Fourth, we present the space required to store the resulting MPHF. + +\vspace{-2mm} +\subsection{The linear time complexity} +\label{sec:linearcomplexity} + +First, we show that the partitioning step presented in +Figure~\ref{fig:partitioningstep} runs in $O(n)$ time. +Each iteration of the {\bf for} loop in statement~1 +runs in $O(|B_j|)$ time, $1 \leq j \leq N$, where $|B_j|$ is the +number of keys +that fit in block $B_j$ of size $\mu$. This is because statement 1.1 just reads +$|B_j|$ keys from disk, statement 1.2 runs a bucket sort like algorithm +that is well known to be linear in the number of keys it sorts (i.e., $|B_j|$ keys), +and statement 1.3 just dumps $|B_j|$ keys to the disk into File $j$. +Thus, the {\bf for} loop runs in $O(\sum_{j=1}^{N}|B_j|)$ time. +As $\sum_{j=1}^{N}|B_j|=n$, then the partitioning step runs in $O(n)$ time. + +Second, we show that the searching step presented in +Figure~\ref{fig:searchingstep} also runs in $O(n)$ time. +The heap construction in statement 1 runs in $O(N)$ time, for $N \ll n$. +We have assumed that insertions and deletions in the heap cost $O(1)$ because +$N$ is typically much smaller than $n$ (see \cite[Section 6.4]{bkz06t} for details). +Statement 2 runs in $O(\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i])$ time +(remember that $\mathit{size}[i]$ stores the number of keys in bucket $i$). +As $\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i] = n$, if +statements 2.1, 2.2 and 2.3 run in $O(\mathit{size}[i])$ time, then statement 2 +runs in $O(n)$ time. + +%Statement 2.1 runs the algorithm to read a bucket from disk. That algorithm reads $\mathit{size}[i]$ +%keys of bucket $i$ that might be spread into many files or, in the worst case, +%into $|BS_{max}|$ files, where $|BS_{max}|$ is the number of keys in the bucket of maximum size. +%It uses the heap $H$ to drive a multiway merge of the sprayed bucket $i$. +%As we are considering that each read/write on disk costs $O(1)$ and +%each heap operation also costs $O(1)$ (recall $N \ll n$), then statement 2.1 +%costs $O(\mathit{size}[i])$ time. +%We need to take into account that this step could generate a lot of seeks on disk. +%However, the number of seeks can be amortized (see Section~\ref{sec:contr-disk-access}) +%and that is why we have been able of getting a MPHF for a set of 1 billion keys in less +%than 4 hours using a machine with just 500 MB of main memory +%(see Section~\ref{sec:performance}). +Statement 2.1 reads $O(\mathit{size}[i])$ keys of bucket $i$ +and is detailed in Figure~\ref{fig:readingbucket}. +As we are assuming that each read or write on disk costs $O(1)$ and +each heap operation also costs $O(1)$, statement~2.1 +takes $O(\mathit{size}[i])$ time. +However, the keys of bucket $i$ are distributed in at most~$BS_{max}$ files on disk +in the worst case +(recall that $BS_{max}$ is the maximum number of keys found in any bucket). +Therefore, we need to take into account that +the critical step in reading a bucket is in statement~1.3 of Figure~\ref{fig:readingbucket}, +where a seek operation in File $j$ +may be performed by the first read operation. + +In order to amortize the number of seeks performed we use a buffering technique~\cite{k73}. +We create a buffer $j$ of size \textbaht$\: = \mu/N$ for each file $j$, +where $1\leq j \leq N$ +(recall that $\mu$ is the size in bytes of an a priori reserved internal memory area). +Every time a read operation is requested to file $j$ and the data is not found +in the $j$th~buffer, \textbaht~bytes are read from file $j$ to buffer $j$. +Hence, the number of seeks performed in the worst case is given by +$\beta/$\textbaht~(remember that $\beta$ is the size in bytes of $S$). +For that we have made the pessimistic assumption that one seek happens every time +buffer $j$ is filled in. +Thus, the number of seeks performed in the worst case is $64n/$\textbaht, since +each URL is 64 bytes long on average. Therefore, the number of seeks is linear on +$n$ and amortized by \textbaht. + +It is important to emphasize two things. +First, the operating system uses techniques +to diminish the number of seeks and the average seek time. +This makes the amortization factor to be greater than \textbaht~in practice. +Second, almost all main memory is available to be used as +file buffers because just a small vector +of $\lceil n/b\rceil$ one-byte entries must be maintained in main memory, +as we show in Section~\ref{sec:memconstruction}. + + +Statement 2.2 runs our internal memory based algorithm in order to generate a MPHF for each bucket. +That algorithm is linear, as we showed in~\cite{bkz05}. As it is applied to buckets with {\it size}$[i]$ keys, statement~2.2 takes $O(\mathit{size}[i])$ time. + +Statement 2.3 has time complexity $O(\mathit{size}[i])$ because it writes to disk +the description of each generated MPHF and each description is stored in +$c \times \mathit{size}[i] + O(1)$ bytes, where $c\in[0.93,1.15]$. +In conclusion, our algorithm takes $O(n)$ time because both the partitioning and +the searching steps run in $O(n)$ time. + +An experimental validation of the above proof and a performance comparison with +our internal memory based algorithm~\cite{bkz05} were not included here due to +space restrictions but can be found in~\cite{bkz06t} and also in the appendix. + +\vspace{-1mm} +\enlargethispage{2\baselineskip} +\subsection{Space used for constructing a MPHF} +\label{sec:memconstruction} + +The vector {\it size} is kept in main memory +all the time. +The vector {\it size} has $\lceil n/b \rceil$ one-byte entries. +It stores the number of keys in each bucket and +those values are less than or equal to 256. +For example, for a set of 1 billion keys and $b=175$ the vector size needs +$5.45$ megabytes of main memory. + +We need an internal memory area of size $\mu$ bytes to be used in +the partitioning step and in the searching step. +The size $\mu$ is fixed a priori and depends only on the amount +of internal memory available to run the algorithm +(i.e., it does not depend on the size $n$ of the problem). + +% One could argue about the a priori reserved internal memory area and the main memory +% required to run the indirect bucket sort algorithm. +% Those internal memory requirements do not depend on the size of the problem +% (i.e., the number of keys being hashed) and can be fixed a priori. + +The additional space required in the searching step +is constant, once the problem was broken down +into several small problems (at most 256 keys) and +the heap size is supposed to be much smaller than $n$ ($N \ll n$). +For example, for a set of 1 billion keys and an internal area of~$\mu = 250$ megabytes, +the number of files is $N = 248$. + +\vspace{-1mm} +\subsection{Evaluation cost of the MPHF} + +Now we consider the amount of CPU time +required by the resulting MPHF at retrieval time. +The MPHF requires for each key the computation of three +universal hash functions and three memory accesses +(see Eqs.~(\ref{eq:mphf}), (\ref{eq:bucketindex}) and (\ref{eq:mphfi})). +This is not optimal. Pagh~\cite{p99} showed that any MPHF requires +at least the computation of two universal hash functions and one memory +access. + +\subsection{Description size of the MPHF} + +The number of bits required to store the MPHF generated by the algorithm +is computed by Eq.~(\ref{eq:newmphfbits}). +We need to store each $g_i$ vector presented in Eq.~(\ref{eq:mphfi}), where +$0\leq i < \lceil n/b \rceil$. As each bucket has at most 256 keys, each +entry in a $g_i$ vector has 8~bits. In each $g_i$ vector there are +$c \times \mathit{size}[i]$ entries (recall $c\in[0.93, 1.15]$). +When we sum up the number of entries of $\lceil n/b \rceil$ $g_i$ vectors we have +$c\sum_{i=0}^{\lceil n/b \rceil -1} \mathit{size}[i]=cn$ entries. We also need to +store $3 \lceil n/b \rceil$ integer numbers of +$\log_2n$ bits referring respectively to the {\it offset} vector and the two random seeds of +$h_{1i}$ and $h_{2i}$. In addition, we need to store $\lceil n/b \rceil$ 8-bit entries of +the vector {\it size}. Therefore, +\begin{eqnarray}\label{eq:newmphfbits} +\mathrm{Required\: Space} = 8cn + \frac{n}{b}\left( 3\log_2n +8\right) \: +\mathrm{bits}. +\end{eqnarray} + +Considering $c=0.93$ and $b=175$, the number of bits per key to store +the description of the resulting MPHF for a set of 1~billion keys is $8.1$. +If we set $b=128$, then the bits per key ratio increases to $8.3$. +Theoretically, the number of bits required to store the MPHF in +Eq.~(\ref{eq:newmphfbits}) +is $O(n\log n)$ as~$n\to\infty$. However, for sets of size up to $2^{b/3}$ keys +the number of bits per key is lower than 9~bits (note that +$2^{b/3}>2^{58}>10^{17}$ for $b=175$). +%For $b=175$, the number of bits per key will be close to 9 for a set of $2^{58}$ keys. +Thus, in practice the resulting function is stored in $O(n)$ bits. diff --git a/vldb07/appendix.tex b/vldb07/appendix.tex new file mode 100644 index 0000000..288ad8a --- /dev/null +++ b/vldb07/appendix.tex @@ -0,0 +1,6 @@ +\appendix +\input{experimentalresults} +\input{thedataandsetup} +\input{costhashingbuckets} +\input{performancenewalgorithm} +\input{diskaccess} diff --git a/vldb07/conclusions.tex b/vldb07/conclusions.tex new file mode 100755 index 0000000..8d32741 --- /dev/null +++ b/vldb07/conclusions.tex @@ -0,0 +1,42 @@ +% Time-stamp: +\enlargethispage{2\baselineskip} +\section{Concluding remarks} +\label{sec:concuding-remarks} + +This paper has presented a novel external memory based algorithm for +constructing MPHFs that works for sets in the order of billions of keys. The +algorithm outputs the resulting function in~$O(n)$ time and, furthermore, it +can be tuned to run only $34\%$ slower (see \cite{bkz06t} for details) than the fastest +algorithm available in the literature for constructing MPHFs~\cite{bkz05}. +In addition, the space +requirement of the resulting MPHF is of up to 9 bits per key for datasets of +up to $2^{58}\simeq10^{17.4}$ keys. + +The algorithm is simple and needs just a +small vector of size approximately 5.45 megabytes in main memory to construct +a MPHF for a collection of 1 billion URLs, each one 64 bytes long on average. +Therefore, almost all main memory is available to be used as disk I/O buffer. +Making use of such a buffering scheme considering an internal memory area of size +$\mu=200$ megabytes, our algorithm can produce a MPHF for a +set of 1 billion URLs in approximately 3.6 hours using a commodity PC of 2.4 gigahertz and +500 megabytes of main memory. +If we increase both the main memory +available to 1 gigabyte and the internal memory area to $\mu=500$ megabytes, +a MPHF for the set of 1 billion URLs is produced in approximately 3 hours. For any +key, the evaluation of the resulting MPHF takes three memory accesses and the +computation of three universal hash functions. + +In order to allow the reproduction of our results and the utilization of both the internal memory +based algorithm and the external memory based algorithm, +the algorithms are available at \texttt{http://cmph.sf.net} +under the GNU Lesser General Public License (LGPL). +They were implemented in the C language. + +In future work, we will exploit the fact that the searching step intrinsically +presents a high degree of parallelism and requires $73\%$ of the +construction time. Therefore, a parallel implementation of our algorithm will +allow the construction and the evaluation of the resulting function in parallel. +Therefore, the description of the resulting MPHFs will be distributed in the paralell +computer allowing the scalability to sets of hundreds of billions of keys. +This is an important contribution, mainly for applications related to the Web, as +mentioned in Section~\ref{sec:intro}. \ No newline at end of file diff --git a/vldb07/costhashingbuckets.tex b/vldb07/costhashingbuckets.tex new file mode 100755 index 0000000..3a624ce --- /dev/null +++ b/vldb07/costhashingbuckets.tex @@ -0,0 +1,177 @@ +% Nivio: 29/jan/06 +% Time-stamp: +\vspace{-2mm} +\subsection{Performance of the internal memory based algorithm} +\label{sec:intern-memory-algor} + +%\begin{table*}[htb] +%\begin{center} +%{\scriptsize +%\begin{tabular}{|c|c|c|c|c|c|c|c|} +%\hline +%$n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\ +%\hline +%Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$ & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\ +%SD (s) & $2.6$ & $5.4$ & $9.8$ & $17.6$ & $37.3$ & $76.3$ \\ +%\hline +%\end{tabular} +%\vspace{-3mm} +%} +%\end{center} +%\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF, +%the standard deviation (SD), and the confidence intervals considering +%a confidence level of $95\%$.} +%\label{tab:medias} +%\end{table*} + +Our three-step internal memory based algorithm presented in~\cite{bkz05} +is used for constructing a MPHF for each bucket. +It is a randomized algorithm because it needs to generate a simple random graph +in its first step. +Once the graph is obtained the other two steps are deterministic. + +Thus, we can consider the runtime of the algorithm to have the form~$\alpha +nZ$ for an input of~$n$ keys, where~$\alpha$ is some machine dependent +constant that further depends on the length of the keys and~$Z$ is a random +variable with geometric distribution with mean~$1/p=e^{1/c^2}$ (see +Section~\ref{sec:mphfbucket}). All results in our experiments were obtained +taking $c=1$; the value of~$c$, with~$c\in[0.93,1.15]$, in fact has little +influence in the runtime, as shown in~\cite{bkz05}. + +The values chosen for $n$ were $1, 2, 4, 8, 16$ and $32$ million. +Although we have a dataset with 1~billion URLs, on a PC with +1~gigabyte of main memory, the algorithm is able +to handle an input with at most 32 million keys. +This is mainly because of the graph we need to keep in main memory. +The algorithm requires $25n + O(1)$ bytes for constructing +a MPHF (details about the data structures used by the algorithm can +be found in~\texttt{http://cmph.sf.net}. +% for the details about the data structures +%used by the algorithm). + +In order to estimate the number of trials for each value of $n$ we use +a statistical method for determining a suitable sample size (see, e.g., +\cite[Chapter 13]{j91}). +As we obtained different values for each $n$, +we used the maximal value obtained, namely, 300~trials in order to have +a confidence level of $95\%$. + +% \begin{figure*}[ht] +% \noindent +% \begin{minipage}[b]{0.5\linewidth} +% \centering +% \subfigure[The previous algorithm] +% {\scalebox{0.5}{\includegraphics{figs/bmz_temporegressao.eps}}} +% \end{minipage} +% \hfill +% \begin{minipage}[b]{0.5\linewidth} +% \centering +% \subfigure[The new algorithm] +% {\scalebox{0.5}{\includegraphics{figs/brz_temporegressao.eps}}} +% \end{minipage} +% \caption{Time versus number of keys in $S$. The solid line corresponds to +% a linear regression model.} +% %obtained from the experimental measurements.} +% \label{fig:temporegressao} +% \end{figure*} + +Table~\ref{tab:medias} presents the runtime average for each $n$, +the respective standard deviations, and +the respective confidence intervals given by +the average time $\pm$ the distance from average time +considering a confidence level of $95\%$. +Observing the runtime averages one sees that +the algorithm runs in expected linear time, +as shown in~\cite{bkz05}. + +\vspace{-2mm} +\begin{table*}[htb] +\begin{center} +{\scriptsize +\begin{tabular}{|c|c|c|c|c|c|c|c|} +\hline +$n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\ +\hline +Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$ & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\ +SD (s) & $2.6$ & $5.4$ & $9.8$ & $17.6$ & $37.3$ & $76.3$ \\ +\hline +\end{tabular} +\vspace{-1mm} +} +\end{center} +\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF, +the standard deviation (SD), and the confidence intervals considering +a confidence level of $95\%$.} +\label{tab:medias} +\vspace{-4mm} +\end{table*} + +% \enlargethispage{\baselineskip} +% \begin{table*}[htb] +% \begin{center} +% {\scriptsize +% (a) +% \begin{tabular}{|c|c|c|c|c|c|c|c|} +% \hline +% $n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\ +% \hline +% Average time (s)& $6.119 \pm 0.300$ & $12.190 \pm 0.615$ & $25.359 \pm 1.109$ & $51.408 \pm 2.003$ & $117.343 \pm 4.364$ & $262.215 \pm 8.724$\\ +% SD (s) & $2.644$ & $5.414$ & $9.757$ & $17.627$ & $37.333$ & $76.271$ \\ +% \hline +% \end{tabular} +% \\[5mm] (b) +% \begin{tabular}{|l|c|c|c|c|c|} +% \hline +% $n$ (millions) & 1 & 2 & 4 & 8 & 16 \\ +% \hline % Part. 16 \% 16 \% 16 \% 18 \% 20\% +% Average time (s) & $6.927 \pm 0.309$ & $13.828 \pm 0.175$ & $31.936 \pm 0.663$ & $69.902 \pm 1.084$ & $140.617 \pm 2.502$ \\ +% SD & $0.431$ & $0.245$ & $0.926$ & $1.515$ & $3.498$ \\ +% \hline +% \hline +% $n$ (millions) & 32 & 64 & 128 & 512 & 1000 \\ +% \hline % Part. 20 \% 20\% 20\% 18\% 18\% +% Average time (s) & $284.330 \pm 1.135$ & $587.880 \pm 3.945$ & $1223.581 \pm 4.864$ & $5966.402 \pm 9.465$ & $13229.540 \pm 12.670$ \\ +% SD & $1.587$ & $5.514$ & $6.800$ & $13.232$ & $18.577$ \\ +% \hline +% \end{tabular} +% } +% \end{center} +% \caption{The runtime averages in seconds, +% the standard deviation (SD), and +% the confidence intervals given by the average time $\pm$ +% the distance from average time considering +% a confidence level of $95\%$.} +% \label{tab:medias} +% \end{table*} + +\enlargethispage{2\baselineskip} +Figure~\ref{fig:bmz_temporegressao} +presents the runtime for each trial. In addition, +the solid line corresponds to a linear regression model +obtained from the experimental measurements. +As we can see, the runtime for a given $n$ has a considerable +fluctuation. However, the fluctuation also grows linearly with $n$. + +\begin{figure}[htb] +\vspace{-2mm} +\begin{center} +\scalebox{0.4}{\includegraphics{figs/bmz_temporegressao.eps}} +\caption{Time versus number of keys in $S$ for the internal memory based algorithm. +The solid line corresponds to a linear regression model.} +\label{fig:bmz_temporegressao} +\end{center} +\vspace{-6mm} +\end{figure} + +The observed fluctuation in the runtimes is as expected; recall that this +runtime has the form~$\alpha nZ$ with~$Z$ a geometric random variable with +mean~$1/p=e$. Thus, the runtime has mean~$\alpha n/p=\alpha en$ and standard +deviation~$\alpha n\sqrt{(1-p)/p^2}=\alpha n\sqrt{e(e-1)}$. +Therefore, the standard deviation also grows +linearly with $n$, as experimentally verified +in Table~\ref{tab:medias} and in Figure~\ref{fig:bmz_temporegressao}. + +%\noindent-------------------------------------------------------------------------\\ +%Comentario para Yoshi: Nao consegui reproduzir bem o que discutimos +%no paragrafo acima, acho que vc conseguira justificar melhor :-). \\ +%-------------------------------------------------------------------------\\ diff --git a/vldb07/determiningb.tex b/vldb07/determiningb.tex new file mode 100755 index 0000000..e9b3cb2 --- /dev/null +++ b/vldb07/determiningb.tex @@ -0,0 +1,146 @@ +% Nivio: 29/jan/06 +% Time-stamp: +\enlargethispage{2\baselineskip} +\subsection{Determining~$b$} +\label{sec:determining-b} +\begin{table*}[t] +\begin{center} +{\small %\scriptsize +\begin{tabular}{|c|ccc|ccc|} +\hline +\raisebox{-0.7em}{$n$} & \multicolumn{3}{c|}{\raisebox{-1mm}{b=128}} & +\multicolumn{3}{c|}{\raisebox{-1mm}{b=175}}\\ +\cline{2-4} \cline{5-7} + & \raisebox{-0.5mm}{Worst Case} & \raisebox{-0.5mm}{Average} &\raisebox{-0.5mm}{Eq.~(\ref{eq:maxbs})} + & \raisebox{-0.5mm}{Worst Case} & \raisebox{-0.5mm}{Average} &\raisebox{-0.5mm}{Eq.~(\ref{eq:maxbs})} \\ +\hline +$1.0 \times 10^6$ & 177 & 172.0 & 176 & 232 & 226.6 & 230 \\ +%$2.0 \times 10^6$ & 179 & 174.0 & 178 & 236 & 228.5 & 232 \\ +$4.0 \times 10^6$ & 182 & 177.5 & 179 & 241 & 231.8 & 234 \\ +%$8.0 \times 10^6$ & 186 & 181.6 & 181 & 238 & 234.2 & 236 \\ +$1.6 \times 10^7$ & 184 & 181.6 & 183 & 241 & 236.1 & 238 \\ +%$3.2 \times 10^7$ & 191 & 183.9 & 184 & 240 & 236.6 & 240 \\ +$6.4 \times 10^7$ & 195 & 185.2 & 186 & 244 & 239.0 & 242 \\ +%$1.28 \times 10^8$ & 193 & 187.7 & 187 & 244 & 239.7 & 244 \\ +$5.12 \times 10^8$ & 196 & 191.7 & 190 & 251 & 246.3 & 247 \\ +$1.0 \times 10^9$ & 197 & 191.6 & 192 & 253 & 248.9 & 249 \\ +\hline +\end{tabular} +\vspace{-1mm} +} +\end{center} +\caption{Values for $\mathit{BS}_{\mathit{max}}$: worst case and average case obtained in the experiments and using Eq.~(\ref{eq:maxbs}), +considering $b=128$ and $b=175$ for different number $n$ of keys in $S$.} +\label{tab:comparison} +\vspace{-6mm} +\end{table*} + +The partitioning step can be viewed as the well known ``balls into bins'' +problem~\cite{ra98,dfm02} where~$n$ keys (the balls) are placed independently and +uniformly into $\lceil n/b\rceil$ buckets (the bins). The main question related to that problem we are interested +in is: what is the maximum number of keys in any bucket? +In fact, we want to get the maximum value for $b$ that makes the maximum number of keys in any bucket +no greater than 256 with high probability. +This is important, as we wish to use 8 bits per entry in the vector $g_i$ of +each $\mathrm{MPHF}_i$, +where $0 \leq i < \lceil n/b\rceil$. +Let $\mathit{BS}_{\mathit{max}}$ be the maximum number of keys in any bucket. + +Clearly, $\BSmax$ is the maximum +of~$\lceil n/b\rceil$ random variables~$Z_i$, each with binomial +distribution~$\Bi(n,p)$ with parameters~$n$ and~$p=1/\lceil n/b\rceil$. +However, the~$Z_i$ are not independent. Note that~$\Bi(n,p)$ has mean and +variance~$\simeq b$. To give an upper estimate for the probability of the +event~$\BSmax\geq \gamma$, we can estimate the probability that we have~$Z_i\geq \gamma$ +for a fixed~$i$, and then sum these estimates over all~$i$. +Let~$\gamma=b+\sigma\sqrt{b\ln(n/b)}$, where~$\sigma=\sqrt2$. +Approximating~$\Bi(n,p)$ by the normal distribution with mean and +variance~$b$, we obtain the +estimate~$(\sigma\sqrt{2\pi\ln(n/b)})^{-1}\times\exp(-(1/2)\sigma^2\ln(n/b))$ for +the probability that~$Z_i\geq \gamma$ occurs, which, summed over all~$i$, gives +that the probability that~$\BSmax\geq \gamma$ occurs is at +most~$1/(\sigma\sqrt{2\pi\ln(n/b)})$, which tends to~$0$ as~$n\to\infty$. +Thus, we have shown that, with high probability, +\begin{equation} + \label{eq:maxbs} + \BSmax\leq b+\sqrt{2b\ln{n\over b}}. +\end{equation} + +% The traditional approach used to estimate $\mathit{BS}_{\mathit{max}}$ with high probability is +% to consider $\mathit{BS}_{\mathit{max}}$ as a random variable that follows a binomial distribution +% that can be approximated by a poisson distribution. This yields a good approximation +% when the number of balls is lower than or equal to the number of bins~\cite{g81}. In our case, +% the number of balls is greater than the number of buckets. +% % and that is why we have used more recent works to estimate $\mathit{BS}_{\mathit{max}}$. +% As $b > \ln (n/b)$, we can use the result by Raab and Steger~\cite{ra98} to estimate +% $\mathit{BS}_{\mathit{max}}$ with high probability. +% The following equation gives the estimation, where $\sigma=\sqrt{2}$: +% \begin{eqnarray} \label{eq:maxbs} +% \mathit{BS}_{\mathit{max}} = b + O \left( \sqrt{b\ln\frac{n}{b}} \right) = b + \sigma \times \left(\sqrt{b\ln\frac{n}{b}} \right) +% \end{eqnarray} + +% In order to estimate the suitable constant $\sigma$ we did a linear +% regression suppressing the constant term. +% We use the equation $BS_{max} - b = \sigma \times \sqrt{b\ln (n/b)}$ +% in the linear regression considering $y=BS_{max} - b$ and $x=\sqrt{b\ln (n/b)}$. +% In order to obtain data to be used in the linear regression we set +% b=128 and ran the new algorithm ten times +% for n equal to 1, 2, 4, 8, 16, 32, 64, 128, 512, 1000 million keys. +% Taking a confidence level equal to 95\% we got +% $\sigma = 2.11 \pm 0.03$. +% The coefficient of determination was $99.6\%$, which means that the linear +% regression explains $99.6\%$ of the data variation and only $0.4\%$ +% is due to experimental errors. +% Therefore, Eq.~(\ref{eq:maxbs}) with $\sigma = 2.11 \pm 0.03$ and $b=128$ +% makes a very good estimation of the maximum number of keys in any bucket. + +% Repeating the same experiments for $b$ equals to $175$ and +% a confidence level of $95\%$ we got $\sigma = 2.07 \pm 0.03$. +% Again we verified that Eq.~(\ref{eq:maxbs}) with $\sigma = 2.07 \pm 0.03$ and $b=175$ is +% a very good estimation of the maximum number of keys in any bucket once the +% coefficient of determination obtained was $99.7 \%$ and $\sigma$ is in a very narrow range. + +In our algorithm the maximum number of keys in any bucket must be at most 256. +Table~\ref{tab:comparison} presents the values for $\mathit{BS}_{\mathit{max}}$ +obtained experimentally and using Eq.~(\ref{eq:maxbs}). +The table presents the worst case and the average case, +considering $b=128$, $b=175$ and Eq.~(\ref{eq:maxbs}), +for several numbers~$n$ of keys in $S$. +The estimation given by Eq.~(\ref{eq:maxbs}) is very close to the experimental +results. + +Now we estimate the biggest problem our algorithm is able to solve for +a given $b$. +Using Eq.~(\ref{eq:maxbs}) considering $b=128$, $b=175$ and imposing +that~$\mathit{BS}_{\mathit{max}}\leq256$, +the sizes of the biggest key set our algorithm +can deal with are $10^{30}$ keys and $10^{10}$ keys, respectively. +%It is also important to have $b$ as big as possible, once its value is +%related to the space required to store the resultant MPHF, as shown later on. +%Table~\ref{tab:bp} shows the biggest problem the algorithm can solve. +% The values were obtained from Eq.~(\ref{eq:maxbs}), +% considering $b=128$ and~$b=175$ and imposing +% that~$\mathit{BS}_{\mathit{max}}\leq256$. + +% We set $\sigma=2.14$ because it was the greatest value obtained for $\sigma$ +% in the two linear regression we did. +% \vspace{-3mm} +% \begin{table}[htb] +% \begin{center} +% {\small %\scriptsize +% \begin{tabular}{|c|c|} +% \hline +% b & Problem size ($n$) \\ +% \hline +% 128 & $10^{30}$ keys \\ +% 175 & $10^{10}$ keys \\ +% \hline +% \end{tabular} +% \vspace{-1mm} +% } +% \end{center} +% \caption{Using Eq.~(\ref{eq:maxbs}) to estimate the biggest problem our algorithm can solve.} +% %considering $\sigma=\sqrt{2}$.} +% \label{tab:bp} +% \vspace{-14mm} +% \end{table} diff --git a/vldb07/diskaccess.tex b/vldb07/diskaccess.tex new file mode 100755 index 0000000..08e54b9 --- /dev/null +++ b/vldb07/diskaccess.tex @@ -0,0 +1,113 @@ +% Nivio: 29/jan/06 +% Time-stamp: +\vspace{-2mm} +\subsection{Controlling disk accesses} +\label{sec:contr-disk-access} + +In order to bring down the number of seek operations on disk +we benefit from the fact that our algorithm leaves almost all main +memory available to be used as disk I/O buffer. +In this section we evaluate how much the parameter $\mu$ +affects the runtime of our algorithm. +For that we fixed $n$ in 1 billion of URLs, +set the main memory of the machine used for the experiments +to 1 gigabyte and used $\mu$ equal to $100, 200, 300, 400, 500$ and $600$ +megabytes. + +\enlargethispage{2\baselineskip} +Table~\ref{tab:diskaccess} presents the number of files $N$, +the buffer size used for all files, the number of seeks in the worst case considering +the pessimistic assumption mentioned in Section~\ref{sec:linearcomplexity}, and +the time to generate a MPHF for 1 billion of keys as a function of the amount of internal +memory available. Observing Table~\ref{tab:diskaccess} we noticed that the time spent in the construction +decreases as the value of $\mu$ increases. However, for $\mu > 400$, the variation +on the time is not as significant as for $\mu \leq 400$. +This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux +has smart policies +for avoiding seeks and diminishing the average seek time +(see \texttt{http://www.linuxjournal.com/article/6931}). +\begin{table*}[ht] +\vspace{-2mm} +\begin{center} +{\scriptsize +\begin{tabular}{|l|c|c|c|c|c|c|} +\hline +$\mu$ (MB) & $100$ & $200$ & $300$ & $400$ & $500$ & $600$ \\ +\hline +$N$ (files) & $619$ & $310$ & $207$ & $155$ & $124$ & $104$ \\ +%\hline +\textbaht~(buffer size in KB) & $165$ & $661$ & $1,484$ & $2,643$ & $4,129$ & $5,908$ \\ +%\hline +$\beta$/\textbaht~(\# of seeks in the worst case) & $384,478$ & $95,974$ & $42,749$ & $24,003$ & $15,365$ & $10,738$ \\ +% \hline +% \raisebox{-0.2em}{\# of seeks performed in} & \raisebox{-0.7em}{$383,056$} & \raisebox{-0.7em}{$95,919$} & \raisebox{-0.7em}{$42,700$} & \raisebox{-0.7em}{$23,980$} & \raisebox{-0.7em}{$15,347$} & \raisebox{-0.7em}{$xx,xxx$} \\ +% \raisebox{0.2em}{statement 1.3 of Figure~\ref{fig:readingbucket}} & & & & & & \\ +% \hline +Time (hours) & $4.04$ & $3.64$ & $3.34$ & $3.20$ & $3.13$ & $3.09$ \\ +\hline +\end{tabular} +\vspace{-1mm} +} +\end{center} +\caption{Influence of the internal memory area size ($\mu$) in our algorithm runtime.} +\label{tab:diskaccess} +\vspace{-14mm} +\end{table*} + + + +% \begin{table*}[ht] +% \begin{center} +% {\scriptsize +% \begin{tabular}{|l|c|c|c|c|c|c|c|c|c|c|c|} +% \hline +% $\mu$ (MB) & $100$ & $150$ & $200$ & $250$ & $300$ & $350$ & $400$ & $450$ & $500$ & $550$ & $600$ \\ +% \hline +% $N$ (files) & $619$ & $413$ & $310$ & $248$ & $207$ & $177$ & $155$ & $138$ & $124$ & $113$ & $103$ \\ +% \hline +% \textbaht~(buffer size in KB) & $165$ & $372$ & $661$ & $1,033$ & $1,484$ & $2,025$ & $2,643$ & $3,339$ & & & \\ +% \hline +% \# of seeks (Worst case) & $384,478$ & $170,535$ & $95,974$ & $61,413$ & $42,749$ & $31,328$ & $24,003$ & $19,000$ & & & \\ +% \hline +% \raisebox{-0.2em}{\# of seeks performed in} & \raisebox{-0.7em}{$383,056$} & \raisebox{-0.7em}{$170,385$} & \raisebox{-0.7em}{$95,919$} & \raisebox{-0.7em}{$61,388$} & \raisebox{-0.7em}{$42,700$} & \raisebox{-0.7em}{$31,296$} & \raisebox{-0.7em}{$23,980$} & \raisebox{-0.7em}{$18,978$} & \raisebox{-0.7em}{$xx,xxx$} & \raisebox{-0.7em}{$xx,xxx$} & \raisebox{-0.7em}{$xx,xxx$} \\ +% \raisebox{0.2em}{statement 1.3 of Figure~\ref{fig:readingbucket}} & & & & & & & & & & & \\ +% \hline +% Time (horas) & $4.04$ & $3.93$ & $3.64$ & $3.46$ & $3.34$ & $3.26$ & $3.20$ & $3.13$ & & & \\ +% \hline +% \end{tabular} +% } +% \end{center} +% \caption{Influence of the internal memory area size ($\mu$) in our algorithm runtime.} +% \label{tab:diskaccess} +% \end{table*} + + + +% \begin{table*}[htb] +% \begin{center} +% {\scriptsize +% \begin{tabular}{|l|c|c|c|c|c|} +% \hline +% $n$ (millions) & 1 & 2 & 4 & 8 & 16 \\ +% \hline % Part. 16 \% 16 \% 16 \% 18 \% 20\% +% Average time (s) & $14.124 \pm 0.128$ & $28.301 \pm 0.140$ & $56.807 \pm 0.312$ & $117.286 \pm 0.997$ & $241.086 \pm 0.936$ \\ +% SD & $0.179$ & $0.196$ & $0.437$ & $1.394$ & $1.308$ \\ +% \hline +% \hline +% $n$ (millions) & 32 & 64 & 128 & 512 & 1000 \\ +% \hline % Part. 20 \% 20\% 20\% 18\% 18\% +% Average time (s) & $492.430 \pm 1.565$ & $1006.307 \pm 1.425$ & $2081.208 \pm 0.740$ & $9253.188 \pm 4.406$ & $19021.480 \pm 13.850$ \\ +% SD & $2.188$ & $1.992$ & $1.035$ & $ 6.160$ & $18.016$ \\ +% \hline + +% \end{tabular} +% } +% \end{center} +% \caption{The runtime averages in seconds, +% the standard deviation (SD), and +% the confidence intervals given by the average time $\pm$ +% the distance from average time considering +% a confidence level of $95\%$. +% } +% \label{tab:mediasbrz} +% \end{table*} diff --git a/vldb07/experimentalresults.tex b/vldb07/experimentalresults.tex new file mode 100755 index 0000000..58b4091 --- /dev/null +++ b/vldb07/experimentalresults.tex @@ -0,0 +1,15 @@ +%Nivio: 29/jan/06 +% Time-stamp: +\vspace{-2mm} +\enlargethispage{2\baselineskip} +\section{Appendix: Experimental results} +\label{sec:experimental-results} +\vspace{-1mm} + +In this section we present the experimental results. +We start presenting the experimental setup. +We then present experimental results for +the internal memory based algorithm~\cite{bkz05} +and for our algorithm. +Finally, we discuss how the amount of internal memory available +affects the runtime of our algorithm. diff --git a/vldb07/figs/bmz_temporegressao.png b/vldb07/figs/bmz_temporegressao.png new file mode 100644 index 0000000000000000000000000000000000000000..e7198c1cea2b9f5d766b7f422b23a2e259209c79 GIT binary patch literal 5769 zcma)Ac|4T;+P@iDvOQ9khA6wDguw{Y2_b2Wk`~LLNQ4+`7>tl5TTix$Y-1!#cFLON zk*#K!VMvQiGYrX=Wz2hb&L8JJ?|aVsc|Z4O?z!*Z{o8-n_5FUYD-mgPUU>iE{Qv+6 zTUwZ+0DxNs0Jzfmxxo@GOI#HA$Lo6GyeYu>eHFeTWP&9EeiqjP06^5`_m3+q#4H6Y zB-+`a!cKO{6@_iHid$?4G)pa`*F#?o}B+d%XnBn}>ZV`{LRt6v}=ea!01OLR;6pkbC{e%w7U%zGg-D|| zAU@bs@I52{2vySzm2S>L4PBRB0TBk`V2=ksqot21UJ_xWC* zka;@ut>FS+zooCt8QB+A;)FP1Y(+&oIFDsEJR0Q@W}dI^w6R7@n5x#{%q|^ca{3s7 z^16a+oYz|%#`@!ze%oLB?5SJLT|W80nzh&sn4tNk8%?G4SA$lwUIZEKk0r1j(&6u{ zqwaY`=gpQGEacZ|u!H$%e_oB!B2k~|_}HT8W77U2nBZVNj=Pt{kq5!v(Ruhc+E+tX zzuwl=Zk&7Dy|CDfxotM}Zs|OFQ8C{*oj0oNg!=f>!t0j3EkWNc*FH%lj933~LfyzG zB1wj^QT;e^*^;hQ+C78^k5`o&G>$&p zzMUo|aXZj)Bi-kEug^?ngm=xWT($rtgS>Fm8i=LKGwK=Q& z)KwZW2zqM}@V4a*MY>eZADKl?geisd{d9=;iUfo!29u?p^ao+RMeIJi#(XS;4j5&K_=I_yA-7%8=)g{UH^!FgCm4 z>&zvVD)Y7eYC&xEqUY@6JO`SZqgD=XSe1<2Ny)IMjw-V1Lp_m!ZJO>QT9#jU`qd0`%3P zHN%iIRAC3gti65E+@fdraM?*DQzsW28Wu_9EW8$}xh#IBb#vs97RHOE+#o-UMWjx> zIe$;w$$d7|Hi4h^XrX^l7~ZkH(Op#%kQ;E`Bq6$GJUZUgaqm12AFMI1|i$(%Rg(*dGe zh;gUW$EaV;rLD!FkD{M8JM&>#*Z-pV z-k~na*mzpBeQ#|yqgfGVi6kNz`;u(z}i%jrM=Cyvw%XW+HFk5UwhDV0>FLA(yG<9-P2;~q-AbDTuSDw z(4#99F%6*0C%MRh=ys1Ai)1Df34N2L%m7JxOk}`7=WY==myY)KdzxZE6MO4xbYVBmA9(QM;?o zY{}WEU%zHKyW#6_x6%=p(c%2YhRCa9ZB4!#C!_%rjAl(z%>;mo=e{)6=8Izk%inWWX_yS}d zwRfa5wt9bXVPks1K7~jCT4sf3!H;Oy5OaP5$oJe4B1MSV5h^MObjb(<;F~}$T(=8? z%mx?H$A1KyfvZ{okW&9|76WOnr|RbeS-?sOB$lA(gjO!3;{Vb5)DXDLlCPS!z)xk6 z73q{Z!5?}coyAt070W= zQcgZuu>c&DM+!oL-b~OS>l)&GW^FxUjr67_fJe2)n6jS2$Y8S#>&givH+K4^B|dTG zHh$l{$LLxbS11Jeu$QQ7ZG?Qi{$s>DW337(5!UuDfnBx3{}O zQB&el{rIP#_faCrH!FN|WB6Jem+B2Vcbh71BPZUcx@xa$UFtXU$}-^>6&HV6aBF`7 z4tm=;5@aEG3kQE~#@hlZB!GPQ5`*tL>S0(VjLisX2Ni|P=^c9OH$(Qh@#Mc=at*x8 zois6w9qcPRAa%n9xl_$F;NK2lfyX%gl0ji4U!et)uc?er6)1%O_nX?C=?S(T>jgSd zRE`|*s*wYlmXwaL&A+2h1;Qz2~z?^Sw@b}wm*QEDvmX` zwcX{c?odYCct%Bn6Jh5{N_H9qFUj;3qI1!W6?~1WaiTF@#^a84)$Ju zd%B%Z^@8=ZuzEZYvq157a&{JFzr9Fwda`8pfth{YfJxgV@Xge+c|J9U&)a|i6FAp3LKbWd6BXr_qHt1bz=1nj73 zadyp}MeW>9DH!$rj;P$e(cUMLFyY|D?!Ts$tla#*(Dj?97G3WO(wx4PrwI4j`*82K z_XBBT%rhpjEsELXE{e>!5O-0*idnBf$kb$gROo`tE#;y~4*77FSO3~H&+!AUKQ7Hb zkU%^jnB0<&QN_q&Q383OeUqwEJ(szq6ufzTywJ26m(i3)U3rD>+t-^V8WWRk8@Rg< zz_dXMVH}10&w+1e(x zl)8fI4GFn3qnIP;}D=_h);XjJ#@g}VdJue zs5*o3rjrG?+x4BN$wVH}o7pu2MVVSq1;V3=ZW!#$^&g#Pp5}s{;UDVFhzrRx314|- zo4IxK7Cv+ohD1o905lc+ z#1Vrj%40q;al&ervdyaK9<>sYn@KgFZ=(1#&r<)uk)tyNphf9$q%x^kER!5v#1DhsifM>k*oQiY2bt<#gs5p=_Jdie~0X@ zb7=Um=WoBfrWCCMa+=Sn%S=fXufHiyRDZ9z+%anY(NC-v#Wx7kM<5;rsDZ&0@Uxh6 zo7h&6d(|llUVGW^5qKJ1&!9ec-3P}!z9&5aLzO-$t3PlO&+~>HAZ|tVHAY4q*0{m7 zHpQ|yA8UfrMeW0Dy%h@yf1W5`2rQrMd;s=;`k1#%9LV+NGGD}DHD+0nc;-a1zWJ+O z|KUh{?fX#`34!0>=VqRte6zy2|GNkEDhPj|U(4f!Wfbk1l@p^SCQ-W>E6b`%Y{W;C z^jzn^F`BBdY=MaXX2NA5Aaot}c7GRK&|f(@*Jjd^Yu1tGTAC{DZ=rUZikNgQwYg<3 zld9tV$T&rKs$y;{3F> zn*w3akiRca(x5$NCpiFP0WC6`wpr_HlHmQ40G@1J$*F!Z}BRgBJ{eKoNAh zc@wM!ZyZIzif?JTawk|is`!FmoXD8_m?a>(`&qrPVsTYw&fd*@k0&LmIh%6EgHRV^ zqIRM7wn6t5T^nkkDb)aBUSES*_iZ*4i~W}6$YOBv`Tx&eO$`_8`g zLPHl@EN2_-1)x|rkOjO>yTx372RLY0F|9?V)AWE_|KJxS>=Fj@&%}xA`ody4=cX=@ zT2p?ZeTbaXP9tZ>eUd5uBvbh7n4AKr_5UTIf>#Z&fgoRlBbG1xBL{;&y-Z=yHqph4 zo{OOoB9jbWI@VWiU;p<&QoQ^;SL;1TdRBbz`;A|I9p6>;-PKZ9EJpDssZyQhrmu2^ z7eF~JJ2UryZ8b*=PSRU#`<#Zq`z5~4EwCz5;jAX#_rGCeXK{=A?$3pei1e$b-V_;& zLlHDSr8nn-+rQnh^-k{k5x9gbe*Th(x=;igqBCJOwU^nGq}!tevL@@E6ItJ^4Xu$> z9PqwXFWBIgql}?WKIz~{`-GdD;fb#` z)W;j1F?j{%ey*iye9K>LQrqkyttOHGifa`KuQ;jG&gps*hP9fh>MmdKO8{A@wxD&r z6#YYOfHlV44P$Y#@sX(IVfwE|$?s8L0zdB@XPjw|oFCmCsmgZRdV6Gn!+NW1eb8-B zMDD}~s@Kj*o9ciOnamx3-wu@ZbCas2oYm_lP`-fpl zZgM(r8WX3690UyGq>n!^iu7jeIk)ym=6sBf?fo9L^cKdP`n|Xv+BejAhpUbAIok5a z*Ha>zHTCfiFl8eGghMmy_>J1q1+W#Rr3FH-pVGbHg?^XbDiY8Ct4G@+bV4_4d?%oC z`@v54p(xQ>QD9@YFyb*#^_sd`WY>~{69&} zByC$gaTcnZ#fGmqqM=EIs-9>GVfl~c+CiZsKaU#+GYH8!244zIBp&=YNAY>5X20B- z+?alc3D5tUaD0FoLMXE+jKiO8ILb;0VlD}sZxA{2wdB`*f*Get1W{NK6M ie{Y%pM>dwu0S0QTBIjy-e8HqKU}+`Ou%4eadxGC$A%-Ue75 zXb%^R2coL>28IRzzV`;&m#_Jz&JJEm?AK3(m=rTs4R1KhwieoY&*qms{_*fGUn!;J zLQVTidCbHc^Jq7Fq32)3=6Sj3TmuhSOoMYVl1KN?&+F_Djape*kq)jO>)6ux{`#>U zhe=}nikMc>5GB_+xuW+)t~3Ar$r%W+e`RKjVpli>1>Ch+K(uiI)48VOn<{Pyj`r9d zV<7~rxCBB`!0V^6D7Wkuok%)IJ|>>&Y;YpD`X56?85}*huT|PYpyON?kO?=vmll|R zL_THi2BGLoglKTKjzL<_?R-V!3qYp$L%;7pemavwJi~%|%WWUHUZ~gk7!CVpdHtt4 z53sPh6{Xcn5RoDkU;!f36rU$$3$XZIY@i_J8xoG?zz%|{)FK`s`r+}YW`&;@>fg!y z;sv5@#w+?v@%0f%z~`tG$;$jIN4|9d^M_liLCRzl=QZhH7mH(=f<8?F>#L)0(l{z@ z7%uiP8OPkPkpniJ73CWIKHUS4?iU#*36?k$no;lXX}H~QyXu*I!JKs=-2Ch45of#8 z|9L<(HpRcWw=cof7)gDtJVk zB>ZfLzCm8wo|+@BjZV%H{}AKqEB_F^zOAS@wT$ce+$Q|~H6zz68Ne@XyisK<HYx}b9b+>iqLgfyvZCh}qYRsRl%J{-b8`wFeO(m^Y2hXrlmJ=lver~S!Pi<>` z3fOwLB)U#GH#9s9HNEg>ySI5_;f_1~b61xO{qwrAr)Q8#TDxP$R%4X=p8>?chX;Lr zPkv3GKz+pSc*qR%%puo&Qcc-y6?rEjl5xH0G6q`99#1NQopM}A{J0_*>ITtrJ%2+# zO>(VnVqgucuIy9tw6vnk&o45~n~^$??+egX8|MFtO><2QouURznN61d$)T`^XCMFm z$6R#Kp4*nI)9OchV_TAPdcCA${9iZRSJCn-m&%XKeTcg2md1KUX|CF7?h&&M3SCAx zz1fQL$A)`Frg<<@J>2t9uNdmX5uooQHk3lQscK~NODecmP5yFGaUzZa8tSb{dR8vN zqCVYfD!v2tr8BRre|xP}mWwX+9M z^{bqJN~V9#J<%T-AYe`Q?*5d%lH{6*Fm9T29JS{u-mJ78eK+&N*i)BA{#rHob~V5# z-Xra|Rv_cL+L)YM&2!GX^&;hJTe3m}l?xeh6_WAGqx^HTu}?P^INr&4>NPwoXg1vL^Yg9BHqFK)yY z+WSV5;Pzx;k0g90}HF(7xvg6$`>zqwK*z$uyc2~pYr5nR)3^3yv9 zUK=pck^-8b^3~=ZT}7NDmt!p0k9ljD8VR_md z528g^8dJd2oNyJOJAPLJqFz2QM&~K8M!H#&e3#=FsW^j zmgvo4C__lx%OfUARqeh-pk217(5q;*#Vz!+2(i3dhknx`Bcr>wV7QqY(`T zO6DwJ|49Pwm;!0BnBzs9YTluL^T<0WGKtJyp)o5XGgFi#~M>pR|E(&LcwOZV>t3T<_2>dNRn zgu+i?Dn|7Mm^Q(Lgp(qKgvFvx&+LFW%AJuXWE1xlVI zocfnOGEQ|Kv~nV!NARFFj_2lUcrAP(R~!$OeMy(Y6)=-TsyZql&l*#%{9~B*Edp74 z=v1D~UJk4kTFCX@%bme$3%PJTGdDXfN(hI)T?Xju3GtwqMYiSKB z4?zDBIc)D$@F|djS&b3(IXI8@EGR&?vJdeI z5VE~v^V2@maFI?_JSz0J7kfw1H&c7zNvYJdLstfbul=>I=w060!A4=vSQOE2;mKwW zd?4Ok864qGYr1ZiSp2AJJa+@U_qOVdY;VG_& zL4NhPK;qYg$x}u=4Mnom;!*tK!NudSbCxNY7{t1x_$3jcRBae?xPmzRP zsS__|e900&*dy(9H>BQ|I~5y%@n~@f#S4DP(w;27OpZs!H@ohW4X0;$K@=ea85z!+s)@}|DAcS zRE0RzE4THY5RF7K-PK(X=KwE;z!x5!YZ4%W#4BBYVU==j5Qu8Xk}knJdX9gi;KP4E zJ1OFTN%$ALs+YJL;yJ!c8nP76+&oPPYs@3szs0-_l9sWReS8eM9FAwB<;GzmJiEBU zlSF?ww(oT$pGR2BQMe=zSj~oEEDS~TWkFDp#@r$M38(4}p*>6R(2!_V;xQ^G<_U2` zqhz;!wYeJ3>G(7O^_jn3|n~bMdKuHpT*hNQ17LC^J;V6 zqpopDo7eLP(J0!rB0uIB6#rxK{%I4IMM|*7ILrnV3%(rz-;{Py4O>h0KLkA$mVhEc zq~KpqlJY@)oo>F4EISC|5Zb5&p9kWmoWPfd#h`R>@Jcx6=UpW7c2FgA>VWHtL+KAg zKqGEa(OkipE8P$_tOEGqTt5Dq;OR+`5i9(ZQalp9AD3(9J4;}=_6&g-+k8aNX-VxYb?(aQA_JP{dhIdxICK59=15MbE+KUL^VmM7;RIifD zFZu@VR`&&?USf%YrP&XKV^N?ZYTXRd23jH@b1juPO$7>3I`3^#9fWzA7FXA!RHkw( zaBH@-m_uHI#30?~(i;bsVo)-m?azvna1Mz=pnKw8(m5=|Ns%mb*y^X9t)e+yGUwUG zmKn^qn7A`;3R;z`@sWod-rLH*y)`1c>VTad@rZ&DV^6cJufwXepyr**J1nj4Cj7FdpR<)3wF2Usj}G zOJAy1_X4qD=P5KXHIVGrj%pRw^YCFXX9Qo{*Ntj5y4W@3d)AZxP+H?Ly-pzYq&e%C z3L^(J?%x?Z>}2cC=1NfEnl*4gUy=mT=>)~)MTzm3i%mWM39-+>{${><#(pnd(=qfP zyU3{NMa*lhM-rZZ=b;;H#r-Xx)m(=^2F~dEJc89qi8^Z?@-lhT#-@MxXpzP%9<_AI z74#m6u8$(nKl9Ma zdm`fy%jtP!?zB?`_;M0uBpD*l)K$XQ6mOKdyh@9GDm0E$-N&UxhbiJts2 znsn69na}23LiJJalXIQYKj^+8L_CgFts{FD{n)RQderfy#vhQdfR=b;!Pl?2`@QZC z`hF^(i;66BwbH%+I5y&z-Oz|Gwd=-;VCm8{3zvzt73;Qh=_^cF{xo4W0mq(OCC`O{ zYAloOs@fbS+p%U-=81Nuq?*4aQLufL1zBN_*3+~GxhQdx!c;TmqWsCBdQBDIj0gQA zf?dadfM0jQ#V=$?Z%og460yHDd(9=ZVPtb|ft+R?qlzu!k*-N_?>@gpw)BPw7KFd6 zGNmbtY7}%fHCVboPkDcxJ@a`e@8SiWDF)VK^ZN_cN*&g(^kjS?!Yi(?|E=@e!oRoM zpPmPGB;0T}Nqp?`c`!LUiSS%lMQ3WYq{un-!BX+{<;2^c$73{YD{pQQ6wH3ewAELO z=Hxk;1(;oUYdr>Taf(S&a5rUNsL?!No}F+ebfTxyz`875)b%3GV^jaFbHwDItC_jE z0u99Uftxxaz^Te!@KVMG_wQls-*^BIWB&hFgn_5E{}NCxfPvh%!yov7g~{-qvbOE! zQWyD=v|UoLw&t28lz0bE`8v0`erEdn;lAz!mU_0;AlrISJQV-&(s)B);aY#{YvxiTsZ1+FLSbTrJ1kgA*U(PK@s_rk(zo7`h7%2&KC z7v#muc|J#SC@pAcvCvOJ0=$@5O8@pveBTLCmo9fjqx%*^zkH=iwk-vQ-z{_`sDp|j zo7XEgy}sQTf9$fjS}_QJHtfx7Yz2oHlk7-)vWm^?0 ztJ2LdWR>t(y)9A~qyE~29sQr!tKF*+RsOKqo#mPC`<|gRiSZg`5*9ftP$;$-<@RBR z2K(671+nRJ>2;%0XxjG>IG#Vx9iz8?^qTgRPuZ+>mMwRAP(24n1qfGYbyZ5Adh9-O z-U{_draeGf%Ae-$>m7yH`$F#_80Q1M{AS(vlU+0flALP7V}Fg#d!((-5ubINv5DDO z-Y{;V7Tf=nZbjA)uFHCvGmrvp2QvKI1L{4)2VoZN=iV97>Mk-WLt0vWSP&gZKfy0W z<3{ytAed&}{HUg|Wy)I{?W81~Yp-?HS4CSa^K)HVe(c5V%_8ES{5l zi<-`S7@Om#Zf9p{GR)K6R0#)?1bR8d6?OsY{}0IjFJgE0n=dKKu7%{~fLFSJfeuc) J@QQuN{{U&H=p48g`weTH#H5IX}; z3=HT10I*FOLv(E(zuTO>m5IB;_0ny`B4LB|*sSd6^R;LZvClNfZ%^=5*8XT_Q=PU! z>Hq@rPdtO^0pD&g8UP^2p~3WoaIM@uJvjNm3@gQ+mVDHQaX1NsXgL|6j^8YIqG{|Z z`y>>l4S3d~VEaF1!&_r8EPBitz#~l7%Q!keGnyt&bP{oJSugwgBAZn99c4OkFPrr7 zt#~kZ#3`czk#h<%r0LRZ)8(CET-_u=U(x(-!uZ9ZQQ|96$7rFr^)z(`khM6^$qqR` zYbNU~Tr1Uff6wRJInASy33*R1_eSPyyJdDY4?kC02Ua<800*^hi3_m+BM}~8q7Ilw zH-12wF$`^8uJNYGMJ- z$UnH<$&TzaR6d-J2Nx@%-5lRfd_^bT5$mhi=Zq`k95@2R^P57czZS_Y*r!#i`Q|ARmIa-Y zH11`r*zxY71XTaj8@xnf1Yg{*D96ift2QtbJwoMoUA*3P8LEPc11FXP=V#h6E9&1v zcce21+-FtM?3R*(?sEpwZPHdtZ};A3(eJB}eU>uc!|C>`*B&NBZHg%s0^uPznG-+m z*g$DnbP~l(r}j9#;M?U#NS+)?gqiwEk^ZQKB2MOeoPW|g4N0pVyC&i1d{+vstsOVz zPVX$gD)KC-+f@bKQ~q`hDKQeJ|*~uV7t78}Cke;?_@sVm^Ovnvk$J;N`JLUB0O%y`CVTJwLmJbw-B{(3ea;8g(A;i}Se$1}M zATPs0VQhQO7{MmOr2=8vQwHP<~KC5Ito!!75o!WD>q46_RQy$2Y>}Gx6jRL{;d;U*npbfU* z`Ovkw{(xp{MS4eWgUFgWid)BDPIG6GLkBcGg2&T&dCsUEmm=kbKK^^;T{Au73&}!f zU$w-wW-F&cuxxxLe2)bR7I%xAe|I>WYQYhkMSuCDkm+y+?mcCX6fK7J?7XUVH{?Q2 zFhSC<$-KDr^{|7_+fqU%3(WB^PJGk3$q{_{!QkYO<(e_=16~Ug^>zw>Wi!im$tn!1elOvr)iVmw>uI^>mc0}A zOpe~vEV9`1GT|{cQwX#qeJQ;NK1#6~IqO-HkNZ+i`v|!rEHxV9P5G`QUIycotI{i% zn`pNp9xzOAl8dvwkAmr8e|%F<$KWykOevs21Z9Ej{(XE6 zp$*nHAM}+CLt&>MG@XcM({K?47;2_o^3N;q&BZnI%_;fp--z);!+9(ETlp+lnC+g4 z(&t7stheM`rsaRneB;Ce1+1vLSG*RW+P~QHR4W&Cc2cXAjXI@ZwY0r69CsO=H&X^k z*VPm{iQJE)fWCkWnVxi*;RBWS=^8ipKeesSp+V{78Z4adL;iuo@SyxJO2xX$umVk@ zB;P7)t1yZ2Yx|?QuwW^~?WATE_p?DwjI-1oFbm|`^T5D(X}-6$7L5^&1UZB_#|)Y2R|Va`p-11NeVm zr3lJnrCgKaq>wYLFvGo8fJ#M*hp<#)fUQBPy%S)F4Aa8~I*%3blC5u7WTc7eCK zPY@swJOX(AzdBBbW(P3Q{g2RtkxWVZZ)WivCEp@vkm`A-I6&Ix4p zDK3&U8u1h7c$f1QQa~PQJc(d55bb%c$FnB8GgmpF#*2pSzffVPpK9EQ6K)9+XUkXQ zK=rdkXmzO4u`cu47F~qoE*R3A63smWa)ajVLY%4|#-BF;Jd~#P^^Mxs1>OJR(?9n0 z-vFcIvj5{=mD9b`qy&MRuPH{l_cjX}G0Wv^+mWH_8}7VhC|?YC^df16g$$)J$fNSN zA|0!MPEwy^hsD^_c8l>C=YJ;DP^O<1T!;;pAw<0ot3#F~jwvL2JTYd4p%$8qCxqjc z+Rm(&F>c;;`Pvh;ot!}j)m70*B&ab${_yssecs5Lrda2PG9N%p9{fLG{l~&9W`BtE z-PYsvSo~Py3RagjNgYeMpG*bq-PCv+54LhQ*Dp|75x35_?6BSkC|j-Ry1m^8gjm4c zRQAQ?iAu*`i7>kKN7pQU@q^9j(6TVU} z#sl;~3ymLqJa_70uRF1^rJr*(?jmKS2xz-?_9|JH=X@Oqf zv{4d|e$Pzc?ckA!xuvK5F4lzD2&A3h*7Zu)=S51GK7ZxH=1uf1wi12t2aXTI8=NlV zdfQ5QJf33Zr;|Pl&v(_PyG2WDJa2M-uC*es&g-F>tAy>e|8SS4m??aFRx&h zC11i)tX#(ZGM4jhNZ4B4NLQcKiKyP)uq{m8#7GWnE7;+_@ct|@dut>S6J{_t;F)jV z1o~AM`aS7BQS%!cN!s3AGqc=r(M1?iF&1p7Ab$xW7Vs`7l;a#?&WQ7PW#?pnM7IC^ zeu)}S`P)vwp~fs%8t)_C2iI%y7eAij79>;(<}C6Sp?Xt`VL8>uyrqI9jr6_qLgGckv-Ga{^jpC)YR%iT zel?2r^JYt{uOaU{ebZwlKYmLHiW^|L-O!Y^6rF$bApJAnn25GW(XUz*>=M&Y>5j6K zz@gzg0tr%OrbAN0+eU;ITQ@zr{F_lnc9s_*Vz(SsIu54>=9L?-z*3eQyxtAb;J5yM z-9*p7wk%-f;gLq#X&8tm5%sCwIiRtGR?RtjRJxTPz0C09vIityF0jqnAVJl7>Yj&g zLg~dTSg&#^m_!?L&7%R^Y%Ei`~BW){Iewi*?ah6OVypCAHnFg4?zMq;7YisJ`T~ zzV4P_jwY2ojQ~PL%JqoZu~~7Oz464OwWq2z&~INsvjk75PkXWF$yh03JHk1CW606k z0KC?DI5}mY)_qKtT6KwI?zhwdM=Lf$Jc~vaNTIF!hN0vlix%9YFEjr1f3j{crZ(vOFr^C6vju_nzHhs~ef|EzWtUnkS9W^xr$y=Z ztQ5g3nL!z=ztgrt4zRvh^sz(#Lik22yHsh`t9{Al2soJG|asSMp0mlMaBte2kdt#2>AmYwt~fYYs* zXo^(}q1nHDd335fpbD2^n7@{E* z&dAz-ANMUiZV#qoJ)OF*&!m8o6m>u(CbliO)POEXJ!u)880NPF8WIGM1)h}@^AAaqB?<5 zCIx7ixz%P4&xcSgkQCADpS{ulNjQutN!||EG%CL%^C{#=YCp}}m5B-(WH&B6XZ*@r zx#Kq)AZ8Q|)(%aQxI1D; zL8xY1(r!^t^c}00ezBDShy;awC*L8(+cJ5&LO{6$LT$j(8EU ydE%}ySS(et!ZLcRyFa7rolcuCKId +\section{Introduction} +\label{sec:intro} + +\enlargethispage{2\baselineskip} +Suppose~$U$ is a universe of \textit{keys} of size $u$. +Let $h:U\to M$ be a {\em hash function} that maps the keys from~$U$ +to a given interval of integers $M=[0,m-1]=\{0,1,\dots,m-1\}$. +Let~$S\subseteq U$ be a set of~$n$ keys from~$U$, where $ n \ll u$. +Given a key~$x\in S$, the hash function~$h$ computes an integer in +$[0,m-1]$ for the storage or retrieval of~$x$ in a {\em hash table}. +% Hashing methods for {\em non-static sets} of keys can be used to construct +% data structures storing $S$ and supporting membership queries +% ``$x \in S$?'' in expected time $O(1)$. +% However, they involve a certain amount of wasted space owing to unused +% locations in the table and waisted time to resolve collisions when +% two keys are hashed to the same table location. +A perfect hash function maps a {\em static set} $S$ of $n$ keys from $U$ into a set of $m$ integer +numbers without collisions, where $m$ is greater than or equal to $n$. +If $m$ is equal to $n$, the function is called minimal. + +% Figure~\ref{fig:minimalperfecthash-ph-mph}(a) illustrates a perfect hash function and +% Figure~\ref{fig:minimalperfecthash-ph-mph}(b) illustrates a minimal perfect hash function (MPHF). +% +% \begin{figure} +% \centering +% \scalebox{0.7}{\epsfig{file=figs/minimalperfecthash-ph-mph.ps}} +% \caption{(a) Perfect hash function (b) Minimal perfect hash function (MPHF)} +% \label{fig:minimalperfecthash-ph-mph} +% %\vspace{-5mm} +% \end{figure} + +Minimal perfect hash functions are widely used for memory efficient storage and fast +retrieval of items from static sets, such as words in natural languages, +reserved words in programming languages or interactive systems, universal resource +locations (URLs) in web search engines, or item sets in data mining techniques. +Search engines are nowadays indexing tens of billions of pages and algorithms +like PageRank~\cite{Brin1998}, which uses the web link structure to derive a +measure of popularity for Web pages, would benefit from a MPHF for storage and +retrieval of such huge sets of URLs. +For instance, the TodoBr\footnote{TodoBr ({\texttt www.todobr.com.br}) is a trademark of +Akwan Information Technologies, which was acquired by Google Inc. in July 2005.} +search engine used the algorithm proposed hereinafter to +improve and to scale its link analysis system. +The WebGraph research group~\cite{bv04} would +also benefit from a MPHF for sets in the order of billions of URLs to scale +and to improve the storange requirements of their algorithms on Graph compression. + + Another interesting application for MPHFs is its use as an indexing structure + for databases. + The B+ tree is very popular as an indexing structure for dynamic applications + with frequent insertions and deletions of records. + However, for applications with sporadic modifications and a huge number of + queries the B+ tree is not the best option, + because it performs poorly with very large sets of keys + such as those required for the new frontiers of database applications~\cite{s05}. + Therefore, there are applications for MPHFs in + information retrieval systems, database systems, language translation systems, + electronic commerce systems, compilers, operating systems, among others. + +Until now, because of the limitations of current algorithms, +the use of MPHFs is restricted to scenarios where the set of keys being hashed is +relatively small. +However, in many cases it is crucial to deal in an efficient way with very large +sets of keys. +Due to the exponential growth of the Web, the work with huge collections is becoming +a daily task. +For instance, the simple assignment of number identifiers to web pages of a collection +can be a challenging task. +While traditional databases simply cannot handle more traffic once the working +set of URLs does not fit in main memory anymore~\cite{s05}, the algorithm we propose here to +construct MPHFs can easily scale to billions of entries. +% using stock hardware. + +As there are many applications for MPHFs, it is +important to design and implement space and time efficient algorithms for +constructing such functions. +The attractiveness of using MPHFs depends on the following issues: +\begin{enumerate} +\item The amount of CPU time required by the algorithms for constructing MPHFs. +\item The space requirements of the algorithms for constructing MPHFs. +\item The amount of CPU time required by a MPHF for each retrieval. +\item The space requirements of the description of the resulting MPHFs to be + used at retrieval time. +\end{enumerate} + +\enlargethispage{2\baselineskip} +This paper presents a novel external memory based algorithm for constructing MPHFs that +is very efficient in these four requirements. +First, the algorithm is linear on the size of keys to construct a MPHF, +which is optimal. +For instance, for a collection of 1 billion URLs +collected from the web, each one 64 characters long on average, the time to construct a +MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory +is approximately 3 hours. +Second, the algorithm needs a small a priori defined vector of $\lceil n/b \rceil$ +one byte entries in main memory to construct a MPHF. +For the collection of 1 billion URLs and using $b=175$, the algorithm needs only +5.45 megabytes of internal memory. +Third, the evaluation of the MPHF for each retrieval requires three memory accesses and +the computation of three universal hash functions. +This is not optimal as any MPHF requires at least one memory access and the computation +of two universal hash functions. +Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal. +For the collection of 1 billion URLs, it needs 8.1 bits for each key, +while the theoretical lower bound is $1/\ln2 \approx 1.4427$ bits per +key~\cite{m84}. + diff --git a/vldb07/makefile b/vldb07/makefile new file mode 100755 index 0000000..1b95644 --- /dev/null +++ b/vldb07/makefile @@ -0,0 +1,17 @@ +all: + latex vldb.tex + bibtex vldb + latex vldb.tex + latex vldb.tex + dvips vldb.dvi -o vldb.ps + ps2pdf vldb.ps + chmod -R g+rwx * + +perm: + chmod -R g+rwx * + +run: clean all + gv vldb.ps & +clean: + rm *.aux *.bbl *.blg *.log *.ps *.pdf *.dvi + diff --git a/vldb07/partitioningthekeys.tex b/vldb07/partitioningthekeys.tex new file mode 100755 index 0000000..e9a48c4 --- /dev/null +++ b/vldb07/partitioningthekeys.tex @@ -0,0 +1,141 @@ +%% Nivio: 21/jan/06 +% Time-stamp: +\vspace{-2mm} +\subsection{Partitioning step} +\label{sec:partitioning-keys} + +The set $S$ of $n$ keys is partitioned into $\lceil n/b \rceil$ buckets, +where $b$ is a suitable parameter chosen to guarantee +that each bucket has at most 256 keys with high probability +(see Section~\ref{sec:determining-b}). +The partitioning step works as follows: + +\begin{figure}[h] +\hrule +\hrule +\vspace{2mm} +\begin{tabbing} +aa\=type booleanx \== (false, true); \kill +\> $\blacktriangleright$ Let $\beta$ be the size in bytes of the set $S$ \\ +\> $\blacktriangleright$ Let $\mu$ be the size in bytes of an a priori reserved \\ +\> ~~~ internal memory area \\ +\> $\blacktriangleright$ Let $N = \lceil \beta/\mu \rceil$ be the number of key blocks that will \\ +\> ~~~ be read from disk into an internal memory area \\ +\> $\blacktriangleright$ Let $\mathit{size}$ be a vector that stores the size of each bucket \\ +\> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \\ +\> ~~ $1.1$ Read block $B_j$ of keys from disk \\ +\> ~~ $1.2$ Cluster $B_j$ into $\lceil n/b \rceil$ buckets using a bucket sort \\ +\> ~~~~~~~ algorithm and update the entries in the vector {\it size} \\ +\> ~~ $1.3$ Dump $B_j$ to the disk into File $j$\\ +\> $2.$ Compute the {\it offset} vector and dump it to the disk. +\end{tabbing} +\hrule +\hrule +\vspace{-1.0mm} +\caption{Partitioning step} +\vspace{-3mm} +\label{fig:partitioningstep} +\end{figure} + +Statement 1.1 of the {\bf for} loop presented in Figure~\ref{fig:partitioningstep} +reads sequentially all the keys of block $B_j$ from disk into an internal area +of size $\mu$. + +Statement 1.2 performs an indirect bucket sort of the keys in block $B_j$ +and at the same time updates the entries in the vector {\em size}. +Let us briefly describe how~$B_j$ is partitioned among the~$\lceil n/b\rceil$ +buckets. +We use a local array of $\lceil n/b \rceil$ counters to store a +count of how many keys from $B_j$ belong to each bucket. +%At the same time, the global vector {\it size} is computed based on the local +%counters. +The pointers to the keys in each bucket $i$, $0 \leq i < \lceil n/b \rceil$, +are stored in contiguous positions in an array. +For this we first reserve the required number of entries +in this array of pointers using the information from the array of counters. +Next, we place the pointers to the keys in each bucket into the respective +reserved areas in the array (i.e., we place the pointers to the keys in bucket 0, +followed by the pointers to the keys in bucket 1, and so on). + +\enlargethispage{2\baselineskip} +To find the bucket address of a given key +we use the universal hash function $h_0(k)$~\cite{j97}. +Key~$k$ goes into bucket~$i$, where +%Then, for each integer $h_0(k)$ the respective bucket address is obtained +%as follows: +\begin{eqnarray} \label{eq:bucketindex} +i=h_0(k) \bmod \left \lceil \frac{n}{b} \right \rceil. +\end{eqnarray} + +Figure~\ref{fig:brz-partitioning}(a) shows a \emph{logical} view of the +$\lceil n/b \rceil$ buckets generated in the partitioning step. +%In this case, the keys of each bucket are put together by the pointers to +%each key stored +%in contiguous positions in the array of pointers. +In reality, the keys belonging to each bucket are distributed among many files, +as depicted in Figure~\ref{fig:brz-partitioning}(b). +In the example of Figure~\ref{fig:brz-partitioning}(b), the keys in bucket 0 +appear in files 1 and $N$, the keys in bucket 1 appear in files 1, 2 +and $N$, and so on. + +\vspace{-7mm} +\begin{figure}[ht] +\centering +\begin{picture}(0,0)% +\includegraphics{figs/brz-partitioning.ps}% +\end{picture}% +\setlength{\unitlength}{4144sp}% +% +\begingroup\makeatletter\ifx\SetFigFont\undefined% +\gdef\SetFigFont#1#2#3#4#5{% + \reset@font\fontsize{#1}{#2pt}% + \fontfamily{#3}\fontseries{#4}\fontshape{#5}% + \selectfont}% +\fi\endgroup% +\begin{picture}(4371,1403)(1,-6977) +\put(333,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}} +\put(545,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}} +\put(759,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}2}}}} +\put(1539,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}${\lceil n/b\rceil - 1}$}}}} +\put(541,-6676){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets Logical View}}}} +\put(3547,-6120){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} +\put(3547,-6188){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} +\put(3547,-6255){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} +\put(3107,-6120){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} +\put(3107,-6188){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} +\put(3107,-6255){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} +\put(4177,-6224){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} +\put(4177,-6269){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} +\put(4177,-6314){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} +\put(3016,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File 1}}}} +\put(3466,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File 2}}}} +\put(4096,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File N}}}} +\put(3196,-6946){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets Physical View}}}} +\end{picture}% +\caption{Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view} +\label{fig:brz-partitioning} +\vspace{-2mm} +\end{figure} + +This scattering of the keys in the buckets could generate a performance +problem because of the potential number of seeks +needed to read the keys in each bucket from the $N$ files in disk +during the searching step. +But, as we show later in Section~\ref{sec:analytcal-results}, the number of seeks +can be kept small using buffering techniques. +Considering that only the vector {\it size}, which has $\lceil n/b \rceil$ +one-byte entries (remember that each bucket has at most 256 keys), +must be maintained in main memory during the searching step, +almost all main memory is available to be used as disk I/O buffer. + +The last step is to compute the {\it offset} vector and dump it to the disk. +We use the vector $\mathit{size}$ to compute the +$\mathit{offset}$ displacement vector. +The $\mathit{offset}[i]$ entry contains the number of keys +in the buckets $0, 1, \dots, i-1$. +As {\it size}$[i]$ stores the number of keys +in bucket $i$, where $0 \leq i <\lceil n/b \rceil$, we have +\begin{displaymath} +\mathit{offset}[i] = \sum_{j=0}^{i-1} \mathit{size}[j] \cdot +\end{displaymath} + diff --git a/vldb07/performancenewalgorithm.tex b/vldb07/performancenewalgorithm.tex new file mode 100755 index 0000000..6911282 --- /dev/null +++ b/vldb07/performancenewalgorithm.tex @@ -0,0 +1,113 @@ +% Nivio: 29/jan/06 +% Time-stamp: +\subsection{Performance of the new algorithm} +\label{sec:performance} +%As we have done for the internal memory based algorithm, + +The runtime of our algorithm is also a random variable, but now it follows a +(highly concentrated) normal distribution, as we discuss at the end of this +section. Again, we are interested in verifying the linearity claim made in +Section~\ref{sec:linearcomplexity}. Therefore, we ran the algorithm for +several numbers $n$ of keys in $S$. + +The values chosen for $n$ were $1, 2, 4, 8, 16, 32, 64, 128, 512$ and $1000$ +million. +%Just the small vector {\it size} must be kept in main memory, +%as we saw in Section~\ref{sec:memconstruction}. +We limited the main memory in 500 megabytes for the experiments. +The size $\mu$ of the a priori reserved internal memory area +was set to 250 megabytes, the parameter $b$ was set to $175$ and +the building block algorithm parameter $c$ was again set to $1$. +In Section~\ref{sec:contr-disk-access} we show how $\mu$ +affects the runtime of the algorithm. The other two parameters +have insignificant influence on the runtime. + +We again use a statistical method for determining a suitable sample size +%~\cite[Chapter 13]{j91} +to estimate the number of trials to be run for each value of $n$. We got that +just one trial for each $n$ would be enough with a confidence level of $95\%$. +However, we made 10 trials. This number of trials seems rather small, but, as +shown below, the behavior of our algorithm is very stable and its runtime is +almost deterministic (i.e., the standard deviation is very small). + +Table~\ref{tab:mediasbrz} presents the runtime average for each $n$, +the respective standard deviations, and +the respective confidence intervals given by +the average time $\pm$ the distance from average time +considering a confidence level of $95\%$. +Observing the runtime averages we noticed that +the algorithm runs in expected linear time, +as shown in~Section~\ref{sec:linearcomplexity}. Better still, +it is only approximately $60\%$ slower than our internal memory based algorithm. +To get that value we used the linear regression model obtained for the runtime of +the internal memory based algorithm to estimate how much time it would require +for constructing a MPHF for a set of 1 billion keys. +We got 2.3 hours for the internal memory based algorithm and we measured +3.67 hours on average for our algorithm. +Increasing the size of the internal memory area +from 250 to 600 megabytes (see Section~\ref{sec:contr-disk-access}), +we have brought the time to 3.09 hours. In this case, our algorithm is +just $34\%$ slower in this setup. + +\enlargethispage{2\baselineskip} +\begin{table*}[htb] +\vspace{-1mm} +\begin{center} +{\scriptsize +\begin{tabular}{|l|c|c|c|c|c|} +\hline +$n$ (millions) & 1 & 2 & 4 & 8 & 16 \\ +\hline % Part. 16 \% 16 \% 16 \% 18 \% 20\% +Average time (s) & $6.9 \pm 0.3$ & $13.8 \pm 0.2$ & $31.9 \pm 0.7$ & $69.9 \pm 1.1$ & $140.6 \pm 2.5$ \\ +SD & $0.4$ & $0.2$ & $0.9$ & $1.5$ & $3.5$ \\ +\hline +\hline +$n$ (millions) & 32 & 64 & 128 & 512 & 1000 \\ +\hline % Part. 20 \% 20\% 20\% 18\% 18\% +Average time (s) & $284.3 \pm 1.1$ & $587.9 \pm 3.9$ & $1223.6 \pm 4.9$ & $5966.4 \pm 9.5$ & $13229.5 \pm 12.7$ \\ +SD & $1.6$ & $5.5$ & $6.8$ & $13.2$ & $18.6$ \\ +\hline + +\end{tabular} +\vspace{-1mm} +} +\end{center} +\caption{Our algorithm: average time in seconds for constructing a MPHF, +the standard deviation (SD), and the confidence intervals considering +a confidence level of $95\%$. +} +\label{tab:mediasbrz} +\vspace{-5mm} +\end{table*} + +Figure~\ref{fig:brz_temporegressao} +presents the runtime for each trial. In addition, +the solid line corresponds to a linear regression model +obtained from the experimental measurements. +As we were expecting the runtime for a given $n$ has almost no +variation. + +\begin{figure}[htb] +\begin{center} +\scalebox{0.4}{\includegraphics{figs/brz_temporegressao.eps}} +\caption{Time versus number of keys in $S$ for our algorithm. The solid line corresponds to +a linear regression model.} +\label{fig:brz_temporegressao} +\end{center} +\vspace{-9mm} +\end{figure} + +An intriguing observation is that the runtime of the algorithm is almost +deterministic, in spite of the fact that it uses as building block an +algorithm with a considerable fluctuation in its runtime. A given bucket~$i$, +$0 \leq i < \lceil n/b \rceil$, is a small set of keys (at most 256 keys) and, +as argued in Section~\ref{sec:intern-memory-algor}, the runtime of the +building block algorithm is a random variable~$X_i$ with high fluctuation. +However, the runtime~$Y$ of the searching step of our algorithm is given +by~$Y=\sum_{0\leq i<\lceil n/b\rceil}X_i$. Under the hypothesis that +the~$X_i$ are independent and bounded, the {\it law of large numbers} (see, +e.g., \cite{j91}) implies that the random variable $Y/\lceil n/b\rceil$ +converges to a constant as~$n\to\infty$. This explains why the runtime of our +algorithm is almost deterministic. + + diff --git a/vldb07/references.bib b/vldb07/references.bib new file mode 100755 index 0000000..d2ea475 --- /dev/null +++ b/vldb07/references.bib @@ -0,0 +1,814 @@ + +@InProceedings{Brin1998, + author = "Sergey Brin and Lawrence Page", + title = "The Anatomy of a Large-Scale Hypertextual Web Search Engine", + booktitle = "Proceedings of the 7th International {World Wide Web} + Conference", + pages = "107--117", + adress = "Brisbane, Australia", + month = "April", + year = 1998, + annote = "Artigo do Google." +} + +@inproceedings{p99, + author = {R. Pagh}, + title = {Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions}, + booktitle = {Workshop on Algorithms and Data Structures}, + pages = {49-54}, + year = 1999, + url = {citeseer.nj.nec.com/pagh99hash.html}, + key = {author} +} + +@article{p00, + author = {R. Pagh}, + title = {Faster deterministic dictionaries}, + journal = {Symposium on Discrete Algorithms (ACM SODA)}, + OPTvolume = {43}, + OPTnumber = {5}, + pages = {487--493}, + year = {2000} +} +@article{g81, + author = {G. H. Gonnet}, + title = {Expected Length of the Longest Probe Sequence in Hash Code Searching}, + journal = {J. ACM}, + volume = {28}, + number = {2}, + year = {1981}, + issn = {0004-5411}, + pages = {289--304}, + doi = {http://doi.acm.org/10.1145/322248.322254}, + publisher = {ACM Press}, + address = {New York, NY, USA}, + } + +@misc{r04, + author = "S. Rao", + title = "Combinatorial Algorithms Data Structures", + year = 2004, + howpublished = {CS 270 Spring}, + url = "citeseer.ist.psu.edu/700201.html" +} +@article{ra98, + author = {Martin Raab and Angelika Steger}, + title = {``{B}alls into Bins'' --- {A} Simple and Tight Analysis}, + journal = {Lecture Notes in Computer Science}, + volume = 1518, + pages = {159--170}, + year = 1998, + url = "citeseer.ist.psu.edu/raab98balls.html" +} + +@misc{mrs00, + author = "M. Mitzenmacher and A. Richa and R. Sitaraman", + title = "The power of two random choices: A survey of the techniques and results", + howpublished={In Handbook of Randomized + Computing, P. Pardalos, S. Rajasekaran, and J. Rolim, Eds. Kluwer}, + year = "2000", + url = "citeseer.ist.psu.edu/article/mitzenmacher00power.html" +} + +@article{dfm02, + author = {E. Drinea and A. Frieze and M. Mitzenmacher}, + title = {Balls and bins models with feedback}, + journal = {Symposium on Discrete Algorithms (ACM SODA)}, + pages = {308--315}, + year = {2002} +} +@Article{j97, + author = {Bob Jenkins}, + title = {Algorithm Alley: Hash Functions}, + journal = {Dr. Dobb's Journal of Software Tools}, + volume = {22}, + number = {9}, + month = {september}, + year = {1997} +} + +@article{gss01, + author = {N. Galli and B. Seybold and K. Simon}, + title = {Tetris-Hashing or optimal table compression}, + journal = {Discrete Applied Mathematics}, + volume = {110}, + number = {1}, + pages = {41--58}, + month = {june}, + publisher = {Elsevier Science}, + year = {2001} +} + +@article{s05, + author = {M. Seltzer}, + title = {Beyond Relational Databases}, + journal = {ACM Queue}, + volume = {3}, + number = {3}, + month = {April}, + year = {2005} +} + +@InProceedings{ss89, + author = {P. Schmidt and A. Siegel}, + title = {On aspects of universality and performance for closed hashing}, + booktitle = {Proc. 21th Ann. ACM Symp. on Theory of Computing -- STOC'89}, + month = {May}, + year = {1989}, + pages = {355--366} +} + +@article{asw00, + author = {M. Atici and D. R. Stinson and R. Wei.}, + title = {A new practical algorithm for the construction of a perfect hash function}, + journal = {Journal Combin. Math. Combin. Comput.}, + volume = {35}, + pages = {127--145}, + year = {2000} +} + +@article{swz00, + author = {D. R. Stinson and R. Wei and L. Zhu}, + title = {New constructions for perfect hash families and related structures using combinatorial designs and codes}, + journal = {Journal Combin. Designs.}, + volume = {8}, + pages = {189--200}, + year = {2000} +} + +@inproceedings{ht01, + author = {T. Hagerup and T. Tholey}, + title = {Efficient minimal perfect hashing in nearly minimal space}, + booktitle = {The 18th Symposium on Theoretical Aspects of Computer Science (STACS), volume 2010 of Lecture Notes in Computer Science}, + year = 2001, + pages = {317--326}, + key = {author} +} + +@inproceedings{dh01, + author = {M. Dietzfelbinger and T. Hagerup}, + title = {Simple minimal perfect hashing in less space}, + booktitle = {The 9th European Symposium on Algorithms (ESA), volume 2161 of Lecture Notes in Computer Science}, + year = 2001, + pages = {109--120}, + key = {author} +} + + +@MastersThesis{mar00, + author = {M. S. Neubert}, + title = {Algoritmos Distribu;os para a Constru;o de Arquivos invertidos}, + school = {Departamento de Ci;cia da Computa;o, Universidade Federal de Minas Gerais}, + year = 2000, + month = {Mar;}, + key = {author} +} + + +@Book{clrs01, + author = {T. H. Cormen and C. E. Leiserson and R. L. Rivest and C. Stein}, + title = {Introduction to Algorithms}, + publisher = {MIT Press}, + year = {2001}, + edition = {second}, +} + +@Book{j91, + author = {R. Jain}, + title = {The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. }, + publisher = {John Wiley}, + year = {1991}, + edition = {first} +} + +@Book{k73, + author = {D. E. Knuth}, + title = {The Art of Computer Programming: Sorting and Searching}, + publisher = {Addison-Wesley}, + volume = {3}, + year = {1973}, + edition = {second}, +} + +@inproceedings{rp99, + author = {R. Pagh}, + title = {Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions}, + booktitle = {Workshop on Algorithms and Data Structures}, + pages = {49-54}, + year = 1999, + url = {citeseer.nj.nec.com/pagh99hash.html}, + key = {author} +} + +@inproceedings{hmwc93, + author = {G. Havas and B.S. Majewski and N.C. Wormald and Z.J. Czech}, + title = {Graphs, Hypergraphs and Hashing}, + booktitle = {19th International Workshop on Graph-Theoretic Concepts in Computer Science}, + publisher = {Springer Lecture Notes in Computer Science vol. 790}, + pages = {153-165}, + year = 1993, + key = {author} +} + +@inproceedings{bkz05, + author = {F.C. Botelho and Y. Kohayakawa and N. Ziviani}, + title = {A Practical Minimal Perfect Hashing Method}, + booktitle = {4th International Workshop on Efficient and Experimental Algorithms}, + publisher = {Springer Lecture Notes in Computer Science vol. 3503}, + pages = {488-500}, + Moth = May, + year = 2005, + key = {author} +} + +@Article{chm97, + author = {Z.J. Czech and G. Havas and B.S. Majewski}, + title = {Fundamental Study Perfect Hashing}, + journal = {Theoretical Computer Science}, + volume = {182}, + year = {1997}, + pages = {1-143}, + key = {author} +} + +@article{chm92, + author = {Z.J. Czech and G. Havas and B.S. Majewski}, + title = {An Optimal Algorithm for Generating Minimal Perfect Hash Functions}, + journal = {Information Processing Letters}, + volume = {43}, + number = {5}, + pages = {257-264}, + year = {1992}, + url = {citeseer.nj.nec.com/czech92optimal.html}, + key = {author} +} + +@Article{mwhc96, + author = {B.S. Majewski and N.C. Wormald and G. Havas and Z.J. Czech}, + title = {A family of perfect hashing methods}, + journal = {The Computer Journal}, + year = {1996}, + volume = {39}, + number = {6}, + pages = {547-554}, + key = {author} +} + +@InProceedings{bv04, +author = {P. Boldi and S. Vigna}, +title = {The WebGraph Framework I: Compression Techniques}, +booktitle = {13th International World Wide Web Conference}, +pages = {595--602}, +year = {2004} +} + + +@Book{z04, + author = {N. Ziviani}, + title = {Projeto de Algoritmos com implementa;es em Pascal e C}, + publisher = {Pioneira Thompson}, + year = 2004, + edition = {segunda edi;o} +} + + +@Book{p85, + author = {E. M. Palmer}, + title = {Graphical Evolution: An Introduction to the Theory of Random Graphs}, + publisher = {John Wiley \& Sons}, + year = {1985}, + address = {New York} +} + +@Book{imb99, + author = {I.H. Witten and A. Moffat and T.C. Bell}, + title = {Managing Gigabytes: Compressing and Indexing Documents and Images}, + publisher = {Morgan Kaufmann Publishers}, + year = 1999, + edition = {second edition} +} +@Book{wfe68, + author = {W. Feller}, + title = { An Introduction to Probability Theory and Its Applications}, + publisher = {Wiley}, + year = 1968, + volume = 1, + optedition = {second edition} +} + + +@Article{fhcd92, + author = {E.A. Fox and L. S. Heath and Q. Chen and A.M. Daoud}, + title = {Practical Minimal Perfect Hash Functions For Large Databases}, + journal = {Communications of the ACM}, + year = {1992}, + volume = {35}, + number = {1}, + pages = {105--121} +} + + +@inproceedings{fch92, + author = {E.A. Fox and Q.F. Chen and L.S. Heath}, + title = {A Faster Algorithm for Constructing Minimal Perfect Hash Functions}, + booktitle = {Proceedings of the 15th Annual International ACM SIGIR Conference + on Research and Development in Information Retrieval}, + year = {1992}, + pages = {266-273}, +} + +@article{c80, + author = {R.J. Cichelli}, + title = {Minimal perfect hash functions made simple}, + journal = {Communications of the ACM}, + volume = {23}, + number = {1}, + year = {1980}, + issn = {0001-0782}, + pages = {17--19}, + doi = {http://doi.acm.org/10.1145/358808.358813}, + publisher = {ACM Press}, + } + + +@TechReport{fhc89, + author = {E.A. Fox and L.S. Heath and Q.F. Chen}, + title = {An $O(n\log n)$ algorithm for finding minimal perfect hash functions}, + institution = {Virginia Polytechnic Institute and State University}, + year = {1989}, + OPTkey = {}, + OPTtype = {}, + OPTnumber = {}, + address = {Blacksburg, VA}, + month = {April}, + OPTnote = {}, + OPTannote = {} +} + +@TechReport{bkz06t, + author = {F.C. Botelho and Y. Kohayakawa and N. Ziviani}, + title = {An Approach for Minimal Perfect Hash Functions in Very Large Databases}, + institution = {Department of Computer Science, Federal University of Minas Gerais}, + note = {Available at http://www.dcc.ufmg.br/\texttt{\~ }nivio/pub/technicalreports.html}, + year = {2006}, + OPTkey = {}, + OPTtype = {}, + number = {RT.DCC.003}, + address = {Belo Horizonte, MG, Brazil}, + month = {April}, + OPTannote = {} +} + +@inproceedings{fcdh90, + author = {E.A. Fox and Q.F. Chen and A.M. Daoud and L.S. Heath}, + title = {Order preserving minimal perfect hash functions and information retrieval}, + booktitle = {Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval}, + year = {1990}, + isbn = {0-89791-408-2}, + pages = {279--311}, + location = {Brussels, Belgium}, + doi = {http://doi.acm.org/10.1145/96749.98233}, + publisher = {ACM Press}, + } + +@Article{fkp89, + author = {P. Flajolet and D. E. Knuth and B. Pittel}, + title = {The first cycles in an evolving graph}, + journal = {Discrete Math}, + year = {1989}, + volume = {75}, + pages = {167-215}, +} + +@Article{s77, + author = {R. Sprugnoli}, + title = {Perfect Hashing Functions: A Single Probe Retrieving + Method For Static Sets}, + journal = {Communications of the ACM}, + year = {1977}, + volume = {20}, + number = {11}, + pages = {841--850}, + month = {November}, +} + +@Article{j81, + author = {G. Jaeschke}, + title = {Reciprocal Hashing: A method For Generating Minimal Perfect + Hashing Functions}, + journal = {Communications of the ACM}, + year = {1981}, + volume = {24}, + number = {12}, + month = {December}, + pages = {829--833} +} + +@Article{c84, + author = {C. C. Chang}, + title = {The Study Of An Ordered Minimal Perfect Hashing Scheme}, + journal = {Communications of the ACM}, + year = {1984}, + volume = {27}, + number = {4}, + month = {December}, + pages = {384--387} +} + +@Article{c86, + author = {C. C. Chang}, + title = {Letter-Oriented Reciprocal Hashing Scheme}, + journal = {Inform. Sci.}, + year = {1986}, + volume = {27}, + pages = {243--255} +} + +@Article{cl86, + author = {C. C. Chang and R. C. T. Lee}, + title = {A Letter-Oriented Minimal Perfect Hashing Scheme}, + journal = {Computer Journal}, + year = {1986}, + volume = {29}, + number = {3}, + month = {June}, + pages = {277--281} +} + + +@Article{cc88, + author = {C. C. Chang and C. H. Chang}, + title = {An Ordered Minimal Perfect Hashing Scheme with Single Parameter}, + journal = {Inform. Process. Lett.}, + year = {1988}, + volume = {27}, + number = {2}, + month = {February}, + pages = {79--83} +} + +@Article{w90, + author = {V. G. Winters}, + title = {Minimal Perfect Hashing in Polynomial Time}, + journal = {BIT}, + year = {1990}, + volume = {30}, + number = {2}, + pages = {235--244} +} + +@Article{fcdh91, + author = {E. A. Fox and Q. F. Chen and A. M. Daoud and L. S. Heath}, + title = {Order Preserving Minimal Perfect Hash Functions and Information Retrieval}, + journal = {ACM Trans. Inform. Systems}, + year = {1991}, + volume = {9}, + number = {3}, + month = {July}, + pages = {281--308} +} + +@Article{fks84, + author = {M. L. Fredman and J. Koml\'os and E. Szemer\'edi}, + title = {Storing a sparse table with {O(1)} worst case access time}, + journal = {J. ACM}, + year = {1984}, + volume = {31}, + number = {3}, + month = {July}, + pages = {538--544} +} + +@Article{dhjs83, + author = {M. W. Du and T. M. Hsieh and K. F. Jea and D. W. Shieh}, + title = {The study of a new perfect hash scheme}, + journal = {IEEE Trans. Software Eng.}, + year = {1983}, + volume = {9}, + number = {3}, + month = {May}, + pages = {305--313} +} + +@Article{bt94, + author = {M. D. Brain and A. L. Tharp}, + title = {Using Tries to Eliminate Pattern Collisions in Perfect Hashing}, + journal = {IEEE Trans. on Knowledge and Data Eng.}, + year = {1994}, + volume = {6}, + number = {2}, + month = {April}, + pages = {239--247} +} + +@Article{bt90, + author = {M. D. Brain and A. L. Tharp}, + title = {Perfect hashing using sparse matrix packing}, + journal = {Inform. Systems}, + year = {1990}, + volume = {15}, + number = {3}, + OPTmonth = {April}, + pages = {281--290} +} + +@Article{ckw93, + author = {C. C. Chang and H. C.Kowng and T. C. Wu}, + title = {A refinement of a compression-oriented addressing scheme}, + journal = {BIT}, + year = {1993}, + volume = {33}, + number = {4}, + OPTmonth = {April}, + pages = {530--535} +} + +@Article{cw91, + author = {C. C. Chang and T. C. Wu}, + title = {A letter-oriented perfect hashing scheme based upon sparse table compression}, + journal = {Software -- Practice Experience}, + year = {1991}, + volume = {21}, + number = {1}, + month = {january}, + pages = {35--49} +} + +@Article{ty79, + author = {R. E. Tarjan and A. C. C. Yao}, + title = {Storing a sparse table}, + journal = {Comm. ACM}, + year = {1979}, + volume = {22}, + number = {11}, + month = {November}, + pages = {606--611} +} + +@Article{yd85, + author = {W. P. Yang and M. W. Du}, + title = {A backtracking method for constructing perfect hash functions from a set of mapping functions}, + journal = {BIT}, + year = {1985}, + volume = {25}, + number = {1}, + pages = {148--164} +} + +@Article{s85, + author = {T. J. Sager}, + title = {A polynomial time generator for minimal perfect hash functions}, + journal = {Commun. ACM}, + year = {1985}, + volume = {28}, + number = {5}, + month = {May}, + pages = {523--532} +} + +@Article{cm93, + author = {Z. J. Czech and B. S. Majewski}, + title = {A linear time algorithm for finding minimal perfect hash functions}, + journal = {The computer Journal}, + year = {1993}, + volume = {36}, + number = {6}, + pages = {579--587} +} + +@Article{gbs94, + author = {R. Gupta and S. Bhaskar and S. Smolka}, + title = {On randomization in sequential and distributed algorithms}, + journal = {ACM Comput. Surveys}, + year = {1994}, + volume = {26}, + number = {1}, + month = {March}, + pages = {7--86} +} + +@InProceedings{sb84, + author = {C. Slot and P. V. E. Boas}, + title = {On tape versus core; an application of space efficient perfect hash functions to the + invariance of space}, + booktitle = {Proc. 16th Ann. ACM Symp. on Theory of Computing -- STOC'84}, + address = {Washington}, + month = {May}, + year = {1984}, + pages = {391--400}, +} + +@InProceedings{wi90, + author = {V. G. Winters}, + title = {Minimal perfect hashing for large sets of data}, + booktitle = {Internat. Conf. on Computing and Information -- ICCI'90}, + address = {Canada}, + month = {May}, + year = {1990}, + pages = {275--284}, +} + +@InProceedings{lr85, + author = {P. Larson and M. V. Ramakrishna}, + title = {External perfect hashing}, + booktitle = {Proc. ACM SIGMOD Conf.}, + address = {Austin TX}, + month = {June}, + year = {1985}, + pages = {190--199}, +} + +@Book{m84, + author = {K. Mehlhorn}, + editor = {W. Brauer and G. Rozenberg and A. Salomaa}, + title = {Data Structures and Algorithms 1: Sorting and Searching}, + publisher = {Springer-Verlag}, + year = {1984}, +} + +@PhdThesis{c92, + author = {Q. F. Chen}, + title = {An Object-Oriented Database System for Efficient Information Retrieval Appliations}, + school = {Virginia Tech Dept. of Computer Science}, + year = {1992}, + month = {March} +} + +@article {er59, + AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.}, + TITLE = {On random graphs {I}}, + JOURNAL = {Pub. Math. Debrecen}, + VOLUME = {6}, + YEAR = {1959}, + PAGES = {290--297}, + MRCLASS = {05.00}, + MRNUMBER = {MR0120167 (22 \#10924)}, +MRREVIEWER = {A. Dvoretzky}, +} + + +@article {erdos61, + AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.}, + TITLE = {On the evolution of random graphs}, + JOURNAL = {Bull. Inst. Internat. Statist.}, + VOLUME = 38, + YEAR = 1961, + PAGES = {343--347}, + MRCLASS = {05.40 (55.10)}, + MRNUMBER = {MR0148055 (26 \#5564)}, +} + +@article {er60, + AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.}, + TITLE = {On the evolution of random graphs}, + JOURNAL = {Magyar Tud. Akad. Mat. Kutat\'o Int. K\"ozl.}, + VOLUME = {5}, + YEAR = {1960}, + PAGES = {17--61}, + MRCLASS = {05.40}, + MRNUMBER = {MR0125031 (23 \#A2338)}, +MRREVIEWER = {J. Riordan}, +} + +@Article{er60:_Old, + author = {P. Erd{\H{o}}s and A. R\'enyi}, + title = {On the evolution of random graphs}, + journal = {Publications of the Mathematical Institute of the Hungarian + Academy of Sciences}, + year = {1960}, + volume = {56}, + pages = {17-61} +} + +@Article{er61, + author = {P. Erd{\H{o}}s and A. R\'enyi}, + title = {On the strength of connectedness of a random graph}, + journal = {Acta Mathematica Scientia Hungary}, + year = {1961}, + volume = {12}, + pages = {261-267} +} + + +@Article{bp04, + author = {B. Bollob\'as and O. Pikhurko}, + title = {Integer Sets with Prescribed Pairwise Differences Being Distinct}, + journal = {European Journal of Combinatorics}, + OPTkey = {}, + OPTvolume = {}, + OPTnumber = {}, + OPTpages = {}, + OPTmonth = {}, + note = {To Appear}, + OPTannote = {} +} + +@Article{pw04:_OLD, + author = {B. Pittel and N. C. Wormald}, + title = {Counting connected graphs inside-out}, + journal = {Journal of Combinatorial Theory}, + OPTkey = {}, + OPTvolume = {}, + OPTnumber = {}, + OPTpages = {}, + OPTmonth = {}, + note = {To Appear}, + OPTannote = {} +} + + +@Article{mr95, + author = {M. Molloy and B. Reed}, + title = {A critical point for random graphs with a given degree sequence}, + journal = {Random Structures and Algorithms}, + year = {1995}, + volume = {6}, + pages = {161-179} +} + +@TechReport{bmz04, + author = {F. C. Botelho and D. Menoti and N. Ziviani}, + title = {A New algorithm for constructing minimal perfect hash functions}, + institution = {Federal Univ. of Minas Gerais}, + year = {2004}, + OPTkey = {}, + OPTtype = {}, + number = {TR004}, + OPTaddress = {}, + OPTmonth = {}, + note = {(http://www.dcc.ufmg.br/\texttt{\~ }nivio/pub/technicalreports.html)}, + OPTannote = {} +} + +@Article{mr98, + author = {M. Molloy and B. Reed}, + title = {The size of the giant component of a random graph with a given degree sequence}, + journal = {Combinatorics, Probability and Computing}, + year = {1998}, + volume = {7}, + pages = {295-305} +} + +@misc{h98, + author = {D. Hawking}, + title = {Overview of TREC-7 Very Large Collection Track (Draft for Notebook)}, + url = {citeseer.ist.psu.edu/4991.html}, + year = {1998}} + +@book {jlr00, + AUTHOR = {Janson, S. and {\L}uczak, T. and Ruci{\'n}ski, A.}, + TITLE = {Random graphs}, + PUBLISHER = {Wiley-Inter.}, + YEAR = 2000, + PAGES = {xii+333}, + ISBN = {0-471-17541-2}, + MRCLASS = {05C80 (60C05 82B41)}, + MRNUMBER = {2001k:05180}, +MRREVIEWER = {Mark R. Jerrum}, +} + +@incollection {jlr90, + AUTHOR = {Janson, Svante and {\L}uczak, Tomasz and Ruci{\'n}ski, + Andrzej}, + TITLE = {An exponential bound for the probability of nonexistence of a + specified subgraph in a random graph}, + BOOKTITLE = {Random graphs '87 (Pozna\'n, 1987)}, + PAGES = {73--87}, + PUBLISHER = {Wiley}, + ADDRESS = {Chichester}, + YEAR = 1990, + MRCLASS = {05C80 (60C05)}, + MRNUMBER = {91m:05168}, +MRREVIEWER = {J. Spencer}, +} + +@book {b01, + AUTHOR = {Bollob{\'a}s, B.}, + TITLE = {Random graphs}, + SERIES = {Cambridge Studies in Advanced Mathematics}, + VOLUME = 73, + EDITION = {Second}, + PUBLISHER = {Cambridge University Press}, + ADDRESS = {Cambridge}, + YEAR = 2001, + PAGES = {xviii+498}, + ISBN = {0-521-80920-7; 0-521-79722-5}, + MRCLASS = {05C80 (60C05)}, + MRNUMBER = {MR1864966 (2002j:05132)}, +} + +@article {pw04, + AUTHOR = {Pittel, Boris and Wormald, Nicholas C.}, + TITLE = {Counting connected graphs inside-out}, + JOURNAL = {J. Combin. Theory Ser. B}, + FJOURNAL = {Journal of Combinatorial Theory. Series B}, + VOLUME = 93, + YEAR = 2005, + NUMBER = 2, + PAGES = {127--172}, + ISSN = {0095-8956}, + CODEN = {JCBTB8}, + MRCLASS = {05C30 (05A16 05C40 05C80)}, + MRNUMBER = {MR2117934 (2005m:05117)}, +MRREVIEWER = {Edward A. Bender}, +} diff --git a/vldb07/relatedwork.tex b/vldb07/relatedwork.tex new file mode 100755 index 0000000..7693002 --- /dev/null +++ b/vldb07/relatedwork.tex @@ -0,0 +1,112 @@ +% Time-stamp: +\vspace{-3mm} +\section{Related work} +\label{sec:relatedprevious-work} +\vspace{-2mm} + +% Optimal speed for hashing means that each key from the key set $S$ +% will map to an unique location in the hash table, avoiding time wasted +% in resolving collisions. That is achieved with a MPHF and +% because of that many algorithms for constructing static +% and dynamic MPHFs, when static or dynamic sets are involved, +% were developed. Our focus has been on static MPHFs, since +% in many applications the key sets change slowly, if at all~\cite{s05}. + +\enlargethispage{2\baselineskip} +Czech, Havas and Majewski~\cite{chm97} provide a +comprehensive survey of the most important theoretical and practical results +on perfect hashing. +In this section we review some of the most important results. +%We also present more recent algorithms that share some features with +%the one presented hereinafter. + +Fredman, Koml\'os and Szemer\'edi~\cite{FKS84} showed that it is possible to +construct space efficient perfect hash functions that can be evaluated in +constant time with table sizes that are linear in the number of keys: +$m=O(n)$. In their model of computation, an element of the universe~$U$ fits +into one machine word, and arithmetic operations and memory accesses have unit +cost. Randomized algorithms in the FKS model can construct a perfect hash +function in expected time~$O(n)$: +this is the case of our algorithm and the works in~\cite{chm92,p99}. + +Mehlhorn~\cite{m84} showed +that at least $\Omega((1/\ln 2)n + \ln\ln u)$ bits are +required to represent a MPHF (i.e, at least 1.4427 bits per +key must be stored). +To the best of our knowledge our algorithm +is the first one capable of generating MPHFs for sets in the order +of billion of keys, and the generated functions +require less than 9 bits per key to be stored. +This increases one order of magnitude in the size of the greatest +key set for which a MPHF was obtained in the literature~\cite{bkz05}. +%which is close to the lower bound presented in~\cite{m84}. + +Some work on minimal perfect hashing has been done under the assumption that +the algorithm can pick and store truly random functions~\cite{bkz05,chm92,p99}. +Since the space requirements for truly random functions makes them unsuitable for +implementation, one has to settle for pseudo-random functions in practice. +Empirical studies show that limited randomness properties are often as good as +total randomness. +We could verify that phenomenon in our experiments by using the universal hash +function proposed by Jenkins~\cite{j97}, which is +time efficient at retrieval time and requires just an integer to be used as a +random seed (the function is completely determined by the seed). +% Os trabalhos~\cite{asw00,swz00} apresentam algoritmos para construir +% FHPs e FHPMs deterministicamente. +% As fun\c{c}\~oes geradas necessitam de $O(n \log(n) + \log(\log(u)))$ bits para serem descritas. +% A complexidade de caso m\'edio dos algoritmos para gerar as fun\c{c}\~oes \'e +% $O(n\log(n) \log( \log (u)))$ e a de pior caso \'e $O(n^3\log(n) \log(\log(u)))$. +% A complexidade de avalia\c{c}\~ao das fun\c{c}\~oes \'e $O(\log(n) + \log(\log(u)))$. +% Assim, os algoritmos n\~ao geram fun\c{c}\~oes que podem ser avaliadas com complexidade +% de tempo $O(1)$, est\~ao distantes a um fator de $\log n$ da complexidade \'otima para descrever +% FHPs e FHPMs (Mehlhorn mostra em~\cite{m84} +% que para armazenar uma FHP s\~ao necess\'arios no m\'{\i}nimo +% $\Omega(n^2/(2\ln 2) m + \log\log u)$ bits), e n\~ao geram as +% fun\c{c}\~oes com complexidade linear. +% Al\'em disso, o universo $U$ das chaves \'e restrito a n\'umeros inteiros, o que pode +% limitar a utiliza\c{c}\~ao na pr\'atica. + +Pagh~\cite{p99} proposed a family of randomized algorithms for +constructing MPHFs +where the form of the resulting function is $h(x) = (f(x) + d[g(x)]) \bmod n$, +where $f$ and $g$ are universal hash functions and $d$ is a set of +displacement values to resolve collisions that are caused by the function $f$. +Pagh identified a set of conditions concerning $f$ and $g$ and showed +that if these conditions are satisfied, then a minimal perfect hash +function can be computed in expected time $O(n)$ and stored in +$(2+\epsilon)n\log_2n$ bits. + +Dietzfelbinger and Hagerup~\cite{dh01} improved~\cite{p99}, +reducing from $(2+\epsilon)n\log_2n$ to $(1+\epsilon)n\log_2n$ the number of bits +required to store the function, but in their approach~$f$ and~$g$ must +be chosen from a class +of hash functions that meet additional requirements. +%Differently from the works in~\cite{dh01, p99}, our algorithm generates a MPHF +%$h$ in expected linear time and $h$ can be stored in $O(n)$ bits (9 bits per key). + +% Galli, Seybold e Simon~\cite{gss01} propuseram um algoritmo r\^andomico +% que gera FHPMs da mesma forma das geradas pelos algoritmos de Pagh~\cite{p99} +% e, Dietzfelbinger e Hagerup~\cite{dh01}. No entanto, eles definiram a forma das +% fun\c{c}\~oes $f(k) = h_c(k) \bmod n$ e $g(k) = \lfloor h_c(k)/n \rfloor$ para obter em tempo esperado $O(n)$ uma fun\c{c}\~ao que pode ser descrita em $O(n\log n)$ bits, onde +% $h_c(k) = (ck \bmod p) \bmod n^2$, $1 \leq c \leq p-1$ e $p$ um primo maior do que $u$. +%Our algorithm is the first one capable of generating MPHFs for sets in the order of +%billion of keys. It happens because we do not need to keep into main memory +%at generation time complex data structures as a graph, lists and so on. We just need to maintain +%a small vector that occupies around 8MB for a set of 1 billion keys. + +Fox et al.~\cite{fch92,fhcd92} studied MPHFs +%that also share features with the ones generated by our algorithm. +that bring down the storage requirements we got to between 2 and 4 bits per key. +However, it is shown in~\cite[Section 6.7]{chm97} that their algorithms have exponential +running times and cannot scale for sets larger than 11 million keys in our +implementation of the algorithm. + +Our previous work~\cite{bkz05} improves the one by Czech, Havas and Majewski~\cite{chm92}. +We obtained more compact functions in less time. Although +the algorithm in~\cite{bkz05} is the fastest algorithm +we know of, the resulting functions are stored in $O(n\log n)$ bits and +one needs to keep in main memory at generation time a random graph of $n$ edges +and $cn$ vertices, +where $c\in[0.93,1.15]$. Using the well known divide to conquer approach +we use that algorithm as a building block for the new one, where the +resulting functions are stored in $O(n)$ bits. diff --git a/vldb07/searching.tex b/vldb07/searching.tex new file mode 100755 index 0000000..8feb6f1 --- /dev/null +++ b/vldb07/searching.tex @@ -0,0 +1,155 @@ +%% Nivio: 22/jan/06 +% Time-stamp: +\vspace{-7mm} +\subsection{Searching step} +\label{sec:searching} + +\enlargethispage{2\baselineskip} +The searching step is responsible for generating a MPHF for each +bucket. +Figure~\ref{fig:searchingstep} presents the searching step algorithm. +\vspace{-2mm} +\begin{figure}[h] +%\centering +\hrule +\hrule +\vspace{2mm} +\begin{tabbing} +aa\=type booleanx \== (false, true); \kill +\> $\blacktriangleright$ Let $H$ be a minimum heap of size $N$, where the \\ +\> ~~ order relation in $H$ is given by Eq.~(\ref{eq:bucketindex}), that is, the\\ +\> ~~ remove operation removes the item with smallest $i$\\ +\> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \{ Heap construction \}\\ +\> ~~ $1.1$ Read key $k$ from File $j$ on disk\\ +\> ~~ $1.2$ Insert $(i, j, k)$ in $H$ \\ +\> $2.$ {\bf for} $i = 0$ {\bf to} $\lceil n/b \rceil - 1$ {\bf do} \\ +\> ~~ $2.1$ Read bucket $i$ from disk driven by heap $H$ \\ +\> ~~ $2.2$ Generate a MPHF for bucket $i$ \\ +\> ~~ $2.3$ Write the description of MPHF$_i$ to the disk +\end{tabbing} +\vspace{-1mm} +\hrule +\hrule +\caption{Searching step} +\label{fig:searchingstep} +\vspace{-4mm} +\end{figure} + +Statement 1 of Figure~\ref{fig:searchingstep} inserts one key from each file +in a minimum heap $H$ of size $N$. +The order relation in $H$ is given by the bucket address $i$ given by +Eq.~(\ref{eq:bucketindex}). + +%\enlargethispage{-\baselineskip} +Statement 2 has two important steps. +In statement 2.1, a bucket is read from disk, +as described below. +%in Section~\ref{sec:readingbucket}. +In statement 2.2, a MPHF is generated for each bucket $i$, as described +in the following. +%in Section~\ref{sec:mphfbucket}. +The description of MPHF$_i$ is a vector $g_i$ of 8-bit integers. +Finally, statement 2.3 writes the description $g_i$ of MPHF$_i$ to disk. + +\vspace{-3mm} +\label{sec:readingbucket} +\subsubsection{Reading a bucket from disk.} + +In this section we present the refinement of statement 2.1 of +Figure~\ref{fig:searchingstep}. +The algorithm to read bucket $i$ from disk is presented +in Figure~\ref{fig:readingbucket}. + +\begin{figure}[h] +\hrule +\hrule +\vspace{2mm} +\begin{tabbing} +aa\=type booleanx \== (false, true); \kill +\> $1.$ {\bf while} bucket $i$ is not full {\bf do} \\ +\> ~~ $1.1$ Remove $(i, j, k)$ from $H$\\ +\> ~~ $1.2$ Insert $k$ into bucket $i$ \\ +\> ~~ $1.3$ Read sequentially all keys $k$ from File $j$ that have \\ +\> ~~~~~~~ the same $i$ and insert them into bucket $i$ \\ +\> ~~ $1.4$ Insert the triple $(i, j, x)$ in $H$, where $x$ is the first \\ +\> ~~~~~~~ key read from File $j$ that does not have the \\ +\> ~~~~~~~ same bucket index $i$ +\end{tabbing} +\hrule +\hrule +\vspace{-1.0mm} +\caption{Reading a bucket} +\vspace{-4.0mm} +\label{fig:readingbucket} +\end{figure} + +Bucket $i$ is distributed among many files and the heap $H$ is used to drive a +multiway merge operation. +In Figure~\ref{fig:readingbucket}, statement 1.1 extracts and removes triple +$(i, j, k)$ from $H$, where $i$ is a minimum value in $H$. +Statement 1.2 inserts key $k$ in bucket $i$. +Notice that the $k$ in the triple $(i, j, k)$ is in fact a pointer to +the first byte of the key that is kept in contiguous positions of an array of characters +(this array containing the keys is initialized during the heap construction +in statement 1 of Figure~\ref{fig:searchingstep}). +Statement 1.3 performs a seek operation in File $j$ on disk for the first +read operation and reads sequentially all keys $k$ that have the same $i$ +%(obtained from Eq.~(\ref{eq:bucketindex})) +and inserts them all in bucket $i$. +Finally, statement 1.4 inserts in $H$ the triple $(i, j, x)$, +where $x$ is the first key read from File $j$ (in statement 1.3) +that does not have the same bucket address as the previous keys. + +The number of seek operations on disk performed in statement 1.3 is discussed +in Section~\ref{sec:linearcomplexity}, +where we present a buffering technique that brings down +the time spent with seeks. + +\vspace{-2mm} +\enlargethispage{2\baselineskip} +\subsubsection{Generating a MPHF for each bucket.} \label{sec:mphfbucket} + +To the best of our knowledge the algorithm we have designed in +our previous work~\cite{bkz05} is the fastest published algorithm for +constructing MPHFs. +That is why we are using that algorithm as a building block for the +algorithm presented here. + +%\enlargethispage{-\baselineskip} +Our previous algorithm is a three-step internal memory based algorithm +that produces a MPHF based on random graphs. +For a set of $n$ keys, the algorithm outputs the resulting MPHF in expected time $O(n)$. +For a given bucket $i$, $0 \leq i < \lceil n/b \rceil$, the corresponding MPHF$_i$ +has the following form: +\begin{eqnarray} + \mathrm{MPHF}_i(k) &=& g_i[a] + g_i[b] \label{eq:mphfi} +\end{eqnarray} +where $a = h_{i1}(k) \bmod t$, $b = h_{i2}(k) \bmod t$ and +$t = c\times \mathit{size}[i]$. The functions +$h_{i1}(k)$ and $h_{i2}(k)$ are the same universal function proposed by Jenkins~\cite{j97} +that was used in the partitioning step described in Section~\ref{sec:partitioning-keys}. + +In order to generate the function above the algorithm involves the generation of simple random graphs +$G_i = (V_i, E_i)$ with~$|V_i|=t=c\times\mathit{size}[i]$ and $|E_i|=\mathit{size}[i]$, with $c \in [0.93, 1.15]$. +To generate a simple random graph with high +probability\footnote{We use the terms `with high probability' +to mean `with probability tending to~$1$ as~$n\to\infty$'.}, two vertices $a$ and $b$ are +computed for each key $k$ in bucket $i$. +Thus, each bucket $i$ has a corresponding graph~$G_i=(V_i,E_i)$, where $V_i=\{0,1, +\ldots,t-1\}$ and $E_i=\big\{\{a,b\}:k \in \mathrm{bucket}\: i\big\}$. +In order to get a simple graph, +the algorithm repeatedly selects $h_{i1}$ and $h_{i2}$ from a family of universal hash functions +until the corresponding graph is simple. +The probability of getting a simple graph is $p=e^{-1/c^2}$. +For $c=1$, this probability is $p \simeq 0.368$, and the expected number of +iterations to obtain a simple graph is~$1/p \simeq 2.72$. + +The construction of MPHF$_i$ ends with a computation of a suitable labelling of the vertices +of~$G_i$. The labelling is stored into vector $g_i$. +We choose~$g_i[v]$ for each~$v\in V_i$ in such +a way that Eq.~(\ref{eq:mphfi}) is a MPHF for bucket $i$. +In order to get the values of each entry of $g_i$ we first +run a breadth-first search on the 2-\textit{core} of $G_i$, i.e., the maximal subgraph +of~$G_i$ with minimal degree at least~$2$ (see, e.g., \cite{b01,jlr00,pw04}) and +a depth-first search on the acyclic part of $G_i$ (see \cite{bkz05} for details). + diff --git a/vldb07/svglov2.clo b/vldb07/svglov2.clo new file mode 100644 index 0000000..d98306e --- /dev/null +++ b/vldb07/svglov2.clo @@ -0,0 +1,77 @@ +% SVJour2 DOCUMENT CLASS OPTION SVGLOV2 -- for standardised journals +% +% This is an enhancement for the LaTeX +% SVJour2 document class for Springer journals +% +%% +%% +%% \CharacterTable +%% {Upper-case \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z +%% Lower-case \a\b\c\d\e\f\g\h\i\j\k\l\m\n\o\p\q\r\s\t\u\v\w\x\y\z +%% Digits \0\1\2\3\4\5\6\7\8\9 +%% Exclamation \! Double quote \" Hash (number) \# +%% Dollar \$ Percent \% Ampersand \& +%% Acute accent \' Left paren \( Right paren \) +%% Asterisk \* Plus \+ Comma \, +%% Minus \- Point \. Solidus \/ +%% Colon \: Semicolon \; Less than \< +%% Equals \= Greater than \> Question mark \? +%% Commercial at \@ Left bracket \[ Backslash \\ +%% Right bracket \] Circumflex \^ Underscore \_ +%% Grave accent \` Left brace \{ Vertical bar \| +%% Right brace \} Tilde \~} +\ProvidesFile{svglov2.clo} + [2004/10/25 v2.1 + style option for standardised journals] +\typeout{SVJour Class option: svglov2.clo for standardised journals} +\def\validfor{svjour2} +\ExecuteOptions{final,10pt,runningheads} +% No size changing allowed, hence a copy of size10.clo is included +\renewcommand\normalsize{% + \@setfontsize\normalsize{10.2pt}{4mm}% + \abovedisplayskip=3 mm plus6pt minus 4pt + \belowdisplayskip=3 mm plus6pt minus 4pt + \abovedisplayshortskip=0.0 mm plus6pt + \belowdisplayshortskip=2 mm plus4pt minus 4pt + \let\@listi\@listI} +\normalsize +\newcommand\small{% + \@setfontsize\small{8.7pt}{3.25mm}% + \abovedisplayskip 8.5\p@ \@plus3\p@ \@minus4\p@ + \abovedisplayshortskip \z@ \@plus2\p@ + \belowdisplayshortskip 4\p@ \@plus2\p@ \@minus2\p@ + \def\@listi{\leftmargin\leftmargini + \parsep 0\p@ \@plus1\p@ \@minus\p@ + \topsep 4\p@ \@plus2\p@ \@minus4\p@ + \itemsep0\p@}% + \belowdisplayskip \abovedisplayskip +} +\let\footnotesize\small +\newcommand\scriptsize{\@setfontsize\scriptsize\@viipt\@viiipt} +\newcommand\tiny{\@setfontsize\tiny\@vpt\@vipt} +\newcommand\large{\@setfontsize\large\@xiipt{14pt}} +\newcommand\Large{\@setfontsize\Large\@xivpt{16dd}} +\newcommand\LARGE{\@setfontsize\LARGE\@xviipt{17dd}} +\newcommand\huge{\@setfontsize\huge\@xxpt{25}} +\newcommand\Huge{\@setfontsize\Huge\@xxvpt{30}} +% +%ALT% \def\runheadhook{\rlap{\smash{\lower5pt\hbox to\textwidth{\hrulefill}}}} +\def\runheadhook{\rlap{\smash{\lower11pt\hbox to\textwidth{\hrulefill}}}} +\AtEndOfClass{\advance\headsep by5pt} +\if@twocolumn +\setlength{\textwidth}{17.6cm} +\setlength{\textheight}{230mm} +\AtEndOfClass{\setlength\columnsep{4mm}} +\else +\setlength{\textwidth}{11.7cm} +\setlength{\textheight}{517.5dd} % 19.46cm +\fi +% +\AtBeginDocument{% +\@ifundefined{@journalname} + {\typeout{Unknown journal: specify \string\journalname\string{% +\string} in preambel^^J}}{}} +% +\endinput +%% +%% End of file `svglov2.clo'. diff --git a/vldb07/svjour2.cls b/vldb07/svjour2.cls new file mode 100644 index 0000000..56d9216 --- /dev/null +++ b/vldb07/svjour2.cls @@ -0,0 +1,1419 @@ +% SVJour2 DOCUMENT CLASS -- version 2.8 for LaTeX2e +% +% LaTeX document class for Springer journals +% +%% +%% +%% \CharacterTable +%% {Upper-case \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z +%% Lower-case \a\b\c\d\e\f\g\h\i\j\k\l\m\n\o\p\q\r\s\t\u\v\w\x\y\z +%% Digits \0\1\2\3\4\5\6\7\8\9 +%% Exclamation \! Double quote \" Hash (number) \# +%% Dollar \$ Percent \% Ampersand \& +%% Acute accent \' Left paren \( Right paren \) +%% Asterisk \* Plus \+ Comma \, +%% Minus \- Point \. Solidus \/ +%% Colon \: Semicolon \; Less than \< +%% Equals \= Greater than \> Question mark \? +%% Commercial at \@ Left bracket \[ Backslash \\ +%% Right bracket \] Circumflex \^ Underscore \_ +%% Grave accent \` Left brace \{ Vertical bar \| +%% Right brace \} Tilde \~} +\NeedsTeXFormat{LaTeX2e}[1995/12/01] +\ProvidesClass{svjour2}[2005/08/29 v2.8 +^^JLaTeX document class for Springer journals] +\newcommand\@ptsize{} +\newif\if@restonecol +\newif\if@titlepage +\@titlepagefalse +\DeclareOption{a4paper} + {\setlength\paperheight {297mm}% + \setlength\paperwidth {210mm}} +\DeclareOption{10pt}{\renewcommand\@ptsize{0}} +\DeclareOption{twoside}{\@twosidetrue \@mparswitchtrue} +\DeclareOption{draft}{\setlength\overfullrule{5pt}} +\DeclareOption{final}{\setlength\overfullrule{0pt}} +\DeclareOption{fleqn}{\input{fleqn.clo}\AtBeginDocument{\mathindent\z@}} +\DeclareOption{twocolumn}{\@twocolumntrue\ExecuteOptions{fleqn}} +\newif\if@avier\@avierfalse +\DeclareOption{onecollarge}{\@aviertrue} +\let\if@mathematic\iftrue +\let\if@numbook\iffalse +\DeclareOption{numbook}{\let\if@envcntsect\iftrue + \AtEndOfPackage{% + \renewcommand\thefigure{\thesection.\@arabic\c@figure}% + \renewcommand\thetable{\thesection.\@arabic\c@table}% + \renewcommand\theequation{\thesection.\@arabic\c@equation}% + \@addtoreset{figure}{section}% + \@addtoreset{table}{section}% + \@addtoreset{equation}{section}% + }% +} +\DeclareOption{openbib}{% + \AtEndOfPackage{% + \renewcommand\@openbib@code{% + \advance\leftmargin\bibindent + \itemindent -\bibindent + \listparindent \itemindent + \parsep \z@ + }% + \renewcommand\newblock{\par}}% +} +\DeclareOption{natbib}{% +\AtEndOfClass{\RequirePackage{natbib}% +% Changing some parameters of NATBIB +\setlength{\bibhang}{\parindent}% +%\setlength{\bibsep}{0mm}% +\let\bibfont=\small +\def\@biblabel#1{#1.}% +\newcommand{\etal}{et al.}% +\bibpunct{(}{)}{;}{a}{}{,}}} +% +\let\if@runhead\iffalse +\DeclareOption{runningheads}{\let\if@runhead\iftrue} +\let\if@smartrunh\iffalse +\DeclareOption{smartrunhead}{\let\if@smartrunh\iftrue} +\DeclareOption{nosmartrunhead}{\let\if@smartrunh\iffalse} +\let\if@envcntreset\iffalse +\DeclareOption{envcountreset}{\let\if@envcntreset\iftrue} +\let\if@envcntsame\iffalse +\DeclareOption{envcountsame}{\let\if@envcntsame\iftrue} +\let\if@envcntsect\iffalse +\DeclareOption{envcountsect}{\let\if@envcntsect\iftrue} +\let\if@referee\iffalse +\DeclareOption{referee}{\let\if@referee\iftrue} +\def\makereferee{\def\baselinestretch{2}} +\let\if@instindent\iffalse +\DeclareOption{instindent}{\let\if@instindent\iftrue} +\let\if@smartand\iffalse +\DeclareOption{smartand}{\let\if@smartand\iftrue} +\let\if@spthms\iftrue +\DeclareOption{nospthms}{\let\if@spthms\iffalse} +% +% language and babel dependencies +\DeclareOption{deutsch}{\def\switcht@@therlang{\switcht@deutsch}% +\gdef\svlanginfo{\typeout{Man spricht deutsch.}\global\let\svlanginfo\relax}} +\DeclareOption{francais}{\def\switcht@@therlang{\switcht@francais}% +\gdef\svlanginfo{\typeout{On parle francais.}\global\let\svlanginfo\relax}} +\let\switcht@@therlang\relax +\let\svlanginfo\relax +% +\AtBeginDocument{\@ifpackageloaded{babel}{% +\@ifundefined{extrasenglish}{}{\addto\extrasenglish{\switcht@albion}}% +\@ifundefined{extrasUKenglish}{}{\addto\extrasUKenglish{\switcht@albion}}% +\@ifundefined{extrasfrenchb}{}{\addto\extrasfrenchb{\switcht@francais}}% +\@ifundefined{extrasgerman}{}{\addto\extrasgerman{\switcht@deutsch}}% +\@ifundefined{extrasngerman}{}{\addto\extrasngerman{\switcht@deutsch}}% +}{\switcht@@therlang}% +} +% +\def\ClassInfoNoLine#1#2{% + \ClassInfo{#1}{#2\@gobble}% +} +\let\journalopt\@empty +\DeclareOption*{% +\InputIfFileExists{sv\CurrentOption.clo}{% +\global\let\journalopt\CurrentOption}{% +\ClassWarning{Springer-SVJour2}{Specified option or subpackage +"\CurrentOption" not found -}\OptionNotUsed}} +\ExecuteOptions{a4paper,twoside,10pt,instindent} +\ProcessOptions +% +\ifx\journalopt\@empty\relax +\ClassInfoNoLine{Springer-SVJour2}{extra/valid Springer sub-package (-> *.clo) +\MessageBreak not found in option list of \string\documentclass +\MessageBreak - autoactivating "global" style}{} +\input{svglov2.clo} +\else +\@ifundefined{validfor}{% +\ClassError{Springer-SVJour2}{Possible option clash for sub-package +\MessageBreak "sv\journalopt.clo" - option file not valid +\MessageBreak for this class}{Perhaps you used an option of the old +Springer class SVJour!} +}{} +\fi +% +\if@smartrunh\AtEndDocument{\islastpageeven\getlastpagenumber}\fi +% +\newcommand{\twocoltest}[2]{\if@twocolumn\def\@gtempa{#2}\else\def\@gtempa{#1}\fi +\@gtempa\makeatother} +\newcommand{\columncase}{\makeatletter\twocoltest} +% +\DeclareMathSymbol{\Gamma}{\mathalpha}{letters}{"00} +\DeclareMathSymbol{\Delta}{\mathalpha}{letters}{"01} +\DeclareMathSymbol{\Theta}{\mathalpha}{letters}{"02} +\DeclareMathSymbol{\Lambda}{\mathalpha}{letters}{"03} +\DeclareMathSymbol{\Xi}{\mathalpha}{letters}{"04} +\DeclareMathSymbol{\Pi}{\mathalpha}{letters}{"05} +\DeclareMathSymbol{\Sigma}{\mathalpha}{letters}{"06} +\DeclareMathSymbol{\Upsilon}{\mathalpha}{letters}{"07} +\DeclareMathSymbol{\Phi}{\mathalpha}{letters}{"08} +\DeclareMathSymbol{\Psi}{\mathalpha}{letters}{"09} +\DeclareMathSymbol{\Omega}{\mathalpha}{letters}{"0A} +% +\setlength\parindent{15\p@} +\setlength\smallskipamount{3\p@ \@plus 1\p@ \@minus 1\p@} +\setlength\medskipamount{6\p@ \@plus 2\p@ \@minus 2\p@} +\setlength\bigskipamount{12\p@ \@plus 4\p@ \@minus 4\p@} +\setlength\headheight{12\p@} +\setlength\headsep {16.74dd} +\setlength\topskip {10\p@} +\setlength\footskip{30\p@} +\setlength\maxdepth{.5\topskip} +% +\@settopoint\textwidth +\setlength\marginparsep {10\p@} +\setlength\marginparpush{5\p@} +\setlength\topmargin{-10pt} +\if@twocolumn + \setlength\oddsidemargin {-30\p@} + \setlength\evensidemargin{-30\p@} +\else + \setlength\oddsidemargin {\z@} + \setlength\evensidemargin{\z@} +\fi +\setlength\marginparwidth {48\p@} +\setlength\footnotesep{8\p@} +\setlength{\skip\footins}{9\p@ \@plus 4\p@ \@minus 2\p@} +\setlength\floatsep {12\p@ \@plus 2\p@ \@minus 2\p@} +\setlength\textfloatsep{20\p@ \@plus 2\p@ \@minus 4\p@} +\setlength\intextsep {20\p@ \@plus 2\p@ \@minus 2\p@} +\setlength\dblfloatsep {12\p@ \@plus 2\p@ \@minus 2\p@} +\setlength\dbltextfloatsep{20\p@ \@plus 2\p@ \@minus 4\p@} +\setlength\@fptop{0\p@} +\setlength\@fpsep{12\p@ \@plus 2\p@ \@minus 2\p@} +\setlength\@fpbot{0\p@ \@plus 1fil} +\setlength\@dblfptop{0\p@} +\setlength\@dblfpsep{12\p@ \@plus 2\p@ \@minus 2\p@} +\setlength\@dblfpbot{0\p@ \@plus 1fil} +\setlength\partopsep{2\p@ \@plus 1\p@ \@minus 1\p@} +\def\@listi{\leftmargin\leftmargini + \parsep \z@ + \topsep 6\p@ \@plus2\p@ \@minus4\p@ + \itemsep\parsep} +\let\@listI\@listi +\@listi +\def\@listii {\leftmargin\leftmarginii + \labelwidth\leftmarginii + \advance\labelwidth-\labelsep + \topsep \z@ + \parsep \topsep + \itemsep \parsep} +\def\@listiii{\leftmargin\leftmarginiii + \labelwidth\leftmarginiii + \advance\labelwidth-\labelsep + \topsep \z@ + \parsep \topsep + \itemsep \parsep} +\def\@listiv {\leftmargin\leftmarginiv + \labelwidth\leftmarginiv + \advance\labelwidth-\labelsep} +\def\@listv {\leftmargin\leftmarginv + \labelwidth\leftmarginv + \advance\labelwidth-\labelsep} +\def\@listvi {\leftmargin\leftmarginvi + \labelwidth\leftmarginvi + \advance\labelwidth-\labelsep} +% +\setlength\lineskip{1\p@} +\setlength\normallineskip{1\p@} +\renewcommand\baselinestretch{} +\setlength\parskip{0\p@ \@plus \p@} +\@lowpenalty 51 +\@medpenalty 151 +\@highpenalty 301 +\setcounter{topnumber}{4} +\renewcommand\topfraction{.9} +\setcounter{bottomnumber}{2} +\renewcommand\bottomfraction{.7} +\setcounter{totalnumber}{6} +\renewcommand\textfraction{.1} +\renewcommand\floatpagefraction{.85} +\setcounter{dbltopnumber}{3} +\renewcommand\dbltopfraction{.85} +\renewcommand\dblfloatpagefraction{.85} +\def\ps@headings{% + \let\@oddfoot\@empty\let\@evenfoot\@empty + \def\@evenhead{\small\csname runheadhook\endcsname + \rlap{\thepage}\hfil\leftmark\unskip}% + \def\@oddhead{\small\csname runheadhook\endcsname + \ignorespaces\rightmark\hfil\llap{\thepage}}% + \let\@mkboth\@gobbletwo + \let\sectionmark\@gobble + \let\subsectionmark\@gobble + } +% make indentations changeable +\def\setitemindent#1{\settowidth{\labelwidth}{#1}% + \leftmargini\labelwidth + \advance\leftmargini\labelsep + \def\@listi{\leftmargin\leftmargini + \labelwidth\leftmargini\advance\labelwidth by -\labelsep + \parsep=\parskip + \topsep=\medskipamount + \itemsep=\parskip \advance\itemsep by -\parsep}} +\def\setitemitemindent#1{\settowidth{\labelwidth}{#1}% + \leftmarginii\labelwidth + \advance\leftmarginii\labelsep +\def\@listii{\leftmargin\leftmarginii + \labelwidth\leftmarginii\advance\labelwidth by -\labelsep + \parsep=\parskip + \topsep=\z@ + \itemsep=\parskip \advance\itemsep by -\parsep}} +% labels of description +\def\descriptionlabel#1{\hspace\labelsep #1\hfil} +% adjusted environment "description" +% if an optional parameter (at the first two levels of lists) +% is present, its width is considered to be the widest mark +% throughout the current list. +\def\description{\@ifnextchar[{\@describe}{\list{}{\labelwidth\z@ + \itemindent-\leftmargin \let\makelabel\descriptionlabel}}} +\let\enddescription\endlist +% +\def\describelabel#1{#1\hfil} +\def\@describe[#1]{\relax\ifnum\@listdepth=0 +\setitemindent{#1}\else\ifnum\@listdepth=1 +\setitemitemindent{#1}\fi\fi +\list{--}{\let\makelabel\describelabel}} +% +\newdimen\logodepth +\logodepth=1.2cm +\newdimen\headerboxheight +\headerboxheight=180pt % 18 10.5dd-lines - 2\baselineskip +\advance\headerboxheight by-14.5mm +\newdimen\betweenumberspace % dimension for space between +\betweenumberspace=3.33pt % number and text of titles. +\newdimen\aftertext % dimension for space after +\aftertext=5pt % text of title. +\newdimen\headlineindent % dimension for space between +\headlineindent=1.166cm % number and text of headings. +\if@mathematic + \def\runinend{} % \enspace} + \def\floatcounterend{\enspace} + \def\sectcounterend{} +\else + \def\runinend{.} + \def\floatcounterend{.\ } + \def\sectcounterend{.} +\fi +\def\email#1{\emailname: #1} +\def\keywords#1{\par\addvspace\medskipamount{\rightskip=0pt plus1cm +\def\and{\ifhmode\unskip\nobreak\fi\ $\cdot$ +}\noindent\keywordname\enspace\ignorespaces#1\par}} +% +\def\subclassname{{\bfseries Mathematics Subject Classification +(2000)}\enspace} +\def\subclass#1{\par\addvspace\medskipamount{\rightskip=0pt plus1cm +\def\and{\ifhmode\unskip\nobreak\fi\ $\cdot$ +}\noindent\subclassname\ignorespaces#1\par}} +% +\def\PACSname{\textbf{PACS}\enspace} +\def\PACS#1{\par\addvspace\medskipamount{\rightskip=0pt plus1cm +\def\and{\ifhmode\unskip\nobreak\fi\ $\cdot$ +}\noindent\PACSname\ignorespaces#1\par}} +% +\def\CRclassname{{\bfseries CR Subject Classification}\enspace} +\def\CRclass#1{\par\addvspace\medskipamount{\rightskip=0pt plus1cm +\def\and{\ifhmode\unskip\nobreak\fi\ $\cdot$ +}\noindent\CRclassname\ignorespaces#1\par}} +% +\def\ESMname{\textbf{Electronic Supplementary Material}\enspace} +\def\ESM#1{\par\addvspace\medskipamount +\noindent\ESMname\ignorespaces#1\par} +% +\newcounter{inst} +\newcounter{auth} +\def\authdepth{2} +\newdimen\instindent +\newbox\authrun +\newtoks\authorrunning +\newbox\titrun +\newtoks\titlerunning +\def\authorfont{\bfseries} + +\def\combirunning#1{\gdef\@combi{#1}} +\def\@combi{} +\newbox\combirun +% +\def\ps@last{\def\@evenhead{\small\rlap{\thepage}\hfil +\lastevenhead}} +\newcounter{lastpage} +\def\islastpageeven{\@ifundefined{lastpagenumber} +{\setcounter{lastpage}{0}}{\setcounter{lastpage}{\lastpagenumber}} +\ifnum\value{lastpage}>0 + \ifodd\value{lastpage}% + \else + \if@smartrunh + \thispagestyle{last}% + \fi + \fi +\fi} +\def\getlastpagenumber{\clearpage +\addtocounter{page}{-1}% + \immediate\write\@auxout{\string\gdef\string\lastpagenumber{\thepage}}% + \immediate\write\@auxout{\string\newlabel{LastPage}{{}{\thepage}}}% + \addtocounter{page}{1}} + +\def\journalname#1{\gdef\@journalname{#1}} + +\def\dedication#1{\gdef\@dedic{#1}} +\def\@dedic{} + +\let\@date\undefined +\def\notused{~} + +\def\institute#1{\gdef\@institute{#1}} + +\def\offprints#1{\begingroup +\def\protect{\noexpand\protect\noexpand}\xdef\@thanks{\@thanks +\protect\footnotetext[0]{\unskip\hskip-15pt{\itshape Send offprint requests +to\/}: \ignorespaces#1}}\endgroup\ignorespaces} + +%\def\mail#1{\gdef\@mail{#1}} +%\def\@mail{} + +\def\@thanks{} + +\def\@fnsymbol#1{\ifcase#1\or\star\or{\star\star}\or{\star\star\star}% + \or \dagger\or \ddagger\or + \mathchar "278\or \mathchar "27B\or \|\or **\or \dagger\dagger + \or \ddagger\ddagger \else\@ctrerr\fi\relax} +% +%\def\invthanks#1{\footnotetext[0]{\kern-\bibindent#1}} +% +\def\nothanksmarks{\def\thanks##1{\protected@xdef\@thanks{\@thanks + \protect\footnotetext[0]{\kern-\bibindent##1}}}} +% +\def\subtitle#1{\gdef\@subtitle{#1}} +\def\@subtitle{} + +\def\headnote#1{\gdef\@headnote{#1}} +\def\@headnote{} + +\def\papertype#1{\gdef\paper@type{\MakeUppercase{#1}}} +\def\paper@type{} + +\def\ch@ckobl#1#2{\@ifundefined{@#1} + {\typeout{SVJour2 warning: Missing +\expandafter\string\csname#1\endcsname}% + \csname #1\endcsname{#2}} + {}} +% +\def\ProcessRunnHead{% + \def\\{\unskip\ \ignorespaces}% + \def\thanks##1{\unskip{}}% + \instindent=\textwidth + \advance\instindent by-\headlineindent + \if!\the\titlerunning!\else + \edef\@title{\the\titlerunning}% + \fi + \global\setbox\titrun=\hbox{\small\rmfamily\unboldmath\ignorespaces\@title + \unskip}% + \ifdim\wd\titrun>\instindent + \typeout{^^JSVJour2 Warning: Title too long for running head.}% + \typeout{Please supply a shorter form with \string\titlerunning + \space prior to \string\maketitle}% + \global\setbox\titrun=\hbox{\small\rmfamily + Title Suppressed Due to Excessive Length}% + \fi + \xdef\@title{\copy\titrun}% +% + \if!\the\authorrunning! + \else + \setcounter{auth}{1}% + \edef\@author{\the\authorrunning}% + \fi + \ifnum\value{inst}>\authdepth + \def\stripauthor##1\and##2\endauthor{% + \protected@xdef\@author{##1\unskip\unskip\if!##2!\else\ et al.\fi}}% + \expandafter\stripauthor\@author\and\endauthor + \else + \gdef\and{\unskip, \ignorespaces}% + {\def\and{\noexpand\protect\noexpand\and}% + \protected@xdef\@author{\@author}} + \fi + \global\setbox\authrun=\hbox{\small\rmfamily\unboldmath\ignorespaces + \@author\unskip}% + \ifdim\wd\authrun>\instindent + \typeout{^^JSVJour2 Warning: Author name(s) too long for running head. + ^^JPlease supply a shorter form with \string\authorrunning + \space prior to \string\maketitle}% + \global\setbox\authrun=\hbox{\small\rmfamily Please give a shorter version + with: {\tt\string\authorrunning\space and + \string\titlerunning\space prior to \string\maketitle}}% + \fi + \xdef\@author{\copy\authrun}% + \markboth{\@author}{\@title}% +} +% +\let\orithanks=\thanks +\def\thanks#1{\ClassWarning{SVJour2}{\string\thanks\space may only be +used inside of \string\title, \string\author,\MessageBreak +and \string\date\space prior to \string\maketitle}} +% +\def\maketitle{\par\let\thanks=\orithanks +\ch@ckobl{journalname}{Noname} +\ch@ckobl{date}{the date of receipt and acceptance should be inserted +later} +\ch@ckobl{title}{A title should be given} +\ch@ckobl{author}{Name(s) and initial(s) of author(s) should be given} +\ch@ckobl{institute}{Address(es) of author(s) should be given} +\begingroup +% + \renewcommand\thefootnote{\@fnsymbol\c@footnote}% + \def\@makefnmark{$^{\@thefnmark}$}% + \renewcommand\@makefntext[1]{% + \noindent + \hb@xt@\bibindent{\hss\@makefnmark\enspace}##1\vrule height0pt + width0pt depth8pt} +% + \def\lastand{\ifnum\value{inst}=2\relax + \unskip{} \andname\ + \else + \unskip, \andname\ + \fi}% + \def\and{\stepcounter{auth}\relax + \if@smartand + \ifnum\value{auth}=\value{inst}% + \lastand + \else + \unskip, + \fi + \else + \unskip, + \fi}% + \thispagestyle{empty} + \ifnum \col@number=\@ne + \@maketitle + \else + \twocolumn[\@maketitle]% + \fi +% + \global\@topnum\z@ + \if!\@thanks!\else + \@thanks +\insert\footins{\vskip-3pt\hrule width\columnwidth\vskip3pt}% + \fi + {\def\thanks##1{\unskip{}}% + \def\iand{\\[5pt]\let\and=\nand}% + \def\nand{\ifhmode\unskip\nobreak\fi\ $\cdot$ }% + \let\and=\nand + \def\at{\\\let\and=\iand}% + \footnotetext[0]{\kern-\bibindent + \ignorespaces\@institute}\vspace{5dd}}% +%\if!\@mail!\else +% \footnotetext[0]{\kern-\bibindent\mailname\ +% \ignorespaces\@mail}% +%\fi +% + \if@runhead + \ProcessRunnHead + \fi +% + \endgroup + \setcounter{footnote}{0} + \global\let\thanks\relax + \global\let\maketitle\relax + \global\let\@maketitle\relax + \global\let\@thanks\@empty + \global\let\@author\@empty + \global\let\@date\@empty + \global\let\@title\@empty + \global\let\@subtitle\@empty + \global\let\title\relax + \global\let\author\relax + \global\let\date\relax + \global\let\and\relax} + +\def\makeheadbox{{% +\hbox to0pt{\vbox{\baselineskip=10dd\hrule\hbox +to\hsize{\vrule\kern3pt\vbox{\kern3pt +\hbox{\bfseries\@journalname\ manuscript No.} +\hbox{(will be inserted by the editor)} +\kern3pt}\hfil\kern3pt\vrule}\hrule}% +\hss}}} +% +\def\rubric{\setbox0=\hbox{\small\strut}\@tempdima=\ht0\advance +\@tempdima\dp0\advance\@tempdima2\fboxsep\vrule\@height\@tempdima +\@width\z@} +\newdimen\rubricwidth +% +\def\@maketitle{\newpage +\normalfont +\vbox to0pt{\if@twocolumn\vskip-39pt\else\vskip-49pt\fi +\nointerlineskip +\makeheadbox\vss}\nointerlineskip +\vbox to 0pt{\offinterlineskip\rubricwidth=\columnwidth +\vskip-12.5pt +\if@twocolumn\else % one column journal + \divide\rubricwidth by144\multiply\rubricwidth by89 % perform golden section + \vskip-\topskip +\fi +\hrule\@height0.35mm\noindent +\advance\fboxsep by.25mm +\global\advance\rubricwidth by0pt +\rubric +\vss}\vskip19.5pt +% +\if@twocolumn\else + \gdef\footnoterule{% + \kern-3\p@ + \hrule\@width\columnwidth %rubricwidth + \kern2.6\p@} +\fi +% + \setbox\authrun=\vbox\bgroup + \hrule\@height 9mm\@width0\p@ + \pretolerance=10000 + \rightskip=0pt plus 4cm + \nothanksmarks +% \if!\@headnote!\else +% \noindent +% {\LARGE\normalfont\itshape\ignorespaces\@headnote\par}\vskip 3.5mm +% \fi + {\authorfont + \setbox0=\vbox{\setcounter{auth}{1}\def\and{\stepcounter{auth} }% + \hfuzz=2\textwidth\def\thanks##1{}\@author}% + \setcounter{footnote}{0}% + \global\value{inst}=\value{auth}% + \setcounter{auth}{1}% + \if@twocolumn + \rightskip43mm plus 4cm minus 3mm + \else % one column journal + \rightskip=\linewidth + \advance\rightskip by-\rubricwidth + \advance\rightskip by0pt plus 4cm minus 3mm + \fi +% +\def\and{\unskip\nobreak\enskip{\boldmath$\cdot$}\enskip\ignorespaces}% + \noindent\ignorespaces\@author\vskip7.23pt} + {\LARGE\bfseries + \noindent\ignorespaces + \@title \par}\vskip 11.24pt\relax + \if!\@subtitle!\else + {\large\bfseries + \pretolerance=10000 + \rightskip=0pt plus 3cm + \vskip-5pt + \noindent\ignorespaces\@subtitle \par}\vskip 11.24pt + \fi + \small + \if!\@dedic!\else + \par + \normalsize\it + \addvspace\baselineskip + \noindent\@dedic + \fi + \egroup % end of header box + \@tempdima=\headerboxheight + \advance\@tempdima by-\ht\authrun + \unvbox\authrun + \ifdim\@tempdima>0pt + \vrule width0pt height\@tempdima\par + \fi + \noindent{\small\@date\vskip 6.2mm} + \global\@minipagetrue + \global\everypar{\global\@minipagefalse\global\everypar{}}% +%\vskip22.47pt +} +% +\if@mathematic + \def\vec#1{\ensuremath{\mathchoice + {\mbox{\boldmath$\displaystyle\mathbf{#1}$}} + {\mbox{\boldmath$\textstyle\mathbf{#1}$}} + {\mbox{\boldmath$\scriptstyle\mathbf{#1}$}} + {\mbox{\boldmath$\scriptscriptstyle\mathbf{#1}$}}}} +\else + \def\vec#1{\ensuremath{\mathchoice + {\mbox{\boldmath$\displaystyle#1$}} + {\mbox{\boldmath$\textstyle#1$}} + {\mbox{\boldmath$\scriptstyle#1$}} + {\mbox{\boldmath$\scriptscriptstyle#1$}}}} +\fi +% +\def\tens#1{\ensuremath{\mathsf{#1}}} +% +\setcounter{secnumdepth}{3} +\newcounter {section} +\newcounter {subsection}[section] +\newcounter {subsubsection}[subsection] +\newcounter {paragraph}[subsubsection] +\newcounter {subparagraph}[paragraph] +\renewcommand\thesection {\@arabic\c@section} +\renewcommand\thesubsection {\thesection.\@arabic\c@subsection} +\renewcommand\thesubsubsection{\thesubsection.\@arabic\c@subsubsection} +\renewcommand\theparagraph {\thesubsubsection.\@arabic\c@paragraph} +\renewcommand\thesubparagraph {\theparagraph.\@arabic\c@subparagraph} +% +\def\@hangfrom#1{\setbox\@tempboxa\hbox{#1}% + \hangindent \z@\noindent\box\@tempboxa} +% +\def\@seccntformat#1{\csname the#1\endcsname\sectcounterend +\hskip\betweenumberspace} +% +\newif\if@sectrule +\if@twocolumn\else\let\@sectruletrue=\relax\fi +\if@avier\let\@sectruletrue=\relax\fi +\def\makesectrule{\if@sectrule\global\@sectrulefalse\null\vglue-\topskip +\hrule\nobreak\parskip=5pt\relax\fi} +% +\let\makesectruleori=\makesectrule +\def\restoresectrule{\global\let\makesectrule=\makesectruleori\global\@sectrulefalse} +\def\nosectrule{\let\makesectrule=\restoresectrule} +% +\def\@startsection#1#2#3#4#5#6{% + \if@noskipsec \leavevmode \fi + \par + \@tempskipa #4\relax + \@afterindenttrue + \ifdim \@tempskipa <\z@ + \@tempskipa -\@tempskipa \@afterindentfalse + \fi + \if@nobreak + \everypar{}% + \else + \addpenalty\@secpenalty\addvspace\@tempskipa + \fi + \ifnum#2=1\relax\@sectruletrue\fi + \@ifstar + {\@ssect{#3}{#4}{#5}{#6}}% + {\@dblarg{\@sect{#1}{#2}{#3}{#4}{#5}{#6}}}} +% +\def\@sect#1#2#3#4#5#6[#7]#8{% + \ifnum #2>\c@secnumdepth + \let\@svsec\@empty + \else + \refstepcounter{#1}% + \protected@edef\@svsec{\@seccntformat{#1}\relax}% + \fi + \@tempskipa #5\relax + \ifdim \@tempskipa>\z@ + \begingroup + #6{\makesectrule + \@hangfrom{\hskip #3\relax\@svsec}% + \raggedright + \hyphenpenalty \@M% + \interlinepenalty \@M #8\@@par}% + \endgroup + \csname #1mark\endcsname{#7}% + \addcontentsline{toc}{#1}{% + \ifnum #2>\c@secnumdepth \else + \protect\numberline{\csname the#1\endcsname\sectcounterend}% + \fi + #7}% + \else + \def\@svsechd{% + #6{\hskip #3\relax + \@svsec #8\/\hskip\aftertext}% + \csname #1mark\endcsname{#7}% + \addcontentsline{toc}{#1}{% + \ifnum #2>\c@secnumdepth \else + \protect\numberline{\csname the#1\endcsname}% + \fi + #7}}% + \fi + \@xsect{#5}} +% +\def\@ssect#1#2#3#4#5{% + \@tempskipa #3\relax + \ifdim \@tempskipa>\z@ + \begingroup + #4{\makesectrule + \@hangfrom{\hskip #1}% + \interlinepenalty \@M #5\@@par}% + \endgroup + \else + \def\@svsechd{#4{\hskip #1\relax #5}}% + \fi + \@xsect{#3}} + +% +% measures and setting of sections +% +\def\section{\@startsection{section}{1}{\z@}% + {-21dd plus-8pt minus-4pt}{10.5dd} + {\normalsize\bfseries\boldmath}} +\def\subsection{\@startsection{subsection}{2}{\z@}% + {-21dd plus-8pt minus-4pt}{10.5dd} + {\normalsize\upshape}} +\def\subsubsection{\@startsection{subsubsection}{3}{\z@}% + {-13dd plus-8pt minus-4pt}{10.5dd} + {\normalsize\itshape}} +\def\paragraph{\@startsection{paragraph}{4}{\z@}% + {-13pt plus-8pt minus-4pt}{\z@}{\normalsize\itshape}} + +\setlength\leftmargini {\parindent} +\leftmargin \leftmargini +\setlength\leftmarginii {\parindent} +\setlength\leftmarginiii {1.87em} +\setlength\leftmarginiv {1.7em} +\setlength\leftmarginv {.5em} +\setlength\leftmarginvi {.5em} +\setlength \labelsep {.5em} +\setlength \labelwidth{\leftmargini} +\addtolength\labelwidth{-\labelsep} +\@beginparpenalty -\@lowpenalty +\@endparpenalty -\@lowpenalty +\@itempenalty -\@lowpenalty +\renewcommand\theenumi{\@arabic\c@enumi} +\renewcommand\theenumii{\@alph\c@enumii} +\renewcommand\theenumiii{\@roman\c@enumiii} +\renewcommand\theenumiv{\@Alph\c@enumiv} +\newcommand\labelenumi{\theenumi.} +\newcommand\labelenumii{(\theenumii)} +\newcommand\labelenumiii{\theenumiii.} +\newcommand\labelenumiv{\theenumiv.} +\renewcommand\p@enumii{\theenumi} +\renewcommand\p@enumiii{\theenumi(\theenumii)} +\renewcommand\p@enumiv{\p@enumiii\theenumiii} +\newcommand\labelitemi{\normalfont\bfseries --} +\newcommand\labelitemii{\normalfont\bfseries --} +\newcommand\labelitemiii{$\m@th\bullet$} +\newcommand\labelitemiv{$\m@th\cdot$} + +\if@spthms +% definition of the "\spnewtheorem" command. +% +% Usage: +% +% \spnewtheorem{env_nam}{caption}[within]{cap_font}{body_font} +% or \spnewtheorem{env_nam}[numbered_like]{caption}{cap_font}{body_font} +% or \spnewtheorem*{env_nam}{caption}{cap_font}{body_font} +% +% New is "cap_font" and "body_font". It stands for +% fontdefinition of the caption and the text itself. +% +% "\spnewtheorem*" gives a theorem without number. +% +% A defined spnewthoerem environment is used as described +% by Lamport. +% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\def\@thmcountersep{} +\def\@thmcounterend{} +\newcommand\nocaption{\noexpand\@gobble} +\newdimen\spthmsep \spthmsep=5pt + +\def\spnewtheorem{\@ifstar{\@sthm}{\@Sthm}} + +% definition of \spnewtheorem with number + +\def\@spnthm#1#2{% + \@ifnextchar[{\@spxnthm{#1}{#2}}{\@spynthm{#1}{#2}}} +\def\@Sthm#1{\@ifnextchar[{\@spothm{#1}}{\@spnthm{#1}}} + +\def\@spxnthm#1#2[#3]#4#5{\expandafter\@ifdefinable\csname #1\endcsname + {\@definecounter{#1}\@addtoreset{#1}{#3}% + \expandafter\xdef\csname the#1\endcsname{\expandafter\noexpand + \csname the#3\endcsname \noexpand\@thmcountersep \@thmcounter{#1}}% + \expandafter\xdef\csname #1name\endcsname{#2}% + \global\@namedef{#1}{\@spthm{#1}{\csname #1name\endcsname}{#4}{#5}}% + \global\@namedef{end#1}{\@endtheorem}}} + +\def\@spynthm#1#2#3#4{\expandafter\@ifdefinable\csname #1\endcsname + {\@definecounter{#1}% + \expandafter\xdef\csname the#1\endcsname{\@thmcounter{#1}}% + \expandafter\xdef\csname #1name\endcsname{#2}% + \global\@namedef{#1}{\@spthm{#1}{\csname #1name\endcsname}{#3}{#4}}% + \global\@namedef{end#1}{\@endtheorem}}} + +\def\@spothm#1[#2]#3#4#5{% + \@ifundefined{c@#2}{\@latexerr{No theorem environment `#2' defined}\@eha}% + {\expandafter\@ifdefinable\csname #1\endcsname + {\global\@namedef{the#1}{\@nameuse{the#2}}% + \expandafter\xdef\csname #1name\endcsname{#3}% + \global\@namedef{#1}{\@spthm{#2}{\csname #1name\endcsname}{#4}{#5}}% + \global\@namedef{end#1}{\@endtheorem}}}} + +\def\@spthm#1#2#3#4{\topsep 7\p@ \@plus2\p@ \@minus4\p@ +\labelsep=\spthmsep\refstepcounter{#1}% +\@ifnextchar[{\@spythm{#1}{#2}{#3}{#4}}{\@spxthm{#1}{#2}{#3}{#4}}} + +\def\@spxthm#1#2#3#4{\@spbegintheorem{#2}{\csname the#1\endcsname}{#3}{#4}% + \ignorespaces} + +\def\@spythm#1#2#3#4[#5]{\@spopargbegintheorem{#2}{\csname + the#1\endcsname}{#5}{#3}{#4}\ignorespaces} + +\def\normalthmheadings{\def\@spbegintheorem##1##2##3##4{\trivlist\normalfont + \item[\hskip\labelsep{##3##1\ ##2\@thmcounterend}]##4} +\def\@spopargbegintheorem##1##2##3##4##5{\trivlist + \item[\hskip\labelsep{##4##1\ ##2}]{##4(##3)\@thmcounterend\ }##5}} +\normalthmheadings + +\def\reversethmheadings{\def\@spbegintheorem##1##2##3##4{\trivlist\normalfont + \item[\hskip\labelsep{##3##2\ ##1\@thmcounterend}]##4} +\def\@spopargbegintheorem##1##2##3##4##5{\trivlist + \item[\hskip\labelsep{##4##2\ ##1}]{##4(##3)\@thmcounterend\ }##5}} + +% definition of \spnewtheorem* without number + +\def\@sthm#1#2{\@Ynthm{#1}{#2}} + +\def\@Ynthm#1#2#3#4{\expandafter\@ifdefinable\csname #1\endcsname + {\global\@namedef{#1}{\@Thm{\csname #1name\endcsname}{#3}{#4}}% + \expandafter\xdef\csname #1name\endcsname{#2}% + \global\@namedef{end#1}{\@endtheorem}}} + +\def\@Thm#1#2#3{\topsep 7\p@ \@plus2\p@ \@minus4\p@ +\@ifnextchar[{\@Ythm{#1}{#2}{#3}}{\@Xthm{#1}{#2}{#3}}} + +\def\@Xthm#1#2#3{\@Begintheorem{#1}{#2}{#3}\ignorespaces} + +\def\@Ythm#1#2#3[#4]{\@Opargbegintheorem{#1} + {#4}{#2}{#3}\ignorespaces} + +\def\@Begintheorem#1#2#3{#3\trivlist + \item[\hskip\labelsep{#2#1\@thmcounterend}]} + +\def\@Opargbegintheorem#1#2#3#4{#4\trivlist + \item[\hskip\labelsep{#3#1}]{#3(#2)\@thmcounterend\ }} + +% initialize theorem environment + +\if@envcntsect + \def\@thmcountersep{.} + \spnewtheorem{theorem}{Theorem}[section]{\bfseries}{\itshape} +\else + \spnewtheorem{theorem}{Theorem}{\bfseries}{\itshape} + \if@envcntreset + \@addtoreset{theorem}{section} + \else + \@addtoreset{theorem}{chapter} + \fi +\fi + +%definition of divers theorem environments +\spnewtheorem*{claim}{Claim}{\itshape}{\rmfamily} +\spnewtheorem*{proof}{Proof}{\itshape}{\rmfamily} +\if@envcntsame % all environments like "Theorem" - using its counter + \def\spn@wtheorem#1#2#3#4{\@spothm{#1}[theorem]{#2}{#3}{#4}} +\else % all environments with their own counter + \if@envcntsect % show section counter + \def\spn@wtheorem#1#2#3#4{\@spxnthm{#1}{#2}[section]{#3}{#4}} + \else % not numbered with section + \if@envcntreset + \def\spn@wtheorem#1#2#3#4{\@spynthm{#1}{#2}{#3}{#4} + \@addtoreset{#1}{section}} + \else + \let\spn@wtheorem=\@spynthm + \fi + \fi +\fi +% +\let\spdefaulttheorem=\spn@wtheorem +% +\spn@wtheorem{case}{Case}{\itshape}{\rmfamily} +\spn@wtheorem{conjecture}{Conjecture}{\itshape}{\rmfamily} +\spn@wtheorem{corollary}{Corollary}{\bfseries}{\itshape} +\spn@wtheorem{definition}{Definition}{\bfseries}{\rmfamily} +\spn@wtheorem{example}{Example}{\itshape}{\rmfamily} +\spn@wtheorem{exercise}{Exercise}{\bfseries}{\rmfamily} +\spn@wtheorem{lemma}{Lemma}{\bfseries}{\itshape} +\spn@wtheorem{note}{Note}{\itshape}{\rmfamily} +\spn@wtheorem{problem}{Problem}{\bfseries}{\rmfamily} +\spn@wtheorem{property}{Property}{\itshape}{\rmfamily} +\spn@wtheorem{proposition}{Proposition}{\bfseries}{\itshape} +\spn@wtheorem{question}{Question}{\itshape}{\rmfamily} +\spn@wtheorem{solution}{Solution}{\bfseries}{\rmfamily} +\spn@wtheorem{remark}{Remark}{\itshape}{\rmfamily} +% +\newenvironment{theopargself} + {\def\@spopargbegintheorem##1##2##3##4##5{\trivlist + \item[\hskip\labelsep{##4##1\ ##2}]{##4##3\@thmcounterend\ }##5} + \def\@Opargbegintheorem##1##2##3##4{##4\trivlist + \item[\hskip\labelsep{##3##1}]{##3##2\@thmcounterend\ }}}{} +\newenvironment{theopargself*} + {\def\@spopargbegintheorem##1##2##3##4##5{\trivlist + \item[\hskip\labelsep{##4##1\ ##2}]{\hspace*{-\labelsep}##4##3\@thmcounterend}##5} + \def\@Opargbegintheorem##1##2##3##4{##4\trivlist + \item[\hskip\labelsep{##3##1}]{\hspace*{-\labelsep}##3##2\@thmcounterend}}}{} +% +\fi + +\def\@takefromreset#1#2{% + \def\@tempa{#1}% + \let\@tempd\@elt + \def\@elt##1{% + \def\@tempb{##1}% + \ifx\@tempa\@tempb\else + \@addtoreset{##1}{#2}% + \fi}% + \expandafter\expandafter\let\expandafter\@tempc\csname cl@#2\endcsname + \expandafter\def\csname cl@#2\endcsname{}% + \@tempc + \let\@elt\@tempd} + +\def\squareforqed{\hbox{\rlap{$\sqcap$}$\sqcup$}} +\def\qed{\ifmmode\else\unskip\quad\fi\squareforqed} +\def\smartqed{\def\qed{\ifmmode\squareforqed\else{\unskip\nobreak\hfil +\penalty50\hskip1em\null\nobreak\hfil\squareforqed +\parfillskip=0pt\finalhyphendemerits=0\endgraf}\fi}} + +% Define `abstract' environment +\def\abstract{\topsep=0pt\partopsep=0pt\parsep=0pt\itemsep=0pt\relax +\trivlist\item[\hskip\labelsep +{\bfseries\abstractname}]\if!\abstractname!\hskip-\labelsep\fi} +\if@twocolumn + \if@avier + \def\endabstract{\endtrivlist\addvspace{5mm}\strich} + \def\strich{\hrule\vskip1ptplus12pt} + \else + \def\endabstract{\endtrivlist\addvspace{3mm}} + \fi +\else +\fi +% +\newenvironment{verse} + {\let\\\@centercr + \list{}{\itemsep \z@ + \itemindent -1.5em% + \listparindent\itemindent + \rightmargin \leftmargin + \advance\leftmargin 1.5em}% + \item\relax} + {\endlist} +\newenvironment{quotation} + {\list{}{\listparindent 1.5em% + \itemindent \listparindent + \rightmargin \leftmargin + \parsep \z@ \@plus\p@}% + \item\relax} + {\endlist} +\newenvironment{quote} + {\list{}{\rightmargin\leftmargin}% + \item\relax} + {\endlist} +\newcommand\appendix{\par\small + \setcounter{section}{0}% + \setcounter{subsection}{0}% + \renewcommand\thesection{\@Alph\c@section}} +\setlength\arraycolsep{1.5\p@} +\setlength\tabcolsep{6\p@} +\setlength\arrayrulewidth{.4\p@} +\setlength\doublerulesep{2\p@} +\setlength\tabbingsep{\labelsep} +\skip\@mpfootins = \skip\footins +\setlength\fboxsep{3\p@} +\setlength\fboxrule{.4\p@} +\renewcommand\theequation{\@arabic\c@equation} +\newcounter{figure} +\renewcommand\thefigure{\@arabic\c@figure} +\def\fps@figure{tbp} +\def\ftype@figure{1} +\def\ext@figure{lof} +\def\fnum@figure{\figurename~\thefigure} +\newenvironment{figure} + {\@float{figure}} + {\end@float} +\newenvironment{figure*} + {\@dblfloat{figure}} + {\end@dblfloat} +\newcounter{table} +\renewcommand\thetable{\@arabic\c@table} +\def\fps@table{tbp} +\def\ftype@table{2} +\def\ext@table{lot} +\def\fnum@table{\tablename~\thetable} +\newenvironment{table} + {\@float{table}} + {\end@float} +\newenvironment{table*} + {\@dblfloat{table}} + {\end@dblfloat} +% +\def \@floatboxreset {% + \reset@font + \small + \@setnobreak + \@setminipage +} +% +\newcommand{\tableheadseprule}{\noalign{\hrule height.375mm}} +% +\newlength\abovecaptionskip +\newlength\belowcaptionskip +\setlength\abovecaptionskip{10\p@} +\setlength\belowcaptionskip{0\p@} +\newcommand\leftlegendglue{} + +\def\fig@type{figure} + +\newdimen\figcapgap\figcapgap=3pt +\newdimen\tabcapgap\tabcapgap=5.5pt + +\@ifundefined{floatlegendstyle}{\def\floatlegendstyle{\bfseries}}{} + +\long\def\@caption#1[#2]#3{\par\addcontentsline{\csname + ext@#1\endcsname}{#1}{\protect\numberline{\csname + the#1\endcsname}{\ignorespaces #2}}\begingroup + \@parboxrestore + \@makecaption{\csname fnum@#1\endcsname}{\ignorespaces #3}\par + \endgroup} + +\def\capstrut{\vrule\@width\z@\@height\topskip} + +\@ifundefined{captionstyle}{\def\captionstyle{\normalfont\small}}{} + +\long\def\@makecaption#1#2{% + \captionstyle + \ifx\@captype\fig@type + \vskip\figcapgap + \fi + \setbox\@tempboxa\hbox{{\floatlegendstyle #1\floatcounterend}% + \capstrut #2}% + \ifdim \wd\@tempboxa >\hsize + {\floatlegendstyle #1\floatcounterend}\capstrut #2\par + \else + \hbox to\hsize{\leftlegendglue\unhbox\@tempboxa\hfil}% + \fi + \ifx\@captype\fig@type\else + \vskip\tabcapgap + \fi} + +\newdimen\figgap\figgap=1cc +\long\def\@makesidecaption#1#2{% + \parbox[b]{\@tempdimb}{\captionstyle{\floatlegendstyle + #1\floatcounterend}#2}} +\def\sidecaption#1\caption{% +\setbox\@tempboxa=\hbox{#1\unskip}% +\if@twocolumn + \ifdim\hsize<\textwidth\else + \ifdim\wd\@tempboxa<\columnwidth + \typeout{Double column float fits into single column - + ^^Jyou'd better switch the environment. }% + \fi + \fi +\fi +\@tempdimb=\hsize +\advance\@tempdimb by-\figgap +\advance\@tempdimb by-\wd\@tempboxa +\ifdim\@tempdimb<3cm + \typeout{\string\sidecaption: No sufficient room for the legend; + using normal \string\caption. }% + \unhbox\@tempboxa + \let\@capcommand=\@caption +\else + \let\@capcommand=\@sidecaption + \leavevmode + \unhbox\@tempboxa + \hfill +\fi +\refstepcounter\@captype +\@dblarg{\@capcommand\@captype}} + +\long\def\@sidecaption#1[#2]#3{\addcontentsline{\csname + ext@#1\endcsname}{#1}{\protect\numberline{\csname + the#1\endcsname}{\ignorespaces #2}}\begingroup + \@parboxrestore + \@makesidecaption{\csname fnum@#1\endcsname}{\ignorespaces #3}\par + \endgroup} + +% Define `acknowledgement' environment +\def\acknowledgement{\par\addvspace{17pt}\small\rmfamily +\trivlist\if!\ackname!\item[]\else +\item[\hskip\labelsep +{\bfseries\ackname}]\fi} +\def\endacknowledgement{\endtrivlist\addvspace{6pt}} +\newenvironment{acknowledgements}{\begin{acknowledgement}} +{\end{acknowledgement}} +% Define `noteadd' environment +\def\noteadd{\par\addvspace{17pt}\small\rmfamily +\trivlist\item[\hskip\labelsep +{\itshape\noteaddname}]} +\def\endnoteadd{\endtrivlist\addvspace{6pt}} + +\DeclareOldFontCommand{\rm}{\normalfont\rmfamily}{\mathrm} +\DeclareOldFontCommand{\sf}{\normalfont\sffamily}{\mathsf} +\DeclareOldFontCommand{\tt}{\normalfont\ttfamily}{\mathtt} +\DeclareOldFontCommand{\bf}{\normalfont\bfseries}{\mathbf} +\DeclareOldFontCommand{\it}{\normalfont\itshape}{\mathit} +\DeclareOldFontCommand{\sl}{\normalfont\slshape}{\@nomath\sl} +\DeclareOldFontCommand{\sc}{\normalfont\scshape}{\@nomath\sc} +\DeclareRobustCommand*\cal{\@fontswitch\relax\mathcal} +\DeclareRobustCommand*\mit{\@fontswitch\relax\mathnormal} +\newcommand\@pnumwidth{1.55em} +\newcommand\@tocrmarg{2.55em} +\newcommand\@dotsep{4.5} +\setcounter{tocdepth}{1} +\newcommand\tableofcontents{% + \section*{\contentsname}% + \@starttoc{toc}% + \addtocontents{toc}{\begingroup\protect\small}% + \AtEndDocument{\addtocontents{toc}{\endgroup}}% + } +\newcommand*\l@part[2]{% + \ifnum \c@tocdepth >-2\relax + \addpenalty\@secpenalty + \addvspace{2.25em \@plus\p@}% + \begingroup + \setlength\@tempdima{3em}% + \parindent \z@ \rightskip \@pnumwidth + \parfillskip -\@pnumwidth + {\leavevmode + \large \bfseries #1\hfil \hb@xt@\@pnumwidth{\hss #2}}\par + \nobreak + \if@compatibility + \global\@nobreaktrue + \everypar{\global\@nobreakfalse\everypar{}}% + \fi + \endgroup + \fi} +\newcommand*\l@section{\@dottedtocline{1}{0pt}{1.5em}} +\newcommand*\l@subsection{\@dottedtocline{2}{1.5em}{2.3em}} +\newcommand*\l@subsubsection{\@dottedtocline{3}{3.8em}{3.2em}} +\newcommand*\l@paragraph{\@dottedtocline{4}{7.0em}{4.1em}} +\newcommand*\l@subparagraph{\@dottedtocline{5}{10em}{5em}} +\newcommand\listoffigures{% + \section*{\listfigurename + \@mkboth{\listfigurename}% + {\listfigurename}}% + \@starttoc{lof}% + } +\newcommand*\l@figure{\@dottedtocline{1}{1.5em}{2.3em}} +\newcommand\listoftables{% + \section*{\listtablename + \@mkboth{\listtablename}{\listtablename}}% + \@starttoc{lot}% + } +\let\l@table\l@figure +\newdimen\bibindent +\setlength\bibindent{\parindent} +\def\@biblabel#1{#1.} +\def\@lbibitem[#1]#2{\item[{[#1]}\hfill]\if@filesw + {\let\protect\noexpand + \immediate + \write\@auxout{\string\bibcite{#2}{#1}}}\fi\ignorespaces} +\newenvironment{thebibliography}[1] + {\section*{\refname + \@mkboth{\refname}{\refname}}\small + \list{\@biblabel{\@arabic\c@enumiv}}% + {\settowidth\labelwidth{\@biblabel{#1}}% + \leftmargin\labelwidth + \advance\leftmargin\labelsep + \@openbib@code + \usecounter{enumiv}% + \let\p@enumiv\@empty + \renewcommand\theenumiv{\@arabic\c@enumiv}}% + \sloppy\clubpenalty4000\widowpenalty4000% + \sfcode`\.\@m} + {\def\@noitemerr + {\@latex@warning{Empty `thebibliography' environment}}% + \endlist} +% +\newcount\@tempcntc +\def\@citex[#1]#2{\if@filesw\immediate\write\@auxout{\string\citation{#2}}\fi + \@tempcnta\z@\@tempcntb\m@ne\def\@citea{}\@cite{\@for\@citeb:=#2\do + {\@ifundefined + {b@\@citeb}{\@citeo\@tempcntb\m@ne\@citea\def\@citea{,}{\bfseries + ?}\@warning + {Citation `\@citeb' on page \thepage \space undefined}}% + {\setbox\z@\hbox{\global\@tempcntc0\csname b@\@citeb\endcsname\relax}% + \ifnum\@tempcntc=\z@ \@citeo\@tempcntb\m@ne + \@citea\def\@citea{,\hskip0.1em\ignorespaces}\hbox{\csname b@\@citeb\endcsname}% + \else + \advance\@tempcntb\@ne + \ifnum\@tempcntb=\@tempcntc + \else\advance\@tempcntb\m@ne\@citeo + \@tempcnta\@tempcntc\@tempcntb\@tempcntc\fi\fi}}\@citeo}{#1}} +\def\@citeo{\ifnum\@tempcnta>\@tempcntb\else + \@citea\def\@citea{,\hskip0.1em\ignorespaces}% + \ifnum\@tempcnta=\@tempcntb\the\@tempcnta\else + {\advance\@tempcnta\@ne\ifnum\@tempcnta=\@tempcntb \else \def\@citea{--}\fi + \advance\@tempcnta\m@ne\the\@tempcnta\@citea\the\@tempcntb}\fi\fi} +% +\newcommand\newblock{\hskip .11em\@plus.33em\@minus.07em} +\let\@openbib@code\@empty +\newenvironment{theindex} + {\if@twocolumn + \@restonecolfalse + \else + \@restonecoltrue + \fi + \columnseprule \z@ + \columnsep 35\p@ + \twocolumn[\section*{\indexname}]% + \@mkboth{\indexname}{\indexname}% + \thispagestyle{plain}\parindent\z@ + \parskip\z@ \@plus .3\p@\relax + \let\item\@idxitem} + {\if@restonecol\onecolumn\else\clearpage\fi} +\newcommand\@idxitem{\par\hangindent 40\p@} +\newcommand\subitem{\@idxitem \hspace*{20\p@}} +\newcommand\subsubitem{\@idxitem \hspace*{30\p@}} +\newcommand\indexspace{\par \vskip 10\p@ \@plus5\p@ \@minus3\p@\relax} + +\if@twocolumn + \renewcommand\footnoterule{% + \kern-3\p@ + \hrule\@width\columnwidth + \kern2.6\p@} +\else + \renewcommand\footnoterule{% + \kern-3\p@ + \hrule\@width.382\columnwidth + \kern2.6\p@} +\fi +\newcommand\@makefntext[1]{% + \noindent + \hb@xt@\bibindent{\hss\@makefnmark\enspace}#1} +% +\def\trans@english{\switcht@albion} +\def\trans@french{\switcht@francais} +\def\trans@german{\switcht@deutsch} +\newenvironment{translation}[1]{\if!#1!\else +\@ifundefined{selectlanguage}{\csname trans@#1\endcsname}{\selectlanguage{#1}}% +\fi}{} +% languages +% English section +\def\switcht@albion{%\typeout{English spoken.}% + \def\abstractname{Abstract}% + \def\ackname{Acknowledgements}% + \def\andname{and}% + \def\lastandname{, and}% + \def\appendixname{Appendix}% + \def\chaptername{Chapter}% + \def\claimname{Claim}% + \def\conjecturename{Conjecture}% + \def\contentsname{Contents}% + \def\corollaryname{Corollary}% + \def\definitionname{Definition}% + \def\emailname{E-mail}% + \def\examplename{Example}% + \def\exercisename{Exercise}% + \def\figurename{Fig.}% + \def\keywordname{{\bfseries Keywords}}% + \def\indexname{Index}% + \def\lemmaname{Lemma}% + \def\contriblistname{List of Contributors}% + \def\listfigurename{List of Figures}% + \def\listtablename{List of Tables}% + \def\mailname{{\itshape Correspondence to\/}:}% + \def\noteaddname{Note added in proof}% + \def\notename{Note}% + \def\partname{Part}% + \def\problemname{Problem}% + \def\proofname{Proof}% + \def\propertyname{Property}% + \def\questionname{Question}% + \def\refname{References}% + \def\remarkname{Remark}% + \def\seename{see}% + \def\solutionname{Solution}% + \def\tablename{Table}% + \def\theoremname{Theorem}% +}\switcht@albion % make English default +% +% French section +\def\switcht@francais{\svlanginfo +%\typeout{On parle francais.}% + \def\abstractname{R\'esum\'e\runinend}% + \def\ackname{Remerciements\runinend}% + \def\andname{et}% + \def\lastandname{ et}% + \def\appendixname{Appendice}% + \def\chaptername{Chapitre}% + \def\claimname{Pr\'etention}% + \def\conjecturename{Hypoth\`ese}% + \def\contentsname{Table des mati\`eres}% + \def\corollaryname{Corollaire}% + \def\definitionname{D\'efinition}% + \def\emailname{E-mail}% + \def\examplename{Exemple}% + \def\exercisename{Exercice}% + \def\figurename{Fig.}% + \def\keywordname{{\bfseries Mots-cl\'e\runinend}}% + \def\indexname{Index}% + \def\lemmaname{Lemme}% + \def\contriblistname{Liste des contributeurs}% + \def\listfigurename{Liste des figures}% + \def\listtablename{Liste des tables}% + \def\mailname{{\itshape Correspondence to\/}:}% + \def\noteaddname{Note ajout\'ee \`a l'\'epreuve}% + \def\notename{Remarque}% + \def\partname{Partie}% + \def\problemname{Probl\`eme}% + \def\proofname{Preuve}% + \def\propertyname{Caract\'eristique}% +%\def\propositionname{Proposition}% + \def\questionname{Question}% + \def\refname{Bibliographie}% + \def\remarkname{Remarque}% + \def\seename{voyez}% + \def\solutionname{Solution}% +%\def\subclassname{{\it Subject Classifications\/}:}% + \def\tablename{Tableau}% + \def\theoremname{Th\'eor\`eme}% +} +% +% German section +\def\switcht@deutsch{\svlanginfo +%\typeout{Man spricht deutsch.}% + \def\abstractname{Zusammenfassung\runinend}% + \def\ackname{Danksagung\runinend}% + \def\andname{und}% + \def\lastandname{ und}% + \def\appendixname{Anhang}% + \def\chaptername{Kapitel}% + \def\claimname{Behauptung}% + \def\conjecturename{Hypothese}% + \def\contentsname{Inhaltsverzeichnis}% + \def\corollaryname{Korollar}% +%\def\definitionname{Definition}% + \def\emailname{E-Mail}% + \def\examplename{Beispiel}% + \def\exercisename{\"Ubung}% + \def\figurename{Abb.}% + \def\keywordname{{\bfseries Schl\"usselw\"orter\runinend}}% + \def\indexname{Index}% +%\def\lemmaname{Lemma}% + \def\contriblistname{Mitarbeiter}% + \def\listfigurename{Abbildungsverzeichnis}% + \def\listtablename{Tabellenverzeichnis}% + \def\mailname{{\itshape Correspondence to\/}:}% + \def\noteaddname{Nachtrag}% + \def\notename{Anmerkung}% + \def\partname{Teil}% +%\def\problemname{Problem}% + \def\proofname{Beweis}% + \def\propertyname{Eigenschaft}% +%\def\propositionname{Proposition}% + \def\questionname{Frage}% + \def\refname{Literatur}% + \def\remarkname{Anmerkung}% + \def\seename{siehe}% + \def\solutionname{L\"osung}% +%\def\subclassname{{\it Subject Classifications\/}:}% + \def\tablename{Tabelle}% +%\def\theoremname{Theorem}% +} +\newcommand\today{} +\edef\today{\ifcase\month\or + January\or February\or March\or April\or May\or June\or + July\or August\or September\or October\or November\or December\fi + \space\number\day, \number\year} +\setlength\columnsep{1.5cc} +\setlength\columnseprule{0\p@} +% +\frenchspacing +\clubpenalty=10000 +\widowpenalty=10000 +\def\thisbottomragged{\def\@textbottom{\vskip\z@ plus.0001fil +\global\let\@textbottom\relax}} +\pagestyle{headings} +\pagenumbering{arabic} +\if@twocolumn + \twocolumn +\fi +\if@avier + \onecolumn + \setlength{\textwidth}{156mm} + \setlength{\textheight}{226mm} +\fi +\if@referee + \makereferee +\fi +\flushbottom +\endinput +%% +%% End of file `svjour2.cls'. diff --git a/vldb07/terminology.tex b/vldb07/terminology.tex new file mode 100755 index 0000000..fd2cf1d --- /dev/null +++ b/vldb07/terminology.tex @@ -0,0 +1,18 @@ +% Time-stamp: +\vspace{-3mm} +\section{Notation and terminology} +\vspace{-2mm} +\label{sec:notation} + +\enlargethispage{2\baselineskip} +The essential notation and terminology used throughout this paper are as follows. +\begin{itemize} +\item $U$: key universe. $|U| = u$. +\item $S$: actual static key set. $S \subset U$, $|S| = n \ll u$. +\item $h: U \to M$ is a hash function that maps keys from a universe $U$ into +a given range $M = \{0,1,\dots,m-1\}$ of integer numbers. +\item $h$ is a perfect hash function if it is one-to-one on~$S$, i.e., if + $h(k_1) \not = h(k_2)$ for all $k_1 \not = k_2$ from $S$. +\item $h$ is a minimal perfect hash function (MPHF) if it is one-to-one on~$S$ + and $n=m$. +\end{itemize} diff --git a/vldb07/thealgorithm.tex b/vldb07/thealgorithm.tex new file mode 100755 index 0000000..1fb256f --- /dev/null +++ b/vldb07/thealgorithm.tex @@ -0,0 +1,78 @@ +%% Nivio: 13/jan/06, 21/jan/06 29/jan/06 +% Time-stamp: +\vspace{-3mm} +\section{The algorithm} +\label{sec:new-algorithm} +\vspace{-2mm} + +\enlargethispage{2\baselineskip} +The main idea supporting our algorithm is the classical divide and conquer technique. +The algorithm is a two-step external memory based algorithm +that generates a MPHF $h$ for a set $S$ of $n$ keys. +Figure~\ref{fig:new-algo-main-steps} illustrates the two steps of the +algorithm: the partitioning step and the searching step. + +\vspace{-2mm} +\begin{figure}[ht] +\centering +\begin{picture}(0,0)% +\includegraphics{figs/brz.ps}% +\end{picture}% +\setlength{\unitlength}{4144sp}% +% +\begingroup\makeatletter\ifx\SetFigFont\undefined% +\gdef\SetFigFont#1#2#3#4#5{% + \reset@font\fontsize{#1}{#2pt}% + \fontfamily{#3}\fontseries{#4}\fontshape{#5}% + \selectfont}% +\fi\endgroup% +\begin{picture}(3704,2091)(1426,-5161) +\put(2570,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}} +\put(2782,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}} +\put(2996,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}2}}}} +\put(4060,-4006){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets}}}} +\put(3776,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}${\lceil n/b\rceil - 1}$}}}} +\put(4563,-3329){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Key Set $S$}}}} +\put(2009,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}} +\put(2221,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}} +\put(4315,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}n-1}}}} +\put(1992,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}} +\put(2204,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}} +\put(4298,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}n-1}}}} +\put(4546,-4977){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Hash Table}}}} +\put(1441,-3616){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Partitioning}}}} +\put(1441,-4426){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Searching}}}} +\put(1981,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_0$}}}} +\put(2521,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_1$}}}} +\put(3016,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_2$}}}} +\put(3826,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_{\lceil n/b \rceil - 1}$}}}} +\end{picture}% +\vspace{-1mm} +\caption{Main steps of our algorithm} +\label{fig:new-algo-main-steps} +\vspace{-3mm} +\end{figure} + +The partitioning step takes a key set $S$ and uses a universal hash function +$h_0$ proposed by Jenkins~\cite{j97} +%for each key $k \in S$ of length $|k|$ +to transform each key~$k\in S$ into an integer~$h_0(k)$. +Reducing~$h_0(k)$ modulo~$\lceil n/b\rceil$, we partition~$S$ into $\lceil n/b +\rceil$ buckets containing at most 256 keys in each bucket (with high +probability). + +The searching step generates a MPHF$_i$ for each bucket $i$, +$0 \leq i < \lceil n/b \rceil$. +The resulting MPHF $h(k)$, $k \in S$, is given by +\begin{eqnarray}\label{eq:mphf} +h(k) = \mathrm{MPHF}_i (k) + \mathit{offset}[i], +\end{eqnarray} +where~$i=h_0(k)\bmod\lceil n/b\rceil$. +The $i$th entry~$\mathit{offset}[i]$ of the displacement vector +$\mathit{offset}$, $0 \leq i < \lceil n/b \rceil$, contains the total number +of keys in the buckets from 0 to $i-1$, that is, it gives the interval of the +keys in the hash table addressed by the MPHF$_i$. In the following we explain +each step in detail. + + + diff --git a/vldb07/thedataandsetup.tex b/vldb07/thedataandsetup.tex new file mode 100755 index 0000000..8739705 --- /dev/null +++ b/vldb07/thedataandsetup.tex @@ -0,0 +1,21 @@ +% Nivio: 29/jan/06 +% Time-stamp: +\vspace{-2mm} +\subsection{The data and the experimental setup} +\label{sec:data-exper-set} + +The algorithms were implemented in the C language and +are available at \texttt{http://\-cmph.sf.net} +under the GNU Lesser General Public License (LGPL). +% free software licence. +All experiments were carried out on +a computer running the Linux operating system, version 2.6, +with a 2.4 gigahertz processor and +1 gigabyte of main memory. +In the experiments related to the new +algorithm we limited the main memory in 500 megabytes. + +Our data consists of a collection of 1 billion +URLs collected from the Web, each URL 64 characters long on average. +The collection is stored on disk in 60.5 gigabytes. + diff --git a/vldb07/vldb.tex b/vldb07/vldb.tex new file mode 100644 index 0000000..618c108 --- /dev/null +++ b/vldb07/vldb.tex @@ -0,0 +1,194 @@ +%%%%%%%%%%%%%%%%%%%%%%% file template.tex %%%%%%%%%%%%%%%%%%%%%%%%% +% +% This is a template file for the LaTeX package SVJour2 for the +% Springer journal "The VLDB Journal". +% +% Springer Heidelberg 2004/12/03 +% +% Copy it to a new file with a new name and use it as the basis +% for your article. Delete % as needed. +% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% +% First comes an example EPS file -- just ignore it and +% proceed on the \documentclass line +% your LaTeX will extract the file if required +%\begin{filecontents*}{figs/minimalperfecthash-ph-mph.ps} +%!PS-Adobe-3.0 EPSF-3.0 +%%BoundingBox: 19 19 221 221 +%%CreationDate: Mon Sep 29 1997 +%%Creator: programmed by hand (JK) +%%EndComments +%gsave +%newpath +% 20 20 moveto +% 20 220 lineto +% 220 220 lineto +% 220 20 lineto +%closepath +%2 setlinewidth +%gsave +% .4 setgray fill +%grestore +%stroke +%grestore +%\end{filecontents*} +% +\documentclass[twocolumn,fleqn,runningheads]{svjour2} +% +\smartqed % flush right qed marks, e.g. at end of proof +% +\usepackage{graphicx} +\usepackage{listings} +\usepackage{epsfig} +\usepackage{textcomp} +\usepackage[latin1]{inputenc} +\usepackage{amssymb} + +%\DeclareGraphicsExtensions{.png} +% +% \usepackage{mathptmx} % use Times fonts if available on your TeX system +% +% insert here the call for the packages your document requires +%\usepackage{latexsym} +% etc. +% +% please place your own definitions here and don't use \def but +% \newcommand{}{} +% + +\lstset{ + language=Pascal, + basicstyle=\fontsize{9}{9}\selectfont, + captionpos=t, + aboveskip=1mm, + belowskip=1mm, + abovecaptionskip=1mm, + belowcaptionskip=1mm, +% numbers = left, + mathescape=true, + escapechar=@, + extendedchars=true, + showstringspaces=false, + columns=fixed, + basewidth=0.515em, + frame=single, + framesep=2mm, + xleftmargin=2mm, + xrightmargin=2mm, + framerule=0.5pt +} + +\def\cG{{\mathcal G}} +\def\crit{{\rm crit}} +\def\ncrit{{\rm ncrit}} +\def\scrit{{\rm scrit}} +\def\bedges{{\rm bedges}} +\def\ZZ{{\mathbb Z}} + +\journalname{The VLDB Journal} +% + +\begin{document} + +\title{Space and Time Efficient Minimal Perfect Hash \\[0.2cm] +Functions for Very Large Databases\thanks{ +This work was supported in part by +GERINDO Project--grant MCT/CNPq/CT-INFO 552.087/02-5, +CAPES/PROF Scholarship (Fabiano C. Botelho), +FAPESP Proj.\ Tem.\ 03/09925-5 and CNPq Grant 30.0334/93-1 +(Yoshiharu Kohayakawa), +and CNPq Grant 30.5237/02-0 (Nivio Ziviani).} +} +%\subtitle{Do you have a subtitle?\\ If so, write it here} + +%\titlerunning{Short form of title} % if too long for running head + +\author{Fabiano C. Botelho \and Davi C. Reis \and Yoshiharu Kohayakawa \and Nivio Ziviani} +%\authorrunning{Short form of author list} % if too long for running head +\institute{ +F. C. Botelho \and +N. Ziviani \at +Dept. of Computer Science, +Federal Univ. of Minas Gerais, +Belo Horizonte, Brazil\\ +\email{\{fbotelho,nivio\}@dcc.ufmg.br} +\and +D. C. Reis \at +Google, Brazil \\ +\email{davi.reis@gmail.com} +\and +Y. Kohayakawa +Dept. of Computer Science, +Univ. of S\~ao Paulo, +S\~ao Paulo, Brazil\\ +\email{yoshi@ime.usp.br} +} + +\date{Received: date / Accepted: date} +% The correct dates will be entered by the editor + + +\maketitle + +\begin{abstract} +We propose a novel external memory based algorithm for constructing minimal +perfect hash functions~$h$ for huge sets of keys. +For a set of~$n$ keys, our algorithm outputs~$h$ in time~$O(n)$. +The algorithm needs a small vector of one byte entries +in main memory to construct $h$. +The evaluation of~$h(x)$ requires three memory accesses for any key~$x$. +The description of~$h$ takes a constant number of bits +for each key, which is optimal, i.e., the theoretical lower bound is $1/\ln 2$ +bits per key. +In our experiments, we used a collection of 1 billion URLs collected +from the web, each URL 64 characters long on average. +For this collection, our algorithm +(i) finds a minimal perfect hash function in approximately +3 hours using a commodity PC, +(ii) needs just 5.45 megabytes of internal memory to generate $h$ +and (iii) takes 8.1 bits per key for the description of~$h$. +\keywords{Minimal Perfect Hashing \and Large Databases} +\end{abstract} + +% main text + +\def\cG{{\mathcal G}} +\def\crit{{\rm crit}} +\def\ncrit{{\rm ncrit}} +\def\scrit{{\rm scrit}} +\def\bedges{{\rm bedges}} +\def\ZZ{{\mathbb Z}} +\def\BSmax{\mathit{BS}_{\mathit{max}}} +\def\Bi{\mathop{\rm Bi}\nolimits} + +\input{introduction} +%\input{terminology} +\input{relatedwork} +\input{thealgorithm} +\input{partitioningthekeys} +\input{searching} +%\input{computingoffset} +%\input{hashingbuckets} +\input{determiningb} +%\input{analyticalandexperimentalresults} +\input{analyticalresults} +%\input{results} +\input{conclusions} + + + + +%\input{acknowledgments} +%\begin{acknowledgements} +%If you'd like to thank anyone, place your comments here +%and remove the percent signs. +%\end{acknowledgements} + +% BibTeX users please use +%\bibliographystyle{spmpsci} +%\bibliography{} % name your BibTeX data base +\bibliographystyle{plain} +\bibliography{references} +\input{appendix} +\end{document}