From 37444720b585b15dd6b5329fc58a16a0392c87c7 Mon Sep 17 00:00:00 2001 From: "Fabiano C. Botelho" Date: Fri, 12 Jun 2009 19:42:24 -0300 Subject: [PATCH] vldb07 directory removed --- vldb07/acknowledgments.tex | 7 - vldb07/analyticalresults.tex | 174 --- vldb07/appendix.tex | 6 - vldb07/conclusions.tex | 42 - vldb07/costhashingbuckets.tex | 177 --- vldb07/determiningb.tex | 146 --- vldb07/diskaccess.tex | 113 -- vldb07/experimentalresults.tex | 15 - vldb07/figs/bmz_temporegressao.png | Bin 5769 -> 0 bytes vldb07/figs/brz-partitioning.fig | 107 -- vldb07/figs/brz-partitioningfabiano.fig | 126 -- vldb07/figs/brz.fig | 183 --- vldb07/figs/brz_temporegressao.png | Bin 5671 -> 0 bytes vldb07/figs/brzfabiano.fig | 153 --- vldb07/figs/minimalperfecthash-ph-mph.png | Bin 3916 -> 0 bytes vldb07/introduction.tex | 109 -- vldb07/makefile | 17 - vldb07/partitioningthekeys.tex | 141 -- vldb07/performancenewalgorithm.tex | 113 -- vldb07/references.bib | 814 ------------ vldb07/relatedwork.tex | 112 -- vldb07/searching.tex | 155 --- vldb07/svglov2.clo | 77 -- vldb07/svjour2.cls | 1419 --------------------- vldb07/terminology.tex | 18 - vldb07/thealgorithm.tex | 78 -- vldb07/thedataandsetup.tex | 21 - vldb07/vldb.tex | 194 --- 28 files changed, 4517 deletions(-) delete mode 100755 vldb07/acknowledgments.tex delete mode 100755 vldb07/analyticalresults.tex delete mode 100644 vldb07/appendix.tex delete mode 100755 vldb07/conclusions.tex delete mode 100755 vldb07/costhashingbuckets.tex delete mode 100755 vldb07/determiningb.tex delete mode 100755 vldb07/diskaccess.tex delete mode 100755 vldb07/experimentalresults.tex delete mode 100644 vldb07/figs/bmz_temporegressao.png delete mode 100644 vldb07/figs/brz-partitioning.fig delete mode 100755 vldb07/figs/brz-partitioningfabiano.fig delete mode 100755 vldb07/figs/brz.fig delete mode 100644 vldb07/figs/brz_temporegressao.png delete mode 100755 vldb07/figs/brzfabiano.fig delete mode 100644 vldb07/figs/minimalperfecthash-ph-mph.png delete mode 100755 vldb07/introduction.tex delete mode 100755 vldb07/makefile delete mode 100755 vldb07/partitioningthekeys.tex delete mode 100755 vldb07/performancenewalgorithm.tex delete mode 100755 vldb07/references.bib delete mode 100755 vldb07/relatedwork.tex delete mode 100755 vldb07/searching.tex delete mode 100644 vldb07/svglov2.clo delete mode 100644 vldb07/svjour2.cls delete mode 100755 vldb07/terminology.tex delete mode 100755 vldb07/thealgorithm.tex delete mode 100755 vldb07/thedataandsetup.tex delete mode 100644 vldb07/vldb.tex diff --git a/vldb07/acknowledgments.tex b/vldb07/acknowledgments.tex deleted file mode 100755 index d903ceb..0000000 --- a/vldb07/acknowledgments.tex +++ /dev/null @@ -1,7 +0,0 @@ -\section{Acknowledgments} -This section is optional; it is a location for you -to acknowledge grants, funding, editing assistance and -what have you. In the present case, for example, the -authors would like to thank Gerald Murray of ACM for -his help in codifying this \textit{Author's Guide} -and the \textbf{.cls} and \textbf{.tex} files that it describes. diff --git a/vldb07/analyticalresults.tex b/vldb07/analyticalresults.tex deleted file mode 100755 index 06ea049..0000000 --- a/vldb07/analyticalresults.tex +++ /dev/null @@ -1,174 +0,0 @@ -%% Nivio: 23/jan/06 29/jan/06 -% Time-stamp: -\enlargethispage{2\baselineskip} -\section{Analytical results} -\label{sec:analytcal-results} - -\vspace{-1mm} -The purpose of this section is fourfold. -First, we show that our algorithm runs in expected time $O(n)$. -Second, we present the main memory requirements for constructing the MPHF. -Third, we discuss the cost of evaluating the resulting MPHF. -Fourth, we present the space required to store the resulting MPHF. - -\vspace{-2mm} -\subsection{The linear time complexity} -\label{sec:linearcomplexity} - -First, we show that the partitioning step presented in -Figure~\ref{fig:partitioningstep} runs in $O(n)$ time. -Each iteration of the {\bf for} loop in statement~1 -runs in $O(|B_j|)$ time, $1 \leq j \leq N$, where $|B_j|$ is the -number of keys -that fit in block $B_j$ of size $\mu$. This is because statement 1.1 just reads -$|B_j|$ keys from disk, statement 1.2 runs a bucket sort like algorithm -that is well known to be linear in the number of keys it sorts (i.e., $|B_j|$ keys), -and statement 1.3 just dumps $|B_j|$ keys to the disk into File $j$. -Thus, the {\bf for} loop runs in $O(\sum_{j=1}^{N}|B_j|)$ time. -As $\sum_{j=1}^{N}|B_j|=n$, then the partitioning step runs in $O(n)$ time. - -Second, we show that the searching step presented in -Figure~\ref{fig:searchingstep} also runs in $O(n)$ time. -The heap construction in statement 1 runs in $O(N)$ time, for $N \ll n$. -We have assumed that insertions and deletions in the heap cost $O(1)$ because -$N$ is typically much smaller than $n$ (see \cite[Section 6.4]{bkz06t} for details). -Statement 2 runs in $O(\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i])$ time -(remember that $\mathit{size}[i]$ stores the number of keys in bucket $i$). -As $\sum_{i=0}^{\lceil n/b \rceil - 1} \mathit{size}[i] = n$, if -statements 2.1, 2.2 and 2.3 run in $O(\mathit{size}[i])$ time, then statement 2 -runs in $O(n)$ time. - -%Statement 2.1 runs the algorithm to read a bucket from disk. That algorithm reads $\mathit{size}[i]$ -%keys of bucket $i$ that might be spread into many files or, in the worst case, -%into $|BS_{max}|$ files, where $|BS_{max}|$ is the number of keys in the bucket of maximum size. -%It uses the heap $H$ to drive a multiway merge of the sprayed bucket $i$. -%As we are considering that each read/write on disk costs $O(1)$ and -%each heap operation also costs $O(1)$ (recall $N \ll n$), then statement 2.1 -%costs $O(\mathit{size}[i])$ time. -%We need to take into account that this step could generate a lot of seeks on disk. -%However, the number of seeks can be amortized (see Section~\ref{sec:contr-disk-access}) -%and that is why we have been able of getting a MPHF for a set of 1 billion keys in less -%than 4 hours using a machine with just 500 MB of main memory -%(see Section~\ref{sec:performance}). -Statement 2.1 reads $O(\mathit{size}[i])$ keys of bucket $i$ -and is detailed in Figure~\ref{fig:readingbucket}. -As we are assuming that each read or write on disk costs $O(1)$ and -each heap operation also costs $O(1)$, statement~2.1 -takes $O(\mathit{size}[i])$ time. -However, the keys of bucket $i$ are distributed in at most~$BS_{max}$ files on disk -in the worst case -(recall that $BS_{max}$ is the maximum number of keys found in any bucket). -Therefore, we need to take into account that -the critical step in reading a bucket is in statement~1.3 of Figure~\ref{fig:readingbucket}, -where a seek operation in File $j$ -may be performed by the first read operation. - -In order to amortize the number of seeks performed we use a buffering technique~\cite{k73}. -We create a buffer $j$ of size \textbaht$\: = \mu/N$ for each file $j$, -where $1\leq j \leq N$ -(recall that $\mu$ is the size in bytes of an a priori reserved internal memory area). -Every time a read operation is requested to file $j$ and the data is not found -in the $j$th~buffer, \textbaht~bytes are read from file $j$ to buffer $j$. -Hence, the number of seeks performed in the worst case is given by -$\beta/$\textbaht~(remember that $\beta$ is the size in bytes of $S$). -For that we have made the pessimistic assumption that one seek happens every time -buffer $j$ is filled in. -Thus, the number of seeks performed in the worst case is $64n/$\textbaht, since -each URL is 64 bytes long on average. Therefore, the number of seeks is linear on -$n$ and amortized by \textbaht. - -It is important to emphasize two things. -First, the operating system uses techniques -to diminish the number of seeks and the average seek time. -This makes the amortization factor to be greater than \textbaht~in practice. -Second, almost all main memory is available to be used as -file buffers because just a small vector -of $\lceil n/b\rceil$ one-byte entries must be maintained in main memory, -as we show in Section~\ref{sec:memconstruction}. - - -Statement 2.2 runs our internal memory based algorithm in order to generate a MPHF for each bucket. -That algorithm is linear, as we showed in~\cite{bkz05}. As it is applied to buckets with {\it size}$[i]$ keys, statement~2.2 takes $O(\mathit{size}[i])$ time. - -Statement 2.3 has time complexity $O(\mathit{size}[i])$ because it writes to disk -the description of each generated MPHF and each description is stored in -$c \times \mathit{size}[i] + O(1)$ bytes, where $c\in[0.93,1.15]$. -In conclusion, our algorithm takes $O(n)$ time because both the partitioning and -the searching steps run in $O(n)$ time. - -An experimental validation of the above proof and a performance comparison with -our internal memory based algorithm~\cite{bkz05} were not included here due to -space restrictions but can be found in~\cite{bkz06t} and also in the appendix. - -\vspace{-1mm} -\enlargethispage{2\baselineskip} -\subsection{Space used for constructing a MPHF} -\label{sec:memconstruction} - -The vector {\it size} is kept in main memory -all the time. -The vector {\it size} has $\lceil n/b \rceil$ one-byte entries. -It stores the number of keys in each bucket and -those values are less than or equal to 256. -For example, for a set of 1 billion keys and $b=175$ the vector size needs -$5.45$ megabytes of main memory. - -We need an internal memory area of size $\mu$ bytes to be used in -the partitioning step and in the searching step. -The size $\mu$ is fixed a priori and depends only on the amount -of internal memory available to run the algorithm -(i.e., it does not depend on the size $n$ of the problem). - -% One could argue about the a priori reserved internal memory area and the main memory -% required to run the indirect bucket sort algorithm. -% Those internal memory requirements do not depend on the size of the problem -% (i.e., the number of keys being hashed) and can be fixed a priori. - -The additional space required in the searching step -is constant, once the problem was broken down -into several small problems (at most 256 keys) and -the heap size is supposed to be much smaller than $n$ ($N \ll n$). -For example, for a set of 1 billion keys and an internal area of~$\mu = 250$ megabytes, -the number of files is $N = 248$. - -\vspace{-1mm} -\subsection{Evaluation cost of the MPHF} - -Now we consider the amount of CPU time -required by the resulting MPHF at retrieval time. -The MPHF requires for each key the computation of three -universal hash functions and three memory accesses -(see Eqs.~(\ref{eq:mphf}), (\ref{eq:bucketindex}) and (\ref{eq:mphfi})). -This is not optimal. Pagh~\cite{p99} showed that any MPHF requires -at least the computation of two universal hash functions and one memory -access. - -\subsection{Description size of the MPHF} - -The number of bits required to store the MPHF generated by the algorithm -is computed by Eq.~(\ref{eq:newmphfbits}). -We need to store each $g_i$ vector presented in Eq.~(\ref{eq:mphfi}), where -$0\leq i < \lceil n/b \rceil$. As each bucket has at most 256 keys, each -entry in a $g_i$ vector has 8~bits. In each $g_i$ vector there are -$c \times \mathit{size}[i]$ entries (recall $c\in[0.93, 1.15]$). -When we sum up the number of entries of $\lceil n/b \rceil$ $g_i$ vectors we have -$c\sum_{i=0}^{\lceil n/b \rceil -1} \mathit{size}[i]=cn$ entries. We also need to -store $3 \lceil n/b \rceil$ integer numbers of -$\log_2n$ bits referring respectively to the {\it offset} vector and the two random seeds of -$h_{1i}$ and $h_{2i}$. In addition, we need to store $\lceil n/b \rceil$ 8-bit entries of -the vector {\it size}. Therefore, -\begin{eqnarray}\label{eq:newmphfbits} -\mathrm{Required\: Space} = 8cn + \frac{n}{b}\left( 3\log_2n +8\right) \: -\mathrm{bits}. -\end{eqnarray} - -Considering $c=0.93$ and $b=175$, the number of bits per key to store -the description of the resulting MPHF for a set of 1~billion keys is $8.1$. -If we set $b=128$, then the bits per key ratio increases to $8.3$. -Theoretically, the number of bits required to store the MPHF in -Eq.~(\ref{eq:newmphfbits}) -is $O(n\log n)$ as~$n\to\infty$. However, for sets of size up to $2^{b/3}$ keys -the number of bits per key is lower than 9~bits (note that -$2^{b/3}>2^{58}>10^{17}$ for $b=175$). -%For $b=175$, the number of bits per key will be close to 9 for a set of $2^{58}$ keys. -Thus, in practice the resulting function is stored in $O(n)$ bits. diff --git a/vldb07/appendix.tex b/vldb07/appendix.tex deleted file mode 100644 index 288ad8a..0000000 --- a/vldb07/appendix.tex +++ /dev/null @@ -1,6 +0,0 @@ -\appendix -\input{experimentalresults} -\input{thedataandsetup} -\input{costhashingbuckets} -\input{performancenewalgorithm} -\input{diskaccess} diff --git a/vldb07/conclusions.tex b/vldb07/conclusions.tex deleted file mode 100755 index 8d32741..0000000 --- a/vldb07/conclusions.tex +++ /dev/null @@ -1,42 +0,0 @@ -% Time-stamp: -\enlargethispage{2\baselineskip} -\section{Concluding remarks} -\label{sec:concuding-remarks} - -This paper has presented a novel external memory based algorithm for -constructing MPHFs that works for sets in the order of billions of keys. The -algorithm outputs the resulting function in~$O(n)$ time and, furthermore, it -can be tuned to run only $34\%$ slower (see \cite{bkz06t} for details) than the fastest -algorithm available in the literature for constructing MPHFs~\cite{bkz05}. -In addition, the space -requirement of the resulting MPHF is of up to 9 bits per key for datasets of -up to $2^{58}\simeq10^{17.4}$ keys. - -The algorithm is simple and needs just a -small vector of size approximately 5.45 megabytes in main memory to construct -a MPHF for a collection of 1 billion URLs, each one 64 bytes long on average. -Therefore, almost all main memory is available to be used as disk I/O buffer. -Making use of such a buffering scheme considering an internal memory area of size -$\mu=200$ megabytes, our algorithm can produce a MPHF for a -set of 1 billion URLs in approximately 3.6 hours using a commodity PC of 2.4 gigahertz and -500 megabytes of main memory. -If we increase both the main memory -available to 1 gigabyte and the internal memory area to $\mu=500$ megabytes, -a MPHF for the set of 1 billion URLs is produced in approximately 3 hours. For any -key, the evaluation of the resulting MPHF takes three memory accesses and the -computation of three universal hash functions. - -In order to allow the reproduction of our results and the utilization of both the internal memory -based algorithm and the external memory based algorithm, -the algorithms are available at \texttt{http://cmph.sf.net} -under the GNU Lesser General Public License (LGPL). -They were implemented in the C language. - -In future work, we will exploit the fact that the searching step intrinsically -presents a high degree of parallelism and requires $73\%$ of the -construction time. Therefore, a parallel implementation of our algorithm will -allow the construction and the evaluation of the resulting function in parallel. -Therefore, the description of the resulting MPHFs will be distributed in the paralell -computer allowing the scalability to sets of hundreds of billions of keys. -This is an important contribution, mainly for applications related to the Web, as -mentioned in Section~\ref{sec:intro}. \ No newline at end of file diff --git a/vldb07/costhashingbuckets.tex b/vldb07/costhashingbuckets.tex deleted file mode 100755 index 610ab6d..0000000 --- a/vldb07/costhashingbuckets.tex +++ /dev/null @@ -1,177 +0,0 @@ -% Nivio: 29/jan/06 -% Time-stamp: -\vspace{-2mm} -\subsection{Performance of the internal memory based algorithm} -\label{sec:intern-memory-algor} - -%\begin{table*}[htb] -%\begin{center} -%{\scriptsize -%\begin{tabular}{|c|c|c|c|c|c|c|c|} -%\hline -%$n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\ -%\hline -%Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$ & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\ -%SD (s) & $2.6$ & $5.4$ & $9.8$ & $17.6$ & $37.3$ & $76.3$ \\ -%\hline -%\end{tabular} -%\vspace{-3mm} -%} -%\end{center} -%\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF, -%the standard deviation (SD), and the confidence intervals considering -%a confidence level of $95\%$.} -%\label{tab:medias} -%\end{table*} - -Our three-step internal memory based algorithm presented in~\cite{bkz05} -is used for constructing a MPHF for each bucket. -It is a randomized algorithm because it needs to generate a simple random graph -in its first step. -Once the graph is obtained the other two steps are deterministic. - -Thus, we can consider the runtime of the algorithm to have the form~$\alpha -nZ$ for an input of~$n$ keys, where~$\alpha$ is some machine dependent -constant that further depends on the length of the keys and~$Z$ is a random -variable with geometric distribution with mean~$1/p=e^{1/c^2}$ (see -Section~\ref{sec:mphfbucket}). All results in our experiments were obtained -taking $c=1$; the value of~$c$, with~$c\in[0.93,1.15]$, in fact has little -influence in the runtime, as shown in~\cite{bkz05}. - -The values chosen for $n$ were $1, 2, 4, 8, 16$ and $32$ million. -Although we have a dataset with 1~billion URLs, on a PC with -1~gigabyte of main memory, the algorithm is able -to handle an input with at most 32 million keys. -This is mainly because of the graph we need to keep in main memory. -The algorithm requires $25n + O(1)$ bytes for constructing -a MPHF (details about the data structures used by the algorithm can -be found in~\texttt{http://cmph.sf.net}. -% for the details about the data structures -%used by the algorithm). - -In order to estimate the number of trials for each value of $n$ we use -a statistical method for determining a suitable sample size (see, e.g., -\cite[Chapter 13]{j91}). -As we obtained different values for each $n$, -we used the maximal value obtained, namely, 300~trials in order to have -a confidence level of $95\%$. - -% \begin{figure*}[ht] -% \noindent -% \begin{minipage}[b]{0.5\linewidth} -% \centering -% \subfigure[The previous algorithm] -% {\scalebox{0.5}{\includegraphics{figs/bmz_temporegressao.eps}}} -% \end{minipage} -% \hfill -% \begin{minipage}[b]{0.5\linewidth} -% \centering -% \subfigure[The new algorithm] -% {\scalebox{0.5}{\includegraphics{figs/brz_temporegressao.eps}}} -% \end{minipage} -% \caption{Time versus number of keys in $S$. The solid line corresponds to -% a linear regression model.} -% %obtained from the experimental measurements.} -% \label{fig:temporegressao} -% \end{figure*} - -Table~\ref{tab:medias} presents the runtime average for each $n$, -the respective standard deviations, and -the respective confidence intervals given by -the average time $\pm$ the distance from average time -considering a confidence level of $95\%$. -Observing the runtime averages one sees that -the algorithm runs in expected linear time, -as shown in~\cite{bkz05}. - -\vspace{-2mm} -\begin{table*}[htb] -\begin{center} -{\scriptsize -\begin{tabular}{|c|c|c|c|c|c|c|c|} -\hline -$n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\ -\hline -Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$ & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\ -SD (s) & $2.6$ & $5.4$ & $9.8$ & $17.6$ & $37.3$ & $76.3$ \\ -\hline -\end{tabular} -\vspace{-1mm} -} -\end{center} -\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF, -the standard deviation (SD), and the confidence intervals considering -a confidence level of $95\%$.} -\label{tab:medias} -\vspace{-4mm} -\end{table*} - -% \enlargethispage{\baselineskip} -% \begin{table*}[htb] -% \begin{center} -% {\scriptsize -% (a) -% \begin{tabular}{|c|c|c|c|c|c|c|c|} -% \hline -% $n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\ -% \hline -% Average time (s)& $6.119 \pm 0.300$ & $12.190 \pm 0.615$ & $25.359 \pm 1.109$ & $51.408 \pm 2.003$ & $117.343 \pm 4.364$ & $262.215 \pm 8.724$\\ -% SD (s) & $2.644$ & $5.414$ & $9.757$ & $17.627$ & $37.333$ & $76.271$ \\ -% \hline -% \end{tabular} -% \\[5mm] (b) -% \begin{tabular}{|l|c|c|c|c|c|} -% \hline -% $n$ (millions) & 1 & 2 & 4 & 8 & 16 \\ -% \hline % Part. 16 \% 16 \% 16 \% 18 \% 20\% -% Average time (s) & $6.927 \pm 0.309$ & $13.828 \pm 0.175$ & $31.936 \pm 0.663$ & $69.902 \pm 1.084$ & $140.617 \pm 2.502$ \\ -% SD & $0.431$ & $0.245$ & $0.926$ & $1.515$ & $3.498$ \\ -% \hline -% \hline -% $n$ (millions) & 32 & 64 & 128 & 512 & 1000 \\ -% \hline % Part. 20 \% 20\% 20\% 18\% 18\% -% Average time (s) & $284.330 \pm 1.135$ & $587.880 \pm 3.945$ & $1223.581 \pm 4.864$ & $5966.402 \pm 9.465$ & $13229.540 \pm 12.670$ \\ -% SD & $1.587$ & $5.514$ & $6.800$ & $13.232$ & $18.577$ \\ -% \hline -% \end{tabular} -% } -% \end{center} -% \caption{The runtime averages in seconds, -% the standard deviation (SD), and -% the confidence intervals given by the average time $\pm$ -% the distance from average time considering -% a confidence level of $95\%$.} -% \label{tab:medias} -% \end{table*} - -\enlargethispage{2\baselineskip} -Figure~\ref{fig:bmz_temporegressao} -presents the runtime for each trial. In addition, -the solid line corresponds to a linear regression model -obtained from the experimental measurements. -As we can see, the runtime for a given $n$ has a considerable -fluctuation. However, the fluctuation also grows linearly with $n$. - -\begin{figure}[htb] -\vspace{-2mm} -\begin{center} -\scalebox{0.4}{\includegraphics{figs/bmz_temporegressao}} -\caption{Time versus number of keys in $S$ for the internal memory based algorithm. -The solid line corresponds to a linear regression model.} -\label{fig:bmz_temporegressao} -\end{center} -\vspace{-6mm} -\end{figure} - -The observed fluctuation in the runtimes is as expected; recall that this -runtime has the form~$\alpha nZ$ with~$Z$ a geometric random variable with -mean~$1/p=e$. Thus, the runtime has mean~$\alpha n/p=\alpha en$ and standard -deviation~$\alpha n\sqrt{(1-p)/p^2}=\alpha n\sqrt{e(e-1)}$. -Therefore, the standard deviation also grows -linearly with $n$, as experimentally verified -in Table~\ref{tab:medias} and in Figure~\ref{fig:bmz_temporegressao}. - -%\noindent-------------------------------------------------------------------------\\ -%Comentario para Yoshi: Nao consegui reproduzir bem o que discutimos -%no paragrafo acima, acho que vc conseguira justificar melhor :-). \\ -%-------------------------------------------------------------------------\\ diff --git a/vldb07/determiningb.tex b/vldb07/determiningb.tex deleted file mode 100755 index e9b3cb2..0000000 --- a/vldb07/determiningb.tex +++ /dev/null @@ -1,146 +0,0 @@ -% Nivio: 29/jan/06 -% Time-stamp: -\enlargethispage{2\baselineskip} -\subsection{Determining~$b$} -\label{sec:determining-b} -\begin{table*}[t] -\begin{center} -{\small %\scriptsize -\begin{tabular}{|c|ccc|ccc|} -\hline -\raisebox{-0.7em}{$n$} & \multicolumn{3}{c|}{\raisebox{-1mm}{b=128}} & -\multicolumn{3}{c|}{\raisebox{-1mm}{b=175}}\\ -\cline{2-4} \cline{5-7} - & \raisebox{-0.5mm}{Worst Case} & \raisebox{-0.5mm}{Average} &\raisebox{-0.5mm}{Eq.~(\ref{eq:maxbs})} - & \raisebox{-0.5mm}{Worst Case} & \raisebox{-0.5mm}{Average} &\raisebox{-0.5mm}{Eq.~(\ref{eq:maxbs})} \\ -\hline -$1.0 \times 10^6$ & 177 & 172.0 & 176 & 232 & 226.6 & 230 \\ -%$2.0 \times 10^6$ & 179 & 174.0 & 178 & 236 & 228.5 & 232 \\ -$4.0 \times 10^6$ & 182 & 177.5 & 179 & 241 & 231.8 & 234 \\ -%$8.0 \times 10^6$ & 186 & 181.6 & 181 & 238 & 234.2 & 236 \\ -$1.6 \times 10^7$ & 184 & 181.6 & 183 & 241 & 236.1 & 238 \\ -%$3.2 \times 10^7$ & 191 & 183.9 & 184 & 240 & 236.6 & 240 \\ -$6.4 \times 10^7$ & 195 & 185.2 & 186 & 244 & 239.0 & 242 \\ -%$1.28 \times 10^8$ & 193 & 187.7 & 187 & 244 & 239.7 & 244 \\ -$5.12 \times 10^8$ & 196 & 191.7 & 190 & 251 & 246.3 & 247 \\ -$1.0 \times 10^9$ & 197 & 191.6 & 192 & 253 & 248.9 & 249 \\ -\hline -\end{tabular} -\vspace{-1mm} -} -\end{center} -\caption{Values for $\mathit{BS}_{\mathit{max}}$: worst case and average case obtained in the experiments and using Eq.~(\ref{eq:maxbs}), -considering $b=128$ and $b=175$ for different number $n$ of keys in $S$.} -\label{tab:comparison} -\vspace{-6mm} -\end{table*} - -The partitioning step can be viewed as the well known ``balls into bins'' -problem~\cite{ra98,dfm02} where~$n$ keys (the balls) are placed independently and -uniformly into $\lceil n/b\rceil$ buckets (the bins). The main question related to that problem we are interested -in is: what is the maximum number of keys in any bucket? -In fact, we want to get the maximum value for $b$ that makes the maximum number of keys in any bucket -no greater than 256 with high probability. -This is important, as we wish to use 8 bits per entry in the vector $g_i$ of -each $\mathrm{MPHF}_i$, -where $0 \leq i < \lceil n/b\rceil$. -Let $\mathit{BS}_{\mathit{max}}$ be the maximum number of keys in any bucket. - -Clearly, $\BSmax$ is the maximum -of~$\lceil n/b\rceil$ random variables~$Z_i$, each with binomial -distribution~$\Bi(n,p)$ with parameters~$n$ and~$p=1/\lceil n/b\rceil$. -However, the~$Z_i$ are not independent. Note that~$\Bi(n,p)$ has mean and -variance~$\simeq b$. To give an upper estimate for the probability of the -event~$\BSmax\geq \gamma$, we can estimate the probability that we have~$Z_i\geq \gamma$ -for a fixed~$i$, and then sum these estimates over all~$i$. -Let~$\gamma=b+\sigma\sqrt{b\ln(n/b)}$, where~$\sigma=\sqrt2$. -Approximating~$\Bi(n,p)$ by the normal distribution with mean and -variance~$b$, we obtain the -estimate~$(\sigma\sqrt{2\pi\ln(n/b)})^{-1}\times\exp(-(1/2)\sigma^2\ln(n/b))$ for -the probability that~$Z_i\geq \gamma$ occurs, which, summed over all~$i$, gives -that the probability that~$\BSmax\geq \gamma$ occurs is at -most~$1/(\sigma\sqrt{2\pi\ln(n/b)})$, which tends to~$0$ as~$n\to\infty$. -Thus, we have shown that, with high probability, -\begin{equation} - \label{eq:maxbs} - \BSmax\leq b+\sqrt{2b\ln{n\over b}}. -\end{equation} - -% The traditional approach used to estimate $\mathit{BS}_{\mathit{max}}$ with high probability is -% to consider $\mathit{BS}_{\mathit{max}}$ as a random variable that follows a binomial distribution -% that can be approximated by a poisson distribution. This yields a good approximation -% when the number of balls is lower than or equal to the number of bins~\cite{g81}. In our case, -% the number of balls is greater than the number of buckets. -% % and that is why we have used more recent works to estimate $\mathit{BS}_{\mathit{max}}$. -% As $b > \ln (n/b)$, we can use the result by Raab and Steger~\cite{ra98} to estimate -% $\mathit{BS}_{\mathit{max}}$ with high probability. -% The following equation gives the estimation, where $\sigma=\sqrt{2}$: -% \begin{eqnarray} \label{eq:maxbs} -% \mathit{BS}_{\mathit{max}} = b + O \left( \sqrt{b\ln\frac{n}{b}} \right) = b + \sigma \times \left(\sqrt{b\ln\frac{n}{b}} \right) -% \end{eqnarray} - -% In order to estimate the suitable constant $\sigma$ we did a linear -% regression suppressing the constant term. -% We use the equation $BS_{max} - b = \sigma \times \sqrt{b\ln (n/b)}$ -% in the linear regression considering $y=BS_{max} - b$ and $x=\sqrt{b\ln (n/b)}$. -% In order to obtain data to be used in the linear regression we set -% b=128 and ran the new algorithm ten times -% for n equal to 1, 2, 4, 8, 16, 32, 64, 128, 512, 1000 million keys. -% Taking a confidence level equal to 95\% we got -% $\sigma = 2.11 \pm 0.03$. -% The coefficient of determination was $99.6\%$, which means that the linear -% regression explains $99.6\%$ of the data variation and only $0.4\%$ -% is due to experimental errors. -% Therefore, Eq.~(\ref{eq:maxbs}) with $\sigma = 2.11 \pm 0.03$ and $b=128$ -% makes a very good estimation of the maximum number of keys in any bucket. - -% Repeating the same experiments for $b$ equals to $175$ and -% a confidence level of $95\%$ we got $\sigma = 2.07 \pm 0.03$. -% Again we verified that Eq.~(\ref{eq:maxbs}) with $\sigma = 2.07 \pm 0.03$ and $b=175$ is -% a very good estimation of the maximum number of keys in any bucket once the -% coefficient of determination obtained was $99.7 \%$ and $\sigma$ is in a very narrow range. - -In our algorithm the maximum number of keys in any bucket must be at most 256. -Table~\ref{tab:comparison} presents the values for $\mathit{BS}_{\mathit{max}}$ -obtained experimentally and using Eq.~(\ref{eq:maxbs}). -The table presents the worst case and the average case, -considering $b=128$, $b=175$ and Eq.~(\ref{eq:maxbs}), -for several numbers~$n$ of keys in $S$. -The estimation given by Eq.~(\ref{eq:maxbs}) is very close to the experimental -results. - -Now we estimate the biggest problem our algorithm is able to solve for -a given $b$. -Using Eq.~(\ref{eq:maxbs}) considering $b=128$, $b=175$ and imposing -that~$\mathit{BS}_{\mathit{max}}\leq256$, -the sizes of the biggest key set our algorithm -can deal with are $10^{30}$ keys and $10^{10}$ keys, respectively. -%It is also important to have $b$ as big as possible, once its value is -%related to the space required to store the resultant MPHF, as shown later on. -%Table~\ref{tab:bp} shows the biggest problem the algorithm can solve. -% The values were obtained from Eq.~(\ref{eq:maxbs}), -% considering $b=128$ and~$b=175$ and imposing -% that~$\mathit{BS}_{\mathit{max}}\leq256$. - -% We set $\sigma=2.14$ because it was the greatest value obtained for $\sigma$ -% in the two linear regression we did. -% \vspace{-3mm} -% \begin{table}[htb] -% \begin{center} -% {\small %\scriptsize -% \begin{tabular}{|c|c|} -% \hline -% b & Problem size ($n$) \\ -% \hline -% 128 & $10^{30}$ keys \\ -% 175 & $10^{10}$ keys \\ -% \hline -% \end{tabular} -% \vspace{-1mm} -% } -% \end{center} -% \caption{Using Eq.~(\ref{eq:maxbs}) to estimate the biggest problem our algorithm can solve.} -% %considering $\sigma=\sqrt{2}$.} -% \label{tab:bp} -% \vspace{-14mm} -% \end{table} diff --git a/vldb07/diskaccess.tex b/vldb07/diskaccess.tex deleted file mode 100755 index 08e54b9..0000000 --- a/vldb07/diskaccess.tex +++ /dev/null @@ -1,113 +0,0 @@ -% Nivio: 29/jan/06 -% Time-stamp: -\vspace{-2mm} -\subsection{Controlling disk accesses} -\label{sec:contr-disk-access} - -In order to bring down the number of seek operations on disk -we benefit from the fact that our algorithm leaves almost all main -memory available to be used as disk I/O buffer. -In this section we evaluate how much the parameter $\mu$ -affects the runtime of our algorithm. -For that we fixed $n$ in 1 billion of URLs, -set the main memory of the machine used for the experiments -to 1 gigabyte and used $\mu$ equal to $100, 200, 300, 400, 500$ and $600$ -megabytes. - -\enlargethispage{2\baselineskip} -Table~\ref{tab:diskaccess} presents the number of files $N$, -the buffer size used for all files, the number of seeks in the worst case considering -the pessimistic assumption mentioned in Section~\ref{sec:linearcomplexity}, and -the time to generate a MPHF for 1 billion of keys as a function of the amount of internal -memory available. Observing Table~\ref{tab:diskaccess} we noticed that the time spent in the construction -decreases as the value of $\mu$ increases. However, for $\mu > 400$, the variation -on the time is not as significant as for $\mu \leq 400$. -This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux -has smart policies -for avoiding seeks and diminishing the average seek time -(see \texttt{http://www.linuxjournal.com/article/6931}). -\begin{table*}[ht] -\vspace{-2mm} -\begin{center} -{\scriptsize -\begin{tabular}{|l|c|c|c|c|c|c|} -\hline -$\mu$ (MB) & $100$ & $200$ & $300$ & $400$ & $500$ & $600$ \\ -\hline -$N$ (files) & $619$ & $310$ & $207$ & $155$ & $124$ & $104$ \\ -%\hline -\textbaht~(buffer size in KB) & $165$ & $661$ & $1,484$ & $2,643$ & $4,129$ & $5,908$ \\ -%\hline -$\beta$/\textbaht~(\# of seeks in the worst case) & $384,478$ & $95,974$ & $42,749$ & $24,003$ & $15,365$ & $10,738$ \\ -% \hline -% \raisebox{-0.2em}{\# of seeks performed in} & \raisebox{-0.7em}{$383,056$} & \raisebox{-0.7em}{$95,919$} & \raisebox{-0.7em}{$42,700$} & \raisebox{-0.7em}{$23,980$} & \raisebox{-0.7em}{$15,347$} & \raisebox{-0.7em}{$xx,xxx$} \\ -% \raisebox{0.2em}{statement 1.3 of Figure~\ref{fig:readingbucket}} & & & & & & \\ -% \hline -Time (hours) & $4.04$ & $3.64$ & $3.34$ & $3.20$ & $3.13$ & $3.09$ \\ -\hline -\end{tabular} -\vspace{-1mm} -} -\end{center} -\caption{Influence of the internal memory area size ($\mu$) in our algorithm runtime.} -\label{tab:diskaccess} -\vspace{-14mm} -\end{table*} - - - -% \begin{table*}[ht] -% \begin{center} -% {\scriptsize -% \begin{tabular}{|l|c|c|c|c|c|c|c|c|c|c|c|} -% \hline -% $\mu$ (MB) & $100$ & $150$ & $200$ & $250$ & $300$ & $350$ & $400$ & $450$ & $500$ & $550$ & $600$ \\ -% \hline -% $N$ (files) & $619$ & $413$ & $310$ & $248$ & $207$ & $177$ & $155$ & $138$ & $124$ & $113$ & $103$ \\ -% \hline -% \textbaht~(buffer size in KB) & $165$ & $372$ & $661$ & $1,033$ & $1,484$ & $2,025$ & $2,643$ & $3,339$ & & & \\ -% \hline -% \# of seeks (Worst case) & $384,478$ & $170,535$ & $95,974$ & $61,413$ & $42,749$ & $31,328$ & $24,003$ & $19,000$ & & & \\ -% \hline -% \raisebox{-0.2em}{\# of seeks performed in} & \raisebox{-0.7em}{$383,056$} & \raisebox{-0.7em}{$170,385$} & \raisebox{-0.7em}{$95,919$} & \raisebox{-0.7em}{$61,388$} & \raisebox{-0.7em}{$42,700$} & \raisebox{-0.7em}{$31,296$} & \raisebox{-0.7em}{$23,980$} & \raisebox{-0.7em}{$18,978$} & \raisebox{-0.7em}{$xx,xxx$} & \raisebox{-0.7em}{$xx,xxx$} & \raisebox{-0.7em}{$xx,xxx$} \\ -% \raisebox{0.2em}{statement 1.3 of Figure~\ref{fig:readingbucket}} & & & & & & & & & & & \\ -% \hline -% Time (horas) & $4.04$ & $3.93$ & $3.64$ & $3.46$ & $3.34$ & $3.26$ & $3.20$ & $3.13$ & & & \\ -% \hline -% \end{tabular} -% } -% \end{center} -% \caption{Influence of the internal memory area size ($\mu$) in our algorithm runtime.} -% \label{tab:diskaccess} -% \end{table*} - - - -% \begin{table*}[htb] -% \begin{center} -% {\scriptsize -% \begin{tabular}{|l|c|c|c|c|c|} -% \hline -% $n$ (millions) & 1 & 2 & 4 & 8 & 16 \\ -% \hline % Part. 16 \% 16 \% 16 \% 18 \% 20\% -% Average time (s) & $14.124 \pm 0.128$ & $28.301 \pm 0.140$ & $56.807 \pm 0.312$ & $117.286 \pm 0.997$ & $241.086 \pm 0.936$ \\ -% SD & $0.179$ & $0.196$ & $0.437$ & $1.394$ & $1.308$ \\ -% \hline -% \hline -% $n$ (millions) & 32 & 64 & 128 & 512 & 1000 \\ -% \hline % Part. 20 \% 20\% 20\% 18\% 18\% -% Average time (s) & $492.430 \pm 1.565$ & $1006.307 \pm 1.425$ & $2081.208 \pm 0.740$ & $9253.188 \pm 4.406$ & $19021.480 \pm 13.850$ \\ -% SD & $2.188$ & $1.992$ & $1.035$ & $ 6.160$ & $18.016$ \\ -% \hline - -% \end{tabular} -% } -% \end{center} -% \caption{The runtime averages in seconds, -% the standard deviation (SD), and -% the confidence intervals given by the average time $\pm$ -% the distance from average time considering -% a confidence level of $95\%$. -% } -% \label{tab:mediasbrz} -% \end{table*} diff --git a/vldb07/experimentalresults.tex b/vldb07/experimentalresults.tex deleted file mode 100755 index 58b4091..0000000 --- a/vldb07/experimentalresults.tex +++ /dev/null @@ -1,15 +0,0 @@ -%Nivio: 29/jan/06 -% Time-stamp: -\vspace{-2mm} -\enlargethispage{2\baselineskip} -\section{Appendix: Experimental results} -\label{sec:experimental-results} -\vspace{-1mm} - -In this section we present the experimental results. -We start presenting the experimental setup. -We then present experimental results for -the internal memory based algorithm~\cite{bkz05} -and for our algorithm. -Finally, we discuss how the amount of internal memory available -affects the runtime of our algorithm. diff --git a/vldb07/figs/bmz_temporegressao.png b/vldb07/figs/bmz_temporegressao.png deleted file mode 100644 index e7198c1cea2b9f5d766b7f422b23a2e259209c79..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 5769 zcma)Ac|4T;+P@iDvOQ9khA6wDguw{Y2_b2Wk`~LLNQ4+`7>tl5TTix$Y-1!#cFLON zk*#K!VMvQiGYrX=Wz2hb&L8JJ?|aVsc|Z4O?z!*Z{o8-n_5FUYD-mgPUU>iE{Qv+6 zTUwZ+0DxNs0Jzfmxxo@GOI#HA$Lo6GyeYu>eHFeTWP&9EeiqjP06^5`_m3+q#4H6Y zB-+`a!cKO{6@_iHid$?4G)pa`*F#?o}B+d%XnBn}>ZV`{LRt6v}=ea!01OLR;6pkbC{e%w7U%zGg-D|| zAU@bs@I52{2vySzm2S>L4PBRB0TBk`V2=ksqot21UJ_xWC* zka;@ut>FS+zooCt8QB+A;)FP1Y(+&oIFDsEJR0Q@W}dI^w6R7@n5x#{%q|^ca{3s7 z^16a+oYz|%#`@!ze%oLB?5SJLT|W80nzh&sn4tNk8%?G4SA$lwUIZEKk0r1j(&6u{ zqwaY`=gpQGEacZ|u!H$%e_oB!B2k~|_}HT8W77U2nBZVNj=Pt{kq5!v(Ruhc+E+tX zzuwl=Zk&7Dy|CDfxotM}Zs|OFQ8C{*oj0oNg!=f>!t0j3EkWNc*FH%lj933~LfyzG zB1wj^QT;e^*^;hQ+C78^k5`o&G>$&p zzMUo|aXZj)Bi-kEug^?ngm=xWT($rtgS>Fm8i=LKGwK=Q& z)KwZW2zqM}@V4a*MY>eZADKl?geisd{d9=;iUfo!29u?p^ao+RMeIJi#(XS;4j5&K_=I_yA-7%8=)g{UH^!FgCm4 z>&zvVD)Y7eYC&xEqUY@6JO`SZqgD=XSe1<2Ny)IMjw-V1Lp_m!ZJO>QT9#jU`qd0`%3P zHN%iIRAC3gti65E+@fdraM?*DQzsW28Wu_9EW8$}xh#IBb#vs97RHOE+#o-UMWjx> zIe$;w$$d7|Hi4h^XrX^l7~ZkH(Op#%kQ;E`Bq6$GJUZUgaqm12AFMI1|i$(%Rg(*dGe zh;gUW$EaV;rLD!FkD{M8JM&>#*Z-pV z-k~na*mzpBeQ#|yqgfGVi6kNz`;u(z}i%jrM=Cyvw%XW+HFk5UwhDV0>FLA(yG<9-P2;~q-AbDTuSDw z(4#99F%6*0C%MRh=ys1Ai)1Df34N2L%m7JxOk}`7=WY==myY)KdzxZE6MO4xbYVBmA9(QM;?o zY{}WEU%zHKyW#6_x6%=p(c%2YhRCa9ZB4!#C!_%rjAl(z%>;mo=e{)6=8Izk%inWWX_yS}d zwRfa5wt9bXVPks1K7~jCT4sf3!H;Oy5OaP5$oJe4B1MSV5h^MObjb(<;F~}$T(=8? z%mx?H$A1KyfvZ{okW&9|76WOnr|RbeS-?sOB$lA(gjO!3;{Vb5)DXDLlCPS!z)xk6 z73q{Z!5?}coyAt070W= zQcgZuu>c&DM+!oL-b~OS>l)&GW^FxUjr67_fJe2)n6jS2$Y8S#>&givH+K4^B|dTG zHh$l{$LLxbS11Jeu$QQ7ZG?Qi{$s>DW337(5!UuDfnBx3{}O zQB&el{rIP#_faCrH!FN|WB6Jem+B2Vcbh71BPZUcx@xa$UFtXU$}-^>6&HV6aBF`7 z4tm=;5@aEG3kQE~#@hlZB!GPQ5`*tL>S0(VjLisX2Ni|P=^c9OH$(Qh@#Mc=at*x8 zois6w9qcPRAa%n9xl_$F;NK2lfyX%gl0ji4U!et)uc?er6)1%O_nX?C=?S(T>jgSd zRE`|*s*wYlmXwaL&A+2h1;Qz2~z?^Sw@b}wm*QEDvmX` zwcX{c?odYCct%Bn6Jh5{N_H9qFUj;3qI1!W6?~1WaiTF@#^a84)$Ju zd%B%Z^@8=ZuzEZYvq157a&{JFzr9Fwda`8pfth{YfJxgV@Xge+c|J9U&)a|i6FAp3LKbWd6BXr_qHt1bz=1nj73 zadyp}MeW>9DH!$rj;P$e(cUMLFyY|D?!Ts$tla#*(Dj?97G3WO(wx4PrwI4j`*82K z_XBBT%rhpjEsELXE{e>!5O-0*idnBf$kb$gROo`tE#;y~4*77FSO3~H&+!AUKQ7Hb zkU%^jnB0<&QN_q&Q383OeUqwEJ(szq6ufzTywJ26m(i3)U3rD>+t-^V8WWRk8@Rg< zz_dXMVH}10&w+1e(x zl)8fI4GFn3qnIP;}D=_h);XjJ#@g}VdJue zs5*o3rjrG?+x4BN$wVH}o7pu2MVVSq1;V3=ZW!#$^&g#Pp5}s{;UDVFhzrRx314|- zo4IxK7Cv+ohD1o905lc+ z#1Vrj%40q;al&ervdyaK9<>sYn@KgFZ=(1#&r<)uk)tyNphf9$q%x^kER!5v#1DhsifM>k*oQiY2bt<#gs5p=_Jdie~0X@ zb7=Um=WoBfrWCCMa+=Sn%S=fXufHiyRDZ9z+%anY(NC-v#Wx7kM<5;rsDZ&0@Uxh6 zo7h&6d(|llUVGW^5qKJ1&!9ec-3P}!z9&5aLzO-$t3PlO&+~>HAZ|tVHAY4q*0{m7 zHpQ|yA8UfrMeW0Dy%h@yf1W5`2rQrMd;s=;`k1#%9LV+NGGD}DHD+0nc;-a1zWJ+O z|KUh{?fX#`34!0>=VqRte6zy2|GNkEDhPj|U(4f!Wfbk1l@p^SCQ-W>E6b`%Y{W;C z^jzn^F`BBdY=MaXX2NA5Aaot}c7GRK&|f(@*Jjd^Yu1tGTAC{DZ=rUZikNgQwYg<3 zld9tV$T&rKs$y;{3F> zn*w3akiRca(x5$NCpiFP0WC6`wpr_HlHmQ40G@1J$*F!Z}BRgBJ{eKoNAh zc@wM!ZyZIzif?JTawk|is`!FmoXD8_m?a>(`&qrPVsTYw&fd*@k0&LmIh%6EgHRV^ zqIRM7wn6t5T^nkkDb)aBUSES*_iZ*4i~W}6$YOBv`Tx&eO$`_8`g zLPHl@EN2_-1)x|rkOjO>yTx372RLY0F|9?V)AWE_|KJxS>=Fj@&%}xA`ody4=cX=@ zT2p?ZeTbaXP9tZ>eUd5uBvbh7n4AKr_5UTIf>#Z&fgoRlBbG1xBL{;&y-Z=yHqph4 zo{OOoB9jbWI@VWiU;p<&QoQ^;SL;1TdRBbz`;A|I9p6>;-PKZ9EJpDssZyQhrmu2^ z7eF~JJ2UryZ8b*=PSRU#`<#Zq`z5~4EwCz5;jAX#_rGCeXK{=A?$3pei1e$b-V_;& zLlHDSr8nn-+rQnh^-k{k5x9gbe*Th(x=;igqBCJOwU^nGq}!tevL@@E6ItJ^4Xu$> z9PqwXFWBIgql}?WKIz~{`-GdD;fb#` z)W;j1F?j{%ey*iye9K>LQrqkyttOHGifa`KuQ;jG&gps*hP9fh>MmdKO8{A@wxD&r z6#YYOfHlV44P$Y#@sX(IVfwE|$?s8L0zdB@XPjw|oFCmCsmgZRdV6Gn!+NW1eb8-B zMDD}~s@Kj*o9ciOnamx3-wu@ZbCas2oYm_lP`-fpl zZgM(r8WX3690UyGq>n!^iu7jeIk)ym=6sBf?fo9L^cKdP`n|Xv+BejAhpUbAIok5a z*Ha>zHTCfiFl8eGghMmy_>J1q1+W#Rr3FH-pVGbHg?^XbDiY8Ct4G@+bV4_4d?%oC z`@v54p(xQ>QD9@YFyb*#^_sd`WY>~{69&} zByC$gaTcnZ#fGmqqM=EIs-9>GVfl~c+CiZsKaU#+GYH8!244zIBp&=YNAY>5X20B- z+?alc3D5tUaD0FoLMXE+jKiO8ILb;0VlD}sZxA{2wdB`*f*Get1W{NK6M ie{Y%pM>dwu0S0QTBIjy-e8HqKU}+`Ou%4eadxGC$A%-Ue75 zXb%^R2coL>28IRzzV`;&m#_Jz&JJEm?AK3(m=rTs4R1KhwieoY&*qms{_*fGUn!;J zLQVTidCbHc^Jq7Fq32)3=6Sj3TmuhSOoMYVl1KN?&+F_Djape*kq)jO>)6ux{`#>U zhe=}nikMc>5GB_+xuW+)t~3Ar$r%W+e`RKjVpli>1>Ch+K(uiI)48VOn<{Pyj`r9d zV<7~rxCBB`!0V^6D7Wkuok%)IJ|>>&Y;YpD`X56?85}*huT|PYpyON?kO?=vmll|R zL_THi2BGLoglKTKjzL<_?R-V!3qYp$L%;7pemavwJi~%|%WWUHUZ~gk7!CVpdHtt4 z53sPh6{Xcn5RoDkU;!f36rU$$3$XZIY@i_J8xoG?zz%|{)FK`s`r+}YW`&;@>fg!y z;sv5@#w+?v@%0f%z~`tG$;$jIN4|9d^M_liLCRzl=QZhH7mH(=f<8?F>#L)0(l{z@ z7%uiP8OPkPkpniJ73CWIKHUS4?iU#*36?k$no;lXX}H~QyXu*I!JKs=-2Ch45of#8 z|9L<(HpRcWw=cof7)gDtJVk zB>ZfLzCm8wo|+@BjZV%H{}AKqEB_F^zOAS@wT$ce+$Q|~H6zz68Ne@XyisK<HYx}b9b+>iqLgfyvZCh}qYRsRl%J{-b8`wFeO(m^Y2hXrlmJ=lver~S!Pi<>` z3fOwLB)U#GH#9s9HNEg>ySI5_;f_1~b61xO{qwrAr)Q8#TDxP$R%4X=p8>?chX;Lr zPkv3GKz+pSc*qR%%puo&Qcc-y6?rEjl5xH0G6q`99#1NQopM}A{J0_*>ITtrJ%2+# zO>(VnVqgucuIy9tw6vnk&o45~n~^$??+egX8|MFtO><2QouURznN61d$)T`^XCMFm z$6R#Kp4*nI)9OchV_TAPdcCA${9iZRSJCn-m&%XKeTcg2md1KUX|CF7?h&&M3SCAx zz1fQL$A)`Frg<<@J>2t9uNdmX5uooQHk3lQscK~NODecmP5yFGaUzZa8tSb{dR8vN zqCVYfD!v2tr8BRre|xP}mWwX+9M z^{bqJN~V9#J<%T-AYe`Q?*5d%lH{6*Fm9T29JS{u-mJ78eK+&N*i)BA{#rHob~V5# z-Xra|Rv_cL+L)YM&2!GX^&;hJTe3m}l?xeh6_WAGqx^HTu}?P^INr&4>NPwoXg1vL^Yg9BHqFK)yY z+WSV5;Pzx;k0g90}HF(7xvg6$`>zqwK*z$uyc2~pYr5nR)3^3yv9 zUK=pck^-8b^3~=ZT}7NDmt!p0k9ljD8VR_md z528g^8dJd2oNyJOJAPLJqFz2QM&~K8M!H#&e3#=FsW^j zmgvo4C__lx%OfUARqeh-pk217(5q;*#Vz!+2(i3dhknx`Bcr>wV7QqY(`T zO6DwJ|49Pwm;!0BnBzs9YTluL^T<0WGKtJyp)o5XGgFi#~M>pR|E(&LcwOZV>t3T<_2>dNRn zgu+i?Dn|7Mm^Q(Lgp(qKgvFvx&+LFW%AJuXWE1xlVI zocfnOGEQ|Kv~nV!NARFFj_2lUcrAP(R~!$OeMy(Y6)=-TsyZql&l*#%{9~B*Edp74 z=v1D~UJk4kTFCX@%bme$3%PJTGdDXfN(hI)T?Xju3GtwqMYiSKB z4?zDBIc)D$@F|djS&b3(IXI8@EGR&?vJdeI z5VE~v^V2@maFI?_JSz0J7kfw1H&c7zNvYJdLstfbul=>I=w060!A4=vSQOE2;mKwW zd?4Ok864qGYr1ZiSp2AJJa+@U_qOVdY;VG_& zL4NhPK;qYg$x}u=4Mnom;!*tK!NudSbCxNY7{t1x_$3jcRBae?xPmzRP zsS__|e900&*dy(9H>BQ|I~5y%@n~@f#S4DP(w;27OpZs!H@ohW4X0;$K@=ea85z!+s)@}|DAcS zRE0RzE4THY5RF7K-PK(X=KwE;z!x5!YZ4%W#4BBYVU==j5Qu8Xk}knJdX9gi;KP4E zJ1OFTN%$ALs+YJL;yJ!c8nP76+&oPPYs@3szs0-_l9sWReS8eM9FAwB<;GzmJiEBU zlSF?ww(oT$pGR2BQMe=zSj~oEEDS~TWkFDp#@r$M38(4}p*>6R(2!_V;xQ^G<_U2` zqhz;!wYeJ3>G(7O^_jn3|n~bMdKuHpT*hNQ17LC^J;V6 zqpopDo7eLP(J0!rB0uIB6#rxK{%I4IMM|*7ILrnV3%(rz-;{Py4O>h0KLkA$mVhEc zq~KpqlJY@)oo>F4EISC|5Zb5&p9kWmoWPfd#h`R>@Jcx6=UpW7c2FgA>VWHtL+KAg zKqGEa(OkipE8P$_tOEGqTt5Dq;OR+`5i9(ZQalp9AD3(9J4;}=_6&g-+k8aNX-VxYb?(aQA_JP{dhIdxICK59=15MbE+KUL^VmM7;RIifD zFZu@VR`&&?USf%YrP&XKV^N?ZYTXRd23jH@b1juPO$7>3I`3^#9fWzA7FXA!RHkw( zaBH@-m_uHI#30?~(i;bsVo)-m?azvna1Mz=pnKw8(m5=|Ns%mb*y^X9t)e+yGUwUG zmKn^qn7A`;3R;z`@sWod-rLH*y)`1c>VTad@rZ&DV^6cJufwXepyr**J1nj4Cj7FdpR<)3wF2Usj}G zOJAy1_X4qD=P5KXHIVGrj%pRw^YCFXX9Qo{*Ntj5y4W@3d)AZxP+H?Ly-pzYq&e%C z3L^(J?%x?Z>}2cC=1NfEnl*4gUy=mT=>)~)MTzm3i%mWM39-+>{${><#(pnd(=qfP zyU3{NMa*lhM-rZZ=b;;H#r-Xx)m(=^2F~dEJc89qi8^Z?@-lhT#-@MxXpzP%9<_AI z74#m6u8$(nKl9Ma zdm`fy%jtP!?zB?`_;M0uBpD*l)K$XQ6mOKdyh@9GDm0E$-N&UxhbiJts2 znsn69na}23LiJJalXIQYKj^+8L_CgFts{FD{n)RQderfy#vhQdfR=b;!Pl?2`@QZC z`hF^(i;66BwbH%+I5y&z-Oz|Gwd=-;VCm8{3zvzt73;Qh=_^cF{xo4W0mq(OCC`O{ zYAloOs@fbS+p%U-=81Nuq?*4aQLufL1zBN_*3+~GxhQdx!c;TmqWsCBdQBDIj0gQA zf?dadfM0jQ#V=$?Z%og460yHDd(9=ZVPtb|ft+R?qlzu!k*-N_?>@gpw)BPw7KFd6 zGNmbtY7}%fHCVboPkDcxJ@a`e@8SiWDF)VK^ZN_cN*&g(^kjS?!Yi(?|E=@e!oRoM zpPmPGB;0T}Nqp?`c`!LUiSS%lMQ3WYq{un-!BX+{<;2^c$73{YD{pQQ6wH3ewAELO z=Hxk;1(;oUYdr>Taf(S&a5rUNsL?!No}F+ebfTxyz`875)b%3GV^jaFbHwDItC_jE z0u99Uftxxaz^Te!@KVMG_wQls-*^BIWB&hFgn_5E{}NCxfPvh%!yov7g~{-qvbOE! zQWyD=v|UoLw&t28lz0bE`8v0`erEdn;lAz!mU_0;AlrISJQV-&(s)B);aY#{YvxiTsZ1+FLSbTrJ1kgA*U(PK@s_rk(zo7`h7%2&KC z7v#muc|J#SC@pAcvCvOJ0=$@5O8@pveBTLCmo9fjqx%*^zkH=iwk-vQ-z{_`sDp|j zo7XEgy}sQTf9$fjS}_QJHtfx7Yz2oHlk7-)vWm^?0 ztJ2LdWR>t(y)9A~qyE~29sQr!tKF*+RsOKqo#mPC`<|gRiSZg`5*9ftP$;$-<@RBR z2K(671+nRJ>2;%0XxjG>IG#Vx9iz8?^qTgRPuZ+>mMwRAP(24n1qfGYbyZ5Adh9-O z-U{_draeGf%Ae-$>m7yH`$F#_80Q1M{AS(vlU+0flALP7V}Fg#d!((-5ubINv5DDO z-Y{;V7Tf=nZbjA)uFHCvGmrvp2QvKI1L{4)2VoZN=iV97>Mk-WLt0vWSP&gZKfy0W z<3{ytAed&}{HUg|Wy)I{?W81~Yp-?HS4CSa^K)HVe(c5V%_8ES{5l zi<-`S7@Om#Zf9p{GR)K6R0#)?1bR8d6?OsY{}0IjFJgE0n=dKKu7%{~fLFSJfeuc) J@QQuN{{U&H=p48g`weTH#H5IX}; z3=HT10I*FOLv(E(zuTO>m5IB;_0ny`B4LB|*sSd6^R;LZvClNfZ%^=5*8XT_Q=PU! z>Hq@rPdtO^0pD&g8UP^2p~3WoaIM@uJvjNm3@gQ+mVDHQaX1NsXgL|6j^8YIqG{|Z z`y>>l4S3d~VEaF1!&_r8EPBitz#~l7%Q!keGnyt&bP{oJSugwgBAZn99c4OkFPrr7 zt#~kZ#3`czk#h<%r0LRZ)8(CET-_u=U(x(-!uZ9ZQQ|96$7rFr^)z(`khM6^$qqR` zYbNU~Tr1Uff6wRJInASy33*R1_eSPyyJdDY4?kC02Ua<800*^hi3_m+BM}~8q7Ilw zH-12wF$`^8uJNYGMJ- z$UnH<$&TzaR6d-J2Nx@%-5lRfd_^bT5$mhi=Zq`k95@2R^P57czZS_Y*r!#i`Q|ARmIa-Y zH11`r*zxY71XTaj8@xnf1Yg{*D96ift2QtbJwoMoUA*3P8LEPc11FXP=V#h6E9&1v zcce21+-FtM?3R*(?sEpwZPHdtZ};A3(eJB}eU>uc!|C>`*B&NBZHg%s0^uPznG-+m z*g$DnbP~l(r}j9#;M?U#NS+)?gqiwEk^ZQKB2MOeoPW|g4N0pVyC&i1d{+vstsOVz zPVX$gD)KC-+f@bKQ~q`hDKQeJ|*~uV7t78}Cke;?_@sVm^Ovnvk$J;N`JLUB0O%y`CVTJwLmJbw-B{(3ea;8g(A;i}Se$1}M zATPs0VQhQO7{MmOr2=8vQwHP<~KC5Ito!!75o!WD>q46_RQy$2Y>}Gx6jRL{;d;U*npbfU* z`Ovkw{(xp{MS4eWgUFgWid)BDPIG6GLkBcGg2&T&dCsUEmm=kbKK^^;T{Au73&}!f zU$w-wW-F&cuxxxLe2)bR7I%xAe|I>WYQYhkMSuCDkm+y+?mcCX6fK7J?7XUVH{?Q2 zFhSC<$-KDr^{|7_+fqU%3(WB^PJGk3$q{_{!QkYO<(e_=16~Ug^>zw>Wi!im$tn!1elOvr)iVmw>uI^>mc0}A zOpe~vEV9`1GT|{cQwX#qeJQ;NK1#6~IqO-HkNZ+i`v|!rEHxV9P5G`QUIycotI{i% zn`pNp9xzOAl8dvwkAmr8e|%F<$KWykOevs21Z9Ej{(XE6 zp$*nHAM}+CLt&>MG@XcM({K?47;2_o^3N;q&BZnI%_;fp--z);!+9(ETlp+lnC+g4 z(&t7stheM`rsaRneB;Ce1+1vLSG*RW+P~QHR4W&Cc2cXAjXI@ZwY0r69CsO=H&X^k z*VPm{iQJE)fWCkWnVxi*;RBWS=^8ipKeesSp+V{78Z4adL;iuo@SyxJO2xX$umVk@ zB;P7)t1yZ2Yx|?QuwW^~?WATE_p?DwjI-1oFbm|`^T5D(X}-6$7L5^&1UZB_#|)Y2R|Va`p-11NeVm zr3lJnrCgKaq>wYLFvGo8fJ#M*hp<#)fUQBPy%S)F4Aa8~I*%3blC5u7WTc7eCK zPY@swJOX(AzdBBbW(P3Q{g2RtkxWVZZ)WivCEp@vkm`A-I6&Ix4p zDK3&U8u1h7c$f1QQa~PQJc(d55bb%c$FnB8GgmpF#*2pSzffVPpK9EQ6K)9+XUkXQ zK=rdkXmzO4u`cu47F~qoE*R3A63smWa)ajVLY%4|#-BF;Jd~#P^^Mxs1>OJR(?9n0 z-vFcIvj5{=mD9b`qy&MRuPH{l_cjX}G0Wv^+mWH_8}7VhC|?YC^df16g$$)J$fNSN zA|0!MPEwy^hsD^_c8l>C=YJ;DP^O<1T!;;pAw<0ot3#F~jwvL2JTYd4p%$8qCxqjc z+Rm(&F>c;;`Pvh;ot!}j)m70*B&ab${_yssecs5Lrda2PG9N%p9{fLG{l~&9W`BtE z-PYsvSo~Py3RagjNgYeMpG*bq-PCv+54LhQ*Dp|75x35_?6BSkC|j-Ry1m^8gjm4c zRQAQ?iAu*`i7>kKN7pQU@q^9j(6TVU} z#sl;~3ymLqJa_70uRF1^rJr*(?jmKS2xz-?_9|JH=X@Oqf zv{4d|e$Pzc?ckA!xuvK5F4lzD2&A3h*7Zu)=S51GK7ZxH=1uf1wi12t2aXTI8=NlV zdfQ5QJf33Zr;|Pl&v(_PyG2WDJa2M-uC*es&g-F>tAy>e|8SS4m??aFRx&h zC11i)tX#(ZGM4jhNZ4B4NLQcKiKyP)uq{m8#7GWnE7;+_@ct|@dut>S6J{_t;F)jV z1o~AM`aS7BQS%!cN!s3AGqc=r(M1?iF&1p7Ab$xW7Vs`7l;a#?&WQ7PW#?pnM7IC^ zeu)}S`P)vwp~fs%8t)_C2iI%y7eAij79>;(<}C6Sp?Xt`VL8>uyrqI9jr6_qLgGckv-Ga{^jpC)YR%iT zel?2r^JYt{uOaU{ebZwlKYmLHiW^|L-O!Y^6rF$bApJAnn25GW(XUz*>=M&Y>5j6K zz@gzg0tr%OrbAN0+eU;ITQ@zr{F_lnc9s_*Vz(SsIu54>=9L?-z*3eQyxtAb;J5yM z-9*p7wk%-f;gLq#X&8tm5%sCwIiRtGR?RtjRJxTPz0C09vIityF0jqnAVJl7>Yj&g zLg~dTSg&#^m_!?L&7%R^Y%Ei`~BW){Iewi*?ah6OVypCAHnFg4?zMq;7YisJ`T~ zzV4P_jwY2ojQ~PL%JqoZu~~7Oz464OwWq2z&~INsvjk75PkXWF$yh03JHk1CW606k z0KC?DI5}mY)_qKtT6KwI?zhwdM=Lf$Jc~vaNTIF!hN0vlix%9YFEjr1f3j{crZ(vOFr^C6vju_nzHhs~ef|EzWtUnkS9W^xr$y=Z ztQ5g3nL!z=ztgrt4zRvh^sz(#Lik22yHsh`t9{Al2soJG|asSMp0mlMaBte2kdt#2>AmYwt~fYYs* zXo^(}q1nHDd335fpbD2^n7@{E* z&dAz-ANMUiZV#qoJ)OF*&!m8o6m>u(CbliO)POEXJ!u)880NPF8WIGM1)h}@^AAaqB?<5 zCIx7ixz%P4&xcSgkQCADpS{ulNjQutN!||EG%CL%^C{#=YCp}}m5B-(WH&B6XZ*@r zx#Kq)AZ8Q|)(%aQxI1D; zL8xY1(r!^t^c}00ezBDShy;awC*L8(+cJ5&LO{6$LT$j(8EU ydE%}ySS(et!ZLcRyFa7rolcuCKId -\section{Introduction} -\label{sec:intro} - -\enlargethispage{2\baselineskip} -Suppose~$U$ is a universe of \textit{keys} of size $u$. -Let $h:U\to M$ be a {\em hash function} that maps the keys from~$U$ -to a given interval of integers $M=[0,m-1]=\{0,1,\dots,m-1\}$. -Let~$S\subseteq U$ be a set of~$n$ keys from~$U$, where $ n \ll u$. -Given a key~$x\in S$, the hash function~$h$ computes an integer in -$[0,m-1]$ for the storage or retrieval of~$x$ in a {\em hash table}. -% Hashing methods for {\em non-static sets} of keys can be used to construct -% data structures storing $S$ and supporting membership queries -% ``$x \in S$?'' in expected time $O(1)$. -% However, they involve a certain amount of wasted space owing to unused -% locations in the table and waisted time to resolve collisions when -% two keys are hashed to the same table location. -A perfect hash function maps a {\em static set} $S$ of $n$ keys from $U$ into a set of $m$ integer -numbers without collisions, where $m$ is greater than or equal to $n$. -If $m$ is equal to $n$, the function is called minimal. - -% Figure~\ref{fig:minimalperfecthash-ph-mph}(a) illustrates a perfect hash function and -% Figure~\ref{fig:minimalperfecthash-ph-mph}(b) illustrates a minimal perfect hash function (MPHF). -% -% \begin{figure} -% \centering -% \scalebox{0.7}{\epsfig{file=figs/minimalperfecthash-ph-mph.ps}} -% \caption{(a) Perfect hash function (b) Minimal perfect hash function (MPHF)} -% \label{fig:minimalperfecthash-ph-mph} -% %\vspace{-5mm} -% \end{figure} - -Minimal perfect hash functions are widely used for memory efficient storage and fast -retrieval of items from static sets, such as words in natural languages, -reserved words in programming languages or interactive systems, universal resource -locations (URLs) in web search engines, or item sets in data mining techniques. -Search engines are nowadays indexing tens of billions of pages and algorithms -like PageRank~\cite{Brin1998}, which uses the web link structure to derive a -measure of popularity for Web pages, would benefit from a MPHF for storage and -retrieval of such huge sets of URLs. -For instance, the TodoBr\footnote{TodoBr ({\texttt www.todobr.com.br}) is a trademark of -Akwan Information Technologies, which was acquired by Google Inc. in July 2005.} -search engine used the algorithm proposed hereinafter to -improve and to scale its link analysis system. -The WebGraph research group~\cite{bv04} would -also benefit from a MPHF for sets in the order of billions of URLs to scale -and to improve the storange requirements of their algorithms on Graph compression. - - Another interesting application for MPHFs is its use as an indexing structure - for databases. - The B+ tree is very popular as an indexing structure for dynamic applications - with frequent insertions and deletions of records. - However, for applications with sporadic modifications and a huge number of - queries the B+ tree is not the best option, - because it performs poorly with very large sets of keys - such as those required for the new frontiers of database applications~\cite{s05}. - Therefore, there are applications for MPHFs in - information retrieval systems, database systems, language translation systems, - electronic commerce systems, compilers, operating systems, among others. - -Until now, because of the limitations of current algorithms, -the use of MPHFs is restricted to scenarios where the set of keys being hashed is -relatively small. -However, in many cases it is crucial to deal in an efficient way with very large -sets of keys. -Due to the exponential growth of the Web, the work with huge collections is becoming -a daily task. -For instance, the simple assignment of number identifiers to web pages of a collection -can be a challenging task. -While traditional databases simply cannot handle more traffic once the working -set of URLs does not fit in main memory anymore~\cite{s05}, the algorithm we propose here to -construct MPHFs can easily scale to billions of entries. -% using stock hardware. - -As there are many applications for MPHFs, it is -important to design and implement space and time efficient algorithms for -constructing such functions. -The attractiveness of using MPHFs depends on the following issues: -\begin{enumerate} -\item The amount of CPU time required by the algorithms for constructing MPHFs. -\item The space requirements of the algorithms for constructing MPHFs. -\item The amount of CPU time required by a MPHF for each retrieval. -\item The space requirements of the description of the resulting MPHFs to be - used at retrieval time. -\end{enumerate} - -\enlargethispage{2\baselineskip} -This paper presents a novel external memory based algorithm for constructing MPHFs that -is very efficient in these four requirements. -First, the algorithm is linear on the size of keys to construct a MPHF, -which is optimal. -For instance, for a collection of 1 billion URLs -collected from the web, each one 64 characters long on average, the time to construct a -MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory -is approximately 3 hours. -Second, the algorithm needs a small a priori defined vector of $\lceil n/b \rceil$ -one byte entries in main memory to construct a MPHF. -For the collection of 1 billion URLs and using $b=175$, the algorithm needs only -5.45 megabytes of internal memory. -Third, the evaluation of the MPHF for each retrieval requires three memory accesses and -the computation of three universal hash functions. -This is not optimal as any MPHF requires at least one memory access and the computation -of two universal hash functions. -Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal. -For the collection of 1 billion URLs, it needs 8.1 bits for each key, -while the theoretical lower bound is $1/\ln2 \approx 1.4427$ bits per -key~\cite{m84}. - diff --git a/vldb07/makefile b/vldb07/makefile deleted file mode 100755 index 1b95644..0000000 --- a/vldb07/makefile +++ /dev/null @@ -1,17 +0,0 @@ -all: - latex vldb.tex - bibtex vldb - latex vldb.tex - latex vldb.tex - dvips vldb.dvi -o vldb.ps - ps2pdf vldb.ps - chmod -R g+rwx * - -perm: - chmod -R g+rwx * - -run: clean all - gv vldb.ps & -clean: - rm *.aux *.bbl *.blg *.log *.ps *.pdf *.dvi - diff --git a/vldb07/partitioningthekeys.tex b/vldb07/partitioningthekeys.tex deleted file mode 100755 index aeaae9b..0000000 --- a/vldb07/partitioningthekeys.tex +++ /dev/null @@ -1,141 +0,0 @@ -%% Nivio: 21/jan/06 -% Time-stamp: -\vspace{-2mm} -\subsection{Partitioning step} -\label{sec:partitioning-keys} - -The set $S$ of $n$ keys is partitioned into $\lceil n/b \rceil$ buckets, -where $b$ is a suitable parameter chosen to guarantee -that each bucket has at most 256 keys with high probability -(see Section~\ref{sec:determining-b}). -The partitioning step works as follows: - -\begin{figure}[h] -\hrule -\hrule -\vspace{2mm} -\begin{tabbing} -aa\=type booleanx \== (false, true); \kill -\> $\blacktriangleright$ Let $\beta$ be the size in bytes of the set $S$ \\ -\> $\blacktriangleright$ Let $\mu$ be the size in bytes of an a priori reserved \\ -\> ~~~ internal memory area \\ -\> $\blacktriangleright$ Let $N = \lceil \beta/\mu \rceil$ be the number of key blocks that will \\ -\> ~~~ be read from disk into an internal memory area \\ -\> $\blacktriangleright$ Let $\mathit{size}$ be a vector that stores the size of each bucket \\ -\> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \\ -\> ~~ $1.1$ Read block $B_j$ of keys from disk \\ -\> ~~ $1.2$ Cluster $B_j$ into $\lceil n/b \rceil$ buckets using a bucket sort \\ -\> ~~~~~~~ algorithm and update the entries in the vector {\it size} \\ -\> ~~ $1.3$ Dump $B_j$ to the disk into File $j$\\ -\> $2.$ Compute the {\it offset} vector and dump it to the disk. -\end{tabbing} -\hrule -\hrule -\vspace{-1.0mm} -\caption{Partitioning step} -\vspace{-3mm} -\label{fig:partitioningstep} -\end{figure} - -Statement 1.1 of the {\bf for} loop presented in Figure~\ref{fig:partitioningstep} -reads sequentially all the keys of block $B_j$ from disk into an internal area -of size $\mu$. - -Statement 1.2 performs an indirect bucket sort of the keys in block $B_j$ -and at the same time updates the entries in the vector {\em size}. -Let us briefly describe how~$B_j$ is partitioned among the~$\lceil n/b\rceil$ -buckets. -We use a local array of $\lceil n/b \rceil$ counters to store a -count of how many keys from $B_j$ belong to each bucket. -%At the same time, the global vector {\it size} is computed based on the local -%counters. -The pointers to the keys in each bucket $i$, $0 \leq i < \lceil n/b \rceil$, -are stored in contiguous positions in an array. -For this we first reserve the required number of entries -in this array of pointers using the information from the array of counters. -Next, we place the pointers to the keys in each bucket into the respective -reserved areas in the array (i.e., we place the pointers to the keys in bucket 0, -followed by the pointers to the keys in bucket 1, and so on). - -\enlargethispage{2\baselineskip} -To find the bucket address of a given key -we use the universal hash function $h_0(k)$~\cite{j97}. -Key~$k$ goes into bucket~$i$, where -%Then, for each integer $h_0(k)$ the respective bucket address is obtained -%as follows: -\begin{eqnarray} \label{eq:bucketindex} -i=h_0(k) \bmod \left \lceil \frac{n}{b} \right \rceil. -\end{eqnarray} - -Figure~\ref{fig:brz-partitioning}(a) shows a \emph{logical} view of the -$\lceil n/b \rceil$ buckets generated in the partitioning step. -%In this case, the keys of each bucket are put together by the pointers to -%each key stored -%in contiguous positions in the array of pointers. -In reality, the keys belonging to each bucket are distributed among many files, -as depicted in Figure~\ref{fig:brz-partitioning}(b). -In the example of Figure~\ref{fig:brz-partitioning}(b), the keys in bucket 0 -appear in files 1 and $N$, the keys in bucket 1 appear in files 1, 2 -and $N$, and so on. - -\vspace{-7mm} -\begin{figure}[ht] -\centering -\begin{picture}(0,0)% -\includegraphics{figs/brz-partitioning}% -\end{picture}% -\setlength{\unitlength}{4144sp}% -% -\begingroup\makeatletter\ifx\SetFigFont\undefined% -\gdef\SetFigFont#1#2#3#4#5{% - \reset@font\fontsize{#1}{#2pt}% - \fontfamily{#3}\fontseries{#4}\fontshape{#5}% - \selectfont}% -\fi\endgroup% -\begin{picture}(4371,1403)(1,-6977) -\put(333,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}} -\put(545,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}} -\put(759,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}2}}}} -\put(1539,-6421){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}${\lceil n/b\rceil - 1}$}}}} -\put(541,-6676){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets Logical View}}}} -\put(3547,-6120){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} -\put(3547,-6188){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} -\put(3547,-6255){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} -\put(3107,-6120){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} -\put(3107,-6188){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} -\put(3107,-6255){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} -\put(4177,-6224){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} -\put(4177,-6269){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} -\put(4177,-6314){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}.}}}} -\put(3016,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File 1}}}} -\put(3466,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File 2}}}} -\put(4096,-6721){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}File N}}}} -\put(3196,-6946){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets Physical View}}}} -\end{picture}% -\caption{Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view} -\label{fig:brz-partitioning} -\vspace{-2mm} -\end{figure} - -This scattering of the keys in the buckets could generate a performance -problem because of the potential number of seeks -needed to read the keys in each bucket from the $N$ files in disk -during the searching step. -But, as we show later in Section~\ref{sec:analytcal-results}, the number of seeks -can be kept small using buffering techniques. -Considering that only the vector {\it size}, which has $\lceil n/b \rceil$ -one-byte entries (remember that each bucket has at most 256 keys), -must be maintained in main memory during the searching step, -almost all main memory is available to be used as disk I/O buffer. - -The last step is to compute the {\it offset} vector and dump it to the disk. -We use the vector $\mathit{size}$ to compute the -$\mathit{offset}$ displacement vector. -The $\mathit{offset}[i]$ entry contains the number of keys -in the buckets $0, 1, \dots, i-1$. -As {\it size}$[i]$ stores the number of keys -in bucket $i$, where $0 \leq i <\lceil n/b \rceil$, we have -\begin{displaymath} -\mathit{offset}[i] = \sum_{j=0}^{i-1} \mathit{size}[j] \cdot -\end{displaymath} - diff --git a/vldb07/performancenewalgorithm.tex b/vldb07/performancenewalgorithm.tex deleted file mode 100755 index 9cddc46..0000000 --- a/vldb07/performancenewalgorithm.tex +++ /dev/null @@ -1,113 +0,0 @@ -% Nivio: 29/jan/06 -% Time-stamp: -\subsection{Performance of the new algorithm} -\label{sec:performance} -%As we have done for the internal memory based algorithm, - -The runtime of our algorithm is also a random variable, but now it follows a -(highly concentrated) normal distribution, as we discuss at the end of this -section. Again, we are interested in verifying the linearity claim made in -Section~\ref{sec:linearcomplexity}. Therefore, we ran the algorithm for -several numbers $n$ of keys in $S$. - -The values chosen for $n$ were $1, 2, 4, 8, 16, 32, 64, 128, 512$ and $1000$ -million. -%Just the small vector {\it size} must be kept in main memory, -%as we saw in Section~\ref{sec:memconstruction}. -We limited the main memory in 500 megabytes for the experiments. -The size $\mu$ of the a priori reserved internal memory area -was set to 250 megabytes, the parameter $b$ was set to $175$ and -the building block algorithm parameter $c$ was again set to $1$. -In Section~\ref{sec:contr-disk-access} we show how $\mu$ -affects the runtime of the algorithm. The other two parameters -have insignificant influence on the runtime. - -We again use a statistical method for determining a suitable sample size -%~\cite[Chapter 13]{j91} -to estimate the number of trials to be run for each value of $n$. We got that -just one trial for each $n$ would be enough with a confidence level of $95\%$. -However, we made 10 trials. This number of trials seems rather small, but, as -shown below, the behavior of our algorithm is very stable and its runtime is -almost deterministic (i.e., the standard deviation is very small). - -Table~\ref{tab:mediasbrz} presents the runtime average for each $n$, -the respective standard deviations, and -the respective confidence intervals given by -the average time $\pm$ the distance from average time -considering a confidence level of $95\%$. -Observing the runtime averages we noticed that -the algorithm runs in expected linear time, -as shown in~Section~\ref{sec:linearcomplexity}. Better still, -it is only approximately $60\%$ slower than our internal memory based algorithm. -To get that value we used the linear regression model obtained for the runtime of -the internal memory based algorithm to estimate how much time it would require -for constructing a MPHF for a set of 1 billion keys. -We got 2.3 hours for the internal memory based algorithm and we measured -3.67 hours on average for our algorithm. -Increasing the size of the internal memory area -from 250 to 600 megabytes (see Section~\ref{sec:contr-disk-access}), -we have brought the time to 3.09 hours. In this case, our algorithm is -just $34\%$ slower in this setup. - -\enlargethispage{2\baselineskip} -\begin{table*}[htb] -\vspace{-1mm} -\begin{center} -{\scriptsize -\begin{tabular}{|l|c|c|c|c|c|} -\hline -$n$ (millions) & 1 & 2 & 4 & 8 & 16 \\ -\hline % Part. 16 \% 16 \% 16 \% 18 \% 20\% -Average time (s) & $6.9 \pm 0.3$ & $13.8 \pm 0.2$ & $31.9 \pm 0.7$ & $69.9 \pm 1.1$ & $140.6 \pm 2.5$ \\ -SD & $0.4$ & $0.2$ & $0.9$ & $1.5$ & $3.5$ \\ -\hline -\hline -$n$ (millions) & 32 & 64 & 128 & 512 & 1000 \\ -\hline % Part. 20 \% 20\% 20\% 18\% 18\% -Average time (s) & $284.3 \pm 1.1$ & $587.9 \pm 3.9$ & $1223.6 \pm 4.9$ & $5966.4 \pm 9.5$ & $13229.5 \pm 12.7$ \\ -SD & $1.6$ & $5.5$ & $6.8$ & $13.2$ & $18.6$ \\ -\hline - -\end{tabular} -\vspace{-1mm} -} -\end{center} -\caption{Our algorithm: average time in seconds for constructing a MPHF, -the standard deviation (SD), and the confidence intervals considering -a confidence level of $95\%$. -} -\label{tab:mediasbrz} -\vspace{-5mm} -\end{table*} - -Figure~\ref{fig:brz_temporegressao} -presents the runtime for each trial. In addition, -the solid line corresponds to a linear regression model -obtained from the experimental measurements. -As we were expecting the runtime for a given $n$ has almost no -variation. - -\begin{figure}[htb] -\begin{center} -\scalebox{0.4}{\includegraphics{figs/brz_temporegressao}} -\caption{Time versus number of keys in $S$ for our algorithm. The solid line corresponds to -a linear regression model.} -\label{fig:brz_temporegressao} -\end{center} -\vspace{-9mm} -\end{figure} - -An intriguing observation is that the runtime of the algorithm is almost -deterministic, in spite of the fact that it uses as building block an -algorithm with a considerable fluctuation in its runtime. A given bucket~$i$, -$0 \leq i < \lceil n/b \rceil$, is a small set of keys (at most 256 keys) and, -as argued in Section~\ref{sec:intern-memory-algor}, the runtime of the -building block algorithm is a random variable~$X_i$ with high fluctuation. -However, the runtime~$Y$ of the searching step of our algorithm is given -by~$Y=\sum_{0\leq i<\lceil n/b\rceil}X_i$. Under the hypothesis that -the~$X_i$ are independent and bounded, the {\it law of large numbers} (see, -e.g., \cite{j91}) implies that the random variable $Y/\lceil n/b\rceil$ -converges to a constant as~$n\to\infty$. This explains why the runtime of our -algorithm is almost deterministic. - - diff --git a/vldb07/references.bib b/vldb07/references.bib deleted file mode 100755 index d2ea475..0000000 --- a/vldb07/references.bib +++ /dev/null @@ -1,814 +0,0 @@ - -@InProceedings{Brin1998, - author = "Sergey Brin and Lawrence Page", - title = "The Anatomy of a Large-Scale Hypertextual Web Search Engine", - booktitle = "Proceedings of the 7th International {World Wide Web} - Conference", - pages = "107--117", - adress = "Brisbane, Australia", - month = "April", - year = 1998, - annote = "Artigo do Google." -} - -@inproceedings{p99, - author = {R. Pagh}, - title = {Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions}, - booktitle = {Workshop on Algorithms and Data Structures}, - pages = {49-54}, - year = 1999, - url = {citeseer.nj.nec.com/pagh99hash.html}, - key = {author} -} - -@article{p00, - author = {R. Pagh}, - title = {Faster deterministic dictionaries}, - journal = {Symposium on Discrete Algorithms (ACM SODA)}, - OPTvolume = {43}, - OPTnumber = {5}, - pages = {487--493}, - year = {2000} -} -@article{g81, - author = {G. H. Gonnet}, - title = {Expected Length of the Longest Probe Sequence in Hash Code Searching}, - journal = {J. ACM}, - volume = {28}, - number = {2}, - year = {1981}, - issn = {0004-5411}, - pages = {289--304}, - doi = {http://doi.acm.org/10.1145/322248.322254}, - publisher = {ACM Press}, - address = {New York, NY, USA}, - } - -@misc{r04, - author = "S. Rao", - title = "Combinatorial Algorithms Data Structures", - year = 2004, - howpublished = {CS 270 Spring}, - url = "citeseer.ist.psu.edu/700201.html" -} -@article{ra98, - author = {Martin Raab and Angelika Steger}, - title = {``{B}alls into Bins'' --- {A} Simple and Tight Analysis}, - journal = {Lecture Notes in Computer Science}, - volume = 1518, - pages = {159--170}, - year = 1998, - url = "citeseer.ist.psu.edu/raab98balls.html" -} - -@misc{mrs00, - author = "M. Mitzenmacher and A. Richa and R. Sitaraman", - title = "The power of two random choices: A survey of the techniques and results", - howpublished={In Handbook of Randomized - Computing, P. Pardalos, S. Rajasekaran, and J. Rolim, Eds. Kluwer}, - year = "2000", - url = "citeseer.ist.psu.edu/article/mitzenmacher00power.html" -} - -@article{dfm02, - author = {E. Drinea and A. Frieze and M. Mitzenmacher}, - title = {Balls and bins models with feedback}, - journal = {Symposium on Discrete Algorithms (ACM SODA)}, - pages = {308--315}, - year = {2002} -} -@Article{j97, - author = {Bob Jenkins}, - title = {Algorithm Alley: Hash Functions}, - journal = {Dr. Dobb's Journal of Software Tools}, - volume = {22}, - number = {9}, - month = {september}, - year = {1997} -} - -@article{gss01, - author = {N. Galli and B. Seybold and K. Simon}, - title = {Tetris-Hashing or optimal table compression}, - journal = {Discrete Applied Mathematics}, - volume = {110}, - number = {1}, - pages = {41--58}, - month = {june}, - publisher = {Elsevier Science}, - year = {2001} -} - -@article{s05, - author = {M. Seltzer}, - title = {Beyond Relational Databases}, - journal = {ACM Queue}, - volume = {3}, - number = {3}, - month = {April}, - year = {2005} -} - -@InProceedings{ss89, - author = {P. Schmidt and A. Siegel}, - title = {On aspects of universality and performance for closed hashing}, - booktitle = {Proc. 21th Ann. ACM Symp. on Theory of Computing -- STOC'89}, - month = {May}, - year = {1989}, - pages = {355--366} -} - -@article{asw00, - author = {M. Atici and D. R. Stinson and R. Wei.}, - title = {A new practical algorithm for the construction of a perfect hash function}, - journal = {Journal Combin. Math. Combin. Comput.}, - volume = {35}, - pages = {127--145}, - year = {2000} -} - -@article{swz00, - author = {D. R. Stinson and R. Wei and L. Zhu}, - title = {New constructions for perfect hash families and related structures using combinatorial designs and codes}, - journal = {Journal Combin. Designs.}, - volume = {8}, - pages = {189--200}, - year = {2000} -} - -@inproceedings{ht01, - author = {T. Hagerup and T. Tholey}, - title = {Efficient minimal perfect hashing in nearly minimal space}, - booktitle = {The 18th Symposium on Theoretical Aspects of Computer Science (STACS), volume 2010 of Lecture Notes in Computer Science}, - year = 2001, - pages = {317--326}, - key = {author} -} - -@inproceedings{dh01, - author = {M. Dietzfelbinger and T. Hagerup}, - title = {Simple minimal perfect hashing in less space}, - booktitle = {The 9th European Symposium on Algorithms (ESA), volume 2161 of Lecture Notes in Computer Science}, - year = 2001, - pages = {109--120}, - key = {author} -} - - -@MastersThesis{mar00, - author = {M. S. Neubert}, - title = {Algoritmos Distribu;os para a Constru;o de Arquivos invertidos}, - school = {Departamento de Ci;cia da Computa;o, Universidade Federal de Minas Gerais}, - year = 2000, - month = {Mar;}, - key = {author} -} - - -@Book{clrs01, - author = {T. H. Cormen and C. E. Leiserson and R. L. Rivest and C. Stein}, - title = {Introduction to Algorithms}, - publisher = {MIT Press}, - year = {2001}, - edition = {second}, -} - -@Book{j91, - author = {R. Jain}, - title = {The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. }, - publisher = {John Wiley}, - year = {1991}, - edition = {first} -} - -@Book{k73, - author = {D. E. Knuth}, - title = {The Art of Computer Programming: Sorting and Searching}, - publisher = {Addison-Wesley}, - volume = {3}, - year = {1973}, - edition = {second}, -} - -@inproceedings{rp99, - author = {R. Pagh}, - title = {Hash and Displace: Efficient Evaluation of Minimal Perfect Hash Functions}, - booktitle = {Workshop on Algorithms and Data Structures}, - pages = {49-54}, - year = 1999, - url = {citeseer.nj.nec.com/pagh99hash.html}, - key = {author} -} - -@inproceedings{hmwc93, - author = {G. Havas and B.S. Majewski and N.C. Wormald and Z.J. Czech}, - title = {Graphs, Hypergraphs and Hashing}, - booktitle = {19th International Workshop on Graph-Theoretic Concepts in Computer Science}, - publisher = {Springer Lecture Notes in Computer Science vol. 790}, - pages = {153-165}, - year = 1993, - key = {author} -} - -@inproceedings{bkz05, - author = {F.C. Botelho and Y. Kohayakawa and N. Ziviani}, - title = {A Practical Minimal Perfect Hashing Method}, - booktitle = {4th International Workshop on Efficient and Experimental Algorithms}, - publisher = {Springer Lecture Notes in Computer Science vol. 3503}, - pages = {488-500}, - Moth = May, - year = 2005, - key = {author} -} - -@Article{chm97, - author = {Z.J. Czech and G. Havas and B.S. Majewski}, - title = {Fundamental Study Perfect Hashing}, - journal = {Theoretical Computer Science}, - volume = {182}, - year = {1997}, - pages = {1-143}, - key = {author} -} - -@article{chm92, - author = {Z.J. Czech and G. Havas and B.S. Majewski}, - title = {An Optimal Algorithm for Generating Minimal Perfect Hash Functions}, - journal = {Information Processing Letters}, - volume = {43}, - number = {5}, - pages = {257-264}, - year = {1992}, - url = {citeseer.nj.nec.com/czech92optimal.html}, - key = {author} -} - -@Article{mwhc96, - author = {B.S. Majewski and N.C. Wormald and G. Havas and Z.J. Czech}, - title = {A family of perfect hashing methods}, - journal = {The Computer Journal}, - year = {1996}, - volume = {39}, - number = {6}, - pages = {547-554}, - key = {author} -} - -@InProceedings{bv04, -author = {P. Boldi and S. Vigna}, -title = {The WebGraph Framework I: Compression Techniques}, -booktitle = {13th International World Wide Web Conference}, -pages = {595--602}, -year = {2004} -} - - -@Book{z04, - author = {N. Ziviani}, - title = {Projeto de Algoritmos com implementa;es em Pascal e C}, - publisher = {Pioneira Thompson}, - year = 2004, - edition = {segunda edi;o} -} - - -@Book{p85, - author = {E. M. Palmer}, - title = {Graphical Evolution: An Introduction to the Theory of Random Graphs}, - publisher = {John Wiley \& Sons}, - year = {1985}, - address = {New York} -} - -@Book{imb99, - author = {I.H. Witten and A. Moffat and T.C. Bell}, - title = {Managing Gigabytes: Compressing and Indexing Documents and Images}, - publisher = {Morgan Kaufmann Publishers}, - year = 1999, - edition = {second edition} -} -@Book{wfe68, - author = {W. Feller}, - title = { An Introduction to Probability Theory and Its Applications}, - publisher = {Wiley}, - year = 1968, - volume = 1, - optedition = {second edition} -} - - -@Article{fhcd92, - author = {E.A. Fox and L. S. Heath and Q. Chen and A.M. Daoud}, - title = {Practical Minimal Perfect Hash Functions For Large Databases}, - journal = {Communications of the ACM}, - year = {1992}, - volume = {35}, - number = {1}, - pages = {105--121} -} - - -@inproceedings{fch92, - author = {E.A. Fox and Q.F. Chen and L.S. Heath}, - title = {A Faster Algorithm for Constructing Minimal Perfect Hash Functions}, - booktitle = {Proceedings of the 15th Annual International ACM SIGIR Conference - on Research and Development in Information Retrieval}, - year = {1992}, - pages = {266-273}, -} - -@article{c80, - author = {R.J. Cichelli}, - title = {Minimal perfect hash functions made simple}, - journal = {Communications of the ACM}, - volume = {23}, - number = {1}, - year = {1980}, - issn = {0001-0782}, - pages = {17--19}, - doi = {http://doi.acm.org/10.1145/358808.358813}, - publisher = {ACM Press}, - } - - -@TechReport{fhc89, - author = {E.A. Fox and L.S. Heath and Q.F. Chen}, - title = {An $O(n\log n)$ algorithm for finding minimal perfect hash functions}, - institution = {Virginia Polytechnic Institute and State University}, - year = {1989}, - OPTkey = {}, - OPTtype = {}, - OPTnumber = {}, - address = {Blacksburg, VA}, - month = {April}, - OPTnote = {}, - OPTannote = {} -} - -@TechReport{bkz06t, - author = {F.C. Botelho and Y. Kohayakawa and N. Ziviani}, - title = {An Approach for Minimal Perfect Hash Functions in Very Large Databases}, - institution = {Department of Computer Science, Federal University of Minas Gerais}, - note = {Available at http://www.dcc.ufmg.br/\texttt{\~ }nivio/pub/technicalreports.html}, - year = {2006}, - OPTkey = {}, - OPTtype = {}, - number = {RT.DCC.003}, - address = {Belo Horizonte, MG, Brazil}, - month = {April}, - OPTannote = {} -} - -@inproceedings{fcdh90, - author = {E.A. Fox and Q.F. Chen and A.M. Daoud and L.S. Heath}, - title = {Order preserving minimal perfect hash functions and information retrieval}, - booktitle = {Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval}, - year = {1990}, - isbn = {0-89791-408-2}, - pages = {279--311}, - location = {Brussels, Belgium}, - doi = {http://doi.acm.org/10.1145/96749.98233}, - publisher = {ACM Press}, - } - -@Article{fkp89, - author = {P. Flajolet and D. E. Knuth and B. Pittel}, - title = {The first cycles in an evolving graph}, - journal = {Discrete Math}, - year = {1989}, - volume = {75}, - pages = {167-215}, -} - -@Article{s77, - author = {R. Sprugnoli}, - title = {Perfect Hashing Functions: A Single Probe Retrieving - Method For Static Sets}, - journal = {Communications of the ACM}, - year = {1977}, - volume = {20}, - number = {11}, - pages = {841--850}, - month = {November}, -} - -@Article{j81, - author = {G. Jaeschke}, - title = {Reciprocal Hashing: A method For Generating Minimal Perfect - Hashing Functions}, - journal = {Communications of the ACM}, - year = {1981}, - volume = {24}, - number = {12}, - month = {December}, - pages = {829--833} -} - -@Article{c84, - author = {C. C. Chang}, - title = {The Study Of An Ordered Minimal Perfect Hashing Scheme}, - journal = {Communications of the ACM}, - year = {1984}, - volume = {27}, - number = {4}, - month = {December}, - pages = {384--387} -} - -@Article{c86, - author = {C. C. Chang}, - title = {Letter-Oriented Reciprocal Hashing Scheme}, - journal = {Inform. Sci.}, - year = {1986}, - volume = {27}, - pages = {243--255} -} - -@Article{cl86, - author = {C. C. Chang and R. C. T. Lee}, - title = {A Letter-Oriented Minimal Perfect Hashing Scheme}, - journal = {Computer Journal}, - year = {1986}, - volume = {29}, - number = {3}, - month = {June}, - pages = {277--281} -} - - -@Article{cc88, - author = {C. C. Chang and C. H. Chang}, - title = {An Ordered Minimal Perfect Hashing Scheme with Single Parameter}, - journal = {Inform. Process. Lett.}, - year = {1988}, - volume = {27}, - number = {2}, - month = {February}, - pages = {79--83} -} - -@Article{w90, - author = {V. G. Winters}, - title = {Minimal Perfect Hashing in Polynomial Time}, - journal = {BIT}, - year = {1990}, - volume = {30}, - number = {2}, - pages = {235--244} -} - -@Article{fcdh91, - author = {E. A. Fox and Q. F. Chen and A. M. Daoud and L. S. Heath}, - title = {Order Preserving Minimal Perfect Hash Functions and Information Retrieval}, - journal = {ACM Trans. Inform. Systems}, - year = {1991}, - volume = {9}, - number = {3}, - month = {July}, - pages = {281--308} -} - -@Article{fks84, - author = {M. L. Fredman and J. Koml\'os and E. Szemer\'edi}, - title = {Storing a sparse table with {O(1)} worst case access time}, - journal = {J. ACM}, - year = {1984}, - volume = {31}, - number = {3}, - month = {July}, - pages = {538--544} -} - -@Article{dhjs83, - author = {M. W. Du and T. M. Hsieh and K. F. Jea and D. W. Shieh}, - title = {The study of a new perfect hash scheme}, - journal = {IEEE Trans. Software Eng.}, - year = {1983}, - volume = {9}, - number = {3}, - month = {May}, - pages = {305--313} -} - -@Article{bt94, - author = {M. D. Brain and A. L. Tharp}, - title = {Using Tries to Eliminate Pattern Collisions in Perfect Hashing}, - journal = {IEEE Trans. on Knowledge and Data Eng.}, - year = {1994}, - volume = {6}, - number = {2}, - month = {April}, - pages = {239--247} -} - -@Article{bt90, - author = {M. D. Brain and A. L. Tharp}, - title = {Perfect hashing using sparse matrix packing}, - journal = {Inform. Systems}, - year = {1990}, - volume = {15}, - number = {3}, - OPTmonth = {April}, - pages = {281--290} -} - -@Article{ckw93, - author = {C. C. Chang and H. C.Kowng and T. C. Wu}, - title = {A refinement of a compression-oriented addressing scheme}, - journal = {BIT}, - year = {1993}, - volume = {33}, - number = {4}, - OPTmonth = {April}, - pages = {530--535} -} - -@Article{cw91, - author = {C. C. Chang and T. C. Wu}, - title = {A letter-oriented perfect hashing scheme based upon sparse table compression}, - journal = {Software -- Practice Experience}, - year = {1991}, - volume = {21}, - number = {1}, - month = {january}, - pages = {35--49} -} - -@Article{ty79, - author = {R. E. Tarjan and A. C. C. Yao}, - title = {Storing a sparse table}, - journal = {Comm. ACM}, - year = {1979}, - volume = {22}, - number = {11}, - month = {November}, - pages = {606--611} -} - -@Article{yd85, - author = {W. P. Yang and M. W. Du}, - title = {A backtracking method for constructing perfect hash functions from a set of mapping functions}, - journal = {BIT}, - year = {1985}, - volume = {25}, - number = {1}, - pages = {148--164} -} - -@Article{s85, - author = {T. J. Sager}, - title = {A polynomial time generator for minimal perfect hash functions}, - journal = {Commun. ACM}, - year = {1985}, - volume = {28}, - number = {5}, - month = {May}, - pages = {523--532} -} - -@Article{cm93, - author = {Z. J. Czech and B. S. Majewski}, - title = {A linear time algorithm for finding minimal perfect hash functions}, - journal = {The computer Journal}, - year = {1993}, - volume = {36}, - number = {6}, - pages = {579--587} -} - -@Article{gbs94, - author = {R. Gupta and S. Bhaskar and S. Smolka}, - title = {On randomization in sequential and distributed algorithms}, - journal = {ACM Comput. Surveys}, - year = {1994}, - volume = {26}, - number = {1}, - month = {March}, - pages = {7--86} -} - -@InProceedings{sb84, - author = {C. Slot and P. V. E. Boas}, - title = {On tape versus core; an application of space efficient perfect hash functions to the - invariance of space}, - booktitle = {Proc. 16th Ann. ACM Symp. on Theory of Computing -- STOC'84}, - address = {Washington}, - month = {May}, - year = {1984}, - pages = {391--400}, -} - -@InProceedings{wi90, - author = {V. G. Winters}, - title = {Minimal perfect hashing for large sets of data}, - booktitle = {Internat. Conf. on Computing and Information -- ICCI'90}, - address = {Canada}, - month = {May}, - year = {1990}, - pages = {275--284}, -} - -@InProceedings{lr85, - author = {P. Larson and M. V. Ramakrishna}, - title = {External perfect hashing}, - booktitle = {Proc. ACM SIGMOD Conf.}, - address = {Austin TX}, - month = {June}, - year = {1985}, - pages = {190--199}, -} - -@Book{m84, - author = {K. Mehlhorn}, - editor = {W. Brauer and G. Rozenberg and A. Salomaa}, - title = {Data Structures and Algorithms 1: Sorting and Searching}, - publisher = {Springer-Verlag}, - year = {1984}, -} - -@PhdThesis{c92, - author = {Q. F. Chen}, - title = {An Object-Oriented Database System for Efficient Information Retrieval Appliations}, - school = {Virginia Tech Dept. of Computer Science}, - year = {1992}, - month = {March} -} - -@article {er59, - AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.}, - TITLE = {On random graphs {I}}, - JOURNAL = {Pub. Math. Debrecen}, - VOLUME = {6}, - YEAR = {1959}, - PAGES = {290--297}, - MRCLASS = {05.00}, - MRNUMBER = {MR0120167 (22 \#10924)}, -MRREVIEWER = {A. Dvoretzky}, -} - - -@article {erdos61, - AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.}, - TITLE = {On the evolution of random graphs}, - JOURNAL = {Bull. Inst. Internat. Statist.}, - VOLUME = 38, - YEAR = 1961, - PAGES = {343--347}, - MRCLASS = {05.40 (55.10)}, - MRNUMBER = {MR0148055 (26 \#5564)}, -} - -@article {er60, - AUTHOR = {Erd{\H{o}}s, P. and R{\'e}nyi, A.}, - TITLE = {On the evolution of random graphs}, - JOURNAL = {Magyar Tud. Akad. Mat. Kutat\'o Int. K\"ozl.}, - VOLUME = {5}, - YEAR = {1960}, - PAGES = {17--61}, - MRCLASS = {05.40}, - MRNUMBER = {MR0125031 (23 \#A2338)}, -MRREVIEWER = {J. Riordan}, -} - -@Article{er60:_Old, - author = {P. Erd{\H{o}}s and A. R\'enyi}, - title = {On the evolution of random graphs}, - journal = {Publications of the Mathematical Institute of the Hungarian - Academy of Sciences}, - year = {1960}, - volume = {56}, - pages = {17-61} -} - -@Article{er61, - author = {P. Erd{\H{o}}s and A. R\'enyi}, - title = {On the strength of connectedness of a random graph}, - journal = {Acta Mathematica Scientia Hungary}, - year = {1961}, - volume = {12}, - pages = {261-267} -} - - -@Article{bp04, - author = {B. Bollob\'as and O. Pikhurko}, - title = {Integer Sets with Prescribed Pairwise Differences Being Distinct}, - journal = {European Journal of Combinatorics}, - OPTkey = {}, - OPTvolume = {}, - OPTnumber = {}, - OPTpages = {}, - OPTmonth = {}, - note = {To Appear}, - OPTannote = {} -} - -@Article{pw04:_OLD, - author = {B. Pittel and N. C. Wormald}, - title = {Counting connected graphs inside-out}, - journal = {Journal of Combinatorial Theory}, - OPTkey = {}, - OPTvolume = {}, - OPTnumber = {}, - OPTpages = {}, - OPTmonth = {}, - note = {To Appear}, - OPTannote = {} -} - - -@Article{mr95, - author = {M. Molloy and B. Reed}, - title = {A critical point for random graphs with a given degree sequence}, - journal = {Random Structures and Algorithms}, - year = {1995}, - volume = {6}, - pages = {161-179} -} - -@TechReport{bmz04, - author = {F. C. Botelho and D. Menoti and N. Ziviani}, - title = {A New algorithm for constructing minimal perfect hash functions}, - institution = {Federal Univ. of Minas Gerais}, - year = {2004}, - OPTkey = {}, - OPTtype = {}, - number = {TR004}, - OPTaddress = {}, - OPTmonth = {}, - note = {(http://www.dcc.ufmg.br/\texttt{\~ }nivio/pub/technicalreports.html)}, - OPTannote = {} -} - -@Article{mr98, - author = {M. Molloy and B. Reed}, - title = {The size of the giant component of a random graph with a given degree sequence}, - journal = {Combinatorics, Probability and Computing}, - year = {1998}, - volume = {7}, - pages = {295-305} -} - -@misc{h98, - author = {D. Hawking}, - title = {Overview of TREC-7 Very Large Collection Track (Draft for Notebook)}, - url = {citeseer.ist.psu.edu/4991.html}, - year = {1998}} - -@book {jlr00, - AUTHOR = {Janson, S. and {\L}uczak, T. and Ruci{\'n}ski, A.}, - TITLE = {Random graphs}, - PUBLISHER = {Wiley-Inter.}, - YEAR = 2000, - PAGES = {xii+333}, - ISBN = {0-471-17541-2}, - MRCLASS = {05C80 (60C05 82B41)}, - MRNUMBER = {2001k:05180}, -MRREVIEWER = {Mark R. Jerrum}, -} - -@incollection {jlr90, - AUTHOR = {Janson, Svante and {\L}uczak, Tomasz and Ruci{\'n}ski, - Andrzej}, - TITLE = {An exponential bound for the probability of nonexistence of a - specified subgraph in a random graph}, - BOOKTITLE = {Random graphs '87 (Pozna\'n, 1987)}, - PAGES = {73--87}, - PUBLISHER = {Wiley}, - ADDRESS = {Chichester}, - YEAR = 1990, - MRCLASS = {05C80 (60C05)}, - MRNUMBER = {91m:05168}, -MRREVIEWER = {J. Spencer}, -} - -@book {b01, - AUTHOR = {Bollob{\'a}s, B.}, - TITLE = {Random graphs}, - SERIES = {Cambridge Studies in Advanced Mathematics}, - VOLUME = 73, - EDITION = {Second}, - PUBLISHER = {Cambridge University Press}, - ADDRESS = {Cambridge}, - YEAR = 2001, - PAGES = {xviii+498}, - ISBN = {0-521-80920-7; 0-521-79722-5}, - MRCLASS = {05C80 (60C05)}, - MRNUMBER = {MR1864966 (2002j:05132)}, -} - -@article {pw04, - AUTHOR = {Pittel, Boris and Wormald, Nicholas C.}, - TITLE = {Counting connected graphs inside-out}, - JOURNAL = {J. Combin. Theory Ser. B}, - FJOURNAL = {Journal of Combinatorial Theory. Series B}, - VOLUME = 93, - YEAR = 2005, - NUMBER = 2, - PAGES = {127--172}, - ISSN = {0095-8956}, - CODEN = {JCBTB8}, - MRCLASS = {05C30 (05A16 05C40 05C80)}, - MRNUMBER = {MR2117934 (2005m:05117)}, -MRREVIEWER = {Edward A. Bender}, -} diff --git a/vldb07/relatedwork.tex b/vldb07/relatedwork.tex deleted file mode 100755 index 7693002..0000000 --- a/vldb07/relatedwork.tex +++ /dev/null @@ -1,112 +0,0 @@ -% Time-stamp: -\vspace{-3mm} -\section{Related work} -\label{sec:relatedprevious-work} -\vspace{-2mm} - -% Optimal speed for hashing means that each key from the key set $S$ -% will map to an unique location in the hash table, avoiding time wasted -% in resolving collisions. That is achieved with a MPHF and -% because of that many algorithms for constructing static -% and dynamic MPHFs, when static or dynamic sets are involved, -% were developed. Our focus has been on static MPHFs, since -% in many applications the key sets change slowly, if at all~\cite{s05}. - -\enlargethispage{2\baselineskip} -Czech, Havas and Majewski~\cite{chm97} provide a -comprehensive survey of the most important theoretical and practical results -on perfect hashing. -In this section we review some of the most important results. -%We also present more recent algorithms that share some features with -%the one presented hereinafter. - -Fredman, Koml\'os and Szemer\'edi~\cite{FKS84} showed that it is possible to -construct space efficient perfect hash functions that can be evaluated in -constant time with table sizes that are linear in the number of keys: -$m=O(n)$. In their model of computation, an element of the universe~$U$ fits -into one machine word, and arithmetic operations and memory accesses have unit -cost. Randomized algorithms in the FKS model can construct a perfect hash -function in expected time~$O(n)$: -this is the case of our algorithm and the works in~\cite{chm92,p99}. - -Mehlhorn~\cite{m84} showed -that at least $\Omega((1/\ln 2)n + \ln\ln u)$ bits are -required to represent a MPHF (i.e, at least 1.4427 bits per -key must be stored). -To the best of our knowledge our algorithm -is the first one capable of generating MPHFs for sets in the order -of billion of keys, and the generated functions -require less than 9 bits per key to be stored. -This increases one order of magnitude in the size of the greatest -key set for which a MPHF was obtained in the literature~\cite{bkz05}. -%which is close to the lower bound presented in~\cite{m84}. - -Some work on minimal perfect hashing has been done under the assumption that -the algorithm can pick and store truly random functions~\cite{bkz05,chm92,p99}. -Since the space requirements for truly random functions makes them unsuitable for -implementation, one has to settle for pseudo-random functions in practice. -Empirical studies show that limited randomness properties are often as good as -total randomness. -We could verify that phenomenon in our experiments by using the universal hash -function proposed by Jenkins~\cite{j97}, which is -time efficient at retrieval time and requires just an integer to be used as a -random seed (the function is completely determined by the seed). -% Os trabalhos~\cite{asw00,swz00} apresentam algoritmos para construir -% FHPs e FHPMs deterministicamente. -% As fun\c{c}\~oes geradas necessitam de $O(n \log(n) + \log(\log(u)))$ bits para serem descritas. -% A complexidade de caso m\'edio dos algoritmos para gerar as fun\c{c}\~oes \'e -% $O(n\log(n) \log( \log (u)))$ e a de pior caso \'e $O(n^3\log(n) \log(\log(u)))$. -% A complexidade de avalia\c{c}\~ao das fun\c{c}\~oes \'e $O(\log(n) + \log(\log(u)))$. -% Assim, os algoritmos n\~ao geram fun\c{c}\~oes que podem ser avaliadas com complexidade -% de tempo $O(1)$, est\~ao distantes a um fator de $\log n$ da complexidade \'otima para descrever -% FHPs e FHPMs (Mehlhorn mostra em~\cite{m84} -% que para armazenar uma FHP s\~ao necess\'arios no m\'{\i}nimo -% $\Omega(n^2/(2\ln 2) m + \log\log u)$ bits), e n\~ao geram as -% fun\c{c}\~oes com complexidade linear. -% Al\'em disso, o universo $U$ das chaves \'e restrito a n\'umeros inteiros, o que pode -% limitar a utiliza\c{c}\~ao na pr\'atica. - -Pagh~\cite{p99} proposed a family of randomized algorithms for -constructing MPHFs -where the form of the resulting function is $h(x) = (f(x) + d[g(x)]) \bmod n$, -where $f$ and $g$ are universal hash functions and $d$ is a set of -displacement values to resolve collisions that are caused by the function $f$. -Pagh identified a set of conditions concerning $f$ and $g$ and showed -that if these conditions are satisfied, then a minimal perfect hash -function can be computed in expected time $O(n)$ and stored in -$(2+\epsilon)n\log_2n$ bits. - -Dietzfelbinger and Hagerup~\cite{dh01} improved~\cite{p99}, -reducing from $(2+\epsilon)n\log_2n$ to $(1+\epsilon)n\log_2n$ the number of bits -required to store the function, but in their approach~$f$ and~$g$ must -be chosen from a class -of hash functions that meet additional requirements. -%Differently from the works in~\cite{dh01, p99}, our algorithm generates a MPHF -%$h$ in expected linear time and $h$ can be stored in $O(n)$ bits (9 bits per key). - -% Galli, Seybold e Simon~\cite{gss01} propuseram um algoritmo r\^andomico -% que gera FHPMs da mesma forma das geradas pelos algoritmos de Pagh~\cite{p99} -% e, Dietzfelbinger e Hagerup~\cite{dh01}. No entanto, eles definiram a forma das -% fun\c{c}\~oes $f(k) = h_c(k) \bmod n$ e $g(k) = \lfloor h_c(k)/n \rfloor$ para obter em tempo esperado $O(n)$ uma fun\c{c}\~ao que pode ser descrita em $O(n\log n)$ bits, onde -% $h_c(k) = (ck \bmod p) \bmod n^2$, $1 \leq c \leq p-1$ e $p$ um primo maior do que $u$. -%Our algorithm is the first one capable of generating MPHFs for sets in the order of -%billion of keys. It happens because we do not need to keep into main memory -%at generation time complex data structures as a graph, lists and so on. We just need to maintain -%a small vector that occupies around 8MB for a set of 1 billion keys. - -Fox et al.~\cite{fch92,fhcd92} studied MPHFs -%that also share features with the ones generated by our algorithm. -that bring down the storage requirements we got to between 2 and 4 bits per key. -However, it is shown in~\cite[Section 6.7]{chm97} that their algorithms have exponential -running times and cannot scale for sets larger than 11 million keys in our -implementation of the algorithm. - -Our previous work~\cite{bkz05} improves the one by Czech, Havas and Majewski~\cite{chm92}. -We obtained more compact functions in less time. Although -the algorithm in~\cite{bkz05} is the fastest algorithm -we know of, the resulting functions are stored in $O(n\log n)$ bits and -one needs to keep in main memory at generation time a random graph of $n$ edges -and $cn$ vertices, -where $c\in[0.93,1.15]$. Using the well known divide to conquer approach -we use that algorithm as a building block for the new one, where the -resulting functions are stored in $O(n)$ bits. diff --git a/vldb07/searching.tex b/vldb07/searching.tex deleted file mode 100755 index 8feb6f1..0000000 --- a/vldb07/searching.tex +++ /dev/null @@ -1,155 +0,0 @@ -%% Nivio: 22/jan/06 -% Time-stamp: -\vspace{-7mm} -\subsection{Searching step} -\label{sec:searching} - -\enlargethispage{2\baselineskip} -The searching step is responsible for generating a MPHF for each -bucket. -Figure~\ref{fig:searchingstep} presents the searching step algorithm. -\vspace{-2mm} -\begin{figure}[h] -%\centering -\hrule -\hrule -\vspace{2mm} -\begin{tabbing} -aa\=type booleanx \== (false, true); \kill -\> $\blacktriangleright$ Let $H$ be a minimum heap of size $N$, where the \\ -\> ~~ order relation in $H$ is given by Eq.~(\ref{eq:bucketindex}), that is, the\\ -\> ~~ remove operation removes the item with smallest $i$\\ -\> $1.$ {\bf for} $j = 1$ {\bf to} $N$ {\bf do} \{ Heap construction \}\\ -\> ~~ $1.1$ Read key $k$ from File $j$ on disk\\ -\> ~~ $1.2$ Insert $(i, j, k)$ in $H$ \\ -\> $2.$ {\bf for} $i = 0$ {\bf to} $\lceil n/b \rceil - 1$ {\bf do} \\ -\> ~~ $2.1$ Read bucket $i$ from disk driven by heap $H$ \\ -\> ~~ $2.2$ Generate a MPHF for bucket $i$ \\ -\> ~~ $2.3$ Write the description of MPHF$_i$ to the disk -\end{tabbing} -\vspace{-1mm} -\hrule -\hrule -\caption{Searching step} -\label{fig:searchingstep} -\vspace{-4mm} -\end{figure} - -Statement 1 of Figure~\ref{fig:searchingstep} inserts one key from each file -in a minimum heap $H$ of size $N$. -The order relation in $H$ is given by the bucket address $i$ given by -Eq.~(\ref{eq:bucketindex}). - -%\enlargethispage{-\baselineskip} -Statement 2 has two important steps. -In statement 2.1, a bucket is read from disk, -as described below. -%in Section~\ref{sec:readingbucket}. -In statement 2.2, a MPHF is generated for each bucket $i$, as described -in the following. -%in Section~\ref{sec:mphfbucket}. -The description of MPHF$_i$ is a vector $g_i$ of 8-bit integers. -Finally, statement 2.3 writes the description $g_i$ of MPHF$_i$ to disk. - -\vspace{-3mm} -\label{sec:readingbucket} -\subsubsection{Reading a bucket from disk.} - -In this section we present the refinement of statement 2.1 of -Figure~\ref{fig:searchingstep}. -The algorithm to read bucket $i$ from disk is presented -in Figure~\ref{fig:readingbucket}. - -\begin{figure}[h] -\hrule -\hrule -\vspace{2mm} -\begin{tabbing} -aa\=type booleanx \== (false, true); \kill -\> $1.$ {\bf while} bucket $i$ is not full {\bf do} \\ -\> ~~ $1.1$ Remove $(i, j, k)$ from $H$\\ -\> ~~ $1.2$ Insert $k$ into bucket $i$ \\ -\> ~~ $1.3$ Read sequentially all keys $k$ from File $j$ that have \\ -\> ~~~~~~~ the same $i$ and insert them into bucket $i$ \\ -\> ~~ $1.4$ Insert the triple $(i, j, x)$ in $H$, where $x$ is the first \\ -\> ~~~~~~~ key read from File $j$ that does not have the \\ -\> ~~~~~~~ same bucket index $i$ -\end{tabbing} -\hrule -\hrule -\vspace{-1.0mm} -\caption{Reading a bucket} -\vspace{-4.0mm} -\label{fig:readingbucket} -\end{figure} - -Bucket $i$ is distributed among many files and the heap $H$ is used to drive a -multiway merge operation. -In Figure~\ref{fig:readingbucket}, statement 1.1 extracts and removes triple -$(i, j, k)$ from $H$, where $i$ is a minimum value in $H$. -Statement 1.2 inserts key $k$ in bucket $i$. -Notice that the $k$ in the triple $(i, j, k)$ is in fact a pointer to -the first byte of the key that is kept in contiguous positions of an array of characters -(this array containing the keys is initialized during the heap construction -in statement 1 of Figure~\ref{fig:searchingstep}). -Statement 1.3 performs a seek operation in File $j$ on disk for the first -read operation and reads sequentially all keys $k$ that have the same $i$ -%(obtained from Eq.~(\ref{eq:bucketindex})) -and inserts them all in bucket $i$. -Finally, statement 1.4 inserts in $H$ the triple $(i, j, x)$, -where $x$ is the first key read from File $j$ (in statement 1.3) -that does not have the same bucket address as the previous keys. - -The number of seek operations on disk performed in statement 1.3 is discussed -in Section~\ref{sec:linearcomplexity}, -where we present a buffering technique that brings down -the time spent with seeks. - -\vspace{-2mm} -\enlargethispage{2\baselineskip} -\subsubsection{Generating a MPHF for each bucket.} \label{sec:mphfbucket} - -To the best of our knowledge the algorithm we have designed in -our previous work~\cite{bkz05} is the fastest published algorithm for -constructing MPHFs. -That is why we are using that algorithm as a building block for the -algorithm presented here. - -%\enlargethispage{-\baselineskip} -Our previous algorithm is a three-step internal memory based algorithm -that produces a MPHF based on random graphs. -For a set of $n$ keys, the algorithm outputs the resulting MPHF in expected time $O(n)$. -For a given bucket $i$, $0 \leq i < \lceil n/b \rceil$, the corresponding MPHF$_i$ -has the following form: -\begin{eqnarray} - \mathrm{MPHF}_i(k) &=& g_i[a] + g_i[b] \label{eq:mphfi} -\end{eqnarray} -where $a = h_{i1}(k) \bmod t$, $b = h_{i2}(k) \bmod t$ and -$t = c\times \mathit{size}[i]$. The functions -$h_{i1}(k)$ and $h_{i2}(k)$ are the same universal function proposed by Jenkins~\cite{j97} -that was used in the partitioning step described in Section~\ref{sec:partitioning-keys}. - -In order to generate the function above the algorithm involves the generation of simple random graphs -$G_i = (V_i, E_i)$ with~$|V_i|=t=c\times\mathit{size}[i]$ and $|E_i|=\mathit{size}[i]$, with $c \in [0.93, 1.15]$. -To generate a simple random graph with high -probability\footnote{We use the terms `with high probability' -to mean `with probability tending to~$1$ as~$n\to\infty$'.}, two vertices $a$ and $b$ are -computed for each key $k$ in bucket $i$. -Thus, each bucket $i$ has a corresponding graph~$G_i=(V_i,E_i)$, where $V_i=\{0,1, -\ldots,t-1\}$ and $E_i=\big\{\{a,b\}:k \in \mathrm{bucket}\: i\big\}$. -In order to get a simple graph, -the algorithm repeatedly selects $h_{i1}$ and $h_{i2}$ from a family of universal hash functions -until the corresponding graph is simple. -The probability of getting a simple graph is $p=e^{-1/c^2}$. -For $c=1$, this probability is $p \simeq 0.368$, and the expected number of -iterations to obtain a simple graph is~$1/p \simeq 2.72$. - -The construction of MPHF$_i$ ends with a computation of a suitable labelling of the vertices -of~$G_i$. The labelling is stored into vector $g_i$. -We choose~$g_i[v]$ for each~$v\in V_i$ in such -a way that Eq.~(\ref{eq:mphfi}) is a MPHF for bucket $i$. -In order to get the values of each entry of $g_i$ we first -run a breadth-first search on the 2-\textit{core} of $G_i$, i.e., the maximal subgraph -of~$G_i$ with minimal degree at least~$2$ (see, e.g., \cite{b01,jlr00,pw04}) and -a depth-first search on the acyclic part of $G_i$ (see \cite{bkz05} for details). - diff --git a/vldb07/svglov2.clo b/vldb07/svglov2.clo deleted file mode 100644 index d98306e..0000000 --- a/vldb07/svglov2.clo +++ /dev/null @@ -1,77 +0,0 @@ -% SVJour2 DOCUMENT CLASS OPTION SVGLOV2 -- for standardised journals -% -% This is an enhancement for the LaTeX -% SVJour2 document class for Springer journals -% -%% -%% -%% \CharacterTable -%% {Upper-case \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z -%% Lower-case \a\b\c\d\e\f\g\h\i\j\k\l\m\n\o\p\q\r\s\t\u\v\w\x\y\z -%% Digits \0\1\2\3\4\5\6\7\8\9 -%% Exclamation \! Double quote \" Hash (number) \# -%% Dollar \$ Percent \% Ampersand \& -%% Acute accent \' Left paren \( Right paren \) -%% Asterisk \* Plus \+ Comma \, -%% Minus \- Point \. Solidus \/ -%% Colon \: Semicolon \; Less than \< -%% Equals \= Greater than \> Question mark \? -%% Commercial at \@ Left bracket \[ Backslash \\ -%% Right bracket \] Circumflex \^ Underscore \_ -%% Grave accent \` Left brace \{ Vertical bar \| -%% Right brace \} Tilde \~} -\ProvidesFile{svglov2.clo} - [2004/10/25 v2.1 - style option for standardised journals] -\typeout{SVJour Class option: svglov2.clo for standardised journals} -\def\validfor{svjour2} -\ExecuteOptions{final,10pt,runningheads} -% No size changing allowed, hence a copy of size10.clo is included -\renewcommand\normalsize{% - \@setfontsize\normalsize{10.2pt}{4mm}% - \abovedisplayskip=3 mm plus6pt minus 4pt - \belowdisplayskip=3 mm plus6pt minus 4pt - \abovedisplayshortskip=0.0 mm plus6pt - \belowdisplayshortskip=2 mm plus4pt minus 4pt - \let\@listi\@listI} -\normalsize -\newcommand\small{% - \@setfontsize\small{8.7pt}{3.25mm}% - \abovedisplayskip 8.5\p@ \@plus3\p@ \@minus4\p@ - \abovedisplayshortskip \z@ \@plus2\p@ - \belowdisplayshortskip 4\p@ \@plus2\p@ \@minus2\p@ - \def\@listi{\leftmargin\leftmargini - \parsep 0\p@ \@plus1\p@ \@minus\p@ - \topsep 4\p@ \@plus2\p@ \@minus4\p@ - \itemsep0\p@}% - \belowdisplayskip \abovedisplayskip -} -\let\footnotesize\small -\newcommand\scriptsize{\@setfontsize\scriptsize\@viipt\@viiipt} -\newcommand\tiny{\@setfontsize\tiny\@vpt\@vipt} -\newcommand\large{\@setfontsize\large\@xiipt{14pt}} -\newcommand\Large{\@setfontsize\Large\@xivpt{16dd}} -\newcommand\LARGE{\@setfontsize\LARGE\@xviipt{17dd}} -\newcommand\huge{\@setfontsize\huge\@xxpt{25}} -\newcommand\Huge{\@setfontsize\Huge\@xxvpt{30}} -% -%ALT% \def\runheadhook{\rlap{\smash{\lower5pt\hbox to\textwidth{\hrulefill}}}} -\def\runheadhook{\rlap{\smash{\lower11pt\hbox to\textwidth{\hrulefill}}}} -\AtEndOfClass{\advance\headsep by5pt} -\if@twocolumn -\setlength{\textwidth}{17.6cm} -\setlength{\textheight}{230mm} -\AtEndOfClass{\setlength\columnsep{4mm}} -\else -\setlength{\textwidth}{11.7cm} -\setlength{\textheight}{517.5dd} % 19.46cm -\fi -% -\AtBeginDocument{% -\@ifundefined{@journalname} - {\typeout{Unknown journal: specify \string\journalname\string{% -\string} in preambel^^J}}{}} -% -\endinput -%% -%% End of file `svglov2.clo'. diff --git a/vldb07/svjour2.cls b/vldb07/svjour2.cls deleted file mode 100644 index 56d9216..0000000 --- a/vldb07/svjour2.cls +++ /dev/null @@ -1,1419 +0,0 @@ -% SVJour2 DOCUMENT CLASS -- version 2.8 for LaTeX2e -% -% LaTeX document class for Springer journals -% -%% -%% -%% \CharacterTable -%% {Upper-case \A\B\C\D\E\F\G\H\I\J\K\L\M\N\O\P\Q\R\S\T\U\V\W\X\Y\Z -%% Lower-case \a\b\c\d\e\f\g\h\i\j\k\l\m\n\o\p\q\r\s\t\u\v\w\x\y\z -%% Digits \0\1\2\3\4\5\6\7\8\9 -%% Exclamation \! Double quote \" Hash (number) \# -%% Dollar \$ Percent \% Ampersand \& -%% Acute accent \' Left paren \( Right paren \) -%% Asterisk \* Plus \+ Comma \, -%% Minus \- Point \. Solidus \/ -%% Colon \: Semicolon \; Less than \< -%% Equals \= Greater than \> Question mark \? -%% Commercial at \@ Left bracket \[ Backslash \\ -%% Right bracket \] Circumflex \^ Underscore \_ -%% Grave accent \` Left brace \{ Vertical bar \| -%% Right brace \} Tilde \~} -\NeedsTeXFormat{LaTeX2e}[1995/12/01] -\ProvidesClass{svjour2}[2005/08/29 v2.8 -^^JLaTeX document class for Springer journals] -\newcommand\@ptsize{} -\newif\if@restonecol -\newif\if@titlepage -\@titlepagefalse -\DeclareOption{a4paper} - {\setlength\paperheight {297mm}% - \setlength\paperwidth {210mm}} -\DeclareOption{10pt}{\renewcommand\@ptsize{0}} -\DeclareOption{twoside}{\@twosidetrue \@mparswitchtrue} -\DeclareOption{draft}{\setlength\overfullrule{5pt}} -\DeclareOption{final}{\setlength\overfullrule{0pt}} -\DeclareOption{fleqn}{\input{fleqn.clo}\AtBeginDocument{\mathindent\z@}} -\DeclareOption{twocolumn}{\@twocolumntrue\ExecuteOptions{fleqn}} -\newif\if@avier\@avierfalse -\DeclareOption{onecollarge}{\@aviertrue} -\let\if@mathematic\iftrue -\let\if@numbook\iffalse -\DeclareOption{numbook}{\let\if@envcntsect\iftrue - \AtEndOfPackage{% - \renewcommand\thefigure{\thesection.\@arabic\c@figure}% - \renewcommand\thetable{\thesection.\@arabic\c@table}% - \renewcommand\theequation{\thesection.\@arabic\c@equation}% - \@addtoreset{figure}{section}% - \@addtoreset{table}{section}% - \@addtoreset{equation}{section}% - }% -} -\DeclareOption{openbib}{% - \AtEndOfPackage{% - \renewcommand\@openbib@code{% - \advance\leftmargin\bibindent - \itemindent -\bibindent - \listparindent \itemindent - \parsep \z@ - }% - \renewcommand\newblock{\par}}% -} -\DeclareOption{natbib}{% -\AtEndOfClass{\RequirePackage{natbib}% -% Changing some parameters of NATBIB -\setlength{\bibhang}{\parindent}% -%\setlength{\bibsep}{0mm}% -\let\bibfont=\small -\def\@biblabel#1{#1.}% -\newcommand{\etal}{et al.}% -\bibpunct{(}{)}{;}{a}{}{,}}} -% -\let\if@runhead\iffalse -\DeclareOption{runningheads}{\let\if@runhead\iftrue} -\let\if@smartrunh\iffalse -\DeclareOption{smartrunhead}{\let\if@smartrunh\iftrue} -\DeclareOption{nosmartrunhead}{\let\if@smartrunh\iffalse} -\let\if@envcntreset\iffalse -\DeclareOption{envcountreset}{\let\if@envcntreset\iftrue} -\let\if@envcntsame\iffalse -\DeclareOption{envcountsame}{\let\if@envcntsame\iftrue} -\let\if@envcntsect\iffalse -\DeclareOption{envcountsect}{\let\if@envcntsect\iftrue} -\let\if@referee\iffalse -\DeclareOption{referee}{\let\if@referee\iftrue} -\def\makereferee{\def\baselinestretch{2}} -\let\if@instindent\iffalse -\DeclareOption{instindent}{\let\if@instindent\iftrue} -\let\if@smartand\iffalse -\DeclareOption{smartand}{\let\if@smartand\iftrue} -\let\if@spthms\iftrue -\DeclareOption{nospthms}{\let\if@spthms\iffalse} -% -% language and babel dependencies -\DeclareOption{deutsch}{\def\switcht@@therlang{\switcht@deutsch}% -\gdef\svlanginfo{\typeout{Man spricht deutsch.}\global\let\svlanginfo\relax}} -\DeclareOption{francais}{\def\switcht@@therlang{\switcht@francais}% -\gdef\svlanginfo{\typeout{On parle francais.}\global\let\svlanginfo\relax}} -\let\switcht@@therlang\relax -\let\svlanginfo\relax -% -\AtBeginDocument{\@ifpackageloaded{babel}{% -\@ifundefined{extrasenglish}{}{\addto\extrasenglish{\switcht@albion}}% -\@ifundefined{extrasUKenglish}{}{\addto\extrasUKenglish{\switcht@albion}}% -\@ifundefined{extrasfrenchb}{}{\addto\extrasfrenchb{\switcht@francais}}% -\@ifundefined{extrasgerman}{}{\addto\extrasgerman{\switcht@deutsch}}% -\@ifundefined{extrasngerman}{}{\addto\extrasngerman{\switcht@deutsch}}% -}{\switcht@@therlang}% -} -% -\def\ClassInfoNoLine#1#2{% - \ClassInfo{#1}{#2\@gobble}% -} -\let\journalopt\@empty -\DeclareOption*{% -\InputIfFileExists{sv\CurrentOption.clo}{% -\global\let\journalopt\CurrentOption}{% -\ClassWarning{Springer-SVJour2}{Specified option or subpackage -"\CurrentOption" not found -}\OptionNotUsed}} -\ExecuteOptions{a4paper,twoside,10pt,instindent} -\ProcessOptions -% -\ifx\journalopt\@empty\relax -\ClassInfoNoLine{Springer-SVJour2}{extra/valid Springer sub-package (-> *.clo) -\MessageBreak not found in option list of \string\documentclass -\MessageBreak - autoactivating "global" style}{} -\input{svglov2.clo} -\else -\@ifundefined{validfor}{% -\ClassError{Springer-SVJour2}{Possible option clash for sub-package -\MessageBreak "sv\journalopt.clo" - option file not valid -\MessageBreak for this class}{Perhaps you used an option of the old -Springer class SVJour!} -}{} -\fi -% -\if@smartrunh\AtEndDocument{\islastpageeven\getlastpagenumber}\fi -% -\newcommand{\twocoltest}[2]{\if@twocolumn\def\@gtempa{#2}\else\def\@gtempa{#1}\fi -\@gtempa\makeatother} -\newcommand{\columncase}{\makeatletter\twocoltest} -% -\DeclareMathSymbol{\Gamma}{\mathalpha}{letters}{"00} -\DeclareMathSymbol{\Delta}{\mathalpha}{letters}{"01} -\DeclareMathSymbol{\Theta}{\mathalpha}{letters}{"02} -\DeclareMathSymbol{\Lambda}{\mathalpha}{letters}{"03} -\DeclareMathSymbol{\Xi}{\mathalpha}{letters}{"04} -\DeclareMathSymbol{\Pi}{\mathalpha}{letters}{"05} -\DeclareMathSymbol{\Sigma}{\mathalpha}{letters}{"06} -\DeclareMathSymbol{\Upsilon}{\mathalpha}{letters}{"07} -\DeclareMathSymbol{\Phi}{\mathalpha}{letters}{"08} -\DeclareMathSymbol{\Psi}{\mathalpha}{letters}{"09} -\DeclareMathSymbol{\Omega}{\mathalpha}{letters}{"0A} -% -\setlength\parindent{15\p@} -\setlength\smallskipamount{3\p@ \@plus 1\p@ \@minus 1\p@} -\setlength\medskipamount{6\p@ \@plus 2\p@ \@minus 2\p@} -\setlength\bigskipamount{12\p@ \@plus 4\p@ \@minus 4\p@} -\setlength\headheight{12\p@} -\setlength\headsep {16.74dd} -\setlength\topskip {10\p@} -\setlength\footskip{30\p@} -\setlength\maxdepth{.5\topskip} -% -\@settopoint\textwidth -\setlength\marginparsep {10\p@} -\setlength\marginparpush{5\p@} -\setlength\topmargin{-10pt} -\if@twocolumn - \setlength\oddsidemargin {-30\p@} - \setlength\evensidemargin{-30\p@} -\else - \setlength\oddsidemargin {\z@} - \setlength\evensidemargin{\z@} -\fi -\setlength\marginparwidth {48\p@} -\setlength\footnotesep{8\p@} -\setlength{\skip\footins}{9\p@ \@plus 4\p@ \@minus 2\p@} -\setlength\floatsep {12\p@ \@plus 2\p@ \@minus 2\p@} -\setlength\textfloatsep{20\p@ \@plus 2\p@ \@minus 4\p@} -\setlength\intextsep {20\p@ \@plus 2\p@ \@minus 2\p@} -\setlength\dblfloatsep {12\p@ \@plus 2\p@ \@minus 2\p@} -\setlength\dbltextfloatsep{20\p@ \@plus 2\p@ \@minus 4\p@} -\setlength\@fptop{0\p@} -\setlength\@fpsep{12\p@ \@plus 2\p@ \@minus 2\p@} -\setlength\@fpbot{0\p@ \@plus 1fil} -\setlength\@dblfptop{0\p@} -\setlength\@dblfpsep{12\p@ \@plus 2\p@ \@minus 2\p@} -\setlength\@dblfpbot{0\p@ \@plus 1fil} -\setlength\partopsep{2\p@ \@plus 1\p@ \@minus 1\p@} -\def\@listi{\leftmargin\leftmargini - \parsep \z@ - \topsep 6\p@ \@plus2\p@ \@minus4\p@ - \itemsep\parsep} -\let\@listI\@listi -\@listi -\def\@listii {\leftmargin\leftmarginii - \labelwidth\leftmarginii - \advance\labelwidth-\labelsep - \topsep \z@ - \parsep \topsep - \itemsep \parsep} -\def\@listiii{\leftmargin\leftmarginiii - \labelwidth\leftmarginiii - \advance\labelwidth-\labelsep - \topsep \z@ - \parsep \topsep - \itemsep \parsep} -\def\@listiv {\leftmargin\leftmarginiv - \labelwidth\leftmarginiv - \advance\labelwidth-\labelsep} -\def\@listv {\leftmargin\leftmarginv - \labelwidth\leftmarginv - \advance\labelwidth-\labelsep} -\def\@listvi {\leftmargin\leftmarginvi - \labelwidth\leftmarginvi - \advance\labelwidth-\labelsep} -% -\setlength\lineskip{1\p@} -\setlength\normallineskip{1\p@} -\renewcommand\baselinestretch{} -\setlength\parskip{0\p@ \@plus \p@} -\@lowpenalty 51 -\@medpenalty 151 -\@highpenalty 301 -\setcounter{topnumber}{4} -\renewcommand\topfraction{.9} -\setcounter{bottomnumber}{2} -\renewcommand\bottomfraction{.7} -\setcounter{totalnumber}{6} -\renewcommand\textfraction{.1} -\renewcommand\floatpagefraction{.85} -\setcounter{dbltopnumber}{3} -\renewcommand\dbltopfraction{.85} -\renewcommand\dblfloatpagefraction{.85} -\def\ps@headings{% - \let\@oddfoot\@empty\let\@evenfoot\@empty - \def\@evenhead{\small\csname runheadhook\endcsname - \rlap{\thepage}\hfil\leftmark\unskip}% - \def\@oddhead{\small\csname runheadhook\endcsname - \ignorespaces\rightmark\hfil\llap{\thepage}}% - \let\@mkboth\@gobbletwo - \let\sectionmark\@gobble - \let\subsectionmark\@gobble - } -% make indentations changeable -\def\setitemindent#1{\settowidth{\labelwidth}{#1}% - \leftmargini\labelwidth - \advance\leftmargini\labelsep - \def\@listi{\leftmargin\leftmargini - \labelwidth\leftmargini\advance\labelwidth by -\labelsep - \parsep=\parskip - \topsep=\medskipamount - \itemsep=\parskip \advance\itemsep by -\parsep}} -\def\setitemitemindent#1{\settowidth{\labelwidth}{#1}% - \leftmarginii\labelwidth - \advance\leftmarginii\labelsep -\def\@listii{\leftmargin\leftmarginii - \labelwidth\leftmarginii\advance\labelwidth by -\labelsep - \parsep=\parskip - \topsep=\z@ - \itemsep=\parskip \advance\itemsep by -\parsep}} -% labels of description -\def\descriptionlabel#1{\hspace\labelsep #1\hfil} -% adjusted environment "description" -% if an optional parameter (at the first two levels of lists) -% is present, its width is considered to be the widest mark -% throughout the current list. -\def\description{\@ifnextchar[{\@describe}{\list{}{\labelwidth\z@ - \itemindent-\leftmargin \let\makelabel\descriptionlabel}}} -\let\enddescription\endlist -% -\def\describelabel#1{#1\hfil} -\def\@describe[#1]{\relax\ifnum\@listdepth=0 -\setitemindent{#1}\else\ifnum\@listdepth=1 -\setitemitemindent{#1}\fi\fi -\list{--}{\let\makelabel\describelabel}} -% -\newdimen\logodepth -\logodepth=1.2cm -\newdimen\headerboxheight -\headerboxheight=180pt % 18 10.5dd-lines - 2\baselineskip -\advance\headerboxheight by-14.5mm -\newdimen\betweenumberspace % dimension for space between -\betweenumberspace=3.33pt % number and text of titles. -\newdimen\aftertext % dimension for space after -\aftertext=5pt % text of title. -\newdimen\headlineindent % dimension for space between -\headlineindent=1.166cm % number and text of headings. -\if@mathematic - \def\runinend{} % \enspace} - \def\floatcounterend{\enspace} - \def\sectcounterend{} -\else - \def\runinend{.} - \def\floatcounterend{.\ } - \def\sectcounterend{.} -\fi -\def\email#1{\emailname: #1} -\def\keywords#1{\par\addvspace\medskipamount{\rightskip=0pt plus1cm -\def\and{\ifhmode\unskip\nobreak\fi\ $\cdot$ -}\noindent\keywordname\enspace\ignorespaces#1\par}} -% -\def\subclassname{{\bfseries Mathematics Subject Classification -(2000)}\enspace} -\def\subclass#1{\par\addvspace\medskipamount{\rightskip=0pt plus1cm -\def\and{\ifhmode\unskip\nobreak\fi\ $\cdot$ -}\noindent\subclassname\ignorespaces#1\par}} -% -\def\PACSname{\textbf{PACS}\enspace} -\def\PACS#1{\par\addvspace\medskipamount{\rightskip=0pt plus1cm -\def\and{\ifhmode\unskip\nobreak\fi\ $\cdot$ -}\noindent\PACSname\ignorespaces#1\par}} -% -\def\CRclassname{{\bfseries CR Subject Classification}\enspace} -\def\CRclass#1{\par\addvspace\medskipamount{\rightskip=0pt plus1cm -\def\and{\ifhmode\unskip\nobreak\fi\ $\cdot$ -}\noindent\CRclassname\ignorespaces#1\par}} -% -\def\ESMname{\textbf{Electronic Supplementary Material}\enspace} -\def\ESM#1{\par\addvspace\medskipamount -\noindent\ESMname\ignorespaces#1\par} -% -\newcounter{inst} -\newcounter{auth} -\def\authdepth{2} -\newdimen\instindent -\newbox\authrun -\newtoks\authorrunning -\newbox\titrun -\newtoks\titlerunning -\def\authorfont{\bfseries} - -\def\combirunning#1{\gdef\@combi{#1}} -\def\@combi{} -\newbox\combirun -% -\def\ps@last{\def\@evenhead{\small\rlap{\thepage}\hfil -\lastevenhead}} -\newcounter{lastpage} -\def\islastpageeven{\@ifundefined{lastpagenumber} -{\setcounter{lastpage}{0}}{\setcounter{lastpage}{\lastpagenumber}} -\ifnum\value{lastpage}>0 - \ifodd\value{lastpage}% - \else - \if@smartrunh - \thispagestyle{last}% - \fi - \fi -\fi} -\def\getlastpagenumber{\clearpage -\addtocounter{page}{-1}% - \immediate\write\@auxout{\string\gdef\string\lastpagenumber{\thepage}}% - \immediate\write\@auxout{\string\newlabel{LastPage}{{}{\thepage}}}% - \addtocounter{page}{1}} - -\def\journalname#1{\gdef\@journalname{#1}} - -\def\dedication#1{\gdef\@dedic{#1}} -\def\@dedic{} - -\let\@date\undefined -\def\notused{~} - -\def\institute#1{\gdef\@institute{#1}} - -\def\offprints#1{\begingroup -\def\protect{\noexpand\protect\noexpand}\xdef\@thanks{\@thanks -\protect\footnotetext[0]{\unskip\hskip-15pt{\itshape Send offprint requests -to\/}: \ignorespaces#1}}\endgroup\ignorespaces} - -%\def\mail#1{\gdef\@mail{#1}} -%\def\@mail{} - -\def\@thanks{} - -\def\@fnsymbol#1{\ifcase#1\or\star\or{\star\star}\or{\star\star\star}% - \or \dagger\or \ddagger\or - \mathchar "278\or \mathchar "27B\or \|\or **\or \dagger\dagger - \or \ddagger\ddagger \else\@ctrerr\fi\relax} -% -%\def\invthanks#1{\footnotetext[0]{\kern-\bibindent#1}} -% -\def\nothanksmarks{\def\thanks##1{\protected@xdef\@thanks{\@thanks - \protect\footnotetext[0]{\kern-\bibindent##1}}}} -% -\def\subtitle#1{\gdef\@subtitle{#1}} -\def\@subtitle{} - -\def\headnote#1{\gdef\@headnote{#1}} -\def\@headnote{} - -\def\papertype#1{\gdef\paper@type{\MakeUppercase{#1}}} -\def\paper@type{} - -\def\ch@ckobl#1#2{\@ifundefined{@#1} - {\typeout{SVJour2 warning: Missing -\expandafter\string\csname#1\endcsname}% - \csname #1\endcsname{#2}} - {}} -% -\def\ProcessRunnHead{% - \def\\{\unskip\ \ignorespaces}% - \def\thanks##1{\unskip{}}% - \instindent=\textwidth - \advance\instindent by-\headlineindent - \if!\the\titlerunning!\else - \edef\@title{\the\titlerunning}% - \fi - \global\setbox\titrun=\hbox{\small\rmfamily\unboldmath\ignorespaces\@title - \unskip}% - \ifdim\wd\titrun>\instindent - \typeout{^^JSVJour2 Warning: Title too long for running head.}% - \typeout{Please supply a shorter form with \string\titlerunning - \space prior to \string\maketitle}% - \global\setbox\titrun=\hbox{\small\rmfamily - Title Suppressed Due to Excessive Length}% - \fi - \xdef\@title{\copy\titrun}% -% - \if!\the\authorrunning! - \else - \setcounter{auth}{1}% - \edef\@author{\the\authorrunning}% - \fi - \ifnum\value{inst}>\authdepth - \def\stripauthor##1\and##2\endauthor{% - \protected@xdef\@author{##1\unskip\unskip\if!##2!\else\ et al.\fi}}% - \expandafter\stripauthor\@author\and\endauthor - \else - \gdef\and{\unskip, \ignorespaces}% - {\def\and{\noexpand\protect\noexpand\and}% - \protected@xdef\@author{\@author}} - \fi - \global\setbox\authrun=\hbox{\small\rmfamily\unboldmath\ignorespaces - \@author\unskip}% - \ifdim\wd\authrun>\instindent - \typeout{^^JSVJour2 Warning: Author name(s) too long for running head. - ^^JPlease supply a shorter form with \string\authorrunning - \space prior to \string\maketitle}% - \global\setbox\authrun=\hbox{\small\rmfamily Please give a shorter version - with: {\tt\string\authorrunning\space and - \string\titlerunning\space prior to \string\maketitle}}% - \fi - \xdef\@author{\copy\authrun}% - \markboth{\@author}{\@title}% -} -% -\let\orithanks=\thanks -\def\thanks#1{\ClassWarning{SVJour2}{\string\thanks\space may only be -used inside of \string\title, \string\author,\MessageBreak -and \string\date\space prior to \string\maketitle}} -% -\def\maketitle{\par\let\thanks=\orithanks -\ch@ckobl{journalname}{Noname} -\ch@ckobl{date}{the date of receipt and acceptance should be inserted -later} -\ch@ckobl{title}{A title should be given} -\ch@ckobl{author}{Name(s) and initial(s) of author(s) should be given} -\ch@ckobl{institute}{Address(es) of author(s) should be given} -\begingroup -% - \renewcommand\thefootnote{\@fnsymbol\c@footnote}% - \def\@makefnmark{$^{\@thefnmark}$}% - \renewcommand\@makefntext[1]{% - \noindent - \hb@xt@\bibindent{\hss\@makefnmark\enspace}##1\vrule height0pt - width0pt depth8pt} -% - \def\lastand{\ifnum\value{inst}=2\relax - \unskip{} \andname\ - \else - \unskip, \andname\ - \fi}% - \def\and{\stepcounter{auth}\relax - \if@smartand - \ifnum\value{auth}=\value{inst}% - \lastand - \else - \unskip, - \fi - \else - \unskip, - \fi}% - \thispagestyle{empty} - \ifnum \col@number=\@ne - \@maketitle - \else - \twocolumn[\@maketitle]% - \fi -% - \global\@topnum\z@ - \if!\@thanks!\else - \@thanks -\insert\footins{\vskip-3pt\hrule width\columnwidth\vskip3pt}% - \fi - {\def\thanks##1{\unskip{}}% - \def\iand{\\[5pt]\let\and=\nand}% - \def\nand{\ifhmode\unskip\nobreak\fi\ $\cdot$ }% - \let\and=\nand - \def\at{\\\let\and=\iand}% - \footnotetext[0]{\kern-\bibindent - \ignorespaces\@institute}\vspace{5dd}}% -%\if!\@mail!\else -% \footnotetext[0]{\kern-\bibindent\mailname\ -% \ignorespaces\@mail}% -%\fi -% - \if@runhead - \ProcessRunnHead - \fi -% - \endgroup - \setcounter{footnote}{0} - \global\let\thanks\relax - \global\let\maketitle\relax - \global\let\@maketitle\relax - \global\let\@thanks\@empty - \global\let\@author\@empty - \global\let\@date\@empty - \global\let\@title\@empty - \global\let\@subtitle\@empty - \global\let\title\relax - \global\let\author\relax - \global\let\date\relax - \global\let\and\relax} - -\def\makeheadbox{{% -\hbox to0pt{\vbox{\baselineskip=10dd\hrule\hbox -to\hsize{\vrule\kern3pt\vbox{\kern3pt -\hbox{\bfseries\@journalname\ manuscript No.} -\hbox{(will be inserted by the editor)} -\kern3pt}\hfil\kern3pt\vrule}\hrule}% -\hss}}} -% -\def\rubric{\setbox0=\hbox{\small\strut}\@tempdima=\ht0\advance -\@tempdima\dp0\advance\@tempdima2\fboxsep\vrule\@height\@tempdima -\@width\z@} -\newdimen\rubricwidth -% -\def\@maketitle{\newpage -\normalfont -\vbox to0pt{\if@twocolumn\vskip-39pt\else\vskip-49pt\fi -\nointerlineskip -\makeheadbox\vss}\nointerlineskip -\vbox to 0pt{\offinterlineskip\rubricwidth=\columnwidth -\vskip-12.5pt -\if@twocolumn\else % one column journal - \divide\rubricwidth by144\multiply\rubricwidth by89 % perform golden section - \vskip-\topskip -\fi -\hrule\@height0.35mm\noindent -\advance\fboxsep by.25mm -\global\advance\rubricwidth by0pt -\rubric -\vss}\vskip19.5pt -% -\if@twocolumn\else - \gdef\footnoterule{% - \kern-3\p@ - \hrule\@width\columnwidth %rubricwidth - \kern2.6\p@} -\fi -% - \setbox\authrun=\vbox\bgroup - \hrule\@height 9mm\@width0\p@ - \pretolerance=10000 - \rightskip=0pt plus 4cm - \nothanksmarks -% \if!\@headnote!\else -% \noindent -% {\LARGE\normalfont\itshape\ignorespaces\@headnote\par}\vskip 3.5mm -% \fi - {\authorfont - \setbox0=\vbox{\setcounter{auth}{1}\def\and{\stepcounter{auth} }% - \hfuzz=2\textwidth\def\thanks##1{}\@author}% - \setcounter{footnote}{0}% - \global\value{inst}=\value{auth}% - \setcounter{auth}{1}% - \if@twocolumn - \rightskip43mm plus 4cm minus 3mm - \else % one column journal - \rightskip=\linewidth - \advance\rightskip by-\rubricwidth - \advance\rightskip by0pt plus 4cm minus 3mm - \fi -% -\def\and{\unskip\nobreak\enskip{\boldmath$\cdot$}\enskip\ignorespaces}% - \noindent\ignorespaces\@author\vskip7.23pt} - {\LARGE\bfseries - \noindent\ignorespaces - \@title \par}\vskip 11.24pt\relax - \if!\@subtitle!\else - {\large\bfseries - \pretolerance=10000 - \rightskip=0pt plus 3cm - \vskip-5pt - \noindent\ignorespaces\@subtitle \par}\vskip 11.24pt - \fi - \small - \if!\@dedic!\else - \par - \normalsize\it - \addvspace\baselineskip - \noindent\@dedic - \fi - \egroup % end of header box - \@tempdima=\headerboxheight - \advance\@tempdima by-\ht\authrun - \unvbox\authrun - \ifdim\@tempdima>0pt - \vrule width0pt height\@tempdima\par - \fi - \noindent{\small\@date\vskip 6.2mm} - \global\@minipagetrue - \global\everypar{\global\@minipagefalse\global\everypar{}}% -%\vskip22.47pt -} -% -\if@mathematic - \def\vec#1{\ensuremath{\mathchoice - {\mbox{\boldmath$\displaystyle\mathbf{#1}$}} - {\mbox{\boldmath$\textstyle\mathbf{#1}$}} - {\mbox{\boldmath$\scriptstyle\mathbf{#1}$}} - {\mbox{\boldmath$\scriptscriptstyle\mathbf{#1}$}}}} -\else - \def\vec#1{\ensuremath{\mathchoice - {\mbox{\boldmath$\displaystyle#1$}} - {\mbox{\boldmath$\textstyle#1$}} - {\mbox{\boldmath$\scriptstyle#1$}} - {\mbox{\boldmath$\scriptscriptstyle#1$}}}} -\fi -% -\def\tens#1{\ensuremath{\mathsf{#1}}} -% -\setcounter{secnumdepth}{3} -\newcounter {section} -\newcounter {subsection}[section] -\newcounter {subsubsection}[subsection] -\newcounter {paragraph}[subsubsection] -\newcounter {subparagraph}[paragraph] -\renewcommand\thesection {\@arabic\c@section} -\renewcommand\thesubsection {\thesection.\@arabic\c@subsection} -\renewcommand\thesubsubsection{\thesubsection.\@arabic\c@subsubsection} -\renewcommand\theparagraph {\thesubsubsection.\@arabic\c@paragraph} -\renewcommand\thesubparagraph {\theparagraph.\@arabic\c@subparagraph} -% -\def\@hangfrom#1{\setbox\@tempboxa\hbox{#1}% - \hangindent \z@\noindent\box\@tempboxa} -% -\def\@seccntformat#1{\csname the#1\endcsname\sectcounterend -\hskip\betweenumberspace} -% -\newif\if@sectrule -\if@twocolumn\else\let\@sectruletrue=\relax\fi -\if@avier\let\@sectruletrue=\relax\fi -\def\makesectrule{\if@sectrule\global\@sectrulefalse\null\vglue-\topskip -\hrule\nobreak\parskip=5pt\relax\fi} -% -\let\makesectruleori=\makesectrule -\def\restoresectrule{\global\let\makesectrule=\makesectruleori\global\@sectrulefalse} -\def\nosectrule{\let\makesectrule=\restoresectrule} -% -\def\@startsection#1#2#3#4#5#6{% - \if@noskipsec \leavevmode \fi - \par - \@tempskipa #4\relax - \@afterindenttrue - \ifdim \@tempskipa <\z@ - \@tempskipa -\@tempskipa \@afterindentfalse - \fi - \if@nobreak - \everypar{}% - \else - \addpenalty\@secpenalty\addvspace\@tempskipa - \fi - \ifnum#2=1\relax\@sectruletrue\fi - \@ifstar - {\@ssect{#3}{#4}{#5}{#6}}% - {\@dblarg{\@sect{#1}{#2}{#3}{#4}{#5}{#6}}}} -% -\def\@sect#1#2#3#4#5#6[#7]#8{% - \ifnum #2>\c@secnumdepth - \let\@svsec\@empty - \else - \refstepcounter{#1}% - \protected@edef\@svsec{\@seccntformat{#1}\relax}% - \fi - \@tempskipa #5\relax - \ifdim \@tempskipa>\z@ - \begingroup - #6{\makesectrule - \@hangfrom{\hskip #3\relax\@svsec}% - \raggedright - \hyphenpenalty \@M% - \interlinepenalty \@M #8\@@par}% - \endgroup - \csname #1mark\endcsname{#7}% - \addcontentsline{toc}{#1}{% - \ifnum #2>\c@secnumdepth \else - \protect\numberline{\csname the#1\endcsname\sectcounterend}% - \fi - #7}% - \else - \def\@svsechd{% - #6{\hskip #3\relax - \@svsec #8\/\hskip\aftertext}% - \csname #1mark\endcsname{#7}% - \addcontentsline{toc}{#1}{% - \ifnum #2>\c@secnumdepth \else - \protect\numberline{\csname the#1\endcsname}% - \fi - #7}}% - \fi - \@xsect{#5}} -% -\def\@ssect#1#2#3#4#5{% - \@tempskipa #3\relax - \ifdim \@tempskipa>\z@ - \begingroup - #4{\makesectrule - \@hangfrom{\hskip #1}% - \interlinepenalty \@M #5\@@par}% - \endgroup - \else - \def\@svsechd{#4{\hskip #1\relax #5}}% - \fi - \@xsect{#3}} - -% -% measures and setting of sections -% -\def\section{\@startsection{section}{1}{\z@}% - {-21dd plus-8pt minus-4pt}{10.5dd} - {\normalsize\bfseries\boldmath}} -\def\subsection{\@startsection{subsection}{2}{\z@}% - {-21dd plus-8pt minus-4pt}{10.5dd} - {\normalsize\upshape}} -\def\subsubsection{\@startsection{subsubsection}{3}{\z@}% - {-13dd plus-8pt minus-4pt}{10.5dd} - {\normalsize\itshape}} -\def\paragraph{\@startsection{paragraph}{4}{\z@}% - {-13pt plus-8pt minus-4pt}{\z@}{\normalsize\itshape}} - -\setlength\leftmargini {\parindent} -\leftmargin \leftmargini -\setlength\leftmarginii {\parindent} -\setlength\leftmarginiii {1.87em} -\setlength\leftmarginiv {1.7em} -\setlength\leftmarginv {.5em} -\setlength\leftmarginvi {.5em} -\setlength \labelsep {.5em} -\setlength \labelwidth{\leftmargini} -\addtolength\labelwidth{-\labelsep} -\@beginparpenalty -\@lowpenalty -\@endparpenalty -\@lowpenalty -\@itempenalty -\@lowpenalty -\renewcommand\theenumi{\@arabic\c@enumi} -\renewcommand\theenumii{\@alph\c@enumii} -\renewcommand\theenumiii{\@roman\c@enumiii} -\renewcommand\theenumiv{\@Alph\c@enumiv} -\newcommand\labelenumi{\theenumi.} -\newcommand\labelenumii{(\theenumii)} -\newcommand\labelenumiii{\theenumiii.} -\newcommand\labelenumiv{\theenumiv.} -\renewcommand\p@enumii{\theenumi} -\renewcommand\p@enumiii{\theenumi(\theenumii)} -\renewcommand\p@enumiv{\p@enumiii\theenumiii} -\newcommand\labelitemi{\normalfont\bfseries --} -\newcommand\labelitemii{\normalfont\bfseries --} -\newcommand\labelitemiii{$\m@th\bullet$} -\newcommand\labelitemiv{$\m@th\cdot$} - -\if@spthms -% definition of the "\spnewtheorem" command. -% -% Usage: -% -% \spnewtheorem{env_nam}{caption}[within]{cap_font}{body_font} -% or \spnewtheorem{env_nam}[numbered_like]{caption}{cap_font}{body_font} -% or \spnewtheorem*{env_nam}{caption}{cap_font}{body_font} -% -% New is "cap_font" and "body_font". It stands for -% fontdefinition of the caption and the text itself. -% -% "\spnewtheorem*" gives a theorem without number. -% -% A defined spnewthoerem environment is used as described -% by Lamport. -% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - -\def\@thmcountersep{} -\def\@thmcounterend{} -\newcommand\nocaption{\noexpand\@gobble} -\newdimen\spthmsep \spthmsep=5pt - -\def\spnewtheorem{\@ifstar{\@sthm}{\@Sthm}} - -% definition of \spnewtheorem with number - -\def\@spnthm#1#2{% - \@ifnextchar[{\@spxnthm{#1}{#2}}{\@spynthm{#1}{#2}}} -\def\@Sthm#1{\@ifnextchar[{\@spothm{#1}}{\@spnthm{#1}}} - -\def\@spxnthm#1#2[#3]#4#5{\expandafter\@ifdefinable\csname #1\endcsname - {\@definecounter{#1}\@addtoreset{#1}{#3}% - \expandafter\xdef\csname the#1\endcsname{\expandafter\noexpand - \csname the#3\endcsname \noexpand\@thmcountersep \@thmcounter{#1}}% - \expandafter\xdef\csname #1name\endcsname{#2}% - \global\@namedef{#1}{\@spthm{#1}{\csname #1name\endcsname}{#4}{#5}}% - \global\@namedef{end#1}{\@endtheorem}}} - -\def\@spynthm#1#2#3#4{\expandafter\@ifdefinable\csname #1\endcsname - {\@definecounter{#1}% - \expandafter\xdef\csname the#1\endcsname{\@thmcounter{#1}}% - \expandafter\xdef\csname #1name\endcsname{#2}% - \global\@namedef{#1}{\@spthm{#1}{\csname #1name\endcsname}{#3}{#4}}% - \global\@namedef{end#1}{\@endtheorem}}} - -\def\@spothm#1[#2]#3#4#5{% - \@ifundefined{c@#2}{\@latexerr{No theorem environment `#2' defined}\@eha}% - {\expandafter\@ifdefinable\csname #1\endcsname - {\global\@namedef{the#1}{\@nameuse{the#2}}% - \expandafter\xdef\csname #1name\endcsname{#3}% - \global\@namedef{#1}{\@spthm{#2}{\csname #1name\endcsname}{#4}{#5}}% - \global\@namedef{end#1}{\@endtheorem}}}} - -\def\@spthm#1#2#3#4{\topsep 7\p@ \@plus2\p@ \@minus4\p@ -\labelsep=\spthmsep\refstepcounter{#1}% -\@ifnextchar[{\@spythm{#1}{#2}{#3}{#4}}{\@spxthm{#1}{#2}{#3}{#4}}} - -\def\@spxthm#1#2#3#4{\@spbegintheorem{#2}{\csname the#1\endcsname}{#3}{#4}% - \ignorespaces} - -\def\@spythm#1#2#3#4[#5]{\@spopargbegintheorem{#2}{\csname - the#1\endcsname}{#5}{#3}{#4}\ignorespaces} - -\def\normalthmheadings{\def\@spbegintheorem##1##2##3##4{\trivlist\normalfont - \item[\hskip\labelsep{##3##1\ ##2\@thmcounterend}]##4} -\def\@spopargbegintheorem##1##2##3##4##5{\trivlist - \item[\hskip\labelsep{##4##1\ ##2}]{##4(##3)\@thmcounterend\ }##5}} -\normalthmheadings - -\def\reversethmheadings{\def\@spbegintheorem##1##2##3##4{\trivlist\normalfont - \item[\hskip\labelsep{##3##2\ ##1\@thmcounterend}]##4} -\def\@spopargbegintheorem##1##2##3##4##5{\trivlist - \item[\hskip\labelsep{##4##2\ ##1}]{##4(##3)\@thmcounterend\ }##5}} - -% definition of \spnewtheorem* without number - -\def\@sthm#1#2{\@Ynthm{#1}{#2}} - -\def\@Ynthm#1#2#3#4{\expandafter\@ifdefinable\csname #1\endcsname - {\global\@namedef{#1}{\@Thm{\csname #1name\endcsname}{#3}{#4}}% - \expandafter\xdef\csname #1name\endcsname{#2}% - \global\@namedef{end#1}{\@endtheorem}}} - -\def\@Thm#1#2#3{\topsep 7\p@ \@plus2\p@ \@minus4\p@ -\@ifnextchar[{\@Ythm{#1}{#2}{#3}}{\@Xthm{#1}{#2}{#3}}} - -\def\@Xthm#1#2#3{\@Begintheorem{#1}{#2}{#3}\ignorespaces} - -\def\@Ythm#1#2#3[#4]{\@Opargbegintheorem{#1} - {#4}{#2}{#3}\ignorespaces} - -\def\@Begintheorem#1#2#3{#3\trivlist - \item[\hskip\labelsep{#2#1\@thmcounterend}]} - -\def\@Opargbegintheorem#1#2#3#4{#4\trivlist - \item[\hskip\labelsep{#3#1}]{#3(#2)\@thmcounterend\ }} - -% initialize theorem environment - -\if@envcntsect - \def\@thmcountersep{.} - \spnewtheorem{theorem}{Theorem}[section]{\bfseries}{\itshape} -\else - \spnewtheorem{theorem}{Theorem}{\bfseries}{\itshape} - \if@envcntreset - \@addtoreset{theorem}{section} - \else - \@addtoreset{theorem}{chapter} - \fi -\fi - -%definition of divers theorem environments -\spnewtheorem*{claim}{Claim}{\itshape}{\rmfamily} -\spnewtheorem*{proof}{Proof}{\itshape}{\rmfamily} -\if@envcntsame % all environments like "Theorem" - using its counter - \def\spn@wtheorem#1#2#3#4{\@spothm{#1}[theorem]{#2}{#3}{#4}} -\else % all environments with their own counter - \if@envcntsect % show section counter - \def\spn@wtheorem#1#2#3#4{\@spxnthm{#1}{#2}[section]{#3}{#4}} - \else % not numbered with section - \if@envcntreset - \def\spn@wtheorem#1#2#3#4{\@spynthm{#1}{#2}{#3}{#4} - \@addtoreset{#1}{section}} - \else - \let\spn@wtheorem=\@spynthm - \fi - \fi -\fi -% -\let\spdefaulttheorem=\spn@wtheorem -% -\spn@wtheorem{case}{Case}{\itshape}{\rmfamily} -\spn@wtheorem{conjecture}{Conjecture}{\itshape}{\rmfamily} -\spn@wtheorem{corollary}{Corollary}{\bfseries}{\itshape} -\spn@wtheorem{definition}{Definition}{\bfseries}{\rmfamily} -\spn@wtheorem{example}{Example}{\itshape}{\rmfamily} -\spn@wtheorem{exercise}{Exercise}{\bfseries}{\rmfamily} -\spn@wtheorem{lemma}{Lemma}{\bfseries}{\itshape} -\spn@wtheorem{note}{Note}{\itshape}{\rmfamily} -\spn@wtheorem{problem}{Problem}{\bfseries}{\rmfamily} -\spn@wtheorem{property}{Property}{\itshape}{\rmfamily} -\spn@wtheorem{proposition}{Proposition}{\bfseries}{\itshape} -\spn@wtheorem{question}{Question}{\itshape}{\rmfamily} -\spn@wtheorem{solution}{Solution}{\bfseries}{\rmfamily} -\spn@wtheorem{remark}{Remark}{\itshape}{\rmfamily} -% -\newenvironment{theopargself} - {\def\@spopargbegintheorem##1##2##3##4##5{\trivlist - \item[\hskip\labelsep{##4##1\ ##2}]{##4##3\@thmcounterend\ }##5} - \def\@Opargbegintheorem##1##2##3##4{##4\trivlist - \item[\hskip\labelsep{##3##1}]{##3##2\@thmcounterend\ }}}{} -\newenvironment{theopargself*} - {\def\@spopargbegintheorem##1##2##3##4##5{\trivlist - \item[\hskip\labelsep{##4##1\ ##2}]{\hspace*{-\labelsep}##4##3\@thmcounterend}##5} - \def\@Opargbegintheorem##1##2##3##4{##4\trivlist - \item[\hskip\labelsep{##3##1}]{\hspace*{-\labelsep}##3##2\@thmcounterend}}}{} -% -\fi - -\def\@takefromreset#1#2{% - \def\@tempa{#1}% - \let\@tempd\@elt - \def\@elt##1{% - \def\@tempb{##1}% - \ifx\@tempa\@tempb\else - \@addtoreset{##1}{#2}% - \fi}% - \expandafter\expandafter\let\expandafter\@tempc\csname cl@#2\endcsname - \expandafter\def\csname cl@#2\endcsname{}% - \@tempc - \let\@elt\@tempd} - -\def\squareforqed{\hbox{\rlap{$\sqcap$}$\sqcup$}} -\def\qed{\ifmmode\else\unskip\quad\fi\squareforqed} -\def\smartqed{\def\qed{\ifmmode\squareforqed\else{\unskip\nobreak\hfil -\penalty50\hskip1em\null\nobreak\hfil\squareforqed -\parfillskip=0pt\finalhyphendemerits=0\endgraf}\fi}} - -% Define `abstract' environment -\def\abstract{\topsep=0pt\partopsep=0pt\parsep=0pt\itemsep=0pt\relax -\trivlist\item[\hskip\labelsep -{\bfseries\abstractname}]\if!\abstractname!\hskip-\labelsep\fi} -\if@twocolumn - \if@avier - \def\endabstract{\endtrivlist\addvspace{5mm}\strich} - \def\strich{\hrule\vskip1ptplus12pt} - \else - \def\endabstract{\endtrivlist\addvspace{3mm}} - \fi -\else -\fi -% -\newenvironment{verse} - {\let\\\@centercr - \list{}{\itemsep \z@ - \itemindent -1.5em% - \listparindent\itemindent - \rightmargin \leftmargin - \advance\leftmargin 1.5em}% - \item\relax} - {\endlist} -\newenvironment{quotation} - {\list{}{\listparindent 1.5em% - \itemindent \listparindent - \rightmargin \leftmargin - \parsep \z@ \@plus\p@}% - \item\relax} - {\endlist} -\newenvironment{quote} - {\list{}{\rightmargin\leftmargin}% - \item\relax} - {\endlist} -\newcommand\appendix{\par\small - \setcounter{section}{0}% - \setcounter{subsection}{0}% - \renewcommand\thesection{\@Alph\c@section}} -\setlength\arraycolsep{1.5\p@} -\setlength\tabcolsep{6\p@} -\setlength\arrayrulewidth{.4\p@} -\setlength\doublerulesep{2\p@} -\setlength\tabbingsep{\labelsep} -\skip\@mpfootins = \skip\footins -\setlength\fboxsep{3\p@} -\setlength\fboxrule{.4\p@} -\renewcommand\theequation{\@arabic\c@equation} -\newcounter{figure} -\renewcommand\thefigure{\@arabic\c@figure} -\def\fps@figure{tbp} -\def\ftype@figure{1} -\def\ext@figure{lof} -\def\fnum@figure{\figurename~\thefigure} -\newenvironment{figure} - {\@float{figure}} - {\end@float} -\newenvironment{figure*} - {\@dblfloat{figure}} - {\end@dblfloat} -\newcounter{table} -\renewcommand\thetable{\@arabic\c@table} -\def\fps@table{tbp} -\def\ftype@table{2} -\def\ext@table{lot} -\def\fnum@table{\tablename~\thetable} -\newenvironment{table} - {\@float{table}} - {\end@float} -\newenvironment{table*} - {\@dblfloat{table}} - {\end@dblfloat} -% -\def \@floatboxreset {% - \reset@font - \small - \@setnobreak - \@setminipage -} -% -\newcommand{\tableheadseprule}{\noalign{\hrule height.375mm}} -% -\newlength\abovecaptionskip -\newlength\belowcaptionskip -\setlength\abovecaptionskip{10\p@} -\setlength\belowcaptionskip{0\p@} -\newcommand\leftlegendglue{} - -\def\fig@type{figure} - -\newdimen\figcapgap\figcapgap=3pt -\newdimen\tabcapgap\tabcapgap=5.5pt - -\@ifundefined{floatlegendstyle}{\def\floatlegendstyle{\bfseries}}{} - -\long\def\@caption#1[#2]#3{\par\addcontentsline{\csname - ext@#1\endcsname}{#1}{\protect\numberline{\csname - the#1\endcsname}{\ignorespaces #2}}\begingroup - \@parboxrestore - \@makecaption{\csname fnum@#1\endcsname}{\ignorespaces #3}\par - \endgroup} - -\def\capstrut{\vrule\@width\z@\@height\topskip} - -\@ifundefined{captionstyle}{\def\captionstyle{\normalfont\small}}{} - -\long\def\@makecaption#1#2{% - \captionstyle - \ifx\@captype\fig@type - \vskip\figcapgap - \fi - \setbox\@tempboxa\hbox{{\floatlegendstyle #1\floatcounterend}% - \capstrut #2}% - \ifdim \wd\@tempboxa >\hsize - {\floatlegendstyle #1\floatcounterend}\capstrut #2\par - \else - \hbox to\hsize{\leftlegendglue\unhbox\@tempboxa\hfil}% - \fi - \ifx\@captype\fig@type\else - \vskip\tabcapgap - \fi} - -\newdimen\figgap\figgap=1cc -\long\def\@makesidecaption#1#2{% - \parbox[b]{\@tempdimb}{\captionstyle{\floatlegendstyle - #1\floatcounterend}#2}} -\def\sidecaption#1\caption{% -\setbox\@tempboxa=\hbox{#1\unskip}% -\if@twocolumn - \ifdim\hsize<\textwidth\else - \ifdim\wd\@tempboxa<\columnwidth - \typeout{Double column float fits into single column - - ^^Jyou'd better switch the environment. }% - \fi - \fi -\fi -\@tempdimb=\hsize -\advance\@tempdimb by-\figgap -\advance\@tempdimb by-\wd\@tempboxa -\ifdim\@tempdimb<3cm - \typeout{\string\sidecaption: No sufficient room for the legend; - using normal \string\caption. }% - \unhbox\@tempboxa - \let\@capcommand=\@caption -\else - \let\@capcommand=\@sidecaption - \leavevmode - \unhbox\@tempboxa - \hfill -\fi -\refstepcounter\@captype -\@dblarg{\@capcommand\@captype}} - -\long\def\@sidecaption#1[#2]#3{\addcontentsline{\csname - ext@#1\endcsname}{#1}{\protect\numberline{\csname - the#1\endcsname}{\ignorespaces #2}}\begingroup - \@parboxrestore - \@makesidecaption{\csname fnum@#1\endcsname}{\ignorespaces #3}\par - \endgroup} - -% Define `acknowledgement' environment -\def\acknowledgement{\par\addvspace{17pt}\small\rmfamily -\trivlist\if!\ackname!\item[]\else -\item[\hskip\labelsep -{\bfseries\ackname}]\fi} -\def\endacknowledgement{\endtrivlist\addvspace{6pt}} -\newenvironment{acknowledgements}{\begin{acknowledgement}} -{\end{acknowledgement}} -% Define `noteadd' environment -\def\noteadd{\par\addvspace{17pt}\small\rmfamily -\trivlist\item[\hskip\labelsep -{\itshape\noteaddname}]} -\def\endnoteadd{\endtrivlist\addvspace{6pt}} - -\DeclareOldFontCommand{\rm}{\normalfont\rmfamily}{\mathrm} -\DeclareOldFontCommand{\sf}{\normalfont\sffamily}{\mathsf} -\DeclareOldFontCommand{\tt}{\normalfont\ttfamily}{\mathtt} -\DeclareOldFontCommand{\bf}{\normalfont\bfseries}{\mathbf} -\DeclareOldFontCommand{\it}{\normalfont\itshape}{\mathit} -\DeclareOldFontCommand{\sl}{\normalfont\slshape}{\@nomath\sl} -\DeclareOldFontCommand{\sc}{\normalfont\scshape}{\@nomath\sc} -\DeclareRobustCommand*\cal{\@fontswitch\relax\mathcal} -\DeclareRobustCommand*\mit{\@fontswitch\relax\mathnormal} -\newcommand\@pnumwidth{1.55em} -\newcommand\@tocrmarg{2.55em} -\newcommand\@dotsep{4.5} -\setcounter{tocdepth}{1} -\newcommand\tableofcontents{% - \section*{\contentsname}% - \@starttoc{toc}% - \addtocontents{toc}{\begingroup\protect\small}% - \AtEndDocument{\addtocontents{toc}{\endgroup}}% - } -\newcommand*\l@part[2]{% - \ifnum \c@tocdepth >-2\relax - \addpenalty\@secpenalty - \addvspace{2.25em \@plus\p@}% - \begingroup - \setlength\@tempdima{3em}% - \parindent \z@ \rightskip \@pnumwidth - \parfillskip -\@pnumwidth - {\leavevmode - \large \bfseries #1\hfil \hb@xt@\@pnumwidth{\hss #2}}\par - \nobreak - \if@compatibility - \global\@nobreaktrue - \everypar{\global\@nobreakfalse\everypar{}}% - \fi - \endgroup - \fi} -\newcommand*\l@section{\@dottedtocline{1}{0pt}{1.5em}} -\newcommand*\l@subsection{\@dottedtocline{2}{1.5em}{2.3em}} -\newcommand*\l@subsubsection{\@dottedtocline{3}{3.8em}{3.2em}} -\newcommand*\l@paragraph{\@dottedtocline{4}{7.0em}{4.1em}} -\newcommand*\l@subparagraph{\@dottedtocline{5}{10em}{5em}} -\newcommand\listoffigures{% - \section*{\listfigurename - \@mkboth{\listfigurename}% - {\listfigurename}}% - \@starttoc{lof}% - } -\newcommand*\l@figure{\@dottedtocline{1}{1.5em}{2.3em}} -\newcommand\listoftables{% - \section*{\listtablename - \@mkboth{\listtablename}{\listtablename}}% - \@starttoc{lot}% - } -\let\l@table\l@figure -\newdimen\bibindent -\setlength\bibindent{\parindent} -\def\@biblabel#1{#1.} -\def\@lbibitem[#1]#2{\item[{[#1]}\hfill]\if@filesw - {\let\protect\noexpand - \immediate - \write\@auxout{\string\bibcite{#2}{#1}}}\fi\ignorespaces} -\newenvironment{thebibliography}[1] - {\section*{\refname - \@mkboth{\refname}{\refname}}\small - \list{\@biblabel{\@arabic\c@enumiv}}% - {\settowidth\labelwidth{\@biblabel{#1}}% - \leftmargin\labelwidth - \advance\leftmargin\labelsep - \@openbib@code - \usecounter{enumiv}% - \let\p@enumiv\@empty - \renewcommand\theenumiv{\@arabic\c@enumiv}}% - \sloppy\clubpenalty4000\widowpenalty4000% - \sfcode`\.\@m} - {\def\@noitemerr - {\@latex@warning{Empty `thebibliography' environment}}% - \endlist} -% -\newcount\@tempcntc -\def\@citex[#1]#2{\if@filesw\immediate\write\@auxout{\string\citation{#2}}\fi - \@tempcnta\z@\@tempcntb\m@ne\def\@citea{}\@cite{\@for\@citeb:=#2\do - {\@ifundefined - {b@\@citeb}{\@citeo\@tempcntb\m@ne\@citea\def\@citea{,}{\bfseries - ?}\@warning - {Citation `\@citeb' on page \thepage \space undefined}}% - {\setbox\z@\hbox{\global\@tempcntc0\csname b@\@citeb\endcsname\relax}% - \ifnum\@tempcntc=\z@ \@citeo\@tempcntb\m@ne - \@citea\def\@citea{,\hskip0.1em\ignorespaces}\hbox{\csname b@\@citeb\endcsname}% - \else - \advance\@tempcntb\@ne - \ifnum\@tempcntb=\@tempcntc - \else\advance\@tempcntb\m@ne\@citeo - \@tempcnta\@tempcntc\@tempcntb\@tempcntc\fi\fi}}\@citeo}{#1}} -\def\@citeo{\ifnum\@tempcnta>\@tempcntb\else - \@citea\def\@citea{,\hskip0.1em\ignorespaces}% - \ifnum\@tempcnta=\@tempcntb\the\@tempcnta\else - {\advance\@tempcnta\@ne\ifnum\@tempcnta=\@tempcntb \else \def\@citea{--}\fi - \advance\@tempcnta\m@ne\the\@tempcnta\@citea\the\@tempcntb}\fi\fi} -% -\newcommand\newblock{\hskip .11em\@plus.33em\@minus.07em} -\let\@openbib@code\@empty -\newenvironment{theindex} - {\if@twocolumn - \@restonecolfalse - \else - \@restonecoltrue - \fi - \columnseprule \z@ - \columnsep 35\p@ - \twocolumn[\section*{\indexname}]% - \@mkboth{\indexname}{\indexname}% - \thispagestyle{plain}\parindent\z@ - \parskip\z@ \@plus .3\p@\relax - \let\item\@idxitem} - {\if@restonecol\onecolumn\else\clearpage\fi} -\newcommand\@idxitem{\par\hangindent 40\p@} -\newcommand\subitem{\@idxitem \hspace*{20\p@}} -\newcommand\subsubitem{\@idxitem \hspace*{30\p@}} -\newcommand\indexspace{\par \vskip 10\p@ \@plus5\p@ \@minus3\p@\relax} - -\if@twocolumn - \renewcommand\footnoterule{% - \kern-3\p@ - \hrule\@width\columnwidth - \kern2.6\p@} -\else - \renewcommand\footnoterule{% - \kern-3\p@ - \hrule\@width.382\columnwidth - \kern2.6\p@} -\fi -\newcommand\@makefntext[1]{% - \noindent - \hb@xt@\bibindent{\hss\@makefnmark\enspace}#1} -% -\def\trans@english{\switcht@albion} -\def\trans@french{\switcht@francais} -\def\trans@german{\switcht@deutsch} -\newenvironment{translation}[1]{\if!#1!\else -\@ifundefined{selectlanguage}{\csname trans@#1\endcsname}{\selectlanguage{#1}}% -\fi}{} -% languages -% English section -\def\switcht@albion{%\typeout{English spoken.}% - \def\abstractname{Abstract}% - \def\ackname{Acknowledgements}% - \def\andname{and}% - \def\lastandname{, and}% - \def\appendixname{Appendix}% - \def\chaptername{Chapter}% - \def\claimname{Claim}% - \def\conjecturename{Conjecture}% - \def\contentsname{Contents}% - \def\corollaryname{Corollary}% - \def\definitionname{Definition}% - \def\emailname{E-mail}% - \def\examplename{Example}% - \def\exercisename{Exercise}% - \def\figurename{Fig.}% - \def\keywordname{{\bfseries Keywords}}% - \def\indexname{Index}% - \def\lemmaname{Lemma}% - \def\contriblistname{List of Contributors}% - \def\listfigurename{List of Figures}% - \def\listtablename{List of Tables}% - \def\mailname{{\itshape Correspondence to\/}:}% - \def\noteaddname{Note added in proof}% - \def\notename{Note}% - \def\partname{Part}% - \def\problemname{Problem}% - \def\proofname{Proof}% - \def\propertyname{Property}% - \def\questionname{Question}% - \def\refname{References}% - \def\remarkname{Remark}% - \def\seename{see}% - \def\solutionname{Solution}% - \def\tablename{Table}% - \def\theoremname{Theorem}% -}\switcht@albion % make English default -% -% French section -\def\switcht@francais{\svlanginfo -%\typeout{On parle francais.}% - \def\abstractname{R\'esum\'e\runinend}% - \def\ackname{Remerciements\runinend}% - \def\andname{et}% - \def\lastandname{ et}% - \def\appendixname{Appendice}% - \def\chaptername{Chapitre}% - \def\claimname{Pr\'etention}% - \def\conjecturename{Hypoth\`ese}% - \def\contentsname{Table des mati\`eres}% - \def\corollaryname{Corollaire}% - \def\definitionname{D\'efinition}% - \def\emailname{E-mail}% - \def\examplename{Exemple}% - \def\exercisename{Exercice}% - \def\figurename{Fig.}% - \def\keywordname{{\bfseries Mots-cl\'e\runinend}}% - \def\indexname{Index}% - \def\lemmaname{Lemme}% - \def\contriblistname{Liste des contributeurs}% - \def\listfigurename{Liste des figures}% - \def\listtablename{Liste des tables}% - \def\mailname{{\itshape Correspondence to\/}:}% - \def\noteaddname{Note ajout\'ee \`a l'\'epreuve}% - \def\notename{Remarque}% - \def\partname{Partie}% - \def\problemname{Probl\`eme}% - \def\proofname{Preuve}% - \def\propertyname{Caract\'eristique}% -%\def\propositionname{Proposition}% - \def\questionname{Question}% - \def\refname{Bibliographie}% - \def\remarkname{Remarque}% - \def\seename{voyez}% - \def\solutionname{Solution}% -%\def\subclassname{{\it Subject Classifications\/}:}% - \def\tablename{Tableau}% - \def\theoremname{Th\'eor\`eme}% -} -% -% German section -\def\switcht@deutsch{\svlanginfo -%\typeout{Man spricht deutsch.}% - \def\abstractname{Zusammenfassung\runinend}% - \def\ackname{Danksagung\runinend}% - \def\andname{und}% - \def\lastandname{ und}% - \def\appendixname{Anhang}% - \def\chaptername{Kapitel}% - \def\claimname{Behauptung}% - \def\conjecturename{Hypothese}% - \def\contentsname{Inhaltsverzeichnis}% - \def\corollaryname{Korollar}% -%\def\definitionname{Definition}% - \def\emailname{E-Mail}% - \def\examplename{Beispiel}% - \def\exercisename{\"Ubung}% - \def\figurename{Abb.}% - \def\keywordname{{\bfseries Schl\"usselw\"orter\runinend}}% - \def\indexname{Index}% -%\def\lemmaname{Lemma}% - \def\contriblistname{Mitarbeiter}% - \def\listfigurename{Abbildungsverzeichnis}% - \def\listtablename{Tabellenverzeichnis}% - \def\mailname{{\itshape Correspondence to\/}:}% - \def\noteaddname{Nachtrag}% - \def\notename{Anmerkung}% - \def\partname{Teil}% -%\def\problemname{Problem}% - \def\proofname{Beweis}% - \def\propertyname{Eigenschaft}% -%\def\propositionname{Proposition}% - \def\questionname{Frage}% - \def\refname{Literatur}% - \def\remarkname{Anmerkung}% - \def\seename{siehe}% - \def\solutionname{L\"osung}% -%\def\subclassname{{\it Subject Classifications\/}:}% - \def\tablename{Tabelle}% -%\def\theoremname{Theorem}% -} -\newcommand\today{} -\edef\today{\ifcase\month\or - January\or February\or March\or April\or May\or June\or - July\or August\or September\or October\or November\or December\fi - \space\number\day, \number\year} -\setlength\columnsep{1.5cc} -\setlength\columnseprule{0\p@} -% -\frenchspacing -\clubpenalty=10000 -\widowpenalty=10000 -\def\thisbottomragged{\def\@textbottom{\vskip\z@ plus.0001fil -\global\let\@textbottom\relax}} -\pagestyle{headings} -\pagenumbering{arabic} -\if@twocolumn - \twocolumn -\fi -\if@avier - \onecolumn - \setlength{\textwidth}{156mm} - \setlength{\textheight}{226mm} -\fi -\if@referee - \makereferee -\fi -\flushbottom -\endinput -%% -%% End of file `svjour2.cls'. diff --git a/vldb07/terminology.tex b/vldb07/terminology.tex deleted file mode 100755 index fd2cf1d..0000000 --- a/vldb07/terminology.tex +++ /dev/null @@ -1,18 +0,0 @@ -% Time-stamp: -\vspace{-3mm} -\section{Notation and terminology} -\vspace{-2mm} -\label{sec:notation} - -\enlargethispage{2\baselineskip} -The essential notation and terminology used throughout this paper are as follows. -\begin{itemize} -\item $U$: key universe. $|U| = u$. -\item $S$: actual static key set. $S \subset U$, $|S| = n \ll u$. -\item $h: U \to M$ is a hash function that maps keys from a universe $U$ into -a given range $M = \{0,1,\dots,m-1\}$ of integer numbers. -\item $h$ is a perfect hash function if it is one-to-one on~$S$, i.e., if - $h(k_1) \not = h(k_2)$ for all $k_1 \not = k_2$ from $S$. -\item $h$ is a minimal perfect hash function (MPHF) if it is one-to-one on~$S$ - and $n=m$. -\end{itemize} diff --git a/vldb07/thealgorithm.tex b/vldb07/thealgorithm.tex deleted file mode 100755 index 673544b..0000000 --- a/vldb07/thealgorithm.tex +++ /dev/null @@ -1,78 +0,0 @@ -%% Nivio: 13/jan/06, 21/jan/06 29/jan/06 -% Time-stamp: -\vspace{-3mm} -\section{The algorithm} -\label{sec:new-algorithm} -\vspace{-2mm} - -\enlargethispage{2\baselineskip} -The main idea supporting our algorithm is the classical divide and conquer technique. -The algorithm is a two-step external memory based algorithm -that generates a MPHF $h$ for a set $S$ of $n$ keys. -Figure~\ref{fig:new-algo-main-steps} illustrates the two steps of the -algorithm: the partitioning step and the searching step. - -\vspace{-2mm} -\begin{figure}[ht] -\centering -\begin{picture}(0,0)% -\includegraphics{figs/brz}% -\end{picture}% -\setlength{\unitlength}{4144sp}% -% -\begingroup\makeatletter\ifx\SetFigFont\undefined% -\gdef\SetFigFont#1#2#3#4#5{% - \reset@font\fontsize{#1}{#2pt}% - \fontfamily{#3}\fontseries{#4}\fontshape{#5}% - \selectfont}% -\fi\endgroup% -\begin{picture}(3704,2091)(1426,-5161) -\put(2570,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}} -\put(2782,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}} -\put(2996,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}2}}}} -\put(4060,-4006){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Buckets}}}} -\put(3776,-4301){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}${\lceil n/b\rceil - 1}$}}}} -\put(4563,-3329){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Key Set $S$}}}} -\put(2009,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}} -\put(2221,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}} -\put(4315,-3160){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}n-1}}}} -\put(1992,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}0}}}} -\put(2204,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}1}}}} -\put(4298,-5146){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}n-1}}}} -\put(4546,-4977){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Hash Table}}}} -\put(1441,-3616){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Partitioning}}}} -\put(1441,-4426){\makebox(0,0)[lb]{\smash{{\SetFigFont{7}{8.4}{\familydefault}{\mddefault}{\updefault}Searching}}}} -\put(1981,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_0$}}}} -\put(2521,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_1$}}}} -\put(3016,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_2$}}}} -\put(3826,-4786){\makebox(0,0)[lb]{\smash{{\SetFigFont{5}{6.0}{\familydefault}{\mddefault}{\updefault}MPHF$_{\lceil n/b \rceil - 1}$}}}} -\end{picture}% -\vspace{-1mm} -\caption{Main steps of our algorithm} -\label{fig:new-algo-main-steps} -\vspace{-3mm} -\end{figure} - -The partitioning step takes a key set $S$ and uses a universal hash function -$h_0$ proposed by Jenkins~\cite{j97} -%for each key $k \in S$ of length $|k|$ -to transform each key~$k\in S$ into an integer~$h_0(k)$. -Reducing~$h_0(k)$ modulo~$\lceil n/b\rceil$, we partition~$S$ into $\lceil n/b -\rceil$ buckets containing at most 256 keys in each bucket (with high -probability). - -The searching step generates a MPHF$_i$ for each bucket $i$, -$0 \leq i < \lceil n/b \rceil$. -The resulting MPHF $h(k)$, $k \in S$, is given by -\begin{eqnarray}\label{eq:mphf} -h(k) = \mathrm{MPHF}_i (k) + \mathit{offset}[i], -\end{eqnarray} -where~$i=h_0(k)\bmod\lceil n/b\rceil$. -The $i$th entry~$\mathit{offset}[i]$ of the displacement vector -$\mathit{offset}$, $0 \leq i < \lceil n/b \rceil$, contains the total number -of keys in the buckets from 0 to $i-1$, that is, it gives the interval of the -keys in the hash table addressed by the MPHF$_i$. In the following we explain -each step in detail. - - - diff --git a/vldb07/thedataandsetup.tex b/vldb07/thedataandsetup.tex deleted file mode 100755 index 8739705..0000000 --- a/vldb07/thedataandsetup.tex +++ /dev/null @@ -1,21 +0,0 @@ -% Nivio: 29/jan/06 -% Time-stamp: -\vspace{-2mm} -\subsection{The data and the experimental setup} -\label{sec:data-exper-set} - -The algorithms were implemented in the C language and -are available at \texttt{http://\-cmph.sf.net} -under the GNU Lesser General Public License (LGPL). -% free software licence. -All experiments were carried out on -a computer running the Linux operating system, version 2.6, -with a 2.4 gigahertz processor and -1 gigabyte of main memory. -In the experiments related to the new -algorithm we limited the main memory in 500 megabytes. - -Our data consists of a collection of 1 billion -URLs collected from the Web, each URL 64 characters long on average. -The collection is stored on disk in 60.5 gigabytes. - diff --git a/vldb07/vldb.tex b/vldb07/vldb.tex deleted file mode 100644 index f3eee30..0000000 --- a/vldb07/vldb.tex +++ /dev/null @@ -1,194 +0,0 @@ -%%%%%%%%%%%%%%%%%%%%%%% file template.tex %%%%%%%%%%%%%%%%%%%%%%%%% -% -% This is a template file for the LaTeX package SVJour2 for the -% Springer journal "The VLDB Journal". -% -% Springer Heidelberg 2004/12/03 -% -% Copy it to a new file with a new name and use it as the basis -% for your article. Delete % as needed. -% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -% -% First comes an example EPS file -- just ignore it and -% proceed on the \documentclass line -% your LaTeX will extract the file if required -%\begin{filecontents*}{figs/minimalperfecthash-ph-mph.ps} -%!PS-Adobe-3.0 EPSF-3.0 -%%BoundingBox: 19 19 221 221 -%%CreationDate: Mon Sep 29 1997 -%%Creator: programmed by hand (JK) -%%EndComments -%gsave -%newpath -% 20 20 moveto -% 20 220 lineto -% 220 220 lineto -% 220 20 lineto -%closepath -%2 setlinewidth -%gsave -% .4 setgray fill -%grestore -%stroke -%grestore -%\end{filecontents*} -% -\documentclass[twocolumn,fleqn,runningheads]{svjour2} -% -\smartqed % flush right qed marks, e.g. at end of proof -% -\usepackage{graphicx} -\usepackage{listings} -\usepackage{epsfig} -\usepackage{textcomp} -\usepackage[latin1]{inputenc} -\usepackage{amssymb} - -\DeclareGraphicsExtensions{.png} -% -% \usepackage{mathptmx} % use Times fonts if available on your TeX system -% -% insert here the call for the packages your document requires -%\usepackage{latexsym} -% etc. -% -% please place your own definitions here and don't use \def but -% \newcommand{}{} -% - -\lstset{ - language=Pascal, - basicstyle=\fontsize{9}{9}\selectfont, - captionpos=t, - aboveskip=1mm, - belowskip=1mm, - abovecaptionskip=1mm, - belowcaptionskip=1mm, -% numbers = left, - mathescape=true, - escapechar=@, - extendedchars=true, - showstringspaces=false, - columns=fixed, - basewidth=0.515em, - frame=single, - framesep=2mm, - xleftmargin=2mm, - xrightmargin=2mm, - framerule=0.5pt -} - -\def\cG{{\mathcal G}} -\def\crit{{\rm crit}} -\def\ncrit{{\rm ncrit}} -\def\scrit{{\rm scrit}} -\def\bedges{{\rm bedges}} -\def\ZZ{{\mathbb Z}} - -\journalname{The VLDB Journal} -% - -\begin{document} - -\title{Space and Time Efficient Minimal Perfect Hash \\[0.2cm] -Functions for Very Large Databases\thanks{ -This work was supported in part by -GERINDO Project--grant MCT/CNPq/CT-INFO 552.087/02-5, -CAPES/PROF Scholarship (Fabiano C. Botelho), -FAPESP Proj.\ Tem.\ 03/09925-5 and CNPq Grant 30.0334/93-1 -(Yoshiharu Kohayakawa), -and CNPq Grant 30.5237/02-0 (Nivio Ziviani).} -} -%\subtitle{Do you have a subtitle?\\ If so, write it here} - -%\titlerunning{Short form of title} % if too long for running head - -\author{Fabiano C. Botelho \and Davi C. Reis \and Yoshiharu Kohayakawa \and Nivio Ziviani} -%\authorrunning{Short form of author list} % if too long for running head -\institute{ -F. C. Botelho \and -N. Ziviani \at -Dept. of Computer Science, -Federal Univ. of Minas Gerais, -Belo Horizonte, Brazil\\ -\email{\{fbotelho,nivio\}@dcc.ufmg.br} -\and -D. C. Reis \at -Google, Brazil \\ -\email{davi.reis@gmail.com} -\and -Y. Kohayakawa -Dept. of Computer Science, -Univ. of S\~ao Paulo, -S\~ao Paulo, Brazil\\ -\email{yoshi@ime.usp.br} -} - -\date{Received: date / Accepted: date} -% The correct dates will be entered by the editor - - -\maketitle - -\begin{abstract} -We propose a novel external memory based algorithm for constructing minimal -perfect hash functions~$h$ for huge sets of keys. -For a set of~$n$ keys, our algorithm outputs~$h$ in time~$O(n)$. -The algorithm needs a small vector of one byte entries -in main memory to construct $h$. -The evaluation of~$h(x)$ requires three memory accesses for any key~$x$. -The description of~$h$ takes a constant number of bits -for each key, which is optimal, i.e., the theoretical lower bound is $1/\ln 2$ -bits per key. -In our experiments, we used a collection of 1 billion URLs collected -from the web, each URL 64 characters long on average. -For this collection, our algorithm -(i) finds a minimal perfect hash function in approximately -3 hours using a commodity PC, -(ii) needs just 5.45 megabytes of internal memory to generate $h$ -and (iii) takes 8.1 bits per key for the description of~$h$. -\keywords{Minimal Perfect Hashing \and Large Databases} -\end{abstract} - -% main text - -\def\cG{{\mathcal G}} -\def\crit{{\rm crit}} -\def\ncrit{{\rm ncrit}} -\def\scrit{{\rm scrit}} -\def\bedges{{\rm bedges}} -\def\ZZ{{\mathbb Z}} -\def\BSmax{\mathit{BS}_{\mathit{max}}} -\def\Bi{\mathop{\rm Bi}\nolimits} - -\input{introduction} -%\input{terminology} -\input{relatedwork} -\input{thealgorithm} -\input{partitioningthekeys} -\input{searching} -%\input{computingoffset} -%\input{hashingbuckets} -\input{determiningb} -%\input{analyticalandexperimentalresults} -\input{analyticalresults} -%\input{results} -\input{conclusions} - - - - -%\input{acknowledgments} -%\begin{acknowledgements} -%If you'd like to thank anyone, place your comments here -%and remove the percent signs. -%\end{acknowledgements} - -% BibTeX users please use -%\bibliographystyle{spmpsci} -%\bibliography{} % name your BibTeX data base -\bibliographystyle{plain} -\bibliography{references} -\input{appendix} -\end{document}