paper for vldb07 added

2006-08-11 17:32:31 +00:00
parent 00c049787a
commit 80549b6ca6
28 changed files with 4517 additions and 0 deletions
--- a/vldb07/costhashingbuckets.tex
+++ b/vldb07/costhashingbuckets.tex
@@ -0,0 +1,177 @@
+% Nivio: 29/jan/06
+% Time-stamp: <Monday 30 Jan 2006 12:37:22am EST yoshi@flare>
+\vspace{-2mm}
+\subsection{Performance of the internal memory based algorithm}
+\label{sec:intern-memory-algor}
+
+%\begin{table*}[htb]
+%\begin{center}
+%{\scriptsize
+%\begin{tabular}{|c|c|c|c|c|c|c|c|}
+%\hline
+%$n$ (millions)  & 1                 & 2                    & 4                  & 8                  & 16      & 32 \\
+%\hline
+%Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$   & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\
+%SD (s)          & $2.6$           & $5.4$              & $9.8$            & $17.6$           & $37.3$            & $76.3$  \\
+%\hline
+%\end{tabular}
+%\vspace{-3mm}
+%}
+%\end{center}
+%\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF,
+%the standard deviation (SD), and the confidence intervals considering
+%a confidence level of $95\%$.}
+%\label{tab:medias}
+%\end{table*}
+
+Our three-step internal memory based algorithm presented in~\cite{bkz05}
+is used for constructing a MPHF for each bucket.
+It is a randomized algorithm because it needs to generate a simple random graph
+in its first step. 
+Once the graph is obtained the other two steps are deterministic. 
+
+Thus, we can consider the runtime of the algorithm to have the form~$\alpha
+nZ$ for an input of~$n$ keys, where~$\alpha$ is some machine dependent
+constant that further depends on the length of the keys and~$Z$ is a random
+variable with geometric distribution with mean~$1/p=e^{1/c^2}$ (see
+Section~\ref{sec:mphfbucket}).  All results in our experiments were obtained
+taking $c=1$; the value of~$c$, with~$c\in[0.93,1.15]$, in fact has little
+influence in the runtime, as shown in~\cite{bkz05}.
+
+The values chosen for $n$ were $1, 2, 4, 8, 16$ and $32$ million.
+Although we have a dataset with 1~billion URLs, on a PC with
+1~gigabyte of main memory, the algorithm is able
+to handle an input with at most 32 million keys.
+This is mainly because of the graph we need to keep in main memory.
+The algorithm requires $25n + O(1)$ bytes for constructing 
+a MPHF (details about the data structures used by the algorithm can
+be found in~\texttt{http://cmph.sf.net}.
+% for the details about the data structures 
+%used by the algorithm).
+
+In order to estimate the number of trials for each value of $n$ we use
+a statistical method for determining a suitable sample size (see, e.g.,
+\cite[Chapter 13]{j91}).  
+As we obtained different values for each $n$, 
+we used the maximal value obtained, namely, 300~trials in order to have 
+a confidence level of $95\%$.
+ 
+% \begin{figure*}[ht]
+%   \noindent
+%   \begin{minipage}[b]{0.5\linewidth}
+%     \centering
+%     \subfigure[The previous algorithm]
+%     {\scalebox{0.5}{\includegraphics{figs/bmz_temporegressao.eps}}}
+%   \end{minipage}
+%   \hfill
+%   \begin{minipage}[b]{0.5\linewidth}
+%     \centering
+%     \subfigure[The new algorithm]
+%     {\scalebox{0.5}{\includegraphics{figs/brz_temporegressao.eps}}}
+%   \end{minipage}
+%     \caption{Time versus number of keys in $S$. The solid line corresponds to 
+% a linear regression model.}
+% %obtained from the experimental measurements.}
+%     \label{fig:temporegressao}
+% \end{figure*}
+
+Table~\ref{tab:medias} presents the runtime average for each $n$,
+the respective standard deviations, and 
+the respective confidence intervals given by 
+the average time $\pm$ the distance from average time
+considering a confidence level of $95\%$.
+Observing the runtime averages one sees that 
+the algorithm runs in expected linear time, 
+as shown in~\cite{bkz05}. 
+
+\vspace{-2mm}
+\begin{table*}[htb]
+\begin{center}
+{\scriptsize
+\begin{tabular}{|c|c|c|c|c|c|c|c|}
+\hline
+$n$ (millions)  & 1                 & 2                    & 4                  & 8                  & 16      & 32 \\
+\hline
+Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$   & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\
+SD (s)          & $2.6$           & $5.4$              & $9.8$            & $17.6$           & $37.3$            & $76.3$  \\
+\hline
+\end{tabular}
+\vspace{-1mm}
+}
+\end{center}
+\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF,
+the standard deviation (SD), and the confidence intervals considering
+a confidence level of $95\%$.}
+\label{tab:medias}
+\vspace{-4mm}
+\end{table*}
+
+% \enlargethispage{\baselineskip}
+% \begin{table*}[htb]
+% \begin{center}
+% {\scriptsize
+% (a)
+% \begin{tabular}{|c|c|c|c|c|c|c|c|}
+% \hline
+% $n$ (millions)  & 1                 & 2                    & 4                  & 8                  & 16                 & 32 \\
+% \hline
+% Average time (s)& $6.119 \pm 0.300$ & $12.190 \pm 0.615$   & $25.359 \pm 1.109$ & $51.408 \pm 2.003$ & $117.343 \pm 4.364$ & $262.215 \pm 8.724$\\
+% SD (s)          & $2.644$           & $5.414$              & $9.757$            & $17.627$           & $37.333$            & $76.271$  \\
+% \hline
+% \end{tabular}
+% \\[5mm]  (b)
+% \begin{tabular}{|l|c|c|c|c|c|}
+% \hline
+% $n$ (millions)   & 1                  & 2                  & 4                  & 8                   & 16             \\
+% \hline % Part.      16 \%                 16 \%                 16 \%                18 \%                 20\%           
+% Average time (s) & $6.927 \pm 0.309$  & $13.828 \pm 0.175$ & $31.936 \pm 0.663$ & $69.902 \pm 1.084$  & $140.617 \pm 2.502$  \\
+% SD               & $0.431$            & $0.245$            & $0.926$            & $1.515$             & $3.498$         \\
+% \hline
+% \hline
+% $n$ (millions)   & 32                  & 64                   & 128                    & 512                  & 1000            \\
+% \hline % Part.      20 \%                 20\%                  20\%                      18\%                    18\%
+% Average time (s) & $284.330 \pm 1.135$ & $587.880 \pm 3.945$  & $1223.581 \pm 4.864$   & $5966.402 \pm 9.465$ & $13229.540 \pm 12.670$  \\
+% SD               & $1.587$             & $5.514$              & $6.800$                & $13.232$             & $18.577$            \\
+% \hline
+% \end{tabular}
+% }
+% \end{center}
+% \caption{The runtime averages in seconds, 
+% the standard deviation (SD), and
+% the confidence intervals given by the average time $\pm$ 
+% the distance from average time considering 
+% a confidence level of $95\%$.}
+% \label{tab:medias}
+% \end{table*}
+
+\enlargethispage{2\baselineskip}
+Figure~\ref{fig:bmz_temporegressao} 
+presents the runtime for each trial. In addition, 
+the solid line corresponds to a linear regression model 
+obtained from the experimental measurements.
+As we can see, the runtime for a given $n$ has a considerable 
+fluctuation. However, the fluctuation also grows linearly with $n$.
+
+\begin{figure}[htb]
+\vspace{-2mm}
+\begin{center}
+\scalebox{0.4}{\includegraphics{figs/bmz_temporegressao.eps}}
+\caption{Time versus number of keys in $S$ for the internal memory based algorithm.
+The solid line corresponds to a linear regression model.}
+\label{fig:bmz_temporegressao}
+\end{center}
+\vspace{-6mm}
+\end{figure}
+
+The observed fluctuation in the runtimes is as expected; recall that this
+runtime has the form~$\alpha nZ$ with~$Z$ a geometric random variable with
+mean~$1/p=e$.  Thus, the runtime has mean~$\alpha n/p=\alpha en$ and standard
+deviation~$\alpha n\sqrt{(1-p)/p^2}=\alpha n\sqrt{e(e-1)}$. 
+Therefore, the standard deviation also grows 
+linearly with $n$, as experimentally verified 
+in Table~\ref{tab:medias} and in Figure~\ref{fig:bmz_temporegressao}.
+
+%\noindent-------------------------------------------------------------------------\\
+%Comentario para Yoshi: Nao consegui reproduzir bem o que discutimos 
+%no paragrafo acima, acho que vc conseguira justificar melhor :-). \\
+%-------------------------------------------------------------------------\\