turbonss/vldb07/costhashingbuckets.tex

% Nivio: 29/jan/06
% Time-stamp: <Monday 30 Jan 2006 12:37:22am EST yoshi@flare>
\vspace{-2mm}
\subsection{Performance of the internal memory based algorithm}
\label{sec:intern-memory-algor}

%\begin{table*}[htb]
%\begin{center}
%{\scriptsize
%\begin{tabular}{|c|c|c|c|c|c|c|c|}
%\hline
%$n$ (millions)  & 1                 & 2                    & 4                  & 8                  & 16      & 32 \\
%\hline
%Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$   & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\
%SD (s)          & $2.6$           & $5.4$              & $9.8$            & $17.6$           & $37.3$            & $76.3$  \\
%\hline
%\end{tabular}
%\vspace{-3mm}
%}
%\end{center}
%\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF,
%the standard deviation (SD), and the confidence intervals considering
%a confidence level of $95\%$.}
%\label{tab:medias}
%\end{table*}

Our three-step internal memory based algorithm presented in~\cite{bkz05}
is used for constructing a MPHF for each bucket.
It is a randomized algorithm because it needs to generate a simple random graph
in its first step.
Once the graph is obtained the other two steps are deterministic.

Thus, we can consider the runtime of the algorithm to have the form~$\alpha
nZ$ for an input of~$n$ keys, where~$\alpha$ is some machine dependent
constant that further depends on the length of the keys and~$Z$ is a random
variable with geometric distribution with mean~$1/p=e^{1/c^2}$ (see
Section~\ref{sec:mphfbucket}).  All results in our experiments were obtained
taking $c=1$; the value of~$c$, with~$c\in[0.93,1.15]$, in fact has little
influence in the runtime, as shown in~\cite{bkz05}.

The values chosen for $n$ were $1, 2, 4, 8, 16$ and $32$ million.
Although we have a dataset with 1~billion URLs, on a PC with
1~gigabyte of main memory, the algorithm is able
to handle an input with at most 32 million keys.
This is mainly because of the graph we need to keep in main memory.
The algorithm requires $25n + O(1)$ bytes for constructing
a MPHF (details about the data structures used by the algorithm can
be found in~\texttt{http://cmph.sf.net}.
% for the details about the data structures
%used by the algorithm).

In order to estimate the number of trials for each value of $n$ we use
a statistical method for determining a suitable sample size (see, e.g.,
\cite[Chapter 13]{j91}).
As we obtained different values for each $n$,
we used the maximal value obtained, namely, 300~trials in order to have
a confidence level of $95\%$.

% \begin{figure*}[ht]
%   \noindent
%   \begin{minipage}[b]{0.5\linewidth}
%     \centering
%     \subfigure[The previous algorithm]
%     {\scalebox{0.5}{\includegraphics{figs/bmz_temporegressao.eps}}}
%   \end{minipage}
%   \hfill
%   \begin{minipage}[b]{0.5\linewidth}
%     \centering
%     \subfigure[The new algorithm]
%     {\scalebox{0.5}{\includegraphics{figs/brz_temporegressao.eps}}}
%   \end{minipage}
%     \caption{Time versus number of keys in $S$. The solid line corresponds to
% a linear regression model.}
% %obtained from the experimental measurements.}
%     \label{fig:temporegressao}
% \end{figure*}

Table~\ref{tab:medias} presents the runtime average for each $n$,
the respective standard deviations, and
the respective confidence intervals given by
the average time $\pm$ the distance from average time
considering a confidence level of $95\%$.
Observing the runtime averages one sees that
the algorithm runs in expected linear time,
as shown in~\cite{bkz05}.

\vspace{-2mm}
\begin{table*}[htb]
\begin{center}
{\scriptsize
\begin{tabular}{|c|c|c|c|c|c|c|c|}
\hline
$n$ (millions)  & 1                 & 2                    & 4                  & 8                  & 16      & 32 \\
\hline
Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$   & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\
SD (s)          & $2.6$           & $5.4$              & $9.8$            & $17.6$           & $37.3$            & $76.3$  \\
\hline
\end{tabular}
\vspace{-1mm}
}
\end{center}
\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF,
the standard deviation (SD), and the confidence intervals considering
a confidence level of $95\%$.}
\label{tab:medias}
\vspace{-4mm}
\end{table*}

% \enlargethispage{\baselineskip}
% \begin{table*}[htb]
% \begin{center}
% {\scriptsize
% (a)
% \begin{tabular}{|c|c|c|c|c|c|c|c|}
% \hline
% $n$ (millions)  & 1                 & 2                    & 4                  & 8                  & 16                 & 32 \\
% \hline
% Average time (s)& $6.119 \pm 0.300$ & $12.190 \pm 0.615$   & $25.359 \pm 1.109$ & $51.408 \pm 2.003$ & $117.343 \pm 4.364$ & $262.215 \pm 8.724$\\
% SD (s)          & $2.644$           & $5.414$              & $9.757$            & $17.627$           & $37.333$            & $76.271$  \\
% \hline
% \end{tabular}
% \\[5mm]  (b)
% \begin{tabular}{|l|c|c|c|c|c|}
% \hline
% $n$ (millions)   & 1                  & 2                  & 4                  & 8                   & 16             \\
% \hline % Part.      16 \%                 16 \%                 16 \%                18 \%                 20\%
% Average time (s) & $6.927 \pm 0.309$  & $13.828 \pm 0.175$ & $31.936 \pm 0.663$ & $69.902 \pm 1.084$  & $140.617 \pm 2.502$  \\
% SD               & $0.431$            & $0.245$            & $0.926$            & $1.515$             & $3.498$         \\
% \hline
% \hline
% $n$ (millions)   & 32                  & 64                   & 128                    & 512                  & 1000            \\
% \hline % Part.      20 \%                 20\%                  20\%                      18\%                    18\%
% Average time (s) & $284.330 \pm 1.135$ & $587.880 \pm 3.945$  & $1223.581 \pm 4.864$   & $5966.402 \pm 9.465$ & $13229.540 \pm 12.670$  \\
% SD               & $1.587$             & $5.514$              & $6.800$                & $13.232$             & $18.577$            \\
% \hline
% \end{tabular}
% }
% \end{center}
% \caption{The runtime averages in seconds,
% the standard deviation (SD), and
% the confidence intervals given by the average time $\pm$
% the distance from average time considering
% a confidence level of $95\%$.}
% \label{tab:medias}
% \end{table*}

\enlargethispage{2\baselineskip}
Figure~\ref{fig:bmz_temporegressao}
presents the runtime for each trial. In addition,
the solid line corresponds to a linear regression model
obtained from the experimental measurements.
As we can see, the runtime for a given $n$ has a considerable
fluctuation. However, the fluctuation also grows linearly with $n$.

\begin{figure}[htb]
\vspace{-2mm}
\begin{center}
\scalebox{0.4}{\includegraphics{figs/bmz_temporegressao.eps}}
\caption{Time versus number of keys in $S$ for the internal memory based algorithm.
The solid line corresponds to a linear regression model.}
\label{fig:bmz_temporegressao}
\end{center}
\vspace{-6mm}
\end{figure}

The observed fluctuation in the runtimes is as expected; recall that this
runtime has the form~$\alpha nZ$ with~$Z$ a geometric random variable with
mean~$1/p=e$.  Thus, the runtime has mean~$\alpha n/p=\alpha en$ and standard
deviation~$\alpha n\sqrt{(1-p)/p^2}=\alpha n\sqrt{e(e-1)}$.
Therefore, the standard deviation also grows
linearly with $n$, as experimentally verified
in Table~\ref{tab:medias} and in Figure~\ref{fig:bmz_temporegressao}.

%\noindent-------------------------------------------------------------------------\\
%Comentario para Yoshi: Nao consegui reproduzir bem o que discutimos
%no paragrafo acima, acho que vc conseguira justificar melhor :-). \\
%-------------------------------------------------------------------------\\