178 lines
7.6 KiB
TeX
178 lines
7.6 KiB
TeX
|
% Nivio: 29/jan/06
|
||
|
% Time-stamp: <Monday 30 Jan 2006 12:37:22am EST yoshi@flare>
|
||
|
\vspace{-2mm}
|
||
|
\subsection{Performance of the internal memory based algorithm}
|
||
|
\label{sec:intern-memory-algor}
|
||
|
|
||
|
%\begin{table*}[htb]
|
||
|
%\begin{center}
|
||
|
%{\scriptsize
|
||
|
%\begin{tabular}{|c|c|c|c|c|c|c|c|}
|
||
|
%\hline
|
||
|
%$n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\
|
||
|
%\hline
|
||
|
%Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$ & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\
|
||
|
%SD (s) & $2.6$ & $5.4$ & $9.8$ & $17.6$ & $37.3$ & $76.3$ \\
|
||
|
%\hline
|
||
|
%\end{tabular}
|
||
|
%\vspace{-3mm}
|
||
|
%}
|
||
|
%\end{center}
|
||
|
%\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF,
|
||
|
%the standard deviation (SD), and the confidence intervals considering
|
||
|
%a confidence level of $95\%$.}
|
||
|
%\label{tab:medias}
|
||
|
%\end{table*}
|
||
|
|
||
|
Our three-step internal memory based algorithm presented in~\cite{bkz05}
|
||
|
is used for constructing a MPHF for each bucket.
|
||
|
It is a randomized algorithm because it needs to generate a simple random graph
|
||
|
in its first step.
|
||
|
Once the graph is obtained the other two steps are deterministic.
|
||
|
|
||
|
Thus, we can consider the runtime of the algorithm to have the form~$\alpha
|
||
|
nZ$ for an input of~$n$ keys, where~$\alpha$ is some machine dependent
|
||
|
constant that further depends on the length of the keys and~$Z$ is a random
|
||
|
variable with geometric distribution with mean~$1/p=e^{1/c^2}$ (see
|
||
|
Section~\ref{sec:mphfbucket}). All results in our experiments were obtained
|
||
|
taking $c=1$; the value of~$c$, with~$c\in[0.93,1.15]$, in fact has little
|
||
|
influence in the runtime, as shown in~\cite{bkz05}.
|
||
|
|
||
|
The values chosen for $n$ were $1, 2, 4, 8, 16$ and $32$ million.
|
||
|
Although we have a dataset with 1~billion URLs, on a PC with
|
||
|
1~gigabyte of main memory, the algorithm is able
|
||
|
to handle an input with at most 32 million keys.
|
||
|
This is mainly because of the graph we need to keep in main memory.
|
||
|
The algorithm requires $25n + O(1)$ bytes for constructing
|
||
|
a MPHF (details about the data structures used by the algorithm can
|
||
|
be found in~\texttt{http://cmph.sf.net}.
|
||
|
% for the details about the data structures
|
||
|
%used by the algorithm).
|
||
|
|
||
|
In order to estimate the number of trials for each value of $n$ we use
|
||
|
a statistical method for determining a suitable sample size (see, e.g.,
|
||
|
\cite[Chapter 13]{j91}).
|
||
|
As we obtained different values for each $n$,
|
||
|
we used the maximal value obtained, namely, 300~trials in order to have
|
||
|
a confidence level of $95\%$.
|
||
|
|
||
|
% \begin{figure*}[ht]
|
||
|
% \noindent
|
||
|
% \begin{minipage}[b]{0.5\linewidth}
|
||
|
% \centering
|
||
|
% \subfigure[The previous algorithm]
|
||
|
% {\scalebox{0.5}{\includegraphics{figs/bmz_temporegressao.eps}}}
|
||
|
% \end{minipage}
|
||
|
% \hfill
|
||
|
% \begin{minipage}[b]{0.5\linewidth}
|
||
|
% \centering
|
||
|
% \subfigure[The new algorithm]
|
||
|
% {\scalebox{0.5}{\includegraphics{figs/brz_temporegressao.eps}}}
|
||
|
% \end{minipage}
|
||
|
% \caption{Time versus number of keys in $S$. The solid line corresponds to
|
||
|
% a linear regression model.}
|
||
|
% %obtained from the experimental measurements.}
|
||
|
% \label{fig:temporegressao}
|
||
|
% \end{figure*}
|
||
|
|
||
|
Table~\ref{tab:medias} presents the runtime average for each $n$,
|
||
|
the respective standard deviations, and
|
||
|
the respective confidence intervals given by
|
||
|
the average time $\pm$ the distance from average time
|
||
|
considering a confidence level of $95\%$.
|
||
|
Observing the runtime averages one sees that
|
||
|
the algorithm runs in expected linear time,
|
||
|
as shown in~\cite{bkz05}.
|
||
|
|
||
|
\vspace{-2mm}
|
||
|
\begin{table*}[htb]
|
||
|
\begin{center}
|
||
|
{\scriptsize
|
||
|
\begin{tabular}{|c|c|c|c|c|c|c|c|}
|
||
|
\hline
|
||
|
$n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\
|
||
|
\hline
|
||
|
Average time (s)& $6.1 \pm 0.3$ & $12.2 \pm 0.6$ & $25.4 \pm 1.1$ & $51.4 \pm 2.0$ & $117.3 \pm 4.4$ & $262.2 \pm 8.7$\\
|
||
|
SD (s) & $2.6$ & $5.4$ & $9.8$ & $17.6$ & $37.3$ & $76.3$ \\
|
||
|
\hline
|
||
|
\end{tabular}
|
||
|
\vspace{-1mm}
|
||
|
}
|
||
|
\end{center}
|
||
|
\caption{Internal memory based algorithm: average time in seconds for constructing a MPHF,
|
||
|
the standard deviation (SD), and the confidence intervals considering
|
||
|
a confidence level of $95\%$.}
|
||
|
\label{tab:medias}
|
||
|
\vspace{-4mm}
|
||
|
\end{table*}
|
||
|
|
||
|
% \enlargethispage{\baselineskip}
|
||
|
% \begin{table*}[htb]
|
||
|
% \begin{center}
|
||
|
% {\scriptsize
|
||
|
% (a)
|
||
|
% \begin{tabular}{|c|c|c|c|c|c|c|c|}
|
||
|
% \hline
|
||
|
% $n$ (millions) & 1 & 2 & 4 & 8 & 16 & 32 \\
|
||
|
% \hline
|
||
|
% Average time (s)& $6.119 \pm 0.300$ & $12.190 \pm 0.615$ & $25.359 \pm 1.109$ & $51.408 \pm 2.003$ & $117.343 \pm 4.364$ & $262.215 \pm 8.724$\\
|
||
|
% SD (s) & $2.644$ & $5.414$ & $9.757$ & $17.627$ & $37.333$ & $76.271$ \\
|
||
|
% \hline
|
||
|
% \end{tabular}
|
||
|
% \\[5mm] (b)
|
||
|
% \begin{tabular}{|l|c|c|c|c|c|}
|
||
|
% \hline
|
||
|
% $n$ (millions) & 1 & 2 & 4 & 8 & 16 \\
|
||
|
% \hline % Part. 16 \% 16 \% 16 \% 18 \% 20\%
|
||
|
% Average time (s) & $6.927 \pm 0.309$ & $13.828 \pm 0.175$ & $31.936 \pm 0.663$ & $69.902 \pm 1.084$ & $140.617 \pm 2.502$ \\
|
||
|
% SD & $0.431$ & $0.245$ & $0.926$ & $1.515$ & $3.498$ \\
|
||
|
% \hline
|
||
|
% \hline
|
||
|
% $n$ (millions) & 32 & 64 & 128 & 512 & 1000 \\
|
||
|
% \hline % Part. 20 \% 20\% 20\% 18\% 18\%
|
||
|
% Average time (s) & $284.330 \pm 1.135$ & $587.880 \pm 3.945$ & $1223.581 \pm 4.864$ & $5966.402 \pm 9.465$ & $13229.540 \pm 12.670$ \\
|
||
|
% SD & $1.587$ & $5.514$ & $6.800$ & $13.232$ & $18.577$ \\
|
||
|
% \hline
|
||
|
% \end{tabular}
|
||
|
% }
|
||
|
% \end{center}
|
||
|
% \caption{The runtime averages in seconds,
|
||
|
% the standard deviation (SD), and
|
||
|
% the confidence intervals given by the average time $\pm$
|
||
|
% the distance from average time considering
|
||
|
% a confidence level of $95\%$.}
|
||
|
% \label{tab:medias}
|
||
|
% \end{table*}
|
||
|
|
||
|
\enlargethispage{2\baselineskip}
|
||
|
Figure~\ref{fig:bmz_temporegressao}
|
||
|
presents the runtime for each trial. In addition,
|
||
|
the solid line corresponds to a linear regression model
|
||
|
obtained from the experimental measurements.
|
||
|
As we can see, the runtime for a given $n$ has a considerable
|
||
|
fluctuation. However, the fluctuation also grows linearly with $n$.
|
||
|
|
||
|
\begin{figure}[htb]
|
||
|
\vspace{-2mm}
|
||
|
\begin{center}
|
||
|
\scalebox{0.4}{\includegraphics{figs/bmz_temporegressao.eps}}
|
||
|
\caption{Time versus number of keys in $S$ for the internal memory based algorithm.
|
||
|
The solid line corresponds to a linear regression model.}
|
||
|
\label{fig:bmz_temporegressao}
|
||
|
\end{center}
|
||
|
\vspace{-6mm}
|
||
|
\end{figure}
|
||
|
|
||
|
The observed fluctuation in the runtimes is as expected; recall that this
|
||
|
runtime has the form~$\alpha nZ$ with~$Z$ a geometric random variable with
|
||
|
mean~$1/p=e$. Thus, the runtime has mean~$\alpha n/p=\alpha en$ and standard
|
||
|
deviation~$\alpha n\sqrt{(1-p)/p^2}=\alpha n\sqrt{e(e-1)}$.
|
||
|
Therefore, the standard deviation also grows
|
||
|
linearly with $n$, as experimentally verified
|
||
|
in Table~\ref{tab:medias} and in Figure~\ref{fig:bmz_temporegressao}.
|
||
|
|
||
|
%\noindent-------------------------------------------------------------------------\\
|
||
|
%Comentario para Yoshi: Nao consegui reproduzir bem o que discutimos
|
||
|
%no paragrafo acima, acho que vc conseguira justificar melhor :-). \\
|
||
|
%-------------------------------------------------------------------------\\
|