\section{The Algorithms}
We are going to present a novel algorithm that extends our previous work
presented in~\cite{bkz05}.
First we describe our previous work and in the following the new algorithm.
To the best of our knowledge this work is the first one that becomes possible
the construction of minimal perfect hash functions for sets in the order of
billion of keys efficiently.
And better, the generated functions are very compact and can be represented
using approximately nine bits per key.
\subsection{A Main Memory Based Algorithm}
\subsection{An External Memory Based Algorithm}
The idea of behind the new algorithm is the traditional divide-to-conquer approach.
The new algorithm consists of two steps that are presented in Fig.~\ref{fig:new-algo-main-steps}:
\item Using an universal hashing function~\cite{ss89} $h_1: S \to B$ the keys from $S$ are segmented to
a bucket set B, where $|B| = b$. We choice parameter $b$ in such way that any bucket will
contain more than 256 keys.
This choice is crucial to make the new algorithm works and we give details about it hereinafter.
\item The keys in each bucket are separetaly spread into a hash table.
% For two-column wide figures use
% Use the relevant command to insert your figure file.
% For example, with the graphicx package use
% figure caption is below the figure
\caption{Main steps of the new algorithm.}
The main novelties are in the way the keys are segmented using external memory and spread using
minimal perfect hash functions for each bucket. The next two sections describe each step in details.
% Let us show how the minimal perfect hash function~$h$
% will be constructed.
% We make use of three auxiliary random functions~$h_1$, $h_2$ and~$h_3:U\to V$,
% where~$V=[0,t-1]$ for some suitably chosen integer~$t=cn$, where
% $n=|S|$.
% We build a random graph~$G=G(h_1,h_2)$ on~$V$,
% whose edge set is~$\big\{\{h_1(x),h_2(x)\}:x\in S\big\}$.
% There is an edge in~$G$ for each key in the set of keys~$S$.
% In what follows, we shall be interested in the \textit{2-core} of
% the random graph~$G$, that is, the maximal subgraph of~$G$ with minimal
% degree at least~$2$
% (see, e.g., \cite{b01,jlr00}).
% Because of its importance in our context, we call the 2-core the
% \textit{critical} subgraph of~$G$ and denote it by~$G_\crit$.
% The vertices and edges in~$G_\crit$ are said to be \textit{critical}.
% We let~$V_\crit=V(G_\crit)$ and~$E_\crit=E(G_\crit)$.
% Moreover, we let~$V_\ncrit=V-V_\crit$ be the set of {\em non-critical}
% vertices in~$G$.
% We also let~$V_\scrit\subseteq V_\crit$ be the set of all critical
% vertices that have at least one non-critical vertex as a neighbour.
% Let $E_\ncrit=E(G)-E_\crit$ be the set of {\em non-critical} edges in~$G$.
% Finally, we let~$G_\ncrit=(V_\ncrit\cup V_\scrit,E_\ncrit)$ be the
% {\em non-critical} subgraph of~$G$.
% The non-critical subgraph $G_\ncrit$ corresponds to the ``acyclic part''
% of~$G$.
% We have $G=G_\crit\cup G_\ncrit$.
% We then construct a suitable labelling $g:V\to\ZZ$ of the vertices
% of~$G$: we choose~$g(v)$ for each~$v\in V(G)$ in such
% a way that~$h(x)=g(h_1(x))+g(h_2(x))$ ($x\in S$) is a
% minimal perfect hash function for~$S$.
% We will see later on that this labelling~$g$ can be found in linear time
% if the number of edges in $G_\crit$ is at most $\frac{1}{2}|E(G)|$.
% Figure~\ref{prog:mainsteps} presents a pseudo code for the algorithm.
% The procedure GenerateMPHF ($S$, $g$) receives as input the set of
% keys~$S$ and produces the labelling~$g$.
% The method uses a mapping, ordering and searching approach.
% We now describe each step.
% \enlargethispage{\baselineskip}
% \enlargethispage{\baselineskip}
% \vspace{-11pt}
% \begin{figure}[htb]
% \begin{center}
% \begin{lstlisting}[
% ]
% procedure @GenerateMPHF@ (@$S$@, @$g$@)
% Mapping (@$S$@, @$G$@);
% Ordering (@$G$@, @$G_\crit$@, @$G_\ncrit$@);
% Searching (@$G$@, @$G_\crit$@, @$G_\ncrit$@, @$g$@);
% \end{lstlisting}
% \end{center}
% \vspace{-12pt}
% \caption{Main steps of the algorithm for constructing a minimal
% perfect hash function}
% \vspace{-26pt}
% \label{prog:mainsteps}
% \end{figure}
% \subsection{Mapping Step}
% \label{sec:mapping}
% The procedure Mapping ($S$, $G$) receives as input the set of keys~$S$ and
% generates the random graph $G=G(h_1,h_2)$, by generating two auxiliary
% functions~$h_1$, $h_2:U\to[0,t-1]$.
% \def\tabela{\hbox{table}}
% %
% The functions~$h_1$ and~$h_2$ are constructed as follows.
% We impose some upper bound~$L$ on the lengths of the keys in~$S$.
% To define~$h_j$ ($j=1$,$2$), we generate an~$L\times\Sigma$ table
% of random integers~$\tabela_j$.
% For a key~$x\in S$ of length~$|x|\leq L$ and~$j\in\{1,2\}$, we let
% \begin{displaymath} \nonumber
% h_j(x) = \Big (\textstyle\sum_{i=1}^{|x|} \tabela_j[i, x[i]] \Big) \bmod t.
% \end{displaymath}
% The random graph~$G=G(h_1,h_2)$ has vertex set~$V=[0,t-1]$ and edge set
% $\big\{\{h_1(x),h_2(x)\}:x\in S\big\}$. We need~$G$ to be
% simple, i.e.,
% $G$~should have neither loops nor multiple edges.
% A loop occurs when $h_1(x) = h_2(x)$ for some~$x\in S$.
% We solve this in an ad hoc manner: we simply let~$h_2(x)=(2h_1(x)+1)\bmod
% t$ in this case.
% If we still find a loop after this,
% we generate another pair $(h_1,h_2)$.
% When a multiple edge occurs we abort and generate a new pair~$(h_1,h_2)$.
% \vspace{-10pt}
% \subsubsection{Analysis of the Mapping Step. }
% We start by discussing some facts on random graphs.
% Let~$G=(V,E)$ with $|V|=t$ and $|E|=n$ be a random graph in the uniform
% model~$\cG(t,n)$, the model in which all the~${{t\choose2}\choose n}$ graphs
% on~$V$ with~$n$ edges are equiprobable.
% The study of~$\cG(t,n)$ goes back to the classical
% work of Erd\H os and R\'enyi~\cite{er59,er60,er61} (for a modern treatment,
% see~\cite{b01,jlr00}).
% Let $d=2n/t$ be the average degree of $G$.
% It is well known that, if~$d>1$, or, equivalently,
% if~$c<2$ (recall that we have $t=cn$),
% then, almost every~$G$
% contains\footnote{As is usual in the theory of random graphs, we use
% the terms `almost every' and `almost surely' to mean `with probability
% tending to~$1$ as~$t\to\infty$'.} a ``giant'' component of
% order~$(1+o(1))bt$, where~$b=1-T/d$, and~$0<T<1$ is the unique solution
% to the equation~$Te^{-T}=de^{-d}$.
% Moreover, all the other components of~$G$ have~$O(\log t)$ vertices.
% Also, the number of vertices in the 2-core of~$G$ (the maximal subgraph of $G$
% with minimal degree at least~$2$) that do not belong to the giant component
% is~$o(t)$ almost surely.
% Pittel and Wormald~\cite{pw04} present detailed results
% for the 2-core of the giant component of the random graph~$G$.
% Since~$\tabela_j$ ($j\in\{1,2\}$) are random, $G=G(h_1,h_2)$~is a random
% graph.
% In what follows, we work under the hypothesis that~$G=G(h_1,h_2)$ is drawn
% from~$\cG(t,n)$.
% Thus, following~\cite{pw04}, the number of vertices of~$G_\crit$ is
% \begin{eqnarray} \label{eq:nvertices2core}
% |V(G_\crit)| = (1+o(1))(1-T)bt
% \end{eqnarray}
% almost surely. Moreover, the number of edges in this 2-core is
% \begin{eqnarray} \label{eq:nedges2core}
% |E(G_\crit)| = (1+o(1))\Big((1-T)b+b(d+T-2)/2\Big)t \\[-4mm]\nonumber
% \end{eqnarray}
% almost surely.
% Let~$d_\crit=2|E(G_\crit)|/|V(G_\crit)|$ be the average degree of~$G_\crit$.
% We are interested in the case in which~$d_\crit$ is a constant.
% \enlargethispage{\baselineskip}
% \enlargethispage{\baselineskip}
% As mentioned before, for us to find
% the labelling $g:V\to\ZZ$ of the vertices of~$G=G(h_1,h_2)$ in linear time,
% we require that~$|E(G_\crit)|\leq\frac{1}{2}|E(G)|=\frac12|S|=n/2$.
% The crucial step now is to determine the value
% of~$c$ (in $t=cn$) to obtain a random graph $G=G_\crit\cup G_\ncrit$ with
% $|E(G_\crit)|\leq\frac{1}{2}|E(G)|$.
% Table~\ref{tab:values} gives some values for~$|V(G_\crit)|$
% and~$|E(G_\crit)|$ using Eqs~(\ref{eq:nvertices2core})
% and~(\ref{eq:nedges2core}).
% The theoretical value for~$c$ is around~$1.152$, which is remarkably
% close to the empirical results presented in
% Table~\ref{tab:probability_cve1}.
% In this table, generated from real data, the probability $P_{|E(G_\crit)|}$
% that $|E(G_\crit)| \le \frac{1}{2}|E(G)|$ tends to~$0$ when $c < 1.15$ and it
% tends to $1$ when $c \ge 1.15$ and $n$ increases. We found this match between
% the empirical and the theoretical results most pleasant,
% and this
% is why we consider that this random graph, conditioned on being simple,
% strongly resembles the random graph from the uniform model~$\cG(t,n)$.
% \vspace{-8pt}
% \begin{table}[!htb]
% {\footnotesize
% \begin{center}
% \begin{tabular}{|c|c|c|c|c|c|}
% \hline
% $d$ & $T$ & $b$ & $|V(G_\crit)|$ & $|E(G_\crit)|$ & $c$ \\
% \hline
% %1.730 & 0.512 & 0.704 & 0.398$n$ & 0.496$n$ & 1.156 \\
% %1.732 & 0.511 & 0.705 & 0.398$n$ & 0.497$n$ & 1.155 \\
% %1.733 & 0.510 & 0.706 & 0.399$n$ & 0.498$n$ & 1.154 \\
% 1.734 & 0.510 & 0.706 & 0.399$n$ & 0.498$n$ & 1.153 \\
% 1.736 & 0.509 & 0.707 & 0.400$n$ & 0.500$n$ & 1.152 \\
% 1.738 & 0.508 & 0.708 & 0.401$n$ & 0.501$n$ & 1.151 \\
% 1.739 & 0.508 & 0.708 & 0.401$n$ & 0.501$n$ & 1.150 \\
% 1.740 & 0.507 & 0.709 & 0.401$n$ & 0.502$n$ & 1.149 \\
% %1.742 & 0.506 & 0.709 & 0.402$n$ & 0.503$n$ & 1.148 \\
% %1.744 & 0.505 & 0.710 & 0.403$n$ & 0.504$n$ & 1.147 \\
% %1.746 & 0.505 & 0.711 & 0.404$n$ & 0.506$n$ & 1.145 \\
% \hline
% \end{tabular}
% \end{center}
% \caption{Determining the $c$ value theoretically}
% \vspace{-42pt}
% \label{tab:values}
% }
% \end{table}
% \begin{table}
% {\footnotesize
% \begin{center}
% \begin{tabular}{|c|c|c|c|c|c|c|c|}
% \hline
% \raisebox{-0.7em}{$c$} & \multicolumn{7}{c|}{\raisebox{-1mm}{URLs ($n$)}} \\
% \cline{2-8}
% & \raisebox{-1mm}{$10^3$} &\raisebox{-1mm}{$10^4$} &\raisebox{-1mm}{$10^5$} & \raisebox{-1mm}{$10^6$} & \raisebox{-1mm}{$2 \times 10^6$} & \raisebox{-1mm}{$3 \times 10^6$} & \raisebox{-1mm}{$4 \times 10^6$} \\
% \hline
% %1.10 & 0.01 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\
% %1.11 & 0.04 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\
% %1.12 & 0.12 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\
% 1.13 & 0.22 & 0.02 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\
% 1.14 & 0.35 & 0.15 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\
% 1.15 & 0.46 & 0.55 & 0.65 & 0.87 & 0.95 & 0.97 & 1.00 \\
% 1.16 & 0.67 & 0.90 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 \\
% 1.17 & 0.82 & 0.99 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 \\
% %1.18 & 0.91 & 0.97 & 0.98 & 1.00 & 1.00 & 1.00 & 1.00 \\
% %1.19 & 0.94 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 \\
% %1.20 & 0.98 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 & 1.00 \\[1mm]
% \hline
% \end{tabular}
% \end{center}
% \caption{Probability $P_{|E_\crit|}$ that $|E(G_\crit)| \le n/2$
% for different $c$ values and different number of keys for a collections of URLs}
% \vspace{-25pt}
% \label{tab:probability_cve1}
% }
% \end{table}
% We now briefly argue that the expected number of iterations to obtain a simple
% graph~$G=G(h_1,h_2)$ is constant for $t=cn$ and $c=1.15$. Let~$p$ be the
% probability of generating a random graph~$G$ without loops and without
% multiple edges. If~$p$ is bounded from below by some positive constant, then
% we are done, because the expected number of iterations to obtain such a graph
% is then~$1/p=O(1)$. To estimate~$p$, we estimate the probability of
% obtaining~$n$ \textit{distinct} objects when we independently draw $n$~objects
% from a universe of cardinality~${t\choose2}={cn\choose2}\sim c^2n^2/2$, with
% replacement. This latter probability is about~$e^{-{n\choose2}/{t\choose2}}$
% for large~$n$. As~$e^{-{n\choose2}/{t\choose2}}\to e^{-1/c^2}>0$
% as~$n\to\infty$, the expected number of iterations is~$e^{1/c^2}=2.13$ (recall
% $c=1.15$).
% As the expected number of iterations is $O(1)$, the mapping step takes
% $O(n)$ time.
% \vspace{-5pt}
% \subsection{Ordering Step}
% \label{sec:ordering}
% The procedure Ordering ($G$, $G_\crit$, $G_\ncrit$) receives as
% input the graph~$G$ and partitions~$G$ into the two subgraphs
% $G_\crit$ and $G_\ncrit$, so that~$G=G_\crit\cup G_\ncrit$.
% For that, the procedure iteratively remove all vertices of degree 1 until done.
% \enlargethispage{\baselineskip}
% Figure~\ref{fig:grafordering}(a) presents a sample graph with 9 vertices
% and 8 edges, where the degree of a vertex is shown besides each vertex.
% Applying the ordering step in this graph, the $5$-vertex graph showed in
% Figure~\ref{fig:grafordering}(b) is obtained.
% All vertices with degree 0 are non-critical vertices and the others are
% critical vertices. In order to determine the vertices in $V_\scrit$ we collect all vertices
% $v \in V(G_\crit)$ with at least one vertex $u$ that is in Adj$(v)$ and
% in $V(G_\ncrit)$, as the vertex 8 in Figure~\ref{fig:grafordering}(b).
% \vspace{-5pt}
% \begin{figure*}[!htb]
% \begin{center}
% \scalebox{0.85}{\psfig{file=figs/grafordering.ps}}
% \end{center}
% \vspace{-10pt}
% \caption{Ordering step for a graph with 9 vertices and 8 edges}
% \vspace{-30pt}
% \label{fig:grafordering}
% \end{figure*}
% \subsubsection{Analysis of the Ordering Step. }
% The time complexity of the ordering step is $O(|V(G)|)$ (see \cite{chm97}).
% As $|V(G)| = t = cn$, the ordering step takes $O(n)$ time.
% \vspace{-5pt}
% \subsection{Searching Step}
% \label{sec:searching}
% In the searching step, the key part is
% the {\em perfect assignment problem}: find $g:V(G)\to\ZZ$ such that
% the function $h:E(G)\to\ZZ$ defined by
% \begin{eqnarray}
% \label{eq:phf}
% h(e) = g(a)+g(b) \qquad(e=\{a,b\})
% \end{eqnarray}
% is a bijection from~$E(G)$ to~$[0,n-1]$ (recall~$n=|S|=|E(G)|$).
% We are interested in a labelling $g:V\to\ZZ$ of
% the vertices of the graph~$G=G(h_1,h_2)$ with
% the property that if~$x$ and~$y$ are keys in~$S$, then
% $g(h_1(x))+g(h_2(x))\neq g(h_1(y))+g(h_2(y))$; that is, if we associate
% to each edge the sum of the labels on its endpoints, then these values
% should be all distinct.
% Moreover, we require that all the sums $g(h_1(x))+g(h_2(x))$ ($x\in S$)
% fall between~$0$ and~$|E(G)|-1=n-1$, so that we have a bijection
% between~$S$ and~$[0,n-1]$.
% The procedure Searching ($G$, $G_\crit$, $G_\ncrit$, $g$) receives
% as input~$G$, $G_\crit$, $G_\ncrit$ and finds a suitable
% $\log_2 |V(G)| + 1$ bit value for each vertex $v \in V(G)$, stored in the
% array~$g$.
% This step is first performed for the vertices in the
% critical subgraph~$G_\crit$ of $G$ (the 2-core of~$G$) and then it is
% performed for the vertices in $G_\ncrit$ (the non-critical subgraph
% of~$G$ that contains the ``acyclic part'' of $G$).
% The reason the assignment of the $g$~values is first
% performed on the vertices in~$G_\crit$ is to resolve reassignments
% as early as possible (such reassignments are consequences of the cycles
% in~$G_\crit$ and are depicted hereinafter).
% \vspace{-8pt}
% \subsubsection{Assignment of Values to Critical Vertices. }
% \label{sec:assignmentcv}
% The labels~$g(v)$ ($v\in V(G_\crit)$)
% are assigned in increasing order following a greedy
% strategy where the critical vertices~$v$ are considered one at a time,
% according to a breadth-first search on~$G_\crit$.
% If a candidate value~$x$ for~$g(v)$ is forbidden
% because setting~$g(v)=x$ would create two edges with the same sum,
% we try~$x+1$ for~$g(v)$. This fact is referred to as a {\em reassignment}.
% \enlargethispage{\baselineskip}
% Let $A_E$ be the set of addresses assigned to edges in $E(G_\crit)$.
% Initially $A_E = \emptyset$.
% Let $x$ be a candidate value for $g(v)$.
% Initially $x = 0$.
% Considering the subgraph $G_\crit$ in Figure~\ref{fig:grafordering}(b),
% a step by step example of the assignment of values to vertices in $G_\crit$
% is presented in Figure~\ref{fig:searching}.
% Initially, a vertex $v$ is chosen, the assignment $g(v)=x$ is made
% and $x$ is set to $x + 1$.
% For example, suppose that vertex $8$ in Figure~\ref{fig:searching}(a) is
% chosen, the assignment $g(8)=0$ is made and $x$ is set to $1$.
% \vspace{-12pt}
% \begin{figure*}[!htb]
% \begin{center}
% \scalebox{0.85}{\psfig{file=figs/grafsearching.ps}}
% \end{center}
% \vspace{-13pt}
% \caption{Example of the assignment of values to critical vertices}
% \vspace{-15pt}
% \label{fig:searching}
% \end{figure*}
% In Figure~\ref{fig:searching}(b), following the adjacency list of vertex $8$,
% the unassigned vertex $0$ is reached.
% At this point, we collect in
% the temporary variable $Y$ all adjacencies of vertex $0$ that have been assigned
% an $x$ value, and $Y = \{8\}$.
% Next, for all $u \in Y$, we check if $g(u)+x \not \in A_E$.
% Since $g(8) + 1 = 1 \not \in A_E$, then $g(0)$ is set to $1$, $x$ is incremented
% by 1 (now $x=2$) and $A_E = A_E \cup \{1\}=\{1\}$.
% Next, vertex $3$ is reached, $g(3)$ is set to $2$,
% $x$ is set to $3$ and $A_E = A_E \cup \{2\}=\{1,2\}$.
% Next, vertex $4$ is reached and $Y=\{3, 8\}$.
% Since $g(3) + 3 = 5 \not \in A_E$ and $g(8) + 3 = 3 \not \in A_E$, then
% $g(4)$ is set to $3$, $x$ is set to $4$ and $A_E = A_E \cup \{3,5\} = \{1,2,3,5\}$.
% Finally, vertex $7$ is reached and $Y=\{0, 8\}$.
% Since $g(0) + 4 = 5 \in A_E$, $x$ is incremented by 1 and set to 5, as depicted in
% Figure~\ref{fig:searching}(c).
% Since $g(8) + 5 = 5 \in A_E$, $x$ is again incremented by 1 and set to 6,
% as depicted in Figure~\ref{fig:searching}(d).
% These two reassignments are indicated by the arrows in Figure~\ref{fig:searching}.
% Since $g(0) + 6 = 7 \not \in A_E$ and $g(8) + 6 = 6 \not \in A_E$, then
% $g(7)$ is set to $6$ and $A_E = A_E \cup \{6,7\} = \{1,2,3,5,6,7\}$.
% This finishes the algorithm.
% \vspace{-15pt}
% \subsubsection{Assignment of Values to Non-Critical Vertices. }
% \label{sec:assignmentncv}
% As $G_\ncrit$ is acyclic, we can impose the order in which addresses are
% associated with edges in $G_\ncrit$, making this step simple to solve
% by a standard depth first search algorithm.
% Therefore, in the assignment of values to vertices in $G_\ncrit$ we
% benefit from the unused addresses in the gaps left by the assignment of values
% to vertices in $G_\crit$.
% For that, we start the depth-first search from the vertices in $V_\scrit$
% because the $g$ values for these critical vertices have already been assigned
% and cannot be changed.
% Considering the subgraph $G_\ncrit$ in Figure~\ref{fig:grafordering}(b),
% a step by step example of the assignment of values to vertices in
% $G_\ncrit$ is presented in Figure~\ref{fig:searchingncv}.
% Figure~\ref{fig:searchingncv}(a) presents the initial state of the
% algorithm.
% The critical vertex~$8$ is the only one that has non-critical
% neighbours.
% In the example presented in Figure~\ref{fig:searching}, the addresses
% $\{0, 4\}$ were not used.
% So, taking the first unused address $0$ and the vertex $1$, which is
% reached from the vertex $8$, $g(1)$ is set to
% $0 - g(8) = 0$, as shown in Figure~\ref{fig:searchingncv}(b).
% The only vertex that is reached from vertex $1$ is vertex $2$, so
% taking the unused address $4$ we set $g(2)$ to $4 - g(1) = 4$,
% as shown in Figure~\ref{fig:searchingncv}(c).
% This process is repeated until the UnAssignedAddresses list becomes empty.
% \vspace{-8pt}
% \begin{figure*}[!htb]
% \begin{center}
% \scalebox{0.85}{\psfig{file=figs/grafsearchingncv.ps}}
% \end{center}
% \vspace{-12pt}
% \caption{Example of the assignment of values to non-critical vertices}
% \vspace{-30pt}
% \label{fig:searchingncv}
% \end{figure*}
% \subsubsection{Analysis of the Searching Step. }
% We shall demonstrate that
% (i) the maximum value assigned to an edge is at most $n-1$ (that is, we
% generate a minimal perfect hash function), and
% (ii) the perfect assignment problem (determination of~$g$)
% can be solved in expected time $O(n)$ if the number of edges
% in $G_\crit$ is at most $\frac{1}{2}|E(G)|$.
% \enlargethispage{\baselineskip}
% We focus on the analysis of the assignment of values to critical vertices
% because the assignment of values to non-critical vertices
% can be solved in linear time by a depth first search algorithm.
% We now define certain complexity measures.
% Let $I(v)$ be the number of times a candidate value $x$ for
% $g(v)$ is incremented.
% Let $N_t$ be the total number of times that candidate values
% $x$ are incremented.
% Thus, we have~$N_t=\sum I(v)$, where the sum is over all~$v\in
% V(G_\crit)$.
% For simplicity, we shall suppose that $G_\crit$, the 2-core of $G$, is
% connected.\footnote{The number of vertices in~$G_\crit$ outside the giant
% component is provably very small for~$c=1.15$;
% see~\cite{b01,jlr00,pw04}.} The fact that
% every edge is either a tree edge or a back edge (see, e.g., \cite{clrs01})
% then implies the following.
% \begin{theorem} \label{th:nbedg}
% The number of back edges $N_\bedges$ of $G = G_\crit \cup G_\ncrit$
% is given by $N_\bedges = |E(G_\crit)| - |V(G_\crit)| + 1$.\qed
% \end{theorem}
% \def\maxx{{\rm max}}
% Our next result concerns the maximal value $A_\maxx$ assigned to an edge $e
% \in E(G_\crit)$ after the assignment of $g$ values to critical vertices.
% \begin{theorem} \label{th:Agrt}
% We have $A_\maxx\le 2|V(G_\crit)| - 3 + 2N_{t}$.
% \end{theorem}
% \vspace{-15pt}
% \enlargethispage{\baselineskip}
% \begin{proof}(Sketch)
% The assignment of $g$ values to critical vertices starts from 0,
% and each edge~$e$ receives the label $h(e)$
% as given by Eq.~(\ref{eq:phf}).
% The $g$ value for each vertex $v$ in $V(G_\crit)$ is assigned only once.
% A little thought shows that~$\max_v g(v)\leq |V(G_\crit)|-1+N_t$, where the
% maximum is taken over all vertices~$v$ in~$V(G_\crit)$. Moreover, two
% distinct vertices get distinct~$g$ values. Hence,
% $A_\maxx\le(|V(G_\crit)|-1+N_t)+(|V(G_\crit)|-2+N_t)
% \le2|V(G_\crit)|-3+2N_t$, as required.\qed
% \end{proof}
% \vspace{-15pt}
% \subsubsection{Maximal Value Assigned to an Edge. }
% In this section we present the following conjecture.
% \begin{conjecture} \label{conj:gretestaddr}
% For a random graph $G$ with $|E(G_\crit)|\leq n/2$ and
% $|V(G)| = 1.15n$,
% it is always possible to generate a minimal perfect hash function
% because the maximal value $A_\maxx$ assigned to an edge
% $e \in E(G_\crit)$ is at most $n - 1$.
% \end{conjecture}
% Let us assume for the moment that $N_{t} \le N_\bedges$.
% Then, from Theorems~\ref{th:nbedg} and~\ref{th:Agrt},
% we have
% $A_\maxx\le2|V(G_\crit)|-3+2N_t\leq2|V(G_\crit)|-3+2N_\bedges
% \leq2|V(G_\crit)|-3+2(|E(G_\crit)|-|V(G_\crit)|+1)\le2|E(G_\crit)|-1$.
% As by hypothesis $|E(G_\crit)|\leq n/2$, we have
% $A_\maxx \le n - 1$, as required.
% \textit{In the mathematical analysis of our algorithm, what is left
% open is a single problem:
% prove that $N_{t} \le N_\bedges$.}\footnote{%
% Bollob\'as and Pikhurko~\cite{bp04} have investigated
% a very close vertex labelling problem for random graphs.
% However, their interest was on denser random graphs, and it seems that
% different methods will have to be used to attack the sparser case that
% we are interested in here.}
% We now show experimental evidence that $N_{t} \le N_\bedges$.
% Considering Eqs~(\ref{eq:nvertices2core}) and~(\ref{eq:nedges2core}),
% the expected values for $|V(G_\crit)|$ and $|E(G_\crit)|$ for $c=1.15$ are
% $0.401 n$ and $0.501n$, respectively.
% From Theorem~\ref{th:nbedg},
% $N_\bedges = 0.501n - 0.401n + 1 = 0.1n + 1$.
% Table~\ref{tab:collisions1} presents the maximal value of $N_t$ obtained
% during 10,000 executions of the algorithm for different sizes of $S$.
% The maximal value of $N_t$ was always smaller than $N_\bedges = 0.1 n + 1$ and
% tends to $0.059n$ for $n\ge1{,}000{,}000$.
% \vspace{-5pt}
% \begin{table}[!htb]
% {\footnotesize%\small
% \begin{center}
% \begin{tabular}{|c|c|}
% \hline
% $n$ & Maximal value of $N_t$\\
% \hline
% %$1{,}000$ & $0.091 n$ \\
% $10{,}000$ & $0.067 n$ \\
% $100{,}000$ & $0.061 n$ \\
% $1{,}000{,}000$ & $0.059 n$ \\
% $2{,}000{,}000$ & $0.059 n$ \\
% %$\vdots$ & $\vdots$ \\
% \hline
% \end{tabular}
% \end{center}
% }
% \caption{The maximal value of $N_t$ for different number of URLs}
% \vspace{-40pt}
% \label{tab:collisions1}
% \end{table}
% \subsubsection{Time Complexity. }
% We now show that the time complexity of determining~$g(v)$
% for all critical vertices~$x\in V(G_\crit)$ is
% $O(|V(G_\crit)|)=O(n)$.
% For each unassigned vertex $v$, the adjacency list of $v$, which we
% call Adj($v$), must be traversed
% to collect the set $Y$ of adjacent vertices that have already been assigned a
% value.
% Then, for each vertex in $Y$, we check if the current candidate value $x$ is
% forbidden because setting $g(v)=x$ would create two edges with the same
% endpoint sum.
% Finally, the edge linking $v$ and $u$, for all $u \in Y$, is
% associated with
% the address that corresponds to the sum of its endpoints.
% Let $d_\crit=2|E(G_\crit)|/|V(G_\crit)|$ be the average degree of $G_\crit$,
% note that~$|Y|\leq|{\mathrm Adj}(v)|$, and suppose for simplicity
% that~$|{\mathrm Adj}(v)|=O(d_\crit)$.
% Then, putting all these together, we see that the time complexity of this
% procedure is
% \begin{eqnarray}
% &C(|V(G_\crit)|) = \sum_{v\in V(G_\crit)} \big[\:|{\mathrm Adj}(v)| +
% (I(v) \times|Y|) + |Y|\big]\nonumber\\
% &\qquad\qquad\qquad\leq\sum_{v\in V(G_\crit)}(2+I(v))|{\mathrm Adj}(v)|
% =4|E(G_\crit)|+O(N_t d_\crit).\nonumber
% \end{eqnarray}
% As $d_\crit=2\times0.501n/0.401n\simeq2.499$ (a constant) we have
% $O(|E(G_\crit)|)=O(|V(G_\crit)|)$.
% Supposing that $N_{t}\le N_\bedges$, we have, from Theorem~\ref{th:nbedg},
% that
% $
% N_{t}\le|E(G_\crit)|-|V(G_\crit)|+1
% =O(|E(G_\crit)|)$.
% We conclude that
% $C(|V(G_\crit)|)=O(|E(G_\crit)|) = O(|V(G_\crit)|)$.
% As $|V(G_\crit)| \le |V(G)|$ and $|V(G)| = cn$,
% the time required to determine~$g$ on the critical vertices is $O(n)$.
% \enlargethispage{\baselineskip}
% \vspace{-8pt}

View File

@ -1,2 +0,0 @@

View File

@ -1,5 +0,0 @@
% We have presented a practical method for constructing minimal perfect
% hash functions for static sets that is efficient and may be tuned
% to yield a function with a very economical description.

View File

@ -1,178 +0,0 @@
\section{Experimental Results}
% We now present some experimental results.
% The same experiments were run with our algorithm and
% the algorithm due to Czech, Havas and Majewski~\cite{chm92}, referred to as
% the CHM algorithm.
% The two algorithms were implemented in the C language and
% are available at \texttt{http://cmph.sf.net}.
% Our data consists
% of a collection of 100 million
% universe resource locations (URLs) collected from the Web.
% The average length of a URL in the collection is 63 bytes.
% All experiments were carried out on
% a computer running the Linux operating system, version 2.6.7,
% with a 2.4 gigahertz processor and
% 4 gigabytes of main memory.
% Table~\ref{tab:characteristics} presents the main characteristics
% of the two algorithms.
% The number of edges in the graph $G=(V,E)$ is~$|S|=n$,
% the number of keys in the input set~$S$.
% The number of vertices of $G$ is equal to $1.15n$ and $2.09n$
% for our algorithm and the CHM algorithm, respectively.
% This measure is related to the amount of space to store the array $g$.
% This improves the space required to store a function in our algorithm to
% $55\%$ of the space required by the CHM algorithm.
% The number of critical edges
% is $\frac{1}{2}|E(G)|$ and 0 for our algorithm and the CHM algorithm,
% respectively.
% Our algorithm generates random graphs that contain cycles with high
% probability and the
% CHM algorithm
% generates
% acyclic random graphs.
% Finally, the CHM algorithm generates order preserving functions
% while our algorithm does not preserve order.
% \vspace{-10pt}
% \begin{table}[htb]
% {\footnotesize
% \begin{center}
% \begin{tabular}{|c|c|c|c|c|c|c|}
% \hline
% & $c$ & $|E(G)|$ & $|V(G)|=|g|$ & $|E(G_\crit)|$ & $G$ & Order preserving \\
% \hline
% Our algorithm & 1.15 & $n$ & $cn$ & $0.5|E(G)|$ & cyclic & no \\
% \hline
% CHM algorithm & 2.09 & $n$ & $cn$ & 0 & acyclic & yes \\
% \hline
% \end{tabular}
% \end{center}
% }
% \caption{Main characteristics of the algorithms}
% \vspace{-25pt}
% \label{tab:characteristics}
% \end{table}
% Table~\ref{tab:timeresults} presents time measurements.
% All times are in seconds.
% The table entries are averages over 50 trials.
% The column labelled $N_i$ gives
% the number of iterations to generate the random graph $G$
% in the mapping step of the algorithms.
% The next columns give the running times
% for the mapping plus ordering steps together and the searching
% step for each algorithm.
% The last column gives the percentage gain of our algorithm
% over the CHM algorithm.
% \begin{table*}
% {\footnotesize
% \begin{center}
% \begin{tabular}{|c|cccc|cccc|c|}
% \hline
% \raisebox{-0.7em}{$n$} & \multicolumn{4}{c|}{\raisebox{-1mm}{Our algorithm}} &
% \multicolumn{4}{c|}{\raisebox{-1mm}{CHM algorithm}}& \raisebox{-0.2em}{Gain}\\
% \cline{2-5} \cline{6-9}
% & \raisebox{-1mm}{$N_i$} &\raisebox{-1mm}{Map+Ord} &
% \raisebox{-1mm}{Search} &\raisebox{-1mm}{Total} &
% \raisebox{-1mm}{$N_i$} &\raisebox{-1mm}{Map+Ord} &\raisebox{-1mm}{Search} &
% \raisebox{-1mm}{Total} & \raisebox{0.2em}{(\%)}\\
% \hline
% %1,562,500 & 2.28 & 8.54 & 2.37 & 10.91 & 2.70 & 14.56 & 1.57 & 16.13 & 48 \\ %[1mm]
% %3,125,000 & 2.16 & 15.92 & 4.88 & 20.80 & 2.85 & 30.36 & 3.20 & 33.56 & 61 \\ %[1mm]
% 6,250,000 & 2.20 & 33.09 & 10.48 & 43.57 & 2.90 & 62.26 & 6.76 & 69.02 & 58 \\ %[1mm]
% 12,500,000 & 2.00 & 63.26 & 23.04 & 86.30 & 2.60 & 117.99 & 14.94 & 132.92 & 54 \\ %[1mm]
% 25,000,000 & 2.00 & 130.79 & 51.55 & 182.34 & 2.80 & 262.05 & 33.68 & 295.73 & 62 \\ %[1mm]
% %50,000,000 & 2.07 & 273.75 & 114.12 & 387.87 & 2.90 & 577.59 & 73.97 & 651.56 & 68 \\ %[1mm]
% 100,000,000 & 2.07 & 567.47 & 243.13 & 810.60 & 2.80 & 1,131.06 & 157.23 & 1,288.29 & 59 \\ %[1mm]
% \hline
% \end{tabular}
% \end{center}
% \caption{Time measurements
% for our algorithm and the CHM algorithm}
% \vspace{-25pt}
% \label{tab:timeresults}
% }\end{table*}
% \enlargethispage{\baselineskip}
% The mapping step of the new algorithm is faster because
% the expected number of iterations in the mapping step to generate
% $G$ are 2.13 and 2.92 for our algorithm and the CHM algorithm, respectively.
% The graph $G$ generated by our algorithm
% has $1.15n$ vertices, against $2.09n$ for the CHM algorithm.
% These two facts make our algorithm faster in the mapping step.
% The ordering step of our algorithm is approximately equal to
% the time to check if $G$ is acyclic for the CHM algorithm.
% The searching step of the CHM algorithm is faster, but the total
% time of our algorithm is, on average, approximately 58\% faster
% than the CHM algorithm.
% The experimental results fully backs the theoretical results.
% It is important to notice the times for the searching step:
% for both algorithms they are not the dominant times,
% and the experimental results clearly show
% a linear behavior for the searching step.
% We now present a heuristic that reduces the space requirement
% to any given value between $1.15n$ words and $0.93n$ words.
% The heuristic reuses, when possible, the set
% of $x$ values that caused reassignments, just before trying $x+1$
% (see Section~\ref{sec:searching}).
% The lower limit $c=0.93$ was obtained experimentally.
% We generate $10{,}000$ random graphs for
% each size $n$ ($n=10^5$, $5 \times 10^5$, $10^6$, $2\times 10^6$).
% With $c=0.93$ we were always able to generate~$h$, but with $c=0.92$ we never
% succeeded.
% Decreasing the value of $c$ leads to an increase in the number of
% iterations to generate $G$.
% For example, for $c=1$ and $c=0.93$, the analytical expected number
% of iterations are $2.72$ and $3.17$, respectively
% (for $n=12{,}500{,}000$, the number of iterations are 2.78 for $c=1$ and 3.04
% for $c=0.93$).
% Table~\ref{tab:timeresults2} presents the total times to construct a
% function for $n=12{,}500{,}000$, with an increase from $86.31$ seconds
% for $c=1.15$ (see Table~\ref{tab:timeresults}) to
% $101.74$ seconds for $c=1$ and to $102.19$ seconds for $c=0.93$.
% \vspace{-5pt}
% \begin{table*}
% {\footnotesize
% \begin{center}
% \begin{tabular}{|c|cccc|cccc|}
% \hline
% \raisebox{-0.7em}{$n$} & \multicolumn{4}{c|}{\raisebox{-1mm}{Our algorithm $c=1.00$}} &
% \multicolumn{4}{c|}{\raisebox{-1mm}{Our algorithm $c=0.93$}} \\
% \cline{2-5} \cline{6-9}
% & \raisebox{-1mm}{$N_i$} &\raisebox{-1mm}{Map+Ord} &
% \raisebox{-1mm}{Search} &\raisebox{-1mm}{Total} &
% \raisebox{-1mm}{$N_i$} &\raisebox{-1mm}{Map+Ord} &\raisebox{-1mm}{Search} &
% \raisebox{-1mm}{Total} \\%[0.3mm]
% \hline%\\[-2mm]
% 12,500,000 & 2.78 & 76.68 & 25.06 & 101.74 & 3.04 & 76.39 & 25.80 & 102.19 \\ %[1mm]
% \hline
% \end{tabular}
% \end{center}
% \caption{Time measurements
% for our tuned algorithm with $c=1.00$ and $c=0.93$}
% \vspace{-25pt}
% \label{tab:timeresults2}
% }
% \end{table*}
% We compared our algorithm with the ones proposed by Pagh~\cite{p99} and
% Dietzfelbinger and Hagerup~\cite{dh01}, respectively. The authors sent to us their
% source code. In their implementation the set of keys is a set of random integers.
% We modified our implementation to generate our~$h$ from a set of random
% integers in order to make a fair comparison. For a set of $10^6$ random integers,
% the times to generate a minimal perfect hash function were $2.7 s$, $4 s$ and $4.5 s$ for
% our algorithm, Pagh's algorithm and Dietzfelbinger and Hagerup's algorithm, respectively.
% Thus, our algorithm was 48\% faster than Pagh's algorithm and 67\% faster than
% Dietzfelbinger and Hagerup's algorithm, on average. This gain was maintained for sets with different
% sizes.
% Our algorithm needs $kn$ ($k \in [0.93, 1.15]$) words to store
% the resulting function, while Pagh's algorithm needs $kn$ ($k > 2$) words and
% Dietzfelbinger and Hagerup's algorithm needs $kn$ ($k \in [1.13, 1.15]$) words.
% The time to generate the functions is inversely proportional to the value of $k$.
% \enlargethispage{\baselineskip}

View File

@ -1,135 +0,0 @@
View File

@ -1,206 +0,0 @@
@ -1,219 +0,0 @@
#FIG 3.2 Produced by xfig version 3.2.5-alpha5
View File

@ -1,130 +0,0 @@
View File

@ -1,168 +0,0 @@
View File

@ -1,180 +0,0 @@
4 0 0 50 -1 0 11 0.0000 4 180 165 4320 5445 a)\001

View File

View File

@ -1,176 +0,0 @@
View File

@ -1,488 +0,0 @@
View File

@ -1,86 +0,0 @@
Suppose~$U$ is a universe of \textit{keys}.
Let $h:U\to M$ be a {\em hash function} that maps the keys from~$U$
to a given interval of integers $M=[0,m-1]=\{0,1,\dots,m-1\}$.
Let~$S\subseteq U$ be a set of~$n$ keys from~$U$.
Given a key~$x\in S$, the hash function~$h$ computes an integer in
$[0,m-1]$ for the storage or retrieval of~$x$ in a {\em hash table}.
Hashing methods for {\em non-static sets} of keys can be used to construct
data structures storing $S$ and supporting membership queries
``$x \in S$?'' in expected time $O(1)$.
However, they involve a certain amount of wasted space owing to unused
locations in the table and waisted time to resolve collisions when
two keys are hashed to the same table location.
For {\em static sets} of keys it is possible to compute a function
to find any key in a table in one probe; such hash functions are called
Given a set of keys~$S$, we shall say that a hash function~$h:U\to M$ is a
\textit{perfect hash function} for~$S$ if~$h$ is an injection on~$S$,
that is, there are no \textit{collisions} among the keys in~$S$: if~$x$
and~$y$ are in~$S$ and~$x\neq y$, then~$h(x)\neq h(y)$.
Figure~\ref{fig:minimalperfecthash-ph-mph}(a) illustrates a perfect hash
Since no collisions occur, each key can be retrieved from the table
with a single probe.
If~$m=n$, that is, the table has the same size as~$S$,
then~$h$ is a \textit{minimal perfect hash function} for~$S$.
Figure~\ref{fig:minimalperfecthash-ph-mph}(b) illustrates
a~minimal perfect hash function.
Minimal perfect hash functions totally avoid the problem of wasted
space and time.
% For two-column wide figures use
% Use the relevant command to insert your figure file.
% For example, with the graphicx package use
% figure caption is below the figure
\caption{(a) Perfect hash function\quad (b) Minimal perfect hash function}
Minimal perfect hash functions are widely used for memory efficient
and fast retrieval of items from static sets, such as words in natural
languages, reserved words in programming languages or interactive systems,
universal resource locations (URLs) in Web search engines, or item sets in
data mining techniques.
The aim of this paper is to describe a new way of constructing minimal perfect
hash functions. Our algorithm shares several features with the one due to
Czech, Havas and Majewski~\cite{chm92}. In particular, our algorithm is also
based on the generation of random graphs~$G=(V,E)$, where~$E$ is in one-to-one
correspondence with the key set~$S$ for which we wish to generate the hash
The two main differences between our algorithm and theirs
are as follows:
(\textit{i})~we generate random graphs
$G = (V, E)$ with $|V|=cn$ and $|E|=|S|=n$, where~$c=1.15$, and hence~$G$
contains cycles with high probability,
while they generate \textit{acyclic} random graphs
$G = (V, E)$ with $|V|=cn$ and $|E|=|S|=n$,
with a greater number of vertices: $|V|\ge2.09n$;
(\textit{ii})~they generate order preserving minimal perfect hash functions
while our algorithm does not preserve order (a perfect hash function $h$ is
\textit{order preserving} if the keys in~$S$ are arranged in some given order
and~$h$ preserves this order in the hash table). Thus, our algorithm improves
the space requirement at the expense of generating functions that are not
order preserving.
Our algorithm is efficient and may be tuned to yield a function~$h$
with a very economical description.
As the algorithm in~\cite{chm92}, our algorithm produces~$h$
in~$O(n)$ expected time for a set of~$n$ keys.
The description of~$h$ requires~$1.15n$ computer words,
and evaluating~$h(x)$
requires two accesses to an array of~$1.15n$ integers.
We further derive a heuristic that improves the space requirement
from~$1.15n$ words down to~$0.93n$ words.
Our scheme is very practical: to generate a minimal perfect hash function for
a collection of 100~million universe resource locations (URLs), each 63 bytes
long on average, our algorithm running on a commodity PC takes 811 seconds on

\section{Related Work}
Czech, Havas and Majewski~\cite{chm97} provide a
comprehensive survey of the most important theoretical results
on perfect hashing.
In the following, we review some of those results.
Fredman, Koml\'os and Szemer\'edi~\cite{FKS84} showed that it is possible to
construct space efficient perfect hash functions that can be evaluated in
constant time with table sizes that are linear in the number of keys:
$m=O(n)$. In their model of computation, an element of the universe~$U$ fits
into one machine word, and arithmetic operations and memory accesses have unit
cost. Randomized algorithms in the FKS model can construct a perfect hash
function in expected time~$O(n)$:
this is the case of our algorithm and the works in~\cite{chm92,p99}.
Many methods for generating minimal perfect hash functions use a
{\em mapping}, {\em ordering} and {\em searching}
(MOS) approach,
a description coined by Fox, Chen and Heath~\cite{fch92}.
In the MOS approach, the construction of a minimal perfect hash function
is accomplished in three steps.
First, the mapping step transforms the key set from the original universe
to a new universe.
Second, the ordering step places the keys in a sequential order that
determines the order in which hash values are assigned to keys.
Third, the searching step attempts to assign hash values to the keys.
Our algorithm and the algorithm presented in~\cite{chm92} use the
MOS approach.
Pagh~\cite{p99} proposed a family of randomized algorithms for
constructing minimal perfect hash functions.
The form of the resulting function is $h(x) = (f(x) + d_{g(x)}) \bmod n$,
where $f$ and $g$ are universal hash functions and $d$ is a set of
displacement values to resolve collisions that are caused by the function $f$.
Pagh identified a set of conditions concerning $f$ and $g$ and showed
that if these conditions are satisfied, then a minimal perfect hash
function can be computed in expected time $O(n)$ and stored in
$(2+\epsilon)n$ computer words.
Dietzfelbinger and Hagerup~\cite{dh01} improved~\cite{p99},
reducing from $(2+\epsilon)n$ to $(1+\epsilon)n$ the number of computer
words required to store the function, but in their approach~$f$ and~$g$ must
be chosen from a class
of hash functions that meet additional requirements.
Differently from the works in~\cite{p99,dh01}, our algorithm uses two
universal hash functions $h_1$ and $h_2$ randomly selected from a class
of universal hash functions that do not need to meet any additional
The work in~\cite{chm92} presents an efficient and practical algorithm
for generating order preserving minimal perfect hash functions.
Their method involves the generation of acyclic random graphs
$G = (V, E)$ with~$|V|=cn$ and $|E|=n$, with $c \ge 2.09$.
They showed that an order preserving minimal perfect hash function
can be found in optimal time if~$G$ is acyclic.
To generate an acyclic graph, two vertices $h_1(x)$ and $h_2(x)$ are
computed for each key $x \in S$.
Thus, each set~$S$ has a corresponding graph~$G=(V,E)$, where $V=\{0,1,
\ldots,t\}$ and $E=\big\{\{h_1(x),h_2(x)\}:x \in S\big\}$.
In order to guarantee the acyclicity of~$G$, the algorithm repeatedly selects
$h_1$ and $h_2$ from a family of universal hash functions
until the corresponding graph is acyclic.
Havas et al.~\cite{hmwc93} proved that if $|V(G)|=cn$ and $c>2$,
then the probability that~$G$ is acyclic is $p=e^{1/c}\sqrt{(c-2)/c}$.
For $c=2.09$, this probability is
$p \simeq 0.342$, and
the expected number of iterations to obtain an acyclic graph
is~$1/p \simeq 2.92$.

View File

@ -1,77 +0,0 @@
\section{Os Algoritmos}
Nesta se\c{c}\~ao apresentamos \cite{bkz05}
\subsection{Um Algoritmo Baseado em Mem\'oria Principal}
\subsection{Um Algoritmo Baseado em Mem\'oria Externa}
% For two-column wide figures use
% Use the relevant command to insert your figure file.
% For example, with the graphicx package use
% figure caption is below the figure
\caption{Main steps of the new algorithm.}

View File

@ -1,55 +0,0 @@
Fun\c{c}\~oes hash s\~ao amplamente utilizadas em v\'arias \'areas da
Ci\^encia da Computa\c{c}\~ao.
Uma \textit{fun\c{c}\~ao hash} $h: U \to M$ mapeia chaves de um universo $U$, $|U|=u$,
para um dado intervalo de inteiros $M=[0,m-1]=\{0,1,\dots,m-1\}$.
Seja~$S\subseteq U$ um subconjunto de $n$ chaves do universo $U$.
Dado uma chave~$k\in S$, uma fun\c{c}\~ao hash $h$ computa um inteiro em
$M$ para armazenamento ou recupera\c{c}\~ao de $k$ em uma \textit{tabela hash}.
Neste artigo consideramos que as chaves s\~ao strings de bits de comprimento
m\'aximo $L$. Portanto $u = 2^L$.
M\'etodos de hashing para {\em conjuntos n\~ao est\'aticos} de chaves podem ser usados para
construir estruturas de dados para armazenar $S$ e suportar consultas do tipo
``$k \in S$?'' em tempo esperado $O(1)$.
No entanto, eles envolvem um certo desperd\'{\i}cio de espa\c{c}o e tempo devido
a localiza\c{c}\~oes inutilizadas na tabela e tempo para resolver colis\~oes quando duas
chaves s\~ao mapeadas para a mesma localiza\c{c}\~ao na tabela.
Para {\em conjuntos est\'aticos} de chaves \'e poss\'{\i}vel computar uma fun\c{c}\~ao
para encontrar qualquer chave na tabela em uma \'unica tentativa; tais fun\c{c}\~oes
s\~ao chamadas de \textit{perfeitas}.
Dado um conjunto de chaves $S$, dizemos que uma fun\c{c}\~ao hash $h:U\to M$ \'e uma
\textit{fun\c{c}\~ao hash perfeita} (FHP) para $S$ se $h$ \'e injetora para $S$,
isto \'e, n\~ao h\'a {\em colis\~oes} entre as chaves em $S$: se $x$
e $y$ est\~ao em $S$ e $x\neq y$, ent\~ao $h(x)\neq h(y)$.
A Figura~\ref{fig:minimalperfecthash-ph-mph}(a) ilustra uma fun\c{c}\~ao hash perfeita.
Se $m=n$, isto \'e, a tabela \'e do mesmo tamanho de $S$,
ent\~ao $h$ \'e uma \textit{fun\c{c}\~ao hash perfeita m\'{\i}nima} (FHPM).
A Figura~\ref{fig:minimalperfecthash-ph-mph}(b) ilustra uma
fun\c{c}\~ao hash perfeita m\'{\i}nima.
FHPMs podem evitar totalmente o problema de desperd\'{\i}cio de espa\c{c}o e tempo.
% For two-column wide figures use
% Use the relevant command to insert your figure file.
% For example, with the graphicx package use
\includegraphics[width=0.45\textwidth, height=0.3\textheight]{figs/minimalperfecthash-ph-mph.ps}
% figure caption is below the figure
\caption{(a) Perfect hash function\quad (b) Minimal perfect hash function}
A aplicabilidade pr\'atica das FHPMs e consequentemente dos algoritmos utilizados para ger\'a-las est\'a diretamente relacionada com as seguintes m\'etricas:
\item Quantidade de tempo gasto para encontrar uma FHPM $h$.
\item Quantidade de mem\'oria exigida para encontrar $h$.
\item Quantidade de tempo necess\'ario para avaliar ou computar $h$ para uma dada chave.
\item Quantidade de mem\'oria exigida para armazenar a descri\c{c}\~ao da fun\c{c}\~ao $h$.
\item Escalabilidade dos algoritmos com o crescimento de $S$.
Neste artigo apresentamos ...

View File

@ -1,73 +0,0 @@
\section{Trabalhos Relacionados}
As FHPs e FHPMs receberam muita aten\c{c}\~ao da comunidade
cient\'{\i}fica nas d\'ecadas de 80 e 90. Em~\cite{chm97} \'e
apresentado um survey completo da \'area at\'e 1997.
Nesta se\c{c}\~ao revisitamos os trabalhos cobertos pelo survey que
est\~ao diretamente relacionados aos algoritmos aqui propostos e
fazemos um survey dos algoritmos propostos desde ent\~ao.
Fredman, Koml\'os e Szemer\'edi~\cite{FKS84} mostraram que \'e poss\'{\i}vel construir
FHPs que podem ser descritas eficientemente em termos de espa\c{c}o e avaliadas em
tempo constante utilizando tamanhos de tabelas que s\~ao lineares no n\'umero de chaves:
No modelo de computa\c{c}\~ao deles, um elemento do universo~$U$ \'e colocado em uma
palavra de m\'aquina, e opera\c{c}\~oes aritm\'eticas e acesso \`a mem\'oria tem custo
Algoritmos rand\^omicos no modelo FKS podem construir FHPs com complexidade de tempo
experada de $O(n)$:
Este \'e o caso dos nossos algoritmos e dos trabalhos em~\cite{chm92,p99}.
Os trabalhos~\cite{asw00,swz00} apresentam algoritmos para construir
FHPs e FHPMs deterministicamente.
As fun\c{c}\~oes geradas necessitam de $O(n \log(n) + \log(\log(u)))$ bits para serem descritas.
A complexidade de caso m\'edio dos algoritmos para gerar as fun\c{c}\~oes \'e
$O(n\log(n) \log( \log (u)))$ e a de pior caso \'e $O(n^3\log(n) \log(\log(u)))$.
A complexidade de avalia\c{c}\~ao das fun\c{c}\~oes \'e $O(\log(n) + \log(\log(u)))$.
Assim, os algoritmos n\~ao geram fun\c{c}\~oes que podem ser avaliadas com complexidade
de tempo $O(1)$, est\~ao distantes a um fator de $\log n$ da complexidade \'otima para descrever
FHPs e FHPMs (Mehlhorn mostra em~\cite{m84}
que para armazenar uma FHP s\~ao necess\'arios no m\'{\i}nimo
$\Omega(n^2/(2\ln 2) m + \log\log u)$ bits), e n\~ao geram as
fun\c{c}\~oes com complexidade linear.
Al\'em disso, o universo $U$ das chaves \'e restrito a n\'umeros inteiros, o que pode
limitar a utiliza\c{c}\~ao na pr\'atica.
Pagh~\cite{p99} prop\^os uma fam\'{\i}lia de algoritmos rand\^omicos para construir
A forma da fun\c{c}\~ao resultante \'e $h(k) = (f(k) + d_{g(k)}) \bmod n$,
onde $f$ e $g$ s\~ao fun\c{c}\~oes hash universal \cite{ss89} e $d$ \'e um conjunto de
valores de deslocamento para resolver as colis\~oes que s\~ao causadas pela fun\c{c}\~ao $f$.
Pagh identificou um conjunto de condi\c{c}\~oes referentes a $f$ e $g$, e mostrou
que se tais condi\c{c}\~oes fossem satisfeitas, ent\~ao, uma FHPM pode ser computada
em tempo esperado $O(n)$ e armazenada em $(2+\epsilon)n$ palavras de computador
(ou $O((2+\epsilon)n \log n)$ bits.)
Dietzfelbinger e Hagerup~\cite{dh01} melhoraram ~\cite{p99},
reduzindo de $(2+\epsilon)n$ para $(1+\epsilon)n$ (ou $O((1+\epsilon)n \log n)$ bits)
o n\'umero de palavras de
computador exigidas para armazenar a fun\c{c}\~ao, mas na abordagem deles $f$ e $g$
devem ser escolhidas de uma classe de fun\c{c}\~oes hash que atendam a requisitos
Galli, Seybold e Simon~\cite{gss01} propuseram um algoritmo r\^andomico
que gera FHPMs da mesma forma das geradas pelos algoritmos de Pagh~\cite{p99}
e, Dietzfelbinger e Hagerup~\cite{dh01}. No entanto, eles definiram a forma das
fun\c{c}\~oes $f(k) = h_c(k) \bmod n$ e $g(k) = \lfloor h_c(k)/n \rfloor$ para obter em tempo esperado $O(n)$ uma fun\c{c}\~ao que pode ser descrita em $O(n\log n)$ bits, onde
$h_c(k) = (ck \bmod p) \bmod n^2$, $1 \leq c \leq p-1$ e $p$ um primo maior do que $u$.
Os algoritmos propostos em~\cite{p99,dh01,gss01} n\~ao s\~ao escal\'aveis com o crescimento do
conjunto de chaves $S$. Isto \'e devido as restri\c{c}\~oes impostas sobre as fun\c{c}\~oes
hash universal utilizadas no c\'alculo das FHPMs. Normalmente \'e exigido um
n\'umero primo maior do que o tamanho do universo $u$ que, em geral, \'e muito maior
do que $n=|S|$ ou opera\c{c}\~oes envolvendo $n^2$ aparecem no c\'alculo da FHPM.
Al\'em disso, todas as fun\c{c}\~oes est\~ao distantes a um fator de $\log n$ da complexidade
\'otima para descrever FHPMs.
Diferentemente dos trabalhos em~\cite{p99,dh01,gss01}, nossos algoritmos usam
fun\c{c}\~oes hash universal que s\~ao selecionadas randomicamente de uma classe
de fun\c{c}\~oes que n\~ao necessitam atender restri\c{c}\~oes adicionais.
Al\'em disso, as FHPMs s\~ao geradas em tempo esperado $O(n)$, s\~ao avaliadas
com custo $O(1)$ e s\~ao descritas em $O(n)$ bits que est\'a muito pr\'oximo da
complexidade \'otima.
Pelo melhor do nosso conhecimento, os algoritmos propostos neste artigo s\~ao
os primeiros da literatura capazes de gerar FHPMs para conjuntos de chaves na
ordem de bilh\~oes de chaves utilizando um simples PC com 1GB de mem\'oria principal.

