turbonss/vldb07/relatedwork.tex

% Time-stamp: <Monday 30 Jan 2006 03:06:57am EDT yoshi@ime.usp.br>
\vspace{-3mm}
\section{Related work}
\label{sec:relatedprevious-work}
\vspace{-2mm}

% Optimal speed for hashing means that each key from the key set $S$
% will map to an unique location in the hash table, avoiding time wasted
% in resolving collisions. That is achieved with a MPHF and
% because of that many algorithms for constructing static
% and dynamic MPHFs, when static or dynamic sets are involved,
% were developed. Our focus has been on static MPHFs, since
% in many applications the key sets change slowly, if at all~\cite{s05}.

\enlargethispage{2\baselineskip}
Czech, Havas and Majewski~\cite{chm97} provide a
comprehensive survey of the most important theoretical and practical results
on perfect hashing.
In this section we review some of the most important results.
%We also present more recent algorithms that share some features with
%the one presented hereinafter.

Fredman, Koml\'os and Szemer\'edi~\cite{FKS84} showed that it is possible to
construct space efficient perfect hash functions that can be evaluated in
constant time with table sizes that are linear in the number of keys:
$m=O(n)$.  In their model of computation, an element of the universe~$U$ fits
into one machine word, and arithmetic operations and memory accesses have unit
cost.  Randomized algorithms in the FKS model can construct a perfect hash
function in expected time~$O(n)$:
this is the case of our algorithm and the works in~\cite{chm92,p99}.

Mehlhorn~\cite{m84} showed
that at least $\Omega((1/\ln 2)n + \ln\ln u)$ bits are
required to represent a MPHF (i.e, at least 1.4427 bits per
key must be stored).
To the best of our knowledge our algorithm
is the first one capable of generating MPHFs for sets in the order
of billion of keys, and the generated functions
require less than 9 bits per key to be stored.
This increases one order of magnitude in the size of the greatest
key set for which a MPHF was obtained in the literature~\cite{bkz05}.
%which is close to the lower bound presented in~\cite{m84}.

Some work on minimal perfect hashing has been done under the assumption that
the algorithm can pick and store truly random functions~\cite{bkz05,chm92,p99}.
Since the space requirements for truly random functions makes them unsuitable for
implementation, one has to settle for pseudo-random functions in practice.
Empirical studies show that limited randomness properties are often as good as
total randomness.
We could verify that phenomenon in our experiments by using the universal hash
function proposed by Jenkins~\cite{j97}, which is
time efficient at retrieval time and requires just an integer to be used as a
random seed (the function is completely determined by the seed).
% Os trabalhos~\cite{asw00,swz00} apresentam algoritmos para construir
% FHPs e FHPMs deterministicamente.
% As fun\c{c}\~oes geradas necessitam de $O(n \log(n) + \log(\log(u)))$ bits para serem descritas.
% A complexidade de caso m\'edio dos algoritmos para gerar as fun\c{c}\~oes \'e
% $O(n\log(n) \log( \log (u)))$ e a de pior caso \'e $O(n^3\log(n) \log(\log(u)))$.
% A complexidade de avalia\c{c}\~ao das fun\c{c}\~oes \'e $O(\log(n) + \log(\log(u)))$.
% Assim, os algoritmos n\~ao geram fun\c{c}\~oes que podem ser avaliadas com complexidade
% de tempo $O(1)$, est\~ao distantes a um fator de $\log n$ da complexidade \'otima para descrever
% FHPs e FHPMs (Mehlhorn mostra em~\cite{m84}
% que para armazenar uma FHP s\~ao necess\'arios no m\'{\i}nimo
% $\Omega(n^2/(2\ln 2) m + \log\log u)$ bits), e n\~ao geram as
% fun\c{c}\~oes com complexidade linear.
% Al\'em disso, o universo $U$ das chaves \'e restrito a n\'umeros inteiros, o que pode
% limitar a utiliza\c{c}\~ao na pr\'atica.

Pagh~\cite{p99} proposed a family of randomized algorithms for
constructing MPHFs
where the form of the resulting function is $h(x) = (f(x) + d[g(x)]) \bmod n$,
where $f$ and $g$ are universal hash functions and $d$ is a set of
displacement values to resolve collisions that are caused by the function $f$.
Pagh identified a set of conditions concerning $f$ and $g$ and showed
that if these conditions are satisfied, then a minimal perfect hash
function can be computed in expected time $O(n)$ and stored in
$(2+\epsilon)n\log_2n$ bits.

Dietzfelbinger and Hagerup~\cite{dh01} improved~\cite{p99},
reducing from $(2+\epsilon)n\log_2n$ to $(1+\epsilon)n\log_2n$ the number of bits
required to store the function, but in their approach~$f$ and~$g$ must
be chosen from a class
of hash functions that meet additional requirements.
%Differently from the works in~\cite{dh01, p99}, our algorithm generates a MPHF
%$h$ in expected linear time and $h$ can be stored in $O(n)$ bits (9 bits per key).

% Galli, Seybold e Simon~\cite{gss01} propuseram um algoritmo r\^andomico
% que gera FHPMs da mesma forma das geradas pelos algoritmos de Pagh~\cite{p99}
% e, Dietzfelbinger e Hagerup~\cite{dh01}. No entanto, eles definiram a forma das
% fun\c{c}\~oes $f(k) = h_c(k) \bmod n$ e $g(k) = \lfloor h_c(k)/n \rfloor$ para obter em tempo esperado $O(n)$ uma fun\c{c}\~ao que pode ser descrita em $O(n\log n)$ bits, onde
% $h_c(k) = (ck \bmod p) \bmod n^2$, $1 \leq c \leq  p-1$ e $p$ um primo maior do que $u$.
%Our algorithm is the first one capable of generating MPHFs for sets in the order of
%billion of keys. It happens because we do not need to keep into main memory
%at generation time complex data structures as a graph, lists and so on. We just need to maintain
%a small vector that occupies around 8MB for a set of 1 billion keys.

Fox et al.~\cite{fch92,fhcd92} studied MPHFs
%that also share features with the ones generated by our algorithm.
that bring down the storage requirements we got to between 2 and 4 bits per key.
However, it is shown in~\cite[Section 6.7]{chm97} that their algorithms have exponential
running times and cannot scale for sets larger than 11 million keys in our
implementation of the algorithm.

Our previous work~\cite{bkz05} improves the one by Czech, Havas and Majewski~\cite{chm92}.
We obtained more compact functions in less time. Although
the algorithm in~\cite{bkz05} is the fastest algorithm
we know of, the resulting functions are stored in $O(n\log n)$ bits and
one needs to keep in main memory at generation time a random graph of $n$ edges
and $cn$ vertices,
where $c\in[0.93,1.15]$.  Using the well known divide to conquer approach
we use that algorithm as a building block for the new one, where the
resulting functions are stored in $O(n)$ bits.