113 lines
6.4 KiB
TeX
Executable File
113 lines
6.4 KiB
TeX
Executable File
% Time-stamp: <Monday 30 Jan 2006 03:06:57am EDT yoshi@ime.usp.br>
|
|
\vspace{-3mm}
|
|
\section{Related work}
|
|
\label{sec:relatedprevious-work}
|
|
\vspace{-2mm}
|
|
|
|
% Optimal speed for hashing means that each key from the key set $S$
|
|
% will map to an unique location in the hash table, avoiding time wasted
|
|
% in resolving collisions. That is achieved with a MPHF and
|
|
% because of that many algorithms for constructing static
|
|
% and dynamic MPHFs, when static or dynamic sets are involved,
|
|
% were developed. Our focus has been on static MPHFs, since
|
|
% in many applications the key sets change slowly, if at all~\cite{s05}.
|
|
|
|
\enlargethispage{2\baselineskip}
|
|
Czech, Havas and Majewski~\cite{chm97} provide a
|
|
comprehensive survey of the most important theoretical and practical results
|
|
on perfect hashing.
|
|
In this section we review some of the most important results.
|
|
%We also present more recent algorithms that share some features with
|
|
%the one presented hereinafter.
|
|
|
|
Fredman, Koml\'os and Szemer\'edi~\cite{FKS84} showed that it is possible to
|
|
construct space efficient perfect hash functions that can be evaluated in
|
|
constant time with table sizes that are linear in the number of keys:
|
|
$m=O(n)$. In their model of computation, an element of the universe~$U$ fits
|
|
into one machine word, and arithmetic operations and memory accesses have unit
|
|
cost. Randomized algorithms in the FKS model can construct a perfect hash
|
|
function in expected time~$O(n)$:
|
|
this is the case of our algorithm and the works in~\cite{chm92,p99}.
|
|
|
|
Mehlhorn~\cite{m84} showed
|
|
that at least $\Omega((1/\ln 2)n + \ln\ln u)$ bits are
|
|
required to represent a MPHF (i.e, at least 1.4427 bits per
|
|
key must be stored).
|
|
To the best of our knowledge our algorithm
|
|
is the first one capable of generating MPHFs for sets in the order
|
|
of billion of keys, and the generated functions
|
|
require less than 9 bits per key to be stored.
|
|
This increases one order of magnitude in the size of the greatest
|
|
key set for which a MPHF was obtained in the literature~\cite{bkz05}.
|
|
%which is close to the lower bound presented in~\cite{m84}.
|
|
|
|
Some work on minimal perfect hashing has been done under the assumption that
|
|
the algorithm can pick and store truly random functions~\cite{bkz05,chm92,p99}.
|
|
Since the space requirements for truly random functions makes them unsuitable for
|
|
implementation, one has to settle for pseudo-random functions in practice.
|
|
Empirical studies show that limited randomness properties are often as good as
|
|
total randomness.
|
|
We could verify that phenomenon in our experiments by using the universal hash
|
|
function proposed by Jenkins~\cite{j97}, which is
|
|
time efficient at retrieval time and requires just an integer to be used as a
|
|
random seed (the function is completely determined by the seed).
|
|
% Os trabalhos~\cite{asw00,swz00} apresentam algoritmos para construir
|
|
% FHPs e FHPMs deterministicamente.
|
|
% As fun\c{c}\~oes geradas necessitam de $O(n \log(n) + \log(\log(u)))$ bits para serem descritas.
|
|
% A complexidade de caso m\'edio dos algoritmos para gerar as fun\c{c}\~oes \'e
|
|
% $O(n\log(n) \log( \log (u)))$ e a de pior caso \'e $O(n^3\log(n) \log(\log(u)))$.
|
|
% A complexidade de avalia\c{c}\~ao das fun\c{c}\~oes \'e $O(\log(n) + \log(\log(u)))$.
|
|
% Assim, os algoritmos n\~ao geram fun\c{c}\~oes que podem ser avaliadas com complexidade
|
|
% de tempo $O(1)$, est\~ao distantes a um fator de $\log n$ da complexidade \'otima para descrever
|
|
% FHPs e FHPMs (Mehlhorn mostra em~\cite{m84}
|
|
% que para armazenar uma FHP s\~ao necess\'arios no m\'{\i}nimo
|
|
% $\Omega(n^2/(2\ln 2) m + \log\log u)$ bits), e n\~ao geram as
|
|
% fun\c{c}\~oes com complexidade linear.
|
|
% Al\'em disso, o universo $U$ das chaves \'e restrito a n\'umeros inteiros, o que pode
|
|
% limitar a utiliza\c{c}\~ao na pr\'atica.
|
|
|
|
Pagh~\cite{p99} proposed a family of randomized algorithms for
|
|
constructing MPHFs
|
|
where the form of the resulting function is $h(x) = (f(x) + d[g(x)]) \bmod n$,
|
|
where $f$ and $g$ are universal hash functions and $d$ is a set of
|
|
displacement values to resolve collisions that are caused by the function $f$.
|
|
Pagh identified a set of conditions concerning $f$ and $g$ and showed
|
|
that if these conditions are satisfied, then a minimal perfect hash
|
|
function can be computed in expected time $O(n)$ and stored in
|
|
$(2+\epsilon)n\log_2n$ bits.
|
|
|
|
Dietzfelbinger and Hagerup~\cite{dh01} improved~\cite{p99},
|
|
reducing from $(2+\epsilon)n\log_2n$ to $(1+\epsilon)n\log_2n$ the number of bits
|
|
required to store the function, but in their approach~$f$ and~$g$ must
|
|
be chosen from a class
|
|
of hash functions that meet additional requirements.
|
|
%Differently from the works in~\cite{dh01, p99}, our algorithm generates a MPHF
|
|
%$h$ in expected linear time and $h$ can be stored in $O(n)$ bits (9 bits per key).
|
|
|
|
% Galli, Seybold e Simon~\cite{gss01} propuseram um algoritmo r\^andomico
|
|
% que gera FHPMs da mesma forma das geradas pelos algoritmos de Pagh~\cite{p99}
|
|
% e, Dietzfelbinger e Hagerup~\cite{dh01}. No entanto, eles definiram a forma das
|
|
% fun\c{c}\~oes $f(k) = h_c(k) \bmod n$ e $g(k) = \lfloor h_c(k)/n \rfloor$ para obter em tempo esperado $O(n)$ uma fun\c{c}\~ao que pode ser descrita em $O(n\log n)$ bits, onde
|
|
% $h_c(k) = (ck \bmod p) \bmod n^2$, $1 \leq c \leq p-1$ e $p$ um primo maior do que $u$.
|
|
%Our algorithm is the first one capable of generating MPHFs for sets in the order of
|
|
%billion of keys. It happens because we do not need to keep into main memory
|
|
%at generation time complex data structures as a graph, lists and so on. We just need to maintain
|
|
%a small vector that occupies around 8MB for a set of 1 billion keys.
|
|
|
|
Fox et al.~\cite{fch92,fhcd92} studied MPHFs
|
|
%that also share features with the ones generated by our algorithm.
|
|
that bring down the storage requirements we got to between 2 and 4 bits per key.
|
|
However, it is shown in~\cite[Section 6.7]{chm97} that their algorithms have exponential
|
|
running times and cannot scale for sets larger than 11 million keys in our
|
|
implementation of the algorithm.
|
|
|
|
Our previous work~\cite{bkz05} improves the one by Czech, Havas and Majewski~\cite{chm92}.
|
|
We obtained more compact functions in less time. Although
|
|
the algorithm in~\cite{bkz05} is the fastest algorithm
|
|
we know of, the resulting functions are stored in $O(n\log n)$ bits and
|
|
one needs to keep in main memory at generation time a random graph of $n$ edges
|
|
and $cn$ vertices,
|
|
where $c\in[0.93,1.15]$. Using the well known divide to conquer approach
|
|
we use that algorithm as a building block for the new one, where the
|
|
resulting functions are stored in $O(n)$ bits.
|