% Time-stamp: \vspace{-3mm} \section{Related work} \label{sec:relatedprevious-work} \vspace{-2mm} % Optimal speed for hashing means that each key from the key set $S$ % will map to an unique location in the hash table, avoiding time wasted % in resolving collisions. That is achieved with a MPHF and % because of that many algorithms for constructing static % and dynamic MPHFs, when static or dynamic sets are involved, % were developed. Our focus has been on static MPHFs, since % in many applications the key sets change slowly, if at all~\cite{s05}. \enlargethispage{2\baselineskip} Czech, Havas and Majewski~\cite{chm97} provide a comprehensive survey of the most important theoretical and practical results on perfect hashing. In this section we review some of the most important results. %We also present more recent algorithms that share some features with %the one presented hereinafter. Fredman, Koml\'os and Szemer\'edi~\cite{FKS84} showed that it is possible to construct space efficient perfect hash functions that can be evaluated in constant time with table sizes that are linear in the number of keys: $m=O(n)$. In their model of computation, an element of the universe~$U$ fits into one machine word, and arithmetic operations and memory accesses have unit cost. Randomized algorithms in the FKS model can construct a perfect hash function in expected time~$O(n)$: this is the case of our algorithm and the works in~\cite{chm92,p99}. Mehlhorn~\cite{m84} showed that at least $\Omega((1/\ln 2)n + \ln\ln u)$ bits are required to represent a MPHF (i.e, at least 1.4427 bits per key must be stored). To the best of our knowledge our algorithm is the first one capable of generating MPHFs for sets in the order of billion of keys, and the generated functions require less than 9 bits per key to be stored. This increases one order of magnitude in the size of the greatest key set for which a MPHF was obtained in the literature~\cite{bkz05}. %which is close to the lower bound presented in~\cite{m84}. Some work on minimal perfect hashing has been done under the assumption that the algorithm can pick and store truly random functions~\cite{bkz05,chm92,p99}. Since the space requirements for truly random functions makes them unsuitable for implementation, one has to settle for pseudo-random functions in practice. Empirical studies show that limited randomness properties are often as good as total randomness. We could verify that phenomenon in our experiments by using the universal hash function proposed by Jenkins~\cite{j97}, which is time efficient at retrieval time and requires just an integer to be used as a random seed (the function is completely determined by the seed). % Os trabalhos~\cite{asw00,swz00} apresentam algoritmos para construir % FHPs e FHPMs deterministicamente. % As fun\c{c}\~oes geradas necessitam de $O(n \log(n) + \log(\log(u)))$ bits para serem descritas. % A complexidade de caso m\'edio dos algoritmos para gerar as fun\c{c}\~oes \'e % $O(n\log(n) \log( \log (u)))$ e a de pior caso \'e $O(n^3\log(n) \log(\log(u)))$. % A complexidade de avalia\c{c}\~ao das fun\c{c}\~oes \'e $O(\log(n) + \log(\log(u)))$. % Assim, os algoritmos n\~ao geram fun\c{c}\~oes que podem ser avaliadas com complexidade % de tempo $O(1)$, est\~ao distantes a um fator de $\log n$ da complexidade \'otima para descrever % FHPs e FHPMs (Mehlhorn mostra em~\cite{m84} % que para armazenar uma FHP s\~ao necess\'arios no m\'{\i}nimo % $\Omega(n^2/(2\ln 2) m + \log\log u)$ bits), e n\~ao geram as % fun\c{c}\~oes com complexidade linear. % Al\'em disso, o universo $U$ das chaves \'e restrito a n\'umeros inteiros, o que pode % limitar a utiliza\c{c}\~ao na pr\'atica. Pagh~\cite{p99} proposed a family of randomized algorithms for constructing MPHFs where the form of the resulting function is $h(x) = (f(x) + d[g(x)]) \bmod n$, where $f$ and $g$ are universal hash functions and $d$ is a set of displacement values to resolve collisions that are caused by the function $f$. Pagh identified a set of conditions concerning $f$ and $g$ and showed that if these conditions are satisfied, then a minimal perfect hash function can be computed in expected time $O(n)$ and stored in $(2+\epsilon)n\log_2n$ bits. Dietzfelbinger and Hagerup~\cite{dh01} improved~\cite{p99}, reducing from $(2+\epsilon)n\log_2n$ to $(1+\epsilon)n\log_2n$ the number of bits required to store the function, but in their approach~$f$ and~$g$ must be chosen from a class of hash functions that meet additional requirements. %Differently from the works in~\cite{dh01, p99}, our algorithm generates a MPHF %$h$ in expected linear time and $h$ can be stored in $O(n)$ bits (9 bits per key). % Galli, Seybold e Simon~\cite{gss01} propuseram um algoritmo r\^andomico % que gera FHPMs da mesma forma das geradas pelos algoritmos de Pagh~\cite{p99} % e, Dietzfelbinger e Hagerup~\cite{dh01}. No entanto, eles definiram a forma das % fun\c{c}\~oes $f(k) = h_c(k) \bmod n$ e $g(k) = \lfloor h_c(k)/n \rfloor$ para obter em tempo esperado $O(n)$ uma fun\c{c}\~ao que pode ser descrita em $O(n\log n)$ bits, onde % $h_c(k) = (ck \bmod p) \bmod n^2$, $1 \leq c \leq p-1$ e $p$ um primo maior do que $u$. %Our algorithm is the first one capable of generating MPHFs for sets in the order of %billion of keys. It happens because we do not need to keep into main memory %at generation time complex data structures as a graph, lists and so on. We just need to maintain %a small vector that occupies around 8MB for a set of 1 billion keys. Fox et al.~\cite{fch92,fhcd92} studied MPHFs %that also share features with the ones generated by our algorithm. that bring down the storage requirements we got to between 2 and 4 bits per key. However, it is shown in~\cite[Section 6.7]{chm97} that their algorithms have exponential running times and cannot scale for sets larger than 11 million keys in our implementation of the algorithm. Our previous work~\cite{bkz05} improves the one by Czech, Havas and Majewski~\cite{chm92}. We obtained more compact functions in less time. Although the algorithm in~\cite{bkz05} is the fastest algorithm we know of, the resulting functions are stored in $O(n\log n)$ bits and one needs to keep in main memory at generation time a random graph of $n$ edges and $cn$ vertices, where $c\in[0.93,1.15]$. Using the well known divide to conquer approach we use that algorithm as a building block for the new one, where the resulting functions are stored in $O(n)$ bits.