turbonss/vldb/ingles/introduction.tex

\section{Introduction}
\label{sec:introduction}

Suppose~$U$ is a universe of \textit{keys}.
Let $h:U\to M$ be a {\em hash function} that maps the keys from~$U$
to a given interval of integers $M=[0,m-1]=\{0,1,\dots,m-1\}$.
Let~$S\subseteq U$ be a set of~$n$ keys from~$U$.
Given a key~$x\in S$, the hash function~$h$ computes an integer in
$[0,m-1]$ for the storage or retrieval of~$x$ in a {\em hash table}.
Hashing methods for {\em non-static sets} of keys can be used to construct
data structures storing $S$ and supporting membership queries
``$x \in S$?'' in expected time $O(1)$.
However, they involve a certain amount of wasted space owing to unused
locations in the table and waisted time to resolve collisions when
two keys are hashed to the same table location.

For {\em static sets} of keys it is possible to compute a function
to find any key in a table in one probe; such hash functions are called
\textit{perfect}.
Given a set of keys~$S$, we shall say that a hash function~$h:U\to M$ is a
\textit{perfect hash function} for~$S$ if~$h$ is an injection on~$S$,
that is, there are no \textit{collisions} among the keys in~$S$: if~$x$
and~$y$ are in~$S$ and~$x\neq y$, then~$h(x)\neq h(y)$.
Figure~\ref{fig:minimalperfecthash-ph-mph}(a) illustrates a perfect hash
function.
Since no collisions occur, each key can be retrieved from the table
with a single probe.
If~$m=n$, that is, the table has the same size as~$S$,
then~$h$ is a \textit{minimal perfect hash function} for~$S$.
Figure~\ref{fig:minimalperfecthash-ph-mph}(b) illustrates
a~minimal perfect hash function.
Minimal perfect hash functions totally avoid the problem of wasted
space and time.

% For two-column wide figures use
\begin{figure*}
% Use the relevant command to insert your figure file.
% For example, with the graphicx package use
\centering
  \includegraphics{figs/minimalperfecthash-ph-mph.ps}
% figure caption is below the figure
\caption{(a) Perfect hash function\quad  (b) Minimal perfect hash function}
\label{fig:minimalperfecthash-ph-mph}
\end{figure*}

Minimal perfect hash functions are widely used for memory efficient
storage
and fast retrieval of items from static sets, such as words in natural
languages, reserved words in programming languages or interactive systems,
universal resource locations (URLs) in Web search engines, or item sets in
data mining techniques.

The aim of this paper is to describe a new way of constructing minimal perfect
hash functions.  Our algorithm shares several features with the one due to
Czech, Havas and Majewski~\cite{chm92}.  In particular, our algorithm is also
based on the generation of random graphs~$G=(V,E)$, where~$E$ is in one-to-one
correspondence with the key set~$S$ for which we wish to generate the hash
function.
The two main differences between our algorithm and theirs
are as follows:
(\textit{i})~we generate random graphs
$G = (V, E)$ with $|V|=cn$ and $|E|=|S|=n$, where~$c=1.15$, and hence~$G$
contains cycles with high probability,
while they generate \textit{acyclic} random graphs
$G = (V, E)$ with $|V|=cn$ and $|E|=|S|=n$,
with a greater number of vertices: $|V|\ge2.09n$;
(\textit{ii})~they generate order preserving minimal perfect hash functions
while our algorithm does not preserve order (a perfect hash function $h$ is
\textit{order preserving} if the keys in~$S$ are arranged in some given order
and~$h$ preserves this order in the hash table).  Thus, our algorithm improves
the space requirement at the expense of generating functions that are not
order preserving.

Our algorithm is efficient and may be tuned to yield a function~$h$
with a very economical description.
As the algorithm in~\cite{chm92}, our algorithm produces~$h$
in~$O(n)$ expected time for a set of~$n$ keys.
The description of~$h$ requires~$1.15n$ computer words,
and evaluating~$h(x)$
requires two accesses to an array of~$1.15n$ integers.
We further derive a heuristic that improves the space requirement
from~$1.15n$ words down to~$0.93n$ words.
Our scheme is very practical: to generate a minimal perfect hash function for
a collection of 100~million universe resource locations (URLs), each 63 bytes
long on average, our algorithm running on a commodity PC takes 811 seconds on
average.