turbonss/tex/chd/introduction.tex
2009-06-12 21:49:26 -03:00

39 lines
3.4 KiB
TeX
Executable File

\section{Introduction} \label{sec:introduction}
The important performance parameters of a PHF are representation size, evaluation time and construction time. The representation size plays an important role when the whole function fits in a faster memory and the actual data is stored in a slower memory. For instace, compact PHFs can be entirely fit in a CPU cache and this makes their computation really fast by avoiding cache misses. The CHD algorithm plays an important role in this context. It was designed by Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger in \cite{bbd09}.
The CHD algorithm permits to obtain PHFs with representation size very close to optimal while retaining $O(n)$ construction time and $O(1)$ evaluation time. For example, in the case $m=2n$ we obtain a PHF that uses space $0.67$ bits per key, and for $m=1.23n$ we obtain space $1.4$ bits per key, which was not achievable with previously known methods. The CHD algorithm is inspired by several known algorithms;
the main new feature is that it combines a modification of Pagh's ``hash-and-displace'' approach
with data compression on a sequence of hash function indices.
That combination makes it possible to significantly reduce space usage
while retaining linear construction time and constant query time.
The CHD algorithm can also be used for $k$-perfect hashing,
where at most $k$ keys may be mapped to the same value.
For the analysis we assume that fully random hash functions are given for free;
such assumptions can be justified and were made in previous papers.
The compact PHFs generated by the CHD algorithm can be used in many applications in which we want to assign a unique identifier to each key without storing any information on the key. One of the most obvious applications of those functions
(or $k$-perfect hash functions) is when we have a small fast memory in which we can store the perfect hash function while the keys and associated satellite data are stored in slower but larger memory.
The size of a block or a transfer unit may be chosen so that $k$ data items can be retrieved in
one read access. In this case we can ensure that data associated with a key can be retrieved in a single probe to slower memory. This has been used for example in hardware routers~\cite{pb06}.
% Perfect hashing has also been found to be competitive with traditional hashing in internal memory~\cite{blmz08} on standard computers. Recently perfect hashing has been used to accelerate algorithms on graphs~\cite{ESS08} when the graph representation does not fit in main memory.
The CHD algorithm generates the most compact PHFs and MPHFs we know of in~$O(n)$ time.
The time required to evaluate the generated functions is constant (in practice less than $1.4$ microseconds).
The storage space of the resulting PHFs and MPHFs are distant from the information
theoretic lower bound by a factor of $1.43$.
The closest competitor is the algorithm by Martin and Pagh \cite{dp08} but
their algorithm do not work in linear time.
Furthermore, the CHD algorithm
can be tuned to run faster than the BPZ algorithm \cite{bpz07} (the fastest algorithm
available in the literature so far) and to obtain more compact functions.
The most impressive characteristic is that it has the ability, in principle, to
approximate the information theoretic lower bound while being practical.
A detailed description of the CHD algorithm can be found in \cite{bbd09}.