CMPH - C Minimal Perfect Hashing Library ------------------------------------------------------------------- Description C Minimal Perfect Hashing Library is a portable LGPLed library to create and to work with minimal perfect hashing functions. The cmph library encapsulates the newest and more efficient algorithms (available in the literature) in an easy-to-use, production-quality and fast API. The library is designed to work with big entries that can not fit in the main memory. It has been used successfully for constructing minimal perfect hashing functions for sets with more than 100 million of keys. Although there is a lack of similar libraries in the free software world (gperf is a bit different (gperf.html)), we can point out some of the distinguishable features of cmph: - Fast. - Space-efficient with main memory usage carefully documented. - The best modern algorithms are available (or at least scheduled for implementation :-)). - Works with in-disk key sets through of using the adapter pattern. - Serialization of hash functions. - Portable C code (currently works on GNU/Linux and WIN32). - Object oriented implementation. - Easily extensible. - Well encapsulated API aiming binary compatibility through releases. - Free Software. ---------------------------------------- Supported Algorithms - BMZ Algorithm. A very fast algorithm based on cyclic random graphs to construct minimal perfect hash functions in linear time. The resulting functions are not order preserving and can be stored in only 4cn bytes, where c is between 0.93 and 1.15. - CHM Algorithm. An algorithm based on acyclic random graphs to construct minimal perfect hash functions in linear time. The resulting functions are order preserving and are stored in 4cn bytes, where c is greater than 2. ---------------------------------------- News for version 0.3 - New heuristic added to the bmz algorithm permits to generate a mphf with only 24.61*n + O(1) bytes. The resulting function can be stored in 3.72*n bytes. click here (bmz.html) for details. ---------------------------------------- Examples Using cmph is quite simple. Take a look. // Create minimal perfect hash function from in-memory vector #include ... const char **vector; unsigned int nkeys; //Fill vector //... //Create minimal perfect hashing function using the default(chm) algorithm. cmph_config_t *config = cmph_config_new(cmph_io_vector_adapter(vector, nkeys)); cmph_t *hash = cmph_new(config); cmph_config_destroy(config); //Find key const char *key = "sample key"; unsigned int id = cmph_search(hash, key); //Destroy hash cmph_destroy(hash); ------------------------------- // Create minimal perfect hash function from in-disk keys using BMZ algorithm #include ... //Open file with newline separated list of keys FILE *fd = fopen("keysfile_newline_separated", "r"); //check for errors //... cmph_config_t *config = cmph_config_new(cmph_io_nlfile_adapter(fd)); cmph_config_set_algo(config, CMPH_BMZ); cmph_t *hash = cmph_new(config); cmph_config_destroy(config); fclose(fd); //Find key const char *key = "sample key"; unsigned int id = cmph_search(hash, key); //Destroy hash cmph_destroy(hash); -------------------------------------- The cmph application cmph is the name of both the library and the utility application that comes with this package. You can use the cmph application for constructing minimal perfect hashing functions from the command line. The cmph utility comes with a number of flags, but it is very simple to create and to query minimal perfect hashing functions: $ # Using the chm algorithm (default one) for constructing a mphf for keys in file keys_file $ ./cmph -g keys_file $ # Query id of keys in the file keys_query $ ./cmph -m keys_file.mph keys_query The additional options let you set most of the parameters you have available through the C API. Below you can see the full help message for the utility. usage: cmph [-v] [-h] [-V] [-k nkeys] [-f hash_function] [-g [-c value][-s seed] ] [-m file.mph] [-a algorithm] keysfile Minimum perfect hashing tool -h print this help message -c c value that determines the number of vertices in the graph -a algorithm - valid values are * bmz * chm -f hash function (may be used multiple times) - valid values are * djb2 * fnv * glib * jenkins * pjw * sdbm -V print version number and exit -v increase verbosity (may be used multiple times) -k number of keys -g generation mode -s random seed -m minimum perfect hash function file keysfile line separated file with keys Additional Documentation FAQ (faq.html) Downloads Use the project page at sourceforge: http://sf.net/projects/cmph License Stuff Code is under the LGPL. ---------------------------------------- Enjoy! Davi de Castro Reis Fabiano Cupertino Botelho Last Updated: Thu Jan 27 10:54:36 2005