external memory based algorithm documentation added
This commit is contained in:
parent
27cd2b7978
commit
bc1dac6891
|
@ -0,0 +1,6 @@
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
| [Home index.html] | [CHM chm.html] | [BMZ bmz.html] | [External Memory Based Algorithm brz.html]
|
||||||
|
----------------------------------------
|
6
BMZ.t2t
6
BMZ.t2t
|
@ -395,11 +395,9 @@ Again we have:
|
||||||
|
|
||||||
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], D. Menoti, [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A New algorithm for constructing minimal perfect hash functions papers/bmz_tr004_04.ps], Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
|
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], D. Menoti, [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A New algorithm for constructing minimal perfect hash functions papers/bmz_tr004_04.ps], Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
|
||||||
|
|
||||||
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], Y. Kohayakawa, and [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A Practical Minimal Perfect Hashing Method papers/bmz_wea2005.ps] (Submitted).
|
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], Y. Kohayakawa, and [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A Practical Minimal Perfect Hashing Method papers/wea05.pdf]. //4th International Workshop on efficient and Experimental Algorithms (WEA05),// Springer-Verlag Lecture Notes in Computer Science, vol. 3505, Santorini Island, Greece, May 2005, 488-500.
|
||||||
|
|
||||||
|
|
||||||
----------------------------------------
|
%!include: ALGORITHMS.t2t
|
||||||
| [Home index.html] | [CHM chm.html] | [BMZ bmz.html]
|
|
||||||
----------------------------------------
|
|
||||||
|
|
||||||
%!include: FOOTER.t2t
|
%!include: FOOTER.t2t
|
||||||
|
|
Before Width: | Height: | Size: 21 KiB After Width: | Height: | Size: 21 KiB |
|
@ -0,0 +1,323 @@
|
||||||
|
External Memory Based Algorithm
|
||||||
|
|
||||||
|
|
||||||
|
%!includeconf: CONFIG.t2t
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
==Introduction==
|
||||||
|
|
||||||
|
Until now, because of the limitations of current algorithms,
|
||||||
|
the use of MPHFs is restricted to scenarios where the set of keys being hashed is
|
||||||
|
relatively small.
|
||||||
|
However, in many cases it is crucial to deal in an efficient way with very large
|
||||||
|
sets of keys.
|
||||||
|
Due to the exponential growth of the Web, the work with huge collections is becoming
|
||||||
|
a daily task.
|
||||||
|
For instance, the simple assignment of number identifiers to web pages of a collection
|
||||||
|
can be a challenging task.
|
||||||
|
While traditional databases simply cannot handle more traffic once the working
|
||||||
|
set of URLs does not fit in main memory anymorei[[4 #papers]], the algorithm we propose here to
|
||||||
|
construct MPHFs can easily scale to billions of entries.
|
||||||
|
|
||||||
|
As there are many applications for MPHFs, it is
|
||||||
|
important to design and implement space and time efficient algorithms for
|
||||||
|
constructing such functions.
|
||||||
|
The attractiveness of using MPHFs depends on the following issues:
|
||||||
|
|
||||||
|
+ The amount of CPU time required by the algorithms for constructing MPHFs.
|
||||||
|
|
||||||
|
+ The space requirements of the algorithms for constructing MPHFs.
|
||||||
|
|
||||||
|
+ The amount of CPU time required by a MPHF for each retrieval.
|
||||||
|
|
||||||
|
+ The space requirements of the description of the resulting MPHFs to be used at retrieval time.
|
||||||
|
|
||||||
|
|
||||||
|
We present here a novel external memory based algorithm for constructing MPHFs that
|
||||||
|
are very efficient in the four requirements mentioned previously.
|
||||||
|
First, the algorithm is linear on the size of keys to construct a MPHF,
|
||||||
|
which is optimal.
|
||||||
|
For instance, for a collection of 1 billion URLs
|
||||||
|
collected from the web, each one 64 characters long on average, the time to construct a
|
||||||
|
MPHF using a 2.4 gigahertz PC with 500 megabytes of available main memory
|
||||||
|
is approximately 3 hours.
|
||||||
|
Second, the algorithm needs a small a priori defined vector of [figs/brz/img23.png] one
|
||||||
|
byte entries in main memory to construct a MPHF.
|
||||||
|
For the collection of 1 billion URLs and using [figs/brz/img4.png], the algorithm needs only
|
||||||
|
5.45 megabytes of internal memory.
|
||||||
|
Third, the evaluation of the MPHF for each retrieval requires three memory accesses and
|
||||||
|
the computation of three universal hash functions.
|
||||||
|
This is not optimal as any MPHF requires at least one memory access and the computation
|
||||||
|
of two universal hash functions.
|
||||||
|
Fourth, the description of a MPHF takes a constant number of bits for each key, which is optimal.
|
||||||
|
For the collection of 1 billion URLs, it needs 8.1 bits for each key,
|
||||||
|
while the theoretical lower bound is [figs/brz/img24.png] bits per key.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
==The Algorithm==
|
||||||
|
|
||||||
|
The main idea supporting our algorithm is the classical divide and conquer technique.
|
||||||
|
The algorithm is a two-step external memory based algorithm
|
||||||
|
that generates a MPHF //h// for a set //S// of //n// keys.
|
||||||
|
Figure 1 illustrates the two steps of the
|
||||||
|
algorithm: the partitioning step and the searching step.
|
||||||
|
|
||||||
|
| [figs/brz/brz.png]
|
||||||
|
| **Figure 1:** Main steps of our algorithm.
|
||||||
|
|
||||||
|
The partitioning step takes a key set //S// and uses a universal hash
|
||||||
|
function [figs/brz/img42.png] proposed by Jenkins[[5 #papers]]
|
||||||
|
to transform each key [figs/brz/img43.png] into an integer [figs/brz/img44.png].
|
||||||
|
Reducing [figs/brz/img44.png] modulo [figs/brz/img23.png], we partition //S//
|
||||||
|
into [figs/brz/img23.png] buckets containing at most 256 keys in each bucket (with high
|
||||||
|
probability).
|
||||||
|
|
||||||
|
The searching step generates a MPHF[figs/brz/img46.png] for each bucket //i//, [figs/brz/img47.png].
|
||||||
|
The resulting MPHF //h(k)//, [figs/brz/img43.png], is given by
|
||||||
|
|
||||||
|
| [figs/brz/img49.png]
|
||||||
|
|
||||||
|
where [figs/brz/img50.png].
|
||||||
|
The //i//th entry //offset[i]// of the displacement vector
|
||||||
|
//offset//, [figs/brz/img47.png], contains the total number
|
||||||
|
of keys in the buckets from 0 to //i-1//, that is, it gives the interval of the
|
||||||
|
keys in the hash table addressed by the MPHF[figs/brz/img46.png]. In the following we explain
|
||||||
|
each step in detail.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
=== Partitioning step ===
|
||||||
|
|
||||||
|
The set //S// of //n// keys is partitioned into [figs/brz/img23.png],
|
||||||
|
where //b// is a suitable parameter chosen to guarantee
|
||||||
|
that each bucket has at most 256 keys with high probability
|
||||||
|
(see [[2 #papers]] for details).
|
||||||
|
The partitioning step works as follows:
|
||||||
|
|
||||||
|
| [figs/brz/img54.png]
|
||||||
|
| **Figure 2:** Partitioning step.
|
||||||
|
|
||||||
|
Statement 1.1 of the **for** loop presented in Figure 2
|
||||||
|
reads sequentially all the keys of block [figs/brz/img55.png] from disk into an internal area
|
||||||
|
of size [figs/brz/img8.png].
|
||||||
|
|
||||||
|
Statement 1.2 performs an indirect bucket sort of the keys in block [figs/brz/img55.png] and
|
||||||
|
at the same time updates the entries in the vector //size//.
|
||||||
|
Let us briefly describe how [figs/brz/img55.png] is partitioned among
|
||||||
|
the [figs/brz/img23.png] buckets.
|
||||||
|
We use a local array of [figs/brz/img23.png] counters to store a
|
||||||
|
count of how many keys from [figs/brz/img55.png] belong to each bucket.
|
||||||
|
The pointers to the keys in each bucket //i//, [figs/brz/img47.png],
|
||||||
|
are stored in contiguous positions in an array.
|
||||||
|
For this we first reserve the required number of entries
|
||||||
|
in this array of pointers using the information from the array of counters.
|
||||||
|
Next, we place the pointers to the keys in each bucket into the respective
|
||||||
|
reserved areas in the array (i.e., we place the pointers to the keys in bucket 0,
|
||||||
|
followed by the pointers to the keys in bucket 1, and so on).
|
||||||
|
|
||||||
|
To find the bucket address of a given key
|
||||||
|
we use the universal hash function [figs/brz/img44.png][[5 #papers]].
|
||||||
|
Key //k// goes into bucket //i//, where
|
||||||
|
|
||||||
|
| [figs/brz/img57.png] (1)
|
||||||
|
|
||||||
|
Figure 3(a) shows a //logical// view of the [figs/brz/img23.png] buckets
|
||||||
|
generated in the partitioning step.
|
||||||
|
In reality, the keys belonging to each bucket are distributed among many files,
|
||||||
|
as depicted in Figure 3(b).
|
||||||
|
In the example of Figure 3(b), the keys in bucket 0
|
||||||
|
appear in files 1 and //N//, the keys in bucket 1 appear in files 1, 2
|
||||||
|
and //N//, and so on.
|
||||||
|
|
||||||
|
| [figs/brz/brz-partitioning.png]
|
||||||
|
| **Figure 3:** Situation of the buckets at the end of the partitioning step: (a) Logical view (b) Physical view.
|
||||||
|
|
||||||
|
This scattering of the keys in the buckets could generate a performance
|
||||||
|
problem because of the potential number of seeks
|
||||||
|
needed to read the keys in each bucket from the //N// files in disk
|
||||||
|
during the searching step.
|
||||||
|
But, as we show in [[2 #papers]], the number of seeks
|
||||||
|
can be kept small using buffering techniques.
|
||||||
|
Considering that only the vector //size//, which has [figs/brz/img23.png] one-byte
|
||||||
|
entries (remember that each bucket has at most 256 keys),
|
||||||
|
must be maintained in main memory during the searching step,
|
||||||
|
almost all main memory is available to be used as disk I/O buffer.
|
||||||
|
|
||||||
|
The last step is to compute the //offset// vector and dump it to the disk.
|
||||||
|
We use the vector //size// to compute the
|
||||||
|
//offset// displacement vector.
|
||||||
|
The //offset[i]// entry contains the number of keys
|
||||||
|
in the buckets //0, 1, ..., i-1//.
|
||||||
|
As //size[i]// stores the number of keys
|
||||||
|
in bucket //i//, where [figs/brz/img47.png], we have
|
||||||
|
|
||||||
|
| [figs/brz/img63.png]
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
=== Searching step ===
|
||||||
|
|
||||||
|
The searching step is responsible for generating a MPHF for each
|
||||||
|
bucket. Figure 4 presents the searching step algorithm.
|
||||||
|
|
||||||
|
| [figs/brz/img64.png]
|
||||||
|
| **Figure 4:** Searching step.
|
||||||
|
|
||||||
|
Statement 1 of Figure 4 inserts one key from each file
|
||||||
|
in a minimum heap //H// of size //N//.
|
||||||
|
The order relation in //H// is given by the bucket address //i// given by
|
||||||
|
Eq. (1).
|
||||||
|
|
||||||
|
Statement 2 has two important steps.
|
||||||
|
In statement 2.1, a bucket is read from disk,
|
||||||
|
as described below.
|
||||||
|
In statement 2.2, a MPHF is generated for each bucket //i//, as described
|
||||||
|
in the following.
|
||||||
|
The description of MPHF[figs/brz/img46.png] is a vector [figs/brz/img66.png] of 8-bit integers.
|
||||||
|
Finally, statement 2.3 writes the description [figs/brz/img66.png] of MPHF[figs/brz/img46.png] to disk.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
==== Reading a bucket from disk ====
|
||||||
|
|
||||||
|
In this section we present the refinement of statement 2.1 of
|
||||||
|
Figure 4.
|
||||||
|
The algorithm to read bucket //i// from disk is presented
|
||||||
|
in Figure 5.
|
||||||
|
|
||||||
|
| [figs/brz/img67.png]
|
||||||
|
| **Figure 5:** Reading a bucket.
|
||||||
|
|
||||||
|
Bucket //i// is distributed among many files and the heap //H// is used to drive a
|
||||||
|
multiway merge operation.
|
||||||
|
In Figure 5, statement 1.1 extracts and removes triple
|
||||||
|
//(i, j, k)// from //H//, where //i// is a minimum value in //H//.
|
||||||
|
Statement 1.2 inserts key //k// in bucket //i//.
|
||||||
|
Notice that the //k// in the triple //(i, j, k)// is in fact a pointer to
|
||||||
|
the first byte of the key that is kept in contiguous positions of an array of characters
|
||||||
|
(this array containing the keys is initialized during the heap construction
|
||||||
|
in statement 1 of Figure 4).
|
||||||
|
Statement 1.3 performs a seek operation in File //j// on disk for the first
|
||||||
|
read operation and reads sequentially all keys //k// that have the same //i//
|
||||||
|
and inserts them all in bucket //i//.
|
||||||
|
Finally, statement 1.4 inserts in //H// the triple //(i, j, x)//,
|
||||||
|
where //x// is the first key read from File //j// (in statement 1.3)
|
||||||
|
that does not have the same bucket address as the previous keys.
|
||||||
|
|
||||||
|
The number of seek operations on disk performed in statement 1.3 is discussed
|
||||||
|
in [[2, Section 5.1 #papers]],
|
||||||
|
where we present a buffering technique that brings down
|
||||||
|
the time spent with seeks.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
==== Generating a MPHF for each bucket ====
|
||||||
|
|
||||||
|
To the best of our knowledge the [BMZ algorithm bmz.html] we have designed in
|
||||||
|
our previous works [[1,3 #papers]] is the fastest published algorithm for
|
||||||
|
constructing MPHFs.
|
||||||
|
That is why we are using that algorithm as a building block for the
|
||||||
|
algorithm presented here. In reality, we are using
|
||||||
|
an optimized version of BMZ (BMZ8) for small set of keys (at most 256 keys).
|
||||||
|
[Click here to see details about BMZ algorithm bmz.html].
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
==Analysis of the Algorithm==
|
||||||
|
|
||||||
|
Analytical results and the complete analysis of the external memory based algorithm
|
||||||
|
can be found in [[2 #papers]].
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
==Experimental Results==
|
||||||
|
|
||||||
|
In this section we present the experimental results.
|
||||||
|
We start presenting the experimental setup.
|
||||||
|
We then present experimental results for
|
||||||
|
the internal memory based algorithm ([the BMZ algorithm bmz.html])
|
||||||
|
and for our external memory based algorithm.
|
||||||
|
Finally, we discuss how the amount of internal memory available
|
||||||
|
affects the runtime of the external memory based algorithm.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
===The data and the experimental setup===
|
||||||
|
|
||||||
|
All experiments were carried out on
|
||||||
|
a computer running the Linux operating system, version 2.6,
|
||||||
|
with a 2.4 gigahertz processor and
|
||||||
|
1 gigabyte of main memory.
|
||||||
|
In the experiments related to the new
|
||||||
|
algorithm we limited the main memory in 500 megabytes.
|
||||||
|
|
||||||
|
Our data consists of a collection of 1 billion
|
||||||
|
URLs collected from the Web, each URL 64 characters long on average.
|
||||||
|
The collection is stored on disk in 60.5 gigabytes.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
===Performance of the BMZ Algorithm===
|
||||||
|
|
||||||
|
[The BMZ algorithm bmz.html] is used for constructing a MPHF for each bucket.
|
||||||
|
It is a randomized algorithm because it needs to generate a simple random graph
|
||||||
|
in its first step.
|
||||||
|
Once the graph is obtained the other two steps are deterministic.
|
||||||
|
|
||||||
|
Thus, we can consider the runtime of the algorithm to have
|
||||||
|
the form [figs/brz/img159.png] for an input of //n// keys,
|
||||||
|
where [figs/brz/img160.png] is some machine dependent
|
||||||
|
constant that further depends on the length of the keys and //Z// is a random
|
||||||
|
variable with geometric distribution with mean [figs/brz/img162.png]. All results
|
||||||
|
in our experiments were obtained taking //c=1//; the value of //c//, with //c// in //[0.93,1.15]//,
|
||||||
|
in fact has little influence in the runtime, as shown in [[3 #papers]].
|
||||||
|
|
||||||
|
The values chosen for //n// were 1, 2, 4, 8, 16 and 32 million.
|
||||||
|
Although we have a dataset with 1 billion URLs, on a PC with
|
||||||
|
1 gigabyte of main memory, the algorithm is able
|
||||||
|
to handle an input with at most 32 million keys.
|
||||||
|
This is mainly because of the graph we need to keep in main memory.
|
||||||
|
The algorithm requires //25n + O(1)// bytes for constructing
|
||||||
|
a MPHF ([click here to get details about the data structures used by the BMZ algorithm bmz.html]).
|
||||||
|
|
||||||
|
In order to estimate the number of trials for each value of //n// we use
|
||||||
|
a statistical method for determining a suitable sample size (see, e.g., [[6, Chapter 13 #papers]]).
|
||||||
|
As we obtained different values for each //n//,
|
||||||
|
we used the maximal value obtained, namely, 300 trials in order to have
|
||||||
|
a confidence level of 95 %.
|
||||||
|
|
||||||
|
|
||||||
|
Table 1 presents the runtime average for each //n//,
|
||||||
|
the respective standard deviations, and
|
||||||
|
the respective confidence intervals given by
|
||||||
|
the average time [figs/brz/img167.png] the distance from average time
|
||||||
|
considering a confidence level of 95 %.
|
||||||
|
Observing the runtime averages one sees that
|
||||||
|
the algorithm runs in expected linear time,
|
||||||
|
as shown in [[3 #papers]].
|
||||||
|
|
||||||
|
%!include(html): ''TABLEBRZ1.t2t''
|
||||||
|
| **Table 1:** Internal memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %.
|
||||||
|
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
==Papers==[papers]
|
||||||
|
|
||||||
|
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], D. Menoti, [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A New algorithm for constructing minimal perfect hash functions papers/bmz_tr004_04.ps], Technical Report TR004/04, Department of Computer Science, Federal University of Minas Gerais, 2004.
|
||||||
|
|
||||||
|
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], Y. Kohayakawa, [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [An Approach for Minimal Perfect Hash Functions for Very Large Databases papers/tr06.pdf], Technical Report TR003/06, Department of Computer Science, Federal University of Minas Gerais, 2004.
|
||||||
|
|
||||||
|
+ [F. C. Botelho http://www.dcc.ufmg.br/~fbotelho], Y. Kohayakawa, and [N. Ziviani http://www.dcc.ufmg.br/~nivio]. [A Practical Minimal Perfect Hashing Method papers/wea05.pdf]. //4th International Workshop on efficient and Experimental Algorithms (WEA05),// Springer-Verlag Lecture Notes in Computer Science, vol. 3505, Santorini Island, Greece, May 2005, 488-500.
|
||||||
|
|
||||||
|
+ [M. Seltzer. Beyond relational databases. ACM Queue, 3(3), April 2005. http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=299]
|
||||||
|
|
||||||
|
+ [Bob Jenkins. Algorithm alley: Hash functions. Dr. Dobb's Journal of Software Tools, 22(9), september 1997. http://burtleburtle.net/bob/hash/doobs.html]
|
||||||
|
|
||||||
|
+ R. Jain. The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. John Wiley, first edition, 1991.
|
||||||
|
|
||||||
|
|
||||||
|
%!include: ALGORITHMS.t2t
|
||||||
|
|
||||||
|
%!include: FOOTER.t2t
|
4
CHM.t2t
4
CHM.t2t
|
@ -81,8 +81,6 @@ Again we have:
|
||||||
The Computer Journal, 39(6):547--554, 1996.
|
The Computer Journal, 39(6):547--554, 1996.
|
||||||
|
|
||||||
|
|
||||||
----------------------------------------
|
%!include: ALGORITHMS.t2t
|
||||||
| [Home index.html] | [CHM chm.html] | [BMZ bmz.html]
|
|
||||||
----------------------------------------
|
|
||||||
|
|
||||||
%!include: FOOTER.t2t
|
%!include: FOOTER.t2t
|
||||||
|
|
|
@ -103,8 +103,7 @@ to [figs/img250.png] seconds for [figs/img6.png].
|
||||||
|
|
||||||
%!include(html): ''TABLE5.t2t''
|
%!include(html): ''TABLE5.t2t''
|
||||||
| **Table 5:** Time measurements for BMZ tuned algorithm with [figs/img5.png] and [figs/img6.png].
|
| **Table 5:** Time measurements for BMZ tuned algorithm with [figs/img5.png] and [figs/img6.png].
|
||||||
----------------------------------------
|
|
||||||
| [Home index.html] | [CHM chm.html] | [BMZ bmz.html]
|
%!include: ALGORITHMS.t2t
|
||||||
----------------------------------------
|
|
||||||
|
|
||||||
%!include: FOOTER.t2t
|
%!include: FOOTER.t2t
|
||||||
|
|
|
@ -49,8 +49,6 @@ languages, reserved words in programming languages or interactive systems,
|
||||||
universal resource locations (URLs) in Web search engines, or item sets in
|
universal resource locations (URLs) in Web search engines, or item sets in
|
||||||
data mining techniques.
|
data mining techniques.
|
||||||
|
|
||||||
----------------------------------------
|
%!include: ALGORITHMS.t2t
|
||||||
| [Home index.html] | [CHM chm.html] | [BMZ bmz.html]
|
|
||||||
----------------------------------------
|
|
||||||
|
|
||||||
%!include: FOOTER.t2t
|
%!include: FOOTER.t2t
|
||||||
|
|
4
FAQ.t2t
4
FAQ.t2t
|
@ -26,8 +26,6 @@ one is executed?
|
||||||
is reset when you call the cmph_config_set_algo function.
|
is reset when you call the cmph_config_set_algo function.
|
||||||
|
|
||||||
|
|
||||||
----------------------------------------
|
%!include: ALGORITHMS.t2t
|
||||||
| [Home index.html] | [CHM chm.html] | [BMZ bmz.html]
|
|
||||||
----------------------------------------
|
|
||||||
|
|
||||||
%!include: FOOTER.t2t
|
%!include: FOOTER.t2t
|
||||||
|
|
|
@ -32,8 +32,6 @@ gperf. The first problem is common in the information retrieval field (e.g.
|
||||||
assigning ids to millions of documents), while the former is usually found in
|
assigning ids to millions of documents), while the former is usually found in
|
||||||
the compiler programming area (detect reserved keywords).
|
the compiler programming area (detect reserved keywords).
|
||||||
|
|
||||||
----------------------------------------
|
%!include: ALGORITHMS.t2t
|
||||||
| [Home index.html] | [CHM chm.html] | [BMZ bmz.html]
|
|
||||||
----------------------------------------
|
|
||||||
|
|
||||||
%!include: FOOTER.t2t
|
%!include: FOOTER.t2t
|
||||||
|
|
|
@ -46,7 +46,12 @@ The CMPH Library encapsulates the newest and more efficient algorithms in an eas
|
||||||
%txt% - BMZ Algorithm.
|
%txt% - BMZ Algorithm.
|
||||||
A very fast algorithm based on cyclic random graphs to construct minimal
|
A very fast algorithm based on cyclic random graphs to construct minimal
|
||||||
perfect hash functions in linear time. The resulting functions are not order preserving and
|
perfect hash functions in linear time. The resulting functions are not order preserving and
|
||||||
can be stored in only //4cn// bytes, where //c// is between 0.93 and 1.15.
|
can be stored in only //4cn// bytes, where //c// is between 0.93 and 1.15.
|
||||||
|
%html% - [External Memory Based Algorithm for sets in the order of billion of keys brz.html]
|
||||||
|
%txt% - BMZ Algorithm.
|
||||||
|
A very fast external memory based algorithm for constructing minimal perfect hash functions
|
||||||
|
for sets in the order of billion of keys in linear time. The resulting functions are not order preserving and
|
||||||
|
can be stored using just 8.1 bits per key. **This algorithm is available just in the CVS for while**.
|
||||||
%html% - [CHM Algorithm chm.html].
|
%html% - [CHM Algorithm chm.html].
|
||||||
%txt% - CHM Algorithm.
|
%txt% - CHM Algorithm.
|
||||||
An algorithm based on acyclic random graphs to construct minimal
|
An algorithm based on acyclic random graphs to construct minimal
|
||||||
|
|
|
@ -0,0 +1,72 @@
|
||||||
|
<TABLE CELLPADDING=3 BORDER="1" ALIGN="CENTER">
|
||||||
|
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
|
||||||
|
<SPAN CLASS="MATH"><IMG
|
||||||
|
WIDTH="14" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||||
|
SRC="figs/brz/img5.png"
|
||||||
|
ALT="$n$"></SPAN> (millions) </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 1 </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 2 </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 4 </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 8 </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 16 </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> 32 </SMALL></TD>
|
||||||
|
<TD></TD>
|
||||||
|
</TR>
|
||||||
|
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
|
||||||
|
|
||||||
|
Average time (s)</SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||||
|
WIDTH="64" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||||
|
SRC="figs/brz/img168.png"
|
||||||
|
ALT="$6.1 \pm 0.3$"></SPAN> </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||||
|
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||||
|
SRC="figs/brz/img169.png"
|
||||||
|
ALT="$12.2 \pm 0.6$"></SPAN> </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||||
|
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||||
|
SRC="figs/brz/img170.png"
|
||||||
|
ALT="$25.4 \pm 1.1$"></SPAN> </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||||
|
WIDTH="72" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||||
|
SRC="figs/brz/img171.png"
|
||||||
|
ALT="$51.4 \pm 2.0$"></SPAN> </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||||
|
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||||
|
SRC="figs/brz/img172.png"
|
||||||
|
ALT="$117.3 \pm 4.4$"></SPAN> </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||||
|
WIDTH="80" HEIGHT="29" ALIGN="MIDDLE" BORDER="0"
|
||||||
|
SRC="figs/brz/img173.png"
|
||||||
|
ALT="$262.2 \pm 8.7$"></SPAN></SMALL></TD>
|
||||||
|
<TD></TD>
|
||||||
|
</TR>
|
||||||
|
<TR><TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE">
|
||||||
|
SD (s) </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||||
|
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||||
|
SRC="figs/brz/img174.png"
|
||||||
|
ALT="$2.6$"></SPAN> </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||||
|
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||||
|
SRC="figs/brz/img175.png"
|
||||||
|
ALT="$5.4$"></SPAN> </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||||
|
WIDTH="24" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||||
|
SRC="figs/brz/img176.png"
|
||||||
|
ALT="$9.8$"></SPAN> </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||||
|
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||||
|
SRC="figs/brz/img177.png"
|
||||||
|
ALT="$17.6$"></SPAN> </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||||
|
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||||
|
SRC="figs/brz/img178.png"
|
||||||
|
ALT="$37.3$"></SPAN> </SMALL></TD>
|
||||||
|
<TD ALIGN="CENTER"><SMALL CLASS="SCRIPTSIZE"> <SPAN CLASS="MATH"><IMG
|
||||||
|
WIDTH="32" HEIGHT="13" ALIGN="BOTTOM" BORDER="0"
|
||||||
|
SRC="figs/brz/img179.png"
|
||||||
|
ALT="$76.3$"></SPAN> </SMALL></TD>
|
||||||
|
<TD></TD>
|
||||||
|
</TR>
|
||||||
|
</TABLE>
|
2
gendocs
2
gendocs
|
@ -1,5 +1,6 @@
|
||||||
txt2tags -t html --mask-email -i README.t2t -o index.html
|
txt2tags -t html --mask-email -i README.t2t -o index.html
|
||||||
txt2tags -t html -i BMZ.t2t -o bmz.html
|
txt2tags -t html -i BMZ.t2t -o bmz.html
|
||||||
|
txt2tags -t html -i BRZ.t2t -o brz.html
|
||||||
txt2tags -t html -i CHM.t2t -o chm.html
|
txt2tags -t html -i CHM.t2t -o chm.html
|
||||||
txt2tags -t html -i COMPARISON.t2t -o comparison.html
|
txt2tags -t html -i COMPARISON.t2t -o comparison.html
|
||||||
txt2tags -t html -i GPERF.t2t -o gperf.html
|
txt2tags -t html -i GPERF.t2t -o gperf.html
|
||||||
|
@ -8,6 +9,7 @@ txt2tags -t html -i CONCEPTS.t2t -o concepts.html
|
||||||
|
|
||||||
txt2tags -t txt --mask-email -i README.t2t -o README
|
txt2tags -t txt --mask-email -i README.t2t -o README
|
||||||
txt2tags -t txt -i BMZ.t2t -o BMZ
|
txt2tags -t txt -i BMZ.t2t -o BMZ
|
||||||
|
txt2tags -t txt -i BRZ.t2t -o BRZ
|
||||||
txt2tags -t txt -i CHM.t2t -o CHM
|
txt2tags -t txt -i CHM.t2t -o CHM
|
||||||
txt2tags -t txt -i COMPARISON.t2t -o COMPARISON
|
txt2tags -t txt -i COMPARISON.t2t -o COMPARISON
|
||||||
txt2tags -t txt -i GPERF.t2t -o GPERF
|
txt2tags -t txt -i GPERF.t2t -o GPERF
|
||||||
|
|
Loading…
Reference in New Issue