From 7e37e5009ee401cfd942893ea627e7323dde2cb2 Mon Sep 17 00:00:00 2001 From: fc_botelho Date: Tue, 25 Apr 2006 19:34:06 +0000 Subject: [PATCH] external memory based algorithm documentation updated --- BRZ.t2t | 117 +++++++++++++++++++++++++++++++++++++++- TABLEBRZ2.t2t | 133 +++++++++++++++++++++++++++++++++++++++++++++ TABLEBRZ3.t2t | 147 ++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 396 insertions(+), 1 deletion(-) create mode 100644 TABLEBRZ2.t2t create mode 100644 TABLEBRZ3.t2t diff --git a/BRZ.t2t b/BRZ.t2t index 250504c..079029a 100644 --- a/BRZ.t2t +++ b/BRZ.t2t @@ -16,7 +16,7 @@ a daily task. For instance, the simple assignment of number identifiers to web pages of a collection can be a challenging task. While traditional databases simply cannot handle more traffic once the working -set of URLs does not fit in main memory anymorei[[4 #papers]], the algorithm we propose here to +set of URLs does not fit in main memory anymore[[4 #papers]], the algorithm we propose here to construct MPHFs can easily scale to billions of entries. As there are many applications for MPHFs, it is @@ -301,6 +301,121 @@ as shown in [[3 #papers]]. %!include(html): ''TABLEBRZ1.t2t'' | **Table 1:** Internal memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %. +Figure 6 presents the runtime for each trial. In addition, +the solid line corresponds to a linear regression model +obtained from the experimental measurements. +As we can see, the runtime for a given //n// has a considerable +fluctuation. However, the fluctuation also grows linearly with //n//. + + | [figs/brz/bmz_temporegressao.png] + | **Figure 6:** Time versus number of keys in //S// for the internal memory based algorithm. The solid line corresponds to a linear regression model. + +The observed fluctuation in the runtimes is as expected; recall that this +runtime has the form [figs/brz/img159.png] with //Z// a geometric random variable with +mean //1/p=e//. Thus, the runtime has mean [figs/brz/img181.png] and standard +deviation [figs/brz/img182.png]. +Therefore, the standard deviation also grows +linearly with //n//, as experimentally verified +in Table 1 and in Figure 6. + +---------------------------------------- + +===Performance of the External Memory Based Algorithm=== + +The runtime of the external memory based algorithm is also a random variable, +but now it follows a (highly concentrated) normal distribution, as we discuss at the end of this +section. Again, we are interested in verifying the linearity claim made in +[[2, Section 5.1 #papers]]. Therefore, we ran the algorithm for +several numbers //n// of keys in //S//. + +The values chosen for //n// were 1, 2, 4, 8, 16, 32, 64, 128, 512 and 1000 +million. +We limited the main memory in 500 megabytes for the experiments. +The size [figs/brz/img8.png] of the a priori reserved internal memory area +was set to 250 megabytes, the parameter //b// was set to //175// and +the building block algorithm parameter //c// was again set to //1//. +We show later on how [figs/brz/img8.png] affects the runtime of the algorithm. The other two parameters +have insignificant influence on the runtime. + +We again use a statistical method for determining a suitable sample size +to estimate the number of trials to be run for each value of //n//. We got that +just one trial for each //n// would be enough with a confidence level of 95 %. +However, we made 10 trials. This number of trials seems rather small, but, as +shown below, the behavior of our algorithm is very stable and its runtime is +almost deterministic (i.e., the standard deviation is very small). + +Table 2 presents the runtime average for each //n//, +the respective standard deviations, and +the respective confidence intervals given by +the average time [figs/brz/img167.png] the distance from average time +considering a confidence level of 95 %. +Observing the runtime averages we noticed that +the algorithm runs in expected linear time, +as shown in [[2, Section 5.1 #papers]]. Better still, +it is only approximately 60 % slower than the BMZ algorithm. +To get that value we used the linear regression model obtained for the runtime of +the internal memory based algorithm to estimate how much time it would require +for constructing a MPHF for a set of 1 billion keys. +We got 2.3 hours for the internal memory based algorithm and we measured +3.67 hours on average for the external memory based algorithm. +Increasing the size of the internal memory area +from 250 to 600 megabytes, +we have brought the time to 3.09 hours. In this case, the external memory based algorithm is +just 34 % slower in this setup. + +%!include(html): ''TABLEBRZ2.t2t'' + | **Table 2:**The external memory based algorithm: average time in seconds for constructing a MPHF, the standard deviation (SD), and the confidence intervals considering a confidence level of 95 %. + +Figure 7 presents the runtime for each trial. In addition, +the solid line corresponds to a linear regression model +obtained from the experimental measurements. +As we were expecting the runtime for a given //n// has almost no +variation. + + | [figs/brz/brz_temporegressao.png] + | **Figure 7:** Time versus number of keys in //S// for our algorithm. The solid line corresponds to a linear regression model. + +An intriguing observation is that the runtime of the algorithm is almost +deterministic, in spite of the fact that it uses as building block an +algorithm with a considerable fluctuation in its runtime. A given bucket +//i//, [figs/brz/img47.png], is a small set of keys (at most 256 keys) and, +as argued in last Section, the runtime of the +building block algorithm is a random variable [figs/brz/img207.png] with high fluctuation. +However, the runtime //Y// of the searching step of the external memory based algorithm is given +by [figs/brz/img209.png]. Under the hypothesis that +the [figs/brz/img207.png] are independent and bounded, the {\it law of large numbers} (see, +e.g., [[6 #papers]]) implies that the random variable [figs/brz/img210.png] converges +to a constant as [figs/brz/img83.png]. This explains why the runtime of our +algorithm is almost deterministic. + +---------------------------------------- + +=== Controlling disk accesses === + +In order to bring down the number of seek operations on disk +we benefit from the fact that our algorithm leaves almost all main +memory available to be used as disk I/O buffer. +In this section we evaluate how much the parameter [figs/brz/img8.png] affects the runtime of our algorithm. +For that we fixed //n// in 1 billion of URLs, +set the main memory of the machine used for the experiments +to 1 gigabyte and used [figs/brz/img8.png] equal to 100, 200, 300, 400, 500 and 600 +megabytes. + +Table 3 presents the number of files //N//, +the buffer size used for all files, the number of seeks in the worst case considering +the pessimistic assumption mentioned in [[2, Section 5.1 #papers]], and +the time to generate a MPHF for 1 billion of keys as a function of the amount of internal +memory available. Observing Table 3 we noticed that the time spent in the construction +decreases as the value of [figs/brz/img8.png] increases. However, for [figs/brz/img213.png], the variation +on the time is not as significant as for [figs/brz/img214.png]. +This can be explained by the fact that the kernel 2.6 I/O scheduler of Linux +has smart policies for avoiding seeks and diminishing the average seek time +(see [http://www.linuxjournal.com/article/6931 http://www.linuxjournal.com/article/6931]). + +%!include(html): ''TABLEBRZ3.t2t'' + | **Table 3:**Influence of the internal memory area size ([figs/brz/img8.png]) in the external memory based algorithm runtime. + + ---------------------------------------- ==Papers==[papers] diff --git a/TABLEBRZ2.t2t b/TABLEBRZ2.t2t new file mode 100644 index 0000000..a72094c --- /dev/null +++ b/TABLEBRZ2.t2t @@ -0,0 +1,133 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+$n$ (millions) 1 2 4 8 16
+ Average time (s) $6.9 \pm 0.3$ $13.8 \pm 0.2$ $31.9 \pm 0.7$ $69.9 \pm 1.1$ $140.6 \pm 2.5$
+ SD $0.4$ $0.2$ $0.9$ $1.5$ $3.5$
+ + $n$ (millions) 32 64 128 512 1000
+ Average time (s) $284.3 \pm 1.1$ $587.9 \pm 3.9$ + $1223.6 \pm 4.9$ + $5966.4 \pm 9.5$ + $13229.5 \pm 12.7$
+ SD $1.6$ $5.5$ $6.8$ $13.2$ $18.6$
diff --git a/TABLEBRZ3.t2t b/TABLEBRZ3.t2t new file mode 100644 index 0000000..516dcab --- /dev/null +++ b/TABLEBRZ3.t2t @@ -0,0 +1,147 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+$\mu $ (MB) $100$ $200$ $300$ $400$ $500$ $600$
+ + $N$ (files) $619$ $310$ $207$ $155$ $124$ $104$
+  (buffer size in KB) $165$ $661$ $1,484$ $2,643$ $4,129$ $5,908$
+ $\beta$/ (# of seeks in the worst case) $384,478$ $95,974$ $42,749$ $24,003$ $15,365$ $10,738$
+ Time (hours) $4.04$ $3.64$ $3.34$ $3.20$ $3.13$ $3.09$