Tuesday, 30 January 2018

Estimating Hash Table Bucket Size Requirement


Problem Statement


How big do my hash buckets need to be? What kind of a question is that anyway?

Let's stand back a bit...

As every computer scientist ought to know, a hash table is a wonderfully efficient, O(1), way to look something up in a table. The idea is simple. For each thing you want to look up, you calculate a hash, which is simply a way to scrunch together the bits of its key such that they all get mixed together and any hash value is as likely as any other. The hash should be equal to the number of entries in the table, so you can use it to index to the table. That is where your object should be. Simple.

The complication is that it's possible for two keys to hash to the same value. Classically, there are two ways to deal with this:
  • Just go to the next entry, and if that is full, the next one, and so on. This works, but gives disastrous performance if the table gets anywhere close to full. There are techniques for improving this, or at least making it less disastrous, for example quadratic rehash. But it is never good.
  • Instead of keeping the actual values in the table, keep a pointer. Then when collisions occur, just chain together the values in a linked list. This works well and doesn't have the performance problems of the first method, but it is seriously cache unfriendly since the chances are that the chained overflow entries won't be in cache, and there is no easy way to prefetch them.
Our system uses a hash table to associate incoming network packets with other packets from the same TCP or UDP session (a flow), using the IP addresses and TCP/UDP ports as a key. This is intense real-time code, intended to handle up to 15 million packets per second. This kind of performance requires attention to the tiniest details. For example, absolutely everything has to be in level one cache before it is accessed.

To achieve this, we use a different scheme to handle hash collisions. Each entry in the table is a fixed size bucket. If two or more flows collide to the same bucket, they are placed in successive slots in the bucket. It's easy to prefetch a whole bucket, so this is cache friendly. As long as the buckets are big enough, this always give excellent performance.

Which brings us back to the first question - how big do the buckets need to be? If there is no free slot in a bucket when a new flow arrives for it, the packets of the flow are simply dropped. So the probability of this happening has to be extremely low, so low that the risk is no higher than for all the other bad things that can happen to a packet in a network.

We could always make the buckets so enormously huge that this just never happens. But there is a cost both in memory and in performance - since a bucket has to be prefetched even if most of it is empty. So we really need to know the optimum size.

So the question can be stated: given a hash table size k, and a number of concurrent flows occupying it, what is the probability that slot b in the bucket will be occupied? In the text that follows  f is the load factor of the table, i.e. n/k.

Results


The answer to the question is given by the following formula, where f=n/k:
\[P(n,k,b) < e^{-f}\cdot \frac{f^{b}}{b!}\frac{b}{b-f}\]
This formula involves several approximations, and is a close upper bound on the actual probability.

To know whether a bucket of size b will overflow, we need to know whether the (b+1)th slot will be full, i.e. the first slot beyond the bucket size.

The following table gives the probability of bucket overflow, for various bucket sizes and various load factors. (Some values are not present for small bucket sizes because the above approximation is not applicable when the load factor is greater than the bucket size).

Load0.10.511.52
Bucket Size
1 0.0050269 0.1516327
2 0.0001587 0.0168481 0.1226265 0.5020429
3 0.0000039 0.0018954 0.0229925 0.0941330 0.2706706
4 0.0000001 0.0001805 0.0040875 0.0225919 0.0721788
5 0.0000000 0.0000146 0.0006387 0.0050428 0.0200497
6 0.0000000 0.0000010 0.0000876 0.0010086 0.0051556
7 0.0000000 0.0000001 0.0000106 0.0001805 0.0012030
8 0.0000000 0.0000000 0.0000012 0.0000291 0.0002546
9 0.0000000 0.0000000 0.0000001 0.0000043 0.0000491
10 0.0000000 0.0000000 0.0000000 0.0000006 0.0000087
11 0.0000000 0.0000000 0.0000000 0.0000001 0.0000014
12 0.0000000 0.0000000 0.0000000 0.0000000 0.0000002

These probabilities are independent of table size. But the probability of a at least one bucket being of this size depends on the table size as well. So for example, in a filled table of 10,000 entries and bucket size of 7, there is a probability of about 0.1 that a bucket will overflow (i.e. need to be size 8 or greater), which is probably acceptable. But if the table size is increased to 1,000,000, that probability increases to practically 1. To achieve the same 0.1 probability as for the smaller table, the bucket size must be increased to 9.

To be precise, the probability that at least one bucket in a table of size k, having n entries and a bucket size of b, will overflow, is given by:
\[ p_{overflow}(n,k,b))=1-(1-p(n,k,b))^{k} \] The following table shows the probability of at least one bucket becoming full, for the given table and bucket sizes, at a load factor of 1 (i.e. n=k).

Table Size100010000100000100000010000000
21.0000001.0000001.0000001.0000001.000000
31.0000001.0000001.0000001.0000001.000000
40.9833601.0000001.0000001.0000001.000000
50.4721190.9983201.0000001.0000001.000000
60.0838670.5835300.9998431.0000001.000000
70.0105880.1009770.6550900.9999761.000000
80.0011580.0115190.1094000.6860760.999991
90.0001140.0011400.0113400.1077870.680341
100.0000100.0001020.0010230.0101880.097333
110.0000010.0000080.0000840.0008440.008413
120.0000000.0000010.0000060.0000640.000644

Derivation: Some Basics


There's a well known analysis concerning the number of empty buckets you can expect to find in a hash table. Considering a single bucket, we ask: what is the probability that not a single entry will have hit it? This is the same as asking whether every entry has hit somewhere else. For  single entry, this is (k-1)/k, so for every entry to have hit somewhere else, the probability is:
\[p(n,k,0)=\left ( \frac{k-1}{k} \right )^{n} \]
We'll use the notation p(n,k,b) to mean the probability that a bucket in a table of size k with n entries has exactly b slots used - in this case, b is zero. With a bit of approximation, this can be simplified to:
\[p(n,k,0)= e^{-f} \]
In particular, for a table where n=k, i.e. f=1, this is 1/e, or about 0.368. So in a table of size 1,000,000, with that many entries, about 368,000 will be empty. The rest will have at least one entry. How many will have more than one?

We can consider the probability of a single key finding itself alone in a single bucket. This is the probability that it will go the bucket, 1/k, times the probability that nobody else will, which is ((k‑1)/k)n-1. Now we have to consider that for every single entry, so we multiply it by n. This gives:
\[p(n,k,1)=\left ( \frac{k-1}{k} \right )^{n-1}\cdot \frac{1}{k} \cdot n \]
which can be simplified, with a little approximation, to:
\[p(n,k,1)= f e^{-f} \]
For the particular case of n=k, this gives exactly the same result as for an empty bucket: 0.368.

The Equation We're Looking For


All we need to do now is to generalise the above result for a single entry, to cover b entries. We use the same logic: consider any b specific keys, what is the probability that they will end up together and with no other keys, in a given bucket? The probability that no other keys will be there is ((k‑1)/k)n-b, while the probability that they will all end up in this particular bucket is k-b. Then we need to multiply this by the number of combinations of b keys, which is given by nCb. Putting this all together gives:
\[p(n,k,b)=\left ( \frac{k-1}{k} \right )^{k-b}\cdot k^{-b}\cdot \binom{n}{b} \]
This can be simplified, again with a little approximation, to:
\[p(n,k,b) = e^{-f}\cdot \frac{f^{b}}{b!}\]
This tells us the probability that a bucket will have exactly b entries. What we need, though, is the probability that it will have b or more entries. Calling this function P, we have:
\[P(n,k,b)=\sum_{i=b}^{\infty}p(n,k,b)=\sum_{i=b}^{\infty}e^{-f}\cdot \frac{f^{b}}{b!}\]
There is no closed form for this sum, but there is a good upper bound:
\[P(n,k,b) < e^{-f}\cdot \frac{f^{b}}{b!}\frac{b}{b-f}\]

The Math: Obtaining the Simplified Formulae


First, recall that by the definition of e:
\[ e^{x}=\lim_{k\rightarrow \infty }\left ( 1+\frac{1}{k} \right )^{xk} \]
Hence, substituting -k for k, and considering that k is close enough to infinity:
\[ \left ( \frac{k-1}{k} \right )^{n}=\left ( 1-\frac{1}{k} \right )^{k\cdot \frac{n}{k}}=\left ( 1+\frac{1}{k} \right )^{-k\cdot \frac{n}{k}}\simeq e^{-f} \]
Now consider the equation for b=1:
\[ p(n,k,1)=\left ( \frac{k-1}{k} \right )^{n-1}\cdot \frac{1}{k}\cdot n \]
We rearrange things to get the same approximation for e-f, and observe that since k is large, (k-1)/k is approximately 1. And so:
\[\left ( \frac{k-1}{k} \right )^{n-1}\cdot \frac{1}{k}\cdot n\simeq \left ( \frac{k-1}{k} \right )^{n}\cdot \left ( \frac{k-1}{k} \right )^{-1}\cdot \frac{n}{k} \simeq f e^{-f} \]
For the general case, b>1:
\[p(n,k,b)=\left ( \frac{k-1}{k} \right )^{k-b}\cdot k^{-b}\cdot \binom{n}{b} \]
we consider first the combinatorial term on the right. Since n»b, we can treat the product in the denominator as nb:
\[\binom{n}{b}=\frac{n!}{(n-b)!b!}=\frac{n(n-1)\ldots (n-b+1)}{b!}\simeq \frac{n^{b}}{b!}\]
Incorporating this approximation and noting as before that k-1/k can be treated as 1, and that n/k is the same as f:
\[\left ( \frac{k-1}{k} \right )^{n-b}\cdot k^{-b}\cdot \binom{n}{b}\simeq \left ( \frac{k-1}{k} \right )^{n}\cdot \left ( \frac{k-1}{k} \right )^{-b}\cdot k^{-b}\cdot \frac{n^{b}}{b!} \simeq e^{-f}\cdot \frac{f^{b}}{b!}\]
Considering now the required value that all slots higher than b are empty, this is given by:
\[P(n,k,b)=\sum_{i=b}^{\infty}p(n,k,b)=\sum_{i=b}^{\infty}e^{-f}\cdot \frac{f^{i}}{i!}=e^{-f}\cdot\sum_{i=b}^{\infty}\frac{f^{i}}{i!}\]
There is no closed form for such a sum. But we can obtain an upper bound by considering the value of the term in i as ti:
\[t_{i}=t_{i-1}\cdot \frac{f}{i}<t_{i-1}\cdot \frac{f}{b}\:\:\:\:\text{(since }\textit{b}<\textit{i}\text{)}\] Hence by induction: \[t_{i}<\frac{f^b}{b!}\cdot\frac{f^{i-b}}{b^{i-b}}\] And hence: \[P(n,k,b)<e^{-f}\frac{f^b}{b!}\sum_{i=0}^{\infty}\frac{f^{i}}{b^{i}}\] And so by the usual formula for an infinite geometric progression: \[P(n,k,b)<e^{-f}\cdot\frac{f^b}{b!}\cdot\frac{1}{1-\frac{f}{b}}=e^{-f}\cdot\frac{f^b}{b!}\cdot\frac{b}{b-f}\]