Problem
About one year ago, in the post Coupon collector’s problem:An infinite-series perspective I come up with a solution to asymptotically answer the following problem:
The cardinality of the data is n, for each distinct group, there are a duplicates. Suppose we want to use hash aggregation algorithm to do some computation, each time you uniformly sample one data and push it in the hash table (capacity is just n), exactly right after sampling m data, for the first time you find that the hash table is full. What is the expectation of m?
The above problem is a model of sampling without replacement. One year ago’s solution is with replacement. Recently, I am reading the book Probability Theory: The Logic of Science. It gives me some idea so I come up with a solution.
Solution
This kind of problem is to compute the expectation of an event’s waiting time. For this kind of problem, I always use the skill learned from the course by Robert Sedgewick:
EXP\{waiting\_time\} = \sum_{m\ge 0} Prob\{m\ times\ no\ happen\}Following the above idea, the core task here is to compute: after m sampling without replacement, what is the probability that the hash table is not full? Then I will use the below skill to model this:
HashTable\_Not\_Full = Missing\_Group\_1 \cup Missing\_Group\_2 \cup \dots \cup Missing\_Group\_nAccording to the addition rule of probability, we have:
Prob\{HashTable\_Not\_Full\} = \sum_{k=1}^{n} (-1)^{k+1} {n \choose k} \frac{{an-ak \choose m}}{an \choose m}The multiplicity {n \choose k} is when computing missing k groups from n groups, we have that many choices.
\frac{{an-ak \choose m}}{an \choose m} is the probability that sampling m elements that they are not from some specific k groups.
Thus we have our final result:
EXP = \sum_{m\ge 0} {an \choose m} ^{-1} \sum_{k=1}^n (-1)^{k+1} {n \choose k} {an-ak \choose m}In fact the above summation is not infinite, because when m > n, {an-ak \choose m} = 0. So we can rewrite the sum above:
EXP = \sum_{m= 0}^{an} {an \choose m} ^{-1} \sum_{k=1}^n (-1)^{k+1} {n \choose k} {an-ak \choose m}Verify for very simple case
We can verify some very simple case to get more confidence that we are not wrong.
Let’s see when each group only contains one element, which means a = 1 . By common sense, we know that when the sampling number is less than the group number, we cannot win. This leads to the conclusion that when a=1 :
\sum_{k=1}^n (-1)^{k+1} {n \choose k} {n-k \choose m} = 1\ if\ m < n\ else\ 0We can verify the above formula as:
\sum_{k=0}^n (-1)^{k+1} {n \choose k} {n-k \choose m} + {n \choose m} = {n \choose m}-\sum_{k=0}^n(-1)^{k}{n \choose k}{n-k \choose m}Then Table 169 in Concrete Mathematics is the correct place to find the answer:

Based on this rule (5.25), we have
{n \choose m}-\sum_{k=0}^n(-1)^{k}{n \choose k}{n-k \choose m} = {n \choose m} - (-1)^{n+m}{n-m-1 \choose n-m} = {n \choose m}This just satisfies our common sense.
But unfortunately, I cannot go a step further to get a closed form of the final result for a>1.
Python simulation
Finally, let’s write a python program to simulate our result:
from scipy.special import comb
from random import randint
import itertools
def compute_exp(a, n):
"""
n boxes(1, 2, 3, ...n)
each box i contains a balls colored with number i
sample m times and just the first time get full
"""
s = 0.0
for m in range(0, a*n + 1):
p = 0
for k in range(1, n+1):
sign = (-1)**((k+1)%2)
p += sign * comb(n, k) * comb(a*n-a*k, m)
s += p / float(comb(a*n, m))
return s
def simulate_once(a, n):
d = [a*[i] for i in range(n)]
data = list(itertools.chain(*d))
s = set()
p = 0
while True:
i = randint(0, len(data)-1)
c = data.pop(i)
s.add(c)
p += 1
if len(s) == n:
break
return p
def simulate(a, n, loops=1000000):
return sum(simulate_once(a, n) for i in range(loops))/float(loops)
def nhn(n):
s = 0
for i in range(1, n+1):
s += 1.0/i
return n *s
def verify():
for a in range(1, 5):
for n in range(1, 9):
sim_val = simulate(a, n)
com_val = compute_exp(a, n)
nhn_val = nhn(n)
diff = abs(sim_val - com_val)*100.0/com_val
diff1 = abs(nhn_val - com_val)*100.0/com_val
print("a = %d, n = %d, exp=%lf, sim=%lf, nhn=%lf, diff=%lf%%, diff1=%lf%%"
% (a, n, com_val, sim_val, nhn_val, diff, diff1))
if __name__ == "__main__":
verify()
Running the code it prints out:
a = 1, n = 1, exp=1.000000, sim=1.000000, nhn=1.000000, diff=0.000000%, diff1=0.000000%
a = 1, n = 2, exp=2.000000, sim=2.000000, nhn=3.000000, diff=0.000000%, diff1=50.000000%
a = 1, n = 3, exp=3.000000, sim=3.000000, nhn=5.500000, diff=0.000000%, diff1=83.333333%
a = 1, n = 4, exp=4.000000, sim=4.000000, nhn=8.333333, diff=0.000000%, diff1=108.333333%
a = 1, n = 5, exp=5.000000, sim=5.000000, nhn=11.416667, diff=0.000000%, diff1=128.333333%
a = 1, n = 6, exp=6.000000, sim=6.000000, nhn=14.700000, diff=0.000000%, diff1=145.000000%
a = 1, n = 7, exp=7.000000, sim=7.000000, nhn=18.150000, diff=0.000000%, diff1=159.285714%
a = 1, n = 8, exp=8.000000, sim=8.000000, nhn=21.742857, diff=0.000000%, diff1=171.785714%
a = 2, n = 1, exp=1.000000, sim=1.000000, nhn=1.000000, diff=0.000000%, diff1=0.000000%
a = 2, n = 2, exp=2.333333, sim=2.333447, nhn=3.000000, diff=0.004871%, diff1=28.571429%
a = 2, n = 3, exp=3.800000, sim=3.799148, nhn=5.500000, diff=0.022421%, diff1=44.736842%
a = 2, n = 4, exp=5.342857, sim=5.342818, nhn=8.333333, diff=0.000733%, diff1=55.971480%
a = 2, n = 5, exp=6.936508, sim=6.936896, nhn=11.416667, diff=0.005595%, diff1=64.588101%
a = 2, n = 6, exp=8.567100, sim=8.569420, nhn=14.700000, diff=0.027085%, diff1=71.586660%
a = 2, n = 7, exp=10.226107, sim=10.227214, nhn=18.150000, diff=0.010823%, diff1=77.486893%
a = 2, n = 8, exp=11.907848, sim=11.910093, nhn=21.742857, diff=0.018856%, diff1=82.592671%
a = 3, n = 1, exp=1.000000, sim=1.000000, nhn=1.000000, diff=0.000000%, diff1=0.000000%
a = 3, n = 2, exp=2.500000, sim=2.500712, nhn=3.000000, diff=0.028480%, diff1=20.000000%
a = 3, n = 3, exp=4.214286, sim=4.211930, nhn=5.500000, diff=0.055898%, diff1=30.508475%
a = 3, n = 4, exp=6.057143, sim=6.056313, nhn=8.333333, diff=0.013700%, diff1=37.578616%
a = 3, n = 5, exp=7.989011, sim=7.989819, nhn=11.416667, diff=0.010114%, diff1=42.904631%
a = 3, n = 6, exp=9.987637, sim=9.987228, nhn=14.700000, diff=0.004099%, diff1=47.181956%
a = 3, n = 7, exp=12.038968, sim=12.043247, nhn=18.150000, diff=0.035546%, diff1=50.760435%
a = 3, n = 8, exp=14.133419, sim=14.128767, nhn=21.742857, diff=0.032916%, diff1=53.840036%
a = 4, n = 1, exp=1.000000, sim=1.000000, nhn=1.000000, diff=0.000000%, diff1=0.000000%
a = 4, n = 2, exp=2.600000, sim=2.598348, nhn=3.000000, diff=0.063538%, diff1=15.384615%
a = 4, n = 3, exp=4.466667, sim=4.465587, nhn=5.500000, diff=0.024172%, diff1=23.134328%
a = 4, n = 4, exp=6.497436, sim=6.497815, nhn=8.333333, diff=0.005835%, diff1=28.255722%
a = 4, n = 5, exp=8.644042, sim=8.643690, nhn=11.416667, diff=0.004075%, diff1=32.075554%
a = 4, n = 6, exp=10.878905, sim=10.877494, nhn=14.700000, diff=0.012974%, diff1=35.123888%
a = 4, n = 7, exp=13.184374, sim=13.186509, nhn=18.150000, diff=0.016193%, diff1=37.662963%
a = 4, n = 8, exp=15.548275, sim=15.549086, nhn=21.742857, diff=0.005217%, diff1=39.840962%
We see our result is very very close to the simulation result, and the with replace result NH_N is not that close for small cases. This also matches our common sense. Only when a \to \infty, without replace and with replace should have the same answer.