Problem
About one year ago, in the post Coupon collector’s problem:An infinite-series perspective I come up with a solution to asymptotically answer the following problem:
The cardinality of the data is n, for each distinct group, there are a duplicates. Suppose we want to use hash aggregation algorithm to do some computation, each time you uniformly sample one data and push it in the hash table (capacity is just n), exactly right after sampling m data, for the first time you find that the hash table is full. What is the expectation of m?
The above problem is a model of sampling without replacement. One year ago’s solution is with replacement. Recently, I am reading the book Probability Theory: The Logic of Science. It gives me some idea so I come up with a solution.
Continue reading

