Coupon collector’s problem:An infinite-series perspective

Background

If you are not familiar with Coupon collector’s problem, please first read the following page:

Six years ago when I was in graduate school, I encountered this problem. I did not want to use Indicator variable method to solve it, so I tried my best to solve it using infinite series. I came up with a complicated formula which was not accurate and I am not satisfied because this means it is wrong.

the wrong formula of six years ago

Recently, one of my colleagues raised a very hard problem on probability, and I spent much time on it and cannot solve it. From the end of July, I am working on this problem every day after work. And I find a perfect course by Dr. Robert Sedgewick and believe by understanding the materials one day finally I would solve this problem. The course is called Analytic Combinatorics. For details, please refer the website: https://aofa.cs.princeton.edu/home/.

The correct formula

Today, when I am learning the material of Analytic Combinatorics of strings by Dr. Robert Sedgewick, I find a very interesting skill:

The wonderful skill

The last step of the equations is wonderful. It took me several minutes to understand it. After understanding it, I find the key to the problem six years ago.

The\ expectation\ steps = \sum_{k\ge 0} Pr\{in\ k\ steps,\ the\ task\ is\ not\ finished\}

So the final task is to compute the probability of the event “in k steps, the task is not finished”. This event is some small events’ union. And then we can use inclusion–exclusion principle to deduce the probability (which can also be known as probability sum).

Pr\{in\ k\ steps,\ the\ task\ is\ not\ finished\} = Pr\{\{lack 1\}\cup\{lack 2\}\dots \cup \{lack N\}\} = \sum_{m=1}^{N}(-1)^{m+1}{N \choose m}(\frac{N-m}{N})^k

So the coupon collector’s problem can be solved by the infinite series:

\sum_{k\ge 0}\sum_{m=1}^{N}(-1)^{m+1}{N \choose m}(\frac{N-m}{N})^k

And as we already know the famous result: NH_N, we get an equation:

NH_N= \sum_{k\ge 0}\sum_{m=1}^{N}(-1)^{m+1}{N \choose m}(\frac{N-m}{N})^k

Let’s write some code to verify this:

from scipy.special import comb


def c(n, k):
    return comb(n, k, exact=True)

def nhn(N):
    return N*sum((1/i) for i in range(1, N+1))

def formula(N, up=1000):
    def p(N, k):
        return sum(((-1)**(m+1)*c(N, m)*((N-m)/N)**k)
                   for m in range(1, N+1))

    return sum(p(N,k) for k in range(up))

def verify():
    for N in range(1, 15):
        a = nhn(N)
        b = formula(N)
        e = abs(a-b)
        assert(e < 1e-8)

if __name__ == "__main__":
    verify()

Running this test it shows that the formula is accurate. But for large N, the python code shows that there are some errors that cannot be eliminated even if we enlarge up. I believe the reason is the precision of float.

We also have a formal proof by induction for this formula. The skills involving the proof are not hard, please try it if interested. The key step is {n+1 \choose m} = {n \choose m} + {n \choose m-1}. Another skill is to switch the sum order so that things become naive.

In concrete mathematics, Donald Knuth provided a direct solution:

1 thought on “Coupon collector’s problem:An infinite-series perspective

  1. Pingback: Coupon collector’s problem: sample without replacement | Thoughts on Computing

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.