Coupon collector’s problem: sample without replacement

Posted on September 19, 2020 by kainwenlv

Problem

About one year ago, in the post Coupon collector’s problem：An infinite-series perspective I come up with a solution to asymptotically answer the following problem:

The cardinality of the data is $n$ , for each distinct group, there are $a$ duplicates. Suppose we want to use hash aggregation algorithm to do some computation, each time you uniformly sample one data and push it in the hash table (capacity is just $n$ ), exactly right after sampling $m$ data, for the first time you find that the hash table is full. What is the expectation of $m$ ?

The above problem is a model of sampling without replacement. One year ago’s solution is with replacement. Recently, I am reading the book Probability Theory: The Logic of Science. It gives me some idea so I come up with a solution.

Continue reading →

Elementary sampling theory

Posted on July 11, 2020 by kainwenlv

The problems are from the Book Jaynes, Edwin T. Probability theory: The logic of science. Cambridge university press, 2003. Chapter 3 of the book is on the topic of Elementary sampling theory. These problems can be easily solved using the symbolic method of analytical combinatorics.

Continue reading →

Analysis of the number of swapping in partition of Quicksort

Posted on November 25, 2019 by kainwenlv

Quicksort is an elegant and efficient sorting algorithm. Analysis of quick sort is a lot of fun. The core part of quick sort algorithm is partition the original array into two parts. This part of code is shown below (in java):

private void quicksort(int[] a, int lo, int hi) {
    if (hi <= lo) return;
    int i = lo - 1, j = hi;
    int t, v = a[hi];
    while (true) {
        while (a[++i] < v);
        while (v < a[--j])
            if (j == lo) break;
        if (i >= j) break;
        t = a[i]; a[i] = a[j]; a[j] = t;
    }
    t = a[i];
    a[i] = a[hi];
    a[hi] = t;
    quicksort(a, lo, i - 1);
    quicksort(a, i + 1, hi);
}

Continue reading →

Coupon collector’s problem：An infinite-series perspective

Posted on August 20, 2019 by kainwenlv

Background

If you are not familiar with Coupon collector’s problem, please first read the following page:

https://en.wikipedia.org/wiki/Coupon_collector%27s_problem

Six years ago when I was in graduate school, I encountered this problem. I did not want to use Indicator variable method to solve it, so I tried my best to solve it using infinite series. I came up with a complicated formula which was not accurate and I am not satisfied because this means it is wrong.

the wrong formula of six years ago

Continue reading →

Pivotal-hash: a new consistent hash algorithm

Posted on February 13, 2019 by kainwenlv

My previous posts on consistent hash are:

The above algorithm is beautiful while its time complexity is $O(n)$ will introduce some CPU overhead in some extreme conditions compared to simple modulo hashing.

Continue reading →

Proof of select-sample problem

Posted on October 28, 2018 by kainwenlv

The problem and algorithm

My colleague asks me for a proof of the following algorithm:

Continue reading →

Jump Consistent Hash Algorithm

Posted on June 25, 2018 by kainwenlv

One of my colleagues introduced me a new consistent hash algorithm: jump consistent hash. It is a concise algorithm, just in 5 line of code. The paper can be found here: A Fast, Minimal Memory, Consistent Hash Algorithm. The proofs in the paper are not that detailed, so in this blog, I will try to provide a more formal way to help you understand it and build it from scratch.

Continue reading →

A Lemma on Number Theory

Posted on August 12, 2017 by kainwenlv

In today’s techtalk in Pivotal Inc. one of my colleagues asked a question about the benefits of double the batch size of one step in a hashjoin algorithm. I think it is not obvious enough to understand it at first glimpse. So I offer the formal description and proof here to make sure that myself understand it well.

Continue reading →

Thoughts on Computing

英雄无岁，江湖无辈

Category Archives: concrete mathematics

Coupon collector’s problem: sample without replacement

Problem

Elementary sampling theory

Analysis of the number of swapping in partition of Quicksort

Coupon collector’s problem：An infinite-series perspective

Background

Pivotal-hash: a new consistent hash algorithm

Previous Posts

Proof of select-sample problem

The problem and algorithm

Jump Consistent Hash Algorithm

A Lemma on Number Theory