The PR (Implement two-stage sampling for AO/AOC analyze #11190) of Greenplum catches my attention. The PR, like many other commits of Greenplum, is perfectly documented and at the top of the commit message it introduces two algorithms of sampling:

- Algorithm S from Knuth
- Algorithm Z from Vitter

Then I become interested in these algorithms, search some resource in the Internet and understand quite a bit of the topic so here comes this blog.

## The big picture

- We should only scan one pass of the data.
- Each algorithm should maintain a state of the past:
- how many records have been scanned
- the candidate set until now

- Based on the state, when touching a new record, design a probability that this tuple be taken into the candidate set
- Do a binary random experiment with the above probability, if it succeeds, take that tuple into the set, otherwise skip it
- Update the state

By carefully designing the state and probability, we can get different sampling algorithms and the key idea should always keep in mind is to maintain the ** loop invariants**.

## Algorithm S: the data size N is known

I have written a blog for the proof of this algorithm, so just leave a link here: ** Proof of select-sample problem**. Algorithm S and Algorithm R is also discussed by Knuth in

**.**

*The Art of Computer Programming, Vol 2, 3.4.2 p.142*## Algorithm R: the data size N is unknown

Let me re-state the above algorithm in informal language first. The problem is to randomly sample n records in a file without knowledge of how many records in that file. Only when getting the EOF can we know we scan all the content of the file. N is the number of records in the file, but we do not know it.

The idea of algorithm R is:

- initialize a candidate set (called a reservoir) with the first n records
- scan the remaining records until EOF
- for ith (start from n+1) each record, the corresponding probability to take it into the candidate set is \frac{n}{i}
- If the binary random experiment shows we should take the record into the set, then uniform-randomly replace one that already in the set

- When EOF, the n records in the reservoir are the sample result

To prove the sample algorithm is correct, we should prove two points:

- every time the algorithm ends, it output exactly n records
- any record in the file is in the final result set with the same probability, that is if the file contains N records, each record has the probability \frac{n}{N} being in the result

The first point of the algorithm is quite easy to verify. Let’s spend some time on the proof of the second point.

For one of the first n records, say the record i, i \le n if it is still in the reservoir finally, it means that all the following binary random experiments do not get the index of it. Let R_k be the result of the Record k’s binary random experiment, so the probability is:

Pr\{i\ in\ final\ set\} = \prod_{k=n+1}^{N} Pr\{R_k \ne i\} = \prod_{k=n+1}^{N} \frac{k-1}{k} = \frac{n}{N}For one of the following record, say the record i, i > n , if it is in the final reservoir, it means that:

- 1 \le R_i \le n , choose in
- \forall j > i, R_i \ne R_j , not replaced by later records

We have shown the algorithm’s correctness. We scan the file record by record, use a binary random experiment to decide if we take it into the reservoir and if we decide to replace then uniform-randomly choose one to replace. **We also will skip many tuples without touching them during the whole algorithm.** This is an important view of point to develop the following algorithms.

## Algorithm X,Y,Z: make Algorithm R faster

Now let’s develop faster algorithms with the knowledge of the analysis of Algorithm R. Since we will skip records, let’s use this point to re-state the Algorithm R in a new framework.

- Initialize the reservoir to contain the first n records of the file
- Loop until EOF
- Compute the jump steps s based on the current state (how many records have been scanned or skipped)
- Skip the following s records (means do not need to do the binary random experiment for them at all, or we can imagine that all of them’s experiments tell us not to take into them)
- If not EOF, take into the next record and uniform-randomly replace one record in the reservoir

- The n records in the reservoir are just the result

*The new perspective emphasize the concept of skipping records. The essence is just Algorithm R but a new perspective may lead to fast implementation. The algorithm R simulates the skip process one record by one record, for each record, it has to play the binary random experiment once. Can we get the probability distribution function of the skip steps and based on the probability distribution function generate the value of skip? This may be a faster simulation of the random process.*

Now let’s do some math to get the correct probability distribution function of the skip step given the number of records we have processed. The valid range of the value of skip steps is the natural numbers \mathcal{N} . We are going to develop a function Pr\{skip\ s\ records\ when\ processed\ t\ records\ already\} = f_{t} (s) . Started from t, the following records t,\mathbf{t+1,t+2,\dots,t+s},t+s+1, the records in bold are the skipping records.

f_t(s) = \frac{t+1-n}{t+1} \frac{t+2-n}{t+2}\dots \frac{t+s-n}{t+s} \frac{n}{t+s+1}It is easy to verify that the above is a valid probability distribution function. We might re-write the above equation using falling power notation, but to me, the above formula is the best for proving the algorithm.

A useful recursive relation of f_t(s) is:

\frac{f_t(s+1)}{f_t(s)} = \frac{t+s+1-n}{t+s+2}We will use the above formula later.

To prove the above framework using the formula f_t(s) is correct, the most important part is to prove that for any later records, the probability to choose it in (not skip) is \frac{n}{k}, k>n for the Record k.

Pr\{k\ choosed\ in\} = \sum_{i=n}^{k-1} Pr\{i\ choosed\ in\} Pr\{start\ from\ i, skip\ k-1-i\ steps\} = \sum_{i=n}^{k-1} Pr\{i\ choosed\ in\} f_i(k-1-i)We want to show the above formula is just \frac{n}{k} .

Here I skip the fundamental case since it is easy to verify the above formula for k=n+1 . By induction, we assume that for all n \le m \le k, Pr\{m\ choosed\ in\} = \frac{n}{m} , let’s try to prove Pr\{k+1\ choosed\ in\} = \frac{n}{k+1} .

Pr\{k+1\ choosed\ in\} = \sum_{i=n}^{k} Pr\{i\ choosed\ in\} f_i(k-i) = \sum_{i=n}^{k}\frac{n}{i}f_i(k-i) = \frac{n}{k}f_k(0) + \sum_{i=n}^{k-1}\frac{n}{i}f_i(k-i) f_i(k-i) = f_i(k-1-i) \frac{k-n}{k+1} f_k(0) = 1 - \frac{k+1-n}{k+1}With all the above equations, we can get Pr\{k+1\ choosed\ in\} = \frac{n}{k+1}, and finish the induction proof.

*Next, to develop the faster algorithms than Algorithm R, in fact, we are just developing the faster algorithms for generate random integer number s given the probability distribution function f_t(s) .*

The detailed analysis and optimization of Algorithm Z involves much math work. I am not going to elaborate here. It deserves a long tutorial in a single PDF version (like hyperloglog I did last year).

Postgres’s implementation of Algorithm Z can be found in: `src/backend/utils/misc/sampling.c:reservoir_init_selection_state`

and `src/backend/utils/misc/sampling.c:reservoir_get_next_S`

.

## My thoughts

The pattern of the algorithms is scanning the records and doing some binary random experiment (Bernoulli experiment). This makes jump consistent hashing algorithm comes to my mind immediately. I have a post on ** Jump Consistent Hash Algorithm** and the analysis method is just the same as here.

From the very fundamental principle, we design the state of the past and the probability distribution function for each Bernoulli experiment. Then, after analysis of the correctness, we simulate the process in another view of the point: jumping (or skipping) to save Bernoulli experiment times. This will improve the efficiency of the algorithm.