The problem and algorithm
My colleague asks me for a proof of the following algorithm:

Let me re-state the problem and algorithm using pseudo-code:
/*
* SelectSample : List -> int -> int -> List
* - Data: the dataset to sample from, index start from 1
* - N : the number of elements of data in Data
* - n : we sample exact n elements from Data
*/
List select_sample(List Data, int N, int n)
{
List result = new List;
int i = 1;
int already_scanned = 0;
int already_picked = 0;
while(true) {
int remain_to_pick = n - already_picked;
int remain_to_scan = N - already_scanned;
if (remain_to_pick = 0)
return result;
double u = random_uniform(0, 1);
if (remain_to_scan * u < remain_to_pick) {
//pick the element Data[i]
result.append(Data[i]);
already_picked++;
already_scanned++;
} else {
already_scanned++;
}
}
}
The above pesudo-code is easy to understand and the code has two claims:
- If S = select_sample(Data, N, n), then length(S) = n
- For any 1 <= i <= N, P{Data[i] in select_sample(Data, N, n)} = n/N
Let’s focus on Claim 2 and leave 1 for readers.
Proof P{Data[i] in select_sample(Data, N, n)} = n/N
Let’s first see some small cases.
i= 1, What is the probability to pick Data[1]?
remain_to_pick = n and remain_to_scan = N, so P_1 = n / m.
i= 2, What is the probability to pick Data[2]?
This probability depends on whether Data[1] has been picked:
- Data[1] has been picked, \frac{n}{N} * \frac{n-1}{N-1}
- Data[1] has not been picked, \frac{N-n}{N} * \frac{n}{N-1}
So P_2 = \frac{n}{N} * \frac{n-1}{N-1} + \frac{N-n}{N} * \frac{n}{N-1} = \frac{n}{N}
……
We’d better stop here to look into the reasoning process: what is the probability to pick Data[k](k>=1).
The direct event {remain_to_scan * u < remain_to_pick} depends on the current state which can be described by the pair {remain_to_scan, remain_to_pick}. Obviously, remain_to_scan = N-k+1. So we have to consider each possible value of remain_to_pick.
It is more convenient to induction on already_picked. Let’s use the symbol m to represent the elements that already picked, they are picked from Data[1], Data[2], … Data[k-1]. Let’s assume 0<=m<=k-1 < n(because otherwise the algorithm has terminated).
So the formula has the pattern:
P_k = \sum_{m=0}^{k-1} P\{we\_have\_picked\_exact\_m\_already\} \frac{n-m}{N-k+1}Our task now is to decide the formula of P\{we\_have\_picked\_exact\_m\_already\}. It is very similar to a binary sample problem. There are k-1 places in the scanning path. We attach each position with a probability which has the patten 1-\frac{x}{N-x} for not-picking or \frac{x}{N-x} for picking, and we product them.
We observe that the denominator of the product is N^{\underline {k-1}}(This is called falling factorial power series). The remaining job is to find out the numerator when we picked m among the first k-1 elements. If it is the ith-pick, it contributes n-i+1 as the factor to the numerator. If it is the its-no-pick, it contributes N-n-i+1 as the factor to the numerator. So
P\{we\_have\_picked\_exact\_m\_already\} = \frac{{k-1 \choose m}n^{\underline m}(N-n)^{\underline {k-1-m}}}{N^{\underline {k-1}}}.
And we finally get the complete probability formula:
P_k=\frac{\sum_{m=0}^{k-1}{k-1 \choose m}n^{\underline m}(N-n)^{\underline {k-1-m}}(n-m)}{N^{\underline k}}And it is just
P_k=\frac{n}{N}\frac{\sum_{m=0}^{k-1}{k-1 \choose m}(n-1)^{\underline m}(N-n)^{\underline {k-1-m}}}{(N-1)^{\underline {k-1}}}And falling factorial also has the famous binomial coefficient theorem:

Please refer Falling Factorial Series.
So, P_k = \frac{n}{N}, we prove the claim.
Proof of n falling factorial binomial theorem
Let me finish the blog by finally proving the binomial theorem for falling factorial series(the equation in the above picture).
This sort of equation is easy to prove by induction. Let’s do induction on n.
Base: n=0 is a trivial case. It obviously holds.
Induction Step: given (x+y)^{\underline n} = \sum_{k=0}^{n}{n \choose k}x^{\underline k}y^{\underline {n-k}}, our goal is to prove (x+y)^{\underline {n+1}} = \sum_{k=0}^{n+1}{n+1 \choose k}x^{\underline k}y^{\underline {n+1-k}}.
(x+y)^{\underline n}(x+y-n) = \sum_{k=0}^{n}{n \choose k}x^{\underline k}y^{\underline {n-k}}(x-k+y-n+k) (x+y)^{\underline n}(x+y-n) = \sum_{k=0}^{n}{n \choose k}x^{\underline {k+1}}y^{\underline {n-k}} + \sum_{k=0}^{n}{n \choose k}x^{\underline {k+1}}y^{\underline {n-k+1}} (x+y)^{\underline n}(x+y-n) = x^{\underline {n+1}} + \sum_{k=0}^{n-1}{n \choose k}x^{\underline {k+1}}y^{\underline {n-k}} + \sum_{k=1}^{n}{n \choose k}x^{\underline {k}}y^{\underline {n-k+1}}+y^{\underline {n+1}} (x+y)^{\underline n}(x+y-n) = x^{\underline {n+1}} + \sum_{k=0}^{n-1}{n \choose k}x^{\underline {k+1}}y^{\underline {n-k}} + \sum_{k=0}^{n-1}{n \choose {k+1}}x^{\underline {k+1}}y^{\underline {n-k}}+y^{\underline {n+1}}And {n \choose k}+{n \choose {k+1}} = {{n+1} \choose {k+1}},
We have prove the theorem.
In fact, this algorithm resembles one of the basic thoughts of the famous algorithm of Markov Chain Monte Carlo(aka ‘MCMC’), whose name is Acceptance-Rejection Sampling. Although it may not be that obvious, they hold the same logic. Further, via Acceptance-Rejection Sampling, we could generate some samples obeying arbitrary distributions (without bias?Still conventional in academia),which in your case, maybe a set of samples dependent on the scale of the point if every record is merely a real number, such as the larger a number is, the more likely it will be chosen. This technique is used in many sampling algorithms including the famous MCMC which have changed the entire face of the field of stochastic simulation and happened to be the given topic of the project in the course of Stochastic Processes weeks ago.
Pingback: Sampling Algorithms: S, R, X, Y, Z | Thoughts on Computing
The pseudo code seems broken, where is i incremented?
You are correct, the index `i` should increase every loop.