Papers and stories
This month, my most spare time has been spent on the problem Cardinality Estimate. When I finished the analysis of hyperloglog (see the blog A Tour of Understanding Hyperloglog), I just remember a recent paper of SIGMOD2019: Approximate Distinct Counts for Billions of Datasets. Its idea is to combine CountMin and Hyperloglog. Before diving into that paper, I decide to study CountMin. Another paper on cardinality estimate survey is VLDB2017’s Cardinality Estimation: An Experimental Survey mentioned a method called MinCount.
CM sketch was introduced by the paper: An Improved Data Stream Summary: The Count-Min Sketch and its Applications. I do not want to dive into that paper, but just reword the proof here (maybe in a better way).
CM sketch
There are n distinct events, named a_1, a_2, a_3, \dots, a_n. Suppose we want to count their frequency, the job is very similar to ask for an approximate result for the following SQL:
select a, count(a) from table group by a
For accurate result, we might use hash aggregate which consumes a lot of memory if n is huge. And the execution time (algorithm complexity) is also large. With the help CM sketch, we can finish the job only using small memory under acceptable accuracy.
The idea of CM sketch is very similar to bloom filter. It maintains d different hash functions and assumes that these hash functions are pair-wise independent. Each hash function maps the index of the data to a hash value range: h_j: \{1,2,\dots, n\} \implies \{1,2,\dots, w\}. The bloom filter like sketch is a two-dimension array c[d][w], d means there are d different hash functions, w means the hash range, the array is initialized with all 0.
The update-operation is quite simple. For each appearance of a_i, the algorithm computes d hash values (because we have d hash functions) for the index. Then for \forall j \in \{1,2,\dots, d\}, c[j][h_{j}(i)]+=1.
By this design, only scan one run of the whole data, the array contains enough information to estimate the results of a, count(a)
: count(a_i) \approx \min_j c[j][h_j(i)] .
Proof
I do not like the demonstration in the original paper. Let me rewrite it here based on my own understanding.
Let’s first take a look at the value c[j][h_j(i)], it will be used for counting a_i. The first question to ask should be “Can other events also update this cell in the array and how?”.
Other events, say a_k also updates the above cell, in math language, is h_j(k) = h_j(i), i \ne k. The probability of this for a specific i, j is Pr (h_j(k) = h_j(i), i \ne k ) = \frac{1}{w}. (Based on the assumption that different hash functions are pair-wise independent).
So c[j][h_j(i)] = a_i + \sum_{1\le k \le n} I_{i,j,k} a_k. The I_{i,j,k} is the indicate variable to control if a_k interfere a_i. We should define it as I_{i,j,k} = 1 if i \ne k and h_j(k) = h_j(i), otherwise, it is 0.
Since every a_k is non-negative, choosing the min value should provide the closest result. Now let’s bound the error: \sum_{1\le k \le n} I_{i,j,k} a_k. To find the expection is always the first step.
E(\sum_{1\le k \le n} I_{i,j,k} a_k) = \sum_{1\le k \le n} E (I_{i,j,k} a_k) = \sum_{1\le k \le n} a_kE (I_{i,j,k}) = \sum_{1\le k \le n} a_k \frac{1}{w} = \frac{1}{w} ||a||_1.
Based on Markov inequality, the error cannot be off its center too much. This can be a bound for error.


The bound is not tight. ||a||_1 can be very huge.
End
This blog is just a note on my scanning of this paper. The skill and bound here is not that interesting.