CM (count min) is not MC (min count)

Contents hide

Papers and stories

This month, my most spare time has been spent on the problem Cardinality Estimate. When I finished the analysis of hyperloglog (see the blog A Tour of Understanding Hyperloglog), I just remember a recent paper of SIGMOD2019: Approximate Distinct Counts for Billions of Datasets. Its idea is to combine CountMin and Hyperloglog. Before diving into that paper, I decide to study CountMin. Another paper on cardinality estimate survey is VLDB2017’s Cardinality Estimation: An Experimental Survey mentioned a method called MinCount.

CM sketch was introduced by the paper: An Improved Data Stream Summary: The Count-Min Sketch and its Applications. I do not want to dive into that paper, but just reword the proof here (maybe in a better way).

CM sketch

There are $n$ distinct events, named $a_1, a_2, a_3, \dots, a_n$ . Suppose we want to count their frequency, the job is very similar to ask for an approximate result for the following SQL:

select a, count(a) from table group by a

For accurate result, we might use hash aggregate which consumes a lot of memory if $n$ is huge. And the execution time (algorithm complexity) is also large. With the help CM sketch, we can finish the job only using small memory under acceptable accuracy.

The idea of CM sketch is very similar to bloom filter. It maintains $d$ different hash functions and assumes that these hash functions are pair-wise independent. Each hash function maps the index of the data to a hash value range: $h_j: \{1,2,\dots, n\} \implies \{1,2,\dots, w\}$ . The bloom filter like sketch is a two-dimension array $c[d][w]$ , $d$ means there are $d$ different hash functions, $w$ means the hash range, the array is initialized with all 0.

The update-operation is quite simple. For each appearance of $a_i$ , the algorithm computes $d$ hash values (because we have $d$ hash functions) for the index. Then for $\forall j \in \{1,2,\dots, d\}, c[j][h_{j}(i)]+=1$ .

By this design, only scan one run of the whole data, the array contains enough information to estimate the results of a, count(a): $count(a_i) \approx \min_j c[j][h_j(i)]$ .

Proof

I do not like the demonstration in the original paper. Let me rewrite it here based on my own understanding.

Let’s first take a look at the value $c[j][h_j(i)]$ , it will be used for counting $a_i$ . The first question to ask should be “Can other events also update this cell in the array and how?”.

Other events, say $a_k$ also updates the above cell, in math language, is $h_j(k) = h_j(i), i \ne k$ . The probability of this for a specific $i, j$ is $Pr (h_j(k) = h_j(i), i \ne k ) = \frac{1}{w}$ . (Based on the assumption that different hash functions are pair-wise independent).

So $c[j][h_j(i)] = a_i + \sum_{1\le k \le n} I_{i,j,k} a_k$ . The $I_{i,j,k}$ is the indicate variable to control if $a_k$ interfere $a_i$ . We should define it as $I_{i,j,k} = 1$ if $i \ne k$ and $h_j(k) = h_j(i)$ , otherwise, it is $0$ .

Since every $a_k$ is non-negative, choosing the min value should provide the closest result. Now let’s bound the error: $\sum_{1\le k \le n} I_{i,j,k} a_k$ . To find the expection is always the first step.

$E(\sum_{1\le k \le n} I_{i,j,k} a_k) = \sum_{1\le k \le n} E (I_{i,j,k} a_k) = \sum_{1\le k \le n} a_kE (I_{i,j,k}) = \sum_{1\le k \le n} a_k \frac{1}{w} = \frac{1}{w} ||a||_1$ .

Based on Markov inequality, the error cannot be off its center too much. This can be a bound for error.

The bound is not tight. $||a||_1$ can be very huge.

End

This blog is just a note on my scanning of this paper. The skill and bound here is not that interesting.

Thoughts on Computing

英雄无岁，江湖无辈

CM (count min) is not MC (min count)

Papers and stories

CM sketch

Proof

End

Leave a ReplyCancel reply

Papers and stories

CM sketch

Proof

End

Leave a ReplyCancel reply

Discover more from Thoughts on Computing