A Lemma on Number Theory

Contents hide

In today’s techtalk in Pivotal Inc. one of my colleagues asked a question about the benefits of double the batch size of one step in a hashjoin algorithm. I think it is not obvious enough to understand it at first glimpse. So I offer the formal description and proof here to make sure that myself understand it well.

The problem in informal language

We have a hash function $h(x) = x$ , and we have a hashtable whose bucket number is $N$ , and we decide to divide each number into different batches. Let $B$ be the number of all the batches, then for number $x$ , its corresponding batch ID is $batch_{id}(x) = (h(x) / N) \b mod B$ .

At some point, we find that the number of numbers in the batch is larger than $N$ , and this means that we need to enlarge the total batch numbers. In the algorithm, we set $B$ to $2B$ . And the benefit is that recalculating the batch id of this batch, the new batch ids are are larger than current batch id. In other words, redistribution can only move current batch elements to the latter batches.

The lemma in formal language

The problem above can be represented in formal language as follows:

$\forall a, b \in N, b > 0, a \bmod 2b \ge a \bmod b$ .

Let’s prove this lemma:

$a \bmod 2b \ge a \bmod b \iff a - 2b \lfloor \frac{a}{2b} \rfloor \ge a - b \lfloor \frac{a}{b} \rfloor \iff \lfloor \frac{a}{b} \rfloor \ge 2\lfloor \frac{a}{2b} \rfloor$