In today’s techtalk in Pivotal Inc. one of my colleagues asked a question about the benefits of double the batch size of one step in a hashjoin algorithm. I think it is not obvious enough to understand it at first glimpse. So I offer the formal description and proof here to make sure that myself understand it well.
The problem in informal language
We have a hash function , and we have a hashtable whose bucket number is
, and we decide to divide each number into different batches. Let
be the number of all the batches, then for number
, its corresponding batch ID is
.
At some point, we find that the number of numbers in the batch is larger than , and this means that we need to enlarge the total batch numbers. In the algorithm, we set
to
. And the benefit is that recalculating the batch id of this batch, the new batch ids are are larger than current batch id. In other words, redistribution can only move current batch elements to the latter batches.
The lemma in formal language
The problem above can be represented in formal language as follows:
.
Let’s prove this lemma:
let , so
, according to lemma in concrete mathematics Page 69’s (3.5), we have:
Now, according to concrete mathematics Page 69’s (3.7), we have
Proof end.
Other enlarging strategies?
If we just increase the batch size by , would we still enjoy the good property
?
The answer is no. To disprove it, all we need to do is to offer a bad case: . Then
.