Pivotal-hash: a new consistent hash algorithm

Previous Posts

My previous posts on consistent hash are:

The above algorithm is beautiful while its time complexity is O(n) will introduce some CPU overhead in some extreme conditions compared to simple modulo hashing.

Maglev Hahsing Paper

One of my colleagues searches Google to find another consistent algorithm called maglev (中文这个单词是”磁悬浮”的意思). The original paper is here: Maglev: A Fast and Reliable Software Network Load Balancer. The hash algorithm is just a small piece section of the original paper and lack of mathematical analysis. And there is a golang implementation on Github.

I don’t want to analyze the algorithm here. The algorithm is not beautiful. The key idea is to compute the target segment by looking up in a pre-computed array (called lookup table). The paper suggests that the size of lookup table need to be much larger than the number of the segments (more than 100 times). The computing progress needs to maintain a huge permutation table (size of lookup table times the number of segments). The paper also introduces a permutation table generating algorithm which is complicated (at least for the fist-eye). It mentions that Knuth-reshuffle will do the work well also.

The biggest flaw of the paper in my mind is that the hash algorithm is not monotonous. See the example of the paper:

Lookup tables of different nodes in the original paper

After removing the nodes B1, look at the last hash mapping, B0 changes to B2. This means that if this algorithm is used for some storage system, some data movement will happen among untouched nodes, which in my mind is bad.

Pivotal-Hash algorithm

Based on the idea of lookup table, a beautiful mathematic formula comes to my mind.

n = \lceil \frac{n}{m} \rceil + \lceil \frac{n-1}{m} \rceil + \dots + \lceil \frac{n-m+1}{m} \rceil = \sum_{k=0}^{k=m-1} \lceil \frac{n-k}{m} \rceil

This formula is from the famous book 《Concrete mathematics : a foundation for computer science》. It solves the problem:

This is just the requirement of balance for consistent hashing. So using the above formula we can compute a best configuration given the number of segments and the size of the lookup table. Suppose at first we have n segments, and we decide to add one more segment.

  • Configuration for n: \{s_1, s_2, \dots, s_n\}
  • Configuration for n+1: \{t_1, t_2, \dots, t_n, t_{n+1}\}

It is easy to verify that s_i \ge t_i . We can explain t_{n+1} like this:

t_{n+1} = (s_1-t_1) + (s_2 - t_2) + \dots + (s_n-t_n)

The above formula tells us that we could find a consistent algorithm that is monotonous.

Below is my python code to implement this algorithm:

#!/usr/bin/env python3
#-*- coding: utf-8 -*-

from copy import deepcopy
from math import ceil
from collections import defaultdict

class Phash:

    def __init__(self, Nsegs, M):
        self.Nsegs = Nsegs
        self.M = M

    def build_lookup(self):
        lookup = [0] * self.M
        for i in range(1, self.Nsegs):
            lookup = self._build_bookup(i, lookup)
        return lookup

    def _build_bookup(self, i, lookup):
        from i-1 enlarge to i
        prev_config = self.compute_config(i-1)
        curr_config = self.compute_config(i)
        delta = [(n1-n2)
                 for n1, n2 in zip(prev_config, curr_config)]

        cp_lookup = deepcopy(lookup)
        for index, seg in enumerate(lookup):
            d = delta[seg]
            if d == 0:
            cp_lookup[index] = i
            delta[seg] -= 1

        self.is_good(lookup, cp_lookup)

        return cp_lookup

    def compute_config(self, i):
        ii = i + 1
        return [int(ceil(float(self.M - x) / ii))
                for x in range(ii)]

    def report(self, lookup):
        d = defaultdict(int)
        for i in lookup:
            d[i] += 1

        for k, v in d.items():
            print(k, v)

    def is_good(self, lk1, lk2):
        new_seg = max(lk2)
        for a, b in zip(lk1, lk2):
            if b != new_seg:
                assert(a == b)

if __name__ == "__main__":
    p = Phash(5, 17)

Some notes


Jump consistent algorithm contains almost pure-computing instructions. However, lookup table method has to access memory. If the size of lookup table is large (in fact, it should be large), the table cannot be put into cache. Accessing memory is much slower than accessing registers.

How to chooces M?

At the first eye, M should be larger than the number of segments. But how larger? Let’s do some mathematical reasoning.

For the best balance algorithm, pivotal-hash, if we denote the number of slots in lookup table occupied by segment i as nslot(i) then max(nslot(i) - nslot(j)) = 1.

So the skew-ratio can be modeld as \frac{1}{\frac{M}{N}} = \frac{N}{M}. We surely want the skew-ratio to be as smaller as possible. If the tolerate-boundary is a, then \frac{N}{M} < a \implies M > \frac{N}{a}.

For tolerate-boundary is a=1\%, for the best algorithm, we have to set M > 100N.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.