Projects

In recent years I focus on developing Greenplum, other projects can be found on my Github Page.

https://github.com/greenplum-db/gpdb

My Greenplum commits:

Improve cost model of hash aggregation for Greenplum

  • Implement an algorithm to estimate the number of tuples that may be spilled to disk if choosing two-stage hash aggregation plan. The algorithm is based on tools of analytic combinatorics
  • Implement an algorithm to compute the number of groups on each segment if the group keys is not matched distributed keys
  • Make Greenplum planner can intelligently choose 1-stage or 2-stage aggregation plans
  • Code: https://github.com/greenplum-db/gpdb/pull/8439

Merge with PostgreSQL v12


Global Deadlock Detector

  • Implement global deadlock detector and prove the correctness of the algorithm
  • Downgrade the lock level on master so that Greenplum can be used in OLTP scenario
  • UPDATE|DELETE TPC-B like benchmark becomes about 100 times faster
  • A patent
  • Code: https://github.com/greenplum-db/gpdb/pull/4810 (there are several later commits, but this is the main one)

Consistent hashing in distributed system and Pivotal hashing algorithm

  • Introduce jump consistent hashing in Greenplum to make expanding faster
  • Develop a new consistent hash——pivotal hash algorithm which is much better than google’s maglev. This is the best consistent hash algorithm based on look-up-table
  • Develop a index-based algorithm to make scanning need-move-tuple faster
  • Code: https://github.com/greenplum-db/gpdb/pull/5426 (Pivotal hash and index algorithm please refer to my blogs)

Optimization of locks and OLTP performance for Greenplum

  • look deep into select statement with locking clause cases in MPP database and do optimization for some simple cases
  • look deep into split-update technique and adopt them for MPP database
  • look deep into locks and partition tables and transactions
  • Code: https://github.com/greenplum-db/gpdb/pull/7635 (Also there are later commits, but this is the main one)

Serialize DMLs for GreenplumDB


Online expansion for Greenplum

  • Expanding Greenplum cluster without restarting the cluster
  • Introduce numsegments in each table’s distribution information so that queries can still take advantage of co-locate of original table, also make planner can generate correct distributed plans
  • Implement an algorithm to reshuffle only-need-move tuples based on split-update
  • Implement an algorithm to reshuffle only-need-move tuples based on two writer gang

Improve performance of join for Greenplum