Worktable as Continuations

Posted on August 27, 2023 by kainwenlv

Not sure why I came again to this interesting paper One WITH RECURSIVE is Worth Many GOTOs. Daily work on Greenplum makes me think and inspires me to learn knowledge in different areas of computer science. This paper was accepted in the same SIGMOD as Greenplum’s HTAP paper and I have a vague impression of it. Recently three tasks in my work made me peruse the paper:

Fix a customer JIRA on recursive CTE
Answer questions from my colleague on the implementation of KNN in Greenplum (I advise using lateral join with some enhancement of Optimizer)
My own spike on the labels as values skill in Greenplum(Postgres)’s expression valuation

This paper uses recursive CTE, Lateral Join and Union All with filters to implement control structure such as Loop, Condition in PLPGSQL. So now is a good time to start a new tour.

Continue reading →

Interconnect Deadlock Problem in Greenplum

Posted on June 28, 2023 by kainwenlv

Recently I see a GitHub Issue motion deadlock, previously we have a bunch of PRs to fix related issues, and now it comes again. I feel it is time to look deeply into the problem and jump out of the patches over patches old circles.

A thread is als created in gpdb-dev mailing list: https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/Y4ajINeKeUw

Continue reading →

探囊取物——理解，诊断，修复Greenplum中的Opclass和Opfamily相关问题

Posted on March 3, 2023 by kainwenlv

我已经不止一次有这样的感觉了: 当遇到一个特性类型的bug后，短时间内会频繁遇到本质一样的bug。这次是Greenplum关于Opclass和Opfamily的话题。这种模式不是偶然，我有一个猜想来解释：遇到第一个问题的时候，如果能够深入仔细的研究思考，由点及面，然后这部分储备的知识和思维的训练，让你更容易从代码和issue里看出同类问题。

本文介绍Greenplum Database里的Opclass和Opfamily相关概念，然后列出Greenplum开源社区仓库 greenplum-db/gpdb 里一些关于这个话题的issue并逐一诊断分析。各小节独立自成一个话题，小节之间可能会显得跳跃。不同的参考资料也将散落在各个小节内部。Issue修复的过程是动态的，我会把在开源社区出现过的东西online session里讨论清楚并公布录屏。

Continue reading →

Squelch Logic in Greenplum

Posted on September 18, 2021 by kainwenlv

Recently I have reviewed a PR from my colleague: Fix crash on QD when squelching CTE producer with a Motion child. This PR from Alexandra Wang (a very smart engineer from Greenplum Team) is well documented in the PR message and has detailed comments, also contains carefully designed test cases. Readers can feel the style of how Greenplum’s open-source community works. Next, let’s come back to the technical details of Greenplum’s Squelch logic (All best ideas are from Alex).

Continue reading →

Volatile Functions in Query on Broadcast-like Relation in Greenplum

Posted on July 5, 2020 by kainwenlv

Some background knowledge

Greenplum is MPP database. The SQL query’s planning and executing model is a distributed computing model. Greenplum introduces the concept of locus to model a path’s data distribution.

Function scan of generate_series is General locus (because the information is complete at every single place, but in Greenplum 5 this function is not treated as General, this may be the historical reason)
the scan path of a replicated table is segmenteGeneral locus (because data only stored in segments, not master, and the information is complete at every single segment)
……

For more details, please refer to one of my talk Greenplum 查询优化器关键技术介绍.

Continue reading →

CM (count min) is not MC (min count)

Posted on June 28, 2020 by kainwenlv

Papers and stories

This month, my most spare time has been spent on the problem Cardinality Estimate. When I finished the analysis of hyperloglog (see the blog A Tour of Understanding Hyperloglog), I just remember a recent paper of SIGMOD2019: Approximate Distinct Counts for Billions of Datasets. Its idea is to combine CountMin and Hyperloglog. Before diving into that paper, I decide to study CountMin. Another paper on cardinality estimate survey is VLDB2017’s Cardinality Estimation: An Experimental Survey mentioned a method called MinCount.

CM sketch was introduced by the paper: An Improved Data Stream Summary: The Count-Min Sketch and its Applications. I do not want to dive into that paper, but just reword the proof here (maybe in a better way).

Continue reading →

Tour of analyzing of a bug of Greenplum: Locus, Partial Distribution, Motion and Interconnect

Posted on May 3, 2020 by kainwenlv

Overview

The bad plan and wrong result

When reviewing the code of the PR Improve Append plan when General children exist #9692, I have come to this line of code in the function set_append_path_locus:

/*
 * CDB: If all the scans are distributed alike, set
 * the result locus to match.  Otherwise, if all are partitioned,
 * set it to strewn.  A mixture of partitioned and non-partitioned
 * scans should not occur after above correction;
 *
 * CDB TODO: When the scans are not all partitioned alike, and the
 * result is joined with another rel, consider pushing the join
 * below the Append so that child tables that are properly
 * distributed can be joined in place.
 */
if (isfirst)
{
	targetlocus = projectedlocus;
	isfirst = false;
}
else if (cdbpathlocus_equal(targetlocus, projectedlocus))
{
	/* compatible */
}
else
{
	/*
	 * subpaths have different distributed policy, mark it as random
	 * distributed and set the numsegments to the maximum of all
	 * subpaths to not missing any tuples.
	 */
	CdbPathLocus_MakeStrewn(&targetlocus,
							Max(CdbPathLocus_NumSegments(targetlocus),
								CdbPathLocus_NumSegments(projectedlocus)));
	break;
}

Continue reading →

三国赵云传数据+Postgres解析一篇SIGMOD2019论文的agg递归CTE

Posted on March 21, 2020 by kainwenlv

标题党实锤了。这也是我博客中第一篇用中文写的技术文章。坚持用英文写博客，并不是崇洋媚外，只是为了跟各国同事之间交流技术方便一些。那么这次为啥选择母语，因为原始论文里的例子一点也不好玩，我选择三国赵云传的一个数据来作例子，那不妨就写得幽默一代呢，放飞自我放飞的根本一些。

论文是这篇: RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-aggregate-SQL on Spark, SIGMOD 19.

Continue reading →

SML Module, Postgres Operator Class and Operator Family

Posted on March 9, 2020 by kainwenlv

Recently I found a Bug of Postgres, to fully understand it and fix it, I spend some time on Postgres operator class and operator family concepts. I find it very similar to SML’s module system. So this blog records my thoughts.

Continue reading →

Distributed Lock in Greenplum

Posted on January 27, 2020 by kainwenlv

I have been working on locks in Greenplum since first half of 2018. The related projects are: Global Deadlock Detector, lock mode optimization and other minor things. This blog reviews all the lessons that learned.

Continue reading →

Thoughts on Computing

英雄无岁，江湖无辈

Category Archives: database

Worktable as Continuations

Interconnect Deadlock Problem in Greenplum

探囊取物——理解，诊断，修复Greenplum中的Opclass和Opfamily相关问题

Squelch Logic in Greenplum

Volatile Functions in Query on Broadcast-like Relation in Greenplum

Some background knowledge

CM (count min) is not MC (min count)

Papers and stories

Tour of analyzing of a bug of Greenplum: Locus, Partial Distribution, Motion and Interconnect

Overview

The bad plan and wrong result

三国赵云传数据+Postgres解析一篇SIGMOD2019论文的agg递归CTE

SML Module, Postgres Operator Class and Operator Family

Distributed Lock in Greenplum