Recently I have reviewed a PR from my colleague: Fix crash on QD when squelching CTE producer with a Motion child. This PR from Alexandra Wang (a very smart engineer from Greenplum Team) is well documented in the PR message and has detailed comments, also contains carefully designed test cases. Readers can feel the style of how Greenplum’s open-source community works. Next, let’s come back to the technical details of Greenplum’s Squelch logic (All best ideas are from Alex).
Why do we need Squelch?
This part of the code is mainly in ExecSquelchNode
. Readers can look at the code’s call chain to understand what the logic is doing. But Postgres does not have the logic, why did Greenplum invent this? Comments around the function may give some clues, let’s use a practical example to show the reason.
Consider the join of two tables using the following plan:
gpadmin=# create table t1(a int, b int) distributed by(a);
CREATE TABLE
gpadmin=# create table t2(a int, b int) distributed randomly;
CREATE TABLE
gpadmin=# select gp_segment_id, * from t1;
gp_segment_id | a | b
---------------+---+---
1 | 1 | 1
0 | 2 | 2
0 | 3 | 3
(3 rows)
gpadmin=# select gp_segment_id, * from t2;
gp_segment_id | a | b
---------------+----+----
0 | 2 | 2
0 | 3 | 3
0 | 10 | 10
1 | 1 | 1
1 | 7 | 7
1 | 9 | 9
2 | 4 | 4
2 | 5 | 5
2 | 6 | 6
2 | 8 | 8
(10 rows)
gpadmin=# explain (costs off) select * from t1 join t2 using (a);
QUERY PLAN
------------------------------------------------------------
Gather Motion 3:1 (slice2; segments: 3)
-> Hash Join
Hash Cond: (t2.a = t1.a)
-> Redistribute Motion 3:3 (slice1; segments: 3)
Hash Key: t2.a
-> Seq Scan on t2
-> Hash
-> Seq Scan on t1
From the above case, we can see on segment 2, t1
contains no data. Looking at slice2, the slice that executes the HashJoin, this part of logic is almost the same as a single process Postgres. The executor will first try to build the hash table which means it firstly will execute the inner plan. When it is inner join and the inner hash table is empty by change, Postgres will shortcut the outer plan’s execution. In a specific QE of Greenplum, we still have such shortcut logic, but things get complicated when in MPP database involving network communication.
If we simply do nothing for the inner plan when executing slice2 in segment 2, we have no chance to consume the tuples from the redistributed motion from slice1 and leads to the sender of the motion cannot clear its buffer and might cause motion deadlock.
So Greenplum cannot do nothing with the inner plan’s shortcut, it must introduce some logic to shut down the execution of interconnect. This is what the function ExecSquelchMotion
does:
void
ExecSquelchMotion(MotionState *node)
{
Motion *motion;
AssertArg(node != NULL);
motion = (Motion *) node->ps.plan;
node->stopRequested = true;
node->ps.state->active_recv_id = -1;
/* pass down */
SendStopMessage(node->ps.state->motionlayer_context,
node->ps.state->interconnect_context,
motion->motionID);
}
Shared Scan is a node introduced by Greenplum for a long time. It can be used for Common Table Expression’s semantic and can be used for grouping sets model. Orca often combines Shared scan with Sequence node to take advantage.
This commit Ensure Execution of Shared Scan Writer On Squelch contains an excellent example (I am very very very proud of Greenplum’s development style, see the detailed commit message) to learn the model of Shared Scan and also the first point of Squelch a shared scan. I paste the example below:
CREATE TABLE foo (a int, b int);
CREATE TABLE bar (c int, d int);
CREATE TABLE jazz(e int, f int);
INSERT INTO bar VALUES (1, 1), (2, 2), (3, 3);
INSERT INTO jazz VALUES (2, 2), (3, 3);
ANALYZE foo;
ANALYZE bar;
ANALYZE jazz;
SET statement_timeout = '15s';
SELECT * FROM
(
WITH cte AS (SELECT * FROM foo)
SELECT * FROM (SELECT * FROM cte UNION ALL SELECT * FROM cte)
AS X
JOIN bar ON b = c
) AS XY
JOIN jazz on c = e AND b = f;
```
leads to a plan that will expose this problem:
```
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice2; segments: 3) (cost=0.00..2155.00 rows=1 width=24)
-> Hash Join (cost=0.00..2155.00 rows=1 width=24)
Hash Cond: bar.c = jazz.e AND share0_ref2.b = jazz.f AND share0_ref2.b = jazz.e AND bar.c = jazz.f
-> Sequence (cost=0.00..1724.00 rows=1 width=16)
-> Shared Scan (share slice:id 2:0) (cost=0.00..431.00 rows=1 width=1)
-> Materialize (cost=0.00..431.00 rows=1 width=1)
-> Table Scan on foo (cost=0.00..431.00 rows=1 width=8)
-> Hash Join (cost=0.00..1293.00 rows=1 width=16)
Hash Cond: share0_ref2.b = bar.c
-> Redistribute Motion 3:3 (slice1; segments: 3) (cost=0.00..862.00 rows=1 width=8)
Hash Key: share0_ref2.b
-> Append (cost=0.00..862.00 rows=1 width=8)
-> Shared Scan (share slice:id 1:0) (cost=0.00..431.00 rows=1 width=8)
-> Shared Scan (share slice:id 1:0) (cost=0.00..431.00 rows=1 width=8)
-> Hash (cost=431.00..431.00 rows=1 width=8)
-> Table Scan on bar (cost=0.00..431.00 rows=1 width=8)
-> Hash (cost=431.00..431.00 rows=1 width=8)
-> Table Scan on jazz (cost=0.00..431.00 rows=1 width=8)
Filter: e = f
The shared scan’s producer is in the first plan of sequence and is read by other slices. The whole Sequence will be squelch-ed in one segment if we simply stop the shared scan without letting it finished, other consumers will wait for the producer to finish forever and lead to deadlock. The above commit fixes it by even squelching still make producer finish the job (generate all the data). The above commit mentions we cannot simply return NULL to let readers wake up and that will lead to wrong results.
What issue this PR solves?
In QEs, squelch logic happens when no tuples are needed. But in QD, the function will be called by mppExecutorFinishup
. When asking why wrong data if we just return null for squelching a share scan’s producer, Alex gives a perfect answer: https://github.com/greenplum-db/gpdb/pull/12550#discussion_r711195354.
Also based on the above case, the core idea is that, produce in a specific slice is not only to serve the same segs’ consumer, it has to make sure data complete because the consumer might send tuples to other segs and no squelch happen in other segs.
Thoughs
- Things are very complicated from single process to MPP environment
- It is super harder for some one to think complete
- Even we want to prove some algorithm is correct, we do not know what to prove.