Diagnosis: a snapshot bug in Greenplum

I have given a series of tech talks internally called Diagnosis, each containing a picture of Dr. Hourse. This time let me introduce how I use rr-debug to RCA the Github Issue 17275.

RR-debug

rr is a super good tool to help debug hard problems. The official website is https://rr-project.org/, please refer to the website for tutorials on how to install and basic usage. We learn the tool from Heikki during Pivotal times.

rr record a process’s whole life and let programmers reverse debug. This is very important. I have fixed several very hard problems in Greenplum using rr, without rr it will be extremely hard for me to find RCA.

We can reproduce the bug and then we can use rr to record the bug and then debug and reverse debug. Life is easy with rr.

Simplify Issue 17275

The local-reproduce step mentioned in https://github.com/greenplum-db/gpdb/issues/17275#issuecomment-2029679376 contains a test suite with many parallel SQLs. I feel some of the parallel SQLs are not needed and I try to remove as many as I can (to save some work for rr) and finally only 4 SQLs are needed to trigger the bug locally.

gpadmin@zlyu-rr:~/workspace/gpdb/src/test/regress$ cat myschedule
test: test_setup
test: function_extensions gp_sort gp_prepared_xacts gp_backend_info

Then I modify the gp_sort.sql provided in the Issue page: adding loop index as comments in the plpy’s SQL so that I can know which plpy SQL trigger the error). Finally I find it is just the 2nd time of truncate in the first loop so I continue to simplify the SQL file to a much simpler version:

-- start_ignore
create language plpython3u;
-- end_ignore

create or replace function gp_mksort_test() returns int as $$
def compare_results(query):
    qsort_res = plpy.execute(query)

for i in range(2):
    plpy.execute("truncate gp_mksort_test_table;")
    insert = "insert into gp_mksort_test_table \
        select floor(random()*10), floor(random()*100), left(md5(g::text),4) \
        from generate_series(1,%d) g;" % (i+1)
    plpy.execute(insert)
    compare_results("select * from gp_mksort_test_table order by a,c")
return 0
$$ language plpython3u;

-- start_ignore
select pg_backend_pid();
-- end_ignore

create table gp_mksort_test_table(a int, b int, c varchar);
select gp_mksort_test();

I add an ignore block (select pg_backend_pid()) to print the session’s QD’s process id and later we can rr replay the target.

Then let’s just run the test and modify the expect file to make the case pass. Later if we trigger the bug, we see it fails we know the bug is hit.

Then make sure we can trigger the bug:

gpadmin@zlyu-rr:~/workspace/gpdb/src/test/regress$ cat myschedule
test: test_setup
test: function_extensions gp_sort gp_prepared_xacts gp_backend_info

gpadmin@zlyu-rr:~/workspace/gpdb/src/test/regress$ ./pg_regress --init-file=init_file --load-extension=gp_inject_fault --schedule=myschedule
(using postmaster on Unix socket, port 7000)
============== dropping database "regression"         ==============
DROP DATABASE
============== faultinjector enabled                  ==============
============== creating database "regression"         ==============
CREATE DATABASE
ALTER DATABASE
============== installing gp_inject_fault             ==============
CREATE EXTENSION
============== installing gp_toolkit                  ==============
CREATE EXTENSION
============== installing pageinspect                 ==============
CREATE EXTENSION
============== checking optimizer status              ==============
Optimizer enabled. Using optimizer answer files whenever possible
============== checking gp_resource_manager status    ==============
Resource group disabled. Using default answer files
============== running regression test queries        ==============
test test_setup                   ... ok         1244 ms (diff   54 ms)
parallel group (4 tests):  gp_prepared_xacts gp_sort gp_backend_info function_extensions
     function_extensions          ... ok         2868 ms (diff   90 ms)
     gp_sort                      ... FAILED      166 ms (diff  104 ms)
     gp_prepared_xacts            ... ok           17 ms (diff   50 ms)
     gp_backend_info              ... ok          193 ms (diff   61 ms)

======================
 1 of 5 tests failed.
======================

gpadmin@zlyu-rr:~/workspace/gpdb/src/test/regress$ cat regression.diffs
diff -I HINT: -I CONTEXT: -I GP_IGNORE: -U3 /home/gpadmin/workspace/gpdb/src/test/regress/expected/gp_sort.out /home/gpadmin/workspace/gpdb/src/test/regress/results/gp_sort.out
--- /home/gpadmin/workspace/gpdb/src/test/regress/expected/gp_sort.out	2024-04-05 14:35:46.491123607 +0000
+++ /home/gpadmin/workspace/gpdb/src/test/regress/results/gp_sort.out	2024-04-05 14:35:46.491123607 +0000
@@ -19,14 +19,14 @@
 GP_IGNORE:select pg_backend_pid();
 GP_IGNORE: pg_backend_pid
 GP_IGNORE:----------------
-GP_IGNORE:         133755
+GP_IGNORE:         134388
 GP_IGNORE:(1 row)
 GP_IGNORE:
 GP_IGNORE:-- end_ignore
 create table gp_mksort_test_table(a int, b int, c varchar);
 select gp_mksort_test();
- gp_mksort_test
-------------
-              0
-(1 row)
-
+ERROR:  spiexceptions.ObjectNotInPrerequisiteState: attempted to update invisible tuple
+CONTEXT:  Traceback (most recent call last):
+  PL/Python function "gp_mksort_test", line 6, in <module>
+    plpy.execute("truncate gp_mksort_test_table;")
+PL/Python function "gp_mksort_test"

Use rr to record the bug

rr record the whole processes defaultly in the path ~/.local/share/rr (you can use ENV VAR to change the path). Before we record, we clear the directory and stop coordinator’s postmaster, and use rr record to start it, then trigger the bug.

gpadmin@zlyu-rr:~/workspace/gpdb$ rm ~/.local/share/rr/* -rf
gpadmin@zlyu-rr:~/workspace/gpdb$ pg_ctl stop -D $COORDINATOR_DATA_DIRECTORY
waiting for server to shut down.... done
server stopped
gpadmin@zlyu-rr:~/workspace/gpdb$ rr record /home/gpadmin/local/gpdb/bin/postgres -D $COORDINATOR_DATA_DIRECTORY -c gp_role=dispatch
rr: Saving execution to trace directory `/home/gpadmin/.local/share/rr/postgres-0'.
2024-04-05 14:40:44.901223 UTC,,,p134839,th-740680192,,,,0,,,seg-1,,,,,"LOG","00000","starting PostgreSQL 12.12 (Greenplum Database 7.1.0+dev.244.g3673829e3
7 build dev) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit",,,,,,,,"PostmasterMain","postmaster.c",1242,
2024-04-05 14:40:44.924868 UTC,,,p134839,th-740680192,,,,0,,,seg-1,,,,,"LOG","00000","listening on IPv4 address ""0.0.0.0"", port 7000",,,,,,,,"StreamServer
Port","pqcomm.c",639,
2024-04-05 14:40:44.932814 UTC,,,p134839,th-740680192,,,,0,,,seg-1,,,,,"LOG","00000","listening on IPv6 address ""::"", port 7000",,,,,,,,"StreamServerPort"
,"pqcomm.c",639,
2024-04-05 14:40:44.940933 UTC,,,p134839,th-740680192,,,,0,,,seg-1,,,,,"LOG","00000","listening on Unix socket ""/tmp/.s.PGSQL.7000""",,,,,,,,"StreamServerP
ort","pqcomm.c",634,
2024-04-05 14:40:44.973937 UTC,,,p134839,th-740680192,,,,0,,,seg-1,,,,,"LOG","00000","redirecting log output to logging collector process",,"Future log outp
ut will appear in directory ""log"".",,,,,,"SysLogger_Start","syslogger.c",920,

Then keep the above terminal and go to a new terminal to run the regress test to trigger the bug. Sure, it does not trigger the bug each time so we need to try several times.

gpadmin@zlyu-rr:~/workspace/gpdb/src/test/regress$ ./pg_regress --init-file=init_file --load-extension=gp_inject_fault --schedule=myschedule
(using postmaster on Unix socket, port 7000)
============== dropping database "regression"         ==============
DROP DATABASE
============== faultinjector enabled                  ==============
============== creating database "regression"         ==============
CREATE DATABASE
ALTER DATABASE
============== installing gp_inject_fault             ==============
CREATE EXTENSION
============== installing gp_toolkit                  ==============
CREATE EXTENSION
============== installing pageinspect                 ==============
CREATE EXTENSION
============== checking optimizer status              ==============
Optimizer enabled. Using optimizer answer files whenever possible
============== checking gp_resource_manager status    ==============
Resource group disabled. Using default answer files
============== running regression test queries        ==============
test test_setup                   ... ok         1162 ms (diff   53 ms)
parallel group (4 tests):  gp_prepared_xacts gp_sort gp_backend_info function_extensions
     function_extensions          ... ok         5449 ms (diff   90 ms)
     gp_sort                      ... FAILED      917 ms (diff  105 ms)
     gp_prepared_xacts            ... ok          166 ms (diff   51 ms)
     gp_backend_info              ... ok         1064 ms (diff   61 ms)

======================
 1 of 5 tests failed.
======================

We catch the bug and now we can terminate the cluster and rr record process. Go the results/gp_sort.out to find out the error out QD’s process.

gpadmin@zlyu-rr:~/workspace/gpdb/src/test/regress$ cat results/gp_sort.out
-- start_ignore
create language plpython3u;
ERROR:  duplicate key value violates unique constraint "pg_proc_proname_args_nsp_index"
DETAIL:  Key (proname, proargtypes, pronamespace)=(plpython3_call_handler, , 11) already exists.
-- end_ignore
create or replace function gp_mksort_test() returns int as $$
def compare_results(query):
    qsort_res = plpy.execute(query)

for i in range(2):
    plpy.execute("truncate gp_mksort_test_table;")
    insert = "insert into gp_mksort_test_table \
        select floor(random()*10), floor(random()*100), left(md5(g::text),4) \
        from generate_series(1,%d) g;" % (i+1)
    plpy.execute(insert)
    compare_results("select * from gp_mksort_test_table order by a,c")
return 0
$$ language plpython3u;
-- start_ignore
select pg_backend_pid();
 pg_backend_pid
----------------
         140179
(1 row)

-- end_ignore
create table gp_mksort_test_table(a int, b int, c varchar);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'a' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
select gp_mksort_test();
ERROR:  spiexceptions.ObjectNotInPrerequisiteState: attempted to update invisible tuple
CONTEXT:  Traceback (most recent call last):
  PL/Python function "gp_mksort_test", line 6, in <module>
    plpy.execute("truncate gp_mksort_test_table;")
PL/Python function "gp_mksort_test"

In my local env it is 140179, let’s now use rr replya to play with it.

Use rr replay to debug

The command is rr replay -f <pid> then you will enter a GDB start at the fork point of the process and can reverse debug. Grep the error message “attempted to update invisible tuple” and find out it is the line of heapam.c:3229:

gpadmin@zlyu-rr:~/workspace/gpdb/src/backend$ ack "attempted to update invisible tuple"
access/heap/heapam.c
3229:                            errmsg("attempted to update invisible tuple")));

 3224     if (result == TM_Invisible)
 3225     {
 3226         UnlockReleaseBuffer(buffer);
 3227         ereport(ERROR,
 3228                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 3229                  errmsg("attempted to update invisible tuple")));
 3230     }

Look into the code, Line 3226 is a good break point to catch the scenario just before erroring out:

gpadmin@zlyu-rr:~/workspace/gpdb/src/test/regress$ rr replay -f 140179
... &lt;many terminal logs...>
... &lt;many terminal logs...>
(rr) b heapam.c:3226
Breakpoint 1 at 0x55de97fd50a9: file heapam.c, line 3226.
(rr) c
Continuing.
Thread 1 hit Breakpoint 1, heap_update_internal (relation=0x55dea06be550, otid=0x55dea09917bc, newtup=0x55dea09917b8, cid=6, crosscheck=0x0, wait=true, tmfd=0x7ffd66af8a30, lockmode=0x7ffd66af8a28, simple=true) at heapam.c:3226
3226                    UnlockReleaseBuffer(buffer);
(rr)

Reading the code, the code HeapTupleSatisfiesUpdate at heapam.c:3219 should return TM_Invisible. We can reverse continue to understand why:

(rr) b HeapTupleSatisfiesUpdate
Breakpoint 2 at 0x55de97feb6e5: file heapam_visibility.c, line 541.
(rr) rc
Continuing.

Thread 1 hit Breakpoint 2, HeapTupleSatisfiesUpdate (relation=0x55dea06be550, htup=0x7ffd66af89a0, curcid=6, buffer=478) at heapam_visibility.c:541
(rr)

Then step into and step over several lines we can understand it reach the lines of heapam_visibility.c:549:

 TM_Result
HeapTupleSatisfiesUpdate(Relation relation, HeapTuple htup, CommandId curcid,
                                                 Buffer buffer)
{
        HeapTupleHeader tuple = htup->t_data;

        Assert(ItemPointerIsValid(&htup->t_self));
        Assert(htup->t_tableOid != InvalidOid);

        if (!HeapTupleHeaderXminCommitted(tuple))
        {
                if (HeapTupleHeaderXminInvalid(tuple))
                        return TM_Invisible;

Looking at the tuple’s infomask and print it in hex mode:

(rr) p tuple->t_infomask
$2 = 10753
(rr) printf "%x\n", $2
2a01
#define HEAP_XMIN_COMMITTED		0x0100	/* t_xmin committed */
#define HEAP_XMIN_INVALID		0x0200	/* t_xmin invalid/aborted */
#define HEAP_XMIN_FROZEN		(HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
#define HEAP_XMAX_COMMITTED		0x0400	/* t_xmax committed */
#define HEAP_XMAX_INVALID		0x0800	/* t_xmax invalid/aborted */

a = 8 + 2 in hex mode, so the infomask does say the tuple’s xmin is invalid and leads to the process later hitting the error.

The key of the bug is who (which code) set the above infomask with bit HEAP_XMIN_INVALID? Lets first grep the code to see possible code path:

gpadmin@zlyu-rr:~/workspace/gpdb/src$ ack HEAP_XMIN_INVALID
backend/access/heap/heapam.c
6551:                   Assert(!(tuple->t_infomask &amp; HEAP_XMIN_INVALID));

backend/access/heap/heapam_visibility.c
199:            case HEAP_XMIN_INVALID:
273:                                    SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
295:                                    SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
343:                    SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
466:                                    SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
488:                                    SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
562:                                    SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
584:                                    SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
677:                    SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
854:                                    SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
876:                                    SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
942:                    SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
1074:                                   SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
1100:                                   SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
1167:                           SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
1333:                           SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
1354:                           SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
1393:                   SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,

backend/storage/buffer/README
59:values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
68:Note, however, that a tuple is frozen by setting both HEAP_XMIN_INVALID

test/regress/sql/vacuum_gp.sql
304:-- is marked frozen which is (HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID)=(0x0100 | 0x0200)=768

test/regress/expected/vacuum_gp.out
431:-- is marked frozen which is (HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID)=(0x0100 | 0x0200)=768

test/isolation/specs/heap-repeatable-read-vacuum-freeze.spec
33:                 WHEN (t_infomask &amp; 512) = 512 THEN 'InvalidXid'    -- HEAP_XMIN_INVALID

include/access/htup_details.h
205:#define HEAP_XMIN_INVALID           0x0200  /* t_xmin invalid/aborted */
206:#define HEAP_XMIN_FROZEN            (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)
342:    ((tup)->t_infomask &amp; (HEAP_XMIN_COMMITTED|HEAP_XMIN_INVALID)) == \
343:            HEAP_XMIN_INVALID \
360:    ((tup)->t_infomask |= HEAP_XMIN_INVALID) \

Ignore the Assert, results in test cases and README. The L360 in the htup_details.h is a dead macro that no code uses it. Thus we can that the only code will use it is in heapam_visibility.c, and each place is a function call of SetHintBits.

For hint bits, in short, it is scan node set transaction information in the tuple’s header so that later MVCC test can save the work of reading CLOG. For more, we need to read to understand the basic theory:

Until now, we know there are some other workers than scan the pg_class catalog and wrongly set the hint bits. We can use pg_waldump to search the transaction information from XLOG. Or, given we have already used rr record everything, we can go to each proc and set a break point at SetHintBits with condition of the tuple id and infomask then we can understand why.

Go back to the rr record of the error out QD to find out the tuple’s CTID:

(rr) l
544             Assert(htup->t_tableOid != InvalidOid);
545
546             if (!HeapTupleHeaderXminCommitted(tuple))
547             {
548                     if (HeapTupleHeaderXminInvalid(tuple))
549                             return TM_Invisible;
550
551                     /* Used by pre-9.0 binary upgrades */
552                     if (tuple->t_infomask &amp; HEAP_MOVED_OFF)
553                     {
(rr) p tuple[0]
$3 = {t_choice = {t_heap = {t_xmin = 1689, t_xmax = 0, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = 1689, datum_typmod = 0, datum_typeid = 0}}, t_ctid = {ip_blkid = {
      bi_hi = 0, bi_lo = 6}, ip_posid = 17}, t_infomask2 = 33, t_infomask = 10753, t_hoff = 32 ' ', t_bits = 0x7fde4fec2037 "\377\377\377?"}
(rr) info source
Current source file is heapam_visibility.c

The CTID is (6, 17). So the break point of SetHintBits should be:

b heapam_visibility.c:182 if rel->rd_rel->oid == 1259 &amp;&amp; (infomask &amp; 0x0200) &amp;&amp; tuple->t_ctid->ip_blkid->bi_lo == 6  &amp;&amp;  tuple->t_ctid->ip_posid == 17 &amp;&amp; tuple->t_choice->t_heap->t_xmin == 1689

Use rr ps to show all processes that recorded by rr. I then set breakpoints and continue one by one (not take much time, should can be refined using scripts) and finally find the root evil is proc: 140217:

gpadmin@zlyu-rr:~/workspace/gpdb/src/backend$ rr replay -f 140217
(rr) b heapam_visibility.c:182 if rel->rd_rel->oid == 1259 &amp;&amp; (infomask &amp; 0x0200) &amp;&amp; tuple->t_ctid->ip_blkid->bi_lo == 6  &amp;&amp;  tuple->t_ctid->ip_posid == 17 &amp;&amp; tuple->t_choice->t_heap->t_xmin == 1689
Breakpoint 1 at 0x55de97feac6f: file heapam_visibility.c, line 182.
(rr) c
Continuing.
[New Thread 140217.140219]

Thread 1 hit Breakpoint 1, SetHintBits (tuple=0x7fde4fec2020, buffer=478, rel=0x55dea065c230, infomask=512, xid=0) at heapam_visibility.c:182
182             if (TransactionIdIsValid(xid))
(rr) bt
#0  SetHintBits (tuple=0x7fde4fec2020, buffer=478, rel=0x55dea065c230, infomask=512, xid=0) at heapam_visibility.c:182
#1  0x000055de97fecf6a in HeapTupleSatisfiesMVCC (relation=0x55dea065c230, htup=0x7ffd66af9bc0, snapshot=0x55dea06a94f8, buffer=478) at heapam_visibility.c:1167
#2  0x000055de97fee385 in HeapTupleSatisfiesVisibility (relation=0x55dea065c230, tup=0x7ffd66af9bc0, snapshot=0x55dea06a94f8, buffer=478) at heapam_visibility.c:1837
#3  0x000055de97fcdebb in heapgetpage (sscan=0x55dea06df040, page=6) at heapam.c:493
#4  0x000055de97fd00d0 in heapgettup_pagemode (scan=0x55dea06df040, dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:1137
#5  0x000055de97fd073c in heap_getnextslot (sscan=0x55dea06df040, direction=ForwardScanDirection, slot=0x55dea069e100) at heapam.c:1428
#6  0x000055de9834d7e5 in table_scan_getnextslot (sscan=0x55dea06df040, direction=ForwardScanDirection, slot=0x55dea069e100) at ../../../src/include/access/tableam.h:1053
#7  0x000055de9834d8cb in SeqNext (node=0x55dea069df28) at nodeSeqscan.c:89
#8  0x000055de9830915d in ExecScanFetch (node=0x55dea069df28, accessMtd=0x55de9834d819 &lt;SeqNext>, recheckMtd=0x55de9834d8dc &lt;SeqRecheck>) at execScan.c:137
#9  0x000055de98309237 in ExecScan (node=0x55dea069df28, accessMtd=0x55de9834d819 &lt;SeqNext>, recheckMtd=0x55de9834d8dc &lt;SeqRecheck>) at execScan.c:206
#10 0x000055de9834d935 in ExecSeqScan (pstate=0x55dea069df28) at nodeSeqscan.c:121
#11 0x000055de98305c15 in ExecProcNodeGPDB (node=0x55dea069df28) at execProcnode.c:645
#12 0x000055de98368294 in ExecProcNode (node=0x55dea069df28) at ../../../src/include/executor/executor.h:271
#13 0x000055de983686e4 in execMotionSender (node=0x55dea069dc18) at nodeMotion.c:227
#14 0x000055de983685ab in ExecMotion (pstate=0x55dea069dc18) at nodeMotion.c:191
#15 0x000055de98305c15 in ExecProcNodeGPDB (node=0x55dea069dc18) at execProcnode.c:645
#16 0x000055de98305aa9 in ExecProcNodeFirst (node=0x55dea069dc18) at execProcnode.c:600
#17 0x000055de982f6a19 in ExecProcNode (node=0x55dea069dc18) at ../../../src/include/executor/executor.h:271
#18 0x000055de982fb68d in ExecutePlan (estate=0x55dea069d990, planstate=0x55dea069dc18, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=false, numberTuples=0,
    direction=ForwardScanDirection, dest=0x55dea0613448, execute_once=false) at execMain.c:2672
#19 0x000055de982f84b9 in standard_ExecutorRun (queryDesc=0x55dea060d530, direction=ForwardScanDirection, count=0, execute_once=false) at execMain.c:922
#20 0x000055de982f8043 in ExecutorRun (queryDesc=0x55dea060d530, direction=ForwardScanDirection, count=0, execute_once=false) at execMain.c:793
#21 0x000055de986014a5 in PortalRunSelect (portal=0x55dea066f5c0, forward=true, count=0, dest=0x55dea0613448) at pquery.c:1147
#22 0x000055de98601071 in PortalRun (portal=0x55dea066f5c0, count=9223372036854775807, isTopLevel=true, run_once=false, dest=0x55dea0613448, altdest=0x55dea0613448,
    completionTag=0x7ffd66afa490 "") at pquery.c:985
#23 0x000055de985f7498 in exec_mpp_query (query_string=0x55dea05db1a2 "SELECT * FROM temp JOIN (SELECT oid FROM pg_class) temp2 on (temp = temp2);",
    serializedPlantree=0x55dea05db1ee "(\265/\375`Y\b\275\031", serializedPlantreelen=833, serializedQueryDispatchDesc=0x55dea05db52f "(\265/\375`e", serializedQueryDispatchDesclen=167)
    at postgres.c:1438
#24 0x000055de985fe54b in PostgresMain (argc=1, argv=0x55dea0617ba8, dbname=0x55dea0607bc8 "regression", username=0x55dea0607ba8 "gpadmin") at postgres.c:5575
#25 0x000055de9851f871 in BackendRun (port=0x55dea0603910) at postmaster.c:4985
#26 0x000055de9851ee86 in BackendStartup (port=0x55dea0603910) at postmaster.c:4662
#27 0x000055de9851a347 in ServerLoop () at postmaster.c:2016
#28 0x000055de9851984f in PostmasterMain (argc=5, argv=0x55dea0545680) at postmaster.c:1631
#29 0x000055de983b5769 in main (argc=5, argv=0x55dea0545680) at main.c:240

Reading the SQL we know it is an entryDB QE that scan the catalog as a normal heap table. And we know that the xid is a subtransaction xid but we don’t see any subtransaction id information in the snapshot!

(rr) up
#1  0x000055de97fecf6a in HeapTupleSatisfiesMVCC (relation=0x55dea065c230, htup=0x7ffd66af9bc0, snapshot=0x55dea06a94f8, buffer=478) at heapam_visibility.c:1167
1167                                    SetHintBits(tuple, buffer, relation, HEAP_XMIN_INVALID,
(rr) p snapshot[0]
$1 = {snapshot_type = SNAPSHOT_MVCC, xmin = 1688, xmax = 1692, xip = 0x55dea06a95a8, xcnt = 1, subxip = 0x0, subxcnt = -1, suboverflowed = false, takenDuringRecovery = false,
  copied = true, haveDistribSnapshot = true, curcid = 0, speculativeToken = 0, active_count = 1, regd_count = 2, ph_node = {first_child = 0x55de996ba2a0 &lt;CatalogSnapshotData+64>,
    next_sibling = 0x0, prev_or_parent = 0x0}, whenTaken = 0, lsn = 0, distribSnapshotWithLocalMapping = {ds = {xminAllDistributedSnapshots = 649, distribSnapshotId = 376, xmin = 651,
      xmax = 653, count = 2, inProgressXidArray = 0x55dea06a95ac}, minCachedLocalXid = 0, maxCachedLocalXid = 0, currentLocalXidsCount = 0, inProgressMappedLocalXids = 0x55dea06a95bc}}
(rr) p snapshot->subxip
$2 = (TransactionId *) 0x0
(rr)

We now understand the RCA of this bug and for more please refer to the Github Issue page’s comments of me and Ashwin.

Hint for local reproduce

To reproduce the bug, we need to sessions:

  • Session 1: needs to start several sub-transactions and each subtransaction truncate the same table
  • Session 2: an entryDB QE that scans the pg_class at the same, after the the subtransaction id is written into the tuple.

We also need session 2 cannot use snapshot’s xmin and xmax to quick return from the MVCC test: it has to scan the subxip and xip array (but the entryDB QE don’t have subxip). This is easy:

  • start a session and force to generate a XID and then commit it, and keep this session alive
  • after subtransaction XID is written into the tuple, start a new session and force to generate a XID and not commit just leave it there.

The UDF of session 1 is:

create table test_t(a int);

create function test() returns int as
$$
for i in range(2):
    plpy.execute("truncate test_t;")
return 0
$$
language plpython3u;

The SQL that has an entryDB QE is: select * from t join (select oid from pg_class) x (a) on x.a = t.a;.

I build a simple reproduce SQLs using isolation2 framework and savepoint:

create table t1(a int);
create table t2(a int);

1: begin;
1: select txid_current();
1: end;

2: begin;
2: select txid_current();

3: begin;
3: savepoint s1;
3: truncate t2;

4: begin;
4: select txid_current();
4: end;

5: begin;
5: select txid_current();

6: select * from t1 join (select oid from pg_class) x(a) on x.a = t1.a;

3: savepoint s2;
3: truncate t2;

1q:
2q:
3q:
4q:
5q:
6q:

More to investigate

  • MVCC fundamental of Postgres
  • Distributed Transaction Management in Greenplum
  • Shared Buffer Pool management in Postgres
  • XLOG dump way to debug this problem

Each topic needs engineer to spend considerable time on looking into and thinking.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.