The HashSubplan implementation for NOT-IN Sublink in Postgres

NOT-IN expressions with subquery in SQL are notorious but very common. Not only do programmers write such kinds of SQLs but also many BI applications generate the kinds of SQLs. With NULL values, things get much more complex and even much much more complex for MPP databases. In this blog, we focus on single-node Postgres to understand the data structures, semantics, and algorithms of this topic.

I will talk more about GPDB’s LASJ implementation of NOT-IN later.

Subquery Expressions in Postgres

Postgres document is a very good material to start https://www.postgresql.org/docs/current/functions-subquery.html, I will first try to use math reasoning to get the semantics and then see if it matches the document and finally match the algorithm (executor code of Postgres).

create table t1(a int, b int, c int);
create table t2(a int, b int, c int);

select * from t1 where a not in (select a from t2);
select * from t1 where not (a = ANY(select a from t2));

select * from t1 where (a,b,c) not in (select a,b,c from t2);
select * from t1 where not ((a,b,c) = ANY(select a,b,c from t2));

C programming language representation of the qual: single-column

We can use gdb with the tool gdbpg to print the c structure of the expressions (I remove some uninteresting fields).

select * from t1 where a not in (select a from t2):

(gdb) p debug_query_string
$1 = 0x559f23b774b8 "select * from t1 where a not in (select a from t2);"
(gdb) pgprint result->jointree->quals
BoolExpr [boolop=NOT_EXPR]
    SubLink [subLinkType=ANY_SUBLINK subLinkId=1]
        [testexpr]
           OpExpr [opno=96 opfuncid=65 opresulttype=16 opretset=false]
                   Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]
                   Param [paramkind=PARAM_SUBLINK paramid=1 paramtype=23]
        [operName]
                String ["="]
        [subselect]
           Query [commandType=CMD_SELECT]
               [rtable]
                   RangeTblEntry [rtekind=RTE_RELATION relid=32792 relkind='r' rellockmode=1 jointype=JOIN_INNER]
               [jointree]
                   FromExpr []
                       [fromlist]
                           RangeTblRef [rtindex=1]
               [targetList]
                   TargetEntry [resno=1 resname="a" resorigtbl=32792 resorigcol=1]
                       Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]

select * from t1 where not (a = ANY(select a from t2)):

(gdb) p debug_query_string
$2 = 0x559f23b774b8 "select * from t1 where not (a = ANY(select a from t2));"
(gdb) pgprint result->jointree->quals
BoolExpr [boolop=NOT_EXPR]
   SubLink [subLinkType=ANY_SUBLINK subLinkId=0]
      [testexpr]
         OpExpr [opno=96 opfuncid=65 opresulttype=16 opretset=false]
                 Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]
                 Param [paramkind=PARAM_SUBLINK paramid=1 paramtype=23]
      [operName]
         String ["="]
      [subselect]
         Query [commandType=CMD_SELECT]
            [rtable]
               RangeTblEntry [rtekind=RTE_RELATION relid=32792 relkind='r' rellockmode=1 jointype=JOIN_INNER]
            [jointree]
               FromExpr []
                  [fromlist]
                      RangeTblRef [rtindex=1]
            [targetList]
               TargetEntry [resno=1 resname="a" resorigtbl=32792 resorigcol=1]
                   Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]

select * from t1 where not (a in (select a from t2)):

(gdb) p debug_query_string
$3 = 0x559f23b774b8 "select * from t1 where not (a in (select a from t2));"
(gdb) pgprint result->jointree->quals
BoolExpr [boolop=NOT_EXPR]
    SubLink [subLinkType=ANY_SUBLINK subLinkId=0]
     [testexpr]
         OpExpr [opno=96 opfuncid=65 opresulttype=16 opretset=false]
            Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]
            Param [paramkind=PARAM_SUBLINK paramid=1 paramtype=23]
     [operName]
         String ["="]
     [subselect]
        Query [commandType=CMD_SELECT]
          [rtable]
                  RangeTblEntry [rtekind=RTE_RELATION relid=32792 relkind='r' rellockmode=1 jointype=JOIN_INNER]
          [jointree]
                  FromExpr []
                          [fromlist]
                                  RangeTblRef [rtindex=1]
          [targetList]
                  TargetEntry [resno=1 resname="a" resorigtbl=32792 resorigcol=1]
                          Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]

Compare the C structure representation, we know that the following two expressions are equivalent:

  • a not in (select a from t1)
  • not (a in (select a from t1))
  • not (a = ANY (select a from t1))

Evaluation Semantics of single-column NOT-IN

An expression here can be evaluated to 3 results: bool true, bool false and null.

A special case is that the subquery generates no data at all. In this case, the expression is true no matter what the left side expression is (even if the left side of not in is null). So a not in (select a from t1 where false) is always true for any a (at the left of not in).

Next, let’s consider the cases that which the subquery does generate some data (not empty). Suppose its result set is {a1,a2,a3,a4,...an}, then the expression can be:

  • not (a = ANY (select a from t1)) ==>
  • not (a = ANY (a1, a2, a3, a4, ..., an)) ==>
  • not (a=a1 or a=a2 or ... or a=an) ==>
  • a<>a1 and a<>a2 and ... a<>an

The final expression is AND together all small clauses. The truth table tells us:

  • if we see one clause is false and the whole is false (even include null),
  • the clause that is true can be removed
  • the 1st step shortcircuit to the result, and the 2nd step removes useless clauses, after both, we still get a clause that evaluated to null then the result is null

So we can have the following conclusion:

  1. if the left side of NOT-IN a is null, then the whole expression is null
  2. if given the left side of NOT-IN a is not null, then
    • the subquery result contains at least one null,
    • and all non-null values don’t equal the left side a
    • the expression result is null
  3. if given the left side of NOT-IN a is not null, then
    • one of the non-null values equals the left side a
    • the expression result is false
  4. if given the left side of NOT-IN a is not null, then
    • there is no null in the subquery result
    • none of the values in the result equals the left side a
    • the result is true

the above matches the Postgres document:

Evaluation Semantics of multiple-columns NOT-IN

Almost the same reasoning progress as the previous section’s single-column, just each clause in the final AND series is something like vector-not-equal. The analysis progress is also: look into the c structure representation, do math analysis, and read the Postgres document.

(gdb) p debug_query_string
$1 = 0x559f23b774b8 "select * from t1 where (a,b,c) not in (select a,b,c from t2);"
(gdb) pgprint result->jointree->quals
BoolExpr [boolop=NOT_EXPR]
   SubLink [subLinkType=ANY_SUBLINK subLinkId=0]
      [testexpr]
         BoolExpr [boolop=AND_EXPR]
            OpExpr [opno=96 opfuncid=65 opresulttype=16 opretset=false]
              Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]
              Param [paramkind=PARAM_SUBLINK paramid=1 paramtype=23]
            OpExpr [opno=96 opfuncid=65 opresulttype=16 opretset=false]
              Var [varno=1 varattno=2 vartype=23 varnoold=1 varoattno=2]
              Param [paramkind=PARAM_SUBLINK paramid=2 paramtype=23]
            OpExpr [opno=96 opfuncid=65 opresulttype=16 opretset=false]
              Var [varno=1 varattno=3 vartype=23 varnoold=1 varoattno=3]
              Param [paramkind=PARAM_SUBLINK paramid=3 paramtype=23]
      [operName]
         String ["="]
      [subselect]
         Query [commandType=CMD_SELECT]
             [rtable]
                RangeTblEntry [rtekind=RTE_RELATION relid=32792 relkind='r' rellockmode=1 jointype=JOIN_INNER]
             [jointree]
                FromExpr []
                  [fromlist]
                     RangeTblRef [rtindex=1]
             [targetList]
                TargetEntry [resno=1 resname="a" resorigtbl=32792 resorigcol=1]
                        Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]
                TargetEntry [resno=2 resname="b" resorigtbl=32792 resorigcol=2]
                        Var [varno=1 varattno=2 vartype=23 varnoold=1 varoattno=2]
                TargetEntry [resno=3 resname="c" resorigtbl=32792 resorigcol=3]
                        Var [varno=1 varattno=3 vartype=23 varnoold=1 varoattno=3]

The difference from the single-column case is the target list of the subquery and the test_expr of the sublink (the clause) which now is a vector-equal test. If the subquery generates an empty result, then the expression of NOT IN is true no matter what its left-side part expression is.

  • not ((a,b,c) = ANY (select a,b,c from t1)) ==>
  • (a <> a1 or b <> b1 or c <> c1) AND (a <> a2 or b <> b2 or c <> c2) AND ... AND (a <> an or b <> bn or c <> cn)

Recall the truth table:

  • for AND, any sub-clause is false, the whole is false short-circuitly.
  • for OR, any sub-clause is true, the whole is true short-circuitly.

We just need to answer in which case it might be evaluated to null:

  • Given no field from the left side part of NOT-IN is null:
    • no results from the subquery make any clause like (a <> a1 or b <> b1 or c <> c1) return false (means no top-level short-circuit), no partial fields eq match at all.
    • at least one of the fields of subquery results is null
  • Given all fields from the left side part of NOT-IN is null, then the results must be null
  • Given some fields of the left side part are null and the other fields of the left part are not null:
    • no results from the subquery make any clause like (a <> a1 or b <> b1 or c <> c1) return false (means no top-level short-circuit), no partial fields eq match at all.
    • meanwhile, at least one clause does not short-circuit for OR: all non-null fields of the left part have eq match.

Postgres Plan and Executor for NOT-IN

Postgreds planner does not choose to pull up such kinds of SubLink, not pull-up-ed sublinks will be created as SubPlan to evaluate, based on the configuration of memory limit and some other conditions, Postgres might choose HashSubPlan, look at the code at subselect.c:build_subplan:

/*
 * We can't convert subplans of ALL_SUBLINK or ANY_SUBLINK types to
 * initPlans, even when they are uncorrelated or undirect correlated,
 * because we need to scan the output of the subplan for each outer
 * tuple.  But if it's a not-direct-correlated IN (= ANY) test, we
 * might be able to use a hashtable to avoid comparing all the tuples.
 */
if (subLinkType == ANY_SUBLINK &&
	splan->parParam == NIL &&
	subplan_is_hashable(plan) &&
	testexpr_is_hashable(splan->testexpr, splan->paramIds))
	splan->useHashTable = true;

We focus on hash subplan way here:

gpadmin=# explain (costs off, verbose)
gpadmin-#  select * from t1 where (a,b,c) not in (select a,b,c from t2);
             QUERY PLAN
------------------------------------
 Seq Scan on public.t1
   Output: t1.a, t1.b, t1.c
   Filter: (NOT (hashed SubPlan 1))
   SubPlan 1
     ->  Seq Scan on public.t2
           Output: t2.a, t2.b, t2.c

Next lets go through the executor code of nodeSubplan.c:ExecHashSubPlan. The 1st step is to build a hashtable, in order to handle null result from subquery, it will create two hash tables (in most cases):

// code piece from nodeSubplan.c:buildSubPlanHash

/*
 * If result contains any nulls, store separately or not at all.
 */
if (slotNoNulls(slot))
{
	(void) LookupTupleHashEntry(node->hashtable, slot, &isnew, NULL);
	node->havehashrows = true;
}
else if (node->hashnulls)
{
	(void) LookupTupleHashEntry(node->hashnulls, slot, &isnew, NULL);
	node->havenullrows = true;
}

null values cannot use = operator to compare. Thus if a tuple from the subquery contains any fileds that are null, it cannot be found via hash method. The executor puts them into another data structure node->hashnull and only uses the function findPartialMatch to access it.

After hash tables are built, to evaluate the HashSubPlan, the 1st thing is to test empty or not:

/*
 * The result for an empty subplan is always FALSE; no need to evaluate
 * lefthand side.
 */
*isNull = false;
if (!node->havehashrows && !node->havenullrows)
    return BoolGetDatum(false);

Then handle the case that all fields of the left part slot are non-null:

if (slotNoNulls(slot))
{
    if (node->havehashrows &&
        FindTupleHashEntry(node->hashtable,
                                           slot,
                                           node->cur_eq_comp,
                                           node->lhs_hash_funcs) != NULL)
    {
        ExecClearTuple(slot);
        return BoolGetDatum(true);
    }
    if (node->havenullrows &&
        findPartialMatch(node->hashnulls, slot, node->cur_eq_funcs))
    {
        ExecClearTuple(slot);
        *isNull = true;
        return BoolGetDatum(false);
    }
    ExecClearTuple(slot);
    return BoolGetDatum(false);
}

The above shows the only case the HashSubPlan might return true: find an exact match in the non-null hash table. This is the only way that return true. When leaving this piece of code, anyway it must return false, all jobs are for null setting.

Given left part slot is non-null, and we don’t find an exact match tuple from non-null hash table, then we need to use findPartialMatch to see if the whole result is null or not. NOTE: this function is doing whole hash table scan. If it cannot reject (not short-circulted) then it is called partial matched and the result will be null.

For left part slot also contains some nulls, the analysis stage is almost the same.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.