NOT-IN expressions with subquery in SQL are notorious but very common. Not only do programmers write such kinds of SQLs but also many BI applications generate the kinds of SQLs. With NULL values, things get much more complex and even much much more complex for MPP databases. In this blog, we focus on single-node Postgres to understand the data structures, semantics, and algorithms of this topic.
I will talk more about GPDB’s LASJ implementation of NOT-IN later.
Subquery Expressions in Postgres
Postgres document is a very good material to start https://www.postgresql.org/docs/current/functions-subquery.html, I will first try to use math reasoning to get the semantics and then see if it matches the document and finally match the algorithm (executor code of Postgres).
create table t1(a int, b int, c int);
create table t2(a int, b int, c int);
select * from t1 where a not in (select a from t2);
select * from t1 where not (a = ANY(select a from t2));
select * from t1 where (a,b,c) not in (select a,b,c from t2);
select * from t1 where not ((a,b,c) = ANY(select a,b,c from t2));
C programming language representation of the qual: single-column
We can use gdb with the tool gdbpg to print the c structure of the expressions (I remove some uninteresting fields).
select * from t1 where a not in (select a from t2):
(gdb) p debug_query_string
$1 = 0x559f23b774b8 "select * from t1 where a not in (select a from t2);"
(gdb) pgprint result->jointree->quals
BoolExpr [boolop=NOT_EXPR]
SubLink [subLinkType=ANY_SUBLINK subLinkId=1]
[testexpr]
OpExpr [opno=96 opfuncid=65 opresulttype=16 opretset=false]
Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]
Param [paramkind=PARAM_SUBLINK paramid=1 paramtype=23]
[operName]
String ["="]
[subselect]
Query [commandType=CMD_SELECT]
[rtable]
RangeTblEntry [rtekind=RTE_RELATION relid=32792 relkind='r' rellockmode=1 jointype=JOIN_INNER]
[jointree]
FromExpr []
[fromlist]
RangeTblRef [rtindex=1]
[targetList]
TargetEntry [resno=1 resname="a" resorigtbl=32792 resorigcol=1]
Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]
select * from t1 where not (a = ANY(select a from t2)):
(gdb) p debug_query_string
$2 = 0x559f23b774b8 "select * from t1 where not (a = ANY(select a from t2));"
(gdb) pgprint result->jointree->quals
BoolExpr [boolop=NOT_EXPR]
SubLink [subLinkType=ANY_SUBLINK subLinkId=0]
[testexpr]
OpExpr [opno=96 opfuncid=65 opresulttype=16 opretset=false]
Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]
Param [paramkind=PARAM_SUBLINK paramid=1 paramtype=23]
[operName]
String ["="]
[subselect]
Query [commandType=CMD_SELECT]
[rtable]
RangeTblEntry [rtekind=RTE_RELATION relid=32792 relkind='r' rellockmode=1 jointype=JOIN_INNER]
[jointree]
FromExpr []
[fromlist]
RangeTblRef [rtindex=1]
[targetList]
TargetEntry [resno=1 resname="a" resorigtbl=32792 resorigcol=1]
Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]
select * from t1 where not (a in (select a from t2)):
(gdb) p debug_query_string
$3 = 0x559f23b774b8 "select * from t1 where not (a in (select a from t2));"
(gdb) pgprint result->jointree->quals
BoolExpr [boolop=NOT_EXPR]
SubLink [subLinkType=ANY_SUBLINK subLinkId=0]
[testexpr]
OpExpr [opno=96 opfuncid=65 opresulttype=16 opretset=false]
Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]
Param [paramkind=PARAM_SUBLINK paramid=1 paramtype=23]
[operName]
String ["="]
[subselect]
Query [commandType=CMD_SELECT]
[rtable]
RangeTblEntry [rtekind=RTE_RELATION relid=32792 relkind='r' rellockmode=1 jointype=JOIN_INNER]
[jointree]
FromExpr []
[fromlist]
RangeTblRef [rtindex=1]
[targetList]
TargetEntry [resno=1 resname="a" resorigtbl=32792 resorigcol=1]
Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]
Compare the C structure representation, we know that the following two expressions are equivalent:
a not in (select a from t1)not (a in (select a from t1))not (a = ANY (select a from t1))
Evaluation Semantics of single-column NOT-IN
An expression here can be evaluated to 3 results: bool true, bool false and null.
A special case is that the subquery generates no data at all. In this case, the expression is true no matter what the left side expression is (even if the left side of not in is null). So a not in (select a from t1 where false) is always true for any a (at the left of not in).
Next, let’s consider the cases that which the subquery does generate some data (not empty). Suppose its result set is {a1,a2,a3,a4,...an}, then the expression can be:
not (a = ANY (select a from t1))==>not (a = ANY (a1, a2, a3, a4, ..., an))==>not (a=a1 or a=a2 or ... or a=an) ==>a<>a1 and a<>a2 and ... a<>an
The final expression is AND together all small clauses. The truth table tells us:
- if we see one clause is
falseand the whole isfalse(even includenull), - the clause that is
truecan be removed - the 1st step shortcircuit to the result, and the 2nd step removes useless clauses, after both, we still get a clause that evaluated to
nullthen the result isnull
So we can have the following conclusion:
- if the left side of NOT-IN
aisnull, then the whole expression isnull - if given the left side of NOT-IN
ais notnull, then- the subquery result contains at least one
null, - and all non-null values don’t equal the left side
a - the expression result is
null
- the subquery result contains at least one
- if given the left side of NOT-IN
ais notnull, then- one of the non-null values equals the left side
a - the expression result is
false
- one of the non-null values equals the left side
- if given the left side of NOT-IN
ais notnull, then- there is no
nullin the subquery result - none of the values in the result equals the left side
a - the result is
true
- there is no
the above matches the Postgres document:

Evaluation Semantics of multiple-columns NOT-IN
Almost the same reasoning progress as the previous section’s single-column, just each clause in the final AND series is something like vector-not-equal. The analysis progress is also: look into the c structure representation, do math analysis, and read the Postgres document.
(gdb) p debug_query_string
$1 = 0x559f23b774b8 "select * from t1 where (a,b,c) not in (select a,b,c from t2);"
(gdb) pgprint result->jointree->quals
BoolExpr [boolop=NOT_EXPR]
SubLink [subLinkType=ANY_SUBLINK subLinkId=0]
[testexpr]
BoolExpr [boolop=AND_EXPR]
OpExpr [opno=96 opfuncid=65 opresulttype=16 opretset=false]
Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]
Param [paramkind=PARAM_SUBLINK paramid=1 paramtype=23]
OpExpr [opno=96 opfuncid=65 opresulttype=16 opretset=false]
Var [varno=1 varattno=2 vartype=23 varnoold=1 varoattno=2]
Param [paramkind=PARAM_SUBLINK paramid=2 paramtype=23]
OpExpr [opno=96 opfuncid=65 opresulttype=16 opretset=false]
Var [varno=1 varattno=3 vartype=23 varnoold=1 varoattno=3]
Param [paramkind=PARAM_SUBLINK paramid=3 paramtype=23]
[operName]
String ["="]
[subselect]
Query [commandType=CMD_SELECT]
[rtable]
RangeTblEntry [rtekind=RTE_RELATION relid=32792 relkind='r' rellockmode=1 jointype=JOIN_INNER]
[jointree]
FromExpr []
[fromlist]
RangeTblRef [rtindex=1]
[targetList]
TargetEntry [resno=1 resname="a" resorigtbl=32792 resorigcol=1]
Var [varno=1 varattno=1 vartype=23 varnoold=1 varoattno=1]
TargetEntry [resno=2 resname="b" resorigtbl=32792 resorigcol=2]
Var [varno=1 varattno=2 vartype=23 varnoold=1 varoattno=2]
TargetEntry [resno=3 resname="c" resorigtbl=32792 resorigcol=3]
Var [varno=1 varattno=3 vartype=23 varnoold=1 varoattno=3]
The difference from the single-column case is the target list of the subquery and the test_expr of the sublink (the clause) which now is a vector-equal test. If the subquery generates an empty result, then the expression of NOT IN is true no matter what its left-side part expression is.
not ((a,b,c) = ANY (select a,b,c from t1))==>(a <> a1 or b <> b1 or c <> c1) AND (a <> a2 or b <> b2 or c <> c2) AND ... AND (a <> an or b <> bn or c <> cn)
Recall the truth table:
- for
AND, any sub-clause isfalse, the whole isfalseshort-circuitly. - for
OR, any sub-clause istrue, the whole istrueshort-circuitly.
We just need to answer in which case it might be evaluated to null:
- Given no field from the left side part of
NOT-INisnull:- no results from the subquery make any clause like
(a <> a1 or b <> b1 or c <> c1)returnfalse(means no top-level short-circuit), no partial fields eq match at all. - at least one of the fields of subquery results is
null
- no results from the subquery make any clause like
- Given all fields from the left side part of
NOT-INisnull, then the results must benull - Given some fields of the left side part are
nulland the other fields of the left part are notnull:- no results from the subquery make any clause like
(a <> a1 or b <> b1 or c <> c1)returnfalse(means no top-level short-circuit), no partial fields eq match at all. - meanwhile, at least one clause does not short-circuit for
OR: all non-null fields of the left part have eq match.
- no results from the subquery make any clause like

Postgres Plan and Executor for NOT-IN
Postgreds planner does not choose to pull up such kinds of SubLink, not pull-up-ed sublinks will be created as SubPlan to evaluate, based on the configuration of memory limit and some other conditions, Postgres might choose HashSubPlan, look at the code at subselect.c:build_subplan:
/*
* We can't convert subplans of ALL_SUBLINK or ANY_SUBLINK types to
* initPlans, even when they are uncorrelated or undirect correlated,
* because we need to scan the output of the subplan for each outer
* tuple. But if it's a not-direct-correlated IN (= ANY) test, we
* might be able to use a hashtable to avoid comparing all the tuples.
*/
if (subLinkType == ANY_SUBLINK &&
splan->parParam == NIL &&
subplan_is_hashable(plan) &&
testexpr_is_hashable(splan->testexpr, splan->paramIds))
splan->useHashTable = true;
We focus on hash subplan way here:
gpadmin=# explain (costs off, verbose)
gpadmin-# select * from t1 where (a,b,c) not in (select a,b,c from t2);
QUERY PLAN
------------------------------------
Seq Scan on public.t1
Output: t1.a, t1.b, t1.c
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> Seq Scan on public.t2
Output: t2.a, t2.b, t2.c
Next lets go through the executor code of nodeSubplan.c:ExecHashSubPlan. The 1st step is to build a hashtable, in order to handle null result from subquery, it will create two hash tables (in most cases):
// code piece from nodeSubplan.c:buildSubPlanHash
/*
* If result contains any nulls, store separately or not at all.
*/
if (slotNoNulls(slot))
{
(void) LookupTupleHashEntry(node->hashtable, slot, &isnew, NULL);
node->havehashrows = true;
}
else if (node->hashnulls)
{
(void) LookupTupleHashEntry(node->hashnulls, slot, &isnew, NULL);
node->havenullrows = true;
}
null values cannot use = operator to compare. Thus if a tuple from the subquery contains any fileds that are null, it cannot be found via hash method. The executor puts them into another data structure node->hashnull and only uses the function findPartialMatch to access it.

After hash tables are built, to evaluate the HashSubPlan, the 1st thing is to test empty or not:
/*
* The result for an empty subplan is always FALSE; no need to evaluate
* lefthand side.
*/
*isNull = false;
if (!node->havehashrows && !node->havenullrows)
return BoolGetDatum(false);
Then handle the case that all fields of the left part slot are non-null:
if (slotNoNulls(slot))
{
if (node->havehashrows &&
FindTupleHashEntry(node->hashtable,
slot,
node->cur_eq_comp,
node->lhs_hash_funcs) != NULL)
{
ExecClearTuple(slot);
return BoolGetDatum(true);
}
if (node->havenullrows &&
findPartialMatch(node->hashnulls, slot, node->cur_eq_funcs))
{
ExecClearTuple(slot);
*isNull = true;
return BoolGetDatum(false);
}
ExecClearTuple(slot);
return BoolGetDatum(false);
}
The above shows the only case the HashSubPlan might return true: find an exact match in the non-null hash table. This is the only way that return true. When leaving this piece of code, anyway it must return false, all jobs are for null setting.
Given left part slot is non-null, and we don’t find an exact match tuple from non-null hash table, then we need to use findPartialMatch to see if the whole result is null or not. NOTE: this function is doing whole hash table scan. If it cannot reject (not short-circulted) then it is called partial matched and the result will be null.
For left part slot also contains some nulls, the analysis stage is almost the same.