Recently I have been working on several issues involving relation_size in Greenplum, this post is to summarize some interesting parts of the topic. I also post a thread on gpdb-dev mailing list: https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/DkTx4O-kuH0.
The problem
Q: there is a table storing some other tables’ oid, how to write a SQL to compute table size for each oid?
This problem is so simple in Postgres (single instance), but it turns much more complex than the first glance to work this out high efficiently in Greenplum.
Greenplum’s MPP architecture
If you are familiar with Greenplum, please skip this section. A Greenplum cluster consists of a coordinator and many segments, catalog is replicated everywhere (there is some catalog difference among the coordinator and many segments). Data is stored on segments under the share-nothing design. Clients directly connect to coordinator, and coordinator forks a worker called QD to handle the client connection. Since data is stored on segments, QD has to fork many distributed processes to execute the query, these processes are forked by segments, called QEs. QD will generate distributed plan, start distributed transactions and create gangs(group of QEs), dispatch the plan to gangs to execute the SQL.
Limitations:
- only QD can create gangs and dispatch
- during a query, QD can at most manage one dispatch context
pg_relation_size’s MPP pattern
pg_relation_size
is from Postgres, Greenplum make it an MPP function, and many other functions are also made MPP using the same code pattern:
- if executed on QD, dispatch the statement to QEs, fetch all results, then combine all on QD to form a result
- if executed on QE, just behave like Postgres
Solutions
S1
select oid, pg_relation_size(oid) from t;
This will not work, because pg_relation_size(oid)
are in targetlist, and will only be executed by one QE for a specific oid stored in t
.
S2
select oid, pg_relation_size(oid) from
(select oid from t limit 1000)x;
This will not work, it adds a limit in the subquery to force a gather to QD and thus pg_relation_size
will be executed on QD, it will try to dispatch. Unfortunately, Greenplum does not support twice dispatch in a single query’s context.
S3
select oid, pg_relation_size(oid) from
unnest(array(select oid from t))
This might work under planner using InitPlan, but the performance is poor because
it will dispatch to all segments every time executing pg_relation_size
. This method is found by one of my colleagues.
S4
alter table t set distributed replicated;
select oid, sum(pg_relation_size(oid))
from gp_dist_random('t') group by oid;
This method is force t
a replicated and then use gp_dist_random
skill to compute it on QEs, and then manually sum results from all QEs. But this method is not correct if the planner chooses single-stage agg, under such case it will first redistribute by oid and then compute the size. We might use GUCS to force multi-stage agg.
S5
This is the most efficient and correct method I can find now. It solves the drawback in S4 using a feature from upstream called security barrier. The idea suddenly comes to my mind since I have fixed a case on security barrier before (Greenplum introduces this feature during merging pg9.2).
alter table t set distributed replicated;
create view mv with (security_barrier) as
select oid, pg_relation_size(oid) size from gp_dist_random('t');
select oid, sum(size) from mv group by oid;
Thanks to security barrier, pg_relation_size
will be executed before the motion. The SQL will only dispatch once to get all tables’ size so it is efficient.
s6
The above S5 seems overkill (use security barrier) to prevent subquery pullup, in fact we can use subquery and volatile projection to fulfill the same goal: pg_relation_size
is volatile, so a subquery containing it won’t be pulled up:
create table t(id oid) distributed replicated;
select id, sum(size) as size
from
(
select id, pg_relation_size(id) from gp_dist_random('t')
) x(id, size)
group by id;
The above SQL will always compute target list before later motion so will be always correct even disable multi-stage agg.