Implement 3-phase aggregation with DEDUP HashAgg for DISTINCT. #676

avamingli · 2024-10-16T13:39:31Z

When DISTINCT on distribution keys or DISTINCT with GROUP BY on distributions keys, we could also add a HashAgg node to do a pre-dedup before aggregation.

A 3-phase aggregation is as:

explain(costs off)
select count(distinct a) from t_issue_659;
                   QUERY PLAN
-------------------------------------------------
 Finalize Aggregate
   ->  Gather Motion 3:1  (slice1; segments: 3)
         ->  Partial Aggregate
               ->  HashAggregate
                     Group Key: a
                     ->  Seq Scan on t_issue_659
 Optimizer: Postgres query optimizer
(7 rows)

HashAggregate node may eliminate tuples and have a win compared to two-phase aggregation with Partial and Final Aggregate.

The effect is closely related to the data distribution of distinct values across segments, and we introduce a new GUC to make use win.

set gp_eager_distinct_dedup = on;

If set, planner will eager to use our 3-phase aggregate plan.

This also works for join:

explain (costs off) select count(distinct dqa_t1.d) from dqa_t1, dqa_t2 where dqa_t1.d = dqa_t2.d group by dqa_t2.dt;
                            QUERY PLAN
------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   ->  Finalize HashAggregate
         Group Key: dqa_t2.dt
         ->  Redistribute Motion 3:3  (slice2; segments: 3)
               Hash Key: dqa_t2.dt
               ->  Partial HashAggregate
                     Group Key: dqa_t2.dt
                     ->  HashAggregate
                           Group Key: dqa_t2.dt, dqa_t1.d
                           ->  Hash Join
                                 Hash Cond: (dqa_t1.d = dqa_t2.d)
                                 ->  Seq Scan on dqa_t1
                                 ->  Hash
                                       ->  Seq Scan on dqa_t2
 Optimizer: Postgres query optimizer
(15 rows)

Authored-by: Zhang Mingli [email protected]

fix #ISSUE_Number

Change logs

Describe your change clearly, including what problem is being solved or what feature is being added.

If it has some breaking backward or forward compatibility, please clary.

Why are the changes needed?

Describe why the changes are necessary.

Does this PR introduce any user-facing change?

If yes, please clarify the previous behavior and the change this PR proposes.

How was this patch tested?

Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
Sign the Contributor License Agreement as prompted for your first-time contribution(One-time setup).
Learn the coding contribution guide, including our code conventions, workflow and more.
List your communication in the GitHub Issues or Discussions (if has or needed).
Document changes.
Add tests for the change
Pass make installcheck
Pass make -C src/test installcheck-cbdb-parallel
Feel free to request cloudberrydb/dev team for review and approval when your PR is ready🥳

When DISTINCT on distribution keys or DISTINCT with GROUP BY on distributions keys, we could also add a HashAgg node to do a pre-dedup before aggregation. A 3-phase aggregation is as: explain(costs off) select count(distinct a) from t_issue_659; QUERY PLAN ------------------------------------------------- Finalize Aggregate -> Gather Motion 3:1 (slice1; segments: 3) -> Partial Aggregate -> HashAggregate Group Key: a -> Seq Scan on t_issue_659 Optimizer: Postgres query optimizer (7 rows) HashAggregate node may eliminate tuples and have a win compared to two-phase aggregation with Partial and Final Aggregate. The effect is closely related to the data distribution of distinct values across segments, and we introduce a new GUC to make use win. set gp_eager_distinct_dedup = on; If set, planner will eager to use our 3-phase aggregate plan. This also works for join: explain (costs off) select count(distinct dqa_t1.d) from dqa_t1, dqa_t2 where dqa_t1.d = dqa_t2.d group by dqa_t2.dt; QUERY PLAN ------------------------------------------------------------------ Gather Motion 3:1 (slice1; segments: 3) -> Finalize HashAggregate Group Key: dqa_t2.dt -> Redistribute Motion 3:3 (slice2; segments: 3) Hash Key: dqa_t2.dt -> Partial HashAggregate Group Key: dqa_t2.dt -> HashAggregate Group Key: dqa_t2.dt, dqa_t1.d -> Hash Join Hash Cond: (dqa_t1.d = dqa_t2.d) -> Seq Scan on dqa_t1 -> Hash -> Seq Scan on dqa_t2 Optimizer: Postgres query optimizer (15 rows) Authored-by: Zhang Mingli [email protected]

avamingli · 2024-10-16T13:45:44Z

Fix #659

src/backend/cdb/cdbgroupingpaths.c

avamingli mentioned this pull request Oct 16, 2024

[Bug] Optimizer could choose HashAggregate to dedup based on cost. #659

Closed

2 tasks

my-ship-it reviewed Oct 16, 2024

View reviewed changes

src/backend/cdb/cdbgroupingpaths.c Show resolved Hide resolved

my-ship-it approved these changes Oct 17, 2024

View reviewed changes

my-ship-it merged commit bcbb1c5 into apache:main Oct 17, 2024

avamingli mentioned this pull request Oct 17, 2024

Get back gp_distinct_plans tests. #680

Merged

9 tasks

avamingli deleted the distinct_3phase_agg branch October 18, 2024 02:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement 3-phase aggregation with DEDUP HashAgg for DISTINCT. #676

Implement 3-phase aggregation with DEDUP HashAgg for DISTINCT. #676

Uh oh!

avamingli commented Oct 16, 2024

Uh oh!

avamingli commented Oct 16, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement 3-phase aggregation with DEDUP HashAgg for DISTINCT. #676

Implement 3-phase aggregation with DEDUP HashAgg for DISTINCT. #676

Uh oh!

Conversation

avamingli commented Oct 16, 2024

Change logs

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Contributor's Checklist

Uh oh!

avamingli commented Oct 16, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants