Handle dicts for distinct count by blaginin · Pull Request #15871 · apache/datafusion

blaginin · 2025-04-27T21:22:21Z

Which issue does this PR close?

Closes Improve performance of COUNT (distinct x) for dictionary columns #258

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

blaginin · 2025-05-08T21:40:50Z

group                                              main                                   pr
-----                                              ----                                   --
count low cardinality dict 20% nulls, no filter    50.55    12.2±0.39ms        ? ?/sec    1.00  240.9±327.44µs        ? ?/sec

🚀

blaginin · 2025-05-08T22:49:20Z

datafusion/functions-aggregate/src/count.rs

    }
+
+    #[test]
+    fn test_nested_dictionary() -> Result<()> {


Worried about edge cases like dict of dicts, dict of lists, etc., but couldn't come up with anything that breaks the function. Happy to be challenged 🙏

I recommend adding something to the aggregate fuzzer:

datafusion/datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs

Line 4 in 7d3c7d8

// regarding copyright ownership. The ASF licenses this file

If you add coverage for adding Dictionary arrays and verify COUNT(DISTINCT ..) that would generate some good results

That was a great suggestion - it helped catch a related bug: #16228

Testing itself is in #16232

alamb

Thanks @blaginin -- I think this looks good overall. Is there any way you can add soem slt tests as well

Perhaps following the model in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt ?

alamb · 2025-05-31T18:10:23Z

datafusion/functions-aggregate-common/src/aggregate/count_distinct/dict.rs

+            .map(|dict| {
+                downcast_dictionary_array! {
+                    dict => {
+                        let buff: BooleanArray = dict.occupancy().into();


TIL: occupancy

alamb · 2025-05-31T18:11:51Z

datafusion/functions-aggregate/src/count.rs

        }
    }
 }
+fn get_primitive_type_accumulator(data_type: &DataType) -> Box<dyn Accumulator> {


Minor: technically this does much more than primitive types

For example Utf8 is not a primitive types

Maybe it could be more precisely be called get_count_accumulator

alamb · 2025-05-31T18:12:45Z

datafusion/functions-aggregate-common/src/aggregate/count_distinct/dict.rs

+
+    fn merge_batch(&mut self, states: &[ArrayRef]) -> datafusion_common::Result<()> {
+        self.inner.merge_batch(states)
+    }


If we really want to juice performance, we would also implement a GroupsAccumulator for Dictionary as well

Yes, for sure - I think we can do it on top of this one?

blaginin · 2025-05-31T18:15:46Z

thank you for the review! i really like your fuzzy testing idea - will push soon (and respond to the new comments)

blaginin · 2025-06-02T17:14:17Z

Is there any way you can add soem slt tests as well

Sure! Found the exiting test and extended it

datafusion/datafusion/sqllogictest/test_files/aggregate.slt

Lines 5027 to 5036 in 8ed4259

    
           ## Multiple distinct aggregates and dictionaries 
        
           statement ok 
        
           create table dict_test as values (1, arrow_cast('foo', 'Dictionary(Int32, Utf8)')), (2, arrow_cast('bar', 'Dictionary(Int32, Utf8)')), (1, arrow_cast('bar', 'Dictionary(Int32, Utf8)')); 
        
           query IT 
        
           select * from dict_test; 
        
           ---- 
        
           1 foo 
        
           2 bar 
        
           1 bar

* Handle dicts for distinct count * Fix sqllogictests * Add bench * Fix no fix the bench * Do not panic if error type is bad * Add full bench query * Set the bench * Add dict of dict test * Fix tests * Rename method * Increase the grouping test * Increase the grouping test a bit more :) * Fix flakiness --------- Co-authored-by: Dmitrii Blaginin <blaginin@bmac.local>

Handle dicts for distinct count

9004db2

github-actions bot added the functions Changes to functions implementation label Apr 27, 2025

blaginin added 6 commits April 28, 2025 22:30

Fix sqllogictests

8c5b623

Add bench

9a0dad6

Merge branch 'main' into use-dicting-count

06819dc

Fix no fix the bench

1dbe235

Do not panic if error type is bad

d2e61e8

Add full bench query

e51f766

github-actions bot added the core Core DataFusion crate label May 6, 2025

Set the bench

933f19d

github-actions bot removed the core Core DataFusion crate label May 8, 2025

Add dict of dict test

47fc9b5

blaginin self-assigned this May 8, 2025

blaginin commented May 8, 2025

View reviewed changes

blaginin requested a review from Dandandan May 8, 2025 22:50

blaginin marked this pull request as ready for review May 13, 2025 10:18

Merge main

2978e7a

blaginin mentioned this pull request May 21, 2025

Improve performance of COUNT (distinct x) for dictionary columns #258

Closed

Merge branch 'main' into use-dicting-count

161c1bb

alamb approved these changes May 31, 2025

View reviewed changes

blaginin mentioned this pull request Jun 2, 2025

Add dicts to aggregation fuzz testing #16232

Merged

blaginin added 4 commits June 2, 2025 16:06

Merge branch 'main' into use-dicting-count

a6f9507

Fix tests

e7e8cef

Rename method

c8c94d1

Increase the grouping test

8ed4259

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 2, 2025

Increase the grouping test a bit more :)

dc7d508

blaginin added 3 commits June 2, 2025 20:07

Fix flakiness

d4f2a0c

Merge branch 'main' into use-dicting-count

1c44c57

Merge branch 'main' into use-dicting-count

f5b09b8

blaginin merged commit 5e307b3 into apache:main Jun 5, 2025
27 checks passed

This was referenced Jun 9, 2025

Incorrect count null in dict values #16228

Closed

Fix distinct count for DictionaryArray to correctly account for nulls in values array #16258

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle dicts for distinct count#15871

Handle dicts for distinct count#15871
blaginin merged 19 commits intoapache:mainfrom
blaginin:use-dicting-count

blaginin commented Apr 27, 2025

Uh oh!

blaginin commented May 8, 2025

Uh oh!

blaginin May 8, 2025

Uh oh!

alamb May 28, 2025

Uh oh!

blaginin Jun 2, 2025 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb May 31, 2025

Uh oh!

alamb May 31, 2025

Uh oh!

alamb May 31, 2025

Uh oh!

blaginin Jun 2, 2025

Uh oh!

blaginin commented May 31, 2025

Uh oh!

blaginin commented Jun 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

blaginin commented Apr 27, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

blaginin commented May 8, 2025

Uh oh!

blaginin May 8, 2025

Choose a reason for hiding this comment

Uh oh!

alamb May 28, 2025

Choose a reason for hiding this comment

Uh oh!

blaginin Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb May 31, 2025

Choose a reason for hiding this comment

Uh oh!

alamb May 31, 2025

Choose a reason for hiding this comment

Uh oh!

alamb May 31, 2025

Choose a reason for hiding this comment

Uh oh!

blaginin Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

blaginin commented May 31, 2025

Uh oh!

blaginin commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

blaginin Jun 2, 2025 •

edited

Loading

blaginin commented Jun 2, 2025 •

edited

Loading