Handle dicts for distinct count#15871
Conversation
🚀 |
| } | ||
|
|
||
| #[test] | ||
| fn test_nested_dictionary() -> Result<()> { |
There was a problem hiding this comment.
Worried about edge cases like dict of dicts, dict of lists, etc., but couldn't come up with anything that breaks the function. Happy to be challenged 🙏
There was a problem hiding this comment.
I recommend adding something to the aggregate fuzzer:
If you add coverage for adding Dictionary arrays and verify COUNT(DISTINCT ..) that would generate some good results
alamb
left a comment
There was a problem hiding this comment.
Thanks @blaginin -- I think this looks good overall. Is there any way you can add soem slt tests as well
Perhaps following the model in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt ?
| .map(|dict| { | ||
| downcast_dictionary_array! { | ||
| dict => { | ||
| let buff: BooleanArray = dict.occupancy().into(); |
| } | ||
| } | ||
| } | ||
| fn get_primitive_type_accumulator(data_type: &DataType) -> Box<dyn Accumulator> { |
There was a problem hiding this comment.
Minor: technically this does much more than primitive types
For example Utf8 is not a primitive types
Maybe it could be more precisely be called get_count_accumulator
|
|
||
| fn merge_batch(&mut self, states: &[ArrayRef]) -> datafusion_common::Result<()> { | ||
| self.inner.merge_batch(states) | ||
| } |
There was a problem hiding this comment.
If we really want to juice performance, we would also implement a GroupsAccumulator for Dictionary as well
There was a problem hiding this comment.
Yes, for sure - I think we can do it on top of this one?
|
thank you for the review! i really like your fuzzy testing idea - will push soon (and respond to the new comments) |
Sure! Found the exiting test and extended it datafusion/datafusion/sqllogictest/test_files/aggregate.slt Lines 5027 to 5036 in 8ed4259 |
* Handle dicts for distinct count * Fix sqllogictests * Add bench * Fix no fix the bench * Do not panic if error type is bad * Add full bench query * Set the bench * Add dict of dict test * Fix tests * Rename method * Increase the grouping test * Increase the grouping test a bit more :) * Fix flakiness --------- Co-authored-by: Dmitrii Blaginin <blaginin@bmac.local>
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?