Improve performance of COUNT (distinct x) for dictionary columns

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

I have large amounts of low cardinality string data (for example, 200 M rows, but only 20 distinct values). DictionaryArrays are very good for such data as they are space efficient.  

https://github.com/apache/arrow-datafusion/pull/256 adds basic query support for distinct dictionary columns but it is not a very computationally efficient imlementation. It effectively unpacks the (likely mostly deduplicated) dictionary's values row by row into a hash set to deduplicate it again. That is a lot of extra hashing work.


**Describe the solution you'd like**
It would likely be much more efficient (especially for arrays that have a small number of distinct values in their dictionary) to look at the values from the dictionary directly, first checking that each entry in the dictionary was actually used. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of COUNT (distinct x) for dictionary columns #258

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve performance of COUNT (distinct x) for dictionary columns #258

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions