-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I have large amounts of low cardinality string data (for example, 200 M rows, but only 20 distinct values). DictionaryArrays are very good for such data as they are space efficient.
#256 adds basic query support for distinct dictionary columns but it is not a very computationally efficient imlementation. It effectively unpacks the (likely mostly deduplicated) dictionary's values row by row into a hash set to deduplicate it again. That is a lot of extra hashing work.
Describe the solution you'd like
It would likely be much more efficient (especially for arrays that have a small number of distinct values in their dictionary) to look at the values from the dictionary directly, first checking that each entry in the dictionary was actually used.