GH-37055: [C++] Optimize hash kernels for Dictionary ChunkedArrays by js8544 · Pull Request #38394 · apache/arrow

js8544 · 2023-10-23T07:00:23Z

Rationale for this change

When merging dictionaries across chunks, the hash kernels unnecessarily unify the existing dictionary, dragging down the performance.

What changes are included in this PR?

Reuse the dictionary unifier across chunks.

Are these changes tested?

Yes, with a new benchmark for dictionary chunked arrays.

Are there any user-facing changes?

No.

Closes: [C++][Python] value_counts extremely slow for chunked DictionaryArray #37055

github-actions · 2023-10-23T07:00:56Z

⚠️ GitHub issue #37055 has been automatically assigned in GitHub to PR creator.

js8544 · 2023-10-23T07:04:25Z

Benchmark result: https://gist.github.com/js8544/d64f289313c814a2df6a87a945a0b382#file-pr38394-txt
Improvement is significant when # of unique is large, i.e. when dictionary size is large.

Also significant improvement in user's case in python: #37055 (comment)

js8544 · 2023-10-23T07:11:21Z

It was mentioned in #9683 (comment) that we can compute the result of each chunk and then merge them with hash_sum. However, since hash aggregate functions are moved to acero. It's less ideal to have compute kernels depend on acero because it's a level higher in the dependency tree.

This PR saves calls of the dictionary unifier, we can also further optimize this by optimizing the unifying process. This will be done once we have a faster hashtable: #38372.

js8544 · 2023-12-09T16:34:38Z

@felipecrv Do you mind having a look at this?

felipecrv · 2023-12-14T00:36:31Z

@felipecrv Do you mind having a look at this?

Soon!

felipecrv · 2023-12-15T19:19:35Z

cpp/src/arrow/compute/kernels/vector_hash.cc

Do we even need dictionary_ to be a member variable now? Wouldn't it suffice to perform a single DictionaryUnifier::GetResult call at the end?

Documentation for DictionaryUnifier::GetResult says the unifier can't be re-used after a call to GetResult [1].

My suggestion (that I think will work well):

Rename dictionary_ to first_dictionary_ and change Append(arr) to transition through this state machine on each call:

// -------------------------------------------------------------------------------------------------------------- // Current State Next State // -------------------------------------------------------------------------------------------------------------- // !first_dictionary_ && !dictionary_unifier_ --> first_dictionary_ = arr_dict_ // UNCHANGED dictionary_unifier_ // -------------------------------------------------------------------------------------------------------------- // first_dictionary_ && !dictionary_unifier_ --> if !first_dictionary_.Equals(arr_dict) then // dictionary_unifier_ = unify(first_dictionary_, arr_dict) // first_dictionary_ = nullptr // else // UNCHANGED first_dictionary_, dictionary_unifier_ // end // -------------------------------------------------------------------------------------------------------------- dictionary_unifier_ --> dictionary_unifier_ = unify(dictionary_unifier_, arr_dict)

You will then have to re-think how dictionary_value_type and dictionary work below.

[1] https://github.com/apache/arrow/blob/main/cpp/src/arrow/array/array_dict.h#L169

@js8544 to clarify, dictionary_unifier_ = unify(dictionary_unifier_, arr_dict) means something like unifier->unify(arr_dict) in the actual code :)

felipecrv

I don't see any changes other than the rebase.

js8544 · 2023-12-19T01:23:17Z

I don't see any changes other than the rebase.

Still working on it :）

js8544 · 2023-12-22T04:54:33Z

@felipecrv Updated as you suggested.

felipecrv

Thank you!

cpp/src/arrow/compute/kernels/vector_hash.cc

felipecrv · 2023-12-22T19:36:48Z

@js8544 I won't merge my own suggestions before you have a chance to reply and apply them yourself. If you're OK with them and CI passes with them applied I will merge.

Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>

felipecrv · 2023-12-23T15:49:44Z

macOS build failing for some unknown reason and this has been rebased very recently, so I'm merging.

conbench-apache-arrow · 2023-12-23T21:44:25Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit ec41209.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

…ays (apache#38394) ### Rationale for this change When merging dictionaries across chunks, the hash kernels unnecessarily unify the existing dictionary, dragging down the performance. ### What changes are included in this PR? Reuse the dictionary unifier across chunks. ### Are these changes tested? Yes, with a new benchmark for dictionary chunked arrays. ### Are there any user-facing changes? No. * Closes: apache#37055 Lead-authored-by: Jin Shang <shangjin1997@gmail.com> Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>

github-actions bot added Component: C++ awaiting review Awaiting review labels Oct 23, 2023

felipecrv requested changes Dec 15, 2023

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 15, 2023

js8544 added 2 commits December 16, 2023 11:57

optimize dictionary hash functions

7320112

add benchmark

f09ab5c

js8544 force-pushed the jinshang/dict_hash_using_agg branch from 001a846 to f09ab5c Compare December 16, 2023 04:03

felipecrv requested changes Dec 18, 2023

View reviewed changes

remove dictionary_ member

a841030

js8544 requested a review from felipecrv December 22, 2023 04:54

felipecrv approved these changes Dec 22, 2023

View reviewed changes

cpp/src/arrow/compute/kernels/vector_hash.cc Show resolved Hide resolved

cpp/src/arrow/compute/kernels/vector_hash.cc Outdated Show resolved Hide resolved

js8544 and others added 3 commits December 23, 2023 11:47

Update cpp/src/arrow/compute/kernels/vector_hash.cc

87d7e1d

Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>

Update cpp/src/arrow/compute/kernels/vector_hash.cc

f14c783

Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>

lint

8fbcbf3

js8544 requested a review from felipecrv December 23, 2023 03:48

felipecrv approved these changes Dec 23, 2023

View reviewed changes

felipecrv merged commit ec41209 into apache:main Dec 23, 2023

felipecrv removed the awaiting committer review Awaiting committer review label Dec 23, 2023

Conversation

js8544 commented Oct 23, 2023 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Oct 23, 2023

Uh oh!

js8544 commented Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

js8544 commented Oct 23, 2023

Uh oh!

js8544 commented Dec 9, 2023

Uh oh!

felipecrv commented Dec 14, 2023

Uh oh!

felipecrv Dec 15, 2023

Choose a reason for hiding this comment

Uh oh!

felipecrv Dec 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipecrv Dec 19, 2023

Choose a reason for hiding this comment

Uh oh!

felipecrv left a comment

Choose a reason for hiding this comment

Uh oh!

js8544 commented Dec 19, 2023

Uh oh!

js8544 commented Dec 22, 2023

Uh oh!

felipecrv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

felipecrv commented Dec 22, 2023

Uh oh!

felipecrv commented Dec 23, 2023

Uh oh!

conbench-apache-arrow bot commented Dec 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

js8544 commented Oct 23, 2023 •

edited by github-actions bot

Loading

js8544 commented Oct 23, 2023 •

edited

Loading

felipecrv Dec 15, 2023 •

edited

Loading