ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join by Dandandan · Pull Request #9116 · apache/arrow

Dandandan · 2021-01-06T20:21:56Z

Create hashes vectorized in hash join

This is one step for a fully vectorized hash join: https://issues.apache.org/jira/browse/ARROW-11112

The idea of the PR is as follows:

We still use a HashMap but rather than using the row data as key we use a hash value ( u64) both as key and as hash. We use a custom Hasher to avoid (re)computing hashes in the hash map and while doing lookups.
Only the hash value creation is in this PR vectorized, the rest is still on a row basis.
A test for hash collision detection needs to be added.

TCPH 12 is without the remaining part ~10% faster than the other PR: ~180ms vs ~200ms.
TCPH 5 is >40% faster (332ms vs 624ms).

github-actions · 2021-01-06T22:42:19Z

https://issues.apache.org/jira/browse/ARROW-11156

jorgecarleitao · 2021-01-06T22:53:30Z

👍 I think that this could be also beneficial for the hash aggregate!

Dandandan · 2021-01-10T19:40:30Z

rust/datafusion/src/physical_plan/hash_join.rs

+/// `Hasher` that returns the same `u64` value as a hash, to avoid re-hashing
+/// it when inserting/indexing or regrowing the `HashMap`
+struct IdHasher {
+    hash: u64,


We just return the key value here, so we don't "rehash" when growing or indexing the HashMap

Dandandan · 2021-01-10T19:41:35Z

rust/datafusion/src/physical_plan/hash_join.rs

+}
+
+// Combines two hashes into one hash
+fn combine_hashes(l: u64, r: u64) -> u64 {


Just a simple hashing combination function. Source: http://myeyesareblind.com/2017/02/06/Combine-hash-values/

Could be improved upon later.

Dandandan · 2021-01-10T19:43:18Z

rust/datafusion/src/physical_plan/hash_join.rs

+}
+
+/// Left and right row have equal values
+fn equal_rows(


This is not vectorized (one row at a time). Could be vectorized in the future, checking an array of "matched candidates" instead of 1 by 1.

Dandandan · 2021-01-10T19:43:56Z

rust/datafusion/src/physical_plan/hash_join.rs

+        let hashes = create_hashes(&[left.columns()[0].clone()], &random_state)?;
+
+        // Create hash collisions
+        hashmap_left.insert(hashes[0], vec![0, 1]);


Maps to both indices, the test makes sure the extra "collisions" are removed.

Dandandan · 2021-01-11T11:22:00Z

rust/datafusion/src/physical_plan/hash_join.rs

+    for (row, hash_value) in hash_values.iter().enumerate() {
        hash.raw_entry_mut()
-            .from_key(&key)
+            .from_key_hashed_nocheck(*hash_value, hash_value)


hash = key now

Dandandan · 2021-01-11T18:24:47Z

FYI @jorgecarleitao @alamb @andygrove
This PR is ready for review. I added some comments to the PR. My next step would be reusing the code/strategy for the hash aggregates. I think a similar speed up could be realized there as well.

alamb · 2021-01-11T22:49:56Z

Hi @Dandandan thanks ! I will try and review this over the next few days.

jorgecarleitao

I finally got the time to review this. IMO this looks great. Thanks a lot for taking this on!

I left some comments, all except one are optional / future PRs: I would really like your opinion wrt to the null values.

jorgecarleitao · 2021-01-15T04:54:30Z

rust/datafusion/src/physical_plan/hash_join.rs

+
+        for (i, hash) in $hashes.iter_mut().enumerate() {
+            let mut hasher = $random_state.build_hasher();
+            hasher.$f(array.value(i));


A potential simplification here is to use let values = array.values(); and zip them with $hashes.

jorgecarleitao · 2021-01-15T04:58:01Z

rust/datafusion/src/physical_plan/hash_join.rs

+}
+
+/// Creates hash values for every element in the row based on the values in the columns
+fn create_hashes(arrays: &[ArrayRef], random_state: &RandomState) -> Result<Vec<u64>> {


This could be an iterator, which would allow to be plugged in in the iteration where it is used: avoids allocating a vector for later iterating over.

Not 100% sure how that would work for multiple columns? An earlier version of the PR also tried to reuse a bit of the allocation, but didn't seem to have a large impact on performance.
I think this can be looked at in a following PR. I am also wondering if there are some cool SIMD hash algorithms on multiple elements of arrays (rather than the common Vec<u8>) if we want to optimizecreate_hashes. Or maybe we can write one ourselves. I am not sure how that would work with an iterator?

jorgecarleitao · 2021-01-15T05:02:02Z

rust/datafusion/src/physical_plan/hash_join.rs

+}
+
+macro_rules! hash_array {
+    ($array_type:ident, $column: ident, $f: ident, $hashes: ident, $random_state: ident) => {


IMO we should prob move this to the arrow crate at some point: hashing and equality are operations that need to be implemented with some invariants for consistency. I see that you had to implement an equal_rows_elem, which already indicates that.

Yes, agreed.

jorgecarleitao · 2021-01-15T05:03:02Z

rust/datafusion/src/physical_plan/hash_join.rs

+
+        for (i, hash) in $hashes.iter_mut().enumerate() {
+            let mut hasher = $random_state.build_hasher();
+            hasher.$f(array.value(i));


Isn't this ignoring nulls? Maybe because they are discarded in joins? If so, isn't this a problem, as we are reading the value of a null slot?

I think so, I added some null handling in hash_array and equal_rows_elem. I think it makes sense that we have some more testing for nulls as well as this can be very tricky.
I am not sure this is handled 100% correctly with the previous code either, I think it is actually not correct, for example, the inner join doesn't check on null values.

I added an issue here to add (more) tests on null values in keys:

https://issues.apache.org/jira/browse/ARROW-11258

alamb · 2021-01-15T19:13:51Z

rust/datafusion/src/physical_plan/hash_join.rs

+        let (l, r) = build_join_indexes(
+            &left_data,
+            &right,
+            JoinType::Inner,


I recommend adding a test that covers collisions for all the JoinType variants -- the path for collision handling is different for those different join type variants.

alamb

Nice work @Dandandan -- I reviewed the logic and the tests in this PR carefully (sorry it took so long) and they look good to me. I think the submodule issue (described below) needs to be sorted out before this is ready to merge. I think the test coverage for nulls could be improved (there is basically none now from what I can tell) but you have filed a follow on PR for that.

It looks to me like there is a submodule update which may not be related to the changes contemplated in this PR -- I think we need to resolve:

Otherwise this is looking great. I vote we ship it and continue improving joins / hashing later as follow on PRs.

Dandandan · 2021-01-15T20:00:11Z

Thanks @alamb submodule change reverted, not sure what happened there.

alamb · 2021-01-17T11:13:09Z

I apologize for the delay in merging Rust PRs -- the 3.0 release is being finalized now and are planning to minimize entropy by postponing merging changes not critical for the release until the process was complete. I hope the process is complete in the next few days. There is more discussion in the mailing list

alamb · 2021-01-19T18:35:49Z

This one is next in line for merging @jorgecarleitao and I have our eyes on it... Once a few more tests have completed on https://github.com/apache/arrow/commits/master we'll get it in

jorgecarleitao · 2021-01-19T19:22:07Z

I am calling the day, so feel free to continue the big flush. Thanks a lot for taking this, @alamb !

alamb · 2021-01-19T19:49:09Z

Master is looking pretty good: https://github.com/apache/arrow/runs/1729927062 and merged this branch locally into master and it compiles and passes tests. Merging it in

Create hashes vectorized in hash join This is one step for a fully vectorized hash join: https://issues.apache.org/jira/browse/ARROW-11112 The idea of the PR is as follows: * We still use a `HashMap` but rather than using the row data as key we use a hash value ( `u64`) both as key and as hash. We use a custom `Hasher` to avoid (re)computing hashes in the hash map and while doing lookups. * Only the hash value creation is in this PR vectorized, the rest is still on a row basis. * A test for hash collision detection needs to be added. TCPH 12 is without the remaining part ~10% faster than the other PR: ~180ms vs ~200ms. TCPH 5 is >40% faster (332ms vs 624ms). Closes #9116 from Dandandan/vectorized_hashing Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

Dandandan added 22 commits January 1, 2021 12:48

Calculate column indices upfront

19c84d7

Make function part of HashJoinExec

45fa819

Small refactor by using struct

baa5dce

Add comments

5b6bf83

Experiment with single batch

38d4651

Use index as offset

54b9c3a

Use rows

2db270c

Refactor with single batch

5842c73

Refactor

1511f0e

Small simplification

a976ace

Merge

bc5bdfa

Add comment

797ffb6

Small simplifiction

b865b8a

Merge

2418113

Reuse num rows

0efb5a7

Rename as offset

bbec070

Use u64 for left indices

6d182c9

Use appending by slice, small refactor

d58abf6

Style tweak

69d3286

Style

c96ba9f

Doc updates

33b96f3

Start vectorized hashing

c3bdde1

Dandandan changed the title ~~ARROW-11156: [Rust][DataFusion] Vectorized hashing [WIP~~ ARROW-11156: [Rust][DataFusion] Vectorized hashing [WIP] Jan 6, 2021

Dandandan changed the title ~~ARROW-11156: [Rust][DataFusion] Vectorized hashing [WIP]~~ ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join [WIP] Jan 6, 2021

Dandandan added 2 commits January 6, 2021 22:10

Fmt

9cf8023

Function to combine hash values

dc0c036

github-actions bot added Component: Rust - DataFusion Component: Rust labels Jan 6, 2021

Dandandan commented Jan 10, 2021

View reviewed changes

Dandandan commented Jan 11, 2021

View reviewed changes

Dandandan added 4 commits January 12, 2021 18:33

Merge master

20770bf

Merge master

8a060b5

Fix merge

5deec7b

fmt

f87daf4

jorgecarleitao reviewed Jan 15, 2021

View reviewed changes

Null handling

567bc67

alamb reviewed Jan 15, 2021

View reviewed changes

alamb approved these changes Jan 15, 2021

View reviewed changes

Dandandan added 2 commits January 15, 2021 20:49

Clippy

749ea4c

Testing module

086d642

Dandandan mentioned this pull request Jan 16, 2021

ARROW-11266: [Rust][DataFusion] Implement vectorized hashing for hash aggregate [WIP] #9213

Closed

4 tasks

Small cleanup

2893556

alamb mentioned this pull request Jan 17, 2021

ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns #9233

Closed

alamb closed this in bbc9029 Jan 19, 2021

asfimport mentioned this pull request Jan 19, 2021

[Rust][DataFusion] Create hashes vectorized in hash join #27066

Closed

Comments

Conversation

Dandandan commented Jan 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 6, 2021

Uh oh!

jorgecarleitao commented Jan 6, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Jan 11, 2021

Uh oh!

alamb commented Jan 11, 2021

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Jan 15, 2021

Uh oh!

alamb commented Jan 17, 2021

Uh oh!

alamb commented Jan 19, 2021

Uh oh!

jorgecarleitao commented Jan 19, 2021

Uh oh!

alamb commented Jan 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Dandandan commented Jan 6, 2021 •

edited

Loading