ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join#9116
ARROW-11156: [Rust][DataFusion] Create hashes vectorized in hash join#9116Dandandan wants to merge 49 commits intoapache:masterfrom
Conversation
|
👍 I think that this could be also beneficial for the hash aggregate! |
| /// `Hasher` that returns the same `u64` value as a hash, to avoid re-hashing | ||
| /// it when inserting/indexing or regrowing the `HashMap` | ||
| struct IdHasher { | ||
| hash: u64, |
There was a problem hiding this comment.
We just return the key value here, so we don't "rehash" when growing or indexing the HashMap
| } | ||
|
|
||
| // Combines two hashes into one hash | ||
| fn combine_hashes(l: u64, r: u64) -> u64 { |
There was a problem hiding this comment.
Just a simple hashing combination function. Source: http://myeyesareblind.com/2017/02/06/Combine-hash-values/
There was a problem hiding this comment.
Could be improved upon later.
| } | ||
|
|
||
| /// Left and right row have equal values | ||
| fn equal_rows( |
There was a problem hiding this comment.
This is not vectorized (one row at a time). Could be vectorized in the future, checking an array of "matched candidates" instead of 1 by 1.
| let hashes = create_hashes(&[left.columns()[0].clone()], &random_state)?; | ||
|
|
||
| // Create hash collisions | ||
| hashmap_left.insert(hashes[0], vec![0, 1]); |
There was a problem hiding this comment.
Maps to both indices, the test makes sure the extra "collisions" are removed.
| for (row, hash_value) in hash_values.iter().enumerate() { | ||
| hash.raw_entry_mut() | ||
| .from_key(&key) | ||
| .from_key_hashed_nocheck(*hash_value, hash_value) |
|
FYI @jorgecarleitao @alamb @andygrove |
|
Hi @Dandandan thanks ! I will try and review this over the next few days. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
I finally got the time to review this. IMO this looks great. Thanks a lot for taking this on!
I left some comments, all except one are optional / future PRs: I would really like your opinion wrt to the null values.
|
|
||
| for (i, hash) in $hashes.iter_mut().enumerate() { | ||
| let mut hasher = $random_state.build_hasher(); | ||
| hasher.$f(array.value(i)); |
There was a problem hiding this comment.
A potential simplification here is to use let values = array.values(); and zip them with $hashes.
| } | ||
|
|
||
| /// Creates hash values for every element in the row based on the values in the columns | ||
| fn create_hashes(arrays: &[ArrayRef], random_state: &RandomState) -> Result<Vec<u64>> { |
There was a problem hiding this comment.
This could be an iterator, which would allow to be plugged in in the iteration where it is used: avoids allocating a vector for later iterating over.
There was a problem hiding this comment.
Not 100% sure how that would work for multiple columns? An earlier version of the PR also tried to reuse a bit of the allocation, but didn't seem to have a large impact on performance.
I think this can be looked at in a following PR. I am also wondering if there are some cool SIMD hash algorithms on multiple elements of arrays (rather than the common Vec<u8>) if we want to optimizecreate_hashes. Or maybe we can write one ourselves. I am not sure how that would work with an iterator?
| } | ||
|
|
||
| macro_rules! hash_array { | ||
| ($array_type:ident, $column: ident, $f: ident, $hashes: ident, $random_state: ident) => { |
There was a problem hiding this comment.
IMO we should prob move this to the arrow crate at some point: hashing and equality are operations that need to be implemented with some invariants for consistency. I see that you had to implement an equal_rows_elem, which already indicates that.
|
|
||
| for (i, hash) in $hashes.iter_mut().enumerate() { | ||
| let mut hasher = $random_state.build_hasher(); | ||
| hasher.$f(array.value(i)); |
There was a problem hiding this comment.
Isn't this ignoring nulls? Maybe because they are discarded in joins? If so, isn't this a problem, as we are reading the value of a null slot?
There was a problem hiding this comment.
I think so, I added some null handling in hash_array and equal_rows_elem. I think it makes sense that we have some more testing for nulls as well as this can be very tricky.
I am not sure this is handled 100% correctly with the previous code either, I think it is actually not correct, for example, the inner join doesn't check on null values.
I added an issue here to add (more) tests on null values in keys:
| let (l, r) = build_join_indexes( | ||
| &left_data, | ||
| &right, | ||
| JoinType::Inner, |
There was a problem hiding this comment.
I recommend adding a test that covers collisions for all the JoinType variants -- the path for collision handling is different for those different join type variants.
alamb
left a comment
There was a problem hiding this comment.
Nice work @Dandandan -- I reviewed the logic and the tests in this PR carefully (sorry it took so long) and they look good to me. I think the submodule issue (described below) needs to be sorted out before this is ready to merge. I think the test coverage for nulls could be improved (there is basically none now from what I can tell) but you have filed a follow on PR for that.
It looks to me like there is a submodule update which may not be related to the changes contemplated in this PR -- I think we need to resolve:
Otherwise this is looking great. I vote we ship it and continue improving joins / hashing later as follow on PRs.
|
Thanks @alamb submodule change reverted, not sure what happened there. |
|
I apologize for the delay in merging Rust PRs -- the 3.0 release is being finalized now and are planning to minimize entropy by postponing merging changes not critical for the release until the process was complete. I hope the process is complete in the next few days. There is more discussion in the mailing list |
|
This one is next in line for merging @jorgecarleitao and I have our eyes on it... Once a few more tests have completed on https://github.com/apache/arrow/commits/master we'll get it in |
|
I am calling the day, so feel free to continue the big flush. Thanks a lot for taking this, @alamb ! |
|
Master is looking pretty good: https://github.com/apache/arrow/runs/1729927062 and merged this branch locally into master and it compiles and passes tests. Merging it in |
Create hashes vectorized in hash join This is one step for a fully vectorized hash join: https://issues.apache.org/jira/browse/ARROW-11112 The idea of the PR is as follows: * We still use a `HashMap` but rather than using the row data as key we use a hash value ( `u64`) both as key and as hash. We use a custom `Hasher` to avoid (re)computing hashes in the hash map and while doing lookups. * Only the hash value creation is in this PR vectorized, the rest is still on a row basis. * A test for hash collision detection needs to be added. TCPH 12 is without the remaining part ~10% faster than the other PR: ~180ms vs ~200ms. TCPH 5 is >40% faster (332ms vs 624ms). Closes #9116 from Dandandan/vectorized_hashing Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

Create hashes vectorized in hash join
This is one step for a fully vectorized hash join: https://issues.apache.org/jira/browse/ARROW-11112
The idea of the PR is as follows:
HashMapbut rather than using the row data as key we use a hash value (u64) both as key and as hash. We use a customHasherto avoid (re)computing hashes in the hash map and while doing lookups.TCPH 12 is without the remaining part ~10% faster than the other PR: ~180ms vs ~200ms.
TCPH 5 is >40% faster (332ms vs 624ms).