Minor: Introduce utils::hash for StructArray#8552
Conversation
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
waynexia
left a comment
There was a problem hiding this comment.
Looks good to me, thanks @jayzhan211
datafusion/common/src/hash_utils.rs
Outdated
| }; | ||
|
|
||
| for i in valid_indices { | ||
| let mut values_hashes = vec![0u64; array.len()]; |
There was a problem hiding this comment.
Why do you create a new vec for each iteration?
There was a problem hiding this comment.
create each values_hashes for each column. And combine those hashed element in column to one
There was a problem hiding this comment.
The values_hashes can be created upfront, outside of the loop.
| let column_values = values[i].clone(); | ||
| // create hashes for each column | ||
| create_hashes(&[column_values], random_state, &mut values_hashes)?; | ||
| let hash = &mut hashes_buffer[i]; |
There was a problem hiding this comment.
This seems not correct, as i here to a column in the struct and hashes_buffer to a hash of one value?
There was a problem hiding this comment.
@Dandandan. Not fully got your point. I do consider each column as one value (iterate each row in one column with get one final hash hashes_buffer[i]).
i is the column index in the struct array and is also the hashes_buffer index.
For example, column1, null, column3, column4. We iterate 0, 2, and 3. And hash each of them to hashes_buffer[0], hashes_buffer[2], and hashes_buffer[3] respectively. A column-wise hashes.
There was a problem hiding this comment.
I think I mistake how the null buffer works in StructArray. It seems to be row-wise, but I consider it as column-wise...
There was a problem hiding this comment.
It looks pretty much better now :)
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
3d7b5d3 to
fc36232
Compare
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
| let valid_indices: Vec<usize> = if let Some(nulls) = nulls { | ||
| nulls.valid_indices().collect() | ||
| } else { | ||
| (0..num_columns).collect() |
There was a problem hiding this comment.
This doesn't yet seem correct, valid_indices contains the rows that are valid, not the columns.
There was a problem hiding this comment.
I hash each row now, so we iterate the valid row and hash each column at that row to one hash value.
| }; | ||
|
|
||
| let mut values_hashes = vec![0u64; array.len()]; | ||
| create_hashes(array.columns(), random_state, &mut values_hashes)?; |
There was a problem hiding this comment.
We should be able to use hashes_buffer here directly?
There was a problem hiding this comment.
If we do that, we should be able to remove the code related to valid_indices, as this already takes care of nulls.
There was a problem hiding this comment.
I dont think so.
values_hashes: [9258723240401068087, 9258723240401068087, 8502738074356456021, 8502738074356456021, 4222447303697976283, 9753707356376286577]
hashes_buffer: [9258723240401091360, 9258723240401091360, 0, 8502738074356479294, 0, 0]
We can see that create_hashes does not consider if the row is valid or not. Therefore, we need to iterate valid_indices once .
| } | ||
| DataType::Struct(_) => { | ||
| let array = as_struct_array(array)?; | ||
| hash_struct_array(array, random_state, hashes_buffer)?; |
There was a problem hiding this comment.
This should work?
| hash_struct_array(array, random_state, hashes_buffer)?; | |
| create_hashes(array.columns(), random_state, hashes_buffer)?; |
There was a problem hiding this comment.
If we consider nulls, this doesn't work
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
|
@Dandandan do you think this PR is ready to merge? |
|
@alamb @Dandandan Any update on this? |
|
Thank you @jayzhan211 @waynexia and @Dandandan |
|
I found the current implementation failed this Let me checkout the reason |
* hash struct Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * row-wise hash Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * create hashes once Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add comment Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Which issue does this PR close?
Closes #.
We will need this for #7893
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?