ARROW-10585: [Rust] [DataFusion] Add join support to DataFrame and LogicalPlan#8720
ARROW-10585: [Rust] [DataFusion] Add join support to DataFrame and LogicalPlan#8720andygrove wants to merge 7 commits intoapache:masterfrom
Conversation
31f7b89 to
def6c52
Compare
|
|
||
| /// The on clause of the join, as vector of (left, right) columns. | ||
| pub type JoinOn<'a> = [(&'a str, &'a str)]; | ||
| pub type JoinOn = [(String, String)]; |
There was a problem hiding this comment.
I ran into ownership issues that I couldn't figure out. I am happy to change this back if someone can show me how.
There was a problem hiding this comment.
This is too small compared to the others ops to justify the effort at this point, IMO.
| JoinType::Inner => { | ||
| // inner: all fields are there | ||
| let on_right = &on.iter().map(|on| on.1.to_string()).collect::<HashSet<_>>(); | ||
| // remove right-side join keys if they have the same names as the left-side |
There was a problem hiding this comment.
The rustdoc test for DataFrame.join was failing until I made this change.
|
|
||
| /// The on clause of the join, as vector of (left, right) columns. | ||
| pub type JoinOn<'a> = [(&'a str, &'a str)]; | ||
| pub type JoinOn = [(String, String)]; |
There was a problem hiding this comment.
This is too small compared to the others ops to justify the effort at this point, IMO.
There was a problem hiding this comment.
I would use Vec<(left_i, right_i)> because it automatically enforces the invariant that left_keys.len() == right_keys.len(). We can still keep the public interface left_keys, right_keys and perform the check before passing them to the builder.
Atm, when we use left.zip(right), we take the shortest vector, which may hide a bug in the code.
There was a problem hiding this comment.
Thanks. I agree and I have made this change. The user-facing DataFrame method is now the only plan that accepts the two separate lists of column names and we verify they are the same length when creating the logical plan.
There was a problem hiding this comment.
The test currently fails and I need to debug why
There was a problem hiding this comment.
The test was invalid. Fixed now.
|
I will go ahead and merge once CI is green and then start implementing TPC-H queries to really test this out. @alamb fyi |
alamb
left a comment
There was a problem hiding this comment.
Thanks for this @andygrove -- I am catching up with reviews and this is super exciting to see
This PR adds
DataFrame.joinand plumbs it through to the physical join plan.