feat: Implement equality = and inequality <> support for StringView#10985
feat: Implement equality = and inequality <> support for StringView#10985alamb merged 8 commits intoapache:string-viewfrom
Conversation
XiangpengHao
left a comment
There was a problem hiding this comment.
The pr looks good to me, left some minor comments
datafusion/common/src/scalar/mod.rs
Outdated
| )?; | ||
| Arc::new(array) | ||
| } | ||
| DataType::Utf8View => { |
There was a problem hiding this comment.
Maybe just build_array_string!(StringViewArray, Utf8View)?
Can you also move this case closer to where we handle Utf8 and LargeUtf8?
| (LargeUtf8, Utf8) => Some(LargeUtf8), | ||
| (Utf8, LargeUtf8) => Some(LargeUtf8), | ||
| (LargeUtf8, LargeUtf8) => Some(LargeUtf8), | ||
| (Utf8View, Utf8View) => Some(Utf8View), |
There was a problem hiding this comment.
Should we handle (Utf8View, Utf8), (Utf8, Utf8View) etc?
It's possible that we read a StringViewArray from parquet, and compare it against a Utf8. What do you think the result type should look like? @alamb
There was a problem hiding this comment.
I agree we should permit to/coercion from any "string" / "binary" type to the associated view type.
It is probably good to make sure the coercion does the cheap direction when possible
For example, we shouldn't be converting StringViewArray --> StringArray as that will copy all the strings. Instead we should convert StringArray --> StringViewArray which is relatively cheap (we just have to copy the views)
I think this feature -- coercion -- is probably worth its own ticket and doesn't need to be done in this PR. I'll file a follow on ticket in the morning
There was a problem hiding this comment.
I handled the Utf8View coercion in this PR, but it would be better to handle the BinaryView in a follow-up.
There was a problem hiding this comment.
The problem with (Utf8View, Utf8) is that it can be quite common to have the following queries:
select * from t where c1 <> 'small';Where the c1 is read to be StringViewArray, then we need to decide what is the result type (in order to implement the coercion). I agree with @alamb that StringViewArray is generally cheaper than StringArray, so we might just use Utf8View if at least one operand is using Utf8View.
There was a problem hiding this comment.
I handled the
Utf8Viewcoercion in this PR, but it would be better to handle theBinaryViewin a follow-up.
Filed #10996 to track Binary view support
There was a problem hiding this comment.
The problem with
(Utf8View, Utf8)is that it can be quite common to have the following queries:select * from t where c1 <> 'small';Where the c1 is read to be
StringViewArray, then we need to decide what is the result type (in order to implement the coercion). I agree with @alamb thatStringViewArrayis generally cheaper thanStringArray, so we might just useUtf8Viewif at least one operand is usingUtf8View.
I wrote a bunch of tests in #10997 and the coercion mostly seems to work well. There is one exception which i think is minor and i will file a ticket about
alamb
left a comment
There was a problem hiding this comment.
Thanks @Weijun-H -- I think once @XiangpengHao 's comments are addressed this PR will be good to go. 🙏
| (LargeUtf8, Utf8) => Some(LargeUtf8), | ||
| (Utf8, LargeUtf8) => Some(LargeUtf8), | ||
| (LargeUtf8, LargeUtf8) => Some(LargeUtf8), | ||
| (Utf8View, Utf8View) => Some(Utf8View), |
There was a problem hiding this comment.
I agree we should permit to/coercion from any "string" / "binary" type to the associated view type.
It is probably good to make sure the coercion does the cheap direction when possible
For example, we shouldn't be converting StringViewArray --> StringArray as that will copy all the strings. Instead we should convert StringArray --> StringViewArray which is relatively cheap (we just have to copy the views)
I think this feature -- coercion -- is probably worth its own ticket and doesn't need to be done in this PR. I'll file a follow on ticket in the morning
| Andrew X | ||
|
|
||
| query ?? | ||
| select * from test where column1 = 'Andrew'; |
There was a problem hiding this comment.
I think for this test to run, we need to add (Utf8View, Utf8) => Utf8View, see #10985 (comment)
alamb
left a comment
There was a problem hiding this comment.
Thank you @XiangpengHao -- I think this pR is a good step forward and so I think we should merge it into the string-view branch to keep making progress.
I will do so and file some follow on issues to track remaining work (specifically for coercion and BinaryView)
…velopment branch (#11402) * Update `string-view` branch to arrow-rs main (#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * feat: Implement equality = and inequality <> support for StringView (#10985) * feat: Implement equality = and inequality <> support for StringView * chore: Add tests for the StringView * chore * chore: Update tests for NULL * fix: Used build_array_string! * chore: Update string_coercion function to handle Utf8View type in binary.rs * chore: add tests * chore: ci * Add more StringView comparison test coverage (#10997) * Add more StringView comparison test coverage * add reference * Add another test showing casting on columns works correctly * feat: Implement equality = and inequality <> support for BinaryView (#11004) * feat: Implement equality = and inequality <> support for BinaryView Signed-off-by: Chojan Shang <psiace@apache.org> * chore: make fmt happy Signed-off-by: Chojan Shang <psiace@apache.org> --------- Signed-off-by: Chojan Shang <psiace@apache.org> * Implement support for LargeString and LargeBinary for StringView and BinaryView (#11034) * implement large binary * add tests for large string * better comments for string coercion * Improve filter predicates with `Utf8View` literals (#11043) * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Improve type coercion logic in TypeCoercionRewriter * chore * chore: Update test * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Remove unused import and update code formatting in unwrap_cast_in_comparison.rs * Remove arrow-patch --------- Signed-off-by: Chojan Shang <psiace@apache.org> Co-authored-by: Alex Huang <huangweijun1001@gmail.com> Co-authored-by: Chojan Shang <psiace@apache.org> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
…velopment branch (apache#11402) * Update `string-view` branch to arrow-rs main (apache#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * feat: Implement equality = and inequality <> support for StringView (apache#10985) * feat: Implement equality = and inequality <> support for StringView * chore: Add tests for the StringView * chore * chore: Update tests for NULL * fix: Used build_array_string! * chore: Update string_coercion function to handle Utf8View type in binary.rs * chore: add tests * chore: ci * Add more StringView comparison test coverage (apache#10997) * Add more StringView comparison test coverage * add reference * Add another test showing casting on columns works correctly * feat: Implement equality = and inequality <> support for BinaryView (apache#11004) * feat: Implement equality = and inequality <> support for BinaryView Signed-off-by: Chojan Shang <psiace@apache.org> * chore: make fmt happy Signed-off-by: Chojan Shang <psiace@apache.org> --------- Signed-off-by: Chojan Shang <psiace@apache.org> * Implement support for LargeString and LargeBinary for StringView and BinaryView (apache#11034) * implement large binary * add tests for large string * better comments for string coercion * Improve filter predicates with `Utf8View` literals (apache#11043) * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Improve type coercion logic in TypeCoercionRewriter * chore * chore: Update test * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Remove unused import and update code formatting in unwrap_cast_in_comparison.rs * Remove arrow-patch --------- Signed-off-by: Chojan Shang <psiace@apache.org> Co-authored-by: Alex Huang <huangweijun1001@gmail.com> Co-authored-by: Chojan Shang <psiace@apache.org> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
…velopment branch (apache#11402) * Update `string-view` branch to arrow-rs main (apache#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * feat: Implement equality = and inequality <> support for StringView (apache#10985) * feat: Implement equality = and inequality <> support for StringView * chore: Add tests for the StringView * chore * chore: Update tests for NULL * fix: Used build_array_string! * chore: Update string_coercion function to handle Utf8View type in binary.rs * chore: add tests * chore: ci * Add more StringView comparison test coverage (apache#10997) * Add more StringView comparison test coverage * add reference * Add another test showing casting on columns works correctly * feat: Implement equality = and inequality <> support for BinaryView (apache#11004) * feat: Implement equality = and inequality <> support for BinaryView Signed-off-by: Chojan Shang <psiace@apache.org> * chore: make fmt happy Signed-off-by: Chojan Shang <psiace@apache.org> --------- Signed-off-by: Chojan Shang <psiace@apache.org> * Implement support for LargeString and LargeBinary for StringView and BinaryView (apache#11034) * implement large binary * add tests for large string * better comments for string coercion * Improve filter predicates with `Utf8View` literals (apache#11043) * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Improve type coercion logic in TypeCoercionRewriter * chore * chore: Update test * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Remove unused import and update code formatting in unwrap_cast_in_comparison.rs * Remove arrow-patch --------- Signed-off-by: Chojan Shang <psiace@apache.org> Co-authored-by: Alex Huang <huangweijun1001@gmail.com> Co-authored-by: Chojan Shang <psiace@apache.org> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
…velopment branch (apache#11402) * Update `string-view` branch to arrow-rs main (apache#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * feat: Implement equality = and inequality <> support for StringView (apache#10985) * feat: Implement equality = and inequality <> support for StringView * chore: Add tests for the StringView * chore * chore: Update tests for NULL * fix: Used build_array_string! * chore: Update string_coercion function to handle Utf8View type in binary.rs * chore: add tests * chore: ci * Add more StringView comparison test coverage (apache#10997) * Add more StringView comparison test coverage * add reference * Add another test showing casting on columns works correctly * feat: Implement equality = and inequality <> support for BinaryView (apache#11004) * feat: Implement equality = and inequality <> support for BinaryView Signed-off-by: Chojan Shang <psiace@apache.org> * chore: make fmt happy Signed-off-by: Chojan Shang <psiace@apache.org> --------- Signed-off-by: Chojan Shang <psiace@apache.org> * Implement support for LargeString and LargeBinary for StringView and BinaryView (apache#11034) * implement large binary * add tests for large string * better comments for string coercion * Improve filter predicates with `Utf8View` literals (apache#11043) * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Improve type coercion logic in TypeCoercionRewriter * chore * chore: Update test * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Remove unused import and update code formatting in unwrap_cast_in_comparison.rs * Remove arrow-patch --------- Signed-off-by: Chojan Shang <psiace@apache.org> Co-authored-by: Alex Huang <huangweijun1001@gmail.com> Co-authored-by: Chojan Shang <psiace@apache.org> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
Note: targets
string-viewbranchWhich issue does this PR close?
Closes #10919
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?