Improve performance of trim for string view (10%)#12395
Conversation
|
FYI @Rachelint that #12383 is modifying |
Thanks! I will push forward this until #12383 merged. |
f6c83bf to
325fac6
Compare
325fac6 to
48cb4db
Compare
alamb
left a comment
There was a problem hiding this comment.
Thank you @Rachelint -- this looks really nice and quite close 🙏
I left some comments, but I don't think they are required to merge this.
I do think we should have benchmark numbers showing this makes things faster in order to merge it. Could you please make a StringView based benchmark for trim -- perhaps in
?Then we can run that benchmark and show that this PR improves the performance.
Thanks again!
| fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | ||
| utf8_to_str_type(&arg_types[0], "btrim") | ||
| if arg_types[0] == DataType::Utf8View { | ||
| Ok(DataType::Utf8View) |
There was a problem hiding this comment.
👍
Also eventually it would also be possible to return Utf8View when the input was Utf8 and save a copy as well
|
|
||
| /// Make a `u128` based on the given substr, start(offset to view.offset), and | ||
| /// push into to the given buffers | ||
| pub(crate) fn make_and_append_view( |
There was a problem hiding this comment.
🤔 I wonder if we should (as a follow on PR) propose adding this upstream to arrow-rs as it seems valuable for any trim related kernels on stringview
There was a problem hiding this comment.
It sounds great! and #12383 (comment) can be solved if it is function in arrow-rs.
There was a problem hiding this comment.
It seems like what would be really useful is a StringViewBuilder that could be modified perhaps 🤔
I started to write a ticket in arrow-rs but I didn't know exactly what API to suggest. I think we would have to try it out
|
I think maybe we should place the LTrim/RTrim/BTrim into a same place(like trim.rs)? |
|
For benchmarking, I would recommend this PR #12111. for what it's worth |
Thanks, it is really helpful! |
4e092d4 to
dbd0f25
Compare
|
Run benchmark introduced in #12513, about 10~20% improvement for the long string(64 bytes). Highlights, as we expected, the string view trim mainly reduces copyings when the trimmed result > 12: The detailed sorted out benchmark result: |
|
I merged this PR up to main and am running another round of benchmarks. Thank you @Rachelint |
Looks like a reasonable improvement to me |
alamb
left a comment
There was a problem hiding this comment.
Thanks @Rachelint -- I went thought this PR again and it looks good
Since I had this PR checked out locally for review, I went ahead and remove the unsafe pointer calculation to try and move this PR along (I know it has been outstanding for too long)
Thanks again!
| let views_buf = ScalarBuffer::from(views_buf); | ||
| let nulls_buf = null_builder.finish(); | ||
|
|
||
| // Safety: |
| // Safety: | ||
| // `trim_str` is computed from `str::trim_xxx_matches`, | ||
| // and its addr is ensured to be >= `origin_str`'s | ||
| let start = unsafe { trim_str.as_ptr().offset_from(src_str.as_ptr()) as u32 }; |
There was a problem hiding this comment.
I ran this diff:
diff --git a/datafusion/functions/src/string/common.rs b/datafusion/functions/src/string/common.rs
index 4f70374b7..f796d10c2 100644
--- a/datafusion/functions/src/string/common.rs
+++ b/datafusion/functions/src/string/common.rs
@@ -204,10 +204,7 @@ fn trim_and_append_str<'a>(
if let (Some(src_str), Some(characters)) = (src_str_opt, trim_characters_opt) {
let trim_str = trim_func(src_str, characters);
- // Safety:
- // `trim_str` is computed from `str::trim_xxx_matches`,
- // and its addr is ensured to be >= `origin_str`'s
- let start = unsafe { trim_str.as_ptr().offset_from(src_str.as_ptr()) as u32 };
+ let start = (src_str.as_bytes().len() - trim_str.as_bytes().len()) as u32;
make_and_append_view(views_buf, null_builder, raw, trim_str, start);
} else {And all tests passed.
|
How about we merge this PR and then you can continue work on the optimizations as follow on PRs? |
I am checking about #12395 (comment) |
Sure -- no worries -- we can wait too. I just feel bad about how long this PR has been outstanding |
Hmm, some newly added tests seem to be failing
Thanks. I added some case to check it. And unfortunately, I found maybe we can't remove the unsafe codes currently. |
I just can't explain why pointer arithmetic is needed -- I think it is important to fix (or really understand) before merging |
Maybe disscussion in #12387 can help. The logic of But unforunately, the needed feature So eventually, we can just through the pointer arithmetic to get the start index currently... Update: |
|
I have filed an issue #12597 to track the introduced unsafe codes. And added a todo to mention this issue for tracking and explaining why we introduce unsafe codes here. |
|
@alamb I found we indeed don't need the unsafe pointer arithmetic to get the |
alamb
left a comment
There was a problem hiding this comment.
Thanks @Rachelint -- very nice. Thank you for sticking with it
I reran the benchmarks one more time and they looks good to me. Nice work.
I merged up from main and removed some redundant tests and plan to merge this PR when it passes CI.
++ critcmp main string-view-trim
group main string-view-trim
----- ---- ----------------
INPUT LEN <= 12/large_string [size=1024, len_before=12, len_after=8] 1.03 42.9±0.05µs ? ?/sec 1.00 41.7±0.06µs ? ?/sec
INPUT LEN <= 12/large_string [size=4096, len_before=12, len_after=8] 1.02 165.5±0.56µs ? ?/sec 1.00 161.5±0.06µs ? ?/sec
INPUT LEN <= 12/large_string [size=8192, len_before=12, len_after=8] 1.02 327.8±0.19µs ? ?/sec 1.00 320.2±0.19µs ? ?/sec
INPUT LEN <= 12/string [size=1024, len_before=12, len_after=8] 1.01 41.6±0.03µs ? ?/sec 1.00 41.2±0.12µs ? ?/sec
INPUT LEN <= 12/string [size=4096, len_before=12, len_after=8] 1.01 160.3±0.11µs ? ?/sec 1.00 159.4±0.09µs ? ?/sec
INPUT LEN <= 12/string [size=8192, len_before=12, len_after=8] 1.01 318.5±1.89µs ? ?/sec 1.00 316.4±0.48µs ? ?/sec
INPUT LEN <= 12/string_view [size=1024, len_before=12, len_after=8] 1.06 41.6±0.01µs ? ?/sec 1.00 39.4±0.02µs ? ?/sec
INPUT LEN <= 12/string_view [size=4096, len_before=12, len_after=8] 1.03 160.5±0.07µs ? ?/sec 1.00 155.1±0.13µs ? ?/sec
INPUT LEN <= 12/string_view [size=8192, len_before=12, len_after=8] 1.03 318.0±0.20µs ? ?/sec 1.00 309.3±1.15µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/large_string [size=1024, len_before=64, len_after=4] 1.10 184.3±0.27µs ? ?/sec 1.00 167.4±0.10µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/large_string [size=4096, len_before=64, len_after=4] 1.09 727.0±0.67µs ? ?/sec 1.00 665.8±0.41µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/large_string [size=8192, len_before=64, len_after=4] 1.09 1448.7±1.76µs ? ?/sec 1.00 1329.6±1.77µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/string [size=1024, len_before=64, len_after=4] 1.09 181.8±0.18µs ? ?/sec 1.00 167.4±0.11µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/string [size=4096, len_before=64, len_after=4] 1.09 725.8±11.87µs ? ?/sec 1.00 664.9±0.62µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/string [size=8192, len_before=64, len_after=4] 1.09 1443.8±2.64µs ? ?/sec 1.00 1324.8±1.25µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/string_view [size=1024, len_before=64, len_after=4] 1.11 181.8±0.11µs ? ?/sec 1.00 163.8±0.22µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/string_view [size=4096, len_before=64, len_after=4] 1.11 724.6±0.84µs ? ?/sec 1.00 651.6±0.22µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN <= 12/string_view [size=8192, len_before=64, len_after=4] 1.11 1444.8±0.63µs ? ?/sec 1.00 1302.7±0.98µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/large_string [size=1024, len_before=64, len_after=60] 1.01 44.4±0.12µs ? ?/sec 1.00 44.0±0.07µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/large_string [size=4096, len_before=64, len_after=60] 1.01 171.7±0.30µs ? ?/sec 1.00 170.4±0.15µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/large_string [size=8192, len_before=64, len_after=60] 1.01 349.8±0.83µs ? ?/sec 1.00 347.1±0.53µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/string [size=1024, len_before=64, len_after=60] 1.00 43.0±0.06µs ? ?/sec 1.00 43.1±0.20µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/string [size=4096, len_before=64, len_after=60] 1.00 166.2±0.21µs ? ?/sec 1.01 167.5±0.32µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/string [size=8192, len_before=64, len_after=60] 1.00 336.5±1.03µs ? ?/sec 1.01 341.1±0.42µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/string_view [size=1024, len_before=64, len_after=60] 1.08 43.8±0.17µs ? ?/sec 1.00 40.6±0.03µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/string_view [size=4096, len_before=64, len_after=60] 1.06 169.0±0.55µs ? ?/sec 1.00 159.0±0.21µs ? ?/sec
INPUT LEN > 12, OUTPUT LEN > 12/string_view [size=8192, len_before=64, len_after=60] 1.08 342.9±0.96µs ? ?/sec 1.00 316.5±0.57µs ? ?/sec
| 01)Projection: temp.column2 || temp.column3 | ||
| 02)--TableScan: temp projection=[column2, column3] | ||
|
|
||
| ################################################ |
There was a problem hiding this comment.
I double checked and @goldmedal 's recent changes (I think created after this PR) make these tests redundant
WHich then runs the tests in https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/string/string_query.slt.part
I removed these tests from this PR to keep things moving
trim for string viewtrim for string view (10%)
|
🚀 |
Which issue does this PR close?
Closes #12387
Rationale for this change
Similar as the string view version substr, we can impl the string view version trim to improve performance.
What changes are included in this PR?
Are these changes tested?
Test by new unit test and exist other tests.
Are there any user-facing changes?
No.