Faster strpos() string function for ASCII-only case by goldmedal · Pull Request #12401 · apache/datafusion

goldmedal · 2024-09-09T18:27:22Z

Which issue does this PR close?

Rationale for this change

For ASCII-only cases, I tried to use the slice window to find the position instead of the string pattern matching.
We can find that the ASCII-only StringViewArray case performs 50% faster than before. ~~However, I'm not really sure why the normal string array can't be improved. 🤔~~

What changes are included in this PR?

If both arguments are ASCII-only, we can perform this faster method.

The benchmark result:

group                                        after                                   before
-----                                        -----                                   ------
strpos_StringArray_ascii_str_len_128         1.00  1585.2±49.78µs        ? ?/sec     1.74      2.8±0.07ms        ? ?/sec
strpos_StringArray_ascii_str_len_32          1.00   558.2±24.13µs        ? ?/sec     2.22  1240.1±51.67µs        ? ?/sec
strpos_StringArray_ascii_str_len_4096        1.00     47.3±2.79ms        ? ?/sec     1.88     89.0±1.68ms        ? ?/sec
strpos_StringArray_ascii_str_len_8           1.00    199.8±6.46µs        ? ?/sec     2.42   483.6±16.36µs        ? ?/sec
strpos_StringArray_utf8_str_len_128          1.00      4.9±0.14ms        ? ?/sec     1.00      4.9±0.12ms        ? ?/sec
strpos_StringArray_utf8_str_len_32           1.00  1852.6±47.05µs        ? ?/sec     1.01  1875.0±66.89µs        ? ?/sec
strpos_StringArray_utf8_str_len_4096         1.00    136.1±2.14ms        ? ?/sec     1.01    136.8±2.63ms        ? ?/sec
strpos_StringArray_utf8_str_len_8            1.00   831.5±43.31µs        ? ?/sec     1.00   830.9±35.23µs        ? ?/sec
strpos_StringViewArray_ascii_str_len_128     1.00  1700.8±63.55µs        ? ?/sec     1.66      2.8±0.12ms        ? ?/sec
strpos_StringViewArray_ascii_str_len_32      1.00   626.3±27.34µs        ? ?/sec     2.02  1264.6±88.59µs        ? ?/sec
strpos_StringViewArray_ascii_str_len_4096    1.00     47.4±1.11ms        ? ?/sec     1.88     88.9±1.96ms        ? ?/sec
strpos_StringViewArray_ascii_str_len_8       1.00   275.8±12.18µs        ? ?/sec     1.69   467.0±17.50µs        ? ?/sec
strpos_StringViewArray_utf8_str_len_128      1.00      4.9±0.14ms        ? ?/sec     1.00      4.9±0.21ms        ? ?/sec
strpos_StringViewArray_utf8_str_len_32       1.00  1898.4±121.22µs        ? ?/sec    1.00  1889.9±51.64µs        ? ?/sec
strpos_StringViewArray_utf8_str_len_4096     1.00    136.5±2.14ms        ? ?/sec     1.00    136.4±1.78ms        ? ?/sec
strpos_StringViewArray_utf8_str_len_8        1.00   848.3±35.50µs        ? ?/sec     1.02  861.6±172.97µs        ? ?/sec

Are these changes tested?

yes

Are there any user-facing changes?

2010YOUY01

Great 🚀 The implementation looks good to me

However, I'm not really sure why the normal string array can't be improved. 🤔

I think it's not executed and triggered some error, see benchmark StringArray cases only take several ns, and StringViewArray cases are at least several us
I guess it's because in micro bench, arguments are of different physical string type and break the function implementation (strpos(string_col, string_view_col))

~~With #12415 this kind of bug can be more throughly tested~~
Update: It's not a bug for SQL API, and this can only be triggered if we use raw expression API as in the benchmark

2010YOUY01 · 2024-09-11T10:47:05Z

datafusion/functions/src/unicode/strpos.rs

 {
-    let string_iter = ArrayIter::new(string_array);
-    let substring_iter = ArrayIter::new(substring_array);
+    let ascii_only = string_array.is_ascii() && substring_array.is_ascii();


Suggested change

let ascii_only = string_array.is_ascii() && substring_array.is_ascii();

let ascii_only = substring_array.is_ascii() && string_array.is_ascii() ;

I think we can check substring first, since it's cheaper in common cases and we can then shortcircuit checking string_array in some case

Nice suggestion! I did the benchmark again. The performance is improved!

group before after ----- ----- ------- strpos_StringArray_ascii_str_len_128 1.06 370.5±9.46ns ? ?/sec 1.00 348.6±11.38ns ? ?/sec strpos_StringArray_ascii_str_len_32 1.07 372.5±10.23ns ? ?/sec 1.00 346.9±9.45ns ? ?/sec strpos_StringArray_ascii_str_len_4096 1.08 378.4±12.05ns ? ?/sec 1.00 349.4±16.83ns ? ?/sec strpos_StringArray_ascii_str_len_8 1.07 371.4±14.53ns ? ?/sec 1.00 346.1±28.44ns ? ?/sec strpos_StringArray_utf8_str_len_128 1.06 377.2±18.04ns ? ?/sec 1.00 356.4±21.92ns ? ?/sec strpos_StringArray_utf8_str_len_32 1.08 374.9±34.32ns ? ?/sec 1.00 345.8±11.05ns ? ?/sec strpos_StringArray_utf8_str_len_4096 1.09 381.6±16.68ns ? ?/sec 1.00 351.4±23.14ns ? ?/sec strpos_StringArray_utf8_str_len_8 1.09 372.9±20.83ns ? ?/sec 1.00 343.3±11.78ns ? ?/sec strpos_StringViewArray_ascii_str_len_128 1.79 3.2±0.15ms ? ?/sec 1.00 1763.8±44.16µs ? ?/sec strpos_StringViewArray_ascii_str_len_32 1.03 648.7±18.27µs ? ?/sec 1.00 628.1±21.69µs ? ?/sec strpos_StringViewArray_ascii_str_len_4096 1.24 62.8±7.27ms ? ?/sec 1.00 50.5±2.05ms ? ?/sec strpos_StringViewArray_ascii_str_len_8 1.00 280.2±10.44µs ? ?/sec 1.05 294.8±42.47µs ? ?/sec strpos_StringViewArray_utf8_str_len_128 1.03 5.3±0.13ms ? ?/sec 1.00 5.1±0.12ms ? ?/sec strpos_StringViewArray_utf8_str_len_32 1.01 1961.9±57.34µs ? ?/sec 1.00 1944.1±73.19µs ? ?/sec strpos_StringViewArray_utf8_str_len_4096 1.03 147.4±9.24ms ? ?/sec 1.00 142.5±3.61ms ? ?/sec strpos_StringViewArray_utf8_str_len_8 1.01 874.5±26.02µs ? ?/sec 1.00 863.4±40.68µs ? ?/sec

goldmedal · 2024-09-11T16:04:35Z

I think it's not executed and triggered some error, see benchmark StringArray cases only take several ns, and StringViewArray cases are at least several us I guess it's because in micro bench, arguments are of different physical string type and break the function implementation (strpos(string_col, string_view_col))

You're right. There are some errors here. I ran the benchmark in the strpos test case and got the following error:

Error: Execution("Unsupported data type combination (Utf8, Utf8View) for function strpos")

It seems that strpos doesn't allow the second argument to be a StringView if the first one is a StringArray. I'm not sure if this case makes sense. Maybe we can file another issue for it if it's reasonable.

Then, I realized it was my mistake 😢. When I generated the substring for the StringArray, I mistakenly set its type to StringView. After fixing it, the benchmark for StringArray works fine.

goldmedal · 2024-09-11T16:32:00Z

datafusion/functions/benches/strpos.rs

+        ]
+    } else {
+        let string_array: StringArray = output_string_vec.clone().into_iter().collect();
+        let sub_string_array: StringArray =


I rebase to make all benchmark-related changes in the first commit. It was

let sub_string_array: StringViewArray = output_string_vec.clone().into_iter().collect();

After changing to StringArray, the benchmark works fine.

goldmedal · 2024-09-11T16:37:02Z

@2010YOUY01 Thank you for the review. I updated the benchmark result. After fixing the benchmark, we can find the StringArray case also be improved.

alamb

Thank you @goldmedal -- this PR looks really nice 🙏

🚀

goldmedal · 2024-09-18T01:54:50Z

Thanks @alamb @2010YOUY01

github-actions bot added the functions Changes to functions implementation label Sep 9, 2024

goldmedal marked this pull request as ready for review September 10, 2024 01:52

2010YOUY01 approved these changes Sep 11, 2024

View reviewed changes

alamb mentioned this pull request Sep 11, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 9, 2024 #12391

Closed

5 tasks

goldmedal added 4 commits September 12, 2024 00:07

add strpos benchmark

9c60c63

add faster path for strpos in ascii-only case

4aebfe4

clippy

a854dd4

compare substring first

62f2307

goldmedal force-pushed the feature/12366-strpos-ascii branch from 5aae0fb to 62f2307 Compare September 11, 2024 16:28

goldmedal commented Sep 11, 2024

View reviewed changes

cargo fmt

81c3fa1

alamb mentioned this pull request Sep 16, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 16, 2024 #12494

Closed

8 tasks

alamb approved these changes Sep 16, 2024

View reviewed changes

alamb merged commit 269a473 into apache:main Sep 17, 2024

goldmedal deleted the feature/12366-strpos-ascii branch September 18, 2024 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Faster strpos() string function for ASCII-only case#12401

Faster strpos() string function for ASCII-only case#12401
alamb merged 5 commits intoapache:mainfrom
goldmedal:feature/12366-strpos-ascii

goldmedal commented Sep 9, 2024 •

edited

Loading

Uh oh!

2010YOUY01 left a comment •

edited

Loading

Uh oh!

2010YOUY01 Sep 11, 2024

Uh oh!

goldmedal Sep 11, 2024

Uh oh!

goldmedal commented Sep 11, 2024

Uh oh!

goldmedal Sep 11, 2024

Uh oh!

goldmedal commented Sep 11, 2024 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

goldmedal commented Sep 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	let ascii_only = string_array.is_ascii() && substring_array.is_ascii();
	let ascii_only = substring_array.is_ascii() && string_array.is_ascii() ;

Comments

Conversation

goldmedal commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

2010YOUY01 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

goldmedal Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

goldmedal commented Sep 11, 2024

Uh oh!

goldmedal Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

goldmedal commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

goldmedal commented Sep 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

goldmedal commented Sep 9, 2024 •

edited

Loading

2010YOUY01 left a comment •

edited

Loading

goldmedal commented Sep 11, 2024 •

edited

Loading