You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@Dandandan noticed regex_replace with a known pattern seems to be taking an extremely long amount of time during ClickBench suite in #3518. This seems to be true due to many factors, but mainly due to how generic regex_replace implementation is (it can handle 2⁴ combinations when it comes to scalars/arrays). Having a generic version ready is good for compatibility, but at the same time, it makes us pay the overhead for common cases (like the example in #3518, where the pattern is static).
What changes are included in this PR?
This PR adds a scalarity (not sure if this is a real word) based specialization system where at the runtime the best regex_replace variation can be picked and executed for the given set of inputs. The system here is just the start, and if there is enough gains we might add a third case where the replacement is also known.
Are there any user-facing changes?
This is mainly an optimization, and there shouldn't be any user facing changes.
Benchmarks
New benchmarks are here #3614 (comment), and overall it shows a speed-up in the range of 20-35X depending on the query & input.
Old benchmarks
Running all benchmarks with --release mode (using the datafusion-cli crate with -f option).
The initial benchmark is the Query 28 from clickhouse
SELECT
REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k,
AVG(length("Referer")) AS l,
COUNT(*) AS c,
MIN("Referer")
FROM hits_1
WHERE"Referer"<>''GROUP BY k
HAVINGCOUNT(*) >100000ORDER BY l DESCLIMIT25;
Master
This Branch
Factor
Cold Run
2.875 seconds
0.318 seconds
9.04x speed-up
Hot Run (6th consecutive run)
2.252 seconds
0.266 seconds
8.46x speed-up
Average
2.408 seconds
0.277 seconds
8.69x speed-up
(Note: I don't have the full ClickBench data, just have a partition of it [1/100 scale] so this might not be very reflective)
A second benchmark is the one where we have both the source and the replacements as arrays, which shows speed-up factor of 1.7X.
-- Generate data---- import secrets-- import random---- rows = 1_000_000---- data = {"user_id": [], "website": []}-- for _ in range(rows):-- data["user_id"].append(secrets.token_hex(8))---- # Sometimes it is proper URL, and sometimes it is not.-- data["website"].append(-- random.choice(["http", "https", "unknown", ""])-- + random.choice([":", "://"])-- + random.choice(["google", "facebook"])-- + random.choice([".com", ".org", ""])-- )---- import pandas as pd-- df = pd.DataFrame(data)-- df.to_parquet("data.parquet")
CREATE EXTERNAL TABLE generated_data
STORED AS PARQUET
LOCATION 'data.parquet';
-- Query 1
EXPLAIN ANALYZE
SELECT
REGEXP_REPLACE("website", '^https?://(?:www.)?([^/]+)$', "user_id") AS encoded_website
FROM generated_data;
Merging #3614 (bbb8c8b) into master (ebb28f5) will decrease coverage by 0.07%.
The diff coverage is 85.23%.
❗ Current head bbb8c8b differs from pull request most recent head d0f1020. Consider uploading reports for the commit d0f1020 to get more accurate results
Benchmark runs are scheduled for baseline = ea3dbb6 and contender = 15c19c3. 15c19c3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #3613.
Rationale for this change
@Dandandan noticed
regex_replacewith a known pattern seems to be taking an extremely long amount of time during ClickBench suite in #3518. This seems to be true due to many factors, but mainly due to how genericregex_replaceimplementation is (it can handle 2⁴ combinations when it comes to scalars/arrays). Having a generic version ready is good for compatibility, but at the same time, it makes us pay the overhead for common cases (like the example in #3518, where the pattern is static).What changes are included in this PR?
This PR adds a scalarity (not sure if this is a real word) based specialization system where at the runtime the best
regex_replacevariation can be picked and executed for the given set of inputs. The system here is just the start, and if there is enough gains we might add a third case where the replacement is also known.Are there any user-facing changes?
This is mainly an optimization, and there shouldn't be any user facing changes.
Benchmarks
New benchmarks are here #3614 (comment), and overall it shows a speed-up in the range of 20-35X depending on the query & input.
Old benchmarks
Running all benchmarks with
--releasemode (using the datafusion-cli crate with-foption).The initial benchmark is the Query 28 from clickhouse
(Note: I don't have the full ClickBench data, just have a partition of it [1/100 scale] so this might not be very reflective)
A second benchmark is the one where we have both the source and the replacements as arrays, which shows speed-up factor of 1.7X.