Enable reading StringViewArray by default from Parquet (8% improvement for entire ClickBench suite)#13101
Conversation
cbdc592 to
c0ff96f
Compare
c0ff96f to
c95b870
Compare
StringViewArray by default from ParquetStringViewArray by default from Parquet (8% improvement for entire ClickBench suite)
|
I think it's interesting to run some more Parquet benchmarks as well to detect any regression. It looks like query 18 of TPC-H is still a tiny bit slower maybe (ran it a few times in a row). The rest is as fast or faster: |
|
Btw - I don't think this should hold off the merge / release, but would be good to track/note any regressions, however small. |
Dandandan
left a comment
There was a problem hiding this comment.
Probably makes sense to run some more benchmarks just to be sure
| /// If true, will use StringView/BinaryViewArray instead of String/BinaryArray | ||
| /// when reading ParquetFiles | ||
| #[structopt(long)] | ||
| pub force_view_types: bool, |
There was a problem hiding this comment.
Should we keep this (or a differently-named) flag as a kill-switch?
There was a problem hiding this comment.
👍 There is a kill switch (in the description of this PR)
set datafusion.execution.parquet.schema_force_view_types = false;
0 row(s) fetched.
Elapsed 0.000 seconds.
This particular code is for the benchmark drivers and I don't think it is super valuable to retain the benchmark in both configurations
I will do so |
|
My plan for this PR is to hedge against disruptions by making a stable DataFusion 42.2.0 and then merging this PR into the main for inclusion in #13065 I will review the benchmark results again and look at what is going on with TPCH Q18 |
I filed this one It may be an instance that Could help with |
|
This PR / project has been outstanding long enough and I desparately need to close off concurrent projects. Let's merge it in and keep iterating on main |
|
Thanks again @findepi @Dandandan (and @Rachelint and @goldmedal and @XiangpengHao and @jayzhan211 and so many others) |
…improvement for entire ClickBench suite) (apache#13101)" This reverts commit 2d7892b.
add `schema_force_view_types=true` in query session config as it improves query performance as explained here - apache/datafusion#13101
add `schema_force_view_types=true` in query session config as it improves query performance as explained here - apache/datafusion#13101
Replacement for #12092 which had too much history on it
Which issue does this PR close?
Closes #11682
Rationale for this change
Reading data as
StringViewArrayis significantly faster thanStringArray. We have been testing this behind a feature flag but it is now stable enough to enable by default.See blog post #11603:
Benchmark Results
(note I believe the changes for Q1 and Q2 are noise (there is no corresponding changes for the
clickbench_partitioned table)Details for `clickbench`
What changes are included in this PR?
schema_force_view_typesto trueAre these changes tested?
Yes, by CI tests
Are there any user-facing changes?
If you see an error related to StringView use, you can disable this feature using the schema_force_string_view option