Conversation
alamb
left a comment
There was a problem hiding this comment.
Thank you @peter-toth -- I think this looks like a really nice improvement in readability.
I am going to run the planning benchmarks on this PR to see if it has any impact on performance, but even if not I think the readability improvements alone make it a valuable contribution
cc @waynexia as well
| .expr_stats | ||
| .entry(expr_id.clone()) | ||
| .or_insert((0, data_type)); | ||
| let count = self.expr_stats.entry(expr_id.clone()).or_insert(0); |
There was a problem hiding this comment.
it would be great to avoid this clone (which I think is a String) by using a slightly different API. Maybe something like
let count = if let Some(count) = self.expr_stats.get(expr_id) {
count
} else {
self.expr_stats.insert(expr_id.clone(), 0);
0
};Which while less elegant and requires hashing twice doens't clone the string 🤔
There may be a more clever way to avoid cloning the id
Also, I see the current code clones expr_id as well, so this change is no worse.
There was a problem hiding this comment.
The reason why I wouldn't change this in this PR is because my #10473 (https://github.com/apache/datafusion/pull/10473/files#diff-351499880963d6a383c92e156e75019cd9ce33107724a9635853d7d4cd1898d0R761) will completely eliminate string identifiers and remove these clones.
| let expected = "Projection: UInt32(1) + test.a, UInt32(1) + {AVG({UInt32(1) + test.a|{test.a}|{UInt32(1)}})|{{UInt32(1) + test.a|{test.a}|{UInt32(1)}}}} AS col1, UInt32(1) - {AVG({UInt32(1) + test.a|{test.a}|{UInt32(1)}})|{{UInt32(1) + test.a|{test.a}|{UInt32(1)}}}} AS col2, {AVG({UInt32(1) + test.a|{test.a}|{UInt32(1)}} AS UInt32(1) + test.a)} AS AVG(UInt32(1) + test.a), UInt32(1) + {my_agg({UInt32(1) + test.a|{test.a}|{UInt32(1)}})|{{UInt32(1) + test.a|{test.a}|{UInt32(1)}}}} AS col3, UInt32(1) - {my_agg({UInt32(1) + test.a|{test.a}|{UInt32(1)}})|{{UInt32(1) + test.a|{test.a}|{UInt32(1)}}}} AS col4, {my_agg({UInt32(1) + test.a|{test.a}|{UInt32(1)}} AS UInt32(1) + test.a)} AS my_agg(UInt32(1) + test.a)\ | ||
| \n Aggregate: groupBy=[[{UInt32(1) + test.a|{test.a}|{UInt32(1)}} AS UInt32(1) + test.a]], aggr=[[AVG({UInt32(1) + test.a|{test.a}|{UInt32(1)}}) AS {AVG({UInt32(1) + test.a|{test.a}|{UInt32(1)}})|{{UInt32(1) + test.a|{test.a}|{UInt32(1)}}}}, my_agg({UInt32(1) + test.a|{test.a}|{UInt32(1)}}) AS {my_agg({UInt32(1) + test.a|{test.a}|{UInt32(1)}})|{{UInt32(1) + test.a|{test.a}|{UInt32(1)}}}}, AVG({UInt32(1) + test.a|{test.a}|{UInt32(1)}} AS UInt32(1) + test.a) AS {AVG({UInt32(1) + test.a|{test.a}|{UInt32(1)}} AS UInt32(1) + test.a)}, my_agg({UInt32(1) + test.a|{test.a}|{UInt32(1)}} AS UInt32(1) + test.a) AS {my_agg({UInt32(1) + test.a|{test.a}|{UInt32(1)}} AS UInt32(1) + test.a)}]]\ | ||
| \n Projection: UInt32(1) + test.a AS {UInt32(1) + test.a|{test.a}|{UInt32(1)}}, test.a, test.b, test.c\ | ||
| let expected = "Projection: UInt32(1) + test.a, UInt32(1) + __cse_2 AS col1, UInt32(1) - __cse_2 AS col2, __cse_4 AS AVG(UInt32(1) + test.a), UInt32(1) + __cse_3 AS col3, UInt32(1) - __cse_3 AS col4, __cse_5 AS my_agg(UInt32(1) + test.a)\ |
There was a problem hiding this comment.
That certainly looks much nicer
| /// A map that contains the common expressions extracted during the second, rewriting | ||
| /// traversal. |
There was a problem hiding this comment.
perhaps we can also update the docstring here to mention that the map contains the alias
| /// A map that contains the common expressions extracted during the second, rewriting | |
| /// traversal. | |
| /// A map that contains the common expressions and their alias extracted during the second, rewriting | |
| /// traversal. |
There was a problem hiding this comment.
Thanks! Fixed in be2e560 together with ExprStats docstring.
|
Running benchmarks now... |
|
My benchmark runs show a small overall improvement (1% or so) so I feel like this PR is ready to go I plan to leave this PR open for another day or so to allow @MohamedAbdeen21 to comment if the would like to Benchmar Details
|
|
We did a thorough review of the changes in the old, now-closed, PR. FWIW, I'm happy with this change 👍 . |
|
Looking great! The CSE alias has changed many times, I think it's much closer to its final form now (I've skimmed #10473, and it also looks very 🆒!) One thing I'd like to discuss is the prefix name. Unlike (#10868 (comment)) |
…n expression aliases, remove `DataType` from `ExprStats` as not needed, store aliases in `CommonExprs`, revert unnecessary changes
a592e69 to
bfe814b
Compare
|
Sorry, I had rebase this PR. The commits remained the same as before, except the last one that changes the prefix to |
|
Thanks again @peter-toth @waynexia and @MohamedAbdeen21 |
|
Thank you all for the review! |
* initial change * test renaming * use counter instead of indexmap * order slt tests * change cse tests * restore slt tests * fix slt test * formatting * ensure no alias collision * keep original alias numbers for collision * ensure no collision in aggregate cse * use `AliasGenerator` to generate aliases, use `__cse` prefix in common expression aliases, remove `DataType` from `ExprStats` as not needed, store aliases in `CommonExprs`, revert unnecessary changes * use `into_values()` instead of `into_iter()` where possible * fix docstring of `ExprStats` and `CommonExprs` * use `__common_expr` prefix --------- Co-authored-by: Mohamed Abdeen <[email protected]>
Which issue does this PR close?
Closes #10280.
Rationale for this change
This PR is based on @MohamedAbdeen21's #10868 and shortens aliases generated by CSE for readability.
What changes are included in this PR?
Use shorthand numeric aliases (
__cse_1,__cse_2, ...) for common subexpressions.Are these changes tested?
Yes, with existing and new UTs.
Are there any user-facing changes?
Yes, better and more concise logical plans.