make common expression alias human-readable#10333
Conversation
|
I don't think removing the accumulation and a couple of clones will increase perf, but let's see /benchmark |
Benchmark resultsBenchmarks comparing b412dba (main) and dc5cf09 (PR) |
alamb
left a comment
There was a problem hiding this comment.
Thank you @MohamedAbdeen21 I went through this code carefully and it seems really nice to me -- not only is it more efficient (less copying of Identifier) the code is also simpler.
I think it is good to go from my perspective given all the existing tests pass 🚀
cc @wiedld as I think you have worked on this code recently and @waynexia and @peter-toth who I think is working to improve this code in terms of copies
|
Thanks for the review @alamb. I was trying to implement it duckdb style with The biggest problem I found was that the rule is applied twice, meaning that we need a shared/global mapping of expr -> id between rule calls. Also, the ordering of the created projections should be incremental ( |
|
can we please wait with merging this PR until we sort out the issues of the rule in #10396 |
|
Per the discussion on #10396, we merged that one first and now this one needs to be rebased / resolved. Marking it as draft until we can do that I believe @MohamedAbdeen21 plans to do it this weekend (per #10396 (comment)) |
alamb
left a comment
There was a problem hiding this comment.
Makes sense (again) to me -- thank you @MohamedAbdeen21 and @peter-toth
| let mut desc = String::new(); | ||
|
|
||
| while let Some(item) = self.visit_stack.pop() { | ||
| fn pop_enter_mark(&mut self) -> Option<usize> { |
There was a problem hiding this comment.
We shoudn't change this part.
The logic that builds up an identifier using visit_stack / 3 kinds of VisitRecord is neccessary and actually a very clever and way to build up an identifier from the current node and sub-identifiers. (An identifier to be a String was not that a clever decision and will be fixed in #10426, but that's a different issue).
This PR shouldn't change what an identifier is / how it is built up otherwise we end up with identifier colliding bugs again. The IdArray, ExprStats and CommonExprs datastructures require an identifier to represent a full expression subtreee. This means that:
fn expr_identifier(expr: &Expr) -> Identifier {
format!("#{{{expr}}}")
}
would cause bugs as shown in 1. of #10396.
I.e. if we encountered both col("a") + col("b") and col("a + b") in the expression list to be CSEd and we used "{expr}" (the non-unique stringified representation) as identifiers then the equal identifier ("a + b") of those 2 different expressions would collide and we counted 2 for the occurance of one of the 2 expressions (and the other expression's count would be lost) resulting wrong CSE.
Please note that currently the identifier of col("a") + col("b") is "{a + b|b|a}" so it doesn't collide with col("a + b")'s identifier: "{a + b}".
Again, this is hard to test now because of the resolution bug: #10413.
I.e. if we write a test where we have
select a + b, "a + b" from (
select 1 as a, 2 as b, 1 as "a + b"
)
then currently it gets resolved as
select "a + b", "a + b" from (
select 1 as a, 2 as b, 1 as "a + b"
)
and this prevents us to create a test case for CSE identifier collision.
(Please note that I'm simplifying the identifier collision exmple as simple columns (col("a + b")) are not subject to CSE.)
What this PR can do is to change the aliases (use something else than identifiers) to make the plans more readable.
There was a problem hiding this comment.
@peter-toth -- thank you -- I see now the danger of the lack of test coverage in this area.
I am still a little unclear on if the bug you described above is something that can actually be hit in practice (due to lack of a test case), or if it will be masked by #10413
What shall we do now? I can see three possibilities:
- Revert this PR and try and make a follow on one that doesn't change it
- Create a PR to revert just ID change
- You can fix the ID change in your fix for Make
CommonSubexprEliminatefaster by stop copying so many strings #10426
Please let me know your thoughts. I can potentially help with 1 or 2.
There was a problem hiding this comment.
I am still a little unclear on if the bug you described above is something that can actually be hit in practice (due to lack of a test case), or if it will be masked by #10413
That's a good question. I think once #10413 gets resolved we can hit the identifier collision issue due to this PR. But actually, let's try to add an identifier collision test case after #10413.
I would suggest 1., revert the commit and a new version of the PR where only the aliases are made simpler.
There was a problem hiding this comment.
Revert PR is merged: #10436
And I am pretty stoked that @jonahgao is looking at #10413 (comment)
| let expected = vec![ | ||
| (8, "{(SUM(a + Int32(1)) - AVG(c)) * Int32(2)|{Int32(2)}|{SUM(a + Int32(1)) - AVG(c)|{AVG(c)|{c}}|{SUM(a + Int32(1))|{a + Int32(1)|{Int32(1)}|{a}}}}}"), | ||
| (6, "{SUM(a + Int32(1)) - AVG(c)|{AVG(c)|{c}}|{SUM(a + Int32(1))|{a + Int32(1)|{Int32(1)}|{a}}}}"), | ||
| (8, "#{(SUM(a + Int32(1)) - AVG(c)) * Int32(2)}"), |
There was a problem hiding this comment.
IdArray shouldn't change in this PR. Actually, new fields can be added to the tuples but the identifier field should remain as it is.
|
At this point I think we'd better revert this, close it and revisit it after #10413. @alamb @peter-toth |
|
Sounds like the consensus is to revert this PR -- could you possible make a revert PR @MohamedAbdeen21 ? |
This reverts commit 2a15614.
This reverts commit 2a15614.
|
Revert PR: #10436 |
* use only current expr symbol + remove visit stack * Fix tpch tests/benchmarks * update docs * removing some clones * fix clippy * update docs * rebase * restore docs * fix clippy
…10436) * Revert "make common expression alias human-readable (apache#10333)" This reverts commit 2a15614. * keep some ok parts
Which issue does this PR close?
Closes #10280.
Thanks @JasonLi-cn for pointing out the relevant code snippet and the detailed description, saved me some time ❤️
Rationale for this change
Making the plans human-readable when common subexpression elimination is applied (check example in ticket and edited tests in this PR).
What changes are included in this PR?
Changing
alias_symbolfromcurr_expr_identifier + sub_expr_identifierto#{curr_expr_identifier}, making it more readable.Also, since we no longer need the sub_expr for the alias, we can remove the entire visit stack + visit record enum.
Are these changes tested?
Changed existing tests to reflect the change.
Are there any user-facing changes?
Readable plans when CSE is applied