Implement prettier SQL unparsing (more human readable)#11186
Implement prettier SQL unparsing (more human readable)#11186alamb merged 17 commits intoapache:mainfrom
Conversation
|
Hi @alamb this can use more testing, match cases and moving stuff around as I'm not too familiar with the unparser. Appreciate if you can provide other cases/expressions that should be tested. |
|
I plan to review this tomorrow |
alamb
left a comment
There was a problem hiding this comment.
This is looking very cool @MohamedAbdeen21 -- thank you
I left a few suggestions for other tests.
|
FYI @phillipleblanc |
phillipleblanc
left a comment
There was a problem hiding this comment.
Looks great @MohamedAbdeen21! Very neat idea and execution. I have a few thoughts on simplifying/unifying the code.
Thanks @alamb for tagging me. 🙏
datafusion/sql/src/unparser/expr.rs
Outdated
| pub fn pretty_expr_to_sql(&self, expr: &Expr) -> Result<ast::Expr> { | ||
| let root_expr = self.expr_to_sql(expr)?; | ||
| Ok(self.pretty(root_expr, LOWEST, LOWEST)) | ||
| } |
There was a problem hiding this comment.
I wouldn't have an extra method here and would combine it with expr_to_sql. The ast::Expr it produces is logically the same as the input one, just with unnecessary nesting removed. In fact, you could even think about this as serving the same purpose as an optimizer rewrite pass for LogicalPlan - it should produce logically the same thing as the input, just more efficient.
datafusion/sql/src/unparser/expr.rs
Outdated
| /// | ||
| /// Also note that when fetching the precedence of a nested expression, we ignore other nested | ||
| /// expressions, so precedence of expr `(a * (b + c))` equals `*` and not `+`. | ||
| fn pretty( |
There was a problem hiding this comment.
If we just have a single expr_to_sql method, then it might make sense to rename this method as well to something like remove_unnecessary_nesting to more accurately describe what it does. Then if we added more "rewrite passes" to make it prettier or achieve some other functional goal, then they would just be added as separate functions after this one.
| /// Get the operator precedence | ||
| /// use <https://www.postgresql.org/docs/7.0/operators.htm#AEN2026> as a reference | ||
| /// use <https://www.postgresql.org/docs/7.2/sql-precedence.html> as a reference | ||
| pub fn precedence(&self) -> u8 { |
There was a problem hiding this comment.
It looks like this is only used in a few places for formatting the parenthesis in a Display trait on BinaryExpr. I don't think that will have any major impact on library users.
|
Thanks @alamb and @phillipleblanc. Updated PR accordingly I'm aware of a failing test, will look into it later today. |
phillipleblanc
left a comment
There was a problem hiding this comment.
Looks great to me, thanks @MohamedAbdeen21!
alamb
left a comment
There was a problem hiding this comment.
THanks @MohamedAbdeen21 and @phillipleblanc -- I think this is looking close
| async fn round_trip_parse_sql_expr_demo() -> Result<()> { | ||
| let sql = "((int_col < 5) OR (double_col = 8))"; | ||
| // unparser can also remove extra parentheses, | ||
| // so `((int_col < 5) OR (double_col = 8))` will also produce the same SQL |
| &mut PlannerContext::new(), | ||
| )?; | ||
|
|
||
| assert_eq!(expr.to_string(), pretty_expr.to_string()); |
There was a problem hiding this comment.
Shouldn't this me veriying that the expr are the same (not just that the string is the same)?
There was a problem hiding this comment.
But they can't be the same, the whole idea behind removing parenthesis is to maintain the final result regardless of which operations get executed first/has the parenthesis aka associativity.
Here's an example from the tests:
("((id + 5) * (age * 8))", "(id + 5) * age * 8"),
Here's the tree of the first/original expr:
*
/ \
+ *
/ \ / \
id 5 age 8
And here's the cleaned one
*
/ \
* 8
/ \
+ age
/ \
id 5
These obviously don't match, but it doesn't matter because multiplication is associative. Almost all operators are associative except for division and subtraction, which are handled.
Comparing the string representation seemed like the best solution.
There was a problem hiding this comment.
If you want to maintain such equivalence then the only case we remove parenthesis would be that the parenthesis give priority to the left side, which is the natural associativity of all binary operators (except = I think)*. Or to the operation with higher (but not equal) precedence. Something like (a + b) + c or true AND (a > b) for example
There was a problem hiding this comment.
Thank you for the response @MohamedAbdeen21 -- this makes sense
🤔 I am not sure if maintaining the structural equivalence of the input/output is important or if just the semantics are important to preserve
Perhaps @goldmedal @seddonm1 or @phillipleblanc or @devinjdangelo has thoughts on what the preferred behavior is
There was a problem hiding this comment.
I think it's related to #11186 (comment). If our main purpose is providing SQL for full pushdown to other data sources, this behavior makes sense to me. For my current project (Wren), it's acceptable.
However, if our purpose is human readability, I think preserving the semantics is important. Maybe we can just remove the outermost brackets (they're exactly redundant) for a calculation expression like:
((3 + (5 + 6)) + 3) -> (3 + (5 + 6)) + 3
I'm not sure, but I think it might be more similar to what a human would write.
There was a problem hiding this comment.
I don't think maintaining structural equivalence is important as long as the correct answer would still be generated if the SQL were run.
But maybe this is exposing two opposing(?) goals of the unparser - the goal for using it to generate SQL that will be run in another SQL query engine. And another goal for displaying it for human readability/debuggability. I'm in the first camp - it doesn't matter to me what the resulting SQL looks like as long as it runs correctly (although having nicer SQL does make debugging easier).
However, I would expect that we could serve both goals with a single unparser implementation. Any work that is done to improve the SQL for human readability shouldn't come at the expense of making it invalid SQL for running in a separate query engine. That is why I thought it would be better to have just a single expr_to_sql function instead of splitting the functionality.
There was a problem hiding this comment.
Since the primary goal is pushdown for other systems, we have two options.
-
Completely ignore this issue. Systems can have very different precedence rules, and therefore, we can not guarantee correctness once some parentheses are removed. Also, it can be hard to decide which parentheses are "safe" to remove.
-
Informing DF of your target system operator precedence through implementing a dialect. This can be a lot of work for the dev, but I can see this being a desired option for some people; as Phillip pointed out, It can make debugging easier.
There was a problem hiding this comment.
But maybe this is exposing two opposing(?) goals of the unparser - the goal for using it to generate SQL that will be run in another SQL query engine. And another goal for displaying it for human readability/debuggability.
I think this is an excellent summary of the issue and given these two different goals there would be two different ideal outcomes.
Perhaps as @MohamedAbdeen21 suggests we can add some sort of API to allow the user to decide. We could potentially use Unparser for this (though I don't think Unparser is part of the public API yet).
datafusion/datafusion/sql/src/unparser/mod.rs
Lines 30 to 32 in 45599ce
So something like
// default unparser generates fully parenthesized sql:
let unparser = Unparser::default();
// can also create an unparser that generates pretty sql
let unparser = Unparser::default()
// enable pretty-printing of the sql easier suited to human consumption
.with_pretty(true);There was a problem hiding this comment.
That makes sense to me. I didn't think about how different query engines might have different precedence rules, so having this functionality split now makes a lot more sense.
Making it explicit: Removing parenthesis according to the precedence rules of DataFusion might make it invalid SQL for other query engines with different precedence rules, even if its valid for DataFusion. That would break one of the use-cases for the unparser, so allowing people to opt-in to that behavior is desirable.
The proposed API by @alamb looks good to me.
There was a problem hiding this comment.
pushed the proposed API, updated docs, examples and tests
| ("((3 + (5 + 6)) * 3)", "(3 + 5 + 6) * 3"), | ||
| ("((3 + (5 + 6)) + 3)", "3 + 5 + 6 + 3"), | ||
| ("3 + 5 + 6 + 3", "3 + 5 + 6 + 3"), | ||
| ("3 + (5 + (6 + 3))", "3 + 5 + 6 + 3"), |
There was a problem hiding this comment.
this doesn't seem right to me -- this implies that (3 + 5) + (6 + 3) and 3 + ((5 + 6) + 3) are logically equivalent to each other as well as with 3 + 5 + 6 + 3
I think (3 + 5) + (6 + 3) and 3 + ((5 + 6) + 3) are different, right? Aka the parenthesis are needed here
There was a problem hiding this comment.
resolved as now pretty unparsing is done conditionally
|
thanks @MohamedAbdeen21 -- I am very behind on reviews. I put this one on my list again |
alamb
left a comment
There was a problem hiding this comment.
I took a look at this @MohamedAbdeen21 and TLDR is I think it looks great now,
While reviewing the code, I noticed some documentation that is missing / non ideal. I'll make a follow on PR to improve things
|
|
||
| assert_eq!(sql, round_trip_sql); | ||
|
|
||
| // enable pretty-unparsing. This make the output more human-readable |
| ("((3 + (5 + 6)) * 3)", "(3 + 5 + 6) * 3"), | ||
| ("((3 + (5 + 6)) + 3)", "3 + 5 + 6 + 3"), | ||
| ("3 + 5 + 6 + 3", "3 + 5 + 6 + 3"), | ||
| ("3 + (5 + (6 + 3))", "3 + 5 + 6 + 3"), |
There was a problem hiding this comment.
resolved as now pretty unparsing is done conditionally
|
cc @backkem and @devinjdangelo |
|
Here is a PR to improve the documentation #11395 (based on this one) |
|
Thanks again @MohamedAbdeen21 |
* initial prettier unparse * bug fix * handling minus and divide * cleaning references and comments * moved tests * Update precedence of BETWEEN * rerun CI * Change precedence to match PGSQLs * more pretty unparser tests * Update operator precedence to match latest PGSQL * directly prettify expr_to_sql * handle IS operator * correct IS precedence * update unparser tests * update unparser example * update more unparser examples * add with_pretty builder to unparser
* initial prettier unparse * bug fix * handling minus and divide * cleaning references and comments * moved tests * Update precedence of BETWEEN * rerun CI * Change precedence to match PGSQLs * more pretty unparser tests * Update operator precedence to match latest PGSQL * directly prettify expr_to_sql * handle IS operator * correct IS precedence * update unparser tests * update unparser example * update more unparser examples * add with_pretty builder to unparser
* initial prettier unparse * bug fix * handling minus and divide * cleaning references and comments * moved tests * Update precedence of BETWEEN * rerun CI * Change precedence to match PGSQLs * more pretty unparser tests * Update operator precedence to match latest PGSQL * directly prettify expr_to_sql * handle IS operator * correct IS precedence * update unparser tests * update unparser example * update more unparser examples * add with_pretty builder to unparser
* initial prettier unparse * bug fix * handling minus and divide * cleaning references and comments * moved tests * Update precedence of BETWEEN * rerun CI * Change precedence to match PGSQLs * more pretty unparser tests * Update operator precedence to match latest PGSQL * directly prettify expr_to_sql * handle IS operator * correct IS precedence * update unparser tests * update unparser example * update more unparser examples * add with_pretty builder to unparser
Which issue does this PR close?
Takes a shot at #10633 .
Rationale for this change
Algorithm adapted from this SO answer.
What changes are included in this PR?
Make unparsed binary expressions easier to read.
Are these changes tested?
Yes, but can use more tests/statements/plans to unparse.
Are there any user-facing changes?
New
.with_pretty()method for the unparser to remove extra parenthesis from an expr.