fix(prqlc, append): first attempt at applying column order on CTEs too#5317
fix(prqlc, append): first attempt at applying column order on CTEs too#5317
Conversation
|
Thanks for taking a look at this! If we make progress here, it would help greatly one of my project's backlog items. Did you look at all at #5165? Do you have any thoughts about that question? |
|
First version dc6273b had the flaw of removing too much columns, and losing track of more complex columns (like Second version 42f5af0 does not change the select at About related issues:
I will wait for feedback before working more on this. |
|
Hi @max-sixty! Would this contribution be welcome? It seems to me the fix should be at the anchoring stage because some of the column reordering/removal happens at that stage. Column order seems done. Now, I'm trying to find a reliable way to remove, in the child relation, the columns that were added or discarded in the parent relation. The issue is that the lineage is not available at this point since it's been discarded in the lowering stage. There is a disconnect between columns in each relation. Do you have an idea, by any chance? |
|
hi @Fanaen ! sorry for missing this, was heads down elsewhere for a while. greatly appreciate the effort! do you know why we get test failures? is that just sqlite? if we can fix bugs but lose some sqlite support, that's a worthwhile tradeoff I do think the existing compiler makes this non-trivial, though hopefully not impossible. if we need to reduce the scope of what's supported (for example compel queries to specify columns when using append), then that's OK... |
Hi @max-sixty! No worries, thanks a lot for your answer!
The first reason is that I threw together a few integration tests to make sure we're covered, and I overlooked some column-typing shenanigans. I'll look into this tomorrow. The second reason is that at the time, the fix was not yet behaving as intended. The good news is that my third attempt seems to be really reliable!
Good to know, thanks! I think we're almost good so hopefully, it won't come to that. |
|
Closing this one in favor of my third attempt. It's different, so I opened a new PR: #5323 |
Hi there!
TL;DR. We recently had more convoluted cases of
appendwhere PRQL output valid SQL that fails at runtime.Here is a proposal to fix common explicit uses of
append.PR's not complete yet: it's an early bird to make sure this goes in the right direction.
Requesting for comments!
Problem
UNIONneed the same number of column and in both queriesQuery Resultstab in the playground:Input space
This means right now, only a tiny subset of valid
appendare working properly.To be clear about what we are trying to fix, let's talk PRQL inputs.
Explicit
append: ✅ covered by this PRIf we put
selecteverywhere, PRQL can work without asumptions and that's what I tried to fix here.Wildcard
append: 🚫 not covered by this PRAs long as there is no selection whatsoever, this is fine:
If we do a select at the end, the main query inherit those select and it fails at SQL runtime.
I'm not sure how much we want to support wildcard
appendsince PRQL can't make assumptions on input tables, so it seems really difficult to get it right, so this PR does not affect those.If we do, maybe we should first reduce the use of CTEs to avoid pulling whole tables.
Hybrid
append: 🚫 not covered by this PRIf we specify
selecton one side and not the other, we trigger a legit PRQL error.This seems OK, especially since sub-query may use other columns names, so this PR does not affect those either.
Columns being unknown probably makes it really difficult to get this right.
Solutions
✅ Apply column projection on CTE
To apply whatever reorder/removal in the main relation on the secondary one. Detailed below.
🚫 Leave CTE as is, but specify the columns in
UNION ALLinstead of*Backup plan in case solution 1 fails due to CTE dependencies. It seems
appendis isolated enough to not warrant this solution.Downside of this approach is that it means potentially bigger CTE than needed.
🚫 Force column alignment by forcing intermediate CTE
Backup-backup plan. IMHO, the less CTE the better.
This method may be required to handle wildcard
append.Details for the first version
Here, we will use this example:
Phase III. Semantic resolver
When the lineage inference occurs, there is a point where we have clearly the matching columns with
topandbottomlineages for theappend. The columns pairs (e.g.(invoice_idA, invoice_idB)) is then stored in the resultingLineagealongside column lineage.So, we get a
topwith columns like[("invoice_idA", 134), ..]andbottomwith[("invoice_idB", 124), ..]and the resulting lineage will have thetopcolumns like before. Now, it will also have the mapping[(("invoice_idA", 134), ("invoice_idB", 124)), ..].bottomlineage is discarded, so make the link from the other side seemed trickier.Phase IV. Semantic-lowering
push_selectadd aselectat the end of each relation. Before this PR, the reordering happens for the main relation (which directly usestoplineage) and not for the additional one.push_selectis triggered by the end oflower_relationwhich is recursive in a sense:a.
lower_relation(main)triggers other methods wich leads to...b.
lower_relation(secondary).c.
push_selecthappens forsecondaryd.
push_selecthappens formainFortunately, the source of the
push_selectis available from the start of thelower_relationso we can leverage the fact thatlower_relation(main)happens first to store which columns are needed, then apply the mapping at the start oflower_relation(secondary)then letpush_selectdo its job.State is added to the
Lowereras a simplelineage_stackto access the latest relevant mapping.A bit of extra logic is needed:
assign,lower_relation(main)is provided with a new("invoice_idA", 139)instead of("invoice_idA", 134), so AFAIK, we have to check by name.Phase VI. Post-process
Should we want to use the second solution (leave CTE as is, but specify the columns in
UNION ALLinstead of*), we would have to get the extra state across Phase V. Anchoring. Possible but not my first choice.Related issues
This PR intends to fix (at least partially) the following issues:
#2680 ✅ Fixed in 42f5af0
#3579 🚧 Not working yet.
group:#4724 🚫 Unrelated after all
(col1, col2), second is(col2, col1))Conclusion
I am still working on the edge cases.
What do you think?