Skip to content

fix(prqlc, append): first attempt at applying column order on CTEs too#5317

Closed
Fanaen wants to merge 4 commits intoPRQL:mainfrom
Fanaen:add-positional-mapping-for-append
Closed

fix(prqlc, append): first attempt at applying column order on CTEs too#5317
Fanaen wants to merge 4 commits intoPRQL:mainfrom
Fanaen:add-positional-mapping-for-append

Conversation

@Fanaen
Copy link
Copy Markdown
Contributor

@Fanaen Fanaen commented Jun 5, 2025

Hi there!

TL;DR. We recently had more convoluted cases of append where PRQL output valid SQL that fails at runtime.
Here is a proposal to fix common explicit uses of append.
PR's not complete yet: it's an early bird to make sure this goes in the right direction.
Requesting for comments!

Problem

  • SQL's UNION need the same number of column and in both queries
  • PRQL currently changes main query's columns without changing sub query as well
  • PRQL outputs a SQL query that leads to a type mismatch or a column count mismatch error. More elaborate cases are a mix of those two. You may try a simplified version of either with the Query Results tab in the playground:
# Column swap leading to type mismatch
from invoices | select { invoice_id, invoice_date }
append (
  from invoices | select { invoice_id, invoice_date }
)
select { invoice_date, invoice_id }
# -> E: Unimplemented type for cast (BIGINT -> TIMESTAMP)
# Column count mismatch
from invoices | select { invoice_id, customer_id, invoice_date }
append (
  from invoices | select { invoice_id, customer_id, invoice_date }
)
group { customer_id } (aggregate { `count` = count `invoice_id` })
# -> E: Set operations can only apply to expressions with the same number of result columns

Input space

This means right now, only a tiny subset of valid append are working properly.
To be clear about what we are trying to fix, let's talk PRQL inputs.

Explicit append: ✅ covered by this PR

If we put select everywhere, PRQL can work without asumptions and that's what I tried to fix here.

from invoices | select { invoice_id, invoice_date } 
append (from invoices | select { invoice_id, invoice_date })
# -> WITH table_0 AS (SELECT invoice_id, invoice_date FROM invoices) SELECT invoice_id, ...

Wildcard append: 🚫 not covered by this PR

As long as there is no selection whatsoever, this is fine:

from invoices | append (from invoices)
# -> SELECT * FROM invoices UNION ALL SELECT * FROM invoices AS table_0

from invoices | derive { c = "a"} | append (from invoices | derive { c = "b"}) | sort { +c } 
# A wildcard CTE appears! Not sure those are great 
# -> WITH table_0 AS (SELECT *, 'b' AS c FROM invoices), ...

If we do a select at the end, the main query inherit those select and it fails at SQL runtime.

from invoices | append (from invoices) | select { invoice_date, invoice_id }
# -> SELECT invoice_date, invoice_id FROM invoices UNION ALL SELECT * FROM invoices AS table_0
# -> E: Set operations can only apply to expressions with the same number of result columns

I'm not sure how much we want to support wildcard append since PRQL can't make assumptions on input tables, so it seems really difficult to get it right, so this PR does not affect those.

If we do, maybe we should first reduce the use of CTEs to avoid pulling whole tables.

Hybrid append: 🚫 not covered by this PR

If we specify select on one side and not the other, we trigger a legit PRQL error.

from invoices | select { invoice_id, invoice_date } | append (from invoices)
# -> E: cannot append two relations with non-matching number of columns.

from invoices | append (from invoices | select { invoice_id, invoice_date })
# -> E: cannot append two relations with non-matching number of columns.

This seems OK, especially since sub-query may use other columns names, so this PR does not affect those either.
Columns being unknown probably makes it really difficult to get this right.

Solutions

  1. Apply column projection on CTE
    To apply whatever reorder/removal in the main relation on the secondary one. Detailed below.

  2. 🚫 Leave CTE as is, but specify the columns in UNION ALL instead of *
    Backup plan in case solution 1 fails due to CTE dependencies. It seems append is isolated enough to not warrant this solution.
    Downside of this approach is that it means potentially bigger CTE than needed.

  3. 🚫 Force column alignment by forcing intermediate CTE
    Backup-backup plan. IMHO, the less CTE the better.
    This method may be required to handle wildcard append.

Details for the first version

Here, we will use this example:

from invoicesA | select { invoice_idA, invoice_dateA }
append (
  from invoicesB | select { invoice_idB, invoice_dateB }
)
select { invoice_dateA, invoice_idA }

Phase III. Semantic resolver

When the lineage inference occurs, there is a point where we have clearly the matching columns with top and bottom lineages for the append. The columns pairs (e.g. (invoice_idA, invoice_idB)) is then stored in the resulting Lineage alongside column lineage.

So, we get a top with columns like [("invoice_idA", 134), ..] and bottom with [("invoice_idB", 124), ..] and the resulting lineage will have the top columns like before. Now, it will also have the mapping [(("invoice_idA", 134), ("invoice_idB", 124)), ..].

bottom lineage is discarded, so make the link from the other side seemed trickier.

Phase IV. Semantic-lowering

push_select add a select at the end of each relation. Before this PR, the reordering happens for the main relation (which directly uses top lineage) and not for the additional one.

push_select is triggered by the end of lower_relation which is recursive in a sense:
a. lower_relation(main) triggers other methods wich leads to...
b. lower_relation(secondary).
c. push_select happens for secondary
d. push_select happens for main

Fortunately, the source of the push_select is available from the start of the lower_relation so we can leverage the fact that lower_relation(main) happens first to store which columns are needed, then apply the mapping at the start of lower_relation(secondary) then let push_select do its job.

State is added to the Lowerer as a simple lineage_stack to access the latest relevant mapping.

A bit of extra logic is needed:

  • to cirmuvent the fact that due to assign, lower_relation(main) is provided with a new ("invoice_idA", 139) instead of ("invoice_idA", 134), so AFAIK, we have to check by name.
  • to avoid losing form of lineage. In fact, there are still some kinks to work out here, where the link is broken and this leads to incorrect SQL queries.

Phase VI. Post-process

Should we want to use the second solution (leave CTE as is, but specify the columns in UNION ALL instead of *), we would have to get the extra state across Phase V. Anchoring. Possible but not my first choice.

Related issues

This PR intends to fix (at least partially) the following issues:

#2680 ✅ Fixed in 42f5af0

  • Column mismatch by redefinition
  • This PR does not intends to fix the initial PRQL input, only the workaround
  • This PR's fix isn't triggered (yet)

#3579 🚧 Not working yet.

  • Column mismatch by reordering
  • This PR's fix is triggered but it removes a column that's important for the group:
WITH table_0 AS (
  SELECT
    substr(user, length('u:') + 1) AS user_id
  FROM
    table2
),
table_1 AS (
  SELECT
    substr(user, length('u:') + 1) AS user_id
  FROM
    table1
  UNION
  ALL
  SELECT
    *
  FROM
    table_0
)
SELECT
  user_id,
  COUNT(*) AS daily_interaction
FROM
  table_1
GROUP BY
  user_id

#4724 🚫 Unrelated after all

  • Input is flawed: first table is (col1, col2), second is (col2, col1))

Conclusion

I am still working on the edge cases.
What do you think?

@Fanaen Fanaen marked this pull request as ready for review June 5, 2025 15:59
@kgutwin
Copy link
Copy Markdown
Collaborator

kgutwin commented Jun 5, 2025

Thanks for taking a look at this! If we make progress here, it would help greatly one of my project's backlog items.

Did you look at all at #5165? Do you have any thoughts about that question?

@Fanaen
Copy link
Copy Markdown
Contributor Author

Fanaen commented Jun 6, 2025

First version dc6273b had the flaw of removing too much columns, and losing track of more complex columns (like compute ones).

Second version 42f5af0 does not change the select at lowering phase but does it at anchoring instead which is a lot more reliable. It needs some work because it removes too little, which means it often fails to fix the issue. Unfortunately, this spread more additional state across three layer, and for that I'm sorry 😅

About related issues:

I will wait for feedback before working more on this.

@Fanaen
Copy link
Copy Markdown
Contributor Author

Fanaen commented Jun 9, 2025

Hi @max-sixty! Would this contribution be welcome?

It seems to me the fix should be at the anchoring stage because some of the column reordering/removal happens at that stage.

Column order seems done. Now, I'm trying to find a reliable way to remove, in the child relation, the columns that were added or discarded in the parent relation. The issue is that the lineage is not available at this point since it's been discarded in the lowering stage. There is a disconnect between columns in each relation. Do you have an idea, by any chance?

@max-sixty
Copy link
Copy Markdown
Member

hi @Fanaen ! sorry for missing this, was heads down elsewhere for a while. greatly appreciate the effort!

do you know why we get test failures? is that just sqlite? if we can fix bugs but lose some sqlite support, that's a worthwhile tradeoff

I do think the existing compiler makes this non-trivial, though hopefully not impossible. if we need to reduce the scope of what's supported (for example compel queries to specify columns when using append), then that's OK...

@Fanaen Fanaen changed the title fix(prqlc, append): apply column order on CTEs too fix(prqlc, append): first attempt at applying column order on CTEs too Jun 10, 2025
@Fanaen Fanaen marked this pull request as draft June 10, 2025 18:44
@Fanaen
Copy link
Copy Markdown
Contributor Author

Fanaen commented Jun 10, 2025

hi @Fanaen ! sorry for missing this, was heads down elsewhere for a while. greatly appreciate the effort!

Hi @max-sixty! No worries, thanks a lot for your answer!

do you know why we get test failures? is that just sqlite? if we can fix bugs but lose some sqlite support, that's a worthwhile tradeoff

The first reason is that I threw together a few integration tests to make sure we're covered, and I overlooked some column-typing shenanigans. I'll look into this tomorrow.

The second reason is that at the time, the fix was not yet behaving as intended.

The good news is that my third attempt seems to be really reliable!
I started a new, clean PR, and it should be ready soon.

I do think the existing compiler makes this non-trivial, though hopefully not impossible. if we need to reduce the scope of what's supported (for example compel queries to specify columns when using append), then that's OK...

Good to know, thanks! I think we're almost good so hopefully, it won't come to that.

@Fanaen
Copy link
Copy Markdown
Contributor Author

Fanaen commented Jun 11, 2025

Closing this one in favor of my third attempt. It's different, so I opened a new PR: #5323

@Fanaen Fanaen closed this Jun 11, 2025
@Fanaen Fanaen deleted the add-positional-mapping-for-append branch June 12, 2025 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants