fix(prqlc, append): first attempt at applying column order on CTEs too by Fanaen · Pull Request #5317 · PRQL/prql

Fanaen · 2025-06-05T15:32:03Z

Hi there!

TL;DR. We recently had more convoluted cases of append where PRQL output valid SQL that fails at runtime.
Here is a proposal to fix common explicit uses of append.
PR's not complete yet: it's an early bird to make sure this goes in the right direction.
Requesting for comments!

Problem

SQL's UNION need the same number of column and in both queries
PRQL currently changes main query's columns without changing sub query as well
PRQL outputs a SQL query that leads to a type mismatch or a column count mismatch error. More elaborate cases are a mix of those two. You may try a simplified version of either with the Query Results tab in the playground:

# Column swap leading to type mismatch
from invoices | select { invoice_id, invoice_date }
append (
  from invoices | select { invoice_id, invoice_date }
)
select { invoice_date, invoice_id }
# -> E: Unimplemented type for cast (BIGINT -> TIMESTAMP)

# Column count mismatch
from invoices | select { invoice_id, customer_id, invoice_date }
append (
  from invoices | select { invoice_id, customer_id, invoice_date }
)
group { customer_id } (aggregate { `count` = count `invoice_id` })
# -> E: Set operations can only apply to expressions with the same number of result columns

Input space

This means right now, only a tiny subset of valid append are working properly.
To be clear about what we are trying to fix, let's talk PRQL inputs.

Explicit `append`: ✅ covered by this PR

If we put select everywhere, PRQL can work without asumptions and that's what I tried to fix here.

from invoices | select { invoice_id, invoice_date } 
append (from invoices | select { invoice_id, invoice_date })
# -> WITH table_0 AS (SELECT invoice_id, invoice_date FROM invoices) SELECT invoice_id, ...

Wildcard `append`: 🚫 not covered by this PR

As long as there is no selection whatsoever, this is fine:

from invoices | append (from invoices)
# -> SELECT * FROM invoices UNION ALL SELECT * FROM invoices AS table_0

from invoices | derive { c = "a"} | append (from invoices | derive { c = "b"}) | sort { +c } 
# A wildcard CTE appears! Not sure those are great 
# -> WITH table_0 AS (SELECT *, 'b' AS c FROM invoices), ...

If we do a select at the end, the main query inherit those select and it fails at SQL runtime.

from invoices | append (from invoices) | select { invoice_date, invoice_id }
# -> SELECT invoice_date, invoice_id FROM invoices UNION ALL SELECT * FROM invoices AS table_0
# -> E: Set operations can only apply to expressions with the same number of result columns

I'm not sure how much we want to support wildcard append since PRQL can't make assumptions on input tables, so it seems really difficult to get it right, so this PR does not affect those.

If we do, maybe we should first reduce the use of CTEs to avoid pulling whole tables.

Hybrid `append`: 🚫 not covered by this PR

If we specify select on one side and not the other, we trigger a legit PRQL error.

from invoices | select { invoice_id, invoice_date } | append (from invoices)
# -> E: cannot append two relations with non-matching number of columns.

from invoices | append (from invoices | select { invoice_id, invoice_date })
# -> E: cannot append two relations with non-matching number of columns.

This seems OK, especially since sub-query may use other columns names, so this PR does not affect those either.
Columns being unknown probably makes it really difficult to get this right.

Solutions

✅ Apply column projection on CTE
To apply whatever reorder/removal in the main relation on the secondary one. Detailed below.
🚫 Leave CTE as is, but specify the columns in UNION ALL instead of *
Backup plan in case solution 1 fails due to CTE dependencies. It seems append is isolated enough to not warrant this solution.
Downside of this approach is that it means potentially bigger CTE than needed.
🚫 Force column alignment by forcing intermediate CTE
Backup-backup plan. IMHO, the less CTE the better.
This method may be required to handle wildcard append.

Details for the first version

Here, we will use this example:

from invoicesA | select { invoice_idA, invoice_dateA }
append (
  from invoicesB | select { invoice_idB, invoice_dateB }
)
select { invoice_dateA, invoice_idA }

Phase III. Semantic resolver

When the lineage inference occurs, there is a point where we have clearly the matching columns with top and bottom lineages for the append. The columns pairs (e.g. (invoice_idA, invoice_idB)) is then stored in the resulting Lineage alongside column lineage.

So, we get a top with columns like [("invoice_idA", 134), ..] and bottom with [("invoice_idB", 124), ..] and the resulting lineage will have the top columns like before. Now, it will also have the mapping [(("invoice_idA", 134), ("invoice_idB", 124)), ..].

bottom lineage is discarded, so make the link from the other side seemed trickier.

Phase IV. Semantic-lowering

push_select add a select at the end of each relation. Before this PR, the reordering happens for the main relation (which directly uses top lineage) and not for the additional one.

push_select is triggered by the end of lower_relation which is recursive in a sense:
a. lower_relation(main) triggers other methods wich leads to...
b. lower_relation(secondary).
c. push_select happens for secondary
d. push_select happens for main

Fortunately, the source of the push_select is available from the start of the lower_relation so we can leverage the fact that lower_relation(main) happens first to store which columns are needed, then apply the mapping at the start of lower_relation(secondary) then let push_select do its job.

State is added to the Lowerer as a simple lineage_stack to access the latest relevant mapping.

A bit of extra logic is needed:

to cirmuvent the fact that due to assign, lower_relation(main) is provided with a new ("invoice_idA", 139) instead of ("invoice_idA", 134), so AFAIK, we have to check by name.
to avoid losing form of lineage. In fact, there are still some kinks to work out here, where the link is broken and this leads to incorrect SQL queries.

Phase VI. Post-process

Should we want to use the second solution (leave CTE as is, but specify the columns in UNION ALL instead of *), we would have to get the extra state across Phase V. Anchoring. Possible but not my first choice.

Related issues

This PR intends to fix (at least partially) the following issues:

#2680 ✅ Fixed in 42f5af0

Column mismatch by redefinition
This PR does not intends to fix the initial PRQL input, only the workaround
This PR's fix isn't triggered (yet)

#3579 🚧 Not working yet.

Column mismatch by reordering
This PR's fix is triggered but it removes a column that's important for the group:

WITH table_0 AS (
  SELECT
    substr(user, length('u:') + 1) AS user_id
  FROM
    table2
),
table_1 AS (
  SELECT
    substr(user, length('u:') + 1) AS user_id
  FROM
    table1
  UNION
  ALL
  SELECT
    *
  FROM
    table_0
)
SELECT
  user_id,
  COUNT(*) AS daily_interaction
FROM
  table_1
GROUP BY
  user_id

#4724 🚫 Unrelated after all

Input is flawed: first table is (col1, col2), second is (col2, col1))

Conclusion

I am still working on the edge cases.
What do you think?

for more information, see https://pre-commit.ci

kgutwin · 2025-06-05T17:01:28Z

Thanks for taking a look at this! If we make progress here, it would help greatly one of my project's backlog items.

Did you look at all at #5165? Do you have any thoughts about that question?

Fanaen · 2025-06-06T15:16:45Z

First version dc6273b had the flaw of removing too much columns, and losing track of more complex columns (like compute ones).

Second version 42f5af0 does not change the select at lowering phase but does it at anchoring instead which is a lot more reliable. It needs some work because it removes too little, which means it often fails to fix the issue. Unfortunately, this spread more additional state across three layer, and for that I'm sorry 😅

About related issues:

This new version fixes the explicit workaround suggested in select clause after append clause #2680
Aggregate doesn't respect union's names #3579 still needs some work (append + count(*) edge case)
Upon closer inspection, the append doesn't check column, column order matters #4724 issue is not related, its about not having to manually align columns

I will wait for feedback before working more on this.

Fanaen · 2025-06-09T18:04:06Z

Hi @max-sixty! Would this contribution be welcome?

It seems to me the fix should be at the anchoring stage because some of the column reordering/removal happens at that stage.

Column order seems done. Now, I'm trying to find a reliable way to remove, in the child relation, the columns that were added or discarded in the parent relation. The issue is that the lineage is not available at this point since it's been discarded in the lowering stage. There is a disconnect between columns in each relation. Do you have an idea, by any chance?

max-sixty · 2025-06-09T19:55:06Z

hi @Fanaen ! sorry for missing this, was heads down elsewhere for a while. greatly appreciate the effort!

do you know why we get test failures? is that just sqlite? if we can fix bugs but lose some sqlite support, that's a worthwhile tradeoff

I do think the existing compiler makes this non-trivial, though hopefully not impossible. if we need to reduce the scope of what's supported (for example compel queries to specify columns when using append), then that's OK...

Fanaen · 2025-06-10T19:38:06Z

hi @Fanaen ! sorry for missing this, was heads down elsewhere for a while. greatly appreciate the effort!

Hi @max-sixty! No worries, thanks a lot for your answer!

do you know why we get test failures? is that just sqlite? if we can fix bugs but lose some sqlite support, that's a worthwhile tradeoff

The first reason is that I threw together a few integration tests to make sure we're covered, and I overlooked some column-typing shenanigans. I'll look into this tomorrow.

The second reason is that at the time, the fix was not yet behaving as intended.

The good news is that my third attempt seems to be really reliable!
I started a new, clean PR, and it should be ready soon.

I do think the existing compiler makes this non-trivial, though hopefully not impossible. if we need to reduce the scope of what's supported (for example compel queries to specify columns when using append), then that's OK...

Good to know, thanks! I think we're almost good so hopefully, it won't come to that.

Fanaen · 2025-06-11T09:43:07Z

Closing this one in favor of my third attempt. It's different, so I opened a new PR: #5323

Fanaen and others added 2 commits June 5, 2025 17:18

fix(prqlc, append): apply column order on CTEs too

dc6273b

[pre-commit.ci] auto fixes from pre-commit.com hooks

8608595

for more information, see https://pre-commit.ci

Fanaen marked this pull request as ready for review June 5, 2025 15:59

Fanaen mentioned this pull request Jun 6, 2025

Should append sometimes or always create UNION ALL BY NAME? #5165

Open

fix(prqlc, append): do it at anchoring instead

42f5af0

chore(prqlc): apply linter suggestion

0356384

Fanaen changed the title ~~fix(prqlc, append): apply column order on CTEs too~~ fix(prqlc, append): first attempt at applying column order on CTEs too Jun 10, 2025

Fanaen marked this pull request as draft June 10, 2025 18:44

Fanaen closed this Jun 11, 2025

Fanaen deleted the add-positional-mapping-for-append branch June 12, 2025 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(prqlc, append): first attempt at applying column order on CTEs too#5317

fix(prqlc, append): first attempt at applying column order on CTEs too#5317
Fanaen wants to merge 4 commits intoPRQL:mainfrom
Fanaen:add-positional-mapping-for-append

Fanaen commented Jun 5, 2025 •

edited

Loading

Uh oh!

kgutwin commented Jun 5, 2025

Uh oh!

Fanaen commented Jun 6, 2025

Uh oh!

Fanaen commented Jun 9, 2025

Uh oh!

max-sixty commented Jun 9, 2025

Uh oh!

Fanaen commented Jun 10, 2025

Uh oh!

Fanaen commented Jun 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Fanaen commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Input space

Explicit append: ✅ covered by this PR

Wildcard append: 🚫 not covered by this PR

Hybrid append: 🚫 not covered by this PR

Solutions

Details for the first version

Phase III. Semantic resolver

Phase IV. Semantic-lowering

Phase VI. Post-process

Related issues

#2680 ✅ Fixed in 42f5af0

#3579 🚧 Not working yet.

#4724 🚫 Unrelated after all

Conclusion

Uh oh!

kgutwin commented Jun 5, 2025

Uh oh!

Fanaen commented Jun 6, 2025

Uh oh!

Fanaen commented Jun 9, 2025

Uh oh!

max-sixty commented Jun 9, 2025

Uh oh!

Fanaen commented Jun 10, 2025

Uh oh!

Fanaen commented Jun 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fanaen commented Jun 5, 2025 •

edited

Loading

Explicit `append`: ✅ covered by this PR

Wildcard `append`: 🚫 not covered by this PR

Hybrid `append`: 🚫 not covered by this PR