Skip to content

arrow-util: canonicalize ranges decoded from Parquet#36499

Open
DAlperin wants to merge 1 commit into
mainfrom
claude/fix-parquet-range-canonicalization-LghhN
Open

arrow-util: canonicalize ranges decoded from Parquet#36499
DAlperin wants to merge 1 commit into
mainfrom
claude/fix-parquet-range-canonicalization-LghhN

Conversation

@DAlperin
Copy link
Copy Markdown
Member

The reader for range columns called push_range_with, which writes the range to the row verbatim. For discrete range types (int4/int8/date), Parquet files authored by external engines may encode ranges with non-canonical bounds — e.g. [1,10] for int4range, which MZ stores internally as [1,11). Without canonicalization those rows do not compare or hash equal to logically-identical values constructed inside MZ, so COPY FROM PARQUET rows mismatch their pure-SQL counterparts.

Switch the decode path to push_range, which canonicalizes the range before packing. Add a unit test that decodes a hand-built non-canonical arrow StructArray and a testdrive regression test that round-trips a DuckDB-authored Parquet file through COPY FROM ... (FORMAT PARQUET).

Fixes DB-55.

https://claude.ai/code/session_01KXioUsT2r5o2tRwrsHm4iR

The reader for range columns called `push_range_with`, which writes the
range to the row verbatim. For discrete range types (int4/int8/date),
Parquet files authored by external engines may encode ranges with
non-canonical bounds — e.g. `[1,10]` for `int4range`, which MZ stores
internally as `[1,11)`. Without canonicalization those rows do not
compare or hash equal to logically-identical values constructed inside
MZ, so `COPY FROM PARQUET` rows mismatch their pure-SQL counterparts.

Switch the decode path to `push_range`, which canonicalizes the range
before packing. Add a unit test that decodes a hand-built non-canonical
arrow StructArray and a testdrive regression test that round-trips a
DuckDB-authored Parquet file through `COPY FROM ... (FORMAT PARQUET)`.

Fixes DB-55.

https://claude.ai/code/session_01KXioUsT2r5o2tRwrsHm4iR
@DAlperin DAlperin requested a review from def- May 11, 2026 01:48
Copy link
Copy Markdown
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding a test, lgtm!

Comment thread test/iceberg/mzcompose.py
Comment on lines +287 to +293
key = _setup(c)

c.run_testdrive_files(
f"--var=s3-access-key={key}",
"--var=aws-endpoint=minio:9000",
"range-noncanonical.td",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many workflows in test/iceberg do just this, we could put them all in a workflow_cdc instead that iterates over the files? But more of a test cleanup, not that important for this PR itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants