General virtual columns support + row numbers as a first use-case by vustef · Pull Request #8715 · apache/arrow-rs

vustef · 2025-10-27T08:51:46Z

Based on #7307.

Which issue does this PR close?

Closes [Parquet] Support file row number in Parquet reader #7299

Rationale for this change

We need row numbers for many of the downstream features, e.g. computing unique row identifier in iceberg.

What changes are included in this PR?

New API to get row numbers as a virtual column:

let file = File::open(path).unwrap();
let row_number_field = Field::new("row_number", ArrowDataType::Int64, false).with_extension_type(RowNumber);
let options = ArrowReaderOptions::new().with_virtual_columns(vec![row_number_field]);
let reader = ParquetRecordBatchReaderBuilder::try_new_with_options(file, options)
    .unwrap()
    .build()
    .expect("Could not create reader");
reader
    .collect::<Result<Vec<_>, _>>()
    .expect("Could not read")
    ```

This column is defined as an extension type.
Parquet metadata is propagated to the array builder to compute first row indexes.
New Virtual column is included in addition to Primitive and Group.

Are these changes tested?

Yes

Are there any user-facing changes?

This is user facing feature, and has added docstrings.
No breaking changes, at least I tried not to, by creating a duplicate of public method to add more parameters.

Co-authored-by: scovich <scovich@users.noreply.github.com>

…r-row-numbers

…eature tests pass

…dded later

…passing

parquet/src/arrow/array_reader/builder.rs

parquet/src/arrow/array_reader/mod.rs

vustef · 2025-10-27T08:54:33Z

parquet/src/arrow/array_reader/row_number.rs

+    }
+
+    fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // TODO: Use advance_by when it stabilizes to improve performance


TODO from original PR

parquet/src/arrow/arrow_reader/mod.rs

vustef · 2025-10-27T09:16:44Z

parquet/src/arrow/async_reader/mod.rs

    }
+
+    fn row_groups(&self) -> Box<dyn Iterator<Item = &RowGroupMetaData> + '_> {
+        Box::new(std::iter::once(self.metadata.row_group(self.row_group_idx)))


this duplicates a lot, not sure if anything can be done here

vustef · 2025-10-27T09:18:13Z

parquet/src/arrow/schema/complex.rs

+/// - If nullable: def_level = parent_def_level + 1
+/// - If required: def_level = parent_def_level
+/// - rep_level = parent_rep_level (virtual fields are not repeated)
+fn convert_virtual_field(


the name used here is not aligned with what other convert_ functions do

…hen metadata parsing may skip row groups

…ef/arrow-rs into feature/parquet-virtual-row-numbers

etseidl

Impressive. Thanks @vustef and @alamb.

etseidl · 2025-11-13T23:24:56Z

parquet/src/arrow/arrow_reader/mod.rs

+    /// # Ok(())
+    /// # }
+    /// ```
+    pub fn with_virtual_columns(self, virtual_columns: Vec<FieldRef>) -> Self {


@vustef I think this is where we'd be able to detect row group filtering. There would be a with_row_group_selection() or some such function added to control skipping, and a check could be added both here and in the new function to disallow setting both.

Thanks for figuring this out. I guess there's no action that we can take right now then, please let me know if it's otherwise.

vustef · 2025-11-14T09:44:37Z

Impressive. Thanks @vustef and @alamb.

Thank you @etseidl for the review. @alamb now that we got another approve, are we good to merge this, before the Monday release?

alamb · 2025-11-14T14:39:49Z

Also thanks to @jkylling whose started this project

alamb · 2025-11-14T14:40:03Z

Impressive. Thanks @vustef and @alamb.

Thank you @etseidl for the review. @alamb now that we got another approve, are we good to merge this, before the Monday release?

Yeah, I don't see any reason to hold off merging. Let's do it!

…t-virtual-row-numbers

alamb · 2025-11-14T21:02:44Z

Once the Ci is green I'll merge this PR. Thank you @vustef

alamb · 2025-11-14T21:17:26Z

gogoogogogogo!!!

alamb · 2025-11-14T21:17:47Z

The 57.1.0 patch release may be the most epic minor release we have ever had

alamb · 2025-11-14T21:24:56Z

Thanks again @jkylling @vustef and @etseidl

vustef · 2025-11-14T21:28:23Z

Thanks again @jkylling @vustef and @etseidl

It was my pleasure, thanks to you all from me as well.

scovich

Post-merge drive by review

scovich · 2025-11-17T20:13:02Z

parquet/src/arrow/array_reader/row_number.rs

+        // Sort ranges by ordinal to maintain original row group order
+        ranges.sort_by_key(|(ordinal, _)| *ordinal);


I don't understand this part? The row groups were supplied in some particular order (by the row_groups iterator), and we're reordering by row group ordinal instead? Wouldn't that cause row number mismatches with other columns that continue reading in the original order? It seems like we actually need:

let selected_ordinals = HashMap<i16, usize> = row_groups .enumerate() .map(...) .collect::<Result<_>>()?;

and then ranges needs to use that enumeration ordinal (not the row group ordinal):

if let Some(i) = selected_ordinals.get(&ordinal) { ranges.push((i, ...); }

... so that the sorted ranges match the original row_group iterator's order?

I messed this up when I started computing first row indexes...thank you for catching this. WIll follow up shortly with tests and the fix.

scovich · 2025-11-17T20:18:17Z

parquet/src/arrow/schema/virtual_type.rs

+        .extension_type_name()
+        .map(|name| name.starts_with(VIRTUAL_PREFIX!()))
+        .unwrap_or(false)
+}


Suggested change

}

.map_or(false, |name| name.starts_with(VIRTUAL_PREFIX!()))

scovich · 2025-11-17T20:21:04Z

parquet/src/arrow/arrow_reader/mod.rs

+            if !is_virtual_column(field) {
+                panic!(


not a fan of panics, but if we're going to panic why not just

assert!( is_virtual_column(field), "...", field.name() );

Me neither, but seemed like a unwritten rule that all these with_ methods in ArrowReaderOptions return Self rather than a Result. Please comment in the new PR if I should change that behaviour.

@alamb also checking for your opinion on this.

Yes, the builder-like with_ functions don't give much opportunity for validity checking. We could probably use a proper ArrowReaderOptionsBuilder and do that kind of checking in build(). Or just change this one to a setter (so it's obvious it can't be chained) and return a Result<()>.

I think it is ok to return errors when trying to build options, like

let options = options.with_virtual_columns(cols)?;

If there is an error it is unlikely the code wants to continue configuring the options anyways

@scovich

# Which issue does this PR close? Closes #8864. # Rationale for this change #8715 introduced row numbers feature last week. However, it had a bug, which luckily for us @scovich pointed out soon after the merge. The issue is that the row numbers are produced in ordinal-based order of the row groups, instead of user-requested order of row groups. The former is wrong, and is being fixed here by switching to user-requested order. # What changes are included in this PR? Just fixing the bug as explained above, and adding test. Also addressing two small comments from post-merge review: #8715 (review) # Are these changes tested? Yes. # Are there any user-facing changes? No, this wasn't released yet. --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

@scovich

…ic on invalid input (#8867) # Which issue does this PR close? - Follow on to #8715 - related to #8863 # Rationale for this change per #8715 (comment), @scovich rightly says > not a fan of panics It is much better for a user facing API to return an error on invalid input than painc # What changes are included in this PR? 1. Make `ArrowReaderOptions::with_virtual_columns` error rather than panic on invalid input 2. Update tests to match # Are these changes tested? Yes by CI # Are there any user-facing changes? While this is an API change it was introduced in #8715 which has not yet been released. Therefore, we can make this change without breaking semver.

…rquet It would be useful to expose the virtual columns of the arrow Parquet reader in the datasource-parquet `ParquetSource` added in apache/arrow-rs#8715. Then engines can use both DataFusion's partition value machinery and the virtual columns. I made a go at it in this PR, but hit some rough edges. This is closer to an issue than a PR, but it is easier to explain with code. The virtual columns we added are a bit difficult to integrate cleanly today. They are part of the physical schema of the Parquet reader, but cannot currently be projected. We need some additional handling to avoid predicate pushdown for virtual columns, to build the correct projection mask, and to build the correct stream schema. See the changes to `opener.rs` in this PR. One alternative would be to modify the arrow-rs implementation to remove these workarounds. Then the only change to `opener.rs` would be `.with_virtual_columns(virtual_columns.to_vec())?` (and maybe even that could be avoided? See the discussion below). What would be the best way forward here? It is redundant that the user needs to specify both `Field::new("row_index", DataType::Int64, false).with_extension_type(RowNumber)`, and add the column in a special way to the reader options with `.with_virtual_columns(virtual_columns.to_vec())?`. When the extension type `RowNumber` is added, we know that it is a virtual column. All users of the `TableSchema/ParquetSource` must know that a schema is built out of three parts: the physical Parquet columns, the virtual columns and the partition columns. From a user perspective, the user would just like to supply a schema. One alternative is to only indicate the column kind using extension types, and the user only supplies a schema. That is, there would be an extension type indicating that a column is a partition column or virtual column, instead of the user supplying this information piecemeal. This may have a performance impact, as we would likely need to extract different extension type columns during planning, which could be problematic for large schemas. Signed-off-by: Jonas Irgens Kylling <jkylling@gmail.com>

Upstream has a solution to this in arrow v57 as [apache#8715](apache#8715)

jkylling and others added 22 commits March 18, 2025 19:06

Add support for file row numbers in Parquet readers

f93d36e

Add Apache license header to row_number.rs

e485c0b

Run cargo format

2a62009

Change with_row_number_column to take impl Into<String>

fb5126f

Change Option<String> -> Option<&str> in build_array_reader

5350728

Replace ParquetError::RowGroupMetaDataMissingRowNumber with General

188f350

Split test_create_array_reader test into two

37a9d83

first_row_number -> first_row_index

41e38fe

Simplify RowNumberReader with iterators

1a1e6b6

Co-authored-by: scovich <scovich@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into feature/parquet-reade…

bcad87f

…r-row-numbers

add parquet-testing change from the merge

89c1fd1

Fix test_arrow_reader_all_columns

b0d53d0

Fix first_row_number

094ae81

Rename to first_row_index consistently, remove Option.

a5858df

revert parquet-testing update

5e7d9a1

Fix baselines in file::metadata::tests::test_memory_size

54c22c6

Fix encryption metadata and async tests. Those features and default f…

f05d470

…eature tests pass

RowNumber extension type

11e4f39

using supplied_schema works

d02c977

Don't modify parsing of parquet schema, virtual columns can only be a…

6fecc17

…dded later

Reworked with_virtual_columns in options

1414421

switch to ref to slice; cleanup with_row_number_columns; async tests …

07eb467

…passing

vustef commented Oct 27, 2025

View reviewed changes

Bring back optionality to first_row_index, for future consideration w…

af0e0f9

…hen metadata parsing may skip row groups

github-actions bot added the parquet Changes to the parquet crate label Oct 27, 2025

vustef mentioned this pull request Oct 27, 2025

[Parquet] Support file row number in Parquet reader #7299

Closed

vustef added 4 commits October 27, 2025 11:44

Reexport

8bccd22

reexport all within virtual_type

65679ba

pub mod virtual_type skipping experimental schema

968d461

Switch back to virtual_type::* for now; fix warnings on cargo test

6144967

vustef added 2 commits November 13, 2025 22:54

Make first_has_ordinal an Option{bool}

db17f6a

Merge branch 'feature/parquet-virtual-row-numbers' of github.com:vust…

962964e

…ef/arrow-rs into feature/parquet-virtual-row-numbers

etseidl approved these changes Nov 13, 2025

View reviewed changes

vustef added 2 commits November 14, 2025 21:54

Merge branch 'main' of github.com:apache/arrow-rs into feature/parque…

f74c8cd

…t-virtual-row-numbers

missed to add file after fixing merge

298ea6b

cargo fmt --all

b887749

alamb merged commit 3d5428d into apache:main Nov 14, 2025
16 checks passed

scovich reviewed Nov 17, 2025

View reviewed changes

This was referenced Nov 18, 2025

Fix RowNumberReader when not all row groups are selected #8863

Merged

RowNumber reader has wrong row group ordering #8864

Closed

alamb mentioned this pull request Nov 18, 2025

Make ArrowReaderOptions::with_virtual_columns error rather than panic on invalid input #8867

Merged

niebayes mentioned this pull request Feb 3, 2026

Add with_virtual_columns to ParquetSource for reading virtual columns apache/datafusion#20132

Open

jkylling mentioned this pull request Feb 3, 2026

Expose virtual columns from the Arrow Parquet reader in datasource-parquet apache/datafusion#20133

Draft

dispanser added a commit to coralogix/arrow-rs that referenced this pull request Feb 24, 2026

Add row_id to parquet reader.

06560a8

Upstream has a solution to this in arrow v57 as [apache#8715](apache#8715)

dispanser added a commit to coralogix/arrow-rs that referenced this pull request Feb 24, 2026

Add row_id to parquet reader.

3a3afd3

Upstream has a solution to this in arrow v57 as [apache#8715](apache#8715)

dispanser added a commit to coralogix/arrow-rs that referenced this pull request Feb 24, 2026

Add row_id to parquet reader.

fb5a7c3

Upstream has a solution to this in arrow v57 as [apache#8715](apache#8715)

dispanser added a commit to coralogix/arrow-rs that referenced this pull request Feb 24, 2026

Add row_id to parquet reader.

162bec2

Upstream has a solution to this in arrow v57 as [apache#8715](apache#8715)

dispanser mentioned this pull request Feb 25, 2026

working build coralogix/arrow-rs#73

Draft

		// Sort ranges by ordinal to maintain original row group order
		ranges.sort_by_key(\|(ordinal, _)\| *ordinal);

Conversation

vustef commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vustef commented Nov 14, 2025

Uh oh!

alamb commented Nov 14, 2025

Uh oh!

alamb commented Nov 14, 2025

Uh oh!

alamb commented Nov 14, 2025

Uh oh!

alamb commented Nov 14, 2025

Uh oh!

Uh oh!

alamb commented Nov 14, 2025

Uh oh!

alamb commented Nov 14, 2025

Uh oh!

vustef commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

vustef commented Oct 27, 2025 •

edited

Loading

vustef commented Nov 14, 2025 •

edited

Loading