Skip to content

Comments

Add Fields abstraction (#3955)#3965

Merged
tustvold merged 6 commits intoapache:masterfrom
tustvold:add-fields
Mar 30, 2023
Merged

Add Fields abstraction (#3955)#3965
tustvold merged 6 commits intoapache:masterfrom
tustvold:add-fields

Conversation

@tustvold
Copy link
Contributor

@tustvold tustvold commented Mar 27, 2023

Which issue does this PR close?

Part of #3955

Rationale for this change

This adds a cheaply cloneable Fields abstraction, that internally contains FieldRef within a reference counted slice.

This achieves a couple of things:

  • The use of FieldRef allows projecting / reconstructing schema without needing to copy Field
  • The use of Arc<[FieldRef]> allows cheap cloning of DataType, construction of DataType::Struct, etc...
  • Allows cheap pointer comparisons between Schema and DataType
  • Provides a potential extension point for faster field name lookups

What changes are included in this PR?

Are there any user-facing changes?

Yes, this makes a fundamental change to the schema representation

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 27, 2023
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally defined Fields = Vec<FieldPtr>

Whilst simple the lack of a newtype made for a more convoluted migration, with a newtype we can define conversions From<Vec<Field>>, etc... to help reduce friction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a better formulation than a typedef and will allow for more flexibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick follow PR would then replace Box<Field> with FieldRef

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another followup could be done for Union, although that would also benefit from a Vec<(Field, i8)> instead of two separate vectors. I think that also currently makes it the largest variant, which increases the needed size of all datatypes.

A slightly hacky improvement for union could also be to move the type_id into Field and leave it unused in most places. That should basically be free since Field already has a few bits of padding left.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I plan to do the other variants in a follow up. I think changing it to (Fields, Arc<[i8]>, UnionMode) may be sufficient and would keep things simple

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is perhaps worth highlighting that this is implemented as a memove, it cannot reuse the allocation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it cannot reuse the allocation

If the vector is allocation is oversized. I think it will reuse the allocation if the vector is at capacity (which is rare though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly the implementation will always move regardless, I think it is some limitation of unsized coercion

@tustvold
Copy link
Contributor Author

FYI @alamb @crepererum @viirya I would appreciate your thoughts on this

Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it cannot reuse the allocation

If the vector is allocation is oversized. I think it will reuse the allocation if the vector is at capacity (which is rare though).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a better formulation than a typedef and will allow for more flexibility

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While constructing / Modifying lists of fields, I think it would be great if we could also add functions like

/// Maybe something more generic to allow adding a Field and FieldREf
pub fn push(mut &self, field: Field...) {
...
}

Copy link
Contributor Author

@tustvold tustvold Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we can make SchemaBuilder public and add such a method to it, Fields itself is inherently immutable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I think is one of the most expensive operations in DataFusion planning now: apache/datafusion#5157 (comment)

So 👍

@github-actions github-actions bot added arrow-flight Changes to the arrow-flight crate parquet Changes to the parquet crate labels Mar 28, 2023
},
);

let iter = v.into_iter();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to rework these once StructArray::new exists as part of #3880

}
}

impl From<RecordBatch> for StructArray {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This replaces the existing implementation in record_batch.rs with a more optimal implementation

}
}

impl From<RecordBatch> for StructArray {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to struct_array.rs

self.iter().map(|field| field.size()).sum()
}

/// Searches for a field by name, returning it along with its index if found
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be an obvious place to add hash based lookup or similar

@tustvold tustvold marked this pull request as ready for review March 28, 2023 18:37
Comment on lines 787 to +788
let struct_type =
DataType::Struct(vec![Field::new("data", DataType::Int64, false)]);
DataType::Struct(vec![Field::new("data", DataType::Int64, false)].into());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

into works like Fields::from here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I switched between the two to make rustfmt happy 😅


/// A cheaply cloneable, owned slice of [`FieldRef`]
///
/// Similar to `Arc<Vec<FieldPtr>>` or `Arc<[FieldPtr]>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FieldPtr? Do you mean FieldRef?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This abstraction looks good and datatype/schema manipulation can be more efficient.

@tustvold
Copy link
Contributor Author

Notified the mailing list about this - https://lists.apache.org/thread/pmxq5j864qlkp36lvxg8kvk0kct56r8m

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold -- I think this looks in general very good.

My biggest concern is on the amount of API churn that this will generate -- I think there may be a way to reduce the churn and make this PR smaller, and I left comments to that effect.

Once we sort it out and get this merged, I think we should then try (almost immediately) to upgrade some other project that makes significant use of arrow-rs to see how painful the upgrade is (and if there are other ergonomic things that could be done to ease the transition pain)

Thank you again for pushing this through

Field::new(
"c25",
DataType::Struct(vec![
DataType::Struct(Fields::from(vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For anyone who uses DataType::Struct this is now getting complicated to construct

I wonder if we can ease the pain by having something

impl DataType {
  fn new_struct(fields: impl Into<Fields>) -> Self {
 ..
}

So then this could be

Suggested change
DataType::Struct(Fields::from(vec![
DataType::new_struct(vec![

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced by this, the major reason for using the more verbose DataType::Struct(Fields::from(..)) was to reduce formatting churn, most downstreams will just be able to use .into().

I'll have a go upgrading DataFusion to assess the churn required

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in my mind it is about the cognative load. Now I need to know what a Fields is, import it, construct one, etc.

Maybe the magic into() will make it ok

@tustvold
Copy link
Contributor Author

DataFusion upgrade PR - apache/datafusion#5782

@tustvold
Copy link
Contributor Author

There don't appear to be any objections to this, and there is plenty of time until the next release, and so I am going to get this in before it develops merge conflicts. We can continue to iterate from there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants