Add Fields abstraction (#3955)#3965
Conversation
arrow-schema/src/fields.rs
Outdated
There was a problem hiding this comment.
I originally defined Fields = Vec<FieldPtr>
Whilst simple the lack of a newtype made for a more convoluted migration, with a newtype we can define conversions From<Vec<Field>>, etc... to help reduce friction
There was a problem hiding this comment.
I agree this is a better formulation than a typedef and will allow for more flexibility
arrow-schema/src/datatype.rs
Outdated
There was a problem hiding this comment.
A quick follow PR would then replace Box<Field> with FieldRef
There was a problem hiding this comment.
Another followup could be done for Union, although that would also benefit from a Vec<(Field, i8)> instead of two separate vectors. I think that also currently makes it the largest variant, which increases the needed size of all datatypes.
A slightly hacky improvement for union could also be to move the type_id into Field and leave it unused in most places. That should basically be free since Field already has a few bits of padding left.
There was a problem hiding this comment.
Yes, I plan to do the other variants in a follow up. I think changing it to (Fields, Arc<[i8]>, UnionMode) may be sufficient and would keep things simple
arrow-schema/src/fields.rs
Outdated
There was a problem hiding this comment.
It is perhaps worth highlighting that this is implemented as a memove, it cannot reuse the allocation
There was a problem hiding this comment.
it cannot reuse the allocation
If the vector is allocation is oversized. I think it will reuse the allocation if the vector is at capacity (which is rare though).
There was a problem hiding this comment.
Sadly the implementation will always move regardless, I think it is some limitation of unsized coercion
|
FYI @alamb @crepererum @viirya I would appreciate your thoughts on this |
arrow-schema/src/fields.rs
Outdated
There was a problem hiding this comment.
it cannot reuse the allocation
If the vector is allocation is oversized. I think it will reuse the allocation if the vector is at capacity (which is rare though).
arrow-schema/src/fields.rs
Outdated
There was a problem hiding this comment.
I agree this is a better formulation than a typedef and will allow for more flexibility
arrow-schema/src/fields.rs
Outdated
There was a problem hiding this comment.
While constructing / Modifying lists of fields, I think it would be great if we could also add functions like
/// Maybe something more generic to allow adding a Field and FieldREf
pub fn push(mut &self, field: Field...) {
...
}There was a problem hiding this comment.
Yeah, I think we can make SchemaBuilder public and add such a method to it, Fields itself is inherently immutable
arrow-schema/src/schema.rs
Outdated
There was a problem hiding this comment.
This I think is one of the most expensive operations in DataFusion planning now: apache/datafusion#5157 (comment)
So 👍
| }, | ||
| ); | ||
|
|
||
| let iter = v.into_iter(); |
There was a problem hiding this comment.
I hope to rework these once StructArray::new exists as part of #3880
| } | ||
| } | ||
|
|
||
| impl From<RecordBatch> for StructArray { |
There was a problem hiding this comment.
This replaces the existing implementation in record_batch.rs with a more optimal implementation
| } | ||
| } | ||
|
|
||
| impl From<RecordBatch> for StructArray { |
There was a problem hiding this comment.
Moved to struct_array.rs
| self.iter().map(|field| field.size()).sum() | ||
| } | ||
|
|
||
| /// Searches for a field by name, returning it along with its index if found |
There was a problem hiding this comment.
This will be an obvious place to add hash based lookup or similar
| let struct_type = | ||
| DataType::Struct(vec![Field::new("data", DataType::Int64, false)]); | ||
| DataType::Struct(vec![Field::new("data", DataType::Int64, false)].into()); |
There was a problem hiding this comment.
into works like Fields::from here ?
There was a problem hiding this comment.
Yeah, I switched between the two to make rustfmt happy 😅
|
|
||
| /// A cheaply cloneable, owned slice of [`FieldRef`] | ||
| /// | ||
| /// Similar to `Arc<Vec<FieldPtr>>` or `Arc<[FieldPtr]>` |
viirya
left a comment
There was a problem hiding this comment.
This abstraction looks good and datatype/schema manipulation can be more efficient.
|
Notified the mailing list about this - https://lists.apache.org/thread/pmxq5j864qlkp36lvxg8kvk0kct56r8m |
alamb
left a comment
There was a problem hiding this comment.
Thank you @tustvold -- I think this looks in general very good.
My biggest concern is on the amount of API churn that this will generate -- I think there may be a way to reduce the churn and make this PR smaller, and I left comments to that effect.
Once we sort it out and get this merged, I think we should then try (almost immediately) to upgrade some other project that makes significant use of arrow-rs to see how painful the upgrade is (and if there are other ergonomic things that could be done to ease the transition pain)
Thank you again for pushing this through
| Field::new( | ||
| "c25", | ||
| DataType::Struct(vec![ | ||
| DataType::Struct(Fields::from(vec![ |
There was a problem hiding this comment.
For anyone who uses DataType::Struct this is now getting complicated to construct
I wonder if we can ease the pain by having something
impl DataType {
fn new_struct(fields: impl Into<Fields>) -> Self {
..
}So then this could be
| DataType::Struct(Fields::from(vec![ | |
| DataType::new_struct(vec![ |
There was a problem hiding this comment.
I'm not convinced by this, the major reason for using the more verbose DataType::Struct(Fields::from(..)) was to reduce formatting churn, most downstreams will just be able to use .into().
I'll have a go upgrading DataFusion to assess the churn required
There was a problem hiding this comment.
I guess in my mind it is about the cognative load. Now I need to know what a Fields is, import it, construct one, etc.
Maybe the magic into() will make it ok
|
DataFusion upgrade PR - apache/datafusion#5782 |
|
There don't appear to be any objections to this, and there is plenty of time until the next release, and so I am going to get this in before it develops merge conflicts. We can continue to iterate from there |
Which issue does this PR close?
Part of #3955
Rationale for this change
This adds a cheaply cloneable
Fieldsabstraction, that internally containsFieldRefwithin a reference counted slice.This achieves a couple of things:
FieldRefallows projecting / reconstructing schema without needing to copyFieldArc<[FieldRef]>allows cheap cloning ofDataType, construction ofDataType::Struct, etc...SchemaandDataTypeWhat changes are included in this PR?
Are there any user-facing changes?
Yes, this makes a fundamental change to the schema representation