-
Notifications
You must be signed in to change notification settings - Fork 113
Description
This is a tracking issue for adding vectors to Vortex.
Related: #4492
RFC: TODO LINK (#4726)
Vectors
Since "vector" is a heavily overloaded term, we we define a vector in Vortex specifically as:
- A fully decompressed (which we call "Canonical" in Vortex) array of contiguous data
- A datatype that can hold itself as children (a vector can hold other child vectors)
- The data is recursively decompressed, which means all child vectors are also fully decompressed (this is different from the current canonical encodings, which allow compressed children)
- Vectors are either borrowed or owned (
VectorandVectorMut), mimicking the popularBytesandBytesMutfrom thebytescrate. - The owned version of vectors (
VectorMut) is fully mutatable
See RFC 0001 (TODO LINK) for more information on why we need vectors. To summarize, instead of potentially compressed Arrays being the intermediate representation (IR) between edges in our decompression plans (which is the model we use for our current compute functions), we want to enforce that fully decompressed arrays (which will be vectors) are the IR so that we can:
- Guarantee progress in decompression at every step (right now, compute functions can do whatever they want)
- Redefine
Arrays to represent logical query plans to decompress data. This allows us to perform better optimizations (pushdown and common subexpression elimination) if we know exactly what data is flowing between nodes in our decompression plan
Why not Arrow?
Copied and pasted from the RFC
There are a few reasons we are not directly using Arrow arrays:
- Vortex vectors are recursively built around variable-width primitives, allowing us to use the narrowest type possible
(e.g. we can narrow dictionary codes, run-end offsets, list view sizes, etc.). - Internally we use the
vortex-buffercrate, meaning we preserve our capabilities around runtime alignment. - We reserve the right to diverge from Arrow layouts in the future. For example, a
VarBinVectorcould use in-memory
pointers like DuckDB, rather than logical offsets. This makes the vectors cheaper to concatenate and slice, but
incurs a conversion cost to/from Arrow.
We may re-visit this decision in the future prior to finalizing the Vortex extension API.
Steps / History
- Feature: Add
vortex-vector#5010 - Fix: incorrect
FromIterator<bool>implementation #5056 - Feature: Add
IntoIterimplementation for vectors #5093 - Implement all
Vectorvariants (1 perDType)-
Null -
Bool -
Primitive -
Decimal -
VarBin/Binary -
List -
FixedSizeList -
Struct -
Extension
-
- Add conversion methods to Arrow (feature flags)
- Replace the current
ArrayBuilders withVectors - Refactor
vortex-scalarto be wrappers aroundVectorof length 1 (we could also potentially flip this and make the scalars the elements of vectors instead)
Chores
- Chore: Remove nullability from Vectors #5029
- Chore: Add more docs +
from_option_iter#5031 - Chore: add more vector constructors #5070
- Chore: Move stuff around in vectors #5072
- Chore: fix from_iter impls in vectors #5075
- Chore: un-ignore macro doc tests #5112
- Chore: add some more docs to varbin vector #5118
- Chore: Add
match_vector_pairmacro #5123 - Chore: clean up struct vector #5126
- Chore: Decimal vector cleanup #5165
- Chore: add docs and tests for decimal vectors #5185
- Chore: Reorganize documentation and modules for vectors #5186
- Chore: add immutable
validityfor mutable vectors #5188
Unresolved Questions
- N/A
Public API
// Borrowed Vector
pub enum Vector {
Null(NullVector),
Bool(BoolVector),
Primitive(PrimitiveVector),
// ...
}
// Example Vector
pub struct BoolVector {
bits: BitBuffer,
validity: Option<Mask>,
}
// Owned and mutable Vector
enum VectorMut {
Null(NullVectorMut),
Bool(BoolVectorMut),
Primitive(PrimitiveVectorMut),
// ...
}
impl VectorMut {
fn freeze(self) -> Vector { ... }
fn split_off(&mut self, at: usize) -> VectorMut { ... }
fn unsplit(&mut self, other: VectorMut) -> VectorMut { ... }
}All vectors will implement one of these two helper traits:
/// Common operations for immutable vectors (all the variants of [`Vector`]).
pub trait VectorOps: private::Sealed + Into<Vector> {
/// The mutable equivalent of this immutable vector.
type Mutable: VectorMutOps<Immutable = Self>;
/// Returns the number of elements in the vector, also referred to as its "length".
fn len(&self) -> usize;
/// Returns the validity mask of the vector, where `true` represents a _valid_ element and
/// `false` represents a `null` element.
fn validity(&self) -> &Mask;
/// Tries to convert `self` into a mutable vector (implementing [`VectorMutOps`]).
fn try_into_mut(self) -> Result<Self::Mutable, Self>
where
Self: Sized;
}
/// Common operations for mutable vectors (all the variants of [`VectorMut`]).
pub trait VectorMutOps: private::Sealed + Into<VectorMut> {
/// The immutable equivalent of this mutable vector.
type Immutable: VectorOps<Mutable = Self>;
/// Returns the number of elements in the vector, also referred to as its "length".
fn len(&self) -> usize;
/// Returns the total number of elements the vector can hold without reallocating.
fn capacity(&self) -> usize;
/// Reserves capacity for at least `additional` more elements to be inserted in the given
/// vector.
fn reserve(&mut self, additional: usize);
/// Extends the vector by appending elements from another vector.
fn extend_from_vector(&mut self, other: &Self::Immutable);
/// Appends `n` null elements to the vector (if it is nullable).
fn append_nulls(&mut self, n: usize);
/// Converts `self` into an immutable vector.
fn freeze(self) -> Self::Immutable;
/// Splits the vector into two at the given index.
fn split_off(&mut self, at: usize) -> Self;
/// Absorbs a mutable vector that was previously split off.
fn unsplit(&mut self, other: Self);
}