Skip to content

Tracking Issue: Vectors #5028

@connortsui20

Description

@connortsui20

This is a tracking issue for adding vectors to Vortex.

Related: #4492
RFC: TODO LINK (#4726)

Vectors

Since "vector" is a heavily overloaded term, we we define a vector in Vortex specifically as:

  • A fully decompressed (which we call "Canonical" in Vortex) array of contiguous data
  • A datatype that can hold itself as children (a vector can hold other child vectors)
  • The data is recursively decompressed, which means all child vectors are also fully decompressed (this is different from the current canonical encodings, which allow compressed children)
  • Vectors are either borrowed or owned (Vector and VectorMut), mimicking the popular Bytes and BytesMut from the bytes crate.
  • The owned version of vectors (VectorMut) is fully mutatable

See RFC 0001 (TODO LINK) for more information on why we need vectors. To summarize, instead of potentially compressed Arrays being the intermediate representation (IR) between edges in our decompression plans (which is the model we use for our current compute functions), we want to enforce that fully decompressed arrays (which will be vectors) are the IR so that we can:

  • Guarantee progress in decompression at every step (right now, compute functions can do whatever they want)
  • Redefine Arrays to represent logical query plans to decompress data. This allows us to perform better optimizations (pushdown and common subexpression elimination) if we know exactly what data is flowing between nodes in our decompression plan

Why not Arrow?

Copied and pasted from the RFC

There are a few reasons we are not directly using Arrow arrays:

  • Vortex vectors are recursively built around variable-width primitives, allowing us to use the narrowest type possible
    (e.g. we can narrow dictionary codes, run-end offsets, list view sizes, etc.).
  • Internally we use the vortex-buffer crate, meaning we preserve our capabilities around runtime alignment.
  • We reserve the right to diverge from Arrow layouts in the future. For example, a VarBinVector could use in-memory
    pointers like DuckDB, rather than logical offsets. This makes the vectors cheaper to concatenate and slice, but
    incurs a conversion cost to/from Arrow.

We may re-visit this decision in the future prior to finalizing the Vortex extension API.

Steps / History

Chores

Unresolved Questions

  • N/A

Public API

// Borrowed Vector
pub enum Vector {
    Null(NullVector),
    Bool(BoolVector),
    Primitive(PrimitiveVector),
    // ...
}

// Example Vector
pub struct BoolVector {
    bits: BitBuffer,
    validity: Option<Mask>,
}

// Owned and mutable Vector
enum VectorMut {
    Null(NullVectorMut),
    Bool(BoolVectorMut),
    Primitive(PrimitiveVectorMut),
    // ...
}

impl VectorMut {
    fn freeze(self) -> Vector { ... }
    fn split_off(&mut self, at: usize) -> VectorMut { ... }
    fn unsplit(&mut self, other: VectorMut) -> VectorMut { ... }
}

All vectors will implement one of these two helper traits:

/// Common operations for immutable vectors (all the variants of [`Vector`]).
pub trait VectorOps: private::Sealed + Into<Vector> {
    /// The mutable equivalent of this immutable vector.
    type Mutable: VectorMutOps<Immutable = Self>;

    /// Returns the number of elements in the vector, also referred to as its "length".
    fn len(&self) -> usize;

    /// Returns the validity mask of the vector, where `true` represents a _valid_ element and
    /// `false` represents a `null` element.
    fn validity(&self) -> &Mask;

    /// Tries to convert `self` into a mutable vector (implementing [`VectorMutOps`]).
    fn try_into_mut(self) -> Result<Self::Mutable, Self>
    where
        Self: Sized;
}

/// Common operations for mutable vectors (all the variants of [`VectorMut`]).
pub trait VectorMutOps: private::Sealed + Into<VectorMut> {
    /// The immutable equivalent of this mutable vector.
    type Immutable: VectorOps<Mutable = Self>;

    /// Returns the number of elements in the vector, also referred to as its "length".
    fn len(&self) -> usize;

    /// Returns the total number of elements the vector can hold without reallocating.
    fn capacity(&self) -> usize;

    /// Reserves capacity for at least `additional` more elements to be inserted in the given
    /// vector.
    fn reserve(&mut self, additional: usize);

    /// Extends the vector by appending elements from another vector.
    fn extend_from_vector(&mut self, other: &Self::Immutable);

    /// Appends `n` null elements to the vector (if it is nullable).
    fn append_nulls(&mut self, n: usize);

    /// Converts `self` into an immutable vector.
    fn freeze(self) -> Self::Immutable;

    /// Splits the vector into two at the given index.
    fn split_off(&mut self, at: usize) -> Self;

    /// Absorbs a mutable vector that was previously split off.
    fn unsplit(&mut self, other: Self);
}

Metadata

Metadata

Assignees

Labels

epicfeatureRelease label indicating a new feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions