Skip to content

compute/correction_v2: shrink Cursor to reduce per-split memory#36477

Draft
antiguru wants to merge 1 commit into
MaterializeInc:mainfrom
antiguru:clu-77-repro
Draft

compute/correction_v2: shrink Cursor to reduce per-split memory#36477
antiguru wants to merge 1 commit into
MaterializeInc:mainfrom
antiguru:clu-77-repro

Conversation

@antiguru
Copy link
Copy Markdown
Member

@antiguru antiguru commented May 8, 2026

Motivation

Cursor::advance_by in the MV sink correction buffer produces one cursor per distinct pre-advance timestamp <= since.
Per-cursor memory therefore multiplies linearly with the number of distinct uncompacted timestamps in a chain.
At one self-managed customer running v26.10/v26.11 this contributed to a 1.1 TB allocation inside Cursor::advance_by and a cluster OOM (see CLU-77, database-issues#11198).

Description

Cursor previously owned a VecDeque<Rc<Chunk<D>>> and was 72 B on the stack plus a per-cursor heap buffer.
This PR replaces the owned VecDeque with a shared Rc<Vec<Rc<Chunk<D>>>> and tracks the cursor's range as a (pos, end) pair of (chunk: u32, offset: u32).
cursor.clone() becomes a single Rc::clone with no allocations; the stack footprint drops from 72 to 40 bytes and the per-cursor VecDeque heap buffer is gone.
The chunk-reuse fast path in try_unwrap is preserved by checking the strong counts of both the outer Rc<Vec<...>> and each inner Rc<Chunk>.

This is a constant-factor mitigation.
The algorithmic property of producing one cursor per distinct timestamp is unchanged, so the underlying blowup remains and must be addressed separately at the compaction or backpressure layer.

Verification

A unit test (tests::advance_by_splits_per_distinct_time) exercises the per-distinct-time splitting and reports the per-cursor stack footprint, to catch regressions on the size and to lock in the algorithmic property documented above.

Release notes

This PR adds the following user-facing behavior changes:

  • Reduce memory consumption in the MV sink correction buffer when materialized views accumulate many distinct uncompacted timestamps.

`Cursor::advance_by` produces one cursor per distinct pre-advance
timestamp `<= since`, so per-cursor memory multiplies linearly with the
number of distinct uncompacted timestamps in a chain. At one
self-managed customer this contributed to a 1.1 TB allocation inside
`Cursor::advance_by` and a cluster OOM (CLU-77, database-issues#11198).

Replace the owned `VecDeque<Rc<Chunk<D>>>` with a shared
`Rc<Vec<Rc<Chunk<D>>>>` and track the cursor's range as a `(pos, end)`
pair of `(chunk: u32, offset: u32)`. `cursor.clone()` becomes a single
`Rc::clone` with no allocations; the stack footprint drops from 72 to
40 bytes. The chunk-reuse fast path in `try_unwrap` is preserved by
checking the strong counts of both the outer `Rc<Vec<...>>` and each
inner `Rc<Chunk>`.

This is a constant-factor mitigation. The algorithmic property of
producing one cursor per distinct timestamp is unchanged, so the
underlying blowup remains and must be addressed at the compaction
layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant