compute/correction_v2: shrink Cursor to reduce per-split memory by antiguru · Pull Request #36477 · MaterializeInc/materialize

antiguru · 2026-05-08T15:34:38Z

Motivation

Cursor::advance_by in the MV sink correction buffer produces one cursor per distinct pre-advance timestamp <= since.
Per-cursor memory therefore multiplies linearly with the number of distinct uncompacted timestamps in a chain.
At one self-managed customer running v26.10/v26.11 this contributed to a 1.1 TB allocation inside Cursor::advance_by and a cluster OOM (see CLU-77, database-issues#11198).

Description

Cursor previously owned a VecDeque<Rc<Chunk<D>>> and was 72 B on the stack plus a per-cursor heap buffer.
This PR replaces the owned VecDeque with a shared Rc<Vec<Rc<Chunk<D>>>> and tracks the cursor's range as a (pos, end) pair of (chunk: u32, offset: u32).
cursor.clone() becomes a single Rc::clone with no allocations; the stack footprint drops from 72 to 40 bytes and the per-cursor VecDeque heap buffer is gone.
The chunk-reuse fast path in try_unwrap is preserved by checking the strong counts of both the outer Rc<Vec<...>> and each inner Rc<Chunk>.

This is a constant-factor mitigation.
The algorithmic property of producing one cursor per distinct timestamp is unchanged, so the underlying blowup remains and must be addressed separately at the compaction or backpressure layer.

Verification

A unit test (tests::advance_by_splits_per_distinct_time) exercises the per-distinct-time splitting and reports the per-cursor stack footprint, to catch regressions on the size and to lock in the algorithmic property documented above.

Release notes

This PR adds the following user-facing behavior changes:

Reduce memory consumption in the MV sink correction buffer when materialized views accumulate many distinct uncompacted timestamps.

`Cursor::advance_by` produces one cursor per distinct pre-advance timestamp `<= since`, so per-cursor memory multiplies linearly with the number of distinct uncompacted timestamps in a chain. At one self-managed customer this contributed to a 1.1 TB allocation inside `Cursor::advance_by` and a cluster OOM (CLU-77, database-issues#11198). Replace the owned `VecDeque<Rc<Chunk<D>>>` with a shared `Rc<Vec<Rc<Chunk<D>>>>` and track the cursor's range as a `(pos, end)` pair of `(chunk: u32, offset: u32)`. `cursor.clone()` becomes a single `Rc::clone` with no allocations; the stack footprint drops from 72 to 40 bytes. The chunk-reuse fast path in `try_unwrap` is preserved by checking the strong counts of both the outer `Rc<Vec<...>>` and each inner `Rc<Chunk>`. This is a constant-factor mitigation. The algorithmic property of producing one cursor per distinct timestamp is unchanged, so the underlying blowup remains and must be addressed at the compaction layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute/correction_v2: shrink Cursor to reduce per-split memory#36477

compute/correction_v2: shrink Cursor to reduce per-split memory#36477
antiguru wants to merge 1 commit into
MaterializeInc:mainfrom
antiguru:clu-77-repro

antiguru commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antiguru commented May 8, 2026

Motivation

Description

Verification

Release notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant