compute/correction_v2: shrink Cursor to reduce per-split memory#36477
Draft
antiguru wants to merge 1 commit into
Draft
compute/correction_v2: shrink Cursor to reduce per-split memory#36477antiguru wants to merge 1 commit into
antiguru wants to merge 1 commit into
Conversation
`Cursor::advance_by` produces one cursor per distinct pre-advance timestamp `<= since`, so per-cursor memory multiplies linearly with the number of distinct uncompacted timestamps in a chain. At one self-managed customer this contributed to a 1.1 TB allocation inside `Cursor::advance_by` and a cluster OOM (CLU-77, database-issues#11198). Replace the owned `VecDeque<Rc<Chunk<D>>>` with a shared `Rc<Vec<Rc<Chunk<D>>>>` and track the cursor's range as a `(pos, end)` pair of `(chunk: u32, offset: u32)`. `cursor.clone()` becomes a single `Rc::clone` with no allocations; the stack footprint drops from 72 to 40 bytes. The chunk-reuse fast path in `try_unwrap` is preserved by checking the strong counts of both the outer `Rc<Vec<...>>` and each inner `Rc<Chunk>`. This is a constant-factor mitigation. The algorithmic property of producing one cursor per distinct timestamp is unchanged, so the underlying blowup remains and must be addressed at the compaction layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Cursor::advance_byin the MV sink correction buffer produces one cursor per distinct pre-advance timestamp<= since.Per-cursor memory therefore multiplies linearly with the number of distinct uncompacted timestamps in a chain.
At one self-managed customer running v26.10/v26.11 this contributed to a 1.1 TB allocation inside
Cursor::advance_byand a cluster OOM (see CLU-77, database-issues#11198).Description
Cursorpreviously owned aVecDeque<Rc<Chunk<D>>>and was 72 B on the stack plus a per-cursor heap buffer.This PR replaces the owned
VecDequewith a sharedRc<Vec<Rc<Chunk<D>>>>and tracks the cursor's range as a(pos, end)pair of(chunk: u32, offset: u32).cursor.clone()becomes a singleRc::clonewith no allocations; the stack footprint drops from 72 to 40 bytes and the per-cursorVecDequeheap buffer is gone.The chunk-reuse fast path in
try_unwrapis preserved by checking the strong counts of both the outerRc<Vec<...>>and each innerRc<Chunk>.This is a constant-factor mitigation.
The algorithmic property of producing one cursor per distinct timestamp is unchanged, so the underlying blowup remains and must be addressed separately at the compaction or backpressure layer.
Verification
A unit test (
tests::advance_by_splits_per_distinct_time) exercises the per-distinct-time splitting and reports the per-cursor stack footprint, to catch regressions on the size and to lock in the algorithmic property documented above.Release notes
This PR adds the following user-facing behavior changes: