raft: prevent snapshot disk exhaustion#987
Conversation
|
Warning Review limit reached
Next review available in: 4 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughAdds raft engine lifecycle signaling and watcher wiring, and extends snapshot persistence and gRPC snapshot receive paths with prewrite cleanup, protection tracking, and related tests. ChangesEngine lifecycle and snapshot protection
Estimated code review effort: 4 (Complex) | ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
main.go (1)
516-516: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueLoop variable
runtimeshadows the stdlibruntimepackage.Line 417 in this same file already uses
rtfor the identical iteration pattern (for _, rt := range runtimes). Usingruntimehere is inconsistent and shadows the standard library package name.♻️ Suggested rename
- for _, runtime := range runtimes { + for _, rt := range runtimes { - if runtime == nil { + if rt == nil { continue } - engine := runtime.snapshotEngine() + engine := rt.snapshotEngine() lifecycle, ok := engine.(raftengine.Lifecycle) if !ok { continue } done := lifecycle.Done() if done == nil { continue } - groupID := runtime.spec.id + groupID := rt.spec.id🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@main.go` at line 516, The loop variable in the runtimes iteration shadows the stdlib runtime package and is inconsistent with the same pattern elsewhere in main.go. Rename the iterator variable in the affected loop to a non-conflicting name such as rt, matching the existing runtimes loop, and update any references inside that block accordingly.main_raft_lifecycle_test.go (1)
22-48: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueConsider consolidating into a table-driven test.
Both tests exercise
startRaftEngineLifecycleWatcherswith a similar setup shape (stub engine + runtimes slice + errgroup), differing mainly in trigger (close(engine.done)vscancel()) and expected outcome. A table-driven form with subtests would reduce duplication while keeping both scenarios covered.As per coding guidelines, "Unit tests should be co-located with packages in
*_test.gofiles; prefer table-driven test cases."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@main_raft_lifecycle_test.go` around lines 22 - 48, The two lifecycle watcher tests duplicate the same `startRaftEngineLifecycleWatchers` setup, so consolidate them into one table-driven test with subtests. Keep the shared arrangement around `lifecycleEngineStub`, `raftGroupRuntime`, and `errgroup.WithContext`, and vary only the trigger (`close(engine.done)` vs `cancel()`) and assertions for the failure/cancellation cases.Source: Coding guidelines
internal/raftengine/etcd/engine.go (1)
2796-2801: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winPrepare failures are only logged, not surfaced anywhere else.
Both call sites swallow
prepareFSMSnapshotWriteerrors with aslog.Warnand continue writing the new snapshot regardless. Since this PR's stated goal is preventing snapshot disk exhaustion, a persistently failing cleanup pass (e.g., permission or filesystem errors) would silently degrade back to the original unbounded-growth behavior with only a log line as a signal. Consider incrementing a metric/counter on failure so operators can alert on it.Also applies to: 4224-4229
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/raftengine/etcd/engine.go` around lines 2796 - 2801, The prepareFSMSnapshotWrite failure is only logged in the snapshot write path and never surfaced to operators. Update the call sites that handle prepareFSMSnapshotWrite so a failure also increments an existing or new metric/counter for prepare failures, using the surrounding snapshot write logic in engine.go to locate both occurrences; keep the Warn log, but add the observable failure signal so repeated cleanup errors can be alerted on.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/raftengine/etcd/fsm_snapshot_file.go`:
- Around line 576-590: The snapshot retention logic in
collectPrewriteSnapCandidates and the purge loop in the prewrite snapshot
cleanup path currently keeps the newest .snap entries without verifying that
each snapshot is restorable. Update the keep-set logic around
collectPrewriteSnapCandidates/sort.Slice/purgeSnapPair to preserve a verified
.snap/.fsm pair (or otherwise build candidates from restorable snapshots only)
so a missing .fsm on the newest token snapshot cannot cause older valid pairs to
be deleted. Add a regression test covering the case where the newest .snap
exists but its .fsm payload is missing, and ensure the kept snapshot remains
restartable.
In `@internal/raftengine/etcd/grpc_transport.go`:
- Around line 752-755: The receive-side cleanup is happening too late in the
snapshot flow, allowing spooling under fsmSnapDir to fail before stale FSM files
are removed. Move the preparation/cleanup step from finalization into the early
snapshot receive path in grpc_transport.go so it runs as soon as snapshot
metadata/index is available, before drainSnapshotChunks starts writing payload
bytes; use the existing prepareReceivedFSMSnapshotWrite logic and the
surrounding receive helpers to wire this in.
---
Nitpick comments:
In `@internal/raftengine/etcd/engine.go`:
- Around line 2796-2801: The prepareFSMSnapshotWrite failure is only logged in
the snapshot write path and never surfaced to operators. Update the call sites
that handle prepareFSMSnapshotWrite so a failure also increments an existing or
new metric/counter for prepare failures, using the surrounding snapshot write
logic in engine.go to locate both occurrences; keep the Warn log, but add the
observable failure signal so repeated cleanup errors can be alerted on.
In `@main_raft_lifecycle_test.go`:
- Around line 22-48: The two lifecycle watcher tests duplicate the same
`startRaftEngineLifecycleWatchers` setup, so consolidate them into one
table-driven test with subtests. Keep the shared arrangement around
`lifecycleEngineStub`, `raftGroupRuntime`, and `errgroup.WithContext`, and vary
only the trigger (`close(engine.done)` vs `cancel()`) and assertions for the
failure/cancellation cases.
In `@main.go`:
- Line 516: The loop variable in the runtimes iteration shadows the stdlib
runtime package and is inconsistent with the same pattern elsewhere in main.go.
Rename the iterator variable in the affected loop to a non-conflicting name such
as rt, matching the existing runtimes loop, and update any references inside
that block accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d020f13b-925a-4a41-8015-336cb27004c5
📒 Files selected for processing (7)
internal/raftengine/engine.gointernal/raftengine/etcd/engine.gointernal/raftengine/etcd/fsm_snapshot_file.gointernal/raftengine/etcd/fsm_snapshot_file_test.gointernal/raftengine/etcd/grpc_transport.gomain.gomain_raft_lifecycle_test.go
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b0ca6c8370
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| var combined error | ||
| combined = errors.CombineErrors(combined, removeFSMTmpOrphans(fsmSnapDir)) |
There was a problem hiding this comment.
Avoid deleting in-flight FSM snapshot temp files
When this helper runs at runtime (for example while receiving a snapshot) it removes every *.fsm.tmp, but local snapshot writes create their active temp files with that same suffix in writeFSMSnapshotFile; persistLocalSnapshot releases snapshotMu after its prepare step before doing the write. If another prepare runs during that write, it can unlink the active temp file, letting the write continue on an unlinked fd and then fail the final rename with ENOENT, which propagates as a snapshot result error and stops the engine. Restrict this cleanup to startup or ensure active snapshot temp files cannot be removed.
Useful? React with 👍 / 👎.
| } | ||
| index := metadata.Snapshot.Metadata.Index | ||
| if fsmSnapDir != "" && index > 0 { | ||
| prepareReceivedFSMSnapshotWrite(fsmSnapDir, index, prepareFn) |
There was a problem hiding this comment.
Free space before spooling received snapshots
On the receive path this prepare call does not run until the final chunk is processed, but the spool has already been written into fsmSnapDir by then. If a follower is near-full because old snapshot pairs are occupying the directory, the stream can hit ENOSPC while writing chunks before this line is reached, so the old files are never purged and raft will keep retrying the same failing snapshot transfer. Run the prewrite purge before creating/draining the spool when fsmSnapDir is set.
Useful? React with 👍 / 👎.
| if err != nil || idx >= nextIndex || liveIndexes[idx] { | ||
| continue | ||
| } | ||
| removeWithWarn(filepath.Join(fsmSnapDir, e.Name()), "orphan fsm snapshot") |
There was a problem hiding this comment.
Preserve FSM files until their raft snap is durable
This treats any .fsm without a matching file in snapDir as removable during runtime prepares, but received snapshots create the .fsm before the engine later persists the corresponding .snap (persistReady applies the snapshot before calling persistReadyToWAL). If another snapshot prepare with a higher index runs in that window, it can delete the just-applied FSM file as an “orphan”; the subsequent SaveSnap then durably references a missing payload, breaking restart or future snapshot sends. Only do this orphan sweep at startup, or hold the same lock through .snap persistence.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/raftengine/etcd/fsm_snapshot_file.go (2)
571-593: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winReclaim orphan FSM artifacts during prewrite cleanup.
collectPrewriteSnapCandidatesis driven only by.snapfiles, so stale.fsmpayloads without a token and.fsm.tmpleftovers are never reclaimed before the next write. Those files can be snapshot-sized and still cause the disk-exhaustion failure this PR is targeting; re-add scoped orphan cleanup before syncing/writing.Also applies to: 609-625
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/raftengine/etcd/fsm_snapshot_file.go` around lines 571 - 593, The prewrite cleanup in collectPrewriteSnapCandidates/purgeSnapPair only handles tokenized snapshot pairs, so orphan .fsm and .fsm.tmp artifacts can survive and keep consuming disk. Add a scoped orphan cleanup step in the prewrite path before syncing/writing that scans the FSM snapshot directory for unpaired stale payloads and removes them alongside the existing candidate purge logic, reusing the same cleanup flow used by purgeSnapPair where possible.
596-633: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winValidate FSM files before treating them as restorable.
hasFSMcurrently means “path exists,” butkeepRestorablePrewriteSnapshotsuses it as “safe to restart from.” A corrupt/truncated newest.fsmcan still evict older valid pairs; use the existing FSM verification path or keep an additional verified pair.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/raftengine/etcd/fsm_snapshot_file.go` around lines 596 - 633, The restorable prewrite snapshot selection in keepRestorablePrewriteSnapshots/collectPrewriteSnapCandidates is treating any existing FSM file as valid, which can let a corrupt or truncated newest .fsm displace older usable pairs. Update the candidate tracking to use a verified FSM check instead of just fsmSnapshotFileExists, ideally by reusing the existing FSM validation path, or retain an additional verified snapshot pair when choosing what to keep. Ensure the logic in keepRestorablePrewriteSnapshots still prefers the newest valid candidates but only marks snapshots as restorable when their FSM has been validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/raftengine/etcd/grpc_transport.go`:
- Around line 828-839: The snapshot chunk handling in append/receive flow
currently allows payload bytes to be written before metadata has been observed,
which breaks the prewrite guarantee. Update the chunk-processing logic around
appendSnapshotChunkMetadata, maybePrepareReceivedFSMSnapshotWrite, and
writeSnapshotChunkPayload to reject any payload-before-metadata chunk with an
error before spooling bytes. Ensure the guard runs before preparedFSMWrite can
be skipped, so cleanup/preparation always happens before payload writes.
---
Outside diff comments:
In `@internal/raftengine/etcd/fsm_snapshot_file.go`:
- Around line 571-593: The prewrite cleanup in
collectPrewriteSnapCandidates/purgeSnapPair only handles tokenized snapshot
pairs, so orphan .fsm and .fsm.tmp artifacts can survive and keep consuming
disk. Add a scoped orphan cleanup step in the prewrite path before
syncing/writing that scans the FSM snapshot directory for unpaired stale
payloads and removes them alongside the existing candidate purge logic, reusing
the same cleanup flow used by purgeSnapPair where possible.
- Around line 596-633: The restorable prewrite snapshot selection in
keepRestorablePrewriteSnapshots/collectPrewriteSnapCandidates is treating any
existing FSM file as valid, which can let a corrupt or truncated newest .fsm
displace older usable pairs. Update the candidate tracking to use a verified FSM
check instead of just fsmSnapshotFileExists, ideally by reusing the existing FSM
validation path, or retain an additional verified snapshot pair when choosing
what to keep. Ensure the logic in keepRestorablePrewriteSnapshots still prefers
the newest valid candidates but only marks snapshots as restorable when their
FSM has been validated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5a82a1ac-05d0-47f3-ac4f-c8ebbdfb50e5
📒 Files selected for processing (5)
internal/raftengine/etcd/fsm_snapshot_file.gointernal/raftengine/etcd/fsm_snapshot_file_test.gointernal/raftengine/etcd/grpc_transport.gointernal/raftengine/etcd/grpc_transport_test.gomain.go
🚧 Files skipped from review as they are similar to previous changes (1)
- main.go
|
@codex review |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2c1c553322
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err != nil { | ||
| return errors.WithStack(err) | ||
| } else if liveIndexes != nil { | ||
| return removeStaleFSMFilesBelowIndex(fsmSnapDir, liveIndexes, retention.restorableFloor) |
There was a problem hiding this comment.
Remove stale orphans up to the incoming snapshot
When the retained restorable snapshot is older than the snapshot being prepared, this only deletes orphan .fsm files below restorableFloor. A stale unreferenced payload at an index between the kept pair and nextIndex (for example keep 200, prepare 400, orphan 300 left by a failed receive) survives the prewrite pass, so the file most likely consuming space is not freed and the following snapshot write can still fail with ENOSPC. The live-index check already prevents deleting files with surviving .snap tokens; stale orphans below the snapshot being written need to be eligible for cleanup rather than capped at the kept snapshot index.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bbf7185875
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err != nil { | ||
| return errors.WithStack(err) | ||
| } else if liveIndexes != nil { | ||
| return removeStaleFSMFilesBelowIndex(fsmSnapDir, liveIndexes, nextIndex) |
There was a problem hiding this comment.
Preserve in-flight received FSM snapshots
When this prewrite cleanup runs after a streaming snapshot has been finalized but before the raft loop applies it, the new .fsm has no matching .snap yet: receiveSnapshotStream renames the spool to <index>.fsm, while handleTransportMessage only enqueues the MsgSnap on stepCh and the .snap token is not written until later Ready processing. If another received/local snapshot prepares a higher nextIndex in that window, collectLiveSnapIndexes will not include the queued snapshot and this call removes its .fsm as an orphan, so the queued restore later fails with a missing FSM file and shuts down the engine. Please exclude in-flight receive indexes or defer orphan cleanup until after apply/startup.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/raftengine/etcd/grpc_transport.go (1)
904-911: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winProtect the received FSM index before publishing the final
.fsmpath.Line 905 makes
<index>.fsmvisible before Line 910 registers protection. A concurrent prewrite cleanup can see that file as an unprotected orphan and remove it before the engine consumes the token. Register protection beforeFinalizeAsFSMFile, and unprotect if finalization fails.Suggested direction
- if err := spool.FinalizeAsFSMFile(fsmSnapDir, index, crc32c); err != nil { + protected := false + if protectFn != nil { + protectFn(index) + protected = true + } + if err := spool.FinalizeAsFSMFile(fsmSnapDir, index, crc32c); err != nil { + if protected && unprotectFn != nil { + unprotectFn(index) + } return raftpb.Message{}, err } metadata.Snapshot.Data = encodeSnapshotToken(index, crc32c) - if protectFn != nil { - protectFn(index) - }This requires threading
unprotectFnalongsideprotectFnthrough the receive/finalize helpers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/raftengine/etcd/grpc_transport.go` around lines 904 - 911, The FSM finalize flow in grpc_transport.go publishes the <index>.fsm path before the index is marked protected, which can race with cleanup. Update the receive/finalize helpers around FinalizeAsFSMFile to register protection with protectFn before making the file visible, and add an unprotectFn path to roll back the protection if finalization fails. Ensure the change is wired through the related helper methods that handle the FSM snapshot receive and finalize sequence.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/raftengine/etcd/engine.go`:
- Around line 2820-2830: The protectReceivedFSMSnapshot fast-path can race with
snapshot release and leave a stale entry in protectedReceivedFSMSnaps. Update
protectReceivedFSMSnapshot to recheck appliedIndex after acquiring snapshotMu,
before initializing or incrementing the map, and return if the snapshot index is
no longer ahead of the current applied index. Use the existing
protectReceivedFSMSnapshot method and protectedReceivedFSMSnaps map to keep the
fix localized.
---
Outside diff comments:
In `@internal/raftengine/etcd/grpc_transport.go`:
- Around line 904-911: The FSM finalize flow in grpc_transport.go publishes the
<index>.fsm path before the index is marked protected, which can race with
cleanup. Update the receive/finalize helpers around FinalizeAsFSMFile to
register protection with protectFn before making the file visible, and add an
unprotectFn path to roll back the protection if finalization fails. Ensure the
change is wired through the related helper methods that handle the FSM snapshot
receive and finalize sequence.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c2e4058f-38e1-454c-be96-19fa04944a2d
📒 Files selected for processing (6)
internal/raftengine/etcd/engine.gointernal/raftengine/etcd/engine_applied_index_test.gointernal/raftengine/etcd/fsm_snapshot_file.gointernal/raftengine/etcd/fsm_snapshot_file_test.gointernal/raftengine/etcd/grpc_transport.gointernal/raftengine/etcd/grpc_transport_test.go
There was a problem hiding this comment.
💡 Codex Review
When a higher-index snapshot prepare runs in the gap after FinalizeAsFSMFile renames the received N.fsm but before protectFn records N, prepareFSMSnapshotWriteProtected sees the new file as an unprotected orphan because no .snap token exists yet and can delete it. The receiver then enqueues/applies a snapshot token that points at a missing .fsm, so restore fails and the raft engine can stop; the protection needs to cover the file before it is visible to prewrite cleanup, or the rename/protect sequence must share the cleanup lock.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| spoolPlacement, fsmSnapDir, prepareFn, protectFn := t.snapshotSpoolPlacement() | ||
| spool, err := newSnapshotSpool(spoolPlacement) |
There was a problem hiding this comment.
Prepare received snapshot cleanup before creating the spool
When fsmSnapDir is already full enough that creating a new temp file needs an inode or directory block, newSnapshotSpool returns before the newly added prewrite cleanup in drainSnapshotChunks ever gets a chance to delete older snapshots. That leaves the receive path still unable to recover from exactly the disk-exhaustion state this change is meant to handle; read the metadata/prepare first (buffering the first chunk if needed) or run cleanup before creating the spool file.
Useful? React with 👍 / 👎.
| } | ||
| metadata.Snapshot.Data = encodeSnapshotToken(index, crc32c) | ||
| if protectFn != nil { | ||
| protectFn(index) |
There was a problem hiding this comment.
Release protection for snapshots raft ignores
When a received snapshot is accepted by the transport but later ignored by raft (for example, the follower catches up before the queued MsgSnap is stepped, or raft fast-forwards because it already has the snapshot index/term), no Ready.Snapshot is persisted and the new release calls never run. Because the success path keeps the .fsm protected indefinitely, those orphan files are skipped by later prewrite cleanup and can accumulate despite this disk-exhaustion fix.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: af915e1ca0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| metadata.Snapshot.Data = encodeSnapshotToken(index, crc32c) | ||
| if protectFn != nil { | ||
| protectFn(index) |
There was a problem hiding this comment.
Reserve received snapshots before finalizing them
Fresh evidence in this patch is that protectFn(index) runs only after FinalizeAsFSMFile has already renamed the spool into <index>.fsm. When two snapshot receives overlap (or a receive overlaps a local snapshot), a prepare for a higher index can acquire snapshotMu in that gap, see no .snap and no protected index for the just-finalized file, and delete it as an orphan; the first receive then enqueues a token pointing at a missing file and restore fails. Reserve/protect the index before or while publishing the .fsm, and unprotect it on finalize/handle failure.
Useful? React with 👍 / 👎.
| if candidates[i].restorable { | ||
| retention.keep[candidates[i].name] = true |
There was a problem hiding this comment.
Keep a WAL-valid snapshot as the prewrite fallback
When a node has a newer .snap left by a crash after persistReadyToWAL's SaveSnap but before its WAL Save, this retention code can keep that newer file because its .fsm passes CRC, then purge the older snapshot that loadPersistedSnapshot can actually load via wal.ValidSnapshotEntries. If the incoming/local snapshot write or persist then fails, restart has neither the new snapshot nor the older WAL-valid fallback, so filter retention by WAL-valid snapshots or keep at least one WAL-referenced pair.
Useful? React with 👍 / 👎.
| } | ||
|
|
||
| func (e *Engine) protectReceivedFSMSnapshot(index uint64) { | ||
| if index == 0 || index <= e.appliedIndex.Load() { |
There was a problem hiding this comment.
Release protections for snapshots raft ignores
When a streamed snapshot loses a race to log catch-up, its index can still be above appliedIndex here but already at or below raft's committed index, so raft ignores the MsgSnap and no Ready snapshot is ever persisted. The file was already finalized and protected, and normal committed-entry apply never calls releaseProtectedReceivedFSMSnapshotsUpToLocked, so that orphan remains protected from later prewrite cleanup and can keep disk full indefinitely; either avoid protecting snapshots raft will ignore or unprotect when Step does not produce a snapshot restore.
Useful? React with 👍 / 👎.
|
@codex review Addressed the latest snapshot lifecycle findings on fadd997:
Validation:
|
|
@codex review Addressed the latest P1 findings on aebbbc2:
Validation:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aebbbc2185
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if e.unprotectReceivedFSMSnapshotTokenIfCommitted(msg) { | ||
| return |
There was a problem hiding this comment.
Keep accepted snapshot files protected until Ready applies
When a MsgSnap is accepted, etcd/raft advances Status().Commit to the snapshot index inside Step before this engine drains the resulting Ready, so this branch releases the protection for the just-finalized .fsm while applyReadySnapshotLocked still needs to open it. If another receive/local snapshot prewrite runs in that window with a higher index, it sees no live .snap yet and no protected index, can remove this .fsm as an orphan, and the subsequent restore fails with a missing snapshot file; compare against the pre-Step commit or wait for applied/Ready processing before unprotecting.
Useful? React with 👍 / 👎.
| if retention.restorableFloor > 0 { | ||
| liveIndexes, err := collectLiveSnapIndexes(snapDir) | ||
| if err != nil { | ||
| return errors.WithStack(err) | ||
| } else if liveIndexes != nil { | ||
| return removeStaleFSMFilesBelowIndex(fsmSnapDir, liveIndexes, protectedIndexes, nextIndex) |
There was a problem hiding this comment.
Drop orphan FSM files even without a retained snapshot
When this is the first snapshot attempt on a node, or all existing .snap candidates are non-restorable, restorableFloor stays zero and this guard skips orphan cleanup entirely. A large unpaired .fsm left by a failed receive before the next local snapshot can therefore remain on disk and still cause the prewrite to hit ENOSPC, even though removeStaleFSMFilesBelowIndex already filters out live and protected indexes and could safely reclaim those orphans.
Useful? React with 👍 / 👎.
|
@codex review Addressed the latest review on ea55b71:
Validation:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ea55b717c9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if !e.rawNode.HasReady() { | ||
| e.unprotectReceivedFSMSnapshotToken(msg) | ||
| return | ||
| } | ||
| e.unprotectReceivedFSMSnapshotTokenIfApplied(msg) |
There was a problem hiding this comment.
Unprotect ignored snapshots that still produce a response Ready
When etcd/raft ignores a MsgSnap with an index above the current commit, for example because the snapshot ConfState no longer contains this node, it still queues a MsgAppResp, so HasReady() is true even though no snapshot will be applied. This branch then skips the unconditional unprotect, appliedIndex never reaches the token index, and the finalized .fsm remains in protectedReceivedFSMSnaps, preventing later prewrite cleanup from reclaiming the orphaned file.
Useful? React with 👍 / 👎.
| if err := lifecycle.Err(); err != nil { | ||
| return errors.Wrapf(err, "raft group %d engine stopped", groupID) | ||
| } | ||
| return errors.Errorf("raft group %d engine stopped", groupID) |
There was a problem hiding this comment.
Don't fail the process on intentional group removal
When this process is intentionally removed from a Raft group, the etcd engine calls requestShutdown() from the remove-conf-change apply path and exits with Err()==nil; this watcher now turns that normal per-group shutdown into a non-nil errgroup error, which cancels the whole server and marks it as failed to serve. In multi-group deployments or decommission flows, removing one local group member should not be reported as a fatal engine failure for every other group hosted by the process.
Useful? React with 👍 / 👎.
| indexes := make(map[uint64]bool, len(walSnaps)) | ||
| for _, snap := range walSnaps { | ||
| if snap.Index > 0 { | ||
| indexes[snap.Index] = true | ||
| } | ||
| } |
There was a problem hiding this comment.
Preserve the snapshot term when filtering WAL-valid pairs
This WAL-valid set records only indexes, but etcd's snapshot loader accepts a snapshot only when both term and index match a WAL snapshot entry. If the snap directory contains stale files with the same index but a different term, prewrite cleanup can mark the stale pair as walValid, keep it as the fallback, and purge the actual term-matching pair; the next restart then cannot load the retained .snap even though cleanup intended to leave a restartable fallback.
Useful? React with 👍 / 👎.
|
@codex review Addressed the latest P2 findings on 7f884f4:
Validation:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7f884f4ea8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| candidates = append(candidates, snapFileCandidate{ | ||
| name: e.Name(), | ||
| index: index, | ||
| restorable: fsmSnapshotFileRestorable(fsmSnapDir, index), |
There was a problem hiding this comment.
Verify the retained .fsm against the snap token
In a same-index snapshot case, this marks every .snap at that index as restorable whenever <index>.fsm is internally well-formed, because fsmSnapshotFileRestorable passes tokenCRC=0 and never checks the CRC embedded in that particular .snap's token. If a retry or another node produced a different valid payload for the same index, prewrite cleanup can keep the WAL-valid .snap whose token does not match the shared .fsm and delete the older matching fallback; the next restart then fails in restoreSnapshotStateFromToken with ErrFSMSnapshotTokenCRC. Please read the candidate snap token and verify the .fsm against that CRC before treating the pair as restorable.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/raftengine/etcd/fsm_snapshot_file.go (1)
680-697: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winWAL-preferred fallback can retain a non-restorable snapshot over a genuinely restorable one.
When the WAL filter (
walFiltered) excludes the only restorable candidate (because it isn't WAL-valid) and the WAL-valid candidate itself isn't restorable (e.g. missing/corrupted.fsm),retention.keepends up empty. The fallback at Line 693-695 then blindly keepscandidates[len(candidates)-1](highest index) with no restorable check, andpurgeUnretainedPrewriteSnapshotswill delete the actually-restorable older pair since it isn't retained. This reproduces the exact "kept snapshot becomes non-restorable" scenario flagged in the prior review, via the new WAL-filter interaction instead of the previously-fixed path.🩹 Proposed fix: retry with restorable-only requirement before the unverified fallback
func keepRestorablePrewriteSnapshots(candidates []snapFileCandidate) prewriteSnapshotRetention { retention := prewriteSnapshotRetention{ keep: make(map[string]bool, prewriteSnapKeep), } walFiltered := prewriteCandidatesHaveWALFilter(candidates) for i := len(candidates) - 1; i >= 0 && len(retention.keep) < prewriteSnapKeep; i-- { if candidates[i].restorable && (!walFiltered || candidates[i].walValid) { retention.keep[candidates[i].name] = true if retention.restorableFloor == 0 || candidates[i].index < retention.restorableFloor { retention.restorableFloor = candidates[i].index } } } + // If the WAL-preferred filter excluded every candidate, fall back to + // the newest restorable pair regardless of WAL validity before + // resorting to an unverified snapshot. + if len(retention.keep) == 0 { + for i := len(candidates) - 1; i >= 0; i-- { + if candidates[i].restorable { + retention.keep[candidates[i].name] = true + retention.restorableFloor = candidates[i].index + break + } + } + } if len(retention.keep) == 0 && len(candidates) > 0 { retention.keep[candidates[len(candidates)-1].name] = true } return retention }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/raftengine/etcd/fsm_snapshot_file.go` around lines 680 - 697, The fallback in keepRestorablePrewriteSnapshots is too permissive: when walFiltered excludes the only restorable candidate and no WAL-valid restorable snapshot is found, it currently keeps the newest candidate without checking restorable. Update keepRestorablePrewriteSnapshots to retry selection with a restorable-only rule before using the final fallback, so the kept snapshot is always restorable if any restorable candidate exists; use the existing symbols walFiltered, retention.keep, and candidates to locate the change.
🧹 Nitpick comments (2)
main_raft_lifecycle_test.go (2)
22-60: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winConsider a table-driven test to align with repo conventions.
The three scenarios (engine failure, clean stop, context cancellation) differ only in stub setup and expected outcome, and are good candidates for a single table-driven test.
As per coding guidelines, "Unit tests should be co-located with packages in
*_test.gofiles; prefer table-driven test cases."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@main_raft_lifecycle_test.go` around lines 22 - 60, The three lifecycle watcher tests are duplicated scenario cases and should be consolidated into a table-driven test to match repo conventions. Refactor the existing TestStartRaftEngineLifecycleWatchersReportsEngineFailure, TestStartRaftEngineLifecycleWatchersIgnoresCleanEngineStop, and TestStartRaftEngineLifecycleWatchersIgnoresContextCancellation cases into one table covering the different lifecycleEngineStub setups, runtime specs, and expected assertions while still exercising startRaftEngineLifecycleWatchers and eg.Wait. Keep the individual scenario names as table entries or subtests so the behavior remains clear and the assertions for engine failure, clean stop, and context cancellation stay explicit.Source: Coding guidelines
22-60: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMissing coverage for the non-Lifecycle/nil-Done guard branches.
None of the three tests exercise
main.goLines 522 (okfalse, engine doesn't implementLifecycle) or 526-528 (Done()returns nil). Adding a stub without aDone/Erroverride, or one returning nil fromDone(), would confirm the watcher skips these runtimes without blockingeg.Wait().🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@main_raft_lifecycle_test.go` around lines 22 - 60, Add test coverage for the guard paths in startRaftEngineLifecycleWatchers that are currently untested: one runtime whose engine does not satisfy the Lifecycle interface, and one Lifecycle stub whose Done() returns nil. Use the existing lifecycleEngineStub or a new minimal stub to exercise these branches and assert eg.Wait() returns cleanly without blocking or reporting an error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@internal/raftengine/etcd/fsm_snapshot_file.go`:
- Around line 680-697: The fallback in keepRestorablePrewriteSnapshots is too
permissive: when walFiltered excludes the only restorable candidate and no
WAL-valid restorable snapshot is found, it currently keeps the newest candidate
without checking restorable. Update keepRestorablePrewriteSnapshots to retry
selection with a restorable-only rule before using the final fallback, so the
kept snapshot is always restorable if any restorable candidate exists; use the
existing symbols walFiltered, retention.keep, and candidates to locate the
change.
---
Nitpick comments:
In `@main_raft_lifecycle_test.go`:
- Around line 22-60: The three lifecycle watcher tests are duplicated scenario
cases and should be consolidated into a table-driven test to match repo
conventions. Refactor the existing
TestStartRaftEngineLifecycleWatchersReportsEngineFailure,
TestStartRaftEngineLifecycleWatchersIgnoresCleanEngineStop, and
TestStartRaftEngineLifecycleWatchersIgnoresContextCancellation cases into one
table covering the different lifecycleEngineStub setups, runtime specs, and
expected assertions while still exercising startRaftEngineLifecycleWatchers and
eg.Wait. Keep the individual scenario names as table entries or subtests so the
behavior remains clear and the assertions for engine failure, clean stop, and
context cancellation stay explicit.
- Around line 22-60: Add test coverage for the guard paths in
startRaftEngineLifecycleWatchers that are currently untested: one runtime whose
engine does not satisfy the Lifecycle interface, and one Lifecycle stub whose
Done() returns nil. Use the existing lifecycleEngineStub or a new minimal stub
to exercise these branches and assert eg.Wait() returns cleanly without blocking
or reporting an error.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 31e352ab-4630-4aa7-8fbe-be1d90baabd0
📒 Files selected for processing (9)
internal/raftengine/etcd/engine.gointernal/raftengine/etcd/engine_applied_index_test.gointernal/raftengine/etcd/engine_test.gointernal/raftengine/etcd/fsm_snapshot_file.gointernal/raftengine/etcd/fsm_snapshot_file_test.gointernal/raftengine/etcd/grpc_transport.gointernal/raftengine/etcd/grpc_transport_test.gomain.gomain_raft_lifecycle_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/raftengine/etcd/engine_applied_index_test.go
- internal/raftengine/etcd/engine.go
|
@codex review Addressed the latest P2 finding on 5d1f5f0:
Validation:
|
|
@codex review Addressed the WAL-filter fallback issue on 3c5e994:
Validation:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3c5e994274
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if len(retention.keep) == 0 { | ||
| keepNewestMatchingPrewriteSnapshots(candidates, &retention, func(candidate snapFileCandidate) bool { | ||
| return candidate.restorable | ||
| }) |
There was a problem hiding this comment.
Keep fallback snapshots loadable by the WAL
When the WAL filter is active and the only WAL-valid candidate is not restorable, this fallback keeps an older restorable snap/fsm pair even though it is not present in wal.ValidSnapshotEntries. If the subsequent snapshot write fails or the process crashes before the new .snap is saved, restart still goes through loadPersistedSnapshot -> snapshotter.LoadNewestAvailable(walSnaps), which ignores that non-WAL-valid fallback and can leave the node with no loadable snapshot after this prewrite pass has deleted the WAL-valid .snap. The fallback needs to keep something the restart path can actually select, or the startup path needs a matching fallback.
Useful? React with 👍 / 👎.
|
@codex review Addressed the latest P1 finding on f24c731:
Validation:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f24c731215
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if index == 0 || index <= e.appliedIndex.Load() { | ||
| return |
There was a problem hiding this comment.
Do not bypass snapshotMu after applied advances
When a duplicate MsgSnap for the same index arrives while persistReadyWithSnapshotLocked is still saving the corresponding .snap, applyReadySnapshotLocked has already advanced appliedIndex, so this fast path returns without taking snapshotMu. The receive path can then rename a new .fsm over fsmSnapPath(index) while SaveSnap is about to persist a token for the previous CRC; a sender retry after an ack/stream failure is enough to create a .snap/.fsm CRC mismatch that prevents restart from that snapshot. Please serialize this case with the snapshot mutex or reject same/older indexes before publishing the received file.
Useful? React with 👍 / 👎.
|
@codex review Addressed the latest P1 finding on 282304e:
Validation:
|
|
Codex Review: Didn't find any major issues. 🎉 Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary by CodeRabbit