Skip to content

raft: prevent snapshot disk exhaustion#987

Open
bootjp wants to merge 14 commits into
mainfrom
fix-raft-snapshot-lifecycle
Open

raft: prevent snapshot disk exhaustion#987
bootjp wants to merge 14 commits into
mainfrom
fix-raft-snapshot-lifecycle

Conversation

@bootjp

@bootjp bootjp commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Summary by CodeRabbit

  • New Features
    • Raft engines now expose lifecycle status (completion and terminal error) to support orderly shutdown monitoring.
    • Snapshot reception adds a “prepare before payload write” flow with protection/unprotection hooks for FSM snapshot files.
  • Bug Fixes
    • Improved coordination and crash-safety for FSM snapshot file cleanup/retention, including safer handling of protected received snapshots during persistence and apply.
  • Tests
    • Added end-to-end coverage for lifecycle propagation, snapshot prepare/protection ordering, retention rules, and stream validation (rejects payload-before-metadata).

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@bootjp, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 4 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 20a509b5-9034-4a66-8d0e-c14acdb06084

📥 Commits

Reviewing files that changed from the base of the PR and between 7f884f4 and 282304e.

📒 Files selected for processing (6)
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/engine_applied_index_test.go
  • internal/raftengine/etcd/fsm_snapshot_file.go
  • internal/raftengine/etcd/fsm_snapshot_file_test.go
  • internal/raftengine/etcd/grpc_transport.go
  • internal/raftengine/etcd/grpc_transport_test.go
📝 Walkthrough

Walkthrough

Adds raft engine lifecycle signaling and watcher wiring, and extends snapshot persistence and gRPC snapshot receive paths with prewrite cleanup, protection tracking, and related tests.

Changes

Engine lifecycle and snapshot protection

Layer / File(s) Summary
Lifecycle interface and engine methods
internal/raftengine/engine.go, internal/raftengine/etcd/engine.go
Defines Lifecycle and adds nil-safe Done() and Err() on the etcd Engine.
Lifecycle watchers and tests
main.go, main_raft_lifecycle_test.go
Starts watcher goroutines from run() that wait on engine completion and surface terminal errors through the errgroup, with tests for error propagation and cancellation behavior.
Engine snapshot locking and release
internal/raftengine/etcd/engine.go, internal/raftengine/etcd/engine_applied_index_test.go, internal/raftengine/etcd/engine_test.go
Tracks protected received FSM snapshots, wires transport callbacks, serializes snapshot persistence under snapshotMu, and releases protected indices after ready, config, and local snapshot persistence.
FSM snapshot prewrite cleanup
internal/raftengine/etcd/fsm_snapshot_file.go, internal/raftengine/etcd/fsm_snapshot_file_test.go
Adds .fsm prewrite cleanup helpers that retain newer restorable pairs, purge older pairs, remove stale orphans, and sync snapshot directories, with tests for pruning and protected retention.
gRPC snapshot receive preparation
internal/raftengine/etcd/grpc_transport.go, internal/raftengine/etcd/grpc_transport_test.go
Adds transport hooks for snapshot preparation and protection, threads them through snapshot receive and finalization, and verifies preparation/protection ordering in tests.

Estimated code review effort: 4 (Complex) | ~45 minutes

Possibly related PRs

  • bootjp/elastickv#747: Both PRs modify the etcd snapshot receive/finalization path in grpc_transport.go, including tokenized .fsm handling and cleanup around finalized snapshot files.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: fixing raft snapshot lifecycle issues to prevent disk exhaustion.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-raft-snapshot-lifecycle

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
main.go (1)

516-516: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Loop variable runtime shadows the stdlib runtime package.

Line 417 in this same file already uses rt for the identical iteration pattern (for _, rt := range runtimes). Using runtime here is inconsistent and shadows the standard library package name.

♻️ Suggested rename
-	for _, runtime := range runtimes {
+	for _, rt := range runtimes {

-		if runtime == nil {
+		if rt == nil {
 			continue
 		}
-		engine := runtime.snapshotEngine()
+		engine := rt.snapshotEngine()
 		lifecycle, ok := engine.(raftengine.Lifecycle)
 		if !ok {
 			continue
 		}
 		done := lifecycle.Done()
 		if done == nil {
 			continue
 		}
-		groupID := runtime.spec.id
+		groupID := rt.spec.id
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@main.go` at line 516, The loop variable in the runtimes iteration shadows the
stdlib runtime package and is inconsistent with the same pattern elsewhere in
main.go. Rename the iterator variable in the affected loop to a non-conflicting
name such as rt, matching the existing runtimes loop, and update any references
inside that block accordingly.
main_raft_lifecycle_test.go (1)

22-48: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider consolidating into a table-driven test.

Both tests exercise startRaftEngineLifecycleWatchers with a similar setup shape (stub engine + runtimes slice + errgroup), differing mainly in trigger (close(engine.done) vs cancel()) and expected outcome. A table-driven form with subtests would reduce duplication while keeping both scenarios covered.

As per coding guidelines, "Unit tests should be co-located with packages in *_test.go files; prefer table-driven test cases."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@main_raft_lifecycle_test.go` around lines 22 - 48, The two lifecycle watcher
tests duplicate the same `startRaftEngineLifecycleWatchers` setup, so
consolidate them into one table-driven test with subtests. Keep the shared
arrangement around `lifecycleEngineStub`, `raftGroupRuntime`, and
`errgroup.WithContext`, and vary only the trigger (`close(engine.done)` vs
`cancel()`) and assertions for the failure/cancellation cases.

Source: Coding guidelines

internal/raftengine/etcd/engine.go (1)

2796-2801: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Prepare failures are only logged, not surfaced anywhere else.

Both call sites swallow prepareFSMSnapshotWrite errors with a slog.Warn and continue writing the new snapshot regardless. Since this PR's stated goal is preventing snapshot disk exhaustion, a persistently failing cleanup pass (e.g., permission or filesystem errors) would silently degrade back to the original unbounded-growth behavior with only a log line as a signal. Consider incrementing a metric/counter on failure so operators can alert on it.

Also applies to: 4224-4229

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/raftengine/etcd/engine.go` around lines 2796 - 2801, The
prepareFSMSnapshotWrite failure is only logged in the snapshot write path and
never surfaced to operators. Update the call sites that handle
prepareFSMSnapshotWrite so a failure also increments an existing or new
metric/counter for prepare failures, using the surrounding snapshot write logic
in engine.go to locate both occurrences; keep the Warn log, but add the
observable failure signal so repeated cleanup errors can be alerted on.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/raftengine/etcd/fsm_snapshot_file.go`:
- Around line 576-590: The snapshot retention logic in
collectPrewriteSnapCandidates and the purge loop in the prewrite snapshot
cleanup path currently keeps the newest .snap entries without verifying that
each snapshot is restorable. Update the keep-set logic around
collectPrewriteSnapCandidates/sort.Slice/purgeSnapPair to preserve a verified
.snap/.fsm pair (or otherwise build candidates from restorable snapshots only)
so a missing .fsm on the newest token snapshot cannot cause older valid pairs to
be deleted. Add a regression test covering the case where the newest .snap
exists but its .fsm payload is missing, and ensure the kept snapshot remains
restartable.

In `@internal/raftengine/etcd/grpc_transport.go`:
- Around line 752-755: The receive-side cleanup is happening too late in the
snapshot flow, allowing spooling under fsmSnapDir to fail before stale FSM files
are removed. Move the preparation/cleanup step from finalization into the early
snapshot receive path in grpc_transport.go so it runs as soon as snapshot
metadata/index is available, before drainSnapshotChunks starts writing payload
bytes; use the existing prepareReceivedFSMSnapshotWrite logic and the
surrounding receive helpers to wire this in.

---

Nitpick comments:
In `@internal/raftengine/etcd/engine.go`:
- Around line 2796-2801: The prepareFSMSnapshotWrite failure is only logged in
the snapshot write path and never surfaced to operators. Update the call sites
that handle prepareFSMSnapshotWrite so a failure also increments an existing or
new metric/counter for prepare failures, using the surrounding snapshot write
logic in engine.go to locate both occurrences; keep the Warn log, but add the
observable failure signal so repeated cleanup errors can be alerted on.

In `@main_raft_lifecycle_test.go`:
- Around line 22-48: The two lifecycle watcher tests duplicate the same
`startRaftEngineLifecycleWatchers` setup, so consolidate them into one
table-driven test with subtests. Keep the shared arrangement around
`lifecycleEngineStub`, `raftGroupRuntime`, and `errgroup.WithContext`, and vary
only the trigger (`close(engine.done)` vs `cancel()`) and assertions for the
failure/cancellation cases.

In `@main.go`:
- Line 516: The loop variable in the runtimes iteration shadows the stdlib
runtime package and is inconsistent with the same pattern elsewhere in main.go.
Rename the iterator variable in the affected loop to a non-conflicting name such
as rt, matching the existing runtimes loop, and update any references inside
that block accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d020f13b-925a-4a41-8015-336cb27004c5

📥 Commits

Reviewing files that changed from the base of the PR and between 85f9890 and b0ca6c8.

📒 Files selected for processing (7)
  • internal/raftengine/engine.go
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/fsm_snapshot_file.go
  • internal/raftengine/etcd/fsm_snapshot_file_test.go
  • internal/raftengine/etcd/grpc_transport.go
  • main.go
  • main_raft_lifecycle_test.go

Comment thread internal/raftengine/etcd/fsm_snapshot_file.go Outdated
Comment thread internal/raftengine/etcd/grpc_transport.go Outdated
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0ca6c8370

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}

var combined error
combined = errors.CombineErrors(combined, removeFSMTmpOrphans(fsmSnapDir))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid deleting in-flight FSM snapshot temp files

When this helper runs at runtime (for example while receiving a snapshot) it removes every *.fsm.tmp, but local snapshot writes create their active temp files with that same suffix in writeFSMSnapshotFile; persistLocalSnapshot releases snapshotMu after its prepare step before doing the write. If another prepare runs during that write, it can unlink the active temp file, letting the write continue on an unlinked fd and then fail the final rename with ENOENT, which propagates as a snapshot result error and stops the engine. Restrict this cleanup to startup or ensure active snapshot temp files cannot be removed.

Useful? React with 👍 / 👎.

}
index := metadata.Snapshot.Metadata.Index
if fsmSnapDir != "" && index > 0 {
prepareReceivedFSMSnapshotWrite(fsmSnapDir, index, prepareFn)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Free space before spooling received snapshots

On the receive path this prepare call does not run until the final chunk is processed, but the spool has already been written into fsmSnapDir by then. If a follower is near-full because old snapshot pairs are occupying the directory, the stream can hit ENOSPC while writing chunks before this line is reached, so the old files are never purged and raft will keep retrying the same failing snapshot transfer. Run the prewrite purge before creating/draining the spool when fsmSnapDir is set.

Useful? React with 👍 / 👎.

Comment on lines +625 to +628
if err != nil || idx >= nextIndex || liveIndexes[idx] {
continue
}
removeWithWarn(filepath.Join(fsmSnapDir, e.Name()), "orphan fsm snapshot")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve FSM files until their raft snap is durable

This treats any .fsm without a matching file in snapDir as removable during runtime prepares, but received snapshots create the .fsm before the engine later persists the corresponding .snap (persistReady applies the snapshot before calling persistReadyToWAL). If another snapshot prepare with a higher index runs in that window, it can delete the just-applied FSM file as an “orphan”; the subsequent SaveSnap then durably references a missing payload, breaking restart or future snapshot sends. Only do this orphan sweep at startup, or hold the same lock through .snap persistence.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jul 2, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/raftengine/etcd/fsm_snapshot_file.go (2)

571-593: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Reclaim orphan FSM artifacts during prewrite cleanup.

collectPrewriteSnapCandidates is driven only by .snap files, so stale .fsm payloads without a token and .fsm.tmp leftovers are never reclaimed before the next write. Those files can be snapshot-sized and still cause the disk-exhaustion failure this PR is targeting; re-add scoped orphan cleanup before syncing/writing.

Also applies to: 609-625

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/raftengine/etcd/fsm_snapshot_file.go` around lines 571 - 593, The
prewrite cleanup in collectPrewriteSnapCandidates/purgeSnapPair only handles
tokenized snapshot pairs, so orphan .fsm and .fsm.tmp artifacts can survive and
keep consuming disk. Add a scoped orphan cleanup step in the prewrite path
before syncing/writing that scans the FSM snapshot directory for unpaired stale
payloads and removes them alongside the existing candidate purge logic, reusing
the same cleanup flow used by purgeSnapPair where possible.

596-633: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate FSM files before treating them as restorable.

hasFSM currently means “path exists,” but keepRestorablePrewriteSnapshots uses it as “safe to restart from.” A corrupt/truncated newest .fsm can still evict older valid pairs; use the existing FSM verification path or keep an additional verified pair.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/raftengine/etcd/fsm_snapshot_file.go` around lines 596 - 633, The
restorable prewrite snapshot selection in
keepRestorablePrewriteSnapshots/collectPrewriteSnapCandidates is treating any
existing FSM file as valid, which can let a corrupt or truncated newest .fsm
displace older usable pairs. Update the candidate tracking to use a verified FSM
check instead of just fsmSnapshotFileExists, ideally by reusing the existing FSM
validation path, or retain an additional verified snapshot pair when choosing
what to keep. Ensure the logic in keepRestorablePrewriteSnapshots still prefers
the newest valid candidates but only marks snapshots as restorable when their
FSM has been validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/raftengine/etcd/grpc_transport.go`:
- Around line 828-839: The snapshot chunk handling in append/receive flow
currently allows payload bytes to be written before metadata has been observed,
which breaks the prewrite guarantee. Update the chunk-processing logic around
appendSnapshotChunkMetadata, maybePrepareReceivedFSMSnapshotWrite, and
writeSnapshotChunkPayload to reject any payload-before-metadata chunk with an
error before spooling bytes. Ensure the guard runs before preparedFSMWrite can
be skipped, so cleanup/preparation always happens before payload writes.

---

Outside diff comments:
In `@internal/raftengine/etcd/fsm_snapshot_file.go`:
- Around line 571-593: The prewrite cleanup in
collectPrewriteSnapCandidates/purgeSnapPair only handles tokenized snapshot
pairs, so orphan .fsm and .fsm.tmp artifacts can survive and keep consuming
disk. Add a scoped orphan cleanup step in the prewrite path before
syncing/writing that scans the FSM snapshot directory for unpaired stale
payloads and removes them alongside the existing candidate purge logic, reusing
the same cleanup flow used by purgeSnapPair where possible.
- Around line 596-633: The restorable prewrite snapshot selection in
keepRestorablePrewriteSnapshots/collectPrewriteSnapCandidates is treating any
existing FSM file as valid, which can let a corrupt or truncated newest .fsm
displace older usable pairs. Update the candidate tracking to use a verified FSM
check instead of just fsmSnapshotFileExists, ideally by reusing the existing FSM
validation path, or retain an additional verified snapshot pair when choosing
what to keep. Ensure the logic in keepRestorablePrewriteSnapshots still prefers
the newest valid candidates but only marks snapshots as restorable when their
FSM has been validated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5a82a1ac-05d0-47f3-ac4f-c8ebbdfb50e5

📥 Commits

Reviewing files that changed from the base of the PR and between b0ca6c8 and 9c8aaaa.

📒 Files selected for processing (5)
  • internal/raftengine/etcd/fsm_snapshot_file.go
  • internal/raftengine/etcd/fsm_snapshot_file_test.go
  • internal/raftengine/etcd/grpc_transport.go
  • internal/raftengine/etcd/grpc_transport_test.go
  • main.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • main.go

Comment thread internal/raftengine/etcd/grpc_transport.go
@bootjp

bootjp commented Jul 2, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@bootjp

bootjp commented Jul 2, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2c1c553322

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if err != nil {
return errors.WithStack(err)
} else if liveIndexes != nil {
return removeStaleFSMFilesBelowIndex(fsmSnapDir, liveIndexes, retention.restorableFloor)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove stale orphans up to the incoming snapshot

When the retained restorable snapshot is older than the snapshot being prepared, this only deletes orphan .fsm files below restorableFloor. A stale unreferenced payload at an index between the kept pair and nextIndex (for example keep 200, prepare 400, orphan 300 left by a failed receive) survives the prewrite pass, so the file most likely consuming space is not freed and the following snapshot write can still fail with ENOSPC. The live-index check already prevents deleting files with surviving .snap tokens; stale orphans below the snapshot being written need to be eligible for cleanup rather than capped at the kept snapshot index.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jul 2, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bbf7185875

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if err != nil {
return errors.WithStack(err)
} else if liveIndexes != nil {
return removeStaleFSMFilesBelowIndex(fsmSnapDir, liveIndexes, nextIndex)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve in-flight received FSM snapshots

When this prewrite cleanup runs after a streaming snapshot has been finalized but before the raft loop applies it, the new .fsm has no matching .snap yet: receiveSnapshotStream renames the spool to <index>.fsm, while handleTransportMessage only enqueues the MsgSnap on stepCh and the .snap token is not written until later Ready processing. If another received/local snapshot prepares a higher nextIndex in that window, collectLiveSnapIndexes will not include the queued snapshot and this call removes its .fsm as an orphan, so the queued restore later fails with a missing FSM file and shuts down the engine. Please exclude in-flight receive indexes or defer orphan cleanup until after apply/startup.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jul 2, 2026

Copy link
Copy Markdown
Owner Author

@codex review

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/raftengine/etcd/grpc_transport.go (1)

904-911: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Protect the received FSM index before publishing the final .fsm path.

Line 905 makes <index>.fsm visible before Line 910 registers protection. A concurrent prewrite cleanup can see that file as an unprotected orphan and remove it before the engine consumes the token. Register protection before FinalizeAsFSMFile, and unprotect if finalization fails.

Suggested direction
-		if err := spool.FinalizeAsFSMFile(fsmSnapDir, index, crc32c); err != nil {
+		protected := false
+		if protectFn != nil {
+			protectFn(index)
+			protected = true
+		}
+		if err := spool.FinalizeAsFSMFile(fsmSnapDir, index, crc32c); err != nil {
+			if protected && unprotectFn != nil {
+				unprotectFn(index)
+			}
 			return raftpb.Message{}, err
 		}
 		metadata.Snapshot.Data = encodeSnapshotToken(index, crc32c)
-		if protectFn != nil {
-			protectFn(index)
-		}

This requires threading unprotectFn alongside protectFn through the receive/finalize helpers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/raftengine/etcd/grpc_transport.go` around lines 904 - 911, The FSM
finalize flow in grpc_transport.go publishes the <index>.fsm path before the
index is marked protected, which can race with cleanup. Update the
receive/finalize helpers around FinalizeAsFSMFile to register protection with
protectFn before making the file visible, and add an unprotectFn path to roll
back the protection if finalization fails. Ensure the change is wired through
the related helper methods that handle the FSM snapshot receive and finalize
sequence.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/raftengine/etcd/engine.go`:
- Around line 2820-2830: The protectReceivedFSMSnapshot fast-path can race with
snapshot release and leave a stale entry in protectedReceivedFSMSnaps. Update
protectReceivedFSMSnapshot to recheck appliedIndex after acquiring snapshotMu,
before initializing or incrementing the map, and return if the snapshot index is
no longer ahead of the current applied index. Use the existing
protectReceivedFSMSnapshot method and protectedReceivedFSMSnaps map to keep the
fix localized.

---

Outside diff comments:
In `@internal/raftengine/etcd/grpc_transport.go`:
- Around line 904-911: The FSM finalize flow in grpc_transport.go publishes the
<index>.fsm path before the index is marked protected, which can race with
cleanup. Update the receive/finalize helpers around FinalizeAsFSMFile to
register protection with protectFn before making the file visible, and add an
unprotectFn path to roll back the protection if finalization fails. Ensure the
change is wired through the related helper methods that handle the FSM snapshot
receive and finalize sequence.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c2e4058f-38e1-454c-be96-19fa04944a2d

📥 Commits

Reviewing files that changed from the base of the PR and between 9c8aaaa and af915e1.

📒 Files selected for processing (6)
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/engine_applied_index_test.go
  • internal/raftengine/etcd/fsm_snapshot_file.go
  • internal/raftengine/etcd/fsm_snapshot_file_test.go
  • internal/raftengine/etcd/grpc_transport.go
  • internal/raftengine/etcd/grpc_transport_test.go

Comment thread internal/raftengine/etcd/engine.go Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

if err := spool.FinalizeAsFSMFile(fsmSnapDir, index, crc32c); err != nil {

P1 Badge Protect received snapshots before making them visible

When a higher-index snapshot prepare runs in the gap after FinalizeAsFSMFile renames the received N.fsm but before protectFn records N, prepareFSMSnapshotWriteProtected sees the new file as an unprotected orphan because no .snap token exists yet and can delete it. The receiver then enqueues/applies a snapshot token that points at a missing .fsm, so restore fails and the raft engine can stop; the protection needs to cover the file before it is visible to prewrite cleanup, or the rename/protect sequence must share the cleanup lock.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 780 to 781
spoolPlacement, fsmSnapDir, prepareFn, protectFn := t.snapshotSpoolPlacement()
spool, err := newSnapshotSpool(spoolPlacement)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prepare received snapshot cleanup before creating the spool

When fsmSnapDir is already full enough that creating a new temp file needs an inode or directory block, newSnapshotSpool returns before the newly added prewrite cleanup in drainSnapshotChunks ever gets a chance to delete older snapshots. That leaves the receive path still unable to recover from exactly the disk-exhaustion state this change is meant to handle; read the metadata/prepare first (buffering the first chunk if needed) or run cleanup before creating the spool file.

Useful? React with 👍 / 👎.

}
metadata.Snapshot.Data = encodeSnapshotToken(index, crc32c)
if protectFn != nil {
protectFn(index)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Release protection for snapshots raft ignores

When a received snapshot is accepted by the transport but later ignored by raft (for example, the follower catches up before the queued MsgSnap is stepped, or raft fast-forwards because it already has the snapshot index/term), no Ready.Snapshot is persisted and the new release calls never run. Because the success path keeps the .fsm protected indefinitely, those orphan files are skipped by later prewrite cleanup and can accumulate despite this disk-exhaustion fix.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: af915e1ca0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +908 to +910
metadata.Snapshot.Data = encodeSnapshotToken(index, crc32c)
if protectFn != nil {
protectFn(index)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reserve received snapshots before finalizing them

Fresh evidence in this patch is that protectFn(index) runs only after FinalizeAsFSMFile has already renamed the spool into <index>.fsm. When two snapshot receives overlap (or a receive overlaps a local snapshot), a prepare for a higher index can acquire snapshotMu in that gap, see no .snap and no protected index for the just-finalized file, and delete it as an orphan; the first receive then enqueues a token pointing at a missing file and restore fails. Reserve/protect the index before or while publishing the .fsm, and unprotect it on finalize/handle failure.

Useful? React with 👍 / 👎.

Comment on lines +657 to +658
if candidates[i].restorable {
retention.keep[candidates[i].name] = true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep a WAL-valid snapshot as the prewrite fallback

When a node has a newer .snap left by a crash after persistReadyToWAL's SaveSnap but before its WAL Save, this retention code can keep that newer file because its .fsm passes CRC, then purge the older snapshot that loadPersistedSnapshot can actually load via wal.ValidSnapshotEntries. If the incoming/local snapshot write or persist then fails, restart has neither the new snapshot nor the older WAL-valid fallback, so filter retention by WAL-valid snapshots or keep at least one WAL-referenced pair.

Useful? React with 👍 / 👎.

Comment thread internal/raftengine/etcd/engine.go Outdated
}

func (e *Engine) protectReceivedFSMSnapshot(index uint64) {
if index == 0 || index <= e.appliedIndex.Load() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Release protections for snapshots raft ignores

When a streamed snapshot loses a race to log catch-up, its index can still be above appliedIndex here but already at or below raft's committed index, so raft ignores the MsgSnap and no Ready snapshot is ever persisted. The file was already finalized and protected, and normal committed-entry apply never calls releaseProtectedReceivedFSMSnapshotsUpToLocked, so that orphan remains protected from later prewrite cleanup and can keep disk full indefinitely; either avoid protecting snapshots raft will ignore or unprotect when Step does not produce a snapshot restore.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jul 2, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the latest snapshot lifecycle findings on fadd997:

  • protect received FSM snapshot indexes before publishing the final .fsm path, and unprotect on finalize failure
  • recheck appliedIndex while holding snapshotMu before adding protection
  • run received snapshot prewrite cleanup before creating the receive spool
  • release stale received snapshot protection when raft ignores/already-applied MsgSnap tokens or appliedIndex advances

Validation:

  • go test ./internal/raftengine/etcd -run 'Test(ReceiveSnapshotStreamPreparesBeforeSpoolCreation|DrainSnapshotChunksProtectsBeforePublishingFSMFile|DrainSnapshotChunksUnprotectsWhenFinalizeFails|ProtectReceivedFSMSnapshotRechecksAppliedIndexUnderLock|UnprotectReceivedFSMSnapshotTokenIfApplied|UnprotectReceivedFSMSnapshotTokenIfAppliedKeepsFutureSnapshot|SendSnapshotProtectsFinalizedFSMFileUntilEngineRelease|SendSnapshot_ApplyFailureRemovesFinalizedFSMFile)'
  • go test ./internal/raftengine/...
  • go test .
  • golangci-lint --config=.golangci.yaml run ./internal/raftengine/etcd
  • git diff --check

@bootjp

bootjp commented Jul 2, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the latest P1 findings on aebbbc2:

  • prewrite retention now prefers a restorable snapshot pair whose index is referenced by wal.ValidSnapshotEntries, so a crash-leftover .snap cannot displace the restartable WAL-backed fallback
  • received snapshot protection is released when raft has already committed past the token index, when Step yields no Ready, when the peer is rejected, or when appliedIndex later advances

Validation:

  • go test ./internal/raftengine/etcd -run 'Test(PrepareFSMSnapshotWriteKeepsWALValidFallbackPair|PrepareFSMSnapshotWriteKeepsNewestRestorablePair|HandleStepUnprotectsSnapshotTokenWhenCommittedAlreadyCoversIt|UnprotectReceivedFSMSnapshotTokenIfApplied|ProtectReceivedFSMSnapshotRechecksAppliedIndexUnderLock)'
  • go test ./internal/raftengine/...
  • go test .
  • golangci-lint --config=.golangci.yaml run ./internal/raftengine/etcd
  • git diff --check

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aebbbc2185

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/raftengine/etcd/engine.go Outdated
Comment on lines +1887 to +1888
if e.unprotectReceivedFSMSnapshotTokenIfCommitted(msg) {
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep accepted snapshot files protected until Ready applies

When a MsgSnap is accepted, etcd/raft advances Status().Commit to the snapshot index inside Step before this engine drains the resulting Ready, so this branch releases the protection for the just-finalized .fsm while applyReadySnapshotLocked still needs to open it. If another receive/local snapshot prewrite runs in that window with a higher index, it sees no live .snap yet and no protected index, can remove this .fsm as an orphan, and the subsequent restore fails with a missing snapshot file; compare against the pre-Step commit or wait for applied/Ready processing before unprotecting.

Useful? React with 👍 / 👎.

Comment on lines +648 to +653
if retention.restorableFloor > 0 {
liveIndexes, err := collectLiveSnapIndexes(snapDir)
if err != nil {
return errors.WithStack(err)
} else if liveIndexes != nil {
return removeStaleFSMFilesBelowIndex(fsmSnapDir, liveIndexes, protectedIndexes, nextIndex)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drop orphan FSM files even without a retained snapshot

When this is the first snapshot attempt on a node, or all existing .snap candidates are non-restorable, restorableFloor stays zero and this guard skips orphan cleanup entirely. A large unpaired .fsm left by a failed receive before the next local snapshot can therefore remain on disk and still cause the prewrite to hit ENOSPC, even though removeStaleFSMFilesBelowIndex already filters out live and protected indexes and could safely reclaim those orphans.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jul 2, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the latest review on ea55b71:

  • compare received snapshot tokens against the pre-Step raft commit index, so accepted MsgSnap files stay protected until Ready applies the snapshot
  • continue to unprotect snapshots that were already committed before Step or that produce no Ready
  • reclaim unpaired .fsm files during prewrite cleanup even when there is no retained restorable snapshot, while still honoring live .snap indexes and protected received indexes

Validation:

  • go test ./internal/raftengine/etcd -run 'Test(HandleStepUnprotectsSnapshotTokenWhenCommittedAlreadyCoversIt|HandleStepKeepsAcceptedSnapshotTokenProtectedUntilReady|PrepareFSMSnapshotWriteRemovesOrphansWithoutRetainedSnapshot|PrepareFSMSnapshotWriteKeepsWALValidFallbackPair)'
  • go test ./internal/raftengine/...
  • go test .
  • golangci-lint --config=.golangci.yaml run ./internal/raftengine/etcd
  • git diff --check

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ea55b717c9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1891 to +1895
if !e.rawNode.HasReady() {
e.unprotectReceivedFSMSnapshotToken(msg)
return
}
e.unprotectReceivedFSMSnapshotTokenIfApplied(msg)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Unprotect ignored snapshots that still produce a response Ready

When etcd/raft ignores a MsgSnap with an index above the current commit, for example because the snapshot ConfState no longer contains this node, it still queues a MsgAppResp, so HasReady() is true even though no snapshot will be applied. This branch then skips the unconditional unprotect, appliedIndex never reaches the token index, and the finalized .fsm remains in protectedReceivedFSMSnaps, preventing later prewrite cleanup from reclaiming the orphaned file.

Useful? React with 👍 / 👎.

Comment thread main.go Outdated
if err := lifecycle.Err(); err != nil {
return errors.Wrapf(err, "raft group %d engine stopped", groupID)
}
return errors.Errorf("raft group %d engine stopped", groupID)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't fail the process on intentional group removal

When this process is intentionally removed from a Raft group, the etcd engine calls requestShutdown() from the remove-conf-change apply path and exits with Err()==nil; this watcher now turns that normal per-group shutdown into a non-nil errgroup error, which cancels the whole server and marks it as failed to serve. In multi-group deployments or decommission flows, removing one local group member should not be reported as a fatal engine failure for every other group hosted by the process.

Useful? React with 👍 / 👎.

Comment on lines +695 to +700
indexes := make(map[uint64]bool, len(walSnaps))
for _, snap := range walSnaps {
if snap.Index > 0 {
indexes[snap.Index] = true
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the snapshot term when filtering WAL-valid pairs

This WAL-valid set records only indexes, but etcd's snapshot loader accepts a snapshot only when both term and index match a WAL snapshot entry. If the snap directory contains stale files with the same index but a different term, prewrite cleanup can mark the stale pair as walValid, keep it as the fallback, and purge the actual term-matching pair; the next restart then cannot load the retained .snap even though cleanup intended to leave a restartable fallback.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jul 2, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the latest P2 findings on 7f884f4:

  • track received snapshot tokens until the next Ready, preserving protection for snapshot Ready and releasing tokens for ignored/non-snapshot Ready
  • treat lifecycle Done with Err()==nil as a clean raft group stop instead of a fatal server error
  • match WAL-valid snapshot fallback by term+index, and avoid deleting shared .fsm files when pruning same-index stale .snap files

Validation:

  • go test ./internal/raftengine/etcd -run 'Test(ReleaseIgnoredReceivedFSMSnapshotStepsUnprotectsNonSnapshotReady|ReleaseIgnoredReceivedFSMSnapshotStepsKeepsSnapshotReadyProtected|HandleStepKeepsAcceptedSnapshotTokenProtectedUntilReady|PrepareFSMSnapshotWriteKeepsWALTermMatchingFallbackPair|PrepareFSMSnapshotWriteKeepsWALValidFallbackPair)'
  • go test . -run 'TestStartRaftEngineLifecycleWatchers'
  • go test ./internal/raftengine/...
  • go test .
  • golangci-lint --config=.golangci.yaml run ./internal/raftengine/etcd
  • git diff --check

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7f884f4ea8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

candidates = append(candidates, snapFileCandidate{
name: e.Name(),
index: index,
restorable: fsmSnapshotFileRestorable(fsmSnapDir, index),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Verify the retained .fsm against the snap token

In a same-index snapshot case, this marks every .snap at that index as restorable whenever <index>.fsm is internally well-formed, because fsmSnapshotFileRestorable passes tokenCRC=0 and never checks the CRC embedded in that particular .snap's token. If a retry or another node produced a different valid payload for the same index, prewrite cleanup can keep the WAL-valid .snap whose token does not match the shared .fsm and delete the older matching fallback; the next restart then fails in restoreSnapshotStateFromToken with ErrFSMSnapshotTokenCRC. Please read the candidate snap token and verify the .fsm against that CRC before treating the pair as restorable.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/raftengine/etcd/fsm_snapshot_file.go (1)

680-697: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

WAL-preferred fallback can retain a non-restorable snapshot over a genuinely restorable one.

When the WAL filter (walFiltered) excludes the only restorable candidate (because it isn't WAL-valid) and the WAL-valid candidate itself isn't restorable (e.g. missing/corrupted .fsm), retention.keep ends up empty. The fallback at Line 693-695 then blindly keeps candidates[len(candidates)-1] (highest index) with no restorable check, and purgeUnretainedPrewriteSnapshots will delete the actually-restorable older pair since it isn't retained. This reproduces the exact "kept snapshot becomes non-restorable" scenario flagged in the prior review, via the new WAL-filter interaction instead of the previously-fixed path.

🩹 Proposed fix: retry with restorable-only requirement before the unverified fallback
 func keepRestorablePrewriteSnapshots(candidates []snapFileCandidate) prewriteSnapshotRetention {
 	retention := prewriteSnapshotRetention{
 		keep: make(map[string]bool, prewriteSnapKeep),
 	}
 	walFiltered := prewriteCandidatesHaveWALFilter(candidates)
 	for i := len(candidates) - 1; i >= 0 && len(retention.keep) < prewriteSnapKeep; i-- {
 		if candidates[i].restorable && (!walFiltered || candidates[i].walValid) {
 			retention.keep[candidates[i].name] = true
 			if retention.restorableFloor == 0 || candidates[i].index < retention.restorableFloor {
 				retention.restorableFloor = candidates[i].index
 			}
 		}
 	}
+	// If the WAL-preferred filter excluded every candidate, fall back to
+	// the newest restorable pair regardless of WAL validity before
+	// resorting to an unverified snapshot.
+	if len(retention.keep) == 0 {
+		for i := len(candidates) - 1; i >= 0; i-- {
+			if candidates[i].restorable {
+				retention.keep[candidates[i].name] = true
+				retention.restorableFloor = candidates[i].index
+				break
+			}
+		}
+	}
 	if len(retention.keep) == 0 && len(candidates) > 0 {
 		retention.keep[candidates[len(candidates)-1].name] = true
 	}
 	return retention
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/raftengine/etcd/fsm_snapshot_file.go` around lines 680 - 697, The
fallback in keepRestorablePrewriteSnapshots is too permissive: when walFiltered
excludes the only restorable candidate and no WAL-valid restorable snapshot is
found, it currently keeps the newest candidate without checking restorable.
Update keepRestorablePrewriteSnapshots to retry selection with a restorable-only
rule before using the final fallback, so the kept snapshot is always restorable
if any restorable candidate exists; use the existing symbols walFiltered,
retention.keep, and candidates to locate the change.
🧹 Nitpick comments (2)
main_raft_lifecycle_test.go (2)

22-60: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider a table-driven test to align with repo conventions.

The three scenarios (engine failure, clean stop, context cancellation) differ only in stub setup and expected outcome, and are good candidates for a single table-driven test.

As per coding guidelines, "Unit tests should be co-located with packages in *_test.go files; prefer table-driven test cases."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@main_raft_lifecycle_test.go` around lines 22 - 60, The three lifecycle
watcher tests are duplicated scenario cases and should be consolidated into a
table-driven test to match repo conventions. Refactor the existing
TestStartRaftEngineLifecycleWatchersReportsEngineFailure,
TestStartRaftEngineLifecycleWatchersIgnoresCleanEngineStop, and
TestStartRaftEngineLifecycleWatchersIgnoresContextCancellation cases into one
table covering the different lifecycleEngineStub setups, runtime specs, and
expected assertions while still exercising startRaftEngineLifecycleWatchers and
eg.Wait. Keep the individual scenario names as table entries or subtests so the
behavior remains clear and the assertions for engine failure, clean stop, and
context cancellation stay explicit.

Source: Coding guidelines


22-60: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Missing coverage for the non-Lifecycle/nil-Done guard branches.

None of the three tests exercise main.go Lines 522 (ok false, engine doesn't implement Lifecycle) or 526-528 (Done() returns nil). Adding a stub without a Done/Err override, or one returning nil from Done(), would confirm the watcher skips these runtimes without blocking eg.Wait().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@main_raft_lifecycle_test.go` around lines 22 - 60, Add test coverage for the
guard paths in startRaftEngineLifecycleWatchers that are currently untested: one
runtime whose engine does not satisfy the Lifecycle interface, and one Lifecycle
stub whose Done() returns nil. Use the existing lifecycleEngineStub or a new
minimal stub to exercise these branches and assert eg.Wait() returns cleanly
without blocking or reporting an error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@internal/raftengine/etcd/fsm_snapshot_file.go`:
- Around line 680-697: The fallback in keepRestorablePrewriteSnapshots is too
permissive: when walFiltered excludes the only restorable candidate and no
WAL-valid restorable snapshot is found, it currently keeps the newest candidate
without checking restorable. Update keepRestorablePrewriteSnapshots to retry
selection with a restorable-only rule before using the final fallback, so the
kept snapshot is always restorable if any restorable candidate exists; use the
existing symbols walFiltered, retention.keep, and candidates to locate the
change.

---

Nitpick comments:
In `@main_raft_lifecycle_test.go`:
- Around line 22-60: The three lifecycle watcher tests are duplicated scenario
cases and should be consolidated into a table-driven test to match repo
conventions. Refactor the existing
TestStartRaftEngineLifecycleWatchersReportsEngineFailure,
TestStartRaftEngineLifecycleWatchersIgnoresCleanEngineStop, and
TestStartRaftEngineLifecycleWatchersIgnoresContextCancellation cases into one
table covering the different lifecycleEngineStub setups, runtime specs, and
expected assertions while still exercising startRaftEngineLifecycleWatchers and
eg.Wait. Keep the individual scenario names as table entries or subtests so the
behavior remains clear and the assertions for engine failure, clean stop, and
context cancellation stay explicit.
- Around line 22-60: Add test coverage for the guard paths in
startRaftEngineLifecycleWatchers that are currently untested: one runtime whose
engine does not satisfy the Lifecycle interface, and one Lifecycle stub whose
Done() returns nil. Use the existing lifecycleEngineStub or a new minimal stub
to exercise these branches and assert eg.Wait() returns cleanly without blocking
or reporting an error.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 31e352ab-4630-4aa7-8fbe-be1d90baabd0

📥 Commits

Reviewing files that changed from the base of the PR and between af915e1 and 7f884f4.

📒 Files selected for processing (9)
  • internal/raftengine/etcd/engine.go
  • internal/raftengine/etcd/engine_applied_index_test.go
  • internal/raftengine/etcd/engine_test.go
  • internal/raftengine/etcd/fsm_snapshot_file.go
  • internal/raftengine/etcd/fsm_snapshot_file_test.go
  • internal/raftengine/etcd/grpc_transport.go
  • internal/raftengine/etcd/grpc_transport_test.go
  • main.go
  • main_raft_lifecycle_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/raftengine/etcd/engine_applied_index_test.go
  • internal/raftengine/etcd/engine.go

@bootjp

bootjp commented Jul 2, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the latest P2 finding on 5d1f5f0:

  • prewrite retention now reads each candidate .snap token and verifies the shared .fsm against that token CRC before treating the pair as restorable
  • same-index stale .snap files are pruned without deleting the shared .fsm when a token-matching retained candidate remains
  • existing verifyFSMSnapshotFile semantics are preserved; tokenCRC=0 still skips comparison for existing callers, while prewrite uses an explicit token-check helper

Validation:

  • go test ./internal/raftengine/etcd -run 'Test(PrepareFSMSnapshotWriteKeepsTokenMatchingFallbackPair|PrepareFSMSnapshotWriteKeepsWALTermMatchingFallbackPair|PrepareFSMSnapshotWriteKeepsWALValidFallbackPair|PrepareFSMSnapshotWriteKeepsNewestRestorablePair|PrepareFSMSnapshotWriteRemovesOrphansWithoutRetainedSnapshot)'
  • go test ./internal/raftengine/...
  • go test .
  • golangci-lint --config=.golangci.yaml run ./internal/raftengine/etcd
  • git diff --check

@bootjp

bootjp commented Jul 2, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the WAL-filter fallback issue on 3c5e994:

  • when no WAL-valid restorable snapshot exists, retention now retries with the newest restorable pair before falling back to an unverified newest .snap
  • added coverage for a WAL-valid but broken candidate that would otherwise evict the only restorable fallback
  • kept the token-matching validation from 5d1f5f0

Validation:

  • go test ./internal/raftengine/etcd -run 'Test(PrepareFSMSnapshotWriteKeepsRestorablePairWhenWALValidCandidateIsBroken|PrepareFSMSnapshotWriteKeepsTokenMatchingFallbackPair|PrepareFSMSnapshotWriteKeepsWALTermMatchingFallbackPair|PrepareFSMSnapshotWriteKeepsWALValidFallbackPair|PrepareFSMSnapshotWriteKeepsNewestRestorablePair)'
  • go test ./internal/raftengine/...
  • go test .
  • golangci-lint --config=.golangci.yaml run ./internal/raftengine/etcd
  • git diff --check

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c5e994274

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +693 to +696
if len(retention.keep) == 0 {
keepNewestMatchingPrewriteSnapshots(candidates, &retention, func(candidate snapFileCandidate) bool {
return candidate.restorable
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep fallback snapshots loadable by the WAL

When the WAL filter is active and the only WAL-valid candidate is not restorable, this fallback keeps an older restorable snap/fsm pair even though it is not present in wal.ValidSnapshotEntries. If the subsequent snapshot write fails or the process crashes before the new .snap is saved, restart still goes through loadPersistedSnapshot -> snapshotter.LoadNewestAvailable(walSnaps), which ignores that non-WAL-valid fallback and can leave the node with no loadable snapshot after this prewrite pass has deleted the WAL-valid .snap. The fallback needs to keep something the restart path can actually select, or the startup path needs a matching fallback.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jul 2, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the latest P1 finding on f24c731:

  • when WAL filtering is active and no WAL-valid restorable pair exists, retention now preserves a WAL-valid candidate so restart can still select a WAL-referenced .snap
  • it also preserves the newest restorable pair when one exists, avoiding eviction of the only verified .snap/.fsm fallback
  • updated the regression test to assert both the WAL-valid .snap and restorable pair are retained in this split fallback case

Validation:

  • go test ./internal/raftengine/etcd -run 'Test(PrepareFSMSnapshotWriteKeepsWALValidAndRestorableFallbacksWhenWALValidCandidateIsBroken|PrepareFSMSnapshotWriteKeepsTokenMatchingFallbackPair|PrepareFSMSnapshotWriteKeepsWALTermMatchingFallbackPair|PrepareFSMSnapshotWriteKeepsWALValidFallbackPair|PrepareFSMSnapshotWriteKeepsNewestRestorablePair)'
  • go test ./internal/raftengine/...
  • go test .
  • golangci-lint --config=.golangci.yaml run ./internal/raftengine/etcd
  • git diff --check

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f24c731215

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/raftengine/etcd/engine.go Outdated
Comment on lines +2836 to +2837
if index == 0 || index <= e.appliedIndex.Load() {
return

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not bypass snapshotMu after applied advances

When a duplicate MsgSnap for the same index arrives while persistReadyWithSnapshotLocked is still saving the corresponding .snap, applyReadySnapshotLocked has already advanced appliedIndex, so this fast path returns without taking snapshotMu. The receive path can then rename a new .fsm over fsmSnapPath(index) while SaveSnap is about to persist a token for the previous CRC; a sender retry after an ack/stream failure is enough to create a .snap/.fsm CRC mismatch that prevents restart from that snapshot. Please serialize this case with the snapshot mutex or reject same/older indexes before publishing the received file.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jul 2, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the latest P1 finding on 282304e:

  • received snapshot protection now always serializes index acceptance with snapshotMu, including same/older indexes after appliedIndex has advanced
  • protect callbacks now return whether the snapshot was accepted; stale/already-applied received snapshots are rejected before publishing the final .fsm path
  • added regression coverage for waiting on snapshotMu when the index is already applied and for rejecting stale receive finalization without calling unprotect or creating the final .fsm

Validation:

  • go test ./internal/raftengine/etcd -run 'Test(ProtectReceivedFSMSnapshotRechecksAppliedIndexUnderLock|ProtectReceivedFSMSnapshotWaitsOnSnapshotMuForAlreadyAppliedIndex|DrainSnapshotChunksProtectsBeforePublishingFSMFile|DrainSnapshotChunksRejectsStaleFSMProtection|DrainSnapshotChunksUnprotectsWhenFinalizeFails)'
  • go test ./internal/raftengine/...
  • go test .
  • golangci-lint --config=.golangci.yaml run ./internal/raftengine/etcd
  • git diff --check

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 🎉

Reviewed commit: 282304e93f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant