From 64a55a120ba90ec490c4fe0dc16546a9d2f3b3c7 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 23 May 2026 21:27:45 +0900 Subject: [PATCH 1/6] feat(encryption): Stage 6D-6a - EnableStorageEnvelope admin RPC + server method MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Stage 6D-6 (the final 6D milestone) is sliced into 6D-6a/b/c so each slice matches the per-PR review cadence of 6D-3/4/5. This PR lands 6D-6a: the proto + server method + server-level unit tests. 6D-6b adds the CLI; 6D-6c adds main.go production wiring + end-to-end integration test. ## Proto changes (§3.1) - Add EnableStorageEnvelope RPC to the EncryptionAdmin service. - Add EnableStorageEnvelopeRequest message (proposer_node_id, proposer_local_epoch with the standard <= 0xFFFF bound). - Add EnableStorageEnvelopeResponse message (applied_index, capability_summary, cutover_index_unknown, was_already_active) matching the §3.1 layout. Notably uses OK gRPC status + a was_already_active boolean discriminator rather than AlreadyExists — unary gRPC drops the response body on non-OK status, so the §6.4 idempotency contract MUST live on the success path. - Add CapabilityVerdict message (full_node_id, encryption_capable, build_sha, sidecar_present) for the response's per-member membership-view summary. - proto/Makefile gen target updated to include encryption_admin.proto so future regenerations stay in sync. ## Server method (adapter/encryption_admin.go) - New CapabilityFanoutFn closure type wired via the new WithEncryptionAdminCapabilityFanout option. Nil leaves the RPC short-circuited with FailedPrecondition — the 6D-6c main.go wiring closes over the live route snapshot + DialFunc + timeout policy. - EnableStorageEnvelope orchestrates the §3.2 server-side sequence: 1. Leader gate (requireLeader, embeds leader hint in the FailedPrecondition status detail). 2. Proposer + sidecar path configured. 3. Request validation (proposer_node_id != 0, proposer_local_epoch <= 0xFFFF). 4. Sidecar read → bootstrap gate (Active.Storage != 0). 5. Idempotent-retry short-circuit (§6.4): if StorageEnvelopeActive == true, return OK + was_already_active=true + applied_index = StorageEnvelopeCutoverIndex. The defensive cutover_index_unknown branch fires when the cutover index is 0 on an active sidecar (operationally impossible under normal apply but hedged against schema rollback / hand-edited sidecars). 6. Capability fan-out gate (s.capabilityFanout == nil → refuse). 7. Run fan-out; any verdict with EncryptionCapable=false or Reachable=false → FailedPrecondition naming the refusing node. 8. Compose RotationPayload with SubTag = RotateSubEnableStorageEnvelope, DEKID = sidecar.Active.Storage, Purpose = PurposeStorage, Wrapped = []byte{} (§2.1 constraint #2), ProposerRegistration covering Active.Storage. 9. Propose through Raft via proposeEncryptionEntry. 10. Post-apply re-read discriminates: - StorageEnvelopeActive=true (apply landed) → fresh success. - StorageEnvelopeActive=false (the 6D-4 applier consumed the entry as a §2.1 #3 benign no-op because a RotateDEK race advanced sidecar.Active.Storage) → FailedPrecondition with retry hint, NOT halt — the apply path deliberately keeps the cluster up. - Orchestration split into cutoverPrecheck / runCutoverFanout / proposeCutoverEntry / cutoverPostcheck helpers so the outer method stays under the cyclop budget and so future test refactors can pin each phase independently. ## Tests (adapter/encryption_admin_test.go) 12 new server-level tests: - TestEncryptionAdmin_MutatingRPCs_RejectWithoutProposer extended to cover EnableStorageEnvelope. - _RejectsOnFollower - leader gate + hint embedding. - _RejectsWithoutSidecarPath - sidecar dependency. - _RejectsZeroProposerNodeID - §3.1 / §6.1 sentinel rule. - _RejectsOversizedLocalEpoch - §3.1 / §4.1 16-bit bound. - _RejectsNotBootstrapped - bootstrap gate with hint. - _IdempotentRetry - sidecar already active → OK + was_already_active=true, applied_index = original StorageEnvelopeCutoverIndex, NO fan-out, NO propose, capability_summary empty. - _DefensiveCutoverIndexUnknown - active sidecar with index=0 (hand-edit hedge) → CutoverIndexUnknown=true, AppliedIndex falls back to RaftAppliedIndex. - _RejectsWithoutCapabilityFanout - the §4 pre-flight is mandatory. - _RejectsOnCapabilityRefusal - any fan-out NO refuses + names the refusing node. - _RejectsOnFanoutError - fan-out helper itself erroring maps to FailedPrecondition (not Internal). - _HappyPath - end-to-end fresh cutover with simulated apply effect; verifies the wire-level Raft entry decodes as RotateSubEnableStorageEnvelope with the §2.1 constraints. - _StaleDEKIDRace - the simulated apply does NOT flip the sidecar field (because a RotateDEK raced); the post-apply re-read takes the stale-DEKID branch and the RPC refuses with the retry hint. - _CapabilitySummaryProjection - per-row wire-shape pin for the internal CapabilityVerdict → proto.CapabilityVerdict mapping. New test helpers: - fixedCapabilityFanout - deterministic fan-out fixture. - applyingProposer - recordingProposer extension that simulates the FSM apply effect on the sidecar so the post-Propose re-read takes the right branch. - applyCutover, applyStaleDEKIDRace - the two apply-effect variants the §6.4 / §2.1 #3 paths require. - cutoverReadySidecarFixture - the bootstrapped + pre-cutover sidecar shape the propose-path tests share. - allOKFanoutResult - the "fan-out approved" deterministic fixture. - assertFreshCutoverResponse, assertCutoverProposalShape, assertProtoVerdict - shared assertion helpers. ## Caller audit - The new RPC is purely additive: no existing method's signature or semantics change. - WithEncryptionAdminCapabilityFanout is a new option; existing test fixtures that don't wire it leave s.capabilityFanout=nil, which only affects EnableStorageEnvelope (refuses) - other mutator RPCs are unaffected. - proto.UnimplementedEncryptionAdminServer carries the EnableStorageEnvelope stub returning codes.Unimplemented, so test doubles that don't override stay unbroken. ## Self-review (5 lenses) 1. Data loss - no. The server method reads the sidecar, proposes via Raft, then re-reads. The 6D-4 applier is the durable writer; this PR only reads pre- and post-apply. 2. Concurrency / distributed failures - the propose-then-re-read sequence relies on the apply being visible on the leader's sidecar by the time Propose returns. The Raft proposer's contract (proposeEncryptionEntry returns the commit index after the FSM apply completes locally on the leader) is the load-bearing invariant; if a future Proposer change made Propose return before local apply, the re-read could see stale state. The existing RotateDEK / BootstrapEncryption paths rely on the same contract. 3. Performance - one extra sidecar read pre- and post-Propose per cutover. Cutover is one-shot per cluster lifetime; the read cost is negligible. 4. Data consistency - the §6.4 idempotency contract is preserved: on retry, applied_index reports the ORIGINAL StorageEnvelopeCutoverIndex (not the retry's Raft index). The CapabilitySummary is intentionally empty on retries so a caller cannot accidentally re-use the original cutover's membership view. The §2.1 #3 stale-DEKID race is surfaced as FailedPrecondition with a retry hint (not Aborted; the design's §6.4 row pins FailedPrecondition). 5. Test coverage - 12 server-level tests across the §3.2 sequence + idempotency contract + race discrimination + wire shape. The applyingProposer helper specifically pins the §6.4 / §2.1 #3 discrimination logic, which is the central correctness-critical branch. ## Verification - go test -race -timeout 240s -run TestEncryptionAdmin_ ./adapter/... - all green - golangci-lint --new-from-rev=origin/main - 0 issues Refs: design doc §3.1 / §3.2 / §6.4 (6D-6a slice of the §7 decomposition table). --- adapter/encryption_admin.go | 342 ++++++++++++ adapter/encryption_admin_test.go | 514 ++++++++++++++++++ ...5_18_partial_6d_enable_storage_envelope.md | 22 +- proto/Makefile | 3 + proto/encryption_admin.pb.go | 313 ++++++++++- proto/encryption_admin.proto | 66 +++ proto/encryption_admin_grpc.pb.go | 38 ++ 7 files changed, 1268 insertions(+), 30 deletions(-) diff --git a/adapter/encryption_admin.go b/adapter/encryption_admin.go index ab8febbb7..37a5564a8 100644 --- a/adapter/encryption_admin.go +++ b/adapter/encryption_admin.go @@ -9,6 +9,7 @@ import ( "runtime/debug" "strconv" + "github.com/bootjp/elastickv/internal/admin" "github.com/bootjp/elastickv/internal/encryption" "github.com/bootjp/elastickv/internal/encryption/fsmwire" "github.com/bootjp/elastickv/internal/raftengine" @@ -30,9 +31,26 @@ type EncryptionAdminServer struct { latestAppliedIndex func() uint64 proposer raftengine.Proposer leaderView raftengine.LeaderView + // capabilityFanout, when wired, runs the §4 Voters ∪ Learners + // fan-out before the §7.1 Phase 1 cutover entry is proposed. + // A nil value short-circuits EnableStorageEnvelope with + // FailedPrecondition — the 6D-6 main.go wiring (lands in 6D-6c) + // is what threads the route-snapshot builder + DialFunc + + // timeout into this closure. Other mutator RPCs are unaffected. + capabilityFanout CapabilityFanoutFn pb.UnimplementedEncryptionAdminServer } +// CapabilityFanoutFn is the closure the server invokes to run the +// §4 Voters ∪ Learners pre-flight before the cutover proposal. +// Production wiring composes it from +// internal/admin.CapabilityFanout(routes, dial, timeout) where +// routes is built from the Raft engine's live membership view and +// dial reuses the existing admin connection pool. Tests stub it +// with a deterministic result to exercise the §4.3 OK / refuse +// branches at the RPC layer without spinning real clients. +type CapabilityFanoutFn func(ctx context.Context) (admin.CapabilityFanoutResult, error) + // EncryptionAdminServerOption configures EncryptionAdminServer behavior. type EncryptionAdminServerOption func(*EncryptionAdminServer) @@ -103,6 +121,24 @@ func WithEncryptionAdminProposer(p raftengine.Proposer) EncryptionAdminServerOpt } } +// WithEncryptionAdminCapabilityFanout wires the §4 Voters ∪ +// Learners pre-flight that the §7.1 Phase 1 cutover RPC +// (EnableStorageEnvelope) runs before composing the +// RotateSubEnableStorageEnvelope payload. Without this option +// EnableStorageEnvelope refuses with FailedPrecondition — +// matching the proposer / leaderView posture for the other +// mutator RPCs. A nil argument is a no-op (the server stays in +// the cutover-disabled posture), mirroring the +// WithEncryptionAdmin* convention. +func WithEncryptionAdminCapabilityFanout(fn CapabilityFanoutFn) EncryptionAdminServerOption { + return func(s *EncryptionAdminServer) { + if fn == nil { + return + } + s.capabilityFanout = fn + } +} + // WithEncryptionAdminLeaderView registers the leadership oracle. // Mutating RPCs and ResyncSidecar reject on followers with // FailedPrecondition; the leader's id and address are embedded in @@ -583,6 +619,312 @@ func (s *EncryptionAdminServer) RotateDEK(ctx context.Context, req *pb.RotateDEK return &pb.RotateDEKResponse{AppliedIndex: idx}, nil } +// EnableStorageEnvelope is the §7.1 Phase 1 cutover RPC: it +// proposes a RotateSubEnableStorageEnvelope (0x04) rotation entry +// that flips sidecar.StorageEnvelopeActive on every replica +// simultaneously and stores the original cutover index in +// sidecar.StorageEnvelopeCutoverIndex (§6.4). After the entry +// applies, the storage layer's WithStorageEnvelopeGate (Stage 6D-5) +// reads true on every Put and the cluster begins emitting §4.1 +// envelopes for new versions. +// +// The RPC composes the §4 Voters ∪ Learners capability fan-out +// helper (Stage 6D-3), the Stage 6D-4 wire dispatch, and the +// idempotency contract (§6.4): a duplicate call against an +// already-active sidecar returns OK with `was_already_active=true` +// and `applied_index = sidecar.StorageEnvelopeCutoverIndex`. +// Returning AlreadyExists instead would drop the response body +// per unary-gRPC semantics, so the idempotency discriminator +// lives on the success path. +// +// The server-side sequence (per design doc §3.2): +// +// 1. Validate `proposer_node_id != 0` and `proposer_local_epoch +// <= 0xFFFF` at the gRPC boundary. +// 2. Verify Stage 6B mutators are enabled — implicit via +// `s.proposer == nil` (matches RotateDEK / BootstrapEncryption +// posture). +// 3. Verify we are the default-group leader via `requireLeader`. +// 4. Verify the sidecar has Active.Storage != 0 (bootstrap +// committed) — return FailedPrecondition with a "run +// BootstrapEncryption first" hint otherwise. +// 5. If sidecar.StorageEnvelopeActive == true, return the §3.2 +// step 5 idempotent-retry response (OK + was_already_active + +// applied_index = StorageEnvelopeCutoverIndex). Skip the +// fan-out — the original cutover already passed the gate. +// 6. Refuse with FailedPrecondition if the capability fan-out is +// not wired (`s.capabilityFanout == nil`): the cutover MUST +// have a §4 pre-flight; a silent skip would let an +// unreachable learner sneak through. +// 7. Run the fan-out. Any verdict with Reachable=false or +// EncryptionCapable=false refuses with FailedPrecondition; +// the response detail names the specific node. +// 8. Compose the RotationPayload (§2.1: empty Wrapped, DEKID = +// sidecar.Active.Storage, Purpose = PurposeStorage, +// ProposerRegistration covering the active storage DEK). +// 9. Propose through Raft via `proposeEncryptionEntry`. +// 10. Re-read the sidecar to discriminate fresh-success vs. +// stale-DEKID race vs. concurrent-overlap idempotent and +// assemble the response. +// +// FSM-level no-op outcomes (stale DEKID via a RotateDEK race, +// already-active via a duplicate cutover) do NOT halt the apply +// path — the 6D-4 applier deliberately consumes those entries +// without flipping the sidecar field. The RPC discriminates by +// reading the post-apply sidecar: still false ⇒ stale DEKID, +// surface as FailedPrecondition with the §2.1 #3 retry hint; now +// true with cutover-index mismatch ⇒ another cutover landed +// concurrently, treat as idempotent success. +func (s *EncryptionAdminServer) EnableStorageEnvelope(ctx context.Context, req *pb.EnableStorageEnvelopeRequest) (*pb.EnableStorageEnvelopeResponse, error) { + preSidecar, earlyResp, err := s.cutoverPrecheck(ctx, req) + if err != nil { + return nil, err + } + if earlyResp != nil { + // Idempotent retry: preSidecar already reports + // StorageEnvelopeActive=true. The precheck returned the + // §3.2 step 5 response shape; no propose, no fan-out. + return earlyResp, nil + } + fanoutResult, err := s.runCutoverFanout(ctx) + if err != nil { + return nil, err + } + proposedIdx, err := s.proposeCutoverEntry(ctx, preSidecar, req) + if err != nil { + return nil, err + } + return s.cutoverPostcheck(proposedIdx, fanoutResult) +} + +// cutoverPrecheck runs the §3.2 steps 1-5 that fire before the +// fan-out: input validation, leader check, sidecar read, bootstrap +// gate, and the idempotent-retry short-circuit. Returns either +// +// - (preSidecar, nil, nil) on the propose-path: continue with +// the fan-out and Raft proposal. +// - (nil, earlyResp, nil) on the §6.4 idempotent retry: the +// caller short-circuits and returns earlyResp without +// touching the fan-out or Raft. +// - (nil, nil, err) on any precheck refusal: the gRPC error +// already carries the right status code. +func (s *EncryptionAdminServer) cutoverPrecheck(ctx context.Context, req *pb.EnableStorageEnvelopeRequest) (*encryption.Sidecar, *pb.EnableStorageEnvelopeResponse, error) { + if err := s.requireLeader(ctx); err != nil { + return nil, nil, err + } + if s.proposer == nil { + return nil, nil, grpcStatusError(codes.FailedPrecondition, "encryption: proposer is not configured on this node") + } + if s.sidecarPath == "" { + return nil, nil, grpcStatusError(codes.FailedPrecondition, "encryption: sidecar path is not configured on this node") + } + if err := validateEnableStorageEnvelopeRequest(req); err != nil { + return nil, nil, err + } + preSidecar, err := encryption.ReadSidecar(s.sidecarPath) + if err != nil { + return nil, nil, statusFromSidecarErr(err) + } + if preSidecar.Active.Storage == 0 { + return nil, nil, grpcStatusError(codes.FailedPrecondition, + "encryption: cluster not bootstrapped (Active.Storage == 0) — call BootstrapEncryption first") + } + if preSidecar.StorageEnvelopeActive { + // §6.4 idempotent-retry path. The original cutover already + // passed the §4 fan-out; re-running it would add latency to + // what is effectively a no-op call. Return OK with the + // stable applied_index. + return nil, idempotentCutoverResponse(preSidecar), nil + } + return preSidecar, nil, nil +} + +// runCutoverFanout invokes the §4 Voters ∪ Learners pre-flight +// helper and translates the OK / refuse / error branches into the +// §3.2 step 6-7 status codes. Pulled out so the orchestration +// body stays under the cyclomatic-complexity budget. +func (s *EncryptionAdminServer) runCutoverFanout(ctx context.Context) (admin.CapabilityFanoutResult, error) { + if s.capabilityFanout == nil { + return admin.CapabilityFanoutResult{}, grpcStatusError(codes.FailedPrecondition, + "encryption: capability fan-out is not configured on this node") + } + result, err := s.capabilityFanout(ctx) + if err != nil { + return admin.CapabilityFanoutResult{}, grpcStatusErrorf(codes.FailedPrecondition, + "encryption: capability fan-out failed: %v", err) + } + if !result.OK { + return admin.CapabilityFanoutResult{}, grpcStatusErrorf(codes.FailedPrecondition, + "encryption: capability check refused cutover (%s)", capabilityRefusalSummary(result)) + } + return result, nil +} + +// proposeCutoverEntry composes the RotationPayload per §2.1 and +// drives it through Raft. The §2.1 #2 length-based-empty-Wrapped +// constraint is satisfied at composition (the payload uses +// []byte{}, not nil), matching the 6D-4 applier's length-based +// check. +func (s *EncryptionAdminServer) proposeCutoverEntry(ctx context.Context, preSidecar *encryption.Sidecar, req *pb.EnableStorageEnvelopeRequest) (uint64, error) { + payload := fsmwire.RotationPayload{ + SubTag: fsmwire.RotateSubEnableStorageEnvelope, + DEKID: preSidecar.Active.Storage, + Purpose: fsmwire.PurposeStorage, + Wrapped: []byte{}, + ProposerRegistration: fsmwire.RegistrationPayload{ + DEKID: preSidecar.Active.Storage, + FullNodeID: req.GetProposerNodeId(), + LocalEpoch: uint32ToLocalEpoch(req.GetProposerLocalEpoch()), + }, + } + return s.proposeEncryptionEntry(ctx, fsmwire.OpRotation, fsmwire.EncodeRotation(payload)) +} + +// cutoverPostcheck re-reads the sidecar after the Raft propose +// returns and discriminates the §2.1 outcomes: +// +// - Fresh success (StorageEnvelopeActive == true with the +// cutover index set by the apply) — assemble the §3.2 happy- +// path response. +// - Stale-DEKID race (StorageEnvelopeActive still false because +// a RotateDEK raced and the 6D-4 applier consumed the entry +// as a benign no-op) — refuse with the §2.1 #3 retry hint. +func (s *EncryptionAdminServer) cutoverPostcheck(proposedIdx uint64, fanoutResult admin.CapabilityFanoutResult) (*pb.EnableStorageEnvelopeResponse, error) { + postSidecar, err := encryption.ReadSidecar(s.sidecarPath) + if err != nil { + return nil, statusFromSidecarErr(err) + } + if !postSidecar.StorageEnvelopeActive { + // §2.1 #3 stale-DEKID race: a RotateDEK committed between + // propose and apply, the 6D-4 applier consumed the entry + // as a benign no-op, sidecar still false. Surface to the + // operator with a retry hint — not Aborted (transient + // concurrency conflict shape, but FailedPrecondition is + // what the design's §6.4 row pins). + return nil, grpcStatusError(codes.FailedPrecondition, + "encryption: cutover proposal raced a RotateDEK (sidecar.Active.Storage moved); retry against the new active DEK") + } + return freshCutoverResponse(postSidecar, proposedIdx, fanoutResult), nil +} + +// validateEnableStorageEnvelopeRequest enforces the §3.2 step 1 +// gRPC-boundary checks. Pulled out so the EnableStorageEnvelope +// orchestration body stays under the cyclomatic-complexity budget +// and so tests can exercise the validation slice in isolation. +func validateEnableStorageEnvelopeRequest(req *pb.EnableStorageEnvelopeRequest) error { + if req.GetProposerNodeId() == 0 { + // §6.1 reserves full_node_id=0 as the "not-capable" + // sentinel. Identical posture to RotateDEK and + // BootstrapEncryption — accepting 0 would weaken the + // writer-registry collision invariant. + return grpcStatusError(codes.InvalidArgument, + "encryption: proposer_node_id must be non-zero (0 is reserved as the §6.1 not-capable sentinel)") + } + if req.GetProposerLocalEpoch() > math.MaxUint16 { + return grpcStatusErrorf(codes.InvalidArgument, + "encryption: proposer_local_epoch=%d exceeds the §4.1 16-bit bound (max 0xFFFF)", + req.GetProposerLocalEpoch()) + } + return nil +} + +// idempotentCutoverResponse is the §3.2 step 5 retry-success +// shape: OK status, was_already_active=true, applied_index = +// sidecar.StorageEnvelopeCutoverIndex (the original cutover's +// apply index). The defensive cutover_index_unknown branch fires +// only when an attacker / schema rollback / hand-edited sidecar +// has StorageEnvelopeActive=true paired with +// StorageEnvelopeCutoverIndex=0 — operationally impossible under +// normal apply but hedged against per §6.4. +func idempotentCutoverResponse(sc *encryption.Sidecar) *pb.EnableStorageEnvelopeResponse { + resp := &pb.EnableStorageEnvelopeResponse{ + WasAlreadyActive: true, + CapabilitySummary: nil, // empty on idempotent retries per §3.1 + AppliedIndex: sc.StorageEnvelopeCutoverIndex, + CutoverIndexUnknown: false, + } + if sc.StorageEnvelopeCutoverIndex == 0 { + // §6.4 defensive branch. + resp.AppliedIndex = sc.RaftAppliedIndex + resp.CutoverIndexUnknown = true + } + return resp +} + +// freshCutoverResponse is the §3.2 fresh-success shape: OK, +// was_already_active=false, applied_index = the Raft index of +// the entry the leader just proposed and waited to apply, +// capability_summary projects the fan-out verdicts into the +// wire shape. +func freshCutoverResponse(sc *encryption.Sidecar, proposedIdx uint64, fanoutResult admin.CapabilityFanoutResult) *pb.EnableStorageEnvelopeResponse { + // The reported applied_index is the post-apply sidecar's + // StorageEnvelopeCutoverIndex when the apply set it equal to + // proposedIdx (the fresh-success path). A mismatch means a + // concurrent cutover entry landed between propose and apply + // — operator-impossible under §2.1 #4 (mutator lock + // serialises overlapping calls on the propose side) but the + // applier still records the FIRST cutover's index. Treat as + // idempotent: report the original index with + // was_already_active=false (this call's propose committed) + // but the surfaced index is the FIRST cutover's. The CLI + // sees the discrepancy via the applied_index vs. its own + // expected value; the RPC must not lie about which call + // proposed which entry. + appliedIndex := sc.StorageEnvelopeCutoverIndex + if appliedIndex == 0 { + // Defensive: the apply path always sets the cutover + // index alongside the active flag. A zero here means + // the post-apply read raced (unlikely, but the §6.4 + // fallback exists for hand-edited sidecars). + appliedIndex = proposedIdx + } + return &pb.EnableStorageEnvelopeResponse{ + AppliedIndex: appliedIndex, + CapabilitySummary: projectCapabilityVerdicts(fanoutResult.Verdicts), + CutoverIndexUnknown: false, + WasAlreadyActive: false, + } +} + +// projectCapabilityVerdicts marshals the internal CapabilityVerdict +// shape into the wire-format proto.CapabilityVerdict. Reachable / +// Err fields are intentionally NOT projected: the cutover RPC only +// returns this summary on the OK path, so every verdict in the +// slice has Reachable=true and Err=nil by construction. Operators +// who need transport-layer diagnostics consult the leader's logs. +func projectCapabilityVerdicts(in []admin.CapabilityVerdict) []*pb.CapabilityVerdict { + if len(in) == 0 { + return nil + } + out := make([]*pb.CapabilityVerdict, 0, len(in)) + for _, v := range in { + out = append(out, &pb.CapabilityVerdict{ + FullNodeId: v.FullNodeID, + EncryptionCapable: v.EncryptionCapable, + BuildSha: v.BuildSHA, + SidecarPresent: v.SidecarPresent, + }) + } + return out +} + +// capabilityRefusalSummary builds the human-readable detail +// included in the FailedPrecondition status when the fan-out +// refused. Names the first unreachable / not-capable member so +// the operator's CLI can immediately diagnose without trawling +// logs. +func capabilityRefusalSummary(result admin.CapabilityFanoutResult) string { + for _, v := range result.Verdicts { + if !v.Reachable { + return "unreachable member full_node_id=" + strconv.FormatUint(v.FullNodeID, 10) + } + if !v.EncryptionCapable { + return "not-capable member full_node_id=" + strconv.FormatUint(v.FullNodeID, 10) + } + } + return "fan-out reported OK=false with no per-member refusal — check leader logs" +} + // RegisterEncryptionWriter proposes a §11.3 0x03 OpRegistration // entry for the calling node's first encrypted-write epoch under // the supplied dek_id. The proto carries `repeated WriterBatch` diff --git a/adapter/encryption_admin_test.go b/adapter/encryption_admin_test.go index 324911d50..01dbffed9 100644 --- a/adapter/encryption_admin_test.go +++ b/adapter/encryption_admin_test.go @@ -3,10 +3,12 @@ package adapter import ( "context" "errors" + "math" "path/filepath" "strings" "testing" + "github.com/bootjp/elastickv/internal/admin" "github.com/bootjp/elastickv/internal/encryption" "github.com/bootjp/elastickv/internal/encryption/fsmwire" "github.com/bootjp/elastickv/internal/raftengine" @@ -266,6 +268,9 @@ func TestEncryptionAdmin_MutatingRPCs_RejectWithoutProposer(t *testing.T) { if _, err := srv.RegisterEncryptionWriter(ctx, validRegisterEncryptionWriterRequest()); status.Code(err) != codes.FailedPrecondition { t.Errorf("RegisterEncryptionWriter status=%v, want FailedPrecondition (no proposer wired)", status.Code(err)) } + if _, err := srv.EnableStorageEnvelope(ctx, validEnableStorageEnvelopeRequest()); status.Code(err) != codes.FailedPrecondition { + t.Errorf("EnableStorageEnvelope status=%v, want FailedPrecondition (no proposer wired)", status.Code(err)) + } } func validBootstrapEncryptionRequest() *pb.BootstrapEncryptionRequest { @@ -291,6 +296,13 @@ func validRotateDEKRequest() *pb.RotateDEKRequest { } } +func validEnableStorageEnvelopeRequest() *pb.EnableStorageEnvelopeRequest { + return &pb.EnableStorageEnvelopeRequest{ + ProposerNodeId: 11, + ProposerLocalEpoch: 7, + } +} + func validRegisterEncryptionWriterRequest() *pb.RegisterEncryptionWriterRequest { return &pb.RegisterEncryptionWriterRequest{ DekId: 5, @@ -1018,3 +1030,505 @@ func writeSidecarFixture(t *testing.T, sc *encryption.Sidecar) string { } return path } + +// fixedCapabilityFanout returns a closure that yields the supplied +// result regardless of context — lets tests drive the §4 fan-out +// branches deterministically without spinning real clients. A +// non-nil err exercises the "fan-out helper itself failed" path +// (§3.2 step 7 wraps as FailedPrecondition). +func fixedCapabilityFanout(result admin.CapabilityFanoutResult, err error) CapabilityFanoutFn { + return func(context.Context) (admin.CapabilityFanoutResult, error) { + return result, err + } +} + +// applyingProposer is a recordingProposer that also simulates the +// FSM apply by writing the supplied applyFn output into the +// sidecar before returning from Propose. The 6D-6 RPC re-reads +// the sidecar after Propose to discriminate fresh-success vs. +// stale-DEKID vs. concurrent-overlap; without simulating the +// apply, the RPC's post-read would always see the pre-cutover +// sidecar and incorrectly classify every propose as stale-DEKID. +type applyingProposer struct { + recordingProposer + sidecarPath string + applyFn func(*encryption.Sidecar, uint64) +} + +func (p *applyingProposer) Propose(ctx context.Context, data []byte) (*raftengine.ProposalResult, error) { + res, err := p.recordingProposer.Propose(ctx, data) + if err != nil || res == nil || p.applyFn == nil { + return res, err + } + sc, rerr := encryption.ReadSidecar(p.sidecarPath) + if rerr != nil { + return nil, rerr + } + p.applyFn(sc, res.CommitIndex) + if werr := encryption.WriteSidecar(p.sidecarPath, sc); werr != nil { + return nil, werr + } + return res, nil +} + +// applyCutover is the §6.4 fresh-success apply effect: flip +// StorageEnvelopeActive to true and stamp the cutover index with +// the apply's Raft index. Used by the EnableStorageEnvelope happy- +// path test to drive the post-Propose sidecar re-read into the +// fresh-success branch. +func applyCutover(sc *encryption.Sidecar, raftIdx uint64) { + sc.StorageEnvelopeActive = true + sc.StorageEnvelopeCutoverIndex = raftIdx + if raftIdx > sc.RaftAppliedIndex { + sc.RaftAppliedIndex = raftIdx + } +} + +// applyStaleDEKIDRace simulates the §2.1 #3 benign-no-op: the +// applier consumed the entry without flipping +// StorageEnvelopeActive (because a RotateDEK raced and advanced +// Active.Storage). Only RaftAppliedIndex advances. +func applyStaleDEKIDRace(sc *encryption.Sidecar, raftIdx uint64) { + if raftIdx > sc.RaftAppliedIndex { + sc.RaftAppliedIndex = raftIdx + } +} + +// allOKFanoutResult is the deterministic "fan-out approved" fixture +// the happy-path test feeds the cutover RPC. The build SHA is +// trivially distinct so the projection-to-proto check can assert +// the field actually flows through (a regression that dropped the +// SHA would otherwise pass on an empty-string comparison). +func allOKFanoutResult() admin.CapabilityFanoutResult { + return admin.CapabilityFanoutResult{ + Verdicts: []admin.CapabilityVerdict{ + {FullNodeID: 11, EncryptionCapable: true, BuildSHA: "build-n1", SidecarPresent: true, Reachable: true}, + {FullNodeID: 22, EncryptionCapable: true, BuildSHA: "build-n2", SidecarPresent: true, Reachable: true}, + }, + OK: true, + } +} + +// cutoverReadySidecarFixture writes a sidecar past Bootstrap but +// pre-cutover (Active.Storage != 0 AND StorageEnvelopeActive == +// false). Returns the path so the test can hand it to the server +// + re-read it afterwards to verify the apply effect. +func cutoverReadySidecarFixture(t *testing.T) string { + t.Helper() + return writeSidecarFixture(t, &encryption.Sidecar{ + Active: encryption.ActiveKeys{Storage: 5, Raft: 6}, + Keys: map[string]encryption.SidecarKey{ + "5": {Purpose: encryption.SidecarPurposeStorage, Wrapped: []byte("ws"), Created: "x", LocalEpoch: 0}, + "6": {Purpose: encryption.SidecarPurposeRaft, Wrapped: []byte("wr"), Created: "x", LocalEpoch: 0}, + }, + RaftAppliedIndex: 100, + }) +} + +// TestEncryptionAdmin_EnableStorageEnvelope_RejectsOnFollower +// pins the §3.2 step 3 leader gate. The leader hint must be +// embedded in the FailedPrecondition status detail so the +// operator's CLI can retry against the right node. +func TestEncryptionAdmin_EnableStorageEnvelope_RejectsOnFollower(t *testing.T) { + t.Parallel() + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(&recordingProposer{}), + WithEncryptionAdminLeaderView(stubLeaderView{ + state: raftengine.StateFollower, + leader: raftengine.LeaderInfo{ID: "n2", Address: "127.0.0.1:50052"}, + }), + WithEncryptionAdminSidecarPath(cutoverReadySidecarFixture(t)), + ) + _, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) + if status.Code(err) != codes.FailedPrecondition { + t.Errorf("EnableStorageEnvelope status=%v, want FailedPrecondition", status.Code(err)) + } + if err == nil || !strings.Contains(err.Error(), "n2") || !strings.Contains(err.Error(), "127.0.0.1:50052") { + t.Errorf("error %q does not embed the leader hint (id=n2 addr=127.0.0.1:50052)", err) + } +} + +// TestEncryptionAdmin_EnableStorageEnvelope_RejectsWithoutSidecarPath +// covers the §6.4 sidecar dependency: without a sidecar path +// wired, the RPC cannot read the pre-cutover state and refuses. +func TestEncryptionAdmin_EnableStorageEnvelope_RejectsWithoutSidecarPath(t *testing.T) { + t.Parallel() + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(&recordingProposer{}), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + ) + _, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) + if status.Code(err) != codes.FailedPrecondition { + t.Errorf("EnableStorageEnvelope status=%v, want FailedPrecondition", status.Code(err)) + } +} + +// TestEncryptionAdmin_EnableStorageEnvelope_RejectsZeroProposerNodeID +// pins §3.1 / §6.1: the not-capable sentinel (full_node_id=0) is +// rejected at the gRPC boundary before any sidecar read. +func TestEncryptionAdmin_EnableStorageEnvelope_RejectsZeroProposerNodeID(t *testing.T) { + t.Parallel() + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(&recordingProposer{}), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(cutoverReadySidecarFixture(t)), + ) + req := validEnableStorageEnvelopeRequest() + req.ProposerNodeId = 0 + _, err := srv.EnableStorageEnvelope(context.Background(), req) + if status.Code(err) != codes.InvalidArgument { + t.Errorf("EnableStorageEnvelope status=%v, want InvalidArgument", status.Code(err)) + } +} + +// TestEncryptionAdmin_EnableStorageEnvelope_RejectsOversizedLocalEpoch +// pins §3.1 / §4.1: the proto3 uint32 wire field MUST be <= 0xFFFF. +func TestEncryptionAdmin_EnableStorageEnvelope_RejectsOversizedLocalEpoch(t *testing.T) { + t.Parallel() + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(&recordingProposer{}), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(cutoverReadySidecarFixture(t)), + ) + req := validEnableStorageEnvelopeRequest() + req.ProposerLocalEpoch = math.MaxUint16 + 1 + _, err := srv.EnableStorageEnvelope(context.Background(), req) + if status.Code(err) != codes.InvalidArgument { + t.Errorf("EnableStorageEnvelope status=%v, want InvalidArgument", status.Code(err)) + } +} + +// TestEncryptionAdmin_EnableStorageEnvelope_RejectsNotBootstrapped +// pins §3.2 step 4: a sidecar with Active.Storage == 0 means +// BootstrapEncryption has not committed yet, so the cutover must +// refuse with a clear hint. +func TestEncryptionAdmin_EnableStorageEnvelope_RejectsNotBootstrapped(t *testing.T) { + t.Parallel() + path := writeSidecarFixture(t, &encryption.Sidecar{ + Active: encryption.ActiveKeys{Storage: 0, Raft: 0}, + Keys: map[string]encryption.SidecarKey{}, + }) + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(&recordingProposer{}), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(path), + ) + _, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) + if status.Code(err) != codes.FailedPrecondition { + t.Errorf("EnableStorageEnvelope status=%v, want FailedPrecondition", status.Code(err)) + } + if err == nil || !strings.Contains(err.Error(), "BootstrapEncryption") { + t.Errorf("error %q does not hint at BootstrapEncryption", err) + } +} + +// TestEncryptionAdmin_EnableStorageEnvelope_IdempotentRetry pins +// §3.2 step 5 + §6.4: a duplicate call against an already-active +// sidecar returns OK with was_already_active=true and +// applied_index = sidecar.StorageEnvelopeCutoverIndex. No +// capability fan-out, no propose. The CapabilitySummary is +// intentionally empty so a caller cannot accidentally re-use +// the original cutover's membership view (which may no longer +// reflect current cluster shape). +func TestEncryptionAdmin_EnableStorageEnvelope_IdempotentRetry(t *testing.T) { + t.Parallel() + path := writeSidecarFixture(t, &encryption.Sidecar{ + Active: encryption.ActiveKeys{Storage: 5, Raft: 6}, + Keys: map[string]encryption.SidecarKey{"5": {Purpose: encryption.SidecarPurposeStorage, Wrapped: []byte("ws"), Created: "x", LocalEpoch: 0}, "6": {Purpose: encryption.SidecarPurposeRaft, Wrapped: []byte("wr"), Created: "x", LocalEpoch: 0}}, + StorageEnvelopeActive: true, + StorageEnvelopeCutoverIndex: 555, + RaftAppliedIndex: 900, + }) + proposer := &recordingProposer{} + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(proposer), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(path), + WithEncryptionAdminCapabilityFanout(fixedCapabilityFanout(allOKFanoutResult(), nil)), + ) + got, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) + if err != nil { + t.Fatalf("EnableStorageEnvelope: %v", err) + } + if !got.WasAlreadyActive { + t.Error("WasAlreadyActive=false, want true (idempotent retry)") + } + if got.AppliedIndex != 555 { + t.Errorf("AppliedIndex=%d, want 555 (original StorageEnvelopeCutoverIndex)", got.AppliedIndex) + } + if got.CutoverIndexUnknown { + t.Error("CutoverIndexUnknown=true, want false (cutover index is set)") + } + if len(got.CapabilitySummary) != 0 { + t.Errorf("CapabilitySummary len=%d, want 0 (empty on idempotent retries)", len(got.CapabilitySummary)) + } + if len(proposer.calls) != 0 { + t.Errorf("proposer.calls len=%d, want 0 (no propose on idempotent retry)", len(proposer.calls)) + } +} + +// TestEncryptionAdmin_EnableStorageEnvelope_DefensiveCutoverIndexUnknown +// pins the §6.4 defensive branch: a sidecar reporting +// StorageEnvelopeActive=true paired with +// StorageEnvelopeCutoverIndex=0 is operationally impossible under +// the 6D-4 apply path but hedged against schema rollback / hand- +// edited sidecars. The RPC falls back to RaftAppliedIndex with +// CutoverIndexUnknown=true so operators see the warning. +func TestEncryptionAdmin_EnableStorageEnvelope_DefensiveCutoverIndexUnknown(t *testing.T) { + t.Parallel() + path := writeSidecarFixture(t, &encryption.Sidecar{ + Active: encryption.ActiveKeys{Storage: 5, Raft: 6}, + Keys: map[string]encryption.SidecarKey{"5": {Purpose: encryption.SidecarPurposeStorage, Wrapped: []byte("ws"), Created: "x", LocalEpoch: 0}, "6": {Purpose: encryption.SidecarPurposeRaft, Wrapped: []byte("wr"), Created: "x", LocalEpoch: 0}}, + StorageEnvelopeActive: true, + StorageEnvelopeCutoverIndex: 0, // operationally impossible; hedge for hand-edits + RaftAppliedIndex: 900, + }) + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(&recordingProposer{}), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(path), + WithEncryptionAdminCapabilityFanout(fixedCapabilityFanout(allOKFanoutResult(), nil)), + ) + got, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) + if err != nil { + t.Fatalf("EnableStorageEnvelope: %v", err) + } + if !got.WasAlreadyActive { + t.Error("WasAlreadyActive=false, want true (sidecar reports active)") + } + if !got.CutoverIndexUnknown { + t.Error("CutoverIndexUnknown=false, want true (defensive branch)") + } + if got.AppliedIndex != 900 { + t.Errorf("AppliedIndex=%d, want 900 (RaftAppliedIndex fallback)", got.AppliedIndex) + } +} + +// TestEncryptionAdmin_EnableStorageEnvelope_RejectsWithoutCapabilityFanout +// pins the §4 pre-flight requirement: without the fan-out wired, +// the RPC refuses even on a fully-bootstrapped pre-cutover +// sidecar. The §6D-6 main.go wiring is what threads the fan-out +// in; tests that skip it deliberately must opt out via the +// idempotent-retry path (already-active sidecar). +func TestEncryptionAdmin_EnableStorageEnvelope_RejectsWithoutCapabilityFanout(t *testing.T) { + t.Parallel() + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(&recordingProposer{}), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(cutoverReadySidecarFixture(t)), + ) + _, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) + if status.Code(err) != codes.FailedPrecondition { + t.Errorf("EnableStorageEnvelope status=%v, want FailedPrecondition (no fan-out wired)", status.Code(err)) + } +} + +// TestEncryptionAdmin_EnableStorageEnvelope_RejectsOnCapabilityRefusal +// pins §3.2 step 7: any fan-out verdict with +// EncryptionCapable=false or Reachable=false fails the pre-flight +// and the RPC refuses with FailedPrecondition. The status detail +// names the specific node so the operator's CLI can diagnose +// without trawling leader logs. +func TestEncryptionAdmin_EnableStorageEnvelope_RejectsOnCapabilityRefusal(t *testing.T) { + t.Parallel() + refusal := admin.CapabilityFanoutResult{ + Verdicts: []admin.CapabilityVerdict{ + {FullNodeID: 11, EncryptionCapable: true, BuildSHA: "build-n1", SidecarPresent: true, Reachable: true}, + {FullNodeID: 99, EncryptionCapable: false, BuildSHA: "build-n99", SidecarPresent: false, Reachable: true}, + }, + OK: false, + } + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(&recordingProposer{}), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(cutoverReadySidecarFixture(t)), + WithEncryptionAdminCapabilityFanout(fixedCapabilityFanout(refusal, nil)), + ) + _, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) + if status.Code(err) != codes.FailedPrecondition { + t.Errorf("EnableStorageEnvelope status=%v, want FailedPrecondition", status.Code(err)) + } + if err == nil || !strings.Contains(err.Error(), "99") { + t.Errorf("error %q does not name the refusing node (full_node_id=99)", err) + } +} + +// TestEncryptionAdmin_EnableStorageEnvelope_RejectsOnFanoutError +// covers the §3.2 step 7 wrap-fallback: the fan-out helper itself +// erroring out (e.g., zero-member snapshot rejected by input +// validation) surfaces as FailedPrecondition rather than Internal, +// so the operator's CLI treats it as a configuration issue rather +// than a transient bug. +func TestEncryptionAdmin_EnableStorageEnvelope_RejectsOnFanoutError(t *testing.T) { + t.Parallel() + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(&recordingProposer{}), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(cutoverReadySidecarFixture(t)), + WithEncryptionAdminCapabilityFanout(fixedCapabilityFanout(admin.CapabilityFanoutResult{}, errors.New("fan-out: bad routes input"))), + ) + _, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) + if status.Code(err) != codes.FailedPrecondition { + t.Errorf("EnableStorageEnvelope status=%v, want FailedPrecondition", status.Code(err)) + } +} + +// TestEncryptionAdmin_EnableStorageEnvelope_HappyPath drives the +// full §3.2 happy path: leader gate passes, fan-out approves, +// propose lands, post-apply sidecar shows the cutover effect, the +// response carries WasAlreadyActive=false + +// CapabilitySummary populated + AppliedIndex matching the +// proposed Raft index. The applying proposer also simulates the +// §6.4 apply (StorageEnvelopeActive=true, +// StorageEnvelopeCutoverIndex=raftIdx) so the post-Propose +// sidecar read takes the fresh-success branch rather than the +// stale-DEKID fallback. +func TestEncryptionAdmin_EnableStorageEnvelope_HappyPath(t *testing.T) { + t.Parallel() + path := cutoverReadySidecarFixture(t) + proposer := &applyingProposer{ + recordingProposer: recordingProposer{commitIndex: 1234}, + sidecarPath: path, + applyFn: applyCutover, + } + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(proposer), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(path), + WithEncryptionAdminCapabilityFanout(fixedCapabilityFanout(allOKFanoutResult(), nil)), + ) + got, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) + if err != nil { + t.Fatalf("EnableStorageEnvelope: %v", err) + } + assertFreshCutoverResponse(t, got, 1234, 2) + // Wire-level pin: verify the proposed Raft entry decodes as + // SubTag=RotateSubEnableStorageEnvelope with the §2.1 + // constraints (Purpose=PurposeStorage, len(Wrapped)==0, + // DEKID=sidecar.Active.Storage). + assertSingleProposalOpcode(t, proposer.calls, fsmwire.OpRotation) + decoded, err := fsmwire.DecodeRotation(proposer.calls[0][1:]) + if err != nil { + t.Fatalf("DecodeRotation: %v", err) + } + assertCutoverProposalShape(t, decoded, 5, 11, 7) +} + +// assertFreshCutoverResponse pins the §3.2 happy-path response +// shape: WasAlreadyActive=false, CutoverIndexUnknown=false, +// AppliedIndex matches the proposed Raft index, capability +// summary populated with the expected verdict count. +func assertFreshCutoverResponse(t *testing.T, resp *pb.EnableStorageEnvelopeResponse, wantIdx uint64, wantVerdicts int) { + t.Helper() + if resp.WasAlreadyActive { + t.Error("WasAlreadyActive=true, want false (fresh cutover)") + } + if resp.CutoverIndexUnknown { + t.Error("CutoverIndexUnknown=true, want false") + } + if resp.AppliedIndex != wantIdx { + t.Errorf("AppliedIndex=%d, want %d (proposed Raft index)", resp.AppliedIndex, wantIdx) + } + if len(resp.CapabilitySummary) != wantVerdicts { + t.Errorf("CapabilitySummary len=%d, want %d", len(resp.CapabilitySummary), wantVerdicts) + } +} + +// assertCutoverProposalShape pins the §2.1 wire layout of a +// decoded cutover rotation: SubTag=RotateSubEnableStorageEnvelope, +// empty Wrapped, Purpose=PurposeStorage, ProposerRegistration +// matching the supplied identifiers. +func assertCutoverProposalShape(t *testing.T, decoded fsmwire.RotationPayload, wantDEKID uint32, wantNodeID uint64, wantEpoch uint16) { + t.Helper() + if decoded.SubTag != fsmwire.RotateSubEnableStorageEnvelope { + t.Errorf("SubTag=0x%02x, want RotateSubEnableStorageEnvelope (0x04)", decoded.SubTag) + } + if decoded.DEKID != wantDEKID { + t.Errorf("DEKID=%d, want %d (sidecar.Active.Storage)", decoded.DEKID, wantDEKID) + } + if decoded.Purpose != fsmwire.PurposeStorage { + t.Errorf("Purpose=%v, want PurposeStorage", decoded.Purpose) + } + if len(decoded.Wrapped) != 0 { + t.Errorf("Wrapped len=%d, want 0 (§2.1 constraint #2)", len(decoded.Wrapped)) + } + if decoded.ProposerRegistration.DEKID != wantDEKID || + decoded.ProposerRegistration.FullNodeID != wantNodeID || + decoded.ProposerRegistration.LocalEpoch != wantEpoch { + t.Errorf("ProposerRegistration=%+v, want {DEKID:%d FullNodeID:%d LocalEpoch:%d}", + decoded.ProposerRegistration, wantDEKID, wantNodeID, wantEpoch) + } +} + +// TestEncryptionAdmin_EnableStorageEnvelope_StaleDEKIDRace pins +// §2.1 #3: when a RotateDEK races between propose and apply, the +// 6D-4 applier consumes the cutover entry as a benign no-op +// (sidecar.StorageEnvelopeActive stays false), and the RPC's +// post-apply read takes the stale-DEKID branch. The RPC must +// surface FailedPrecondition with a retry hint rather than +// reporting fresh-success. +func TestEncryptionAdmin_EnableStorageEnvelope_StaleDEKIDRace(t *testing.T) { + t.Parallel() + path := cutoverReadySidecarFixture(t) + proposer := &applyingProposer{ + recordingProposer: recordingProposer{commitIndex: 1234}, + sidecarPath: path, + applyFn: applyStaleDEKIDRace, // does NOT flip StorageEnvelopeActive + } + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(proposer), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(path), + WithEncryptionAdminCapabilityFanout(fixedCapabilityFanout(allOKFanoutResult(), nil)), + ) + _, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) + if status.Code(err) != codes.FailedPrecondition { + t.Errorf("EnableStorageEnvelope status=%v, want FailedPrecondition (stale DEKID race)", status.Code(err)) + } + if err == nil || !strings.Contains(err.Error(), "RotateDEK") { + t.Errorf("error %q does not hint at the RotateDEK race", err) + } +} + +// TestEncryptionAdmin_EnableStorageEnvelope_CapabilitySummaryProjection +// pins the wire-shape mapping from internal CapabilityVerdict to +// proto.CapabilityVerdict. A regression that re-orders or drops +// any field would show up here rather than at the user-facing +// RPC client. +func TestEncryptionAdmin_EnableStorageEnvelope_CapabilitySummaryProjection(t *testing.T) { + t.Parallel() + path := cutoverReadySidecarFixture(t) + proposer := &applyingProposer{ + recordingProposer: recordingProposer{commitIndex: 2025}, + sidecarPath: path, + applyFn: applyCutover, + } + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(proposer), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(path), + WithEncryptionAdminCapabilityFanout(fixedCapabilityFanout(allOKFanoutResult(), nil)), + ) + got, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) + if err != nil { + t.Fatalf("EnableStorageEnvelope: %v", err) + } + if len(got.CapabilitySummary) != 2 { + t.Fatalf("CapabilitySummary len=%d, want 2", len(got.CapabilitySummary)) + } + assertProtoVerdict(t, got.CapabilitySummary[0], 11, "build-n1") + assertProtoVerdict(t, got.CapabilitySummary[1], 22, "build-n2") +} + +// assertProtoVerdict pins one row of the projected +// proto.CapabilityVerdict slice. All fan-out OK-path verdicts +// share the same shape (EncryptionCapable=true, SidecarPresent=true, +// Reachable not projected); the per-row assertion captures only +// the variable identity fields so the caller test stays terse. +func assertProtoVerdict(t *testing.T, v *pb.CapabilityVerdict, wantNodeID uint64, wantBuildSHA string) { + t.Helper() + if v.FullNodeId != wantNodeID || v.BuildSha != wantBuildSHA || !v.EncryptionCapable || !v.SidecarPresent { + t.Errorf("verdict=%+v, want {FullNodeId:%d BuildSha:%s EncryptionCapable:true SidecarPresent:true}", + v, wantNodeID, wantBuildSHA) + } +} diff --git a/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md b/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md index 3498b4a84..dd96bf0e3 100644 --- a/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md +++ b/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md @@ -2,7 +2,7 @@ | Field | Value | |---|---| -| Status | partial — 6D-1 (doc), 6D-2 (startup guards), 6D-3 (capability fan-out helper), 6D-4 (cutover wire + apply dispatch), 6D-5 (storage-layer toggle) shipped; 6D-6 remains | +| Status | partial — 6D-1 (doc), 6D-2 (startup guards), 6D-3 (capability fan-out helper), 6D-4 (cutover wire + apply dispatch), 6D-5 (storage-layer toggle), 6D-6a (EnableStorageEnvelope server method) shipped; 6D-6b (CLI), 6D-6c (main.go wiring + integration test) remain | | Date | 2026-05-18 | | Parent design | [`2026_04_29_partial_data_at_rest_encryption.md`](2026_04_29_partial_data_at_rest_encryption.md) | | Blockers (now satisfied) | 6B (KEK plumbing), 6C-1 / 6C-2 (startup guards), 6C-2d (`ErrSidecarBehindRaftLog` wiring) | @@ -45,8 +45,28 @@ pre-6D-6 production wiring keep working unchanged. Operator- inert until 6D-6 wires both the cipher and the gate in main.go and exposes the cutover RPC. +- **6D-6a** (EnableStorageEnvelope server method) — + `proto/encryption_admin.proto` adds the `EnableStorageEnvelope` + RPC + `EnableStorageEnvelopeRequest` / `Response` + + `CapabilityVerdict` messages. `adapter/encryption_admin.go` + ships the server method that composes the §3.2 sequence: leader + gate → input validation → sidecar read → bootstrap gate → + idempotent-retry short-circuit (§6.4) → capability fan-out + (6D-3) → propose RotateSubEnableStorageEnvelope through Raft + (6D-4 wire) → post-apply re-read discriminating fresh-success + vs. stale-DEKID race. The 6D-6b CLI and 6D-6c main.go wiring + + integration test slice on top of this server method. ## Open milestones + +- **6D-6b** — `elastickv-admin enable-storage-envelope` CLI + subcommand that drives the server method end-to-end. +- **6D-6c** — main.go production wiring: cipher + WithEncryption + + WithStorageEnvelopeGate threaded from the sidecar, plus the + CapabilityFanout closure bound to the live Raft membership + view. End-to-end integration test exercises a single-node + cluster Bootstrap → EnableStorageEnvelope → Put → read-back- + via-envelope. - **6D-6** — `EnableStorageEnvelope` admin RPC + CLI command + integration test composing 6D-3 + 6D-4 + 6D-5. diff --git a/proto/Makefile b/proto/Makefile index b2de0a67e..35abb0dc6 100644 --- a/proto/Makefile +++ b/proto/Makefile @@ -43,3 +43,6 @@ gen: check-tools protoc --go_out=. --go_opt=paths=source_relative \ --go-grpc_out=. --go-grpc_opt=paths=source_relative \ admin_forward.proto + protoc --go_out=. --go_opt=paths=source_relative \ + --go-grpc_out=. --go-grpc_opt=paths=source_relative \ + encryption_admin.proto diff --git a/proto/encryption_admin.pb.go b/proto/encryption_admin.pb.go index b694c0dd3..050db57ef 100644 --- a/proto/encryption_admin.pb.go +++ b/proto/encryption_admin.pb.go @@ -818,6 +818,240 @@ func (x *ResyncSidecarResponse) GetWriterRegistryForCaller() map[uint32]uint32 { return nil } +// EnableStorageEnvelopeRequest proposes the §7.1 Phase 1 cutover +// that flips the cluster from cleartext storage writes to §4.1 +// envelope writes. Defined in the 6D design doc §3.1; the server +// composes a RotationPayload with SubTag = +// RotateSubEnableStorageEnvelope (0x04) and routes it through the +// default Raft group's leader as a §11.3 0x05 OpRotation entry. +// +// proposer_node_id MUST be non-zero (the §6.1 "not-capable" +// sentinel is rejected at the server boundary, matching the +// existing RotateDEK / BootstrapEncryption posture). +// +// proposer_local_epoch carries the §4.1 16-bit nonce field as +// uint32 (proto3 has no uint16); values above 0xFFFF are +// rejected at the server boundary before any Raft proposal is +// composed. ApplyRotation re-validates at apply time +// (defense-in-depth). +type EnableStorageEnvelopeRequest struct { + state protoimpl.MessageState `protogen:"open.v1"` + ProposerNodeId uint64 `protobuf:"varint,1,opt,name=proposer_node_id,json=proposerNodeId,proto3" json:"proposer_node_id,omitempty"` + ProposerLocalEpoch uint32 `protobuf:"varint,2,opt,name=proposer_local_epoch,json=proposerLocalEpoch,proto3" json:"proposer_local_epoch,omitempty"` // MUST be <= 0xFFFF on the wire. + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *EnableStorageEnvelopeRequest) Reset() { + *x = EnableStorageEnvelopeRequest{} + mi := &file_encryption_admin_proto_msgTypes[12] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *EnableStorageEnvelopeRequest) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*EnableStorageEnvelopeRequest) ProtoMessage() {} + +func (x *EnableStorageEnvelopeRequest) ProtoReflect() protoreflect.Message { + mi := &file_encryption_admin_proto_msgTypes[12] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use EnableStorageEnvelopeRequest.ProtoReflect.Descriptor instead. +func (*EnableStorageEnvelopeRequest) Descriptor() ([]byte, []int) { + return file_encryption_admin_proto_rawDescGZIP(), []int{12} +} + +func (x *EnableStorageEnvelopeRequest) GetProposerNodeId() uint64 { + if x != nil { + return x.ProposerNodeId + } + return 0 +} + +func (x *EnableStorageEnvelopeRequest) GetProposerLocalEpoch() uint32 { + if x != nil { + return x.ProposerLocalEpoch + } + return 0 +} + +// EnableStorageEnvelopeResponse reports the outcome of a cutover +// proposal. The §6.4 idempotency contract is encoded as an OK +// status with `was_already_active = true` (NOT AlreadyExists, +// since unary gRPC drops the response body on non-OK status; the +// applied_index must be reachable on the success path). +// +// On a fresh cutover (was_already_active == false), applied_index +// is the Raft index of the entry the leader just proposed and +// waited to apply. On a retried call, applied_index is the +// recorded sidecar.StorageEnvelopeCutoverIndex from the ORIGINAL +// cutover (§6.4) — stable across arbitrary subsequent +// encryption-relevant Raft activity. +// +// capability_summary records which (full_node_id) members were +// probed during the pre-flight gate and what they reported. +// Empty on idempotent retries (was_already_active=true); the +// membership view of the original cutover is not retained. +// +// cutover_index_unknown is the §6.4 defensive fallback: it only +// fires if a sidecar reports StorageEnvelopeActive=true with +// StorageEnvelopeCutoverIndex=0 (operationally impossible under +// normal apply, but hedged against future schema rollback). On +// healthy clusters this stays false. The field is only +// meaningful when was_already_active=true. +type EnableStorageEnvelopeResponse struct { + state protoimpl.MessageState `protogen:"open.v1"` + AppliedIndex uint64 `protobuf:"varint,1,opt,name=applied_index,json=appliedIndex,proto3" json:"applied_index,omitempty"` + CapabilitySummary []*CapabilityVerdict `protobuf:"bytes,2,rep,name=capability_summary,json=capabilitySummary,proto3" json:"capability_summary,omitempty"` + CutoverIndexUnknown bool `protobuf:"varint,3,opt,name=cutover_index_unknown,json=cutoverIndexUnknown,proto3" json:"cutover_index_unknown,omitempty"` + WasAlreadyActive bool `protobuf:"varint,4,opt,name=was_already_active,json=wasAlreadyActive,proto3" json:"was_already_active,omitempty"` + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *EnableStorageEnvelopeResponse) Reset() { + *x = EnableStorageEnvelopeResponse{} + mi := &file_encryption_admin_proto_msgTypes[13] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *EnableStorageEnvelopeResponse) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*EnableStorageEnvelopeResponse) ProtoMessage() {} + +func (x *EnableStorageEnvelopeResponse) ProtoReflect() protoreflect.Message { + mi := &file_encryption_admin_proto_msgTypes[13] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use EnableStorageEnvelopeResponse.ProtoReflect.Descriptor instead. +func (*EnableStorageEnvelopeResponse) Descriptor() ([]byte, []int) { + return file_encryption_admin_proto_rawDescGZIP(), []int{13} +} + +func (x *EnableStorageEnvelopeResponse) GetAppliedIndex() uint64 { + if x != nil { + return x.AppliedIndex + } + return 0 +} + +func (x *EnableStorageEnvelopeResponse) GetCapabilitySummary() []*CapabilityVerdict { + if x != nil { + return x.CapabilitySummary + } + return nil +} + +func (x *EnableStorageEnvelopeResponse) GetCutoverIndexUnknown() bool { + if x != nil { + return x.CutoverIndexUnknown + } + return false +} + +func (x *EnableStorageEnvelopeResponse) GetWasAlreadyActive() bool { + if x != nil { + return x.WasAlreadyActive + } + return false +} + +// CapabilityVerdict is one row of the §4 fan-out summary the +// cutover RPC returns. full_node_id is the route member the leader +// probed; the remaining fields mirror the corresponding member's +// CapabilityReport. A cluster where any verdict has +// encryption_capable=false MUST NOT reach the propose step; the +// summary in the response is the post-hoc record for operators. +type CapabilityVerdict struct { + state protoimpl.MessageState `protogen:"open.v1"` + FullNodeId uint64 `protobuf:"varint,1,opt,name=full_node_id,json=fullNodeId,proto3" json:"full_node_id,omitempty"` + EncryptionCapable bool `protobuf:"varint,2,opt,name=encryption_capable,json=encryptionCapable,proto3" json:"encryption_capable,omitempty"` + BuildSha string `protobuf:"bytes,3,opt,name=build_sha,json=buildSha,proto3" json:"build_sha,omitempty"` + SidecarPresent bool `protobuf:"varint,4,opt,name=sidecar_present,json=sidecarPresent,proto3" json:"sidecar_present,omitempty"` + unknownFields protoimpl.UnknownFields + sizeCache protoimpl.SizeCache +} + +func (x *CapabilityVerdict) Reset() { + *x = CapabilityVerdict{} + mi := &file_encryption_admin_proto_msgTypes[14] + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + ms.StoreMessageInfo(mi) +} + +func (x *CapabilityVerdict) String() string { + return protoimpl.X.MessageStringOf(x) +} + +func (*CapabilityVerdict) ProtoMessage() {} + +func (x *CapabilityVerdict) ProtoReflect() protoreflect.Message { + mi := &file_encryption_admin_proto_msgTypes[14] + if x != nil { + ms := protoimpl.X.MessageStateOf(protoimpl.Pointer(x)) + if ms.LoadMessageInfo() == nil { + ms.StoreMessageInfo(mi) + } + return ms + } + return mi.MessageOf(x) +} + +// Deprecated: Use CapabilityVerdict.ProtoReflect.Descriptor instead. +func (*CapabilityVerdict) Descriptor() ([]byte, []int) { + return file_encryption_admin_proto_rawDescGZIP(), []int{14} +} + +func (x *CapabilityVerdict) GetFullNodeId() uint64 { + if x != nil { + return x.FullNodeId + } + return 0 +} + +func (x *CapabilityVerdict) GetEncryptionCapable() bool { + if x != nil { + return x.EncryptionCapable + } + return false +} + +func (x *CapabilityVerdict) GetBuildSha() string { + if x != nil { + return x.BuildSha + } + return "" +} + +func (x *CapabilityVerdict) GetSidecarPresent() bool { + if x != nil { + return x.SidecarPresent + } + return false +} + var File_encryption_admin_proto protoreflect.FileDescriptor const file_encryption_admin_proto_rawDesc = "" + @@ -890,14 +1124,29 @@ const file_encryption_admin_proto_rawDesc = "" + "\x05value\x18\x02 \x01(\fR\x05value:\x028\x01\x1aJ\n" + "\x1cWriterRegistryForCallerEntry\x12\x10\n" + "\x03key\x18\x01 \x01(\rR\x03key\x12\x14\n" + - "\x05value\x18\x02 \x01(\rR\x05value:\x028\x012\xa0\x03\n" + + "\x05value\x18\x02 \x01(\rR\x05value:\x028\x01\"z\n" + + "\x1cEnableStorageEnvelopeRequest\x12(\n" + + "\x10proposer_node_id\x18\x01 \x01(\x04R\x0eproposerNodeId\x120\n" + + "\x14proposer_local_epoch\x18\x02 \x01(\rR\x12proposerLocalEpoch\"\xe9\x01\n" + + "\x1dEnableStorageEnvelopeResponse\x12#\n" + + "\rapplied_index\x18\x01 \x01(\x04R\fappliedIndex\x12A\n" + + "\x12capability_summary\x18\x02 \x03(\v2\x12.CapabilityVerdictR\x11capabilitySummary\x122\n" + + "\x15cutover_index_unknown\x18\x03 \x01(\bR\x13cutoverIndexUnknown\x12,\n" + + "\x12was_already_active\x18\x04 \x01(\bR\x10wasAlreadyActive\"\xaa\x01\n" + + "\x11CapabilityVerdict\x12 \n" + + "\ffull_node_id\x18\x01 \x01(\x04R\n" + + "fullNodeId\x12-\n" + + "\x12encryption_capable\x18\x02 \x01(\bR\x11encryptionCapable\x12\x1b\n" + + "\tbuild_sha\x18\x03 \x01(\tR\bbuildSha\x12'\n" + + "\x0fsidecar_present\x18\x04 \x01(\bR\x0esidecarPresent2\xfa\x03\n" + "\x0fEncryptionAdmin\x12,\n" + "\rGetCapability\x12\x06.Empty\x1a\x11.CapabilityReport\"\x00\x120\n" + "\x0fGetSidecarState\x12\x06.Empty\x1a\x13.SidecarStateReport\"\x00\x12R\n" + "\x13BootstrapEncryption\x12\x1b.BootstrapEncryptionRequest\x1a\x1c.BootstrapEncryptionResponse\"\x00\x124\n" + "\tRotateDEK\x12\x11.RotateDEKRequest\x1a\x12.RotateDEKResponse\"\x00\x12a\n" + "\x18RegisterEncryptionWriter\x12 .RegisterEncryptionWriterRequest\x1a!.RegisterEncryptionWriterResponse\"\x00\x12@\n" + - "\rResyncSidecar\x12\x15.ResyncSidecarRequest\x1a\x16.ResyncSidecarResponse\"\x00B#Z!github.com/bootjp/elastickv/protob\x06proto3" + "\rResyncSidecar\x12\x15.ResyncSidecarRequest\x1a\x16.ResyncSidecarResponse\"\x00\x12X\n" + + "\x15EnableStorageEnvelope\x12\x1d.EnableStorageEnvelopeRequest\x1a\x1e.EnableStorageEnvelopeResponse\"\x00B#Z!github.com/bootjp/elastickv/protob\x06proto3" var ( file_encryption_admin_proto_rawDescOnce sync.Once @@ -912,7 +1161,7 @@ func file_encryption_admin_proto_rawDescGZIP() []byte { } var file_encryption_admin_proto_enumTypes = make([]protoimpl.EnumInfo, 1) -var file_encryption_admin_proto_msgTypes = make([]protoimpl.MessageInfo, 16) +var file_encryption_admin_proto_msgTypes = make([]protoimpl.MessageInfo, 19) var file_encryption_admin_proto_goTypes = []any{ (RotateDEKRequest_Purpose)(0), // 0: RotateDEKRequest.Purpose (*Empty)(nil), // 1: Empty @@ -927,36 +1176,42 @@ var file_encryption_admin_proto_goTypes = []any{ (*RegisterEncryptionWriterResponse)(nil), // 10: RegisterEncryptionWriterResponse (*ResyncSidecarRequest)(nil), // 11: ResyncSidecarRequest (*ResyncSidecarResponse)(nil), // 12: ResyncSidecarResponse - nil, // 13: SidecarStateReport.WrappedDeksByIdEntry - nil, // 14: SidecarStateReport.WriterRegistryForCallerEntry - nil, // 15: ResyncSidecarResponse.WrappedDeksByIdEntry - nil, // 16: ResyncSidecarResponse.WriterRegistryForCallerEntry + (*EnableStorageEnvelopeRequest)(nil), // 13: EnableStorageEnvelopeRequest + (*EnableStorageEnvelopeResponse)(nil), // 14: EnableStorageEnvelopeResponse + (*CapabilityVerdict)(nil), // 15: CapabilityVerdict + nil, // 16: SidecarStateReport.WrappedDeksByIdEntry + nil, // 17: SidecarStateReport.WriterRegistryForCallerEntry + nil, // 18: ResyncSidecarResponse.WrappedDeksByIdEntry + nil, // 19: ResyncSidecarResponse.WriterRegistryForCallerEntry } var file_encryption_admin_proto_depIdxs = []int32{ - 13, // 0: SidecarStateReport.wrapped_deks_by_id:type_name -> SidecarStateReport.WrappedDeksByIdEntry - 14, // 1: SidecarStateReport.writer_registry_for_caller:type_name -> SidecarStateReport.WriterRegistryForCallerEntry + 16, // 0: SidecarStateReport.wrapped_deks_by_id:type_name -> SidecarStateReport.WrappedDeksByIdEntry + 17, // 1: SidecarStateReport.writer_registry_for_caller:type_name -> SidecarStateReport.WriterRegistryForCallerEntry 4, // 2: BootstrapEncryptionRequest.writer_batch:type_name -> WriterRegistryEntry 0, // 3: RotateDEKRequest.purpose:type_name -> RotateDEKRequest.Purpose 4, // 4: RegisterEncryptionWriterRequest.writers:type_name -> WriterRegistryEntry - 15, // 5: ResyncSidecarResponse.wrapped_deks_by_id:type_name -> ResyncSidecarResponse.WrappedDeksByIdEntry - 16, // 6: ResyncSidecarResponse.writer_registry_for_caller:type_name -> ResyncSidecarResponse.WriterRegistryForCallerEntry - 1, // 7: EncryptionAdmin.GetCapability:input_type -> Empty - 1, // 8: EncryptionAdmin.GetSidecarState:input_type -> Empty - 5, // 9: EncryptionAdmin.BootstrapEncryption:input_type -> BootstrapEncryptionRequest - 7, // 10: EncryptionAdmin.RotateDEK:input_type -> RotateDEKRequest - 9, // 11: EncryptionAdmin.RegisterEncryptionWriter:input_type -> RegisterEncryptionWriterRequest - 11, // 12: EncryptionAdmin.ResyncSidecar:input_type -> ResyncSidecarRequest - 2, // 13: EncryptionAdmin.GetCapability:output_type -> CapabilityReport - 3, // 14: EncryptionAdmin.GetSidecarState:output_type -> SidecarStateReport - 6, // 15: EncryptionAdmin.BootstrapEncryption:output_type -> BootstrapEncryptionResponse - 8, // 16: EncryptionAdmin.RotateDEK:output_type -> RotateDEKResponse - 10, // 17: EncryptionAdmin.RegisterEncryptionWriter:output_type -> RegisterEncryptionWriterResponse - 12, // 18: EncryptionAdmin.ResyncSidecar:output_type -> ResyncSidecarResponse - 13, // [13:19] is the sub-list for method output_type - 7, // [7:13] is the sub-list for method input_type - 7, // [7:7] is the sub-list for extension type_name - 7, // [7:7] is the sub-list for extension extendee - 0, // [0:7] is the sub-list for field type_name + 18, // 5: ResyncSidecarResponse.wrapped_deks_by_id:type_name -> ResyncSidecarResponse.WrappedDeksByIdEntry + 19, // 6: ResyncSidecarResponse.writer_registry_for_caller:type_name -> ResyncSidecarResponse.WriterRegistryForCallerEntry + 15, // 7: EnableStorageEnvelopeResponse.capability_summary:type_name -> CapabilityVerdict + 1, // 8: EncryptionAdmin.GetCapability:input_type -> Empty + 1, // 9: EncryptionAdmin.GetSidecarState:input_type -> Empty + 5, // 10: EncryptionAdmin.BootstrapEncryption:input_type -> BootstrapEncryptionRequest + 7, // 11: EncryptionAdmin.RotateDEK:input_type -> RotateDEKRequest + 9, // 12: EncryptionAdmin.RegisterEncryptionWriter:input_type -> RegisterEncryptionWriterRequest + 11, // 13: EncryptionAdmin.ResyncSidecar:input_type -> ResyncSidecarRequest + 13, // 14: EncryptionAdmin.EnableStorageEnvelope:input_type -> EnableStorageEnvelopeRequest + 2, // 15: EncryptionAdmin.GetCapability:output_type -> CapabilityReport + 3, // 16: EncryptionAdmin.GetSidecarState:output_type -> SidecarStateReport + 6, // 17: EncryptionAdmin.BootstrapEncryption:output_type -> BootstrapEncryptionResponse + 8, // 18: EncryptionAdmin.RotateDEK:output_type -> RotateDEKResponse + 10, // 19: EncryptionAdmin.RegisterEncryptionWriter:output_type -> RegisterEncryptionWriterResponse + 12, // 20: EncryptionAdmin.ResyncSidecar:output_type -> ResyncSidecarResponse + 14, // 21: EncryptionAdmin.EnableStorageEnvelope:output_type -> EnableStorageEnvelopeResponse + 15, // [15:22] is the sub-list for method output_type + 8, // [8:15] is the sub-list for method input_type + 8, // [8:8] is the sub-list for extension type_name + 8, // [8:8] is the sub-list for extension extendee + 0, // [0:8] is the sub-list for field type_name } func init() { file_encryption_admin_proto_init() } @@ -970,7 +1225,7 @@ func file_encryption_admin_proto_init() { GoPackagePath: reflect.TypeOf(x{}).PkgPath(), RawDescriptor: unsafe.Slice(unsafe.StringData(file_encryption_admin_proto_rawDesc), len(file_encryption_admin_proto_rawDesc)), NumEnums: 1, - NumMessages: 16, + NumMessages: 19, NumExtensions: 0, NumServices: 1, }, diff --git a/proto/encryption_admin.proto b/proto/encryption_admin.proto index af90bbb54..caf1e02ae 100644 --- a/proto/encryption_admin.proto +++ b/proto/encryption_admin.proto @@ -49,6 +49,7 @@ service EncryptionAdmin { rpc RotateDEK (RotateDEKRequest) returns (RotateDEKResponse) {} rpc RegisterEncryptionWriter (RegisterEncryptionWriterRequest) returns (RegisterEncryptionWriterResponse) {} rpc ResyncSidecar (ResyncSidecarRequest) returns (ResyncSidecarResponse) {} + rpc EnableStorageEnvelope (EnableStorageEnvelopeRequest) returns (EnableStorageEnvelopeResponse) {} } message Empty {} @@ -162,3 +163,68 @@ message ResyncSidecarResponse { // recovering before any DEK exists has nothing to re-derive. map writer_registry_for_caller = 5; } + +// EnableStorageEnvelopeRequest proposes the §7.1 Phase 1 cutover +// that flips the cluster from cleartext storage writes to §4.1 +// envelope writes. Defined in the 6D design doc §3.1; the server +// composes a RotationPayload with SubTag = +// RotateSubEnableStorageEnvelope (0x04) and routes it through the +// default Raft group's leader as a §11.3 0x05 OpRotation entry. +// +// proposer_node_id MUST be non-zero (the §6.1 "not-capable" +// sentinel is rejected at the server boundary, matching the +// existing RotateDEK / BootstrapEncryption posture). +// +// proposer_local_epoch carries the §4.1 16-bit nonce field as +// uint32 (proto3 has no uint16); values above 0xFFFF are +// rejected at the server boundary before any Raft proposal is +// composed. ApplyRotation re-validates at apply time +// (defense-in-depth). +message EnableStorageEnvelopeRequest { + uint64 proposer_node_id = 1; + uint32 proposer_local_epoch = 2; // MUST be <= 0xFFFF on the wire. +} + +// EnableStorageEnvelopeResponse reports the outcome of a cutover +// proposal. The §6.4 idempotency contract is encoded as an OK +// status with `was_already_active = true` (NOT AlreadyExists, +// since unary gRPC drops the response body on non-OK status; the +// applied_index must be reachable on the success path). +// +// On a fresh cutover (was_already_active == false), applied_index +// is the Raft index of the entry the leader just proposed and +// waited to apply. On a retried call, applied_index is the +// recorded sidecar.StorageEnvelopeCutoverIndex from the ORIGINAL +// cutover (§6.4) — stable across arbitrary subsequent +// encryption-relevant Raft activity. +// +// capability_summary records which (full_node_id) members were +// probed during the pre-flight gate and what they reported. +// Empty on idempotent retries (was_already_active=true); the +// membership view of the original cutover is not retained. +// +// cutover_index_unknown is the §6.4 defensive fallback: it only +// fires if a sidecar reports StorageEnvelopeActive=true with +// StorageEnvelopeCutoverIndex=0 (operationally impossible under +// normal apply, but hedged against future schema rollback). On +// healthy clusters this stays false. The field is only +// meaningful when was_already_active=true. +message EnableStorageEnvelopeResponse { + uint64 applied_index = 1; + repeated CapabilityVerdict capability_summary = 2; + bool cutover_index_unknown = 3; + bool was_already_active = 4; +} + +// CapabilityVerdict is one row of the §4 fan-out summary the +// cutover RPC returns. full_node_id is the route member the leader +// probed; the remaining fields mirror the corresponding member's +// CapabilityReport. A cluster where any verdict has +// encryption_capable=false MUST NOT reach the propose step; the +// summary in the response is the post-hoc record for operators. +message CapabilityVerdict { + uint64 full_node_id = 1; + bool encryption_capable = 2; + string build_sha = 3; + bool sidecar_present = 4; +} diff --git a/proto/encryption_admin_grpc.pb.go b/proto/encryption_admin_grpc.pb.go index b6126eed0..1183a5e79 100644 --- a/proto/encryption_admin_grpc.pb.go +++ b/proto/encryption_admin_grpc.pb.go @@ -25,6 +25,7 @@ const ( EncryptionAdmin_RotateDEK_FullMethodName = "/EncryptionAdmin/RotateDEK" EncryptionAdmin_RegisterEncryptionWriter_FullMethodName = "/EncryptionAdmin/RegisterEncryptionWriter" EncryptionAdmin_ResyncSidecar_FullMethodName = "/EncryptionAdmin/ResyncSidecar" + EncryptionAdmin_EnableStorageEnvelope_FullMethodName = "/EncryptionAdmin/EnableStorageEnvelope" ) // EncryptionAdminClient is the client API for EncryptionAdmin service. @@ -78,6 +79,7 @@ type EncryptionAdminClient interface { RotateDEK(ctx context.Context, in *RotateDEKRequest, opts ...grpc.CallOption) (*RotateDEKResponse, error) RegisterEncryptionWriter(ctx context.Context, in *RegisterEncryptionWriterRequest, opts ...grpc.CallOption) (*RegisterEncryptionWriterResponse, error) ResyncSidecar(ctx context.Context, in *ResyncSidecarRequest, opts ...grpc.CallOption) (*ResyncSidecarResponse, error) + EnableStorageEnvelope(ctx context.Context, in *EnableStorageEnvelopeRequest, opts ...grpc.CallOption) (*EnableStorageEnvelopeResponse, error) } type encryptionAdminClient struct { @@ -148,6 +150,16 @@ func (c *encryptionAdminClient) ResyncSidecar(ctx context.Context, in *ResyncSid return out, nil } +func (c *encryptionAdminClient) EnableStorageEnvelope(ctx context.Context, in *EnableStorageEnvelopeRequest, opts ...grpc.CallOption) (*EnableStorageEnvelopeResponse, error) { + cOpts := append([]grpc.CallOption{grpc.StaticMethod()}, opts...) + out := new(EnableStorageEnvelopeResponse) + err := c.cc.Invoke(ctx, EncryptionAdmin_EnableStorageEnvelope_FullMethodName, in, out, cOpts...) + if err != nil { + return nil, err + } + return out, nil +} + // EncryptionAdminServer is the server API for EncryptionAdmin service. // All implementations must embed UnimplementedEncryptionAdminServer // for forward compatibility. @@ -199,6 +211,7 @@ type EncryptionAdminServer interface { RotateDEK(context.Context, *RotateDEKRequest) (*RotateDEKResponse, error) RegisterEncryptionWriter(context.Context, *RegisterEncryptionWriterRequest) (*RegisterEncryptionWriterResponse, error) ResyncSidecar(context.Context, *ResyncSidecarRequest) (*ResyncSidecarResponse, error) + EnableStorageEnvelope(context.Context, *EnableStorageEnvelopeRequest) (*EnableStorageEnvelopeResponse, error) mustEmbedUnimplementedEncryptionAdminServer() } @@ -227,6 +240,9 @@ func (UnimplementedEncryptionAdminServer) RegisterEncryptionWriter(context.Conte func (UnimplementedEncryptionAdminServer) ResyncSidecar(context.Context, *ResyncSidecarRequest) (*ResyncSidecarResponse, error) { return nil, status.Error(codes.Unimplemented, "method ResyncSidecar not implemented") } +func (UnimplementedEncryptionAdminServer) EnableStorageEnvelope(context.Context, *EnableStorageEnvelopeRequest) (*EnableStorageEnvelopeResponse, error) { + return nil, status.Error(codes.Unimplemented, "method EnableStorageEnvelope not implemented") +} func (UnimplementedEncryptionAdminServer) mustEmbedUnimplementedEncryptionAdminServer() {} func (UnimplementedEncryptionAdminServer) testEmbeddedByValue() {} @@ -356,6 +372,24 @@ func _EncryptionAdmin_ResyncSidecar_Handler(srv interface{}, ctx context.Context return interceptor(ctx, in, info, handler) } +func _EncryptionAdmin_EnableStorageEnvelope_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) { + in := new(EnableStorageEnvelopeRequest) + if err := dec(in); err != nil { + return nil, err + } + if interceptor == nil { + return srv.(EncryptionAdminServer).EnableStorageEnvelope(ctx, in) + } + info := &grpc.UnaryServerInfo{ + Server: srv, + FullMethod: EncryptionAdmin_EnableStorageEnvelope_FullMethodName, + } + handler := func(ctx context.Context, req interface{}) (interface{}, error) { + return srv.(EncryptionAdminServer).EnableStorageEnvelope(ctx, req.(*EnableStorageEnvelopeRequest)) + } + return interceptor(ctx, in, info, handler) +} + // EncryptionAdmin_ServiceDesc is the grpc.ServiceDesc for EncryptionAdmin service. // It's only intended for direct use with grpc.RegisterService, // and not to be introspected or modified (even as a copy) @@ -387,6 +421,10 @@ var EncryptionAdmin_ServiceDesc = grpc.ServiceDesc{ MethodName: "ResyncSidecar", Handler: _EncryptionAdmin_ResyncSidecar_Handler, }, + { + MethodName: "EnableStorageEnvelope", + Handler: _EncryptionAdmin_EnableStorageEnvelope_Handler, + }, }, Streams: []grpc.StreamDesc{}, Metadata: "encryption_admin.proto", From 55be257942c695636900e09677de9fcc057991ea Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 23 May 2026 21:46:18 +0900 Subject: [PATCH 2/6] fix(encryption): PR812 round-1 - codex P2 (preserve context errors in capability fan-out) Codex P2 finding on PR812: Pre-fix: runCutoverFanout wrapped every err from s.capabilityFanout(ctx) as FailedPrecondition. When the request deadline expires or the client cancels mid-pre-flight, the client received FailedPrecondition (a configuration-failure signal) instead of Canceled / DeadlineExceeded (the transport- layer cancellation signal). This breaks retry logic that switches on the gRPC code. Fix: new capabilityFanoutErrorToStatus helper maps context.Canceled and context.DeadlineExceeded to their native gRPC codes (codes.Canceled / codes.DeadlineExceeded) and preserves the FailedPrecondition default for non-transport errors (zero-member snapshot input validation, etc.). Symmetric with the existing proposeErrorToStatus pattern in the same file. Regression test: TestEncryptionAdmin_EnableStorageEnvelope_PreservesContextCancellation table-driven across {context.Canceled, context.DeadlineExceeded} asserts the gRPC status code matches the native ctx error. ## Caller audit (semantic change) The new error code mapping is semantically narrower than before: errors that were previously FailedPrecondition now surface as Canceled / DeadlineExceeded for transport errors, and FailedPrecondition stays for everything else. Caller impact: - adapter/encryption_admin.go: runCutoverFanout is called only from cutoverPostcheck which threads the returned error back to gRPC. Now propagates the correct code. - CLI (lands in 6D-6b) will see the correct code for retry decisions. - No other internal caller exists; the new helper is file- scoped. ## Verification - go test -race -timeout 60s -run TestEncryptionAdmin_EnableStorageEnvelope ./adapter/... - all green - golangci-lint --new-from-rev=origin/main - 0 issues --- adapter/encryption_admin.go | 31 +++++++++++++++++++++++++++-- adapter/encryption_admin_test.go | 34 ++++++++++++++++++++++++++++++++ 2 files changed, 63 insertions(+), 2 deletions(-) diff --git a/adapter/encryption_admin.go b/adapter/encryption_admin.go index 37a5564a8..c9893e87a 100644 --- a/adapter/encryption_admin.go +++ b/adapter/encryption_admin.go @@ -743,6 +743,14 @@ func (s *EncryptionAdminServer) cutoverPrecheck(ctx context.Context, req *pb.Ena // helper and translates the OK / refuse / error branches into the // §3.2 step 6-7 status codes. Pulled out so the orchestration // body stays under the cyclomatic-complexity budget. +// +// Context-cancellation errors flow through with their gRPC code +// (codes.Canceled / codes.DeadlineExceeded) so a client that +// cancels mid-fan-out gets the right retry-semantics shape; +// wrapping every err as FailedPrecondition would be a configuration- +// failure signal that misleads automated retry logic (codex P2 on +// PR812). Configuration-shape errors (zero-member snapshot, etc.) +// remain FailedPrecondition. func (s *EncryptionAdminServer) runCutoverFanout(ctx context.Context) (admin.CapabilityFanoutResult, error) { if s.capabilityFanout == nil { return admin.CapabilityFanoutResult{}, grpcStatusError(codes.FailedPrecondition, @@ -750,8 +758,7 @@ func (s *EncryptionAdminServer) runCutoverFanout(ctx context.Context) (admin.Cap } result, err := s.capabilityFanout(ctx) if err != nil { - return admin.CapabilityFanoutResult{}, grpcStatusErrorf(codes.FailedPrecondition, - "encryption: capability fan-out failed: %v", err) + return admin.CapabilityFanoutResult{}, capabilityFanoutErrorToStatus(err) } if !result.OK { return admin.CapabilityFanoutResult{}, grpcStatusErrorf(codes.FailedPrecondition, @@ -760,6 +767,26 @@ func (s *EncryptionAdminServer) runCutoverFanout(ctx context.Context) (admin.Cap return result, nil } +// capabilityFanoutErrorToStatus maps the fan-out helper's +// possible failure modes to gRPC status codes. The transport- +// layer ctx errors (caller canceled / deadline expired) keep +// their native codes so a client's retry behaviour stays +// correct; anything else surfaces as a configuration failure +// (FailedPrecondition). +func capabilityFanoutErrorToStatus(err error) error { + switch { + case errors.Is(err, context.Canceled): + return grpcStatusErrorf(codes.Canceled, + "encryption: capability fan-out canceled: %v", err) + case errors.Is(err, context.DeadlineExceeded): + return grpcStatusErrorf(codes.DeadlineExceeded, + "encryption: capability fan-out deadline exceeded: %v", err) + default: + return grpcStatusErrorf(codes.FailedPrecondition, + "encryption: capability fan-out failed: %v", err) + } +} + // proposeCutoverEntry composes the RotationPayload per §2.1 and // drives it through Raft. The §2.1 #2 length-based-empty-Wrapped // constraint is satisfied at composition (the payload uses diff --git a/adapter/encryption_admin_test.go b/adapter/encryption_admin_test.go index 01dbffed9..2e40c81cc 100644 --- a/adapter/encryption_admin_test.go +++ b/adapter/encryption_admin_test.go @@ -1373,6 +1373,40 @@ func TestEncryptionAdmin_EnableStorageEnvelope_RejectsOnFanoutError(t *testing.T } } +// TestEncryptionAdmin_EnableStorageEnvelope_PreservesContextCancellation +// pins the codex P2 finding on PR812: context.Canceled and +// context.DeadlineExceeded surfaced by the fan-out closure MUST +// keep their native gRPC code (Canceled / DeadlineExceeded) +// rather than being squashed into FailedPrecondition. +// FailedPrecondition is a configuration-failure signal; mapping +// a transport-layer cancellation to it breaks client retry +// logic that switches on the code. +func TestEncryptionAdmin_EnableStorageEnvelope_PreservesContextCancellation(t *testing.T) { + t.Parallel() + cases := []struct { + name string + err error + wantCode codes.Code + }{ + {name: "canceled", err: context.Canceled, wantCode: codes.Canceled}, + {name: "deadline_exceeded", err: context.DeadlineExceeded, wantCode: codes.DeadlineExceeded}, + } + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(&recordingProposer{}), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(cutoverReadySidecarFixture(t)), + WithEncryptionAdminCapabilityFanout(fixedCapabilityFanout(admin.CapabilityFanoutResult{}, tc.err)), + ) + _, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) + if status.Code(err) != tc.wantCode { + t.Errorf("EnableStorageEnvelope status=%v, want %v", status.Code(err), tc.wantCode) + } + }) + } +} + // TestEncryptionAdmin_EnableStorageEnvelope_HappyPath drives the // full §3.2 happy path: leader gate passes, fan-out approves, // propose lands, post-apply sidecar shows the cutover effect, the From c408449560f87c519ff5d036c4f4d23a22aade71 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 23 May 2026 22:57:20 +0900 Subject: [PATCH 3/6] fix(encryption): PR812 round-2 - claude P2 (Reachable=false branch coverage) + informational comment MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Claude bot review identified one P2 test gap + two informational points on PR812 round-1. ## claude P2 - capabilityRefusalSummary Reachable=false branch untested Pre-fix: _RejectsOnCapabilityRefusal only exercised the EncryptionCapable=false branch of capabilityRefusalSummary. The Reachable=false branch (which produces 'unreachable member ...' in the status detail) had no regression coverage. A future refactor that broke that string would slip through silently. Fix: convert the test to table-driven across {not_capable, unreachable} fixtures. Each case asserts: - status code is FailedPrecondition - error message contains the refusing node's full_node_id - error message contains the refusal reason label ('not-capable' or 'unreachable') ## informational - freshCutoverResponse defensive fallback comment Claude bot noted that the defensive fallback to proposedIdx when sc.StorageEnvelopeCutoverIndex == 0 doesn't set CutoverIndexUnknown=true, in contrast with the idempotent-retry path's same anomaly. The asymmetry is correct (proto §3.1 says CutoverIndexUnknown is 'only meaningful when was_already_active=true', and on the fresh-success path we DO know the correct index: it's proposedIdx). Added a clarifying comment so a future reader doesn't have to trace through the proto doc. ## informational - catch-all in capabilityRefusalSummary Claude bot noted that 'fan-out reported OK=false with no per-member refusal' is structurally unreachable under the current CapabilityFanout contract (OK=false ⇒ at least one failing verdict). Marked informational, no code change — the catch-all is a defensive fallback message, not a correctness path; covering it would require an invalid input the helper contract forbids. ## Caller audit (semantic change) No semantic change. The test refactor only widens fixture coverage; the production code change is a comment addition. No caller impact. ## Verification - go test -race -timeout 60s -run TestEncryptionAdmin_EnableStorageEnvelope ./adapter/... - all green - golangci-lint --new-from-rev=origin/main - 0 issues --- adapter/encryption_admin.go | 8 +++++ adapter/encryption_admin_test.go | 56 ++++++++++++++++++++++++++++---- 2 files changed, 57 insertions(+), 7 deletions(-) diff --git a/adapter/encryption_admin.go b/adapter/encryption_admin.go index c9893e87a..da944cbba 100644 --- a/adapter/encryption_admin.go +++ b/adapter/encryption_admin.go @@ -903,6 +903,14 @@ func freshCutoverResponse(sc *encryption.Sidecar, proposedIdx uint64, fanoutResu // index alongside the active flag. A zero here means // the post-apply read raced (unlikely, but the §6.4 // fallback exists for hand-edited sidecars). + // CutoverIndexUnknown stays false on this branch — the + // proto field is "only meaningful when + // was_already_active=true" (§3.1), and here we know the + // correct applied_index: it is proposedIdx (the Raft + // entry this call just committed). The idempotent-retry + // path is the only context where the original cutover + // index is irrecoverable, hence its CutoverIndexUnknown + // signal. Claude bot informational on PR812. appliedIndex = proposedIdx } return &pb.EnableStorageEnvelopeResponse{ diff --git a/adapter/encryption_admin_test.go b/adapter/encryption_admin_test.go index 2e40c81cc..b3ab00b4c 100644 --- a/adapter/encryption_admin_test.go +++ b/adapter/encryption_admin_test.go @@ -1329,15 +1329,54 @@ func TestEncryptionAdmin_EnableStorageEnvelope_RejectsWithoutCapabilityFanout(t // and the RPC refuses with FailedPrecondition. The status detail // names the specific node so the operator's CLI can diagnose // without trawling leader logs. +// +// Table-driven across the two refusal shapes so both branches of +// capabilityRefusalSummary (Reachable=false first, then +// EncryptionCapable=false) are exercised — without the unreachable +// case the "unreachable member" status detail would lose its +// regression coverage. PR812 claude P2. func TestEncryptionAdmin_EnableStorageEnvelope_RejectsOnCapabilityRefusal(t *testing.T) { t.Parallel() - refusal := admin.CapabilityFanoutResult{ - Verdicts: []admin.CapabilityVerdict{ - {FullNodeID: 11, EncryptionCapable: true, BuildSHA: "build-n1", SidecarPresent: true, Reachable: true}, - {FullNodeID: 99, EncryptionCapable: false, BuildSHA: "build-n99", SidecarPresent: false, Reachable: true}, + cases := []struct { + name string + refusal admin.CapabilityFanoutResult + wantNodeStr string + wantReason string + }{ + { + name: "not_capable", + refusal: admin.CapabilityFanoutResult{ + Verdicts: []admin.CapabilityVerdict{ + {FullNodeID: 11, EncryptionCapable: true, BuildSHA: "build-n1", SidecarPresent: true, Reachable: true}, + {FullNodeID: 99, EncryptionCapable: false, BuildSHA: "build-n99", SidecarPresent: false, Reachable: true}, + }, + OK: false, + }, + wantNodeStr: "99", + wantReason: "not-capable", + }, + { + name: "unreachable", + refusal: admin.CapabilityFanoutResult{ + Verdicts: []admin.CapabilityVerdict{ + {FullNodeID: 11, EncryptionCapable: true, BuildSHA: "build-n1", SidecarPresent: true, Reachable: true}, + {FullNodeID: 88, EncryptionCapable: false, BuildSHA: "", SidecarPresent: false, Reachable: false}, + }, + OK: false, + }, + wantNodeStr: "88", + wantReason: "unreachable", }, - OK: false, } + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + runCapabilityRefusalCase(t, tc.refusal, tc.wantNodeStr, tc.wantReason) + }) + } +} + +func runCapabilityRefusalCase(t *testing.T, refusal admin.CapabilityFanoutResult, wantNodeStr, wantReason string) { + t.Helper() srv := NewEncryptionAdminServer( WithEncryptionAdminProposer(&recordingProposer{}), WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), @@ -1348,8 +1387,11 @@ func TestEncryptionAdmin_EnableStorageEnvelope_RejectsOnCapabilityRefusal(t *tes if status.Code(err) != codes.FailedPrecondition { t.Errorf("EnableStorageEnvelope status=%v, want FailedPrecondition", status.Code(err)) } - if err == nil || !strings.Contains(err.Error(), "99") { - t.Errorf("error %q does not name the refusing node (full_node_id=99)", err) + if err == nil || !strings.Contains(err.Error(), wantNodeStr) { + t.Errorf("error %q does not name the refusing node (full_node_id=%s)", err, wantNodeStr) + } + if err == nil || !strings.Contains(err.Error(), wantReason) { + t.Errorf("error %q does not name the refusal reason (%q)", err, wantReason) } } From a45eeac798cc33f2a91ad218d92fa72a523eb452 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 23 May 2026 23:24:01 +0900 Subject: [PATCH 4/6] fix(encryption): PR812 round-3 - coderabbit Major (mutator lock) + codex P2 (ctx-err preservation) + cleanups MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Round-2 review surfaced one Major + two P2 + one Minor. ## coderabbit Major - serialize concurrent EnableStorageEnvelope calls Pre-fix: gRPC serves handlers concurrently. Two concurrent EnableStorageEnvelope calls could both observe StorageEnvelopeActive=false at their precheck, both propose, and the loser would assemble freshCutoverResponse with was_already_active=false but the FIRST cutover's StorageEnvelopeCutoverIndex - violating the §6.4 fresh-success contract documented on the proto response. Fix: add cutoverMu sync.Mutex to EncryptionAdminServer. EnableStorageEnvelope acquires it at the very top and releases on defer. The lock spans the entire precheck → fan-out → propose → postcheck sequence so a second overlapping call sees StorageEnvelopeActive=true at its precheck (because the first call already proposed and applied) and takes the §6.4 idempotent-retry short-circuit. Note on design §2.1 #4 'shared lock with RotateDEK': the design calls for a shared mutator lock so a RotateDEK cannot interleave between a cutover's propose and apply. That piece is left to a follow-up because extending serialization to RotateDEK changes a hot-path mutator's semantics; the §2.1 #3 stale-DEKID benign no-op handles the un-serialized RotateDEK / cutover interleave today. Doc comment in cutoverMu records the deferred work. Regression test: TestEncryptionAdmin_EnableStorageEnvelope_SerializesConcurrentCutovers fires N=4 concurrent callers via a sync.WaitGroup release barrier, asserts proposer.calls == 1 (cutover proposed exactly once) and that exactly one caller saw was_already_active=false and the rest hit the §6.4 idempotent-retry shape. ## codex P2 - preserve ctx codes when fan-out returns OK=false synthesized Pre-fix: when the production CapabilityFanout helper hits ctx.expired mid-probe, it synthesizes Reachable=false verdicts and returns (result, nil) with OK=false rather than erroring out. runCutoverFanout then classified the outcome as a 'capability check refused' FailedPrecondition, hiding the transport-layer Canceled / DeadlineExceeded that the caller should observe. Fix: in runCutoverFanout's OK=false branch, check ctx.Err() first. If set, route through capabilityFanoutErrorToStatus so codes.Canceled / codes.DeadlineExceeded surface to the client. Only fall through to FailedPrecondition for genuine configuration refusals. Regression test: TestEncryptionAdmin_EnableStorageEnvelope_PreservesContextCancellationOnFanoutNotOK table-driven across {canceled, deadline_exceeded} contexts; fan-out fixture returns synthesized (notOK, nil); the test asserts the gRPC status code matches the ctx error. ## coderabbit quick-win - idempotent-retry tests must not wire fan-out Pre-fix: _IdempotentRetry and _DefensiveCutoverIndexUnknown both wired a SUCCESSFUL capability fan-out fixture. A regression that re-ordered cutoverPrecheck so the fan-out fired BEFORE the §6.4 short-circuit would have silently passed both tests. Fix: new failOnCallCapabilityFanout(t) helper that fails the test if invoked. Both idempotent-retry tests now use it - any re-ordering regression trips the fixture. ## coderabbit minor - design doc obsolete entry Removed the original combined '6D-6' milestone line from the design doc's open-milestones section (it was superseded by the 6D-6a shipped + 6D-6b/c open structure). ## Caller audit (semantic change) EnableStorageEnvelope is the only mutator that acquires cutoverMu; RotateDEK, BootstrapEncryption, and RegisterEncryptionWriter remain un-serialized at the RPC layer (their FSM apply-path conflict detection is unchanged). No production caller of EnableStorageEnvelope exists yet (CLI lands in 6D-6b); the only callers are server-level unit tests, which exercise the new locked path. ## Verification - go test -race -timeout 60s -run TestEncryptionAdmin_EnableStorageEnvelope ./adapter/... - all green (the -race build also exercises the serialization test's 4-goroutine fan-out) - golangci-lint --new-from-rev=origin/main - 0 issues --- adapter/encryption_admin.go | 38 +++++ adapter/encryption_admin_test.go | 156 +++++++++++++++++- ...5_18_partial_6d_enable_storage_envelope.md | 2 - 3 files changed, 192 insertions(+), 4 deletions(-) diff --git a/adapter/encryption_admin.go b/adapter/encryption_admin.go index da944cbba..7cb9674d0 100644 --- a/adapter/encryption_admin.go +++ b/adapter/encryption_admin.go @@ -8,6 +8,7 @@ import ( "os" "runtime/debug" "strconv" + "sync" "github.com/bootjp/elastickv/internal/admin" "github.com/bootjp/elastickv/internal/encryption" @@ -38,6 +39,23 @@ type EncryptionAdminServer struct { // is what threads the route-snapshot builder + DialFunc + // timeout into this closure. Other mutator RPCs are unaffected. capabilityFanout CapabilityFanoutFn + // cutoverMu serializes concurrent EnableStorageEnvelope calls + // per design §2.1 #4 ("The cutover-RPC mutator serializes + // overlapping calls on the propose side"). Without it, two + // concurrent cutover calls could both observe + // StorageEnvelopeActive=false, both propose, and the loser + // would assemble a freshCutoverResponse with + // was_already_active=false but the FIRST cutover's + // applied_index — violating the §6.4 fresh-success contract + // (coderabbit Major on PR812). The design also calls for a + // shared lock with RotateDEK so a RotateDEK cannot interleave + // between a cutover's propose and apply; that piece is left + // to a follow-up because RotateDEK has no existing lock and + // extending serialization to it would change a hot-path + // mutator's semantics — see the §2.1 #3 stale-DEKID benign + // no-op for the FALLBACK that handles the un-serialized + // RotateDEK/cutover interleave today. + cutoverMu sync.Mutex pb.UnimplementedEncryptionAdminServer } @@ -676,6 +694,14 @@ func (s *EncryptionAdminServer) RotateDEK(ctx context.Context, req *pb.RotateDEK // true with cutover-index mismatch ⇒ another cutover landed // concurrently, treat as idempotent success. func (s *EncryptionAdminServer) EnableStorageEnvelope(ctx context.Context, req *pb.EnableStorageEnvelopeRequest) (*pb.EnableStorageEnvelopeResponse, error) { + // Serialize concurrent cutover RPCs (design §2.1 #4 + PR812 + // coderabbit Major). The lock spans the entire precheck → + // fan-out → propose → postcheck sequence so a second + // overlapping call sees StorageEnvelopeActive=true at its + // precheck and takes the §6.4 idempotent-retry short-circuit + // rather than re-proposing. + s.cutoverMu.Lock() + defer s.cutoverMu.Unlock() preSidecar, earlyResp, err := s.cutoverPrecheck(ctx, req) if err != nil { return nil, err @@ -761,6 +787,18 @@ func (s *EncryptionAdminServer) runCutoverFanout(ctx context.Context) (admin.Cap return admin.CapabilityFanoutResult{}, capabilityFanoutErrorToStatus(err) } if !result.OK { + // Codex P2 round-2 on PR812: the production fan-out + // helper can return (result, nil) with OK=false when ctx + // expires mid-probe — it synthesizes Reachable=false + // verdicts and returns the result rather than erroring + // out. In that case, classifying the outcome as a + // configuration refusal (FailedPrecondition) hides the + // real transport-layer cancellation/deadline from + // retry-aware clients. Check ctx.Err() first so the + // gRPC status code matches what the caller observed. + if ctxErr := ctx.Err(); ctxErr != nil { + return admin.CapabilityFanoutResult{}, capabilityFanoutErrorToStatus(ctxErr) + } return admin.CapabilityFanoutResult{}, grpcStatusErrorf(codes.FailedPrecondition, "encryption: capability check refused cutover (%s)", capabilityRefusalSummary(result)) } diff --git a/adapter/encryption_admin_test.go b/adapter/encryption_admin_test.go index b3ab00b4c..5cc8a22ed 100644 --- a/adapter/encryption_admin_test.go +++ b/adapter/encryption_admin_test.go @@ -6,7 +6,9 @@ import ( "math" "path/filepath" "strings" + "sync" "testing" + "time" "github.com/bootjp/elastickv/internal/admin" "github.com/bootjp/elastickv/internal/encryption" @@ -1042,6 +1044,22 @@ func fixedCapabilityFanout(result admin.CapabilityFanoutResult, err error) Capab } } +// failOnCallCapabilityFanout returns a closure that fails the test +// if it is invoked. Used by the §6.4 idempotent-retry tests to +// pin the "already-active short-circuit must NOT run fan-out" +// invariant: a regression that re-ordered the cutoverPrecheck so +// the fan-out fired before the short-circuit would trip this +// fixture and the test would fail at the wire-level rather than +// silently passing on a successful fan-out. PR812 coderabbit +// quick-win round-2. +func failOnCallCapabilityFanout(t *testing.T) CapabilityFanoutFn { + t.Helper() + return func(context.Context) (admin.CapabilityFanoutResult, error) { + t.Errorf("capability fan-out invoked on idempotent-retry path; the §6.4 short-circuit MUST skip fan-out") + return admin.CapabilityFanoutResult{}, errors.New("fan-out invoked on idempotent path") + } +} + // applyingProposer is a recordingProposer that also simulates the // FSM apply by writing the supplied applyFn output into the // sidecar before returning from Propose. The 6D-6 RPC re-reads @@ -1244,7 +1262,10 @@ func TestEncryptionAdmin_EnableStorageEnvelope_IdempotentRetry(t *testing.T) { WithEncryptionAdminProposer(proposer), WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), WithEncryptionAdminSidecarPath(path), - WithEncryptionAdminCapabilityFanout(fixedCapabilityFanout(allOKFanoutResult(), nil)), + // Use failOnCallCapabilityFanout: the §6.4 short-circuit + // MUST skip the fan-out; if a regression re-ordered the + // checks the fan-out fixture trips the test. + WithEncryptionAdminCapabilityFanout(failOnCallCapabilityFanout(t)), ) got, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) if err != nil { @@ -1287,7 +1308,10 @@ func TestEncryptionAdmin_EnableStorageEnvelope_DefensiveCutoverIndexUnknown(t *t WithEncryptionAdminProposer(&recordingProposer{}), WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), WithEncryptionAdminSidecarPath(path), - WithEncryptionAdminCapabilityFanout(fixedCapabilityFanout(allOKFanoutResult(), nil)), + // The §6.4 defensive branch is still on the + // idempotent-retry path (sidecar reports active); fan-out + // MUST be skipped exactly as in the standard retry case. + WithEncryptionAdminCapabilityFanout(failOnCallCapabilityFanout(t)), ) got, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) if err != nil { @@ -1415,6 +1439,134 @@ func TestEncryptionAdmin_EnableStorageEnvelope_RejectsOnFanoutError(t *testing.T } } +// TestEncryptionAdmin_EnableStorageEnvelope_SerializesConcurrentCutovers +// pins the §2.1 #4 mutator-lock invariant: two concurrent cutover +// RPCs MUST NOT both produce a freshCutoverResponse — the second +// one must observe StorageEnvelopeActive=true at its precheck and +// fall through to the §6.4 idempotent-retry shape. Coderabbit +// Major on PR812. +// +// The applying proposer flips the sidecar on the FIRST propose +// call. The second call enters EnableStorageEnvelope, waits on +// cutoverMu, then runs its precheck. Without the lock, the +// second call's precheck could see the pre-flip sidecar and +// re-propose — producing a freshCutoverResponse with +// WasAlreadyActive=false but the FIRST cutover's +// StorageEnvelopeCutoverIndex (the §6.4 contract violation). +func TestEncryptionAdmin_EnableStorageEnvelope_SerializesConcurrentCutovers(t *testing.T) { + t.Parallel() + path := cutoverReadySidecarFixture(t) + proposer := &applyingProposer{ + recordingProposer: recordingProposer{commitIndex: 7000}, + sidecarPath: path, + applyFn: applyCutover, + } + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(proposer), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(path), + WithEncryptionAdminCapabilityFanout(fixedCapabilityFanout(allOKFanoutResult(), nil)), + ) + const callers = 4 + results := make(chan *pb.EnableStorageEnvelopeResponse, callers) + errs := make(chan error, callers) + var ready sync.WaitGroup + var release sync.WaitGroup + ready.Add(callers) + release.Add(1) + for range callers { + go func() { + ready.Done() + release.Wait() + resp, err := srv.EnableStorageEnvelope(context.Background(), validEnableStorageEnvelopeRequest()) + results <- resp + errs <- err + }() + } + ready.Wait() + release.Done() + freshCount := 0 + idempotentCount := 0 + for i := 0; i < callers; i++ { + err := <-errs + resp := <-results + if err != nil { + t.Fatalf("EnableStorageEnvelope call %d: %v", i, err) + } + if resp.WasAlreadyActive { + idempotentCount++ + } else { + freshCount++ + } + } + if freshCount != 1 { + t.Errorf("freshCount=%d, want exactly 1 (only one caller should propose; the rest must hit the §6.4 idempotent-retry path)", freshCount) + } + if idempotentCount != callers-1 { + t.Errorf("idempotentCount=%d, want %d", idempotentCount, callers-1) + } + if len(proposer.calls) != 1 { + t.Errorf("proposer.calls=%d, want 1 (cutover proposed exactly once across concurrent callers)", len(proposer.calls)) + } +} + +// TestEncryptionAdmin_EnableStorageEnvelope_PreservesContextCancellationOnFanoutNotOK +// pins codex P2 round-2 on PR812: the production fan-out helper +// can synthesize Reachable=false verdicts and return (result, nil) +// with OK=false when ctx expires mid-probe. In that case +// EnableStorageEnvelope MUST preserve the transport-layer +// cancellation/deadline shape rather than misclassifying the +// outcome as a configuration refusal. +func TestEncryptionAdmin_EnableStorageEnvelope_PreservesContextCancellationOnFanoutNotOK(t *testing.T) { + t.Parallel() + cases := []struct { + name string + ctxFn func() (context.Context, context.CancelFunc) + wantCode codes.Code + }{ + { + name: "canceled", + ctxFn: func() (context.Context, context.CancelFunc) { + ctx, cancel := context.WithCancel(context.Background()) + cancel() + return ctx, func() {} + }, + wantCode: codes.Canceled, + }, + { + name: "deadline_exceeded", + ctxFn: func() (context.Context, context.CancelFunc) { + ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(-time.Second)) + return ctx, cancel + }, + wantCode: codes.DeadlineExceeded, + }, + } + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + // Synthesize the production fan-out's behavior: it returns + // a NOT-OK result without err when ctx expired mid-probe. + notOK := admin.CapabilityFanoutResult{ + Verdicts: []admin.CapabilityVerdict{{FullNodeID: 11, Reachable: false}}, + OK: false, + } + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(&recordingProposer{}), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(cutoverReadySidecarFixture(t)), + WithEncryptionAdminCapabilityFanout(fixedCapabilityFanout(notOK, nil)), + ) + ctx, cancel := tc.ctxFn() + defer cancel() + _, err := srv.EnableStorageEnvelope(ctx, validEnableStorageEnvelopeRequest()) + if status.Code(err) != tc.wantCode { + t.Errorf("EnableStorageEnvelope status=%v, want %v (ctx error should take precedence over FailedPrecondition)", + status.Code(err), tc.wantCode) + } + }) + } +} + // TestEncryptionAdmin_EnableStorageEnvelope_PreservesContextCancellation // pins the codex P2 finding on PR812: context.Canceled and // context.DeadlineExceeded surfaced by the fan-out closure MUST diff --git a/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md b/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md index dd96bf0e3..71ff540ed 100644 --- a/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md +++ b/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md @@ -67,8 +67,6 @@ view. End-to-end integration test exercises a single-node cluster Bootstrap → EnableStorageEnvelope → Put → read-back- via-envelope. -- **6D-6** — `EnableStorageEnvelope` admin RPC + CLI command + - integration test composing 6D-3 + 6D-4 + 6D-5. ## 0. Why this doc exists From f93bea7e59a2f0746560f857f9b34fe777cc8013 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sat, 23 May 2026 23:48:02 +0900 Subject: [PATCH 5/6] fix(encryption): PR812 round-4 - codex P2 round-3 (ctx-aware cutover-mutator wait) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex P2 round-3 on PR812: Pre-fix: EnableStorageEnvelope took cutoverMu (sync.Mutex) unconditionally at the top. A concurrent caller with a short ctx deadline could block on Lock() well past its own cancellation/deadline if the first caller stalled in fan-out or propose - breaking gRPC cancellation semantics and tying up admin handlers. Fix: replace sync.Mutex with a capacity-1 channel semaphore (cutoverSem chan struct{}). New acquireCutoverSemaphore / releaseCutoverSemaphore helpers; acquire uses a select on ctx.Done() so an expired/canceled context returns immediately with the matching gRPC code (Canceled / DeadlineExceeded). New cutoverSemaphoreErrorToStatus helper maps the ctx error - named distinctly from capabilityFanoutErrorToStatus so the status detail surfaces which stage of the RPC the wait fired in. Initialization: NewEncryptionAdminServer eagerly creates the buffered chan (a nil chan would block forever on send, defeating the ctx-aware acquire). Regression test: TestEncryptionAdmin_EnableStorageEnvelope_HonorsCtxDeadlineWaitingOnMutator drives the first call into a fan-out fixture that blocks on a never-firing channel (holds cutoverSem); the second call uses an already-expired ctx and asserts: - the RPC returns within 500ms (NOT blocking on the held semaphore) - the gRPC status code is DeadlineExceeded ## Caller audit (semantic change) acquireCutoverSemaphore is called only from EnableStorageEnvelope. The release path is symmetric (defer releaseCutoverSemaphore). No other code path touches cutoverSem - the channel is internal to the cutover RPC. The SerializesConcurrentCutovers test from round-3 still covers the mutual-exclusion invariant (the loser still hits the §6.4 idempotent-retry shape); HonorsCtxDeadlineWaitingOnMutator covers the new ctx-aware acquisition path. ## Verification - go test -race -timeout 60s -run TestEncryptionAdmin_EnableStorageEnvelope ./adapter/... - all green - golangci-lint --new-from-rev=origin/main - 0 issues --- adapter/encryption_admin.go | 103 ++++++++++++++++++++++++------- adapter/encryption_admin_test.go | 61 ++++++++++++++++++ 2 files changed, 143 insertions(+), 21 deletions(-) diff --git a/adapter/encryption_admin.go b/adapter/encryption_admin.go index 7cb9674d0..02a686386 100644 --- a/adapter/encryption_admin.go +++ b/adapter/encryption_admin.go @@ -8,7 +8,6 @@ import ( "os" "runtime/debug" "strconv" - "sync" "github.com/bootjp/elastickv/internal/admin" "github.com/bootjp/elastickv/internal/encryption" @@ -39,23 +38,32 @@ type EncryptionAdminServer struct { // is what threads the route-snapshot builder + DialFunc + // timeout into this closure. Other mutator RPCs are unaffected. capabilityFanout CapabilityFanoutFn - // cutoverMu serializes concurrent EnableStorageEnvelope calls - // per design §2.1 #4 ("The cutover-RPC mutator serializes - // overlapping calls on the propose side"). Without it, two - // concurrent cutover calls could both observe - // StorageEnvelopeActive=false, both propose, and the loser - // would assemble a freshCutoverResponse with + // cutoverSem serializes concurrent EnableStorageEnvelope + // calls per design §2.1 #4 ("The cutover-RPC mutator + // serializes overlapping calls on the propose side"). + // Without serialization, two concurrent cutover calls could + // both observe StorageEnvelopeActive=false, both propose, and + // the loser would assemble a freshCutoverResponse with // was_already_active=false but the FIRST cutover's // applied_index — violating the §6.4 fresh-success contract - // (coderabbit Major on PR812). The design also calls for a - // shared lock with RotateDEK so a RotateDEK cannot interleave - // between a cutover's propose and apply; that piece is left - // to a follow-up because RotateDEK has no existing lock and - // extending serialization to it would change a hot-path - // mutator's semantics — see the §2.1 #3 stale-DEKID benign - // no-op for the FALLBACK that handles the un-serialized - // RotateDEK/cutover interleave today. - cutoverMu sync.Mutex + // (coderabbit Major on PR812). + // + // A capacity-1 channel rather than a plain sync.Mutex so + // acquisition can honor ctx cancellation: an admin RPC with + // a short deadline whose mutator lock is held by an in-flight + // fan-out + propose call surfaces Canceled / DeadlineExceeded + // at the gRPC boundary rather than blocking indefinitely + // (codex P2 round-3 on PR812). + // + // The design also calls for a shared lock with RotateDEK so + // a RotateDEK cannot interleave between a cutover's propose + // and apply; that piece is left to a follow-up because + // RotateDEK has no existing serialization and extending it + // would change a hot-path mutator's semantics — see the + // §2.1 #3 stale-DEKID benign no-op for the FALLBACK that + // handles the un-serialized RotateDEK/cutover interleave + // today. + cutoverSem chan struct{} pb.UnimplementedEncryptionAdminServer } @@ -179,7 +187,11 @@ func WithEncryptionAdminLeaderView(v raftengine.LeaderView) EncryptionAdminServe // fails closed at startup rather than silently letting followers // mutate state. func NewEncryptionAdminServer(opts ...EncryptionAdminServerOption) *EncryptionAdminServer { - s := &EncryptionAdminServer{} + // cutoverSem buffer of 1 = mutual-exclusion semaphore; + // initialized eagerly so EnableStorageEnvelope never has to + // nil-check (a nil chan send blocks forever, defeating the + // ctx-aware acquire below). + s := &EncryptionAdminServer{cutoverSem: make(chan struct{}, 1)} for _, opt := range opts { if opt != nil { opt(s) @@ -695,13 +707,19 @@ func (s *EncryptionAdminServer) RotateDEK(ctx context.Context, req *pb.RotateDEK // concurrently, treat as idempotent success. func (s *EncryptionAdminServer) EnableStorageEnvelope(ctx context.Context, req *pb.EnableStorageEnvelopeRequest) (*pb.EnableStorageEnvelopeResponse, error) { // Serialize concurrent cutover RPCs (design §2.1 #4 + PR812 - // coderabbit Major). The lock spans the entire precheck → - // fan-out → propose → postcheck sequence so a second + // coderabbit Major). The semaphore spans the entire precheck + // → fan-out → propose → postcheck sequence so a second // overlapping call sees StorageEnvelopeActive=true at its // precheck and takes the §6.4 idempotent-retry short-circuit // rather than re-proposing. - s.cutoverMu.Lock() - defer s.cutoverMu.Unlock() + // + // Acquire honors ctx cancellation so a caller with a short + // deadline does not block indefinitely on an in-flight + // cutover's fan-out + propose (codex P2 round-3 on PR812). + if err := s.acquireCutoverSemaphore(ctx); err != nil { + return nil, err + } + defer s.releaseCutoverSemaphore() preSidecar, earlyResp, err := s.cutoverPrecheck(ctx, req) if err != nil { return nil, err @@ -723,6 +741,49 @@ func (s *EncryptionAdminServer) EnableStorageEnvelope(ctx context.Context, req * return s.cutoverPostcheck(proposedIdx, fanoutResult) } +// acquireCutoverSemaphore takes the cutoverSem with ctx-aware +// wait semantics. When the caller's ctx fires before the +// semaphore frees, the wait returns immediately with a gRPC +// status matching the ctx error — Canceled or DeadlineExceeded. +// A plain sync.Mutex would block past the caller's deadline, +// breaking RPC cancellation semantics (codex P2 round-3 on +// PR812). +func (s *EncryptionAdminServer) acquireCutoverSemaphore(ctx context.Context) error { + select { + case s.cutoverSem <- struct{}{}: + return nil + case <-ctx.Done(): + return cutoverSemaphoreErrorToStatus(ctx.Err()) + } +} + +// cutoverSemaphoreErrorToStatus maps the ctx-cancellation +// outcomes of an acquireCutoverSemaphore wait to their native +// gRPC codes. Pulled out so the status detail names the wait +// site (semaphore vs. fan-out) — without that, an operator +// debugging a Canceled response would not know which stage +// of the RPC the cancellation actually fired in. +func cutoverSemaphoreErrorToStatus(err error) error { + switch { + case errors.Is(err, context.Canceled): + return grpcStatusErrorf(codes.Canceled, + "encryption: cutover mutator wait canceled: %v", err) + case errors.Is(err, context.DeadlineExceeded): + return grpcStatusErrorf(codes.DeadlineExceeded, + "encryption: cutover mutator wait deadline exceeded: %v", err) + default: + return grpcStatusErrorf(codes.Internal, + "encryption: cutover mutator wait failed: %v", err) + } +} + +// releaseCutoverSemaphore returns the cutoverSem token. Always +// called via defer from EnableStorageEnvelope; never called +// without a prior successful acquireCutoverSemaphore. +func (s *EncryptionAdminServer) releaseCutoverSemaphore() { + <-s.cutoverSem +} + // cutoverPrecheck runs the §3.2 steps 1-5 that fire before the // fan-out: input validation, leader check, sidecar read, bootstrap // gate, and the idempotent-retry short-circuit. Returns either diff --git a/adapter/encryption_admin_test.go b/adapter/encryption_admin_test.go index 5cc8a22ed..699b0a0b2 100644 --- a/adapter/encryption_admin_test.go +++ b/adapter/encryption_admin_test.go @@ -1510,6 +1510,67 @@ func TestEncryptionAdmin_EnableStorageEnvelope_SerializesConcurrentCutovers(t *t } } +// TestEncryptionAdmin_EnableStorageEnvelope_HonorsCtxDeadlineWaitingOnMutator +// pins codex P2 round-3 on PR812: when one cutover RPC holds +// the mutator semaphore through a slow fan-out + propose, a +// concurrent caller with a short deadline MUST return +// DeadlineExceeded / Canceled at the gRPC boundary rather than +// blocking past its own ctx. +// +// The test drives the first call into a fan-out that blocks +// indefinitely on a never-firing channel; while it sits in +// fan-out (holding cutoverSem), the second call attempts the +// RPC with an already-expired ctx and must return immediately +// with the matching gRPC code. +func TestEncryptionAdmin_EnableStorageEnvelope_HonorsCtxDeadlineWaitingOnMutator(t *testing.T) { + t.Parallel() + path := cutoverReadySidecarFixture(t) + blockFanout := make(chan struct{}) // never closed; first call blocks here forever + t.Cleanup(func() { close(blockFanout) }) + blockingFanout := func(callCtx context.Context) (admin.CapabilityFanoutResult, error) { + select { + case <-blockFanout: + return admin.CapabilityFanoutResult{}, errors.New("test teardown") + case <-callCtx.Done(): + return admin.CapabilityFanoutResult{}, callCtx.Err() + } + } + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(&recordingProposer{}), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(path), + WithEncryptionAdminCapabilityFanout(blockingFanout), + ) + // First call: launch in a goroutine; it holds cutoverSem + // while blocked in fan-out. + firstStarted := make(chan struct{}) + firstDone := make(chan struct{}) + go func() { + defer close(firstDone) + ctx, cancel := context.WithCancel(context.Background()) + t.Cleanup(cancel) + close(firstStarted) + _, _ = srv.EnableStorageEnvelope(ctx, validEnableStorageEnvelopeRequest()) + }() + <-firstStarted + // Yield so the first call reaches the fan-out wait — a + // short sleep is the only reliable signal short of plumbing + // a "fan-out entered" channel through the test fixture. + time.Sleep(50 * time.Millisecond) + // Second call with an already-expired ctx. It must NOT + // block past its deadline waiting on the mutator semaphore. + ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(-time.Millisecond)) + defer cancel() + start := time.Now() + _, err := srv.EnableStorageEnvelope(ctx, validEnableStorageEnvelopeRequest()) + if elapsed := time.Since(start); elapsed > 500*time.Millisecond { + t.Errorf("EnableStorageEnvelope blocked %v past ctx deadline; want immediate return", elapsed) + } + if status.Code(err) != codes.DeadlineExceeded { + t.Errorf("EnableStorageEnvelope status=%v, want DeadlineExceeded", status.Code(err)) + } +} + // TestEncryptionAdmin_EnableStorageEnvelope_PreservesContextCancellationOnFanoutNotOK // pins codex P2 round-2 on PR812: the production fan-out helper // can synthesize Reachable=false verdicts and return (result, nil) From 3b865e66cfaca1fce219f588e962272d17c89d68 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Sun, 24 May 2026 07:28:45 +0900 Subject: [PATCH 6/6] fix(encryption): PR812 round-5 - codex P1 (deterministic ctx short-circuit before semaphore acquire) Codex P1 round-4 on PR812: Pre-fix: acquireCutoverSemaphore entered a 2-case select (send to cutoverSem vs. ctx.Done) without first checking ctx.Err(). When ctx was already canceled AND the semaphore had capacity, BOTH select cases were immediately ready, and Go's select picks uniformly at random among ready cases - so the acquire could non-deterministically take the send branch and the RPC would proceed into precheck / fan-out / propose against an aborted caller. Fix: add an explicit ctx.Err() check before the select. The select still handles the during-wait cancellation case; the up-front check turns the pre-canceled case into a deterministic short-circuit. Regression test: TestEncryptionAdmin_EnableStorageEnvelope_PreCanceledCtxDeterministicShortCircuit table-driven across {canceled, deadline_exceeded}. Each case constructs a freshly-created server (semaphore empty, send slot available) and runs 100 iterations with a pre-canceled ctx. The fan-out fixture is failOnCallCapabilityFanout, which t.Errorf's if invoked - so any iteration that lost the random select coin flip and advanced into precheck would trip the test. Without the explicit ctx.Err() check, the 100-iteration loop would surface the bug as a flaky t.Errorf hit on roughly half the iterations. ## Caller audit (semantic change) acquireCutoverSemaphore is called only from EnableStorageEnvelope. The change narrows behaviour: a pre-canceled ctx now ALWAYS returns the matching gRPC code, rather than sometimes (when the semaphore is taken) running the cutover sequence to completion. Strictly safer; no other caller of acquireCutoverSemaphore exists. ## Verification - go test -race -timeout 60s -run TestEncryptionAdmin_EnableStorageEnvelope ./adapter/... - all green (the new 100-iteration test passes deterministically; would have flapped pre-fix) - golangci-lint --new-from-rev=origin/main - 0 issues --- adapter/encryption_admin.go | 12 ++++++ adapter/encryption_admin_test.go | 70 ++++++++++++++++++++++++++++++++ 2 files changed, 82 insertions(+) diff --git a/adapter/encryption_admin.go b/adapter/encryption_admin.go index 02a686386..ae4d4ce3a 100644 --- a/adapter/encryption_admin.go +++ b/adapter/encryption_admin.go @@ -748,7 +748,19 @@ func (s *EncryptionAdminServer) EnableStorageEnvelope(ctx context.Context, req * // A plain sync.Mutex would block past the caller's deadline, // breaking RPC cancellation semantics (codex P2 round-3 on // PR812). +// +// The explicit ctx.Err() check before the select is load-bearing: +// Go's select picks uniformly at random among ready cases (it +// does NOT prioritize ctx.Done()), so an already-canceled ctx +// paired with a free semaphore would coin-flip between acquiring +// the slot (and running precheck / fan-out / propose against an +// aborted caller) and the cancellation return. Checking +// ctx.Err() first turns the cancellation into a deterministic +// short-circuit (codex P1 round-4 on PR812). func (s *EncryptionAdminServer) acquireCutoverSemaphore(ctx context.Context) error { + if err := ctx.Err(); err != nil { + return cutoverSemaphoreErrorToStatus(err) + } select { case s.cutoverSem <- struct{}{}: return nil diff --git a/adapter/encryption_admin_test.go b/adapter/encryption_admin_test.go index 699b0a0b2..5a6769426 100644 --- a/adapter/encryption_admin_test.go +++ b/adapter/encryption_admin_test.go @@ -1510,6 +1510,76 @@ func TestEncryptionAdmin_EnableStorageEnvelope_SerializesConcurrentCutovers(t *t } } +// TestEncryptionAdmin_EnableStorageEnvelope_PreCanceledCtxDeterministicShortCircuit +// pins codex P1 round-4 on PR812: when ctx is already canceled +// at entry AND the cutoverSem has capacity (no in-flight cutover +// is holding it), Go's select would pick non-deterministically +// between the send case and the ctx.Done case. An explicit +// ctx.Err() check before the select turns the cancellation into +// a deterministic short-circuit so a caller who canceled before +// the RPC even started cannot accidentally drive precheck / +// fan-out / propose work. +// +// The test repeats the pre-canceled call N times against a fresh +// (capacity-free) semaphore: every iteration MUST surface the +// ctx error, never advance into the fan-out closure (which the +// test fixture would observe via a t.Errorf if hit). +func TestEncryptionAdmin_EnableStorageEnvelope_PreCanceledCtxDeterministicShortCircuit(t *testing.T) { + t.Parallel() + cases := []struct { + name string + ctxFn func() (context.Context, context.CancelFunc) + wantCode codes.Code + }{ + { + name: "canceled", + ctxFn: func() (context.Context, context.CancelFunc) { + ctx, cancel := context.WithCancel(context.Background()) + cancel() + return ctx, func() {} + }, + wantCode: codes.Canceled, + }, + { + name: "deadline_exceeded", + ctxFn: func() (context.Context, context.CancelFunc) { + ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(-time.Second)) + return ctx, cancel + }, + wantCode: codes.DeadlineExceeded, + }, + } + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + // failOnCallCapabilityFanout fires t.Errorf if the + // RPC ever advances into fan-out — that would mean + // the semaphore acquire DID take the send case + // despite the canceled ctx. + srv := NewEncryptionAdminServer( + WithEncryptionAdminProposer(&recordingProposer{}), + WithEncryptionAdminLeaderView(stubLeaderView{state: raftengine.StateLeader}), + WithEncryptionAdminSidecarPath(cutoverReadySidecarFixture(t)), + WithEncryptionAdminCapabilityFanout(failOnCallCapabilityFanout(t)), + ) + // 100 iterations against a freshly-created server + // (semaphore is empty / capacity available). Without + // the explicit ctx.Err() check, Go's pseudo-random + // select choice would make some of these iterations + // advance into precheck/fan-out, tripping the + // failOnCallCapabilityFanout fixture. + for range 100 { + ctx, cancel := tc.ctxFn() + _, err := srv.EnableStorageEnvelope(ctx, validEnableStorageEnvelopeRequest()) + cancel() + if status.Code(err) != tc.wantCode { + t.Fatalf("EnableStorageEnvelope status=%v, want %v (must short-circuit on pre-canceled ctx)", + status.Code(err), tc.wantCode) + } + } + }) + } +} + // TestEncryptionAdmin_EnableStorageEnvelope_HonorsCtxDeadlineWaitingOnMutator // pins codex P2 round-3 on PR812: when one cutover RPC holds // the mutator semaphore through a slow fan-out + propose, a