From a246a56259feaccb540baccc641e4d95ee350d4a Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Mon, 18 May 2026 18:26:03 +0900 Subject: [PATCH 1/6] feat(encryption): Stage 6D-2 - ErrNodeIDCollision + ErrLocalEpochRollback primitives + probe-node-id CLI Stage 6D-2 per the 6D design doc 5 (PR #786, merged). Ships the two bundled 6C-3 startup-guard primitives as pure functions in the encryption package, plus the probe-node-id CLI helper from 5.1 mitigation #2. Operator-inert until 6D-6 wires the primitives into main.go's startup-guard phase. ## What this ships ### Two new sentinel errors - ErrNodeIDCollision (errors.go): two distinct full_node_id values in the local route-catalog hash to the same 16-bit node_id, which would reuse GCM nonces under the active DEK. Triage message points at the probe-node-id CLI for pre-join verification. - ErrLocalEpochRollback (errors.go): sidecar local_epoch for the active storage DEK is <= the local writer-registry last_seen_local_epoch (strict-ahead invariant from 5.2 of the 6D design doc). Triage message points at resync-sidecar or DEK rotation. ### Two pure-function guard primitives - CheckNodeIDCollision(fullNodeIDs []uint64) error (node_id_collision.go): narrows each full_node_id via uint16(full_node_id & 0xFFFF) - same shipped mask as applier.go's nodeIDMask - and returns ErrNodeIDCollision if any two DISTINCT values share a node_id. Order- independent. Same-value-twice is NOT a collision (duplicate-membership-entry quirk; the route catalog watcher should dedupe upstream but the primitive defends correctness here). Local-state-only; no RPC, no engine access. - CheckLocalEpochRollback(registry, fullNodeID, dekID, sidecarLocalEpoch) error (local_epoch_rollback.go): reads the local Pebble writer-registry row at RegistryKey(dekID, uint16(fullNodeID & 0xFFFF)). Returns nil if no row exists (freshly-joined learner; 4.1 case-1 first-seen creates the row on first encrypted write). Returns ErrLocalEpochRollback when sidecar <= registry (the strict-ahead invariant - equality is NOT safe because the same (node_id, local_epoch) prefix would be replayed). Pebble I/O errors propagate as wrapped errors NOT marked with ErrLocalEpochRollback so the operator triages transport failure separately from a real rollback. ### CLI: probe-node-id - elastickv-admin encryption probe-node-id --full-node-id= (cmd/elastickv-admin/encryption.go): derives the 16-bit node_id from a candidate full_node_id using the same shipped narrowing convention. Operator runs it against candidates BEFORE joining a new node to verify the value is free in the cluster's allocated node_id space - the recommended mitigation per the 6D design doc 5.1 mitigation menu. Accepts decimal or 0x-prefixed hex input; prints both the 16-hex-digit canonical full_node_id and the derived node_id. ## Test coverage 11 new tests pin the load-bearing properties: node_id_collision_test.go (8 tests): - Empty - skip - SingleNode - skip - NoCollision - nil - Collision - ErrNodeIDCollision - DuplicateNotCollision - same full_node_id twice is NOT a collision - OrderIndependent - either ordering of the colliding pair catches the hit - ErrorIncludesBothIDs - operator-triage shape names both colliding values and the shared node_id local_epoch_rollback_test.go (6 tests): - NoRow - freshly-joined-learner skip - SidecarStrictlyAhead - nil (healthy case) - SidecarLessThan - ErrLocalEpochRollback (classic rollback) - SidecarEqual - ErrLocalEpochRollback (strict-ahead invariant; replay scenario) - PebbleError - underlying err propagated, NOT classified as ErrLocalEpochRollback - NarrowingMatchesShippedConvention - looks up registry row at uint16-narrowed key matching applier.go:230 encryption_test.go (5 new tests for probe-node-id): - Hex - 0x-prefixed input - Decimal - decimal input - RequiresFlag - missing --full-node-id is an error - RejectsBadInput - non-numeric/non-hex input is an error - EncryptionMain_ProbeNodeIDSubcommand - dispatch reaches runEncryptionProbeNodeID (not the unknown-subcommand default branch) ## Caller audit No semantic changes to existing functions; only additions. - New sentinels in errors.go - exported package surface additions, no caller audit needed (no existing callers to update). - New pure functions in node_id_collision.go and local_epoch_rollback.go - no callers yet (operator-inert until 6D-6 wires them). - New CLI subcommand probe-node-id - additive; no existing subcommand changed. ## Five-lens self-review 1. Data loss - net-positive. Both guards catch potential nonce-reuse scenarios BEFORE the first encrypted write that would corrupt confidentiality. 2. Concurrency / distributed failures - primitives are pure functions over local snapshots; no shared state. Run in the startup-before-serving phase before any goroutine fan-out. 3. Performance - O(N) hash map walk for CheckNodeIDCollision (N = cluster member count). Single Pebble Get for CheckLocalEpochRollback. Both paid once per process start on the happy path. 4. Data consistency - matches the shipped 16-bit narrowing convention from applier.go (nodeIDMask = 0xFFFF) and the writer-registry keying. The narrowing-matches test pins this against a future change that would diverge. 5. Test coverage - 11 new unit tests cover every documented branch + the operationally important edge cases (equality for strict-ahead, narrowing equivalence, error propagation). ## Verification - go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS - golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues - go build ./... - PASS ## Plan Stage 6D-2 ships the primitives + CLI as the bundled 6C-3 piece per the design doc 7 decomposition. Next sub-PRs per 7: - 6D-3: Voters union Learners CapabilityFanout helper - 6D-4: Wire format + FSM apply path - 6D-5: 6.2 storage-layer toggle - 6D-6: EnableStorageEnvelope admin RPC + CLI + integration test + main.go wiring of THIS PR's primitives The two primitives in this PR are operator-inert until 6D-6 wires them into main.go's startup-guard phase. The probe-node-id CLI is reachable today but only computes locally - no operator action required against the cluster. --- cmd/elastickv-admin/encryption.go | 65 +++++++- cmd/elastickv-admin/encryption_test.go | 99 ++++++++++++ internal/encryption/errors.go | 55 +++++++ internal/encryption/local_epoch_rollback.go | 91 +++++++++++ .../encryption/local_epoch_rollback_test.go | 152 ++++++++++++++++++ internal/encryption/node_id_collision.go | 73 +++++++++ internal/encryption/node_id_collision_test.go | 125 ++++++++++++++ 7 files changed, 658 insertions(+), 2 deletions(-) create mode 100644 internal/encryption/local_epoch_rollback.go create mode 100644 internal/encryption/local_epoch_rollback_test.go create mode 100644 internal/encryption/node_id_collision.go create mode 100644 internal/encryption/node_id_collision_test.go diff --git a/cmd/elastickv-admin/encryption.go b/cmd/elastickv-admin/encryption.go index de2702d6b..e093b782b 100644 --- a/cmd/elastickv-admin/encryption.go +++ b/cmd/elastickv-admin/encryption.go @@ -42,21 +42,82 @@ func encryptionMain(args []string) error { return runEncryptionRegisterWriter(rest, os.Stdout) case "bootstrap": return runEncryptionBootstrap(rest, os.Stdout) + case "probe-node-id": + return runEncryptionProbeNodeID(rest, os.Stdout) case "-h", "--help", "help": // `-h` is the universal "show usage" affordance for CLI // subcommands; returning nil keeps the exit code at 0 // so shell scripts using $? to detect success do not // trip on a help request. - _, err := fmt.Fprintln(os.Stdout, "usage: elastickv-admin encryption [flags]\n\nsubcommands:\n status\n rotate-dek\n register-writer\n bootstrap") + _, err := fmt.Fprintln(os.Stdout, "usage: elastickv-admin encryption [flags]\n\nsubcommands:\n status\n rotate-dek\n register-writer\n bootstrap\n probe-node-id") if err != nil { return errors.Wrap(err, "write usage") } return nil default: - return errors.Errorf("encryption: unknown subcommand %q (supported: status, rotate-dek, register-writer, bootstrap)", sub) + return errors.Errorf("encryption: unknown subcommand %q (supported: status, rotate-dek, register-writer, bootstrap, probe-node-id)", sub) } } +// runEncryptionProbeNodeID implements the §5.1 collision- +// mitigation helper from the 6D design doc: it derives the +// 16-bit `node_id` from a candidate `full_node_id` so the +// operator can verify the value is free in the cluster's +// allocated node_id space BEFORE joining the node and +// triggering an `ErrNodeIDCollision` refusal. +// +// The narrowing — `uint16(full_node_id & 0xFFFF)` — matches the +// shipped writer-registry keying and §4.1 GCM nonce prefix +// (see `internal/encryption/applier.go::nodeIDMask`). Anything +// other than this narrowing would compute a value that diverges +// from what the runtime uses, defeating the purpose of the +// probe. +// +// No RPC — this is a pure local computation. The operator runs +// it against a candidate `full_node_id` and a current cluster's +// allocated node_id list (out of scope for this subcommand; the +// operator can read `gh ... encryption status` on each member +// to gather them). +func runEncryptionProbeNodeID(args []string, out io.Writer) error { + fs := flag.NewFlagSet("encryption probe-node-id", flag.ContinueOnError) + fullNodeIDStr := fs.String("full-node-id", "", "candidate 64-bit full_node_id to probe (decimal or 0x-prefixed hex)") + if err := fs.Parse(args); err != nil { + return errors.Wrap(err, "parse probe-node-id flags") + } + if *fullNodeIDStr == "" { + return errors.New("--full-node-id is required (decimal or 0x-prefixed hex)") + } + full, err := parseUint64WithRadix(*fullNodeIDStr) + if err != nil { + return errors.Wrapf(err, "parse --full-node-id=%q", *fullNodeIDStr) + } + const nodeIDMask = 0xFFFF + narrowed := uint16(full & nodeIDMask) //nolint:gosec // masked to 16 bits; matches applier.go convention + if _, err := fmt.Fprintf(out, "full_node_id: %#016x (%d)\nnode_id: %#04x (%d)\n", + full, full, narrowed, narrowed); err != nil { + return errors.Wrap(err, "write probe-node-id result") + } + return nil +} + +// parseUint64WithRadix accepts either decimal ("12345") or +// 0x-prefixed hex ("0xDEADBEEF") so operators can paste values +// in whichever form their inventory uses. +func parseUint64WithRadix(s string) (uint64, error) { + if len(s) >= 2 && (s[0:2] == "0x" || s[0:2] == "0X") { + var v uint64 + if _, err := fmt.Sscanf(s[2:], "%x", &v); err != nil { + return 0, errors.Wrap(err, "hex parse") + } + return v, nil + } + var v uint64 + if _, err := fmt.Sscanf(s, "%d", &v); err != nil { + return 0, errors.Wrap(err, "decimal parse") + } + return v, nil +} + type encryptionEndpointFlags struct { endpoint *string timeout *time.Duration diff --git a/cmd/elastickv-admin/encryption_test.go b/cmd/elastickv-admin/encryption_test.go index d4e44239e..06661ec37 100644 --- a/cmd/elastickv-admin/encryption_test.go +++ b/cmd/elastickv-admin/encryption_test.go @@ -462,3 +462,102 @@ func startCustomEncryptionAdminTestServer(t *testing.T, impl pb.EncryptionAdminS }) return lis.Addr().String() } + +// TestRunEncryptionProbeNodeID_Hex pins the §5.1 collision- +// mitigation helper from the 6D design doc: a hex full_node_id +// is parsed and narrowed via uint16(full_node_id & 0xFFFF) — the +// same mask that applier.go (writer-registry keying) and §4.1 +// (GCM nonce prefix) already use. The CLI prints both the +// full_node_id (canonical 16-hex-digit form) and the derived +// node_id so the operator can verify the value is free in the +// cluster's allocated node_id space before joining. +func TestRunEncryptionProbeNodeID_Hex(t *testing.T) { + t.Parallel() + var buf bytes.Buffer + err := runEncryptionProbeNodeID([]string{"--full-node-id=0xDEADBEEFCAFE1234"}, &buf) + if err != nil { + t.Fatalf("runEncryptionProbeNodeID: %v", err) + } + out := buf.String() + for _, needle := range []string{ + "full_node_id: 0xdeadbeefcafe1234", + "node_id: 0x1234", + } { + if !strings.Contains(out, needle) { + t.Errorf("output missing %q:\n%s", needle, out) + } + } +} + +// TestRunEncryptionProbeNodeID_Decimal pins the dual-radix +// parser: decimal input works alongside the 0x-prefixed hex case. +func TestRunEncryptionProbeNodeID_Decimal(t *testing.T) { + t.Parallel() + var buf bytes.Buffer + err := runEncryptionProbeNodeID([]string{"--full-node-id=305419896"}, &buf) + if err != nil { + t.Fatalf("runEncryptionProbeNodeID: %v", err) + } + // 305419896 = 0x12345678; low 16 bits = 0x5678 + out := buf.String() + for _, needle := range []string{ + "full_node_id: 0x0000000012345678", + "node_id: 0x5678", + } { + if !strings.Contains(out, needle) { + t.Errorf("output missing %q:\n%s", needle, out) + } + } +} + +// TestRunEncryptionProbeNodeID_RequiresFlag pins the +// required-flag contract: omitting --full-node-id returns an +// explicit error rather than producing a misleading "probe of +// zero" output. +func TestRunEncryptionProbeNodeID_RequiresFlag(t *testing.T) { + t.Parallel() + var buf bytes.Buffer + err := runEncryptionProbeNodeID([]string{}, &buf) + if err == nil { + t.Fatal("missing --full-node-id: want error, got nil") + } + if !strings.Contains(err.Error(), "--full-node-id is required") { + t.Errorf("error message: want mention of required flag, got %v", err) + } + if buf.Len() != 0 { + t.Errorf("must not print output on error: got %q", buf.String()) + } +} + +// TestRunEncryptionProbeNodeID_RejectsBadInput pins the +// invalid-input case: non-numeric, non-hex input returns an +// error rather than silently truncating. +func TestRunEncryptionProbeNodeID_RejectsBadInput(t *testing.T) { + t.Parallel() + var buf bytes.Buffer + err := runEncryptionProbeNodeID([]string{"--full-node-id=notanumber"}, &buf) + if err == nil { + t.Fatal("invalid input: want error, got nil") + } +} + +// TestEncryptionMain_ProbeNodeIDSubcommand pins the dispatch in +// encryptionMain: the new probe-node-id case must route to the +// runner. Without this, a typo in encryptionMain's switch +// statement would route the subcommand to the default branch +// and fail with "unknown subcommand", and the dispatch test +// would not catch it. +func TestEncryptionMain_ProbeNodeIDSubcommand(t *testing.T) { + t.Parallel() + // Help-flag path: probe-node-id without flags errors with + // "required" rather than "unknown subcommand". A successful + // dispatch route reaches runEncryptionProbeNodeID's + // missing-flag branch. + err := encryptionMain([]string{"probe-node-id"}) + if err == nil { + t.Fatal("probe-node-id with no flags: want error, got nil") + } + if !strings.Contains(err.Error(), "--full-node-id is required") { + t.Errorf("dispatch reached wrong handler: got %v", err) + } +} diff --git a/internal/encryption/errors.go b/internal/encryption/errors.go index c905494e5..09e6675cb 100644 --- a/internal/encryption/errors.go +++ b/internal/encryption/errors.go @@ -208,4 +208,59 @@ var ( // from forcing a spurious refusal. See §5.5 for the // IsEncryptionRelevantOpcode rationale. ErrSidecarBehindRaftLog = errors.New("encryption: sidecar raft_applied_index is behind the raftengine's persisted applied index and the gap covers an encryption-relevant entry; refusing to start (run `encryption resync-sidecar` to advance the sidecar past the encryption-relevant entries before retrying — see §9.1 + §5.5)") + + // ErrNodeIDCollision is the §9.1 / 6C-3 startup-refusal guard + // raised when two distinct `full_node_id` values in the local + // route-catalog snapshot map to the same 16-bit `node_id`. The + // derivation is `uint16(full_node_id & 0xFFFF)` — same narrowing + // the writer-registry keying and §4.1 GCM nonce prefix already + // use (see `applier.go` `nodeIDMask`). A collision means the two + // members would write under the same `(node_id, local_epoch)` + // prefix and reuse the GCM counter under the same DEK, which + // breaks AES-GCM's nonce-uniqueness requirement and is a + // catastrophic confidentiality + integrity failure. + // + // The guard reads the LOCAL route-catalog snapshot only — no + // RPC fan-out — because the startup-guard phase runs BEFORE the + // gRPC server is up, and the local Raft log is authoritative + // for cluster membership (every node applies the same + // `ConfChange` entries). See `2026_05_18_proposed_6d_enable_storage_envelope.md` + // §5.1 for the lifecycle rationale and the operator-mitigation + // menu (`probe-node-id` CLI, full_node_id re-roll, etc.). + // + // Skip conditions: encryption disabled (no nonce-reuse risk), + // empty membership snapshot (single-node pre-bootstrap; nothing + // to compare yet). + ErrNodeIDCollision = errors.New("encryption: two distinct full_node_id values in the local route-catalog hash to the same 16-bit node_id, which would reuse GCM nonces under the active DEK; refusing to start (re-roll one of the colliding full_node_id values — use `elastickv-admin encryption probe-node-id --full-node-id=` to verify before joining — see §9.1 + §5.1 of the 6D design doc)") + + // ErrLocalEpochRollback is the §9.1 / 6C-3 startup-refusal + // guard raised when this node's sidecar `local_epoch` for the + // active storage DEK is less than OR equal to the local Pebble + // writer-registry's `LastSeenLocalEpoch` for the + // `(full_node_id, active_storage_dek_id)` row. The strict-ahead + // posture — `sidecar > registry`, not `>=` — is the §5.2 + // nonce-monotonicity invariant: a node may only resume issuing + // nonces when its sidecar is STRICTLY ahead of the registry + // record. The equality case would replay the same + // `(node_id, local_epoch)` prefix and reuse the GCM counter + // under the same DEK, identical to the collision scenario + // `ErrNodeIDCollision` catches but at the single-node-restart + // timescale rather than the cluster-membership timescale. + // + // Classic trigger: operator restored the sidecar from an old + // backup, leaving `local_epoch` behind the registry record + // that the node has already used for prior writes. + // + // The guard reads LOCAL Pebble state only — no RPC. The + // writer-registry is local state on every node post-bootstrap + // (every node applies the same OpRegistration and OpBootstrap + // entries), so the local row is authoritative for this node's + // own monotonicity contract. + // + // Skip conditions: encryption disabled, sidecar's + // `Active.Storage == 0` (bootstrap not yet committed), no + // registry row for `(full_node_id, active_storage_dek_id)` (a + // freshly-joined learner that hasn't proposed a + // `RegisterEncryptionWriter` yet). + ErrLocalEpochRollback = errors.New("encryption: sidecar local_epoch for the active storage DEK is at or below the local writer-registry's last_seen_local_epoch, which would replay the GCM nonce prefix under the same DEK; refusing to start (verify the sidecar was not restored from an old backup; run `encryption resync-sidecar` if appropriate, or rotate the affected DEK — see §9.1 + §5.2 of the 6D design doc)") ) diff --git a/internal/encryption/local_epoch_rollback.go b/internal/encryption/local_epoch_rollback.go new file mode 100644 index 000000000..f1431965a --- /dev/null +++ b/internal/encryption/local_epoch_rollback.go @@ -0,0 +1,91 @@ +package encryption + +import ( + pkgerrors "github.com/cockroachdb/errors" +) + +// CheckLocalEpochRollback is the §9.1 / 6C-3 startup-guard +// primitive for `ErrLocalEpochRollback`. It compares this node's +// sidecar `local_epoch` for the active storage DEK against the +// local writer-registry's `LastSeenLocalEpoch` for the +// `(full_node_id, active_storage_dek_id)` row, and returns +// `ErrLocalEpochRollback` when `sidecar <= registry` — the +// strict-ahead invariant from §5.2 of the 6D design doc. +// +// Why strict-ahead and not "strictly less than": +// +// - `sidecar < registry`: classic rollback (sidecar restored +// from an old backup); the node would reissue +// `(node_id, local_epoch)` prefixes already consumed by prior +// writes under the same DEK. +// +// - `sidecar == registry`: replay of the same epoch; the node +// would reissue the SAME prefix and reuse the GCM counter, +// identical to the collision scenario but at the +// single-node-restart timescale rather than the +// cluster-membership timescale (which `ErrNodeIDCollision` +// handles). +// +// - `sidecar > registry`: the healthy case; the node has +// advanced its sidecar past the last replicated registration, +// so the next nonce prefix is fresh. +// +// The primitive consults LOCAL state only (sidecar + Pebble +// writer-registry on this node). No RPC. This matches the +// startup-before-serving phase where the gRPC server is not up. +// +// Skip conditions handled by the CALLER (the startup-guard +// wiring), not by this primitive: +// +// - Encryption disabled. +// - Sidecar's `Active.Storage == 0` (bootstrap not yet +// committed; no DEK to compare against). +// +// Skip condition handled HERE (inside the primitive): +// +// - The writer-registry has no row for +// `(full_node_id, active_storage_dek_id)`. A freshly-joined +// learner that has not yet proposed a +// `RegisterEncryptionWriter` legitimately lacks a registry +// row; the §4.1 case-1 first-seen monotonicity branch will +// create it on the next encrypted write. Returning nil here +// preserves that lifecycle. +// +// Returns `ErrLocalEpochRollback` wrapped with both observed +// values when `sidecar <= registry`. Returns the wrapped +// underlying error (NOT marked with `ErrLocalEpochRollback`) on +// a Pebble I/O error from the registry read so the operator +// triages a transport failure separately from a real rollback. +func CheckLocalEpochRollback( + registry WriterRegistryStore, + fullNodeID uint64, + activeStorageDEKID uint32, + sidecarLocalEpoch uint16, +) error { + key := RegistryKey(activeStorageDEKID, uint16(fullNodeID&nodeIDMask)) //nolint:gosec // masked to 16 bits; matches applier.go convention + rawVal, ok, err := registry.GetRegistryRow(key) + if err != nil { + return pkgerrors.Wrapf(err, + "local_epoch rollback guard: read writer-registry row for full_node_id=%#x dek_id=%d", + fullNodeID, activeStorageDEKID) + } + if !ok { + // Freshly-joined learner with no registry row yet. Per + // §5.2 skip-condition, this is a legitimate lifecycle + // state — the first encrypted write will create the row + // via §4.1 case-1 first-seen. + return nil + } + registryRow, err := DecodeRegistryValue(rawVal) + if err != nil { + return pkgerrors.Wrapf(err, + "local_epoch rollback guard: decode writer-registry row for full_node_id=%#x dek_id=%d", + fullNodeID, activeStorageDEKID) + } + if sidecarLocalEpoch <= registryRow.LastSeenLocalEpoch { + return pkgerrors.Wrapf(ErrLocalEpochRollback, + "sidecar local_epoch=%d <= registry last_seen_local_epoch=%d (full_node_id=%#x dek_id=%d) — strict-ahead invariant requires sidecar > registry", + sidecarLocalEpoch, registryRow.LastSeenLocalEpoch, fullNodeID, activeStorageDEKID) + } + return nil +} diff --git a/internal/encryption/local_epoch_rollback_test.go b/internal/encryption/local_epoch_rollback_test.go new file mode 100644 index 000000000..2833c20c9 --- /dev/null +++ b/internal/encryption/local_epoch_rollback_test.go @@ -0,0 +1,152 @@ +package encryption_test + +import ( + "errors" + "testing" + + "github.com/bootjp/elastickv/internal/encryption" +) + +// stubRegistryStore is an in-memory WriterRegistryStore for the +// rollback-guard tests. Keyed by string so the test can populate +// rows by RegistryKey output without exporting unexported state. +type stubRegistryStore struct { + rows map[string][]byte + getErr error +} + +func (s *stubRegistryStore) GetRegistryRow(key []byte) ([]byte, bool, error) { + if s.getErr != nil { + return nil, false, s.getErr + } + v, ok := s.rows[string(key)] + return v, ok, nil +} + +func (s *stubRegistryStore) SetRegistryRow(key []byte, value []byte) error { + if s.rows == nil { + s.rows = map[string][]byte{} + } + s.rows[string(key)] = value + return nil +} + +// seedRegistry constructs a stub registry pre-populated with one +// row at the registry key derived from `full` (narrowed to its +// low 16 bits) under DEK id 1. All current tests probe the same +// active-DEK because the rollback contract is per-DEK and the +// table-of-DEKs case is exercised by `applier_test.go`'s +// case-1/2/3 dispatch tests. Future tests that need a +// different DEK id should accept a dekID parameter — for now +// hardcoding keeps the test fixtures tight. +func seedRegistry(t *testing.T, full uint64, lastSeen uint16) *stubRegistryStore { + t.Helper() + const seededDEKID uint32 = 1 + reg := &stubRegistryStore{rows: map[string][]byte{}} + key := encryption.RegistryKey(seededDEKID, uint16(full&0xFFFF)) //nolint:gosec // masked to 16 bits + row := encryption.EncodeRegistryValue(encryption.RegistryValue{ + FullNodeID: full, + FirstSeenLocalEpoch: 0, + LastSeenLocalEpoch: lastSeen, + }) + if err := reg.SetRegistryRow(key, row); err != nil { + t.Fatalf("seedRegistry: %v", err) + } + return reg +} + +// TestCheckLocalEpochRollback_NoRow pins the freshly-joined- +// learner skip: the registry has no row for +// (full_node_id, active_storage_dek_id), so the primitive must +// return nil per §5.2 skip-condition. The §4.1 case-1 first-seen +// branch will create the row on the next encrypted write. +func TestCheckLocalEpochRollback_NoRow(t *testing.T) { + t.Parallel() + reg := &stubRegistryStore{rows: map[string][]byte{}} + err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 42) + if err != nil { + t.Errorf("no-row case: want nil, got %v", err) + } +} + +// TestCheckLocalEpochRollback_SidecarStrictlyAhead pins the +// healthy case: sidecar > registry (the strict-ahead invariant) +// returns nil. +func TestCheckLocalEpochRollback_SidecarStrictlyAhead(t *testing.T) { + t.Parallel() + reg := seedRegistry(t, 0xAAAA, 10) + if err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 11); err != nil { + t.Errorf("sidecar=11 > registry=10: want nil, got %v", err) + } +} + +// TestCheckLocalEpochRollback_SidecarLessThan pins the classic +// rollback case (sidecar restored from old backup): sidecar < +// registry returns ErrLocalEpochRollback. +func TestCheckLocalEpochRollback_SidecarLessThan(t *testing.T) { + t.Parallel() + reg := seedRegistry(t, 0xAAAA, 42) + err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 10) + if !errors.Is(err, encryption.ErrLocalEpochRollback) { + t.Fatalf("sidecar=10 < registry=42: want ErrLocalEpochRollback, got %v", err) + } +} + +// TestCheckLocalEpochRollback_SidecarEqual pins the equality +// case: the strict-ahead invariant requires sidecar > registry, +// so equality is also a rollback (the SAME (node_id, +// local_epoch) prefix would be replayed under the same DEK). +// This is the case codex r6 flagged as a load-bearing edge. +func TestCheckLocalEpochRollback_SidecarEqual(t *testing.T) { + t.Parallel() + reg := seedRegistry(t, 0xAAAA, 42) + err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 42) + if !errors.Is(err, encryption.ErrLocalEpochRollback) { + t.Fatalf("sidecar=42 == registry=42: want ErrLocalEpochRollback, got %v", err) + } +} + +// TestCheckLocalEpochRollback_PebbleError pins the I/O-error +// propagation path: a Pebble error from GetRegistryRow MUST NOT +// be classified as ErrLocalEpochRollback (the operator triages +// transport failure separately from a real rollback). +func TestCheckLocalEpochRollback_PebbleError(t *testing.T) { + t.Parallel() + pebbleErr := errors.New("simulated pebble corruption") + reg := &stubRegistryStore{getErr: pebbleErr} + err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 10) + if err == nil { + t.Fatal("pebble error: want error, got nil") + } + if errors.Is(err, encryption.ErrLocalEpochRollback) { + t.Errorf("pebble error must NOT be classified as ErrLocalEpochRollback: %v", err) + } + if !errors.Is(err, pebbleErr) { + t.Errorf("underlying pebble error must be in chain: %v", err) + } +} + +// TestCheckLocalEpochRollback_NarrowingMatchesShippedConvention +// pins the load-bearing 16-bit narrowing: the primitive must +// look up the registry row at RegistryKey(dekID, +// uint16(full_node_id & 0xFFFF)), matching applier.go:230's +// shipped convention. A 6D-2 implementor that derives the +// registry key via xxhash (or any other 16-bit projection) +// would lookup at a different key and the no-row branch would +// fire spuriously, masking a real rollback. +// +// Verify by seeding the row at uint16-narrowed key and looking +// up with a full_node_id whose HIGH bits differ — narrowing +// must hit the same row. +func TestCheckLocalEpochRollback_NarrowingMatchesShippedConvention(t *testing.T) { + t.Parallel() + // Seed with full_node_id 0xDEADBEEF_0000AAAA (narrowed=0xAAAA). + reg := seedRegistry(t, 0xDEADBEEF_0000AAAA, 50) + // Look up with a different full_node_id that ALSO narrows + // to 0xAAAA. Same registry row should be hit; sidecar=10 < + // registry=50 should fire ErrLocalEpochRollback. + err := encryption.CheckLocalEpochRollback(reg, 0xCAFEBABE_0000AAAA, 1, 10) + if !errors.Is(err, encryption.ErrLocalEpochRollback) { + t.Fatalf("narrowing-match: want ErrLocalEpochRollback, got %v", err) + } +} diff --git a/internal/encryption/node_id_collision.go b/internal/encryption/node_id_collision.go new file mode 100644 index 000000000..9c6f7b2bc --- /dev/null +++ b/internal/encryption/node_id_collision.go @@ -0,0 +1,73 @@ +package encryption + +import ( + pkgerrors "github.com/cockroachdb/errors" +) + +// CheckNodeIDCollision is the §9.1 / 6C-3 startup-guard primitive +// for `ErrNodeIDCollision`. It walks the supplied list of +// `full_node_id` values (typically every voter + learner in the +// default group's local route-catalog snapshot), narrows each to +// its 16-bit `node_id` via the same `uint16(full_node_id & 0xFFFF)` +// mask that the writer-registry keying and §4.1 GCM nonce prefix +// use, and returns `ErrNodeIDCollision` if any two DISTINCT +// `full_node_id` values share the same `node_id`. +// +// Skip conditions handled by the caller (the startup-guard wiring +// in main.go), not by this primitive: +// +// - Encryption disabled (no nonce-reuse risk). +// - Membership snapshot empty (single-node pre-bootstrap; +// nothing to compare yet). +// +// This primitive does NOT consult any sidecar, registry, or RPC +// transport. It runs in the startup-before-serving phase where +// none of those are available yet (the gRPC server is not up). +// +// Determinism: the caller may pass full_node_ids in any order; +// detection is symmetric. On a hit, the returned error wraps +// the two colliding `full_node_id` values and the shared +// `node_id` so the operator triage line names the conflict +// concretely. +// +// Returns nil when: +// +// - The slice has fewer than two unique values (no possible +// collision). +// - No two distinct values map to the same 16-bit `node_id`. +// +// Returns `ErrNodeIDCollision` wrapped with offending IDs when +// any two distinct `full_node_id` values share a `node_id`. +// minMembersForCollision is the smallest membership-set size at +// which two distinct full_node_id values can exist (and therefore +// at which a 16-bit narrowing collision becomes possible). With +// 0 or 1 member there is nothing to compare; skip early. +const minMembersForCollision = 2 + +func CheckNodeIDCollision(fullNodeIDs []uint64) error { + if len(fullNodeIDs) < minMembersForCollision { + return nil + } + // node_id -> first full_node_id that mapped to it. On second + // hit at the same node_id with a DIFFERENT full_node_id, we + // have a real collision; same full_node_id appearing twice is + // not a collision (duplicate-membership-entry quirk; the + // route catalog watcher should already dedupe, but defending + // here keeps the primitive correct under any input ordering). + seen := make(map[uint16]uint64, len(fullNodeIDs)) + for _, fnid := range fullNodeIDs { + nodeID := uint16(fnid & nodeIDMask) //nolint:gosec // masked to 16 bits; matches applier.go/encryption_admin.go convention + prev, exists := seen[nodeID] + if !exists { + seen[nodeID] = fnid + continue + } + if prev == fnid { + continue + } + return pkgerrors.Wrapf(ErrNodeIDCollision, + "full_node_id=%#x and full_node_id=%#x both narrow to node_id=%#04x", + prev, fnid, nodeID) + } + return nil +} diff --git a/internal/encryption/node_id_collision_test.go b/internal/encryption/node_id_collision_test.go new file mode 100644 index 000000000..94030cbc2 --- /dev/null +++ b/internal/encryption/node_id_collision_test.go @@ -0,0 +1,125 @@ +package encryption_test + +import ( + "errors" + "strings" + "testing" + + "github.com/bootjp/elastickv/internal/encryption" +) + +// TestCheckNodeIDCollision_Empty pins the skip-on-empty case: +// with no full_node_ids the primitive cannot detect a collision +// and MUST return nil. The caller (startup-guard wiring) is +// responsible for the encryption-disabled and +// pre-bootstrap-empty-membership skip conditions. +func TestCheckNodeIDCollision_Empty(t *testing.T) { + t.Parallel() + if err := encryption.CheckNodeIDCollision(nil); err != nil { + t.Errorf("nil slice: want nil, got %v", err) + } + if err := encryption.CheckNodeIDCollision([]uint64{}); err != nil { + t.Errorf("empty slice: want nil, got %v", err) + } +} + +// TestCheckNodeIDCollision_SingleNode pins the single-member +// case: one full_node_id alone cannot collide with itself, so +// the primitive returns nil. +func TestCheckNodeIDCollision_SingleNode(t *testing.T) { + t.Parallel() + if err := encryption.CheckNodeIDCollision([]uint64{0xAAAA}); err != nil { + t.Errorf("single node: want nil, got %v", err) + } +} + +// TestCheckNodeIDCollision_NoCollision pins the healthy +// multi-node case: every distinct full_node_id maps to a +// distinct 16-bit node_id. +func TestCheckNodeIDCollision_NoCollision(t *testing.T) { + t.Parallel() + // Three values whose low 16 bits are clearly distinct + // (0x0001 / 0x0002 / 0x0003). + fnids := []uint64{ + 0xDEADBEEF_00000001, + 0xDEADBEEF_00000002, + 0xDEADBEEF_00000003, + } + if err := encryption.CheckNodeIDCollision(fnids); err != nil { + t.Errorf("no collision: want nil, got %v", err) + } +} + +// TestCheckNodeIDCollision_Collision pins the fire path: two +// DISTINCT full_node_id values whose low 16 bits match. The +// classic hit is two values that differ only above bit 16. +func TestCheckNodeIDCollision_Collision(t *testing.T) { + t.Parallel() + // 0xDEADBEEF_0000_AAAA and 0xCAFEBABE_0000_AAAA both narrow + // to node_id 0xAAAA. + fnids := []uint64{ + 0xDEADBEEF_0000AAAA, + 0xCAFEBABE_0000AAAA, + } + err := encryption.CheckNodeIDCollision(fnids) + if !errors.Is(err, encryption.ErrNodeIDCollision) { + t.Fatalf("collision: want ErrNodeIDCollision, got %v", err) + } +} + +// TestCheckNodeIDCollision_DuplicateNotCollision verifies that +// the SAME full_node_id appearing twice in the slice (which a +// route-catalog watcher should already dedupe but the primitive +// defends against under any input ordering) is NOT reported as +// a collision. Only DISTINCT full_node_ids mapping to the same +// node_id count. +func TestCheckNodeIDCollision_DuplicateNotCollision(t *testing.T) { + t.Parallel() + fnids := []uint64{0xAAAA, 0xAAAA, 0xBBBB} + if err := encryption.CheckNodeIDCollision(fnids); err != nil { + t.Errorf("duplicates of same full_node_id: want nil, got %v", err) + } +} + +// TestCheckNodeIDCollision_OrderIndependent pins the symmetry +// property: detection does not depend on which colliding value +// appeared first in the slice. The route catalog may emit +// members in any order; the guard must catch the collision +// either way. +func TestCheckNodeIDCollision_OrderIndependent(t *testing.T) { + t.Parallel() + a := []uint64{0xDEADBEEF_0000AAAA, 0xCAFEBABE_0000AAAA} + b := []uint64{0xCAFEBABE_0000AAAA, 0xDEADBEEF_0000AAAA} + if err := encryption.CheckNodeIDCollision(a); !errors.Is(err, encryption.ErrNodeIDCollision) { + t.Fatalf("order A: want ErrNodeIDCollision, got %v", err) + } + if err := encryption.CheckNodeIDCollision(b); !errors.Is(err, encryption.ErrNodeIDCollision) { + t.Fatalf("order B: want ErrNodeIDCollision, got %v", err) + } +} + +// TestCheckNodeIDCollision_ErrorIncludesBothIDs pins the +// operator-triage shape: the wrapped error must name BOTH +// colliding full_node_id values and the shared node_id so the +// operator can re-roll the right one without guessing. +func TestCheckNodeIDCollision_ErrorIncludesBothIDs(t *testing.T) { + t.Parallel() + fnids := []uint64{ + 0xDEADBEEF_0000AAAA, + 0xCAFEBABE_0000AAAA, + } + err := encryption.CheckNodeIDCollision(fnids) + if err == nil { + t.Fatal("want error, got nil") + } + msg := strings.ToLower(err.Error()) + for _, needle := range []string{ + "0xdeadbeef0000aaaa", + "0xcafebabe0000aaaa", + "0xaaaa", + } { + if !strings.Contains(msg, needle) { + t.Errorf("error message missing %q: %s", needle, msg) + } + } +} From 9eecdb559c22ce43fa69d0b5dc6aeee4119f492d Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Mon, 18 May 2026 18:36:07 +0900 Subject: [PATCH 2/6] fix(encryption): PR788 round-1 - claude high (strconv.ParseUint) + medium (go-doc attribution) Two findings from claude r1 review on PR #788: ## claude high - parseUint64WithRadix silently accepts partial input cmd/elastickv-admin/encryption.go's parseUint64WithRadix used fmt.Sscanf for both hex and decimal parsing. Sscanf returns success once it matches the format, even if the input has trailing unparsed characters: "0x1234ZZZZ" -> parses as 0x1234 without error "1234abc" -> parses as 1234 without error This was unsafe for an operator-visible identifier where every digit matters - a typo'd full_node_id with a non-hex tail would silently return a different value than the operator meant, leading the operator to join a node under the wrong node_id. Fix: switched to strconv.ParseUint, which requires the ENTIRE input to be valid for the chosen radix. Updated import set and added TestRunEncryptionProbeNodeID_RejectsPartialInput with three regression cases (hex prefix + non-hex tail, decimal prefix + letter tail, hex with embedded space). ## claude medium - minMembersForCollision placement broke go doc internal/encryption/node_id_collision.go had the minMembersForCollision constant declared IMMEDIATELY before CheckNodeIDCollision (no blank line). In Go, a doc comment preceding a declaration is attributed to that declaration, so the 38-line doc comment block was being attributed to the CONSTANT by go doc, not to the function. go doc encryption.CheckNodeIDCollision would have produced no documentation. Fix: hoisted the constant declaration ABOVE the function's doc comment, with its own short comment. Both declarations now have correct go-doc attribution. ## Gemini medium NOT addressed (verified false) Gemini r1 flagged %#016x as producing 14 hex digits (claiming the 0x prefix counted toward the width). Verified directly: Go emits 16 hex digits after the 0x prefix for %#016x (output 0x0000000012345678 = 18 chars total). My tests already assert this exact form and pass. Gemini's claim is incorrect per Go's actual format behavior; no change needed. ## Caller audit (per cron directive) parseUint64WithRadix is package-private; the only caller is runEncryptionProbeNodeID (one call site). The semantic change is "strict parse vs. silent prefix" - the strict parse is the safer posture and no existing test relied on the silent-prefix behavior (the existing RejectsBadInput test used "notanumber" which fails both implementations). minMembersForCollision is a package-private constant only used within node_id_collision.go. Hoisting the declaration is structural; no semantic change. ## Verification - go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS - golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues - go build ./... - PASS No semantic changes to existing functions outside the parseUint64WithRadix bug-fix. --- cmd/elastickv-admin/encryption.go | 17 ++++++++++++----- cmd/elastickv-admin/encryption_test.go | 23 +++++++++++++++++++++++ internal/encryption/node_id_collision.go | 12 ++++++------ 3 files changed, 41 insertions(+), 11 deletions(-) diff --git a/cmd/elastickv-admin/encryption.go b/cmd/elastickv-admin/encryption.go index e093b782b..7409cb42d 100644 --- a/cmd/elastickv-admin/encryption.go +++ b/cmd/elastickv-admin/encryption.go @@ -7,6 +7,7 @@ import ( "io" "os" "slices" + "strconv" "time" pb "github.com/bootjp/elastickv/proto" @@ -102,17 +103,23 @@ func runEncryptionProbeNodeID(args []string, out io.Writer) error { // parseUint64WithRadix accepts either decimal ("12345") or // 0x-prefixed hex ("0xDEADBEEF") so operators can paste values -// in whichever form their inventory uses. +// in whichever form their inventory uses. Uses strconv.ParseUint +// rather than fmt.Sscanf: ParseUint requires the ENTIRE input to +// be valid for the chosen radix, whereas Sscanf stops at the +// first non-matching character and silently returns the prefix. +// The silent-prefix behaviour would let "0x1234ZZZZ" parse as +// 0x1234 (or "1234abc" as 1234), which is unsafe for an +// operator-visible identifier where every digit matters. func parseUint64WithRadix(s string) (uint64, error) { if len(s) >= 2 && (s[0:2] == "0x" || s[0:2] == "0X") { - var v uint64 - if _, err := fmt.Sscanf(s[2:], "%x", &v); err != nil { + v, err := strconv.ParseUint(s[2:], 16, 64) + if err != nil { return 0, errors.Wrap(err, "hex parse") } return v, nil } - var v uint64 - if _, err := fmt.Sscanf(s, "%d", &v); err != nil { + v, err := strconv.ParseUint(s, 10, 64) + if err != nil { return 0, errors.Wrap(err, "decimal parse") } return v, nil diff --git a/cmd/elastickv-admin/encryption_test.go b/cmd/elastickv-admin/encryption_test.go index 06661ec37..bbbb4bc5a 100644 --- a/cmd/elastickv-admin/encryption_test.go +++ b/cmd/elastickv-admin/encryption_test.go @@ -541,6 +541,29 @@ func TestRunEncryptionProbeNodeID_RejectsBadInput(t *testing.T) { } } +// TestRunEncryptionProbeNodeID_RejectsPartialInput pins the +// strconv.ParseUint vs fmt.Sscanf distinction: parseUint64WithRadix +// MUST reject inputs where only a prefix is parseable, because +// "0x1234ZZZZ" silently parsing as 0x1234 would mislead the +// operator into joining a node under a different node_id than +// the one they meant to probe. Sscanf accepts partial input; +// ParseUint rejects it. +func TestRunEncryptionProbeNodeID_RejectsPartialInput(t *testing.T) { + t.Parallel() + cases := []string{ + "--full-node-id=0x1234ZZZZ", // hex prefix with non-hex tail + "--full-node-id=1234abc", // decimal prefix with letter tail + "--full-node-id=0xDEAD GHI", // hex with embedded space + } + for _, tc := range cases { + var buf bytes.Buffer + err := runEncryptionProbeNodeID([]string{tc}, &buf) + if err == nil { + t.Errorf("partial-input %q: want error, got nil (output=%q)", tc, buf.String()) + } + } +} + // TestEncryptionMain_ProbeNodeIDSubcommand pins the dispatch in // encryptionMain: the new probe-node-id case must route to the // runner. Without this, a typo in encryptionMain's switch diff --git a/internal/encryption/node_id_collision.go b/internal/encryption/node_id_collision.go index 9c6f7b2bc..a13d6d034 100644 --- a/internal/encryption/node_id_collision.go +++ b/internal/encryption/node_id_collision.go @@ -4,6 +4,12 @@ import ( pkgerrors "github.com/cockroachdb/errors" ) +// minMembersForCollision is the smallest membership-set size at +// which two distinct full_node_id values can exist (and therefore +// at which a 16-bit narrowing collision becomes possible). With +// 0 or 1 member there is nothing to compare; the guard skips early. +const minMembersForCollision = 2 + // CheckNodeIDCollision is the §9.1 / 6C-3 startup-guard primitive // for `ErrNodeIDCollision`. It walks the supplied list of // `full_node_id` values (typically every voter + learner in the @@ -38,12 +44,6 @@ import ( // // Returns `ErrNodeIDCollision` wrapped with offending IDs when // any two distinct `full_node_id` values share a `node_id`. -// minMembersForCollision is the smallest membership-set size at -// which two distinct full_node_id values can exist (and therefore -// at which a 16-bit narrowing collision becomes possible). With -// 0 or 1 member there is nothing to compare; skip early. -const minMembersForCollision = 2 - func CheckNodeIDCollision(fullNodeIDs []uint64) error { if len(fullNodeIDs) < minMembersForCollision { return nil From eab8c49dd4adc5230a781be486f132939e941454 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Mon, 18 May 2026 20:05:05 +0900 Subject: [PATCH 3/6] fix(encryption): PR788 round-2 - codex P2 (preserve missing-row signal split per 5.2) Codex r1 P2 on PR788: CheckLocalEpochRollback collapsed two distinct states - "safe strict-ahead" and "no rollback anchor exists" - into the same nil return. The 6D design doc 5.2 explicitly splits these on storage_envelope_active: - PRE-cutover (storage_envelope_active == false): missing row is the freshly-joined-learner case (allow startup; 4.1 case-1 first-seen creates the row on first encrypted write) - POST-cutover (storage_envelope_active == true): missing row means encrypted writes are happening cluster-wide with no rollback anchor on this node (refuse startup; the guard's raison d'etre) The previous implementation collapsed both to nil, meaning a 6D-6 implementor wiring the guard would have no way to enforce the post-cutover requirement without reading the registry again. Codex's flag is valid. ## Fix: storageEnvelopeActive bool parameter Added a `storageEnvelopeActive` bool to CheckLocalEpochRollback. The missing-row branch now splits internally: - false + no row -> nil (pre-cutover freshly-joined-learner) - true + no row -> ErrLocalEpochRollback wrapped with "storage_envelope_active=true but no registry row" diagnostic This matches 5.2 exactly. The caller (main.go wiring in 6D-6 per the design doc 7) passes sidecar.StorageEnvelopeActive as the bool. ## Caller audit (per cron directive - SEMANTIC CHANGE) CheckLocalEpochRollback is currently called only from tests (operator-inert per 6D-2 scope). grep verified: - internal/encryption/local_epoch_rollback_test.go (6 callers) - No production callers (the main.go wiring is deferred to 6D-6 per the design doc 7) All six test callers updated to pass `false` (pre-cutover) as the new bool, preserving the existing assertion semantics. Added one new test for the post-cutover branch: TestCheckLocalEpochRollback_NoRow_PostCutover - verifies missing-row + active=true returns ErrLocalEpochRollback with the "storage_envelope_active=true" context message. Renamed the existing TestCheckLocalEpochRollback_NoRow to TestCheckLocalEpochRollback_NoRow_PreCutover to make the split explicit at the test level. ## Gemini r1 format-width finding - confirmed false, no change Format width %#016x produces 16 hex digits after the 0x prefix (output 0x0000000012345678 = 18 chars total). Verified via direct Go execution. No change needed. ## Verification - go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS - golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues - go build ./... - PASS No semantic changes outside CheckLocalEpochRollback's signature. --- internal/encryption/local_epoch_rollback.go | 60 ++++++++++++++----- .../encryption/local_epoch_rollback_test.go | 52 ++++++++++++---- 2 files changed, 83 insertions(+), 29 deletions(-) diff --git a/internal/encryption/local_epoch_rollback.go b/internal/encryption/local_epoch_rollback.go index f1431965a..ee315455a 100644 --- a/internal/encryption/local_epoch_rollback.go +++ b/internal/encryption/local_epoch_rollback.go @@ -41,26 +41,44 @@ import ( // - Sidecar's `Active.Storage == 0` (bootstrap not yet // committed; no DEK to compare against). // -// Skip condition handled HERE (inside the primitive): +// Missing-registry-row split — `storageEnvelopeActive` parameter: // -// - The writer-registry has no row for -// `(full_node_id, active_storage_dek_id)`. A freshly-joined -// learner that has not yet proposed a -// `RegisterEncryptionWriter` legitimately lacks a registry -// row; the §4.1 case-1 first-seen monotonicity branch will -// create it on the next encrypted write. Returning nil here -// preserves that lifecycle. +// §5.2 of the 6D design doc splits the missing-row behaviour on +// whether the storage envelope cutover has fired: +// +// - PRE-cutover (`storageEnvelopeActive == false`): missing +// row is the freshly-joined-learner lifecycle state. The +// node has not yet proposed a `RegisterEncryptionWriter` +// and the §4.1 case-1 first-seen branch will create the +// row on the next encrypted write. Allow startup. +// +// - POST-cutover (`storageEnvelopeActive == true`): missing +// row means there is NO rollback anchor to compare the +// sidecar against, but encrypted writes are already +// happening cluster-wide under the active storage DEK. +// The node could start issuing nonces with no +// monotonicity guard, which is the exact failure mode the +// guard exists to prevent. Refuse startup with +// `ErrLocalEpochRollback` wrapped with a "missing +// registry row with active envelope" diagnostic. +// +// Pebble I/O errors from the registry read propagate as wrapped +// errors NOT classified as `ErrLocalEpochRollback` (operator +// triages transport failure separately from a real rollback). // // Returns `ErrLocalEpochRollback` wrapped with both observed -// values when `sidecar <= registry`. Returns the wrapped -// underlying error (NOT marked with `ErrLocalEpochRollback`) on -// a Pebble I/O error from the registry read so the operator -// triages a transport failure separately from a real rollback. +// values when `sidecar <= registry`. Returns +// `ErrLocalEpochRollback` wrapped with a missing-row +// diagnostic when the row is absent AND +// `storageEnvelopeActive == true`. Returns nil when the row is +// absent AND `storageEnvelopeActive == false` (pre-cutover +// freshly-joined learner) or when `sidecar > registry`. func CheckLocalEpochRollback( registry WriterRegistryStore, fullNodeID uint64, activeStorageDEKID uint32, sidecarLocalEpoch uint16, + storageEnvelopeActive bool, ) error { key := RegistryKey(activeStorageDEKID, uint16(fullNodeID&nodeIDMask)) //nolint:gosec // masked to 16 bits; matches applier.go convention rawVal, ok, err := registry.GetRegistryRow(key) @@ -70,10 +88,20 @@ func CheckLocalEpochRollback( fullNodeID, activeStorageDEKID) } if !ok { - // Freshly-joined learner with no registry row yet. Per - // §5.2 skip-condition, this is a legitimate lifecycle - // state — the first encrypted write will create the row - // via §4.1 case-1 first-seen. + if storageEnvelopeActive { + // Post-cutover: missing row means no rollback + // anchor exists, but encrypted writes are happening. + // Refuse to start; the node cannot prove nonce + // monotonicity without a registry record to compare + // against. + return pkgerrors.Wrapf(ErrLocalEpochRollback, + "writer-registry has no row for full_node_id=%#x dek_id=%d but storage_envelope_active=true; cannot prove nonce monotonicity for the active DEK", + fullNodeID, activeStorageDEKID) + } + // Pre-cutover: freshly-joined learner that has not yet + // proposed a `RegisterEncryptionWriter`. The §4.1 + // case-1 first-seen branch will create the row on the + // next encrypted write. return nil } registryRow, err := DecodeRegistryValue(rawVal) diff --git a/internal/encryption/local_epoch_rollback_test.go b/internal/encryption/local_epoch_rollback_test.go index 2833c20c9..dadee8054 100644 --- a/internal/encryption/local_epoch_rollback_test.go +++ b/internal/encryption/local_epoch_rollback_test.go @@ -2,6 +2,7 @@ package encryption_test import ( "errors" + "strings" "testing" "github.com/bootjp/elastickv/internal/encryption" @@ -55,17 +56,42 @@ func seedRegistry(t *testing.T, full uint64, lastSeen uint16) *stubRegistryStore return reg } -// TestCheckLocalEpochRollback_NoRow pins the freshly-joined- -// learner skip: the registry has no row for -// (full_node_id, active_storage_dek_id), so the primitive must -// return nil per §5.2 skip-condition. The §4.1 case-1 first-seen -// branch will create the row on the next encrypted write. -func TestCheckLocalEpochRollback_NoRow(t *testing.T) { +// TestCheckLocalEpochRollback_NoRow_PreCutover pins the +// freshly-joined-learner skip per §5.2 of the 6D design doc: +// with storageEnvelopeActive=false (pre-cutover) AND no +// registry row, the primitive returns nil. The §4.1 case-1 +// first-seen branch will create the row on the next +// encrypted write. +func TestCheckLocalEpochRollback_NoRow_PreCutover(t *testing.T) { t.Parallel() reg := &stubRegistryStore{rows: map[string][]byte{}} - err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 42) + err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 42, false) if err != nil { - t.Errorf("no-row case: want nil, got %v", err) + t.Errorf("no-row + pre-cutover: want nil, got %v", err) + } +} + +// TestCheckLocalEpochRollback_NoRow_PostCutover pins the +// post-cutover refusal per §5.2 of the 6D design doc: with +// storageEnvelopeActive=true (post-cutover) AND no registry +// row, encrypted writes are happening cluster-wide but this +// node has no rollback anchor to compare its sidecar against. +// The primitive returns ErrLocalEpochRollback wrapped with a +// missing-row diagnostic. +// +// Codex r1 P2 specifically flagged that collapsing the +// missing-row case to nil unconditionally would let a node +// start without a nonce-reuse guardrail when the cutover is +// active. This test pins the fix. +func TestCheckLocalEpochRollback_NoRow_PostCutover(t *testing.T) { + t.Parallel() + reg := &stubRegistryStore{rows: map[string][]byte{}} + err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 42, true) + if !errors.Is(err, encryption.ErrLocalEpochRollback) { + t.Fatalf("no-row + post-cutover: want ErrLocalEpochRollback, got %v", err) + } + if !strings.Contains(err.Error(), "storage_envelope_active=true") { + t.Errorf("error must mention storage_envelope_active=true context: %v", err) } } @@ -75,7 +101,7 @@ func TestCheckLocalEpochRollback_NoRow(t *testing.T) { func TestCheckLocalEpochRollback_SidecarStrictlyAhead(t *testing.T) { t.Parallel() reg := seedRegistry(t, 0xAAAA, 10) - if err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 11); err != nil { + if err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 11, false); err != nil { t.Errorf("sidecar=11 > registry=10: want nil, got %v", err) } } @@ -86,7 +112,7 @@ func TestCheckLocalEpochRollback_SidecarStrictlyAhead(t *testing.T) { func TestCheckLocalEpochRollback_SidecarLessThan(t *testing.T) { t.Parallel() reg := seedRegistry(t, 0xAAAA, 42) - err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 10) + err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 10, false) if !errors.Is(err, encryption.ErrLocalEpochRollback) { t.Fatalf("sidecar=10 < registry=42: want ErrLocalEpochRollback, got %v", err) } @@ -100,7 +126,7 @@ func TestCheckLocalEpochRollback_SidecarLessThan(t *testing.T) { func TestCheckLocalEpochRollback_SidecarEqual(t *testing.T) { t.Parallel() reg := seedRegistry(t, 0xAAAA, 42) - err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 42) + err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 42, false) if !errors.Is(err, encryption.ErrLocalEpochRollback) { t.Fatalf("sidecar=42 == registry=42: want ErrLocalEpochRollback, got %v", err) } @@ -114,7 +140,7 @@ func TestCheckLocalEpochRollback_PebbleError(t *testing.T) { t.Parallel() pebbleErr := errors.New("simulated pebble corruption") reg := &stubRegistryStore{getErr: pebbleErr} - err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 10) + err := encryption.CheckLocalEpochRollback(reg, 0xAAAA, 1, 10, false) if err == nil { t.Fatal("pebble error: want error, got nil") } @@ -145,7 +171,7 @@ func TestCheckLocalEpochRollback_NarrowingMatchesShippedConvention(t *testing.T) // Look up with a different full_node_id that ALSO narrows // to 0xAAAA. Same registry row should be hit; sidecar=10 < // registry=50 should fire ErrLocalEpochRollback. - err := encryption.CheckLocalEpochRollback(reg, 0xCAFEBABE_0000AAAA, 1, 10) + err := encryption.CheckLocalEpochRollback(reg, 0xCAFEBABE_0000AAAA, 1, 10, false) if !errors.Is(err, encryption.ErrLocalEpochRollback) { t.Fatalf("narrowing-match: want ErrLocalEpochRollback, got %v", err) } From 189cd59614c00531b1d547e4ac641a8300f7a961 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Mon, 18 May 2026 20:15:05 +0900 Subject: [PATCH 4/6] fix(encryption): PR788 round-3 - coderabbit minors (stale doc + ErrHelp handling) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two CodeRabbit minor findings from round-2 review. ## CodeRabbit minor #1 (errors.go:260-264) - stale skip-condition doc The ErrLocalEpochRollback skip-conditions comment in errors.go still listed "no registry row" as an unconditional skip. After the round-2 storageEnvelopeActive split, the missing-row case is no longer unconditional: pre-cutover it's a skip (returns nil), post-cutover it's the refusal that the guard exists to catch. Fix: rewrote the skip-conditions paragraph to enumerate just the two unconditional skips (encryption disabled, no active storage DEK) and cross-reference local_epoch_rollback.go for the §5.2 missing-row split. The comprehensive doc on CheckLocalEpochRollback is the authoritative source; errors.go now points there. ## CodeRabbit minor #2 (encryption.go:85-87) - ErrHelp handling probe-node-id wrapped flag.ErrHelp as an error, breaking the exit-zero-on-help convention used by every other encryption subcommand: - runEncryptionStatus (encryption.go:169-172) - runEncryptionRotateDEK (encryption_mutators.go:74) - runEncryptionRegisterWriter (encryption_mutators.go:119) - runEncryptionBootstrap (encryption_mutators.go:261) Operators running `elastickv-admin encryption probe-node-id -h` would get exit code 2 (the flag package default for help) which breaks shell scripts that test $?. Fix: added the standard `if errors.Is(err, flag.ErrHelp) { return nil }` short-circuit, matching the existing pattern. Added TestRunEncryptionProbeNodeID_HelpFlagExitsZero to pin the behavior. ## Verification - go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS (24 tests total in these files) - golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues No semantic changes to function signatures or behavior outside of the help-flag exit code. --- cmd/elastickv-admin/encryption.go | 10 ++++++++++ cmd/elastickv-admin/encryption_test.go | 15 +++++++++++++++ internal/encryption/errors.go | 14 ++++++++++---- 3 files changed, 35 insertions(+), 4 deletions(-) diff --git a/cmd/elastickv-admin/encryption.go b/cmd/elastickv-admin/encryption.go index 7409cb42d..54e978925 100644 --- a/cmd/elastickv-admin/encryption.go +++ b/cmd/elastickv-admin/encryption.go @@ -83,6 +83,16 @@ func runEncryptionProbeNodeID(args []string, out io.Writer) error { fs := flag.NewFlagSet("encryption probe-node-id", flag.ContinueOnError) fullNodeIDStr := fs.String("full-node-id", "", "candidate 64-bit full_node_id to probe (decimal or 0x-prefixed hex)") if err := fs.Parse(args); err != nil { + // flag.ContinueOnError reports `-h`/`--help` via the + // sentinel flag.ErrHelp after writing usage to + // fs.Output(). Match the convention used by every other + // encryption subcommand (status, rotate-dek, + // register-writer, bootstrap): treat ErrHelp as a usage + // request, exit 0 so shell scripts that test $? on help + // invocations don't trip. + if errors.Is(err, flag.ErrHelp) { + return nil + } return errors.Wrap(err, "parse probe-node-id flags") } if *fullNodeIDStr == "" { diff --git a/cmd/elastickv-admin/encryption_test.go b/cmd/elastickv-admin/encryption_test.go index bbbb4bc5a..08f63bf01 100644 --- a/cmd/elastickv-admin/encryption_test.go +++ b/cmd/elastickv-admin/encryption_test.go @@ -541,6 +541,21 @@ func TestRunEncryptionProbeNodeID_RejectsBadInput(t *testing.T) { } } +// TestRunEncryptionProbeNodeID_HelpFlagExitsZero pins the +// flag.ErrHelp special-case: every other encryption subcommand +// (status, rotate-dek, register-writer, bootstrap) returns nil +// on `-h`/`--help` so shell scripts that test $? on help +// invocations don't trip. probe-node-id must match the +// convention. Codex r2 flagged the absent ErrHelp branch. +func TestRunEncryptionProbeNodeID_HelpFlagExitsZero(t *testing.T) { + t.Parallel() + var buf bytes.Buffer + err := runEncryptionProbeNodeID([]string{"-h"}, &buf) + if err != nil { + t.Fatalf("-h: want nil, got %v", err) + } +} + // TestRunEncryptionProbeNodeID_RejectsPartialInput pins the // strconv.ParseUint vs fmt.Sscanf distinction: parseUint64WithRadix // MUST reject inputs where only a prefix is parseable, because diff --git a/internal/encryption/errors.go b/internal/encryption/errors.go index 09e6675cb..218bc4586 100644 --- a/internal/encryption/errors.go +++ b/internal/encryption/errors.go @@ -258,9 +258,15 @@ var ( // own monotonicity contract. // // Skip conditions: encryption disabled, sidecar's - // `Active.Storage == 0` (bootstrap not yet committed), no - // registry row for `(full_node_id, active_storage_dek_id)` (a - // freshly-joined learner that hasn't proposed a - // `RegisterEncryptionWriter` yet). + // `Active.Storage == 0` (bootstrap not yet committed). The + // missing-registry-row case is NOT an unconditional skip — + // the §5.2 split fires based on `sidecar.StorageEnvelopeActive`: + // pre-cutover (active=false) the missing-row case is a + // freshly-joined-learner skip and `CheckLocalEpochRollback` + // returns nil; post-cutover (active=true) the missing-row + // case is a missing rollback anchor and the function fires + // `ErrLocalEpochRollback`. See the comprehensive doc on + // `CheckLocalEpochRollback` (`local_epoch_rollback.go`) for + // the full split. ErrLocalEpochRollback = errors.New("encryption: sidecar local_epoch for the active storage DEK is at or below the local writer-registry's last_seen_local_epoch, which would replay the GCM nonce prefix under the same DEK; refusing to start (verify the sidecar was not restored from an old backup; run `encryption resync-sidecar` if appropriate, or rotate the affected DEK — see §9.1 + §5.2 of the 6D design doc)") ) From 2ca075ac6fff10f27a355bf6d88d928c41126676 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Mon, 18 May 2026 20:19:58 +0900 Subject: [PATCH 5/6] fix(encryption): PR788 round-4 - codex r3 P2 (validate registry row FullNodeID match) Codex r3 P2 on PR788: CheckLocalEpochRollback looked up the registry row by uint16(fullNodeID&nodeIDMask) and compared epochs without verifying that the decoded RegistryValue's FullNodeID matched the caller's argument. If a different writer previously occupied the same 16-bit slot (historical occupancy that the route-catalog watcher would not surface to ErrNodeIDCollision), this function would treat that foreign row as THIS node's rollback anchor. When sidecarLocalEpoch happens to exceed the foreign LastSeenLocalEpoch, the function would return nil and the missing-row-post-cutover refusal would be bypassed. ## Fix: defense-in-depth FullNodeID equality check After decoding the registry row, compare registryRow.FullNodeID against the caller's fullNodeID. On mismatch, route through the existing 5.2 split via a new shared helper: - missingRegistryRowResult(storageEnvelopeActive, ...) - pre-cutover allow startup, post-cutover refuse with ErrLocalEpochRollback. The mismatch is treated as "no row for THIS node" because the node's rollback anchor genuinely doesn't exist - the foreign writer's anchor doesn't apply to THIS node's monotonicity contract. This is layered defense: ErrNodeIDCollision is the primary guard for active collisions, but a since-removed member can leave its row behind. The FullNodeID check handles that case correctly without depending on ErrNodeIDCollision's route-catalog snapshot. ## Test coverage Two new tests pin the foreign-row scenarios: - TestCheckLocalEpochRollback_ForeignRowPreCutover - foreign row + pre-cutover -> allow startup (next OpRegistration will overwrite via 4.1 case 1). - TestCheckLocalEpochRollback_ForeignRowPostCutover - foreign row + post-cutover -> refuse with ErrLocalEpochRollback AND uses the missing-row context message ("storage_envelope_active=true"). Includes the "sidecar > foreign.LastSeen" sub-case which would have masked the bug without the FullNodeID check. The existing TestCheckLocalEpochRollback_NarrowingMatchesShippedConvention was confusingly using the foreign-row scenario as its fixture. Rewrote it to use the SAME full_node_id for seed and lookup (still confirms the high-bit-aware uint16 narrowing), and moved the cross-full_node_id case to the dedicated foreign-row tests where it belongs. ## Caller audit (per cron directive) CheckLocalEpochRollback's exported signature is unchanged - the semantic change is internal (a new branch that routes through the existing 5.2 split). No production callers (operator-inert per 6D-2). All 6 existing test callers continue to work unchanged. ## Verification - go test -race -timeout=60s ./internal/encryption/ ./cmd/elastickv-admin/ - PASS (29 tests total in these files) - golangci-lint run ./internal/encryption/ ./cmd/elastickv-admin/ - 0 issues --- internal/encryption/local_epoch_rollback.go | 48 +++++++---- .../encryption/local_epoch_rollback_test.go | 85 +++++++++++++++++-- 2 files changed, 110 insertions(+), 23 deletions(-) diff --git a/internal/encryption/local_epoch_rollback.go b/internal/encryption/local_epoch_rollback.go index ee315455a..334429371 100644 --- a/internal/encryption/local_epoch_rollback.go +++ b/internal/encryption/local_epoch_rollback.go @@ -73,6 +73,21 @@ import ( // `storageEnvelopeActive == true`. Returns nil when the row is // absent AND `storageEnvelopeActive == false` (pre-cutover // freshly-joined learner) or when `sidecar > registry`. +// missingRegistryRowResult is the shared §5.2 split for "no +// registry row for THIS node": pre-cutover allow startup, +// post-cutover refuse. The function is invoked both when the +// Pebble Get returns ok=false AND when a row exists but its +// FullNodeID does not match the caller's expectation (a +// historical occupant of the same 16-bit slot). +func missingRegistryRowResult(storageEnvelopeActive bool, fullNodeID uint64, activeStorageDEKID uint32) error { + if storageEnvelopeActive { + return pkgerrors.Wrapf(ErrLocalEpochRollback, + "writer-registry has no row for full_node_id=%#x dek_id=%d but storage_envelope_active=true; cannot prove nonce monotonicity for the active DEK", + fullNodeID, activeStorageDEKID) + } + return nil +} + func CheckLocalEpochRollback( registry WriterRegistryStore, fullNodeID uint64, @@ -88,21 +103,7 @@ func CheckLocalEpochRollback( fullNodeID, activeStorageDEKID) } if !ok { - if storageEnvelopeActive { - // Post-cutover: missing row means no rollback - // anchor exists, but encrypted writes are happening. - // Refuse to start; the node cannot prove nonce - // monotonicity without a registry record to compare - // against. - return pkgerrors.Wrapf(ErrLocalEpochRollback, - "writer-registry has no row for full_node_id=%#x dek_id=%d but storage_envelope_active=true; cannot prove nonce monotonicity for the active DEK", - fullNodeID, activeStorageDEKID) - } - // Pre-cutover: freshly-joined learner that has not yet - // proposed a `RegisterEncryptionWriter`. The §4.1 - // case-1 first-seen branch will create the row on the - // next encrypted write. - return nil + return missingRegistryRowResult(storageEnvelopeActive, fullNodeID, activeStorageDEKID) } registryRow, err := DecodeRegistryValue(rawVal) if err != nil { @@ -110,6 +111,23 @@ func CheckLocalEpochRollback( "local_epoch rollback guard: decode writer-registry row for full_node_id=%#x dek_id=%d", fullNodeID, activeStorageDEKID) } + // Defense-in-depth: the registry key narrows full_node_id to + // its low 16 bits, so a row at the same key MIGHT belong to a + // DIFFERENT writer that previously occupied the same 16-bit + // slot. ErrNodeIDCollision is the primary layer that catches + // active collisions, but historical occupancy (a since-removed + // member that left its row behind) is not in the route-catalog + // snapshot the collision guard reads. Treat a row whose + // FullNodeID does NOT match the caller's expectation as if no + // row exists for THIS node — route through the §5.2 + // storage_envelope_active split. Pre-cutover this is a + // freshly-joined-learner-with-historical-collision case (allow + // startup; the next OpRegistration will overwrite the foreign + // row at §4.1 case 1). Post-cutover it is an unrecoverable + // missing rollback anchor for THIS node (refuse startup). + if registryRow.FullNodeID != fullNodeID { + return missingRegistryRowResult(storageEnvelopeActive, fullNodeID, activeStorageDEKID) + } if sidecarLocalEpoch <= registryRow.LastSeenLocalEpoch { return pkgerrors.Wrapf(ErrLocalEpochRollback, "sidecar local_epoch=%d <= registry last_seen_local_epoch=%d (full_node_id=%#x dek_id=%d) — strict-ahead invariant requires sidecar > registry", diff --git a/internal/encryption/local_epoch_rollback_test.go b/internal/encryption/local_epoch_rollback_test.go index dadee8054..f2811c499 100644 --- a/internal/encryption/local_epoch_rollback_test.go +++ b/internal/encryption/local_epoch_rollback_test.go @@ -152,6 +152,69 @@ func TestCheckLocalEpochRollback_PebbleError(t *testing.T) { } } +// TestCheckLocalEpochRollback_ForeignRowPreCutover pins the +// defense-in-depth FullNodeID-mismatch case for pre-cutover: +// a registry row exists at the 16-bit-narrowed key but belongs +// to a DIFFERENT writer (historical occupant). Pre-cutover +// this is the freshly-joined-learner-with-historical-collision +// case — the next OpRegistration will overwrite the foreign +// row. Allow startup. +// +// Codex r3 P2 flagged that without this check, the comparison +// would happen against the FOREIGN writer's LastSeenLocalEpoch, +// potentially returning nil (sidecar > foreign) when in fact +// THIS node has no rollback anchor at all. +func TestCheckLocalEpochRollback_ForeignRowPreCutover(t *testing.T) { + t.Parallel() + // Seed with full_node_id 0xDEADBEEF_0000AAAA (narrowed=0xAAAA). + reg := seedRegistry(t, 0xDEADBEEF_0000AAAA, 99) + // Look up with a DIFFERENT full_node_id whose low 16 bits + // ALSO narrow to 0xAAAA. Pre-cutover: the foreign row is + // not THIS node's anchor; treat as no-row + allow startup. + err := encryption.CheckLocalEpochRollback(reg, 0xCAFEBABE_0000AAAA, 1, 10, false) + if err != nil { + t.Errorf("foreign-row + pre-cutover: want nil, got %v", err) + } +} + +// TestCheckLocalEpochRollback_ForeignRowPostCutover pins the +// post-cutover refusal for the FullNodeID-mismatch case: a +// foreign row at the same 16-bit-narrowed key MUST NOT be +// treated as this node's rollback anchor. With +// storage_envelope_active=true, the absence of THIS node's +// own row means no rollback anchor exists and startup must +// refuse — same as the no-row + post-cutover case. +// +// Without the FullNodeID check, the function would compare +// sidecarLocalEpoch against the foreign writer's +// LastSeenLocalEpoch and (when sidecar happens to be ahead) +// return nil, allowing the node to start without an anchor. +func TestCheckLocalEpochRollback_ForeignRowPostCutover(t *testing.T) { + t.Parallel() + // Seed with full_node_id 0xDEADBEEF_0000AAAA (narrowed=0xAAAA), + // large LastSeenLocalEpoch so sidecar=10 would be BELOW it + // (the "natural" rollback scenario) — but the foreign-row + // check should fire FIRST and route through the + // missing-row-post-cutover branch. + reg := seedRegistry(t, 0xDEADBEEF_0000AAAA, 99) + err := encryption.CheckLocalEpochRollback(reg, 0xCAFEBABE_0000AAAA, 1, 10, true) + if !errors.Is(err, encryption.ErrLocalEpochRollback) { + t.Fatalf("foreign-row + post-cutover: want ErrLocalEpochRollback, got %v", err) + } + if !strings.Contains(err.Error(), "storage_envelope_active=true") { + t.Errorf("must use the missing-row-post-cutover message (storage_envelope_active=true): %v", err) + } + + // Even when sidecar > foreign's LastSeenLocalEpoch (the + // case that would have masked the missing anchor without + // the FullNodeID check), the refusal must still fire. + reg2 := seedRegistry(t, 0xDEADBEEF_0000AAAA, 5) + err2 := encryption.CheckLocalEpochRollback(reg2, 0xCAFEBABE_0000AAAA, 1, 100, true) + if !errors.Is(err2, encryption.ErrLocalEpochRollback) { + t.Fatalf("foreign-row + post-cutover + sidecar>foreign.LastSeen: want ErrLocalEpochRollback, got %v", err2) + } +} + // TestCheckLocalEpochRollback_NarrowingMatchesShippedConvention // pins the load-bearing 16-bit narrowing: the primitive must // look up the registry row at RegistryKey(dekID, @@ -161,17 +224,23 @@ func TestCheckLocalEpochRollback_PebbleError(t *testing.T) { // would lookup at a different key and the no-row branch would // fire spuriously, masking a real rollback. // -// Verify by seeding the row at uint16-narrowed key and looking -// up with a full_node_id whose HIGH bits differ — narrowing -// must hit the same row. +// Verify by seeding the row at uint16-narrowed key with a +// full_node_id that has non-trivial high bits and looking up +// with the SAME full_node_id — confirming that the key +// derivation handles high-bit-aware values correctly. (The +// foreign-row case, where two distinct full_node_ids share the +// same 16-bit slice, is covered by the dedicated +// `ForeignRowPreCutover` / `ForeignRowPostCutover` tests +// above.) func TestCheckLocalEpochRollback_NarrowingMatchesShippedConvention(t *testing.T) { t.Parallel() // Seed with full_node_id 0xDEADBEEF_0000AAAA (narrowed=0xAAAA). - reg := seedRegistry(t, 0xDEADBEEF_0000AAAA, 50) - // Look up with a different full_node_id that ALSO narrows - // to 0xAAAA. Same registry row should be hit; sidecar=10 < - // registry=50 should fire ErrLocalEpochRollback. - err := encryption.CheckLocalEpochRollback(reg, 0xCAFEBABE_0000AAAA, 1, 10, false) + const fullID uint64 = 0xDEADBEEF_0000AAAA + reg := seedRegistry(t, fullID, 50) + // Look up with the SAME full_node_id. The key derivation + // must use uint16(fullID & 0xFFFF) and find the row. + // sidecar=10 < registry=50 fires ErrLocalEpochRollback. + err := encryption.CheckLocalEpochRollback(reg, fullID, 1, 10, false) if !errors.Is(err, encryption.ErrLocalEpochRollback) { t.Fatalf("narrowing-match: want ErrLocalEpochRollback, got %v", err) } From 5aaa3e0b744f172e1e349f7101b6173f0dc7df5c Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Wed, 20 May 2026 05:24:49 +0900 Subject: [PATCH 6/6] fix(encryption): PR788 round-5 - drop probe-node-id gosec nolint (claude r4 deferred item) Use the uint64 mask result directly for %#04x / %d formatting; the uint16 cast was only there for display and pulled in a gosec G115 nolint that did not exist anywhere else in the operator-tool path. The single-source narrow-and-cast convention (applier.go:230, local_epoch_rollback.go:98, node_id_collision.go:59) is unaffected - those sites genuinely need uint16 to build the 6-byte registry key. Display-only sites do not, and now do not pay the nolint cost. --- cmd/elastickv-admin/encryption.go | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/cmd/elastickv-admin/encryption.go b/cmd/elastickv-admin/encryption.go index 54e978925..0d2bb3c38 100644 --- a/cmd/elastickv-admin/encryption.go +++ b/cmd/elastickv-admin/encryption.go @@ -103,9 +103,10 @@ func runEncryptionProbeNodeID(args []string, out io.Writer) error { return errors.Wrapf(err, "parse --full-node-id=%q", *fullNodeIDStr) } const nodeIDMask = 0xFFFF - narrowed := uint16(full & nodeIDMask) //nolint:gosec // masked to 16 bits; matches applier.go convention + // keep masked as uint64: the value is only formatted for display, so the uint16 cast (and its gosec G115 nolint) is not load-bearing here. + masked := full & nodeIDMask if _, err := fmt.Fprintf(out, "full_node_id: %#016x (%d)\nnode_id: %#04x (%d)\n", - full, full, narrowed, narrowed); err != nil { + full, full, masked, masked); err != nil { return errors.Wrap(err, "write probe-node-id result") } return nil