From 90dad5d1339ed64f2e815992fa1a2e08e4b1dfa6 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Mon, 25 May 2026 22:33:09 +0900 Subject: [PATCH 1/4] docs(encryption): Stage 6D-6c-3a proposed - capability fan-out closure wiring MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Proposes wiring the §4 capability fan-out closure into main.go so the EnableStorageEnvelope cutover RPC can run its pre-flight GetCapability fan-out (today s.capabilityFanout is nil → cutover refuses with the §4 "not wired" FailedPrecondition). Design: build an all-groups admin.RouteSnapshot from each runtime's engine.Configuration(ctx), a DialFunc over a dedicated kv.GRPCConnCache returning pb.EncryptionAdminClient (no-op cleanup, cache owns conns), and assemble adapter.CapabilityFanoutFn; wire via WithEncryptionAdminCapabilityFanout gated on enableMutators. Fail-closed when the membership view can't be fully enumerated. Dedicated conn cache (not the --adminEnabled-gated admin-forward cache) so the fan-out works whenever encryption mutators are enabled. e2e test deferred to 6D-6c-3b. Doc-first per CLAUDE.md; implementation commits follow on the same PR. --- ...roposed_6d6c3a_capability_fanout_wiring.md | 112 ++++++++++++++++++ 1 file changed, 112 insertions(+) create mode 100644 docs/design/2026_05_25_proposed_6d6c3a_capability_fanout_wiring.md diff --git a/docs/design/2026_05_25_proposed_6d6c3a_capability_fanout_wiring.md b/docs/design/2026_05_25_proposed_6d6c3a_capability_fanout_wiring.md new file mode 100644 index 000000000..8bdc0e2f9 --- /dev/null +++ b/docs/design/2026_05_25_proposed_6d6c3a_capability_fanout_wiring.md @@ -0,0 +1,112 @@ +# Stage 6D-6c-3a — capability fan-out closure wiring + +| Field | Value | +|---|---| +| Status | proposed | +| Date | 2026-05-25 | +| Parent designs | [`2026_05_18_partial_6d_enable_storage_envelope.md`](2026_05_18_partial_6d_enable_storage_envelope.md) (6D-6c-3 milestone; §4 capability fan-out), [`2026_04_29_partial_data_at_rest_encryption.md`](2026_04_29_partial_data_at_rest_encryption.md) (§7.1 rollout) | +| Builds on | 6D-3 (`internal/admin.CapabilityFanout` helper), 6D-6a (`adapter.WithEncryptionAdminCapabilityFanout` option) | + +## 0. Why this doc exists + +The `EnableStorageEnvelope` cutover RPC (6D-6a) runs the §4 capability +fan-out before proposing the cutover — but only when a +`CapabilityFanoutFn` closure is wired via +`WithEncryptionAdminCapabilityFanout`. `main.go` does **not** wire it +today, so `s.capabilityFanout == nil` and the cutover RPC refuses with +the §4 "fan-out not wired" `FailedPrecondition`. 6D-6c-3a builds and +wires that closure; 6D-6c-3b adds the end-to-end test on top. + +The fan-out helper (`admin.CapabilityFanout(ctx, routes, dial, +timeout)`) and the RPC option already exist and are tested. This slice +is pure wiring of those existing pieces into `main.go` — no new RPC, no +wire-format change. + +## 1. Scope (6D-6c-3a) + +- `buildCapabilityFanoutFn(runtimes, connCache, timeout) + adapter.CapabilityFanoutFn` in a new `main_encryption_fanout.go`: + - **Snapshot builder** — for every runtime, call + `rt.engine.Configuration(ctx)` and map each `raftengine.Server` + to an `admin.RouteMember{FullNodeID: etcd.DeriveNodeID(srv.ID), + Address: srv.Address}`, splitting on `srv.Suffrage` + (`SuffrageVoter` → Voters, else Learners) into one + `admin.RouteGroup{GroupID: rt.spec.id}`. The §4.1 contract is + "every (voter ∪ learner) of **every** Raft group", so the + snapshot spans **all** runtimes, not just the cutover RPC's group. + - **DialFunc** — `connCache.ConnFor(addr)` → `pb.NewEncryptionAdminClient(conn)`, + with a **no-op cleanup** (the cache owns the conn lifecycle and + reuses it across probes; closing per-probe would defeat pooling). + - The closure runs `admin.CapabilityFanout(ctx, snapshot, dial, + timeout)`; a snapshot-build error (any `engine.Configuration` + failure) is returned as the closure error so the cutover RPC + fails closed (§4 maps it to a refusal). +- A **dedicated** `kv.GRPCConnCache` for the fan-out, created in + `startRaftServers` and closed via the cleanup stack. Not the + admin-forward cache: that one is gated on `--adminEnabled`, but the + cutover fan-out must dial whenever encryption mutators are enabled, + independent of the admin HTTP surface. `kv.GRPCConnCache.ConnFor` + already uses the shared `internalutil.GRPCDialOptions()` so the + fan-out dials with the same transport posture as every other + intra-cluster gRPC client. +- Wire the closure into `registerEncryptionAdminServer` via + `WithEncryptionAdminCapabilityFanout`, gated on the same + `enableMutators` boolean that gates the Proposer/LeaderView (the + fan-out is only meaningful when the cutover mutator is reachable). +- Fan-out timeout: a `const` in `main.go` (start at 5s — generous for + a small cluster GetCapability round-trip; the helper bounds the + whole fan-out by it regardless of member count). + +### Out of scope (6D-6c-3b) + +The end-to-end integration test (single-node cluster: Bootstrap → +EnableStorageEnvelope → Put → read-back-via-envelope) lands in 3b on +top of this wiring. + +## 2. Fail-closed posture + +- **Snapshot build error** (`engine.Configuration` fails on any + group) → closure returns the error → cutover RPC refuses. Never + propose a cutover against a membership view we could not fully + enumerate. +- **Unreachable / not-capable member** → handled inside + `CapabilityFanout` (verdict `Reachable=false` / `EncryptionCapable=false` + → `OK=false`) → cutover refuses. No partial success (§4.3). +- **Closure not wired** (encryption mutators disabled) → unchanged + existing behavior: `s.capabilityFanout == nil` → cutover refuses + with the §4 "not wired" `FailedPrecondition`. + +## 3. Why a dedicated conn cache (not the admin-forward one) + +The admin-forward `connCache` is constructed only when `--adminEnabled` +(it backs the follower→leader admin write forwarder). The cutover +capability fan-out is orthogonal: it must dial every member's +`EncryptionAdmin` endpoint whenever the operator has enabled +encryption mutators, regardless of whether the admin HTTP API is +served. Coupling the two would make the fan-out silently inert on a +`--encryption-enabled` cluster that left `--adminEnabled` off. A +separate cache keeps the lifecycles independent and is cheap (one +idle `*grpc.ClientConn` per peer, closed on shutdown). + +## 4. Self-review checklist (for the implementation PR) + +- **Data loss / consistency** — read-only control-plane wiring; issues + no writes and changes no apply path. The cutover it gates is + unchanged (6D-6a). +- **Concurrency** — `CapabilityFanout` is already concurrent + timeout + bounded; the snapshot builder runs per-RPC, single-shot; the conn + cache is already concurrency-safe. +- **Performance** — fan-out runs once per cutover RPC (a rare operator + action), not on the data path. +- **Test coverage** — unit tests for the snapshot builder (Server → + RouteMember mapping, voter/learner split, multi-group, Configuration + error → closure error) and the DialFunc (cache reuse, nil-conn + guard). The full e2e is 3b. + +## 5. Open questions + +1. Fan-out timeout value — 5s proposed; revisit if large clusters need + more headroom (the helper bounds the whole fan-out, not per-probe). +2. Should learners that are mid-snapshot-catchup be probed? Yes — the + §4.1 contract is unconditional (voter ∪ learner); an unreachable + learner is a hard refusal, matching the parent design §8. From cdf5021734d308319aa326a2927ef2b23603764d Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Mon, 25 May 2026 22:39:02 +0900 Subject: [PATCH 2/4] feat(encryption): Stage 6D-6c-3a - wire capability fan-out closure into EnableStorageEnvelope MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Wires the §4 capability fan-out into main.go so the EnableStorageEnvelope cutover RPC can run its pre-flight GetCapability fan-out (previously s.capabilityFanout was nil → cutover refused with the §4 "not wired" FailedPrecondition). - main_encryption_fanout.go: buildCapabilityFanoutFn assembles the adapter.CapabilityFanoutFn. The closure snapshots all groups' membership via each runtime's engine.Configuration(ctx) (routeGroupFromConfiguration maps raftengine.Server → admin.RouteMember keyed by etcd.DeriveNodeID, split voter/learner on Suffrage with empty == voter), and dials each peer's EncryptionAdmin endpoint over a dedicated kv.GRPCConnCache (no-op cleanup; the cache owns conns). A snapshot-build error fails the closure → cutover refuses (fail-closed). - buildEncryptionCapabilityFanout owns the dedicated conn cache (NOT the --adminEnabled-gated admin-forward cache, so the fan-out works whenever encryption mutators are enabled), drained on ctx cancellation via eg. Returns nil when mutators are disabled. Extracted as a helper to keep startRaftServers under the cyclop budget. - registerEncryptionAdminServer threads the closure through WithEncryptionAdminCapabilityFanout, gated on the same enableMutators boolean as the Proposer/LeaderView. The end-to-end Bootstrap→cutover→Put→read-back test lands in 6D-6c-3b. Tests: routeGroupFromConfiguration voter/learner split (incl. empty suffrage == voter), DeriveNodeID node identity, address mapping, multi-server, empty configuration; builder returns non-nil. Self-review (5 lenses): - Data loss/consistency: read-only control-plane wiring; no writes, no apply-path change; cutover semantics unchanged (6D-6a). - Concurrency: CapabilityFanout already concurrent + timeout-bounded; snapshot builder per-RPC single-shot; conn cache concurrency-safe; cleanup via eg on ctx.Done. - Performance: fan-out runs once per (rare) cutover RPC, off the data path. - Test coverage: mapping helper unit-tested; full fan-out behavior covered by the 6D-6c-3b e2e. Doc commit (6D-6c-3a proposed) landed first; 6D milestone doc updated. --- ...5_18_partial_6d_enable_storage_envelope.md | 36 +++--- main.go | 3 +- main_encryption_admin.go | 9 +- main_encryption_admin_test.go | 4 +- main_encryption_fanout.go | 121 ++++++++++++++++++ main_encryption_fanout_test.go | 64 +++++++++ 6 files changed, 216 insertions(+), 21 deletions(-) create mode 100644 main_encryption_fanout.go create mode 100644 main_encryption_fanout_test.go diff --git a/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md b/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md index 4eafa7061..58b8878d3 100644 --- a/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md +++ b/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md @@ -97,25 +97,27 @@ into `store.WithEncryption` and `WithStorageEnvelopeGate` in 6D-6c-2. +- **6D-6c-2** (shipped, PR #826) — main.go production wiring: + `encryption.NewCipher(keystore)` + a single `encryption.StateCache` + threaded via `WithStateCache` into every per-shard `Applier`, with + `cache.ActiveStorageKeyID` / `cache.StorageEnvelopeActive` passed + into `store.WithEncryption` + `store.WithStorageEnvelopeGate` for + each shard's PebbleStore. Pulled forward the deterministic-nonce + core of Stage 7 (keystore hydration, local_epoch bump, + DeterministicNonceFactory) — see + [`2026_05_25_partial_6d6c2_production_storage_envelope_wiring.md`](2026_05_25_partial_6d6c2_production_storage_envelope_wiring.md). +- **6D-6c-3a** — main.go CapabilityFanout closure bound to the live + Raft membership view (`engine.Configuration` route snapshot across + all groups + a `DialFunc` over a dedicated `kv.GRPCConnCache`), + wired into the EnableStorageEnvelope server via + `WithEncryptionAdminCapabilityFanout` gated on encryption mutators. + See [`2026_05_25_proposed_6d6c3a_capability_fanout_wiring.md`](2026_05_25_proposed_6d6c3a_capability_fanout_wiring.md). + ## Open milestones -- **6D-6c-2** — main.go production wiring: build - `encryption.NewCipher(keystore)` and construct a single - `encryption.StateCache` at startup (parallel to the shared - `*Keystore`). Thread the cache via `WithStateCache` into every - per-shard `Applier` inside `buildShardGroups`, and pass - `cache.ActiveStorageKeyID` / `cache.StorageEnvelopeActive` - (NOT the per-shard `Applier` delegates) into - `store.WithEncryption` + `store.WithStorageEnvelopeGate` for - each shard's PebbleStore. Reading via the StateCache directly - ensures every shard's storage layer sees the post-apply state - regardless of which shard's leader accepted the encryption - proposal. -- **6D-6c-3** — main.go CapabilityFanout closure bound to the - live Raft membership view (etcd engine route snapshot + admin - client DialFunc), and end-to-end integration test exercising - a single-node cluster Bootstrap → EnableStorageEnvelope → - Put → read-back-via-envelope. +- **6D-6c-3b** — end-to-end integration test exercising a single-node + cluster Bootstrap → EnableStorageEnvelope → Put → + read-back-via-envelope, on top of the 6D-6c-3a fan-out wiring. ## 0. Why this doc exists diff --git a/main.go b/main.go index 89cb13786..2cc4fc0f1 100644 --- a/main.go +++ b/main.go @@ -1460,6 +1460,7 @@ func startRaftServers( // linter does not complain. const extraOptsCap = 2 enableMutators := encryptionMutatorsEnabled() + encryptionCapabilityFanout := buildEncryptionCapabilityFanout(ctx, eg, runtimes, enableMutators) for _, rt := range runtimes { baseOpts := internalutil.GRPCServerOptions() opts := make([]grpc.ServerOption, 0, len(baseOpts)+extraOptsCap) @@ -1500,7 +1501,7 @@ func startRaftServers( // per-shard loop. Each shard's own engine is the // Proposer + LeaderView so the mutator proposes through // the correct Raft group. - registerEncryptionAdminServer(gs, etcdraftengine.DeriveNodeID(*raftId), *encryptionSidecarPath, enableMutators, rt.engine) + registerEncryptionAdminServer(gs, etcdraftengine.DeriveNodeID(*raftId), *encryptionSidecarPath, enableMutators, rt.engine, encryptionCapabilityFanout) registerAdminForwardServer(gs, forwardDeps, forwardLogger) rt.registerGRPC(gs) internalraftadmin.RegisterOperationalServices(ctx, gs, rt.engine, []string{"RawKV"}) diff --git a/main_encryption_admin.go b/main_encryption_admin.go index f8794fc01..db1c81d77 100644 --- a/main_encryption_admin.go +++ b/main_encryption_admin.go @@ -128,7 +128,7 @@ type encryptionAdminEngine interface { // (the §7.1 cutover refuses with ErrCapabilityCheckFailed); // when set, capability probing reads the §5.1 keys.json and // reports encryption_capable=true. -func registerEncryptionAdminServer(gs *grpc.Server, fullNodeID uint64, sidecarPath string, enableMutators bool, engine encryptionAdminEngine) { +func registerEncryptionAdminServer(gs *grpc.Server, fullNodeID uint64, sidecarPath string, enableMutators bool, engine encryptionAdminEngine, capabilityFanout adapter.CapabilityFanoutFn) { opts := []adapter.EncryptionAdminServerOption{ adapter.WithEncryptionAdminFullNodeID(fullNodeID), } @@ -140,6 +140,13 @@ func registerEncryptionAdminServer(gs *grpc.Server, fullNodeID uint64, sidecarPa adapter.WithEncryptionAdminProposer(engine), adapter.WithEncryptionAdminLeaderView(engine), ) + // The §4 capability fan-out is only meaningful when the + // cutover mutator is reachable (same enableMutators gate as + // the Proposer/LeaderView). Without it the cutover RPC + // refuses with the §4 "fan-out not wired" FailedPrecondition. + if capabilityFanout != nil { + opts = append(opts, adapter.WithEncryptionAdminCapabilityFanout(capabilityFanout)) + } } srv := adapter.NewEncryptionAdminServer(opts...) if err := srv.Validate(); err != nil { diff --git a/main_encryption_admin_test.go b/main_encryption_admin_test.go index a4e6cc360..fa7c103f7 100644 --- a/main_encryption_admin_test.go +++ b/main_encryption_admin_test.go @@ -135,7 +135,7 @@ func callBootstrapAgainstNewServer(t *testing.T, enableMutators bool, engine enc t.Helper() listener := bufconn.Listen(1024 * 1024) gs := grpc.NewServer() - registerEncryptionAdminServer(gs, 1, "", enableMutators, engine) + registerEncryptionAdminServer(gs, 1, "", enableMutators, engine, nil) go func() { _ = gs.Serve(listener) }() t.Cleanup(gs.Stop) conn, err := grpc.NewClient( @@ -161,7 +161,7 @@ func callBootstrapAgainstNewServer(t *testing.T, enableMutators bool, engine enc func TestRegisterEncryptionAdminServer_Registers(t *testing.T) { gs := grpc.NewServer() - registerEncryptionAdminServer(gs, 1, "", false, nil) + registerEncryptionAdminServer(gs, 1, "", false, nil, nil) info := gs.GetServiceInfo() if _, ok := info["EncryptionAdmin"]; !ok { var registered []string diff --git a/main_encryption_fanout.go b/main_encryption_fanout.go new file mode 100644 index 000000000..aeb621c4f --- /dev/null +++ b/main_encryption_fanout.go @@ -0,0 +1,121 @@ +package main + +import ( + "context" + "time" + + "github.com/bootjp/elastickv/adapter" + "github.com/bootjp/elastickv/internal/admin" + "github.com/bootjp/elastickv/internal/raftengine" + etcdraftengine "github.com/bootjp/elastickv/internal/raftengine/etcd" + "github.com/bootjp/elastickv/kv" + pb "github.com/bootjp/elastickv/proto" + "github.com/cockroachdb/errors" + "golang.org/x/sync/errgroup" +) + +// capabilityFanoutTimeout bounds the whole §4 GetCapability fan-out the +// EnableStorageEnvelope cutover runs before proposing. The helper +// caps the entire fan-out (not per-probe) by this value regardless of +// member count, so a slow/unreachable peer cannot stall the operator's +// cutover RPC indefinitely — it surfaces as a Reachable=false verdict +// and the cutover fails closed. 5s is generous for a small-cluster +// GetCapability round-trip. +const capabilityFanoutTimeout = 5 * time.Second + +// buildEncryptionCapabilityFanout builds the §4 capability fan-out +// closure shared across every shard's EncryptionAdmin server, or nil +// when encryption mutators are disabled (the cutover RPC is then +// unreachable, so the fan-out would never run). It owns a dedicated +// kv.GRPCConnCache — NOT the --adminEnabled-gated admin-forward cache — +// so the fan-out dials whenever encryption mutators are enabled, +// independent of the admin HTTP surface; the cache is drained on +// context cancellation via eg. +func buildEncryptionCapabilityFanout(ctx context.Context, eg *errgroup.Group, runtimes []*raftGroupRuntime, enableMutators bool) adapter.CapabilityFanoutFn { + if !enableMutators { + return nil + } + fanoutConnCache := &kv.GRPCConnCache{} + eg.Go(func() error { + <-ctx.Done() + if err := fanoutConnCache.Close(); err != nil { + return errors.Wrap(err, "close encryption capability fan-out gRPC connection cache") + } + return nil + }) + return buildCapabilityFanoutFn(runtimes, fanoutConnCache, capabilityFanoutTimeout) +} + +// buildCapabilityFanoutFn assembles the adapter.CapabilityFanoutFn the +// EnableStorageEnvelope server invokes for its §4 pre-flight check. The +// returned closure snapshots the live membership of every Raft group +// (the §4.1 "voter ∪ learner of every group" contract) and fans +// GetCapability out to each unique node. +// +// dial reuses connCache so repeated cutover attempts share one +// *grpc.ClientConn per peer; the cleanup closure is a no-op because the +// cache owns the connection lifecycle (closed on shutdown by the +// caller, not per-probe). A snapshot-build error fails the closure so +// the cutover refuses rather than proposing against a membership view +// it could not fully enumerate (fail-closed, §4.3 no-partial-success). +func buildCapabilityFanoutFn(runtimes []*raftGroupRuntime, connCache *kv.GRPCConnCache, timeout time.Duration) adapter.CapabilityFanoutFn { + dial := func(_ context.Context, address string) (pb.EncryptionAdminClient, func(), error) { + conn, err := connCache.ConnFor(address) + if err != nil { + return nil, nil, errors.Wrapf(err, "capability fan-out: dial %s", address) + } + return pb.NewEncryptionAdminClient(conn), func() {}, nil + } + return func(ctx context.Context) (admin.CapabilityFanoutResult, error) { + snapshot, err := capabilityRouteSnapshot(ctx, runtimes) + if err != nil { + return admin.CapabilityFanoutResult{}, err + } + return admin.CapabilityFanout(ctx, snapshot, dial, timeout) + } +} + +// capabilityRouteSnapshot builds the all-groups admin.RouteSnapshot +// from each runtime's live Raft Configuration. Every server maps to a +// RouteMember keyed by the same etcd.DeriveNodeID(raftID) value the +// rest of the encryption stack uses for node identity, split into +// Voters / Learners on Suffrage (empty suffrage is treated as voter, +// matching raftengine's own convention). A nil runtime/engine is +// skipped; any Configuration error aborts the whole snapshot so the +// caller fails closed. +func capabilityRouteSnapshot(ctx context.Context, runtimes []*raftGroupRuntime) (admin.RouteSnapshot, error) { + groups := make([]admin.RouteGroup, 0, len(runtimes)) + for _, rt := range runtimes { + if rt == nil || rt.engine == nil { + continue + } + cfg, err := rt.engine.Configuration(ctx) + if err != nil { + return admin.RouteSnapshot{}, errors.Wrapf(err, "capability fan-out: configuration for group %d", rt.spec.id) + } + groups = append(groups, routeGroupFromConfiguration(rt.spec.id, cfg)) + } + return admin.RouteSnapshot{Groups: groups}, nil +} + +// routeGroupFromConfiguration maps one Raft group's live Configuration +// to an admin.RouteGroup: each server becomes a RouteMember keyed by +// etcd.DeriveNodeID(srv.ID) (the node identity the rest of the +// encryption stack uses), split into Voters / Learners on Suffrage. +// Empty suffrage counts as voter, matching raftengine's convention +// that an unannotated peer (e.g. mid-WAL-replay) is a voter. +func routeGroupFromConfiguration(groupID uint64, cfg raftengine.Configuration) admin.RouteGroup { + group := admin.RouteGroup{GroupID: groupID} + for _, srv := range cfg.Servers { + member := admin.RouteMember{ + FullNodeID: etcdraftengine.DeriveNodeID(srv.ID), + Address: srv.Address, + } + if srv.Suffrage == etcdraftengine.SuffrageLearner { + group.Learners = append(group.Learners, member) + } else { + group.Voters = append(group.Voters, member) + } + } + return group +} diff --git a/main_encryption_fanout_test.go b/main_encryption_fanout_test.go new file mode 100644 index 000000000..b7a0880db --- /dev/null +++ b/main_encryption_fanout_test.go @@ -0,0 +1,64 @@ +package main + +import ( + "testing" + + "github.com/bootjp/elastickv/internal/raftengine" + etcdraftengine "github.com/bootjp/elastickv/internal/raftengine/etcd" +) + +func TestRouteGroupFromConfiguration_SplitsVotersAndLearners(t *testing.T) { + t.Parallel() + cfg := raftengine.Configuration{Servers: []raftengine.Server{ + {ID: "n1", Address: "127.0.0.1:5051", Suffrage: etcdraftengine.SuffrageVoter}, + {ID: "n2", Address: "127.0.0.1:5052", Suffrage: etcdraftengine.SuffrageLearner}, + {ID: "n3", Address: "127.0.0.1:5053", Suffrage: ""}, // empty == voter + }} + g := routeGroupFromConfiguration(7, cfg) + if g.GroupID != 7 { + t.Errorf("GroupID = %d, want 7", g.GroupID) + } + if len(g.Voters) != 2 { + t.Fatalf("Voters = %d, want 2 (n1 + empty-suffrage n3)", len(g.Voters)) + } + if len(g.Learners) != 1 { + t.Fatalf("Learners = %d, want 1 (n2)", len(g.Learners)) + } + // FullNodeID must match the canonical etcd derivation so the + // fan-out dedup + verdict matching line up with the rest of the + // encryption stack's node identity. + if g.Voters[0].FullNodeID != etcdraftengine.DeriveNodeID("n1") { + t.Errorf("Voters[0].FullNodeID = %d, want DeriveNodeID(n1)=%d", + g.Voters[0].FullNodeID, etcdraftengine.DeriveNodeID("n1")) + } + if g.Voters[0].Address != "127.0.0.1:5051" { + t.Errorf("Voters[0].Address = %q, want 127.0.0.1:5051", g.Voters[0].Address) + } + if g.Learners[0].FullNodeID != etcdraftengine.DeriveNodeID("n2") { + t.Errorf("Learners[0].FullNodeID mismatch") + } +} + +func TestRouteGroupFromConfiguration_Empty(t *testing.T) { + t.Parallel() + g := routeGroupFromConfiguration(1, raftengine.Configuration{}) + if g.GroupID != 1 { + t.Errorf("GroupID = %d, want 1", g.GroupID) + } + if len(g.Voters) != 0 || len(g.Learners) != 0 { + t.Errorf("empty configuration should yield no members, got %d voters / %d learners", + len(g.Voters), len(g.Learners)) + } +} + +// TestBuildCapabilityFanoutFn_NonNil pins that the builder returns a +// usable closure (it is only wired when encryption mutators are +// enabled; the closure's fan-out behavior is exercised end-to-end in +// 6D-6c-3b). +func TestBuildCapabilityFanoutFn_NonNil(t *testing.T) { + t.Parallel() + fn := buildCapabilityFanoutFn(nil, nil, capabilityFanoutTimeout) + if fn == nil { + t.Fatal("buildCapabilityFanoutFn returned nil") + } +} From 26ecce502342be221239650cbec6f145a86ff74f Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Mon, 25 May 2026 22:47:29 +0900 Subject: [PATCH 3/4] fix(encryption): PR830 round-1 - data-race fix (snapshotEngine), test gaps, doc lifecycle MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses the claude review on PR #830: - must-fix (data race): capabilityRouteSnapshot read rt.engine directly, but the fan-out closure runs during live EnableStorageEnvelope RPCs concurrently with a possible Close() that clears the engine field under engineMu.Lock(). Now reads via rt.snapshotEngine() (engineMu.RLock, nil-safe), matching the documented contract for post-startup engine reads. Not a semantic change — same value, race-safe access. - should-fix (test gaps): extracted routeSnapshotFromSources behind a narrow configReader interface so the fail-closed path is unit testable without a full engine. Added: - TestRouteSnapshotFromSources_ConfigurationErrorFailsClosed — a group Configuration error aborts the whole snapshot (cutover refuses rather than proposing against a partial membership view). - TestRouteSnapshotFromSources_MapsAllGroups — multi-group mapping. - TestBuildEncryptionCapabilityFanout_DisabledReturnsNil — mutators off → nil closure. - should-fix (doc lifecycle): git mv the per-PR design doc proposed -> partial with Status: partial + a lifecycle note (the fan-out wiring shipped; 6D-6c-3b e2e remains). Minor non-blocking observations (WaitForReady bounded by the 5s timeout, const placement, gating comment, shutdown error surfacing) were confirmed correct as-is and left unchanged. --- ...artial_6d6c3a_capability_fanout_wiring.md} | 6 +- main_encryption_fanout.go | 50 ++++++++++---- main_encryption_fanout_test.go | 69 +++++++++++++++++++ 3 files changed, 112 insertions(+), 13 deletions(-) rename docs/design/{2026_05_25_proposed_6d6c3a_capability_fanout_wiring.md => 2026_05_25_partial_6d6c3a_capability_fanout_wiring.md} (96%) diff --git a/docs/design/2026_05_25_proposed_6d6c3a_capability_fanout_wiring.md b/docs/design/2026_05_25_partial_6d6c3a_capability_fanout_wiring.md similarity index 96% rename from docs/design/2026_05_25_proposed_6d6c3a_capability_fanout_wiring.md rename to docs/design/2026_05_25_partial_6d6c3a_capability_fanout_wiring.md index 8bdc0e2f9..940598e90 100644 --- a/docs/design/2026_05_25_proposed_6d6c3a_capability_fanout_wiring.md +++ b/docs/design/2026_05_25_partial_6d6c3a_capability_fanout_wiring.md @@ -2,11 +2,15 @@ | Field | Value | |---|---| -| Status | proposed | +| Status | partial | | Date | 2026-05-25 | | Parent designs | [`2026_05_18_partial_6d_enable_storage_envelope.md`](2026_05_18_partial_6d_enable_storage_envelope.md) (6D-6c-3 milestone; §4 capability fan-out), [`2026_04_29_partial_data_at_rest_encryption.md`](2026_04_29_partial_data_at_rest_encryption.md) (§7.1 rollout) | | Builds on | 6D-3 (`internal/admin.CapabilityFanout` helper), 6D-6a (`adapter.WithEncryptionAdminCapabilityFanout` option) | +**Lifecycle:** the §1 fan-out closure wiring shipped in this PR; the +6D-6c-3b end-to-end test (Bootstrap → EnableStorageEnvelope → Put → +read-back) remains open. Flips to `*_implemented_*` when 3b lands. + ## 0. Why this doc exists The `EnableStorageEnvelope` cutover RPC (6D-6a) runs the §4 capability diff --git a/main_encryption_fanout.go b/main_encryption_fanout.go index aeb621c4f..0883c18ef 100644 --- a/main_encryption_fanout.go +++ b/main_encryption_fanout.go @@ -75,25 +75,51 @@ func buildCapabilityFanoutFn(runtimes []*raftGroupRuntime, connCache *kv.GRPCCon } } +// configReader is the narrow membership-view surface +// capabilityRouteSnapshot needs from a Raft engine. raftengine.Engine +// satisfies it structurally; defining it here lets the snapshot +// builder be unit-tested with a stub that returns a Configuration +// error (the fail-closed path) without standing up a full engine. +type configReader interface { + Configuration(ctx context.Context) (raftengine.Configuration, error) +} + +// groupConfigSource pairs a group id with its membership reader. +type groupConfigSource struct { + groupID uint64 + engine configReader +} + // capabilityRouteSnapshot builds the all-groups admin.RouteSnapshot -// from each runtime's live Raft Configuration. Every server maps to a -// RouteMember keyed by the same etcd.DeriveNodeID(raftID) value the -// rest of the encryption stack uses for node identity, split into -// Voters / Learners on Suffrage (empty suffrage is treated as voter, -// matching raftengine's own convention). A nil runtime/engine is -// skipped; any Configuration error aborts the whole snapshot so the -// caller fails closed. +// from each runtime's live Raft Configuration. It reads each engine +// via snapshotEngine() (engineMu.RLock) because the fan-out closure +// runs during live EnableStorageEnvelope RPCs, concurrently with a +// possible Close() that clears the engine field — a direct field read +// would be a data race. A nil runtime/engine is skipped. func capabilityRouteSnapshot(ctx context.Context, runtimes []*raftGroupRuntime) (admin.RouteSnapshot, error) { - groups := make([]admin.RouteGroup, 0, len(runtimes)) + sources := make([]groupConfigSource, 0, len(runtimes)) for _, rt := range runtimes { - if rt == nil || rt.engine == nil { + eng := rt.snapshotEngine() + if eng == nil { continue } - cfg, err := rt.engine.Configuration(ctx) + sources = append(sources, groupConfigSource{groupID: rt.spec.id, engine: eng}) + } + return routeSnapshotFromSources(ctx, sources) +} + +// routeSnapshotFromSources maps each group's Configuration into the +// admin.RouteSnapshot. Any Configuration error aborts the whole +// snapshot so the caller fails closed — the cutover must never +// proceed against a membership view it could not fully enumerate. +func routeSnapshotFromSources(ctx context.Context, sources []groupConfigSource) (admin.RouteSnapshot, error) { + groups := make([]admin.RouteGroup, 0, len(sources)) + for _, s := range sources { + cfg, err := s.engine.Configuration(ctx) if err != nil { - return admin.RouteSnapshot{}, errors.Wrapf(err, "capability fan-out: configuration for group %d", rt.spec.id) + return admin.RouteSnapshot{}, errors.Wrapf(err, "capability fan-out: configuration for group %d", s.groupID) } - groups = append(groups, routeGroupFromConfiguration(rt.spec.id, cfg)) + groups = append(groups, routeGroupFromConfiguration(s.groupID, cfg)) } return admin.RouteSnapshot{Groups: groups}, nil } diff --git a/main_encryption_fanout_test.go b/main_encryption_fanout_test.go index b7a0880db..2051edc9e 100644 --- a/main_encryption_fanout_test.go +++ b/main_encryption_fanout_test.go @@ -1,12 +1,24 @@ package main import ( + "context" + "errors" "testing" "github.com/bootjp/elastickv/internal/raftengine" etcdraftengine "github.com/bootjp/elastickv/internal/raftengine/etcd" ) +// stubConfigReader is a configReader for the snapshot-builder tests. +type stubConfigReader struct { + cfg raftengine.Configuration + err error +} + +func (s stubConfigReader) Configuration(context.Context) (raftengine.Configuration, error) { + return s.cfg, s.err +} + func TestRouteGroupFromConfiguration_SplitsVotersAndLearners(t *testing.T) { t.Parallel() cfg := raftengine.Configuration{Servers: []raftengine.Server{ @@ -62,3 +74,60 @@ func TestBuildCapabilityFanoutFn_NonNil(t *testing.T) { t.Fatal("buildCapabilityFanoutFn returned nil") } } + +// TestBuildEncryptionCapabilityFanout_DisabledReturnsNil pins that the +// closure is not built when encryption mutators are off — the cutover +// RPC is then unreachable, so wiring a fan-out would be dead weight. +func TestBuildEncryptionCapabilityFanout_DisabledReturnsNil(t *testing.T) { + t.Parallel() + if fn := buildEncryptionCapabilityFanout(context.Background(), nil, nil, false); fn != nil { + t.Error("buildEncryptionCapabilityFanout(enableMutators=false) = non-nil, want nil") + } +} + +func TestRouteSnapshotFromSources_MapsAllGroups(t *testing.T) { + t.Parallel() + sources := []groupConfigSource{ + {groupID: 1, engine: stubConfigReader{cfg: raftengine.Configuration{Servers: []raftengine.Server{ + {ID: "n1", Address: "a:1", Suffrage: etcdraftengine.SuffrageVoter}, + }}}}, + {groupID: 2, engine: stubConfigReader{cfg: raftengine.Configuration{Servers: []raftengine.Server{ + {ID: "n2", Address: "b:2", Suffrage: etcdraftengine.SuffrageLearner}, + }}}}, + } + snap, err := routeSnapshotFromSources(context.Background(), sources) + if err != nil { + t.Fatalf("routeSnapshotFromSources: %v", err) + } + if len(snap.Groups) != 2 { + t.Fatalf("Groups = %d, want 2", len(snap.Groups)) + } + if snap.Groups[0].GroupID != 1 || len(snap.Groups[0].Voters) != 1 { + t.Errorf("group 1 mapping wrong: %+v", snap.Groups[0]) + } + if snap.Groups[1].GroupID != 2 || len(snap.Groups[1].Learners) != 1 { + t.Errorf("group 2 mapping wrong: %+v", snap.Groups[1]) + } +} + +// TestRouteSnapshotFromSources_ConfigurationErrorFailsClosed pins the +// load-bearing fail-closed invariant: if any group's Configuration +// errors, the whole snapshot is abandoned so the cutover refuses +// rather than proposing against a partially-enumerated membership. +func TestRouteSnapshotFromSources_ConfigurationErrorFailsClosed(t *testing.T) { + t.Parallel() + boom := errors.New("configuration unavailable") + sources := []groupConfigSource{ + {groupID: 1, engine: stubConfigReader{cfg: raftengine.Configuration{Servers: []raftengine.Server{ + {ID: "n1", Address: "a:1", Suffrage: etcdraftengine.SuffrageVoter}, + }}}}, + {groupID: 2, engine: stubConfigReader{err: boom}}, + } + _, err := routeSnapshotFromSources(context.Background(), sources) + if err == nil { + t.Fatal("expected error when a group's Configuration fails, got nil") + } + if !errors.Is(err, boom) { + t.Errorf("error does not wrap the Configuration failure: %v", err) + } +} From 664d8f994455f0d5437ba7489b188388391bd6dc Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Mon, 25 May 2026 22:55:43 +0900 Subject: [PATCH 4/4] fix(encryption): PR830 round-2 - explicit nil-runtime guard + doc corrections MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit claude round-2 approved to merge; this folds in the late-arriving coderabbit findings and confirms the codex P1. - coderabbit Minor (nil runtime guard): add an explicit `if rt == nil { continue }` before rt.snapshotEngine() in capabilityRouteSnapshot. snapshotEngine is already nil-receiver-safe (returns nil → skipped), so this is defensive clarity, not a behavior change. - coderabbit Minor (doc): correct the 6D-6c-3a suffrage mapping text to match the implementation — SuffrageLearner → Learners, else → Voters (empty/unannotated suffrage counts as voter). - coderabbit Minor (doc): fix the parent 6D doc's reference link to point at the renamed *_partial_* 6D-6c-3a doc. codex P1 ("read engine through snapshot accessor") was already resolved in round-1 (commit 26ecce50): capabilityRouteSnapshot reads via rt.snapshotEngine() at main_encryption_fanout.go:105 — the P1 was filed against the pre-fix commit. claude round-2 independently confirmed the race fix is correct. --- docs/design/2026_05_18_partial_6d_enable_storage_envelope.md | 2 +- .../2026_05_25_partial_6d6c3a_capability_fanout_wiring.md | 3 ++- main_encryption_fanout.go | 3 +++ 3 files changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md b/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md index 58b8878d3..ac2cf5ffa 100644 --- a/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md +++ b/docs/design/2026_05_18_partial_6d_enable_storage_envelope.md @@ -111,7 +111,7 @@ all groups + a `DialFunc` over a dedicated `kv.GRPCConnCache`), wired into the EnableStorageEnvelope server via `WithEncryptionAdminCapabilityFanout` gated on encryption mutators. - See [`2026_05_25_proposed_6d6c3a_capability_fanout_wiring.md`](2026_05_25_proposed_6d6c3a_capability_fanout_wiring.md). + See [`2026_05_25_partial_6d6c3a_capability_fanout_wiring.md`](2026_05_25_partial_6d6c3a_capability_fanout_wiring.md). ## Open milestones diff --git a/docs/design/2026_05_25_partial_6d6c3a_capability_fanout_wiring.md b/docs/design/2026_05_25_partial_6d6c3a_capability_fanout_wiring.md index 940598e90..ba605f1a9 100644 --- a/docs/design/2026_05_25_partial_6d6c3a_capability_fanout_wiring.md +++ b/docs/design/2026_05_25_partial_6d6c3a_capability_fanout_wiring.md @@ -34,7 +34,8 @@ wire-format change. `rt.engine.Configuration(ctx)` and map each `raftengine.Server` to an `admin.RouteMember{FullNodeID: etcd.DeriveNodeID(srv.ID), Address: srv.Address}`, splitting on `srv.Suffrage` - (`SuffrageVoter` → Voters, else Learners) into one + (`SuffrageLearner` → Learners, else → Voters, so empty/unannotated + suffrage counts as a voter per raftengine's convention) into one `admin.RouteGroup{GroupID: rt.spec.id}`. The §4.1 contract is "every (voter ∪ learner) of **every** Raft group", so the snapshot spans **all** runtimes, not just the cutover RPC's group. diff --git a/main_encryption_fanout.go b/main_encryption_fanout.go index 0883c18ef..a225317f2 100644 --- a/main_encryption_fanout.go +++ b/main_encryption_fanout.go @@ -99,6 +99,9 @@ type groupConfigSource struct { func capabilityRouteSnapshot(ctx context.Context, runtimes []*raftGroupRuntime) (admin.RouteSnapshot, error) { sources := make([]groupConfigSource, 0, len(runtimes)) for _, rt := range runtimes { + if rt == nil { + continue + } eng := rt.snapshotEngine() if eng == nil { continue