test(ncclfrreceiver): factory→consumer e2e with golden fixtures (closes #330)#435
Conversation
Adds nccl_fr_integration_test.go covering five scenarios via the
public NewFactory().CreateLogs path (the codepath OCB stitches at
runtime):
* TestIntegration_E2E_FactoryToConsumer — operator-rich config
(rank, communicator, hw.id, k8s.pod.*) → committed pkl fixture
→ consumer; asserts source-path matches the watched file +
diffs against operator-attrs golden.
* TestIntegration_E2E_AllFixtures — table-driven over three
committed slugs (2.29 healthy, 2.30 healthy, fr-is-p2p);
bare-defaults config to pin the contract for `dump_dir`-only
operators.
* TestIntegration_EdgeCase_UnsafeOpcode — adversarial PROTO 5 +
REDUCE pickle is rejected without emit.
* TestIntegration_EdgeCase_EmptyFile — zero-byte .pkl from a
mid-fsync writer is rejected without emit.
* TestIntegration_EdgeCase_NonMatchingGlob — receiver's glob
filter precedes the parse path.
Golden-file pattern: emitted plog.Logs are normalized (temp-dir
paths redacted; records sorted on a stable composite key) and
compared against testdata/integration/<slug>.emitted.golden.json.
UPDATE_GOLDEN=1 regenerates after intentional schema changes.
Suite total: 1.23s (well inside the 2s ci-fast budget). Passes
under -race.
Refs #330
Signed-off-by: Tri Lam <tree@lumalabs.ai>
Independent Adversarial Review: PR #435B/A/A+ Criteria for THIS PR
Findings1. componenttestHost: unnecessary stub with false justification 2. logsSink: redundant reimplementation of consumertest.LogsSink 3. padInt function: latent negative-number sort bug 4. Fixture redundancy: 2.29 vs 2.30 healthy cover identical code path 5. UPDATE_GOLDEN env var: undocumented new convention Simplification SweepTrim targets:
Sweep result: 3 load-bearing deletions + 2 clarity fixes — achievable in one follow-up commit. VERDICT: BLOCKThe PR ships two custom abstractions ( Path forward: Apply the 5 trim targets (surgical edits, <30 lines net), re-test, and resubmit. The underlying test structure is solid; it needs only cleanup. |
- Drop 6-line componenttestHost stub; use upstream componenttest.NewNopHost(). - Drop 19-line custom logsSink; use upstream consumertest.LogsSink directly (sink.LogRecordCount(), sink.AllLogs()[0] at call sites). - padInt now panics on negative input (fixtures pin keys to non-negative by NCCL semantics); removes latent lexicographic-sort bug for negatives. - Drop redundant nccl-2.29.x-healthy bare-defaults golden (same codepath as 2.30.x-healthy; differs only in nccl.version + op_id offset). The 2.29.x operator-attrs golden remains and pins the 2.29 .pkl through TestIntegration_E2E_FactoryToConsumer. - Trim header doc to 1-2 lines covering UPDATE_GOLDEN regen + the separate make generate-fixtures parser-fixture path. - Promote componenttest + consumertest from indirect to direct deps in module/go.mod (go mod tidy clean). Signed-off-by: Tri Lam <tree@lumalabs.ai>
|
All 5 reviewer findings applied (commit 4423077):
Net: +67 / -191 (124 lines deleted). Verification:
Re-requesting review. |
## Summary Closes #329 — adds `module/receiver/ncclfrreceiver/noop_fallback_test.go` pinning the selftel noop fallback paths the audit flagged as 0% func-cov. **Root cause of the 0% audit reading.** The five noop methods on `noopSelfTelemetry` have empty bodies (`selftel.go:90-94`): ```go func (noopSelfTelemetry) IncError(kind) {} func (noopSelfTelemetry) IncEmissions(int64) {} func (noopSelfTelemetry) ObserveLatency(time.Duration) {} func (noopSelfTelemetry) SetDegraded(bool) {} func (noopSelfTelemetry) MarkActivity() {} ``` Go's cover tool reports 0% func cov on these because there are zero statements to instrument — not because they're un-invoked. `TestSelfTelemetry_NoopAlwaysSafe` already calls every one of them. The audit's "≥80% line cov" gate is structurally unreachable here without adding non-empty bodies (production change, out of scope for this PR per task constraint and follow-up #432). This PR guards the **behavioral contract** instead: a noop method replaced by `panic` fails every test below. ## Tests (all in `noop_fallback_test.go`) - **`TestNoopFallback_AllMethodsSafe_TableDriven`** — 16 sub-tests across every `kind` enum (`kindEnumerate/Read/Parse/Downstream/Panic` + an unknown-kind), every `int64` emission delta (zero / positive / negative / max), every `time.Duration` (zero / positive / negative), both `SetDegraded` transitions, `MarkActivity`. Guards `selftel.go:90-94`. - **`TestNoopFallback_FactoryWithNilMeterProvider`** — drives `factory.go:57` with `set.MeterProvider = nil`. The factory must skip `newSelfTelemetry` entirely and leave the receiver holding the noop assigned in `newReceiver` (`nccl_fr.go:79`). Then exercises every hot-path method against `recv.telemetry`. The existing `TestSelfTelemetry_NewReceiver_NilProviderErrors` covered the unit; this covers the factory wiring. - **`TestNoopFallback_FactoryFailedMeter_AllHotPathsSafe`** — extends the existing `TestFactory_FallsBackToNoopWhenMeterFails` (which only invokes `IncError`) by exercising every noop method after the factory falls back to noop via the failing-meter path (`factory.go:58-66`), then asserts no datapoints leaked into `otelcol.receiver.ncclfr.*` counters/histograms (noop must discard). ## Noop-site references | Site | File:Line | Guarded by | |---|---|---| | `noopSelfTelemetry.IncError` | `selftel.go:90` | TableDriven `IncError/*` + Factory tests | | `noopSelfTelemetry.IncEmissions` | `selftel.go:91` | TableDriven `IncEmissions/*` + Factory tests | | `noopSelfTelemetry.ObserveLatency` | `selftel.go:92` | TableDriven `ObserveLatency/*` + Factory tests | | `noopSelfTelemetry.SetDegraded` | `selftel.go:93` | TableDriven `SetDegraded/*` + Factory tests | | `noopSelfTelemetry.MarkActivity` | `selftel.go:94` | TableDriven `MarkActivity` + Factory tests | | `newReceiver` noop init | `nccl_fr.go:79` | both Factory tests | | factory nil-MP branch | `factory.go:57,67-69` | `FactoryWithNilMeterProvider` | | factory failed-meter branch | `factory.go:58-66` | `FactoryFailedMeter_AllHotPathsSafe` | ## TDD red-first verification Locally substituted each noop body with `panic("noop hit: <name>")`, re-ran: ``` --- FAIL: TestNoopFallback_AllMethodsSafe_TableDriven/IncError/enumerate noop_fallback_test.go:78: noop IncError/enumerate panicked: noop hit: IncError --- FAIL: TestNoopFallback_FactoryWithNilMeterProvider noop_fallback_test.go:116: noop hot-path call panicked: noop hit: IncError --- FAIL: TestNoopFallback_FactoryFailedMeter_AllHotPathsSafe noop_fallback_test.go:165: noop hot-path call panicked: noop hit: IncError ``` Reverted `selftel.go` (no diff), tests GREEN. ## Sibling lane This PR is the noop-edges complement to PR #435 (factory→consumer e2e, closes #330). Disjoint files: #435 only touches `nccl_fr_integration_test.go` + golden fixtures. ## Sibling receiver with same pattern (A+ follow-up) `module/processor/patterndetectorprocessor/selftel.go` uses the same noop pattern (`IncVerdict` at 0% func cov for the same empty-body reason). Will file a follow-up issue post-merge. ## Coverage `cd module && GOWORK=off go test -cover ./receiver/ncclfrreceiver/`: 77.3% → 78.3%. Empty-body noop methods remain at 0% func-cov by Go-cover-tool design — that gap is structural, not a test gap. The audit's ≥80% target needs a production change (non-empty noop bodies, or a different surface shape); tracked as follow-up. ## Test plan - [x] `cd module && GOWORK=off go test -race -count=1 ./receiver/ncclfrreceiver/...` green (2.449s) - [x] `TestNoopFallback_*` runs in 0.04s (<500ms target) - [x] TDD red-first: panic-substitution causes failures with matching messages - [x] DCO sign-off + `.githooks` lint/vet/attr-namespace-check all green pre-commit - [x] Diff = `noop_fallback_test.go` only (no production-code touched) ## Release notes ```release-notes NONE ``` Test-only PR; no operator-visible change. Signed-off-by: Tri Lam <tree@lumalabs.ai>
Independent Adversarial Review: Commit 4423077 (Post-Block-Fix)Prior BLOCK Resolution (5 Findings)1. componenttestHost stub (✓ RESOLVED)
2. logsSink custom reimplementation (✓ RESOLVED)
3. padInt negative-sort bug (✓ RESOLVED)
4. Fixture redundancy (✓ RESOLVED)
5. UPDATE_GOLDEN convention (✓ RESOLVED)
Final Simplification SweepHeader doc: Trimmed to 5 lines (L21-25), explains issue closure + regen mechanism + distinction from parser fixture path. No multi-paragraph cruft. Imports: Clean — only upstream deps (componenttest, consumertest) + standard lib (context, encoding/json, errors, os, path/filepath, reflect, sort, testing, time). No custom abstractions: All stubs removed; upstream-only pattern throughout. Test performance: ✓
Code quality: ✓
Merge conflict risk (PR #435 + #440): ✓ CLEAR
New Findings (Simplification Sweep)NONE. The commit is clean. All 5 prior findings resolved; no new code noise introduced; net -124 lines (67 additions / 191 deletions). VERDICT: AJustification:
RECOMMENDATION: MERGE |
Summary
Closes #330 —
ncclfrreceiverpreviously had no test that exercised the full factory → file-watch → safe-pickle parse → consumer chain against a binary.pklfixture. This PR addsnccl_fr_integration_test.gowith five scenarios wired through the publicNewFactory().CreateLogscodepath (the same path OCB stitches at runtime).TestIntegration_E2E_FactoryToConsumersets rank / communicator / hw.id / k8s.pod.* on the receiver config, drops a committed.pklfixture into a temp watch dir, asserts the emittednccl.fr.source_pathpoints at the file we wrote, and diffs the full log shape against a committed operator-attrs golden.TestIntegration_E2E_AllFixturesruns three slugs (nccl-2.29.x-healthy,nccl-2.30.x-healthy,nccl-fr-is-p2p) with bare-defaults config to pin the contract operators get when they configure onlydump_dir..pkl, and non-matching-glob file are each asserted to produce zero log records without panicking.Golden-file pattern
Emitted
plog.Logsare normalized (temp-dir prefix innccl.fr.source_pathredacted to<watch_dir>/; records sorted oncollective_seq_id:p2p_seq_id:op_idso map iteration order can't flake) and compared against committedtestdata/integration/<slug>.emitted.golden.json. The pattern mirrors the parser's existingTestFixtures_MatchGoldens—UPDATE_GOLDEN=1regenerates after an intentional schema change.Performance
Well inside the 2s
ci-fastbudget. Passes under-race(full package: 3.15s).Notes for reviewer
NewFactory(), the existinglogsSinkfromnccl_fr_test.go, and a minimal localcomponenttestHost(mirroringpatterndetectorprocessor's pattern) so the test stays inside themodule/go.mod's direct-dependency surface.Validate()). The existingnccl_fr_test.goruns at 50ms because it callsnewReceiverdirectly and bypasses Validate; the new tests go through the factory so they're subject to the floor.emit()callsMarkActivity()but neverIncEmissions(n), so the audit's "assertIncEmissionsfired once" criterion can't be satisfied without first fixing the production telemetry gap. Filed as bug(ncclfrreceiver): emit() never calls IncEmissions; emissions_total stays 0 #432 — separate scope, separate PR, test-only PR rules apply here.Test plan
go test -count=1 -race ./module/receiver/ncclfrreceiver/...— passes (3.15s)go test -count=1 -run TestIntegration ./module/receiver/ncclfrreceiver/...— 5 tests pass in 1.23sTestReceiver_*,TestConfig_Validate,TestSelfTelemetry_*,TestFactory_*still passUPDATE_GOLDEN=1 go test ...regenerates fixtures deterministically (verified by re-running and confirming no diff)