Skip to content

test(ncclfrreceiver): factory→consumer e2e with golden fixtures (closes #330)#435

Merged
trilamsr merged 2 commits into
mainfrom
test/330-ncclfrreceiver-e2e
Jun 2, 2026
Merged

test(ncclfrreceiver): factory→consumer e2e with golden fixtures (closes #330)#435
trilamsr merged 2 commits into
mainfrom
test/330-ncclfrreceiver-e2e

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #330ncclfrreceiver previously had no test that exercised the full factory → file-watch → safe-pickle parse → consumer chain against a binary .pkl fixture. This PR adds nccl_fr_integration_test.go with five scenarios wired through the public NewFactory().CreateLogs codepath (the same path OCB stitches at runtime).

  • Happy path with operator-rich configTestIntegration_E2E_FactoryToConsumer sets rank / communicator / hw.id / k8s.pod.* on the receiver config, drops a committed .pkl fixture into a temp watch dir, asserts the emitted nccl.fr.source_path points at the file we wrote, and diffs the full log shape against a committed operator-attrs golden.
  • Table-driven happy-path matrixTestIntegration_E2E_AllFixtures runs three slugs (nccl-2.29.x-healthy, nccl-2.30.x-healthy, nccl-fr-is-p2p) with bare-defaults config to pin the contract operators get when they configure only dump_dir.
  • Edge cases — unsafe-opcode pickle (PROTO 5 + REDUCE), zero-byte mid-fsync .pkl, and non-matching-glob file are each asserted to produce zero log records without panicking.

Golden-file pattern

Emitted plog.Logs are normalized (temp-dir prefix in nccl.fr.source_path redacted to <watch_dir>/; records sorted on collective_seq_id:p2p_seq_id:op_id so map iteration order can't flake) and compared against committed testdata/integration/<slug>.emitted.golden.json. The pattern mirrors the parser's existing TestFixtures_MatchGoldensUPDATE_GOLDEN=1 regenerates after an intentional schema change.

Performance

TestIntegration_E2E_FactoryToConsumer            0.05s
TestIntegration_E2E_AllFixtures                  0.08s  (3 sub-tests)
TestIntegration_EdgeCase_UnsafeOpcode            0.25s
TestIntegration_EdgeCase_EmptyFile               0.25s
TestIntegration_EdgeCase_NonMatchingGlob         0.25s
-------------------------------------------------------
total                                            1.23s

Well inside the 2s ci-fast budget. Passes under -race (full package: 3.15s).

Notes for reviewer

  • No production code touched. The integration tests use the existing NewFactory(), the existing logsSink from nccl_fr_test.go, and a minimal local componenttestHost (mirroring patterndetectorprocessor's pattern) so the test stays inside the module/ go.mod's direct-dependency surface.
  • Poll interval is 100ms (the receiver's enforced floor via Validate()). The existing nccl_fr_test.go runs at 50ms because it calls newReceiver directly and bypasses Validate; the new tests go through the factory so they're subject to the floor.
  • Follow-up: bug(ncclfrreceiver): emit() never calls IncEmissions; emissions_total stays 0 #432 — While drafting this test I confirmed the receiver's emit() calls MarkActivity() but never IncEmissions(n), so the audit's "assert IncEmissions fired once" criterion can't be satisfied without first fixing the production telemetry gap. Filed as bug(ncclfrreceiver): emit() never calls IncEmissions; emissions_total stays 0 #432 — separate scope, separate PR, test-only PR rules apply here.
- Add `ncclfrreceiver` factory→consumer end-to-end integration test (#330)

Test plan

  • go test -count=1 -race ./module/receiver/ncclfrreceiver/... — passes (3.15s)
  • go test -count=1 -run TestIntegration ./module/receiver/ncclfrreceiver/... — 5 tests pass in 1.23s
  • Pre-existing TestReceiver_*, TestConfig_Validate, TestSelfTelemetry_*, TestFactory_* still pass
  • gofmt + go vet clean
  • UPDATE_GOLDEN=1 go test ... regenerates fixtures deterministically (verified by re-running and confirming no diff)

Adds nccl_fr_integration_test.go covering five scenarios via the
public NewFactory().CreateLogs path (the codepath OCB stitches at
runtime):

  * TestIntegration_E2E_FactoryToConsumer — operator-rich config
    (rank, communicator, hw.id, k8s.pod.*) → committed pkl fixture
    → consumer; asserts source-path matches the watched file +
    diffs against operator-attrs golden.
  * TestIntegration_E2E_AllFixtures — table-driven over three
    committed slugs (2.29 healthy, 2.30 healthy, fr-is-p2p);
    bare-defaults config to pin the contract for `dump_dir`-only
    operators.
  * TestIntegration_EdgeCase_UnsafeOpcode — adversarial PROTO 5 +
    REDUCE pickle is rejected without emit.
  * TestIntegration_EdgeCase_EmptyFile — zero-byte .pkl from a
    mid-fsync writer is rejected without emit.
  * TestIntegration_EdgeCase_NonMatchingGlob — receiver's glob
    filter precedes the parse path.

Golden-file pattern: emitted plog.Logs are normalized (temp-dir
paths redacted; records sorted on a stable composite key) and
compared against testdata/integration/<slug>.emitted.golden.json.
UPDATE_GOLDEN=1 regenerates after intentional schema changes.

Suite total: 1.23s (well inside the 2s ci-fast budget). Passes
under -race.

Refs #330

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Independent Adversarial Review: PR #435

B/A/A+ Criteria for THIS PR

  1. Acceptance criteria closure: Exercise factory→consumer pipeline against committed .pkl fixtures; assert emitted logs match golden JSON; verify nccl.fr.source_path attribute.
  2. No production-code modifications: Test-only code; leaves receiver/factory/parser untouched.
  3. Race-safe, CI-budget compliant: 5 tests (7 sub-tests) complete in 1.23s; passes under -race at 3.15s.

Findings

1. componenttestHost: unnecessary stub with false justification
Lines ~291–292 — Builder claims the local stub is needed "to stay inside module's direct-dependency surface." False: go.opentelemetry.io/collector/component/componenttest v0.130.0 is already listed as indirect in module/go.mod (line 53), and componenttest.NewNopHost() is available. Delete stub; import and use the upstream function directly (1 import + 1 call swap).

2. logsSink: redundant reimplementation of consumertest.LogsSink
Lines ~298–316 — The local struct duplicates upstream consumertest.LogsSink (which provides AllLogs(), LogRecordCount(), Reset(), Contexts()). Replace with direct use of upstream; change calls from sink.recordCount() to sink.LogRecordCount() and inline a helper for first() if needed. Net ~−20 lines.

3. padInt function: latent negative-number sort bug
Lines ~377–393 — Pads integers for lexicographic sort but fails on negatives. Input -5"-00000000000000000005" sorts before "00000000000000000001", correct only by accident of the - prefix. Test data has no negative IDs so bug is hidden. Either (a) add assertion that all seq IDs are non-negative, or (b) use fmt.Sprintf("%+021d"...) to handle signs correctly.

4. Fixture redundancy: 2.29 vs 2.30 healthy cover identical code path
nccl-2.29.x-healthy vs nccl-2.30.x-healthy — Both test the collective-ops path (is_p2p=false); they differ only in nccl.version string and op_id offsets. This violates the PR summary's claim of "different code paths." The nccl-fr-is-p2p fixture legitimately covers is_p2p=true. Drop one of the 2.29/2.30 pair (keep 2.30); saves ~27ms test time and ~70 lines of redundant JSON.

5. UPDATE_GOLDEN env var: undocumented new convention
Lines ~424–446 — The UPDATE_GOLDEN=1 pattern is a new convention not present elsewhere in the codebase. The parser test suite uses make generate-fixtures. Either document UPDATE_GOLDEN in a Makefile comment or align with the parser convention for consistency.


Simplification Sweep

Trim targets:

  • Delete componenttestHost struct; use componenttest.NewNopHost() (net −6 lines, 1 import).
  • Delete local logsSink; use consumertest.LogsSink (net −20 lines, adjust call sites).
  • Drop nccl-2.29.x-healthy golden and corresponding test case (net −70 lines, reduces redundancy).
  • Fix or assert on padInt negative-number case (correctness, not size).
  • Document UPDATE_GOLDEN=1 convention (clarity).

Sweep result: 3 load-bearing deletions + 2 clarity fixes — achievable in one follow-up commit.


VERDICT: BLOCK

The PR ships two custom abstractions (componenttestHost, logsSink) when upstream equivalents already exist in the dependency tree. The builder's stated justification for the stub is factually incorrect — the assertion "staying inside direct-dependency surface" ignores that componenttest is already transitive. The padInt negative-sort bug is latent but breaks the contract if fixture data ever includes negative seq IDs. The 2.29/2.30 redundancy contradicts the claimed "different code paths" coverage. These are not cosmetic: they represent design decisions that fail the simplicity-first review criterion.

Path forward: Apply the 5 trim targets (surgical edits, <30 lines net), re-test, and resubmit. The underlying test structure is solid; it needs only cleanup.

- Drop 6-line componenttestHost stub; use upstream componenttest.NewNopHost().
- Drop 19-line custom logsSink; use upstream consumertest.LogsSink directly
  (sink.LogRecordCount(), sink.AllLogs()[0] at call sites).
- padInt now panics on negative input (fixtures pin keys to non-negative
  by NCCL semantics); removes latent lexicographic-sort bug for negatives.
- Drop redundant nccl-2.29.x-healthy bare-defaults golden (same codepath
  as 2.30.x-healthy; differs only in nccl.version + op_id offset).
  The 2.29.x operator-attrs golden remains and pins the 2.29 .pkl
  through TestIntegration_E2E_FactoryToConsumer.
- Trim header doc to 1-2 lines covering UPDATE_GOLDEN regen + the
  separate make generate-fixtures parser-fixture path.
- Promote componenttest + consumertest from indirect to direct deps in
  module/go.mod (go mod tidy clean).

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

All 5 reviewer findings applied (commit 4423077):

  1. componenttestHost stub deleted (6 lines) — now uses componenttest.NewNopHost(); promoted to direct dep in module/go.mod.
  2. Custom logsSink deleted (19 lines) — now uses consumertest.LogsSink directly (LogRecordCount(), AllLogs()[0]); promoted to direct dep.
  3. padInt negative-sort bug fixed — now panics on n < 0 with explanatory message; fixtures pin keys to non-negative by NCCL semantics.
  4. Redundant 2.29.x bare-defaults golden dropped (70 lines) — 2.30.x-healthy retained as "current" version; 2.29.x operator-attrs golden remains (distinct codepath).
  5. Header doc tightened to 5 lines covering UPDATE_GOLDEN regen + the separate make generate-fixtures parser-fixture path.

Net: +67 / -191 (124 lines deleted). Verification:

  • go test -race -count=1 ./module/receiver/ncclfrreceiver/... → ok (3.299s)
  • go test -count=1 -run TestIntegration ./module/receiver/ncclfrreceiver/... → ok (1.236s)
  • gofmt -l module/receiver/ncclfrreceiver/ → empty
  • go vet ./module/receiver/ncclfrreceiver/... → clean
  • pre-commit hooks (lint, vet, mod-verify, attribute-namespace, slo-rules, deprecation, no-autoupdate) → all green

Re-requesting review.

trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

Closes #329 — adds
`module/receiver/ncclfrreceiver/noop_fallback_test.go` pinning the
selftel noop fallback paths the audit flagged as 0% func-cov.

**Root cause of the 0% audit reading.** The five noop methods on
`noopSelfTelemetry` have empty bodies (`selftel.go:90-94`):

```go
func (noopSelfTelemetry) IncError(kind)                {}
func (noopSelfTelemetry) IncEmissions(int64)           {}
func (noopSelfTelemetry) ObserveLatency(time.Duration) {}
func (noopSelfTelemetry) SetDegraded(bool)             {}
func (noopSelfTelemetry) MarkActivity()                {}
```

Go's cover tool reports 0% func cov on these because there are zero
statements to instrument — not because they're un-invoked.
`TestSelfTelemetry_NoopAlwaysSafe` already calls every one of them. The
audit's "≥80% line cov" gate is structurally unreachable here without
adding non-empty bodies (production change, out of scope for this PR per
task constraint and follow-up #432).

This PR guards the **behavioral contract** instead: a noop method
replaced by `panic` fails every test below.

## Tests (all in `noop_fallback_test.go`)

- **`TestNoopFallback_AllMethodsSafe_TableDriven`** — 16 sub-tests
across every `kind` enum (`kindEnumerate/Read/Parse/Downstream/Panic` +
an unknown-kind), every `int64` emission delta (zero / positive /
negative / max), every `time.Duration` (zero / positive / negative),
both `SetDegraded` transitions, `MarkActivity`. Guards
`selftel.go:90-94`.

- **`TestNoopFallback_FactoryWithNilMeterProvider`** — drives
`factory.go:57` with `set.MeterProvider = nil`. The factory must skip
`newSelfTelemetry` entirely and leave the receiver holding the noop
assigned in `newReceiver` (`nccl_fr.go:79`). Then exercises every
hot-path method against `recv.telemetry`. The existing
`TestSelfTelemetry_NewReceiver_NilProviderErrors` covered the unit; this
covers the factory wiring.

- **`TestNoopFallback_FactoryFailedMeter_AllHotPathsSafe`** — extends
the existing `TestFactory_FallsBackToNoopWhenMeterFails` (which only
invokes `IncError`) by exercising every noop method after the factory
falls back to noop via the failing-meter path (`factory.go:58-66`), then
asserts no datapoints leaked into `otelcol.receiver.ncclfr.*`
counters/histograms (noop must discard).

## Noop-site references

| Site | File:Line | Guarded by |
|---|---|---|
| `noopSelfTelemetry.IncError` | `selftel.go:90` | TableDriven
`IncError/*` + Factory tests |
| `noopSelfTelemetry.IncEmissions` | `selftel.go:91` | TableDriven
`IncEmissions/*` + Factory tests |
| `noopSelfTelemetry.ObserveLatency` | `selftel.go:92` | TableDriven
`ObserveLatency/*` + Factory tests |
| `noopSelfTelemetry.SetDegraded` | `selftel.go:93` | TableDriven
`SetDegraded/*` + Factory tests |
| `noopSelfTelemetry.MarkActivity` | `selftel.go:94` | TableDriven
`MarkActivity` + Factory tests |
| `newReceiver` noop init | `nccl_fr.go:79` | both Factory tests |
| factory nil-MP branch | `factory.go:57,67-69` |
`FactoryWithNilMeterProvider` |
| factory failed-meter branch | `factory.go:58-66` |
`FactoryFailedMeter_AllHotPathsSafe` |

## TDD red-first verification

Locally substituted each noop body with `panic("noop hit: <name>")`,
re-ran:

```
--- FAIL: TestNoopFallback_AllMethodsSafe_TableDriven/IncError/enumerate
    noop_fallback_test.go:78: noop IncError/enumerate panicked: noop hit: IncError
--- FAIL: TestNoopFallback_FactoryWithNilMeterProvider
    noop_fallback_test.go:116: noop hot-path call panicked: noop hit: IncError
--- FAIL: TestNoopFallback_FactoryFailedMeter_AllHotPathsSafe
    noop_fallback_test.go:165: noop hot-path call panicked: noop hit: IncError
```

Reverted `selftel.go` (no diff), tests GREEN.

## Sibling lane

This PR is the noop-edges complement to PR #435 (factory→consumer e2e,
closes #330). Disjoint files: #435 only touches
`nccl_fr_integration_test.go` + golden fixtures.

## Sibling receiver with same pattern (A+ follow-up)

`module/processor/patterndetectorprocessor/selftel.go` uses the same
noop pattern (`IncVerdict` at 0% func cov for the same empty-body
reason). Will file a follow-up issue post-merge.

## Coverage

`cd module && GOWORK=off go test -cover ./receiver/ncclfrreceiver/`:
77.3% → 78.3%. Empty-body noop methods remain at 0% func-cov by
Go-cover-tool design — that gap is structural, not a test gap. The
audit's ≥80% target needs a production change (non-empty noop bodies, or
a different surface shape); tracked as follow-up.

## Test plan

- [x] `cd module && GOWORK=off go test -race -count=1
./receiver/ncclfrreceiver/...` green (2.449s)
- [x] `TestNoopFallback_*` runs in 0.04s (<500ms target)
- [x] TDD red-first: panic-substitution causes failures with matching
messages
- [x] DCO sign-off + `.githooks` lint/vet/attr-namespace-check all green
pre-commit
- [x] Diff = `noop_fallback_test.go` only (no production-code touched)

## Release notes

```release-notes
NONE
```

Test-only PR; no operator-visible change.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Independent Adversarial Review: Commit 4423077 (Post-Block-Fix)

Prior BLOCK Resolution (5 Findings)

1. componenttestHost stub (✓ RESOLVED)

  • Deleted entirely; replaced with componenttest.NewNopHost() in all call sites (3 calls: nccl_fr_integration_test.go L78, L239, L280).
  • Promoted to direct dep in go.mod (line 37).
  • Verification: 0 references to deleted struct remain across test files.

2. logsSink custom reimplementation (✓ RESOLVED)

  • Deleted entirely (~19 lines).
  • All call sites now use upstream consumertest.LogsSink directly.
  • Method migrations: sink.recordCount()sink.LogRecordCount() (8 call sites); sink.first()sink.AllLogs()[0] (3 call sites).
  • Promoted to direct dep in go.mod (line 38); also added import to selftel_test.go for consistency.
  • Verification: no newLogsSink() or recordCount() references remain.

3. padInt negative-sort bug (✓ RESOLVED)

  • Now panics on n < 0 with clear message: "padInt: negative input not supported; sort keys must be >= 0" (L416).
  • Documentation explains why: fixtures pin collective_seq_id/p2p_seq_id/op_id to non-negative monotonic values by NCCL semantics (L408-413).
  • Acceptable choice (panic in test ≈ test failure, which is correct behavior; assert.GreaterOrEqual would require testify import, overkill for fixtures with guaranteed non-negative keys).

4. Fixture redundancy (✓ RESOLVED)

  • Deleted nccl-2.29.x-healthy.emitted.golden.json (70 lines).
  • Rationale correctly documented in integrationFixtureSlugs (L204-207): "2.29.x-healthy is intentionally omitted: it tests the same collective-ops codepath as 2.30.x-healthy (differs only in nccl.version + op_id offset), and the operator-attrs golden in TestIntegration_E2E_FactoryToConsumer already covers the 2.29 .pkl."
  • Retained nccl-2.29.x-healthy.operator-attrs.emitted.golden.json (used in TestIntegration_E2E_FactoryToConsumer, L52, L117) — correct: distinct from bare-defaults golden (covers operator-set resource attrs path).
  • integrationFixtureSlugs() now returns only ["nccl-2.30.x-healthy", "nccl-fr-is-p2p"] (L209-212).

5. UPDATE_GOLDEN convention (✓ RESOLVED)

  • Documented in file header (L21-25): "Regenerate goldens after an intentional schema change with UPDATE_GOLDEN=1 go test ./module/receiver/ncclfrreceiver/... (parser .pkl fixtures regenerate separately via make generate-fixtures)."
  • Also documented in assertGolden comment (L434-436): "When UPDATE_GOLDEN=1 is set the file is rewritten instead — that is the only mutating path and it's gated behind an env var so CI never accidentally rewrites goldens."
  • Establishes clear convention + distinguishes from separate parser fixture regen path.

Final Simplification Sweep

Header doc: Trimmed to 5 lines (L21-25), explains issue closure + regen mechanism + distinction from parser fixture path. No multi-paragraph cruft.

Imports: Clean — only upstream deps (componenttest, consumertest) + standard lib (context, encoding/json, errors, os, path/filepath, reflect, sort, testing, time).

No custom abstractions: All stubs removed; upstream-only pattern throughout.

Test performance: ✓

  • TestIntegration suite: 1.240s (integration only: -run TestIntegration)
  • Full package under -race: 3.235s
  • All within budget.

Code quality: ✓

  • go vet: clean
  • gofmt: clean
  • padInt logic: correct (pads negatives panic; non-negative case works correctly)
  • All waitForRecords, recordCount, first() references updated

Merge conflict risk (PR #435 + #440): ✓ CLEAR


New Findings (Simplification Sweep)

NONE. The commit is clean. All 5 prior findings resolved; no new code noise introduced; net -124 lines (67 additions / 191 deletions).


VERDICT: A

Justification:

  • All 5 reviewer findings applied correctly.
  • Upstream abstractions now used throughout (componenttest, consumertest).
  • Correct simplification rationale documented (integrationFixtureSlugs comment, file header, assertGolden comment).
  • Test structure solid; golden pattern follows precedent (UPDATE_GOLDEN convention).
  • No regressions; full test suite green under -race.
  • Closes [rc1-prep] test-gap: add ncclfrreceiver end-to-end integration test #330 (factory→consumer e2e on committed fixtures); satisfies acceptance criteria.

RECOMMENDATION: MERGE

@trilamsr trilamsr enabled auto-merge (squash) June 2, 2026 01:41
@trilamsr trilamsr merged commit 4e0d800 into main Jun 2, 2026
12 checks passed
@trilamsr trilamsr deleted the test/330-ncclfrreceiver-e2e branch June 2, 2026 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[rc1-prep] test-gap: add ncclfrreceiver end-to-end integration test

1 participant