Skip to content

refactor: extract shared selftel + test helpers#312

Merged
trilamsr merged 2 commits into
mainfrom
refactor/dedup-selftel
Jun 1, 2026
Merged

refactor: extract shared selftel + test helpers#312
trilamsr merged 2 commits into
mainfrom
refactor/dedup-selftel

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Collapse four hand-rolled self-telemetry packages into one shared module/pkg/selftel + a module/pkg/testutil/selftel test helper. Each per-component selftel.go shrinks to a thin wrapper that pins scope name + instrument names + the package-local kind enum; the OTel counter / histogram / observable wiring, the degraded-state atomic machinery, and the init_errors_total fallback counter all live in one place.

Root cause

Four sibling ports landed independently (RFC-0013 PR-B1/PR-B2/PR-F.1) and each copy chose to re-implement the v0.1.x internal/selftelemetry shape locally rather than share. The result was ~860 LOC of verbatim duplication across components/exporters/{otlphttp,stdoutexporter}/selftel.go + components/receivers/pyspy/selftel.go + module/receiver/ncclfrreceiver/selftel.go, plus ~1600 LOC of test boilerplate (newTestMeterProvider, collectRM, findInstrument, scopeOf, attr-equality, and four near-identical failing*MP wrappers). This PR removes the duplication.

What stays per-component

  • The kind string newtype + named consts (wire-format strings are the operator-facing contract).
  • The instrumentationScope constant (each component's Go import path — OTel scope-name standard).
  • The instrument name choices (e.g. otelcol.receiver.ncclfr.errors_total).
  • A thin selfExporter / selfTelemetry interface that carries the package-local kind type at the seam (zero-cost cast to string).
  • A package-local noopSelfExporter{} / noopSelfTelemetry{} that satisfies that local interface (a shared noop typed to selftel.Exporter / selftel.Receiver wouldn't fit through the kind-carrying interface).
  • Test-side dumpNames/testID/testSettings/factory-shape tests stay where the factory lives.

What's now shared

  • module/pkg/selftel.NewExporter(componentID, scope, callsTotalName, mp) + Exporter interface.
  • module/pkg/selftel.NewReceiver(componentID, scope, names, mp) + Receiver interface + ReceiverInstrumentNames struct (named-field, not positional).
  • module/pkg/selftel.RecordInitError(ctx, mp, scope, kind, componentID, reason) + ErrNilMeterProvider sentinel + ReasonInstrumentRegister const.
  • module/pkg/testutil/selftel: NewTestMeterProvider, CollectRM, FindInstrument, ScopeOf, KVMatch, DumpNames, NewFailingMeterProvider(real, prefix).

Behavior contracts preserved

  • Same metric names, same scope names, same label shapes (component_id, result, kind, etc.) — every pre-existing shape test passes unchanged.
  • errors.Is(err, errNilMeterProvider) still holds in factories (sentinel aliased).
  • recordInitError is still package-local; thin wrapper forwards to shared with the package's scope.
  • The receiver state machine (lock-free CAS on degradedAt + accumulated nanoseconds + activityUnix gauge) is identical — moved verbatim into one place.
  • Latency histogram bucket boundaries unchanged.

LOC delta

components/exporters/otlphttp/selftel.go            172 -> 117
components/exporters/otlphttp/selftel_test.go       417 -> 308
components/exporters/stdoutexporter/selftel.go      168 -> 110
components/exporters/stdoutexporter/selftel_test.go 404 -> 295
components/receivers/pyspy/selftel.go               249 -> 117
components/receivers/pyspy/selftel_test.go          389 -> 249
module/receiver/ncclfrreceiver/selftel.go           269 -> 130
module/receiver/ncclfrreceiver/selftel_test.go      388 -> 248
module/pkg/selftel/{selftel,exporter,receiver}.go   +384  (new)
module/pkg/selftel/selftel_test.go                  +244  (new)
module/pkg/testutil/selftel/selftel.go              +187  (new)

Net ~2456 -> ~2387 LOC, but ~860 LOC of verbatim duplication is gone — future changes to the receiver state machine / init-error counter / failing-meter wrapper touch one file instead of four.

Self-review pass (second commit)

Reviewing my own first commit caught dead exports: selftel.NoopExporter() and selftel.NoopReceiver() were unused by every caller (each per-component package owns its own noop typed to its package-local kind-carrying interface). Dropped both + their tests; added a comment at each removal site documenting why the export is intentionally absent so a future contributor doesn't re-add it under the same aspirational reasoning.

Test plan

  • make check (fmt + tidy + golangci-lint + vet + mod-verify) — clean.
  • go test ./components/... -count=1 — otlphttp / stdoutexporter / pyspy all green.
  • cd module && go test ./... -count=1 — including new module/pkg/selftel shared tests + ncclfrreceiver + every other module package green.
  • Per-package shape assertions (scope name, instrument name, attribute keys, datapoint shapes) pass — same tests that locked the v0.1.x contract still lock it.
NONE

Internal refactor — no operator-visible change. Same metric names, scope names, label shapes, and noop-fallback semantics.

Tri Lam added 2 commits May 31, 2026 23:37
Collapse four hand-rolled self-telemetry packages into a single shared
module/pkg/selftel + module/pkg/testutil/selftel. Per-component
selftel.go shrinks to a thin wrapper that pins scope name + instrument
names + the package-local kind enum; shared plumbing (counters, the
degraded-state atomic machinery, noop fallbacks, init_errors_total)
lives in one place.

Operator-facing contracts unchanged: same scope names, same instrument
names, same label shapes. Existing per-package shape tests pass before
and after; a new module/pkg/selftel test pins the shared contract.

LOC delta:
- components/exporters/otlphttp/selftel.go        172 -> 117
- components/exporters/otlphttp/selftel_test.go   417 -> 308
- components/exporters/stdoutexporter/selftel.go  168 -> 110
- components/exporters/stdoutexporter/selftel_t   404 -> 295
- components/receivers/pyspy/selftel.go           249 -> 117
- components/receivers/pyspy/selftel_test.go      389 -> 249
- module/receiver/ncclfrreceiver/selftel.go       269 -> 130
- module/receiver/ncclfrreceiver/selftel_test.go  388 -> 248
- module/pkg/selftel/{selftel,exporter,receiver}.go  +397 (new)
- module/pkg/selftel/selftel_test.go              +267 (new)
- module/pkg/testutil/selftel/selftel.go          +187 (new)

Net: ~2456 -> ~2425 LOC, but ~860 LOC of verbatim duplication is gone.

Signed-off-by: Tri Lam <tri@maydow.com>
Self-review finding: shared NoopExporter() / NoopReceiver() were
unused by every caller. Each per-component selftel.go owns a
package-local noop (typed to its package-local selfExporter /
selfTelemetry interface that carries the package-local kind newtype) —
the shared selftel.Exporter-typed / selftel.Receiver-typed noop
wouldn't fit through that interface, so production code can't use
them. They were dead exported surface kept alive only by the shared
package's own noop-panic-safety tests.

Drop both Noop* exports + their tests; per-component
TestX_NoopAlwaysSafe covers the same contract at the actual point of
use. Add a comment at each removal site documenting why the export is
intentionally absent so a future contributor doesn't add it back
under the same aspirational reasoning.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 06:42
@trilamsr trilamsr merged commit b65f04c into main Jun 1, 2026
14 of 15 checks passed
@trilamsr trilamsr deleted the refactor/dedup-selftel branch June 1, 2026 06:54
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Resolve 5 conflicts post-PR #310 / #312 / #313:
- factory.go deleted on main (merged into patterndetector.go);
  port wave's selftel wiring (#261) into the merged createLogs
- VerdictAttr* unexported per #310; rename 16 wave-added consts
  + all callers across cuda_oom + ib_link_flap + pcie_aer tests
- docs/{MILESTONES,FOLLOWUPS,patterns/README}.md path + content
  reconcile after MILESTONES.md moved to docs/

Address reviewer findings before PR:
- docs/THREAT-MODEL.md case-mismatch -> docs/threat-model.md
  (Linux CI is case-sensitive)
- pattern.id schema drift: 8 specs said `ib_link_flap`/`cuda_oom`,
  code emits "2"/"10"/.../"13"; rewrite spec attribute tables to
  match shipped customer-stable namespace
- pattern.confidence: 8 specs said `high|partial`, code emits
  `full|partial`; rewrite
- 02-ib-link-flap.md attribute drift: spec said
  tracecore.alert.ib_link_flap.{hca_device,port}, code emits
  hw.network.ib.{device,port.num}; align spec to shipped code
- v1-rc1-cut-criteria criterion #1 status stale-on-arrival
  ("6 patterns shipped" -> "8 patterns shipped, 4 remaining")
- NetPol UX trap: NOTES.txt warning when networkPolicy.enabled=true
  with empty allowedEgressEndpoints (silently kills OTLP exporter)
  + warning when ServiceMonitor scraper in different namespace
- File #337 for missing OTTL recipe projecting DCGM FB_USED/FREE
  -> hw.gpu.memory.{free,total} log shape (CUDA OOM detector
  consumes but recipe gap means it ships dark)

Tests: ./module/processor/patterndetectorprocessor/... +
./module/pkg/patterns/... both ok.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
…ts (#338)

## Summary

15-agent parallel wave bridging v1.0-rc1 knowledge gaps + closing
horizon backlog. 31 commits, 81 files, +8650/-180.

**Code (5 detectors / features):**
- `feat(iblinkflap)` pattern #2 IB link flap detector — 13 tests,
cross-rank helper extracted for reuse by patterns #7/#9
- `feat(cudaoom)` pattern #10 CUDA OOM detector +
fragmentation-vs-true-OOM discriminator — 35 tests, 0/6 false-positive
rate on fixture corpus (#303 wiring — recipe gap tracked at #337)
- `feat(verdict)` deprecate EvictedPod, co-emit PodName + PodNamespace
(#277) with regression-pinning test
- `feat(chart)` opt-in default-deny NetworkPolicy + cert-manager mTLS
reference (#301); ServiceMonitor + scrape annotations (#296); NOTES.txt
UX warnings for empty-egress / cross-ns scraper traps
- `feat(bench)` per-detector allocs/event harness + soft ratchet gate,
graduation criterion documented (#302)
- `feat(patterndetector)` verdict counter metric for dashboard panels
(#261)
- `fix(slo-rules)` correct otelcol_* label set + drop silent-no-op
`unless on (instance)` join (#298)

**8 pattern design specs (`docs/patterns/{02,07-13}-*.md`):**
- Per pattern: symptom, layers crossed, signal sources, detector
evaluation rule, verdict attrs, edge cases, open questions.
- 7 load-bearing spec gaps flagged for future TDD red-test work
(multi-vendor SDC signal, cohort grouping, processor metrics path, etc).

**9 v1.0-rc1 audit / knowledge-gap docs:**
- `docs/v1-rc1-cut-criteria.md` — 12 falsifiable cut gates derived from
O1-O7
- `docs/v1-rc1-operational-gaps.md` — SLSA L3 + air-gap +
upgrade-rollback audit (8 issues filed #314-#321)
- `docs/v1-rc1-governance-gaps.md` — CODEOWNERS 0%, lint-principles
4/16, retros, `make ci` 148s (5 issues #322-#325, #327)
- `docs/v1-rc1-test-audit.md` — 82.9% coverage, fuzz harness inventory
(5 issues #328-#332)
- `docs/v1-rc1-simplification-audit.md` — top deletion candidates ~9.6K
LOC (3 issues #333-#335)
- `docs/threat-model.md` — STRIDE per trust boundary + audit RFP scope
(#336)
- `docs/reference-environments.md` — Tier 1 kind + Tier 2 32×H100
binding spec for O2 hero KPI
- `docs/adoption-pipeline.md` — S0-S3 funnel + comms templates for O5
hero KPI
- `docs/standards-roadmap.md` — 10 `gen_ai.training.*` attributes
proposed upstream (#326)

**Doc-drift cleanup:** 11 issues closed (#265, #268, #269, #276, #283,
#287, #292-295, #299).

**OTTL recipe wiring:** 6 issues closed (#260, #261, #273, #282, #284,
#285); #272 deferred to standards-roadmap.

**Multi-cluster auth:** bearer-token + mTLS examples (#297).

**Merge resolution + reviewer fixes:**
- Resolved 5 conflicts post-PR #310/#312/#313 (factory.go delete,
VerdictAttr* unexport, MILESTONES.md → docs/, FOLLOWUPS, patterns
README)
- Adversarial reviewer found 1 BLOCKER + 6 MAJOR; all addressed before
push:
  - Renamed 16 `VerdictAttr*` → `verdictAttr*` per #310 convention
  - Re-ported selftel wiring (#261) into main's merged `createLogs`
- Fixed case-mismatch `docs/THREAT-MODEL.md` → `docs/threat-model.md`
(Linux CI is case-sensitive)
- 8 pattern specs schema drift: `pattern.id` slug → numeric (`"2"`,
`"7"`...`"13"`), `pattern.confidence` `high` → `full`
- `02-ib-link-flap.md` attribute drift: spec said
`tracecore.alert.ib_link_flap.{hca_device,port}`, code emits
`hw.network.ib.{device,port.num}`
- `v1-rc1-cut-criteria` criterion #1 status stale-on-arrival ("6
patterns shipped" → "8 patterns shipped, 4 remaining")
- NetPol UX trap: NOTES.txt warns when `enabled=true` with empty
`allowedEgressEndpoints` (silently kills OTLP) or cross-ns Prometheus
- Filed #337 for missing OTTL recipe projecting `DCGM_FI_DEV_FB_*` →
`hw.gpu.memory.{free,total}` (CUDA OOM detector consumes but recipe gap)
- Post-merge stale-relative-path sweep: 6 wave docs + NORTHSTARS.md +
MILESTONES.md (`docs/`, `../`, `docs/docs/` drift after MILESTONES +
NORTHSTARS moved to docs/)
- Documented 5 newly-emitted attributes in ATTRIBUTES.md (drop_ratio +
IB tier — `attribute-namespace-check` now 67/67)

## Test plan

- [x] `go test ./module/processor/patterndetectorprocessor/...
./module/pkg/patterns/...` — ok
- [x] `make lint` (golangci-lint via goreleaser-style gate) — 0 issues
- [x] `go vet ./...` — clean
- [x] `make doc-check` — passes after stale-link sweep
- [x] `scripts/attribute-namespace-check.sh` — 67/67 documented
- [x] `helm lint install/kubernetes/tracecore` — 0 chart(s) failed
- [x] `promtool check rules` on slo-rules.yaml — 13 rules / SUCCESS
- [ ] CI compat-matrix (rc1 criterion #6) — gated on next wave
- [ ] manual smoke install on real cluster — owner clearance pending

```release-notes
Lands two new pattern detectors (#2 IB link flap, #10 CUDA OOM
fragmentation-vs-true discriminator), 8 pattern design specs for the
remaining v1.0 root-cause patterns, opt-in default-deny NetworkPolicy
+ Prometheus Operator ServiceMonitor on the Helm chart, the
EvictedPod → PodName/PodNamespace verdict-attribute deprecation
co-emit, per-detector allocs/event bench harness, SLO-rules label
fix, and the v1.0-rc1 knowledge-gap audit set (cut criteria, ops gaps,
governance gaps, test audit, simplification audit, threat model,
reference envs, adoption pipeline, standards roadmap).
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant