refactor: extract shared selftel + test helpers#312
Merged
Conversation
added 2 commits
May 31, 2026 23:37
Collapse four hand-rolled self-telemetry packages into a single shared
module/pkg/selftel + module/pkg/testutil/selftel. Per-component
selftel.go shrinks to a thin wrapper that pins scope name + instrument
names + the package-local kind enum; shared plumbing (counters, the
degraded-state atomic machinery, noop fallbacks, init_errors_total)
lives in one place.
Operator-facing contracts unchanged: same scope names, same instrument
names, same label shapes. Existing per-package shape tests pass before
and after; a new module/pkg/selftel test pins the shared contract.
LOC delta:
- components/exporters/otlphttp/selftel.go 172 -> 117
- components/exporters/otlphttp/selftel_test.go 417 -> 308
- components/exporters/stdoutexporter/selftel.go 168 -> 110
- components/exporters/stdoutexporter/selftel_t 404 -> 295
- components/receivers/pyspy/selftel.go 249 -> 117
- components/receivers/pyspy/selftel_test.go 389 -> 249
- module/receiver/ncclfrreceiver/selftel.go 269 -> 130
- module/receiver/ncclfrreceiver/selftel_test.go 388 -> 248
- module/pkg/selftel/{selftel,exporter,receiver}.go +397 (new)
- module/pkg/selftel/selftel_test.go +267 (new)
- module/pkg/testutil/selftel/selftel.go +187 (new)
Net: ~2456 -> ~2425 LOC, but ~860 LOC of verbatim duplication is gone.
Signed-off-by: Tri Lam <tri@maydow.com>
Self-review finding: shared NoopExporter() / NoopReceiver() were unused by every caller. Each per-component selftel.go owns a package-local noop (typed to its package-local selfExporter / selfTelemetry interface that carries the package-local kind newtype) — the shared selftel.Exporter-typed / selftel.Receiver-typed noop wouldn't fit through that interface, so production code can't use them. They were dead exported surface kept alive only by the shared package's own noop-panic-safety tests. Drop both Noop* exports + their tests; per-component TestX_NoopAlwaysSafe covers the same contract at the actual point of use. Add a comment at each removal site documenting why the export is intentionally absent so a future contributor doesn't add it back under the same aspirational reasoning. Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Resolve 5 conflicts post-PR #310 / #312 / #313: - factory.go deleted on main (merged into patterndetector.go); port wave's selftel wiring (#261) into the merged createLogs - VerdictAttr* unexported per #310; rename 16 wave-added consts + all callers across cuda_oom + ib_link_flap + pcie_aer tests - docs/{MILESTONES,FOLLOWUPS,patterns/README}.md path + content reconcile after MILESTONES.md moved to docs/ Address reviewer findings before PR: - docs/THREAT-MODEL.md case-mismatch -> docs/threat-model.md (Linux CI is case-sensitive) - pattern.id schema drift: 8 specs said `ib_link_flap`/`cuda_oom`, code emits "2"/"10"/.../"13"; rewrite spec attribute tables to match shipped customer-stable namespace - pattern.confidence: 8 specs said `high|partial`, code emits `full|partial`; rewrite - 02-ib-link-flap.md attribute drift: spec said tracecore.alert.ib_link_flap.{hca_device,port}, code emits hw.network.ib.{device,port.num}; align spec to shipped code - v1-rc1-cut-criteria criterion #1 status stale-on-arrival ("6 patterns shipped" -> "8 patterns shipped, 4 remaining") - NetPol UX trap: NOTES.txt warning when networkPolicy.enabled=true with empty allowedEgressEndpoints (silently kills OTLP exporter) + warning when ServiceMonitor scraper in different namespace - File #337 for missing OTTL recipe projecting DCGM FB_USED/FREE -> hw.gpu.memory.{free,total} log shape (CUDA OOM detector consumes but recipe gap means it ships dark) Tests: ./module/processor/patterndetectorprocessor/... + ./module/pkg/patterns/... both ok. Signed-off-by: Tri Lam <tri@maydow.com>
9 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
…ts (#338) ## Summary 15-agent parallel wave bridging v1.0-rc1 knowledge gaps + closing horizon backlog. 31 commits, 81 files, +8650/-180. **Code (5 detectors / features):** - `feat(iblinkflap)` pattern #2 IB link flap detector — 13 tests, cross-rank helper extracted for reuse by patterns #7/#9 - `feat(cudaoom)` pattern #10 CUDA OOM detector + fragmentation-vs-true-OOM discriminator — 35 tests, 0/6 false-positive rate on fixture corpus (#303 wiring — recipe gap tracked at #337) - `feat(verdict)` deprecate EvictedPod, co-emit PodName + PodNamespace (#277) with regression-pinning test - `feat(chart)` opt-in default-deny NetworkPolicy + cert-manager mTLS reference (#301); ServiceMonitor + scrape annotations (#296); NOTES.txt UX warnings for empty-egress / cross-ns scraper traps - `feat(bench)` per-detector allocs/event harness + soft ratchet gate, graduation criterion documented (#302) - `feat(patterndetector)` verdict counter metric for dashboard panels (#261) - `fix(slo-rules)` correct otelcol_* label set + drop silent-no-op `unless on (instance)` join (#298) **8 pattern design specs (`docs/patterns/{02,07-13}-*.md`):** - Per pattern: symptom, layers crossed, signal sources, detector evaluation rule, verdict attrs, edge cases, open questions. - 7 load-bearing spec gaps flagged for future TDD red-test work (multi-vendor SDC signal, cohort grouping, processor metrics path, etc). **9 v1.0-rc1 audit / knowledge-gap docs:** - `docs/v1-rc1-cut-criteria.md` — 12 falsifiable cut gates derived from O1-O7 - `docs/v1-rc1-operational-gaps.md` — SLSA L3 + air-gap + upgrade-rollback audit (8 issues filed #314-#321) - `docs/v1-rc1-governance-gaps.md` — CODEOWNERS 0%, lint-principles 4/16, retros, `make ci` 148s (5 issues #322-#325, #327) - `docs/v1-rc1-test-audit.md` — 82.9% coverage, fuzz harness inventory (5 issues #328-#332) - `docs/v1-rc1-simplification-audit.md` — top deletion candidates ~9.6K LOC (3 issues #333-#335) - `docs/threat-model.md` — STRIDE per trust boundary + audit RFP scope (#336) - `docs/reference-environments.md` — Tier 1 kind + Tier 2 32×H100 binding spec for O2 hero KPI - `docs/adoption-pipeline.md` — S0-S3 funnel + comms templates for O5 hero KPI - `docs/standards-roadmap.md` — 10 `gen_ai.training.*` attributes proposed upstream (#326) **Doc-drift cleanup:** 11 issues closed (#265, #268, #269, #276, #283, #287, #292-295, #299). **OTTL recipe wiring:** 6 issues closed (#260, #261, #273, #282, #284, #285); #272 deferred to standards-roadmap. **Multi-cluster auth:** bearer-token + mTLS examples (#297). **Merge resolution + reviewer fixes:** - Resolved 5 conflicts post-PR #310/#312/#313 (factory.go delete, VerdictAttr* unexport, MILESTONES.md → docs/, FOLLOWUPS, patterns README) - Adversarial reviewer found 1 BLOCKER + 6 MAJOR; all addressed before push: - Renamed 16 `VerdictAttr*` → `verdictAttr*` per #310 convention - Re-ported selftel wiring (#261) into main's merged `createLogs` - Fixed case-mismatch `docs/THREAT-MODEL.md` → `docs/threat-model.md` (Linux CI is case-sensitive) - 8 pattern specs schema drift: `pattern.id` slug → numeric (`"2"`, `"7"`...`"13"`), `pattern.confidence` `high` → `full` - `02-ib-link-flap.md` attribute drift: spec said `tracecore.alert.ib_link_flap.{hca_device,port}`, code emits `hw.network.ib.{device,port.num}` - `v1-rc1-cut-criteria` criterion #1 status stale-on-arrival ("6 patterns shipped" → "8 patterns shipped, 4 remaining") - NetPol UX trap: NOTES.txt warns when `enabled=true` with empty `allowedEgressEndpoints` (silently kills OTLP) or cross-ns Prometheus - Filed #337 for missing OTTL recipe projecting `DCGM_FI_DEV_FB_*` → `hw.gpu.memory.{free,total}` (CUDA OOM detector consumes but recipe gap) - Post-merge stale-relative-path sweep: 6 wave docs + NORTHSTARS.md + MILESTONES.md (`docs/`, `../`, `docs/docs/` drift after MILESTONES + NORTHSTARS moved to docs/) - Documented 5 newly-emitted attributes in ATTRIBUTES.md (drop_ratio + IB tier — `attribute-namespace-check` now 67/67) ## Test plan - [x] `go test ./module/processor/patterndetectorprocessor/... ./module/pkg/patterns/...` — ok - [x] `make lint` (golangci-lint via goreleaser-style gate) — 0 issues - [x] `go vet ./...` — clean - [x] `make doc-check` — passes after stale-link sweep - [x] `scripts/attribute-namespace-check.sh` — 67/67 documented - [x] `helm lint install/kubernetes/tracecore` — 0 chart(s) failed - [x] `promtool check rules` on slo-rules.yaml — 13 rules / SUCCESS - [ ] CI compat-matrix (rc1 criterion #6) — gated on next wave - [ ] manual smoke install on real cluster — owner clearance pending ```release-notes Lands two new pattern detectors (#2 IB link flap, #10 CUDA OOM fragmentation-vs-true discriminator), 8 pattern design specs for the remaining v1.0 root-cause patterns, opt-in default-deny NetworkPolicy + Prometheus Operator ServiceMonitor on the Helm chart, the EvictedPod → PodName/PodNamespace verdict-attribute deprecation co-emit, per-detector allocs/event bench harness, SLO-rules label fix, and the v1.0-rc1 knowledge-gap audit set (cut criteria, ops gaps, governance gaps, test audit, simplification audit, threat model, reference envs, adoption pipeline, standards roadmap). ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Collapse four hand-rolled self-telemetry packages into one shared
module/pkg/selftel+ amodule/pkg/testutil/selfteltest helper. Each per-componentselftel.goshrinks to a thin wrapper that pins scope name + instrument names + the package-localkindenum; the OTel counter / histogram / observable wiring, the degraded-state atomic machinery, and theinit_errors_totalfallback counter all live in one place.Root cause
Four sibling ports landed independently (RFC-0013 PR-B1/PR-B2/PR-F.1) and each copy chose to re-implement the v0.1.x
internal/selftelemetryshape locally rather than share. The result was ~860 LOC of verbatim duplication acrosscomponents/exporters/{otlphttp,stdoutexporter}/selftel.go+components/receivers/pyspy/selftel.go+module/receiver/ncclfrreceiver/selftel.go, plus ~1600 LOC of test boilerplate (newTestMeterProvider,collectRM,findInstrument,scopeOf, attr-equality, and four near-identicalfailing*MPwrappers). This PR removes the duplication.What stays per-component
kind stringnewtype + named consts (wire-format strings are the operator-facing contract).instrumentationScopeconstant (each component's Go import path — OTel scope-name standard).otelcol.receiver.ncclfr.errors_total).selfExporter/selfTelemetryinterface that carries the package-localkindtype at the seam (zero-cost cast tostring).noopSelfExporter{}/noopSelfTelemetry{}that satisfies that local interface (a shared noop typed toselftel.Exporter/selftel.Receiverwouldn't fit through the kind-carrying interface).dumpNames/testID/testSettings/factory-shape tests stay where the factory lives.What's now shared
module/pkg/selftel.NewExporter(componentID, scope, callsTotalName, mp)+Exporterinterface.module/pkg/selftel.NewReceiver(componentID, scope, names, mp)+Receiverinterface +ReceiverInstrumentNamesstruct (named-field, not positional).module/pkg/selftel.RecordInitError(ctx, mp, scope, kind, componentID, reason)+ErrNilMeterProvidersentinel +ReasonInstrumentRegisterconst.module/pkg/testutil/selftel:NewTestMeterProvider,CollectRM,FindInstrument,ScopeOf,KVMatch,DumpNames,NewFailingMeterProvider(real, prefix).Behavior contracts preserved
component_id,result,kind, etc.) — every pre-existing shape test passes unchanged.errors.Is(err, errNilMeterProvider)still holds in factories (sentinel aliased).recordInitErroris still package-local; thin wrapper forwards to shared with the package's scope.degradedAt+accumulatednanoseconds +activityUnixgauge) is identical — moved verbatim into one place.LOC delta
Net ~2456 -> ~2387 LOC, but ~860 LOC of verbatim duplication is gone — future changes to the receiver state machine / init-error counter / failing-meter wrapper touch one file instead of four.
Self-review pass (second commit)
Reviewing my own first commit caught dead exports:
selftel.NoopExporter()andselftel.NoopReceiver()were unused by every caller (each per-component package owns its own noop typed to its package-local kind-carrying interface). Dropped both + their tests; added a comment at each removal site documenting why the export is intentionally absent so a future contributor doesn't re-add it under the same aspirational reasoning.Test plan
make check(fmt + tidy + golangci-lint + vet + mod-verify) — clean.go test ./components/... -count=1— otlphttp / stdoutexporter / pyspy all green.cd module && go test ./... -count=1— including newmodule/pkg/selftelshared tests + ncclfrreceiver + every other module package green.Internal refactor — no operator-visible change. Same metric names, scope names, label shapes, and noop-fallback semantics.