diff --git a/docs/patterns/10-cuda-oom-deceptive.md b/docs/patterns/10-cuda-oom-deceptive.md index 4a88ab88..2f5b215e 100644 --- a/docs/patterns/10-cuda-oom-deceptive.md +++ b/docs/patterns/10-cuda-oom-deceptive.md @@ -71,7 +71,7 @@ Per issue #303 scalar-promotion checklist: ## Open questions -1. ~~**`DCGM_FI_DEV_FB_*` OTTL recipe extension.**~~ Resolved by issue [#337](https://github.com/tracecoreai/tracecore/issues/337): metric-side projection (`DCGM_FI_DEV_FB_USED` → `hw.gpu.memory.used`, `DCGM_FI_DEV_FB_FREE` → `hw.gpu.memory.free`) ships in [docs/integrations/prometheus-scrape.md §Pattern #10](../integrations/prometheus-scrape.md#pattern-10--cuda-oom-framebuffer). The `hw.gpu.memory.total = used + free` derivation + log-record emission belongs to the RFC-0014 PR-B `WithMetrics` bridge; the log-shape spec the bridge MUST honor is pinned in the recipe's [§Pattern #10 — `hw.gpu.memory.{free,total}`](../integrations/prometheus-scrape.md#pattern-10--hwgpumemoryfreetotal-issue-337) section. +1. **`DCGM_FI_DEV_FB_*` OTTL recipe extension.** Resolved by issue [#337](https://github.com/tracecoreai/tracecore/issues/337): metric-side projection (`DCGM_FI_DEV_FB_USED` → `hw.gpu.memory.used`, `DCGM_FI_DEV_FB_FREE` → `hw.gpu.memory.free`) ships in [docs/integrations/prometheus-scrape.md §Pattern #10](../integrations/prometheus-scrape.md#pattern-10--cuda-oom-framebuffer). The `hw.gpu.memory.total = used + free` derivation + log-record emission belongs to the RFC-0014 PR-B `WithMetrics` bridge; the log-shape spec the bridge MUST honor is pinned in the recipe's [§Pattern #10 — `hw.gpu.memory.{free,total}`](../integrations/prometheus-scrape.md#pattern-10--hwgpumemoryfreetotal-issue-337) section. 2. **filelogreceiver OTTL stanza for the OOM regex.** Sibling to #285. Multi-line `RuntimeError` traceback handling: OTTL recipe stops at the first stanza match or continues into the traceback? 3. **Metrics-path on patterndetectorprocessor.** Per ADR-0001 PR-B — the processor today consumes logs only. CUDA-OOM joins a log to a metric. Either the metric is projected to a log via the metrics→logs OTTL bridge (RFC-0014 PR-B, also blocking pattern #3 today), or the processor grows a metrics input. 4. **`cuda_oom.kind` enum namespace.** Should this be `pattern.cuda_oom.kind` or top-level `cuda_oom.kind`? ATTRIBUTES.md prefers `pattern.*` for tracecore-internal verdict scalars, but issue #303 used `cuda_oom.kind` directly. Reconcile. diff --git a/docs/v1-rc1-simplification-audit.md b/docs/v1-rc1-simplification-audit.md index fdc2eeda..86661408 100644 --- a/docs/v1-rc1-simplification-audit.md +++ b/docs/v1-rc1-simplification-audit.md @@ -182,8 +182,8 @@ commitment), 5 = load-bearing. | Rank | Candidate | LOC saved | Risk | Reason | |---:|---|---:|---:|---| -| 1 | ~~`components/exporters/otlphttp/` (wrapper)~~ | ~~2,904 (1,168 non-test)~~ | — | **Done** — deleted in #333. Upstream `otlphttpexporter` v0.130.0 in builder-config.yaml now owns this surface. | -| 2 | ~~`components/exporters/stdoutexporter/` (wrapper)~~ | ~~1,077 (392 non-test)~~ | — | **Done** — deleted in #334. Upstream `debugexporter` v0.130.0 in builder-config.yaml now owns this surface. | +| 1 | `components/exporters/otlphttp/` (wrapper) | 2,904 (1,168 non-test) | — | **Done** — deleted in #333. Upstream `otlphttpexporter` v0.130.0 in builder-config.yaml now owns this surface. | +| 2 | `components/exporters/stdoutexporter/` (wrapper) | 1,077 (392 non-test) | — | **Done** — deleted in #334. Upstream `debugexporter` v0.130.0 in builder-config.yaml now owns this surface. | | 3 | `components/receivers/pyspy/` + `python/tracecore_pyspy/` + `tools/pyspy-lint/` | 5,617 total | **4** | Not in OCB build (zero Go importers from `_build/components.go`). BUT — RFC-0013's v0.3.0 pyspy-delete row was **deferred to v0.4.0+ per [#222](https://github.com/TraceCoreAI/tracecore/issues/222)**. `docs/migration/v0.2-to-v0.3.md` explicitly states "pyspy ships as-is in v0.3.0." Delete is blocked on #222, not on this audit. | | 4 | `docs/followups/M*.md` shards where every item is `[STRIKE]` or `landed` | ~3,552 total / shard-by-shard ~50-300 | **2** | Cosmetic; consolidate landed deferrals into RFC-0013 audit-trail footer. Per `feedback_no_bloat`: don't track when fix-now is in scope, but these are post-merge artifacts so they're not blocking. | | 5 | `docs/research/m15-container-stdout.md` + `m16-kueue.md` + `m16-kueue-production-followups.md` (already `[STRIKE]`-banner'd) | ~2,500 | **2** | All carry RFC-0013-supersession banners; bodies are retained as decision history. Could collapse into RFC-0013 §audit-trail. Same cosmetic-cleanup class as row 4. | diff --git a/module/doc.go b/module/doc.go index d6d620f8..6444e01c 100644 --- a/module/doc.go +++ b/module/doc.go @@ -2,13 +2,5 @@ // SPDX-License-Identifier: Apache-2.0 // Package module hosts the in-repo Go submodule for receivers/processors/exporters -// moved out of the root module per RFC-0013. Contents land in PR-I.1b (nccl_fr move) -// and PR-I.2 (rankjoinprocessor + patterndetectorprocessor). -// -// This file exists so that `module/v0.0.1` (the genesis tag cut after PR-I.1a -// merges) resolves through the Go module proxy — proxies typically require at -// least one .go file in the module root to validate the module and surface it -// in `go list`. Without it, `GOPROXY=https://proxy.golang.org go list -m -// github.com/tracecoreai/tracecore/module@v0.0.1` may fail with a 404 or -// "no Go source files" error, breaking the PR-I.1b release-tag-cut workflow. +// moved out of the root module per RFC-0013. package module diff --git a/module/processor/patterndetectorprocessor/config.go b/module/processor/patterndetectorprocessor/config.go index 2524aa5a..943fc989 100644 --- a/module/processor/patterndetectorprocessor/config.go +++ b/module/processor/patterndetectorprocessor/config.go @@ -1,5 +1,20 @@ // SPDX-License-Identifier: Apache-2.0 +// Knob-naming convention (post-wave audit #379): +// +// New pattern knobs MUST use the `` Go identifier shape, +// which maps to `_` in YAML — e.g. `CheckpointerHangBackwardWindow` +// → `checkpointer_hang_backward_window`, `NCCLBootstrapDeadline` → +// `nccl_bootstrap_deadline`. This avoids cross-pattern collisions on the +// generic `*_window` / `*_threshold` axes once N detectors coexist. +// +// Bare-name knobs (`JoinWindow`, `NCCLHangThreshold`, `XidCorrelationWindow`, +// `HBMECCWindow`, `HBMECCDeltaThreshold`, `ThermalThrottleWindow`, +// `PCIeAERWindow`, `IBLinkFlapWindow`, `CUDAOOMCorrelationWindow`, +// `EmitPartialVerdicts`) predate the convention and are retained for +// backward-compat with pre-v0.4 `values.yaml` files. They will be nested +// into `: { : ... }` blocks in v2.0 with a migration helper; +// do NOT add new bare-named knobs. package patterndetectorprocessor import (