Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/patterns/10-cuda-oom-deceptive.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ Per issue #303 scalar-promotion checklist:

## Open questions

1. ~~**`DCGM_FI_DEV_FB_*` OTTL recipe extension.**~~ Resolved by issue [#337](https://github.com/tracecoreai/tracecore/issues/337): metric-side projection (`DCGM_FI_DEV_FB_USED` → `hw.gpu.memory.used`, `DCGM_FI_DEV_FB_FREE` → `hw.gpu.memory.free`) ships in [docs/integrations/prometheus-scrape.md §Pattern #10](../integrations/prometheus-scrape.md#pattern-10--cuda-oom-framebuffer). The `hw.gpu.memory.total = used + free` derivation + log-record emission belongs to the RFC-0014 PR-B `WithMetrics` bridge; the log-shape spec the bridge MUST honor is pinned in the recipe's [§Pattern #10 — `hw.gpu.memory.{free,total}`](../integrations/prometheus-scrape.md#pattern-10--hwgpumemoryfreetotal-issue-337) section.
1. **`DCGM_FI_DEV_FB_*` OTTL recipe extension.** Resolved by issue [#337](https://github.com/tracecoreai/tracecore/issues/337): metric-side projection (`DCGM_FI_DEV_FB_USED` → `hw.gpu.memory.used`, `DCGM_FI_DEV_FB_FREE` → `hw.gpu.memory.free`) ships in [docs/integrations/prometheus-scrape.md §Pattern #10](../integrations/prometheus-scrape.md#pattern-10--cuda-oom-framebuffer). The `hw.gpu.memory.total = used + free` derivation + log-record emission belongs to the RFC-0014 PR-B `WithMetrics` bridge; the log-shape spec the bridge MUST honor is pinned in the recipe's [§Pattern #10 — `hw.gpu.memory.{free,total}`](../integrations/prometheus-scrape.md#pattern-10--hwgpumemoryfreetotal-issue-337) section.
2. **filelogreceiver OTTL stanza for the OOM regex.** Sibling to #285. Multi-line `RuntimeError` traceback handling: OTTL recipe stops at the first stanza match or continues into the traceback?
3. **Metrics-path on patterndetectorprocessor.** Per ADR-0001 PR-B — the processor today consumes logs only. CUDA-OOM joins a log to a metric. Either the metric is projected to a log via the metrics→logs OTTL bridge (RFC-0014 PR-B, also blocking pattern #3 today), or the processor grows a metrics input.
4. **`cuda_oom.kind` enum namespace.** Should this be `pattern.cuda_oom.kind` or top-level `cuda_oom.kind`? ATTRIBUTES.md prefers `pattern.*` for tracecore-internal verdict scalars, but issue #303 used `cuda_oom.kind` directly. Reconcile.
4 changes: 2 additions & 2 deletions docs/v1-rc1-simplification-audit.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,8 +182,8 @@ commitment), 5 = load-bearing.

| Rank | Candidate | LOC saved | Risk | Reason |
|---:|---|---:|---:|---|
| 1 | ~~`components/exporters/otlphttp/` (wrapper)~~ | ~~2,904 (1,168 non-test)~~ | — | **Done** — deleted in #333. Upstream `otlphttpexporter` v0.130.0 in builder-config.yaml now owns this surface. |
| 2 | ~~`components/exporters/stdoutexporter/` (wrapper)~~ | ~~1,077 (392 non-test)~~ | — | **Done** — deleted in #334. Upstream `debugexporter` v0.130.0 in builder-config.yaml now owns this surface. |
| 1 | `components/exporters/otlphttp/` (wrapper) | 2,904 (1,168 non-test) | — | **Done** — deleted in #333. Upstream `otlphttpexporter` v0.130.0 in builder-config.yaml now owns this surface. |
| 2 | `components/exporters/stdoutexporter/` (wrapper) | 1,077 (392 non-test) | — | **Done** — deleted in #334. Upstream `debugexporter` v0.130.0 in builder-config.yaml now owns this surface. |
| 3 | `components/receivers/pyspy/` + `python/tracecore_pyspy/` + `tools/pyspy-lint/` | 5,617 total | **4** | Not in OCB build (zero Go importers from `_build/components.go`). BUT — RFC-0013's v0.3.0 pyspy-delete row was **deferred to v0.4.0+ per [#222](https://github.com/TraceCoreAI/tracecore/issues/222)**. `docs/migration/v0.2-to-v0.3.md` explicitly states "pyspy ships as-is in v0.3.0." Delete is blocked on #222, not on this audit. |
| 4 | `docs/followups/M*.md` shards where every item is `[STRIKE]` or `landed` | ~3,552 total / shard-by-shard ~50-300 | **2** | Cosmetic; consolidate landed deferrals into RFC-0013 audit-trail footer. Per `feedback_no_bloat`: don't track when fix-now is in scope, but these are post-merge artifacts so they're not blocking. |
| 5 | `docs/research/m15-container-stdout.md` + `m16-kueue.md` + `m16-kueue-production-followups.md` (already `[STRIKE]`-banner'd) | ~2,500 | **2** | All carry RFC-0013-supersession banners; bodies are retained as decision history. Could collapse into RFC-0013 §audit-trail. Same cosmetic-cleanup class as row 4. |
Expand Down
10 changes: 1 addition & 9 deletions module/doc.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,5 @@
// SPDX-License-Identifier: Apache-2.0

// Package module hosts the in-repo Go submodule for receivers/processors/exporters
// moved out of the root module per RFC-0013. Contents land in PR-I.1b (nccl_fr move)
// and PR-I.2 (rankjoinprocessor + patterndetectorprocessor).
//
// This file exists so that `module/v0.0.1` (the genesis tag cut after PR-I.1a
// merges) resolves through the Go module proxy — proxies typically require at
// least one .go file in the module root to validate the module and surface it
// in `go list`. Without it, `GOPROXY=https://proxy.golang.org go list -m
// github.com/tracecoreai/tracecore/module@v0.0.1` may fail with a 404 or
// "no Go source files" error, breaking the PR-I.1b release-tag-cut workflow.
// moved out of the root module per RFC-0013.
package module
15 changes: 15 additions & 0 deletions module/processor/patterndetectorprocessor/config.go
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
// SPDX-License-Identifier: Apache-2.0

// Knob-naming convention (post-wave audit #379):
//
// New pattern knobs MUST use the `<Pattern><Knob>` Go identifier shape,
// which maps to `<pattern>_<knob>` in YAML — e.g. `CheckpointerHangBackwardWindow`
// → `checkpointer_hang_backward_window`, `NCCLBootstrapDeadline` →
// `nccl_bootstrap_deadline`. This avoids cross-pattern collisions on the
// generic `*_window` / `*_threshold` axes once N detectors coexist.
//
// Bare-name knobs (`JoinWindow`, `NCCLHangThreshold`, `XidCorrelationWindow`,
// `HBMECCWindow`, `HBMECCDeltaThreshold`, `ThermalThrottleWindow`,
// `PCIeAERWindow`, `IBLinkFlapWindow`, `CUDAOOMCorrelationWindow`,
// `EmitPartialVerdicts`) predate the convention and are retained for
// backward-compat with pre-v0.4 `values.yaml` files. They will be nested
// into `<pattern>: { <knob>: ... }` blocks in v2.0 with a migration helper;
// do NOT add new bare-named knobs.
package patterndetectorprocessor

import (
Expand Down