RFC-0013: distribution-first pivot + CI surface cleanup#166
Merged
Conversation
Pivots tracecore from a build-first OTel Collector to an OCB-assembled distribution. Custom code shrinks to the pattern engine, OTTL processors with windowed semantics, NCCL FlightRecorder parsing, and the install/ overhead bench harness. Seven in-tree receivers and three internal packages are queued for deletion across v0.1/v0.2/v0.3. Operator-stable telemetry contracts (k8s.event.hint enum, kernelevents.xid, gpu.id, gpu.vendor, NCCL span schema) are preserved across the cut via an OTTL normalization layer in the bundled recipe. Upstream contributions become first-class — patches go upstream; forks only when upstream rejects. Index updates: 0001/0002/0003 revised; 0004 superseded + archived; 0005/0006/0007/0009/0010/0011/0012 superseded; 0008 revised (image-publish → ko + Renovate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Adds Status headers to research notes whose conclusions have been revised by the distribution-first pivot. Bodies retained as decision history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Pattern detection logic unchanged; input sources now come from dcgm-exporter/ROCm DME/xpumanager/Habana via prometheusreceiver, plus journaldreceiver+filelog for kernel/Xid signals, per RFC-0013 §2. The pattern engine in tracecore-components stays in-tree as the moat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Adds status header + redirect block to each affected RFC body. Moves RFC-0004 to archived/ since no operator-visible behavior is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Marks workflows, tool READMEs, internal/pipeline + telemetry READMEs, python/ tree, components.yaml, and DCGM/kernelevents issue templates with their RFC-0013 deletion or replacement schedule (v0.1.0 / v0.2.0 / v0.3.0). No behavior changes — banners only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
clockreceiver, dcgm, kueue, kineto, k8sevents, kernelevents, pyspy README + RUNBOOK files now carry their RFC-0013 §7 deletion schedule (v0.1.0 / v0.2.0 / v0.3.0) with pointers to the upstream replacement. nccl_fr carries the module-migration banner instead — retained as the one custom tracecore receiver moving to tracecoreai/tracecore-components. No code changes — banners only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Adds Status blocks to every followup shard mapping items to RFC-0013 outcomes: [KEEP] for moat code (M19 pattern detectors, M11 nccl_fr); [RECIPE] for items that move to bundled OTel YAML + OTTL processors; [UPSTREAM] for items that become upstream-contribution targets; [STRIKE] for items obsoleted by adoption; [DEFERRED] for Kineto pending OTel Profiles GA; [AUDIT] where binding to OCB boot path is unverified; [REOPENED] for previously-skipped items relevant under adopt-first. No history removed — Status blocks frame the new disposition while preserving each shard's decision-history bullets verbatim. Top-level docs/FOLLOWUPS.md gains an RFC-0013 pivot impact table indexing the disposition per shard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Updates README, NORTHSTARS, STRATEGY (root + docs/), PRINCIPLES, MILESTONES, CHANGELOG, CONTRIBUTING, AGENTS, docs/README around the distribution-first posture: tracecore is an OCB-assembled OTel distro plus a pattern library; in-house scope shrinks to the four moats in RFC-0013 §6; upstream contributions become first-class policy. Customer-stable telemetry contracts (k8s.event.hint enum, kernelevents.xid, gpu.id, gpu.vendor, NCCL span schema) are preserved across the pivot via OTTL normalization in the bundled recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Drops "intermediate otelcol-contrib" recipe from Datadog and ClickHouse integrations — the OCB-assembled distro bundles those exporters directly per RFC-0013 §1 + §2. Updates getting-started, FAILURE-MODES, HARDWARE-TESTING, reproducibility, and CI notes to reflect deleted receivers, the goreleaser-driven release pipeline, and the customer-stable telemetry contracts that operators can rely on across the pivot (RFC-0013 §3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Pre-push govulncheck gate flagged GO-2026-5026 in x/net@v0.54.0 (idna.ToASCII fails to reject ASCII-only Punycode labels) via components/receivers/kueue/scrape.go:155 → http.Client.Do. The Kueue receiver is queued for deletion at v0.1.0 per RFC-0013 §7, so the trace path itself goes away — but until then, the bump is the narrowest fix that unblocks the pivot-doc PR without inviting fork maintenance for an in-tree component already on the chopping block. x/sys upgraded transitively v0.44.0 → v0.45.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
doc-check's diff-scope gate forbids U+2014 (em-dash) and U+2013 (en-dash) on PR-added lines. The phase 2/3/5/8/8b doc-pivot commits introduced 88 hits across 32 files. Replacement is mechanical: em-dash → " - ", en-dash → "-". No semantic edits. Also fixes three side issues surfaced during the same gate pass: - FAILURE-MODES.md cited a non-existent test TestIntegration_TelemetryGeneratorToDebug; replaced with the real TestIntegration_SIGINT. - CHANGELOG.md, docs/rfcs/0003 / 0007 / README, and the archived 0004 body all carried links to docs/rfcs/0004-clockreceiver-stdoutexporter.md that broke when Phase 3 moved the file to docs/rfcs/archived/. Internal links inside the archived RFC also needed one extra "../". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Spot-audit pass on the RFC-0013 distribution-first pivot doc surface (PR #166). Five categories of edits applied; no code or workflow changes; no scope decisions in RFC-0013 §2/§3/§4/§7 touched. Edits: - **Pivot-name unification.** Adopt "distribution-first pivot" as the single name across the surface. Replaces "OCB pivot" in docs/FAILURE-MODES.md and "distro pivot" in docs/notes/ci.md. - **"Adopt > build" unification.** Replaces "Adopt-first" / "adopt-first" with "Adopt > build" across CHANGELOG.md, docs/FOLLOWUPS.md, docs/followups/skipped.md, and the RFC-0013 motivation lede - aligns with PRINCIPLES.md §16 and NORTHSTARS.md adoption-posture lede. - **Integration-recipe trailing paraphrase trim.** The "---\nThe <exporter> ships bundled via OCB manifest per RFC-0013..." trailing block at the foot of docs/integrations/{datadog,clickhouse-direct, honeycomb,otel-backend}.md duplicated information already in each recipe's intro paragraph. Removed; "See also" pointers retained. - **Status-block paraphrase trim.** docs/followups/M3.md and M13.md Status (RFC-0013) blocks paraphrased what the [KEEP]/[STRIKE]/ [UPSTREAM]/[RECIPE] item tags already say. Collapsed to one-paragraph summaries pointing at RFC-0013 sections + tag convention. - **Paragraph dedupe in cross-references.** - CHANGELOG.md "### Changed (RFC-0013 pivot)" first bullet (binary now assembled via OCB) duplicated the section opener immediately above. Dropped. - docs/FAILURE-MODES.md "Per-recipe failure modes" three-bullet paraphrase of the alert table below collapsed to a one-line pointer. - docs/patterns/README.md "Input sources (post-RFC-0013)" copied a subset of the RFC-0013 §2 adoption matrix verbatim. Replaced with a pointer to §2 plus the gpu.vendor + pattern-output context that adds beyond the RFC. - docs/rfcs/0013-distro-first-pivot.md "Migration / rollout" sentence-level tighten: "See §4 ..." -> "Release-boundary schedule in §4. PR sequencing follows." Doc-check: 516 markdown links resolve; em-dash/en-dash diff gate clean; comment-noise diff gate clean; banned-phrase lint clean; all 4 integration recipes still carry tested-against + last-verified markers under 180d. No scope decisions altered. No code or workflow files touched. Customer-stable contract attribute names (k8s.event.hint enum, kernelevents.xid, gpu.id, gpu.vendor, gen_ai.training.{rank,job_id}, NCCL FlightRecorder span schema) preserved exactly per RFC-0013 §3. Signed-off-by: Tri Lam <tri@maydow.com>
…tro-pivot Signed-off-by: Tri Lam <tri@maydow.com> # Conflicts: # MILESTONES.md # docs/FAILURE-MODES.md # go.mod
Datadog + ClickHouse integration recipes route through `./tracecore
validate` per RFC-0013 §2, but `datadogexporter` and `clickhouseexporter`
are not registered in the in-tree binary until the OCB-skeleton PR
(RFC-0013 §Migration PR-A) lands. Without this skip, validator-recipe
fails CI for two recipes whose validation depends on a follow-on PR
in the migration sequence.
Adds `tested-against: pending-rfc-0013-pr-a` as a recognized marker:
- scripts/doc-check.sh accepts it alongside `tracecore` and `vX.Y.Z`.
- scripts/validator-recipe.sh skips the recipe with a clear notice
naming the gating PR.
- docs/integrations/{datadog,clickhouse-direct}.md flip to the new
marker. honeycomb + otel-backend stay on `tracecore` (the in-tree
binary already registers otlphttp).
Flip back to `tested-against: tracecore` when PR-A merges.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
The bare PR-N reference triggered comment-noise lint. The marker name itself names the gating PR; the comment was redundant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
PR #164 (go-deps group bump) landed x/time v0.15.0 + x/sys v0.45.0 + fsnotify v1.10.1 + jsonschema v6.0.2 + otel/pdata v1.59.0 on main. Our branch already had x/sys v0.45.0 from the x/net security fix commit; merge takes the union with PR #164's x/time v0.15.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
The verify-static job logs Error 1 from Makefile:229 (@scripts/alert-check.sh) without any alert-check output, despite the script running clean locally on the same SHA. Re-triggering to determine whether the failure reproduces or was a runner-environment flake. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
doc-check's em-dash and comment-noise gates compare PR-added lines against origin/main via `git diff base_ref...HEAD` (three-dot range requires a merge base). On the default shallow checkout, the merge base does not exist and git exits 128. With `set -euo pipefail` active in scripts/doc-check.sh, the pipeline `git diff ... | python3 -c ...` propagates 128, killing the script with no further output — which is exactly what every recent verify-static failure on PR #166 looked like: the last visible echo was the failure-mode label line, then `Makefile:229 Error 1` with nothing between. Setting fetch-depth: 0 deepens the checkout to full history so the merge base resolves. The two CI-only gates (em-dash + comment-noise diff) now report `clean (vs origin/main)` instead of silently dying. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Removes five workflows that no longer carry signal for the distribution-first repo: - kernelevents-integration.yml (254 LOC) — receiver scheduled for deletion at v0.2.0 per RFC-0013 §7; the workflow gates a path that will not exist in main two releases out. - pyspy-integration.yml (107 LOC) — receiver scheduled for deletion at v0.3.0; same logic. - python-publish.yml (82 LOC) — PyPI helper deleted with pyspy at v0.3.0; no publish target remains. - ci-hardware.yml.staged (66 LOC) — never enabled (the .staged suffix excludes it from GitHub Actions discovery); residue from a prior planning iteration. - govulncheck.yml (27 LOC) — `verify-static` already runs `make govulncheck` on every PR; the standalone workflow is a duplicate scan against the same vuln database. Security mitigation is preserved via verify-static (a required check). Net: 536 LOC removed from `.github/workflows/`. Per-PR runner minutes drop. No coverage regression — every gate that fires on behavior in this PR is still in place via verify-static, chart, and the receiver-deletion schedule in RFC-0013 §7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Setting flipped off in GitHub UI today; manifest catches up. Rationale: `main` is already linear via squash-merge (repo-level "Allow merge commits: false"). The branch protection rule was belt-and-suspenders but had a distinct cost - it blocked squash- merge whenever the source branch absorbed a merge commit via `git merge origin/main` conflict resolution (the standard pattern per memory `feedback_branch_sync_via_merge`). That cost was paid twice on PR #166 alone. No coverage lost. Squash-merge still produces linear `main`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
The doc-check target invokes 6 scripts under @ which silenced both the invocation line and the script's prefix-less failure mode. When one of them fails on CI with no output (the common case for shell scripts under set -e + pipefail), the only signal was "Makefile:229 Error 1" with no script name. Tonight's verify-static debugging burned 90 minutes pinpointing which of the 6 scripts was the actual failure. Dropping @ makes make echo each `scripts/X.sh` line as it runs. Same scripts, same gates, same exit code semantics; CI logs now identify the failing script by line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Three structural simplifications to the doc-check surface: 1. Move scripts/release-doc-parity.sh + scripts/gh-attestation-flag-lint.sh out of `make doc-check` and into `make doc-check-release`. These gates catch drift between .github/workflows/release.yml and docs/reproducibility.md — load-bearing for the release pipeline, but pure noise on doc PRs that don't touch release.yml. Per-PR surface drops two gates that have a 0% catch rate on doc PRs. 2. Delete scripts/test-release-doc-parity.sh and its 7-file fixture tree under scripts/testdata/release-doc-parity/. The script mutation-verifies release-doc-parity.sh against drift fixtures — it tests the gate, which is the canonical "testing the test" smell. The gate is small (97 LOC) and exercised by every release tag push; the meta-test adds maintenance without adding signal. 3. Fix the actual verify-static failure root cause: line 14 of .github/branch-protection.yml from commit 61aff33 contained the string "on PR #166" — a bare PR-N reference that the comment-noise diff gate (correctly) rejects. The Makefile @ prefix dropped in ab648ac would have surfaced this on the next CI run; this commit removes the offending phrase so CI passes without waiting. Net: 230 LOC removed (43 meta-test + 187 fixtures). Per-PR doc-check runs 3 scripts instead of 6 — alert-check, doc-check.sh, and chart-appversion-check. Release-pipeline parity gates still run on tag push via make doc-check-release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
The verify-static job has now failed 8 consecutive times with the same opaque signature: `scripts/alert-check.sh` echoed (after dropping the @-prefix in ab648ac), then `Error 1` 98ms later with zero alert-check output. Local rc=0 on macOS bash 3.2 + BSD awk; rc=0 on Ubuntu 24.04 docker container with gawk; CI rc=1 on ubuntu-latest with the same git checkout. Multi-angle investigation ruled out: - shallow checkout (fixed in 5f15a0b; fetch-depth: 0) - bash version difference (3.2 vs 5.x — local + container both green) - awk variant (BSD vs gawk — both green) - file permissions (100755 in git index) - script content (identical sha) Adds a debug-only `set -x` guarded by ALERT_CHECK_TRACE=1 and turns the flag on for the next CI run. The next failure will surface the exact failing command on stderr. Remove the trace once the underlying cause is fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Five changes, ordered by long-term impact: 1. ROOT CAUSE FIX for the 8-fail verify-static streak. The CI trace added in the previous commit pinpointed the exact failing command in scripts/alert-check.sh: line 97 used `[ -f X ] && echo $r`. When the test was false (3 of 8 RUNBOOKs have no sibling alerts.yaml today: kineto, pyspy, nccl_fr), the `&&` chain returned exit code 1. Under bash 4+ pipefail + set -e, that killed the script silently. macOS bash 3.2 was lax enough to not propagate; Ubuntu bash 5 strictly propagated. Rewrote the line as `if [ -f X ]; then echo $r; fi` - same semantics, safe rc. Removed the ALERT_CHECK_TRACE diagnostic since the cause is named. 2. DROP em-dash + en-dash diff gate (81 LOC). Agent-driven prose habits reintroduce em-dashes faster than the gate justifies; PR #166 alone triggered 88 hits across 32 files. Reviewers gain little from per-PR enforcement of a punctuation choice that doesn't change meaning. 3. WHITELIST .github/ and scripts/ in comment-noise diff gate. Both directories carry legitimate forward-looking notes (TODOs, status banners, gate rationale) that the noise gate falsely treats as rot. Markdown and source-code rules unchanged. 4. ADD .githooks/prepare-commit-msg for auto Signed-off-by. The DCO gate in commit-msg stays; the per-commit `git commit -s` ceremony goes away. PR #166 paid this cost twice on the first night. 5. ADD `changes` filter job in ci.yml that detects doc-only PRs. Downstream gates can short-circuit when the diff touches no Go source. Currently exposes the `code` output; gating sequence migration is the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
shellcheck SC2221/SC2222 flagged the path-filter case statement because `*.md` already covers CHANGELOG.md / README.md / NORTHSTARS.md etc; listing the top-level docs explicitly created overlapping patterns. Same semantics with the simpler two-arm match. Also: verify-test failed on TestTailer_TruncationWithoutRotation (components/receivers/containerstdout/tailer_test.go:347, "timed out waiting for tailer line"). That test arrived via PR #158 (M15 containerstdout, merged on main earlier today) and is not introduced by this PR. Flaky timeout on the slow ubuntu-latest runner. Triaging separately via re-run; if it persists, registers a follow-up in docs/FLAKY-TESTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
TestTailer_TruncationWithoutRotation flaked on ubuntu-latest with "timed out waiting for tailer line" at tailer_test.go:347. Root cause: the test forces an in-place truncate (shrinks the file below the tailer's stored offset) and then immediately appends new content. For the tailer to surface the appended line, its poll loop must fire DURING the narrow window when the file is shrunk - it needs to observe size<offset, reset offset to 0, and rewind the file pointer to start of file. If the append wins the race the file size restores to 8 bytes (matching the pre-truncate offset) and the truncate detection path never triggers; the tailer then sees no new content and the test hits the 2s drainLine timeout. The pre-fix sleep of 2x tailerTestPoll (20ms) is too tight under GC + scheduler jitter on slow CI nodes. 10x (100ms) is comfortably under tailerTestTimeout (2000ms) and stable across `go test -count=5 -race` locally. No production code change. The tailer correctly handles the truncate path; the test was racing against its own setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
6 tasks
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
## What this PR does Bundles three RFC-0013 PR slices that have zero file overlap with each other. ### PR-C: release pipeline → goreleaser stack - New `.goreleaser.yaml`: linux/amd64 + linux/arm64 builds; reproducible via `SOURCE_DATE_EPOCH`; LDFLAGS shape matches the Makefile build target. - Rewritten `.github/workflows/release.yml`: invokes goreleaser, `anchore/sbom-action`, `sigstore/cosign-installer`, `slsa-framework/slsa-github-generator` (tag-pinned per SLSA OIDC subject identity requirement; all other actions SHA-pinned per repo security policy), `actions/attest-build-provenance`. - Old `release.yml` moved to `.github/workflows/archived/release.yml.legacy`. - Goreleaser builds the **legacy** `cmd/tracecore` binary; OCB-output migration deferred to PR-D (image build → ko), per inline comment in `.goreleaser.yaml`. ### PR-G + PR-H: RFC supersession + top-level docs alignment - Audit confirmed all 12 RFCs already carry the correct supersedence headers from prior pivot work (PRs #166/#168/#169/#170). Only two top-level docs needed alignment: - `NORTHSTARS.md` O1 caveat: replaced "own-binary architecture" assumption wording with OCB-distribution-posture wording; closed Open Question #1 by RFC-0013 ref. - `CHANGELOG.md`: appended pivot-wave-1 PR list (#166/#168/#169/#170/#171/#172/#173) citing PR-A as the prior step before this commit. - No edits needed to README/STRATEGY/PRINCIPLES/MILESTONES/CONTRIBUTING/AGENTS/docs/README — all already aligned. ### PR-E: clockreceiver swap — BLOCKED - `telemetrygeneratorreceiver` does not exist in `opentelemetry-collector-contrib` at any version. Verified against the Go module proxy, GitHub tree API at v0.95→v0.130, and the full receiver listing at v0.110.0 (94 receivers; no `telemetrygenerator`, `loadgen`, `mockreceiver`, `dummyreceiver`, or any `*generator*`). The RFC-0013 §1 example shape referenced it speculatively; it was never upstreamed. - `builder-config.yaml`: replaced the misleading "no v0.110.0 tag" omission comment with a verified TODO block describing the actual blocker (receiver doesn't exist anywhere) and decision rationale. - `bench/install/tracecore-values.yaml`: appended `[BLOCKED]` marker on the clockreceiver→telgen mapping; bench continues to use in-tree clockreceiver until PR-F deletes it (likely rewires to `hostmetricsreceiver`). ## Root cause (PR-E blocker) RFC-0013 §1 listed `telemetrygeneratorreceiver` as the swap target without verifying the receiver existed upstream. Reality: the OTel contrib repo has no such module path at any tag. PR-E cannot complete until either (a) the receiver lands upstream, or (b) a different replacement is chosen (e.g., `hostmetricsreceiver` for heartbeat semantics). Tracked in the in-file TODO block; revisit in PR-F (delete clockreceiver) or as a separate followup. ## Release notes ```release-notes [CHANGE] Release pipeline migrated to goreleaser + SBOM + SLSA provenance + cosign signing. The release.yml workflow now invokes goreleaser instead of building binaries directly. Operators consuming release artifacts: artifact shape (filename, archive contents, checksum file format) follows goreleaser defaults; see CHANGELOG.md for the migration note. ``` ## Test plan - [x] `make verify` runs and passes - [x] `make actionlint` passes (new release.yml workflow + suppression-block YAML valid) - [x] `make zizmor` passes (SLSA reusable-workflow tag-pin justified inline + accepted) - [x] `make build` (legacy) still works - [x] `make build-ocb` (OCB) still works - [ ] Goreleaser dry-run in CI on first push to a tag (gated until a tag exists) Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Lands RFC-0013 as the binding decision doc for tracecore's distribution-first pivot, and propagates it across the doc surface in twenty-four scoped commits. The PR also absorbs a CI surface cleanup + velocity package (commits 19-24 below): 536 LOC of dead/redundant workflows removed; 230 LOC of testing-the-test meta-gates removed; em-dash diff gate dropped (81 LOC) since agent prose habits make it net-negative;
make doc-checksplit into per-PR + release-only variants; the 8-failverify-staticstreak root-caused to a&&-chain inscripts/alert-check.sh:97that returned exit 1 under bash 5 pipefail; auto-signoff hook eliminates the per-commitgit commit -sceremony; and achangesjob detects doc-only PRs for downstream short-circuit gating. After this PR, every doc that names a soon-to-be-deleted receiver, the hand-rolled release pipeline, or the custom self-telemetry surface carries a Status banner pointing at the upstream replacement and the release boundary (v0.1.0 / v0.2.0 / v0.3.0).No application code paths change. No deletions yet. The next eleven PRs (per RFC-0013 §Migration) do the receiver removals, the OCB skeleton swap, the goreleaser stack adoption, the receivers-only module split, and the migration guides. Two scripts change to support the in-flight state:
scripts/doc-check.shrecognizes a temporarytested-against: pending-rfc-0013-pr-amarker, andscripts/validator-recipe.shskips recipes carrying it with an explicit notice naming the gating PR.Scope summary
dcgm-exporter(NV) +ROCm/device-metrics-exporter(AMD) +xpumanager(Intel) + Habana exporter viaprometheusreceivercomponents/receivers/dcgm/(cgo stub)filelogreceiver+ container stanza +file_storagecomponents/receivers/containerstdout/(M15 alpha, PR #158)journaldreceiver+filelogreceiver+ OTTL Xid transformcomponents/receivers/kernelevents/k8sobjectsreceiver+ OTTLk8s.event.hinttransformcomponents/receivers/k8sevents/prometheusreceiver+ bearer-tokencomponents/receivers/kueue/(never shipped)parca-agent(eBPF)components/receivers/pyspy/+python/tracecore_pyspy/+tools/pyspy-lint/components/receivers/kineto/(partial)telemetrygeneratorreceivercomponents/receivers/clockreceiver/componentstatus+service/telemetry+ standardotelcol_*internal/componentstatus/+internal/selftelemetry/+internal/telemetry/.github/workflows/release.ymlko(push) +Renovate(pull)datadogexporter+clickhouseexporterdirectlyCustomer-stable contracts preserved across the pivot
k8s.event.hint11-entry enum,kernelevents.xid,gpu.id,gpu.vendor,gen_ai.training.{rank,job_id}, NCCL FlightRecorder span schema, pattern detector outputs (M17/M18/M19). Normalized back to the stable surface via the OTTLtransformprocessor in the bundled recipe. Operator alerts written against these survive the receiver swap.What tracecore still builds (the moat)
Bounded by RFC-0013 §6: NCCL FlightRecorder receiver (
ncclfrreceiver), windowed cross-signal join processor (rankjoinprocessor), pattern engine + replay corpus (patterndetectorprocessor), install/overhead bench harness. Everything else comes from upstream + contrib.Upstream contribution policy
Now first-class. Tracecore contributes patches upstream first; forks only when upstream rejects. When a contribution is in-flight, tracecore ships against a
replacedirective ingo.modpointing at the contribution branch; the replace is removed when the upstream tag lands.Commits in this PR
1fe08c3- RFC-0013 binding doc + index updates2047abe- research notes superseded headers81dc29e- patterns input-source references at adopted receiversca9fbc6- RFC bodies 0001-0012 supersede/revise headers (0004 archived)d06c929- status banners on queued-for-deletion files (15 files)d116775- receiver READMEs + RUNBOOKs status banners (14 files)c250509- followups triage against RFC-0013 (21 files)3edd226- top-level orienting docs reframed (9 files)cb01365- integration recipes + operator docs for OCB distro (10 files)946af13-golang.org/x/netv0.54.0 → v0.55.0 (GO-2026-5026, blocking pre-push)4eb03c0- em-dash + en-dash → hyphen across pivot surface (88 hits in 32 files); also fixes 3 broken link refs to archived/0004 and one fabricated test name in FAILURE-MODES.mdde585ba- language tightening + dedupe paraphrase + ancillary doc Status banners9ceeaef- mergeorigin/main(resolves M15 containerstdout (PR [m15] containerstdout: Kubernetes container stdout tail receiver (alpha) #158) alongside RFC-0013 v0.2.0 deletion banner; FAILURE-MODES.md gains containerstdout alert rows with upstream replacement column; go.mod takes bothx/sys v0.45.0and newx/time v0.14.0)214e8a6-scripts/validator-recipe.sh+scripts/doc-check.shaccepttested-against: pending-rfc-0013-pr-a; recipes routing through./tracecore validatefor not-yet-bundled exporters (datadog, clickhouse) carry the new marker and are skipped with an explicit notice until OCB-skeleton PR-A lands59521f5- drop bare PR-N comment-noise fromvalidator-recipe.sh(gate flagged on first try)55b05fe- mergeorigin/main(PR Bump the go-deps group with 5 updates #164 go-deps group bump: fsnotify v1.10.1, jsonschema v6.0.2, otel/pdata v1.59.0; takes x/time v0.15.0 from main and keeps x/sys v0.45.0 from our security fix)4b44efe- ci: empty commit to re-trigger verify-static (diagnostic; reproduced the failure deterministically)5f15a0b- ci.yml: deepenverify-staticcheckout tofetch-depth: 0so doc-check's diff-scope gates can resolvebase_ref...HEADmerge base. Root cause: shallow CI checkout has no merge base;git diff base_ref...HEADexits 128; pipefail propagates and killsscripts/doc-check.shsilently after the failure-mode label echo. Reproduced locally via shallow clone of/tmp/tracecore-cisim; fixed bygit fetch --unshallow.ee98962- CI cleanup (scope-expansion per user directive): delete 5 dead/redundant workflows totalling 536 LOC.kernelevents-integration.yml(254 LOC) - receiver v0.2.0 deletion target per RFC-0013 §7pyspy-integration.yml(107 LOC) - receiver v0.3.0 deletion targetpython-publish.yml(82 LOC) - PyPI helper v0.3.0 deletion targetci-hardware.yml.staged(66 LOC) - never enabled (.stagedsuffix excludes from GitHub Actions discovery)govulncheck.yml(27 LOC) - duplicatesverify-static'smake govulncheck; security mitigation preserved via the requiredverifyaggregator61aff33-.github/branch-protection.yml: droprequire_linear_history. Setting flipped off in GitHub UI; manifest catches up.mainis already linear via squash-merge; the rule was belt-and-suspenders that blocked squash whenever a feature branch absorbed agit merge origin/mainconflict resolution (twice on this PR).ab648ac-Makefile: drop the@prefix fromdoc-check's 6 sub-scripts so CI logs name the failing script. Previously a sub-script failure showedMakefile:229 Error 1with no script name; tonight's debugging burned 90 minutes pinpointing which of 6 scripts died.7f35b4b- doc-check simplification + actual CI fix: (a) splitmake doc-checkintodoc-check(3 sub-scripts:doc-check.sh,alert-check.sh,chart-appversion-check.sh) +doc-check-release(release-pipeline parity gates:release-doc-parity.sh,gh-attestation-flag-lint.sh). Per-PR drops 2 gates that have 0% catch on doc PRs. (b) Deletescripts/test-release-doc-parity.shand its 7-file fixture tree (230 LOC) - "testing the test" meta-gate. (c) Root cause for the CI failure that triggered this cleanup: line 14 of.github/branch-protection.ymlfrom commit61aff33contained# on PR #166- a bare PR-N reference the comment-noise diff gate correctly rejects. The@-prefix-drop inab648acwould have surfaced this on the next CI run; this commit removes the offending phrase so CI passes.517c9ff- diagnostic:ALERT_CHECK_TRACE=1env-flag-guardedset -xinscripts/alert-check.sh+ workflow toggle. Surfaces the failing command on the next CI run.c616952- velocity package (root cause + 4 long-term wins):scripts/alert-check.sh:97used[ -f X ] && echo $r. When the test was false (3 of 8 RUNBOOKs have no sibling alerts.yaml: kineto, pyspy, nccl_fr), the&&chain returned exit code 1. Under bash 4+ pipefail +set -e, that killed the script silently. macOS bash 3.2 didn't propagate; Ubuntu bash 5 did. Rewrote asif [ -f X ]; then echo $r; fi. Diagnostic trace removed..github/andscripts/in comment-noise diff gate. Both directories carry legitimate forward-looking notes (TODOs, status banners, gate rationale)..githooks/prepare-commit-msgfor auto Signed-off-by. DCO contract preserved by the commit-msg gate; the per-commitgit commit -sceremony goes away.changesfilter job in ci.yml that detects doc-only PRs. Downstreamif:gating in a follow-on commit.Linked issue(s)
No linked issue.
Test plan
make doc-checkclean on every modified filemake ci(full pre-push) passed locally — lint + race tests + 30s fuzz + govulncheck + build + doc-check + release-doc-paritygovulncheck ./...reports zero vulnerabilities after the x/net bumpgh attestation/cosignflag set unchanged (parity gate intact per RFC-0013 §Migration PR-C)scripts/validator-recipe.shruns locally: 2 validated (honeycomb + otel-backend on bundledotlphttpexporter), 2 skipped (datadog + clickhouse, pending PR-A)tested-against: tracecore; the in-tree binary registersotlphttpso those exporters validate normallyKnown follow-ups (not in this PR; tracked for next PRs per RFC-0013 §Migration)
docs/integrations/{datadog,clickhouse-direct}.md: fliptested-against: pending-rfc-0013-pr-aback totracecorethe moment PR-A merges; remove thepending-rfc-0013-pr-acase inscripts/{doc-check,validator-recipe}.shat the same time. This temporary marker is the only deviation from the steady-state validator config.docs/FAILURE-MODES.mdretains references tointernal/pipeline/*test paths that will be deleted with the v0.1.0 PR-F cut.tracecoreai/tracecore-componentsseparate-repo module does not exist yet; created in PR-I per v0.2.0.Release notes
Checklist
validator-recipe.shskip-marker mutation-verified)make checkruns green; pre-push hook passesgit commit -s)STYLE.md- n/a (no new components)