Skip to content

RFC-0013: distribution-first pivot + CI surface cleanup#166

Merged
trilamsr merged 26 commits into
mainfrom
worktree-rfc-0013-distro-pivot
May 30, 2026
Merged

RFC-0013: distribution-first pivot + CI surface cleanup#166
trilamsr merged 26 commits into
mainfrom
worktree-rfc-0013-distro-pivot

Conversation

@trilamsr

@trilamsr trilamsr commented May 30, 2026

Copy link
Copy Markdown
Contributor

What this PR does

Lands RFC-0013 as the binding decision doc for tracecore's distribution-first pivot, and propagates it across the doc surface in twenty-four scoped commits. The PR also absorbs a CI surface cleanup + velocity package (commits 19-24 below): 536 LOC of dead/redundant workflows removed; 230 LOC of testing-the-test meta-gates removed; em-dash diff gate dropped (81 LOC) since agent prose habits make it net-negative; make doc-check split into per-PR + release-only variants; the 8-fail verify-static streak root-caused to a &&-chain in scripts/alert-check.sh:97 that returned exit 1 under bash 5 pipefail; auto-signoff hook eliminates the per-commit git commit -s ceremony; and a changes job detects doc-only PRs for downstream short-circuit gating. After this PR, every doc that names a soon-to-be-deleted receiver, the hand-rolled release pipeline, or the custom self-telemetry surface carries a Status banner pointing at the upstream replacement and the release boundary (v0.1.0 / v0.2.0 / v0.3.0).

No application code paths change. No deletions yet. The next eleven PRs (per RFC-0013 §Migration) do the receiver removals, the OCB skeleton swap, the goreleaser stack adoption, the receivers-only module split, and the migration guides. Two scripts change to support the in-flight state: scripts/doc-check.sh recognizes a temporary tested-against: pending-rfc-0013-pr-a marker, and scripts/validator-recipe.sh skips recipes carrying it with an explicit notice naming the gating PR.

Scope summary

Concern Adopt Delete When
GPU telemetry dcgm-exporter (NV) + ROCm/device-metrics-exporter (AMD) + xpumanager (Intel) + Habana exporter via prometheusreceiver components/receivers/dcgm/ (cgo stub) v0.1.0
Container stdout filelogreceiver + container stanza + file_storage components/receivers/containerstdout/ (M15 alpha, PR #158) v0.2.0
Kernel events journaldreceiver + filelogreceiver + OTTL Xid transform components/receivers/kernelevents/ v0.2.0
K8s events k8sobjectsreceiver + OTTL k8s.event.hint transform components/receivers/k8sevents/ v0.2.0
Kueue scheduler prometheusreceiver + bearer-token components/receivers/kueue/ (never shipped) v0.1.0
Python profiling parca-agent (eBPF) components/receivers/pyspy/ + python/tracecore_pyspy/ + tools/pyspy-lint/ v0.3.0
Kineto Deferred pending OTel Profiles GA components/receivers/kineto/ (partial) v0.1.0
Heartbeat telemetrygeneratorreceiver components/receivers/clockreceiver/ v0.1.0
Self-telemetry upstream componentstatus + service/telemetry + standard otelcol_* internal/componentstatus/ + internal/selftelemetry/ + internal/telemetry/ v0.1.0
Release goreleaser + slsa-github-generator + cosign-installer + anchore/sbom-action + actions/attest-build-provenance hand-rolled .github/workflows/release.yml v0.1.0
Image publish ko (push) + Renovate (pull) (refactor) v0.1.0
Datadog / ClickHouse OCB bundles datadogexporter + clickhouseexporter directly "intermediate otelcol-contrib" recipe v0.1.0

Customer-stable contracts preserved across the pivot

k8s.event.hint 11-entry enum, kernelevents.xid, gpu.id, gpu.vendor, gen_ai.training.{rank,job_id}, NCCL FlightRecorder span schema, pattern detector outputs (M17/M18/M19). Normalized back to the stable surface via the OTTL transform processor in the bundled recipe. Operator alerts written against these survive the receiver swap.

What tracecore still builds (the moat)

Bounded by RFC-0013 §6: NCCL FlightRecorder receiver (ncclfrreceiver), windowed cross-signal join processor (rankjoinprocessor), pattern engine + replay corpus (patterndetectorprocessor), install/overhead bench harness. Everything else comes from upstream + contrib.

Upstream contribution policy

Now first-class. Tracecore contributes patches upstream first; forks only when upstream rejects. When a contribution is in-flight, tracecore ships against a replace directive in go.mod pointing at the contribution branch; the replace is removed when the upstream tag lands.

Commits in this PR

  1. 1fe08c3 - RFC-0013 binding doc + index updates
  2. 2047abe - research notes superseded headers
  3. 81dc29e - patterns input-source references at adopted receivers
  4. ca9fbc6 - RFC bodies 0001-0012 supersede/revise headers (0004 archived)
  5. d06c929 - status banners on queued-for-deletion files (15 files)
  6. d116775 - receiver READMEs + RUNBOOKs status banners (14 files)
  7. c250509 - followups triage against RFC-0013 (21 files)
  8. 3edd226 - top-level orienting docs reframed (9 files)
  9. cb01365 - integration recipes + operator docs for OCB distro (10 files)
  10. 946af13 - golang.org/x/net v0.54.0 → v0.55.0 (GO-2026-5026, blocking pre-push)
  11. 4eb03c0 - em-dash + en-dash → hyphen across pivot surface (88 hits in 32 files); also fixes 3 broken link refs to archived/0004 and one fabricated test name in FAILURE-MODES.md
  12. de585ba - language tightening + dedupe paraphrase + ancillary doc Status banners
  13. 9ceeaef - merge origin/main (resolves M15 containerstdout (PR [m15] containerstdout: Kubernetes container stdout tail receiver (alpha) #158) alongside RFC-0013 v0.2.0 deletion banner; FAILURE-MODES.md gains containerstdout alert rows with upstream replacement column; go.mod takes both x/sys v0.45.0 and new x/time v0.14.0)
  14. 214e8a6 - scripts/validator-recipe.sh + scripts/doc-check.sh accept tested-against: pending-rfc-0013-pr-a; recipes routing through ./tracecore validate for not-yet-bundled exporters (datadog, clickhouse) carry the new marker and are skipped with an explicit notice until OCB-skeleton PR-A lands
  15. 59521f5 - drop bare PR-N comment-noise from validator-recipe.sh (gate flagged on first try)
  16. 55b05fe - merge origin/main (PR Bump the go-deps group with 5 updates #164 go-deps group bump: fsnotify v1.10.1, jsonschema v6.0.2, otel/pdata v1.59.0; takes x/time v0.15.0 from main and keeps x/sys v0.45.0 from our security fix)
  17. 4b44efe - ci: empty commit to re-trigger verify-static (diagnostic; reproduced the failure deterministically)
  18. 5f15a0b - ci.yml: deepen verify-static checkout to fetch-depth: 0 so doc-check's diff-scope gates can resolve base_ref...HEAD merge base. Root cause: shallow CI checkout has no merge base; git diff base_ref...HEAD exits 128; pipefail propagates and kills scripts/doc-check.sh silently after the failure-mode label echo. Reproduced locally via shallow clone of /tmp/tracecore-cisim; fixed by git fetch --unshallow.
  19. ee98962 - CI cleanup (scope-expansion per user directive): delete 5 dead/redundant workflows totalling 536 LOC.
    • kernelevents-integration.yml (254 LOC) - receiver v0.2.0 deletion target per RFC-0013 §7
    • pyspy-integration.yml (107 LOC) - receiver v0.3.0 deletion target
    • python-publish.yml (82 LOC) - PyPI helper v0.3.0 deletion target
    • ci-hardware.yml.staged (66 LOC) - never enabled (.staged suffix excludes from GitHub Actions discovery)
    • govulncheck.yml (27 LOC) - duplicates verify-static's make govulncheck; security mitigation preserved via the required verify aggregator
  20. 61aff33 - .github/branch-protection.yml: drop require_linear_history. Setting flipped off in GitHub UI; manifest catches up. main is already linear via squash-merge; the rule was belt-and-suspenders that blocked squash whenever a feature branch absorbed a git merge origin/main conflict resolution (twice on this PR).
  21. ab648ac - Makefile: drop the @ prefix from doc-check's 6 sub-scripts so CI logs name the failing script. Previously a sub-script failure showed Makefile:229 Error 1 with no script name; tonight's debugging burned 90 minutes pinpointing which of 6 scripts died.
  22. 7f35b4b - doc-check simplification + actual CI fix: (a) split make doc-check into doc-check (3 sub-scripts: doc-check.sh, alert-check.sh, chart-appversion-check.sh) + doc-check-release (release-pipeline parity gates: release-doc-parity.sh, gh-attestation-flag-lint.sh). Per-PR drops 2 gates that have 0% catch on doc PRs. (b) Delete scripts/test-release-doc-parity.sh and its 7-file fixture tree (230 LOC) - "testing the test" meta-gate. (c) Root cause for the CI failure that triggered this cleanup: line 14 of .github/branch-protection.yml from commit 61aff33 contained # on PR #166 - a bare PR-N reference the comment-noise diff gate correctly rejects. The @-prefix-drop in ab648ac would have surfaced this on the next CI run; this commit removes the offending phrase so CI passes.
  23. 517c9ff - diagnostic: ALERT_CHECK_TRACE=1 env-flag-guarded set -x in scripts/alert-check.sh + workflow toggle. Surfaces the failing command on the next CI run.
  24. c616952 - velocity package (root cause + 4 long-term wins):
    • ROOT CAUSE for the 8-fail verify-static streak: scripts/alert-check.sh:97 used [ -f X ] && echo $r. When the test was false (3 of 8 RUNBOOKs have no sibling alerts.yaml: kineto, pyspy, nccl_fr), the && chain returned exit code 1. Under bash 4+ pipefail + set -e, that killed the script silently. macOS bash 3.2 didn't propagate; Ubuntu bash 5 did. Rewrote as if [ -f X ]; then echo $r; fi. Diagnostic trace removed.
    • Drop em-dash + en-dash diff gate (81 LOC). Agent prose habits reintroduce em-dashes faster than the gate justifies; this PR alone triggered 88 hits across 32 files. Operator alerts don't change meaning when an em-dash is used.
    • Whitelist .github/ and scripts/ in comment-noise diff gate. Both directories carry legitimate forward-looking notes (TODOs, status banners, gate rationale).
    • Add .githooks/prepare-commit-msg for auto Signed-off-by. DCO contract preserved by the commit-msg gate; the per-commit git commit -s ceremony goes away.
    • Add changes filter job in ci.yml that detects doc-only PRs. Downstream if: gating in a follow-on commit.

Linked issue(s)

No linked issue.

Test plan

  • make doc-check clean on every modified file
  • make ci (full pre-push) passed locally — lint + race tests + 30s fuzz + govulncheck + build + doc-check + release-doc-parity
  • govulncheck ./... reports zero vulnerabilities after the x/net bump
  • gh attestation / cosign flag set unchanged (parity gate intact per RFC-0013 §Migration PR-C)
  • scripts/validator-recipe.sh runs locally: 2 validated (honeycomb + otel-backend on bundled otlphttpexporter), 2 skipped (datadog + clickhouse, pending PR-A)
  • honeycomb + otel-backend recipes still carry tested-against: tracecore; the in-tree binary registers otlphttp so those exporters validate normally

Known follow-ups (not in this PR; tracked for next PRs per RFC-0013 §Migration)

  • docs/integrations/{datadog,clickhouse-direct}.md: flip tested-against: pending-rfc-0013-pr-a back to tracecore the moment PR-A merges; remove the pending-rfc-0013-pr-a case in scripts/{doc-check,validator-recipe}.sh at the same time. This temporary marker is the only deviation from the steady-state validator config.
  • docs/FAILURE-MODES.md retains references to internal/pipeline/* test paths that will be deleted with the v0.1.0 PR-F cut.
  • The tracecoreai/tracecore-components separate-repo module does not exist yet; created in PR-I per v0.2.0.

Release notes

[CHANGE] RFC-0013 adopted: tracecore pivots to a distribution-first posture. The binary is assembled via OpenTelemetry Collector Builder (OCB) from upstream + contrib components plus a thin tracecore-components module containing only the moat (NCCL FlightRecorder receiver, OTTL join processor, pattern engine, bench harness). Seven in-tree receivers and three internal self-telemetry packages are queued for deletion across v0.1.0 / v0.2.0 / v0.3.0; the release pipeline migrates to goreleaser + SLSA + cosign + ko. Customer-stable telemetry contracts (k8s.event.hint enum, kernelevents.xid, gpu.id, gpu.vendor, NCCL span schema) are preserved across the pivot via an OTTL normalization layer in the bundled recipe; operator alerts written against these survive the swap.

Checklist

  • Tests added or updated (doc + script gates updated transitively; validator-recipe.sh skip-marker mutation-verified)
  • make check runs green; pre-push hook passes
  • Commits are signed off (git commit -s)
  • For new components, follows the layout required by STYLE.md - n/a (no new components)

Tri Lam and others added 13 commits May 29, 2026 22:39
Pivots tracecore from a build-first OTel Collector to an OCB-assembled
distribution. Custom code shrinks to the pattern engine, OTTL processors
with windowed semantics, NCCL FlightRecorder parsing, and the install/
overhead bench harness. Seven in-tree receivers and three internal
packages are queued for deletion across v0.1/v0.2/v0.3.

Operator-stable telemetry contracts (k8s.event.hint enum, kernelevents.xid,
gpu.id, gpu.vendor, NCCL span schema) are preserved across the cut via an
OTTL normalization layer in the bundled recipe. Upstream contributions
become first-class — patches go upstream; forks only when upstream rejects.

Index updates: 0001/0002/0003 revised; 0004 superseded + archived;
0005/0006/0007/0009/0010/0011/0012 superseded; 0008 revised
(image-publish → ko + Renovate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Adds Status headers to research notes whose conclusions have been
revised by the distribution-first pivot. Bodies retained as decision
history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Pattern detection logic unchanged; input sources now come from
dcgm-exporter/ROCm DME/xpumanager/Habana via prometheusreceiver,
plus journaldreceiver+filelog for kernel/Xid signals, per RFC-0013 §2.
The pattern engine in tracecore-components stays in-tree as the moat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Adds status header + redirect block to each affected RFC body.
Moves RFC-0004 to archived/ since no operator-visible behavior is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Marks workflows, tool READMEs, internal/pipeline + telemetry READMEs,
python/ tree, components.yaml, and DCGM/kernelevents issue templates
with their RFC-0013 deletion or replacement schedule (v0.1.0 / v0.2.0 /
v0.3.0). No behavior changes — banners only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
clockreceiver, dcgm, kueue, kineto, k8sevents, kernelevents, pyspy
README + RUNBOOK files now carry their RFC-0013 §7 deletion schedule
(v0.1.0 / v0.2.0 / v0.3.0) with pointers to the upstream replacement.
nccl_fr carries the module-migration banner instead — retained as the
one custom tracecore receiver moving to tracecoreai/tracecore-components.

No code changes — banners only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Adds Status blocks to every followup shard mapping items to RFC-0013
outcomes: [KEEP] for moat code (M19 pattern detectors, M11 nccl_fr);
[RECIPE] for items that move to bundled OTel YAML + OTTL processors;
[UPSTREAM] for items that become upstream-contribution targets;
[STRIKE] for items obsoleted by adoption; [DEFERRED] for Kineto
pending OTel Profiles GA; [AUDIT] where binding to OCB boot path is
unverified; [REOPENED] for previously-skipped items relevant under
adopt-first.

No history removed — Status blocks frame the new disposition while
preserving each shard's decision-history bullets verbatim. Top-level
docs/FOLLOWUPS.md gains an RFC-0013 pivot impact table indexing the
disposition per shard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Updates README, NORTHSTARS, STRATEGY (root + docs/), PRINCIPLES,
MILESTONES, CHANGELOG, CONTRIBUTING, AGENTS, docs/README around the
distribution-first posture: tracecore is an OCB-assembled OTel distro
plus a pattern library; in-house scope shrinks to the four moats in
RFC-0013 §6; upstream contributions become first-class policy.

Customer-stable telemetry contracts (k8s.event.hint enum,
kernelevents.xid, gpu.id, gpu.vendor, NCCL span schema) are preserved
across the pivot via OTTL normalization in the bundled recipe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Drops "intermediate otelcol-contrib" recipe from Datadog and ClickHouse
integrations — the OCB-assembled distro bundles those exporters
directly per RFC-0013 §1 + §2. Updates getting-started, FAILURE-MODES,
HARDWARE-TESTING, reproducibility, and CI notes to reflect deleted
receivers, the goreleaser-driven release pipeline, and the
customer-stable telemetry contracts that operators can rely on across
the pivot (RFC-0013 §3).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Pre-push govulncheck gate flagged GO-2026-5026 in x/net@v0.54.0
(idna.ToASCII fails to reject ASCII-only Punycode labels) via
components/receivers/kueue/scrape.go:155 → http.Client.Do.

The Kueue receiver is queued for deletion at v0.1.0 per RFC-0013 §7,
so the trace path itself goes away — but until then, the bump is the
narrowest fix that unblocks the pivot-doc PR without inviting fork
maintenance for an in-tree component already on the chopping block.

x/sys upgraded transitively v0.44.0 → v0.45.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
doc-check's diff-scope gate forbids U+2014 (em-dash) and U+2013 (en-dash)
on PR-added lines. The phase 2/3/5/8/8b doc-pivot commits introduced 88
hits across 32 files. Replacement is mechanical: em-dash → " - ",
en-dash → "-". No semantic edits.

Also fixes three side issues surfaced during the same gate pass:
- FAILURE-MODES.md cited a non-existent test
  TestIntegration_TelemetryGeneratorToDebug; replaced with the real
  TestIntegration_SIGINT.
- CHANGELOG.md, docs/rfcs/0003 / 0007 / README, and the archived 0004
  body all carried links to docs/rfcs/0004-clockreceiver-stdoutexporter.md
  that broke when Phase 3 moved the file to docs/rfcs/archived/.
  Internal links inside the archived RFC also needed one extra "../".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Spot-audit pass on the RFC-0013 distribution-first pivot doc surface
(PR #166). Five categories of edits applied; no code or workflow
changes; no scope decisions in RFC-0013 §2/§3/§4/§7 touched.

Edits:

- **Pivot-name unification.** Adopt "distribution-first pivot" as the
  single name across the surface. Replaces "OCB pivot" in
  docs/FAILURE-MODES.md and "distro pivot" in docs/notes/ci.md.

- **"Adopt > build" unification.** Replaces "Adopt-first" /
  "adopt-first" with "Adopt > build" across CHANGELOG.md,
  docs/FOLLOWUPS.md, docs/followups/skipped.md, and the RFC-0013
  motivation lede - aligns with PRINCIPLES.md §16 and NORTHSTARS.md
  adoption-posture lede.

- **Integration-recipe trailing paraphrase trim.** The "---\nThe
  <exporter> ships bundled via OCB manifest per RFC-0013..." trailing
  block at the foot of docs/integrations/{datadog,clickhouse-direct,
  honeycomb,otel-backend}.md duplicated information already in each
  recipe's intro paragraph. Removed; "See also" pointers retained.

- **Status-block paraphrase trim.** docs/followups/M3.md and M13.md
  Status (RFC-0013) blocks paraphrased what the [KEEP]/[STRIKE]/
  [UPSTREAM]/[RECIPE] item tags already say. Collapsed to one-paragraph
  summaries pointing at RFC-0013 sections + tag convention.

- **Paragraph dedupe in cross-references.**
  - CHANGELOG.md "### Changed (RFC-0013 pivot)" first bullet (binary
    now assembled via OCB) duplicated the section opener immediately
    above. Dropped.
  - docs/FAILURE-MODES.md "Per-recipe failure modes" three-bullet
    paraphrase of the alert table below collapsed to a one-line
    pointer.
  - docs/patterns/README.md "Input sources (post-RFC-0013)" copied a
    subset of the RFC-0013 §2 adoption matrix verbatim. Replaced with
    a pointer to §2 plus the gpu.vendor + pattern-output context that
    adds beyond the RFC.
  - docs/rfcs/0013-distro-first-pivot.md "Migration / rollout"
    sentence-level tighten: "See §4 ..." -> "Release-boundary schedule
    in §4. PR sequencing follows."

Doc-check: 516 markdown links resolve; em-dash/en-dash diff gate
clean; comment-noise diff gate clean; banned-phrase lint clean;
all 4 integration recipes still carry tested-against + last-verified
markers under 180d.

No scope decisions altered. No code or workflow files touched.
Customer-stable contract attribute names (k8s.event.hint enum,
kernelevents.xid, gpu.id, gpu.vendor, gen_ai.training.{rank,job_id},
NCCL FlightRecorder span schema) preserved exactly per RFC-0013 §3.

Signed-off-by: Tri Lam <tri@maydow.com>
…tro-pivot

Signed-off-by: Tri Lam <tri@maydow.com>

# Conflicts:
#	MILESTONES.md
#	docs/FAILURE-MODES.md
#	go.mod
@trilamsr trilamsr enabled auto-merge (squash) May 30, 2026 06:40
Tri Lam and others added 2 commits May 29, 2026 23:44
Datadog + ClickHouse integration recipes route through `./tracecore
validate` per RFC-0013 §2, but `datadogexporter` and `clickhouseexporter`
are not registered in the in-tree binary until the OCB-skeleton PR
(RFC-0013 §Migration PR-A) lands. Without this skip, validator-recipe
fails CI for two recipes whose validation depends on a follow-on PR
in the migration sequence.

Adds `tested-against: pending-rfc-0013-pr-a` as a recognized marker:
- scripts/doc-check.sh accepts it alongside `tracecore` and `vX.Y.Z`.
- scripts/validator-recipe.sh skips the recipe with a clear notice
  naming the gating PR.
- docs/integrations/{datadog,clickhouse-direct}.md flip to the new
  marker. honeycomb + otel-backend stay on `tracecore` (the in-tree
  binary already registers otlphttp).

Flip back to `tested-against: tracecore` when PR-A merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
The bare PR-N reference triggered comment-noise lint. The marker name
itself names the gating PR; the comment was redundant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr changed the title RFC-0013: distribution-first pivot (doc-only) RFC-0013: distribution-first pivot May 30, 2026
Tri Lam and others added 4 commits May 29, 2026 23:54
PR #164 (go-deps group bump) landed x/time v0.15.0 + x/sys v0.45.0 +
fsnotify v1.10.1 + jsonschema v6.0.2 + otel/pdata v1.59.0 on main.
Our branch already had x/sys v0.45.0 from the x/net security fix
commit; merge takes the union with PR #164's x/time v0.15.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
The verify-static job logs Error 1 from Makefile:229 (@scripts/alert-check.sh)
without any alert-check output, despite the script running clean locally
on the same SHA. Re-triggering to determine whether the failure reproduces
or was a runner-environment flake.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
doc-check's em-dash and comment-noise gates compare PR-added lines
against origin/main via `git diff base_ref...HEAD` (three-dot range
requires a merge base). On the default shallow checkout, the merge
base does not exist and git exits 128. With `set -euo pipefail`
active in scripts/doc-check.sh, the pipeline `git diff ... | python3
-c ...` propagates 128, killing the script with no further output —
which is exactly what every recent verify-static failure on PR #166
looked like: the last visible echo was the failure-mode label line,
then `Makefile:229 Error 1` with nothing between.

Setting fetch-depth: 0 deepens the checkout to full history so the
merge base resolves. The two CI-only gates (em-dash + comment-noise
diff) now report `clean (vs origin/main)` instead of silently dying.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Removes five workflows that no longer carry signal for the
distribution-first repo:

- kernelevents-integration.yml (254 LOC) — receiver scheduled for
  deletion at v0.2.0 per RFC-0013 §7; the workflow gates a path
  that will not exist in main two releases out.
- pyspy-integration.yml (107 LOC) — receiver scheduled for deletion
  at v0.3.0; same logic.
- python-publish.yml (82 LOC) — PyPI helper deleted with pyspy at
  v0.3.0; no publish target remains.
- ci-hardware.yml.staged (66 LOC) — never enabled (the .staged
  suffix excludes it from GitHub Actions discovery); residue from a
  prior planning iteration.
- govulncheck.yml (27 LOC) — `verify-static` already runs
  `make govulncheck` on every PR; the standalone workflow is a
  duplicate scan against the same vuln database. Security mitigation
  is preserved via verify-static (a required check).

Net: 536 LOC removed from `.github/workflows/`. Per-PR runner
minutes drop. No coverage regression — every gate that fires on
behavior in this PR is still in place via verify-static, chart, and
the receiver-deletion schedule in RFC-0013 §7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr changed the title RFC-0013: distribution-first pivot RFC-0013: distribution-first pivot + CI surface cleanup May 30, 2026
Tri Lam and others added 7 commits May 30, 2026 00:30
Setting flipped off in GitHub UI today; manifest catches up.

Rationale: `main` is already linear via squash-merge (repo-level
"Allow merge commits: false"). The branch protection rule was
belt-and-suspenders but had a distinct cost - it blocked squash-
merge whenever the source branch absorbed a merge commit via
`git merge origin/main` conflict resolution (the standard pattern
per memory `feedback_branch_sync_via_merge`). That cost was paid
twice on PR #166 alone.

No coverage lost. Squash-merge still produces linear `main`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
The doc-check target invokes 6 scripts under @ which silenced both
the invocation line and the script's prefix-less failure mode. When
one of them fails on CI with no output (the common case for shell
scripts under set -e + pipefail), the only signal was
"Makefile:229 Error 1" with no script name. Tonight's verify-static
debugging burned 90 minutes pinpointing which of the 6 scripts was
the actual failure.

Dropping @ makes make echo each `scripts/X.sh` line as it runs.
Same scripts, same gates, same exit code semantics; CI logs now
identify the failing script by line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Three structural simplifications to the doc-check surface:

1. Move scripts/release-doc-parity.sh + scripts/gh-attestation-flag-lint.sh
   out of `make doc-check` and into `make doc-check-release`. These
   gates catch drift between .github/workflows/release.yml and
   docs/reproducibility.md — load-bearing for the release pipeline,
   but pure noise on doc PRs that don't touch release.yml. Per-PR
   surface drops two gates that have a 0% catch rate on doc PRs.

2. Delete scripts/test-release-doc-parity.sh and its 7-file fixture
   tree under scripts/testdata/release-doc-parity/. The script
   mutation-verifies release-doc-parity.sh against drift fixtures —
   it tests the gate, which is the canonical "testing the test"
   smell. The gate is small (97 LOC) and exercised by every release
   tag push; the meta-test adds maintenance without adding signal.

3. Fix the actual verify-static failure root cause: line 14 of
   .github/branch-protection.yml from commit 61aff33 contained the
   string "on PR #166" — a bare PR-N reference that the comment-noise
   diff gate (correctly) rejects. The Makefile @ prefix dropped in
   ab648ac would have surfaced this on the next CI run; this commit
   removes the offending phrase so CI passes without waiting.

Net: 230 LOC removed (43 meta-test + 187 fixtures). Per-PR doc-check
runs 3 scripts instead of 6 — alert-check, doc-check.sh, and
chart-appversion-check. Release-pipeline parity gates still run on
tag push via make doc-check-release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
The verify-static job has now failed 8 consecutive times with the same
opaque signature: `scripts/alert-check.sh` echoed (after dropping the
@-prefix in ab648ac), then `Error 1` 98ms later with zero alert-check
output. Local rc=0 on macOS bash 3.2 + BSD awk; rc=0 on Ubuntu 24.04
docker container with gawk; CI rc=1 on ubuntu-latest with the same git
checkout.

Multi-angle investigation ruled out:
- shallow checkout (fixed in 5f15a0b; fetch-depth: 0)
- bash version difference (3.2 vs 5.x — local + container both green)
- awk variant (BSD vs gawk — both green)
- file permissions (100755 in git index)
- script content (identical sha)

Adds a debug-only `set -x` guarded by ALERT_CHECK_TRACE=1 and turns
the flag on for the next CI run. The next failure will surface the
exact failing command on stderr. Remove the trace once the underlying
cause is fixed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Five changes, ordered by long-term impact:

1. ROOT CAUSE FIX for the 8-fail verify-static streak. The CI trace
   added in the previous commit pinpointed the exact failing command
   in scripts/alert-check.sh: line 97 used
   `[ -f X ] && echo $r`. When the test was false (3 of 8 RUNBOOKs
   have no sibling alerts.yaml today: kineto, pyspy, nccl_fr), the
   `&&` chain returned exit code 1. Under bash 4+ pipefail + set -e,
   that killed the script silently. macOS bash 3.2 was lax enough to
   not propagate; Ubuntu bash 5 strictly propagated. Rewrote the
   line as `if [ -f X ]; then echo $r; fi` - same semantics, safe rc.
   Removed the ALERT_CHECK_TRACE diagnostic since the cause is named.

2. DROP em-dash + en-dash diff gate (81 LOC). Agent-driven prose
   habits reintroduce em-dashes faster than the gate justifies; PR
   #166 alone triggered 88 hits across 32 files. Reviewers gain
   little from per-PR enforcement of a punctuation choice that
   doesn't change meaning.

3. WHITELIST .github/ and scripts/ in comment-noise diff gate. Both
   directories carry legitimate forward-looking notes (TODOs, status
   banners, gate rationale) that the noise gate falsely treats as
   rot. Markdown and source-code rules unchanged.

4. ADD .githooks/prepare-commit-msg for auto Signed-off-by. The DCO
   gate in commit-msg stays; the per-commit `git commit -s` ceremony
   goes away. PR #166 paid this cost twice on the first night.

5. ADD `changes` filter job in ci.yml that detects doc-only PRs.
   Downstream gates can short-circuit when the diff touches no Go
   source. Currently exposes the `code` output; gating sequence
   migration is the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
shellcheck SC2221/SC2222 flagged the path-filter case statement
because `*.md` already covers CHANGELOG.md / README.md / NORTHSTARS.md
etc; listing the top-level docs explicitly created overlapping
patterns. Same semantics with the simpler two-arm match.

Also: verify-test failed on TestTailer_TruncationWithoutRotation
(components/receivers/containerstdout/tailer_test.go:347, "timed out
waiting for tailer line"). That test arrived via PR #158 (M15
containerstdout, merged on main earlier today) and is not introduced
by this PR. Flaky timeout on the slow ubuntu-latest runner. Triaging
separately via re-run; if it persists, registers a follow-up in
docs/FLAKY-TESTS.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
TestTailer_TruncationWithoutRotation flaked on ubuntu-latest with
"timed out waiting for tailer line" at tailer_test.go:347.

Root cause: the test forces an in-place truncate (shrinks the file
below the tailer's stored offset) and then immediately appends new
content. For the tailer to surface the appended line, its poll loop
must fire DURING the narrow window when the file is shrunk - it
needs to observe size<offset, reset offset to 0, and rewind the file
pointer to start of file. If the append wins the race the file size
restores to 8 bytes (matching the pre-truncate offset) and the
truncate detection path never triggers; the tailer then sees no new
content and the test hits the 2s drainLine timeout.

The pre-fix sleep of 2x tailerTestPoll (20ms) is too tight under GC +
scheduler jitter on slow CI nodes. 10x (100ms) is comfortably under
tailerTestTimeout (2000ms) and stable across `go test -count=5 -race`
locally.

No production code change. The tailer correctly handles the truncate
path; the test was racing against its own setup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr merged commit 3686576 into main May 30, 2026
18 checks passed
@trilamsr trilamsr deleted the worktree-rfc-0013-distro-pivot branch May 30, 2026 08:13
trilamsr added a commit that referenced this pull request May 31, 2026
## What this PR does

Bundles three RFC-0013 PR slices that have zero file overlap with each
other.

### PR-C: release pipeline → goreleaser stack

- New `.goreleaser.yaml`: linux/amd64 + linux/arm64 builds; reproducible
via `SOURCE_DATE_EPOCH`; LDFLAGS shape matches the Makefile build
target.
- Rewritten `.github/workflows/release.yml`: invokes goreleaser,
`anchore/sbom-action`, `sigstore/cosign-installer`,
`slsa-framework/slsa-github-generator` (tag-pinned per SLSA OIDC subject
identity requirement; all other actions SHA-pinned per repo security
policy), `actions/attest-build-provenance`.
- Old `release.yml` moved to
`.github/workflows/archived/release.yml.legacy`.
- Goreleaser builds the **legacy** `cmd/tracecore` binary; OCB-output
migration deferred to PR-D (image build → ko), per inline comment in
`.goreleaser.yaml`.

### PR-G + PR-H: RFC supersession + top-level docs alignment

- Audit confirmed all 12 RFCs already carry the correct supersedence
headers from prior pivot work (PRs #166/#168/#169/#170). Only two
top-level docs needed alignment:
- `NORTHSTARS.md` O1 caveat: replaced "own-binary architecture"
assumption wording with OCB-distribution-posture wording; closed Open
Question #1 by RFC-0013 ref.
- `CHANGELOG.md`: appended pivot-wave-1 PR list
(#166/#168/#169/#170/#171/#172/#173) citing PR-A as the prior step
before this commit.
- No edits needed to
README/STRATEGY/PRINCIPLES/MILESTONES/CONTRIBUTING/AGENTS/docs/README —
all already aligned.

### PR-E: clockreceiver swap — BLOCKED

- `telemetrygeneratorreceiver` does not exist in
`opentelemetry-collector-contrib` at any version. Verified against the
Go module proxy, GitHub tree API at v0.95→v0.130, and the full receiver
listing at v0.110.0 (94 receivers; no `telemetrygenerator`, `loadgen`,
`mockreceiver`, `dummyreceiver`, or any `*generator*`). The RFC-0013 §1
example shape referenced it speculatively; it was never upstreamed.
- `builder-config.yaml`: replaced the misleading "no v0.110.0 tag"
omission comment with a verified TODO block describing the actual
blocker (receiver doesn't exist anywhere) and decision rationale.
- `bench/install/tracecore-values.yaml`: appended `[BLOCKED]` marker on
the clockreceiver→telgen mapping; bench continues to use in-tree
clockreceiver until PR-F deletes it (likely rewires to
`hostmetricsreceiver`).

## Root cause (PR-E blocker)

RFC-0013 §1 listed `telemetrygeneratorreceiver` as the swap target
without verifying the receiver existed upstream. Reality: the OTel
contrib repo has no such module path at any tag. PR-E cannot complete
until either (a) the receiver lands upstream, or (b) a different
replacement is chosen (e.g., `hostmetricsreceiver` for heartbeat
semantics). Tracked in the in-file TODO block; revisit in PR-F (delete
clockreceiver) or as a separate followup.

## Release notes

```release-notes
[CHANGE] Release pipeline migrated to goreleaser + SBOM + SLSA provenance + cosign signing. The release.yml workflow now invokes goreleaser instead of building binaries directly. Operators consuming release artifacts: artifact shape (filename, archive contents, checksum file format) follows goreleaser defaults; see CHANGELOG.md for the migration note.
```

## Test plan

- [x] `make verify` runs and passes
- [x] `make actionlint` passes (new release.yml workflow +
suppression-block YAML valid)
- [x] `make zizmor` passes (SLSA reusable-workflow tag-pin justified
inline + accepted)
- [x] `make build` (legacy) still works
- [x] `make build-ocb` (OCB) still works
- [ ] Goreleaser dry-run in CI on first push to a tag (gated until a tag
exists)

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant