Skip to content

[ci+docs] integration paths: substrate + MILESTONES rubric backfill#149

Merged
trilamsr merged 2 commits into
mainfrom
integration-paths-and-rubric-backfill
May 21, 2026
Merged

[ci+docs] integration paths: substrate + MILESTONES rubric backfill#149
trilamsr merged 2 commits into
mainfrom
integration-paths-and-rubric-backfill

Conversation

@trilamsr

Copy link
Copy Markdown
Contributor

Summary

Two commits, both follow-on from PR #147:

  1. [ci] integration paths: add substrate to kernelevents + pyspy — closes the workflow-paths audit gap surfaced in PR [docs] followups bundle: 10 easy items (3 strikes, 1 test add, 6 docs) #147's docs/followups/otlphttp.md "Workflow paths trigger" row.
  2. [docs] MILESTONES: backfill rubric blocks for M1, M2, M4, M9 — closes docs/followups/M3.md "Backfill Foundation milestone rubrics" row.

Both follow-up rows are marked [x] in their respective shards with strike-through and ship-evidence.

Commit 1: integration workflow paths

PR #147's audit found that .github/workflows/kernelevents-integration.yml and pyspy-integration.yml paths: filters cover only:

  • components/receivers/<name>/**
  • internal/runtime/lifecycle/**

So a change to cmd/tracecore factory wiring, internal/pipeline contract, or internal/selftelemetry surface could land without re-running either integration suite — even though receiver behavior depends on all three substrates.

This commit adds the three substrate path patterns to both push: and pull_request: filters on both workflows. Symmetric with install-bench.yml (P3-Rev1 #10 fix) and chart.yml.

chaos.yml was audited and intentionally not changed — its substrate coupling runs via tools/failure-inject/** + internal/synthesis/** only.

actionlint clean on both workflows.

Commit 2: MILESTONES rubric backfill

M1, M2, M4, M9 predate the per-rubric convention adopted in PR #53 and shipped as prose-only delivery summaries. This commit reformats each section to match M3 / M5b / M10+ shape:

  • Functional rubrics: block with bullets citing RFC sections or shipped file paths.
  • Non-functional rubrics: block for budget / policy / overhead guarantees.

Every claim was extracted from the existing prose summary; no new guarantees added. Source citations:

  • M1: RFC-0003 (Component / Host / Factory contracts, two-phase shutdown, push-based consumers, factory map, safe.Call, operator UX).
  • M2: RFC-0006 (/metrics + /healthz + /readyz, selftelemetry.Receiver, O2 SLO gauges, three OTel divergences closed).
  • M4: .golangci.yml + Makefile + scripts/ (no RFC; convention is the tooling files themselves).
  • M9: RFC-0007 (/dev/kmsg + journald via one source interface, NVRM Xid extraction, RE2 filters compile at Start, trace context propagation, non-Linux stubs, overhead budget).

Also fixed three stale docs/FOLLOWUPS.md references that survived the shard split (PR #132):

  • MILESTONES.md L210 M21 carry-forward → docs/followups/M3.md
  • MILESTONES.md L269 benchstat → docs/followups/opportunistic.md
  • MILESTONES.md L552 M8 carry-forward → docs/followups/M8.md

Files changed

File Commit LOC
.github/workflows/kernelevents-integration.yml 1 +6
.github/workflows/pyspy-integration.yml 1 +6
docs/followups/otlphttp.md 1 +5/-5 (strike + audit closure)
MILESTONES.md 2 +48/-12 (4 rubric blocks + 3 link fixes)
docs/followups/M3.md 2 +5/-4 (strike)

Release notes

NONE

Test plan

  • make ci green (verify, verify-lint, verify-static, verify-test, build, vet, golangci-lint, zizmor, actionlint, govulncheck, fuzz 30s).
  • bash scripts/doc-check.sh green (em-dash + en-dash diff gate clean, comment-noise diff gate clean, 452 markdown links resolve).
  • actionlint clean on both modified workflows.
  • bash scripts/no-autoupdate-check_test.sh 10/10 assertions pass.
  • release-doc-parity clean, chart-appversion-check clean, alert-check clean.
  • CI green on this branch.

Sequencing

Builds on PR #147 (merged) which surfaced both follow-up items. Independent of any currently-open PRs.

🤖 Generated with Claude Code

Tri Lam and others added 2 commits May 20, 2026 20:59
PR #147 audit found that kernelevents-integration.yml and
pyspy-integration.yml `paths:` filters cover only
`components/receivers/<name>/**` + `internal/runtime/lifecycle/**`.
A change to `cmd/tracecore` factory wiring, `internal/pipeline`
contract, or `internal/selftelemetry` surface could land without
re-running either integration suite, even though the receiver's
behavior depends on all three substrates.

Add the three substrate path patterns to both push and pull_request
filters on both workflows. Symmetric with install-bench.yml
(P3-Rev1 #10 fix) and chart.yml.

`chaos.yml` audited and intentionally not changed — its
substrate-coupling is via `tools/failure-inject/**` +
`internal/synthesis/**` only.

Mark `docs/followups/otlphttp.md` "Workflow paths trigger" row
shipped with the explicit list of paths added.

`actionlint` clean on both workflows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
M1, M2, M4, M9 predate the per-rubric `☑` convention adopted in
PR #53 and shipped as prose-only delivery summaries. Reformat each
section to match M3 / M5b / M10+ shape:

- **Functional rubrics:** block with `☑` bullets citing RFC sections
  or shipped file paths.
- **Non-functional rubrics:** block for budget / policy / overhead
  guarantees.

Every claim was extracted from the existing prose summary; no new
guarantees added. Source citations point at RFC-0003 (M1),
RFC-0006 (M2), `.golangci.yml` + `Makefile` + `scripts/` (M4 has
no RFC; convention is the tooling files themselves), RFC-0007 (M9).

Strike `docs/followups/M3.md` "Backfill Foundation milestone
rubrics (M1, M2, M4, M9)" row — landed.

Also fix three stale `docs/FOLLOWUPS.md` references that survived
the shard split (PR #132):
- L210 M21 carry-forward → `docs/followups/M3.md`
- L269 benchstat → `docs/followups/opportunistic.md`
- L552 M8 carry-forward → `docs/followups/M8.md`

`make doc-check` green: em-dash + en-dash diff gate clean,
comment-noise diff gate clean, 437+ markdown links resolve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr merged commit 00caaad into main May 21, 2026
17 checks passed
@trilamsr trilamsr deleted the integration-paths-and-rubric-backfill branch May 21, 2026 04:18
trilamsr added a commit that referenced this pull request May 21, 2026
## Summary

Closes the long-standing chart-default-image gap. The chart's
`install/kubernetes/tracecore/values.yaml` has shipped with
`image.repository: ghcr.io/tracecoreai/tracecore` as the default since
M5b, but `release.yml` only ever published the binary + SBOM +
cosign-bundle + provenance as GitHub Release artifacts. Operators
following the chart's defaults could not `helm install`. RFC-0008 names
this path as the target operator-pull surface.

### Root cause

The chart's default `image.repository` and `release.yml`'s output set
drifted. The chart was deliberately specified against a future-state
image registry; the registry-publish job was tracked as a M3 follow-up
and not yet built. This PR closes the gap at the source by adding the
publish job, not by walking back the chart default.

### Architecture

- **Dockerfile** pins `gcr.io/distroless/static-debian12:nonroot` by
digest (`sha256:d093aa3e30...`). Non-root UID 65532 matches the chart's
`runAsUser`. CGO_ENABLED=0 makes `scratch` viable too, but distroless
gives a working CA bundle for the `otlphttp` exporter's HTTPS path and
tzdata for RFC3339 stamping with zero shell-attack surface. The
Dockerfile also declares `ARG SOURCE_DATE_EPOCH` so the determinism
contract is visible to a Dockerfile-only reader.
- The image consumes the **pre-built reproducible binary** from the
`build` job (`COPY release/$BINARY_BASENAME`), not a recompile. Image
reproducibility reduces to binary reproducibility (already gated) plus
the digest-pinned base layer plus `SOURCE_DATE_EPOCH` threaded through
buildkit's layer-rewrite (via the step `env:` block, not just
`--build-arg`).

### `release.yml` `image` job

- `needs: build` (downloads binary artifact, verifies SHA-256 matches
`build.outputs.digest` before push).
- `docker/build-push-action@v6.19.2` with `SOURCE_DATE_EPOCH` set via
both the step `env:` block AND `--build-arg` so buildkit's layer-rewrite
kicks in.
- Always tags `:TAG`. Floats `:latest` **only** on stable releases (no
`-` in the SemVer pre-release field), so a pre-release cannot silently
promote alpha bits to the chart's default-pull surface.
- `cosign sign --yes "$IMAGE_REPO@$DIGEST"`: signs by **digest**, not
tag. A registry rebuild of a floating tag would otherwise let an
attacker replace what `cosign verify` resolves.
- `cosign verify` smoke check pins the same identity binding the binary
blob already uses (`--certificate-github-workflow-ref refs/tags/$TAG`,
`--trigger push`).
- `attest-build-provenance` with `push-to-registry: true` attaches the
SLSA v1.0 provenance to the manifest in the registry, so a verifier
pulls everything from one place via `gh attestation verify oci://`.

Permissions: `id-token: write`, `attestations: write`, `packages:
write`. No long-lived registry credentials (GHCR auth uses the
workflow's `GITHUB_TOKEN`); no long-lived signing keys (cosign keyless
via OIDC).

### Docs

- `docs/reproducibility.md` grows two steps (8: resolve digest with
`crane digest`, then `cosign verify` by digest; 9: `gh attestation
verify oci://`) with the same identity-binding flags as the binary-side
steps. `crane` added to prerequisites.
- `install/kubernetes/tracecore/README.md` "Pre-release note" replaced
with the live-publish contract. Troubleshooting "ImagePullBackOff on
first install" entry updated with the Dockerfile-based local-build
workaround (was: "M3 release stream has not landed yet").
- `docs/followups/M3.md` "Container-image publish" item closed with the
HTML-comment + struck-italic convention used by the rows already closed
in that shard. New section "Items impossible to accomplish locally"
added for the three M21-trigger items (end-to-end push, oci://
attestation smoke, two-build image-digest equality) so a future
contributor does not file a "missing test" issue assuming the gap is
oversight.
- `CHANGELOG.md [Unreleased] ### Added` gains an M3 entry.

### Self-review fixes (commits 2-4)

Two rounds of self-review surfaced and closed:

**Round 1 (commit 2 — `7578feb`):**
- **F3:** `cosign triangulate --type digest` was the wrong tool. It
resolves the signature reference for a subject, not the subject's own
digest. Replaced with `crane digest` (canonical tag→digest resolver);
added `crane` to prerequisites.
- **F5:** `SOURCE_DATE_EPOCH` did not actually reach buildkit.
Build-args undeclared in the Dockerfile are silently ignored, so the
COPY layer's mtime was non-deterministic. Now threaded through both
`env:` block (buildkit layer-rewrite) and `ARG SOURCE_DATE_EPOCH`
(Dockerfile contract).
- **F1:** `release-doc-parity.sh` only covered the binary surface.
Extended with a parallel block for image-side `cosign verify`.
Mutation-verified.
- **F4:** Force-push comment overstated the SHA pin's guarantee.
Reworded to match the actual (binary-digest guard + tree-checkout)
closure.

**Round 2 (commit 3 — `7034e1a`, commit 4 — `459b686`):**
- **R1 (gh CLI semantic drift):** New
`scripts/gh-attestation-flag-lint.sh` parses `gh attestation verify
--help` and asserts every long flag used in `release.yml` +
`reproducibility.md` is still recognised by the installed CLI. Wired
into `make doc-check`. Mutation-verified (mutated `--help` output that
drops one flag → script exits 1 with fix hint).
- **R2 (distroless base digest rotation):** New
`scripts/base-digest-check.sh` compares the Dockerfile pin against
`crane digest gcr.io/distroless/static-debian12:nonroot`. Two modes:
`--warn` (default, exits 0) for periodic cadences and `--strict` (exits
non-zero) for M21 release-prep via `make base-digest-check`.
Deliberately NOT in `doc-check` (network + legitimate-lag).
Mutation-verified.
- **A++ #1 (gate-the-gate):**
`scripts/testdata/release-doc-parity/{intact,drift-binary,drift-image}/`
fixtures exercise both parity blocks;
`scripts/test-release-doc-parity.sh` drives them with WORKFLOW/DOC env
overrides and asserts expected exit codes. Mutation-verified: breaking
the image-side awk anchor in the gate makes the `intact` fixture fail.
- **R3 (`timeout-minutes`):** Out of scope (no per-job timeouts exist
anywhere in `release.yml` today). Documented as a M3 follow-up with
concrete per-job minute suggestions.
- Commit 4 fixes one em-dash the doc-check em-dash gate flagged in the
fixture README.

### Items impossible to accomplish locally (documented in
`docs/followups/M3.md`)

Three checks only become exercisable at M21 v0.1.0 (or any `vX.Y.Z` tag)
push time, because the `image` job is tag-triggered:

1. **End-to-end image push smoke against
`ghcr.io/tracecoreai/tracecore`.** Mitigations in place: `actionlint`,
`release-doc-parity.sh` image block, `gh-attestation-flag-lint.sh`,
binary-digest guard.
2. **`gh attestation verify "oci://$DIGEST"` against a real attestation
in the shape this pipeline emits.** No public OCI image carries a GitHub
Actions provenance attestation in matching shape, so the verifier
walkthrough cannot be smoke-tested end-to-end before M21.
`gh-attestation-flag-lint.sh` partially covers this by asserting
flag-name compatibility; semantic flag changes are the residual risk.
3. **Two-build digest equality for the image.** The `SOURCE_DATE_EPOCH`
plumbing claims image reproducibility, but the claim is only verifiable
by building twice at the same SHA and diff'ing the manifest digests. The
local dev environment currently lacks a working `docker buildx`; CI has
buildx but doubling the runner-time at every tag is a tradeoff worth
revisiting post-M21.

## Release notes

```release-notes
[FEATURE] Container images publish to ghcr.io/tracecoreai/tracecore:<TAG> on every release tag, signed and attested (cosign keyless + SLSA v1.0 provenance, both pushed to the registry). The Helm chart's default image.repository is now a live pull path. Verification walkthrough in docs/reproducibility.md steps 8-9.
```

## Test plan

- [x] `make ci` clean: golangci-lint, govulncheck, vet, mod-verify, RCE
gate, register-lint, actionlint, zizmor, all unit/race tests.
- [x] `make doc-check` clean (14 sub-gates including 3 new ones from
round 2: `test-release-doc-parity` (3 fixtures),
`gh-attestation-flag-lint` (6 flags), image-side `release-doc-parity`
block).
- [x] `actionlint` clean on `release.yml` after the `env:` block
addition.
- [x] `make base-digest-check` clean against live gcr.io (pinned digest
is current).
- [x] Mutation-verified: every new gate's failure mode (gh CLI flag
rename, Dockerfile digest forge, parity-script regex break).
- [x] Dockerfile validates by inspection: distroless base pinned by
digest; UID 65532 matches chart `runAsUser`; ENTRYPOINT/CMD shape allows
the chart's `args: [collect, --config=/etc/tracecore/config.yaml]` to
override cleanly; `ARG SOURCE_DATE_EPOCH` declared for
local-reproducibility.
- [ ] End-to-end image push exercise + `gh attestation verify oci://`
against a real attestation + two-build image-digest equality: impossible
locally; see "Items impossible to accomplish locally" above. First real
exercise will be M21 v0.1.0 (or any pre-release tag).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

### Update (commits 5-6, after main moved further)

While this PR was open, `main` advanced an additional 4 PRs (#143, #144,
#146, #147, #148, #149). Branch caught up via `git merge origin/main`
per the merge-not-rebase policy this PR also documents (see commit
`ddf86f7`).

- **Commit 5 — `ddf86f7`:** Adds the explicit branch-sync guidance to
`CONTRIBUTING.md`. Triggered by direct observation in this session that
the implicit "rebase to keep main linear" assumption was wrong
(`required_linear_history` on `main` only governs PR landing;
squash-merge collapses any feature-branch shape).
- **Commit 6 — `59e675c`:** Merge commit resolving conflicts in
`CHANGELOG.md` (additive) and `docs/followups/M3.md` (PR #143
partially-shipped row vs my closure HTML comment). `Makefile`
auto-merged cleanly with the new gate wires from commits 2-4 plus the
`validator-recipe` target from #144.

All 14 doc-check gates still green post-merge. The merge commit is
preserved on the branch (`git merge` with `--no-ff`); GitHub
squash-merge on the PR button will collapse it into the same
single-commit-on-main shape every other tracecore PR lands as.


### Update (commit 7 — `285640c`, A+ polish)

Self-review pass after the merge surfaced two cross-cutting hardening
items both worth one-line-per-job to land, and both gaps that would have
made the surface incomplete:

1. **`timeout-minutes` on every `release.yml` job** (build=20, sbom=15,
sign=10, provenance=10, image=20, release=10). GitHub's default ceiling
is 360m / 6h; a wedged push or hung Sigstore round-trip now fails fast
inside the per-job cap rather than burning a runner-hour. Caps chosen at
2-4x observed real wall-clock so transient ghcr/Sigstore weather doesn't
trip on healthy runs. Closes the M3.md row that previously held this out
as "opportunistic."
2. **`cosign verify-attestation --type slsaprovenance1` smoke check** in
the `image` job after `attest-build-provenance` pushes the SLSA v1
attestation to the registry. Uses the same identity binding
(refs/tags/$TAG + release.yml workflow path + push trigger) the
manifest-signature verify already enforces. Now every artifact this
pipeline publishes — binary blob, image manifest, image provenance — is
CI-verified inside the same run that produced it, against the same
identity claims a third-party verifier would reproduce offline.

`docs/followups/M3.md` also gains a new explicit "Out of scope for M3"
section rowing three items the self-review asked about: multi-arch image
build (`linux/arm64`), container vulnerability scan gate (trivy/grype),
and image SBOM sub-attestation (syft/cyclonedx with `--upload`). Each is
rowed with a trigger so a future audit can find them without commit
archaeology rather than ambiguously deferred.

`actionlint` clean on `release.yml`; `make doc-check` clean across all
gates including the new `release-doc-parity` image block,
`test-release-doc-parity` (3/3 fixtures), `gh-attestation-flag-lint` (6
flags), and `chart-appversion-check`.

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 2, 2026
…460) (#466)

## Summary

Closes #460. The `exit 0` on `scripts/doc-check.sh` ran unconditionally
whenever `docs/FAILURE-MODES.md` carried no `Test*`/`Fuzz*`/`Benchmark*`
identifiers (its current state on `main` — `grep -c` = 0), silently
bypassing every gate below it. Fix scopes the skip to the Go-test parity
block only (if/else, not `exit`), then surfaces and fixes the dead refs
the gates were supposed to be catching.

## Root cause

Commit a57883f (#13) shipped `doc-check.sh` with one gate — the Go-test
name parity check — so `[ -z "$referenced" ] && exit 0` was correct
then. PRs #28, #56, #115, #131, #144, #149, #195, #234, #241, #443,
#455, #459 (and others) appended gates **below** that line without
recognising they'd become dead code whenever `FAILURE-MODES.md` lost its
`Test*` references. PR #459 worked around the bug by placing its new
YAML gate *above* line 99 and tracked the root cause separately as #460.

## What surfaced

Once `exit 0` was removed, three real issues fired:

1. **Dead `.md` link**: `docs/FOLLOWUPS.md` → `followups/otlphttp.md`.
The shard was never committed to `main`'s ancestry. Folded into the
existing "Shards deleted post-v0.2.0 as fully resolved-via-pivot" prose
block (sibling treatment to M9, M14, M16).
2. **Banned-phrase hits** (3x `production-grade`): reworded in
`docs/cut-criteria.yaml.md` (2x) and
`install/kubernetes/tracecore/README.md` (1x) to falsifiable language.
3. **`docs/getting-started.md` block cap**: 7 fenced bash/sh blocks. The
M6 cap of 5 was set for the quickstart only — `## Install via Helm` and
`## Air-gapped install` are alternate deployment paths that landed
post-M6 and aren't part of the quickstart budget. Rescoped the gate to
count blocks inside the `## Walkthrough` H2 section only (1 block, well
under cap).

## Gate count

Empirically verified via `grep -c '^doc-check: '` on `make doc-check`
output on a clean tree:

| State | Status lines emitted | Gates the early-exit was hiding |
|---|---|---|
| Pre-fix on `main` (post-#459) | 3 (trust-posture, YAML cross-link,
parity-skip) | 14 |
| Post-fix this PR (post-rebase) | 17 | 0 |

The "14 gates hidden" number is invariant across the rebase: it counts
gates placed below the early-exit line. The "3 → 17" total reflects
post-#459 reality on `main`; pre-#459 baseline was "2 → 16" (the figure
originally in this PR body), and #459 itself worked around the bug by
placing its YAML gate above line 99.

## Mutation tests

Each gate below the original early-exit was confirmed to fire post-fix:

| Mutation | Gate expected to fire | Exit code post-mutation | Exit code
post-restore |
|---|---|---|---|
| Inject `[bad](nonexistent-ghost.md)` into `docs/FOLLOWUPS.md` |
markdown link-rot | 1 | 0 |
| Append `blazing-fast` + `rock-solid` to `docs/getting-started.md` |
banned-phrase lint | 1 | 0 |
| Delete `<!-- tested-against: ... -->` from
`docs/integrations/datadog.md` | M6 recipe markers | 1 | 0 |

## Test plan

- [x] `make doc-check` exits 0 on clean tree (re-run post-rebase onto
origin/main; 17 status lines)
- [x] 3 mutation tests above each toggle exit 1 → 0 across mutate /
restore
- [x] Pre-push hooks green: golangci-lint (0 issues), `go vet ./...`,
`go mod verify`, `attribute-namespace-check` (100 attrs, all
documented), `register-lint`, `actionlint`, `zizmor`,
`deprecation-check`, `no-autoupdate-check`
- [x] Rebased onto current `origin/main` (includes #459, #461, #462,
#456); no conflicts; gate count re-verified empirically post-rebase
- [x] No changes to gates above line 99 (the trust-posture callout +
YAML cross-link gate from #459 still run and emit unchanged status
lines)

## Self-grade

**A+** — root cause named in commit body (a57883f #13 with one gate;
gates appended below without exit-path awareness); 3 mutation tests
(success criteria required 1–2); rescoped the getting-started gate to
match M6 intent rather than papering over the surfaced overflow; the `[
-z "$referenced" ]` legitimate skip is preserved via if/else (not `:`
no-op, which would have left the `defined=` / `orphans=` block running
on empty input); gate count corrected empirically post-rebase per
reviewer B feedback.

```release-notes
- fix(ci): `scripts/doc-check.sh` no longer exits 0 at the Go-test parity gate when `docs/FAILURE-MODES.md` carries no `Test*` references. 14 gates below that line (link-rot, banned-phrase, M6 recipe markers, etc.) are now actually enforced on every `make doc-check` invocation. Closes #460.
```

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant