Skip to content

[m6] integration recipes: 4 backend recipes + validator-recipe CI#144

Merged
trilamsr merged 1 commit into
mainfrom
worktree-m6-integration-recipes
May 21, 2026
Merged

[m6] integration recipes: 4 backend recipes + validator-recipe CI#144
trilamsr merged 1 commit into
mainfrom
worktree-m6-integration-recipes

Conversation

@trilamsr

@trilamsr trilamsr commented May 20, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Ships docs/integrations/{otel-backend,honeycomb,datadog,clickhouse-direct}.md + matching docs/integrations/examples/*.yaml. Each recipe carries <!-- tested-against: ... --> + <!-- last-verified: YYYY-MM-DD --> HTML markers.
  • Extends scripts/doc-check.sh with eight new gates: (a) every recipe contains >=1 fenced yaml block whose first non-blank line names a file under docs/integrations/examples/; (b) tested-against marker present; (c) last-verified <=180 days old; (d) docs/README.md indexes every recipe; (e) docs/nps.md carries three canonical H3 survey headings (Recommend / Biggest change / Best part); (f) docs/FAILURE-MODES.md carries rows for vendor SDK failure / exporter unreachable / config invalid each citing a real Test* identifier; (g) docs/getting-started.md has <=5 fenced bash/sh blocks; (h) every docs/integrations/examples/*.yaml uses REPLACE_WITH_* placeholders, rejecting ${VAR} interpolation (tracecore does not expand env vars in YAML).
  • Adds scripts/validator-recipe.sh + new validator-recipe CI job. Tracecore-tagged recipes are validated by the in-tree ./tracecore validate --config=...; contrib-tagged recipes are validated by otelcol-contrib validate --config=... after SHA-256-verifying the pinned v0.152.0 archive against the upstream opentelemetry-collector-releases checksums file. The job feeds the existing verify aggregator.

Flips 8 M6 rubrics to shipped in MILESTONES.md. Carry-forward: wiring scripts/smoke.sh to exercise the docs/getting-started.md bash blocks (the <=5-count gate ships here; smoke today runs the dcgm example).

Why this shape

The M6 rubric splits per-recipe validation by scope: tracecore validate only knows in-tree components (here: clockreceiver, otlphttp); contrib-only exporters (datadog, clickhouse) need the upstream collector binary. The tested-against marker drives the dispatch deterministically. Pinning the contrib binary by SHA-256 against the upstream checksums file means a contrib re-release cannot silently change behavior; bumping the pin is a deliberate one-PR change.

The placeholder-style gate exists because tracecore does NOT expand environment variables in YAML — a recipe writing ${VAR} would deploy the literal string and fail at runtime in a way validate cannot catch. Mandating REPLACE_WITH_* makes the failure mode loud (the API rejects the literal) and forces the rendering step into the deploy pipeline.

Root cause for the rebase

The branch was opened ahead of main moving 5 commits forward (M19 chaos, M3 GHCR publish, follow-up sweep, comment-trim anti-regression, M6 doc index). GitHub reported mergeStateStatus: DIRTY and silently dropped all CI runs; gh pr checks 144 returned no checks reported. Resolved by rebasing onto origin/main, which produced one content conflict in docs/README.md (both branches added a new row under ## Subdirectories — kept both: followups/ from main, integrations/ from this PR).

Test plan

  • make doc-check exit 0 (15 gate lines clean including em-dash + comment-noise diff gates)
  • make validator-recipe exit 0 (2 validated locally, 2 skipped on darwin with linux-only notice)
  • make ci exit 0 locally (full path including zizmor + actionlint + tests)
  • Pre-push hook make ci exit 0 (verified via --force-with-lease push completing)
  • All 8 new doc-check gates mutation-verified (remove marker, stale date, missing example file, missing nps heading, missing FAILURE-MODES row, missing README index row, >5 getting-started blocks, ${VAR} placeholder in any example YAML)
  • All 4 example YAMLs empirically validated against their target binary: in-tree tracecore validate on otel-backend.yaml + honeycomb.yaml; otelcol-contrib 0.152.0 on datadog.yaml + clickhouse-direct.yaml
  • otelcol-contrib v0.152.0 SHA-256 pin verified against upstream opentelemetry-collector-releases_otelcol-contrib_checksums.txt
  • First CI run of validator-recipe job on a linux runner downloads + extracts the contrib binary, validates both contrib examples (the local check used the darwin_arm64 binary at the same source tag)

Carry-forward + follow-ups

Tracked in docs/followups/M6.md (new file in this PR):

  • docs/getting-started.md blocks exercised by scripts/smoke.sh (M6 rubric half). The <=5-count gate ships here; making smoke.sh execute the page's commands is a separate change with its own assertion surface. Noted in MILESTONES.md §M6 carry-forward.
  • Sandbox smoke test against real backends (validate-only catches schema; not credential rejection, region mismatch, quota throttling).
  • Per-recipe operator checklist (DNS / API key / quota match).
  • Additional backends (Grafana Cloud OTLP, AWS X-Ray + CloudWatch, Splunk SignalFx, Tempo / Mimir / Loki self-hosted).
  • CI contrib-binary cache via actions/cache keyed on tested-against marker.
  • Deprecation lint for tested-against markers older than ~18 months independent of last-verified.

🤖 Generated with Claude Code

docs/integrations/{otel-backend,honeycomb,datadog,clickhouse-direct}.md
plus matching docs/integrations/examples/*.yaml. Each recipe carries
tested-against + last-verified HTML markers. scripts/doc-check.sh adds
8 new gates (examples-file resolver, both markers, 180-day staleness,
README index parity, nps H3 set, getting-started <=5 blocks,
FAILURE-MODES required rows, placeholder-style lint that rejects
${VAR} interpolation since tracecore does not expand env vars in
YAML); scripts/validator-recipe.sh + new CI job run validate against
both binaries (in-tree tracecore for tracecore-tagged recipes; pinned
otelcol-contrib v0.152.0 by SHA-256 for contrib-tagged recipes). All
4 examples empirically validated against their target binary. All 8
new gates mutation-verified.

MILESTONES M6: 8 rubrics flipped to shipped; getting-started block
exercise wiring stays as carry-forward. Follow-ups (smoke.sh wiring,
sandbox smoke test, per-recipe operator checklist, additional
backends, CI contrib-binary cache, deprecation lint for
tested-against) tracked in docs/followups/M6.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr force-pushed the worktree-m6-integration-recipes branch from 33b4322 to 4fa63a6 Compare May 21, 2026 03:31
@trilamsr trilamsr merged commit 052e0d5 into main May 21, 2026
10 checks passed
@trilamsr trilamsr deleted the worktree-m6-integration-recipes branch May 21, 2026 03:43
trilamsr added a commit that referenced this pull request May 21, 2026
#147)

## Summary

Single-PR bundle of 10 low-risk follow-up actions. Each row was
anchor-verified on `main` before editing; no production behavior change.
Diff is 10 files, +91/-37, dominated by markdown.

**Breakdown:**
- 3 strikes (anchor shipped, row was stale)
- 1 test-only struct add (k8sevents `NodeWatchErrors`)
- 1 bash test add (`no-autoupdate-check` hit-line format lock)
- 5 doc-only clarifications / partial-ship / audit notes

## Items applied

### Strikes — anchor on `main` confirms shipped

1. **M3.md L188 `make doc-check`.** `scripts/doc-check.sh` header reads
"verify every Test\*/Fuzz\*/Benchmark\* name referenced in docs"; wired
into `make doc-check` and `make ci`.
2. **M8.md L103 `docs/HARDWARE-TESTING.md` libdcgm + nv-hostengine
setup.** File exists (28 hits for libdcgm/dcgm/nv-hostengine); covers
Ubuntu 22.04 driver / `libdcgm-dev` / `nv-hostengine` provisioning,
x86_64 + aarch64-SBSA build matrix, and the `//go:build dcgm,hardware`
distinction. Doc shipped ahead of cgo client to unblock GPU-less
contributors.
3. **M19.md L18 `nodeWatchErrCount` not in SnapshotCounters.** Closed by
item 6 below — added `NodeWatchErrors` field symmetrically.

### Test-only struct add

4. **components/receivers/k8sevents/export_test.go.** Added
`NodeWatchErrors int64` field on `CountersForTest`; `SnapshotCounters`
now reads `rr.nodeWatchErrCount.Load()` symmetrically with
`rr.watchErrCount.Load()`. Both call sites are keyed-init inside the
same file; no external positional callers to break (grep confirmed: only
2 hits, both in `export_test.go`).

### Bash test add (M23 grep-gate format lock)

5. **scripts/no-autoupdate-check_test.sh "hit-line-format-stable".** New
assertion that runs the gate against the hyphenated-go-update fixture,
captures stdout (existing tests discard it), and asserts at least one
line matches `^[^:]+:[0-9]+:`. Locks the parseable hit-line shape
*before* the first automation consumer (CI summary, dashboard, Slack
notifier) wires up — a cosmetic tweak to the gate's message body now
fails CI instead of silently breaking downstream parsers. M23.md row
struck.

### Doc-only clarifications

6. **M15.md L185 falsifying-check backfill.** Anchored the
"/var/lib/tracecore/ subdir governance" row's grep-falsifying-check to
RFC-0010 §Proposal — `docs/rfcs/0010-containerstdout-receiver-scope.md`
L177/L217/L274/L393/L407 already carry the convention ("M15 owns
`/var/lib/tracecore/container_stdout/`. Future siblings reserve their
own subdirectories."). Row marked `[x]`.

7. **M15.md L192 + RFC-0010 §Pod-attribution forward-pointer.** Appended
one-line cross-reference at RFC-0010 L158 → `docs/followups/M15.md`
"Cross-receiver rank-label reconciliation" so the deferred audit trail
is discoverable from the RFC. Row marked `[x]`.

8. **M8.md L30 `tracecore debug dump` partial-ship.**
`cmd/tracecore/debug.go::runDebugDump` already writes version + revision
+ branch + build date + Go runtime stats + registered components +
redacted config to `tracecore debug dump > diagnostic.txt`. Remaining
gap is "last N samples" — needs receiver-side ring buffer (M2
carry-forward). Row kept open with partial-ship line +
remaining-trigger.

9. **M3.md L153 SUPPLY-CHAIN-IDENTITY.md scope clarification.** Added
one sentence noting the consolidation is a copy-and-deduplicate pass
against existing `release.yml` comment blocks (cosign-sign-blob,
gh-attestation-sign), not net-new authoring — so the next reader sees
the actual scope of work, not a misleading "30-min write" estimate that
implies green-field.

10. **otlphttp.md L182 workflow paths audit + M14.md L88 test pointer.**
- **otlphttp**: inlined audit findings (2026-05-20). `chart.yml` and
`install-bench.yml` are substrate-aware (include `cmd/tracecore/**`,
`internal/**`); `kernelevents-integration.yml` and
`pyspy-integration.yml` cover only `components/receivers/<name>/**` +
`internal/runtime/lifecycle/**` — a `cmd/tracecore` factory wiring or
`internal/pipeline` contract change can land without re-running these
integration jobs. `chaos.yml` covers `tools/failure-inject/**` +
`internal/synthesis/**` only (indirect coupling, acceptable). Remaining:
6-line YAML edit per integration workflow.
- **M14**: added inline pointer from the multi-retry slow-write fixture
row to the existing single-retry baseline at
`components/receivers/kineto/shutdown_test.go::TestIngest_RetryOnTruncated`
so the future author has the test-shape anchor.

## Files changed

| File | LOC | Kind |
|---|---|---|
| `components/receivers/k8sevents/export_test.go` | +2 | test struct
field |
| `scripts/no-autoupdate-check_test.sh` | +20 | bash test add |
| `docs/rfcs/0010-containerstdout-receiver-scope.md` | +1/-1 | inline
cross-ref |
| `docs/followups/M3.md` | +9/-5 | strike + scope clarification |
| `docs/followups/M8.md` | +16/-5 | strike + partial-ship |
| `docs/followups/M14.md` | +1/-1 | test pointer |
| `docs/followups/M15.md` | +15/-8 | 2 strikes |
| `docs/followups/M19.md` | +5/-9 | strike (anchored to test add) |
| `docs/followups/M23.md` | +9/-7 | strike |
| `docs/followups/otlphttp.md` | +13/-1 | audit findings inline |

## Test plan

- [x] `go test ./components/receivers/k8sevents/...` green.
- [x] `bash scripts/no-autoupdate-check_test.sh` 10/10 assertions pass
(added "hit-line-format-stable" — the 10th).
- [x] `bash scripts/doc-check.sh` green (437 markdown links resolve,
em-dash + en-dash diff gate clean, comment-noise diff gate clean).
- [x] Pre-commit hook ran full `make check` + `make ci` (all package
tests cached/passing).
- [ ] CI green on this branch.

## Release notes

```release-notes
NONE
```

## Sequencing

Builds on `main` after PRs #132 (shard split), #133 (RUNBOOK +
chart-appversion), #142 (opportunistic curation), #134 (chaos.yml row),
#143 (cross-shard audit). Independent of currently-open PRs #144 (m6
integration recipes) and #145 (m3 GHCR image publish).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr pushed a commit that referenced this pull request May 21, 2026
Sync feature branch with main per the merge-not-rebase policy
documented in CONTRIBUTING.md (commit ddf86f7).

Main moved 5 PRs ahead during this branch's lifetime:
- PR #143 (followups sweep)
- PR #134 (chaos.yml pattern-pod-evicted)
- PR #142 (follow-up curation)
- PR #144 (M6 integration recipes)
- PR #146 (kineto MaxEvents stub)
- PR #147 (followups bundle)

Conflicts expected in CHANGELOG.md and docs/followups/M3.md
(both additive).

# Conflicts:
#	CHANGELOG.md
trilamsr added a commit that referenced this pull request May 21, 2026
## Summary

Closes the long-standing chart-default-image gap. The chart's
`install/kubernetes/tracecore/values.yaml` has shipped with
`image.repository: ghcr.io/tracecoreai/tracecore` as the default since
M5b, but `release.yml` only ever published the binary + SBOM +
cosign-bundle + provenance as GitHub Release artifacts. Operators
following the chart's defaults could not `helm install`. RFC-0008 names
this path as the target operator-pull surface.

### Root cause

The chart's default `image.repository` and `release.yml`'s output set
drifted. The chart was deliberately specified against a future-state
image registry; the registry-publish job was tracked as a M3 follow-up
and not yet built. This PR closes the gap at the source by adding the
publish job, not by walking back the chart default.

### Architecture

- **Dockerfile** pins `gcr.io/distroless/static-debian12:nonroot` by
digest (`sha256:d093aa3e30...`). Non-root UID 65532 matches the chart's
`runAsUser`. CGO_ENABLED=0 makes `scratch` viable too, but distroless
gives a working CA bundle for the `otlphttp` exporter's HTTPS path and
tzdata for RFC3339 stamping with zero shell-attack surface. The
Dockerfile also declares `ARG SOURCE_DATE_EPOCH` so the determinism
contract is visible to a Dockerfile-only reader.
- The image consumes the **pre-built reproducible binary** from the
`build` job (`COPY release/$BINARY_BASENAME`), not a recompile. Image
reproducibility reduces to binary reproducibility (already gated) plus
the digest-pinned base layer plus `SOURCE_DATE_EPOCH` threaded through
buildkit's layer-rewrite (via the step `env:` block, not just
`--build-arg`).

### `release.yml` `image` job

- `needs: build` (downloads binary artifact, verifies SHA-256 matches
`build.outputs.digest` before push).
- `docker/build-push-action@v6.19.2` with `SOURCE_DATE_EPOCH` set via
both the step `env:` block AND `--build-arg` so buildkit's layer-rewrite
kicks in.
- Always tags `:TAG`. Floats `:latest` **only** on stable releases (no
`-` in the SemVer pre-release field), so a pre-release cannot silently
promote alpha bits to the chart's default-pull surface.
- `cosign sign --yes "$IMAGE_REPO@$DIGEST"`: signs by **digest**, not
tag. A registry rebuild of a floating tag would otherwise let an
attacker replace what `cosign verify` resolves.
- `cosign verify` smoke check pins the same identity binding the binary
blob already uses (`--certificate-github-workflow-ref refs/tags/$TAG`,
`--trigger push`).
- `attest-build-provenance` with `push-to-registry: true` attaches the
SLSA v1.0 provenance to the manifest in the registry, so a verifier
pulls everything from one place via `gh attestation verify oci://`.

Permissions: `id-token: write`, `attestations: write`, `packages:
write`. No long-lived registry credentials (GHCR auth uses the
workflow's `GITHUB_TOKEN`); no long-lived signing keys (cosign keyless
via OIDC).

### Docs

- `docs/reproducibility.md` grows two steps (8: resolve digest with
`crane digest`, then `cosign verify` by digest; 9: `gh attestation
verify oci://`) with the same identity-binding flags as the binary-side
steps. `crane` added to prerequisites.
- `install/kubernetes/tracecore/README.md` "Pre-release note" replaced
with the live-publish contract. Troubleshooting "ImagePullBackOff on
first install" entry updated with the Dockerfile-based local-build
workaround (was: "M3 release stream has not landed yet").
- `docs/followups/M3.md` "Container-image publish" item closed with the
HTML-comment + struck-italic convention used by the rows already closed
in that shard. New section "Items impossible to accomplish locally"
added for the three M21-trigger items (end-to-end push, oci://
attestation smoke, two-build image-digest equality) so a future
contributor does not file a "missing test" issue assuming the gap is
oversight.
- `CHANGELOG.md [Unreleased] ### Added` gains an M3 entry.

### Self-review fixes (commits 2-4)

Two rounds of self-review surfaced and closed:

**Round 1 (commit 2 — `7578feb`):**
- **F3:** `cosign triangulate --type digest` was the wrong tool. It
resolves the signature reference for a subject, not the subject's own
digest. Replaced with `crane digest` (canonical tag→digest resolver);
added `crane` to prerequisites.
- **F5:** `SOURCE_DATE_EPOCH` did not actually reach buildkit.
Build-args undeclared in the Dockerfile are silently ignored, so the
COPY layer's mtime was non-deterministic. Now threaded through both
`env:` block (buildkit layer-rewrite) and `ARG SOURCE_DATE_EPOCH`
(Dockerfile contract).
- **F1:** `release-doc-parity.sh` only covered the binary surface.
Extended with a parallel block for image-side `cosign verify`.
Mutation-verified.
- **F4:** Force-push comment overstated the SHA pin's guarantee.
Reworded to match the actual (binary-digest guard + tree-checkout)
closure.

**Round 2 (commit 3 — `7034e1a`, commit 4 — `459b686`):**
- **R1 (gh CLI semantic drift):** New
`scripts/gh-attestation-flag-lint.sh` parses `gh attestation verify
--help` and asserts every long flag used in `release.yml` +
`reproducibility.md` is still recognised by the installed CLI. Wired
into `make doc-check`. Mutation-verified (mutated `--help` output that
drops one flag → script exits 1 with fix hint).
- **R2 (distroless base digest rotation):** New
`scripts/base-digest-check.sh` compares the Dockerfile pin against
`crane digest gcr.io/distroless/static-debian12:nonroot`. Two modes:
`--warn` (default, exits 0) for periodic cadences and `--strict` (exits
non-zero) for M21 release-prep via `make base-digest-check`.
Deliberately NOT in `doc-check` (network + legitimate-lag).
Mutation-verified.
- **A++ #1 (gate-the-gate):**
`scripts/testdata/release-doc-parity/{intact,drift-binary,drift-image}/`
fixtures exercise both parity blocks;
`scripts/test-release-doc-parity.sh` drives them with WORKFLOW/DOC env
overrides and asserts expected exit codes. Mutation-verified: breaking
the image-side awk anchor in the gate makes the `intact` fixture fail.
- **R3 (`timeout-minutes`):** Out of scope (no per-job timeouts exist
anywhere in `release.yml` today). Documented as a M3 follow-up with
concrete per-job minute suggestions.
- Commit 4 fixes one em-dash the doc-check em-dash gate flagged in the
fixture README.

### Items impossible to accomplish locally (documented in
`docs/followups/M3.md`)

Three checks only become exercisable at M21 v0.1.0 (or any `vX.Y.Z` tag)
push time, because the `image` job is tag-triggered:

1. **End-to-end image push smoke against
`ghcr.io/tracecoreai/tracecore`.** Mitigations in place: `actionlint`,
`release-doc-parity.sh` image block, `gh-attestation-flag-lint.sh`,
binary-digest guard.
2. **`gh attestation verify "oci://$DIGEST"` against a real attestation
in the shape this pipeline emits.** No public OCI image carries a GitHub
Actions provenance attestation in matching shape, so the verifier
walkthrough cannot be smoke-tested end-to-end before M21.
`gh-attestation-flag-lint.sh` partially covers this by asserting
flag-name compatibility; semantic flag changes are the residual risk.
3. **Two-build digest equality for the image.** The `SOURCE_DATE_EPOCH`
plumbing claims image reproducibility, but the claim is only verifiable
by building twice at the same SHA and diff'ing the manifest digests. The
local dev environment currently lacks a working `docker buildx`; CI has
buildx but doubling the runner-time at every tag is a tradeoff worth
revisiting post-M21.

## Release notes

```release-notes
[FEATURE] Container images publish to ghcr.io/tracecoreai/tracecore:<TAG> on every release tag, signed and attested (cosign keyless + SLSA v1.0 provenance, both pushed to the registry). The Helm chart's default image.repository is now a live pull path. Verification walkthrough in docs/reproducibility.md steps 8-9.
```

## Test plan

- [x] `make ci` clean: golangci-lint, govulncheck, vet, mod-verify, RCE
gate, register-lint, actionlint, zizmor, all unit/race tests.
- [x] `make doc-check` clean (14 sub-gates including 3 new ones from
round 2: `test-release-doc-parity` (3 fixtures),
`gh-attestation-flag-lint` (6 flags), image-side `release-doc-parity`
block).
- [x] `actionlint` clean on `release.yml` after the `env:` block
addition.
- [x] `make base-digest-check` clean against live gcr.io (pinned digest
is current).
- [x] Mutation-verified: every new gate's failure mode (gh CLI flag
rename, Dockerfile digest forge, parity-script regex break).
- [x] Dockerfile validates by inspection: distroless base pinned by
digest; UID 65532 matches chart `runAsUser`; ENTRYPOINT/CMD shape allows
the chart's `args: [collect, --config=/etc/tracecore/config.yaml]` to
override cleanly; `ARG SOURCE_DATE_EPOCH` declared for
local-reproducibility.
- [ ] End-to-end image push exercise + `gh attestation verify oci://`
against a real attestation + two-build image-digest equality: impossible
locally; see "Items impossible to accomplish locally" above. First real
exercise will be M21 v0.1.0 (or any pre-release tag).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

### Update (commits 5-6, after main moved further)

While this PR was open, `main` advanced an additional 4 PRs (#143, #144,
#146, #147, #148, #149). Branch caught up via `git merge origin/main`
per the merge-not-rebase policy this PR also documents (see commit
`ddf86f7`).

- **Commit 5 — `ddf86f7`:** Adds the explicit branch-sync guidance to
`CONTRIBUTING.md`. Triggered by direct observation in this session that
the implicit "rebase to keep main linear" assumption was wrong
(`required_linear_history` on `main` only governs PR landing;
squash-merge collapses any feature-branch shape).
- **Commit 6 — `59e675c`:** Merge commit resolving conflicts in
`CHANGELOG.md` (additive) and `docs/followups/M3.md` (PR #143
partially-shipped row vs my closure HTML comment). `Makefile`
auto-merged cleanly with the new gate wires from commits 2-4 plus the
`validator-recipe` target from #144.

All 14 doc-check gates still green post-merge. The merge commit is
preserved on the branch (`git merge` with `--no-ff`); GitHub
squash-merge on the PR button will collapse it into the same
single-commit-on-main shape every other tracecore PR lands as.


### Update (commit 7 — `285640c`, A+ polish)

Self-review pass after the merge surfaced two cross-cutting hardening
items both worth one-line-per-job to land, and both gaps that would have
made the surface incomplete:

1. **`timeout-minutes` on every `release.yml` job** (build=20, sbom=15,
sign=10, provenance=10, image=20, release=10). GitHub's default ceiling
is 360m / 6h; a wedged push or hung Sigstore round-trip now fails fast
inside the per-job cap rather than burning a runner-hour. Caps chosen at
2-4x observed real wall-clock so transient ghcr/Sigstore weather doesn't
trip on healthy runs. Closes the M3.md row that previously held this out
as "opportunistic."
2. **`cosign verify-attestation --type slsaprovenance1` smoke check** in
the `image` job after `attest-build-provenance` pushes the SLSA v1
attestation to the registry. Uses the same identity binding
(refs/tags/$TAG + release.yml workflow path + push trigger) the
manifest-signature verify already enforces. Now every artifact this
pipeline publishes — binary blob, image manifest, image provenance — is
CI-verified inside the same run that produced it, against the same
identity claims a third-party verifier would reproduce offline.

`docs/followups/M3.md` also gains a new explicit "Out of scope for M3"
section rowing three items the self-review asked about: multi-arch image
build (`linux/arm64`), container vulnerability scan gate (trivy/grype),
and image SBOM sub-attestation (syft/cyclonedx with `--upload`). Each is
rowed with a trigger so a future audit can find them without commit
archaeology rather than ambiguously deferred.

`actionlint` clean on `release.yml`; `make doc-check` clean across all
gates including the new `release-doc-parity` image block,
`test-release-doc-parity` (3/3 fixtures), `gh-attestation-flag-lint` (6
flags), and `chart-appversion-check`.

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 2, 2026
…460) (#466)

## Summary

Closes #460. The `exit 0` on `scripts/doc-check.sh` ran unconditionally
whenever `docs/FAILURE-MODES.md` carried no `Test*`/`Fuzz*`/`Benchmark*`
identifiers (its current state on `main` — `grep -c` = 0), silently
bypassing every gate below it. Fix scopes the skip to the Go-test parity
block only (if/else, not `exit`), then surfaces and fixes the dead refs
the gates were supposed to be catching.

## Root cause

Commit a57883f (#13) shipped `doc-check.sh` with one gate — the Go-test
name parity check — so `[ -z "$referenced" ] && exit 0` was correct
then. PRs #28, #56, #115, #131, #144, #149, #195, #234, #241, #443,
#455, #459 (and others) appended gates **below** that line without
recognising they'd become dead code whenever `FAILURE-MODES.md` lost its
`Test*` references. PR #459 worked around the bug by placing its new
YAML gate *above* line 99 and tracked the root cause separately as #460.

## What surfaced

Once `exit 0` was removed, three real issues fired:

1. **Dead `.md` link**: `docs/FOLLOWUPS.md` → `followups/otlphttp.md`.
The shard was never committed to `main`'s ancestry. Folded into the
existing "Shards deleted post-v0.2.0 as fully resolved-via-pivot" prose
block (sibling treatment to M9, M14, M16).
2. **Banned-phrase hits** (3x `production-grade`): reworded in
`docs/cut-criteria.yaml.md` (2x) and
`install/kubernetes/tracecore/README.md` (1x) to falsifiable language.
3. **`docs/getting-started.md` block cap**: 7 fenced bash/sh blocks. The
M6 cap of 5 was set for the quickstart only — `## Install via Helm` and
`## Air-gapped install` are alternate deployment paths that landed
post-M6 and aren't part of the quickstart budget. Rescoped the gate to
count blocks inside the `## Walkthrough` H2 section only (1 block, well
under cap).

## Gate count

Empirically verified via `grep -c '^doc-check: '` on `make doc-check`
output on a clean tree:

| State | Status lines emitted | Gates the early-exit was hiding |
|---|---|---|
| Pre-fix on `main` (post-#459) | 3 (trust-posture, YAML cross-link,
parity-skip) | 14 |
| Post-fix this PR (post-rebase) | 17 | 0 |

The "14 gates hidden" number is invariant across the rebase: it counts
gates placed below the early-exit line. The "3 → 17" total reflects
post-#459 reality on `main`; pre-#459 baseline was "2 → 16" (the figure
originally in this PR body), and #459 itself worked around the bug by
placing its YAML gate above line 99.

## Mutation tests

Each gate below the original early-exit was confirmed to fire post-fix:

| Mutation | Gate expected to fire | Exit code post-mutation | Exit code
post-restore |
|---|---|---|---|
| Inject `[bad](nonexistent-ghost.md)` into `docs/FOLLOWUPS.md` |
markdown link-rot | 1 | 0 |
| Append `blazing-fast` + `rock-solid` to `docs/getting-started.md` |
banned-phrase lint | 1 | 0 |
| Delete `<!-- tested-against: ... -->` from
`docs/integrations/datadog.md` | M6 recipe markers | 1 | 0 |

## Test plan

- [x] `make doc-check` exits 0 on clean tree (re-run post-rebase onto
origin/main; 17 status lines)
- [x] 3 mutation tests above each toggle exit 1 → 0 across mutate /
restore
- [x] Pre-push hooks green: golangci-lint (0 issues), `go vet ./...`,
`go mod verify`, `attribute-namespace-check` (100 attrs, all
documented), `register-lint`, `actionlint`, `zizmor`,
`deprecation-check`, `no-autoupdate-check`
- [x] Rebased onto current `origin/main` (includes #459, #461, #462,
#456); no conflicts; gate count re-verified empirically post-rebase
- [x] No changes to gates above line 99 (the trust-posture callout +
YAML cross-link gate from #459 still run and emit unchanged status
lines)

## Self-grade

**A+** — root cause named in commit body (a57883f #13 with one gate;
gates appended below without exit-path awareness); 3 mutation tests
(success criteria required 1–2); rescoped the getting-started gate to
match M6 intent rather than papering over the surfaced overflow; the `[
-z "$referenced" ]` legitimate skip is preserved via if/else (not `:`
no-op, which would have left the `defined=` / `orphans=` block running
on empty input); gate count corrected empirically post-rebase per
reviewer B feedback.

```release-notes
- fix(ci): `scripts/doc-check.sh` no longer exits 0 at the Go-test parity gate when `docs/FAILURE-MODES.md` carries no `Test*` references. 14 gates below that line (link-rot, banned-phrase, M6 recipe markers, etc.) are now actually enforced on every `make doc-check` invocation. Closes #460.
```

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant