Skip to content

[chore] kineto: stub MaxEvents in cap test, cut verify-test by ~390s#146

Merged
trilamsr merged 1 commit into
mainfrom
worktree-ci-kineto-maxevents-cap-stub
May 21, 2026
Merged

[chore] kineto: stub MaxEvents in cap test, cut verify-test by ~390s#146
trilamsr merged 1 commit into
mainfrom
worktree-ci-kineto-maxevents-cap-stub

Conversation

@trilamsr

@trilamsr trilamsr commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

  • TestParse_MaxEventsCap built a 16M-event / ~1.1 GB JSON string and ran ~398s under -race, dominating verify-test wallclock (pkg/kineto = 392s of ~470s total). MaxEvents is excluded from -coverpkg, so the test contributed zero coverage signal; only the cap-path assertion.
  • Promote MaxEvents from const to var (mirrors the existing MaxBytes test-override idiom in the same file), have the test lower the cap to 100 via t.Cleanup, and drop the testing.Short() skip since the cheap form always belongs in CI.
  • Update RFC-0012 §"Resource bounds" wording from "package-level constants" to "package-level bounds" with a note that MaxBytes and MaxEvents are var (test overrides only); MaxStringBytes and MaxDepth remain const.

Receipts (local; M-series Mac)

Step Before After
pkg/kineto race tests 392s 1.5s
TestParse_MaxEventsCap ~398s <0.01s
make coverage-check (full) ~470s ~25s

Per-package coverage % is identical pre/post; verified by diffing both runs (only wallclock columns and a 0.1% pyspy attribution flake — a pre-existing wobble unrelated to this diff).

On CI ubuntu-latest, verify-test's coverage-check step (~125s) is expected to drop ~80–90s. pkg/kineto no longer dominates; the new tail is cmd/tracecore, components/receivers/kineto, components/receivers/kernelevents at ~17s each.

Risk surface

  • MaxEvents is now mutable in production binaries (same as MaxBytes has been). RFC-0012 contract: production does not mutate; no caller does. golangci-lint, go vet, gofumpt, go mod tidy, make doc-check all clean.
  • No t.Parallel() in pkg/kineto, so the swap is race-safe under the existing test layout. Future contributors adding t.Parallel() to a test that reads MaxEvents would race against TestParse_MaxEventsCap — same hazard the existing MaxBytes override has had since day one.
  • components/receivers/kineto/stress_2gb_test.go (14M-event synthesis) was audited: gated with //go:build !race + testing.Short() + a 3 GiB disk pre-check, so it is not in the race path.

Test plan

  • go test -race ./pkg/kineto/ — 1.5s, all tests pass
  • make coverage-check — 25s locally, all gates pass, coverage % identical to pre-PR
  • go vet ./pkg/kineto/...
  • go tool golangci-lint run ./pkg/kineto/...
  • go tool gofumpt -l pkg/kineto/ (no diff)
  • make doc-check (RFC-0012 reference resolves; em-dash gate clean)
  • make ci end-to-end via pre-push hook
  • CI verify-test job time recorded against the pre-PR baseline

Release notes

NONE

TestParse_MaxEventsCap built a 16M-event / ~1.1 GB JSON string and ran
~398s under -race, dominating verify-test wallclock (pkg/kineto = 392s
of ~470s total). MaxEvents is excluded from -coverpkg, so the test
contributed zero coverage signal — only the cap-path assertion.

Promote MaxEvents from const to var (mirrors existing MaxBytes idiom),
have the test lower the cap to 100, and remove the testing.Short skip
since the cheap form always belongs in CI. Override pattern matches
TestParse_MaxBytesCap_DistinguishedFromTruncated.

Receipts (local, M-series):
  pkg/kineto race tests:      392s -> 1.5s
  TestParse_MaxEventsCap:     398s -> <0.01s
  make coverage-check total:  470s -> 25s
Per-package coverage % identical pre/post (verified by diffing both
runs). On CI ubuntu-latest, verify-test's coverage-check step (~125s)
is expected to drop ~80-90s; pkg/kineto no longer dominates and the
tail (cmd/tracecore, receivers/kineto, receivers/kernelevents at ~17s
each) bounds the job.

RFC-0012 wording updated: "package-level constants" -> "package-level
bounds" with a note that MaxBytes and MaxEvents are vars (test
overrides only); MaxStringBytes and MaxDepth remain const.

stress_2gb_test.go (14M events) was audited — already gated with
//go:build !race and testing.Short(), so it is not in the race path.

Signed-off-by: Tri Lam <trilamsr@gmail.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr merged commit 82bdc56 into main May 21, 2026
8 of 9 checks passed
@trilamsr trilamsr deleted the worktree-ci-kineto-maxevents-cap-stub branch May 21, 2026 03:32
trilamsr pushed a commit that referenced this pull request May 21, 2026
Sync feature branch with main per the merge-not-rebase policy
documented in CONTRIBUTING.md (commit ddf86f7).

Main moved 5 PRs ahead during this branch's lifetime:
- PR #143 (followups sweep)
- PR #134 (chaos.yml pattern-pod-evicted)
- PR #142 (follow-up curation)
- PR #144 (M6 integration recipes)
- PR #146 (kineto MaxEvents stub)
- PR #147 (followups bundle)

Conflicts expected in CHANGELOG.md and docs/followups/M3.md
(both additive).

# Conflicts:
#	CHANGELOG.md
trilamsr added a commit that referenced this pull request May 21, 2026
## Summary

Closes the long-standing chart-default-image gap. The chart's
`install/kubernetes/tracecore/values.yaml` has shipped with
`image.repository: ghcr.io/tracecoreai/tracecore` as the default since
M5b, but `release.yml` only ever published the binary + SBOM +
cosign-bundle + provenance as GitHub Release artifacts. Operators
following the chart's defaults could not `helm install`. RFC-0008 names
this path as the target operator-pull surface.

### Root cause

The chart's default `image.repository` and `release.yml`'s output set
drifted. The chart was deliberately specified against a future-state
image registry; the registry-publish job was tracked as a M3 follow-up
and not yet built. This PR closes the gap at the source by adding the
publish job, not by walking back the chart default.

### Architecture

- **Dockerfile** pins `gcr.io/distroless/static-debian12:nonroot` by
digest (`sha256:d093aa3e30...`). Non-root UID 65532 matches the chart's
`runAsUser`. CGO_ENABLED=0 makes `scratch` viable too, but distroless
gives a working CA bundle for the `otlphttp` exporter's HTTPS path and
tzdata for RFC3339 stamping with zero shell-attack surface. The
Dockerfile also declares `ARG SOURCE_DATE_EPOCH` so the determinism
contract is visible to a Dockerfile-only reader.
- The image consumes the **pre-built reproducible binary** from the
`build` job (`COPY release/$BINARY_BASENAME`), not a recompile. Image
reproducibility reduces to binary reproducibility (already gated) plus
the digest-pinned base layer plus `SOURCE_DATE_EPOCH` threaded through
buildkit's layer-rewrite (via the step `env:` block, not just
`--build-arg`).

### `release.yml` `image` job

- `needs: build` (downloads binary artifact, verifies SHA-256 matches
`build.outputs.digest` before push).
- `docker/build-push-action@v6.19.2` with `SOURCE_DATE_EPOCH` set via
both the step `env:` block AND `--build-arg` so buildkit's layer-rewrite
kicks in.
- Always tags `:TAG`. Floats `:latest` **only** on stable releases (no
`-` in the SemVer pre-release field), so a pre-release cannot silently
promote alpha bits to the chart's default-pull surface.
- `cosign sign --yes "$IMAGE_REPO@$DIGEST"`: signs by **digest**, not
tag. A registry rebuild of a floating tag would otherwise let an
attacker replace what `cosign verify` resolves.
- `cosign verify` smoke check pins the same identity binding the binary
blob already uses (`--certificate-github-workflow-ref refs/tags/$TAG`,
`--trigger push`).
- `attest-build-provenance` with `push-to-registry: true` attaches the
SLSA v1.0 provenance to the manifest in the registry, so a verifier
pulls everything from one place via `gh attestation verify oci://`.

Permissions: `id-token: write`, `attestations: write`, `packages:
write`. No long-lived registry credentials (GHCR auth uses the
workflow's `GITHUB_TOKEN`); no long-lived signing keys (cosign keyless
via OIDC).

### Docs

- `docs/reproducibility.md` grows two steps (8: resolve digest with
`crane digest`, then `cosign verify` by digest; 9: `gh attestation
verify oci://`) with the same identity-binding flags as the binary-side
steps. `crane` added to prerequisites.
- `install/kubernetes/tracecore/README.md` "Pre-release note" replaced
with the live-publish contract. Troubleshooting "ImagePullBackOff on
first install" entry updated with the Dockerfile-based local-build
workaround (was: "M3 release stream has not landed yet").
- `docs/followups/M3.md` "Container-image publish" item closed with the
HTML-comment + struck-italic convention used by the rows already closed
in that shard. New section "Items impossible to accomplish locally"
added for the three M21-trigger items (end-to-end push, oci://
attestation smoke, two-build image-digest equality) so a future
contributor does not file a "missing test" issue assuming the gap is
oversight.
- `CHANGELOG.md [Unreleased] ### Added` gains an M3 entry.

### Self-review fixes (commits 2-4)

Two rounds of self-review surfaced and closed:

**Round 1 (commit 2 — `7578feb`):**
- **F3:** `cosign triangulate --type digest` was the wrong tool. It
resolves the signature reference for a subject, not the subject's own
digest. Replaced with `crane digest` (canonical tag→digest resolver);
added `crane` to prerequisites.
- **F5:** `SOURCE_DATE_EPOCH` did not actually reach buildkit.
Build-args undeclared in the Dockerfile are silently ignored, so the
COPY layer's mtime was non-deterministic. Now threaded through both
`env:` block (buildkit layer-rewrite) and `ARG SOURCE_DATE_EPOCH`
(Dockerfile contract).
- **F1:** `release-doc-parity.sh` only covered the binary surface.
Extended with a parallel block for image-side `cosign verify`.
Mutation-verified.
- **F4:** Force-push comment overstated the SHA pin's guarantee.
Reworded to match the actual (binary-digest guard + tree-checkout)
closure.

**Round 2 (commit 3 — `7034e1a`, commit 4 — `459b686`):**
- **R1 (gh CLI semantic drift):** New
`scripts/gh-attestation-flag-lint.sh` parses `gh attestation verify
--help` and asserts every long flag used in `release.yml` +
`reproducibility.md` is still recognised by the installed CLI. Wired
into `make doc-check`. Mutation-verified (mutated `--help` output that
drops one flag → script exits 1 with fix hint).
- **R2 (distroless base digest rotation):** New
`scripts/base-digest-check.sh` compares the Dockerfile pin against
`crane digest gcr.io/distroless/static-debian12:nonroot`. Two modes:
`--warn` (default, exits 0) for periodic cadences and `--strict` (exits
non-zero) for M21 release-prep via `make base-digest-check`.
Deliberately NOT in `doc-check` (network + legitimate-lag).
Mutation-verified.
- **A++ #1 (gate-the-gate):**
`scripts/testdata/release-doc-parity/{intact,drift-binary,drift-image}/`
fixtures exercise both parity blocks;
`scripts/test-release-doc-parity.sh` drives them with WORKFLOW/DOC env
overrides and asserts expected exit codes. Mutation-verified: breaking
the image-side awk anchor in the gate makes the `intact` fixture fail.
- **R3 (`timeout-minutes`):** Out of scope (no per-job timeouts exist
anywhere in `release.yml` today). Documented as a M3 follow-up with
concrete per-job minute suggestions.
- Commit 4 fixes one em-dash the doc-check em-dash gate flagged in the
fixture README.

### Items impossible to accomplish locally (documented in
`docs/followups/M3.md`)

Three checks only become exercisable at M21 v0.1.0 (or any `vX.Y.Z` tag)
push time, because the `image` job is tag-triggered:

1. **End-to-end image push smoke against
`ghcr.io/tracecoreai/tracecore`.** Mitigations in place: `actionlint`,
`release-doc-parity.sh` image block, `gh-attestation-flag-lint.sh`,
binary-digest guard.
2. **`gh attestation verify "oci://$DIGEST"` against a real attestation
in the shape this pipeline emits.** No public OCI image carries a GitHub
Actions provenance attestation in matching shape, so the verifier
walkthrough cannot be smoke-tested end-to-end before M21.
`gh-attestation-flag-lint.sh` partially covers this by asserting
flag-name compatibility; semantic flag changes are the residual risk.
3. **Two-build digest equality for the image.** The `SOURCE_DATE_EPOCH`
plumbing claims image reproducibility, but the claim is only verifiable
by building twice at the same SHA and diff'ing the manifest digests. The
local dev environment currently lacks a working `docker buildx`; CI has
buildx but doubling the runner-time at every tag is a tradeoff worth
revisiting post-M21.

## Release notes

```release-notes
[FEATURE] Container images publish to ghcr.io/tracecoreai/tracecore:<TAG> on every release tag, signed and attested (cosign keyless + SLSA v1.0 provenance, both pushed to the registry). The Helm chart's default image.repository is now a live pull path. Verification walkthrough in docs/reproducibility.md steps 8-9.
```

## Test plan

- [x] `make ci` clean: golangci-lint, govulncheck, vet, mod-verify, RCE
gate, register-lint, actionlint, zizmor, all unit/race tests.
- [x] `make doc-check` clean (14 sub-gates including 3 new ones from
round 2: `test-release-doc-parity` (3 fixtures),
`gh-attestation-flag-lint` (6 flags), image-side `release-doc-parity`
block).
- [x] `actionlint` clean on `release.yml` after the `env:` block
addition.
- [x] `make base-digest-check` clean against live gcr.io (pinned digest
is current).
- [x] Mutation-verified: every new gate's failure mode (gh CLI flag
rename, Dockerfile digest forge, parity-script regex break).
- [x] Dockerfile validates by inspection: distroless base pinned by
digest; UID 65532 matches chart `runAsUser`; ENTRYPOINT/CMD shape allows
the chart's `args: [collect, --config=/etc/tracecore/config.yaml]` to
override cleanly; `ARG SOURCE_DATE_EPOCH` declared for
local-reproducibility.
- [ ] End-to-end image push exercise + `gh attestation verify oci://`
against a real attestation + two-build image-digest equality: impossible
locally; see "Items impossible to accomplish locally" above. First real
exercise will be M21 v0.1.0 (or any pre-release tag).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

### Update (commits 5-6, after main moved further)

While this PR was open, `main` advanced an additional 4 PRs (#143, #144,
#146, #147, #148, #149). Branch caught up via `git merge origin/main`
per the merge-not-rebase policy this PR also documents (see commit
`ddf86f7`).

- **Commit 5 — `ddf86f7`:** Adds the explicit branch-sync guidance to
`CONTRIBUTING.md`. Triggered by direct observation in this session that
the implicit "rebase to keep main linear" assumption was wrong
(`required_linear_history` on `main` only governs PR landing;
squash-merge collapses any feature-branch shape).
- **Commit 6 — `59e675c`:** Merge commit resolving conflicts in
`CHANGELOG.md` (additive) and `docs/followups/M3.md` (PR #143
partially-shipped row vs my closure HTML comment). `Makefile`
auto-merged cleanly with the new gate wires from commits 2-4 plus the
`validator-recipe` target from #144.

All 14 doc-check gates still green post-merge. The merge commit is
preserved on the branch (`git merge` with `--no-ff`); GitHub
squash-merge on the PR button will collapse it into the same
single-commit-on-main shape every other tracecore PR lands as.


### Update (commit 7 — `285640c`, A+ polish)

Self-review pass after the merge surfaced two cross-cutting hardening
items both worth one-line-per-job to land, and both gaps that would have
made the surface incomplete:

1. **`timeout-minutes` on every `release.yml` job** (build=20, sbom=15,
sign=10, provenance=10, image=20, release=10). GitHub's default ceiling
is 360m / 6h; a wedged push or hung Sigstore round-trip now fails fast
inside the per-job cap rather than burning a runner-hour. Caps chosen at
2-4x observed real wall-clock so transient ghcr/Sigstore weather doesn't
trip on healthy runs. Closes the M3.md row that previously held this out
as "opportunistic."
2. **`cosign verify-attestation --type slsaprovenance1` smoke check** in
the `image` job after `attest-build-provenance` pushes the SLSA v1
attestation to the registry. Uses the same identity binding
(refs/tags/$TAG + release.yml workflow path + push trigger) the
manifest-signature verify already enforces. Now every artifact this
pipeline publishes — binary blob, image manifest, image provenance — is
CI-verified inside the same run that produced it, against the same
identity claims a third-party verifier would reproduce offline.

`docs/followups/M3.md` also gains a new explicit "Out of scope for M3"
section rowing three items the self-review asked about: multi-arch image
build (`linux/arm64`), container vulnerability scan gate (trivy/grype),
and image SBOM sub-attestation (syft/cyclonedx with `--upload`). Each is
rowed with a trigger so a future audit can find them without commit
archaeology rather than ambiguously deferred.

`actionlint` clean on `release.yml`; `make doc-check` clean across all
gates including the new `release-doc-parity` image block,
`test-release-doc-parity` (3/3 fixtures), `gh-attestation-flag-lint` (6
flags), and `chart-appversion-check`.

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant