diff --git a/docs/2026-Q1-retrospective.md b/docs/2026-Q1-retrospective.md index 6e38850d..98a25a1a 100644 --- a/docs/2026-Q1-retrospective.md +++ b/docs/2026-Q1-retrospective.md @@ -1,6 +1,6 @@ # 2026-Q1 retrospective -**Scope.** Backfilled per [`v1-rc1-governance-gaps.md`](v1-rc1-governance-gaps.md) §3. +**Scope.** Backfilled per [`history/v1-rc1/governance-gaps.md`](history/v1-rc1/governance-gaps.md) §3. Covers the project's first epoch: repo scaffold (2026-05-07) through end of the v0.2.0 / v0.3.x pivot wave (2026-05-31). The strict calendar quarter (Jan–Mar 2026) is pre-repo; this retro adopts the project-quarter @@ -62,7 +62,7 @@ targets, then this retro for the snapshot. ≥80% by M12; current state is governance-doc-only coverage with zero rules touching `components/`, `internal/`, `module/`, `install/`, `scripts/`, `tools/`, `python/`, `docs/`, or `bench/`. - Audited in [`v1-rc1-governance-gaps.md`](v1-rc1-governance-gaps.md) §1. + Audited in [`history/v1-rc1/governance-gaps.md`](history/v1-rc1/governance-gaps.md) §1. Remediation lands inside the rc1 cap as an in-repo issue. 2. **Lint-enforced principles at 4 of 16, not the ≥6 target.** Strict @@ -70,7 +70,7 @@ targets, then this retro for the snapshot. (errcheck/errorlint), §13 (loggercheck/contextcheck). PRINCIPLES.md grew §16 ("Adopt > build") this quarter without the KPI denominator being updated from the NORTHSTARS-original "15" → "16" — drift - recorded in [`v1-rc1-governance-gaps.md`](v1-rc1-governance-gaps.md) §2. + recorded in [`history/v1-rc1/governance-gaps.md`](history/v1-rc1/governance-gaps.md) §2. 3. **Quarterly retro discipline missed (this doc is the backfill).** NORTHSTARS O6 lists the retro as a supporting KPI; absence is a @@ -156,17 +156,17 @@ memory). Source for each row in parentheses. | Quarterly ship-commitment hit-rate (O6 hero) | ≥80% | v0.1.0-m1, v0.2.0, module/v0.3.0 cut; pattern library exceeded 6-pattern goal (12 docs shipped) | `git tag --sort=creatordate`; [`docs/patterns/`](patterns/) | | Quarterly retro published | published | this doc (backfilled) | [`docs/2026-Q1-retrospective.md`](2026-Q1-retrospective.md) | | RFC accept/reject/supersede log | logged per quarter in retro | RFC section above | RFC `Status:` headers | -| `make ci` runtime | <60s on dev laptop | 148s measured (split mitigation in #362) | [`v1-rc1-governance-gaps.md`](v1-rc1-governance-gaps.md) §5 | -| CODEOWNERS coverage of code paths (O7) | ≥80% by M12 | 0% (governance docs only) | [`v1-rc1-governance-gaps.md`](v1-rc1-governance-gaps.md) §1 | -| Lint-enforced principles (O7) | ≥6 of 16 | 4 of 16 strict; 9 of 16 if counting scripted gates | [`v1-rc1-governance-gaps.md`](v1-rc1-governance-gaps.md) §2 | -| Maintainer count with merge authority (O7) | ≥3 by M9 | 1 | [`docs/maintainership.md`](maintainership.md); [`v1-rc1-governance-gaps.md`](v1-rc1-governance-gaps.md) §6 | +| `make ci` runtime | <60s on dev laptop | 148s measured (split mitigation in #362) | [`history/v1-rc1/governance-gaps.md`](history/v1-rc1/governance-gaps.md) §5 | +| CODEOWNERS coverage of code paths (O7) | ≥80% by M12 | 0% (governance docs only) | [`history/v1-rc1/governance-gaps.md`](history/v1-rc1/governance-gaps.md) §1 | +| Lint-enforced principles (O7) | ≥6 of 16 | 4 of 16 strict; 9 of 16 if counting scripted gates | [`history/v1-rc1/governance-gaps.md`](history/v1-rc1/governance-gaps.md) §2 | +| Maintainer count with merge authority (O7) | ≥3 by M9 | 1 | [`docs/maintainership.md`](maintainership.md); [`history/v1-rc1/governance-gaps.md`](history/v1-rc1/governance-gaps.md) §6 | | Release cadence (O6) | ≥1 minor / quarter | 1 minor (v0.2.0) + 1 module-tag (module/v0.3.0) | `git tag` | | Time-to-merge p50 / p90 (O6) | <7d / <14d | not measured this quarter; instrument in 2026-Q2 retro | gh PR API (deferred) | **Carry-forward into 2026-Q2 retro:** - Close CODEOWNERS coverage gap (in-repo, follow-up issue from - `v1-rc1-governance-gaps.md` §1). + `history/v1-rc1/governance-gaps.md` §1). - Reconcile PRINCIPLES.md §10 line with current `make ci` budget. - Measure PR time-to-merge p50/p90 across the v0.3.x wave. - Track RFC-0015+ decisions (if any) and re-snapshot the RFC table. diff --git a/docs/MILESTONES.md b/docs/MILESTONES.md index 680356f7..09f32c2b 100644 --- a/docs/MILESTONES.md +++ b/docs/MILESTONES.md @@ -149,7 +149,7 @@ M20a/b/c are gates against the same artifact (`bench/install/run.sh`) at progres - **Depends on:** M3, M5b, M6, ≥3 receivers at alpha (M8 partial; M10/M13/M15/M16 from Lanes 4-5; M11/M12 from Lane 6 if flood gate open) - **NORTHSTARS coupling:** NORTHSTARS.md O1 targets 3 patterns covered at M6/v0. If the flood gate has not opened by M21, only M19 (pattern #14, GPU-independent) is guaranteed; M17 (pattern #1) and M18 (pattern #6, build-time coupled to M17's `cross_rank.go`) are at risk. Either the flood gate opens before M21 or NORTHSTARS O1 is explicitly relaxed in M21's release notes with written reason - silent divergence is a blocker per PRINCIPLES §15. - **Carry-forward from M3:** asset-shape reconciliation owed at the v0.1.0 cut. M3's `release.yml` publishes raw `tracecore__linux_amd64` (not the `.tar.gz` line 156 names), `*.cosign.bundle` (not the detached `*.sig` line 156 names), and `*.intoto.jsonl` (file is a Sigstore bundle JSON, not in-toto JSONL; extension is the de-facto convention but misleads sniff-by-extension tooling). M21 decides: keep raw-binary + bundle, switch to tar.gz + detached `.sig`, and pick a stable name for the Sigstore-bundle artifact. The hardening backlog (SLSA L3, build-env sanitization, CycloneDX `mod`→`app`, cosign / `gh attestation` flag tightening, nightly drift cron, repo tag-protection on `v*`, CI Actions linter, github-actions Dependabot, Rekor log-index in release notes) lives in [`docs/followups/M3.md`](followups/M3.md) "M3 release-pipeline hardening (post-PR #28)". -- **v1.0-rc1 operational-gap dependency:** the three gates between the current pipeline and a `v1.0-rc1` cut (SLSA L3 prerequisites, air-gapped install path, DaemonSet upgrade-rollback) are audited in [`docs/v1-rc1-operational-gaps.md`](v1-rc1-operational-gaps.md) with per-section remediation steps, effort estimates, and blockers. The doc's "Cross-cut" minimum bar (air-gap docs + M20 row; upgrade-rollback doc-reconcile + `minReadySeconds`) gates rc1; SLSA L3 stays an O3 stretch goal. +- **v1.0-rc1 operational-gap dependency:** the three gates between the current pipeline and a `v1.0-rc1` cut (SLSA L3 prerequisites, air-gapped install path, DaemonSet upgrade-rollback) are audited in [`docs/history/v1-rc1/operational-gaps.md`](history/v1-rc1/operational-gaps.md) with per-section remediation steps, effort estimates, and blockers. The doc's "Cross-cut" minimum bar (air-gap docs + M20 row; upgrade-rollback doc-reconcile + `minReadySeconds`) gates rc1; SLSA L3 stays an O3 stretch goal. **Rubric summary:** Signed annotated tag `v0.1.0` (`git tag -v` passes). GitHub release ships `tracecore_v0.1.0_linux_amd64.tar.gz` + CycloneDX SBOM + SLSA `*.intoto.jsonl` provenance + cosign `*.sig`; post-release CI asserts presence + `cosign verify-blob` + `slsa-verifier verify-artifact` succeed on fresh checkout. `CHANGELOG.md` `## [0.1.0]` with `### Added`; release notes link `getting-started.md` + ≥1 `integrations/*`; ISSUE_TEMPLATE YAML lints clean; synthesis gate enumerates contributing milestones. NFR: `make release` on `v0.1.0` byte-identical via `diffoscope` (P0); pinned action SHAs + `id-token:write`/`contents:write` only + zero `zizmor`/`actionlint` findings; SBOM `len(components) >= direct_dep_count`; `SECURITY.md` referenced + private-vuln reporting verified enabled at release time. diff --git a/docs/NORTHSTARS.md b/docs/NORTHSTARS.md index 679706df..c359153d 100644 --- a/docs/NORTHSTARS.md +++ b/docs/NORTHSTARS.md @@ -197,7 +197,7 @@ Seven lines of work. Each has one accountable owner role, one hero KPI, supporti **Operating rule:** *Trust under load is the product* ([`PRINCIPLES.md`](../PRINCIPLES.md) §1). Any supply-chain regression - broken reproducibility, missing signatures, lapsed SBOM, missed disclosure SLA - is P0 and blocks the next release. **Caveats:** -- SLSA L3 requires hermetic, parameterless builds signed by trusted infrastructure. The build system needs the work - not just the policy. **rc1 posture (2026-06):** L2 build platform + L3-grade provenance attestation infrastructure (Sigstore-Fulcio keyless under GHA OIDC, recorded in Rekor) — see [`v1-rc1-operational-gaps.md`](v1-rc1-operational-gaps.md#1-slsa-l3-prerequisites) §1 for the full posture and the upstream blocker (OCB-style generated entrypoints + missing pre-build hook in the trusted reusable workflow, tracked at [slsa-framework/slsa-github-generator#2483](https://github.com/slsa-framework/slsa-github-generator/issues/2483)). The architectural gap applies to both the build job and the user-defined sign job (`sign` consumes `package`'s artifact across a job boundary — the same "build influences signing" pattern L3 forbids), so both inherit the same deferral against M12. +- SLSA L3 requires hermetic, parameterless builds signed by trusted infrastructure. The build system needs the work - not just the policy. **rc1 posture (2026-06):** L2 build platform + L3-grade provenance attestation infrastructure (Sigstore-Fulcio keyless under GHA OIDC, recorded in Rekor) — see [`history/v1-rc1/operational-gaps.md`](history/v1-rc1/operational-gaps.md#1-slsa-l3-prerequisites) §1 for the full posture and the upstream blocker (OCB-style generated entrypoints + missing pre-build hook in the trusted reusable workflow, tracked at [slsa-framework/slsa-github-generator#2483](https://github.com/slsa-framework/slsa-github-generator/issues/2483)). The architectural gap applies to both the build job and the user-defined sign job (`sign` consumes `package`'s artifact across a job boundary — the same "build influences signing" pattern L3 forbids), so both inherit the same deferral against M12. - Reproducibility policy lives here (P0 designation, response); the CI gate that *enforces* it lives in O2. Same artifact, two homes - by design. - Disclosure SLAs are aspirational until tracecore has a security inbox in place; [`SECURITY.md`](../SECURITY.md) must name the contact before the SLA clock starts publicly. diff --git a/docs/README.md b/docs/README.md index 97e5e7bf..26a0650f 100644 --- a/docs/README.md +++ b/docs/README.md @@ -28,7 +28,7 @@ Legend: 👤 operator · 🛠️ contributor · 🏛️ maintainer · 🌐 exter | [maintainership.md](maintainership.md) | 🏛️ 🛠️ | Governance: who has commit access, how RFCs are sponsored, how security issues are handled. | | [ATTRIBUTES.md](ATTRIBUTES.md) | 👤 🛠️ | Customer-stable attribute namespace inventory + soft-lock policy. Every `pattern.*` / `tracecore.*` / `hw.gpu.*` / `k8s.*` / `nccl.fr.*` / `kernelevents.*` / `gen_ai.training.*` key the collector emits or consumes, with stability tags and the v0.4-advisory → v1.0-enforced rename policy. | | [v1-rc1-cut-criteria.md](v1-rc1-cut-criteria.md) | 🏛️ | Twelve falsifiable rubrics for the `v1.0.0-rc1` cut (deriving from NORTHSTARS O1-O7) + Tier-2 GA path-clearing items + out-of-scope deferrals. Authoritative rubric source for `MILESTONES.md` M22. | -| [v1-rc1-governance-gaps.md](v1-rc1-governance-gaps.md) | 🏛️ | Audit of O6 velocity + O7 governance supporting KPIs (CODEOWNERS coverage, lint-enforced principles, quarterly retros, RFC log, `make ci` budget, maintainer count). One section per gap; closing action list. | +| [history/v1-rc1/](history/v1-rc1/) | 🏛️ | Archived v1.0-rc1 audit snapshots — governance-gaps + operational-gaps. See [history/v1-rc1/README.md](history/v1-rc1/README.md). | | [standards-roadmap.md](standards-roadmap.md) | 🏛️ | NORTHSTARS O4 tracking artifact for the `gen_ai.training.*` semconv upstream motion. Inventory of upstream + tracecore-emitted training keys, proposal set for PR-1/PR-2, SIG cadence (Tuesdays 09:00 PT), competing-proposal risk (`rl.*` Issue #88), and cross-ref to in-repo work that depends on each PR landing. | ## Subdirectories diff --git a/docs/audits/wave-2026-06-01.md b/docs/audits/wave-2026-06-01.md index 955d3c35..0c76a223 100644 --- a/docs/audits/wave-2026-06-01.md +++ b/docs/audits/wave-2026-06-01.md @@ -14,7 +14,7 @@ Audit of 27 PRs merged this session (#339–#374) per `feedback_review_disciplin | 6 | **`TrainingStepStallRecord` is the only intentionally shared `Record` type.** No other accidental sharing detected. Other shared types in `model.go` are intentional core abstractions. **Clean.** | none | `module/pkg/patterns/checkpointer_hang.go:105` (canonical) | **No action.** | | 7 | **12 stale "future PR-B" comments** across `module/` + `docs/` reference RFC-0014 `WithMetrics` bridge. Resolved: ADR-0001 PR-B shipped for cuda_oom (#10) via [#437](https://github.com/TraceCoreAI/tracecore/issues/437) / PR #461; comment sweep landed via [#380](https://github.com/TraceCoreAI/tracecore/issues/380) — surviving references now point at the cuda_oom precedent and name patterns #1 / #2 / #3 / #4 / #5 as pending sibling-consumer follow-ups under #260. | low | 10 hits in `module/`, 2 in `docs/` | **Closed.** | | 8 | **`module/doc.go` references PR-I.1a / PR-I.1b / PR-I.2 as "Contents land in…"** — these PRs all landed long ago. Comment is a historical artifact. | low | `module/doc.go:5-13` | **Trim** — 8-line deletion. **Issue [#381](https://github.com/TraceCoreAI/tracecore/issues/381).** | -| 9 | **`docs/v1-rc1-simplification-audit.md` is now historical** (references #333/#334 which closed). Compare to `v1-rc1-cut-criteria.md` (current source of truth) + `v1-rc1-test-audit.md` + `v1-rc1-governance-gaps.md` + `v1-rc1-operational-gaps.md`. | low | `docs/v1-rc1-simplification-audit.md` | **No action pre-RC1** — flip to "Status: ☑ shipped (historical)" at RC1 tag time. Bundled into #383. | +| 9 | **`docs/v1-rc1-simplification-audit.md` is now historical** (references #333/#334 which closed). Compare to `v1-rc1-cut-criteria.md` (current source of truth) + `v1-rc1-test-audit.md` + `history/v1-rc1/governance-gaps.md` + `history/v1-rc1/operational-gaps.md`. | low | `docs/v1-rc1-simplification-audit.md` | **No action pre-RC1** — flip to "Status: ☑ shipped (historical)" at RC1 tag time. Bundled into #383. | | 10 | **`chart.yml` workflow is 496 lines with 3 jobs that each re-setup helm + kind + image-build.** Triple `Install helm`, double `Build tracecore image`, double `Create kind cluster`, double `Load image into kind`. | medium | `.github/workflows/chart.yml:292,378` | **Refactor** — extract `.github/actions/kind-tracecore-up/`. ~60-line reduction. **Issue [#382](https://github.com/TraceCoreAI/tracecore/issues/382).** Not blocking RC1. | | 11 | **`docs/rfcs/archived/0004-clockreceiver-stdoutexporter.md`** — only archived RFC. Convention OK. | none | — | **No action.** | | 12 | **No `.bak` / `_unused` / `.orig` files** outside `archived/`. | none | — | **No action.** | diff --git a/docs/followups/M3.md b/docs/followups/M3.md index cd7549eb..e04666bf 100644 --- a/docs/followups/M3.md +++ b/docs/followups/M3.md @@ -53,16 +53,16 @@ tracked at slsa-framework/slsa-github-generator#2483 ("Pre and Post Build Actions for BYOB" — `slsa-prebuild-action-path` proposal; open, type:feature, area:BYOB). (Corrects an earlier cite of #3033, which is actually a Maven E2E test issue, not the pre-build hook -proposal — docs/v1-rc1-operational-gaps.md §1 corrected in the same +proposal — docs/history/v1-rc1/operational-gaps.md §1 corrected in the same PR.) The rc1 posture is documented in -docs/v1-rc1-operational-gaps.md §1 (L2 build platform + L3-grade +docs/history/v1-rc1/operational-gaps.md §1 (L2 build platform + L3-grade provenance attestation infrastructure), and docs/reproducibility.md steps 6 + 9 now cite Build L2 rather than Build L1. Re-open if upstream #2483 ships, or if M12's L3 binding becomes load-bearing before then. --> - *Closed (see comment above): L3 re-evaluated at rc1 cut and deferred upstream-blocked (slsa-framework/slsa-github-generator#2483). - rc1 posture documented in docs/v1-rc1-operational-gaps.md §1; the + rc1 posture documented in docs/history/v1-rc1/operational-gaps.md §1; the same architectural gap applies to the `sign` job migration, so both inherit the same deferral. docs/reproducibility.md steps 6 + 9 now cite SLSA Build L2 (build platform) + L3-grade provenance diff --git a/docs/history/v1-rc1/README.md b/docs/history/v1-rc1/README.md new file mode 100644 index 00000000..6dee7b1c --- /dev/null +++ b/docs/history/v1-rc1/README.md @@ -0,0 +1,13 @@ +# v1.0-rc1 Audit History + +Archived audit snapshots from the `v1.0.0-rc1` cut window. Preserved +for traceability; do not edit historical claims. + +| File | Scope | +|---|---| +| [governance-gaps.md](governance-gaps.md) | O6 velocity + O7 governance KPI audit (CODEOWNERS, lint-enforced principles, retros, RFC log, `make ci` budget, maintainer count). | +| [operational-gaps.md](operational-gaps.md) | SLSA L3 prerequisites, air-gapped install path, DaemonSet upgrade-rollback. | + +The live rubric `docs/v1-rc1-cut-criteria.md` is generated by +`make cut-criteria-render` (path pinned in `scripts/cut_criteria.py`) +and stays in `docs/` for the render gate. diff --git a/docs/v1-rc1-governance-gaps.md b/docs/history/v1-rc1/governance-gaps.md similarity index 96% rename from docs/v1-rc1-governance-gaps.md rename to docs/history/v1-rc1/governance-gaps.md index 942857de..2f44cb79 100644 --- a/docs/v1-rc1-governance-gaps.md +++ b/docs/history/v1-rc1/governance-gaps.md @@ -11,8 +11,8 @@ five in-repo issues filed per the audit charter — items beyond the cap are listed under "Deferred to follow-up issues" so future audits do not rediscover them cold. -This doc is a sibling to [`v1-rc1-cut-criteria.md`](v1-rc1-cut-criteria.md) -(the rubric source) and [`v1-rc1-operational-gaps.md`](v1-rc1-operational-gaps.md) +This doc is a sibling to [`v1-rc1-cut-criteria.md`](../../v1-rc1-cut-criteria.md) +(the rubric source) and [`operational-gaps.md`](operational-gaps.md) (SLO + runbook drift). When a row here is satisfied, mark it `[x]` in place with the closing PR / issue reference; do not delete rows. @@ -22,7 +22,7 @@ place with the closing PR / issue reference; do not delete rows. **Target** (NORTHSTARS O7): `CODEOWNERS covers ≥80% of code paths` by M12. -**Current state.** [`CODEOWNERS`](../CODEOWNERS) covers nine governance +**Current state.** [`CODEOWNERS`](../../../CODEOWNERS) covers nine governance files (LICENSE, SECURITY.md, CODE_OF_CONDUCT.md, PRINCIPLES.md, STYLE.md, NORTHSTARS.md, MILESTONES.md, CONTRIBUTING.md, CODEOWNERS) plus four CI / supply-chain anchors (`/.github/`, `/Makefile`, @@ -77,7 +77,7 @@ KPIs. golangci-lint; new rules added when a principle is violated once in code`. -**Current state.** [`PRINCIPLES.md`](../PRINCIPLES.md) contains 16 +**Current state.** [`PRINCIPLES.md`](../../../PRINCIPLES.md) contains 16 numbered principles (the doc lists §1–§16; the original NORTHSTARS phrasing said "15", written before §16 "Adopt > build" was added — treating the KPI denominator as "all numbered principles" the count is @@ -99,7 +99,7 @@ treating the KPI denominator as "all numbered principles" the count is | 12 | Reproducibility is a feature | **scripted gate** — `-trimpath`, `SOURCE_DATE_EPOCH`, `tidy-check`, `mod-verify`, `generate-fixtures-check`, `base-digest-check` | | 13 | Operability is owed to the operator | **partial lint** — `loggercheck` (slog), `contextcheck` | | 14 | Honest commits, honest history | **scripted gate** — DCO sign-off check (CI), `.github/workflows/pr-lint.yml` (release-notes block + subject style); not a `golangci-lint` rule | -| 15 | Decide late, write it down | scripted via [`docs/rfcs/README.md`](rfcs/README.md) status index — no `golangci-lint` rule | +| 15 | Decide late, write it down | scripted via [`docs/rfcs/README.md`](../../rfcs/README.md) status index — no `golangci-lint` rule | | 16 | Adopt > build | **scripted gate** — `register-lint.sh` (factory-location), `no-autoupdate-check.sh` (RFC-0008), `depguard` `inconshreveable/go-update` family bans | **Count: 4 principles enforced strictly by `golangci-lint`** (§3, §8, @@ -160,7 +160,7 @@ unwritten). 2. Schedule `docs/2026-Q2-retrospective.md` for end-of-Q2 cut (target: 2026-06-30); write it concurrently with the rc1 cut so the rc1 audit is in scope. -3. Add a one-line entry to [`MILESTONES.md`](MILESTONES.md)'s +3. Add a one-line entry to [`MILESTONES.md`](../../MILESTONES.md)'s "Keeping this document current" pointer noting the quarterly retro cadence. @@ -175,7 +175,7 @@ by a calendar entry, not code). **Target** (NORTHSTARS O7): `RFC accept/reject/supersede status logged in each O6 quarterly retro` — Quarterly. -**Current state.** [`docs/rfcs/README.md`](rfcs/README.md) carries a +**Current state.** [`docs/rfcs/README.md`](../../rfcs/README.md) carries a status index that is the de-facto log: every RFC (`0001`–`0014` plus `0000-template.md`) has a `Status:` row and a `Last updated` date. Audit of individual RFC bodies confirms every file has a `Status:` @@ -309,7 +309,7 @@ failure without an explicit operator advisory. **Remediation steps** (none are code edits in this audit): 1. Continue partner outreach per - [`docs/adoption-pipeline.md`](adoption-pipeline.md). The maintainer + [`docs/adoption-pipeline.md`](../../adoption-pipeline.md). The maintainer pipeline runs through landed PRs from non-author contributors — at least three landed PRs (per `docs/maintainership.md` § "Proposed: bar to join") qualifies a candidate. diff --git a/docs/v1-rc1-operational-gaps.md b/docs/history/v1-rc1/operational-gaps.md similarity index 85% rename from docs/v1-rc1-operational-gaps.md rename to docs/history/v1-rc1/operational-gaps.md index 639a1c26..37a40436 100644 --- a/docs/v1-rc1-operational-gaps.md +++ b/docs/history/v1-rc1/operational-gaps.md @@ -1,14 +1,14 @@ # v1.0-rc1 operational gaps Three gates between `main` and a `v1.0-rc1` cut, audited against the -binding targets in [`NORTHSTARS.md`](NORTHSTARS.md) O2 (operator +binding targets in [`NORTHSTARS.md`](../../NORTHSTARS.md) O2 (operator experience) and O3 (supply chain). Each section captures the **current state** from concrete file evidence, the **gap to target** in falsifiable terms, **remediation steps** tagged by work type, an **effort estimate**, and **blockers**. This doc is the source-of-truth for the rc1 cut-criteria checklist; -[`MILESTONES.md`](MILESTONES.md) M21 references it for the +[`MILESTONES.md`](../../MILESTONES.md) M21 references it for the release-prep gate. --- @@ -23,35 +23,35 @@ infrastructure** (per [slsa.dev/spec/v1.0/requirements#build-l3](https://slsa.de - **Provenance generation.** Two parallel trails: `slsa-framework/slsa-github-generator@v2.1.0` (generic-generator - reusable workflow, [`.github/workflows/release.yml:350`](../.github/workflows/release.yml#L350)) + reusable workflow, [`.github/workflows/release.yml:350`](../../../.github/workflows/release.yml#L350)) and `actions/attest-build-provenance@v4.1.0` (GitHub-native, - [`release.yml:377,570`](../.github/workflows/release.yml#L377)). + [`release.yml:377,570`](../../../.github/workflows/release.yml#L377)). - **Signer identity.** Sigstore Fulcio keyless via GHA OIDC token (`sigstore/cosign-installer@v4.1.2`, - [`release.yml:280,442`](../.github/workflows/release.yml#L280)). + [`release.yml:280,442`](../../../.github/workflows/release.yml#L280)). No long-lived signing keys held by the project. - **Action SHA pinning.** Every third-party action pinned by commit SHA (verified: `actions/checkout`, `actions/setup-go`, `ko-build/setup-ko`, `sigstore/cosign-installer`, `actions/attest-build-provenance`, `anchore/sbom-action`, `helm/kind-action`, `azure/setup-helm`, `github/codeql-action` — - see [`.github/zizmor.yml`](../.github/zizmor.yml) for the enforcement + see [`.github/zizmor.yml`](../../../.github/zizmor.yml) for the enforcement gate). One documented exception: `slsa-framework/slsa-github-generator` is pinned by tag - ([`release.yml:350`](../.github/workflows/release.yml#L350)), + ([`release.yml:350`](../../../.github/workflows/release.yml#L350)), required by SLSA's verifier-identity model. - **Build environment.** `ubuntu-latest` GitHub-hosted runner; no custom image, no self-hosted runner, no nix/bazel hermetic sandbox. OCB regenerates `./_build/` per architecture - ([`release.yml:128`](../.github/workflows/release.yml#L128)); ko + ([`release.yml:128`](../../../.github/workflows/release.yml#L128)); ko image build runs from the same submodule - ([`release.yml:518`](../.github/workflows/release.yml#L518)). + ([`release.yml:518`](../../../.github/workflows/release.yml#L518)). - **Determinism.** `SOURCE_DATE_EPOCH` exported from tag commit timestamp, `-trimpath`, `CGO_ENABLED=0`, `tar --sort=name --numeric-owner --mtime=@${SOURCE_DATE_EPOCH} | gzip -n` - ([`release.yml:115,142`](../.github/workflows/release.yml#L115)). + ([`release.yml:115,142`](../../../.github/workflows/release.yml#L115)). Base image pinned by sha256 digest - ([`.ko.yaml:47`](../.ko.yaml#L47)). + ([`.ko.yaml:47`](../../../.ko.yaml#L47)). ### Gap to target @@ -63,7 +63,7 @@ satisfied**: | Provenance non-falsifiable, signed by trusted infra | met | `attest-build-provenance` uses GitHub-OIDC + Sigstore; signer identity is the workflow ref bound to the tag | | Build run as a discrete, hosted service (not on a self-controlled machine) | met | `ubuntu-latest`; no self-hosted runners; cf. `gh api /repos/tracecoreai/tracecore/actions/runners` | | **Hermetic build** (network egress disallowed except to declared inputs; build steps cannot reach the public internet) | **not met** | `ko build`, `go build`, `actions/setup-go`, `anchore/sbom-action`, ko's base-image fetch, the SLSA generator itself all fetch from the public internet at build time; no isolated network policy on the runner | -| **Parameterless build** (no operator-supplied build parameters that change the produced artifact) | **partial** | `TRACECORE_VERSION` is injected from the tag, which is itself the build subject; `KO_DOCKER_REPO` is operator-controlled at env-line scope ([`release.yml:423`](../.github/workflows/release.yml#L423)); `TAGS` partly derived from `TAG` | +| **Parameterless build** (no operator-supplied build parameters that change the produced artifact) | **partial** | `TRACECORE_VERSION` is injected from the tag, which is itself the build subject; `KO_DOCKER_REPO` is operator-controlled at env-line scope ([`release.yml:423`](../../../.github/workflows/release.yml#L423)); `TAGS` partly derived from `TAG` | Falsifiable cut criterion: `gh attestation verify --predicate-type https://slsa.dev/provenance/v1` returns a predicate whose @@ -124,7 +124,7 @@ rc1; the M3 follow-up has been closed as upstream-blocked. spec-strict terms) and the redundant GitHub-native trail via `actions/attest-build-provenance`. **Re-evaluated 2026-06 at rc1 cut: upstream gap unchanged; M3 follow-up - ([`docs/followups/M3.md`](followups/M3.md)) closed as + ([`docs/followups/M3.md`](../../followups/M3.md)) closed as upstream-blocked. NORTHSTARS §O3 binds L3 to M12, not rc1, so L2-with-L3-attestation is the documented rc1 posture.** 2. **[ops]** ✅ **Done (#315 — `step-security/harden-runner@v2.19.4` @@ -138,7 +138,7 @@ rc1; the M3 follow-up has been closed as upstream-blocked. 3. **[code]** ✅ **Done (#316).** `KO_DOCKER_REPO` lifted out of the job-level `env:` block and hardcoded at the step level so the build is parameterless w.r.t. operator-controllable inputs. -4. **[doc]** ~~Update [`docs/reproducibility.md`](reproducibility.md) +4. **[doc]** ~~Update [`docs/reproducibility.md`](../../reproducibility.md) "What this verifies" table to mark steps 6 + 9 with the generic-generator's actual posture (provenance attestation on L3 infra / build platform L2) rather than waiting for the @@ -184,7 +184,7 @@ single pre-staged bundle. - **Binary tarball.** Per-arch `tracecore__linux_{amd64,arm64}.tar.gz` published on every release tag - ([`release.yml:137`](../.github/workflows/release.yml#L137)). + ([`release.yml:137`](../../../.github/workflows/release.yml#L137)). Archive contains `tracecore` binary + `LICENSE` + `README.md`. Cosign-bundle (`.cosign.bundle`) and per-archive CycloneDX SBOM (`.sbom.cdx.json`) ship alongside, plus @@ -192,20 +192,20 @@ single pre-staged bundle. - **Verification offline-capable.** `cosign verify-blob --bundle` works air-gapped (bundle carries Rekor inclusion proof); `gh attestation verify --bundle ` works air-gapped per - [`docs/reproducibility.md:36`](reproducibility.md#L36). + [`docs/reproducibility.md:36`](../../reproducibility.md#L36). - **Container image.** Multi-arch image at `ghcr.io/tracecoreai/tracecore:` (ko-built; - [`release.yml:412`](../.github/workflows/release.yml#L412)). + [`release.yml:412`](../../../.github/workflows/release.yml#L412)). Air-gapped operators must mirror to an internal registry - ([`install/kubernetes/tracecore/README.md:250`](../install/kubernetes/tracecore/README.md#L250)). + ([`install/kubernetes/tracecore/README.md:250`](../../../install/kubernetes/tracecore/README.md#L250)). - **Helm chart.** In-repo at - [`install/kubernetes/tracecore/`](../install/kubernetes/tracecore/); + [`install/kubernetes/tracecore/`](../../../install/kubernetes/tracecore/); `image.repository` is operator-overridable - ([`values.yaml:14`](../install/kubernetes/tracecore/values.yaml#L14)); + ([`values.yaml:14`](../../../install/kubernetes/tracecore/values.yaml#L14)); `imagePullSecrets` exposed - ([`values.yaml:20`](../install/kubernetes/tracecore/values.yaml#L20)). + ([`values.yaml:20`](../../../install/kubernetes/tracecore/values.yaml#L20)). No OCI-published chart yet (per - [`docs/getting-started.md:96`](getting-started.md#L96), the OCI + [`docs/getting-started.md:96`](../../getting-started.md#L96), the OCI chart `oci://ghcr.io/tracecoreai/charts/tracecore-recipes` is a post-PR-L target). @@ -223,13 +223,13 @@ single pre-staged bundle. Falsifiable cut criterion: a tester on a network with `iptables -A OUTPUT -j DROP` (except to a private registry and a local file mirror) can complete every step from -[`docs/getting-started.md`](getting-started.md) §"Install via Helm". +[`docs/getting-started.md`](../../getting-started.md) §"Install via Helm". ### Remediation steps 1. **[ops]** Publish the chart as an OCI artifact to `oci://ghcr.io/tracecoreai/charts/tracecore` on every release; - add a `chart-publish` job to [`release.yml`](../.github/workflows/release.yml) + add a `chart-publish` job to [`release.yml`](../../../.github/workflows/release.yml) that runs `helm package` + `helm push`. Sign the chart with cosign keyless using the same identity binding as the binary + image. (Tracked separately as RFC-0013 PR-L "Recipe chart OCI @@ -239,15 +239,15 @@ mirror) can complete every step from binary tarball, image OCI-layout tarball (`crane export -`), chart tarball, all signatures, and a `verify.sh` script that runs the full - [`docs/reproducibility.md`](reproducibility.md) verification + [`docs/reproducibility.md`](../../reproducibility.md) verification chain offline. 3. **[doc]** New section in - [`docs/getting-started.md`](getting-started.md) — "Air-gapped + [`docs/getting-started.md`](../../getting-started.md) — "Air-gapped install" — covering: download bundle on a connected host, copy to air-gapped host, verify, `docker load` / `crane push` to private registry, `helm install --set image.repository=...`. 4. **[doc]** Add an air-gap test row to the M20 install benchmark - harness in [`MILESTONES.md:226`](MILESTONES.md#L226) so a + harness in [`MILESTONES.md:226`](../../MILESTONES.md#L226) so a regression in offline installability is CI-visible. ### Estimated effort @@ -276,34 +276,34 @@ documentation + tests that demonstrate skew tolerance. ### Current state - **updateStrategy.** `RollingUpdate` with `maxUnavailable: 1` - ([`values.yaml:158`](../install/kubernetes/tracecore/values.yaml#L158)). + ([`values.yaml:158`](../../../install/kubernetes/tracecore/values.yaml#L158)). Operator-overridable - ([`daemonset.yaml:10`](../install/kubernetes/tracecore/templates/daemonset.yaml#L10)). + ([`daemonset.yaml:10`](../../../install/kubernetes/tracecore/templates/daemonset.yaml#L10)). No `maxSurge` (DaemonSet API only supports `maxSurge` as alpha in k8s ≥1.22 via `DaemonSetUpdateStrategy.RollingUpdate.MaxSurge`; chart does not expose). - **Readiness probe.** `readinessProbe.httpGet.path: {{ .Values.telemetry.healthPath }}` against the `health` port - ([`daemonset.yaml:80-86`](../install/kubernetes/tracecore/templates/daemonset.yaml#L80)). + ([`daemonset.yaml:80-86`](../../../install/kubernetes/tracecore/templates/daemonset.yaml#L80)). Default `healthPath: /` - ([`values.yaml:87`](../install/kubernetes/tracecore/values.yaml#L87)). + ([`values.yaml:87`](../../../install/kubernetes/tracecore/values.yaml#L87)). Probe timing: `initialDelaySeconds: 5`, `periodSeconds: 10`, `failureThreshold: 4` → ~45s grace window - ([`values.yaml:182-185`](../install/kubernetes/tracecore/values.yaml#L182)). + ([`values.yaml:182-185`](../../../install/kubernetes/tracecore/values.yaml#L182)). - **`/readyz` wiring.** **Broken in two places.** (a) The chart's values.yaml comment at - [`values.yaml:74`](../install/kubernetes/tracecore/values.yaml#L74) + [`values.yaml:74`](../../../install/kubernetes/tracecore/values.yaml#L74) states *"The legacy single-listener `telemetry:` block (one port serving /metrics + /healthz + /readyz) is gone; OCB doesn't recognise it."* OCB's `healthcheckextension` serves a single path (default `/`) and **does not distinguish liveness from readiness**. (b) Despite (a), the chart README at - [`README.md:78`](../install/kubernetes/tracecore/README.md#L78) - and [`README.md:121`](../install/kubernetes/tracecore/README.md#L121) + [`README.md:78`](../../../install/kubernetes/tracecore/README.md#L78) + and [`README.md:121`](../../../install/kubernetes/tracecore/README.md#L121) still references `/readyz` as the rollback gate signal. The README is stale relative to the chart. - **Liveness probe.** Same endpoint as readiness - ([`daemonset.yaml:73-79`](../install/kubernetes/tracecore/templates/daemonset.yaml#L73)) — + ([`daemonset.yaml:73-79`](../../../install/kubernetes/tracecore/templates/daemonset.yaml#L73)) — no semantic distinction. - **`minReadySeconds`.** Not set (verified: zero hits in chart for `minReadySeconds`). DaemonSet treats a pod as ready the moment @@ -311,8 +311,8 @@ documentation + tests that demonstrate skew tolerance. - **`progressDeadlineSeconds`.** Not set; DaemonSet API does not support it (Deployment-only). - **Version-skew testing.** No N→N+1 or N+1→N upgrade test in CI. - [`chart.yml`](../.github/workflows/chart.yml) and - [`install-bench.yml`](../.github/workflows/install-bench.yml) + [`chart.yml`](../../../.github/workflows/chart.yml) and + [`install-bench.yml`](../../../.github/workflows/install-bench.yml) validate fresh installs; neither covers `helm upgrade` from a previous tag. @@ -354,20 +354,20 @@ fail injected via `failure-inject`. race. 3. **[code]** Expose `minReadySeconds` as a values knob and document its effect on the rollout-duration formula in - [`install/kubernetes/tracecore/README.md`](../install/kubernetes/tracecore/README.md). + [`install/kubernetes/tracecore/README.md`](../../../install/kubernetes/tracecore/README.md). 4. **[code]** Add a `helm upgrade` test to - [`install-bench.yml`](../.github/workflows/install-bench.yml): + [`install-bench.yml`](../../../.github/workflows/install-bench.yml): install rc1−1, upgrade to rc1, assert `kubectl rollout status daemonset/tracecore` succeeds within `(node_count × (initialDelay + periodSeconds × failureThreshold + minReadySeconds))`. Cover both N→N+1 and N+1→N rollback. 5. **[doc]** New §"Version-skew policy" in - [`docs/getting-started.md`](getting-started.md) §"Install via + [`docs/getting-started.md`](../../getting-started.md) §"Install via Helm" stating the supported skew (e.g. "any v1.x → v1.x+1 is supported, v0.x → v1.0 is a breaking upgrade") and the procedure. 6. **[code]** Add a chaos-test row in - [`.github/workflows/chaos.yml`](../.github/workflows/chaos.yml) + [`.github/workflows/chaos.yml`](../../../.github/workflows/chaos.yml) that kills the pod immediately after readiness flips green and asserts the DaemonSet does not flap. diff --git a/docs/migration/v0.x-to-v1.0.md b/docs/migration/v0.x-to-v1.0.md index a8a49d08..e793f707 100644 --- a/docs/migration/v0.x-to-v1.0.md +++ b/docs/migration/v0.x-to-v1.0.md @@ -370,7 +370,7 @@ so the rollout treated a pod as Ready the moment its readiness probe returned 200 for the first time. v1.0.0-rc1 requires the pod hold Ready for 10 seconds before the rollout counts it. -**Why.** Per [`docs/v1-rc1-operational-gaps.md` §2](../v1-rc1-operational-gaps.md) +**Why.** Per [`docs/history/v1-rc1/operational-gaps.md` §2](../history/v1-rc1/operational-gaps.md) upgrade-rollback row: a pod that flips Ready then crashes mid-rollout can serialize through `maxUnavailable: 1` without ever serving a scrape, producing a "successful" rollout with a data gap. The 10-second diff --git a/docs/reproducibility.md b/docs/reproducibility.md index 45029d5f..7ca80a30 100644 --- a/docs/reproducibility.md +++ b/docs/reproducibility.md @@ -191,7 +191,7 @@ gh attestation verify "oci://$DIGEST" \ |---|---|---| | 4 | Byte-identical rebuild at the same SHA | [PRINCIPLES.md §12](../PRINCIPLES.md) | | 5 | Binary signature traces to a GitHub Actions OIDC identity, no long-lived key | [Sigstore keyless](https://docs.sigstore.dev/cosign/verifying/verify/) | -| 6 | Binary provenance matches `predicateType: https://slsa.dev/provenance/v1`, signed by this repo's `release.yml` on a tag-ref. Provenance infrastructure is L3-grade (Sigstore-Fulcio keyless under GHA OIDC, recorded in Rekor); build platform is [SLSA Build L2](https://slsa.dev/spec/v1.0/levels#build-l2) — the `package` job runs outside the generic generator's controlled env. [`docs/v1-rc1-operational-gaps.md`](v1-rc1-operational-gaps.md#1-slsa-l3-prerequisites) §1 explains the upstream blocker for the L3 path (OCB-style generated entrypoints + missing pre-build hook in the trusted reusable workflow, tracked at [slsa-framework/slsa-github-generator#2483](https://github.com/slsa-framework/slsa-github-generator/issues/2483)). | [SLSA v1.0 Build L2](https://slsa.dev/spec/v1.0/levels#build-l2) + [GitHub artifact attestations](https://docs.github.com/en/actions/security-for-github-actions/using-artifact-attestations) | +| 6 | Binary provenance matches `predicateType: https://slsa.dev/provenance/v1`, signed by this repo's `release.yml` on a tag-ref. Provenance infrastructure is L3-grade (Sigstore-Fulcio keyless under GHA OIDC, recorded in Rekor); build platform is [SLSA Build L2](https://slsa.dev/spec/v1.0/levels#build-l2) — the `package` job runs outside the generic generator's controlled env. [`docs/history/v1-rc1/operational-gaps.md`](history/v1-rc1/operational-gaps.md#1-slsa-l3-prerequisites) §1 explains the upstream blocker for the L3 path (OCB-style generated entrypoints + missing pre-build hook in the trusted reusable workflow, tracked at [slsa-framework/slsa-github-generator#2483](https://github.com/slsa-framework/slsa-github-generator/issues/2483)). | [SLSA v1.0 Build L2](https://slsa.dev/spec/v1.0/levels#build-l2) + [GitHub artifact attestations](https://docs.github.com/en/actions/security-for-github-actions/using-artifact-attestations) | | 7 | SBOM coverage of every direct module | [CycloneDX spec](https://cyclonedx.org/specification/overview/) | | 8 | Image manifest signature traces to the same OIDC identity as the binary | [Sigstore keyless](https://docs.sigstore.dev/cosign/verifying/verify/) | | 9 | Image provenance matches `predicateType: https://slsa.dev/provenance/v1`, attestation stored in the OCI registry. Same L2 build-platform / L3 provenance-infrastructure posture as step 6. | [SLSA v1.0 Build L2](https://slsa.dev/spec/v1.0/levels#build-l2) |