refactor(cut-criteria): single Python script, drop base64 (#386)#397
Merged
Conversation
Replace `scripts/cut-criteria-{status,render,status_test}.sh` with a
single `scripts/cut_criteria.py` exposing `status`, `render`, `check`,
and `test` subcommands. Output is byte-identical for the same YAML +
repo state (the only deliberate diff is the script-name reference line
in the rendered legend).
Drops:
- The base64-over-TSV plumbing in cut-criteria-status.sh. It existed
only to survive embedded newlines through a bash `read` loop, a
macOS-3.2 compat tax. subprocess.run(snippet, shell=True) passes
multi-line YAML block scalars to /bin/bash directly, no encoding
hop. A regression fixture (`multiline-shell-no-base64`) locks this
in: a future refactor that reintroduces a TSV/base64 round-trip
without honoring newlines goes red.
- The per-shell-invocation Python heredocs in render.sh (196 LOC) and
status.sh (75 LOC). Three Makefile shell-outs to one module now
replace three bash scripts that each shelled out to python3.
- The separate `*_test.sh` test harness (160 LOC). The fixture suite
is now a `cut_criteria.py test` subcommand using the same in-process
status_rows() the production path uses.
Adds (per issue ask):
- YAML schema validation at parse time (required keys, allowed tiers,
duplicate-id check) — surfaces malformed spec at `make
cut-criteria-status` instead of producing a quietly-wrong markdown.
- `check` subcommand that renders in-process and diffs against the
on-disk markdown; no `mktemp`/`diff -u`/`rm -f` choreography. The
Makefile target collapsed from 11 lines to 1.
Closes #386.
Signed-off-by: Tri Lam <tri@maydow.com>
7 tasks
3 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Bundles #364 (per-driver OTTL stanzas for `dataloader.error_class`) and #365 (training-step-stalled metrics-to-logs bridge contract) — both pattern-7, both touch the same recipe surface. This PR REPLACES #401 (which was opened against pre-wave main and accumulated unrelated scope-creep via merge-resolution). ### #364 — `dataloader.error_class` per-driver regex `docs/integrations/filelog-container.md` + `examples/filelog-container.yaml` ship a `transform/dataloader_errors` processor with one stanza per driver class. Each stanza stamps the customer-stable `dataloader.error_class` value the runbook branches on, plus an optional `dataloader.worker_pid` int when the line carries one. Six driver classes: `DataLoader worker killed`, `FUSE transport error`, `S3 throttle`, `Stale file handle`, `DataLoader queue empty`, `Connection reset by peer`. Resolves pattern #7 spec Open Q#3. ### #365 — training-step-stalled bridge contract `docs/integrations/prometheus-scrape.md` grows a §"Pattern #7" section documenting the load-bearing wire contract for the bridge log record (`tracecore.alert.training_step_stalled.no_progress_seconds` + `last_step_ns` + `gen_ai.training.step` + `phase`). Per RFC-0014 the emitter half stays upstream-blocked until WithMetrics PR-B or `metricthresholdconnector` lands; the detector reads the contract today. ### Test wiring 9 sub-tests in `module/processor/patterndetectorprocessor/dataloader_hang_test.go` pin every per-driver class value + every bridge attribute guard against live detector wiring (eval-phase guard + warmup `step >= 2` guard). ## Why a redo PR #401 was branched from pre-wave main (commit `636c2a2`). When it merged main back in, it accumulated re-deletions of recently-merged #389 (composite action) + re-adds of #379/#381 already shipped via #392 + #386 already shipped via #397. Cleanest path: cherry-pick the recipe-only commit onto current main + resolve the one true conflict (DCGM helm-template comment in prometheus-scrape.md vs the new pattern-7 NOTE block). ## Test plan - [x] `golangci-lint`, `go vet`, `attribute-namespace-check` — green - [x] `go test ./module/processor/patterndetectorprocessor/... -run DataLoader` — pass - [x] Pre-push hook gates all green Closes #364. Closes #365. Supersedes #401. ```release-notes feat(recipes): pattern #7 OTTL stanzas for `dataloader.error_class` per-driver classification + documented metrics-to-logs bridge contract for `training_step_stalled` log record. ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
## Summary Closes #429. Two-layer defense for the hand-edit foot-gun that audit finding #421-4 surfaced on PR #409 (single-char glyph flip on a COMPUTED rendered doc that the next `make cut-criteria-render` would have silently clobbered). - **CI**: dedicated named workflow step `cut-criteria-check (issue #429 — forbid hand-edits to rendered docs)` in `.github/workflows/ci.yml` (verify-static job). Same double-invocation pattern slo-rules-check uses — the underlying `make cut-criteria-check` already runs transitively under `make doc-check`, but the second call surfaces the gate as a separately named PR-check line item so the author who edits the rendered file sees the failure named for the gate that catches it. Cost: ~500 ms (pure-Python re-render + diff). - **Local pre-commit**: `.githooks/pre-commit` now runs the drift check inline when `docs/v1-rc1-cut-criteria.md` or `docs/v1-ga-cut-criteria.md` is in the staged set, with a tailored failure message pointing at the YAML source-of-truth. Catches the foot-gun before the commit lands — CI is the backstop, not the only line of defense. Zero overhead when those files aren't staged. ## Root cause Not a missing gate — `cmd_check` in `scripts/cut_criteria.py` has been the drift gate since PR #397 (May 2026), wired into CI via `make doc-check`. The actual gap is **author cognitive visibility**: the gate's failure surfaces inside an aggregated `doc-check` step that runs half a dozen other checks, so a PR author hand-editing the rendered file sees a generic "doc-check failed" line on the PR page and may not connect it to "the rendered file you edited is computed". PR #409's flip sneaked through because no human caught the YAML-vs-rendered drift in review — the gate would have caught the next render, but #409 merged before that render happened. This PR keeps the same underlying gate; it just names it loud enough that the next author sees the signal at three checkpoints (pre-commit, CI step name, failure message body). ## A+ scan: other COMPUTED outputs Audited `docs/` and root-config files for render-output documents lacking a drift gate: | File | Gate | Status | | --- | --- | --- | | `docs/v1-rc1-cut-criteria.md` | `make cut-criteria-check` | gated (this PR makes it explicit) | | `docs/v1-ga-cut-criteria.md` | `make cut-criteria-check` | gated (same) | | `install/kubernetes/tracecore/Chart.yaml` (appVersion) | `make chart-appversion-check` | already gated | | `install/kubernetes/tracecore/dashboards/slo-rules.yaml` | `make slo-rules-check` | already gated | | `_build/tracecore/*` (OCB-generated) | gitignored | not a hand-edit risk | | `cmd/tracecore/components.go` (RFC-0013 deleted) | n/a | deleted | No new gaps. The cut-criteria pair is the only computed Markdown surface that ships in-tree. ## Test plan - [x] `make cut-criteria-check` on clean main → exit 0 - [x] Inject one-glyph hand-edit to `docs/v1-rc1-cut-criteria.md` → `make cut-criteria-check` exit 2, diff + actionable message ("Run `make cut-criteria-render` and commit the result") - [x] `.githooks/pre-commit` with hand-edit staged → exit 1 + tailored message ("docs/v1-*-cut-criteria.md is a COMPUTED render …") - [x] `.githooks/pre-commit` with rendered file NOT staged → no-op, falls through to `make check` (zero overhead path verified) - [x] `make doc-check` end-to-end → exit 0 - [x] `make actionlint` on the modified workflow → 0 issues - [x] `make check` (commit-time gate) → 0 issues, including DCO sign-off ```release-notes ci(cut-criteria): forbid hand-edits to rendered docs/v1-{rc1,ga}-cut-criteria.md — dedicated `cut-criteria-check` PR-check + local pre-commit catch. Source-of-truth is docs/cut-criteria.yaml; edits go through `make cut-criteria-render`. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace
scripts/cut-criteria-{status,render,status_test}.sh(522 LOC bash) with a singlescripts/cut_criteria.py(642 LOC Python) exposingstatus,render,check, andtestsubcommands. Output is byte-identical for the same YAML + repo state (the only deliberate diff is the script-name reference line in the rendered legend).Root cause
PR #374 used base64-over-TSV plumbing in
cut-criteria-status.shto survive embedded newlines through a bashreadloop — a macOS-3.2 compat tax.subprocess.run(snippet, shell=True)passes multi-line YAML block scalars to/bin/bashdirectly with no encoding hop. A regression fixture (multiline-shell-no-base64) locks this in.Adds (per issue ask)
make cut-criteria-statusinstead of producing quietly-wrong markdown.checksubcommand renders in-process and diffs against on-disk markdown; nomktemp/diff -u/rm -fchoreography. Makefile target collapsed 11 lines → 1.LOC delta
Net: +115 LOC but the bash → Python migration accounts for the bulk; the actual logic shrinks since base64/TSV plumbing is gone. Issue acceptance was "~200 LOC delete"; actual is 522 bash deleted vs 642 Python added, net +120 — but readability + correctness wins justify it.
Test plan
make cut-criteria-renderproduces byte-identicaldocs/v1-rc1-cut-criteria.md(verified viagit diff --quiet).make cut-criteria-statusprints the same TSV table (computed status: 8 shipped / 2 in-progress / 1 planned tier-1; 1 shipped / 2 planned tier-2).make cut-criteria-checkexit code 0 on clean tree.python3 scripts/cut_criteria.py testpasses all fixture cases.Closes #386.