refactor(cut-criteria): single Python script, drop base64 (#386) by trilamsr · Pull Request #397 · TraceCoreAI/tracecore

trilamsr · 2026-06-01T22:12:43Z

Summary

Replace scripts/cut-criteria-{status,render,status_test}.sh (522 LOC bash) with a single scripts/cut_criteria.py (642 LOC Python) exposing status, render, check, and test subcommands. Output is byte-identical for the same YAML + repo state (the only deliberate diff is the script-name reference line in the rendered legend).

Root cause

PR #374 used base64-over-TSV plumbing in cut-criteria-status.sh to survive embedded newlines through a bash read loop — a macOS-3.2 compat tax. subprocess.run(snippet, shell=True) passes multi-line YAML block scalars to /bin/bash directly with no encoding hop. A regression fixture (multiline-shell-no-base64) locks this in.

Adds (per issue ask)

YAML schema validation at parse time (required keys, allowed tiers, duplicate-id check) — surfaces malformed spec at make cut-criteria-status instead of producing quietly-wrong markdown.
check subcommand renders in-process and diffs against on-disk markdown; no mktemp/diff -u/rm -f choreography. Makefile target collapsed 11 lines → 1.

LOC delta

Makefile                            |  17 +-
docs/cut-criteria.yaml              |   2 +-
docs/cut-criteria.yaml.md           |  20 +-
docs/v1-rc1-cut-criteria.md         |   2 +-
scripts/cut-criteria-render.sh      | 240 --------
scripts/cut-criteria-status.sh      | 122 ------
scripts/cut-criteria-status_test.sh | 160 -------
scripts/cut_criteria.py             | 642 ++++++++++++++++++++++++++++
8 files changed, 660 insertions(+), 545 deletions(-)

Net: +115 LOC but the bash → Python migration accounts for the bulk; the actual logic shrinks since base64/TSV plumbing is gone. Issue acceptance was "~200 LOC delete"; actual is 522 bash deleted vs 642 Python added, net +120 — but readability + correctness wins justify it.

Test plan

make cut-criteria-render produces byte-identical docs/v1-rc1-cut-criteria.md (verified via git diff --quiet).
make cut-criteria-status prints the same TSV table (computed status: 8 shipped / 2 in-progress / 1 planned tier-1; 1 shipped / 2 planned tier-2).
make cut-criteria-check exit code 0 on clean tree.
python3 scripts/cut_criteria.py test passes all fixture cases.
Schema validation: malformed YAML (missing required key) errors with clear message.

Closes #386.

refactor: replace `cut-criteria-{status,render,status_test}.sh` with single `scripts/cut_criteria.py` — byte-identical output, YAML schema validation, simpler drift gate.

Replace `scripts/cut-criteria-{status,render,status_test}.sh` with a single `scripts/cut_criteria.py` exposing `status`, `render`, `check`, and `test` subcommands. Output is byte-identical for the same YAML + repo state (the only deliberate diff is the script-name reference line in the rendered legend). Drops: - The base64-over-TSV plumbing in cut-criteria-status.sh. It existed only to survive embedded newlines through a bash `read` loop, a macOS-3.2 compat tax. subprocess.run(snippet, shell=True) passes multi-line YAML block scalars to /bin/bash directly, no encoding hop. A regression fixture (`multiline-shell-no-base64`) locks this in: a future refactor that reintroduces a TSV/base64 round-trip without honoring newlines goes red. - The per-shell-invocation Python heredocs in render.sh (196 LOC) and status.sh (75 LOC). Three Makefile shell-outs to one module now replace three bash scripts that each shelled out to python3. - The separate `*_test.sh` test harness (160 LOC). The fixture suite is now a `cut_criteria.py test` subcommand using the same in-process status_rows() the production path uses. Adds (per issue ask): - YAML schema validation at parse time (required keys, allowed tiers, duplicate-id check) — surfaces malformed spec at `make cut-criteria-status` instead of producing a quietly-wrong markdown. - `check` subcommand that renders in-process and diffs against the on-disk markdown; no `mktemp`/`diff -u`/`rm -f` choreography. The Makefile target collapsed from 11 lines to 1. Closes #386. Signed-off-by: Tri Lam <tri@maydow.com>

## Summary Bundles #364 (per-driver OTTL stanzas for `dataloader.error_class`) and #365 (training-step-stalled metrics-to-logs bridge contract) — both pattern-7, both touch the same recipe surface. This PR REPLACES #401 (which was opened against pre-wave main and accumulated unrelated scope-creep via merge-resolution). ### #364 — `dataloader.error_class` per-driver regex `docs/integrations/filelog-container.md` + `examples/filelog-container.yaml` ship a `transform/dataloader_errors` processor with one stanza per driver class. Each stanza stamps the customer-stable `dataloader.error_class` value the runbook branches on, plus an optional `dataloader.worker_pid` int when the line carries one. Six driver classes: `DataLoader worker killed`, `FUSE transport error`, `S3 throttle`, `Stale file handle`, `DataLoader queue empty`, `Connection reset by peer`. Resolves pattern #7 spec Open Q#3. ### #365 — training-step-stalled bridge contract `docs/integrations/prometheus-scrape.md` grows a §"Pattern #7" section documenting the load-bearing wire contract for the bridge log record (`tracecore.alert.training_step_stalled.no_progress_seconds` + `last_step_ns` + `gen_ai.training.step` + `phase`). Per RFC-0014 the emitter half stays upstream-blocked until WithMetrics PR-B or `metricthresholdconnector` lands; the detector reads the contract today. ### Test wiring 9 sub-tests in `module/processor/patterndetectorprocessor/dataloader_hang_test.go` pin every per-driver class value + every bridge attribute guard against live detector wiring (eval-phase guard + warmup `step >= 2` guard). ## Why a redo PR #401 was branched from pre-wave main (commit `636c2a2`). When it merged main back in, it accumulated re-deletions of recently-merged #389 (composite action) + re-adds of #379/#381 already shipped via #392 + #386 already shipped via #397. Cleanest path: cherry-pick the recipe-only commit onto current main + resolve the one true conflict (DCGM helm-template comment in prometheus-scrape.md vs the new pattern-7 NOTE block). ## Test plan - [x] `golangci-lint`, `go vet`, `attribute-namespace-check` — green - [x] `go test ./module/processor/patterndetectorprocessor/... -run DataLoader` — pass - [x] Pre-push hook gates all green Closes #364. Closes #365. Supersedes #401. ```release-notes feat(recipes): pattern #7 OTTL stanzas for `dataloader.error_class` per-driver classification + documented metrics-to-logs bridge contract for `training_step_stalled` log record. ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>

## Summary Closes #429. Two-layer defense for the hand-edit foot-gun that audit finding #421-4 surfaced on PR #409 (single-char glyph flip on a COMPUTED rendered doc that the next `make cut-criteria-render` would have silently clobbered). - **CI**: dedicated named workflow step `cut-criteria-check (issue #429 — forbid hand-edits to rendered docs)` in `.github/workflows/ci.yml` (verify-static job). Same double-invocation pattern slo-rules-check uses — the underlying `make cut-criteria-check` already runs transitively under `make doc-check`, but the second call surfaces the gate as a separately named PR-check line item so the author who edits the rendered file sees the failure named for the gate that catches it. Cost: ~500 ms (pure-Python re-render + diff). - **Local pre-commit**: `.githooks/pre-commit` now runs the drift check inline when `docs/v1-rc1-cut-criteria.md` or `docs/v1-ga-cut-criteria.md` is in the staged set, with a tailored failure message pointing at the YAML source-of-truth. Catches the foot-gun before the commit lands — CI is the backstop, not the only line of defense. Zero overhead when those files aren't staged. ## Root cause Not a missing gate — `cmd_check` in `scripts/cut_criteria.py` has been the drift gate since PR #397 (May 2026), wired into CI via `make doc-check`. The actual gap is **author cognitive visibility**: the gate's failure surfaces inside an aggregated `doc-check` step that runs half a dozen other checks, so a PR author hand-editing the rendered file sees a generic "doc-check failed" line on the PR page and may not connect it to "the rendered file you edited is computed". PR #409's flip sneaked through because no human caught the YAML-vs-rendered drift in review — the gate would have caught the next render, but #409 merged before that render happened. This PR keeps the same underlying gate; it just names it loud enough that the next author sees the signal at three checkpoints (pre-commit, CI step name, failure message body). ## A+ scan: other COMPUTED outputs Audited `docs/` and root-config files for render-output documents lacking a drift gate: | File | Gate | Status | | --- | --- | --- | | `docs/v1-rc1-cut-criteria.md` | `make cut-criteria-check` | gated (this PR makes it explicit) | | `docs/v1-ga-cut-criteria.md` | `make cut-criteria-check` | gated (same) | | `install/kubernetes/tracecore/Chart.yaml` (appVersion) | `make chart-appversion-check` | already gated | | `install/kubernetes/tracecore/dashboards/slo-rules.yaml` | `make slo-rules-check` | already gated | | `_build/tracecore/*` (OCB-generated) | gitignored | not a hand-edit risk | | `cmd/tracecore/components.go` (RFC-0013 deleted) | n/a | deleted | No new gaps. The cut-criteria pair is the only computed Markdown surface that ships in-tree. ## Test plan - [x] `make cut-criteria-check` on clean main → exit 0 - [x] Inject one-glyph hand-edit to `docs/v1-rc1-cut-criteria.md` → `make cut-criteria-check` exit 2, diff + actionable message ("Run `make cut-criteria-render` and commit the result") - [x] `.githooks/pre-commit` with hand-edit staged → exit 1 + tailored message ("docs/v1-*-cut-criteria.md is a COMPUTED render …") - [x] `.githooks/pre-commit` with rendered file NOT staged → no-op, falls through to `make check` (zero overhead path verified) - [x] `make doc-check` end-to-end → exit 0 - [x] `make actionlint` on the modified workflow → 0 issues - [x] `make check` (commit-time gate) → 0 issues, including DCO sign-off ```release-notes ci(cut-criteria): forbid hand-edits to rendered docs/v1-{rc1,ga}-cut-criteria.md — dedicated `cut-criteria-check` PR-check + local pre-commit catch. Source-of-truth is docs/cut-criteria.yaml; edits go through `make cut-criteria-render`. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr mentioned this pull request Jun 1, 2026

refactor(cut-criteria): unify 3 bash scripts into one Python module #395

Closed

7 tasks

trilamsr enabled auto-merge (squash) June 1, 2026 22:16

trilamsr merged commit 1c41c61 into main Jun 1, 2026
23 checks passed

trilamsr deleted the refactor/cut-criteria-python-386 branch June 1, 2026 22:21

trilamsr mentioned this pull request Jun 1, 2026

feat(recipes): OTTL stanzas + bridge for pattern #7 (#364 #365) #406

Merged

3 tasks

This was referenced Jun 1, 2026

audit(wave-2026-06-01): post-wave cross-cut review #421

Closed

chore(cut-criteria): forbid hand-edits to render-output docs/v1-rc1-cut-criteria.md #429

Closed

ci(cut-criteria): explicit gate + local pre-commit catch (#429) #433

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(cut-criteria): single Python script, drop base64 (#386)#397

refactor(cut-criteria): single Python script, drop base64 (#386)#397
trilamsr merged 1 commit into
mainfrom
refactor/cut-criteria-python-386

trilamsr commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 1, 2026

Summary

Root cause

Adds (per issue ask)

LOC delta

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant