Skip to content

refactor(cut-criteria): single Python script, drop base64 (#386)#397

Merged
trilamsr merged 1 commit into
mainfrom
refactor/cut-criteria-python-386
Jun 1, 2026
Merged

refactor(cut-criteria): single Python script, drop base64 (#386)#397
trilamsr merged 1 commit into
mainfrom
refactor/cut-criteria-python-386

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Replace scripts/cut-criteria-{status,render,status_test}.sh (522 LOC bash) with a single scripts/cut_criteria.py (642 LOC Python) exposing status, render, check, and test subcommands. Output is byte-identical for the same YAML + repo state (the only deliberate diff is the script-name reference line in the rendered legend).

Root cause

PR #374 used base64-over-TSV plumbing in cut-criteria-status.sh to survive embedded newlines through a bash read loop — a macOS-3.2 compat tax. subprocess.run(snippet, shell=True) passes multi-line YAML block scalars to /bin/bash directly with no encoding hop. A regression fixture (multiline-shell-no-base64) locks this in.

Adds (per issue ask)

  • YAML schema validation at parse time (required keys, allowed tiers, duplicate-id check) — surfaces malformed spec at make cut-criteria-status instead of producing quietly-wrong markdown.
  • check subcommand renders in-process and diffs against on-disk markdown; no mktemp/diff -u/rm -f choreography. Makefile target collapsed 11 lines → 1.

LOC delta

Makefile                            |  17 +-
docs/cut-criteria.yaml              |   2 +-
docs/cut-criteria.yaml.md           |  20 +-
docs/v1-rc1-cut-criteria.md         |   2 +-
scripts/cut-criteria-render.sh      | 240 --------
scripts/cut-criteria-status.sh      | 122 ------
scripts/cut-criteria-status_test.sh | 160 -------
scripts/cut_criteria.py             | 642 ++++++++++++++++++++++++++++
8 files changed, 660 insertions(+), 545 deletions(-)

Net: +115 LOC but the bash → Python migration accounts for the bulk; the actual logic shrinks since base64/TSV plumbing is gone. Issue acceptance was "~200 LOC delete"; actual is 522 bash deleted vs 642 Python added, net +120 — but readability + correctness wins justify it.

Test plan

  • make cut-criteria-render produces byte-identical docs/v1-rc1-cut-criteria.md (verified via git diff --quiet).
  • make cut-criteria-status prints the same TSV table (computed status: 8 shipped / 2 in-progress / 1 planned tier-1; 1 shipped / 2 planned tier-2).
  • make cut-criteria-check exit code 0 on clean tree.
  • python3 scripts/cut_criteria.py test passes all fixture cases.
  • Schema validation: malformed YAML (missing required key) errors with clear message.

Closes #386.

refactor: replace `cut-criteria-{status,render,status_test}.sh` with single `scripts/cut_criteria.py` — byte-identical output, YAML schema validation, simpler drift gate.

Replace `scripts/cut-criteria-{status,render,status_test}.sh` with a
single `scripts/cut_criteria.py` exposing `status`, `render`, `check`,
and `test` subcommands. Output is byte-identical for the same YAML +
repo state (the only deliberate diff is the script-name reference line
in the rendered legend).

Drops:
- The base64-over-TSV plumbing in cut-criteria-status.sh. It existed
  only to survive embedded newlines through a bash `read` loop, a
  macOS-3.2 compat tax. subprocess.run(snippet, shell=True) passes
  multi-line YAML block scalars to /bin/bash directly, no encoding
  hop. A regression fixture (`multiline-shell-no-base64`) locks this
  in: a future refactor that reintroduces a TSV/base64 round-trip
  without honoring newlines goes red.
- The per-shell-invocation Python heredocs in render.sh (196 LOC) and
  status.sh (75 LOC). Three Makefile shell-outs to one module now
  replace three bash scripts that each shelled out to python3.
- The separate `*_test.sh` test harness (160 LOC). The fixture suite
  is now a `cut_criteria.py test` subcommand using the same in-process
  status_rows() the production path uses.

Adds (per issue ask):
- YAML schema validation at parse time (required keys, allowed tiers,
  duplicate-id check) — surfaces malformed spec at `make
  cut-criteria-status` instead of producing a quietly-wrong markdown.
- `check` subcommand that renders in-process and diffs against the
  on-disk markdown; no `mktemp`/`diff -u`/`rm -f` choreography. The
  Makefile target collapsed from 11 lines to 1.

Closes #386.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 22:16
@trilamsr trilamsr merged commit 1c41c61 into main Jun 1, 2026
23 checks passed
@trilamsr trilamsr deleted the refactor/cut-criteria-python-386 branch June 1, 2026 22:21
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Bundles #364 (per-driver OTTL stanzas for `dataloader.error_class`) and
#365 (training-step-stalled metrics-to-logs bridge contract) — both
pattern-7, both touch the same recipe surface. This PR REPLACES #401
(which was opened against pre-wave main and accumulated unrelated
scope-creep via merge-resolution).

### #364 — `dataloader.error_class` per-driver regex

`docs/integrations/filelog-container.md` +
`examples/filelog-container.yaml` ship a `transform/dataloader_errors`
processor with one stanza per driver class. Each stanza stamps the
customer-stable `dataloader.error_class` value the runbook branches on,
plus an optional `dataloader.worker_pid` int when the line carries one.

Six driver classes: `DataLoader worker killed`, `FUSE transport error`,
`S3 throttle`, `Stale file handle`, `DataLoader queue empty`,
`Connection reset by peer`. Resolves pattern #7 spec Open Q#3.

### #365 — training-step-stalled bridge contract

`docs/integrations/prometheus-scrape.md` grows a §"Pattern #7" section
documenting the load-bearing wire contract for the bridge log record
(`tracecore.alert.training_step_stalled.no_progress_seconds` +
`last_step_ns` + `gen_ai.training.step` + `phase`). Per RFC-0014 the
emitter half stays upstream-blocked until WithMetrics PR-B or
`metricthresholdconnector` lands; the detector reads the contract today.

### Test wiring

9 sub-tests in
`module/processor/patterndetectorprocessor/dataloader_hang_test.go` pin
every per-driver class value + every bridge attribute guard against live
detector wiring (eval-phase guard + warmup `step >= 2` guard).

## Why a redo PR

#401 was branched from pre-wave main (commit `636c2a2`). When it merged
main back in, it accumulated re-deletions of recently-merged #389
(composite action) + re-adds of #379/#381 already shipped via #392 +
#386 already shipped via #397. Cleanest path: cherry-pick the
recipe-only commit onto current main + resolve the one true conflict
(DCGM helm-template comment in prometheus-scrape.md vs the new pattern-7
NOTE block).

## Test plan

- [x] `golangci-lint`, `go vet`, `attribute-namespace-check` — green
- [x] `go test ./module/processor/patterndetectorprocessor/... -run
DataLoader` — pass
- [x] Pre-push hook gates all green

Closes #364. Closes #365. Supersedes #401.

```release-notes
feat(recipes): pattern #7 OTTL stanzas for `dataloader.error_class` per-driver classification + documented metrics-to-logs bridge contract for `training_step_stalled` log record.
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

Closes #429. Two-layer defense for the hand-edit foot-gun that audit
finding #421-4 surfaced on PR #409 (single-char glyph flip on a COMPUTED
rendered doc that the next `make cut-criteria-render` would have
silently
clobbered).

- **CI**: dedicated named workflow step `cut-criteria-check (issue #429
  — forbid hand-edits to rendered docs)` in `.github/workflows/ci.yml`
  (verify-static job). Same double-invocation pattern slo-rules-check
  uses — the underlying `make cut-criteria-check` already runs
  transitively under `make doc-check`, but the second call surfaces the
  gate as a separately named PR-check line item so the author who edits
  the rendered file sees the failure named for the gate that catches it.
  Cost: ~500 ms (pure-Python re-render + diff).

- **Local pre-commit**: `.githooks/pre-commit` now runs the drift check
inline when `docs/v1-rc1-cut-criteria.md` or
`docs/v1-ga-cut-criteria.md`
  is in the staged set, with a tailored failure message pointing at the
  YAML source-of-truth. Catches the foot-gun before the commit lands —
  CI is the backstop, not the only line of defense. Zero overhead when
  those files aren't staged.

## Root cause

Not a missing gate — `cmd_check` in `scripts/cut_criteria.py` has been
the drift gate since PR #397 (May 2026), wired into CI via
`make doc-check`. The actual gap is **author cognitive visibility**: the
gate's failure surfaces inside an aggregated `doc-check` step that runs
half a dozen other checks, so a PR author hand-editing the rendered file
sees a generic "doc-check failed" line on the PR page and may not
connect it to "the rendered file you edited is computed". PR #409's flip
sneaked through because no human caught the YAML-vs-rendered drift in
review — the gate would have caught the next render, but #409 merged
before that render happened.

This PR keeps the same underlying gate; it just names it loud enough
that the next author sees the signal at three checkpoints (pre-commit,
CI step name, failure message body).

## A+ scan: other COMPUTED outputs

Audited `docs/` and root-config files for render-output documents
lacking a drift gate:

| File | Gate | Status |
| --- | --- | --- |
| `docs/v1-rc1-cut-criteria.md` | `make cut-criteria-check` | gated
(this PR makes it explicit) |
| `docs/v1-ga-cut-criteria.md` | `make cut-criteria-check` | gated
(same) |
| `install/kubernetes/tracecore/Chart.yaml` (appVersion) | `make
chart-appversion-check` | already gated |
| `install/kubernetes/tracecore/dashboards/slo-rules.yaml` | `make
slo-rules-check` | already gated |
| `_build/tracecore/*` (OCB-generated) | gitignored | not a hand-edit
risk |
| `cmd/tracecore/components.go` (RFC-0013 deleted) | n/a | deleted |

No new gaps. The cut-criteria pair is the only computed Markdown surface
that ships in-tree.

## Test plan

- [x] `make cut-criteria-check` on clean main → exit 0
- [x] Inject one-glyph hand-edit to `docs/v1-rc1-cut-criteria.md` →
      `make cut-criteria-check` exit 2, diff + actionable message
      ("Run `make cut-criteria-render` and commit the result")
- [x] `.githooks/pre-commit` with hand-edit staged → exit 1 + tailored
      message ("docs/v1-*-cut-criteria.md is a COMPUTED render …")
- [x] `.githooks/pre-commit` with rendered file NOT staged → no-op,
      falls through to `make check` (zero overhead path verified)
- [x] `make doc-check` end-to-end → exit 0
- [x] `make actionlint` on the modified workflow → 0 issues
- [x] `make check` (commit-time gate) → 0 issues, including DCO sign-off

```release-notes
ci(cut-criteria): forbid hand-edits to rendered docs/v1-{rc1,ga}-cut-criteria.md — dedicated `cut-criteria-check` PR-check + local pre-commit catch. Source-of-truth is docs/cut-criteria.yaml; edits go through `make cut-criteria-render`.
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[refactor] simplify cut-criteria scripts (drop base64, single Python)

1 participant