feat(nccl-boot): pattern-9 bootstrap-timeout detector by trilamsr · Pull Request #347 · TraceCoreAI/tracecore

trilamsr · 2026-06-01T08:41:36Z

Summary

Ships pattern-#9 (NCCL bootstrap timeout) detector end-to-end — the first job-start-time pattern in the library (sibling to pattern #8 which fires mid-run). A training-job cohort whose pods are Ready past BootstrapDeadline (default 5min) but where at least one rank never emitted any NCCL FlightRecorder record is stuck in NCCL bootstrap; a same-namespace K8s CNI / network-readiness event in the correlation window promotes the verdict to confidence=full and stamps discriminator=cni_error.

Spec: docs/patterns/09-nccl-bootstrap-timeout.md. Status flipped from planned → shipped; Implementation notes section captures how each spec open-question resolved with the most-conservative reading.

What landed

module/pkg/patterns/nccl_bootstrap.go — detector + TrainingPodRecord / CNINetworkEventRecord / NCCLBootstrapTimeoutVerdict types. Reuses NCCLFRRecord from nccl_hang.go.
module/pkg/patterns/nccl_bootstrap_test.go — 11 detector tests + schema-conformance + 10-falsifier drift battery. Covers: full-correlation fires, partial-when-no-CNI, normal-startup-no-fire, deadline-not-yet-reached, heterogeneous failure, multi-job cohorts don't merge, namespace-only fallback, cross-namespace CNI doesn't join, deadline-configurable, deterministic ordering, max(ReadyAt) drives age.
module/pkg/patterns/testdata/nccl_bootstrap_verdict.schema.json — JSON Schema with additionalProperties:false and full enum guards.
module/processor/patterndetectorprocessor/nccl_bootstrap.go — projections (projectTrainingPodRecord gates on k8s.pod.ready_time + gen_ai.training.rank; projectCNINetworkEventRecord gates on k8s.event.reason ∈ {FailedCreatePodSandBox, NetworkNotReady, CNIError}), verdict writer with promoted scalars (issue patterndetectorprocessor: promote operator-facing scalar attrs onto verdict records #270 contract), and runner that consumes NCCL FR records from the existing cross-cutting collectInputs (no double-projection).
module/processor/patterndetectorprocessor/nccl_bootstrap_test.go — 6 wiring tests (full verdict, partial verdict, partial-suppressed-by-flag, normal-startup-no-fire, sub-1s deadline rejection, sub-1s window rejection).
Config.NCCLBootstrapDeadline + Config.NCCLBootstrapCorrelationWindow with Validate guards (≥1s) and withDefaults / defaultConfig wiring; example_config.yaml updated.
docs/ATTRIBUTES.md — 3 new tracecore.alert.nccl_bootstrap_timeout.* rows, new k8s.pod.ready_time row, updated gen_ai.training.job_id row (now consumed with fallback), new per-pattern matrix row for nccl_bootstrap.

Design calls (load-bearing)

Cohort key. (gen_ai.training.job_id, k8s.namespace.name) when stamped; (k8s.namespace.name)-only fallback when job_id is absent (spec open question ci(deps): bump the gh-actions group with 5 updates #1). Empty gen_ai.training.job_id on the verdict signals the fallback path to operators.
Bootstrap-failed-rank index key. (node, rank) not (namespace, rank) — avoids cross-cohort contamination when two jobs in the same namespace land on different nodes. FR records with empty Node are skipped from the index (a wiring gap should NOT cause false-negatives — i.e. mask real bootstrap failures — even at the cost of cross-job false-positives that are unlikely in practice).
CNI vocab. v0 ships the K8s-control-plane vocabulary only (FailedCreatePodSandBox / NetworkNotReady / CNIError). Per-CNI raw-error parsing (Cilium / Calico / multus distinct strings) is the discriminator-branch follow-up that lights up socket_ifname_mismatch / rendezvous_unreachable.
Cohort size. Count of distinct ranks the detector observed pod-Ready signals for. Pods that never reached Ready (image-pull stuck) don't enter the cohort — they belong to pattern [telemetry] Add SelfTelemetry interface + agent infra scaffolding #15. Per the spec's edge case "slow image pull" no false-positive.
max(ReadyAt) drives deadline. A late-joining rank pushes the effective ready timestamp forward, preventing false-positives during rolling pod-Ready scenarios on cold-cache clusters.

Test plan

cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... — clean
cd module && go vet ./... — clean
Pre-commit hook: golangci-lint run ./... — 0 issues; attribute-namespace-check — 72/72 documented
TDD discipline: test(nccl-boot): RED → feat(nccl-boot): GREEN commits
CI green on PR (full matrix)

feat(patterns): pattern-9 (NCCL bootstrap timeout) detector — fires when a training-job cohort has at least one rank with no NCCL FR record past `BootstrapDeadline` from pod-ready (default 5min); a same-namespace `FailedCreatePodSandBox` / `NetworkNotReady` / `CNIError` event promotes to `confidence=full` with `discriminator=cni_error`. New YAML knobs: `nccl_bootstrap_deadline` (default 5m), `nccl_bootstrap_correlation_window` (default 10m). Verdict shape pinned by `nccl_bootstrap_verdict.schema.json`.

…chema Signed-off-by: Tri Lam <tri@maydow.com>

Signed-off-by: Tri Lam <tri@maydow.com>

The previous max(ReadyAt) deadline gate silently SUPPRESSED verdicts for genuinely-stuck early ranks whenever a late-joining rank pushed max(ReadyAt) forward — a 15-min-ready stuck rank would be masked by a peer ready 2min ago. Switch the deadline gate to min(ReadyAt) so the bootstrap window is measured from the FIRST-ready rank's perspective — which is the rank whose bootstrap is genuinely stuck. max(ReadyAt) is retained for evidence-trail timestamp anchoring as the cohort's last-known-good Ready signal — the operator-visible "most recent Ready event on this cohort" surface. The slow-image-pull guard the original max() phrasing was meant to provide is naturally handled upstream: pods that haven't reached Ready don't enter the cohort at all (spec edge case "Slow image pull"). Tests: * Rename TestNCCLBootstrapDetector_MaxPodReadyDrivesAge → MinPodReadyDrivesAge with inverted assertion (verdict now fires). * New TestNCCLBootstrapDetector_LateJoinerDoesNotMaskStuckRank pins the load-bearing property with heterogeneous ReadyAt. * New TestNCCLBootstrapDetector_MaxPodReadyAnchorsEvidence pins max(ReadyAt) as the evidence-trail anchor. Signed-off-by: Tri Lam <tri@maydow.com>

Reviewer cleanup pass on yellow-tier findings: * Schema: tighten gen_ai.training.job_id to minLength:1 and document the fallback-grouping semantic explicitly — the field is OMITTED (not empty-string) on the namespace-only fallback path. Downstream consumers must treat ABSENCE as the explicit fallback signal, not silent exclusion. processor already uses putStrIfSet to suppress empty-string variants. * Spec eval rule: clarify Pattern #8 (NCCL hang) vs Pattern #9 (NCCL bootstrap) trigger disjoint-ness — hang fires on PRESENCE of non-completed FR records mid-run; bootstrap fires on ABSENCE of any FR record past deadline. Both can fire on the same cohort during a heterogeneous bootstrap by design. * Spec impl-notes: add note #5 documenting the min(ReadyAt) / max(ReadyAt) split with the late-joiner-masks-stuck-rank scenario as the rationale. * Empty-Node skip comment in detector: previous comment claimed the skip biases toward false-negatives; in fact it STRENGTHENS the absence signal (rank stays "no FR seen" → counted as failed), biasing toward false-positives. Correct the directionality and call out the fallback-to-(namespace, rank) escape hatch. Signed-off-by: Tri Lam <tri@maydow.com>

trilamsr · 2026-06-01T08:54:09Z

Addressed reviewer BLOCKER + several yellows:

🔴 Fixed: deadline now uses min(ReadyAt) — a stuck early rank is detected even when later ranks join after the bootstrap window. max(ReadyAt) retained for evidence-trail anchoring only. New test TestNCCLBootstrapDetector_LateJoinerDoesNotMaskStuckRank pins the property (rank0 ReadyAt = T−10min, rank1 ReadyAt = T+2min, eval at T+8min, deadline 5min → verdict fires because min-age = 18min). TestNCCLBootstrapDetector_MaxPodReadyDrivesAge renamed → MinPodReadyDrivesAge with inverted assertion. New TestNCCLBootstrapDetector_MaxPodReadyAnchorsEvidence pins max(ReadyAt) as the evidence-trail anchor. Spec impl-note #5 documents the min/max split + rationale.

🟡 Fixed: empty-Node skip comment corrected — was claiming FN bias, actually STRENGTHENS the absence signal (rank stays "no FR seen" → counted as failed), biasing toward FPs. Comment now names the directionality + the (namespace, rank) fallback escape hatch.

🟡 Fixed: gen_ai.training.job_id schema tightened (minLength:1) + description clarifies that the verdict OMITS the field on the namespace-only fallback path (rather than emitting empty-string) — downstream consumers must treat ABSENCE as the explicit fallback signal. The processor already uses putStrIfSet to suppress the empty-string variant on the log-record attribute path, so no code change there.

🟡 Spec: Pattern #8 (NCCL hang) vs #9 (NCCL bootstrap) trigger disjoint-ness called out in eval rule — hang fires on PRESENCE of non-completed FR records mid-run; bootstrap fires on ABSENCE of any FR record past deadline. Both can fire concurrently on a heterogeneous bootstrap by design.

Out of scope (acknowledged risks, not addressed this PR):

Cohort-key namespace-only fallback collision in multi-job-per-namespace clusters: tracked as spec Open Q#1, follow-up issue to file.
Closed-set CNI vocab: explicitly punted in spec impl-note Add NORTHSTARS, RFC-0002, and Q1 MILESTONES #3.
Duplicate-projection tree walks (O(n)×2): micro-perf only, deferred.

Commits: 071a3e8 (BLOCKER fix) + cf66328 (schema/spec/comment hygiene). All local hooks + go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... + go vet ./... green.

…ootstrap # Conflicts: # docs/ATTRIBUTES.md # module/processor/patterndetectorprocessor/config.go # module/processor/patterndetectorprocessor/example_config.yaml

…ootstrap # Conflicts: # docs/ATTRIBUTES.md # module/processor/patterndetectorprocessor/config.go # module/processor/patterndetectorprocessor/patterndetector.go

## Summary CI \`changes\` pre-flight job intermittently fails with exit 128 when \`origin/\$base\` ref isn't fully fetched (shallow-clone race / fresh runner). \`git diff origin/\$base...HEAD\` then exits non-zero; \`bash -e\` propagates and fails the whole workflow. ## Root cause \`set -e\` from \`bash -e\` causes the command-substitution \`changed=\$(git diff ...)\` to abort on non-zero exit even with \`2>/dev/null\` redirecting stderr. Append \`|| true\` so failure falls through to the existing "treat as code-changed" default. ## Test plan - [x] yaml.safe_load parses cleanly - [x] actionlint + zizmor clean - [x] golangci-lint + go vet + attribute-namespace-check + doc-check + alert-check + chart-appversion-check + deprecation-check + no-autoupdate-check all green - [ ] Verified by next PR's pre-flight running green Reproduced flake on PRs #347 + #357. Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>

Tri Lam added 5 commits June 1, 2026 01:26

test(nccl-boot): RED — pattern-9 bootstrap-timeout detector tests + s…

e24e72e

…chema Signed-off-by: Tri Lam <tri@maydow.com>

feat(nccl-boot): GREEN — pattern-9 bootstrap-timeout detector

e66982a

Signed-off-by: Tri Lam <tri@maydow.com>

feat(nccl-boot): wire pattern-9 detector + docs + ATTRIBUTES

156e90a

Signed-off-by: Tri Lam <tri@maydow.com>

Tri Lam added 2 commits June 1, 2026 02:57

Merge remote-tracking branch 'origin/main' into feat/pattern-9-nccl-b…

a86ad68

…ootstrap # Conflicts: # docs/ATTRIBUTES.md # module/processor/patterndetectorprocessor/config.go # module/processor/patterndetectorprocessor/example_config.yaml

Merge remote-tracking branch 'origin/main' into feat/pattern-9-nccl-b…

3112bdf

…ootstrap # Conflicts: # docs/ATTRIBUTES.md # module/processor/patterndetectorprocessor/config.go # module/processor/patterndetectorprocessor/patterndetector.go

trilamsr mentioned this pull request Jun 1, 2026

ci(changes): make pre-flight diff permissive on missing base ref #363

Merged

4 tasks

Merge branch 'main' into feat/pattern-9-nccl-bootstrap

dc18ca2

trilamsr merged commit cd29f1b into main Jun 1, 2026
22 checks passed

trilamsr deleted the feat/pattern-9-nccl-bootstrap branch June 1, 2026 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(nccl-boot): pattern-9 bootstrap-timeout detector#347

feat(nccl-boot): pattern-9 bootstrap-timeout detector#347
trilamsr merged 8 commits into
mainfrom
feat/pattern-9-nccl-bootstrap

trilamsr commented Jun 1, 2026

Uh oh!

trilamsr commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 1, 2026

Summary

What landed

Design calls (load-bearing)

Test plan

Uh oh!

trilamsr commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant