Skip to content

feat(nccl-boot): pattern-9 bootstrap-timeout detector#347

Merged
trilamsr merged 8 commits into
mainfrom
feat/pattern-9-nccl-bootstrap
Jun 1, 2026
Merged

feat(nccl-boot): pattern-9 bootstrap-timeout detector#347
trilamsr merged 8 commits into
mainfrom
feat/pattern-9-nccl-bootstrap

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Ships pattern-#9 (NCCL bootstrap timeout) detector end-to-end — the first job-start-time pattern in the library (sibling to pattern #8 which fires mid-run). A training-job cohort whose pods are Ready past BootstrapDeadline (default 5min) but where at least one rank never emitted any NCCL FlightRecorder record is stuck in NCCL bootstrap; a same-namespace K8s CNI / network-readiness event in the correlation window promotes the verdict to confidence=full and stamps discriminator=cni_error.

Spec: docs/patterns/09-nccl-bootstrap-timeout.md. Status flipped from plannedshipped; Implementation notes section captures how each spec open-question resolved with the most-conservative reading.

What landed

  • module/pkg/patterns/nccl_bootstrap.go — detector + TrainingPodRecord / CNINetworkEventRecord / NCCLBootstrapTimeoutVerdict types. Reuses NCCLFRRecord from nccl_hang.go.
  • module/pkg/patterns/nccl_bootstrap_test.go — 11 detector tests + schema-conformance + 10-falsifier drift battery. Covers: full-correlation fires, partial-when-no-CNI, normal-startup-no-fire, deadline-not-yet-reached, heterogeneous failure, multi-job cohorts don't merge, namespace-only fallback, cross-namespace CNI doesn't join, deadline-configurable, deterministic ordering, max(ReadyAt) drives age.
  • module/pkg/patterns/testdata/nccl_bootstrap_verdict.schema.json — JSON Schema with additionalProperties:false and full enum guards.
  • module/processor/patterndetectorprocessor/nccl_bootstrap.go — projections (projectTrainingPodRecord gates on k8s.pod.ready_time + gen_ai.training.rank; projectCNINetworkEventRecord gates on k8s.event.reason{FailedCreatePodSandBox, NetworkNotReady, CNIError}), verdict writer with promoted scalars (issue patterndetectorprocessor: promote operator-facing scalar attrs onto verdict records #270 contract), and runner that consumes NCCL FR records from the existing cross-cutting collectInputs (no double-projection).
  • module/processor/patterndetectorprocessor/nccl_bootstrap_test.go — 6 wiring tests (full verdict, partial verdict, partial-suppressed-by-flag, normal-startup-no-fire, sub-1s deadline rejection, sub-1s window rejection).
  • Config.NCCLBootstrapDeadline + Config.NCCLBootstrapCorrelationWindow with Validate guards (≥1s) and withDefaults / defaultConfig wiring; example_config.yaml updated.
  • docs/ATTRIBUTES.md — 3 new tracecore.alert.nccl_bootstrap_timeout.* rows, new k8s.pod.ready_time row, updated gen_ai.training.job_id row (now consumed with fallback), new per-pattern matrix row for nccl_bootstrap.

Design calls (load-bearing)

  • Cohort key. (gen_ai.training.job_id, k8s.namespace.name) when stamped; (k8s.namespace.name)-only fallback when job_id is absent (spec open question ci(deps): bump the gh-actions group with 5 updates #1). Empty gen_ai.training.job_id on the verdict signals the fallback path to operators.
  • Bootstrap-failed-rank index key. (node, rank) not (namespace, rank) — avoids cross-cohort contamination when two jobs in the same namespace land on different nodes. FR records with empty Node are skipped from the index (a wiring gap should NOT cause false-negatives — i.e. mask real bootstrap failures — even at the cost of cross-job false-positives that are unlikely in practice).
  • CNI vocab. v0 ships the K8s-control-plane vocabulary only (FailedCreatePodSandBox / NetworkNotReady / CNIError). Per-CNI raw-error parsing (Cilium / Calico / multus distinct strings) is the discriminator-branch follow-up that lights up socket_ifname_mismatch / rendezvous_unreachable.
  • Cohort size. Count of distinct ranks the detector observed pod-Ready signals for. Pods that never reached Ready (image-pull stuck) don't enter the cohort — they belong to pattern [telemetry] Add SelfTelemetry interface + agent infra scaffolding #15. Per the spec's edge case "slow image pull" no false-positive.
  • max(ReadyAt) drives deadline. A late-joining rank pushes the effective ready timestamp forward, preventing false-positives during rolling pod-Ready scenarios on cold-cache clusters.

Test plan

  • cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... — clean
  • cd module && go vet ./... — clean
  • Pre-commit hook: golangci-lint run ./... — 0 issues; attribute-namespace-check — 72/72 documented
  • TDD discipline: test(nccl-boot): REDfeat(nccl-boot): GREEN commits
  • CI green on PR (full matrix)
feat(patterns): pattern-9 (NCCL bootstrap timeout) detector — fires when a training-job cohort has at least one rank with no NCCL FR record past `BootstrapDeadline` from pod-ready (default 5min); a same-namespace `FailedCreatePodSandBox` / `NetworkNotReady` / `CNIError` event promotes to `confidence=full` with `discriminator=cni_error`. New YAML knobs: `nccl_bootstrap_deadline` (default 5m), `nccl_bootstrap_correlation_window` (default 10m). Verdict shape pinned by `nccl_bootstrap_verdict.schema.json`.

Tri Lam added 5 commits June 1, 2026 01:26
…chema

Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
The previous max(ReadyAt) deadline gate silently SUPPRESSED verdicts
for genuinely-stuck early ranks whenever a late-joining rank pushed
max(ReadyAt) forward — a 15-min-ready stuck rank would be masked by
a peer ready 2min ago. Switch the deadline gate to min(ReadyAt) so
the bootstrap window is measured from the FIRST-ready rank's
perspective — which is the rank whose bootstrap is genuinely stuck.

max(ReadyAt) is retained for evidence-trail timestamp anchoring as
the cohort's last-known-good Ready signal — the operator-visible
"most recent Ready event on this cohort" surface.

The slow-image-pull guard the original max() phrasing was meant to
provide is naturally handled upstream: pods that haven't reached
Ready don't enter the cohort at all (spec edge case "Slow image
pull").

Tests:
 * Rename TestNCCLBootstrapDetector_MaxPodReadyDrivesAge → MinPodReadyDrivesAge
   with inverted assertion (verdict now fires).
 * New TestNCCLBootstrapDetector_LateJoinerDoesNotMaskStuckRank pins
   the load-bearing property with heterogeneous ReadyAt.
 * New TestNCCLBootstrapDetector_MaxPodReadyAnchorsEvidence pins
   max(ReadyAt) as the evidence-trail anchor.

Signed-off-by: Tri Lam <tri@maydow.com>
Reviewer cleanup pass on yellow-tier findings:

* Schema: tighten gen_ai.training.job_id to minLength:1 and document
  the fallback-grouping semantic explicitly — the field is OMITTED
  (not empty-string) on the namespace-only fallback path. Downstream
  consumers must treat ABSENCE as the explicit fallback signal, not
  silent exclusion. processor already uses putStrIfSet to suppress
  empty-string variants.

* Spec eval rule: clarify Pattern #8 (NCCL hang) vs Pattern #9 (NCCL
  bootstrap) trigger disjoint-ness — hang fires on PRESENCE of
  non-completed FR records mid-run; bootstrap fires on ABSENCE of
  any FR record past deadline. Both can fire on the same cohort
  during a heterogeneous bootstrap by design.

* Spec impl-notes: add note #5 documenting the min(ReadyAt) /
  max(ReadyAt) split with the late-joiner-masks-stuck-rank
  scenario as the rationale.

* Empty-Node skip comment in detector: previous comment claimed the
  skip biases toward false-negatives; in fact it STRENGTHENS the
  absence signal (rank stays "no FR seen" → counted as failed),
  biasing toward false-positives. Correct the directionality and
  call out the fallback-to-(namespace, rank) escape hatch.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr

trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Addressed reviewer BLOCKER + several yellows:

🔴 Fixed: deadline now uses min(ReadyAt) — a stuck early rank is detected even when later ranks join after the bootstrap window. max(ReadyAt) retained for evidence-trail anchoring only. New test TestNCCLBootstrapDetector_LateJoinerDoesNotMaskStuckRank pins the property (rank0 ReadyAt = T−10min, rank1 ReadyAt = T+2min, eval at T+8min, deadline 5min → verdict fires because min-age = 18min). TestNCCLBootstrapDetector_MaxPodReadyDrivesAge renamed → MinPodReadyDrivesAge with inverted assertion. New TestNCCLBootstrapDetector_MaxPodReadyAnchorsEvidence pins max(ReadyAt) as the evidence-trail anchor. Spec impl-note #5 documents the min/max split + rationale.

🟡 Fixed: empty-Node skip comment corrected — was claiming FN bias, actually STRENGTHENS the absence signal (rank stays "no FR seen" → counted as failed), biasing toward FPs. Comment now names the directionality + the (namespace, rank) fallback escape hatch.

🟡 Fixed: gen_ai.training.job_id schema tightened (minLength:1) + description clarifies that the verdict OMITS the field on the namespace-only fallback path (rather than emitting empty-string) — downstream consumers must treat ABSENCE as the explicit fallback signal. The processor already uses putStrIfSet to suppress the empty-string variant on the log-record attribute path, so no code change there.

🟡 Spec: Pattern #8 (NCCL hang) vs #9 (NCCL bootstrap) trigger disjoint-ness called out in eval rule — hang fires on PRESENCE of non-completed FR records mid-run; bootstrap fires on ABSENCE of any FR record past deadline. Both can fire concurrently on a heterogeneous bootstrap by design.

Out of scope (acknowledged risks, not addressed this PR):

  • Cohort-key namespace-only fallback collision in multi-job-per-namespace clusters: tracked as spec Open Q#1, follow-up issue to file.
  • Closed-set CNI vocab: explicitly punted in spec impl-note Add NORTHSTARS, RFC-0002, and Q1 MILESTONES #3.
  • Duplicate-projection tree walks (O(n)×2): micro-perf only, deferred.

Commits: 071a3e8 (BLOCKER fix) + cf66328 (schema/spec/comment hygiene). All local hooks + go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... + go vet ./... green.

Tri Lam added 2 commits June 1, 2026 02:57
…ootstrap

# Conflicts:
#	docs/ATTRIBUTES.md
#	module/processor/patterndetectorprocessor/config.go
#	module/processor/patterndetectorprocessor/example_config.yaml
…ootstrap

# Conflicts:
#	docs/ATTRIBUTES.md
#	module/processor/patterndetectorprocessor/config.go
#	module/processor/patterndetectorprocessor/patterndetector.go
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

CI \`changes\` pre-flight job intermittently fails with exit 128 when
\`origin/\$base\` ref isn't fully fetched (shallow-clone race / fresh
runner). \`git diff origin/\$base...HEAD\` then exits non-zero; \`bash
-e\` propagates and fails the whole workflow.

## Root cause

\`set -e\` from \`bash -e\` causes the command-substitution
\`changed=\$(git diff ...)\` to abort on non-zero exit even with
\`2>/dev/null\` redirecting stderr. Append \`|| true\` so failure falls
through to the existing "treat as code-changed" default.

## Test plan

- [x] yaml.safe_load parses cleanly
- [x] actionlint + zizmor clean
- [x] golangci-lint + go vet + attribute-namespace-check + doc-check +
alert-check + chart-appversion-check + deprecation-check +
no-autoupdate-check all green
- [ ] Verified by next PR's pre-flight running green

Reproduced flake on PRs #347 + #357.

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr merged commit cd29f1b into main Jun 1, 2026
22 checks passed
@trilamsr trilamsr deleted the feat/pattern-9-nccl-bootstrap branch June 1, 2026 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant