feat(nccl-boot): pattern-9 bootstrap-timeout detector#347
Conversation
…chema Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
The previous max(ReadyAt) deadline gate silently SUPPRESSED verdicts for genuinely-stuck early ranks whenever a late-joining rank pushed max(ReadyAt) forward — a 15-min-ready stuck rank would be masked by a peer ready 2min ago. Switch the deadline gate to min(ReadyAt) so the bootstrap window is measured from the FIRST-ready rank's perspective — which is the rank whose bootstrap is genuinely stuck. max(ReadyAt) is retained for evidence-trail timestamp anchoring as the cohort's last-known-good Ready signal — the operator-visible "most recent Ready event on this cohort" surface. The slow-image-pull guard the original max() phrasing was meant to provide is naturally handled upstream: pods that haven't reached Ready don't enter the cohort at all (spec edge case "Slow image pull"). Tests: * Rename TestNCCLBootstrapDetector_MaxPodReadyDrivesAge → MinPodReadyDrivesAge with inverted assertion (verdict now fires). * New TestNCCLBootstrapDetector_LateJoinerDoesNotMaskStuckRank pins the load-bearing property with heterogeneous ReadyAt. * New TestNCCLBootstrapDetector_MaxPodReadyAnchorsEvidence pins max(ReadyAt) as the evidence-trail anchor. Signed-off-by: Tri Lam <tri@maydow.com>
Reviewer cleanup pass on yellow-tier findings: * Schema: tighten gen_ai.training.job_id to minLength:1 and document the fallback-grouping semantic explicitly — the field is OMITTED (not empty-string) on the namespace-only fallback path. Downstream consumers must treat ABSENCE as the explicit fallback signal, not silent exclusion. processor already uses putStrIfSet to suppress empty-string variants. * Spec eval rule: clarify Pattern #8 (NCCL hang) vs Pattern #9 (NCCL bootstrap) trigger disjoint-ness — hang fires on PRESENCE of non-completed FR records mid-run; bootstrap fires on ABSENCE of any FR record past deadline. Both can fire on the same cohort during a heterogeneous bootstrap by design. * Spec impl-notes: add note #5 documenting the min(ReadyAt) / max(ReadyAt) split with the late-joiner-masks-stuck-rank scenario as the rationale. * Empty-Node skip comment in detector: previous comment claimed the skip biases toward false-negatives; in fact it STRENGTHENS the absence signal (rank stays "no FR seen" → counted as failed), biasing toward false-positives. Correct the directionality and call out the fallback-to-(namespace, rank) escape hatch. Signed-off-by: Tri Lam <tri@maydow.com>
|
Addressed reviewer BLOCKER + several yellows: 🔴 Fixed: deadline now uses 🟡 Fixed: empty-Node skip comment corrected — was claiming FN bias, actually STRENGTHENS the absence signal (rank stays "no FR seen" → counted as failed), biasing toward FPs. Comment now names the directionality + the (namespace, rank) fallback escape hatch. 🟡 Fixed: 🟡 Spec: Pattern #8 (NCCL hang) vs #9 (NCCL bootstrap) trigger disjoint-ness called out in eval rule — hang fires on PRESENCE of non-completed FR records mid-run; bootstrap fires on ABSENCE of any FR record past deadline. Both can fire concurrently on a heterogeneous bootstrap by design. Out of scope (acknowledged risks, not addressed this PR):
Commits: |
…ootstrap # Conflicts: # docs/ATTRIBUTES.md # module/processor/patterndetectorprocessor/config.go # module/processor/patterndetectorprocessor/example_config.yaml
…ootstrap # Conflicts: # docs/ATTRIBUTES.md # module/processor/patterndetectorprocessor/config.go # module/processor/patterndetectorprocessor/patterndetector.go
## Summary CI \`changes\` pre-flight job intermittently fails with exit 128 when \`origin/\$base\` ref isn't fully fetched (shallow-clone race / fresh runner). \`git diff origin/\$base...HEAD\` then exits non-zero; \`bash -e\` propagates and fails the whole workflow. ## Root cause \`set -e\` from \`bash -e\` causes the command-substitution \`changed=\$(git diff ...)\` to abort on non-zero exit even with \`2>/dev/null\` redirecting stderr. Append \`|| true\` so failure falls through to the existing "treat as code-changed" default. ## Test plan - [x] yaml.safe_load parses cleanly - [x] actionlint + zizmor clean - [x] golangci-lint + go vet + attribute-namespace-check + doc-check + alert-check + chart-appversion-check + deprecation-check + no-autoupdate-check all green - [ ] Verified by next PR's pre-flight running green Reproduced flake on PRs #347 + #357. Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
Summary
Ships pattern-#9 (NCCL bootstrap timeout) detector end-to-end — the first job-start-time pattern in the library (sibling to pattern #8 which fires mid-run). A training-job cohort whose pods are Ready past
BootstrapDeadline(default 5min) but where at least one rank never emitted any NCCL FlightRecorder record is stuck in NCCL bootstrap; a same-namespace K8s CNI / network-readiness event in the correlation window promotes the verdict toconfidence=fulland stampsdiscriminator=cni_error.Spec:
docs/patterns/09-nccl-bootstrap-timeout.md. Status flipped fromplanned→shipped;Implementation notessection captures how each spec open-question resolved with the most-conservative reading.What landed
module/pkg/patterns/nccl_bootstrap.go— detector +TrainingPodRecord/CNINetworkEventRecord/NCCLBootstrapTimeoutVerdicttypes. ReusesNCCLFRRecordfromnccl_hang.go.module/pkg/patterns/nccl_bootstrap_test.go— 11 detector tests + schema-conformance + 10-falsifier drift battery. Covers: full-correlation fires, partial-when-no-CNI, normal-startup-no-fire, deadline-not-yet-reached, heterogeneous failure, multi-job cohorts don't merge, namespace-only fallback, cross-namespace CNI doesn't join, deadline-configurable, deterministic ordering, max(ReadyAt) drives age.module/pkg/patterns/testdata/nccl_bootstrap_verdict.schema.json— JSON Schema withadditionalProperties:falseand full enum guards.module/processor/patterndetectorprocessor/nccl_bootstrap.go— projections (projectTrainingPodRecordgates onk8s.pod.ready_time+gen_ai.training.rank;projectCNINetworkEventRecordgates onk8s.event.reason∈{FailedCreatePodSandBox, NetworkNotReady, CNIError}), verdict writer with promoted scalars (issue patterndetectorprocessor: promote operator-facing scalar attrs onto verdict records #270 contract), and runner that consumes NCCL FR records from the existing cross-cuttingcollectInputs(no double-projection).module/processor/patterndetectorprocessor/nccl_bootstrap_test.go— 6 wiring tests (full verdict, partial verdict, partial-suppressed-by-flag, normal-startup-no-fire, sub-1s deadline rejection, sub-1s window rejection).Config.NCCLBootstrapDeadline+Config.NCCLBootstrapCorrelationWindowwith Validate guards (≥1s) andwithDefaults/defaultConfigwiring;example_config.yamlupdated.docs/ATTRIBUTES.md— 3 newtracecore.alert.nccl_bootstrap_timeout.*rows, newk8s.pod.ready_timerow, updatedgen_ai.training.job_idrow (now consumed with fallback), new per-pattern matrix row fornccl_bootstrap.Design calls (load-bearing)
(gen_ai.training.job_id, k8s.namespace.name)when stamped;(k8s.namespace.name)-only fallback when job_id is absent (spec open question ci(deps): bump the gh-actions group with 5 updates #1). Emptygen_ai.training.job_idon the verdict signals the fallback path to operators.(node, rank)not(namespace, rank)— avoids cross-cohort contamination when two jobs in the same namespace land on different nodes. FR records with empty Node are skipped from the index (a wiring gap should NOT cause false-negatives — i.e. mask real bootstrap failures — even at the cost of cross-job false-positives that are unlikely in practice).FailedCreatePodSandBox/NetworkNotReady/CNIError). Per-CNI raw-error parsing (Cilium / Calico / multus distinct strings) is the discriminator-branch follow-up that lights upsocket_ifname_mismatch/rendezvous_unreachable.max(ReadyAt)drives deadline. A late-joining rank pushes the effective ready timestamp forward, preventing false-positives during rolling pod-Ready scenarios on cold-cache clusters.Test plan
cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/...— cleancd module && go vet ./...— cleangolangci-lint run ./...— 0 issues;attribute-namespace-check— 72/72 documentedtest(nccl-boot): RED→feat(nccl-boot): GREENcommits