Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 27 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -160,10 +160,32 @@ jobs:
- name: smoke-quickstart
run: make smoke-quickstart

# v1-rc1 cut criterion 12: Python verdict-consumption SDK.
# The Go half (module/sdk/verdict/) is already covered by verify-test
# because `go test -race ./...` resolves it via go.work; the Python
# half needs its own minimal job. Lightweight on purpose — single
# Python version + pytest run, no matrix. If the SDK grows, this is
# where a Python-version matrix lands.
sdk-python:
name: sdk-python
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: "3.11"
cache: "pip"
- name: install
run: |
python -m pip install --upgrade pip
pip install -e 'python/tracecore_verdict[test]'
- name: pytest
run: pytest python/tracecore_verdict/

verify:
name: verify
runs-on: ubuntu-latest
needs: [verify-test, verify-lint, verify-static, validator-recipe]
needs: [verify-test, verify-lint, verify-static, validator-recipe, sdk-python]
if: always()
steps:
- name: aggregator
Expand All @@ -185,6 +207,10 @@ jobs:
echo "::error::validator-recipe did not succeed (result=${{ needs.validator-recipe.result }})"
fail=1
fi
if [ "${{ needs.sdk-python.result }}" != "success" ]; then
echo "::error::sdk-python did not succeed (result=${{ needs.sdk-python.result }})"
fail=1
fi
if [ "$fail" -eq 1 ]; then
echo "one or more verify-* sub-jobs did not succeed; failing aggregator"
exit 1
Expand Down
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ Legend: 👤 operator · 🛠️ contributor · 🏛️ maintainer · 🌐 exter
| [proposals/](proposals/) | 🏛️ | Drafts pending upstream (semconv extensions, etc.). |
| [research/](research/) | 🛠️ | Synthesized findings from reading external sources (OTel collector internals, benchmark baselines). |
| [schemas/](schemas/) | 🛠️ | Receiver schema documents pointed at by emitted `SchemaURL`. |
| [sdk/](sdk/README.md) | 🌐 👤 | Verdict-consumption SDKs (Python + Go) — typed clients for the v1.0-rc1 envelope. Closes v1-rc1 cut criterion 12. |
| [examples/](examples/) | 👤 | Reference operator artifacts (Prometheus alerts, Grafana dashboard, with-telemetry config). |
| [followups/](followups/) | 🏛️ | Per-milestone follow-up shards + cross-cutting `_needs-prod-data` / `_needs-gpu` buckets. See [followups/README.md](followups/README.md) for filing convention. |
| [integrations/](integrations/) | 👤 | Validated recipes for shipping tracecore output to specific backends. See per-recipe rows below. |
Expand Down
118 changes: 118 additions & 0 deletions docs/sdk/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Verdict-consumption SDKs

Typed client SDKs for consuming **TraceCore Verdict v1.0-rc1**
envelopes. These ship alongside every release tagged at the matching
schema version (e.g. `v1.0.0-rc1`) and close
[v1-rc1 cut-criterion 12](../v1-rc1-cut-criteria.md#12-verdict-consumption-sdks-python--go).

## Which one do I pick?

| If you are… | Use |
| --- | --- |
| Writing a Prometheus alertmanager webhook in Python, or routing verdicts into Slack via FastAPI | [`tracecore-verdict`](../../python/tracecore_verdict/) (Python) |
| Writing an OTel-collector processor, an exporter, or any downstream Go consumer | [`module/sdk/verdict`](../../module/sdk/verdict/) (Go) |
| Calling from another language | Validate against the published JSON Schema at `https://schema.tracecore.io/verdict/1.0.0-rc1.json` (mirrored in [`docs/schemas/verdict-1.0.0-rc1.json`](../schemas/verdict-1.0.0-rc1.json)) |

Both SDKs:

- **Target schema v1.0-rc1 only.** A future envelope ships a new
sibling package — never a silent breaking change.
- **Validate before they hand back a typed object.** A schema-invalid
payload surfaces as an error, not a half-populated struct.
- **Preserve per-pattern extensions** (`gpu_id`, `xid_code`, `kind`,
`k8s.pod.name`, …) via an `Extras` / `extras` map. The envelope
sets `additionalProperties: true`; the SDK rides that through
unchanged.
- **Pin a byte-identical copy of `docs/schemas/verdict-1.0.0-rc1.json`**
as the source of truth. CI's drift test fails the build if the SDK
copy strays from the canonical.

## What's in the envelope

| Field | Required | Notes |
| --- | --- | --- |
| `pattern.id` | yes | String-typed numeric (e.g. `"14"`). Future namespacing (`"vendor.42"`) keeps the wire-format stable. |
| `headline` | yes | One-line operator-facing summary. Alert title. |
| `remediation` | yes | Detector-controlled remediation hint. Alert body. |
| `evidence_trail` | yes | Ordered list; `minItems: 1`. Reads top-to-bottom as the timeline. |
| `confidence` | no | `full` \| `partial`. Patterns whose emission rule is "all layers or no verdict" omit it. |
| `missing_layers` | no | Names of evidence layers that did not join. Empty when `confidence == full`. |
| `<per-pattern fields>` | no | Anything else — `k8s.pod.name`, `gpu_id`, `xid_code`, `kind`, … — rides through on Extras. |

## Install

```bash
# Python
pip install tracecore-verdict

# Go
go get github.com/tracecoreai/tracecore/module/sdk/verdict
```

## Quick start (Python)

```python
from tracecore_verdict import decode

v = decode(raw_bytes)
print(v.pattern_id, v.headline)
pod = v.extras.get("k8s.pod.name") # per-pattern extension
```

## Quick start (Go)

```go
import "github.com/tracecoreai/tracecore/module/sdk/verdict"

v, err := verdict.Decode(rawJSON)
if err != nil { return err }
fmt.Println(v.PatternID, v.Headline)
if pod, ok := v.Extras["k8s.pod.name"].(string); ok { /* … */ }
```

## Patterns shipped at v1.0-rc1

The envelope is the **union-superset** of every pattern's emitted
shape. The nine verdicts shipped at v1.0-rc1:

| ID | Name | Per-pattern extensions |
| --- | --- | --- |
| 13 | `silent_data_corruption` | `kind`, `gen_ai.training.job_id`, `accuracy_drop`, `baseline_accuracy`, `observed_accuracy`, `suspect_gpu_id`, `suspect_node`, `sdc_counter_delta` |
| 14 | `pod_evicted` | `k8s.pod.name`, `k8s.pod.namespace`, `k8s.node.name`, `k8s.event.reason` |
| 15 | `nccl_hang` | `pg_id`, `collective_seq_id`, `hanging_ranks` |
| 16 | `xid_correlation` | `xid_code`, `node`, `evicted_pod` |
| 17 | `hbm_ecc` | `xid_code`, `gpu_id`, `ecc_delta`, `node` |
| 18 | `thermal_throttle` | `node`, `gpu_count`, `gpu_ids` |
| 19 | `pcie_aer` | `gpu_id`, `severity`, `aer_type`, `drop_ratio`, `node` |
| 20 | `cuda_oom` | `gpu_id`, `node`, `kind`, `tried_alloc_bytes`, `fb_free_bytes`, `fb_free_ratio` |
| 21 | `ib_link_flap` | `node`, `hca_device`, `port`, `transition_count` |

Each SDK ships a parametrized test that round-trips one fixture per
row above. Adding a pattern in a future PR MUST extend that table.

## Versioning policy

- **`v1.0.0-rc1`** is the first tagged schema version. Both SDKs key
off the schema's `$id`
(`https://schema.tracecore.io/verdict/1.0.0-rc1.json`).
- A backwards-compatible field-add (still rides on `additionalProperties`)
is a **patch** bump at the schema level; SDKs that bind only typed
fields keep working without an upgrade.
- A breaking change (new required field, type change, narrowed enum)
is a **major** bump and lands as a new sibling SDK package
(`tracecore-verdict-v2` / `module/sdk/verdict/v2`).

See [`docs/schemas/README.md`](../schemas/README.md) for the schema
evolution policy and [`docs/DEPRECATION.md`](../DEPRECATION.md) for
the rollout cadence.

## Source layout

```
docs/schemas/verdict-1.0.0-rc1.json # canonical source of truth
module/sdk/verdict/ # Go SDK (embeds schema.json)
python/tracecore_verdict/ # Python SDK (package-data schema.json)
```

Both SDKs include a sync-check test that fails the build if either
copy drifts from the canonical.
13 changes: 12 additions & 1 deletion docs/v1-rc1-cut-criteria.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,18 @@ Status is set against the repo state as of 2026-05-31.
RC gate "Verdict-consumption SDKs tagged at RC schema version" and
[`NORTHSTARS.md` §O4 hero KPI](NORTHSTARS.md#o4-standards)
(external `gen_ai.training.*` implementations).
- **Status:** ☐ — neither SDK exists yet.
- **Status:** ☑ — both SDKs shipped. Go: [`module/sdk/verdict/`](../module/sdk/verdict);
Python: [`python/tracecore_verdict/`](../python/tracecore_verdict).
Index + version-policy: [`docs/sdk/README.md`](sdk/README.md). Each
SDK embeds a byte-identical copy of
[`docs/schemas/verdict-1.0.0-rc1.json`](schemas/verdict-1.0.0-rc1.json)
pinned by a drift test; `Decode` / `decode` validate before
returning a typed `Verdict`; per-pattern extensions ride through
on `Extras` / `extras` because the envelope sets
`additionalProperties: true`; both ship a parametrized test that
round-trips all 8 v1.0-rc1 patterns. CI: Go SDK runs through
`verify-test`; Python SDK has its own `sdk-python` job wired into
the `verify` aggregator.

---

Expand Down
105 changes: 105 additions & 0 deletions module/sdk/verdict/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# TraceCore Verdict SDK (Go)

Client-side SDK for consuming **TraceCore Verdict v1.0-rc1** envelopes.

This is the Go half of v1.0-rc1 cut-criterion 12 ("Verdict-consumption
SDKs"). The Python sibling lives at [`python/tracecore_verdict/`](../../../python/tracecore_verdict).

## Install

```bash
go get github.com/tracecoreai/tracecore/module/sdk/verdict
```

The SDK lives inside the main `tracecore` module — no separate
`go.mod`. Tag the consuming version against the `v1.0.0-rc1` repo tag.

## Use

```go
package main

import (
"fmt"
"log"

"github.com/tracecoreai/tracecore/module/sdk/verdict"
)

func main() {
raw := []byte(`{
"pattern.id": "14",
"headline": "Pod ml-prod/trainer-3 evicted at 2026-05-18T10:00:00Z due to disk pressure",
"remediation": "relocate workload to a node with NVMe-backed ephemeral storage",
"confidence": "full",
"evidence_trail": [
{"kind": "pod_event", "uid": "evt-1",
"timestamp": "2026-05-18T10:00:00Z",
"description": "Evicted: disk pressure"}
],
"k8s.pod.name": "trainer-3"
}`)

v, err := verdict.Decode(raw)
if err != nil {
log.Fatal(err)
}
fmt.Println("pattern.id:", v.PatternID)
fmt.Println("headline:", v.Headline)
fmt.Println("confidence:", v.Confidence)
if pod, ok := v.Extras["k8s.pod.name"].(string); ok {
fmt.Println("pod:", pod)
}
}
```

## What's in the box

| Symbol | Purpose |
| --- | --- |
| `Verdict` | Typed envelope: `PatternID`, `Headline`, `Remediation`, `Confidence`, `EvidenceTrail`, `MissingLayers`, `Extras` |
| `EvidenceRef` | One row in the evidence trail (`Kind`, `UID`, `Timestamp`, `Description`) |
| `Confidence` (`ConfidenceFull` \| `ConfidencePartial`) | Join-completeness signal |
| `Decode(rawJSON []byte) (*Verdict, error)` | JSON → schema-validated typed Verdict |

## Pattern-specific fields

The envelope sets `additionalProperties: true`, so every per-pattern
extension (`gpu_id`, `xid_code`, `kind`, `k8s.pod.name`, …) rides
through on `Verdict.Extras` (`map[string]any`). Use Go type assertions
to read them:

```go
if pod, ok := v.Extras["k8s.pod.name"].(string); ok { /* … */ }
if xid, ok := v.Extras["xid_code"].(float64); ok { /* … */ } // JSON numbers
```

## Versioning

This package targets **schema v1.0-rc1 only**.

A future envelope revision lands as a new sibling package (e.g.
`module/sdk/verdict/v2`) — never a silent breaking change here. The
SDK's `schema.json` is byte-pinned against
`docs/schemas/verdict-1.0.0-rc1.json` by
`TestEmbeddedSchemaMatchesCanonical`; drift fails the build.

## Errors

`Decode` returns an error (not a panic) for:

- malformed JSON
- missing envelope-required field (`pattern.id`, `headline`,
`remediation`, `evidence_trail`)
- `confidence` outside the `full|partial` enum
- empty `evidence_trail`
- evidence-row missing `kind` / `uid` / `timestamp` / `description`
- any other v1.0-rc1 envelope violation

Per-pattern field constraints (e.g. `pcie_aer.severity ∈ {Fatal,
Non-Fatal, Corrected}`) are out of scope — those live on per-pattern
schemas under `module/pkg/patterns/testdata/`.

## License

Apache-2.0.
Loading
Loading