Skip to content

[standards] file first gen_ai.training.* PR upstream #326

Description

@trilamsr

Tracking issue for NORTHSTARS O4 ("Standards" objective): file the first
gen_ai.training.* semconv PR upstream at open-telemetry/semantic-conventions-genai.

Roadmap (binding): docs/standards-roadmap.md.
Technical proposal body (ready to copy-paste into the PR description):
docs/proposals/gen-ai-training-semconv.md.

Hero KPI gate

External (non-tracecore) implementations of gen_ai.training.*:

  • M6: 0 (first PR merged)
  • M12: ≥1
  • M18: ≥3
  • M24: ≥5

Pre-filing prerequisites

  • Attend ≥1 OpenTelemetry GenAI Instrumentation SIG meeting (general track, Tuesdays 09:00 PT) and add the proposal to the meeting-notes agenda.
  • Engage semantic-conventions-genai Issue #88 (rl.* proposal) with a scope-overlap comment — alignment or graceful coexistence stance.
  • Recruit one non-tracecore co-author or sponsor from a SIG-regular vendor (Pyroscope, Datadog, Honeycomb, Microsoft, Google, Splunk approvers per CONTRIBUTING.md).
  • Reconcile two internal naming inconsistencies before PR-1:
    • gen_ai.training.job_id (current emit) → gen_ai.training.job.id (dotted, consistent with gen_ai.tool.call.id).
    • gen_ai.training.step_id (M14 plan) → gen_ai.training.step (int counter, consistent with proposal).

PR-1 — minimal viable scope (M1 target)

  • gen_ai.training.run.id (string, development) — logical training run id, stable across restart/pre-emption.
  • gen_ai.training.job.id (string, development) — orchestrator-assigned job id.
  • gen_ai.training.rank (int, development) — global zero-indexed process rank.
  • gen_ai.training.world_size (int, development) — total process count (constant per job.id).
  • gen_ai.training.local_rank (int, development) — per-node rank.

PR-2 — step and collective (M3 target)

  • gen_ai.training.step (int, development) — current optimizer step (monotonic per run.id).
  • gen_ai.training.collective.op (string, development) — collective op enum (all_reduce, all_gather, reduce_scatter, broadcast, send, recv, barrier).
  • gen_ai.training.collective.tag (string, development) — application-supplied tag distinguishing collectives in a step.
  • gen_ai.training.group_rank (int, development) — rank within a parallelism group (DP/TP/PP/EP).
  • gen_ai.training.group_kind (string, development) — required when group_rank present; enum data/tensor/pipeline/expert.

Cadence

  • SIG attendance: ≥1 per fortnight (general track, Tuesdays 09:00 PT) — logged in docs/standards-roadmap.md §6 meeting log on each attend.
  • PR-1 filed: M1.
  • PR-1 merged: M6 (NORTHSTARS O4 hero-KPI gate).
  • PR-2 filed: M3 (parallel review if upstream backlog allows).
  • First external implementation: M12.

Risk fallbacks (per O4 Operating Rule "External implementations matter, not the chair seat")

Cross-ref — in-repo dependents

Tracecore deliverable Depends on Status
rankjoinprocessor rank stamping (live, M19) PR-1 PROPOSED-flagged emit; rename on merge
Helm-chart OTTL recipe gen_ai.training.{rank,job_id} (live, RFC-0013 §3) PR-1 PROPOSED; job_idjob.id migration follows
M13 pyspy gen_ai.training.world_size (deferred) PR-1 Ships Phase-3 under PROPOSED
M14 Kineto gen_ai.training.step_id.step (planned) PR-2 Naming follows PR-2
M18 straggler detector (consumer) PR-1 Depends only on rank-key landing
patterndetectorprocessor nccl_hang rank fallback chain (live) PR-1 Fallback nccl.rank/nccl.fr.rank preserved

Closing condition: PR-1 merged upstream + ≥1 external implementation observed (verified per docs/adoption-pipeline.md methodology).

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationexternal-clockBlocked on out-of-repo state

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions