Skip to content

Sibling-detector _real_world/ infra: generalize PR #484's anonymizer (or land per-pattern slots on first capture) #485

Description

@trilamsr

Context

PR #484 shipped scripts/anonymize-pod-evicted-fixture.sh (deterministic PII rewrite + mutation-tested verifier) and one _real_world/ slot under module/pkg/replay/pod_evicted/. The verifier is universal (IPv4 / email / cloud-instance / image-ref regex set in prose), but the rewrite layer is pod_evicted-specific — it knows about regarding.{namespace,name,uid}, reporting_instance, node_{name,uid}.

Sibling detectors with no _real_world/ slot today:

  • module/pkg/replay/cuda_oom/
  • module/pkg/replay/nccl_hang/
  • module/pkg/replay/hbm_ecc/
  • module/pkg/replay/checkpointer_hang/
  • module/pkg/replay/dataloader_hang/
  • module/pkg/replay/ib_link_flap/
  • module/pkg/replay/pcie_aer/
  • module/pkg/replay/nccl_bootstrap/
  • module/pkg/replay/silent_data_corruption/
  • module/pkg/replay/thermal_throttle/
  • module/pkg/replay/xid_correlation/

Each consumes a different input record type (patterns.TrainingStepStallRecord, patterns.CheckpointPhaseRecord, etc.) — so the structured-field rewrite map is per-detector.

Proposed shape

Two non-exclusive options:

  1. Generalize the anonymizer: pull the IPv4/email/cloud-node/image regex set into a shared anonymize-replay-common.sh library; each pattern's anonymizer sources it and adds its own JSON field map. PRINCIPLES §3 rule-of-three says wait until two captures land before extracting; we have one shipped + one synthetic, so this issue is the placeholder for the second.

  2. Per-detector _real_world/ slots: only land them when an operator capture is actually ready to contribute. Today's directories are well-formed and don't waste maintainability budget on empty slots.

Recommended: wait for the second contributed capture (any detector) and bundle the extract + new slot in one PR. Until then this issue captures the deferred work.

Acceptance

  • Either one operator capture in any non-pod_evicted _real_world/ slot (whichever detector receives the first contribution), OR a docs note in docs/MILESTONES.md per-pattern milestone clarifying that anonymized contributions are welcome but no slot exists yet (contributor opens the slot in the same PR that lands the capture).

Cross-refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions