diff --git a/components/receivers/pyspy/README.md b/components/receivers/pyspy/README.md index 5468bef9..bb6ae822 100644 --- a/components/receivers/pyspy/README.md +++ b/components/receivers/pyspy/README.md @@ -60,7 +60,6 @@ When something goes wrong, the receiver bumps one of these `IncError` kinds and | `faulthandler_missing` | Helper hello said `unsupported: true` | Workload runtime must be CPython 3.3+; PyPy/MicroPython unsupported | | `uds_dir_permission_denied` | Receiver can't `stat`/read `uds_dir` at Start | Check container UID + mount permissions | | `helper_oom_mid_dump` | Helper replied with `reason="MemoryError"` | Workload near OOM; lower `max_threads_per_dump` or raise pod limit | -| `sidecar_uid_drift` | Workload UID ≠ sidecar UID (Phase 4 chart guards this) | Align `runAsUser` between containers | ## CI gates diff --git a/components/receivers/pyspy/RUNBOOK.md b/components/receivers/pyspy/RUNBOOK.md index 95d72160..faba58ca 100644 --- a/components/receivers/pyspy/RUNBOOK.md +++ b/components/receivers/pyspy/RUNBOOK.md @@ -30,7 +30,7 @@ Degraded mode never restarts the receiver; absent data is the operator-visible s **Likely causes.** - Helper crashed and left a stale socket inode. -- Sidecar/workload UID drift (see `sidecar_uid_drift` for the typed variant). +- Sidecar/workload UID drift (collector and helper containers running as different UIDs leaves the helper-bound `0700` UDS unreachable for the collector). **Investigation.** 1. `stat /pyspy..sock` - confirm file exists. @@ -123,14 +123,6 @@ Degraded mode never restarts the receiver; absent data is the operator-visible s **Remediation.** Align directory permissions or `securityContext.runAsUser` between helper and receiver. Distinct from `target_not_attached` (directory readable but empty). -## kind=sidecar_uid_drift - -**Trigger.** Helper bound the UDS with the workload's UID and mode `0700`; sidecar collector has a different UID and gets `EACCES` on `connect()`. Receiver enters `target_not_listening` posture for that PID; this kind distinguishes it from a generic stale socket. - -**Investigation.** Compare `runAsUser` in the helper container's `securityContext` vs the collector's. The Helm chart (Phase 4) defaults both from one variable; manual installs need explicit alignment. - -**Remediation.** Set both containers to the same `runAsUser`, or set the workload's helper to mode `0770` and group both UIDs into the same `runAsGroup`. - ## kind=panic **Trigger.** A panic was recovered inside `internal/safe.Call`. diff --git a/components/receivers/pyspy/factory.go b/components/receivers/pyspy/factory.go index e22968c3..36524f0c 100644 --- a/components/receivers/pyspy/factory.go +++ b/components/receivers/pyspy/factory.go @@ -8,7 +8,6 @@ import ( "github.com/tracecoreai/tracecore/internal/consumer" "github.com/tracecoreai/tracecore/internal/pipeline" - "github.com/tracecoreai/tracecore/internal/selftelemetry" ) // ComponentType is the canonical receiver-factory ID. Centralized @@ -56,11 +55,11 @@ func (*factory) CreateLogs(ctx context.Context, set pipeline.CreateSettings, cfg } r := newReceiver(set, c, next) if set.Telemetry.MeterProvider != nil { - if rt, err := selftelemetry.NewReceiver(set.ID, set.Telemetry.MeterProvider); err == nil { + if rt, err := newSelfTelemetry(set.ID, set.Telemetry.MeterProvider); err == nil { r.telemetry = rt } else { - selftelemetry.RecordInitError(ctx, set.Telemetry.MeterProvider, - "receiver", set.ID.String(), selftelemetry.ReasonInstrumentRegister) + recordInitError(ctx, set.Telemetry.MeterProvider, + "receiver", set.ID.String(), reasonInstrumentRegister) if set.Telemetry.Logger != nil { set.Telemetry.Logger.Warn("pyspy self-telemetry init failed; using noop", "err", err) } diff --git a/components/receivers/pyspy/factory_test.go b/components/receivers/pyspy/factory_test.go index 3789fe51..350a9a19 100644 --- a/components/receivers/pyspy/factory_test.go +++ b/components/receivers/pyspy/factory_test.go @@ -11,38 +11,38 @@ import ( "github.com/tracecoreai/tracecore/internal/pipeline" ) -func TestFactory_Type(t *testing.T) { +func TestPyspy_Type(t *testing.T) { require.Equal(t, ComponentType, Factory.Type().String()) } -func TestFactory_DefaultConfigValidates(t *testing.T) { +func TestPyspy_DefaultConfigValidates(t *testing.T) { cfg := Factory.CreateDefaultConfig() require.NoError(t, cfg.Validate()) _, ok := cfg.(*Config) require.True(t, ok, "default config must be *Config") } -// TestFactory_CreateMetricsReturnsErrSignalNotSupported pins that +// TestPyspy_CreateMetricsReturnsErrSignalNotSupported pins that // the metrics signal returns the canonical sentinel; the pipeline // runtime uses errors.Is to surface a clear operator message. -func TestFactory_CreateMetricsReturnsErrSignalNotSupported(t *testing.T) { +func TestPyspy_CreateMetricsReturnsErrSignalNotSupported(t *testing.T) { _, err := Factory.CreateMetrics(context.Background(), pipeline.CreateSettings{}, defaultConfig(), nil) require.ErrorIs(t, err, pipeline.ErrSignalNotSupported, "CreateMetrics must return pipeline.ErrSignalNotSupported") } -func TestFactory_CreateTracesReturnsErrSignalNotSupported(t *testing.T) { +func TestPyspy_CreateTracesReturnsErrSignalNotSupported(t *testing.T) { _, err := Factory.CreateTraces(context.Background(), pipeline.CreateSettings{}, defaultConfig(), nil) require.ErrorIs(t, err, pipeline.ErrSignalNotSupported, "CreateTraces must return pipeline.ErrSignalNotSupported") } -// TestFactory_CreateLogsReturnsRealReceiver pins that the logs +// TestPyspy_CreateLogsReturnsRealReceiver pins that the logs // signal actually constructs a working Receiver. The Phase 3 // pprof-dictionary emission path may move this registration to // CreateProfiles per RFC-0009 §6 footnote; until then logs is the // registered signal. -func TestFactory_CreateLogsReturnsRealReceiver(t *testing.T) { +func TestPyspy_CreateLogsReturnsRealReceiver(t *testing.T) { r, err := Factory.CreateLogs(context.Background(), pipeline.CreateSettings{ID: pipeline.MustNewID(pipeline.MustNewType(ComponentType), "")}, defaultConfig(), @@ -51,11 +51,11 @@ func TestFactory_CreateLogsReturnsRealReceiver(t *testing.T) { require.NotNil(t, r) } -// TestFactory_CreateLogsRejectsWrongConfigType pins the runtime's +// TestPyspy_CreateLogsRejectsWrongConfigType pins the runtime's // type-safety expectation: a factory handed a config of the wrong // type returns an error naming the actual type so the operator can // chase the misconfigured pipeline. -func TestFactory_CreateLogsRejectsWrongConfigType(t *testing.T) { +func TestPyspy_CreateLogsRejectsWrongConfigType(t *testing.T) { type wrongConfig struct{ pipeline.Config } _, err := Factory.CreateLogs(context.Background(), pipeline.CreateSettings{}, &wrongConfig{}, nil) require.Error(t, err) @@ -63,11 +63,11 @@ func TestFactory_CreateLogsRejectsWrongConfigType(t *testing.T) { require.ErrorContains(t, err, "config type") } -// TestNewFactory_ReturnsTheSamePackageVar pins that NewFactory() +// TestPyspy_NewFactory_ReturnsTheSamePackageVar pins that NewFactory() // is the codegen seam: tools/components-gen emits calls to it, and // it must return the package-private Factory var. Otherwise an // external consumer (rare) could end up with a stale factory // instance that didn't pick up package-level wiring. -func TestNewFactory_ReturnsTheSamePackageVar(t *testing.T) { +func TestPyspy_NewFactory_ReturnsTheSamePackageVar(t *testing.T) { require.Same(t, Factory, NewFactory()) } diff --git a/components/receivers/pyspy/fake_telemetry_test.go b/components/receivers/pyspy/fake_telemetry_test.go new file mode 100644 index 00000000..caaf4a34 --- /dev/null +++ b/components/receivers/pyspy/fake_telemetry_test.go @@ -0,0 +1,75 @@ +// SPDX-License-Identifier: Apache-2.0 + +package pyspy + +import ( + "sync" + "sync/atomic" + "time" +) + +// fakeTelemetry is a test-only captor for the receiver-scoped +// selfTelemetry surface. Replaces the v0.1.x dependency on +// `internal/selftelemetry.CapturingReceiver` so the receiver package +// stays decoupled from internal/* (PR-F unblock). Methods match the +// selfTelemetry interface 1:1; the captured-state accessors are the +// surface pyspy_test.go asserts against. +type fakeTelemetry struct { + emissions atomic.Int64 + latency atomic.Int32 + activity atomic.Int32 + + mu sync.Mutex + errorKinds []kind + degradedSet []bool +} + +func newFakeTelemetry() *fakeTelemetry { return &fakeTelemetry{} } + +func (f *fakeTelemetry) IncError(k kind) { + f.mu.Lock() + f.errorKinds = append(f.errorKinds, k) + f.mu.Unlock() +} + +func (f *fakeTelemetry) IncEmissions(n int64) { + if n < 0 { + return + } + f.emissions.Add(n) +} + +func (f *fakeTelemetry) ObserveLatency(time.Duration) { f.latency.Add(1) } + +func (f *fakeTelemetry) SetDegraded(b bool) { + f.mu.Lock() + f.degradedSet = append(f.degradedSet, b) + f.mu.Unlock() +} + +func (f *fakeTelemetry) MarkActivity() { f.activity.Add(1) } + +// Errors returns a snapshot of every IncError kind in call order so +// tests can assert "kindX was recorded" or "kindX recorded exactly N +// times". Returned slice is safe to mutate. +func (f *fakeTelemetry) Errors() []kind { + f.mu.Lock() + defer f.mu.Unlock() + out := make([]kind, len(f.errorKinds)) + copy(out, f.errorKinds) + return out +} + +// DegradedTransitions returns a snapshot of every SetDegraded value in +// call order — tests assert "first transition was true" + "degraded +// did not flip back to false during Phase 1". +func (f *fakeTelemetry) DegradedTransitions() []bool { + f.mu.Lock() + defer f.mu.Unlock() + out := make([]bool, len(f.degradedSet)) + copy(out, f.degradedSet) + return out +} + +// Verify fakeTelemetry satisfies the interface at compile time. +var _ selfTelemetry = (*fakeTelemetry)(nil) diff --git a/components/receivers/pyspy/kinds.go b/components/receivers/pyspy/kinds.go index fdf4bb35..dde36b7b 100644 --- a/components/receivers/pyspy/kinds.go +++ b/components/receivers/pyspy/kinds.go @@ -2,29 +2,37 @@ package pyspy -import "github.com/tracecoreai/tracecore/internal/selftelemetry" +// kind is a low-cardinality error-class identifier. Mirrors the +// internal/selftelemetry.Kind type so the migration is mechanical; +// receiver-local because the canonical-Kind enforcement that the +// internal package owned moves into RFC-0013 PR-I's submodule. +type kind string // Degraded-mode Kind values for RFC-0009 §Degraded modes. Operator // semantics live in RUNBOOK.md; only the wire strings are stable here. // Adding a kind: declare it, document in RUNBOOK.md, and add a // prometheus-alerts.example.yaml entry if it has an actionable threshold. const ( - kindTargetNotAttached selftelemetry.Kind = "target_not_attached" - kindTargetNotListening selftelemetry.Kind = "target_not_listening" - kindTargetGone selftelemetry.Kind = "target_gone" - kindDumpOverlap selftelemetry.Kind = "dump_overlap" - kindProtocolVersion selftelemetry.Kind = "protocol_version" - kindDumpFailed selftelemetry.Kind = "dump_failed" - kindFrameTooLarge selftelemetry.Kind = "frame_too_large" - kindParseError selftelemetry.Kind = "parse_error" - kindFaulthandlerMissing selftelemetry.Kind = "faulthandler_missing" - kindUDSDirPermissionDenied selftelemetry.Kind = "uds_dir_permission_denied" - kindHelperOOMMidDump selftelemetry.Kind = "helper_oom_mid_dump" - kindSidecarUIDDrift selftelemetry.Kind = "sidecar_uid_drift" + kindTargetNotAttached kind = "target_not_attached" + kindTargetNotListening kind = "target_not_listening" + kindTargetGone kind = "target_gone" + kindDumpOverlap kind = "dump_overlap" + kindProtocolVersion kind = "protocol_version" + kindDumpFailed kind = "dump_failed" + kindFrameTooLarge kind = "frame_too_large" + kindParseError kind = "parse_error" + kindFaulthandlerMissing kind = "faulthandler_missing" + kindUDSDirPermissionDenied kind = "uds_dir_permission_denied" + kindHelperOOMMidDump kind = "helper_oom_mid_dump" + // kindPanic mirrors the canonical selftelemetry.KindPanic — the + // receiver-scoped sibling has no separate canonical-Kind enforcement + // (that lives in the deleted internal package); the panic kind is + // declared locally so the lifecycle's onPanic callback can tick it. + kindPanic kind = "panic" ) // allKinds enforces parity with RFC-0009 §Degraded modes via kinds_test.go. -var allKinds = []selftelemetry.Kind{ +var allKinds = []kind{ kindTargetNotAttached, kindTargetNotListening, kindTargetGone, @@ -36,5 +44,4 @@ var allKinds = []selftelemetry.Kind{ kindFaulthandlerMissing, kindUDSDirPermissionDenied, kindHelperOOMMidDump, - kindSidecarUIDDrift, } diff --git a/components/receivers/pyspy/kinds_test.go b/components/receivers/pyspy/kinds_test.go index 5f385749..47ec2f71 100644 --- a/components/receivers/pyspy/kinds_test.go +++ b/components/receivers/pyspy/kinds_test.go @@ -27,7 +27,6 @@ func TestKinds_AllRFC0009DegradedModesCovered(t *testing.T) { "faulthandler_missing": {}, "uds_dir_permission_denied": {}, "helper_oom_mid_dump": {}, - "sidecar_uid_drift": {}, } got := map[string]struct{}{} for _, k := range allKinds { diff --git a/components/receivers/pyspy/lifecycle.go b/components/receivers/pyspy/lifecycle.go new file mode 100644 index 00000000..bde15ed7 --- /dev/null +++ b/components/receivers/pyspy/lifecycle.go @@ -0,0 +1,140 @@ +// SPDX-License-Identifier: Apache-2.0 + +// Receiver-scoped streaming-source lifecycle helper. Replaces the +// v0.1.x dependency on `internal/runtime/lifecycle`, which is slated +// for deletion in RFC-0013 PR-F. Owns the cancel + WaitGroup + +// panic-recovery bookkeeping pyspy's runAll fan-in goroutine needs, so +// the receiver author writes the body function, not the plumbing. +// Slimmer than the internal helper: no Add() (pyspy is single-source +// from lifecycle's POV — runAll spawns the three sub-goroutines under +// its OWN sync.WaitGroup), no post-Shutdown Add silent-no-op path. + +package pyspy + +import ( + "context" + "errors" + "fmt" + "log/slog" + "runtime" + "sync" + "sync/atomic" +) + +// errLifecycleAlreadyStarted is returned by lifecycle.Start when called +// twice without an intervening Shutdown. errors.Is-comparable so callers +// don't string-match the message. +var errLifecycleAlreadyStarted = errors.New("pyspy lifecycle: already started") + +// panicCallback is invoked once if the Run function panics. The helper +// recovers the panic so the receiver never crashes the workload +// (PRINCIPLES.md §1). pyspy wires this to IncError(kindPanic) + +// SetDegraded(true). +type panicCallback func(panicValue any) + +// lifecycle bundles cancel + WaitGroup + started-flag for a streaming +// source. Zero-value is NOT useful; use newLifecycle. +// +// Shutdown is idempotent. The FIRST Shutdown's error (typically a +// caller-ctx deadline) is stashed + returned by every subsequent +// Shutdown so deadline failures aren't silently swallowed. +type lifecycle struct { + logger *slog.Logger + onPanic panicCallback + + mu sync.Mutex + cancel context.CancelFunc + closed bool + shutdownErr error + wg sync.WaitGroup + started atomic.Bool +} + +// newLifecycle constructs a lifecycle. logger may be nil (replaced with +// slog.Default for tests). onPanic may be nil — panics are still +// recovered + logged at ERROR but no callback fires. +func newLifecycle(logger *slog.Logger, onPanic panicCallback) *lifecycle { + if logger == nil { + logger = slog.Default() + } + return &lifecycle{logger: logger, onPanic: onPanic} +} + +// Start spawns run in a goroutine. The ctx passed to run is derived +// from parent via context.WithCancel — so the receiver-level parent's +// cancellation cascades into the goroutine without an explicit Shutdown +// call. Idempotent: a second Start without an intervening Shutdown +// returns errLifecycleAlreadyStarted. +func (l *lifecycle) Start(parent context.Context, run func(context.Context)) error { + if !l.started.CompareAndSwap(false, true) { + return errLifecycleAlreadyStarted + } + l.mu.Lock() + internalCtx, cancel := context.WithCancel(parent) + l.cancel = cancel + // wg.Add(1) MUST happen under the same mutex as cancel so a + // concurrent Shutdown sees the post-Add state. Without this lock, + // Shutdown could observe cancel != nil, call wg.Wait at counter=0, + // return immediately, and either trigger `sync: WaitGroup misuse` + // OR orphan the goroutine. (Mirrors the internal helper's fix.) + l.wg.Add(1) + l.mu.Unlock() + go l.safeRun(internalCtx, run) + return nil +} + +// safeRun wraps run with panic recovery + wg.Done bookkeeping. +func (l *lifecycle) safeRun(ctx context.Context, run func(context.Context)) { + defer l.wg.Done() + defer func() { + if rec := recover(); rec != nil { + l.logger.Error("pyspy lifecycle: run panic recovered", "panic", fmt.Sprintf("%v", rec)) + if l.onPanic != nil { + l.onPanic(rec) + } + } + }() + run(ctx) +} + +// Shutdown cancels the internal ctx + waits for the goroutine to exit, +// honoring the caller's ctx deadline. Idempotent: subsequent calls +// return the FIRST call's error so a missed deadline isn't silently +// swallowed. +func (l *lifecycle) Shutdown(ctx context.Context) error { + l.mu.Lock() + if l.closed { + err := l.shutdownErr + l.mu.Unlock() + return err + } + cancel := l.cancel + l.cancel = nil + l.closed = true + l.mu.Unlock() + if cancel == nil { + return nil + } + cancel() + + done := make(chan struct{}) + go func() { + l.wg.Wait() + close(done) + }() + select { + case <-done: + return nil + case <-ctx.Done(): + // NumGoroutine is process-wide, not lifecycle-local; surfacing + // it here lets operators eyeball whether the leak is plausibly + // ours. + l.logger.Warn("pyspy lifecycle: shutdown deadline elapsed before goroutine exited", + "process_goroutines", runtime.NumGoroutine()) + err := fmt.Errorf("pyspy lifecycle shutdown: %w", ctx.Err()) + l.mu.Lock() + l.shutdownErr = err + l.mu.Unlock() + return err + } +} diff --git a/components/receivers/pyspy/pyspy.go b/components/receivers/pyspy/pyspy.go index 4ce37cb5..74a9aef8 100644 --- a/components/receivers/pyspy/pyspy.go +++ b/components/receivers/pyspy/pyspy.go @@ -15,8 +15,6 @@ import ( "github.com/tracecoreai/tracecore/internal/consumer" "github.com/tracecoreai/tracecore/internal/pipeline" - "github.com/tracecoreai/tracecore/internal/runtime/lifecycle" - "github.com/tracecoreai/tracecore/internal/selftelemetry" ) // pyspyReceiver is the M13 Phase 2 receiver. It scans the configured @@ -35,9 +33,9 @@ type pyspyReceiver struct { set pipeline.CreateSettings cfg *Config next consumer.Logs - telemetry selftelemetry.Receiver + telemetry selfTelemetry - lc *lifecycle.Lifecycle + lc *lifecycle disabledReason disabledReason @@ -61,12 +59,12 @@ type pyspyReceiver struct { } // receiverOption mutates the receiver during newReceiver. Used by -// tests to inject the selftelemetry capturer and override +// tests to inject the fakeTelemetry capturer and override // scan-time hooks. type receiverOption func(*pyspyReceiver) //nolint:unused // exported via export_test.go for external tests; production callers land with M2's TelemetrySettings wiring. -func withSelfTelemetry(t selftelemetry.Receiver) receiverOption { +func withSelfTelemetry(t selfTelemetry) receiverOption { return func(r *pyspyReceiver) { if t == nil { return @@ -83,7 +81,7 @@ func newReceiver(set pipeline.CreateSettings, cfg *Config, next consumer.Logs, o set: set, cfg: cfg, next: next, - telemetry: selftelemetry.NewNoopReceiver(), + telemetry: newNoopSelfTelemetry(), } for _, opt := range opts { opt(r) @@ -130,8 +128,8 @@ func (r *pyspyReceiver) Start(ctx context.Context, host pipeline.Host) error { return err } - r.lc = lifecycle.New(r.logger(), func(_ any) { - r.telemetry.IncError(selftelemetry.KindPanic) + r.lc = newLifecycle(r.logger(), func(_ any) { + r.telemetry.IncError(kindPanic) r.telemetry.SetDegraded(true) }) diff --git a/components/receivers/pyspy/pyspy_test.go b/components/receivers/pyspy/pyspy_test.go index 5602d0d3..c73c312a 100644 --- a/components/receivers/pyspy/pyspy_test.go +++ b/components/receivers/pyspy/pyspy_test.go @@ -17,7 +17,6 @@ import ( "github.com/tracecoreai/tracecore/internal/consumer" "github.com/tracecoreai/tracecore/internal/pipeline" "github.com/tracecoreai/tracecore/internal/pipeline/pipelinetest" - "github.com/tracecoreai/tracecore/internal/selftelemetry" ) // logsSink is a thread-safe consumer.Logs implementation used by the @@ -76,7 +75,7 @@ func configForTest(t *testing.T) *Config { // - shuts down within the 1s Phase-1 budget func TestStart_EmptyUDSDir_LatchesTargetNotAttached(t *testing.T) { cfg := configForTest(t) - capturer := selftelemetry.NewCapturingReceiver() + capturer := newFakeTelemetry() r := newReceiver(settings(t), cfg, newLogsSink(), withSelfTelemetry(capturer)) ctx := context.Background() @@ -125,7 +124,7 @@ func TestStart_NonExistentUDSDir_TreatedAsTargetNotAttached(t *testing.T) { cfg.Target.UDSDir = filepath.Join(t.TempDir(), "does-not-exist") cfg.RetryInterval = 10 * time.Millisecond - capturer := selftelemetry.NewCapturingReceiver() + capturer := newFakeTelemetry() r := newReceiver(settings(t), cfg, newLogsSink(), withSelfTelemetry(capturer)) require.NoError(t, r.Start(context.Background(), pipelinetest.NewHost())) @@ -149,7 +148,7 @@ func TestStart_TopLevelDisabled_IsNoOp(t *testing.T) { disabled := false cfg.Enabled = &disabled - capturer := selftelemetry.NewCapturingReceiver() + capturer := newFakeTelemetry() r := newReceiver(settings(t), cfg, newLogsSink(), withSelfTelemetry(capturer)) require.NoError(t, r.Start(context.Background(), pipelinetest.NewHost())) @@ -199,7 +198,7 @@ func TestStart_ScanFindsCandidates_StaysInTargetNotAttached(t *testing.T) { // doesn't depend on the platform's UDS support. require.NoError(t, os.WriteFile(filepath.Join(cfg.Target.UDSDir, "pyspy.123.sock"), []byte{}, 0o600)) - capturer := selftelemetry.NewCapturingReceiver() + capturer := newFakeTelemetry() r := newReceiver(settings(t), cfg, newLogsSink(), withSelfTelemetry(capturer)) require.NoError(t, r.Start(context.Background(), pipelinetest.NewHost())) diff --git a/components/receivers/pyspy/selftel.go b/components/receivers/pyspy/selftel.go new file mode 100644 index 00000000..583796a4 --- /dev/null +++ b/components/receivers/pyspy/selftel.go @@ -0,0 +1,247 @@ +// SPDX-License-Identifier: Apache-2.0 + +// Receiver-scoped self-telemetry surface. Replaces the v0.1.x +// dependency on `internal/selftelemetry`, which is slated for deletion +// in RFC-0013 PR-F. Metric names + label shape are preserved +// (`tracecore.receiver.errors_total{kind,component_id}` and siblings) +// so dashboards / alerts don't regress. The instrumentation scope name +// is THIS receiver's Go import path — when the receiver moves to +// `module/receiver/pyspyreceiver/` in PR-I.1, the scope name moves +// with it, matching OTel convention. + +package pyspy + +import ( + "context" + "errors" + "fmt" + "sync/atomic" + "time" + + "go.opentelemetry.io/otel/attribute" + "go.opentelemetry.io/otel/metric" + + "github.com/tracecoreai/tracecore/internal/pipeline" +) + +// reasonInstrumentRegister labels init_errors_total ticks when OTel +// instrument registration failed at construction time. +const reasonInstrumentRegister = "instrument_register" + +// instrumentationScope pins the OTel scope name. Per OTel convention, +// the scope is the package's Go import path; PR-I.1 changes this when +// the receiver moves to module/receiver/pyspyreceiver/. +const instrumentationScope = "github.com/tracecoreai/tracecore/components/receivers/pyspy" + +// errNilMeterProvider mirrors selftelemetry.ErrNilMeterProvider — the +// factory is responsible for substituting the noop fallback + ticking +// init_errors_total. Returning a sentinel rather than a generic error +// lets the factory distinguish "wire-up bug" from "instrument register +// failure" if it ever needs to. +var errNilMeterProvider = errors.New("pyspy: MeterProvider is nil") + +// selfTelemetry is the receiver-scoped self-health surface. Methods are +// non-blocking + safe for concurrent use; the noop impl discards. +// Mirrors the internal/selftelemetry.Receiver interface 1:1 so the +// migration is mechanical — pyspy hot paths already call IncError / +// IncEmissions / ObserveLatency / SetDegraded / MarkActivity. +type selfTelemetry interface { + IncError(k kind) + IncEmissions(n int64) + ObserveLatency(d time.Duration) + SetDegraded(degraded bool) + MarkActivity() +} + +// noopSelfTelemetry discards every call. +type noopSelfTelemetry struct{} + +func newNoopSelfTelemetry() selfTelemetry { return noopSelfTelemetry{} } + +func (noopSelfTelemetry) IncError(kind) {} +func (noopSelfTelemetry) IncEmissions(int64) {} +func (noopSelfTelemetry) ObserveLatency(time.Duration) {} +func (noopSelfTelemetry) SetDegraded(bool) {} +func (noopSelfTelemetry) MarkActivity() {} + +var _ selfTelemetry = noopSelfTelemetry{} + +// newSelfTelemetry returns a real selfTelemetry backed by OTel metric +// instruments acquired from mp. The component's id is attached as the +// `component_id` label on every emission. Registers the same five +// instruments the v0.1.x internal selftelemetry package registered, so +// scraped metric names + label shape are unchanged. +func newSelfTelemetry(id pipeline.ID, mp metric.MeterProvider) (selfTelemetry, error) { + if mp == nil { + return nil, errNilMeterProvider + } + meter := mp.Meter(instrumentationScope) + attrSet := attribute.NewSet(attribute.String("component_id", id.String())) + + errsCtr, err := meter.Int64Counter( + "tracecore.receiver.errors_total", + metric.WithDescription("Errors observed by a receiver, partitioned by kind"), + ) + if err != nil { + return nil, fmt.Errorf("errors_total counter: %w", err) + } + emissionsCtr, err := meter.Int64Counter( + "tracecore.receiver.emissions_total", + metric.WithDescription("Data points / events emitted by a receiver"), + ) + if err != nil { + return nil, fmt.Errorf("emissions_total counter: %w", err) + } + latencyHist, err := meter.Float64Histogram( + "tracecore.receiver.collection_latency_seconds", + metric.WithDescription("Receiver collection cycle latency in seconds"), + metric.WithUnit("s"), + // Bucket boundaries chosen for sub-millisecond dump-poll cycles + // up to 10s slow paths; mirrors the internal/selftelemetry + // shape so histograms remain comparable across the migration. + metric.WithExplicitBucketBoundaries( + 0.0001, 0.001, 0.005, 0.01, 0.05, + 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, + ), + ) + if err != nil { + return nil, fmt.Errorf("collection_latency_seconds histogram: %w", err) + } + + st := &selfTelemetryImpl{ + componentID: id.String(), + attrs: attrSet, + errors: errsCtr, + emissions: emissionsCtr, + latency: latencyHist, + } + // Seed last-activity to construction time so a `time() - last_activity + // > N` alert doesn't fire on the zero-valued gauge during boot. + st.activityUnix.Store(time.Now().Unix()) + + if _, err := meter.Float64ObservableCounter( + "tracecore.receiver.degraded_seconds_total", + metric.WithDescription("Cumulative seconds the receiver has been in the degraded state"), + metric.WithUnit("s"), + metric.WithFloat64Callback(func(_ context.Context, obs metric.Float64Observer) error { + obs.Observe(st.degradedTotalSeconds(), metric.WithAttributeSet(attrSet)) + return nil + }), + ); err != nil { + return nil, fmt.Errorf("degraded_seconds_total observable: %w", err) + } + + if _, err := meter.Int64ObservableGauge( + "tracecore.receiver.last_activity_unix_seconds", + metric.WithDescription("Unix-second timestamp of the receiver's last successful activity"), + metric.WithInt64Callback(func(_ context.Context, obs metric.Int64Observer) error { + obs.Observe(st.activityUnix.Load(), metric.WithAttributeSet(attrSet)) + return nil + }), + ); err != nil { + return nil, fmt.Errorf("last_activity_unix_seconds observable: %w", err) + } + + return st, nil +} + +var _ selfTelemetry = (*selfTelemetryImpl)(nil) + +type selfTelemetryImpl struct { + componentID string + attrs attribute.Set + errors metric.Int64Counter + emissions metric.Int64Counter + latency metric.Float64Histogram + + // degradedAt holds the time of the most recent SetDegraded(true); + // nil pointer = not currently degraded. Atomic so SetDegraded is + // lock-free and the observable callback reads a stable snapshot. + degradedAt atomic.Pointer[time.Time] + + // accumulated holds nanoseconds spent degraded across completed + // degrade→recover cycles; degradedTotalSeconds adds the open + // interval at observation time. + accumulated atomic.Uint64 + + // activityUnix holds the Unix-second timestamp of the most recent + // MarkActivity (seeded to construction time). + activityUnix atomic.Int64 +} + +func (s *selfTelemetryImpl) IncError(k kind) { + // Emit component_id + kind in one WithAttributes call rather than + // merging two attribute sets — avoids relying on SDK merge semantics + // that vary across OTel versions. + s.errors.Add(context.Background(), 1, metric.WithAttributes( + attribute.String("component_id", s.componentID), + attribute.String("kind", string(k)), + )) +} + +func (s *selfTelemetryImpl) IncEmissions(n int64) { + if n < 0 { + return + } + s.emissions.Add(context.Background(), n, metric.WithAttributeSet(s.attrs)) +} + +func (s *selfTelemetryImpl) ObserveLatency(d time.Duration) { + s.latency.Record(context.Background(), d.Seconds(), metric.WithAttributeSet(s.attrs)) +} + +// SetDegraded transitions degraded state. Lock-free: enter via +// CAS(nil → &now), exit via Swap → nil + accumulate the elapsed +// interval. Microsecond-scale under-count on concurrent transitions is +// tolerated; self-corrects on the next scrape. +func (s *selfTelemetryImpl) SetDegraded(degraded bool) { + if degraded { + now := time.Now() + s.degradedAt.CompareAndSwap(nil, &now) + return + } + if old := s.degradedAt.Swap(nil); old != nil { + elapsed := time.Since(*old) + if elapsed > 0 { + s.accumulated.Add(uint64(elapsed.Nanoseconds())) + } + } +} + +func (s *selfTelemetryImpl) MarkActivity() { + s.activityUnix.Store(time.Now().Unix()) +} + +func (s *selfTelemetryImpl) degradedTotalSeconds() float64 { + acc := time.Duration(s.accumulated.Load()) + if openStart := s.degradedAt.Load(); openStart != nil { + acc += time.Since(*openStart) + } + return acc.Seconds() +} + +// recordInitError ticks tracecore.selftelemetry.init_errors_total when +// receiver wiring falls back to noop telemetry. Operators alert on +// `> 0` to learn that self-telemetry isn't really plugged in. Panics +// from a broken MeterProvider are swallowed — recordInitError IS the +// degraded-path fallback; crashing here would turn a partial outage +// into a process kill. +func recordInitError(ctx context.Context, mp metric.MeterProvider, kindLabel, componentID, reason string) { + defer func() { _ = recover() }() + if mp == nil { + return + } + meter := mp.Meter(instrumentationScope) + c, err := meter.Int64Counter( + "tracecore.selftelemetry.init_errors_total", + metric.WithDescription("Counter of self-telemetry construction failures that fell back to the noop implementation."), + ) + if err != nil { + return + } + c.Add(ctx, 1, metric.WithAttributes( + attribute.String("kind", kindLabel), + attribute.String("component_id", componentID), + attribute.String("reason", reason), + )) +} diff --git a/components/receivers/pyspy/selftel_test.go b/components/receivers/pyspy/selftel_test.go new file mode 100644 index 00000000..303d418b --- /dev/null +++ b/components/receivers/pyspy/selftel_test.go @@ -0,0 +1,392 @@ +// SPDX-License-Identifier: Apache-2.0 + +package pyspy + +import ( + "context" + "errors" + "fmt" + "strings" + "testing" + "time" + + "go.opentelemetry.io/otel/attribute" + "go.opentelemetry.io/otel/metric" + "go.opentelemetry.io/otel/metric/embedded" + sdkmetric "go.opentelemetry.io/otel/sdk/metric" + "go.opentelemetry.io/otel/sdk/metric/metricdata" + + "github.com/tracecoreai/tracecore/internal/pipeline" +) + +// newTestMeterProvider builds an SDK MeterProvider backed by a ManualReader +// so tests can collect metricdata.ResourceMetrics deterministically without +// the Prometheus exporter or any internal/telemetry plumbing — the receiver +// package must stay decoupled from internal/* so PR-F can delete those +// packages without touching this test file. +func newTestMeterProvider(t *testing.T) (*sdkmetric.MeterProvider, *sdkmetric.ManualReader) { + t.Helper() + rdr := sdkmetric.NewManualReader() + mp := sdkmetric.NewMeterProvider(sdkmetric.WithReader(rdr)) + t.Cleanup(func() { _ = mp.Shutdown(context.Background()) }) + return mp, rdr +} + +func collectMetrics(t *testing.T, rdr *sdkmetric.ManualReader) metricdata.ResourceMetrics { + t.Helper() + var rm metricdata.ResourceMetrics + if err := rdr.Collect(context.Background(), &rm); err != nil { + t.Fatalf("collect: %v", err) + } + return rm +} + +// findInstrument returns the first metricdata.Metrics whose Name matches the +// supplied OTel-dot name (e.g. "tracecore.receiver.errors_total"). Returns +// (nil, false) if absent. Scope-agnostic: walks all scope metrics. +func findInstrument(rm metricdata.ResourceMetrics, name string) (metricdata.Metrics, bool) { + for _, sm := range rm.ScopeMetrics { + for _, m := range sm.Metrics { + if m.Name == name { + return m, true + } + } + } + return metricdata.Metrics{}, false +} + +// scopeOf returns the instrumentation scope name that emitted the supplied +// metric name. Used to pin the scope-name standard for PR-F sibling ports. +func scopeOf(rm metricdata.ResourceMetrics, name string) (string, bool) { + for _, sm := range rm.ScopeMetrics { + for _, m := range sm.Metrics { + if m.Name == name { + return sm.Scope.Name, true + } + } + } + return "", false +} + +// kvMatch returns true if every want key's value matches the int64 +// datapoint's attribute set. +func kvMatch(dp metricdata.DataPoint[int64], want map[string]string) bool { + for k, v := range want { + got, ok := dp.Attributes.Value(attribute.Key(k)) + if !ok || got.AsString() != v { + return false + } + } + return true +} + +// TestPyspy_NoopAlwaysSafe pins: newNoopSelfTelemetry returns a +// value whose hot-path methods never panic and silently discard. Every +// receiver hot path calls into the selfTelemetry surface; nil-checks at +// each call site are forbidden, so the noop must be a real value. +func TestPyspy_NoopAlwaysSafe(t *testing.T) { + st := newNoopSelfTelemetry() + defer func() { + if r := recover(); r != nil { + t.Fatalf("noop panicked: %v", r) + } + }() + st.IncError(kindTargetNotAttached) + st.IncError(kindTargetGone) + st.IncError(kindDumpOverlap) + st.IncError(kindPanic) + st.IncEmissions(42) + st.IncEmissions(-1) + st.ObserveLatency(15 * time.Millisecond) + st.SetDegraded(true) + st.SetDegraded(false) + st.MarkActivity() +} + +// TestPyspy_NewReceiver_NilProviderErrors pins: newSelfTelemetry +// returns errNilMeterProvider when called with a nil provider rather than +// silently substituting noop — the factory is responsible for the fallback +// + the recordInitError tick. +func TestPyspy_NewReceiver_NilProviderErrors(t *testing.T) { + _, err := newSelfTelemetry(testID(), nil) + if !errors.Is(err, errNilMeterProvider) { + t.Fatalf("err = %v, want errNilMeterProvider", err) + } +} + +// TestPyspy_EmitsErrorsTotal_WithKindAndComponentID pins the M2 +// metric contract. After IncError(kindTargetGone) ×2 + IncError(kindParseError) ×1, +// the ManualReader collects tracecore.receiver.errors_total with +// datapoints partitioned by kind and labeled with the component_id. A +// regression that drops the kind label, the component_id label, or the +// metric-name prefix fails here. +func TestPyspy_EmitsErrorsTotal_WithKindAndComponentID(t *testing.T) { + mp, rdr := newTestMeterProvider(t) + st, err := newSelfTelemetry(testID(), mp) + if err != nil { + t.Fatalf("newSelfTelemetry: %v", err) + } + st.IncError(kindTargetGone) + st.IncError(kindTargetGone) + st.IncError(kindParseError) + + rm := collectMetrics(t, rdr) + m, ok := findInstrument(rm, "tracecore.receiver.errors_total") + if !ok { + t.Fatalf("metric tracecore.receiver.errors_total absent; have: %s", dumpNames(rm)) + } + sum, ok := m.Data.(metricdata.Sum[int64]) + if !ok { + t.Fatalf("errors_total data shape: got %T, want metricdata.Sum[int64]", m.Data) + } + gotGone, foundGone := 0, false + gotParse, foundParse := 0, false + for _, dp := range sum.DataPoints { + if !kvMatch(dp, map[string]string{"component_id": "pyspy/test"}) { + t.Errorf("datapoint missing component_id=pyspy/test: %v", dp.Attributes) + continue + } + kind, _ := dp.Attributes.Value("kind") + switch kind.AsString() { + case "target_gone": + gotGone = int(dp.Value) + foundGone = true + case "parse_error": + gotParse = int(dp.Value) + foundParse = true + } + } + if !foundGone || gotGone != 2 { + t.Errorf("target_gone count: got %d (found=%v), want 2", gotGone, foundGone) + } + if !foundParse || gotParse != 1 { + t.Errorf("parse_error count: got %d (found=%v), want 1", gotParse, foundParse) + } +} + +// TestPyspy_ScopeNameIsReceiverImportPath pins the OTel scope-name +// standard: instrumentation scope = receiver's Go import path. This anchors +// the PR-F sibling-port decision so a future drift back to the deleted +// internal/selftelemetry scope fails here. +func TestPyspy_ScopeNameIsReceiverImportPath(t *testing.T) { + mp, rdr := newTestMeterProvider(t) + st, err := newSelfTelemetry(testID(), mp) + if err != nil { + t.Fatalf("newSelfTelemetry: %v", err) + } + st.IncEmissions(1) + rm := collectMetrics(t, rdr) + scope, ok := scopeOf(rm, "tracecore.receiver.emissions_total") + if !ok { + t.Fatalf("emissions_total absent") + } + const wantScope = "github.com/tracecoreai/tracecore/components/receivers/pyspy" + if scope != wantScope { + t.Errorf("instrumentation scope: got %q, want %q", scope, wantScope) + } +} + +// TestPyspy_RecordInitError_TicksInitErrorsCounter pins: when factory wiring +// fails (newSelfTelemetry returns an error), recordInitError surfaces a +// tracecore.selftelemetry.init_errors_total tick with kind="receiver", +// the component_id label, and reason="instrument_register". This is the +// only signal that a receiver fell back to noop telemetry; dropping the +// recordInitError call must fail this test. +func TestPyspy_RecordInitError_TicksInitErrorsCounter(t *testing.T) { + mp, rdr := newTestMeterProvider(t) + recordInitError(context.Background(), mp, "receiver", testID().String(), reasonInstrumentRegister) + + rm := collectMetrics(t, rdr) + m, ok := findInstrument(rm, "tracecore.selftelemetry.init_errors_total") + if !ok { + t.Fatalf("init_errors_total absent; have: %s", dumpNames(rm)) + } + sum, ok := m.Data.(metricdata.Sum[int64]) + if !ok { + t.Fatalf("init_errors_total data shape: got %T, want metricdata.Sum[int64]", m.Data) + } + if len(sum.DataPoints) != 1 { + t.Fatalf("init_errors datapoints: got %d, want 1", len(sum.DataPoints)) + } + dp := sum.DataPoints[0] + want := map[string]string{ + "kind": "receiver", + "component_id": "pyspy/test", + "reason": reasonInstrumentRegister, + } + if !kvMatch(dp, want) { + t.Errorf("init_errors attrs: got %v, want %v", dp.Attributes, want) + } + if dp.Value != 1 { + t.Errorf("init_errors value: got %d, want 1", dp.Value) + } +} + +// TestPyspy_RecordInitError_NilProviderIsSafe pins: a nil MeterProvider must +// not panic — recordInitError IS the fallback path; crashing here would +// turn a partial degradation into a process kill. +func TestPyspy_RecordInitError_NilProviderIsSafe(t *testing.T) { + defer func() { + if r := recover(); r != nil { + t.Fatalf("recordInitError(nil) panicked: %v", r) + } + }() + recordInitError(context.Background(), nil, "receiver", "x/y", reasonInstrumentRegister) +} + +// TestPyspy_FallsBackToNoopWhenMeterFails pins the factory +// observability contract end-to-end: when newSelfTelemetry returns an +// error (synthetic register failure for every tracecore.receiver.* +// instrument), the factory MUST (1) leave the receiver with a working +// noop telemetry field (no nil, no panic on hot-path calls), AND (2) +// tick tracecore.selftelemetry.init_errors_total via recordInitError. +func TestPyspy_FallsBackToNoopWhenMeterFails(t *testing.T) { + mp, rdr := newTestMeterProvider(t) + failing := &failingReceiverMP{real: mp} + + set := pipeline.CreateSettings{ + ID: pipeline.MustNewID(pipeline.MustNewType(ComponentType), "test"), + } + set.Telemetry.MeterProvider = failing + cfg := defaultConfig() + cfg.Target.UDSDir = t.TempDir() + r, err := Factory.CreateLogs(context.Background(), set, cfg, newLogsSink()) + if err != nil { + t.Fatalf("CreateLogs: %v", err) + } + recv, ok := r.(*pyspyReceiver) + if !ok { + t.Fatalf("receiver type: got %T, want *pyspyReceiver", r) + } + if recv.telemetry == nil { + t.Fatal("telemetry field nil after failed wiring; must fall back to noop") + } + // Hot-path call must not panic + must not surface (noop discards). + recv.telemetry.IncError(kindTargetGone) + + rm := collectMetrics(t, rdr) + if m, ok := findInstrument(rm, "tracecore.receiver.errors_total"); ok { + if sum, ok := m.Data.(metricdata.Sum[int64]); ok && len(sum.DataPoints) > 0 { + t.Errorf("noop fallback leaked IncError into errors_total datapoints: %v", sum.DataPoints) + } + } + m, ok := findInstrument(rm, "tracecore.selftelemetry.init_errors_total") + if !ok { + t.Fatalf("init_errors_total absent after factory fallback; have: %s", dumpNames(rm)) + } + sum, ok := m.Data.(metricdata.Sum[int64]) + if !ok { + t.Fatalf("init_errors_total data shape: got %T", m.Data) + } + if len(sum.DataPoints) != 1 || sum.DataPoints[0].Value != 1 { + t.Errorf("init_errors_total: want 1 datapoint value=1, got %v", sum.DataPoints) + } +} + +func testID() pipeline.ID { + return pipeline.MustNewID(pipeline.MustNewType(ComponentType), "test") +} + +func dumpNames(rm metricdata.ResourceMetrics) string { + var b strings.Builder + for _, sm := range rm.ScopeMetrics { + for _, m := range sm.Metrics { + fmt.Fprintf(&b, " %s@%s", m.Name, sm.Scope.Name) + } + } + return b.String() +} + +// failingReceiverMP wraps a real MeterProvider but fails every instrument +// registration whose name starts with "tracecore.receiver.". Mirrors the +// nccl_fr sibling test seam so a future refactor that reorders the +// newSelfTelemetry constructor doesn't silently bypass coverage. +type failingReceiverMP struct { + embedded.MeterProvider + real metric.MeterProvider +} + +func (p *failingReceiverMP) Meter(name string, opts ...metric.MeterOption) metric.Meter { + return &failingReceiverMeter{Meter: p.real.Meter(name, opts...)} +} + +type failingReceiverMeter struct { + metric.Meter +} + +const receiverInstrumentPrefix = "tracecore.receiver." + +var errSyntheticReceiverFailure = errors.New("synthetic: receiver instrument registration failed") + +func (m *failingReceiverMeter) Int64Counter(name string, opts ...metric.Int64CounterOption) (metric.Int64Counter, error) { + if strings.HasPrefix(name, receiverInstrumentPrefix) { + return nil, errSyntheticReceiverFailure + } + c, err := m.Meter.Int64Counter(name, opts...) + if err != nil { + return nil, fmt.Errorf("failingReceiverMeter passthrough: %w", err) + } + return c, nil +} + +func (m *failingReceiverMeter) Float64Histogram(name string, opts ...metric.Float64HistogramOption) (metric.Float64Histogram, error) { + if strings.HasPrefix(name, receiverInstrumentPrefix) { + return nil, errSyntheticReceiverFailure + } + h, err := m.Meter.Float64Histogram(name, opts...) + if err != nil { + return nil, fmt.Errorf("failingReceiverMeter passthrough: %w", err) + } + return h, nil +} + +func (m *failingReceiverMeter) Float64ObservableCounter(name string, opts ...metric.Float64ObservableCounterOption) (metric.Float64ObservableCounter, error) { + if strings.HasPrefix(name, receiverInstrumentPrefix) { + return nil, errSyntheticReceiverFailure + } + c, err := m.Meter.Float64ObservableCounter(name, opts...) + if err != nil { + return nil, fmt.Errorf("failingReceiverMeter passthrough: %w", err) + } + return c, nil +} + +func (m *failingReceiverMeter) Int64ObservableGauge(name string, opts ...metric.Int64ObservableGaugeOption) (metric.Int64ObservableGauge, error) { + if strings.HasPrefix(name, receiverInstrumentPrefix) { + return nil, errSyntheticReceiverFailure + } + g, err := m.Meter.Int64ObservableGauge(name, opts...) + if err != nil { + return nil, fmt.Errorf("failingReceiverMeter passthrough: %w", err) + } + return g, nil +} + +// asSelfTelemetry is a compile-time pin: it accepts the package-local +// selfTelemetry interface only. If a future refactor moves the type +// back into internal/selftelemetry (e.g. reintroduces a +// selftelemetry.Receiver alias), this function's signature breaks + +// every caller fails compile. Pairs with the kind-value asserts below +// to pin the sibling-types contract that PR-B1 established. Mirrors +// the stdoutexporter `asSelfExporter` pattern. +func asSelfTelemetry(s selfTelemetry) selfTelemetry { return s } + +// TestPyspy_SiblingTypesArePackageLocal pins the PR-B1 sibling +// contract: the pyspy package owns its own selfTelemetry + kind types +// — they must NOT come from internal/selftelemetry. If a future +// refactor reintroduces that import, the asSelfTelemetry signature +// changes type → break compile here. The kind-value asserts pin a +// representative wire-format string ("target_gone") that operators +// query. +func TestPyspy_SiblingTypesArePackageLocal(t *testing.T) { + iface := asSelfTelemetry(newNoopSelfTelemetry()) + iface.IncError(kindTargetGone) + iface.IncEmissions(1) + + if string(kindTargetGone) != "target_gone" { + t.Errorf("kindTargetGone: got %q, want %q", string(kindTargetGone), "target_gone") + } + if string(kindPanic) != "panic" { + t.Errorf("kindPanic: got %q, want %q", string(kindPanic), "panic") + } +} diff --git a/docs/rfcs/0009-pyspy-receiver-scope.md b/docs/rfcs/0009-pyspy-receiver-scope.md index 7e84ae66..c8dc8015 100644 --- a/docs/rfcs/0009-pyspy-receiver-scope.md +++ b/docs/rfcs/0009-pyspy-receiver-scope.md @@ -168,7 +168,6 @@ Each row maps to one `IncError(kind)` invocation and one `FAILURE-MODES.md` entr | Helper signals `faulthandler` unavailable (CPython built without it) | `faulthandler_missing` | Helper sends `{"kind":"hello","version":1,"unsupported":true}`; receiver idles with self-metric. | | `uds_dir` exists but receiver lacks read+execute permission | `uds_dir_permission_denied` | Validate-at-Start fails with named-field error per M1's config-error contract; receiver does not start. Operator sees error in pod startup logs. Distinct from `target_not_attached` (which is "dir exists and is readable but empty"). | | Helper-side OOM during a `dump_traceback` call (Python interpreter OOM, not workload OOM) | `helper_oom_mid_dump` | Helper's reader thread catches the resulting `MemoryError` via its `BaseException` handler and replies `{"kind":"dump_failed","reason":"MemoryError"}`. Distinct from `dump_failed` only in the `reason` payload; counts under the broader `dump_failed` kind. Documented separately so operators know `MemoryError` is the surface to alert on. | -| Workload image rebuilt with a different `runAsUser` than the sidecar | `sidecar_uid_drift` | Helper binds UDS with the workload's UID + mode `0700`; sidecar with different UID gets `EACCES` on connect. Receiver enters `target_not_listening` posture for that PID, but the cause is operator-visible UID misconfig. Phase 4 chart `values.yaml` defaults both UIDs from one variable to make this loud; in Phase 1 the receiver self-metric `disabled_reason="sidecar_uid_drift"` distinguishes from `target_not_listening`. | | Receiver shutdown lands mid-frame (in-flight response not drained) | `target_gone` (on next Start) | Current Shutdown is silent (no per-shutdown error counter); next Start observes the helper-side socket closure and posts the `target_gone` row. Operator-visible signal: gap in dump records bracketed by the receiver restart timestamp. | There is no `cap_missing` row. The receiver requires no capability addition.