Skip to content

OTel Hook: Cross-SDK inconsistencies and proposal for a shared test harness (like flagd-testbed) #373

@aepfli

Description

@aepfli

Background

Appendix D defines the observability conventions for OpenFeature telemetry hooks — attribute names, event naming, hook lifecycle placement, and metadata mappings. Multiple SDK-contrib repositories already ship OTel hook implementations:

SDK Implementation
Java java-sdk-contrib – hooks/open-telemetry
Go go-sdk-contrib – hooks/open-telemetry
Python python-sdk-contrib – hooks/openfeature-hooks-opentelemetry
.NET dotnet-sdk-contrib – src/OpenFeature.Contrib.Hooks.Otel

Each has its own isolated unit tests but there is no shared, cross-SDK test harness that verifies compliance with Appendix D — analogous to how flagd-testbed provides Gherkin suites + a Docker container for validating flagd provider behavior across all SDKs.

An audit of current implementations reveals multiple spec-compliance gaps that a shared harness would catch automatically. Notably, some core SDKs have also defined their own telemetry helpers (Go SDK, Python SDK, .NET SDK) with varying degrees of alignment to the spec.


Inconsistencies Found

1. Attribute Names — Out of Sync with Current OTel Semconv

Appendix D defines the current attribute names following OTel semconv renames in v1.32–v1.34. The table below shows the full picture across hook implementations and core SDK telemetry helpers:

Attribute (Appendix D) Java hook Go hook Python hook .NET hook Python SDK core .NET SDK core Go SDK core
feature_flag.key ✅ (as "key" ⚠️)
feature_flag.result.variant ❌ old feature_flag.variant ❌ old feature_flag.variant ❌ old feature_flag.variant ❌ old feature_flag.variant
feature_flag.result.reason ❌ not emitted in traces ❌ not emitted ❌ unnamespaced reason ⚠️ ❌ old feature_flag.evaluation.reason
feature_flag.result.value ❌ not emitted ❌ not emitted ❌ not emitted ❌ not emitted
feature_flag.provider.name ❌ old feature_flag.provider_name ❌ old feature_flag.provider_name ❌ old provider_name ⚠️ ❌ old feature_flag.provider_name
feature_flag.context.id ❌ not emitted ❌ not emitted ❌ not emitted
feature_flag.set.id ❌ not emitted ❌ not emitted ❌ not emitted ❌ not emitted
feature_flag.version ❌ not emitted ❌ not emitted ❌ not emitted ❌ not emitted
error.type ❌ not emitted ❌ not emitted ❌ not emitted
error.message ❌ not emitted ❌ not emitted ❌ not emitted ❌ old feature_flag.evaluation.error.message

⚠️ = .NET contrib MetricsHook uses entirely unqualified keys ("key", "provider_name", "variant", "reason") rather than the feature_flag.* namespace.

A particularly striking disconnect: the .NET SDK core (TelemetryConstants.cs) defines correct up-to-date attribute names, while the .NET contrib hook (TracingHook.cs / MetricsHook.cs) in the same ecosystem still uses the old ones. This shows how quickly implementations drift from the spec without automated validation.

2. Span / Log Event Name

Appendix D and the spec's Hook Lifecycle guidance recommend using "feature_flag.evaluation" as the event name. Implementations diverge:

SDK Event name
Java hook "feature_flag"
Go hook "feature_flag.evaluation"
Python hook "feature_flag"
.NET hook "feature_flag" (ActivityEvent) ❌
Go SDK core "feature_flag.evaluation"

3. Hook Lifecycle Stage for Emitting Telemetry

Appendix D states:

"The finally hook stage is where telemetry signals are emitted with complete evaluation details."

SDK Stage used
Java TracesHook after
Go traceHook finally
Python TracingHook after
.NET TracingHook AfterAsync

Using after skips the error path — an evaluation that throws will never emit a trace event in Java, Python, or .NET.

4. OTel Semconv Version Drift — Even Within a Single Repo

In go-sdk-contrib, metrics.go imports semconv/v1.34.0 while traces.go imports semconv/v1.37.0. This kind of intra-repo drift is invisible without a contract-based harness.

5. Metrics — No Formal Spec Coverage

Java, Go, and .NET all ship a MetricsHook emitting:

  • feature_flag.evaluation_active_count
  • feature_flag.evaluation_requests_total
  • feature_flag.evaluation_success_total
  • feature_flag.evaluation_error_total

These names are consistent across SDKs, but Appendix D defines no metrics conventions. Python has no metrics hook at all. Without a normative spec for metric names, dimensions, and instrument types (counter vs. updown-counter), future implementations will diverge.


Proposal: Shared OTel Hook Test Harness

The flagd-testbed model is a proven pattern:

  • A central repo defines language-agnostic Gherkin feature files
  • Each SDK imports it as a git submodule and implements step definitions using its language's test tooling
  • Automated CI in each SDK repo gates PRs on compliance

We propose the same approach for OTel hooks, extending the spec repo itself (since Appendix B already hosts evaluation.feature, hooks.feature, etc.) with OTel-specific Gherkin suites:

specification/assets/gherkin/otel-traces-hook.feature
specification/assets/gherkin/otel-metrics-hook.feature

Each scenario would:

  • Set up a minimal in-memory OTel exporter (spans or metrics)
  • Invoke a flag evaluation through the hook under test
  • Assert the correct current attribute keys and values are emitted
  • Assert deprecated/renamed attribute names are NOT present
  • Cover all hook lifecycle stages (before, after, error, finally)
  • Cover error scenarios (error.type, error.message)
  • Cover metadata → attribute mapping (feature_flag.context.id, feature_flag.set.id, feature_flag.version)

Each SDK's OTel hook would consume these feature files and implement step definitions using its language's OTel test SDK:

  • Java: OpenTelemetryExtension (JUnit 5)
  • Go: tracetest.NewInMemoryExporter / metric.NewManualReader
  • Python: opentelemetry-sdk in-memory exporters
  • .NET: AddInMemoryExporter

Why This Matters

OTel semconv for feature flags has seen 5 breaking attribute renames between v1.32 and v1.34. Without a shared test harness, SDKs silently fall behind — and users instrumenting multiple services in different languages get inconsistent telemetry. Dashboards, alerts, and correlation queries built on one SDK's attribute names silently break when querying data from another SDK.

Appendix D already contains the normative rules. Gherkin suites would make those rules machine-verifiable across every SDK implementation, closing the loop between spec and implementation — exactly as flagd-testbed does for provider conformance.


Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions