Background
Appendix D defines the observability conventions for OpenFeature telemetry hooks — attribute names, event naming, hook lifecycle placement, and metadata mappings. Multiple SDK-contrib repositories already ship OTel hook implementations:
Each has its own isolated unit tests but there is no shared, cross-SDK test harness that verifies compliance with Appendix D — analogous to how flagd-testbed provides Gherkin suites + a Docker container for validating flagd provider behavior across all SDKs.
An audit of current implementations reveals multiple spec-compliance gaps that a shared harness would catch automatically. Notably, some core SDKs have also defined their own telemetry helpers (Go SDK, Python SDK, .NET SDK) with varying degrees of alignment to the spec.
Inconsistencies Found
1. Attribute Names — Out of Sync with Current OTel Semconv
Appendix D defines the current attribute names following OTel semconv renames in v1.32–v1.34. The table below shows the full picture across hook implementations and core SDK telemetry helpers:
| Attribute (Appendix D) |
Java hook |
Go hook |
Python hook |
.NET hook |
Python SDK core |
.NET SDK core |
Go SDK core |
feature_flag.key |
✅ |
✅ |
✅ |
✅ (as "key" ⚠️) |
✅ |
✅ |
✅ |
feature_flag.result.variant |
❌ old feature_flag.variant |
✅ |
❌ old feature_flag.variant |
❌ old feature_flag.variant |
❌ old feature_flag.variant |
✅ |
✅ |
feature_flag.result.reason |
❌ not emitted in traces |
✅ |
❌ not emitted |
❌ unnamespaced reason ⚠️ |
❌ old feature_flag.evaluation.reason |
✅ |
✅ |
feature_flag.result.value |
❌ not emitted |
✅ |
❌ not emitted |
❌ not emitted |
❌ not emitted |
✅ |
✅ |
feature_flag.provider.name |
❌ old feature_flag.provider_name |
✅ |
❌ old feature_flag.provider_name |
❌ old provider_name ⚠️ |
❌ old feature_flag.provider_name |
✅ |
✅ |
feature_flag.context.id |
❌ not emitted |
✅ |
❌ not emitted |
❌ not emitted |
✅ |
✅ |
✅ |
feature_flag.set.id |
❌ not emitted |
❌ not emitted |
❌ not emitted |
❌ not emitted |
✅ |
✅ |
✅ |
feature_flag.version |
❌ not emitted |
❌ not emitted |
❌ not emitted |
❌ not emitted |
✅ |
✅ |
✅ |
error.type |
❌ not emitted |
✅ |
❌ not emitted |
❌ not emitted |
✅ |
✅ |
✅ |
error.message |
❌ not emitted |
✅ |
❌ not emitted |
❌ not emitted |
❌ old feature_flag.evaluation.error.message |
✅ |
✅ |
⚠️ = .NET contrib MetricsHook uses entirely unqualified keys ("key", "provider_name", "variant", "reason") rather than the feature_flag.* namespace.
A particularly striking disconnect: the .NET SDK core (TelemetryConstants.cs) defines correct up-to-date attribute names, while the .NET contrib hook (TracingHook.cs / MetricsHook.cs) in the same ecosystem still uses the old ones. This shows how quickly implementations drift from the spec without automated validation.
2. Span / Log Event Name
Appendix D and the spec's Hook Lifecycle guidance recommend using "feature_flag.evaluation" as the event name. Implementations diverge:
| SDK |
Event name |
| Java hook |
"feature_flag" ❌ |
| Go hook |
"feature_flag.evaluation" ✅ |
| Python hook |
"feature_flag" ❌ |
| .NET hook |
"feature_flag" (ActivityEvent) ❌ |
| Go SDK core |
"feature_flag.evaluation" ✅ |
3. Hook Lifecycle Stage for Emitting Telemetry
Appendix D states:
"The finally hook stage is where telemetry signals are emitted with complete evaluation details."
| SDK |
Stage used |
Java TracesHook |
after ❌ |
Go traceHook |
finally ✅ |
Python TracingHook |
after ❌ |
.NET TracingHook |
AfterAsync ❌ |
Using after skips the error path — an evaluation that throws will never emit a trace event in Java, Python, or .NET.
4. OTel Semconv Version Drift — Even Within a Single Repo
In go-sdk-contrib, metrics.go imports semconv/v1.34.0 while traces.go imports semconv/v1.37.0. This kind of intra-repo drift is invisible without a contract-based harness.
5. Metrics — No Formal Spec Coverage
Java, Go, and .NET all ship a MetricsHook emitting:
feature_flag.evaluation_active_count
feature_flag.evaluation_requests_total
feature_flag.evaluation_success_total
feature_flag.evaluation_error_total
These names are consistent across SDKs, but Appendix D defines no metrics conventions. Python has no metrics hook at all. Without a normative spec for metric names, dimensions, and instrument types (counter vs. updown-counter), future implementations will diverge.
Proposal: Shared OTel Hook Test Harness
The flagd-testbed model is a proven pattern:
- A central repo defines language-agnostic Gherkin feature files
- Each SDK imports it as a git submodule and implements step definitions using its language's test tooling
- Automated CI in each SDK repo gates PRs on compliance
We propose the same approach for OTel hooks, extending the spec repo itself (since Appendix B already hosts evaluation.feature, hooks.feature, etc.) with OTel-specific Gherkin suites:
specification/assets/gherkin/otel-traces-hook.feature
specification/assets/gherkin/otel-metrics-hook.feature
Each scenario would:
- Set up a minimal in-memory OTel exporter (spans or metrics)
- Invoke a flag evaluation through the hook under test
- Assert the correct current attribute keys and values are emitted
- Assert deprecated/renamed attribute names are NOT present
- Cover all hook lifecycle stages (
before, after, error, finally)
- Cover error scenarios (
error.type, error.message)
- Cover metadata → attribute mapping (
feature_flag.context.id, feature_flag.set.id, feature_flag.version)
Each SDK's OTel hook would consume these feature files and implement step definitions using its language's OTel test SDK:
- Java:
OpenTelemetryExtension (JUnit 5)
- Go:
tracetest.NewInMemoryExporter / metric.NewManualReader
- Python:
opentelemetry-sdk in-memory exporters
- .NET:
AddInMemoryExporter
Why This Matters
OTel semconv for feature flags has seen 5 breaking attribute renames between v1.32 and v1.34. Without a shared test harness, SDKs silently fall behind — and users instrumenting multiple services in different languages get inconsistent telemetry. Dashboards, alerts, and correlation queries built on one SDK's attribute names silently break when querying data from another SDK.
Appendix D already contains the normative rules. Gherkin suites would make those rules machine-verifiable across every SDK implementation, closing the loop between spec and implementation — exactly as flagd-testbed does for provider conformance.
Related
Background
Appendix D defines the observability conventions for OpenFeature telemetry hooks — attribute names, event naming, hook lifecycle placement, and metadata mappings. Multiple SDK-contrib repositories already ship OTel hook implementations:
Each has its own isolated unit tests but there is no shared, cross-SDK test harness that verifies compliance with Appendix D — analogous to how flagd-testbed provides Gherkin suites + a Docker container for validating flagd provider behavior across all SDKs.
An audit of current implementations reveals multiple spec-compliance gaps that a shared harness would catch automatically. Notably, some core SDKs have also defined their own telemetry helpers (Go SDK, Python SDK, .NET SDK) with varying degrees of alignment to the spec.
Inconsistencies Found
1. Attribute Names — Out of Sync with Current OTel Semconv
Appendix D defines the current attribute names following OTel semconv renames in v1.32–v1.34. The table below shows the full picture across hook implementations and core SDK telemetry helpers:
feature_flag.key"key"feature_flag.result.variantfeature_flag.variantfeature_flag.variantfeature_flag.variantfeature_flag.variantfeature_flag.result.reasonreasonfeature_flag.evaluation.reasonfeature_flag.result.valuefeature_flag.provider.namefeature_flag.provider_namefeature_flag.provider_nameprovider_namefeature_flag.provider_namefeature_flag.context.idfeature_flag.set.idfeature_flag.versionerror.typeerror.messagefeature_flag.evaluation.error.messageMetricsHookuses entirely unqualified keys ("key","provider_name","variant","reason") rather than thefeature_flag.*namespace.A particularly striking disconnect: the .NET SDK core (
TelemetryConstants.cs) defines correct up-to-date attribute names, while the .NET contrib hook (TracingHook.cs/MetricsHook.cs) in the same ecosystem still uses the old ones. This shows how quickly implementations drift from the spec without automated validation.2. Span / Log Event Name
Appendix D and the spec's Hook Lifecycle guidance recommend using
"feature_flag.evaluation"as the event name. Implementations diverge:"feature_flag"❌"feature_flag.evaluation"✅"feature_flag"❌"feature_flag"(ActivityEvent) ❌"feature_flag.evaluation"✅3. Hook Lifecycle Stage for Emitting Telemetry
Appendix D states:
TracesHookafter❌traceHookfinally✅TracingHookafter❌TracingHookAfterAsync❌Using
afterskips the error path — an evaluation that throws will never emit a trace event in Java, Python, or .NET.4. OTel Semconv Version Drift — Even Within a Single Repo
In
go-sdk-contrib,metrics.goimportssemconv/v1.34.0whiletraces.goimportssemconv/v1.37.0. This kind of intra-repo drift is invisible without a contract-based harness.5. Metrics — No Formal Spec Coverage
Java, Go, and .NET all ship a
MetricsHookemitting:feature_flag.evaluation_active_countfeature_flag.evaluation_requests_totalfeature_flag.evaluation_success_totalfeature_flag.evaluation_error_totalThese names are consistent across SDKs, but Appendix D defines no metrics conventions. Python has no metrics hook at all. Without a normative spec for metric names, dimensions, and instrument types (counter vs. updown-counter), future implementations will diverge.
Proposal: Shared OTel Hook Test Harness
The flagd-testbed model is a proven pattern:
We propose the same approach for OTel hooks, extending the spec repo itself (since Appendix B already hosts
evaluation.feature,hooks.feature, etc.) with OTel-specific Gherkin suites:Each scenario would:
before,after,error,finally)error.type,error.message)feature_flag.context.id,feature_flag.set.id,feature_flag.version)Each SDK's OTel hook would consume these feature files and implement step definitions using its language's OTel test SDK:
OpenTelemetryExtension(JUnit 5)tracetest.NewInMemoryExporter/metric.NewManualReaderopentelemetry-sdkin-memory exportersAddInMemoryExporterWhy This Matters
OTel semconv for feature flags has seen 5 breaking attribute renames between v1.32 and v1.34. Without a shared test harness, SDKs silently fall behind — and users instrumenting multiple services in different languages get inconsistent telemetry. Dashboards, alerts, and correlation queries built on one SDK's attribute names silently break when querying data from another SDK.
Appendix D already contains the normative rules. Gherkin suites would make those rules machine-verifiable across every SDK implementation, closing the loop between spec and implementation — exactly as flagd-testbed does for provider conformance.
Related