Skip to content

fix(install-bench): swap to OCB binary for hostmetricsreceiver#190

Merged
trilamsr merged 1 commit into
mainfrom
pr-q-install-bench-ocb-binary
May 31, 2026
Merged

fix(install-bench): swap to OCB binary for hostmetricsreceiver#190
trilamsr merged 1 commit into
mainfrom
pr-q-install-bench-ocb-binary

Conversation

@trilamsr

@trilamsr trilamsr commented May 31, 2026

Copy link
Copy Markdown
Contributor

Root cause

PR #180 (chore(pivot): PR-E unblock — bench heartbeat to hostmetricsreceiver) enabled the hostmetrics receiver in bench/install/tracecore-values.yaml and added it to builder-config.yaml, but the install-bench Dockerfile was still building from ./cmd/tracecore. The generated cmd/tracecore/components.go only registers in-tree receivers — hostmetricsreceiver is upstream OTel-contrib and is only bundled by the OCB-assembled binary at _build/tracecore. The daemonset pod failed config load with unknown component type hostmetrics, kubectl rollout status timed out at 5 m, and bash -e aborted run.sh before any diagnostics fired — so CI showed a bare red with no actionable log.

install-bench has been red on main since 2026-05-31T02:28:40Z and on every PR opened after #180.

Affected PRs (open as of this writing): #186, #187, #188, #189.

Fix

Switch install/kubernetes/tracecore/Dockerfile to build via OCB:

make build-ocb           # generates ./_build/{main.go,go.mod,...} + compiles
cd _build && go build .  # re-link with CGO_ENABLED=0 -trimpath -ldflags "-s -w"

The re-link with our flags guarantees the static binary the distroless base can exec; OCB's intermediate compile uses its own defaults. The final image still uses gcr.io/distroless/static-debian12:nonroot at the same pinned digest.

This is a tactical bridge: PR-A2 (#189) makes _build/tracecore the canonical binary for all builds. Once that lands, the in-tree cmd/tracecore path retires entirely (RFC-0013 PR-F) and this Dockerfile change becomes the new normal across every image, not just install-bench.

Why not the alternatives?

  • Revert hostmetrics in bench values → walks back PR chore(pivot): PR-E unblock — bench heartbeat to hostmetricsreceiver #180's pivot intent ("no custom receiver where upstream satisfies"); the legacy clockreceiver is on its way out in PR-K.
  • Add hostmetricsreceiver to cmd/tracecore/components.go → diverges the in-tree component list from the OCB-managed one; the whole point of PR-A2 is to delete that divergence.

Bonus: surface root cause on rollout-status failure

bench/install/run.sh had post-deadline diagnostics for the first-data path, but the rollout-status path (the actual failure mode of this regression) just exited via set -e. Added dump_failure_diagnostics() (pod state, kubectl describe, current + previous container logs, rendered config) wired to both failure paths; refactor eliminates the duplicated tracecore-pod spelunking that lived inline. Future regressions surface root cause in the CI log without re-running.

Verification

$ make check
… 0 issues, all modules verified

$ docker build -f install/kubernetes/tracecore/Dockerfile -t tracecore:bench-test .
… exporting to image done

$ docker run --rm tracecore:bench-test components | grep -E "hostmetrics|otlp"
    - name: hostmetrics
      module: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/hostmetricsreceiver v0.110.0
    - name: otlp
      module: go.opentelemetry.io/collector/receiver/otlpreceiver v0.110.0
    - name: otlphttp
      module: go.opentelemetry.io/collector/exporter/otlphttpexporter v0.110.0

End-to-end install-bench (kind cluster + helm install) runs on this PR via the workflow itself.

Cost

Docker build stage adds ~100 s (OCB compile inside Alpine). Bench Docker rebuild only fires on chart/bench/builder-config changes — acceptable.

[CI] install-bench Dockerfile now builds via OpenTelemetry Collector Builder so the bench daemonset can load hostmetricsreceiver; also dumps pod state, logs, and rendered config on rollout-status failure. Unblocks every PR opened after #180.

PR #180 enabled `hostmetrics` in `bench/install/tracecore-values.yaml`
but the install-bench Dockerfile was still building from
`./cmd/tracecore`, whose generated `components.go` only registers
in-tree receivers. hostmetricsreceiver is upstream OTel-contrib —
only the OCB-assembled binary at `_build/tracecore` (per
`builder-config.yaml`) carries it. So the daemonset pod failed config
load with "unknown component type hostmetrics" and `kubectl rollout
status` timed out, turning install-bench red on main since
2026-05-31T02:28:40Z and on every PR opened since.

Switch the Dockerfile to `make build-ocb` + re-link from `./_build`
with distroless-friendly flags. Tactical bridge until PR-A2 (#189)
makes `_build/tracecore` the canonical binary for all builds.

Bonus: `bench/install/run.sh` previously aborted via `set -e` on
rollout-status timeout without dumping diagnostics — the actual
failure mode for this regression. Add `dump_failure_diagnostics()`
(pod state, describe, logs, previous-container logs, rendered
config) wired to both the rollout-status path and the existing
first-data-deadline path. Refactor eliminates duplicated tracecore-pod
spelunking.

Verified locally: `docker build -f install/kubernetes/tracecore/Dockerfile -t tracecore:bench-test .` succeeds and `docker run --rm tracecore:bench-test components | grep hostmetrics` shows hostmetricsreceiver v0.110.0 embedded.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) May 31, 2026 04:44
@trilamsr trilamsr merged commit 07cbf5f into main May 31, 2026
12 of 15 checks passed
@trilamsr trilamsr deleted the pr-q-install-bench-ocb-binary branch May 31, 2026 04:48
trilamsr pushed a commit that referenced this pull request May 31, 2026
Two-file conflict from main landings since last sync:

- install/kubernetes/tracecore/Dockerfile: both sides build the
  OCB-generated _build/tracecore binary. Took PR-A2's canonical
  `make build` invocation (the OCB target) and kept #190's
  defense-in-depth re-link inside ./_build/ with distroless flags
  (CGO_ENABLED=0, -trimpath, -s -w) so the resulting binary is
  guaranteed-static for distroless/static-debian12.
- docs/migration/v0.1-to-v0.2.md: PR-A2 added two self-telemetry
  rows (otelcol_* rename + telemetry.listen split), #186 added a
  stdoutexporter row. All three rows belong; kept all three.

values.yaml + .github/workflows/chart.yml had no actual conflicts
after fetch — must have auto-merged in the prior sync.

Gates: make check / go test ./... / helm lint / helm template /
make build / ./_build/tracecore validate against rendered chart —
all green.

Signed-off-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant