fix(install-bench): swap to OCB binary for hostmetricsreceiver#190
Merged
Conversation
PR #180 enabled `hostmetrics` in `bench/install/tracecore-values.yaml` but the install-bench Dockerfile was still building from `./cmd/tracecore`, whose generated `components.go` only registers in-tree receivers. hostmetricsreceiver is upstream OTel-contrib — only the OCB-assembled binary at `_build/tracecore` (per `builder-config.yaml`) carries it. So the daemonset pod failed config load with "unknown component type hostmetrics" and `kubectl rollout status` timed out, turning install-bench red on main since 2026-05-31T02:28:40Z and on every PR opened since. Switch the Dockerfile to `make build-ocb` + re-link from `./_build` with distroless-friendly flags. Tactical bridge until PR-A2 (#189) makes `_build/tracecore` the canonical binary for all builds. Bonus: `bench/install/run.sh` previously aborted via `set -e` on rollout-status timeout without dumping diagnostics — the actual failure mode for this regression. Add `dump_failure_diagnostics()` (pod state, describe, logs, previous-container logs, rendered config) wired to both the rollout-status path and the existing first-data-deadline path. Refactor eliminates duplicated tracecore-pod spelunking. Verified locally: `docker build -f install/kubernetes/tracecore/Dockerfile -t tracecore:bench-test .` succeeds and `docker run --rm tracecore:bench-test components | grep hostmetrics` shows hostmetricsreceiver v0.110.0 embedded. Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
May 31, 2026
Two-file conflict from main landings since last sync: - install/kubernetes/tracecore/Dockerfile: both sides build the OCB-generated _build/tracecore binary. Took PR-A2's canonical `make build` invocation (the OCB target) and kept #190's defense-in-depth re-link inside ./_build/ with distroless flags (CGO_ENABLED=0, -trimpath, -s -w) so the resulting binary is guaranteed-static for distroless/static-debian12. - docs/migration/v0.1-to-v0.2.md: PR-A2 added two self-telemetry rows (otelcol_* rename + telemetry.listen split), #186 added a stdoutexporter row. All three rows belong; kept all three. values.yaml + .github/workflows/chart.yml had no actual conflicts after fetch — must have auto-merged in the prior sync. Gates: make check / go test ./... / helm lint / helm template / make build / ./_build/tracecore validate against rendered chart — all green. Signed-off-by: Tri Lam <tri@maydow.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
PR #180 (
chore(pivot): PR-E unblock — bench heartbeat to hostmetricsreceiver) enabled thehostmetricsreceiver inbench/install/tracecore-values.yamland added it tobuilder-config.yaml, but the install-benchDockerfilewas still building from./cmd/tracecore. The generatedcmd/tracecore/components.goonly registers in-tree receivers —hostmetricsreceiveris upstream OTel-contrib and is only bundled by the OCB-assembled binary at_build/tracecore. The daemonset pod failed config load withunknown component type hostmetrics,kubectl rollout statustimed out at 5 m, andbash -eabortedrun.shbefore any diagnostics fired — so CI showed a bare red with no actionable log.install-benchhas been red onmainsince 2026-05-31T02:28:40Z and on every PR opened after #180.Affected PRs (open as of this writing): #186, #187, #188, #189.
Fix
Switch
install/kubernetes/tracecore/Dockerfileto build via OCB:The re-link with our flags guarantees the static binary the distroless base can exec; OCB's intermediate compile uses its own defaults. The final image still uses
gcr.io/distroless/static-debian12:nonrootat the same pinned digest.This is a tactical bridge: PR-A2 (#189) makes
_build/tracecorethe canonical binary for all builds. Once that lands, the in-treecmd/tracecorepath retires entirely (RFC-0013 PR-F) and this Dockerfile change becomes the new normal across every image, not just install-bench.Why not the alternatives?
clockreceiveris on its way out in PR-K.cmd/tracecore/components.go→ diverges the in-tree component list from the OCB-managed one; the whole point of PR-A2 is to delete that divergence.Bonus: surface root cause on rollout-status failure
bench/install/run.shhad post-deadline diagnostics for the first-data path, but the rollout-status path (the actual failure mode of this regression) just exited viaset -e. Addeddump_failure_diagnostics()(pod state,kubectl describe, current + previous container logs, rendered config) wired to both failure paths; refactor eliminates the duplicated tracecore-pod spelunking that lived inline. Future regressions surface root cause in the CI log without re-running.Verification
End-to-end install-bench (kind cluster + helm install) runs on this PR via the workflow itself.
Cost
Docker build stage adds ~100 s (OCB compile inside Alpine). Bench Docker rebuild only fires on chart/bench/builder-config changes — acceptable.