ci(kb): guard cleanup trap docker logs on container existence#417
Conversation
The cleanup trap unconditionally ran `docker logs` for both ES and Kibana, emitting "No such container" on stderr whenever the script failed before the container was created (npm ci, build, image pull, ES bootstrap). That diagnostic masked the real upstream failure in Buildkite output. Wrap each `docker logs` call in a `docker inspect` existence check, printing "(container never started)" when the container does not exist. Also fold the three `docker rm -f` calls into one. Refs #416
✅MegaLinter analysis: Success
Notices📣 MegaLinter 9.5.0 is out! Discover the new features and security recommendations in the release announcement. (Skip this info by defining See detailed reports in MegaLinter artifacts MegaLinter is graciously provided by OX Security |
Resolves shellcheck SC2012 on line 71. The preceding `compgen -G` already confirms at least one match, so an array assignment is sufficient.
The KB functional tests intermittently failed in setup-kibana.cjs with 6 minutes of `failed to retrieve password hash for reserved user [elastic]` errors. The underlying cause is in the ES container logs: UnavailableShardsException: at least one primary shard for the index [.security-7] is unavailable When the Buildkite host disk is above the default 85% low watermark, ES refuses to allocate the security index's primary shard, so the reserved-user store never becomes readable and every `elastic` auth attempt fails. Intermittent because it depends on the agent's disk state at run time. For a single-node trial test cluster on ephemeral CI storage, the disk allocator buys nothing, so turn it off. Refs #416
Previously `retry` did `catch { /* not ready yet */ }` and threw a
generic "did not become ready in time" error with no detail. Diagnosing
the recent ES allocation flake required pulling container logs from the
cleanup trap because the script itself revealed nothing.
Track the last failure reason — HTTP status + body, cluster status, or
exception message — log it on each progress tick, and include it in the
final timeout error. Call sites now throw descriptive errors instead of
returning booleans.
Refs #416
margaretjgu
left a comment
There was a problem hiding this comment.
LGTM. The description does a good job tracing the root cause chain (trap masking real errors, disk watermark blocking shard allocation, retry swallowing context) and each fix is targeted and correct.
One minor note: ES starts with --rm, so if it crashes and Docker auto-removes it before cleanup() fires, docker inspect returns false and prints "(container never started)" even though it did start. The logs are gone anyway so there is nothing actionable, but a short comment in the cleanup function would save a future reader some confusion. Not a blocker.
Worth noting this PR is good defense-in-depth and complements #280. Once the ES image is pre-baked on the agent, the disk pressure and cold-start timing issues that produce these failures in the first place are eliminated at the infrastructure level. Both are worth shipping.
Summary
Closes #416. Three commits, each targets one symptom of the same flaky job.
1. Cleanup trap (
.buildkite/run-kb-tests.sh)The trap ran
docker logsfor both ES and Kibana unconditionally, so any failure before line 154 (npm ci, build, image pull, ES bootstrap, kibana_system setup) made it emitError response from daemon: No such container: elastic-cli-kbto stderr.|| trueonly suppressed the exit code, not the message. Now eachdocker logsis guarded bydocker inspect; missing containers print(container never started).2. ES disk watermark (
.buildkite/run-kb-tests.sh)Once the trap stopped masking failures, the real cause appeared in the ES container log:
When the Buildkite host disk is above ES's default 85% low watermark, the security index's primary shard refuses to allocate, the reserved-user store is unreadable, and every
elasticauth attempt fails forever. Intermittent because it depends on the agent's disk state. Fix: setcluster.routing.allocation.disk.threshold_enabled=falseon the single-node test cluster. The disk allocator has no value on ephemeral CI storage.3. Setup-kibana diagnostics (
.buildkite/setup-kibana.cjs)retryswallowed every error withcatch { /* not ready yet */ }and threw a generic "did not become ready in time" with no detail. Pinning down (2) required pulling ES container logs from the trap output because the Node script itself revealed nothing. Now the retry records the last failure reason — HTTP status + body, cluster status, or exception message — logs it on each progress tick, and includes it in the final timeout error. Future flakes from a different cause will be diagnosable in one CI run.4. SC2012 cleanup (
.buildkite/run-kb-tests.sh)While touching the file, fix the pre-existing shellcheck SC2012 on line 71 by replacing
ls ... | head -1with a bash array —compgen -Gon the line above already guarantees a match.Implementation note
The issue's option 1 suggested chaining
docker inspect ... && docker logs ... || echo. I usedif/then/else/fiinstead: withset -o pipefailin effect, the chained form would print "(container never started)" ifdocker logs | tail -50itself returned non-zero on a real container.Test plan
shellcheck .buildkite/run-kb-tests.sh— clean (the pre-existing SC2012 is also resolved)bash -n .buildkite/run-kb-tests.sh— syntax OKnode --check .buildkite/setup-kibana.cjs— syntax OKretryfailures show the last seen HTTP status / cluster statusAcceptance (per #416)