Symptom
Buildkite job buildkite/elastic-cli/pr (Kibana functional tests) intermittently fails with this in the cleanup output:
Error response from daemon: No such container: elastic-cli-kb
Examples (open PRs at time of filing):
The PRs themselves don't touch .buildkite/, build config, or Kibana code, so the failure is not caused by their content.
Root cause (proximate)
The error comes from the cleanup trap in .buildkite/run-kb-tests.sh:36:
docker logs "$KB_CONTAINER_NAME" 2>&1 | tail -50 || true
docker logs writes the "No such container" diagnostic to stderr before exiting non-zero, and || true only suppresses the exit code, not the stderr message. The container is created at line 154, so any failure earlier in the script (npm ci, npm run build, setup-kibana.cjs, image pulls, ES bootstrap) hits the trap before Kibana ever exists.
The real failure happens earlier in the log — this message is the alarm, not the cause.
Suggested fixes
1. Make the cleanup trap robust — guard docker logs on container existence so the output points at the actual failing step instead of adding noise:
cleanup() {
echo "--- ES logs (last 50 lines)"
docker inspect "$ES_CONTAINER_NAME" >/dev/null 2>&1 \
&& docker logs "$ES_CONTAINER_NAME" 2>&1 | tail -50 \
|| echo "(container never started)"
echo "--- Kibana logs (last 50 lines)"
docker inspect "$KB_CONTAINER_NAME" >/dev/null 2>&1 \
&& docker logs "$KB_CONTAINER_NAME" 2>&1 | tail -50 \
|| echo "(container never started)"
echo "--- Cleaning up"
docker rm -f "$TEST_RUNNER_NAME" "$KB_CONTAINER_NAME" "$ES_CONTAINER_NAME" 2>/dev/null || true
docker network rm "$NETWORK_NAME" 2>/dev/null || true
}
2. Diagnose the underlying flake. Likely candidates without log access:
setup-kibana.cjs (line 145) races ES bootstrap. The script waits on the Node image pull, but not on ES readiness — ES's security index can take minutes after the container reports healthy. A retry loop on 401 / connection-refused inside setup-kibana.cjs would be more robust.
- Backgrounded image pulls (
wait "$NODE_PULL_PID" / wait "$KB_PULL_PID") timing out on slow agents.
Step 1 is cheap and would make step 2 much easier to diagnose from the public Buildkite output.
Acceptance
- Cleanup trap no longer emits "No such container" lines.
- KB tests either pass reliably or fail with a clear, actionable error pointing at the real root cause.
Symptom
Buildkite job
buildkite/elastic-cli/pr(Kibana functional tests) intermittently fails with this in the cleanup output:Examples (open PRs at time of filing):
The PRs themselves don't touch
.buildkite/, build config, or Kibana code, so the failure is not caused by their content.Root cause (proximate)
The error comes from the
cleanuptrap in.buildkite/run-kb-tests.sh:36:docker logswrites the "No such container" diagnostic to stderr before exiting non-zero, and|| trueonly suppresses the exit code, not the stderr message. The container is created at line 154, so any failure earlier in the script (npm ci,npm run build,setup-kibana.cjs, image pulls, ES bootstrap) hits the trap before Kibana ever exists.The real failure happens earlier in the log — this message is the alarm, not the cause.
Suggested fixes
1. Make the cleanup trap robust — guard
docker logson container existence so the output points at the actual failing step instead of adding noise:2. Diagnose the underlying flake. Likely candidates without log access:
setup-kibana.cjs(line 145) races ES bootstrap. The scriptwaits on the Node image pull, but not on ES readiness — ES's security index can take minutes after the container reports healthy. A retry loop on 401 / connection-refused insidesetup-kibana.cjswould be more robust.wait "$NODE_PULL_PID"/wait "$KB_PULL_PID") timing out on slow agents.Step 1 is cheap and would make step 2 much easier to diagnose from the public Buildkite output.
Acceptance