Skip to content

ci(kb): cleanup trap masks real failure in Buildkite KB functional tests #416

Description

@MattDevy

Symptom

Buildkite job buildkite/elastic-cli/pr (Kibana functional tests) intermittently fails with this in the cleanup output:

Error response from daemon: No such container: elastic-cli-kb

Examples (open PRs at time of filing):

  • #415 — Build #639
  • #414 — Build #635

The PRs themselves don't touch .buildkite/, build config, or Kibana code, so the failure is not caused by their content.

Root cause (proximate)

The error comes from the cleanup trap in .buildkite/run-kb-tests.sh:36:

docker logs "$KB_CONTAINER_NAME" 2>&1 | tail -50 || true

docker logs writes the "No such container" diagnostic to stderr before exiting non-zero, and || true only suppresses the exit code, not the stderr message. The container is created at line 154, so any failure earlier in the script (npm ci, npm run build, setup-kibana.cjs, image pulls, ES bootstrap) hits the trap before Kibana ever exists.

The real failure happens earlier in the log — this message is the alarm, not the cause.

Suggested fixes

1. Make the cleanup trap robust — guard docker logs on container existence so the output points at the actual failing step instead of adding noise:

cleanup() {
  echo "--- ES logs (last 50 lines)"
  docker inspect "$ES_CONTAINER_NAME" >/dev/null 2>&1 \
    && docker logs "$ES_CONTAINER_NAME" 2>&1 | tail -50 \
    || echo "(container never started)"
  echo "--- Kibana logs (last 50 lines)"
  docker inspect "$KB_CONTAINER_NAME" >/dev/null 2>&1 \
    && docker logs "$KB_CONTAINER_NAME" 2>&1 | tail -50 \
    || echo "(container never started)"
  echo "--- Cleaning up"
  docker rm -f "$TEST_RUNNER_NAME" "$KB_CONTAINER_NAME" "$ES_CONTAINER_NAME" 2>/dev/null || true
  docker network rm "$NETWORK_NAME" 2>/dev/null || true
}

2. Diagnose the underlying flake. Likely candidates without log access:

  • setup-kibana.cjs (line 145) races ES bootstrap. The script waits on the Node image pull, but not on ES readiness — ES's security index can take minutes after the container reports healthy. A retry loop on 401 / connection-refused inside setup-kibana.cjs would be more robust.
  • Backgrounded image pulls (wait "$NODE_PULL_PID" / wait "$KB_PULL_PID") timing out on slow agents.

Step 1 is cheap and would make step 2 much easier to diagnose from the public Buildkite output.

Acceptance

  • Cleanup trap no longer emits "No such container" lines.
  • KB tests either pass reliably or fail with a clear, actionable error pointing at the real root cause.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions