ci(kb): cleanup trap masks real failure in Buildkite KB functional tests

## Symptom

Buildkite job `buildkite/elastic-cli/pr` (Kibana functional tests) intermittently fails with this in the cleanup output:

```
Error response from daemon: No such container: elastic-cli-kb
```

Examples (open PRs at time of filing):
- [#415](https://github.com/elastic/cli/pull/415) — Build #639
- [#414](https://github.com/elastic/cli/pull/414) — Build #635

The PRs themselves don't touch `.buildkite/`, build config, or Kibana code, so the failure is not caused by their content.

## Root cause (proximate)

The error comes from the `cleanup` trap in [`.buildkite/run-kb-tests.sh:36`](https://github.com/elastic/cli/blob/main/.buildkite/run-kb-tests.sh#L36):

```bash
docker logs "$KB_CONTAINER_NAME" 2>&1 | tail -50 || true
```

`docker logs` writes the "No such container" diagnostic to stderr before exiting non-zero, and `|| true` only suppresses the exit code, not the stderr message. The container is created at line 154, so any failure earlier in the script (`npm ci`, `npm run build`, `setup-kibana.cjs`, image pulls, ES bootstrap) hits the trap before Kibana ever exists.

The real failure happens earlier in the log — this message is the alarm, not the cause.

## Suggested fixes

**1. Make the cleanup trap robust** — guard `docker logs` on container existence so the output points at the actual failing step instead of adding noise:

```bash
cleanup() {
  echo "--- ES logs (last 50 lines)"
  docker inspect "$ES_CONTAINER_NAME" >/dev/null 2>&1 \
    && docker logs "$ES_CONTAINER_NAME" 2>&1 | tail -50 \
    || echo "(container never started)"
  echo "--- Kibana logs (last 50 lines)"
  docker inspect "$KB_CONTAINER_NAME" >/dev/null 2>&1 \
    && docker logs "$KB_CONTAINER_NAME" 2>&1 | tail -50 \
    || echo "(container never started)"
  echo "--- Cleaning up"
  docker rm -f "$TEST_RUNNER_NAME" "$KB_CONTAINER_NAME" "$ES_CONTAINER_NAME" 2>/dev/null || true
  docker network rm "$NETWORK_NAME" 2>/dev/null || true
}
```

**2. Diagnose the underlying flake.** Likely candidates without log access:

- `setup-kibana.cjs` (line 145) races ES bootstrap. The script `wait`s on the Node image pull, but not on ES readiness — ES's security index can take minutes after the container reports healthy. A retry loop on 401 / connection-refused inside `setup-kibana.cjs` would be more robust.
- Backgrounded image pulls (`wait "$NODE_PULL_PID"` / `wait "$KB_PULL_PID"`) timing out on slow agents.

Step 1 is cheap and would make step 2 much easier to diagnose from the public Buildkite output.

## Acceptance

- Cleanup trap no longer emits "No such container" lines.
- KB tests either pass reliably or fail with a clear, actionable error pointing at the real root cause.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci(kb): cleanup trap masks real failure in Buildkite KB functional tests #416

Symptom

Root cause (proximate)

Suggested fixes

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

ci(kb): cleanup trap masks real failure in Buildkite KB functional tests #416

Description

Symptom

Root cause (proximate)

Suggested fixes

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions