ci(kb): guard cleanup trap docker logs on container existence by MattDevy · Pull Request #417 · elastic/cli

MattDevy · 2026-06-16T12:37:49Z

Summary

Closes #416. Three commits, each targets one symptom of the same flaky job.

1. Cleanup trap (.buildkite/run-kb-tests.sh)
The trap ran docker logs for both ES and Kibana unconditionally, so any failure before line 154 (npm ci, build, image pull, ES bootstrap, kibana_system setup) made it emit Error response from daemon: No such container: elastic-cli-kb to stderr. || true only suppressed the exit code, not the message. Now each docker logs is guarded by docker inspect; missing containers print (container never started).

2. ES disk watermark (.buildkite/run-kb-tests.sh)
Once the trap stopped masking failures, the real cause appeared in the ES container log:

UnavailableShardsException: at least one primary shard for the index [.security-7] is unavailable
failed to retrieve password hash for reserved user [elastic]

When the Buildkite host disk is above ES's default 85% low watermark, the security index's primary shard refuses to allocate, the reserved-user store is unreadable, and every elastic auth attempt fails forever. Intermittent because it depends on the agent's disk state. Fix: set cluster.routing.allocation.disk.threshold_enabled=false on the single-node test cluster. The disk allocator has no value on ephemeral CI storage.

3. Setup-kibana diagnostics (.buildkite/setup-kibana.cjs)
retry swallowed every error with catch { /* not ready yet */ } and threw a generic "did not become ready in time" with no detail. Pinning down (2) required pulling ES container logs from the trap output because the Node script itself revealed nothing. Now the retry records the last failure reason — HTTP status + body, cluster status, or exception message — logs it on each progress tick, and includes it in the final timeout error. Future flakes from a different cause will be diagnosable in one CI run.

4. SC2012 cleanup (.buildkite/run-kb-tests.sh)
While touching the file, fix the pre-existing shellcheck SC2012 on line 71 by replacing ls ... | head -1 with a bash array — compgen -G on the line above already guarantees a match.

Implementation note

The issue's option 1 suggested chaining docker inspect ... && docker logs ... || echo. I used if/then/else/fi instead: with set -o pipefail in effect, the chained form would print "(container never started)" if docker logs | tail -50 itself returned non-zero on a real container.

Test plan

shellcheck .buildkite/run-kb-tests.sh — clean (the pre-existing SC2012 is also resolved)
bash -n .buildkite/run-kb-tests.sh — syntax OK
node --check .buildkite/setup-kibana.cjs — syntax OK
Buildkite KB job runs to completion without spurious "No such container" output
On a clean agent, ES reaches yellow and Kibana starts (verifies the watermark fix)
If a different upstream step fails, cleanup output points at the real cause and retry failures show the last seen HTTP status / cluster status

Acceptance (per #416)

Cleanup trap no longer emits "No such container" lines.
KB tests either pass reliably or fail with a clear, actionable error pointing at the real root cause.

The cleanup trap unconditionally ran `docker logs` for both ES and Kibana, emitting "No such container" on stderr whenever the script failed before the container was created (npm ci, build, image pull, ES bootstrap). That diagnostic masked the real upstream failure in Buildkite output. Wrap each `docker logs` call in a `docker inspect` existence check, printing "(container never started)" when the container does not exist. Also fold the three `docker rm -f` calls into one. Refs #416

github-actions · 2026-06-16T12:43:13Z

✅MegaLinter analysis: Success

Descriptor	Linter	Files	Errors	Warnings	Elapsed time
✅ BASH	shellcheck	1	0	0	0.33s
✅ COPYPASTE	jscpd	yes	no	no	8.22s
✅ REPOSITORY	gitleaks	yes	no	no	57.85s
✅ REPOSITORY	git_diff	yes	no	no	0.84s
✅ REPOSITORY	secretlint	yes	no	no	33.59s
✅ REPOSITORY	trivy	yes	no	no	18.16s

Notices

📣 MegaLinter 9.5.0 is out! Discover the new features and security recommendations in the release announcement. (Skip this info by defining SECURITY_SUGGESTIONS: false)

See detailed reports in MegaLinter artifacts
Set VALIDATE_ALL_CODEBASE: true in mega-linter.yml to validate all sources, not only the diff

MegaLinter is graciously provided by OX Security
Show us your support by starring ⭐ the repository

Resolves shellcheck SC2012 on line 71. The preceding `compgen -G` already confirms at least one match, so an array assignment is sufficient.

The KB functional tests intermittently failed in setup-kibana.cjs with 6 minutes of `failed to retrieve password hash for reserved user [elastic]` errors. The underlying cause is in the ES container logs: UnavailableShardsException: at least one primary shard for the index [.security-7] is unavailable When the Buildkite host disk is above the default 85% low watermark, ES refuses to allocate the security index's primary shard, so the reserved-user store never becomes readable and every `elastic` auth attempt fails. Intermittent because it depends on the agent's disk state at run time. For a single-node trial test cluster on ephemeral CI storage, the disk allocator buys nothing, so turn it off. Refs #416

Previously `retry` did `catch { /* not ready yet */ }` and threw a generic "did not become ready in time" error with no detail. Diagnosing the recent ES allocation flake required pulling container logs from the cleanup trap because the script itself revealed nothing. Track the last failure reason — HTTP status + body, cluster status, or exception message — log it on each progress tick, and include it in the final timeout error. Call sites now throw descriptive errors instead of returning booleans. Refs #416

margaretjgu

LGTM. The description does a good job tracing the root cause chain (trap masking real errors, disk watermark blocking shard allocation, retry swallowing context) and each fix is targeted and correct.

One minor note: ES starts with --rm, so if it crashes and Docker auto-removes it before cleanup() fires, docker inspect returns false and prints "(container never started)" even though it did start. The logs are gone anyway so there is nothing actionable, but a short comment in the cleanup function would save a future reader some confusion. Not a blocker.

Worth noting this PR is good defense-in-depth and complements #280. Once the ES image is pre-baked on the agent, the disk pressure and cold-start timing issues that produce these failures in the first place are eliminated at the infrastructure level. Both are worth shipping.

MattDevy added 2 commits June 16, 2026 13:37

chore: update NOTICE.txt

6bfbf02

MattDevy requested a review from margaretjgu June 16, 2026 12:46

MattDevy marked this pull request as ready for review June 16, 2026 12:46

MattDevy added 3 commits June 16, 2026 13:54

ci(kb): use bash glob instead of ls for ES tarball lookup

9ae56ef

Resolves shellcheck SC2012 on line 71. The preceding `compgen -G` already confirms at least one match, so an array assignment is sufficient.

MattDevy enabled auto-merge (squash) June 16, 2026 15:22

margaretjgu approved these changes Jun 16, 2026

View reviewed changes

MattDevy merged commit 60522aa into main Jun 16, 2026
25 checks passed

MattDevy deleted the fix/kb-tests-cleanup-trap branch June 16, 2026 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci(kb): guard cleanup trap docker logs on container existence#417

ci(kb): guard cleanup trap docker logs on container existence#417
MattDevy merged 5 commits into
mainfrom
fix/kb-tests-cleanup-trap

MattDevy commented Jun 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

margaretjgu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

MattDevy commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation note

Test plan

Acceptance (per #416)

Uh oh!

github-actions Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅MegaLinter analysis: Success

Notices

Uh oh!

margaretjgu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MattDevy commented Jun 16, 2026 •

edited

Loading

github-actions Bot commented Jun 16, 2026 •

edited

Loading