Skip to content

ci(kb): guard cleanup trap docker logs on container existence#417

Merged
MattDevy merged 5 commits into
mainfrom
fix/kb-tests-cleanup-trap
Jun 16, 2026
Merged

ci(kb): guard cleanup trap docker logs on container existence#417
MattDevy merged 5 commits into
mainfrom
fix/kb-tests-cleanup-trap

Conversation

@MattDevy

@MattDevy MattDevy commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #416. Three commits, each targets one symptom of the same flaky job.

1. Cleanup trap (.buildkite/run-kb-tests.sh)
The trap ran docker logs for both ES and Kibana unconditionally, so any failure before line 154 (npm ci, build, image pull, ES bootstrap, kibana_system setup) made it emit Error response from daemon: No such container: elastic-cli-kb to stderr. || true only suppressed the exit code, not the message. Now each docker logs is guarded by docker inspect; missing containers print (container never started).

2. ES disk watermark (.buildkite/run-kb-tests.sh)
Once the trap stopped masking failures, the real cause appeared in the ES container log:

UnavailableShardsException: at least one primary shard for the index [.security-7] is unavailable
failed to retrieve password hash for reserved user [elastic]

When the Buildkite host disk is above ES's default 85% low watermark, the security index's primary shard refuses to allocate, the reserved-user store is unreadable, and every elastic auth attempt fails forever. Intermittent because it depends on the agent's disk state. Fix: set cluster.routing.allocation.disk.threshold_enabled=false on the single-node test cluster. The disk allocator has no value on ephemeral CI storage.

3. Setup-kibana diagnostics (.buildkite/setup-kibana.cjs)
retry swallowed every error with catch { /* not ready yet */ } and threw a generic "did not become ready in time" with no detail. Pinning down (2) required pulling ES container logs from the trap output because the Node script itself revealed nothing. Now the retry records the last failure reason — HTTP status + body, cluster status, or exception message — logs it on each progress tick, and includes it in the final timeout error. Future flakes from a different cause will be diagnosable in one CI run.

4. SC2012 cleanup (.buildkite/run-kb-tests.sh)
While touching the file, fix the pre-existing shellcheck SC2012 on line 71 by replacing ls ... | head -1 with a bash array — compgen -G on the line above already guarantees a match.

Implementation note

The issue's option 1 suggested chaining docker inspect ... && docker logs ... || echo. I used if/then/else/fi instead: with set -o pipefail in effect, the chained form would print "(container never started)" if docker logs | tail -50 itself returned non-zero on a real container.

Test plan

  • shellcheck .buildkite/run-kb-tests.sh — clean (the pre-existing SC2012 is also resolved)
  • bash -n .buildkite/run-kb-tests.sh — syntax OK
  • node --check .buildkite/setup-kibana.cjs — syntax OK
  • Buildkite KB job runs to completion without spurious "No such container" output
  • On a clean agent, ES reaches yellow and Kibana starts (verifies the watermark fix)
  • If a different upstream step fails, cleanup output points at the real cause and retry failures show the last seen HTTP status / cluster status

Acceptance (per #416)

  • Cleanup trap no longer emits "No such container" lines.
  • KB tests either pass reliably or fail with a clear, actionable error pointing at the real root cause.

MattDevy added 2 commits June 16, 2026 13:37
The cleanup trap unconditionally ran `docker logs` for both ES and Kibana,
emitting "No such container" on stderr whenever the script failed before
the container was created (npm ci, build, image pull, ES bootstrap). That
diagnostic masked the real upstream failure in Buildkite output.

Wrap each `docker logs` call in a `docker inspect` existence check, printing
"(container never started)" when the container does not exist. Also fold
the three `docker rm -f` calls into one.

Refs #416
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

MegaLinter analysis: Success

Descriptor Linter Files Fixed Errors Warnings Elapsed time
✅ BASH shellcheck 1 0 0 0.33s
✅ COPYPASTE jscpd yes no no 8.22s
✅ REPOSITORY gitleaks yes no no 57.85s
✅ REPOSITORY git_diff yes no no 0.84s
✅ REPOSITORY secretlint yes no no 33.59s
✅ REPOSITORY trivy yes no no 18.16s

Notices

📣 MegaLinter 9.5.0 is out! Discover the new features and security recommendations in the release announcement. (Skip this info by defining SECURITY_SUGGESTIONS: false)

See detailed reports in MegaLinter artifacts
Set VALIDATE_ALL_CODEBASE: true in mega-linter.yml to validate all sources, not only the diff

MegaLinter is graciously provided by OX Security
Show us your support by starring ⭐ the repository

@MattDevy MattDevy requested a review from margaretjgu June 16, 2026 12:46
@MattDevy MattDevy marked this pull request as ready for review June 16, 2026 12:46
MattDevy added 3 commits June 16, 2026 13:54
Resolves shellcheck SC2012 on line 71. The preceding `compgen -G` already
confirms at least one match, so an array assignment is sufficient.
The KB functional tests intermittently failed in setup-kibana.cjs with
6 minutes of `failed to retrieve password hash for reserved user [elastic]`
errors. The underlying cause is in the ES container logs:

  UnavailableShardsException: at least one primary shard for the index
  [.security-7] is unavailable

When the Buildkite host disk is above the default 85% low watermark,
ES refuses to allocate the security index's primary shard, so the
reserved-user store never becomes readable and every `elastic` auth
attempt fails. Intermittent because it depends on the agent's disk
state at run time.

For a single-node trial test cluster on ephemeral CI storage, the
disk allocator buys nothing, so turn it off.

Refs #416
Previously `retry` did `catch { /* not ready yet */ }` and threw a
generic "did not become ready in time" error with no detail. Diagnosing
the recent ES allocation flake required pulling container logs from the
cleanup trap because the script itself revealed nothing.

Track the last failure reason — HTTP status + body, cluster status, or
exception message — log it on each progress tick, and include it in the
final timeout error. Call sites now throw descriptive errors instead of
returning booleans.

Refs #416
@MattDevy MattDevy enabled auto-merge (squash) June 16, 2026 15:22

@margaretjgu margaretjgu left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The description does a good job tracing the root cause chain (trap masking real errors, disk watermark blocking shard allocation, retry swallowing context) and each fix is targeted and correct.

One minor note: ES starts with --rm, so if it crashes and Docker auto-removes it before cleanup() fires, docker inspect returns false and prints "(container never started)" even though it did start. The logs are gone anyway so there is nothing actionable, but a short comment in the cleanup function would save a future reader some confusion. Not a blocker.

Worth noting this PR is good defense-in-depth and complements #280. Once the ES image is pre-baked on the agent, the disk pressure and cold-start timing issues that produce these failures in the first place are eliminated at the infrastructure level. Both are worth shipping.

@MattDevy MattDevy merged commit 60522aa into main Jun 16, 2026
25 checks passed
@MattDevy MattDevy deleted the fix/kb-tests-cleanup-trap branch June 16, 2026 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ci(kb): cleanup trap masks real failure in Buildkite KB functional tests

2 participants