perf: tighten buildkitd readiness polling and gate it in regression workflow by taha-au · Pull Request #100 · useblacksmith/setup-docker-builder

taha-au · 2026-04-25T06:22:28Z

Summary

Two related improvements + a regression-workflow extension:

1. Tighten the buildkitd readiness polling

After buildkitd is launched, the action polls buildctl debug workers until the OCI worker is up. The loop used a flat 1s sleep between polls, so we paid up to ~1s of pure polling discretization on every job, regardless of how fast buildkitd actually came up.

Replace that with exponential backoff: 100→200→400→800ms, capped at 1000ms. Same 30s hard timeout. Steady-state startups observe the worker up to ~900ms sooner; cold/slow startups behave no worse than before.

2. Surface the slow path

Today, when readiness takes a long time, the action's stdout gives you no idea why — the explanation lives in /tmp/buildkitd.log on the runner.

Add a buildkitd workers ready in <N>ms after <K> poll(s) log line so the readiness window is directly measurable from the action's own output.
When readiness takes >2s, automatically tail the last 50 lines of /tmp/buildkitd.log so the slow path is self-explanatory in CI logs without spamming the fast path.

3. Extend the step-duration regression workflow

Adds a third gated metric to step-duration-regression.yml alongside setup/post step durations: the buildkitd readiness window in milliseconds, parsed out of the new telemetry line. Threshold defaults to BUILDKITD_READY_MAX_MS=8000.

This catches both:

regressions in the action's polling backoff (would push readiness ~1s+ higher than necessary), and
regressions in buildkitd warm-up itself (would blow past the 8s ceiling regardless of polling).

The check is informational on older action versions that don't emit the telemetry line.

Validation

Manual run on this branch via workflow_dispatch: https://github.com/useblacksmith/setup-docker-builder/actions/runs/24924523539

Setup step ("Setup Docker Builder under test"): 3s   (threshold 5s)
Post  step ("Post Setup Docker Builder under test"):  2s   (threshold 5s)
buildkitd readiness:          169ms in 1 poll(s)   (threshold 8000ms)

All thresholds passed.

A real production job (useblacksmith/web, large sticky disk with ~382 GB of pre-existing cache) was previously seeing ~6.8s of buildkitd readiness time. We can't shrink the buildkitd warm-up itself from this action, but we can now (a) get to the worker as soon as it's actually ready (no ~1s overshoot) and (b) see directly from CI logs how much of the readiness window was buildkitd vs polling.

Test plan

Unit tests pass (pnpm test)
Typecheck passes (pnpm typecheck)
dist/ rebuilt and committed
Manual workflow_dispatch run on this branch shows the new telemetry line, the parsed value reaches the validator, and the threshold check fires
Regression workflow runs green against this PR (will appear in checks below)

^{Codesmith can help with this PR, just tag @codesmith or enable autofix. Settings.}

Autofix CI and bot reviews (Staging)

^{Codesmith can help with this PR, just tag @codesmith or enable autofix. Settings.}

Autofix CI and bot reviews

Note

Low Risk
Changes are limited to a CI validation workflow; no runtime action or production path is modified in this diff.

Overview
Extends step-duration-regression.yml so CI tracks a third performance metric alongside setup/post step duration: buildkitd readiness time parsed from the action log line buildkitd workers ready in <N>ms after <K> poll(s).

Adds BUILDKITD_READY_MAX_MS=8000 and updates the validate job to resolve the exercise job id, fetch its logs via the GitHub API, extract readiness ms and poll count, and surface them in the console and Step Durations job summary (with n/a when telemetry is missing on older action versions). Setup/post checks are unchanged in spirit but renamed to assert_step_under_threshold with clearer threshold labels; readiness fails the job only when a numeric value exceeds 8s, otherwise emits a warning and skips enforcement.

^{Reviewed by Cursor Bugbot for commit 41c9238. Bugbot is set up for automated code reviews on this repo. Configure here.}

The setup step's "wait for buildkitd workers" loop polled with a flat 1s backoff. In practice buildkitd's OCI worker comes up in well under a second on most runs, so we paid up to ~1s of polling discretization for nothing on every job. On slow runs, we had no signal at all about what buildkitd was doing during the wait - it all lived in /tmp/buildkitd.log on the runner. Changes: - Replace the fixed 1000ms backoff with exponential 100->200->400->800 ms, capped at 1000ms. Same 30s hard timeout. Steady-state startups observe the worker up to ~900ms sooner; cold/slow startups behave no worse than before. - Emit a "buildkitd workers ready in <N>ms after <K> poll(s)" log line so the readiness window is directly measurable from the action's own output instead of having to subtract two adjacent log timestamps by hand. - When readiness takes >2s, automatically tail the last 50 lines of /tmp/buildkitd.log so the slow path is self-explanatory in CI logs without spamming the fast path. Also extends the step-duration-regression workflow to parse the new telemetry line out of the job log and gate it on BUILDKITD_READY_MAX_MS (default 8000ms). This catches both: - a regression in the action's polling backoff (would push readiness ~1s+ higher than necessary), and - a regression in buildkitd warm-up itself (would blow past the 8s ceiling regardless of polling). The check is informational on older action versions that don't emit the telemetry line.

The BUILDKITD_READY_MAX_MS comment described the gated window as "buildkitd version" to "Found N workers", but the telemetry actually measures from buildkitd launch until the OCI worker is registered. Reword the comment to match what is measured. Also render the readiness value as "n/a (no telemetry line found)" when the target job has no telemetry line (older action versions) instead of printing a misleading "MISSINGms" in both the console line and the step summary table. Gating logic is unchanged; this is output-only. Co-authored-by: Codesmith <codesmith-bot@users.noreply.github.com>

taha-au marked this pull request as draft April 25, 2026 06:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: tighten buildkitd readiness polling and gate it in regression workflow#100

perf: tighten buildkitd readiness polling and gate it in regression workflow#100
taha-au wants to merge 2 commits into
mainfrom
perf/buildkitd-readiness-polling

taha-au commented Apr 25, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taha-au commented Apr 25, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Tighten the buildkitd readiness polling

2. Surface the slow path

3. Extend the step-duration regression workflow

Validation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

taha-au commented Apr 25, 2026 •

edited by cursor Bot

Loading