perf: tighten buildkitd readiness polling and gate it in regression workflow#100
Draft
taha-au wants to merge 2 commits into
Draft
perf: tighten buildkitd readiness polling and gate it in regression workflow#100taha-au wants to merge 2 commits into
taha-au wants to merge 2 commits into
Conversation
The setup step's "wait for buildkitd workers" loop polled with a flat 1s backoff. In practice buildkitd's OCI worker comes up in well under a second on most runs, so we paid up to ~1s of polling discretization for nothing on every job. On slow runs, we had no signal at all about what buildkitd was doing during the wait - it all lived in /tmp/buildkitd.log on the runner. Changes: - Replace the fixed 1000ms backoff with exponential 100->200->400->800 ms, capped at 1000ms. Same 30s hard timeout. Steady-state startups observe the worker up to ~900ms sooner; cold/slow startups behave no worse than before. - Emit a "buildkitd workers ready in <N>ms after <K> poll(s)" log line so the readiness window is directly measurable from the action's own output instead of having to subtract two adjacent log timestamps by hand. - When readiness takes >2s, automatically tail the last 50 lines of /tmp/buildkitd.log so the slow path is self-explanatory in CI logs without spamming the fast path. Also extends the step-duration-regression workflow to parse the new telemetry line out of the job log and gate it on BUILDKITD_READY_MAX_MS (default 8000ms). This catches both: - a regression in the action's polling backoff (would push readiness ~1s+ higher than necessary), and - a regression in buildkitd warm-up itself (would blow past the 8s ceiling regardless of polling). The check is informational on older action versions that don't emit the telemetry line.
The BUILDKITD_READY_MAX_MS comment described the gated window as "buildkitd version" to "Found N workers", but the telemetry actually measures from buildkitd launch until the OCI worker is registered. Reword the comment to match what is measured. Also render the readiness value as "n/a (no telemetry line found)" when the target job has no telemetry line (older action versions) instead of printing a misleading "MISSINGms" in both the console line and the step summary table. Gating logic is unchanged; this is output-only. Co-authored-by: Codesmith <codesmith-bot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related improvements + a regression-workflow extension:
1. Tighten the buildkitd readiness polling
After
buildkitdis launched, the action pollsbuildctl debug workersuntil the OCI worker is up. The loop used a flat 1s sleep between polls, so we paid up to ~1s of pure polling discretization on every job, regardless of how fast buildkitd actually came up.Replace that with exponential backoff: 100→200→400→800ms, capped at 1000ms. Same 30s hard timeout. Steady-state startups observe the worker up to ~900ms sooner; cold/slow startups behave no worse than before.
2. Surface the slow path
Today, when readiness takes a long time, the action's stdout gives you no idea why — the explanation lives in
/tmp/buildkitd.logon the runner.buildkitd workers ready in <N>ms after <K> poll(s)log line so the readiness window is directly measurable from the action's own output./tmp/buildkitd.logso the slow path is self-explanatory in CI logs without spamming the fast path.3. Extend the step-duration regression workflow
Adds a third gated metric to
step-duration-regression.ymlalongside setup/post step durations: the buildkitd readiness window in milliseconds, parsed out of the new telemetry line. Threshold defaults toBUILDKITD_READY_MAX_MS=8000.This catches both:
The check is informational on older action versions that don't emit the telemetry line.
Validation
Manual run on this branch via
workflow_dispatch: https://github.com/useblacksmith/setup-docker-builder/actions/runs/24924523539A real production job (
useblacksmith/web, large sticky disk with ~382 GB of pre-existing cache) was previously seeing ~6.8s of buildkitd readiness time. We can't shrink the buildkitd warm-up itself from this action, but we can now (a) get to the worker as soon as it's actually ready (no ~1s overshoot) and (b) see directly from CI logs how much of the readiness window was buildkitd vs polling.Test plan
pnpm test)pnpm typecheck)dist/rebuilt and committedCodesmith can help with this PR, just tag
@codesmithor enable autofix. Settings.Codesmith can help with this PR, just tag
@codesmithor enable autofix. Settings.Note
Low Risk
Changes are limited to a CI validation workflow; no runtime action or production path is modified in this diff.
Overview
Extends
step-duration-regression.ymlso CI tracks a third performance metric alongside setup/post step duration: buildkitd readiness time parsed from the action log linebuildkitd workers ready in <N>ms after <K> poll(s).Adds
BUILDKITD_READY_MAX_MS=8000and updates the validate job to resolve the exercise job id, fetch its logs via the GitHub API, extract readiness ms and poll count, and surface them in the console and Step Durations job summary (with n/a when telemetry is missing on older action versions). Setup/post checks are unchanged in spirit but renamed toassert_step_under_thresholdwith clearer threshold labels; readiness fails the job only when a numeric value exceeds 8s, otherwise emits a warning and skips enforcement.Reviewed by Cursor Bugbot for commit 41c9238. Bugbot is set up for automated code reviews on this repo. Configure here.