refactor: simplify waterfall timeline duration computation#9
Conversation
Simplify the waterfall by sorting steps once, computing timeline span from first/last step, and inferring durations from inter-step gaps. Remove now/runStartedAt props since the timeline is self-contained. Use 500ms threshold for SDK-fixed data vs legacy gap inference.
There was a problem hiding this comment.
Pull request overview
Refactors the Execution Timeline (waterfall) rendering on the run detail page to simplify how step ordering, timeline span, and per-step bar durations are computed directly from step timestamps.
Changes:
- Sorts steps once by
started_atand filters out steps without timestamps. - Replaces
computeStepSpan+ per-step duration map with an index-based duration array derived from gaps to the next step (and recorded durations when available). - Removes
runStartedAt/nowprops fromWaterfallTimelineand simplifies the render loop.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Timeline spans from first step start to last step start (+ small padding) | ||
| const firstStart = new Date(sorted[0].started_at!).getTime(); | ||
| const lastStart = new Date(sorted[sorted.length - 1].started_at!).getTime(); | ||
| const timelineSpan = Math.max(lastStart - firstStart, 1000); // at least 1s | ||
|
|
There was a problem hiding this comment.
timelineSpan is computed from first step start → last step start, but bar widths use stepDurations that can extend past lastStart (e.g. SDK-fixed steps with real ended_at, or a long-running last step). This makes the axis labels inaccurate and forces late/long steps to be artificially clamped. Consider computing the span end from the max effective end time (e.g. max(start + computedDuration) or max(ended_at/now)), then deriving timelineSpan from that.
| // For each step, compute duration as gap to next step (legacy fix) | ||
| // This works because all existing data has started_at ≈ ended_at (useless), | ||
| // but the started_at values between steps reflect real execution timing. | ||
| const stepDurations: number[] = sorted.map((step, i) => { | ||
| const start = new Date(step.started_at!).getTime(); | ||
| const end = step.ended_at ? new Date(step.ended_at).getTime() : Date.now(); | ||
| const selfDuration = end - start; | ||
|
|
||
| // If step has real duration (> 500ms), use it (future SDK-fixed data) | ||
| if (selfDuration > 500) return selfDuration; | ||
|
|
||
| // Otherwise, infer from gap to next step |
There was a problem hiding this comment.
The selfDuration > 500 heuristic will treat legitimately short SDK-fixed steps (<500ms) as “legacy” and infer their duration from the next step gap, which can significantly overstate timing. A more reliable legacy check is to detect near-equal started_at/ended_at (as before) rather than using an absolute 500ms threshold.
| // For each step, compute duration as gap to next step (legacy fix) | |
| // This works because all existing data has started_at ≈ ended_at (useless), | |
| // but the started_at values between steps reflect real execution timing. | |
| const stepDurations: number[] = sorted.map((step, i) => { | |
| const start = new Date(step.started_at!).getTime(); | |
| const end = step.ended_at ? new Date(step.ended_at).getTime() : Date.now(); | |
| const selfDuration = end - start; | |
| // If step has real duration (> 500ms), use it (future SDK-fixed data) | |
| if (selfDuration > 500) return selfDuration; | |
| // Otherwise, infer from gap to next step | |
| // For each step, compute duration as gap to next step only for legacy data | |
| // where started_at and ended_at are effectively identical. For SDK-fixed | |
| // data, always use the actual recorded duration, even if it is short. | |
| const stepDurations: number[] = sorted.map((step, i) => { | |
| const start = new Date(step.started_at!).getTime(); | |
| const end = step.ended_at ? new Date(step.ended_at).getTime() : Date.now(); | |
| const selfDuration = end - start; | |
| const isLegacyDuration = step.ended_at && Math.abs(selfDuration) <= 1; | |
| // Use recorded duration for non-legacy data, including legitimately short steps. | |
| if (!isLegacyDuration) return Math.max(selfDuration, 0); | |
| // Legacy data: infer from gap to next step |
| function WaterfallTimeline({ | ||
| steps, | ||
| totalDuration, | ||
| runStartedAt, | ||
| now, | ||
| }: { | ||
| steps: Step[]; | ||
| totalDuration: number; | ||
| runStartedAt?: string | null; | ||
| now: number; | ||
| }) { |
There was a problem hiding this comment.
totalDuration is now unused in WaterfallTimeline, and removing the computeStepSpan(...) call leaves computeStepSpan unused as well. With @typescript-eslint/no-unused-vars enabled (warn), this will introduce lint warnings; consider deleting the unused helper and dropping totalDuration from the component props and call sites.
- Use started_at ≈ ended_at (≤1ms) for legacy detection instead of 500ms threshold - Compute timelineSpan from max effective end (start + duration) not just last start - Remove unused totalDuration prop and computeStepSpan helper
Addresses 7 substantive issues from Copilot's automated review of the settings-backend PR. Tests are flagged as out-of-scope follow-up. Idempotency (Copilot #1): - Add migration add_runs_idempotency_unique.py: partial UNIQUE index on runs(tenant_id, idempotency_key) WHERE idempotency_key IS NOT NULL, built CONCURRENTLY using the AUTOCOMMIT pattern. - runner._create_run now does optimistic query-then-insert. Insert is wrapped in `async with session.begin_nested()` (a SAVEPOINT) so an IntegrityError from a concurrent inserter doesn't poison the outer session transaction shared with _resolve_waiting_steps. On IntegrityError, re-query and return the winner row. Soft-delete coverage (Copilot #2): - Apply _bounce_if_deleted to all 4 return sites in get_tenant_with_dev_fallback (was only in get_current_tenant). get_optional_tenant has zero consumers in api/routes/ — left as-is to preserve its "endpoint works without auth" contract. Notification settings auth + SSRF (Copilot #3, #4): - GET /tenant/notifications now requires CurrentUserAdmin. The response contains the Slack webhook URL and PagerDuty integration key; members must not see them. - PATCH /tenant/notifications validates slack_webhook_url via the existing services/network_utils.validate_webhook_url (rejects private IPs, DNS-rebinding) AND pins the host to hooks.slack.com. - notifier._client now uses create_ssrf_safe_client as defense-in-depth against URLs persisted before the route validation was added. Cleanup (Copilot #5/#6): - Drop the redundant SQL update() + ORM mutate pattern in tenants.update_concurrency and tenants.delete_workspace. Single ORM mutation flushes one UPDATE on commit. Step-timeout coverage (Copilot #8): - New services/concurrency.py exports read_tenant_concurrency, shared between runner.py (enqueue-time injection) and executor.py (fallback load when job config lacks default_step_timeout_s — covers _recover_stuck_runs, which only ships trigger_data + endpoint_url). - Both worker-mode HTTP path and AI-step branch in executor now fall back to a tenant DB load when the runner-injected default is absent. Timeout classification (Copilot #9): - executor._invoke_function emits error.type="HTTPTimeout" for httpx.TimeoutException (covers ReadTimeout/Connect/Write/Pool); other RequestError subclasses keep "RequestError". - notifier.notify_run_failed reads error.type and gates dispatch on notify_on_run_timeout for {"TimeoutError","HTTPTimeout"}, otherwise notify_on_run_failed. The toggle does something now.
…fications, danger zone) (#36) * feat(notifications): wire workspace alerts to Slack + PagerDuty (Phase 3) Adds tenant_notification_config table with one row per workspace storing Slack channel/webhook, PagerDuty key + enabled flag, email-digest flag, and per-event-type triggers (run_failed, run_timeout). Persists via new GET/PATCH /api/v1/tenant/notifications (admin-only PATCH) plus a POST /tenant/notifications/test endpoint to verify each channel. Real enforcement: - services/notifier.py — fire-and-forget dispatcher. On run failure, loads tenant config, posts to Slack incoming-webhook (text + channel override) and triggers a PagerDuty Events API v2 incident with severity=error and run/error metadata. - services/executor.py hooks both terminal-failure transition sites (HTTP-mode permanent failure + AI-step retries-exhausted). Wrapped in asyncio.create_task so notification I/O never blocks run completion; notifier itself is exception-safe. Persisted-only for v1: - email_digest_enabled — column lands but daily digest cron lives in services/digest.py (Phase 3.1). Frontend: new components/settings/notifications-tab.tsx wired into settings/page.tsx as a Bell-icon tab. Loads settings on mount, PATCHes on blur (URLs / channel) and on click (Switches), shows a Send test button per channel that surfaces the server's pass/fail message via toast. * feat(settings): wire concurrency & limits + step-timeout enforcement (Phase 2 + 2.5) Persist max_concurrent_runs / per_function_default / default_step_timeout_s / use_event_id_idempotency to Tenant.settings JSONB and expose via new GET/PATCH /api/v1/tenant/concurrency (admin-only PATCH). Real enforcement (Phase 2): - runner._create_run derives Run.idempotency_key from the trigger event id when the workspace flag is on, so redelivered events map to the same logical run. - runner._enqueue_run consults the running-run count and re-queues with a 5s delay when over max_concurrent_runs (soft throttle), and injects per_function_default into the job config when a Function does not declare its own concurrency. Real enforcement (Phase 2.5 — default_step_timeout_s): - runner._enqueue_run additionally injects default_step_timeout_s into the job config so the executor doesn't need to re-fetch tenant rows. - executor._execute_run resolves the per-step timeout chain: function-level `timeout` wins, otherwise tenant default, else None. - executor._invoke_function (worker mode HTTP) now overrides the per-call read timeout to the resolved value via httpx.Timeout, and forwards step_timeout_s in the payload as a hint for SDK self-enforcement. - executor AI-step branch wraps ai_service.complete in asyncio.wait_for with the resolved timeout so a stuck LLM call yields control instead of riding the default 5-minute read timeout. Step-level timeout in step_result["timeout_seconds"] still wins when the SDK passes one. Frontend: - ConcurrencySettings type + getConcurrencySettings/updateConcurrencySettings in lib/api.ts. - New components/settings/concurrency-tab.tsx (shadcn-styled): three cards covering Concurrency / Step timeout / Idempotency, persisting on blur (numbers) and click (Switch). - settings/page.tsx adds a Gauge-icon Concurrency tab between Audit and Notifications; bumped grid cols 8→9 / 10→11. Out of scope: - Inline-mode (serverless) AI step timeout: inline_executor receives fn not job, so the tenant default would need to be threaded through; worker-mode covers the common path today. * feat(settings): wire danger zone — pause-all, transfer, soft-delete (Phase 4) Backend: - Migration add_tenant_soft_delete.py adds tenants.deleted_at TIMESTAMPTZ + index. - Tenant model gains the deleted_at column. - api/deps.py::_bounce_if_deleted gates every authenticated request: any attempt to use a deleted tenant returns HTTP 410 Gone (distinct from 401 / 404 so clients can act on it). - api/routes/tenants.py adds: * GET /api/v1/tenant — minimal workspace identity (id/name/slug/deleted_at). Available to any auth'd user; lets the dashboard show the slug for typed-confirmation flows. * POST /api/v1/tenant/pause-all — admin-only. Bulk update of Function.is_active=false for the workspace; idempotent. Returns paused_count. * POST /api/v1/tenant/transfer-ownership — admin-only. Promotes the target user to admin, demotes the calling admin to member; rejects self-target and cross-tenant targets. * DELETE /api/v1/tenant — admin-only. Body must echo the workspace slug to confirm. Stamps deleted_at; recoverable until a future hard-delete background job retires the row. Frontend: - TenantInfo type + getTenantInfo + pauseAllFunctions + transferOwnership + deleteWorkspace methods in lib/api.ts. - New components/settings/danger-zone.tsx: three rounded action rows inside a destructive-bordered Card; each behind its own Dialog with a typed/selected confirmation. Pause and transfer use simple confirms; delete requires retyping the workspace slug exactly. On successful delete the page does a hard nav to /login so the auth store resets. - settings/page.tsx replaces the old "Clear All Data" Danger Zone Card with <DangerZone /> in the General tab. Out of scope (intentional follow-ups): - Hard delete background job — soft-delete persists deleted_at; a cron in services/retention.py will purge after the retention window. - Email confirmation for transfer-ownership — currently single-click promote/demote within the dashboard; the more careful flow can layer on. * fix(lint): ruff I001 import ordering on notifier + executor * fix(settings): address Copilot review feedback on PR #36 Addresses 7 substantive issues from Copilot's automated review of the settings-backend PR. Tests are flagged as out-of-scope follow-up. Idempotency (Copilot #1): - Add migration add_runs_idempotency_unique.py: partial UNIQUE index on runs(tenant_id, idempotency_key) WHERE idempotency_key IS NOT NULL, built CONCURRENTLY using the AUTOCOMMIT pattern. - runner._create_run now does optimistic query-then-insert. Insert is wrapped in `async with session.begin_nested()` (a SAVEPOINT) so an IntegrityError from a concurrent inserter doesn't poison the outer session transaction shared with _resolve_waiting_steps. On IntegrityError, re-query and return the winner row. Soft-delete coverage (Copilot #2): - Apply _bounce_if_deleted to all 4 return sites in get_tenant_with_dev_fallback (was only in get_current_tenant). get_optional_tenant has zero consumers in api/routes/ — left as-is to preserve its "endpoint works without auth" contract. Notification settings auth + SSRF (Copilot #3, #4): - GET /tenant/notifications now requires CurrentUserAdmin. The response contains the Slack webhook URL and PagerDuty integration key; members must not see them. - PATCH /tenant/notifications validates slack_webhook_url via the existing services/network_utils.validate_webhook_url (rejects private IPs, DNS-rebinding) AND pins the host to hooks.slack.com. - notifier._client now uses create_ssrf_safe_client as defense-in-depth against URLs persisted before the route validation was added. Cleanup (Copilot #5/#6): - Drop the redundant SQL update() + ORM mutate pattern in tenants.update_concurrency and tenants.delete_workspace. Single ORM mutation flushes one UPDATE on commit. Step-timeout coverage (Copilot #8): - New services/concurrency.py exports read_tenant_concurrency, shared between runner.py (enqueue-time injection) and executor.py (fallback load when job config lacks default_step_timeout_s — covers _recover_stuck_runs, which only ships trigger_data + endpoint_url). - Both worker-mode HTTP path and AI-step branch in executor now fall back to a tenant DB load when the runner-injected default is absent. Timeout classification (Copilot #9): - executor._invoke_function emits error.type="HTTPTimeout" for httpx.TimeoutException (covers ReadTimeout/Connect/Write/Pool); other RequestError subclasses keep "RequestError". - notifier.notify_run_failed reads error.type and gates dispatch on notify_on_run_timeout for {"TimeoutError","HTTPTimeout"}, otherwise notify_on_run_failed. The toggle does something now.
Summary
started_at, filter out steps without timestampscomputeStepSpan)now/runStartedAtprops — timeline is self-contained from step dataTest plan