refactor: simplify waterfall timeline duration computation by hoootan · Pull Request #9 · hoootan/flowforge

hoootan · 2026-04-07T22:59:17Z

Summary

Sort steps once by started_at, filter out steps without timestamps
Compute timeline span from first-to-last step start (simpler than computeStepSpan)
Infer step duration from gap to next step for legacy data; use real duration (>500ms) for SDK-fixed data
Remove now/runStartedAt props — timeline is self-contained from step data
Cleaner render loop using sorted array index for duration lookup

Test plan

Waterfall renders correctly for old runs (legacy started_at == ended_at)
Waterfall renders correctly for new runs (SDK timing)
Steps without started_at are excluded (pending steps don't break layout)
Expand/collapse still works with >15 steps

Simplify the waterfall by sorting steps once, computing timeline span from first/last step, and inferring durations from inter-step gaps. Remove now/runStartedAt props since the timeline is self-contained. Use 500ms threshold for SDK-fixed data vs legacy gap inference.

Copilot

Pull request overview

Refactors the Execution Timeline (waterfall) rendering on the run detail page to simplify how step ordering, timeline span, and per-step bar durations are computed directly from step timestamps.

Changes:

Sorts steps once by started_at and filters out steps without timestamps.
Replaces computeStepSpan + per-step duration map with an index-based duration array derived from gaps to the next step (and recorded durations when available).
Removes runStartedAt/now props from WaterfallTimeline and simplifies the render loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T23:03:13Z

+  // Timeline spans from first step start to last step start (+ small padding)
+  const firstStart = new Date(sorted[0].started_at!).getTime();
+  const lastStart = new Date(sorted[sorted.length - 1].started_at!).getTime();
+  const timelineSpan = Math.max(lastStart - firstStart, 1000); // at least 1s
+


timelineSpan is computed from first step start → last step start, but bar widths use stepDurations that can extend past lastStart (e.g. SDK-fixed steps with real ended_at, or a long-running last step). This makes the axis labels inaccurate and forces late/long steps to be artificially clamped. Consider computing the span end from the max effective end time (e.g. max(start + computedDuration) or max(ended_at/now)), then deriving timelineSpan from that.

Copilot · 2026-04-07T23:03:13Z

+  // For each step, compute duration as gap to next step (legacy fix)
+  // This works because all existing data has started_at ≈ ended_at (useless),
+  // but the started_at values between steps reflect real execution timing.
+  const stepDurations: number[] = sorted.map((step, i) => {
+    const start = new Date(step.started_at!).getTime();
+    const end = step.ended_at ? new Date(step.ended_at).getTime() : Date.now();
+    const selfDuration = end - start;
+
+    // If step has real duration (> 500ms), use it (future SDK-fixed data)
+    if (selfDuration > 500) return selfDuration;
+
+    // Otherwise, infer from gap to next step


The selfDuration > 500 heuristic will treat legitimately short SDK-fixed steps (<500ms) as “legacy” and infer their duration from the next step gap, which can significantly overstate timing. A more reliable legacy check is to detect near-equal started_at/ended_at (as before) rather than using an absolute 500ms threshold.

Suggested change

// For each step, compute duration as gap to next step (legacy fix)

// This works because all existing data has started_at ≈ ended_at (useless),

// but the started_at values between steps reflect real execution timing.

const stepDurations: number[] = sorted.map((step, i) => {

const start = new Date(step.started_at!).getTime();

const end = step.ended_at ? new Date(step.ended_at).getTime() : Date.now();

const selfDuration = end - start;

// If step has real duration (> 500ms), use it (future SDK-fixed data)

if (selfDuration > 500) return selfDuration;

// Otherwise, infer from gap to next step

// For each step, compute duration as gap to next step only for legacy data

// where started_at and ended_at are effectively identical. For SDK-fixed

// data, always use the actual recorded duration, even if it is short.

const stepDurations: number[] = sorted.map((step, i) => {

const start = new Date(step.started_at!).getTime();

const end = step.ended_at ? new Date(step.ended_at).getTime() : Date.now();

const selfDuration = end - start;

const isLegacyDuration = step.ended_at && Math.abs(selfDuration) <= 1;

// Use recorded duration for non-legacy data, including legitimately short steps.

if (!isLegacyDuration) return Math.max(selfDuration, 0);

// Legacy data: infer from gap to next step

Copilot · 2026-04-07T23:03:14Z

 function WaterfallTimeline({
  steps,
  totalDuration,
-  runStartedAt,
-  now,
 }: {
  steps: Step[];
  totalDuration: number;
-  runStartedAt?: string | null;
-  now: number;
 }) {


totalDuration is now unused in WaterfallTimeline, and removing the computeStepSpan(...) call leaves computeStepSpan unused as well. With @typescript-eslint/no-unused-vars enabled (warn), this will introduce lint warnings; consider deleting the unused helper and dropping totalDuration from the component props and call sites.

- Use started_at ≈ ended_at (≤1ms) for legacy detection instead of 500ms threshold - Compute timelineSpan from max effective end (start + duration) not just last start - Remove unused totalDuration prop and computeStepSpan helper

Addresses 7 substantive issues from Copilot's automated review of the settings-backend PR. Tests are flagged as out-of-scope follow-up. Idempotency (Copilot #1): - Add migration add_runs_idempotency_unique.py: partial UNIQUE index on runs(tenant_id, idempotency_key) WHERE idempotency_key IS NOT NULL, built CONCURRENTLY using the AUTOCOMMIT pattern. - runner._create_run now does optimistic query-then-insert. Insert is wrapped in `async with session.begin_nested()` (a SAVEPOINT) so an IntegrityError from a concurrent inserter doesn't poison the outer session transaction shared with _resolve_waiting_steps. On IntegrityError, re-query and return the winner row. Soft-delete coverage (Copilot #2): - Apply _bounce_if_deleted to all 4 return sites in get_tenant_with_dev_fallback (was only in get_current_tenant). get_optional_tenant has zero consumers in api/routes/ — left as-is to preserve its "endpoint works without auth" contract. Notification settings auth + SSRF (Copilot #3, #4): - GET /tenant/notifications now requires CurrentUserAdmin. The response contains the Slack webhook URL and PagerDuty integration key; members must not see them. - PATCH /tenant/notifications validates slack_webhook_url via the existing services/network_utils.validate_webhook_url (rejects private IPs, DNS-rebinding) AND pins the host to hooks.slack.com. - notifier._client now uses create_ssrf_safe_client as defense-in-depth against URLs persisted before the route validation was added. Cleanup (Copilot #5/#6): - Drop the redundant SQL update() + ORM mutate pattern in tenants.update_concurrency and tenants.delete_workspace. Single ORM mutation flushes one UPDATE on commit. Step-timeout coverage (Copilot #8): - New services/concurrency.py exports read_tenant_concurrency, shared between runner.py (enqueue-time injection) and executor.py (fallback load when job config lacks default_step_timeout_s — covers _recover_stuck_runs, which only ships trigger_data + endpoint_url). - Both worker-mode HTTP path and AI-step branch in executor now fall back to a tenant DB load when the runner-injected default is absent. Timeout classification (Copilot #9): - executor._invoke_function emits error.type="HTTPTimeout" for httpx.TimeoutException (covers ReadTimeout/Connect/Write/Pool); other RequestError subclasses keep "RequestError". - notifier.notify_run_failed reads error.type and gates dispatch on notify_on_run_timeout for {"TimeoutError","HTTPTimeout"}, otherwise notify_on_run_failed. The toggle does something now.

…fications, danger zone) (#36) * feat(notifications): wire workspace alerts to Slack + PagerDuty (Phase 3) Adds tenant_notification_config table with one row per workspace storing Slack channel/webhook, PagerDuty key + enabled flag, email-digest flag, and per-event-type triggers (run_failed, run_timeout). Persists via new GET/PATCH /api/v1/tenant/notifications (admin-only PATCH) plus a POST /tenant/notifications/test endpoint to verify each channel. Real enforcement: - services/notifier.py — fire-and-forget dispatcher. On run failure, loads tenant config, posts to Slack incoming-webhook (text + channel override) and triggers a PagerDuty Events API v2 incident with severity=error and run/error metadata. - services/executor.py hooks both terminal-failure transition sites (HTTP-mode permanent failure + AI-step retries-exhausted). Wrapped in asyncio.create_task so notification I/O never blocks run completion; notifier itself is exception-safe. Persisted-only for v1: - email_digest_enabled — column lands but daily digest cron lives in services/digest.py (Phase 3.1). Frontend: new components/settings/notifications-tab.tsx wired into settings/page.tsx as a Bell-icon tab. Loads settings on mount, PATCHes on blur (URLs / channel) and on click (Switches), shows a Send test button per channel that surfaces the server's pass/fail message via toast. * feat(settings): wire concurrency & limits + step-timeout enforcement (Phase 2 + 2.5) Persist max_concurrent_runs / per_function_default / default_step_timeout_s / use_event_id_idempotency to Tenant.settings JSONB and expose via new GET/PATCH /api/v1/tenant/concurrency (admin-only PATCH). Real enforcement (Phase 2): - runner._create_run derives Run.idempotency_key from the trigger event id when the workspace flag is on, so redelivered events map to the same logical run. - runner._enqueue_run consults the running-run count and re-queues with a 5s delay when over max_concurrent_runs (soft throttle), and injects per_function_default into the job config when a Function does not declare its own concurrency. Real enforcement (Phase 2.5 — default_step_timeout_s): - runner._enqueue_run additionally injects default_step_timeout_s into the job config so the executor doesn't need to re-fetch tenant rows. - executor._execute_run resolves the per-step timeout chain: function-level `timeout` wins, otherwise tenant default, else None. - executor._invoke_function (worker mode HTTP) now overrides the per-call read timeout to the resolved value via httpx.Timeout, and forwards step_timeout_s in the payload as a hint for SDK self-enforcement. - executor AI-step branch wraps ai_service.complete in asyncio.wait_for with the resolved timeout so a stuck LLM call yields control instead of riding the default 5-minute read timeout. Step-level timeout in step_result["timeout_seconds"] still wins when the SDK passes one. Frontend: - ConcurrencySettings type + getConcurrencySettings/updateConcurrencySettings in lib/api.ts. - New components/settings/concurrency-tab.tsx (shadcn-styled): three cards covering Concurrency / Step timeout / Idempotency, persisting on blur (numbers) and click (Switch). - settings/page.tsx adds a Gauge-icon Concurrency tab between Audit and Notifications; bumped grid cols 8→9 / 10→11. Out of scope: - Inline-mode (serverless) AI step timeout: inline_executor receives fn not job, so the tenant default would need to be threaded through; worker-mode covers the common path today. * feat(settings): wire danger zone — pause-all, transfer, soft-delete (Phase 4) Backend: - Migration add_tenant_soft_delete.py adds tenants.deleted_at TIMESTAMPTZ + index. - Tenant model gains the deleted_at column. - api/deps.py::_bounce_if_deleted gates every authenticated request: any attempt to use a deleted tenant returns HTTP 410 Gone (distinct from 401 / 404 so clients can act on it). - api/routes/tenants.py adds: * GET /api/v1/tenant — minimal workspace identity (id/name/slug/deleted_at). Available to any auth'd user; lets the dashboard show the slug for typed-confirmation flows. * POST /api/v1/tenant/pause-all — admin-only. Bulk update of Function.is_active=false for the workspace; idempotent. Returns paused_count. * POST /api/v1/tenant/transfer-ownership — admin-only. Promotes the target user to admin, demotes the calling admin to member; rejects self-target and cross-tenant targets. * DELETE /api/v1/tenant — admin-only. Body must echo the workspace slug to confirm. Stamps deleted_at; recoverable until a future hard-delete background job retires the row. Frontend: - TenantInfo type + getTenantInfo + pauseAllFunctions + transferOwnership + deleteWorkspace methods in lib/api.ts. - New components/settings/danger-zone.tsx: three rounded action rows inside a destructive-bordered Card; each behind its own Dialog with a typed/selected confirmation. Pause and transfer use simple confirms; delete requires retyping the workspace slug exactly. On successful delete the page does a hard nav to /login so the auth store resets. - settings/page.tsx replaces the old "Clear All Data" Danger Zone Card with <DangerZone /> in the General tab. Out of scope (intentional follow-ups): - Hard delete background job — soft-delete persists deleted_at; a cron in services/retention.py will purge after the retention window. - Email confirmation for transfer-ownership — currently single-click promote/demote within the dashboard; the more careful flow can layer on. * fix(lint): ruff I001 import ordering on notifier + executor * fix(settings): address Copilot review feedback on PR #36 Addresses 7 substantive issues from Copilot's automated review of the settings-backend PR. Tests are flagged as out-of-scope follow-up. Idempotency (Copilot #1): - Add migration add_runs_idempotency_unique.py: partial UNIQUE index on runs(tenant_id, idempotency_key) WHERE idempotency_key IS NOT NULL, built CONCURRENTLY using the AUTOCOMMIT pattern. - runner._create_run now does optimistic query-then-insert. Insert is wrapped in `async with session.begin_nested()` (a SAVEPOINT) so an IntegrityError from a concurrent inserter doesn't poison the outer session transaction shared with _resolve_waiting_steps. On IntegrityError, re-query and return the winner row. Soft-delete coverage (Copilot #2): - Apply _bounce_if_deleted to all 4 return sites in get_tenant_with_dev_fallback (was only in get_current_tenant). get_optional_tenant has zero consumers in api/routes/ — left as-is to preserve its "endpoint works without auth" contract. Notification settings auth + SSRF (Copilot #3, #4): - GET /tenant/notifications now requires CurrentUserAdmin. The response contains the Slack webhook URL and PagerDuty integration key; members must not see them. - PATCH /tenant/notifications validates slack_webhook_url via the existing services/network_utils.validate_webhook_url (rejects private IPs, DNS-rebinding) AND pins the host to hooks.slack.com. - notifier._client now uses create_ssrf_safe_client as defense-in-depth against URLs persisted before the route validation was added. Cleanup (Copilot #5/#6): - Drop the redundant SQL update() + ORM mutate pattern in tenants.update_concurrency and tenants.delete_workspace. Single ORM mutation flushes one UPDATE on commit. Step-timeout coverage (Copilot #8): - New services/concurrency.py exports read_tenant_concurrency, shared between runner.py (enqueue-time injection) and executor.py (fallback load when job config lacks default_step_timeout_s — covers _recover_stuck_runs, which only ships trigger_data + endpoint_url). - Both worker-mode HTTP path and AI-step branch in executor now fall back to a tenant DB load when the runner-injected default is absent. Timeout classification (Copilot #9): - executor._invoke_function emits error.type="HTTPTimeout" for httpx.TimeoutException (covers ReadTimeout/Connect/Write/Pool); other RequestError subclasses keep "RequestError". - notifier.notify_run_failed reads error.type and gates dispatch on notify_on_run_timeout for {"TimeoutError","HTTPTimeout"}, otherwise notify_on_run_failed. The toggle does something now.

Copilot AI review requested due to automatic review settings April 7, 2026 22:59

Copilot started reviewing on behalf of hoootan April 7, 2026 22:59 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

fix: address Copilot review on waterfall timeline

793d194

- Use started_at ≈ ended_at (≤1ms) for legacy detection instead of 500ms threshold - Compute timelineSpan from max effective end (start + duration) not just last start - Remove unused totalDuration prop and computeStepSpan helper

hoootan merged commit 5d2809a into main Apr 7, 2026
3 checks passed

hoootan deleted the refactor/waterfall-simplify branch April 7, 2026 23:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: simplify waterfall timeline duration computation#9

refactor: simplify waterfall timeline duration computation#9
hoootan merged 2 commits into
mainfrom
refactor/waterfall-simplify

hoootan commented Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hoootan commented Apr 7, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants