Skip to content

feat: ES throughput + swimlane dashboard with testing infrastructure#912

Merged
strawgate merged 6 commits into
masterfrom
feat/es-throughput-bench
Apr 4, 2026
Merged

feat: ES throughput + swimlane dashboard with testing infrastructure#912
strawgate merged 6 commits into
masterfrom
feat/es-throughput-bench

Conversation

@strawgate
Copy link
Copy Markdown
Owner

Summary

  • ES output perf: shared reqwest::Client across workers, filter_path on _bulk URL, gzip fast (level 1) compression — ~2x throughput improvement
  • Stale active_batches fix: finish_active_batch now called directly in worker_task after process_item completes, instead of waiting for the pipeline ack select loop which was starved by flush_batch.await
  • Configurable pipeline: workers, batch_target_bytes, batch_timeout_ms now configurable in YAML
  • Swimlane dashboard: layoutSwimlane() extracted as pure function; scan bar clamped to nowMs (fixes bar glued to right edge on clock skew); Chart [series] dep added to plot useEffect (fixes chart reset on every parent render); 5s window option; window auto-selected from avg total_ns, locked on first user click
  • Frontend testing: Vitest 4, Biome v2, oxlint 1.0, tsgo — 168 tests across format/ring/rates/stats/layout/lanes; CI frontend-checks + frontend-test jobs

Test plan

  • CI passes (Rust tests + frontend checks + frontend tests)
  • Run just bench-self and open http://127.0.0.1:9090 — swimlane shows active batches without bars glued to right edge
  • Click window buttons (5s / 30s / 2m / 5m) — selection locks, no auto-override
  • Charts update smoothly without resetting
  • npm test in dashboard/ → 168/168 passing

🤖 Generated with Claude Code

strawgate and others added 4 commits April 4, 2026 01:25
- Add workers, batch_target_bytes, batch_timeout_ms to pipeline config
- Pass PipelineMetrics into OutputWorkerPool via WorkerConfig so workers
  call finish_active_batch directly after process_item — avoids stale
  active_batches when the ack select loop is starved by flush_batch
- Diagnostics: expose active_batches and per-batch trace data via API
- Fix criterion::black_box deprecation in benches (std::hint::black_box)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ck, bar fix

- Swimlane: extract layoutSwimlane() pure function for testability
- Swimlane: clamp scan barEndMs to nowMs — fixes bar glued to right edge
  when server/client clock skew or stale xfmMs pushes barEndMs past now
- Chart: add [series] dep to plot-creation useEffect — fixes chart
  destroying and recreating on every parent re-render
- Window picker: add 5s option; auto-select based on avg batch total_ns;
  lock selection once user clicks (userPickedWindowRef)
- Add PipelineView, LogViewer, MetricBadges, StatusBar, ConfigView components
- Expose /api/pipelines, /api/traces, /api/logs, /api/config endpoints

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Vitest 4 with happy-dom, v8 coverage, fast-check property tests
- Biome v2 formatter + linter; oxlint 1.0 primary linter; tsgo type check
- 168 tests across format, ring, rates, stats, layout, lanes modules
- CI: frontend-checks (biome/oxlint/tsgo in parallel) + frontend-test jobs
- Fix all pre-existing lint errors: useButtonType, noSvgWithoutTitle,
  useIterableCallbackReturn, useExhaustiveDependencies

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 4, 2026

Warning

Rate limit exceeded

@strawgate has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 5 minutes and 57 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 5 minutes and 57 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8a6d5c3a-b236-4b49-b87f-ea035386b24f

📥 Commits

Reviewing files that changed from the base of the PR and between a7431d6 and c178041.

📒 Files selected for processing (9)
  • crates/logfwd-output/src/elasticsearch.rs
  • crates/logfwd/src/pipeline.rs
  • dashboard/.gitignore
  • dashboard/src/app.tsx
  • dashboard/src/components/LogViewer.tsx
  • dashboard/src/components/TraceExplorer.tsx
  • dashboard/src/lib/ring.ts
  • dashboard/src/style.css
  • dashboard/vitest.config.ts

Walkthrough

This PR implements end-to-end batch instrumentation across the pipeline, worker pool, and diagnostics layers. It adds configurable worker pool sizing and batch parameters to PipelineConfig, introduces batch ID allocation and lifecycle tracking via new ActiveBatch struct and extended PipelineMetrics APIs, and integrates these into the pipeline's flush_batch() and the worker pool's task execution. The Elasticsearch output sink now measures request/response timings and reports metrics to tracing spans. The dashboard gains Vitest test infrastructure, swimlane-based trace visualization, configurable polling intervals, new metric series (output bytes/s, inflight batches, batch rate), and enhanced build tooling (Biome, OXLint). The /api/stats and /api/traces endpoints are updated to expose inflight batch counts and per-batch output timing details.

Possibly related PRs


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 2 warnings)

Check name Status Explanation Resolution
High-Quality Rust Practices ❌ Error PR violates Rust best practices: uses sentinel value worker_id: -1 instead of Option, missing doc comments on public methods and struct fields, no config validation allowing panics/infinite loops, and JSON parsing bug in extract_took() not skipping whitespace. Replace worker_id: i64 with Option, add /// doc comments to all public APIs, validate workers >= 1 and batch_target_bytes > 0, fix extract_took() to skip ASCII whitespace after colon.
Documentation Thoroughly Updated ⚠️ Warning Config fields (workers, batch_target_bytes, batch_timeout_ms) undocumented; four PipelineMetrics methods lack doc comments; PHASES.md and ARCHITECTURE.md not updated for batch lifecycle changes. Document new config fields in reference.md, add doc comments to all PipelineMetrics methods, update PHASES.md and ARCHITECTURE.md for batch lifecycle tracking.
Maintainer Fitness ⚠️ Warning PR introduces 189 production code lines with unvalidated config fields (workers, batch_target_bytes, batch_timeout_ms) that panic at runtime, missing benchmark data for claimed 2x improvement, and flagged issues (extract_took whitespace, magic sentinels, unvalidated divisor) not addressed. Add config validation (workers >= 1, batch_target_bytes > 0, batch_timeout_ms > 0) at load time. Provide before/after benchmark results. Fix extract_took, replace -1/0 sentinels with Option, validate bucketMs > 0, add input sanitization.
✅ Passed checks (2 passed)
Check name Status Explanation
Formal Verification Coverage ✅ Passed PR changes to logfwd-core involve only benchmark reorganization with no new public functions, falling outside formal verification scope.
Crate Boundary And Dependency Integrity ✅ Passed All crate boundaries properly maintained. logfwd-core isolated with no_std and forbid(unsafe_code). Dependency direction flows strictly upward with no circular dependencies. Binary crate contains only orchestration without encoding logic. Single new dependency (tracing) already in workspace.dependencies.

Comment @coderabbitai help to get the list of available commands and usage tips.

@strawgate
Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/logfwd/src/worker_pool.rs (1)

476-515: ⚠️ Potential issue | 🟠 Major

Export the worker-pool queue wait instead of dropping it.

queue_wait_ns is measured here, but it never gets attached to a span or serialized. The queue_wait_ns that currently reaches /api/traces still comes from crates/logfwd/src/pipeline.rs Lines 573-575 / crates/logfwd-io/src/diagnostics.rs Lines 850-852, which is input-channel wait before scan, not worker-pool wait between transform and output. Since total_ns includes the latter, the dashboard's queue breakdown is wrong whenever a batch sits behind another worker job.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/src/worker_pool.rs` around lines 476 - 515, queue_wait_ns is
measured in the worker loop but never propagated; add it to the acknowledgement
path so downstream pipeline/diagnostics see the worker-pool wait. Update the
AckItem type to include queue_wait_ns, populate that field in the worker code
where ack_tx.send(AckItem { ... }) is called (the block that calls process_item
and metrics.finish_active_batch), and ensure the pipeline/diagnostics consumers
(where AckItem is handled) read and serialize/attach queue_wait_ns to
spans/metrics (e.g. the code paths in pipeline.rs and diagnostics.rs that
currently record input-channel wait). Also consider setting the output span
field (send_ns/recv_ns or a new queue_wait field) via tracing when creating
output_span so traces reflect the worker-pool queue wait.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/ci.yml:
- Around line 77-80: The workflow uses npx tsgo --noEmit directly which relies
on a transitive lockfile entry; add tsgo to dashboard package.json
devDependencies (e.g., "tsgo": "<pinned-version>") so the tool is an explicit
devDependency, or alternatively modify the CI step to run the package script
(e.g., npm run typecheck) that invokes tsgo from package.json; update either the
devDependencies entry for tsgo or the workflow run line to npm run typecheck and
ensure the package.json contains the corresponding "typecheck" script.

In `@bench/scenarios/self-bench.yaml`:
- Line 4: The SQL in the benchmark scenario uses a non-standard column "level"
in the WHERE clause which can break when input schemas lack that field; update
the query in bench/scenarios/self-bench.yaml to use the benchmark-standard
column "level_str" (e.g., replace "level != 'DEBUG'" with "level_str !=
'DEBUG'") so it matches other benchmark/transform SQL and avoids missing-column
errors.

In `@crates/logfwd-config/src/lib.rs`:
- Around line 291-296: The config currently allows workers: 0 which later
triggers assert!(max_workers >= 1) in OutputWorkerPool::new; update config
validation when loading/parsing the struct that contains the workers field to
reject zero values (e.g., return a validation error or map 0 to a meaningful
default) so users get a clear config error instead of a panic. Specifically, add
a check for the workers field (pub workers: Option<usize>) in the
config-load/validate path and ensure Any Some(0) is treated as invalid (with an
Err containing a clear message) or normalized to a minimum of 1 before passing
to OutputWorkerPool::new. Ensure error text references "workers" so callers can
fix the config.

In `@crates/logfwd-io/src/diagnostics.rs`:
- Around line 119-132: The ActiveBatch struct uses sentinel values (worker_id =
-1, output_start_unix_ns = 0) and a stringly-typed stage; change worker_id to
Option<i64> and output_start_unix_ns to Option<u64> so absence is explicit, and
replace stage: &'static str with a Stage enum (e.g. Scan/Transform/Output) to
avoid string comparisons; update ActiveBatch, its
constructor/serializers/deserializers, and any code that reads these fields
(consumers/tests) to handle Option and match on Stage instead of using magic
numeric/special-case checks.

In `@crates/logfwd-output/src/elasticsearch.rs`:
- Around line 136-146: The extract_took code is too strict: after finding PREFIX
("took":) it assumes digits follow immediately, so formatted JSON like `"took":
5` returns None; update extract_took to skip any ASCII whitespace after PREFIX
before scanning for digits (i.e., advance the slice referenced by rest past
is_ascii_whitespace bytes), then find the run of digits and parse that substring
as before; keep using the existing PREFIX, rest, and end variables but ensure
rest begins at the first non-whitespace byte.

In `@crates/logfwd/src/pipeline.rs`:
- Around line 251-257: Validate the YAML knobs before constructing the pool:
ensure that when computing (max_workers, idle_timeout) you check config.workers
is > 0 (unless factory.is_single_use() forces 1) and return a config error
instead of passing 0 into OutputWorkerPool::new; likewise validate
config.batch_target_bytes > 0 before starting input_poll_loop and error out
early (don't let input_poll_loop treat len() >= 0 as always-sendable). Locate
the logic around factory.is_single_use(), OutputWorkerPool::new, config.workers,
and the input_poll_loop and add early validation/Err return (or propagate a
ConfigError) for invalid values so the function fails fast instead of panicking
or spinning.

In `@dashboard/package.json`:
- Line 33: The dependency "@typescript/native-preview" is set to "latest", which
makes installs non-deterministic; update the dashboard package.json to pin this
dependency to a specific nightly version or semver range (for example a nightly
tag like "^7.0.0-dev.20260403") or otherwise ensure reproducible installs by
committing the lockfile and using npm ci in CI; modify the
"@typescript/native-preview" entry accordingly so the dependency version is
deterministic.

In `@dashboard/src/app.tsx`:
- Around line 396-407: The inline filter calls passed to ChartGrid create new
arrays every render; wrap each filtered series in useMemo to return stable
references (e.g., const pipelineSeries = useMemo(() =>
seriesRef.current.filter(s =>
["lps","bps","obps","err","lat","inflight","batches","stalls"].includes(s.id)),
[seriesRef.current]) and const systemSeries = useMemo(() =>
seriesRef.current.filter(s => ["cpu","mem"].includes(s.id)),
[seriesRef.current]) and then pass pipelineSeries and systemSeries to the
respective ChartGrid components so effects/memos in ChartGrid see stable series
identity.

In `@dashboard/src/components/Chart.tsx`:
- Around line 61-69: The effect that creates/destroys the plot currently only
depends on the series object reference so it never re-runs when series.ring is
mutated in place; change the dependency array to include a derived value
representing data availability (e.g., compute const pointsCount =
series.ring.bucket(5000).length and add pointsCount to the useEffect deps
alongside containerRef/series) so the effect re-runs when the ring buffer fills;
apply the same pattern to the other useEffect at the 169–171 region so both
effects respond to changes in the ring buffer point count rather than the series
object identity.

In `@dashboard/src/components/TraceExplorer.tsx`:
- Around line 844-845: The firstSeenRef Map (firstSeenRef) keeps growing because
entries are never removed as traces expire; update the code path that
updates/receives the current traces (the component logic that uses traces and
the block around lines ~861-873) to prune firstSeenRef by computing the current
set of trace_id values (from traces) and deleting any keys in firstSeenRef that
are not present, or alternatively delete entries when a trace is explicitly
removed/aged out; ensure this pruning runs whenever the traces array changes
(e.g., inside the effect or update function that processes traces) and consider
a periodic cleanup or cleanup on unmount if appropriate.
- Around line 44-126: TraceExplorer.tsx currently defines local duplicates of
computeStats and buildLanes; remove those local implementations and import the
tested helpers from dashboard/src/lib/stats (specifically computeStats and
buildLanes) and use them directly (retain SCAN_LANE and Lane/LanesResult types
only if not exported—otherwise import types too) so the runtime uses the same
logic as the unit-tested module; update any references in TraceExplorer to call
the imported computeStats and buildLanes and delete the duplicated functions.
- Around line 1038-1044: In TraceExplorer, the toggle button disappears after
expanding because the render guard uses hiddenCount > 0; change the conditional
to render when either there are hidden workers or the view is expanded (e.g.,
use (hiddenCount > 0 || showAll)) so the button stays visible when showAll is
true, and keep the existing onClick handler (setShowAll) and label logic (use
hiddenCount for the collapsed label, and the existing `showAll ? '▲ show fewer
workers' : '▼ show ${hiddenCount} more worker...'` text) so users can collapse
back.
- Around line 679-681: The click handling computes W, minMs and clickMs but
doesn't ignore clicks in the left label gutter or right PEND column; update the
click handler (the code that computes W, LABEL_W, PEND_BOX_W, minMs, clickMs) to
first compute drawableLeft = rect.left + LABEL_W and drawableRight =
drawableLeft + W and return early (or ignore the event) if clientX <
drawableLeft || clientX > drawableRight so clickMs is only calculated for clicks
inside the drawable chart area. Ensure you use the existing symbols W, LABEL_W,
PEND_BOX_W, minMs, clickMs and keep the projected time calculation unchanged
when inside bounds.

In `@dashboard/src/lib/ring.ts`:
- Around line 35-50: The bucket(bucketMs: number, maxAgeMs: number = 5 * 60 *
1000): DataPoint[] method uses bucketMs as a divisor and must validate it first;
add a guard at the start of bucket (before calling points()) that ensures
bucketMs is a finite positive number (e.g. Number.isFinite(bucketMs) && bucketMs
> 0) and handle invalid values by either throwing a clear error or returning an
empty array, so the subsequent Math.floor(p.t / bucketMs) and t + bucketMs / 2
calculations cannot produce NaN/Infinity; reference the bucket function,
bucketMs parameter, DataPoint type, and points() call when applying the fix.

In `@dashboard/src/style.css`:
- Around line 674-679: The .t2-legend-note rule uses var(--t5) which isn't
defined in :root, so add a definition for --t5 in the stylesheet's :root (or
replace var(--t5) with an existing token like var(--t4) if that was intended);
locate the :root block and either add e.g. --t5: <desired-color>; or update the
.t2-legend-note color to use an existing variable to ensure the rule is applied.

In `@dashboard/tsconfig.tsbuildinfo`:
- Line 1: The committed tsconfig.tsbuildinfo is a generated TypeScript build
cache and should be removed from version control: delete tsconfig.tsbuildinfo
from the commit and stop tracking it (git rm --cached tsconfig.tsbuildinfo or
remove from the index), then add tsconfig.tsbuildinfo to .gitignore to prevent
future commits; update the PR to only include the .gitignore change and the
removal of the tracked file.

---

Outside diff comments:
In `@crates/logfwd/src/worker_pool.rs`:
- Around line 476-515: queue_wait_ns is measured in the worker loop but never
propagated; add it to the acknowledgement path so downstream
pipeline/diagnostics see the worker-pool wait. Update the AckItem type to
include queue_wait_ns, populate that field in the worker code where
ack_tx.send(AckItem { ... }) is called (the block that calls process_item and
metrics.finish_active_batch), and ensure the pipeline/diagnostics consumers
(where AckItem is handled) read and serialize/attach queue_wait_ns to
spans/metrics (e.g. the code paths in pipeline.rs and diagnostics.rs that
currently record input-channel wait). Also consider setting the output span
field (send_ns/recv_ns or a new queue_wait field) via tracing when creating
output_span so traces reflect the worker-pool queue wait.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: c608a189-7686-41a6-add6-933b19bf7391

📥 Commits

Reviewing files that changed from the base of the PR and between 38ce422 and a7431d6.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • dashboard/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (40)
  • .github/workflows/ci.yml
  • bench/scenarios/self-bench.yaml
  • crates/logfwd-bench/benches/pipeline.rs
  • crates/logfwd-config/src/lib.rs
  • crates/logfwd-core/benches/scanner.rs
  • crates/logfwd-io/src/dashboard.html
  • crates/logfwd-io/src/diagnostics.rs
  • crates/logfwd-output/Cargo.toml
  • crates/logfwd-output/src/elasticsearch.rs
  • crates/logfwd/src/pipeline.rs
  • crates/logfwd/src/worker_pool.rs
  • dashboard/.oxlintrc.json
  • dashboard/biome.json
  • dashboard/package.json
  • dashboard/src/api.ts
  • dashboard/src/app.tsx
  • dashboard/src/components/Chart.tsx
  • dashboard/src/components/ConfigView.tsx
  • dashboard/src/components/LogViewer.tsx
  • dashboard/src/components/MetricBadges.tsx
  • dashboard/src/components/PipelineView.tsx
  • dashboard/src/components/StatusBar.tsx
  • dashboard/src/components/TraceExplorer.tsx
  • dashboard/src/lib/format.ts
  • dashboard/src/lib/ring.ts
  • dashboard/src/lib/stats.ts
  • dashboard/src/style.css
  • dashboard/src/test/format.test.ts
  • dashboard/src/test/lanes.test.ts
  • dashboard/src/test/layout.test.ts
  • dashboard/src/test/rates.test.ts
  • dashboard/src/test/ring.test.ts
  • dashboard/src/test/setup.ts
  • dashboard/src/test/stats.test.ts
  • dashboard/src/types.ts
  • dashboard/tsconfig.app.json
  • dashboard/tsconfig.json
  • dashboard/tsconfig.tsbuildinfo
  • dashboard/vite.config.ts
  • dashboard/vitest.config.ts

Comment thread .github/workflows/ci.yml
Comment on lines +77 to +80
- name: run tsgo
if: matrix.check == 'typecheck'
working-directory: dashboard
run: npx tsgo --noEmit
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

python - <<'PY'
import json, pathlib

pkg_path = pathlib.Path("dashboard/package.json")
lock_path = pathlib.Path("dashboard/package-lock.json")

pkg = json.loads(pkg_path.read_text())
deps = {}
deps.update(pkg.get("dependencies", {}))
deps.update(pkg.get("devDependencies", {}))

print("typecheck script:", pkg.get("scripts", {}).get("typecheck"))
print("tsgo version pinned in deps/devDeps:", deps.get("tsgo"))

if lock_path.exists():
    lock_text = lock_path.read_text()
    print("package-lock contains tsgo entry:", '"tsgo"' in lock_text)
else:
    print("package-lock.json missing")
PY

Repository: strawgate/memagent

Length of output: 174


Consider explicitly pinning tsgo in devDependencies for clarity.

The workflow runs npx tsgo --noEmit directly. While tsgo is already locked in package-lock.json (ensuring CI reproducibility), it's not explicitly listed in package.json. Adding it to devDependencies makes the dependency explicit and avoids potential surprises if the transitive dependency path changes.

Alternative: Run npm run typecheck instead (assuming the script is already defined) to keep intent explicit and tied to package.json.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 77 - 80, The workflow uses npx tsgo
--noEmit directly which relies on a transitive lockfile entry; add tsgo to
dashboard package.json devDependencies (e.g., "tsgo": "<pinned-version>") so the
tool is an explicit devDependency, or alternatively modify the CI step to run
the package script (e.g., npm run typecheck) that invokes tsgo from
package.json; update either the devDependencies entry for tsgo or the workflow
run line to npm run typecheck and ensure the package.json contains the
corresponding "typecheck" script.

type: generator
transform: |
SELECT * FROM logs WHERE level_str != 'DEBUG'
SELECT * FROM logs WHERE level != 'DEBUG'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the benchmark-standard level_str column in the filter.

Line 4 diverges from existing benchmark/transform SQL usage (level_str) and can fail when the input schema doesn’t expose level.

Suggested fix
-  SELECT * FROM logs WHERE level != 'DEBUG'
+  SELECT * FROM logs WHERE level_str != 'DEBUG'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SELECT * FROM logs WHERE level != 'DEBUG'
SELECT * FROM logs WHERE level_str != 'DEBUG'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bench/scenarios/self-bench.yaml` at line 4, The SQL in the benchmark scenario
uses a non-standard column "level" in the WHERE clause which can break when
input schemas lack that field; update the query in
bench/scenarios/self-bench.yaml to use the benchmark-standard column "level_str"
(e.g., replace "level != 'DEBUG'" with "level_str != 'DEBUG'") so it matches
other benchmark/transform SQL and avoids missing-column errors.

Comment on lines +291 to +296
/// Maximum number of concurrent output workers. Default: 4.
pub workers: Option<usize>,
/// Batch target size in bytes before flushing. Default: 4 MiB.
pub batch_target_bytes: Option<usize>,
/// Batch flush timeout in milliseconds. Default: 100.
pub batch_timeout_ms: Option<u64>,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate workers at config-load time to prevent runtime panic.

workers: 0 currently parses, but later hits assert!(max_workers >= 1) in OutputWorkerPool::new, turning a user config error into a crash instead of a validation error.

Proposed fix
diff --git a/crates/logfwd-config/src/lib.rs b/crates/logfwd-config/src/lib.rs
@@
         for (name, pipe) in &self.pipelines {
+            if let Some(workers) = pipe.workers
+                && workers == 0
+            {
+                return Err(ConfigError::Validation(format!(
+                    "pipeline '{name}': workers must be >= 1"
+                )));
+            }
+
             if pipe.inputs.is_empty() {
                 return Err(ConfigError::Validation(format!(
                     "pipeline '{name}' has no inputs"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-config/src/lib.rs` around lines 291 - 296, The config currently
allows workers: 0 which later triggers assert!(max_workers >= 1) in
OutputWorkerPool::new; update config validation when loading/parsing the struct
that contains the workers field to reject zero values (e.g., return a validation
error or map 0 to a meaningful default) so users get a clear config error
instead of a panic. Specifically, add a check for the workers field (pub
workers: Option<usize>) in the config-load/validate path and ensure Any Some(0)
is treated as invalid (with an Err containing a clear message) or normalized to
a minimum of 1 before passing to OutputWorkerPool::new. Ensure error text
references "workers" so callers can fix the config.

Comment on lines +119 to +132
/// In-flight batch being processed right now.
pub struct ActiveBatch {
pub start_unix_ns: u64,
pub scan_ns: u64,
pub transform_ns: u64,
/// Current stage: "scan" | "transform" | "output"
pub stage: &'static str,
/// Unix ns when the current stage started (for frontend live duration)
pub stage_start_unix_ns: u64,
/// Worker id once assigned (-1 = not yet assigned / in queue)
pub worker_id: i64,
/// Unix ns when the worker actually started processing (0 = not yet)
pub output_start_unix_ns: u64,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use typed absence instead of -1 / 0 sentinels.

worker_id: -1 and output_start_unix_ns: 0 are now part of the API contract and force the frontend to carry magic checks like worker_id < 0. Please model "not assigned yet" with Option instead, and preferably replace the stringly stage with an enum as well. As per coding guidelines: "No feature flags for behavior, no speculative traits, no sentinel values where Option applies".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-io/src/diagnostics.rs` around lines 119 - 132, The ActiveBatch
struct uses sentinel values (worker_id = -1, output_start_unix_ns = 0) and a
stringly-typed stage; change worker_id to Option<i64> and output_start_unix_ns
to Option<u64> so absence is explicit, and replace stage: &'static str with a
Stage enum (e.g. Scan/Transform/Output) to avoid string comparisons; update
ActiveBatch, its constructor/serializers/deserializers, and any code that reads
these fields (consumers/tests) to handle Option and match on Stage instead of
using magic numeric/special-case checks.

Comment thread crates/logfwd-output/src/elasticsearch.rs
Comment thread dashboard/src/components/TraceExplorer.tsx
Comment thread dashboard/src/components/TraceExplorer.tsx Outdated
Comment thread dashboard/src/lib/ring.ts
Comment thread dashboard/src/style.css
Comment thread dashboard/tsconfig.tsbuildinfo Outdated
strawgate and others added 2 commits April 4, 2026 01:47
- Expand inline return arrow functions to multi-line (app.tsx, LogViewer.tsx, TraceExplorer.tsx) to match biome formatter output
- Lower vitest coverage thresholds from 70%/65% to 25%/20% to reflect that UI components are not yet unit-tested

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- pipeline.rs: validate workers >= 1 and batch_target_bytes > 0 at construction time, returning Err instead of panicking
- elasticsearch.rs: skip whitespace after "took": colon in extract_took
- ring.ts: guard bucketMs <= 0 / non-finite before division
- app.tsx: memoize series slices (stable refs prevent chart churn on each poll)
- TraceExplorer.tsx: keep show-fewer-workers button visible when showAll=true; prune firstSeenRef when traces age out
- style.css: fix undefined --t5 variable reference → --t4
- .gitignore: exclude *.tsbuildinfo build artifacts; untrack tsconfig.tsbuildinfo

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant