Skip to content

feat: batch trace explorer — OTel spans + /api/traces + dashboard UI#683

Merged
strawgate merged 19 commits into
masterfrom
feat/batch-trace-explorer
Apr 2, 2026
Merged

feat: batch trace explorer — OTel spans + /api/traces + dashboard UI#683
strawgate merged 19 commits into
masterfrom
feat/batch-trace-explorer

Conversation

@strawgate
Copy link
Copy Markdown
Owner

Summary

  • OTel span instrumentation: flush_batch emits a root batch span with child scan, transform, output spans via tracing + tracing-opentelemetry. Attributes: pipeline, input_rows, output_rows, errors.
  • RingBufferExporter: custom SpanExporter in logfwd-io storing the last ~1000 batches (4000 spans) in an Arc<Mutex<VecDeque>> — zero async overhead, no external dependency.
  • /api/traces: new endpoint groups spans by trace_id, returns per-batch summary with scan_ns, transform_ns, output_ns, row counts, and status.
  • server.traces_endpoint: optional OTLP span push — if set, a BatchSpanProcessor sends traces to any OTel-compatible backend (Jaeger, Tempo, etc.) alongside the in-process buffer.
  • TraceExplorer dashboard component: waterfall rows proportional to duration, color-coded scan (blue) / transform (purple) / output (green) segments, click-to-expand detail with timing breakdown and filter ratio.

Architecture

tracing::info_span!("batch")        ← #[instrument] on flush_batch
  ├── "scan"    .entered()           ← sync block_in_place wrapper
  ├── "transform"  .instrument()     ← async execute()
  └── "output"  .entered()           ← sync block_in_place wrapper
       ↓
tracing-opentelemetry bridge
       ↓
SdkTracerProvider
  ├── SimpleSpanProcessor → RingBufferExporter → /api/traces
  └── BatchSpanProcessor  → OTLP (optional, server.traces_endpoint)

Test plan

  • just test passes
  • just dashboard && cargo build --release && logfwd --config bench/scenarios/self-bench.yaml — TraceExplorer shows live batch rows within 2s
  • Click a trace row — detail expands showing scan/transform/output timing and filter ratio
  • curl http://127.0.0.1:9090/api/traces returns JSON with traces array containing scan_ns, transform_ns, output_ns, input_rows, output_rows
  • Error batches show err badge on the row

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds workspace tracing dependencies and a new optional server.traces_endpoint: Option<String> config field. Introduces an in-process ring-buffer OpenTelemetry span exporter (span_exporter::SpanBuffer, RingBufferExporter) and exposes it from the io crate. Diagnostics server gains a trace buffer setter and a /api/traces handler that returns JSON trace batches. The pipeline is instrumented with tracing spans/attributes and a tracer-provider builder/shutdown path is added. Dashboard client, types, UI component, styles, and the HTML app shell are added to display traces.

Possibly related PRs

  • strawgate/memagent PR 375 — Edits the same logfwd-io diagnostics/API surface and crate exports, overlapping with the introduced span_exporter and trace-buffer handling.
  • strawgate/memagent PR 567 — Modifies diagnostics.rs request routing/handlers and could conflict with the added /api/traces handler and DiagnosticsServer::set_trace_buffer.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-config/src/lib.rs`:
- Around line 246-248: Config::validate() currently skips validating
server.traces_endpoint, so malformed endpoints only fail later; update
Config::validate() to check server.traces_endpoint when Some by attempting to
parse it as a URL (e.g., with url::Url::parse) and return a validation error if
parsing fails. Locate the validation routine in Config::validate(), reference
the server.traces_endpoint field, and add a branch that attempts to parse the
string and converts parse errors into the existing config validation error type
(or a clear new message) so invalid endpoints fail fast.

In `@crates/logfwd/src/main.rs`:
- Line 570: The traces endpoint is printed directly to stderr (the eprintln! in
main.rs) and may contain credentials or tokens; change the print to log a
redacted value instead: parse the endpoint string with url::Url, remove
username/password and query/fragment (or fall back to using only url.host_str()
and port), build a safe "redacted" string and use that variable in the eprintln!
call where traces push is reported so credentials/query parameters are not
emitted to stderr or diagnostics.
- Around line 402-408: The tracer provider returned by build_tracer_provider is
stored in tracer_provider and then set as the global provider, but you must call
its shutdown to flush the BatchSpanProcessor before the process exits; add an
explicit shutdown path that invokes either
opentelemetry::global::shutdown_tracer_provider() or calls
tracer_provider.shutdown(Duration) at program termination (handle and log any
shutdown errors). Also stop logging the raw traces endpoint URL to
stderr—replace the verbatim URL print with a redacted or masked form (or omit
secrets) where the current code logs the traces endpoint string variable so
secrets are not leaked.

In `@crates/logfwd/src/pipeline.rs`:
- Around line 443-449: In the Err(e) => branch of the scan handling in
pipeline.rs (the block that calls self.metrics.inc_scan_error() and
self.metrics.inc_dropped_batch()), set the root batch span's error flag the same
way the transform/output error branches do (i.e., set errors = 1 or mark the
span attribute that signals an error on the root batch span) before returning so
dropped scan batches are recorded with the error flag when exported to
/api/traces.
- Around line 517-520: The BatchMetadata construction in flush_batch() is
cloning self.resource_attrs (hot path) — change BatchMetadata to hold an
Arc-backed attribute set (e.g., Arc<Vec<...>> or Arc<AttributeSet>), update the
BatchMetadata struct to use that Arc type, and replace the expensive clone in
flush_batch() with a cheap Arc::clone of self.resource_attrs (or move an Arc
into BatchMetadata if ownership allows); ensure all call sites and types that
read BatchMetadata.resource_attrs are updated to dereference or clone the Arc as
needed.
- Around line 397-405: The tracing span in the tracing::instrument attribute for
the batch span leaves input_rows and output_rows as tracing::field::Empty which
can omit those fields and produce invalid JSON; change those field initializers
to numeric zeroes (match the existing errors = 0u64) so input_rows, output_rows,
and errors are initialized as 0u64 in the tracing::instrument attribute on the
batch span in pipeline.rs (the span declared with name = "batch" and fields
including pipeline = %self.name).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6724d8a9-8bc7-4cf6-9946-9577f0b54a77

📥 Commits

Reviewing files that changed from the base of the PR and between 26a1ccf and b66d9f0.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • Cargo.toml
  • crates/logfwd-config/src/lib.rs
  • crates/logfwd-io/Cargo.toml
  • crates/logfwd-io/src/dashboard.html
  • crates/logfwd-io/src/diagnostics.rs
  • crates/logfwd-io/src/lib.rs
  • crates/logfwd-io/src/span_exporter.rs
  • crates/logfwd/Cargo.toml
  • crates/logfwd/src/main.rs
  • crates/logfwd/src/pipeline.rs
  • dashboard/src/api.ts
  • dashboard/src/app.tsx
  • dashboard/src/components/TraceExplorer.tsx
  • dashboard/src/style.css
  • dashboard/src/types.ts

Comment thread crates/logfwd-config/src/lib.rs
Comment thread crates/logfwd/src/main.rs Outdated
Comment thread crates/logfwd/src/main.rs Outdated
Comment thread crates/logfwd/src/pipeline.rs
Comment thread crates/logfwd/src/pipeline.rs
Comment thread crates/logfwd/src/pipeline.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
crates/logfwd/src/main.rs (2)

399-406: ⚠️ Potential issue | 🟠 Major

Add explicit tracer provider shutdown before exit.

BatchSpanProcessor (line 566) won't flush pending spans on drop — requires explicit shutdown(). Without it, final OTLP batches are lost. The meter provider has shutdown() at line 497; add equivalent for the tracer provider.

Suggested approach

Store the SdkTracerProvider and call tracer_provider.shutdown() before meter_provider.shutdown() at the end of run_pipelines. Handle the result similarly to line 497-503.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/src/main.rs` around lines 399 - 406, The tracer provider
created by build_tracer_provider (assigned to tracer_provider) must be
explicitly shut down to flush BatchSpanProcessor spans before exit; update
run_pipelines to hold the SdkTracerProvider value you create (ensure it’s the
concrete SdkTracerProvider returned by build_tracer_provider/typed
appropriately) and call tracer_provider.shutdown() and handle its Result (log
error on Err) immediately before calling meter_provider.shutdown(), mirroring
the shutdown handling used for the meter provider.

568-568: ⚠️ Potential issue | 🟠 Major

Redact the traces endpoint before logging.

If the OTLP URL contains embedded credentials (https://user:token@host/...) or query tokens, they leak to stderr and /api/logs. Log only host:port or a redacted form.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/src/main.rs` at line 568, The log currently prints the raw
endpoint value (endpoint) which may include credentials or query tokens; before
calling eprintln! in main.rs (the trace push logging line that prints
"{endpoint}"), parse the URL (e.g., with url::Url::parse(endpoint)), strip or
redact sensitive parts (clear username/password and query, or replace path/query
with "/..."), and then print the sanitized/redacted value (e.g., host[:port] or
host[:port]/...) instead of the raw endpoint variable so credentials and tokens
are not logged.
crates/logfwd/src/pipeline.rs (3)

516-519: 🛠️ Refactor suggestion | 🟠 Major

Avoid cloning resource_attrs on every batch.

This copies the entire Vec<(String, String)> per successful batch in a hot path. Consider making BatchMetadata.resource_attrs an Arc<Vec<...>> so per-batch work is a cheap Arc::clone. As per coding guidelines: "Do not introduce allocations in hot paths without benchmarking".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/src/pipeline.rs` around lines 516 - 519, BatchMetadata
currently clones self.resource_attrs (Vec<(String,String)>) per batch; change
BatchMetadata.resource_attrs to Arc<Vec<(String,String)>> and make the
producer-side field (self.resource_attrs) also an Arc<Vec<...>> so per-batch
work uses cheap Arc::clone instead of cloning the whole Vec. Update all places
that construct BatchMetadata (the snippet creating BatchMetadata) to assign
Arc::clone(&self.resource_attrs), and update any initialization sites for the
producer to wrap the original Vec in Arc::new(...). Ensure types and imports
(std::sync::Arc) are updated accordingly.

442-448: ⚠️ Potential issue | 🟠 Major

Mark scan failures on the root batch span.

Transform/output error branches set errors = 1, but this scan-error return does not. Dropped batches reach /api/traces without the error flag, so TraceExplorer never shows the err badge.

Suggested fix
             Err(e) => {
+                tracing::Span::current().record("errors", 1u64);
                 // Queued tickets dropped here — safe, not tracked by machine.
                 self.metrics.inc_scan_error();
                 self.metrics.inc_dropped_batch();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/src/pipeline.rs` around lines 442 - 448, The scan-error branch
currently returns without marking the root batch span as errored; update the
Err(e) => { ... } block (the branch that calls self.metrics.inc_scan_error() and
self.metrics.inc_dropped_batch()) to set the root batch span's error flag the
same way transform/output error branches do (i.e., set errors = 1 on the root
batch span or call the same helper used by those branches) before logging and
returning so dropped batches are reported with the error flag to TraceExplorer.

396-405: ⚠️ Potential issue | 🟠 Major

Initialize input_rows and output_rows with numeric zeroes.

With tracing::field::Empty, the zero-row and scan-error early-return paths leave these fields unset. In diagnostics.rs:693-695, serve_traces reads these attributes via attr("input_rows") which returns "" for missing fields, then emits raw JSON like "input_rows":,"output_rows": — invalid JSON.

Suggested fix
     fields(
         pipeline = %self.name,
-        input_rows = tracing::field::Empty,
-        output_rows = tracing::field::Empty,
+        input_rows = 0u64,
+        output_rows = 0u64,
         errors = 0u64,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/src/pipeline.rs` around lines 396 - 405, The tracing instrument
macro for the batch span leaves input_rows and output_rows as
tracing::field::Empty which causes missing attributes and invalid JSON; update
the attribute fields in the #[tracing::instrument(...)] block so that input_rows
and output_rows are initialized to numeric zeroes (e.g. 0u64) — same style as
the existing errors = 0u64 — so that the span created in the batch function
always has concrete numeric values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-io/src/diagnostics.rs`:
- Around line 686-723: The JSON emitter can produce invalid JSON when numeric
attributes are missing because attr(...) returns "" and the write! inserts raw
values; update the serialization around the attr closure / the variables
pipeline, input_rows, output_rows, errors and the write! call so that if
input_rows/output_rows/errors are empty you emit a safe numeric fallback (e.g.,
"0") or quote them as strings; specifically change how ir, or, err are derived
before the write! (or adjust attr to return a default like "0") so the formatted
values inserted into the write! are always valid JSON numbers or quoted strings.

---

Duplicate comments:
In `@crates/logfwd/src/main.rs`:
- Around line 399-406: The tracer provider created by build_tracer_provider
(assigned to tracer_provider) must be explicitly shut down to flush
BatchSpanProcessor spans before exit; update run_pipelines to hold the
SdkTracerProvider value you create (ensure it’s the concrete SdkTracerProvider
returned by build_tracer_provider/typed appropriately) and call
tracer_provider.shutdown() and handle its Result (log error on Err) immediately
before calling meter_provider.shutdown(), mirroring the shutdown handling used
for the meter provider.
- Line 568: The log currently prints the raw endpoint value (endpoint) which may
include credentials or query tokens; before calling eprintln! in main.rs (the
trace push logging line that prints "{endpoint}"), parse the URL (e.g., with
url::Url::parse(endpoint)), strip or redact sensitive parts (clear
username/password and query, or replace path/query with "/..."), and then print
the sanitized/redacted value (e.g., host[:port] or host[:port]/...) instead of
the raw endpoint variable so credentials and tokens are not logged.

In `@crates/logfwd/src/pipeline.rs`:
- Around line 516-519: BatchMetadata currently clones self.resource_attrs
(Vec<(String,String)>) per batch; change BatchMetadata.resource_attrs to
Arc<Vec<(String,String)>> and make the producer-side field (self.resource_attrs)
also an Arc<Vec<...>> so per-batch work uses cheap Arc::clone instead of cloning
the whole Vec. Update all places that construct BatchMetadata (the snippet
creating BatchMetadata) to assign Arc::clone(&self.resource_attrs), and update
any initialization sites for the producer to wrap the original Vec in
Arc::new(...). Ensure types and imports (std::sync::Arc) are updated
accordingly.
- Around line 442-448: The scan-error branch currently returns without marking
the root batch span as errored; update the Err(e) => { ... } block (the branch
that calls self.metrics.inc_scan_error() and self.metrics.inc_dropped_batch())
to set the root batch span's error flag the same way transform/output error
branches do (i.e., set errors = 1 on the root batch span or call the same helper
used by those branches) before logging and returning so dropped batches are
reported with the error flag to TraceExplorer.
- Around line 396-405: The tracing instrument macro for the batch span leaves
input_rows and output_rows as tracing::field::Empty which causes missing
attributes and invalid JSON; update the attribute fields in the
#[tracing::instrument(...)] block so that input_rows and output_rows are
initialized to numeric zeroes (e.g. 0u64) — same style as the existing errors =
0u64 — so that the span created in the batch function always has concrete
numeric values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4da4d950-8763-42c2-9254-9671662fb8cb

📥 Commits

Reviewing files that changed from the base of the PR and between b66d9f0 and 70dc96f.

📒 Files selected for processing (5)
  • crates/logfwd-io/src/diagnostics.rs
  • crates/logfwd-io/src/span_exporter.rs
  • crates/logfwd-transform/src/udf/json_extract.rs
  • crates/logfwd/src/main.rs
  • crates/logfwd/src/pipeline.rs

Comment thread crates/logfwd-io/src/diagnostics.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
crates/logfwd/src/pipeline.rs (1)

517-520: 🛠️ Refactor suggestion | 🟠 Major

Hot-path allocation: resource_attrs.clone() still copies the entire Vec per batch.

This was flagged previously. Consider wrapping in Arc<Vec<...>> to make cloning cheap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/src/pipeline.rs` around lines 517 - 520, BatchMetadata is
allocating on the hot path because resource_attrs.clone() copies the entire Vec
for every batch; change the type of resource_attrs (where defined on the
pipeline/state struct and in BatchMetadata) from Vec<T> to Arc<Vec<T>> (or
Arc<[T]>) so cloning becomes a cheap Arc::clone; update places that
construct/consume BatchMetadata and any code that mutates resource_attrs to
clone into a new Arc when mutation is needed (or use Arc::make_mut) and replace
the call here (resource_attrs.clone()) with Arc::clone(&self.resource_attrs) to
avoid per-batch heap copies.
crates/logfwd/src/main.rs (1)

575-575: ⚠️ Potential issue | 🟠 Major

Redact the traces endpoint before logging.

If the OTLP URL contains embedded credentials or query tokens, they leak to stderr (and potentially to /api/logs). Log host:port only.

🔒 Proposed fix
+    fn redact_url(url: &str) -> String {
+        url::Url::parse(url)
+            .map(|u| format!("{}:{}", u.host_str().unwrap_or("?"), u.port().unwrap_or(443)))
+            .unwrap_or_else(|_| "<invalid-url>".to_string())
+    }
+
     if let Some(ref endpoint) = config.server.traces_endpoint {
         // ... exporter setup ...
-        eprintln!("  {}traces push{}: {endpoint}", dim(), reset());
+        eprintln!("  {}traces push{}: {}", dim(), reset(), redact_url(endpoint));
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/src/main.rs` at line 575, The eprintln! is printing the full
OTLP endpoint (variable endpoint) which can leak credentials and query tokens;
change the logging to redact sensitive parts by parsing endpoint as a URL (e.g.,
using url::Url::parse or equivalent), then extract and print only the host and
port (or host if no port) and not the scheme, userinfo, path, or query; update
the eprintln! that references endpoint to print the sanitized host[:port] string
instead and fall back to a safe placeholder like "<redacted-host>" if parsing
fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd/src/main.rs`:
- Line 566: The traces_endpoint and metrics_endpoint Option<String> can be
Some("") and should be guarded; update the usage that currently does `if let
Some(ref endpoint) = config.server.traces_endpoint {` to only run when the
endpoint is non-empty (e.g., `if let Some(ref endpoint) = ... if
!endpoint.trim().is_empty()`), or add validation in Config::validate() to call
the existing validate_endpoint_url() (or a new non-empty check) for
server.traces_endpoint and server.metrics_endpoint so empty/whitespace strings
are rejected earlier; ensure you apply the same change to metrics_endpoint and
keep behavior consistent with SpanExporter::builder() error handling.

---

Duplicate comments:
In `@crates/logfwd/src/main.rs`:
- Line 575: The eprintln! is printing the full OTLP endpoint (variable endpoint)
which can leak credentials and query tokens; change the logging to redact
sensitive parts by parsing endpoint as a URL (e.g., using url::Url::parse or
equivalent), then extract and print only the host and port (or host if no port)
and not the scheme, userinfo, path, or query; update the eprintln! that
references endpoint to print the sanitized host[:port] string instead and fall
back to a safe placeholder like "<redacted-host>" if parsing fails.

In `@crates/logfwd/src/pipeline.rs`:
- Around line 517-520: BatchMetadata is allocating on the hot path because
resource_attrs.clone() copies the entire Vec for every batch; change the type of
resource_attrs (where defined on the pipeline/state struct and in BatchMetadata)
from Vec<T> to Arc<Vec<T>> (or Arc<[T]>) so cloning becomes a cheap Arc::clone;
update places that construct/consume BatchMetadata and any code that mutates
resource_attrs to clone into a new Arc when mutation is needed (or use
Arc::make_mut) and replace the call here (resource_attrs.clone()) with
Arc::clone(&self.resource_attrs) to avoid per-batch heap copies.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 04e4de92-acba-44c7-8f4d-41b1e53bf56e

📥 Commits

Reviewing files that changed from the base of the PR and between 70dc96f and 9b77934.

📒 Files selected for processing (2)
  • crates/logfwd/src/main.rs
  • crates/logfwd/src/pipeline.rs

Comment thread crates/logfwd/src/main.rs

let mut builder = SdkTracerProvider::builder().with_span_processor(ring_processor);

if let Some(ref endpoint) = config.server.traces_endpoint {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if ServerConfig has any custom deserialization that handles empty strings
rg -n "traces_endpoint|impl.*Deserialize.*ServerConfig" crates/logfwd-config/src/

Repository: strawgate/memagent

Length of output: 138


🏁 Script executed:

# Check ServerConfig struct definition and any custom deserialization
rg -B5 -A10 "pub struct ServerConfig" crates/logfwd-config/src/

Repository: strawgate/memagent

Length of output: 1333


🏁 Script executed:

# Read the specific line in main.rs where the issue is reported
sed -n '560,580p' crates/logfwd/src/main.rs

Repository: strawgate/memagent

Length of output: 860


🏁 Script executed:

# Check how traces_endpoint is used in exporter creation
rg -B5 -A10 "traces_endpoint" crates/logfwd/src/main.rs

Repository: strawgate/memagent

Length of output: 1403


🏁 Script executed:

# Check for any validation of traces_endpoint in config or elsewhere
rg -n "traces_endpoint" crates/ --type rs

Repository: strawgate/memagent

Length of output: 88


🏁 Script executed:

# Look for any tests or examples of opentelemetry_otlp with invalid endpoints
rg -n "with_endpoint|SpanExporter" crates/ --type rs -B2 -A2

Repository: strawgate/memagent

Length of output: 88


🏁 Script executed:

# Check Cargo.toml to see opentelemetry_otlp version
rg -n "opentelemetry" crates/logfwd/Cargo.toml

Repository: strawgate/memagent

Length of output: 238


🏁 Script executed:

# Check for any validation of traces_endpoint in config or elsewhere
rg -n "traces_endpoint" crates/

Repository: strawgate/memagent

Length of output: 331


🏁 Script executed:

# Look for any tests or examples of opentelemetry_otlp with invalid endpoints
rg -n "with_endpoint|SpanExporter" crates/ -B2 -A2

Repository: strawgate/memagent

Length of output: 2552


🏁 Script executed:

# Check the workspace Cargo.toml for opentelemetry versions
rg -n "opentelemetry" Cargo.toml

Repository: strawgate/memagent

Length of output: 326


🏁 Script executed:

# Check metrics_endpoint handling to see if there's a pattern for validation
rg -B5 -A10 "metrics_endpoint" crates/logfwd/src/main.rs

Repository: strawgate/memagent

Length of output: 789


🏁 Script executed:

# Search for any validation or filtering of empty strings elsewhere in the codebase
rg -n "trim|is_empty|filter" crates/logfwd/src/main.rs

Repository: strawgate/memagent

Length of output: 44


🏁 Script executed:

# Check if there's any validation in the config module
cat crates/logfwd-config/src/lib.rs | head -100

Repository: strawgate/memagent

Length of output: 3048


🏁 Script executed:

# Check if there are any existing validations or guards on config options
rg -n "\.map\(|\.filter\(|\.ok_or" crates/logfwd-config/src/lib.rs

Repository: strawgate/memagent

Length of output: 479


🏁 Script executed:

# Look for any test files that might show how empty endpoints behave
fd -e rs test crates/logfwd/src/

Repository: strawgate/memagent

Length of output: 44


🏁 Script executed:

# Check if there's a validate function or validation logic on Config struct
rg -B5 -A15 "impl Config|fn validate|fn load" crates/logfwd-config/src/lib.rs

Repository: strawgate/memagent

Length of output: 2769


🏁 Script executed:

# Check where validate_endpoint_url is called and if it's applied to traces_endpoint
rg -B5 -A10 "validate_endpoint_url" crates/logfwd-config/src/lib.rs

Repository: strawgate/memagent

Length of output: 1719


🏁 Script executed:

# Get the full validate function and see all validations
rg -A50 "fn validate\(" crates/logfwd-config/src/lib.rs | head -80

Repository: strawgate/memagent

Length of output: 2213


🏁 Script executed:

# Check the complete validate_endpoint_url function
rg -A20 "fn validate_endpoint_url" crates/logfwd-config/src/lib.rs

Repository: strawgate/memagent

Length of output: 730


🏁 Script executed:

# Check if server.traces_endpoint or server.metrics_endpoint are validated anywhere
rg -B3 -A3 "server\.(traces|metrics)_endpoint" crates/logfwd-config/src/lib.rs

Repository: strawgate/memagent

Length of output: 44


Guard traces_endpoint against empty strings.

ServerConfig.traces_endpoint: Option<String> deserializes "" as Some(""), which passes the if let Some(...) check at line 566 but passes an invalid endpoint to opentelemetry_otlp::SpanExporter::builder().

Either filter empty/whitespace at usage:

-    if let Some(ref endpoint) = config.server.traces_endpoint {
+    if let Some(ref endpoint) = config.server.traces_endpoint.as_ref().filter(|e| !e.trim().is_empty()) {

Or add validation in Config::validate() alongside existing validate_endpoint_url() calls.

Note: metrics_endpoint has the same vulnerability.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if let Some(ref endpoint) = config.server.traces_endpoint {
if let Some(ref endpoint) = config.server.traces_endpoint.as_ref().filter(|e| !e.trim().is_empty()) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/src/main.rs` at line 566, The traces_endpoint and
metrics_endpoint Option<String> can be Some("") and should be guarded; update
the usage that currently does `if let Some(ref endpoint) =
config.server.traces_endpoint {` to only run when the endpoint is non-empty
(e.g., `if let Some(ref endpoint) = ... if !endpoint.trim().is_empty()`), or add
validation in Config::validate() to call the existing validate_endpoint_url()
(or a new non-empty check) for server.traces_endpoint and
server.metrics_endpoint so empty/whitespace strings are rejected earlier; ensure
you apply the same change to metrics_endpoint and keep behavior consistent with
SpanExporter::builder() error handling.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-io/src/diagnostics.rs`:
- Around line 639-736: Add contract tests for the /api/traces endpoint to lock
the wire format: create tests that exercise diagnostics::serve_traces (via an
HTTP request to /api/traces) covering (1) an empty trace_buf case returning
{"traces":[]} and (2) a single-batch case with one root TraceSpan and its child
spans to assert the exact JSON structure produced (fields: trace_id, pipeline,
start_unix_ns, total_ns, scan_ns, transform_ns, output_ns, input_rows,
output_rows, errors, status), confirm newest-first ordering, that child grouping
uses root_marker "0000000000000000", and that attributes parsed from
TraceSpan.attrs are serialized correctly; add these tests alongside other
diagnostics tests and build assertions around serve_traces, trace_buf, and
TraceSpan to prevent regressions.
- Around line 697-715: The emitted JSON currently writes root.start_unix_ns as a
raw number (see root.start_unix_ns in the write! call) which loses nanosecond
precision in JS; change the serialization to emit start_unix_ns as a string
(e.g., format the integer into a quoted string) or add a new start_unix_ms field
computed from root.start_unix_ns/1_000_000 and emit that as a number, and update
the frontend TraceRecord type (dashboard/src/types.ts) to accept the chosen
representation (start_unix_ns: string or new start_unix_ms: number) so the
client and server agree.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1a937320-d8c5-4a33-9fdb-52896961a331

📥 Commits

Reviewing files that changed from the base of the PR and between 9b77934 and cda73d3.

📒 Files selected for processing (1)
  • crates/logfwd-io/src/diagnostics.rs

Comment thread crates/logfwd-io/src/diagnostics.rs
Comment on lines +697 to +715
let _ = write!(
out,
"{{\
\"trace_id\":\"{tid}\",\
\"pipeline\":\"{pl}\",\
\"start_unix_ns\":{st},\
\"total_ns\":{tot},\
\"scan_ns\":{scan},\
\"transform_ns\":{xfm},\
\"output_ns\":{out_ns},\
\"input_rows\":{ir},\
\"output_rows\":{or},\
\"errors\":{err},\
\"status\":\"{status}\"\
}}",
tid = root.trace_id,
pl = esc(pipeline),
st = root.start_unix_ns,
tot = root.duration_ns,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '63,79p' dashboard/src/types.ts

python - <<'PY'
import time

limit = 2**53 - 1
now_ns = time.time_ns()

print(f"now_ns={now_ns}")
print(f"max_safe_integer={limit}")
print(f"unsafe_for_js_number={now_ns > limit}")

for delta in range(1, 2048):
    if float(now_ns) == float(now_ns + delta):
        print(f"js_number_collapses_{delta}ns_at_current_epoch=true")
        break
PY

Repository: strawgate/memagent

Length of output: 537


JSON number precision loss in start_unix_ns field.

The code emits root.start_unix_ns as a raw JSON number at lines 706. The frontend TraceRecord interface defines this as a TypeScript number, but IEEE-754 doubles cannot represent nanosecond precision at the current epoch—consecutive nanosecond values collapse to identical representations. This causes traces to become unorderable and timestamps to silently round.

Serialize as a string or expose a millisecond-precision field instead:

Server-side fix
-                        \"start_unix_ns\":{st},\
+                        \"start_unix_ns\":\"{st}\",\

Update dashboard/src/types.ts to match (start_unix_ns: string), or introduce start_unix_ms: number for browser consumption.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-io/src/diagnostics.rs` around lines 697 - 715, The emitted JSON
currently writes root.start_unix_ns as a raw number (see root.start_unix_ns in
the write! call) which loses nanosecond precision in JS; change the
serialization to emit start_unix_ns as a string (e.g., format the integer into a
quoted string) or add a new start_unix_ms field computed from
root.start_unix_ns/1_000_000 and emit that as a number, and update the frontend
TraceRecord type (dashboard/src/types.ts) to accept the chosen representation
(start_unix_ns: string or new start_unix_ms: number) so the client and server
agree.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/logfwd/src/pipeline.rs (1)

36-43: ⚠️ Potential issue | 🔴 Critical

Update the in-file test call sites for the new ChannelMsg::Data / helper signature.

This change makes the local tests stop compiling: ChannelMsg::Data now requires queued_at, and blocking_send_channel_msg now requires input_name. Line 1498, Line 1510, and Line 2403 still build the old variant, and Line 1507 still calls the old helper signature.

Minimal test fixes
 tx.try_send(ChannelMsg::Data {
     bytes: vec![1],
     checkpoints: vec![],
+    queued_at: Instant::now(),
 })
 .unwrap();

 let handle = std::thread::spawn(move || {
     blocking_send_channel_msg(
+        "test",
         &tx2,
         &metrics2,
         ChannelMsg::Data {
             bytes: vec![2],
             checkpoints: vec![],
+            queued_at: Instant::now(),
         },
     )
 });

 tx.try_send(ChannelMsg::Data {
     bytes: b"test\n".to_vec(),
     checkpoints: vec![(SourceId(42), ByteOffset(1000))],
+    queued_at: Instant::now(),
 })
 .unwrap();

Also applies to: 764-769

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/src/pipeline.rs` around lines 36 - 43, Tests still construct
ChannelMsg::Data and call blocking_send_channel_msg using the old signatures;
update all test sites to pass the new fields: construct ChannelMsg::Data with
bytes, checkpoints and the new queued_at Instant value, and update
blocking_send_channel_msg calls to include the new input_name argument. Search
for ChannelMsg::Data usages and the helper blocking_send_channel_msg (references
around the former call sites noted in the comment) and add a queued_at (e.g.,
Instant::now() or a deterministic Instant) when building Data, and supply the
appropriate input_name string when calling blocking_send_channel_msg so all
tests compile against the new signatures.
♻️ Duplicate comments (2)
crates/logfwd/src/pipeline.rs (1)

476-487: ⚠️ Potential issue | 🟠 Major

Record scan failures on the batch span, not the entered scan span.

This branch runs while scan_span is entered, so tracing::Span::current() resolves to the child span. crates/logfwd-io/src/diagnostics.rs reads errors from the root batch span, so scan-error traces still render as clean. Capture the batch span before entering scan_span and record on that handle here.

One way to fix it
-        let batch = {
+        let batch_span = tracing::Span::current();
+        let batch = {
             let scan_span = tracing::info_span!("scan", pipeline = %self.name, rows = 0u64);
             let _entered = scan_span.enter();
             let b = match tokio::task::block_in_place(|| self.scanner.scan(combined.into())) {
                 Ok(b) => b,
                 Err(e) => {
                     // Queued tickets dropped here — safe, not tracked by machine.
                     self.metrics.inc_scan_error();
                     self.metrics.inc_dropped_batch();
                     eprintln!("pipeline: scan error (batch dropped): {e}");
-                    tracing::Span::current().record("errors", 1u64);
+                    batch_span.record("errors", 1u64);
                     return;
                 }
             };
In Rust's `tracing` crate, when code runs inside `let _entered = scan_span.enter()`, does `tracing::Span::current()` resolve to `scan_span` or the surrounding parent span?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/src/pipeline.rs` around lines 476 - 487, The scan-error branch
is recording errors on the currently entered child span (scan_span) instead of
the batch root span; before entering the scan_span capture the batch span handle
(e.g., let batch_span = tracing::Span::current() or obtain it prior to let
_entered = scan_span.enter()), then in the Err(e) branch call
batch_span.record("errors", 1u64) (and/or use batch_span.record_debug/record_str
as appropriate) instead of tracing::Span::current().record(...), leaving the
rest of the error handling (metrics.inc_scan_error(), inc_dropped_batch(),
eprintln!, return) unchanged so diagnostics read the error count from the batch
span.
crates/logfwd-io/src/diagnostics.rs (1)

711-733: ⚠️ Potential issue | 🟠 Major

start_unix_ns loses precision in the browser.

This emits a nanosecond epoch timestamp as a JSON number, but the dashboard contract models it as a TypeScript number; at the current epoch that already exceeds JS’s safe-integer range, so nearby trace timestamps collapse. Emit this as a string, or add a millisecond-precision field, and update dashboard/src/types.ts to match.

Possible server-side fix
-                        \"start_unix_ns\":{st},\
+                        \"start_unix_ns\":\"{st}\",\
#!/bin/bash
python - <<'PY'
import time

limit = 2**53 - 1
now_ns = time.time_ns()

print(f"now_ns={now_ns}")
print(f"max_safe_integer={limit}")
print(f"unsafe_for_js_number={now_ns > limit}")

for delta in range(1, 2048):
    if float(now_ns) == float(now_ns + delta):
        print(f"js_number_collapses_{delta}ns_at_current_epoch=true")
        break
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-io/src/diagnostics.rs` around lines 711 - 733, The JSON
currently emits start_unix_ns as a numeric literal (root.start_unix_ns) which
exceeds JS safe integers; change the emitter in the write! call to output
start_unix_ns as a string (e.g. use start_unix_ns =
root.start_unix_ns.to_string() and put it inside quotes in the format) or
alternatively add a new start_unix_ms field (compute root.start_unix_ns /
1_000_000) and keep ms as a number; update the corresponding dashboard type in
dashboard/src/types.ts to expect the chosen type (string for start_unix_ns or
number for start_unix_ms) so the contract matches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd/src/pipeline.rs`:
- Around line 288-289: The batch_queued_at variable is only set once in the
receive loops and drain loop, which lets the first-seen message incorrectly
become the batch oldest time; update those spots (where batch_queued_at is
assigned from a message's queued_at) to take the minimum of the existing
batch_queued_at and the incoming queued_at (i.e., batch_queued_at =
Some(min(existing, queued_at))) so the tracked oldest queued_at across the
entire accumulating batch is preserved and queue_wait_ns is not underreported;
apply this change in both receive loops and the drain loop where batch_queued_at
is updated.

In `@dashboard/src/components/TraceExplorer.tsx`:
- Around line 108-112: The trace row div in TraceExplorer.tsx (the element keyed
by t.trace_id that calls setSelected and uses isSelected) must be made
keyboard-operable: either replace the <div> with a semantic <button> or add
role="button", tabIndex={0}, aria-expanded={isSelected} and an onKeyDown handler
that toggles via setSelected when Enter or Space is pressed; also ensure a
visible focus style (e.g., add a focus class like "trace-row-focus" or include
:focus styles alongside "trace-row-selected") so keyboard users can see focus.

---

Outside diff comments:
In `@crates/logfwd/src/pipeline.rs`:
- Around line 36-43: Tests still construct ChannelMsg::Data and call
blocking_send_channel_msg using the old signatures; update all test sites to
pass the new fields: construct ChannelMsg::Data with bytes, checkpoints and the
new queued_at Instant value, and update blocking_send_channel_msg calls to
include the new input_name argument. Search for ChannelMsg::Data usages and the
helper blocking_send_channel_msg (references around the former call sites noted
in the comment) and add a queued_at (e.g., Instant::now() or a deterministic
Instant) when building Data, and supply the appropriate input_name string when
calling blocking_send_channel_msg so all tests compile against the new
signatures.

---

Duplicate comments:
In `@crates/logfwd-io/src/diagnostics.rs`:
- Around line 711-733: The JSON currently emits start_unix_ns as a numeric
literal (root.start_unix_ns) which exceeds JS safe integers; change the emitter
in the write! call to output start_unix_ns as a string (e.g. use start_unix_ns =
root.start_unix_ns.to_string() and put it inside quotes in the format) or
alternatively add a new start_unix_ms field (compute root.start_unix_ns /
1_000_000) and keep ms as a number; update the corresponding dashboard type in
dashboard/src/types.ts to expect the chosen type (string for start_unix_ns or
number for start_unix_ms) so the contract matches.

In `@crates/logfwd/src/pipeline.rs`:
- Around line 476-487: The scan-error branch is recording errors on the
currently entered child span (scan_span) instead of the batch root span; before
entering the scan_span capture the batch span handle (e.g., let batch_span =
tracing::Span::current() or obtain it prior to let _entered =
scan_span.enter()), then in the Err(e) branch call batch_span.record("errors",
1u64) (and/or use batch_span.record_debug/record_str as appropriate) instead of
tracing::Span::current().record(...), leaving the rest of the error handling
(metrics.inc_scan_error(), inc_dropped_batch(), eprintln!, return) unchanged so
diagnostics read the error count from the batch span.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2313122a-5181-4a7f-a6a7-eb1039595a00

📥 Commits

Reviewing files that changed from the base of the PR and between cda73d3 and ae20519.

📒 Files selected for processing (5)
  • crates/logfwd-io/src/diagnostics.rs
  • crates/logfwd/src/pipeline.rs
  • dashboard/src/components/TraceExplorer.tsx
  • dashboard/src/style.css
  • dashboard/src/types.ts

Comment on lines +288 to +289
// Oldest queued_at in the current accumulating batch — reset after each flush.
let mut batch_queued_at: Option<Instant> = None;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Track the oldest queued_at across the whole batch.

Both receive loops only initialize batch_queued_at once. With multiple inputs or blocked senders, the first message received is not guaranteed to be the oldest message that ends up in the batch, so queue_wait_ns gets underreported exactly when backlog builds. Reduce with min(existing, queued_at) in both spots.

Suggested fix
-                            if batch_queued_at.is_none() {
-                                batch_queued_at = Some(queued_at);
-                            }
+                            batch_queued_at = Some(match batch_queued_at {
+                                Some(oldest) => oldest.min(queued_at),
+                                None => queued_at,
+                            });

Apply the same change in the drain loop at Line 349-Line 351.

Also applies to: 301-304, 349-351

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/src/pipeline.rs` around lines 288 - 289, The batch_queued_at
variable is only set once in the receive loops and drain loop, which lets the
first-seen message incorrectly become the batch oldest time; update those spots
(where batch_queued_at is assigned from a message's queued_at) to take the
minimum of the existing batch_queued_at and the incoming queued_at (i.e.,
batch_queued_at = Some(min(existing, queued_at))) so the tracked oldest
queued_at across the entire accumulating batch is preserved and queue_wait_ns is
not underreported; apply this change in both receive loops and the drain loop
where batch_queued_at is updated.

Comment on lines +108 to +112
<div
key={t.trace_id}
class={`trace-row${isSelected ? " trace-row-selected" : ""}`}
onClick={() => setSelected(isSelected ? null : t.trace_id)}
>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the trace row keyboard-operable.

This is a clickable <div> with no role, tab stop, or key handling, so keyboard users cannot expand a trace at all. Please use a real button for the row header, or add role="button", tabIndex={0}, aria-expanded, onKeyDown, and a visible focus state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dashboard/src/components/TraceExplorer.tsx` around lines 108 - 112, The trace
row div in TraceExplorer.tsx (the element keyed by t.trace_id that calls
setSelected and uses isSelected) must be made keyboard-operable: either replace
the <div> with a semantic <button> or add role="button", tabIndex={0},
aria-expanded={isSelected} and an onKeyDown handler that toggles via setSelected
when Enter or Space is pressed; also ensure a visible focus style (e.g., add a
focus class like "trace-row-focus" or include :focus styles alongside
"trace-row-selected") so keyboard users can see focus.

strawgate and others added 11 commits April 2, 2026 12:32
Pipeline batches are now traced end-to-end using OpenTelemetry spans
routed through tracing-opentelemetry. Each flush_batch emits a root
"batch" span with three child spans (scan, transform, output) carrying
timing and row-count attributes.

Rust:
- Workspace: add tracing, tracing-subscriber, tracing-opentelemetry 0.32
- logfwd-config: add server.traces_endpoint (optional OTLP trace push)
- logfwd-io: RingBufferExporter (SpanExporter impl) + SpanBuffer; stores
  last 4000 spans (≈1000 batches) in an Arc<Mutex<VecDeque>>
- logfwd-io/diagnostics: /api/traces endpoint groups spans by trace_id,
  returns per-batch summary (scan/transform/output ns, row counts, status)
- logfwd/main: build_tracer_provider() wires RingBufferExporter +
  optional OTLP BatchSpanProcessor; initialises global tracing subscriber
- logfwd/pipeline: #[tracing::instrument] on flush_batch (root "batch"
  span); sync scan/output wrapped with .entered() spans; async transform
  wrapped with .instrument(); input_rows/output_rows/errors recorded

Dashboard:
- TraceExplorer component: waterfall rows, proportional scan/transform/
  output segments, click to expand detail (timing breakdown, filter ratio)
- /api/traces polled every 2s alongside other endpoints
- Trace explorer shown by default below the log viewer

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- diagnostics: replace push_str(&format!()) with write!(), fix
  explicit_iter_loop, remove needless raw string hashes
- pipeline: remove #[expect(dead_code)] on name field now that
  tracing spans reference it
- cargo fmt across all changed files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Use 0u64 instead of tracing::field::Empty for input_rows/output_rows
  so the fields always appear in spans even on early-exit paths
- Record errors=1 on scan error path (was missing, transform path already had it)
- Retain tracer_provider and call shutdown() on exit alongside meter_provider

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
attr() returns &str; interpolating it bare into JSON works when the
OTel bridge serializes integers as decimal strings, but would produce
invalid JSON if the attribute were ever absent or malformed. Parse with
an explicit u64 type and unwrap_or(0) fallback so the output is always
valid regardless of bridge behavior.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Batch span gains:
- bytes_in: raw bytes fed to scanner → enables scan MB/s
- flush_reason: "size" | "timeout" | "drain" → shows if throughput- or latency-bound
- queue_wait_ns: time data sat in channel before flush_batch ran

Child spans gain:
- scan.rows: rows extracted by scanner
- transform.rows_in: rows entering SQL (from existing root input_rows)
- output.rows: rows sent to sink

Input thread gains tracing::event! for:
- input.poll_error (warn, with error field)
- input.file_rotated / input.file_truncated (info)
- input.new_files (info, with files count)
- input.backpressure (warn) — emitted when channel is full

Dashboard surfaces:
- bytes_in badge per trace row
- "timeout" flush badge (amber) to distinguish timeout vs size flushes
- Expanded detail: throughput (MB/s), queue wait, flush reason, scan rows

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three-zone layout replacing the POC single-list design:

Stats strip: batch/min, MB/s, avg duration, per-stage % breakdown,
idle-flush %, error count — all computed from the buffered window.

Timeline canvas: one bar per batch positioned by start_unix_ns,
height = duration, stacked scan/transform/output/overhead segments,
white outline on selected batch. Click to select from timeline.

Batch list (100 visible, rest in timeline):
- 100% stacked bar per row — shows breakdown as share of total time,
  not relative to other batches (fixes the global-maxNs problem)
- "slow" amber left-border for batches above P90 (self-calibrating)
- "idle" badge for timeout flushes, "err" for errors

Detail panel (click to expand):
- Three side-by-side stage boxes: duration + rows + derived throughput
- "overhead" box when unaccounted time > 5% of batch
- queue_wait, flush_reason, bytes_in, trace_id in meta row

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ines

Expand state no longer resets on poll:
- Store full TraceRecord in expanded state instead of just trace_id
- useEffect refreshes expanded record when traces update (preserves
  detail even if the trace shifts position in the list)
- If expanded trace scrolls past LIST_LIMIT, it's pinned at top of
  list so the detail panel is never orphaned

Input/output types shown in pipeline nodes:
- pn-type now shows the actual type (File, Generator, TCP, OTLP, etc.)
  instead of the hardcoded "Input"/"Output" role label
- Auto-generated names (input_0, output_1) suppressed from node cards
  since they add no information; user-defined names still shown
- Inspector title updated to "Input · File · my-name" pattern

TraceExplorer moved under each pipeline:
- TraceExplorer is now a presentational component (no polling)
- traces fetched once in App poll alongside pipelines/stats
- Passed pre-filtered to PipelineView by pipeline name
- PipelineView shows "Batch Traces" toggle below the node graph
  - Collapsed: stats strip + timeline only
  - Expanded: full batch list with detail panels
- Removed standalone <TraceExplorer /> from App

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The eprintln confirming each pipeline is ready was accidentally dropped
during the tracing setup refactor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…annel_msg input_name

Test call sites were not updated when ChannelMsg::Data gained the
queued_at field and blocking_send_channel_msg gained the input_name param.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… endpoint validation

- BatchMetadata.resource_attrs: Vec -> Arc<Vec> to avoid per-batch clone
- Redact traces_endpoint URL before printing to stderr (strip credentials/query)
- Validate server.traces_endpoint in Config::validate() for fast-fail

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r's startup summary)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@strawgate strawgate force-pushed the feat/batch-trace-explorer branch from 741be52 to a255f87 Compare April 2, 2026 17:33
strawgate and others added 5 commits April 2, 2026 12:56
- Use tracing::field::Empty for all span fields updated via record() — OTel
  appends on record() so pre-initialized 0u64 fields shadowed the real values.
  This fixes bytes_in, scan_rows, input_rows, output_rows, queue_wait_ns all
  showing 0 in the trace explorer.
- Show DetailPanel in collapsed mode when a trace is selected via timeline click.
- Compute Batch Latency from rolling avg of trace total_ns instead of the
  cumulative processing rate (which was wrong units and had a startup spike).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The PR changed BatchMetadata::resource_attrs from Vec to Arc<Vec> but
missed 9 call sites that still used bare vec![]. Fixes compile errors
on both Linux and macOS CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…w(vec![])

Arc<Vec<T>> implements Default (via Vec: Default, Arc<T: Default>: Default),
so Arc::default() is equivalent but more concise at the 9 empty call sites.
Non-empty sites keep Arc::new(vec![...]) as-is.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four construction sites in bench/fuzz/test-utils were missed when
resource_attrs changed to Arc<Vec<...>>; also fixes a rustfmt line-length
issue in pipeline.rs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@strawgate
Copy link
Copy Markdown
Owner Author

@coderabbitai please do a very thorough review both as a user for expectations and as a maintainer.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 2, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Gather key files for thorough review
echo "=== span_exporter.rs ==="
fd "span_exporter.rs" --exec cat {}

echo "=== diagnostics.rs (traces section) ==="
fd "diagnostics.rs" crates/logfwd-io/src --exec cat {}

echo "=== pipeline.rs (tracing section) ==="
fd "pipeline.rs" crates/logfwd/src --exec cat {}

echo "=== main.rs ==="
fd "main.rs" crates/logfwd/src --exec cat {}

echo "=== TraceExplorer.tsx ==="
fd "TraceExplorer.tsx" --exec cat {}

echo "=== types.ts ==="
fd "types.ts" dashboard/src --exec cat {}

echo "=== logfwd-io Cargo.toml ==="
fd "Cargo.toml" crates/logfwd-io --exec cat {}

echo "=== logfwd Cargo.toml ==="
fd "Cargo.toml" crates/logfwd --exec cat {}

echo "=== workspace Cargo.toml ==="
cat Cargo.toml

Length of output: 205505


@strawgate Here's a thorough review covering correctness, observability, resilience, and UX across the full diff.


🔴 Critical

1. start_unix_ns loses precision in TypeScript (types.ts)

start_unix_ns: number;  // JavaScript Number = IEEE 754 double

Current Unix time in nanoseconds is ~1.75 × 10¹⁸. JavaScript's safe integer limit is Number.MAX_SAFE_INTEGER = 9.007 × 10¹⁵. Any nanosecond timestamp beyond that will be rounded, meaning two different batches can appear to have the same start_unix_ns, breaking the timeline ordering and hit-testing in drawTimeline/hitTestTimeline.

Fix: Use BigInt or, more practically, accept the value as a string from the API and convert only to milliseconds for display:

start_unix_ns: string;  // keep as string; parse with BigInt when needed
// or: serialize as ms in the API response

Alternatively, emit start_unix_ms: f64 from the Rust side instead of u64 nanoseconds, accepting the minor precision trade-off (sub-millisecond ordering is rarely needed in the list).


2. TraceExplorer prop mismatch — compile-time type error

app.tsx renders <TraceExplorer /> with no props, but the component signature is:

export function TraceExplorer({ traces, collapsed = false }: Props)
// where Props = { traces: TraceRecord[]; collapsed?: boolean }

traces is required here. This is a TypeScript error and will likely produce a runtime crash (traces.length on undefined). Either:

  • The app.tsx diff was not shown and it already passes traces (likely — app.tsx polls the API), or
  • The component needs a default: traces = [].

Please verify app.tsx passes the traces prop; if not, add traces = [] as TraceRecord[] as a default.


🟠 High

3. Missing unit test for /api/traces

Every other endpoint (/health, /api/pipelines, /api/stats, /ready, /api/logs) has at least one test in diagnostics.rs. The new serve_traces handler has none. Given its bespoke JSON construction (not using serde), this is the highest risk of silent regressions (wrong field names, missed escaping, wrong grouping).

Suggested minimum cases:

  • No trace_buf set → {"traces":[]} with status 200
  • Buffer with one root + three child spans → correct scan_ns, transform_ns, output_ns in output
  • Root span with no children → zero stage durations

4. Mutex poison silently drops spans (span_exporter.rs)

if let Ok(mut buf) = self.buf.inner.lock() { ... }
// panics in another thread → all subsequent spans silently discarded

If any thread panics while holding the buffer lock, lock() returns Err(PoisonError) forever and the ring buffer silently stops accepting spans — the dashboard shows no new traces with no indication of the problem.

Fix:

let mut buf = self.buf.inner.lock().unwrap_or_else(|e| e.into_inner());

Or use parking_lot::Mutex which doesn't poison.

5. get_spans() clones entire buffer on every poll (span_exporter.rs)

pub fn get_spans(&self) -> Vec<TraceSpan> {
    match self.inner.lock() {
        Ok(buf) => buf.iter().cloned().collect(),
        ...
    }
}

With 4 000 spans, each containing several String/Vec<[String;2]> fields, this is a non-trivial allocation triggered every 2 s by the dashboard. The lock is also held for the full duration of the clone.

Consider returning only the last N spans needed by serve_traces (it caps at 500 traces × ~4 spans = 2 000), or snapshotting only the trace_id, parent_id, name, duration_ns, start_unix_ns, attrs subset needed for the API.


🟡 Medium

6. windowSec in computeStats is effectively zero for 1-batch windows (TraceExplorer.tsx)

const windowSec = Math.max(1,
  (traces[0].start_unix_ns - traces[traces.length - 1].start_unix_ns) / 1e9,
);

traces[0] is the newest batch, traces[last] is the oldest. With start_unix_ns in nanoseconds > Number.MAX_SAFE_INTEGER (see issue 1), this subtraction will already be wrong. Even without that bug: when only one batch is in the buffer (traces.length < 2 returns null from computeStats, so guarded), but when exactly 2 batches exist and they started at the same nanosecond, windowSec = 1 and batchPerMin = 120, which is misleading. Dependent on fix for issue 1.

7. serve_traces escapes pipeline and flush_reason but not trace_id/status (diagnostics.rs)

\"trace_id\":\"{tid}\",        // no esc()
\"status\":\"{status}\"\       // no esc()

trace_id is always lowercase hex (safe), and status is one of "ok"/"error"/"unset" (safe). But a future refactor that changes how trace_id is derived (e.g., custom sampler returning arbitrary strings) could silently introduce JSON injection. Defensive esc() wrapping is cheap.

8. SimpleSpanProcessor exports on the hot async path (main.rs)

let ring_processor = SimpleSpanProcessor::new(RingBufferExporter::new(buf));

SimpleSpanProcessor calls exporter.export() synchronously on the thread that closes each span. Since flush_batch runs on the tokio executor and spans are closed inline, this means the executor thread briefly acquires the Mutex<VecDeque> per span. Under high batch rates this is low-latency (just a deque push), but consider BatchSpanProcessor for the ring buffer too if the executor shows contention under profiling.

9. tracer_provider shutdown not guarded by pipeline join (main.rs)

if let Err(e) = tracer_provider.shutdown() { ... }

This is called after result? (which may return early on error) and after pipeline handles are joined. However, it's called before the shutdown warning for the meter provider in some code paths. The ordering is correct here, but worth a comment explaining that shutdown must come after all spans are closed.

10. collapsed prop not used in app.tsx integration (TraceExplorer.tsx)

TraceExplorer accepts collapsed?: boolean but the initial integration in app.tsx likely always passes collapsed={false} (or omits it). The collapsed mode (stats + timeline only, no batch list) appears designed for a future dashboard layout. Consider documenting or marking it @internal until it has a wiring point.


🔵 Low / Nit

11. tracing-opentelemetry = "0.32" with opentelemetry = "0.31" — version skew

tracing-opentelemetry 0.32 targets opentelemetry 0.28+. This should compile cleanly against opentelemetry 0.31, but it's worth a comment in Cargo.toml noting the intentional pinning, since the OTel Rust ecosystem has historically had tight semver coupling between these crates.

12. flush_reason at drain-time uses queued_at: Instant::now() (pipeline.rs)

ChannelMsg::Data {
    bytes: data,
    checkpoints,
    queued_at: Instant::now(),  // set at shutdown drain, not at original enqueue
}

The queue_wait_ns for drain flushes will be near-zero and flush_reason = "drain", which is accurate but means queue_wait_ns loses its meaning for shutdown-triggered drains. A comment explaining this trade-off would prevent future confusion.

13. Instrument import unused warning (pipeline.rs)

use tracing::Instrument;

This is used for .instrument(info_span!("transform", ...)), so it is needed — just verifying no accidental dead import.

14. src/dashboard.html contains a large inlined module script

The build artifact embedded in dashboard.html contains what appears to be the built dashboard JS. Ensure the CI/CD pipeline regenerates this from dashboard/src rather than allowing manual edits to the HTML file, to prevent drift between the source and the embedded build.


Summary table

# Severity Area Issue
1 🔴 Critical types.ts start_unix_ns: number loses precision (>2⁵³)
2 🔴 Critical TraceExplorer.tsx / app.tsx traces required prop not passed from parent
3 🟠 High diagnostics.rs No test for /api/traces
4 🟠 High span_exporter.rs Mutex poison silently drops all future spans
5 🟠 High span_exporter.rs Full 4K-span clone on every 2s poll
6 🟡 Medium TraceExplorer.tsx windowSec subtraction wrong with ns precision loss
7 🟡 Medium diagnostics.rs trace_id/status not escaped in JSON template
8 🟡 Medium main.rs SimpleSpanProcessor exports sync on tokio executor
9 🟡 Medium main.rs tracer_provider shutdown ordering needs comment
10 🔵 Low TraceExplorer.tsx collapsed prop has no live wiring
11 🔵 Low Cargo.toml OTel crate version skew lacks comment
12 🔵 Low pipeline.rs queue_wait_ns misleading for drain flushes

strawgate and others added 3 commits April 2, 2026 14:55
PipelineView was re-filtering traces that app.tsx already filtered by
pipeline name — the second pass was always a no-op. Removed it.

computeStats() calculated window as (newest start − oldest start),
cutting off the newest batch's duration. Window now ends at
newest.start + newest.total_ns so batch/min is accurate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TcpListener::bind(":0") has a TOCTOU race: previous test servers stay
alive (JoinHandle drop detaches threads, not stops them), and on macOS
the OS re-assigns the same port on the next :0 bind, causing EADDRINUSE.

Replace with a static AtomicU16 counter in the 19100+ range so each
test gets a strictly-unique port that can never collide with a still-
running server from an earlier test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scan-error path called tracing::Span::current() while the scan child span
was still entered, so errors=1 landed on the wrong span and never appeared
in /api/traces. Fixed to use the saved batch-root `span` handle instead.

Added two contract tests for /api/traces:
- empty buffer → {"traces":[]}
- one root + one child span → full JSON shape with correct field values

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@strawgate strawgate merged commit ae7f09e into master Apr 2, 2026
7 of 8 checks passed
@strawgate strawgate deleted the feat/batch-trace-explorer branch April 2, 2026 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant