Migrate diagnostics server from tiny_http to axum with WebSocket telemetry#1707
Conversation
Replace the tiny_http-based diagnostics server with axum, adding a
WebSocket endpoint at /admin/v1/telemetry that pushes OTLP JSON
(metrics, traces, logs) every 2s via a broadcast channel. This
eliminates dashboard polling and enables live span-start visibility.
- Rewrite server.rs from tiny_http to axum (1104 lines, down from 2010)
- New telemetry.rs: OTLP JSON serialization + sampling/collection
- Remove /admin/v1/{stats,traces,logs,history} (replaced by WebSocket)
- Simplify BackgroundHttpTask (remove dead TinyHttp variant)
- Add ws feature to axum dependency
- All 32 diagnostics tests pass, zero warnings
https://claude.ai/code/session_01WsySSDSHHCWA55S2XHHku7
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 0 minutes and 34 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (4)
WalkthroughThe diagnostics HTTP server was migrated from tiny_http to axum and now runs on a dedicated OS thread hosting an internal Tokio runtime. Shutdown handling was simplified: the former enum-based Possibly related PRs
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (2 errors, 2 warnings)
✅ Passed checks (1 passed)
Comment |
There was a problem hiding this comment.
Pull request overview
Migrates the logfwd-io diagnostics server from a blocking tiny_http implementation to an async axum server and introduces a WebSocket endpoint for streaming OTLP JSON telemetry (metrics/spans/logs) to connected clients.
Changes:
- Replace the diagnostics HTTP server with an
axum+ Tokio background thread setup, including graceful shutdown viaoneshot. - Add
/admin/v1/telemetryWebSocket endpoint with a periodic sampler that broadcasts OTLP JSON payloads. - Introduce a new
telemetry.rsmodule that collects pipeline/process/span/log data and serializes it to OTLP JSON.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/logfwd-io/src/diagnostics/telemetry.rs | New telemetry collection + OTLP JSON serialization (with unit tests). |
| crates/logfwd-io/src/diagnostics/server.rs | Axum-based diagnostics server, WebSocket endpoint, async sampler task, and updated endpoint handlers/tests. |
| crates/logfwd-io/src/diagnostics.rs | Wires in the new telemetry module. |
| crates/logfwd-io/src/background_http_task.rs | Simplifies shutdown handling to a single oneshot::Sender<()> for axum-based servers. |
| crates/logfwd-io/Cargo.toml | Enables axum ws feature for WebSocket support. |
| Cargo.lock | Locks additional transitive deps required by axum WebSocket support. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/logfwd-io/src/diagnostics/server.rs`:
- Around line 215-227: The background thread currently swallows errors when
building the runtime and converting the std_listener, making failures invisible;
change the two Err(_) arms in the tokio::runtime::Builder::new_current_thread()
match and the tokio::net::TcpListener::from_std(std_listener) match to capture
the error (e.g., Err(e)) and log it with context (using the project's
logging/tracing facility) before returning so failures in runtime creation and
listener conversion are recorded along with the error details.
In `@crates/logfwd-io/src/diagnostics/telemetry.rs`:
- Line 322: The current write! call computes the span end as s.start_unix_ns +
s.duration_ns which can overflow; change the calculation to use saturating_add
(e.g., s.start_unix_ns.saturating_add(s.duration_ns)) before formatting so the
end time is clamped instead of overflowing, leaving the surrounding write!(out,
"{}", ...) invocation intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: 862ec23b-f6a6-48a3-a75b-ea2edf15ea9f
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
crates/logfwd-io/Cargo.tomlcrates/logfwd-io/src/background_http_task.rscrates/logfwd-io/src/diagnostics.rscrates/logfwd-io/src/diagnostics/server.rscrates/logfwd-io/src/diagnostics/telemetry.rs
ApprovabilityVerdict: Needs human review This PR is a major infrastructure refactor that migrates from tiny_http to axum, removes several REST endpoints (stats, traces, logs, history, telemetry/*), adds WebSocket-based telemetry push, and changes the threading model from blocking std threads to async tokio. These are significant runtime behavior changes that warrant human review despite the clean implementation. You can customize Macroscope's approvability policy. Learn more. |
- Fix inaccurate comment about unknown span attributes (dropped, not stored) - Use full 32-hex-char trace_id to prevent collisions past 16M batches - Use saturating_add for endTimeUnixNano to prevent overflow - Log errors on diagnostics runtime/listener failures instead of silent return - Skip OTLP serialization+broadcast when no WebSocket subscribers - Capture start_time.elapsed() once per status request - Emit null for non-finite gauge values instead of invalid JSON - Fix collect_new_logs stall after stderr ring buffer eviction - Run cargo fmt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@coderabbitai please do a full review |
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/logfwd-io/src/diagnostics/server.rs`:
- Around line 212-246: The thread currently returns success as soon as it is
spawned even if the Tokio runtime or TcpListener setup later fails; add a
startup acknowledgement oneshot channel created in start() (e.g., let
(startup_tx, startup_rx) = tokio::sync::oneshot::channel()) and move startup_tx
into the spawned thread; inside the thread, after successfully building the
runtime and converting the std_listener via tokio::net::TcpListener::from_std
and right after spawning the sampler_loop/starting the axum server task, send
startup_tx.send(Ok(())) (or send Err(e) on failures) to confirm the background
server is alive; in start(), await startup_rx (or block on it) and only return
the ServerHandle/Ok(...) when the receiver yields success, propagating any error
from the thread if it fails during initialization (ensure the thread still logs
errors via tracing::error as now).
In `@crates/logfwd-io/src/diagnostics/telemetry.rs`:
- Around line 289-317: The current span IDs are built from the per-pipeline
batch key alone (the loop over pm in pipelines reading active_batches and using
id), which allows identical trace_id/span_id across different PipelineMetrics;
fix it by deriving trace_id/span_id from both the pipeline identity and the
batch id (e.g., hash or namespace the tuple (pm.name, id)) or, preferably,
persist real trace_id/span_id on ActiveBatch and use those when building
SpanRecord; update the SpanRecord construction in telemetry.rs to use the
combined/ persisted IDs (reference symbols: PipelineMetrics, active_batches,
next_batch_id, ActiveBatch, SpanRecord, trace_id, span_id).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: d25411ef-106b-4940-872b-18505bc550b4
📒 Files selected for processing (2)
crates/logfwd-io/src/diagnostics/server.rscrates/logfwd-io/src/diagnostics/telemetry.rs
…nc, span IDs - Guard i64 cast for gauge values outside i64 range (emit as float) - Synchronize diagnostics server startup via channel to propagate errors - Mix pipeline name hash into synthetic trace/span IDs for uniqueness Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflicts with merged #1706 (telemetry HTTP endpoints). Keep #1707's axum server rewrite as the primary diagnostics server. Retain telemetry_buffer.rs from #1706 on disk (already in main). Remove orphaned tiny_http handler methods that were auto-merged into the axum code, and suppress dead-code warnings on the telemetry_buffer module since it is unused by the axum server.
…metry (#1707) Co-authored-by: Claude <noreply@anthropic.com>
Summary
Replaces the tiny_http-based diagnostics server with an axum-based implementation that adds WebSocket support for real-time telemetry streaming. The new server pushes OTLP JSON metrics, spans, and logs every 2 seconds to connected WebSocket clients.
Key Changes
/admin/v1/telemetryendpoint that streams OTLP JSON signals (resourceMetrics, resourceSpans, resourceLogs) to clientstelemetry.rsmodule with OTLP JSON serializationAtomicBoolflag withoneshotchannel for cleaner shutdown signaling/admin/v1/stats,/admin/v1/history,/admin/v1/logs, and/admin/v1/traces(functionality now available via WebSocket)DiagnosticsStatestruct passed via axum'sStateextractorArchitecture
The new server runs a single tokio runtime on a background thread with:
Closes
Test plan
just cipasses/admin/v1/telemetryreceives OTLP JSON payloads/,/live,/ready,/admin/v1/status,/admin/v1/config) continue to workhttps://claude.ai/code/session_01WsySSDSHHCWA55S2XHHku7
Note
Migrate diagnostics HTTP server from tiny_http to axum with WebSocket telemetry
tiny_httpblocking-thread server in server.rs with anaxum+ tokio async server running on a dedicated background thread with graceful oneshot shutdown./admin/v1/telemetrythat pushes OTLP JSON payloads (metrics, spans, logs) to subscribers every 2 seconds via a tokio broadcast channel.telemetrymodule (telemetry.rs) with sampling, collection, and OTLP JSON serialization for metrics, spans, and logs./admin/v1/stats,/admin/v1/history,/admin/v1/logs,/admin/v1/traces; existing endpoints (/,/live,/ready,/admin/v1/status,/admin/v1/config) are preserved with axum semantics.BackgroundHttpTaskby removing theShutdownHandleenum and storing the oneshot sender directly.Macroscope summarized fd0fe52.