Migrate diagnostics server from tiny_http to axum with WebSocket telemetry by strawgate · Pull Request #1707 · strawgate/fastforward

strawgate · 2026-04-09T03:00:10Z

Summary

Replaces the tiny_http-based diagnostics server with an axum-based implementation that adds WebSocket support for real-time telemetry streaming. The new server pushes OTLP JSON metrics, spans, and logs every 2 seconds to connected WebSocket clients.

Key Changes

HTTP Framework: Migrated from tiny_http (blocking, single-threaded) to axum (async, tokio-based)
WebSocket Support: Added /admin/v1/telemetry endpoint that streams OTLP JSON signals (resourceMetrics, resourceSpans, resourceLogs) to clients
Telemetry Module: Extracted metric sampling, span collection, and log gathering into a new telemetry.rs module with OTLP JSON serialization
Async Sampler: Converted the metric sampler from a blocking thread to an async tokio task
Simplified Shutdown: Replaced AtomicBool flag with oneshot channel for cleaner shutdown signaling
Removed Endpoints: Dropped /admin/v1/stats, /admin/v1/history, /admin/v1/logs, and /admin/v1/traces (functionality now available via WebSocket)
State Management: Centralized server state in DiagnosticsState struct passed via axum's State extractor

Architecture

The new server runs a single tokio runtime on a background thread with:

Main axum router handling HTTP GET requests
Async sampler task that collects metrics/spans/logs every 2s and broadcasts to WebSocket subscribers
Graceful shutdown via oneshot channel

Closes

Test plan

just ci passes
Manual WebSocket connection to /admin/v1/telemetry receives OTLP JSON payloads
Existing HTTP endpoints (/, /live, /ready, /admin/v1/status, /admin/v1/config) continue to work

https://claude.ai/code/session_01WsySSDSHHCWA55S2XHHku7

Note

Migrate diagnostics HTTP server from tiny_http to axum with WebSocket telemetry

Replaces the tiny_http blocking-thread server in server.rs with an axum + tokio async server running on a dedicated background thread with graceful oneshot shutdown.
Adds a new WebSocket endpoint at /admin/v1/telemetry that pushes OTLP JSON payloads (metrics, spans, logs) to subscribers every 2 seconds via a tokio broadcast channel.
Adds a new telemetry module (telemetry.rs) with sampling, collection, and OTLP JSON serialization for metrics, spans, and logs.
Removes legacy polling endpoints: /admin/v1/stats, /admin/v1/history, /admin/v1/logs, /admin/v1/traces; existing endpoints (/, /live, /ready, /admin/v1/status, /admin/v1/config) are preserved with axum semantics.
Simplifies BackgroundHttpTask by removing the ShutdownHandle enum and storing the oneshot sender directly.

^{Macroscope summarized fd0fe52.}

Replace the tiny_http-based diagnostics server with axum, adding a WebSocket endpoint at /admin/v1/telemetry that pushes OTLP JSON (metrics, traces, logs) every 2s via a broadcast channel. This eliminates dashboard polling and enables live span-start visibility. - Rewrite server.rs from tiny_http to axum (1104 lines, down from 2010) - New telemetry.rs: OTLP JSON serialization + sampling/collection - Remove /admin/v1/{stats,traces,logs,history} (replaced by WebSocket) - Simplify BackgroundHttpTask (remove dead TinyHttp variant) - Add ws feature to axum dependency - All 32 diagnostics tests pass, zero warnings https://claude.ai/code/session_01WsySSDSHHCWA55S2XHHku7

chatgpt-codex-connector · 2026-04-09T03:00:17Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

coderabbitai · 2026-04-09T03:00:25Z

Warning

Rate limit exceeded

@strawgate has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 34 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 0 minutes and 34 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: 31ffa53d-d46e-4376-8d82-7614cc078a53

📥 Commits

Reviewing files that changed from the base of the PR and between c81aa72 and fd0fe52.

📒 Files selected for processing (4)

crates/logfwd-io/src/diagnostics.rs
crates/logfwd-io/src/diagnostics/server.rs
crates/logfwd-io/src/diagnostics/telemetry.rs
crates/logfwd-io/src/lib.rs

Walkthrough

The diagnostics HTTP server was migrated from tiny_http to axum and now runs on a dedicated OS thread hosting an internal Tokio runtime. Shutdown handling was simplified: the former enum-based ShutdownHandle was removed in favor of a tokio::sync::oneshot::Sender<()> used for graceful shutdown. A WebSocket telemetry endpoint at /admin/v1/telemetry and a new telemetry module (OTLP JSON serializers for metrics, spans, logs) were added. The sampler loop became an async task broadcasting telemetry to clients; legacy tiny_http endpoints and associated sampler/thread handling were removed.

Possibly related PRs

Migrate OTLP/OTAP/Arrow receivers to axum #1591: Introduced the ShutdownHandle/new_axum support for axum alongside tiny_http, which this change removes in favor of a oneshot-based shutdown.
fix(diagnostics): stop metric-sampler thread on drop; add double-start guard #1524: Modified diagnostics server lifecycle with Arc and sampler thread handling that was refactored into an async sampler and oneshot shutdown here.
feat: logfwd-core is now #![no_std] + #![forbid(unsafe_code)] #375: Earlier addition of the diagnostics HTTP server and BackgroundHttpTask code that this PR rewrites for the axum migration.

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (2 errors, 2 warnings)

Check name	Status	Explanation	Resolution
High-Quality Rust Practices	❌ Error	Four new pub(super) types in telemetry.rs lack required /// doc comments describing behavior.	Add /// doc comments to MetricValue, MetricPoint, SpanRecord, and LogRecord explaining their purpose and field semantics.
Formal Verification Coverage	❌ Error	Six new pub(super) functions with non-trivial logic lack required KANI proofs in #[cfg(kani)] blocks despite formal verification guidance requirements.	Add KANI proofs for all six functions verifying no panic on bounded inputs, JSON correctness, arithmetic invariants, and loop termination with #[kani::unwind(N)] annotations.
Documentation Thoroughly Updated	⚠️ Warning	PR lacks documentation for public methods and structs, no ADR for architectural migration, and architecture docs not updated.	Add doc comments to BackgroundHttpTask methods and telemetry structs; create ADR documenting tiny_http to axum migration; update ARCHITECTURE.md.
Maintainer Fitness	⚠️ Warning	PR exceeds 500 non-test lines without explicit atomicity justification and omits required disclosures: risk surface, detailed test coverage breakdown, and known limitations acknowledgment.	Add scope justification, risk surface section, concrete test coverage details, and known limitations section with span ID collision and startup channel tracking.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Crate Boundary And Dependency Integrity	✅ Passed	PR satisfies all five crate boundary requirements: no new external dependencies added, logfwd-core maintains no-std/no-unsafe contract, dependency direction flows correctly with no cycles, binary contains only CLI orchestration, and telemetry module appropriately placed in logfwd-io.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Migrates the logfwd-io diagnostics server from a blocking tiny_http implementation to an async axum server and introduces a WebSocket endpoint for streaming OTLP JSON telemetry (metrics/spans/logs) to connected clients.

Changes:

Replace the diagnostics HTTP server with an axum + Tokio background thread setup, including graceful shutdown via oneshot.
Add /admin/v1/telemetry WebSocket endpoint with a periodic sampler that broadcasts OTLP JSON payloads.
Introduce a new telemetry.rs module that collects pipeline/process/span/log data and serializes it to OTLP JSON.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
crates/logfwd-io/src/diagnostics/telemetry.rs	New telemetry collection + OTLP JSON serialization (with unit tests).
crates/logfwd-io/src/diagnostics/server.rs	Axum-based diagnostics server, WebSocket endpoint, async sampler task, and updated endpoint handlers/tests.
crates/logfwd-io/src/diagnostics.rs	Wires in the new telemetry module.
crates/logfwd-io/src/background_http_task.rs	Simplifies shutdown handling to a single `oneshot::Sender<()>` for axum-based servers.
crates/logfwd-io/Cargo.toml	Enables axum `ws` feature for WebSocket support.
Cargo.lock	Locks additional transitive deps required by axum WebSocket support.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-io/src/diagnostics/server.rs`:
- Around line 215-227: The background thread currently swallows errors when
building the runtime and converting the std_listener, making failures invisible;
change the two Err(_) arms in the tokio::runtime::Builder::new_current_thread()
match and the tokio::net::TcpListener::from_std(std_listener) match to capture
the error (e.g., Err(e)) and log it with context (using the project's
logging/tracing facility) before returning so failures in runtime creation and
listener conversion are recorded along with the error details.

In `@crates/logfwd-io/src/diagnostics/telemetry.rs`:
- Line 322: The current write! call computes the span end as s.start_unix_ns +
s.duration_ns which can overflow; change the calculation to use saturating_add
(e.g., s.start_unix_ns.saturating_add(s.duration_ns)) before formatting so the
end time is clamped instead of overflowing, leaving the surrounding write!(out,
"{}", ...) invocation intact.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: 862ec23b-f6a6-48a3-a75b-ea2edf15ea9f

📥 Commits

Reviewing files that changed from the base of the PR and between 2f7d9e9 and 32102cd.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

crates/logfwd-io/Cargo.toml
crates/logfwd-io/src/background_http_task.rs
crates/logfwd-io/src/diagnostics.rs
crates/logfwd-io/src/diagnostics/server.rs
crates/logfwd-io/src/diagnostics/telemetry.rs

macroscopeapp · 2026-04-09T03:07:11Z

Approvability

Verdict: Needs human review

This PR is a major infrastructure refactor that migrates from tiny_http to axum, removes several REST endpoints (stats, traces, logs, history, telemetry/*), adds WebSocket-based telemetry push, and changes the threading model from blocking std threads to async tokio. These are significant runtime behavior changes that warrant human review despite the clean implementation.

^{You can customize Macroscope's approvability policy. Learn more.}

- Fix inaccurate comment about unknown span attributes (dropped, not stored) - Use full 32-hex-char trace_id to prevent collisions past 16M batches - Use saturating_add for endTimeUnixNano to prevent overflow - Log errors on diagnostics runtime/listener failures instead of silent return - Skip OTLP serialization+broadcast when no WebSocket subscribers - Capture start_time.elapsed() once per status request - Emit null for non-finite gauge values instead of invalid JSON - Fix collect_new_logs stall after stderr ring buffer eviction - Run cargo fmt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

strawgate · 2026-04-09T04:44:32Z

@coderabbitai please do a full review

coderabbitai · 2026-04-09T04:44:43Z

@strawgate Sure, kicking off a full review of PR #1707 now!

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-io/src/diagnostics/server.rs`:
- Around line 212-246: The thread currently returns success as soon as it is
spawned even if the Tokio runtime or TcpListener setup later fails; add a
startup acknowledgement oneshot channel created in start() (e.g., let
(startup_tx, startup_rx) = tokio::sync::oneshot::channel()) and move startup_tx
into the spawned thread; inside the thread, after successfully building the
runtime and converting the std_listener via tokio::net::TcpListener::from_std
and right after spawning the sampler_loop/starting the axum server task, send
startup_tx.send(Ok(())) (or send Err(e) on failures) to confirm the background
server is alive; in start(), await startup_rx (or block on it) and only return
the ServerHandle/Ok(...) when the receiver yields success, propagating any error
from the thread if it fails during initialization (ensure the thread still logs
errors via tracing::error as now).

In `@crates/logfwd-io/src/diagnostics/telemetry.rs`:
- Around line 289-317: The current span IDs are built from the per-pipeline
batch key alone (the loop over pm in pipelines reading active_batches and using
id), which allows identical trace_id/span_id across different PipelineMetrics;
fix it by deriving trace_id/span_id from both the pipeline identity and the
batch id (e.g., hash or namespace the tuple (pm.name, id)) or, preferably,
persist real trace_id/span_id on ActiveBatch and use those when building
SpanRecord; update the SpanRecord construction in telemetry.rs to use the
combined/ persisted IDs (reference symbols: PipelineMetrics, active_batches,
next_batch_id, ActiveBatch, SpanRecord, trace_id, span_id).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: d25411ef-106b-4940-872b-18505bc550b4

📥 Commits

Reviewing files that changed from the base of the PR and between c5d76b8 and c81aa72.

📒 Files selected for processing (2)

crates/logfwd-io/src/diagnostics/server.rs
crates/logfwd-io/src/diagnostics/telemetry.rs

…nc, span IDs - Guard i64 cast for gauge values outside i64 range (emit as float) - Synchronize diagnostics server startup via channel to propagate errors - Mix pipeline name hash into synthetic trace/span IDs for uniqueness Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Resolve conflicts with merged #1706 (telemetry HTTP endpoints). Keep #1707's axum server rewrite as the primary diagnostics server. Retain telemetry_buffer.rs from #1706 on disk (already in main). Remove orphaned tiny_http handler methods that were auto-merged into the axum code, and suppress dead-code warnings on the telemetry_buffer module since it is unused by the axum server.

…metry (#1707) Co-authored-by: Claude <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 9, 2026 03:00

Copilot started reviewing on behalf of strawgate April 9, 2026 03:00 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 9, 2026

View reviewed changes

Comment thread crates/logfwd-io/src/diagnostics/server.rs

Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs Outdated

macroscopeapp Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs Outdated

Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs

Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs Outdated

strawgate and others added 2 commits April 8, 2026 23:39

Merge branch 'main' into claude/refine-local-plan-1flIk

c5d76b8

macroscopeapp Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs Outdated

coderabbitai Bot requested changes Apr 9, 2026

View reviewed changes

Comment thread crates/logfwd-io/src/diagnostics/server.rs

Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs

github-actions Bot mentioned this pull request Apr 9, 2026

work-unit: transport observability parity #1716

Closed

3 tasks

strawgate and others added 4 commits April 9, 2026 00:12

Merge branch 'main' into claude/refine-local-plan-1flIk

2206fdf

fix(clippy): invert if-not-else in platform_sensor_beta

cd6547d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

strawgate merged commit c1d8227 into main Apr 9, 2026
16 of 17 checks passed

strawgate deleted the claude/refine-local-plan-1flIk branch April 9, 2026 05:42

strawgate added a commit that referenced this pull request Apr 9, 2026

Migrate diagnostics server from tiny_http to axum with WebSocket tele…

11d28af

…metry (#1707) Co-authored-by: Claude <noreply@anthropic.com>

Conversation

strawgate commented Apr 9, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Architecture

Closes

Test plan

Migrate diagnostics HTTP server from tiny_http to axum with WebSocket telemetry

Uh oh!

chatgpt-codex-connector Bot commented Apr 9, 2026

Uh oh!

coderabbitai Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Possibly related PRs

Pre-merge checks failed

❌ Failed checks (2 errors, 2 warnings)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

strawgate commented Apr 9, 2026

Uh oh!

coderabbitai Bot commented Apr 9, 2026

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

strawgate commented Apr 9, 2026 •

edited by macroscopeapp Bot

Loading

coderabbitai Bot commented Apr 9, 2026 •

edited

Loading

macroscopeapp Bot commented Apr 9, 2026 •

edited

Loading