Skip to content

Migrate diagnostics server from tiny_http to axum with WebSocket telemetry#1707

Merged
strawgate merged 7 commits into
mainfrom
claude/refine-local-plan-1flIk
Apr 9, 2026
Merged

Migrate diagnostics server from tiny_http to axum with WebSocket telemetry#1707
strawgate merged 7 commits into
mainfrom
claude/refine-local-plan-1flIk

Conversation

@strawgate
Copy link
Copy Markdown
Owner

@strawgate strawgate commented Apr 9, 2026

Summary

Replaces the tiny_http-based diagnostics server with an axum-based implementation that adds WebSocket support for real-time telemetry streaming. The new server pushes OTLP JSON metrics, spans, and logs every 2 seconds to connected WebSocket clients.

Key Changes

  • HTTP Framework: Migrated from tiny_http (blocking, single-threaded) to axum (async, tokio-based)
  • WebSocket Support: Added /admin/v1/telemetry endpoint that streams OTLP JSON signals (resourceMetrics, resourceSpans, resourceLogs) to clients
  • Telemetry Module: Extracted metric sampling, span collection, and log gathering into a new telemetry.rs module with OTLP JSON serialization
  • Async Sampler: Converted the metric sampler from a blocking thread to an async tokio task
  • Simplified Shutdown: Replaced AtomicBool flag with oneshot channel for cleaner shutdown signaling
  • Removed Endpoints: Dropped /admin/v1/stats, /admin/v1/history, /admin/v1/logs, and /admin/v1/traces (functionality now available via WebSocket)
  • State Management: Centralized server state in DiagnosticsState struct passed via axum's State extractor

Architecture

The new server runs a single tokio runtime on a background thread with:

  • Main axum router handling HTTP GET requests
  • Async sampler task that collects metrics/spans/logs every 2s and broadcasts to WebSocket subscribers
  • Graceful shutdown via oneshot channel

Closes

Test plan

  • just ci passes
  • Manual WebSocket connection to /admin/v1/telemetry receives OTLP JSON payloads
  • Existing HTTP endpoints (/, /live, /ready, /admin/v1/status, /admin/v1/config) continue to work

https://claude.ai/code/session_01WsySSDSHHCWA55S2XHHku7

Note

Migrate diagnostics HTTP server from tiny_http to axum with WebSocket telemetry

  • Replaces the tiny_http blocking-thread server in server.rs with an axum + tokio async server running on a dedicated background thread with graceful oneshot shutdown.
  • Adds a new WebSocket endpoint at /admin/v1/telemetry that pushes OTLP JSON payloads (metrics, spans, logs) to subscribers every 2 seconds via a tokio broadcast channel.
  • Adds a new telemetry module (telemetry.rs) with sampling, collection, and OTLP JSON serialization for metrics, spans, and logs.
  • Removes legacy polling endpoints: /admin/v1/stats, /admin/v1/history, /admin/v1/logs, /admin/v1/traces; existing endpoints (/, /live, /ready, /admin/v1/status, /admin/v1/config) are preserved with axum semantics.
  • Simplifies BackgroundHttpTask by removing the ShutdownHandle enum and storing the oneshot sender directly.

Macroscope summarized fd0fe52.

Replace the tiny_http-based diagnostics server with axum, adding a
WebSocket endpoint at /admin/v1/telemetry that pushes OTLP JSON
(metrics, traces, logs) every 2s via a broadcast channel. This
eliminates dashboard polling and enables live span-start visibility.

- Rewrite server.rs from tiny_http to axum (1104 lines, down from 2010)
- New telemetry.rs: OTLP JSON serialization + sampling/collection
- Remove /admin/v1/{stats,traces,logs,history} (replaced by WebSocket)
- Simplify BackgroundHttpTask (remove dead TinyHttp variant)
- Add ws feature to axum dependency
- All 32 diagnostics tests pass, zero warnings

https://claude.ai/code/session_01WsySSDSHHCWA55S2XHHku7
Copilot AI review requested due to automatic review settings April 9, 2026 03:00
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

Warning

Rate limit exceeded

@strawgate has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 34 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 0 minutes and 34 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: 31ffa53d-d46e-4376-8d82-7614cc078a53

📥 Commits

Reviewing files that changed from the base of the PR and between c81aa72 and fd0fe52.

📒 Files selected for processing (4)
  • crates/logfwd-io/src/diagnostics.rs
  • crates/logfwd-io/src/diagnostics/server.rs
  • crates/logfwd-io/src/diagnostics/telemetry.rs
  • crates/logfwd-io/src/lib.rs

Walkthrough

The diagnostics HTTP server was migrated from tiny_http to axum and now runs on a dedicated OS thread hosting an internal Tokio runtime. Shutdown handling was simplified: the former enum-based ShutdownHandle was removed in favor of a tokio::sync::oneshot::Sender<()> used for graceful shutdown. A WebSocket telemetry endpoint at /admin/v1/telemetry and a new telemetry module (OTLP JSON serializers for metrics, spans, logs) were added. The sampler loop became an async task broadcasting telemetry to clients; legacy tiny_http endpoints and associated sampler/thread handling were removed.

Possibly related PRs


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (2 errors, 2 warnings)

Check name Status Explanation Resolution
High-Quality Rust Practices ❌ Error Four new pub(super) types in telemetry.rs lack required /// doc comments describing behavior. Add /// doc comments to MetricValue, MetricPoint, SpanRecord, and LogRecord explaining their purpose and field semantics.
Formal Verification Coverage ❌ Error Six new pub(super) functions with non-trivial logic lack required KANI proofs in #[cfg(kani)] blocks despite formal verification guidance requirements. Add KANI proofs for all six functions verifying no panic on bounded inputs, JSON correctness, arithmetic invariants, and loop termination with #[kani::unwind(N)] annotations.
Documentation Thoroughly Updated ⚠️ Warning PR lacks documentation for public methods and structs, no ADR for architectural migration, and architecture docs not updated. Add doc comments to BackgroundHttpTask methods and telemetry structs; create ADR documenting tiny_http to axum migration; update ARCHITECTURE.md.
Maintainer Fitness ⚠️ Warning PR exceeds 500 non-test lines without explicit atomicity justification and omits required disclosures: risk surface, detailed test coverage breakdown, and known limitations acknowledgment. Add scope justification, risk surface section, concrete test coverage details, and known limitations section with span ID collision and startup channel tracking.
✅ Passed checks (1 passed)
Check name Status Explanation
Crate Boundary And Dependency Integrity ✅ Passed PR satisfies all five crate boundary requirements: no new external dependencies added, logfwd-core maintains no-std/no-unsafe contract, dependency direction flows correctly with no cycles, binary contains only CLI orchestration, and telemetry module appropriately placed in logfwd-io.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates the logfwd-io diagnostics server from a blocking tiny_http implementation to an async axum server and introduces a WebSocket endpoint for streaming OTLP JSON telemetry (metrics/spans/logs) to connected clients.

Changes:

  • Replace the diagnostics HTTP server with an axum + Tokio background thread setup, including graceful shutdown via oneshot.
  • Add /admin/v1/telemetry WebSocket endpoint with a periodic sampler that broadcasts OTLP JSON payloads.
  • Introduce a new telemetry.rs module that collects pipeline/process/span/log data and serializes it to OTLP JSON.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
crates/logfwd-io/src/diagnostics/telemetry.rs New telemetry collection + OTLP JSON serialization (with unit tests).
crates/logfwd-io/src/diagnostics/server.rs Axum-based diagnostics server, WebSocket endpoint, async sampler task, and updated endpoint handlers/tests.
crates/logfwd-io/src/diagnostics.rs Wires in the new telemetry module.
crates/logfwd-io/src/background_http_task.rs Simplifies shutdown handling to a single oneshot::Sender<()> for axum-based servers.
crates/logfwd-io/Cargo.toml Enables axum ws feature for WebSocket support.
Cargo.lock Locks additional transitive deps required by axum WebSocket support.

Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs
Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs Outdated
Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs Outdated
Comment thread crates/logfwd-io/src/diagnostics/server.rs
Comment thread crates/logfwd-io/src/diagnostics/server.rs
Comment thread crates/logfwd-io/src/diagnostics/server.rs Outdated
Comment thread crates/logfwd-io/src/diagnostics/server.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-io/src/diagnostics/server.rs`:
- Around line 215-227: The background thread currently swallows errors when
building the runtime and converting the std_listener, making failures invisible;
change the two Err(_) arms in the tokio::runtime::Builder::new_current_thread()
match and the tokio::net::TcpListener::from_std(std_listener) match to capture
the error (e.g., Err(e)) and log it with context (using the project's
logging/tracing facility) before returning so failures in runtime creation and
listener conversion are recorded along with the error details.

In `@crates/logfwd-io/src/diagnostics/telemetry.rs`:
- Line 322: The current write! call computes the span end as s.start_unix_ns +
s.duration_ns which can overflow; change the calculation to use saturating_add
(e.g., s.start_unix_ns.saturating_add(s.duration_ns)) before formatting so the
end time is clamped instead of overflowing, leaving the surrounding write!(out,
"{}", ...) invocation intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: 862ec23b-f6a6-48a3-a75b-ea2edf15ea9f

📥 Commits

Reviewing files that changed from the base of the PR and between 2f7d9e9 and 32102cd.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • crates/logfwd-io/Cargo.toml
  • crates/logfwd-io/src/background_http_task.rs
  • crates/logfwd-io/src/diagnostics.rs
  • crates/logfwd-io/src/diagnostics/server.rs
  • crates/logfwd-io/src/diagnostics/telemetry.rs

Comment thread crates/logfwd-io/src/diagnostics/server.rs
Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs Outdated
Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs Outdated
Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs
Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs Outdated
@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented Apr 9, 2026

Approvability

Verdict: Needs human review

This PR is a major infrastructure refactor that migrates from tiny_http to axum, removes several REST endpoints (stats, traces, logs, history, telemetry/*), adds WebSocket-based telemetry push, and changes the threading model from blocking std threads to async tokio. These are significant runtime behavior changes that warrant human review despite the clean implementation.

You can customize Macroscope's approvability policy. Learn more.

strawgate and others added 2 commits April 8, 2026 23:39
- Fix inaccurate comment about unknown span attributes (dropped, not stored)
- Use full 32-hex-char trace_id to prevent collisions past 16M batches
- Use saturating_add for endTimeUnixNano to prevent overflow
- Log errors on diagnostics runtime/listener failures instead of silent return
- Skip OTLP serialization+broadcast when no WebSocket subscribers
- Capture start_time.elapsed() once per status request
- Emit null for non-finite gauge values instead of invalid JSON
- Fix collect_new_logs stall after stderr ring buffer eviction
- Run cargo fmt

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@strawgate
Copy link
Copy Markdown
Owner Author

@coderabbitai please do a full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

@strawgate Sure, kicking off a full review of PR #1707 now!

✅ Actions performed

Full review triggered.

Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-io/src/diagnostics/server.rs`:
- Around line 212-246: The thread currently returns success as soon as it is
spawned even if the Tokio runtime or TcpListener setup later fails; add a
startup acknowledgement oneshot channel created in start() (e.g., let
(startup_tx, startup_rx) = tokio::sync::oneshot::channel()) and move startup_tx
into the spawned thread; inside the thread, after successfully building the
runtime and converting the std_listener via tokio::net::TcpListener::from_std
and right after spawning the sampler_loop/starting the axum server task, send
startup_tx.send(Ok(())) (or send Err(e) on failures) to confirm the background
server is alive; in start(), await startup_rx (or block on it) and only return
the ServerHandle/Ok(...) when the receiver yields success, propagating any error
from the thread if it fails during initialization (ensure the thread still logs
errors via tracing::error as now).

In `@crates/logfwd-io/src/diagnostics/telemetry.rs`:
- Around line 289-317: The current span IDs are built from the per-pipeline
batch key alone (the loop over pm in pipelines reading active_batches and using
id), which allows identical trace_id/span_id across different PipelineMetrics;
fix it by deriving trace_id/span_id from both the pipeline identity and the
batch id (e.g., hash or namespace the tuple (pm.name, id)) or, preferably,
persist real trace_id/span_id on ActiveBatch and use those when building
SpanRecord; update the SpanRecord construction in telemetry.rs to use the
combined/ persisted IDs (reference symbols: PipelineMetrics, active_batches,
next_batch_id, ActiveBatch, SpanRecord, trace_id, span_id).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: d25411ef-106b-4940-872b-18505bc550b4

📥 Commits

Reviewing files that changed from the base of the PR and between c5d76b8 and c81aa72.

📒 Files selected for processing (2)
  • crates/logfwd-io/src/diagnostics/server.rs
  • crates/logfwd-io/src/diagnostics/telemetry.rs

Comment thread crates/logfwd-io/src/diagnostics/server.rs
Comment thread crates/logfwd-io/src/diagnostics/telemetry.rs
strawgate and others added 4 commits April 9, 2026 00:12
…nc, span IDs

- Guard i64 cast for gauge values outside i64 range (emit as float)
- Synchronize diagnostics server startup via channel to propagate errors
- Mix pipeline name hash into synthetic trace/span IDs for uniqueness

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflicts with merged #1706 (telemetry HTTP endpoints).
Keep #1707's axum server rewrite as the primary diagnostics server.
Retain telemetry_buffer.rs from #1706 on disk (already in main).
Remove orphaned tiny_http handler methods that were auto-merged
into the axum code, and suppress dead-code warnings on the
telemetry_buffer module since it is unused by the axum server.
@strawgate strawgate merged commit c1d8227 into main Apr 9, 2026
16 of 17 checks passed
@strawgate strawgate deleted the claude/refine-local-plan-1flIk branch April 9, 2026 05:42
strawgate added a commit that referenced this pull request Apr 9, 2026
…metry (#1707)

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants