fix: improve file tailing reliability by strawgate · Pull Request #805 · strawgate/fastforward

strawgate · 2026-04-03T05:52:10Z

Summary

Addresses three key reliability issues in the file tailing path:

Checkpoint validation (bug: checkpoint seek not validated against current file size — stale offsets corrupt tailing after truncation #656): Compare restored offsets with file sizes, reset to 0 on rotation/truncation with a warning
Exclusive lock (bug: no lock file — multiple logfwd instances processing same file produce duplicate output with no warning #737): Acquire flock on {data_dir}/logfwd.lock at startup to prevent duplicate instances
Glob warnings (bug: missing/unmatched input file produces no warning — pipeline silently waits with zero output #730): Warn when configured glob matches no files

Additional: replaced eprintln! with tracing macros in tail.rs, added unit tests for checkpoint validation and glob warnings, integration test for exclusive lock.

Replaces #789 which was closed due to an accidentally committed .env file. This PR contains a clean squashed commit with no secrets in history.

Fixes #780

Test plan

cargo test -p logfwd-io — tail tests
cargo test -p logfwd --test test_lock — exclusive lock integration test
Manual: run two logfwd instances on same data_dir, second should fail

🤖 Generated with Claude Code

1. Checkpoint validation (#656): compare restored offsets with file sizes, reset to 0 on rotation/truncation with a warning. 2. Exclusive lock (#737): acquire flock on {data_dir}/logfwd.lock at startup to prevent duplicate instances. 3. Glob warnings (#730): warn when configured glob matches no files. Additional: replaced eprintln! with tracing macros in tail.rs, added unit tests for checkpoint validation and glob warnings, added integration test for exclusive lock. Co-Authored-By: Jules <noreply@google.com>

coderabbitai · 2026-04-03T05:52:26Z

Walkthrough

This PR implements file-tailing reliability features and performs extensive code modernization. Key changes include: (1) checkpoint offset validation against current file size on startup, with automatic reset to offset 0 if stale; (2) exclusive lock file acquisition at startup to prevent duplicate instance processing; (3) glob pattern match warnings when configured file patterns yield no files; (4) column naming updates from message_str to msg_str across documentation and examples; (5) grok UDF planning-time hook renamed from return_field_from_args to return_type_from_args returning ReturnInfo; (6) systematic use of fully-qualified type paths throughout codebase (e.g., std::process::Command instead of imported Command); (7) removal of #[serde(deny_unknown_fields)] from multiple config structs; (8) workspace dependency changes (arrow 55→54, explicit datafusion 45 with reduced feature set). New integration test test_lock.rs validates exclusive instance locking behavior. Deletion of dev-docs/research/file-tailing-audit.md.

Possibly related PRs

build: upgrade arrow 54→55, datafusion 45→48 #794: Modifies grok UDF planning-time return type from return_field_from_args to return_type_from_args with ReturnInfo wrapper; updates DataFusion dependency version and feature configuration
lint: enable unused_qualifications workspace-wide #783: Updates workspace lint configuration in Cargo.toml; applies opposite direction to unused_qualifications lint (removed here vs. enabled there)
fix: deny unknown fields in all config structs #782: Modifies #[serde(deny_unknown_fields)] attributes on config structs in opposite direction (removed here vs. added there)

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (2 errors, 3 warnings)

Check name	Status	Explanation	Resolution
Formal Verification Coverage	❌ Error	PR modifies public FileTailer::set_offset API and adds checkpoint validation logic without Kani proofs, proptest oracles, or verification documentation.	Add Kani proofs for set_offset boundary conditions, proptest harness for checkpoint restoration, and update dev-docs/VERIFICATION.md with proof status.
Crate Boundary And Dependency Integrity	❌ Error	PR violates workspace dependency management by pinning arrow-json and datafusion directly in child crates instead of using [workspace.dependencies].	Add arrow-json and datafusion to root [workspace.dependencies], then update child crates to use { workspace = true } instead of pinning versions.
Out of Scope Changes check	⚠️ Warning	The PR includes extensive out-of-scope changes beyond the three stated objectives: qualified path imports, raw string delimiter updates, trait object signature changes, and unrelated crate dependency version updates.	Isolate PR to file-tailing objectives only. Move unrelated changes (dependency updates, import qualification refactors, string literal reformatting) to a separate maintenance PR.
Documentation Thoroughly Updated	⚠️ Warning	PR implements three critical file-tailing reliability features but fails to document them comprehensively. Removed 306-line audit document not consolidated into DEVELOPING.md or ARCHITECTURE.md; missing ADR in DESIGN.md, entries in DEVELOPING.md, resolution marker in PHASES.md, and feature documentation in CONFIG_REFERENCE.md.	Add ADR to DESIGN.md for exclusive lock, document bugs/lessons in DEVELOPING.md, consolidate audit content into ARCHITECTURE.md, update PHASES.md `#780` resolution, and document new features in CONFIG_REFERENCE.md with proper code doc comments.
Maintainer Fitness	⚠️ Warning	PR bundles 3 core reliability objectives with ~50 files of unrelated refactoring (message_str renames, import qualifications, delimiter changes). Review comments explicitly flag unrelated changes.	Split into focused PRs: reliability-only PR with risk documentation and benchmark results; separate refactoring PR for cosmetic changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	All three objectives from `#780` are addressed: checkpoint validation on startup resets offsets exceeding file size (`#656`), exclusive lock acquisition prevents duplicate instances (`#737`), and glob warnings emit on zero matches (`#730`).
High-Quality Rust Practices	✅ Passed	PR adheres to all six Rust code quality criteria: unwrap/expect calls have clear messages or are test-only, structural allocations are justified, unsafe blocks have comprehensive SAFETY comments, all public items are documented, no behavioral feature flags present, and no speculative patterns detected.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

crates/logfwd/tests/compliance.rs (1)

150-168: ⚠️ Potential issue | 🟠 Major

Source-specific verification can be bypassed when source_id column is missing.

With non-empty source_id, batch.column_by_name("source_id") returning None makes source_filter become None, so all rows from that batch are counted. That can produce false pass/fail in per-source compliance checks.

🔧 Proposed fix

-        let source_filter: Option<Vec<bool>> = if !source_id.is_empty() {
-            batch.column_by_name("source_id").map(|col| {
+        let source_filter: Option<Vec<bool>> = if !source_id.is_empty() {
+            Some(if let Some(col) = batch.column_by_name("source_id") {
                 (0..batch.num_rows())
                     .map(|row| {
                         if col.is_null(row) {
                             return false;
                         }
                         let val = match col.data_type() {
                             arrow::datatypes::DataType::Utf8 => col.as_string::<i32>().value(row),
                             arrow::datatypes::DataType::Utf8View => col.as_string_view().value(row),
                             _ => "",
                         };
                         val == source_id
                     })
                     .collect()
-            })
+            } else {
+                vec![false; batch.num_rows()]
+            })
         } else {
             None
         };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/tests/compliance.rs` around lines 150 - 168, The current logic
sets source_filter to None when batch.column_by_name("source_id") returns None,
which lets rows pass when source_id is non-empty; change this so that when
source_id is non-empty but the "source_id" column is missing you produce
Some(Vec<bool>) of length batch.num_rows() with all false (i.e., no rows match)
instead of None; update the branch around batch.column_by_name("source_id") used
to build source_filter (and preserve the existing null handling and
UTF8/Utf8View matching logic) so missing column yields a false mask rather than
allowing all rows.

docs/CONFIG_REFERENCE.md (1)

137-143: ⚠️ Potential issue | 🟡 Minor

Add generator input type to documentation table.

The Generator variant exists in the InputType enum (implemented, used for benchmarking) but is missing from the Input types table in docs/CONFIG_REFERENCE.md. Add a row for generator with appropriate status and description.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/CONFIG_REFERENCE.md` around lines 137 - 143, The Input types table in
CONFIG_REFERENCE.md is missing the Generator variant from the InputType enum;
add a new table row for `generator` describing its status (e.g., Implemented)
and a concise description like "Produce synthetic logs for benchmarking/testing"
so documentation matches the implemented InputType::Generator used for
benchmarking. Ensure the row format matches the existing table style (Value |
Status | Description) and use the same backtick formatting for `generator`.

♻️ Duplicate comments (1)

crates/logfwd-io/tests/transport_e2e.rs (1)
159-159: ⚠️ Potential issue | 🟡 Minor

Same unnecessary change as Line 106.

Identical issue: writeln! and write! with explicit \n produce the same output. This change is unrelated to the PR scope and should be reverted.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-io/tests/transport_e2e.rs` at line 159, The change replaces
writeln! with write!(client, "...\\n") in the test (the write!(client,
"{\"client\":{id},\"seq\":{i}}}\\n").unwrap() call), which is unnecessary and
out of scope; revert this to using writeln!(client,
"{{\"client\":{id},\"seq\":{i}}}").unwrap() (restore the original writeln!
invocation for the client write in the transport_e2e test) so behavior remains
unchanged and the unrelated edit is removed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-arrow/src/streaming_builder.rs`:
- Line 467: The change to the raw byte string delimiter on the variable `json`
in `streaming_builder.rs` is a cosmetic, unrelated edit; revert the line that
sets `let json = br#"not used directly"#;` back to the original `let json =
br"not used directly";` (or remove the unrelated test change entirely) so the PR
remains focused on tailing reliability; locate the `json` binding in the
`streaming_builder.rs` file (inside the Arrow builder code) and restore the
original delimiter form.

In `@crates/logfwd-core/Cargo.toml`:
- Line 16: Replace the direct pinned dev-dependency "arrow-json" in the
logfwd-core crate with a workspace-managed dependency: add an entry for
arrow-json (with the explicit version and default-features setting) under the
root [workspace.dependencies], then change the arrow-json line in the
logfwd-core Cargo.toml to reference the workspace (e.g., arrow-json = {
workspace = true }) so the crate uses the workspace-managed version instead of
pinning "54" locally.

In `@crates/logfwd-io/src/tail.rs`:
- Around line 247-249: The current logic only warns when the union of matches is
empty (using initial_paths.is_empty()), so change it to warn per-glob: iterate
over each pattern in patterns and call the existing warn_on_empty behavior for
each pattern when it yields no matches (use warn_on_empty = true), and remove
the aggregate initial_paths.is_empty() checks used at startup and in the rescan
path; ensure both the startup code and the rescan code invoke the per-pattern
warning path so each empty glob emits its own tracing::warn identifying the
failing pattern.
- Around line 1398-1417: The tests (e.g., test_glob_no_match_warning and the
related tests around FileTailer::new_with_globs) only assert tailer state and do
not verify that tracing::warn! was emitted; update the tests to install a test
tracing subscriber (or use the tracing_test helpers) that captures logs and
assert that a warning containing the expected message is emitted both at startup
and after rescan, and add a new mixed-pattern test that calls
FileTailer::new_with_globs with one matching glob and one non-matching glob to
assert the unmatched-glob warning is still logged; ensure you reference the
existing test names (test_glob_no_match_warning) and methods
(FileTailer::new_with_globs, tailer.poll) when adding assertions so tests fail
if the warnings are removed.

In `@crates/logfwd-io/tests/transport_e2e.rs`:
- Line 106: The change from writeln!(client, "{\"seq\":{i}}") to write!(client,
"{\"seq\":{i}}\\n") is unnecessary and unrelated to the PR goal; revert to using
writeln!(client, "{\"seq\":{i}}") (or otherwise restore the original writeln!
invocation that writes the formatted string with newline) where the client is
written to in transport_e2e.rs so the test code remains semantically identical
and focused on the tailing/checkpoint behaviors.

In `@crates/logfwd-transform/Cargo.toml`:
- Around line 11-16: The crate is pinning datafusion locally in
crates/logfwd-transform/Cargo.toml; instead add a datafusion entry with the
required features under the root [workspace.dependencies] and then replace the
local datafusion declaration here with an inherited dependency (e.g., datafusion
= { workspace = true, default-features = false } or similar) so this crate
(crates/logfwd-transform) uses the workspace-managed datafusion with the same
feature set (string_expressions, unicode_expressions, regex_expressions,
recursive_protection).

In `@crates/logfwd-transform/src/udf/grok.rs`:
- Line 309: The test uses the fully-qualified
arrow::record_batch::RecordBatch::try_new call even though RecordBatch is
already imported; replace arrow::record_batch::RecordBatch::try_new(schema,
vec![msgs]).unwrap() with the imported RecordBatch::try_new(schema,
vec![msgs]).unwrap() (i.e., use the existing RecordBatch symbol) to remove the
redundant qualification.

In `@crates/logfwd/tests/allocation_scaling.rs`:
- Line 54: The fixed 30s deadline in allocation_scaling.rs (the line assigning
let deadline = std::time::Instant::now() + Duration::from_secs(30)) makes the
500K scaling test flaky; change the timeout to be row-count-aware or overridable
via env var: compute the deadline using a base-per-row multiplier (e.g.,
duration_per_row * expected_row_count) with a minimum floor, and fall back to an
environment variable (e.g., TEST_TIMEOUT_MS) if provided, then replace the
current Duration::from_secs(30) use in the test's deadline calculation so the
timeout scales for large workloads while remaining bounded for small tests.

In `@crates/logfwd/tests/test_lock.rs`:
- Around line 5-68: The test_exclusive_lock test is fragile and platform-unsafe:
add a #[cfg(unix)] attribute to the test to avoid running on Windows; replace
the fixed std::thread::sleep(Duration::from_secs(10)) used after spawning child1
with a polling loop that checks for the lock file (e.g.,
data_dir.join("logfwd.lock")) or a readiness signal before proceeding (poll at
short intervals with a bounded number of attempts); enforce a hard timeout for
the polling so the test fails fast if the first instance never acquires the lock
and ensure child1 is killed/reaped in all failure paths to avoid zombies (use
child1.kill()/child1.wait() in the timeout cleanup).
- Around line 28-38: Replace launching the test helper with `cargo run` by
invoking the pre-built binary via the CARGO_BIN_EXE mechanism: use
env!("CARGO_BIN_EXE_logfwd") as the program passed to Command::new (the code
creating `child1`), and pass the same args (e.g., "--config",
config_path.to_str().unwrap()) to avoid recompilation in CI; ensure your crate
exposes a binary target (add a [[bin]] entry or otherwise build the binary in
test setup) so env!("CARGO_BIN_EXE_logfwd") is available at test runtime.

In `@docs/CONFIG_REFERENCE.md`:
- Line 422: The docs entry for the `diagnostics` option is incomplete and
misleading; update the description to list all active endpoints implemented by
the diagnostics router (e.g., `/health`, `/ready`, `/api/stats` served by
serve_stats, `/api/config` served by serve_config, `/api/logs`, `/api/history`,
`/api/traces`, etc.) and note that `/metrics` currently returns 410 Gone with a
removed-message rather than providing metrics; ensure the text references the
diagnostics route dispatcher (diagnostics) so readers can find the true
behavior.

In `@README.md`:
- Around line 161-178: The README example uses singular keys `input`/`output`
but the config struct is defined as `PipelineConfig` in
crates/logfwd-config/src/lib.rs with plural fields `inputs` and `outputs`, so
update the README snippet to use `inputs:` and `outputs:` (and similarly for the
`debug` block) to match `PipelineConfig`'s field names and allow serde
deserialization to succeed; ensure the YAML mirrors the plural usage used in
book/src/config/reference.md to keep docs consistent.

---

Outside diff comments:
In `@crates/logfwd/tests/compliance.rs`:
- Around line 150-168: The current logic sets source_filter to None when
batch.column_by_name("source_id") returns None, which lets rows pass when
source_id is non-empty; change this so that when source_id is non-empty but the
"source_id" column is missing you produce Some(Vec<bool>) of length
batch.num_rows() with all false (i.e., no rows match) instead of None; update
the branch around batch.column_by_name("source_id") used to build source_filter
(and preserve the existing null handling and UTF8/Utf8View matching logic) so
missing column yields a false mask rather than allowing all rows.

In `@docs/CONFIG_REFERENCE.md`:
- Around line 137-143: The Input types table in CONFIG_REFERENCE.md is missing
the Generator variant from the InputType enum; add a new table row for
`generator` describing its status (e.g., Implemented) and a concise description
like "Produce synthetic logs for benchmarking/testing" so documentation matches
the implemented InputType::Generator used for benchmarking. Ensure the row
format matches the existing table style (Value | Status | Description) and use
the same backtick formatting for `generator`.

---

Duplicate comments:
In `@crates/logfwd-io/tests/transport_e2e.rs`:
- Line 159: The change replaces writeln! with write!(client, "...\\n") in the
test (the write!(client, "{\"client\":{id},\"seq\":{i}}}\\n").unwrap() call),
which is unnecessary and out of scope; revert this to using writeln!(client,
"{{\"client\":{id},\"seq\":{i}}}").unwrap() (restore the original writeln!
invocation for the client write in the transport_e2e test) so behavior remains
unchanged and the unrelated edit is removed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: cf1d383a-b4d9-4350-a814-113b942c9b3c

📥 Commits

Reviewing files that changed from the base of the PR and between e282864 and 8f5d08f.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (55)

Cargo.toml
README.md
book/src/README.md
book/src/config/reference.md
book/src/deployment/kubernetes.md
crates/logfwd-arrow/src/streaming_builder.rs
crates/logfwd-arrow/tests/allocation_regression.rs
crates/logfwd-bench/benches/elasticsearch_arrow.rs
crates/logfwd-bench/benches/pipeline.rs
crates/logfwd-competitive-bench/src/main.rs
crates/logfwd-competitive-bench/src/rate_bench.rs
crates/logfwd-competitive-bench/src/runner.rs
crates/logfwd-config/src/lib.rs
crates/logfwd-core/Cargo.toml
crates/logfwd-core/benches/scanner.rs
crates/logfwd-core/examples/arrow_ipc_roundtrip.rs
crates/logfwd-core/src/json_scanner.rs
crates/logfwd-core/src/pipeline/lifecycle.rs
crates/logfwd-core/tests/compliance_data.rs
crates/logfwd-core/tests/scanner_conformance.rs
crates/logfwd-io/src/diagnostics.rs
crates/logfwd-io/src/enrichment.rs
crates/logfwd-io/src/generator.rs
crates/logfwd-io/src/metric_history.rs
crates/logfwd-io/src/otlp_receiver.rs
crates/logfwd-io/src/span_exporter.rs
crates/logfwd-io/src/tail.rs
crates/logfwd-io/src/tcp_input.rs
crates/logfwd-io/tests/transport_e2e.rs
crates/logfwd-output/src/elasticsearch.rs
crates/logfwd-output/src/lib.rs
crates/logfwd-output/src/loki.rs
crates/logfwd-output/src/null.rs
crates/logfwd-output/src/otlp_sink.rs
crates/logfwd-output/tests/elasticsearch_arrow_ipc.rs
crates/logfwd-transform/Cargo.toml
crates/logfwd-transform/src/lib.rs
crates/logfwd-transform/src/udf/grok.rs
crates/logfwd/Cargo.toml
crates/logfwd/src/main.rs
crates/logfwd/src/pipeline.rs
crates/logfwd/src/worker_pool.rs
crates/logfwd/tests/allocation_e2e.rs
crates/logfwd/tests/allocation_scaling.rs
crates/logfwd/tests/compliance.rs
crates/logfwd/tests/compliance_file.rs
crates/logfwd/tests/integration.rs
crates/logfwd/tests/test_lock.rs
dev-docs/research/file-tailing-audit.md
docs/COLUMN_NAMING.md
docs/CONFIG_REFERENCE.md
docs/DEPLOYMENT.md
docs/TROUBLESHOOTING.md
examples/elasticsearch/README.md
examples/elasticsearch/with-auth-and-filter.yaml

💤 Files with no reviewable changes (1)

dev-docs/research/file-tailing-audit.md

coderabbitai · 2026-04-03T06:03:34Z

    #[test]
    fn test_basic_string_and_int() {
-        let json = br"not used directly";
+        let json = br#"not used directly"#;


🧹 Nitpick | 🔵 Trivial

Cosmetic test change unrelated to PR objectives.

This raw byte string delimiter change (br"..." → br#"..."#) is purely syntactic with zero functional impact. The PR focuses on file tailing reliability (checkpoint validation, exclusive lock, glob warnings), but this file is in the Arrow builder crate. Consider whether unrelated cleanup belongs in a focused reliability fix.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-arrow/src/streaming_builder.rs` at line 467, The change to the raw byte string delimiter on the variable `json` in `streaming_builder.rs` is a cosmetic, unrelated edit; revert the line that sets `let json = br#"not used directly"#;` back to the original `let json = br"not used directly";` (or remove the unrelated test change entirely) so the PR remains focused on tailing reliability; locate the `json` binding in the `streaming_builder.rs` file (inside the Arrow builder code) and restore the original delimiter form.

coderabbitai · 2026-04-03T06:03:34Z

 logfwd-test-utils = { version = "0.1.0", path = "../logfwd-test-utils" }
 arrow = { workspace = true }
-arrow-json = { workspace = true }
+arrow-json = "54"


🧹 Nitpick | 🔵 Trivial

Dev-dependency bypasses workspace management.

arrow-json = "54" is pinned directly instead of using { workspace = true }. As per coding guidelines, all new dependencies must be in [workspace.dependencies] with explicit version. This creates version drift risk when arrow is updated workspace-wide.

Suggested fix

Add to root Cargo.toml under [workspace.dependencies]:

arrow-json = { version = "54", default-features = false }

Then here:

-arrow-json = "54" +arrow-json = { workspace = true }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-core/Cargo.toml` at line 16, Replace the direct pinned dev-dependency "arrow-json" in the logfwd-core crate with a workspace-managed dependency: add an entry for arrow-json (with the explicit version and default-features setting) under the root [workspace.dependencies], then change the arrow-json line in the logfwd-core Cargo.toml to reference the workspace (e.g., arrow-json = { workspace = true }) so the crate uses the workspace-managed version instead of pinning "54" locally.

coderabbitai · 2026-04-03T06:03:34Z

+        if initial_paths.is_empty() && !patterns.is_empty() {
+            tracing::warn!("glob patterns match no files at startup: {:?}", patterns);
+        }


⚠️ Potential issue | 🟠 Major

Warn per glob, not only when the whole pattern set is empty.

These branches only fire when the union of matches is empty. If one configured glob matches files and another matches none, the empty glob is silent, which misses the #730 behavior and hides which pattern is broken.

Suggested fix

-fn expand_glob_patterns(patterns: &[&str]) -> Vec<PathBuf> { +fn expand_glob_patterns(patterns: &[&str], warn_on_empty: bool) -> Vec<PathBuf> { let mut paths = Vec::new(); for pattern in patterns { match glob::glob(pattern) { Ok(entries) => { - for entry in entries.flatten() { - paths.push(entry); + let mut matched = false; + for entry in entries { + match entry { + Ok(path) => { + matched = true; + paths.push(path); + } + Err(err) => { + tracing::warn!("error expanding glob pattern {pattern:?}: {err}"); + } + } + } + if warn_on_empty && !matched { + tracing::warn!("glob pattern matched no files: {pattern:?}"); } } Err(e) => { tracing::warn!("invalid glob pattern {pattern:?}: {e}"); } } } paths }

Then call it with warn_on_empty = true from both startup and rescan paths, and drop the aggregate is_empty() checks.

Also applies to: 280-285

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-io/src/tail.rs` around lines 247 - 249, The current logic only warns when the union of matches is empty (using initial_paths.is_empty()), so change it to warn per-glob: iterate over each pattern in patterns and call the existing warn_on_empty behavior for each pattern when it yields no matches (use warn_on_empty = true), and remove the aggregate initial_paths.is_empty() checks used at startup and in the rescan path; ensure both the startup code and the rescan code invoke the per-pattern warning path so each empty glob emits its own tracing::warn identifying the failing pattern.

coderabbitai · 2026-04-03T06:03:34Z

+    #[test]
+    fn test_glob_no_match_warning() {
+        let dir = tempfile::tempdir().unwrap();
+        let pattern = format!("{}/*.nonexistent", dir.path().display());
+
+        let config = TailConfig {
+            poll_interval_ms: 10,
+            glob_rescan_interval_ms: 50,
+            ..Default::default()
+        };
+
+        // This should trigger the "match no files at startup" warning.
+        let mut tailer = FileTailer::new_with_globs(&[&pattern], config).unwrap();
+        assert_eq!(tailer.num_files(), 0);
+
+        // Wait for rescan. This should trigger the "still match no files" warning.
+        std::thread::sleep(Duration::from_millis(150));
+        tailer.poll().unwrap();
+        assert_eq!(tailer.num_files(), 0);
+    }


⚠️ Potential issue | 🟠 Major

These tests don't actually verify the new warnings.

Both tests validate state, but neither asserts the tracing::warn! side effect. A future regression could drop the warning emission and still pass. For the glob case, please also add a mixed-pattern scenario so one unmatched glob is covered even when another glob has matches.

Also applies to: 1419-1457

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-io/src/tail.rs` around lines 1398 - 1417, The tests (e.g., test_glob_no_match_warning and the related tests around FileTailer::new_with_globs) only assert tailer state and do not verify that tracing::warn! was emitted; update the tests to install a test tracing subscriber (or use the tracing_test helpers) that captures logs and assert that a warning containing the expected message is emitted both at startup and after rescan, and add a new mixed-pattern test that calls FileTailer::new_with_globs with one matching glob and one non-matching glob to assert the unmatched-glob warning is still logged; ensure you reference the existing test names (test_glob_no_match_warning) and methods (FileTailer::new_with_globs, tailer.poll) when adding assertions so tests fail if the warnings are removed.

coderabbitai · 2026-04-03T06:03:34Z

    let mut client = TcpStream::connect(addr).unwrap();
    for i in 0..100 {
-        writeln!(client, "{{\"seq\":{i}}}").unwrap();
+        write!(client, "{{\"seq\":{i}}}\n").unwrap();


⚠️ Potential issue | 🟡 Minor

Unnecessary change unrelated to PR objectives.

writeln!(client, "{{\"seq\":{i}}}") and write!(client, "{{\"seq\":{i}}}\n") produce identical output—both write the formatted string followed by \n. This change adds no functional value and is unrelated to the file tailing reliability objectives (checkpoint validation, exclusive lock, glob warnings). Consider reverting.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-io/tests/transport_e2e.rs` at line 106, The change from writeln!(client, "{\"seq\":{i}}") to write!(client, "{\"seq\":{i}}\\n") is unnecessary and unrelated to the PR goal; revert to using writeln!(client, "{\"seq\":{i}}") (or otherwise restore the original writeln! invocation that writes the formatted string with newline) where the client is written to in transport_e2e.rs so the test code remains semantically identical and focused on the tailing/checkpoint behaviors.

coderabbitai · 2026-04-03T06:03:35Z

    // This makes the test data-driven, not time-driven.
    std::thread::spawn(move || {
-        let deadline = std::time::Instant::now() + Duration::from_secs(60);
+        let deadline = std::time::Instant::now() + Duration::from_secs(30);


⚠️ Potential issue | 🟠 Major

30s fixed deadline can make the 500K scaling case flaky in slower CI environments.

This test hard-asserts full row completion; halving the safety window increases false failures. Prefer a row-count-aware timeout (or env-overridable timeout) to preserve reliability.

Suggested fix

- let deadline = std::time::Instant::now() + Duration::from_secs(30); + // Scale safety timeout with workload size to reduce CI flakiness. + let timeout_secs = ((row_count as u64 / 100_000).max(1) * 30).min(180); + let deadline = std::time::Instant::now() + Duration::from_secs(timeout_secs);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let deadline = std::time::Instant::now() + Duration::from_secs(30);

// Scale safety timeout with workload size to reduce CI flakiness.

let timeout_secs = ((row_count as u64 / 100_000).max(1) * 30).min(180);

let deadline = std::time::Instant::now() + Duration::from_secs(timeout_secs);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd/tests/allocation_scaling.rs` at line 54, The fixed 30s deadline in allocation_scaling.rs (the line assigning let deadline = std::time::Instant::now() + Duration::from_secs(30)) makes the 500K scaling test flaky; change the timeout to be row-count-aware or overridable via env var: compute the deadline using a base-per-row multiplier (e.g., duration_per_row * expected_row_count) with a minimum floor, and fall back to an environment variable (e.g., TEST_TIMEOUT_MS) if provided, then replace the current Duration::from_secs(30) use in the test's deadline calculation so the timeout scales for large workloads while remaining bounded for small tests.

coderabbitai · 2026-04-03T06:03:35Z

+#[test]
+fn test_exclusive_lock() {
+    let dir = tempdir().unwrap();
+    let data_dir = dir.path().to_path_buf();
+
+    let config_path = dir.path().join("config.yaml");
+    std::fs::write(
+        &config_path,
+        format!(
+            r#"
+input:
+  type: generator
+output:
+  type: "null"
+storage:
+  data_dir: {}
+"#,
+            data_dir.display()
+        ),
+    )
+    .unwrap();
+
+    // Start the first instance.
+    let mut child1 = Command::new("cargo")
+        .args([
+            "run",
+            "-p",
+            "logfwd",
+            "--",
+            "--config",
+            config_path.to_str().unwrap(),
+        ])
+        .spawn()
+        .expect("failed to start first logfwd instance");
+
+    // Wait a bit for it to start and acquire the lock.
+    std::thread::sleep(Duration::from_secs(10));
+
+    // Try to start a second instance.
+    let output2 = Command::new("cargo")
+        .args([
+            "run",
+            "-p",
+            "logfwd",
+            "--",
+            "--config",
+            config_path.to_str().unwrap(),
+        ])
+        .output()
+        .expect("failed to start second logfwd instance");
+
+    // Clean up first instance — wait after kill to reap the process and prevent zombies.
+    let _ = child1.kill();
+    let _ = child1.wait();
+
+    let stderr = String::from_utf8_lossy(&output2.stderr);
+    assert!(
+        stderr.contains("another instance of logfwd is already running"),
+        "expected error message about another instance, got stderr: {}\nstdout: {}",
+        stderr,
+        String::from_utf8_lossy(&output2.stdout)
+    );
+    assert_eq!(output2.status.code(), Some(2), "expected exit code 2");
+}


⚠️ Potential issue | 🟡 Minor

Test has robustness concerns.

Unix-only but not gated: The flock implementation is Unix-only, but this test lacks #[cfg(unix)]. It will fail on Windows.

10-second sleep is fragile: If the binary isn't cached, cargo run triggers a rebuild. Consider polling for lock file existence or a readiness signal.

No test timeout: If first instance never acquires lock, the test hangs indefinitely.

Suggested improvements

+#[cfg(unix)] #[test] fn test_exclusive_lock() {

For robustness, consider polling:

// Poll for lock file instead of fixed sleep let lock_path = data_dir.join("logfwd.lock"); for _ in 0..100 { if lock_path.exists() { break; } std::thread::sleep(Duration::from_millis(100)); }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd/tests/test_lock.rs` around lines 5 - 68, The test_exclusive_lock test is fragile and platform-unsafe: add a #[cfg(unix)] attribute to the test to avoid running on Windows; replace the fixed std::thread::sleep(Duration::from_secs(10)) used after spawning child1 with a polling loop that checks for the lock file (e.g., data_dir.join("logfwd.lock")) or a readiness signal before proceeding (poll at short intervals with a bounded number of attempts); enforce a hard timeout for the polling so the test fails fast if the first instance never acquires the lock and ensure child1 is killed/reaped in all failure paths to avoid zombies (use child1.kill()/child1.wait() in the timeout cleanup).

coderabbitai · 2026-04-03T06:03:35Z

+    let mut child1 = Command::new("cargo")
+        .args([
+            "run",
+            "-p",
+            "logfwd",
+            "--",
+            "--config",
+            config_path.to_str().unwrap(),
+        ])
+        .spawn()
+        .expect("failed to start first logfwd instance");


🧹 Nitpick | 🔵 Trivial

Consider using pre-built binary path.

Using cargo run may trigger recompilation. For faster, more reliable CI:

let binary = env!("CARGO_BIN_EXE_logfwd"); Command::new(binary) .args(["--config", config_path.to_str().unwrap()]) // ...

This requires adding [[bin]] target or using assert_cmd crate patterns.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd/tests/test_lock.rs` around lines 28 - 38, Replace launching the test helper with `cargo run` by invoking the pre-built binary via the CARGO_BIN_EXE mechanism: use env!("CARGO_BIN_EXE_logfwd") as the program passed to Command::new (the code creating `child1`), and pass the same args (e.g., "--config", config_path.to_str().unwrap()) to avoid recompilation in CI; ensure your crate exposes a binary target (add a [[bin]] entry or otherwise build the binary in test setup) so env!("CARGO_BIN_EXE_logfwd") is available at test runtime.

coderabbitai · 2026-04-03T06:03:35Z

 | Field | Type | Default | Description |
 |-------|------|---------|-------------|
-| `diagnostics` | string | none | `host:port` to listen for HTTP diagnostics. See [Diagnostics API](#diagnostics-api) for available routes. |
+| `diagnostics` | string | none | `host:port` to listen for HTTP diagnostics. Exposes `/metrics` and `/api/pipelines`. |


⚠️ Potential issue | 🟠 Major

Documentation inaccurately describes diagnostics endpoints.

The documentation claims diagnostics "exposes /metrics and /api/pipelines", but the code shows many additional active endpoints are served:

/health, /ready (health checks)

/api/stats (pipeline statistics - uptime, line counts, bytes, timing)

/api/config (configuration YAML)

/api/logs (logs endpoint)

/api/history (history endpoint)

/api/traces (traces endpoint)

Additionally, /metrics returns HTTP 410 Gone with a message stating the endpoint was removed, not working metrics.

📝 Evidence from diagnostics.rs route handler

The route dispatcher in crates/logfwd-io/src/diagnostics.rs:468-500 shows all active endpoints, and serve_stats (lines 555-625) and serve_config (lines 627-638) implementations confirm these endpoints are fully functional.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/CONFIG_REFERENCE.md` at line 422, The docs entry for the `diagnostics` option is incomplete and misleading; update the description to list all active endpoints implemented by the diagnostics router (e.g., `/health`, `/ready`, `/api/stats` served by serve_stats, `/api/config` served by serve_config, `/api/logs`, `/api/history`, `/api/traces`, etc.) and note that `/metrics` currently returns 410 Gone with a removed-message rather than providing metrics; ensure the text references the diagnostics route dispatcher (diagnostics) so readers can find the true behavior.

coderabbitai · 2026-04-03T06:03:35Z

+    input:
      type: file
      path: /var/log/pods/**/*.log
      format: cri
    transform: SELECT * FROM logs WHERE level_str = 'ERROR'
-    outputs:
+    output:
      type: otlp
      endpoint: http://otel-collector:4318

  debug:
-    inputs:
+    input:
      type: file
      path: /var/log/pods/**/*.log
      format: cri
-    outputs:
+    output:
      type: stdout
      format: console
 ```


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify PipelineConfig field names in the actual struct definition rg -n -A5 'pub struct PipelineConfig' crates/logfwd-config/src/ # Check if there's any serde rename or alias for input/output rg -n 'serde.*alias.*input|serde.*rename.*input' crates/logfwd-config/src/

Repository: strawgate/memagent

Length of output: 546

🏁 Script executed:

# Find the deserialize_one_or_many function definition rg -n -B2 -A10 'fn deserialize_one_or_many' crates/logfwd-config/src/

Repository: strawgate/memagent

Length of output: 857

🏁 Script executed:

# Get the exact lines of PipelineConfig struct with all attributes sed -n '264,270p' crates/logfwd-config/src/lib.rs

Repository: strawgate/memagent

Length of output: 382

🏁 Script executed:

# Check for any serde alias or rename on the inputs/outputs fields rg -B5 -A1 'pub inputs:|pub outputs:' crates/logfwd-config/src/lib.rs

Repository: strawgate/memagent

Length of output: 483

Invalid YAML field names: input/output should be inputs/outputs.

The PipelineConfig struct in crates/logfwd-config/src/lib.rs defines plural field names (inputs, outputs) with no serde rename attributes. The example in README.md uses singular field names, which will fail to deserialize.

This creates inconsistency with book/src/config/reference.md (lines 40-63), which correctly uses plural forms.

Proposed fix

pipelines: errors: - input: + inputs: + - type: file - type: file - path: /var/log/pods/**/*.log - format: cri + path: /var/log/pods/**/*.log + format: cri transform: SELECT * FROM logs WHERE level_str = 'ERROR' - output: + outputs: + - type: otlp - type: otlp - endpoint: http://otel-collector:4318 + endpoint: http://otel-collector:4318 debug: - input: + inputs: + - type: file - type: file - path: /var/log/pods/**/*.log - format: cri - output: + path: /var/log/pods/**/*.log + format: cri + outputs: + - type: stdout - type: stdout - format: console + format: console

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

input:

type: file

path: /var/log/pods/**/*.log

format: cri

transform: SELECT * FROM logs WHERE level_str = 'ERROR'

outputs:

output:

type: otlp

endpoint: http://otel-collector:4318

debug:

inputs:

input:

type: file

path: /var/log/pods/**/*.log

format: cri

outputs:

output:

type: stdout

format: console

```

inputs:

- type: file

path: /var/log/pods/**/*.log

format: cri

transform: SELECT * FROM logs WHERE level_str = 'ERROR'

outputs:

- type: otlp

endpoint: http://otel-collector:4318

debug:

inputs:

- type: file

path: /var/log/pods/**/*.log

format: cri

outputs:

- type: stdout

format: console

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 161 - 178, The README example uses singular keys `input`/`output` but the config struct is defined as `PipelineConfig` in crates/logfwd-config/src/lib.rs with plural fields `inputs` and `outputs`, so update the README snippet to use `inputs:` and `outputs:` (and similarly for the `debug` block) to match `PipelineConfig`'s field names and allow serde deserialization to succeed; ensure the YAML mirrors the plural usage used in book/src/config/reference.md to keep docs consistent.

strawgate · 2026-04-03T06:31:00Z

Superseded by focused PRs:

fix: file tailing reliability — truncation, OOM, glob dedup, checkpoint validation, warnings (#796, #800, #799, #656, #730) #808 (merged): checkpoint validation, glob warnings, eprintln→tracing, truncation, read cap
feat: exclusive instance lock — prevent duplicate file processing (#737) #814: exclusive instance lock (flock) — extracted with bug fix (original had drop(lock_file) releasing lock immediately)

The 50+ unrelated file changes in this PR were from a rebase that pulled in master changes.

coderabbitai Bot requested changes Apr 3, 2026

View reviewed changes

strawgate mentioned this pull request Apr 3, 2026

feat: exclusive instance lock — prevent duplicate file processing (#737) #814

Merged

3 tasks

strawgate closed this Apr 3, 2026

github-actions Bot mentioned this pull request Apr 6, 2026

work-unit: control plane — input lifecycle status wiring #1415

Closed

4 tasks

-        let deadline = std::time::Instant::now() + Duration::from_secs(30);
+        // Scale safety timeout with workload size to reduce CI flakiness.
+        let timeout_secs = ((row_count as u64 / 100_000).max(1) * 30).min(180);
+        let deadline = std::time::Instant::now() + Duration::from_secs(timeout_secs);

Conversation

strawgate commented Apr 3, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Possibly related PRs

Pre-merge checks failed

❌ Failed checks (2 errors, 3 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

strawgate commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 3, 2026 •

edited

Loading