feat: wire FileCheckpointStore into pipeline — persist and restore file offsets (#588)#784
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 1 minutes and 32 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (6)
WalkthroughThis change adds durable checkpoint persistence and offset restore by source fingerprint: Pipeline now opens a Possibly related PRs
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error, 2 warnings)
✅ Passed checks (4 passed)
Comment |
|
@coderabbitai please do a thorough review |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 17577 🏁 Script executed: Length of output: 388 🏁 Script executed: Length of output: 190818
Overall AssessmentThe core checkpoint machinery ( 🔴 Critical1. Test env-var mutation is not thread-safeBoth new tests use
// test_checkpoint_persisted_after_clean_shutdown AND
// test_pipeline_resumes_from_checkpoint both do:
unsafe { std::env::set_var("LOGFWD_DATA_DIR", dir.path()); }
// ... pipeline runs with internal threads ...
unsafe { std::env::remove_var("LOGFWD_DATA_DIR"); }Fix: add 2.
|
| # | Severity | File | Description |
|---|---|---|---|
| 1 | 🔴 Critical | pipeline.rs tests |
Env-var race across parallel test threads |
| 2 | 🔴 Critical | pipeline.rs |
path: None checkpoint silently ignored on restore |
| 3 | 🟡 Significant | pipeline.rs |
Flush storm on persistent write errors |
| 4 | 🟡 Significant | pipeline.rs |
Single checkpoints.json clobbers multi-pipeline configs |
| 5 | 🟡 Significant | checkpoint.rs |
Missing directory fsync after atomic rename |
| 6 | 🟡 Significant | pipeline.rs |
Checkpoint store opened for non-file inputs |
| 7 | 🔵 Minor | pipeline.rs tests |
Inaccurate SAFETY comments |
| 8 | 🔵 Minor | test-utils/lib.rs |
append_json_lines reads entire file to count lines |
| 9 | 🔵 Minor | test-utils/sinks.rs |
Ordering::Relaxed undocumented |
| 10 | 🔵 Minor | pipeline.rs |
Missing just bench results for hot path |
Items 1 and 2 are correctness bugs that affect the primary feature this PR delivers. Items 3–6 should be addressed before merge. The rest can be follow-ups if preferred.
🧠 Learnings used
Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-03T02:15:17.278Z
Learning: Applies to src/crate/logfwd-core/**/pipeline/lifecycle.rs : Any change to pipeline/lifecycle.rs, PipelineMachine state transitions, drain protocol, checkpoint ordering, batch sequence numbering, or ordered-ACK mechanism MUST address TLA+ coverage
Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-03T02:15:17.278Z
Learning: Applies to src/crate/logfwd-core/**/pipeline/lifecycle.rs : State machine lifecycle.rs MUST prove all valid state transitions preserve invariants: no batch creation after drain begins, committed checkpoint monotonically non-decreasing, stopped state is terminal
Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-03T02:15:17.278Z
Learning: Applies to src/crate/logfwd-core/**/pipeline/**/*.rs : New async pipeline code MUST have proptest tests for: event sequences in arbitrary order, acks arriving in different order than submits, drain requested while batches in-flight
Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-03T02:15:17.278Z
Learning: Applies to tla/PipelineMachine.tla : TLA+ spec must verify safety invariants: TypeOK, NoDoubleComplete, DrainCompleteness, CheckpointOrderingInvariant, CommittedNeverAheadOfAcked, InFlightImpliesCreated, AckedImpliesCreated with PipelineMachine.cfg
Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-03T02:15:17.278Z
Learning: Applies to src/crate/logfwd-core/**/*.rs : Hot-path code changes include before/after benchmark results from `just bench`
Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-03T02:15:17.278Z
Learning: Applies to src/crate/logfwd-core/**/pipeline/registry.rs : State machine registry.rs MUST prove: sequence numbers are strictly monotonic, no duplicate batch IDs
Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-03T02:15:17.278Z
Learning: Applies to src/crate/logfwd-core/**/pipeline/batch.rs : State machine batch.rs MUST prove typestate: Queued → Sending is only allowed first transition, Sending → Acked and Sending → Rejected are only allowed completions
- **Flush storm** (critical): reset last_checkpoint_flush unconditionally
on every flush attempt so a persistent write error doesn't bypass the
5-second throttle and hammer the disk.
- **PathUpdate ordering hole** (critical): two-part fix.
(1) The drain path in input_poll_loop now sends PathUpdate for any
sources not yet announced before sending their Data — mirrors the
normal-path logic, prevents path:None checkpoints on first-cycle
shutdown.
(2) ack_all_tickets and the shutdown flush fall back to the previously
persisted path via store.load() as defense-in-depth, so an existing
valid checkpoint is never overwritten with path:None.
- **Multi-pipeline namespace** (significant): checkpoint store is now
opened at default_data_dir()/{pipeline_name}/ so multiple pipelines
don't share a single checkpoints.json file.
- **Directory fsync** (significant): FileCheckpointStore::flush() now
fsyncs the parent directory after rename so the new directory entry is
durable on filesystems that require it (ext4 without data=ordered, XFS).
- **Non-file inputs** (significant): checkpoint store is only opened when
at least one input is InputType::File; generator/OTLP/TCP/UDP pipelines
no longer create unnecessary checkpoint files.
- **TCP test race** (pre-existing): add connections_accepted monotonic
counter to TcpInput and use it in tcp_max_line_length tests — same fix
applied to the other TCP tests in a prior commit but missed here.
- **CountingSink comment**: document why Ordering::Relaxed is correct.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/logfwd-test-utils/src/lib.rs`:
- Around line 36-56: The docstring for append_json_lines misleadingly refers to
`start_seq` as if it's a parameter; update the comment to state that sequence
IDs start from the current line count of the target file (computed by reading
the file) rather than an input parameter. Mention that the function auto-detects
the starting sequence by counting existing lines and then appends `count` NDJSON
lines with IDs from that computed `start_seq`, and remove or reword the phrase
implying `start_seq` is provided by the caller.
In `@crates/logfwd/src/pipeline.rs`:
- Around line 2654-2657: Update the misleading SAFETY comments around the unsafe
std::env::set_var("LOGFWD_DATA_DIR", ...) block and the other comment near the
"single-threaded test": state that CHECKPOINT_ENV_MUTEX serializes access to
environment variable mutation for this test, and clarify that spawned threads
(e.g., the metrics thread started later) do not read or rely on environment
variables rather than claiming there are "no concurrent env reads" or that the
test is "single-threaded"; reference the unsafe block around set_var and the
comment near the later test to make these clarifications.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2afaa399-a22a-4a94-a9d1-c2245c46d31c
📒 Files selected for processing (5)
crates/logfwd-io/src/checkpoint.rscrates/logfwd-io/src/tcp_input.rscrates/logfwd-test-utils/src/lib.rscrates/logfwd-test-utils/src/sinks.rscrates/logfwd/src/pipeline.rs
- Fix docstring: append_json_lines auto-detects start sequence from current line count, not a parameter - Fix SAFETY comments: clarify that CHECKPOINT_ENV_MUTEX serialises env var access, and spawned threads only access metrics/shutdown Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/logfwd/src/pipeline.rs (1)
886-923: 🧹 Nitpick | 🔵 TrivialConsider extracting PathUpdate-before-Data logic.
Lines 895-915 duplicate the send pattern from lines 843-871. A helper like
send_data_with_path_update(...)could reduce duplication.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/logfwd/src/pipeline.rs` around lines 886 - 923, Extract the duplicated "send PathUpdate before Data" logic into a helper (e.g., send_data_with_path_update) that accepts the source (or input), tx, metrics, known_sources and the data/checkpoints to send; move the block that computes current_sources, new_sources, filters source_paths and calls blocking_send_channel_msg with ChannelMsg::PathUpdate into that helper and then have it send ChannelMsg::Data (using blocking_send_channel_msg) so both the earlier occurrence (lines ~843-871) and this shutdown-path (lines ~895-915) call the same function, reducing duplication and keeping calls to input.source.checkpoint_data(), input.source.source_paths(), blocking_send_channel_msg, ChannelMsg::PathUpdate and ChannelMsg::Data centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/logfwd/src/pipeline.rs`:
- Around line 723-745: The code repeatedly calls store.load(advance.source.0) as
a fallback when self.source_paths lacks an entry; cache the persisted path on
first load to avoid repeated I/O/lookup. Change the fallback flow in the
machine.apply_ack handling: when you call store.load(...) and get a
SourceCheckpoint with a path, insert that path into self.source_paths (or a
dedicated persisted-path cache) keyed by advance.source so subsequent
checkpoints use the in-memory path instead of calling store.load again; ensure
you still prefer PathUpdate entries from self.source_paths and only
consult/update the cache when store.load returns a path.
---
Outside diff comments:
In `@crates/logfwd/src/pipeline.rs`:
- Around line 886-923: Extract the duplicated "send PathUpdate before Data"
logic into a helper (e.g., send_data_with_path_update) that accepts the source
(or input), tx, metrics, known_sources and the data/checkpoints to send; move
the block that computes current_sources, new_sources, filters source_paths and
calls blocking_send_channel_msg with ChannelMsg::PathUpdate into that helper and
then have it send ChannelMsg::Data (using blocking_send_channel_msg) so both the
earlier occurrence (lines ~843-871) and this shutdown-path (lines ~895-915) call
the same function, reducing duplication and keeping calls to
input.source.checkpoint_data(), input.source.source_paths(),
blocking_send_channel_msg, ChannelMsg::PathUpdate and ChannelMsg::Data
centralized.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7c42b466-2b12-488a-8fd2-5b0b3423e13a
📒 Files selected for processing (2)
crates/logfwd-test-utils/src/lib.rscrates/logfwd/src/pipeline.rs
| let advance = machine.apply_ack(receipt); | ||
| if advance.advanced { | ||
| if let (Some(ref mut store), Some(offset)) = | ||
| (self.checkpoint_store.as_mut(), advance.checkpoint) | ||
| { | ||
| // Prefer the live path from PathUpdate messages. If the | ||
| // PathUpdate hasn't arrived yet (e.g. shutdown happened on | ||
| // the very first poll cycle before any flush threshold was | ||
| // reached), fall back to the previously persisted path so | ||
| // we don't overwrite a valid checkpoint with path: None. | ||
| let path = self | ||
| .source_paths | ||
| .get(&advance.source) | ||
| .cloned() | ||
| .or_else(|| store.load(advance.source.0).and_then(|c| c.path)); | ||
| store.update(SourceCheckpoint { | ||
| source_id: advance.source.0, | ||
| path, | ||
| offset, | ||
| }); | ||
| any_advanced = true; | ||
| } | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider caching previously persisted paths.
store.load(advance.source.0) inside the ticket loop performs I/O (or at least a hash lookup) on every advancing checkpoint. If source_paths frequently lacks entries, this fallback runs repeatedly.
Consider caching the path on first load or ensuring PathUpdate always arrives before any Data referencing that source.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/logfwd/src/pipeline.rs` around lines 723 - 745, The code repeatedly
calls store.load(advance.source.0) as a fallback when self.source_paths lacks an
entry; cache the persisted path on first load to avoid repeated I/O/lookup.
Change the fallback flow in the machine.apply_ack handling: when you call
store.load(...) and get a SourceCheckpoint with a path, insert that path into
self.source_paths (or a dedicated persisted-path cache) keyed by advance.source
so subsequent checkpoints use the in-memory path instead of calling store.load
again; ensure you still prefer PathUpdate entries from self.source_paths and
only consult/update the cache when store.load returns a path.
36f2414 to
f6c9aa8
Compare
|
@coderabbitai please do a thorough code review |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
|
Reviewed: all CodeRabbit nitpicks resolved. The remaining caching suggestion is a valid optimization but not blocking. Branch is mergeable and up to date with master. Ready to merge. |
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
crates/logfwd-test-utils/src/lib.rs (1)
36-38:⚠️ Potential issue | 🟡 MinorFix the stale
start_seqdoc comment.
start_seqis derived insideappend_json_lines; it is not a caller-supplied input.📝 Suggested doc fix
-/// Append `count` NDJSON lines to an existing file. -/// Sequence IDs start from `start_seq`. Used by resume tests to simulate -/// new data appearing after the initial batch. +/// Append `count` NDJSON lines to an existing file. +/// Sequence IDs continue from the current line count. Used by resume tests +/// to simulate new data appearing after the initial batch.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-test-utils/src/lib.rs` around lines 36 - 38, The doc comment incorrectly states that `start_seq` is a caller-supplied parameter; update the comment for the function append_json_lines to explain that sequence IDs are computed within the function (derived from existing file contents or an internal counter) rather than passed in by the caller, and remove or rephrase any wording implying `start_seq` is an external argument so the docs accurately reflect the function's behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/logfwd/src/pipeline.rs`:
- Around line 199-205: The restore loop currently only uses the unstable
fingerprint SourceId (via set_offset_by_source) which breaks when files grow;
change checkpoint persistence to store the canonical file path (not None) and in
the restore in pipeline.rs iterate saved_checkpoints and first try to restore by
that canonical path (e.g. call the source API to set offset by path, use
something like input.source.set_offset_by_path(canonical_path, cp.offset)), then
only if that fails fall back to the fingerprint-based restore using SourceId and
set_offset_by_source(SourceId(cp.source_id), cp.offset); update the checkpoint
write path to persist canonical_path for each checkpoint so the primary restore
key is the stable path and fingerprint remains a fallback (references:
saved_checkpoints, cp.source_id, cp.offset, SourceId, inputs,
set_offset_by_source, set_offset_by_path).
- Around line 157-169: Pipeline::from_config currently calls
FileCheckpointStore::open(default_data_dir()) which causes all pipelines to
share the same checkpoints.json and allows one pipeline's
FileCheckpointStore::flush to overwrite another's in-memory map; change this to
open a pipeline-scoped store (e.g. default_data_dir().join(pipeline_name) or
default_data_dir().join(format!("checkpoints-{}.json", name))) so each pipeline
gets its own file/dir. Specifically, update the call site in
Pipeline::from_config to pass a per-pipeline path/namespace into
FileCheckpointStore::open (or add a new FileCheckpointStore::open_at(path) if
needed), and ensure FileCheckpointStore::load_all and FileCheckpointStore::flush
read/write that per-pipeline path so checkpoints are namespaced by the pipeline
`name` (or other stable pipeline identity).
- Around line 704-712: The current logic in the flush throttle uses
last_checkpoint_flush and only updates it on successful store.flush(), so after
the first flush error the timestamp never advances and the code repeatedly
retries and floods logs; modify the block around any_advanced /
self.checkpoint_store so that last_checkpoint_flush = Instant::now() is set
regardless of whether store.flush() returns Err or Ok (or implement an
increasing backoff), i.e., move the last_checkpoint_flush update out of the
success branch (or update on error with appropriate backoff) while keeping the
existing error logging for checkpoint_store.flush().
- Around line 2527-2531: Tests mutate the shared env var LOGFWD_DATA_DIR causing
races; either add the serial_test crate and mark the affected tests with
#[serial] or refactor the tested constructors to accept a data_dir so tests
don't set env globals. To apply the serial approach: add serial_test to
Cargo.toml, import the serial attribute (use serial_test::serial) and annotate
test_checkpoint_persisted_after_clean_shutdown,
test_pipeline_resumes_from_checkpoint and the two other tests that set
LOGFWD_DATA_DIR with #[serial]. Alternatively, implement a
from_config/from_data_dir constructor for the Pipeline (or relevant builder used
in those tests) and update the tests to instantiate Pipeline with an explicit
data_dir instead of calling std::env::set_var("LOGFWD_DATA_DIR").
---
Duplicate comments:
In `@crates/logfwd-test-utils/src/lib.rs`:
- Around line 36-38: The doc comment incorrectly states that `start_seq` is a
caller-supplied parameter; update the comment for the function append_json_lines
to explain that sequence IDs are computed within the function (derived from
existing file contents or an internal counter) rather than passed in by the
caller, and remove or rephrase any wording implying `start_seq` is an external
argument so the docs accurately reflect the function's behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: a60c3ca2-a016-45be-b4c2-7273189ceee2
📒 Files selected for processing (6)
crates/logfwd-io/src/framed.rscrates/logfwd-io/src/input.rscrates/logfwd-io/src/tail.rscrates/logfwd-test-utils/src/lib.rscrates/logfwd-test-utils/src/sinks.rscrates/logfwd/src/pipeline.rs
| // Restore previously saved file offsets by fingerprint (SourceId). | ||
| // No path needed — the tailer finds the matching file by fingerprint. | ||
| for cp in &saved_checkpoints { | ||
| let source_id = SourceId(cp.source_id); | ||
| for input in &mut inputs { | ||
| input.source.set_offset_by_source(source_id, cp.offset); | ||
| } |
There was a problem hiding this comment.
Fingerprint-only restore is not stable across restarts.
set_offset_by_source() is now the only restore path, and both persistence paths write path: None. Here SourceId comes from the tailer's fingerprint of the first fingerprint_bytes bytes, so a file checkpointed while smaller than that window stops matching after it grows before the next start; restore silently no-ops and the file replays from offset 0. Persist the canonical path and use it as the primary restore key, or switch to a source identity that is stable across growth and duplicate prefixes.
Also applies to: 463-468, 695-699
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/logfwd/src/pipeline.rs` around lines 199 - 205, The restore loop
currently only uses the unstable fingerprint SourceId (via set_offset_by_source)
which breaks when files grow; change checkpoint persistence to store the
canonical file path (not None) and in the restore in pipeline.rs iterate
saved_checkpoints and first try to restore by that canonical path (e.g. call the
source API to set offset by path, use something like
input.source.set_offset_by_path(canonical_path, cp.offset)), then only if that
fails fall back to the fingerprint-based restore using SourceId and
set_offset_by_source(SourceId(cp.source_id), cp.offset); update the checkpoint
write path to persist canonical_path for each checkpoint so the primary restore
key is the stable path and fingerprint remains a fallback (references:
saved_checkpoints, cp.source_id, cp.offset, SourceId, inputs,
set_offset_by_source, set_offset_by_path).
| // Override data dir so checkpoints land in our temp dir. | ||
| // SAFETY: single-threaded test; no concurrent env reads. | ||
| unsafe { | ||
| std::env::set_var("LOGFWD_DATA_DIR", dir.path()); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n 'LOGFWD_DATA_DIR|CHECKPOINT_ENV_MUTEX|serial_test'Repository: strawgate/memagent
Length of output: 1123
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Extract the test functions and their context around the identified lines
echo "=== Extracting test functions around LOGFWD_DATA_DIR mutations ==="
# Get lines around 2527-2531
echo "--- Lines 2520-2535 (first mutation) ---"
sed -n '2520,2535p' crates/logfwd/src/pipeline.rs | cat -n
echo ""
echo "--- Lines 2560-2575 (second mutation) ---"
sed -n '2560,2575p' crates/logfwd/src/pipeline.rs | cat -n
echo ""
echo "--- Lines 2573-2590 (third mutation) ---"
sed -n '2573,2590p' crates/logfwd/src/pipeline.rs | cat -n
echo ""
echo "--- Lines 2650-2670 (fourth mutation) ---"
sed -n '2650,2670p' crates/logfwd/src/pipeline.rs | cat -nRepository: strawgate/memagent
Length of output: 3572
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if logfwd has serial_test dependency
echo "=== Checking logfwd crate dependencies ==="
grep -A 20 '^\[dev-dependencies\]' crates/logfwd/Cargo.toml || echo "No [dev-dependencies] section found"
echo ""
echo "=== Checking for any #[serial] usage in pipeline.rs ==="
rg -n '#\[serial\]' crates/logfwd/src/pipeline.rs || echo "No #[serial] attributes found"
echo ""
echo "=== Checking test module attributes in pipeline.rs ==="
sed -n '2500,2530p' crates/logfwd/src/pipeline.rs | cat -nRepository: strawgate/memagent
Length of output: 1813
Serialize environment mutations in tests to prevent race conditions.
Tests test_checkpoint_persisted_after_clean_shutdown (line 2522), test_pipeline_resumes_from_checkpoint (line 2576), and two others mutate the shared process environment (LOGFWD_DATA_DIR) without serialization. Under cargo test parallelism, these calls race with other tests; a panic before cleanup leaks state. The logfwd crate lacks a serial_test dependency, so add it and decorate these tests with #[serial], or refactor to pass data_dir explicitly via a from_config variant.
Applies to lines: 2527–2531, 2568–2570, 2580–2583, 2659–2661.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/logfwd/src/pipeline.rs` around lines 2527 - 2531, Tests mutate the
shared env var LOGFWD_DATA_DIR causing races; either add the serial_test crate
and mark the affected tests with #[serial] or refactor the tested constructors
to accept a data_dir so tests don't set env globals. To apply the serial
approach: add serial_test to Cargo.toml, import the serial attribute (use
serial_test::serial) and annotate
test_checkpoint_persisted_after_clean_shutdown,
test_pipeline_resumes_from_checkpoint and the two other tests that set
LOGFWD_DATA_DIR with #[serial]. Alternatively, implement a
from_config/from_data_dir constructor for the Pipeline (or relevant builder used
in those tests) and update the tests to instantiate Pipeline with an explicit
data_dir instead of calling std::env::set_var("LOGFWD_DATA_DIR").
049eaa8 to
0d22876
Compare
|
CI failing with 2 test failures:
Both need investigation against the new checkpoint persistence logic. |
|
@coderabbitai can you please fully review this change? |
🧩 Analysis chain
🏁 Script executed: Length of output: 2749 🏁 Script executed: Length of output: 20833 🏁 Script executed: Length of output: 25724
CI Failure Analysis🔴 Failure 1 —
|
| # | Status | Notes |
|---|---|---|
| 1 (env-var race) | ❓ Unknown | Didn't see serial_test added; please confirm |
| 2 (path:None hole) | ✅ Fixed | set_offset_by_source used directly — no path lookup needed on restore |
| 3 (flush-throttle on error) | 🔴 Still present | last_checkpoint_flush reset only in else branch of flush error check (line 710); persistent errors cause flush storms |
| 4 (single checkpoints.json) | 🟡 Open | All pipelines share {data_dir}/checkpoints.json; multi-pipeline configs will clobber each other |
| 5 (parent dir fsync) | ✅ Fixed (per user) | |
| 6 (store for non-file inputs) | 🟡 Open | Also root cause of allocation test failure |
| 7–9 (minor polish) | ✅ Resolved (per user) | |
| 10 (bench results) | 🟡 Open | ack_all_tickets hot-path now does extra work; benchmark numbers still missing |
Recommendation
The two CI failures are both caused by the checkpoint persistence code being active in test paths where it shouldn't be. Fixing Issue #6 (open store only for file inputs, and ideally make it suppressible in tests) would likely resolve the allocation test. The transform-error test needs a different SQL that actually reaches the transform executor.
🧠 Learnings used
Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-03T03:35:42.884Z
Learning: Applies to logfwd-core/src/lifecycle.rs : PRs touching lifecycle.rs, PipelineMachine transitions, drain, checkpoint ordering, or batch sequencing must address TLA+ coverage with updated tla/PipelineMachine.tla and TLC verification output
Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-03T03:35:42.884Z
Learning: Applies to logfwd-io/src/**/*.rs : New async pipeline code must have proptest coverage for: arbitrary event ordering; acks out of order; drain while in-flight
Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-03T03:35:42.884Z
Learning: Applies to logfwd-core/src/**/*.rs : New scanner behavior must have proptest coverage for: escape sequences crossing 64-byte boundaries; fields in different orders; duplicate keys same type (first-write-wins); duplicate keys different types (two suffixed columns)
✅ Actions performed
Full review triggered.
|
Investigated test failures:
The test uses file input → null output with Areas to investigate:
The |
…le offsets (#588) Completes Phase 5c: the PipelineMachine was already wiring begin_send/ack/reject correctly; this adds durable checkpoint persistence so logfwd resumes where it left off after a restart. Changes: - pipeline.rs: open FileCheckpointStore on startup (LOGFWD_DATA_DIR or default), apply saved offsets to file inputs via set_offset() before first poll, persist checkpoints when CommitAdvance::advanced fires (throttled to once per 5s), flush final_checkpoints() on clean Stopped shutdown - source_paths moved from run_async() local to Pipeline struct field so ack_all_tickets() can include paths in persisted SourceCheckpoints - logfwd-test-utils: add CountingSink (AtomicU64 row counter) and append_json_lines() (append to existing file) for checkpoint tests - Two new integration tests: checkpoint_persisted_after_clean_shutdown and pipeline_resumes_from_checkpoint Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire FileCheckpointStore into pipeline — persist and restore file offsets. Checkpoints are keyed by SourceId (file fingerprint), not path, eliminating the PathUpdate channel message and source_paths tracking. Key changes: - Pipeline opens FileCheckpointStore on startup (scoped to pipeline name), restores offsets via set_offset_by_source(SourceId, offset) - Only creates data dir if LOGFWD_DATA_DIR is set or dir already exists - ack_all_tickets persists checkpoints on advance (throttled 5s flush) - Shutdown flushes final_checkpoints() unconditionally - Remove ChannelMsg::PathUpdate variant entirely - Remove source_paths HashMap from Pipeline - Remove known_sources tracking from input_poll_loop - Remove dead source_paths()/set_offset()/file_paths() from traits - Add set_offset_by_source to InputSource trait, FileInput, FramedInput - Add FileTailer::set_offset_by_source (scans files by fingerprint) Tests: checkpoint_persisted_after_clean_shutdown, pipeline_resumes_from_checkpoint, channel_msg_data_carries_checkpoints Closes #588, closes #787 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Per-pipeline checkpoint namespace (join(name)) — prevents clobbering - Only create data dir when LOGFWD_DATA_DIR is set or dir exists - Flush throttle advances timer on failure too (prevents retry flood) - Remove dead source_paths()/set_offset()/file_paths() from traits - Fix unused Path/PathBuf imports - Fix docstring (start_seq → current line count) - Fix SAFETY comments (describe CHECKPOINT_ENV_MUTEX) - Fix test assertion for pipeline-scoped checkpoint directory Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d1af8ac to
958d18f
Compare
Detailed reviewArchitecture: correctCheckpoints keyed by SourceId (fingerprint), not path. Restore on startup by matching fingerprint to open files. This is the right approach — matches industry practice and our TLA+ spec. Bug: checkpoint advances past remainder (TLA+ finding 002)`checkpoint_data()` returns `tailed.offset` (the read position, where the fd is). `FramedInput` passes this through without subtracting `remainder.len()`. This means the checkpoint includes bytes that haven't been processed into complete lines. On crash, the remainder is lost. On restart, the tailer seeks to the checkpoint position, which is past the partial line. Those bytes are permanently skipped. This is exactly TLA+ finding 002 (documented in `dev-docs/research/tla-finding-002.md` on the `docs/file-tailing-audit` branch). The fix is: ``` `FramedInput.checkpoint_data()` should override the inner source's offsets by subtracting the per-source remainder length. Bug: checkpoint advances at send time, not ack timeThe checkpoint snapshot is taken in `input_poll_loop` when the batch bytes are sent through the channel. If the downstream output fails (batch dropped by transform error, output error, etc.), the checkpoint still advances. On restart, the dropped batch's data is skipped. The checkpoint should only advance when the batch is durably delivered (acked). This requires the checkpoint to travel with the batch through the pipeline and only be persisted after ack. Missing: per-source remainder (#797)The shared `FramedInput` remainder bug is still present. Multiple files sharing one remainder can corrupt data across files. Missing: crash+partial line testThe three tests cover clean shutdown and basic resume, but not the case where a crash loses partial lines in the remainder. RecommendationThis PR has the right architecture (SourceId-keyed checkpoints, fingerprint-based restore). But the checkpoint value itself is wrong (read_offset instead of processed_offset). I'd recommend:
These fixes align with the TLA+ spec and `CheckpointTracker` implementation in PR #802. |
CodeRabbit review assessment
Additionally, our TLA+ analysis found that this PR checkpoints `read_offset` instead of `processed_offset` (read_offset - remainder.len()). See TLA+ finding 002 in PR #802. This causes data loss on crash when partial lines are in the remainder buffer. Recommendation: Fix #3 and #4 (small changes). For #1 and the TLA+ finding, consider adopting the per-source remainder from PR #837 which provides the corrected checkpoint calculation. #2 is a test infrastructure issue — add `serial_test` or pass `data_dir` explicitly. |
Summary
Completes Phase 5c — durable checkpoint persistence with fingerprint-keyed restore.
Design: Checkpoints are keyed by
SourceId(file fingerprint), not path. This eliminates thePathUpdatechannel message,source_pathstracking, andstore.load()fallback — matching how shippers handle checkpoints in practice.Changes
logfwd-io/src/tail.rsset_offset_by_source(SourceId, u64)logfwd-io/src/input.rssource_paths/set_offsetlogfwd-io/src/framed.rslogfwd/src/pipeline.rsPathUpdate/source_pathslogfwd-test-utils/src/CountingSink,append_json_linesRemoved
ChannelMsg::PathUpdatevariantsource_paths: HashMap<SourceId, PathBuf>fieldknown_sourcestracking in input loopsource_paths()/set_offset()/file_paths()(dead after PathUpdate removal)Tests
test_checkpoint_persisted_after_clean_shutdowntest_pipeline_resumes_from_checkpointtest_channel_msg_data_carries_checkpointsCloses #588, closes #787
Test plan
cargo clippy -p logfwd -p logfwd-io -- -D warnings— cleancargo fmt --check— clean🤖 Generated with Claude Code