fix: compound SourceId + per-connection TCP identity (#798, #803)#891
Conversation
SourceId was previously derived from just the xxh64 fingerprint of a file's first N bytes, causing collisions between files with identical prefixes. Now hashes (device, inode, fingerprint) together via FileIdentity::source_id(). TCP connections previously emitted source_id: None, so FramedInput's per-source remainder tracking could not distinguish interleaved data from different peers. Each accepted connection now receives a unique SourceId derived from a monotonic counter with domain-separated hashing. Fixes #798, Fixes #803 Part of work-unit #872, #873 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WalkthroughThis pull request fixes checkpoint misattribution and cross-connection data corruption in file and TCP input handling. In Possibly related PRs
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (3 errors, 2 warnings)
✅ Passed checks (2 passed)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/logfwd-io/src/tail.rs (2)
1549-1579: 🛠️ Refactor suggestion | 🟠 MajorMissing test for collision prevention (the core fix).
The test verifies distinct files have distinct SourceIds, but doesn't exercise the specific scenario from issue
#798: two files with identical first N bytes (same fingerprint) but different inodes. Consider adding a test that creates two files with identical content and verifies they get distinct SourceIds.💚 Suggested test
/// Regression test for `#798`: files with identical fingerprints must have distinct SourceIds. #[test] fn test_same_fingerprint_different_inode_distinct_source_id() { let dir = tempfile::tempdir().unwrap(); let path_a = dir.path().join("a.log"); let path_b = dir.path().join("b.log"); // Write identical content to both files. let content = "identical prefix content\n"; std::fs::write(&path_a, content).unwrap(); std::fs::write(&path_b, content).unwrap(); let id_a = identify_file(&path_a, 1024).unwrap(); let id_b = identify_file(&path_b, 1024).unwrap(); // Fingerprints should be identical. assert_eq!(id_a.fingerprint, id_b.fingerprint, "files have same content"); // But inodes differ. assert_ne!(id_a.inode, id_b.inode, "files have different inodes"); // And SourceIds must be distinct. assert_ne!(id_a.source_id(), id_b.source_id(), "same-fingerprint files must have distinct SourceIds (`#798`)"); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-io/src/tail.rs` around lines 1549 - 1579, Add a regression test that reproduces issue `#798` by creating two separate files with identical first-N-byte content and asserting that identify_file(path, N) yields identical fingerprints but different inodes and therefore different SourceIds; specifically call identify_file for both paths, assert id_a.fingerprint == id_b.fingerprint, assert id_a.inode != id_b.inode, and assert id_a.source_id() != id_b.source_id() to ensure the collision-prevention logic for SourceId is exercised.
754-760: 🧹 Nitpick | 🔵 TrivialHot-path hash recomputation on every call.
file_offsets()is documented as hot path (~100ms cadence). Each call now recomputesxxh64(device || inode || fingerprint)for every tailed file. Consider caching theSourceIdonTailedFileat open/rotate time to avoid repeated hashing.♻️ Suggested caching approach
struct TailedFile { path: PathBuf, identity: FileIdentity, + /// Cached compound SourceId, updated when identity changes. + cached_source_id: SourceId, file: File, offset: u64, // ... }Then initialize
cached_source_idinopen_file_at()and update it when fingerprint changes (truncation, re-fingerprinting).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-io/src/tail.rs` around lines 754 - 760, file_offsets currently recomputes the SourceId hash for every tailed file on each call; add a cached_source_id: Option<SourceId> field to the TailedFile struct and populate it in open_file_at() (and update it whenever identity.fingerprint changes: on rotate/truncate/re-fingerprinting) so file_offsets can directly read tailed.cached_source_id (falling back to computing only if None) instead of calling identity.source_id() for every file; update all rotate/reopen/re-fingerprint code paths to refresh cached_source_id when they change the fingerprint.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/logfwd-io/src/tail.rs`:
- Around line 35-53: FileIdentity::source_id() changed the SourceId derivation
to hash device+inode+fingerprint which breaks restore because persisted
checkpoints used fingerprint-only IDs; update restore/migration logic (e.g., in
set_offset_by_source or the checkpoint loader) to also consider the legacy ID by
computing the fingerprint-only SourceId and attempting a lookup with that before
treating as missing, and add a one-time migration path to rewrite checkpoints to
the new ID (or log and preserve both IDs) so upgrades continue restoring offsets
correctly.
In `@crates/logfwd-io/src/tcp_input.rs`:
- Around line 36-43: The source_id_for_connection function can theoretically
return the sentinel value SourceId(0) used for empty files; add a defensive
check after computing the hash in source_id_for_connection to detect a zero
result and replace it with a non-zero SourceId (e.g., increment to 1 or rehash
with a different domain string/seed until non-zero) so TCP-generated SourceId
never equals the empty-file sentinel used in tail.rs.
---
Outside diff comments:
In `@crates/logfwd-io/src/tail.rs`:
- Around line 1549-1579: Add a regression test that reproduces issue `#798` by
creating two separate files with identical first-N-byte content and asserting
that identify_file(path, N) yields identical fingerprints but different inodes
and therefore different SourceIds; specifically call identify_file for both
paths, assert id_a.fingerprint == id_b.fingerprint, assert id_a.inode !=
id_b.inode, and assert id_a.source_id() != id_b.source_id() to ensure the
collision-prevention logic for SourceId is exercised.
- Around line 754-760: file_offsets currently recomputes the SourceId hash for
every tailed file on each call; add a cached_source_id: Option<SourceId> field
to the TailedFile struct and populate it in open_file_at() (and update it
whenever identity.fingerprint changes: on rotate/truncate/re-fingerprinting) so
file_offsets can directly read tailed.cached_source_id (falling back to
computing only if None) instead of calling identity.source_id() for every file;
update all rotate/reopen/re-fingerprint code paths to refresh cached_source_id
when they change the fingerprint.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: a24767a7-9002-42c2-a4a4-7f42110d7991
📒 Files selected for processing (2)
crates/logfwd-io/src/tail.rscrates/logfwd-io/src/tcp_input.rs
Summary
SourceIdwas derived from just the xxh64 fingerprint of a file's first N bytes, so two files with identical prefixes collided. NowFileIdentity::source_id()hashes(device, inode, fingerprint)together, making collision impossible for files on the same filesystem.source_id: None, breakingFramedInput's per-source remainder tracking. Each accepted connection now gets a uniqueSourceIdfrom a monotonic counter with domain-separated hashing (prefixtcp:prevents collision with file-based ids).MAX_READ_PER_POLLcap is unchanged.Fixes #798, Fixes #803
Part of work-unit #872, #873
Extracted from #868 — only the collision fix and TCP identity changes, without the 256 KiB read fairness budget.
Test plan
logfwd-iotests pass (cargo test -p logfwd-io)distinct_source_ids_per_connectionverifies two concurrent TCP connections produce differentSourceIdvaluestest_file_paths_matches_offsetsconfirmsfile_offsets()andfile_paths()still agree on SourceId valuescargo clippy -p logfwd-io -- -D warningscleancargo check -p logfwd) compiles cleanly🤖 Generated with Claude Code