Skip to content

fix: compound SourceId + per-connection TCP identity (#798, #803)#891

Merged
strawgate merged 1 commit into
masterfrom
fix/source-id-compound-key
Apr 4, 2026
Merged

fix: compound SourceId + per-connection TCP identity (#798, #803)#891
strawgate merged 1 commit into
masterfrom
fix/source-id-compound-key

Conversation

@strawgate
Copy link
Copy Markdown
Owner

Summary

  • Compound SourceId for files: SourceId was derived from just the xxh64 fingerprint of a file's first N bytes, so two files with identical prefixes collided. Now FileIdentity::source_id() hashes (device, inode, fingerprint) together, making collision impossible for files on the same filesystem.
  • Per-connection TCP identity: All TCP connections emitted source_id: None, breaking FramedInput's per-source remainder tracking. Each accepted connection now gets a unique SourceId from a monotonic counter with domain-separated hashing (prefix tcp: prevents collision with file-based ids).
  • No read budget changes: The existing 4 MiB MAX_READ_PER_POLL cap is unchanged.

Fixes #798, Fixes #803
Part of work-unit #872, #873

Extracted from #868 — only the collision fix and TCP identity changes, without the 256 KiB read fairness budget.

Test plan

  • All 184 logfwd-io tests pass (cargo test -p logfwd-io)
  • New test distinct_source_ids_per_connection verifies two concurrent TCP connections produce different SourceId values
  • Existing test_file_paths_matches_offsets confirms file_offsets() and file_paths() still agree on SourceId values
  • cargo clippy -p logfwd-io -- -D warnings clean
  • Binary crate (cargo check -p logfwd) compiles cleanly

🤖 Generated with Claude Code

SourceId was previously derived from just the xxh64 fingerprint of a file's
first N bytes, causing collisions between files with identical prefixes.
Now hashes (device, inode, fingerprint) together via FileIdentity::source_id().

TCP connections previously emitted source_id: None, so FramedInput's
per-source remainder tracking could not distinguish interleaved data from
different peers. Each accepted connection now receives a unique SourceId
derived from a monotonic counter with domain-separated hashing.

Fixes #798, Fixes #803
Part of work-unit #872, #873

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 4, 2026

Walkthrough

This pull request fixes checkpoint misattribution and cross-connection data corruption in file and TCP input handling. In tail.rs, SourceId derivation shifts from fingerprint-only (first 1024 bytes) to a composite hash of device, inode, and fingerprint; checkpoint matching now uses this compound identity to prevent collisions when files share identical headers. In tcp_input.rs, each accepted TCP connection receives a unique SourceId from a monotonic sequence counter, and data accumulation shifts from a shared per-poll buffer to per-client buffers, with separate InputEvent::Data emission per connection to prevent cross-connection data merging.

Possibly related PRs


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (3 errors, 2 warnings)

Check name Status Explanation Resolution
High-Quality Rust Practices ❌ Error Code uses sentinel value SourceId(0) instead of Option for empty files, violating high-quality-rust-practices.md directive. Change FileIdentity::source_id() to return Option and add defensive collision check in source_id_for_connection() to prevent SourceId(0) production.
Formal Verification Coverage ❌ Error New public functions FileIdentity::source_id() and source_id_for_connection() lack required Kani formal verification proofs and cfg(kani) test modules. Add #[cfg(kani)] verification modules with proofs for both functions, ensuring hash collision avoidance and stability. Update dev-docs/VERIFICATION.md with proof coverage details.
Crate Boundary And Dependency Integrity ❌ Error PR introduces xxhash-rust as new dependency directly in crates/logfwd-io/Cargo.toml instead of [workspace.dependencies], violating crate boundary integrity requirement. Move xxhash-rust to [workspace.dependencies] in root Cargo.toml, then reference as workspace = true in crates/logfwd-io/Cargo.toml.
Documentation Thoroughly Updated ⚠️ Warning PR implements compound SourceId (device+inode+fingerprint) and per-connection TCP identity but lacks comprehensive documentation coverage across DESIGN.md, PHASES.md, VERIFICATION.md, ARCHITECTURE.md, and DEVELOPING.md. No migration path documented for breaking checkpoint format change. Add ADR to DESIGN.md, mark issues #798/#803 complete in PHASES.md, update VERIFICATION.md for source_id() method, document compound identity in ARCHITECTURE.md checkpoint section, add fingerprint collision lesson to DEVELOPING.md, and document checkpoint migration strategy.
Maintainer Fitness ⚠️ Warning PR lacks benchmark results, migration strategy for breaking checkpoint format change, and defensive collision check. Provide benchmark results, checkpoint migration strategy with tests, update commit message, add defensive SourceId(0) check, and expand PR description with risk surface and limitations.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The pull request implements both issue requirements: (1) compound SourceId for files using (device, inode, fingerprint) hashing to prevent collisions [#798], and (2) per-connection TCP identity via monotonic counter hashing with 'tcp:' prefix [#803]. All code changes directly address the stated objectives.
Out of Scope Changes check ✅ Passed All changes are tightly scoped to the two linked issues: FileIdentity::source_id() implementation, checkpoint/offset lookup updates, and per-connection TCP identity assignment. No unrelated modifications detected in either file.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/logfwd-io/src/tail.rs (2)

1549-1579: 🛠️ Refactor suggestion | 🟠 Major

Missing test for collision prevention (the core fix).

The test verifies distinct files have distinct SourceIds, but doesn't exercise the specific scenario from issue #798: two files with identical first N bytes (same fingerprint) but different inodes. Consider adding a test that creates two files with identical content and verifies they get distinct SourceIds.

💚 Suggested test
/// Regression test for `#798`: files with identical fingerprints must have distinct SourceIds.
#[test]
fn test_same_fingerprint_different_inode_distinct_source_id() {
    let dir = tempfile::tempdir().unwrap();
    let path_a = dir.path().join("a.log");
    let path_b = dir.path().join("b.log");
    
    // Write identical content to both files.
    let content = "identical prefix content\n";
    std::fs::write(&path_a, content).unwrap();
    std::fs::write(&path_b, content).unwrap();
    
    let id_a = identify_file(&path_a, 1024).unwrap();
    let id_b = identify_file(&path_b, 1024).unwrap();
    
    // Fingerprints should be identical.
    assert_eq!(id_a.fingerprint, id_b.fingerprint, "files have same content");
    // But inodes differ.
    assert_ne!(id_a.inode, id_b.inode, "files have different inodes");
    // And SourceIds must be distinct.
    assert_ne!(id_a.source_id(), id_b.source_id(), 
        "same-fingerprint files must have distinct SourceIds (`#798`)");
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-io/src/tail.rs` around lines 1549 - 1579, Add a regression test
that reproduces issue `#798` by creating two separate files with identical
first-N-byte content and asserting that identify_file(path, N) yields identical
fingerprints but different inodes and therefore different SourceIds;
specifically call identify_file for both paths, assert id_a.fingerprint ==
id_b.fingerprint, assert id_a.inode != id_b.inode, and assert id_a.source_id()
!= id_b.source_id() to ensure the collision-prevention logic for SourceId is
exercised.

754-760: 🧹 Nitpick | 🔵 Trivial

Hot-path hash recomputation on every call.

file_offsets() is documented as hot path (~100ms cadence). Each call now recomputes xxh64(device || inode || fingerprint) for every tailed file. Consider caching the SourceId on TailedFile at open/rotate time to avoid repeated hashing.

♻️ Suggested caching approach
 struct TailedFile {
     path: PathBuf,
     identity: FileIdentity,
+    /// Cached compound SourceId, updated when identity changes.
+    cached_source_id: SourceId,
     file: File,
     offset: u64,
     // ...
 }

Then initialize cached_source_id in open_file_at() and update it when fingerprint changes (truncation, re-fingerprinting).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-io/src/tail.rs` around lines 754 - 760, file_offsets currently
recomputes the SourceId hash for every tailed file on each call; add a
cached_source_id: Option<SourceId> field to the TailedFile struct and populate
it in open_file_at() (and update it whenever identity.fingerprint changes: on
rotate/truncate/re-fingerprinting) so file_offsets can directly read
tailed.cached_source_id (falling back to computing only if None) instead of
calling identity.source_id() for every file; update all
rotate/reopen/re-fingerprint code paths to refresh cached_source_id when they
change the fingerprint.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-io/src/tail.rs`:
- Around line 35-53: FileIdentity::source_id() changed the SourceId derivation
to hash device+inode+fingerprint which breaks restore because persisted
checkpoints used fingerprint-only IDs; update restore/migration logic (e.g., in
set_offset_by_source or the checkpoint loader) to also consider the legacy ID by
computing the fingerprint-only SourceId and attempting a lookup with that before
treating as missing, and add a one-time migration path to rewrite checkpoints to
the new ID (or log and preserve both IDs) so upgrades continue restoring offsets
correctly.

In `@crates/logfwd-io/src/tcp_input.rs`:
- Around line 36-43: The source_id_for_connection function can theoretically
return the sentinel value SourceId(0) used for empty files; add a defensive
check after computing the hash in source_id_for_connection to detect a zero
result and replace it with a non-zero SourceId (e.g., increment to 1 or rehash
with a different domain string/seed until non-zero) so TCP-generated SourceId
never equals the empty-file sentinel used in tail.rs.

---

Outside diff comments:
In `@crates/logfwd-io/src/tail.rs`:
- Around line 1549-1579: Add a regression test that reproduces issue `#798` by
creating two separate files with identical first-N-byte content and asserting
that identify_file(path, N) yields identical fingerprints but different inodes
and therefore different SourceIds; specifically call identify_file for both
paths, assert id_a.fingerprint == id_b.fingerprint, assert id_a.inode !=
id_b.inode, and assert id_a.source_id() != id_b.source_id() to ensure the
collision-prevention logic for SourceId is exercised.
- Around line 754-760: file_offsets currently recomputes the SourceId hash for
every tailed file on each call; add a cached_source_id: Option<SourceId> field
to the TailedFile struct and populate it in open_file_at() (and update it
whenever identity.fingerprint changes: on rotate/truncate/re-fingerprinting) so
file_offsets can directly read tailed.cached_source_id (falling back to
computing only if None) instead of calling identity.source_id() for every file;
update all rotate/reopen/re-fingerprint code paths to refresh cached_source_id
when they change the fingerprint.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: a24767a7-9002-42c2-a4a4-7f42110d7991

📥 Commits

Reviewing files that changed from the base of the PR and between 4c955ef and 714ba5c.

📒 Files selected for processing (2)
  • crates/logfwd-io/src/tail.rs
  • crates/logfwd-io/src/tcp_input.rs

Comment thread crates/logfwd-io/src/tail.rs
Comment thread crates/logfwd-io/src/tcp_input.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant