Skip to content

feat: optimize restoring duplicate files from TAR chunk#73

Merged
woutervanranst merged 8 commits into
masterfrom
feat/fix12
Apr 28, 2026
Merged

feat: optimize restoring duplicate files from TAR chunk#73
woutervanranst merged 8 commits into
masterfrom
feat/fix12

Conversation

@woutervanranst

@woutervanranst woutervanranst commented Apr 28, 2026

Copy link
Copy Markdown
Owner

Fix for #12

Summary by CodeRabbit

  • Improvements

    • Reduced memory use during restore by streaming primary output and creating duplicate files via on-disk copies.
    • Restored file size is now reported from the actual on-disk length.
  • Tests

    • Added tests for restoring duplicate archive entries to multiple paths, asserting binary content, pointer files, and timestamps.
    • Improved temp-directory and file-operation test helpers and updated invalid-snapshot test to the new restore command type.
  • Chores

    • Updated CI benchmark artifact handling and added benchmark report files.

@coderabbitai

coderabbitai Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Restore loop now pattern-matches tar entries, streams the first destination directly from the tar entry DataStream, creates duplicates by copying the written file (or an empty file when no stream), and derives restored size from on-disk file length. Tests switched to direct RestoreCommand, use temp-dir helpers, add duplicate-entry restore scenarios, and CI now stages benchmark-tail.md.

Changes

Cohort / File(s) Summary
Restore Handler Core Logic
src/Arius.Core/Features/RestoreCommand/RestoreCommandHandler.cs
Iterates TarEntry via pattern matching; writes first output directly from tarEntry.DataStream, copies that file for duplicate paths, handles absent streams by creating empty files, and sets FileRestoredEvent.Size using on-disk FileInfo.Length.
Tests (Restore Command)
src/Arius.Core.Tests/Features/RestoreCommand/RestoreCommandHandlerTests.cs
Tests now instantiate Core.Features.RestoreCommand.RestoreCommand directly, use shared temp-dir helpers (CreateTempDirectory, WriteFile), refactor large-chunk test, add duplicate-tar-entry restore scenarios (assert binary content, *.pointer.arius presence, UTC timestamps with creation checks skipped on Linux), and consolidate cleanup of temp roots and repo cache.
CI / Benchmarks
.github/workflows/ci.yml, src/Arius.Benchmarks/benchmark-tail.md, src/Arius.Benchmarks/raw/.../results/*
CI benchmark commit step updated to stage benchmark-tail.md instead of benchmark-tail.log; benchmark MD/HTML/RAW report files added/updated for latest run.

Sequence Diagram(s)

sequenceDiagram
    participant Handler as RestoreCommandHandler
    participant Tar as TarEntry
    participant FS as FileSystem
    participant Bus as EventBus

    Handler->>Tar: enumerate tar entries (pattern match)
    alt entry has DataStream
        Handler->>FS: stream-write first destination from DataStream
        FS-->>Handler: file written
        Handler->>FS: copy written file for duplicate paths
        FS-->>Handler: duplicates created
    else no DataStream
        Handler->>FS: create empty file at path
        FS-->>Handler: empty file created
    end
    Handler->>FS: stat file -> get on-disk length
    FS-->>Handler: FileInfo.Length
    Handler->>Bus: publish FileRestoredEvent (size from disk)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main optimization: restoring duplicate files from TAR chunks more efficiently by writing the first file from the data stream and copying for duplicates.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/fix12

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Arius.Core.Tests/Features/RestoreCommand/RestoreCommandHandlerTests.cs`:
- Around line 108-166: The test
Handle_Restores_DuplicateTarEntryContent_ToAllPaths_WithPerPathMetadata should
be extended to add a pair of zero-byte files (e.g., WriteFile(localRoot,
"zero/a.txt", Array.Empty<byte>(), created, modified) and WriteFile(localRoot,
"zero/b.txt", Array.Empty<byte>(), created2, modified2)) so the archive/restore
flow hits the tarEntry.DataStream is null branch in RestoreCommandHandler and
exercises pointer-file creation and per-path timestamps; after archiving and
restoring assert restoreResult.FilesRestored increments accordingly and add
AssertRestoredFile checks for both zero-byte paths with their respective
created/modified timestamps to validate the per-path metadata behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 61924f74-ffe2-4875-9055-5b2ca91ac7d7

📥 Commits

Reviewing files that changed from the base of the PR and between ff1446f and c6938d0.

📒 Files selected for processing (2)
  • src/Arius.Core.Tests/Features/RestoreCommand/RestoreCommandHandlerTests.cs
  • src/Arius.Core/Features/RestoreCommand/RestoreCommandHandler.cs

@codecov

codecov Bot commented Apr 28, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.24%. Comparing base (ff1446f) to head (9a6de3a).
⚠️ Report is 10 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master      #73   +/-   ##
=======================================
  Coverage   78.23%   78.24%           
=======================================
  Files          70       70           
  Lines        4931     4932    +1     
  Branches      680      679    -1     
=======================================
+ Hits         3858     3859    +1     
  Misses        909      909           
  Partials      164      164           
Flag Coverage Δ
linux 81.34% <100.00%> (+<0.01%) ⬆️
windows 75.99% <100.00%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/Arius.Core.Tests/Features/RestoreCommand/RestoreCommandHandlerTests.cs (1)

164-165: Also assert pointer file payload content, not only existence/timestamps.

Right now these tests can pass even if the .pointer.arius file contains the wrong hash. Add a payload assertion to tighten regression protection.

♻️ Proposed test assertion enhancement
-            AssertRestoredFile("archives/duplicates/copy-a.bin", content, firstCreated,  firstModified);
-            AssertRestoredFile("nested/deep/a/b/c/copy-b.bin",   content, secondCreated, secondModified);
+            var expectedPointerHash = ContentHash.Parse(encryption.ComputeHash(content).ToString()).ToString();
+            AssertRestoredFile("archives/duplicates/copy-a.bin", content, expectedPointerHash, firstCreated,  firstModified);
+            AssertRestoredFile("nested/deep/a/b/c/copy-b.bin",   content, expectedPointerHash, secondCreated, secondModified);

-        void AssertRestoredFile(string relativePath, byte[] expectedContent, DateTime expectedCreated, DateTime expectedModified)
+        void AssertRestoredFile(string relativePath, byte[] expectedContent, string expectedPointerHash, DateTime expectedCreated, DateTime expectedModified)
         {
             var restoredPath = Path.Combine(restoreRoot, relativePath.Replace('/', Path.DirectorySeparatorChar));
             var pointerPath  = restoredPath + ".pointer.arius";

             File.ReadAllBytes(restoredPath).ShouldBe(expectedContent);
             File.Exists(pointerPath).ShouldBeTrue($"Pointer file should exist for {relativePath}");
+            File.ReadAllText(pointerPath).Trim().ShouldBe(expectedPointerHash, $"Pointer payload for {relativePath}");

As per coding guidelines: **/*.Tests/**: Test projects. Focus on test coverage gaps and assertion quality rather than style.

Also applies to: 174-190, 248-249, 258-274

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Arius.Core.Tests/Features/RestoreCommand/RestoreCommandHandlerTests.cs`
around lines 164 - 165, The tests currently only check pointer file
existence/timestamps; update RestoreCommandHandlerTests to also read and assert
the pointer file payload matches the expected payload/hash for each restored
file: after calling AssertRestoredFile("archives/duplicates/copy-a.bin", ...)
and AssertRestoredFile("nested/deep/a/b/c/copy-b.bin", ...) open the
corresponding ".pointer.arius" file, deserialize or parse its payload and assert
it equals the expected payload/hash (use the same expected value you derive for
content e.g. expectedHash or expectedPayload variable), and apply the same
additional assertion for the other test cases referenced (lines 174-190,
248-249, 258-274) so each pointer file's content is validated not just its
existence/timestamps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/Arius.Core.Tests/Features/RestoreCommand/RestoreCommandHandlerTests.cs`:
- Around line 164-165: The tests currently only check pointer file
existence/timestamps; update RestoreCommandHandlerTests to also read and assert
the pointer file payload matches the expected payload/hash for each restored
file: after calling AssertRestoredFile("archives/duplicates/copy-a.bin", ...)
and AssertRestoredFile("nested/deep/a/b/c/copy-b.bin", ...) open the
corresponding ".pointer.arius" file, deserialize or parse its payload and assert
it equals the expected payload/hash (use the same expected value you derive for
content e.g. expectedHash or expectedPayload variable), and apply the same
additional assertion for the other test cases referenced (lines 174-190,
248-249, 258-274) so each pointer file's content is validated not just its
existence/timestamps.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d17d4239-9a29-4e9d-a642-23133550a1e9

📥 Commits

Reviewing files that changed from the base of the PR and between c6938d0 and 9a27686.

📒 Files selected for processing (2)
  • .github/workflows/ci.yml
  • src/Arius.Core.Tests/Features/RestoreCommand/RestoreCommandHandlerTests.cs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Arius.Benchmarks/benchmark-tail.md`:
- Line 8: Edit the markdown row that starts with "| woutbook6 |
2026-04-28T12:24:00.9455930+00:00 | f053596521cec1c1b5322c39cec1cc3d89d7d7d5 | 1
| 3 | 1.404 m | 0.1645 m | 0.0090 m | 409000.0000 | 73000.0000 | 43000.0000 |
4.77 GB | 459541.0000 | 7.0000 | src/Arius.Benchmarks/raw/20260428T122400.945Z"
so it ends with a trailing pipe ("|") to keep the correct 15 cells, and move the
inline comment "# Larger woutbook run after refactor to hashes" off the table
(place it on the following line as a plain paragraph or an HTML comment) so it
is not parsed as a table cell; finally ensure the file ends with a trailing
newline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5bb5c649-94fb-42b3-900c-21bc5fb142be

📥 Commits

Reviewing files that changed from the base of the PR and between d6051c5 and 9a6de3a.

📒 Files selected for processing (1)
  • src/Arius.Benchmarks/benchmark-tail.md

| runnervmhkfpo | 2026-04-28T04:51:52.2915132+00:00 | 0ee7db665f5f425a1a77a69e1c63778789648460 | 8 | 3 | 38.54 s | 6.373 s | 0.349 s | 49000.0000 | 17000.0000 | 14000.0000 | 829.57 MB | 70331.0000 | 21.0000 | src/Arius.Benchmarks/raw/20260428T045152.291Z | # Run on GH runner after refactor

| woutbook6 | 2026-04-28T12:24:00.9455930+00:00 | f053596521cec1c1b5322c39cec1cc3d89d7d7d5 | 1 | 3 | 1.404 m | 0.1645 m | 0.0090 m | 409000.0000 | 73000.0000 | 43000.0000 | 4.77 GB | 459541.0000 | 7.0000 | src/Arius.Benchmarks/raw/20260428T122400.945Z | # Larger woutbook run after refactor to hashes
| woutbook6 | 2026-04-28T12:24:00.9455930+00:00 | f053596521cec1c1b5322c39cec1cc3d89d7d7d5 | 1 | 3 | 1.404 m | 0.1645 m | 0.0090 m | 409000.0000 | 73000.0000 | 43000.0000 | 4.77 GB | 459541.0000 | 7.0000 | src/Arius.Benchmarks/raw/20260428T122400.945Z | # Larger woutbook run after refactor to hashes No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix markdown table formatting.

The table row has two formatting issues:

  1. Missing trailing pipe (|) at the end of the row
  2. The comment is being treated as a table cell (causing 16 cells instead of the expected 15)

The file also ends without a trailing newline, which violates common text file best practices.

📝 Proposed fix for table formatting
-| woutbook6 | 2026-04-28T12:24:00.9455930+00:00 | f053596521cec1c1b5322c39cec1cc3d89d7d7d5 | 1 | 3 | 1.404 m | 0.1645 m | 0.0090 m | 409000.0000 | 73000.0000 | 43000.0000 | 4.77 GB | 459541.0000 | 7.0000 | src/Arius.Benchmarks/raw/20260428T122400.945Z | # Larger woutbook run after refactor to hashes
+| woutbook6     | 2026-04-28T12:24:00.9455930+00:00 | f053596521cec1c1b5322c39cec1cc3d89d7d7d5 | 1                          | 3          | 1.404 m | 0.1645 m | 0.0090 m | 409000.0000 | 73000.0000  | 43000.0000 | 4.77 GB   | 459541.0000          | 7.0000           | src/Arius.Benchmarks/raw/20260428T122400.945Z | # Larger woutbook run after refactor to hashes
+

Note: Added trailing newline and consistent spacing to match the formatting of other rows.

🧰 Tools
🪛 LanguageTool

[grammar] ~8-~8: Ensure spelling is correct
Context: ...rks/raw/20260428T122400.945Z | # Larger woutbook run after refactor to hashes

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.22.1)

[warning] 8-8: Table pipe style
Expected: leading_and_trailing; Actual: leading_only; Missing trailing pipe

(MD055, table-pipe-style)


[warning] 8-8: Table column count
Expected: 15; Actual: 16; Too many cells, extra data will be missing

(MD056, table-column-count)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Arius.Benchmarks/benchmark-tail.md` at line 8, Edit the markdown row that
starts with "| woutbook6 | 2026-04-28T12:24:00.9455930+00:00 |
f053596521cec1c1b5322c39cec1cc3d89d7d7d5 | 1 | 3 | 1.404 m | 0.1645 m | 0.0090 m
| 409000.0000 | 73000.0000 | 43000.0000 | 4.77 GB | 459541.0000 | 7.0000 |
src/Arius.Benchmarks/raw/20260428T122400.945Z" so it ends with a trailing pipe
("|") to keep the correct 15 cells, and move the inline comment "# Larger
woutbook run after refactor to hashes" off the table (place it on the following
line as a plain paragraph or an HTML comment) so it is not parsed as a table
cell; finally ensure the file ends with a trailing newline.

@woutervanranst

Copy link
Copy Markdown
Owner Author

fixes #12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant