Skip to content

feat: chunk-index variable but fixed length, chunk-index repair and updated thin chunk format#95

Merged
woutervanranst merged 79 commits into
masterfrom
improve-chunk-index2
Jun 2, 2026
Merged

feat: chunk-index variable but fixed length, chunk-index repair and updated thin chunk format#95
woutervanranst merged 79 commits into
masterfrom
improve-chunk-index2

Conversation

@woutervanranst
Copy link
Copy Markdown
Owner

@woutervanranst woutervanranst commented May 29, 2026

Summary by CodeRabbit

  • New Features

    • Added repair-index CLI command to run an explicit chunk-index repair.
    • Archive/restore/ls now surface clear errors instructing users to run repair when index corruption or incomplete repair is detected.
    • Improved chunk upload semantics for thin/large chunks to enable reliable repair and metadata recovery.
  • Documentation

    • README and docs updated with chunk-index repair guidance and updated architecture/ADR notes.

…TreeValidationResult, archive invalidates chunk-index on mismatch, parallel flush uses an internal worker const and partial failure keeps snapshot unpublished, and prefix length lives as a const on ChunkIndexService.
…Shared/ChunkIndex coverage acceptance criterion
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces explicit chunk-index repair and refactors chunk-index internals (shard cache, reader, write-session), makes blob listing metadata-aware, standardizes thin-chunk metadata (empty body + parent hash in metadata), changes FileTreeService validation to return mismatch info, runs chunk-index flush and filetree sync concurrently in archive tail, adds a repair-index CLI/handler, and updates specs, tests, and docs.

Changes

Chunk index, storage API, pipeline, CLI, specs

Layer / File(s) Summary
Blob listing API, blob metadata keys, Azure container changes, and test fakes
src/Arius.Core/Shared/Storage/*, src/Arius.AzureBlob/*, src/**/Fakes/*, src/**/Tests/*
IBlobContainerService.ListAsync now yields BlobListItem with optional Metadata/ContentLength via includeMetadata flag; BlobNotFoundException + TryDownloadAsync added; Azure and all test fakes updated; tests enumerate item.Name and new BlobListItem tests added.
ChunkIndex refactor, repair API, exceptions, and tests
src/Arius.Core/Shared/ChunkIndex/*, src/Arius.Core.Tests/Shared/ChunkIndex/*
Splits responsibilities into ChunkIndexShardCache, ChunkIndexReader, ChunkIndexWriteSession; introduces repair marker enforcement, RepairAsync returning ChunkIndexRepairResult, new exceptions (ChunkIndexCorruptException, ChunkIndexRepairIncompleteException, ChunkIndexRepairException), Shard mutation API (AddOrUpdate*), ContentHash.Prefix(int), and extensive lookup/repair/write-session/shard-cache tests.
Chunk storage and thin-chunk semantics
src/Arius.Core/Shared/ChunkStorage/*, related tests
Thin chunks stored as empty-body blobs with parent_chunk_hash metadata; ChunkUploadResult gains OriginalSize; UploadThin uploads empty content + metadata in a single call; upload recovery uses arius_type sentinel; tests updated to assert metadata and empty payloads.
FileTreeService validation and archive/restore orchestration
src/Arius.Core/Shared/FileTree/*, src/Arius.Core/Features/*, src/Arius.Core/ServiceCollectionExtensions.cs
FileTreeService no longer depends on ChunkIndexService; ValidateAsync returns FileTreeValidationResult(SnapshotMismatch) and must not invalidate chunk-index caches; ArchiveCommandHandler invalidates caches on mismatch, then runs ChunkIndexService.FlushAsync and FileTreeBuilder.SynchronizeAsync concurrently and publishes snapshot after both; RestoreCommandHandler fails on unresolved content hashes.
CLI repair verb and handler
src/Arius.Cli/*, src/Arius.Core/Features/RepairChunkIndexCommand/*, src/Arius.Cli.Tests/*, README.md
Adds repair-index CLI verb, RepairChunkIndexCommand/handler, DI registration, CLI tests covering success/failure/missing-arg, and README documentation for repair command and thin-chunk/index notes.
Specs, ADR, docs, and tasks
openspec/**, docs/*, docs/decisions/*
Adds ADR and many spec updates covering shard-prefixing (service-owned), explicit repair, metadata-aware listing, thin-chunk semantics, archive/restore/ls audit-logging, architecture tests enforcing facade boundary, and tasks/checklists; docs/cache.md and README updated accordingly.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant CLI as arius repair-index
  participant Mediator
  participant RepairHandler
  participant ChunkIndexService
  participant BlobStore
  User->>CLI: invoke
  CLI->>Mediator: RepairChunkIndexCommand
  Mediator->>RepairHandler: Dispatch
  RepairHandler->>ChunkIndexService: RepairAsync
  ChunkIndexService->>BlobStore: ListAsync(chunks/, includeMetadata:true)
  ChunkIndexService->>BlobStore: Upload rebuilt shards / Delete stale
  ChunkIndexService-->>RepairHandler: Repair result
  RepairHandler-->>CLI: Success/Failure
  CLI-->>User: Output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch improve-chunk-index2

@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

❌ Patch coverage is 86.60287% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.91%. Comparing base (251c60e) to head (0d35646).
⚠️ Report is 80 commits behind head on master.

Files with missing lines Patch % Lines
...ius.Core/Shared/ChunkIndex/ChunkIndexShardCache.cs 88.99% 8 Missing and 4 partials ⚠️
...hunkIndexCommand/RepairChunkIndexCommandHandler.cs 31.25% 11 Missing ⚠️
src/Arius.AzureBlob/AzureBlobContainerService.cs 44.44% 9 Missing and 1 partial ⚠️
src/Arius.Cli/Commands/Repair/RepairVerb.cs 90.16% 4 Missing and 2 partials ⚠️
.../Arius.Core/Shared/ChunkIndex/ChunkIndexService.cs 95.09% 1 Missing and 4 partials ⚠️
...s.Core/Shared/ChunkIndex/ChunkIndexWriteSession.cs 86.20% 2 Missing and 2 partials ⚠️
...e/Features/ArchiveCommand/ArchiveCommandHandler.cs 70.00% 1 Missing and 2 partials ⚠️
...e/Features/RestoreCommand/RestoreCommandHandler.cs 0.00% 2 Missing and 1 partial ⚠️
...c/Arius.Core/Shared/ChunkIndex/ChunkIndexReader.cs 87.50% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #95      +/-   ##
==========================================
- Coverage   78.52%   77.91%   -0.61%     
==========================================
  Files          78       70       -8     
  Lines        5300     4900     -400     
  Branches      731      626     -105     
==========================================
- Hits         4162     3818     -344     
+ Misses        940      899      -41     
+ Partials      198      183      -15     
Flag Coverage Δ
linux 77.91% <86.60%> (-3.73%) ⬇️
windows 77.91% <86.60%> (+0.93%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/Arius.Core/Shared/ChunkIndex/ChunkIndexService.cs (1)

239-248: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Legacy thin chunks without ParentChunkHash metadata will break repair.

CreateThinRepairEntry requires ParentChunkHash in blob metadata (line 242) and throws ChunkIndexRepairException if missing or invalid. According to the past review comment (marked "✅ Addressed") and the PR description ("updated thin chunk format"), legacy thin chunks stored the parent hash in the blob body, not metadata. Repairing a repository with legacy thin chunks will throw here, leave the repair marker on disk, and block all archive/restore/list operations until the marker is manually removed.

Add a fallback that downloads and parses the legacy thin-body format when ParentChunkHash metadata is absent:

🔧 Proposed fix to support legacy thin chunks
-    private static ShardEntry CreateThinRepairEntry(BlobListItem item)
+    private static async Task<ShardEntry> CreateThinRepairEntryAsync(BlobListItem item, IBlobContainerService blobs, CancellationToken cancellationToken)
     {
         var contentHash = ContentHash.Parse(item.Name.Name.ToString());
-        if (!item.Metadata!.TryGetValue(BlobMetadataKeys.ParentChunkHash, out var parentChunkHashValue) || !ChunkHash.TryParse(parentChunkHashValue, out var parentChunkHash))
-            throw new ChunkIndexRepairException(item.Name, $"missing or invalid {BlobMetadataKeys.ParentChunkHash} metadata");
+        
+        ChunkHash parentChunkHash;
+        if (item.Metadata!.TryGetValue(BlobMetadataKeys.ParentChunkHash, out var parentChunkHashValue) && ChunkHash.TryParse(parentChunkHashValue, out var parsedHash))
+        {
+            // New format: parent hash in metadata
+            parentChunkHash = parsedHash;
+        }
+        else
+        {
+            // Legacy format: parent hash in body (empty body was not used previously)
+            if (item.ContentLength == 0)
+                throw new ChunkIndexRepairException(item.Name, $"missing or invalid {BlobMetadataKeys.ParentChunkHash} metadata and empty body (cannot parse legacy format)");
+            
+            await using var stream = await blobs.DownloadAsync(item.Name, cancellationToken);
+            using var reader = new StreamReader(stream, System.Text.Encoding.UTF8);
+            var parentHashValue = await reader.ReadToEndAsync(cancellationToken);
+            if (!ChunkHash.TryParse(parentHashValue.Trim(), out var legacyHash))
+                throw new ChunkIndexRepairException(item.Name, $"missing or invalid {BlobMetadataKeys.ParentChunkHash} metadata and failed to parse legacy thin-body format");
+            
+            parentChunkHash = legacyHash;
+        }

         var originalSize = ReadRequiredLongMetadata(item, BlobMetadataKeys.OriginalSize);
         var compressedSize = ReadRequiredLongMetadata(item, BlobMetadataKeys.CompressedSize);
         return new ShardEntry(contentHash, parentChunkHash, originalSize, compressedSize);
     }

Update the caller at line 226 and the return type at line 217 to await the async call.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Arius.Core/Shared/ChunkIndex/ChunkIndexService.cs` around lines 239 -
248, CreateThinRepairEntry currently throws if BlobMetadataKeys.ParentChunkHash
is missing; change it to an async CreateThinRepairEntry that first tries
metadata then falls back to downloading and parsing the legacy thin-body to
extract the parent chunk hash when metadata is absent or invalid, then return
the same ShardEntry using ContentHash.Parse(item.Name.Name.ToString()),
ReadRequiredLongMetadata(item, BlobMetadataKeys.OriginalSize) and
ReadRequiredLongMetadata(item, BlobMetadataKeys.CompressedSize); if the legacy
body cannot be read or parsed still throw ChunkIndexRepairException with
context. Also update the caller that invokes CreateThinRepairEntry to await the
new async method and adjust its return type accordingly so the async flow
compiles.
src/Arius.Core/Shared/ChunkIndex/ShardSerializer.cs (1)

10-10: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Make ShardSerializer internal to align with architecture boundary.

ShardSerializer is a chunk-index implementation detail and is only used by other internal chunk-index components (e.g., ChunkIndexShardCache) and tests (which have InternalsVisibleTo access). Per the architecture requirement in the spec, extracted chunk-index components should remain internal behind the ChunkIndexService facade. Change the class declaration to internal static class ShardSerializer. As per coding guidelines, make non-test classes internal and only make them public when they must be consumed by another non-test assembly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Arius.Core/Shared/ChunkIndex/ShardSerializer.cs` at line 10, Change the
public visibility of the ShardSerializer class to internal since it is an
implementation detail of the chunk-index; update the class declaration for
ShardSerializer from "public static class ShardSerializer" to "internal static
class ShardSerializer" so it remains accessible to internal consumers like
ChunkIndexShardCache and tests (which use InternalsVisibleTo) but is not exposed
publicly beyond the assembly.
♻️ Duplicate comments (1)
src/Arius.Core.Tests/Shared/ChunkIndex/ChunkIndexServiceRepairTests.cs (1)

82-86: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Verify that rebuilt shards were actually persisted, not just cached in memory.

The test currently only checks LookupAsync on the same ChunkIndexService instance that performed the repair. To definitively prove the rebuilt shard was persisted to blob storage (L2/L3) and not just held in memory (L1), create a fresh ChunkIndexService instance after repair and verify the lookup succeeds on that new instance.

🧪 Proposed fix to add persistence verification
         var result = await index.RepairAsync();
 
         result.RebuiltShardCount.ShouldBe(1);
-        (await index.LookupAsync(rebuiltHash)).ShouldBe(new ShardEntry(rebuiltHash, ChunkHash.Parse(rebuiltHash), 100, 3));
-        (await index.LookupAsync(staleHash)).ShouldBeNull();
+        
+        // Verify rebuilt shard was persisted by loading fresh instance
+        index.Dispose();
+        using var freshIndex = new ChunkIndexService(blobs, s_encryption, repositoryKey, repositoryKey);
+        (await freshIndex.LookupAsync(rebuiltHash)).ShouldBe(new ShardEntry(rebuiltHash, ChunkHash.Parse(rebuiltHash), 100, 3));
+        (await freshIndex.LookupAsync(staleHash)).ShouldBeNull();

As per coding guidelines, **/*.Tests/**: Test projects. Focus on test coverage gaps and assertion quality rather than style.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Arius.Core.Tests/Shared/ChunkIndex/ChunkIndexServiceRepairTests.cs`
around lines 82 - 86, The test currently verifies rebuilt shards only against
the same ChunkIndexService instance that performed RepairAsync, so update the
test to assert persistence by instantiating a fresh ChunkIndexService after
calling index.RepairAsync(); then call LookupAsync(rebuiltHash) and
LookupAsync(staleHash) on the new service and assert the rebuilt shard returns
the expected ShardEntry (new ShardEntry(rebuiltHash,
ChunkHash.Parse(rebuiltHash), 100, 3)) and the staleHash returns null; ensure
you dispose/close the original index if required before creating the new
instance so the lookup truly hits persisted storage.
🧹 Nitpick comments (1)
src/Arius.Core.Tests/Shared/ChunkIndex/ChunkIndexServiceLookupTests.cs (1)

216-255: 💤 Low value

Consider moving ThrowOnMetadataBlobContainerService into a local Fakes/ subfolder.

This scenario-specific test double is declared inline in the test file. The repository convention places such doubles in a Fakes/ subfolder beside the tests that use them (as done for the other blob-container fakes in this PR), which improves discoverability and reuse.

As per coding guidelines: "Put scenario-specific test doubles in a local Fakes/ subfolder beside the tests that use them."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Arius.Core.Tests/Shared/ChunkIndex/ChunkIndexServiceLookupTests.cs`
around lines 216 - 255, The inline test double
ThrowOnMetadataBlobContainerService should be moved into a local Fakes/
subfolder beside these tests; create a new file named
ThrowOnMetadataBlobContainerService.cs under the test's Fakes/ directory, move
the class implementation there, adjust the namespace to match the existing
tests' namespace (or add an appropriate using) and ensure the test references
compile (update accessibility if needed); no behavior changes—just relocate the
class to follow the repository convention for scenario-specific fakes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/Arius.Core/Shared/ChunkIndex/ChunkIndexWriteSession.cs`:
- Around line 13-19: AddEntry has a TOCTOU race with FlushAsync: it checks
Volatile.Read(ref _flushInProgress) then writes into the shared dictionary
_sessionEntries so FlushAsync may snapshot and clear entries while AddEntry
writes after the snapshot, causing silent loss; fix by making _sessionEntries
non-readonly and changing FlushAsync to atomically swap the current dictionary
with a new empty one (e.g., Interlocked.Exchange on the _sessionEntries
reference), then operate on the snapshot for clearing/persisting; keep the
Volatile.Read check in AddEntry and continue to write into the current
_sessionEntries reference so any concurrent flush will operate on the swapped
snapshot, eliminating the TOCTOU window (refer to AddEntry, FlushAsync,
_sessionEntries, and _flushInProgress).

---

Outside diff comments:
In `@src/Arius.Core/Shared/ChunkIndex/ChunkIndexService.cs`:
- Around line 239-248: CreateThinRepairEntry currently throws if
BlobMetadataKeys.ParentChunkHash is missing; change it to an async
CreateThinRepairEntry that first tries metadata then falls back to downloading
and parsing the legacy thin-body to extract the parent chunk hash when metadata
is absent or invalid, then return the same ShardEntry using
ContentHash.Parse(item.Name.Name.ToString()), ReadRequiredLongMetadata(item,
BlobMetadataKeys.OriginalSize) and ReadRequiredLongMetadata(item,
BlobMetadataKeys.CompressedSize); if the legacy body cannot be read or parsed
still throw ChunkIndexRepairException with context. Also update the caller that
invokes CreateThinRepairEntry to await the new async method and adjust its
return type accordingly so the async flow compiles.

In `@src/Arius.Core/Shared/ChunkIndex/ShardSerializer.cs`:
- Line 10: Change the public visibility of the ShardSerializer class to internal
since it is an implementation detail of the chunk-index; update the class
declaration for ShardSerializer from "public static class ShardSerializer" to
"internal static class ShardSerializer" so it remains accessible to internal
consumers like ChunkIndexShardCache and tests (which use InternalsVisibleTo) but
is not exposed publicly beyond the assembly.

---

Duplicate comments:
In `@src/Arius.Core.Tests/Shared/ChunkIndex/ChunkIndexServiceRepairTests.cs`:
- Around line 82-86: The test currently verifies rebuilt shards only against the
same ChunkIndexService instance that performed RepairAsync, so update the test
to assert persistence by instantiating a fresh ChunkIndexService after calling
index.RepairAsync(); then call LookupAsync(rebuiltHash) and
LookupAsync(staleHash) on the new service and assert the rebuilt shard returns
the expected ShardEntry (new ShardEntry(rebuiltHash,
ChunkHash.Parse(rebuiltHash), 100, 3)) and the staleHash returns null; ensure
you dispose/close the original index if required before creating the new
instance so the lookup truly hits persisted storage.

---

Nitpick comments:
In `@src/Arius.Core.Tests/Shared/ChunkIndex/ChunkIndexServiceLookupTests.cs`:
- Around line 216-255: The inline test double
ThrowOnMetadataBlobContainerService should be moved into a local Fakes/
subfolder beside these tests; create a new file named
ThrowOnMetadataBlobContainerService.cs under the test's Fakes/ directory, move
the class implementation there, adjust the namespace to match the existing
tests' namespace (or add an appropriate using) and ensure the test references
compile (update accessibility if needed); no behavior changes—just relocate the
class to follow the repository convention for scenario-specific fakes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b06e78e8-8aff-40e9-805c-81dff31dc094

📥 Commits

Reviewing files that changed from the base of the PR and between 4ce3437 and 3f5dc0e.

📒 Files selected for processing (53)
  • .agents/skills/grill-me/SKILL.md
  • .agents/skills/grill-with-openspec/SKILL.md
  • docs/cache.md
  • openspec/changes/archive/2026-06-02-split-chunk-index-responsibilities/.openspec.yaml
  • openspec/changes/archive/2026-06-02-split-chunk-index-responsibilities/design.md
  • openspec/changes/archive/2026-06-02-split-chunk-index-responsibilities/proposal.md
  • openspec/changes/archive/2026-06-02-split-chunk-index-responsibilities/specs/architecture-tests/spec.md
  • openspec/changes/archive/2026-06-02-split-chunk-index-responsibilities/specs/chunk-index-service/spec.md
  • openspec/changes/archive/2026-06-02-split-chunk-index-responsibilities/tasks.md
  • openspec/specs/architecture-tests/spec.md
  • openspec/specs/chunk-index-service/spec.md
  • src/Arius.Architecture.Tests/DependencyTests.cs
  • src/Arius.AzureBlob/AzureBlobContainerService.cs
  • src/Arius.Cli.Tests/MediatorEventRoutingIntegrationTests.cs
  • src/Arius.Core.Tests/Fakes/FakeMetadataOnlyBlobContainerService.cs
  • src/Arius.Core.Tests/Fakes/FakeRecordingBlobContainerService.cs
  • src/Arius.Core.Tests/Fakes/FakeSeededBlobContainerService.cs
  • src/Arius.Core.Tests/Fakes/ThrowOnCreateBlobContainerService.cs
  • src/Arius.Core.Tests/Features/ArchiveCommand/ArchiveRecoveryTests.cs
  • src/Arius.Core.Tests/Features/ChunkHydrationStatusQuery/ResolveFileHydrationStatusesHandlerTests.cs
  • src/Arius.Core.Tests/Features/ListQuery/ListQueryHandlerTests.cs
  • src/Arius.Core.Tests/Shared/ChunkIndex/ChunkIndexReaderTests.cs
  • src/Arius.Core.Tests/Shared/ChunkIndex/ChunkIndexServiceLookupTests.cs
  • src/Arius.Core.Tests/Shared/ChunkIndex/ChunkIndexServiceRepairTests.cs
  • src/Arius.Core.Tests/Shared/ChunkIndex/ChunkIndexShardCacheTests.cs
  • src/Arius.Core.Tests/Shared/ChunkIndex/ChunkIndexWriteSessionTests.cs
  • src/Arius.Core.Tests/Shared/ChunkIndex/ShardSerializerLocalTests.cs
  • src/Arius.Core.Tests/Shared/ChunkIndex/ShardSerializerTests.cs
  • src/Arius.Core.Tests/Shared/ChunkIndex/ShardTests.cs
  • src/Arius.Core.Tests/Shared/ChunkStorage/Fakes/BlobAlreadyExistsOnSetMetadataOnceBlobContainerService.cs
  • src/Arius.Core.Tests/Shared/ChunkStorage/Fakes/BlockingDeleteBlobContainerService.cs
  • src/Arius.Core.Tests/Shared/ChunkStorage/Fakes/ContentTypeCapturingBlobContainerService.cs
  • src/Arius.Core.Tests/Shared/FileTree/Fakes/BlockingFileTreeUploadBlobContainerService.cs
  • src/Arius.Core.Tests/Shared/FileTree/Fakes/FaultingAndBlockingFileTreeUploadBlobContainerService.cs
  • src/Arius.Core.Tests/Shared/FileTree/Fakes/SlowDownloadBlobContainerService.cs
  • src/Arius.Core.Tests/Shared/FileTree/Fakes/ThrowingDownloadBlobContainerService.cs
  • src/Arius.Core.Tests/Shared/FileTree/Fakes/UnsortedSnapshotBlobContainerService.cs
  • src/Arius.Core.Tests/Shared/Storage/FakeInMemoryBlobContainerServiceTests.cs
  • src/Arius.Core/Features/ArchiveCommand/ArchiveCommandHandler.cs
  • src/Arius.Core/ServiceCollectionExtensions.cs
  • src/Arius.Core/Shared/ChunkIndex/ChunkIndexReader.cs
  • src/Arius.Core/Shared/ChunkIndex/ChunkIndexService.cs
  • src/Arius.Core/Shared/ChunkIndex/ChunkIndexShardCache.cs
  • src/Arius.Core/Shared/ChunkIndex/ChunkIndexWriteSession.cs
  • src/Arius.Core/Shared/ChunkIndex/CustomExceptions.cs
  • src/Arius.Core/Shared/ChunkIndex/Shard.cs
  • src/Arius.Core/Shared/ChunkIndex/ShardSerializer.cs
  • src/Arius.Core/Shared/Storage/IBlobContainerService.cs
  • src/Arius.Explorer/Infrastructure/RepositorySession.cs
  • src/Arius.Integration.Tests/Pipeline/Fakes/FaultingBlobService.cs
  • src/Arius.Integration.Tests/Pipeline/Fakes/RehydrationSimulatingBlobService.cs
  • src/Arius.Integration.Tests/Storage/BlobStorageServiceTests.cs
  • src/Arius.Tests.Shared/Storage/FakeInMemoryBlobContainerService.cs
💤 Files with no reviewable changes (1)
  • .agents/skills/grill-me/SKILL.md
✅ Files skipped from review due to trivial changes (5)
  • openspec/changes/archive/2026-06-02-split-chunk-index-responsibilities/proposal.md
  • src/Arius.Core.Tests/Features/ChunkHydrationStatusQuery/ResolveFileHydrationStatusesHandlerTests.cs
  • openspec/changes/archive/2026-06-02-split-chunk-index-responsibilities/.openspec.yaml
  • openspec/changes/archive/2026-06-02-split-chunk-index-responsibilities/specs/architecture-tests/spec.md
  • src/Arius.Core.Tests/Shared/ChunkIndex/ShardSerializerLocalTests.cs
🚧 Files skipped from review as they are similar to previous changes (9)
  • src/Arius.Core.Tests/Features/ListQuery/ListQueryHandlerTests.cs
  • src/Arius.Explorer/Infrastructure/RepositorySession.cs
  • src/Arius.Core.Tests/Fakes/ThrowOnCreateBlobContainerService.cs
  • src/Arius.Core.Tests/Shared/ChunkStorage/Fakes/ContentTypeCapturingBlobContainerService.cs
  • src/Arius.Core.Tests/Shared/FileTree/Fakes/BlockingFileTreeUploadBlobContainerService.cs
  • src/Arius.Core.Tests/Shared/FileTree/Fakes/FaultingAndBlockingFileTreeUploadBlobContainerService.cs
  • src/Arius.Core.Tests/Fakes/FakeSeededBlobContainerService.cs
  • src/Arius.Core.Tests/Shared/ChunkStorage/Fakes/BlobAlreadyExistsOnSetMetadataOnceBlobContainerService.cs
  • src/Arius.Integration.Tests/Pipeline/Fakes/RehydrationSimulatingBlobService.cs

Comment on lines +13 to +19
public void AddEntry(ShardEntry entry)
{
if (Volatile.Read(ref _flushInProgress) != 0)
throw new InvalidOperationException("Cannot record chunk-index entries while a flush is in progress.");

_sessionEntries[entry.ContentHash] = entry;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

TOCTOU race can cause silent data loss when AddEntry/FlushAsync run concurrently.

AddEntry checks _flushInProgress (line 15) then writes to _sessionEntries (line 18). If FlushAsync sets the flag and snapshots entries between those two operations, the new entry will be written after the snapshot but cleared at line 44, causing silent data loss instead of throwing the expected exception.

The archive handler upholds the "drain workers before flush" contract (ArchiveCommandHandler lines 506–523), so this race won't occur in current usage. However, the fail-fast check exists to catch future contract violations, and the TOCTOU window makes it unreliable.

Fix by atomically swapping _sessionEntries in FlushAsync:

🔒 Proposed fix to eliminate TOCTOU race

Remove readonly from the field declaration and swap the dictionary atomically in FlushAsync:

-    private readonly ConcurrentDictionary<ContentHash, ShardEntry> _sessionEntries = [];
+    private ConcurrentDictionary<ContentHash, ShardEntry> _sessionEntries = new();
     public async Task FlushAsync(ChunkIndexShardCache shardCache, CancellationToken cancellationToken = default)
     {
         if (Interlocked.Exchange(ref _flushInProgress, 1) != 0)
             throw new InvalidOperationException("Chunk-index flush is already in progress.");

         try
         {
-            if (_sessionEntries.IsEmpty)
+            var entriesToFlush = Interlocked.Exchange(ref _sessionEntries, new ConcurrentDictionary<ContentHash, ShardEntry>());
+            
+            if (entriesToFlush.IsEmpty)
                 return;

-            var byPrefix = _sessionEntries.Values
+            var byPrefix = entriesToFlush.Values
                 .GroupBy(e => Shard.PrefixOf(e.ContentHash))
                 .Select(g => new KeyValuePair<PathSegment, ShardEntry[]>(g.Key, [.. g]))
                 .ToArray();

             await Parallel.ForEachAsync(
                 byPrefix,
                 new ParallelOptions { MaxDegreeOfParallelism = ChunkIndexService.FlushWorkers, CancellationToken = cancellationToken },
                 async (group, ct) =>
                 {
                     await shardCache.UpdateShardAsync(group.Key, group.Value, ct);
                 });

-            _sessionEntries.Clear();
+            // No need to clear - already swapped to new dictionary
         }
         finally
         {
             Volatile.Write(ref _flushInProgress, 0);
         }
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/Arius.Core/Shared/ChunkIndex/ChunkIndexWriteSession.cs` around lines 13 -
19, AddEntry has a TOCTOU race with FlushAsync: it checks Volatile.Read(ref
_flushInProgress) then writes into the shared dictionary _sessionEntries so
FlushAsync may snapshot and clear entries while AddEntry writes after the
snapshot, causing silent loss; fix by making _sessionEntries non-readonly and
changing FlushAsync to atomically swap the current dictionary with a new empty
one (e.g., Interlocked.Exchange on the _sessionEntries reference), then operate
on the snapshot for clearing/persisting; keep the Volatile.Read check in
AddEntry and continue to write into the current _sessionEntries reference so any
concurrent flush will operate on the swapped snapshot, eliminating the TOCTOU
window (refer to AddEntry, FlushAsync, _sessionEntries, and _flushInProgress).

@woutervanranst woutervanranst merged commit 335b850 into master Jun 2, 2026
1 check passed
@woutervanranst woutervanranst deleted the improve-chunk-index2 branch June 2, 2026 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant