Replace CAS retry loop with Interlocked.Add in MemoryCache size tracking#124430
Replace CAS retry loop with Interlocked.Add in MemoryCache size tracking#124430lufen wants to merge 1 commit into
Conversation
Fix dotnet#111959 — UpdateCacheSizeExceedsCapacity uses a CompareExchange loop with 100 retries to atomically update _cacheSize. Under high thread contention this causes CPU spikes, retry storms, and silent entry drops when retries are exhausted. Replace with Interlocked.Add (wait-free) + post-check rollback: - Optimistically add delta via Interlocked.Add - Check capacity; if exceeded, roll back with Interlocked.Add(-delta) Benchmark results (isolated contention microbenchmark, 50k ops/thread): | Threads | BEFORE (CAS loop) | AFTER (Interlocked.Add) | Speedup | |---------|-------------------|------------------------|---------| | 1 | 447 us | 312 us | 1.4x | | 2 | 3,549 us | 1,729 us | 2.1x | | 4 | 13,208 us | 4,148 us | 3.2x | | 8 | 26,885 us | 8,119 us | 3.3x | | 16 | 105,903 us | 15,365 us | 6.9x | Zero allocation difference. CAS scales ~O(N^2), Add scales ~O(N). Performance trade-off assessment: The only trade-off is a nanosecond-scale transient over-count window in _cacheSize between Add and rollback. This is acceptable because: - _cacheSize is documented as "eventually consistent" (line 91-92) - The other two write sites (SetEntry rollback line 182, RemoveEntry line 790) already use plain Interlocked.Add without CAS protection - The old CAS code has the same race: two threads can both pass the capacity check and together exceed the limit - The over-count is bounded to entry.Size per thread (~1-2 CPU instructions of exposure) Additional benefits beyond throughput: - Fixes correctness bug: old code silently drops entries after 100 CAS failures even when capacity is available - 30% faster single-threaded (less branching, no loop) - Simpler code: 14 lines vs 20 lines Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR aims to improve MemoryCache performance under high thread contention by replacing a CompareExchange retry loop with an optimistic Interlocked.Add approach for cache size tracking. The change is motivated by CPU spikes observed when many threads compete to update cache size, with benchmark results showing 2-7x speedup under contention.
Changes:
- Replaced CompareExchange loop (with 100-retry limit) with optimistic Interlocked.Add + post-check rollback pattern
- Simplified the logic from ~20 lines to ~14 lines
- Added inline documentation referencing the motivating GitHub issue
| long newSize = Interlocked.Add(ref coherentState._cacheSize, delta); | ||
|
|
||
| long original = Interlocked.CompareExchange(ref coherentState._cacheSize, newSize, sizeRead); | ||
| if (sizeRead == original) | ||
| { | ||
| return false; | ||
| } | ||
| sizeRead = original; | ||
| if ((ulong)newSize > (ulong)sizeLimit) | ||
| { | ||
| // Exceeded capacity — roll back the optimistic addition | ||
| Interlocked.Add(ref coherentState._cacheSize, -delta); | ||
| return true; |
There was a problem hiding this comment.
Critical bug: When arithmetic overflow occurs in the optimistic Interlocked.Add, the rollback cannot restore the original _cacheSize value.
Example: If _cacheSize is long.MaxValue - 10 and delta is 100, the Add overflows to long.MinValue + 89. The rollback with Interlocked.Add(ref _cacheSize, -100) makes it long.MinValue - 11, not the original long.MaxValue - 10. This corrupts the cache size tracking permanently.
The old CompareExchange loop avoided this by computing newSize locally first and only updating _cacheSize if no overflow would occur. To maintain correctness, you need to check for potential overflow BEFORE calling Interlocked.Add, similar to the old approach but without the retry loop.
| if ((ulong)newSize > (ulong)sizeLimit) | ||
| { | ||
| // Exceeded capacity — roll back the optimistic addition | ||
| Interlocked.Add(ref coherentState._cacheSize, -delta); | ||
| return true; |
There was a problem hiding this comment.
The optimistic approach introduces false rejections under concurrent load. When multiple threads add simultaneously, later additions see the temporarily inflated size and may incorrectly reject entries that should fit.
Example: With _cacheSize=50 and limit=100, Thread A adds 60 (making size 110), then Thread B adds 30 (seeing size 140). Thread B checks 140 > 100 and rolls back, but the entry (size 30) should have been accepted since 50+30=80 < 100.
The old CompareExchange loop avoided this by checking capacity BEFORE committing the update, ensuring entries are rejected based on the actual cache state, not temporarily inflated values. This change could increase cache miss rates under high concurrency, undermining the performance improvement goal.
…size-limited cache (#129215) ## Summary Fixes #129186. When `MemoryCacheOptions.SizeLimit` is set, the internal `_cacheSize` counter can drift **negative** under concurrent `Set`/`Get`/`Remove` on string keys. Once negative, the capacity check `(ulong)... > (ulong)sizeLimit` is permanently true, so **every subsequent `Set` is silently rejected** (no exception) and nothing is retained for the lifetime of the cache — `CurrentEntryCount` stuck at 0, every `Get` a miss. This is permanent, silent data loss. ## Regression source Introduced by #103931 (merged for 9.0, "Subtract prior entry size when adding entry to cache"), which fixed #36039 — allowing an entry at the size limit to be replaced by a same-or-smaller one. That PR made two changes to the replace path: 1. It made the capacity **check** prior-aware (`newSize -= priorEntry.Size`) so a same-or-smaller replace fits at the limit. This is correct and desirable. 2. It also **moved the `_cacheSize` decrement of the prior entry** out of the post-swap `if (entryAdded)` block and into the speculative capacity computation, *before* the `TryUpdate` swap. Change (2) is the bug. Pre-#103931 (i.e., 8.x), the prior entry's size was decremented **only after** a successful `TryUpdate`, atomically tied to the swap, so a concurrent removal of the prior entry could not double-count it. ## Root cause In `SetEntry`, the prior entry's size is subtracted **speculatively** inside `UpdateCacheSizeExceedsCapacity`, before the entry is actually swapped in by `TryUpdate`. If another thread runs `RemoveEntry(priorEntry)` (expiration scan, explicit `Remove`, or eviction) in the window between that subtraction and the `TryUpdate`, `priorEntry.Size` is subtracted **twice** — once speculatively and once by the real removal in `CoherentState.RemoveEntry`. Each leaked decrement drives `_cacheSize` negative, and the `(ulong)` cast then latches the cache permanently. ## Fix Keep #103931's prior-aware capacity **check** (so the #36039 behavior and its test are preserved), but commit only `+entry.Size` to `_cacheSize` there. Decrement `priorEntry.Size` exactly once, atomically with the `TryUpdate` that performs the swap — restoring the pre-#103931 (8.x) ordering. On the failure path, roll back only `entry.Size`. This intentionally keeps the existing `CompareExchange` retry loop and the "check capacity *before* committing" ordering, so it avoids both problems that closed #124430 (overflow-rollback corruption, and false rejections from a transiently-inflated size). The only transient window now over-counts by `priorEntry.Size` (positive), which is self-correcting and can never latch the cache. ## Validation Reproduced and validated against the genuine `MemoryCache` source compiled standalone (concurrent Set/Get/Remove storm on a small string keyspace under a generous `SizeLimit`, with a fresh-key retention probe): - **Before**: `_cacheSize` goes negative, 0/512 fresh entries retained. - **After**: no drift (size stays >= 0), 512/512 retained across repeated runs; `SizeLimit` eviction still enforced (a tiny limit still bounds count/size); and the #36039 replace-at-limit behavior (`ReplaceOldEntryWithSameSizeOrLessNewEntryAtSizeLimitCapacity`, new value size 6/5/2 at limit 6) still passes. Bisection (same harness): **8.0.x not affected; 9.0.x and 10.0.x affected**, on both `net462` and `net10.0` builds — consistent with #103931 shipping in 9.0. A single-threaded pure-replace microbenchmark (the only path this change touches) shows the difference vs the current code is within run-to-run noise (~100 ns/op either way) — per-replace cost is dominated by entry allocation and dictionary operations; the fix only restores one `Interlocked.Add` on the replace path. Happy to add a `Set`-throughput benchmark to dotnet/performance if desired (per the discussion on #111959). ## Tests Adds `MemoryCacheConcurrentSizeTrackingTests` with: - a concurrency regression test asserting the tracked size never goes negative and the cache keeps retaining entries after a Set/Get/Remove storm, and - a guard that `SizeLimit` eviction is still enforced. Related: #36039 (behavior preserved), #103931 (regression source), #111959, #124430.
Fix #111959 — UpdateCacheSizeExceedsCapacity uses a CompareExchange loop with 100 retries to atomically update _cacheSize. Under high thread contention this causes CPU spikes, retry storms, and silent entry drops when retries are exhausted.
Replace with Interlocked.Add (wait-free) + post-check rollback:
Benchmark results (isolated contention microbenchmark, 50k ops/thread):
| Threads | BEFORE (CAS loop) | AFTER (Interlocked.Add) | Speedup | |---------|-------------------|------------------------|---------|
| 1 | 447 us | 312 us | 1.4x |
| 2 | 3,549 us | 1,729 us | 2.1x |
| 4 | 13,208 us | 4,148 us | 3.2x |
| 8 | 26,885 us | 8,119 us | 3.3x |
| 16 | 105,903 us | 15,365 us | 6.9x |
Zero allocation difference. CAS scales ~O(N^2), Add scales ~O(N).
Benchmark PR: dotnet/performance#5120
Performance trade-off assessment:
The only trade-off is a nanosecond-scale transient over-count window in _cacheSize between Add and rollback. This is acceptable because:
Additional benefits beyond throughput: