Bump+reset arena allocator for ZK proving workloads.
zk-alloc replaces the system allocator (glibc malloc) via Rust's #[global_allocator]. It mmaps a large virtual region with MAP_NORESERVE (no physical memory committed), splits it into per-thread 8GB slabs, and bumps a pointer for every allocation. dealloc is a no-op for arena-owned memory. Between proofs, begin_phase() resets all bump pointers so physical pages are reused without demand-paging costs.
use zk_alloc::ZkAllocator;
#[global_allocator]
static ALLOC: ZkAllocator = ZkAllocator;
fn main() {
loop {
let proof = zk_alloc::phase(|| generate_proof()); // arena on inside
let output = proof.clone(); // detach to System
submit(output);
}
}phase(|| { ... }) activates the arena, runs the closure, and deactivates
on return — including during panic unwinding (it's an RAII wrapper around
begin_phase() / end_phase(), which are also exposed for callers that
need finer-grained control).
ZkAllocator routes each request to one of two backends:
- Arena — bump-pointer slab, used during an active phase for allocations
≥
ZK_ALLOC_MIN_BYTES(default 4096). Reset on the nextbegin_phase(). - System —
glibc malloc, used for everything else: allocations made outside any phase, allocations under the size-routing threshold (small library bookkeeping like rayon's injector blocks, tracing-subscriber registry slots, hashbrown HashMap entries), andreallocof any pointer that originated in System (sticky-System routing — System allocations never silently migrate to arena on growth).
Allocations made during phase N must not be held past begin_phase() of
phase N+1 — that call recycles the slab, and the next allocation at the
same offset overwrites the retained bytes. Violating this contract is
undefined behavior (the old pointer becomes invalid the moment the
overwrite happens). In practice:
- Drop or
clone()arena-allocated values before the phase ends. - Construct long-lived state (thread pools, channels, registries) before any phase begins so it lives in System.
- Use
phase(|| { ... })(or aPhaseGuard) instead of paired calls so the phase ends correctly even on panic.
| Variable | Default | Effect |
|---|---|---|
ZK_ALLOC_SLAB_GB |
8 |
Per-thread slab size, in GiB. Raise for workloads that overflow (overflow_stats() reports the count). Total virtual reservation = ZK_ALLOC_SLAB_GB × thread_count (e.g., 8 GiB × 16 threads = 128 GiB virtual). Physical RAM is only consumed on touch. |
ZK_ALLOC_MIN_BYTES |
4096 |
Size-routing threshold. Allocations smaller than this go to System even during a phase. Set to 0 to send everything to arena (loses size-routing protection against library-internal pooled allocations). |
| Platform | Path | Notes |
|---|---|---|
| Linux x86_64 | direct syscalls (mmap, madvise) |
Fastest path. No libc allocator reentrancy concerns. |
| Linux aarch64 | direct syscalls | Requires vm.overcommit_memory=1 for MAP_NORESERVE to behave (Asahi/server-aarch64). Without it, large reservations SIGABRT. |
| Other Unix (macOS, *BSD) | libc fallback (mmap via libc, madvise no-op) |
Functional, slightly slower setup; no MADV_NOHUGEPAGE hint. |
| Windows | no-op stubs | Allocator routes everything through System; arena is inert. Use System allocator directly here. |
Minimum RAM: at least one slab's worth (default 8 GiB) of working set per active thread when phases run. On memory-constrained machines (e.g., 16 GiB M-series Macs), set ZK_ALLOC_SLAB_GB lower or limit thread count.
| Prover | Architecture | vs glibc | Mechanism |
|---|---|---|---|
| leanMultisig | FFT-based (Plonky3/WHIR, KoalaBear) | -27% warm proof | Page reuse eliminates demand-paging |
| Plonky3 | FFT-based (BabyBear, FRI) | -12% to -17% | Same mechanism, Poseidon1/2 and Keccak |
| Jolt | Sumcheck-based (Dory/BN254) | +1% to +4% (null) | Compute-bound; allocator overhead <1% |
FFT-based provers are memory-bound and benefit significantly. Sumcheck-based provers are compute-bound and unaffected.
mmapwithMAP_NORESERVE: reserves virtual address space without committing physical memoryMADV_NOHUGEPAGE: 4KB pages are faster for bump+reset than 2MB THP (lower per-fault cost, no compaction)- Thread detection via
available_parallelism(): auto-sizes to the machine - Overflow to
System: allocations that exceed the slab fall back to glibc overflow_stats(): reports how many allocations fell through (useful for tuning)
The technique is a 1990s bump allocator (Hanson, 1990) applied to a domain where nobody questioned malloc. The novelty is the application, not the technique.
Hanson, D.R. (1990). "Fast allocation and deallocation of memory based on object lifetimes." Software: Practice and Experience, 20(1), 5-12.
Apache-2.0 — see LICENSE for the full text.