Skip to content

GPE in prove_execution only when opt-level="z" AND codegen-units=1 (x86_64 Linux) #198

@ch4r10t33r

Description

@ch4r10t33r

Filing this mostly for visibility. We've been chasing a deterministic General Protection Exception on x86_64-unknown-linux-gnu that only reproduces when a downstream consumer builds leanMultisig under opt-level = "z" and codegen-units = 1 simultaneously. Either knob on its own is fine. The crash lands inside rec_aggregation::xmss_aggregate → lean_prover::prove_execution → prove_generic_logup → Poseidon16Precompile::bus.

Our downstream (blockblaz/zeam) already shipped a workaround in #759 by switching to opt-level = "s", so we're unblocked. But the interesting part sits in leanMultisig, so I wanted to write it up here.

What actually faults

Debug symbols point at Poseidon16Precompile::bus, but the real body at the faulting PC is a monomorphization of lean_vm::tables::utils::eval_virtual_bus_column that got inlined into lean_prover's CGU:

pub(crate) fn eval_virtual_bus_column<AB: AirBuilder, EF: ExtensionField<PF<EF>>>(
    extra_data: &ExtraDataForBuses<EF>, flag: AB::IF, data: &[AB::IF],
) -> AB::EF {
    let (logup_alphas_eq_poly, bus_beta) = extra_data.transmute_bus_data::<AB::EF>();
    assert!(data.len() < logup_alphas_eq_poly.len());
    (logup_alphas_eq_poly.iter().zip(data).map(|(c, d)| *c * *d).sum::<AB::EF>()
     + *logup_alphas_eq_poly.last().unwrap() * AB::F::from_usize(LOGUP_PRECOMPILE_DOMAINSEP))
     * *bus_beta + flag
}

Twelve lines of safe Rust, but at the crash site objdump shows a single basic block keeping all 16 YMM registers live through vpblendd/vpbroadcastd/vpmuludq/vinserti128/vpaddq, and dying on a vpinsrd $0x1, 0x14(%rsp), %xmm5, %xmm5 — a stack-relative reload of a spilled vector lane. The whole iterator chain has been inlined into one monstrous SIMD fold, and at -Oz the frame layout for that fold looks wrong. Bumping CGU to ≥ 2 keeps the function outlined at the crate boundary, which is why that mitigation works.

(For what it's worth: the unsafe transmute in ExtraDataForBuses::transmute_bus_data isn't the culprit here — at this monomorphization AB::EF == EF so it's an identity transmute. nm confirms the affected symbol only exists in lean_vm's CGU. Still a pattern I'd love to see go away for readability reasons, but it's sound.)

What we tried

Bisected on an AMD Zen 4 VM with stable rustc 1.95.0. Everything below runs against leanMultisig rev 2eb4b9d983171139af36749f127dd9890c9109e6:

  • Per-crate opt-level = "s" overrides across mt-*, rec_aggregation, backend, lean_prover, lean_vm, utils, sub_protocols — none of those combos fixed it. A lean_compiler-only override did, but it turned out to be a CGU-partitioning side-effect that's unstable under small cache changes.
  • RUSTFLAGS="-Cllvm-args=-enable-machine-outliner=never" — no effect.
  • #[inline(never)] on eval_virtual_bus_column itself — no effect.
  • codegen-units = 16 with opt-level = "z" — clean.
  • codegen-units = 1 with opt-level = "s" — clean.

So the condition is genuinely the conjunction of { "z", 1 }, not either alone.

Reproducer

I don't have a standalone leanMultisig-only repro yet — it needs a realistic witness flowing into prove_execution. Via zeam it's:

git clone https://github.com/blockblaz/zeam.git
cd zeam && git checkout 17f1083   # pre-#759, opt-level="z" still present
./zig build run -Dprover=risc0 -- prove -z risc0
# crashes in ~25s with "General protection exception" on Linux x86_64

Flipping either opt-level to "s" or codegen-units to anything ≥ 2 in rust/Cargo.toml's [profile.risc0-release] clears it. Also reproduces on GitHub's 2-core ubuntu-latest runners.

If it would help, I'm happy to try to carve a standalone #[test] inside leanMultisig that builds under { "z", 1 } and triggers the same codegen path — something around prove_generic_logup with a trivial witness. Let me know if you'd want that shape or something different.

Probably worth doing regardless of a fix

A short note somewhere in the README / a CODEGEN.md saying "consumers using opt-level = "z" need codegen-units >= 2, or use opt-level = "s" — the { "z", 1 } combo miscompiles on x86_64 rustc ≥ 1.95" would save anyone else hitting this a lot of bisecting. Happy to PR that once you confirm the preferred wording/location.

A proper rustc/LLVM upstream issue is the long-term fix, but needs a minimized reproducer first.

Environment

  • rustc 1.95.0 (59807616e 2026-04-14) stable
  • AMD EPYC-Genoa (Zen 4), also on GitHub ubuntu-latest
  • Ubuntu 24.04, Linux 6.8
  • -Ctarget-cpu=x86-64-v3 (so AVX2, not AVX-512)
  • leanMultisig 2eb4b9d, zeam pre-fix ref 17f1083, downstream fix blockblaz/zeam#759

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions