build: default rust-target-cpu to x86-64-v3 on x86_64#756
Conversation
Cap the default Rust target CPU at AVX2 instead of `native` so `zig build` produces portable binaries on x86_64 Linux. `target-cpu=native` on AVX-512 capable hosts has been causing hard-to-diagnose runtime GPFs in the deeper Rust dependency graph (LLVM codegen bugs, XSAVE/microcode quirks). Users who want machine-specific performance can still opt in via `-Drust-target-cpu=native` (or `x86-64-v4` for AVX-512).
zclawz
left a comment
There was a problem hiding this comment.
LGTM — the right conclusion to the AVX-512 saga (#729 → #735 → #737 → #743).
What this does
Flips the default for -Drust-target-cpu from native → x86-64-v3. Previously native was the fallback for local and CI builds; this PR makes x86-64-v3 (AVX2, no AVX-512) the universal default, with opt-in escape hatches documented in the help string.
Why this is correct
The recurring CI SIGILL/rustc-crash pattern is exactly the symptom described in the comment: LLVM codegen emitting AVX-512 instructions that then fault on the kernel's XSAVE/XRSTOR path or on VMs that advertise AVX-512 but don't implement all variants cleanly. GitHub's ubuntu-latest runners are known to vary in AVX-512 support across the fleet — capping at AVX2 eliminates the whole class of runner-lottery failures.
Interaction with existing PRs
The Docker workflow (#729/#735) and auto-release CI already explicitly pass -Drust-target-cpu=x86-64-v3, so those paths are unaffected. This PR closes the gap for local builds and any CI job that didn't previously pass the flag — which is where the flaky SIGILL was originating.
Anything missed?
No. The aarch64 guard (ring 0.17 skip) is untouched. The comment is accurate and well-scoped. Approved. 🚢
The `risc0-release` and `openvm-release` profiles used `opt-level = "z"`
for size reduction. On x86_64 Linux this triggers a codegen interaction
in leanMultisig's prover (`rec_aggregation` / `lean_prover` / `backend`)
that produces a runtime General Protection Exception inside
`xmss_aggregate` on the first aggregation call from `genMockChain`.
Bisection against `zig build run -Dprover=risc0 -- prove -z risc0` on
an AMD EPYC Genoa guest (Linux 6.x, x86-64-v3 rustflags, fresh rebuild):
- opt-level = z : crashes at `pkgs/xmss/src/aggregation.zig:139` (first
xmss_aggregate call), identical stack to CI runs.
- opt-level = s : completes all 5 mock blocks; libmultisig_glue.a = 63 MB.
- opt-level = 1 : also clean.
Ruled out independently: stack overflow (`ulimit -s unlimited` did not
help), AVX-512 (reproduces with `-Ctarget-cpu=x86-64-v3`), stale
`rust-cache`, and CPU vendor (crashes on both Intel CI runners and AMD
Zen 4). Also ruled out as coincidence with PR #756's default change:
the risc0 workflow has been failing on every push to main since Apr 9
with this exact stack.
Keeps size-optimization focus (still "s", not "1"/"2") while avoiding
the aggressive inlining / machine-outliner passes that expose the issue.
Root cause in leanMultisig still needs upstream investigation; #734
remains open.
#759) The `risc0-release` and `openvm-release` profiles used `opt-level = "z"` for size reduction. On x86_64 Linux this triggers a codegen interaction in leanMultisig's prover (`rec_aggregation` / `lean_prover` / `backend`) that produces a runtime General Protection Exception inside `xmss_aggregate` on the first aggregation call from `genMockChain`. Bisection against `zig build run -Dprover=risc0 -- prove -z risc0` on an AMD EPYC Genoa guest (Linux 6.x, x86-64-v3 rustflags, fresh rebuild): - opt-level = z : crashes at `pkgs/xmss/src/aggregation.zig:139` (first xmss_aggregate call), identical stack to CI runs. - opt-level = s : completes all 5 mock blocks; libmultisig_glue.a = 63 MB. - opt-level = 1 : also clean. Ruled out independently: stack overflow (`ulimit -s unlimited` did not help), AVX-512 (reproduces with `-Ctarget-cpu=x86-64-v3`), stale `rust-cache`, and CPU vendor (crashes on both Intel CI runners and AMD Zen 4). Also ruled out as coincidence with PR #756's default change: the risc0 workflow has been failing on every push to main since Apr 9 with this exact stack. Keeps size-optimization focus (still "s", not "1"/"2") while avoiding the aggressive inlining / machine-outliner passes that expose the issue. Root cause in leanMultisig still needs upstream investigation; #734 remains open.
Summary
Cap the default Rust
target-cpuatx86-64-v3(AVX2, no AVX-512) on x86_64 instead ofnative, sozig buildproduces portable binaries across all x86_64 Linux hosts by default. Users who want machine-specific codegen can still opt in via-Drust-target-cpu=native(orx86-64-v4for AVX-512).Why
-Ctarget-cpu=nativeon AVX-512-capable build hosts has been a persistent source of hard-to-diagnose failures:Runtime GPFs / SIGILLs for users whose CPU supports a different AVX-512 subset than the builder's (LLVM codegen bugs in deeper dependencies, inline-asm clobber-list issues, kernel/microcode XSAVE quirks).
Specifically, today CI job `test (ubuntu-latest)` on PR ci(hive): retry on transient docker-build failures #750 crashed `rustc` itself with `SIGILL: illegal instruction` inside the `thiserror_impl` proc-macro `.so`:
Root cause: `Swatinem/rust-cache` restored a proc-macro `.so` compiled with `-Ctarget-cpu=native` on a previous runner that had AVX-512, then today's runner lacked those opcodes. All 3 retry attempts hit the same poisoned cache.
Capping the default at AVX2 makes the output safe to share across any modern x86_64 machine and eliminates this entire class of flake. As a side effect, flipping the default in `RUSTFLAGS` bumps the `rust-cache` key (it hashes `RUSTFLAGS`), which also invalidates the currently-poisoned cache entry on first CI run.
Scope
Test plan