Skip to content

build: default rust-target-cpu to x86-64-v3 on x86_64#756

Merged
ch4r10t33r merged 1 commit into
mainfrom
fix/default-rust-target-cpu-to-x86-64-v3
Apr 17, 2026
Merged

build: default rust-target-cpu to x86-64-v3 on x86_64#756
ch4r10t33r merged 1 commit into
mainfrom
fix/default-rust-target-cpu-to-x86-64-v3

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Contributor

Summary

Cap the default Rust target-cpu at x86-64-v3 (AVX2, no AVX-512) on x86_64 instead of native, so zig build produces portable binaries across all x86_64 Linux hosts by default. Users who want machine-specific codegen can still opt in via -Drust-target-cpu=native (or x86-64-v4 for AVX-512).

Why

-Ctarget-cpu=native on AVX-512-capable build hosts has been a persistent source of hard-to-diagnose failures:

  • Runtime GPFs / SIGILLs for users whose CPU supports a different AVX-512 subset than the builder's (LLVM codegen bugs in deeper dependencies, inline-asm clobber-list issues, kernel/microcode XSAVE quirks).

  • Specifically, today CI job `test (ubuntu-latest)` on PR ci(hive): retry on transient docker-build failures #750 crashed `rustc` itself with `SIGILL: illegal instruction` inside the `thiserror_impl` proc-macro `.so`:

    could not compile hashsig-glue (lib)
    Caused by: rustc ... -Ctarget-cpu=native ... (signal: 4, SIGILL: illegal instruction)
    

    Root cause: `Swatinem/rust-cache` restored a proc-macro `.so` compiled with `-Ctarget-cpu=native` on a previous runner that had AVX-512, then today's runner lacked those opcodes. All 3 retry attempts hit the same poisoned cache.

Capping the default at AVX2 makes the output safe to share across any modern x86_64 machine and eliminates this entire class of flake. As a side effect, flipping the default in `RUSTFLAGS` bumps the `rust-cache` key (it hashes `RUSTFLAGS`), which also invalidates the currently-poisoned cache entry on first CI run.

Scope

  • aarch64 is unchanged (ring 0.17 fails compile-time feature assertions under `target-cpu=native` on aarch64-apple-darwin, so we were already not setting target-cpu there).
  • Opt-in escape hatch preserved: `-Drust-target-cpu=native` restores the old behaviour per-build.
  • This is a cherry-pick of commit 840bb36 from PR fix: x86_64 GPF by using C ABI for Zig FFI #743, extracted so it can merge independently of the broader libp2p-glue C-ABI work.

Test plan

  • `zig fmt --check build.zig` passes locally.
  • CI (`test (ubuntu-latest)` in particular) compiles Rust crates without SIGILL and is reproducible across runners.
  • `zig build` local developer flow still works; `-Drust-target-cpu=native` still accepted.

Cap the default Rust target CPU at AVX2 instead of `native` so `zig build`
produces portable binaries on x86_64 Linux. `target-cpu=native` on AVX-512
capable hosts has been causing hard-to-diagnose runtime GPFs in the deeper
Rust dependency graph (LLVM codegen bugs, XSAVE/microcode quirks). Users
who want machine-specific performance can still opt in via
`-Drust-target-cpu=native` (or `x86-64-v4` for AVX-512).
Copy link
Copy Markdown
Contributor

@zclawz zclawz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — the right conclusion to the AVX-512 saga (#729#735#737#743).

What this does

Flips the default for -Drust-target-cpu from nativex86-64-v3. Previously native was the fallback for local and CI builds; this PR makes x86-64-v3 (AVX2, no AVX-512) the universal default, with opt-in escape hatches documented in the help string.

Why this is correct

The recurring CI SIGILL/rustc-crash pattern is exactly the symptom described in the comment: LLVM codegen emitting AVX-512 instructions that then fault on the kernel's XSAVE/XRSTOR path or on VMs that advertise AVX-512 but don't implement all variants cleanly. GitHub's ubuntu-latest runners are known to vary in AVX-512 support across the fleet — capping at AVX2 eliminates the whole class of runner-lottery failures.

Interaction with existing PRs

The Docker workflow (#729/#735) and auto-release CI already explicitly pass -Drust-target-cpu=x86-64-v3, so those paths are unaffected. This PR closes the gap for local builds and any CI job that didn't previously pass the flag — which is where the flaky SIGILL was originating.

Anything missed?

No. The aarch64 guard (ring 0.17 skip) is untouched. The comment is accurate and well-scoped. Approved. 🚢

@ch4r10t33r ch4r10t33r merged commit a40714a into main Apr 17, 2026
13 checks passed
@ch4r10t33r ch4r10t33r deleted the fix/default-rust-target-cpu-to-x86-64-v3 branch April 17, 2026 15:39
ch4r10t33r added a commit that referenced this pull request Apr 17, 2026
The `risc0-release` and `openvm-release` profiles used `opt-level = "z"`
for size reduction. On x86_64 Linux this triggers a codegen interaction
in leanMultisig's prover (`rec_aggregation` / `lean_prover` / `backend`)
that produces a runtime General Protection Exception inside
`xmss_aggregate` on the first aggregation call from `genMockChain`.

Bisection against `zig build run -Dprover=risc0 -- prove -z risc0` on
an AMD EPYC Genoa guest (Linux 6.x, x86-64-v3 rustflags, fresh rebuild):

- opt-level = z : crashes at `pkgs/xmss/src/aggregation.zig:139` (first
                  xmss_aggregate call), identical stack to CI runs.
- opt-level = s : completes all 5 mock blocks; libmultisig_glue.a = 63 MB.
- opt-level = 1 : also clean.

Ruled out independently: stack overflow (`ulimit -s unlimited` did not
help), AVX-512 (reproduces with `-Ctarget-cpu=x86-64-v3`), stale
`rust-cache`, and CPU vendor (crashes on both Intel CI runners and AMD
Zen 4). Also ruled out as coincidence with PR #756's default change:
the risc0 workflow has been failing on every push to main since Apr 9
with this exact stack.

Keeps size-optimization focus (still "s", not "1"/"2") while avoiding
the aggressive inlining / machine-outliner passes that expose the issue.
Root cause in leanMultisig still needs upstream investigation; #734
remains open.
gballet pushed a commit that referenced this pull request Apr 17, 2026
#759)

The `risc0-release` and `openvm-release` profiles used `opt-level = "z"`
for size reduction. On x86_64 Linux this triggers a codegen interaction
in leanMultisig's prover (`rec_aggregation` / `lean_prover` / `backend`)
that produces a runtime General Protection Exception inside
`xmss_aggregate` on the first aggregation call from `genMockChain`.

Bisection against `zig build run -Dprover=risc0 -- prove -z risc0` on
an AMD EPYC Genoa guest (Linux 6.x, x86-64-v3 rustflags, fresh rebuild):

- opt-level = z : crashes at `pkgs/xmss/src/aggregation.zig:139` (first
                  xmss_aggregate call), identical stack to CI runs.
- opt-level = s : completes all 5 mock blocks; libmultisig_glue.a = 63 MB.
- opt-level = 1 : also clean.

Ruled out independently: stack overflow (`ulimit -s unlimited` did not
help), AVX-512 (reproduces with `-Ctarget-cpu=x86-64-v3`), stale
`rust-cache`, and CPU vendor (crashes on both Intel CI runners and AMD
Zen 4). Also ruled out as coincidence with PR #756's default change:
the risc0 workflow has been failing on every push to main since Apr 9
with this exact stack.

Keeps size-optimization focus (still "s", not "1"/"2") while avoiding
the aggressive inlining / machine-outliner passes that expose the issue.
Root cause in leanMultisig still needs upstream investigation; #734
remains open.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants