feat: metal backend by dcvz · Pull Request #1175 · tracel-ai/cubecl

dcvz · 2026-02-04T23:08:50Z

No description provided.

Copilot

Pull request overview

Adds a native Metal backend to CubeCL and wires it into the workspace’s build/test/doc tooling and codegen so Metal can be selected as a runtime and supported in CI workflows.

Changes:

Introduces new cubecl-metal crate implementing a Metal runtime/server/stream/memory/storage stack.
Exposes Metal runtime via cubecl feature flags (metal) and TestRuntime cfg plumbing.
Updates xtask + CI workflow to support doc --ci and to exclude Metal/CUDA/HIP crates in CI.

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`xtask/src/main.rs`	Adds `doc` subcommand wiring to the xtask CLI.
`xtask/src/commands/mod.rs`	Registers new `doc` module.
`xtask/src/commands/doc.rs`	Adds `--ci` handling for docs, excluding unsupported crates.
`xtask/src/commands/validate.rs`	Routes doc validation through the new CubeCL doc command args.
`xtask/src/commands/build.rs`	Excludes `cubecl-metal` (and CUDA/HIP) in CI builds.
`xtask/src/commands/test.rs`	Excludes `cubecl-metal` (and CUDA/HIP) in CI tests.
`xtask/src/commands/check.rs`	Adds CI-specific workspace clippy that excludes platform-specific crates.
`.github/workflows/ci.yml`	Switches CI invocations to use `--ci` for `check` and `doc`.
`crates/cubecl/src/lib.rs`	Re-exports `cubecl_metal` behind `feature = "metal"` and adds `test_runtime_metal`.
`crates/cubecl/Cargo.toml`	Re-points `metal` feature to `cubecl-metal` and adds the dependency.
`crates/cubecl/build.rs`	Adds `test_runtime_metal` check-cfg and feature wiring.
`crates/cubecl-metal/*`	New Metal runtime implementation (device selection, compilation, server, stream/event sync, storage).
`crates/cubecl-cpp/src/shared/variable.rs`	Updates pointer/atomic formatting to include Metal address spaces.
`crates/cubecl-cpp/src/shared/unary.rs`	Adjusts unary function formatting for BF16 under Metal math constraints.
`crates/cubecl-cpp/src/shared/instruction.rs`	Casts float literals to support BF16 paths.
`crates/cubecl-cpp/src/shared/base.rs`	Ensures Metal extensions get registered for `hypot/rhypot`.
`crates/cubecl-cpp/src/metal/extension.rs`	Adds `hypot/rhypot` extensions and improves BF16-safe casts.
`crates/cubecl-cpp/src/metal/dialect.rs`	Emits `hypot/rhypot` calls and fixes atomic compare-exchange emission for MSL semantics.
`crates/cubecl-cpp/src/metal/arch.rs`	Adjusts Metal “warp” size to 32.
`crates/cubecl-cpp/src/metal/address_space.rs`	Ensures atomic bindings/pointers use `device` address space in MSL.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T21:47:20Z

+        _type_id: u16,
+        _info: &<Self::Server as cubecl_core::server::ComputeServer>::Info,
+    ) -> Vec<DeviceId> {
+        let devices = crate::device::all_devices();
+        (0..devices.len())
+            .map(|i| DeviceId {
+                type_id: 0,
+                index_id: i as u32,
+            })
+            .collect()


enumerate_devices ignores the type_id filter and always returns DeviceId { type_id: 0, ... }. With the current MetalDevice::from_id mapping, this means every enumerated device resolves to DefaultDevice, and callers can’t enumerate/select discrete vs integrated devices correctly. Consider implementing type_id filtering similar to cubecl-wgpu (e.g., return only DefaultDevice for type 0, discrete list for type 1, integrated list for type 2, etc.), and ensure returned DeviceId.type_id matches the MetalDevice variants you support.

Suggested change

_type_id: u16,

_info: &<Self::Server as cubecl_core::server::ComputeServer>::Info,

) -> Vec<DeviceId> {

let devices = crate::device::all_devices();

(0..devices.len())

.map(|i| DeviceId {

type_id: 0,

index_id: i as u32,

})

.collect()

type_id: u16,

_info: &<Self::Server as cubecl_core::server::ComputeServer>::Info,

) -> Vec<DeviceId> {

let devices = crate::device::all_devices();

match type_id {

0 => vec![DeviceId {

type_id: 0,

index_id: 0,

}],

1 => devices

.iter()

.enumerate()

.filter_map(|(i, device)| {

(!device.is_low_power()).then_some(DeviceId {

type_id: 1,

index_id: i as u32,

})

})

.collect(),

2 => devices

.iter()

.enumerate()

.filter_map(|(i, device)| {

device.is_low_power().then_some(DeviceId {

type_id: 2,

index_id: i as u32,

})

})

.collect(),

_ => Vec::new(),

}

Copilot · 2026-04-08T21:47:20Z

+        let dispatch_info = match count {
+            CubeCount::Static(x, y, z) => DispatchInfo::Static(x, y, z),
+            CubeCount::Dynamic(binding) => DispatchInfo::Dynamic(binding),
+        };
+
+        let mut resolved = match self
+            .streams
+            .resolve(stream_id, bindings.buffers.iter(), false)
+        {
+            Ok(r) => r,
+            Err(_) => return,
+        };
+


MultiStream::resolve is called only with bindings.buffers.iter(). When count is CubeCount::Dynamic, the indirect-dispatch Binding used for dispatchThreadgroupsWithIndirectBuffer... is not included in the resolve handles, so cross-stream synchronization for that buffer can be skipped. Consider including the dynamic dispatch Binding in the iterator passed to resolve (e.g., chain the buffers iterator with std::iter::once(binding) when CubeCount::Dynamic).

Suggested change

let dispatch_info = match count {

CubeCount::Static(x, y, z) => DispatchInfo::Static(x, y, z),

CubeCount::Dynamic(binding) => DispatchInfo::Dynamic(binding),

};

let mut resolved = match self

.streams

.resolve(stream_id, bindings.buffers.iter(), false)

{

Ok(r) => r,

Err(_) => return,

};

let mut resolved = match &count {

CubeCount::Dynamic(binding) => self.streams.resolve(

stream_id,

bindings

.buffers

.iter()

.chain(std::iter::once(binding)),

false,

),

CubeCount::Static(_, _, _) => {

self.streams.resolve(stream_id, bindings.buffers.iter(), false)

}

} {

Ok(r) => r,

Err(_) => return,

};

let dispatch_info = match count {

CubeCount::Static(x, y, z) => DispatchInfo::Static(x, y, z),

CubeCount::Dynamic(binding) => DispatchInfo::Dynamic(binding),

};

Copilot · 2026-04-08T21:47:21Z

+    fn handle_cursor(_stream: &Self::Stream, handle: &Binding) -> u64 {
+        // Metal uses shared memory so cursor tracking is minimal
+        handle.size


handle_cursor currently returns handle.size, which is constant for the lifetime of an allocation and does not reflect when the binding was last written on a stream. This breaks MultiStream’s shared-binding synchronization logic (it relies on a monotonic cursor, like other backends’ memory_management.get_cursor(binding.memory)), and can lead to missing waits between streams. Consider returning the cursor from stream.memory_management.get_cursor(handle.memory.clone()) (and falling back safely if that errors).

Suggested change

fn handle_cursor(_stream: &Self::Stream, handle: &Binding) -> u64 {

// Metal uses shared memory so cursor tracking is minimal

handle.size

fn handle_cursor(stream: &Self::Stream, handle: &Binding) -> u64 {

stream

.memory_management

.get_cursor(handle.memory.clone())

.unwrap_or(handle.size)

Copilot · 2026-04-08T21:47:21Z

 cubecl-cuda = { path = "../cubecl-cuda", version = "=0.10.0-pre.3", default-features = false, optional = true }
 cubecl-hip = { path = "../cubecl-hip", version = "=0.10.0-pre.3", default-features = false, optional = true }
 cubecl-ir = { path = "../cubecl-ir", version = "=0.10.0-pre.3", default-features = false }
+cubecl-metal = { path = "../cubecl-metal", version = "=0.10.0-pre.3", optional = true }


cubecl-metal is added as an optional dependency without default-features = false, unlike the other runtime dependencies in this crate (cubecl-cuda, cubecl-hip, cubecl-wgpu, etc.). This can unintentionally pull in cubecl-metal default features when consumers enable cubecl/metal, which may be inconsistent with the crate’s feature strategy. Consider adding default-features = false for cubecl-metal for parity with the other runtime deps.

Suggested change

cubecl-metal = { path = "../cubecl-metal", version = "=0.10.0-pre.3", optional = true }

cubecl-metal = { path = "../cubecl-metal", version = "=0.10.0-pre.3", default-features = false, optional = true }

dcvz force-pushed the feat/native-metal-backend branch 7 times, most recently from 3bed613 to d379550 Compare February 5, 2026 21:33

dcvz marked this pull request as ready for review February 6, 2026 14:47

dcvz force-pushed the feat/native-metal-backend branch 3 times, most recently from e68df50 to 622c366 Compare March 16, 2026 18:02

dcvz mentioned this pull request Mar 16, 2026

add missing math + address space qualifiers #1162

Closed

4 tasks

AdrianEddy reviewed Mar 22, 2026

View reviewed changes

Comment thread crates/cubecl-metal/src/runtime.rs Outdated

dcvz added 17 commits April 8, 2026 21:45

Add native Metal backend

8ae36ef

Add bf16 support + math op fixes

4fbcfbe

Add memory-aware command buffer batching

2b2749b

Use async completion handler for buffer cleanup

166a6b2

Implement async read operations

f790637

Use MTLSharedEvent for synchronization

ca6cd9e

Use device-specific batch thresholds

722dd6b

Use concurrent encoder for batched kernel dispatch

d061989

Use serial dispatch for encoder

4bfde55

Disable bf16 tests

844e00d

Add Metal backend to cubecl crate

0670ade

Add NonUniformControlFlow plane feature for Metal

d6a7ec8

Add max_global_line_size for Metal

35a0965

Cleanup pass

b6152eb

Fix max_streams and wait_async

16849d0

Update README and Cargo.toml

a618ae2

Some cleanup

6f65603

dcvz added 11 commits April 8, 2026 21:49

Make Metal3 explicit requirement

1165d31

Disable metal crate on CI

fc72970

Add backpressure

b3c26d2

Free temporaries in completion handler

f7caa0a

Finish rebasing

9cfbca2

Rebase one more time

41176d0

Improvements

84e35da

Fixup fmt

072d64c

Fix up rebase

7f5f7f3

Check tensor layout

34d3bd0

Fix toml deps after rebase

4a63e18

dcvz force-pushed the feat/native-metal-backend branch from d212496 to 4a63e18 Compare April 8, 2026 20:18

antimora requested a review from Copilot April 8, 2026 21:39

Copilot started reviewing on behalf of antimora April 8, 2026 21:39 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: metal backend#1175

feat: metal backend#1175
dcvz wants to merge 28 commits intotracel-ai:mainfrom
oxiglade:feat/native-metal-backend

dcvz commented Feb 4, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    fn handle_cursor(_stream: &Self::Stream, handle: &Binding) -> u64 {
-        // Metal uses shared memory so cursor tracking is minimal
-        handle.size
+    fn handle_cursor(stream: &Self::Stream, handle: &Binding) -> u64 {
+        stream
+            .memory_management
+            .get_cursor(handle.memory.clone())
+            .unwrap_or(handle.size)

	cubecl-metal = { path = "../cubecl-metal", version = "=0.10.0-pre.3", optional = true }
	cubecl-metal = { path = "../cubecl-metal", version = "=0.10.0-pre.3", default-features = false, optional = true }

Conversation

dcvz commented Feb 4, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants