Accelerate `Half` with FP16 ISA by anthonycanino · Pull Request #122649 · dotnet/runtime

anthonycanino · 2025-12-18T20:45:27Z

Draft PR for in-progress work to accelerate System.Half with FP16 ISA.

Current work done:

Add a TYP_HALF to the .NET runtime, which is treated like a TYP_SIMDXX, but with some notable differences. Namely, a TYP_HALF is passed around via the xmm registers, and while it will pass a varTypeIsStruct test, it must be treated as a primitive in other places.
Accelerate System.Half operations with the TYP_HALF and some FP16 intrinsics. Not every possible function has been accelerated yet.

For discussion:

I have currently worked around some checks to make TYP_HALF behave like a struct and a primitive. It's very ad-hoc at the moment.
Much of the work to transform the named System.Half intrinsics into a sequence of intrinsic nodes is done in importcall.cpp and might want to be moved up into some of the gtNewSimdXX nodes.

anthonycanino · 2025-12-18T20:50:41Z

@tannergooding @jakobbotsch please take a look when you get a chance.

src/coreclr/jit/codegencommon.cpp

anthonycanino · 2026-01-06T12:47:42Z

@dotnet/intel @tannergooding may I get some high level feedback on the structure of the PR?

src/coreclr/jit/codegenxarch.cpp

src/coreclr/jit/compiler.cpp

src/coreclr/jit/emitxarch.cpp

src/coreclr/jit/gentree.cpp

src/coreclr/jit/gentree.h

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

src/coreclr/jit/importer.cpp

src/coreclr/jit/importercalls.cpp

src/coreclr/jit/instr.cpp

src/coreclr/jit/lower.cpp

src/coreclr/jit/lsrabuild.cpp

tannergooding · 2026-01-06T17:24:59Z

src/coreclr/jit/lsrabuild.cpp

                        // We ONLY want the valid double register in the RBM_DOUBLERET mask.
 #ifdef TARGET_AMD64
                        useCandidates = (RBM_DOUBLERET & RBM_ALLDOUBLE).GetFloatRegSet();
 #else
                    useCandidates = (RBM_DOUBLERET & RBM_ALLDOUBLE).GetFloatRegSet();
 #endif // TARGET_AMD64


not related to this PR, but these two paths are the same

Copilot

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated 2 comments.

Copilot · 2026-03-03T19:20:59Z

src/libraries/System.Private.CoreLib/src/System/Half.cs

 using System.Numerics;
 using System.Runtime.CompilerServices;
 using System.Runtime.InteropServices;
+using System.Runtime.Intrinsics;



using System.Runtime.Intrinsics; appears to be unused in this file (the [Intrinsic] attribute is defined under System.Runtime.CompilerServices). If unused-usings warnings are treated as errors for CoreLib, this will break the build. Remove the using or reference a type from that namespace if it’s required.

Copilot · 2026-03-03T19:21:00Z

src/coreclr/jit/importercalls.cpp

+            case NI_System_Half_FusedMultiplyAdd:
+            {
+#if defined(TARGET_XARCH)
+                if (compOpportunisticallyDependsOn(InstructionSet_AVX10v1))
+                {
+                    // We are constructing a chain of intrinsics similar to:
+                    //    return FMA.MultiplyAddScalar(
+                    //        Vector128.CreateScalarUnsafe(x),
+                    //        Vector128.CreateScalarUnsafe(y),
+                    //        Vector128.CreateScalarUnsafe(z)
+                    //    ).ToScalar();
+


This PR introduces new JIT named intrinsics for System.Half (including lowering to AVX10v1 scalar FP16 instructions) but there don’t appear to be any JIT/HardwareIntrinsics tests exercising Half + AVX10v1 codegen paths. Adding targeted tests under src/tests/JIT/HardwareIntrinsics/X86_Avx10v1 (or an equivalent location) would help catch regressions in recognition, codegen, and calling convention handling.

Our intent is to have these tests as the APIs are directly implemented.

anthonycanino · 2026-03-03T19:24:10Z

Our new repository configuration requires all comments to be resolved before a PR can merge. There are a lot of old unresolved comments on this PR, can you go through and resolve everything you think has been dealt with?

I've gone through and resolved most of the old comments. There are a few of yours with responses from me I think you should check before resolution.

src/coreclr/jit/emitxarch.cpp

src/coreclr/jit/float16.h

kg · 2026-03-03T20:06:23Z

src/coreclr/jit/hwintrinsic.h

+#if defined(TARGET_XARCH)
+    uint16_t ins[11]; // 11 * 2-bytes
+#else
+    uint16_t ins[10]; // 10 * 2-bytes
+#endif


This is a question that you don't necessarily have to answer. Why make it 22 bytes on xarch and 20 on other arches, instead of 22 bytes on every arch for simplicity? Are there already 20-byte insns on i.e. arm32, riscv, loongarch, arm64? If we're defining this based on how big insns are on a given architecture we could potentially make ins[] much smaller and reduce memory usage and/or file size a bit.

But I don't think this needs to change, it just came to mind when reading the diff.

kg · 2026-03-03T20:08:58Z

src/coreclr/jit/importercalls.cpp

    if (simdReturnType != call->TypeGet())
    {
-        assert(varTypeIsSIMD(simdReturnType));
+        assert(varTypeIsAccelerated(simdReturnType));


The name varTypeIsAccelerated feels ambiguous in a way that varTypeIsSIMD was not. "Accelerated" makes me think the target i'm jitting for has native instructions for it, while "SIMD" makes me think it's a vector type. I would expect "accelerated" to be false in some cases where "SIMD" is true if I am dealing with i.e. a Vector512 on an arch that only does 128, i.e. wasm. When I look at this i immediately wonder whether accelerated means "actually hardware accelerated" or if it just means "type that might be hardware accelerated" or "acceleratable"

We could use something like varTypeIsStructPrimitive, since that's really what this is.

That is, we have the regular built-in primitives and then the struct primitives that map to ABI concepts beyond the built-in ones.

src/coreclr/jit/importercalls.cpp

kg · 2026-03-03T20:33:15Z

src/coreclr/jit/importercalls.cpp

+int Compiler::lookupHalfRoundingMode(NamedIntrinsic ni)
+{
+    switch (ni)
+    {
+        case NI_System_Half_Round:
+            return 0; // Round to nearest
+        case NI_System_Half_Ceiling:
+            return 1; // Round towards +infinity
+        case NI_System_Half_Floor:
+            return 2; // Round towards -infinity
+        case NI_System_Half_Truncate:
+            return 3; // Round towards zero
+        default:
+            noway_assert(!"Should have one of the above Half intrinsics");
+            return -1;
+    }
+}
+


Should this be xarch ifdef'd? Though I assume if all its usages are ifdef'd it'll get pruned by the linker anyway

kg · 2026-03-03T20:53:57Z

Went over everything, deferring the greencheck to tanner. Thanks for your hard work :)

Copilot

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated 2 comments.

src/coreclr/jit/importercalls.cpp

src/coreclr/jit/hwintrinsic.h

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (6)

src/coreclr/jit/utils.cpp:1

The “jam” bit computation in shiftRightJam is incorrect due to operator precedence/parenthesization: it currently shifts by either 0 or 1, rather than testing whether any bits were shifted out. This will produce incorrect rounding for many values in convertDoubleToFloat16. Consider rewriting the expression to explicitly compute ((l << ((-dist) & 63)) != 0) (or equivalent) and OR in 1 when any discarded bits are non-zero.
src/coreclr/jit/utils.cpp:1
HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITS are declared as uint64_t but are used as 16-bit Half bit patterns and returned from a float16_t (uint16_t) function. This implicit narrowing is easy to miss and makes the code harder to reason about (and may trigger warnings under stricter builds). Prefer declaring these constants as uint16_t (or float16_t) to match semantics and avoid silent truncation.
src/coreclr/jit/utils.cpp:1
HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITS are declared as uint64_t but are used as 16-bit Half bit patterns and returned from a float16_t (uint16_t) function. This implicit narrowing is easy to miss and makes the code harder to reason about (and may trigger warnings under stricter builds). Prefer declaring these constants as uint16_t (or float16_t) to match semantics and avoid silent truncation.
src/coreclr/jit/vartype.h:1
This branch is used for non-x86 targets (per the surrounding #if/#else and the comment stating “Other targets pass them as regular structs”), but it unconditionally treats TYP_HALF as using float argument registers. If TYP_HALF can ever appear on non-xarch targets (including during cross-target JIT builds or shared utilities), this risks ABI/calling convention mismatches. Consider guarding the TYP_HALF clause behind TARGET_XARCH (or otherwise ensuring TYP_HALF cannot reach this path).
src/libraries/System.Private.CoreLib/src/System/Half.cs:1
Only Asin(Half) was changed to use AggressiveInlining, while nearby Half math wrappers are being marked with [Intrinsic] (or left unchanged). This inconsistency makes it unclear whether the intent is “intrinsic expansion” or “inline the wrapper”. Consider either removing this attribute (if incidental) or aligning it with the other accelerated Half APIs (e.g., mark as [Intrinsic] if it’s intended to be lowered).
src/coreclr/vm/reflectioninvocation.cpp:1
Minor: there’s trailing whitespace on the #if line. Also, many other hunks use TARGET_XARCH for this combined case; using a consistent macro improves readability and reduces duplication.

Copilot · 2026-03-05T15:34:45Z

src/coreclr/jit/compiler.h

+        if (structSize == 2)
+            return true;
+
 #ifdef FEATURE_SIMD
        return (structSize >= getMinVectorByteLength()) && (structSize <= getMaxVectorByteLength());


Returning true for all 2-byte structs makes many unrelated valuetypes (e.g., arbitrary 2-byte user structs) go down “accelerated type” code paths, frequently triggering additional EE queries (e.g., impNormStructType, VN field selector normalization). Since Half acceleration here is gated to xarch+AVX10v1 and intrinsic types, consider tightening this predicate (e.g., only treat size==2 as accelerated when building for xarch and when Half acceleration is enabled), or splitting into a separate Half-specific fast-path to avoid broad 2-byte struct churn.

Suggested change

if (structSize == 2)

return true;

#ifdef FEATURE_SIMD

return (structSize >= getMinVectorByteLength()) && (structSize <= getMaxVectorByteLength());

#if defined(TARGET_XARCH) && defined(FEATURE_SIMD)

// On xarch with AVX10v1 support, 2-byte structs may represent accelerated Half types.

// Restrict the 2-byte fast-path to this configuration to avoid sending arbitrary 2-byte

// user structs down accelerated-type code paths on other targets.

if ((structSize == 2) && compSupports(InstructionSet_AVX10v1))

{

return true;

}

return (structSize >= getMinVectorByteLength()) && (structSize <= getMaxVectorByteLength());

#elif defined(FEATURE_SIMD)

return (structSize >= getMinVectorByteLength()) && (structSize <= getMaxVectorByteLength());

Copilot · 2026-03-05T15:34:46Z

src/coreclr/jit/simd.h

+#ifndef SIZE_UNKNOWN
+#define SIZE_UNKNOWN UINT8_MAX
+#endif


Defining a global macro like SIZE_UNKNOWN in a widely included header increases the risk of collisions and makes it harder to track the canonical meaning/value across the JIT. If possible, prefer a scoped constant (e.g., static constexpr unsigned) in an appropriate shared header/namespace, or include the header that already owns this sentinel to avoid multiple “fallback” definitions.

Suggested change

#ifndef SIZE_UNKNOWN

#define SIZE_UNKNOWN UINT8_MAX

#endif

static constexpr unsigned SIZE_UNKNOWN = UINT8_MAX;

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 18, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Dec 18, 2025

build-analysis bot mentioned this pull request Dec 19, 2025

[android][arm64] System.Net.Sockets.Tests.SendTo_SyncForceNonBlocking.Datagram_UDP_ShouldImplicitlyBindLocalEndpoint fails with NetworkUnreachable #120526

Open

jakobbotsch reviewed Jan 5, 2026

View reviewed changes

src/coreclr/jit/codegencommon.cpp Show resolved Hide resolved

anthonycanino force-pushed the half-xmm-struct-abi branch from 3b8abaa to f633726 Compare January 5, 2026 19:52

This was referenced Jan 5, 2026

[mono] mono_thread_info_install_interrupt: previous_token should be INTERRUPT_STATE #122669

Open

iOS.Device test WorkItemExecutions #122874

Open