Skip to content

Accelerate Half with FP16 ISA#122649

Open
anthonycanino wants to merge 5 commits intodotnet:mainfrom
anthonycanino:half-xmm-struct-abi
Open

Accelerate Half with FP16 ISA#122649
anthonycanino wants to merge 5 commits intodotnet:mainfrom
anthonycanino:half-xmm-struct-abi

Conversation

@anthonycanino
Copy link
Contributor

Draft PR for in-progress work to accelerate System.Half with FP16 ISA.

Current work done:

  1. Add a TYP_HALF to the .NET runtime, which is treated like a TYP_SIMDXX, but with some notable differences. Namely, a TYP_HALF is passed around via the xmm registers, and while it will pass a varTypeIsStruct test, it must be treated as a primitive in other places.

  2. Accelerate System.Half operations with the TYP_HALF and some FP16 intrinsics. Not every possible function has been accelerated yet.

For discussion:

  1. I have currently worked around some checks to make TYP_HALF behave like a struct and a primitive. It's very ad-hoc at the moment.

  2. Much of the work to transform the named System.Half intrinsics into a sequence of intrinsic nodes is done in importcall.cpp and might want to be moved up into some of the gtNewSimdXX nodes.

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 18, 2025
@anthonycanino
Copy link
Contributor Author

@tannergooding @jakobbotsch please take a look when you get a chance.

@anthonycanino
Copy link
Contributor Author

@dotnet/intel @tannergooding may I get some high level feedback on the structure of the PR?

Comment on lines 4402 to 4407
// We ONLY want the valid double register in the RBM_DOUBLERET mask.
#ifdef TARGET_AMD64
useCandidates = (RBM_DOUBLERET & RBM_ALLDOUBLE).GetFloatRegSet();
#else
useCandidates = (RBM_DOUBLERET & RBM_ALLDOUBLE).GetFloatRegSet();
#endif // TARGET_AMD64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but these two paths are the same

Copilot AI review requested due to automatic review settings March 3, 2026 19:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated 2 comments.

Comment on lines 8 to 12
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using System.Runtime.Intrinsics; appears to be unused in this file (the [Intrinsic] attribute is defined under System.Runtime.CompilerServices). If unused-usings warnings are treated as errors for CoreLib, this will break the build. Remove the using or reference a type from that namespace if it’s required.

Copilot uses AI. Check for mistakes.
Comment on lines +4429 to +4440
case NI_System_Half_FusedMultiplyAdd:
{
#if defined(TARGET_XARCH)
if (compOpportunisticallyDependsOn(InstructionSet_AVX10v1))
{
// We are constructing a chain of intrinsics similar to:
// return FMA.MultiplyAddScalar(
// Vector128.CreateScalarUnsafe(x),
// Vector128.CreateScalarUnsafe(y),
// Vector128.CreateScalarUnsafe(z)
// ).ToScalar();

Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces new JIT named intrinsics for System.Half (including lowering to AVX10v1 scalar FP16 instructions) but there don’t appear to be any JIT/HardwareIntrinsics tests exercising Half + AVX10v1 codegen paths. Adding targeted tests under src/tests/JIT/HardwareIntrinsics/X86_Avx10v1 (or an equivalent location) would help catch regressions in recognition, codegen, and calling convention handling.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our intent is to have these tests as the APIs are directly implemented.

@anthonycanino
Copy link
Contributor Author

Our new repository configuration requires all comments to be resolved before a PR can merge. There are a lot of old unresolved comments on this PR, can you go through and resolve everything you think has been dealt with?

I've gone through and resolved most of the old comments. There are a few of yours with responses from me I think you should check before resolution.

Comment on lines +520 to +524
#if defined(TARGET_XARCH)
uint16_t ins[11]; // 11 * 2-bytes
#else
uint16_t ins[10]; // 10 * 2-bytes
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a question that you don't necessarily have to answer. Why make it 22 bytes on xarch and 20 on other arches, instead of 22 bytes on every arch for simplicity? Are there already 20-byte insns on i.e. arm32, riscv, loongarch, arm64? If we're defining this based on how big insns are on a given architecture we could potentially make ins[] much smaller and reduce memory usage and/or file size a bit.

But I don't think this needs to change, it just came to mind when reading the diff.

if (simdReturnType != call->TypeGet())
{
assert(varTypeIsSIMD(simdReturnType));
assert(varTypeIsAccelerated(simdReturnType));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name varTypeIsAccelerated feels ambiguous in a way that varTypeIsSIMD was not. "Accelerated" makes me think the target i'm jitting for has native instructions for it, while "SIMD" makes me think it's a vector type. I would expect "accelerated" to be false in some cases where "SIMD" is true if I am dealing with i.e. a Vector512 on an arch that only does 128, i.e. wasm. When I look at this i immediately wonder whether accelerated means "actually hardware accelerated" or if it just means "type that might be hardware accelerated" or "acceleratable"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use something like varTypeIsStructPrimitive, since that's really what this is.

That is, we have the regular built-in primitives and then the struct primitives that map to ABI concepts beyond the built-in ones.

Comment on lines +12494 to +12511
int Compiler::lookupHalfRoundingMode(NamedIntrinsic ni)
{
switch (ni)
{
case NI_System_Half_Round:
return 0; // Round to nearest
case NI_System_Half_Ceiling:
return 1; // Round towards +infinity
case NI_System_Half_Floor:
return 2; // Round towards -infinity
case NI_System_Half_Truncate:
return 3; // Round towards zero
default:
noway_assert(!"Should have one of the above Half intrinsics");
return -1;
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be xarch ifdef'd? Though I assume if all its usages are ifdef'd it'll get pruned by the linker anyway

@kg
Copy link
Member

kg commented Mar 3, 2026

Went over everything, deferring the greencheck to tanner. Thanks for your hard work :)

Copilot AI review requested due to automatic review settings March 4, 2026 18:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings March 4, 2026 20:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated 2 comments.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 5, 2026 15:23
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@tannergooding tannergooding changed the title [Draft] Accelerate Half with FP16 ISA Accelerate Half with FP16 ISA Mar 5, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 46 out of 47 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (6)

src/coreclr/jit/utils.cpp:1

  • The “jam” bit computation in shiftRightJam is incorrect due to operator precedence/parenthesization: it currently shifts by either 0 or 1, rather than testing whether any bits were shifted out. This will produce incorrect rounding for many values in convertDoubleToFloat16. Consider rewriting the expression to explicitly compute ((l << ((-dist) & 63)) != 0) (or equivalent) and OR in 1 when any discarded bits are non-zero.
    src/coreclr/jit/utils.cpp:1
  • HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITS are declared as uint64_t but are used as 16-bit Half bit patterns and returned from a float16_t (uint16_t) function. This implicit narrowing is easy to miss and makes the code harder to reason about (and may trigger warnings under stricter builds). Prefer declaring these constants as uint16_t (or float16_t) to match semantics and avoid silent truncation.
    src/coreclr/jit/utils.cpp:1
  • HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITS are declared as uint64_t but are used as 16-bit Half bit patterns and returned from a float16_t (uint16_t) function. This implicit narrowing is easy to miss and makes the code harder to reason about (and may trigger warnings under stricter builds). Prefer declaring these constants as uint16_t (or float16_t) to match semantics and avoid silent truncation.
    src/coreclr/jit/vartype.h:1
  • This branch is used for non-x86 targets (per the surrounding #if/#else and the comment stating “Other targets pass them as regular structs”), but it unconditionally treats TYP_HALF as using float argument registers. If TYP_HALF can ever appear on non-xarch targets (including during cross-target JIT builds or shared utilities), this risks ABI/calling convention mismatches. Consider guarding the TYP_HALF clause behind TARGET_XARCH (or otherwise ensuring TYP_HALF cannot reach this path).
    src/libraries/System.Private.CoreLib/src/System/Half.cs:1
  • Only Asin(Half) was changed to use AggressiveInlining, while nearby Half math wrappers are being marked with [Intrinsic] (or left unchanged). This inconsistency makes it unclear whether the intent is “intrinsic expansion” or “inline the wrapper”. Consider either removing this attribute (if incidental) or aligning it with the other accelerated Half APIs (e.g., mark as [Intrinsic] if it’s intended to be lowered).
    src/coreclr/vm/reflectioninvocation.cpp:1
  • Minor: there’s trailing whitespace on the #if line. Also, many other hunks use TARGET_XARCH for this combined case; using a consistent macro improves readability and reduces duplication.

Comment on lines +9982 to 9986
if (structSize == 2)
return true;

#ifdef FEATURE_SIMD
return (structSize >= getMinVectorByteLength()) && (structSize <= getMaxVectorByteLength());
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning true for all 2-byte structs makes many unrelated valuetypes (e.g., arbitrary 2-byte user structs) go down “accelerated type” code paths, frequently triggering additional EE queries (e.g., impNormStructType, VN field selector normalization). Since Half acceleration here is gated to xarch+AVX10v1 and intrinsic types, consider tightening this predicate (e.g., only treat size==2 as accelerated when building for xarch and when Half acceleration is enabled), or splitting into a separate Half-specific fast-path to avoid broad 2-byte struct churn.

Suggested change
if (structSize == 2)
return true;
#ifdef FEATURE_SIMD
return (structSize >= getMinVectorByteLength()) && (structSize <= getMaxVectorByteLength());
#if defined(TARGET_XARCH) && defined(FEATURE_SIMD)
// On xarch with AVX10v1 support, 2-byte structs may represent accelerated Half types.
// Restrict the 2-byte fast-path to this configuration to avoid sending arbitrary 2-byte
// user structs down accelerated-type code paths on other targets.
if ((structSize == 2) && compSupports(InstructionSet_AVX10v1))
{
return true;
}
return (structSize >= getMinVectorByteLength()) && (structSize <= getMaxVectorByteLength());
#elif defined(FEATURE_SIMD)
return (structSize >= getMinVectorByteLength()) && (structSize <= getMaxVectorByteLength());

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +11
#ifndef SIZE_UNKNOWN
#define SIZE_UNKNOWN UINT8_MAX
#endif
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining a global macro like SIZE_UNKNOWN in a widely included header increases the risk of collisions and makes it harder to track the canonical meaning/value across the JIT. If possible, prefer a scoped constant (e.g., static constexpr unsigned) in an appropriate shared header/namespace, or include the header that already owns this sentinel to avoid multiple “fallback” definitions.

Suggested change
#ifndef SIZE_UNKNOWN
#define SIZE_UNKNOWN UINT8_MAX
#endif
static constexpr unsigned SIZE_UNKNOWN = UINT8_MAX;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants