Conversation
|
@tannergooding @jakobbotsch please take a look when you get a chance. |
3b8abaa to
f633726
Compare
|
@dotnet/intel @tannergooding may I get some high level feedback on the structure of the PR? |
| // We ONLY want the valid double register in the RBM_DOUBLERET mask. | ||
| #ifdef TARGET_AMD64 | ||
| useCandidates = (RBM_DOUBLERET & RBM_ALLDOUBLE).GetFloatRegSet(); | ||
| #else | ||
| useCandidates = (RBM_DOUBLERET & RBM_ALLDOUBLE).GetFloatRegSet(); | ||
| #endif // TARGET_AMD64 |
There was a problem hiding this comment.
not related to this PR, but these two paths are the same
| using System.Numerics; | ||
| using System.Runtime.CompilerServices; | ||
| using System.Runtime.InteropServices; | ||
| using System.Runtime.Intrinsics; | ||
|
|
There was a problem hiding this comment.
using System.Runtime.Intrinsics; appears to be unused in this file (the [Intrinsic] attribute is defined under System.Runtime.CompilerServices). If unused-usings warnings are treated as errors for CoreLib, this will break the build. Remove the using or reference a type from that namespace if it’s required.
| case NI_System_Half_FusedMultiplyAdd: | ||
| { | ||
| #if defined(TARGET_XARCH) | ||
| if (compOpportunisticallyDependsOn(InstructionSet_AVX10v1)) | ||
| { | ||
| // We are constructing a chain of intrinsics similar to: | ||
| // return FMA.MultiplyAddScalar( | ||
| // Vector128.CreateScalarUnsafe(x), | ||
| // Vector128.CreateScalarUnsafe(y), | ||
| // Vector128.CreateScalarUnsafe(z) | ||
| // ).ToScalar(); | ||
|
|
There was a problem hiding this comment.
This PR introduces new JIT named intrinsics for System.Half (including lowering to AVX10v1 scalar FP16 instructions) but there don’t appear to be any JIT/HardwareIntrinsics tests exercising Half + AVX10v1 codegen paths. Adding targeted tests under src/tests/JIT/HardwareIntrinsics/X86_Avx10v1 (or an equivalent location) would help catch regressions in recognition, codegen, and calling convention handling.
There was a problem hiding this comment.
Our intent is to have these tests as the APIs are directly implemented.
b89ec75 to
0b01700
Compare
I've gone through and resolved most of the old comments. There are a few of yours with responses from me I think you should check before resolution. |
| #if defined(TARGET_XARCH) | ||
| uint16_t ins[11]; // 11 * 2-bytes | ||
| #else | ||
| uint16_t ins[10]; // 10 * 2-bytes | ||
| #endif |
There was a problem hiding this comment.
This is a question that you don't necessarily have to answer. Why make it 22 bytes on xarch and 20 on other arches, instead of 22 bytes on every arch for simplicity? Are there already 20-byte insns on i.e. arm32, riscv, loongarch, arm64? If we're defining this based on how big insns are on a given architecture we could potentially make ins[] much smaller and reduce memory usage and/or file size a bit.
But I don't think this needs to change, it just came to mind when reading the diff.
| if (simdReturnType != call->TypeGet()) | ||
| { | ||
| assert(varTypeIsSIMD(simdReturnType)); | ||
| assert(varTypeIsAccelerated(simdReturnType)); |
There was a problem hiding this comment.
The name varTypeIsAccelerated feels ambiguous in a way that varTypeIsSIMD was not. "Accelerated" makes me think the target i'm jitting for has native instructions for it, while "SIMD" makes me think it's a vector type. I would expect "accelerated" to be false in some cases where "SIMD" is true if I am dealing with i.e. a Vector512 on an arch that only does 128, i.e. wasm. When I look at this i immediately wonder whether accelerated means "actually hardware accelerated" or if it just means "type that might be hardware accelerated" or "acceleratable"
There was a problem hiding this comment.
We could use something like varTypeIsStructPrimitive, since that's really what this is.
That is, we have the regular built-in primitives and then the struct primitives that map to ABI concepts beyond the built-in ones.
| int Compiler::lookupHalfRoundingMode(NamedIntrinsic ni) | ||
| { | ||
| switch (ni) | ||
| { | ||
| case NI_System_Half_Round: | ||
| return 0; // Round to nearest | ||
| case NI_System_Half_Ceiling: | ||
| return 1; // Round towards +infinity | ||
| case NI_System_Half_Floor: | ||
| return 2; // Round towards -infinity | ||
| case NI_System_Half_Truncate: | ||
| return 3; // Round towards zero | ||
| default: | ||
| noway_assert(!"Should have one of the above Half intrinsics"); | ||
| return -1; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Should this be xarch ifdef'd? Though I assume if all its usages are ifdef'd it'll get pruned by the linker anyway
|
Went over everything, deferring the greencheck to tanner. Thanks for your hard work :) |
0b01700 to
aea6da2
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Half with FP16 ISAHalf with FP16 ISA
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 46 out of 47 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (6)
src/coreclr/jit/utils.cpp:1
- The “jam” bit computation in
shiftRightJamis incorrect due to operator precedence/parenthesization: it currently shifts by either 0 or 1, rather than testing whether any bits were shifted out. This will produce incorrect rounding for many values inconvertDoubleToFloat16. Consider rewriting the expression to explicitly compute((l << ((-dist) & 63)) != 0)(or equivalent) and OR in1when any discarded bits are non-zero.
src/coreclr/jit/utils.cpp:1 HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITSare declared asuint64_tbut are used as 16-bit Half bit patterns and returned from afloat16_t(uint16_t) function. This implicit narrowing is easy to miss and makes the code harder to reason about (and may trigger warnings under stricter builds). Prefer declaring these constants asuint16_t(orfloat16_t) to match semantics and avoid silent truncation.
src/coreclr/jit/utils.cpp:1HALF_POSITIVE_INFINITY_BITS/HALF_NEGATIVE_INFINITY_BITSare declared asuint64_tbut are used as 16-bit Half bit patterns and returned from afloat16_t(uint16_t) function. This implicit narrowing is easy to miss and makes the code harder to reason about (and may trigger warnings under stricter builds). Prefer declaring these constants asuint16_t(orfloat16_t) to match semantics and avoid silent truncation.
src/coreclr/jit/vartype.h:1- This branch is used for non-x86 targets (per the surrounding
#if/#elseand the comment stating “Other targets pass them as regular structs”), but it unconditionally treatsTYP_HALFas using float argument registers. IfTYP_HALFcan ever appear on non-xarch targets (including during cross-target JIT builds or shared utilities), this risks ABI/calling convention mismatches. Consider guarding theTYP_HALFclause behindTARGET_XARCH(or otherwise ensuringTYP_HALFcannot reach this path).
src/libraries/System.Private.CoreLib/src/System/Half.cs:1 - Only
Asin(Half)was changed to useAggressiveInlining, while nearby Half math wrappers are being marked with[Intrinsic](or left unchanged). This inconsistency makes it unclear whether the intent is “intrinsic expansion” or “inline the wrapper”. Consider either removing this attribute (if incidental) or aligning it with the other accelerated Half APIs (e.g., mark as[Intrinsic]if it’s intended to be lowered).
src/coreclr/vm/reflectioninvocation.cpp:1 - Minor: there’s trailing whitespace on the
#ifline. Also, many other hunks useTARGET_XARCHfor this combined case; using a consistent macro improves readability and reduces duplication.
| if (structSize == 2) | ||
| return true; | ||
|
|
||
| #ifdef FEATURE_SIMD | ||
| return (structSize >= getMinVectorByteLength()) && (structSize <= getMaxVectorByteLength()); |
There was a problem hiding this comment.
Returning true for all 2-byte structs makes many unrelated valuetypes (e.g., arbitrary 2-byte user structs) go down “accelerated type” code paths, frequently triggering additional EE queries (e.g., impNormStructType, VN field selector normalization). Since Half acceleration here is gated to xarch+AVX10v1 and intrinsic types, consider tightening this predicate (e.g., only treat size==2 as accelerated when building for xarch and when Half acceleration is enabled), or splitting into a separate Half-specific fast-path to avoid broad 2-byte struct churn.
| if (structSize == 2) | |
| return true; | |
| #ifdef FEATURE_SIMD | |
| return (structSize >= getMinVectorByteLength()) && (structSize <= getMaxVectorByteLength()); | |
| #if defined(TARGET_XARCH) && defined(FEATURE_SIMD) | |
| // On xarch with AVX10v1 support, 2-byte structs may represent accelerated Half types. | |
| // Restrict the 2-byte fast-path to this configuration to avoid sending arbitrary 2-byte | |
| // user structs down accelerated-type code paths on other targets. | |
| if ((structSize == 2) && compSupports(InstructionSet_AVX10v1)) | |
| { | |
| return true; | |
| } | |
| return (structSize >= getMinVectorByteLength()) && (structSize <= getMaxVectorByteLength()); | |
| #elif defined(FEATURE_SIMD) | |
| return (structSize >= getMinVectorByteLength()) && (structSize <= getMaxVectorByteLength()); |
| #ifndef SIZE_UNKNOWN | ||
| #define SIZE_UNKNOWN UINT8_MAX | ||
| #endif |
There was a problem hiding this comment.
Defining a global macro like SIZE_UNKNOWN in a widely included header increases the risk of collisions and makes it harder to track the canonical meaning/value across the JIT. If possible, prefer a scoped constant (e.g., static constexpr unsigned) in an appropriate shared header/namespace, or include the header that already owns this sentinel to avoid multiple “fallback” definitions.
| #ifndef SIZE_UNKNOWN | |
| #define SIZE_UNKNOWN UINT8_MAX | |
| #endif | |
| static constexpr unsigned SIZE_UNKNOWN = UINT8_MAX; |
Draft PR for in-progress work to accelerate
System.Halfwith FP16 ISA.Current work done:
Add a
TYP_HALFto the .NET runtime, which is treated like aTYP_SIMDXX, but with some notable differences. Namely, aTYP_HALFis passed around via the xmm registers, and while it will pass avarTypeIsStructtest, it must be treated as a primitive in other places.Accelerate
System.Halfoperations with theTYP_HALFand some FP16 intrinsics. Not every possible function has been accelerated yet.For discussion:
I have currently worked around some checks to make
TYP_HALFbehave like a struct and a primitive. It's very ad-hoc at the moment.Much of the work to transform the named
System.Halfintrinsics into a sequence of intrinsic nodes is done inimportcall.cppand might want to be moved up into some of thegtNewSimdXXnodes.