JIT: ARM64 SVE format encodings, SVE_IF_4A to SVE_JK_4B#97739
JIT: ARM64 SVE format encodings, SVE_IF_4A to SVE_JK_4B#97739TIHan merged 19 commits intodotnet:mainfrom
SVE_IF_4A to SVE_JK_4B#97739Conversation
…er SVE format group.
…w for encoding elem size.
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsContributes to #94549 Adds 35 formats. This is a large one, but it is better to do these all at once. Progress:
|
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.04%)
MinOpts (+0.05% to +0.07%)
FullOpts (+0.02% to +0.03%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.05%)
MinOpts (+0.06% to +0.10%)
FullOpts (+0.02% to +0.03%)
Details here |
|
@dotnet/jit-contrib @dotnet/arm64-contrib @kunalspathak @a74nh this is ready. |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to -0.00%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.06%)
MinOpts (+0.07% to +0.12%)
FullOpts (+0.02% to +0.03%)
Details here |
|
I can look. My first guess is the additional cases in emitIns_R_R_R_R. |
a74nh
left a comment
There was a problem hiding this comment.
I need another pass through (there's a lot of code!), but so far this is looking good.
| INS_OPTS_SCALABLE_S); // LDNT1SB {<Zt>.S }, <Pg>/Z, [<Zn>.S{, <Xm>}] | ||
| theEmitter->emitIns_R_R_R_R(INS_sve_ldnt1sh, EA_SCALABLE, REG_V3, REG_P4, REG_V1, REG_R2, | ||
| INS_OPTS_SCALABLE_S); // LDNT1SH {<Zt>.S }, <Pg>/Z, [<Zn>.S{, <Xm>}] | ||
| // REG_ZR can be used due to the optional {, <Xm>} of the format. |
There was a problem hiding this comment.
Alternatively, we could use emitIns_R_R_R() for these, but I think your way is better
There was a problem hiding this comment.
I did think about it. We still have to pass and encode REG_ZR regardless so it just made sense to keep it as emitIns_R_R_R_R.
| */ | ||
|
|
||
| void emitter::emitIns_BARR(instruction ins, insBarrier barrier) |
There was a problem hiding this comment.
Am I right in thinking you've not touched emitIns_BARR() or emitIns_R_R_R_COND() and this is just the diff getting confused?
Which emitIns_ functions have you changed?
There was a problem hiding this comment.
The diff is confused... never touched those. I only touched emitIns_R_R_R_R.
There was a problem hiding this comment.
I'm probably going to create a new emitIns called emitInsSve_R_R_R_R and put all the handling of the sve instructions in that and have emitIns_R_R_R_R call into it as a fallback. I'm hopeful that would limit the TP regressions and have it be more organized.
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.06%)
MinOpts (+0.07% to +0.12%)
FullOpts (+0.02% to +0.03%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.06%)
MinOpts (+0.07% to +0.12%)
FullOpts (+0.02% to +0.03%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (-0.66% to -0.26%)
MinOpts (-1.28% to -0.76%)
FullOpts (-0.43% to -0.23%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (-0.59% to -0.27%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.43% to -0.23%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (-0.57% to -0.26%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.43% to -0.24%)
Details here |
|
@kunalspathak looks like I solved the TP regression :) |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (-0.01% to -0.00%)
MinOpts (-0.04% to -0.01%)
FullOpts (-0.01% to -0.00%)
Details here |
a74nh
left a comment
There was a problem hiding this comment.
Other than the one thing below, LGTM.
Disclaimer: I've not checked checked every single pattern by hand as there are a lot.
| INS_OPTS_SCALABLE_D); // LDFF1W {<Zt>.D }, <Pg>/Z, [<Xn|SP>, <Zm>.D] | ||
|
|
||
| // IF_SVE_IF_4A | ||
| theEmitter->emitIns_R_R_R_R(INS_sve_ldnt1b, EA_SCALABLE, REG_V3, REG_P2, REG_V1, REG_R0, |
There was a problem hiding this comment.
Should probably call emitInsSve_R_R_R_R() directly here and remove the changes from emitIns_R_R_R_R() ?
@kunalspathak : not sure what your future plans for splitting things up was?
There was a problem hiding this comment.
I could go either way. The benefit of calling emitInsSve is intent, and we could remove the scalable options from the emitIns.
kunalspathak
left a comment
There was a problem hiding this comment.
Thanks for covering many formats. Added some suggestions.
| INS_SCALABLE_OPTS_LSL_N); // LD3Q {<Zt1>.Q, <Zt2>.Q, <Zt3>.Q }, <Pg>/Z, [<Xn|SP>, | ||
| // <Xm>, | ||
| // LSL #4] | ||
| theEmitter->emitIns_R_R_R_R(INS_sve_ld4q, EA_SCALABLE, REG_V5, REG_P1, REG_R4, REG_R3, INS_OPTS_SCALABLE_Q, |
There was a problem hiding this comment.
hhm, lot of instructions that needs consecutive registers.
src/coreclr/jit/emitarm64.cpp
Outdated
| * for the 'dtype' field. | ||
| */ | ||
|
|
||
| /*static*/ emitter::code_t emitter::insEncodeSveElemsize_dtype_ld1w(instruction ins, insFormat fmt, emitAttr size, code_t code) |
There was a problem hiding this comment.
is there a reason why this can't be part of insEncodeSveElemsize_dtype()? For each size, you can have a case INS_sve_ld1w and then a switch/case for the formats?
There was a problem hiding this comment.
Yea, ld1w has to be handled differently for two formats while the other cases in insEncodeSveElemsize_dtype don't need to be.
There was a problem hiding this comment.
sure, so can you send the fmt to insEncodeSveElemsize_dtype and handle it accordingly instead of creating a new method?
Edit: possibly, we might need fmt to handle future instructions as well, so should be ok to send it in.
There was a problem hiding this comment.
fmt isn't needed for the other instructions, it's only needed for ld1w. Passing in fmt for insEncodeSveElemsize_dtype would imply that it will check the format for the other instructions which I didn't want.
There was a problem hiding this comment.
What makes ld1w special is that we have to set specific bits that are actually unrelated to dtype, but related to elemsize.
There was a problem hiding this comment.
Passing in fmt for insEncodeSveElemsize_dtype would imply that it will check the format for the other instructions which I didn't want
You can just a case ld1w and do whatever you are doing in insEncodeSveElemsize_dtype_ld1w?
There was a problem hiding this comment.
discussed this offline - we decided to keep insEncodeSveElemsize_dtype_ld1w but if more instructions need similar special handling, then we will merge it in insEncodeSveElemsize_dtype .
| break; | ||
|
|
||
| default: | ||
| assert(!"Invalid instruction"); |
There was a problem hiding this comment.
this is little odd that we have 4 cases that can land us inside this switch/case and here too we need to add default and assert.
There was a problem hiding this comment.
The default case should never be hit here. The assert is more for sanity.
There was a problem hiding this comment.
I know, I am just pointing that we are writing odd looking pattern and testing C++ compiler here :) But honestly, not sure what's the best alternative here.
switch (ins)
{
case insA:
case insB:
case insC:
case insD:
...
...
switch (ins)
{
case insA:
case insB:
case insC:
case insD:
...
default:
assert("c++ compiler messed up :)");
}
}| fmt = IF_SVE_IK_4A_I; | ||
| break; | ||
|
|
||
| default: |
There was a problem hiding this comment.
likewise here and down below.
There was a problem hiding this comment.
Same with this, default case should never be hit.
| * Returns true if the SVE instruction has a LSL addr. | ||
| * This is for formats that have [<Xn|SP>, <Xm>, LSL #N], [<Xn|SP>{, <Xm>, LSL #N}] | ||
| */ | ||
| /*static*/ bool emitter::insSveIsLslN(instruction ins, insFormat fmt) |
There was a problem hiding this comment.
this kind of methods should really be converted to a table driven lookup.
There was a problem hiding this comment.
Makes sense to me, though we shouldn't do it in this PR. A follow-up will be better.
src/coreclr/jit/emit.h
Outdated
| void emitAdvanceInstrDesc(instrDesc** id, size_t idSize) const; | ||
| size_t emitIssue1Instr(insGroup* ig, instrDesc* id, BYTE** dp); | ||
| size_t emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp); | ||
| BYTE* emitInstrSve(instrDesc* id, BYTE* dst); |
There was a problem hiding this comment.
rename it to emitOutputInstrSve() . Also should be guarded by #ifdef TARGET_ARM64.
There was a problem hiding this comment.
I originally wanted to name it that, but emitOutputInstrSve can't be called independently of emitOutputInstr, and the signature is different, such as emitOutputInstr returns a size_t where as the emitOutputInstrSve would return a BYTE*.
Maybe I'm being too picky on the name. I noticed we also have a emitOutput_Instr which has a similar signature to emitInstrSve. Maybe I can just call it emitOutput_InstrSve ... just putting an underscore in there.
There was a problem hiding this comment.
Yea I wasn't thinking, this should have just been defined in emitarm64.h rather than in emit.h.
| if (isVectorRegister(reg1)) | ||
| { | ||
| // If the overall instruction is working on 128-bit | ||
| // registers, the size of this register for |
There was a problem hiding this comment.
given the capstone matches with JITDisasm, I just skimmed through the display code.
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (-0.62% to -0.26%)
MinOpts (-1.27% to -0.76%)
FullOpts (-0.44% to -0.25%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (-0.71% to -0.27%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.43% to -0.24%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (-0.58% to -0.26%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.44% to -0.24%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (-0.02% to -0.00%)
MinOpts (-0.08% to -0.01%)
FullOpts (-0.01% to -0.00%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (-0.62% to -0.26%)
MinOpts (-1.28% to -0.76%)
FullOpts (-0.44% to -0.25%)
Throughput diffs for osx/arm64 ran on linux/x64Overall (-0.71% to -0.27%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.44% to -0.24%)
Throughput diffs for windows/arm64 ran on linux/x64Overall (-0.58% to -0.26%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.44% to -0.24%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (-0.62% to -0.26%)
MinOpts (-1.28% to -0.76%)
FullOpts (-0.44% to -0.25%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (-0.71% to -0.27%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.44% to -0.24%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (-0.58% to -0.26%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.44% to -0.24%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (-0.01% to -0.00%)
MinOpts (-0.06% to -0.01%)
FullOpts (-0.01% to -0.00%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (-0.01% to -0.00%)
MinOpts (-0.06% to -0.01%)
FullOpts (-0.01% to -0.00%)
Details here |

Contributes to #94549
Adds 35 formats. This is a large one, but it is better to do these all at once.
Progress:
Left: Capstone,

Right: Jit