Load with extension operation

There is no efficient way to represent loading of narrow-type vector with extension to wide-type vector, e.g. **Load 4 uint16_t values and extend to 4 x uint32_t vector**. To simulate such operation with the current API, we'd need to load values as a 64-bit scalar (potentially spilling to two registers on 32-bit architectures), transfer to SIMD register (expensive!), and then use shuffles to get it into proper places. With the native SIMD ISA, it can be implemented more efficiently:

- `PMOVZXWD xmm, [mem]` on x86 with SSE4.1
- `MOVQ xmm, [mem] + PXOR xmm0, xmm0 + PUNPCKLWD xmm, xmm0` on SSE2
- `VLD1.16 {dX}, [rAddr] + VMOVL.U16 qX, dX` on ARMv7+NEON
- `LD1 {Vx.4H}, xAddr + UXTL Vx.4S, Vx.4H` on ARM64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load with extension operation #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Load with extension operation #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions