Epic: ResizeProcessor performance improvements (Memory & CPU)

I'm opening this issue to present and track my plan regarding the improvements we should implement in order to have a highly optimized `ResizeProcessor` for 1.0.

### Goals
If we implement all the tasks, I expect that:
- The memory usage of `ResizeProcessor` will drop dramatically (Reduced by a factor of `20x - 50x` for typical images.)
- The single-threaded execution time of (non-companding) resize operations will be about 70% of System.Drawing's resize time. (While our quality is better.) ([Current status](https://gist.github.com/antonfirsov/ba8af1589bcbd2489c336c063498b1ea#file-after-md): 120-130%)

### Tasks
- [ ] (1) Implement new data types for [SOA](https://en.wikipedia.org/wiki/AOS_and_SOA) representation of `Vector4` (or other 4 channel) buffers that could be used as SOA counterparts of AOS types `IMemoryOwner<Vector4>`, `Buffer2D<Vector4>`, and `Span<Vector4>`. Something like `class BufferOf4Channels<T>`, `class BufferOf4Channels2D<T>` and `ref struct BufferSegmentOf4Channels<T>`.

- [ ] (2) Implement bulk packing methods in `PixelOperations<TPixel>` for the `BufferSegmentOf4Channels<float>`. `Rgba32` packing/unpacking should be optimized the same way it's done for `Span<Vector4>` packing. 

- [ ] (3) Integrate the `SRgb` companding (Compress/Expand) operations into the API-s defined in (2)

- [ ] (4) *Optional CPU optimization.* Optimize the implementation of (3) for `Rgba32`, using lookup tables.

- [ ] (5) Replace all `Vector4` AOS buffers with SOA counterparts in `ResizeProcessor`

- [ ] (6) *Memory optimization.* Implement #642, preferably when (5) is fully implemented. Parallelization should be dropped.

- [ ] (7) *CPU optimization.* Implement vectorized convolution in `ResizeKernel`, using `Vector4` by default, and `Vector<float>` if AVX2 is detected (`Vector<float>.Count == 8`)

- [ ] (8) *Optional CPU optimization.* Vectorized `Premultiply` and `UnPremultiply` (both `Vector4` and AVX2 variants)

### Outlook
~If we manage to implement these points, the bottleneck would be the `Rgba32` <-> `4 x float` unpacking/repacking operation. If we can optimize it, we can reach even more superior performance.~ **Update:** Done in  #742.

Alternatively, we can try implementing fixed-point math using `Vector<ushort>`.

*As always, community help is welcome!*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Epic: ResizeProcessor performance improvements (Memory & CPU) #733

Goals

Tasks

Outlook

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Epic: ResizeProcessor performance improvements (Memory & CPU) #733

Description

Goals

Tasks

Outlook

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions