Skip to content

Epic: ResizeProcessor performance improvements (Memory & CPU) #733

@antonfirsov

Description

@antonfirsov

I'm opening this issue to present and track my plan regarding the improvements we should implement in order to have a highly optimized ResizeProcessor for 1.0.

Goals

If we implement all the tasks, I expect that:

  • The memory usage of ResizeProcessor will drop dramatically (Reduced by a factor of 20x - 50x for typical images.)
  • The single-threaded execution time of (non-companding) resize operations will be about 70% of System.Drawing's resize time. (While our quality is better.) (Current status: 120-130%)

Tasks

  • (1) Implement new data types for SOA representation of Vector4 (or other 4 channel) buffers that could be used as SOA counterparts of AOS types IMemoryOwner<Vector4>, Buffer2D<Vector4>, and Span<Vector4>. Something like class BufferOf4Channels<T>, class BufferOf4Channels2D<T> and ref struct BufferSegmentOf4Channels<T>.

  • (2) Implement bulk packing methods in PixelOperations<TPixel> for the BufferSegmentOf4Channels<float>. Rgba32 packing/unpacking should be optimized the same way it's done for Span<Vector4> packing.

  • (3) Integrate the SRgb companding (Compress/Expand) operations into the API-s defined in (2)

  • (4) Optional CPU optimization. Optimize the implementation of (3) for Rgba32, using lookup tables.

  • (5) Replace all Vector4 AOS buffers with SOA counterparts in ResizeProcessor

  • (6) Memory optimization. Implement Optimize memory consumption of ResizeProcessor #642, preferably when (5) is fully implemented. Parallelization should be dropped.

  • (7) CPU optimization. Implement vectorized convolution in ResizeKernel, using Vector4 by default, and Vector<float> if AVX2 is detected (Vector<float>.Count == 8)

  • (8) Optional CPU optimization. Vectorized Premultiply and UnPremultiply (both Vector4 and AVX2 variants)

Outlook

If we manage to implement these points, the bottleneck would be the Rgba32 <-> 4 x float unpacking/repacking operation. If we can optimize it, we can reach even more superior performance. Update: Done in #742.

Alternatively, we can try implementing fixed-point math using Vector<ushort>.

As always, community help is welcome!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions