-
-
Notifications
You must be signed in to change notification settings - Fork 891
Description
I'm opening this issue to present and track my plan regarding the improvements we should implement in order to have a highly optimized ResizeProcessor for 1.0.
Goals
If we implement all the tasks, I expect that:
- The memory usage of
ResizeProcessorwill drop dramatically (Reduced by a factor of20x - 50xfor typical images.) - The single-threaded execution time of (non-companding) resize operations will be about 70% of System.Drawing's resize time. (While our quality is better.) (Current status: 120-130%)
Tasks
-
(1) Implement new data types for SOA representation of
Vector4(or other 4 channel) buffers that could be used as SOA counterparts of AOS typesIMemoryOwner<Vector4>,Buffer2D<Vector4>, andSpan<Vector4>. Something likeclass BufferOf4Channels<T>,class BufferOf4Channels2D<T>andref struct BufferSegmentOf4Channels<T>. -
(2) Implement bulk packing methods in
PixelOperations<TPixel>for theBufferSegmentOf4Channels<float>.Rgba32packing/unpacking should be optimized the same way it's done forSpan<Vector4>packing. -
(3) Integrate the
SRgbcompanding (Compress/Expand) operations into the API-s defined in (2) -
(4) Optional CPU optimization. Optimize the implementation of (3) for
Rgba32, using lookup tables. -
(5) Replace all
Vector4AOS buffers with SOA counterparts inResizeProcessor -
(6) Memory optimization. Implement Optimize memory consumption of ResizeProcessor #642, preferably when (5) is fully implemented. Parallelization should be dropped.
-
(7) CPU optimization. Implement vectorized convolution in
ResizeKernel, usingVector4by default, andVector<float>if AVX2 is detected (Vector<float>.Count == 8) -
(8) Optional CPU optimization. Vectorized
PremultiplyandUnPremultiply(bothVector4and AVX2 variants)
Outlook
If we manage to implement these points, the bottleneck would be the Update: Done in #742.Rgba32 <-> 4 x float unpacking/repacking operation. If we can optimize it, we can reach even more superior performance.
Alternatively, we can try implementing fixed-point math using Vector<ushort>.
As always, community help is welcome!