avfilter/tonemapx: add simd optimized tonemapx

This includes NEON for ARMv8, SSE for x86-64-v2 and AVX+FMA for x86-64-v3

Test result with 4K HEVC 10bit HLG input, encoding with libx264 veryfast using bt2390:

Intel Core i9-12900:

tonemapx.c: 57fps
tonemapx.sse: 74fps
tonemapx.avx: 77fps

Apple M1 Max:

tonemapx.c:43fps
tonemapx.neon: 57fps

For comparison, original zscale+tonemap simd results:

Intel Core i9-12900:

tonemap.avx: 40fps
tonemap.sse: 40fps
tonemap.c: 32fps

Apple M1 Max:

tonemap.neon: 44fps
tonemap.c: 35fps

The original implementation is too memory heavy that dual-channel
desktop CPUs are easily memory bounded due to the intermediate
RGBF32 framebuffer sharing with zscale. Tonemapx lowered the the
bandwidth requirement which brings significant performance gain
to bandwidth limited platforms. Even for bandwidth-rich M1 Max
it still provides significant performance boost due to better cache
hitrate.
This commit is contained in:
gnattu 2024-06-25 20:40:06 +08:00
parent 443d842d1e
commit d37d7386e6

File diff suppressed because it is too large Load Diff