third_party_ffmpeg/libavcodec/aarch64
Martin Storsjö 9f10cff610 aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter
This work is sponsored by, and copyright, Google.

This is similar to the arm version, but due to the larger registers
on aarch64, we can do 8 pixels at a time for all filter sizes.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                             ARM AArch64
vp9_loop_filter_h_4_8_10bpp_neon:          213.2   172.6
vp9_loop_filter_h_8_8_10bpp_neon:          281.2   244.2
vp9_loop_filter_h_16_8_10bpp_neon:         657.0   444.5
vp9_loop_filter_h_16_16_10bpp_neon:       1280.4   877.7
vp9_loop_filter_mix2_h_44_16_10bpp_neon:   397.7   358.0
vp9_loop_filter_mix2_h_48_16_10bpp_neon:   465.7   429.0
vp9_loop_filter_mix2_h_84_16_10bpp_neon:   465.7   428.0
vp9_loop_filter_mix2_h_88_16_10bpp_neon:   533.7   499.0
vp9_loop_filter_mix2_v_44_16_10bpp_neon:   271.5   244.0
vp9_loop_filter_mix2_v_48_16_10bpp_neon:   330.0   305.0
vp9_loop_filter_mix2_v_84_16_10bpp_neon:   329.0   306.0
vp9_loop_filter_mix2_v_88_16_10bpp_neon:   386.0   365.0
vp9_loop_filter_v_4_8_10bpp_neon:          150.0   115.2
vp9_loop_filter_v_8_8_10bpp_neon:          209.0   175.5
vp9_loop_filter_v_16_8_10bpp_neon:         492.7   345.2
vp9_loop_filter_v_16_16_10bpp_neon:        951.0   682.7

This is significantly faster than the ARM version in almost
all cases except for the mix2 functions.

Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 2-3x.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-01-24 22:36:11 +02:00
..
asm-offsets.h Merge commit '705f5e5e155f6f280a360af220fc5b30cfcee702' 2016-01-02 11:14:28 +01:00
cabac.h Merge commit 'dfe224f377be3e45758c69d881ca7874b82d647a' 2014-03-09 13:27:04 +01:00
fft_init_aarch64.c Merge commit '97aec6e75ef36ed0402653519daa8e1fc8ddb555' 2016-04-12 15:43:09 +01:00
fft_neon.S Merge commit '780cd20b00a69e26bbfffbb8eec16fbe999ea793' 2014-12-09 12:08:29 +01:00
fmtconvert_init.c Merge commit 'a0fc780a2093784e8664f88205ee1b215e109cee' 2016-01-02 11:21:16 +01:00
fmtconvert_neon.S Merge commit 'a0fc780a2093784e8664f88205ee1b215e109cee' 2016-01-02 11:21:16 +01:00
h264chroma_init_aarch64.c
h264cmc_neon.S avcodec: fix vc1dsp dependencies 2016-09-25 13:11:45 +02:00
h264dsp_init_aarch64.c lavc/aarch64: Do not use the neon horizontal chroma loop filter for H.264 4:2:2. 2015-01-31 10:05:10 +01:00
h264dsp_neon.S
h264idct_neon.S aarch64: h264idct: Use the offset parameter to movrel 2016-12-08 18:11:07 +01:00
h264pred_init.c Merge commit 'f56d8d8dd72b1ab52aa814c5a0fccabf8040ef68' 2015-07-21 01:39:30 +02:00
h264pred_neon.S Merge commit 'f56d8d8dd72b1ab52aa814c5a0fccabf8040ef68' 2015-07-21 01:39:30 +02:00
h264qpel_init_aarch64.c arm64: constify src in h264qpel dsp function definitions 2015-06-24 08:41:32 +02:00
h264qpel_neon.S
hpeldsp_init_aarch64.c
hpeldsp_neon.S
Makefile aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter 2017-01-24 22:36:11 +02:00
mdct_neon.S Merge commit 'ee2bc5974fe64fd214f52574400ae01c85f4b855' 2014-04-22 23:27:02 +02:00
mpegaudiodsp_init.c Merge commit '8f9fe6ae3461ce270bce6b7083fda5ec314cdad4' 2014-04-22 23:45:50 +02:00
mpegaudiodsp_neon.S Merge commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb' 2016-06-21 21:55:34 +02:00
neon.S Merge commit 'cdb1665f70def544ddab3e3ed3763ef99c8b3873' 2016-04-24 12:51:42 +01:00
neontest.c avcodec: fix arguments on xmm/neon clobber test wrappers 2016-10-02 02:15:47 -03:00
rv40dsp_init_aarch64.c
synth_filter_init.c avcodec/synth_filter: split off remaining code from dcadec files 2016-01-25 14:57:38 -03:00
synth_filter_neon.S Merge commit '705f5e5e155f6f280a360af220fc5b30cfcee702' 2016-01-02 11:14:28 +01:00
vc1dsp_init_aarch64.c
videodsp_init.c Merge commit 'd3789eeeed3423bd1ca9dc40030a2f7a21ea5332' 2014-04-07 02:51:05 +02:00
videodsp.S Merge commit 'd3789eeeed3423bd1ca9dc40030a2f7a21ea5332' 2014-04-07 02:51:05 +02:00
vorbisdsp_init.c Merge commit '3956a5e0ea46ed7e27ca888fe11c47986ad99261' 2014-04-22 23:51:19 +02:00
vorbisdsp_neon.S Merge commit '3956a5e0ea46ed7e27ca888fe11c47986ad99261' 2014-04-22 23:51:19 +02:00
vp9dsp_init_10bpp_aarch64.c aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC 2017-01-24 22:36:05 +02:00
vp9dsp_init_12bpp_aarch64.c aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC 2017-01-24 22:36:05 +02:00
vp9dsp_init_16bpp_aarch64_template.c aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter 2017-01-24 22:36:11 +02:00
vp9dsp_init_aarch64.c aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC 2017-01-24 22:36:05 +02:00
vp9dsp_init.h aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC 2017-01-24 22:36:05 +02:00
vp9itxfm_16bpp_neon.S aarch64: Add NEON optimizations for 10 and 12 bit vp9 itxfm 2017-01-24 22:36:08 +02:00
vp9itxfm_neon.S aarch64: vp9itxfm: Skip empty slices in the first pass of idct_idct 16x16 and 32x32 2017-01-14 21:13:32 +01:00
vp9lpf_16bpp_neon.S aarch64: Add NEON optimizations for 10 and 12 bit vp9 loop filter 2017-01-24 22:36:11 +02:00
vp9lpf_neon.S aarch64: vp9: loop filter: replace 'orr; cbn?z' with 'adds; b.{eq,ne}; 2017-01-14 21:13:10 +01:00
vp9mc_16bpp_neon.S aarch64: Add NEON optimizations for 10 and 12 bit vp9 MC 2017-01-24 22:36:05 +02:00
vp9mc_neon.S aarch64: vp9mc: Fix a comment to refer to a register with the right name 2017-01-14 21:13:43 +01:00