Commit Graph

890 Commits

Author SHA1 Message Date
James Almer
f9c3fbc00c Merge commit '3d69dd65c6771c28d3bf4e8e53a905aa8cd01fd9'
* commit '3d69dd65c6771c28d3bf4e8e53a905aa8cd01fd9':
  hevc: Add support for bitdepth 10 for IDCT DC

Merged-by: James Almer <jamrial@gmail.com>
2017-10-30 16:03:27 -03:00
James Almer
cc8c2d3609 Merge commit '358adef0305618219522858e471edf7e0cb4043e'
* commit '358adef0305618219522858e471edf7e0cb4043e':
  hevc: Add NEON IDCT DC functions for bitdepth 8

See 03cecf45c1

Merged-by: James Almer <jamrial@gmail.com>
2017-10-30 15:58:40 -03:00
James Almer
9840ca70e7 Merge commit '89d9869d2491d4209d707a8e7f29c58227ae5a4e'
* commit '89d9869d2491d4209d707a8e7f29c58227ae5a4e':
  hevc: Add NEON 16x16 IDCT

Merged-by: James Almer <jamrial@gmail.com>
2017-10-27 18:22:39 -03:00
James Almer
c0683dce89 Merge commit '0b9a237b2386ff84a6f99716bd58fa27a1b767e7'
* commit '0b9a237b2386ff84a6f99716bd58fa27a1b767e7':
  hevc: Add NEON 4x4 and 8x8 IDCT

[15:12:59] <@ubitux> hevc_idct_4x4_8_c: 389.1
[15:13:00] <@ubitux> hevc_idct_4x4_8_neon: 126.6
[15:13:02] <@ubitux> our ^
[15:13:06] <@ubitux> hevc_idct_4x4_8_c: 389.3
[15:13:08] <@ubitux> hevc_idct_4x4_8_neon: 107.8
[15:13:10] <@ubitux> hevc_idct_4x4_10_c: 418.6
[15:13:12] <@ubitux> hevc_idct_4x4_10_neon: 108.1
[15:13:14] <@ubitux> libav ^
[15:13:30] <@ubitux> so yeah, we can probably trash our versions here

Merged-by: James Almer <jamrial@gmail.com>
2017-10-24 19:10:22 -03:00
Muhammad Faiz
0780ad9c68 avcodec/rdft: remove sintable
It is redundant with costable. The first half of sintable is
identical with the second half of costable. The second half
of sintable is negative value of the first half of sintable.

The computation is changed to handle sign of sin values, in
C code and ARM assembly code.

Signed-off-by: Muhammad Faiz <mfcc64@gmail.com>
2017-07-11 13:22:02 +07:00
Clément Bœsch
b12a36170b lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis 2017-06-28 12:22:39 +02:00
Clément Bœsch
e4a27e2f2d lavc/arm: fix lack of precision in ff_ps_stereo_interpolate_neon
The code originally pre-multiply by 2 the steps, causing the running sum
of the h factors to drift away due to the lack of precision. It quickly
causes an inaccuracy > 0.01.

I tried diverse approaches such as multiply by 2.0 (instead of adding
the value itself) without success.

I'm unable to bench the impact of this change, feel free to compare.

This commit fixes the incoming aacpsdsp tests.

Following is an alternative simplified function (matching the incoming
AArch64 code) that may be used:

function ff_ps_stereo_interpolate_neon, export=1
        vld1.32         {q0}, [r2]
        vld1.32         {q1}, [r3]
        ldr             r12, [sp]
        vmov.f32        q8, q0
        vmov.f32        q9, q1
        vzip.32         q8, q0
        vzip.32         q9, q1
1:
        vld1.32         {d4}, [r0,:64]
        vld1.32         {d6}, [r1,:64]
        vadd.f32        q8, q8, q9
        vadd.f32        q0, q0, q1
        vmov.f32        d5, d4
        vmov.f32        d7, d6
        vmul.f32        q2, q2, q8
        vmla.f32        q2, q3, q0
        vst1.32         {d4}, [r0,:64]!
        vst1.32         {d5}, [r1,:64]!
        subs            r12, r12, #1
        bgt             1b
        bx              lr
endfunc
2017-06-28 11:59:34 +02:00
Alexandra Hájková
3d69dd65c6 hevc: Add support for bitdepth 10 for IDCT DC
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-04-25 22:48:45 +03:00
Seppo Tomperi
358adef030 hevc: Add NEON IDCT DC functions for bitdepth 8
Signed-off-by: Alexandra Hájková <alexandra@khirnov.net>
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-04-25 22:48:45 +03:00
Alexandra Hájková
89d9869d24 hevc: Add NEON 16x16 IDCT
The speedup vs C code is around 6-13x.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-04-12 22:40:54 +03:00
Ronald S. Bultje
40cbd686dc idct_arm: remove use of ff_put/add_pixels_clamped function pointer.
Instead, hardcode the use of the _arm implementation of add_pixels,
and use the C version for put_pixels (as no arm-optimized version
exists). Since there's separate implementations of idct{,_put,_add}
for neon, this has no practical impact on performance.
2017-04-06 10:03:27 -04:00
Ronald S. Bultje
0c46641784 vp9: split out generic decoding skeleton interface API from VP9 types.
This allows vp9dsp.h to only include the VP9 types header, and not the
decoder skeleton interface which is for hardware decoders (dxva2/vaapi).
2017-03-28 18:04:27 -04:00
Ronald S. Bultje
f8c019944d vp9: re-split the decoder/format/dsp interface header files.
The advantage here is that the internal software decoder interface is
not exposed to the DSP functions or the hardware accelerations.
2017-03-28 18:04:26 -04:00
Martin Storsjö
fbc6f190a6 arm: Always build the hevcdsp_init_arm.c file
The main hevcdsp.c file calls this init function if HAVE_ARM is set,
regardless of whether neon support is available or not.

This fixes builds where neon isn't supported by the build tools at all.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-28 11:36:01 +03:00
Alexandra Hájková
0b9a237b23 hevc: Add NEON 4x4 and 8x8 IDCT
Optimized by Martin Storsjö <martin@martin.st>.

The speedup vs C code is around 3.2-4.4x.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-27 22:56:23 +03:00
Clément Bœsch
1c9f4b5078 lavc/vp9: split into vp9{block,data,mvs}
This is following Libav layout to ease merges.
2017-03-27 21:38:21 +02:00
James Almer
9a0fbb9ca9 Merge commit '2caa93b813adc5dbb7771dfe615da826a2947d18'
* commit '2caa93b813adc5dbb7771dfe615da826a2947d18':
  mpegaudiodsp: Change type of array stride parameters to ptrdiff_t

Merged-by: James Almer <jamrial@gmail.com>
2017-03-21 16:04:22 -03:00
James Almer
a8474df944 Merge commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c'
* commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c':
  h264chroma: Change type of stride parameters to ptrdiff_t

Merged-by: James Almer <jamrial@gmail.com>
2017-03-21 15:20:45 -03:00
James Almer
5a49097b42 Merge commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428'
* commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428':
  idct: Change type of array stride parameters to ptrdiff_t

Merged-by: James Almer <jamrial@gmail.com>
2017-03-21 14:29:52 -03:00
Clément Bœsch
51b5672f49 Merge commit '92c5755a185086067fe49e7e64c23a8e7011be31'
* commit '92c5755a185086067fe49e7e64c23a8e7011be31':
  hpeldsp: arm: Update comments left behind in 25841dfe80

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-21 15:10:46 +01:00
Clément Bœsch
ad98af27f7 Merge commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0'
* commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0':
  lavc: add clobber tests for the new encoding/decoding API

The merge only re-order what we already have.

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-21 14:43:53 +01:00
Clément Bœsch
83cd80d10a Merge commit '12004a9a7f20e44f4da2ee6c372d5e1794c8d6c5'
* commit '12004a9a7f20e44f4da2ee6c372d5e1794c8d6c5':
  audiodsp/x86: yasmify vector_clipf_sse
  audiodsp: reorder arguments for vector_clipf

Merged the version from Libav after a discussion with James Almer on
IRC:

19:22 <ubitux> jamrial: opinion on 12004a9a7f20e44f4da2ee6c372d5e1794c8d6c5?
19:23 <ubitux> it was apparently yasmified differently
19:23 <ubitux> (it depends on the previous commit arg shuffle)
19:24 <ubitux> i don't see the magic movsxdifnidn in your port btw
19:24 <ubitux> it's a port from 1d36defe94
19:25 <jamrial> seems better thanks to said arg shuffle
19:25 <jamrial> the loop is the same, but init is simpler
19:25 <jamrial> probably worth merging
19:25 <ubitux> OK
19:25 <ubitux> thanks
19:26 <jamrial> curious they didn't make len ptrdiff_t after the previous bunch of commits, heh
19:26 <ubitux> yeah indeed

Both commits are merged at the same time to prevent a conflict with our
existing yasmified ff_vector_clipf_sse.

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-20 22:35:07 +01:00
Clément Bœsch
b78243c504 lavc/arm: fix indent in blockdsp_init_neon 2017-03-20 19:01:25 +01:00
Clément Bœsch
e07fa3008b Merge commit 'de452e503734ebb0fdbce86e9d16693b3530fad3'
* commit 'de452e503734ebb0fdbce86e9d16693b3530fad3':
  pixblockdsp: Change type of stride parameters to ptrdiff_t

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-20 15:58:32 +01:00
Martin Storsjö
eabc5abf94 arm: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible
This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 14516 bytes to 22484 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:                                 Cortex A7       A8       A9      A53
vp9_inv_dct_dct_16x16_sub1_add_10_neon:     454.0    270.7    418.5    295.4
vp9_inv_dct_dct_16x16_sub2_add_10_neon:    3840.2   3244.8   3700.1   2337.9
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4212.5   3575.4   3996.9   2571.6
vp9_inv_dct_dct_16x16_sub8_add_10_neon:    5174.4   4270.5   4615.5   3031.9
vp9_inv_dct_dct_16x16_sub12_add_10_neon:   5676.0   4908.5   5226.5   3491.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6403.9   5589.0   5839.8   3948.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1710.7    944.7   1582.1   1045.4
vp9_inv_dct_dct_32x32_sub2_add_10_neon:   21040.7  16706.1  18687.7  13193.1
vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22197.7  18282.7  19577.5  13918.6
vp9_inv_dct_dct_32x32_sub8_add_10_neon:   24511.5  20911.5  21472.5  15367.5
vp9_inv_dct_dct_32x32_sub12_add_10_neon:  26939.5  24264.3  23239.1  16830.3
vp9_inv_dct_dct_32x32_sub16_add_10_neon:  29419.5  26845.1  25020.6  18259.9
vp9_inv_dct_dct_32x32_sub20_add_10_neon:  31146.4  29633.5  26803.3  19721.7
vp9_inv_dct_dct_32x32_sub24_add_10_neon:  33376.3  32507.8  28642.4  21174.2
vp9_inv_dct_dct_32x32_sub28_add_10_neon:  35629.4  35439.6  30416.5  22625.7
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37269.9  37914.9  32271.9  24078.9

After:
vp9_inv_dct_dct_16x16_sub1_add_10_neon:     454.0    276.0    418.5    295.1
vp9_inv_dct_dct_16x16_sub2_add_10_neon:    2336.2   1886.0   2251.0   1458.6
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    2531.0   2054.7   2402.8   1591.1
vp9_inv_dct_dct_16x16_sub8_add_10_neon:    3848.6   3491.1   3845.7   2554.8
vp9_inv_dct_dct_16x16_sub12_add_10_neon:   5703.8   4831.6   5230.8   3493.4
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6399.5   5567.0   5832.4   3951.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1722.1    938.5   1577.3   1044.5
vp9_inv_dct_dct_32x32_sub2_add_10_neon:   15003.5  11576.8  13105.8   9602.2
vp9_inv_dct_dct_32x32_sub4_add_10_neon:   15768.5  12677.2  13726.0  10138.1
vp9_inv_dct_dct_32x32_sub8_add_10_neon:   17278.8  14825.4  14907.5  11185.7
vp9_inv_dct_dct_32x32_sub12_add_10_neon:  22335.7  21544.5  20379.5  15019.8
vp9_inv_dct_dct_32x32_sub16_add_10_neon:  24165.6  23881.7  21938.6  16308.2
vp9_inv_dct_dct_32x32_sub20_add_10_neon:  31082.2  30860.9  26835.3  19711.3
vp9_inv_dct_dct_32x32_sub24_add_10_neon:  33102.6  31922.8  28638.3  21161.0
vp9_inv_dct_dct_32x32_sub28_add_10_neon:  35104.9  34867.5  30411.7  22621.2
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37438.1  39103.4  32217.8  24067.6

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:33 +02:00
Martin Storsjö
0ea603203d arm: vp9itxfm16: Make the larger core transforms standalone functions
This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from
17500 to 14516 bytes.

This gives a small slowdown of a couple tens of cycles, up to around
150 cycles for the full case of the largest transform, but makes
it more feasible to add more optimized versions of these transforms.

Before:                                 Cortex A7       A8       A9      A53
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4237.4   3561.5   3971.8   2525.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6371.9   5452.0   5779.3   3910.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22068.8  17867.5  19555.2  13871.6
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37268.9  38684.2  32314.2  23969.0

After:
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4375.1   3571.9   4283.8   2567.2
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6415.6   5578.9   5844.6   3948.3
vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22653.7  18079.7  19603.7  13905.3
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37593.2  38862.2  32235.8  24070.9

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:19 +02:00
Martin Storsjö
32e273c111 arm: vp9itxfm16: Avoid reloading the idct32 coefficients
Keep the idct32 coefficients in narrow form in q6-q7, and idct16
coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering
q0-q3 in the pass1 function, and squeeze the idct16 coefficients
into q0-q1 in the pass2 function to avoid reloading them.

The idct16 coefficients are clobbered and reloaded within idct32_odd
though, since that turns out to be faster than narrowing them and
swapping them into q6-q7.

Before:                            Cortex       A7        A8        A9      A53
vp9_inv_dct_dct_32x32_sub4_add_10_neon:    22653.8   18268.4   19598.0  14079.0
vp9_inv_dct_dct_32x32_sub32_add_10_neon:   37699.0   38665.2   32542.3  24472.2
After:
vp9_inv_dct_dct_32x32_sub4_add_10_neon:    22270.8   18159.3   19531.0  13865.0
vp9_inv_dct_dct_32x32_sub32_add_10_neon:   37523.3   37731.6   32181.7  24071.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:57 +02:00
Martin Storsjö
c1619318e5 arm: vp9itxfm16: Fix vertical alignment
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:48 +02:00
Martin Storsjö
b46d37e93a arm: vp9itxfm16: Use the right lane size
This makes the code slightly clearer, but doesn't make any functional
difference.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:43 +02:00
Martin Storsjö
21c89f3a26 arm/aarch64: vp9: Fix vertical alignment
Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.

This is cherrypicked from libav commit
7995ebfad1.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:32 +02:00
Martin Storsjö
70317b25aa arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used
In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.

This is cherrypicked from libav commit
3a0d5e206d.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:28 +02:00
Martin Storsjö
b7a565fe71 arm: vp9itxfm: Template the quarter/half idct32 function
This reduces the number of lines and reduces the duplication.

Also simplify the eob check for the half case.

If we are in the half case, we know we at least will need to do the
first three slices, we only need to check eob for the fourth one,
so we can hardcode the value to check against instead of loading
from the min_eob array.

Since at most one slice can be skipped in the first pass, we can
unroll the loop for filling zeros completely, as it was done for
the quarter case before.

This allows skipping loading the min_eob pointer when using the
quarter/half cases.

This is cherrypicked from libav commit
98ee855ae0.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:22 +02:00
James Almer
6966a5e4d7 Merge commit '721d57e608dc4fd6c86f27c5ae76ef559d646220'
* commit '721d57e608dc4fd6c86f27c5ae76ef559d646220':
  vp56: Separate VP5 and VP6 dsp initialization

Merged-by: James Almer <jamrial@gmail.com>
2017-03-19 17:15:24 -03:00
James Almer
4e4dfcac58 Merge commit '802727b538b484e3f9d1345bfcc4ab24cfea8898'
* commit '802727b538b484e3f9d1345bfcc4ab24cfea8898':
  vp8: Update some assembly comments left unchanged in bd66f073fe

Merged-by: James Almer <jamrial@gmail.com>
2017-03-19 15:18:31 -03:00
James Almer
4004d33fcb Merge commit 'd9d26a3674f31f482f54e936fcb382160830877a'
* commit 'd9d26a3674f31f482f54e936fcb382160830877a':
  vp56: Change type of stride parameters to ptrdiff_t

Merged-by: James Almer <jamrial@gmail.com>
2017-03-19 14:54:25 -03:00
Clément Bœsch
6a42a54b9d Merge commit '6892df9294d93322d43255ada299507465bc93c8'
* commit '6892df9294d93322d43255ada299507465bc93c8':
  vp3: Change type of stride parameters to ptrdiff_t

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-19 18:41:26 +01:00
Clément Bœsch
a754fae4a7 Merge commit '014852e932dab6e9cf2a53e7a17ce8321f3e922c'
* commit '014852e932dab6e9cf2a53e7a17ce8321f3e922c':
  simple_idct: arm: Drop disabled code variant

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-19 16:12:07 +01:00
Martin Storsjö
7995ebfad1 arm/aarch64: vp9: Fix vertical alignment
Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-16 23:09:00 +02:00
Martin Storsjö
3a0d5e206d arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used
In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 22:07:30 +02:00
Martin Storsjö
98ee855ae0 arm: vp9itxfm: Template the quarter/half idct32 function
This reduces the number of lines and reduces the duplication.

Also simplify the eob check for the half case.

If we are in the half case, we know we at least will need to do the
first three slices, we only need to check eob for the fourth one,
so we can hardcode the value to check against instead of loading
from the min_eob array.

Since at most one slice can be skipped in the first pass, we can
unroll the loop for filling zeros completely, as it was done for
the quarter case before.

This allows skipping loading the min_eob pointer when using the
quarter/half cases.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 22:07:12 +02:00
Martin Storsjö
b2e20d8984 arm: vp9itxfm: Reorder iadst16 coeffs
This matches the order they are in the 16 bpp version.

There they are in this order, to make sure we access them in the
same order they are declared, easing loading only half of the
coefficients at a time.

This makes the 8 bpp version match the 16 bpp version better.

This is cherrypicked from libav commit
08074c092d.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:52 +02:00
Martin Storsjö
4f693b56bd arm: vp9itxfm: Reorder the idct coefficients for better pairing
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.

This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.

This is cherrypicked from libav commit
de06bdfe6c.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:51 +02:00
Martin Storsjö
600f4c9b03 arm: vp9itxfm: Avoid reloading the idct32 coefficients
The idct32x32 function actually pushed q4-q7 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.

Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
in the idct16 function), and the lanewise vmul needs a register in
the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
while doing idct16.

While keeping these coefficients in registers, we still can skip pushing
q7.

Before:                              Cortex A7       A8       A9      A53
vp9_inv_dct_dct_32x32_sub32_add_neon:  18553.8  17182.7  14303.3  12089.7
After:
vp9_inv_dct_dct_32x32_sub32_add_neon:  18470.3  16717.7  14173.6  11860.8

This is cherrypicked from libav commit
402546a172.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:51 +02:00
Martin Storsjö
a88db8b9a0 arm: vp9lpf: Implement the mix2_44 function with one single filter pass
For this case, with 8 inputs but only changing 4 of them, we can fit
all 16 input pixels into a q register, and still have enough temporary
registers for doing the loop filter.

The wd=8 filters would require too many temporary registers for
processing all 16 pixels at once though.

Before:                          Cortex A7      A8     A9     A53
vp9_loop_filter_mix2_v_44_16_neon:   289.7   256.2  237.5   181.2
After:
vp9_loop_filter_mix2_v_44_16_neon:   221.2   150.5  177.7   138.0

This is cherrypicked from libav commit
575e31e931.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:51 +02:00
Martin Storsjö
3fbbad2984 arm/aarch64: vp9lpf: Keep the comparison to E within 8 bit
The theoretical maximum value of E is 193, so we can just
saturate the addition to 255.

Before:                     Cortex A7      A8      A9     A53  A53/AArch64
vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0
After:
vp9_loop_filter_v_4_8_neon:     136.0   125.7   112.6    84.0         83.0
vp9_loop_filter_v_8_8_neon:     234.0   195.5   171.5   136.0        133.7
vp9_loop_filter_v_16_8_neon:    490.0   417.5   377.7   289.0        271.0
vp9_loop_filter_v_16_16_neon:   951.2   814.7   732.3   571.0        446.7

This is cherrypicked from libav commit
c582cb8537.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:50 +02:00
Martin Storsjö
83399cf569 arm: vp9lpf: Interleave the start of flat8in into the calculation above
This adds lots of extra .ifs, but speeds it up by a couple cycles,
by avoiding stalls.

This is cherrypicked from libav commit
e18c39005a.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:49 +02:00
Martin Storsjö
92ab8374b1 arm: vp9lpf: Use orrs instead of orr+cmp
This is cherrypicked from libav commit
435cd7bc99.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:49 +02:00
Martin Storsjö
f0ecbb13cf arm/aarch64: vp9lpf: Calculate !hev directly
Previously we first calculated hev, and then negated it.

Since we were able to schedule the negation in the middle
of another calculation, we don't see any gain in all cases.

Before:                     Cortex A7      A8      A9     A53  A53/AArch64
vp9_loop_filter_v_4_8_neon:     147.0   129.0   115.8    89.0         88.7
vp9_loop_filter_v_8_8_neon:     242.0   198.5   174.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    500.0   419.5   382.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   971.2   825.5   731.5   579.0        453.0
After:
vp9_loop_filter_v_4_8_neon:     143.0   127.7   114.8    88.0         87.7
vp9_loop_filter_v_8_8_neon:     241.0   197.2   173.7   140.0        136.7
vp9_loop_filter_v_16_8_neon:    497.0   419.5   379.7   293.0        275.7
vp9_loop_filter_v_16_16_neon:   965.2   818.7   731.4   579.0        452.0

This is cherrypicked from libav commit
e1f9de86f4.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:48 +02:00
Martin Storsjö
758302e4bc arm: vp9itxfm: Optimize 16x16 and 32x32 idct dc by unrolling
This work is sponsored by, and copyright, Google.

Before:                            Cortex A7      A8      A9     A53
vp9_inv_dct_dct_16x16_sub1_add_neon:   273.0   189.5   211.7   235.8
vp9_inv_dct_dct_32x32_sub1_add_neon:   752.0   459.2   862.2   553.9
After:
vp9_inv_dct_dct_16x16_sub1_add_neon:   226.5   145.0   225.1   171.8
vp9_inv_dct_dct_32x32_sub1_add_neon:   721.2   415.7   727.6   475.0

This is cherrypicked from libav commit
a76bf8cf12.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:48 +02:00
Martin Storsjö
bff0771590 arm: vp9mc: Calculate less unused data in the 4 pixel wide horizontal filter
Before:                    Cortex A7      A8     A9     A53
vp9_put_8tap_smooth_4h_neon:   378.1   273.2  340.7   229.5
After:
vp9_put_8tap_smooth_4h_neon:   352.1   222.2  290.5   229.5

This is cherrypicked from libav commit
fea92a4b57.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-11 13:14:47 +02:00