The two halves are provided as two uint64_t values that shouldn't be
sign extended between them. Treat them as uint64_t until combined in to
a single int128_t. Fixes long signed divide.
this fails on current main with blocksize=500 due to mentioned RA bug. passes
with blocksize=1.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
At the moment we always run ctest with max number of cpus. If
undefined, it will keep current behaviour, otherwise it will
honour TEST_JOB_COUNT.
Therefore to run ctest one test at a time, use
`cmake ... -DTEST_JOB_COUNT=1`
We had failed to enable these implementations for the
`ExtendedMemOperand` helpers. We had already implemented the non-helper
forms, which are already tested in CI. These helpers just weren't
updated?
Noticed this when running libaom's SSE4.1 tests, where it managed to
execute a pmovzxbq instruction with reg+reg memory source and was
breaking the test results.
There are /very/ few vector register operations that access only 8-bit
or 16-bit in vectors so this flew under the radar for quite a while.
Fixes their unit tests.
Also adds a unittest using sse4.1 pmovzxbq to ensure we support the
reg+reg case, and also a few other instructions to test 8-bit and 16-bit
vector loads and stores.
We didn't have a unit test for this and we weren't implementing it at
all.
We treated it as vmovhps/vmovhpd accidentally. Once again caught by the
libaom Intrinsics unit tests.
This is a very minor performance change. On Cortex CPUs that support
SVE, they do movprfx+<instruction> fusion to remove two cycles and a
dependency from the backend.
This is a minor win to convert from ASIMD mov+bsl to SVE movprfx+bsl
because of this, saving two cycles and a dependency on Cortex A710 and
A715. This is slightly less of a win on Cortex-A720/A725 because it supports
zero-cycle vector register renames, but it is still a win on Cortex-X925
because that is an older core design that doesn't support zero-cycle
vector register renames.
Very silly little thing.
The canonical way to generate a zero register vector in x86 is to xor
itself. Capture this can convert it to canonical zero register instead.
Can get zero-cycle renamed on latest CPUs.
x86 doesn't have a lot of non-temporal vector stores but we do have a
few of them.
- MMX: MOVNTQ
- SSE2: MOVNTDQ, MOVNTPS, MOVNTPD
- AVX: VMOVNTDQ (128-bit & 256-bit), VMOVNTPD
Additionally SSE4a adds 32-bit and 64-bit scalar vector non-temporal
stores, which we keep as regular stores. Since ARM doesn't have matching
semantics for those.
Additionally SSE4.1 adds non-temporal vector LOADS which this doesn't
touch.
- SSE4.1: MOVNTDQA
- AVX: VMOVNTDQA (128-bit)
- AVX2: VMOVNTDQA (256-bit)
Fixes#3364