At the moment we always run ctest with max number of cpus. If
undefined, it will keep current behaviour, otherwise it will
honour TEST_JOB_COUNT.
Therefore to run ctest one test at a time, use
`cmake ... -DTEST_JOB_COUNT=1`
We had failed to enable these implementations for the
`ExtendedMemOperand` helpers. We had already implemented the non-helper
forms, which are already tested in CI. These helpers just weren't
updated?
Noticed this when running libaom's SSE4.1 tests, where it managed to
execute a pmovzxbq instruction with reg+reg memory source and was
breaking the test results.
There are /very/ few vector register operations that access only 8-bit
or 16-bit in vectors so this flew under the radar for quite a while.
Fixes their unit tests.
Also adds a unittest using sse4.1 pmovzxbq to ensure we support the
reg+reg case, and also a few other instructions to test 8-bit and 16-bit
vector loads and stores.
We didn't have a unit test for this and we weren't implementing it at
all.
We treated it as vmovhps/vmovhpd accidentally. Once again caught by the
libaom Intrinsics unit tests.
This is a very minor performance change. On Cortex CPUs that support
SVE, they do movprfx+<instruction> fusion to remove two cycles and a
dependency from the backend.
This is a minor win to convert from ASIMD mov+bsl to SVE movprfx+bsl
because of this, saving two cycles and a dependency on Cortex A710 and
A715. This is slightly less of a win on Cortex-A720/A725 because it supports
zero-cycle vector register renames, but it is still a win on Cortex-X925
because that is an older core design that doesn't support zero-cycle
vector register renames.
Very silly little thing.
The canonical way to generate a zero register vector in x86 is to xor
itself. Capture this can convert it to canonical zero register instead.
Can get zero-cycle renamed on latest CPUs.
x86 doesn't have a lot of non-temporal vector stores but we do have a
few of them.
- MMX: MOVNTQ
- SSE2: MOVNTDQ, MOVNTPS, MOVNTPD
- AVX: VMOVNTDQ (128-bit & 256-bit), VMOVNTPD
Additionally SSE4a adds 32-bit and 64-bit scalar vector non-temporal
stores, which we keep as regular stores. Since ARM doesn't have matching
semantics for those.
Additionally SSE4.1 adds non-temporal vector LOADS which this doesn't
touch.
- SSE4.1: MOVNTDQA
- AVX: VMOVNTDQA (128-bit)
- AVX2: VMOVNTDQA (256-bit)
Fixes#3364
Arm64's SVE load instruction can be minorly optimized in the case that a
base GPR register isn't provided, as it has a version of the instruction
that doesn't require one.
The limitation of this instruction is that it doesn't support scaling at
all so it only works if the offset scale is 1.
When FEX hits the optimal case that the destination isn't one of the
incoming sources (other than the incomingDest source) then we can
optimize out two moves per 128-bit lane.
Cuts 256-bit non-SVE gather loads from 50 instructions down to 46.
Some applications don't measure rdtsc correctly and instead use cpuinfo
to get the CPU core's base clock speed. Which for most x86 CPUs is the
base clock speed which also matches their cycle counter speed.
Did this as a quick test to see if this would help `Unbound: Worlds
Apart` stuttering while BinaryNinja was disassembling the binary.
Turns out the game doesn't use cpuinfo for its cycle counter speed
determination, but it is good to implement this regardless.
We can support a few combinations of guest and host vector sizes
Host: 128-bit or 256-bit
Guest: 128-bit or 256-bit
The typical case is Host = 128-bit and Guest = 256-bit now that AVX is
implemented.
On 32-bit this changes to Host=128-bit and Guest=128-bit because we
disable AVX.
In the vixl simulator 32-bit turns in to Host=256-bit and Guest=128-bit.
And then in the vixl sim 64-bit turns in to Host=256-bit and
Guest=256-bit.
We cover all four combinations of guest and host vector register sizes!
Fixes a few assumptions that SVE256 = AVX256 basically.