The two halves are provided as two uint64_t values that shouldn't be
sign extended between them. Treat them as uint64_t until combined in to
a single int128_t. Fixes long signed divide.
At the moment we always run ctest with max number of cpus. If
undefined, it will keep current behaviour, otherwise it will
honour TEST_JOB_COUNT.
Therefore to run ctest one test at a time, use
`cmake ... -DTEST_JOB_COUNT=1`
We didn't have a unit test for this and we weren't implementing it at
all.
We treated it as vmovhps/vmovhpd accidentally. Once again caught by the
libaom Intrinsics unit tests.
This is a very minor performance change. On Cortex CPUs that support
SVE, they do movprfx+<instruction> fusion to remove two cycles and a
dependency from the backend.
This is a minor win to convert from ASIMD mov+bsl to SVE movprfx+bsl
because of this, saving two cycles and a dependency on Cortex A710 and
A715. This is slightly less of a win on Cortex-A720/A725 because it supports
zero-cycle vector register renames, but it is still a win on Cortex-X925
because that is an older core design that doesn't support zero-cycle
vector register renames.
Very silly little thing.
The canonical way to generate a zero register vector in x86 is to xor
itself. Capture this can convert it to canonical zero register instead.
Can get zero-cycle renamed on latest CPUs.
x86 doesn't have a lot of non-temporal vector stores but we do have a
few of them.
- MMX: MOVNTQ
- SSE2: MOVNTDQ, MOVNTPS, MOVNTPD
- AVX: VMOVNTDQ (128-bit & 256-bit), VMOVNTPD
Additionally SSE4a adds 32-bit and 64-bit scalar vector non-temporal
stores, which we keep as regular stores. Since ARM doesn't have matching
semantics for those.
Additionally SSE4.1 adds non-temporal vector LOADS which this doesn't
touch.
- SSE4.1: MOVNTDQA
- AVX: VMOVNTDQA (128-bit)
- AVX2: VMOVNTDQA (256-bit)
Fixes#3364
Arm64's SVE load instruction can be minorly optimized in the case that a
base GPR register isn't provided, as it has a version of the instruction
that doesn't require one.
The limitation of this instruction is that it doesn't support scaling at
all so it only works if the offset scale is 1.
When FEX hits the optimal case that the destination isn't one of the
incoming sources (other than the incomingDest source) then we can
optimize out two moves per 128-bit lane.
Cuts 256-bit non-SVE gather loads from 50 instructions down to 46.
We can support a few combinations of guest and host vector sizes
Host: 128-bit or 256-bit
Guest: 128-bit or 256-bit
The typical case is Host = 128-bit and Guest = 256-bit now that AVX is
implemented.
On 32-bit this changes to Host=128-bit and Guest=128-bit because we
disable AVX.
In the vixl simulator 32-bit turns in to Host=256-bit and Guest=128-bit.
And then in the vixl sim 64-bit turns in to Host=256-bit and
Guest=256-bit.
We cover all four combinations of guest and host vector register sizes!
Fixes a few assumptions that SVE256 = AVX256 basically.
In regular SIB land the index register encoding of 0b100 encodes to "no
register", this feature lets you get SIB encodings without an index
register for flexibility.
In VSIB encoding this isn't expected behaviour and instead there are no
encodings where an index register is missing. Allowing you to encode all
sixteen registers as an index register.
This was causing an abort in `AVX128_LoadVSIB` because the index turned
in to an invalid register.
Working instruction:
`vgatherdps ymm2, dword [eax+ymm5*4], ymm7`
Broken instruction:
`vgatherdps ymm0, dword [eax+ymm4*4], ymm7`
This fixes a crash in libfmod where it is using gathers in the wild.
Fixing a crash in Ender Lilies.
When the source or destination is a register, the address size override
doesn't apply. We were accidentally applying it on all sources
regardless of type which was causing us to zero extend on operations
that aren't affected by address size override.
This fixes the OpenSSL cert error in every application, but most
importantly Steam.