Also fixes a bug where it was failing to zero the upper bits of the
destination register in the AVX128 implementation. Which the updated
unit tests now check against.
Fixes a minor precision issue that was reported in #2995. We still don't
return correct values for overflow. x86 always returns maximum negative
int32_t on overflow, ARM will return maximum negative or positive
depending on sign of the double.
The codepath from #3826 was only targeting 256-bit sized operations.
This missed the vpgatherdq/vgatherdpd 128-bit operations. By extending
the codepath to understand 128-bit operations, we now hit these
instruction variants.
With this PR, we now have SVE128 codepaths that handle ALL variants of
x86 gather instructions! There are zero ASIMD fallbacks used in this
case!
Of course depending on the instruction, the performance still leaves a
lot to be desired, and there is no way to emulate x86 TSO behaviour
without an ASIMD fallback, which we will likely need to add as a
fallback at some point.
Based on #3836 until that is merged.
With the introduction of the wide gathers in #3828 this has opened new
avenues for optimizing these cases that would typically fall back to
ASIMD. In the cases that 32-bit SVE scaling doesn't fit, we can instead
sign extend the elements in to double-width address registers.
This then feeds naturally in to the SVE path even though we end up
needing to allocate 512-bits worth of address registers. This ends up
being significantly better than the ASIMD path still.
Relies on #3828 to be merged first
Fixes#3829
SVE has a special version of their gather instruction that gets similar
behaviour to x86's VGATHERQPS/VPGATHERQD instructions.
The quirk of these instructions that the previous SVE implementation
didn't handle and required ASIMD fallback, was that most gather
instructions require the data element size and address element size to
match. This x86 instruction uses a 64-bit address size while loading 32-bit
elements. This matches this specific variant of the SVE instruction, but
the data is zero-extended once loaded, requiring us to shuffle the data
after it is loaded.
This isn't the worst but the implementation is different enough that
stuffing it in to the other gather load will cause headaches.
Basically gets 32 instruction variants to use the SVE version!
Fixes#3827
If the destination isn't any of the incoming sources then we can avoid
one of the moves at the end. This half works around the problem proposed
in #3794, but doesn't solve the entire problem.
To solve the other half of the moving problem means we need to solve the
SRA allocation problem for this temporary register with addsub/subadd, so it gets allocated
for both the FMA operation and the XOR operation.
When loading 256-bits of data with only 128-bits of address indices, we
can sign extend the source indices to be 64-bit. Thus falling down the
ideal path for SVE where each 128-bit lane is loading the data to
addresses in a 1:1 element ratio.
This means we use the SVE path more often because of this.
Based on top of #3825 because the prescaling behaviour was introduced
there. This implements its own prescaling when the sign extension occurs
because ARM's SSHLL{,2} instruction gives us that for free.
This additionally fixes a bug where we were accidentally loading the top
128-bit half of the addresses for gathers when it was unnecessary, and
on the AVX256 side it was duplicating and doing some additional work
when it shouldn't have.
It'll be good to walk the commits when looking at this one, as there are
a couple of incremental changes that are easier to follow that way.
Fixes#3806
Was missing a clamp on the high half when encounting a 128-bit gather
instruction. Was causing us to unconditionally load the top half when it
was unncessary.
FEX allocations can get in the way of allocations that are 4gb-limited
even in 65-bit mode (i.e. those from LuaJIT), so allocate starting from
the top of the AS to prevent conflicts.
If the host supports SVE128, if the address element size and data size is 64-bit, and the scale is not one of the two that is supported by SVE; Then prescale the addresses.
64-bit address overflow masks the top bits so is well defined that we
can scale the vector elements and still execute the SVE code path in
that case. Removing the ASIMD code paths from a lot of gathers.
Fixes#3805
the special case here lines up with the special case of using a larger shift for
a smaller result, so we can just grab CF from the larger result.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>