Commit Graph

9883 Commits

Author SHA1 Message Date
Ryan Houdek
b3a7a973a1
AVX128: Extends 32-bit indexes path for 128-bit operations
The codepath from #3826 was only targeting 256-bit sized operations.
This missed the vpgatherdq/vgatherdpd 128-bit operations. By extending
the codepath to understand 128-bit operations, we now hit these
instruction variants.

With this PR, we now have SVE128 codepaths that handle ALL variants of
x86 gather instructions! There are zero ASIMD fallbacks used in this
case!

Of course depending on the instruction, the performance still leaves a
lot to be desired, and there is no way to emulate x86 TSO behaviour
without an ASIMD fallback, which we will likely need to add as a
fallback at some point.

Based on #3836 until that is merged.
2024-07-08 18:44:07 -07:00
Mai
22b26696ba
Merge pull request #3836 from Sonicadvance1/optimize_sve_vpgatherdd
AVX128: Optimize the vpgatherdd/vgatherdps cases that would fall back to ASIMD
2024-07-08 21:43:36 -04:00
Ryan Houdek
495241f8ca
InstcountCI: Update for wide gather vpgatherdd SVE usage 2024-07-08 18:12:28 -07:00
Ryan Houdek
4afbfcae17
AVX128: Optimize the vpgatherdd/vgatherdps cases that would fall back to ASIMD
With the introduction of the wide gathers in #3828 this has opened new
avenues for optimizing these cases that would typically fall back to
ASIMD. In the cases that 32-bit SVE scaling doesn't fit, we can instead
sign extend the elements in to double-width address registers.

This then feeds naturally in to the SVE path even though we end up
needing to allocate 512-bits worth of address registers. This ends up
being significantly better than the ASIMD path still.

Relies on #3828 to be merged first
Fixes #3829
2024-07-08 18:12:28 -07:00
Mai
3627de4cbc
Merge pull request #3828 from Sonicadvance1/optimize_wide_gathers
AVX128: Optimize QPS/QD variant of gather loads!
2024-07-08 21:11:36 -04:00
Ryan Houdek
007c07e612
InstcountCI: Update for wide gathers 2024-07-08 17:19:18 -07:00
Ryan Houdek
ec7c8fd922
AVX128: Optimize QPS/QD variant of gather loads!
SVE has a special version of their gather instruction that gets similar
behaviour to x86's VGATHERQPS/VPGATHERQD instructions.

The quirk of these instructions that the previous SVE implementation
didn't handle and required ASIMD fallback, was that most gather
instructions require the data element size and address element size to
match. This x86 instruction uses a 64-bit address size while loading 32-bit
elements. This matches this specific variant of the SVE instruction, but
the data is zero-extended once loaded, requiring us to shuffle the data
after it is loaded.

This isn't the worst but the implementation is different enough that
stuffing it in to the other gather load will cause headaches.

Basically gets 32 instruction variants to use the SVE version!

Fixes #3827
2024-07-08 17:19:18 -07:00
Ryan Houdek
c5a0ae7b34
IR: Adds new QPS gather load variant! 2024-07-08 17:19:18 -07:00
Ryan Houdek
4bd207ebf3
Arm64: Moves 128Bit gather ASIMD emulation to its own helper
It is going to get reused.
2024-07-08 17:19:18 -07:00
Tony Wasserka
45011234d9
Merge pull request #3845 from pmatos/TESTJOBCOUNTFix
Use nproc only if TEST_JOB_COUNT not specified
2024-07-08 22:31:23 +02:00
Paulo Matos
24017f379e Use nproc only if TEST_JOB_COUNT not specified 2024-07-08 21:38:56 +02:00
Mai
aad7656b38
Merge pull request #3826 from Sonicadvance1/scale_32bit_gather
AVX128: Extend 32-bit address indices when possible
2024-07-08 15:29:44 -04:00
Mai
95a9f32bf0
Merge pull request #3840 from Sonicadvance1/extend_vinsert128_tests
unittests: Extends vinsert{i,f}128 tests for garbage data
2024-07-07 13:39:20 -04:00
Mai
c4ae761a0e
Merge pull request #3841 from Sonicadvance1/add_missing_cpu_names
CPUID: Adds a few missing CPU names for new CPU cores
2024-07-07 13:38:27 -04:00
Ryan Houdek
0653b346e0
CPUID: Adds a few missing CPU names for new CPU cores
These should be making their way to the market sooner rather than later
so make sure we have the descriptor text for them.
2024-07-07 02:40:19 -07:00
Ryan Houdek
fa587398bd
unittests: Extends vinsert{i,f}128 tests for garbage data
Just to ensure we don't hit an issue with masking the immediate bits.

Fixes #3753
2024-07-07 02:16:21 -07:00
Ryan Houdek
6b67857151
InstcountCI: Adds a missing gather instruction invariant
Oops, must have accidentally deleted this while copying things around.
2024-07-06 18:32:36 -07:00
Ryan Houdek
81165f0c40
InstcountCI: Update for 32-bit gather sign extend optimization 2024-07-06 18:32:35 -07:00
Ryan Houdek
df40515087
AVX128: Extend 32-bit address indices when possible
When loading 256-bits of data with only 128-bits of address indices, we
can sign extend the source indices to be 64-bit. Thus falling down the
ideal path for SVE where each 128-bit lane is loading the data to
addresses in a 1:1 element ratio.

This means we use the SVE path more often because of this.

Based on top of #3825 because the prescaling behaviour was introduced
there. This implements its own prescaling when the sign extension occurs
because ARM's SSHLL{,2} instruction gives us that for free.

This additionally fixes a bug where we were accidentally loading the top
128-bit half of the addresses for gathers when it was unnecessary, and
on the AVX256 side it was duplicating and doing some additional work
when it shouldn't have.

It'll be good to walk the commits when looking at this one, as there are
a couple of incremental changes that are easier to follow that way.

Fixes #3806
2024-07-06 18:32:35 -07:00
Ryan Houdek
c77922e3e5
InstcountCI: Update for previous fix 2024-07-06 18:32:35 -07:00
Ryan Houdek
0f9abe68b9
AVX128: Fixes accidentally loading high addr register when unnnecessary
Was missing a clamp on the high half when encounting a 128-bit gather
instruction. Was causing us to unconditionally load the top half when it
was unncessary.
2024-07-06 18:32:35 -07:00
Ryan Houdek
c168ee6940
Arm64: Implements VSSHLL{,2} IR ops 2024-07-06 18:32:35 -07:00
Ryan Houdek
0d4414fdd0
AVX128: Removes templated AddrElementSize and add as argument
NFC
2024-07-06 18:32:35 -07:00
Ryan Houdek
968d5e0d8f
Merge pull request #3774 from bylaws/win-ci
FEXCore ARM64EC CI support
2024-07-06 18:22:57 -07:00
Ryan Houdek
635182b57c
Merge pull request #3832 from bylaws/wow64-wine
WOW64: Mark the FEX dll as a wine builtin
2024-07-06 17:58:00 -07:00
Ryan Houdek
9d0b6ce75e
Merge pull request #3835 from bylaws/ec-topdown
AllocatorHooks: Allocate from the top down on windows
2024-07-06 17:40:36 -07:00
Ryan Houdek
2fdd80fe3a
Merge pull request #3833 from bylaws/common-tso
Windows: Commonise TSOHandlerConfig
2024-07-06 17:38:45 -07:00
Ryan Houdek
dbac23b749
Merge pull request #3834 from bylaws/ec-amd64
Windows: Report as an AMD64 processor when targeting ARM64EC
2024-07-06 17:38:13 -07:00
Billy Laws
7fa7061aa5 Windows: Report as an AMD64 processor when targeting ARM64EC 2024-07-06 20:37:15 +00:00
Billy Laws
e45e631199 AllocatorHooks: Allocate from the top down on windows
FEX allocations can get in the way of allocations that are 4gb-limited
even in 65-bit mode (i.e. those from LuaJIT), so allocate starting from
the top of the AS to prevent conflicts.
2024-07-06 20:35:38 +00:00
Billy Laws
b21e77c1e0 Windows: Commonise TSOHandlerConfig 2024-07-06 19:20:49 +00:00
Billy Laws
ba33294225 WOW64: Mark the FEX dll as a wine builtin
Allows it to be automatically picked up by wine during prefix setup,
without a manual dll override.

Thanks to AndreRH for pointing me to this.
2024-07-06 19:19:36 +00:00
Billy Laws
97c21cc3a7 CI: Add ARM64EC build CI 2024-07-06 17:27:41 +01:00
Billy Laws
7d7e6f5326 CMake: Disable WOW64 module for ARM64EC 2024-07-06 17:27:41 +01:00
Billy Laws
5e15bd935e CMake: Disable glibc jemalloc for MinGW builds 2024-07-06 17:27:41 +01:00
Ryan Houdek
9bad09c45f
Merge pull request #3823 from alyssarosenzweig/bug/shl-var-small
Fix CF with small shifts
2024-07-06 01:33:57 -07:00
Ryan Houdek
47d077ff22
Merge pull request #3825 from Sonicadvance1/scale_64bit_gather
AVX128: Prescale addresses in gathers if possible
2024-07-05 19:10:43 -07:00
Ryan Houdek
bbf8dde3ca
Merge pull request #3824 from alyssarosenzweig/bug/rc2
OpcodeDispatcher: Fix 8/16-bit rcr masking
2024-07-05 17:01:16 -07:00
Ryan Houdek
6e8ca3bc6c
InstcountCI: Update for gather prescaling 2024-07-05 16:47:11 -07:00
Ryan Houdek
11a494d7b3
AVX128: Prescale addresses in gathers if possible
If the host supports SVE128, if the address element size and data size is 64-bit, and the scale is not one of the two that is supported by SVE; Then prescale the addresses.
64-bit address overflow masks the top bits so is well defined that we
can scale the vector elements and still execute the SVE code path in
that case. Removing the ASIMD code paths from a lot of gathers.

Fixes #3805
2024-07-05 16:47:11 -07:00
Alyssa Rosenzweig
9b570de33f InstCountCI: Update
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-07-05 18:44:21 -04:00
Ryan Houdek
b67343fc5a unittests: Adds a test for small shift flags calculation
Currently we calculate CF incorrectly in the case of small shifts with
large offsets.
2024-07-05 18:38:12 -04:00
Alyssa Rosenzweig
5a3c0eb83c OpcodeDispatcher: fix shl with 8/16-bit variable
the special case here lines up with the special case of using a larger shift for
a smaller result, so we can just grab CF from the larger result.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-07-05 18:38:12 -04:00
Alyssa Rosenzweig
10391608a0 InstCountCI: Update
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-07-05 18:34:18 -04:00
Ryan Houdek
51c57cc5ae unittests: More rotate with carry unit tests
Looks like we missed some edge cases with small carry rotate. Adds even
more unit tests.
2024-07-05 18:34:18 -04:00
Alyssa Rosenzweig
05e4678e65 OpcodeDispatcher: fix missing masking on smaller RCR
I probably broke this when working on eliminating crossblock liveness.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-07-05 18:34:18 -04:00
Alyssa Rosenzweig
0f0e402db4 OpcodeDispatcher: fix CF with 8/16-bit immediate
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-07-05 18:24:34 -04:00
Ryan Houdek
653bf04db0
Merge pull request #3819 from alyssarosenzweig/bug/rcr-smol
Fix 8/16-bit RCR
2024-07-05 12:49:23 -07:00
Ryan Houdek
b77a25b21a
Merge pull request #3818 from alyssarosenzweig/jit/shiftbymaskstozero
JIT: fix ShiftFlags masking
2024-07-05 12:49:16 -07:00
Alyssa Rosenzweig
9db6931cea InstCountCI: Update
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-07-05 10:49:12 -04:00