Commit Graph

1656 Commits

Author SHA1 Message Date
Ryan Houdek
5ef0db994d
VDSO: Stop using a vector for a static
This causes a global initializer that registers an atexit handler.

Be smarter, use an std::array and pass its data around using a span
instead.

Removes the global initializer and removes the atexit installation
2024-07-11 23:53:57 -07:00
Ryan Houdek
fc0b233046
Merge pull request #3859 from neobrain/refactor_opdispatch_templates
OpcodeDispatcher: Replace hand-written wrapper templates with a generic utility
2024-07-11 18:18:23 -07:00
Mai
e25918d846
Merge pull request #3858 from Sonicadvance1/implement_nt_load
Implement support for SSE4.1/AVX NT loads
2024-07-11 14:22:41 -04:00
Tony Wasserka
b9829ed316 OpcodeDispatcher: Replace even more hand-written wrapper templates 2024-07-11 16:19:15 +02:00
Tony Wasserka
4ccec17676 OpcodeDispatcher: Replace more hand-written wrapper templates 2024-07-11 16:19:15 +02:00
Tony Wasserka
f45082043b OpcodeDispatcher: Replace hand-written wrapper templates with a generic utility 2024-07-11 16:19:14 +02:00
Tony Wasserka
3222f13dde Fix comment formatting 2024-07-11 16:19:14 +02:00
Mai
b282620a48
Merge pull request #3857 from Sonicadvance1/sve_bitperm
Arm64: Implement support for SVE bitperm
2024-07-11 05:05:41 -04:00
Ryan Houdek
e24b01b6cb
Arm64: Implement support for SVE bitperm 2024-07-11 01:46:35 -07:00
Tony Wasserka
9a8694c2f3
Merge pull request #3853 from neobrain/refactor_warn_fixes
Fix all the warnings
2024-07-11 10:12:41 +02:00
Tony Wasserka
070a9148aa
Merge pull request #3852 from neobrain/refactor_opdispatch_codesize
OpcodeDispatcher: Avoid template monomorphization to reduce FEXLoader binary size
2024-07-11 09:58:49 +02:00
Tony Wasserka
f19fe3b6f3 Fix warning about an expression with side effects being passed to __builtin_assume
LOGMAN_THROW_AA_FMT has no benefit over LOGMAN_THROW_A_FMT here, so just use
the latter.
2024-07-11 09:54:31 +02:00
Tony Wasserka
8d2b15665d Fix unused-variable warnings 2024-07-11 09:54:30 +02:00
Tony Wasserka
5dc4ab062d Fix invalid-offsetof warnings due to InternalThreadState not being standard layout
See https://github.com/llvm/llvm-project/issues/53021 for more information
about unique_ptr turning non-standard-layout.
2024-07-11 09:54:30 +02:00
Ryan Houdek
548fd9daf8
OpcodeDispatcher: Implement support for SSE4.1 NT load 2024-07-10 23:07:37 -07:00
Ryan Houdek
f831f5a0e1
AVX128: Implement support for NT Load 2024-07-10 23:07:14 -07:00
Ryan Houdek
4c21aa2604
Arm64: Implement support for NT Loads with ASIMD fallback 2024-07-10 23:06:46 -07:00
Ryan Houdek
c9efb75714
CodeEmitter: Implement support for SVE NT loads 2024-07-10 23:06:19 -07:00
Ryan Houdek
3554d5c2f7
HostFeatures: Check for SVE bit permute extension 2024-07-10 21:45:07 -07:00
Mai
5fe405e1fb
Merge pull request #3855 from neobrain/fix_aotir_uniqueptr
AOTIR: Change std::unique_ptr to fextl::unique_ptr
2024-07-10 17:04:12 -04:00
Tony Wasserka
56bb3744a5 AOTIR: Change std::unique_ptr to fextl::unique_ptr 2024-07-10 19:34:24 +02:00
Tony Wasserka
470b435afd fextl: Properly handle nullptr arguments in fextl::default_delete
This reflects behavior of std::default_delete.
2024-07-10 19:17:50 +02:00
Tony Wasserka
441187470e OpcodeDispatcher: Avoid monomorphization of some AVX functions 2024-07-10 17:01:30 +02:00
Tony Wasserka
59fd13cc2f OpcodeDispatcher: Avoid monomorphization of even more functions 2024-07-10 17:01:30 +02:00
Tony Wasserka
c9e7bfdf16 OpcodeDispatcher: Avoid monomorphization of more functions 2024-07-10 17:01:30 +02:00
Tony Wasserka
2d700c381e OpcodeDispatcher: Avoid monomorphization of large functions 2024-07-10 17:01:30 +02:00
Ryan Houdek
72d6c8ebd6
Merge pull request #3820 from alyssarosenzweig/ir/drop-deferred
Drop deferred flag infrastructure
2024-07-09 17:06:25 -07:00
Ryan Houdek
991c6941c1
Merge pull request #3849 from alyssarosenzweig/ir/drop-parser-2
Scripts: drop remnant of IR parser
2024-07-09 16:48:36 -07:00
Alyssa Rosenzweig
f974696e34 Scripts: drop remnant of IR parser
unused.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-07-09 16:08:38 -04:00
Mai
af6a0be832
Merge pull request #3842 from Sonicadvance1/fix_f64_to_i32
VCVT{T,}PD2DQ fixes and optimization
2024-07-09 03:49:31 -04:00
Ryan Houdek
b9c214e6e8
OpcodeDispatcher: Use new IR op for vcvt{t,}pd2dq
Also fixes a bug where it was failing to zero the upper bits of the
destination register in the AVX128 implementation. Which the updated
unit tests now check against.

Fixes a minor precision issue that was reported in #2995. We still don't
return correct values for overflow. x86 always returns maximum negative
int32_t on overflow, ARM will return maximum negative or positive
depending on sign of the double.
2024-07-09 00:38:47 -07:00
Ryan Houdek
d3d76aa8ce
IR: Adds new F64 -> I32 operation that changes behaviour depending on SVE
SVE added the ability to do F64 -> I32 conversions directly without an
fcvtn inbetween. So maybe sure to support them.
2024-07-09 00:38:47 -07:00
Ryan Houdek
3bea08da5f
Merge pull request #3843 from Sonicadvance1/remove_half_moves_fma3
Arm64: Remove one move if possible in FMA operations
2024-07-09 00:25:07 -07:00
Ryan Houdek
b3a7a973a1
AVX128: Extends 32-bit indexes path for 128-bit operations
The codepath from #3826 was only targeting 256-bit sized operations.
This missed the vpgatherdq/vgatherdpd 128-bit operations. By extending
the codepath to understand 128-bit operations, we now hit these
instruction variants.

With this PR, we now have SVE128 codepaths that handle ALL variants of
x86 gather instructions! There are zero ASIMD fallbacks used in this
case!

Of course depending on the instruction, the performance still leaves a
lot to be desired, and there is no way to emulate x86 TSO behaviour
without an ASIMD fallback, which we will likely need to add as a
fallback at some point.

Based on #3836 until that is merged.
2024-07-08 18:44:07 -07:00
Ryan Houdek
4afbfcae17
AVX128: Optimize the vpgatherdd/vgatherdps cases that would fall back to ASIMD
With the introduction of the wide gathers in #3828 this has opened new
avenues for optimizing these cases that would typically fall back to
ASIMD. In the cases that 32-bit SVE scaling doesn't fit, we can instead
sign extend the elements in to double-width address registers.

This then feeds naturally in to the SVE path even though we end up
needing to allocate 512-bits worth of address registers. This ends up
being significantly better than the ASIMD path still.

Relies on #3828 to be merged first
Fixes #3829
2024-07-08 18:12:28 -07:00
Ryan Houdek
ec7c8fd922
AVX128: Optimize QPS/QD variant of gather loads!
SVE has a special version of their gather instruction that gets similar
behaviour to x86's VGATHERQPS/VPGATHERQD instructions.

The quirk of these instructions that the previous SVE implementation
didn't handle and required ASIMD fallback, was that most gather
instructions require the data element size and address element size to
match. This x86 instruction uses a 64-bit address size while loading 32-bit
elements. This matches this specific variant of the SVE instruction, but
the data is zero-extended once loaded, requiring us to shuffle the data
after it is loaded.

This isn't the worst but the implementation is different enough that
stuffing it in to the other gather load will cause headaches.

Basically gets 32 instruction variants to use the SVE version!

Fixes #3827
2024-07-08 17:19:18 -07:00
Ryan Houdek
c5a0ae7b34
IR: Adds new QPS gather load variant! 2024-07-08 17:19:18 -07:00
Ryan Houdek
4bd207ebf3
Arm64: Moves 128Bit gather ASIMD emulation to its own helper
It is going to get reused.
2024-07-08 17:19:18 -07:00
Mai
aad7656b38
Merge pull request #3826 from Sonicadvance1/scale_32bit_gather
AVX128: Extend 32-bit address indices when possible
2024-07-08 15:29:44 -04:00
Ryan Houdek
62cec7b6b2
Arm64: Remove one move if possible in FMA operations
If the destination isn't any of the incoming sources then we can avoid
one of the moves at the end. This half works around the problem proposed
in #3794, but doesn't solve the entire problem.

To solve the other half of the moving problem means we need to solve the
SRA allocation problem for this temporary register with addsub/subadd, so it gets allocated
for both the FMA operation and the XOR operation.
2024-07-08 04:44:40 -07:00
Ryan Houdek
0653b346e0
CPUID: Adds a few missing CPU names for new CPU cores
These should be making their way to the market sooner rather than later
so make sure we have the descriptor text for them.
2024-07-07 02:40:19 -07:00
Ryan Houdek
df40515087
AVX128: Extend 32-bit address indices when possible
When loading 256-bits of data with only 128-bits of address indices, we
can sign extend the source indices to be 64-bit. Thus falling down the
ideal path for SVE where each 128-bit lane is loading the data to
addresses in a 1:1 element ratio.

This means we use the SVE path more often because of this.

Based on top of #3825 because the prescaling behaviour was introduced
there. This implements its own prescaling when the sign extension occurs
because ARM's SSHLL{,2} instruction gives us that for free.

This additionally fixes a bug where we were accidentally loading the top
128-bit half of the addresses for gathers when it was unnecessary, and
on the AVX256 side it was duplicating and doing some additional work
when it shouldn't have.

It'll be good to walk the commits when looking at this one, as there are
a couple of incremental changes that are easier to follow that way.

Fixes #3806
2024-07-06 18:32:35 -07:00
Ryan Houdek
0f9abe68b9
AVX128: Fixes accidentally loading high addr register when unnnecessary
Was missing a clamp on the high half when encounting a 128-bit gather
instruction. Was causing us to unconditionally load the top half when it
was unncessary.
2024-07-06 18:32:35 -07:00
Ryan Houdek
c168ee6940
Arm64: Implements VSSHLL{,2} IR ops 2024-07-06 18:32:35 -07:00
Ryan Houdek
0d4414fdd0
AVX128: Removes templated AddrElementSize and add as argument
NFC
2024-07-06 18:32:35 -07:00
Billy Laws
e45e631199 AllocatorHooks: Allocate from the top down on windows
FEX allocations can get in the way of allocations that are 4gb-limited
even in 65-bit mode (i.e. those from LuaJIT), so allocate starting from
the top of the AS to prevent conflicts.
2024-07-06 20:35:38 +00:00
Ryan Houdek
9bad09c45f
Merge pull request #3823 from alyssarosenzweig/bug/shl-var-small
Fix CF with small shifts
2024-07-06 01:33:57 -07:00
Ryan Houdek
47d077ff22
Merge pull request #3825 from Sonicadvance1/scale_64bit_gather
AVX128: Prescale addresses in gathers if possible
2024-07-05 19:10:43 -07:00
Ryan Houdek
11a494d7b3
AVX128: Prescale addresses in gathers if possible
If the host supports SVE128, if the address element size and data size is 64-bit, and the scale is not one of the two that is supported by SVE; Then prescale the addresses.
64-bit address overflow masks the top bits so is well defined that we
can scale the vector elements and still execute the SVE code path in
that case. Removing the ASIMD code paths from a lot of gathers.

Fixes #3805
2024-07-05 16:47:11 -07:00
Alyssa Rosenzweig
5a3c0eb83c OpcodeDispatcher: fix shl with 8/16-bit variable
the special case here lines up with the special case of using a larger shift for
a smaller result, so we can just grab CF from the larger result.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-07-05 18:38:12 -04:00