1357 Commits

Author SHA1 Message Date
Ryan Houdek
ec3e7ceeb5 IR: Removes implicit sized EXTR 2023-08-28 05:02:00 -07:00
Ryan Houdek
c8c8ddbd4f IR: Removes implicit sized PDEP/PEXT 2023-08-28 05:02:00 -07:00
Ryan Houdek
35013bda37 IR: Adds helper to convert between an integer size and IR::OpSize
This is a nop operation and will get optimized away in release builds.
2023-08-28 05:02:00 -07:00
Ryan Houdek
62a9a075b7 Arm64: Leave a comment that 32-bit division shouldn't leave garbage in upper 64-bits 2023-08-28 05:02:00 -07:00
Ryan Houdek
ea6d068cc5 IR: Removes implicit sized LDIV/LREM 2023-08-28 05:02:00 -07:00
Ryan Houdek
5013473ec0 IR: Removes implicit sized LUDIV/LUREM 2023-08-28 05:02:00 -07:00
Ryan Houdek
6f2b3e76ac
Merge pull request #3013 from Sonicadvance1/32bit_sra
IR/Passes/RA: Enable SRA for 32-bit GPRs
2023-08-27 21:30:39 -07:00
Ryan Houdek
1d7c280367
Merge pull request #3012 from Sonicadvance1/optimize_movmskps
OpcodeDispatcher: Optimizes SSE movmaskps
2023-08-27 21:29:04 -07:00
Ryan Houdek
514a8223d9 OpcodeDispatcher: Optimizes SSE movmaskps
This now improves the instruction implementation from 17 instructions
down to 5 or 6 depending on if the host supports SVE.

I would say this is now optimal.
2023-08-27 21:07:20 -07:00
Ryan Houdek
8d110738ac IR: Add option to disable vector shift range clamping
The range check and clamping is necessary in the cases of passing x86
shift amounts directly through VUSHL/VSSHR.

Some AVX operations are still using these with range clamping. A future
investigation task should be the check if they can be switched over to
the wide variants that we implemented for the SSE instructions.

When consuming our own controlled data, we don't want the range clamping
to be enabled.
2023-08-27 21:07:20 -07:00
Ryan Houdek
e4bb0df486 IR: Convert all Move+Atomic+ALU ops from implicit to explicit size
The number of times the implicit size calculation in GPR operations has
bit us is immeasurable and was a mistake from the start of the project.
The vector based operations never had this problem since they were
explicitly sized for a long time now.

This converts the base IR operations to be explicitly sized, but adds
implicit sized helpers for the moment while we work on removing implicit
usage from the OpcodeDispatcher.

Should be NFC at this moment but it is a big enough change that I want
it in before the "real" work starts.
2023-08-27 01:35:08 -07:00
Ryan Houdek
7146691360 IR/Passes/RA: Enable SRA for 32-bit GPRs
Noticed that we hadn't ever enabled this, which was a concern when our
GPR operations weren't as strict about leaving garbage in the upper bits
when operating as a 32-bit operation.

Now that our ALU operations are more strict about enforcing upper bit
zeroing we can enable this.

This causes Half-Life: Source FPS to get to > 200FPS finally. Causes
significant performance improvements for 32-bit games because we're no
longer redundantly moving registers before and after every operation.
Causing a bunch of 3-4 instruction sequences to convert to 1.
2023-08-26 18:22:50 -07:00
Ryan Houdek
572d6cd3e6 OpcodeDispatcher: Fixes ADC and SBB 2023-08-26 18:22:50 -07:00
Ryan Houdek
fcc37bf6a8 OpcodeDispatcher: Fixes RCR and ADOX 32-bit
Automatic size inheritance was breaking these operations.
2023-08-26 18:22:50 -07:00
Ryan Houdek
8f7925d06f Arm64: Simple typo fix 2023-08-26 18:22:50 -07:00
Ryan Houdek
eace648fa9 OpcodeDispatcher: Fixes bug in UMUL
This was trying to operating on a 32-bit value but BFE the upper
32-bits.

Actually fixes this so it is operating on the 64-bit multiply result.
2023-08-26 18:22:50 -07:00
Ryan Houdek
a4ac21a4e4 OpcodeDispatcher: Fixes bug in GetRFLAG with CachedNZCV
This was only operating at byte size but it was attempting to get bit
offsets at greater than operating size.
Change operating size over to 64-bit.
2023-08-26 18:22:50 -07:00
Ryan Houdek
a01e69092d Arm64: Ensure Bfe and Sbfe operate at 32-bit or 64-bit op size
For Sbfe at least it ensures the upper bits don't get filled with
garbage.
Bfe it doesn't change behaviour but best to be correct.
2023-08-26 18:22:50 -07:00
Ryan Houdek
2fde2140ef Arm64: Ensure assert is testing correct array 2023-08-26 18:22:50 -07:00
Ryan Houdek
e10afefb2b IR: Fixes RAValidation for 32-bit applications
RAValidation was making an assumption that GPR register class would only
have up to 16 registers for either SRA or dynamic registers.

When running a 32-bit application we allow 17 GPRs to be dynamically
allocated, since we can take 8 back from SRA in that case.

Just split the two classes in the RAValidation pass since they will
never overlap their allocation.

Fixes validation in `32Bit_Secondary/15_XX_0.asm` locally that changed
behaviour due to tinkering.
2023-08-26 16:18:03 -07:00
Ryan Houdek
9ba46f429e X8764: Ensure frndint uses host rounding mode
This previously used `Round_Nearest` which had a bug on Arm64 that it
actually was always using `Round_Host` aka frinti.
Ever since 393cea2e8ba47a15a3ce31d07a6088a2ff91653c[1] this has been fixed
so that `Round_Nearest` actually uses frintn for neaest.

This instruction actually wants to use the host rounding mode.
Once issue with this is that x87 and SSE have different rounding mode
flags and currently we conflate the two in our JIT. This will need to be
fixed in the future.

In the meantime this restores behaviour that it actually uses the host
rounding mode, which fixes black screen and broken vertices in Grim
Fandango Remastered.

[1] e89321dc602e35cbb1382b35ac2b35e7e417ef92 for scalar.
2023-08-25 16:04:01 -07:00
Ryan Houdek
a76c2c57b0 OpcodeDispatcher: Optimize PSHUF{LW, HW, D}!
This is way more optimal!
2023-08-25 12:59:40 -07:00
Ryan Houdek
7f63d87295 IR: Adds support for new LoadNamedVectorIndexedConstant IR 2023-08-25 12:59:40 -07:00
Mai
bf12f08218
Merge pull request #3002 from Sonicadvance1/optimize_movmaskpd
OpcodeDispatcher: Optimize 128-bit movmaskpd
2023-08-25 08:50:28 -04:00
Mai
1f7d138d2a
Merge pull request #3008 from Sonicadvance1/optimize_movddup
OpcodeDispatcher: Optimize movddup from register
2023-08-25 08:48:40 -04:00
Mai
f36f07055a
Merge pull request #3007 from Sonicadvance1/optimize_cvtdq2pd
OpcodeDispatcher: Optimize cvtdq2pd from register source
2023-08-25 08:47:47 -04:00
Mai
30a1a382c4
Merge pull request #3006 from Sonicadvance1/optimize_movq
OpcodeDispatcher: Optimizes movq
2023-08-25 08:46:02 -04:00
Mai
631655dd81
Merge pull request #3005 from Sonicadvance1/nontemporalmoves
OpcodeDispatcher: Optimize nontemporal moves
2023-08-25 08:44:48 -04:00
Ryan Houdek
2fbcf2e4a9 OpcodeDispatcher: Optimize movddup from register
This is now optimal
2023-08-25 03:44:10 -07:00
Ryan Houdek
00124205e5 OpcodeDispatcher: Optimize cvtdq2pd from register source
This is now optimal
2023-08-25 03:39:24 -07:00
Ryan Houdek
81281e2115 OpcodeDispatcher: Optimizes movq
Removes a redundant move between registers and makes it optimal.
Also removes a couple redundant moves on the avx version.
2023-08-25 03:29:58 -07:00
Ryan Houdek
1cb2b084b3 OpcodeDispatcher: Generate more optimal code for scalar GPR converts
1) In the case that we are converted a GPR, don't zero extend it first.
2) In the case that the scalar comes from memory, load it first in an
   FPR and converted it in-place.

These are now optimal in the case of AFP is unsupported.
2023-08-25 03:19:11 -07:00
Ryan Houdek
189b0da68f JIT/Int: Add support for scalar conversion as well 2023-08-25 03:19:11 -07:00
Ryan Houdek
f3679a99ec OpcodeDispatcher: Optimize nontemporal moves
These are now optimal.
2023-08-25 03:13:10 -07:00
Ryan Houdek
62156f2152 ARM64JIT: Adds support for scalar cvt 2023-08-25 02:34:30 -07:00
Ryan Houdek
9a54898429 OpcodeDispatcher: Optimize 128-bit movmaskpd
I'd consider this optimal now.

Thanks to @dougallj for the optimization idea again!
2023-08-24 17:27:54 -07:00
Ryan Houdek
80d871fb18
Merge pull request #3001 from Sonicadvance1/optimize_cvtps2pd
OpcodeDispatcher: Optimize cvtps2pd
2023-08-24 16:09:12 -07:00
Ryan Houdek
3731e6d88b OpcodeDispatcher: Optimize cvtps2pd
SSE version is now optimal and AVX version gets rid of a redundant move.
2023-08-24 15:55:11 -07:00
Ryan Houdek
c441b238c7 OpcodeDispatcher: Optimize MMX conversion operation
These instructions are now optimal
2023-08-24 15:46:19 -07:00
Ryan Houdek
72ce7ddf2d Arm64: Optimize CVT operations for 64-bit variants
Using 128-bit converts for 64-bit versions cuts their throughput in half
on Cortex. Ensure we use the 64-bit version when possible.
2023-08-24 15:45:07 -07:00
Ryan Houdek
a1210f892a OpcodeDispatcher: Optimize addsubp{s,d} using fcadd
This extension was added with seemingly Cortex-A710 and turns this
instruction in to two instructions which is quite good.

Needs #2994 merged first.

Huge thanks to @dougallj for the optimization idea!
2023-08-24 15:00:41 -07:00
Ryan Houdek
ba01eac467 IR: Adds support for ARM's FCMA FCADD instruction 2023-08-24 15:00:41 -07:00
Ryan Houdek
c5d147322f HostFeatures: Adds support for FCMA 2023-08-24 15:00:41 -07:00
Ryan Houdek
565b30e15e OpcodeDispatcher: Cache named vector constants in the block
If the named constant of that size gets used multiple times then just
use the previous value if it was in scope.

Makes addsubp{s,d} and phminposuw more optimal for each that are in a
block.

Needs #2993 merged first.
2023-08-24 14:46:37 -07:00
Ryan Houdek
f300196d90 OpcodeDispatcher: Optimize AddSubP{S,D}
Use a named constant for loading the sign inversion, then EOR the second
source and just FAdd it all.
In a vacuum it isn't a significant improvement, but as soon as more than
one instruction is in a block it will eventually get optimized with
named constant caching and be a significant win.

Thanks to @rygorous for the idea!
2023-08-23 20:32:51 -07:00
Lioncache
42ccc18606 x86_64/MemoryOps: Fix mislabeled IR op messages 2023-08-23 22:54:36 -04:00
Mai
66c6f96120
Merge pull request #2990 from Sonicadvance1/optimize_pmulh
OpcodeDispatcher: Optimize PMULH{U,}W using new IR operations
2023-08-23 22:06:14 -04:00
Ryan Houdek
77b6d854b9 OpcodeDispatcher: Optimize PMULH{U,}W using new IR operations
SSE implementations are now optimal.
SVE-128bit operation makes it more optimal.
2023-08-23 18:38:05 -07:00
Ryan Houdek
05b9651279 IR: Implements new vector multiply returning high bits
SVE implemented a new instruction that does this explicitly, so we
should support it directly.
2023-08-23 18:38:05 -07:00
Lioncache
26c81224ac OpcodeDispatcher: Remove redundant moves from AESIMC
Zero-extension will occur automatically upon storing if necessary.

We can also join the SSE and AVX implementations together.
2023-08-23 21:34:37 -04:00