This now improves the instruction implementation from 17 instructions
down to 5 or 6 depending on if the host supports SVE.
I would say this is now optimal.
The range check and clamping is necessary in the cases of passing x86
shift amounts directly through VUSHL/VSSHR.
Some AVX operations are still using these with range clamping. A future
investigation task should be the check if they can be switched over to
the wide variants that we implemented for the SSE instructions.
When consuming our own controlled data, we don't want the range clamping
to be enabled.
The number of times the implicit size calculation in GPR operations has
bit us is immeasurable and was a mistake from the start of the project.
The vector based operations never had this problem since they were
explicitly sized for a long time now.
This converts the base IR operations to be explicitly sized, but adds
implicit sized helpers for the moment while we work on removing implicit
usage from the OpcodeDispatcher.
Should be NFC at this moment but it is a big enough change that I want
it in before the "real" work starts.
Noticed that we hadn't ever enabled this, which was a concern when our
GPR operations weren't as strict about leaving garbage in the upper bits
when operating as a 32-bit operation.
Now that our ALU operations are more strict about enforcing upper bit
zeroing we can enable this.
This causes Half-Life: Source FPS to get to > 200FPS finally. Causes
significant performance improvements for 32-bit games because we're no
longer redundantly moving registers before and after every operation.
Causing a bunch of 3-4 instruction sequences to convert to 1.
RAValidation was making an assumption that GPR register class would only
have up to 16 registers for either SRA or dynamic registers.
When running a 32-bit application we allow 17 GPRs to be dynamically
allocated, since we can take 8 back from SRA in that case.
Just split the two classes in the RAValidation pass since they will
never overlap their allocation.
Fixes validation in `32Bit_Secondary/15_XX_0.asm` locally that changed
behaviour due to tinkering.
This previously used `Round_Nearest` which had a bug on Arm64 that it
actually was always using `Round_Host` aka frinti.
Ever since 393cea2e8ba47a15a3ce31d07a6088a2ff91653c[1] this has been fixed
so that `Round_Nearest` actually uses frintn for neaest.
This instruction actually wants to use the host rounding mode.
Once issue with this is that x87 and SSE have different rounding mode
flags and currently we conflate the two in our JIT. This will need to be
fixed in the future.
In the meantime this restores behaviour that it actually uses the host
rounding mode, which fixes black screen and broken vertices in Grim
Fandango Remastered.
[1] e89321dc602e35cbb1382b35ac2b35e7e417ef92 for scalar.
1) In the case that we are converted a GPR, don't zero extend it first.
2) In the case that the scalar comes from memory, load it first in an
FPR and converted it in-place.
These are now optimal in the case of AFP is unsupported.
This extension was added with seemingly Cortex-A710 and turns this
instruction in to two instructions which is quite good.
Needs #2994 merged first.
Huge thanks to @dougallj for the optimization idea!
If the named constant of that size gets used multiple times then just
use the previous value if it was in scope.
Makes addsubp{s,d} and phminposuw more optimal for each that are in a
block.
Needs #2993 merged first.
Use a named constant for loading the sign inversion, then EOR the second
source and just FAdd it all.
In a vacuum it isn't a significant improvement, but as soon as more than
one instruction is in a block it will eventually get optimized with
named constant caching and be a significant win.
Thanks to @rygorous for the idea!