Aside from its own self-test, the parser is unused and should remain that way,
since it's a maintenance burden with no real benefit. Burn it.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
SRA is fundamentally about hardware registers, not stores into a
software-defined context. So, it should take a register instead of an offset.
This makes all the unaligned special cases unrepresentable (by design).
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
While the ARM64EC ABI mostly matches FEX's SRA, the stack still needs to
be switched to the emulator stack and target RIP stored into the FEX
context before jumping to the dispatcher loop.
These callbacks are used for code invalidation and setting the right
emulated CPU features, neither of which are necessary for syscalls made
from within FEX. Avoid calling them to prevent deadlocks caused by
nested locks during compilation.
Moves it to the hypervisor leafs.
Before:
```bash
$ FEXBash 'cat /proc/cpuinfo | grep "model name"'
model name : FEX-2404-101-gf9effcb Cortex-A78C
model name : FEX-2404-101-gf9effcb Cortex-A78C
model name : FEX-2404-101-gf9effcb Cortex-A78C
model name : FEX-2404-101-gf9effcb Cortex-A78C
model name : FEX-2404-101-gf9effcb Cortex-X1C
model name : FEX-2404-101-gf9effcb Cortex-X1C
model name : FEX-2404-101-gf9effcb Cortex-X1C
model name : FEX-2404-101-gf9effcb Cortex-X1C
```
After:
```bash
$ FEXBash 'cat /proc/cpuinfo | grep "model name"'
model name : Cortex-A78C
model name : Cortex-A78C
model name : Cortex-A78C
model name : Cortex-A78C
model name : Cortex-X1C
model name : Cortex-X1C
model name : Cortex-X1C
model name : Cortex-X1C
```
Now the FEX string is in the hypervisor functions as a leaf, so if some
utility wants the FEX version they can query that directly
Ex:
```bash
$ ./Bin/FEXInterpreter get_cpuid_fex
Maximum 4000_0001h sub-leaf: 2
We are running under FEX on host: 2
FEX version string is: 'FEX-2404-113-g820494d'
```
We were previously genrating nonsense code if the destination != source:
faddp v2.4s, v4.4s, v4.4s
faddp s2, v4.2s
The result of the first faddp is ignored, so the second merely calculates the
sum of the first 2 sources (not all 4 as needed).
The correct fix is to feed the first add into the second, regardless of the
final destination:
faddp v2.4s, v4.4s, v4.4s
faddp s2, v2.2s
Hit in an ASM test with new RA.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
A feature of FEX's JIT is that when an unaligned atomic load/store
operation occurs, the instructions will be backpatched in to a barrier
plus a non-atomic memory instruction. This is the half-barrier technique
that still ensures correct visibility of loadstores in an unaligned
context.
The problem with this approach is that the dmb instructions are HEAVY,
because they effectively stop the world until all memory operations in
flight are visible. But it is a necessary evil since unaligned atomics
aren't a thing on ARM processors. FEAT_LSE only gives you unaligned
atomics inside of a 16-byte granularity, which doesn't match x86
behaviour of cacheline size (effectively always 64B).
This adds a new TSO option to disable the half-barrier on unaligned
atomic and instead only convert it to a regular loadstore instruction,
ommiting the half-barrier. This gives more insight in to how well a
CPU's LRCPC implementation is by not stalling on DMB instructions when
possible.
Originally implemented as a test to see if this makes Sonic Adventure 2
run full speed with TSO enabled (but all available TSO options disabled)
on NVIDIA Orin. Unfortunately this basically makes the code no longer
stall on dmb instructions and instead just showing how bad the LRCPC
implementation is, since the stalls show up on `ldapur` instructions
instead.
Tested Sonic Adventure 2 on X13s and it ran at 60FPS there without the
hack anyway.
Instead of only enabling enhanced rep movs if software TSO is disabled,
Enable it if software tso is disabled OR memcpysettso is disabled. This
is because now we hit the fast path when memcpysettso is disabled alone
but global TSO is disabled.
Retested Hades and performance was fine in this configuration.
Found out that Far Cry uses this instruction and it is viable to use in
CPL-3. This only returns constant data but its behaviour is a little
quirky.
This instruction has a weird behaviour that the 32-bit operation does an
insert in to the 64-bit destination, which might be an Intel versus AMD
behaviour. I don't have an Intel machine available to test if that
theory is true although. This assumption would match similar behaviour
where segment registers are inserted instead of zext.
Gets the game farther but then it crashes in a `___ascii_strnicmp`
function where the arguments end up being `___ascii_strnicmp(nullptr, "Color", 5);`.
Functional revert of 92f31648b ("RCLSE: optimize out pointless stores"), which
reportedly regressed some titles due to RA doom. We'll revisit later, leaving in
the code for when RA is ready to light this up.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>