SRA is fundamentally about hardware registers, not stores into a
software-defined context. So, it should take a register instead of an offset.
This makes all the unaligned special cases unrepresentable (by design).
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
While the ARM64EC ABI mostly matches FEX's SRA, the stack still needs to
be switched to the emulator stack and target RIP stored into the FEX
context before jumping to the dispatcher loop.
These callbacks are used for code invalidation and setting the right
emulated CPU features, neither of which are necessary for syscalls made
from within FEX. Avoid calling them to prevent deadlocks caused by
nested locks during compilation.
This is required by recent wine changes to use longjmp for user
callbacks. Switch to saving the context at every simulate call and
setting the unwind SP/PC to that context with a small SEH trampoline
for the syscall handler.
A bit of refactoring necessary before we can move the remaining Linux
specific code to the frontend.
Most of this taken from #3535 but attempting to be NFC as much as
possible.
From `man 2 open`:
> The mode argument must be supplied if O_CREAT or O_TMPFILE is
> specified in flags; if it is not supplied, some arbitrary bytes
> from the stack will be applied as the file mode.
The implementation of this has been brittle and is architecturally
incompatible with 32-bit guests. It's unlikely this could be fixed with
incremental improvements.
Since libGL and libvulkan can be forwarded independently of libX11 now,
these libX11 bits can be dropped without negative impact on compatibility.
Some applications create multiple Vulkan instances with different sets of
extensions. We might hence miss some of these pointers during the initial
function pointer query.
X11 displays and xcb connections managed by the guest libX11 can't be used by
the host, but we can create intermediary objects using the host libX11. This
allows to connect guest-managed objects to the host window system integration
APIs in OpenGL/Vulkan.
Moves it to the hypervisor leafs.
Before:
```bash
$ FEXBash 'cat /proc/cpuinfo | grep "model name"'
model name : FEX-2404-101-gf9effcb Cortex-A78C
model name : FEX-2404-101-gf9effcb Cortex-A78C
model name : FEX-2404-101-gf9effcb Cortex-A78C
model name : FEX-2404-101-gf9effcb Cortex-A78C
model name : FEX-2404-101-gf9effcb Cortex-X1C
model name : FEX-2404-101-gf9effcb Cortex-X1C
model name : FEX-2404-101-gf9effcb Cortex-X1C
model name : FEX-2404-101-gf9effcb Cortex-X1C
```
After:
```bash
$ FEXBash 'cat /proc/cpuinfo | grep "model name"'
model name : Cortex-A78C
model name : Cortex-A78C
model name : Cortex-A78C
model name : Cortex-A78C
model name : Cortex-X1C
model name : Cortex-X1C
model name : Cortex-X1C
model name : Cortex-X1C
```
Now the FEX string is in the hypervisor functions as a leaf, so if some
utility wants the FEX version they can query that directly
Ex:
```bash
$ ./Bin/FEXInterpreter get_cpuid_fex
Maximum 4000_0001h sub-leaf: 2
We are running under FEX on host: 2
FEX version string is: 'FEX-2404-113-g820494d'
```
We were previously genrating nonsense code if the destination != source:
faddp v2.4s, v4.4s, v4.4s
faddp s2, v4.2s
The result of the first faddp is ignored, so the second merely calculates the
sum of the first 2 sources (not all 4 as needed).
The correct fix is to feed the first add into the second, regardless of the
final destination:
faddp v2.4s, v4.4s, v4.4s
faddp s2, v2.2s
Hit in an ASM test with new RA.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
A feature of FEX's JIT is that when an unaligned atomic load/store
operation occurs, the instructions will be backpatched in to a barrier
plus a non-atomic memory instruction. This is the half-barrier technique
that still ensures correct visibility of loadstores in an unaligned
context.
The problem with this approach is that the dmb instructions are HEAVY,
because they effectively stop the world until all memory operations in
flight are visible. But it is a necessary evil since unaligned atomics
aren't a thing on ARM processors. FEAT_LSE only gives you unaligned
atomics inside of a 16-byte granularity, which doesn't match x86
behaviour of cacheline size (effectively always 64B).
This adds a new TSO option to disable the half-barrier on unaligned
atomic and instead only convert it to a regular loadstore instruction,
ommiting the half-barrier. This gives more insight in to how well a
CPU's LRCPC implementation is by not stalling on DMB instructions when
possible.
Originally implemented as a test to see if this makes Sonic Adventure 2
run full speed with TSO enabled (but all available TSO options disabled)
on NVIDIA Orin. Unfortunately this basically makes the code no longer
stall on dmb instructions and instead just showing how bad the LRCPC
implementation is, since the stalls show up on `ldapur` instructions
instead.
Tested Sonic Adventure 2 on X13s and it ran at 60FPS there without the
hack anyway.