This moves the CPU feature querying to the frontend. The primary purpose
here is for the wow64 frontend to not require linux-isms for querying
these features. This is required since non-Linux environments don't have
the "CPUID" feature for reading EL1 MSRs in EL0.
Wiring up the remaining wow64 registry querying is left for a future
exercise.
This also technically removes an xbyak requirement from FEXCore for when
building the x86 Test harness runner, but that doesn't really matter for
regular use cases.
An exception in JIT code acts as a transition to ARM64EC code (in
NTDLL for exception handling etc) as such, much like ExitFunction,
InSimulation must be unset. InSyscallCallback is unset for robustness
against exception in the JIT itself.
Only NZCV and TF are passed through to BeginSimulation as the rest are
lost when converting to a native context and back on the ntdll side. To
prevent thread suspension from wiping out the rest of the flags, only
copy these specific flags into the current JIT EFlags state.
This is required for handling SMC with the ResetToConsistentState
arguments as used in Windows, as using the NTDLL exported NtContinue
would wipe out any reserved registers in the ARM64EC ABI.
For Windows the syscall numbers are somewhat stable, and the SVC
instruction can be called directly. Since wine doesn't handle that on
ARM64, hardcode the system call number and manually call into wine
dispatcher. Once wine gains proper syscall thunks, those can be
parsed to get the number and the hardcoding dropped.
OutputDebugString etc are exception based and thus don't really work for
FEX's needs as often times logs can happen in places where exceptions
cannot be thrown.
FEX emulates faulting instructions (e.g. ud2 or int 2d) by jumping to
the dispatcher and filling out a structure with fault details in the
thread context. Parse this out into a windows exception record structure
so the correct fault information can be seen by the guest.
As the exception dispatcher is initially invoked on the emulator stack,
control needs to be transferred to the dispatcher on the guest stack
after recovering the x86 RSP to allow for invoking x86 exception
handlers.
This is used by the kernel (or UNIX side of ntdll in wine) to jump into
x86 code with the given context as is necessary when e.g. returning from
an exception.
FEX is unable to deal with reentrant compilation of any x64 hotpatches
so they need to be ignored by bypassing FFSs and calling directly into
the native target.
ARM64 requires that SP is always 16-byte aligned for memory accesses,
but ARM64EC shares the SP between x64 code and ARM64 code, the former
of which doesn't enforce such a restriction. This causes crashes in
programs such as HITMAN 3 that don't correctly follow the Windows ABI
and call into system library functions with SP only 8-byte-aligned.
Fixup stack alignment in such cases by leaving the 8-byte return
address on the stack and returning to a lone 'ret' instruction instead.
This allows for running x64 applications under wine without having to run all
of wine under FEX. The JIT is invoked when ARM64EC code performs an indirect
branch to x64 code, and left whenever the x64 code calls into ARM64EC
code.
This is required by recent wine changes to use longjmp for user
callbacks. Switch to saving the context at every simulate call and
setting the unwind SP/PC to that context with a small SEH trampoline
for the syscall handler.
A feature of FEX's JIT is that when an unaligned atomic load/store
operation occurs, the instructions will be backpatched in to a barrier
plus a non-atomic memory instruction. This is the half-barrier technique
that still ensures correct visibility of loadstores in an unaligned
context.
The problem with this approach is that the dmb instructions are HEAVY,
because they effectively stop the world until all memory operations in
flight are visible. But it is a necessary evil since unaligned atomics
aren't a thing on ARM processors. FEAT_LSE only gives you unaligned
atomics inside of a 16-byte granularity, which doesn't match x86
behaviour of cacheline size (effectively always 64B).
This adds a new TSO option to disable the half-barrier on unaligned
atomic and instead only convert it to a regular loadstore instruction,
ommiting the half-barrier. This gives more insight in to how well a
CPU's LRCPC implementation is by not stalling on DMB instructions when
possible.
Originally implemented as a test to see if this makes Sonic Adventure 2
run full speed with TSO enabled (but all available TSO options disabled)
on NVIDIA Orin. Unfortunately this basically makes the code no longer
stall on dmb instructions and instead just showing how bad the LRCPC
implementation is, since the stalls show up on `ldapur` instructions
instead.
Tested Sonic Adventure 2 on X13s and it ran at 60FPS there without the
hack anyway.
Unmapping a section will unmap the whole size initially allocated,
irrespective of how their protections are changed afterwards. Make sure
to follow this logic for invalidation too.
When thread management was moved to the frontend, invalidation moved
from being a global operation to per-thread but the WOW64 backend wasn't
updated to account for this. Now for any invalidation event loop over
all threads tracked by the frontend and invalidate the appropriate
range.