alternative to #3638. this is theoretically better for side-by-side diffs. in
practice it may make other diffs worse since all the \'s change when part of the
macro change.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Arm64ec introduced the InterruptFaultPage which is lower overhead since
instead of ldr+str it just turns in to a single str. We were already
allocating the space, FEXCore and the frontend signal delegator just
needed to be updated to understand the new location.
We can additionally use this in the future if we want to make deferred
async signals INSIDE the JIT only cost a single str as well.
SRA is fundamentally about hardware registers, not stores into a
software-defined context. So, it should take a register instead of an offset.
This makes all the unaligned special cases unrepresentable (by design).
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
While the ARM64EC ABI mostly matches FEX's SRA, the stack still needs to
be switched to the emulator stack and target RIP stored into the FEX
context before jumping to the dispatcher loop.
A feature of FEX's JIT is that when an unaligned atomic load/store
operation occurs, the instructions will be backpatched in to a barrier
plus a non-atomic memory instruction. This is the half-barrier technique
that still ensures correct visibility of loadstores in an unaligned
context.
The problem with this approach is that the dmb instructions are HEAVY,
because they effectively stop the world until all memory operations in
flight are visible. But it is a necessary evil since unaligned atomics
aren't a thing on ARM processors. FEAT_LSE only gives you unaligned
atomics inside of a 16-byte granularity, which doesn't match x86
behaviour of cacheline size (effectively always 64B).
This adds a new TSO option to disable the half-barrier on unaligned
atomic and instead only convert it to a regular loadstore instruction,
ommiting the half-barrier. This gives more insight in to how well a
CPU's LRCPC implementation is by not stalling on DMB instructions when
possible.
Originally implemented as a test to see if this makes Sonic Adventure 2
run full speed with TSO enabled (but all available TSO options disabled)
on NVIDIA Orin. Unfortunately this basically makes the code no longer
stall on dmb instructions and instead just showing how bad the LRCPC
implementation is, since the stalls show up on `ldapur` instructions
instead.
Tested Sonic Adventure 2 on X13s and it ran at 60FPS there without the
hack anyway.
The frontend will provide the return logic via ExitFunctionEC, which
will be jumped to whenever there is an indirect branch/return to an addr
such that RtlIsEcCode(addr) returns true.
Executable mapped memory is treated as x86 code by default when
running under EC, VirtualAlloc2 needs to be used together with a
special flag to map JIT arm64 code.
This environment variable had an incorrect priority on the configuration
system. The expectation was higher priority than most other layers.
Now the only layer that has higher priority is the environment
variables.
It has been a long time coming that FEX no longer needed to leak IR
implementation details to the frontend, this was legacy due to IR CI and
various other problems.
Now that the last bits of IR leaking has been removed, move everything
that we can internally to the implementation.
We still have a couple of minor details in the exposed IR.h to the
frontend, but these are limited to a few enums and some thunking struct
information rather than all the implementation details.
No functional change with this, just moving headers around.
FEXCore includes was including an FHU header which would result in
compilation failure for external projects trying to link to libFEXCore.
Moves it over to fix this, it was the only FHU usage in FEXCore/include
NFC
This is no longer necessary to be part of the public API. Moves the
header internally.
Needed to pass through `IsAddressInCodeBuffer` from CPUBackend through
the Context object, but otherwise no functional change.
This may be useful for tracking TSO faulting when it manages to fetch
stale data. While most TSO crashes are due to nullptr dereferences, this
can still check for the corruption case.
I was looking at some other JIT overheads and this cropped up as some
overhead. Instead of materializing a constant using mov+movk+movk+movk,
load it from the named vector constant array.
In a micro-benchmark this improved performance by 34%.
In bytemark this improved on subbench by 0.82%
This function can be unit-tested more easily, and the stack special is more
cleanly handled as a post-collection step.
There is a minor functional change: The stack special case didn't trigger
previously if the range end was within the stack mapping. This is now fixed.
FEXCore doesn't need track the TLS state of the SignalDelegator, this is
a frontend concept.
Removes the tracking from the backend and keeps it in the frontend.
ARM64EC has a shared SRA mapping between ARM64 and X64 code, so there
needs to be a public way to enter the dispatcher without refilling SRA
from the in-memory context struct.
When code invalidation is happening we currently have the issue that a
thread can acquire the code invalidation mutex in the middle of
invalidation. This is due to us acquiring and releasing the mutex
between each thread's code invalidation.
We need to hold the mutex for the entire duration for all thread's code
invalidation.
This fixes a rare hang on proton startup and resolves a consistent hang
on Proton application shutdown.
This now puts us on par with FEX-2312.1 with hanging.
This does not fix a relatively rare hang on fork (which also existed with FEX-2312.1).
This also does not fix the issue that the intersection of our mutexes
between frontend and backend are very convoluted. In part of the work
that is going to fix the rare fork mutex hang will change more of this.
This is used for instcountci to ensure instruction counts don't change
when a compiler supports this feature or not. Always runtime disable
when running in instcountci.
CMake option from #3394 can still be useful so leaving that in place.