alternative to #3638. this is theoretically better for side-by-side diffs. in
practice it may make other diffs worse since all the \'s change when part of the
macro change.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Depending on where the assembly was getting loaded in to memory it was
causing slight code generation differences.
Map the entire file to the same fixed offset as our ASM tests to ensure
consistency and removing flakes in CI.
Arm64ec introduced the InterruptFaultPage which is lower overhead since
instead of ldr+str it just turns in to a single str. We were already
allocating the space, FEXCore and the frontend signal delegator just
needed to be updated to understand the new location.
We can additionally use this in the future if we want to make deferred
async signals INSIDE the JIT only cost a single str as well.
This is required by recent wine changes to use longjmp for user
callbacks. Switch to saving the context at every simulate call and
setting the unwind SP/PC to that context with a small SEH trampoline
for the syscall handler.
A bit of refactoring necessary before we can move the remaining Linux
specific code to the frontend.
Most of this taken from #3535 but attempting to be NFC as much as
possible.
From `man 2 open`:
> The mode argument must be supplied if O_CREAT or O_TMPFILE is
> specified in flags; if it is not supplied, some arbitrary bytes
> from the stack will be applied as the file mode.
A feature of FEX's JIT is that when an unaligned atomic load/store
operation occurs, the instructions will be backpatched in to a barrier
plus a non-atomic memory instruction. This is the half-barrier technique
that still ensures correct visibility of loadstores in an unaligned
context.
The problem with this approach is that the dmb instructions are HEAVY,
because they effectively stop the world until all memory operations in
flight are visible. But it is a necessary evil since unaligned atomics
aren't a thing on ARM processors. FEAT_LSE only gives you unaligned
atomics inside of a 16-byte granularity, which doesn't match x86
behaviour of cacheline size (effectively always 64B).
This adds a new TSO option to disable the half-barrier on unaligned
atomic and instead only convert it to a regular loadstore instruction,
ommiting the half-barrier. This gives more insight in to how well a
CPU's LRCPC implementation is by not stalling on DMB instructions when
possible.
Originally implemented as a test to see if this makes Sonic Adventure 2
run full speed with TSO enabled (but all available TSO options disabled)
on NVIDIA Orin. Unfortunately this basically makes the code no longer
stall on dmb instructions and instead just showing how bad the LRCPC
implementation is, since the stalls show up on `ldapur` instructions
instead.
Tested Sonic Adventure 2 on X13s and it ran at 60FPS there without the
hack anyway.
Unmapping a section will unmap the whole size initially allocated,
irrespective of how their protections are changed afterwards. Make sure
to follow this logic for invalidation too.
When thread management was moved to the frontend, invalidation moved
from being a global operation to per-thread but the WOW64 backend wasn't
updated to account for this. Now for any invalidation event loop over
all threads tracked by the frontend and invalidate the appropriate
range.
This environment variable had an incorrect priority on the configuration
system. The expectation was higher priority than most other layers.
Now the only layer that has higher priority is the environment
variables.
FEXCore includes was including an FHU header which would result in
compilation failure for external projects trying to link to libFEXCore.
Moves it over to fix this, it was the only FHU usage in FEXCore/include
NFC
This is no longer necessary to be part of the public API. Moves the
header internally.
Needed to pass through `IsAddressInCodeBuffer` from CPUBackend through
the Context object, but otherwise no functional change.
Same situation as the last stack leak memory fix, this is fairly tricky
since it is dealing with stack pivoting. Fixes the memory leak around
pthread stack allocations, making memory usage lower for applications
that constantly spin-up and destroy threads (Like Steam).
We need to let glibc allocate a minimum sized stack (128KB and we can't
control it) to work around a race condition with DTV/TLS regions. This
means we need to do a stack pivot once the thread starts executing.
We also need to be careful because the `PThread` object is deleted
inside of the execution thread, which was resulting in a use-after-free
bug.
There are definitely some more memory leaks that I'm still fighting, and I have
noticed in my abusive thread creation program that we might want to
change some jemalloc options to more aggressively cut down on residency.
This is just one out of many.
I remember seeing some application last year where they closed a FEX
owned FD but now I don't remember what it was. This can really mess us
up so add some debug tracking so we can try and find it again.
Might be something specifically around flatpack, appimage, or chrome's
sandbox. I have some ideas about how to work around these problems if
they crop up but need to find the problem applications again.
This may be useful for tracking TSO faulting when it manages to fetch
stale data. While most TSO crashes are due to nullptr dereferences, this
can still check for the corruption case.
This was a funny joke that this was here, but it is fundamentally
incompatible with what we're doing. All those users are running proot
anyway because of how broken running under termux directly is.
Just remove this from here.
The NVIF ioctl isn't publicly described in the nouveau headers and it is
required for anything to work with Nouveau.
Pass the ioctl command through without modification and hope that this
ioctl is architecture agnostic.
This is what we'll actually ship (I hope), so that's the config we want to
track long-term. It's also a lot more managable resulting asm.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Reimagining of #3355 without any json generators or new concepts.
Fixes some mislabeling of system calls. Some getting inlined when they
shouldn't be, a lot not getting inlined when they can be.
This really cleans up the syscall implementation, all syscalls that can
be passthrough implementations require a very small two line
declaration.
Additionally cleans up a bit of implementation cruft where some
passthrough syscalls were using the glibc syscall handler, and some were
using the glibc implementation. We have had multiple issues in the past
where the glibc implementation does something subtly different than the
raw syscall and breaks things. Now all passthrough handlers do a system
call directly, removing at least one indirection and some ambiguity.
This makes it significantly easier to add new passthrough syscalls as
well. Only need to do a version check and add the three lines per
syscall. Which there are new syscalls incoming that we will want to add.
Tangible improvements:
- Syscalls are lower overhead than ever.
- When I'm adding more syscalls I have less chance of mucking it up.