- Res is unused
- SrcSize doesn't matter since we ignore the high bits, might as well always use
32-bit, it doesn't matter
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
ARM64EC has a shared SRA mapping between ARM64 and X64 code, so there
needs to be a public way to enter the dispatcher without refilling SRA
from the in-memory context struct.
When code invalidation is happening we currently have the issue that a
thread can acquire the code invalidation mutex in the middle of
invalidation. This is due to us acquiring and releasing the mutex
between each thread's code invalidation.
We need to hold the mutex for the entire duration for all thread's code
invalidation.
This fixes a rare hang on proton startup and resolves a consistent hang
on Proton application shutdown.
This now puts us on par with FEX-2312.1 with hanging.
This does not fix a relatively rare hang on fork (which also existed with FEX-2312.1).
This also does not fix the issue that the intersection of our mutexes
between frontend and backend are very convoluted. In part of the work
that is going to fix the rare fork mutex hang will change more of this.
RCLSE ignores NZCV and doesn't optimize stores which doesn't help us with PF/AF
either. So, we add a new pass for dead flag elimination (cannibalizing the old
and broken dead flag elimination pass). This is a simple local optimizer that
walks each block backwards, converging in linear time & constant space in a
single iteration.
Right now, it doesn't do a ton (other than a nice reduction in silliness in
the hot Sonic block), but it provides the framework to fuse comparisons.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
This is required for host-side calls to guest functions on 32-bit guests.
Since the host stack is allocated before FEX blocks memory inaccessible to
the guest, the guest would otherwise fail to read the packed argument data.
With a contended unique lock, we forgot to reset the `Expected` value to
zero. This was causing a contended mutex to incorrectly succeed.
Noticed this when converting some pthread mutexes over to spinloops to
remove strace noise.
The reference wfe_mutex library I wrote didn't have this problem since
the implementation is slightly different.
We only need each part of W extracted in the corresponding round, so sink the
extract into the round to reduce pressure.
Further, W and E are added and then never used again. So, by reassociating we
can do the add upfront, killing W and E at the start and further reducing
pressure.
Eliminates spilling in sha1rnds4.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
This is used for instcountci to ensure instruction counts don't change
when a compiler supports this feature or not. Always runtime disable
when running in instcountci.
CMake option from #3394 can still be useful so leaving that in place.
When 4d109c9ce0 fixed parsing strenum
types in the json, it also added `ArgumentHandler` types to the json
parsing. This was incorrect as those types are already stored in the
json in their decoded numerical format.
Without this change, all config options with `ArgumentHandler` will
decode as "0" which is incorrect. The main killer here is that SMCChecks
gets disabled (visible in both FEXConfig and when applications are
running) which was causing spurious failures.
Forces disabling use of __attribute__((preserve_all)).
Until CI uses clang17, where this attribute was added, instcountci fails
when FEX is compiled with clang>=17.
as far as flags go, they're identical: set ZF for zero output, set CF for output
= DestSize, undef the rest. merge the impls, so we get the optimized lzcnt impl
for tzcnt.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
this skips the constant folding, which saves the branching in the rotate
immediate implementations.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
- use better algorithm that is O(# set bits) instead of O(# total bits)
- eliminate spilling by careful management of our temporaries
- fix nzcv clobber bug (whoops)
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
The ARM64EC SRA layout will use x0-3 for x86_64 registers, as such any
arguments passed to C ABI functions need to proxy their arguments
through the temporaries and move as appropriate.
We had a chance of doing an additional bogus wfe if the expected value
was hit in one iteration of a loop. Not the biggest problem on current
hardware where WFE only ever sleeps for 1-4 system cycles, but on future
hardware where WFE might actually sleep for longer then this could have
been an issue.
Noticed this while writing #3342.
Fixes#3343
The syscall instruction is defined in the documentation that it will set
RCX to the next instruction's RIP and R11 to be RFLAGS. We entirely
skipped this which I noticed while writing unit tests.
Adds unittests to test both 32-bit and 64-bit behaviour because our
helper shares code with both.
I don't know if anything actually relied on this behaviour but we should
definitely support it.
Primary goal for this is to ensure that the delinker doesn't need to
allocate any memory. This delinker can end up getting hit heavily with
JIT code so we don't want it to be allocating memory.
Currently all uses of the forward label calls in to jemalloc to allocate
memory. This allows a forward label that doesn't require any memory
allocation, which is the common case in FEX.
The delinker step of the JIT was using std::function with capture
lambdas that required memory allocation when unnecessary.
Because the compiler can't see through our std::function usage it could
never decompose these by itself.
By passing the Thread's frame and record to the function as arguments
then we can have the signature be a raw function pointer.
This fixes an area of concern from:
https://github.com/FEX-Emu/FEX/blob/main/docs/ProgrammingConcerns.md#stdfunction-and-lambdas
If the Dst register is allocated as VectorIndices or VectorTable,
using Dst as an operand to perform the tbx operation will result in an error.
For example:
%131(FPR0) i128 = LoadNamedVectorIndexedConstant u8:Tmp:RegisterSize, #0x6, #0xaa0
%132(FPR0) i128 = VTBX1 u8:Tmp:RegisterSize, %129(FPRFixed6) i32v4, %126(FPRFixed10) i16v8, %131(FPR0) i128
Since the tbx instruction's destination register is also the original operand,
this is consistent with the semantics of VTBX1. Therefore,
directly using VectorSrcDst as the destination operand for the tbx instruction is safe.
While locking a shared_lock and doing an empty table lookup is fairly
fast, just remove them from the hot path entirely if no custom IR
handlers are installed.
This is only used for our IRLoader, which is losing its importance
significantly and should probably be removed anyway.
This unit test hasn't really served any purpose for a while now and
mostly just causes pain when reworking things in the IR.
Just remove the IRLoader, its unit tests, the github action steps and
the public FEXCore interface to it. Since it isn't used by anything
other than Thunks.
Also moves some IR definitions from the public API to the backend.
Need #3348 merged first.
As I was casually thinking, this code made me realize that it was quite
branch heavy and could likely be optimized to logic.
The previous code generated some fairly nasty branch heavy code. This
can be optimized to be branchless and take roughly five instructions
per flag. Using a bitfield for each feature would turn each calculation
in to 3-4 instructions but that seems overkill.
Very minor thing.
We only used this so that our Xavier CI system which were running old
kernels could run unit tests. We have now removed the Xaviers from CI
and this is no longer necessary.
Stop pretending that we support kernels older than 5.0 and allowing this
fallback.
The 32-bit allocator is still used for the MAP_32BIT mmap flag, so the
load bearing code can't be fully removed. Just remove the config and the
frontend things using it.
Currently no functional change but public API breaks should come early.
The thread state object will be used for looking up thread specific
codebuffers in the future when we support MDWE with code mirrors.
We can safely call virtual functions through the JIT with a little bit
of work.
FEX's JIT has quite a few steps before it gets to a syscall handler.
Before this commit:
JIT->static HandleSyscall->SyscallHandler::HandleSyscall->SyscallHandler
After this commit:
JIT->SyscallHandler::HandleSyscall->SyscallHandler
A bit hard to notice this when this interface can spin at 67-million
calls per second though.
This has the Frontend and OpcodeDispatcher select their operating mode
depending on the incoming code segment long-mode flag.
Adds some asserts since currently it is unexpected if the configuration
changes at runtime.
This is fairly straightforward for an initial setup but isn't fully
fleshed out.
Right now FEX's x86 tables aren't setup in a way to support choosing a
different instruction decoding depending on runtime operating mode
change, so that would break in interesting ways.
Primarily this just gets FEX setup to start piping the operating mode
through from the frontend to the backend. This is a long term task, so
it is going to take a long time to iron out all the issues.
Previously we were only storing the 32-bit base address which isn't
actually how segment descriptors work.
In reality segment descriptors are 64-bit descriptors that are laid out
in a particular layout depending on the 4-bit type value. In reality we
only care about code and data segment layouts since the rest are
bonkers.
Describe these descriptors correctly and setup a default code descriptor
for the operating mode that FEX is starting in.
This will result in FEX not being able to allocate executable memory.
We can use shared memory in the future to work around this but for now
we don't support that as a fix.
Lots going on here.
This moves OS thread object lifetime management and internal thread
state lifetime management to the frontend. This causes a bunch of thread
handling to move from the FEXCore Context to the frontend.
Looking at `FEXCore/include/FEXCore/Core/Context.h` really shows how
much of the API has moved to the frontend that FEXCore no longer needs
to manage. Primarily this makes FEXCore itself no longer need to care
about most of the management of the emulation state.
A large amount of the behaviour moved wholesale from Core.cpp to
LinuxEmulation's ThreadManager.cpp. Which this manages the lifetimes of
both the OS threads and the FEXCore thread state objects.
One feature lost was the instruction capability, but this was already
buggy and is going to be rewritten/fixed when gdbserver work continues.
Now that all of this management is moved to the frontend, the gdbserver
can start improving since it can start managing all thread state
directly.
Similar to #3284 but works around some of the bugs that one introduced.
This is the minimal amount of changes to move the ownership from FEXCore
to the frontend. Since the frontends don't yet have a full thread state
tracking, there is an opaque pointer that needs to be managed.
In the followup commits this will be changed to have the syscall handler
to be the thread object manager.
This was a temporary header to help with when this header was migrated
to our public API headers.
It's temporary nature is no longer necessary, just get rid of it.
No need to wait for initialization on for this anymore.
Ever since Init was refactored to do basically no work, this hasn't been
necessary.
CPUID does need to still be initialized after HostFeatures though, so
need to ensure correct member ordering there.
When the address calculation for SIB has both index and base then we can
optimize this to an add with a shifted register. This will convert a
three instruction sequence in to one instruction in most cases.
While we were calling this function, its asserting nature hasn't been
used for a long time.
This used to trigger more frequently when CompileBlock would fail to
compile code, either due to not being able to decode an instruction or
hitting an instruction that FEX doesn't understand.
When these cases are hit today we still generate code blocks which
generate SIGILL. This means that this code was actually never hit.
Completely remove this function and have the JIT's dispatcher call the
CompileBlock function directly. Signature is slightly different since we
need to set x3 to be 0.
Reduces the ELF's VM size from 9.8MB down to 9.37MB and should reduce
initialization time a smidge.
Slammed this out while waiting for other PRs to get reviewed.
Fairly lightweight since it is almost 1:1 transplanting the code from
FEXCore in to the SyscallHandler's thread creation code.
Minor changes:
- ExecutionThreadHandler gets freed before executing the thread
- Saves 16-bytes of memory per thread
- Start all threads paused by default
- Since I moved the code to the frontend, I noticed we needed to do
some post thread-creation setup.
- Without the pause we were racing code execution with TLS setup and
a few other things.