This does duplicate the _Constant(1) but it doesn't matter because it
gets inlined into the eor anyway. There is no functional change here.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
We store garbage in the upper bits. That's ok, but it means we need to
mask on read for correct behaviour.
Closes#2767
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
We can fold the Not into the And. This requires flipping the arguments
to Andn, but we do not flip the order of the assignments since that
requires an extra register in a test I'm looking at.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
WIN32 has a define already called `GetObject` and will cause our
symbol to have an A appended to it and break linking.
Just rename it to `GetTelemetryValue`
Noticed during introspection that we were generating zero constants
redundantly. Bunch of single cycle hits or zero-register renames.
Every time a `SetRFLAG` helper was called, it was /always/ doing a BFE
on everything passed in to extract the lowest bit. In nearly all cases
the data getting passed in is already only the lowest bit.
Instead, stop the helper from doing this BFE, and ensure the
OpcodeDispatcher does BFE in the couple of cases it still needs to do.
As I was skimming through all these to ensure BFE isn't necessary, I did
notice that some of the BCD instructions are wrong or questionable. So I
left a comment on those so we can come back to it.
These address calculations were failing to understand that they can be
optimized. When TSO emulation is disabled these were fine, but with TSO
we were eating one more instruction.
Before:
```
add x20, x12, #0x4 (4)
dmb ish
ldr s16, [x20]
dmb ish
```
After:
```
dmb ish
ldr s16, [x12, #4]
dmb ish
```
Also left a note that once LRCPC3 is supported in hardware that we can do a similar optimization there.
When this instruction returns the index in to the ecx register, this is
defined as a 32-bit result. This means it actually gets zero-extended to
the full 64-bit GPR size on 64-bit processes.
Previously FEX was doing a 32-bit insert which leaves garbage data in
the upper 32-bits of the RCX register.
Adds a unit test to ensure the result is zero extended.
Fixes running Java games under FEX now that SSE4.2 is exposed.
ARM64 BFI doesn't allow you to encode two source registers here to match
our SSA semantics. Also since we don't support RA constraints to ensure
that these match, just do the optimal case in the backend.
Leave a comment for future RA contraint excavators to make this more
optimal
When a fork occurs FEX needs to be incredibly careful as any thread
(that isn't forking) that holds a lock will vanish when the fork occurs.
At this point if the newly forked process tries to use these mutexes
then the process hangs indefinitely.
The three major mutexes that need to be held during a fork:
- Code Invalidation mutex
- This is the highest priority and causes us to hang frequently.
- This is highly likely to occur when one thread is loading shared
libraries and another thread is forking.
- Happens frequently with Wine and steam.
- VMA tracking mutex
- This one happens when one thread is allocating memory while a fork
occurs.
- This closely relates to the code invalidation mutex, just happens at
the syscall layer instead of the FEXCore layer.
- Happens as frequently as the code invalidation mutex.
- Allocation mutex
- This mutex is used for FEX's 64-bit Allocator, this happens when FEX
is allocating memory on one thread and a fork occurs.
- Fairly infrequent because jemalloc doesn't allocate VMA regions that
often.
While this likely doesn't hit all of the FEX mutexes, this hits the ones
that are burning fires and are happening frequently.
- FEXCore: Adds forkable mutex/locks
Necessary since we have a few locations in FEX that need to be locked
before and after a fork.
When a fork occurs the locks must be locked prior to the fork. Then
afterwards they either need to unlock or be set to default
initialization state.
- Parent
- Does an unlock
- Child
- Sets the lock to default initialization state
- This is because it pthreads does TID based ownership checking on
unique locks and refcount based waiting for shared locks.
- No way to "unlock" after fork in this case other than default
initializing.
This has been around since the initial commit. Bad idea that wasn't ever
thought through. Something about remapping guest virtual and host
virtual memory which will never be a thing.
Currently RegisterClassType and FenceType are passed into logs, which
fmt 10.0.0 is more strict about. Adds the formatters that were missing
so that compilation can succeed without needing to change all log sites.
We can handle this in the dispatcher itself, so that we don't need to pass along
the register size as a member of the opcode. This gets rid of some unnecessary duplication
of functionality in the backends and makes it so potential backends don't need to deal
with this.
Previously, the bits that we support in the MXCSR weren't being saved,
which means that some opcode patterns may fail to restore the rounding mode
properly.
e.g. FXSAVE, followed by FNINIT, followed by FXRSTOR wouldn't restore the
rounding mode properly
This fixes that.
FEX's current implementation of RIP reconstruction is limited to the
entrypoint that a single block has. This will cause the RIP to be
incorrect past the first instruction in that block.
While this is fine for a decent number of games, especially since fault
handling isn't super common. This doesn't work for all situations.
When testing Ultimate Chicken Horse, we found out that changing the
block size to 1 worked around an early crash in the game's startup.
This game is likely relying on Mono/Unity's AOT compilation step, which
does some more robust faulting that the runtime JIT. Needing the RIP to
be correct since they do some sort of checking for what the code came
from.
This fixes Ultimate Chicken Horse specifically, but will likely fix
other games that are built the same way.
When executing a 32-bit application we were failing to allocate a single
GPR pair. This meant we only have 7 pairs when we could have had 8.
This was because r30 was ending up in the middle of the allocation
arrays so we couldn't safely create a sequential pair of registers.
Organize the register allocation arrays to be unique for each bitness
being executed and then access them through spans instead.
Also works around bug where the RA validation doesn't understand when pair
indexes don't correlate directly to GPR indexes. So while the previous
PR fixed the RA pass, it didn't fix the RA validation pass.
Noticed this when pr57018 32-bit gcc test was run with the #2700 PR
which improved the RA allocation a bit.
When FEX was updated to reclaim 64-bit registers in #2494, I had
mistakenly messed up pair register class conflicts.
The problem is that FEX has r30 stuck in the middle of the RA which
causes the paired registers to need to offset their index half way.
This meant that the conflict index being incorrect was always broken on
32-bit applications ever since that PR.
Keep the intersection indexes in their own array so to can be correctly
indexed at runtime.
Thanks to @asahilina finding out that Osmos started crashing a few
months ago and I finally just got around to bisecting what the problem
was.
This now fixes Osmos from crashing, although the motes are still
invisible on the 32-bit application. Not sure what other havok this has
been causing.
So, uh, this was a little silly to track down. So, having the upper limit
as unsigned was a mistake, since this would cause negative valid lengths to
convert into an unsigned value within the first two flag comparison cases
A -1 valid length can occur if one of the strings starts with a null character
in a vector's first element. (It will be zero and we then subtract it to
make the length zero-based).
Fixes this edge-case up and expands a test to check for this in the future.
Allows us to generate a header at compile time for OS specific features.
Should fix compiling on Android since they have a different function
declaration for `malloc_usable_size` compared to Linux.
We spent a bit of effort removing 8-bits from this header to get it down
to three bytes. This ended up in PRs #2319 and #2320
There was no explicit need to go down to three bytes, the other two
arguments we were removing were just better served to be lookups instead
of adding IR overhead for each operation.
This now introduced alignment issues that was brought up in #2472.
Apparently the Android NDK's clang will pad nested structs like this,
maybe to match alignment? Regardless we should just make it be 32-bit.
This fixes Android execution of FEXCore.
This fixes#2472
Pros:
- Initialization now turns in to a single str because it's 32-bit
- We have 8-bits more space that we can abuse in the IR op now
- If we need more than 64-bit and 128-bit are easy bumps in the
future
Cons:
- Each IR operation takes at minimum 25% more space in the intrusive
allocators
- Not really that big of a deal since we are talking 3 bytes versus
4.
FEXCore has no need to understand how to load these layers. Which
requires json parsing.
Move these to the frontend which is already doing the configuration
layer setup and initialization tasks anyway.
Means FEXCore itself no longer needs to link to tiny-json which can be
left to the frontend.
Regular LoadStoreTSO operations have gained support for LRCPC and LRCPC2
which changes the semantics of the operation by letting it support
immediate offsets.
The paranoid version of these operations didn't support the immediate
offsets yet which was causing incorrect memory loadstores.
Bring over the new semantics from the regular LoadStoreTSO but without
any nop padding.
`eor <reg>, <reg>, #-1` can't be encoded as an instruction. Instead use
mvn which does the same thing.
Removes a single instruction from each OF calculation for ADC and ADD.
Also no reason to use a switch statement for the source size, just use
_Bfe and calculate the offset based on operation size.
SBB caught in the crossfire to ensure it also isn't using a switch
statement.
This is part of FEXCore since it pulls in InternalThreadData, but is
related to the FHU signal mutex class.
Necessary to allow deferring signals in C++ code rather than right in
the JIT.
When a signal handler is not installed and is a terminal failure, make
sure to save telemetry before faulting.
We know when an application is going down in this case so we can make
sure to have the telemetry data saved.
Adds a telemetry signal mask data point as well to know which signal
took it down.
These two extensions rely on AVX being supported to be used. Primarily
because they are VEX encoded.
GTA5 is using these flags to determine if it should enable its AVX
support.
Some code in FEX's Arm64 emitter was making an assumption that once
SpillStaticRegs was called that it was safe to still use the SRA
register state.
This wasn't actually true since FEX was using one SRA register to
optimize FPR stores. Assuming that the SRA registers were safe to use
since they were just saved and no longer necessary.
Correct this assumption hell by forcing users of the function to provide
the temporary register directly. In all cases the users have a temporary
available that it can use.
Probably fixes some very weird edge case bugs.
This returns the `XFEATURE_ENABLED_MASK` register which reports what
features are enabled on the CPU.
This behaves similarly to CPUID where it uses an index register in ecx.
This is a prerequisite to enabling XSAVE/XRSTOR and AVX since
applications will expect this to exist.
xsetbv is a privileged instruction and doesn't need to be implemented.
I forgot that x11 was part of the custom ABI of thunks. #2672 had broken
thunks on ARM64. I thought I had tested a game with them enabled but
apparently I tested the wrong game.
Not a full revert since we can still ldr with a literal, but we also
still need to adr x11 and nop pad. At least removes the data dependency
on x11 from the ldr.
Currently WINE's longjump doesn't work, so instead set a flag that if
HLT is attempted, just exit the JIT.
This will get our unittests executing at least.
InferFromOS doesn't work under WINE.
InferFromIDRegisters doesn't work under Windows but it will under Wine.
Since we don't support Windows, just use InferFromIDRegisters.
No need to use adr for getting the PC relative literal, we can use LDR
(literal) to load the PC relative address directly.
Reduces trampline instructions from 3 to 2, also reduces trampoline size
from 24-bytes to 16-bytes.
Wine syscalls need to end the code block at the point of the syscall.
This is because syscalls may update RIP which means the JIT loop needs
to immediately restart.
Additionally since they can update CPU state, make wine syscalls not
return a result and instead refill the register state from the CPU
state. This will mean the syscall handler will need to update their
result register (RAX?) before returning.
Disabling SRA has been broken a quite a while. Disabling this was
instrumental in figuring out the VC redistributable crash.
Ensure it works by reintroducing non-SRA load/store register handlers,
and by supporting runtime selectable dispatch pointers for the JIT.
Side-bonus, moves the {LOAD,STORE}MEMTSO ops over to this dispatch as
well to make it consistent and probably slightly quicker.
From https://github.com/AsahiLinux/linux/commits/bits/220-tso
This fails gracefully in the case the upstream kernel doesn't support
this feature, so can go in early.
This feature allows FEX to use hardware's TSO emulation capability to
reduce emulation overhead from our atomic/lrcpc implementation.
In the case that the TSO emulation feature is enabled in FEX, we will
check if the hardware supports this feature and then enable it.
If the hardware feature is supported it will then use regular memory
accesses with the expectation that these are x86-TSO in strength.
The only hardware that anyone cares about that supports this is Apple's
M class SoCs. Theoretically NVIDIA Denver/Carmel supports sequentially
consistent, which isn't quite the same thing. I haven't cared to check
if multithreaded SC has as strong of guarantees. But also since
Carmel/Denver hardware is fairly rare, it's hard to care about for our
use case.
This can be done in an OS agnostic fashion. FEXCore knows the details of
its JIT and should be done in FEXCore itself.
The frontend is only necessary to inform FEXCore where the fault occured
and provide the array of GPRs for accessing and modifying the signal
state.
This is necessary for supporting both Linux and Wine signal contexts
with their unaligned access handlers.
We don't have a sane way to query cpu index under wine. We could
technically still use the syscall since we know that we are still
executing under Linux, but that seems a bit terrible.
Disable for now until something can be worked out. Not like it is used
heavily anyway.
This will be used with the TestHarnessRunner in the future to map
specific memory regions.
This is only used as a hint rather than exact placement with failure on
inability to map. This also hits the fun quirk of 64k allocation
granularity which developers need to be careful about.
Related to #2659 but not necessary directly.
Currently x30(LR) is unused in our RA. In all locations that call out to
code, we are already preserving LR and bringing it back after the fact.
This was just a missed opportunity since we aren't doing any call-ret
stack manipulations that would facilitate LR needing to stick around.
Since x18 is a reserved platform register on win32, we can replace its
usage with r19, and then replace r19 usage with x30 and everything just
works happily. Now x18 is the unused register instead of x30 and we can
come back in the future to gain one more register for RA on Linux
platforms.
All code paths to this are already guaranteed to own the lock.
The rest of the codepaths haven't been vetted to actually need
recursive_mutex yet, but seems likely that it will be able to get
converted to a regular mutex with some more work.
All variants of the PCMPXSTRX instructions will take their arguments in
the same manner, so we don't need to specify them for each handler.
We can also rename the function to PCMPXSTRXOpImpl, since this will
be extended to handle the masking variants of the string instructions.
This is a very OS specific operation and it living in FEXCore doesn't
make much sense. This still requires some strong collaboration between
FEXCore and the frontend but it is now split between the locations.
There's still a bit more cleanup work that can be done after this is
merged, but we need to get this burning fire out of the way.
This is necessary for llvm-mingw, this requires all previous PRs to be
merged first.
After this is merged, most of the llvm-mingw work is complete, just some
minor cleanups.
To be merged first:
- #2602
- #2604
- #2605
- #2607
- #2610
- #2615
- #2619
- #2621
- #2622
- #2624
- #2625
- #2626
- #2627
- #2628
- #2629
We can reuse the same helper we have for handling VMASKMOVPD and VMASKMOVPS,
though we need to move some handling around to account for the fact that
VPMASKMOVD and VPMASKMOVQ 'hijack' the REX.W bit to signify the element
size of the operation.
This was only used for the unit test fuzzing framework. Which has been
removed and unused for pretty much its entire lifespan.
These can now be internal only.
Adds in the handling of destination type size differences with AVX.
Also fixes cases where the SSE operations would load 128-bit vectors
from meory, rather than only loading 64-bit vectors with VCVTPS2PD.
In order to implement the SSE4.2 string instructions in a reasonable
manner, we can make use of a fallback implementation for the time
being.
This implementation just returns the intermediate result and leaves it
up to the function making use of it to derive the final result from said
intermediate result. This is fine, considering we have the immediate
control byte that tells us exactly what is desired as far as output
formats go.
Given that the result of this IR op will never take up more than
16-bits, we store the flags we need to set in the upper 16 bits of the
result to avoid needing to implement multiple return values in the JIT.
Also, since the IR op just returns the intermediate result, this can be
used to implement all of the explicit string instructions with a single IR op.
The implementation is pretty heavily documented to help make heads or
tails of these monster instructions.
This will use the correct `__cpuid` define, either in cpuid.h or
self-defined depending on environment.
Otherwise we would need to define our own cpuid helpers to match the
difference between mingw and linux.
This lets all the path generation for the config to be in the frontend.
This then informs FEXCore where things should live.
This is for llvm-mingw. While paths aren't quite generated correctly,
this gets the code closer to compiling.
This is not an attempt to clean up the various issues with the pthread
logic, instead just moving the pthread specific logic out of FEXCore in
to FEXLoader.
FEXCore needs to know how to create threads in an agnostic way. Which is
why we obfuscate the details with this inteface.
Initially this was implemented with the pthread handlers in FEXCore and
expected eventually for those to get moved to the frontend. This is the
time when it has been moved.
This is the first step towards compiling with llvm-mingw.
Still a long way to go.
We still need to hook glibc for thunks to work with
`IsHostHeapAllocation`.
So now we link in two jemalloc allocators in different namespaces.
As usual we have multiple heap allocators that we need to be careful about.
1. jemalloc with `je_` namespace.
- This is FEX's regular heap allocator and what gets used for all the
fextl objects.
- This allocator is the one that the FEX mmap/munmap hooks hook in to
- This mmap hooking gives this allocator the full 48-bit VA even in
32-bit space.
2. jemalloc with `glibc_je_` namespace.
- This is the allocator that overrides the glibc allocator routines
- This is the allocator that thunks will use.
- This is what `IsHostHeapAllocation` will check for.
3. Guest glibc allocator
- We don't touch this one. But it is distinct from the host side
allocators.
- The guest side of thunks will use this heap allocator.
4. Host glibc allocator
- #2 replaces this one unless explicitly disabled.
- Always expected to override the allocator, so this configuration
isn't expected.
Already tested this with Dota Underlords to ensure this works with
thunks.
The dispatcher was saving AVX state even though FEX doesn't support it
currently. This is due to it checking for the config option rather than
the HostFeatures option.
The `EnableAVX` config option is supposed to be used to inform FEXCore
if we want AVX disabled or not when the host supports the feature. In
this case it is universally enabled because we haven't encountered any
games that have issues with AVX state being saved with signals. (We know
they exist, we just don't have configurations for them).
The HostFeatures option `SupportsAVX` is the option that is supposed to
be getting used for determining if the runtime AVX feature is enabled.
This also had an issue though that this was **also** always enabled if
running on an x86 host with AVX, or an ARM host with SVE2-256bit.
It was then disabled if the config option was disabled; But, since
FEX-Emu doesn't support AVX fully yet, we need to ensure this isn't yet
enabled.
But this only solves half the problem. In order for our CI to test AVX
features before fully supporting AVX, it needs to be able to enable AVX
so that the CPU state is correctly saved.
So we need to change the default configuration option to be false, and
have CI enable it for the tests that matter before AVX is fully
implemented.
Every time we are calling a function in `FEXCore::Allocator::` this is a
pointer indirection. Which means on x86 it is always a `call [rdi]` and
on AArch64 it is a `ldr x17, [x0]; blr x17;`.
Instead of doing this, use inline functions in the header that call the
correct allocation function directly. This function gets inlined and is
no longer an indirect call.
When compiling with jemalloc, we forward declare the jemalloc function
definitions so we don't have to pull in the entire jemalloc interface in
to the public header definitions.
`std::stoul` and `std::stroull` take a std::string which was converting
the string_view to a std::string first. Causing glibc fault testing to
catch this since not much uses this.
These will be added to the documentation.
fwrite allocates some backing memory for buffering outputs which FEX
can't track.
Switch to using `fileno` to get the fd from the FILE and write directly.
This will need to be changed for llvm-mingw support but that will come after
this.
This will be added to the documentation that we can't use fwrite.
This is done by consuming a single page at the end of the current sbrk
memory region. Then consuming any remaining bytes that could have
potentially ended up in it.
This ensures that glibc won't be able to return 64-bit pointers to
32-bit thunks once the remaining work is in place.
The LOAD_LIB and EXPORTS macros behave slightly differently in this regard:
* Use LOAD_LIB(libwayland-client) in Guest.cpp (library name with dash)
* Use EXPORTS(libwayland_client) in Host.cpp (library name with underscore)
The previous log in the frontend is super useful when an instruction
decoding wasn't supported.
Now that most of AVX is covered, a game will crash on SIGILL (and
usually catch it) and close without any indication.
Now if the instruction is decoded but it is invalid for the
configuration, still output a message as a good indicator that the game
is using instructions that the host doesn't support.
Will let us still pick up on games crashing due to lack of SVE very
easily.
This is an undocumented but supported instruction. It behaves just like
an `sbb al, al` but doesn't set flags and is one byte shorter.
The end result is that al is set to 0xFF or 0 depending on if CF is set
or not.
And with that, we support all of the AVX1-only instructions.
The remaining instructions for full AVX1 support is now just the SSE4.2
string instructions.
Will be used to implement the load variants of VMASKMOVP{D, S} and
VPMASKMOV{D, Q}
Particularly useful, since with SVE this behavior can be collapsed into
two instructions (CMPGT followed by the relevant LD1 load instruction)
These conditionals were accidentally inverted and were treating 32-bit
elements as 64-bit ones, when this is unintended.
Also add missing tests to ensure this doesn't slip through in the
future.