Inspired from: https://github.com/dotnet/runtime/issues/8072
Currently FEX is /very/ heavy handed with our backpatching where we wrap
every backpatched loadstore with `dmb ish`.
This can be relaxed slightly according to the linked issue.
For TSO load instructions the instruction sequence changes to:
ldr <args>;
dmb ld; <-- Slightly less strict dmb
For TSO store instructions the instruction sequence changes to:
dmb ish; <-- Still the all encompassing dmb
str <args>;
For backpatching loadstores this does the same thing where only one side
needs the nop and it uses the same instruction sequence when
backpatched.
The minor change is that on load backpatching, we are no longer backing
up a single instruction, instead just re-executing the instruction we
patched directly.
Took a long time to come back to this (Last looked in August 2020).
Previously when I was implementing this idea it didn't work, but that
was because our CompareExchange operation was broken back then. With the
CAS now, it should just work.
We expect that PF is written more often than it's read, so we want to
get the expensive popcount out of the hot path. (Thank you to Dougall
for suggesting that.)
There are two cases:
1. PF is written by an integer instruction. In this case, we calculate
with the formula `popcount(x ^ 1) & 1`.
2. PF is written by a float instruction, copying a host flag.
What we really want is to defer the relatively expensive popcount. So,
to unify these cases, we have integer instructions write `x ^ 1` and
(unchanged) float instructions write the host flag. Then, when reading
PF, we do `popcount(value) & 1` on the byte read in.
If PF is written but not read, this saves the expensive popcount and
leaves only the cheap xor.
If PF is written by an integer op and read, this maybe shuffles some
code but does not materially change anything.
If PF is written by a float op and read, this is worse because now we're
doing an extra pointless popcount. This is a tradeoff... However, this
is only relevant to unordered float comparisons, which I expect to be
obscure for games. So this should be worth it over all (for games, if
not weird numerical computing workloads).
How does this connect to my register zeroing quest? The constant folding
code doesn't currently deal with FPRs and I'm not in a mood to change
this. So before, a block ending with `xor eax, eax` would still do a
popcount for PF. Now it just writes a constant 1 since the xor constant
folds and the popcount never happens at all.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
For flag calculation after moving a constant. This cleans up the code
generated for zeroing at the end of a block.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Lets us deduplicate the asserts and put them in one spot. We can also
improve the assert message to also indicate the valid range.
While we're in the area, we can collapse a few case paths
as a result of this assert movement.
For #2804 so it can compare if GPRs match more easily.
Only adding to GPR and Literal types, since its ambiguous what this
would mean for the memory accessing types. Going to leave those other
ones alone for now.
Now that we have the helper for encoding immediate shifts,
we can trivially implement the remaining missing instruction
in the bitwise logical unpredicated group.
Prevents the code invalidation mutex from being locked as shared recursively,
since it is locked before entering ThreadExitFunctionLink and would end up
being locked again by ThreadAddBlockLink.
This fixes a deadlock on Windows.
This lets us deduplicate the behavior rather than open-coding it everywhere
and also makes it nicer to implement instructions that make use of this
encoding pattern.
Allows us to consume an array of strings and convert it to an mask of
enum values. This is a quality of life change that allows us to specify
a mask of options.
The first configuration option added to support this is to control the
vixl disassembler. Now by default the vixl disassembler doesn't
disassemble any blocks and needs to be enabled individually.
eg:
```
FEXLoader --disassemble=blocks <args>
FEXLoader --disassemble=dispatcher <args>
FEXLoader --disassemble=blocks,dispatcher <args>
```
Has the additional convenience option of just passing in numbers as
well.
```
FEXLoader --disassemble=2 <args>
FEXLoader --disassemble=1 <args>
FEXLoader --disassemble=3 <args>
```
Also of course all of this works through environment variables.
```
FEX_DISASSEMBLE=blocks FEXInterpreter <args>
FEX_DISASSEMBLE=dispatcher FEXInterpreter <args>
FEX_DISASSEMBLE=blocks,dispatcher FEXInterpreter <args>
```
While only used fairly sparingly now, this is likely to have some
additional configurations using this in the future. Since we already
have some configs that are basically using enums, but just by doing
string comparisons.
This was asked for by a developer, so I figured I would throw it
together quick.
Ensures that we handle the AVX2 VSIB byte in a decent way.
As is, we can't compute the [index * scale] variant portion
of the entire address operand, since the scale needs to act
on every element of the vector after sign extension.
What we can do though, is compute the base address and add
the displacement to it ahead of time though.
All of these IR operations were being fairly inefficient in their
address calculation. All of these are known using power of 2 stride
indexing. So all of these can be converted from three instructions to
one.
These are always used for x87 stack accesses so each one gets an
improvement.
Before:
```asm
0x0000ffff6a800248 d2800200 mov x0, #0x10
0x0000ffff6a80024c 9b007e80 mul x0, x20, x0
0x0000ffff6a800250 8b000380 add x0, x28, x0
0x0000ffff6a800254 fd417805 ldr d5, [x0, #752]
```
After:
```asm
0x0000ffff91e80240 8b141380 add x0, x28, x20, lsl #4
0x0000ffff91e80244 fd417805 ldr d5, [x0, #752]
```
Currently we're clearing icache including the data that lives on the
tail of the block. Instead only clear the code that the was emitted and
not tail data.
Additionally only disasm the code rather than all the tail data as well,
as it gets unwieldy if viewing.
If we are loading exactly the flags we need from the RFLAGS (ensuring we
don't load the reserved flag in bit 1) then we don't need to do a mask
on the result.
Additionally there is some bad code-motion around selects that was
causing SBFE operations to occur on constants. Ensure that we const-prop
any SBFE operations to clean this up.
This PR along with #2783 causes FMOV blow-up to go from 41 instruction
to 31 instructions.
Noticed this when inspecting some code that was moving constant
`0x80808080` in to a register. Was using two move instructions when it
could have used a single bitmask move.
This now checks to see if a constant can be 32-bit encoded in a logical
bitmask move and uses that.
This instruction doesn't match ARM semantics very well since it returns
the position of the minimum element.
But at the very least the insert in to the final instruction can be a
bit more optimal, Converts an 5 inst eor+mov+mov+mov+mov in to 2 inst
mov+mov.
This works because `VUMinV` already zero extends the vector so the
position only needs to be inserted at the end.
32-bit and 64-bit SH{L,R}D matches behaviour of EXTR. Optimize to using
this op in that case.
This converts the lsl+lsr+orr sequence in to a single extr instruction.
16-bit still goes down the old path.
Weirdly this code manages to have a bad insert for no reason? But
unrelated since this happens in the old code as well.
```
%4(GPRFixed3) i64 = LoadRegister #0x0, #0x20, GPR, GPRFixed, u8:Tmp:Size
%5(GPR0) i64 = LoadRegister #0x0, #0x8, GPR, GPRFixed, u8:Tmp:Size
%6(GPRFixed0) i64 = Extr %5(GPR0) i64, %4(GPRFixed3) i64, #0x3e
```
Not sure why the SRA fails on that second LoadRegister.
There was one holdout variable that was in a TLS object in FEXCore. Move
it to the frontend with the rest of the TLS variables.
Allows us to remove "Frontend" TLS management to be the only TLS
management.
This pass is currently doing nothing in main.
Ever since we have enforced that LoadContext/StoreContext doesn't touch
GPRs and FPRs, this has only been eliminating flags.
Remove that usage of LoadContext/StoreContext and replace with their
their replacement of LoadRegister/StoreRegister for tracking GPR and FPR
accesses.
Stripped from #2700 since this is safe to merge.
Noticed while looking at #2700.
Testing doesn't currently see this as a bug but will once #2700 starts
optimizing StoreRegister+LoadRegister pairs.
Doesn't fix the issues in that PR, but this is one.
While we're in the area implementing the Scalar + Vector variants,
we may as well cross off the Vector + Immediate variants and
complete all of the load variants for the regular LD1{*} loads
This is less noisy with no loss of clarity, and follows the notation
used by both LLVM IR and NIR. (So, it should be familiar.)
Change done with:
sed -i -e 's/%ssa/%/g' $(git grep -l '%ssa')
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
This does duplicate the _Constant(1) but it doesn't matter because it
gets inlined into the eor anyway. There is no functional change here.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
We store garbage in the upper bits. That's ok, but it means we need to
mask on read for correct behaviour.
Closes#2767
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
We can fold the Not into the And. This requires flipping the arguments
to Andn, but we do not flip the order of the assignments since that
requires an extra register in a test I'm looking at.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
WIN32 has a define already called `GetObject` and will cause our
symbol to have an A appended to it and break linking.
Just rename it to `GetTelemetryValue`
Noticed during introspection that we were generating zero constants
redundantly. Bunch of single cycle hits or zero-register renames.
Every time a `SetRFLAG` helper was called, it was /always/ doing a BFE
on everything passed in to extract the lowest bit. In nearly all cases
the data getting passed in is already only the lowest bit.
Instead, stop the helper from doing this BFE, and ensure the
OpcodeDispatcher does BFE in the couple of cases it still needs to do.
As I was skimming through all these to ensure BFE isn't necessary, I did
notice that some of the BCD instructions are wrong or questionable. So I
left a comment on those so we can come back to it.
These address calculations were failing to understand that they can be
optimized. When TSO emulation is disabled these were fine, but with TSO
we were eating one more instruction.
Before:
```
add x20, x12, #0x4 (4)
dmb ish
ldr s16, [x20]
dmb ish
```
After:
```
dmb ish
ldr s16, [x12, #4]
dmb ish
```
Also left a note that once LRCPC3 is supported in hardware that we can do a similar optimization there.
When this instruction returns the index in to the ecx register, this is
defined as a 32-bit result. This means it actually gets zero-extended to
the full 64-bit GPR size on 64-bit processes.
Previously FEX was doing a 32-bit insert which leaves garbage data in
the upper 32-bits of the RCX register.
Adds a unit test to ensure the result is zero extended.
Fixes running Java games under FEX now that SSE4.2 is exposed.
ARM64 BFI doesn't allow you to encode two source registers here to match
our SSA semantics. Also since we don't support RA constraints to ensure
that these match, just do the optimal case in the backend.
Leave a comment for future RA contraint excavators to make this more
optimal
When a fork occurs FEX needs to be incredibly careful as any thread
(that isn't forking) that holds a lock will vanish when the fork occurs.
At this point if the newly forked process tries to use these mutexes
then the process hangs indefinitely.
The three major mutexes that need to be held during a fork:
- Code Invalidation mutex
- This is the highest priority and causes us to hang frequently.
- This is highly likely to occur when one thread is loading shared
libraries and another thread is forking.
- Happens frequently with Wine and steam.
- VMA tracking mutex
- This one happens when one thread is allocating memory while a fork
occurs.
- This closely relates to the code invalidation mutex, just happens at
the syscall layer instead of the FEXCore layer.
- Happens as frequently as the code invalidation mutex.
- Allocation mutex
- This mutex is used for FEX's 64-bit Allocator, this happens when FEX
is allocating memory on one thread and a fork occurs.
- Fairly infrequent because jemalloc doesn't allocate VMA regions that
often.
While this likely doesn't hit all of the FEX mutexes, this hits the ones
that are burning fires and are happening frequently.
- FEXCore: Adds forkable mutex/locks
Necessary since we have a few locations in FEX that need to be locked
before and after a fork.
When a fork occurs the locks must be locked prior to the fork. Then
afterwards they either need to unlock or be set to default
initialization state.
- Parent
- Does an unlock
- Child
- Sets the lock to default initialization state
- This is because it pthreads does TID based ownership checking on
unique locks and refcount based waiting for shared locks.
- No way to "unlock" after fork in this case other than default
initializing.
This has been around since the initial commit. Bad idea that wasn't ever
thought through. Something about remapping guest virtual and host
virtual memory which will never be a thing.
Currently RegisterClassType and FenceType are passed into logs, which
fmt 10.0.0 is more strict about. Adds the formatters that were missing
so that compilation can succeed without needing to change all log sites.
We can handle this in the dispatcher itself, so that we don't need to pass along
the register size as a member of the opcode. This gets rid of some unnecessary duplication
of functionality in the backends and makes it so potential backends don't need to deal
with this.
Previously, the bits that we support in the MXCSR weren't being saved,
which means that some opcode patterns may fail to restore the rounding mode
properly.
e.g. FXSAVE, followed by FNINIT, followed by FXRSTOR wouldn't restore the
rounding mode properly
This fixes that.
FEX's current implementation of RIP reconstruction is limited to the
entrypoint that a single block has. This will cause the RIP to be
incorrect past the first instruction in that block.
While this is fine for a decent number of games, especially since fault
handling isn't super common. This doesn't work for all situations.
When testing Ultimate Chicken Horse, we found out that changing the
block size to 1 worked around an early crash in the game's startup.
This game is likely relying on Mono/Unity's AOT compilation step, which
does some more robust faulting that the runtime JIT. Needing the RIP to
be correct since they do some sort of checking for what the code came
from.
This fixes Ultimate Chicken Horse specifically, but will likely fix
other games that are built the same way.
When executing a 32-bit application we were failing to allocate a single
GPR pair. This meant we only have 7 pairs when we could have had 8.
This was because r30 was ending up in the middle of the allocation
arrays so we couldn't safely create a sequential pair of registers.
Organize the register allocation arrays to be unique for each bitness
being executed and then access them through spans instead.
Also works around bug where the RA validation doesn't understand when pair
indexes don't correlate directly to GPR indexes. So while the previous
PR fixed the RA pass, it didn't fix the RA validation pass.
Noticed this when pr57018 32-bit gcc test was run with the #2700 PR
which improved the RA allocation a bit.
When FEX was updated to reclaim 64-bit registers in #2494, I had
mistakenly messed up pair register class conflicts.
The problem is that FEX has r30 stuck in the middle of the RA which
causes the paired registers to need to offset their index half way.
This meant that the conflict index being incorrect was always broken on
32-bit applications ever since that PR.
Keep the intersection indexes in their own array so to can be correctly
indexed at runtime.
Thanks to @asahilina finding out that Osmos started crashing a few
months ago and I finally just got around to bisecting what the problem
was.
This now fixes Osmos from crashing, although the motes are still
invisible on the 32-bit application. Not sure what other havok this has
been causing.
So, uh, this was a little silly to track down. So, having the upper limit
as unsigned was a mistake, since this would cause negative valid lengths to
convert into an unsigned value within the first two flag comparison cases
A -1 valid length can occur if one of the strings starts with a null character
in a vector's first element. (It will be zero and we then subtract it to
make the length zero-based).
Fixes this edge-case up and expands a test to check for this in the future.
Allows us to generate a header at compile time for OS specific features.
Should fix compiling on Android since they have a different function
declaration for `malloc_usable_size` compared to Linux.
We spent a bit of effort removing 8-bits from this header to get it down
to three bytes. This ended up in PRs #2319 and #2320
There was no explicit need to go down to three bytes, the other two
arguments we were removing were just better served to be lookups instead
of adding IR overhead for each operation.
This now introduced alignment issues that was brought up in #2472.
Apparently the Android NDK's clang will pad nested structs like this,
maybe to match alignment? Regardless we should just make it be 32-bit.
This fixes Android execution of FEXCore.
This fixes#2472
Pros:
- Initialization now turns in to a single str because it's 32-bit
- We have 8-bits more space that we can abuse in the IR op now
- If we need more than 64-bit and 128-bit are easy bumps in the
future
Cons:
- Each IR operation takes at minimum 25% more space in the intrusive
allocators
- Not really that big of a deal since we are talking 3 bytes versus
4.
FEXCore has no need to understand how to load these layers. Which
requires json parsing.
Move these to the frontend which is already doing the configuration
layer setup and initialization tasks anyway.
Means FEXCore itself no longer needs to link to tiny-json which can be
left to the frontend.
Regular LoadStoreTSO operations have gained support for LRCPC and LRCPC2
which changes the semantics of the operation by letting it support
immediate offsets.
The paranoid version of these operations didn't support the immediate
offsets yet which was causing incorrect memory loadstores.
Bring over the new semantics from the regular LoadStoreTSO but without
any nop padding.
`eor <reg>, <reg>, #-1` can't be encoded as an instruction. Instead use
mvn which does the same thing.
Removes a single instruction from each OF calculation for ADC and ADD.
Also no reason to use a switch statement for the source size, just use
_Bfe and calculate the offset based on operation size.
SBB caught in the crossfire to ensure it also isn't using a switch
statement.
This is part of FEXCore since it pulls in InternalThreadData, but is
related to the FHU signal mutex class.
Necessary to allow deferring signals in C++ code rather than right in
the JIT.
When a signal handler is not installed and is a terminal failure, make
sure to save telemetry before faulting.
We know when an application is going down in this case so we can make
sure to have the telemetry data saved.
Adds a telemetry signal mask data point as well to know which signal
took it down.
These two extensions rely on AVX being supported to be used. Primarily
because they are VEX encoded.
GTA5 is using these flags to determine if it should enable its AVX
support.
Some code in FEX's Arm64 emitter was making an assumption that once
SpillStaticRegs was called that it was safe to still use the SRA
register state.
This wasn't actually true since FEX was using one SRA register to
optimize FPR stores. Assuming that the SRA registers were safe to use
since they were just saved and no longer necessary.
Correct this assumption hell by forcing users of the function to provide
the temporary register directly. In all cases the users have a temporary
available that it can use.
Probably fixes some very weird edge case bugs.
This returns the `XFEATURE_ENABLED_MASK` register which reports what
features are enabled on the CPU.
This behaves similarly to CPUID where it uses an index register in ecx.
This is a prerequisite to enabling XSAVE/XRSTOR and AVX since
applications will expect this to exist.
xsetbv is a privileged instruction and doesn't need to be implemented.
I forgot that x11 was part of the custom ABI of thunks. #2672 had broken
thunks on ARM64. I thought I had tested a game with them enabled but
apparently I tested the wrong game.
Not a full revert since we can still ldr with a literal, but we also
still need to adr x11 and nop pad. At least removes the data dependency
on x11 from the ldr.
Currently WINE's longjump doesn't work, so instead set a flag that if
HLT is attempted, just exit the JIT.
This will get our unittests executing at least.
InferFromOS doesn't work under WINE.
InferFromIDRegisters doesn't work under Windows but it will under Wine.
Since we don't support Windows, just use InferFromIDRegisters.
No need to use adr for getting the PC relative literal, we can use LDR
(literal) to load the PC relative address directly.
Reduces trampline instructions from 3 to 2, also reduces trampoline size
from 24-bytes to 16-bytes.
Wine syscalls need to end the code block at the point of the syscall.
This is because syscalls may update RIP which means the JIT loop needs
to immediately restart.
Additionally since they can update CPU state, make wine syscalls not
return a result and instead refill the register state from the CPU
state. This will mean the syscall handler will need to update their
result register (RAX?) before returning.
Disabling SRA has been broken a quite a while. Disabling this was
instrumental in figuring out the VC redistributable crash.
Ensure it works by reintroducing non-SRA load/store register handlers,
and by supporting runtime selectable dispatch pointers for the JIT.
Side-bonus, moves the {LOAD,STORE}MEMTSO ops over to this dispatch as
well to make it consistent and probably slightly quicker.
From https://github.com/AsahiLinux/linux/commits/bits/220-tso
This fails gracefully in the case the upstream kernel doesn't support
this feature, so can go in early.
This feature allows FEX to use hardware's TSO emulation capability to
reduce emulation overhead from our atomic/lrcpc implementation.
In the case that the TSO emulation feature is enabled in FEX, we will
check if the hardware supports this feature and then enable it.
If the hardware feature is supported it will then use regular memory
accesses with the expectation that these are x86-TSO in strength.
The only hardware that anyone cares about that supports this is Apple's
M class SoCs. Theoretically NVIDIA Denver/Carmel supports sequentially
consistent, which isn't quite the same thing. I haven't cared to check
if multithreaded SC has as strong of guarantees. But also since
Carmel/Denver hardware is fairly rare, it's hard to care about for our
use case.
This can be done in an OS agnostic fashion. FEXCore knows the details of
its JIT and should be done in FEXCore itself.
The frontend is only necessary to inform FEXCore where the fault occured
and provide the array of GPRs for accessing and modifying the signal
state.
This is necessary for supporting both Linux and Wine signal contexts
with their unaligned access handlers.
We don't have a sane way to query cpu index under wine. We could
technically still use the syscall since we know that we are still
executing under Linux, but that seems a bit terrible.
Disable for now until something can be worked out. Not like it is used
heavily anyway.
This will be used with the TestHarnessRunner in the future to map
specific memory regions.
This is only used as a hint rather than exact placement with failure on
inability to map. This also hits the fun quirk of 64k allocation
granularity which developers need to be careful about.
Related to #2659 but not necessary directly.
Currently x30(LR) is unused in our RA. In all locations that call out to
code, we are already preserving LR and bringing it back after the fact.
This was just a missed opportunity since we aren't doing any call-ret
stack manipulations that would facilitate LR needing to stick around.
Since x18 is a reserved platform register on win32, we can replace its
usage with r19, and then replace r19 usage with x30 and everything just
works happily. Now x18 is the unused register instead of x30 and we can
come back in the future to gain one more register for RA on Linux
platforms.
All code paths to this are already guaranteed to own the lock.
The rest of the codepaths haven't been vetted to actually need
recursive_mutex yet, but seems likely that it will be able to get
converted to a regular mutex with some more work.
All variants of the PCMPXSTRX instructions will take their arguments in
the same manner, so we don't need to specify them for each handler.
We can also rename the function to PCMPXSTRXOpImpl, since this will
be extended to handle the masking variants of the string instructions.
This is a very OS specific operation and it living in FEXCore doesn't
make much sense. This still requires some strong collaboration between
FEXCore and the frontend but it is now split between the locations.
There's still a bit more cleanup work that can be done after this is
merged, but we need to get this burning fire out of the way.
This is necessary for llvm-mingw, this requires all previous PRs to be
merged first.
After this is merged, most of the llvm-mingw work is complete, just some
minor cleanups.
To be merged first:
- #2602
- #2604
- #2605
- #2607
- #2610
- #2615
- #2619
- #2621
- #2622
- #2624
- #2625
- #2626
- #2627
- #2628
- #2629
We can reuse the same helper we have for handling VMASKMOVPD and VMASKMOVPS,
though we need to move some handling around to account for the fact that
VPMASKMOVD and VPMASKMOVQ 'hijack' the REX.W bit to signify the element
size of the operation.
This was only used for the unit test fuzzing framework. Which has been
removed and unused for pretty much its entire lifespan.
These can now be internal only.
Adds in the handling of destination type size differences with AVX.
Also fixes cases where the SSE operations would load 128-bit vectors
from meory, rather than only loading 64-bit vectors with VCVTPS2PD.
In order to implement the SSE4.2 string instructions in a reasonable
manner, we can make use of a fallback implementation for the time
being.
This implementation just returns the intermediate result and leaves it
up to the function making use of it to derive the final result from said
intermediate result. This is fine, considering we have the immediate
control byte that tells us exactly what is desired as far as output
formats go.
Given that the result of this IR op will never take up more than
16-bits, we store the flags we need to set in the upper 16 bits of the
result to avoid needing to implement multiple return values in the JIT.
Also, since the IR op just returns the intermediate result, this can be
used to implement all of the explicit string instructions with a single IR op.
The implementation is pretty heavily documented to help make heads or
tails of these monster instructions.