Moves it to the hypervisor leafs.
Before:
```bash
$ FEXBash 'cat /proc/cpuinfo | grep "model name"'
model name : FEX-2404-101-gf9effcb Cortex-A78C
model name : FEX-2404-101-gf9effcb Cortex-A78C
model name : FEX-2404-101-gf9effcb Cortex-A78C
model name : FEX-2404-101-gf9effcb Cortex-A78C
model name : FEX-2404-101-gf9effcb Cortex-X1C
model name : FEX-2404-101-gf9effcb Cortex-X1C
model name : FEX-2404-101-gf9effcb Cortex-X1C
model name : FEX-2404-101-gf9effcb Cortex-X1C
```
After:
```bash
$ FEXBash 'cat /proc/cpuinfo | grep "model name"'
model name : Cortex-A78C
model name : Cortex-A78C
model name : Cortex-A78C
model name : Cortex-A78C
model name : Cortex-X1C
model name : Cortex-X1C
model name : Cortex-X1C
model name : Cortex-X1C
```
Now the FEX string is in the hypervisor functions as a leaf, so if some
utility wants the FEX version they can query that directly
Ex:
```bash
$ ./Bin/FEXInterpreter get_cpuid_fex
Maximum 4000_0001h sub-leaf: 2
We are running under FEX on host: 2
FEX version string is: 'FEX-2404-113-g820494d'
```
We were previously genrating nonsense code if the destination != source:
faddp v2.4s, v4.4s, v4.4s
faddp s2, v4.2s
The result of the first faddp is ignored, so the second merely calculates the
sum of the first 2 sources (not all 4 as needed).
The correct fix is to feed the first add into the second, regardless of the
final destination:
faddp v2.4s, v4.4s, v4.4s
faddp s2, v2.2s
Hit in an ASM test with new RA.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
A feature of FEX's JIT is that when an unaligned atomic load/store
operation occurs, the instructions will be backpatched in to a barrier
plus a non-atomic memory instruction. This is the half-barrier technique
that still ensures correct visibility of loadstores in an unaligned
context.
The problem with this approach is that the dmb instructions are HEAVY,
because they effectively stop the world until all memory operations in
flight are visible. But it is a necessary evil since unaligned atomics
aren't a thing on ARM processors. FEAT_LSE only gives you unaligned
atomics inside of a 16-byte granularity, which doesn't match x86
behaviour of cacheline size (effectively always 64B).
This adds a new TSO option to disable the half-barrier on unaligned
atomic and instead only convert it to a regular loadstore instruction,
ommiting the half-barrier. This gives more insight in to how well a
CPU's LRCPC implementation is by not stalling on DMB instructions when
possible.
Originally implemented as a test to see if this makes Sonic Adventure 2
run full speed with TSO enabled (but all available TSO options disabled)
on NVIDIA Orin. Unfortunately this basically makes the code no longer
stall on dmb instructions and instead just showing how bad the LRCPC
implementation is, since the stalls show up on `ldapur` instructions
instead.
Tested Sonic Adventure 2 on X13s and it ran at 60FPS there without the
hack anyway.
Instead of only enabling enhanced rep movs if software TSO is disabled,
Enable it if software tso is disabled OR memcpysettso is disabled. This
is because now we hit the fast path when memcpysettso is disabled alone
but global TSO is disabled.
Retested Hades and performance was fine in this configuration.
Found out that Far Cry uses this instruction and it is viable to use in
CPL-3. This only returns constant data but its behaviour is a little
quirky.
This instruction has a weird behaviour that the 32-bit operation does an
insert in to the 64-bit destination, which might be an Intel versus AMD
behaviour. I don't have an Intel machine available to test if that
theory is true although. This assumption would match similar behaviour
where segment registers are inserted instead of zext.
Gets the game farther but then it crashes in a `___ascii_strnicmp`
function where the arguments end up being `___ascii_strnicmp(nullptr, "Color", 5);`.
Functional revert of 92f31648b ("RCLSE: optimize out pointless stores"), which
reportedly regressed some titles due to RA doom. We'll revisit later, leaving in
the code for when RA is ready to light this up.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
There's quite a few places where the segment offset appending is open-coded
throughout the opcode dispatcher, but we can pull these out into a few
helpers to make the sites a little more compact and declarative.
Now that we have successfully eliminated crossblock liveness from the IR we
generate, validate as much to ensure it doesn't come back. We will take
advantage of this new invariant in RA in the future.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
this ensures we put the StoreNZCV in the right block, which will fix validation
fails later in the series.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Rather than checking the actual EC bitmap in the dispatcher (~6 instrs), this
indirection through the code cache allows just 1 instr for the hot path
of calling repeated EC code/x64 code.
The frontend will provide the return logic via ExitFunctionEC, which
will be jumped to whenever there is an indirect branch/return to an addr
such that RtlIsEcCode(addr) returns true.
Executable mapped memory is treated as x86 code by default when
running under EC, VirtualAlloc2 needs to be used together with a
special flag to map JIT arm64 code.
Generates flags for a variable shift as a dedicated IR op. This lets us optimize
around it (without generating control flow, relying on deferred flag infra,
etc). And it neatly solves our RA problem for shifts.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
This is something the new shift flag code will do. Backporting the opt since
that's stalled and this reduces the diff.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
This environment variable had an incorrect priority on the configuration
system. The expectation was higher priority than most other layers.
Now the only layer that has higher priority is the environment
variables.
1. pull flag calculation out of the loop body for perf
2. fully rotate the inner loop to save an instruction per iteration
3. hoist the rcx=0 jump to avoid computing df when rcx=0
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
single unified implementation for ROL & ROR (instead of 4 cases). no more
deferred flags because it's easy to shoot ourselves in the foot with deferred
flags w.r.t the new RA design, and rotates are rare enough with very efficient
flag calculations such that the extra JIT overhead should be minimal to DCE the
resulting calculations later.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Just like #3508, clang-18 complains about VLA usage.
This vector is relatively small, only around 18 elements but is
semi-dynamic depending on arch and if FEXCore is targeting Linux or
Win32.
It has been a long time coming that FEX no longer needed to leak IR
implementation details to the frontend, this was legacy due to IR CI and
various other problems.
Now that the last bits of IR leaking has been removed, move everything
that we can internally to the implementation.
We still have a couple of minor details in the exposed IR.h to the
frontend, but these are limited to a few enums and some thunking struct
information rather than all the implementation details.
No functional change with this, just moving headers around.
FEXCore includes was including an FHU header which would result in
compilation failure for external projects trying to link to libFEXCore.
Moves it over to fix this, it was the only FHU usage in FEXCore/include
NFC
This is no longer necessary to be part of the public API. Moves the
header internally.
Needed to pass through `IsAddressInCodeBuffer` from CPUBackend through
the Context object, but otherwise no functional change.
In the old case:
* if we take the branch, 1 instruction
* if we don't take the branch, 3 instruction
* branch predictor fun
* 3 instructions of icache pressure
In the new case:
* unconditionally 2 instructions
* no branch predictor dependence
* 2 instructions of icache pressure
This should not be non-neglibly worse, and it simplifies things for RA.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
exhaustively checked against the Intel pseudocode since this is tricky:
def intel(AL, CF, AF):
old_AL = AL
old_CF = CF
CF = False
if (AL & 0x0F) > 9 or AF:
Borrow = AL < 6
AL = (AL - 6) & 0xff
CF = old_CF or Borrow
AF = True
else:
AF = False
if (old_AL > 0x99) or old_CF:
AL = (AL - 0x60) & 0xff
CF = True
return (AL & 0xff, CF, AF)
def fex(AL, CF, AF):
AF = AF | ((AL & 0xf) > 9)
CF = CF | (AL > 0x99)
NewCF = CF | (AF if (AL < 6) else CF)
AL = (AL - 6) if AF else AL
AL = (AL - 0x60) if CF else AL
return (AL & 0xff, NewCF, AF)
for AL in range(256):
for CF in [False, True]:
for AF in [False, True]:
ref = intel(AL, CF, AF)
test = fex(AL, CF, AF)
print(AL, "CF" if CF else "", "AF" if AF else "", ref, test)
assert(ref == test)
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Based on https://www.righto.com/2023/01/
New implementation is branchless, which is theoretically easier to RA. It's also
massively simpler which is good for a demon opcode.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Since we do an immediate overwrite of the file we are copying, we can
instead do a rename. Failure on rename is fine, will either mean the
telemetry file didn't exist initially, or some other permission error so
the telemetry will get lost regardless.
This may be useful for tracking TSO faulting when it manages to fetch
stale data. While most TSO crashes are due to nullptr dereferences, this
can still check for the corruption case.
In 64-bit mode, the LOOP instruction's RCX register usage is 64-bit or
32-bit.
In 32-bit mode, the LOOP instruction's RCX register usage is 32-bit or
16-bit.
FEX wasn't handling the 16-bit case at all which was causing the LOOP
instruction to effectively always operate at 32-bit size. Now this is
correctly supported, and it also stops treating the operation as 64-bit.
This was a funny joke that this was here, but it is fundamentally
incompatible with what we're doing. All those users are running proot
anyway because of how broken running under termux directly is.
Just remove this from here.
Take e.g a forward rep movsb copy from addr 0 to 1, the expected
behaviour since this is a bytewise copy is:
before: aaabbbb...
after: aaaaaaa...
but by copying in 32-byte chunks we end up with:
after: aaaabbbb...
due to the self overwrites not occuring within a single 32 bit copy.
When TSO is disabled, vector LDP/STP can be used for a two
instruction 32 byte memory copy which is significantly faster than the
current byte-by-byte copy. Performing two such copies directly after
oneanother also marginally increases copy speed for all sizes >=64.
I was looking at some other JIT overheads and this cropped up as some
overhead. Instead of materializing a constant using mov+movk+movk+movk,
load it from the named vector constant array.
In a micro-benchmark this improved performance by 34%.
In bytemark this improved on subbench by 0.82%
Missed this instruction when implementing rdtscp. Returns the same ID
result in a register just like rdtscp, but without the cycle counter
results. Doesn't touch any flags just like rdtscp.
x86 has a few prefetch instructions.
- prefetch - One of two classic 3DNow! instructions
- Prefetch in to L1 data cache
- prefetchw - One of two classic 3DNow! instructions
- Implies prefetch in to L1 data cache
- Prefetch cacheline with intent to write and exclusive ownership
- prefetchnta
- Prefetch non-temporal data in respect to /all/ cache levels
- Assumes inclusive caches?
- prefetch{t0,t1,t2}
- Prefetch data with respect to each cache level
- T0 = L1 and higher
- T1 = L2 and higher
- T2 = L3 and higher
**Some silly duplicates**
- prefetchwt1
- Duplicate of prefetchw but explicitly L1 data cache
- prefetch_exclusive
- Duplicate of prefetch
God Of War 2018 uses prefetchw as a hint for exclusive ownership of the
cacheline in some very aggressive spin-loops. Let's implement the
operations to help it along.
This function can be unit-tested more easily, and the stack special is more
cleanly handled as a post-collection step.
There is a minor functional change: The stack special case didn't trigger
previously if the range end was within the stack mapping. This is now fixed.
can help a lot of x86 code because x86 is 2-address and a64 is 3-address, so x86
ends up with piles of movs that end up dead after translation
It's not a win across the board because our RA isn't aware of tied registers so
sometimes we regress moves. But it's a win on average, and the RA bits can be
improved with time.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Arguments and conditional doesn't get optimized out in release builds
for the inline function call versus the define.
Was showing up an annoying amount of time when testing.
Folds reg+const memory address into addressing mode,
if the constant is within 16Kb.
Update instcountci files.
Add test 32Bit_ASM/FEX_bugs/SubAddrBug.asm
Fixes an issue where TestHarnessRunner was managing to reserve the space
below stack again, resulting in stack growth breaking. Would typically
only show up when using the vixl simulator under gdb for some reason.
This is likely the last bandage on this code before it gets completely
rewritten to be more readable.
ldswpal doesn't overwrite the source register and only reads the bits
required for the sized operation.
Not sure exactly why we were doing a copy here.
Removing it means improving Skyrim's hottest code block, as seen in #3472