1. pull flag calculation out of the loop body for perf
2. fully rotate the inner loop to save an instruction per iteration
3. hoist the rcx=0 jump to avoid computing df when rcx=0
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Just like #3508, clang-18 complains about VLA usage.
This vector is relatively small, only around 18 elements but is
semi-dynamic depending on arch and if FEXCore is targeting Linux or
Win32.
In the old case:
* if we take the branch, 1 instruction
* if we don't take the branch, 3 instruction
* branch predictor fun
* 3 instructions of icache pressure
In the new case:
* unconditionally 2 instructions
* no branch predictor dependence
* 2 instructions of icache pressure
This should not be non-neglibly worse, and it simplifies things for RA.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
exhaustively checked against the Intel pseudocode since this is tricky:
def intel(AL, CF, AF):
old_AL = AL
old_CF = CF
CF = False
if (AL & 0x0F) > 9 or AF:
Borrow = AL < 6
AL = (AL - 6) & 0xff
CF = old_CF or Borrow
AF = True
else:
AF = False
if (old_AL > 0x99) or old_CF:
AL = (AL - 0x60) & 0xff
CF = True
return (AL & 0xff, CF, AF)
def fex(AL, CF, AF):
AF = AF | ((AL & 0xf) > 9)
CF = CF | (AL > 0x99)
NewCF = CF | (AF if (AL < 6) else CF)
AL = (AL - 6) if AF else AL
AL = (AL - 0x60) if CF else AL
return (AL & 0xff, NewCF, AF)
for AL in range(256):
for CF in [False, True]:
for AF in [False, True]:
ref = intel(AL, CF, AF)
test = fex(AL, CF, AF)
print(AL, "CF" if CF else "", "AF" if AF else "", ref, test)
assert(ref == test)
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Based on https://www.righto.com/2023/01/
New implementation is branchless, which is theoretically easier to RA. It's also
massively simpler which is good for a demon opcode.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Since we do an immediate overwrite of the file we are copying, we can
instead do a rename. Failure on rename is fine, will either mean the
telemetry file didn't exist initially, or some other permission error so
the telemetry will get lost regardless.
This may be useful for tracking TSO faulting when it manages to fetch
stale data. While most TSO crashes are due to nullptr dereferences, this
can still check for the corruption case.
In 64-bit mode, the LOOP instruction's RCX register usage is 64-bit or
32-bit.
In 32-bit mode, the LOOP instruction's RCX register usage is 32-bit or
16-bit.
FEX wasn't handling the 16-bit case at all which was causing the LOOP
instruction to effectively always operate at 32-bit size. Now this is
correctly supported, and it also stops treating the operation as 64-bit.
This was a funny joke that this was here, but it is fundamentally
incompatible with what we're doing. All those users are running proot
anyway because of how broken running under termux directly is.
Just remove this from here.
Take e.g a forward rep movsb copy from addr 0 to 1, the expected
behaviour since this is a bytewise copy is:
before: aaabbbb...
after: aaaaaaa...
but by copying in 32-byte chunks we end up with:
after: aaaabbbb...
due to the self overwrites not occuring within a single 32 bit copy.
When TSO is disabled, vector LDP/STP can be used for a two
instruction 32 byte memory copy which is significantly faster than the
current byte-by-byte copy. Performing two such copies directly after
oneanother also marginally increases copy speed for all sizes >=64.
I was looking at some other JIT overheads and this cropped up as some
overhead. Instead of materializing a constant using mov+movk+movk+movk,
load it from the named vector constant array.
In a micro-benchmark this improved performance by 34%.
In bytemark this improved on subbench by 0.82%
Missed this instruction when implementing rdtscp. Returns the same ID
result in a register just like rdtscp, but without the cycle counter
results. Doesn't touch any flags just like rdtscp.