Six of the EFLAGS can't be used directly in a bitmask because they are
either contained in a different flags location or has multiple bits
stored in it.
SF, ZF, CF, OF are stored in ARM's NZCV format in offset 24.
PF calculation is deferred but stored in the regular offset.
AF is also deferred in relation to the PF but stored in the regular
offset.
These /need/ to be reconstructed using the `ReconstructCompactedEFLAGS`
function when wanting to read the EFLAGS.
When setting these flags they /need/ to be set using
`SetFlagsFromCompactedEFLAGS`.
If either of these functions are not used when managing EFLAGs then the
internal representation will get mangled and the state will be
corrupted.
Having a little `_RAW` on these to signify that these aren't just
regular single bit representations like the other flags in EFLAGS should
make us puzzle about this issue before writing more broken code that
tries accessing it directly.
This allows us to use reciprocal instructions which matches precision of
what x86 expects rather than converting everything to float divides.
Currently no hardware supports this, and even the upcoming X4/A720/A520
won't support it, but it was trivial to implement so wire it up.
The motivation towards just having a pointer array in CpuState was that
initialization was fairly cheap and that we have limited space inside
the encoding depending on what we want to do.
Initialization cost is still a concern but doing a memcpy of 128-bytes
isn't that big of a deal.
Limited space in CpuState, while a concern isn't a significant one.
- Needs to currently be less than 1 page in size
- Needs to be under the architectural offset limitations of loadstore
scaled offsets. Which is 65KB for 128-bit vectors
Still keeps the pointer array around for cases when we would need
synthesize an address offset and it's just easier to load the
process-wide table.
The performance improvement here is removing the dependency in the
ldr+ldr chain. In microbenchmarks this has shown to have an improvement
of ~4% by removing this dependency chain on Cortex-X1C.
It is scarcely used today, and like the x86 jit, it is a significant
maintainence burden complicating work on FEXCore and arm64 optimization. Remove
it, bringing us down to 2 backends.
1 down, 1 to go.
Some interpreter scaffolding remains for x87 fallbacks. That is not a problem
here.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
While this interface is usually pretty fast because it is a write and
forget operation, this has issues when there are multiple threads
hitting the perf map file at the same time. In particular this interface
becomes a bottleneck due to a locking mutex on writes in the kernel.
The situations when this bottleneck occurs is when a bunch of threads
get spawned and they are all jitting code as quickly as possible. In
particular Geekbench's clang benchmark hits this hard where each CPU
thread spends ~40% CPU time on all eight CPU threads because they are
stalled waiting for this mutex to unlock.
To work around this issue, buffer the writes a small amount. Either up
to a page-ish of data or 100ms of time. This completely eliminates
threads waiting on the kernel mutex.
- Around a page of buffer space was chosen by profiling Geekbench's
clang benchmark and seeing how frequently it was still writing.
- 1024 bytes was still fairly aggressive, 4096 seemed fine.
- 100ms was chosen to ensure we don't wait /too/ long to write JIT
symbols.
- In most cases 100ms is enough that you won't notice the blip in
perf.
One thing of note is that with profiling enabled and checking the time
on every JIT block still ends up with 2-3% CPUtime in vdso
clock_gettime. We can improve this by using the cyclecounter directly
since that is still guaranteed to be monotonic. Maybe we'll come back to
that if it is actually an issue here.
Changes the helper which all the source uses to still calculate the size
implicitly. This is going to take a while to convert all implicit uses
over to the explicit operation.
Get us started by at least having the IR operation itself be explicit.
In the case that source registers are sequential then this turns in to a
load of the vector constant (2 instructions) and the single tbl
instruction.
If the registers aren't sequential then the tbl turns in to 2 moves and
then the single tbl, which with zero-cycle rename isn't too bad.
Since this is a worst case option this is significantly better than the
previous implementation doing a bunch of inserts which was always 9
instructions.
We should still strive to implement faster versions without the use of
TBL2 if possible but this makes it less of a concern.
Skips implementing it for the x86 JIT because that's a bit of a
nightmare to think about.
The ARM64 implementation requires sequential registers which means if
the incoming sources aren't sequential then we need to move the sources
in to the two vector temporaries. This is fine since we have zero-cycle
vector renames and the alternative is slower.
This wasn't implemented initially for the interpreter and x86 JIT.
This meant we are maintaining two codepaths. Implement these operations
in the interpreter and x86 JIT so we no longer need to do that.
The emitted code in the x86 JIT is hot garbage, but it's only necessary
for correctness testing, not performance testing there.
It turns out that pure SSA isn't a great choice for the sort of emulation we do.
On one hand, it discards information from the guest binary's register allocation
that would let us skip stuff. On the other hand, it doesn't have nearly as many
benefits in this setting as in a traditional compiler... We really *don't* want
to do global RA or really any global optimization. We assume the guest optimizer
did its job for x86, we just need to clean up the mess left from going x86 ->
arm. So we just need enough SSA to peephole optimize.
My concrete IR proposals are that:
* SSA values must be killed in the same block that they are defined.
* Explicit LoadGPR/StoreGPR instructions can be used for global persistence.
* LoadGPR/StoreGPR are eliminated in favour of SSA within a block.
This has a lot of nice properties for our setting:
* Except for some internal REP instruction emulation (etc), we already have
registers for everything that escapes block boundaries, so this form is very
easy to go into -- straightforward local value numbering, not a full into
SSA pass.
* Spilling is entirely local (if it happens at all), since everything is in
registers at block boundaries. This is excellent, because Belady's algorithm
lets us spill nearly optimally in linear-time for individual blocks. (And
the global version of Belady's algorithm is massively more complicated...)
A nice fit for a JIT.
Relatedly, it turns out allowing spilling is probably a decent decision,
since the same spiller code can be used to rematerialize constants in a
straightforward way. This is an issue with the current RA.
* Register assignment is entirely local. For the same reason, we can assign
registers "optimally" in linear time & memory (e.g. with linear scan). And
the impl is massively simpler than a full blown SSA-based tree scan RA. For
example, we don't have to worry about parallel copies or coalescing phis or
anything. Massively nicer algorithm to deal with.
* SSA value names can be block local which makes the validation implicit :~)
It also has remarkably few drawbacks, because we didn't want to do CFG global
optimization anyway given our time budget and the diminishng returns. The few
global optimizations we might want (flag escape analysis?) don't necessarily
benefit from pure SSA anyway.
Anyway, we explicitly don't want phi nodes in any of this. They're currently
unused. Let's just remove them so nobody gets the bright idea of changing that.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
The FXSAVE and FSAVE tag words are written out in different formats,
with FXSAVE using an abridged version that lacks the zero/special/valid
distinction. Switch to using this abridged version internally for
simplicity, and to allow the calculation of zero/special/valid
distinction to be deferred until an fxsave instruction (in the future,
currently the distinction is ignored and only valid/empty states are
possible).
Currently FEX's internal EFLAGS representation is a perfect 1:1 mapping
between bit offset and byte offset. This is going to change with #3038.
There should be no reason that the frontend needs to understand how to
reconstruct the compacted flags from the internal representation.
Adds context helpers and moves all the logic to FEXCore. The locations
that previously needed to handle this have been converted over to use
this.
We can load the swizzle table from our constant pool now. This removes
the only usage of VTMP3 from our Arm64 JIT.
I would say the this is now optimal for the version without RCON set.
With RCON we could technically make some of the move of the constant
more optimal.
Saw a few locations in here that we operate things at 64-bit
unconditionally around pointer calculation. Will be coming back for
those when running in 32-bit mode.
This is the last of the implicit sized ALU operations! After this I'll
be going through the IR more individually to try and remove any
stragglers.
Then should be able to start cleaning up and actually optimizing GPR
operations.