No reason to have a separate pass for this, merging should be a bit faster since
it eliminates an IR walk.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Flag DCE needs to do general DCE anyway to converge in one pass. So we can move
the special syscall/atomic logic over to flag DCE and then drop the second DCE
pass altogether. Now local dead code of both is eliminated in a single pass.
Flag DCE is carefully written to converge in a single iteration which makes this
scheme work.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
New RA does not need it for correctness, and the slight slow down to new RA from
not compacting first is much smaller than the cost of compaction. Overall speeds
up node.js start time by ~6% on top of new RA.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
All we actually need to validate is that each source has been previously defined
within the block. That checks everything we care about now.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Aside from its own self-test, the parser is unused and should remain that way,
since it's a maintenance burden with no real benefit. Burn it.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Executable mapped memory is treated as x86 code by default when
running under EC, VirtualAlloc2 needs to be used together with a
special flag to map JIT arm64 code.
As we are moving more and more OS specific code to the frontend, this is
another set of functions that can be moved to FEXLoader from FEXCore.
No functional change here, only code moved from protected to private and
to FEXLoader's SignalDelegator.
Once more thread handling is moved to the frontend we can move even more
out of FEXCore. As follows:
- CheckXIDHandler can get moved.
- First pthread FEX makes would just call this.
- Register/UnregisterTLSState
- This can happen in the clone/thread handler once the frontend
handles it.
This leaves very little in the backend and is mostly an interface for
passing signal data to the frontend that it needs once a signal has
occured.
It additionally also is used for `SignalThread`.
This has long since been unused. Originally implemented for some fuzzing
tests but has been abandoned and that should likely be implemented some
other way.
This fixes an issue where CPU tunables were ending up in the thunk
generator which means if your CPU doesn't support all the features on
the *Builder* then it would crash with SIGILL. This was happening with
Canonical's runners because they typically only support ARMv8.2 but we
are compiling packages to run on ARMv8.4 devices.
cc: FEX-2311.1
Requires #3249 to be merged first
Library alerting has been disabled for now, and storing IR while
gdbserver is running is removed.
Otherwise no functional change.
Suggested by Alyssa. Adding an IR operation can be a little tedious
since you need to add the definition to JIT.cpp for the dispatch switch,
JITClass.h for the function declared, and then actually defining the
implementation in the correct file.
Instead support the common case where an IR operation just gets
dispatched through to the regular handler. This lets the developer just
put the function definition in to the json and the relevent cpp file and
it just gets picked up.
Some minor things:
- Needs to support dynamic dispatch for {Load,Store}Register and
{Load,Store}Mem
- This is just a bool in the json
- It needs to not output JIT dispatch for some IR operations
- SSE4.2 string instructions and x87 operations
- These go down the "Unhandled" path
- Needs to support a Dispatcher function override
- This is just for handling NoOp IR operations that get used for
other reasons.
- Finally removes VSMul and VUMul, consolidating to VMul
- Unlike V{U,S}Mull, signed or unsigned doesn't change behaviour here
- Fixed a couple random handler names not matching the IR operation
name.
With the removal of the x86 JIT, there is no need to have these be
independent classes.
Merges the Arm64Dispatcher in to the base Dispatcher class.
No functional change, just moving code.
This is blocking performance improvements. This backend is almost
unilaterally unused except for when I'm testing if games run on Radeon
video drivers.
Hopefully AmpereOne and Orin/Grace can fulfill this role when they
launch next year.
It is scarcely used today, and like the x86 jit, it is a significant
maintainence burden complicating work on FEXCore and arm64 optimization. Remove
it, bringing us down to 2 backends.
1 down, 1 to go.
Some interpreter scaffolding remains for x87 fallbacks. That is not a problem
here.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
It turns out that pure SSA isn't a great choice for the sort of emulation we do.
On one hand, it discards information from the guest binary's register allocation
that would let us skip stuff. On the other hand, it doesn't have nearly as many
benefits in this setting as in a traditional compiler... We really *don't* want
to do global RA or really any global optimization. We assume the guest optimizer
did its job for x86, we just need to clean up the mess left from going x86 ->
arm. So we just need enough SSA to peephole optimize.
My concrete IR proposals are that:
* SSA values must be killed in the same block that they are defined.
* Explicit LoadGPR/StoreGPR instructions can be used for global persistence.
* LoadGPR/StoreGPR are eliminated in favour of SSA within a block.
This has a lot of nice properties for our setting:
* Except for some internal REP instruction emulation (etc), we already have
registers for everything that escapes block boundaries, so this form is very
easy to go into -- straightforward local value numbering, not a full into
SSA pass.
* Spilling is entirely local (if it happens at all), since everything is in
registers at block boundaries. This is excellent, because Belady's algorithm
lets us spill nearly optimally in linear-time for individual blocks. (And
the global version of Belady's algorithm is massively more complicated...)
A nice fit for a JIT.
Relatedly, it turns out allowing spilling is probably a decent decision,
since the same spiller code can be used to rematerialize constants in a
straightforward way. This is an issue with the current RA.
* Register assignment is entirely local. For the same reason, we can assign
registers "optimally" in linear time & memory (e.g. with linear scan). And
the impl is massively simpler than a full blown SSA-based tree scan RA. For
example, we don't have to worry about parallel copies or coalescing phis or
anything. Massively nicer algorithm to deal with.
* SSA value names can be block local which makes the validation implicit :~)
It also has remarkably few drawbacks, because we didn't want to do CFG global
optimization anyway given our time budget and the diminishng returns. The few
global optimizations we might want (flag escape analysis?) don't necessarily
benefit from pure SSA anyway.
Anyway, we explicitly don't want phi nodes in any of this. They're currently
unused. Let's just remove them so nobody gets the bright idea of changing that.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
It is not an external component, and it makes paths needlessly long.
Ryan seemed amenable to this when we discussed on IRC earlier.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>