108 Commits

Author SHA1 Message Date
Alyssa Rosenzweig
04e4993d9b OpcodeDispatcher: Add a kludge to save NZCV less
Some opcodes only clobber NZCV under certain circumstances, we don't yet have
a good way of encoding that. In the mean time this hot fixes some would-be
instcountci regressions.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-11-09 09:40:51 -04:00
Ryan Houdek
e2c65189ff GDBServer: Preparation work to get this moved to the frontend
GDBServer is inherently OS specific which is why all this code is
removed when compiling for mingw/win32. This should get moved to the
frontend before we start landing more work to clean this interface up.

Not really any functional change.

Changes:

FEXCore/Context: Adds new public interfaces, these were previously
private.
- WaitForIdle
   - If `Pause` was called or the process is shutting down then this
     will wait until all threads have paused or exited.
- WaitForThreadsToRun
   - If `Pause` was previously called and then `Run` was called to get
     them running again, this waits until all the threads have come out
     of idle to avoid races.
- GetThreads
   - Returns the `InternalThreadData` for all the current threads.
   - GDBServer needs to know all the internal thread data state when the
     threads are paused which is what this gives it.

GDBServer:
- Removes usages of internal data structures where possible.
   - This gets it clean enough that moving it out of FEXCore is now
     possible.
2023-11-02 20:11:01 -07:00
Ryan Houdek
03f63f99a8 FEXCore: Moves StringUtils to FEXCore headers
Once gdbserver gets moved to the frontend this will need to be in the
includes.
2023-11-02 20:09:12 -07:00
Alyssa Rosenzweig
d4a6b031ea
Merge pull request #3245 from Sonicadvance1/remove_gdbpausecheck
FEXCore: Removes gdb pause check handler
2023-11-01 15:46:40 -04:00
Ryan Houdek
e5df636efd OpcodeDispatcher: Optimize pblendw
Requires #3238 to be merged first since this uses the tbx IR operation.

Worst case is now a three instruction sequence of ldr+ldr+tbx.
Some operations are special-cased, which definitely doesn't cover all
possible cases we could use without tbx, but as a worst case improvement
this is a significant improvement.
2023-10-31 21:18:02 -07:00
Ryan Houdek
de10cbad98 OpcodeDispatcher: Optimize blendps
A bunch of blendps swizzles weren't optimal. This optimizes all swizzles
to be optimal.

Two instructions can be more optimal without a tbx but the rest required
tbx to be optimal since they don't match ARM's swizzle mechanics.
2023-10-31 20:06:45 -07:00
Mai
77d92872bc
Merge pull request #3212 from Sonicadvance1/dpp_opt
OpcodeDispatcher: Optimize 128-bit DPPS and DPPD
2023-11-01 04:01:05 +01:00
Ryan Houdek
460f13be71 FEXCore: Removes gdb pause check handler
gdbserver is currently entirely broken so this doesn't change behaviour.
The gdb pause check that we originally had an excessive amount of
overhead.
Instead use the pending interrupt fault check that was wired
up for wine.
This makes the check very lightweight and makes it more reasonable to
implement a way to have gdbserver support attaching to a process.
2023-10-31 18:40:00 -07:00
Ryan Houdek
09e3371a0d Config: Fixes string enum parser with multiple arguments
Messed up when originally implementing this, substr's second argument is
requested substring length, not the ending position.

Noticed this while trying to parse multiple FEX_HOSTFEATURES options.
2023-10-27 12:04:04 -07:00
Ryan Houdek
99465faf63 IR: Implements support for subtract with shifted register
Will be used soon.
2023-10-23 09:27:41 -07:00
Alyssa Rosenzweig
d87155e4ee IR: Add infrastructure for modelling flag clobbers
Lots of instructions clobber NZCV inadvertently but are not intended to write to
the host flags from the IR point-of-view. As an example, Abs logically has no
side effects but physically clobbers NZCV due to its cmp/csneg impl on non-CSSC
hw. Add infrastructure to model this in the IR so we can deal with it when we
start using NZCV for things.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-10-23 10:21:47 -04:00
Ryan Houdek
b3d76bd2f1 IR: Adds DPPS and DPPD source masks
This will get used for these instructions soon
2023-10-19 16:36:19 +02:00
Ryan Houdek
2671246fef IR: Adds scalar vector insert operations
These IR operations are required to support AFP's NEP mode which does
vector insert in to the destination register. Additionally it gives us
tracking information to allow optimizing out redundant inserts on
devices that don't support AFP natively.

In order to match x86 semantics we need to support binary and unary
scalar operations that do a final insert in to a vector. With optional
zeroing of the top 128-bits for AVX variants.

A tricky thing is that in binary operations this means that the
destination and first source have an intrinsically linked property
depending on if it is SSE or AVX.

SSE example:
- addss xmm0, xmm1
   - xmm0 is both the destination and the first source.
   - This means xmm0[31:0] = xmm0[31:0] + xmm1[31:0]
   - Bits [127:32] are UNMODIFIED.

FEX's JIT jumps through some hoops so that if the destination register
equals the first source register, then it hits the optimal path the
AFP.NEP will insert in to the result. AVX throws a small wrench in to
this due to changed behaviour

AVX example:
- vaddss xmm0, xmm1, xmm2
  - xmm0 is ONLY the destination, xmm1 and xmm2 are the sources
  - This operation copies the bits above the scalar result from the
    first source (xmm1).
  - Additionally this will zero bits above the original 128-bit xmm
    register.
  - xmm0[31:0] = xmm1[31:0] + xmm2[31:0]
  - xmm0[127:32] = xmm1[127:32]
  - ymm0[255:127] = 0

This causes these instructions to support a fairly large table depending
on if the instruction is an SSE or AVX instruction, plus if the host CPU
supports AFP or not.

So while fairly complex, it's handling all the edge cases and gives us
optimization opportunities as we move forward. Currently on non-AFP
supporting devices this has a minor benefit that these IR operations
remove one temporary register, lowering the Register Allocation
overhead.

In the coming weeks I am likely to introduce an optimization pass that
removes redundant inserts because FEX currently does /really/ badly with
scalar code loops.

Needs #3184 merged first.
2023-10-10 03:17:19 -07:00
Ryan Houdek
cd83d3eb24 InstCountCI: Support multiple instructions in the tests
There are some cases where we want to test multiple instructions where
we can do optimizations that would overwise be hard to see.

eg:
```asm
; Can be optimized to a single stp
push eax
push ebx

; Can remove half of the copy since we know the direction
cld
rep movsb

; Can remove a redundant insert
addss xmm0, xmm1
addss xmm0, xmm2
```

This lets us have arbitrary sized code in instruction count CI, with the
original json key becoming only a label if the instruction array is
provided.

There are still some major limitations to this, instructions that
generate side-effects might have "garbage" after the end of the block
that isn't correctly accounted for. So care must be taken.

Example in the json
```json
"push ax, bx": {
  "ExpectedInstructionCount": 4,
  "Optimal": "No",
  "Comment": "0x50",
  "x86Insts": [
    "push ax",
    "push bx"
  ],
  "ExpectedArm64ASM": [
    "uxth w20, w4",
    "strh w20, [x8, #-2]!",
    "uxth w20, w7",
    "strh w20, [x8, #-2]!"
  ]
}
```
2023-10-09 21:49:53 -07:00
Ryan Houdek
6403290019 FEXCore: Renames raw FLAGS location names to signify they can't be used directly
Six of the EFLAGS can't be used directly in a bitmask because they are
either contained in a different flags location or has multiple bits
stored in it.

SF, ZF, CF, OF are stored in ARM's NZCV format in offset 24.
PF calculation is deferred but stored in the regular offset.
AF is also deferred in relation to the PF but stored in the regular
offset.

These /need/ to be reconstructed using the `ReconstructCompactedEFLAGS`
function when wanting to read the EFLAGS.

When setting these flags they /need/ to be set using
`SetFlagsFromCompactedEFLAGS`.

If either of these functions are not used when managing EFLAGs then the
internal representation will get mangled and the state will be
corrupted.

Having a little `_RAW` on these to signify that these aren't just
regular single bit representations like the other flags in EFLAGS should
make us puzzle about this issue before writing more broken code that
tries accessing it directly.
2023-10-08 11:51:11 -07:00
Ryan Houdek
22590dde77 FEXCore: Implements support for RPRES
This allows us to use reciprocal instructions which matches precision of
what x86 expects rather than converting everything to float divides.

Currently no hardware supports this, and even the upcoming X4/A720/A520
won't support it, but it was trivial to implement so wire it up.
2023-10-07 23:13:47 -07:00
Ryan Houdek
559cf6491a InstCountCI: Support overriding AFP features
Also disable AFP under the vixl simulator by default since it doesn't support it.
2023-10-07 11:48:42 -07:00
Ryan Houdek
8a51bb7a61 FEXCore: Support CpuState relative vector named constants
The motivation towards just having a pointer array in CpuState was that
initialization was fairly cheap and that we have limited space inside
the encoding depending on what we want to do.

Initialization cost is still a concern but doing a memcpy of 128-bytes
isn't that big of a deal.

Limited space in CpuState, while a concern isn't a significant one.
   - Needs to currently be less than 1 page in size
   - Needs to be under the architectural offset limitations of loadstore
     scaled offsets. Which is 65KB for 128-bit vectors

Still keeps the pointer array around for cases when we would need
synthesize an address offset and it's just easier to load the
process-wide table.

The performance improvement here is removing the dependency in the
ldr+ldr chain. In microbenchmarks this has shown to have an improvement
of ~4% by removing this dependency chain on Cortex-X1C.
2023-10-04 20:56:29 -07:00
Ryan Houdek
ee6debe8fd FEXCore: Adds DividePow2 helper 2023-10-04 20:56:29 -07:00
Ryan Houdek
98789a8039 FEXCore: Implement support for AVX2 feature detection 2023-09-28 19:57:08 -07:00
Billy Laws
51f8c83c76 Context: Add an alternative thread-oriented execute function 2023-09-22 10:12:40 -07:00
Ryan Houdek
31564354b1 FEXCore: Removes vestigial Interpreter code 2023-09-21 15:49:49 -07:00
Ryan Houdek
fea72ce19c
Merge pull request #3120 from Sonicadvance1/more_optimal_x87
FEXCore: Support preserve_all ABI for interpreter fallbacks
2023-09-21 15:35:37 -07:00
Alyssa Rosenzweig
c52741c813 FEXCore: Gut interpreter
It is scarcely used today, and like the x86 jit, it is a significant
maintainence burden complicating work on FEXCore and arm64 optimization. Remove
it, bringing us down to 2 backends.

1 down, 1 to go.

Some interpreter scaffolding remains for x87 fallbacks. That is not a problem
here.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-09-21 12:48:12 -04:00
Ryan Houdek
ca6570d5de FEXCore/Include: Adds SPDX identifier 2023-09-18 22:13:10 -07:00
Ryan Houdek
838293c2f0 FEXCore: Remove unused FallbackhandlerIndex LoadFCW
We removed this once passing in FCW explicitly.
2023-09-18 17:06:46 -07:00
Ryan Houdek
0c5c146fcf FEXCore/JitSymbols: Buffer writes to reduce overhead
While this interface is usually pretty fast because it is a write and
forget operation, this has issues when there are multiple threads
hitting the perf map file at the same time. In particular this interface
becomes a bottleneck due to a locking mutex on writes in the kernel.

The situations when this bottleneck occurs is when a bunch of threads
get spawned and they are all jitting code as quickly as possible. In
particular Geekbench's clang benchmark hits this hard where each CPU
thread spends ~40% CPU time on all eight CPU threads because they are
stalled waiting for this mutex to unlock.

To work around this issue, buffer the writes a small amount. Either up
to a page-ish of data or 100ms of time. This completely eliminates
threads waiting on the kernel mutex.
- Around a page of buffer space was chosen by profiling Geekbench's
  clang benchmark and seeing how frequently it was still writing.
   - 1024 bytes was still fairly aggressive, 4096 seemed fine.
- 100ms was chosen to ensure we don't wait /too/ long to write JIT
  symbols.
   - In most cases 100ms is enough that you won't notice the blip in
     perf.

One thing of note is that with profiling enabled and checking the time
on every JIT block still ends up with 2-3% CPUtime in vdso
clock_gettime. We can improve this by using the cyclecounter directly
since that is still guaranteed to be monotonic. Maybe we'll come back to
that if it is actually an issue here.
2023-09-16 17:52:46 -07:00
Ryan Houdek
d5782567e8
Merge pull request #3077 from Sonicadvance1/x86_shifted
FEXCore: Implements support for shifted bitwise ops
2023-09-15 08:09:35 -07:00
Ryan Houdek
d81d89c4fb IR: Changes Select operation to not have implicit sizes
Changes the helper which all the source uses to still calculate the size
implicitly. This is going to take a while to convert all implicit uses
over to the explicit operation.

Get us started by at least having the IR operation itself be explicit.
2023-09-14 20:48:16 -07:00
Ryan Houdek
db5056f275 OpcodeDispatcher: Implement shufps with VTBL2 in worst case
In the case that source registers are sequential then this turns in to a
load of the vector constant (2 instructions) and the single tbl
instruction.

If the registers aren't sequential then the tbl turns in to 2 moves and
then the single tbl, which with zero-cycle rename isn't too bad.

Since this is a worst case option this is significantly better than the
previous implementation doing a bunch of inserts which was always 9
instructions.
We should still strive to implement faster versions without the use of
TBL2 if possible but this makes it less of a concern.
2023-09-13 11:31:20 -07:00
Ryan Houdek
e9d96ce538 IR: Implements support for VTBL2
Skips implementing it for the x86 JIT because that's a bit of a
nightmare to think about.

The ARM64 implementation requires sequential registers which means if
the incoming sources aren't sequential then we need to move the sources
in to the two vector temporaries. This is fine since we have zero-cycle
vector renames and the alternative is slower.
2023-09-13 11:31:20 -07:00
Ryan Houdek
b453439968 HostFeatures: Detect FlagM/2
Currently unused but at least detect the feature so that our Arm64 JIT
can use it in the future.
2023-09-11 16:41:30 -07:00
Ryan Houdek
863331b117 FEXCore: Implements support for shifted bitwise ops
This wasn't implemented initially for the interpreter and x86 JIT.

This meant we are maintaining two codepaths. Implement these operations
in the interpreter and x86 JIT so we no longer need to do that.

The emitted code in the x86 JIT is hot garbage, but it's only necessary
for correctness testing, not performance testing there.
2023-09-11 13:17:35 -07:00
Alyssa Rosenzweig
e6db2d0b96 IR: Remove phi nodes
It turns out that pure SSA isn't a great choice for the sort of emulation we do.
On one hand, it discards information from the guest binary's register allocation
that would let us skip stuff. On the other hand, it doesn't have nearly as many
benefits in this setting as in a traditional compiler... We really *don't* want
to do global RA or really any global optimization. We assume the guest optimizer
did its job for x86, we just need to clean up the mess left from going x86 ->
arm. So we just need enough SSA to peephole optimize.

My concrete IR proposals are that:

  * SSA values must be killed in the same block that they are defined.
  * Explicit LoadGPR/StoreGPR instructions can be used for global persistence.
  * LoadGPR/StoreGPR are eliminated in favour of SSA within a block.

This has a lot of nice properties for our setting:

  * Except for some internal REP instruction emulation (etc), we already have
    registers for everything that escapes block boundaries, so this form is very
    easy to go into -- straightforward local value numbering, not a full into
    SSA pass.

  * Spilling is entirely local (if it happens at all), since everything is in
    registers at block boundaries. This is excellent, because Belady's algorithm
    lets us spill nearly optimally in linear-time for individual blocks. (And
    the global version of Belady's algorithm is massively more complicated...)
    A nice fit for a JIT.

    Relatedly, it turns out allowing spilling is probably a decent decision,
    since the same spiller code can be used to rematerialize constants in a
    straightforward way. This is an issue with the current RA.

  * Register assignment is entirely local. For the same reason, we can assign
    registers "optimally" in linear time & memory (e.g. with linear scan). And
    the impl is massively simpler than a full blown SSA-based tree scan RA. For
    example, we don't have to worry about parallel copies or coalescing phis or
    anything. Massively nicer algorithm to deal with.

  * SSA value names can be block local which makes the validation implicit :~)

It also has remarkably few drawbacks, because we didn't want to do CFG global
optimization anyway given our time budget and the diminishng returns. The few
global optimizations we might want (flag escape analysis?) don't necessarily
benefit from pure SSA anyway.

Anyway, we explicitly don't want phi nodes in any of this. They're currently
unused. Let's just remove them so nobody gets the bright idea of changing that.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-09-05 16:35:12 -04:00
Billy Laws
3792f707dc FEXLoader: Convert between abridged/full tag fmts in signal dispatch
X86 fpstate expects FTW to be saved in the FSAVE format, whereas X64
fpstate expects it to be saved in the abridged format used by FXSAVE.
2023-09-02 09:17:33 -07:00
Billy Laws
cb49373f47 FEXCore: Rework X87 tag word handling
The FXSAVE and FSAVE tag words are written out in different formats,
with FXSAVE using an abridged version that lacks the zero/special/valid
distinction. Switch to using this abridged version internally for
simplicity, and to allow the calculation of zero/special/valid
distinction to be deferred until an fxsave instruction (in the future,
currently the distinction is ignored and only valid/empty states are
possible).
2023-09-02 09:17:33 -07:00
Ryan Houdek
435f03c703 Context: Adds helper to reconstruct and consume packed EFLAGS
Currently FEX's internal EFLAGS representation is a perfect 1:1 mapping
between bit offset and byte offset. This is going to change with #3038.
There should be no reason that the frontend needs to understand how to
reconstruct the compacted flags from the internal representation.

Adds context helpers and moves all the logic to FEXCore. The locations
that previously needed to handle this have been converted over to use
this.
2023-09-02 07:05:54 -07:00
Ryan Houdek
1446d4fe12 IR: Adds support for named vector zero
This is useful for caching a zero register vector which we use in
various locations. This will be abused soon.
2023-08-30 18:59:38 -07:00
Ryan Houdek
81a32c3998 FEXCore: Allows disabling telemetry at runtime
This is useful for InstCountCI so you can disable the telemetry
gathering even if enabled so it doesn't affect the CI system.
2023-08-30 12:59:41 -07:00
Ryan Houdek
d8f131fa3d Arm64: Optimize AESKeyGenAssist
We can load the swizzle table from our constant pool now. This removes
the only usage of VTMP3 from our Arm64 JIT.

I would say the this is now optimal for the version without RCON set.
With RCON we could technically make some of the move of the constant
more optimal.
2023-08-30 12:15:09 -07:00
Ryan Houdek
572cc57aa3 IR: Adds printer for OpSize 2023-08-30 11:32:43 -07:00
Ryan Houdek
f741ebf970 IR: Removes implicit sized add
Saw a few locations in here that we operate things at 64-bit
unconditionally around pointer calculation. Will be coming back for
those when running in 32-bit mode.

This is the last of the implicit sized ALU operations! After this I'll
be going through the IR more individually to try and remove any
stragglers.
Then should be able to start cleaning up and actually optimizing GPR
operations.
2023-08-29 22:26:51 -07:00
Ryan Houdek
e8b767b553 IR: Removes implicit sized bfe
This one is a bit of a mess, looking forward to coming back and cleaning
this up.
2023-08-29 19:43:39 -07:00
Ryan Houdek
9e70aa4192 IR: Removes implicit sized and 2023-08-28 22:43:21 -07:00
Ryan Houdek
b5dc6a69c7 IR: Removes implicit sized sub 2023-08-28 22:05:02 -07:00
Ryan Houdek
a276b37252 IR: Removes bfi from variable size
This one was already explicit sized. Just convert it over to OpSize.
2023-08-28 21:31:37 -07:00
Ryan Houdek
8bc84c202c IR: Removes implicit sized xor 2023-08-28 19:51:14 -07:00
Ryan Houdek
e9a3848602
Merge pull request #3027 from Sonicadvance1/remove_implicit_andn
IR: Removes implicit sized andn
2023-08-28 19:39:49 -07:00
Ryan Houdek
1699ec9a76 IR: Removes implicit sized andn 2023-08-28 19:16:16 -07:00
Ryan Houdek
db6c8852fc IR: Removes implicit sized or 2023-08-28 19:06:05 -07:00