Commit Graph

4083 Commits

Author SHA1 Message Date
Alyssa Rosenzweig
4c1f53c1ff ConstProp: Handle constant Bfi
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-07-25 10:19:56 -04:00
Ryan Houdek
5821175ddb
Merge pull request #2813 from Sonicadvance1/fix_ra_lr
Arm64: Fixes LR corruption in 128-bit divides
2023-07-24 15:08:45 -07:00
Alyssa Rosenzweig
27a1ebc2f5 IR: Expand to 16-bit opcodes
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-07-24 17:21:11 -04:00
Ryan Houdek
61e905a339 Arm64: Fixes LR corruption in 128-bit divides
Need to save and restore LR before branching out to the helpers.
Confirmed that the rest of the JIT handles this correctly.
2023-07-24 12:04:16 -07:00
Lioncache
10fdcaa109 Arm64/Emitter: Simplify SVE immediate shift helper
We can collapse the entire if statement to be much simpler.
2023-07-24 15:04:14 -04:00
Ryan Houdek
ba672f868c Arm64: Switch to using half barriers
Inspired from: https://github.com/dotnet/runtime/issues/8072

Currently FEX is /very/ heavy handed with our backpatching where we wrap
every backpatched loadstore with `dmb ish`.

This can be relaxed slightly according to the linked issue.

For TSO load instructions the instruction sequence changes to:
  ldr <args>;
  dmb ld; <-- Slightly less strict dmb

For TSO store instructions the instruction sequence changes to:
  dmb ish; <-- Still the all encompassing dmb
  str <args>;

For backpatching loadstores this does the same thing where only one side
needs the nop and it uses the same instruction sequence when
backpatched.

The minor change is that on load backpatching, we are no longer backing
up a single instruction, instead just re-executing the instruction we
patched directly.

Took a long time to come back to this (Last looked in August 2020).
Previously when I was implementing this idea it didn't work, but that
was because our CompareExchange operation was broken back then. With the
CAS now, it should just work.
2023-07-24 11:06:20 -07:00
Lioncache
9c3a843df7 Arm64/Emitter: Move indexed dup handling into SVEDup
SVEDup is only used by dup(), so we can move all the implementation
details into it instead of keeping it all in the public function.
2023-07-24 13:53:57 -04:00
Lioncache
8defa2b55f Arm64/Emitter: Collapse encoding cases for indexed dup
Lets us hoist out the asserts and also collapse all the
branching into one series of operations.
2023-07-24 13:53:42 -04:00
Ryan Houdek
77c88ffe53
Merge pull request #2804 from alyssarosenzweig/eor-zero
Optimize `xor %eax, %eax`
2023-07-24 09:01:53 -07:00
Alyssa Rosenzweig
2ce15ddc89 OpcodeDispatcher: Partially defer PF calculation
We expect that PF is written more often than it's read, so we want to
get the expensive popcount out of the hot path. (Thank you to Dougall
for suggesting that.)

There are two cases:

1. PF is written by an integer instruction. In this case, we calculate
   with the formula `popcount(x ^ 1) & 1`.
2. PF is written by a float instruction, copying a host flag.

What we really want is to defer the relatively expensive popcount. So,
to unify these cases, we have integer instructions write `x ^ 1` and
(unchanged) float instructions write the host flag. Then, when reading
PF, we do `popcount(value) & 1` on the byte read in.

If PF is written but not read, this saves the expensive popcount and
leaves only the cheap xor.

If PF is written by an integer op and read, this maybe shuffles some
code but does not materially change anything.

If PF is written by a float op and read, this is worse because now we're
doing an extra pointless popcount. This is a tradeoff... However, this
is only relevant to unordered float comparisons, which I expect to be
obscure for games. So this should be worth it over all (for games, if
not weird numerical computing workloads).

How does this connect to my register zeroing quest? The constant folding
code doesn't currently deal with FPRs and I'm not in a mood to change
this. So before, a block ending with `xor eax, eax` would still do a
popcount for PF. Now it just writes a constant 1 since the xor constant
folds and the popcount never happens at all.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-07-24 11:17:31 -04:00
Alyssa Rosenzweig
3d1b55383e ConstProp: Handle Select::EQ
For flag calculation after moving a constant. This cleans up the code
generated for zeroing at the end of a block.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-07-24 11:17:31 -04:00
Alyssa Rosenzweig
69deaa0976 LongDivideRemovalPass: Don't detect xor zero
It is now optimized out (canonicalized) in the frontend.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-07-24 11:17:31 -04:00
Alyssa Rosenzweig
fc72fa9e5f OpcodeDispatcher: Optimize xor zeroing
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-07-24 11:17:31 -04:00
Lioncache
362a5a019a Arm64/VectorOps: Move VInsElement size assert checks to be first
Ensures the asserts will be hit first before emitting code.
2023-07-24 11:08:35 -04:00
Lioncache
d529a58893 Arm64/VectorOps: Hoist asserts out of VInsElement cases
Lets us deduplicate the asserts and put them in one spot. We can also
improve the assert message to also indicate the valid range.

While we're in the area, we can collapse a few case paths
as a result of this assert movement.
2023-07-24 11:05:13 -04:00
Ryan Houdek
036a196984 X86Tables: Adds spaceship operator to couple op types
For #2804 so it can compare if GPRs match more easily.

Only adding to GPR and Literal types, since its ambiguous what this
would mean for the memory accessing types. Going to leave those other
ones alone for now.
2023-07-21 14:42:01 -07:00
Ryan Houdek
e633ef7cf5
Merge pull request #2801 from lioncash/cvt
Arm64/Emitter: Handle SVE FP convert precision group
2023-07-19 13:45:56 -07:00
Lioncache
801106faf4 Arm64/Emitter: Handle SVE FP convert precision group 2023-07-19 15:24:54 -04:00
Ryan Houdek
536b2ed495
Merge pull request #2783 from Sonicadvance1/optimize_loadstorecontextindexed
Arm64: Optimize {Load,Store}ContextIndexed address generation
2023-07-19 12:16:39 -07:00
Alyssa Rosenzweig
072f027885
Merge pull request #2784 from Sonicadvance1/optimize_small_rcr
OpcodeDispatcher: Optimize 8/16-bit RCR
2023-07-19 15:07:58 -04:00
Lioncache
ef257418d3 Arm64/Emitter: Handle SVE FP arithmetic with immediate (predicated) group 2023-07-19 13:37:33 -04:00
Ryan Houdek
842b71cf83
Merge pull request #2799 from lioncash/xar
Arm64/Emitter: Handle SVE XAR
2023-07-19 09:39:30 -07:00
Ryan Houdek
8d8b64d2b7
Merge pull request #2798 from bylaws/dealock
Jit: Add block links directly through the lookup cache on thread exit
2023-07-19 09:37:57 -07:00
Lioncache
85c6ef8097 Arm64/Emitter: Handle SVE XAR
Now that we have the helper for encoding immediate shifts,
we can trivially implement the remaining missing instruction
in the bitwise logical unpredicated group.
2023-07-19 12:26:05 -04:00
Billy Laws
2be16d9054 Jit: Add block links directly through the lookup cache on thread exit
Prevents the code invalidation mutex from being locked as shared recursively,
since it is locked before entering ThreadExitFunctionLink and would end up
being locked again by ThreadAddBlockLink.

This fixes a deadlock on Windows.
2023-07-19 17:13:17 +01:00
Lioncache
f1f50b7a98 Arm64/Emitter: Add helper for encoding SVE shift immediates
This lets us deduplicate the behavior rather than open-coding it everywhere
and also makes it nicer to implement instructions that make use of this
encoding pattern.
2023-07-19 11:39:28 -04:00
Lioncache
3f884fe2d0 Arm64/Emitter: Deduplicate ops in SVE Bitwise Logical - Unpredicated
Same behavior, with less code.
2023-07-19 09:27:13 -04:00
Lioncache
1d453f10a0 Arm64/Emitter: Remove unnecessary qualifiers from integer unary arith ops
While we're in the area, we can make these much quicker to read by
removing the unnecessary qualifiers.
2023-07-19 09:18:36 -04:00
Lioncache
5ebd21ca6a Arm64/Emitter: Deduplicate integer unary arithmetic instructions
We can move the opcodes into the underlying implementation along with
the asserts to deduplicate a bit of code.
2023-07-19 09:16:33 -04:00
Ryan Houdek
79f7dcbaa5 FEXCore/Config: Adds support for enum mask configuration array
Allows us to consume an array of strings and convert it to an mask of
enum values. This is a quality of life change that allows us to specify
a mask of options.

The first configuration option added to support this is to control the
vixl disassembler. Now by default the vixl disassembler doesn't
disassemble any blocks and needs to be enabled individually.

eg:
```
FEXLoader --disassemble=blocks <args>
FEXLoader --disassemble=dispatcher <args>
FEXLoader --disassemble=blocks,dispatcher <args>
```

Has the additional convenience option of just passing in numbers as
well.

```
FEXLoader --disassemble=2 <args>
FEXLoader --disassemble=1 <args>
FEXLoader --disassemble=3 <args>
```

Also of course all of this works through environment variables.
```
FEX_DISASSEMBLE=blocks FEXInterpreter <args>
FEX_DISASSEMBLE=dispatcher FEXInterpreter <args>
FEX_DISASSEMBLE=blocks,dispatcher FEXInterpreter <args>
```

While only used fairly sparingly now, this is likely to have some
additional configurations using this in the future. Since we already
have some configs that are basically using enums, but just by doing
string comparisons.

This was asked for by a developer, so I figured I would throw it
together quick.
2023-07-18 18:17:43 -07:00
Ryan Houdek
f7b7997c77
Merge pull request #2786 from Sonicadvance1/minor_fcmov_opt2
OpcodeDispatcher: Another FCMov minor optimization
2023-07-18 18:14:17 -07:00
Ryan Houdek
0674dfab0a
Merge pull request #2787 from Sonicadvance1/icache_only_code
Arm64: Only clear icache for code
2023-07-18 18:14:05 -07:00
Lioncache
fe5f17d92e OpcodeDispatcher: Handle VSIB byte
Ensures that we handle the AVX2 VSIB byte in a decent way.

As is, we can't compute the [index * scale] variant portion
of the entire address operand, since the scale needs to act
on every element of the vector after sign extension.

What we can do though, is compute the base address and add
the displacement to it ahead of time though.
2023-07-18 13:56:38 -04:00
Lioncache
5043e5fbc0 OpcodeDispatcher: Move ShouldDump member into private section
Like with HandledLock, we can move this into the private section
and put an API around it for consistency.
2023-07-18 11:03:25 -04:00
Lioncache
2acfde3cad OpcodeDispatcher: Move CTX member into private section
This isn't used outside of the class.
2023-07-18 11:00:04 -04:00
Lioncache
c1eeeaf688 OpcodeDispatcher: Move flag-related variables into private section
These are only used within the opcode dispatcher, so they can be private.
2023-07-18 10:58:09 -04:00
Lioncache
f2b3229a87 OpcodeDispatcher: Move JumpTargets into private section
This is only used within the class, so it can be made private.
2023-07-18 10:52:48 -04:00
Lioncache
e20bfc0701 OpcodeDispatcher: Move HandledLock boolean into private class section
This can be trivially hidden and have an API put around it.
2023-07-18 10:43:52 -04:00
Lioncache
4caee5c9be OpcodeDispatcher: Remove unused Current_Header and Current_HeaderNode variables
These aren't used outside of being assigned to, so they can be removed.
2023-07-18 10:40:26 -04:00
Lioncache
cee5512a56 Arm64/Emitter: Handle unsized contiguous STR variants
We handle the unsized load variants, so we should do the same with the stores.
2023-07-18 08:38:23 -04:00
Ryan Houdek
80ae3e632d Arm64: Optimize {Load,Store}ContextIndexed address generation
All of these IR operations were being fairly inefficient in their
address calculation. All of these are known using power of 2 stride
indexing. So all of these can be converted from three instructions to
one.

These are always used for x87 stack accesses so each one gets an
improvement.

Before:
```asm
0x0000ffff6a800248  d2800200    mov x0, #0x10
0x0000ffff6a80024c  9b007e80    mul x0, x20, x0
0x0000ffff6a800250  8b000380    add x0, x28, x0
0x0000ffff6a800254  fd417805    ldr d5, [x0, #752]
```

After:
```asm
0x0000ffff91e80240  8b141380    add x0, x28, x20, lsl #4
0x0000ffff91e80244  fd417805    ldr d5, [x0, #752]
```
2023-07-17 22:59:33 -07:00
Ryan Houdek
ed75c19324 OpcodeDispatcher: Optimize 8/16-bit RCR
The BFI cascades in this particular instruction weren't optimal.
Biggest improvement is the 8-bit version, while the 16-bit version gets
a minor improvement.

8-bit instruction count reduced from 38 to 29.
16-bit instruction count reduced from 34 to 28.

RCL can have a similar optimization done to it.
```asm
Before 16-bit:
0x0000ffff80a801e0  10ffffe0    adr x0, #-0x4 (addr 0xffff80a801dc)
0x0000ffff80a801e4  f9005f80    str x0, [x28, #184]
0x0000ffff80a801e8  d3403cb4    uxth x20, w5
0x0000ffff80a801ec  d3403cf5    uxth x21, w7
0x0000ffff80a801f0  394b0396    ldrb w22, [x28, #704]
0x0000ffff80a801f4  12001294    and w20, w20, #0x1f
0x0000ffff80a801f8  d2800017    mov x23, #0x0
0x0000ffff80a801fc  b3403eb7    bfxil x23, x21, #0, #16
0x0000ffff80a80200  b37002d7    bfi x23, x22, #16, #1
0x0000ffff80a80204  b36f3eb7    bfi x23, x21, #17, #16
0x0000ffff80a80208  b35f02d7    bfi x23, x22, #33, #1
0x0000ffff80a8020c  aa1703e0    mov x0, x23
0x0000ffff80a80210  b35e3ea0    bfi x0, x21, #34, #16
0x0000ffff80a80214  aa0003f5    mov x21, x0
0x0000ffff80a80218  b34e02d5    bfi x21, x22, #50, #1
0x0000ffff80a8021c  9ad426b7    lsr x23, x21, x20
0x0000ffff80a80220  b3403ee7    bfxil x7, x23, #0, #16
0x0000ffff80a80224  51000698    sub w24, w20, #0x1 (1)
0x0000ffff80a80228  9ad826b5    lsr x21, x21, x24
0x0000ffff80a8022c  d34002b5    ubfx x21, x21, #0, #1
0x0000ffff80a80230  7100069f    cmp w20, #0x1 (1)
0x0000ffff80a80234  9a9622b4    csel x20, x21, x22, hs
0x0000ffff80a80238  390b0394    strb w20, [x28, #704]
0x0000ffff80a8023c  d34f3ef4    ubfx x20, x23, #15, #1
0x0000ffff80a80240  d34e3af5    ubfx x21, x23, #14, #1
0x0000ffff80a80244  ca150294    eor x20, x20, x21
0x0000ffff80a80248  390b2f94    strb w20, [x28, #715]
0x0000ffff80a8024c  58000040    ldr x0, pc+8 (addr 0xffff80a80254)
0x0000ffff80a80250  d63f0000    blr x0
0x0000ffff80a80254  967da128    bl #-0x6097b60 (addr 0xffff7a9e86f4)
0x0000ffff80a80258  0000ffff    udf #0xffff
0x0000ffff80a8025c  00010023    unallocated (Unallocated)
0x0000ffff80a80260  00000000    udf #0x0
[DEBUG] RIP: 0x10020
[DEBUG] Guest Code instructions: 1
[DEBUG] Host Code instructions: 34
[DEBUG] Blow-up Amt: 34x

After 16-bit:
0x0000ffffa7c801e0  10ffffe0            adr x0, #-0x4 (addr 0xffffa7c801dc)
0x0000ffffa7c801e4  f9005f80            str x0, [x28, #184]
0x0000ffffa7c801e8  d3403cb4            uxth x20, w5
0x0000ffffa7c801ec  d3403cf5            uxth x21, w7
0x0000ffffa7c801f0  394b0396            ldrb w22, [x28, #704]
0x0000ffffa7c801f4  12001294            and w20, w20, #0x1f
0x0000ffffa7c801f8  b37002d5            bfi x21, x22, #16, #1
0x0000ffffa7c801fc  b36f42b5            bfi x21, x21, #17, #17
0x0000ffffa7c80200  b35e42b5            bfi x21, x21, #34, #17
0x0000ffffa7c80204  9ad426b7            lsr x23, x21, x20
0x0000ffffa7c80208  b3403ee7            bfxil x7, x23, #0, #16
0x0000ffffa7c8020c  51000698            sub w24, w20, #0x1 (1)
0x0000ffffa7c80210  9ad826b5            lsr x21, x21, x24
0x0000ffffa7c80214  d34002b5            ubfx x21, x21, #0, #1
0x0000ffffa7c80218  7100069f            cmp w20, #0x1 (1)
0x0000ffffa7c8021c  9a9622b4            csel x20, x21, x22, hs
0x0000ffffa7c80220  390b0394            strb w20, [x28, #704]
0x0000ffffa7c80224  d34f3ef4            ubfx x20, x23, #15, #1
0x0000ffffa7c80228  d34e3af5            ubfx x21, x23, #14, #1
0x0000ffffa7c8022c  ca150294            eor x20, x20, x21
0x0000ffffa7c80230  390b2f94            strb w20, [x28, #715]
0x0000ffffa7c80234  58000040            ldr x0, pc+8 (addr 0xffffa7c8023c)
0x0000ffffa7c80238  d63f0000            blr x0
0x0000ffffa7c8023c  bd9cc128            unallocated (Unallocated)
0x0000ffffa7c80240  0000ffff            udf #0xffff
0x0000ffffa7c80244  00010023            unallocated (Unallocated)
0x0000ffffa7c80248  00000000            udf #0x0
[DEBUG] RIP: 0x10020
[DEBUG] Guest Code instructions: 1
[DEBUG] Host Code instructions: 28
[DEBUG] Blow-up Amt: 28x

Before 8-bit:
0x0000ffffa92801e0  10ffffe0            adr x0, #-0x4 (addr 0xffffa92801dc)
0x0000ffffa92801e4  f9005f80            str x0, [x28, #184]
0x0000ffffa92801e8  d3401cb4            uxtb x20, w5
0x0000ffffa92801ec  d3401cf5            uxtb x21, w7
0x0000ffffa92801f0  394b0396            ldrb w22, [x28, #704]
0x0000ffffa92801f4  12001294            and w20, w20, #0x1f
0x0000ffffa92801f8  d2800017            mov x23, #0x0
0x0000ffffa92801fc  b3401eb7            bfxil x23, x21, #0, #8
0x0000ffffa9280200  b37802d7            bfi x23, x22, #8, #1
0x0000ffffa9280204  b3771eb7            bfi x23, x21, #9, #8
0x0000ffffa9280208  b36f02d7            bfi x23, x22, #17, #1
0x0000ffffa928020c  b36e1eb7            bfi x23, x21, #18, #8
0x0000ffffa9280210  b36602d7            bfi x23, x22, #26, #1
0x0000ffffa9280214  b3651eb7            bfi x23, x21, #27, #8
0x0000ffffa9280218  b35d02d7            bfi x23, x22, #35, #1
0x0000ffffa928021c  aa1703e0            mov x0, x23
0x0000ffffa9280220  b35c1ea0            bfi x0, x21, #36, #8
0x0000ffffa9280224  aa0003f5            mov x21, x0
0x0000ffffa9280228  b35402d5            bfi x21, x22, #44, #1
0x0000ffffa928022c  9ad426b7            lsr x23, x21, x20
0x0000ffffa9280230  b3401ee7            bfxil x7, x23, #0, #8
0x0000ffffa9280234  51000698            sub w24, w20, #0x1 (1)
0x0000ffffa9280238  9ad826b5            lsr x21, x21, x24
0x0000ffffa928023c  d34002b5            ubfx x21, x21, #0, #1
0x0000ffffa9280240  7100069f            cmp w20, #0x1 (1)
0x0000ffffa9280244  9a9622b4            csel x20, x21, x22, hs
0x0000ffffa9280248  390b0394            strb w20, [x28, #704]
0x0000ffffa928024c  d3471ef4            ubfx x20, x23, #7, #1
0x0000ffffa9280250  d3461af5            ubfx x21, x23, #6, #1
0x0000ffffa9280254  ca150294            eor x20, x20, x21
0x0000ffffa9280258  390b2f94            strb w20, [x28, #715]
0x0000ffffa928025c  58000040            ldr x0, pc+8 (addr 0xffffa9280264)
0x0000ffffa9280260  d63f0000            blr x0
0x0000ffffa9280264  bf062128            unallocated (Unallocated)
0x0000ffffa9280268  0000ffff            udf #0xffff
0x0000ffffa928026c  00010022            unallocated (Unallocated)
0x0000ffffa9280270  00000000            udf #0x0
[DEBUG] RIP: 0x10020
[DEBUG] Guest Code instructions: 1
[DEBUG] Host Code instructions: 38
[DEBUG] Blow-up Amt: 38x

After 8-bit:
0x0000ffff9cc801e0  10ffffe0    adr x0, #-0x4 (addr 0xffff9cc801dc)
0x0000ffff9cc801e4  f9005f80    str x0, [x28, #184]
0x0000ffff9cc801e8  d3401cb4    uxtb x20, w5
0x0000ffff9cc801ec  d3401cf5    uxtb x21, w7
0x0000ffff9cc801f0  394b0396    ldrb w22, [x28, #704]
0x0000ffff9cc801f4  12001294    and w20, w20, #0x1f
0x0000ffff9cc801f8  b37802d5    bfi x21, x22, #8, #1
0x0000ffff9cc801fc  b37722b5    bfi x21, x21, #9, #9
0x0000ffff9cc80200  b36e46b5    bfi x21, x21, #18, #18
0x0000ffff9cc80204  b3778eb5    bfi x21, x21, #9, #36
0x0000ffff9cc80208  9ad426b7    lsr x23, x21, x20
0x0000ffff9cc8020c  b3401ee7    bfxil x7, x23, #0, #8
0x0000ffff9cc80210  51000698    sub w24, w20, #0x1 (1)
0x0000ffff9cc80214  9ad826b5    lsr x21, x21, x24
0x0000ffff9cc80218  d34002b5    ubfx x21, x21, #0, #1
0x0000ffff9cc8021c  7100069f    cmp w20, #0x1 (1)
0x0000ffff9cc80220  9a9622b4    csel x20, x21, x22, hs
0x0000ffff9cc80224  390b0394    strb w20, [x28, #704]
0x0000ffff9cc80228  d3471ef4    ubfx x20, x23, #7, #1
0x0000ffff9cc8022c  d3461af5    ubfx x21, x23, #6, #1
0x0000ffff9cc80230  ca150294    eor x20, x20, x21
0x0000ffff9cc80234  390b2f94    strb w20, [x28, #715]
0x0000ffff9cc80238  58000040    ldr x0, pc+8 (addr 0xffff9cc80240)
0x0000ffff9cc8023c  d63f0000    blr x0
0x0000ffff9cc80240  b2a75128    unallocated (Unallocated)
0x0000ffff9cc80244  0000ffff    udf #0xffff
0x0000ffff9cc80248  00010022    unallocated (Unallocated)
0x0000ffff9cc8024c  00000000    udf #0x0
[DEBUG] RIP: 0x10020
[DEBUG] Guest Code instructions: 1
[DEBUG] Host Code instructions: 29
[DEBUG] Blow-up Amt: 29x
```
2023-07-17 19:13:23 -07:00
Lioncache
98f51c47fa Frontend: Handle VSIB byte
Extends handling of the SIB byte to also handle the AVX2 VSIB byte.

While we're in the area, we can set up the gather instruction flags as well.
2023-07-17 16:16:21 -04:00
Ryan Houdek
724a8e13bf
Merge pull request #2789 from lioncash/scatter
Arm64/Emitter: Handle ST1{*} scatter store variants
2023-07-17 09:20:06 -07:00
Mai
d9b52fd67d
Merge pull request #2785 from Sonicadvance1/32bit_mov_bitmask
ArmEmitter: Support 32-bit bitmask moves
2023-07-17 12:13:27 -04:00
Mai
d3a2795106
Merge pull request #2781 from Sonicadvance1/optimize_phminposuw
OpcodeDispatcher: Minor optimization to phminposuw
2023-07-17 12:12:25 -04:00
Mai
3cd6c2d91a
Merge pull request #2779 from Sonicadvance1/optimize_shiftd
OpcodeDispatcher: Optimize 32/64-bit SH{L,R}D with extr
2023-07-17 12:11:51 -04:00
Mai
b86abfbccf
Merge pull request #2778 from Sonicadvance1/move_tls_signal_frontend
SignalDelegator: Moves last TLS variable to the frontend
2023-07-17 12:10:58 -04:00
Mai
ee66985ae0
Merge pull request #2777 from Sonicadvance1/deadstore_elimination
IR/Passes: Fixes DeadStoreElimination pass
2023-07-17 12:09:54 -04:00
Mai
daeba0625f
Merge pull request #2776 from Sonicadvance1/fix_constprop_mask
ConstProp: Fix shift mask in const-prop
2023-07-17 12:08:33 -04:00
Lioncache
5e6af25194 Arm64/Emitter: Handle ST1{*} Vector + Imm scatter stores 2023-07-17 12:05:27 -04:00
Lioncache
7f4528a6b0 Arm64/Emitter: Handle ST1{*} Scalar + Vector scatter stores 2023-07-17 11:11:30 -04:00
Ryan Houdek
776b7674e4 Arm64: Only clear icache for code
Currently we're clearing icache including the data that lives on the
tail of the block. Instead only clear the code that the was emitted and
not tail data.

Additionally only disasm the code rather than all the tail data as well,
as it gets unwieldy if viewing.
2023-07-16 22:24:01 -07:00
Ryan Houdek
24cb2610a2 OpcodeDispatcher: Another FCMov minor optimization
If we are loading exactly the flags we need from the RFLAGS (ensuring we
don't load the reserved flag in bit 1) then we don't need to do a mask
on the result.

Additionally there is some bad code-motion around selects that was
causing SBFE operations to occur on constants. Ensure that we const-prop
any SBFE operations to clean this up.

This PR along with #2783 causes FMOV blow-up to go from 41 instruction
to 31 instructions.
2023-07-16 21:37:54 -07:00
Ryan Houdek
54b7a43b95 ArmEmitter: Support 32-bit bitmask moves
Noticed this when inspecting some code that was moving constant
`0x80808080` in to a register. Was using two move instructions when it
could have used a single bitmask move.

This now checks to see if a constant can be 32-bit encoded in a logical
bitmask move and uses that.
2023-07-16 18:52:34 -07:00
Ryan Houdek
ead43c6a51 OpcodeDispatcher: Minor optimization to FILD
Removes one instruction from FILD, or two instructions if the CPU
supports the CSSC extension.

Going from 47 instructions to 46/45.
2023-07-16 01:28:57 -07:00
Ryan Houdek
e49de77225 IR: Implements a 2's complement Integer absolute
Supports CSSC extension.
2023-07-16 01:28:57 -07:00
Ryan Houdek
8f4fe39b7d Emitter: Adds support for CNEG instruction alias
tests included.
2023-07-16 00:25:52 -07:00
Ryan Houdek
f250509718 OpcodeDispatcher: Minor optimization to phminposuw
This instruction doesn't match ARM semantics very well since it returns
the position of the minimum element.

But at the very least the insert in to the final instruction can be a
bit more optimal, Converts an 5 inst eor+mov+mov+mov+mov in to 2 inst
mov+mov.

This works because `VUMinV` already zero extends the vector so the
position only needs to be inserted at the end.
2023-07-15 23:23:32 -07:00
Ryan Houdek
6179c5a13e OpcodeDispatcher: Optimize 32/64-bit SH{L,R}D with extr
32-bit and 64-bit SH{L,R}D matches behaviour of EXTR. Optimize to using
this op in that case.
This converts the lsl+lsr+orr sequence in to a single extr instruction.

16-bit still goes down the old path.

Weirdly this code manages to have a bad insert for no reason? But
unrelated since this happens in the old code as well.

```
  %4(GPRFixed3) i64 = LoadRegister #0x0, #0x20, GPR, GPRFixed, u8:Tmp:Size
  %5(GPR0) i64 = LoadRegister #0x0, #0x8, GPR, GPRFixed, u8:Tmp:Size
  %6(GPRFixed0) i64 = Extr %5(GPR0) i64, %4(GPRFixed3) i64, #0x3e
```

Not sure why the SRA fails on that second LoadRegister.
2023-07-15 22:05:39 -07:00
Ryan Houdek
9e14a83442 SignalDelegator: Moves last TLS variable to the frontend
There was one holdout variable that was in a TLS object in FEXCore. Move
it to the frontend with the rest of the TLS variables.

Allows us to remove "Frontend" TLS management to be the only TLS
management.
2023-07-15 20:28:58 -07:00
Ryan Houdek
95bfd003d2 IR/Passes: Fixes DeadStoreElimination pass
This pass is currently doing nothing in main.
Ever since we have enforced that LoadContext/StoreContext doesn't touch
GPRs and FPRs, this has only been eliminating flags.

Remove that usage of LoadContext/StoreContext and replace with their
their replacement of LoadRegister/StoreRegister for tracking GPR and FPR
accesses.

Stripped from #2700 since this is safe to merge.
2023-07-14 22:22:26 -07:00
Ryan Houdek
08ca43c3c4 ConstProp: Fix shift mask in const-prop
Noticed while looking at #2700.

Testing doesn't currently see this as a bug but will once #2700 starts
optimizing StoreRegister+LoadRegister pairs.

Doesn't fix the issues in that PR, but this is one.
2023-07-14 18:14:57 -07:00
Lioncache
1a18bbb966 Arm64: Emitter: Handle LD1{*}/LDFF1{*} Vector + Immediate encodings
While we're in the area implementing the Scalar + Vector variants,
we may as well cross off the Vector + Immediate variants and
complete all of the load variants for the regular LD1{*} loads
2023-07-14 00:50:48 -04:00
Ryan Houdek
699c3f5762
Merge pull request #2773 from lioncash/memop
Arm64/Emitter: Simplify SVEMemOperand data union
2023-07-13 18:24:44 -07:00
Lioncache
c1205eb809 Arm64/Emitter: Mark SVEMemOperand Type enum as enum class
Now that we have helpers to make querying a little less verbose, we can mark
the enum as an enum class to get rid of implicit conversions.
2023-07-13 15:28:17 -04:00
Lioncache
31b7cd77e9 Arm64/Emitter: Simplify SVEMemOperand data union
We can just move the header out of the union, since it's present in all cases.
2023-07-13 15:28:14 -04:00
Ryan Houdek
68a2441e65
Merge pull request #2772 from lioncash/insrem
OpcodeDispatcher: Narrow use of LoadXMMRegister in StoreResult_WithOpSize
2023-07-13 12:07:05 -07:00
Ryan Houdek
1d7b4bb522
Merge pull request #2768 from alyssarosenzweig/fix/pf
OpcodeDispatcher: Fix and optimize PF calculation
2023-07-13 11:54:30 -07:00
Ryan Houdek
22f95e627d
Merge pull request #2769 from random415/main
fix spelling errors
2023-07-13 11:48:39 -07:00
Ryan Houdek
599b64e975
Merge pull request #2771 from alyssarosenzweig/print/de-ssa
IR: Print SSA values as %123 instead of %ssa123
2023-07-13 11:48:28 -07:00
Lioncache
58c93568f6 OpcodeDispatcher: Narrow use of LoadXMMRegister in StoreResult_WithOpSize
This only needs to be loaded when a partial insert needs to be performed,
so we can narrow it's scope instead of always loading it in the AVX case.
2023-07-13 14:23:34 -04:00
Alyssa Rosenzweig
491e4e2c23 IR: Print SSA values as %123 instead of %ssa123
This is less noisy with no loss of clarity, and follows the notation
used by both LLVM IR and NIR. (So, it should be familiar.)

Change done with:

    sed -i -e 's/%ssa/%/g' $(git grep -l '%ssa')

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-07-13 11:52:36 -04:00
Lioncache
1ed2e24fba Arm64/Emitter: Handle LDFF1{*} scalar plus vector variants
These can use the same handling code that we introduced for the normal
LD1{*} gathers, so we may as well expose support for them.
2023-07-13 11:25:44 -04:00
Lioncache
1fbb8bd78f Arm64/Emitter: Handle LD1{*} Scalar + Vector variants 2023-07-13 11:25:41 -04:00
Elias James Howell
b953433404 fix spelling errors
Fixing some minor spelling errors which should not affect functionality but improve the overall quality of documentation.
2023-07-13 11:23:59 -04:00
Alyssa Rosenzweig
eae950be16 OpcodeDispatcher: Use vector ops for PF calculation
On current targets, popcount is a vector op. By using VPopcount
ourselves when calculating, we can reduce some pointless masking.
Before:

    and x22, x4, #0xff
    fmov d0, x22
    cnt v0.8b, v0.8b
    addv b0, v0.8b
    umov w22, v0.b[0]
    eor x22, x22, #0x1
    strb w22, [x28, #706]

After:

    eor x22, x4, #0x1
    fmov s4, w22
    cnt v4.16b, v4.16b
    umov w22, v4.b[0]
    strb w22, [x28, #706]

llvm-mca before:

    Iterations:        100
    Instructions:      700
    Total Cycles:      2002
    Total uOps:        700

    Dispatch Width:    2
    uOps Per Cycle:    0.35
    IPC:               0.35
    Block RThroughput: 3.5

llvm-mca after:

    Iterations:        100
    Instructions:      500
    Total Cycles:      1402
    Total uOps:        500

    Dispatch Width:    2
    uOps Per Cycle:    0.36
    IPC:               0.36
    Block RThroughput: 2.5

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-07-13 11:08:47 -04:00
Alyssa Rosenzweig
7e6bb04db1 OpcodeDispatcher: Extract CalculatePF
This does duplicate the _Constant(1) but it doesn't matter because it
gets inlined into the eor anyway. There is no functional change here.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-07-13 10:05:18 -04:00
Alyssa Rosenzweig
716cac35a8 OpcodeDispatcher: Fix PF calculation
We store garbage in the upper bits. That's ok, but it means we need to
mask on read for correct behaviour.

Closes #2767

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-07-13 08:38:46 -04:00
Ryan Houdek
9722c4c5a4
Merge pull request #2766 from alyssarosenzweig/flags/add-of
OpcodeDispatcher: Optimize ADD/ADC OF flag packing
2023-07-12 15:47:21 -07:00
Alyssa Rosenzweig
e8c0e19afc OpcodeDispatcher: "Calculcate" -> "Calculate"
Typofix.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-07-12 18:07:04 -04:00
Alyssa Rosenzweig
c559fec959 OpcodeDispatcher: Optimize ADD/ADC OF flag packing
We can fold the Not into the And. This requires flipping the arguments
to Andn, but we do not flip the order of the assignments since that
requires an extra register in a test I'm looking at.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-07-12 18:06:36 -04:00
Alyssa Rosenzweig
8d2fabe705 OpcodeDispatcher: Deduplicate ADD/ADC OF generation
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2023-07-12 18:06:36 -04:00
Ryan Houdek
5dbd1b8dc2 FEXCore: Removes unused TLS variable
Not sure why this still existed.
2023-07-12 13:05:47 -07:00
Ryan Houdek
5fef0c29aa FEXCore: Rename Telemetry helper function GetObject
WIN32 has a define already called `GetObject` and will cause our
symbol to have an A appended to it and break linking.

Just rename it to `GetTelemetryValue`
2023-07-12 11:53:13 -07:00
Ryan Houdek
d387c46aab FEXCore: Fixes WIN32 compiling again
Mostly a quick bandage while I'm setting getting ready to setup the
runners to test this for us.
2023-07-12 11:53:13 -07:00
Mai
ddd6dbfdcc
Merge pull request #2759 from Sonicadvance1/redundant_bfe_flags
OpcodeDispatcher: Remove spurious bfe with flag storing
2023-07-10 22:19:21 -04:00
Mai
7f2557e322
Merge pull request #2757 from Sonicadvance1/optimize_movss_reg
OpcodeDispatcher: Optimize MOVSS to register
2023-07-10 21:22:47 -04:00
Mai
810c7d926c
Merge pull request #2758 from Sonicadvance1/optimize_tso_vector_loadstores
IR: Optimize vector TSO loadstore address calculation
2023-07-10 21:21:46 -04:00
Ryan Houdek
04c325661c OpcodeDispatcher: Remove spurious bfe with flag storing
Noticed during introspection that we were generating zero constants
redundantly. Bunch of single cycle hits or zero-register renames.

Every time a `SetRFLAG` helper was called, it was /always/ doing a BFE
on everything passed in to extract the lowest bit. In nearly all cases
the data getting passed in is already only the lowest bit.

Instead, stop the helper from doing this BFE, and ensure the
OpcodeDispatcher does BFE in the couple of cases it still needs to do.

As I was skimming through all these to ensure BFE isn't necessary, I did
notice that some of the BCD instructions are wrong or questionable. So I
left a comment on those so we can come back to it.
2023-07-10 18:03:23 -07:00
Ryan Houdek
2d800b2627 IR: Optimize vector TSO loadstore address calculation
These address calculations were failing to understand that they can be
optimized. When TSO emulation is disabled these were fine, but with TSO
we were eating one more instruction.

Before:
```
add x20, x12, #0x4 (4)
dmb ish
ldr s16, [x20]
dmb ish
```

After:
```
dmb ish
ldr s16, [x12, #4]
dmb ish
```

Also left a note that once LRCPC3 is supported in hardware that we can do a similar optimization there.
2023-07-10 15:21:46 -07:00
Ryan Houdek
55ed3e0549 OpcodeDispatcher: Optimize MOVSS to register
Easily fixed. Found through inspection.

Before:
```
eor v0.16b, v0.16b, v0.16b
mov v0.s[0], v17.s[0]
mov v4.16b, v0.16b
mov v16.s[0], v4.s[0]
```

After:
```
mov v16.s[0], v17.s[0]
```
2023-07-10 14:36:27 -07:00
Ryan Houdek
55d084ebb0 OpcodeDispatcher: Optimize MOVSS to memory destination
Easy fixed. Found through inspection.

Before:
```
eor v0.16b, v0.16b, v0.16b
mov v0.s[0], v16.s[0]
mov v4.16b, v0.16b
str s4, [x11]
```

After:
```
str s16, [x11]
```
2023-07-10 14:25:01 -07:00
Mai
98eda5e163
Merge pull request #2749 from Sonicadvance1/optimize_away_redundant_masks
OpcodeDispatcher: Optimize some shifts size masking
2023-07-10 08:08:57 -04:00
Ryan Houdek
92d0344d6a OpcodeDispatcher: Fixes bug with pcmpestri
When this instruction returns the index in to the ecx register, this is
defined as a 32-bit result. This means it actually gets zero-extended to
the full 64-bit GPR size on 64-bit processes.
Previously FEX was doing a 32-bit insert which leaves garbage data in
the upper 32-bits of the RCX register.

Adds a unit test to ensure the result is zero extended.
Fixes running Java games under FEX now that SSE4.2 is exposed.
2023-07-08 18:08:47 -07:00
Ryan Houdek
9327435f97 OpcodeDispatcher: Optimize some shifts size masking
Inspired from #2561, these shifts  don't need to be masked if we know
their operating size up front.

Causes a handful of these to become more optimal.
2023-07-08 16:41:15 -07:00
Mai
8a4bfba47c
Merge pull request #2745 from Sonicadvance1/optimize_fcmov
OpcodeDispatcher: Optimize GetPackedRFLAG
2023-07-07 22:29:52 -04:00
Mai
69ea03f0eb
Merge pull request #2746 from Sonicadvance1/optimize_maskmov
OpcodeDispatcher: Optimize MASKMOVDQU and MASKMOVQ
2023-07-07 22:29:37 -04:00
Ryan Houdek
15f5fe658b OpcodeDispatcher: Optimize MASKMOVDQU and MASKMOVQ
This previous implementation was particularly gnarly. Because these
instructions are both weackly ordered and have implementation dependent
exception and trap behaviour these can actually be fairly conveniently
converted over to a load + cmlt + bsl + str instruction.

For the XMM variant this reduces code blowup from 80x to 15x!
For the MMX variant this reduces code blowup from 46x to 17x!

Both of these improvements are significant wins! There's still some
minor improvement that could be done with bsl that requires some
redundant moves, but since we don't have constraint support for this we
still eat two additional instructions

Before:
```asm
0x0000ffff7b800718  10ffffe0    adr x0, #-0x4 (addr 0xffff7b800714)
0x0000ffff7b80071c  f9005f80    str x0, [x28, #184]
0x0000ffff7b800720  4eb11e24    mov v4.16b, v17.16b
0x0000ffff7b800724  4eb01e05    mov v5.16b, v16.16b
0x0000ffff7b800728  aa0b03f4    mov x20, x11
0x0000ffff7b80072c  4e083c95    mov x21, v4.d[0]
0x0000ffff7b800730  4e083cb6    mov x22, v5.d[0]
0x0000ffff7b800734  d3471eb7    ubfx x23, x21, #7, #1
0x0000ffff7b800738  b4000077    cbz x23, #+0xc (addr 0xffff7b800744)
0x0000ffff7b80073c  d3401ed7    uxtb x23, w22
0x0000ffff7b800740  39000297    strb w23, [x20]
0x0000ffff7b800744  d34f3eb7    ubfx x23, x21, #15, #1
0x0000ffff7b800748  b4000077    cbz x23, #+0xc (addr 0xffff7b800754)
0x0000ffff7b80074c  d3483ed7    ubfx x23, x22, #8, #8
0x0000ffff7b800750  39000697    strb w23, [x20, #1]
0x0000ffff7b800754  d3575eb7    ubfx x23, x21, #23, #1
0x0000ffff7b800758  b4000077    cbz x23, #+0xc (addr 0xffff7b800764)
0x0000ffff7b80075c  d3505ed7    ubfx x23, x22, #16, #8
0x0000ffff7b800760  39000a97    strb w23, [x20, #2]
0x0000ffff7b800764  d35f7eb7    ubfx x23, x21, #31, #1
0x0000ffff7b800768  b4000077    cbz x23, #+0xc (addr 0xffff7b800774)
0x0000ffff7b80076c  d3587ed7    ubfx x23, x22, #24, #8
0x0000ffff7b800770  39000e97    strb w23, [x20, #3]
0x0000ffff7b800774  d3679eb7    ubfx x23, x21, #39, #1
0x0000ffff7b800778  b4000077    cbz x23, #+0xc (addr 0xffff7b800784)
0x0000ffff7b80077c  d3609ed7    ubfx x23, x22, #32, #8
0x0000ffff7b800780  39001297    strb w23, [x20, #4]
0x0000ffff7b800784  d36fbeb7    ubfx x23, x21, #47, #1
0x0000ffff7b800788  b4000077    cbz x23, #+0xc (addr 0xffff7b800794)
0x0000ffff7b80078c  d368bed7    ubfx x23, x22, #40, #8
0x0000ffff7b800790  39001697    strb w23, [x20, #5]
0x0000ffff7b800794  d377deb7    ubfx x23, x21, #55, #1
0x0000ffff7b800798  b4000077    cbz x23, #+0xc (addr 0xffff7b8007a4)
0x0000ffff7b80079c  d370ded7    ubfx x23, x22, #48, #8
0x0000ffff7b8007a0  39001a97    strb w23, [x20, #6]
0x0000ffff7b8007a4  d37ffeb5    lsr x21, x21, #63
0x0000ffff7b8007a8  b4000075    cbz x21, #+0xc (addr 0xffff7b8007b4)
0x0000ffff7b8007ac  d378fed5    lsr x21, x22, #56
0x0000ffff7b8007b0  39001e95    strb w21, [x20, #7]
0x0000ffff7b8007b4  4e183c95    mov x21, v4.d[1]
0x0000ffff7b8007b8  4e183cb6    mov x22, v5.d[1]
0x0000ffff7b8007bc  d3471eb7    ubfx x23, x21, #7, #1
0x0000ffff7b8007c0  b4000077    cbz x23, #+0xc (addr 0xffff7b8007cc)
0x0000ffff7b8007c4  d3401ed7    uxtb x23, w22
0x0000ffff7b8007c8  39002297    strb w23, [x20, #8]
0x0000ffff7b8007cc  d34f3eb7    ubfx x23, x21, #15, #1
0x0000ffff7b8007d0  b4000077    cbz x23, #+0xc (addr 0xffff7b8007dc)
0x0000ffff7b8007d4  d3483ed7    ubfx x23, x22, #8, #8
0x0000ffff7b8007d8  39002697    strb w23, [x20, #9]
0x0000ffff7b8007dc  d3575eb7    ubfx x23, x21, #23, #1
0x0000ffff7b8007e0  b4000077    cbz x23, #+0xc (addr 0xffff7b8007ec)
0x0000ffff7b8007e4  d3505ed7    ubfx x23, x22, #16, #8
0x0000ffff7b8007e8  39002a97    strb w23, [x20, #10]
0x0000ffff7b8007ec  d35f7eb7    ubfx x23, x21, #31, #1
0x0000ffff7b8007f0  b4000077    cbz x23, #+0xc (addr 0xffff7b8007fc)
0x0000ffff7b8007f4  d3587ed7    ubfx x23, x22, #24, #8
0x0000ffff7b8007f8  39002e97    strb w23, [x20, #11]
0x0000ffff7b8007fc  d3679eb7    ubfx x23, x21, #39, #1
0x0000ffff7b800800  b4000077    cbz x23, #+0xc (addr 0xffff7b80080c)
0x0000ffff7b800804  d3609ed7    ubfx x23, x22, #32, #8
0x0000ffff7b800808  39003297    strb w23, [x20, #12]
0x0000ffff7b80080c  d36fbeb7    ubfx x23, x21, #47, #1
0x0000ffff7b800810  b4000077    cbz x23, #+0xc (addr 0xffff7b80081c)
0x0000ffff7b800814  d368bed7    ubfx x23, x22, #40, #8
0x0000ffff7b800818  39003697    strb w23, [x20, #13]
0x0000ffff7b80081c  d377deb7    ubfx x23, x21, #55, #1
0x0000ffff7b800820  b4000077    cbz x23, #+0xc (addr 0xffff7b80082c)
0x0000ffff7b800824  d370ded7    ubfx x23, x22, #48, #8
0x0000ffff7b800828  39003a97    strb w23, [x20, #14]
0x0000ffff7b80082c  d37ffeb5    lsr x21, x21, #63
0x0000ffff7b800830  b4000075    cbz x21, #+0xc (addr 0xffff7b80083c)
0x0000ffff7b800834  d378fed5    lsr x21, x22, #56
0x0000ffff7b800838  39003e95    strb w21, [x20, #15]
0x0000ffff7b80083c  58000040    ldr x0, pc+8 (addr 0xffff7b800844)
0x0000ffff7b800840  d63f0000    blr x0
```

After:
```asm
0x0000ffff7ac00718  10ffffe0            adr x0, #-0x4 (addr 0xffff7ac00714)
0x0000ffff7ac0071c  f9005f80            str x0, [x28, #184]
0x0000ffff7ac00720  4e20aa24            cmlt v4.16b, v17.16b, #0
0x0000ffff7ac00724  3dc00165            ldr q5, [x11]
0x0000ffff7ac00728  4ea41c80            mov v0.16b, v4.16b
0x0000ffff7ac0072c  6e651e00            bsl v0.16b, v16.16b, v5.16b
0x0000ffff7ac00730  4ea01c04            mov v4.16b, v0.16b
0x0000ffff7ac00734  3d800164            str q4, [x11]
0x0000ffff7ac00738  58000040            ldr x0, pc+8 (addr 0xffff7ac00740)
0x0000ffff7ac0073c  d63f0000            blr x0
```
2023-07-07 18:37:17 -07:00
Ryan Houdek
052aa4317b OpcodeDispatcher: Optimize GetPackedRFLAG
Only return the particular flags that are being requested in the moment
since compacting them all when requested is fairly slow.

x87 fcmov in particular was requesting all the flags when it only needs
a couple.
This reduces a `fcmovb` instruction count blowup from 103x to 38x. Still
more room to go but this one stood out as being particularly bad.

Old:
```asm
0x0000000265a002bc  10ffffe0    adr x0, #-0x4 (addr 0x265a002b8)
0x0000000265a002c0  f9005f80    str x0, [x28, #184]
0x0000000265a002c4  d2800014    mov x20, #0x0
0x0000000265a002c8  d2800035    mov x21, #0x1
0x0000000265a002cc  d2800056    mov x22, #0x2
0x0000000265a002d0  394b0397    ldrb w23, [x28, #704]
0x0000000265a002d4  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a002d8  aa1702d6    orr x22, x22, x23
0x0000000265a002dc  394b0b97    ldrb w23, [x28, #706]
0x0000000265a002e0  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a002e4  531e76f7    lsl w23, w23, #2
0x0000000265a002e8  aa1702d6    orr x22, x22, x23
0x0000000265a002ec  394b1397    ldrb w23, [x28, #708]
0x0000000265a002f0  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a002f4  531c6ef7    lsl w23, w23, #4
0x0000000265a002f8  aa1702d6    orr x22, x22, x23
0x0000000265a002fc  394b1b97    ldrb w23, [x28, #710]
0x0000000265a00300  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a00304  531a66f7    lsl w23, w23, #6
0x0000000265a00308  aa1702d6    orr x22, x22, x23
0x0000000265a0030c  394b1f97    ldrb w23, [x28, #711]
0x0000000265a00310  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a00314  531962f7    lsl w23, w23, #7
0x0000000265a00318  aa1702d6    orr x22, x22, x23
0x0000000265a0031c  394b2397    ldrb w23, [x28, #712]
0x0000000265a00320  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a00324  53185ef7    lsl w23, w23, #8
0x0000000265a00328  aa1702d6    orr x22, x22, x23
0x0000000265a0032c  394b2797    ldrb w23, [x28, #713]
0x0000000265a00330  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a00334  53175af7    lsl w23, w23, #9
0x0000000265a00338  aa1702d6    orr x22, x22, x23
0x0000000265a0033c  394b2b97    ldrb w23, [x28, #714]
0x0000000265a00340  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a00344  531656f7    lsl w23, w23, #10
0x0000000265a00348  aa1702d6    orr x22, x22, x23
0x0000000265a0034c  394b2f97    ldrb w23, [x28, #715]
0x0000000265a00350  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a00354  531552f7    lsl w23, w23, #11
0x0000000265a00358  aa1702d6    orr x22, x22, x23
0x0000000265a0035c  394b3397    ldrb w23, [x28, #716]
0x0000000265a00360  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a00364  53144ef7    lsl w23, w23, #12
0x0000000265a00368  aa1702d6    orr x22, x22, x23
0x0000000265a0036c  394b3b97    ldrb w23, [x28, #718]
0x0000000265a00370  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a00374  531246f7    lsl w23, w23, #14
0x0000000265a00378  aa1702d6    orr x22, x22, x23
0x0000000265a0037c  394b4397    ldrb w23, [x28, #720]
0x0000000265a00380  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a00384  53103ef7    lsl w23, w23, #16
0x0000000265a00388  aa1702d6    orr x22, x22, x23
0x0000000265a0038c  394b4797    ldrb w23, [x28, #721]
0x0000000265a00390  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a00394  530f3af7    lsl w23, w23, #17
0x0000000265a00398  aa1702d6    orr x22, x22, x23
0x0000000265a0039c  394b4b97    ldrb w23, [x28, #722]
0x0000000265a003a0  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a003a4  530e36f7    lsl w23, w23, #18
0x0000000265a003a8  aa1702d6    orr x22, x22, x23
0x0000000265a003ac  394b4f97    ldrb w23, [x28, #723]
0x0000000265a003b0  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a003b4  530d32f7    lsl w23, w23, #19
0x0000000265a003b8  aa1702d6    orr x22, x22, x23
0x0000000265a003bc  394b5397    ldrb w23, [x28, #724]
0x0000000265a003c0  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a003c4  530c2ef7    lsl w23, w23, #20
0x0000000265a003c8  aa1702d6    orr x22, x22, x23
0x0000000265a003cc  394b5797    ldrb w23, [x28, #725]
0x0000000265a003d0  d3407ef7    ubfx x23, x23, #0, #32
0x0000000265a003d4  530b2af7    lsl w23, w23, #21
0x0000000265a003d8  aa1702d6    orr x22, x22, x23
0x0000000265a003dc  924002d6    and x22, x22, #0x1
0x0000000265a003e0  93400294    sbfx x20, x20, #0, #1
0x0000000265a003e4  934002b5    sbfx x21, x21, #0, #1
0x0000000265a003e8  f10002df    cmp x22, #0x0 (0)
0x0000000265a003ec  9a950294    csel x20, x20, x21, eq
0x0000000265a003f0  4e080e84    dup v4.2d, x20
0x0000000265a003f4  394baf94    ldrb w20, [x28, #747]
0x0000000265a003f8  91000695    add x21, x20, #0x1 (1)
0x0000000265a003fc  92400ab5    and x21, x21, #0x7
0x0000000265a00400  d2800200    mov x0, #0x10
0x0000000265a00404  9b007e80    mul x0, x20, x0
0x0000000265a00408  8b000380    add x0, x28, x0
0x0000000265a0040c  3dc0bc05    ldr q5, [x0, #752]
0x0000000265a00410  d2800200    mov x0, #0x10
0x0000000265a00414  9b007ea0    mul x0, x21, x0
0x0000000265a00418  8b000380    add x0, x28, x0
0x0000000265a0041c  3dc0bc06    ldr q6, [x0, #752]
0x0000000265a00420  4ea41c80    mov v0.16b, v4.16b
0x0000000265a00424  6e651cc0    bsl v0.16b, v6.16b, v5.16b
0x0000000265a00428  4ea01c04    mov v4.16b, v0.16b
0x0000000265a0042c  d2800200    mov x0, #0x10
0x0000000265a00430  9b007e80    mul x0, x20, x0
0x0000000265a00434  8b000380    add x0, x28, x0
0x0000000265a00438  3d80bc04    str q4, [x0, #752]
0x0000000265a0043c  58000040    ldr x0, pc+8 (addr 0x265a00444)
0x0000000265a00440  d63f0000    blr x0
```

New:
```asm
0x0000000265a002bc  10ffffe0    adr x0, #-0x4 (addr 0x265a002b8)
0x0000000265a002c0  f9005f80    str x0, [x28, #184]
0x0000000265a002c4  d2800014    mov x20, #0x0
0x0000000265a002c8  d2800035    mov x21, #0x1
0x0000000265a002cc  d2800056    mov x22, #0x2
0x0000000265a002d0  394b0397    ldrb w23, [x28, #704]
0x0000000265a002d4  330002f6    bfxil w22, w23, #0, #1
0x0000000265a002d8  924002d6    and x22, x22, #0x1
0x0000000265a002dc  93400294    sbfx x20, x20, #0, #1
0x0000000265a002e0  934002b5    sbfx x21, x21, #0, #1
0x0000000265a002e4  f10002df    cmp x22, #0x0 (0)
0x0000000265a002e8  9a950294    csel x20, x20, x21, eq
0x0000000265a002ec  4e080e84    dup v4.2d, x20
0x0000000265a002f0  394baf94    ldrb w20, [x28, #747]
0x0000000265a002f4  91000695    add x21, x20, #0x1 (1)
0x0000000265a002f8  92400ab5    and x21, x21, #0x7
0x0000000265a002fc  d2800200    mov x0, #0x10
0x0000000265a00300  9b007e80    mul x0, x20, x0
0x0000000265a00304  8b000380    add x0, x28, x0
0x0000000265a00308  3dc0bc05    ldr q5, [x0, #752]
0x0000000265a0030c  d2800200    mov x0, #0x10
0x0000000265a00310  9b007ea0    mul x0, x21, x0
0x0000000265a00314  8b000380    add x0, x28, x0
0x0000000265a00318  3dc0bc06    ldr q6, [x0, #752]
0x0000000265a0031c  4ea41c80    mov v0.16b, v4.16b
0x0000000265a00320  6e651cc0    bsl v0.16b, v6.16b, v5.16b
0x0000000265a00324  4ea01c04    mov v4.16b, v0.16b
0x0000000265a00328  d2800200    mov x0, #0x10
0x0000000265a0032c  9b007e80    mul x0, x20, x0
0x0000000265a00330  8b000380    add x0, x28, x0
0x0000000265a00334  3d80bc04    str q4, [x0, #752]
0x0000000265a00338  58000040    ldr x0, pc+8 (addr 0x265a00340)
0x0000000265a0033c  d63f0000    blr x0
```
2023-07-07 17:01:59 -07:00
Ryan Houdek
debcb0e047 Arm64: Optimize BFI in the case that Dst == srcDst
ARM64 BFI doesn't allow you to encode two source registers here to match
our SSA semantics. Also since we don't support RA constraints to ensure
that these match, just do the optimal case in the backend.

Leave a comment for future RA contraint excavators to make this more
optimal
2023-07-07 16:43:41 -07:00
Ryan Houdek
baf04b6a41 FEXCore: Minor cleanup
This isn't required anymore since we are exposing the virtual class
directly.
2023-07-07 15:06:14 -07:00
Ryan Houdek
f9b352a093 Linux: Fixes hangs due to mutexes locked while fork happens.
When a fork occurs FEX needs to be incredibly careful as any thread
(that isn't forking) that holds a lock will vanish when the fork occurs.

At this point if the newly forked process tries to use these mutexes
then the process hangs indefinitely.

The three major mutexes that need to be held during a fork:
- Code Invalidation mutex
  - This is the highest priority and causes us to hang frequently.
  - This is highly likely to occur when one thread is loading shared
    libraries and another thread is forking.
     - Happens frequently with Wine and steam.
- VMA tracking mutex
  - This one happens when one thread is allocating memory while a fork
    occurs.
  - This closely relates to the code invalidation mutex, just happens at
    the syscall layer instead of the FEXCore layer.
  - Happens as frequently as the code invalidation mutex.
- Allocation mutex
  - This mutex is used for FEX's 64-bit Allocator, this happens when FEX
    is allocating memory on one thread and a fork occurs.
  - Fairly infrequent because jemalloc doesn't allocate VMA regions that
    often.

While this likely doesn't hit all of the FEX mutexes, this hits the ones
that are burning fires and are happening frequently.

- FEXCore: Adds forkable mutex/locks

Necessary since we have a few locations in FEX that need to be locked
before and after a fork.

When a fork occurs the locks must be locked prior to the fork. Then
afterwards they either need to unlock or be set to default
initialization state.
- Parent
   - Does an unlock
- Child
   - Sets the lock to default initialization state
   - This is because it pthreads does TID based ownership checking on
     unique locks and refcount based waiting for shared locks.
   - No way to "unlock" after fork in this case other than default
     initializing.
2023-07-04 02:13:06 -07:00
Ryan Houdek
d2032da452
Merge pull request #2737 from bylaws/main
Some small fixes for android building
2023-07-01 14:59:18 -07:00
Billy Laws
35c52f20f9 AllocatorHooks: Avoid referencing valloc on Android
This is not implemented in bionic, so follow the MINGW approach and implement it with _aligned_alloc.
2023-07-01 22:21:16 +01:00
Billy Laws
17c82c22a6 JitSymbols: Store symbol mappings in /data/local/tmp on Android 2023-07-01 22:13:44 +01:00
Ryan Houdek
df03a7b101 unittests/Emitter: Adds CSSC tests 2023-06-30 19:34:35 -07:00
Ryan Houdek
c859540d7e Emitter: Adds support for CSSC
Not used currently but will be used in the future.
2023-06-30 19:34:35 -07:00
Ryan Houdek
20794593e7 unittests/Emitter: Update tests for updated vixl
Output in vixl changed for some of these. Most for the better but not
all of them.
2023-06-30 19:34:35 -07:00
Ryan Houdek
a80a2bf569 External/vixl: Update 2023-06-30 19:11:22 -07:00
Mai
1a4d5a1abb
Merge pull request #2733 from Sonicadvance1/fix_jemalloc_checks
External/jemalloc: Updates external jemallocs
2023-06-30 17:57:29 -04:00
Ryan Houdek
677b72c9a5 External/jemalloc: Updates external jemallocs
Fixes their `malloc_usable_size` checks.
2023-06-28 09:26:45 -07:00
Ryan Houdek
71a8c66c95 Context: Removes dead AddVirtualMemoryMapping function
This has been around since the initial commit. Bad idea that wasn't ever
thought through. Something about remapping guest virtual and host
virtual memory which will never be a thing.
2023-06-28 09:18:36 -07:00
Lioncache
bf773452ac IR: Add missing formatters
Currently RegisterClassType and FenceType are passed into logs, which
fmt 10.0.0 is more strict about. Adds the formatters that were missing
so that compilation can succeed without needing to change all log sites.
2023-06-17 09:42:31 -04:00
Lioncache
95dbccc0ab Externals: Update fmt to 10.0.0
Keeps ourselves up to date with the latest major release.
2023-06-17 09:25:20 -04:00
Ryan Houdek
e5189d63a2
Merge pull request #2708 from Sonicadvance1/fix_paranoidtso
Arm64: Fixes paranoidtso option for CPUs that support LRCPC/2
2023-06-16 13:32:43 -07:00
Ryan Houdek
9dcc1deec0
Merge pull request #2722 from Sonicadvance1/rip_reconstruction
JIT: Implement support for per-instruction RIP reconstruction
2023-06-16 13:02:14 -07:00
Ryan Houdek
66d4206cd7
Merge pull request #2719 from lioncash/flags
OpcodeDispatcher: Ensure MXCSR is saved/restored with FXSAVE/FXRSTOR
2023-06-16 13:01:56 -07:00
Lioncache
01837b3ad6 IR: Remove HasSideEffects for VPCMPXSTRX ops
This is a leftover from early on and not necessary, since we
don't operate on any state other than what is provided to the
IR op itself.
2023-06-16 11:53:31 -04:00
Lioncache
bdb68840e3 IR: Move VPCMPESTRX REX handling to OpcodeDispatcher
We can handle this in the dispatcher itself, so that we don't need to pass along
the register size as a member of the opcode. This gets rid of some unnecessary duplication
of functionality in the backends and makes it so potential backends don't need to deal
with this.
2023-06-16 11:49:36 -04:00
Lioncache
4e2dcf3298 OpcodeDispatcher: Ensure MXCSR is saved/restored with FXSAVE/FXRSTOR
Previously, the bits that we support in the MXCSR weren't being saved,
which means that some opcode patterns may fail to restore the rounding mode
properly.

e.g. FXSAVE, followed by FNINIT, followed by FXRSTOR wouldn't restore the
     rounding mode properly

This fixes that.
2023-06-16 09:25:53 -04:00
Ryan Houdek
628f825416 JIT: Implement support for per-instruction RIP reconstruction
FEX's current implementation of RIP reconstruction is limited to the
entrypoint that a single block has. This will cause the RIP to be
incorrect past the first instruction in that block.

While this is fine for a decent number of games, especially since fault
handling isn't super common. This doesn't work for all situations.

When testing Ultimate Chicken Horse, we found out that changing the
block size to 1 worked around an early crash in the game's startup.
This game is likely relying on Mono/Unity's AOT compilation step, which
does some more robust faulting that the runtime JIT. Needing the RIP to
be correct since they do some sort of checking for what the code came
from.

This fixes Ultimate Chicken Horse specifically, but will likely fix
other games that are built the same way.
2023-06-14 17:28:56 -07:00
Ryan Houdek
a80327f6df X86Tables: Adds some missing MEM_ACCESS flags to REP instructions 2023-06-14 17:04:50 -07:00
Ryan Houdek
c9712e45cb Arm64: Fixes GPR pair allocation to get one pair back
When executing a 32-bit application we were failing to allocate a single
GPR pair. This meant we only have 7 pairs when we could have had 8.

This was because r30 was ending up in the middle of the allocation
arrays so we couldn't safely create a sequential pair of registers.

Organize the register allocation arrays to be unique for each bitness
being executed and then access them through spans instead.

Also works around bug where the RA validation doesn't understand when pair
indexes don't correlate directly to GPR indexes. So while the previous
PR fixed the RA pass, it didn't fix the RA validation pass.

Noticed this when pr57018 32-bit gcc test was run with the #2700 PR
which improved the RA allocation a bit.
2023-06-13 20:04:51 -07:00
Lioncache
9017325c95 CPUID: Signify support for XSAVE if AVX is enabled
Now that XSAVE and XRSTOR are implemented, we can enable the
CPUID bits for them when AVX support is enabled.
2023-06-13 19:21:14 -04:00
Lioncache
ae536e44d7 OpcodeDispatcher: Handle XRSTOR 2023-06-13 17:47:45 -04:00
Lioncache
7679485cc3 OpcodeDispatcher: Handle XSAVE 2023-06-13 15:01:33 -04:00
Ryan Houdek
537562fab7 Arm64: Fixes register pair conflict.
When FEX was updated to reclaim 64-bit registers in #2494, I had
mistakenly messed up pair register class conflicts.

The problem is that FEX has r30 stuck in the middle of the RA which
causes the paired registers to need to offset their index half way.

This meant that the conflict index being incorrect was always broken on
32-bit applications ever since that PR.

Keep the intersection indexes in their own array so to can be correctly
indexed at runtime.

Thanks to @asahilina finding out that Osmos started crashing a few
months ago and I finally just got around to bisecting what the problem
was.
This now fixes Osmos from crashing, although the motes are still
invisible on the 32-bit application. Not sure what other havok this has
been causing.
2023-06-12 23:31:16 -07:00
Mai
f8721992c2
Merge pull request #2712 from Sonicadvance1/fix_jemalloc_generate
External: Update jemalloc trees
2023-06-12 17:12:24 -04:00
Lioncache
755600c371 CPUID: Signify support for SSE4.2
With all the kinks worked out of these instructions, we can finally enable SSE4.2
2023-06-12 13:19:38 -04:00
Lioncache
bec8b70e5d VectorFallbacks: Fix PCMPSTR fallback ZF/SF flag setting
So, uh, this was a little silly to track down. So, having the upper limit
as unsigned was a mistake, since this would cause negative valid lengths to
convert into an unsigned value within the first two flag comparison cases

A -1 valid length can occur if one of the strings starts with a null character
in a vector's first element. (It will be zero and we then subtract it to
make the length zero-based).

Fixes this edge-case up and expands a test to check for this in the future.
2023-06-12 13:13:24 -04:00
Ryan Houdek
bef8ddde48 External: Update jemalloc trees
Allows us to generate a header at compile time for OS specific features.
Should fix compiling on Android since they have a different function
declaration for `malloc_usable_size` compared to Linux.
2023-06-12 09:34:30 -07:00
Mai
fe06f1b151
Merge pull request #2711 from Sonicadvance1/pad_ir_header_32bit
IR: Pad IROp_Header to be 32-bit in width
2023-06-11 05:49:00 -04:00
Ryan Houdek
92a15e00c7 IR: Pad IROp_Header to be 32-bit in width
We spent a bit of effort removing 8-bits from this header to get it down
to three bytes. This ended up in PRs #2319 and #2320

There was no explicit need to go down to three bytes, the other two
arguments we were removing were just better served to be lookups instead
of adding IR overhead for each operation.

This now introduced alignment issues that was brought up in #2472.
Apparently the Android NDK's clang will pad nested structs like this,
maybe to match alignment? Regardless we should just make it be 32-bit.

This fixes Android execution of FEXCore.
This fixes #2472

Pros:
- Initialization now turns in to a single str because it's 32-bit
- We have 8-bits more space that we can abuse in the IR op now
   - If we need more than 64-bit and 128-bit are easy bumps in the
     future

Cons:
- Each IR operation takes at minimum 25% more space in the intrusive
  allocators
   - Not really that big of a deal since we are talking 3 bytes versus
     4.
2023-06-10 12:38:03 -07:00
Ryan Houdek
7ceadc6b5b Move config layers to the frontend
FEXCore has no need to understand how to load these layers. Which
requires json parsing.

Move these to the frontend which is already doing the configuration
layer setup and initialization tasks anyway.

Means FEXCore itself no longer needs to link to tiny-json which can be
left to the frontend.
2023-06-09 18:15:40 -07:00
Ryan Houdek
8c41e8f7d8 Arm64: Fixes paranoidtso option for CPUs that support LRCPC/2
Regular LoadStoreTSO operations have gained support for LRCPC and LRCPC2
which changes the semantics of the operation by letting it support
immediate offsets.

The paranoid version of these operations didn't support the immediate
offsets yet which was causing incorrect memory loadstores.

Bring over the new semantics from the regular LoadStoreTSO but without
any nop padding.
2023-06-09 16:32:28 -07:00
Ryan Houdek
784b3064fc ArchHelpers: Convert a couple of magic numbers to constants
Makes this easier to read.
2023-06-09 16:31:44 -07:00
Ryan Houdek
5b5808218b
Merge pull request #2703 from Sonicadvance1/minor_of_opt
OpcodeDispatcher: Optimize ADC/ADD OF flag calculation
2023-06-07 12:54:55 -07:00
Ryan Houdek
41ec987f3e OpcodeDispatcher: Optimize ADC/ADD OF flag calculation
`eor <reg>, <reg>, #-1` can't be encoded as an instruction. Instead use
mvn which does the same thing.

Removes a single instruction from each OF calculation for ADC and ADD.

Also no reason to use a switch statement for the source size, just use
_Bfe and calculate the offset based on operation size.

SBB caught in the crossfire to ensure it also isn't using a switch
statement.
2023-06-07 12:40:51 -07:00
Ryan Houdek
03f73531d3 IRDumper: Fixes ssa number in arguments.
This can spuriously end up as a hex number which makes it hard to reason
why DCE wasn't deleting IR operations. Ensure it is always a decimal.
2023-06-07 09:52:04 -07:00
Ryan Houdek
a2cbfccb3b OpcodeDispatcher: Optimize EFLAG unpacking
Noticed this was slightly unoptimal. Resulting in a 18% code reduction
in the case of of a simple four instruction test ASM case.
2023-06-06 17:56:25 -07:00
Mai
4e01452a65
Merge pull request #2699 from Sonicadvance1/minor_fcmov_opt
X87: Super minor FCMOV optimization
2023-06-06 20:22:40 -04:00
Mai
cc7a56b1a6
Merge pull request #2689 from Sonicadvance1/fix_bmi
CPUID: Only enable BMI1 and BMI2 if AVX is supported
2023-06-06 20:21:57 -04:00
Ryan Houdek
0b0dd3891e X87: Super minor FCMOV optimization
This caught my eye as I was skimming, remove one IR op per FCMOV
instruction.

This was just duplicating the generated GPR mask across the FPR.
2023-06-04 06:39:35 -07:00
Ryan Houdek
96a0364a86 Review comments 2023-06-02 21:53:52 -07:00
Ryan Houdek
c0a783997d Convert remaining memory tracking to deferred signals 2023-06-01 11:35:22 -07:00
Ryan Houdek
f78537109d Core: Convert mtrack code invalidation over to deferred signals 2023-06-01 11:35:22 -07:00
Ryan Houdek
0c156ed6f9 Context: Switch over to deferred signals 2023-06-01 11:28:04 -07:00
Ryan Houdek
8840b2154c Allocator: Allow more optimal deferred signals path 2023-06-01 11:28:04 -07:00
Ryan Houdek
e02be8073e FEXCore: Support deferred signal mutex
This is part of FEXCore since it pulls in InternalThreadData, but is
related to the FHU signal mutex class.

Necessary to allow deferring signals in C++ code rather than right in
the JIT.
2023-06-01 11:28:04 -07:00
Ryan Houdek
f75d3550b4 Jit64: Used deferred signals in dispatcher 2023-06-01 11:28:04 -07:00
Ryan Houdek
802c588695 Arm64: Use deferred signals in dispatcher 2023-06-01 11:28:04 -07:00
Ryan Houdek
fd962f40d7 SignalDelegator: Support deferring signals 2023-06-01 11:28:04 -07:00
Ryan Houdek
a9b660af69 CoreState: Add new members to track deferred signal capability 2023-06-01 11:28:04 -07:00
Ryan Houdek
5be798e9e6
Merge pull request #2693 from Sonicadvance1/remove_debug
Context: Remove debug namespace
2023-06-01 11:26:05 -07:00
Ryan Houdek
c9d1f0d75a
Merge pull request #2687 from Sonicadvance1/telemetry_save_crash
Telemetry: Save on signal terminate
2023-05-30 10:26:03 -07:00
Ryan Houdek
1dc4f8c429 Context: Remove debug namespace
Unused and broken
2023-05-30 09:00:57 -07:00
Ryan Houdek
45d3b83143 Telemetry: Save on signal terminate
When a signal handler is not installed and is a terminal failure, make
sure to save telemetry before faulting.

We know when an application is going down in this case so we can make
sure to have the telemetry data saved.

Adds a telemetry signal mask data point as well to know which signal
took it down.
2023-05-30 08:49:33 -07:00
Ryan Houdek
c9101d3f68 CPUID: Only enable BMI1 and BMI2 if AVX is supported
These two extensions rely on AVX being supported to be used. Primarily
because they are VEX encoded.

GTA5 is using these flags to determine if it should enable its AVX
support.
2023-05-26 20:48:36 -07:00
Ryan Houdek
a6c6248bcb ArmEmitter: Fixes bug in SpillStaticRegs
Some code in FEX's Arm64 emitter was making an assumption that once
SpillStaticRegs was called that it was safe to still use the SRA
register state.
This wasn't actually true since FEX was using one SRA register to
optimize FPR stores. Assuming that the SRA registers were safe to use
since they were just saved and no longer necessary.

Correct this assumption hell by forcing users of the function to provide
the temporary register directly. In all cases the users have a temporary
available that it can use.

Probably fixes some very weird edge case bugs.
2023-05-22 16:48:07 -07:00
Ryan Houdek
5646428640 FEXCore: Implements support for xgetbv
This returns the `XFEATURE_ENABLED_MASK` register which reports what
features are enabled on the CPU.
This behaves similarly to CPUID where it uses an index register in ecx.

This is a prerequisite to enabling XSAVE/XRSTOR and AVX since
applications will expect this to exist.

xsetbv is a privileged instruction and doesn't need to be implemented.
2023-05-22 16:48:07 -07:00
Ryan Houdek
6ef6d9c391 Thunks: Mostly reverts #2672
I forgot that x11 was part of the custom ABI of thunks. #2672 had broken
thunks on ARM64. I thought I had tested a game with them enabled but
apparently I tested the wrong game.

Not a full revert since we can still ldr with a literal, but we also
still need to adr x11 and nop pad. At least removes the data dependency
on x11 from the ldr.
2023-05-18 15:50:55 -07:00
Ryan Houdek
3a4a965347 TestHarnessRunner: Support exiting on HLT
Currently WINE's longjump doesn't work, so instead set a flag that if
HLT is attempted, just exit the JIT.

This will get our unittests executing at least.
2023-05-17 21:09:31 -07:00
Ryan Houdek
45cdab2ac3 HostFeatures: Use ID registers under Wine
InferFromOS doesn't work under WINE.
InferFromIDRegisters doesn't work under Windows but it will under Wine.

Since we don't support Windows, just use InferFromIDRegisters.
2023-05-17 21:07:40 -07:00
Ryan Houdek
d675b4af6f External: Update vixl 2023-05-17 21:07:40 -07:00
Ryan Houdek
363411f0c7 ArchHelpers: Adds missing stub function 2023-05-17 21:05:55 -07:00
Ryan Houdek
5bc418407c FEXCore: Disable emitter unit tests on win32 2023-05-17 21:05:55 -07:00
Ryan Houdek
61ca651fe1 FEXCore: Don't initialize ThunkHandler on Win32
Adds a couple pointer checks to ensure it won't crash.

Doesn't work and will cause assertions.
2023-05-17 21:05:55 -07:00
Mai
77e8be1215
Merge pull request #2671 from Sonicadvance1/wine_syscalls
FEXCore: Support Wine syscalls
2023-05-18 00:04:25 -04:00
Lioncache
f7c663240e OpcodeDispatcher: Handle PCMPESTRM/VPCMPESTRM
...and with that all of the SSE4.2 string instructions are implemented now
2023-05-17 00:21:55 -04:00
Lioncache
82b4aef30d OpcodeDispatcher: Handle PCMPISTRM/VPCMPISTRM 2023-05-16 22:59:54 -04:00
Lioncache
22919a5b65 OpcodeDispatcher: Add mask variant handling to PCMPXSTXOpImpl()
Will be used to handle PCMPESTRM/PCMPISTRM instruction variants.
2023-05-16 22:59:52 -04:00
Ryan Houdek
f47caf48c6
Merge pull request #2669 from Sonicadvance1/aotir_mutex
AOTIR: Stop passing a mutex around. It's already guarded
2023-05-12 18:56:55 -07:00
Ryan Houdek
5674d3a871
Merge pull request #2667 from Sonicadvance1/fextl_file
FEXCore: Convert Core and Telemetry over to fextl::file::File
2023-05-12 18:56:45 -07:00
Mai
e03b859c20
Merge pull request #2673 from Sonicadvance1/remove_warnings_13
OpcodeDispatcher: Removes a warning that cropped up.
2023-05-12 21:49:43 -04:00
Ryan Houdek
7d822ba1c8 OpcodeDispatcher: Removes a warning that cropped up. 2023-05-12 17:34:20 -07:00
Ryan Houdek
f90dcd2eb1 FEXCore: Convert Core and Telemetry over to FEXCore::File::File
This way telemetry and IR dumping can work under Wine.
2023-05-12 17:32:48 -07:00
Ryan Houdek
adbdd33ece fextl/fmt: Adds write handler for FEXCore::File::File 2023-05-12 17:32:48 -07:00
Ryan Houdek
06250d806d FEXCore/Utils: Adds File type
OS agnostic file class since we can't use std::FILE
2023-05-12 17:32:48 -07:00
Ryan Houdek
613ed559e7 Thunks: Optimize ARM64 trampoline
No need to use adr for getting the PC relative literal, we can use LDR
(literal) to load the PC relative address directly.

Reduces trampline instructions from 3 to 2, also reduces trampoline size
from 24-bytes to 16-bytes.
2023-05-12 17:28:36 -07:00
Ryan Houdek
8ac3841946 FEXCore: Support Wine syscalls
Wine syscalls need to end the code block at the point of the syscall.
This is because syscalls may update RIP which means the JIT loop needs
to immediately restart.

Additionally since they can update CPU state, make wine syscalls not
return a result and instead refill the register state from the CPU
state. This will mean the syscall handler will need to update their
result register (RAX?) before returning.
2023-05-12 16:42:26 -07:00
Ryan Houdek
458259bf47 FEXCore: Move EnumOperators to FEXCore
fextl needs this and can't depend on FHU
2023-05-12 15:23:00 -07:00
Ryan Houdek
2fc529d5b7 AOTIR: Stop passing a mutex around. It's already guarded 2023-05-11 03:56:33 -07:00
Ryan Houdek
ea489567da ARM64: Fixes SRA disabled codepath
Disabling SRA has been broken a quite a while. Disabling this was
instrumental in figuring out the VC redistributable crash.

Ensure it works by reintroducing non-SRA load/store register handlers,
and by supporting runtime selectable dispatch pointers for the JIT.

Side-bonus, moves the {LOAD,STORE}MEMTSO ops over to this dispatch as
well to make it consistent and probably slightly quicker.
2023-05-11 03:25:19 -07:00
Ryan Houdek
6eae064511 FEXCore: Adds support for hardware x86-TSO prctl
From https://github.com/AsahiLinux/linux/commits/bits/220-tso

This fails gracefully in the case the upstream kernel doesn't support
this feature, so can go in early.

This feature allows FEX to use hardware's TSO emulation capability to
reduce emulation overhead from our atomic/lrcpc implementation.
In the case that the TSO emulation feature is enabled in FEX, we will
check if the hardware supports this feature and then enable it.

If the hardware feature is supported it will then use regular memory
accesses with the expectation that these are x86-TSO in strength.

The only hardware that anyone cares about that supports this is Apple's
M class SoCs. Theoretically NVIDIA Denver/Carmel supports sequentially
consistent, which isn't quite the same thing. I haven't cared to check
if multithreaded SC has as strong of guarantees. But also since
Carmel/Denver hardware is fairly rare, it's hard to care about for our
use case.
2023-05-08 20:12:03 -07:00
Ryan Houdek
2d4bf97cac FEXCore: Moves SIGBUS handler to FEXCore/Utils
This can be done in an OS agnostic fashion. FEXCore knows the details of
its JIT and should be done in FEXCore itself.

The frontend is only necessary to inform FEXCore where the fault occured
and provide the array of GPRs for accessing and modifying the signal
state.

This is necessary for supporting both Linux and Wine signal contexts
with their unaligned access handlers.
2023-05-05 17:04:26 -07:00
Mai
f7d827a26a
Merge pull request #2662 from Sonicadvance1/disable_rdtscp
CPUID: Disable RDTSCP under wine
2023-05-05 17:33:20 -04:00
Ryan Houdek
37b5bc49c6
Merge pull request #2656 from Sonicadvance1/fexcore_no_exceptions
FEXCore: Compile without exceptions
2023-05-05 14:32:37 -07:00
Ryan Houdek
dcb3f182d6 CPUID: Disable RDTSCP under wine
We don't have a sane way to query cpu index under wine. We could
technically still use the syscall since we know that we are still
executing under Linux, but that seems a bit terrible.

Disable for now until something can be worked out. Not like it is used
heavily anyway.
2023-05-05 13:52:39 -07:00
Mai
ba45bf4ae7
Merge pull request #2661 from Sonicadvance1/virtual_alloc_base
Allocator: Adds VirtualAlloc with memory Base hint function
2023-05-05 14:35:24 -04:00
Mai
121d9fda2d
Merge pull request #2660 from Sonicadvance1/arm64_win32_ra
Arm64Emitter: Replace x18 usage with x30
2023-05-05 14:35:02 -04:00
Mai
73ede9d000
Merge pull request #2659 from Sonicadvance1/save_platform_register
ARM64Emitter: Ensure platform register is saved on win32
2023-05-05 14:34:25 -04:00
Mai
6dfea8a80f
Merge pull request #2657 from Sonicadvance1/remove_unnecessary_guard
LookupCache: Removes unnecessary recursive lock_guard
2023-05-05 14:33:50 -04:00
Ryan Houdek
ef6c220a75 Allocator: Adds VirtualAlloc with memory Base hint function
This will be used with the TestHarnessRunner in the future to map
specific memory regions.

This is only used as a hint rather than exact placement with failure on
inability to map. This also hits the fun quirk of 64k allocation
granularity which developers need to be careful about.
2023-05-04 15:39:32 -07:00
Ryan Houdek
1e4a6d432c
Merge pull request #2658 from Sonicadvance1/remove_unused_log
LogManager: Remove unused handler
2023-05-04 15:32:04 -07:00
Ryan Houdek
4ebd180147 Arm64Emitter: Replace x18 usage with x30
Related to #2659 but not necessary directly.

Currently x30(LR) is unused in our RA. In all locations that call out to
code, we are already preserving LR and bringing it back after the fact.
This was just a missed opportunity since we aren't doing any call-ret
stack manipulations that would facilitate LR needing to stick around.

Since x18 is a reserved platform register on win32, we can replace its
usage with r19, and then replace r19 usage with x30 and everything just
works happily. Now x18 is the unused register instead of x30 and we can
come back in the future to gain one more register for RA on Linux
platforms.
2023-05-04 15:25:47 -07:00
Ryan Houdek
ac4ef63ae6 ARM64Emitter: Ensure platform register is saved on win32
Platform register stores the TEB region on win32 and needs to be
preserved if we're going to overwrite it.

Ensure we do so.
2023-05-04 15:12:52 -07:00
Ryan Houdek
b2392ef1c6 LogManager: Remove unused handler
This non-fmt handler is now entirely unused and can be removed.
2023-05-04 14:52:45 -07:00
Ryan Houdek
8e4d52396b LookupCache: Removes unnecessary recursive lock_guard
All code paths to this are already guaranteed to own the lock.

The rest of the codepaths haven't been vetted to actually need
recursive_mutex yet, but seems likely that it will be able to get
converted to a regular mutex with some more work.
2023-05-04 14:45:19 -07:00
Ryan Houdek
6eeb45b2dc FEXCore: Compile without exceptions
This disables some unwinding overhead when FEXCore is already guaranteed
to not throw.
2023-05-04 14:42:02 -07:00
Ryan Houdek
22cf2696da fextl/memory: Don't allow arrays in fextl::make_unique
This ensures we don't hit a programming error since we don't support the
array version of this.
2023-05-04 14:38:12 -07:00
Alexandre Julliard
8081ac61e5 AllocatorHooks: Fix parameter order for Win32 _aligned_malloc.
The prototype is the opposite of memalign().
2023-05-03 16:15:07 +02:00
Alexandre Julliard
435b4daae1 AllocatorHooks: Pass valid parameters to the Win32 VirtualAlloc. 2023-05-03 16:13:37 +02:00
Lioncache
5ee913bc75 OpcodeDispatcher: Simplify PCMPXSTRIOpImpl
All variants of the PCMPXSTRX instructions will take their arguments in
the same manner, so we don't need to specify them for each handler.

We can also rename the function to PCMPXSTRXOpImpl, since this will
be extended to handle the masking variants of the string instructions.
2023-05-02 18:48:35 -04:00
Lioncache
f502154f96 OpcodeDispatcher: Handle VPCMPISTRI 2023-05-02 14:00:05 -04:00
Lioncache
7a59fb3e25 IR: Add IR fallback for VPCMPISTRX
Will be the fallback that handles the implicit length string instruction emulation.
2023-05-02 13:52:30 -04:00
Mai
590422b295
Merge pull request #2641 from Sonicadvance1/remove_unittest_gen
FEXCore: Stop exposing the x86 table data symbols
2023-04-26 05:09:23 -04:00
Ryan Houdek
9d268df91f Softfloat: Disable some duplicate BIGFLOAT handlers
Since mingw has its reduced precision has double, these handlers were
duplicated and causing compile failure.
2023-04-26 01:48:37 -07:00
Ryan Houdek
699541485d Arm64: Disable ProcessorID and Break on mingw
Currently unsupported on mingw
2023-04-26 01:48:37 -07:00
Ryan Houdek
46a63186a2 FEXCore: Name libFEXCore correctly and use sync library 2023-04-26 01:48:37 -07:00
Ryan Houdek
90f347839d InterruptableConditionVariable: Implement for mingw 2023-04-26 01:48:37 -07:00
Ryan Houdek
c9e7d9f331 FEXCore: Disable IRDumper on mingw 2023-04-26 01:48:37 -07:00
Ryan Houdek
8c3a3bfb7c FEXCore: Resolve some header includes
Some aren't necessary anymore. Some need to not exist on mingw.
2023-04-26 01:48:37 -07:00
Ryan Houdek
9034946b43 Move UContext from FEXCore to frontend.
FEXCore no longer needs this since all the signal handling is done in
the frontend.
2023-04-26 01:48:37 -07:00
Ryan Houdek
056f44be0b SignalDelegator: Moves all signal handling to the frontend
This is a very OS specific operation and it living in FEXCore doesn't
make much sense. This still requires some strong collaboration between
FEXCore and the frontend but it is now split between the locations.

There's still a bit more cleanup work that can be done after this is
merged, but we need to get this burning fire out of the way.

This is necessary for llvm-mingw, this requires all previous PRs to be
merged first.

After this is merged, most of the llvm-mingw work is complete, just some
minor cleanups.

To be merged first:
- #2602
- #2604
- #2605
- #2607
- #2610
- #2615
- #2619
- #2621
- #2622
- #2624
- #2625
- #2626
- #2627
- #2628
- #2629
2023-04-26 01:24:11 -07:00
Mai
b5420f5db3
Merge pull request #2629 from Sonicadvance1/fexcore_cmake_mingw
FEXCore: Fixup cmake file for mingw
2023-04-25 10:12:17 -04:00
Mai
c94268789b
Merge pull request #2619 from Sonicadvance1/fileloading_mingw
FileLoading: Add WIN32 specific loading path
2023-04-25 10:11:14 -04:00
Mai
af15277fc4
Merge pull request #2615 from Sonicadvance1/fhu_mingw
FHU/FS: Create WIN32 helpers for some functions.
2023-04-25 10:09:35 -04:00
Mai
86e09a00f0
Merge pull request #2610 from Sonicadvance1/mingw_virtual_alloc
AllocatorHooks: Adds some mingw allocator helpers
2023-04-25 10:08:44 -04:00
Lioncache
c94721a04b OpcodeDispatcher: Handle VPMASKMOVD/VPMASKMOVQ
We can reuse the same helper we have for handling VMASKMOVPD and VMASKMOVPS,
though we need to move some handling around to account for the fact that
VPMASKMOVD and VPMASKMOVQ 'hijack' the REX.W bit to signify the element
size of the operation.
2023-04-24 10:50:11 -04:00
Ryan Houdek
c87f361bb5 FEXCore: Stop exposing the x86 table data symbols
This was only used for the unit test fuzzing framework. Which has been
removed and unused for pretty much its entire lifespan.

These can now be internal only.
2023-04-23 09:38:03 -07:00
Mai
0fa4390e47
Merge pull request #2622 from Sonicadvance1/dispatcher_signals
Dispatcher: Disable signal handling under mingw
2023-04-21 21:43:30 -04:00
Mai
7a774a8d80
Merge pull request #2624 from Sonicadvance1/fexcore_cpuid
FEXCore: Switch to xbyak for CPUID fetch helpers.
2023-04-21 21:42:54 -04:00
Mai
361e684c64
Merge pull request #2628 from Sonicadvance1/objectcache_mingw
Disable AOT and object cache under mingw
2023-04-21 21:42:25 -04:00
Mai
4c74913edf
Merge pull request #2627 from Sonicadvance1/disable_break_mingw
Disable Break/INT operations on mingw
2023-04-21 21:42:08 -04:00
Mai
059472fcef
Merge pull request #2621 from Sonicadvance1/object_cache_packed
ObjectCache: Ensure correctly packed config option
2023-04-21 21:41:34 -04:00
Mai
4a11111abd
Merge pull request #2626 from Sonicadvance1/thunks_mingw
Thunks: Disable under mingw
2023-04-21 21:40:40 -04:00
Mai
f673afc38f
Merge pull request #2625 from Sonicadvance1/gdbserver_mingw
GdbServer: Disable under mingw
2023-04-21 21:40:24 -04:00
Mai
1fad26d72f
Merge pull request #2613 from Sonicadvance1/cpuinfo_mingw
CPUInfo: Add mingw helper for CalculateNumberOfCPUs
2023-04-21 21:39:52 -04:00
Mai
2b5ddb6b93
Merge pull request #2607 from Sonicadvance1/mingw_softflow
llvm-mingw: Fix SoftFloat compiling
2023-04-21 21:38:27 -04:00
Mai
c140dd7da8
Merge pull request #2605 from Sonicadvance1/aligned_alloc
Allocator: Ensure uses of aligned allocations use aligned_free
2023-04-21 21:38:08 -04:00
Mai
da126141d3
Merge pull request #2604 from Sonicadvance1/move_config_paths
Config: Move path generation to the frontend
2023-04-21 21:37:23 -04:00
Mai
874ae5b0fc
Merge pull request #2602 from Sonicadvance1/move_thread_creation
Threads: Moves pthread logic to FEXLoader
2023-04-21 21:36:28 -04:00
Lioncache
651c6f8ddf OpcodeDispatcher: Handle VCVTPS2PD/VCVTPD2PS 2023-04-18 10:29:57 -04:00
Lioncache
73ca4e5687 OpcodeDispatcher: Move vector float conversion to helper
Will be used for implementing the equivalent AVX instructions.
2023-04-18 10:07:30 -04:00
Lioncache
cb9cc74fcc OpcodeDispatcher: Handle AVX variants of float-to-float conversions
Adds in the handling of destination type size differences with AVX.

Also fixes cases where the SSE operations would load 128-bit vectors
from meory, rather than only loading 64-bit vectors with VCVTPS2PD.
2023-04-18 09:52:28 -04:00
Lioncache
d1116456fc OpcodeDispatcher: Handle VCVTSD2SS/VCVTSS2SD 2023-04-18 08:13:23 -04:00
Lioncache
84985952c9 OpcodeDispatcher: Factor out scalar floating-point conversion to helper
Will be used to implement the AVX variants of VCVTSD2SS and VCVTSS2SD
2023-04-18 07:16:37 -04:00
Mai
a351620c60
Merge pull request #2634 from Sonicadvance1/missing_avx
VEXTables: Adds a missing class of AVX instructions
2023-04-18 06:55:35 -04:00
Ryan Houdek
9117f7e724 VEXTables: Adds a missing class of AVX instructions
These are all AVX1, not sure how I missed this.
Sorry @lioncash, four more instructions.
2023-04-17 20:39:59 -07:00
Lioncache
8e391e7a61 Interpreter: Move PCMPESTRX fallback to VectorFallbacks
Now that OpHandlers isn't coupled to the F80 ops anymore, we can
move this over to its own file dedicated to vector fallbacks.
2023-04-17 22:57:09 -04:00
Lioncache
98fbc4a46d Interpreter: Move OpHandler struct into its own header
We can also provide a general rundown for hooking up interpreter fallbacks here for the uninitiated.
2023-04-17 22:55:02 -04:00
Lioncache
8481aeccb5 Interpreter: Move F80Ops.h into Fallback directory
We can also rename it to F80Fallbacks.h to make the file purpose a little more explicit.
2023-04-17 22:54:58 -04:00
Lioncache
b1df63f425 Interpreter: Move fallbacks into new directory
Will be used to store fallbacks and separate the definition struct from the F80 fallbacks
2023-04-17 22:05:00 -04:00
Lioncache
39c73d975b OpcodeDispatcher: Handle PCMPESTRI/VPCMPESTRI 2023-04-17 21:42:58 -04:00
Lioncache
30cb1aaaed IR: Add VPCMPESTRX fallback
In order to implement the SSE4.2 string instructions in a reasonable
manner, we can make use of a fallback implementation for the time
being.

This implementation just returns the intermediate result and leaves it
up to the function making use of it to derive the final result from said
intermediate result. This is fine, considering we have the immediate
control byte that tells us exactly what is desired as far as output
formats go.

Given that the result of this IR op will never take up more than
16-bits, we store the flags we need to set in the upper 16 bits of the
result to avoid needing to implement multiple return values in the JIT.

Also, since the IR op just returns the intermediate result, this can be
used to implement all of the explicit string instructions with a single IR op.

The implementation is pretty heavily documented to help make heads or
tails of these monster instructions.
2023-04-17 21:39:32 -04:00
Ryan Houdek
cbf41448fc Thunks: Disable under mingw 2023-04-17 03:10:04 -07:00
Ryan Houdek
47bdc9af12 Config: Move realpath usage to FHU 2023-04-17 03:05:25 -07:00
Ryan Houdek
0fad5b88c1 FEXCore: Fixup cmake file for mingw
- 64-bit allocator doesn't work under mingw atm.
- Can't link against libdl
- Can't have a SONAME because it is a PE, not a shared library.
2023-04-17 02:57:27 -07:00
Ryan Houdek
8c9fe0dd31 AOTIR: Disable loading and saving on mingw 2023-04-17 02:55:15 -07:00