Commit Graph

9772 Commits

Author SHA1 Message Date
Alyssa Rosenzweig
edf1a7970d X86Tables: add Literal() helper
Any time we get the value of Literal, we want to assert that it's actually a
literal. We've been open coding this pattern sporadically throughout the
opcodedispatcher. Let's add an ergonomic helper to fetch the value of literal,
asserting that the value is indeed literal.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-06-21 14:46:46 -04:00
Ryan Houdek
fac9972bad
Merge pull request #3741 from alyssarosenzweig/cleanup/comiss
OpcodeDispatcher: refactor Comiss helper
2024-06-21 11:43:05 -07:00
Alyssa Rosenzweig
9ecb960f3a OpcodeDispatcher: refactor Comiss helper
AVX128 will use this, it's not SSE-specific.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-06-21 14:23:20 -04:00
Ryan Houdek
3d26e23891
Merge pull request #3737 from Sonicadvance1/avx_10
Arm64: Implement support for emulated masked vector loadstores
2024-06-21 11:04:01 -07:00
Ryan Houdek
7bbbd95775
Merge pull request #3736 from Sonicadvance1/avx_9
AVX128: Some pun pickles, moves and conversions
2024-06-21 10:55:19 -07:00
Ryan Houdek
bb308899b9
Frontend: Expose AVX W flag
Previously we could always tell the size of the operation depending on
how this effects the operating size of the instruction. Converting
64-bit down to 32-bit as an example.

AVX gather instructions are the first instruction class that can't infer
this information. The element load size is determined by the W flag but
the operating size of 128-bit or 256-bit is determined by other means.

Expose this flag so we can determine this difference. The FMA
instructions are going to need this flag as well.
2024-06-21 10:54:42 -07:00
Ryan Houdek
e95c8d703c
Arm64: Implement support for emulated masked vector loadstores
In order to support `vmaskmov{ps,pd}` without SVE128 this is required.
It's pretty gnarly but they aren't often used so that's fine from a
compatibility perspective.

Example SVE128 implementation:
```json
    "vmaskmovps ymm0, ymm1, [rax]": {
      "ExpectedInstructionCount": 9,
      "Comment": [
        "Map 2 0b01 0x2c 256-bit"
      ],
      "ExpectedArm64ASM": [
        "ldr q2, [x28, #32]",
        "mrs x20, nzcv",
        "cmplt p0.s, p6/z, z17.s, #0",
        "ld1w {z16.s}, p0/z, [x4]",
        "add x21, x4, #0x10 (16)",
        "cmplt p0.s, p6/z, z2.s, #0",
        "ld1w {z2.s}, p0/z, [x21]",
        "str q2, [x28, #16]",
        "msr nzcv, x20"
      ]
    },
```

Example ASIMD implementation
```json
    "vmaskmovps ymm0, ymm1, [rax]": {
      "ExpectedInstructionCount": 37,
      "Comment": [
        "Map 2 0b01 0x2c 256-bit"
      ],
      "ExpectedArm64ASM": [
        "ldr q2, [x28, #32]",
        "mrs x20, nzcv",
        "movi v0.2d, #0x0",
        "mov x1, x4",
        "mov x0, v17.d[0]",
        "tbz x0, #63, #+0x8",
        "ld1 {v0.s}[0], [x1]",
        "add x1, x1, #0x4 (4)",
        "tbz w0, #31, #+0x8",
        "ld1 {v0.s}[1], [x1]",
        "add x1, x1, #0x4 (4)",
        "mov x0, v17.d[1]",
        "tbz x0, #63, #+0x8",
        "ld1 {v0.s}[2], [x1]",
        "add x1, x1, #0x4 (4)",
        "tbz w0, #31, #+0x8",
        "ld1 {v0.s}[3], [x1]",
        "mov v16.16b, v0.16b",
        "add x21, x4, #0x10 (16)",
        "movi v0.2d, #0x0",
        "mov x1, x21",
        "mov x0, v2.d[0]",
        "tbz x0, #63, #+0x8",
        "ld1 {v0.s}[0], [x1]",
        "add x1, x1, #0x4 (4)",
        "tbz w0, #31, #+0x8",
        "ld1 {v0.s}[1], [x1]",
        "add x1, x1, #0x4 (4)",
        "mov x0, v2.d[1]",
        "tbz x0, #63, #+0x8",
        "ld1 {v0.s}[2], [x1]",
        "add x1, x1, #0x4 (4)",
        "tbz w0, #31, #+0x8",
        "ld1 {v0.s}[3], [x1]",
        "mov v2.16b, v0.16b",
        "str q2, [x28, #16]",
        "msr nzcv, x20"
      ]
    },
```

There's a little bit of an improvement where nzcv isn't needed to get
touched on the ASIMD implementation, but I'll leave that for a future
improvement.
2024-06-21 08:21:32 -07:00
Ryan Houdek
903d6a742e
CPUBackend: Removes SupportsSaturatingRoundingShifts option
This has always been true ever since we removed the x86 JIT and
Interpreter. This was left over and adding more code for no reason.
2024-06-21 08:11:22 -07:00
Ryan Houdek
424218e327
AVX128: Implement support for vpsign{b,w,d} 2024-06-21 08:11:22 -07:00
Ryan Houdek
17dc03d414
AVX128: Implement support for vpack{s,u}{wb,dw} 2024-06-21 08:11:21 -07:00
Ryan Houdek
baf699c6e1
AVX128: Implements support for vandnps and vpandn
This can't use the previous binary operator handler since the register
sources need to be swapped.
2024-06-21 08:11:21 -07:00
Ryan Houdek
1431af1ff5
AVX128: Implements support for vcvt{t,}s{s,d}2si 2024-06-21 08:11:21 -07:00
Ryan Houdek
775a41b903
AVX128: Implement support for vcvtsi2s{s,d} 2024-06-21 08:11:21 -07:00
Ryan Houdek
2da1e90dd5
Merge pull request #3738 from Sonicadvance1/cpuid_label
CPUID: Update labeling on some reserved bits
2024-06-21 07:41:57 -07:00
Ryan Houdek
e614340c0c
CPUID: Update labeling on some reserved bits
These aren't reserved and I was confused that they were missing.
2024-06-21 05:34:44 -07:00
Ryan Houdek
3c293b9aed
Arm64: Loosen restrictions on V{Load,Store}VectorMasked to allow 128-bit operation 2024-06-21 04:26:09 -07:00
Ryan Houdek
283c2861c9
AVX128: Implement suppor for vlddqu 2024-06-21 00:56:36 -07:00
Ryan Houdek
757dc95116
AVX128: Implement support for the punpckh instructions 2024-06-21 00:56:32 -07:00
Ryan Houdek
6192250b8a
AVX128: Implement support for the punpckl instructions 2024-06-21 00:56:28 -07:00
Ryan Houdek
f489135b1d
Merge pull request #3734 from Sonicadvance1/avx_8
AVX128: Move moves!
2024-06-21 00:53:41 -07:00
Ryan Houdek
4d00a52761
Merge pull request #3732 from Sonicadvance1/avx_6
unittests: Split up vtestps unittest to accumulate flags in independent registers.
2024-06-21 00:52:02 -07:00
Ryan Houdek
6941a59223
unittests: Split up vtestps unittest to accumulate flags in independent registers.
Makes it easier to see what is failing on the 128-bit side versus
256-bit side.
2024-06-21 00:45:30 -07:00
Ryan Houdek
3f232e631e
Merge pull request #3730 from Sonicadvance1/avx_4
Vector: Helper refactorings
2024-06-21 00:31:14 -07:00
Ryan Houdek
6e3643c3ef
Merge pull request #3714 from pmatos/FSTstiTagSet
Set tag properly in X87 FST(reg)
2024-06-21 00:27:24 -07:00
Ryan Houdek
d7348c8aff
Merge pull request #3683 from Sonicadvance1/fix_broken_mprotect
SMCTracking: Fix incorrect mprotect tracking
2024-06-20 22:49:51 -07:00
Ryan Houdek
e7bdb8679d
Merge pull request #3735 from alyssarosenzweig/instcountci/seg-reg-cases
InstCountCI: add segment register cases
2024-06-20 09:43:42 -07:00
Ryan Houdek
c28824f94d
AVX128: Implements support for vbroadcast* 2024-06-20 09:43:10 -07:00
Ryan Houdek
664d766b45
AVX128: Implement support for vmovshdup 2024-06-20 09:43:10 -07:00
Ryan Houdek
fce694ed92
AVX128: Implement support for vmovsldup 2024-06-20 09:43:10 -07:00
Ryan Houdek
96aafb4f07
AVX128: Implement support for vmovddup
This instruction is a little weird.
When accessing memory, the 128-bit operating size of the instruction
only loads 64-bits.
Meanwhile the 256-bit operating size of the instruction fetches a full
256-bits.

Theoretically the hardware could get away with two 64-bit loads or a
wacky 24-byte load, but it looks like to simplify hardware they just
spec'd it that the 256-bit version will always load the full range.
2024-06-20 09:43:10 -07:00
Alyssa Rosenzweig
a474f86ea8 InstCountCI: add segment register cases
add a bit of coverage for this funny addressing corner. We do handle this
optimally but I had to write this to check ;)

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-06-20 11:37:35 -04:00
Ryan Houdek
dbaf95a8f3
AVX128: Implement support for vmovhps/d 2024-06-20 06:53:21 -07:00
Ryan Houdek
e67df96ad9
AVX128: Implement support for movlps/d 2024-06-20 06:53:17 -07:00
Ryan Houdek
56de94578d
AVX128: Implement support for vmovq 2024-06-20 06:53:13 -07:00
Ryan Houdek
06fc2f5ef0
AVX128: Implement support for non-temporal moves. 2024-06-20 06:53:09 -07:00
Ryan Houdek
b3ba315cbd
AVX128: Implements unary/binary lambda helper 2024-06-20 06:53:05 -07:00
Ryan Houdek
e5a531e683
Vector: Refactor MPSADBWOpImpl so AVX128 can use it. 2024-06-20 06:43:57 -07:00
Ryan Houdek
e2de57bd04
Vector: Refactor PSADBWOpImpl so AVX128 can use it. 2024-06-20 06:43:57 -07:00
Ryan Houdek
4eebca93e3
Vector: Refactor PSHUFBOpImpl. This will be reused for AVX128 2024-06-20 06:33:27 -07:00
Ryan Houdek
3919ec9692
Vector: Expose VBLENDOpImpl in the OpcodeDispatcher. It will be reused by AVX128 2024-06-20 06:33:21 -07:00
Ryan Houdek
02aeb0ac1a
Vector: Restructure PMADDWDOpImpl. It's going to get reused for AVX128 2024-06-20 06:33:15 -07:00
Ryan Houdek
206544ad09
Vector: Reconfigure PMADDUBSWOpImpl, it's going to get reused for AVX128 2024-06-20 06:33:08 -07:00
Ryan Houdek
3854cd2b2f
Vector: Restruture SHUFOpImpl. AVX128 is going to reuse it. 2024-06-20 06:32:58 -07:00
Alyssa Rosenzweig
b2eb8aaf66
Merge pull request #3718 from Sonicadvance1/avx128_3
OpcodeDispatcher: Adds initial groundwork for decomposed AVX operations
2024-06-20 08:57:35 -04:00
Ryan Houdek
acbd920c9a OpcodeDispatcher: Adds initial groundwork for decomposed AVX operations
Only installs the tables if SVE256 isn't supported yet AVX is explicitly
enabled with HostFeatures, to protect accidental enablement early.

- Only implements 85 instructions starting out
- Basic vector moves
- Basic vector unary operations
- Basic vector binary operations
- VZeroUpper/VZeroAll

The bulk of the implementation is currently the handling for loading and
storing the halves of the registers from the context or from memory.

This means the load/store helpers must always return a pair unless only
requesting the bottom half of the register, which occurs with 128-bit
AVX operations. The store side then needing to consume the named zero
register if it occurs since those cases will zero the upper bits.

This implementation approach has a few benefits.
- I can pound this out extremely quickly
- SSE implementations are unaffected and don't need to deal with the
  insert behaviour of SVE256.
- We still keep the SVE256 implementation for the inevitable future when
  hardware vendors actually do implement it (Give it 8 years or
  something).
- We can actually unit test this path in CI once it is complete.
- We can partially optimize some paths with SVE128 (Gathers) and support
  a full ASIMD path if necessary.

One downside is that I can't enable this in CI yet because it can't pass
all unittests. but that's a non-issue since it is going to be in heavy
flux as I'm hammering out the implementation. It'll get switched on at
the end when it's passing all 1265 AVX unittests. Currently at 1001 on
this.
2024-06-20 08:44:14 -04:00
Alyssa Rosenzweig
db0bdd48e5
Merge pull request #3729 from alyssarosenzweig/refactor/address-modes
OpcodeDispatcher: Refactor address modes
2024-06-20 08:18:33 -04:00
Ryan Houdek
da21ee3cda
Merge pull request #3692 from pmatos/AFP_RPRES_fix
Fixes AFP.NEP handling on scalar insertions
2024-06-19 19:23:49 -07:00
Ryan Houdek
d2baef2b36
Merge pull request #3727 from Sonicadvance1/vaes
VAES support
2024-06-19 19:22:56 -07:00
Ryan Houdek
df96bc83cc
Merge pull request #3726 from Sonicadvance1/oryon_errata
HostFeatures: Work around Qualcomm Oryon RNG errata
2024-06-19 19:21:14 -07:00
Alyssa Rosenzweig
ec03831a21 OpcodeDispatcher: plumb A.NonTSO deeper
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-06-19 08:52:07 -04:00