Commit Graph

1592 Commits

Author SHA1 Message Date
Alyssa Rosenzweig
1b552a6f62 JIT: fix ShiftFlags masking
we don't update flags for a nonzero shift that masks to zero.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-07-05 09:57:42 -04:00
Mai
f2d1f2de56
Merge pull request #3817 from Sonicadvance1/fix_x87_integer_indefinite
Softfloat: Fixes Integer indefinite return for 16-bit signed values
2024-07-04 23:11:44 -04:00
Ryan Houdek
692c2fae96
Merge pull request #3813 from alyssarosenzweig/bug/fix-sbb
Fix 16-bit SBB
2024-07-04 19:52:37 -07:00
Ryan Houdek
8955f83ef6
Softfloat: Fixes Integer indefinite return for 16-bit signed values
Regardless of positive or negative value, if the converted integer
doesn't fit in to the converted int16_t then it returns INT16_MIN.
2024-07-04 17:43:28 -07:00
Ryan Houdek
38a823cc54
Arm64: Fixes long signed divide
The two halves are provided as two uint64_t values that shouldn't be
sign extended between them. Treat them as uint64_t until combined in to
a single int128_t. Fixes long signed divide.
2024-07-04 16:42:23 -07:00
Ryan Houdek
90a6647fa4
Merge pull request #3811 from alyssarosenzweig/ra/fix-lsp
RA: fix interaction between SRA & shuffles
2024-07-04 14:20:46 -07:00
Alyssa Rosenzweig
a38205069b OpcodeDispatcher: fix SBB carry flag
do it the naive way, just applying the x86 definitions of SBB.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-07-04 16:58:45 -04:00
Alyssa Rosenzweig
504511fe7e RA: fix interaction between SRA & shuffles
missed a Map. tricky case hit by the unit test added in the next commit.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-07-04 13:37:13 -04:00
Alyssa Rosenzweig
1a0d135201
Merge pull request #3809 from alyssarosenzweig/rm/old-md
FEXCore: remove very out-of-date optimizer docs
2024-07-03 15:46:27 -04:00
Ryan Houdek
f453e1523e
Merge pull request #3803 from pmatos/NinjaCore
Use number of jobs as defined by TEST_JOB_COUNT
2024-07-03 12:42:14 -07:00
Alyssa Rosenzweig
622b0bfbc9 FEXCore: remove very out-of-date optimizer docs
most of this doesn't exist and won't exist. nothing lost here but hopes &
dreams.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-07-03 11:36:48 -04:00
Paulo Matos
ad52514b97 Use number of jobs as defined by TEST_JOB_COUNT
At the moment we always run ctest with max number of cpus. If
undefined, it will keep current behaviour, otherwise it will
honour TEST_JOB_COUNT.

Therefore to run ctest one test at a time, use
`cmake ... -DTEST_JOB_COUNT=1`
2024-07-03 14:09:39 +02:00
Ryan Houdek
0d06e3e47d
Revert "OpcodeDispatcher: add cache"
This reverts commit 46676ca376.
2024-07-02 20:24:57 -07:00
Ryan Houdek
78aee4d96e
Revert "IR: drop RCLSE"
This reverts commit a5b24bfe4c.
2024-07-02 20:21:59 -07:00
Ryan Houdek
ba04da87e5
Merge pull request #3780 from Sonicadvance1/optimize_gathers
Optimize gathers slightly
2024-07-02 10:58:38 -07:00
Ryan Houdek
2e6b08cbcb
Merge pull request #3798 from Sonicadvance1/minor_128bit_vbsl_opt
Arm64: Minor VBSL optimization with SVE128
2024-07-01 18:57:46 -07:00
Ryan Houdek
472a373861
Merge pull request #3786 from Sonicadvance1/non_temporal_stores
OpcodeDispatcher: Implement support for non-temporal vector stores
2024-07-01 18:57:38 -07:00
Ryan Houdek
a451420911
Merge pull request #3783 from Sonicadvance1/optimize_vector_zeroregister
OpcodeDispatcher: Optimize x86 canonical vector zero register
2024-07-01 18:57:31 -07:00
Ryan Houdek
babde31bf0
AVX128: Fixes vmovlhps
We didn't have a unit test for this and we weren't implementing it at
all.
We treated it as vmovhps/vmovhpd accidentally. Once again caught by the
libaom Intrinsics unit tests.
2024-07-01 13:54:11 -07:00
Ryan Houdek
8d28a441ab
Arm64: Minor VBSL optimization with SVE128
This is a very minor performance change. On Cortex CPUs that support
SVE, they do movprfx+<instruction> fusion to remove two cycles and a
dependency from the backend.

This is a minor win to convert from ASIMD mov+bsl to SVE movprfx+bsl
because of this, saving two cycles and a dependency on Cortex A710 and
A715. This is slightly less of a win on Cortex-A720/A725 because it supports
zero-cycle vector register renames, but it is still a win on Cortex-X925
because that is an older core design that doesn't support zero-cycle
vector register renames.

Very silly little thing.
2024-06-30 16:22:29 -07:00
Ryan Houdek
5821054d91
Merge pull request #3789 from Sonicadvance1/avx128_minor_pshufb_opt
AVX128: Minor optimization to 256-bit vpshufb
2024-06-30 15:45:11 -07:00
Ryan Houdek
4626145374
Merge pull request #3792 from Sonicadvance1/avx128_fix_scalar_fma
AVX128: Fixes scalar FMA accidentally using vector wide
2024-06-30 15:36:09 -07:00
Ryan Houdek
1393dc2a5b
AVX128: Fixes scalar FMA accidentally using vector wide 2024-06-30 14:36:33 -07:00
Ryan Houdek
cffae9cb0f
AVX128: Minor optimization to 256-bit vpshufb 2024-06-30 13:41:03 -07:00
Ryan Houdek
7d05610da7
OpcodeDispatcher: Optimize x86 canonical vector zero register
The canonical way to generate a zero register vector in x86 is to xor
itself. Capture this can convert it to canonical zero register instead.

Can get zero-cycle renamed on latest CPUs.
2024-06-29 22:21:53 -07:00
Ryan Houdek
f4ff1b0688
OpcodeDispatcher: Implement support for non-temporal vector stores
x86 doesn't have a lot of non-temporal vector stores but we do have a
few of them.

- MMX: MOVNTQ
- SSE2: MOVNTDQ, MOVNTPS, MOVNTPD
- AVX: VMOVNTDQ (128-bit & 256-bit), VMOVNTPD

Additionally SSE4a adds 32-bit and 64-bit scalar vector non-temporal
stores, which we keep as regular stores. Since ARM doesn't have matching
semantics for those.

Additionally SSE4.1 adds non-temporal vector LOADS which this doesn't
touch.
- SSE4.1: MOVNTDQA
- AVX: VMOVNTDQA (128-bit)
- AVX2: VMOVNTDQA (256-bit)

Fixes #3364
2024-06-29 22:05:56 -07:00
Ryan Houdek
2b4cec8385
Arm64: Implement support for non-temporal vector stores 2024-06-29 22:03:17 -07:00
Ryan Houdek
8ab4ab29f8
CodeEmitter: Add SVE contiguous non-temporal instructions 2024-06-29 21:51:58 -07:00
Ryan Houdek
ebfa65fedc
AVX128: Minor optimization to vmov{l,h}{ps,pd} 2024-06-29 19:27:16 -07:00
Ryan Houdek
58ea76eb24
Arm64: Minor optimization to gather loads with no base addr register and SVE path
Arm64's SVE load instruction can be minorly optimized in the case that a
base GPR register isn't provided, as it has a version of the instruction
that doesn't require one.

The limitation of this instruction is that it doesn't support scaling at
all so it only works if the offset scale is 1.
2024-06-29 13:14:35 -07:00
Ryan Houdek
47fd73f6cf
Arm64: Optimize non-SVE gather load
When FEX hits the optimal case that the destination isn't one of the
incoming sources (other than the incomingDest source) then we can
optimize out two moves per 128-bit lane.

Cuts 256-bit non-SVE gather loads from 50 instructions down to 46.
2024-06-29 13:02:10 -07:00
Ryan Houdek
f2f90eeb82
FEXCore: Make more distinctions between host register size and guest vector register size
We can support a few combinations of guest and host vector sizes
Host: 128-bit or 256-bit
Guest: 128-bit or 256-bit

The typical case is Host = 128-bit and Guest = 256-bit now that AVX is
implemented.
On 32-bit this changes to Host=128-bit and Guest=128-bit because we
disable AVX.

In the vixl simulator 32-bit turns in to Host=256-bit and Guest=128-bit.
And then in the vixl sim 64-bit turns in to Host=256-bit and
Guest=256-bit.

We cover all four combinations of guest and host vector register sizes!

Fixes a few assumptions that SVE256 = AVX256 basically.
2024-06-28 13:05:52 -07:00
Ryan Houdek
f267fd2250
HostFeatures: Always disable AVX in 32-bit mode to protect from stack overflows 2024-06-28 13:05:52 -07:00
Ryan Houdek
4060f4018e
Frontend: Fixes invalid VSIB Index problem
In regular SIB land the index register encoding of 0b100 encodes to "no
register", this feature lets you get SIB encodings without an index
register for flexibility.

In VSIB encoding this isn't expected behaviour and instead there are no
encodings where an index register is missing. Allowing you to encode all
sixteen registers as an index register.

This was causing an abort in `AVX128_LoadVSIB` because the index turned
in to an invalid register.

Working instruction:
`vgatherdps ymm2, dword [eax+ymm5*4], ymm7`

Broken instruction:
`vgatherdps ymm0, dword [eax+ymm4*4], ymm7`

This fixes a crash in libfmod where it is using gathers in the wild.
Fixing a crash in Ender Lilies.
2024-06-27 20:55:30 -07:00
Ryan Houdek
aba7a3a830
AVX128: Fixes vblendps lower and upper selector 2024-06-27 17:20:39 -07:00
Ryan Houdek
9027d1eee7
AVX128: Fixes bug in vector immediate shift 2024-06-27 16:22:14 -07:00
Ryan Houdek
4e5da4946d
Merge pull request #3773 from bylaws/win-fixes
Windows: Small fixes for compat with newer toolchains/wine versions
2024-06-27 15:14:20 -07:00
Billy Laws
a70e3e42b2 FEXCore: Drop unneeded MinGW library naming workaround
It's generally expected for libraries to use the .a suffix with MinGW,
and DLLs are still correctly named without the prior special handling.
2024-06-27 23:01:21 +01:00
Billy Laws
09f476924f FEXCore: Fix missing return in win32 SetSignalMask path 2024-06-27 23:01:21 +01:00
Billy Laws
230e3245fd FileLoading: Fix compilation with newer libc++ 2024-06-27 23:01:21 +01:00
Ryan Houdek
b0eb63ab9a
FEXCore: Fixes address size override on GPR sources and destinations
When the source or destination is a register, the address size override
doesn't apply. We were accidentally applying it on all sources
regardless of type which was causing us to zero extend on operations
that aren't affected by address size override.

This fixes the OpenSSL cert error in every application, but most
importantly Steam.
2024-06-27 14:12:01 -07:00
Ryan Houdek
2e3242682d
Merge pull request #3771 from alyssarosenzweig/opt/asimd-masked
OpcodeDispatcher: optimize nzcv with asimd masked load/store
2024-06-27 10:27:10 -07:00
Alyssa Rosenzweig
196a0531e0 OpcodeDispatcher: optimize nzcv with asimd masked load/store
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-06-27 10:37:06 -04:00
Alyssa Rosenzweig
f9b53c6b51 AVX_128: save a move in vzeroall
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-06-27 10:30:25 -04:00
Ryan Houdek
dad47b7bda
CPUID: Oops, forgot to enable AVX2 2024-06-26 17:43:56 -07:00
Ryan Houdek
4d56fec5f1
AVX128: Work around glibc fault testing 2024-06-26 16:49:00 -07:00
Ryan Houdek
8181552b16
AVX128: Actually install AVX helpers per thread.
How this didn't break the world in my testing I don't know.
2024-06-26 16:49:00 -07:00
Ryan Houdek
975069825e
AVX128: Fix a real bug with VCVTPS2PH 2024-06-26 16:49:00 -07:00
Ryan Houdek
031d56de35
HostFeatures: Enables AVX unconditionally 2024-06-26 15:03:21 -07:00
Ryan Houdek
b5e696b3cb
CPUID: Implement support for XCR0 when AVX is enabled
This enables AVX, AVX2, FMA3 for the entire CPUID!

```bash
$ FEX_HOSTFEATURES=enableavx,enableavx2 ./Bin/FEXInterpreter /usr/bin/cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Cortex-A78AE
stepping        : 0
microcode       : 0x0
cpu MHz         : 3000
cache size      : 512 KB
physical id     : 0
siblings        : 12
core id         : 0
cpu cores       : 12
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht tm syscall nx mmxext fxsr_opt rdtscp lm 3dnow 3dnowext constant_tsc art rep_good nopl xtoplogy nonstop_tsc cpuid tsc_known_freq pni pclmulqdq dtes64 monitor tm2 ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx hypervisor lahf_lm cmp_legacy extapic abm 3dnowprefetc
h tce fsgsbase bmi1 avx2 smep bmi2 erms invpcid adx clflushopt clwb sha_ni clzero arat vpclmulqdq rdpid fsrm
bugs            :
bogomips        : 8000.0
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment  : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
```

Notice avx, avx2, and fma
2024-06-26 14:56:01 -07:00