Commit Graph

9862 Commits

Author SHA1 Message Date
Alyssa Rosenzweig
504511fe7e RA: fix interaction between SRA & shuffles
missed a Map. tricky case hit by the unit test added in the next commit.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-07-04 13:37:13 -04:00
Ryan Houdek
d3399a261b
Docs: Update for release FEX-2407 2024-07-03 17:59:42 -07:00
Ryan Houdek
d2437e6a21
Merge pull request #3810 from Sonicadvance1/x87_mmx_unittest
unittests: Adds MMX and x87 conflating unit test
2024-07-03 14:39:05 -07:00
Ryan Houdek
95dd6ceba8
unittests: Adds MMX and x87 conflating unit test
This failed with prior RCLSE deletion caching.
2024-07-03 13:54:07 -07:00
Alyssa Rosenzweig
1a0d135201
Merge pull request #3809 from alyssarosenzweig/rm/old-md
FEXCore: remove very out-of-date optimizer docs
2024-07-03 15:46:27 -04:00
Ryan Houdek
f453e1523e
Merge pull request #3803 from pmatos/NinjaCore
Use number of jobs as defined by TEST_JOB_COUNT
2024-07-03 12:42:14 -07:00
Alyssa Rosenzweig
622b0bfbc9 FEXCore: remove very out-of-date optimizer docs
most of this doesn't exist and won't exist. nothing lost here but hopes &
dreams.

Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
2024-07-03 11:36:48 -04:00
Paulo Matos
ad52514b97 Use number of jobs as defined by TEST_JOB_COUNT
At the moment we always run ctest with max number of cpus. If
undefined, it will keep current behaviour, otherwise it will
honour TEST_JOB_COUNT.

Therefore to run ctest one test at a time, use
`cmake ... -DTEST_JOB_COUNT=1`
2024-07-03 14:09:39 +02:00
Alyssa Rosenzweig
02a218c6e3
Merge pull request #3804 from Sonicadvance1/revert_rclse_drop
Revert removing RCLSE
2024-07-03 07:37:02 -04:00
Ryan Houdek
2d617ad173
InstcountCI: Update 2024-07-02 20:24:58 -07:00
Ryan Houdek
0d06e3e47d
Revert "OpcodeDispatcher: add cache"
This reverts commit 46676ca376.
2024-07-02 20:24:57 -07:00
Ryan Houdek
78aee4d96e
Revert "IR: drop RCLSE"
This reverts commit a5b24bfe4c.
2024-07-02 20:21:59 -07:00
Ryan Houdek
ba04da87e5
Merge pull request #3780 from Sonicadvance1/optimize_gathers
Optimize gathers slightly
2024-07-02 10:58:38 -07:00
Ryan Houdek
2e6b08cbcb
Merge pull request #3798 from Sonicadvance1/minor_128bit_vbsl_opt
Arm64: Minor VBSL optimization with SVE128
2024-07-01 18:57:46 -07:00
Ryan Houdek
472a373861
Merge pull request #3786 from Sonicadvance1/non_temporal_stores
OpcodeDispatcher: Implement support for non-temporal vector stores
2024-07-01 18:57:38 -07:00
Ryan Houdek
a451420911
Merge pull request #3783 from Sonicadvance1/optimize_vector_zeroregister
OpcodeDispatcher: Optimize x86 canonical vector zero register
2024-07-01 18:57:31 -07:00
Mai
2e84f21c18
Merge pull request #3802 from Sonicadvance1/fix_sse41_helper
CodeEmitter: Fixes vector {ldr,str}{b,h} with reg-reg source
2024-07-01 20:42:49 -04:00
Ryan Houdek
fb7167c2d2
CodeEmitter: Fixes vector {ldr,str}{b,h} with reg-reg source
We had failed to enable these implementations for the
`ExtendedMemOperand` helpers. We had already implemented the non-helper
forms, which are already tested in CI. These helpers just weren't
updated?

Noticed this when running libaom's SSE4.1 tests, where it managed to
execute a pmovzxbq instruction with reg+reg memory source and was
breaking the test results.

There are /very/ few vector register operations that access only 8-bit
or 16-bit in vectors so this flew under the radar for quite a while.

Fixes their unit tests.

Also adds a unittest using sse4.1 pmovzxbq to ensure we support the
reg+reg case, and also a few other instructions to test 8-bit and 16-bit
vector loads and stores.
2024-07-01 17:03:47 -07:00
Mai
d884eb9287
Merge pull request #3801 from Sonicadvance1/fix_vpcmpgtw_typo
unittests: Fixes typo in vpcmpgtw test
2024-07-01 18:16:54 -04:00
Ryan Houdek
8b9b1a90e4
unittests: Fixes typo in vpcmpgtw test 2024-07-01 14:42:23 -07:00
Ryan Houdek
e2d4010b59
Merge pull request #3800 from Sonicadvance1/fix_vmovlhps
AVX128: Fixes vmovlhps
2024-07-01 14:41:43 -07:00
Ryan Houdek
babde31bf0
AVX128: Fixes vmovlhps
We didn't have a unit test for this and we weren't implementing it at
all.
We treated it as vmovhps/vmovhpd accidentally. Once again caught by the
libaom Intrinsics unit tests.
2024-07-01 13:54:11 -07:00
Ryan Houdek
c282239077
InstcountCI: Add SVE128 VEX_map3 2024-06-30 16:27:58 -07:00
Ryan Houdek
8d28a441ab
Arm64: Minor VBSL optimization with SVE128
This is a very minor performance change. On Cortex CPUs that support
SVE, they do movprfx+<instruction> fusion to remove two cycles and a
dependency from the backend.

This is a minor win to convert from ASIMD mov+bsl to SVE movprfx+bsl
because of this, saving two cycles and a dependency on Cortex A710 and
A715. This is slightly less of a win on Cortex-A720/A725 because it supports
zero-cycle vector register renames, but it is still a win on Cortex-X925
because that is an older core design that doesn't support zero-cycle
vector register renames.

Very silly little thing.
2024-06-30 16:22:29 -07:00
Ryan Houdek
5821054d91
Merge pull request #3789 from Sonicadvance1/avx128_minor_pshufb_opt
AVX128: Minor optimization to 256-bit vpshufb
2024-06-30 15:45:11 -07:00
Ryan Houdek
4626145374
Merge pull request #3792 from Sonicadvance1/avx128_fix_scalar_fma
AVX128: Fixes scalar FMA accidentally using vector wide
2024-06-30 15:36:09 -07:00
Ryan Houdek
a786d3621d
InstcountCI: Update for Scalar FMA 2024-06-30 14:36:56 -07:00
Ryan Houdek
1393dc2a5b
AVX128: Fixes scalar FMA accidentally using vector wide 2024-06-30 14:36:33 -07:00
Ryan Houdek
c4604465ba
InstcountCI: Update 2024-06-30 13:41:14 -07:00
Ryan Houdek
cffae9cb0f
AVX128: Minor optimization to 256-bit vpshufb 2024-06-30 13:41:03 -07:00
Ryan Houdek
cf24d3c33f
Merge pull request #3781 from Sonicadvance1/optimize_vmovlh
AVX128: Minor optimization to vmov{l,h}{ps,pd}
2024-06-29 23:15:53 -07:00
Ryan Houdek
672e885e40
InstcountCI: Adds canonical zero register tests 2024-06-29 22:21:53 -07:00
Ryan Houdek
7d05610da7
OpcodeDispatcher: Optimize x86 canonical vector zero register
The canonical way to generate a zero register vector in x86 is to xor
itself. Capture this can convert it to canonical zero register instead.

Can get zero-cycle renamed on latest CPUs.
2024-06-29 22:21:53 -07:00
Ryan Houdek
a843ecf4c8
InstcountCI: Update for non-temporal stores 2024-06-29 22:05:56 -07:00
Ryan Houdek
f4ff1b0688
OpcodeDispatcher: Implement support for non-temporal vector stores
x86 doesn't have a lot of non-temporal vector stores but we do have a
few of them.

- MMX: MOVNTQ
- SSE2: MOVNTDQ, MOVNTPS, MOVNTPD
- AVX: VMOVNTDQ (128-bit & 256-bit), VMOVNTPD

Additionally SSE4a adds 32-bit and 64-bit scalar vector non-temporal
stores, which we keep as regular stores. Since ARM doesn't have matching
semantics for those.

Additionally SSE4.1 adds non-temporal vector LOADS which this doesn't
touch.
- SSE4.1: MOVNTDQA
- AVX: VMOVNTDQA (128-bit)
- AVX2: VMOVNTDQA (256-bit)

Fixes #3364
2024-06-29 22:05:56 -07:00
Ryan Houdek
2b4cec8385
Arm64: Implement support for non-temporal vector stores 2024-06-29 22:03:17 -07:00
Ryan Houdek
8ab4ab29f8
CodeEmitter: Add SVE contiguous non-temporal instructions 2024-06-29 21:51:58 -07:00
Ryan Houdek
cc0509c0f3
InstcountCI: Update 2024-06-29 19:27:39 -07:00
Ryan Houdek
ebfa65fedc
AVX128: Minor optimization to vmov{l,h}{ps,pd} 2024-06-29 19:27:16 -07:00
Ryan Houdek
a34ae24b3f
InstcountCI: Update for SVE non-base address reg 2024-06-29 13:16:02 -07:00
Ryan Houdek
58ea76eb24
Arm64: Minor optimization to gather loads with no base addr register and SVE path
Arm64's SVE load instruction can be minorly optimized in the case that a
base GPR register isn't provided, as it has a version of the instruction
that doesn't require one.

The limitation of this instruction is that it doesn't support scaling at
all so it only works if the offset scale is 1.
2024-06-29 13:14:35 -07:00
Ryan Houdek
e9a17b19c5
InstcountCI: Add SVE gathers without base addr 2024-06-29 13:07:32 -07:00
Ryan Houdek
ce8d111453
InstcountCI: Update 2024-06-29 13:04:21 -07:00
Ryan Houdek
47fd73f6cf
Arm64: Optimize non-SVE gather load
When FEX hits the optimal case that the destination isn't one of the
incoming sources (other than the incomingDest source) then we can
optimize out two moves per 128-bit lane.

Cuts 256-bit non-SVE gather loads from 50 instructions down to 46.
2024-06-29 13:02:10 -07:00
Ryan Houdek
76f3391ebc
Merge pull request #3779 from Sonicadvance1/cpuinfo_cyclecounter
Linux: Calculate cycle counter frequency for cpuinfo
2024-06-29 11:58:32 -07:00
Ryan Houdek
be6ff52709
Linux: Calculate cycle counter frequency for cpuinfo
Some applications don't measure rdtsc correctly and instead use cpuinfo
to get the CPU core's base clock speed. Which for most x86 CPUs is the
base clock speed which also matches their cycle counter speed.

Did this as a quick test to see if this would help `Unbound: Worlds
Apart` stuttering while BinaryNinja was disassembling the binary.

Turns out the game doesn't use cpuinfo for its cycle counter speed
determination, but it is good to implement this regardless.
2024-06-28 16:38:49 -07:00
Ryan Houdek
e99e252188
Merge pull request #3731 from Sonicadvance1/avx_5
HostFeatures: Always disable AVX in 32-bit mode to protect from stack overflows
2024-06-28 13:37:55 -07:00
Ryan Houdek
98b980f7e3
TestHarnessRunner: Ensure we are still reconstructing XMM registers if we don't support AVX
Also fixes a bug where we were destroying the thread context before
reading the data from it, spooky.
2024-06-28 13:05:52 -07:00
Ryan Houdek
f2f90eeb82
FEXCore: Make more distinctions between host register size and guest vector register size
We can support a few combinations of guest and host vector sizes
Host: 128-bit or 256-bit
Guest: 128-bit or 256-bit

The typical case is Host = 128-bit and Guest = 256-bit now that AVX is
implemented.
On 32-bit this changes to Host=128-bit and Guest=128-bit because we
disable AVX.

In the vixl simulator 32-bit turns in to Host=256-bit and Guest=128-bit.
And then in the vixl sim 64-bit turns in to Host=256-bit and
Guest=256-bit.

We cover all four combinations of guest and host vector register sizes!

Fixes a few assumptions that SVE256 = AVX256 basically.
2024-06-28 13:05:52 -07:00
Ryan Houdek
f267fd2250
HostFeatures: Always disable AVX in 32-bit mode to protect from stack overflows 2024-06-28 13:05:52 -07:00