121 Commits

Author SHA1 Message Date
Ryan Houdek
2501ebc1cd
Merge pull request #2980 from lioncash/avg
Arm64/VectorOps: Remove redundant moves from SVE VURAvg if possible
2023-08-23 14:05:51 -07:00
Lioncache
8431ab43a0 Arm64/VectorOps: Remove redundant moves from SVE VURAvg if possible
If Dst and Vector1 alias one another, then we can perform the operation
without needing to move any data around.
2023-08-23 16:51:17 -04:00
Mai
4b06069c0d
Merge pull request #2972 from Sonicadvance1/optimize_scalar_mov
OpcodeDispatcher: Optimizes scalar movd/movq
2023-08-23 16:30:45 -04:00
Ryan Houdek
b646f4b781
Merge pull request #2978 from lioncash/misc
OpcodeDispatcher: Remove redundant moves from remaining AVX ops
2023-08-23 13:18:25 -07:00
Ryan Houdek
8836ab8988 OpcodeDispatcher: Optimizes scalar movd/movq
MMX and SSE versions are now optimal.
2023-08-23 12:56:11 -07:00
Lioncache
ea9747289a OpcodeDispatcher: Remove redundant moves from remaining AVX ops
Zero-extension will occur automatically if necessary upon storing.
2023-08-23 15:31:59 -04:00
Lioncache
735e2060a3 OpcodeDispatcher: Remove redundant moves from VPACKUSOP/VPACKSSOp
Zero-extension will occur automatically if necessary.
2023-08-23 15:09:57 -04:00
Ryan Houdek
86ef6fe48d
Merge pull request #2976 from lioncash/mov
OpcodeDispatcher: Remove unnecessary moves from AVX move ops where applicable
2023-08-23 12:02:05 -07:00
Lioncache
8e7e91d61f OpcodeDispatcher: Remove redundant moves in VMOVLPOp
Zero-extension will automatically occur upon storing if necessary.
2023-08-23 14:23:57 -04:00
Lioncache
bcba3700c8 OpcodeDispatcher: Remove redundant moves from VMOVVectorNTOp
Zero-extension will automatically occur if necessary upon storing.

We can also join the SSE and AVX implementations.
2023-08-23 14:17:27 -04:00
Lioncache
7d05797e82 OpcodeDispatcher: Remove redundant moves from VMOVHPOp
Zero-extension will automatically occur if necessary upon storing.
2023-08-23 14:14:35 -04:00
Lioncache
c409ea78bc OpcodeDispatcher: Remove unnecessary moves from VMOV{A,U}PS/VMOV{A,U}PD
Zero-extension will occur automatically upon storing if necessary.

We can also join the SSE and AVX implementations together.
2023-08-23 14:10:49 -04:00
Ryan Houdek
a62ba75ede
Merge pull request #2975 from lioncash/scalar
Arm64/ConversionOps: Add scalar support to Vector_FToI
2023-08-23 10:57:38 -07:00
Lioncache
4a7ef3da13 OpcodeDispatcher: Remove unnecessary moves in AVXVectorRound
Zero-extension will occur automatically if necessary upon storing.
2023-08-23 13:41:56 -04:00
Lioncache
990b70dcd6 OpcodeDispatcher: Use scalar rounding for scalar round instructions 2023-08-23 13:34:01 -04:00
Lioncache
393cea2e8b Arm64/ConversionOps: Correct AdvSIMD round-to-nearest Vector_FToI case
This was previously using frinti, which uses the host rounding mode, rather
than round to nearest.
2023-08-23 13:27:44 -04:00
Ryan Houdek
6624f50abf
Merge pull request #2974 from lioncash/extend
OpcodeDispatcher: Remove unnecessary moves from AVXExtendVectorElements
2023-08-23 10:25:56 -07:00
Ryan Houdek
cd1f401363
Merge pull request #2973 from lioncash/vfcmp
OpcodeDispatcher: Remove unnecessary moves in AVXVFCMPOp
2023-08-23 10:25:16 -07:00
Lioncache
e89321dc60 Arm64/ConversionOps: Add scalar support to Vector_FToI
This can be used for the scalar conversions instead of always using the
vector variants.
2023-08-23 13:23:07 -04:00
Lioncache
d99bcbf01b OpcodeDispatcher: Remove unnecessary moves from AVXExtendVectorElements
Zero-extension will already occur if necessary upon storing.

Also we can join the AVX and SSE implementations together and get
rid of some template instantiations, now that the only differing
behavior is removed.
2023-08-23 12:47:31 -04:00
Mai
0819338dbf
Merge pull request #2970 from Sonicadvance1/optimize_pminmax
Arm64: Optimize VFMin/VFMax
2023-08-23 12:39:37 -04:00
Lioncache
f516aed4b7 OpcodeDispatcher: Remove unnecessary moves in AVXVFCMPOp
Zero-extension will already occur if necessary upon storing.
2023-08-23 12:37:45 -04:00
Lioncache
3858e4124b OpcodeDispatcher: Remove redundant moves in AVX blend special cases
Zero-extension will happen if necessary upon storing.
2023-08-23 00:08:33 -04:00
Mai
819fe110da
Merge pull request #2967 from Sonicadvance1/optimize_storeelement
OpcodeDispatcher: Optimize MOVHP{S,D}
2023-08-23 00:01:25 -04:00
Ryan Houdek
f0b1030e54 Arm64: Optimize VFMin/VFMax
We can be more optimal on the selects. This makes 3DNow! and SSE packed
min/max operations optimal.
2023-08-22 20:59:11 -07:00
Ryan Houdek
adfd6787c0
Merge pull request #2969 from lioncash/insert
OpcodeDispatcher: Remove unnecessary moves from AVX inserts
2023-08-22 20:51:55 -07:00
Ryan Houdek
0ee2579a5e OpcodeDispatcher: Optimize MOVHP{S,D}
Loads can turn in to element Loads.
Stores can turn in to element stores.

These four instruction variants are now optimal.
2023-08-22 20:42:24 -07:00
Ryan Houdek
6aa2cab41c IR: Implement support for vector store element
Matches ARM64 ST1 semantics
2023-08-22 20:42:24 -07:00
Mai
bb2f7107cd
Merge pull request #2963 from Sonicadvance1/optimize_loadelement
OpcodeDispatcher: Optimize MOVLP{S,D} loads
2023-08-22 23:41:59 -04:00
Lioncache
c33f3ff8df OpcodeDispatcher: Remove unnecessary moves from AVX inserts
We already zero-extend on stores when necessary.
2023-08-22 23:29:40 -04:00
Ryan Houdek
de239cde67 OpcodeDispatcher: Optimize MOVLP{S,D} loads
This now uses the new load element IR operation and makes these
instructions optimal.

LRPCPC3 will introduce instructions in the future for TSO emulation to
help these operations, but that doesn't exist today.
2023-08-22 20:15:16 -07:00
Ryan Houdek
5e57ec94cf IR: Implement support for vector load element
Matches Arm64 LD1 semantics.
2023-08-22 20:15:16 -07:00
Lioncache
2f5fae7677 OpcodeDispatcher: Remove unnecessary moves from AVX register shifts
Zero-extension will occur automatically when necessary upon storing.
2023-08-22 23:06:53 -04:00
Lioncache
8f8062eb4e OpcodeDispatcher: Remove redundant moves from AVX immediate shifts
These zero-extensions will occur automatically when applicable.
2023-08-22 22:50:10 -04:00
Lioncache
e5f5629ffc OpcodeDispatcher: Remove unnecessary moves from AVX conversion operations
These zero-extensions will already happen automatically if necessary.
2023-08-22 22:20:13 -04:00
Ryan Houdek
ead141fd90
Merge pull request #2962 from lioncash/variable
OpcodeDispatcher: Remove unnecessary moves from AVXVariableShiftImpl
2023-08-22 18:52:47 -07:00
Lioncache
2b071e282e OpcodeDispatcher: Remove unnecessary moves from AVXVariableShiftImpl
We already zero-extend on a store if necessary.
2023-08-22 21:20:34 -04:00
Ryan Houdek
14144523f7
Merge pull request #2961 from lioncash/minpos
OpcodeDispatcher: Remove unnecessary move from VPHMINPOSUW
2023-08-22 18:19:46 -07:00
Ryan Houdek
1f2c5fc6c6
Merge pull request #2960 from lioncash/index
Arm64: Optimize SVE VInsElement
2023-08-22 18:19:13 -07:00
Lioncache
fa17d9fae9 OpcodeDispatcher: Remove unnecessary move from VPHMINPOSUW
We already do a zero-extend if necessary in StoreResult.

This also lets us unify both the SSE and AVX handling code.
2023-08-22 20:59:56 -04:00
Lioncache
398a70312e Arm64: Optimize SVE VInsElement
This can be done without storing any data to memory and also
compressing the amount of instructions being used.

Thanks to @dougallj for the optimization suggestions.
2023-08-22 20:36:29 -04:00
Ryan Houdek
d3ed9766e8 X86Tables: Optimize MOVLPD stores
Just use the full register size and store the lower bits.
2023-08-22 17:33:34 -07:00
Ryan Houdek
c795d42d21 OpcodeDispatcher: Optimize phminposuw
I would now consider the XMM version of this to be optimal.

Thanks to @rygorous for giving the idea for how to optimize this!
2023-08-22 16:29:06 -07:00
Ryan Houdek
bbf9cb9d52 IR: Implements new VRev32 and LoadNamedVectorConstant ops
VRev32 matches Arm64 semantics directly.
LoadNamedVectorConstant allows FEX to quickly load "named constants".
This will allow us to have specific hardcoded vector constant values
that we can load with a ldr(State)+ldr(Value) and will be more abused in
the future.
This also allows us to do a very simple optimization in the future where
we can optimize away redundant loads of these loads if they are used
multiple times in the same block. (Not implemented here).
2023-08-22 16:29:06 -07:00
Mai
6c7933e7b1
Merge pull request #2957 from Sonicadvance1/optimize_pfnacc
OpcodeDispatcher: Optimize PFNACC
2023-08-22 10:10:11 -04:00
Ryan Houdek
364f084604
Merge pull request #2956 from Sonicadvance1/optimize_hsubp
OpcodeDispatcher: Optimize hsubp
2023-08-21 20:47:50 -07:00
Ryan Houdek
ffa8f1e3dc
Merge pull request #2955 from lioncash/sign
OpcodeDispatcher: Remove redundant move from VPSIGN
2023-08-21 20:47:41 -07:00
Ryan Houdek
ad6738939b OpcodeDispatcher: Optimize PFNACC
Turns out this can be even more optimal.
2023-08-21 20:38:18 -07:00
Ryan Houdek
dcb3e4ee86 OpcodeDispatcher: Optimize hsubp
This makes the SSE version optimal.
This dramatically improves the AVX version as well.
2023-08-21 20:22:14 -07:00
Lioncache
dbbe6288de OpcodeDispatcher: Remove redundant move from VPSIGN
StoreResult will already zero-extend if the vector is 128-bit.
2023-08-21 23:11:07 -04:00