1357 Commits

Author SHA1 Message Date
Lioncache
a9a7cbce21 Interpreter: Use alias for temporary vector data
Lets us extract the size into one location for easy size
changes in the future if necessary.
2023-08-21 14:12:55 -04:00
Ryan Houdek
a3bf952f2b IR: Implements support for wide scalar shifts
This matches x86 vector shift behaviour closely for ps{rl,ra,ll}{w,d,q}
where the vector is shifted by a scalar value that is 64-bits wide.
Anything larger than the element size will set that element to zero.

With SVE we have some new wide element shifts that match this behaviour
exactly (except supports wide shift sources rather than scalar).

This is a significant improvement even on platforms that only support
128-bit SVE.
2023-08-20 19:16:40 -07:00
Ryan Houdek
ac77986c44 IR: Implements support for saturating/rounding vector shifts 2023-08-20 13:38:23 -07:00
Ryan Houdek
c5f5a03c68 OpcodeDispatcher: Optimize PSIGN
This dramatically improves the performance of the PSIGN instructions.
2023-08-20 13:38:23 -07:00
Ryan Houdek
1ed9ec63be Arm64: Optimize BSL when possible.
With ASIMD FEX would never optimize BSL out of fear if some registers
overlapped it would break things. So it had previously always moved to a
temporary first and then moved the result back out when done.

Now instead check upfront if any of the source registers overlap the
destination. If the destination register overlaps any of the three
sources we can bsl, bit, or bif depending on which register gets
overlapped.

Worst case the destination doesn't overlap any of the source registers
and still needs these moves.
2023-08-20 13:38:23 -07:00
Ryan Houdek
a523858f66
Merge pull request #2923 from Sonicadvance1/nonnull_legacy_segment_telemetry
FEXCore: Adds telemetry around legacy segment register setting
2023-08-20 10:27:56 -07:00
Ryan Houdek
9e4888c6a1
Merge pull request #2930 from Sonicadvance1/support_push
OpcodeDispatcher: Implement support for push IR operation
2023-08-20 10:25:26 -07:00
Lioncache
8167626a07 x86_64/VectorOps: Properly handle VExtr element sizes other than bytes
We need to convert the index into a byte index.
2023-08-20 13:03:27 -04:00
Lioncache
c3778a9729 OpcodeDispatcher: Improve SHA1MSG1 output
We can simplify these inserts down to a single EXT
2023-08-20 12:50:07 -04:00
Ryan Houdek
34722348e8
Merge pull request #2912 from Sonicadvance1/optimize_flag_clearing
OpcodeDispatcher: Minor optimization around clearing flags
2023-08-19 23:03:40 -07:00
Lioncache
1fbf193739 Arm64/EncryptionOps: Use MOVI reg, #0 to zero vectors
This is a little more optimal than XORing the vector by itself.
2023-08-19 23:20:36 -04:00
Ryan Houdek
92c3014aaa OpcodeDispatcher: Minor optimization around clearing flags
When clearing multiple flags it is more optimal to load the mask
constant in to a register and then clear with a single and/bic.

Back to back bfi is actually less optimal due to dependency tracking.

With #2911, this is a total win since this hits an edge case with
constant loading that #2911 fixes.
2023-08-19 20:14:37 -07:00
Mai
6960fca256
Merge pull request #2929 from Sonicadvance1/signaldelegator_getconfig
SignalDelegator: Allow getting the internal configuration
2023-08-19 23:12:06 -04:00
Mai
affbcd2241
Merge pull request #2928 from Sonicadvance1/remove_x18_saving
Arm64Emitter: Stop saving and restoring platform register
2023-08-19 23:11:44 -04:00
Mai
a2e5c231ae
Merge pull request #2908 from Sonicadvance1/optimize_stc_clc
IR/ConstProp: Ensure that BFI with constant bitfields can optimize to Andn or Or
2023-08-19 23:10:48 -04:00
Ryan Houdek
b973c193be
Merge pull request #2939 from lioncash/round
OpcodeDispatcher: Eliminate redundant moves in {AVX}VectorRound
2023-08-19 19:29:01 -07:00
Ryan Houdek
4c409ea47d
Merge pull request #2938 from lioncash/fcmp
OpcodeDispatcher: Eliminate unnecessary moves in {AVX}VFCMPOp
2023-08-19 19:27:24 -07:00
Ryan Houdek
ac53913c37
Merge pull request #2937 from lioncash/scalarunary
OpcodeDispatcher: Remove unnecessary moves in {AVX}VectorUnaryOp
2023-08-19 19:22:10 -07:00
Ryan Houdek
2224c23c79
Merge pull request #2934 from lioncash/scalarfp
OpcodeDispatcher: Remove extraneous moves in {V}CVTSD2SS/{V}CVTSS2SD
2023-08-19 19:20:20 -07:00
Ryan Houdek
6ce380d7a9
Merge pull request #2936 from lioncash/scalaralu
OpcodeDispatcher: Remove unnecessary moves in {AVX}VectorScalarALUOp
2023-08-19 19:09:06 -07:00
Lioncache
5152854b98 OpcodeDispatcher: Remove extraneous moves in {V}CVTSD2SS/{V}CVTSS2SD
Since all we're going to be doing is an insert as the final operation,
in the cases where our source is a vector, we can specify the size of
the vector rather than the size of the element to avoid doing unnecessary
zero-extending.
2023-08-19 22:07:25 -04:00
Ryan Houdek
8dade7eea1
Merge pull request #2935 from lioncash/scalarfp2
OpcodeDispatcher: Remove redundant moves from {V}CVTSD2SI/{V}CVTSS2SI
2023-08-19 19:05:52 -07:00
Lioncache
6ba42e5cf1 OpcodeDispatcher: Eliminate redundant moves in {AVX}VectorRound
When dealing with scalar source registers, we can opt to not zero-extend
the vector and just perform the scalar operation and then insert the result.
2023-08-19 21:27:59 -04:00
Lioncache
343b00818d OpcodeDispatcher: Eliminate unnecessary moves in {AVX}VFCMPOp
We dealing with scalar vector sources, we don't need to zero-extend
the vector, and we can just use it as is.
2023-08-19 21:20:03 -04:00
Lioncache
09addb217a OpcodeDispatcher: Remove unnecessary moves in {AVX}VectorUnaryOp
When dealing with source vectors, we can use the vector length
rather than using a smaller size and zero extending the register,
especially since the resulting value is just inserted into another
vector.
2023-08-19 20:09:19 -04:00
Lioncache
4d1f002dea OpcodeDispatcher: Remove unnecessary moves in AVXVectorScalarALUOp
Same thing as the SSE variant, but for AVX.
2023-08-19 19:22:42 -04:00
Lioncache
1158ad7b2a OpcodeDispatcher: Remove unnecessary moves in VectorScalarALUOp
We can explicitly specify the vector width when working with a
vector source, so that we don't do any unnecessary zero-extending
on the element.
2023-08-19 19:15:13 -04:00
Lioncache
6907fdca6b OpcodeDispatcher: Remove redundant moves from {V}CVTSD2SI/{V}CVTSS2SI
We can specify the full vector length when dealing with a source vector
to avoid zero-extending the vector unnecessarily. When dealing with a
memory operand, however, we only want to load the exact source size.
2023-08-19 18:46:08 -04:00
Lioncache
c31329609f OpcodeDispatcher: Unify handling code for MOVSD and MOVSS
These have the same behavior and only differ based on element size,
so we can join the implementations together instead of duplicating
them across both functions.
2023-08-19 17:44:45 -04:00
Lioncache
1fe8470933 OpcodeDispatcher: Remove extraneous moves from VMOVSS/VMOVSD xmm to mem case
Like the changes made to the xmm to xmm case, since we're going to be storing
a 64-bit value, we don't directly need to zero-extend the vector on a load.
2023-08-19 17:35:52 -04:00
Lioncache
99b5aaa426 OpcodeDispatcher: Remove extraneous moves in VMOVSS/VMOVSD register case
In the event that we have a full length vector, we can just load and move
from it, which gets rid of a little bit of mov noise. Since all we intend
to do is perform an insert from one vector into another, we don't need the
zero-extending behavior that an 64-bit vector load would do.
2023-08-19 17:35:12 -04:00
Lioncache
84f228a75a x86_64/VectorOps: Simplify index handling in VInsElement
While we're in the area, we can simplify these cases down a little.
2023-08-19 01:28:45 -04:00
Lioncache
83a330b039 x86_64/VectorOps: Fix insertion bugs in VDupElement for 256-bit
Previously we weren't hitting this because we were never broadcasting
from the upper lane with VDupElement.
2023-08-19 01:21:12 -04:00
Lioncache
bbed4d73ed OpcodeDispatcher: Improve VPERMQ/VPERMPD broadcast cases
For a bunch of cases that act as broadcasts (where all
indices in the imm8 specify the same element), we
can use VDupElement here rather than iterating through.
2023-08-19 01:08:30 -04:00
Ryan Houdek
8b051b5e63 OpcodeDispatcher: Implement support for push IR operation
This paves the way to optimizing pushes in to both push operations and
push pair operations to more optimally match Arm64 push support.

While this does the first step for supporting the base push, we'll leave
optimizing push pairs to future work.
2023-08-18 14:19:16 -07:00
Ryan Houdek
1fdc4d2c62 IR: Implement support for a push operation
This is a bit of tricky operation where due to our our usage of SSA, the
incoming source isn't guaranteed to end its live-range at this
instruction.

This gives us a behaviour where to be optimal we need to take different
paths depending on if the incoming address register is the same as the
destination node.
Once we have form of RA constraints or non-SSA IR form that can
guarantee this restriction then this will go away.
2023-08-18 14:14:38 -07:00
Ryan Houdek
ea5c67da80 SignalDelegator: Allow getting the internal configuration
Not used by FEX today but will be used by the WINE integration.
2023-08-18 11:56:52 -07:00
Ryan Houdek
0373826f46 Arm64Emitter: Stop saving and restoring platform register
FEX doesn't use the platform register on wine platforms so there is no
reason to save  and restore it.

On Linux we can still use it at some point but for now it isn't part of
our RA.
2023-08-18 11:49:44 -07:00
Ryan Houdek
7db2e487c3 IR/ConstProp: Remove some UBSAN behaviour
Changes the idiom used for constant mask generation to a ternary.
This pattern is definitely used elsewhere in code but we can get rid of
all instances here.
2023-08-18 11:41:11 -07:00
Ryan Houdek
6e5111b876 IR/ConstProp: Ensure that BFI with constant bitfields can optimize to Andn or Or
This optimizes the clc and stc instructions for flag setting and
clearing.
2023-08-18 11:33:11 -07:00
Ryan Houdek
d502ad63f4 IR/ConstProp: Ensure ANDN is optimized 2023-08-18 11:27:06 -07:00
Ryan Houdek
fc84f6b345
Merge pull request #2927 from bylaws/interrupt
FEXCore: Allow for interrupting the JIT on block entry
2023-08-18 06:14:24 -07:00
Billy Laws
de63fd05d0 FEXCore: Allow for interrupting the JIT on block entry
This takes a similar approach to deferred signal handling and allows any given
thread to be interrupted while running JIT code by protecting the appropriate
page as RO. When the thread then enters a new block, it will try to acccess
that page and segfault. This is safer than just sending a signal to the thread
as that could stop in a place where JIT context couldn't be recovered correctly.
2023-08-18 05:58:51 -07:00
Ryan Houdek
d3f0c7e969
Merge pull request #2925 from bylaws/winfile
Support for Config.json loading on WIN32
2023-08-18 05:00:22 -07:00
Billy Laws
00556023c2 Remove unnecessary WIN32 file handling TODOs
With WOW, all allocations from 64-bit code use the full address space
and limiting is handled on the syscall thunk side so theres need to
worry about STL allocations stealing AS.
2023-08-18 04:37:40 -07:00
Billy Laws
5de0714766 FileLoading: Fix handling of non-existent files on WIN32 2023-08-18 04:37:40 -07:00
Billy Laws
bbfd15f801 LogMan: Commonise log level to string conversion 2023-08-18 04:36:31 -07:00
Billy Laws
0954c7eb9f AllocatorHooks: Add C++17 aligned new/delete functions 2023-08-18 04:32:16 -07:00
Ryan Houdek
f09d9af3db
Merge pull request #2922 from lioncash/psrld
OpcodeDispatcher: Improve {V}PSRLDQ shift by 0
2023-08-17 17:05:35 -07:00
Ryan Houdek
d19e2507e5 FEXCore: Adds telemetry around legacy segment register setting
Due to Intel dropping support for legacy segment registers[1] there is a
concern that this will break legacy 32-bit software that is doing some
magic segment register handling.

Adds some simple telemetry for 32-bit applications that when they
encounter an instruction that sets the segment register or uses a
segment register that the JIT will do a /relatively/ quick four
instruction check to see if it is not a null segment.

It's not enough to just check if the segment index is 0 or not, 32-bit
Linux software starts with non-zero segment register indexes but the LDT
for each segment index is a null-descriptor.

Once the segment address is loaded, the IR operation will do a quick
check against zero and if it /isn't/ zero then set the telemetry value.

A very minor optimization that segment registers only get checked once
per block to ensure overhead stays low.

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html
   - 3.6 - Restricted Subset of Segmentation
      - `Bases are supported for FS, GS, GDT, IDT, LDT, and TSS
        registers; the base for CS, DS, ES, and SS is ignored for 32-bit
        mode, same as 64-bit mode (treated as zero).`
   - 4.2.17 - MOV to Segment Register
      - Will fault if SS is written (Breaking anything that writes to
        SS).
      - Will not fault if CS, DS, ES are written (Thus it sets the
        segment but gets ignored due to 3.6).
2023-08-17 17:00:41 -07:00