This matches x86 vector shift behaviour closely for ps{rl,ra,ll}{w,d,q}
where the vector is shifted by a scalar value that is 64-bits wide.
Anything larger than the element size will set that element to zero.
With SVE we have some new wide element shifts that match this behaviour
exactly (except supports wide shift sources rather than scalar).
This is a significant improvement even on platforms that only support
128-bit SVE.
With ASIMD FEX would never optimize BSL out of fear if some registers
overlapped it would break things. So it had previously always moved to a
temporary first and then moved the result back out when done.
Now instead check upfront if any of the source registers overlap the
destination. If the destination register overlaps any of the three
sources we can bsl, bit, or bif depending on which register gets
overlapped.
Worst case the destination doesn't overlap any of the source registers
and still needs these moves.
When clearing multiple flags it is more optimal to load the mask
constant in to a register and then clear with a single and/bic.
Back to back bfi is actually less optimal due to dependency tracking.
With #2911, this is a total win since this hits an edge case with
constant loading that #2911 fixes.
Since all we're going to be doing is an insert as the final operation,
in the cases where our source is a vector, we can specify the size of
the vector rather than the size of the element to avoid doing unnecessary
zero-extending.
When dealing with source vectors, we can use the vector length
rather than using a smaller size and zero extending the register,
especially since the resulting value is just inserted into another
vector.
We can specify the full vector length when dealing with a source vector
to avoid zero-extending the vector unnecessarily. When dealing with a
memory operand, however, we only want to load the exact source size.
These have the same behavior and only differ based on element size,
so we can join the implementations together instead of duplicating
them across both functions.
Like the changes made to the xmm to xmm case, since we're going to be storing
a 64-bit value, we don't directly need to zero-extend the vector on a load.
In the event that we have a full length vector, we can just load and move
from it, which gets rid of a little bit of mov noise. Since all we intend
to do is perform an insert from one vector into another, we don't need the
zero-extending behavior that an 64-bit vector load would do.
For a bunch of cases that act as broadcasts (where all
indices in the imm8 specify the same element), we
can use VDupElement here rather than iterating through.
This paves the way to optimizing pushes in to both push operations and
push pair operations to more optimally match Arm64 push support.
While this does the first step for supporting the base push, we'll leave
optimizing push pairs to future work.
This is a bit of tricky operation where due to our our usage of SSA, the
incoming source isn't guaranteed to end its live-range at this
instruction.
This gives us a behaviour where to be optimal we need to take different
paths depending on if the incoming address register is the same as the
destination node.
Once we have form of RA constraints or non-SSA IR form that can
guarantee this restriction then this will go away.
FEX doesn't use the platform register on wine platforms so there is no
reason to save and restore it.
On Linux we can still use it at some point but for now it isn't part of
our RA.
Changes the idiom used for constant mask generation to a ternary.
This pattern is definitely used elsewhere in code but we can get rid of
all instances here.
This takes a similar approach to deferred signal handling and allows any given
thread to be interrupted while running JIT code by protecting the appropriate
page as RO. When the thread then enters a new block, it will try to acccess
that page and segfault. This is safer than just sending a signal to the thread
as that could stop in a place where JIT context couldn't be recovered correctly.
With WOW, all allocations from 64-bit code use the full address space
and limiting is handled on the syscall thunk side so theres need to
worry about STL allocations stealing AS.
Due to Intel dropping support for legacy segment registers[1] there is a
concern that this will break legacy 32-bit software that is doing some
magic segment register handling.
Adds some simple telemetry for 32-bit applications that when they
encounter an instruction that sets the segment register or uses a
segment register that the JIT will do a /relatively/ quick four
instruction check to see if it is not a null segment.
It's not enough to just check if the segment index is 0 or not, 32-bit
Linux software starts with non-zero segment register indexes but the LDT
for each segment index is a null-descriptor.
Once the segment address is loaded, the IR operation will do a quick
check against zero and if it /isn't/ zero then set the telemetry value.
A very minor optimization that segment registers only get checked once
per block to ensure overhead stays low.
[1] https://www.intel.com/content/www/us/en/developer/articles/technical/envisioning-future-simplified-architecture.html
- 3.6 - Restricted Subset of Segmentation
- `Bases are supported for FS, GS, GDT, IDT, LDT, and TSS
registers; the base for CS, DS, ES, and SS is ignored for 32-bit
mode, same as 64-bit mode (treated as zero).`
- 4.2.17 - MOV to Segment Register
- Will fault if SS is written (Breaking anything that writes to
SS).
- Will not fault if CS, DS, ES are written (Thus it sets the
segment but gets ignored due to 3.6).