Using a brute force solver to add in more optimized code paths
- Adds 12 single VInsElement implementations
- Adds 4 two IR operation implementations
Not adding any of the two or three IR operation implementations that use
VInsElement because SRA interacts badly and becomes worse than the VTBX
implementation.
Optimizes the AVX128 blends by reusing the prior SSE4.1 implementation.
Only difference is the destination register isn't reused as a source
register.
One confusing thing is that Felix Cloutier's documentation has a typo on
the 256-bit VPBLENDW instruction where it had the top 128-bit lane
reusing the destination instead of sources. So I wrote a unittest to
ensure correctness.
Fixes#3796
stop prefixing the arguments when we generate allocate ops (in particular), this
is more convenient and simpler. in exchange we need to prefix Op to avoid a
collision on fcmpscalarinsert which has an argument named Op, but that's a local
change at least.
came up when experimenting with new IR, but I think this is probably a win by
itself.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
now that we do everything via NZCV, this is mostly vestigial. DF/x87 flags are
sufficiently rare to be "don't care"s here, and we don't even have multiblock
enabled yet!
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
This is no longer necessary and it also no longer provides us any useful
information. Since we expose the AVX CPUID flag, basically everything
uses VEX encoding now, so it is basically always set.
Some locations could end up with SRA registers that only spilled one
register.
Allow passing in temporaries from the call site.
Fixes rpid and syscalls asserting.
When AFP is supported then we can actually support DAZ. This might also
fix the audio corruption in Animal Well but I can't test it until Steam
is running on Oryon. Requires a bit of plumbing for MXCSR which we were
hacking around before but now we actually want to store the value.
Fixes#3856
With a brute force search of methods between 1-3 instructions we cover a
lot more cases more optimally.
There's definitely still more cases (and probably some that can reduce
from 3 instruction to 2), but covering 44 cases is a pretty good margin
already.
VPSHUFD and VPERMILPS are aliases of each other.
Reuses the implementation path from the PSHUFD implementation which has
a few swizzles and then a table lookup.
VPERMILPD is a very simple swizzle per 128-bit lane.
Fixes#3797Fixes#3784
nothing is optimizing around this, it's just adding pointless complexity. if we
want to actually optimize F80Cmp, the right way would be to lift the
implementation into the OpcodeDispatcher or JIT. it wouldn't be terribly
difficult. This kludge doesn't get us closer there.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
This causes a global initializer that registers an atexit handler.
Be smarter, use an std::array and pass its data around using a span
instead.
Removes the global initializer and removes the atexit installation
We never use more than one logging method at a time so this was
overengineered for what it is doing.
Instead only allow one handler for messages and throw messages each
which just is a pointer.
Removes a global initializer and an atexit handler being installed
This is the initial split to decouple AVX256 composed operations from
their MMX/SSE counterparts. This is to work around the subtle
differences with AVX/SSE zext/insert behaviour.
This was doing a 128-bit load from memory and then a 64-bit zero extend
which looked like a spurious move but it was trying to match the
behaviour of vmovq where it needed the zero extend.
Also adds a unit test to ensure that we aren't loading too much data by
loading right up against a page boundary.
Fixes#3787