Similar to previous tests, vpgatherqq and vgatherqpd are equivalent
instructions. So the tests are the same with the mnemonic changed.
This adds tests for an additional two sets of instructions. Getting us
full coverage of all eight instructions if we include the tests from
PR #3167 and #3166
Tests the same things as described in #3165
In addition, since these tests use 64-bit indices for address
calculation, we can easily generate and indice vector that tests
overflow. So every test at every displacement ALSO gains an additional
overflow test to ensure correct behaviour around pointer overflow
calculation.
vpgatherdd and vgatherps are effectively the same instructions, so the
tests are the same except for the instruction mnemonic.
This adds unit tests for two of the eight gather instructions.
Specifically this adds tests for the 32-bit indices loading 32-bit
elements instructions.
What it tests:
- Tests all displacement scales
- Tests multiple mask arrangements
- Ensures the mask register is zero'd after the instruction
What it doesn't test:
- Doesn't test address size calculation overflow
- Only would happen on 32-bit with 32-bit indices, or /really/ high
base addresses
- The instruction should behave as a mask to the address size
- Effectively behaves like `(uint64_t)(base + index << ilog2(scale))`
- Better idea is to just not expose AVX to 32-bit applications
- Doesn't test VSIB immediate displacement
- This just ends up being base_addr + imm so it isn't too interesting
- We can add more tests in the future if we think we messed that up
- Doesn't test partial fault behaviour
- Because that's a nightmare.
Specifically keeps each instruction test small and isolated so if a
single register fails it is very easily to nail down which operation did
it.
I know some of our ASM tests do a chunk of work and spit out a result at
the end which can be difficult to debug in some cases. Didn't want to do
that which is why the tests are spread out across 16 files for these
single class of instructions.
If we ever get around to fusing ops with shifts in the ConstProp optimizer (may
or may not be worthwhile), this will delete an instruction from things like "or
al, bh".
Even though lsr is the same speed as bfe on Firestorm, I feel if you ask for
garbage you should get garbage C:
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Pointless, upper bits ignored anyway. Deletes piles of uxt and even some 32-bit
instruction moves.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
To load 8-bit sources without bfe'ing for al/bl/cl if the caller knows it
doesn't need masking behaviour, but without lying about the size so the extract
for ah/bh/ch will still work properly.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
For the GPR result, the masking already happens as part of the bfi. So the only
point of masking is for the flag calculation. But actually, every flag except
carry will ignore the upper bits anyway. And the carry calculation actually
WANTS the upper bit as a faster impl.
Deletes a pile of code both in FEX and the output :-)
ADC/SBC could probably get similar treatment later.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Now unused, its former users all prefer LoadPFRaw since they can fold in some of
this math into the use.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Use the raw popcount rather than the final PF and use some sneaky bit math to
come out 1 instruction ahead.
Closes#3117
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Mostly copypaste of Orlshl... we really should deduplicate this mess somehow.
Maybe a shift enum on the core Or op?
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
This logic is unused since 8adfaa9aa ("OpcodeDispatcher: Use SelectCC for x87"),
which addressed the underlying issue.
This reverts commit df3833edbe3d34da4df28269f31340076238e420.
If we const-prop the required functions and leafs then we can directly
encode the CPUID information rather than jumping out of the JIT.
In testing almost all CPUID executions const-prop which function is
getting called. Worst case that I found was only 85% const-prop rate.
This isn't quite 100% optimal since we need to call the RCLSE and
Constprop passes after we optimize these, which would remove some
redundant moves.
Sadly there seems to be a bug in the constprop pass that starts crashing
applications if that is done.
Easily enough tested by running Half-Life 2 and it immediately hitting
SIGILL.
Even without this optimization, this is stil a significant savings since
we aren't jumping out of the JIT anymore for these optimized CPUIDs.
Most CPUID routines return constant data, there are four that don't.
Some CPUID functions also need the leaf descriptor, so we need to
describe that as well.
Functions that don't return constant data:
- function 1Ah - Returns different data depending on current CPU core
- function 8000_000{2,3,4} - Different data based on CPU core
Functions that need leaf constprop:
- 4h, 7h, Dh, 4000_0001h, 8000_001Dh