We didn't have a unit test for this and we weren't implementing it at
all.
We treated it as vmovhps/vmovhpd accidentally. Once again caught by the
libaom Intrinsics unit tests.
A bit confusing because the instruction encoding is the same between
VMOVHLPS and VMOVLPS so this unittest was missed.
Implement the test to ensure it stays working
dougallj mentioned that adding these tests might expose
a bug in bextr. Since bextr implementation was changed
apparently it now works correctly, that's good.
If no registers alias, then we can move the first source directly into the
destination and then perform the FCADD operation as opposed to using a
temporary.
We can perform less moves by checking for scenarios where aliasing
occurs. Since addition is commutative (usually, general-case anyway),
order of inputs doesn't strictly matter here.
In the event no source vectors alias the destination,
we can just move the first source vector into it and
then perform the divide without needing to move afterword.
We can avoid needing to use movprfx here by moving
directly into the destination when possible and just
doing the UMAX directly.
Also expands the unsigned max tests to test values with
the sign bit set to ensure all behavior is caught.
Since SMAX performs a comparison and returns the max value regardless
of how the operands are provided, we can check for when the second
input aliases the destination.
Similar to previous tests, vpgatherqq and vgatherqpd are equivalent
instructions. So the tests are the same with the mnemonic changed.
This adds tests for an additional two sets of instructions. Getting us
full coverage of all eight instructions if we include the tests from
PR #3167 and #3166
Tests the same things as described in #3165
In addition, since these tests use 64-bit indices for address
calculation, we can easily generate and indice vector that tests
overflow. So every test at every displacement ALSO gains an additional
overflow test to ensure correct behaviour around pointer overflow
calculation.
Similar to previous tests, vgatherqd and vgatherqps are equivalent
instructions. So the tests are the same with the mnemonic changed.
This adds tests for an additional two sets of instructions, Getting us
up to six total over the eight if we include the tests from #3166.
Tests the same things as described in #3165
In addition, since these tests use 64-bit indices for address
calculation, we can easily generate and indice vector that tests
overflow. So every test at every displacement ALSO gains and additional
overflow test to ensure correct behaviour around pointer overflow
calculation.
Just like the previous tests, vpgatherdq and vgatherpq are equivalent
instructions. So the tests are the same except for the instruction
mnemonic again.
This adds unittests for two more of the eight gather instructions.
Getting us up to testing four in total.
Specifically this adds tests for 32-bit indices while loading 64-bit
element instructions.
Same thing as PR #3165 for what it tests versus doesn't.
vpgatherdd and vgatherps are effectively the same instructions, so the
tests are the same except for the instruction mnemonic.
This adds unit tests for two of the eight gather instructions.
Specifically this adds tests for the 32-bit indices loading 32-bit
elements instructions.
What it tests:
- Tests all displacement scales
- Tests multiple mask arrangements
- Ensures the mask register is zero'd after the instruction
What it doesn't test:
- Doesn't test address size calculation overflow
- Only would happen on 32-bit with 32-bit indices, or /really/ high
base addresses
- The instruction should behave as a mask to the address size
- Effectively behaves like `(uint64_t)(base + index << ilog2(scale))`
- Better idea is to just not expose AVX to 32-bit applications
- Doesn't test VSIB immediate displacement
- This just ends up being base_addr + imm so it isn't too interesting
- We can add more tests in the future if we think we messed that up
- Doesn't test partial fault behaviour
- Because that's a nightmare.
Specifically keeps each instruction test small and isolated so if a
single register fails it is very easily to nail down which operation did
it.
I know some of our ASM tests do a chunk of work and spit out a result at
the end which can be difficult to debug in some cases. Didn't want to do
that which is why the tests are spread out across 16 files for these
single class of instructions.
For a bunch of cases that act as broadcasts (where all
indices in the imm8 specify the same element), we
can use VDupElement here rather than iterating through.
While it would be bizarre if this actually occurred frequently
in practice, we can still tune it so there's no subpar assembly
output in the cases it actually does happen.
We can reuse the same helper we have for handling VMASKMOVPD and VMASKMOVPS,
though we need to move some handling around to account for the fact that
VPMASKMOVD and VPMASKMOVQ 'hijack' the REX.W bit to signify the element
size of the operation.