Implements CI for tracking instruction counts for generate blocks of
code when transforming from x86 to ARM64 assembly.
This will end up encompassing every instruction in our instruction
tables similarly to how our assembly tests try to test everything in our
instruction tables.
Incidentally, the data for this CI is generated using our assembly
tests. By enabling disassembly and instruction stats when executing a
suite of instructions, this gives the stats that can be added to a json
file.
The current implementation only implements the SecondGroup table of
instructions because it is a relatively small table and has known
inefficiencies in the instruction implementations. As this gets merged I
will be adding more tables of instructions to additional json files for
testing.
These JSON files will support adjusting CPU features regardless of the
host features so it can test implementations depending on different CPU
features. This will let us test things like one instruction having
different "optimal" implementations depending on if it supports SVE128,
SVE256, SVEI8MM, etc.
This initial instruction auditing is what found the bug in our vector
shift instructions by size of zero. If inspecting the result of the CI
run, you can tell that these instructions still aren't "optimal" because
they are doing loads and stores that can be eliminated.
The "Optimal" in the JSON is purely for human readable and grepping
ability to see what is optimal versus not. Same with the "Comment"
section.
According to my auditing spreadsheet, the total number of instructions
that will end up in these json files will be about 1000, but we will
likely end up with more since there will be edge cases that can be more
optimal depending on arguments.
This was confusingly split between Arm64Emitter, Arm64Dispatcher, and
Arm64JIT.
- Arm64JIT objects were unnecessary and free to be deleted.
- Arm64Dispatcher simulator and decoder moved to Arm64Emitter
- Arm64Emitter disassembler and decoder renamed
- Dropped usage of the PrintDisassembler since it is hardcoded to go
through a FILE* type
- We instead want its output to go through LogMan, which means using a
split Decoder+Disassembler object pair.
- Can't reuse the object from the vixl simulator since the simulator
registers the decoder as a visitor, causing the simulator to execute
while disassembling instructions if reused.
- Disassembly output for blocks and dispatcher now output through Logman
- Blocks wrapped in Begin/End text for tracking purposes for CI.
We don't currently have a device in CI that can run SVE with 128-bit
width registers. Until we have a device with this, make sure the vixl
simulator is also running the ASM tests in this width.
This was causing test failure locally where some values were set to
uninitialized data. Ensure that gregs, YMM, and MMX registers are all
zero initialized.
Requires the IR headerop to house the number of host instructions this
code is translating for the stats.
Fixes compiling with disassembly enabled, will be used with the
instruction count CI.
This is incredibly useful and I find myself hacking this feature in
every time I am optimizing IR. Adds a new configuration option which
allows dumping IR at various times.
Before any optimization passes has happened
After all optimizations passes have happened
Before and After each IRPass to see what is breaking something.
Needs #2864 merged first
This is a /very/ simple optimization purely because of a choice that ARM
made with SVE in latest Cortex.
Cortex-A715:
- sxtl/sxtl2/uxtl/uxtl2 can execute 1 instruction per cycle.
- sunpklo/sunpkhi/uunpklo/uunpkhi can execute 2 instructions per cycle.
Cortex-X3:
- sxtl/sxtl2/uxtl/uxtl2 can execute 2 instruction per cycle.
- sunpklo/sunpkhi/uunpklo/uunpkhi can execute 4 instructions per cycle.
This is fairly quirky since this optimization only works on SVE systems
with 128-bit Vector length. Which since it is all of the current
consumer platforms, it will work.
We need to know the difference between the host supporting SVE with
128-bit registers versus 256-bit registers. Ensure we know the
difference.
No functional change here.
This allows use to both enable and disable regardless of what the host
supports. This replaces the old `EnableAVX` option.
Unlike the old EnableAVX option which was a binary option which could
only disable, each of these options are technically trinary states.
Not setting an option gives you the default detection, while explicitly
enabling or disabling will toggle the option regardless of what the host
supports.
This will be used by the instruction count CI in the future.
Moves the dummy handlers over to this library. This will end up getting
used for more than the mingw test harness runner once the instruction
count CI is operational.
This was a debug LoadConstant that would load the entry in to a temprary
register to make it easier to see what RIP a block was in.
This was implemented when FEX stopped storing the RIP in the CPU state
for every block. This is now no longer necessary since FEX stores the
in the tail data of the block.
This was affecting instructioncountci when in a debug build.
I use this locally when looking for optimization opportunities in the
JIT.
The instruction count CI in the future will use this as well.
Just get it upstreamed right away.
`eor <reg>, <reg>, <reg>` is not the optimal way to zero a vector
register on ARM CPUs. Instead we should move by constant or zero
register to take advantage of zero-latency moves.