Requires the IR headerop to house the number of host instructions this
code is translating for the stats.
Fixes compiling with disassembly enabled, will be used with the
instruction count CI.
This is incredibly useful and I find myself hacking this feature in
every time I am optimizing IR. Adds a new configuration option which
allows dumping IR at various times.
Before any optimization passes has happened
After all optimizations passes have happened
Before and After each IRPass to see what is breaking something.
Needs #2864 merged first
This is a /very/ simple optimization purely because of a choice that ARM
made with SVE in latest Cortex.
Cortex-A715:
- sxtl/sxtl2/uxtl/uxtl2 can execute 1 instruction per cycle.
- sunpklo/sunpkhi/uunpklo/uunpkhi can execute 2 instructions per cycle.
Cortex-X3:
- sxtl/sxtl2/uxtl/uxtl2 can execute 2 instruction per cycle.
- sunpklo/sunpkhi/uunpklo/uunpkhi can execute 4 instructions per cycle.
This is fairly quirky since this optimization only works on SVE systems
with 128-bit Vector length. Which since it is all of the current
consumer platforms, it will work.
We need to know the difference between the host supporting SVE with
128-bit registers versus 256-bit registers. Ensure we know the
difference.
No functional change here.
This allows use to both enable and disable regardless of what the host
supports. This replaces the old `EnableAVX` option.
Unlike the old EnableAVX option which was a binary option which could
only disable, each of these options are technically trinary states.
Not setting an option gives you the default detection, while explicitly
enabling or disabling will toggle the option regardless of what the host
supports.
This will be used by the instruction count CI in the future.
Moves the dummy handlers over to this library. This will end up getting
used for more than the mingw test harness runner once the instruction
count CI is operational.
This was a debug LoadConstant that would load the entry in to a temprary
register to make it easier to see what RIP a block was in.
This was implemented when FEX stopped storing the RIP in the CPU state
for every block. This is now no longer necessary since FEX stores the
in the tail data of the block.
This was affecting instructioncountci when in a debug build.
I use this locally when looking for optimization opportunities in the
JIT.
The instruction count CI in the future will use this as well.
Just get it upstreamed right away.
`eor <reg>, <reg>, <reg>` is not the optimal way to zero a vector
register on ARM CPUs. Instead we should move by constant or zero
register to take advantage of zero-latency moves.
While the ENABLE_LLD and ENABLE_MOLD options are nice, they don't handle
the case when the linker of `lld` or `mold` doesn't match the compiler.
This particularly crops up when overriding the C compiler to a new
version of clang but the globally installed `ld.lld` is still the old
clang version.
This then causes clang to fail with unusual errors when upstream breaks
compatibility with itself.
Easy enough to use by passing the linker to cmake:
`-DUSE_LINKER=/usr/bin/ld.lld-15`
This also removes the ENABLE_LLD and ENABLE_MOLD options to use
USE_LINKER directly.
- ldd: `-DUSE_LINKER=lld`
- mold: `-DUSE_LINKER=mold`
Example of compiler failure when built with clang-15 but attempting to
link with ld.lld 14:
```bash
ld.lld-14: error: unittests/APITests/CMakeFiles/Filesystem.dir/Filesystem.cpp.o: Opaque pointers are only supported in -opaque-pointers mode (Producer: 'LLVM15.0.7' Reader: 'LLVM 14.0.6')
```
This needs to default to 64-bit addresses, this was previously
defaulting to 32-bit which was meaning the destination address was
getting truncated. In a 32-bit process the address is still 32-bit.
I'm actually surprised this hasn't caused spurious SIGSEGV before this
point.
Adds a 32-bit test to ensure that side is tested as well.
This is more obvious. llvm-mca says TST is half the cycle count of CMN
for whatever it's defaulting to. dougallj's reference shows both as the
same performance.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
In the non-immediate cases, we can amortize some work between the two
flags to come out 1 instruction ahead.
In the immediate case, costs us an extra 2 instructions compared to
before we packed NZCV flags, but this mitigates a bigger instr count
regression that this PR would otherwise have. Coming out ahead will
require FlagM and smarter RA, but is doable.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Same technique as the left shifts. Gets rid of all our COND_FLAG_SET
use, which is good because it's a performance footgun.
Overall saves 17 instructions (!!!!) from the flag calculation code for
`sar eax, cl`.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>