Adapted from LLVM version of pr-code-format.yml.
Copies a few scripts from LLVM to External/.
Runs self-hosted on X64.
Assumes clang-format 16.0.6 for formatting.
Notably this bugfix version also introduces support for formatting
std::atomic types and std::atomic_flag.
Also, of course keeps our tracked external up to date.
It is not an external component, and it makes paths needlessly long.
Ryan seemed amenable to this when we discussed on IRC earlier.
Signed-off-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Now all vbroadcast implementations go down the more optimal path.
For non-SVE 128-bit cases where we only have 128-bit wide registers,
we behave like ld1rqb and just act as a normal 128-bit load for
interface convenience.
In the case of running on a 128-bit SVE system this predicate wasn't
setup. Since we never had any predicate usage before this wasn't an
issue. Now that #2914 is using the 128-bit predicate we need to make
sure that we are generating it.
Allows the implementations of the vbroadcast instructions to perform the
load and broadcast in one operation as opposed to doing the load and then
broadcast separately.
Notably, the broadcasting loads can also be used on systems that have SVE 128-bit
support as well, not only 256-bit.
On non-SVE systems, we use the equivalent AdvSIMD instructions.
For config values that were string objects we were unnecessary creating
copies each time the string was accessed.
Convert the () operator over to returning a reference.
The current implementation uses orr excessively. This has FEX missing
hardware optimization opportunities where some CPU cores will zero-cycle
move constants that fit in to the 16-bits of movz/movk.
First evaluate up front if the number of 16-bit segments is > 1, in
those cases we should check if it is a bitfield that can be moved in one
instruction with orr.
After that point we will use movz for 16-bit constant moves.
Additionally this optimizes the case where a constant of zero is loaded
to be a `mov <reg>, zr` which gets renamed in most hardware.
Commonly we are doing a BFI into a 32-bit register, which is hitting the
ubfx (lsr alias) path.
In the case of 32-bit destination we can also do a regular move, which
will take advantage of CPU's rename functionality and give a minor speed
boost.
Didn't notice this in the previous PR, When DUMPIR=stderr without and
selection of where to place it in PASSMANAGERDUMPIR it was supposed to
put the dumper at the end of the passes.
We need to make sure that it it placed at the end of the passes rather
than current `it`.
We can perform the SQRT first and then broadcast 1.0 into the destination
since all the intermediary work is done, meaning we don't have to worry
about Dst and Vector aliasing one another.
If DumpIR is enabled but the PassManagerDumpIR option isn't enabled then
this currently does nothing.
As a convenience, enable dumping the final optimized IR if an option
hasn't been specified.