This commit adds a new IR level pass to the AMDGPU backend to perform
atomic optimizations. It works by:
- Running through a function and finding atomicrmw add/sub or uses of
the atomic buffer intrinsics for add/sub.
- If all arguments except the value to be added/subtracted are uniform,
record the value to be optimized.
- Run through the atomic operations we can optimize and, depending on
whether the value is uniform/divergent use wavefront wide operations
(DPP in the divergent case) to calculate the total amount to be
atomically added/subtracted.
- Then let only a single lane of each wavefront perform the atomic
operation, reducing the total number of atomic operations in flight.
- Lastly we recombine the result from the single lane to each lane of
the wavefront, and calculate our individual lanes offset into the
final result.
Differential Revision: https://reviews.llvm.org/D51969
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343973 91177308-0d34-0410-b5e6-96231b3b80d8
When branch target identification is enabled, we can only do indirect
tail-calls through x16 or x17. This means that the outliner can't
transform a BLR instruction at the end of an outlined region into a BR.
Differential revision: https://reviews.llvm.org/D52869
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343969 91177308-0d34-0410-b5e6-96231b3b80d8
When branch target identification is enabled, all indirectly-callable
functions start with a BTI C instruction. this instruction can only be
the target of certain indirect branches (direct branches and
fall-through are not affected):
- A BLR instruction, in either a protected or unprotected page.
- A BR instruction in a protected page, using x16 or x17.
- A BR instruction in an unprotected page, using any register.
Without BTI, we can use any non call-preserved register to hold the
address for an indirect tail call. However, when BTI is enabled, then
the code being compiled might be loaded into a BTI-protected page, where
only x16 and x17 can be used for indirect tail calls.
Legacy code withiout this restriction can still indirectly tail-call
BTI-protected functions, because they will be loaded into an unprotected
page, so any register is allowed.
Differential revision: https://reviews.llvm.org/D52868
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343968 91177308-0d34-0410-b5e6-96231b3b80d8
The Branch Target Identification extension, introduced to AArch64 in
Armv8.5-A, adds the BTI instruction, which is used to mark valid targets
of indirect branches. When enabled, the processor will trap if an
instruction in a protected page tries to perform an indirect branch to
any instruction other than a BTI. The BTI instruction uses encodings
which were NOPs in earlier versions of the architecture, so BTI-enabled
code will still run on earlier hardware, just without the extra
protection.
There are 3 variants of the BTI instruction, which are valid targets for
different kinds or branches:
- BTI C can be targeted by call instructions, and is inteneded to be
used at function entry points. These are the BLR instruction, as well
as BR with x16 or x17. These BR instructions are allowed for use in
PLT entries, and we can also use them to allow indirect tail-calls.
- BTI J can be targeted by BR only, and is intended to be used by jump
tables.
- BTI JC acts ab both a BTI C and a BTI J instruction, and can be
targeted by any BLR or BR instruction.
Note that RET instructions are not restricted by branch target
identification, the reason for this is that return addresses can be
protected more effectively using return address signing. Direct branches
and calls are also unaffected, as it is assumed that an attacker cannot
modify executable pages (if they could, they wouldn't need to do a
ROP/JOP attack).
This patch adds a MachineFunctionPass which:
- Adds a BTI C at the start of every function which could be indirectly
called (either because it is address-taken, or externally visible so
could be address-taken in another translation unit).
- Adds a BTI J at the start of every basic block which could be
indirectly branched to. This could be either done by a jump table, or
by taking the address of the block (e.g. the using GCC label values
extension).
We only need to use BTI JC when a function is indirectly-callable, and
takes the address of the entry block. I've not been able to trigger this
from C or IR, but I've included a MIR test just in case.
Using BTI C at function entries relies on the fact that no other code in
BTI-protected pages uses indirect tail-calls, unless they use x16 or x17
to hold the address. I'll add that code-generation restriction as a
separate patch.
Differential revision: https://reviews.llvm.org/D52867
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343967 91177308-0d34-0410-b5e6-96231b3b80d8
When deciding if it is safe to optimize a conditional branch to a CBZ or
CBNZ the offsets of the BasicBlocks from the start of the function are
estimated. For inline assembly the generic getInlineAsmLength() function is
used to get a worst case estimate of the inline assembly by multiplying the
number of instructions by the max instruction size of 4 bytes. This
unfortunately doesn't take into account the generation of Thumb implicit IT
instructions. In edge cases such as when all the instructions in the block
are 4-bytes in size and there is an implicit IT then the size is
underestimated. This can cause an out of range CBZ or CBNZ to be generated.
The patch takes a conservative approach and assumes that every instruction
in the inline assembly block may have an implicit IT.
Fixes pr31805
Differential Revision: https://reviews.llvm.org/D52834
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343960 91177308-0d34-0410-b5e6-96231b3b80d8
The MachineOutliner for AArch64 transforms indirect calls into indirect
tail calls, replacing the call with the TCRETURNri pseudo-instruction.
This pseudo lowers to a BR, but has the isCall and isReturn flags set.
The problem is that TCRETURNri takes a tcGPR64 as the register argument,
to prevent indiret tail-calls from using caller-saved registers. The
indirect calls transformed by the outliner could use caller-saved
registers. This is fine, because the outliner ensures that the register
is available at all call sites. However, this causes a verifier failure
when the register is not in tcGPR64. The fix is to add a new
pseudo-instruction like TCRETURNri, but which accepts any GPR.
Differential revision: https://reviews.llvm.org/D52829
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343959 91177308-0d34-0410-b5e6-96231b3b80d8
The srli test in alu8.ll was a no-op, as it shifted by 8 bits. Fix this, and
also change the immediate in alu16.ll as shifted by something other than a
poewr of 8 is more interesting.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343958 91177308-0d34-0410-b5e6-96231b3b80d8
This change is proposed as a part of D44548, but we
need this independently to avoid regressions from improved
undef propagation in SimplifyDemandedVectorElts().
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343940 91177308-0d34-0410-b5e6-96231b3b80d8
rL343913 was using SimplifyDemandedBits's original demanded mask instead of the adjusted 'NewMask' that accounts for multiple uses of the op (those variable names really need improving....).
Annoyingly many of the test changes (back to pre-rL343913 state) are actually safe - but only because their multiple uses are all by PMULDQ/PMULUDQ.
Thanks to Jan Vesely (@jvesely) for bisecting the bug.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343935 91177308-0d34-0410-b5e6-96231b3b80d8
These track the quality of generated code for simple arithmetic operations
that were legalised from non-native types.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343930 91177308-0d34-0410-b5e6-96231b3b80d8
Prevents missing other simplifications that may occur deep in the operand chain where CommitTargetLoweringOpt won't add the PMULDQ back to the worklist itself
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343922 91177308-0d34-0410-b5e6-96231b3b80d8
Attempt to simplify PSHUFB masks (even non-constant ones) - we should probably be able to simplify other variable shuffles as well as the need arises.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343919 91177308-0d34-0410-b5e6-96231b3b80d8
This patch enables SimplifyDemandedBits to call SimplifyDemandedVectorElts in cases where the demanded bits mask covers entire elements of a bitcasted source vector.
There are a couple of cases here where simplification at a deeper level (such as through bitcasts) prevents further simplification - CommitTargetLoweringOpt only adds immediate uses/users back to the worklist when we might want to combine the original caller again to see what else it can simplify.
As well as that I had to disable handling of bool vector until SimplifyDemandedVectorElts better supports some of their opcodes (SETCC, shifts etc.).
Fixes PR39178
Differential Revision: https://reviews.llvm.org/D52935
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343913 91177308-0d34-0410-b5e6-96231b3b80d8
Port over the implementation in SelectionDAGBuilder.cpp into the IRTranslator
and update the arm64-irtranslator test.
These were causing fallbacks in CTMark/Bullet (-Rpass-missed=gisel-select),
and this patch fixes that.
https://reviews.llvm.org/D52945
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343885 91177308-0d34-0410-b5e6-96231b3b80d8
The comments in this code say we were trying to avoid 16-bit immediates, but if the immediate fits in 8-bits this isn't an issue. This avoids creating a zero extend that probably won't go away.
The movmskb related changes are interesting. The movmskb instruction writes a 32-bit result, but fills the upper bits with 0. So the zero_extend we were previously emitting was free, but we turned a -1 immediate that would fit in 8-bits into a 32-bit immediate so it was still bad.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343871 91177308-0d34-0410-b5e6-96231b3b80d8
And use that to transform fsub with zero constant operands.
The integer part isn't used yet, but it is proposed for use in
D44548, so adding both enhancements here makes that
patch simpler.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343865 91177308-0d34-0410-b5e6-96231b3b80d8
Fix for D52912 which was simplifying MOVLPS/MOVHPS(+PD) instructions as the tests were only touching one of the vector halfs
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343858 91177308-0d34-0410-b5e6-96231b3b80d8
Decode subvector shuffles from INSERT_SUBVECTOR(SRC0, SHUFFLE(EXTRACT_SUBVECTOR(SRC1))
This was found necessary while investigating PR39161
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343853 91177308-0d34-0410-b5e6-96231b3b80d8
GlobalISel uses MIR with implicit fallthrough on each basic block. As a result,
getFirstNonPhi() can return end().
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343829 91177308-0d34-0410-b5e6-96231b3b80d8
The simplest instance of this is an intrinsic with no results which will have the
intrinsic ID as operand 0.
Also fix some benign incorrectness when op0 is a reg but isn't a def that was
guarded against by checking for the extension opcodes.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343821 91177308-0d34-0410-b5e6-96231b3b80d8
The isAmdCodeObjectV2 is a misleading name which actually checks whether the os
is amdhsa or mesa.
Also add a test to make sure we do not generate old kernel header for code
object v3.
Differential Revision: https://reviews.llvm.org/D52897
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343813 91177308-0d34-0410-b5e6-96231b3b80d8
This brings the extending loads patch back to the original intent but minus the
PHI bug and with another small improvement to de-dupe truncates that are
inserted into the same block.
The truncates are sunk to their uses unless this would require inserting before a
phi in which case it sinks to the _beginning_ of the predecessor block for that
path (but no earlier than the def).
The reason for choosing the beginning of the predecessor is that it makes de-duping
multiple truncates in the same block simple, and optimized code is going to run a
scheduler at some point which will likely change the position anyway.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343804 91177308-0d34-0410-b5e6-96231b3b80d8
f32 values passed on the stack would previously cause an assertion in
unpackFromMemLoc.. This would only trigger in the presence of the F extension
making f32 a legal type. Otherwise the f32 would be legalized.
This patch fixes that by keeping LocVT=f32 when a float is passed on the
stack. It also adds test coverage for this case, and tests that also
demonstrate lw/sw/flw/fsw will be selected when most profitable. i.e. there is
no unnecessary i32<->f32 conversion in registers.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@343756 91177308-0d34-0410-b5e6-96231b3b80d8