This enables proper lowering of non-byte sized loads. We still aren't
faithfully preserving memory types everywhere, so the legality checks
still only consider the size.
This is to allow 64 bit constant rematerialization. If a constant
is split into two separate moves initializing sub0 and sub1 like
now RA cannot rematerizalize a 64 bit register.
This gives 10-20% uplift in a set of huge apps heavily using double
precession math.
Fixes: SWDEV-292645
Differential Revision: https://reviews.llvm.org/D104874
Details: https://reviews.llvm.org/D96805 changed the GCNTTIImpl::getCFInstrCost to return 1 for the PHI nodes
for the TTI::TCK_CodeSize and TTI::TCK_SizeAndLatency. This is incorrect because the value moves that are the
result of the PHI lowering are inserted into the basic block predecessors - not into the block itself.
As a result of this change LoopRotate and LoopUnroll were broken because of the incorrect Loop header and loop
body size/cost estimation.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D105104
This is a followup patch on D103636 where
it seemed checking on amdgpu-calls and
amdgpu-stack-objects is unnecessary. Removing these
checks didn't regress any tests functionally.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D104513
Update AMDGPU gfx90a memory model to make coarse grain memory allocations
consistent when fine grained system scope atomic acquire and release is
performed.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D105137
Adds legalizer, register bank select, and instruction
select support for G_SBFX and G_UBFX. These opcodes generate
scalar or vector ALU bitfield extract instructions for
AMDGPU. The instructions allow both constant or register
values for the offset and width operands.
The 32-bit scalar version is expanded to a sequence that
combines the offset and width into a single register.
There are no 64-bit vgpr bitfield extract instructions, so the
operations are expanded to a sequence of instructions that
implement the operation. If the width is a constant,
then the 32-bit bitfield extract instructions are used.
Moved the AArch64 specific code for creating G_SBFX to
CombinerHelper.cpp so that it can be used by other targets.
Only bitfield extracts with constant offset and width values
are handled currently.
Differential Revision: https://reviews.llvm.org/D100149
Most tests passed with an extra argument to explicitly enable the pass.
One does not, deleted it as part of this change. I can't see why the codegen
would be different between default on and default off but switched on. It
can be retrieved from the project history.
This would be a revert, but git revert was not clean. Disabling the pass
and leaving it in tree is less likely to cause breakage elsewhere than
patching up the git revert conflicts on unfamiliar code. It'll be landed
without review, as @hsmhsm is believed unavailable at present.
Differential Revision: https://reviews.llvm.org/D104962
This is a mechanical change. This actually also renames the
similarly named methods in the SmallString class, however these
methods don't seem to be used outside of the llvm subproject, so
this doesn't break building of the rest of the monorepo.
This also adds new interfaces for the fixed- and scalable case:
* LLT::fixed_vector
* LLT::scalable_vector
The strategy for migrating to the new interfaces was as follows:
* If the new LLT is a (modified) clone of another LLT, taking the
same number of elements, then use LLT::vector(OtherTy.getElementCount())
or if the number of elements is halfed/doubled, it uses .divideCoefficientBy(2)
or operator*. That is because there is no reason to specifically restrict
the types to 'fixed_vector'.
* If the algorithm works on the number of elements (as unsigned), then
just use fixed_vector. This will need to be fixed up in the future when
modifying the algorithm to also work for scalable vectors, and will need
then need additional tests to confirm the behaviour works the same for
scalable vectors.
* If the test used the '/*Scalable=*/true` flag of LLT::vector, then
this is replaced by LLT::scalable_vector.
Reviewed By: aemerson
Differential Revision: https://reviews.llvm.org/D104451
Add SReg_224, VReg_224, AReg_224, etc.
Link 224-bit types with v7i32/v7f32.
Link existing 192-bit types to newly added v3i64/v3f64/v6i32/v6f32.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D104622
Don't use SCC iterators when we're only interested in reachability.
Use df_begin/df_end inline to find reachable nodes.
Differential Revision: https://reviews.llvm.org/D104704
These used to consistently be zeroed pre-gfx9, but gfx9 made the
situation complicated since now some still do and some don't. This
also manages to pick up a few cases that the pattern fails to optimize
away.
We handle some cases with instruction patterns, but some get
through. In particular this improves the integer cases.
We can do this optimization in the majority of cases, but we currently
don't have a way to do it. We do not track/model which instructions
have which behavior, the control bit to change the high bit behavior,
or making use of preserved bits at all. This is a bit fuzzy since we
don't know precisely how the source instruction will be lowered, but
that only really matters in one case (for fma_mixlo).
We do need to fixup some of these cases after selection, but the
pattern helps eliminate many of these zexts.
Since this method can apply to cmpxchg operations, make sure it's clear
what value we're actually retrieving. This will help ensure we don't
accidentally ignore the failure ordering of cmpxchg in the future.
We could potentially introduce a getOrdering() method on AtomicSDNode
that asserts the operation isn't cmpxchg, but not sure that's
worthwhile.
Differential Revision: https://reviews.llvm.org/D103338
This pass aims to optimize VGPR live-range in a typical divergent if-else
control flow. For example:
def(a)
if(cond)
use(a)
... // A
else
use(a)
As AMDGPU access vgpr with respect to active-mask, we can mark `a` as
dead in region A. For details, please refer to the comments in
implementation file.
The pass is enabled by default, the frontend can disable it through
"-amdgpu-opt-vgpr-liverange=false".
Differential Revision: https://reviews.llvm.org/D102212
The main motivation behind pointer replacement of LDS use within non-kernel
functions is - to *avoid* subsequent LDS lowering pass from directly packing
LDS (assume large LDS) into a struct type which would otherwise cause allocating
huge memory for struct instance within every kernel.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D103225
- Take the same principle as the conversion from f64 to i64 with extra
necessary pre- and post-processing. It helps to reduce that conversion
sequence by half compared to legacy one.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D104427
We were not reporting isFNegFree for v2f32, although it is effectively
free after legalization. The generic combine was pulling fneg out of
the fma source operands, and the AMDGPU combine was doing the
opposite.
Implemented the transformation of xor (llvm.amdgcn.class x, mask), -1 into
llvm.amdgcn.class(x, ~mask). Added LIT tests as well.
Differential Revision: https://reviews.llvm.org/D104049
This does not affect codegen, which only tests these flags on Pseudo
instructions, but might help llvm-mca which has to work with Real
instructions. In particular setting LGKM_CNT on DS instructions helps
with the problem identified in D104149.
Differential Revision: https://reviews.llvm.org/D104293
Factor out repeated !cast<SOP*_Pseudo>(NAME) into a new "defvar ps",
just to improve readability and maintainability.
Differential Revision: https://reviews.llvm.org/D104306
The code in fixLdsBranchVmemWARHazard looks for patterns of a vmem/lds
access followed by a branch, followed by an lds/vmem access.
The handling of the hazard requires an arbitrary number of instructions
to process. In the worst case where a function has a vmem access, but no lds
accesses, all instructions are examined only to conclude that the hazard
cannot occur.
Add the pre-processing stage which detects if there is both lds and vmem
present in the function and only then does the more costly search.
This patch significantly improves compilation time in the cases the hazard
cannot happen. In one pathological case I looked at IsHazardInst is needlesly
called 88.6 milions times.
The numbers could also be improved by introducing a map around the
inner calls to ::getWaitStatesSince in fixLdsBranchVmemWARHazard, but
nothing will beat not running fixLdsBranchVmemWARHazard at all in the cases
detected by shouldRunLdsBranchVmemWARHazardFixup().
Differential Revision: https://reviews.llvm.org/D104219
This patch computes max SGPRs and VGPRs used by module
in presence of indirect calls and makes that
as register requirement for functions/kernels
which makes indirect calls.
This patch also refactors code AMDGPUSubTarget.cpp
which add a "base" variants of getMaxNumSGPRs which
is used by MachineFunction and new Function version.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D103636