Summary:
This results in higher register usage, but should make it easier for
the compiler to hide latency.
This pass is a prerequisite for some more scheduler improvements, and I
think the increase register usage with this patch is acceptable, because
when combined with the scheduler improvements, the total register usage
will decrease.
shader-db stats:
2382 shaders in 478 tests
Totals:
SGPRS: 48672 -> 49088 (0.85 %)
VGPRS: 34148 -> 34847 (2.05 %)
Code Size: 1285816 -> 1289128 (0.26 %) bytes
LDS: 28 -> 28 (0.00 %) blocks
Scratch: 492544 -> 573440 (16.42 %) bytes per wave
Max Waves: 6856 -> 6846 (-0.15 %)
Wait states: 0 -> 0 (0.00 %)
Depends on D18451
Reviewers: nhaehnle, arsenm
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D18452
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@264876 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This helps prevent load clustering from drastically increasing register
pressure by trying to cluster 4 SMRDx8 loads together. The limit of 16
bytes was chosen, because it seems like that was the original intent
of setting the limit to 4 instructions, but more analysis could show
that a different limit is better.
This fixes yields small decreases in register usage with shader-db, but
also helps avoid a large increase in register usage when lane mask
tracking is enabled in the machine scheduler, because lane mask tracking
enables more opportunities for load clustering.
shader-db stats:
2379 shaders in 477 tests
Totals:
SGPRS: 49744 -> 48600 (-2.30 %)
VGPRS: 34120 -> 34076 (-0.13 %)
Code Size: 1282888 -> 1283184 (0.02 %) bytes
LDS: 28 -> 28 (0.00 %) blocks
Scratch: 495616 -> 492544 (-0.62 %) bytes per wave
Max Waves: 6843 -> 6853 (0.15 %)
Wait states: 0 -> 0 (0.00 %)
Reviewers: nhaehnle, arsenm
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D18451
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@264589 91177308-0d34-0410-b5e6-96231b3b80d8
There is no benefit to these since materializing the constant 1
requires the same number of instructions as materializing uint_max
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@264215 91177308-0d34-0410-b5e6-96231b3b80d8
Strengthen tests of storing frame indices.
Right now this just creates irrelevant scheduling changes.
We don't want to have multiple frame index operands
on an instruction. There seem to be various assumptions
that at least the same frame index will not appear twice
in the LocalStackSlotAllocation pass.
There's no reason to have this happen, and it just
makes it easy to introduce bugs where the immediate
offset is appplied to the storing instruction when it should
really be applied to the value being stored as a separate
add.
This might not be sufficient. It might still be problematic
to have an add fi, fi situation, but that's even less unlikely
to happen in real code.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@264200 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Whole quad mode is already enabled for pixel shaders that compute
derivatives, but it must be suspended for instructions that cause a
shader to have side effects (i.e. stores and atomics).
This pass addresses the issue by storing the real (initial) live mask
in a register, masking EXEC before instructions that require exact
execution and (re-)enabling WQM where required.
This pass is run before register coalescing so that we can use
machine SSA for analysis.
The changes in this patch expose a problem with the second machine
scheduling pass: target independent instructions like COPY implicitly
use EXEC when they operate on VGPRs, but this fact is not encoded in
the MIR. This can lead to miscompilation because instructions are
moved past changes to EXEC.
This patch fixes the problem by adding use-implicit operands to
target independent instructions. Some general codegen passes are
relaxed to work with such implicit use operands.
Reviewers: arsenm, tstellarAMD, mareko
Subscribers: MatzeB, arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D18162
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@263982 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
When control flow is implemented using the exec mask, the compiler will
insert branch instructions to skip over the masked section when exec is
zero if the section contains more than a certain number of instructions.
The previous code would only count instructions in successor blocks,
and this patch modifies the code to start counting instructions in all
blocks between the start and end of the branch.
Reviewers: nhaehnle, arsenm
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D18282
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@263969 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Allow the selection of BUFFER_LOAD_FORMAT_x and _XY. Do this now before
the frontend patches land in Mesa. Eventually, we may want to automatically
reduce the size of loads at the LLVM IR level, which requires such overloads,
and in some cases Mesa can generate them directly.
Reviewers: tstellarAMD, arsenm
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D18255
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@263792 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
These intrinsics expose the BUFFER_ATOMIC_* instructions and will be used
by Mesa to implement atomics with buffer semantics. The intrinsic interface
matches that of buffer.load.format and buffer.store.format, except that the
GLC bit is not exposed (it is automatically deduced based on whether the
return value is used).
The change of hasSideEffects is required for TableGen to accept the pattern
that matches the intrinsic.
Reviewers: tstellarAMD, arsenm
Subscribers: arsenm, rivanvx, llvm-commits
Differential Revision: http://reviews.llvm.org/D18151
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@263791 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
We cannot easily deduce that an offset is in an SGPR, but the Mesa frontend
cannot easily make use of an explicit soffset parameter either. Furthermore,
it is likely that in the future, LLVM will be in a better position than the
frontend to choose an SGPR offset if possible.
Since there aren't any frontend uses of these intrinsics in upstream
repositories yet, I would like to take this opportunity to change the
intrinsic signatures to a single offset parameter, which is then selected
to immediate offsets or voffsets using a ComplexPattern.
Reviewers: arsenm, tstellarAMD, mareko
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D18218
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@263790 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Uniform loops where the branch leaving the loop is predicated on VCCNZ
must be skipped if EXEC = 0, otherwise they will be infinite.
Reviewers: tstellarAMD, arsenm
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D18137
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@263658 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Static LDS size is saved in MachineFunctionInfo::LDSSize,
We define a pseudo instruction with usesCustomInserter bit set. Then, in EmitInstrWithCustomInserter,
we replace this pseudo instruction with a mov of MachineFunctionInfo::LDSSize.
Reviewers:
arsenm
tstellarAMD
Subscribers
llvm-commits, arsenm
Differential Revision:
http://reviews.llvm.org/D18064
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@263563 91177308-0d34-0410-b5e6-96231b3b80d8
If a constant is the same as the reverse of an inline immediate,
this is 4 bytes smaller than having to embed a 32-bit literal.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@263201 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
They correspond to BUFFER_LOAD/STORE_FORMAT_XYZW and will be used by Mesa
to implement the GL_ARB_shader_image_load_store extension.
The intention is that for llvm.amdgcn.buffer.load.format, LLVM will decide
whether one of the _X/_XY/_XYZ opcodes can be used (similar to image sampling
and loads). However, this is not currently implemented.
For llvm.amdgcn.buffer.store, LLVM cannot decide to use one of the "smaller"
opcodes and therefore the intrinsic is overloaded. Currently, only the v4f32
is actually implemented since GLSL also only has a vec4 variant of the store
instructions, although it's conceivable that Mesa will want to be smarter
about this in the future.
BUFFER_LOAD_FORMAT_XYZW is already exposed via llvm.SI.vs.load.input, which
has a legacy name, pretends not to access memory, and does not capture the
full flexibility of the instruction.
Reviewers: arsenm, tstellarAMD, mareko
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D17277
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@263140 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
The code in SelectionDAG did not handle the case where the
register type and output types were different, but had the same size.
Reviewers: arsenm, echristo
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D17940
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@263022 91177308-0d34-0410-b5e6-96231b3b80d8
Supprot DPP syntax as used in SP3 (except several operands syntax).
Added dpp-specific operands in td-files.
Added DPP flag to TSFlags to determine if instruction is dpp in InstPrinter.
Support for VOP2 DPP instructions in td-files.
Some tests for DPP instructions.
ToDo:
- VOP2bInst:
- vcc is considered as operand
- AsmMatcher doesn't apply mnemonic aliases when parsing operands
- v_mac_f32
- v_nop
- disable instructions with 64-bit operands
- change dpp_ctrl assembler representation to conform sp3
Review: http://reviews.llvm.org/D17804
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@263008 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This is necessary for when we run out of VGPRs and can no
longer use v_{read,write}_lane for spilling SGPRs.
Reviewers: arsenm
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D17592
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262732 91177308-0d34-0410-b5e6-96231b3b80d8
On AMDGPU where operations i64 operations are often bitcasted to v2i32
and back, this pattern shows up regularly where it breaks some
expected combines on i64, such as load width reducing.
This fixes some test failures in a future commit when i64 loads
are changed to promote.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262397 91177308-0d34-0410-b5e6-96231b3b80d8
This reduces the number of bitcast nodes and generally cleans up the
DAG when bitcasting between integers and vectors everywhere.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262358 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This patch impleemnts DS_PERMUTE/DS_BPERMUTE instruction definitions and intrinsics,
which are new since VI.
Reviewers: tstellarAMD, arsenm
Subscribers: llvm-commits, arsenm
Differential Revision: http://reviews.llvm.org/D17614
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262356 91177308-0d34-0410-b5e6-96231b3b80d8
This currently does not have the control over the bitwidth,
and there are missing optimizations to reduce the integer to
32-bit if it can be.
But in most situations we do want the sinking to occur.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262296 91177308-0d34-0410-b5e6-96231b3b80d8
The maximum private allocation for the whole GPU is 4G,
so the maximum possible index for a single workitem is the
maximum size divided by the smallest granularity for a dispatch.
This increases the number of known zero high bits, which
enables more offset folding. The maximum private size per
workitem with this is 128M but may be smaller still.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262153 91177308-0d34-0410-b5e6-96231b3b80d8
In the case where op = add, y = base_ptr, and x = offset, this
transform:
(op y, (op x, c1)) -> (op (op x, y), c1)
breaks the canonical form of add by putting the base pointer in the
second operand and the offset in the first.
This fix is important for the R600 target, because for some address
spaces the base pointer and the offset are stored in separate register
classes. The old pattern caused the ISel code for matching addressing
modes to put the base pointer and offset in the wrong register classes,
which required no-trivial code transformations to fix.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262148 91177308-0d34-0410-b5e6-96231b3b80d8
This matches the behavior of the HSAIL clock instruction.
s_realmemtime is used if the subtarget supports it, and falls
back to s_memtime if not.
Also introduces new intrinsics for each of s_memtime / s_memrealtime.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262119 91177308-0d34-0410-b5e6-96231b3b80d8
Add parsing and printing of image operands. Matches legacy sp3 assembler.
Change image instruction order to have data/image/sampler operands in the beginning. This is needed because optional operands in MC are always last.
Update SITargetLowering for new order.
Add basic MC test.
Update CodeGen tests.
Review: http://reviews.llvm.org/D17574
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@261995 91177308-0d34-0410-b5e6-96231b3b80d8
I don't think this test was intending to test unaligned load/store.
Change it to use the natural alignment to avoid regressing.
Also adds missing SI checks.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@261571 91177308-0d34-0410-b5e6-96231b3b80d8
This avoids some test regressions in a future commit
when unaligned operations are expanded when they
have custom lowering.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@261570 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Instead of trying to replace SMRD instructions with a VGPR base pointer
with an equivalent MUBUF instruction, we now copy the base pointer to
SGPRs using v_readfirstlane.
This is safe to do, because any load selected as an SMRD instruction
has been proven to have a uniform base pointer, so each thread in the
wave will have the same pointer value in VGPRs.
This will fix some errors on VI from trying to replace SMRD instructions
with addr64-enabled MUBUF instructions that don't exist.
Reviewers: arsenm, cfang, nhaehnle
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D17305
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@261385 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This was broken in r260694 which swapped the address and data operands
for flat store instructions. The code in SIInsertWaits assumes
that the data operand always comes before the address operand, so
we need to add a special case for flat.
Reviewers: arsenm
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D17366
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@261330 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
These correspond to IMAGE_LOAD/STORE[_MIP] and are going to be used by Mesa
for the GL_ARB_shader_image_load_store extension.
IMAGE_LOAD is already matched by llvm.SI.image.load. That intrinsic has
a legacy name and pretends not to read memory.
Differential Revision: http://reviews.llvm.org/D17276
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@261224 91177308-0d34-0410-b5e6-96231b3b80d8
Tests for the new scalarize all private access options will be
included with a future commit.
The only functional change is to make the split/scalarize behavior
for private access of > 4 element vectors to be consistent
with the flat/global handling. This makes the spilling worse
in the two changed tests.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@260804 91177308-0d34-0410-b5e6-96231b3b80d8
This intrinsic will be used to expose dpp functionality to higher-level
languages. It will map to the dpp version of v_mov_b32.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@260792 91177308-0d34-0410-b5e6-96231b3b80d8
These provide direct access to the hardware instruction without
the unit version required like llvm.sin/llvm.cos lowering requires.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@260782 91177308-0d34-0410-b5e6-96231b3b80d8
Historically, AMD internal sp3 assembler has flat_store* addr, data
format. To match existing code and to enable reuse, change LLVM
definitions to match. Also update MC and CodeGen tests.
Differential Revision: http://reviews.llvm.org/D16927
Patch by: Nikolay Haustov
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@260694 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
It is possible that the loop condition can be a boolean constant (infinite loop,
for example). So we sould handle constant condition in annotating a loop. This
patch adds this functionality to support annotating constant condition.
Reviewers: tstellarAMD, arsenm
Subscribers: llvm-commits, arsenm
Differential Revision: http://reviews.llvm.org/D15093
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@260692 91177308-0d34-0410-b5e6-96231b3b80d8
This was hardcoded to the static private size, but this
would be missing the offset and additional size for someday
when we have dynamic sizing.
Also stops always initializing flat_scratch even when unused.
In the future we should stop emitting this unless flat instructions
are used to access private memory. For example this will initialize
it almost always on VI because flat is used for global access.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@260658 91177308-0d34-0410-b5e6-96231b3b80d8
Introduce a subtarget feature for this, and leave the default with
the current behavior which assumes up to 16-byte loads/stores can
be used. The field also seems to have the ability to be set to 2 bytes,
but I'm not sure what that would be used for.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@260651 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
It's possible to have resource descriptors and samplers stored in
VGPRs, either by a VMEM instruction or in the case of samplers,
floating-point calculations. When this happens, we need to use
v_readfirstlane to copy these values back to sgprs.
Reviewers: mareko, arsenm
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D17102
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@260599 91177308-0d34-0410-b5e6-96231b3b80d8
If the two operands to an instruction were both
subregisters of the same super register, it would incorrectly
think this counted as the same constant bus use.
This fixes the verifier error in fmin_legacy.ll which
was missing -verify-machineinstrs.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@260495 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This fixes a crash where subsequent spills would be unable to scavenge
a register. In particular, it fixes a crash in piglit's
spec@glsl-1.50@execution@geometry@max-input-components (the test still
has a shader that fails to compile because of too many SGPR spills, but
at least it doesn't crash any more).
This is a candidate for the release branch.
Reviewers: arsenm, tstellarAMD
Subscribers: qcolombet, arsenm
Differential Revision: http://reviews.llvm.org/D16558
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@260427 91177308-0d34-0410-b5e6-96231b3b80d8
If a range has a lower bound of 0, add an AssertZext from the
nearest floor power of two.
This allows operations with some workitem intrinsics with known
maximum ranges to use fast 24-bit multiplies.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@260109 91177308-0d34-0410-b5e6-96231b3b80d8
The current situation isn't great, because the amount of padding
requires is determined by the inverse order of the first encountered
use. We should eventually somehow sort these to minimize wasted space.
Another problem is the alignment of kernel arguments isn't
respected. The group_segment_alignment is always emitted as
the default 16, and typed arguments with higher alignments
or an explicitly set alignment are also ignored.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@259912 91177308-0d34-0410-b5e6-96231b3b80d8
Recommited, after some fixing with test cases.
Updated test cases:
test/CodeGen/AArch64/arm64-misched-memdep-bug.ll
test/CodeGen/AArch64/tailcall_misched_graph.ll
Temporarily disabled test cases:
test/CodeGen/AMDGPU/split-vector-memoperand-offsets.ll
test/CodeGen/PowerPC/ppc64-fastcc.ll (partially updated)
test/CodeGen/PowerPC/vsx-fma-m.ll
test/CodeGen/PowerPC/vsx-fma-sp.ll
http://reviews.llvm.org/D8705
Reviewers: Hal Finkel, Andy Trick.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@259673 91177308-0d34-0410-b5e6-96231b3b80d8
If we can't assume the pointer value isn't within the bounds
of the object, it seems risky to try to replace the pointer
calculations.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@259573 91177308-0d34-0410-b5e6-96231b3b80d8
When promoting allocas to LDS, we know we are indexing
into a specific area just created, and the calculation
will also never overflow.
Also emit some of the muls as nsw nuw, because instcombine
infers this already from the range metadata. I think
putting this on the other adds and muls might be OK too,
but I'm not 100% sure.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@259545 91177308-0d34-0410-b5e6-96231b3b80d8
Re-commit of r258951 after fixing layering violation.
The BPF and WebAssembly backends had identical code for emitting errors
for unsupported features, and AMDGPU had very similar code. This merges
them all into one DiagnosticInfo subclass, that can be used by any
backend.
There should be minimal functional changes here, but some AMDGPU tests
have been updated for the new format of errors (it used a slightly
different format to BPF and WebAssembly). The AMDGPU error messages will
now benefit from having precise source locations when debug info is
available.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@259498 91177308-0d34-0410-b5e6-96231b3b80d8
The AMDGPUPromoteAlloca pass was emitting the read.local.size
calls, which with HSA was incorrectly selected to reading from
the offset mesa uses off of the kernarg pointer.
Error on intrinsics which aren't supported by HSA, and start
emitting the correct IR to read the workgroup size
out of the dispatch pointer.
Also initialize the pass so it can be tested with opt, and
start moving towards not depending on the subtarget as an
argument.
Start emitting errors for the intrinsics not handled with HSA.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@259297 91177308-0d34-0410-b5e6-96231b3b80d8
Only the dispatch.ptr intrinsic is supposed to be used now to get
the workgroup size, and the read.local.size intrinsics do not
work correctly.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@259296 91177308-0d34-0410-b5e6-96231b3b80d8
These use the correct prefix and follow the HSA naming convention
rather than the config register option names.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@259293 91177308-0d34-0410-b5e6-96231b3b80d8
Re-commit of r258951 after fixing layering violation.
The related LLVM patch adds a backend diagnostic type for reporting
unsupported features, this adds a printer for them to clang.
In the case where debug location information is not available, I've
changed the printer to report the location as the first line of the
function, rather than the closing brace, as the latter does not give the
user any information. This also affects optimisation remarks.
Differential Revision: http://reviews.llvm.org/D16590
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@259035 91177308-0d34-0410-b5e6-96231b3b80d8
The BPF and WebAssembly backends had identical code for emitting errors
for unsupported features, and AMDGPU had very similar code. This merges
them all into one DiagnosticInfo subclass, that can be used by any
backend.
There should be minimal functional changes here, but some AMDGPU tests
have been updated for the new format of errors (it used a slightly
different format to BPF and WebAssembly). The AMDGPU error messages will
now benefit from having precise source locations when debug info is
available.
The implementation of DiagnosticInfoUnsupported::print must be in
lib/Codegen rather than in the existing file in lib/IR/ to avoid
introducing a dependency from IR to CodeGen.
Differential Revision: http://reviews.llvm.org/D16590
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@258951 91177308-0d34-0410-b5e6-96231b3b80d8
When no device name is specified, default to kaveri
for HSA since SI is not supported and it woud fail.
Default to "tahiti" instead of "SI" since these are
effectively the same, and tahiti is an actual device.
Move default device handling to the TargetMachine
rather than the AMDGPUSubtarget. The module ISA version
is computed from the device name provided with the target
machine, so the attributes printed by the AsmPrinter were
inconsistent with those computed in the subtarget.
Also remove DevName field from subtarget since it's redundant
with getCPU() in the superclass.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@258901 91177308-0d34-0410-b5e6-96231b3b80d8
Old intrinsics were forcing these, but they have now all
been removed. This fixes large i8 vector operations generally
being broken.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@258788 91177308-0d34-0410-b5e6-96231b3b80d8
I did my best to try to update all the uses in tests that
just happened to use the old ones to the newer intrinsics.
I'm not sure I got all of the immediate operand conversions
correct, since the value seems to have been ignored by the
old pattern but I don't think it really matters.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@258787 91177308-0d34-0410-b5e6-96231b3b80d8
More cleanup to try to get all intrinsics using the correct
amdgcn prefix that are as close to the instruction as possible.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@258786 91177308-0d34-0410-b5e6-96231b3b80d8
Some of the special intrinsics now that now correspond to a instruction
also have special setting of some registers, e.g. llvm.SI.sendmsg sets
m0 as well as use s_sendmsg. Using these explicit register intrinsics
may be a better option.
Reading the exec mask and others may be useful for debugging. For this
I'm not sure this is entirely correct because we would want this to
be convergent, although it's possible this is already treated
sufficently conservatively.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@258785 91177308-0d34-0410-b5e6-96231b3b80d8
For historic reasons, the behavior of .align differs between targets.
Fortunately, there are alternatives, .p2align and .balign, which make the
interpretation of the parameter explicit, and which behave consistently across
targets.
This patch teaches MC to use .p2align instead of .align, so that people reading
code for multiple architectures don't have to remember which way each platform
does its .align directive.
Differential Revision: http://reviews.llvm.org/D16549
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@258750 91177308-0d34-0410-b5e6-96231b3b80d8
The intrinsic target prefix should match the target name
as it appears in the triple.
This is not yet complete, but gets most of the important ones.
llvm.AMDGPU.* intrinsics used by mesa and libclc are still handled
for compatability for now.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@258557 91177308-0d34-0410-b5e6-96231b3b80d8
The promote alloca pass didn't handle these intrinsics and crashed.
These intrinsics should accept any address space, but for now just
erase them to avoid breaking.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@258537 91177308-0d34-0410-b5e6-96231b3b80d8
This breaks the tests that were meant for testing
64-bit inline immediates, so move those to shl where
they won't be broken up.
This should be repeated for the other related bit ops.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@258095 91177308-0d34-0410-b5e6-96231b3b80d8
Reduce 64-bit shl with constant > 32. We already special cased
this for the == 32 case, but this also works for any >= 32 constant.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@258092 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
It is off by default, but can be used
with --misched=si
Patch by: Axel Davy
Reviewers: arsenm, tstellarAMD, nhaehnle
Subscribers: nhaehnle, solenskiner, arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D11885
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@257609 91177308-0d34-0410-b5e6-96231b3b80d8
The old lowering for uint_to_fp failed opencl conformance.
It might be OK for fast math mode, but I'm not sure.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@257393 91177308-0d34-0410-b5e6-96231b3b80d8
The hardware instruction's output on 0 is -1 rather than 32.
Eliminate a test and select to -1. This removes an extra instruction
from the compatability function with HSAIL's firstbit instruction.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@257352 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Multi-dword constant loads generated unnecessary moves from SGPRs into VGPRs,
increasing the code size and VGPR pressure. These moves are now folded away.
Note that this lack of operand folding was not a problem for VMEM loads,
because COPY nodes from VReg_Nnn to VGPR32 are eliminated by the register
coalescer.
Some tests are updated, note that the fsub.ll test explicitly checks that
the move is elided.
With the IR generated by current Mesa, the changes are obviously relatively
minor:
7063 shaders in 3531 tests
Totals:
SGPRS: 351872 -> 352560 (0.20 %)
VGPRS: 199984 -> 200732 (0.37 %)
Code Size: 9876968 -> 9881112 (0.04 %) bytes
LDS: 91 -> 91 (0.00 %) blocks
Scratch: 1779712 -> 1767424 (-0.69 %) bytes per wave
Wait states: 295164 -> 295337 (0.06 %)
Totals from affected shaders:
SGPRS: 65784 -> 66472 (1.05 %)
VGPRS: 38064 -> 38812 (1.97 %)
Code Size: 1993828 -> 1997972 (0.21 %) bytes
LDS: 42 -> 42 (0.00 %) blocks
Scratch: 795648 -> 783360 (-1.54 %) bytes per wave
Wait states: 54026 -> 54199 (0.32 %)
Reviewers: tstellarAMD, arsenm, mareko
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D15875
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@257074 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Somehow, I first interpreted the docs as saying space for xnack_mask is only
reserved when XNACK is enabled via SH_MEM_CONFIG. I felt uneasy about this and
went back to actually test what is happening, and it turns out that xnack_mask
is always reserved at least on Tonga and Carrizo, in the sense that flat_scr
is always fixed below the SGPRs that are used to implement xnack_mask, whether
or not they are actually used.
I confirmed this by writing a shader using inline assembly to tease out the
aliasing between flat_scratch and regular SGPRs. For example, on Tonga, where
we fix the number of SGPRs to 80, s[74:75] aliases flat_scratch (so
xnack_mask is s[76:77] and vcc is s[78:79]).
This patch changes both the calculation of the total number of SGPRs and the
various register reservations to account for this.
It ought to be possible to use the gap left by xnack_mask when the feature
isn't used, but this patch doesn't try to do that. (Note that the same applies
to vcc.)
Note that previously, even before my earlier change in r256794, the SGPRs that
alias to xnack_mask could end up being used as well when flat_scr was unused
and the total number of SGPRs happened to fall on the right alignment
(e.g. highest regular SGPR being used s29 and VCC used would lead to number
of SGPRs being 32, where s28 and s29 alias with xnack_mask). So if there
were some conflict due to such aliasing, we should have noticed that already.
Reviewers: arsenm, tstellarAMD
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D15898
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@257073 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This is admittedly something that you could only run into by manually
playing around with shader assembly because the SITypeWriter pass is
skipped for compute.
Reviewers: arsenm, tstellarAMD
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D15902
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@256980 91177308-0d34-0410-b5e6-96231b3b80d8
Due to the SGPR init bug, every program claims to use the same number
of SGPRs anyway, so there's no point in trying to shift those registers
down from their initial spot of reservation.
Add a test that uses VGPR spilling and blocks most SGPRs from being used for
the scratch resource register. Previously, this would run into an assertion.
Differential Revision: http://reviews.llvm.org/D15724
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@256870 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Enabling this feature will account for the two SGPRs used by the hardware
to store the XNACK_MASK physically.
The hardware only requires this reservation when the XNACK feature is
explicitly enabled. At some point, HSA will probably want to do that, but
it does increase SGPR register pressure, so leave it disabled by default
for now (but do add a small test).
Reviewers: arsenm, tstellarAMD
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D15869
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@256794 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
For some reason doing executing an MUBUF instruction with the addr64
bit set and a zero base pointer in the resource descriptor causes
the memory operation to be dropped when the shader is executed using
the HSA runtime.
This kind of MUBUF instruction is commonly used when the pointer is
stored in VGPRs. The base pointer field in the resource descriptor
is set to zero and and the pointer is stored in the vaddr field.
This patch resolves the issue by only using flat instructions for
global memory operations when targeting HSA. This is an overly
conservative fix as all other configurations of MUBUF instructions
appear to work.
NOTE: re-commit by fixing a failure in Codegen/AMDGPU/llvm.dbg.value.ll
Reviewers: tstellarAMD
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D15543
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@256282 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
For some reason doing executing an MUBUF instruction with the addr64
bit set and a zero base pointer in the resource descriptor causes
the memory operation to be dropped when the shader is executed using
the HSA runtime.
This kind of MUBUF instruction is commonly used when the pointer is
stored in VGPRs. The base pointer field in the resource descriptor
is set to zero and and the pointer is stored in the vaddr field.
This patch resolves the issue by only using flat instructions for
global memory operations when targeting HSA. This is an overly
conservative fix as all other configurations of MUBUF instructions
appear to work.
Reviewers: tstellarAMD
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D15543
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@256273 91177308-0d34-0410-b5e6-96231b3b80d8
noduplicate prevents unrolling of small loops that happen to have
barriers in them. If a loop has a barrier in it, it is OK to duplicate
it for the unroll.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@256075 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
When copying aggregate registers within the same register class, there may
be an overlap between source and destination that forces us to do the copy
backwards.
Do the simplest possible thing that guarantees the correct order of moves
when there are overlaps, and does whatever when there is no overlap. (The
last part forces some trivial adjustments to test cases.)
Together with r255906, this fixes a VM fault in Unreal Elemental Demo.
While at it, change the generation of kill and def flags to something that
looks more reasonable. This method is used very late during compilation, so
it probably doesn't matter in practice, and to be honest, I don't know if
this change is actually correct because the semantics in connection with
aggregate registers vs. sub-registers are not clear to me.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93264
Reviewers: arsenm, tstellarAMD
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D15622
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@256072 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
We were previously selecting all constant loads to SMRD instructions and legalizing
the SMRDs with non-uniform addresses during the SIFixSGPRCopesPass.
This new solution is more simple and also generates much better code, because
the instruction selector is able to take advantage of all the MUBUF addressing
modes that are legalization pass wasn't able to.
We also no longer need to generate v_add_* instructions when we
have a uniform pointer and a non-uniform offset, as this is now folded into the
MUBUF instruction during instruction selection.
Reviewers: arsenm
Subscribers: arsenm, llvm-commits
Differential Revision: http://reviews.llvm.org/D15425
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@255672 91177308-0d34-0410-b5e6-96231b3b80d8