Because the stack growth direction and addressing is done
in the same direction, modifying SP at the beginning of the
call sequence was incorrect. If we had a stack passed argument,
we would end up skipping that number of bytes before pushing
arguments, leaving unused/inconsistent space.
The callee creates fixed stack objects in its frame, so
the space necessary for these is already logically allocated
in the callee, so we just let the callee increment SP if
it really requires it.
llvm-svn: 313279
Using SplitCSR for the frame register was very broken. Often
the copies in the prolog and epilog were optimized out, in addition
to them being inserted after the true prolog where the FP
was clobbered.
I have a hacky solution which works that continues to use
split CSR, but for now this is simpler and will get to working
programs.
llvm-svn: 313274
MachineScheduler when clustering loads or stores checks if base
pointers point to the same memory. This check is done through
comparison of base registers of two memory instructions. This
works fine when instructions have separate offset operand. If
they require a full calculated pointer such instructions can
never be clustered according to such logic.
Changed shouldClusterMemOps to accept base registers as well and
let it decide what to do about it.
Differential Revision: https://reviews.llvm.org/D37698
llvm-svn: 313208
These two instructions are normally selected, but when the
two address pass converts mac into mad we end up with the
mad where we could have one of these.
Differential Revision: https://reviews.llvm.org/D37389
llvm-svn: 312928
A mrt exp with vm=1 must be in exact (non-WQM) mode, as it also exports
the exec mask as the valid mask to determine which pixels to render.
This commit marks any exp as needing to be in exact mode.
Actually, if there are multiple mrt exps, only one needs to have vm=1,
and only that one needs to be in exact mode. But that is an optimization
for another day.
Differential Revision: https://reviews.llvm.org/D36305
llvm-svn: 312915
We have a lot of operand definition work essentially producing
every valid permutation of operands to workaround builiding
operand lists based on the instruction features. Apparently tablegen
already has a mostly undocumented operator to concat dags which
simplies this.
Convert one simple place to use this. The BUF instruction definitions
have much more complicated logic that can be totally rewritten now.
llvm-svn: 312822
The various scalar bit operations set SCC,
so one is erased or moved it needs to be recomputed.
Not sure why the existing tests don't fail on this.
llvm-svn: 312819
Summary:
Mesa still uses a hack where empty inline assembly is used as a kind of
optimization barrier. This exposed a problem where not enough wait states
were inserted, because the hazard recognizer implicitly assumed that each
inline assembly "instruction" has at least one wait state.
Reviewers: arsenm
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, llvm-commits, t-tye
Differential Revision: https://reviews.llvm.org/D37205
llvm-svn: 312635
When packet size equals packet align and is power of 2, transform
__read_pipe* and __write_pipe* to specialized library function.
Differential Revision: https://reviews.llvm.org/D36831
llvm-svn: 312598
- Make SIMemOpInfo a class
- Add accessor methods to SIMemOpInfo
- Move get*Info methods to SIMemOpInfo
Differential Revision: https://reviews.llvm.org/D37395
llvm-svn: 312541
- Rename MemOpInfo -> SIMemOpInfo
- Move SIMemOpInfo class out of SIMemoryLegalizer class
Differential Revision: https://reviews.llvm.org/D37394
llvm-svn: 312540
Summary:
This fixes a bug that was exposed on gfx9 in various
GL45-CTS.shaders.loops.*_iterations.select_iteration_count_fragment tests,
e.g. GL45-CTS.shaders.loops.do_while_uniform_iterations.select_iteration_count_fragment
Reviewers: arsenm
Subscribers: kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D36193
llvm-svn: 312337
build_vector is a more useful canonical form when
pattern matching packed operations, so turn shift
into high element into a build_vector.
Should show no change for now.
llvm-svn: 312282
The majority of the time spent in the pass checking
for the register reads. Rather than searching all of
the defined registers for uses in each instruction,
use a set of defined registers and check the operands
of the instruction.
This process still is algorithmically not great,
but with the additional trick of skipping the analysis
for addresses with one use, this brings one slow
testcase into a reasonable range.
llvm-svn: 312206
These aren't really packed instructions, so the default
op_sel_hi should be 0 since this indicates a conversion.
The operand types are scalar values that behave similar
to an f16 scalar that may be converted to f32.
Doesn't change the default printing for op_sel_hi, just
the parsing.
llvm-svn: 312179
The merge is only possible if the base address register is the
same for the two instructions. If there is only the one use,
there's no point in doing an expensive forward scan checking
for memory interference looking for a merge candidate.
This gives a signficant improvement in one extreme testcase.
The code to do the scan is still algorithmically terrible,
so this is still the slowest pass in that example.
llvm-svn: 312096
If denorms are not flushed we can use max instead of multiplication
by 1. For double that is simply faster, while for float and half
it is shorter, because mul uses constant bus and VOP3.
Differential Revision: https://reviews.llvm.org/D36856
llvm-svn: 312095
Under -cl-fast-relaxed-math we could use native_sqrt, but f64 was
allowed to produce HSAIL's nsqrt instruction. HSAIL is not here
and we stick with non-existing native_sqrt(double) as a result.
Add check for f64 to not return native functions and also remove
handling of f64 case for fold_sqrt.
Differential Revision: https://reviews.llvm.org/D37223
llvm-svn: 311900
Summary:
This is step towards separating the GCN and R600 tablegen'd code.
This is a little awkward for now, because the R600 functions won't have the
MCSubtargetInfo parameter, so we need to have AMDMGPUInstPrinter
delegate to R600InstPrinter, but once the tablegen'd code is split,
we will be able to drop the delegation and use R600InstPrinter directly.
Reviewers: arsenm
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D36444
llvm-svn: 311128