The variable mask form of VPERMILPD/VPERMILPS were only partially implemented, with much of it still performed as an intrinsic.
This patch properly defines the instructions in terms of X86ISD::VPERMILPV, permitting the opcode to be easily combined as a target shuffle.
Differential Revision: http://reviews.llvm.org/D17681
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262635 91177308-0d34-0410-b5e6-96231b3b80d8
That's not the case for VPERMV/VPERMV3, which cover all possible
combinations (the C intrinsics use a different order; the AVX vs
AVX512 intrinsics are different still).
Since:
r246981 AVX-512: Lowering for 512-bit vector shuffles.
VPERMV is recognized in getTargetShuffleMask.
This breaks assumptions in most callers, as they expect
the non-mask operands to start at index 0.
VPERMV has the mask as operand #0; VPERMV3 has it in the middle.
Instead of the faulty assumption, have getTargetShuffleMask return
its operands as well.
One alternative we considered was to change the operand order of
VPERMV, but we agreed to stick to the instruction order, as there
are more AVX512 weirdness to cover (vpermt2/vpermi2 in particular).
Differential Revision: http://reviews.llvm.org/D17041
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262627 91177308-0d34-0410-b5e6-96231b3b80d8
The vectorization of first-order recurrences (r261346) caused PR26734. When
detecting these recurrences, we need to ensure that the previous value is
actually defined inside the loop. This patch includes the fix and test case.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262624 91177308-0d34-0410-b5e6-96231b3b80d8
This test failed in some ARM bots after a divmod change because it was
running on a native llc, instead of targeted one. This makes sure the test
is target-specific (as intended), and also copies to ARM and AArch64
directories. If it is also supposed to work on other architectures, I'll
leave as an exercise to the respective maintainers.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262620 91177308-0d34-0410-b5e6-96231b3b80d8
Generalise the existing SIGN_EXTEND to SIGN_EXTEND_VECTOR_INREG combine to support zero extension as well and get rid of a lot of unnecessary ANY_EXTEND + mask patterns.
Differential Revision: http://reviews.llvm.org/D17691
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262599 91177308-0d34-0410-b5e6-96231b3b80d8
If we have a loop with a rarely taken path, we will prune that from the blocks which get added as part of the loop chain. The problem is that we weren't then recognizing the loop chain as schedulable when considering the preheader when forming the function chain. We'd then fall to various non-predecessors before finally scheduling the loop chain (as if the CFG was unnatural.) The net result was that there could be lots of garbage between a loop preheader and the loop, even though we could have directly fallen into the loop. It also meant we separated hot code with regions of colder code.
The particular reason for the rejection of the loop chain was that we were scanning predecessor of the header, seeing the backedge, believing that was a globally more important predecessor (true), but forgetting to account for the fact the backedge precessor was already part of the existing loop chain (oops!.
Differential Revision: http://reviews.llvm.org/D17830
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262547 91177308-0d34-0410-b5e6-96231b3b80d8
Catch objects with a displacement of zero do not initialize a catch
object. The displacement is relative to %rsp at the end of the
function's prologue for x86_64 targets.
If we place an object at the top-of-stack, we will end up wit a
displacement of zero resulting in our catch object remaining
uninitialized.
Address this by creating our catch objects as fixed objects. We will
ensure that the UnwindHelp object is created after the catch objects so
that no catch object will have a displacement of zero.
Differential Revision: http://reviews.llvm.org/D17823
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262546 91177308-0d34-0410-b5e6-96231b3b80d8
Summary: This is the last step toward supporting aggregate memory access in instcombine. This explodes stores of arrays into a serie of stores for each element, allowing them to be optimized.
Reviewers: joker.eph, reames, hfinkel, majnemer, mgrang
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D17828
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262530 91177308-0d34-0410-b5e6-96231b3b80d8
Summary: This is another step toward improving fca support. This unpack load of array in a series of load to array's elements.
Reviewers: chandlerc, joker.eph, majnemer, reames, hfinkel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D15890
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262521 91177308-0d34-0410-b5e6-96231b3b80d8
This is a test that Akira Hatanaka wrote to test GlobalOpt's handling of
aliases with GEP operands. David Majnemer independently made the same
change to GlobalOpt in r212079. Akira's test is a useful addition, so I'm
pulling it over from the llvm repo for Swift on GitHub.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262510 91177308-0d34-0410-b5e6-96231b3b80d8
When div+rem calls on the same arguments are found, the ARM back-end merges the
two calls into one __aeabi_divmod call for up to 32-bits values. However,
for 64-bit values, which also have a lib call (__aeabi_ldivmod), it wasn't
merging the calls, and thus calling ldivmod twice and spilling the temporary
results, which generated pretty bad code.
This patch legalises 64-bit lib calls for divmod, so that now all the spilling
and the second call are gone. It also relaxes the DivRem combiner a bit on the
legal type check, since it was already checking for isLegalOrCustom on every
value, so the extra check for isTypeLegal was redundant.
This patch fixes PR17193 (and a long time FIXME in the tests).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262507 91177308-0d34-0410-b5e6-96231b3b80d8
This reverts commit r262370.
It turns out there is code out there that does sequences of allocas
greater than 4K: http://crbug.com/591404
The goal of this change was to improve the code size of inalloca call
sequences, but we got tangled up in the mess of dynamic allocas.
Instead, we should come back later with a separate MI pass that uses
dominance to optimize the full sequence. This should also be able to
remove the often unneeded stacksave/stackrestore pairs around the call.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262505 91177308-0d34-0410-b5e6-96231b3b80d8
Most of the time ARM has the CCR.UNALIGN_TRP bit set to false which
means that unaligned loads/stores do not trap and even extensive testing
will not catch these bugs. However the multi/double variants are not
affected by this bit and will still trap. In effect a more aggressive
load/store optimization will break existing (bad) code.
These bugs do not necessarily manifest in the broken code where the
misaligned pointer is formed but often later in perfectly legal code
where it is accessed. This means recompiling system libraries (which
have no alignment bugs) with a newer compiler will break existing
applications (with alignment bugs) that worked before.
So (under protest) I implemented this safe mode which limits the
formation of multi/double operations to cases that are not affected by
user code (stack operations like spills/reloads) or cases where the
normal operations trap anyway (floating point load/stores). It is
disabled by default.
Differential Revision: http://reviews.llvm.org/D17015
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262504 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This change enables frame pointer elimination in non-leaf functions.
The -fomit-frame-pointer option still needs to be used when compiling
via clang (or an equivalent method of not setting the
'no-frame-pointer-elim*' function attributes if generating llvm IR via
some other method) to take advantage of this optimization.
This change should be NFC when compiling via clang without
-fomit-frame-pointer.
Reviewers: t.p.northover
Subscribers: aemerson, rengolin, tberghammer, qcolombet, llvm-commits, danalbert, mcrosier, srhines
Differential Revision: http://reviews.llvm.org/D17730
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262495 91177308-0d34-0410-b5e6-96231b3b80d8
We have a number of useful lowering strategies for VBROADCAST instructions (both from memory and register element 0) which the 128-bit form of the MOVDDUP instruction can make use of.
This patch tweaks lowerVectorShuffleAsBroadcast to enable it to broadcast 2f64 args using MOVDDUP as well.
It does require a slight tweak to the lowerVectorShuffleAsBroadcast mechanism as the existing MOVDDUP lowering uses isShuffleEquivalent which can match binary shuffles that can lower to (unary) broadcasts.
Differential Revision: http://reviews.llvm.org/D17680
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262478 91177308-0d34-0410-b5e6-96231b3b80d8
We modeled the RDFLAGS{32,64} operations as "using" {E,R}FLAGS.
While technically correct, this is not be desirable for folks who want
to examine aspects of the FLAGS register which are not related to
computation like whether or not CPUID is a valid instruction.
Differential Revision: http://reviews.llvm.org/D17782
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262465 91177308-0d34-0410-b5e6-96231b3b80d8
Have ScalarEvolution::getRange re-consider cases like "{C?A:B,+,C?P:Q}"
by factoring out "C" and computing RangeOf{A,+,P} union RangeOf({B,+,Q})
instead.
The latter can be easier to compute precisely in cases like
"{C?0:N,+,C?1:-1}" N is the backedge taken count of the loop; since in
such cases the latter form simplifies to [0,N+1) union [0,N+1).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262438 91177308-0d34-0410-b5e6-96231b3b80d8
As noted in the code comment, I don't think we can do the same transform that we do for
*scalar* integers comparisons to *vector* integers comparisons because it might pessimize
the general case.
Exhibit A for an incomplete integer comparison ISA remains x86 SSE/AVX: it only has EQ and GT
for integer vectors.
But we should now recognize all the variants of this construct and produce the optimal code
for the cases shown in:
https://llvm.org/bugs/show_bug.cgi?id=26701
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262424 91177308-0d34-0410-b5e6-96231b3b80d8
Summary: SampleProfile pass needs to be performed after InstructionCombiningPass, which helps eliminate un-inlinable function calls.
Reviewers: davidxl, dnovillo
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D17742
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262419 91177308-0d34-0410-b5e6-96231b3b80d8
On AMDGPU where operations i64 operations are often bitcasted to v2i32
and back, this pattern shows up regularly where it breaks some
expected combines on i64, such as load width reducing.
This fixes some test failures in a future commit when i64 loads
are changed to promote.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262397 91177308-0d34-0410-b5e6-96231b3b80d8
Revert r262248 in an attempt to fix the clang-native-aarch64-full
bot and to investigate a performance regression in
SingleSource/Benchmarks/CoyoteBench/huffbench
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262388 91177308-0d34-0410-b5e6-96231b3b80d8
This reverts commit r262316.
It seems that my change breaks an out-of-tree chromium buildbot, so
I'm reverting this in order to investigate the situation further.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262387 91177308-0d34-0410-b5e6-96231b3b80d8
Most portions of InstCombine properly propagate fast math flags, but
apparently the vector scalarization section was overlooked.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262376 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Calls sometimes need to be convergent. This is already handled at the
LLVM IR level, but it also needs to be handled at the MI level.
Ideally we'd propagate convergence from instructions, down through the
selection DAG, and into MIs. But this is Hard, and would affect
optimizations in the SDNs -- right now only SDNs with two operands have
any flags at all.
Instead, here's a much simpler hack: Add new opcodes for NVPTX for
convergent calls, and generate these when lowering convergent LLVM
calls.
Reviewers: jholewinski
Subscribers: jholewinski, chandlerc, joker.eph, jhen, tra, llvm-commits
Differential Revision: http://reviews.llvm.org/D17423
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262373 91177308-0d34-0410-b5e6-96231b3b80d8
The _chkstk function is called by the compiler to probe the stack in an
order consistent with Windows' expectations. However, it is possible to
elide the call to _chkstk and manually adjust the stack pointer if we
can prove that the allocation is fixed size and smaller than the probe
size.
This shrinks chrome.dll, chrome_child.dll and chrome.exe by a
cummulative ~133 KB.
Differential Revision: http://reviews.llvm.org/D17679
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262370 91177308-0d34-0410-b5e6-96231b3b80d8
Code in visitEHPadPredecessors assume a little too much about the
validity of a cleanupret with an invalid cleanuppad operand.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@262364 91177308-0d34-0410-b5e6-96231b3b80d8