This patch improves the knownbits logic for unsigned integer min/max opcodes.
For UMIN we know that the result will have the maximum of the inputs' known leading zero bits in the result, similarly for UMAX the maximum of the inputs' leading one bits.
This is particularly useful for simplifying clamping patterns,. e.g. as SSE doesn't have a uitofp instruction we want to use sitofp instead where possible and for that we need to confirm that the top bit is not set.
Differential Revision: https://reviews.llvm.org/D28853
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292528 91177308-0d34-0410-b5e6-96231b3b80d8
As discussed on D28219 - it is profitable to combine trunc(binop (s/zext(x), s/zext(y)) to binop(trunc(s/zext(x)), trunc(s/zext(y))) assuming the trunc(ext()) will simplify further
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292493 91177308-0d34-0410-b5e6-96231b3b80d8
As discussed on D28219 - it is profitable to combine trunc(binop (s/zext(x), s/zext(y)) to binop(trunc(s/zext(x)), trunc(s/zext(y))) assuming the trunc(ext()) will simplify further
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292487 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
The SDNodeOrder is saved in the IROrder field in the SDNode, and this
field may affects scheduling. Thus, letting dbg.value/declare increase
the order numbers may in turn affect scheduling.
Because of this change we also need to update the code deciding when
dbg values should be output, in ScheduleDAGSDNodes.cpp/ProcessSDDbgValues.
Dbg values now have the same order as the SDNode they are connected to,
not the following orders.
Test cases provided by Florian Hahn.
Reviewers: bogner, aprantl, sunfish, atrick
Reviewed By: atrick
Subscribers: fhahn, probinson, andreadb, llvm-commits, MatzeB
Differential Revision: https://reviews.llvm.org/D25318
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292485 91177308-0d34-0410-b5e6-96231b3b80d8
If the subvector comes from a load, we convert to SUBV_BROADCAST and use a broadcast instruction. But if there is no load we keep the inserts. I think we should create the SUBV_BROADCAST even without the load and let isel use the fallback patterns that are used if the load can't be folded. This will use the SHUFF32X4 or similar instruction for the 128-bit into 512-bit case and a single insert for 128 into 256 or 256 into 512.
This should be fixed so subvector broadcast intrinsics can be replaced with native IR since some of those currently lower directly to SHUFF32X4.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292475 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Currently we expand and scalarize these operations, but I think we should be able to implement ADD/SUB with KXOR and MUL with KAND.
We already do this for scalar i1 operations so I just extended it to vectors of i1.
Reviewers: zvi, delena
Reviewed By: delena
Subscribers: guyblank, llvm-commits
Differential Revision: https://reviews.llvm.org/D28888
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292474 91177308-0d34-0410-b5e6-96231b3b80d8
r291670 doesn't crash on the original testcase from PR31589,
but it crashes on a slightly more complex one.
PR31589 has the new reproducer.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292444 91177308-0d34-0410-b5e6-96231b3b80d8
Non-prevailing weak/linkonce odr symbols will be dropped by ThinLTO to
available_externally when possible. If they had an initializer in the
global_ctors list, a comdat group was being created. This code
already had logic to skip available_externally defs, but now the
EliminateAvailableExternally pass will drop these symbols to
declarations earlier. Change the check to skip all declarations for
linker (which includes available_externally along with declarations).
Reviewers: mehdi_amini
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D28737
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292408 91177308-0d34-0410-b5e6-96231b3b80d8
combineSRA doesn't detect sign bits splats that it does itself so just use -1 as the demanded input so that its already splatted
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292361 91177308-0d34-0410-b5e6-96231b3b80d8
This patch improves the mul instruction combine function (combineMul)
by adding new layer of logic.
In this patch, we are adding the ability to fold (mul x, -((1 << c) -1))
or (mul x, -((1 << c) +1)) into (neg(X << c) -x) or (neg((x << c) + x) respective.
Differential Revision: https://reviews.llvm.org/D28232
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292358 91177308-0d34-0410-b5e6-96231b3b80d8
The patch is to solve the performance problem described in PR27827.
Register coalescing sometimes cannot remove a copy because of interference.
But if we can find a reverse copy in one of the predecessor block of the copy,
the copy is partially redundent and we may remove the copy partially by moving
it to the predecessor block without the reverse copy.
Differential Revision: https://reviews.llvm.org/D28585
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292292 91177308-0d34-0410-b5e6-96231b3b80d8
Even with the fix from r291630, this still causes problems. I get
widespread assertion failures in the Swift runtime's WeakRefCount::increment()
function. I sent a reduced testcase in reply to the commit.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292242 91177308-0d34-0410-b5e6-96231b3b80d8
SelectionDAGBuilder recognizes libfuncs using some homegrown
parameter type-checking.
Use TLI instead, removing another heap of redundant code.
This isn't strictly NFC, as the SDAG code was too lax.
Concretely, this means changes are required to two tests:
- calling a non-variadic function via a variadic prototype isn't OK;
it just happens to work on x86_64 (but not on, e.g., aarch64).
- mempcpy has a size_t parameter; the SDAG code accepts any integer
type, which meant using i32 on x86_64 worked.
I don't think it's worth supporting either of these (IMO) broken
testcases. Instead, fix them to be more correct.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292189 91177308-0d34-0410-b5e6-96231b3b80d8
We were relying on constant folding of the legalized instructions to do what constant folding we had previously
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292114 91177308-0d34-0410-b5e6-96231b3b80d8
VPMACSDQH/VPMACSDQL act as VPADDQ( VPMULDQ( x, y ), z ) - multiply+extending either the odd/even 4i32 input elements and adding to v2i64 accumulator
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292020 91177308-0d34-0410-b5e6-96231b3b80d8
VPMACSWW/VPMACSDD act as add( mul( x, y ), z ) - ignoring any upper bits from both the multiply and add stages
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292019 91177308-0d34-0410-b5e6-96231b3b80d8
Tests showing missed opportunities to use XOP's integer fma instructions
Some of these are pretty awkward to match as they often have implicit sext/trunc stages but many just ignore overflow bits which makes things pretty straightforward.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292017 91177308-0d34-0410-b5e6-96231b3b80d8
Isel now selects masked move instructions for vselect instead of blendm. But sometimes it beneficial to register allocation to remove the tied register constraint by using blendm instructions.
This also picks up cases where the masked move was created due to a masked load intrinsic.
Differential Revision: https://reviews.llvm.org/D28454
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292005 91177308-0d34-0410-b5e6-96231b3b80d8
We'll now expand AVX512_128_SET0 to an EVEX VXORD if VLX available. Or if its not, but register allocation has selected a non-extended register we will use VEX VXORPS. And if its an extended register without VLX we'll use a 512-bit XOR. Do the same for AVX512_FsFLD0SS/SD.
This makes it possible for the register allocator to have all 32 registers available to work with.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@292004 91177308-0d34-0410-b5e6-96231b3b80d8
This lowers to SHUFPD if the input is zeroinitializer but not with a demanded elts optimized build vector.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@291924 91177308-0d34-0410-b5e6-96231b3b80d8
Use v8i64 variable ASHR instructions if we don't have VLX.
This is a reduced version of D28537 that just adds support for variable shifts - I'll continue with that patch (for just constant/uniform shifts) once I've fixed the type legalization issue in avx512-cvt.ll.
Differential Revision: https://reviews.llvm.org/D28604
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@291901 91177308-0d34-0410-b5e6-96231b3b80d8
Some shuffles can be lowered to blend mask instruction (VPBLENDMB/VPBLENDMW/VPBLENDMD/VPBLENDMQ) .
In this patch, I added new pattern match for this case.
Reviewers:
1. craig.topper
2. guyblank
3. RKSimon
4. igorb
Differential Revision: https://reviews.llvm.org/D28483
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@291888 91177308-0d34-0410-b5e6-96231b3b80d8
Emit SHRQ/SHLQ instead of ANDQ with a 64 bit constant mask if the result
is unused and the mask has only higher/lower bits set. For example, with
this patch LLVM emits
shrq $41, %rdi
je
instead of
movabsq $0xFFFFFE0000000000, %rcx
testq %rcx, %rdi
je
This reduces number of instructions, code size and register pressure.
The transformation is applied only for cases where the mask cannot be
encoded as an immediate value within TESTQ instruction.
Differential Revision: https://reviews.llvm.org/D28198
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@291806 91177308-0d34-0410-b5e6-96231b3b80d8
For tests on bypassing slow division there's no need to be
Atom-specific. The patch renames all tests on division bypassing
and makes their names more consistent:
atom-bypass-slow-division.ll -> bypass-slow-division-32.ll
(tests verifying correctness of divl-to-divb bypassing)
atom-bypass-slow-division-64.ll -> bypass-slow-division-64.ll
(tests verifying correctness of divq-to-divl bypassing)
slow-div.ll -> bypass-slow-division-tune.ll
(tests verifying that bypassing is enabled only when appropriate)
Differential Revision: https://reviews.llvm.org/D28197
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@291802 91177308-0d34-0410-b5e6-96231b3b80d8
64-bit integer division in Intel CPUs is extremely slow, much slower
than 32-bit division. On the other hand, 8-bit and 16-bit divisions
aren't any faster. The only important exception is Atom where DIV8
is fastest. Because of that, the patch
1) Enables bypassing of 64-bit division for Atom, Silvermont and
all big cores.
2) Modifies 64-bit bypassing to use 32-bit division instead of
16-bit one. This doesn't make the shorter division slower but
increases chances of taking it. Moreover, it's much more likely
to prove at compile-time that a value fits 32 bits and doesn't
require a run-time check (e.g. zext i32 to i64).
Differential Revision: https://reviews.llvm.org/D28196
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@291800 91177308-0d34-0410-b5e6-96231b3b80d8
r289653 added a case where `vselect <cond> <vector1> <all-zeros>`
is transformed to:
`vselect xor(cond, DAG.getConstant(1, DL, CondVT) <all-zeros> <vector1>`
This was not aimed to catch cases where Cond is not a vXi1
mask but it does. Moreover, when Cond type is VxiN (N > 1)
then xor(cond, DAG.getConstant(1, DL, CondVT) != NOT(cond).
This patch changes the above to xor with allones, and avoids
entering the case for non-mask Conds.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@291745 91177308-0d34-0410-b5e6-96231b3b80d8