This patch adds costs for the vectorized implementations of CTPOP, the default values were seriously underestimating the cost of these and was encouraging vectorization on targets where serialized use of POPCNT would be much better.
Differential Revision: https://reviews.llvm.org/D22456
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@276104 91177308-0d34-0410-b5e6-96231b3b80d8
This is "cvtdq2ps" which does not appear to be particularly slow on any CPU
according to Agner's tables. Choosing "5" as a cost here as suggested in:
https://llvm.org/bugs/show_bug.cgi?id=21356
...but it seems very conservative given that the instruction is fully pipelined,
and I think these costs are supposed to model throughput.
Note that related costs are also most likely too high, but this fixes PR21356
and partly fixes PR28434.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@274658 91177308-0d34-0410-b5e6-96231b3b80d8
The cost model should not assume vector casts get completely scalarized, since
on targets that have vector support, the common case is a partial split up to
the legal vector size. So, when a vector cast gets split, the resulting casts
end up legal and cheap.
Instead of pessimistically assuming scalarization, base TTI can use the costs
the concrete TTI provides for the split vector, plus a fudge factor to account
for the cost of the split itself. This fudge factor is currently 1 by default,
except on AMDGPU where inserts and extracts are considered free.
Differential Revision: http://reviews.llvm.org/D21251
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@274642 91177308-0d34-0410-b5e6-96231b3b80d8
This patch corresponds to review:
http://reviews.llvm.org/D20443
It changes the legalization strategy for illegal vector types from integer
promotion to widening. This only applies for vectors with elements of width
that is a multiple of a byte since we have hardware support for vectors with
1, 2, 3, 8 and 16 byte elements.
Integer promotion for vectors is quite expensive on PPC due to the sequence
of breaking apart the vector, extending the elements and reconstituting the
vector. Two of these operations are expensive.
This patch causes between minor and major improvements in performance on most
benchmarks. There are very few benchmarks whose performance regresses. These
regressions can be handled in a subsequent patch with a DAG combine (similar
to how this patch handles int -> fp conversions of illegal vector types).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@274535 91177308-0d34-0410-b5e6-96231b3b80d8
This is a resubmittion of 263158 change after fixing the existing problem with intrinsics mangling (see LTO and intrinsics mangling llvm-dev thread for details).
This patch fixes the problem which occurs when loop-vectorize tries to use @llvm.masked.load/store intrinsic for a non-default addrspace pointer. It fails with "Calling a function with a bad signature!" assertion in CallInst constructor because it tries to pass a non-default addrspace pointer to the pointer argument which has default addrspace.
The fix is to add pointer type as another overloaded type to @llvm.masked.load/store intrinsics.
Reviewed By: reames
Differential Revision: http://reviews.llvm.org/D17270
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@274043 91177308-0d34-0410-b5e6-96231b3b80d8
This is a resubmittion of 263158 change after fixing the existing problem with intrinsics mangling (see LTO and intrinsics mangling llvm-dev thread for details).
This patch fixes the problem which occurs when loop-vectorize tries to use @llvm.masked.load/store intrinsic for a non-default addrspace pointer. It fails with "Calling a function with a bad signature!" assertion in CallInst constructor because it tries to pass a non-default addrspace pointer to the pointer argument which has default addrspace.
The fix is to add pointer type as another overloaded type to @llvm.masked.load/store intrinsics.
Reviewed By: reames
Differential Revision: http://reviews.llvm.org/D17270
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@273892 91177308-0d34-0410-b5e6-96231b3b80d8
The BSWAP of vector types is quite efficiently implemented using vector shuffles on SSE/AVX targets, we should reflect the typical cost of this to encourage vectorization.
Differential Revision: http://reviews.llvm.org/D21521
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@273217 91177308-0d34-0410-b5e6-96231b3b80d8
As discussed on PR24888, until SSE42 we don't have access to PCMPGTQ for v2i64 comparisons, but the cost models don't reflect this, resulting in over-optimistic vectorizaton.
This patch adds SSE2 'base level' costs that match what a typical target is capable of and only reduces the v2i64 costs at SSE42.
Technically SSE41 provides a PCMPEQQ v2i64 equality test, but as getCmpSelInstrCost doesn't give us a way to discriminate between comparison test types we can't easily make use of this, otherwise we could split the cost of integer equality and greater-than tests to give better costings of each.
Differential Revision: http://reviews.llvm.org/D20057
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@268972 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
rL256194 transforms truncations between vectors of integers into PACKUS/PACKSS
operations during DAG combine. This generates better code for truncate, so cost
of truncate needs to be changed but looks like it got changed only in SSE2 table
Whereas this change is also applicable for SSE4.1, so the cost of truncate needs
to be changed for that as well. Cost of “TRUNCATE v16i32 to v16i8” & “TRUNCATE
v16i16 to v16i8” should be same in SSE4.1 & SSE2 table. Removing their cost from
SSE4.1, so it will fall back to SSE2.
Reviewers: Simon Pilgrim
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@267123 91177308-0d34-0410-b5e6-96231b3b80d8
This is a resubmittion of 263158 change.
This patch fixes the problem which occurs when loop-vectorize tries to use @llvm.masked.load/store intrinsic for a non-default addrspace pointer. It fails with "Calling a function with a bad signature!" assertion in CallInst constructor because it tries to pass a non-default addrspace pointer to the pointer argument which has default addrspace.
The fix is to add pointer type as another overloaded type to @llvm.masked.load/store intrinsics.
Reviewed By: reames
Differential Revision: http://reviews.llvm.org/D17270
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@266086 91177308-0d34-0410-b5e6-96231b3b80d8
PPC has a vector popcount, this lets the vectorizer use the correct cost
for it. Tweak X86 test to use an intrinsic that's actually scalarized (we
have a somewhat efficient lowering for vector popcount using SSE, the
cost model finds that now).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@265005 91177308-0d34-0410-b5e6-96231b3b80d8
This patch fixes the problem which occurs when loop-vectorize tries to use @llvm.masked.load/store intrinsic for a non-default addrspace pointer. It fails with "Calling a function with a bad signature!" assertion in CallInst constructor because it tries to pass a non-default addrspace pointer to the pointer argument which has default addrspace.
The fix is to add pointer type as another overloaded type to @llvm.masked.load/store intrinsics.
Reviewed By: reames
Differential Revision: http://reviews.llvm.org/D17270
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@263158 91177308-0d34-0410-b5e6-96231b3b80d8
This patch transforms truncation between vectors of integers into
X86ISD::PACKUS/PACKSS operations during DAG combine. We don't do it in
lowering phase because after type legalization, the original truncation
will be turned into a BUILD_VECTOR with each element that is extracted
from a vector and then truncated, and from them it is difficult to do
this optimization. This greatly improves the performance of truncations
on some specific types.
Cost table is updated accordingly.
Differential revision: http://reviews.llvm.org/D14588
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@256194 91177308-0d34-0410-b5e6-96231b3b80d8
Previously in the conversion cost table there are no entries for integer-integer
conversions on SSE2. This will result in imprecise costs for certain vectorized
operations. This patch adds those entries for SSE2 and SSE4.1. The cost numbers
are counted from the result of running llc on the new test case in this patch.
Differential revision: http://reviews.llvm.org/D15132
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@255315 91177308-0d34-0410-b5e6-96231b3b80d8
Currently in LLVM's cost model, a vectorized arithmetic instruction will have
high cost if its type is split into multiple registers. However, this
punishment is too heavy and unnecessary. The overhead of the split should not
be on arithmetic instructions but instructions that implement the split. Note
that during vectorization we have calculated the register pressure, and we
only choose proper interleaving factor (and also vectorization factor) so
that we don't use more registers than the maximum number.
Here is a very simple example: if a vadd has the cost 1, and if we double VF
so that we need two registers to perform it, then its cost will become 4 with
the current implementation, which will prevent us to use larger VF.
Differential revision: http://reviews.llvm.org/D15159
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@254671 91177308-0d34-0410-b5e6-96231b3b80d8
I checked and updated the cost of AVX-512 conversion operations. Added cost of conversion operations in DQ mode.
Conversion of illegal types that requires vector split is not calculated right now (like for other X86 targets).
Differential Revision: http://reviews.llvm.org/D15074
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@254494 91177308-0d34-0410-b5e6-96231b3b80d8
The cost for scalarized operations is computed as N * (scalar operation
cost + 1 extractelement + 1 insertelement). This partially fixes
inflating the cost of scalarized operations since every operation is
scalarized and free. I don't think we want any cost asociated with
scalarization, but for now insertelement is still counted. I'm not sure
if we should pretend that insertelement is also free, or add a way
to compute a custom scalarization cost.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@254438 91177308-0d34-0410-b5e6-96231b3b80d8
The underlying issues surrounding codegen for 32-bit vselects have been resolved. The pessimistic costs for 64-bit vselects remain due to the bad
scalarization that is still happening there.
I tested this on A57 in T32, A32 and A64 modes. I saw no regressions, and some improvements.
From my benchmarks, I saw these improvements in A57 (T32)
spec.cpu2000.ref.177_mesa 5.95%
lnt.SingleSource/Benchmarks/Shootout/strcat 12.93%
lnt.MultiSource/Benchmarks/MiBench/telecomm-CRC32/telecomm-CRC32 11.89%
I also measured A57 A32, A53 T32 and A9 T32 and found no performance regressions. I see much bigger wins in third-party benchmarks with this change
Differential Revision: http://reviews.llvm.org/D14743
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@253349 91177308-0d34-0410-b5e6-96231b3b80d8
There are several dodgy costings due to AVX1 legalizing 256-bit integer vectors that need fixing.
As discussed in D8690.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@249983 91177308-0d34-0410-b5e6-96231b3b80d8
There are several dodgy costings due to AVX1 legalizing 256-bit integer vectors that need fixing.
As discussed in D8690.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@249981 91177308-0d34-0410-b5e6-96231b3b80d8
The XOP shifts just have logical/arithmetic versions and the left/right shifts are controlled by whether the value is positive/negative. Because of this I've added new X86ISD nodes instead of trying to force them to use the existing shift nodes.
Additionally Excavator cores (bdver4) support XOP and AVX2 - meaning that it should use the AVX2 shifts when it can and fall back to XOP in other cases.
Differential Revision: http://reviews.llvm.org/D8690
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@248878 91177308-0d34-0410-b5e6-96231b3b80d8