The underlying issues surrounding codegen for 32-bit vselects have been resolved. The pessimistic costs for 64-bit vselects remain due to the bad
scalarization that is still happening there.
I tested this on A57 in T32, A32 and A64 modes. I saw no regressions, and some improvements.
From my benchmarks, I saw these improvements in A57 (T32)
spec.cpu2000.ref.177_mesa 5.95%
lnt.SingleSource/Benchmarks/Shootout/strcat 12.93%
lnt.MultiSource/Benchmarks/MiBench/telecomm-CRC32/telecomm-CRC32 11.89%
I also measured A57 A32, A53 T32 and A9 T32 and found no performance regressions. I see much bigger wins in third-party benchmarks with this change
Differential Revision: http://reviews.llvm.org/D14743
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@253349 91177308-0d34-0410-b5e6-96231b3b80d8
There are several dodgy costings due to AVX1 legalizing 256-bit integer vectors that need fixing.
As discussed in D8690.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@249983 91177308-0d34-0410-b5e6-96231b3b80d8
There are several dodgy costings due to AVX1 legalizing 256-bit integer vectors that need fixing.
As discussed in D8690.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@249981 91177308-0d34-0410-b5e6-96231b3b80d8
The XOP shifts just have logical/arithmetic versions and the left/right shifts are controlled by whether the value is positive/negative. Because of this I've added new X86ISD nodes instead of trying to force them to use the existing shift nodes.
Additionally Excavator cores (bdver4) support XOP and AVX2 - meaning that it should use the AVX2 shifts when it can and fall back to XOP in other cases.
Differential Revision: http://reviews.llvm.org/D8690
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@248878 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
We are not scalarizing the wide selects in codegen for i16 and i32 and
therefore we can remove the amortization factor. We still have issues
with i64 vectors in codegen though.
Reviewers: mcrosier
Subscribers: mcrosier, aemerson, llvm-commits, rengolin
Differential Revision: http://reviews.llvm.org/D12724
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@247156 91177308-0d34-0410-b5e6-96231b3b80d8
Pre-P8, when we generate code for unaligned vector loads (for Altivec and QPX
types), even when accounting for the combining that takes place for multiple
consecutive such loads, there is at least one load instructions and one
permutation for each load. Make sure the cost reported reflects the cost of the
permutes as well.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@246807 91177308-0d34-0410-b5e6-96231b3b80d8
I'm adding a regression test to better cover code generation for unaligned
vector loads and stores, but there's no functional change to the code
generation here. There is an improvement to the cost model for unaligned vector
loads and stores, mostly for QPX (for which we were not previously accounting
for the permutation-based loads), and the cost model implementation is cleaner.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@246712 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This change limits the minimum cost of an insert/extract
element operation to 2 in cases where this would result
in mixing of NEON and VFP code.
Reviewers: rengolin
Subscribers: mssimpso, aemerson, llvm-commits, rengolin
Differential Revision: http://reviews.llvm.org/D12030
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@245225 91177308-0d34-0410-b5e6-96231b3b80d8
This patch vectorizes the v2i64/v4i64 ASHR shift operations - the last remaining integer vector shifts that are still being transferred to/from the scalar unit to be completed.
Differential Revision: http://reviews.llvm.org/D11439
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@243569 91177308-0d34-0410-b5e6-96231b3b80d8
r243250 appeared to break clang/test/Analysis/dead-store.c on one of the build
slaves, but I couldn't reproduce this failure locally. Probably a false
positive as I saw this test was broken by r243246 or r243247 too but passed
later without people fixing anything.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@243253 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This patch updates TargetTransformInfoImplCRTPBase::getGEPCost to consider
addressing modes. It now returns TCC_Free when the GEP can be completely folded
to an addresing mode.
I started this patch as I refactored SLSR. Function isGEPFoldable looks common
and is indeed used by some WIP of mine. So I extracted that logic to getGEPCost.
Furthermore, I noticed getGEPCost wasn't directly tested anywhere. The best
testing bed seems CostModel, but its getInstructionCost method invokes
getAddressComputationCost for GEPs which provides very coarse estimation. So
this patch also makes getInstructionCost call the updated getGEPCost for GEPs.
This change inevitably breaks some tests because the cost model changes, but
nothing looks seriously wrong -- if we believe the new cost model is the right
way to go, these tests should be updated.
This patch is not perfect yet -- the comments in some tests need to be updated.
I want to know whether this is a right approach before fixing those details.
Reviewers: chandlerc, hfinkel
Subscribers: aschwaighofer, llvm-commits, aemerson
Differential Revision: http://reviews.llvm.org/D9819
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@243250 91177308-0d34-0410-b5e6-96231b3b80d8
While the v4i32 shl operation is already vectorized using a cvttps2dq/pmulld pattern, the lshr/ashr opeations are still scalarized.
This patch adds vectorization support for non-uniform v4i32 shift operations - it splats constant shift amounts to allow them to use the immediate sse shift instructions, or extracts/zero-extends non-constant shift amounts. The individual results are then blended together.
Differential Revision: http://reviews.llvm.org/D11063
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@241989 91177308-0d34-0410-b5e6-96231b3b80d8
Merged separate (but equivalent) SSE2/AVX512F tests.
Removed codegen tests since these are already done better in test/CodeGen/X86.
The actual cost values still need to be updated to match recent codegen improvements.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@240219 91177308-0d34-0410-b5e6-96231b3b80d8
This patch ensures that SHL/SRL/SRA shifts for i8 and i16 vectors avoid scalarization. It builds on the existing i8 SHL vectorized implementation of moving the shift bits up to the sign bit position and separating the 4, 2 & 1 bit shifts with several improvements:
1 - SSE41 targets can use (v)pblendvb directly with the sign bit instead of performing a comparison to feed into a VSELECT node.
2 - pre-SSE41 targets were masking + comparing with an 0x80 constant - we avoid this by using the fact that a set sign bit means a negative integer which can be compared against zero to then feed into VSELECT, avoiding the need for a constant mask (zero generation is much cheaper).
3 - SRA i8 needs to be unpacked to the upper byte of a i16 so that the i16 psraw instruction can be correctly used for sign extension - we have to do more work than for SHL/SRL but perf tests indicate that this is still beneficial.
The i16 implementation is similar but simpler than for i8 - we have to do 8, 4, 2 & 1 bit shifts but less shift masking is involved. SSE41 use of (v)pblendvb requires that the i16 shift amount is splatted to both bytes however.
Tested on SSE2, SSE41 and AVX machines.
Differential Revision: http://reviews.llvm.org/D9474
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@239509 91177308-0d34-0410-b5e6-96231b3b80d8
Essentially the same as the GEP change in r230786.
A similar migration script can be used to update test cases, though a few more
test case improvements/changes were required this time around: (r229269-r229278)
import fileinput
import sys
import re
pat = re.compile(r"((?:=|:|^)\s*load (?:atomic )?(?:volatile )?(.*?))(| addrspace\(\d+\) *)\*($| *(?:%|@|null|undef|blockaddress|getelementptr|addrspacecast|bitcast|inttoptr|\[\[[a-zA-Z]|\{\{).*$)")
for line in sys.stdin:
sys.stdout.write(re.sub(pat, r"\1, \2\3*\4", line))
Reviewers: rafael, dexonsmith, grosser
Differential Revision: http://reviews.llvm.org/D7649
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@230794 91177308-0d34-0410-b5e6-96231b3b80d8
One of several parallel first steps to remove the target type of pointers,
replacing them with a single opaque pointer type.
This adds an explicit type parameter to the gep instruction so that when the
first parameter becomes an opaque pointer type, the type to gep through is
still available to the instructions.
* This doesn't modify gep operators, only instructions (operators will be
handled separately)
* Textual IR changes only. Bitcode (including upgrade) and changing the
in-memory representation will be in separate changes.
* geps of vectors are transformed as:
getelementptr <4 x float*> %x, ...
->getelementptr float, <4 x float*> %x, ...
Then, once the opaque pointer type is introduced, this will ultimately look
like:
getelementptr float, <4 x ptr> %x
with the unambiguous interpretation that it is a vector of pointers to float.
* address spaces remain on the pointer, not the type:
getelementptr float addrspace(1)* %x
->getelementptr float, float addrspace(1)* %x
Then, eventually:
getelementptr float, ptr addrspace(1) %x
Importantly, the massive amount of test case churn has been automated by
same crappy python code. I had to manually update a few test cases that
wouldn't fit the script's model (r228970,r229196,r229197,r229198). The
python script just massages stdin and writes the result to stdout, I
then wrapped that in a shell script to handle replacing files, then
using the usual find+xargs to migrate all the files.
update.py:
import fileinput
import sys
import re
ibrep = re.compile(r"(^.*?[^%\w]getelementptr inbounds )(((?:<\d* x )?)(.*?)(| addrspace\(\d\)) *\*(|>)(?:$| *(?:%|@|null|undef|blockaddress|getelementptr|addrspacecast|bitcast|inttoptr|\[\[[a-zA-Z]|\{\{).*$))")
normrep = re.compile( r"(^.*?[^%\w]getelementptr )(((?:<\d* x )?)(.*?)(| addrspace\(\d\)) *\*(|>)(?:$| *(?:%|@|null|undef|blockaddress|getelementptr|addrspacecast|bitcast|inttoptr|\[\[[a-zA-Z]|\{\{).*$))")
def conv(match, line):
if not match:
return line
line = match.groups()[0]
if len(match.groups()[5]) == 0:
line += match.groups()[2]
line += match.groups()[3]
line += ", "
line += match.groups()[1]
line += "\n"
return line
for line in sys.stdin:
if line.find("getelementptr ") == line.find("getelementptr inbounds"):
if line.find("getelementptr inbounds") != line.find("getelementptr inbounds ("):
line = conv(re.match(ibrep, line), line)
elif line.find("getelementptr ") != line.find("getelementptr ("):
line = conv(re.match(normrep, line), line)
sys.stdout.write(line)
apply.sh:
for name in "$@"
do
python3 `dirname "$0"`/update.py < "$name" > "$name.tmp" && mv "$name.tmp" "$name"
rm -f "$name.tmp"
done
The actual commands:
From llvm/src:
find test/ -name *.ll | xargs ./apply.sh
From llvm/src/tools/clang:
find test/ -name *.mm -o -name *.m -o -name *.cpp -o -name *.c | xargs -I '{}' ../../apply.sh "{}"
From llvm/src/tools/polly:
find test/ -name *.ll | xargs ./apply.sh
After that, check-all (with llvm, clang, clang-tools-extra, lld,
compiler-rt, and polly all checked out).
The extra 'rm' in the apply.sh script is due to a few files in clang's test
suite using interesting unicode stuff that my python script was throwing
exceptions on. None of those files needed to be migrated, so it seemed
sufficient to ignore those cases.
Reviewers: rafael, dexonsmith, grosser
Differential Revision: http://reviews.llvm.org/D7636
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@230786 91177308-0d34-0410-b5e6-96231b3b80d8
First, don't combine bit masking into vector shuffles (even ones the
target can handle) once operation legalization has taken place. Custom
legalization of vector shuffles may exist for these patterns (making the
predicate return true) but that custom legalization may in some cases
produce the exact bit math this matches. We only really want to handle
this prior to operation legalization.
However, the x86 backend, in a fit of awesome, relied on this. What it
would do is mark VSELECTs as expand, which would turn them into
arithmetic, which this would then match back into vector shuffles, which
we would then lower properly. Amazing.
Instead, the second change is to teach the x86 backend to directly form
vector shuffles from VSELECT nodes with constant conditions, and to mark
all of the vector types we support lowering blends as shuffles as custom
VSELECT lowering. We still mark the forms which actually support
variable blends as *legal* so that the custom lowering is bypassed, and
the legal lowering can even be used by the vector shuffle legalization
(yes, i know, this is confusing. but that's how the patterns are
written).
This makes the VSELECT lowering much more sensible, and in fact should
fix a bunch of bugs with it. However, as you'll see in the test cases,
right now what it does is point out the *hilarious* deficiency of the
new vector shuffle lowering when it comes to blends. Fortunately, my
very next patch fixes that. I can't submit it yet, because that patch,
somewhat obviously, forms the exact and/or pattern that the DAG combine
is matching here! Without this patch, teaching the vector shuffle
lowering to produce the right code infloops in the DAG combiner. With
this patch alone, we produce terrible code but at least lower through
the right paths. With both patches, all the regressions here should be
fixed, and a bunch of the improvements (like using 2 shufps with no
memory loads instead of 2 andps with memory loads and an orps) will
stay. Win!
There is one other change worth noting here. We had hilariously wrong
vectorization cost estimates for vselect because we fell through to the
code path that assumed all "expand" vector operations are scalarized.
However, the "expand" lowering of VSELECT is vector bit math, most
definitely not scalarized. So now we go back to the correct if horribly
naive cost of "1" for "not scalarized". If anyone wants to add actual
modeling of shuffle costs, that would be cool, but this seems an
improvement on its own. Note the removal of 16 and 32 "costs" for doing
a blend. Even in SSE2 we can blend in fewer than 16 instructions. ;] Of
course, we don't right now because of OMG bad code, but I'm going to fix
that. Next patch. I promise.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@229835 91177308-0d34-0410-b5e6-96231b3b80d8
type erased interface and a single analysis pass rather than an
extremely complex analysis group.
The end result is that the TTI analysis can contain a type erased
implementation that supports the polymorphic TTI interface. We can build
one from a target-specific implementation or from a dummy one in the IR.
I've also factored all of the code into "mix-in"-able base classes,
including CRTP base classes to facilitate calling back up to the most
specialized form when delegating horizontally across the surface. These
aren't as clean as I would like and I'm planning to work on cleaning
some of this up, but I wanted to start by putting into the right form.
There are a number of reasons for this change, and this particular
design. The first and foremost reason is that an analysis group is
complete overkill, and the chaining delegation strategy was so opaque,
confusing, and high overhead that TTI was suffering greatly for it.
Several of the TTI functions had failed to be implemented in all places
because of the chaining-based delegation making there be no checking of
this. A few other functions were implemented with incorrect delegation.
The message to me was very clear working on this -- the delegation and
analysis group structure was too confusing to be useful here.
The other reason of course is that this is *much* more natural fit for
the new pass manager. This will lay the ground work for a type-erased
per-function info object that can look up the correct subtarget and even
cache it.
Yet another benefit is that this will significantly simplify the
interaction of the pass managers and the TargetMachine. See the future
work below.
The downside of this change is that it is very, very verbose. I'm going
to work to improve that, but it is somewhat an implementation necessity
in C++ to do type erasure. =/ I discussed this design really extensively
with Eric and Hal prior to going down this path, and afterward showed
them the result. No one was really thrilled with it, but there doesn't
seem to be a substantially better alternative. Using a base class and
virtual method dispatch would make the code much shorter, but as
discussed in the update to the programmer's manual and elsewhere,
a polymorphic interface feels like the more principled approach even if
this is perhaps the least compelling example of it. ;]
Ultimately, there is still a lot more to be done here, but this was the
huge chunk that I couldn't really split things out of because this was
the interface change to TTI. I've tried to minimize all the other parts
of this. The follow up work should include at least:
1) Improving the TargetMachine interface by having it directly return
a TTI object. Because we have a non-pass object with value semantics
and an internal type erasure mechanism, we can narrow the interface
of the TargetMachine to *just* do what we need: build and return
a TTI object that we can then insert into the pass pipeline.
2) Make the TTI object be fully specialized for a particular function.
This will include splitting off a minimal form of it which is
sufficient for the inliner and the old pass manager.
3) Add a new pass manager analysis which produces TTI objects from the
target machine for each function. This may actually be done as part
of #2 in order to use the new analysis to implement #2.
4) Work on narrowing the API between TTI and the targets so that it is
easier to understand and less verbose to type erase.
5) Work on narrowing the API between TTI and its clients so that it is
easier to understand and less verbose to forward.
6) Try to improve the CRTP-based delegation. I feel like this code is
just a bit messy and exacerbating the complexity of implementing
the TTI in each target.
Many thanks to Eric and Hal for their help here. I ended up blocked on
this somewhat more abruptly than I expected, and so I appreciate getting
it sorted out very quickly.
Differential Revision: http://reviews.llvm.org/D7293
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227669 91177308-0d34-0410-b5e6-96231b3b80d8
AVX2 is available.
According to IACA, the new lowering has a throughput of 8 cycles instead of 13
with the previous one.
Althought this lowering kicks in some SPECs benchmarks, the performance
improvement was within the noise.
Correctness testing has been done for the whole range of uint32_t with the
following program:
uint4 v = (uint4) {0,1,2,3};
uint32_t i;
//Check correctness over entire range for uint4 -> float4 conversion
for( i = 0; i < 1U << (32-2); i++ )
{
float4 t = test(v);
float4 c = correct(v);
if( 0xf != _mm_movemask_ps( t == c ))
{
printf( "Error @ %vx: %vf vs. %vf\n", v, c, t);
return -1;
}
v += 4;
}
Where "correct" is the old lowering and "test" the new one.
The patch adds a test case for the two custom lowering instruction.
It also modifies the vector cost model, which is why cast.ll and uitofp.ll are
modified.
2009-02-26-MachineLICMBug.ll is also modified because we now hoist 7
instructions instead of 4 (3 more constant loads).
rdar://problem/18153096>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221657 91177308-0d34-0410-b5e6-96231b3b80d8
The default implementation of getCmpSelInstrCost, which provides the cost of
icmp/fcmp/select instructions, did not deal sensibly with illegal vector types
that were scalarized. We'd ask for the legalization cost of the vector type,
which would return something like (4, f64) given an input of <4 x double>, and
we'd then check the TLI status of the ISD opcode on that scalar type. This would
result in querying (ISD::VSELECT, f64), for example. Amusingly enough,
ISD::VSELECT on scalar types is marked as Legal by default (as with most other
operations), and most backends never change this because VSELECT is never
generated on scalars. However, seeing the resulting operation as Legal, we'd
neglect to add the scalarization cost before returning. The result is that we'd
grossly under-estimate the cost of cmps/selects on illegal vector types.
Now, if type legalization clearly results in scalarization, we skip the early
return and add the scalarization cost.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@217859 91177308-0d34-0410-b5e6-96231b3b80d8
Cross-class copies being expensive is actually a trait of the microarchitecture, but as I haven't yet seen an example of a microarchitecture where they're cheap it seems best to just enable this by default, covering the non-mcpu build case.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@217674 91177308-0d34-0410-b5e6-96231b3b80d8
This patch:
1) Improves the cost model for x86 alternate shuffles (originally
added at revision 211339);
2) Teaches the Cost Model Analysis pass how to analyze alternate shuffles.
Alternate shuffles are a special kind of blend; on x86, we can often
easily lowered alternate shuffled into single blend
instruction (depending on the subtarget features).
The existing cost model didn't take into account subtarget features.
Also, it had a couple of "dead" entries for vector types that are never
legal (example: on x86 types v2i32 and v2f32 are not legal; those are
always either promoted or widened to 128-bit vector types).
The new x86 cost model takes into account what target features we have
before returning the shuffle cost (i.e. the number of instructions
after the blend is lowered/expanded).
This patch also teaches the Cost Model Analysis how to identify and analyze
alternate shuffles (i.e. 'SK_Alternate' shufflevector instructions):
- added function 'isAlternateVectorMask';
- added some logic to check if an instruction is a alternate shuffle and, in
case, call the target specific TTI to get the corresponding shuffle cost;
- added a test to verify the cost model analysis on alternate shuffles.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@212296 91177308-0d34-0410-b5e6-96231b3b80d8
This commit starts with a "git mv ARM64 AArch64" and continues out
from there, renaming the C++ classes, intrinsics, and other
target-local objects for consistency.
"ARM64" test directories are also moved, and tests that began their
life in ARM64 use an arm64 triple, those from AArch64 use an aarch64
triple. Both should be equivalent though.
This finishes the AArch64 merge, and everyone should feel free to
continue committing as normal now.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@209577 91177308-0d34-0410-b5e6-96231b3b80d8
BasicTTI::getMemoryOpCost must explicitly check for non-simple types; setting
AllowUnknown=true with TLI->getSimpleValueType is not sufficient because, for
example, non-power-of-two vector types return non-simple EVTs (not MVT::Other).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@206150 91177308-0d34-0410-b5e6-96231b3b80d8
This provides more realistic costs for the insert/extractelement instructions
(which are load/store pairs), accounts for the cheap unaligned Altivec load
sequence, and for unaligned VSX load/stores.
Bad news:
MultiSource/Applications/sgefa/sgefa - 35% slowdown (this will require more investigation)
SingleSource/Benchmarks/McGill/queens - 20% slowdown (we no longer vectorize this, but it was a constant store that was scalarized)
MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2 - 2% slowdown
Good news:
SingleSource/Benchmarks/Shootout/ary3 - 54% speedup
SingleSource/Benchmarks/Shootout-C++/ary - 40% speedup
MultiSource/Benchmarks/Ptrdist/ks/ks - 35% speedup
MultiSource/Benchmarks/FreeBench/neural/neural - 30% speedup
MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt - 20% speedup
Unfortunately, estimating the costs of the stack-based scalarization sequences
is hard, and adjusting these costs is like a game of whac-a-mole :( I'll
revisit this again after we have better codegen for vector extloads and
truncstores and unaligned load/stores.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@205658 91177308-0d34-0410-b5e6-96231b3b80d8
When a vector type legalizes to a larger vector type, and the target does not
support the associated extending load (or truncating store), then legalization
will scalarize the load (or store) resulting in an associated scalarization
cost. BasicTTI::getMemoryOpCost needs to account for this.
Between this, and r205487, PowerPC on the P7 with VSX enabled shows:
MultiSource/Benchmarks/PAQ8p/paq8p: 43% speedup
SingleSource/Benchmarks/BenchmarkGame/puzzle: 51% speedup
SingleSource/UnitTests/Vectorizer/gcc-loops 28% speedup
(some of these are new; some of these, such as PAQ8p, just reverse regressions
that VSX support would trigger)
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@205495 91177308-0d34-0410-b5e6-96231b3b80d8
For an cast (extension, etc.), the currently logic predicts a low cost if the
associated operation (keyed on the destination type) is legal (or promoted).
This is not true when the number of values required to legalize the type is
changing. For example, <8 x i16> being sign extended by <8 x i32> is not
generically cheap on PPC with VSX, even though sign extension to v4i32 is
legal, because two output v4i32 values are required compared to the single
v8i16 input value, and without custom logic in the target, this conversion will
scalarize.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@205487 91177308-0d34-0410-b5e6-96231b3b80d8
This adds a second implementation of the AArch64 architecture to LLVM,
accessible in parallel via the "arm64" triple. The plan over the
coming weeks & months is to merge the two into a single backend,
during which time thorough code review should naturally occur.
Everything will be easier with the target in-tree though, hence this
commit.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@205090 91177308-0d34-0410-b5e6-96231b3b80d8
the legalization cost must be included to get an accurate
estimation of the total cost of the scalarized vector.
The inaccurate cost triggered unprofitable SLP vectorization on
32-bit X86.
Summary:
Include legalization overhead when computing scalarization cost
Reviewers: hfinkel, nadav
CC: chandlerc, rnk, llvm-commits
Differential Revision: http://llvm-reviews.chandlerc.com/D2992
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@203509 91177308-0d34-0410-b5e6-96231b3b80d8
'OK_NonUniformConstValue' to identify operands which are constants but
not constant splats.
The cost model now allows returning 'OK_NonUniformConstValue'
for non splat operands that are instances of ConstantVector or
ConstantDataVector.
With this change, targets are now able to compute different costs
for instructions with non-uniform constant operands.
For example, On X86 the cost of a vector shift may vary depending on whether
the second operand is a uniform or non-uniform constant.
This patch applies the following changes:
- The cost model computation now takes into account non-uniform constants;
- The cost of vector shift instructions has been improved in
X86TargetTransformInfo analysis pass;
- BBVectorize, SLPVectorizer and LoopVectorize now know how to distinguish
between non-uniform and uniform constant operands.
Added a new test to verify that the output of opt
'-cost-model -analyze' is valid in the following configurations: SSE2,
SSE4.1, AVX, AVX2.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@201272 91177308-0d34-0410-b5e6-96231b3b80d8
The most important part of this is probably adding any cost at all for
operations like zext <8 x i8> to <8 x i32>. Before they were being
recorded as extremely costly (24, I believe) which made LLVM fall back
on a 4-wide vectorisation of a loop.
It also rebalances the values for sext, zext and trunc. Lacking any
other sane metric that might work across CPU microarchitectures I went
for instructions. This seems to be in reasonable accord with the rest
of the table (sitofp, ...) though no doubt at least one value is
sub-optimal for some bizarre reason.
Finally, separate AVX and AVX2 values are provided where appropriate.
The CodeGen is quite different in many cases.
rdar://problem/15981990
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@200928 91177308-0d34-0410-b5e6-96231b3b80d8