'OK_NonUniformConstValue' to identify operands which are constants but
not constant splats.
The cost model now allows returning 'OK_NonUniformConstValue'
for non splat operands that are instances of ConstantVector or
ConstantDataVector.
With this change, targets are now able to compute different costs
for instructions with non-uniform constant operands.
For example, On X86 the cost of a vector shift may vary depending on whether
the second operand is a uniform or non-uniform constant.
This patch applies the following changes:
- The cost model computation now takes into account non-uniform constants;
- The cost of vector shift instructions has been improved in
X86TargetTransformInfo analysis pass;
- BBVectorize, SLPVectorizer and LoopVectorize now know how to distinguish
between non-uniform and uniform constant operands.
Added a new test to verify that the output of opt
'-cost-model -analyze' is valid in the following configurations: SSE2,
SSE4.1, AVX, AVX2.
llvm-svn: 201272
Ideally only those transform passes that run at -O0 remain enabled,
in reality we get as close as we reasonably can.
Passes are responsible for disabling themselves, it's not the job of
the pass manager to do it for them.
llvm-svn: 200892
This reverts commit r200576. It broke 32-bit self-host builds by
vectorizing two calls to @llvm.bswap.i64, which we then fail to expand.
llvm-svn: 200602
transform accordingly. Based on similar code from Loop vectorization.
Subsequent commits will include vectorization of function calls to
vector intrinsics and form function calls to vector library calls.
Patch by Raul Silvera! (Much delayed due to my not running dcommit)
llvm-svn: 200576
can be used by both the new pass manager and the old.
This removes it from any of the virtual mess of the pass interfaces and
lets it derive cleanly from the DominatorTreeBase<> template. In turn,
tons of boilerplate interface can be nuked and it turns into a very
straightforward extension of the base DominatorTree interface.
The old analysis pass is now a simple wrapper. The names and style of
this split should match the split between CallGraph and
CallGraphWrapperPass. All of the users of DominatorTree have been
updated to match using many of the same tricks as with CallGraph. The
goal is that the common type remains the resulting DominatorTree rather
than the pass. This will make subsequent work toward the new pass
manager significantly easier.
Also in numerous places things became cleaner because I switched from
re-running the pass (!!! mid way through some other passes run!!!) to
directly recomputing the domtree.
llvm-svn: 199104
directory. These passes are already defined in the IR library, and it
doesn't make any sense to have the headers in Analysis.
Long term, I think there is going to be a much better way to divide
these matters. The dominators code should be fully separated into the
abstract graph algorithm and have that put in Support where it becomes
obvious that evn Clang's CFGBlock's can use it. Then the verifier can
manually construct dominance information from the Support-driven
interface while the Analysis library can provide a pass which both
caches, reconstructs, and supports a nice update API.
But those are very long term, and so I don't want to leave the really
confusing structure until that day arrives.
llvm-svn: 199082
subsequent changes are easier to review. About to fix some layering
issues, and wanted to separate out the necessary churn.
Also comment and sink the include of "Windows.h" in three .inc files to
match the usage in Memory.inc.
llvm-svn: 198685
We use CSEBlocks to initialize a worklist:
SmallVector<BasicBlock *, 8> CSEWorkList(CSEBlocks.begin(), CSEBlocks.end());
so it must have a deterministic order.
llvm-svn: 196520
We were creating external uses for scalar values in MustGather entries that also
had a ScalarToTreeEntry (they also are present in a vectorized tuple). This
meant we would keep a value 'alive' as a scalar and vectorized causing havoc.
This is not necessary because when we create a MustGather vector we explicitly
create external uses entries for the insertelement instructions of the
MustGather vector elements.
Fixes PR18129.
radar://15582184
llvm-svn: 196508
we generate PHI nodes with multiple entries from the same basic block but
with different values. Enabling CSE on ExtractElement instructions make sure
that all of the RAUWed instructions are the same.
llvm-svn: 195773
SLP vectorization. Based on the code in BBVectorizer.
Fixes PR17741.
Patch by Raul Silvera, reviewed by Hal and Nadav. Reformatted by my
driving of clang-format. =]
llvm-svn: 195528
We are slicing an array of Value pointers and process those slices in a loop.
The problem is that we might invalidate a later slice by vectorizing a former
slice.
Use a WeakVH to track the pointer. If the pointer is deleted or RAUW'ed we can
tell.
The test case will only fail when running with libgmalloc.
radar://15498655
llvm-svn: 195162
Instead of doing a RPO traversal of the whole function remember the blocks
containing gathers (typically <= 2) and scan them in dominator-first order.
The actual CSE is still quadratic, but I'm not confident that adding a
scoped hash table here is worth it as we're only looking at the generated
instructions and not arbitrary code.
llvm-svn: 193956
By vectorizing a series of srl, or, ... instructions we have obfuscated the
intention so much that the backend does not know how to fold this code away.
radar://15336950
llvm-svn: 193573
No test case, because with the current cost model we don't see a difference.
An upcoming ARM memory cost model change will expose and test this bug.
radar://15332579
llvm-svn: 193572
Before this patch we relied on the order of phi nodes when we looked for phi
nodes of the same type. This could prevent vectorization of cases where there
was a phi node of a second type in between phi nodes of some type.
This is important for vectorization of an internal graphics kernel. On the test
suite + external on x86_64 (and on a run on armv7s) it showed no impact on
either performance or compile time.
radar://15024459
llvm-svn: 192537
Sort the operands of the other entries in the current vectorization root
according to the first entry's operands opcodes.
%conv0 = uitofp ...
%load0 = load float ...
= fmul %conv0, %load0
= fmul %load0, %conv1
= fmul %load0, %conv2
Make sure that we recursively vectorize <%conv0, %conv1, %conv2> and <%load0,
%load0, %load0>.
This makes it more likely to obtain vectorizable trees. We have to be careful
when we sort that we don't destroy 'good' existing ordering implied by source
order.
radar://15080067
llvm-svn: 191977
This recursively strips all GEPs like the existing code. It also handles bitcasts and
other operations that do not change the pointer value.
llvm-svn: 191847
Inspired by the object from the SLPVectorizer. This found a minor bug in the
debug loc restoration in the vectorizer where the location of a following
instruction was attached instead of the location from the original instruction.
llvm-svn: 191673
We were previously using getFirstInsertionPt to insert PHI
instructions when vectorizing, but getFirstInsertionPt also skips past
landingpads, causing this to generate invalid IR.
We can avoid this issue by using getFirstNonPHI instead.
llvm-svn: 191526
Put them under a separate flag for experimentation. They are more likely to
interfere with loop vectorization which happens later in the pass pipeline.
llvm-svn: 191371
Reapply r191108 with a fix for a memory corruption error I introduced. Of
course, we can't reference the scalars that we replace by vectorizing and then
call their eraseFromParent method. I only 'needed' the scalars to get the
DebugLoc. Just store the DebugLoc before actually vectorizing instead. As a nice
side effect, this also simplifies the interface between BoUpSLP and the
HorizontalReduction class to returning a value pointer (the vectorized tree
root).
radar://14607682
llvm-svn: 191123
Match reductions starting at binary operation feeding into a phi. The code
handles trees like
r += v1 + v2 + v3 ...
and
r += v1
r += v2
...
and
r *= v1 + v2 + ...
We currently only handle associative operations (add, fadd fast).
The code can now also handle reductions feeding into stores.
a[i] = v1 + v2 + v3 + ...
The code is currently disabled behind the flag "-slp-vectorize-hor". The cost
model for most architectures is not there yet.
I found one opportunity of a horizontal reduction feeding a phi in TSVC
(LoopRerolling-flt) and there are several opportunities where reductions feed
into stores.
radar://14607682
llvm-svn: 191108
XCore target: Add XCoreTargetTransformInfo
This is where getNumberOfRegisters() resides, which in turn returns the
number of vector registers (=0).
llvm-svn: 190936
We can't insert an insertelement after an invoke. We would have to split a
critical edge. So when we see a phi node that uses an invoke we just give up.
radar://14990770
llvm-svn: 190871
1) If the width of vectorization list candidate is bigger than vector reg width, we will break it down to fit the vector reg.
2) We do not vectorize the width which is not power of two.
The performance result shows it will help some spec benchmarks. mesa improved 6.97% and ammp improved 1.54%.
llvm-svn: 189830
The builder inserts from before the insert point,
not after, so this would insert before the last
instruction in the bundle instead of after it.
I'm not sure if this can actually be a problem
with any of the current insertions.
llvm-svn: 189285