r306381 caused PR33613, by reversing the order in which insertelements were
generated per unroll part. This patch fixes PR33613 by retraining this order,
placing each set of insertelements per part immediately after the last scalar
being packed for this part. Includes a test case derived from PR33613.
Reference: https://bugs.llvm.org/show_bug.cgi?id=33613
Differential Revision: https://reviews.llvm.org/D34760
llvm-svn: 306575
Undoing revert 306338 after fixed bug: add metadata to the load instead of the
reverse shuffle added to it, retaining the original ValueMap implementation.
llvm-svn: 306381
Summary:
vectorizer-maximize-bandwidth is generally useful in terms of performance. I've tested the impact of changing this to default on speccpu benchmarks on sandybridge machines. The result shows non-negative impact:
spec/2006/fp/C++/444.namd 26.84 -0.31%
spec/2006/fp/C++/447.dealII 46.19 +0.89%
spec/2006/fp/C++/450.soplex 42.92 -0.44%
spec/2006/fp/C++/453.povray 38.57 -2.25%
spec/2006/fp/C/433.milc 24.54 -0.76%
spec/2006/fp/C/470.lbm 41.08 +0.26%
spec/2006/fp/C/482.sphinx3 47.58 -0.99%
spec/2006/int/C++/471.omnetpp 22.06 +1.87%
spec/2006/int/C++/473.astar 22.65 -0.12%
spec/2006/int/C++/483.xalancbmk 33.69 +4.97%
spec/2006/int/C/400.perlbench 33.43 +1.70%
spec/2006/int/C/401.bzip2 23.02 -0.19%
spec/2006/int/C/403.gcc 32.57 -0.43%
spec/2006/int/C/429.mcf 40.35 +0.27%
spec/2006/int/C/445.gobmk 26.96 +0.06%
spec/2006/int/C/456.hmmer 24.4 +0.19%
spec/2006/int/C/458.sjeng 27.91 -0.08%
spec/2006/int/C/462.libquantum 57.47 -0.20%
spec/2006/int/C/464.h264ref 46.52 +1.35%
geometric mean +0.29%
The regression on 453.povray seems real, but is due to secondary effects as all hot functions are bit-identical with and without the flag.
I started this patch to consult upstream opinions on this. It will be greatly appreciated if the community can help test the performance impact of this change on other architectures so that we can decided if this should be target-dependent.
Reviewers: hfinkel, mkuper, davidxl, chandlerc
Reviewed By: chandlerc
Subscribers: rengolin, sanjoy, javed.absar, bjope, dorit, magabari, RKSimon, llvm-commits, mzolotukhin
Differential Revision: https://reviews.llvm.org/D33341
llvm-svn: 306336
Instead of providing access to the internal MapStorage holding all Values
associated with a given Key, used for setting or resetting them all together,
ValueMap keeps its MapStorage internal; its new interface allows getting,
setting or resetting a single Value, per part or per part-and-lane.
Follows the discussion in https://reviews.llvm.org/D32871.
Differential Revision: https://reviews.llvm.org/D34473
llvm-svn: 306331
Summary:
vectorizer-maximize-bandwidth is generally useful in terms of performance. I've tested the impact of changing this to default on speccpu benchmarks on sandybridge machines. The result shows non-negative impact:
spec/2006/fp/C++/444.namd 26.84 -0.31%
spec/2006/fp/C++/447.dealII 46.19 +0.89%
spec/2006/fp/C++/450.soplex 42.92 -0.44%
spec/2006/fp/C++/453.povray 38.57 -2.25%
spec/2006/fp/C/433.milc 24.54 -0.76%
spec/2006/fp/C/470.lbm 41.08 +0.26%
spec/2006/fp/C/482.sphinx3 47.58 -0.99%
spec/2006/int/C++/471.omnetpp 22.06 +1.87%
spec/2006/int/C++/473.astar 22.65 -0.12%
spec/2006/int/C++/483.xalancbmk 33.69 +4.97%
spec/2006/int/C/400.perlbench 33.43 +1.70%
spec/2006/int/C/401.bzip2 23.02 -0.19%
spec/2006/int/C/403.gcc 32.57 -0.43%
spec/2006/int/C/429.mcf 40.35 +0.27%
spec/2006/int/C/445.gobmk 26.96 +0.06%
spec/2006/int/C/456.hmmer 24.4 +0.19%
spec/2006/int/C/458.sjeng 27.91 -0.08%
spec/2006/int/C/462.libquantum 57.47 -0.20%
spec/2006/int/C/464.h264ref 46.52 +1.35%
geometric mean +0.29%
The regression on 453.povray seems real, but is due to secondary effects as all hot functions are bit-identical with and without the flag.
I started this patch to consult upstream opinions on this. It will be greatly appreciated if the community can help test the performance impact of this change on other architectures so that we can decided if this should be target-dependent.
Reviewers: hfinkel, mkuper, davidxl, chandlerc
Reviewed By: chandlerc
Subscribers: rengolin, sanjoy, javed.absar, bjope, dorit, magabari, RKSimon, llvm-commits, mzolotukhin
Differential Revision: https://reviews.llvm.org/D33341
llvm-svn: 305960
Summary:
Existing heuristic uses the ratio between the function entry
frequency and the loop invocation frequency to find cold loops. However,
even if the loop executes frequently, if it has a small trip count per
each invocation, vectorization is not beneficial. On the other hand,
even if the loop invocation frequency is much smaller than the function
invocation frequency, if the trip count is high it is still beneficial
to vectorize the loop.
This patch uses estimated trip count computed from the profile metadata
as a primary metric to determine coldness of the loop. If the estimated
trip count cannot be computed, it falls back to the original heuristics.
Reviewers: Ayal, mssimpso, mkuper, danielcdh, wmi, tejohnson
Reviewed By: tejohnson
Subscribers: tejohnson, mzolotukhin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32451
llvm-svn: 305729
If we're shrinking a binary operation, it may be the case that the new
operations wraps where the old didn't. If this happens, the behavior
should be well-defined. So, we can't always carry wrapping flags with us
when we shrink operations.
If we do, we get incorrect optimizations in cases like:
void foo(const unsigned char *from, unsigned char *to, int n) {
for (int i = 0; i < n; i++)
to[i] = from[i] - 128;
}
which gets optimized to:
void foo(const unsigned char *from, unsigned char *to, int n) {
for (int i = 0; i < n; i++)
to[i] = from[i] | 128;
}
Because:
- InstCombine turned `sub i32 %from.i, 128` into
`add nuw nsw i32 %from.i, 128`.
- LoopVectorize vectorized the add to be `add nuw nsw <16 x i8>` with a
vector full of `i8 128`s
- InstCombine took advantage of the fact that the newly-shrunken add
"couldn't wrap", and changed the `add` to an `or`.
InstCombine seems happy to figure out whether we can add nuw/nsw on its
own, so I just decided to drop the flags. There are already a number of
places in LoopVectorize where we rely on InstCombine to clean up.
llvm-svn: 305053
Following the request made in https://reviews.llvm.org/D32871,
scalarizeInstruction() which is no longer overridden by InnerLoopUnroller is
hereby made non-virtual in InnerLoopVectorizer.
Should have been part of r297580 originally.
llvm-svn: 304685
r303763 caused build failures in some out-of-tree tests due to an assertion in
TTI. The original patch updated cost estimates for induction variable update
instructions marked for scalarization. However, it didn't consider that the
incoming value of an induction variable phi node could be a cast instruction.
This caused queries for cast instruction costs with a mix of vector and scalar
types. This patch includes a fix for cast instructions and the test case from
PR33193.
The fix was suggested by Jonas Paulsson <paulsson@linux.vnet.ibm.com>.
Reference: https://bugs.llvm.org/show_bug.cgi?id=33193
Original Differential Revision: https://reviews.llvm.org/D33457
llvm-svn: 304235
For non-uniform instructions marked for scalarization, we should update
`VectorTy` when computing instruction costs to reflect the scalar type. In
addition to determining instruction costs, this type is also used to signal
that all instructions in the loop will be scalarized. This currently affects
memory instructions and non-pointer induction variables and their updates. (We
also mark GEPs scalar after vectorization, but their cost is computed together
with memory instructions.) For scalarized induction updates, this patch also
scales the scalar cost by the vectorization factor, corresponding to each
induction step.
llvm-svn: 303763
The loop vectorizer usually vectorizes any instruction it can and then
extracts the elements for a scalarized use. On SystemZ, all elements
containing addresses must be extracted into address registers (GRs). Since
this extraction is not free, it is better to have the address in a suitable
register to begin with. By forcing address arithmetic instructions and loads
of addresses to be scalar after vectorization, two benefits result:
* No need to extract the register
* LSR optimizations trigger (LSR isn't handling vector addresses currently)
Benchmarking show improvements on SystemZ with this new behaviour.
Any other target could try this by returning false in the new hook
prefersVectorizedAddressing().
Review: Renato Golin, Elena Demikhovsky, Ulrich Weigand
https://reviews.llvm.org/D32422
llvm-svn: 303744
The default behavior of -Rpass-analysis=loop-vectorizer is to report only the
first reason encountered for not vectorizing, if one is found, at which time the
vectorizer aborts its handling of the loop. This patch allows multiple reasons
for not vectorizing to be identified and reported, at the potential expense of
additional compile-time, under allowExtraAnalysis which can currently be turned
on by Clang's -fsave-optimization-record and opt's -pass-remarks-missed.
Removed from LoopVectorizationLegality::canVectorize() the redundant checking
and reporting if we CantComputeNumberOfIterations, as LAI::canAnalyzeLoop() also
does that. This redundancy is caught by a lit test once multiple reasons are
reported.
Patch initially developed by Dror Barak.
Differential Revision: https://reviews.llvm.org/D33396
llvm-svn: 303613
Introduce LoopVectorizationPlanner.executePlan(), replacing ILV.vectorize() and
refactoring ILV.vectorizeLoop(). Method collectDeadInstructions() is moved from
ILV to LVP. These changes facilitate building VPlans and using them to generate
code, following https://reviews.llvm.org/D28975 and its tentative breakdown.
Method ILV.createEmptyLoop() is renamed ILV.createVectorizedLoopSkeleton() to
improve clarity; it's contents remain intact.
Differential Revision: https://reviews.llvm.org/D32200
llvm-svn: 302790
Summary:
In first order recurrence vectorization, when the previous value is a phi node, we need to
set the insertion point to the first non-phi node.
We can have the previous value being a phi node, due to the generation of new
IVs as part of trunc optimization [1].
[1] https://reviews.llvm.org/rL294967
Reviewers: mssimpso, mkuper
Subscribers: mzolotukhin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32969
llvm-svn: 302532
- This change allows targets to opt-in to using them instead of the log2
shufflevector algorithm.
- The SLP and Loop vectorizers have the common code to do shuffle reductions
factored out into LoopUtils, and now have a unified interface for generating
reductions regardless of the preference of the target. LoopUtils now uses TTI
to determine what kind of reductions the target wants to handle.
- For CodeGen, basic legalization support is added.
Differential Revision: https://reviews.llvm.org/D30086
llvm-svn: 302514
This patch is part of D28975's breakdown.
Genreating the control-flow to guard predicated instructions modified to
only use SplitBlockAndInsertIfThen() for producing the if-then construct.
Differential Revision: https://reviews.llvm.org/D32224
llvm-svn: 301293
Phi nodes in non-header blocks are converted to select instructions after
if-conversion. This patch updates the cost model to account for the selects.
Differential Revision: https://reviews.llvm.org/D31906
llvm-svn: 300980
This patch is part of D28975's breakdown.
Add caching for block masks similar to the cache already used for edge masks,
replacing generation per user with reusing the first generated value which
dominates all uses.
Differential Revision: https://reviews.llvm.org/D32054
llvm-svn: 300557
This patch is part of D28975's breakdown - no change in output intended.
LV's code currently assumes the vectorized loop is a single basic block up
until predicateInstructions() is called. This patch removes two manifestations
of this assumption (loop phi incoming values, dominator tree update) by
replacing the use of vectorLoopBody with the vectorized loop's latch/header.
Differential Revision: https://reviews.llvm.org/D32040
llvm-svn: 300310
Summary:
In first order recurrences where phi's are used outside the loop,
we should generate an additional vector.extract of the second last element from
the vectorized phi update.
This is because we require the phi itself (which is the value at the second last
iteration of the vector loop) and not the phi's update within the loop.
Also fix the code gen when we just unroll, but don't vectorize.
Fixes PR32396.
Reviewers: mssimpso, mkuper, anemet
Subscribers: llvm-commits, mzolotukhin
Differential Revision: https://reviews.llvm.org/D31979
llvm-svn: 300238
Refactoring InnerLoopVectorizer's vectorizeBlockInLoop() to provide
vectorizeInstruction(). Aligning DeadInstructions with its only user.
Facilitates driving the transformation by VPlan - follows
https://reviews.llvm.org/D28975 and its tentative breakdown.
Differential Revision: https://reviews.llvm.org/D31997
llvm-svn: 300183
The cost for a branch after vectorization is very different depending on if
the vectorizer will if-convert the block (branch is eliminated), or if
scalarized and predicated blocks will be produced (branch duplicated before
each block). There is also the case of remaining scalar branches, such as the
back-edge branch.
This patch handles these cases differently with TTI based cost estimates.
Review: Matthew Simpson
https://reviews.llvm.org/D31175
llvm-svn: 300058
Since SystemZ supports vector element load/store instructions, there is no
need for extracts/inserts if a vector load/store gets scalarized.
This patch lets Target specify that it supports such instructions by means of
a new TTI hook that defaults to false.
The use for this is in the LoopVectorizer getScalarizationOverhead() method,
which will with this patch produce a smaller sum for a vector load/store on
SystemZ.
New test: test/Transforms/LoopVectorize/SystemZ/load-store-scalarization-cost.ll
Review: Adam Nemet
https://reviews.llvm.org/D30680
llvm-svn: 300056
getArithmeticInstrCost(), getShuffleCost(), getCastInstrCost(),
getCmpSelInstrCost(), getVectorInstrCost(), getMemoryOpCost(),
getInterleavedMemoryOpCost() implemented.
Interleaved access vectorization enabled.
BasicTTIImpl::getCastInstrCost() improved to check for legal extending loads,
in which case the cost of the z/sext instruction becomes 0.
Review: Ulrich Weigand, Renato Golin.
https://reviews.llvm.org/D29631
llvm-svn: 300052
In the vectorization of first order recurrence, we vectorize such
that the last element in the vector will be the one extracted to pass into the
scalar remainder loop. However, this is not true when there is a phi (other
than the primary induction variable) is used outside the loop.
In such a case, we need the value from the second last iteration (i.e.
the phi value), not the last iteration (which would be the phi update).
I've added a test case for this. Also see PR32396.
A follow up patch would generate the correct code gen for such cases,
and turn this vectorization on.
Differential Revision: https://reviews.llvm.org/D31910
Reviewers: mssimpso
llvm-svn: 299985
This patch reapplies r298620. The original patch was reverted because of two
issues. First, the patch exposed a bug in InstCombine that caused the Chromium
builds to fail (PR32414). This issue was fixed in r299017. Second, the patch
introduced a bug in the vectorizer's scalars analysis that caused test suite
builds to fail on SystemZ. The scalars analysis was too aggressive and marked a
memory instruction scalar, even though it was going to be vectorized. This
issue has been fixed in the current patch and several new test cases for the
scalars analysis have been added.
llvm-svn: 299770
The vectorizer tries to replace truncations of induction variables with new
induction variables having the smaller type. After r295063, this optimization
was applied to all integer induction variables, including non-primary ones.
When optimizing the truncation of a non-primary induction variable, we still
need to transform the new induction so that it has the correct start value.
This should fix PR32419.
Reference: https://bugs.llvm.org/show_bug.cgi?id=32419
llvm-svn: 298882
Reason: breaks linking Chromium with LLD + ThinLTO (a pass crashes)
LLVM bug: https://bugs.llvm.org//show_bug.cgi?id=32413
Original change description:
[LV] Vectorize GEPs
This patch adds support for vectorizing GEPs. Previously, we only generated
vector GEPs on-demand when creating gather or scatter operations. All GEPs from
the original loop were scalarized by default, and if a pointer was to be stored
to memory, we would have to build up the pointer vector with insertelement
instructions.
With this patch, we will vectorize all GEPs that haven't already been marked
for scalarization.
The patch refines collectLoopScalars to more exactly identify the scalar GEPs.
The function now more closely resembles collectLoopUniforms. And the patch
moves vector GEP creation out of vectorizeMemoryInstruction and into the main
vectorization loop. The vector GEPs needed for gather and scatter operations
will have already been generated before vectoring the memory accesses.
Original Differential Revision: https://reviews.llvm.org/D30710
llvm-svn: 298735
This patch adds support for vectorizing GEPs. Previously, we only generated
vector GEPs on-demand when creating gather or scatter operations. All GEPs from
the original loop were scalarized by default, and if a pointer was to be stored
to memory, we would have to build up the pointer vector with insertelement
instructions.
With this patch, we will vectorize all GEPs that haven't already been marked
for scalarization.
The patch refines collectLoopScalars to more exactly identify the scalar GEPs.
The function now more closely resembles collectLoopUniforms. And the patch
moves vector GEP creation out of vectorizeMemoryInstruction and into the main
vectorization loop. The vector GEPs needed for gather and scatter operations
will have already been generated before vectoring the memory accesses.
Differential Revision: https://reviews.llvm.org/D30710
llvm-svn: 298620
The code for generating scalar base pointers in vectorizeMemoryInstruction is
not needed. We currently scalarize all GEPs and maintain the scalarized values
in VectorLoopValueMap. The GEP cloning in this unneeded code is the same as
that in scalarizeInstruction. The test cases that changed as a result of this
patch changed because we were able to reuse the scalarized GEP that we
previously generated instead of cloning a new one.
Differential Revision: https://reviews.llvm.org/D30587
llvm-svn: 298615
This patch refactors the PHisToFix loop as follows:
- The loop itself now resides in its own method.
- The new method iterates on scalar-loop's header; the PHIsToFix map formerly
propagated as an output parameter and filled during phi widening is removed.
- The code handling reductions is moved into its own method, similar to the
existing fixFirstOrderRecurrence().
Differential Revision: https://reviews.llvm.org/D30755
llvm-svn: 297740
Refactoring Cost Model's selectVectorizationFactor() so that it handles only the
selection of the best VF from a pre-computed range of candidate VF's, extracting
early-exit criteria and the computation of a MaxVF upper-bound to other methods,
all driven by a newly introduced LoopVectorizationPlanner.
Differential Revision: https://reviews.llvm.org/D30653
llvm-svn: 297737
getIntrinsicInstrCost() used to only compute scalarization cost based on types.
This patch improves this so that the actual arguments are checked when they are
available, in order to handle only unique non-constant operands.
Tests updates:
Analysis/CostModel/X86/arith-fp.ll
Transforms/LoopVectorize/AArch64/interleaved_cost.ll
Transforms/LoopVectorize/ARM/interleaved_cost.ll
The improvement in getOperandsScalarizationOverhead() to differentiate on
constants made it necessary to update the interleaved_cost.ll tests even
though they do not relate to intrinsics.
Review: Hal Finkel
https://reviews.llvm.org/D29540
llvm-svn: 297705
This commit is a follow-up on r297580. It fixes the FIXME added temporarily
by that commit to keep the removal of Unroller's specialized version of
scalarizeInstruction() an NFC. See https://reviews.llvm.org/D30715 for details.
llvm-svn: 297610
Unroller's specialized scalarizeInstruction() is mostly duplicating Vectorizer's
variant. OTOH Vectorizer's scalarizeInstruction() already supports the special
case of VF==1 except for avoiding mask-bit extraction in that case. This patch
removes Unroller's specialized version in favor of a unified method.
The only functional difference between the two variants seems to be setting
memcheck metadata for loads and stores only in Vectorizer's variant, which is a
bug in Unroller. To keep this patch an NFC the unified method doesn't set
memcheck metadata for VF==1.
Differential Revision: https://reviews.llvm.org/D30715
llvm-svn: 297580
Summary:
Similar to SmallPtrSet, this makes find and count work with both const
referneces and const pointers.
Reviewers: dblaikie
Subscribers: llvm-commits, mzolotukhin
Differential Revision: https://reviews.llvm.org/D30713
llvm-svn: 297424
Because IRBuilder performs constant-folding, it's not guaranteed that an
instruction in the original loop map to an instruction in the vector loop. It
could map to a constant vector instead. The handling of first-order recurrences
was incorrectly making this assumption when setting the IRBuilder's insert
point.
llvm-svn: 297302