We have several bug reports that could be characterized as "reducing scalarization",
and this topic was also raised on llvm-dev recently:
http://lists.llvm.org/pipermail/llvm-dev/2020-January/138157.html
...so I'm proposing that we deal with these patterns in a new, lightweight IR vector
pass that runs before/after other vectorization passes.
There are 4 alternate options that I can think of to deal with this kind of problem
(and we've seen various attempts at all of these), but they all have flaws:
InstCombine - can't happen without TTI, but we don't want target-specific
folds there.
SDAG - too late to assist other vectorization passes; TLI is not equipped
for these kind of cost queries; limited to a single basic block.
CGP - too late to assist other vectorization passes; would need to re-implement
basic cleanups like CSE/instcombine.
SLP - doesn't fit with existing transforms; limited to a single basic block.
This initial patch/transform is based on existing code in AggressiveInstCombine:
we walk backwards through the function looking for a pattern match. But we diverge
from that cost-independent IR canonicalization pass by using TTI to decide if the
vector alternative is profitable.
We probably have at least 10 similar bug reports/patterns (binops, constants,
inserts, cheap shuffles, etc) that would fit in this pass as follow-up enhancements.
It's possible that we could iterate on a worklist to fix-point like InstCombine does,
but it's safer to start with a most basic case and evolve from there, so I didn't
try to do anything fancy with this initial implementation.
Differential Revision: https://reviews.llvm.org/D73480
from DenseMap to MapVector
The iteration order of LoopVectorizationLegality::Reductions matters for the
final code generation, so we better use MapVector instead of DenseMap for it
to remove the nondeterminacy. reduction-order.ll in the patch is an example
reduced from the case we saw. In the output of opt command, the order of the
select instructions in the vector.body block keeps changing from run to run
currently.
Differential Revision: https://reviews.llvm.org/D73490
The assume intrinsic is intentionally marked as may reading/writing
memory, to avoid passes moving them around. When flattening the CFG
for predicated blocks, we have to drop the assume calls, as they
are control-flow dependent.
There are some cases where we can do better (when control flow is
preserved), but that is follow-up work.
Fixes PR43620.
Reviewers: hsaito, rengolin, dcaballe, Ayal
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D68814
After speaking with Sanjay - seeing a number of miscompiles and working
on tracking down a testcase. None of the follow on patches seem to
have helped so far.
This reverts commit 8a0aa5310bccbb42d16d11db090419fcefdd1376.
After speaking with Sanjay - seeing a number of miscompiles and working
on tracking down a testcase. None of the follow on patches seem to
have helped so far.
This reverts commit 7ff57705ba196ce649d6034614b3b9df57e1f84f.
This reverts commit e511c4b0dff1692c267addf17dce3cebe8f97faa:
Temporarily Revert:
"[SLP] Generalization of stores vectorization."
"[SLP] Fix -Wunused-variable. NFC"
"[SLP] Vectorize jumbled stores."
after fixing the problem with compile time.
We have a vector compare reduction problem seen in PR39665 comment 2:
https://bugs.llvm.org/show_bug.cgi?id=39665#c2
Or slightly reduced here:
define i1 @cmp2(<2 x double> %a0) {
%a = fcmp ogt <2 x double> %a0, <double 1.0, double 1.0>
%b = extractelement <2 x i1> %a, i32 0
%c = extractelement <2 x i1> %a, i32 1
%d = and i1 %b, %c
ret i1 %d
}
SLP would not attempt to turn this into a vector reduction because there is an
artificial lower limit on that transform. We can not completely remove that limit
without inducing regressions though, so this patch just hacks an extra attempt at
creating a 2-way reduction to the end of the analysis.
As shown in the test file, we are still not getting some of the motivating cases,
so follow-on patches will be needed to solve those cases.
Differential Revision: https://reviews.llvm.org/D59710
"[SLP] Generalization of stores vectorization."
"[SLP] Fix -Wunused-variable. NFC"
"[SLP] Vectorize jumbled stores."
As they're causing significant (10-30x) compile time regressions on
vectorizable code.
The primary cause of the compile-time regression is f228b5371647f471853c5fb3e6719823a42fe451.
This reverts commits:
f228b5371647f471853c5fb3e6719823a42fe451
5503455ccb3f5fcedced158332c016c8d3a7fa81
21d498c9c0f32dcab5bc89ac593aa813b533b43a
Stores are vectorized with maximum vectorization factor of 16. Patch
tries to improve the situation and use maximal vectorization factor.
Reviewers: spatel, RKSimon, mkuper, hfinkel
Differential Revision: https://reviews.llvm.org/D43582
Initially SLP vectorizer replaced all going-to-be-vectorized
instructions with Undef values. It may break ScalarEvaluation and may
cause a crash.
Reworked SLP vectorizer so that it does not replace vectorized
instructions by UndefValue anymore. Instead vectorized instructions are
marked for deletion inside if BoUpSLP class and deleted upon class
destruction.
Reviewers: mzolotukhin, mkuper, hfinkel, RKSimon, davide, spatel
Subscribers: RKSimon, Gerolf, anemet, hans, majnemer, llvm-commits, sanjoy
Differential Revision: https://reviews.llvm.org/D29641
llvm-svn: 373166
Summary:
Initially SLP vectorizer replaced all going-to-be-vectorized
instructions with Undef values. It may break ScalarEvaluation and may
cause a crash.
Reworked SLP vectorizer so that it does not replace vectorized
instructions by UndefValue anymore. Instead vectorized instructions are
marked for deletion inside if BoUpSLP class and deleted upon class
destruction.
Reviewers: mzolotukhin, mkuper, hfinkel, RKSimon, davide, spatel
Subscribers: RKSimon, Gerolf, anemet, hans, majnemer, llvm-commits, sanjoy
Differential Revision: https://reviews.llvm.org/D29641
llvm-svn: 372626
Summary:
Fold-tail currently supports reduction last-vector-value live-out's,
but has yet to support last-scalar-value live-outs, including
non-header phi's. As it relies on AllowedExit in order to detect
them and bail out we need to add the non-header PHI nodes to
AllowedExit, otherwise we end up with miscompiles.
Solves https://bugs.llvm.org/show_bug.cgi?id=43166
Reviewers: fhahn, Ayal
Reviewed By: fhahn, Ayal
Subscribers: anna, hiraditya, rkruppe, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67074
llvm-svn: 370721
assume_safety implies that loads under "if's" can be safely executed
speculatively (unguarded, unmasked). However this assumption holds only for the
original user "if's", not those introduced by the compiler, such as the
fold-tail "if" that guards us from loading beyond the original loop trip-count.
Currently the combination of fold-tail and assume-safety pragmas results in
ignoring the fold-tail predicate that guards the loads, generating unmasked
loads. This patch fixes this behavior.
Differential Revision: https://reviews.llvm.org/D66106
Reviewers: Ayal, hsaito, fhahn
llvm-svn: 368973
This allows folding of the scalar epilogue loop (the tail) into the main
vectorised loop body when the loop is annotated with a "vector predicate"
metadata hint. To fold the tail, instructions need to be predicated (masked),
enabling/disabling lanes for the remainder iterations.
Differential Revision: https://reviews.llvm.org/D65197
llvm-svn: 367592
When considering a loop containing nontemporal stores or loads for
vectorization, suppress the vectorization if the corresponding
vectorized store or load with the aligment of the original scaler
memory op is not supported with the nontemporal hint on the target.
This adds two new functions:
bool isLegalNTStore(Type *DataType, unsigned Alignment) const;
bool isLegalNTLoad(Type *DataType, unsigned Alignment) const;
to TTI, leaving the target independent default implementation as
returning true, but with overriding implementations for X86 that
check the legality based on available Subtarget features.
This fixes https://llvm.org/PR40759
Differential Revision: https://reviews.llvm.org/D61764
llvm-svn: 363581
A function for loop vectorization illegality reporting has been
introduced:
void LoopVectorizationLegality::reportVectorizationFailure(
const StringRef DebugMsg, const StringRef OREMsg,
const StringRef ORETag, Instruction * const I) const;
The function prints a debug message when the debug for the compilation
unit is enabled as well as invokes the optimization report emitter to
generate a message with a specified tag. The function doesn't cover any
complicated logic when a custom lambda should be passed to the emitter,
only generating a message with a tag is supported.
The function always prints the instruction `I` after the debug message
whenever the instruction is specified, otherwise the debug message
ends with a dot: 'LV: Not vectorizing: Disabled/already vectorized.'
Patch by Pavel Samolysov <samolisov@gmail.com>
llvm-svn: 362736
Summary:
Trying to add the plumbing necessary to add tuning options to the new pass manager.
Testing with the flags for loop vectorize.
Reviewers: chandlerc
Subscribers: sanjoy, mehdi_amini, jlebar, steven_wu, dexonsmith, dang, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D59723
llvm-svn: 358763
Summary:
Enable some of the existing size optimizations for cold code under PGO.
A ~5% code size saving in big internal app under PGO.
The way it gets BFI/PSI is discussed in the RFC thread
http://lists.llvm.org/pipermail/llvm-dev/2019-March/130894.html
Note it doesn't currently touch loop passes.
Reviewers: davidxl, eraman
Reviewed By: eraman
Subscribers: mgorny, javed.absar, smeenai, mehdi_amini, eraman, zzheng, steven_wu, dexonsmith, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D59514
llvm-svn: 358422
Loop::setAlreadyUnrolled() and
LoopVectorizeHints::setLoopAlreadyUnrolled() both add loop metadata that
stops the same loop from being transformed multiple times. This patch
merges both implementations.
In doing so we fix 3 potential issues:
* setLoopAlreadyUnrolled() kept the llvm.loop.vectorize/interleave.*
metadata even though it will not be used anymore. This already caused
problems such as http://llvm.org/PR40546. Change the behavior to the
one of setAlreadyUnrolled which deletes this loop metadata.
* setAlreadyUnrolled() used to create a new LoopID by calling
MDNode::get with nullptr as the first operand, then replacing it by
the returned references using replaceOperandWith. It is possible
that MDNode::get would instead return an existing node (due to
de-duplication) that then gets modified. To avoid, use a fresh
TempMDNode that does not get uniqued with anything else before
replacing it with replaceOperandWith.
* LoopVectorizeHints::matchesHintMetadataName() only compares the
suffix of the attribute to set the new value for. That is, when
called with "enable", would erase attributes such as
"llvm.loop.unroll.enable", "llvm.loop.vectorize.enable" and
"llvm.loop.distribute.enable" instead of the one to replace.
Fortunately, function was only called with "isvectorized".
Differential Revision: https://reviews.llvm.org/D57566
llvm-svn: 353738
to reflect the new license.
We understand that people may be surprised that we're moving the header
entirely to discuss the new license. We checked this carefully with the
Foundation's lawyer and we believe this is the correct approach.
Essentially, all code in the project is now made available by the LLVM
project under our new license, so you will see that the license headers
include that license only. Some of our contributors have contributed
code under our old license, and accordingly, we have retained a copy of
our old license notice in the top-level files in each project and
repository.
llvm-svn: 351636
Rename:
NoUnrolling to InterleaveOnlyWhenForced
and
AlwaysVectorize to !VectorizeOnlyWhenForced
Contrary to what the name 'AlwaysVectorize' suggests, it does not
unconditionally vectorize all loops, but applies a cost model to
determine whether vectorization is profitable to all loops. Hence,
passing false will disable the cost model, except when a loop is marked
with llvm.loop.vectorize.enable. The 'OnlyWhenForced' suffix (suggested
by @hfinkel in D55716) better matches this behavior.
Similarly, 'NoUnrolling' disables the profitability cost model for
interleaving (a term to distinguish it from unrolling by the
LoopUnrollPass); rename it for consistency.
Differential Revision: https://reviews.llvm.org/D55785
llvm-svn: 349513
When multiple loop transformation are defined in a loop's metadata, their order of execution is defined by the order of their respective passes in the pass pipeline. For instance, e.g.
#pragma clang loop unroll_and_jam(enable)
#pragma clang loop distribute(enable)
is the same as
#pragma clang loop distribute(enable)
#pragma clang loop unroll_and_jam(enable)
and will try to loop-distribute before Unroll-And-Jam because the LoopDistribute pass is scheduled after UnrollAndJam pass. UnrollAndJamPass only supports one inner loop, i.e. it will necessarily fail after loop distribution. It is not possible to specify another execution order. Also,t the order of passes in the pipeline is subject to change between versions of LLVM, optimization options and which pass manager is used.
This patch adds 'followup' attributes to various loop transformation passes. These attributes define which attributes the resulting loop of a transformation should have. For instance,
!0 = !{!0, !1, !2}
!1 = !{!"llvm.loop.unroll_and_jam.enable"}
!2 = !{!"llvm.loop.unroll_and_jam.followup_inner", !3}
!3 = !{!"llvm.loop.distribute.enable"}
defines a loop ID (!0) to be unrolled-and-jammed (!1) and then the attribute !3 to be added to the jammed inner loop, which contains the instruction to distribute the inner loop.
Currently, in both pass managers, pass execution is in a fixed order and UnrollAndJamPass will not execute again after LoopDistribute. We hope to fix this in the future by allowing pass managers to run passes until a fixpoint is reached, use Polly to perform these transformations, or add a loop transformation pass which takes the order issue into account.
For mandatory/forced transformations (e.g. by having been declared by #pragma omp simd), the user must be notified when a transformation could not be performed. It is not possible that the responsible pass emits such a warning because the transformation might be 'hidden' in a followup attribute when it is executed, or it is not present in the pipeline at all. For this reason, this patche introduces a WarnMissedTransformations pass, to warn about orphaned transformations.
Since this changes the user-visible diagnostic message when a transformation is applied, two test cases in the clang repository need to be updated.
To ensure that no other transformation is executed before the intended one, the attribute `llvm.loop.disable_nonforced` can be added which should disable transformation heuristics before the intended transformation is applied. E.g. it would be surprising if a loop is distributed before a #pragma unroll_and_jam is applied.
With more supported code transformations (loop fusion, interchange, stripmining, offloading, etc.), transformations can be used as building blocks for more complex transformations (e.g. stripmining+stripmining+interchange -> tiling).
Reviewed By: hfinkel, dmgreen
Differential Revision: https://reviews.llvm.org/D49281
Differential Revision: https://reviews.llvm.org/D55288
llvm-svn: 348944
When optimizing for size, a loop is vectorized only if the resulting vector loop
completely replaces the original scalar loop. This holds if no runtime guards
are needed, if the original trip-count TC does not overflow, and if TC is a
known constant that is a multiple of the VF. The last two TC-related conditions
can be overcome by
1. rounding the trip-count of the vector loop up from TC to a multiple of VF;
2. masking the vector body under a newly introduced "if (i <= TC-1)" condition.
The patch allows loops with arbitrary trip counts to be vectorized under -Os,
subject to the existing cost model considerations. It also applies to loops with
small trip counts (under -O2) which are currently handled as if under -Os.
The patch does not handle loops with reductions, live-outs, or w/o a primary
induction variable, and disallows interleave groups.
(Third, final and main part of -)
Differential Revision: https://reviews.llvm.org/D50480
llvm-svn: 344743
Summary:
[VPlan] Implement vector code generation support for simple outer loops.
Context: Patch Series #1 for outer loop vectorization support in LV using VPlan. (RFC: http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html).
This patch introduces vector code generation support for simple outer loops that are currently supported in the VPlanNativePath. Changes here essentially do the following:
- force vector code generation using explicit vectorize_width
- add conservative early returns in cost model and other places for VPlanNativePath
- add code for setting up outer loop inductions
- support for widening non-induction PHIs that can result from inner loops and uniform conditional branches
- support for generating uniform inner branches
We plan to add a handful C outer loop executable tests once the initial code generation support is committed. This patch is expected to be NFC for the inner loop vectorizer path. Since we are moving in the direction of supporting outer loop vectorization in LV, it may also be time to rename classes such as InnerLoopVectorizer.
Reviewers: fhahn, rengolin, hsaito, dcaballe, mkuper, hfinkel, Ayal
Reviewed By: fhahn, hsaito
Subscribers: dmgreen, bollu, tschuett, rkruppe, rogfer01, llvm-commits
Differential Revision: https://reviews.llvm.org/D50820
llvm-svn: 342197
We've been running doxygen with the autobrief option for a couple of
years now. This makes the \brief markers into our comments
redundant. Since they are a visual distraction and we don't want to
encourage more \brief markers in new code either, this patch removes
them all.
Patch produced by
for i in $(git grep -l '\\brief'); do perl -pi -e 's/\\brief //g' $i & done
Differential Revision: https://reviews.llvm.org/D46290
llvm-svn: 331272
Summary:
This is a follow up to D45420 (included here since it is still under review and this change is dependent on that) and D45072 (committed).
Actual change for this patch is LoopVectorize* and cmakefile. All others are all from D45420.
LoopVectorizationLegality is an analysis and thus really belongs to Analysis tree. It is modular enough and it is reusable enough ---- we can further improve those aspects once uses outside of LV picks up.
Hopefully, this will make it easier for people familiar with vectorization theory, but not necessarily LV itself to contribute, by lowering the volume of code they should deal with. We probably should start adding some code in LV to check its own capability (i.e., vectorization is legal but LV is not ready to handle it) and then bail out.
Reviewers: rengolin, fhahn, hfinkel, mkuper, aemerson, mssimpso, dcaballe, sguggill
Reviewed By: rengolin, dcaballe
Subscribers: egarcia, rogfer01, mgorny, llvm-commits
Differential Revision: https://reviews.llvm.org/D45552
llvm-svn: 331139
Patch #2 from VPlan Outer Loop Vectorization Patch Series #1
(RFC: http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html).
This patch introduces the basic infrastructure to detect, legality check
and process outer loops annotated with hints for explicit vectorization.
All these changes are protected under the feature flag
-enable-vplan-native-path. This should make this patch NFC for the existing
inner loop vectorizer.
Reviewers: hfinkel, mkuper, rengolin, fhahn, aemerson, mssimpso.
Differential Revision: https://reviews.llvm.org/D42447
llvm-svn: 330739
Summary:
For better vectorization result we should take into consideration the
cost of the user insertelement instructions when we try to
vectorize sequences that build the whole vector. I.e. if we have the
following scalar code:
```
<Scalar code>
insertelement <ScalarCode>, ...
```
we should consider the cost of the last `insertelement ` instructions as
the cost of the scalar code.
Reviewers: RKSimon, spatel, hfinkel, mkuper
Subscribers: javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D42657
llvm-svn: 324893
Summary:
Fixes the bug with incorrect handling of InsertValue|InsertElement
instrucions in SLP vectorizer. Currently, we may use incorrect
ExtractElement instructions as the operands of the original
InsertValue|InsertElement instructions.
Reviewers: mkuper, hfinkel, RKSimon, spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D41767
llvm-svn: 321994
In SLPVectorizer, the vector build instructions (insertvalue for aggregate type) is passed to BoUpSLP.buildTree, it is treated as UserIgnoreList, so later in cost estimation, the cost of these instructions are not counted.
For aggregate value, later usage are more likely to be done in scalar registers, either used as individual scalars or used as a whole for function call or return value. Ignore scalar extraction instructions may cause too aggressive vectorization for aggregate values, and slow down performance. So for vectorization of aggregate value, the scalar extraction instructions are required in cost estimation.
Differential Revision: https://reviews.llvm.org/D41139
llvm-svn: 320736
Patch tries to improve two-pass vectorization analysis, existing in SLP vectorizer. What it does:
1. Defines key nodes, that are the vectorization roots. Previously vectorization started if StoreInst or ReturnInst is found. For now, the vectorization started for all Instructions with no users and void types (Terminators, StoreInst) + CallInsts.
2. CmpInsts, InsertElementInsts and InsertValueInsts are stored in the
array. This array is processed only after the vectorization of the
first-after-these instructions key node is finished. Vectorization goes
in reverse order to try to vectorize as much code as possible.
Reviewers: mzolotukhin, Ayal, mkuper, gilr, hfinkel, RKSimon
Subscribers: ashahid, anemet, RKSimon, mssimpso, llvm-commits
Differential Revision: https://reviews.llvm.org/D29826
llvm-svn: 310260
Summary:
Patch tries to improve two-pass vectorization analysis, existing in SLP vectorizer. What it does:
1. Defines key nodes, that are the vectorization roots. Previously vectorization started if StoreInst or ReturnInst is found. For now, the vectorization started for all Instructions with no users and void types (Terminators, StoreInst) + CallInsts.
2. CmpInsts, InsertElementInsts and InsertValueInsts are stored in the array. This array is processed only after the vectorization of the first-after-these instructions key node is finished. Vectorization goes in reverse order to try to vectorize as much code as possible.
Reviewers: mzolotukhin, Ayal, mkuper, gilr, hfinkel, RKSimon
Subscribers: ashahid, anemet, RKSimon, mssimpso, llvm-commits
Differential Revision: https://reviews.llvm.org/D29826
llvm-svn: 310255
Summary:
Existing heuristic uses the ratio between the function entry
frequency and the loop invocation frequency to find cold loops. However,
even if the loop executes frequently, if it has a small trip count per
each invocation, vectorization is not beneficial. On the other hand,
even if the loop invocation frequency is much smaller than the function
invocation frequency, if the trip count is high it is still beneficial
to vectorize the loop.
This patch uses estimated trip count computed from the profile metadata
as a primary metric to determine coldness of the loop. If the estimated
trip count cannot be computed, it falls back to the original heuristics.
Reviewers: Ayal, mssimpso, mkuper, danielcdh, wmi, tejohnson
Reviewed By: tejohnson
Subscribers: tejohnson, mzolotukhin, llvm-commits
Differential Revision: https://reviews.llvm.org/D32451
llvm-svn: 305729
The approach I followed was to emit the remark after getTreeCost concludes
that SLP is profitable. I initially tried emitting them after the
vectorizeRootInstruction calls in vectorizeChainsInBlock but I vaguely
remember missing a few cases for example in HorizontalReduction::tryToReduce.
ORE is placed in BoUpSLP so that it's available from everywhere (notably
HorizontalReduction::tryToReduce).
We use the first instruction in the root bundle as the locator for the remark.
In order to get a sense how far the tree is spanning I've include the size of
the tree in the remark. This is not perfect of course but it gives you at
least a rough idea about the tree. Then you can follow up with -view-slp-tree
to really see the actual tree.
llvm-svn: 302811