Summary:
Just because INC/DEC is a little slow on some processors doesn't mean we shouldn't prefer it when optimizing for size.
This appears to match gcc behavior.
Reviewers: chandlerc, zvi, RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D37177
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312866 91177308-0d34-0410-b5e6-96231b3b80d8
Reduced version of 'addr-calc-crash.ll' that was included in D27044, that had been fixed already by D31286/rL298633
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312786 91177308-0d34-0410-b5e6-96231b3b80d8
cover the bitwise operators.
Nothing really exciting here, this just stamps out the rest of the core
operations that can RMW memory and set flags.
Still not implemented here: ADC, SBB. Those will require more
interesting logic to channel the flags *in*, and I'm not currently
planning to try to tackle that. It might be interesting for someone who
wants to improve our code generation for bignum implementations.
Differential Revision: https://reviews.llvm.org/D37141
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312768 91177308-0d34-0410-b5e6-96231b3b80d8
operands and used flags to support matching immediate operands.
This is a bit trickier than register operands, and we still want to fall
back on a register operands even for things that appear to be
"immediates" when they won't actually select into the operation's
immediate operand. This also requires us to handle things like selecting
`sub` vs. `add` to minimize the number of bits needed to represent the
immediate, and picking the shortest immediate encoding. In order to
that, we in turn need to scan to make sure that CF isn't used as it will
get inverted.
The end result seems very nice though, and we're now generating
optimal instruction sequences for these patterns IMO.
A follow-up patch will further expand this to other operations with RMW
memory operands. But handing `add` and `sub` are useful starting points
to flesh out the machinery and make sure interesting and complex cases
can be handled.
Thanks to Craig Topper who provided a few fixes and improvements to this
patch in addition to the review!
Differential Revision: https://reviews.llvm.org/D37139
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312764 91177308-0d34-0410-b5e6-96231b3b80d8
This patch expands the support of lowerInterleavedload to {8|16|32}x8i stride 3.
LLVM creates suboptimal shuffle code-gen for AVX2. In overall, this patch is a specific fix for the pattern (Strid=3 VF={8|16|32}) and we plan to include the store (deinterleved side).
The patch goal is to optimize the following sequence:
a0 b0 c0 a1 b1 c1 a2 b2
c2 a3 b3 c3 a4 b4 c4 a5
b5 c5 a6 b6 c6 a7 b7 c7
into
a0 a1 a2 a3 a4 a5 a6 a7
b0 b1 b2 b3 b4 b5 b6 b7
c0 c1 c2 c3 c4 c5 c6 c7
Reviewers
1. zvi
2. igor
3. guyblank
4. dorit
5. Ayal
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312722 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
For large basic blocks with lots of combinable instructions, the
MachineTraceMetrics computations in MachineCombiner can dominate the compile
time, as computing the trace information is quadratic in the number of
instructions in a BB and it's relevant successors/predecessors.
In most cases, knowing the instruction depth should be enough to make
combination decisions. As we already iterate over all instructions in a basic
block, the instruction depth can be computed incrementally. This reduces the
cost of machine-combine drastically in cases where lots of instructions
are combined. The major drawback is that AFAIK, computing the critical path
length cannot be done incrementally. Therefore we only compute
instruction depths incrementally, for basic blocks with more
instructions than inc_threshold. The -machine-combiner-inc-threshold
option can be used to set the threshold and allows for easier
experimenting and checking if using incremental updates for all basic
blocks has any impact on the performance.
Reviewers: sanjoy, Gerolf, MatzeB, efriedma, fhahn
Reviewed By: fhahn
Subscribers: kiranchandramohan, javed.absar, efriedma, llvm-commits
Differential Revision: https://reviews.llvm.org/D36619
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312719 91177308-0d34-0410-b5e6-96231b3b80d8
Adding i8 -> [i16, i32, i64] and i32 -> i64 cases.
This way we can see what the current codegen looks like.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312707 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Add patterns for
fptoui <16 x float> to <16 x i8>
fptoui <16 x float> to <16 x i16>
Reviewers: igorb, delena, craig.topper
Reviewed By: craig.topper
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D37505
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312704 91177308-0d34-0410-b5e6-96231b3b80d8
function return the intrinsics's first argument.
llvm.memcpy/memset/memmove return void but they will return the first
argument after they are expanded as libcalls. Now if the parent function
has any return value, llvm.memcpy cannot be turned into tail call after
expansion.
The patch is to handle that case in SelectionDAGBuilder so when caller
function return the same value as the first argument of llvm.memcpy,
tail call is allowed.
Differential Revision: https://reviews.llvm.org/D37406
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312641 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Most instructions in AVX work “in-lane”, that is, each source element is applied only to other
elements of the same lane, thus a cross lane permutation is costly and needs more than one instrution.
AVX2 includes instructions to perform any-to-any permutation of words over a 256-bit register
and vectorized table lookup.
This should also Fix PR34369
Differential Revision: https://reviews.llvm.org/D37388
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312608 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This intrinsic represents a label with a list of associated metadata
strings. It is modelled as reading and writing inaccessible memory so
that it won't be removed as dead code. I think the intention is that the
annotation strings should appear at most once in the debug info, so I
marked it noduplicate. We are allowed to inline code with annotations as
long as we strip the annotation, but that can be done later.
Reviewers: majnemer
Subscribers: eraman, llvm-commits, hiraditya
Differential Revision: https://reviews.llvm.org/D36904
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312569 91177308-0d34-0410-b5e6-96231b3b80d8
We had already disabled the pattern for SSE4.1 and SSE4.2. But it got re-enabled for AVX and AVX512.
With SSE41 we rely on a separate (v4f32 (X86vzmovl VR128)) pattern to select blendps with a xorps to create zeroess. And a separate (v4f32 (scalar_to_vector FR32X)) to select a COPY_TO_REG_CLASS to move FR32 to VR128
The same thing can happen for AVX with vblendps and those separate patterns already exist.
For AVX512, (v4f32 (X86vzmov VR128)) will select a VMOVSS instruction instead of VBLENDPS due to their not being a EVEX VBLENDPS. This is what we were getting out of the larger pattern anyway. So the larger pattern is unneeded for AVX512 too.
For SSE1-SSSE3 we can rely on (v4f32 (X86vzmov VR128)) selecting a MOVSS similar to AVX512. Again this is what the larger pattern did too.
So the only real change here is that AVX1/2 now properly outputs a VBLENDPS during isel instead of a VMOVSS to match SSE41. Most tests didn't notice because the two address instruction pass knows how to turn VMOVSS into VBLENDPS to get an independent destination register.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312564 91177308-0d34-0410-b5e6-96231b3b80d8
We don't have this same pattern for AVX2 so I don't believe we should have it for AVX512. We also didn't have it for v16f32.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312543 91177308-0d34-0410-b5e6-96231b3b80d8
Avoid use of VPERMPS where we don't need it by instead using the variable mask version of VPERMILPS for unary shuffles.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312486 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
This is a re-roll of D36615 which uses PLT relocations in the back-end
to the call to __xray_CustomEvent() when building in -fPIC and
-fxray-instrument mode.
Reviewers: pcc, djasper, bkramer
Subscribers: sdardis, javed.absar, llvm-commits
Differential Revision: https://reviews.llvm.org/D37373
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312466 91177308-0d34-0410-b5e6-96231b3b80d8
Ideally we'd be able to emit the SUBREG_TO_REG without the explicit register->register move, but we'd need to be sure the producing operation would select something that guaranteed the upper bits were already zeroed.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312450 91177308-0d34-0410-b5e6-96231b3b80d8
In a future patch, I plan to teach isel to use a small vector move with implicit zeroing of the upper elements when it sees the (insert_subvector zero, X, 0) pattern.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312448 91177308-0d34-0410-b5e6-96231b3b80d8
Throughout an effort to strongly check the behavior of CodeGen with the IR shufflevector instruction we generated many tests while predicting the best X86 sequence that may be generated.
This is a subset of the generated tests that we think may add value to our X86 set of tests.
Some of the checks are not optimal and will be changed after fixing:
1. PR34394
2. PR34382
3. PR34380
4. PR34359
Differential Revision: https://reviews.llvm.org/D37329
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312442 91177308-0d34-0410-b5e6-96231b3b80d8