This can significantly reduce the size of the switch, allowing for more
efficient lowering.
I also worked with the idea of exploiting unreachable defaults by
omitting the range check for jump tables, but always ended up with a
non-neglible binary size increase. It might be worth looking into some more.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@223049 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
PowerPC DWARF unwind info defined CFA as SP + offset even in a function
where the stack had been dynamically realigned. This clearly doesn't
work because the offset from SP to CFA is not a constant. Fix it by
defining CFA as BP instead.
This was causing the AddressSanitizer null_deref test to fail 50% of
the time, depending on whether SP happened to be 32-byte aligned on
entry to a particular function or not.
Reviewers: willschm, uweigand, hfinkel
Reviewed By: hfinkel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D6410
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222996 91177308-0d34-0410-b5e6-96231b3b80d8
This reverts commit r222632 (and follow-up r222636), which caused a host
of LNT failures on an internal bot. I'll respond to the commit on the
list with a reproduction of one of the failures.
Conflicts:
lib/Target/X86/X86TargetTransformInfo.cpp
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222936 91177308-0d34-0410-b5e6-96231b3b80d8
The AAPCS treats small structs and homogeneous floating (or vector) aggregates
specially, and guarantees they either get passed as a contiguous block of
registers, or prevent any future use of those registers and get passed on the
stack.
This concept can fit quite neatly into LLVM's own type system, mapping an HFA
to [N x float] and so on, and small structs to [N x i64]. Doing so allows
front-ends to emit AAPCS compliant code without having to duplicate the
register counting logic.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222903 91177308-0d34-0410-b5e6-96231b3b80d8
The string data for string-valued build attributes were being unconditionally
uppercased. There is no mention in the ARM ABI addenda about case conventions,
so it's technically implementation defined as to whether the data are
capitialised in some way or not. However, there are good reasons not to
captialise the data.
* It's less work.
* Some vendors may legitimately have case-sensitive checks for these
attributes which would fail on LLVM generated object files.
* There could be locale issues with uppercasing.
The original reasons for uppercasing appear to have stemmed from an
old codesourcery toolchain behaviour, see
http://comments.gmane.org/gmane.comp.compilers.llvm.cvs/87133
This patch makes the object file emitted no longer captialise string
data, it encodes as seen in the assembly source.
Change-Id: Ibe20dd6e60d2773d57ff72a78470839033aa5538
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222882 91177308-0d34-0410-b5e6-96231b3b80d8
This mostly entails adding relocations, however there are a couple of
changes to existing relocations:
1. R_AARCH64_NONE is defined to be zero rather than 256
R_AARCH64_NONE has been defined to be zero for a long time elsewhere
e.g. binutils and glibc since the submission of the AArch64 port in
2012 so this is required for compatibility.
2. R_AARCH64_TLSDESC_ADR_PAGE renamed to R_AARCH64_TLSDESC_ADR_PAGE21
I don't think there is any way for relocation names to leak out of LLVM
so this should not break anything.
Tested with check-all with no regressions.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222821 91177308-0d34-0410-b5e6-96231b3b80d8
including SAE mode and memory operand.
Added AVX512_maskable_scalar template, that should cover all scalar instructions in the future.
The main difference between AVX512_maskable_scalar<> and AVX512_maskable<> is using X86select instead of vselect.
I need it, because I can't create vselect node for MVT::i1 mask for scalar instruction.
http://reviews.llvm.org/D6378
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222820 91177308-0d34-0410-b5e6-96231b3b80d8
Since (v)pslldq / (v)psrldq instructions resolve to a single input argument it is useful to match it much earlier than we currently do - this prevents more complicated shuffles (notably insertion into a zero vector) matching before it.
Differential Revision: http://reviews.llvm.org/D6409
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222796 91177308-0d34-0410-b5e6-96231b3b80d8
The pattern matching failed to recognize all instances of "-1", because when
comparing against "-1" we didn't use an APInt of the same bitwidth.
This commit fixes this and also adds inverse versions of the conditon to catch
more cases.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222722 91177308-0d34-0410-b5e6-96231b3b80d8
This does not matter on newer cores (where we can use reciprocal estimates in
fast-math mode anyway), but for older cores this allows us to generate better
fast-math code where we have multiple FDIVs with a common divisor.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222710 91177308-0d34-0410-b5e6-96231b3b80d8
This patch teaches function 'transformVSELECTtoBlendVECTOR_SHUFFLE' how to
convert VSELECT dag nodes to shuffles on targets that do not have SSE4.1.
On pre-SSE4.1 targets, we can still perform blend operations using movss/movsd.
Also, removed a target specific combine that performed a premature lowering of
VSELECT nodes to target specific MOVSS/MOVSD nodes.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222647 91177308-0d34-0410-b5e6-96231b3b80d8
r222375 made some improvements to build_vector lowering of v4x32 and v4xf32 into an insertps, but it missed a case where:
1. A single extracted element is used twice.
2. The lower of the two non-zero indexes should be preserved, and the higher should be used for the dest mask.
This caused a crash, since the source value for the insertps ends-up uninitialized.
Differential Revision: http://reviews.llvm.org/D6377
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222635 91177308-0d34-0410-b5e6-96231b3b80d8
Introduced new target-independent intrinsics in order to support masked vector loads and stores. The loop vectorizer optimizes loops containing conditional memory accesses by generating these intrinsics for existing targets AVX2 and AVX-512. The vectorizer asks the target about availability of masked vector loads and stores.
Added SDNodes for masked operations and lowering patterns for X86 code generator.
Examples:
<16 x i32> @llvm.masked.load.v16i32(i8* %addr, <16 x i32> %passthru, i32 4 /* align */, <16 x i1> %mask)
declare void @llvm.masked.store.v8f64(i8* %addr, <8 x double> %value, i32 4, <8 x i1> %mask)
Scalarizer for other targets (not AVX2/AVX-512) will be done in a separate patch.
http://reviews.llvm.org/D6191
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222632 91177308-0d34-0410-b5e6-96231b3b80d8
has a remarkably unique and efficient lowering.
While we get this some of the time already, we miss a few cases and
there wasn't a principled reason we got it. We should at least test
this. v8 already has tests for this pattern.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222607 91177308-0d34-0410-b5e6-96231b3b80d8
This s_mov_b32 will write to a virtual register from the M0Reg
class and all the ds instructions now take an extra M0Reg explicit
argument.
This change is necessary to prevent issues with the scheduler
mixing together instructions that expect different values in the m0
registers.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222583 91177308-0d34-0410-b5e6-96231b3b80d8
filler such as if delay slot filler have to put NOP instruction into the
delay slot of microMIPS BEQ or BNE instruction which uses the register $0,
then instead of emitting NOP this instruction is replaced by the corresponding
microMIPS compact branch instruction, i.e. BEQZC or BNEZC.
Differential Revision: http://reviews.llvm.org/D3566
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222580 91177308-0d34-0410-b5e6-96231b3b80d8
This patch adds a feature flag to avoid unaligned 32-byte load/store AVX codegen
for Sandy Bridge and Ivy Bridge. There is no functionality change intended for
those chips. Previously, the absence of AVX2 was being used as a proxy to detect
this feature. But that hindered codegen for AVX-enabled AMD chips such as btver2
that do not have the 32-byte unaligned access slowdown.
Performance measurements are included in PR21541 ( http://llvm.org/bugs/show_bug.cgi?id=21541 ).
Differential Revision: http://reviews.llvm.org/D6355
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222544 91177308-0d34-0410-b5e6-96231b3b80d8
shuffle lowering to allow much better blend matching.
Specifically, with the new structure the code seems clearer to me and we
correctly can hit the cases where merging two 128-bit lanes is a clear
win and can be shuffled cheaply afterward.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222539 91177308-0d34-0410-b5e6-96231b3b80d8
a bunch more improvements.
Non-lane-crossing is fine, the key is that lane merging only makes sense
for single-input shuffles. Not sure why I got so turned around here. The
code all works, I was just using the wrong model for it.
This only updates v4 and v8 lowering. The v16 and v32 lowering requires
restructuring the entire check sequence.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222537 91177308-0d34-0410-b5e6-96231b3b80d8
Before this patch, the DAGCombiner only tried to convert build_vector dag nodes
into shuffles if all operands were either extract_vector_elt or undef.
This patch improves that logic and teaches the DAGCombiner how to deal with
build_vector dag nodes where one or more operands are zero. A build_vector
dag node with some zero operands is turned into a shuffle only if the resulting
shuffle mask is legal for the target.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222536 91177308-0d34-0410-b5e6-96231b3b80d8
lanes.
By special casing these we can often either reduce the total number of
shuffles significantly or reduce the number of (high latency on Haswell)
AVX2 shuffles that potentially cross 128-bit lanes. Even when these
don't actually cross lanes, they have much higher latency to support
that. Doing two of them and a blend is worse than doing a single insert
across the 128-bit lanes to blend and then doing a single interleaved
shuffle.
While this seems like a narrow case, it kept cropping up on me and the
difference is *huge* as you can see in many of the test cases. I first
hit this trying to perfectly fix the interleaving shuffle patterns used
by Halide for AVX2.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222533 91177308-0d34-0410-b5e6-96231b3b80d8
merging 128-bit subvectors and also shuffling all the elements of those
subvectors. Currently we generate pretty bad code for many of these, but
I'm testing a patch that should dramatically improve this in addition to
making the shuffle lowering robust to other changes.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222525 91177308-0d34-0410-b5e6-96231b3b80d8
This mirrors r222331, which enabled SeparateConstOffsetFromGEP on AArch64, in
the PowerPC backend. Yields, on a POWER7 machine, a 30% speedup on
SingleSource/Benchmarks/Shootout/nestedloop (this might just be from LICM,
there is a store moved out of the inner loop) and a potential speedup on
MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode. Regardless, it
makes some code look cleaner, and synchronizing the backends in this regard
seems like a generally good thing.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222504 91177308-0d34-0410-b5e6-96231b3b80d8
Windows itanium targets the MSVCRT, and the stack probe symbol is provided by
MSVCRT. This corrects the emission of stack probes on i686-windows-itanium.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222439 91177308-0d34-0410-b5e6-96231b3b80d8
This patch improves the lowering of v4f32 and v4i32 build_vector dag nodes
that are known to have at least two non-zero elements.
With this patch, a build_vector that performs a blend with zero is
converted into a shuffle. This is done to let the shuffle legalizer expand
the dag node in a optimal way. For example, if we know that a build_vector
performs a blend with zero, we can try to lower it as a movq/blend instead of
always selecting an insertps.
This patch also improves the logic that lowers a build_vector into a insertps
with zero masking. See for example the extra test cases added to test sse41.ll.
Differential Revision: http://reviews.llvm.org/D6311
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222375 91177308-0d34-0410-b5e6-96231b3b80d8
A register operand that has a common sub-class with its instruction's
defined register class is not always legal. For example,
SReg_32 and M0Reg both have a common sub-class, but we can't
use an SReg_32 in instructions that expect a M0Reg.
This prevents the llvm.SI.sendmsg.ll test from failing when the fold
operand pass is added.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222368 91177308-0d34-0410-b5e6-96231b3b80d8
This patch builds on http://reviews.llvm.org/D5598 to perform byte rotation shuffles (lowerVectorShuffleAsByteRotate) on pre-SSSE3 (palignr) targets - pre-SSSE3 is only enabled on i8 and i16 vector targets where it is a more definite performance gain.
I've also added a separate byte shift shuffle (lowerVectorShuffleAsByteShift) that makes use of the ability of the SLLDQ/SRLDQ instructions to implicitly shift in zero bytes to avoid the need to create a zero register if we had used palignr.
Differential Revision: http://reviews.llvm.org/D5699
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222340 91177308-0d34-0410-b5e6-96231b3b80d8
SeparateConstOffsetFromGEP can gives more optimizaiton opportunities related to GEPs, which benefits EarlyCSE
and LICM. By enabling these passes we can have better address calculations and generate a better addressing
mode. Some SPEC 2006 benchmarks (astar, gobmk, namd) have obvious improvements on Cortex-A57.
Reviewed in http://reviews.llvm.org/D5864.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222331 91177308-0d34-0410-b5e6-96231b3b80d8