This matches the format produced by the AMD proprietary driver.
//==================================================================//
// Shell script for converting .ll test cases: (Pass the .ll files
you want to convert to this script as arguments).
//==================================================================//
; This was necessary on my system so that A-Z in sed would match only
; upper case. I'm not sure why.
export LC_ALL='C'
TEST_FILES="$*"
MATCHES=`grep -v Patterns SIInstructions.td | grep -o '"[A-Z0-9_]\+["e]' | grep -o '[A-Z0-9_]\+' | sort -r`
for f in $TEST_FILES; do
# Check that there are SI tests:
grep -q -e 'verde' -e 'bonaire' -e 'SI' -e 'tahiti' $f
if [ $? -eq 0 ]; then
for match in $MATCHES; do
sed -i -e "s/\([ :]$match\)/\L\1/" $f
done
# Try to get check lines with partial instruction names
sed -i 's/\(;[ ]*SI[A-Z\\-]*: \)\([A-Z_0-9]\+\)/\1\L\2/' $f
fi
done
sed -i -e 's/bb0_1/BB0_1/g' ../../../test/CodeGen/R600/infinite-loop.ll
sed -i -e 's/SI-NOT: bfe/SI-NOT: {{[^@]}}bfe/g'../../../test/CodeGen/R600/llvm.AMDGPU.bfe.*32.ll ../../../test/CodeGen/R600/sext-in-reg.ll
sed -i -e 's/exp_IEEE/EXP_IEEE/g' ../../../test/CodeGen/R600/llvm.exp2.ll
sed -i -e 's/numVgprs/NumVgprs/g' ../../../test/CodeGen/R600/register-count-comments.ll
sed -i 's/\(; CHECK[-NOT]*: \)\([A-Z_0-9]\+\)/\1\L\2/' ../../../test/CodeGen/R600/select64.ll ../../../test/CodeGen/R600/sgpr-copy.ll
//==================================================================//
// Shell script for converting .td files (run this last)
//==================================================================//
export LC_ALL='C'
sed -i -e '/Patterns/!s/\("[A-Z0-9_]\+[ "e]\)/\L\1/g' SIInstructions.td
sed -i -e 's/"EXP/"exp/g' SIInstrInfo.td
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221350 91177308-0d34-0410-b5e6-96231b3b80d8
This patch improves the folding of vector AND nodes into blend operations for
targets that feature SSE4.1. A vector AND node where one of the operands is
a constant build_vector with elements that are either zero or all-ones can be
converted into a blend.
This allows for example to simplify the following code:
define <4 x i32> @test(<4 x i32> %A, <4 x i32> %B) {
%1 = and <4 x i32> %A, <i32 0, i32 0, i32 0, i32 -1>
%2 = and <4 x i32> %B, <i32 -1, i32 -1, i32 -1, i32 0>
%3 = or <4 x i32> %1, %2
ret <4 x i32> %3
}
Before this patch llc (-mcpu=corei7) generated:
andps LCPI1_0(%rip), %xmm0, %xmm0
andps LCPI1_1(%rip), %xmm1, %xmm1
orps %xmm1, %xmm0, %xmm0
retq
With this patch we generate a single 'vpblendw'.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221343 91177308-0d34-0410-b5e6-96231b3b80d8
We currently try to push an even number of registers to preserve 8-byte
alignment during a function's prologue, but only when the stack alignment is
prcisely 8. Many of the reasons for doing it apply also when that alignment > 8
(the extra store is often free, and can save another stack adjustment, though
less frequently for 16-byte stack alignment).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221321 91177308-0d34-0410-b5e6-96231b3b80d8
We were making an attempt to do this by adding an extra callee-saved GPR (so
that there was an even number in the list), but when that failed we went ahead
and pushed anyway.
This had a couple of potential issues:
+ The .cfi directives we emit misplaced dN because they were based on
PrologEpilogInserter's calculation.
+ Unaligned stores can be less efficient.
+ Unaligned stores can actually fault (likely only an issue in niche cases,
but possible).
This adds a final explicit stack adjustment if all other options fail, so that
the actual locations of the registers match up with where they should be.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221320 91177308-0d34-0410-b5e6-96231b3b80d8
Patch to allow (v)blendps, (v)blendpd, (v)pblendw and vpblendd instructions to be commuted - swaps the src registers and inverts the blend mask.
This is primarily to improve memory folding (see new tests), but it also improves the quality of shuffles (see modified tests).
Differential Revision: http://reviews.llvm.org/D6015
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221313 91177308-0d34-0410-b5e6-96231b3b80d8
While fixing up the register classes in the machine combiner in a previous
commit I missed one.
This fixes the last one and adds a test case.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221308 91177308-0d34-0410-b5e6-96231b3b80d8
Registers are not all equal. Some are not allocatable (infinite cost),
some have to be preserved but can be used, and some others are just free
to use.
Ensure there is a cost hierarchy reflecting this fact, so that the
allocator will favor scratch registers over callee-saved registers.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221293 91177308-0d34-0410-b5e6-96231b3b80d8
the tombstone or empty keys of a DenseMap<int64_t, T>. This patch
fixes the issue (and adds a tests case).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221214 91177308-0d34-0410-b5e6-96231b3b80d8
register class tGPRRegClass if the target is thumb1.
This commit fixes a crash that occurs during register allocation which was
triggered when a virtual register defined by an inline-asm instruction had to
be spilled.
rdar://problem/18740489
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221178 91177308-0d34-0410-b5e6-96231b3b80d8
For 8-bit divrems where the remainder is used, we used to generate:
divb %sil
shrw $8, %ax
movzbl %al, %eax
That was to avoid an H-reg access, which is problematic mainly because
it isn't possible in REX-prefixed instructions.
This patch optimizes that to:
divb %sil
movzbl %ah, %eax
To do that, we explicitly extend AH, and extract the L-subreg in the
resulting register. The extension is done using the NOREX variants of
MOVZX. To support signed operations, MOVSX_NOREX is also added.
Further, this introduces a new SDNode type, [us]divrem_ext_hreg, which is
then lowered to a sequence containing a single zext (rather than 2).
Differential Revision: http://reviews.llvm.org/D6064
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221176 91177308-0d34-0410-b5e6-96231b3b80d8
Function calls aren't supported yet.
This was reverted due to build breakages, which should be fixed now.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221173 91177308-0d34-0410-b5e6-96231b3b80d8
call DAGCombiner. But we ran into a case (on Windows) where the
calling convention causes argument lowering to bail out of fast-isel,
and we end up in CodeGenAndEmitDAG() which does run DAGCombiner.
So, we need to make DAGCombiner check for 'optnone' after all.
Commit includes the test that found this, plus another one that got
missed in the original optnone work.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221168 91177308-0d34-0410-b5e6-96231b3b80d8
This CPU definition is redundant. The Cortex-A9 is defined as
supporting multiprocessing extensions. Remove its definition and
update appropriate tests.
LLVM defines both a cortex-a9 CPU and a cortex-a9-mp CPU. The only
difference between the two CPU definitions in ARM.td is that
cortex-a9-mp contains the feature FeatureMP for multiprocessing
extensions.
This is redundant since the Cortex-A9 is defined as having
multiprocessing extensions in the TRMs. armcc also defines the
Cortex-A9 as having multiprocessing extensions by default.
Change-Id: Ifcadaa6c322be0a33d9d2a39cfdd7da1d75981a7
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221166 91177308-0d34-0410-b5e6-96231b3b80d8
Some literals in the AArch64 backend had 15 'f's rather than 16, causing
comparisons with a constant 0xffffffffffffffff to be miscompiled.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221157 91177308-0d34-0410-b5e6-96231b3b80d8
This reverts commit r221028. Later commits depend on this and
reverting just this one causes even more bots to fail.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221041 91177308-0d34-0410-b5e6-96231b3b80d8
"[x86] Simplify vector selection if condition value type matches vselect value type and true value is all ones or false value is all zeros."
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221028 91177308-0d34-0410-b5e6-96231b3b80d8
This reverts commit r220996.
It introduced layering violations causing link errors in many
configurations.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221020 91177308-0d34-0410-b5e6-96231b3b80d8
We need to figure out how to track ptrtoint values all the
way until result is converted back to a pointer in order
to correctly rewrite the pointer type.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220997 91177308-0d34-0410-b5e6-96231b3b80d8
Now that we have initial support for VSX, we can begin adding
intrinsics for programmer access to VSX instructions. This patch adds
basic support for VSX intrinsics in general, and tests it by
implementing intrinsics for minimum and maximum for the vector double
data type.
The LLVM portion of this is quite straightforward. There is a
companion patch for Clang.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220988 91177308-0d34-0410-b5e6-96231b3b80d8
This patch adds an optimization in CodeGenPrepare to move an extractelement
right before a store when the target can combine them.
The optimization may promote any scalar operations to vector operations in the
way to make that possible.
** Context **
Some targets use different register files for both vector and scalar operations.
This means that transitioning from one domain to another may incur copy from one
register file to another. These copies are not coalescable and may be expensive.
For example, according to the scheduling model, on cortex-A8 a vector to GPR
move is 20 cycles.
** Motivating Example **
Let us consider an example:
define void @foo(<2 x i32>* %addr1, i32* %dest) {
%in1 = load <2 x i32>* %addr1, align 8
%extract = extractelement <2 x i32> %in1, i32 1
%out = or i32 %extract, 1
store i32 %out, i32* %dest, align 4
ret void
}
As it is, this IR generates the following assembly on armv7:
vldr d16, [r0] @vector load
vmov.32 r0, d16[1] @ cross-register-file copy: 20 cycles
orr r0, r0, #1 @ scalar bitwise or
str r0, [r1] @ scalar store
bx lr
Whereas we could generate much faster code:
vldr d16, [r0] @ vector load
vorr.i32 d16, #0x1 @ vector bitwise or
vst1.32 {d16[1]}, [r1:32] @ vector extract + store
bx lr
Half of the computation made in the vector is useless, but this allows to get
rid of the expensive cross-register-file copy.
** Proposed Solution **
To avoid this cross-register-copy penalty, we promote the scalar operations to
vector operations. The penalty will be removed if we manage to promote the whole
chain of computation in the vector domain.
Currently, we do that only when the chain of computation ends by a store and the
target is able to combine an extract with a store.
Stores are the most likely candidates, because other instructions produce values
that would need to be promoted and so, extracted as some point[1]. Moreover,
this is customary that targets feature stores that perform a vector extract (see
AArch64 and X86 for instance).
The proposed implementation relies on the TargetTransformInfo to decide whether
or not it is beneficial to promote a chain of computation in the vector domain.
Unfortunately, this interface is rather inaccurate for this level of details and
although this optimization may be beneficial for X86 and AArch64, the inaccuracy
will lead to the optimization being too aggressive.
Basically in TargetTransformInfo, everything that is legal has a cost of 1,
whereas, even if a vector type is legal, usually a vector operation is slightly
more expensive than its scalar counterpart. That will lead to too many
promotions that may not be counter balanced by the saving of the
cross-register-file copy. For instance, on AArch64 this penalty is just 4
cycles.
For now, the optimization is just enabled for ARM prior than v8, since those
processors have a larger penalty on cross-register-file copies, and the scope is
limited to basic blocks. Because of these two factors, we limit the effects of
the inaccuracy. Indeed, I did not want to build up a fancy cost model with block
frequency and everything on top of that.
[1] We can imagine targets that can combine an extractelement with other
instructions than just stores. If we want to go into that direction, the current
interfaces must be augmented and, moreover, I think this becomes a global isel
problem.
Differential Revision: http://reviews.llvm.org/D5921
<rdar://problem/14170854>
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220978 91177308-0d34-0410-b5e6-96231b3b80d8
Since block address values can be larger than 2GB in 64-bit code, they
cannot be loaded simply using an @l / @ha pair, but instead must be
loaded from the TOC, just like GlobalAddress, ConstantPool, and
JumpTable values are.
The commit also fixes a bug in PPCLinuxAsmPrinter::doFinalization where
temporary labels could not be used as TOC values, since code would
attempt (and fail) to use GetOrCreateSymbol to create a symbol of the
same name as the temporary label.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220959 91177308-0d34-0410-b5e6-96231b3b80d8
r212242 introduced a legalizer hook, originally to let AArch64 widen
v1i{32,16,8} rather than scalarize, because the legalizer expected, when
scalarizing the result of a conversion operation, to already have
scalarized the operands. On AArch64, v1i64 is legal, so that commit
ensured operations such as v1i32 = trunc v1i64 wouldn't assert.
It did that by choosing to widen v1 types whenever possible. However,
v1i1 types, for which there's no legal widened type, would still trigger
the assert.
This commit fixes that, by only scalarizing a trunc's result when the
operand has already been scalarized, and introducing an extract_elt
otherwise.
This is similar to r205625.
Fixes PR20777.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220937 91177308-0d34-0410-b5e6-96231b3b80d8
Earlier this summer I fixed an issue where we were incorrectly combining
multiple loads that had different constraints such alignment, invariance,
temporality, etc. Apparently in one case I made copt paste error and swapped
alignment and invariance.
Tests included.
rdar://18816719
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220933 91177308-0d34-0410-b5e6-96231b3b80d8
It should be on for every target that supports unaligned accesses (e.g. not
v6m).
Patch by Charlie Turner.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220912 91177308-0d34-0410-b5e6-96231b3b80d8
This transformation worked if selector is produced by SETCC, however SETCC is needed only if we consider to swap operands. So I replaced SETCC check for this case.
Added tests for vselect of <X x i1> values.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220777 91177308-0d34-0410-b5e6-96231b3b80d8
Ffter commit at rev219046 512-bit broadcasts lowering become non-optimal. Most of tests on broadcasting and embedded broadcasting were changed and they doesn’t produce efficient code.
Example below is from commit changes (it’s the first test from test/CodeGen/X86/avx512-vbroadcast.ll):
define <16 x i32> @_inreg16xi32(i32 %a) {
; CHECK-LABEL: _inreg16xi32:
; CHECK: ## BB#0:
-; CHECK-NEXT: vpbroadcastd %edi, %zmm0
+; CHECK-NEXT: vmovd %edi, %xmm0
+; CHECK-NEXT: vpbroadcastd %xmm0, %ymm0
+; CHECK-NEXT: vinserti64x4 $1, %ymm0, %zmm0, %zmm0
; CHECK-NEXT: retq
%b = insertelement <16 x i32> undef, i32 %a, i32 0
%c = shufflevector <16 x i32> %b, <16 x i32> undef, <16 x i32> zeroinitializer
ret <16 x i32> %c
}
Here, 256-bit broadcast was generated instead of 512-bit one.
In this patch
1) I added vector-shuffle lowering through broadcasts
2) Removed asserts and branches likes because this is incorrect
- assert(Subtarget->hasDQI() && "We can only lower v8i64 with AVX-512-DQI");
3) Fixed lowering tests
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220774 91177308-0d34-0410-b5e6-96231b3b80d8
This is a Microsoft calling convention that supports both x86 and x86_64
subtargets. It passes vector and floating point arguments in XMM0-XMM5,
and passes them indirectly once they are consumed.
Homogenous vector aggregates of up to four elements can be passed in
sequential vector registers, but this part is not implemented in LLVM
and will be handled in Clang.
On 32-bit x86, it is similar to fastcall in that it uses ecx:edx as
integer register parameters and is callee cleanup. On x86_64, it
delegates to the normal win64 calling convention.
Reviewers: majnemer
Differential Revision: http://reviews.llvm.org/D5943
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220745 91177308-0d34-0410-b5e6-96231b3b80d8
Benchmarks have shown that it's harmless to the performance there, and having a
unified set of passes between the two cores where possible helps big.LITTLE
deployment.
Patch by Z. Zheng.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220744 91177308-0d34-0410-b5e6-96231b3b80d8
For a call to not return in to the stackmap shadow, the shadow must end with the call.
To do this, we must insert any required nops *before* the call, and not after it.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220728 91177308-0d34-0410-b5e6-96231b3b80d8
This is a minor change to use the immediate version when the operand is a null
value. This should get rid of an unnecessary 'mov' instruction in debug
builds and align the code more with the one generated by SelectionDAG.
This fixes rdar://problem/18785125.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220713 91177308-0d34-0410-b5e6-96231b3b80d8
Minor enhancement to use 'tbz' for i1 compare-and-branch to get rid of an 'and'
instruction.
This fixes rdar://problem/18784953.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220712 91177308-0d34-0410-b5e6-96231b3b80d8
To avoid emitting too many nops, a stackmap shadow can include emitted instructions in the shadow, but these must not include branch targets.
A return from a call should count as a branch target as patching over the instructions after the call would lead to incorrect behaviour for threads currently making that call, when they return.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220710 91177308-0d34-0410-b5e6-96231b3b80d8
The pattern matching for a 'ConstantInt' value was too restrictive. Checking for
a 'Constant' with a bull value is sufficient for using an 'cbz/cbnz' instruction.
This fixes rdar://problem/18784732.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220709 91177308-0d34-0410-b5e6-96231b3b80d8
This fixes a bug where the input register was not defined for the 'tbz/tbnz'
instruction. This happened, because we folded the 'and' instruction from a
different basic block.
This fixes rdar://problem/18784013.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220704 91177308-0d34-0410-b5e6-96231b3b80d8
At higher optimization levels the LLVM IR may contain more complex patterns for
loads/stores from/to frame indices. The 'computeAddress' function wasn't able to
handle this and triggered an assertion.
This fix extends the possible addressing modes for frame indices.
This fixes rdar://problem/18783298.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220700 91177308-0d34-0410-b5e6-96231b3b80d8
Currently, the ARM backend will select the VMAXNM and VMINNM for these C
expressions:
(a < b) ? a : b
(a > b) ? a : b
but not these expressions:
(a > b) ? b : a
(a < b) ? b : a
This patch allows all of these expressions to be matched.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220671 91177308-0d34-0410-b5e6-96231b3b80d8