This adds support for the QPX vector instruction set, which is used by the
enhanced A2 cores on the IBM BG/Q supercomputers. QPX vectors are 256 bytes
wide, holding 4 double-precision floating-point values. Boolean values, modeled
here as <4 x i1> are actually also represented as floating-point values
(essentially { -1, 1 } for { false, true }). QPX shares many features with
Altivec and VSX, but is distinct from both of them. One major difference is
that, instead of adding completely-separate vector registers, QPX vector
registers are extensions of the scalar floating-point registers (lane 0 is the
corresponding scalar floating-point value). The operations supported on QPX
vectors mirrors that supported on the scalar floating-point values (with some
additional ones for permutations and logical/comparison operations).
I've been maintaining this support out-of-tree, as part of the bgclang project,
for several years. This is not the entire bgclang patch set, but is most of the
subset that can be cleanly integrated into LLVM proper at this time. Adding
this to the LLVM backend is part of my efforts to rebase bgclang to the current
LLVM trunk, but is independently useful (especially for codes that use LLVM as
a JIT in library form).
The assembler/disassembler test coverage is complete. The CodeGen test coverage
is not, but I've included some tests, and more will be added as follow-up work.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@230413 91177308-0d34-0410-b5e6-96231b3b80d8
Everyone except R600 was manually passing the length of a static array
at each callsite, calculated in a variety of interesting ways. Far
easier to let ArrayRef handle that.
There should be no functional change, but out of tree targets may have
to tweak their calls as with these examples.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@230118 91177308-0d34-0410-b5e6-96231b3b80d8
changes to remove non-Function based subtargets out of the asm
printer. For module level emission we'll need to construct up
an MCSubtargetInfo so that we can encode instructions for
emission.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@230050 91177308-0d34-0410-b5e6-96231b3b80d8
EmitFunctionStubs is called from doFinalization and so can't
depend on the Subtarget existing. It's also irrelevant as
we know we're darwin since we're in the darwin asm printer.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@230039 91177308-0d34-0410-b5e6-96231b3b80d8
The IBM BG/Q supercomputer's A2 cores have a hardware prefetching unit, the
L1P, but it does not prefetch directly into the A2's L1 cache. Instead, it
prefetches into its own L1P buffer, and the latency to access that buffer is
significantly higher than that to the L1 cache (although smaller than the
latency to the L2 cache). As a result, especially when multiple hardware
threads are not actively busy, explicitly prefetching data into the L1 cache is
advantageous.
I've been using this pass out-of-tree for data prefetching on the BG/Q for well
over a year, and it has worked quite well. It is enabled by default only for
the BG/Q, but can be enabled for other cores as well via a command-line option.
Eventually, we might want to add some TTI interfaces and move this into
Transforms/Scalar (there is nothing particularly target dependent about it,
although only machines like the BG/Q will benefit from its simplistic
strategy).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@229966 91177308-0d34-0410-b5e6-96231b3b80d8
initialization. Initialize the subtarget once per function and
migrate EmitStartOfAsmFile to either use attributes on the
TargetMachine or get information from all of the various
subtargets.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@229475 91177308-0d34-0410-b5e6-96231b3b80d8
This required changing how the computation of the ABI is handled
and how some of the checks for ABI/target are done.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@229471 91177308-0d34-0410-b5e6-96231b3b80d8
Our register allocation has become better recently, it seems, and is now
starting to generate cross-block copies into inflated register classes. These
copies are not transformed into subregister insertions/extractions by the
PPCVSXCopy class, and so need to be handled directly by
PPCInstrInfo::copyPhysReg. The code to do this was *almost* there, but not
quite (it was unnecessarily restricting itself to only the direct
sub/super-register-class case (not copying between, for example, something in
VRRC and the lower-half of VSRC which are super-registers of F8RC).
Triggering this behavior manually is difficult; I'm including two
bugpoint-reduced test cases from the test suite.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@229457 91177308-0d34-0410-b5e6-96231b3b80d8
This adds a safe interface to the machine independent InputArg struct
for accessing the index of the original (IR-level) argument. When a
non-native return type is lowered, we generate the hidden
machine-level sret argument on-the-fly. Before this fix, we were
representing this argument as OrigArgIndex == 0, which is an outright
lie. In particular this crashed in the AArch64 backend where we
actually try to access the type of the original argument.
Now we use a sentinel value for machine arguments that have no
original argument index. AArch64, ARM, Mips, and PPC now check for this
case before accessing the original argument.
Fixes <rdar://19792160> Null pointer assertion in AArch64TargetLowering
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@229413 91177308-0d34-0410-b5e6-96231b3b80d8
Canonicalize access to function attributes to use the simpler API.
getAttributes().getAttribute(AttributeSet::FunctionIndex, Kind)
=> getFnAttribute(Kind)
getAttributes().hasAttribute(AttributeSet::FunctionIndex, Kind)
=> hasFnAttribute(Kind)
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@229224 91177308-0d34-0410-b5e6-96231b3b80d8
LLVM's include tree and the use of using declarations to hide the
'legacy' namespace for the old pass manager.
This undoes the primary modules-hostile change I made to keep
out-of-tree targets building. I sent an email inquiring about whether
this would be reasonable to do at this phase and people seemed fine with
it, so making it a reality. This should allow us to start bootstrapping
with modules to a certain extent along with making it easier to mix and
match headers in general.
The updates to any code for users of LLVM are very mechanical. Switch
from including "llvm/PassManager.h" to "llvm/IR/LegacyPassManager.h".
Qualify the types which now produce compile errors with "legacy::". The
most common ones are "PassManager", "PassManagerBase", and
"FunctionPassManager".
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@229094 91177308-0d34-0410-b5e6-96231b3b80d8
On PowerPC, which has a full set of logical operations on (its multiple sets
of) condition-register bits, it is not profitable to break of complex
conditions feeding a jump into multiple jumps. We can turn off this feature of
CGP/SDAGBuilder by marking jumps as "expensive".
P7 test-suite speedups (no regressions):
MultiSource/Benchmarks/FreeBench/pcompress2/pcompress2
-0.626647% +/- 0.323583%
MultiSource/Benchmarks/Olden/power/power
-18.2821% +/- 8.06481%
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@228895 91177308-0d34-0410-b5e6-96231b3b80d8
See full discussion in http://reviews.llvm.org/D7491.
We now hide the add-immediate and call instructions together in a
separate pseudo-op, which is tagged to define GPR3 and clobber the
call-killed registers. The PPCTLSDynamicCall pass prior to RA now
expands this op into the two separate addi and call ops, with explicit
definitions of GPR3 on both instructions, and explicit clobbers on the
call instruction. The pass is now marked as requiring and preserving
the LiveIntervals and SlotIndexes analyses, and fixes these up after
the replacement sequences are introduced.
Self-hosting has been verified on LE P8 and BE P7 with various
optimization levels, etc. It has also been verified with the
--no-tls-optimize flag workaround removed.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@228725 91177308-0d34-0410-b5e6-96231b3b80d8
Some old assembly code uses the cntlz alias for cntlzw, binutils supports this,
and we should too. Fixes PR22519.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@228719 91177308-0d34-0410-b5e6-96231b3b80d8
veqv (vector equivalence)
vnand
vorc
I increased the AddedComplexity for these instructions to 500 to ensure they are generated instead of issuing other VSX instructions.
Phabricator review: http://reviews.llvm.org/D7469
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@228580 91177308-0d34-0410-b5e6-96231b3b80d8
If a loop predecessor has an invoke as its terminator, and the return value
from that invoke is used to determine the loop iteration space, then we can't
insert a computation based on that value in the loop predecessor prior to the
terminator (oops). If there's such an invoke, or just no predecessor for that
matter, insert a new loop preheader.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@228488 91177308-0d34-0410-b5e6-96231b3b80d8
Unfortunately, even with the workaround of disabling the linker TLS
optimizations in Clang restored (which has already been done), this still
breaks self-hosting on my P7 machine (-O3 -DNDEBUG -mcpu=native).
Bill is currently working on an alternate implementation to address the TLS
issue in a way that also fully elides the linker bug (which, unfortunately,
this approach did not fully), so I'm reverting this now.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@228460 91177308-0d34-0410-b5e6-96231b3b80d8
PowerPC supports pre-increment load/store instructions (except for Altivec/VSX
vector load/stores). Using these on embedded cores can be very important, but
most loops are not naturally set up to use them. We can often change that,
however, by placing loops into a non-canonical form. Generically, this means
transforming loops like this:
for (int i = 0; i < n; ++i)
array[i] = c;
to look like this:
T *p = array[-1];
for (int i = 0; i < n; ++i)
*++p = c;
the key point is that addresses accessed are pulled into dedicated PHIs and
"pre-decremented" in the loop preheader. This allows the use of pre-increment
load/store instructions without loop peeling.
A target-specific late IR-level pass (running post-LSR), PPCLoopPreIncPrep, is
introduced to perform this transformation. I've used this code out-of-tree for
generating code for the PPC A2 for over a year. Somewhat to my surprise,
running the test suite + externals on a P7 with this transformation enabled
showed no performance regressions, and one speedup:
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk
-2.32514% +/- 1.03736%
So I'm going to enable it on everything for now. I was surprised by this
because, on the POWER cores, these pre-increment load/store instructions are
cracked (and, thus, harder to schedule effectively). But seeing no regressions,
and feeling that it is generally easier to split instructions apart late than
it is to combine them late, this might be the better approach regardless.
In the future, we might want to integrate this functionality into LSR (but
currently LSR does not create new PHI nodes, so (for that and other reasons)
significant work would need to be done).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@228328 91177308-0d34-0410-b5e6-96231b3b80d8
PowerPC supports pre-increment floating-point load/store instructions, both r+r
and r+i, and we had patterns for them, but they were not marked as legal. Mark
them as legal (and add a test case).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@228327 91177308-0d34-0410-b5e6-96231b3b80d8
Patch by Kit Barton.
Add the vector count leading zeros instruction for byte, halfword,
word, and doubleword sizes. This is a fairly straightforward addition
after the changes made for vpopcnt:
1. Add the correct definitions for the various instructions in
PPCInstrAltivec.td
2. Make the CTLZ operation legal on vector types when using P8Altivec
in PPCISelLowering.cpp
Test Plan
Created new test case in test/CodeGen/PowerPC/vec_clz.ll to check the
instructions are being generated when the CTLZ operation is used in
LLVM.
Check the encoding and decoding in test/MC/PowerPC/ppc_encoding_vmx.s
and test/Disassembler/PowerPC/ppc_encoding_vmx.txt respectively.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@228301 91177308-0d34-0410-b5e6-96231b3b80d8
Patch by Kit Barton.
Add the vector population count instructions for byte, halfword, word,
and doubleword sizes. There are two major changes here:
PPCISelLowering.cpp: Make CTPOP legal for vector types.
PPCRegisterInfo.td: Added v2i64 to the VRRC register
definition. This is needed for the doubleword variations of the
integer ops that were added in P8.
Test Plan
Test the instruction vpcnt* encoding/decoding in ppc64-encoding-vmx.s
Test the generation of the vpopcnt instructions for various vector
data types. When adding the v2i64 type to the Vector Register set, I
also needed to add the appropriate bit conversion patterns between
v2i64 and the existing vector types. Testing for these conversions
were also added in the test case by passing a different vector type as
a parameter into the test functions. There is also a run step that
will ensure the vpopcnt instructions are generated when the vsx
feature is disabled.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@228046 91177308-0d34-0410-b5e6-96231b3b80d8
This patch is a third attempt to properly handle the local-dynamic and
global-dynamic TLS models.
In my original implementation, calls to __tls_get_addr were hidden
from view until the asm-printer phase, at which point the underlying
branch-and-link instruction was created with proper relocations. This
mostly worked well, but I used some repellent techniques to ensure
that the TLS_GET_ADDR nodes at the SD and MI levels correctly received
input from GPR3 and produced output into GPR3. This proved to work
badly in the presence of multiple TLS variable accesses, with the
copies to and from GPR3 being scheduled incorrectly and generally
creating havoc.
In r221703, I addressed that problem by representing the calls to
__tls_get_addr as true calls during instruction lowering. This had
the advantage of removing all of the bad hacks and relying on the
existing call machinery to properly glue the copies in place. It
looked like this was going to be the right way to go.
However, as a side effect of the recent discovery of problems with
linker optimizations for TLS, we discovered cases of suboptimal code
generation with this strategy. The problem comes when tls_get_addr is
called for the same address, and there is a resulting CSE
opportunity. It turns out that in such cases MachineCSE will common
the addis/addi instructions that set up the input value to
tls_get_addr, but will not common the calls themselves. MachineCSE
does not have any machinery to common idempotent calls. This is
perfectly sensible, since presumably this would be done at the IR
level, and introducing calls in the back end isn't commonplace. In
any case, we end up with two calls to __tls_get_addr when one would
suffice, and that isn't good.
I presumed that the original design would have allowed commoning of
the machine-specific nodes that hid the __tls_get_addr calls, so as
suggested by Ulrich Weigand, I went back to that design and cleaned it
up so that the copies were properly held together by glue
nodes. However, it turned out that this didn't work either...the
presence of copies to physical registers kept the machine-specific
nodes from being commoned also.
All of which leads to the design presented here. This is a return to
the original design, except that no attempt is made to introduce
copies to and from GPR3 during instruction lowering. Virtual registers
are used until prior to register allocation. At that point, a special
pass is run that identifies the machine-specific nodes that hide the
tls_get_addr calls and introduces the copies to and from GPR3 around
them. The register allocator then coalesces these copies away. With
this design, MachineCSE succeeds in commoning tls_get_addr calls where
possible, and we get nice optimal code generation (better than GCC at
the moment, which does not common these calls).
One additional problem must be dealt with: After introducing the
mentions of the physical register GPR3, the aggressive anti-dependence
breaker sees opportunities to improve scheduling by selecting a
different register instead. Flags must be used on the instruction
descriptions to tell the anti-dependence breaker to keep its hands in
its pockets.
One thing missing from the original design was recording a definition
of the link register on the GET_TLS_ADDR nodes. Doing this was found
to be insufficient to force a stack frame to be created, which led to
looping behavior because two different LR values were stored at the
same address. This appears to have been an oversight in
PPCFrameLowering::determineFrameLayout(), which is repaired here.
Because MustSaveLR() returns true for calls to builtin_return_address,
this changed the expected behavior of
test/CodeGen/PowerPC/retaddr2.ll, which now stacks a frame but
formerly did not. I've fixed the test case to reflect this.
There are existing TLS tests to catch regressions; the checks in
test/CodeGen/PowerPC/tls-store2.ll proved to be too restrictive in the
face of instruction scheduling with these changes, so I fixed that
up.
I've added a new test case based on the PrettyStackTrace module that
demonstrated the original problem. This checks that we get correct
code generation and that CSE of the calls to __get_tls_addr has taken
place.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227976 91177308-0d34-0410-b5e6-96231b3b80d8
PPCInstrInfo.cpp has ended up containing several small MI-level passes, and
this is making the file harder to read than necessary. Split out
PPCEarlyReturn into its own source file. NFC.
Now that PPCInstrInfo.cpp does not also contain pass implementations, I hope
that it will be slightly less unwieldy.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227775 91177308-0d34-0410-b5e6-96231b3b80d8
PPCInstrInfo.cpp has ended up containing several small MI-level passes, and
this is making the file harder to read than necessary. Split out
PPCVSXCopy into its own source file. NFC.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227771 91177308-0d34-0410-b5e6-96231b3b80d8
PPCInstrInfo.cpp has ended up containing several small MI-level passes, and
this is making the file harder to read than necessary. Split out
PPCVSXFMAMutate into its own source file. NFC.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227770 91177308-0d34-0410-b5e6-96231b3b80d8
This MI-level pass was necessary when VSX support was first being developed,
specifically, before the ABI code had been updated to use VSX registers for
arguments (the register assignments did not change, in a physical sense, but
the VSX super-registers are now used). Unfortunately, I never went back and
removed this pass after that was done. I believe this code is now effectively
dead.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227767 91177308-0d34-0410-b5e6-96231b3b80d8
When PPCEarlyReturn, it should really copy implicit ops from the old return
instruction to the new one. This currently does not matter much, because we run
PPCEarlyReturn very late in the pipeline (there is nothing to do DCE on
definitions of those registers). However, for completeness, we should do it
anyway.
Noticed by inspection (and there should be no functional change); thus, no
test case.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227763 91177308-0d34-0410-b5e6-96231b3b80d8
The VSX store instructions were also picking up an implicit "may read" from the
default pattern, which was an intrinsic (and we don't currently have a way of
specifying write-only intrinsics).
This was causing MI verification to fail for VSX spill restores.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227759 91177308-0d34-0410-b5e6-96231b3b80d8
isel is actually a cracked instruction on the P7/P8, and must start a dispatch
group. The scheduling model should reflect this so that we don't bunch too many
of them together when possible.
Thanks to Bill Schmidt and Pat Haugen for helping to sort this out.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227758 91177308-0d34-0410-b5e6-96231b3b80d8
The TOC base pointer is passed in r2, and we normally reserve this register so
that we can depend on it being there. However, for leaf functions, and
specifically those leaf functions that don't do any TOC access of their own
(which is generally due to accessing the constant pool, using TLS, etc.),
we can treat r2 as an ordinary callee-saved register (it must be callee-saved
because, for local direct calls, the linker will not insert any save/restore
code).
The allocation order has been changed slightly for PPC64/ELF systems to put r2
at the end of the list (while leaving it near the beginning for Darwin systems
to prevent unnecessary output changes). While r2 is allocatable, using it still
requires spill/restore traffic, and thus comes at the end of the list.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227745 91177308-0d34-0410-b5e6-96231b3b80d8
now that we have a correct and cached subtarget specific to the
function.
Also, finish providing a cached per-function subtarget in the core
LLVMTargetMachine -- that layer hadn't switched over yet.
The only use of the TargetMachine was to re-lookup a subtarget for
a particular function to work around the fact that TTI was immutable.
Now that it is per-function and we haved a cached subtarget, use it.
This still leaves a few interfaces with real warts on them where we were
passing Function objects through the TTI interface. I'll remove these
and clean their usage up in subsequent commits now that this isn't
necessary.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227738 91177308-0d34-0410-b5e6-96231b3b80d8
intermediate TTI implementation template and instead query up to the
derived class for both the TargetMachine and the TargetLowering.
Most of the derived types had a TLI cached already and there is no need
to store a less precisely typed target machine pointer.
This will in turn make it much cleaner to look up the TLI via
a per-function subtarget instead of the generic subtarget, and it will
pave the way toward pulling the subtarget used for unroll preferences
into the same form once we are *always* using the function to look up
the correct subtarget.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227737 91177308-0d34-0410-b5e6-96231b3b80d8
TargetIRAnalysis access path directly rather than implementing getTTI.
This even removes getTTI from the interface. It's more efficient for
each target to just register a precise callback that creates their
specific TTI.
As part of this, all of the targets which are building their subtargets
individually per-function now build their TTI instance with the function
and thus look up the correct subtarget and cache it. NVPTX, R600, and
XCore currently don't leverage this functionality, but its trivial for
them to add it now.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227735 91177308-0d34-0410-b5e6-96231b3b80d8
null.
For some reason some of the original TTI code supported a null target
machine. This seems to have been legacy, and I made matters worse when
refactoring this code by spreading that pattern further through the
various targets.
The TargetMachine can't actually be null, and it doesn't make sense to
support that use case. I've now consistently removed it and removed all
of the code trying to cope with that situation. This is probably good,
as several targets *didn't* cope with it being null despite the null
default argument in their constructors. =]
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227734 91177308-0d34-0410-b5e6-96231b3b80d8
base which it adds a single analysis pass to, to instead return the type
erased TargetTransformInfo object constructed for that TargetMachine.
This removes all of the pass variants for TTI. There is now a single TTI
*pass* in the Analysis layer. All of the Analysis <-> Target
communication is through the TTI's type erased interface itself. While
the diff is large here, it is nothing more that code motion to make
types available in a header file for use in a different source file
within each target.
I've tried to keep all the doxygen comments and file boilerplate in line
with this move, but let me know if I missed anything.
With this in place, the next step to making TTI work with the new pass
manager is to introduce a really simple new-style analysis that produces
a TTI object via a callback into this routine on the target machine.
Once we have that, we'll have the building blocks necessary to accept
a function argument as well.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227685 91177308-0d34-0410-b5e6-96231b3b80d8
type erased interface and a single analysis pass rather than an
extremely complex analysis group.
The end result is that the TTI analysis can contain a type erased
implementation that supports the polymorphic TTI interface. We can build
one from a target-specific implementation or from a dummy one in the IR.
I've also factored all of the code into "mix-in"-able base classes,
including CRTP base classes to facilitate calling back up to the most
specialized form when delegating horizontally across the surface. These
aren't as clean as I would like and I'm planning to work on cleaning
some of this up, but I wanted to start by putting into the right form.
There are a number of reasons for this change, and this particular
design. The first and foremost reason is that an analysis group is
complete overkill, and the chaining delegation strategy was so opaque,
confusing, and high overhead that TTI was suffering greatly for it.
Several of the TTI functions had failed to be implemented in all places
because of the chaining-based delegation making there be no checking of
this. A few other functions were implemented with incorrect delegation.
The message to me was very clear working on this -- the delegation and
analysis group structure was too confusing to be useful here.
The other reason of course is that this is *much* more natural fit for
the new pass manager. This will lay the ground work for a type-erased
per-function info object that can look up the correct subtarget and even
cache it.
Yet another benefit is that this will significantly simplify the
interaction of the pass managers and the TargetMachine. See the future
work below.
The downside of this change is that it is very, very verbose. I'm going
to work to improve that, but it is somewhat an implementation necessity
in C++ to do type erasure. =/ I discussed this design really extensively
with Eric and Hal prior to going down this path, and afterward showed
them the result. No one was really thrilled with it, but there doesn't
seem to be a substantially better alternative. Using a base class and
virtual method dispatch would make the code much shorter, but as
discussed in the update to the programmer's manual and elsewhere,
a polymorphic interface feels like the more principled approach even if
this is perhaps the least compelling example of it. ;]
Ultimately, there is still a lot more to be done here, but this was the
huge chunk that I couldn't really split things out of because this was
the interface change to TTI. I've tried to minimize all the other parts
of this. The follow up work should include at least:
1) Improving the TargetMachine interface by having it directly return
a TTI object. Because we have a non-pass object with value semantics
and an internal type erasure mechanism, we can narrow the interface
of the TargetMachine to *just* do what we need: build and return
a TTI object that we can then insert into the pass pipeline.
2) Make the TTI object be fully specialized for a particular function.
This will include splitting off a minimal form of it which is
sufficient for the inliner and the old pass manager.
3) Add a new pass manager analysis which produces TTI objects from the
target machine for each function. This may actually be done as part
of #2 in order to use the new analysis to implement #2.
4) Work on narrowing the API between TTI and the targets so that it is
easier to understand and less verbose to type erase.
5) Work on narrowing the API between TTI and its clients so that it is
easier to understand and less verbose to forward.
6) Try to improve the CRTP-based delegation. I feel like this code is
just a bit messy and exacerbating the complexity of implementing
the TTI in each target.
Many thanks to Eric and Hal for their help here. I ended up blocked on
this somewhat more abruptly than I expected, and so I appreciate getting
it sorted out very quickly.
Differential Revision: http://reviews.llvm.org/D7293
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227669 91177308-0d34-0410-b5e6-96231b3b80d8
Any code creating an MCSectionELF knows ELF and already provides the flags.
SectionKind is an abstraction used by common code that uses a plain
MCSection.
Use the flags to compute the SectionKind. This removes a lot of
guessing and boilerplate from the MCSectionELF construction.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227476 91177308-0d34-0410-b5e6-96231b3b80d8
Patch by Nemanja Ivanovic.
As was uncovered by the failing test case (when run on non-PPC
platforms), the feature set when compiling with -march=ppc64le was not
being picked up. This change ensures that if the -mcpu option is not
specified, the correct feature set is picked up regardless of whether
we are on PPC or not.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227455 91177308-0d34-0410-b5e6-96231b3b80d8
derived classes.
Since global data alignment, layout, and mangling is often based on the
DataLayout, move it to the TargetMachine. This ensures that global
data is going to be layed out and mangled consistently if the subtarget
changes on a per function basis. Prior to this all targets(*) have
had subtarget dependent code moved out and onto the TargetMachine.
*One target hasn't been migrated as part of this change: R600. The
R600 port has, as a subtarget feature, the size of pointers and
this affects global data layout. I've currently hacked in a FIXME
to enable progress, but the port needs to be updated to either pass
the 64-bitness to the TargetMachine, or fix the DataLayout to
avoid subtarget dependent features.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227113 91177308-0d34-0410-b5e6-96231b3b80d8
Test by Nemanja Ivanovic.
Since ppc64le implies POWER8 as a minimum, it makes sense that the
same features are included. Since the pwr8 processor model will likely
be getting new features until the implementation is complete, I
created a new list to add these updates to. This will include them in
both pwr8 and ppc64le.
Furthermore, it seems that it would make sense to compose the feature
lists for other processor models (pwr3 and up). Per discussion in the
review, I will make this change in a subsequent patch.
In order to test the changes, I've added an additional run step to
test cases that specify -march=ppc64le -mcpu=pwr8 to omit the -mcpu
option. Since the feature lists are the same, the behaviour should be
unchanged.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@227053 91177308-0d34-0410-b5e6-96231b3b80d8
The fixes are to note that AArch64 has additional restrictions on when local
relocations can be used. In particular, ld64 requires that relocations to
cstring/cfstrings use linker visible symbols.
Original message:
In an assembly expression like
bar:
.long L0 + 1
the intended semantics is that bar will contain a pointer one byte past L0.
In sections that are merged by content (strings, 4 byte constants, etc), a
single position in the section doesn't give the linker enough information.
For example, it would not be able to tell a relocation must point to the
end of a string, since that would look just like the start of the next.
The solution used in ELF to use relocation with symbols if there is a non-zero
addend.
In MachO before this patch we would just keep all symbols in some sections.
This would miss some cases (only cstrings on x86_64 were implemented) and was
inefficient since most relocations have an addend of 0 and can be represented
without the symbol.
This patch implements the non-zero addend logic for MachO too.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226503 91177308-0d34-0410-b5e6-96231b3b80d8
We don't need to exclude patchpoints from the implicit r2 dependence in
FastISel because it is added as an implicit operand and, thus, should not
confuse that StackMap code.
By inspection / no test case.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226434 91177308-0d34-0410-b5e6-96231b3b80d8
Our PPC64 ELF V2 call lowering logic added r2 as an operand to all direct call
instructions in order to represent the dependency on the TOC base pointer
value. Restricting this to ELF V2, however, does not seem to make sense: calls
under ELF V1 have the same dependence, and indirect calls have an r2 dependence
just as direct ones. Make sure the dependence is noted for all calls under both
ELF V1 and ELF V2.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226432 91177308-0d34-0410-b5e6-96231b3b80d8
Instructions that have high-order TOC relocations always carry R2 as their base
register, so it does not matter whether we take the register from the
instruction or just hard-code it in PPCAsmPrinter. In the future, however, we
might want to apply these relocations to instructions using a different
register, so taking the register from the instruction is a better thing to do.
No change in functionality here, however.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226403 91177308-0d34-0410-b5e6-96231b3b80d8
So we don't forget, once we support FPR <-> GPR moves on the P8, we'll likely
want to re-visit this part of the calling convention.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226401 91177308-0d34-0410-b5e6-96231b3b80d8
The default calling convention specified by the PPC64 ELF (V1 and V2) ABI is
designed to work with both prototyped and non-prototyped/varargs functions. As
a result, GPRs and stack space are allocated for every argument, even those
that are passed in floating-point or vector registers.
GlobalOpt::OptimizeFunctions will transform local non-varargs functions (that
do not have their address taken) to use the 'fast' calling convention.
When functions are using the 'fast' calling convention, don't allocate GPRs for
arguments passed in other types of registers, and don't allocate stack space for
arguments passed in registers. Other changes for the fast calling convention
may be added in the future.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226399 91177308-0d34-0410-b5e6-96231b3b80d8
a LoopInfoWrapperPass to wire the object up to the legacy pass manager.
This switches all the clients of LoopInfo over and paves the way to port
LoopInfo to the new pass manager. No functionality change is intended
with this iteration.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226373 91177308-0d34-0410-b5e6-96231b3b80d8
R11's status is the same under both the PPC64 ELF V1 and V2 ABIs: it is
reserved for use as an "environment pointer" for compilation models that
require such a thing. We don't, we also don't need a second scratch register,
and because we support only "local" patchpoint call targets, we might as well
let R11 be used for anyregcc patchpoints.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226369 91177308-0d34-0410-b5e6-96231b3b80d8
Bill Schmidt pointed out that some adjustments would be needed to properly
support powerpc64le (using the ELF V2 ABI). For one thing, R11 is not available
as a scratch register, so we need to use R12. R12 is also available under ELF
V1, so to maintain consistency, I flipped the order to make R12 the first
scratch register in the array under both ABIs.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226247 91177308-0d34-0410-b5e6-96231b3b80d8
Function pointers under PPC64 ELFv1 (which is used on PPC64/Linux on the
POWER7, A2 and earlier cores) are really pointers to a function descriptor, a
structure with three pointers: the actual pointer to the code to which to jump,
the pointer to the TOC needed by the callee, and an environment pointer. We
used to chain these loads, and make them opaque to the rest of the optimizer,
so that they'd always occur directly before the call. This is not necessary,
and in fact, highly suboptimal on embedded cores. Once the function pointer is
known, the loads can be performed ahead of time; in fact, they can be hoisted
out of loops.
Now these function descriptors are almost always generated by the linker, and
thus the contents of the descriptors are invariant. As a result, by default,
we'll mark the associated loads as invariant (allowing them to be hoisted out
of loops). I've added a target feature to turn this off, however, just in case
someone needs that option (constructing an on-stack descriptor, casting it to a
function pointer, and then calling it cannot be well-defined C/C++ code, but I
can imagine some JIT-compilation system doing so).
Consider this simple test:
$ cat call.c
typedef void (*fp)();
void bar(fp x) {
for (int i = 0; i < 1600000000; ++i)
x();
}
$ cat main.c
typedef void (*fp)();
void bar(fp x);
void foo() {}
int main() {
bar(foo);
}
On the PPC A2 (the BG/Q supercomputer), marking the function-descriptor loads
as invariant brings the execution time down to ~8 seconds from ~32 seconds with
the loads in the loop.
The difference on the POWER7 is smaller. Compiling with:
gcc -std=c99 -O3 -mcpu=native call.c main.c : ~6 seconds [this is 4.8.2]
clang -O3 -mcpu=native call.c main.c : ~5.3 seconds
clang -O3 -mcpu=native call.c main.c -mno-invariant-function-descriptors : ~4 seconds
(looks like we'd benefit from additional loop unrolling here, as a first
guess, because this is faster with the extra loads)
The -mno-invariant-function-descriptors will be added to Clang shortly.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226207 91177308-0d34-0410-b5e6-96231b3b80d8
The pass is really just a means of accessing a cached instance of the
TargetLibraryInfo object, and this way we can re-use that object for the
new pass manager as its result.
Lots of delta, but nothing interesting happening here. This is the
common pattern that is developing to allow analyses to live in both the
old and new pass manager -- a wrapper pass in the old pass manager
emulates the separation intrinsic to the new pass manager between the
result and pass for analyses.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226157 91177308-0d34-0410-b5e6-96231b3b80d8
While the term "Target" is in the name, it doesn't really have to do
with the LLVM Target library -- this isn't an abstraction which LLVM
targets generally need to implement or extend. It has much more to do
with modeling the various runtime libraries on different OSes and with
different runtime environments. The "target" in this sense is the more
general sense of a target of cross compilation.
This is in preparation for porting this analysis to the new pass
manager.
No functionality changed, and updates inbound for Clang and Polly.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226078 91177308-0d34-0410-b5e6-96231b3b80d8
Fill out our support for the floating-point status and control register
instructions (mcrfs and friends). As it turns out, these are necessary for
compiling src/test/harness_fp.h in TBB for PowerPC.
Thanks to Raf Schietekat for reporting the issue!
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226070 91177308-0d34-0410-b5e6-96231b3b80d8
Patch by Kit Barton.
Support for the ICBT instruction is currently present, but limited to
embedded processors. This change adds a new FeatureICBT that can be used
to identify whether the ICBT instruction is available on a specific processor.
Two new tests are added:
* Positive test to ensure the icbt instruction is present when using
-mcpu=pwr8
* Negative test to ensure the icbt instruction is not generated when
using -mcpu=pwr7
Both test cases use the Prefetch opcode in LLVM. They are based on the
ppc64-prefetch.ll test case.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@226033 91177308-0d34-0410-b5e6-96231b3b80d8
utils/sort_includes.py.
I clearly haven't done this in a while, so more changed than usual. This
even uncovered a missing include from the InstrProf library that I've
added. No functionality changed here, just mechanical cleanup of the
include order.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225974 91177308-0d34-0410-b5e6-96231b3b80d8
This re-applies r225808, fixed to avoid problems with SDAG dependencies along
with the preceding fix to ScheduleDAGSDNodes::RegDefIter::InitNodeNumDefs.
These problems caused the original regression tests to assert/segfault on many
(but not all) systems.
Original commit message:
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this common code for lowering some
common intrinsics, stackmap/patchpoint among them).
2. Adds support for stackmap/patchpoint lowering. For the most part, this is
very similar to the support in the AArch64 target, with the obvious differences
(different registers, NOP instructions, etc.). The test cases are adapted
from the AArch64 test cases.
One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).
StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225909 91177308-0d34-0410-b5e6-96231b3b80d8
This was already done in clang, this commit now uses the integrated
assembler as default when using LLVM tools directly.
A number of test cases using inline asm had to be adapted, either by
updating the expected output, or by using -no-integrated-as (for such
tests that deliberately use an invalid instruction in inline asm).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225819 91177308-0d34-0410-b5e6-96231b3b80d8
Reverting this while I investiage buildbot failures (segfaulting in
GetCostForDef at ScheduleDAGRRList.cpp:314).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225811 91177308-0d34-0410-b5e6-96231b3b80d8
This commit does two things:
1. Refactors PPCFastISel to use more of the common infrastructure for call
lowering (this lets us take advantage of this common code for lowering some
common intrinsics, stackmap/patchpoint among them).
2. Adds support for stackmap/patchpoint lowering. For the most part, this is
very similar to the support in the AArch64 target, with the obvious differences
(different registers, NOP instructions, etc.). The test cases are adapted
from the AArch64 test cases.
One difference of note is that the patchpoint call sequence takes 24 bytes, so
you can't use less than that (on AArch64 you can go down to 16). Also, as noted
in the docs, we take the patchpoint address to be the actual code address
(assuming the call is local in the TOC-sharing sense), which should yield
higher performance than generating the full cross-DSO indirect-call sequence
and is likely just as useful for JITed code (if not, we'll change it).
StackMaps and Patchpoints are still marked as experimental, and so this support
is doubly experimental. So go ahead and experiment!
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225808 91177308-0d34-0410-b5e6-96231b3b80d8
We really need a separate 64-bit version of this instruction so that it can be
marked as clobbering LR8 (instead of just LR). No change in functionality
(although the verifier might be slightly happier), however, it is required for
stackmap/patchpoint support. Thus, this will be covered by stackmap test cases
once those are added.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225804 91177308-0d34-0410-b5e6-96231b3b80d8
For registers that have DWARF numbers (like CA, which is really part of XER),
add them. Also, RM is not an SPR, and the declaration hack (where it is
declared as an SPR with an arbitrary number) is not needed, so just declare it
as a register.
NFC; although CA's register number will be needed when stackmap/patchpoint
support is added.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225800 91177308-0d34-0410-b5e6-96231b3b80d8
One is that AArch64 has additional restrictions on when local relocations can
be used. We have to take those into consideration when deciding to put a L
symbol in the symbol table or not.
The other is that ld64 requires the relocations to cstring to use linker
visible symbols on AArch64.
Thanks to Michael Zolotukhin for testing this!
Remove doesSectionRequireSymbols.
In an assembly expression like
bar:
.long L0 + 1
the intended semantics is that bar will contain a pointer one byte past L0.
In sections that are merged by content (strings, 4 byte constants, etc), a
single position in the section doesn't give the linker enough information.
For example, it would not be able to tell a relocation must point to the
end of a string, since that would look just like the start of the next.
The solution used in ELF to use relocation with symbols if there is a non-zero
addend.
In MachO before this patch we would just keep all symbols in some sections.
This would miss some cases (only cstrings on x86_64 were implemented) and was
inefficient since most relocations have an addend of 0 and can be represented
without the symbol.
This patch implements the non-zero addend logic for MachO too.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225644 91177308-0d34-0410-b5e6-96231b3b80d8
Looking at r225438 inspired me to see how the PowerPC backend handled the
situation (calling a bitcasted TLS global), and it turns out we also produced
an error (cannot select ...). What it means to "call" something that is not a
function is implementation and platform specific, but in the name of doing
something (besides crashing), this makes sure we do what GCC does (treat all
such calls as calls through a function pointer -- meaning that the pointer is
assumed, as is the convention on PPC, to point to a function descriptor
structure holding the actual code address along with the function's TOC pointer
and environment pointer). As GCC does, we now do the same for calling regular
(non-TLS) non-function globals too.
I'm not sure whether this is the most useful way to define the behavior, but at
least we won't be alone.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225617 91177308-0d34-0410-b5e6-96231b3b80d8
This initial implementation of PPCTargetLowering::isZExtFree marks as free
zexts of small scalar loads (that are not sign-extending). This callback is
used by SelectionDAGBuilder's RegsForValue::getCopyToRegs, and thus to
determine whether a zext or an anyext is used to lower illegally-typed PHIs.
Because later truncates of zero-extended values are nops, this allows for the
elimination of later unnecessary truncations.
Fixes the initial complaint associated with PR22120.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225584 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
In the previous commit, the register was saved, but space was not allocated.
This resulted in the parameter save area potentially clobbering r30, leading to
nasty results.
Test Plan: Tests updated
Reviewers: hfinkel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D6906
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225573 91177308-0d34-0410-b5e6-96231b3b80d8
Now that the way that the partial unrolling threshold for small loops is used
to compute the unrolling factor as been corrected, a slightly smaller threshold
is preferable. This is expected; other targets may need to re-tune as well.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225566 91177308-0d34-0410-b5e6-96231b3b80d8
The P7 benefits from not have really-small loops so that we either have
multiple dispatch groups in the loop and/or the ability to form more-full
dispatch groups during scheduling. Setting the partial unrolling threshold to
44 seems good, empirically, for the P7. Compared to using no late partial
unrolling, this yields the following test-suite speedups:
SingleSource/Benchmarks/Adobe-C++/simple_types_constant_folding
-66.3253% +/- 24.1975%
SingleSource/Benchmarks/Misc-C++/oopack_v1p8
-44.0169% +/- 29.4881%
SingleSource/Benchmarks/Misc/pi
-27.8351% +/- 12.2712%
SingleSource/Benchmarks/Stanford/Bubblesort
-30.9898% +/- 22.4647%
I've speculatively added a similar setting for the P8. Also, I've noticed that
the unroller does not quite calculate the unrolling factor correctly for really
tiny loops because it neglects to account for the fact that not every loop body
replicant contains an ending branch and counter increment. I'll fix that later.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225522 91177308-0d34-0410-b5e6-96231b3b80d8
On modern cores with lfiw[az]x, we can fold a sign or zero extension from i32
to i64 into the load necessary for an i64 -> fp conversion.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225493 91177308-0d34-0410-b5e6-96231b3b80d8
MachineLICM uses a callback named hasLowDefLatency to determine if an
instruction def operand has a 'low' latency. If all relevant operands have a
'low' latency, the instruction is considered too cheap to hoist out of loops
even in low-register-pressure situations. On PowerPC cores, both the embedded
cores and the others, there is no reason to believe that this is a good choice:
all instructions have a cost inside a loop, and hoisting them when not limited
by register pressure is a reasonable default.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225471 91177308-0d34-0410-b5e6-96231b3b80d8
Summary: The PIC additions didn't update the prologue and epilogue code to save and restore r30 (PIC base register). This does that.
Test Plan: Tests updated.
Reviewers: hfinkel
Reviewed By: hfinkel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D6876
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225450 91177308-0d34-0410-b5e6-96231b3b80d8
type (in addition to the memory type).
The *LoadExt* legalization handling used to only have one type, the
memory type. This forced users to assume that as long as the extload
for the memory type was declared legal, and the result type was legal,
the whole extload was legal.
However, this isn't always the case. For instance, on X86, with AVX,
this is legal:
v4i32 load, zext from v4i8
but this isn't:
v4i64 load, zext from v4i8
Whereas v4i64 is (arguably) legal, even without AVX2.
Note that the same thing was done a while ago for truncstores (r46140),
but I assume no one needed it yet for extloads, so here we go.
Calls to getLoadExtAction were changed to add the value type, found
manually in the surrounding code.
Calls to setLoadExtAction were mechanically changed, by wrapping the
call in a loop, to match previous behavior. The loop iterates over
the MVT subrange corresponding to the memory type (FP vectors, etc...).
I also pulled neighboring setTruncStoreActions into some of the loops;
those shouldn't make a difference, as the additional types are illegal.
(e.g., i128->i1 truncstores on PPC.)
No functional change intended.
Differential Revision: http://reviews.llvm.org/D6532
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225421 91177308-0d34-0410-b5e6-96231b3b80d8
A few loops do trickier things than just iterating on an MVT subset,
so I'll leave them be for now.
Follow-up of r225387.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225392 91177308-0d34-0410-b5e6-96231b3b80d8
Remove the README.txt entry regarding register allocation of CR logical ops,
and replace it with a FIXME in PPCInstrInfo.td. The text in the README.txt was
not really accurate, and thanks goes to Pat Haugen (and Bill Schmidt) from IBM
for clarifying what was intended and highlighting the relevant text in the ISA
specification.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225325 91177308-0d34-0410-b5e6-96231b3b80d8
This is affecting the behavior of some ObjC++ / AArch64 test cases on Darwin.
Reverting to get the bots green while I track down the source of the changed
behavior.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225311 91177308-0d34-0410-b5e6-96231b3b80d8
int->fp conversions on PPC must be done through memory loads and stores. On a
modern core, this process begins by storing the int value to memory, then
loading it using a (sometimes special) FP load instruction. Unfortunately, we
would do this even when the value to be converted was itself a load, and we can
just use that same memory location instead of copying it to another first.
There is a slight complication when handling int_to_fp(fp_to_int(x)) pairs,
because the fp_to_int operand has not been lowered when the int_to_fp is being
lowered. We handle this specially by invoking fp_to_int's lowering logic
(partially) and getting the necessary memory location (some trivial refactoring
was done to make this possible).
This is all somewhat ugly, and it would be nice if some later CodeGen stage
could just clean this stuff up, but because doing so would involve modifying
target-specific nodes (or instructions), it is not immediately clear how that
would work.
Also, remove a related entry from the README.txt for which we now generate
reasonable code.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225301 91177308-0d34-0410-b5e6-96231b3b80d8
Because of how Clang represents structs as arrays (at least on non-Darwin
platforms), and what SROA does, etc. this is no longer a problem.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225251 91177308-0d34-0410-b5e6-96231b3b80d8
The old target DAG combine that allowed for performing int_to_fp(fp_to_int(x))
without a load/store pair is updated here with support for unsigned integers,
and to support single-precision values without a third rounding step, on newer
cores with the appropriate instructions.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225248 91177308-0d34-0410-b5e6-96231b3b80d8
We no longer generate horrible code for the stated function:
void f(signed char *a, _Bool b, _Bool c) {
signed char t = 0;
if (b) t = *a;
if (c) *a = t;
}
for which we now generate:
.L.f:
andi. 5, 5, 1
cmpldi 1, 4, 0
li 5, 0
beq 1, .LBB0_2
lbz 5, 0(3)
.LBB0_2: # %if.end
bclr 4, 1, 0
stb 5, 0(3)
blr
so we don't need the README.txt entry.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225217 91177308-0d34-0410-b5e6-96231b3b80d8
We now produce the desired code as noted in the README.txt file (no spurious
or). Remove the README entry and improve the regression test.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225214 91177308-0d34-0410-b5e6-96231b3b80d8
We now produce the desired code as noted in the README.txt file. Remove the
README entry and add a regression test.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225209 91177308-0d34-0410-b5e6-96231b3b80d8
We now produce the desired code as noted in the README.txt file. Remove the
README entry and add a regression test.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225205 91177308-0d34-0410-b5e6-96231b3b80d8
Consider this function from our README.txt file:
int foo(int a, int b) { return (a < b) << 4; }
We now explicitly track CR bits by default, so the comment in the README.txt
about not really having a SETCC is no longer accurate, but we did generate this
somewhat silly code:
cmpw 0, 3, 4
li 3, 0
li 12, 1
isel 3, 12, 3, 0
sldi 3, 3, 4
blr
which generates the zext as a select between 0 and 1, and then shifts the
result by a constant amount. Here we preprocess the DAG in order to fold the
results of operations on an extension of an i1 value into the SELECT_I[48]
pseudo instruction when the resulting constant can be materialized using one
instruction (just like the 0 and 1). This was not implemented as a DAGCombine
because the resulting code would have been anti-canonical and depends on
replacing chained user nodes, which does not fit well into the lowering
paradigm. Now we generate:
cmpw 0, 3, 4
li 3, 0
li 12, 16
isel 3, 12, 3, 0
blr
which is less silly.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225203 91177308-0d34-0410-b5e6-96231b3b80d8
The 64-bit semantics of cntlzw are not special, the 32-bit population count is
stored as a 64-bit value in the range [0,32]. As a result, it is always zero
extended, and it can be added to the PPCISelDAGToDAG peephole optimization as a
frontier instruction for the removal of unnecessary zero extensions.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225192 91177308-0d34-0410-b5e6-96231b3b80d8
lhbrx and lwbrx not only load their data with byte swapping, but also clear the
upper 32 bits (at least). As a result, they can be added to the PPCISelDAGToDAG
peephole optimization as frontier instructions for the removal of unnecessary
zero extensions.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225189 91177308-0d34-0410-b5e6-96231b3b80d8
PPC has an instruction for ctlz with defined zero behavior, and our lowering of
cttz (provided by DAGCombine) is also efficient and branchless, so speculating
these makes sense.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225150 91177308-0d34-0410-b5e6-96231b3b80d8
r225135 added the ability to materialize i64 constants using rotations in order
to reduce the instruction count. Sometimes we can use a rotation only with some
extra masking, so that we take advantage of the fact that generating a bunch of
extra higher-order 1 bits is easy using li/lis.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225147 91177308-0d34-0410-b5e6-96231b3b80d8
Materializing full 64-bit constants on PPC64 can be expensive, requiring up to
5 instructions depending on the locations of the non-zero bits. Sometimes
materializing a rotated constant, and then applying the inverse rotation, requires
fewer instructions than the direct method. If so, do that instead.
In r225132, I added support for forming constants using bit inversion. In
effect, this reverts that commit and replaces it with rotation support. The bit
inversion is useful for turning constants that are mostly ones into ones that
are mostly zeros (thus enabling a more-efficient shift-based materialization),
but the same effect can be obtained by using negative constants and a rotate,
and that is at least as efficient, if not more.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225135 91177308-0d34-0410-b5e6-96231b3b80d8
Materializing full 64-bit constants on PPC64 can be expensive, requiring up to
5 instructions depending on the locations of the non-zero bits. Sometimes
materializing the bit-reversed constant, and then flipping the bits, requires
fewer instructions than the direct method. If so, do that instead.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225132 91177308-0d34-0410-b5e6-96231b3b80d8
The existing code provided for specifying a global loop alignment preference.
However, the preferred loop alignment might depend on the loop itself. For
recent POWER cores, loops between 5 and 8 instructions should have 32-byte
alignment (while the others are better with 16-byte alignment) so that the
entire loop will fit in one i-cache line.
To support this, getPrefLoopAlignment has been made virtual, and can be
provided with an optional MachineLoop* so the target can inspect the loop
before answering the query. The default behavior, as before, is to return the
value set with setPrefLoopAlignment. MachineBlockPlacement now queries the
target for each loop instead of only once per function. There should be no
functional change for other targets.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225117 91177308-0d34-0410-b5e6-96231b3b80d8
Most modern PowerPC cores prefer that functions and loops start on
16-byte-aligned boundaries (*), so instruct block placement, etc. to make this
happen. The branch selector has also been adjusted so account for the extra
nops that might now be inserted before loop headers.
(*) Some cores actually prefer other alignments for small loops, but that will
be addressed in a follow-up commit.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225115 91177308-0d34-0410-b5e6-96231b3b80d8
Make sure they all have llvm_unreachable on the default path out of the switch. Remove unnecessary "default: break". Remove a 'return' after unreachable. Fix some indentation.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225114 91177308-0d34-0410-b5e6-96231b3b80d8
Newer POWER cores, and the A2, support the cmpb instruction. This instruction
compares its operands, treating each of the 8 bytes in the GPRs separately,
returning a 'mask' result of 0 (for false) or -1 (for true) in each byte.
Code generation support is added, in the form of a PPCISelDAGToDAG
DAG-preprocessing routine, that recognizes patterns close to what the
instruction computes (either exactly, or related by a constant masking
operation), and generates the cmpb instruction (along with any necessary
constant masking operation). This can be expanded if use cases arise.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225106 91177308-0d34-0410-b5e6-96231b3b80d8
Attempting to fix PR22078 (building on 32-bit systems) by replacing my careless
use of 1ul to be a uint64_t constant with UINT64_C(1).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225066 91177308-0d34-0410-b5e6-96231b3b80d8
This is the second installment of improvements to instruction selection for "bit
permutation" instruction sequences. r224318 added logic for instruction
selection for 32-bit bit permutation sequences, and this adds lowering for
64-bit sequences. The 64-bit sequences are more complicated than the 32-bit
ones because:
a) the 64-bit versions of the 32-bit rotate-and-mask instructions
work by replicating the lower 32-bits of the value-to-be-rotated into the
upper 32 bits -- and integrating this into the cost modeling for the various
bit group operations is non-trivial
b) unlike the 32-bit instructions in 32-bit mode, the rotate-and-mask instructions
cannot, in one instruction, specify the
mask starting index, the mask ending index, and the rotation factor. Also,
forming arbitrary 64-bit constants is more complicated than in 32-bit mode
because the number of instructions necessary is value dependent.
Plus, support for 'late masking' was added: it is sometimes more efficient to
treat the overall value as if it had no mandatory zero bits when planning the
bit-group insertions, and then mask them in at the very end. Unfortunately, as
the structure of the bit groups is different in the two cases, the more
feasible implementation technique was to generate both instruction sequences,
and then pick the shorter one.
And finally, we now generate reasonable code for i64 bswap:
rldicl 5, 3, 16, 0
rldicl 4, 3, 8, 0
rldicl 6, 3, 24, 0
rldimi 4, 5, 8, 48
rldicl 5, 3, 32, 0
rldimi 4, 6, 16, 40
rldicl 6, 3, 48, 0
rldimi 4, 5, 24, 32
rldicl 5, 3, 56, 0
rldimi 4, 6, 40, 16
rldimi 4, 5, 48, 8
rldimi 4, 3, 56, 0
vs. what we used to produce:
li 4, 255
rldicl 5, 3, 24, 40
rldicl 6, 3, 40, 24
rldicl 7, 3, 56, 8
sldi 8, 3, 8
sldi 10, 3, 24
sldi 12, 3, 40
rldicl 0, 3, 8, 56
sldi 9, 4, 32
sldi 11, 4, 40
sldi 4, 4, 48
andi. 5, 5, 65280
andis. 6, 6, 255
andis. 7, 7, 65280
sldi 3, 3, 56
and 8, 8, 9
and 4, 12, 4
and 9, 10, 11
or 6, 7, 6
or 5, 5, 0
or 3, 3, 4
or 7, 9, 8
or 4, 6, 5
or 3, 3, 7
or 3, 3, 4
which is 12 instructions, instead of 25, and seems optimal (at least in terms
of code size).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225056 91177308-0d34-0410-b5e6-96231b3b80d8
The issues was that AArch64 has additional restrictions on when local
relocations can be used. We have to take those into consideration when
deciding to put a L symbol in the symbol table or not.
Original message:
Remove doesSectionRequireSymbols.
In an assembly expression like
bar:
.long L0 + 1
the intended semantics is that bar will contain a pointer one byte past L0.
In sections that are merged by content (strings, 4 byte constants, etc), a
single position in the section doesn't give the linker enough information.
For example, it would not be able to tell a relocation must point to the
end of a string, since that would look just like the start of the next.
The solution used in ELF to use relocation with symbols if there is a non-zero
addend.
In MachO before this patch we would just keep all symbols in some sections.
This would miss some cases (only cstrings on x86_64 were implemented) and was
inefficient since most relocations have an addend of 0 and can be represented
without the symbol.
This patch implements the non-zero addend logic for MachO too.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225048 91177308-0d34-0410-b5e6-96231b3b80d8
In an assembly expression like
bar:
.long L0 + 1
the intended semantics is that bar will contain a pointer one byte past L0.
In sections that are merged by content (strings, 4 byte constants, etc), a
single position in the section doesn't give the linker enough information.
For example, it would not be able to tell a relocation must point to the
end of a string, since that would look just like the start of the next.
The solution used in ELF to use relocation with symbols if there is a non-zero
addend.
In MachO before this patch we would just keep all symbols in some sections.
This would miss some cases (only cstrings on x86_64 were implemented) and was
inefficient since most relocations have an addend of 0 and can be represented
without the symbol.
This patch implements the non-zero addend logic for MachO too.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@224985 91177308-0d34-0410-b5e6-96231b3b80d8
Determining the address of a TLS variable results in a function call in
certain TLS models. This means that a simple ICmpInst might actually
result in invalidating the CTR register.
In such cases, do not attempt to rely on the CTR register for loop
optimization purposes.
This fixes PR22034.
Differential Revision: http://reviews.llvm.org/D6786
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@224890 91177308-0d34-0410-b5e6-96231b3b80d8
When materializing constant i1 values, they must be zero extended. We represent
i1 values as [0, 1], not [0, -1], in i32 registers. As it turns out, this code
path was dead for i1 values prior to r216006 (which is why this did not manifest in
miscompiles until recently).
Fixes -O0 self-hosting on PPC64/Linux.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@224842 91177308-0d34-0410-b5e6-96231b3b80d8
On non-Darwin PPC64, the TOC reload needs to come directly after the bctrl
instruction (for indirect calls) because the 'bctrl/ld 2, 40(1)' instruction
sequence is interpreted by the unwinding code in libgcc. To make sure these
occur as a pair, as with other pairings interpreted by the linker, fuse the two
instructions into one instruction (for code generation only).
In the future, we might wish to do this by emitting CFI directives instead,
but this solution is simpler, and mirrors what GCC does. Additional discussion
on this point is contained in the PR.
Fixes PR22015.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@224788 91177308-0d34-0410-b5e6-96231b3b80d8
It is tempting to mark the fixed stack slot used to store the return address as
immutable when lowering @llvm.returnaddress(i32 0). Unfortunately, within the
function, it is not completely immutable: it is written during the function
prologue. When using post-RA instruction scheduling, the prologue instructions
are available for scheduling, and we're not free to interchange the order of a
particular store in the prologue with loads from that stack location.
Fixes PR21976.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@224761 91177308-0d34-0410-b5e6-96231b3b80d8
In r224033, in moving the signed power-of-2 division expansion into
BuildSDIVPow2, I accidentally made it possible to attempt the lowering for a
64-bit division on PPC32. This later asserts.
Fixes PR21928.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@224758 91177308-0d34-0410-b5e6-96231b3b80d8
This was missed last time around, for the P8 Instruction Scheduling
changes (223257). This will hook the P8Model entry in so those
changes will actually be used.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@224452 91177308-0d34-0410-b5e6-96231b3b80d8
The PowerPC backend, somewhat embarrassingly, did not generate an
optimal-length sequence of instructions for a 32-bit bswap. While adding a
pattern for the bswap intrinsic to fix this would not have been terribly
difficult, doing so would not have addressed the real problem: we had been
generating poor code for many bit-permuting operations (by which I mean things
like byte-swap that permute the bits of one or more inputs around in various
ways). Here are some initial steps toward solving this deficiency.
Bit-permuting operations are represented, at the SDAG level, using ISD::ROTL,
SHL, SRL, AND and OR (mostly with constant second operands). Looking back
through these operations, we can build up a description of the bits in the
resulting value in terms of bits of one or more input values (and constant
zeros). For each bit, we compute the rotation amount from the original value,
and then group consecutive (value, rotation factor) bits into groups. Groups
sharing these attributes are then collected and sorted, and we can then
instruction select the entire permutation using a combination of masked
rotations (rlwinm), imm ands (andi/andis), and masked rotation inserts
(rlwimi).
The result is that instead of lowering an i32 bswap as:
rlwinm 5, 3, 24, 16, 23
rlwinm 4, 3, 24, 0, 7
rlwimi 4, 3, 8, 8, 15
rlwimi 5, 3, 8, 24, 31
rlwimi 4, 5, 0, 16, 31
we now produce:
rlwinm 4, 3, 8, 0, 31
rlwimi 4, 3, 24, 16, 23
rlwimi 4, 3, 24, 0, 7
and for the 'test6' example in the PowerPC/README.txt file:
unsigned test6(unsigned x) {
return ((x & 0x00FF0000) >> 16) | ((x & 0x000000FF) << 16);
}
we used to produce:
lis 4, 255
rlwinm 3, 3, 16, 0, 31
ori 4, 4, 255
and 3, 3, 4
and now we produce:
rlwinm 4, 3, 16, 24, 31
rlwimi 4, 3, 16, 8, 15
and, as a nice bonus, this fixes the FIXME in
test/CodeGen/PowerPC/rlwimi-and.ll.
This commit does not include instruction-selection for i64 operations, those
will come later.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@224318 91177308-0d34-0410-b5e6-96231b3b80d8
PPCTargetLowering::DAGCombineExtBoolTrunc contains logic to remove unwanted
truncations and extensions when dealing with nodes of the form:
zext(binary-ops(binary-ops(trunc(x), trunc(y)), ...)
There was a FIXME in the implementation (now removed) regarding the fact that
the function would abort the transformations if any of the non-output operands
of a SELECT or SELECT_CC node would need to be promoted (because they were
also output operands, for example). As a result, we continued to generate
unnecessary zero-extends for code such as this:
unsigned foo(unsigned a, unsigned b) {
return (a <= b) ? a : b;
}
which would produce:
cmplw 0, 3, 4
isel 3, 4, 3, 1
rldicl 3, 3, 0, 32
blr
and now we produce:
cmplw 0, 3, 4
isel 3, 4, 3, 1
blr
which is better in the obvious way.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@224213 91177308-0d34-0410-b5e6-96231b3b80d8
On PPC64, we end up with lots of i32 -> i64 zero extensions, not only from all
of the usual places, but also from the ABI, which specifies that values passed
are zero extended. Almost all 32-bit PPC instructions in PPC64 mode are defined
to do *something* to the higher-order bits, and for some instructions, that
action clears those bits (thus providing a zero-extended result). This is
especially common after rotate-and-mask instructions. Adding an additional
instruction to zero-extend the results of these instructions is unnecessary.
This PPCISelDAGToDAG peephole optimization examines these zero-extensions, and
looks back through their operands to see if all instructions will implicitly
zero extend their results. If so, we convert these instructions to their 64-bit
variants (which is an internal change only, the actual encoding of these
instructions is the same as the original 32-bit ones) and remove the
unnecessary zero-extension (changing where the INSERT_SUBREG instructions are
to make everything internally consistent).
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@224169 91177308-0d34-0410-b5e6-96231b3b80d8
If we have an add (or an or that is really an add), where one operand is a
FrameIndex and the other operand is a small constant, we can combine the
lowering of the FrameIndex (which is lowered as an add of the FI and a zero
offset) with the constant operand.
Amusingly, this is an old potential improvement entry from
lib/Target/PowerPC/README.txt which had never been resolved. In short, we used
to lower:
%X = alloca { i32, i32 }
%Y = getelementptr {i32,i32}* %X, i32 0, i32 1
ret i32* %Y
as:
addi 3, 1, -8
ori 3, 3, 4
blr
and now we produce:
addi 3, 1, -4
blr
which is much more sensible.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@224071 91177308-0d34-0410-b5e6-96231b3b80d8
Previously print+verify passes were added in a very unsystematic way, which is
annoying when debugging as you miss intermediate steps and allows bugs to stay
unnotice when no verification is performed.
To make this change practical I added the possibility to explicitely disable
verification. I used this option on all places where no verification was
performed previously (because alot of places actually don't pass the
MachineVerifier).
In the long term these problems should be fixed properly and verification
enabled after each pass. I'll enable some more verification in subsequent
commits.
This is the 2nd attempt at this after realizing that PassManager::add() may
actually delete the pass.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@224059 91177308-0d34-0410-b5e6-96231b3b80d8
Previously print+verify passes were added in a very unsystematic way, which is
annoying when debugging as you miss intermediate steps and allows bugs to stay
unnotice when no verification is performed.
To make this change practical I added the possibility to explicitely disable
verification. I used this option on all places where no verification was
performed previously (because alot of places actually don't pass the
MachineVerifier).
In the long term these problems should be fixed properly and verification
enabled after each pass. I'll enable some more verification in subsequent
commits.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@224042 91177308-0d34-0410-b5e6-96231b3b80d8
PPCISelDAGToDAG contained existing code to lower i32 sdiv by a power-of-2 using
srawi/addze, but did not implement the i64 case. DAGCombine now contains a
callback specifically designed for this purpose (BuildSDIVPow2), and part of
the logic has been moved to an implementation of that callback. Doing this
lowering using BuildSDIVPow2 likely does not matter, compared to handling
everything in PPCISelDAGToDAG, for the positive divisor case, but the negative
divisor case, which generates an additional negation, can potentially benefit
from additional folding from DAGCombine. Now, both the i32 and the i64 cases
have been implemented.
Fixes PR20732.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@224033 91177308-0d34-0410-b5e6-96231b3b80d8
With the foregoing three patches, VSX instructions can be used for
little endian. This patch removes the restriction that prevented
this, and re-enables the test cases from the first three patches.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@223792 91177308-0d34-0410-b5e6-96231b3b80d8
When performing instruction selection for ISD::VECTOR_SHUFFLE, there
is special code for handling v2f64 and v2i64 using VSX instructions.
This code must be adjusted for little-endian. Because the two inputs
are treated as a double-wide register, we must swap their order for
little endian. To get the appropriate mask elements to use with the
big-endian biased XXPERMDI instruction, we must reverse their order
and invert the bits.
A new test is added to test the 16 possible values of the shuffle
mask. It is initially disabled for reasons specified in the test. It
is re-enabled by patch 4/4.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@223791 91177308-0d34-0410-b5e6-96231b3b80d8
For little endian, we need to make some straightforward adjustments in
the code expansions for scalar_to_vector and vector_extract of v2f64.
First, scalar_to_vector must place the scalar into vector element
zero. However, our implementation of SUBREG_TO_REG will place it into
big-element vector element zero (high-order bits), and for little
endian we need it in the low-order bits. The LE implementation splats
the high-order doubleword into the low-order doubleword.
Second, the meaning of (vector_extract x, 0) and (vector_extract x, 1)
must be reversed for similar reasons.
A new test is added that tests code generation for insertelement and
extractelement for both element 0 and element 1. It is disabled in
this patch but enabled in patch 4/4, for reasons stated in the test.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@223788 91177308-0d34-0410-b5e6-96231b3b80d8
This patch addresses the inherent big-endian bias in the lxvd2x,
lxvw4x, stxvd2x, and stxvw4x instructions. These instructions load
vector elements into registers left-to-right (with the first element
loaded into the high-order bits of the register), regardless of the
endian setting of the processor. However, these are the only
vector memory instructions that permit unaligned storage accesses, so
we want to use them for little-endian.
To make this work, a lxvd2x or lxvw4x is replaced with an lxvd2x
followed by an xxswapd, which swaps the doublewords. This works for
lxvw4x as well as lxvd2x, because for lxvw4x on an LE system the
vector elements are in LE order (right-to-left) within each
doubleword. (Thus after lxvw2x of a <4 x float> the elements will
appear as 1, 0, 3, 2. Following the swap, they will appear as 3, 2,
0, 1, as desired.) For stores, an stxvd2x or stxvw4x is replaced
with an stxvd2x preceded by an xxswapd.
Introduction of extra swap instructions provides correctness, but
obviously is not ideal from a performance perspective. Future patches
will address this with optimizations to remove most of the introduced
swaps, which have proven effective in other implementations.
The introduction of the swaps is performed during lowering of LOAD,
STORE, INTRINSIC_W_CHAIN, and INTRINSIC_VOID operations. The latter
are used to translate intrinsics that specify the VSX loads and stores
directly into equivalent sequences for little endian. Thus code that
uses vec_vsx_ld and vec_vsx_st does not have to be modified to be
ported from BE to LE.
We introduce new PPCISD opcodes for LXVD2X, STXVD2X, and XXSWAPD for
use during this lowering step. In PPCInstrVSX.td, we add new SDType
and SDNode definitions for these (PPClxvd2x, PPCstxvd2x, PPCxxswapd).
These are recognized during instruction selection and mapped to the
correct instructions.
Several tests that were written to use -mcpu=pwr7 or pwr8 are modified
to disable VSX on LE variants because code generation changes with
this and subsequent patches in this set. I chose to include all of
these in the first patch than try to rigorously sort out which tests
were broken by one or another of the patches. Sorry about that.
The new test vsx-ldst-builtin-le.ll, and the changes to vsx-ldst.ll,
are disabled until LE support is enabled because of breakages that
occur as noted in those tests. They are re-enabled in patch 4/4.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@223783 91177308-0d34-0410-b5e6-96231b3b80d8
CodeGen/PowerPC/vsx-p8.ll was failing.
'+power8-vector' is not a recognized feature for this target (ignoring feature)
llvm/test/CodeGen/PowerPC/vsx-p8.ll:33:14: error: expected string not found in input
; CHECK-REG: lxvw4x 34, 0, 3
^
<stdin>:50:2: note: scanning from here
.align 3
^
<stdin>:61:2: note: possible intended match here
lvx 3, 0, 3
^
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@223729 91177308-0d34-0410-b5e6-96231b3b80d8
GCC accepts 'cc' as an alias for 'cr0', and we need to do the same when
processing inline asm constraints. This had previously been implemented using a
non-allocatable register, named 'cc', that was listed as an alias of 'cr0', but
the infrastructure does not seem to support this properly (neither the register
allocator nor the scheduler properly accounts for the alias). Instead, we can
just process this as a naming alias inside of the inline asm
constraint-processing code, so we'll do that instead.
There are two regression tests, one where the post-RA scheduler did the wrong
thing with the non-allocatable alias, and one where the register allocator did
the wrong thing. Fixes PR21742.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@223708 91177308-0d34-0410-b5e6-96231b3b80d8
We had mistakenly believed that GCC's 'cc' referred to the entire
condition-code register (cr0 through cr7) -- and implemented this in r205630 to
fix PR19326, but 'cc' is actually an alias only to 'cr0'. This is causing LLVM
to clobber too much with legacy code with inline asm using the 'cc' clobber.
Fixes PR21451.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@223328 91177308-0d34-0410-b5e6-96231b3b80d8
On PowerPC, inline asm memory operands might be expanded as 0($r), where $r is
a register containing the address. As a result, this register cannot be r0, and
we need to enforce this register subclass constraint to prevent miscompiling
the code (we'd get this constraint for free with the usual instruction
definitions, but that scheme has no knowledge of how we end up printing inline
asm memory operands, and so here we need to do it 'by hand'). We can accomplish
this within the current address-mode selection framework by introducing an
explicit COPY_TO_REGCLASS node.
Fixes PR21443.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@223318 91177308-0d34-0410-b5e6-96231b3b80d8
Almost all immediates in PowerPC assembly (both 32-bit and 64-bit) are signed
numbers, and it is important that we print them as such. To make sure that
happens, we change PPCTargetLowering::LowerAsmOperandForConstraint so that it
does all intermediate checks on a signed-extended int64_t value, and then
creates the resulting target constant using MVT::i64. This will ensure that all
negative values are printed as negative values (mirroring what is done in other
backends to achieve the same sign-extension effect).
This came up in the context of inline assembly like this:
"add%I2 %0,%0,%2", ..., "Ir"(-1ll)
where we used to print:
addi 3,3,4294967295
and gcc would print:
addi 3,3,-1
and gas accepts both forms, but our builtin assembler (correctly) does not. Now
we print -1 like gcc does.
While here, I replaced a bunch of custom integer checks with isInt<16> and
friends from MathExtras.h.
Thanks to Paul Hargrove for the bug report.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@223220 91177308-0d34-0410-b5e6-96231b3b80d8
We need to use the custom expansion of readcyclecounter on all 32-bit targets
(even those with 64-bit registers). This should fix the ppc64 buildbot.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@223182 91177308-0d34-0410-b5e6-96231b3b80d8
We've long supported readcyclecounter on PPC64, but it is easier there (the
read of the 64-bit time-base register can be accomplished via a single
instruction). This now provides an implementation for PPC32 as well. On PPC32,
the time-base register is still 64 bits, but can only be read 32 bits at a time
via two separate SPRs. The ISA manual explains how to do this properly (it
involves re-reading the upper bits and looping if the counter has wrapped while
being read).
This requires PPC to implement a custom integer splitting legalization for the
READCYCLECOUNTER node, turning it into a target-specific SDAG node, which then
gets turned into a pseudo-instruction, which is then expanded to the necessary
sequence (which has three SPR reads, the comparison and the branch).
Thanks to Paul Hargrove for pointing out to me that this was still unimplemented.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@223161 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
PowerPC DWARF unwind info defined CFA as SP + offset even in a function
where the stack had been dynamically realigned. This clearly doesn't
work because the offset from SP to CFA is not a constant. Fix it by
defining CFA as BP instead.
This was causing the AddressSanitizer null_deref test to fail 50% of
the time, depending on whether SP happened to be 32-byte aligned on
entry to a particular function or not.
Reviewers: willschm, uweigand, hfinkel
Reviewed By: hfinkel
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D6410
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222996 91177308-0d34-0410-b5e6-96231b3b80d8
Add assembler support for the fixed-point cache-inhibited load/store
instructions. These are hypervisor-level only, so don't get too excited ;)
Fixes PR21650.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222976 91177308-0d34-0410-b5e6-96231b3b80d8
The attn instruction is not part of the Power ISA, but is documented in the A2
user manual, and is accepted by the GNU assembler for the A2 and the POWER4+.
Reported as part of PR21650.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222712 91177308-0d34-0410-b5e6-96231b3b80d8
This does not matter on newer cores (where we can use reciprocal estimates in
fast-math mode anyway), but for older cores this allows us to generate better
fast-math code where we have multiple FDIVs with a common divisor.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222710 91177308-0d34-0410-b5e6-96231b3b80d8
When processing an assignment in the integrated assembler that sets
a symbol to the value of another symbol, we need to copy the st_other
bits that encode the local entry point offset.
Modeled after MipsTargetELFStreamer::emitAssignment handling of the
ELF::STO_MIPS_MICROMIPS flag.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222672 91177308-0d34-0410-b5e6-96231b3b80d8
This mirrors r222331, which enabled SeparateConstOffsetFromGEP on AArch64, in
the PowerPC backend. Yields, on a POWER7 machine, a 30% speedup on
SingleSource/Benchmarks/Shootout/nestedloop (this might just be from LICM,
there is a store moved out of the inner loop) and a potential speedup on
MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode. Regardless, it
makes some code look cleaner, and synchronizing the backends in this regard
seems like a generally good thing.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222504 91177308-0d34-0410-b5e6-96231b3b80d8
These recently all grew a unique_ptr<TargetLoweringObjectFile> member in
r221878. When anyone calls a virtual method of a class, clang-cl
requires all virtual methods to be semantically valid. This includes the
implicit virtual destructor, which triggers instantiation of the
unique_ptr destructor, which fails because the type being deleted is
incomplete.
This is just part of the ongoing saga of PR20337, which is affecting
Blink as well. Because the MSVC ABI doesn't have key functions, we end
up referencing the vtable and implicit destructor on any virtual call
through a class. We don't actually end up emitting the dtor, so it'd be
good if we could avoid this unneeded type completion work.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222480 91177308-0d34-0410-b5e6-96231b3b80d8
This is to be consistent with StringSet and ultimately with the standard
library's associative container insert function.
This lead to updating SmallSet::insert to return pair<iterator, bool>,
and then to update SmallPtrSet::insert to return pair<iterator, bool>,
and then to update all the existing users of those functions...
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@222334 91177308-0d34-0410-b5e6-96231b3b80d8
This patch adds builtin support for xvdivdp and xvdivsp, along with a
test case. Straightforward stuff.
There's a companion patch for Clang.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221983 91177308-0d34-0410-b5e6-96231b3b80d8
Summary:
Large-model was added first. With the addition of support for multiple PIC
models in LLVM, now add small-model PIC for 32-bit PowerPC, SysV4 ABI. This
generates more optimal code, for shared libraries with less than about 16380
data objects.
Test Plan: Test cases added or updated
Reviewers: joerg, hfinkel
Reviewed By: hfinkel
Subscribers: jholewinski, mcrosier, emaste, llvm-commits
Differential Revision: http://reviews.llvm.org/D5399
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221791 91177308-0d34-0410-b5e6-96231b3b80d8
This patch enables the vec_vsx_ld and vec_vsx_st intrinsics for
PowerPC, which provide programmer access to the lxvd2x, lxvw4x,
stxvd2x, and stxvw4x instructions.
New LLVM intrinsics are provided to represent these four instructions
in IntrinsicsPowerPC.td. These are patterned after the similar
intrinsics for lvx and stvx (Altivec). In PPCInstrVSX.td, these
intrinsics are tied to the code gen patterns, with additional patterns
to allow plain vanilla loads and stores to still generate these
instructions.
At -O1 and higher the intrinsics are immediately converted to loads
and stores in InstCombineCalls.cpp. This will open up more
optimization opportunities while still allowing the correct
instructions to be generated. (Similar code exists for aligned
Altivec loads and stores.)
The new intrinsics are added to the code that checks for consecutive
loads and stores in PPCISelLowering.cpp, as well as to
PPCTargetLowering::getTgtMemIntrinsic().
There's a new test to verify the correct instructions are generated.
The loads and stores tend to be reordered, so the test just counts
their number. It runs at -O2, as it's not very effective to test this
at -O0, when many unnecessary loads and stores are generated.
I ended up having to modify vsx-fma-m.ll. It turns out this test case
is slightly unreliable, but I don't know a good way to prevent
problems with it. The xvmaddmdp instructions read and write the same
register, which is one of the multiplicands. Commutativity allows
either to be chosen. If the FMAs are reordered differently than
expected by the test, the register assignment can be different as a
result. Hopefully this doesn't change often.
There is a companion patch for Clang.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221767 91177308-0d34-0410-b5e6-96231b3b80d8
With this patch MCDisassembler::getInstruction takes an ArrayRef<uint8_t>
instead of a MemoryObject.
Even on X86 there is a maximum size an instruction can have. Given
that, it seems way simpler and more efficient to just pass an ArrayRef
to the disassembler instead of a MemoryObject and have it do a virtual
call every time it wants some extra bytes.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221751 91177308-0d34-0410-b5e6-96231b3b80d8
My original support for the general dynamic and local dynamic TLS
models contained some fairly obtuse hacks to generate calls to
__tls_get_addr when lowering a TargetGlobalAddress. Rather than
generating real calls, special GET_TLS_ADDR nodes were used to wrap
the calls and only reveal them at assembly time. I attempted to
provide correct parameter and return values by chaining CopyToReg and
CopyFromReg nodes onto the GET_TLS_ADDR nodes, but this was also not
fully correct. Problems were seen with two back-to-back stores to TLS
variables, where the call sequences ended up overlapping with unhappy
results. Additionally, since these weren't real calls, the proper
register side effects of a call were not recorded, so clobbered values
were kept live across the calls.
The proper thing to do is to lower these into calls in the first
place. This is relatively straightforward; see the changes to
PPCTargetLowering::LowerGlobalTLSAddress() in PPCISelLowering.cpp.
The changes here are standard call lowering, except that we need to
track the fact that these calls will require a relocation. This is
done by adding a machine operand flag of MO_TLSLD or MO_TLSGD to the
TargetGlobalAddress operand that appears earlier in the sequence.
The calls to LowerCallTo() eventually find their way to
LowerCall_64SVR4() or LowerCall_32SVR4(), which call FinishCall(),
which calls PrepareCall(). In PrepareCall(), we detect the calls to
__tls_get_addr and immediately snag the TargetGlobalTLSAddress with
the annotated relocation information. This becomes an extra operand
on the call following the callee, which is expected for nodes of type
tlscall. We change the call opcode to CALL_TLS for this case. Back
in FinishCall(), we change it again to CALL_NOP_TLS for 64-bit only,
since we require a TOC-restore nop following the call for the 64-bit
ABIs.
During selection, patterns in PPCInstrInfo.td and PPCInstr64Bit.td
convert the CALL_TLS nodes into BL_TLS nodes, and convert the
CALL_NOP_TLS nodes into BL8_NOP_TLS nodes. This replaces the code
removed from PPCAsmPrinter.cpp, as the BL_TLS or BL8_NOP_TLS
nodes can now be emitted normally using their patterns and the
associated printTLSCall print method.
Finally, as a result of these changes, all references to get-tls-addr
in its various guises are no longer used, so they have been removed.
There are existing TLS tests to verify the changes haven't messed
anything up). I've added one new test that verifies that the problem
with the original code has been fixed.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221703 91177308-0d34-0410-b5e6-96231b3b80d8
This fixes a few cases of:
* Wrong variable name style.
* Lines longer than 80 columns.
* Repeated names in comments.
* clang-format of the above.
This make the next patch a lot easier to read.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221615 91177308-0d34-0410-b5e6-96231b3b80d8
This removes calls to isMaterializable in the following cases:
* It was redundant with a call to isDeclaration now that isDeclaration returns
the correct answer for materializable functions.
* It was followed by a call to Materialize. Just call Materialize and check EC.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@221050 91177308-0d34-0410-b5e6-96231b3b80d8
Now that we have initial support for VSX, we can begin adding
intrinsics for programmer access to VSX instructions. This patch adds
basic support for VSX intrinsics in general, and tests it by
implementing intrinsics for minimum and maximum for the vector double
data type.
The LLVM portion of this is quite straightforward. There is a
companion patch for Clang.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220988 91177308-0d34-0410-b5e6-96231b3b80d8
Since block address values can be larger than 2GB in 64-bit code, they
cannot be loaded simply using an @l / @ha pair, but instead must be
loaded from the TOC, just like GlobalAddress, ConstantPool, and
JumpTable values are.
The commit also fixes a bug in PPCLinuxAsmPrinter::doFinalization where
temporary labels could not be used as TOC values, since code would
attempt (and fail) to use GetOrCreateSymbol to create a symbol of the
same name as the temporary label.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220959 91177308-0d34-0410-b5e6-96231b3b80d8
This is a first step for generating SSE rsqrt instructions for
reciprocal square root calcs when fast-math is allowed.
For now, be conservative and only enable this for AMD btver2
where performance improves significantly - for example, 29%
on llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c
(if we convert the data type to single-precision float).
This patch adds a two constant version of the Newton-Raphson
refinement algorithm to DAGCombiner that can be selected by any target
via a parameter returned by getRsqrtEstimate()..
See PR20900 for more details:
http://llvm.org/bugs/show_bug.cgi?id=20900
Differential Revision: http://reviews.llvm.org/D5658
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@220570 91177308-0d34-0410-b5e6-96231b3b80d8