On the x86-64 and thumb2 targets, some registers are more expensive to encode
than others in the same register class.
Add a CostPerUse field to the TableGen register description, and make it
available from TRI->getCostPerUse. This represents the cost of a REX prefix or a
32-bit instruction encoding required by choosing a high register.
Teach the greedy register allocator to prefer cheap registers for busy live
ranges (as indicated by spill weight).
llvm-svn: 129864
used by Clang. To help Clang integration, the PTX target has been split
into two targets: ptx32 and ptx64, depending on the desired pointer size.
- Add GCCBuiltin class to all intrinsics
- Split PTX target into ptx32 and ptx64
llvm-svn: 129851
Add a avoidWriteAfterWrite() target hook to identify register classes that
suffer from write-after-write hazards. For those register classes, try to avoid
writing the same register in two consecutive instructions.
This is currently disabled by default. We should not spill to avoid hazards!
The command line flag -avoid-waw-hazard can be used to enable waw avoidance.
llvm-svn: 129772
the generated FastISel. X86 doesn't need to generate code to match ADD16ri8
since ADD16ri will do just fine. This is a small codesize win in the generated
instruction selector.
llvm-svn: 129692
kind of predicate: one that is specific to imm nodes. The predicate function
specified here just checks an int64_t directly instead of messing around with
SDNode's. The virtue of this is that it means that fastisel and other things
can reason about these predicates.
llvm-svn: 129675
structure and fix some fixmes. We now have a TreePredicateFn class
that handles all of the decoding of these things. This is an internal
cleanup that has no impact on the code generated by tblgen.
llvm-svn: 129670
The basic issue here is that bottom-up isel is matching the branch
and compare, and was failing to fold the load into the branch/compare
combo. Fixing this (by allowing folding into any instruction of a
sequence that is selected) allows us to produce things like:
cmpb $0, 52(%rax)
je LBB4_2
instead of:
movb 52(%rax), %cl
cmpb $0, %cl
je LBB4_2
This makes the generated -O0 code run a bit faster, but also speeds up
compile time by putting less pressure on the register allocator and
generating less code.
This was one of the biggest classes of missing load folding. Implementing
this shrinks 176.gcc's c-decl.s (as a random example) by about 4% in (verbose-asm)
line count.
llvm-svn: 129656
Change ELF systems to use CFI for producing the EH tables. This reduces the
size of the clang binary in Debug builds from 690MB to 679MB.
llvm-svn: 129571
This is done by pushing physical register definitions close to their
use, which happens to handle flag definitions if they're not glued to
the branch. This seems to be generally a good thing though, so I
didn't need to add a target hook yet.
The primary motivation is to generate code closer to what people
expect and rule out missed opportunity from enabling macro-op
fusion. As a side benefit, we get several 2-5% gains on x86
benchmarks. There is one regression:
SingleSource/Benchmarks/Shootout/lists slows down be -10%. But this is
an independent scheduler bug that will be tracked separately.
See rdar://problem/9283108.
Incidentally, pre-RA scheduling is only half the solution. Fixing the
later passes is tracked by:
<rdar://problem/8932804> [pre-RA-sched] on x86, attempt to schedule CMP/TEST adjacent with condition jump
Fixes:
<rdar://problem/9262453> Scheduler unnecessary break of cmp/jump fusion
llvm-svn: 129508
will allow multiple context with different loop unroll parameters to run. This is a minor change and no effect
on existing application.
llvm-svn: 129449
Additional fixes:
Do something reasonable for subtargets with generic
itineraries by handle node latency the same as for an empty
itinerary. Now nodes default to unit latency unless an itinerary
explicitly specifies a zero cycle stage or it is a TokenFactor chain.
Original fixes:
UnitsSharePred was a source of randomness in the scheduler: node
priority depended on the queue data structure. I rewrote the recent
VRegCycle heuristics to completely replace the old heuristic without
any randomness. To make the ndoe latency adjustments work, I also
needed to do something a little more reasonable with TokenFactor. I
gave it zero latency to its consumers and always schedule it as low as
possible.
llvm-svn: 129421
Now that we have a first-class way to represent unaligned loads, the unaligned
load intrinsics are superfluous.
First part of <rdar://problem/8460511>.
llvm-svn: 129401
Add handling for tracking the relocations on symbols and resolving them.
Keep track of the relocations even after they are resolved so that if
the RuntimeDyld client moves the object, it can update the address and any
relocations to that object will be updated.
For our trival object file load/run test harness (llvm-rtdyld), this enables
relocations between functions located in the same object module. It should
be trivially extendable to load multiple objects with mutual references.
As a simple example, the following now works (running on x86_64 Darwin 10.6):
$ cat t.c
int bar() {
return 65;
}
int main() {
return bar();
}
$ clang t.c -fno-asynchronous-unwind-tables -o t.o -c
$ otool -vt t.o
t.o:
(__TEXT,__text) section
_bar:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp,%rbp
0000000000000004 movl $0x00000041,%eax
0000000000000009 popq %rbp
000000000000000a ret
000000000000000b nopl 0x00(%rax,%rax)
_main:
0000000000000010 pushq %rbp
0000000000000011 movq %rsp,%rbp
0000000000000014 subq $0x10,%rsp
0000000000000018 movl $0x00000000,0xfc(%rbp)
000000000000001f callq 0x00000024
0000000000000024 addq $0x10,%rsp
0000000000000028 popq %rbp
0000000000000029 ret
$ llvm-rtdyld t.o -debug-only=dyld ; echo $?
Function sym: '_bar' @ 0
Function sym: '_main' @ 16
Extracting function: _bar from [0, 15]
allocated to 0x100153000
Extracting function: _main from [16, 41]
allocated to 0x100154000
Relocation at '_main' + 16 from '_bar(Word1: 0x2d000000)
Resolving relocation at '_main' + 16 (0x100154010) from '_bar (0x100153000)(pcrel, type: 2, Size: 4).
loaded '_main' at: 0x100154000
65
$
llvm-svn: 129388
Use debug info in the IR to find the directory/file:line:col. Each time that location changes, bump a counter.
Unlike the existing profiling system, we don't try to look at argv[], and thusly don't require main() to be present in the IR. This matches GCC's technique where you specify the profiling flag when producing each .o file.
The runtime library is minimal, currently just calling printf at program shutdown time. The API is designed to make it possible to emit GCOV data later on.
llvm-svn: 129340
disassembler API. Hooked this up to the ARM target so such tools as Darwin's
otool(1) can now print things like branch targets for example this:
blx _puts
instead of this:
blx #-36
And even print the expression encoded in the Mach-O relocation entried for
things like this:
movt r0, :upper16:((_foo-_bar)+1234)
llvm-svn: 129284
--- Reverse-merging r129235 into '.':
D test/Feature/bb_attrs.ll
U include/llvm/BasicBlock.h
U include/llvm/Bitcode/LLVMBitCodes.h
U lib/VMCore/AsmWriter.cpp
U lib/VMCore/BasicBlock.cpp
U lib/AsmParser/LLParser.cpp
U lib/AsmParser/LLLexer.cpp
U lib/AsmParser/LLToken.h
U lib/Bitcode/Reader/BitcodeReader.cpp
U lib/Bitcode/Writer/BitcodeWriter.cpp
llvm-svn: 129259
* Add a "landing pad" attribute to the BasicBlock.
* Modify the bitcode reader and writer to handle said attribute.
Later: The verifier will ensure that the landing pad attribute is used in the
appropriate manner. I.e., not applied to the entry block, and applied only to
basic blocks that are branched to via a `dispatch' instruction.
(This is a work-in-progress.)
llvm-svn: 129235
It is common for large live ranges to have few basic blocks with register uses
and many live-through blocks without any uses. This approach grows the Hopfield
network incrementally around the use blocks, completely avoiding checking
interference for some through blocks.
llvm-svn: 129188
Teach 32-bit section loading to use the Memory Manager interface, just like
the 64-bit loading does. Tidy up a few other things here and there.
llvm-svn: 129138
with the newer, cleaner model. It uses the IAPrinter class to hold the
information that is needed to match an instruction with its alias. This also
takes into account the available features of the platform.
There is one bit of ugliness. The way the logic determines if a pattern is
unique is O(N**2), which is gross. But in reality, the number of items it's
checking against isn't large. So while it's N**2, it shouldn't be a massive time
sink.
llvm-svn: 129110
induction variable. The preRA scheduler is unaware of induction vars,
so we look for potential "virtual register cycles" instead.
Fixes <rdar://problem/8946719> Bad scheduling prevents coalescing
llvm-svn: 129100
Start teaching the runtime Dyld interface to use the memory manager API
for allocating space. Rather than mapping directly into the MachO object,
we extract the payload for each object and copy it into a dedicated buffer
allocated via the memory manager. For now, just do Segment64, so this works
on x86_64, but not yet on ARM.
llvm-svn: 128973
developers can see if their driver changed any cl::Option's. The
current implementation isn't perfect but handles most kinds of
options. This is nice to have when decomposing the stages of
compilation and moving between different drivers. It's also a good
sanity check when comparing results produced by different command line
invocations that are expected to produce the comparable results.
Note: This is not an attempt to prolong the life of cl::Option. On the
contrary, it's a placeholder for a feature that must exist when
cl::Option is replaced by a more appropriate framework. A new
framework needs: a central option registry, dynamic name lookup,
non-global containers of option values (e.g. per-module,
per-function), *and* the ability to print options values and their defaults at
any point during compilation.
llvm-svn: 128910
This allows us to always keep the smaller slot for an instruction which is what
we want when a register has early clobber defines.
Drop the UsingInstrs set and the UsingBlocks map. They are no longer needed.
llvm-svn: 128886
inlined path for the common case.
Most basic blocks don't contain a call that may throw, so the last split point
os simply the first terminator.
llvm-svn: 128874
It needed to be moved closer to the setjmp statement, because the code directly
after the setjmp needs to know about values that are on the stack. Also, the
'bitcast' of the function context was causing a dead load. This wouldn't be too
horrible, except that at -O0 it wasn't optimized out, and because it wasn't
using the correct base pointer (if there is a VLA), it would try to access a
value from a garbage address.
<rdar://problem/9130540>
llvm-svn: 128873
The JITMemory manager references LLVM IR constructs directly, while the
runtime Dyld works at a lower level and can handle objects which may not
originate from LLVM IR. Introduce a new layer for the memory manager to
handle the interface between them. For the MCJIT, this layer will be almost
entirely simply a call-through w/ translation between the IR objects and
symbol names.
llvm-svn: 128851
after the given instruction; make sure to handle that case correctly.
(It's difficult to trigger; the included testcase involves a dead
block, but I don't think that's a requirement.)
While I'm here, get rid of the unnecessary warning about
SimplifyInstructionsInBlock, since it should work correctly as far as I know.
llvm-svn: 128782
When the greedy register allocator is splitting multiple global live ranges, it
tends to look at the same interference data many times. The InterferenceCache
class caches queries for unaltered LiveIntervalUnions.
llvm-svn: 128764
transformations in target-specific DAG combines without causing DAGCombiner to
delete the same node twice. If you know of a better way to avoid this (see my
next patch for an example), please let me know.
llvm-svn: 128758
StringMap was not properly updating NumTombstones after a clear or rehash.
This was not fatal until now because the table was growing faster than
NumTombstones could, but with the previous change of preventing infinite
growth of the table the invariant (NumItems + NumTombstones <= NumBuckets)
stopped being observed, causing infinite loops in certain situations.
Patch by José Fonseca!
llvm-svn: 128567
When the hash function uses object pointers all free entries eventually
become tombstones as they are used at least once, regardless of the size.
DenseMap cannot function with zero empty keys, so it double size to get
get ridof the tombstones.
However DenseMap never shrinks automatically unless it is cleared, so
the net result is that certain tables grow infinitely.
The solution is to make a fresh copy of the table without tombstones
instead of doubling size, by simply calling grow with the current size.
Patch by José Fonseca!
llvm-svn: 128564
The idea is, that if an ieee 754 float is divided by a power of two, we can
turn the division into a cheaper multiplication. This function sees if we can
get an exact multiplicative inverse for a divisor and returns it if possible.
This is the hard part of PR9587.
I tested many inputs against llvm-gcc's frotend implementation of this
optimization and didn't find any difference. However, floating point is the
land of weird edge cases, so any review would be appreciated.
llvm-svn: 128545
was lowering them to sext / uxt + mul instructions. Unfortunately the
optimization passes may hoist the extensions out of the loop and separate them.
When that happens, the long multiplication instructions can be broken into
several scalar instructions, causing significant performance issue.
Note the vmla and vmls intrinsics are not added back. Frontend will codegen them
as intrinsics vmull* + add / sub. Also note the isel optimizations for catching
mul + sext / zext are not changed either.
First part of rdar://8832507, rdar://9203134
llvm-svn: 128502
otool(1), this time with the needed fix for case sensitive file systems :) .
This is a work in progress as the interface for producing symbolic operands is
not done. But a hacked prototype using information from the object file's
relocation entiries and replacing immediate operands with MCExpr's has been
shown to work with no changes to the instrucion printer. These APIs will be
moved into a dynamic library at some point.
llvm-svn: 128415
Correctly terminate the range of register DBG_VALUEs when the register is
clobbered or when the basic block ends.
The code is now ready to deal with variables that are sometimes in a register
and sometimes on the stack. We just need to teach emitDebugLoc to say 'stack
slot'.
llvm-svn: 128327
This is a work in progress as the interface for producing symbolic operands is
not done. But a hacked prototype using information from the object file's
relocation entiries and replacing immediate operands with MCExpr's has been
shown to work with no changes to the instrucion printer. These APIs will be
moved into a dynamic library at some point.
llvm-svn: 128308
The MC asm lexer wasn't honoring a non-default (anything but ';') statement
separator. Fix that, and generalize a bit to support multi-character
statement separators.
llvm-svn: 128227
Move the dynamic linking functionality of the llvm-rtdyld program into an
ExecutionEngine support library. Update llvm-rtdyld to just load an object
file into memory, use the library to process it, then run the _main()
function, if one is found.
llvm-svn: 128031
the alias of an InstAlias instead of the thing being aliased. Because we need to
know the features that are valid for an InstAlias.
This is part of a work-in-progress.
llvm-svn: 127986
to have single return block (at least getting there) for optimizations. This
is general goodness but it would prevent some tailcall optimizations.
One specific case is code like this:
int f1(void);
int f2(void);
int f3(void);
int f4(void);
int f5(void);
int f6(void);
int foo(int x) {
switch(x) {
case 1: return f1();
case 2: return f2();
case 3: return f3();
case 4: return f4();
case 5: return f5();
case 6: return f6();
}
}
=>
LBB0_2: ## %sw.bb
callq _f1
popq %rbp
ret
LBB0_3: ## %sw.bb1
callq _f2
popq %rbp
ret
LBB0_4: ## %sw.bb3
callq _f3
popq %rbp
ret
This patch teaches codegenprep to duplicate returns when the return value
is a phi and where the phi operands are produced by tail calls followed by
an unconditional branch:
sw.bb7: ; preds = %entry
%call8 = tail call i32 @f5() nounwind
br label %return
sw.bb9: ; preds = %entry
%call10 = tail call i32 @f6() nounwind
br label %return
return:
%retval.0 = phi i32 [ %call10, %sw.bb9 ], [ %call8, %sw.bb7 ], ... [ 0, %entry ]
ret i32 %retval.0
This allows codegen to generate better code like this:
LBB0_2: ## %sw.bb
jmp _f1 ## TAILCALL
LBB0_3: ## %sw.bb1
jmp _f2 ## TAILCALL
LBB0_4: ## %sw.bb3
jmp _f3 ## TAILCALL
rdar://9147433
llvm-svn: 127953
Proof-of-concept code that code-gens a module to an in-memory MachO object.
This will be hooked up to a run-time dynamic linker library (see: llvm-rtdyld
for similarly conceptual work for that part) which will take the compiled
object and link it together with the rest of the system, providing back to the
JIT a table of available symbols which will be used to respond to the
getPointerTo*() queries.
llvm-svn: 127916
For example, on 32-bit architecture, don't promote all uses of the IV
to 64-bits just because one use is a 64-bit cast.
Alternate implementation of the patch by Arnaud de Grandmaison.
llvm-svn: 127884
SCEV may generate expressions composed of multiple pointers, which can
lead to invalid GEP expansion. Until we can teach SCEV to follow strict
pointer rules, make sure no bad GEPs creep into IR.
Fixes rdar://problem/9038671.
llvm-svn: 127839
I have convinced myself that it can only happen when a phi value dies. When it
happens, allocate new virtual registers for the components.
llvm-svn: 127827
rather than an int. Thankfully, this only causes LLVM to miss optimizations, not
generate incorrect code.
This just fixes the zext at the return. We still insert an i32 ZextAssert when
reading a function's arguments, but it is followed by a truncate and another i8
ZextAssert so it is not optimized.
llvm-svn: 127766
doesn't return, so just go back to using the old runtime function instead
of trying to use abort() when __builtin_unreachable (or an equivalent) isn't
supported.
llvm-svn: 127629
properties.
Added the self-wrap flag for SCEV::AddRecExpr.
A slew of temporary FIXMEs indicate the intention of the no-self-wrap flag
without changing behavior in this revision.
llvm-svn: 127590
llvm-gcc-i386-linux-selfhost and llvm-x86_64-linux-checks buildbots.
The original log entry:
Remove optimization emitting a reference insted of label difference, since
it can create more relocations. Removed isBaseAddressKnownZero method,
because it is no longer used.
llvm-svn: 127540
It generates output that lools like
8 times line number info lost by Scalar Replacement of Aggregates (SSAUp)
1 times line number info lost by Simplify well-known library calls
12 times variable info lost by Jump Threading
llvm-svn: 127381
flexible.
If it returns a register class that's different from the input, then that's the
register class used for cross-register class copies.
If it returns a register class that's the same as the input, then no cross-
register class copies are needed (normal copies would do).
If it returns null, then it's not at all possible to copy registers of the
specified register class.
llvm-svn: 127368
the value splatted into every element. Extend this to getTrue and getFalse which
by providing new overloads that take Types that are either i1 or <N x i1>. Use
it in InstCombine to add vector support to some code, fixing PR8469!
llvm-svn: 127116
This makes lookup slightly more expensive but it's worth it, unused
DenseMaps are common in LLVM code apparently.
1% speedup on clang -O3 bzip2.c
4% speedup on clang -O3 oggenc.c (Release build of clang on i386/linux)
llvm-svn: 127088
regs. This is the only change in this checkin that may affects the
default scheduler. With better register tracking and heuristics, it
doesn't make sense to artificially lower the register limit so much.
Added -sched-high-latency-cycles and X86InstrInfo::isHighLatencyDef to
give the scheduler a way to account for div and sqrt on targets that
don't have an itinerary. It is currently defaults to 10 (the actual
number doesn't matter much), but only takes effect on non-default
schedulers: list-hybrid and list-ilp.
Added several heuristics that can be individually disabled for the
non-default sched=list-ilp mode. This helps us determine how much
better we can do on a given benchmark than the default
scheduler. Certain compute intensive loops run much faster in this
mode with the right set of heuristics, and it doesn't seem to have
much negative impact elsewhere. Not all of the heuristics are needed,
but we still need to experiment to decide which should be disabled by
default for sched=list-ilp.
llvm-svn: 127067
Initially, slot indexes are quad-spaced. There is room for inserting up to 3
new instructions between the original instructions.
When we run out of indexes between two instructions, renumber locally using
double-spaced indexes. The original quad-spacing means that we catch up quickly,
and we only have to renumber a handful of instructions to get a monotonic
sequence. This is much faster than renumbering the whole function as we did
before.
llvm-svn: 127023
it. It's been assumed up til now that it would be in its immediate
successor. However, this isn't necessarily the case. It could be in one of its
successor's successors.
Modify the code to more thoroughly check for an 'eh.selector' call in
successors. It only looks at a successor if we get there as a result of an
unconditional branch.
Testcase ObjC/exceptions-4.m in r126968.
llvm-svn: 126969
This is much faster than using a pointer to a ManagedStatic object accessed with
a function call. The greedy register allocator is 5% faster overall just from
the SlotIndex default constructor savings.
llvm-svn: 126925
The SlotIndex created by the default construction does not represent a position
in the function, and it doesn't make sense to compare it to other indexes.
llvm-svn: 126924
uses.
The result produced by the streamer is used to give the linker more accurate
information and to add to llvm.compiler.used. The second improvement removes
the need for the user to add __attribute__((used)) to functions only used in
inline asm. The first one lets us build firefox with LTO on Darwin :-)
llvm-svn: 126830