mirror of
https://github.com/RPCS3/llvm-mirror.git
synced 2025-01-20 02:34:48 +00:00
Here are the notes from our Reoptimizer meetings.
llvm-svn: 6923
This commit is contained in:
parent
9d2a41f43e
commit
40d8d5e7c5
137
docs/HistoricalNotes/2003-06-25-Reoptimizer1.txt
Normal file
137
docs/HistoricalNotes/2003-06-25-Reoptimizer1.txt
Normal file
@ -0,0 +1,137 @@
|
||||
Wed Jun 25 15:13:51 CDT 2003
|
||||
|
||||
First-level instrumentation
|
||||
---------------------------
|
||||
|
||||
We use opt to do Bytecode-to-bytecode instrumentation. Look at
|
||||
back-edges and insert llvm_first_trigger() function call which takes
|
||||
no arguments and no return value. This instrumentation is designed to
|
||||
be easy to remove, for instance by writing a NOP over the function
|
||||
call instruction.
|
||||
|
||||
Keep count of every call to llvm_first_trigger(), and maintain
|
||||
counters in a map indexed by return address. If the trigger count
|
||||
exceeds a threshold, we identify a hot loop and perform second-level
|
||||
instrumentation on the hot loop region (the instructions between the
|
||||
target of the back-edge and the branch that causes the back-edge). We
|
||||
do not move code across basic-block boundaries.
|
||||
|
||||
|
||||
Second-level instrumentation
|
||||
---------------------------
|
||||
|
||||
We remove the first-level instrumentation by overwriting the CALL to
|
||||
llvm_first_trigger() with a NOP.
|
||||
|
||||
The reoptimizer maintains a map between machine-code basic blocks and
|
||||
LLVM BasicBlock*s. We only keep track of paths that start at the
|
||||
first machine-code basic block of the hot loop region.
|
||||
|
||||
How do we keep track of which edges to instrument, and which edges are
|
||||
exits from the hot region? 3 step process.
|
||||
|
||||
1) Do a DFS from the first machine-code basic block of the hot loop
|
||||
region and mark reachable edges.
|
||||
|
||||
2) Do a DFS from the last machine-code basic block of the hot loop
|
||||
region IGNORING back edges, and mark the edges which are reachable in
|
||||
1) and also in 2) (i.e., must be reachable from both the start BB and
|
||||
the end BB of the hot region).
|
||||
|
||||
3) Mark BBs which end in edges that exit the hot region; we need to
|
||||
instrument these differently.
|
||||
|
||||
Assume that there is 1 free register. On SPARC we use %g1, which LLC
|
||||
has agreed not to use. Shift a 1 into it at the beginning. At every
|
||||
edge which corresponds to a conditional branch, we shift 0 for not
|
||||
taken and 1 for taken into a register. This uniquely numbers the paths
|
||||
through the hot region. Silently fail if we need more than 64 bits.
|
||||
|
||||
At the end BB we call countPath and increment the counter based on %g1
|
||||
and the return address of the countPath call. We keep track of the
|
||||
number of iterations and the number of paths. We only run this
|
||||
version 30 or 40 times.
|
||||
|
||||
Find the BBs that total 90% or more of execution, and aggregate them
|
||||
together to form our trace. But we do not allow more than 5 paths; if
|
||||
we have more than 5 we take the ones that are executed the most. We
|
||||
verify our assumption that we picked a hot back-edge in first-level
|
||||
instrumentation, by making sure that the number of times we took an
|
||||
exit edge from the hot trace is less than 10% of the number of
|
||||
iterations.
|
||||
|
||||
LLC has been taught to recognize llvm_first_trigger() calls and NOT
|
||||
generate saves and restores of caller-saved registers around these
|
||||
calls.
|
||||
|
||||
|
||||
Phase behavior
|
||||
--------------
|
||||
|
||||
We turn off llvm_first_trigger() calls with NOPs, but this would hide
|
||||
phase behavior from us (when some funcs/traces stop being hot and
|
||||
others become hot.)
|
||||
|
||||
We have a SIGALRM timer that counts time for us. Every time we get a
|
||||
SIGALRM we look at our priority queue of locations where we have
|
||||
removed llvm_first_trigger() calls. Each location is inserted along
|
||||
with a time when we will next turn instrumentation back on for that
|
||||
call site. If the time has arrived for a particular call site, we pop
|
||||
that off the prio. queue and turn instrumentation back on for that
|
||||
call site.
|
||||
|
||||
|
||||
Generating traces
|
||||
-----------------
|
||||
|
||||
When we finally generate an optimized trace we first copy the code
|
||||
into the trace cache. This leaves us with 3 copies of the code: the
|
||||
original code, the instrumented code, and the optimized trace. The
|
||||
optimized trace does not have instrumentation. The original code and
|
||||
the instrumented code are modified to have a branch to the trace
|
||||
cache, where the optimized traces are kept.
|
||||
|
||||
We copy the code from the original to the instrumentation version
|
||||
by tracing the LLVM-to-Machine code basic block map and then copying
|
||||
each machine code basic block we think is in the hot region into the
|
||||
trace cache. Then we instrument that code. The process is similar for
|
||||
generating the final optimized trace; we copy the same basic blocks
|
||||
because we might need to put in fixup code for exit BBs.
|
||||
|
||||
LLVM basic blocks are not typically used in the Reoptimizer except
|
||||
for the mapping information.
|
||||
|
||||
We are restricted to using single instructions to branch between the
|
||||
original code, trace, and instrumented code. So we have to keep the
|
||||
code copies in memory near the original code (they can't be far enough
|
||||
away that a single pc-relative branch would not work.) Malloc() or
|
||||
data region space is too far away. this impacts the design of the
|
||||
trace cache.
|
||||
|
||||
We use a dummy function that is full of a bunch of for loops which we
|
||||
overwrite with trace-cache code. The trace manager keeps track of
|
||||
whether or not we have enough space in the trace cache, etc.
|
||||
|
||||
The trace insertion routine takes an original start address, a vector
|
||||
of machine instructions representing the trace, index of branches and
|
||||
their corresponding absolute targets, and index of calls and their
|
||||
corresponding absolute targets.
|
||||
|
||||
The trace insertion routine is responsible for inserting branches from
|
||||
the beginning of the original code to the beginning of the optimized
|
||||
trace. This is because at some point the trace cache may run out of
|
||||
space and it may have to evict a trace, at which point the branch to
|
||||
the trace would also have to be removed. It uses a round-robin
|
||||
replacement policy; we have found that this is almost as good as LRU
|
||||
and better than random (especially because of problems fitting the new
|
||||
trace in.)
|
||||
|
||||
We cannot deal with discontiguous trace cache areas. The trace cache
|
||||
is supposed to be cache-line-aligned, but it is not page-aligned.
|
||||
|
||||
We generate instrumentation traces and optimized traces into separate
|
||||
trace caches. We keep the instrumented code around because you don't
|
||||
want to delete a trace when you still might have to return to it
|
||||
(i.e., return from a llvm_first_trigger() or countPath() call.)
|
||||
|
||||
|
110
docs/HistoricalNotes/2003-06-26-Reoptimizer2.txt
Normal file
110
docs/HistoricalNotes/2003-06-26-Reoptimizer2.txt
Normal file
@ -0,0 +1,110 @@
|
||||
Thu Jun 26 14:43:04 CDT 2003
|
||||
|
||||
Information about BinInterface
|
||||
------------------------------
|
||||
|
||||
Take in a set of instructions with some particular register
|
||||
allocation. It allows you to add, modify, or delete some instructions,
|
||||
in SSA form (kind of like LLVM's MachineInstrs.) Then re-allocate
|
||||
registers. It assumes that the transformations you are doing are safe.
|
||||
It does not update the mapping information or the LLVM representation
|
||||
for the modified trace (so it would not, for instance, support
|
||||
multiple optimization passes; passes have to be aware of and update
|
||||
manually the mapping information.)
|
||||
|
||||
The way you use it is you take the original code and provide it to
|
||||
BinInterface; then you do optimizations to it, then you put it in the
|
||||
trace cache.
|
||||
|
||||
The BinInterface tries to find live-outs for traces so that it can do
|
||||
register allocation on just the trace, and stitch the trace back into
|
||||
the original code. It has to preserve the live-ins and live-outs when
|
||||
it does its register allocation. (On exits from the trace we have
|
||||
epilogues that copy live-outs back into the right registers, but
|
||||
live-ins have to be in the right registers.)
|
||||
|
||||
|
||||
Limitations of BinInterface
|
||||
---------------------------
|
||||
|
||||
It does copy insertions for PHIs, which it infers from the machine
|
||||
code. The mapping info inserted by LLC is not sufficient to determine
|
||||
the PHIs.
|
||||
|
||||
It does not handle integer or floating-point condition codes and it
|
||||
does not handle floating-point register allocation.
|
||||
|
||||
It is not aggressively able to use lots of registers.
|
||||
|
||||
There is a problem with alloca: we cannot find our spill space for
|
||||
spilling registers, normally allocated on the stack, if the trace
|
||||
follows an alloca(). What might be an acceptable solution would be to
|
||||
disable trace generation on functions that have variable-sized
|
||||
alloca()s. Variable-sized allocas in the trace would also probably
|
||||
screw things up.
|
||||
|
||||
Because of the FP and alloca limitations, the BinInterface is
|
||||
completely disabled right now.
|
||||
|
||||
|
||||
Demo
|
||||
----
|
||||
|
||||
This is a demo of the Ball & Larus version that does NOT use 2-level
|
||||
profiling.
|
||||
|
||||
1. Compile program with llvm-gcc.
|
||||
2. Run opt -lowerswitch -paths -emitfuncs on the bytecode.
|
||||
-lowerswitch change switch statements to branches
|
||||
-paths Ball & Larus path-profiling algorithm
|
||||
-emitfuncs emit the table of functions
|
||||
3. Run llc to generate SPARC assembly code for the result of step 2.
|
||||
4. Use g++ to link the (instrumented) assembly code.
|
||||
|
||||
We use a script to do all this:
|
||||
------------------------------------------------------------------------------
|
||||
#!/bin/sh
|
||||
llvm-gcc $1.c -o $1
|
||||
opt -lowerswitch -paths -emitfuncs $1.bc > $1.run.bc
|
||||
llc -f $1.run.bc
|
||||
LIBS=$HOME/llvm_sparc/lib/Debug
|
||||
GXX=/usr/dcs/software/evaluation/bin/g++
|
||||
$GXX -g -L $LIBS $1.run.s -o $1.run.llc \
|
||||
$LIBS/tracecache.o \
|
||||
$LIBS/mapinfo.o \
|
||||
$LIBS/trigger.o \
|
||||
$LIBS/profpaths.o \
|
||||
$LIBS/bininterface.o \
|
||||
$LIBS/support.o \
|
||||
$LIBS/vmcore.o \
|
||||
$LIBS/transformutils.o \
|
||||
$LIBS/bcreader.o \
|
||||
-lscalaropts -lscalaropts -lanalysis \
|
||||
-lmalloc -lcpc -lm -ldl
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
5. Run the resulting binary. You will see output from BinInterface
|
||||
(described below) intermixed with the output from the program.
|
||||
|
||||
|
||||
Output from BinInterface
|
||||
------------------------
|
||||
|
||||
BinInterface's debugging code prints out the following stuff in order:
|
||||
|
||||
1. Initial code provided to BinInterface with original register
|
||||
allocation.
|
||||
|
||||
2. Section 0 is the trace prolog, consisting mainly of live-ins and
|
||||
register saves which will be restored in epilogs.
|
||||
|
||||
3. Section 1 is the trace itself, in SSA form used by BinInterface,
|
||||
along with the PHIs that are inserted.
|
||||
PHIs are followed by the copies that implement them.
|
||||
Each branch (i.e., out of the trace) is annotated with the
|
||||
section number that represents the epilog it branches to.
|
||||
|
||||
4. All the other sections starting with Section 2 are trace epilogs.
|
||||
Every branch from the trace has to go to some epilog.
|
||||
|
||||
5. After the last section is the register allocation output.
|
Loading…
x
Reference in New Issue
Block a user