This patch adds unit testing support for BOLT. In order to do this we will need at least do this changes on the code level:
* Make createMCPlusBuilder accessible externally
* Remove positional InputFilename argument to bolt utlity sources
And prepare the cmake and lit for the new tests.
Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
Reviewed By: maksfb, Amir
Differential Revision: https://reviews.llvm.org/D118271
Summary:
Address @smeenai feedback https://reviews.llvm.org/D117061#inline-1122106:
>CMake has if(IN_LIST) now, which you can use instead of the string(FIND)
IN_LIST is available since CMake 3.3 released in 2015.
Reviewed By: smeenai
FBD33590959
Summary:
Fix missing string header file inclusion and link_fdata find
problem in lit tests. Change root-level tests to require
linux. Re-enable Windows in our root CMakeLists.txt.
(cherry picked from FBD33296290)
Summary:
Create a new high-level target named bolt that builds all
BOLT artifacts, as well as a install-bolt target that installs them.
(cherry picked from FBD33133002)
Summary:
Build is already broken because VS fails to locate
llvm-boltdiff when running tests, and VS also complains that
include/bolt/Passes/InstrumentationSummary.h is lacking an include
string header. Disable this until we have a Windows buildbot to make
sure this build is sane.
(cherry picked from FBD33039972)
Summary:
Make BOLT build in VisualStudio compiler and run without
crashing on a simple test. Other tests are not running.
(cherry picked from FBD32378736)
Summary:
Change cmake config in BOLT to only support Linux. In other
platforms, we print a warning that we won't build BOLT. Change
configs to determine whether we will build BOLT runtime libs. This
only happens in x86 hosts. If true, we will build the runtime and
enable bolt-runtime tests. New tests that depend on the bolt_rt lib
needs to be marked REQUIRES:bolt-runtime. I updated the relevant
tests. Fix cmake to do not crash when building llvm with a target
that BOLT does not support.
(cherry picked from FBD31935760)
Summary:
Moves source files into separate components, and make explicit
component dependency on each other, so LLVM build system knows how to
build BOLT in BUILD_SHARED_LIBS=ON.
Please use the -c merge.renamelimit=230 git option when rebasing your
work on top of this change.
To achieve this, we create a new library to hold core IR files (most
classes beginning with Binary in their names), a new library to hold
Utils, some command line options shared across both RewriteInstance
and core IR files, a new library called Rewrite to hold most classes
concerned with running top-level functions coordinating the binary
rewriting process, and a new library called Profile to hold classes
dealing with profile reading and writing.
To remove the dependency from BinaryContext into X86-specific classes,
we do some refactoring on the BinaryContext constructor to receive a
reference to the specific backend directly from RewriteInstance. Then,
the dependency on X86 or AArch64-specific classes is transfered to the
Rewrite library. We can't have the Core library depend on targets
because targets depend on Core (which would create a cycle).
Files implementing the entry point of a tool are transferred to the
tools/ folder. All header files are transferred to the include/
folder. The src/ folder was renamed to lib/.
(cherry picked from FBD32746834)
Summary:
To allow the development of future instrumentation work, this
patch adds support in BOLT for linking arbitrary libraries into the
binary processed by BOLT. We use orc relocation handling mechanism for
that. With this support, this patch also moves code programatically
generated in X86 assembly language by X86MCPlusBuilder to C code written
in a new library called bolt_rt. Change CMake to support this library as
an external project in the same way as clang does with compiler_rt. This
library is installed in the lib/ folder relative to BOLT root
installation and by default instrumentation will look for the library
at that location to finish processing the binary with instrumentation.
(cherry picked from FBD16572013)
Summary:
Compile Bolt using std 14.
We want that to be able to use some threading the locking tools that do not exists in std 11.
(cherry picked from FBD15671736)
Summary:
Create folders and setup to make LIT run BOLT-only tests. Add
a test example. This will add a new make/ninja rule "check-bolt" that
the user can invoke to run LIT on this folder.
(cherry picked from FBD8595786)
Summary:
BOLT sources are being moved under tools/llvm-bolt/src
and tools/llvm-bolt will contain more files such as LICENSE.txt,
README.txt, etc.
Remove trailing white spaces from our sources.
Create llvm.patch by running
> git diff f137ed238db11440f03083b1c88b7ffc0f4af65e include lib > \
tools/llvm-bolt/llvm.patch
README.txt has instructions on checking out sources and applying the
patch.
(cherry picked from FBD7878380)
Summary:
Refactor architecture-specific code out of llvm into llvm-bolt.
Introduce MCPlusBuilder, a class that is taking over MCInstrAnalysis
responsibilities, i.e. creating, analyzing, and modifying instructions.
To access the builder use BC->MIB, i.e. substitute MIA with MIB.
MIB is an acronym for MCInstBuilder, that's what MCPlusBuilder used
to be. The name stuck, and I find it better than MPB.
Instructions are still MCInst, and a bunch of BOLT-specific code still
lives in LLVM, but the staff under Target/* is significantly reduced.
(cherry picked from FBD7300101)
Summary:
This is a simple bolt-based tool that instantiates two
RewriteInstances objects and compares them. Add a method to
RewriteInstance to enable us to compare two objects. Include a mechanism
to match functions from binary 1 to binary 2 and finally print the
largest differences in profiling data from one binary to another.
(cherry picked from FBD6517076)
Summary:
This is preparation work for static data reordering.
I've created a new class called BinaryData which represents a symbol
contained in a section. It records almost all the information relevant
for dealing with data, e.g. names, address, size, alignment, profiling
data, etc. BinaryContext still stores and manages BinaryData objects
similar to how it managed symbols and global addresses before. The
interfaces are not changed too drastically from before either. There is
a bit of overlap between BinaryData and BinaryFunction. I would have
liked to do some more refactoring to make a BinaryFunctionFragment that
subclassed from BinaryData and then have BinaryFunction be composed or
associated with BinaryFunctionFragments.
I've also attempted to use (symbol + offset) for when addresses are
pointing into the middle of symbols with known sizes. This changes the
simplify rodata loads optimization slightly since the expression on an
instruction can now also be a (symbol + offset) rather than just a symbol.
One of the overall goals for this refactoring is to make sure every
relocation is associated with a BinaryData object. This requires adding
"hole" BinaryData's wherever there are gaps in a section's address space.
Most of the holes seem to be data that has no associated symbol info. In
this case we can't do any better than lumping all the adjacent hole
symbols into one big symbol (there may be more than one actual data
object that contributes to a hole). At least the combined holes should
be moveable.
Jump tables have similar issues. They appear to mostly be sub-objects
for top level local symbols. The main problem is that we can't recognize
jump tables at the time we scan the symbol table, we have to wait til
disassembly. When a jump table is discovered we add it as a sub-object
to the existing local symbol. If there are one or more existing
BinaryData's that appear in the address range of a newly created jump
table, those are added as sub-objects as well.
(cherry picked from FBD6362544)
Summary:
This commit includes all code necessary to make BOLT working again
after the rebase. This includes a redesign of the EHFrame work,
cherry-pick of the 3dnow disassembly work, compilation error fixes,
and port of the debug_info work. The macroop fusion feature is not
ported yet.
The rebased version has minor changes to the "executed instructions"
dynostats counter because REP prefixes are considered a part of the
instruction it applies to. Also, some X86 instructions had the "mayLoad"
tablegen property removed, which BOLT uses to identify and account
for loads, thus reducing the total number of loads reported by
dynostats. This was observed in X86::MOVDQUmr. TRAP instructions are
not terminators anymore, changing our CFG. This commit adds compensation
to preserve this old behavior and minimize tests changes. debug_info
sections are now slightly larger. The discriminator field in the line
table is slightly different due to a change upstream. New profiles
generated with the other bolt are incompatible with this version
because of different hash values calculated for functions, so they will
be considered 100% stale. This commit changes the corresponding test
to XFAIL so it can be updated. The hash function changes because it
relies on raw opcode values, which change according to the opcodes
described in the X86 tablegen files. When processing HHVM, bolt was
observed to be using about 800MB more memory in the rebased version
and being about 5% slower.
(cherry picked from FBD7078072)
Summary: Add BinarySection class that is a wrapper around SectionRef. This is refactoring work for static data reordering.
(cherry picked from FBD6792785)
Summary:
A new profile that is more resilient to minor binary modifications.
BranchData is eliminated. For calls, the data is converted into instruction
annotations if the profile matches a function. If a profile cannot be matched,
AllCallSites data should have call sites profiles.
The new profile format is YAML, which is quite verbose. It still takes
less space than the older format because we avoid function name repetition.
The plan is to get rid of the old profile format eventually.
merge-fdata does not work with the new format yet.
(cherry picked from FBD6753747)
Summary:
Profile reading was tightly coupled with building CFG. Since I plan
to move to a new profile format that will be associated with CFG
it is critical to decouple the two phases.
We now have read profile right after the cfg was constructed, but
before it is "canonicalized", i.e. CTCs will till be there.
After reading the profile, we do a post-processing pass that fixes
CFG and does some post-processing for debug info, such as
inference of fall-throughs, which is still required with the current
format.
Another good reason for decoupling is that we can use profile with
CFG to more accurately record fall-through branches during
aggregation.
At the moment we use "Offset" annotations to facilitate location
of instructions corresponding to the profile. This might not be
super efficient. However, once we switch to the new profile format
the offsets would be no longer needed. We might keep them for
the aggregator, but if we have to trust LBR data that might
not be strictly necessary.
I've tried to make changes while keeping backwards compatibly. This makes
it easier to verify correctness of the changes, but that also means
that we lose accuracy of the profile.
Some refactoring is included.
Flag "-prof-compat-mode" (on by default) is used for bug-level
backwards compatibility. Disable it for more accurate tracing.
(cherry picked from FBD6506156)
Summary:
This is a replacement of a previous diff. The implemented metric
('graph distance') is not very useful at the moment but I plan to add
more relevant metrics in the subsequent diff. This diff fixes some
obvious problems and moves the call of CalcMetrics::printAll to the
right place.
(cherry picked from FBD6072312)
Summary:
Move the data aggregator logic from our python script to
our C++ LLVM/BOLT libs. This has a dramatic reduction in processing
time for profiling data (from 45 minutes for HHVM to 5 minutes) because
we directly use BOLT as a disassembler in order to validate traces found
in the LBR and to add the fallthrough counts. Previously, the python
approach relied on parsing the output objdump to check traces.
(cherry picked from FBD5761313)
Summary:
Designed a new metric, which shows 93.46% correltation with Cache Miss
and 86% correlation with CPU Time.
Definition:
One can get all the traversal path for each function. And for each traversal,
we will define a distance. The distance represents how far two connected
basic blocks are. Therefore, for each traversal, I will go through the
basic blocks one by one, until the end of the traversal and sum up the
distance for the neighboring basic blocks.
Distance between two connected basic blocks is the distance of the
centers of two blocks in the binary file.
(cherry picked from FBD5242526)
Summary:
Optinally add a .bolt_info notes section containing BOLT revision and command line args.
The new section is controlled by the -add-bolt-info flag which is on by default.
(cherry picked from FBD5125890)
Summary:
Calls to __builtin_unreachable() can result in a inconsistent CFG.
It was possible for basic block to end with a conditional branche
and have a single successor. Or there could exist non-terminated
basic block without successors.
We also often treated conditional jumps with destination past the end
of a function as conditional tail calls. This can be prevented
reliably at least when the byte past the end of the function does
not belong to the next function.
This diff includes several changes:
* At disassembly stage jumps past the end of a function are converted
into 'nops'. This is done only for cases when we can guarantee that
the jump is not a tail call. Conversion to nop is required since the
instruction could be referenced either by exception handling
tables and/or debug info. Nops are later removed.
* In CFG insert 'ret' into non-terminated basic blocks without
successors (this almost never happens).
* Conditional jumps at the end of the function are removed from
CFG. The block will still have a single successor.
* Cases where a destination of a jump instruction is the start
of the next function, are still conservatively handled as
(conditional) tail calls.
(cherry picked from FBD4655046)
Summary:
This is a first attempt to perform data flow analyses on bolt
and try to rebuild the stack frame for functions. The goal of the frame
optimization pass is to detect instructions that are accessing stack and,
if loading values, evaluate whether this load is redundant and we can
substitute the memory operation for a register load or immediate load.
To find opportunities, this pass also builds a map of clobbered registers
by function, so we use this in our analysis at call sites. If a call site
is found out to not clobber a caller-saved register but the caller is
spilling it anyway to the stack (to comply with the ABI), we should
detect these cases and remove this unnecessary move.
(cherry picked from FBD4337238)
Summary:
The various reorder and clustering algorithms have been refactored
into separate classes, so that it is easier to add new algorithms and/or
change the logic of algorithm selection.
(cherry picked from FBD3473656)
Summary:
Many functions (around 600) in the HHVM binary are simply
a single unconditional jump instruction to another function. These can
be trivially optimized by modifying the call sites to directly call the
branch target instead (because it also happens with more than one jump
in sequence, we do it iteratively).
This diff also adds a very simple analysis/optimization pass system in
which this pass is the first one to be implemented. A follow-up to this
could be to move the current optimizations to other passes.
(cherry picked from FBD3211138)
Summary:
merge-fdata tool takes multiple .fdata files and outputs to stdout
combined fdata. Takes about 2 seconds per each additional .fdata
file with hhvm production data.
(cherry picked from FBD3216430)
Summary:
Summary: Update DWARF location lists in .debug_loc and pointers to
them in .debug_info so that gdb can print variables which change
location during their lifetime.
The following changes were made:
- Refactored BasicBlockOffsetRanges to allow ranges to be tied to binary information (so that we can reuse it for location lists)
- Implemented range compression optimization in BasicBlockOffsetRanges (needed otherwise too much data was being generated).
- Added representation for location lists (LocationList.h, BinaryContext.h)
- Implemented .debug_loc serializer that keeps the updated offsets (DebugLocWriter.{h,cpp})
- After disassembly, traverse entries in .debug_loc and save them in context (BinaryContext.cpp)
- After optimizations, serialize .debug_loc and update pointers in .debug_info (RewriteInstance.cpp)
(cherry picked from FBD3130682)
Summary:
Updates DWARF lexical blocks address ranges in the output binary after optimizations.
This is similar to updating function address ranges except that the ranges representation needs
to be more general, since address ranges can begin or end in the middle of a basic block.
The following changes were made:
- Added a data structure for iterating over the basic blocks that intersect an address range: BasicBlockTable.h
- Added some more bookkeeping in BinaryBasicBlock. Basically, I needed to keep track of the block's size in the input binary as well as its address in the output binary. This information is mostly set by BinaryFunction after disassembly.
- Added a representation for address ranges relative to basic blocks (BasicBlockOffsetRanges.h). Will also serve for location lists.
- Added a representation for Lexical Blocks (LexicalBlock.h)
- Small refactorings in DebugArangesWriter:
-- Renamed to DebugRangesSectionsWriter since it also writes .debug_ranges
-- Refactored it not to depend on BinaryFunction but instead on anything that can be assined an aoffset in .debug_ranges (added an interface for that)
- Iterate over the DIE tree during initialization to find lexical blocks in .debug_info (BinaryContext.cpp)
- Added patches to .debug_abbrev and .debug_info in RewriteInstance to update lexical blocks attributes (in fact, this part is very similar to what was done to function address ranges and I just refactored/reused that code)
- Added small test case (lexical_blocks_address_ranges_debug.test)
(cherry picked from FBD3113181)
Summary:
[WIP] Update DWARF info for function address ranges.
This diff currently does not work for unknown reasons,
but I'm describing here what's the current state.
According to both llvm-dwarf and readelf our output seems correct,
but GDB does not interpret it as expected. All details go below in
hope I missed something.
I couldn't actually track the whole change that introduced support for
what we need in gdb yet, but I think I can get to it
(2007-12-04: Support
lexical bocks and function bodies that occupy non-contiguous address ranges). I have reasons to believe gdb at least at some
nges).
The set of introduced changes was basically this:
- After disassembly, iterate over the DIEs in .debug_info and find the
ones that correspond to each BinaryFunction.
- Refactor DebugArangesWriter to also write addresses of functions to
.debug_ranges and track the offsets of function address ranges there
- Add some infrastructure to facilitate patching the binary in
simple ways (BinaryPatcher.h)
- In RewriteInstance, after writing .debug_ranges already with
function address ranges, for each function do:
-- Find the abbreviation corresponding to the function
-- Patch .debug_abbrev to replace DW_AT_low_pc with DW_AT_ranges and
DW_AT_high_pc with DW_AT_producer (I'll explain this hack below).
Also patch the corresponding forms to DW_FORM_sec_offset and
DW_FORM_string (null-terminated in-place string).
-- Patch debug_info with the .debug_ranges offset in place of
the first 4 bytes of DW_AT_low_pc (DW_AT_ranges only occupies 4
bytes whereas low_pc occupies 8), and write an arbitrary string
in-place in the other 12 bytes that were the 4 MSB of low_pc
and the 8 bytes of high_pc before the patch. This depends on
low_pc and high_pc being put consecutively by the compiler, but
it serves to validate the idea. I tried another way of doing it
that does not rely on this but it didn't work either and I believe
the reason for either not working is the same (and still unknown,
but unrelated to them. I might be wrong though, and if I find yet
another way of doing it I may try it). The other way was to
use a form of DW_FORM_data8 for the section offset. This is
disallowed by the specification, but I doubt gdb validates this,
as it's just easier to store it as 64-bit anyway as this is even
necessary to support 64-bit DWARF (which is not what gcc generates
by default apparently).
I still need to make changes to the diff to make it production-ready,
but first I want to figure out why it doesn't work as expected.
By looking at the output of llvm-dwarfdump or readelf, all of
.debug_ranges, .debug_abbrev and .debug_info seem to have been
correctly updated. However, gdb seems to have serious problems with
what we write.
(In fact, readelf --debug-dump=Ranges shows some funny warning messages
of the form ("Warning: There is a hole [0x100 - 0x120] in .debug_ranges"),
but I played around with this and it seems it's just because no
compile unit was using these ranges. Changing .debug_info apparently
changes these warnings, so they seem to be unrelated to the section
itself. Also looking at the hex dump of the section doesn't help,
as everything seems fine. llvm-dwarfdump doesn't say anything.
So I think .debug_ranges is fine.)
The result is that gdb not only doesn't show the function name as we
wanted, but it also stops showing line number information.
Apparently it's not reading/interpreting the address ranges at all,
and so the functions now have no associated address ranges, only the
symbol value which allows one to put a breakpoint in the function,
but not to show source code.
As this left me without more ideas of what to try to feed gdb with,
I believe the most promising next trial is to try to debug gdb itself,
unless someone spots anything I missed.
I found where the interesting part of the code lies for this
case (gdb/dwarf2read.c and some other related files, but mainly that one).
It seems in some parts gdb uses DW_AT_ranges for only getting
its lowest and highest addresses and setting that as low_pc and
high_pc (see dwarf2_get_pc_bounds in gdb's code and where it's called).
I really hope this is not actually the case for
function address ranges. I'll investigate this further. Otherwise
I don't think any changes we make will make it work as initially
intended, as we'll simply need gdb to support it and in that case it
doesn't.
(cherry picked from FBD3073641)
Summary:
Write the .debug_aranges section after optimizations to the output binary.
Each function generates at least one range and at most two (one extra for its cold part).
The writing is done manually because LLVM's implementation is tied to the output of
.debug_info (see EmitGenDwarfInfo and EmitGenDwarfARanges in lib/MC/MCDwarf.cpp),
which we don't want to trigger right now.
(cherry picked from FBD3043108)
Summary:
Reads information in the DWARF .debug_line section using LLVM and
tie every MCInst to one line of a line table from the input binary. Subsequent
diffs will update this information to match the final binary layout and
output updated line tables.
(cherry picked from FBD2989813)
Summary:
Previously, llvm-flo.cpp contained a long function doing lots of
different tasks. This patch refactors this logic into a separate class with
different member functions, exposing the relationship between each step of
the rewritting process and making it easier to coordinate/change it.
(cherry picked from FBD2691674)
Summary: In order to reorder binaries with C++ exceptions, we first need to
read DWARF CFI (call frame info) from binaries in a table in the .eh_frame
ELF section. This table contains unwinding information we need to be aware of
when reordering basic blocks, so as to avoid corrupting it. This patch also
cleans up some code from Exceptions.cpp due to a refactoring where we moved
some functions to the LLVM's libSupport.
(cherry picked from FBD2614464)
Summary:
This patch introduces DataReader, a module responsible for
parsing llvm flo data files into in-memory data structures.
(cherry picked from FBD2515754)