mirror of
https://github.com/capstone-engine/llvm-capstone.git
synced 2025-02-25 13:05:04 +00:00

Add stub Sphinx documentation, with configuration copy-pasted from lld and index page converted from bolt/README.md. Reviewed By: #bolt, rafauler Differential Revision: https://reviews.llvm.org/D140156
258 lines
9.3 KiB
ReStructuredText
258 lines
9.3 KiB
ReStructuredText
BOLT
|
||
====
|
||
|
||
BOLT is a post-link optimizer developed to speed up large applications.
|
||
It achieves the improvements by optimizing application’s code layout
|
||
based on execution profile gathered by sampling profiler, such as Linux
|
||
``perf`` tool. An overview of the ideas implemented in BOLT along with a
|
||
discussion of its potential and current results is available in `CGO’19
|
||
paper <https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/>`__.
|
||
|
||
Input Binary Requirements
|
||
-------------------------
|
||
|
||
BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the
|
||
binaries should have an unstripped symbol table, and, to get maximum
|
||
performance gains, they should be linked with relocations
|
||
(``--emit-relocs`` or ``-q`` linker flag).
|
||
|
||
BOLT disassembles functions and reconstructs the control flow graph
|
||
(CFG) before it runs optimizations. Since this is a nontrivial task,
|
||
especially when indirect branches are present, we rely on certain
|
||
heuristics to accomplish it. These heuristics have been tested on a code
|
||
generated with Clang and GCC compilers. The main requirement for C/C++
|
||
code is not to rely on code layout properties, such as function pointer
|
||
deltas. Assembly code can be processed too. Requirements for it include
|
||
a clear separation of code and data, with data objects being placed into
|
||
data sections/segments. If indirect jumps are used for intra-function
|
||
control transfer (e.g., jump tables), the code patterns should be
|
||
matching those generated by Clang/GCC.
|
||
|
||
NOTE: BOLT is currently incompatible with the
|
||
``-freorder-blocks-and-partition`` compiler option. Since GCC8 enables
|
||
this option by default, you have to explicitly disable it by adding
|
||
``-fno-reorder-blocks-and-partition`` flag if you are compiling with
|
||
GCC8 or above.
|
||
|
||
NOTE2: DWARF v5 is the new debugging format generated by the latest LLVM
|
||
and GCC compilers. It offers several benefits over the previous DWARF
|
||
v4. Currently, the support for v5 is a work in progress for BOLT. While
|
||
you will be able to optimize binaries produced by the latest compilers,
|
||
until the support is complete, you will not be able to update the debug
|
||
info with ``-update-debug-sections``. To temporarily work around the
|
||
issue, we recommend compiling binaries with ``-gdwarf-4`` option that
|
||
forces DWARF v4 output.
|
||
|
||
PIE and .so support has been added recently. Please report bugs if you
|
||
encounter any issues.
|
||
|
||
Installation
|
||
------------
|
||
|
||
Docker Image
|
||
~~~~~~~~~~~~
|
||
|
||
You can build and use the docker image containing BOLT using our `docker
|
||
file <utils/docker/Dockerfile>`__. Alternatively, you can build BOLT
|
||
manually using the steps below.
|
||
|
||
Manual Build
|
||
~~~~~~~~~~~~
|
||
|
||
BOLT heavily uses LLVM libraries, and by design, it is built as one of
|
||
LLVM tools. The build process is not much different from a regular LLVM
|
||
build. The following instructions are assuming that you are running
|
||
under Linux.
|
||
|
||
Start with cloning LLVM repo:
|
||
|
||
::
|
||
|
||
> git clone https://github.com/llvm/llvm-project.git
|
||
> mkdir build
|
||
> cd build
|
||
> cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt"
|
||
> ninja bolt
|
||
|
||
``llvm-bolt`` will be available under ``bin/``. Add this directory to
|
||
your path to ensure the rest of the commands in this tutorial work.
|
||
|
||
Optimizing BOLT’s Performance
|
||
-----------------------------
|
||
|
||
BOLT runs many internal passes in parallel. If you foresee heavy usage
|
||
of BOLT, you can improve the processing time by linking against one of
|
||
memory allocation libraries with good support for concurrency. E.g. to
|
||
use jemalloc:
|
||
|
||
::
|
||
|
||
> sudo yum install jemalloc-devel
|
||
> LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt ....
|
||
|
||
Or if you rather use tcmalloc:
|
||
|
||
::
|
||
|
||
> sudo yum install gperftools-devel
|
||
> LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt ....
|
||
|
||
Usage
|
||
-----
|
||
|
||
For a complete practical guide of using BOLT see `Optimizing Clang with
|
||
BOLT <docs/OptimizingClang.md>`__.
|
||
|
||
Step 0
|
||
~~~~~~
|
||
|
||
In order to allow BOLT to re-arrange functions (in addition to
|
||
re-arranging code within functions) in your program, it needs a little
|
||
help from the linker. Add ``--emit-relocs`` to the final link step of
|
||
your application. You can verify the presence of relocations by checking
|
||
for ``.rela.text`` section in the binary. BOLT will also report if it
|
||
detects relocations while processing the binary.
|
||
|
||
Step 1: Collect Profile
|
||
~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
This step is different for different kinds of executables. If you can
|
||
invoke your program to run on a representative input from a command
|
||
line, then check **For Applications** section below. If your program
|
||
typically runs as a server/service, then skip to **For Services**
|
||
section.
|
||
|
||
The version of ``perf`` command used for the following steps has to
|
||
support ``-F brstack`` option. We recommend using ``perf`` version 4.5
|
||
or later.
|
||
|
||
For Applications
|
||
^^^^^^^^^^^^^^^^
|
||
|
||
This assumes you can run your program from a command line with a typical
|
||
input. In this case, simply prepend the command line invocation with
|
||
``perf``:
|
||
|
||
::
|
||
|
||
$ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...
|
||
|
||
For Services
|
||
^^^^^^^^^^^^
|
||
|
||
Once you get the service deployed and warmed-up, it is time to collect
|
||
perf data with LBR (branch information). The exact perf command to use
|
||
will depend on the service. E.g., to collect the data for all processes
|
||
running on the server for the next 3 minutes use:
|
||
|
||
::
|
||
|
||
$ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180
|
||
|
||
Depending on the application, you may need more samples to be included
|
||
with your profile. It’s hard to tell upfront what would be a sweet spot
|
||
for your application. We recommend the profile to cover 1B instructions
|
||
as reported by BOLT ``-dyno-stats`` option. If you need to increase the
|
||
number of samples in the profile, you can either run the ``sleep``
|
||
command for longer and use ``-F<N>`` option with ``perf`` to increase
|
||
sampling frequency.
|
||
|
||
Note that for profile collection we recommend using cycle events and not
|
||
``BR_INST_RETIRED.*``. Empirically we found it to produce better
|
||
results.
|
||
|
||
If the collection of a profile with branches is not available, e.g.,
|
||
when you run on a VM or on hardware that does not support it, then you
|
||
can use only sample events, such as cycles. In this case, the quality of
|
||
the profile information would not be as good, and performance gains with
|
||
BOLT are expected to be lower.
|
||
|
||
With instrumentation
|
||
^^^^^^^^^^^^^^^^^^^^
|
||
|
||
If perf record is not available to you, you may collect profile by first
|
||
instrumenting the binary with BOLT and then running it.
|
||
|
||
::
|
||
|
||
llvm-bolt <executable> -instrument -o <instrumented-executable>
|
||
|
||
After you run instrumented-executable with the desired workload, its
|
||
BOLT profile should be ready for you in ``/tmp/prof.fdata`` and you can
|
||
skip **Step 2**.
|
||
|
||
Run BOLT with the ``-help`` option and check the category “BOLT
|
||
instrumentation options” for a quick reference on instrumentation knobs.
|
||
|
||
Step 2: Convert Profile to BOLT Format
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
NOTE: you can skip this step and feed ``perf.data`` directly to BOLT
|
||
using experimental ``-p perf.data`` option.
|
||
|
||
For this step, you will need ``perf.data`` file collected from the
|
||
previous step and a copy of the binary that was running. The binary has
|
||
to be either unstripped, or should have a symbol table intact (i.e.,
|
||
running ``strip -g`` is okay).
|
||
|
||
Make sure ``perf`` is in your ``PATH``, and execute ``perf2bolt``:
|
||
|
||
::
|
||
|
||
$ perf2bolt -p perf.data -o perf.fdata <executable>
|
||
|
||
This command will aggregate branch data from ``perf.data`` and store it
|
||
in a format that is both more compact and more resilient to binary
|
||
modifications.
|
||
|
||
If the profile was collected without LBRs, you will need to add ``-nl``
|
||
flag to the command line above.
|
||
|
||
Step 3: Optimize with BOLT
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Once you have ``perf.fdata`` ready, you can use it for optimizations
|
||
with BOLT. Assuming your environment is setup to include the right path,
|
||
execute ``llvm-bolt``:
|
||
|
||
::
|
||
|
||
$ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats
|
||
|
||
If you do need an updated debug info, then add
|
||
``-update-debug-sections`` option to the command above. The processing
|
||
time will be slightly longer.
|
||
|
||
For a full list of options see ``-help``/``-help-hidden`` output.
|
||
|
||
The input binary for this step does not have to 100% match the binary
|
||
used for profile collection in **Step 1**. This could happen when you
|
||
are doing active development, and the source code constantly changes,
|
||
yet you want to benefit from profile-guided optimizations. However,
|
||
since the binary is not precisely the same, the profile information
|
||
could become invalid or stale, and BOLT will report the number of
|
||
functions with a stale profile. The higher the number, the less
|
||
performance improvement should be expected. Thus, it is crucial to
|
||
update ``.fdata`` for release branches.
|
||
|
||
Multiple Profiles
|
||
-----------------
|
||
|
||
Suppose your application can run in different modes, and you can
|
||
generate multiple profiles for each one of them. To generate a single
|
||
binary that can benefit all modes (assuming the profiles don’t
|
||
contradict each other) you can use ``merge-fdata`` tool:
|
||
|
||
::
|
||
|
||
$ merge-fdata *.fdata > combined.fdata
|
||
|
||
Use ``combined.fdata`` for **Step 3** above to generate a universally
|
||
optimized binary.
|
||
|
||
License
|
||
-------
|
||
|
||
BOLT is licensed under the `Apache License v2.0 with LLVM
|
||
Exceptions <./LICENSE.TXT>`__.
|