mirror of
https://github.com/capstone-engine/llvm-capstone.git
synced 2025-01-05 23:52:45 +00:00
22b9404f09
When the environment LLVM_ENABLE_SYMBOLIZER_MARKUP is set, if llvm-symbolizer fails or is disabled, this change will print a backtrace in llvm-symbolizer markup instead of falling back to in-process symbolization mechanisms. This allows llvm-symbolizer to be run on the output later to produce a high quality backtrace, even for fully-stripped LLVM utilities. Reviewed By: mcgrathr Differential Revision: https://reviews.llvm.org/D139750
442 lines
22 KiB
ReStructuredText
442 lines
22 KiB
ReStructuredText
==========================
|
||
Symbolizer Markup Format
|
||
==========================
|
||
|
||
.. contents::
|
||
:local:
|
||
|
||
Overview
|
||
========
|
||
|
||
This document defines a text format for log messages that can be processed by a
|
||
symbolizing filter. The basic idea is that logging code emits text that contains
|
||
raw address values and so forth, without the logging code doing any real work to
|
||
convert those values to human-readable form. Instead, logging text uses the
|
||
markup format defined here to identify pieces of information that should be
|
||
converted to human-readable form after the fact. As with other markup formats,
|
||
the expectation is that most of the text will be displayed as is, while the
|
||
markup elements will be replaced with expanded text, or converted into active UI
|
||
elements, that present more details in symbolic form.
|
||
|
||
This means there is no need for symbol tables, DWARF debugging sections, or
|
||
similar information to be directly accessible at runtime. There is also no need
|
||
at runtime for any logic intended to compute human-readable presentation of
|
||
information, such as C++ symbol demangling. Instead, logging must include markup
|
||
elements that give the contextual information necessary to make sense of the raw
|
||
data, such as memory layout details.
|
||
|
||
This format identifies markup elements with a syntax that is both simple and
|
||
distinctive. It's simple enough to be matched and parsed with straightforward
|
||
code. It's distinctive enough that character sequences that look like the start
|
||
or end of a markup element should rarely if ever appear incidentally in logging
|
||
text. It's specifically intended not to require sanitizing plain text, such as
|
||
the HTML/XML requirement to replace ``<`` with ``<`` and the like.
|
||
|
||
:doc:`llvm-symbolizer <CommandGuide/llvm-symbolizer>` includes a symbolizing
|
||
filter via its ``--filter-markup`` option. Also, LLVM utilites emit stack
|
||
traces as markup when the ``LLVM_ENABLE_SYMBOLIZER_MARKUP`` environment
|
||
variable is set.
|
||
|
||
Scope and assumptions
|
||
=====================
|
||
|
||
A symbolizing filter implementation will be independent both of the target
|
||
operating system and machine architecture where the logs are generated and of
|
||
the host operating system and machine architecture where the filter runs.
|
||
|
||
This format assumes that the symbolizing filter processes intact whole lines. If
|
||
long lines might be split during some stage of a logging pipeline, they must be
|
||
reassembled to restore the original line breaks before feeding lines into the
|
||
symbolizing filter. Most markup elements must appear entirely on a single line
|
||
(often with other text before and/or after the markup element). There are some
|
||
markup elements that are specified to span lines, with line breaks in the middle
|
||
of the element. Even in those cases, the filter is not expected to handle line
|
||
breaks in arbitrary places inside a markup element, but only inside certain
|
||
fields.
|
||
|
||
This format assumes that the symbolizing filter processes a coherent stream of
|
||
log lines from a single process address space context. If a logging stream
|
||
interleaves log lines from more than one process, these must be collated into
|
||
separate per-process log streams and each stream processed by a separate
|
||
instance of the symbolizing filter. Because the kernel and user processes use
|
||
disjoint address regions in most operating systems, a single user process
|
||
address space plus the kernel address space can be treated as a single address
|
||
space for symbolization purposes if desired.
|
||
|
||
Dependence on Build IDs
|
||
=======================
|
||
|
||
The symbolizer markup scheme relies on contextual information about runtime
|
||
memory address layout to make it possible to convert markup elements into useful
|
||
symbolic form. This relies on having an unmistakable identification of which
|
||
binary was loaded at each address.
|
||
|
||
An ELF Build ID is the payload of an ELF note with name ``"GNU"`` and type
|
||
``NT_GNU_BUILD_ID``, a unique byte sequence that identifies a particular binary
|
||
(executable, shared library, loadable module, or driver module). The linker
|
||
generates this automatically based on a hash that includes the complete symbol
|
||
table and debugging information, even if this is later stripped from the binary.
|
||
|
||
This specification uses the ELF Build ID as the sole means of identifying
|
||
binaries. Each binary relevant to the log must have been linked with a unique
|
||
Build ID. The symbolizing filter must have some means of mapping a Build ID back
|
||
to the original ELF binary (either the whole unstripped binary, or a stripped
|
||
binary paired with a separate debug file).
|
||
|
||
Colorization
|
||
============
|
||
|
||
The markup format supports a restricted subset of ANSI X3.64 SGR (Select Graphic
|
||
Rendition) control sequences. These are unlike other markup elements:
|
||
|
||
* They specify presentation details (bold or colors) rather than semantic
|
||
information. The association of semantic meaning with color (e.g. red for
|
||
errors) is chosen by the code doing the logging, rather than by the UI
|
||
presentation of the symbolizing filter. This is a concession to existing code
|
||
(e.g. LLVM sanitizer runtimes) that use specific colors and would require
|
||
substantial changes to generate semantic markup instead.
|
||
|
||
* A single control sequence changes "the state", rather than being an
|
||
hierarchical structure that surrounds affected text.
|
||
|
||
The filter processes ANSI SGR control sequences only within a single line. If a
|
||
control sequence to enter a bold or color state is encountered, it's expected
|
||
that the control sequence to reset to default state will be encountered before
|
||
the end of that line. If a "dangling" state is left at the end of a line, the
|
||
filter may reset to default state for the next line.
|
||
|
||
An SGR control sequence is not interpreted inside any other markup element.
|
||
However, other markup elements may appear between SGR control sequences and the
|
||
color/bold state is expected to apply to the symbolic output that replaces the
|
||
markup element in the filter's output.
|
||
|
||
The accepted SGR control sequences all have the form ``"\033[%um"`` (expressed here
|
||
using C string syntax), where ``%u`` is one of these:
|
||
|
||
==== ============================ ===============================================
|
||
Code Effect Notes
|
||
==== ============================ ===============================================
|
||
0 Reset to default formatting.
|
||
1 Bold text Combines with color states, doesn't reset them.
|
||
30 Black foreground
|
||
31 Red foreground
|
||
32 Green foreground
|
||
33 Yellow foreground
|
||
34 Blue foreground
|
||
35 Magenta foreground
|
||
36 Cyan foreground
|
||
37 White foreground
|
||
==== ============================ ===============================================
|
||
|
||
Common markup element syntax
|
||
============================
|
||
|
||
All the markup elements share a common syntactic structure to facilitate simple
|
||
matching and parsing code. Each element has the form::
|
||
|
||
{{{tag:fields}}}
|
||
|
||
``tag`` identifies one of the element types described below, and is always a
|
||
short alphabetic string that must be in lower case. The rest of the element
|
||
consists of one or more fields. Fields are separated by ``:`` and cannot contain
|
||
any ``:`` or ``}`` characters. How many fields must be or may be present and
|
||
what they contain is specified for each element type.
|
||
|
||
No markup elements or ANSI SGR control sequences are interpreted inside the
|
||
contents of a field.
|
||
|
||
Implementations must ignore markup fields after those expected; this allows
|
||
adding new fields to backwards-compatibly extend elements. Implementations need
|
||
not ignore them silently, but the element should behave otherwise as if the
|
||
fields were removed.
|
||
|
||
In the descriptions of each element type, ``printf``-style placeholders indicate
|
||
field contents:
|
||
|
||
``%s``
|
||
A string of printable characters, not including ``:`` or ``}``.
|
||
|
||
``%p``
|
||
An address value represented by ``0x`` followed by an even number of
|
||
hexadecimal digits (using either lower-case or upper-case for ``A``–``F``).
|
||
If the digits are all ``0`` then the ``0x`` prefix may be omitted. No more
|
||
than 16 hexadecimal digits are expected to appear in a single value (64 bits).
|
||
|
||
``%u``
|
||
A nonnegative decimal integer.
|
||
|
||
``%i``
|
||
A nonnegative integer. The digits are hexadecimal if prefixed by ``0x``, octal
|
||
if prefixed by ``0``, or decimal otherwise.
|
||
|
||
``%x``
|
||
A sequence of an even number of hexadecimal digits (using either lower-case or
|
||
upper-case for ``A``–``F``), with no ``0x`` prefix. This represents an
|
||
arbitrary sequence of bytes, such as an ELF Build ID.
|
||
|
||
Presentation elements
|
||
=====================
|
||
|
||
These are elements that convey a specific program entity to be displayed in
|
||
human-readable symbolic form.
|
||
|
||
``{{{symbol:%s}}}``
|
||
Here ``%s`` is the linkage name for a symbol or type. It may require
|
||
demangling according to language ABI rules. Even for unmangled names, it's
|
||
recommended that this markup element be used to identify a symbol name so that
|
||
it can be presented distinctively.
|
||
|
||
Examples::
|
||
|
||
{{{symbol:_ZN7Mangled4NameEv}}}
|
||
{{{symbol:foobar}}}
|
||
|
||
``{{{pc:%p}}}``, ``{{{pc:%p:ra}}}``, ``{{{pc:%p:pc}}}``
|
||
|
||
Here ``%p`` is the memory address of a code location. It might be presented as a
|
||
function name and source location. The second two forms distinguish the kind of
|
||
code location, as described in detail for bt elements below.
|
||
|
||
Examples::
|
||
|
||
{{{pc:0x12345678}}}
|
||
{{{pc:0xffffffff9abcdef0}}}
|
||
|
||
``{{{data:%p}}}``
|
||
|
||
Here ``%p`` is the memory address of a data location. It might be presented as
|
||
the name of a global variable at that location.
|
||
|
||
Examples::
|
||
|
||
{{{data:0x12345678}}}
|
||
{{{data:0xffffffff9abcdef0}}}
|
||
|
||
``{{{bt:%u:%p}}}``, ``{{{bt:%u:%p:ra}}}``, ``{{{bt:%u:%p:pc}}}``
|
||
|
||
This represents one frame in a backtrace. It usually appears on a line by
|
||
itself (surrounded only by whitespace), in a sequence of such lines with
|
||
ascending frame numbers. So the human-readable output might be formatted
|
||
assuming that, such that it looks good for a sequence of bt elements each
|
||
alone on its line with uniform indentation of each line. But it can appear
|
||
anywhere, so the filter should not remove any non-whitespace text surrounding
|
||
the element.
|
||
|
||
Here ``%u`` is the frame number, which starts at zero for the location of the
|
||
fault being identified, increments to one for the caller of frame zero's call
|
||
frame, to two for the caller of frame one, etc. ``%p`` is the memory address
|
||
of a code location.
|
||
|
||
Code locations in a backtrace come from two distinct sources. Most backtrace
|
||
frames describe a return address code location, i.e. the instruction
|
||
immediately after a call instruction. This is the location of code that has
|
||
yet to run, since the function called there has not yet returned. Hence the
|
||
code location of actual interest is usually the call site itself rather than
|
||
the return address, i.e. one instruction earlier. When presenting the source
|
||
location for a return address frame, the symbolizing filter will subtract one
|
||
byte or one instruction length from the actual return address for the call
|
||
site, with the intent that the address logged can be translated directly to a
|
||
source location for the call site and not for the apparent return site
|
||
thereafter (which can be confusing). When inlined functions are involved, the
|
||
call site and the return site can appear to be in different functions at
|
||
entirely unrelated source locations rather than just a line away, making the
|
||
confusion of showing the return site rather the call site quite severe.
|
||
|
||
Often the first frame in a backtrace ("frame zero") identifies the precise
|
||
code location of a fault, trap, or asynchronous interrupt rather than a return
|
||
address. At other times, even the first frame is actually a return address
|
||
(for example, backtraces collected at the time of an object allocation and
|
||
reported later when the allocated object is used or misused). When a system
|
||
supports in-thread trap handling, there may also be frames after the first
|
||
that represent a precise interrupted code location rather than a return
|
||
address, presented as the "caller" of a trap handler function (for example,
|
||
signal handlers in POSIX systems).
|
||
|
||
Return address frames are identified by the ``:ra`` suffix. Precise code
|
||
location frames are identified by the ``:pc`` suffix.
|
||
|
||
Traditional practice has often been to collect backtraces as simple address
|
||
lists, losing the distinction between return address code locations and
|
||
precise code locations. Some such code applies the "subtract one" adjustment
|
||
described above to the address values before reporting them, and it's not
|
||
always clear or consistent whether this adjustment has been applied or not.
|
||
These ambiguous cases are supported by the ``bt`` and ``pc`` forms with no
|
||
``:ra`` or ``:pc`` suffix, which indicate it's unclear which sort of code
|
||
location this is. However, it's highly recommended that all emitters use the
|
||
suffixed forms and deliver address values with no adjustments applied. When
|
||
traditional practice has been ambiguous, the majority of cases seem to have
|
||
been of printing addresses that are return address code locations and printing
|
||
them without adjustment. So the symbolizing filter will usually apply the
|
||
"subtract one byte" adjustment to an address printed without a disambiguating
|
||
suffix. Assuming that a call instruction is longer than one byte on all
|
||
supported machines, applying the "subtract one byte" adjustment a second time
|
||
still results in an address somewhere in the call instruction, so a little
|
||
sloppiness here often does little or no harm.
|
||
|
||
Examples::
|
||
|
||
{{{bt:0:0x12345678:pc}}}
|
||
{{{bt:1:0xffffffff9abcdef0:ra}}}
|
||
|
||
``{{{hexdict:...}}}`` [#not_yet_implemented]_
|
||
|
||
This element can span multiple lines. Here ``...`` is a sequence of key-value
|
||
pairs where a single ``:`` separates each key from its value, and arbitrary
|
||
whitespace separates the pairs. The value (right-hand side) of each pair
|
||
either is one or more ``0`` digits, or is ``0x`` followed by hexadecimal
|
||
digits. Each value might be a memory address or might be some other integer
|
||
(including an integer that looks like a likely memory address but actually has
|
||
an unrelated purpose). When the contextual information about the memory layout
|
||
suggests that a given value could be a code location or a global variable data
|
||
address, it might be presented as a source location or variable name or with
|
||
active UI that makes such interpretation optionally visible.
|
||
|
||
The intended use is for things like register dumps, where the emitter doesn't
|
||
know which values might have a symbolic interpretation but a presentation that
|
||
makes plausible symbolic interpretations available might be very useful to
|
||
someone reading the log. At the same time, a flat text presentation should
|
||
usually avoid interfering too much with the original contents and formatting
|
||
of the dump. For example, it might use footnotes with source locations for
|
||
values that appear to be code locations. An active UI presentation might show
|
||
the dump text as is, but highlight values with symbolic information available
|
||
and pop up a presentation of symbolic details when a value is selected.
|
||
|
||
Example::
|
||
|
||
{{{hexdict:
|
||
CS: 0 RIP: 0x6ee17076fb80 EFL: 0x10246 CR2: 0
|
||
RAX: 0xc53d0acbcf0 RBX: 0x1e659ea7e0d0 RCX: 0 RDX: 0x6ee1708300cc
|
||
RSI: 0 RDI: 0x6ee170830040 RBP: 0x3b13734898e0 RSP: 0x3b13734898d8
|
||
R8: 0x3b1373489860 R9: 0x2776ff4f R10: 0x2749d3e9a940 R11: 0x246
|
||
R12: 0x1e659ea7e0f0 R13: 0xd7231230fd6ff2e7 R14: 0x1e659ea7e108 R15: 0xc53d0acbcf0
|
||
}}}
|
||
|
||
Trigger elements
|
||
================
|
||
|
||
These elements cause an external action and will be presented to the user in a
|
||
human readable form. Generally they trigger an external action to occur that
|
||
results in a linkable page. The link or some other informative information about
|
||
the external action can then be presented to the user.
|
||
|
||
``{{{dumpfile:%s:%s}}}`` [#not_yet_implemented]_
|
||
|
||
Here the first ``%s`` is an identifier for a type of dump and the second
|
||
``%s`` is an identifier for a particular dump that's just been published. The
|
||
types of dumps, the exact meaning of "published", and the nature of the
|
||
identifier are outside the scope of the markup format per se. In general it
|
||
might correspond to writing a file by that name or something similar.
|
||
|
||
This element may trigger additional post-processing work beyond symbolizing
|
||
the markup. It indicates that a dump file of some sort has been published.
|
||
Some logic attached to the symbolizing filter may understand certain types of
|
||
dump file and trigger additional post-processing of the dump file upon
|
||
encountering this element (e.g. generating visualizations, symbolization). The
|
||
expectation is that the information collected from contextual elements
|
||
(described below) in the logging stream may be necessary to decode the content
|
||
of the dump. So if the symbolizing filter triggers other processing, it may
|
||
need to feed some distilled form of the contextual information to those
|
||
processes.
|
||
|
||
An example of a type identifier is ``sancov``, for dumps from LLVM
|
||
`SanitizerCoverage <https://clang.llvm.org/docs/SanitizerCoverage.html>`_.
|
||
|
||
Example::
|
||
|
||
{{{dumpfile:sancov:sancov.8675}}}
|
||
|
||
Contextual elements
|
||
===================
|
||
|
||
These are elements that supply information necessary to convert presentation
|
||
elements to symbolic form. Unlike presentation elements, they are not directly
|
||
related to the surrounding text. Contextual elements should appear alone on
|
||
lines with no other non-whitespace text, so that the symbolizing filter might
|
||
elide the whole line from its output without hiding any other log text.
|
||
|
||
The contextual elements themselves do not necessarily need to be presented in
|
||
human-readable output. However, the information they impart may be essential to
|
||
understanding the logging text even after symbolization. So it's recommended
|
||
that this information be preserved in some form when the original raw log with
|
||
markup may no longer be readily accessible for whatever reason.
|
||
|
||
Contextual elements should appear in the logging stream before they are needed.
|
||
That is, if some piece of context may affect how the symbolizing filter would
|
||
interpret or present a later presentation element, the necessary contextual
|
||
elements should have appeared somewhere earlier in the logging stream. It should
|
||
always be possible for the symbolizing filter to be implemented as a single pass
|
||
over the raw logging stream, accumulating context and massaging text as it goes.
|
||
|
||
``{{{reset}}}``
|
||
|
||
This should be output before any other contextual element. The need for this
|
||
contextual element is to support implementations that handle logs coming from
|
||
multiple processes. Such implementations might not know when a new process
|
||
starts or ends. Because some identifying information (like process IDs) might
|
||
be the same between old and new processes, a way is needed to distinguish two
|
||
processes with such identical identifying information. This element informs
|
||
such implementations to reset the state of a filter so that information from a
|
||
previous process's contextual elements is not assumed for new process that
|
||
just happens have the same identifying information.
|
||
|
||
``{{{module:%i:%s:%s:...}}}``
|
||
|
||
This element represents a so-called "module". A "module" is a single linked
|
||
binary, such as a loaded ELF file. Usually each module occupies a contiguous
|
||
range of memory.
|
||
|
||
Here ``%i`` is the module ID which is used by other contextual elements to
|
||
refer to this module. The first ``%s`` is a human-readable identifier for the
|
||
module, such as an ELF ``DT_SONAME`` string or a file name; but it might be
|
||
empty. It's only for casual information. Only the module ID is used to refer
|
||
to this module in other contextual elements, never the ``%s`` string. The
|
||
``module`` element defining a module ID must always be emitted before any
|
||
other elements that refer to that module ID, so that a filter never needs to
|
||
keep track of dangling references. The second ``%s`` is the module type and it
|
||
determines what the remaining fields are. The following module types are
|
||
supported:
|
||
|
||
* ``elf:%x``
|
||
|
||
Here ``%x`` encodes an ELF Build ID. The Build ID should refer to a single
|
||
linked binary. The Build ID string is the sole way to identify the binary from
|
||
which this module was loaded.
|
||
|
||
Example::
|
||
|
||
{{{module:1:libc.so:elf:83238ab56ba10497}}}
|
||
|
||
``{{{mmap:%p:%i:...}}}``
|
||
|
||
This contextual element is used to give information about a particular region
|
||
in memory. ``%p`` is the starting address and ``%i`` gives the size in hex of the
|
||
region of memory. The ``...`` part can take different forms to give different
|
||
information about the specified region of memory. The allowed forms are the
|
||
following:
|
||
|
||
* ``load:%i:%s:%p``
|
||
|
||
This subelement informs the filter that a segment was loaded from a module.
|
||
The module is identified by its module ID ``%i``. The ``%s`` is one or more of
|
||
the letters 'r', 'w', and 'x' (in that order and in either upper or lower
|
||
case) to indicate this segment of memory is readable, writable, and/or
|
||
executable. The symbolizing filter can use this information to guess whether
|
||
an address is a likely code address or a likely data address in the given
|
||
module. The remaining ``%p`` gives the module relative address. For ELF files
|
||
the module relative address will be the ``p_vaddr`` of the associated program
|
||
header. For example if your module's executable segment has
|
||
``p_vaddr=0x1000``, ``p_memsz=0x1234``, and was loaded at ``0x7acba69d5000``
|
||
then you need to subtract ``0x7acba69d4000`` from any address between
|
||
``0x7acba69d5000`` and ``0x7acba69d6234`` to get the module relative address.
|
||
The starting address will usually have been rounded down to the active page
|
||
size, and the size rounded up.
|
||
|
||
Example::
|
||
|
||
{{{mmap:0x7acba69d5000:0x5a000:load:1:rx:0x1000}}}
|
||
|
||
.. rubric:: Footnotes
|
||
|
||
.. [#not_yet_implemented] This markup element is not yet implemented in
|
||
:doc:`llvm-symbolizer <CommandGuide/llvm-symbolizer>`.
|