diff --git a/docs/XRay.rst b/docs/XRay.rst index d61e4e6d995..3009c8794a2 100644 --- a/docs/XRay.rst +++ b/docs/XRay.rst @@ -196,6 +196,9 @@ on your application, you may set the ``xray_fdr_log`` option to ``true`` in the ``XRAY_OPTIONS`` environment variable (while also optionally setting the ``xray_naive_log`` to ``false``). +When the buffers are flushed to disk, the result is a binary trace format +described by `XRay FDR format `_ + When FDR mode is on, it will keep writing and recycling memory buffers until the logging implementation is finalized -- at which point it can be flushed and re-initialised later. To do this programmatically, we follow the workflow diff --git a/docs/XRayFDRFormat.rst b/docs/XRayFDRFormat.rst new file mode 100644 index 00000000000..c249ee0efed --- /dev/null +++ b/docs/XRayFDRFormat.rst @@ -0,0 +1,394 @@ +====================================== +XRay Flight Data Recorder Trace Format +====================================== + +:Version: 1 as of 2017-07-20 + +.. contents:: + :local: + + +Introduction +============ + +When gathering XRay traces in Flight Data Recorder mode, each thread of an +application will claim buffers to fill with trace data, which at some point +is finalized and flushed. + +A goal of the profiler is to minimize overhead, so the flushed data directly +corresponds to the buffer. + +This document describes the format of a trace file. + + +General +======= + +Each trace file corresponds to a sequence of events in a particular thread. + +The file has a header followed by a sequence of discriminated record types. + +The endianess of byte fields matches the endianess of the platform which +produced the trace file. + + +Header Section +============== + +A trace file begins with a 32 byte header. + ++-------------------+-----------------+----------------------------------------+ +| Field | Size (bytes) | Description | ++===================+=================+========================================+ +| version | ``2`` | Anticipates versioned readers. This | +| | | document describes the format when | +| | | version == 1 | ++-------------------+-----------------+----------------------------------------+ +| type | ``2`` | An enumeration encoding the type of | +| | | trace. Flight Data Recorder mode | +| | | traces have type == 1 | ++-------------------+-----------------+----------------------------------------+ +| bitfield | ``4`` | Holds parameters that are not aligned | +| | | to bytes. Further described below. | ++-------------------+-----------------+----------------------------------------+ +| cycle_frequency | ``8`` | The frequency in hertz of the CPU | +| | | oscillator used to measure duration of | +| | | events in ticks. | ++-------------------+-----------------+----------------------------------------+ +| buffer_size | ``8`` | The size in bytes of the data portion | +| | | of the trace following the header. | ++-------------------+-----------------+----------------------------------------+ +| reserved | ``8`` | Reserved for future use. | ++-------------------+-----------------+----------------------------------------+ + +The bitfield parameter of the file header is composed of the following fields. + ++-------------------+----------------+-----------------------------------------+ +| Field | Size (bits) | Description | ++===================+================+=========================================+ +| constant_tsc | ``1`` | Whether the platform's timestamp | +| | | counter used to record ticks between | +| | | events ticks at a constant frequency | +| | | despite CPU frequency changes. | +| | | 0 == non-constant. 1 == constant. | ++-------------------+----------------+-----------------------------------------+ +| nonstop_tsc | ``1`` | Whether the tsc continues to count | +| | | despite whether the CPU is in a low | +| | | power state. 0 == stop. 1 == non-stop. | ++-------------------+----------------+-----------------------------------------+ +| reserved | ``30`` | Not meaningful. | ++-------------------+----------------+-----------------------------------------+ + + +Data Section +============ + +Following the header in a trace is a data section with size matching the +buffer_size field in the header. + +The data section is a stream of elements of different types. + +There are a few categories of data in the sequence. + +- ``Function Records``: Function Records contain the timing of entry into and + exit from function execution. Function Records have 8 bytes each. + +- ``Metadata Records``: Metadata records serve many purposes. Mostly, they + capture information that may be too costly to record for each function, but + that is required to contextualize the fine-grained timings. They also are used + as markers for user-defined Event Data payloads. Metadata records have 16 + bytes each. + +- ``Event Data``: Free form data may be associated with events that are traced + by the binary and encode data defined by a handler function. Event data is + always preceded with a marker record which indicates how large it is. + +- ``Function Arguments``: The arguments to some functions are included in the + trace. These are either pointer addresses or primitives that are read and + logged independently of their types in a high level language. To the tracer, + they are all simply numbers. Function Records that have attached arguments + will indicate their presence on the function entry record. We only support + logging contiguous function argument sequences starting with argument zero, + which will be the "this" pointer for member function invocations. For example, + we don't support logging the first and third argument. + +A reader of the memory format must maintain a state machine. The format makes no +attempt to pad for alignment, and it is not seekable. + + +Function Records +---------------- + +Function Records have an 8 byte layout. This layout encodes information to +reconstruct a call stack of instrumented function and their durations. + ++---------------+--------------+-----------------------------------------------+ +| Field | Size (bits) | Description | ++===============+==============+===============================================+ +| discriminant | ``1`` | Indicates whether a reader should read a | +| | | Function or Metadata record. Set to ``0`` for | +| | | Function records. | ++---------------+--------------+-----------------------------------------------+ +| action | ``3`` | Specifies whether the function is being | +| | | entered, exited, or is a non-standard entry | +| | | or exit produced by optimizations. | ++---------------+--------------+-----------------------------------------------+ +| function_id | ``28`` | A numeric ID for the function. Resolved to a | +| | | name via the xray instrumentation map. The | +| | | instrumentation map is built by xray at | +| | | compile time into an object file and pairs | +| | | the function ids to addresses. It is used for | +| | | patching and as a lookup into the binary's | +| | | symbols to obtain names. | ++---------------+--------------+-----------------------------------------------+ +| tsc_delta | ``32`` | The number of ticks of the timestamp counter | +| | | since a previous record recorded a delta or | +| | | other TSC resetting event. | ++---------------+--------------+-----------------------------------------------+ + +On little-endian machines, the bitfields are ordered from least significant bit +bit to most significant bit. A reader can read an 8 bit value and apply the mask +``0x01`` for the discriminant. Similarly, they can read 32 bits and unsigned +shift right by ``0x04`` to obtain the function_id field. + +On big-endian machine, the bitfields are written in order from most significant +bit to least significant bit. A reader would read an 8 bit value and unsigned +shift right by 7 bits for the discriminant. The function_id field could be +obtained by reading a 32 bit value and applying the mask ``0x0FFFFFFF``. + +Function action types are as follows. + ++---------------+--------------+-----------------------------------------------+ +| Type | Number | Description | ++===============+==============+===============================================+ +| Entry | ``0`` | Typical function entry. | ++---------------+--------------+-----------------------------------------------+ +| Exit | ``1`` | Typical function exit. | ++---------------+--------------+-----------------------------------------------+ +| Tail_Exit | ``2`` | An exit from a function due to Tail call | +| | | optimization. | ++---------------+--------------+-----------------------------------------------+ +| Entry_Args | ``3`` | A function entry that records arguments. | ++---------------+--------------+-----------------------------------------------+ + +Entry_Args records do not contain the arguments themselves. Instead, metadata +records for each of the logged args follow the function record in the stream. + + +Metadata Records +---------------- + +Interspersed throughout the buffer are 16 byte Metadata records. For typically +instrumented binaries, they will be sparser than Function records, and they +provide a fuller picture of the binary execution state. + +Metadata record layout is partially record dependent, but they share a common +structure. + +The same bit field rules described for function records apply to the first byte +of MetadataRecords. Within this byte, little endian machines use lsb to msb +ordering and big endian machines use msb to lsb ordering. + ++---------------+--------------+-----------------------------------------------+ +| Field | Size | Description | ++===============+==============+===============================================+ +| discriminant | ``1 bit`` | Indicates whether a reader should read a | +| | | Function or Metadata record. Set to ``1`` for | +| | | Metadata records. | ++---------------+--------------+-----------------------------------------------+ +| record_kind | ``7 bits`` | The type of Metadata record. | ++---------------+--------------+-----------------------------------------------+ +| data | ``15 bytes`` | A data field used differently for each record | +| | | type. | ++---------------+--------------+-----------------------------------------------+ + +Here is a table of the enumerated record kinds. + +====== =========================== +Number Type +------ --------------------------- +0 NewBuffer +1 EndOfBuffer +2 NewCPUId +3 TSCWrap +4 WallTimeMarker +5 CustomEventMarker +6 CallArgument + + +NewBuffer Records +----------------- + +Each buffer begins with a NewBuffer record immediately after the header. +It records the thread ID of the thread that the trace belongs to. + +Its data segment is as follows. + ++---------------+--------------+-----------------------------------------------+ +| Field | Size (bytes) | Description | ++===============+==============+===============================================+ +| thread_Id | ``2`` | Thread ID for buffer. | ++---------------+--------------+-----------------------------------------------+ +| reserved | ``13`` | Unused. | ++---------------+--------------+-----------------------------------------------+ + + +WallClockTime Records +--------------------- + +Following the NewBuffer record, each buffer records an absolute time as a frame +of reference for the durations recorded by timestamp counter deltas. + +Its data segment is as follows. + ++---------------+--------------+-----------------------------------------------+ +| Field | Size (bytes) | Description | ++===============+==============+===============================================+ +| seconds | ``8`` | Seconds on absolute timescale. The starting | +| | | point is unspecified and depends on the | +| | | implementation and platform configured by the | +| | | tracer. | ++---------------+--------------+-----------------------------------------------+ +| microseconds | ``4`` | The microsecond component of the time. | ++---------------+--------------+-----------------------------------------------+ +| reserved | ``3`` | Unused. | ++---------------+--------------+-----------------------------------------------+ + + +NewCpuId Records +---------------- + +Each function entry invokes a routine to determine what CPU is executing. +Typically, this is done with readtscp, which reads the timestamp counter at the +same time. + +If the tracing detects that the execution has switched CPUs or if this is the +first instrumented entry point, the tracer will output a NewCpuId record. + +Its data segment is as follows. + ++---------------+--------------+-----------------------------------------------+ +| Field | Size (bytes) | Description | ++===============+==============+===============================================+ +| cpu_id | ``2`` | CPU Id. | ++---------------+--------------+-----------------------------------------------+ +| absolute_tsc | ``8`` | The absolute value of the timestamp counter. | ++---------------+--------------+-----------------------------------------------+ +| reserved | ``5`` | Unused. | ++---------------+--------------+-----------------------------------------------+ + + +TSCWrap Records +--------------- + +Since each function record uses a 32 bit value to represent the number of ticks +of the timestamp counter since the last reference, it is possible for this value +to overflow, particularly for sparsely instrumented binaries. + +When this delta would not fit into a 32 bit representation, a reference absolute +timestamp counter record is written in the form of a TSCWrap record. + +Its data segment is as follows. + ++---------------+--------------+-----------------------------------------------+ +| Field | Size (bytes) | Description | ++===============+==============+===============================================+ +| absolute_tsc | ``8`` | Timestamp counter value. | ++---------------+--------------+-----------------------------------------------+ +| reserved | ``7`` | Unused. | ++---------------+--------------+-----------------------------------------------+ + + +CallArgument Records +-------------------- + +Immediately following an Entry_Args type function record, there may be one or +more CallArgument records that contain the traced function's parameter values. + +The order of the CallArgument Record sequency corresponds one to one with the +order of the function parameters. + +CallArgument data segment: + ++---------------+--------------+-----------------------------------------------+ +| Field | Size (bytes) | Description | ++===============+==============+===============================================+ +| argument | ``8`` | Numeric argument (may be pointer address). | ++---------------+--------------+-----------------------------------------------+ +| reserved | ``7`` | Unused. | ++---------------+--------------+-----------------------------------------------+ + + +CustomEventMarker Records +------------------------- + +XRay provides the feature of logging custom events. This may be leveraged to +record tracing info for RPCs or similarly trace data that is application +specific. + +Custom Events themselves are an unstructured (application defined) segment of +memory with arbitrary size within the buffer. They are preceded by +CustomEventMarkers to indicate their presence and size. + +CustomEventMarker data segment: + ++---------------+--------------+-----------------------------------------------+ +| Field | Size (bytes) | Description | ++===============+==============+===============================================+ +| event_size | ``4`` | Size of preceded event. | ++---------------+--------------+-----------------------------------------------+ +| absolute_tsc | ``8`` | A timestamp counter of the event. | ++---------------+--------------+-----------------------------------------------+ +| reserved | ``3`` | Unused. | ++---------------+--------------+-----------------------------------------------+ + + +EndOfBuffer Records +------------------- + +An EndOfBuffer record type indicates that there is no more trace data in this +buffer. The reader is expected to seek past the remaining buffer_size expressed +before the start of buffer and look for either another header or EOF. + + +Format Grammar and Invariants +============================= + +Not all sequences of Metadata records and Function records are valid data. A +sequence should be parsed as a state machine. The expectations for a valid +format can be expressed as a context free grammar. + +This is an attempt to explain the format with statements in EBNF format. + +- Format := Header ThreadBuffer* EOF + +- ThreadBuffer := NewBuffer WallClockTime NewCPUId BodySequence* End + +- BodySequence := NewCPUId | TSCWrap | Function | CustomEvent + +- Function := (Function_Entry_Args CallArgument*) | Function_Other_Type + +- CustomEvent := CustomEventMarker CustomEventUnstructuredMemory + +- End := EndOfBuffer RemainingBufferSizeToSkip + + +Function Record Order +--------------------- + +There are a few clarifications that may help understand what is expected of +Function records. + +- Functions with an Exit are expected to have a corresponding Entry or + Entry_Args function record precede them in the trace. + +- Tail_Exit Function records record the Function ID of the function whose return + address the program counter will take. In other words, the final function that + would be popped off of the call stack if tail call optimization was not used. + +- Not all functions marked for instrumentation are necessarily in the trace. The + tracer uses heuristics to preserve the trace for non-trivial functions. + +- Not every entry must have a traced Exit or Tail Exit. The buffer may run out + of space or the program may request for the tracer to finalize toreturn the + buffer before an instrumented function exits. diff --git a/lib/XRay/Trace.cpp b/lib/XRay/Trace.cpp index 6677063f944..89fd0b0748b 100644 --- a/lib/XRay/Trace.cpp +++ b/lib/XRay/Trace.cpp @@ -222,8 +222,8 @@ Error processCustomEventMarker(FDRState &State, uint8_t RecordFirstByte, DataExtractor &RecordExtractor, size_t &RecordSize) { // We can encounter a CustomEventMarker anywhere in the log, so we can handle - // it regardless of the expectation. However, we do se the expectation to read - // a set number of fixed bytes, as described in the metadata. + // it regardless of the expectation. However, we do set the expectation to + // read a set number of fixed bytes, as described in the metadata. uint32_t OffsetPtr = 1; // Read after the first byte. uint32_t DataSize = RecordExtractor.getU32(&OffsetPtr); uint64_t TSC = RecordExtractor.getU64(&OffsetPtr); @@ -333,7 +333,7 @@ Error processFDRFunctionRecord(FDRState &State, uint8_t RecordFirstByte, } Record.CPU = State.CPUId; Record.TId = State.ThreadId; - // Back up to read first 32 bits, including the 8 we pulled RecordType + // Back up to read first 32 bits, including the 4 we pulled RecordType // and RecordKind out of. The remaining 28 are FunctionId. uint32_t OffsetPtr = 0; // Despite function Id being a signed int on XRayRecord, @@ -369,8 +369,9 @@ Error processFDRFunctionRecord(FDRState &State, uint8_t RecordFirstByte, /// We expect a format complying with the grammar in the following pseudo-EBNF. /// /// FDRLog: XRayFileHeader ThreadBuffer* -/// XRayFileHeader: 32 bits to identify the log as FDR with machine metadata. -/// ThreadBuffer: BufSize NewBuffer WallClockTime NewCPUId FunctionSequence EOB +/// XRayFileHeader: 32 bytes to identify the log as FDR with machine metadata. +/// Includes BufferSize +/// ThreadBuffer: NewBuffer WallClockTime NewCPUId FunctionSequence EOB /// BufSize: 8 byte unsigned integer indicating how large the buffer is. /// NewBuffer: 16 byte metadata record with Thread Id. /// WallClockTime: 16 byte metadata record with human readable time.