[MCA][doc] Add a section for the 'Bottleneck Analysis'.

Also clarify the meaning of 'Block RThroughput' and 'RThroughput'.

llvm-svn: 367853
This commit is contained in:
Andrea Di Biagio 2019-08-05 13:18:37 +00:00
parent 0f61f1648d
commit f5f269d720

View File

@ -373,17 +373,28 @@ overview of the performance throughput. Important performance indicators are
**IPC**, **uOps Per Cycle**, and **Block RThroughput** (Block Reciprocal
Throughput).
Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched
to the out-of-order backend every simulated cycle.
IPC is computed dividing the total number of simulated instructions by the total
number of cycles. In the absence of loop-carried data dependencies, the
observed IPC tends to a theoretical maximum which can be computed by dividing
the number of instructions of a single iteration by the *Block RThroughput*.
number of cycles.
Field *Block RThroughput* is the reciprocal of the block throughput. Block
throuhgput is a theoretical quantity computed as the maximum number of blocks
(i.e. iterations) that can be executed per simulated clock cycle in the absence
of loop carried dependencies. Block throughput is is superiorly
limited by the dispatch rate, and the availability of hardware resources.
In the absence of loop-carried data dependencies, the observed IPC tends to a
theoretical maximum which can be computed by dividing the number of instructions
of a single iteration by the `Block RThroughput`.
Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
opcodes by the total number of cycles. A delta between Dispatch Width and this
field is an indicator of a performance issue. In the absence of loop-carried
data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
maximum throughput which can be computed by dividing the number of uOps of a
single iteration by the *Block RThroughput*.
single iteration by the `Block RThroughput`.
Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
because the dispatch width limits the maximum size of a dispatch group. Both IPC
@ -392,12 +403,12 @@ availability of hardware resources affects the resource pressure distribution,
and it limits the number of instructions that can be executed in parallel every
cycle. A delta between Dispatch Width and the theoretical maximum uOps per
Cycle (computed by dividing the number of uOps of a single iteration by the
*Block RTrhoughput*) is an indicator of a performance bottleneck caused by the
`Block RThroughput`) is an indicator of a performance bottleneck caused by the
lack of hardware resources.
In general, the lower the Block RThroughput, the better.
In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to
are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to
approach 1.50 when the number of iterations tends to infinity. The delta between
the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
an indicator of a performance bottleneck caused by the lack of hardware
@ -409,6 +420,13 @@ throughput of every instruction in the sequence. That section also reports
extra information related to the number of micro opcodes, and opcode properties
(i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
Field *RThroughput* is the reciprocal of the instruction throughput. Throughput
is computed as the maximum number of instructions of a same type that can be
executed per clock cycle in the absence of operand dependencies. In this
example, the reciprocal throughput of a vector float multiply is 1
cycles/instruction. That is because the FP multiplier JFPM is only available
from pipeline JFPU1.
The third section is the *Resource pressure view*. This view reports
the average number of resource cycles consumed every iteration by instructions
for every processor resource unit available on the target. Information is
@ -540,6 +558,61 @@ resources, the delta between the two counters is small. However, the number of
cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
especially when compared to other low latency instructions.
Bottleneck Analysis
^^^^^^^^^^^^^^^^^^^
The ``-bottleneck-analysis`` command line option enables the analysis of
performance bottlenecks.
This analysis is potentially expensive. It attempts to correlate increases in
backend pressure (caused by pipeline resource pressure and data dependencies) to
dynamic dispatch stalls.
Below is an example of ``-bottleneck-analysis`` output generated by
:program:`llvm-mca` for 500 iterations of the dot-product example on btver2.
.. code-block:: none
Cycles with backend pressure increase [ 48.07% ]
Throughput Bottlenecks:
Resource Pressure [ 47.77% ]
- JFPA [ 47.77% ]
- JFPU0 [ 47.77% ]
Data Dependencies: [ 0.30% ]
- Register Dependencies [ 0.30% ]
- Memory Dependencies [ 0.00% ]
Critical sequence based on the simulation:
Instruction Dependency Information
+----< 2. vhaddps %xmm3, %xmm3, %xmm4
|
| < loop carried >
|
| 0. vmulps %xmm0, %xmm1, %xmm2
+----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
+----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3
|
| < loop carried >
|
+----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
According to the analysis, throughput is limited by resource pressure and not by
data dependencies. The analysis observed increases in backend pressure during
48.07% of the simulated run. Almost all those pressure increase events were
caused by contention on processor resources JFPA/JFPU0.
The `critical sequence` is the most expensive sequence of instructions according
to the simulation. It is annotated to provide extra information about critical
register dependencies and resource interferences between instructions.
Instructions from the critical sequence are expected to significantly impact
performance. By construction, the accuracy of this analysis is strongly
dependent on the simulation and (as always) by the quality of the processor
model in llvm.
Extra Statistics to Further Diagnose Performance Issues
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``-all-stats`` command line option enables extra statistics and performance