mirror of
https://github.com/RPCS3/llvm-mirror.git
synced 2025-02-24 22:40:36 +00:00
[MCA][doc] Add a section for the 'Bottleneck Analysis'.
Also clarify the meaning of 'Block RThroughput' and 'RThroughput'. llvm-svn: 367853
This commit is contained in:
parent
0f61f1648d
commit
f5f269d720
@ -373,17 +373,28 @@ overview of the performance throughput. Important performance indicators are
|
||||
**IPC**, **uOps Per Cycle**, and **Block RThroughput** (Block Reciprocal
|
||||
Throughput).
|
||||
|
||||
Field *DispatchWidth* is the maximum number of micro opcodes that are dispatched
|
||||
to the out-of-order backend every simulated cycle.
|
||||
|
||||
IPC is computed dividing the total number of simulated instructions by the total
|
||||
number of cycles. In the absence of loop-carried data dependencies, the
|
||||
observed IPC tends to a theoretical maximum which can be computed by dividing
|
||||
the number of instructions of a single iteration by the *Block RThroughput*.
|
||||
number of cycles.
|
||||
|
||||
Field *Block RThroughput* is the reciprocal of the block throughput. Block
|
||||
throuhgput is a theoretical quantity computed as the maximum number of blocks
|
||||
(i.e. iterations) that can be executed per simulated clock cycle in the absence
|
||||
of loop carried dependencies. Block throughput is is superiorly
|
||||
limited by the dispatch rate, and the availability of hardware resources.
|
||||
|
||||
In the absence of loop-carried data dependencies, the observed IPC tends to a
|
||||
theoretical maximum which can be computed by dividing the number of instructions
|
||||
of a single iteration by the `Block RThroughput`.
|
||||
|
||||
Field 'uOps Per Cycle' is computed dividing the total number of simulated micro
|
||||
opcodes by the total number of cycles. A delta between Dispatch Width and this
|
||||
field is an indicator of a performance issue. In the absence of loop-carried
|
||||
data dependencies, the observed 'uOps Per Cycle' should tend to a theoretical
|
||||
maximum throughput which can be computed by dividing the number of uOps of a
|
||||
single iteration by the *Block RThroughput*.
|
||||
single iteration by the `Block RThroughput`.
|
||||
|
||||
Field *uOps Per Cycle* is bounded from above by the dispatch width. That is
|
||||
because the dispatch width limits the maximum size of a dispatch group. Both IPC
|
||||
@ -392,12 +403,12 @@ availability of hardware resources affects the resource pressure distribution,
|
||||
and it limits the number of instructions that can be executed in parallel every
|
||||
cycle. A delta between Dispatch Width and the theoretical maximum uOps per
|
||||
Cycle (computed by dividing the number of uOps of a single iteration by the
|
||||
*Block RTrhoughput*) is an indicator of a performance bottleneck caused by the
|
||||
`Block RThroughput`) is an indicator of a performance bottleneck caused by the
|
||||
lack of hardware resources.
|
||||
In general, the lower the Block RThroughput, the better.
|
||||
|
||||
In this example, ``uOps per iteration/Block RThroughput`` is 1.50. Since there
|
||||
are no loop-carried dependencies, the observed *uOps Per Cycle* is expected to
|
||||
are no loop-carried dependencies, the observed `uOps Per Cycle` is expected to
|
||||
approach 1.50 when the number of iterations tends to infinity. The delta between
|
||||
the Dispatch Width (2.00), and the theoretical maximum uOp throughput (1.50) is
|
||||
an indicator of a performance bottleneck caused by the lack of hardware
|
||||
@ -409,6 +420,13 @@ throughput of every instruction in the sequence. That section also reports
|
||||
extra information related to the number of micro opcodes, and opcode properties
|
||||
(i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
|
||||
|
||||
Field *RThroughput* is the reciprocal of the instruction throughput. Throughput
|
||||
is computed as the maximum number of instructions of a same type that can be
|
||||
executed per clock cycle in the absence of operand dependencies. In this
|
||||
example, the reciprocal throughput of a vector float multiply is 1
|
||||
cycles/instruction. That is because the FP multiplier JFPM is only available
|
||||
from pipeline JFPU1.
|
||||
|
||||
The third section is the *Resource pressure view*. This view reports
|
||||
the average number of resource cycles consumed every iteration by instructions
|
||||
for every processor resource unit available on the target. Information is
|
||||
@ -540,6 +558,61 @@ resources, the delta between the two counters is small. However, the number of
|
||||
cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
|
||||
especially when compared to other low latency instructions.
|
||||
|
||||
Bottleneck Analysis
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
The ``-bottleneck-analysis`` command line option enables the analysis of
|
||||
performance bottlenecks.
|
||||
|
||||
This analysis is potentially expensive. It attempts to correlate increases in
|
||||
backend pressure (caused by pipeline resource pressure and data dependencies) to
|
||||
dynamic dispatch stalls.
|
||||
|
||||
Below is an example of ``-bottleneck-analysis`` output generated by
|
||||
:program:`llvm-mca` for 500 iterations of the dot-product example on btver2.
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
|
||||
Cycles with backend pressure increase [ 48.07% ]
|
||||
Throughput Bottlenecks:
|
||||
Resource Pressure [ 47.77% ]
|
||||
- JFPA [ 47.77% ]
|
||||
- JFPU0 [ 47.77% ]
|
||||
Data Dependencies: [ 0.30% ]
|
||||
- Register Dependencies [ 0.30% ]
|
||||
- Memory Dependencies [ 0.00% ]
|
||||
|
||||
Critical sequence based on the simulation:
|
||||
|
||||
Instruction Dependency Information
|
||||
+----< 2. vhaddps %xmm3, %xmm3, %xmm4
|
||||
|
|
||||
| < loop carried >
|
||||
|
|
||||
| 0. vmulps %xmm0, %xmm1, %xmm2
|
||||
+----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
|
||||
+----> 2. vhaddps %xmm3, %xmm3, %xmm4 ## REGISTER dependency: %xmm3
|
||||
|
|
||||
| < loop carried >
|
||||
|
|
||||
+----> 1. vhaddps %xmm2, %xmm2, %xmm3 ## RESOURCE interference: JFPA [ probability: 74% ]
|
||||
|
||||
|
||||
According to the analysis, throughput is limited by resource pressure and not by
|
||||
data dependencies. The analysis observed increases in backend pressure during
|
||||
48.07% of the simulated run. Almost all those pressure increase events were
|
||||
caused by contention on processor resources JFPA/JFPU0.
|
||||
|
||||
The `critical sequence` is the most expensive sequence of instructions according
|
||||
to the simulation. It is annotated to provide extra information about critical
|
||||
register dependencies and resource interferences between instructions.
|
||||
|
||||
Instructions from the critical sequence are expected to significantly impact
|
||||
performance. By construction, the accuracy of this analysis is strongly
|
||||
dependent on the simulation and (as always) by the quality of the processor
|
||||
model in llvm.
|
||||
|
||||
|
||||
Extra Statistics to Further Diagnose Performance Issues
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
The ``-all-stats`` command line option enables extra statistics and performance
|
||||
|
Loading…
x
Reference in New Issue
Block a user