[llvm-mca][docs] Add instruction flow documentation. NFC.

Summary:
This patch mostly copies the existing Instruction Flow, and stage descriptions
from the mca README.  I made a few text tweaks, but no semantic changes,
and made reference to the "default pipeline."  I also removed the internals
references (e.g., reference to class names and header files).  I did leave the
LSUnit name around, but only as an abbreviated word for the load-store unit.


Reviewers: andreadb, courbet, RKSimon, gbedwell, filcab

Reviewed By: andreadb

Subscribers: tschuett, jfb, llvm-commits

Differential Revision: https://reviews.llvm.org/D49692

llvm-svn: 338319
This commit is contained in:
Matt Davis 2018-07-30 22:30:14 +00:00
parent a0ef9a3456
commit 2fb231bb42

View File

@ -549,3 +549,180 @@ statistics are displayed by using the command option ``-all-stats`` or
In this example, we can conclude that the IPC is mostly limited by data
dependencies, and not by resource pressure.
Instruction Flow
^^^^^^^^^^^^^^^^
This section describes the instruction flow through MCA's default out-of-order
pipeline, as well as the functional units involved in the process.
The default pipeline implements the following sequence of stages used to
process instructions.
* Dispatch (Instruction is dispatched to the schedulers).
* Issue (Instruction is issued to the processor pipelines).
* Write Back (Instruction is executed, and results are written back).
* Retire (Instruction is retired; writes are architecturally committed).
The default pipeline only models the out-of-order portion of a processor.
Therefore, the instruction fetch and decode stages are not modeled. Performance
bottlenecks in the frontend are not diagnosed. MCA assumes that instructions
have all been decoded and placed into a queue. Also, MCA does not model branch
prediction.
Instruction Dispatch
""""""""""""""""""""
During the dispatch stage, instructions are picked in program order from a
queue of already decoded instructions, and dispatched in groups to the
simulated hardware schedulers.
The size of a dispatch group depends on the availability of the simulated
hardware resources. The processor dispatch width defaults to the value
of the ``IssueWidth`` in LLVM's scheduling model.
An instruction can be dispatched if:
* The size of the dispatch group is smaller than processor's dispatch width.
* There are enough entries in the reorder buffer.
* There are enough physical registers to do register renaming.
* The schedulers are not full.
Scheduling models can optionally specify which register files are available on
the processor. MCA uses that information to initialize register file
descriptors. Users can limit the number of physical registers that are
globally available for register renaming by using the command option
``-register-file-size``. A value of zero for this option means *unbounded*.
By knowing how many registers are available for renaming, MCA can predict
dispatch stalls caused by the lack of registers.
The number of reorder buffer entries consumed by an instruction depends on the
number of micro-opcodes specified by the target scheduling model. MCA's
reorder buffer's purpose is to track the progress of instructions that are
"in-flight," and to retire instructions in program order. The number of
entries in the reorder buffer defaults to the `MicroOpBufferSize` provided by
the target scheduling model.
Instructions that are dispatched to the schedulers consume scheduler buffer
entries. MCA queries the scheduling model to determine the set of
buffered resources consumed by an instruction. Buffered resources are treated
like scheduler resources.
Instruction Issue
"""""""""""""""""
Each processor scheduler implements a buffer of instructions. An instruction
has to wait in the scheduler's buffer until input register operands become
available. Only at that point, does the instruction becomes eligible for
execution and may be issued (potentially out-of-order) for execution.
Instruction latencies are computed by MCA with the help of the scheduling
model.
MCA's scheduler is designed to simulate multiple processor schedulers. The
scheduler is responsible for tracking data dependencies, and dynamically
selecting which processor resources are consumed by instructions.
The scheduler delegates the management of processor resource units and resource
groups to a resource manager. The resource manager is responsible for
selecting resource units that are consumed by instructions. For example, if an
instruction consumes 1cy of a resource group, the resource manager selects one
of the available units from the group; by default, the resource manager uses a
round-robin selector to guarantee that resource usage is uniformly distributed
between all units of a group.
MCA's scheduler implements three instruction queues:
* WaitQueue: a queue of instructions whose operands are not ready.
* ReadyQueue: a queue of instructions ready to execute.
* IssuedQueue: a queue of instructions executing.
Depending on the operand availability, instructions that are dispatched to the
scheduler are either placed into the WaitQueue or into the ReadyQueue.
Every cycle, the scheduler checks if instructions can be moved from the
WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
issued. The algorithm prioritizes older instructions over younger
instructions.
Write-Back and Retire Stage
"""""""""""""""""""""""""""
Issued instructions are moved from the ReadyQueue to the IssuedQueue. There,
instructions wait until they reach the write-back stage. At that point, they
get removed from the queue and the retire control unit is notified.
When instructions are executed, the retire control unit flags the
instruction as "ready to retire."
Instructions are retired in program order. The register file is notified of
the retirement so that it can free the temporary registers that were allocated
for the instruction during the register renaming stage.
Load/Store Unit and Memory Consistency Model
""""""""""""""""""""""""""""""""""""""""""""
To simulate an out-of-order execution of memory operations, MCA utilizes a
simulated load/store unit (LSUnit) to simulate the speculative execution of
loads and stores.
Each load (or store) consumes an entry in the load (or store) queue. The
number of slots in the load/store queues is unknown by MCA, since there is no
mention of it in the scheduling model. In practice, users can specify flags
``-lqueue`` and ``-squeue`` to limit the number of entries in the load and
store queues respectively. The queues are unbounded by default.
The LSUnit implements a relaxed consistency model for memory loads and stores.
The rules are:
1. A younger load is allowed to pass an older load only if there are no
intervening stores or barriers between the two loads.
2. A younger load is allowed to pass an older store provided that the load does
not alias with the store.
3. A younger store is not allowed to pass an older store.
4. A younger store is not allowed to pass an older load.
By default, the LSUnit optimistically assumes that loads do not alias
(`-noalias=true`) store operations. Under this assumption, younger loads are
always allowed to pass older stores. Essentially, the LSUnit does not attempt
to run any alias analysis to predict when loads and stores do not alias with
each other.
Note that, in the case of write-combining memory, rule 3 could be relaxed to
allow reordering of non-aliasing store operations. That being said, at the
moment, there is no way to further relax the memory model (``-noalias`` is the
only option). Essentially, there is no option to specify a different memory
type (e.g., write-back, write-combining, write-through; etc.) and consequently
to weaken, or strengthen, the memory model.
Other limitations are:
* The LSUnit does not know when store-to-load forwarding may occur.
* The LSUnit does not know anything about cache hierarchy and memory types.
* The LSUnit does not know how to identify serializing operations and memory
fences.
The LSUnit does not attempt to predict if a load or store hits or misses the L1
cache. It only knows if an instruction "MayLoad" and/or "MayStore." For
loads, the scheduling model provides an "optimistic" load-to-use latency (which
usually matches the load-to-use latency for when there is a hit in the L1D).
MCA does not know about serializing operations or memory-barrier like
instructions. The LSUnit conservatively assumes that an instruction which has
both "MayLoad" and unmodeled side effects behaves like a "soft" load-barrier.
That means, it serializes loads without forcing a flush of the load queue.
Similarly, instructions that "MayStore" and have unmodeled side effects are
treated like store barriers. A full memory barrier is a "MayLoad" and
"MayStore" instruction with unmodeled side effects. This is inaccurate, but it
is the best that we can do at the moment with the current information available
in LLVM.
A load/store barrier consumes one entry of the load/store queue. A load/store
barrier enforces ordering of loads/stores. A younger load cannot pass a load
barrier. Also, a younger store cannot pass a store barrier. A younger load
has to wait for the memory/load barrier to execute. A load/store barrier is
"executed" when it becomes the oldest entry in the load/store queue(s). That
also means, by construction, all of the older loads/stores have been executed.
In conclusion, the full set of load/store consistency rules are:
#. A store may not pass a previous store.
#. A store may not pass a previous load (regardless of ``-noalias``).
#. A store has to wait until an older store barrier is fully executed.
#. A load may pass a previous load.
#. A load may not pass a previous store unless ``-noalias`` is set.
#. A load has to wait until an older load barrier is fully executed.