mirror of
https://github.com/capstone-engine/llvm-capstone.git
synced 2025-01-05 23:52:45 +00:00
1610 lines
62 KiB
ReStructuredText
1610 lines
62 KiB
ReStructuredText
==============================
|
|
Convergent Operation Semantics
|
|
==============================
|
|
|
|
.. contents::
|
|
:local:
|
|
:depth: 4
|
|
|
|
Overview
|
|
========
|
|
|
|
Some parallel execution environments execute threads in groups that allow
|
|
efficient communication within the group using special primitives called
|
|
*convergent* operations. The outcome of a convergent operation is sensitive to
|
|
the set of threads that executes it "together", i.e., convergently. When control
|
|
flow :ref:`diverges <convergence-and-uniformity>`, i.e. threads of the same
|
|
group follow different
|
|
paths through the CFG, not all threads of the group may be available to
|
|
participate in this communication. This is the defining characteristic that
|
|
distinguishes convergent operations from other inter-thread communication:
|
|
|
|
A convergent operation involves inter-thread communication or synchronization
|
|
that occurs outside of the memory model, where the set of threads which
|
|
participate in communication is implicitly affected by control flow.
|
|
|
|
For example, in the following GPU compute kernel, communication during the
|
|
convergent operation is expected to occur precisely among those threads of an
|
|
implementation-defined execution scope (such as workgroup or subgroup) for
|
|
which ``condition`` is true:
|
|
|
|
.. code-block:: c++
|
|
|
|
void example_kernel() {
|
|
...
|
|
if (condition)
|
|
convergent_operation();
|
|
...
|
|
}
|
|
|
|
In structured programming languages, there is often an intuitive and
|
|
unambiguous way of determining the threads that are expected to communicate.
|
|
However, this is not always the case even in structured programming languages,
|
|
and the intuition breaks down entirely in unstructured control flow. This
|
|
document describes the formal semantics in LLVM, i.e. how to determine the set
|
|
of communicating threads for convergent operations.
|
|
|
|
The definitions in this document leave many details open, such as how groups of
|
|
threads are formed in the first place. It focuses on the questions that are
|
|
relevant for deciding the correctness of generic program transforms and
|
|
convergence-related analyses such as :ref:`uniformity analysis
|
|
<convergence-and-uniformity>`.
|
|
|
|
.. _convergent_operations:
|
|
|
|
Convergent Operations
|
|
=====================
|
|
|
|
In LLVM IR, the only way to communicate between threads as described
|
|
above is by calling target-defined convergent intrinsics. Hence, only
|
|
a call-site in LLVM IR (a :ref:`call <i_call>`, :ref:`invoke
|
|
<i_invoke>`, or :ref:`callbr <i_callbr>` instruction) can result in a
|
|
convergent operation.
|
|
|
|
A function in LLVM IR is said to be *convergent* if it has the
|
|
:ref:`convergent <attr_convergent>` attribute.
|
|
|
|
A call-site in LLVM IR is said to be *convergent* if it is a direct
|
|
call to a convergent function or it has the :ref:`convergent
|
|
<attr_convergent>` attribute or a :ref:`convergencectrl operand bundle
|
|
<convergencectrl>`.
|
|
|
|
Informational notes:
|
|
|
|
A function may have to be treated as convergent if that function, or
|
|
transitively, any function called from it, contains a convergent call-site. A
|
|
frontend generating the ``convergent`` attribute should take this into account
|
|
when emitting functions and function calls. But this is not always the case:
|
|
|
|
A non-convergent function may contain convergent operations; such operations
|
|
do not directly depend on the set of threads that enter the function as a
|
|
single communicating group. Instead, these operations depend on an
|
|
implementation-defined subset of threads within the body of the function, as
|
|
shown in :ref:`opportunistic_convergence`.
|
|
|
|
Examples of Convergent Operations
|
|
========================================
|
|
|
|
(This section is informative.)
|
|
|
|
Texture sampling in a pixel shader
|
|
----------------------------------
|
|
|
|
The following stylized pixel shader samples a texture at a given set of
|
|
coordinates, using the builtin function `textureSample`. Texture sampling
|
|
requires screen-space derivatives of the coordinates to determine the level of
|
|
detail (mipmap level) of the sample. They are commonly approximated by taking
|
|
the difference between neighboring pixels, which are computed by different
|
|
threads in the same group:
|
|
|
|
.. code-block:: c++
|
|
|
|
void example_shader() {
|
|
...
|
|
color = textureSample(texture, coordinates);
|
|
if (condition) {
|
|
use(color);
|
|
}
|
|
...
|
|
}
|
|
|
|
From a purely single-threaded perspective, sinking the `textureSample` into
|
|
the if-statement appears legal. However, if the condition is false for some
|
|
neighboring pixels, then their corresponding threads will not execute together
|
|
in the group, making it impossible to take the difference of coordinates as an
|
|
approximation of the screen-space derivative. In practice, the outcome will be
|
|
an undefined value.
|
|
|
|
That is, the `textureSample` operation fits our definition of a convergent
|
|
operation:
|
|
|
|
1. It communicates with a set of threads that implicitly depends on control
|
|
flow.
|
|
2. Correctness depends on this set of threads.
|
|
|
|
The compiler frontend can emit IR that expresses the convergence constraints as
|
|
follows:
|
|
|
|
.. code-block:: llvm
|
|
|
|
define void @example_shader() convergent {
|
|
%entry = call token @llvm.experimental.convergence.entry()
|
|
...
|
|
%color = call T @textureSample(U %texture, V %coordinates) [ "convergencectrl"(token %entry) ]
|
|
br i1 %condition, label %then, label %end
|
|
|
|
then:
|
|
call void @use(T %color)
|
|
br label %end
|
|
|
|
end:
|
|
ret void
|
|
}
|
|
|
|
The :ref:`llvm.experimental.convergence.entry <llvm.experimental.convergence.entry>`
|
|
intrinsic is itself ``convergent``, and we expect it to communicate at least
|
|
among all threads of the same "quad" -- a group of 2x2 pixels that are
|
|
evaluated together for the purpose of approximating screen-space derivatives.
|
|
This fact is not part of the generic LLVM IR semantics; it would have to be
|
|
defined somewhere else, for example as part of target-specific ABI definitions
|
|
and/or in reference to some relevant API specs.
|
|
|
|
Since the ``@textureSample`` call then uses the token produced by the entry
|
|
intrinsic in its ``convergencectrl`` bundle, and has no additional control
|
|
dependencies, it must communicate among the same set of threads. This indicates
|
|
to generic program transforms that sinking the ``@textureSample`` call is
|
|
forbidden. (A program transform can still sink the call if it can prove somehow,
|
|
e.g. by leaning on target-specific callbacks that can analyze the program with
|
|
additional knowledge, that ``%condition`` is always uniform across the threads
|
|
referenced by the *convergence token* ``%entry``.)
|
|
|
|
.. _convergence_example_reductions:
|
|
|
|
Reductions inside divergent control flow
|
|
----------------------------------------
|
|
|
|
The following example shows that merging common code of branches can be
|
|
incorrect in the face of convergent operations:
|
|
|
|
.. code-block:: c++
|
|
|
|
void example_kernel() {
|
|
delta = ...
|
|
if (delta > 0) {
|
|
total_gains = subgroupAdd(delta);
|
|
...
|
|
} else {
|
|
total_losses = subgroupAdd(delta);
|
|
...
|
|
}
|
|
}
|
|
|
|
The ``subgroupAdd`` computing the ``total_gains`` will be executed by the
|
|
subset of threads with positive ``delta`` in a subgroup (wave), and so will sum
|
|
up all the ``delta`` values of those threads; and similarly for the
|
|
``subgroupAdd`` that computes the ``total_losses``.
|
|
|
|
If we were to hoist and merge the ``subgroupAdd`` above the if-statement, it
|
|
would sum up the ``delta`` across *all* threads instead.
|
|
|
|
The compiler frontend can emit IR that expresses the convergence constraints
|
|
as follows:
|
|
|
|
.. code-block:: llvm
|
|
|
|
define void @example_kernel() convergent {
|
|
%entry = call token @llvm.experimental.convergence.entry()
|
|
%delta = ...
|
|
%cc = icmp sgt i32 %delta, 0
|
|
br i1 %cc, label %then, label %else
|
|
|
|
then:
|
|
%total_gains = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ]
|
|
...
|
|
br label %end
|
|
|
|
else:
|
|
%total_losses = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ]
|
|
...
|
|
br label %end
|
|
|
|
end:
|
|
...
|
|
}
|
|
|
|
The entry intrinsic behaves like in the previous example: assuming that
|
|
``@example_kernel`` is an OpenCL kernel (as hinted at by the "subgroup"
|
|
terminology), we expect it to communicate among all threads within the
|
|
"subgroup". This typically maps to a SIMD vector on GPU hardware.
|
|
|
|
The calls to ``@subgroupAdd`` use the token produced by the entry intrinsic,
|
|
but they also have an additional control dependency. According to the rules
|
|
defined in this document, they only communicate among the subset of threads
|
|
that actually end up executing the respective (static) call site.
|
|
|
|
Hoisting them would remove the control dependency and cause them to communicate
|
|
among the full set of threads that the entry intrinsic communicated with.
|
|
Again, hoisting is allowed if it can be proven that ``%cc`` is always uniform
|
|
among the relevant set of threads: in that case, the ``@subgroupAdd`` already
|
|
communicates among the full set of threads in the original program.
|
|
|
|
Motivating Examples of Convergence Control
|
|
==========================================
|
|
|
|
(This section is informative.)
|
|
|
|
Unstructured control flow
|
|
-------------------------
|
|
|
|
Consider an example of how jump threading removes structure in a way that can
|
|
make semantics non-obvious without the convergence intrinsics described in this
|
|
document:
|
|
|
|
.. code-block:: llvm
|
|
|
|
void example_original() {
|
|
entry:
|
|
...
|
|
br i1 %cond1, label %then1, label %mid
|
|
|
|
then1:
|
|
...
|
|
%cond2 = ...
|
|
br label %mid
|
|
|
|
mid:
|
|
%flag = phi i1 [ true, %entry ], [ %cond2, %then1 ]
|
|
br i1 %flag, label %then2, label %end
|
|
|
|
then2:
|
|
...
|
|
call void @subgroupControlBarrier()
|
|
...
|
|
br label %end
|
|
|
|
end:
|
|
}
|
|
|
|
void example_jumpthreaded() {
|
|
entry:
|
|
...
|
|
br i1 %cond1, label %then1, label %then2
|
|
|
|
then1:
|
|
...
|
|
%cond2 = ...
|
|
br i1 %cond2, label %then2, label %end
|
|
|
|
then2:
|
|
...
|
|
call void @subgroupControlBarrier()
|
|
...
|
|
br label %end
|
|
|
|
end:
|
|
}
|
|
|
|
Is the control barrier guaranteed to synchronize among the same set of threads
|
|
in both cases? Different implementations in the literature may give different
|
|
answers to this question:
|
|
|
|
* In an implementation that reconverges at post-dominators, threads reconverge
|
|
at ``mid`` in the first version, so that all threads (within a subgroup/wave)
|
|
that execute the control barrier do so together. In the second version,
|
|
threads that reach the control barrier via different paths synchronize
|
|
separately: the first (and only) post-dominator is ``end``, so threads do not
|
|
reconverge before then.
|
|
|
|
* An implementation that sorts basic blocks topologically and ensures maximal
|
|
reconvergence for each basic block would behave the same way in both
|
|
versions.
|
|
|
|
We generally take the stance that reconvergence in acyclic control flow must
|
|
be maximal. The compiler frontend could augment the original code as follows:
|
|
|
|
.. code-block:: llvm
|
|
|
|
define void @example_original() convergent {
|
|
entry:
|
|
%entry = call token @llvm.experimental.convergence.entry()
|
|
...
|
|
br i1 %cond1, label %then1, label %mid
|
|
|
|
then1:
|
|
...
|
|
%cond2 = ...
|
|
br label %mid
|
|
|
|
mid:
|
|
%flag = phi i1 [ true, %entry ], [ %cond2, %then1 ]
|
|
br i1 %flag, label %then2, label %end
|
|
|
|
then2:
|
|
...
|
|
call void @subgroupControlBarrier() [ "convergencectrl"(token %entry) ]
|
|
...
|
|
br label %end
|
|
|
|
end:
|
|
}
|
|
|
|
If S is the set of threads that the entry intrinsic communicated with, then
|
|
the ``@subgroupControlBarrier`` call communicates with the subset of S that
|
|
actually reaches the call site. This set of threads doesn't change after
|
|
jump-threading, so the answer to the question posed above remains the same.
|
|
|
|
.. _opportunistic_convergence:
|
|
|
|
Opportunistic convergent operations
|
|
-----------------------------------
|
|
|
|
Some programs have local regions of code that contain a sequence of convergent
|
|
operations where the code does not care about the exact set of threads with
|
|
which it is executed, but only that the set of threads is the same for all the
|
|
operations within the sequence. (If a subset of the convergent operations in the
|
|
sequence have additional, non-uniform control dependencies, then this is not
|
|
possible. However, the code may still require that the sets of threads are
|
|
logically consistent with the conditions of those control dependencies.) In this
|
|
case, :ref:`llvm.experimental.convergence.anchor
|
|
<llvm.experimental.convergence.anchor>` can be used to express the desired
|
|
semantics.
|
|
|
|
The following example function could be part of a hypothetical "append buffer"
|
|
implementation, where threads conditionally write fixed-sized records
|
|
contiguously into a global buffer. The function ``@reserveSpaceInBuffer``
|
|
returns the index into the buffer at which the calling thread should store its
|
|
data.
|
|
|
|
This could be achieved by using a simple atomic operation in every thread to
|
|
bump an allocation counter.
|
|
|
|
However, the following implementation can be more performant on some hardware,
|
|
because it uses only a single atomic operation for an entire group of threads.
|
|
To do this, it first determines the total size of the group, which will be the
|
|
operand to the atomic operation, and then later broadcasts the result of the
|
|
atomic operation to all threads of the group, so that each thread can compute
|
|
its individual position in the buffer:
|
|
|
|
.. code-block:: llvm
|
|
|
|
define i32 @reserveSpaceInBuffer() { ; NOTE: _not_ a convergent function!
|
|
entry:
|
|
%anchor = call token @llvm.experimental.convergence.anchor()
|
|
|
|
%ballot = call i64 @subgroupBallot(i1 true) [ "convergencectrl"(token %anchor) ]
|
|
%numThreads.p = call i64 @llvm.ctpop.i64(i64 %ballot)
|
|
%numThreads = trunc i64 %numThreads.p to i32
|
|
|
|
%absoluteThreadIdx = call i32 @getSubgroupLocalInvocationId()
|
|
%absoluteThreadIdx.ext = zext i32 %absoluteThreadIdx to i64
|
|
%mask.p = shl i64 1, %absoluteThreadIdx.ext
|
|
%mask = sub i64 %mask.p, 1
|
|
|
|
%maskedBallot = and i64 %ballot, %mask
|
|
%relativeThreadIdx.p = call i64 @llvm.ctpop.i64(i64 %maskedBallot)
|
|
%relativeThreadIdx = trunc i64 %relativeThreadIdx.p to i32
|
|
|
|
%isFirstThread = icmp eq i32 %relativeThreadIdx, 0
|
|
br i1 %isFirstThread, label %then, label %end
|
|
|
|
then:
|
|
%baseOffset.1 = atomicrmw add ptr @bufferAllocationCount, i32 %numThreads monotonic
|
|
br label %end
|
|
|
|
end:
|
|
%baseOffset.2 = phi i32 [ undef, %entry ], [ %baseOffset.1, %then ]
|
|
%baseOffset = call i32 @subgroupBroadcastFirst(i32 %baseOffset.2) [ "convergencectrl"(token %anchor) ]
|
|
%offset = add i32 %baseOffset, %relativeThreadIdx
|
|
ret i32 %offset
|
|
}
|
|
|
|
The key here is that the function really doesn't care which set of threads it
|
|
is being called with. It takes whatever set of threads it can get. What the
|
|
implementation of the function cares about is that the initial
|
|
``@subgroupBallot`` -- which is used to retrieve the bitmask of threads that
|
|
executed the anchor together -- executes with the same set of threads as the
|
|
final ``@subgroupBroadcastFirst``. Nothing else is required for correctness as
|
|
far as convergence is concerned.
|
|
|
|
The function ``@reserveSpaceInBuffer`` itself is _not_ ``convergent``: callers
|
|
are free to move call sites of the function as they see fit. This can change
|
|
the behavior in practice, by changing the sets of threads that are grouped
|
|
together for the atomic operation. This can be visible in the output of the
|
|
program, since the order in which outputs appear in the buffer is changed.
|
|
However, this does not break the overall contract that ``@reserveSpaceInBuffer``
|
|
has with its caller -- which makes sense: the order of outputs is
|
|
non-deterministic anyway because of the atomic operation that is involved.
|
|
|
|
If the function is inlined, the use of the anchor intrinsic similarly indicates
|
|
that certain transforms which are usually forbidden by the presence of
|
|
convergent operations are in fact allowed, as long as they don't break up the
|
|
region of code that is controlled by the anchor.
|
|
|
|
.. _convergence_high-level_break:
|
|
|
|
Extended Cycles: Divergent Exit from a Loop
|
|
-------------------------------------------
|
|
|
|
High-level languages typically provide a ``break`` statement that transfers
|
|
control out of a loop statement. In most cases, the loop is structured and hence
|
|
there is no ambiguity about convergence inside the loop. But an ambiguity arises
|
|
when a ``break`` is control dependent on a divergent condition inside the loop.
|
|
Consider the following example:
|
|
|
|
.. code-block:: c++
|
|
|
|
void example() {
|
|
// A
|
|
...
|
|
for (...) {
|
|
// B
|
|
if (condition) { // divergent condition
|
|
// C
|
|
convergent_op();
|
|
break;
|
|
}
|
|
// D
|
|
...
|
|
}
|
|
// E
|
|
}
|
|
|
|
In this program, the call to convergent_op() is lexically "inside" the ``for``
|
|
loop. But when translated to LLVM IR, the basic block B is an exiting block
|
|
ending in a divergent branch, and the basic block C is an exit of the loop.
|
|
Thus, the call to convergent_op() is outside the loop. This causes a mismatch
|
|
between the programmer's expectation and the compiled program. The call should
|
|
be executed convergently on every iteration of the loop, by threads that
|
|
together take the branch to exit the loop. But when compiled, all threads that
|
|
take the divergent exit on different iterations first converge at the beginning
|
|
of basic block C and then together execute the call to convergent_op().
|
|
|
|
In this case, :ref:`llvm.experimental.convergence.loop
|
|
<llvm.experimental.convergence.loop>` can be used to express the desired
|
|
semantics. A call to this intrinsic is placed in the loop header, which tracks
|
|
each iteration of the loop. The token produced by this is used as a
|
|
``convergencectrl`` operand to the convergent call. The semantics of the
|
|
``loop`` intrinsic ensures that the convergent call is performed convergently
|
|
only by those threads that convergently exited the loop in a given iteration.
|
|
|
|
.. code-block:: llvm
|
|
|
|
define void @example() convergent {
|
|
%entry = call token @llvm.experimental.convergence.entry()
|
|
br label %for
|
|
|
|
for:
|
|
%inner = call token @llvm.experimental.convergence.loop() ["convergencectrl"(token %entry)]
|
|
%for.cond = i1 ...
|
|
br i1 %for.cond, label %B, label %E
|
|
|
|
B:
|
|
...
|
|
%condition = i1 ...
|
|
br i1 %condition, label %C, label %D
|
|
|
|
C:
|
|
call void @convergent_op() ["convergencectrl"(token %inner)]
|
|
br label %E
|
|
|
|
D:
|
|
...
|
|
br label %for
|
|
|
|
E:
|
|
...
|
|
ret void
|
|
}
|
|
|
|
The LLVM IR version of the same program shows a cycle consisting of the basic
|
|
blocks ``%for``, ``%B`` and ``%D``, while ``%C`` is an exit reached by the
|
|
divergent branch at the end of the exiting block ``%B``. But the use of
|
|
convergence control tokens makes it clear that block ``%C`` must be executed
|
|
convergently only by those threads that convergently take the exit edge from %B
|
|
to ``%C``. In other words, the convergent execution of ``%C`` is governed by the
|
|
call to the :ref:`llvm.experimental.convergence.loop
|
|
<llvm.experimental.convergence.loop>` intrinsic inside the cycle. The cycle is
|
|
effectively extended to include all uses of this token that lie outside the
|
|
cycle.
|
|
|
|
.. _dynamic_instances_and_convergence_tokens:
|
|
|
|
Dynamic Instances and Convergence Tokens
|
|
========================================
|
|
|
|
Every execution of an LLVM IR instruction occurs in a :ref:`dynamic instance
|
|
<convergence-dynamic-instances>` of the instruction. Dynamic instances are the
|
|
formal objects by which we talk about communicating threads in convergent
|
|
operations. Dynamic instances are defined for *all* operations in an LLVM
|
|
program, whether convergent or not. Convergence control is primarily about the
|
|
dynamic instances of convergent operations since they affect execution of the
|
|
program through inter-thread communication. The dynamic instances for
|
|
non-convergent operations are relevant for determining :ref:`uniformity
|
|
<convergence-and-uniformity>` of values.
|
|
|
|
Dynamic instances produced by the execution of the same *convergent operation*
|
|
by different threads may be :ref:`converged <convergence-definition>`. When
|
|
executing a convergent operation, the set of threads that execute converged
|
|
dynamic instances is the set of threads that communicate with each other.
|
|
*Convergence tokens* capture this convergence as described below.
|
|
|
|
*Convergence tokens* are values of ``token`` type, i.e. they cannot be used in
|
|
``phi`` or ``select`` instructions. A convergence token value represents the
|
|
dynamic instance of the instruction that produced it.
|
|
|
|
Convergent operations may have an optional ``convergencectrl`` operand bundle with
|
|
a convergence token operand to define the set of communicating threads relative
|
|
to the operation that defined the token.
|
|
|
|
Let ``U`` be a convergent operation other than a call to a convergence
|
|
control intrinsic, and ``D`` be the convergent operation that defines
|
|
the token value used as the ``convergencectrl`` operand to ``U``. Two
|
|
threads execute converged dynamic instances of ``U`` if and only if the
|
|
token value in both threads was returned by converged dynamic
|
|
instances of ``D``.
|
|
|
|
.. note::
|
|
|
|
The text defines convergence token values as representing dynamic instances.
|
|
But if we were to assume that converged dynamic instances produce the same
|
|
token value, then we could almost think of the token value as representing a
|
|
set of threads instead -- specifically, the set ``S`` of threads that
|
|
executed converged dynamic instances of the defining instruction ``D``.
|
|
|
|
In this intuitive picture, when a convergence token value ``T`` is used by a
|
|
``convergencectrl`` bundle on an instruction ``I``, then the set of threads that
|
|
communicates in ``I`` is a subset of the set ``S`` represented by the token value.
|
|
Specifically, it is the subset of threads that ends up executing ``I`` while
|
|
using the token value.
|
|
|
|
This by itself wouldn't quite work as a definition: what if ``I`` is executed
|
|
multiple times by the same threads? Which execution of ``I`` in thread 1
|
|
communicates with which execution of ``I`` in thread 2? Leaning on the notion
|
|
of dynamic instances gives a robust answer to this question as long as ``D``
|
|
and ``I`` are at the same loop (or cycle) nesting level.
|
|
|
|
The case where ``D`` and ``I`` are at different loop nesting levels is
|
|
forbidden by the :ref:`static rules <convergence_static_rules>` -- handling
|
|
that case is the purpose of :ref:`llvm.experimental.convergence.loop
|
|
<llvm.experimental.convergence.loop>`.
|
|
|
|
.. _convergence_control_intrinsics:
|
|
|
|
Convergence Control Intrinsics
|
|
==============================
|
|
|
|
This section describes target-independent intrinsics that can be used to
|
|
produce convergence tokens.
|
|
|
|
Behaviour is undefined if a convergence control intrinsic is called
|
|
indirectly.
|
|
|
|
.. _llvm.experimental.convergence.entry:
|
|
|
|
``llvm.experimental.convergence.entry``
|
|
----------------------------------------
|
|
|
|
.. code-block:: llvm
|
|
|
|
token @llvm.experimental.convergence.entry() convergent readnone
|
|
|
|
This intrinsic is used to tie the dynamic instances inside of a function to
|
|
those in the caller.
|
|
|
|
1. If the function is called from outside the scope of LLVM, the convergence of
|
|
dynamic instances of this intrinsic are environment-defined. For example:
|
|
|
|
a. In an OpenCL *kernel launch*, the maximal set of threads that
|
|
can communicate outside the memory model is a *workgroup*.
|
|
Hence, a suitable choice is to specify that all the threads from
|
|
a single workgroup in OpenCL execute converged dynamic instances
|
|
of this intrinsic.
|
|
b. In a C/C++ program, threads are launched independently and they can
|
|
communicate only through the memory model. Hence the dynamic instances of
|
|
this intrinsic in a C/C++ program are never converged.
|
|
2. If the function is called from a call-site in LLVM IR, then two
|
|
threads execute converged dynamic instances of this intrinsic if and
|
|
only if both threads entered the function by executing converged
|
|
dynamic instances of the call-site.
|
|
|
|
This intrinsic can occur at most once in a function, and only in the entry
|
|
block of the function. If this intrinsic occurs in a basic block, then it must
|
|
precede any other convergent operation in the same basic block.
|
|
|
|
It is an error if this intrinsic appears in a non-convergent function.
|
|
|
|
It is an error to specify a ``convergencectrl`` operand bundle at a
|
|
call to this intrinsic.
|
|
|
|
Function inlining substitutes this intrinsic with the token from the operand
|
|
bundle. For example:
|
|
|
|
.. code-block:: c++
|
|
|
|
// Before inlining:
|
|
|
|
void callee() convergent {
|
|
%tok = call token @llvm.experimental.convergence.entry()
|
|
convergent_operation(...) [ "convergencectrl"(token %tok) ]
|
|
}
|
|
|
|
void main() {
|
|
%outer = call token @llvm.experimental.convergence.anchor()
|
|
for (...) {
|
|
%inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
|
|
callee() [ "convergencectrl"(token %inner) ]
|
|
}
|
|
}
|
|
|
|
// After inlining:
|
|
|
|
void main() {
|
|
%outer = call token @llvm.experimental.convergence.anchor()
|
|
for (...) {
|
|
%inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
|
|
convergent_operation(...) [ "convergencectrl"(token %inner) ]
|
|
}
|
|
}
|
|
|
|
.. _llvm.experimental.convergence.loop:
|
|
|
|
``llvm.experimental.convergence.loop``
|
|
--------------------------------------
|
|
|
|
.. code-block:: llvm
|
|
|
|
token @llvm.experimental.convergence.loop() [ "convergencectrl"(token) ] convergent readnone
|
|
|
|
This intrinsic represents the place where an imaginary counter is incremented
|
|
for determining convergence inside a control flow cycle.
|
|
|
|
Let ``U`` be a call to this intrinsic and ``D`` be the convergent operation that
|
|
defines the token value used as the ``convergencectrl`` operand to ``U``. Two
|
|
threads execute converged dynamic instances of ``U`` if and only if:
|
|
|
|
1. The token value in both threads was returned by converged dynamic
|
|
instances of ``D``, and,
|
|
2. There is an integer *n* such that both threads execute ``U`` for the *n*'th time
|
|
with that token value.
|
|
|
|
It is an error to omit the ``convergencectrl`` operand bundle on a
|
|
call to this intrinsic.
|
|
|
|
If this intrinsic occurs in a basic block, then it must precede any other
|
|
convergent operation in the same basic block.
|
|
|
|
.. _convergence_cycle_heart:
|
|
|
|
**Heart of a Cycle:**
|
|
|
|
If a :ref:`cycle <cycle-terminology>` ``C`` contains an occurrence ``H`` of
|
|
this intrinsic whose token operand is defined outside ``C``, then ``H`` is
|
|
called the heart of ``C``.
|
|
|
|
.. note::
|
|
|
|
The static rules for cycles imply that a heart can occur only in the header
|
|
of a natural loop. This ensures that the heart closely represents the
|
|
intuitive notion of a loop iteration. If this restriction is relaxed, the
|
|
resulting semantics provides a new notion of "cycle iteration" even for
|
|
irreducible cycles. But this allows a natural loop to have a heart in a
|
|
node other than its header, which has interesting consequences on the
|
|
meaning of a loop iteration in terms of convergence. For now, we disallow
|
|
this situation since its practical application is very rare.
|
|
|
|
.. _llvm.experimental.convergence.anchor:
|
|
|
|
``llvm.experimental.convergence.anchor``
|
|
----------------------------------------
|
|
|
|
.. code-block:: llvm
|
|
|
|
token @llvm.experimental.convergence.anchor() convergent readnone
|
|
|
|
This intrinsic produces an initial convergence token that is independent from
|
|
any "outer scope". The set of threads executing converged dynamic instances of
|
|
this intrinsic is implementation-defined.
|
|
|
|
It is an error to pass a ``convergencectrl`` operand bundle at a
|
|
call to this intrinsic.
|
|
|
|
.. note::
|
|
|
|
The expectation is that all threads within a group that "happen to be active
|
|
at the same time" will execute converged dynamic instances, so that programs
|
|
can detect the maximal set of threads that can communicate efficiently within
|
|
some local region of the program.
|
|
|
|
.. _convergence_uncontrolled:
|
|
|
|
Uncontrolled Convergent Operations
|
|
==================================
|
|
|
|
Convergent operations with an explicit ``convergencectrl`` operand bundle are
|
|
called *controlled convergent operations*. All other convergent operations are
|
|
said to be *uncontrolled*.
|
|
|
|
An uncontrolled convergent operation is said to have *implicit convergence
|
|
control* determined by the ``convergent`` attribute alone. The semantics of the
|
|
``convergent`` attribute as implemented in LLVM differs from the documented
|
|
semantics. The implementation tries to follow common intuition about convergent
|
|
operations, which remains under-specified. As such, it is not possible to fully
|
|
translate implicit convergence control into explicit convergence control tokens,
|
|
and these two modes cannot be mixed in the same function.
|
|
|
|
If a function contains a controlled convergent operation, then all convergent
|
|
operations in that function must either be controlled operations or calls to
|
|
the convergence control intrinsics.
|
|
|
|
Inferring Tokens
|
|
----------------
|
|
|
|
(This section is informational)
|
|
|
|
Sometimes, it may be necessary to reinterpret the implicit convergence control
|
|
in terms of explicit convergence control tokens. For example, this may happen
|
|
when a function call is inlined, and either the caller or the callee contains
|
|
uncontrolled convergent operations.
|
|
|
|
Some uses of uncontrolled convergent operations may need to satisfy the
|
|
following property:
|
|
|
|
For an environment-defined group of threads (such as an OpenCL workgroup or
|
|
subgroup), if one thread in the group executes a convergent operation, then
|
|
all threads in the group do so convergently with that thread.
|
|
|
|
In terms of explicit convergence control, this means that the
|
|
``convergencectrl`` operand on each convergent operation ``X`` must ultimately
|
|
originate from a call to the :ref:`llvm.experimental.convergence.entry
|
|
<llvm.experimental.convergence.entry>` intrinsic. This preserves the possibility
|
|
that the group of threads that converge on reaching ``X`` is the same group that
|
|
originally started executing the program in convergence. In comparison, the
|
|
:ref:`llvm.experimental.convergence.anchor
|
|
<llvm.experimental.convergence.anchor>` intrinsic captures an
|
|
implementation-defined group of threads, which is insufficient to support the
|
|
above property.
|
|
|
|
One way to approximate implicit convergence control in terms of explicit
|
|
convergence control tokens is the following procedure, which preserves the above
|
|
mentioned property:
|
|
|
|
1. Convert every irreducible cycle into a reducible cycle.
|
|
2. Insert a call to :ref:`llvm.experimental.convergence.entry
|
|
<llvm.experimental.convergence.entry>` at the start of the entry block of the
|
|
function.
|
|
3. Insert a call to :ref:`llvm.experimental.convergence.loop
|
|
<llvm.experimental.convergence.loop>` at the start of every loop header. If
|
|
this loop is an outermost loop, the ``convergencectrl`` operand is the call
|
|
to :ref:`llvm.experimental.convergence.entry
|
|
<llvm.experimental.convergence.entry>` in the entry block of the function.
|
|
Otherwise, the ``convergencectrl`` operand is the call to
|
|
:ref:`llvm.experimental.convergence.loop
|
|
<llvm.experimental.convergence.loop>` in the parent loop's header.
|
|
4. For each uncontrolled convergent operation ``X``, add a ``convergencectrl``
|
|
operand bundle using the token defined by a definition ``D`` that is a
|
|
:ref:`sibling <cycle-sibling>` to this operation. ``D`` always dominates
|
|
``X`` --- if ``X`` is not in any cycle, then ``D`` is a call to
|
|
:ref:`llvm.experimental.convergence.entry
|
|
<llvm.experimental.convergence.entry>`; otherwise ``D`` is the heart of the
|
|
parent cycle of ``X``.
|
|
|
|
.. _convergence_static_rules:
|
|
|
|
Static Rules
|
|
============
|
|
|
|
A *well-formed* program in LLVM IR must satisfy the following static
|
|
rules about cycles and convergence regions.
|
|
|
|
Closed Paths
|
|
------------
|
|
|
|
A :ref:`closed path <cycle-closed-path>` in a CFG is a connected sequence of
|
|
nodes and edges in the CFG whose start and end points are the same.
|
|
|
|
1. Every closed path in the CFG that contains a use of a convergence token T other
|
|
than a use by
|
|
:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
|
|
must also contain the definition of T.
|
|
|
|
2. Every closed path in the CFG that contains two different uses of a convergence
|
|
token T must also contain the definition of T.
|
|
|
|
3. Every closed path in the CFG that contains uses of two different convergence tokens
|
|
T1 and T2 must also contain the definition of at least one of them.
|
|
|
|
Taken together, these rules imply that for every closed path C, there can be at most
|
|
one convergence token T which is used in C but defined outside of it, and that
|
|
T can be used only once in C, and only by ``llvm.experimental.convergence.loop``.
|
|
|
|
4. In every closed path that contains a use U of a token T but not the
|
|
definition of T, U must dominate all nodes in the closed path.
|
|
|
|
This implies that ``llvm.experimental.convergence.loop`` can appear as a heart
|
|
only in the header of a natural loop.
|
|
|
|
**Sufficient Conditions:** From the :ref:`properties of cycles
|
|
<cycle-closed-path>`, it is sufficient to prove the above properties
|
|
for cycles instead of closed paths. Briefly, any closed path that violates
|
|
one or more of the above static rules is contained in a cycle that also
|
|
violates the same rule(s).
|
|
|
|
.. _convergence_region:
|
|
|
|
Convergence Regions
|
|
-------------------
|
|
|
|
The *convergence region* of a convergence token T is the minimal region in
|
|
which T is live and used, i.e., the set of program points dominated by the
|
|
definition D of T from which a use of T can be reached.
|
|
|
|
The following static rule about convergence regions must be satisfied by
|
|
valid programs:
|
|
|
|
If a convergence region R for a token T1 contains a use of a convergence
|
|
token T2, then R must also contain the definition of T2. (In other words,
|
|
convergence regions must be reasonably nested.)
|
|
|
|
.. note::
|
|
|
|
For brevity, this document uses the term "convergence region of a token
|
|
definition ``D``" to actually refer to the convergence region of the token
|
|
``T`` defined by ``D``.
|
|
|
|
.. _inferring_noconvergent:
|
|
|
|
Inferring non-convergence
|
|
=========================
|
|
|
|
When the target or the environment guarantees that threads do not
|
|
communicate using convergent operations or that threads never diverge,
|
|
the dynamic instances in the program are irrelevant and an optimizer
|
|
may remove any occurrence of the ``convergent`` attribute on a
|
|
call-site or a function and any explicit ``convergencectrl`` operand
|
|
bundle at a call-site.
|
|
|
|
An optimizer may remove the ``convergent`` attribute and any explicit
|
|
``convergencectrl`` operand bundle from a call-site if it can prove
|
|
that the execution of this call-site always results in a call to a
|
|
non-convergent function.
|
|
|
|
An optimizer may remove the ``convergent`` attribute on a function if it can
|
|
prove that the function does not contain a call to
|
|
:ref:`llvm.experimental.convergence.entry
|
|
<llvm.experimental.convergence.entry>`, or any uncontrolled convergent
|
|
operations.
|
|
|
|
Memory Model Non-Interaction
|
|
============================
|
|
|
|
The fact that an operation is convergent has no effect on how it is treated for
|
|
memory model purposes. In particular, an operation that is ``convergent`` and
|
|
``readnone`` does not introduce additional ordering constraints as far as the
|
|
memory model is concerned. There is no implied barrier, neither in the memory
|
|
barrier sense nor in the control barrier sense of synchronizing the execution
|
|
of threads.
|
|
|
|
Informational note: Threads that execute converged dynamic instances do not
|
|
necessarily do so at the same time.
|
|
|
|
|
|
Other Interactions
|
|
==================
|
|
|
|
A function can be both ``convergent`` and
|
|
``speculatable``, indicating that the function does not have undefined
|
|
behavior and has no effects besides calculating its result, but is still
|
|
affected by the set of threads executing this function. This typically
|
|
prevents speculation of calls to the function unless the constraint imposed
|
|
by ``convergent`` is further relaxed by some other means.
|
|
|
|
Controlled Maximal Convergence
|
|
==============================
|
|
|
|
The :ref:`converged-with relation <convergence-definition>` over dynamic
|
|
instances of each controlled convergent operation is completely defined by the
|
|
semantics of convergence tokens. But the implementation-defined convergence at a
|
|
call to :ref:`llvm.experimental.convergence.anchor
|
|
<llvm.experimental.convergence.anchor>` also depends on the cycle hierarchy
|
|
chosen if it occurs inside an irreducible cycle.
|
|
|
|
When the token defined by a convergent operation ``D`` is used at another
|
|
convergent operation ``U``, the implementation must ensure that the threads that
|
|
converge at ``U`` are all the threads that reached ``U`` after converging at
|
|
``D``. On most implementations, it is reasonable to assume that only these
|
|
threads are converged at every node they reach on any path from ``D`` to ``U``.
|
|
In other words, the converged-with relation at ``D`` produces groups of threads
|
|
that can converge only within each group, while inside the convergence region of
|
|
``D``.
|
|
|
|
All this affects the :ref:`maximal converged-with relation
|
|
<convergence-maximal>` over dynamic instances and in turn the :ref:`m-converged
|
|
property <uniformity-analysis>` of static instances in the convergence region of
|
|
``D``.
|
|
|
|
.. _controlled_maximal_converged_with:
|
|
|
|
**Controlled Maximal converged-with Relation**
|
|
|
|
1. Dynamic instances of a *convergent operation* are related in the controlled
|
|
maximal converged-with relation according to the semantics of the convergence
|
|
control tokens.
|
|
2. Dynamic instances ``X1`` and ``X2`` produced by different threads for the
|
|
same *non-convergent operation* ``X`` are related in the controlled maximal
|
|
converged-with relation if and only if:
|
|
|
|
1. Both threads executed converged dynamic instances of every token
|
|
definition ``D`` such that ``X`` is in the convergence region of ``D``,
|
|
and,
|
|
2. For every cycle ``C`` with header ``H`` that contains ``X``:
|
|
|
|
- every dynamic instance ``H1`` of ``H`` that precedes ``X1`` in the
|
|
respective thread is convergence-before ``X2``, and,
|
|
- every dynamic instance ``H2`` of ``H`` that precedes ``X2`` in the
|
|
respective thread is convergence-before ``X1``,
|
|
- without assuming that ``X1`` is converged with ``X2``.
|
|
|
|
.. _controlled_m_converged:
|
|
|
|
**Controlled m-converged Static Instances**
|
|
|
|
A node ``X`` in a given CFG is reported to be m-converged if and only if:
|
|
|
|
1. For any token definition ``D`` such that ``X`` is inside the convergence region
|
|
of ``D``, ``D`` itself is m-converged, and,
|
|
2. Every cycle that contains ``X`` satisfies the following necessary
|
|
conditions:
|
|
|
|
a. Every divergent branch inside the cycle satisfies the :ref:`diverged
|
|
entry criterion<convergence-diverged-entry>`, and,
|
|
b. There are no :ref:`diverged paths reaching the
|
|
cycle<convergence-diverged-outside>` from a divergent branch outside it.
|
|
|
|
Temporal Divergence at Cycle Exit
|
|
---------------------------------
|
|
|
|
When a cycle has a divergent exit, maximal convergence assumes that all threads
|
|
converge at the exit block. But if a controlled convergent operation outside the
|
|
cycle uses a token defined by an operation ``D`` inside the cycle, the
|
|
convergence region of ``D`` now extends outside the cycle. If two threads
|
|
executed converged dynamic instances of ``D`` before exiting the cycle, then
|
|
they continue to execute converged dynamic instances of nodes in the convergence
|
|
region of ``D`` outside the cycle. Thus, for a value ``V`` defined inside the
|
|
cycle, any use ``U`` of ``V`` within the convergence region of ``T`` uses the
|
|
output of converged dynamic instances of ``V``. If ``V`` is uniform, then its
|
|
use at such a ``U`` is also uniform. In other words, temporal divergence applies
|
|
only to a use of ``V`` that is outside the convergence region of ``D``.
|
|
|
|
Rationales for Static rules about cycles
|
|
========================================
|
|
|
|
(This section is informative.)
|
|
|
|
.. note::
|
|
|
|
For convenience, we use the operator ``==`` to represent the relation
|
|
``converged-with`` and the operator ``!=`` to represent its negation.
|
|
|
|
Consider a loop with (incorrect!) convergence control as in the following
|
|
pseudocode:
|
|
|
|
.. code-block:: llvm
|
|
|
|
; WARNING: Example of incorrect convergence control!
|
|
|
|
%anchor = call token @llvm.experimental.convergence.anchor()
|
|
for (;;) {
|
|
...
|
|
call void @convergent.op() [ "convergencectrl"(token %anchor) ]
|
|
...
|
|
}
|
|
|
|
This code is forbidden by the first static rule about cycles.
|
|
|
|
A first formal argument why we have to do this is that the dynamic rule for
|
|
deciding whether two threads execute converged dynamic instances of
|
|
``@convergent.op`` leads to a logical contradiction in this code.
|
|
Assume two threads execute converged dynamic instances of the anchor
|
|
followed by two iterations of the loop. Thread 1 executes dynamic instances
|
|
I1 and I2 of ``@convergent.op``, thread 2 executes dynamic instances J1 and J2.
|
|
Using all the rules, we can deduce:
|
|
|
|
1. ``I1 != I2`` and ``J1 != J2`` by the basic rules of dynamic instances.
|
|
|
|
2. ``I1 == J1`` by the first dynamic rule about controlled convergent
|
|
operations: both threads execute the same static instruction while using
|
|
a convergence token value produced by converged dynamic instances of an
|
|
instruction (the anchor).
|
|
|
|
3. ``I1 == J2`` by the same argument. Also, ``I2 == J1`` and ``I2 == J2``.
|
|
|
|
The fact that one may be *intuitively* tempted to think of ``I1`` and ``J2``
|
|
as being executed in different loop iterations is completely irrelevant for
|
|
the *formal* argument. There is no mechanism in LLVM IR semantics for
|
|
forming associations between loop iterations in different threads, *except*
|
|
for the rules defined in this document -- and the rules in this document
|
|
require a loop heart intrinsic for talking about loop iterations.
|
|
|
|
4. By transitivity, we have ``I1 == I2`` and ``J1 == J2``. That is a
|
|
contradiction.
|
|
|
|
This problem goes away by inserting a loop heart intrinsic as follows, which
|
|
establishes a relationship between loop iterations across threads.
|
|
|
|
.. code-block:: llvm
|
|
|
|
%anchor = call token @llvm.experimental.convergence.anchor()
|
|
for (;;) {
|
|
%loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
|
|
...
|
|
call void @convergent.op() [ "convergencectrl"(token %loop) ]
|
|
...
|
|
}
|
|
|
|
In the same scenario of two threads executing converged dynamic instances of the
|
|
anchor and then two iterations of the loop, the dynamic rule about loop heart
|
|
intrinsics implies that both threads execute the converged dynamic instances of
|
|
the loop heart intrinsic in their respective first iterations and then again in
|
|
their respective second iterations of the loop.
|
|
|
|
This then implies that they execute converged dynamic instances ``I1 == J1`` of
|
|
the ``@convergent.op`` in their first iterations and then
|
|
``I2 == J2`` in their second iterations. The rule is an "if and only if" rule,
|
|
so it also implies that ``I1 != J2`` and ``I2 != J1``, because those executions
|
|
see token values of ``%loop`` originating from non-converged dynamic
|
|
instances of the loop intrinsic.
|
|
|
|
One may ask whether we could change the dynamic rule instead of adding the
|
|
static rule about cycles. That is impractical due to deeper difficulties.
|
|
Consider the following loop, again with incorrect convergence control:
|
|
|
|
.. code-block:: llvm
|
|
|
|
; WARNING: Example of incorrect convergence control!
|
|
|
|
; (A)
|
|
%anchor = call token @llvm.experimental.convergence.anchor()
|
|
for (;;) {
|
|
; (B)
|
|
if (condition1) {
|
|
; (C)
|
|
call void @convergent.op.1() [ "convergencectrl"(token %anchor) ]
|
|
}
|
|
; (D)
|
|
if (condition2) {
|
|
; (E)
|
|
call void @convergent.op.2() [ "convergencectrl"(token %anchor) ]
|
|
}
|
|
; (F)
|
|
}
|
|
; (G)
|
|
|
|
Assume two threads execute converged dynamic instances of the anchor followed
|
|
by this sequence of basic blocks:
|
|
|
|
.. code-block:: text
|
|
|
|
Thread 1: A B C D F B D E F G
|
|
Thread 2: A B D E F B C D F G
|
|
|
|
That is, both threads execute two iterations of the loop, but they execute
|
|
the different convergent operations in different iterations. Without forming a
|
|
relation between loop iterations across the threads, there is no reasonable way
|
|
of defining which dynamic instances of the convergent operations should be the
|
|
same across the threads, if any.
|
|
|
|
Again, this can be addressed by adding a loop heart intrinsic, most naturally
|
|
as:
|
|
|
|
.. code-block:: llvm
|
|
|
|
; (A)
|
|
%anchor = call token @llvm.experimental.convergence.anchor()
|
|
for (;;) {
|
|
; (B)
|
|
%loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
|
|
if (condition1) {
|
|
; (C)
|
|
call void @convergent.op.1() [ "convergencectrl"(token %loop) ]
|
|
}
|
|
; (D)
|
|
if (condition2) {
|
|
; (E)
|
|
call void @convergent.op.2() [ "convergencectrl"(token %loop) ]
|
|
}
|
|
; (F)
|
|
}
|
|
; (G)
|
|
|
|
Let ``%loop(i;j)`` be the dynamic instance of ``j``-th execution of the loop
|
|
heart intrinsic by thread ``i``, and analogously ``@op.k(i)`` and ``@op.k(i)``
|
|
the dynamic instances of the execution of ``@convergent.op.k`` by thread ``i``.
|
|
Then we have:
|
|
|
|
1. ``%loop(1;j) == %loop(2;j)`` for ``j = 1, 2`` because of the dynamic rule
|
|
about loop heart intrinsics.
|
|
|
|
2. ``%loop(i;1) != %loop(i;2)`` for ``i = 1, 2`` because of the basic rule that
|
|
different executions by the same thread happen in different dynamic
|
|
instances.
|
|
|
|
3. ``@op.1(1) != @op.1(2)``, since ``@op.1(1)`` uses the token value of ``%loop``
|
|
referring to ``%loop(1;1)`` and ``@op.1(2)`` uses that
|
|
referring to ``%loop(2;2) == %loop(1;2)``, which is different from
|
|
``%loop(1;1)``.
|
|
|
|
4. Similarly, ``@op.2(1) != @op.2(2)``.
|
|
|
|
However, loop heart intrinsics could be inserted differently, at the cost
|
|
of also inserting a free-standing anchor:
|
|
|
|
.. code-block:: llvm
|
|
|
|
; (A)
|
|
%anchor = call token @llvm.experimental.convergence.anchor()
|
|
for (;;) {
|
|
; (B)
|
|
if (condition1) {
|
|
; (C)
|
|
%loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
|
|
call void @convergent.op.1() [ "convergencectrl"(token %loop) ]
|
|
}
|
|
; (D)
|
|
if (condition2) {
|
|
; (E)
|
|
%free = call token @llvm.experimental.convergence.anchor()
|
|
call void @convergent.op.2() [ "convergencectrl"(token %free) ]
|
|
}
|
|
; (F)
|
|
}
|
|
; (G)
|
|
|
|
This leads to the "unnatural counting of loop iterations" that is also mentioned
|
|
elsewhere. Let ``%loop(i)`` be the dynamic instance of the execution of the
|
|
loop heart intrinsic by thread ``i`` (each thread executes it only once), and
|
|
let ``@op.k(i)`` be as before. Then:
|
|
|
|
1. ``%loop(1) == %loop(2)`` because of the dynamic rule about loop heart
|
|
intrinsics.
|
|
|
|
2. ``@op.1(1) == @op.1(2)`` because ``@op.1(i)`` uses the value of ``%loop``
|
|
referring to ``%loop(i)``, and ``%loop(1) == %loop(2)``.
|
|
|
|
3. Whether ``@op.2(1) == @op.2(2)`` is implementation-defined because of the
|
|
use of the ``%free`` anchor intrinsic.
|
|
|
|
In practice, they almost certainly have to be non-converged dynamic
|
|
instances. Consider that if an implementation strictly follows the order of
|
|
instructions given in the program, the executions of the threads can be
|
|
"aligned" as follows:
|
|
|
|
.. code-block:: text
|
|
|
|
Thread 1: A B C D F B D E F G
|
|
Thread 2: A B D E F B C D F G
|
|
|
|
So then ``@op.2(1)`` physically executes later than ``@op.2(2)`` and there
|
|
can be no communication between the threads, which means they execute
|
|
non-converged dynamic instances.
|
|
|
|
That said, it is conceivable that there aren't actually any data or other
|
|
dependencies that would enforce this execution order. In that case, a highly
|
|
out-of-order implementation could potentially allow communication. That's
|
|
why the rules defined in this document are silent about whether
|
|
``@op.2(1) == @op.2(2)`` or not.
|
|
|
|
This type of convergence control seems relatively unlikely to appear in real
|
|
programs. Its possibility is simply a logical consequence of the model.
|
|
|
|
An equivalent issue arises if the convergent operations are replaced by nested
|
|
loops with loop heart intrinsics that directly refer to ``%anchor``, hence
|
|
the variants of the static rules about cycles that apply to them:
|
|
|
|
.. code-block:: llvm
|
|
|
|
; WARNING: Example of incorrect convergence control!
|
|
|
|
%anchor = call token @llvm.experimental.convergence.anchor()
|
|
for (;;) {
|
|
if (condition1) {
|
|
for (;;) {
|
|
%loop1 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
|
|
}
|
|
}
|
|
if (condition2) {
|
|
for (;;) {
|
|
%loop2 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
|
|
}
|
|
}
|
|
}
|
|
|
|
There is a cycle (closed walk in the CFG) that goes through both loop heart
|
|
intrinsics using ``%anchor`` but not through the definition of ``%anchor``,
|
|
so this code is invalid.
|
|
|
|
|
|
Examples for the Correctness of Program Transforms
|
|
==================================================
|
|
|
|
(This section is informative.)
|
|
|
|
As implied by the rules in the previous sections, program transforms are correct
|
|
with respect to convergent operations if they preserve or refine their
|
|
semantics. This means that the set of communicating threads in the transformed
|
|
program must have been possible in the original program.
|
|
|
|
Program transforms with a single-threaded focus are generally conservatively
|
|
correct if they do not sink or hoist convergent operations across a branch.
|
|
This applies even to program transforms that change the control flow graph.
|
|
|
|
For example, unrolling a loop that does not contain convergent operations
|
|
cannot break any of the guarantees required for convergent operations outside
|
|
of the loop.
|
|
|
|
|
|
Loop unrolling examples
|
|
-----------------------
|
|
|
|
We consider three kinds of loop unrolling here:
|
|
|
|
* Partial unrolling with no known trip multiple, so a "tail" is required to
|
|
collect the remaining elements.
|
|
* Partial unrolling by a trip multiple, so no "tail" is required.
|
|
* Full unrolling, which eliminates the loop.
|
|
|
|
The first kind is forbidden when ``@llvm.experimental.convergence.loop`` is
|
|
used. We illustrate the reasoning with some examples.
|
|
|
|
First, an arbitrary loop that contains convergent operations *can* be unrolled
|
|
in all of these ways, even with "tail", if all convergent operations refer back
|
|
to an anchor inside the loop. For example (in pseudo-code):
|
|
|
|
.. code-block:: llvm
|
|
|
|
while (counter > 0) {
|
|
%tok = call token @llvm.experimental.convergence.anchor()
|
|
call void @convergent.operation() [ "convergencectrl"(token %tok) ]
|
|
counter--;
|
|
}
|
|
|
|
This can be unrolled to:
|
|
|
|
.. code-block:: llvm
|
|
|
|
while (counter >= 2) {
|
|
%tok = call token @llvm.experimental.convergence.anchor()
|
|
call void @convergent.operation() [ "convergencectrl"(token %tok) ]
|
|
%tok = call token @llvm.experimental.convergence.anchor()
|
|
call void @convergent.operation() [ "convergencectrl"(token %tok) ]
|
|
counter -= 2;
|
|
}
|
|
while (counter > 0) {
|
|
%tok = call token @llvm.experimental.convergence.anchor()
|
|
call void @convergent.operation() [ "convergencectrl"(token %tok) ]
|
|
counter--;
|
|
}
|
|
|
|
This is likely to change the behavior of the convergent operation if there
|
|
are threads whose initial counter value is not a multiple of 2. In particular,
|
|
all threads with an odd trip count are now likely to execute the convergent
|
|
operation in their respective final iterations together because the
|
|
underlying implementation is likely to try to group as many threads together
|
|
as possible for the execution of the "tail".
|
|
|
|
This change is allowed because the anchor intrinsic has implementation-defined
|
|
convergence behavior and the loop unrolling transform is considered to be part
|
|
of the implementation. Another way of reasoning is that while the *likely*
|
|
behavior of the code has changed, the *guarantees* about its behavior have
|
|
remained the same.
|
|
|
|
If the loop contains uncontrolled convergent operations, this kind of unrolling
|
|
is forbidden.
|
|
|
|
Unrolling a loop with convergent operations that refer to tokens produced
|
|
outside the loop is forbidden when a "tail" or "remainder" would have to
|
|
be introduced. Consider:
|
|
|
|
.. code-block:: llvm
|
|
|
|
; (A)
|
|
%outer = call token @llvm.experimental.convergence.anchor()
|
|
while (counter > 0) {
|
|
%inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
|
|
; (B)
|
|
call void @convergent.operation() [ "convergencectrl"(token %inner) ]
|
|
counter--;
|
|
}
|
|
; (C)
|
|
|
|
To understand why unrolling is forbidden, consider two threads that execute
|
|
converged dynamic instances of the anchor and then proceed with 3 and 4 loop
|
|
iterations, respectively:
|
|
|
|
.. code-block:: text
|
|
|
|
Thread 1: A B B B C
|
|
Thread 2: A B B B B C
|
|
|
|
By the dynamic rule on loop heart intrinsics, these threads execute converged
|
|
dynamic instances of the loop intrinsic for the first 3 iterations, and then
|
|
thread 2 executes another dynamic instance by itself.
|
|
|
|
By the dynamic rule on general convergent operations, the threads execute
|
|
converged dynamic instances of the ``@convergent.operation`` in the first 3
|
|
iterations (that is, the dynamic instance executed by thread 1 in iteration
|
|
*n* is the same as that executed by thread 2 in iteration *n*, for *n = 1,2,3*;
|
|
the dynamic instance executed in iteration 1 is different from that in
|
|
iteration 2, etc.).
|
|
|
|
Now assume that the loop is unrolled by a factor of 2, which requires a
|
|
remainder as follows:
|
|
|
|
.. code-block:: llvm
|
|
|
|
; (A)
|
|
%outer = call token @llvm.experimental.convergence.anchor()
|
|
while (counter >= 2) {
|
|
%inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
|
|
; (B)
|
|
call void @convergent.operation() [ "convergencectrl"(token %inner) ]
|
|
call void @convergent.operation() [ "convergencectrl"(token %inner) ]
|
|
counter -= 2;
|
|
}
|
|
; (C)
|
|
if (counter > 0) {
|
|
%remainder = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
|
|
; (D)
|
|
call void @convergent.operation() [ "convergencectrl"(token %remainder) ]
|
|
}
|
|
; (E)
|
|
|
|
First of all, note some interesting problems surrounding the loop intrinsic:
|
|
|
|
1. It is *not* duplicated inside the unrolled loop. This is to comply with
|
|
the :ref:`convergence_static_rules`.
|
|
|
|
2. It is unclear whether the loop intrinsic ought to be duplicated in the
|
|
remainder, or whether the final ``@convergent.operation`` in D should just
|
|
refer to either ``%inner`` (which is possible in SSA form) or directly to
|
|
``%outer``. The decision made here is arbitrary and doesn't change the
|
|
argument that follows. Ultimately, it simply doesn't matter because the
|
|
transform is incorrect either way.
|
|
|
|
The threads now execute the following sequences of blocks:
|
|
|
|
.. code-block:: text
|
|
|
|
Thread 1: A B C D E
|
|
Thread 2: A B B C D E
|
|
|
|
Analogous to the argument above, they execute converged dynamic instances of the
|
|
``%inner`` intrinsic and the ``@convergent.operation`` in the first iteration
|
|
of the unrolled loop, which corresponds to the first 2 iterations of the
|
|
original loop.
|
|
|
|
However, they execute different static calls to ``@convergent.operation`` for
|
|
the 3rd iteration of the original loop. In thread 1, that iteration corresponds
|
|
to the call in the remainder, while in thread 2 it corresponds to the first
|
|
call to ``@convergent.operation`` in the unrolled loop. Therefore, they execute
|
|
non-converged dynamic instances, which means that the set of communicating threads
|
|
for the 3rd iteration of the original loop is different. This is why the
|
|
unrolling is incorrect.
|
|
|
|
On the other hand, unrolling without "tail" is allowed. For example, assuming
|
|
that the trip counter is known to be a multiple of 2, we can unroll the loop
|
|
as follows:
|
|
|
|
.. code-block:: llvm
|
|
|
|
%outer = call token @llvm.experimental.convergence.anchor()
|
|
while (counter > 0) {
|
|
%inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
|
|
call void @convergent.operation() [ "convergencectrl"(token %inner) ]
|
|
call void @convergent.operation() [ "convergencectrl"(token %inner) ]
|
|
counter -= 2;
|
|
}
|
|
|
|
Note again that the loop intrinsic is not duplicated.
|
|
|
|
The
|
|
:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
|
|
intrinsic is typically expected to appear in the header of a natural loop.
|
|
However, it can also appear in non-header blocks of a loop. In that case, the
|
|
loop can generally not be unrolled.
|
|
|
|
|
|
Hoisting and sinking
|
|
--------------------
|
|
|
|
In general, hoisting and sinking of convergent operations is forbidden. This is
|
|
because moving the operation to a different point in control flow generally
|
|
changes the set of threads that reach the operation and therefore, the set of
|
|
threads that execute converged dynamic instances of the operation. By
|
|
definition, this changes the set of threads that participate in the
|
|
communication of the convergent operation, which will typically change its
|
|
result.
|
|
|
|
There are a number of exceptions, though most of them require additional
|
|
knowledge.
|
|
|
|
For example, hoisting and sinking across *uniform* conditional branches -- i.e.,
|
|
conditional branches where within every possible relevant set of threads, all
|
|
threads will always take the same direction -- is generally allowed. See the end
|
|
of the :ref:`example of reductions inside control flow
|
|
<convergence_example_reductions>` for a brief discussion.
|
|
|
|
Some convergent operations can be hoisted but not sunk, or vice versa. A simple
|
|
example is the ``subgroupShuffle(data, id)`` operation. It returns the ``data``
|
|
operand of the thread identified by ``id``, where thread IDs are fixed and
|
|
assigned to each thread at launch. The result is undefined (or perhaps there is
|
|
UB, depending on the language and environment) if thread ``id`` is not in the
|
|
communicating set of threads. So hoisting is allowed in the following
|
|
pseudo-code example:
|
|
|
|
.. code-block:: llvm
|
|
|
|
define void @example(...) convergent {
|
|
%entry = call token @llvm.experimental.convergence.entry()
|
|
%data = ...
|
|
%id = ...
|
|
if (condition) {
|
|
%shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
|
|
...
|
|
} else {
|
|
%shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
|
|
...
|
|
}
|
|
}
|
|
|
|
After hoisting the calls to ``@subgroupShuffle``, the communicating set of
|
|
threads is the union of the two sets of threads in the original program, so
|
|
``%id`` can only go "out of range" after hoisting if it did so in the original
|
|
program.
|
|
|
|
However, speculative execution of ``@subgroupShuffle`` in the following program
|
|
may be forbidden:
|
|
|
|
.. code-block:: llvm
|
|
|
|
define void @example(...) convergent {
|
|
%entry = call token @llvm.experimental.convergence.entry()
|
|
%data = ...
|
|
%id = ...
|
|
if (condition) {
|
|
%shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
|
|
...
|
|
}
|
|
}
|
|
|
|
There is no guarantee about the value of ``%id`` in the threads where
|
|
``condition`` is false. If ``@subgroupShuffle`` is defined to have UB when
|
|
``%id`` is outside of the set of communicating threads, then speculating and
|
|
hoisting ``@subgroupShuffle`` might introduce UB.
|
|
|
|
On the other hand, if ``@subgroupShuffle`` is defined such that it merely
|
|
produces an undefined value or poison as result when ``%id`` is "out of range",
|
|
then speculating is okay.
|
|
|
|
Even though
|
|
:ref:`llvm.experimental.convergence.anchor <llvm.experimental.convergence.anchor>`
|
|
is marked as ``convergent``, it can be sunk in some cases. For example, in
|
|
pseudo-code:
|
|
|
|
.. code-block:: llvm
|
|
|
|
%tok = call token @llvm.experimental.convergence.anchor()
|
|
if (condition) {
|
|
call void @convergent.operation() [ "convergencectrl"(token %tok) ]
|
|
}
|
|
|
|
Assuming that ``%tok`` is only used inside the conditional block, the anchor can
|
|
be sunk. The rationale is two-fold. First, the anchor has implementation-defined
|
|
behavior, and the sinking is part of the implementation. Second, already in the
|
|
original program, the set of threads that communicates in the
|
|
``@convergent.operation`` is automatically subset to the threads for which
|
|
``condition`` is true.
|
|
|
|
Anchors can be hoisted in acyclic control flow. For example:
|
|
|
|
.. code-block:: llvm
|
|
|
|
if (condition) {
|
|
%tok1 = call token @llvm.experimental.convergence.anchor()
|
|
call void @convergent.operation() [ "convergencectrl"(token %tok1) ]
|
|
} else {
|
|
%tok2 = call token @llvm.experimental.convergence.anchor()
|
|
call void @convergent.operation() [ "convergencectrl"(token %tok2) ]
|
|
}
|
|
|
|
The anchors can be hoisted, resulting in:
|
|
|
|
.. code-block:: llvm
|
|
|
|
%tok = call token @llvm.experimental.convergence.anchor()
|
|
if (condition) {
|
|
call void @convergent.operation() [ "convergencectrl"(token %tok) ]
|
|
} else {
|
|
call void @convergent.operation() [ "convergencectrl"(token %tok) ]
|
|
}
|
|
|
|
The behavior is unchanged, since each of the static convergent operations only
|
|
ever communicates with threads that have the same ``condition`` value.
|
|
By contrast, hoisting the convergent operations themselves is forbidden.
|
|
|
|
Hoisting and sinking anchors out of and into loops is forbidden. For example:
|
|
|
|
.. code-block:: llvm
|
|
|
|
for (;;) {
|
|
%tok = call token @llvm.experimental.convergence.anchor()
|
|
call void @convergent.operation() [ "convergencectrl"(token %tok) ]
|
|
}
|
|
|
|
Hoisting the anchor would make the program invalid according to the static
|
|
validity rules. Conversely:
|
|
|
|
.. code-block:: llvm
|
|
|
|
%outer = call token @llvm.experimental.convergence.anchor()
|
|
while (counter > 0) {
|
|
%inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
|
|
call void @convergent.operation() [ "convergencectrl"(token %inner) ]
|
|
counter--;
|
|
}
|
|
|
|
The program would stay valid if the anchor was sunk into the loop, but its
|
|
behavior could end up being different. If the anchor is inside the loop, then
|
|
each loop iteration has a new dynamic instance of the anchor, and the set of
|
|
threads participating in those dynamic instances of the anchor could be
|
|
different in arbitrary implementation-defined ways. Via the dynamic rules about
|
|
dynamic instances of convergent operations, this then implies that the set of
|
|
threads executing ``@convergent.operation`` could be different in each loop
|
|
iteration in arbitrary implementation-defined ways.
|
|
|
|
Convergent operations can be sunk together with their anchor. Again in
|
|
pseudo-code:
|
|
|
|
.. code-block:: llvm
|
|
|
|
%tok = call token @llvm.experimental.convergence.anchor()
|
|
%a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
|
|
%b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
|
|
if (condition) {
|
|
use(%a, %b)
|
|
}
|
|
|
|
Assuming that ``%tok``, ``%a``, and ``%b`` are only used inside the conditional
|
|
block, all can be sunk together:
|
|
|
|
.. code-block:: llvm
|
|
|
|
if (condition) {
|
|
%tok = call token @llvm.experimental.convergence.anchor()
|
|
%a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
|
|
%b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
|
|
use(%a, %b)
|
|
}
|
|
|
|
The rationale is that the anchor intrinsic has implementation-defined behavior,
|
|
and the sinking transform is considered to be part of the implementation:
|
|
the sinking will restrict the set of communicating threads to those for which
|
|
``condition`` is true, but that could have happened in the original program
|
|
anyway for some arbitrary other reason.
|
|
|
|
However, sinking *only* the convergent operation producing ``%b`` would be
|
|
incorrect. That would allow threads for which ``condition`` is false to
|
|
communicate at ``%a``, but not at ``%b``, which the original program doesn't
|
|
allow.
|
|
|
|
Note that the entry intrinsic behaves differently. Sinking the convergent
|
|
operations is forbidden in the following snippet:
|
|
|
|
.. code-block:: llvm
|
|
|
|
%tok = call token @llvm.experimental.convergence.entry()
|
|
%a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
|
|
%b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
|
|
if (condition) {
|
|
use(%a, %b)
|
|
}
|