[AMDGPU] Corrections to memory model description.

- Add description on nontemporal support.
 - Correct OpenCL sequentially consistent and fence code sequences.
 - Minor test cleanup.

Differential Revision: https://reviews.llvm.org/D39073


git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@316131 91177308-0d34-0410-b5e6-96231b3b80d8
This commit is contained in:
Tony Tye 2017-10-18 22:16:55 +00:00
parent b5cb868aaa
commit 0a09220c32

View File

@ -1240,7 +1240,7 @@ non-AMD key names should be prefixed by "*vendor-name*.".
=================================== ============== ========= ==============
.. TODO
Plan to remove the debug properties metadata.
Plan to remove the debug properties metadata.
Kernel Dispatch
~~~~~~~~~~~~~~~
@ -1431,9 +1431,9 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
.. table:: Kernel Descriptor for GFX6-GFX9
:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
======= ======= =============================== ===========================
======= ======= =============================== ============================
Bits Size Field Name Description
======= ======= =============================== ===========================
======= ======= =============================== ============================
31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
address space memory
required for a work-group
@ -1461,7 +1461,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
97 1 bit IsXNACKEnabled Indicates if the generated
machine code is capable of
suppoting XNACK.
127:98 30 bits Reserved. Must be 0.
127:98 30 bits Reserved, must be 0.
191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
negative) from base
address of kernel
@ -1469,7 +1469,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
entry point instruction
which must be 256 byte
aligned.
383:192 24 Reserved. Must be 0.
383:192 24 Reserved, must be 0.
bytes
415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
program settings used by
@ -1477,7 +1477,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
``COMPUTE_PGM_RSRC1``
configuration
register. See
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
program settings used by
CP to set up
@ -1509,16 +1509,16 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
should always be 0.
457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
should always be 0.
463:458 6 bits Reserved. Must be 0.
511:464 6 Reserved. Must be 0.
463:458 6 bits Reserved, must be 0.
511:464 6 Reserved, must be 0.
bytes
512 **Total size 64 bytes.**
======= ===================================================================
======= ====================================================================
..
.. table:: compute_pgm_rsrc1 for GFX6-GFX9
:name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
======= ======= =============================== ===========================================================================
Bits Size Field Name Description
@ -1529,8 +1529,9 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
specific:
GFX6-9
roundup((max-vgpg + 1)
/ 4) - 1
- max_vgpr 1..256
- roundup((max_vgpg + 1)
/ 4) - 1
Used by CP to set up
``COMPUTE_PGM_RSRC1.VGPRS``.
@ -1540,11 +1541,13 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
specific:
GFX6-8
roundup((max-sgpg + 1)
/ 8) - 1
- max_sgpr 1..112
- roundup((max_sgpg + 1)
/ 8) - 1
GFX9
roundup((max-sgpg + 1)
/ 16) - 1
- max_sgpr 1..112
- roundup((max_sgpg + 1)
/ 16) - 1
Includes the special SGPRs
for VCC, Flat Scratch (for
@ -1628,7 +1631,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
with DX10 clamp mode
enabled. Used by the vector
ALU to force DX-10 style
ALU to force DX10 style
treatment of NaN's (when
set, clamp NaN to zero,
otherwise pass NaN
@ -1676,29 +1679,25 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
CP is responsible for
filling in
``COMPUTE_PGM_RSRC1.CDBG_USER``.
26 1 bit FP16_OVFL GFX6-8:
Reserved. Must be 0.
GFX9:
Wavefront starts
execution with specified
fp16 overflow mode.
26 1 bit FP16_OVFL GFX6-8
Reserved, must be 0.
GFX9
Wavefront starts execution
with specified fp16 overflow
mode.
- If 0, then fp16
overflow generates
- If 0, fp16 overflow generates
+/-INF values.
- If 1, then fp16
overflow that is the
result of an +/-INF
input value or divide
by 0 generates a
+/-INF, otherwise
clamps computed
overflow to +/-MAX_FP16
as appropriate.
- If 1, fp16 overflow that is the
result of an +/-INF input value
or divide by 0 produces a +/-INF,
otherwise clamps computed
overflow to +/-MAX_FP16 as
appropriate.
Used by CP to set up
``COMPUTE_PGM_RSRC1.FP16_OVFL``.
31:27 5 bits Reserved. Must be 0.
31:27 5 bits Reserved, must be 0.
32 **Total size 4 bytes**
======= ===================================================================================================================
@ -1855,7 +1854,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
_ZERO (rcp_iflag_f32 instruction
only)
31 1 bit Reserved. Must be 0.
31 1 bit Reserved, must be 0.
32 **Total size 4 bytes.**
======= ===================================================================================================================
@ -2245,9 +2244,6 @@ This section describes the mapping of LLVM memory model onto AMDGPU machine code
.. TODO
Update when implementation complete.
Support more relaxed OpenCL memory model to be controlled by environment
component of target triple.
The AMDGPU backend supports the memory synchronization scopes specified in
:ref:`amdgpu-memory-scopes`.
@ -2264,19 +2260,23 @@ additional ``s_waitcnt`` instructions are required to ensure registers are
defined before being used. These may be able to be combined with the memory
model ``s_waitcnt`` instructions as described above.
The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
memory model which has separate happens-before relations for the global and
local address spaces, and only a fence specifying both global and local address
space joins the relationships. Since the LLVM ``memfence`` instruction does not
allow an address space to be specified the OpenCL fence has to convervatively
assume both local and global address space was specified. However, optimizations
can often be done to eliminate the additional ``s_waitcnt``instructions when
there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
instructions. The code sequences in the table indicate what can be omitted for
the OpenCL memory. The target triple environment is used to determine if the
source language is OpenCL (see :ref:`amdgpu-opencl`).
The AMDGPU backend supports the following memory models:
HSA Memory Model [HSA]_
The HSA memory model uses a single happens-before relation for all address
spaces (see :ref:`amdgpu-address-spaces`).
OpenCL Memory Model [OpenCL]_
The OpenCL memory model which has separate happens-before relations for the
global and local address spaces. Only a fence specifying both global and
local address space, and seq_cst instructions join the relationships. Since
the LLVM ``memfence`` instruction does not allow an address space to be
specified the OpenCL fence has to convervatively assume both local and
global address space was specified. However, optimizations can often be
done to eliminate the additional ``s_waitcnt`` instructions when there are
no intervening memory instructions which access the corresponding address
space. The code sequences in the table indicate what can be omitted for the
OpenCL memory. The target triple environment is used to determine if the
source language is OpenCL (see :ref:`amdgpu-opencl`).
``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
operations.
@ -2308,11 +2308,11 @@ For GFX6-GFX9:
that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
vector memory order if they access LDS memory, and out of LDS operation order
if they access global memory.
* The vector memory operations access a vector L1 cache shared by all wavefronts
on a CU. Therefore, no special action is required for coherence between
wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
coherence between waves executing in different work-groups as they may be
executing on different CUs.
* The vector memory operations access a single vector L1 cache shared by all
SIMDs a CU. Therefore, no special action is required for coherence between the
lanes of a single wavefront, or for coherence between wavefronts in the same
work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
executing in different work-groups as they may be executing on different CUs.
* The scalar memory operations access a scalar L1 cache shared by all wavefronts
on a group of CUs. The scalar and vector L1 caches are not coherent. However,
scalar operations are used in a restricted way so do not impact the memory
@ -2376,45 +2376,62 @@ future wave that uses the same scratch area, or a function call that creates a
frame at the same address, respectively. There is no need for a ``s_dcache_inv``
as all scalar writes are write-before-read in the same thread.
Scratch backing memory (which is used for the private address space) is accessed
with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
is only accessed by a single thread, and is always write-before-read,
there is never a need to invalidate these entries from the L1 cache. Hence all
cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
lines.
Scratch backing memory (which is used for the private address space)
is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
address space is only accessed by a single thread, and is always
write-before-read, there is never a need to invalidate these entries from the L1
cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
volatile cache lines.
On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
to invalidate the L2 cache. This also causes it to be treated as non-volatile
and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
coherent) and so the L2 cache will coherent with the CPU and other agents.
to invalidate the L2 cache. This also causes it to be treated as
non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
(cache coherent) and so the L2 cache will coherent with the CPU and other
agents.
.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
============ ============ ============== ========== =======================
============ ============ ============== ========== ===============================
LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
Ordering Sync Scope Address
Space
============ ============ ============== ========== =======================
============ ============ ============== ========== ===============================
**Non-Atomic**
---------------------------------------------------------------------------
load *none* *none* - global non-volatile
- generic 1. buffer/global/flat_load
volatile
-----------------------------------------------------------------------------------
load *none* *none* - global - !volatile & !nontemporal
- generic
- private 1. buffer/global/flat_load
- constant
- volatile & !nontemporal
1. buffer/global/flat_load
glc=1
- nontemporal
1. buffer/global/flat_load
glc=1 slc=1
load *none* *none* - local 1. ds_load
store *none* *none* - global 1. buffer/global/flat_store
store *none* *none* - global - !nontemporal
- generic
- private 1. buffer/global/flat_store
- constant
- nontemporal
1. buffer/global/flat_stote
glc=1 slc=1
store *none* *none* - local 1. ds_store
**Unordered Atomic**
---------------------------------------------------------------------------
-----------------------------------------------------------------------------------
load atomic unordered *any* *any* *Same as non-atomic*.
store atomic unordered *any* *any* *Same as non-atomic*.
atomicrmw unordered *any* *any* *Same as monotonic
atomic*.
**Monotonic Atomic**
---------------------------------------------------------------------------
-----------------------------------------------------------------------------------
load atomic monotonic - singlethread - global 1. buffer/global/flat_load
- wavefront - generic
- workgroup
@ -2440,16 +2457,15 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
- wavefront
- workgroup
**Acquire Atomic**
---------------------------------------------------------------------------
-----------------------------------------------------------------------------------
load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
- wavefront - local
- generic
load atomic acquire - workgroup - global 1. buffer/global_load
load atomic acquire - workgroup - local 1. ds/flat_load
- generic 2. s_waitcnt lgkmcnt(0)
load atomic acquire - workgroup - global 1. buffer/global/flat_load
load atomic acquire - workgroup - local 1. ds_load
2. s_waitcnt lgkmcnt(0)
- If OpenCL, omit
waitcnt.
- If OpenCL, omit.
- Must happen before
any following
global/generic
@ -2462,8 +2478,23 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
older than the load
atomic value being
acquired.
load atomic acquire - workgroup - generic 1. flat_load
2. s_waitcnt lgkmcnt(0)
load atomic acquire - agent - global 1. buffer/global_load
- If OpenCL, omit.
- Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
- Ensures any
following global
data read is no
older than the load
atomic value being
acquired.
load atomic acquire - agent - global 1. buffer/global/flat_load
- system glc=1
2. s_waitcnt vmcnt(0)
@ -2516,12 +2547,11 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
- wavefront - local
- generic
atomicrmw acquire - workgroup - global 1. buffer/global_atomic
atomicrmw acquire - workgroup - local 1. ds/flat_atomic
- generic 2. waitcnt lgkmcnt(0)
atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
atomicrmw acquire - workgroup - local 1. ds_atomic
2. waitcnt lgkmcnt(0)
- If OpenCL, omit
waitcnt.
- If OpenCL, omit.
- Must happen before
any following
global/generic
@ -2535,7 +2565,24 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
atomicrmw value
being acquired.
atomicrmw acquire - agent - global 1. buffer/global_atomic
atomicrmw acquire - workgroup - generic 1. flat_atomic
2. waitcnt lgkmcnt(0)
- If OpenCL, omit.
- Must happen before
any following
global/generic
load/load
atomic/store/store
atomic/atomicrmw.
- Ensures any
following global
data read is no
older than the
atomicrmw value
being acquired.
atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
- system 2. s_waitcnt vmcnt(0)
- Must happen before
@ -2592,9 +2639,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
- If OpenCL and
address space is
not generic, omit
waitcnt. However,
since LLVM
not generic, omit.
- However, since LLVM
currently has no
address space on
the fence need to
@ -2633,14 +2679,14 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
value read by the
fence-paired-atomic.
fence acquire - agent *none* 1. s_waitcnt vmcnt(0) &
- system lgkmcnt(0)
fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
- system vmcnt(0)
- If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
However, since LLVM
- However, since LLVM
currently has no
address space on
the fence need to
@ -2672,7 +2718,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
- s_waitcnt lgkmcnt(0)
must happen after
any preceding
group/generic load
local/generic load
atomic/atomicrmw
with an equal or
wider sync scope
@ -2699,8 +2745,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
2. buffer_wbinvl1_vol
- Must happen before
any following global/generic
- Must happen before any
following global/generic
load/load
atomic/store/store
atomic/atomicrmw.
@ -2710,14 +2756,13 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
global data.
**Release Atomic**
---------------------------------------------------------------------------
-----------------------------------------------------------------------------------
store atomic release - singlethread - global 1. buffer/global/ds/flat_store
- wavefront - local
- generic
store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
- generic
- If OpenCL, omit
waitcnt.
- If OpenCL, omit.
- Must happen after
any preceding
local/generic
@ -2737,8 +2782,29 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
2. buffer/global/flat_store
store atomic release - workgroup - local 1. ds_store
store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
- system - generic lgkmcnt(0)
store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
- If OpenCL, omit.
- Must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
- Must happen before
the following
store.
- Ensures that all
memory operations
to local have
completed before
performing the
store that is being
released.
2. flat_store
store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
- system - generic vmcnt(0)
- If OpenCL, omit
lgkmcnt(0).
@ -2770,7 +2836,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
store.
- Ensures that all
memory operations
to global have
to memory have
completed before
performing the
store that is being
@ -2781,9 +2847,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
- wavefront - local
- generic
atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
- generic
- If OpenCL, omit
waitcnt.
- If OpenCL, omit.
- Must happen after
any preceding
local/generic
@ -2803,8 +2868,29 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
2. buffer/global/flat_atomic
atomicrmw release - workgroup - local 1. ds_atomic
atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
- system - generic lgkmcnt(0)
atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
- If OpenCL, omit.
- Must happen after
any preceding
local/generic
load/store/load
atomic/store
atomic/atomicrmw.
- Must happen before
the following
atomicrmw.
- Ensures that all
memory operations
to local have
completed before
performing the
atomicrmw that is
being released.
2. flat_atomic
atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
- system - generic vmcnt(0)
- If OpenCL, omit
lgkmcnt(0).
@ -2842,23 +2928,29 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
the atomicrmw that
is being released.
2. buffer/global/ds/flat_atomic*
2. buffer/global/ds/flat_atomic
fence release - singlethread *none* *none*
- wavefront
fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
- If OpenCL and
address space is
not generic, omit
waitcnt. However,
since LLVM
not generic, omit.
- However, since LLVM
currently has no
address space on
the fence need to
conservatively
always generate
(see comment for
previous fence).
always generate. If
fence had an
address space then
set to address
space of OpenCL
fence flag, or to
generic if both
local and global
flags are
specified.
- Must happen after
any preceding
local/generic
@ -2883,21 +2975,32 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
following
fence-paired-atomic.
fence release - agent *none* 1. s_waitcnt vmcnt(0) &
- system lgkmcnt(0)
fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
- system vmcnt(0)
- If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
However, since LLVM
- If OpenCL and
address space is
local, omit
vmcnt(0).
- However, since LLVM
currently has no
address space on
the fence need to
conservatively
always generate
(see comment for
previous fence).
always generate. If
fence had an
address space then
set to address
space of OpenCL
fence flag, or to
generic if both
local and global
flags are
specified.
- Could be split into
separate s_waitcnt
vmcnt(0) and
@ -2933,21 +3036,20 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
fence-paired-atomic).
- Ensures that all
memory operations
to global have
have
completed before
performing the
following
fence-paired-atomic.
**Acquire-Release Atomic**
---------------------------------------------------------------------------
-----------------------------------------------------------------------------------
atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
- wavefront - local
- generic
atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
- If OpenCL, omit
waitcnt.
- If OpenCL, omit.
- Must happen after
any preceding
local/generic
@ -2965,12 +3067,11 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
atomicrmw that is
being released.
2. buffer/global_atomic
2. buffer/global/flat_atomic
atomicrmw acq_rel - workgroup - local 1. ds_atomic
2. s_waitcnt lgkmcnt(0)
- If OpenCL, omit
waitcnt.
- If OpenCL, omit.
- Must happen before
any following
global/generic
@ -2986,8 +3087,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
- If OpenCL, omit
waitcnt.
- If OpenCL, omit.
- Must happen after
any preceding
local/generic
@ -3008,8 +3108,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
2. flat_atomic
3. s_waitcnt lgkmcnt(0)
- If OpenCL, omit
waitcnt.
- If OpenCL, omit.
- Must happen before
any following
global/generic
@ -3022,8 +3121,9 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
older than the load
atomic value being
acquired.
atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
- system lgkmcnt(0)
atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
- system vmcnt(0)
- If OpenCL, omit
lgkmcnt(0).
@ -3061,7 +3161,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
atomicrmw that is
being released.
2. buffer/global_atomic
2. buffer/global/flat_atomic
3. s_waitcnt vmcnt(0)
- Must happen before
@ -3085,8 +3185,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
will not see stale
global data.
atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
- system lgkmcnt(0)
atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
- system vmcnt(0)
- If OpenCL, omit
lgkmcnt(0).
@ -3157,8 +3257,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
- If OpenCL and
address space is
not generic, omit
waitcnt. However,
not generic, omit.
- However,
since LLVM
currently has no
address space on
@ -3196,8 +3296,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
stronger than
unordered (this is
termed the
fence-paired-atomic)
has completed
acquire-fence-paired-atomic
) has completed
before following
global memory
operations. This
@ -3217,19 +3317,19 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
stronger than
unordered (this is
termed the
fence-paired-atomic).
This satisfies the
release-fence-paired-atomic
). This satisfies the
requirements of
release.
fence acq_rel - agent *none* 1. s_waitcnt vmcnt(0) &
- system lgkmcnt(0)
fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
- system vmcnt(0)
- If OpenCL and
address space is
not generic, omit
lgkmcnt(0).
However, since LLVM
- However, since LLVM
currently has no
address space on
the fence need to
@ -3274,8 +3374,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
stronger than
unordered (this is
termed the
fence-paired-atomic)
has completed
acquire-fence-paired-atomic
) has completed
before invalidating
the cache. This
satisfies the
@ -3295,8 +3395,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
stronger than
unordered (this is
termed the
fence-paired-atomic).
This satisfies the
release-fence-paired-atomic
). This satisfies the
requirements of
release.
@ -3317,13 +3417,103 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
acquire.
**Sequential Consistent Atomic**
---------------------------------------------------------------------------
-----------------------------------------------------------------------------------
load atomic seq_cst - singlethread - global *Same as corresponding
- wavefront - local load atomic acquire*.
- workgroup - generic
load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
- system - local
- generic - Must happen after
- wavefront - local load atomic acquire,
- generic except must generated
all instructions even
for OpenCL.*
load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
- generic
- Must
happen after
preceding
global/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
lgkmcnt(0) and so do
not need to be
considered.)
- Ensures any
preceding
sequential
consistent local
memory instructions
have completed
before executing
this sequentially
consistent
instruction. This
prevents reordering
a seq_cst store
followed by a
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
the reordering of
load acquire
followed by a store
release is
prevented by the
waitcnt of
the release, but
there is nothing
preventing a store
release followed by
load acquire from
competing out of
order.)
2. *Following
instructions same as
corresponding load
atomic acquire,
except must generated
all instructions even
for OpenCL.*
load atomic seq_cst - workgroup - local *Same as corresponding
load atomic acquire,
except must generated
all instructions even
for OpenCL.*
load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
- system - generic vmcnt(0)
- Could be split into
separate s_waitcnt
vmcnt(0)
and s_waitcnt
lgkmcnt(0) to allow
them to be
independently moved
according to the
following rules.
- waitcnt lgkmcnt(0)
must happen after
preceding
global/generic load
atomic/store
atomic/atomicrmw
with memory
ordering of seq_cst
and with equal or
wider sync scope.
(Note that seq_cst
fences have their
own s_waitcnt
lgkmcnt(0) and so do
not need to be
considered.)
- waitcnt vmcnt(0)
must happen after
preceding
global/generic load
atomic/store
@ -3351,7 +3541,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
prevents reordering
a seq_cst store
followed by a
seq_cst load (Note
seq_cst load. (Note
that seq_cst is
stronger than
acquire/release as
@ -3360,7 +3550,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
followed by a store
release is
prevented by the
waitcnt vmcnt(0) of
waitcnt of
the release, but
there is nothing
preventing a store
@ -3372,24 +3562,36 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
2. *Following
instructions same as
corresponding load
atomic acquire*.
atomic acquire,
except must generated
all instructions even
for OpenCL.*
store atomic seq_cst - singlethread - global *Same as corresponding
- wavefront - local store atomic release*.
- workgroup - generic
- wavefront - local store atomic release,
- workgroup - generic except must generated
all instructions even
for OpenCL.*
store atomic seq_cst - agent - global *Same as corresponding
- system - generic store atomic release*.
- system - generic store atomic release,
except must generated
all instructions even
for OpenCL.*
atomicrmw seq_cst - singlethread - global *Same as corresponding
- wavefront - local atomicrmw acq_rel*.
- workgroup - generic
- wavefront - local atomicrmw acq_rel,
- workgroup - generic except must generated
all instructions even
for OpenCL.*
atomicrmw seq_cst - agent - global *Same as corresponding
- system - generic atomicrmw acq_rel*.
- system - generic atomicrmw acq_rel,
except must generated
all instructions even
for OpenCL.*
fence seq_cst - singlethread *none* *Same as corresponding
- wavefront fence acq_rel*.
- workgroup
- agent
- system
============ ============ ============== ========== =======================
- wavefront fence acq_rel,
- workgroup except must generated
- agent all instructions even
- system for OpenCL.*
============ ============ ============== ========== ===============================
The memory order also adds the single thread optimization constrains defined in
table
@ -3799,7 +4001,7 @@ used. The default value for all keys is 0, with the following exceptions:
- *kernel_code_entry_byte_offset* defaults to 256.
- *wavefront_size* defaults to 6.
- *kernarg_segment_alignment*, *group_segment_alignment*, and
*private_segment_alignment* default to 4. Note that alignments are specified
*private_segment_alignment* default to 4. Note that alignments are specified
as a power of two, so a value of **n** means an alignment of 2^ **n**.
The *.amd_kernel_code_t* directive must be placed immediately after the