[AMDGPU] Corrections to memory model description.

- Add description on nontemporal support. - Correct OpenCL sequentially consistent and fence code sequences. - Minor test cleanup. Differential Revision: https://reviews.llvm.org/D39073 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@316131 91177308-0d34-0410-b5e6-96231b3b80d8
2024-11-24 04:09:47 +00:00 · 2017-10-18 22:16:55 +00:00 · 2017-10-18 22:16:55 +00:00 · 0a09220c32
commit 0a09220c32
parent b5cb868aaa
1 changed files with 375 additions and 173 deletions
--- a/docs/AMDGPUUsage.rst
+++ b/docs/AMDGPUUsage.rst
@ -1240,7 +1240,7 @@ non-AMD key names should be prefixed by "*vendor-name*.".
     =================================== ============== ========= ==============

 .. TODO
-   Plan to remove the debug properties metadata.   
+   Plan to remove the debug properties metadata.

 Kernel Dispatch
 ~~~~~~~~~~~~~~~
@ -1431,9 +1431,9 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
  .. table:: Kernel Descriptor for GFX6-GFX9
     :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table

-     ======= ======= =============================== ===========================
+     ======= ======= =============================== ============================
     Bits    Size    Field Name                      Description
-     ======= ======= =============================== ===========================
+     ======= ======= =============================== ============================
     31:0    4 bytes GroupSegmentFixedSize           The amount of fixed local
                                                     address space memory
                                                     required for a work-group
@ -1461,7 +1461,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
     97      1 bit   IsXNACKEnabled                  Indicates if the generated
                                                     machine code is capable of
                                                     suppoting XNACK.
-     127:98  30 bits                                 Reserved. Must be 0.
+     127:98  30 bits                                 Reserved, must be 0.
     191:128 8 bytes KernelCodeEntryByteOffset       Byte offset (possibly
                                                     negative) from base
                                                     address of kernel
@ -1469,7 +1469,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                     entry point instruction
                                                     which must be 256 byte
                                                     aligned.
-     383:192 24                                      Reserved. Must be 0.
+     383:192 24                                      Reserved, must be 0.
             bytes
     415:384 4 bytes ComputePgmRsrc1                 Compute Shader (CS)
                                                     program settings used by
@ -1477,7 +1477,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                     ``COMPUTE_PGM_RSRC1``
                                                     configuration
                                                     register. See
-                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
+                                                     :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
     447:416 4 bytes ComputePgmRsrc2                 Compute Shader (CS)
                                                     program settings used by
                                                     CP to set up
@ -1509,16 +1509,16 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                     should always be 0.
     457     1 bit   EnableSGPRGridWorkgroupCountZ   Not implemented in CP and
                                                     should always be 0.
-     463:458 6 bits                                  Reserved. Must be 0.
-     511:464 6                                       Reserved. Must be 0.
+     463:458 6 bits                                  Reserved, must be 0.
+     511:464 6                                       Reserved, must be 0.
             bytes
     512     **Total size 64 bytes.**
-     ======= ===================================================================
+     ======= ====================================================================

 ..

  .. table:: compute_pgm_rsrc1 for GFX6-GFX9
-     :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
+     :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table

     ======= ======= =============================== ===========================================================================
     Bits    Size    Field Name                      Description
@ -1529,8 +1529,9 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                     specific:

                                                     GFX6-9
-                                                       roundup((max-vgpg + 1)
-                                                       / 4) - 1
+                                                       - max_vgpr 1..256
+                                                       - roundup((max_vgpg + 1)
+                                                         / 4) - 1

                                                     Used by CP to set up
                                                     ``COMPUTE_PGM_RSRC1.VGPRS``.
@ -1540,11 +1541,13 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                     specific:

                                                     GFX6-8
-                                                       roundup((max-sgpg + 1)
-                                                       / 8) - 1
+                                                       - max_sgpr 1..112
+                                                       - roundup((max_sgpg + 1)
+                                                         / 8) - 1
                                                     GFX9
-                                                       roundup((max-sgpg + 1)
-                                                       / 16) - 1
+                                                       - max_sgpr 1..112
+                                                       - roundup((max_sgpg + 1)
+                                                         / 16) - 1

                                                     Includes the special SGPRs
                                                     for VCC, Flat Scratch (for
@ -1628,7 +1631,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
     21      1 bit   ENABLE_DX10_CLAMP               Wavefront starts execution
                                                     with DX10 clamp mode
                                                     enabled. Used by the vector
-                                                     ALU to force DX-10 style
+                                                     ALU to force DX10 style
                                                     treatment of NaN's (when
                                                     set, clamp NaN to zero,
                                                     otherwise pass NaN
@ -1676,29 +1679,25 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
                                                     CP is responsible for
                                                     filling in
                                                     ``COMPUTE_PGM_RSRC1.CDBG_USER``.
-     26      1 bit   FP16_OVFL                       GFX6-8:
-                                                       Reserved. Must be 0.
-                                                     GFX9:
-                                                       Wavefront starts
-                                                       execution with specified
-                                                       fp16 overflow mode.
+     26      1 bit   FP16_OVFL                       GFX6-8
+                                                       Reserved, must be 0.
+                                                     GFX9
+                                                       Wavefront starts execution
+                                                       with specified fp16 overflow
+                                                       mode.

-                                                       - If 0, then fp16
-                                                         overflow generates
+                                                       - If 0, fp16 overflow generates
                                                         +/-INF values.
-                                                       - If 1, then fp16
-                                                         overflow that is the
-                                                         result of an +/-INF
-                                                         input value or divide
-                                                         by 0 generates a
-                                                         +/-INF, otherwise
-                                                         clamps computed
-                                                         overflow to +/-MAX_FP16
-                                                         as appropriate.
+                                                       - If 1, fp16 overflow that is the
+                                                         result of an +/-INF input value
+                                                         or divide by 0 produces a +/-INF,
+                                                         otherwise clamps computed
+                                                         overflow to +/-MAX_FP16 as
+                                                         appropriate.

                                                       Used by CP to set up
                                                       ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
-     31:27   5 bits                                  Reserved. Must be 0.
+     31:27   5 bits                                  Reserved, must be 0.
     32      **Total size 4 bytes**
     ======= ===================================================================================================================

@ -1855,7 +1854,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
     30      1 bit   ENABLE_EXCEPTION_INT_DIVIDE_BY  Integer Division by Zero
                     _ZERO                           (rcp_iflag_f32 instruction
                                                     only)
-     31      1 bit                                   Reserved. Must be 0.
+     31      1 bit                                   Reserved, must be 0.
     32      **Total size 4 bytes.**
     ======= ===================================================================================================================

@ -2245,9 +2244,6 @@ This section describes the mapping of LLVM memory model onto AMDGPU machine code
 .. TODO
   Update when implementation complete.

-   Support more relaxed OpenCL memory model to be controlled by environment
-   component of target triple.
-
 The AMDGPU backend supports the memory synchronization scopes specified in
 :ref:`amdgpu-memory-scopes`.

@ -2264,19 +2260,23 @@ additional ``s_waitcnt`` instructions are required to ensure registers are
 defined before being used. These may be able to be combined with the memory
 model ``s_waitcnt`` instructions as described above.

-The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
-OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
-relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
-memory model which has separate happens-before relations for the global and
-local address spaces, and only a fence specifying both global and local address
-space joins the relationships. Since the LLVM ``memfence`` instruction does not
-allow an address space to be specified the OpenCL fence has to convervatively
-assume both local and global address space was specified. However, optimizations
-can often be done to eliminate the additional ``s_waitcnt``instructions when
-there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
-instructions. The code sequences in the table indicate what can be omitted for
-the OpenCL memory. The target triple environment is used to determine if the
-source language is OpenCL (see :ref:`amdgpu-opencl`).
+The AMDGPU backend supports the following memory models:
+
+  HSA Memory Model [HSA]_
+    The HSA memory model uses a single happens-before relation for all address
+    spaces (see :ref:`amdgpu-address-spaces`).
+  OpenCL Memory Model [OpenCL]_
+    The OpenCL memory model which has separate happens-before relations for the
+    global and local address spaces. Only a fence specifying both global and
+    local address space, and seq_cst instructions join the relationships. Since
+    the LLVM ``memfence`` instruction does not allow an address space to be
+    specified the OpenCL fence has to convervatively assume both local and
+    global address space was specified. However, optimizations can often be
+    done to eliminate the additional ``s_waitcnt`` instructions when there are
+    no intervening memory instructions which access the corresponding address
+    space. The code sequences in the table indicate what can be omitted for the
+    OpenCL memory. The target triple environment is used to determine if the
+    source language is OpenCL (see :ref:`amdgpu-opencl`).

 ``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
 operations.
@ -2308,11 +2308,11 @@ For GFX6-GFX9:
  that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
  vector memory order if they access LDS memory, and out of LDS operation order
  if they access global memory.
-* The vector memory operations access a vector L1 cache shared by all wavefronts
-  on a CU. Therefore, no special action is required for coherence between
-  wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
-  coherence between waves executing in different work-groups as they may be
-  executing on different CUs.
+* The vector memory operations access a single vector L1 cache shared by all
+  SIMDs a CU. Therefore, no special action is required for coherence between the
+  lanes of a single wavefront, or for coherence between wavefronts in the same
+  work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
+  executing in different work-groups as they may be executing on different CUs.
 * The scalar memory operations access a scalar L1 cache shared by all wavefronts
  on a group of CUs. The scalar and vector L1 caches are not coherent. However,
  scalar operations are used in a restricted way so do not impact the memory
@ -2376,45 +2376,62 @@ future wave that uses the same scratch area, or a function call that creates a
 frame at the same address, respectively. There is no need for a ``s_dcache_inv``
 as all scalar writes are write-before-read in the same thread.

-Scratch backing memory (which is used for the private address space) is accessed
-with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
-is only accessed by a single thread, and is always write-before-read,
-there is never a need to invalidate these entries from the L1 cache. Hence all
-cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
-lines.
+Scratch backing memory (which is used for the private address space)
+is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
+address space is only accessed by a single thread, and is always
+write-before-read, there is never a need to invalidate these entries from the L1
+cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
+volatile cache lines.

 On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
-to invalidate the L2 cache. This also causes it to be treated as non-volatile
-and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
-coherent) and so the L2 cache will coherent with the CPU and other agents.
+to invalidate the L2 cache. This also causes it to be treated as
+non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
+(cache coherent) and so the L2 cache will coherent with the CPU and other
+agents.

  .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
     :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table

-     ============ ============ ============== ========== =======================
+     ============ ============ ============== ========== ===============================
     LLVM Instr   LLVM Memory  LLVM Memory    AMDGPU     AMDGPU Machine Code
                  Ordering     Sync Scope     Address
                                              Space
-     ============ ============ ============== ========== =======================
+     ============ ============ ============== ========== ===============================
     **Non-Atomic**
-     ---------------------------------------------------------------------------
-     load         *none*       *none*         - global   non-volatile
-                                              - generic    1. buffer/global/flat_load
-                                                         volatile
+     -----------------------------------------------------------------------------------
+     load         *none*       *none*         - global   - !volatile & !nontemporal
+                                              - generic
+                                              - private    1. buffer/global/flat_load
+                                              - constant
+                                                         - volatile & !nontemporal
+
                                                           1. buffer/global/flat_load
                                                              glc=1
+
+                                                         - nontemporal
+
+                                                           1. buffer/global/flat_load
+                                                              glc=1 slc=1
+
     load         *none*       *none*         - local    1. ds_load
-     store        *none*       *none*         - global   1. buffer/global/flat_store
+     store        *none*       *none*         - global   - !nontemporal
                                              - generic
+                                              - private    1. buffer/global/flat_store
+                                              - constant
+                                                         - nontemporal
+
+                                                           1. buffer/global/flat_stote
+                                                              glc=1 slc=1
+
     store        *none*       *none*         - local    1. ds_store
     **Unordered Atomic**
-     ---------------------------------------------------------------------------
+     -----------------------------------------------------------------------------------
     load atomic  unordered    *any*          *any*      *Same as non-atomic*.
     store atomic unordered    *any*          *any*      *Same as non-atomic*.
     atomicrmw    unordered    *any*          *any*      *Same as monotonic
                                                         atomic*.
     **Monotonic Atomic**
-     ---------------------------------------------------------------------------
+     -----------------------------------------------------------------------------------
     load atomic  monotonic    - singlethread - global   1. buffer/global/flat_load
                               - wavefront    - generic
                               - workgroup
@ -2440,16 +2457,15 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                               - wavefront
                               - workgroup
     **Acquire Atomic**
-     ---------------------------------------------------------------------------
+     -----------------------------------------------------------------------------------
     load atomic  acquire      - singlethread - global   1. buffer/global/ds/flat_load
                               - wavefront    - local
                                              - generic
-     load atomic  acquire      - workgroup    - global   1. buffer/global_load
-     load atomic  acquire      - workgroup    - local    1. ds/flat_load
-                                              - generic  2. s_waitcnt lgkmcnt(0)
+     load atomic  acquire      - workgroup    - global   1. buffer/global/flat_load
+     load atomic  acquire      - workgroup    - local    1. ds_load
+                                                         2. s_waitcnt lgkmcnt(0)

-                                                           - If OpenCL, omit
-                                                             waitcnt.
+                                                           - If OpenCL, omit.
                                                           - Must happen before
                                                             any following
                                                             global/generic
@ -2462,8 +2478,23 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             older than the load
                                                             atomic value being
                                                             acquired.
+     load atomic  acquire      - workgroup    - generic  1. flat_load
+                                                         2. s_waitcnt lgkmcnt(0)

-     load atomic  acquire      - agent        - global   1. buffer/global_load
+                                                           - If OpenCL, omit.
+                                                           - Must happen before
+                                                             any following
+                                                             global/generic
+                                                             load/load
+                                                             atomic/store/store
+                                                             atomic/atomicrmw.
+                                                           - Ensures any
+                                                             following global
+                                                             data read is no
+                                                             older than the load
+                                                             atomic value being
+                                                             acquired.
+     load atomic  acquire      - agent        - global   1. buffer/global/flat_load
                               - system                     glc=1
                                                         2. s_waitcnt vmcnt(0)

@ -2516,12 +2547,11 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
     atomicrmw    acquire      - singlethread - global   1. buffer/global/ds/flat_atomic
                               - wavefront    - local
                                              - generic
-     atomicrmw    acquire      - workgroup    - global   1. buffer/global_atomic
-     atomicrmw    acquire      - workgroup    - local    1. ds/flat_atomic
-                                              - generic  2. waitcnt lgkmcnt(0)
+     atomicrmw    acquire      - workgroup    - global   1. buffer/global/flat_atomic
+     atomicrmw    acquire      - workgroup    - local    1. ds_atomic
+                                                         2. waitcnt lgkmcnt(0)

-                                                           - If OpenCL, omit
-                                                             waitcnt.
+                                                           - If OpenCL, omit.
                                                           - Must happen before
                                                             any following
                                                             global/generic
@ -2535,7 +2565,24 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             atomicrmw value
                                                             being acquired.

-     atomicrmw    acquire      - agent        - global   1. buffer/global_atomic
+     atomicrmw    acquire      - workgroup    - generic  1. flat_atomic
+                                                         2. waitcnt lgkmcnt(0)
+
+                                                           - If OpenCL, omit.
+                                                           - Must happen before
+                                                             any following
+                                                             global/generic
+                                                             load/load
+                                                             atomic/store/store
+                                                             atomic/atomicrmw.
+                                                           - Ensures any
+                                                             following global
+                                                             data read is no
+                                                             older than the
+                                                             atomicrmw value
+                                                             being acquired.
+
+     atomicrmw    acquire      - agent        - global   1. buffer/global/flat_atomic
                               - system                  2. s_waitcnt vmcnt(0)

                                                           - Must happen before
@ -2592,9 +2639,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.

                                                           - If OpenCL and
                                                             address space is
-                                                             not generic, omit
-                                                             waitcnt. However,
-                                                             since LLVM
+                                                             not generic, omit.
+                                                           - However, since LLVM
                                                             currently has no
                                                             address space on
                                                             the fence need to
@ -2633,14 +2679,14 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             value read by the
                                                             fence-paired-atomic.

-     fence        acquire      - agent        *none*     1. s_waitcnt vmcnt(0) &
-                               - system                     lgkmcnt(0)
+     fence        acquire      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
+                               - system                     vmcnt(0)

                                                           - If OpenCL and
                                                             address space is
                                                             not generic, omit
                                                             lgkmcnt(0).
-                                                             However, since LLVM
+                                                           - However, since LLVM
                                                             currently has no
                                                             address space on
                                                             the fence need to
@ -2672,7 +2718,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                           - s_waitcnt lgkmcnt(0)
                                                             must happen after
                                                             any preceding
-                                                             group/generic load
+                                                             local/generic load
                                                             atomic/atomicrmw
                                                             with an equal or
                                                             wider sync scope
@ -2699,8 +2745,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.

                                                         2. buffer_wbinvl1_vol

-                                                           - Must happen before
-                                                             any following global/generic
+                                                           - Must happen before any
+                                                             following global/generic
                                                             load/load
                                                             atomic/store/store
                                                             atomic/atomicrmw.
@ -2710,14 +2756,13 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             global data.

     **Release Atomic**
-     ---------------------------------------------------------------------------
+     -----------------------------------------------------------------------------------
     store atomic release      - singlethread - global   1. buffer/global/ds/flat_store
                               - wavefront    - local
                                              - generic
     store atomic release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
-                                              - generic
-                                                           - If OpenCL, omit
-                                                             waitcnt.
+
+                                                           - If OpenCL, omit.
                                                           - Must happen after
                                                             any preceding
                                                             local/generic
@ -2737,8 +2782,29 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.

                                                         2. buffer/global/flat_store
     store atomic release      - workgroup    - local    1. ds_store
-     store atomic release      - agent        - global   1. s_waitcnt vmcnt(0) &
-                               - system       - generic     lgkmcnt(0)
+     store atomic release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
+
+                                                           - If OpenCL, omit.
+                                                           - Must happen after
+                                                             any preceding
+                                                             local/generic
+                                                             load/store/load
+                                                             atomic/store
+                                                             atomic/atomicrmw.
+                                                           - Must happen before
+                                                             the following
+                                                             store.
+                                                           - Ensures that all
+                                                             memory operations
+                                                             to local have
+                                                             completed before
+                                                             performing the
+                                                             store that is being
+                                                             released.
+
+                                                         2. flat_store
+     store atomic release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
+                               - system       - generic     vmcnt(0)

                                                           - If OpenCL, omit
                                                             lgkmcnt(0).
@ -2770,7 +2836,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             store.
                                                           - Ensures that all
                                                             memory operations
-                                                             to global have
+                                                             to memory have
                                                             completed before
                                                             performing the
                                                             store that is being
@ -2781,9 +2847,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                               - wavefront    - local
                                              - generic
     atomicrmw    release      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
-                                              - generic
-                                                           - If OpenCL, omit
-                                                             waitcnt.
+
+                                                           - If OpenCL, omit.
                                                           - Must happen after
                                                             any preceding
                                                             local/generic
@ -2803,8 +2868,29 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.

                                                         2. buffer/global/flat_atomic
     atomicrmw    release      - workgroup    - local    1. ds_atomic
-     atomicrmw    release      - agent        - global   1. s_waitcnt vmcnt(0) &
-                               - system       - generic     lgkmcnt(0)
+     atomicrmw    release      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)
+
+                                                           - If OpenCL, omit.
+                                                           - Must happen after
+                                                             any preceding
+                                                             local/generic
+                                                             load/store/load
+                                                             atomic/store
+                                                             atomic/atomicrmw.
+                                                           - Must happen before
+                                                             the following
+                                                             atomicrmw.
+                                                           - Ensures that all
+                                                             memory operations
+                                                             to local have
+                                                             completed before
+                                                             performing the
+                                                             atomicrmw that is
+                                                             being released.
+
+                                                         2. flat_atomic
+     atomicrmw    release      - agent        - global   1. s_waitcnt lgkmcnt(0) &
+                               - system       - generic     vmcnt(0)

                                                           - If OpenCL, omit
                                                             lgkmcnt(0).
@ -2842,23 +2928,29 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             the atomicrmw that
                                                             is being released.

-                                                         2. buffer/global/ds/flat_atomic*
+                                                         2. buffer/global/ds/flat_atomic
     fence        release      - singlethread *none*     *none*
                               - wavefront
     fence        release      - workgroup    *none*     1. s_waitcnt lgkmcnt(0)

                                                           - If OpenCL and
                                                             address space is
-                                                             not generic, omit
-                                                             waitcnt. However,
-                                                             since LLVM
+                                                             not generic, omit.
+                                                           - However, since LLVM
                                                             currently has no
                                                             address space on
                                                             the fence need to
                                                             conservatively
-                                                             always generate
-                                                             (see comment for
-                                                             previous fence).
+                                                             always generate. If
+                                                             fence had an
+                                                             address space then
+                                                             set to address
+                                                             space of OpenCL
+                                                             fence flag, or to
+                                                             generic if both
+                                                             local and global
+                                                             flags are
+                                                             specified.
                                                           - Must happen after
                                                             any preceding
                                                             local/generic
@ -2883,21 +2975,32 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             following
                                                             fence-paired-atomic.

-     fence        release      - agent        *none*     1. s_waitcnt vmcnt(0) &
-                               - system                     lgkmcnt(0)
+     fence        release      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
+                               - system                     vmcnt(0)

                                                           - If OpenCL and
                                                             address space is
                                                             not generic, omit
                                                             lgkmcnt(0).
-                                                             However, since LLVM
+                                                           - If OpenCL and
+                                                             address space is
+                                                             local, omit
+                                                             vmcnt(0).
+                                                           - However, since LLVM
                                                             currently has no
                                                             address space on
                                                             the fence need to
                                                             conservatively
-                                                             always generate
-                                                             (see comment for
-                                                             previous fence).
+                                                             always generate. If
+                                                             fence had an
+                                                             address space then
+                                                             set to address
+                                                             space of OpenCL
+                                                             fence flag, or to
+                                                             generic if both
+                                                             local and global
+                                                             flags are
+                                                             specified.
                                                           - Could be split into
                                                             separate s_waitcnt
                                                             vmcnt(0) and
@ -2933,21 +3036,20 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             fence-paired-atomic).
                                                           - Ensures that all
                                                             memory operations
-                                                             to global have
+                                                             have
                                                             completed before
                                                             performing the
                                                             following
                                                             fence-paired-atomic.

     **Acquire-Release Atomic**
-     ---------------------------------------------------------------------------
+     -----------------------------------------------------------------------------------
     atomicrmw    acq_rel      - singlethread - global   1. buffer/global/ds/flat_atomic
                               - wavefront    - local
                                              - generic
     atomicrmw    acq_rel      - workgroup    - global   1. s_waitcnt lgkmcnt(0)

-                                                           - If OpenCL, omit
-                                                             waitcnt.
+                                                           - If OpenCL, omit.
                                                           - Must happen after
                                                             any preceding
                                                             local/generic
@ -2965,12 +3067,11 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             atomicrmw that is
                                                             being released.

-                                                         2. buffer/global_atomic
+                                                         2. buffer/global/flat_atomic
     atomicrmw    acq_rel      - workgroup    - local    1. ds_atomic
                                                         2. s_waitcnt lgkmcnt(0)

-                                                           - If OpenCL, omit
-                                                             waitcnt.
+                                                           - If OpenCL, omit.
                                                           - Must happen before
                                                             any following
                                                             global/generic
@ -2986,8 +3087,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.

     atomicrmw    acq_rel      - workgroup    - generic  1. s_waitcnt lgkmcnt(0)

-                                                           - If OpenCL, omit
-                                                             waitcnt.
+                                                           - If OpenCL, omit.
                                                           - Must happen after
                                                             any preceding
                                                             local/generic
@ -3008,8 +3108,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                         2. flat_atomic
                                                         3. s_waitcnt lgkmcnt(0)

-                                                           - If OpenCL, omit
-                                                             waitcnt.
+                                                           - If OpenCL, omit.
                                                           - Must happen before
                                                             any following
                                                             global/generic
@ -3022,8 +3121,9 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             older than the load
                                                             atomic value being
                                                             acquired.
-     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt vmcnt(0) &
-                               - system                     lgkmcnt(0)
+
+     atomicrmw    acq_rel      - agent        - global   1. s_waitcnt lgkmcnt(0) &
+                               - system                     vmcnt(0)

                                                           - If OpenCL, omit
                                                             lgkmcnt(0).
@ -3061,7 +3161,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             atomicrmw that is
                                                             being released.

-                                                         2. buffer/global_atomic
+                                                         2. buffer/global/flat_atomic
                                                         3. s_waitcnt vmcnt(0)

                                                           - Must happen before
@ -3085,8 +3185,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             will not see stale
                                                             global data.

-     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt vmcnt(0) &
-                               - system                     lgkmcnt(0)
+     atomicrmw    acq_rel      - agent        - generic  1. s_waitcnt lgkmcnt(0) &
+                               - system                     vmcnt(0)

                                                           - If OpenCL, omit
                                                             lgkmcnt(0).
@ -3157,8 +3257,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.

                                                           - If OpenCL and
                                                             address space is
-                                                             not generic, omit
-                                                             waitcnt. However,
+                                                             not generic, omit.
+                                                           - However,
                                                             since LLVM
                                                             currently has no
                                                             address space on
@ -3196,8 +3296,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             stronger than
                                                             unordered (this is
                                                             termed the
-                                                             fence-paired-atomic)
-                                                             has completed
+                                                             acquire-fence-paired-atomic
+                                                             ) has completed
                                                             before following
                                                             global memory
                                                             operations. This
@ -3217,19 +3317,19 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             stronger than
                                                             unordered (this is
                                                             termed the
-                                                             fence-paired-atomic).
-                                                             This satisfies the
+                                                             release-fence-paired-atomic
+                                                             ). This satisfies the
                                                             requirements of
                                                             release.

-     fence        acq_rel      - agent        *none*     1. s_waitcnt vmcnt(0) &
-                               - system                     lgkmcnt(0)
+     fence        acq_rel      - agent        *none*     1. s_waitcnt lgkmcnt(0) &
+                               - system                     vmcnt(0)

                                                           - If OpenCL and
                                                             address space is
                                                             not generic, omit
                                                             lgkmcnt(0).
-                                                             However, since LLVM
+                                                           - However, since LLVM
                                                             currently has no
                                                             address space on
                                                             the fence need to
@ -3274,8 +3374,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             stronger than
                                                             unordered (this is
                                                             termed the
-                                                             fence-paired-atomic)
-                                                             has completed
+                                                             acquire-fence-paired-atomic
+                                                             ) has completed
                                                             before invalidating
                                                             the cache. This
                                                             satisfies the
@ -3295,8 +3395,8 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             stronger than
                                                             unordered (this is
                                                             termed the
-                                                             fence-paired-atomic).
-                                                             This satisfies the
+                                                             release-fence-paired-atomic
+                                                             ). This satisfies the
                                                             requirements of
                                                             release.

@ -3317,13 +3417,103 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             acquire.

     **Sequential Consistent Atomic**
-     ---------------------------------------------------------------------------
+     -----------------------------------------------------------------------------------
     load atomic  seq_cst      - singlethread - global   *Same as corresponding
-                               - wavefront    - local    load atomic acquire*.
-                               - workgroup    - generic
-     load atomic  seq_cst      - agent        - global   1. s_waitcnt vmcnt(0)
-                               - system       - local
-                                              - generic    - Must happen after
+                               - wavefront    - local    load atomic acquire,
+                                              - generic  except must generated
+                                                         all instructions even
+                                                         for OpenCL.*
+     load atomic  seq_cst      - workgroup    - global   1. s_waitcnt lgkmcnt(0)
+                                              - generic
+                                                           - Must
+                                                             happen after
+                                                             preceding
+                                                             global/generic load
+                                                             atomic/store
+                                                             atomic/atomicrmw
+                                                             with memory
+                                                             ordering of seq_cst
+                                                             and with equal or
+                                                             wider sync scope.
+                                                             (Note that seq_cst
+                                                             fences have their
+                                                             own s_waitcnt
+                                                             lgkmcnt(0) and so do
+                                                             not need to be
+                                                             considered.)
+                                                           - Ensures any
+                                                             preceding
+                                                             sequential
+                                                             consistent local
+                                                             memory instructions
+                                                             have completed
+                                                             before executing
+                                                             this sequentially
+                                                             consistent
+                                                             instruction. This
+                                                             prevents reordering
+                                                             a seq_cst store
+                                                             followed by a
+                                                             seq_cst load. (Note
+                                                             that seq_cst is
+                                                             stronger than
+                                                             acquire/release as
+                                                             the reordering of
+                                                             load acquire
+                                                             followed by a store
+                                                             release is
+                                                             prevented by the
+                                                             waitcnt of
+                                                             the release, but
+                                                             there is nothing
+                                                             preventing a store
+                                                             release followed by
+                                                             load acquire from
+                                                             competing out of
+                                                             order.)
+
+                                                         2. *Following
+                                                            instructions same as
+                                                            corresponding load
+                                                            atomic acquire,
+                                                            except must generated
+                                                            all instructions even
+                                                            for OpenCL.*
+     load atomic  seq_cst      - workgroup    - local    *Same as corresponding
+                                                         load atomic acquire,
+                                                         except must generated
+                                                         all instructions even
+                                                         for OpenCL.*
+     load atomic  seq_cst      - agent        - global   1. s_waitcnt lgkmcnt(0) &
+                               - system       - generic     vmcnt(0)
+
+                                                           - Could be split into
+                                                             separate s_waitcnt
+                                                             vmcnt(0)
+                                                             and s_waitcnt
+                                                             lgkmcnt(0) to allow
+                                                             them to be
+                                                             independently moved
+                                                             according to the
+                                                             following rules.
+                                                           - waitcnt lgkmcnt(0)
+                                                             must happen after
+                                                             preceding
+                                                             global/generic load
+                                                             atomic/store
+                                                             atomic/atomicrmw
+                                                             with memory
+                                                             ordering of seq_cst
+                                                             and with equal or
+                                                             wider sync scope.
+                                                             (Note that seq_cst
+                                                             fences have their
+                                                             own s_waitcnt
+                                                             lgkmcnt(0) and so do
+                                                             not need to be
+                                                             considered.)
+                                                           - waitcnt vmcnt(0)
+                                                             must happen after
                                                             preceding
                                                             global/generic load
                                                             atomic/store
@ -3351,7 +3541,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             prevents reordering
                                                             a seq_cst store
                                                             followed by a
-                                                             seq_cst load (Note
+                                                             seq_cst load. (Note
                                                             that seq_cst is
                                                             stronger than
                                                             acquire/release as
@ -3360,7 +3550,7 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                             followed by a store
                                                             release is
                                                             prevented by the
-                                                             waitcnt vmcnt(0) of
+                                                             waitcnt of
                                                             the release, but
                                                             there is nothing
                                                             preventing a store
@ -3372,24 +3562,36 @@ coherent) and so the L2 cache will coherent with the CPU and other agents.
                                                         2. *Following
                                                            instructions same as
                                                            corresponding load
-                                                            atomic acquire*.
-
+                                                            atomic acquire,
+                                                            except must generated
+                                                            all instructions even
+                                                            for OpenCL.*
     store atomic seq_cst      - singlethread - global   *Same as corresponding
-                               - wavefront    - local    store atomic release*.
-                               - workgroup    - generic
+                               - wavefront    - local    store atomic release,
+                               - workgroup    - generic  except must generated
+                                                         all instructions even
+                                                         for OpenCL.*
     store atomic seq_cst      - agent        - global   *Same as corresponding
-                               - system       - generic  store atomic release*.
+                               - system       - generic  store atomic release,
+                                                         except must generated
+                                                         all instructions even
+                                                         for OpenCL.*
     atomicrmw    seq_cst      - singlethread - global   *Same as corresponding
-                               - wavefront    - local    atomicrmw acq_rel*.
-                               - workgroup    - generic
+                               - wavefront    - local    atomicrmw acq_rel,
+                               - workgroup    - generic  except must generated
+                                                         all instructions even
+                                                         for OpenCL.*
     atomicrmw    seq_cst      - agent        - global   *Same as corresponding
-                               - system       - generic  atomicrmw acq_rel*.
+                               - system       - generic  atomicrmw acq_rel,
+                                                         except must generated
+                                                         all instructions even
+                                                         for OpenCL.*
     fence        seq_cst      - singlethread *none*     *Same as corresponding
-                               - wavefront               fence acq_rel*.
-                               - workgroup
-                               - agent
-                               - system
-     ============ ============ ============== ========== =======================
+                               - wavefront               fence acq_rel,
+                               - workgroup               except must generated
+                               - agent                   all instructions even
+                               - system                  for OpenCL.*
+     ============ ============ ============== ========== ===============================

 The memory order also adds the single thread optimization constrains defined in
 table
@ -3799,7 +4001,7 @@ used.  The default value for all keys is 0, with the following exceptions:
 - *kernel_code_entry_byte_offset* defaults to 256.
 - *wavefront_size* defaults to 6.
 - *kernarg_segment_alignment*, *group_segment_alignment*, and
-  *private_segment_alignment* default to 4.  Note that alignments are specified
+  *private_segment_alignment* default to 4. Note that alignments are specified
  as a power of two, so a value of **n** means an alignment of 2^ **n**.

 The *.amd_kernel_code_t* directive must be placed immediately after the