From 491ebe17c83863affc99909757bb63bcaab7e7bc Mon Sep 17 00:00:00 2001
From: Tony <Tony.Tye@amd.com>
Date: Thu, 13 Feb 2020 01:19:25 -0500
Subject: [PATCH] [AMDGPU] AMDGPUUsage define call convention ABI

Reviewers: scott.linder, arsenm, b-sumner

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74861
---
 docs/AMDGPUUsage.rst | 452 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 449 insertions(+), 3 deletions(-)

diff --git a/docs/AMDGPUUsage.rst b/docs/AMDGPUUsage.rst
index 863a907f673..197765a3071 100644
--- a/docs/AMDGPUUsage.rst
+++ b/docs/AMDGPUUsage.rst
@@ -480,7 +480,7 @@ is conservatively correct for OpenCL.
                              - ``agent`` and executed by a thread on the same
                                agent.
                              - ``workgroup`` and executed by a thread in the
-                               same workgroup.
+                               same work-group.
                              - ``wavefront`` and executed by a thread in the
                                same wavefront.
 
@@ -493,7 +493,7 @@ is conservatively correct for OpenCL.
                              - ``system`` or ``agent`` and executed by a thread
                                on the same agent.
                              - ``workgroup`` and executed by a thread in the
-                               same workgroup.
+                               same work-group.
                              - ``wavefront`` and executed by a thread in the
                                same wavefront.
 
@@ -504,7 +504,7 @@ is conservatively correct for OpenCL.
                              provided the other operation's sync scope is:
 
                              - ``system``, ``agent`` or ``workgroup`` and
-                               executed by a thread in the same workgroup.
+                               executed by a thread in the same work-group.
                              - ``wavefront`` and executed by a thread in the
                                same wavefront.
 
@@ -8501,6 +8501,452 @@ the ``s_trap`` instruction with the following usage:
      reserved            ``s_trap 0xff``                 Reserved.
      =================== =============== =============== =======================
 
+.. _amdgpu-amdhsa-function-call-convention:
+
+Call Convention
+~~~~~~~~~~~~~~~
+
+.. note::
+
+  This section is currently incomplete and has inakkuracies. It is WIP that will
+  be updated as information is determined.
+
+See :ref:`amdgpu-dwarf-address-space-mapping` for information on swizzled
+addresses. Unswizzled addresses are normal linear addresses.
+
+Kernel Functions
+++++++++++++++++
+
+This section describes the call convention ABI for the outer kernel function.
+
+See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
+convention.
+
+The following is not part of the AMDGPU kernel calling convention but describes
+how the AMDGPU implements function calls:
+
+1.  Clang decides the kernarg layout to match the *HSA Programmer's Language
+    Reference* [HSA]_.
+
+    - All structs are passed directly.
+    - Lambda values are passed *TBA*.
+
+    .. TODO::
+
+      - Does this really follow HSA rules? Or are structs >16 bytes passed
+        by-value struct?
+      - What is ABI for lambda values?
+
+2.  The CFI return address is undefined.
+3.  If the kernel contains no calls then:
+
+    - If using the ``amdhsa`` OS ABI (see :ref:`amdgpu-os-table`), and know
+      during ISel that there is stack usage SGPR0-3 is reserved for use as the
+      scratch SRD and SGPR33 reserved for the wave scratch offset. Stack usage
+      is assumed if ``-O0``, if already aware of stack objects for locals, etc.,
+      or if there are any function calls.
+    - Otherwise, five high numbered SGPRs are reserved for the tentative scratch
+      SRD and wave scratch offset. These will be used if determine need to do
+      spilling.
+
+      - If no use is made of the tentative scratch SRD or wave scratch offset,
+        then they are unreserved and the register count is determined ignoring
+        them.
+      - If use is made of the tenatative scratch SRD or wave scratch offset,
+        then the register numbers used are shifted to be after the highest one
+        allocated by the register allocator, and all uses updated. The register
+        count will include them in the shifted location. Since register
+        allocation may introduce spills, this shifting allows them to be
+        eliminated without having to perform register allocation again.
+      - In either case, if the processor has the SGPR allocation bug, the
+        tentative allocation is not shifted or unreserved inorder to ensure the
+        register count is higher to workaround the bug.
+
+4.  If the kernel contains function calls:
+
+    - SP is set to the wave scratch offset.
+
+      - Since SP is an unswizzled address relative to the queue scratch base, an
+        wave scratch offset is an unswizzle offset, this means that if SP is
+        used to access swizzled scratch memory, it will access the private
+        segment address 0.
+
+      .. note::
+
+        This is planned to be changed to be the unswizzled base address of the
+        wavefront scratch backing memory.
+
+Non-Kernel Functions
+++++++++++++++++++++
+
+This section describes the call convention ABI for functions other than the
+outer kernel function.
+
+If a kernel has function calls then scratch is always allocated and used for the
+call stack which grows from low address to high address using the swizzled
+scratch address space.
+
+On entry to a function:
+
+1.  SGPR0-3 contain a V# with the following properties:
+
+    * Base address of the queue scratch backing memory.
+
+      .. note::
+
+        This is planned to be changed to be the unswizzled base address of the
+        wavefront scratch backing memory.
+
+    * Swizzled with dword element size and stride of wavefront size elements.
+
+2.  The FLAT_SCRATCH register pair is setup. See
+    :ref:`amdgpu-amdhsa-flat-scratch`.
+3.  GFX6-8: M0 register set to the size of LDS in bytes.
+4.  The EXEC register is set to the lanes active on entry to the function.
+5.  MODE register: *TBD*
+6.  VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
+    below.
+7.  SGPR30-31 return address (RA). The code address that the function must
+    return to when it completes. The value is undefined if the function is *no
+    return*.
+8.  SGPR32 is used for the stack pointer (SP). It is an unswizzled
+    scratch offset relative to the beginning of the queue scratch backing
+    memory.
+
+    The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
+    offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
+    manner.
+
+    The swizzled SP value is always 4 bytes aligned for the ``r600``
+    architecture and 16 byte aligned for the ``amdgcn`` architecture.
+
+    .. note::
+
+      The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
+      OpenCL language which has the largest base type defined as 16 bytes.
+
+    On entry, the swizzled SP value is the address of the first function
+    argument passed on the stack. Other stack passed arguments are positive
+    offsets from the entry swizzled SP value.
+
+    The function may use positive offsets beyond the last stack passed argument
+    for stack allocated local variables and register spill slots. If necessary
+    the function may align these to greater alignment than 16 bytes. After these
+    the function may dynamically allocate space for such things as runtime sized
+    ``alloca`` local allocations.
+
+    If the function calls another function, it will place any stack allocated
+    arguments after the last local allocation and adjust SGPR32 to the address
+    after the last local allocation.
+
+    .. note::
+
+      The SP value is planned to be changed to be the unswizzled offset relative
+      to the wavefront scratch backing memory.
+
+9.  SGPR33 wavefront scratch base offset. The unswizzled offset from the queue
+    scratch backing memory base to the base of the wavefront scratch backing
+    memory.
+
+    It is used to convert the unswizzled SP value to swizzled address in the
+    private address space by:
+
+      | private address = (unswizzled SP - wavefront scratch base offset) /
+        wavefront size
+
+    This may be used to obtain the private address of stack objects and to
+    convert these address to a flat address by adding the flat scratch aperture
+    base address.
+
+    .. note::
+
+      This is planned to be eliminated when SP is changed to be the unswizzled
+      offset relative to the wavefront scratch backing memory. The the
+      conversion simplifies to:
+
+        | private address = unswizzled SP / wavefront size
+
+10. All other registers are unspecified.
+11. Any necessary ``waitcnt`` has been performed to ensure memory is available
+    to the function.
+
+On exit from a function:
+
+1.  VGPR0-31 and SGPR4-29 are used to pass function result arguments as
+    described below. Any registers used are considered clobbered registers,
+2.  The following registers are preserved and have the same value as on entry:
+
+    * FLAT_SCRATCH
+    * EXEC
+    * GFX6-8: M0
+    * All SGPR and VGPR registers except the clobbered registers of SGPR4-31 and
+      VGPR0-31.
+
+      For the AMDGPU backend, an inter-procedural register allocation (IPRA)
+      optimization may mark some of clobbered SGPR4-31 and VGPR0-31 registers as
+      preserved if it can be determined that the called function does not change
+      their value.
+
+2.  The PC is set to the RA provided on entry.
+3.  MODE register: *TBD*.
+4.  All other registers are clobbered.
+5.  Any necessary ``waitcnt`` has been performed to ensure memory accessed by
+    function is available to the caller.
+
+.. TODO::
+
+  - On gfx908 are all ACC registers clobbered?
+
+  - How are function results returned? The address of structured types is passed
+    by reference, but what about other types?
+
+The function input arguments are made up of the formal arguments explicitly
+declared by the source language function plus the implicit input arguments used
+by the implementation.
+
+The source language input arguments are:
+
+1. Any source language implicit ``this`` or ``self`` argument comes first as a
+   pointer type.
+2. Followed by the function formal arguments in left to right source order.
+
+The source language result arguments are:
+
+1. The function result argument.
+
+The source language input or result struct type arguments that are less than or
+equal to 16 bytes, are decomposed recursively into their base type fields, and
+each field is passed as if a separate argument. For input arguments, if the
+called function requires the struct to be in memory, for example because its
+address is taken, then the function body is responsible for allocating a stack
+location and copying the field arguments into it. Clang terms this *direct
+struct*.
+
+The source language input struct type arguments that are greater than 16 bytes,
+are passed by reference. The caller is responsible for allocating a stack
+location to make a copy of the struct value and pass the address as the input
+argument. The called function is responsible to perform the dereference when
+accessing the input argument. Clang terms this *by-value struct*.
+
+A source language result struct type argument that is greater than 16 bytes, is
+returned by reference. The caller is responsible for allocating a stack location
+to hold the result value and passes the address as the last input argument
+(before the implicit input arguments). In this case there are no result
+arguments. The called function is responsible to perform the dereference when
+storing the result value. Clang terms this *structured return (sret)*.
+
+*TODO: correct the sret definition.*
+
+.. TODO::
+
+  Is this definition correct? Or is sret only used if passing in registers, and
+  pass as non-decomposed struct as stack argument? Or something else? Is the
+  memory location in the caller stack frame, or a stack memory argument and so
+  no address is passed as the caller can directly write to the argument stack
+  location. But then the stack location is still live after return. If an
+  argument stack location is it the first stack argument or the last one?
+
+Lambda argument types are treated as struct types with an implementation defined
+set of fields.
+
+.. TODO::
+
+  Need to specify the ABI for lambda types for AMDGPU.
+
+For AMDGPU backend all source language arguments (including the decomposed
+struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
+they are passed in SGPRs.
+
+The AMDGPU backend walks the function call graph from the leaves to determine
+which implicit input arguments are used, propagating to each caller of the
+function. The used implicit arguments are appended to the function arguments
+after the source language arguments in the following order:
+
+.. TODO::
+
+  Is recursion or external functions supported?
+
+1.  Work-Item ID (1 VGPR)
+
+    The X, Y and Z work-item ID are packed into a single VGRP with the following
+    layout. Only fields actually used by the function are set. The other bits
+    are undefined.
+
+    The values come from the initial kernel execution state. See
+    :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
+
+    .. table:: Work-item implict argument layout
+      :name: amdgpu-amdhsa-workitem-implict-argument-layout-table
+
+      ======= ======= ==============
+      Bits    Size    Field Name
+      ======= ======= ==============
+      9:0     10 bits X Work-Item ID
+      19:10   10 bits Y Work-Item ID
+      29:20   10 bits Z Work-Item ID
+      31:30   2 bits  Unused
+      ======= ======= ==============
+
+2.  Dispatch Ptr (2 SGPRs)
+
+    The value comes from the initial kernel execution state. See
+    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
+
+3.  Queue Ptr (2 SGPRs)
+
+    The value comes from the initial kernel execution state. See
+    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
+
+4.  Kernarg Segment Ptr (2 SGPRs)
+
+    The value comes from the initial kernel execution state. See
+    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
+
+5.  Dispatch id (2 SGPRs)
+
+    The value comes from the initial kernel execution state. See
+    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
+
+6.  Work-Group ID X (1 SGPR)
+
+    The value comes from the initial kernel execution state. See
+    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
+
+7.  Work-Group ID Y (1 SGPR)
+
+    The value comes from the initial kernel execution state. See
+    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
+
+8.  Work-Group ID Z (1 SGPR)
+
+    The value comes from the initial kernel execution state. See
+    :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
+
+9.  Implicit Argument Ptr (2 SGPRs)
+
+    The value is computed by adding an offset to Kernarg Segment Ptr to get the
+    global address space pointer to the first kernarg implicit argument.
+
+The input and result arguments are assigned in order in the following manner:
+
+..note::
+
+  There are likely some errors and ommissions in the following description that
+  need correction.
+
+  ..TODO::
+
+    Check the clang source code to decipher how funtion arguments and return
+    results are handled. Also see the AMDGPU specific values used.
+
+* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
+  VGPR31.
+
+  If there are more arguments than will fit in these registers, the remaining
+  arguments are allocated on the stack in order on naturally aligned
+  addresses.
+
+  .. TODO::
+
+    How are overly aligned structures allocated on the stack?
+
+* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
+  SGPR29.
+
+  If there are more arguments than will fit in these registers, the remaining
+  arguments are allocated on the stack in order on naturally aligned
+  addresses.
+
+Note that decomposed struct type arguments may have some fields passed in
+registers and some in memory.
+
+..TODO::
+
+  So a struct which can pass some fields as decomposed register arguments, will
+  pass the rest as decomposed stack elements? But an arguent that will not start
+  in registers will not be decomposed and will be passed as a non-decomposed
+  stack value?
+
+The following is not part of the AMDGPU function calling convention but
+describes how the AMDGPU implements function calls:
+
+1.  SGPR34 is used as a frame pointer (FP) if necessary. Like the SP it is an
+    unswizzled scratch address. It is only needed if runtime sized ``alloca``
+    are used, or for the reasons defined in ``SiFrameLowering``.
+2.  Runtime stack alignment is not currently supported.
+
+    .. TODO::
+
+      - If runtime stack alignment is supported then will an extra argument
+        pointer register be used?
+
+2.  Allocating SGPR arguments on the stack are not supported.
+
+3.  No CFI is currently generated. See :ref:`amdgpu-call-frame-information`.
+
+    ..note::
+
+      Before CFI is generated, the call convention will be changed so that SP is
+      an unswizzled address relative to the wave scratch base.
+
+      CFI will be generated that defines the CFA as the unswizzled address
+      relative to the wave scratch base in the unswizzled private address space
+      of the lowest address stack allocated local variable.
+
+      ``DW_AT_frame_base`` will be defined as the swizelled address in the
+      swizzled private address space by dividing the CFA by the wavefront size
+      (since CFA is always at least dword aligned which matches the scratch
+      swizzle element size).
+
+      If no dynamic stack alignment was performed, the stack allocated arguments
+      are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
+      local variables and register spill slots are accessed as positive offsets
+      relative to ``DW_AT_frame_base``.
+
+4.  Function argument passing is implemented by copying the input physical
+    registers to virtual registers on entry. The register allocator can spill if
+    necessary. These are copied back to physical registers at call sites. The
+    net effect is that each function call can have these values in entirely
+    distinct locations. The IPRA can help avoid shuffling argument registers.
+5.  Call sites are implemented by setting up the arguments at positive offsets
+    from SP. Then SP is incremented to account for the known frame size before
+    the call and decremented after the call.
+
+    ..note::
+
+      The CFI will reflect the changed calculation needed to compute the CFA
+      from SP.
+
+6.  4 byte spill slots are used in the stack frame. One slot is allocated for an
+    emergency spill slot. Buffer instructions are used for stack accesses and
+    not the ``flat_scratch`` instruction.
+
+    ..TODO::
+
+      Explain when the emergency spill slot is used.
+
+.. TODO::
+
+  Possible broken issues:
+
+  - Stack arguments must be aligned to required alignment.
+  - Stack is aligned to max(16, max formal argument alignment)
+  - Direct argument < 64 bits should check register budget.
+  - Register budget calculation should respect ``inreg`` for SGPR.
+  - SGPR overflow is not handled.
+  - struct with 1 member unpeeling is not checking size of member.
+  - ``sret`` is after ``this`` pointer.
+  - Caller is not implementing stack realignment: need an extra pointer.
+  - Should say AMDGPU passes FP rather than SP.
+  - Should CFI define CFA as address of locals or arguments. Difference is
+    apparent when have implemented dynamic alignment.
+  - If ``SCRATCH`` instruction could allow negative offsets then can make FP be
+    highest address of stack frame and use negative offset for locals. Would
+    allow SP to be the same as FP and could support signal-handler-like as now
+    have a real SP for the top of the stack.
+  - How is ``sret`` passed on the stack? In argument stack area? Can it overlay
+    arguments?
+
 AMDPAL
 ------