From 491ebe17c83863affc99909757bb63bcaab7e7bc Mon Sep 17 00:00:00 2001 From: Tony Date: Thu, 13 Feb 2020 01:19:25 -0500 Subject: [PATCH] [AMDGPU] AMDGPUUsage define call convention ABI Reviewers: scott.linder, arsenm, b-sumner Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D74861 --- docs/AMDGPUUsage.rst | 452 ++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 449 insertions(+), 3 deletions(-) diff --git a/docs/AMDGPUUsage.rst b/docs/AMDGPUUsage.rst index 863a907f673..197765a3071 100644 --- a/docs/AMDGPUUsage.rst +++ b/docs/AMDGPUUsage.rst @@ -480,7 +480,7 @@ is conservatively correct for OpenCL. - ``agent`` and executed by a thread on the same agent. - ``workgroup`` and executed by a thread in the - same workgroup. + same work-group. - ``wavefront`` and executed by a thread in the same wavefront. @@ -493,7 +493,7 @@ is conservatively correct for OpenCL. - ``system`` or ``agent`` and executed by a thread on the same agent. - ``workgroup`` and executed by a thread in the - same workgroup. + same work-group. - ``wavefront`` and executed by a thread in the same wavefront. @@ -504,7 +504,7 @@ is conservatively correct for OpenCL. provided the other operation's sync scope is: - ``system``, ``agent`` or ``workgroup`` and - executed by a thread in the same workgroup. + executed by a thread in the same work-group. - ``wavefront`` and executed by a thread in the same wavefront. @@ -8501,6 +8501,452 @@ the ``s_trap`` instruction with the following usage: reserved ``s_trap 0xff`` Reserved. =================== =============== =============== ======================= +.. _amdgpu-amdhsa-function-call-convention: + +Call Convention +~~~~~~~~~~~~~~~ + +.. note:: + + This section is currently incomplete and has inakkuracies. It is WIP that will + be updated as information is determined. + +See :ref:`amdgpu-dwarf-address-space-mapping` for information on swizzled +addresses. Unswizzled addresses are normal linear addresses. + +Kernel Functions +++++++++++++++++ + +This section describes the call convention ABI for the outer kernel function. + +See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call +convention. + +The following is not part of the AMDGPU kernel calling convention but describes +how the AMDGPU implements function calls: + +1. Clang decides the kernarg layout to match the *HSA Programmer's Language + Reference* [HSA]_. + + - All structs are passed directly. + - Lambda values are passed *TBA*. + + .. TODO:: + + - Does this really follow HSA rules? Or are structs >16 bytes passed + by-value struct? + - What is ABI for lambda values? + +2. The CFI return address is undefined. +3. If the kernel contains no calls then: + + - If using the ``amdhsa`` OS ABI (see :ref:`amdgpu-os-table`), and know + during ISel that there is stack usage SGPR0-3 is reserved for use as the + scratch SRD and SGPR33 reserved for the wave scratch offset. Stack usage + is assumed if ``-O0``, if already aware of stack objects for locals, etc., + or if there are any function calls. + - Otherwise, five high numbered SGPRs are reserved for the tentative scratch + SRD and wave scratch offset. These will be used if determine need to do + spilling. + + - If no use is made of the tentative scratch SRD or wave scratch offset, + then they are unreserved and the register count is determined ignoring + them. + - If use is made of the tenatative scratch SRD or wave scratch offset, + then the register numbers used are shifted to be after the highest one + allocated by the register allocator, and all uses updated. The register + count will include them in the shifted location. Since register + allocation may introduce spills, this shifting allows them to be + eliminated without having to perform register allocation again. + - In either case, if the processor has the SGPR allocation bug, the + tentative allocation is not shifted or unreserved inorder to ensure the + register count is higher to workaround the bug. + +4. If the kernel contains function calls: + + - SP is set to the wave scratch offset. + + - Since SP is an unswizzled address relative to the queue scratch base, an + wave scratch offset is an unswizzle offset, this means that if SP is + used to access swizzled scratch memory, it will access the private + segment address 0. + + .. note:: + + This is planned to be changed to be the unswizzled base address of the + wavefront scratch backing memory. + +Non-Kernel Functions +++++++++++++++++++++ + +This section describes the call convention ABI for functions other than the +outer kernel function. + +If a kernel has function calls then scratch is always allocated and used for the +call stack which grows from low address to high address using the swizzled +scratch address space. + +On entry to a function: + +1. SGPR0-3 contain a V# with the following properties: + + * Base address of the queue scratch backing memory. + + .. note:: + + This is planned to be changed to be the unswizzled base address of the + wavefront scratch backing memory. + + * Swizzled with dword element size and stride of wavefront size elements. + +2. The FLAT_SCRATCH register pair is setup. See + :ref:`amdgpu-amdhsa-flat-scratch`. +3. GFX6-8: M0 register set to the size of LDS in bytes. +4. The EXEC register is set to the lanes active on entry to the function. +5. MODE register: *TBD* +6. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described + below. +7. SGPR30-31 return address (RA). The code address that the function must + return to when it completes. The value is undefined if the function is *no + return*. +8. SGPR32 is used for the stack pointer (SP). It is an unswizzled + scratch offset relative to the beginning of the queue scratch backing + memory. + + The unswizzled SP can be used with buffer instructions as an unswizzled SGPR + offset with the scratch V# in SGPR0-3 to access the stack in a swizzled + manner. + + The swizzled SP value is always 4 bytes aligned for the ``r600`` + architecture and 16 byte aligned for the ``amdgcn`` architecture. + + .. note:: + + The ``amdgcn`` value is selected to avoid dynamic stack alignment for the + OpenCL language which has the largest base type defined as 16 bytes. + + On entry, the swizzled SP value is the address of the first function + argument passed on the stack. Other stack passed arguments are positive + offsets from the entry swizzled SP value. + + The function may use positive offsets beyond the last stack passed argument + for stack allocated local variables and register spill slots. If necessary + the function may align these to greater alignment than 16 bytes. After these + the function may dynamically allocate space for such things as runtime sized + ``alloca`` local allocations. + + If the function calls another function, it will place any stack allocated + arguments after the last local allocation and adjust SGPR32 to the address + after the last local allocation. + + .. note:: + + The SP value is planned to be changed to be the unswizzled offset relative + to the wavefront scratch backing memory. + +9. SGPR33 wavefront scratch base offset. The unswizzled offset from the queue + scratch backing memory base to the base of the wavefront scratch backing + memory. + + It is used to convert the unswizzled SP value to swizzled address in the + private address space by: + + | private address = (unswizzled SP - wavefront scratch base offset) / + wavefront size + + This may be used to obtain the private address of stack objects and to + convert these address to a flat address by adding the flat scratch aperture + base address. + + .. note:: + + This is planned to be eliminated when SP is changed to be the unswizzled + offset relative to the wavefront scratch backing memory. The the + conversion simplifies to: + + | private address = unswizzled SP / wavefront size + +10. All other registers are unspecified. +11. Any necessary ``waitcnt`` has been performed to ensure memory is available + to the function. + +On exit from a function: + +1. VGPR0-31 and SGPR4-29 are used to pass function result arguments as + described below. Any registers used are considered clobbered registers, +2. The following registers are preserved and have the same value as on entry: + + * FLAT_SCRATCH + * EXEC + * GFX6-8: M0 + * All SGPR and VGPR registers except the clobbered registers of SGPR4-31 and + VGPR0-31. + + For the AMDGPU backend, an inter-procedural register allocation (IPRA) + optimization may mark some of clobbered SGPR4-31 and VGPR0-31 registers as + preserved if it can be determined that the called function does not change + their value. + +2. The PC is set to the RA provided on entry. +3. MODE register: *TBD*. +4. All other registers are clobbered. +5. Any necessary ``waitcnt`` has been performed to ensure memory accessed by + function is available to the caller. + +.. TODO:: + + - On gfx908 are all ACC registers clobbered? + + - How are function results returned? The address of structured types is passed + by reference, but what about other types? + +The function input arguments are made up of the formal arguments explicitly +declared by the source language function plus the implicit input arguments used +by the implementation. + +The source language input arguments are: + +1. Any source language implicit ``this`` or ``self`` argument comes first as a + pointer type. +2. Followed by the function formal arguments in left to right source order. + +The source language result arguments are: + +1. The function result argument. + +The source language input or result struct type arguments that are less than or +equal to 16 bytes, are decomposed recursively into their base type fields, and +each field is passed as if a separate argument. For input arguments, if the +called function requires the struct to be in memory, for example because its +address is taken, then the function body is responsible for allocating a stack +location and copying the field arguments into it. Clang terms this *direct +struct*. + +The source language input struct type arguments that are greater than 16 bytes, +are passed by reference. The caller is responsible for allocating a stack +location to make a copy of the struct value and pass the address as the input +argument. The called function is responsible to perform the dereference when +accessing the input argument. Clang terms this *by-value struct*. + +A source language result struct type argument that is greater than 16 bytes, is +returned by reference. The caller is responsible for allocating a stack location +to hold the result value and passes the address as the last input argument +(before the implicit input arguments). In this case there are no result +arguments. The called function is responsible to perform the dereference when +storing the result value. Clang terms this *structured return (sret)*. + +*TODO: correct the sret definition.* + +.. TODO:: + + Is this definition correct? Or is sret only used if passing in registers, and + pass as non-decomposed struct as stack argument? Or something else? Is the + memory location in the caller stack frame, or a stack memory argument and so + no address is passed as the caller can directly write to the argument stack + location. But then the stack location is still live after return. If an + argument stack location is it the first stack argument or the last one? + +Lambda argument types are treated as struct types with an implementation defined +set of fields. + +.. TODO:: + + Need to specify the ABI for lambda types for AMDGPU. + +For AMDGPU backend all source language arguments (including the decomposed +struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case +they are passed in SGPRs. + +The AMDGPU backend walks the function call graph from the leaves to determine +which implicit input arguments are used, propagating to each caller of the +function. The used implicit arguments are appended to the function arguments +after the source language arguments in the following order: + +.. TODO:: + + Is recursion or external functions supported? + +1. Work-Item ID (1 VGPR) + + The X, Y and Z work-item ID are packed into a single VGRP with the following + layout. Only fields actually used by the function are set. The other bits + are undefined. + + The values come from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`. + + .. table:: Work-item implict argument layout + :name: amdgpu-amdhsa-workitem-implict-argument-layout-table + + ======= ======= ============== + Bits Size Field Name + ======= ======= ============== + 9:0 10 bits X Work-Item ID + 19:10 10 bits Y Work-Item ID + 29:20 10 bits Z Work-Item ID + 31:30 2 bits Unused + ======= ======= ============== + +2. Dispatch Ptr (2 SGPRs) + + The value comes from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + +3. Queue Ptr (2 SGPRs) + + The value comes from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + +4. Kernarg Segment Ptr (2 SGPRs) + + The value comes from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + +5. Dispatch id (2 SGPRs) + + The value comes from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + +6. Work-Group ID X (1 SGPR) + + The value comes from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + +7. Work-Group ID Y (1 SGPR) + + The value comes from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + +8. Work-Group ID Z (1 SGPR) + + The value comes from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + +9. Implicit Argument Ptr (2 SGPRs) + + The value is computed by adding an offset to Kernarg Segment Ptr to get the + global address space pointer to the first kernarg implicit argument. + +The input and result arguments are assigned in order in the following manner: + +..note:: + + There are likely some errors and ommissions in the following description that + need correction. + + ..TODO:: + + Check the clang source code to decipher how funtion arguments and return + results are handled. Also see the AMDGPU specific values used. + +* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to + VGPR31. + + If there are more arguments than will fit in these registers, the remaining + arguments are allocated on the stack in order on naturally aligned + addresses. + + .. TODO:: + + How are overly aligned structures allocated on the stack? + +* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to + SGPR29. + + If there are more arguments than will fit in these registers, the remaining + arguments are allocated on the stack in order on naturally aligned + addresses. + +Note that decomposed struct type arguments may have some fields passed in +registers and some in memory. + +..TODO:: + + So a struct which can pass some fields as decomposed register arguments, will + pass the rest as decomposed stack elements? But an arguent that will not start + in registers will not be decomposed and will be passed as a non-decomposed + stack value? + +The following is not part of the AMDGPU function calling convention but +describes how the AMDGPU implements function calls: + +1. SGPR34 is used as a frame pointer (FP) if necessary. Like the SP it is an + unswizzled scratch address. It is only needed if runtime sized ``alloca`` + are used, or for the reasons defined in ``SiFrameLowering``. +2. Runtime stack alignment is not currently supported. + + .. TODO:: + + - If runtime stack alignment is supported then will an extra argument + pointer register be used? + +2. Allocating SGPR arguments on the stack are not supported. + +3. No CFI is currently generated. See :ref:`amdgpu-call-frame-information`. + + ..note:: + + Before CFI is generated, the call convention will be changed so that SP is + an unswizzled address relative to the wave scratch base. + + CFI will be generated that defines the CFA as the unswizzled address + relative to the wave scratch base in the unswizzled private address space + of the lowest address stack allocated local variable. + + ``DW_AT_frame_base`` will be defined as the swizelled address in the + swizzled private address space by dividing the CFA by the wavefront size + (since CFA is always at least dword aligned which matches the scratch + swizzle element size). + + If no dynamic stack alignment was performed, the stack allocated arguments + are accessed as negative offsets relative to ``DW_AT_frame_base``, and the + local variables and register spill slots are accessed as positive offsets + relative to ``DW_AT_frame_base``. + +4. Function argument passing is implemented by copying the input physical + registers to virtual registers on entry. The register allocator can spill if + necessary. These are copied back to physical registers at call sites. The + net effect is that each function call can have these values in entirely + distinct locations. The IPRA can help avoid shuffling argument registers. +5. Call sites are implemented by setting up the arguments at positive offsets + from SP. Then SP is incremented to account for the known frame size before + the call and decremented after the call. + + ..note:: + + The CFI will reflect the changed calculation needed to compute the CFA + from SP. + +6. 4 byte spill slots are used in the stack frame. One slot is allocated for an + emergency spill slot. Buffer instructions are used for stack accesses and + not the ``flat_scratch`` instruction. + + ..TODO:: + + Explain when the emergency spill slot is used. + +.. TODO:: + + Possible broken issues: + + - Stack arguments must be aligned to required alignment. + - Stack is aligned to max(16, max formal argument alignment) + - Direct argument < 64 bits should check register budget. + - Register budget calculation should respect ``inreg`` for SGPR. + - SGPR overflow is not handled. + - struct with 1 member unpeeling is not checking size of member. + - ``sret`` is after ``this`` pointer. + - Caller is not implementing stack realignment: need an extra pointer. + - Should say AMDGPU passes FP rather than SP. + - Should CFI define CFA as address of locals or arguments. Difference is + apparent when have implemented dynamic alignment. + - If ``SCRATCH`` instruction could allow negative offsets then can make FP be + highest address of stack frame and use negative offset for locals. Would + allow SP to be the same as FP and could support signal-handler-like as now + have a real SP for the top of the stack. + - How is ``sret`` passed on the stack? In argument stack area? Can it overlay + arguments? + AMDPAL ------