[AMDGPU] AMDGPUUsage define call convention ABI

Reviewers: scott.linder, arsenm, b-sumner

Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, kerbowa, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74861
This commit is contained in:
Tony 2020-02-13 01:19:25 -05:00
parent 71dc77c1eb
commit 491ebe17c8

View File

@ -480,7 +480,7 @@ is conservatively correct for OpenCL.
- ``agent`` and executed by a thread on the same - ``agent`` and executed by a thread on the same
agent. agent.
- ``workgroup`` and executed by a thread in the - ``workgroup`` and executed by a thread in the
same workgroup. same work-group.
- ``wavefront`` and executed by a thread in the - ``wavefront`` and executed by a thread in the
same wavefront. same wavefront.
@ -493,7 +493,7 @@ is conservatively correct for OpenCL.
- ``system`` or ``agent`` and executed by a thread - ``system`` or ``agent`` and executed by a thread
on the same agent. on the same agent.
- ``workgroup`` and executed by a thread in the - ``workgroup`` and executed by a thread in the
same workgroup. same work-group.
- ``wavefront`` and executed by a thread in the - ``wavefront`` and executed by a thread in the
same wavefront. same wavefront.
@ -504,7 +504,7 @@ is conservatively correct for OpenCL.
provided the other operation's sync scope is: provided the other operation's sync scope is:
- ``system``, ``agent`` or ``workgroup`` and - ``system``, ``agent`` or ``workgroup`` and
executed by a thread in the same workgroup. executed by a thread in the same work-group.
- ``wavefront`` and executed by a thread in the - ``wavefront`` and executed by a thread in the
same wavefront. same wavefront.
@ -8501,6 +8501,452 @@ the ``s_trap`` instruction with the following usage:
reserved ``s_trap 0xff`` Reserved. reserved ``s_trap 0xff`` Reserved.
=================== =============== =============== ======================= =================== =============== =============== =======================
.. _amdgpu-amdhsa-function-call-convention:
Call Convention
~~~~~~~~~~~~~~~
.. note::
This section is currently incomplete and has inakkuracies. It is WIP that will
be updated as information is determined.
See :ref:`amdgpu-dwarf-address-space-mapping` for information on swizzled
addresses. Unswizzled addresses are normal linear addresses.
Kernel Functions
++++++++++++++++
This section describes the call convention ABI for the outer kernel function.
See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
convention.
The following is not part of the AMDGPU kernel calling convention but describes
how the AMDGPU implements function calls:
1. Clang decides the kernarg layout to match the *HSA Programmer's Language
Reference* [HSA]_.
- All structs are passed directly.
- Lambda values are passed *TBA*.
.. TODO::
- Does this really follow HSA rules? Or are structs >16 bytes passed
by-value struct?
- What is ABI for lambda values?
2. The CFI return address is undefined.
3. If the kernel contains no calls then:
- If using the ``amdhsa`` OS ABI (see :ref:`amdgpu-os-table`), and know
during ISel that there is stack usage SGPR0-3 is reserved for use as the
scratch SRD and SGPR33 reserved for the wave scratch offset. Stack usage
is assumed if ``-O0``, if already aware of stack objects for locals, etc.,
or if there are any function calls.
- Otherwise, five high numbered SGPRs are reserved for the tentative scratch
SRD and wave scratch offset. These will be used if determine need to do
spilling.
- If no use is made of the tentative scratch SRD or wave scratch offset,
then they are unreserved and the register count is determined ignoring
them.
- If use is made of the tenatative scratch SRD or wave scratch offset,
then the register numbers used are shifted to be after the highest one
allocated by the register allocator, and all uses updated. The register
count will include them in the shifted location. Since register
allocation may introduce spills, this shifting allows them to be
eliminated without having to perform register allocation again.
- In either case, if the processor has the SGPR allocation bug, the
tentative allocation is not shifted or unreserved inorder to ensure the
register count is higher to workaround the bug.
4. If the kernel contains function calls:
- SP is set to the wave scratch offset.
- Since SP is an unswizzled address relative to the queue scratch base, an
wave scratch offset is an unswizzle offset, this means that if SP is
used to access swizzled scratch memory, it will access the private
segment address 0.
.. note::
This is planned to be changed to be the unswizzled base address of the
wavefront scratch backing memory.
Non-Kernel Functions
++++++++++++++++++++
This section describes the call convention ABI for functions other than the
outer kernel function.
If a kernel has function calls then scratch is always allocated and used for the
call stack which grows from low address to high address using the swizzled
scratch address space.
On entry to a function:
1. SGPR0-3 contain a V# with the following properties:
* Base address of the queue scratch backing memory.
.. note::
This is planned to be changed to be the unswizzled base address of the
wavefront scratch backing memory.
* Swizzled with dword element size and stride of wavefront size elements.
2. The FLAT_SCRATCH register pair is setup. See
:ref:`amdgpu-amdhsa-flat-scratch`.
3. GFX6-8: M0 register set to the size of LDS in bytes.
4. The EXEC register is set to the lanes active on entry to the function.
5. MODE register: *TBD*
6. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
below.
7. SGPR30-31 return address (RA). The code address that the function must
return to when it completes. The value is undefined if the function is *no
return*.
8. SGPR32 is used for the stack pointer (SP). It is an unswizzled
scratch offset relative to the beginning of the queue scratch backing
memory.
The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
manner.
The swizzled SP value is always 4 bytes aligned for the ``r600``
architecture and 16 byte aligned for the ``amdgcn`` architecture.
.. note::
The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
OpenCL language which has the largest base type defined as 16 bytes.
On entry, the swizzled SP value is the address of the first function
argument passed on the stack. Other stack passed arguments are positive
offsets from the entry swizzled SP value.
The function may use positive offsets beyond the last stack passed argument
for stack allocated local variables and register spill slots. If necessary
the function may align these to greater alignment than 16 bytes. After these
the function may dynamically allocate space for such things as runtime sized
``alloca`` local allocations.
If the function calls another function, it will place any stack allocated
arguments after the last local allocation and adjust SGPR32 to the address
after the last local allocation.
.. note::
The SP value is planned to be changed to be the unswizzled offset relative
to the wavefront scratch backing memory.
9. SGPR33 wavefront scratch base offset. The unswizzled offset from the queue
scratch backing memory base to the base of the wavefront scratch backing
memory.
It is used to convert the unswizzled SP value to swizzled address in the
private address space by:
| private address = (unswizzled SP - wavefront scratch base offset) /
wavefront size
This may be used to obtain the private address of stack objects and to
convert these address to a flat address by adding the flat scratch aperture
base address.
.. note::
This is planned to be eliminated when SP is changed to be the unswizzled
offset relative to the wavefront scratch backing memory. The the
conversion simplifies to:
| private address = unswizzled SP / wavefront size
10. All other registers are unspecified.
11. Any necessary ``waitcnt`` has been performed to ensure memory is available
to the function.
On exit from a function:
1. VGPR0-31 and SGPR4-29 are used to pass function result arguments as
described below. Any registers used are considered clobbered registers,
2. The following registers are preserved and have the same value as on entry:
* FLAT_SCRATCH
* EXEC
* GFX6-8: M0
* All SGPR and VGPR registers except the clobbered registers of SGPR4-31 and
VGPR0-31.
For the AMDGPU backend, an inter-procedural register allocation (IPRA)
optimization may mark some of clobbered SGPR4-31 and VGPR0-31 registers as
preserved if it can be determined that the called function does not change
their value.
2. The PC is set to the RA provided on entry.
3. MODE register: *TBD*.
4. All other registers are clobbered.
5. Any necessary ``waitcnt`` has been performed to ensure memory accessed by
function is available to the caller.
.. TODO::
- On gfx908 are all ACC registers clobbered?
- How are function results returned? The address of structured types is passed
by reference, but what about other types?
The function input arguments are made up of the formal arguments explicitly
declared by the source language function plus the implicit input arguments used
by the implementation.
The source language input arguments are:
1. Any source language implicit ``this`` or ``self`` argument comes first as a
pointer type.
2. Followed by the function formal arguments in left to right source order.
The source language result arguments are:
1. The function result argument.
The source language input or result struct type arguments that are less than or
equal to 16 bytes, are decomposed recursively into their base type fields, and
each field is passed as if a separate argument. For input arguments, if the
called function requires the struct to be in memory, for example because its
address is taken, then the function body is responsible for allocating a stack
location and copying the field arguments into it. Clang terms this *direct
struct*.
The source language input struct type arguments that are greater than 16 bytes,
are passed by reference. The caller is responsible for allocating a stack
location to make a copy of the struct value and pass the address as the input
argument. The called function is responsible to perform the dereference when
accessing the input argument. Clang terms this *by-value struct*.
A source language result struct type argument that is greater than 16 bytes, is
returned by reference. The caller is responsible for allocating a stack location
to hold the result value and passes the address as the last input argument
(before the implicit input arguments). In this case there are no result
arguments. The called function is responsible to perform the dereference when
storing the result value. Clang terms this *structured return (sret)*.
*TODO: correct the sret definition.*
.. TODO::
Is this definition correct? Or is sret only used if passing in registers, and
pass as non-decomposed struct as stack argument? Or something else? Is the
memory location in the caller stack frame, or a stack memory argument and so
no address is passed as the caller can directly write to the argument stack
location. But then the stack location is still live after return. If an
argument stack location is it the first stack argument or the last one?
Lambda argument types are treated as struct types with an implementation defined
set of fields.
.. TODO::
Need to specify the ABI for lambda types for AMDGPU.
For AMDGPU backend all source language arguments (including the decomposed
struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
they are passed in SGPRs.
The AMDGPU backend walks the function call graph from the leaves to determine
which implicit input arguments are used, propagating to each caller of the
function. The used implicit arguments are appended to the function arguments
after the source language arguments in the following order:
.. TODO::
Is recursion or external functions supported?
1. Work-Item ID (1 VGPR)
The X, Y and Z work-item ID are packed into a single VGRP with the following
layout. Only fields actually used by the function are set. The other bits
are undefined.
The values come from the initial kernel execution state. See
:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
.. table:: Work-item implict argument layout
:name: amdgpu-amdhsa-workitem-implict-argument-layout-table
======= ======= ==============
Bits Size Field Name
======= ======= ==============
9:0 10 bits X Work-Item ID
19:10 10 bits Y Work-Item ID
29:20 10 bits Z Work-Item ID
31:30 2 bits Unused
======= ======= ==============
2. Dispatch Ptr (2 SGPRs)
The value comes from the initial kernel execution state. See
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
3. Queue Ptr (2 SGPRs)
The value comes from the initial kernel execution state. See
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
4. Kernarg Segment Ptr (2 SGPRs)
The value comes from the initial kernel execution state. See
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
5. Dispatch id (2 SGPRs)
The value comes from the initial kernel execution state. See
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
6. Work-Group ID X (1 SGPR)
The value comes from the initial kernel execution state. See
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
7. Work-Group ID Y (1 SGPR)
The value comes from the initial kernel execution state. See
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
8. Work-Group ID Z (1 SGPR)
The value comes from the initial kernel execution state. See
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
9. Implicit Argument Ptr (2 SGPRs)
The value is computed by adding an offset to Kernarg Segment Ptr to get the
global address space pointer to the first kernarg implicit argument.
The input and result arguments are assigned in order in the following manner:
..note::
There are likely some errors and ommissions in the following description that
need correction.
..TODO::
Check the clang source code to decipher how funtion arguments and return
results are handled. Also see the AMDGPU specific values used.
* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
VGPR31.
If there are more arguments than will fit in these registers, the remaining
arguments are allocated on the stack in order on naturally aligned
addresses.
.. TODO::
How are overly aligned structures allocated on the stack?
* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
SGPR29.
If there are more arguments than will fit in these registers, the remaining
arguments are allocated on the stack in order on naturally aligned
addresses.
Note that decomposed struct type arguments may have some fields passed in
registers and some in memory.
..TODO::
So a struct which can pass some fields as decomposed register arguments, will
pass the rest as decomposed stack elements? But an arguent that will not start
in registers will not be decomposed and will be passed as a non-decomposed
stack value?
The following is not part of the AMDGPU function calling convention but
describes how the AMDGPU implements function calls:
1. SGPR34 is used as a frame pointer (FP) if necessary. Like the SP it is an
unswizzled scratch address. It is only needed if runtime sized ``alloca``
are used, or for the reasons defined in ``SiFrameLowering``.
2. Runtime stack alignment is not currently supported.
.. TODO::
- If runtime stack alignment is supported then will an extra argument
pointer register be used?
2. Allocating SGPR arguments on the stack are not supported.
3. No CFI is currently generated. See :ref:`amdgpu-call-frame-information`.
..note::
Before CFI is generated, the call convention will be changed so that SP is
an unswizzled address relative to the wave scratch base.
CFI will be generated that defines the CFA as the unswizzled address
relative to the wave scratch base in the unswizzled private address space
of the lowest address stack allocated local variable.
``DW_AT_frame_base`` will be defined as the swizelled address in the
swizzled private address space by dividing the CFA by the wavefront size
(since CFA is always at least dword aligned which matches the scratch
swizzle element size).
If no dynamic stack alignment was performed, the stack allocated arguments
are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
local variables and register spill slots are accessed as positive offsets
relative to ``DW_AT_frame_base``.
4. Function argument passing is implemented by copying the input physical
registers to virtual registers on entry. The register allocator can spill if
necessary. These are copied back to physical registers at call sites. The
net effect is that each function call can have these values in entirely
distinct locations. The IPRA can help avoid shuffling argument registers.
5. Call sites are implemented by setting up the arguments at positive offsets
from SP. Then SP is incremented to account for the known frame size before
the call and decremented after the call.
..note::
The CFI will reflect the changed calculation needed to compute the CFA
from SP.
6. 4 byte spill slots are used in the stack frame. One slot is allocated for an
emergency spill slot. Buffer instructions are used for stack accesses and
not the ``flat_scratch`` instruction.
..TODO::
Explain when the emergency spill slot is used.
.. TODO::
Possible broken issues:
- Stack arguments must be aligned to required alignment.
- Stack is aligned to max(16, max formal argument alignment)
- Direct argument < 64 bits should check register budget.
- Register budget calculation should respect ``inreg`` for SGPR.
- SGPR overflow is not handled.
- struct with 1 member unpeeling is not checking size of member.
- ``sret`` is after ``this`` pointer.
- Caller is not implementing stack realignment: need an extra pointer.
- Should say AMDGPU passes FP rather than SP.
- Should CFI define CFA as address of locals or arguments. Difference is
apparent when have implemented dynamic alignment.
- If ``SCRATCH`` instruction could allow negative offsets then can make FP be
highest address of stack frame and use negative offset for locals. Would
allow SP to be the same as FP and could support signal-handler-like as now
have a real SP for the top of the stack.
- How is ``sret`` passed on the stack? In argument stack area? Can it overlay
arguments?
AMDPAL AMDPAL
------ ------