mirror of
https://github.com/RPCS3/llvm-mirror.git
synced 2024-11-26 04:40:38 +00:00
[AMDGPU] AMDGPUUsage define call convention ABI
Reviewers: scott.linder, arsenm, b-sumner Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D74861
This commit is contained in:
parent
71dc77c1eb
commit
491ebe17c8
@ -480,7 +480,7 @@ is conservatively correct for OpenCL.
|
||||
- ``agent`` and executed by a thread on the same
|
||||
agent.
|
||||
- ``workgroup`` and executed by a thread in the
|
||||
same workgroup.
|
||||
same work-group.
|
||||
- ``wavefront`` and executed by a thread in the
|
||||
same wavefront.
|
||||
|
||||
@ -493,7 +493,7 @@ is conservatively correct for OpenCL.
|
||||
- ``system`` or ``agent`` and executed by a thread
|
||||
on the same agent.
|
||||
- ``workgroup`` and executed by a thread in the
|
||||
same workgroup.
|
||||
same work-group.
|
||||
- ``wavefront`` and executed by a thread in the
|
||||
same wavefront.
|
||||
|
||||
@ -504,7 +504,7 @@ is conservatively correct for OpenCL.
|
||||
provided the other operation's sync scope is:
|
||||
|
||||
- ``system``, ``agent`` or ``workgroup`` and
|
||||
executed by a thread in the same workgroup.
|
||||
executed by a thread in the same work-group.
|
||||
- ``wavefront`` and executed by a thread in the
|
||||
same wavefront.
|
||||
|
||||
@ -8501,6 +8501,452 @@ the ``s_trap`` instruction with the following usage:
|
||||
reserved ``s_trap 0xff`` Reserved.
|
||||
=================== =============== =============== =======================
|
||||
|
||||
.. _amdgpu-amdhsa-function-call-convention:
|
||||
|
||||
Call Convention
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
.. note::
|
||||
|
||||
This section is currently incomplete and has inakkuracies. It is WIP that will
|
||||
be updated as information is determined.
|
||||
|
||||
See :ref:`amdgpu-dwarf-address-space-mapping` for information on swizzled
|
||||
addresses. Unswizzled addresses are normal linear addresses.
|
||||
|
||||
Kernel Functions
|
||||
++++++++++++++++
|
||||
|
||||
This section describes the call convention ABI for the outer kernel function.
|
||||
|
||||
See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call
|
||||
convention.
|
||||
|
||||
The following is not part of the AMDGPU kernel calling convention but describes
|
||||
how the AMDGPU implements function calls:
|
||||
|
||||
1. Clang decides the kernarg layout to match the *HSA Programmer's Language
|
||||
Reference* [HSA]_.
|
||||
|
||||
- All structs are passed directly.
|
||||
- Lambda values are passed *TBA*.
|
||||
|
||||
.. TODO::
|
||||
|
||||
- Does this really follow HSA rules? Or are structs >16 bytes passed
|
||||
by-value struct?
|
||||
- What is ABI for lambda values?
|
||||
|
||||
2. The CFI return address is undefined.
|
||||
3. If the kernel contains no calls then:
|
||||
|
||||
- If using the ``amdhsa`` OS ABI (see :ref:`amdgpu-os-table`), and know
|
||||
during ISel that there is stack usage SGPR0-3 is reserved for use as the
|
||||
scratch SRD and SGPR33 reserved for the wave scratch offset. Stack usage
|
||||
is assumed if ``-O0``, if already aware of stack objects for locals, etc.,
|
||||
or if there are any function calls.
|
||||
- Otherwise, five high numbered SGPRs are reserved for the tentative scratch
|
||||
SRD and wave scratch offset. These will be used if determine need to do
|
||||
spilling.
|
||||
|
||||
- If no use is made of the tentative scratch SRD or wave scratch offset,
|
||||
then they are unreserved and the register count is determined ignoring
|
||||
them.
|
||||
- If use is made of the tenatative scratch SRD or wave scratch offset,
|
||||
then the register numbers used are shifted to be after the highest one
|
||||
allocated by the register allocator, and all uses updated. The register
|
||||
count will include them in the shifted location. Since register
|
||||
allocation may introduce spills, this shifting allows them to be
|
||||
eliminated without having to perform register allocation again.
|
||||
- In either case, if the processor has the SGPR allocation bug, the
|
||||
tentative allocation is not shifted or unreserved inorder to ensure the
|
||||
register count is higher to workaround the bug.
|
||||
|
||||
4. If the kernel contains function calls:
|
||||
|
||||
- SP is set to the wave scratch offset.
|
||||
|
||||
- Since SP is an unswizzled address relative to the queue scratch base, an
|
||||
wave scratch offset is an unswizzle offset, this means that if SP is
|
||||
used to access swizzled scratch memory, it will access the private
|
||||
segment address 0.
|
||||
|
||||
.. note::
|
||||
|
||||
This is planned to be changed to be the unswizzled base address of the
|
||||
wavefront scratch backing memory.
|
||||
|
||||
Non-Kernel Functions
|
||||
++++++++++++++++++++
|
||||
|
||||
This section describes the call convention ABI for functions other than the
|
||||
outer kernel function.
|
||||
|
||||
If a kernel has function calls then scratch is always allocated and used for the
|
||||
call stack which grows from low address to high address using the swizzled
|
||||
scratch address space.
|
||||
|
||||
On entry to a function:
|
||||
|
||||
1. SGPR0-3 contain a V# with the following properties:
|
||||
|
||||
* Base address of the queue scratch backing memory.
|
||||
|
||||
.. note::
|
||||
|
||||
This is planned to be changed to be the unswizzled base address of the
|
||||
wavefront scratch backing memory.
|
||||
|
||||
* Swizzled with dword element size and stride of wavefront size elements.
|
||||
|
||||
2. The FLAT_SCRATCH register pair is setup. See
|
||||
:ref:`amdgpu-amdhsa-flat-scratch`.
|
||||
3. GFX6-8: M0 register set to the size of LDS in bytes.
|
||||
4. The EXEC register is set to the lanes active on entry to the function.
|
||||
5. MODE register: *TBD*
|
||||
6. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described
|
||||
below.
|
||||
7. SGPR30-31 return address (RA). The code address that the function must
|
||||
return to when it completes. The value is undefined if the function is *no
|
||||
return*.
|
||||
8. SGPR32 is used for the stack pointer (SP). It is an unswizzled
|
||||
scratch offset relative to the beginning of the queue scratch backing
|
||||
memory.
|
||||
|
||||
The unswizzled SP can be used with buffer instructions as an unswizzled SGPR
|
||||
offset with the scratch V# in SGPR0-3 to access the stack in a swizzled
|
||||
manner.
|
||||
|
||||
The swizzled SP value is always 4 bytes aligned for the ``r600``
|
||||
architecture and 16 byte aligned for the ``amdgcn`` architecture.
|
||||
|
||||
.. note::
|
||||
|
||||
The ``amdgcn`` value is selected to avoid dynamic stack alignment for the
|
||||
OpenCL language which has the largest base type defined as 16 bytes.
|
||||
|
||||
On entry, the swizzled SP value is the address of the first function
|
||||
argument passed on the stack. Other stack passed arguments are positive
|
||||
offsets from the entry swizzled SP value.
|
||||
|
||||
The function may use positive offsets beyond the last stack passed argument
|
||||
for stack allocated local variables and register spill slots. If necessary
|
||||
the function may align these to greater alignment than 16 bytes. After these
|
||||
the function may dynamically allocate space for such things as runtime sized
|
||||
``alloca`` local allocations.
|
||||
|
||||
If the function calls another function, it will place any stack allocated
|
||||
arguments after the last local allocation and adjust SGPR32 to the address
|
||||
after the last local allocation.
|
||||
|
||||
.. note::
|
||||
|
||||
The SP value is planned to be changed to be the unswizzled offset relative
|
||||
to the wavefront scratch backing memory.
|
||||
|
||||
9. SGPR33 wavefront scratch base offset. The unswizzled offset from the queue
|
||||
scratch backing memory base to the base of the wavefront scratch backing
|
||||
memory.
|
||||
|
||||
It is used to convert the unswizzled SP value to swizzled address in the
|
||||
private address space by:
|
||||
|
||||
| private address = (unswizzled SP - wavefront scratch base offset) /
|
||||
wavefront size
|
||||
|
||||
This may be used to obtain the private address of stack objects and to
|
||||
convert these address to a flat address by adding the flat scratch aperture
|
||||
base address.
|
||||
|
||||
.. note::
|
||||
|
||||
This is planned to be eliminated when SP is changed to be the unswizzled
|
||||
offset relative to the wavefront scratch backing memory. The the
|
||||
conversion simplifies to:
|
||||
|
||||
| private address = unswizzled SP / wavefront size
|
||||
|
||||
10. All other registers are unspecified.
|
||||
11. Any necessary ``waitcnt`` has been performed to ensure memory is available
|
||||
to the function.
|
||||
|
||||
On exit from a function:
|
||||
|
||||
1. VGPR0-31 and SGPR4-29 are used to pass function result arguments as
|
||||
described below. Any registers used are considered clobbered registers,
|
||||
2. The following registers are preserved and have the same value as on entry:
|
||||
|
||||
* FLAT_SCRATCH
|
||||
* EXEC
|
||||
* GFX6-8: M0
|
||||
* All SGPR and VGPR registers except the clobbered registers of SGPR4-31 and
|
||||
VGPR0-31.
|
||||
|
||||
For the AMDGPU backend, an inter-procedural register allocation (IPRA)
|
||||
optimization may mark some of clobbered SGPR4-31 and VGPR0-31 registers as
|
||||
preserved if it can be determined that the called function does not change
|
||||
their value.
|
||||
|
||||
2. The PC is set to the RA provided on entry.
|
||||
3. MODE register: *TBD*.
|
||||
4. All other registers are clobbered.
|
||||
5. Any necessary ``waitcnt`` has been performed to ensure memory accessed by
|
||||
function is available to the caller.
|
||||
|
||||
.. TODO::
|
||||
|
||||
- On gfx908 are all ACC registers clobbered?
|
||||
|
||||
- How are function results returned? The address of structured types is passed
|
||||
by reference, but what about other types?
|
||||
|
||||
The function input arguments are made up of the formal arguments explicitly
|
||||
declared by the source language function plus the implicit input arguments used
|
||||
by the implementation.
|
||||
|
||||
The source language input arguments are:
|
||||
|
||||
1. Any source language implicit ``this`` or ``self`` argument comes first as a
|
||||
pointer type.
|
||||
2. Followed by the function formal arguments in left to right source order.
|
||||
|
||||
The source language result arguments are:
|
||||
|
||||
1. The function result argument.
|
||||
|
||||
The source language input or result struct type arguments that are less than or
|
||||
equal to 16 bytes, are decomposed recursively into their base type fields, and
|
||||
each field is passed as if a separate argument. For input arguments, if the
|
||||
called function requires the struct to be in memory, for example because its
|
||||
address is taken, then the function body is responsible for allocating a stack
|
||||
location and copying the field arguments into it. Clang terms this *direct
|
||||
struct*.
|
||||
|
||||
The source language input struct type arguments that are greater than 16 bytes,
|
||||
are passed by reference. The caller is responsible for allocating a stack
|
||||
location to make a copy of the struct value and pass the address as the input
|
||||
argument. The called function is responsible to perform the dereference when
|
||||
accessing the input argument. Clang terms this *by-value struct*.
|
||||
|
||||
A source language result struct type argument that is greater than 16 bytes, is
|
||||
returned by reference. The caller is responsible for allocating a stack location
|
||||
to hold the result value and passes the address as the last input argument
|
||||
(before the implicit input arguments). In this case there are no result
|
||||
arguments. The called function is responsible to perform the dereference when
|
||||
storing the result value. Clang terms this *structured return (sret)*.
|
||||
|
||||
*TODO: correct the sret definition.*
|
||||
|
||||
.. TODO::
|
||||
|
||||
Is this definition correct? Or is sret only used if passing in registers, and
|
||||
pass as non-decomposed struct as stack argument? Or something else? Is the
|
||||
memory location in the caller stack frame, or a stack memory argument and so
|
||||
no address is passed as the caller can directly write to the argument stack
|
||||
location. But then the stack location is still live after return. If an
|
||||
argument stack location is it the first stack argument or the last one?
|
||||
|
||||
Lambda argument types are treated as struct types with an implementation defined
|
||||
set of fields.
|
||||
|
||||
.. TODO::
|
||||
|
||||
Need to specify the ABI for lambda types for AMDGPU.
|
||||
|
||||
For AMDGPU backend all source language arguments (including the decomposed
|
||||
struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case
|
||||
they are passed in SGPRs.
|
||||
|
||||
The AMDGPU backend walks the function call graph from the leaves to determine
|
||||
which implicit input arguments are used, propagating to each caller of the
|
||||
function. The used implicit arguments are appended to the function arguments
|
||||
after the source language arguments in the following order:
|
||||
|
||||
.. TODO::
|
||||
|
||||
Is recursion or external functions supported?
|
||||
|
||||
1. Work-Item ID (1 VGPR)
|
||||
|
||||
The X, Y and Z work-item ID are packed into a single VGRP with the following
|
||||
layout. Only fields actually used by the function are set. The other bits
|
||||
are undefined.
|
||||
|
||||
The values come from the initial kernel execution state. See
|
||||
:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
|
||||
|
||||
.. table:: Work-item implict argument layout
|
||||
:name: amdgpu-amdhsa-workitem-implict-argument-layout-table
|
||||
|
||||
======= ======= ==============
|
||||
Bits Size Field Name
|
||||
======= ======= ==============
|
||||
9:0 10 bits X Work-Item ID
|
||||
19:10 10 bits Y Work-Item ID
|
||||
29:20 10 bits Z Work-Item ID
|
||||
31:30 2 bits Unused
|
||||
======= ======= ==============
|
||||
|
||||
2. Dispatch Ptr (2 SGPRs)
|
||||
|
||||
The value comes from the initial kernel execution state. See
|
||||
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
||||
|
||||
3. Queue Ptr (2 SGPRs)
|
||||
|
||||
The value comes from the initial kernel execution state. See
|
||||
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
||||
|
||||
4. Kernarg Segment Ptr (2 SGPRs)
|
||||
|
||||
The value comes from the initial kernel execution state. See
|
||||
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
||||
|
||||
5. Dispatch id (2 SGPRs)
|
||||
|
||||
The value comes from the initial kernel execution state. See
|
||||
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
||||
|
||||
6. Work-Group ID X (1 SGPR)
|
||||
|
||||
The value comes from the initial kernel execution state. See
|
||||
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
||||
|
||||
7. Work-Group ID Y (1 SGPR)
|
||||
|
||||
The value comes from the initial kernel execution state. See
|
||||
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
||||
|
||||
8. Work-Group ID Z (1 SGPR)
|
||||
|
||||
The value comes from the initial kernel execution state. See
|
||||
:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
|
||||
|
||||
9. Implicit Argument Ptr (2 SGPRs)
|
||||
|
||||
The value is computed by adding an offset to Kernarg Segment Ptr to get the
|
||||
global address space pointer to the first kernarg implicit argument.
|
||||
|
||||
The input and result arguments are assigned in order in the following manner:
|
||||
|
||||
..note::
|
||||
|
||||
There are likely some errors and ommissions in the following description that
|
||||
need correction.
|
||||
|
||||
..TODO::
|
||||
|
||||
Check the clang source code to decipher how funtion arguments and return
|
||||
results are handled. Also see the AMDGPU specific values used.
|
||||
|
||||
* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to
|
||||
VGPR31.
|
||||
|
||||
If there are more arguments than will fit in these registers, the remaining
|
||||
arguments are allocated on the stack in order on naturally aligned
|
||||
addresses.
|
||||
|
||||
.. TODO::
|
||||
|
||||
How are overly aligned structures allocated on the stack?
|
||||
|
||||
* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to
|
||||
SGPR29.
|
||||
|
||||
If there are more arguments than will fit in these registers, the remaining
|
||||
arguments are allocated on the stack in order on naturally aligned
|
||||
addresses.
|
||||
|
||||
Note that decomposed struct type arguments may have some fields passed in
|
||||
registers and some in memory.
|
||||
|
||||
..TODO::
|
||||
|
||||
So a struct which can pass some fields as decomposed register arguments, will
|
||||
pass the rest as decomposed stack elements? But an arguent that will not start
|
||||
in registers will not be decomposed and will be passed as a non-decomposed
|
||||
stack value?
|
||||
|
||||
The following is not part of the AMDGPU function calling convention but
|
||||
describes how the AMDGPU implements function calls:
|
||||
|
||||
1. SGPR34 is used as a frame pointer (FP) if necessary. Like the SP it is an
|
||||
unswizzled scratch address. It is only needed if runtime sized ``alloca``
|
||||
are used, or for the reasons defined in ``SiFrameLowering``.
|
||||
2. Runtime stack alignment is not currently supported.
|
||||
|
||||
.. TODO::
|
||||
|
||||
- If runtime stack alignment is supported then will an extra argument
|
||||
pointer register be used?
|
||||
|
||||
2. Allocating SGPR arguments on the stack are not supported.
|
||||
|
||||
3. No CFI is currently generated. See :ref:`amdgpu-call-frame-information`.
|
||||
|
||||
..note::
|
||||
|
||||
Before CFI is generated, the call convention will be changed so that SP is
|
||||
an unswizzled address relative to the wave scratch base.
|
||||
|
||||
CFI will be generated that defines the CFA as the unswizzled address
|
||||
relative to the wave scratch base in the unswizzled private address space
|
||||
of the lowest address stack allocated local variable.
|
||||
|
||||
``DW_AT_frame_base`` will be defined as the swizelled address in the
|
||||
swizzled private address space by dividing the CFA by the wavefront size
|
||||
(since CFA is always at least dword aligned which matches the scratch
|
||||
swizzle element size).
|
||||
|
||||
If no dynamic stack alignment was performed, the stack allocated arguments
|
||||
are accessed as negative offsets relative to ``DW_AT_frame_base``, and the
|
||||
local variables and register spill slots are accessed as positive offsets
|
||||
relative to ``DW_AT_frame_base``.
|
||||
|
||||
4. Function argument passing is implemented by copying the input physical
|
||||
registers to virtual registers on entry. The register allocator can spill if
|
||||
necessary. These are copied back to physical registers at call sites. The
|
||||
net effect is that each function call can have these values in entirely
|
||||
distinct locations. The IPRA can help avoid shuffling argument registers.
|
||||
5. Call sites are implemented by setting up the arguments at positive offsets
|
||||
from SP. Then SP is incremented to account for the known frame size before
|
||||
the call and decremented after the call.
|
||||
|
||||
..note::
|
||||
|
||||
The CFI will reflect the changed calculation needed to compute the CFA
|
||||
from SP.
|
||||
|
||||
6. 4 byte spill slots are used in the stack frame. One slot is allocated for an
|
||||
emergency spill slot. Buffer instructions are used for stack accesses and
|
||||
not the ``flat_scratch`` instruction.
|
||||
|
||||
..TODO::
|
||||
|
||||
Explain when the emergency spill slot is used.
|
||||
|
||||
.. TODO::
|
||||
|
||||
Possible broken issues:
|
||||
|
||||
- Stack arguments must be aligned to required alignment.
|
||||
- Stack is aligned to max(16, max formal argument alignment)
|
||||
- Direct argument < 64 bits should check register budget.
|
||||
- Register budget calculation should respect ``inreg`` for SGPR.
|
||||
- SGPR overflow is not handled.
|
||||
- struct with 1 member unpeeling is not checking size of member.
|
||||
- ``sret`` is after ``this`` pointer.
|
||||
- Caller is not implementing stack realignment: need an extra pointer.
|
||||
- Should say AMDGPU passes FP rather than SP.
|
||||
- Should CFI define CFA as address of locals or arguments. Difference is
|
||||
apparent when have implemented dynamic alignment.
|
||||
- If ``SCRATCH`` instruction could allow negative offsets then can make FP be
|
||||
highest address of stack frame and use negative offset for locals. Would
|
||||
allow SP to be the same as FP and could support signal-handler-like as now
|
||||
have a real SP for the top of the stack.
|
||||
- How is ``sret`` passed on the stack? In argument stack area? Can it overlay
|
||||
arguments?
|
||||
|
||||
AMDPAL
|
||||
------
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user