Add `dump_alias_sets` to `transform.bufferization.one_shot_bufferize`.
This option is useful for debugging. Also improve the verifier to ensure
that `test_analysis_only` is set when other debugging flags are enabled.
* Fixes#67977, a crash in `empty-tensor-elimination`.
* Also improves `linalg.copy` canonicalization.
* Also improves indentation indentation in `mlir-linalg-ods-yaml-gen.cpp`.
This PR introduces a new Op called `warpgroup.mma.store` to the NVGPU
dialect of MLIR. The purpose of this operation is to facilitate storing
fragmanted result(s) `nvgpu.warpgroup.accumulator` produced by
`warpgroup.mma` to the given memref.
An example of fragmentated matrix is given here :
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d
The `warpgroup.mma.store` does followings:
1) Takes one or more `nvgpu.warpgroup.accumulator` type (fragmented
results matrix)
2) Calculates indexes per thread in warp-group and stores the data into
give memref.
Here's an example usage:
```
// A warpgroup performs GEMM, results in fragmented matrix
%result1, %result2 = nvgpu.warpgroup.mma ...
// Stores the fragmented result to memref
nvgpu.warpgroup.mma.store [%result1, %result2], %matrixD :
!nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>
to memref<128x128xf32,3>
```
This PR introduces substantial improvements to the readability and
maintainability of the `nvgpu.warpgroup.mma` Op transformation from
nvgpu->nvvm. This transformation plays a crucial role in GEMM and
manages complex operations such as generating multiple wgmma ops and
iterating their descriptors. The prior code lacked clarity, but this PR
addresses that issue effectively.
**PR does followings:**
**Introduces a helper class:** `WarpgroupGemm` class encapsulates the
necessary functionality, making the code cleaner and more
understandable.
**Detailed Documentation:** Each function within the helper class is
thoroughly documented to provide clear insights into its purpose and
functionality.
NVGPU dialect is gaining large support for warpgroup level operations,
and their names always starts with `warpgroup....`.
This PR changes name of Op and type from `wgmma.descriptor` to
`warpgroup.descriptor` for sake of consistency.
Also fix an issue with sink broadcast across elementwise where
`arith.cmpf` is elementwise, but result type is different. The result
type is not same as the operand type, creating illegal IR.
Similar issue with `vector.fma` which only accepts vector operand types,
while broadcasts can have scalar sources. Sinking broadcast across would
result in an illegal `vector.fma` (with scalar operands).
Move the definitions of LLVM intrinsic Ops for ArmSME into a dedicated
file. To facilitate this, the dialect definition together with various
shared definitions are moved to ArmSME.td.
This change will allow us to refactor the ArmSME dialect documentation.
In particular, we will be able to categorise the Ops into "regular" and
"intrinsic" ops. Also, it will be easier to add some custom
documentation as opposed to relying on auto-generated docs that simply
list the available Ops.
The documentation will be updated in a forthcoming patch. Only
non-functional changes.
This patch adds support for lowering vector.insert/extract of tile
slices or elements to ArmSME MOVA intrinsics.
This enables the following operations for ArmSME:
```
// Extract slice from tile:
%slice = vector.extract %tile[%row]
: vector<[4]xi32> from vector<[4]x[4]xi32>
```
```
// Extract element from tile:
%el = vector.extract %tile[%row, %col]
: i32 from vector<[4]x[4]xi32>
```
```
// Insert slice into tile:
%new_tile = vector.insert %slice, %tile[%row]
: vector<[4]xi32> into vector<[4]x[4]xi32>
```
```
// Insert element into tile;
%new_tile = vector.insert %el, %tile[%row, %col]
: i32 into vector<[4]x[4]xi32>
```
This commit renames the arguments of several static implementation
functions of the transform interpreter base class to match the names of
the corresponding member variables in order to clarify their intent.
Similarly, it renames some local variables to reflect their relationship
with corresponding member variables. Finally, this commit also asserts
in `interpreterBaseRunOnOperationImpl` that at most one of shared and
library module are set (which the initialization function guarantees)
and simplifies some related `if` conditions.
This PR enables `test-lower-to-nvvm` pass pipeline for the integration
tests for NVIDIA sm_90 architecture.
This PR adjusts `test-lower-to-nvvm` pass in two ways:
1) Calls `createConvertNVGPUToNVVMPass` before the outlining process.
This particular pass is responsible for generating both device and host
code. On the host, it calls the CUDA driver to build the TMA descriptor
(`cuTensorMap`).
2) Integrates the `createConvertNVVMToLLVMPass` to generate PTXs for
NVVM Ops.
The TableGen code generator now generates C++ code that returns a single
`OpOperand &` for `get...Mutable` of operands that are not variadic and
not optional. `OpOperand::set`/`assign` can be used to set a value (same
as `MutableOperandRange::assign`). This is safer than
`MutableOperandRange` because only single values (and no longer
`ValueRange`) can be assigned.
E.g.:
```
// Assignment of multiple values to non-variadic operand.
// Before: Compiles, but produces invalid op.
// After: Compilation error.
extractSliceOp.getSourceMutable().assign({v1, v2});
```
Previously this would just assume all conversions are possible and this
would crash. Use an in-tree testing case here.
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
The `BufferizableOpInterface` implementation of `scf.for` currently
assumes that an OpResult does not alias with any tensor apart from the
corresponding init OpOperand. Newly allocated buffers (inside of the
loop) are also allowed. The current implementation checks whether the
respective init_arg and yielded value are equivalent. This is overly
strict and causes extra buffer allocations/copies when yielding a new
buffer allocation from a loop.
Make `tensor.empty` bufferizable, so that the
`-empty-tensor-to-alloc-tensor` pass becomes optional. This makes the
bufferization easier to use. `tensor.empty` used to be non-bufferizable,
so that there two separate ops, one that can be optimized away
(`tensor.empty`) and one that is guaranteed to bufferize to an
allocation (`bufferization.alloc_tensor`). With the recent improvements
of "empty tensor elimination" this is no longer needed and
`bufferization.alloc_tensor` can be phased out.
At the moment, for device a reference pointer is generated in place of
the original declare target global value, this reference pointer is the
pointer that actually receives the data. In Clang the original global
value isn't generated for device, just the reference pointer.
Unfortunately for Flang/MLIR this is currently not the case, as the
declare target attribute is processed after the creation of the global
so we end up with a dead global on device effectively after rewriting
its uses to the new device reference pointer.
It appears I was a little overzealous with the deletion of the declare
target globals for device. The current method breaks in-cases where the
same declare target global is used across two target regions (added a
runtime reproduced in the patch). As it'll effectively delete it before
the second target gets a chance to be written to LLVM IR and have it's
uses rewritten .
I'd like to remove this deletion as the dead global isn't breaking any
code and will likely be removed in later dead code elimination passes,
perhaps a little too heavy handed with the original approach.
Ensure the stream flushed to the string before acquiring the mutext.
No need to flush the output stream, the goal of the mutex is to sync
ahead before content is added to the stream.
Note that the `Pass` suffix is added in tablegen, and as a side effect the
options are renamed from `ArithExpandOpsOptions` to `ArithExpandOpsPassOptions`.
According to specification the `shift` attribute of the Mul operator in
TOSA is of signless i8 type instead of i32.
Signed-off-by: Georgios Pinitas <georgios.pinitas@arm.com>
Buffer deallocation pipeline previously was incorrect when applied to
functions. It has since been fixed. Make sure it is exercised in the
tutorial to avoid leaking allocations.
Values that are the result of buffer allocation ops are guaranteed to
*not* be the same allocation as block arguments of containing blocks.
This fact can be used to allow for more aggressive simplification of
`bufferization.dealloc` ops.