Many machine-learning applications (and most software written at AMD)
expect the operation that truncates floats to 8-bit floats to be
saturatinng. That is, they expect `truncf 256.0 : f32 to f8E4M3FNUZ` to
yield `240.0`, not `NaN`, and similarly for negative numbers. However,
the underlying hardware instruction that can be used for this truncation
implements overflow-to-NaN semantics.
To enable handling this usecase, we add the saturate-fp8-truncf option
to ArithToAMDGPU (off by default), which causes the requisite clamping
code to be emitted. Said clamping code ensures that Inf and NaN are
passed through exactly (and thus trancate to NaN).
Per review feedback, this commit efactors
createScalarOrSplatConstant() to the Arith dialect utilities and uses
it in this code. It also fixes naming of existing patterns and
switches from vector.extractelement/insertelement to
vector.extract/insert.
This patch adds support for translating dense_resource attributes to
LLVMIR Target.
The support added is similar to how DenseElementsAttr is handled, except
we
don't need to handle splats.
Another possible way of doing this is adding iteration on
dense_resource, but that is
non-trivial as DenseResourceAttr is not meant to be something you should
directly
access. It has subclasses which you are supposed to use to iterate on
it.
The initial design of the `acc.loop` was to be an operation that
encapsulates a loop like operation. This was an early design and we now
want to change it so the `acc.loop` operation becomes a real loop-like
operation by implementing the LoopLikeInterface.
Differential Revision: https://reviews.llvm.org/D159229
This patch is just moved from Phabricator to github
This adds a new `mlir_arm_runner_utils` library that contains utils
specific to Arm/AArch64. This is for use in MLIR integration tests.
This initial patch adds `setArmVLBits()` and `setArmSVLBits()`. This
allows changing vector length or streaming vector length at runtime (or
setting it to a known minimum, i.e. 128-bits).
Existing 'arith::ConstantOp' conversion and tests are moved from
VectorToArmSME. There's currently only a single op that's converted at
the moment, but this will grow in the future as things like in-tile add
are implemented. Also, 'createLoopOverTileSlices' is moved to ArmSME
utils since it's relevant for both conversions.
When we generate the loop body function, we need to be sure, that all
original loop counters are replaced by the new counter.
We need to save all items which use the original loop counter and then
perform replacement of the original loop counter. If we don't do it,
there is a risk that some values are not updated.
This patch updates the cp.async.bulk.{commit/wait}_group Ops to use NVVM
intrinsics.
* Doc updated for the commit_group Op.
* Tests are added to verify the lowering to the intrinsics.
While we are there, fix the FileCheck directive on the
'nvvm.setmaxregister' test.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
The current implementation of `nvvm.wgmma.mma_async` Op deduces the data
type of the output matrix from the data type of struct member, which can be
non-intuitive, especially in cases where types like `2xf16` are packed
into `i32`.
This PR addresses this issue by improving the Op to include an explicit
data type for the output matrix.
The modified Op now includes an explicit data type for Matrix-D (<f16>),
and looks as follows:
```
%result = llvm.mlir.undef : !llvm.struct<(struct<(i32, i32, ...
nvvm.wgmma.mma_async
%descA, %descB, %result,
#nvvm.shape<m = 64, n = 32, k = 16>,
D [<f16>, #nvvm.wgmma_scale_out<zero>],
A [<f16>, #nvvm.wgmma_scale_in<neg>, <col>],
B [<f16>, #nvvm.wgmma_scale_in<neg>, <col>]
```
This PR fixes the 'nvgpu.warpgroup.mma.store' index calculation. When
the destionation memref and current accumulator matrix were small, the
previous code was reaching out of range.
The buffer deallocation pass checks the IR ("operation preconditions")
to make sure that there is no IR that is unsupported. In such a case,
the pass signals a failure.
The pass now rejects all ops with unknown memory effects. We do not know
whether such an op allocates memory or not. Therefore, the buffer
deallocation pass does not know whether a deallocation op should be
inserted or not.
Memory effects are queried from the `MemoryEffectOpInterface` interface.
Ops that do not implement this interface but have the
`RecursiveMemoryEffects` trait do not have any side effects (apart from
the ones that their nested ops may have).
Unregistered ops are now rejected by the pass because they do not
implement the `MemoryEffectOpInterface` and neither do we know if they
have `RecursiveMemoryEffects` or not. All test cases that currently have
unregistered ops are updated to use registered ops.
Such cases show up in the middle of optimizations passes, e.g., after
some rewrites and then CSE. The current folder can fold such cases when
the inputs are constant; this patch improves it to fold even if the
inputs are non-constant.
It implements transformation to optimize accesses to shared memory.
Reference: https://reviews.llvm.org/D127457
_This change adds a transformation and pass to the NvGPU dialect that
attempts to optimize reads/writes from a memref representing GPU shared
memory in order to avoid bank conflicts. Given a value representing a
shared memory memref, it traverses all reads/writes within the parent op
and, subject to suitable conditions, rewrites all last dimension index
values such that element locations in the final (col) dimension are
given by newColIdx = col % vecSize + perm[row](col / vecSize, row)
where perm is a permutation function indexed by row and vecSize
is the vector access size in elements (currently assumes 128bit
vectorized accesses, but this can be made a parameter). This specific
transformation can help optimize typical distributed & vectorized
accesses
common to loading matrix multiplication operands to/from shared memory._
Adds `transform.func.cast_and_call` that takes a set of inputs and
outputs and replaces the uses of those outputs with a call to a function
at a specified insertion point.
The idea with this operation is to allow users to author independent IR
outside of a to-be-compiled module, and then match and replace a slice
of the program with a call to the external function.
Additionally adds a mechanism for populating a type converter with a set
of conversion materialization functions that allow insertion of
casts on the inputs/outputs to and from the types of the function
signature.
Provides some context for failing to generate LLVM IR for `target
enter|exit|update` directives when `nowait` is provided. This is
directly helpful for flang users since they would get this error message
if they tried to use `nowait`. Before that we had a very generic
message.
This is a follow-up to https://github.com/llvm/llvm-project/pull/78269,
please only review the latest commit (the one with the same commit
message as the PR title).
This commit simplifies a helper function in the ownership-based buffer
deallocation pass. Fixes a potential double-free (depending on the
scheduling of patterns).
This revision updates the LLVM IR import to support unreachable basic
blocks. An unreachable block may dominate itself and a value defined
inside the block may thus be used before its definition. The import does
not support such dependencies. We thus delete the unreachable basic
blocks before the import. This is possible since MLIR does not have
basic block labels that can be reached using an indirect call and
unreachable blocks can indeed be deleted safely.
Additionally, add a small poison constant import test.
Currently ODS generates `populateDefaultAttrs` or
`populateDefaultProperties` based on whether the dialect opted into
usePropertiesForAttributes. But since individual ops might get opted
into using properties (as long as it has one property), it should
actually just check whether the op itself uses properties. Otherwise
`populateDefaultAttrs` will overwrite existing attrs inside properties
when creating an op. Understandably this becomes moot once everything
switches over to using properties, but this fixes it for now.
This PR makes ODS generate `populateDefaultProperties` as long as the op
itself uses properties.
This patch add support for device_type on the acc.routine operation.
device_type can be specified on seq, worker, vector, gang and bind
information.
The support is following the same design than the one for compute
operations, data operation and the loop operation.
The `irdl.base` op represent an attribute constraint that will check
that the
base of a type or attribute is the expected one (e.g. `IntegerType`) .
Example:
```mlir
irdl.dialect @cmath {
irdl.type @complex {
%0 = irdl.base "!builtin.integer"
irdl.parameters(%0)
}
irdl.type @complex_wrapper {
%0 = irdl.base @complex
irdl.parameters(%0)
}
}
```
The above program defines a `cmath.complex` type that expects a single
parameter, which is a type with base name `builtin.integer`, which is
the
name of an `IntegerType` type.
It also defines a `cmath.complex_wrapper` type that expects a single
parameter, which is a type of base type `cmath.complex`.
Since most of the operations in the `math` dialect don't have
low-precision implementations, add the -math-legalize-to-f32 pass that
goes through and brackets low-precision math funcitons (like `math.sin
%0 : f16`) with `arith.extf` and `arith.truncf`. This preserves the
original semantics of the math operation but allows lowering to proceed.
Versions of this lowering are already implicitly present in some passes,
like ConvertGPUToROCDL. However, because those are implicit rewrites,
they hide the floating-point extension and truncation, preventing anyone
from writing passes that operate on those implitic extf/truncf pairs.
Exposing this legalization explicitly is needed to allow lowening 8-bit
floats on AMD GPUs, as the implementation of extf and truncf on that
platform requires the complex logic found in ArithToAMDGPU, which runs
before the GPU to ROCDL lowering.
After the removal of the OpenMP early outlining MLIR pass in #67319, the
`EarlyOutliningInterface` stopped doing any useful work. It used to be
necessary to tie the name of the function from which a target region was
outlined to that new function, so it would be used when translating to
LLVM IR in place of the outlined function's name.
This is not necessary anymore, so this patch removes all references to
this interface and uses of the `omp.outline_parent_name` discardable
attribute in tests.
Similar to `transform.get_result`, except it returns a handle to the
operand indicated by a positional specification, same as is defined for
the linalg match ops.
Additionally updates `get_result` to take the same positional specification.
This makes the use case of wanting to get all of the results of an
operation easier by no longer requiring the user to reconstruct the list
of results one-by-one.
This patch adds the target_cpu attribute to llvm.func MLIR operations
and updates the translation to/from LLVM IR to match "target-cpu"
function attributes.
This commit renames 4 pattern rewriter API functions:
* `updateRootInPlace` -> `modifyOpInPlace`
* `startRootUpdate` -> `startOpModification`
* `finalizeRootUpdate` -> `finalizeOpModification`
* `cancelRootUpdate` -> `cancelOpModification`
The term "root" is a misnomer. The root is the op that a rewrite pattern
matches against
(https://mlir.llvm.org/docs/PatternRewriter/#root-operation-name-optional).
A rewriter must be notified of all in-place op modifications, not just
in-place modifications of the root
(https://mlir.llvm.org/docs/PatternRewriter/#pattern-rewriter). The old
function names were confusing and have contributed to various broken
rewrite patterns.
Note: The new function names use the term "modify" instead of "update"
for consistency with the `RewriterBase::Listener` terminology
(`notifyOperationModified`).
Add overflow flags support to the following ops:
* `arith.addi`
* `arith.subi`
* `arith.muli`
Example of new syntax:
```
%res = arith.addi %arg1, %arg2 overflow<nsw> : i64
```
Similar to existing LLVM dialect syntax
```
%res = llvm.add %arg1, %arg2 overflow<nsw> : i64
```
Tablegen canonicalization patterns updated to always drop flags, proper
support with tests will be added later.
Updated LLVMIR translation as part of this commit as it currenly written
in a way that it will crash when new attributes added to arith ops
otherwise.
Also lower `arith` overflow flags to corresponding SPIR-V op decorations
Discussion
https://discourse.llvm.org/t/rfc-integer-overflow-flags-support-in-arith-dialect/76025
This effectively rolls forward #77211, #77700 and #77714 while adding a
test to ensure the Python usage is not broken. More follow up needed but
unrelated to the core change here. The changes here are minimal and just
correspond to "textual namespacing" ODS side, no C++ or Python changes
were needed.
---------
---------
Co-authored-by: Ivan Butygin <ivan.butygin@gmail.com>, Yi Wu <yi.wu2@arm.com>
This revision moves the ArrayRef field of the LoopAnnotation attribute
to the end of the struct to enable printing and parsing of the
attribute. Previously, the parsing could fail in the presence of a start
or end loc.
If a rewrite pattern returns "failure", it must not have modified the
IR. This commit fixes
`Dialect/Vector/vector-contract-to-outerproduct-transforms-unsupported.mlir`
when running with `MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`.
```
* Pattern (anonymous namespace)::ContractionOpToOuterProductOpLowering : 'vector.contract -> ()' {
Trying to match "(anonymous namespace)::ContractionOpToOuterProductOpLowering"
** Insert : 'vector.transpose'(0x5625b3a8cb30)
** Insert : 'vector.transpose'(0x5625b3a8cbc0)
"(anonymous namespace)::ContractionOpToOuterProductOpLowering" result 0
} -> failure : pattern failed to match
} -> failure : pattern failed to match
LLVM ERROR: pattern returned failure but IR did change
```
Note: `vector-contract-to-outerproduct-transforms-unsupported.mlir` is
merged into `vector-contract-to-outerproduct-matvec-transforms.mlir`.
The `greedy pattern application failed` error is not longer produced.
This error indicates that the greedy pattern rewrite did not
convergence; it does not mean that a pattern could not be applied.
This patch adds an optional offloading handler attribute to
the`gpu.module` op. This attribute will be used during
`gpu-module-to-binary` pass to override the offloading handler used in
the `gpu.binary` op.
An op verifier should verify only local properties. This commit removes
the verification of `scf.for` step sizes. (Verifiers can check
attributes but should not follow SSA values.) This verification could
reject IR that is actually valid, e.g.:
```mlir
scf.if %always_false {
// Branch is never entered.
scf.for ... step %c0 { ... }
}
```
This commit fixes `for-loop-peeling.mlir` when running with
`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`:
```
within split at llvm-project/mlir/test/Dialect/SCF/for-loop-peeling.mlir:293 offset :9:3: note: see current operation:
"scf.for"(%0, %3, %2) ({
^bb0(%arg1: index):
%4 = "arith.index_cast"(%arg1) : (index) -> i64
"memref.store"(%4, %arg0) : (i64, memref<i64>) -> ()
"scf.yield"() : () -> ()
}) {__peeled_loop__} : (index, index, index) -> ()
LLVM ERROR: IR failed to verify after folding
```
Note: `%2` is `arith.constant 0 : index`.
Remove the somewhat redundant rank attribute.
Before this change
```
mesh.cluster @mesh(rank = 3, dim_sizes = 2x3)
```
After
```
mesh.cluster @mesh(shape = 2x3x?)
```
The rank is instead determined by the provided shape. With this change
no longer `getDimSizes()` can be wrongly assumed to have size equal to
the cluster rank.
Now `getShape().size()` will always equal `getRank()`.
this is causing test failures on AArch64 linux, hitting the
following assert:
# | mlir-cpu-runner: /home/culrho01/llvm-project/llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp:519: void llvm::RuntimeDyldELF::resolveAArch64Relocation(const SectionEntry &, uint64_t, uint64_t, uint32_t, int64_t): Assertion `isInt<33>(Result) && "overflow check failed for relocation"' failed.
Seeing the same in buildbot as well, e.g.
https://lab.llvm.org/buildbot/#/builders/179/builds/9094/steps/12/logs/FAIL__MLIR__sparse_codegen_dim_mlir
Reverts llvm/llvm-project#78070
This patch adds:
* Support for the 'aligned' variants of the cluster barrier Ops, by
extending the existing Op with an 'aligned' attribute.
* Docs for these Ops.
* Test cases to verify the lowering to the corresponding intrinsics.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>