Fold expressions on Clang are limited to 256 elements. This causes
compilation errors in cases when the amount of elements added exceeds
this limit. Side-step the issue by restoring the original trick that
would use the std::initializer_list. For the record, in our downstream
Clang 16 gives:
mlir/include/mlir/IR/Dialect.h:269:23: fatal error: instantiating fold
expression with 688 arguments exceeded expression nesting limit of 256
(addType<Args>(), ...);
Partially reverts 26d811b3ec.
Co-authored-by: Nikita Kudriavtsev <nikita.kudriavtsev@intel.com>
(cherry picked from commit e3a38a75ddc6ff00301ec19a0e2488d00f2cc297)
GEPArg can only be constructed from int32_t and mlir::Value. Explicitly
cast other types (e.g. unsigned, size_t) to int32_t to avoid narrowing
conversion warnings on MSVC. Some recent examples of such are:
```
mlir\lib\Dialect\LLVMIR\Transforms\TypeConsistency.cpp: error C2398:
Element '1': conversion from 'size_t' to 'T' requires a narrowing
conversion
with
[
T=mlir::LLVM::GEPArg
]
mlir\lib\Dialect\LLVMIR\Transforms\TypeConsistency.cpp: error C2398:
Element '1': conversion from 'unsigned int' to 'T' requires a narrowing
conversion
with
[
T=mlir::LLVM::GEPArg
]
```
Co-authored-by: Nikita Kudriavtsev <nikita.kudriavtsev@intel.com>
(cherry picked from commit 89cd345667a5f8f4c37c621fd8abe8d84e85c050)
Many machine-learning applications (and most software written at AMD)
expect the operation that truncates floats to 8-bit floats to be
saturatinng. That is, they expect `truncf 256.0 : f32 to f8E4M3FNUZ` to
yield `240.0`, not `NaN`, and similarly for negative numbers. However,
the underlying hardware instruction that can be used for this truncation
implements overflow-to-NaN semantics.
To enable handling this usecase, we add the saturate-fp8-truncf option
to ArithToAMDGPU (off by default), which causes the requisite clamping
code to be emitted. Said clamping code ensures that Inf and NaN are
passed through exactly (and thus trancate to NaN).
Per review feedback, this commit efactors
createScalarOrSplatConstant() to the Arith dialect utilities and uses
it in this code. It also fixes naming of existing patterns and
switches from vector.extractelement/insertelement to
vector.extract/insert.
This patch adds support for translating dense_resource attributes to
LLVMIR Target.
The support added is similar to how DenseElementsAttr is handled, except
we
don't need to handle splats.
Another possible way of doing this is adding iteration on
dense_resource, but that is
non-trivial as DenseResourceAttr is not meant to be something you should
directly
access. It has subclasses which you are supposed to use to iterate on
it.
Much confusion occurred earlier today when updating the fallback `int
abi;` in addControlVariables() didn't do anything. THis was because that
that value is the fallback for if the ABI version fails to parse ...
which it always should, because it has a default value that comes from
multiple different places.
This commit updates all the places said default variable can come from,
namely:
1. The ROCDL target attribute definition
2. The ROCDL target attribute's builders
3. The rocdl-attach-target pass's default option values.
With this, the printf test is passing.
The initial design of the `acc.loop` was to be an operation that
encapsulates a loop like operation. This was an early design and we now
want to change it so the `acc.loop` operation becomes a real loop-like
operation by implementing the LoopLikeInterface.
Differential Revision: https://reviews.llvm.org/D159229
This patch is just moved from Phabricator to github
1. Updates and clarifies a few comments related to hooks for
vector.{insert|extract}_strided_slice.
2. For consistency with vector.insert_strided_slice, removes a TODO from
vector.extract_strided_slice Op def. It's self-explenatory that
adding support for non-unit strides is a "TODO".
This adds a new `mlir_arm_runner_utils` library that contains utils
specific to Arm/AArch64. This is for use in MLIR integration tests.
This initial patch adds `setArmVLBits()` and `setArmSVLBits()`. This
allows changing vector length or streaming vector length at runtime (or
setting it to a known minimum, i.e. 128-bits).
Existing 'arith::ConstantOp' conversion and tests are moved from
VectorToArmSME. There's currently only a single op that's converted at
the moment, but this will grow in the future as things like in-tile add
are implemented. Also, 'createLoopOverTileSlices' is moved to ArmSME
utils since it's relevant for both conversions.
We implement `computeNumTerms()`, which counts the number of terms in a
generating function by substituting the unit vector in it.
This is the main function in Barvinok's algorithm – the number of points
in a polytope is given by the number of terms in the generating function
corresponding to it.
We also modify the GeneratingFunction class to have `const` getters and
improve the simplification of QuasiPolynomials.
When we generate the loop body function, we need to be sure, that all
original loop counters are replaced by the new counter.
We need to save all items which use the original loop counter and then
perform replacement of the original loop counter. If we don't do it,
there is a risk that some values are not updated.
This patch updates the cp.async.bulk.{commit/wait}_group Ops to use NVVM
intrinsics.
* Doc updated for the commit_group Op.
* Tests are added to verify the lowering to the intrinsics.
While we are there, fix the FileCheck directive on the
'nvvm.setmaxregister' test.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
The current implementation of `nvvm.wgmma.mma_async` Op deduces the data
type of the output matrix from the data type of struct member, which can be
non-intuitive, especially in cases where types like `2xf16` are packed
into `i32`.
This PR addresses this issue by improving the Op to include an explicit
data type for the output matrix.
The modified Op now includes an explicit data type for Matrix-D (<f16>),
and looks as follows:
```
%result = llvm.mlir.undef : !llvm.struct<(struct<(i32, i32, ...
nvvm.wgmma.mma_async
%descA, %descB, %result,
#nvvm.shape<m = 64, n = 32, k = 16>,
D [<f16>, #nvvm.wgmma_scale_out<zero>],
A [<f16>, #nvvm.wgmma_scale_in<neg>, <col>],
B [<f16>, #nvvm.wgmma_scale_in<neg>, <col>]
```
This PR fixes the 'nvgpu.warpgroup.mma.store' index calculation. When
the destionation memref and current accumulator matrix were small, the
previous code was reaching out of range.
The buffer deallocation pass checks the IR ("operation preconditions")
to make sure that there is no IR that is unsupported. In such a case,
the pass signals a failure.
The pass now rejects all ops with unknown memory effects. We do not know
whether such an op allocates memory or not. Therefore, the buffer
deallocation pass does not know whether a deallocation op should be
inserted or not.
Memory effects are queried from the `MemoryEffectOpInterface` interface.
Ops that do not implement this interface but have the
`RecursiveMemoryEffects` trait do not have any side effects (apart from
the ones that their nested ops may have).
Unregistered ops are now rejected by the pass because they do not
implement the `MemoryEffectOpInterface` and neither do we know if they
have `RecursiveMemoryEffects` or not. All test cases that currently have
unregistered ops are updated to use registered ops.
There is already a "block inserted" notification (in
`OpBuilder::Listener`), so there should also be a "block removed"
notification.
The purpose of this change is to make the listener API more mature.
There is currently a gap between what kind of IR changes can be made and
what IR changes can be listened to. At the moment, the only way to
inform listeners about "block removal" is to send a manual
`notifyOperationModified` for the parent op (e.g., by wrapping the
`eraseBlock(b)` method call in `updateRootInPlace(b->getParentOp())`).
This tells the listener that *something* has changed, but it is somewhat
of an API abuse.
Such cases show up in the middle of optimizations passes, e.g., after
some rewrites and then CSE. The current folder can fold such cases when
the inputs are constant; this patch improves it to fold even if the
inputs are non-constant.
It implements transformation to optimize accesses to shared memory.
Reference: https://reviews.llvm.org/D127457
_This change adds a transformation and pass to the NvGPU dialect that
attempts to optimize reads/writes from a memref representing GPU shared
memory in order to avoid bank conflicts. Given a value representing a
shared memory memref, it traverses all reads/writes within the parent op
and, subject to suitable conditions, rewrites all last dimension index
values such that element locations in the final (col) dimension are
given by newColIdx = col % vecSize + perm[row](col / vecSize, row)
where perm is a permutation function indexed by row and vecSize
is the vector access size in elements (currently assumes 128bit
vectorized accesses, but this can be made a parameter). This specific
transformation can help optimize typical distributed & vectorized
accesses
common to loading matrix multiplication operands to/from shared memory._
Adds `transform.func.cast_and_call` that takes a set of inputs and
outputs and replaces the uses of those outputs with a call to a function
at a specified insertion point.
The idea with this operation is to allow users to author independent IR
outside of a to-be-compiled module, and then match and replace a slice
of the program with a call to the external function.
Additionally adds a mechanism for populating a type converter with a set
of conversion materialization functions that allow insertion of
casts on the inputs/outputs to and from the types of the function
signature.
Provides some context for failing to generate LLVM IR for `target
enter|exit|update` directives when `nowait` is provided. This is
directly helpful for flang users since they would get this error message
if they tried to use `nowait`. Before that we had a very generic
message.
This is a follow-up to https://github.com/llvm/llvm-project/pull/78269,
please only review the latest commit (the one with the same commit
message as the PR title).
This commit simplifies a helper function in the ownership-based buffer
deallocation pass. Fixes a potential double-free (depending on the
scheduling of patterns).