`buildUnrealizedCast` used to generate invalid
`builtin.unrealized_conversion_cast` ops with zero results. This commit
fixes
`test/Conversion/OneToNTypeConversion/one-to-n-type-conversion.mlir`
when running with `MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`.
```
* Pattern (anonymous namespace)::ConvertMakeTupleOp : 'test.make_tuple -> ()' {
Trying to match "(anonymous namespace)::ConvertMakeTupleOp"
[...]
"(anonymous namespace)::ConvertMakeTupleOp" result 1
} -> success : pattern applied successfully
// *** IR Dump After Pattern Application ***
mlir-asm-printer: Verifying operation: func.func
'builtin.unrealized_conversion_cast' op expected at least one result for cast operation
mlir-asm-printer: 'func.func' failed to verify and will be printed in generic form
"func.func"() <{function_type = (i1, i2) -> (i1, i2), sym_name = "pack_unpack"}> ({
^bb0(%arg0: i1, %arg1: i2):
%0 = "test.make_tuple"() : () -> tuple<>
"builtin.unrealized_conversion_cast"(%0) {"__one-to-n-type-conversion_cast-kind__" = "target"} : (tuple<>) -> ()
[...]
}) : () -> ()
within split at /usr/local/google/home/springerm/mlir_public/llvm-project/mlir/test/Conversion/OneToNTypeConversion/one-to-n-type-conversion.mlir:1 offset :20:8: error: 'builtin.unrealized_conversion_cast' op expected at least one result for cast operation
%0 = "test.make_tuple"() : () -> tuple<>
^
within split at /usr/local/google/home/springerm/mlir_public/llvm-project/mlir/test/Conversion/OneToNTypeConversion/one-to-n-type-conversion.mlir:1 offset :20:8: note: see current operation: "builtin.unrealized_conversion_cast"(%0) {"__one-to-n-type-conversion_cast-kind__" = "target"} : (tuple<>) -> ()
LLVM ERROR: IR failed to verify after pattern application
```
We implement two functions that are needed to compute the constant term
of a GF.
One finds a vector not orthogonal to all the non-null vectors in a given
set.
One computes the coefficient of any term in an arbitrary rational
function (quotient of two polynomials).
Before applying the peeling patterns, it can happen that the `ForOp`
gets a step of zero during folding. This leads to a division-by-zero
down the line.
This patch adds an additional check for a constant-zero step and a
test.
Fix https://github.com/llvm/llvm-project/issues/75758
The simplify of bufferization.clone generates a memref.cast op, but the
checks in simplify do not verify whether the operand types and return
types of clone op is compatiable, leading to errors. This patch
addresses this issue.
This adds very basic (and inelegant) support for something like spilling
and reloading tiles, if you use more SME tiles than physically exist.
This is purely implemented to prevent the compiler from aborting if a
function uses too many tiles (i.e. due to bad unrolling), but is
expected to perform very poorly.
Currently, this works in two stages:
During tile allocation, if we run out of tiles instead of giving up, we
switch to allocating 'in-memory' tile IDs. These are tile IDs that start
at 16 (which is higher than any real tile ID). A warning will also be
emitted for each (root) tile op assigned an in-memory tile ID:
```
warning: failed to allocate SME virtual tile to operation, all tile operations will go through memory, expect degraded performance
```
Everything after this works like normal until `-convert-arm-sme-to-llvm`
Here the in-memory tile op:
```mlir
arm_sme.tile_op { tile_id = <IN MEMORY TILE> }
```
Is lowered to:
```mlir
// At function entry:
%alloca = memref.alloca ... : memref<?x?xty>
// Around the op:
// Swap the contents of %alloca and tile 0.
scf.for %slice_idx {
%current_slice = "arm_sme.intr.read.horiz" ... <{tile_id = 0 : i32}>
"arm_sme.intr.ld1h.horiz"(%alloca, %slice_idx) <{tile_id = 0 : i32}>
vector.store %current_slice, %alloca[%slice_idx, %c0]
}
// Execute op using tile 0.
arm_sme.tile_op { tile_id = 0 }
// Swap the contents of %alloca and tile 0.
// This restores tile 0 to its original state.
scf.for %slice_idx {
%current_slice = "arm_sme.intr.read.horiz" ... <{tile_id = 0 : i32}>
"arm_sme.intr.ld1h.horiz"(%alloca, %slice_idx) <{tile_id = 0 : i32}>
vector.store %current_slice, %alloca[%slice_idx, %c0]
}
```
This is inserted during the lowering to LLVM as spilling/reloading
registers is a very low-level concept, that can't really be modeled
correctly at a high level in MLIR.
Note: This is always doing the worst case full-tile swap. This could be
optimized to only spill/load data the tile op will use, which could be
just a slice. It's also not making any use of liveness, which could
allow reusing tiles. But these is not seen as important as correct code
should only use the available number of tiles.
Introduce a new extension for simple print-debugging of the transform
dialect scripts. The initial version of this extension consists of two
ops that are printing the payload objects associated with transform
dialect values. Similar ops were already available in the test extenion
and several downstream projects, and were extensively used for testing.
Support distribution of `vector.transfer_read` ops when operands are
defined inside of the region of `warp_execute_on_lane_0` (except for the
buffer from which the op is reading).
Such IR was previously not supported. This commit changes the
implementation such that indices and the padding value are also
distributed.
This commit simplifies the implementation considerably: the original
implementation created a new `transfer_read` op and then checked if this
new op is valid. If not, the rewrite pattern failed. This was a bit
hacky. It was also a violation of the rewrite pattern API (detected by
`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`) because the IR was modified,
but the pattern returned "failure".
Rename interface functions as follows:
* `hasTensorSemantics` -> `hasPureTensorSemantics`
* `hasBufferSemantics` -> `hasPureBufferSemantics`
These two functions return "true" if the op has tensor/buffer operands
but not buffer/tensor operands.
Also drop the "ranked" part from the interface, i.e., do not distinguish
between ranked/unranked types.
The new function names describe the functions more accurately. They also
align their semantics with the notion of "tensor semantics" with the
bufferization framework. (An op is supposed to be bufferized if it has
tensor operands, and we don't care if it also has memref operands.)
This change is in preparation of #75273, which adds
`BufferizableOpInterface::hasTensorSemantics`. By renaming the functions
in the `DestinationStyleOpInterface`, we can avoid name clashes between
the two interfaces.
In the process a couple of test transform dialect ops are added just
for testing. These operations are not intended to use as full flushed
out of transformation ops, but are rather operations added for testing.
A separate operation is added to `LinalgTransformOps.td` to convert a
`TilingInterface` operation to loops using the
`generateScalarImplementation` method implemented by the
operation. Eventually this and other operations related to tiling
using the `TilingInterface` need to move to a better place (i.e. out
of `Linalg` dialect)
The new flag, `--mlir-print-skip-regions`, sets the op printing option
that disables region printing. This results in the usual
`--mlir-print-ir-*` debug options printing only the names of the
executed passes and the signatures of the ops.
Example:
```mlir
// -----// IR Dump Before CSE (cse) //----- //
func.func @bar(%arg0: f32, %arg1: f32) -> f32 {...}
// -----// IR Dump Before Canonicalizer (canonicalize) //----- //
func.func @bar(%arg0: f32, %arg1: f32) -> f32 {...}
```
The main use-case is to be triage compilation issues (crashes, slowness)
on very deep pass pipelines and with very large IR files, where printing
IR is prohibitively slow otherwise.
The IR representation for gang, vector and worker has grown with the
support for device_type. This patch simplify the IR representation for
gang, vector and worker information on the acc.loop operation.
When the only the keyword is present without any values, the information
is printed at the same place than when there is values. The device_type
is omitted if there is no values and it is equal to None. Otherwise the
full information is displayed. First the keyword only device_type
information and then the values with their device_type.
Currently, the `memref.transpose` verifier forces the result type of the
Op to have an explicit `StridedLayoutAttr` via the method
`inferTransposeResultType`. This means that the example Op
given in the documentation is actually invalid because it uses an `AffineMap`
to specify the layout.
It also means that we can't "un-transpose" a transposed memref back to
the implicit layout form, because the verifier will always enforce the
explicit strided layout.
This patch makes the following changes:
1. The verifier checks whether the canonicalized strided layout of the
result Type is identitcal to the canonicalized infered result type
layout. This way, it's only important that the two Types have the same
strided layout, not necessarily the same representation of it.
2. The folder is extended to support folding away the trivial case of
identity permutation and to fold one transposition into another by
composing the permutation maps.
The folder for `AffineApplyOp` will try creating a `PoisonAttr`
under certain circumstances. However, this will result in a crash if the
`UBDialect` isn't loaded.
This patch adds a dependency of `AffineDialect` on `UBDialect`.
llvm-project/mlir/lib/Conversion/TosaToLinalg/TosaToLinalg.cpp:2376:9:
error: object backing the pointer will be destroyed at the end of the full-expression [-Werror,-Wdangling-gsl]
tensor::getMixedSizes(rewriter, loc, input_real);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
llvm-project/mlir/lib/Conversion/TosaToLinalg/TosaToLinalg.cpp:2366:10:
error: unused variable 'imag_el_ty' [-Werror,-Wunused-variable]
auto imag_el_ty = cast<FloatType>(
^
2 errors generated.
Clarify what kind of IR modifications are allowed. Also improve the
documentation of the greedy rewrite driver entry points.
Addressing comments in #76219.
1. Fix a bug in verifyMemref to pass in `data` instead of `baseptr`,
which didn't verify data correctly.
2. Add `==` for f16 and bf16.
3. Add a comprehensive test of verifyMemref for all supported types.
Add a rewriter for DIExpressions & use it to run legalization patterns
before exporting to llvm (because LLVM dialect allows DI Expressions
that may not be valid in LLVM IR).
The rewriter driver works similarly to the existing mlir rewriter
drivers, except it operates on lists of DIExpressionElemAttr (i.e.
DIExpressionAttr). Each rewrite pattern transforms a range of
DIExpressionElemAttr into a new list of DIExpressionElemAttr.
In addition, this PR sets up a place to add legalization patterns that
are broadly applicable internally to the LLVM dialect, and they will
always be applied prior to export. This PR adds one pattern for merging
fragment operators.
---------
Co-authored-by: Tobias Gysi <tobias.gysi@nextsilicon.com>
Rationale:
Since this mini-pipeline may be used in alternative pipelines (viz.
different from the default "sparsifier" pipeline) where unknown ops are
handled by alternative bufferization methods that are downstream of this
mini-pipeline, we allow unknown ops by default (failure to bufferize is
eventually apparent by failing to convert to LLVM IR).
This is part of enabling e2e testing for TORCH-MLIR tests using a
sparsifier backend
We implement a function that computes the generating function
corresponding to a unimodular cone.
The generating function for a polytope is obtained by summing these
generating functions over all tangent cones.
Store the last token parsed in the parser state so that the range parsed
can utilize its end rather than the start of the token after parsed.
This results in a tighter range (especially true in the case of
comments, see
```mlir
|%c4 = arith.constant 4 : index
// Foo
|
```
vs
```mlir
|%c4 = arith.constant 4 : index|
```
).
Discovered while working on a little textual post processing tool.
This patch updates the setmaxregister NVVM Op to use the
intrinsics instead of inline-ptx.
* The interface remains same (as expected).
* Tests are added to verify the lowered intrinsics in
Target/LLVMIR/nvvmir.mlir.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
Change Rescale shift attribute to be DenseI8ArrayAttr to match spec
(instead of DenseI32ArrayAttr)
This replaces https://reviews.llvm.org/D157439
Signed-off-by: Tai Ly <tai.ly@arm.com>
They can be simplified to reshape ops if outer_dims_perm is an identity
permutation. The revision adds a `isIdentityPermutation` method to
IndexingUtils.
Following on from #75842, we can demonstrate that loop peeling combined
with masked vectorisation and existing canonicalization for vector.mask
operations leads to the following loop structure:
```
// M dimension
scf.for 1:M
// N dimension (contains vector ops _without_ masking)
scf.for 1:UB
// K dimension
scf.for 1:K
vector.add
// N dimension (contains vector ops _with_ masking)
scf.for UB:N
// K dimension
scf.for 1:K
vector.mask { vector.add }
```
This is particularly beneficial for scalable vectors which normally
require masking. This example demonstrates how to avoid them.