A tensor.pack op can be rewritten to a tensor.expand_shape op if the
packing only happens on inner most dimension.
This also formats the lit checks better.
This revision adds support for importing call site calling conventions.
Additionally, the revision also adds a roundtrip test for an indirect
call with a non-standard calling convention.
The TypedValue::getType() essentially forwards the return value of
Value::getType() which is a const method. Somehow, at TypedValue level
the method's constness is lost, so restore it.
Originally done by: Nikita Kudriavtsev <nikita.kudriavtsev@intel.com>
The transform.structured.eliminate_empty_tensors can produce mis-typed
IR when traversing use-def chains past tensor reshaping operations for
sharing candidates. This results in Linalg operations whose output types
do not match their 'outs' arguments.
This patch filters out candidate tensor.empty operations when their
types do not match the candidate input operand.
Fixed a bug where IntMatrix determinant() had a bug where it would try to assign to a null
pointer.
Added a test case that triggers this bug to avoid regressions.
The revision moves pack/unpack related patterns to
PackAndUnpackPatterns.cpp. This follows the convention like other tensor
ops.
It also renames `populateSimplifyTensorPack` to
`populateSimplifyPackAndUnpackPatterns` and adds a TODO item for
tensor.unpack op.
The new patterns break down subgroup reduce ops with vector values into
a sequence of subgroup reductions that fit the native shuffle size. The
maximum/native shuffle size is parametrized.
The overall goal is to be able to perform multi-element reductions with
a sequence of `gpu.shuffle` ops.
This commit adds an assert in one of the CallOp builders to ensure it is not use to create an indirect call. Otherwise, the callee type would include the callee pointer type which is handed in as first argument.
Fixed two issues: A SmallVector size that caused size-differences issue
(8 vs. 12). Thus removed this size restriction. Also a constant
parameter was causing an issue in a function not marked constant.
Define basic types and classes for Barvinok's algorithm, including
polyhedra, generating functions and quasi-polynomials.
The class definitions include methods for arithmetic manipulation,
printing, logical relations, etc.
This test checks if MLIR code is lowered according to schema presented
below:
func1() {
call __kmpc_parallel_51(..., func2, ...)
}
func2() {
call __kmpc_for_static_loop_4u(..., func3, ...)
}
func3() {
//loop body
}
This commit adds an additional "expensive check" that verifies the IR
before starting a greedy pattern rewriter, after every pattern
application and after every folding. (Only if
`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS` is set.)
It also adds an assertion that the `scope` region (part of
`GreedyRewriteConfig`) is not being erased as part of the greedy pattern
rewrite. That would break the scoping mechanism and the expensive
checks.
This commit does not fix any patterns, this is done in separate commits.
This test demonstrates how the PhysicalStorageBuffer extension can be
used end-2-end in a spir-v module.
This module has been verified to pass serialization, deserialization,
and validation with spirv-val.
This change adds (runtime) bounds checks for `memref` ops using the
existing `RuntimeVerifiableOpInterface`. For `memref.load` and
`memref.store`, we check that the indices are in-bounds of the memref's
index space. For `memref.reinterpret_cast` and `memref.subview` we check
that the resulting address space is in-bounds of the input memref's
address space.
Similar to `vector.transfer_read`/`vector.transfer_write`, allow 0-D
vectors.
This commit fixes
`mlir/test/Dialect/Vector/vector-transfer-to-vector-load-store.mlir`
when verifying the IR after each pattern (#74270). That test produces a
temporary 0-D load/store op.
Follow up to the discussion from #75258, and serves as an alternate
solution for #74670.
Set the location to Unknown for deduplicated / moved / materialized
constants by OperationFolder. This makes sure that the folded constants
don't end up with an arbitrary location of one of the original ops that
became it, and that hoisted ops don't confuse the stepping order.
Add missing constant propogation folder for SNegate, [Logical]Not.
Implement additional folding when !(!x) for all ops.
This helps for readability of lowered code into SPIR-V.
Part of work for #70704
This enum is used by dataflow analyses to indicate whether further
propagation is necessary to reach the fix point. Accidentally discarding
such a value will likely lead to propagation stopping early, leading to
incomplete or incorrect results. The most egregious example is the
duality between `join` on the analysis class, which triggers propagation
internally, and `join` on the lattice class that does not and expects
the caller to trigger it depending on the returned `ChangeResult`.
Each vector element is reduced independently, which is a form of
multi-reduction.
The plan is to allow for gradual lowering of multi-reduction that
results in fewer `gpu.shuffle` ops at the end:
1d `vector.multi_reduction` --> 1d `gpu.subgroup_reduce` --> smaller 1d
`gpu.subgroup_reduce` --> packed `gpu.shuffle` over i32
For example we can perform 2 independent f16 reductions with a series of
`gpu.shuffles` over i32, reducing the final number of `gpu.shuffles` by 2x.
Updates two tests for vector.contract -> vector.outerproduct
transformations:
1. Rename "vector-contract-to-outerproduct-transforms.mlir" as
"vector-contract-to-outerproduct-matmul-transforms.mlir". The new
name more accurate captures what's being tested. it is also
consistent with
"vector-contract-to-outerproduct-matvec-transforms.mlir", which
covers vector matvec operations and makes finding relevant tests
easier.
2. For matmul tests, move the traits definining the iteration spaces to
the top of the file. This is consistent with how matvec tests are
defined and also makes it easy to quickly identify what cases are
covered.
3. For matmul tests, use more meaningful names for function arguments.
This helps keep things consistent across the file (i.e. function
definitions wih check lines and comments).
4. For matvec test, move a few tests around so that the most basic case
(without masking) is first.
5. Update comments.
This reverts commit bbc2976868.
This change seems to be at odds with the non-owning part semantics of
MlirOperation in C API. Since downstream clients can only take and
return MlirOperation, it does not sound correct to force all returns of
MlirOperation transfer ownership. Specifically, this makes it impossible
for downstreams to implement IR-traversing functions that, e.g., look at
neighbors of an operation.
The following patch triggers the exception, and there does not seem to
be an alternative way for a downstream binding writer to express this:
```
diff --git a/mlir/lib/Bindings/Python/IRCore.cpp b/mlir/lib/Bindings/Python/IRCore.cpp
index 39757dfad5be..2ce640674245 100644
--- a/mlir/lib/Bindings/Python/IRCore.cpp
+++ b/mlir/lib/Bindings/Python/IRCore.cpp
@@ -3071,6 +3071,11 @@ void mlir::python::populateIRCore(py::module &m) {
py::arg("successors") = py::none(), py::arg("regions") = 0,
py::arg("loc") = py::none(), py::arg("ip") = py::none(),
py::arg("infer_type") = false, kOperationCreateDocstring)
+ .def("_get_first_in_block", [](PyOperation &self) -> MlirOperation {
+ MlirBlock block = mlirOperationGetBlock(self.get());
+ MlirOperation first = mlirBlockGetFirstOperation(block);
+ return first;
+ })
.def_static(
"parse",
[](const std::string &sourceStr, const std::string &sourceName,
diff --git a/mlir/test/python/ir/operation.py b/mlir/test/python/ir/operation.py
index f59b1a26ba48..6b12b8da5c24 100644
--- a/mlir/test/python/ir/operation.py
+++ b/mlir/test/python/ir/operation.py
@@ -24,6 +24,25 @@ def expect_index_error(callback):
except IndexError:
pass
+@run
+def testCustomBind():
+ ctx = Context()
+ ctx.allow_unregistered_dialects = True
+ module = Module.parse(
+ r"""
+ func.func @f1(%arg0: i32) -> i32 {
+ %1 = "custom.addi"(%arg0, %arg0) : (i32, i32) -> i32
+ return %1 : i32
+ }
+ """,
+ ctx,
+ )
+ add = module.body.operations[0].regions[0].blocks[0].operations[0]
+ op = add.operation
+ # This will get a reference to itself.
+ f1 = op._get_first_in_block()
+
+
# Verify iterator based traversal of the op/region/block hierarchy.
# CHECK-LABEL: TEST: testTraverseOpRegionBlockIterators
```
When operations are modified in-place, the rewriter must be notified.
This commit fixes `mlir/test/Conversion/ArmSMEToLLVM/unsupported.mlir`,
`mlir/test/Dialect/ArmSME/tile-zero-masks.mlir` and
`mlir/test/Dialect/ArmSME/vector-ops-to-llvm.mlir` when running with
`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS` enabled.
This revision changes the alloca handling in the LLVM inliner.
It ensures that alloca operations, even those nested within a
region operation, can be relocated to the entry block of the function,
or the closest ancestor region that is marked with either the
isolated from above or automatic allocation scope trait.
While the LLVM dialect does not have any region operations,
the inlining interface may be used on IR that mixes different
dialects.
When operations are modified in-place, the rewriter must be notified.
This commit fixes `mlir/test/Dialect/EmitC/transforms.mlir` when running
with `MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS` enabled.
Re-land PR after being reverted because of buildbot failures.
This patch adds representation for `device_type` clause information on
compute construct (parallel, kernels, serial).
The `device_type` clause on compute construct impacts clauses that
appear after it. The values impacted by `device_type` are now tied with
an attribute array that represent the device_type associated with them.
`DeviceType::None` is used to represent the value produced by a clause
before any `device_type`. The operands and the attribute information are
parser/printed together.
This is an example with `vector_length` clause. The first value (64) is
not impacted by `device_type` so it will be represented with
DeviceType::None. None is not printed. The second value (128) is tied
with the `device_type(multicore)` clause.
```
!$acc parallel vector_length(64) device_type(multicore) vector_length(256)
```
```
acc.parallel vector_length(%c64 : i32, %c128 : i32 [#acc.device_type<multicore>]) {
}
```
When multiple values can be produced for a single clause like
`num_gangs` and `wait`, an extra attribute describe the number of values
belonging to each `device_type`. Values and attributes are
parsed/printed together.
```
acc.parallel num_gangs({%c2 : i32, %c4 : i32}, {%c4 : i32} [#acc.device_type<nvidia>])
```
While preparing this patch I noticed that the wait devnum is not part of
the operations and is not lowered. It will be added in a follow up
patch.
This patch adds representation for `device_type` clause information on
compute construct (parallel, kernels, serial).
The `device_type` clause on compute construct impacts clauses that
appear after it. The values impacted by `device_type` are now tied with
an attribute array that represent the device_type associated with them.
`DeviceType::None` is used to represent the value produced by a clause
before any `device_type`. The operands and the attribute information are
parser/printed together.
This is an example with `vector_length` clause. The first value (64) is
not impacted by `device_type` so it will be represented with
DeviceType::None. None is not printed. The second value (128) is tied
with the `device_type(multicore)` clause.
```
!$acc parallel vector_length(64) device_type(multicore) vector_length(256)
```
```
acc.parallel vector_length(%c64 : i32, %c128 : i32 [#acc.device_type<multicore>]) {
}
```
When multiple values can be produced for a single clause like
`num_gangs` and `wait`, an extra attribute describe the number of values
belonging to each `device_type`. Values and attributes are
parsed/printed together.
```
acc.parallel num_gangs({%c2 : i32, %c4 : i32}, {%c4 : i32} [#acc.device_type<nvidia>])
```
While preparing this patch I noticed that the wait devnum is not part of
the operations and is not lowered. It will be added in a follow up
patch.
This fixes a longstanding bug in the `Context._CAPICreate` method
whereby it was not taking ownership of the PyMlirContext wrapper when
casting to a Python object. The result was minimally that all such
contexts transferred in that way would leak. In addition, counter to the
documentation for the `_CAPICreate` helper (see
`mlir-c/Bindings/Python/Interop.h`) and the `forContext` /
`forOperation` methods, we were silently upgrading any unknown
context/operation pointer to steal-ownership semantics. This is
dangerous and was causing some subtle bugs downstream where this
facility is getting the most use.
This patch corrects the semantics and will only do an ownership transfer
for `_CAPICreate`, and it will further require that it is an ownership
transfer (if already transferred, it was just silently succeeding).
Removing the mis-aligned behavior made it clear where the downstream was
doing the wrong thing.
It also adds some `_testing_` functions to create unowned context and
operation capsules so that this can be fully tested upstream, reworking
the tests to verify the behavior.
In some torture testing downstream, I was not able to trigger any memory
corruption with the newly enforced semantics. When getting it wrong, a
regular exception is raised.
See #73359
Types using `assemblyFormat` to define parsing don't need an additional
handwritten parser. So we should remove the handwritten parsers where
one
provided by an `assemblyFormat` already exists to avoid confusion and
de-syncing.
When project components are built as separate shared libraries, a lot
of errors appear about undefined symbols, e.g.
```
/usr/bin/ld: CMakeFiles/obj.MLIRGPUPipelines.dir/GPUToNVVMPipeline.cpp.o
: in function `(anonymous namespace)::buildCommonPassPipeline(mlir::OpPa
ssManager&, (anonymous namespace)::GPUToNVVMPipelineOptions const&)':
GPUToNVVMPipeline.cpp:(.text._ZN12_GLOBAL__N_123buildCommonPassPipelineE
RN4mlir13OpPassManagerERKNS_24GPUToNVVMPipelineOptionsE+0xa5): undefined
reference to `mlir::createConvertLinalgToLoopsPass()'
```
Add the necessary dependencies to Dialect/GPU/Pipelines/CMakeLists.txt
The maxnum/minnum semantics can be found at
https://llvm.org/docs/LangRef.html#llvm-minnum-intrinsic.
The revision also updates function names in lit tests to match op name.
Take arith.maxnumf as example:
```
func.func @maxnumf(%lhs: f32, %rhs: f32) -> f32 {
%result = arith.maxnumf %lhs, %rhs : f32
return %result : f32
}
```
will be expanded to
```
func.func @maxnumf(%lhs: f32, %rhs: f32) -> f32 {
%0 = arith.cmpf ugt, %lhs, %rhs : f32
%1 = arith.select %0, %lhs, %rhs : f32
%2 = arith.cmpf uno, %lhs, %lhs : f32
%3 = arith.select %2, %rhs, %1 : f32
return %3 : f32
}
```
Case 1: Both LHS and RHS are not NaN; LHS > RHS
In this case, `%1` is LHS. `%3` and `%1` have the same value, so `%3` is
LHS.
Case 2: LHS is NaN and RHS is not NaN
In this case, `%2` is true, so `%3` is always RHS.
Case 3: LHS is not NaN and RHS is NaN
In this case, `%0` is true and `%1` is LHS. `%2` is false, so `%3` and
`%1` have the same value, which is LHS.
Case 4: Both LHS and RHS are NaN:
`%1` and RHS are all NaN, so the result is still NaN.
The `acc` dialect operations now implement MemoryEffects interfaces in
the following ways:
- Data entry operations which may read host memory via `varPtr` are now
marked as so. The majority of them do NOT actually read the host memory.
For example, `acc.present` works on the basis of presence of pointer and
not necessarily what the data points to - so they are not marked as
reading the host memory. They still use `varPtr` though but this
dependency is reflected through ssa.
- Data clause operations which may mutate the data pointed to by
`accPtr` are marked as doing so.
- Data clause operations which update required structured or dynamic
runtime counters are marked as reading and writing the newly defined
`RuntimeCounters` resource. Some operations, like `acc.getdeviceptr` do
not actually use the runtime counters - but are marked as reading them
since the address obtained depends on the mapping operations which do
update the runtime counters. Namely, `acc.getdeviceptr` cannot be moved
across other mapping operations.
- Constructs are marked as writing to the `ConstructResource`. This may
be too strict but is needed for the following reasons: 1) Structured
constructs may not use `accPtr` and instead use `varPtr` - when this is
the case, data actions may be removed even when used. 2) Unstructured
constructs are currently used to aggregate multiple data actions. We do
not want such constructs removed or moved for now.
- Terminators are marked as `Pure` as in other dialects.
The current approach has the following limitations which may require
further improvements:
- Subsequent `acc.copyin` operations on same data do not actually read
host memory pointed to by `varPtr` but are still marked as so.
- Two `acc.delete` operations on same data may not mutate `accPtr` until
the runtime counters are zero (but are still marked as mutating).
- The `varPtrPtr` argument, when present, points to the address of
location of `varPtr`. When mapping to target device, an `accPtrPtr`
needs computed and this memory is mutated. This effect is not captured
since the current operations do not produce `accPtrPtr`.
- Runtime counter effects are imprecise since two operations with
differing `varPtr` increment/decrement different counters. Additionally,
operations with `varPtrPtr` mutate attachment counters.
- The `ConstructResource` is too strict and likely can be relaxed with
better modeling.
Add an emitc.expression operation that models C expressions, and provide
transforms to form and fold expressions. The translator emits the body
of
emitc.expression ops as a single C expression.
This expression is emitted by default as the RHS of an EmitC SSA value,
but if
possible, expressions with a single use that is not another expression
are
instead inlined. Specific expression's inlining can be fine tuned by
lowering
passes and transforms.
Tests for vector.outerproduct for scalable vectors from
"vector-scalable-outerproduct.mlir" are moved to:
* ops.mlir and invalid.mlir.
These files are effectively used to document what Ops are supported and
That's basically what the original file was testing (but specifically
for scalable vectors).
This change makes block/region walkers consistent with operation
walkers. An operation walk enumerates the current operation. Similarly,
block/region walks should enumerate the current block/region.
Example:
```
// Current behavior:
op1->walk([](Operation *op2) { /* op1 is enumerated */ });
block1->walk([](Block *block2) { /* block1 is NOT enumerated */ });
region1->walk([](Block *block) { /* blocks of region1 are NOT enumerated */ });
region1->walk([](Region *region2) { /* region1 is NOT enumerated });
// New behavior:
op1->walk([](Operation *op2) { /* op1 is enumerated */ });
block1->walk([](Block *block2) { /* block1 IS enumerated */ });
region1->walk([](Block *block) { /* blocks of region1 ARE enumerated */ });
region1->walk([](Region *region2) { /* region1 IS enumerated });
```
This commit adds extra assertions to `OperationFolder` and `OpBuilder`
to ensure that the types of the folded SSA values match with the result
types of the op. There used to be checks that discard the folded results
if the types do not match. This commit makes these checks stricter and
turns them into assertions.
Discarding folded results with the wrong type (without failing
explicitly) can hide bugs in op folders. Two such bugs became apparent
in MLIR (and some more in downstream projects) and are fixed with this
change.
Note: The existing type checks were introduced in
https://reviews.llvm.org/D95991.
Migration guide: If you see failing assertions (`folder produced value
of incorrect type`; make sure to run with assertions enabled!), run with
`-debug` or dump the operation right before the failing assertion. This
will point you to the op that has the broken folder. A common mistake is
a mismatch between static/dynamic dimensions (e.g., input has a static
dimension but folded result has a dynamic dimension).
This is to avoid confusion when dealing with reduction/combining kinds.
For example, see a recent PR comment:
https://github.com/llvm/llvm-project/pull/75846#discussion_r1430722175.
Previously, they were picked to mostly mirror the names of the llvm
vector reduction intrinsics:
https://llvm.org/docs/LangRef.html#llvm-vector-reduce-fmin-intrinsic. In
isolation, it was not clear if `<maxf>` has `arith.maxnumf` or
`arith.maximumf` semantics. The new reduction kind names map 1:1 to
arith ops, which makes it easier to tell/look up their semantics.
Because both the vector and the gpu dialect depend on the arith dialect,
it's more natural to align names with those in arith than with the
lowering to llvm intrinsics.
Issue: https://github.com/llvm/llvm-project/issues/72354
This commit makes reductions part of the terminator. Instead of
`scf.yield`, `scf.reduce` now terminates the body of `scf.parallel` ops.
`scf.reduce` may contain an arbitrary number of reductions, with one
region per reduction.
Example:
```mlir
%init = arith.constant 0.0 : f32
%r:2 = scf.parallel (%iv) = (%lb) to (%ub) step (%step) init (%init, %init)
-> f32, f32 {
%elem_to_reduce1 = load %buffer1[%iv] : memref<100xf32>
%elem_to_reduce2 = load %buffer2[%iv] : memref<100xf32>
scf.reduce(%elem_to_reduce1, %elem_to_reduce2 : f32, f32) {
^bb0(%lhs : f32, %rhs: f32):
%res = arith.addf %lhs, %rhs : f32
scf.reduce.return %res : f32
}, {
^bb0(%lhs : f32, %rhs: f32):
%res = arith.mulf %lhs, %rhs : f32
scf.reduce.return %res : f32
}
}
```
`scf.reduce` operations can no longer be interleaved with other ops in
the body of `scf.parallel`. This simplifies the op and makes it possible
to assign the `RecursiveMemoryEffects` trait to `scf.reduce`. (This was
not possible before because the op was not a terminator, causing the op
to be DCE'd.)
Abort fusion if memref load may alias write, but not the exact alias.
Add alias check hook to `naivelyFuseParallelOps`, so user can customize
alias checking.
Use builtin alias analysis in `ParallelLoopFusion` pass.
Extend the `amendOperation` mechanism for translating dialect attributes
attached to operations from another dialect when translating MLIR to
LLVM IR. Previously, this mechanism would have no knowledge of the LLVM
IR instructions created for the given operation, making it impossible
for it to perform local modifications such as attaching operation-level
metadata. Collect instructions inserted by the LLVM IR builder and pass
them to `amendOperation`.
The `test-lower-to-nvvm` pipeline serves as the common and proper
pipeline for nvvm+host compilation, and it's used across our CUDA
integration tests.
This PR updates the `test-lower-to-nvvm` pipeline to `gpu-lower-to-nvvm`
and moves it within `InitAllPasses.h`. The aim is to call it from
Python, also having a standardize compilation process for nvvm.
There were two issues with the previous computation:
* it never looked at dimensions past the second one
* the definition was recursive, making each dimension have an extra
`elementSize` power
`FoldInsertPadIntoFill` used to generate an invalid
`tensor.insert_slice` op:
```
error: expected type to be 'tensor<?x?x?xf32>' or a rank-reduced version. (size mismatch)
```
This commit fixes tests such as
`mlir/test/Dialect/Linalg/canonicalize.mlir` when verifying the IR after
each pattern application (#74270).
The number of vector elements considered 'small' enough to extract is
parameterized.
This is to avoid going into specialized reduction lowering when a
single/couple of arith ops can do. Targets without dedicated reduction
intrinsics can use that as an emulation path too.
Depends on https://github.com/llvm/llvm-project/pull/75846.
While debugging https://github.com/llvm/llvm-project/issues/71326, the
`LoadOp::verify` code and error were very confusing. This PR improves
that.
This code was a part from the reverted PR
https://github.com/llvm/llvm-project/pull/75519. Fixing the
`-convert-vector-to-scf` issue is going to take a bit longer and this
code was out of scope anyway.
Co-authored-by: Benjamin Maxwell <macdue@dueutil.tech>
The core implementation of the dataflow anlysis framework is
interpocedural by design. While this offers better analysis precision,
it also comes with additional cost as it takes longer for the analysis
to reach the fixpoint state. Add a configuration mechanism to the
dataflow solver to control whether it operates inteprocedurally or not
to offer clients a choice.
As a positive side effect, this change also adds hooks for explicitly
processing external/opaque function calls in the dataflow analyses,
e.g., based off of attributes present in the the function declaration or
call operation such as alias scopes and modref available in the LLVM
dialect.
This change should not affect existing analyses and the default solver
configuration remains interprocedural.
Co-authored-by: Jacob Peng <jacobmpeng@gmail.com>
The specialisation will not be valid when ConstantInt gains native
support for vector types.
This is largely a mechanical change but with extra attention paid to constant
folding, InstCombineVectorOps.cpp, LoopFlatten.cpp and Verifier.cpp to
remove the need to call `getIntegerType()`.
Co-authored-by: Nikita Popov <github@npopov.com>
Check if workshare loop without loop body is lowered correctly i.e.:
1) null pointer is passed to OpenMP device RTL function as a
parameter which denotes loop function body aggregated parameters
2) Outlined loop function body has only one parameter - loop counter
The functionality to `-print-ir-after-all` was added in
caa159f044.
This PR adds a test and, with that, some documentation.
---------
Co-authored-by: Maksim Levental <maksim.levental@gmail.com>
Fixes https://github.com/llvm/llvm-project/issues/71326.
The cause of the issue was that a new `LoadOp` was created which looked
something like:
```mlir
%arg4 =
func.func main(%arg1 : index, %arg2 : index) {
%alloca_0 = memref.alloca() : memref<vector<1x32xi1>>
%1 = vector.type_cast %alloca_0 : memref<vector<1x32xi1>> to memref<1xvector<32xi1>>
%2 = memref.load %1[%arg1, %arg2] : memref<1xvector<32xi1>>
return
}
```
which crashed inside the `LoadOp::verify`. Note here that `%alloca_0` is
0 dimensional, `%1` has one dimension, but `memref.load` tries to index
`%1` with two indices.
This is now fixed by using the fact that `unpackOneDim` always unpacks
one dim
1bce61e6b0/mlir/lib/Conversion/VectorToSCF/VectorToSCF.cpp (L897-L903)
and so the `loadOp` should just index only one dimension.
---------
Co-authored-by: Benjamin Maxwell <macdue@dueutil.tech>
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
Inferring the reshape reassociation indices for extract/insert slice ops
based on the read sizes of the original slicing op will generate an
invalid expand/collapse shape op for already rank-reduced cases. Instead
just infer from the shape of the slice.
Ported from Differential Revision: https://reviews.llvm.org/D147488
Note that at the current moment, the newly-introduced
`SparseTensorLevel` classes are far from complete, we plan to migrate
code generation related to accessing sparse tensor levels to these
classes in the near future to simplify `LoopEmitter`.
Updates the vectorisation of 1D depthwise convolution when flattening
the channel dimension (introduced in #71918). In particular - how the
convolution filter is "flattened". ATM, the vectoriser will use
`vector.shape_cast`:
```mlir
%b_filter = vector.broadcast %filter : vector<4xf32> to vector<3x2x4xf32>
%sc_filter = vector.shape_cast %b_filter : vector<3x2x4xf32> to vector<3x8xf32>
```
This lowering is not ideal - `vector.shape_cast` can be convenient when
it's folded away, but that's not happening in this case. Instead, this
patch updates the vectoriser to use `vector.shuffle` (the overall result
is identical):
```mlir
%sh_filter = vector.shuffle %filter, %filter
[0, 1, 2, 3, 0, 1, 2, 3] : vector<4xf32>, vector<4xf32>
%b_filter = vector.broadcast %sh_filter : vector<8xf32> to vector<3x8xf32>
```
This is an experimental address space for strided buffers. These buffers
can have structs as elements and
a stride > 1.
These pointers allow the indexed access in units of stride, i.e., they
point at `buffer[index * stride]`.
Thus, we can use the `idxen` modifier for buffer loads.
We assign address space 9 to 192-bit buffer pointers which contain a
128-bit descriptor, a 32-bit offset and a 32-bit index. Essentially,
they are fat buffer pointers with an additional 32-bit index.
Add verification and canonicalization for
broadcast, gather, recv, reduce, scatter, send and shift.
The canonicalizations only remove trivial collectives with empty
mesh_axes attrubutes.
This feature had been marked as `TODO` in the `tensor.splat`
documentation for a while. This MR includes:
- Support for dynamically shaped tensors in the return type of
`tensor.splat` with the syntax suggested in the `TODO` comment.
- Updated op documentation.
- Bufferization support.
- Updates in op folders affected by the new feature.
- Unit tests for valid/invalid syntax, valid/invalid folding, and
lowering through bufferization.
- Additional op builders resembling those available in `tensor.empty`.
When the concatenated dim is statically sized but the inputs are
dynamically sized, reifyResultShapes must return the static shape. Fixes
the implementation of the interface for tensor.concat in such cases.
This adds Python abstractions for the different handle types of the
transform dialect
The abstractions allow for straightforward chaining of transforms by
calling their member functions.
As an initial PR for this infrastructure, only a single transform is
included: `transform.structured.match`.
With a future `tile` transform abstraction an example of the usage is:
```Python
def script(module: OpHandle):
module.match_ops(MatchInterfaceEnum.TilingInterface).tile(tile_sizes=[32,32])
```
to generate the following IR:
```mlir
%0 = transform.structured.match interface{TilingInterface} in %arg0
%tiled_op, %loops = transform.structured.tile_using_for %0 [32, 32]
```
These abstractions are intended to enhance the usability and flexibility
of the transform dialect by providing an accessible interface that
allows for easy assembly of complex transformation chains.
There is a use case that we need to peel the first iteration out of the
for loop so that the peeled forOp can be canonicalized away and the
fillOp can be fused into the inner forall loop. For example, we have
nested loops as below
```
linalg.fill ins(...) outs(...)
scf.for %arg = %lb to %ub step %step
scf.forall ...
```
After the peeling transform, it is expected to be
```
scf.forall ...
linalg.fill ins(...) outs(...)
scf.for %arg = %(lb + step) to %ub step %step
scf.forall ...
```
This patch makes the most use of the existing peeling functions and adds
support for peeling the first iteration out of the loop.
Tried to keep this simple while handling obvious CSE instances. For more
complicated cases the expectation is still that the sorting pass would
run before. While simple, this case did turn up in a real deployed
instance where it had a large (>10% e2e) impact. This can of course be
refined.
This patch fixes the error in issue #75434. The crash was being caused
by not checking for a lack of target attributes in a GPU module. It's
now considered an error to invoke the pass with a GPU module with no
target attributes.
This commit implements the LLVM IR invariant intrinsics in LLVM dialect.
These intrinsics can be used to mark a program regions in which the
contents of a specific memory object will not change.
The LLVM dialect implementation also implements the
PromotableOpInterface to ensure Mem2Reg & SROA are able to promote
pointers that are marked using the invariant intrinsics.
Add an op in the OMP dialect to model the `target update` direcive. This
change reuses the `MapInfoOp` used by other device directive to model
`map` clauses but verifies that the restrictions imposed by the `target
update` directive are respected.
Add `num-threads` option to the `-convert-scf-to-openmp` pass, allowing
to set the number of threads to be used in the `omp.parallel` to a fixed
value.
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
`isDimOpValidSymbol` is used during the verification of `affine.for`
ops. It is used to check if LB/UB values are valid symbols. This change
adds support for `memref.cast`, which can be skipped over if it is a
ranked -> ranked cast.
This change fixes `mlir/test/Transforms/canonicalize.mlir`, which used
to fail when verifying the IR after each pattern application (#74270).
In this test case, a pattern that folds dynamic offsets/sizes/strides to
static ones is applied. This pattern inserts a trivial `memref.cast`
that can be folded away. This folding happens after the pattern
application, so the IR fails to verify after applying the
offsets/sizes/strides canonicalization pattern.
Note: The verifier of `affine.for` violates MLIR guidelines. Only local
properties of an op should be verified. The verifier should not inspect
the defining ops of operands. (This would mean that constraints such as
"operand is a valid affine symbol" cannot be verified.)
This reverts commit 87e2e89019.
and its follow-ups 0d1490f09f (#75218)
and 6fe3cd5467 (#75312).
We observed significant OOM/timeout issues due to #74670 to quite a few
services including google-research/swirl-lm. The follow-up #75218 and
#75312 do not address the issue. Perhaps this is worth more
investigation.
For vectors with either leading or trailing unit dim, replaces:
elementwise(a, b)
with:
sc_a = shape_cast(a)
sc_b = shape_cast(b)
res = elementwise(sc_a, sc_b)
return shape_cast(res)
The newly inserted shape_cast Ops fold (before elementwise Op) and then
restore (after elementwise Op) the unit dim. Vectors `a` and `b` are
required to be rank > 1.
Example:
```mlir
%mul = arith.mulf %B_row, %A_row : vector<1x[4]xf32>
%cast = vector.shape_cast %mul : vector<1x[4]xf32> to vector<[4]xf32>
```
gets converted to:
```mlir
%B_row_sc = vector.shape_cast %B_row : vector<1x[4]xf32> to vector<[4]xf32>
%A_row_sc = vector.shape_cast %A_row : vector<1x[4]xf32> to vector<[4]xf32>
%mul = arith.mulf %B_row_sc, %A_row_sc : vector<[4]xf32>
%mul_sc = vector.shape_cast %mul : vector<[4]xf32> to vector<1x[4]xf32>
%cast = vector.shape_cast %mul_sc : vector<1x[4]xf32> to vector<[4]xf32>
```
In practice, the bottom 2 shape_cast(s) will be folded away.
This is a follow-up on [PR
75218](https://github.com/llvm/llvm-project/pull/75218) that avoids
reconstructing a fused loc in the `FlattenFusedLocationRecursively`
helper when there has been no change.
This moves the fix out of the IR and into the pass description, which
seems nicer. It also works as an integration test for the
`only-if-required-by-ops` flag :)
This commit ensures that we model DI information for global constants
correctly. These constructs can lack scopes, names, and linkage names,
so these parameters were made optional for the DIGlobalVariable
attribute.
The disallowlist was used as a migration strategy while support was
extended to more side effecting operations. We now (to the best of our
knowledge) support all side effecting operations, so never fail
`isLegalToInline` on any LLVM operation.
There is no test included, because that's exactly the reason for this
change: there are no more unsupported operations in inlining; the
existing tests for unsupported inlines have already been burninated.
Separates actual transformation files from supporting utility files in
the transforms directory. Includes a bazel overlay fix for the build (as
well as a bit of cleanup of that file to be less verbose and more
flexible).
I was not able to fully triage why this just started failing on one of
our bots as it seems that the use was added 4 months ago. I would assume
that it was accidentally coming in transitively in some way as the dep
was definitely missing.
For context, this started failing in [our
byo_llvm](https://github.com/openxla/iree/blob/main/build_tools/llvm/byo_llvm.sh)
build on a stock build of MLIR on top of an existing LLVM. We were
getting:
```
ld.lld: error: undefined symbol: mlir::registerSPIRVDialectTranslation(mlir::DialectRegistry&) >>> referenced by mlir-opt.cpp
>>> tools/mlir-opt/CMakeFiles/mlir-opt.dir/mlir-opt.cpp.o:(main)
```
[PR 74670](https://github.com/llvm/llvm-project/pull/74670) added
support for merging locations at constant folding time. We have
discovered that in some cases, the number of locations grows so big as
to cause a compilation process to OOM. In that case, many of the
locations end up appearing several times in nested fused locations.
We add here a helper that always flattens fused locations in order to
eliminate duplicates in the case of nested fused locations.
1. A single dimension can either be blocked (with floordiv and mod pair)
or non-blocked. Mixing them would be invalid.
2. Block size should be non-zero value.
Does transformations like
all_reduce(x) + all_reduce(y) -> all_reduce(x + y)
max(all_reduce(x), all_reduce(y)) -> all_reduce(max(x, y))
when the all_reduce element-wise op is max.
Added general rewrite pattern HomomorphismSimplification and
EndomorphismSimplification that encapsulate the general algorithm.
Made specialization for all-reduce with respect to
addf, addi, minsi, maxsi, minimumf and maximumf
in the Arithmetic dialect.
Add a configuration option to allow vector distribution with multiple
elements written by a single lane.
This is so that we can perform vector multi-reduction with multiple
results per workgroup.
The folder for `tensor.extract` is not operating correctly when it is
consuming the result of a `tensor.from_elements` operation.
The existing unit test named `@extract_from_tensor.from_elements_3d` in
`mlir/test/Dialect/Tensor/canonicalize.mlir` seems an attempt to stress
this code. However, this unit tests creates a `tensor.from_elements` op
exclusively from constants, which gets folded away into a single
constant tensor. Therefore, the buggy code was never executed in unit
tests.
I have added a new unit test named
`@extract_from_tensor.from_elements_variable_3d` that makes sure the
`tensor.from_elements` op is not folded away by having its input
operands come directly from function arguments. The original folder code
would have made this test fail.
This bug was notably affecting the lowering of the `tosa.pad` op in the
`tosa-to-tensor` pass, where the generated code is likely to contain a
`tensor.from_elements` + `tensor.extract` op sequence.
This test checks if proper OpenMP device RTL function is called to
handle workshare loop for GPU.
The code generation for GPU worksharing loops is implemented by the
patch: https://github.com/llvm/llvm-project/pull/73360
Adjust the silenceable failure message as we lower `tensor.unpack` as a
combination of `linalg.transpose` + `tensor.collapse_shape` and
`tensor.extract_slice`.
When merging constants by the operation folder, the location of the op
that remains should be updated to track the new meaning of this op. This
way we do not lose track of all possible source locations that the
constant op came from, and the final location of the op is less reliant
on the order of folding. This will also help debuggers understand how to
step these instructions.
This PR introduces a helper for operation folder to fuse another
location into the location of an op. When an op is deduplicated, fuse
the location of the op to be removed into the op that is retained. The
retained op now represents both original ops.
The FusedLoc will have a string metadata to help understand the reason
for the location fusion (motivated by the
[example](71be8f3c23/mlir/include/mlir/IR/BuiltinLocationAttributes.td (L130))
in the docstring of FusedLoc).
Add missing constant propogation folder for Bitwise[Or|And|Xor].
Move previous Bitwise[Or|And] fold implementations to
SPIRVCanonicalization for consistency.
Implement additional folding when lhs == rhs and rhs = 0 for Xor. As
well as, update an Xor testcase to account for this introduced folding.
This helps for readability of lowered code into SPIR-V.
Part of work for #70704
Fixes https://github.com/llvm/llvm-project/issues/74453.
The `gepToByteOffset` was implicitly casting an signed integer to an
unsigned integer even though negative dimensions are valid for
`llvm.getelementptr`.
---------
Co-authored-by: Tobias Gysi <tobias.gysi@nextsilicon.com>
Without this patch, MLIR crashes with
```
Assertion failed: (getNumDims() == map.getNumResults() && "Number of results mismatch"), function compose, file AffineMap.cpp, line 537.
```
during parsing.
If the loop bound is not initialized, the analysis crashed, as it only checked for nullity. Also checking for initialization fixes the issue.
Signed-off-by: Victor Perez <victor.perez@codeplay.com>
Co-authored-by: Tsang, Whitney <whitney.tsang@intel.com>
Support loops without static boundaries. Since the number of iteration
is not known we need to predicate prologue and epilogue in case the
number of iterations is smaller than the number of stages.
This patch includes work from @chengjunlu
Note, tensor.empty may feed into SPARSE output (meaning it truly has no
values yet), but for a DENSE output, it should always have an initial
value. We ran a verifier over all our tests and this is the only
remaining omission.
For BSR and convolutions, we encounter
(d0, d1, d2, d3) -> ((d0 + d2) floordiv 2, (d1 + d3) floordiv 2, (d0 +
d2) mod 2, (d1 + d3) mod 2)
which crashed the current test. Note that an actual test and working
code is still to follow (since we need to fix a few other things first)
Examle:
substitute
mesh.cluster @mesh0(rank = 2, dim_sizes = [0, 4])
with
mesh.cluster @mesh0(rank = 2, dim_sizes = ?x4)
Same as tensor/memref shapes. The only difference is for 0-rank shapes.
With tensors you would have something like `tensor<f32>`. Here to avoid
matching an empty string a 0-rank shape is denoted by `[]`.
This patch moves PassExecutionAction to Pass.h so that it can be used by
the action framework to introspect and intercede in pass managers that
might be set up opaquely. This provides for a very particular use case,
which essentially involves being able to intercede in a PassManager and
skip or apply individual passes. Because of this, this patch also adds a
test for this use case to verify that it could in fact work.
Declare `getPreservedProducerResults` function which helps to get the
preserved results of the producer linalg generic operation as a result
of elementwise fusion.
This MR fix the argument name of GEPOp builder from `basePtrType` to
`elementType` to avoid confusion.
Co-authored-by: Xiaolei Shi <xiaoleis@nvidia.com>
Do not add the previous users of replaced ops to the worklist during
`notifyOperationReplaced`.
The previous users are modified inplace as part of
`PatternRewriter::replaceOp`, which calls
`PatternRewriter::replaceAllUsesWith`. The latter function updates all
users with `updateRootInPlace`, which already puts all previous users of
the replaced op on the worklist. No further worklist management work is
needed in the `notifyOperationReplaced` callback.
This patch modifies the CombineTransferReadOpTranspose pattern to handle
extf ops. Also adds a test which shows the transpose getting folded into
the transfer_read.
Test was failing due to a different transform sequence declaration (transform sequence were used, while now it should be named transform sequence). Test is now fixed.
Fixes https://github.com/llvm/llvm-project/issues/62489.
Some notes for each number:
- 1 `bool-literal` should be reasonably clear from context.
- 2 Fixed.
- 3 This is now fixed. `loc(fused[])` is valid, but `loc(fused["foo",])`
is not.
- 4 This operation uses `assemblyFormat` so the syntax is correct
(assuming ODS is correct).
- 5 This operation uses `assemblyFormat` so the syntax is correct
(assuming ODS is correct).
- 6 Added an example.
- 7 The suggested fix is in line with other `assemblyFormat` examples.
- 8 Added syntax and an example.
- 9 I don't know what this is referring too.
- 10 Added example.
- 11 and 12 suggestion seems wrong as the `ShapedTypeInterface` could be
extended by clients, so is not limited to tensors or vectors.
- 13 is already reasonably clear with the example, I think.
- 14 is already reasonably clear with the example, I think.
- 15 Added an example from the `opaque_locations.mlir` tests.
- 16 The answer to this seems to change over time and depend on the use
case? Suggestions by reviewers are welcome.
Rewrite patterns must return `success` if the IR was modified. This
commit fixes sparse tensor tests such as
`SparseTensor/sparse_fusion.mlir`,
`SparseTensor/CPU/sparse_reduce_custom.mlir`,
`SparseTensor/CPU/sparse_semiring_select.mlir` when running with
`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`.
The op used to support only float element types. This was inconsistent
with `ConstantOp::isBuildableWith`, which allows integer element types.
The complex type allows any float/integer element type.
Note: The other complex dialect ops do not support non-float element
types yet. The main purpose of this change to fix
`Tensor/canonicalize.mlir`, which is currently failing when verifying
the IR after each pattern application (#74270).
```
within split at mlir/test/Dialect/Tensor/canonicalize.mlir:231 offset :8:15: error: 'complex.constant' op result #0 must be complex type with floating-point elements, but got 'complex<i32>'
%complex1 = tensor.extract %c1[] : tensor<complex<i32>>
^
within split at mlir/test/Dialect/Tensor/canonicalize.mlir:231 offset :8:15: note: see current operation: %0 = "complex.constant"() <{value = [1 : i32, 2 : i32]}> : () -> complex<i32>
"func.func"() <{function_type = () -> tensor<3xcomplex<i32>>, sym_name = "extract_from_elements_complex_i"}> ({
%0 = "complex.constant"() <{value = [1 : i32, 2 : i32]}> : () -> complex<i32>
%1 = "arith.constant"() <{value = dense<(3,2)> : tensor<complex<i32>>}> : () -> tensor<complex<i32>>
%2 = "arith.constant"() <{value = dense<(1,2)> : tensor<complex<i32>>}> : () -> tensor<complex<i32>>
%3 = "tensor.extract"(%1) : (tensor<complex<i32>>) -> complex<i32>
%4 = "tensor.from_elements"(%0, %3, %0) : (complex<i32>, complex<i32>, complex<i32>) -> tensor<3xcomplex<i32>>
"func.return"(%4) : (tensor<3xcomplex<i32>>) -> ()
}) : () -> ()
```