The generated code for dependent dialects is awkwardly formatted, making
the code harder to read. This change reformats the whitespace to align
code in its context and avoid unnecessary empty lines.
Also included are some typo fixes.
Below are examples of the codegen for a dialect before and after the
change.
Before:
```
GPUDialect::GPUDialect(::mlir::MLIRContext *context)
: ::mlir::Dialect(getDialectNamespace(), context, ::mlir::TypeID::get<GPUDialect>()) {
getContext()->loadDialect<arith::ArithDialect>();
initialize();
}
```
After:
```
GPUDialect::GPUDialect(::mlir::MLIRContext *context)
: ::mlir::Dialect(getDialectNamespace(), context, ::mlir::TypeID::get<GPUDialect>()) {
getContext()->loadDialect<arith::ArithDialect>();
initialize();
}
```
Below are examples of the codegen for a pass before and after the
change.
Before:
```
/// Return the dialect that must be loaded in the context before this pass.
void getDependentDialects(::mlir::DialectRegistry ®istry) const override {
registry.insert<func::FuncDialect>();
registry.insert<tensor::TensorDialect>();
registry.insert<tosa::TosaDialect>();
}
```
After:
```
/// Register the dialects that must be loaded in the context before this pass.
void getDependentDialects(::mlir::DialectRegistry ®istry) const override {
registry.insert<func::FuncDialect>();
registry.insert<tensor::TensorDialect>();
registry.insert<tosa::TosaDialect>();
}
```
In addition to the existing `OpHandle` which provides an abstraction to
emit transform ops targeting operations this introduces a similar
concept for _values_ and _parameters_ in form of `ValueHandle` and
`ParamHandle`.
New core transform abstractions:
- `constant_param`
- `OpHandle.get_result`
- `OpHandle.print`
- `ValueHandle.get_defining_op`
This commit changes the MLIR to LLVMIR export to also attach subprogram
debug attachements to function declarations.
This commit additonally fixes the two passes that produce subprograms to
not attach the "Definition" flag to function declarations. This
otherwise results in invalid LLVM IR.
`buildUnrealizedCast` used to generate invalid
`builtin.unrealized_conversion_cast` ops with zero results. This commit
fixes
`test/Conversion/OneToNTypeConversion/one-to-n-type-conversion.mlir`
when running with `MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`.
```
* Pattern (anonymous namespace)::ConvertMakeTupleOp : 'test.make_tuple -> ()' {
Trying to match "(anonymous namespace)::ConvertMakeTupleOp"
[...]
"(anonymous namespace)::ConvertMakeTupleOp" result 1
} -> success : pattern applied successfully
// *** IR Dump After Pattern Application ***
mlir-asm-printer: Verifying operation: func.func
'builtin.unrealized_conversion_cast' op expected at least one result for cast operation
mlir-asm-printer: 'func.func' failed to verify and will be printed in generic form
"func.func"() <{function_type = (i1, i2) -> (i1, i2), sym_name = "pack_unpack"}> ({
^bb0(%arg0: i1, %arg1: i2):
%0 = "test.make_tuple"() : () -> tuple<>
"builtin.unrealized_conversion_cast"(%0) {"__one-to-n-type-conversion_cast-kind__" = "target"} : (tuple<>) -> ()
[...]
}) : () -> ()
within split at /usr/local/google/home/springerm/mlir_public/llvm-project/mlir/test/Conversion/OneToNTypeConversion/one-to-n-type-conversion.mlir:1 offset :20:8: error: 'builtin.unrealized_conversion_cast' op expected at least one result for cast operation
%0 = "test.make_tuple"() : () -> tuple<>
^
within split at /usr/local/google/home/springerm/mlir_public/llvm-project/mlir/test/Conversion/OneToNTypeConversion/one-to-n-type-conversion.mlir:1 offset :20:8: note: see current operation: "builtin.unrealized_conversion_cast"(%0) {"__one-to-n-type-conversion_cast-kind__" = "target"} : (tuple<>) -> ()
LLVM ERROR: IR failed to verify after pattern application
```
We implement two functions that are needed to compute the constant term
of a GF.
One finds a vector not orthogonal to all the non-null vectors in a given
set.
One computes the coefficient of any term in an arbitrary rational
function (quotient of two polynomials).
Before applying the peeling patterns, it can happen that the `ForOp`
gets a step of zero during folding. This leads to a division-by-zero
down the line.
This patch adds an additional check for a constant-zero step and a
test.
Fix https://github.com/llvm/llvm-project/issues/75758
The simplify of bufferization.clone generates a memref.cast op, but the
checks in simplify do not verify whether the operand types and return
types of clone op is compatiable, leading to errors. This patch
addresses this issue.
This adds very basic (and inelegant) support for something like spilling
and reloading tiles, if you use more SME tiles than physically exist.
This is purely implemented to prevent the compiler from aborting if a
function uses too many tiles (i.e. due to bad unrolling), but is
expected to perform very poorly.
Currently, this works in two stages:
During tile allocation, if we run out of tiles instead of giving up, we
switch to allocating 'in-memory' tile IDs. These are tile IDs that start
at 16 (which is higher than any real tile ID). A warning will also be
emitted for each (root) tile op assigned an in-memory tile ID:
```
warning: failed to allocate SME virtual tile to operation, all tile operations will go through memory, expect degraded performance
```
Everything after this works like normal until `-convert-arm-sme-to-llvm`
Here the in-memory tile op:
```mlir
arm_sme.tile_op { tile_id = <IN MEMORY TILE> }
```
Is lowered to:
```mlir
// At function entry:
%alloca = memref.alloca ... : memref<?x?xty>
// Around the op:
// Swap the contents of %alloca and tile 0.
scf.for %slice_idx {
%current_slice = "arm_sme.intr.read.horiz" ... <{tile_id = 0 : i32}>
"arm_sme.intr.ld1h.horiz"(%alloca, %slice_idx) <{tile_id = 0 : i32}>
vector.store %current_slice, %alloca[%slice_idx, %c0]
}
// Execute op using tile 0.
arm_sme.tile_op { tile_id = 0 }
// Swap the contents of %alloca and tile 0.
// This restores tile 0 to its original state.
scf.for %slice_idx {
%current_slice = "arm_sme.intr.read.horiz" ... <{tile_id = 0 : i32}>
"arm_sme.intr.ld1h.horiz"(%alloca, %slice_idx) <{tile_id = 0 : i32}>
vector.store %current_slice, %alloca[%slice_idx, %c0]
}
```
This is inserted during the lowering to LLVM as spilling/reloading
registers is a very low-level concept, that can't really be modeled
correctly at a high level in MLIR.
Note: This is always doing the worst case full-tile swap. This could be
optimized to only spill/load data the tile op will use, which could be
just a slice. It's also not making any use of liveness, which could
allow reusing tiles. But these is not seen as important as correct code
should only use the available number of tiles.
Introduce a new extension for simple print-debugging of the transform
dialect scripts. The initial version of this extension consists of two
ops that are printing the payload objects associated with transform
dialect values. Similar ops were already available in the test extenion
and several downstream projects, and were extensively used for testing.
Support distribution of `vector.transfer_read` ops when operands are
defined inside of the region of `warp_execute_on_lane_0` (except for the
buffer from which the op is reading).
Such IR was previously not supported. This commit changes the
implementation such that indices and the padding value are also
distributed.
This commit simplifies the implementation considerably: the original
implementation created a new `transfer_read` op and then checked if this
new op is valid. If not, the rewrite pattern failed. This was a bit
hacky. It was also a violation of the rewrite pattern API (detected by
`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`) because the IR was modified,
but the pattern returned "failure".
Rename interface functions as follows:
* `hasTensorSemantics` -> `hasPureTensorSemantics`
* `hasBufferSemantics` -> `hasPureBufferSemantics`
These two functions return "true" if the op has tensor/buffer operands
but not buffer/tensor operands.
Also drop the "ranked" part from the interface, i.e., do not distinguish
between ranked/unranked types.
The new function names describe the functions more accurately. They also
align their semantics with the notion of "tensor semantics" with the
bufferization framework. (An op is supposed to be bufferized if it has
tensor operands, and we don't care if it also has memref operands.)
This change is in preparation of #75273, which adds
`BufferizableOpInterface::hasTensorSemantics`. By renaming the functions
in the `DestinationStyleOpInterface`, we can avoid name clashes between
the two interfaces.
In the process a couple of test transform dialect ops are added just
for testing. These operations are not intended to use as full flushed
out of transformation ops, but are rather operations added for testing.
A separate operation is added to `LinalgTransformOps.td` to convert a
`TilingInterface` operation to loops using the
`generateScalarImplementation` method implemented by the
operation. Eventually this and other operations related to tiling
using the `TilingInterface` need to move to a better place (i.e. out
of `Linalg` dialect)
The new flag, `--mlir-print-skip-regions`, sets the op printing option
that disables region printing. This results in the usual
`--mlir-print-ir-*` debug options printing only the names of the
executed passes and the signatures of the ops.
Example:
```mlir
// -----// IR Dump Before CSE (cse) //----- //
func.func @bar(%arg0: f32, %arg1: f32) -> f32 {...}
// -----// IR Dump Before Canonicalizer (canonicalize) //----- //
func.func @bar(%arg0: f32, %arg1: f32) -> f32 {...}
```
The main use-case is to be triage compilation issues (crashes, slowness)
on very deep pass pipelines and with very large IR files, where printing
IR is prohibitively slow otherwise.
The IR representation for gang, vector and worker has grown with the
support for device_type. This patch simplify the IR representation for
gang, vector and worker information on the acc.loop operation.
When the only the keyword is present without any values, the information
is printed at the same place than when there is values. The device_type
is omitted if there is no values and it is equal to None. Otherwise the
full information is displayed. First the keyword only device_type
information and then the values with their device_type.
Currently, the `memref.transpose` verifier forces the result type of the
Op to have an explicit `StridedLayoutAttr` via the method
`inferTransposeResultType`. This means that the example Op
given in the documentation is actually invalid because it uses an `AffineMap`
to specify the layout.
It also means that we can't "un-transpose" a transposed memref back to
the implicit layout form, because the verifier will always enforce the
explicit strided layout.
This patch makes the following changes:
1. The verifier checks whether the canonicalized strided layout of the
result Type is identitcal to the canonicalized infered result type
layout. This way, it's only important that the two Types have the same
strided layout, not necessarily the same representation of it.
2. The folder is extended to support folding away the trivial case of
identity permutation and to fold one transposition into another by
composing the permutation maps.
The folder for `AffineApplyOp` will try creating a `PoisonAttr`
under certain circumstances. However, this will result in a crash if the
`UBDialect` isn't loaded.
This patch adds a dependency of `AffineDialect` on `UBDialect`.
llvm-project/mlir/lib/Conversion/TosaToLinalg/TosaToLinalg.cpp:2376:9:
error: object backing the pointer will be destroyed at the end of the full-expression [-Werror,-Wdangling-gsl]
tensor::getMixedSizes(rewriter, loc, input_real);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
llvm-project/mlir/lib/Conversion/TosaToLinalg/TosaToLinalg.cpp:2366:10:
error: unused variable 'imag_el_ty' [-Werror,-Wunused-variable]
auto imag_el_ty = cast<FloatType>(
^
2 errors generated.
Clarify what kind of IR modifications are allowed. Also improve the
documentation of the greedy rewrite driver entry points.
Addressing comments in #76219.
1. Fix a bug in verifyMemref to pass in `data` instead of `baseptr`,
which didn't verify data correctly.
2. Add `==` for f16 and bf16.
3. Add a comprehensive test of verifyMemref for all supported types.
Add a rewriter for DIExpressions & use it to run legalization patterns
before exporting to llvm (because LLVM dialect allows DI Expressions
that may not be valid in LLVM IR).
The rewriter driver works similarly to the existing mlir rewriter
drivers, except it operates on lists of DIExpressionElemAttr (i.e.
DIExpressionAttr). Each rewrite pattern transforms a range of
DIExpressionElemAttr into a new list of DIExpressionElemAttr.
In addition, this PR sets up a place to add legalization patterns that
are broadly applicable internally to the LLVM dialect, and they will
always be applied prior to export. This PR adds one pattern for merging
fragment operators.
---------
Co-authored-by: Tobias Gysi <tobias.gysi@nextsilicon.com>
Rationale:
Since this mini-pipeline may be used in alternative pipelines (viz.
different from the default "sparsifier" pipeline) where unknown ops are
handled by alternative bufferization methods that are downstream of this
mini-pipeline, we allow unknown ops by default (failure to bufferize is
eventually apparent by failing to convert to LLVM IR).
This is part of enabling e2e testing for TORCH-MLIR tests using a
sparsifier backend
We implement a function that computes the generating function
corresponding to a unimodular cone.
The generating function for a polytope is obtained by summing these
generating functions over all tangent cones.
Store the last token parsed in the parser state so that the range parsed
can utilize its end rather than the start of the token after parsed.
This results in a tighter range (especially true in the case of
comments, see
```mlir
|%c4 = arith.constant 4 : index
// Foo
|
```
vs
```mlir
|%c4 = arith.constant 4 : index|
```
).
Discovered while working on a little textual post processing tool.
This patch updates the setmaxregister NVVM Op to use the
intrinsics instead of inline-ptx.
* The interface remains same (as expected).
* Tests are added to verify the lowered intrinsics in
Target/LLVMIR/nvvmir.mlir.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
Change Rescale shift attribute to be DenseI8ArrayAttr to match spec
(instead of DenseI32ArrayAttr)
This replaces https://reviews.llvm.org/D157439
Signed-off-by: Tai Ly <tai.ly@arm.com>
They can be simplified to reshape ops if outer_dims_perm is an identity
permutation. The revision adds a `isIdentityPermutation` method to
IndexingUtils.
Following on from #75842, we can demonstrate that loop peeling combined
with masked vectorisation and existing canonicalization for vector.mask
operations leads to the following loop structure:
```
// M dimension
scf.for 1:M
// N dimension (contains vector ops _without_ masking)
scf.for 1:UB
// K dimension
scf.for 1:K
vector.add
// N dimension (contains vector ops _with_ masking)
scf.for UB:N
// K dimension
scf.for 1:K
vector.mask { vector.add }
```
This is particularly beneficial for scalable vectors which normally
require masking. This example demonstrates how to avoid them.
* Rename mesh.process_index -> mesh.process_multi_index.
* Add mesh.process_linear_index op.
* Add lowering of mesh.process_multi_index into an expression using
mesh.process_linear_index, mesh.cluster_shape and
affine.delinearize_index.
This is useful to lower mesh ops and prepare them for further lowering
where the runtime may have only the linear index of a device/process.
For example in MPI we have a rank (linear index) in a communicator.
Add a check to validate that the schedule passed to the pipeliner
transformation is valid and won't cause the pipeliner to break SSA.
This checks that the for each operation in the loop operations are
scheduled after their operands.
This operation provides a convenient way to query the streaming vector
length regardless of the streaming mode. This most useful for functions
that call/pass data to streaming functions, but are not streaming
themselves.
Example:
```mlir
%svl_w = arm_sme.streaming_vl <word>
```
Created based on discussion here:
https://github.com/llvm/llvm-project/pull/76086#discussion_r1434226352
If -nogpulib option is passed by the user, then the OpenMP device
runtime is not used and we should not emit globals to configure
debugging at compile-time for the device runtime.
Link to -nogpulib flag implementation for Clang:
https://reviews.llvm.org/D125314
Add overflow flags support to the following ops:
* `arith.addi`
* `arith.subi`
* `arith.muli`
Example of new syntax:
```
%res = arith.addi %arg1, %arg2 overflow<nsw> : i64
```
Similar to existing LLVM dialect syntax
```
%res = llvm.add %arg1, %arg2 overflow<nsw> : i64
```
Tablegen canonicalization patterns updated to always drop flags, proper
support with tests will be added later.
Updated LLVMIR translation as part of this commit as it currenly written
in a way that it will crash when new attributes added to arith ops
otherwise.
Discussion
https://discourse.llvm.org/t/rfc-integer-overflow-flags-support-in-arith-dialect/76025
---------
Co-authored-by: Yi Wu <yi.wu2@arm.com>
After PR#75548, the OpenACC documentation on the MLIR website has a few
issues. This change corrects them:
- Renames OpenACC.md to OpenACCDialect.md so that links remain
unchanged. In its current state, the links to
https://mlir.llvm.org/docs/Dialects/OpenACCDialect/ no longer work.
- Since the old OpenACCDialect.md (the one with operation definitions)
is being included in the new file, rename the old file to prevent name
ambiguity.
- A header is needed in the .md file, otherwise the index on website is
not properly created.
- Add a new section before including the operations .md file because
otherwise the separation is not clear.
These tests were initially pushed together with
https://github.com/llvm/llvm-project/pull/75864 but they were triggering
some buildbot failure (sanitizers). They now make use of the
`OwningOpRef` so all the resources are correctly destroyed at the end of
each tests.
They will be extended to includes all the extra getter functions added
with device_type support.
Since vector loads and stores from scalar memrefs translate to
llvm.load/store, add the ability to tag said loads and stores as
nontemporal. This mirrors functionality available in memref.load/store.
This document captures the design philosophy of the acc dialect. It also
shares the rationale behind the design and implementation of various
operations - and ties that back to the dialect design goals.
Co-authored-by: Valentin Clement <clementval@gmail.com>
Co-authored-by: Slava Zakharin <szakharin@nvidia.com>
This patch is based on a previous PR https://reviews.llvm.org/D144657
that added alloca address space handling to MLIR's DataLayout and DLTI
interface. This patch aims to add identical features to import and
access the global and program memory space through MLIR's
DataLayout/DLTI system.
Introduce a new match combinator into the transform dialect. This
operation collects all operations that are yielded by a satisfactory
match into its results. This is a simpler version of `foreach_match`
that can be inserted directly into existing transform scripts.
This adds MLIR versions of the Arm streaming vector length intrinsics.
These allow reading the streaming vector length regardless of the
streaming mode.
As per the docs [1]:
```
In absence of an explicit layout, a memref is considered to have a
multi-dimensional identity affine map layout.
```
This patch makes sure that MemRefs with no strides (i.e. no explicit
layout) are treated as contiguous when checking whether a particular
vector is a contiguous slice of the given MemRef.
[1] https://mlir.llvm.org/docs/Dialects/Builtin/#layout
Follow-up for #76428.
This PR adds promised interface declarations for
`ConvertToLLVMPatternInterface` in all the dialects that support the
`ConvertToLLVM` dialect extension.
Promised interfaces allow a dialect to declare that it will have an
implementation of a particular interface, crashing the program if one
isn't provided when the interface is used.
This extension has been superseded by SPV_KHR_cooperative_matrix which
is supported across major vendors GPU like Nvidia, AMD, and Intel.
Given that the KHR version has been supported for nearly half a year,
drop the NV-specific extension to reduce the maintenance burden and code
duplication.
Currently the `tileConsumerAndFuseProducerGreedilyUsingSCFFor` method
greedily fuses through all slices that are generated during the tile and
fuse flow. That is not the normal use case. Ideally the caller would
like to control which slices get fused and which dont. This patch
introduces a new field to the `SCFTileAndFuseOptions` to specify this
control.
The contol function also allows the caller to specify if the replacement
for the fused producer needs to be yielded from within the tiled
computation. This allows replacing the fused producers in case they have
other uses. Without this the original producers still survive negating
the utility of the fusion.
The change here also means that the name of the function
`tileConsumerAndFuseProducerGreedily...` can be updated. Defering that
to a later stage to reduce the churn of API changes.
Make GreedyPatternRewriteDriver handle failures of `materializeConstant`
gracefully. Previously it was not checking whether the returned op was
null and crashing. This PR handles it similarly to how OperationFolder
does it.
Setting thread block size with `maxntid` on the kernel has great
performance benefits. In this way, downstream PTX compiler can do better
register allocation.
MLIR's `gpu.launch` and `gpu.launch_func` already has an attribute
(`known_block_size`) that keeps the thread block size when it is known.
This PR simply uses this attribute to set `maxntid`.
This revision updates the llvm dialect inliner to explicitly disallow
the inlining of variadic functions. Already previously the inlining
failed if the number of function arguments did not match the number of
call arguments. After the change, inlining checks the function is not
variadic and it does not contain a va_start intrinsic.
This commit adds an optional distinct attribute parameter to the
DISubprogramAttr. This enables modeling of distinct subprograms, as
required for LLVM IR. This change is required to avoid accidential
uniquing of subprograms on functions that would lead to invalid LLVM IR
post export.
This commit adds a distinct attribute parameter to the DICompileUnit to
enable the modeling of distinctness. LLVM requires DICompileUnits to be
distinct and there are cases where one gets two equivalent compilation
units but LLVM still requires differentiates them. We observed such
cases for combinations of LTO and inline functions.
This patch also changes the DIScopeForLLVMFuncOp pass to a module pass,
to ensure that only one distinct DICompileUnit is created, instead of
one for each function.
Also improve the implementation of `findCommonDominator` (skip duplicate
blocks) and extract it from `BufferPlacementTransformationBase` (so that
`BufferPlacementTransformationBase` can be retired eventually).
`BufferPlacementTransformationBase::isLoop` checks if there a loop in
the region branching graph of an operation. This algorithm is similar to
`isRegionReachable` in the `RegionBranchOpInterface`. To avoid duplicate
code, `isRegionReachable` is generalized, so that it can be used to
detect region loops. A helper function
`RegionBranchOpInterface::hasLoop` is added.
This change also turns a recursive implementation into an iterative one,
which is the preferred implementation strategy in LLVM.
Also move the `isLoop` to `BufferOptimizations.cpp`, so that we can
gradually retire `BufferPlacementTransformationBase`. (This is so that
proper error handling can be added to `BufferViewFlowAnalysis`.)
We add some basic type aliases and function definitions relating to
cones for Barvinok's algorithm.
These include functions to get the dual of a cone and find its index.
see #73359
Declarative assemblyFormat ODS is more concise and requires less
boilerplate than filling out CPP interfaces.
Changes:
* updates the Ops defined in `SPIRVAtomicOps.td` to use assemblyFormat.
* Removes print/parse from`AtomcOps.cpp` which is now generated by
assemblyFormat
* Adds `Trait` to verify that a pointer operand `foo`'s pointee type
matches operand `bar`'s type
* * Updates error message expected in tests from new Trait
* Updates tests to updated format (largely using <operand> in place of
"operand")
Changes include:
- spirv serialization and deserialization needs handling in cases when
GlobalVariableOp initializer is defined using spirv SpecConstant or
SpecConstantComposite op, currently even though it allows SpecConst, it
only looked up in for GlobalVariable Map to find initializer symbol
reference, change is fixing this and extending the support to
SpecConstantComposite as an initializer.
- Adds tests to make sure GlobalVariable can be initialized using
specialized constants.
---------
Co-authored-by: Lei Zhang <antiagainst@gmail.com>
This PR improves the documentation for the `gpu-lower-to-nvvm-pipeline`
(as it was remaning item for #75775)
- Changes pipeline `gpu-lower-to-nvvm` -> `gpu-lower-to-nvvm-pipeline`
- Adds a section in GPU Dialect in website. It clarifies the pipeline's
functionality in lowering primary dialects to NVVM targets.
According to
https://mlir.llvm.org/docs/DefiningDialects/Operations/#custom-directives,
custom directive supports attr-dict
> attr-dict Directive: NamedAttrList &
But it doesn't support prop-dict which is introduced into MLIR recently.
It's useful to have tblgen support prop-dict like attr-dict. This PR
enable tblgen to support prop-dict
```bash
error: only variables and types may be used as parameters to a custom directive
... custom<Print>(prop-dict)
```
Co-authored-by: Fung Xie <ftse@nvidia.com>
This PR improves the functionality of the `nvgpu.tma.async.load` Op by
adding support for multicast. While we already had this capability in
the lower-level `nvvm.cp.async.bulk.tensor.shared.cluster.global` NVVM
Op, this PR lowers mask information to the NVVM operation.
The `GreedyPatternRewriteDriver` tries to iteratively fold ops and apply
rewrite patterns to ops. It has special handling for constants: they are
CSE'd and sometimes moved to parent regions to allow for additional
CSE'ing. This happens in `OperationFolder`.
To allow for efficient CSE'ing, `OperationFolder` maintains an internal
lookup data structure to find the existing constant ops with the same
value for each `IsolatedFromAbove` region:
```c++
/// A mapping between an insertion region and the constants that have been
/// created within it.
DenseMap<Region *, ConstantMap> foldScopes;
```
Rewrite patterns are allowed to modify operations. In particular, they
may move operations (including constants) from one region to another
one. Such an IR rewrite can make the above lookup data structure
inconsistent.
We encountered such a bug in a downstream project. This bug materialized
in the form of an op that uses the result of a constant op from a
different `IsolatedFromAbove` region (that is not accessible).
This commit changes the behavior of the `GreedyPatternRewriteDriver`
such that `OperationFolder` is used to CSE constants at the beginning of
each iteration (as the worklist is populated), but no longer during an
iteration. `OperationFolder` is no longer used after populating the
worklist, so we do not have to care about inconsistent state in the
`OperationFolder` due to IR rewrites. The `GreedyPatternRewriteDriver`
now performs the op folding by itself instead of calling
`OperationFolder::tryToFold`.
This change changes the order of constant ops in test cases, but not the
region in which they appear. All broken test cases were fixed by turning
`CHECK` into `CHECK-DAG`.
Alternatives considered: The state of `OperationFolder` could be
partially invalidated with every `notifyOperationModified` notification.
That is more fragile than the solution in this commit because incorrect
rewriter API usage can lead to missing notifications and hard-to-debug
`IsolatedFromAbove` violations. (It did not fix the above mention bug in
a downstream project, which could be due to incorrect rewriter API usage
or due to another conceptual problem that I missed.) Moreover, ops are
frequently getting modified during a greedy pattern rewrite, so we would
likely keep invalidating large parts of the state of `OperationFolder`
over and over.
Migration guide: Turn `CHECK` into `CHECK-DAG` in test cases. Constant
ops are no longer folded during a greedy pattern rewrite. If you rely on
folding (and rematerialization) of constant ops during a greedy pattern
rewrite, turn the folder into a pattern.
Support WalkResult for AffineExpr walk and support interrupting walks
along the lines of Operation::walk. This allows interrupted walks when a
condition is met. Also, switch from std::function to llvm::function_ref
for the walk function.
This is adding support for `device_type` clause representation in the
OpenACC MLIR dialect on the acc.loop operation and adjust flang to lower
correctly to the new representation.
Each "value" that can be impacted by a `device_type` clause is now
associated with an array attribute that carry this information. This
includes:
- `worker` clause information
- `gang` clause information
- `vector` clause information
- `collapse` clause information
- `tile` clause information
The representation of the `gang` clause information has been updated and
all values are now carried in a single operand segment. This segment is
then subdivided by `device_type`. Each value in a segment is also
associated with a `GangArgType` so it can be differentiated
(num/dim/static). This simplify the handling of gang values an limit the
number of new attributes needed.
When the clause can be associated with the operation without any value
(`gang`, `vector`, `worker`). These are represented by a dedicated
attributes with device_type information.
Extra getter functions are provided to make it easier to retrieve a
value based on a device_type.
Update several tests under mlir/test/Dialect/Transform to use the "main"
transform interpreter pass with named entry points rather than the test
interpreter pass.
This helped discover a logic error in the expensive checks mechanism
that was exiting too early.
Print the op and its types when the fold type check fails. This is to
speed up debuging as it should be trivial to map the offending op to its
folder based on the op name.
This helps support generic manipulation of operations that don't (yet)
use properties to store inherent attributes.
Use this mechanism in type inference and operation equivalence.
Note that only minimal unit tests are introduced as all the upstream
dialects seem to have been updated to use properties and the
non-property behavior is essentially deprecated and untested.
The parser and printer of string attributes were changed to handle
escape sequences. Therefore, we no longer require a custom parser and
printer. Verification is moved from the parser to the verifier
accordingly.
Replace (in tests and docs):
%forall, %tiled = transform.structured.tile_using_forall
with (updated order of return handles):
%tiled, %forall = transform.structured.tile_using_forall
Similar change is applied to (in the TD tutorial):
transform.structured.fuse_into_containing_op
This update makes sure that the tests/documentation are consistent with
the Op specifications. Follow-up for #67320 which updated the order of
the return handles for `tile_using_forall`.
The change in c1eab57 fixed the
behavior of `getDiscardableAttrDictionary` for ops that are not using
properties to only return discardable attributes. Bytecode writer was
relying on the wrong behavior and would assume all attributes are
discardable, without appropriate testing. Fix that and add a test.
The dataflow analysis framework within MLIR allows to customize the
transfer function when a `call-like` operation is encuntered.
The check to see if the analysis was executed in intraprocedural mode
was executed after the check to see if the callee had the
CallableOpInterface, and thus intraprocedural analyses would behave as
interpocedural ones when performing indirect calls.
This commit fixes the issue by performing the check for
intraprocedurality first.
Dense forward analyses were already behaving correctly.
https://github.com/llvm/llvm-project/blob/main/mlir/lib/Analysis/DataFlow/DenseAnalysis.cpp#L63
Co-authored-by: massimo <mo.fioravanti@gmail.com>
Make it so that PDL in pattern rewrites can be optionally disabled.
PDL is still enabled by default and not optional bazel. So this should
be a NOP for most folks, while enabling other to disable.
This only works with tests disabled. With tests enabled this still
compiles but tests fail as there is no lit config to disable tests that
depend on PDL rewrites yet.
Expand the copying of attributes on GPU kernel arguments during LLVM
lowering.
Support copying attributes from values that are already LLVM pointers.
Support copying attributes, like `noundef`, that aren't specific to (the
pointer parts of) arguments.
Make it so that PDL in pattern rewrites can be optionally disabled.
PDL is still enabled by default and not optional bazel. So this should
be a NOP for most folks, while enabling other to disable.
This is piped through mlir-tblgen invocation and that could be
changed/avoided by splitting up the passes file instead.
This only works with tests disabled. With tests enabled this still
compiles but tests fail as there is no lit config to disable tests that
depend on PDL rewrites yet.
The change in c1eab57673 fixed the
behavior of `getDiscardableAttrDictionary` for ops that are not using
properties to only return discardable attributes. AsmPrinter was relying
on the wrong behavior when printing such ops in the generic form,
assuming all attributes are discardable.
This PR adds API `makeReproducer` and cl::opt flag
`--mlir-generate-reproducer=<filename>` in order to allow for mlir
reproducer dumps even when the pipeline doesn't crash.
This PR also decouples the code that handles generation of an MLIR
reproducer from the crash recovery portion. The purpose is to allow for
generating reproducers outside of the context of a compiler crash.
This will be useful for frameworks and runtimes that use MLIR where it
is needed to reproduce the pipeline behavior for reasons outside of
diagnosing crashes. An example is for diagnosing performance issues
using offline tools, where being able to dump the reproducer from a
runtime compiler would be helpful.
The change in c1eab57673 fixed the
behavior of `getDiscardableAttrDictionary` for ops that are not using
properties to only return discardable attributes. `InferTypeOpInterface`
was relying on the wrong behavior when constructing an adaptor and would
assume that all attributes were discardable, which is not the case.
When properties are not enabled in an operation, inherent attributes are
stored in the common dictionary with discardable attributes. However,
`getDiscardableAttrs` and `getDiscardableAttrDictionary` were returning
the entire dictionary, making the caller mistakenly believe that all
inherent attributes are discardable. Fix this by filtering out
attributes whose names are registered with the operation, i.e., inherent
attributes. This requires an API change so `getDiscardableAttrs` returns
a filter range.
Fixes https://github.com/llvm/llvm-project/issues/71326.
This is the second PR. The first PR at
https://github.com/llvm/llvm-project/pull/75519 was reverted because an
integration test failed. The failed integration test was simplified and
added to the core MLIR tests. Compared to the first PR, the current PR
uses a more reliable approach. In summary, the current PR determines the
mask indices by looking up the _mask_ buffer load indices from the
previous iteration, whereas `main` looks up the indices for the _data_
buffer. The mask and data indices can differ when using a
`permutation_map`.
The cause of the issue was that a new `LoadOp` was created which looked
something like:
```mlir
func.func main(%arg1 : index, %arg2 : index) {
%alloca_0 = memref.alloca() : memref<vector<1x32xi1>>
%1 = vector.type_cast %alloca_0 : memref<vector<1x32xi1>> to memref<1xvector<32xi1>>
%2 = memref.load %1[%arg1, %arg2] : memref<1xvector<32xi1>>
return
}
```
which crashed inside the `LoadOp::verify`. Note here that `%alloca_0` is
the mask as can be seen from the `i1` element type and note it is 0
dimensional. Next, `%1` has one dimension, but `memref.load` tries to
index it with two indices.
This issue occured in the following code (a simplified version of the
bug report):
```mlir
#map1 = affine_map<(d0, d1, d2, d3) -> (d0, 0, 0, d3)>
func.func @main(%subview: memref<1x1x1x1xi32>, %mask: vector<1x1xi1>) -> vector<1x1x1x1xi32> {
%c0 = arith.constant 0 : index
%c0_i32 = arith.constant 0 : i32
%3 = vector.transfer_read %subview[%c0, %c0, %c0, %c0], %c0_i32, %mask {permutation_map = #map1}
: memref<1x1x1x1xi32>, vector<1x1x1x1xi32>
return %3 : vector<1x1x1x1xi32>
}
```
After this patch, it is lowered to the following by
`-convert-vector-to-scf`:
```mlir
func.func @main(%arg0: memref<1x1x1x1xi32>, %arg1: vector<1x1xi1>) -> vector<1x1x1x1xi32> {
%c0_i32 = arith.constant 0 : i32
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%alloca = memref.alloca() : memref<vector<1x1x1x1xi32>>
%alloca_0 = memref.alloca() : memref<vector<1x1xi1>>
memref.store %arg1, %alloca_0[] : memref<vector<1x1xi1>>
%0 = vector.type_cast %alloca : memref<vector<1x1x1x1xi32>> to memref<1xvector<1x1x1xi32>>
%1 = vector.type_cast %alloca_0 : memref<vector<1x1xi1>> to memref<1xvector<1xi1>>
scf.for %arg2 = %c0 to %c1 step %c1 {
%3 = vector.type_cast %0 : memref<1xvector<1x1x1xi32>> to memref<1x1xvector<1x1xi32>>
scf.for %arg3 = %c0 to %c1 step %c1 {
%4 = vector.type_cast %3 : memref<1x1xvector<1x1xi32>> to memref<1x1x1xvector<1xi32>>
scf.for %arg4 = %c0 to %c1 step %c1 {
%5 = memref.load %1[%arg2] : memref<1xvector<1xi1>>
%6 = vector.transfer_read %arg0[%arg2, %c0, %c0, %c0], %c0_i32, %5 {in_bounds = [true]} : memref<1x1x1x1xi32>, vector<1xi32>
memref.store %6, %4[%arg2, %arg3, %arg4] : memref<1x1x1xvector<1xi32>>
}
}
}
%2 = memref.load %alloca[] : memref<vector<1x1x1x1xi32>>
return %2 : vector<1x1x1x1xi32>
}
```
What was causing the problems is that one dimension of the data buffer
`%alloca` (eltype `i32`) is unpacked (`vector.type_cast`) inside the
outmost loop (loop with index variable `%arg2`) and the nested loop
(loop with index variable `%arg3`), whereas the mask buffer `%alloca_0`
(eltype `i1`) is not unpacked in these loops.
Before this patch, the load indices would be determined by looking up
the load indices for the *data* buffer load op. However, as shown in the
specific example, when a permutation map is specified then the load
indices from the data buffer load op start to differ from the indices
for the mask op. To fix this, this patch ensures that the load indices
for the *mask* buffer are used instead.
---------
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
Add a new transform operation that creates a new parameter containing the number of payload objects (operations, values or attributes) associated with the argument. This is useful in matching and for debugging purposes. This replaces three ad-hoc operations previously provided by the test extension.