MemRef descriptors contain - among others - a field called `alignedPtr` or `data` and a field called `offset`. The actual buffer of the MemRef starts at `offset` elements after `alignedPtr`. In the CRunnerUtils, there exist helper classes to iterate over MemRefs' elements but the `offset` is not handled consistently so that accessing a MemRef with an `offset` != 0 via an iterator will lead to incorrect results.
The problem is that "offset" can be understood in two ways, firstly as the offset of the beginning of the MemRef with respect to the `alignedPtr`, ie what the `offset` field means in the MemRef descriptor, and secondly as the offset of some element within the MemRef relative to the first element of the MemRef, which could more accurately be called something like `linearIndex`.
The `offset` field within `StridedMemRefIterator` and `DynamicMemRefIterator` are interpreted the first way, therefore the offsets passed to the constructors of these classes need to account for the already existing offset in the descriptor on top of any potential "shift" within the MemRef.
This patch takes care of that and adds some basic tests that catch problems with indexing MemRefs with an `offset`.
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D157008
The lowering pattern to LLVM for memref.transpose has a bug where
instead of transposing from (source) -> (dest) it actually transposes
(dest) -> (source). This patch fixes the bug and updates the test.
Fix https://github.com/llvm/llvm-project/issues/65145
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D159290
Add a new Buffer Deallocation pass with the intend to replace the old
one. For now it is added as a separate pass alongside in order to allow
downstream users to migrate over gradually. This new pass has the goal
of inserting fewer clone operations and supporting additional use-cases.
Please refer to the Buffer Deallocation section in the updated
Bufferization.md file for more information on how this new pass works.
Default atomic ordering information is processed in the OpenMP dialect
to LLVM IR lowering stage at every spot where an operation can be
affected by it. The rest of clauses are stored globally in the
OpenMPIRBuilderConfig object before starting that lowering stage, so
that the OMPIRBuilder can conditionally modify code generation
depending on these. At the end of the process, the omp.requires
attribute is itself lowered into a global constructor that passes these
clauses as flags to the OpenMP runtime.
Depends on D147217, D147218 and D158278.
Differential Revision: https://reviews.llvm.org/D147219
This patch updates the `OpenMPIRBuilderConfig` structure to hold all
available 'requires' clauses, and it replicates part of the code
generation for the 'requires' registration function from clang in the
`OMPIRBuilder`, to be used with flang.
Porting the rest of features of the clang implementation to the IRBuilder
and sharing it between clang and flang remains for a future patch, due to the
complexity of the logic selecting the attributes of the generated
registration function.
Differential Revision: https://reviews.llvm.org/D147217
This commit generalizes empty tensor elimination to operate on subset
ops. No new test cases are added because all current subset ops were
already supported previously. From this perspective, this change is NFC.
A new interface method (and a helper method) are added to
`SubsetInsertionOpInterface` to build the subset of the destination
tensor.
This patch adds support for lowering vector.outerproduct to the ArmSME
MOPA intrinsic for the following types:
vector<[8]xf16>, vector<[8]xf16> -> vector<[8]x[8]xf16>
vector<[8]xbf16>, vector<[8]xbf16> -> vector<[8]x[8]xbf16>
vector<[4]xf32>, vector<[4]xf32> -> vector<[4]x[4]xf32>
vector<[2]xf64>, vector<[2]xf64> -> vector<[2]x[2]xf64>
The FP variants are lowered to FMOPA (non-widening) [1] and BFloat to
BFMOPA
(non-widening) [2].
Note at the ISA level these variants are implemented by different
architecture features, these are listed below:
FMOPA (non-widening)
* half-precision - +sme2p1,+sme-f16f16
* single-precision - +sme
* double-precision - +sme-f64f64
BFMOPA (non-widening)
* half-precision - +sme2p1,+b16b16
There's currently no way to target different features when lowering to
ArmSME. Integration tests are added for F32 and F64. We use QEMU to run
the integration tests but SME2 support isn't available yet, it's
targeted for 9.0, so integration tests for these variants excluded.
Masking is currently unsupported.
Depends on #65450.
[1] https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-
[2] https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/BFMOPA--non-widening---BFloat16-floating-point-outer-product-and-accumulate-
Rationale:
This test was really fun to compare the MLIR sparsifier with TACO using
the PyTACO format. However, the underlying mechanism is rapidly growing
outdated with our recent developments. Rather than maintaining the old
code, we are moving toward the newer, better approaches. So if you are
sad this is gone, stay tuned, something better is coming!
This patch is part of a larger initiative aimed at fixing floating-point `max` and `min` operations in MLIR: https://discourse.llvm.org/t/rfc-fix-floating-point-max-and-min-operations-in-mlir/72671.
Within LLVM, there are no masked reduction counterparts for vector reductions such as `fmaximum` and `fminimum`.
More information can be found here: https://github.com/llvm/llvm-project/issues/64940#issuecomment-1690694156.
To address this issue in MLIR, where we need to generate appropriate lowerings for these cases, we employ regular non-masked intrinsics.
However, we modify the input vector using the `arith.select` operation to effectively deactivate undesired elements using a "neutral mask value".
The neutral mask value is the smallest possible value for the `fmaximum` reduction and the largest possible value for the `fminimum` reduction.
Depends on D158618
Reviewed By: dcaballe
Differential Revision: https://reviews.llvm.org/D158773
Rationale:
This was actually just a pure "string based" test
with very little actual python usage. The output
sparse tensor was handled via the deprecated
convertFromMLIRSparseTensor method.
Using `extraClassOf` currently does not work with attribute or type
interfaces as the generated code calls `getInterfaceFor`, a private
method of `AttributeInterface` and `TypeInterface` respectively.
This PR fixes this by applying the same change that has been done to
`OpInterface` in the past: Make `getInterfaceFor` a protected member of
the class, allowing the generated code in subclasses to use it.
An attribute and type interface with `extraClassOf` have been added as
interfaces in the test dialect to ensure it compiles without errors.
This is a cleanup in preparation for adding a second conversion path
using the KHR cooperative matrix extension.
Make the existing lowering explicit about emitting ops from the NV coop
matrix extension. Clean up surrounding code.
Currently, the VectorToLLVM patterns are built into a library along
with the corresponding pass, which also pulls in all the
platform-specific vector dialects (like AMXDialect) to apply all the
vector to LLVM conversions.
This causes dependency bloat when writing libraries - for example the
GPU to LLVM passes, which use the vector to LLVM patterns, don't need
the X86Vector dialect to be present at all.
This commit partitions the library into VectorToLLVM and
VectorToLLVMPass, where the latter pulls in all the other vector
transformations.
Reviewed By: nicolasvasilache, mehdi_amini
Differential Revision: https://reviews.llvm.org/D158287
This adds a new pass to add an Any comdat to each linkonce
and linkonce_odr function in the LLVM dialect. These comdats are
necessary on Windows
to allow the default system linker to link binaries containing these
functions.
This commit generalizes the special
tensor.extract_slice/tensor.insert_slice bufferization rules to tensor
subset ops.
Ops that insert a tensor into a tensor at a specified subset (e.g.,
tensor.insert_slice, tensor.scatter) can implement the
`SubsetInsertionOpInterface`.
Apart from adding a new op interface (extending the API), this change is
NFC. The only ops that currently implement the new interface are
tensor.insert_slice and tensor.parallel_insert_slice, and those ops were
are supported by One-Shot Bufferize.
Since buffer deallocation requires a few passes to be run in a somewhat fixed
sequence, it makes sense to have a pipeline for convenience (and to reduce the
number of transform ops to represent default deallocation).
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D159432
The scf.forall.in_parallel terminator operation has a nested graph region with
the NoTerminator trait. Such regions are not supported by the default
implementations. Therefore, this commit adds a specialized implementation for
this operation which only covers the case where the nested region is empty.
This is because after bufferization, ops like tensor.parallel_insert_slice were
already converted to memref operations residing int the scf.forall only and the
nested region of scf.forall.in_parallel ends up empty.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D158979
Add a method to the BufferDeallocationOpInterface that allows operations to
implement the interface and provide custom logic to compute the ownership
indicators of values it defines. As a demonstrating example, this new method is
implemented by the `arith.select` operation.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D158828
This new interface allows operations to implement custom handling of ownership
values and insertion of dealloc operations which is useful when an op cannot
implement the interfaces supported by default by the buffer deallocation pass
(e.g., because they are not exactly compatible or because there are some
additional semantics to it that would render the default implementations in
buffer deallocation invalid, or because no interfaces exist for this kind of
behavior and it's not worth introducing one plus a default implementation in
buffer deallocation). Additionally, it can also be used to provide more
efficient handling for a specific op than the interface based default
implementations can.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D158756
Add a new Buffer Deallocation pass replacing the old one with the goal of
inserting fewer clone operations and supporting additional use-cases.
Please refer to the Buffer Deallocation section in the updated
Bufferization.md file for more information on how this new pass works.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D158421
This commit prepares the linalg integration tests to be run with the new
BufferDeallocation pass which requires the bufferization-to-memref pass
to be run afterwards.
The bufferization-to-memref pass may create ops of the SCF, Func, Arith,
and MemRef dialects. Currently, not all integration tests execute all the
conversion passes necessary to lower these dialects to LLVM.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D156663
This is the first commit in a series with the goal to rework the
BufferDeallocation pass. Currently, this pass heavily relies on copies
to perform correct deallocations, which leads to very slow code and
potentially high memory usage. Additionally, there are unsupported cases
such as returning memrefs which this series of commits aims to add
support for as well.
This first commit removes the deallocation capabilities of
one-shot-bufferization.One-shot-bufferization should never deallocate any
memrefs as this should be entirely handled by the buffer-deallocation pass
going forward. This means the allow-return-allocs pass option will
default to true now, create-deallocs defaults to false and they, as well
as the escape attribute indicating whether a memref escapes the current region,
will be removed.
The documentation should w.r.t. these pass option changes should also be
updated in this commit.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D156662
- Check `MakePointer*` load/store attribute values.
- Support coop matrix types in `MatrixTimesScalar` verification.
- Add test cases for all the remaining ops that accept coop matrix types.
- Split NV and KHR tests.
This cleans up a unnecessary code that changes zero size allocation to
avoid the following error message
'cuMemAlloc(&ptr, sizeBytes)' failed with 'CUDA_ERROR_INVALID_VALUE'
Splits the cleanup block lowered from AsyncToAsyncRuntime.
The incentive of this change is to clarify the CFG branched by
`async.coro.suspend`.
The `async.coro.suspend` op branches into 3 blocks, depending on the
state of the coroutine:
1) suspend
2) resume
3) cleanup
The behavior before this change is that after the coroutine is resumed
and completed, it will jump to a shared cleanup block for destroying the
states of coroutines. The CFG looks like the following,
Entry block
| \
resume |
| |
Cleanup
|
End
This CFG can potentially be problematic, because the `Cleanup` block is
a shared block and it is not dominated by `resume`. For instance, if
some pass wants to add some specific cleanup mechanism to resume, it can
be confused and add them to the shared `Cleanup`, which leads to the
"operand not dominate its use" error because of the existence of the
other "Entry->cleanup" path.
After this change, the CFG will look like the following,
The overall structure of the lowered CFG can be the following,
Entry (calling async.coro.suspend)
| \
Resume Destroy (duplicate of Cleanup)
| |
Cleanup |
| /
End (ends the corontine)
In this case, the Cleanup block tied to the Resume block will be
isolated from the other path and it is strictly dominated by "Resume".
- Fix values of Matrix Operand bit enums.
- Add verification for the aligned Memory Operand attributes. Mark the
'Aligned' enumerant as not supported.
The target test passes validation with `spirv-val`.
Make sure that when analysing a `vector.transfer_read` that's a
candidate for either hoisting or store-to-load forwarding,
`memref.collapse_shape` Ops are correctly included in the alias
analysis. This is done by either
* making sure that relevant users are taken into account, or
* source Ops are correctly identified.
The current implicit conversion operator from an interface to a "base
interface" of the interface unconditionally calls `this->getImpl()`
which leads to accessing a null pointer if the interface instance is a
null instance.
This PR changes the ODS generated interface instance to explicitly check
and then return a null interface instance if the `this` instance is a
null instance.
This patch is part of a larger initiative aimed at fixing floating-point `max` and `min` operations in MLIR: https://discourse.llvm.org/t/rfc-fix-floating-point-max-and-min-operations-in-mlir/72671.
This commit addresses Task 1.2 of the mentioned RFC. By renaming these operations, we align their names with LLVM intrinsics that have corresponding semantics.
Currently, dimlvlmap with identity affine map will be treated as empty
affine map. But the new syntax would treat it as an actual identity
affine map such as {d0} -> {d0}. This mismatch could raise an error when
we are comparing sparse encodings.
Exposes the existing `get(ShapedType, StringRef, AsmResourceBlob)`
builder publicly (was protected) and adds a CAPI
`mlirUnmanagedDenseBlobResourceElementsAttrGet`.
While such a generic construction interface is a big help when it comes
to interop, it is also necessary for creating resources that don't have
a standard C type (i.e. f16, the f8s, etc).
Previously reviewed/approved as part of https://reviews.llvm.org/D157064
The `cache` directive may appear at the top of (inside of) a loop. It
specifies array elements or subarrays that should be fetched into the
highest level of the cache for the body of the loop.
The `cache` directive is modeled as a data entry operands attached to
the acc.loop operation.
Deprecate the `gpu-to-cubin` & `gpu-to-hsaco` passes in favor of the
`TargetAttr` workflow. This patch removes remaining upstream uses of the
aforementioned passes, including the option to use them in `mlir-opt`. A
future patch will remove these passes entirely.
The passes can be re-enabled in `mlir-opt` by adding the CMake flag: `-DMLIR_ENABLE_DEPRECATED_GPU_SERIALIZATION=1`.
This is a follow-on to D158753, and allows the lowering of a
transfer read/write of n-D vectors with a single trailing scalable dimension
to primitive vector ops.
The final conversion to LLVM depends on D158517 and D158752, without
these patches type conversion will fail (or an assert is hit in the LLVM
backend) if the final IR contains an array of scalable vectors.
This patch adds `transform.apply_patterns.vector.lower_create_mask`
which allows the lowering of vector.create_mask/constant_mask to be
tested independently of --convert-vector-to-llvm.
Reviewed By: c-rhodes, awarzynski, dcaballe
Differential Revision: https://reviews.llvm.org/D159482
Extend SPIR-V target serialization and deserialization to handle coop
matrix types. Add a roundtrip test. In addition to `FileCheck` checks,
the resulting spirv binary also passes `spir-val` (external tool).
Also fix a type attribute bug surfaced by the `CooperativeMatrixLength`
op.
Multiple matrix operand attributes will be handled in a future patch to
reduce the scope.
The 'TargetAttr' workflow was recently introduced to serialization for
'MLIR->LLVM->PTX'. #65857 removes previous passes (gpu::Serialization*
passes) because they are duplicates.
This PR removes the use of gpu::Serialization* passes in SM_90
integration tests, and enables the 'TargetAttr' workflow.
It also moves the transform dialect specific test to a new folder.
This patch changes vector type conversion to return failure on n-D
scalable vector types instead of asserting.
This is an alternative approach to #65261 that aims to enable lowering
of Vector ops directly to ArmSME intrinsics where possible, and seems
more consistent with other type conversions. It's trivial to hit the
assert at the moment and it could be interpreted as n-D scalable vector
types being a bug, when they're valid types in the Vector dialect.
By returning failure it will generally fail more gracefully,
particularly for release builds or other builds where assertions are
disabled.
This is just a small fix that makes sure that `vector.contract` works
with scalable vectors.
Rather than duplicating all the roundtrip tests for vector.contract, I'm
treating scalable vectors as an edge case and just adding a couple to
verify that this works.
This patch addresses a crash that occurs when negative dynamic sizes are
provided in tensor.emptyOp by adding a check to ensure that dynamic
sizes are non-negative.
Fixes#64064
This is a new attempt of https://reviews.llvm.org/D159481, this time as
GitHub PR.
`GenericOptionValue::compare()` should return `true` for a match.
- `OptionValueBase::compare()` always returns `false` and shouldn't
match anything.
- `OptionValueCopy::compare()` returns `false` if not `Valid` which
corresponds to no match.
Also adding some tests.
This patch adds the option of building an optional symbol table for the
top operation in the `gpu-module-to-binary` pass. The table is not
created by default as most targets don't need it; instead, it is lazily
built. The table is passed through a callback in `TargetOptions`.
This patch is required to integrate #65539 .
The revert happened due to a build bot failure that threw 'CUDA_ERROR_UNSUPPORTED_PTX_VERSION'.
The failure's root cause was a pass using "+ptx76" for compilation and an old CUDA driver
on the bot. This commit relands the patch with "+ptx60".
Original Gh PR: #65768
Original commit message:
Migrate tests referencing `gpu-to-cubin` to the new compilation workflow
using `TargetAttrs`. The `test-lower-to-nvvm` pass pipeline was modified
to use the new compilation workflow to simplify the introduction of
future tests.
The `createLowerGpuOpsToNVVMOpsPass` function was removed, as it didn't
allow for passing all options available in the `ConvertGpuOpsToNVVMOp`
pass.
Migrate tests referencing `gpu-to-cubin` to the new compilation workflow
using `TargetAttrs`. The `test-lower-to-nvvm` pass pipeline was modified
to use the new compilation workflow to simplify the introduction of
future tests.
The `createLowerGpuOpsToNVVMOpsPass` function was removed, as it didn't
allow for passing all options available in the `ConvertGpuOpsToNVVMOp`
pass.
This enables canonicalization to fold away unnecessary tensor.dim ops
which in turn enables folding away of other operations, as can be seen
in conv_tensors_dynamic where affine.min operations were folded away.
The current printer of `StringRefParameter` simply prints out the
content of the string as is without escaping it any way. This leads to
it generating invalid syntax, causing parser errors when read in again.
This PR fixes that by adding `printString` to `AsmPrinter`, allowing one
to print a string that can be parsed with `parseString`, using the same
escaping syntax as `StringAttr`.
This patch employs the updated promise mechanism to enforce Target
Attribute IR constraints. Due to this patch, TargetAttributes
implementations no longer have to be registered before executing
translation to LLVM IR in cases where they are not needed, like when
translating `gpu.binary` operations.
This commit provides a default implementation for all ops that implement
the `DestinationStyleOpInterface`. Result values of such ops are tied to
operand, and those have the same type.
- Fix order of operands/attributes
- Allow for stride to be any integer type
- Use ODS for parsing/printing
- Update examples and tests
- Fix a typo in SPIR-V tblgen code
According to the LLVM language reference, both volatile memory
operations and atomic operations (except unordered) do not simply read
memory but also perform write operations on arbitrary memory[0][1].
In the case of volatile memory operations, this is the case due to the
read possibly having target specific properties. A common real-world
situation where this happens is reading memory mapped registers on an
MCU for example. Atomic operations are more special. They form a kind of
memory barrier which from the perspective of the optimizer/lang-ref
makes writes from other threads visible in the current thread. Any kind
of synchronization can therefore conservatively be modeled as a
write-effect.
This PR therefore adjusts the side effects of `llvm.load` and
`llvm.store` to add unknown global read and write effects if they are
either atomic or volatile.
Regarding testing: I am not sure how to best test this change for
`llvm.store` and the "globalness" of the effect that isn't just a unit
test checking that the output matches exactly. For the time being, I
added a test making sure that `llvm.load` does not get DCEd in
aforementioned cases.
Related logic in LLVM proper:
3398744a61/llvm/lib/IR/Instruction.cpp (L638-L676)3398744a61/llvm/include/llvm/IR/Instructions.h (L258-L262)
[0] https://llvm.org/docs/LangRef.html#volatile-memory-accesses
[1] https://llvm.org/docs/Atomics.html#monotonic
This patch adds support for the zeroinitializer constant to LLVM
dialect. It's meant to simplify zero-initialization of aggregate types
in MLIR, although it can also be used with non-aggregate types.
This allows the lowering of > rank 1 transfer_reads/writes to equivalent
lower-rank ones when the trailing dimension is scalable. The resulting
ops still cannot be completely lowered as they depend on arrays of
scalable vectors being enabled, and a few related fixes (see D158517).
This patch also explicitly disables lowering transfer_reads/writes with
a leading scalable dimension, as more changes would be needed to handle
that correctly and it is unclear if it is required.
Examples of ops that can now be further lowered:
%vec = vector.transfer_read %arg0[%c0, %c0], %cst, %mask
{in_bounds = [true, true]} : memref<3x?xf32>, vector<3x[4]xf32>
vector.transfer_write %vec, %arg0[%c0, %c0], %mask
{in_bounds = [true, true]} : vector<3x[4]xf32>, memref<3x?xf32>
Reviewed By: c-rhodes, awarzynski, dcaballe
Differential Revision: https://reviews.llvm.org/D158753
Both `TileOp` and `TileToScfForOp` use the tiling interface and the
`tileUsingSCFForOp` method. This duplication was introduced in
44cfea0279
as a way to retire `linalg::tileLinalgOp,` now there is not more need
for this duplication, and it seems that `tileOp` has more recent
changes, thus retire `TileToScfForOp.`