This revision adds support for empty tensor elimination to
"bufferization.materialize_in_destination" by implementing the
`SubsetInsertionOpInterface`.
Furthermore, the One-Shot Bufferize conflict detection is improved for
"bufferization.materialize_in_destination".
…(trunci) expansion
This revision adds a rewrite for sequences of vector `bitcast(trunci)`
to use a more efficient sequence of vector operations comprising
`shuffle` and `bitwise` ops.
Such patterns appear naturally when writing quantization /
dequantization functionality with the vector dialect.
The rewrite performs a simple enumeration of each of the bits in the
result vector and determines its provenance in the pre-trunci vector.
The enumeration is used to generate the proper sequence of `shuffle`,
`andi`, `ori` followed by an optional final `trunci`/`extui`.
The rewrite currently only applies to 1-D non-scalable vectors and bails
out if the final vector element type is not a multiple of 8. This is a
failsafe heuristic determined empirically: if the resulting type is not
an even number of bytes, further complexities arise that are not
improved by this pattern: the heavy lifting still needs to be done by
LLVM.
`operator[]` returns `OpOperand &` instead of `Value`.
* This allows users to get OpOperands by name instead of "magic" number.
E.g., `extractSliceOp->getOpOperand(0)` can be written as
`extractSliceOp.getSourceMutable()[0]`.
* `OperandRange` provides a read-only API to operands: `operator[]`
returns `Value`. `MutableOperandRange` now provides a mutable API:
`operator[]` returns `OpOperand &`, which can be used to set operands.
Note: The TableGen code generator could be changed to return `OpOperand
&` (instead of `MutableOperandRange`) for non-variadic and non-optional
arguments in a subsequent change. Then the `[0]` part in the above
example would no longer be necessary.
* Always use the auto-generated `getInitArgs` function. Remove the
hand-written `getInitOperands` duplicate.
* Remove `hasIterOperands` and `getNumIterOperands`. The names were
inconsistent because the "arg" is called `initArgs` in TableGen. Use
`getInitArgs().size()` instead.
* Fix verification around ops with no results.
Add capabilities for comparing opaque properties. This is useful when
dealing with arbitrary operations which can be compare based on their
OperationName. Now you can furthermore compare their properties without
the need to determine their actual type.
Before this change, the `ByteCode` test failed on CentOS 7 with
devtoolset-9, because strings happen to be only 8 byte aligned. In
general though, strings have no alignment guarantee.
Increase resource alignment in test to 32 bytes.
Adjust test to sufficiently align buffer.
Add test to check error when buffer is insufficiently aligned.
This patch changes the MaxScale value for level EightK to 256. Also
updated affected level check tests
Change-Id: Id9cffd5eb9053bb688196cd5b3b55b3ddd2a359c
Signed-off-by: Tai Ly <tai.ly@arm.com>
The previous implementation crashed if run on a `builtin.module` using
an `op_name` filter (because the initial value of `parent` in the while
loop was a `nullptr`). This PR fixes the crash and adds a test.
One of the main user of these kind of coroutines is swift. There yield-once (`retcon.once`) coroutines are used to temporary "expose" pointers to internal fields of various objects creating borrow scopes.
However, in some cases it might be useful also to allow these coroutines to produce a normal result, but there is no convenient way to represent this (as compared to switched-resume kind of coroutines where C++ `co_return`
is transformed to a member / callback call on promise object).
The extension is simple: we allow continuation function to have a non-void result and accept optional extra arguments via a special `llvm.coro.end.result` intrinsic that would essentially forward them as normal results.
This patch makes sure that the following case is lowered correctly
("duplication"):
```
func.func @broadcast_scalable_duplication(%arg0: vector<[32]xf32>) -> vector<1x[32]xf32> {
%res = vector.broadcast %arg0 : vector<[32]xf32> to vector<1x[32]xf32>
return %res : vector<1x[32]xf32>
}
```
Change `SingleBlock::{insert,push_back}` to avoid inserting the argument
operation after the block's terminator. This allows removing
`SingleBlockImplicitTerminator`'s functions with the same name.
Define `Block::hasTerminator` checking whether the block has a
terminator operation.
Signed-off-by: Victor Perez <victor.perez@codeplay.com>
Since ownership based buffer deallocation requires a few passes to be run in a somewhat fixed sequence, it makes sense to have a pipeline for convenience (and to reduce the number of transform ops to represent default deallocation).
This change refactors some of the utilities used to unroll larger vector
computations into smaller vector computations. In fact, the indexing
computations used here are rather generic and are useful in other dialects or
downstream projects. Therefore, a utility for iterating over all possible tile
offsets for a particular pair of static (shape, tiled shape) is introduced in
IndexingUtils and replaces the existing computations in the vector unrolling
transformations. This builds off of the refactoring of IndexingUtils introduced
in 203fad476b7e.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D150000
This is plugged in as an alternative lowering path in the gpu to spirv
dialect conversion. Add custom op builders for coop matrix ops to make
the create functions nicer to work with and less error-prone. The latter
is accomplished by following the op syntax and also requiring stride to
be a constant op to avoid confusion around the order of arguments.
The remaining lowering patterns will be added in a future patch.
Only construction and type casting are implemented. The method to create
is explicitly named "unsafe" and the documentation calls out what the
caller is responsible for. There really isn't a better way to do this
and retain the power-user feature this represents.
This patch is part of a larger initiative aimed at fixing floating-point
`max` and `min` operations in MLIR:
https://discourse.llvm.org/t/rfc-fix-floating-point-max-and-min-operations-in-mlir/72671.
The commit addresses the task 1.4 of the RFC by adding LLVM lowering to
the corresponding LLVM intrinsics.
Please **note**: this PR is part of a stack of patches and depends on
#66429.
This patch adds an NVPTX compilation path that enables JIT compilation
on NVIDIA targets. The following modifications were performed:
1. Adding a format field to the GPU object attribute, allowing the
translation attribute to use the correct runtime function to load the
module. Likewise, a dictionary attribute was added to add any possible
extra options.
2. Adding the `createObject` method to `GPUTargetAttrInterface`; this
method returns a GPU object from a binary string.
3. Adding the function `mgpuModuleLoadJIT`, which is only available for
NVIDIA GPUs, as there is no equivalent for AMD.
4. Adding the CMake flag `MLIR_GPU_COMPILATION_TEST_FORMAT` to specify
the format to use during testing.
This patch refactors the `AtomicRMWOpConverter` class to use
the dedicated operations from Arith dialect instead of using
`cmpf` + `select` pattern.
Also, a test for `minimumf` kind of `atomic_rmw` has been added.
This patch is part of a larger initiative aimed at fixing floating-point
`max` and `min` operations in MLIR:
https://discourse.llvm.org/t/rfc-fix-floating-point-max-and-min-operations-in-mlir/72671.
Here we introduce new operations for floating-point numbers: `minnum`
and `maxnum`.
These operations have different semantics than `minumumf` and `maximumf`
ops.
They follow the eponymous LLVM intrinsics semantics, which differs
in the handling positive and negative zeros and NaNs.
This patch addresses the 1.3 task from the RFC.
This will make it easy for callers to see issues with and fix up calls
to createTargetMachine after a future change to the params of
TargetMachine.
This matches other nearby enums.
For downstream users, this should be a fairly straightforward
replacement,
e.g. s/CodeGenOpt::Aggressive/CodeGenOptLevel::Aggressive
or s/CGFT_/CodeGenFileType::
This change removes the requirement that the row stride be statically known when
converting `vector.transfer_read` and `vector.transfer_write` to distributed
SIMT operations in the `nvgpu` lowering path. It also adds a check to verify
that the last dimension of the source memref is statically known to have stride
1 since this is assumed in the conversion logic. No other change should be
required since the generated `vector.load` operations are never created across
dimensions other than the last. The routines for checking preconditions on
`vector.transfer_read/write` are moved to under nvgpu utilities.
The change is NFC with respect to the GPU dialect lowering path.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D155753
This change adds a method to modify the ConversionTarget used during
`transform.apply_conversion_patterns` to the
`ConversionPatternDescriptorOpInterface`. This is needed when the TypeConverter
is used to dictate the dynamic legality of operations, as in "structural"
conversion patterns present in, for example, the SCF and func dialects.
As a first use case/test, this change also adds a
`transform.apply_patterns.scf.structural_conversions` operation to the SCF
dialect.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D158672
Rationale:
These libraries provided COO input and output at external boundaries
which, since then, has been generalized to the much more powerful pack
and unpack operations of the sparse tensor dialect.
Add extra error checking (in debug mode) to detect cases where an
iterator on "direct" payload IR mappings is invalidated (due to elements
being removed). Such errors are hard to debug: they are often
non-deterministic; sometimes the program crashes, sometimes it produces
wrong results. Even when it crashes, the stack trace often points to
completely unrelated code locations.
Store a timestamp with each "direct" mapping. The timestamp is increased
whenever an operation is performed that invaldiates an iterator on that
mapping. A debug iterator is added that checks the timestamp as payload
IR is enumerated.
The scf.forall.in_parallel terminator operation has a nested graph region with the NoTerminator trait. Such regions are not supported by the default implementations. Therefore, this commit adds a specialized implementation for
this operation which only covers the case where the nested region is empty.
This is because after bufferization, ops like tensor.parallel_insert_slice were already converted to memref operations residing int the scf.forall only and the nested region of scf.forall.in_parallel ends up empty.
Fixes a crash when an op, that is mapped to handle that a
`transform.foreach` iterates over, was erased (through the
`TrackingRewriter`). Erasing an op removes it from all mappings and
invalidates iterators. This is already taken care of when an op is
iterating over payload ops in its `apply` method, but not when another
transform op is erasing a tracked payload op.
Add a method to the BufferDeallocationOpInterface that allows operations to implement the interface and provide custom logic to compute the ownership indicators of values it defines. As a demonstrating example, this new method is implemented by the `arith.select` operation.
This new interface allows operations to implement custom handling of ownership values and insertion of dealloc operations which is useful when an op cannot implement the interfaces supported by default by the buffer deallocation pass (e.g., because they are not exactly compatible or because there are some additional semantics to it that would render the default implementations in buffer deallocation invalid, or because no interfaces exist for this
kind of behavior and it's not worth introducing one plus a default implementation in buffer deallocation). Additionally, it can also be used to provide more efficient handling for a specific op than the interface based default
implementations can.
memref.copy gets lowered to a function call sometimes, this function
is passed the element size of the memref in bytes as an argument.
The element size passed to the copyMemRef() function call can be
miscalculated if the LLVM IR uses aligned access to the memory.
This can be fixed by using llvm.getelementptr to calculate the element
size natively. This is also done in the other lowering path that lowers
to an intrinsic.
Fix https://github.com/llvm/llvm-project/issues/64072
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D156126
The lowering pattern to LLVM for memref.transpose has a bug where
instead of transposing from (source) -> (dest) it actually transposes
(dest) -> (source). This patch fixes the bug and updates the test.
Fix https://github.com/llvm/llvm-project/issues/65145
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D159290
Add a new Buffer Deallocation pass with the intend to replace the old
one. For now it is added as a separate pass alongside in order to allow
downstream users to migrate over gradually. This new pass has the goal
of inserting fewer clone operations and supporting additional use-cases.
Please refer to the Buffer Deallocation section in the updated
Bufferization.md file for more information on how this new pass works.
Default atomic ordering information is processed in the OpenMP dialect
to LLVM IR lowering stage at every spot where an operation can be
affected by it. The rest of clauses are stored globally in the
OpenMPIRBuilderConfig object before starting that lowering stage, so
that the OMPIRBuilder can conditionally modify code generation
depending on these. At the end of the process, the omp.requires
attribute is itself lowered into a global constructor that passes these
clauses as flags to the OpenMP runtime.
Depends on D147217, D147218 and D158278.
Differential Revision: https://reviews.llvm.org/D147219
This patch updates the `OpenMPIRBuilderConfig` structure to hold all
available 'requires' clauses, and it replicates part of the code
generation for the 'requires' registration function from clang in the
`OMPIRBuilder`, to be used with flang.
Porting the rest of features of the clang implementation to the IRBuilder
and sharing it between clang and flang remains for a future patch, due to the
complexity of the logic selecting the attributes of the generated
registration function.
Differential Revision: https://reviews.llvm.org/D147217
This commit generalizes empty tensor elimination to operate on subset
ops. No new test cases are added because all current subset ops were
already supported previously. From this perspective, this change is NFC.
A new interface method (and a helper method) are added to
`SubsetInsertionOpInterface` to build the subset of the destination
tensor.
This patch adds support for lowering vector.outerproduct to the ArmSME
MOPA intrinsic for the following types:
vector<[8]xf16>, vector<[8]xf16> -> vector<[8]x[8]xf16>
vector<[8]xbf16>, vector<[8]xbf16> -> vector<[8]x[8]xbf16>
vector<[4]xf32>, vector<[4]xf32> -> vector<[4]x[4]xf32>
vector<[2]xf64>, vector<[2]xf64> -> vector<[2]x[2]xf64>
The FP variants are lowered to FMOPA (non-widening) [1] and BFloat to
BFMOPA
(non-widening) [2].
Note at the ISA level these variants are implemented by different
architecture features, these are listed below:
FMOPA (non-widening)
* half-precision - +sme2p1,+sme-f16f16
* single-precision - +sme
* double-precision - +sme-f64f64
BFMOPA (non-widening)
* half-precision - +sme2p1,+b16b16
There's currently no way to target different features when lowering to
ArmSME. Integration tests are added for F32 and F64. We use QEMU to run
the integration tests but SME2 support isn't available yet, it's
targeted for 9.0, so integration tests for these variants excluded.
Masking is currently unsupported.
Depends on #65450.
[1] https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/FMOPA--non-widening---Floating-point-outer-product-and-accumulate-
[2] https://developer.arm.com/documentation/ddi0602/2023-06/SME-Instructions/BFMOPA--non-widening---BFloat16-floating-point-outer-product-and-accumulate-
This patch is part of a larger initiative aimed at fixing floating-point `max` and `min` operations in MLIR: https://discourse.llvm.org/t/rfc-fix-floating-point-max-and-min-operations-in-mlir/72671.
Within LLVM, there are no masked reduction counterparts for vector reductions such as `fmaximum` and `fminimum`.
More information can be found here: https://github.com/llvm/llvm-project/issues/64940#issuecomment-1690694156.
To address this issue in MLIR, where we need to generate appropriate lowerings for these cases, we employ regular non-masked intrinsics.
However, we modify the input vector using the `arith.select` operation to effectively deactivate undesired elements using a "neutral mask value".
The neutral mask value is the smallest possible value for the `fmaximum` reduction and the largest possible value for the `fminimum` reduction.
Depends on D158618
Reviewed By: dcaballe
Differential Revision: https://reviews.llvm.org/D158773
This is a cleanup in preparation for adding a second conversion path
using the KHR cooperative matrix extension.
Make the existing lowering explicit about emitting ops from the NV coop
matrix extension. Clean up surrounding code.
Currently, the VectorToLLVM patterns are built into a library along
with the corresponding pass, which also pulls in all the
platform-specific vector dialects (like AMXDialect) to apply all the
vector to LLVM conversions.
This causes dependency bloat when writing libraries - for example the
GPU to LLVM passes, which use the vector to LLVM patterns, don't need
the X86Vector dialect to be present at all.
This commit partitions the library into VectorToLLVM and
VectorToLLVMPass, where the latter pulls in all the other vector
transformations.
Reviewed By: nicolasvasilache, mehdi_amini
Differential Revision: https://reviews.llvm.org/D158287