This cleans up a unnecessary code that changes zero size allocation to
avoid the following error message
'cuMemAlloc(&ptr, sizeBytes)' failed with 'CUDA_ERROR_INVALID_VALUE'
Incorporated two header files directly into other since
other parts were used (and it makes it hard to find the
definitions). Removed TODOs that are less likely to be done.
Reviewed By: yinying-lisa-li
Differential Revision: https://reviews.llvm.org/D159381
Incorporated two header files directly into other since
other parts were used (and it makes it hard to find the
definitions). Removed TODOs that are less likely to be done.
Reviewed By: Peiming
Differential Revision: https://reviews.llvm.org/D159330
Consistent order of ops and related methods.
Also, renamed SpGEMMGetSizeOp to SpMatGetSizeOp
since this is a general utility for sparse matrices,
not specific to GEMM ops only.
Reviewed By: Peiming
Differential Revision: https://reviews.llvm.org/D157922
Rationale:
Since we only support default algorithm for SpGEMM, we can remove the
estimate op (for now at least). This also introduces the set csr pointers
op, and fixes a few bugs in the existing lowering for the SpGEMM breakdown.
This revision paves the way for actual recognition of SpGEMM in the sparsifier.
Reviewed By: K-Wu
Differential Revision: https://reviews.llvm.org/D157645
Rationale:
This is the approach taken for all the others too (SpMV, SpMM, SDDMM),
so it is more consistent to follow the same path (until we have a need
for more algorithms). Also, in a follow up revision, this will allow
us to remove some unused GEMM ops.
Reviewed By: K-Wu
Differential Revision: https://reviews.llvm.org/D157542
On Fedora, rocminfo is a fedora package and rocm_agent_enumberator is
installed to /usr/bin. This causes this error when building.
CMake Error at external/llvm-project/mlir/lib/ExecutionEngine/CMakeLists.txt:232 (message):
Could not run rocm_agent_enumerator and ROCM_TEST_CHIPSET is not defined
So use find_program() to look for rocm_agent_enumerator instead of assuming a
single location.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed By: krzysz00
Differential Revision: https://reviews.llvm.org/D156826
This works aims to address the issue related to larger shared memory usage in the MLIR CUDA runtime. Currently, when the shared memory usage exceeds 48KB, we need to set the CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES attribute of the CUDA kernel appropriately. This work takes care of that by setting the attribute as required. Additionally, it includes some debug prints for better visibility and troubleshooting.
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D156874
This work introduces `MLIR_CUDA_DEBUG` environment value and `debug_print` function to be able to debug runtimes.
Reviewed By: aartbik
Differential Revision: https://reviews.llvm.org/D156232
The Op creates a tensor map descriptor object representing tiled memory region. The descriptor is used by Tensor Memory Access (TMA). The `tensor` is the source tensor to be tiled. The `boxDimensions` is the size of the tiled memory region in each dimension.
The pattern here lowers `tma.create.descriptor` to a runtime function call that eventually calls calls CUDA Driver's `cuTensorMapEncodeTiled`. For more information see below:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html
Depends on D155453
Reviewed By: nicolasvasilache
Differential Revision: https://reviews.llvm.org/D155680
(1) without the check, the results may silently be wrong, so check is needed
(2) add pruning step to guarantee 2:4 property
Note, in the longer run, we may want to split out the pruning step somehow,
or make it optional.
Reviewed By: K-Wu
Differential Revision: https://reviews.llvm.org/D155320
Also makes some minor consistency edits in the cuSparseLt wrapper lib.
Reviewed By: Peiming, K-Wu
Differential Revision: https://reviews.llvm.org/D155139
Currently crashes if function isn't void when specifiying
'-entry-point-result=void'.
Reviewed By: jpienaar
Differential Revision: https://reviews.llvm.org/D154352
SubtargetFeature.h is currently part of MC while it doesn't depend on
anything in MC. Since some LLVM components might have the need to work
with target features without necessarily needing MC, it might be
worthwhile to move SubtargetFeature.h to a different location. This will
reduce the dependencies of said components.
Note that I choose TargetParser as the destination because that's where
Triple lives and SubtargetFeatures feels related to that.
This issues came up during a JITLink review (D149522). JITLink would
like to avoid a dependency on MC while still needing to store target
features.
Reviewed By: MaskRay, arsenm
Differential Revision: https://reviews.llvm.org/D150549
This reverts commit bba2b656110209a3d9863b92c060082479b06ab1.
Reason: Broke the HWASan buildbot. See https://reviews.llvm.org/D153250
for more information.
This reverts commit 9119325a5666e557a19f38a05525578b556c215b.
A buildbot is broken, probably because of this change breaking the
SHARED_LIBS=ON build more.
There are two ways to make symbols from a shared library visible in the
execution engine: exporting the symbols with public visibility or
implementing a loading/unloading mechansim that registers the exported
symbols explicitly. The latter has only been available in the JIT runner
until recently, but https://reviews.llvm.org/D153029 makes it available
in any usage of the execution engine (including the Python bindings).
This patch makes the runner utils library use the latter mechanism
instead of the former, i.e., it makes all of its symbols private and
implements the init/destroy functions of the loading mechanism to
control explicitly which symbols it registers.
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D153250
The async runtime library explicitly registers the symbols it exports
with the loading mechanism of the execution engine. This even works even
though these symbols were marked as hidden in the library. However, if
used outside the execution engine, such as with `lli --dlopen` or if AOT
compiled, these hidden symbols would not be found. This patch thus marks
all symbols that are part of the API as visible.
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D153348
In https://reviews.llvm.org/D153029, I moved the loading/unloading
mechanisms of shared libraries from the JIT runner to the execution
engine in order to make that mechanism available in the latter
(including its Python bindings). However, I realized that I introduced a
small change in semantic: previously, the JIT runner checked for the
presence of init/destroy functions and only loaded the library as
JITDyLib if they were not present. After I moved the code, all libraries
were loaded as JITDyLib, even if they registered their symbols
explicitly in their init function. I am not sure if this is really a
problem but (1) the previous behavior was different and (2) I guess it
could cause a problem if some symbols are exported through the init
function *and* have public visibility. This patch reestablishes the
original behaviour in the new place of the code.
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D153249
This updates the code comments about the library registration mechanism,
which changed in https://reviews.llvm.org/D153029, and which should have
updated as part of that patch.
Reviewed By: ingomueller-net
Differential Revision: https://reviews.llvm.org/D153147
Both the mlir-cpu-runner and the execution engine allow to provide a
list of shared libraries that should be loaded into the process such
that the jitted code can use the symbols from those libraries. The
runner had implemented a protocol that allowed libraries to control
which symbols it wants to provide in that context (with a function
called __mlir_runner_init). In absence of that, the runner would rely on
the loading mechanism of the execution engine, which didn't do anything
particular with the symbols, i.e., only symbols with public visibility
were visible to jitted code.
Libraries used a mix of the two mechanisms: while the runner utils and C
runner utils libs (and potentially others) used public visibility, the
async runtime lib (as the only one in the monorepo) used the loading
protocol. As a consequence, the async runtime library could not be used
through the Python bindings of the execution engine.
This patch moves the loading protocol from the runner to the execution
engine. For the runner, this should not change anything: it lets the
execution engine handle the loading which now implements the same
protocol that the runner had implemented before. However, the Python
binding now get to benefit from the loading protocol as well, so the
async runtime library (and potentially other out-of-tree libraries) can
now be used in that context.
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D153029
Add 16-bit version of cudaMemset in cudaRuntimeWrappers and update the GPU to LLVM lowering.
Reviewed By: bondhugula
Differential Revision: https://reviews.llvm.org/D151642
Even though this feature was deprecated in release 11.2,
any library before this version still supports the feature,
which is why we are making it available under a macro.
Reviewed By: K-Wu
Differential Revision: https://reviews.llvm.org/D152290
The cmake logic to find cuda paths exposes some paths to search for the cuda
library, we need to propagate this through the call for find_library.
This was already done for cuSparse but not for cuda.
Differential Revision: https://reviews.llvm.org/D151645