Fixes:
```
[4038/11058] Building CXX object projects/openmp/libomptarget/src/CMakeFiles/omptarget.dir/OpenMP/InteropAPI.cpp.o
/home/aganea/llvm-project/openmp/libomptarget/src/OpenMP/InteropAPI.cpp:202:2: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
};
^
1 warning generated.
```
Summary:
This patch cleans up some of the JIT handling for AMDGPU as well as
removing its temporary files. Previously these would be left in the
temporary directory after the program was run. This costs some extra
time, but the correct solution to avoid that is to create a sufficient
entrypoint into `ld.lld` that we can simply pass a memory buffer into.
Some are missing setting of HSA_XNACK=1 environment variable, used to
enable unified memory support on amdgpu's when it's not been set at
kernel boot time. Some others needed to be marked as supporting
unified_shared_memory in the lit test harness.
Summary:
This patch is an attempt to get a clean run of `check-openmp` running on
an NVPTX machine. I simply took the lists of tests that failed on my
`sm_89` machine and disabled them or fixed them. A lot of these tests
are disabled on AMDGPU already, so it makes sense that NVPTX fails. The
others are simply problems with NVPTX optimized debugging which will
need to be fixed. I opened an issue on one of them.
Summary:
The constructors and destructors look up a symbol in the ELF quickly to
determine if they need to be run on the GPU. This allows us to avoid the
very slow actions required to do the slower lookup using the vendor API.
One problem occurs with how we handle the lifetime of these images.
Right now there is no invariant to specify the lifetime of the
underlying binary image that is loaded. In the typical case, this comes
from the binary itself in the `.llvm.offloading` section, meaning that
the lifetime of the binary should match the executable itself. This
would work fine, if it weren't for the fact that the plugin is loaded
via `dlopen` and can have a teardown order out of sync with the main
executable.
This was likely what was occuring when this failed on some systems but
not others. A potential solution would be to simply copy images into
memory so the runtime does not rely on external references. Another
would be to manually zero these out after initialization as to prevent
this mistake from happening accidentally. The former has the benefit of
making some checks easier, and allowing for constant initialization be
done on the ELF itself (normally we can't do this because writing to a
constant section, e.g. .llvm.offloading is a segfault.). The downside
would be the extra time required to copy the image in bulk (Although we
are likely doing this in the vendor runtimes as well).
This patch went with a quick solution to simply set a boolean value at
initialization time if we need to call destructors.
Fixes: https://github.com/llvm/llvm-project/issues/77798
Summary:
Recently a patch added an assertion in the GlobalHandler to indicate
when an ELF was not used. This began to fire whenever NVPTX JIT was
used, because the JIT pass output a PTX file instead of an ELF. The
CUModuleLoad method consumes `.s` internally and compiles it to a cubin,
however, this is too late as we perform several checks on the ELF
directly for the presence of certain symbols and to read some necessary
constants. This results in inconsistent behaviour.
To address this, this patch simply calls `ptxas` manually, similar to
how `lld` is called for the AMDGPU JIT pass. This is inevitably going to
be slower than simply passing it to the CUDA routine due to the overhead
involved in file IO and a fork call, but it's necessary for correctness.
CUDA provides an API for compiling PTX manually. However, this only
started showing up in CUDA 11.1 and is only provided "officially" in a
static library. The `libnvidia-ptxjitcompiler.so` next to the CUDA
driver has the same symbols and can likely be used as a replacement.
This would be the faster solution. However, given that it's not
documented it may have some issues.
Variables `__omp_rtl_assume_teams_oversubscription` and
`__omp_rtl_assume_threads_oversubscription `are used by functions:
`__kmpc_distribute_static_loop`, `__kmpc_distribute_for_static_loop `and
`__kmpc_for_static_loop`.
These tests were slightly broken, in one case a failing test that now works. In the other case
some accidentally left over code during a name change that broke compilation due to missing
symbols.
Summary:
The current code logic is supposed to skip plugins that aren't found or
could not be loaded. However, the plugic ontained a call to `abort` if
it failed, which prevented us from continuing if initilalization the
plugin failed (such as if `dlopen` failed for the dyanmic plugins).
Summary:
The LLVM style uses /*Foo=*/ when indicating the name of a constant. See
https://llvm.org/docs/CodingStandards.html#comment-formatting. This is
useful for consistency, as well as because `clang-format` understands
this syntax and formats it more cleanly. Do a bulk update of this
syntax.
Summary:
My previous reworking of the image hangling removed the image info which
was originally used for this extra error message requested by Ye Luo. I
have since added in the necessary ELF facilities to extract it from the
object file and can add it back in. It's a little verbose mostly from
needing to shuffle around types and potential errors.
Summary:
The previous behaviour before I made it dynamically open libFFI was that
these tests would be ignored if FFI was not found. This now allows tests
to be run without the dependency and thus the tests fails on some
buildbots. This simply makesit not build the tests if it's not present.
The comment indicate that it should be possible, but as long as it
wasn't a cache variable, the cmake script overwrote whatever variable
the user had set.
Summary:
The CPU targets currently rely on `libffi` to invoke the "kernel"
functions. Previously we would not build these if this dependency was
not found. This patch copies th eapproach used for things like CUDA and
HSA to dynamically load this if it is not found.
The one sketchy thing this does is hard-code the default ABI for the
target. These are normally defined on a per-file basis in the FFI
source, so I had to fish out the expected values. We only use two types,
so ideally we will always be able to use the default ABI.
It's possible we could remove this dependency entirely in the future as
well.
Summary:
The offloading entries right now are assumed to be baked into the binary
itself, and thus always valid whenever the library is executing. This
means that we don't need to copy them to additional storage and can
instead simply pass around references to it.
This is not likely to change in the expected operation of the OpenMP
library. Additionally, the indirection for the offload entry struct is
simply two pointers, so moving it by value is trivial.
…on (zero-copy) on MI300A.
This patch enables applications that did not request OpenMP
unified_shared_memory to run with the same zero-copy behavior, where
mapped memory does not result in extra memory allocations and memory
copies, but CPU-allocated memory is accessed from the device. The name
for this behavior is "automatic zero-copy" and it relies on detecting:
that the runtime is running on a MI300A, that the user did not select
unified_shared_memory in their program, and that XNACK (unified memory
support) is enabled in the current GPU configuration. If all these
conditions are met, then automatic zero-copy is triggered.
This patch is still missing support for global variables, which will be
provided in a subsequent patch.
Co-authored-by: Thorsten Blass <thorsten.blass@amd.com>
Summary:
This is needed for some definition in `hsa.h` that requires this to be
set for some architectures when it fails at autodetection. We only
really build `libomptarget` with `gcc` and `clang` which already provide
their own way of detecting this. Remove the unnecessary define and move
it into the source.
This PR contains initial changes for building and testing libomp on AIX.
More changes will follow.
- `KMP_OS_AIX` is defined for the AIX platform
- `KMP_ARCH_PPC` is defined for 32-bit PPC
- `KMP_ARCH_PPC_XCOFF` and `KMP_ARCH_PPC64_XCOFF` are for 32- and 64-bit
XCOFF object formats respectively
- Assembly file `z_AIX_asm.S` is used for AIX specific assembly code and
will be added in a separate PR
- The target library is disabled because AIX does not have the device
support
- OMPT is temporarily disabled
#65273 "hidden_dynamic_lds_size" argument will be added in the reserved
section at offset 120 of the implicit argument layout
Add DynamicLdsSize to AMDGPUImplicitArgsTy struct at offset 120 and fill
the dynamic LDS size before kernel launch.
Summary:
The device allocator on NVPTX architectures is enqueued to a stream that
the kernel is potentially executing on. This can lead to deadlocks as
the kernel will not proceed until the allocation is complete and the
allocation will not proceed until the kernel is complete. CUDA 11.2
introduced async allocations that we can manually place on separate
streams to combat this. This patch makes a new allocation type that's
guaranteed to be non-blocking so it will actually make progress, only
Nvidia needs to care about this as the others are not blocking in this
way by default.
I had originally tried to make the `alloc` and `free` methods take a
`__tgt_async_info`. However, I observed that with the large volume of
streams being created by a parallel test it quickly locked up the system
as presumably too many streams were being created. This implementation
not just creates a new stream and immediately destroys it. This
obviously isn't very fast, but it at least gets the cases to stop
deadlocking for now.
Summary:
In the future, we may have more checks for different kinds of inputs,
e.g. SPIR-V. This patch simply reworks the handling to be more generic
and do the magic detection up-front. The checks inside the routines are
now asserts so we don't spend time checking this stuff over and over
again.
This patch also tweaked the bitcode check. I used a different function
to get the Lazy-IR module now, as it returns the raw expected value
rather than the SM diganostic.
No functionality change intended.
Summary:
Adding information to the LIBOMPTARGET profiler runtime kernel and API
calls.
Key changes:
* Adding information to runtime calls for better understanding of how
the application
is executing. For example teams requested by the user, size of memory
transfers.
* Profile timer was changed from 'us' to 'ns', since 'us' was too
coarse-grain
to register some important details like key kernel duration
* Removed non API or Runtime calls, to reduce complexity of profile for
application
developers.
---------
Co-authored-by: Felipe Cabarcas <cabarcas@leia.crpl.cis.udel.edu>
Co-authored-by: fel-cab <fel-cab@github.com>
This patch fixes the erroneous multiple-target requirement in Fortran
offloading tests. Additionally, it adds two new variables
(test_flags_clang, test_flags_flang) to lit.cfg so that
compiler-specific flags for Clang and Flang can be specified.
This patch re-lands: #74543. The error was caused by having:
```
config.substitutions.append(("%flags", config.test_flags))
config.substitutions.append(("%flags_clang", config.test_flags_clang))
config.substitutions.append(("%flags_flang", config.test_flags_flang))
```
when instead it has to be:
```
config.substitutions.append(("%flags_clang", config.test_flags_clang))
config.substitutions.append(("%flags_flang", config.test_flags_flang))
config.substitutions.append(("%flags", config.test_flags))
```
because LIT replaces with the first longest sub-string match.
This patch addresses an issue introduced in pull request #74398. CMake
will attempt to re-build gtest if openmp is enabled as a project (as
opposed to being enabled as a runtime). This patch adds a check that
prevents this from happening.
Summary:
We currently keep a cache of created ELF files from the relevant images.
This shouldn't be necessary as the entire ELF interface is generally
trivially constructable and extremely cheap. The cost of constructing
one of these objects is simply a size check and writing a pointer to the
underlying data. Given that, keeping a cache of these images should not
be necessary overall.
This patch reduces the size of heap memory required by the test
`malloc_parallel.c` and `malloc.c`. The original size is too large such
that `malloc` returns `nullptr` on many threads, causing illegal
memory access.
This patch add three GTest unit tests that test plugin read and write
operations. Tests can be compiled with `ninja -C runtimes/runtimes-bins
LibomptUnitTests`.
Summary:
This patch reorganizes a lot of the code used to check for compatibility
with the current environment. The main bulk of this patch involves
moving from using a separate `__tgt_image_info` struct (which just
contains a string for the architecture) to instead simply checking this
information from the ELF directly. Checking information in the ELF is
very inexpensive as creating an ELF file is simply writing a base
pointer.
The main desire to do this was to reorganize everything into the ELF
image. We can then do the majority of these checks without first
initializing the plugin. A future patch will move the first ELF checks
to happen without initializing the plugin so we no longer need to
initialize and plugins that don't have needed images.
This patch also adds a lot more sanity checks for whether or not the ELF
is actually compatible. Such as if the images have a valid ABI, 64-bit
width, executable, etc.
Summary:
A recent patch allowed us to easily replace GNU atomics with scoped
variants that make use of the backend's handling for more permissive
scopes. The default is full "system" scope, that means the atomic
operation must be consistent with operations that may happen on the
host's memory. This is generally only required for processes that are
communicating with something via global fine-grained memory. This patch
uses these atomics to make everything device scoped, as nothing in the
OpenMP runtime should depend on the host.
This is only provided as a very new clang extension but the DeviceRTL is
only compiled with clang so it is always available.
ompt/synchronization/[masked.c | master.c] tests fail due to a wrong
offset being calculated for the possible return addreses. PR #65936
fixes this for Darwin and the same has to be done for Linux.
Updates #69627
This patch fixes the erroneous multiple-target requirement in Fortran
offloading tests. Additionally, it adds two new variables
(`test_flags_clang`, `test_flags_flang`) to `lit.cfg` so that
compiler-specific flags for Clang and Flang can be specified.
In kernel language mode, use user's grid and blocks size directly. No
validity
check, which means if user's values are too large, the launch will fail,
similar
to what CUDA and HIP are doing right now.
Summary:
This patch fixes the remaining global constructor in the plguins after
addressing the ones in the JIT interface. This struct was mistakenly
using global constructors as not all the members were being initialized
properly. This was almost certainly being optimized out because it's
trivial, but would still be present in debug builds and prevented us
from compiling with `-Werror=global-constructors`. We will want to do
that once offloading is moved to a runtimes only build.
Summary:
Libomptarget supports JIT by treating an LLVM-IR file as a regular input
image. The handling here used a global map to keep track of triples once
it was parsed. This was done to same time, however this created a global
constructor as well as an extra mutex to handle it. This patch removes
the use of this map.
Instead, we simply use the file magic to perform a quick check if the
input image is valid bitcode. If not, we then create a lazy module. This
should roughly equivalent to the old handling that create an IR symbol
table. Here we can prevent the module from materializing everything but
the single triple metadata we read in later.
Fix mapping of structs to device.
The following example fails:
```
#include <stdio.h>
#include <stdlib.h>
struct Descriptor {
int *datum;
long int x;
int xi;
long int arr[1][30];
};
int main() {
Descriptor dat = Descriptor();
dat.datum = (int *)malloc(sizeof(int)*10);
dat.xi = 3;
dat.arr[0][0] = 1;
#pragma omp target enter data map(to: dat.datum[:10]) map(to: dat)
#pragma omp target
{
dat.xi = 4;
dat.datum[dat.arr[0][0]] = dat.xi;
}
#pragma omp target exit data map(from: dat)
return 0;
}
```
This is a rework of the previous attempt:
https://github.com/llvm/llvm-project/pull/72410