Commit Graph

3023 Commits

Author SHA1 Message Date
Shilei Tian
0d5b7dd25c
[OpenMP] Add a test for D158802 (#70678)
In D158802 we honored user's `thread_limit` value even with the
optimization
introduced in D152014. This patch adds a simple test.
2023-10-30 15:59:05 -04:00
Jon Chesterfield
896749aa0d
[amdgpu][openmp] Avoiding writing to packet header twice (#70695)
I think it follows from the HSA spec that a write to the first byte is
deemed significant to the GPU in which case writing to the second short
and reading back from it later would be safe. However, the examples for
this all involve an atomic write to the first 32 bits and it seems a
credible risk that the occasional CI errors abound invalid packets have
as their root cause that the firmware notices the early write to
packet->setup and treats that as a sign that the packet is ready to go.

That was overly-paranoid, however in passing noticed the code in libc is
genuinely invalid. The memset writes a zero to the header byte, changing
it from type_invalid (1) to type_vendor (0), at which point the GPU is
free to read the 64 byte packet and interpret it as a vendor packet,
which is probably why libc CI periodically errors about invalid packets.

Also a drive by change to do the atomic store on a uint32_t
consistently. I'm not sure offhand what __atomic_store_n on a uint16_t*
and an int resolves to, seems better to be unambiguous there.
2023-10-30 18:35:52 +00:00
agozillon
6a62707c04
[Flang][OpenMP][MLIR] Initial array section mapping MLIR -> LLVM-IR lowering utilising omp.bounds (#68689)
This patch seeks to add initial lowering of OpenMP array sections within
target region map clauses from MLIR to LLVM IR.

This patch seeks to support fixed sized contiguous (don't think OpenMP
supports anything other than contiguous sections from my reading but i
could be wrong) arrays initially, before looking toward assumed size and
shaped arrays. The patch also currently does not include stride, it's
left for future work.

Although, assumed size works in some fashion (dummy arguments) with some
minor alterations to the OMPEarlyOutliner, so it is possible changes
made in the IsolatedFromAbove series may allow this to work with no
further required patches.

It utilises the generated omp.bounds to calculate the size of the mapped
OpenMP array (both for sectioned and un-sectioned arrays) as well as the
offset to be passed to the kernel argument structure.

Alongside these changes some refactoring of how map data is handled is
attempted, using a new MapData structure to keep track of information
utilised in the lowering of mapped values.

The initial addition of a more complex createDeviceArgumentAccessor that
utilises capture kinds similarly to (and loosely based on) Clang to
generate different kernel argument accesses is also added.

A similar function for altering how the kernel argument is passed to the
kernel argument structure on the host is also utilised
(createAlteredByCaptureMap), which allows modification of the
pointer/basePointer based on their capture (and bounds information).
It's of note ByRef, is the default for explicit mappings and ByCopy will
be the default for implicit captures, so the former is currently tested
in this patch and the latter is not for the moment.
2023-10-30 16:00:23 +01:00
Brad Smith
0a29879e41
[OpenMP] Add missing bit with the Hurd support (#70609)
Looking at 855d09855d it looks like a bit was
missing. The padding variable is used further down by the KMP_ALLOCA()
function.
2023-10-29 22:35:03 -04:00
Brad Smith
0d1da7c37f
[OpenMP] Make use of getloadavg() on *BSD OS's (#70586)
OpenBSD does not have /proc filesystem, neither does FreeBSD (by default).
2023-10-29 18:30:11 -04:00
Konstantinos Parasyris
d6a3d6b96d
[openmp] Fixed Support for VA for record-replay. (#70396)
The commit was discussed in phabricator
(https://reviews.llvm.org/D157186).

Record replay currently fails on AMD as it conflicts with the heap
memory allocator introduced in #69806. The workaround is setting
`LIBOMPTARGET_HEAP_SIZE=0` during both record and replay run.
2023-10-29 12:27:19 -07:00
Johannes Doerfert
d346c82435
[OpenMP] Associate the KernelEnvironment with the GenericKernelTy (#70383)
By associating the kernel environment with the generic kernel we can
access middle-end information easily, including the launch bounds ranges
that are acceptable. By constraining the number of threads accordingly,
we now obey the user-provided bounds that were passed via attributes.
2023-10-29 11:35:34 -07:00
Brad Smith
223852aecf
[OpenMP] Fix building for 32-bit DragonFly, NetBSD, OpenBSD (#70527)
Fixing ```#error "Unknown or unsupported OS"```
2023-10-27 22:53:24 -04:00
Konstantinos Parasyris
01828c4323
[OpenMP] record-replay use static-cast (#70516)
[OpenMP] Fixes #69905
2023-10-27 16:46:34 -07:00
Shraiysh
03485a0406
[openmp][flang] Add tests for map clause (#70394)
This patch adds basic tests for map clause on target construct for
commonblocks. There will be more tests to add, which will be added in
future patches. Currently failing tests are added in a separate folder
with XFAIL. They should be moved as they are fixed.
2023-10-27 09:35:06 -07:00
Mehdi Amini
f390a76b7e Revert "Revert "[OpenMP][NFC] Add min/max threads/teams count into the KernelEnvironment (#70257)""
This reverts commit ddbaa11e9f.

Reapply the original commit, the broken test was repaired in 5e51363f38 in the meantime.
2023-10-26 17:30:01 -07:00
Mehdi Amini
ddbaa11e9f Revert "[OpenMP][NFC] Add min/max threads/teams count into the KernelEnvironment (#70257)"
This reverts commit c2a1249a82.

The MLIR bots are broken with an omp test failure.
2023-10-26 17:25:20 -07:00
Johannes Doerfert
c2a1249a82
[OpenMP][NFC] Add min/max threads/teams count into the KernelEnvironment (#70257)
The runtime needs to know about the acceptable launch bounds, especially
if the compiler (middle- or backend) assumed those bounds. While this
patch does not yet inform the runtime, it stores the bounds in a place
that can/will be accessed and is associated with the kernel.
2023-10-26 14:46:55 -07:00
Johannes Doerfert
0012b956f9 [OpenMP][FIX] Move workaround code to avoid races
The workaround code ensure we always call __kmpc_kernel_parallel, but it
did so in a racy manner as the initialization might not have been
completed yet. To avoid introducing a sync, we move the workaround into
the deinit function for now.
2023-10-26 14:38:23 -07:00
Joseph Huber
cee08ff342
[Libomptarget] Do not pass 'nogpulib' to the non-LTO Nvidia tests (#70327)
Summary:
For the other tests we pass `-nogpulib` to ensure that we set up the
needed libraries correctly. However, this caused problems for the
non-LTO build and test of Nvidia systems. In general this is because we
would do a separate compile of the libomptarget device runtime and then
link in that cubin. This exercised the runtime in a lot of ways it's not
used to, since doing things this way was hardly expected or tested. This
patch disables it only for the Nvidia non-LTO build so that we still get
the effect of `--liboimptarget-nvptx-bc-path` rather than ignoring it.
2023-10-26 10:36:34 -05:00
Yuanfang Chen
f09f58d0f2 [OpenMP] [OMPD] Fix CMake install command
https://cmake.org/cmake/help/latest/command/install.html
"If a relative path is given it is interpreted relative to the value of the CMAKE_INSTALL_PREFIX variable."
2023-10-26 03:02:53 +00:00
Joseph Huber
17b5445996
[Libomptarget] Add a wavefront sync builtin for the AMDGPU implementation (#70228)
Summary:
While this is technically a no-op for AMDGPU hardware, in cases where
the user would see fit to add an explicit wavefront sync on Nvidia
hardware, we should also inform the LLVM optimizer that this control
flow is convergent so we do not reorder blocks.
2023-10-25 14:27:14 -05:00
Joseph Huber
006cd37960 [OpenMP][Obvious] Fix incorrect variant selector in test
Summary:
This should be `kind` and not `arch`.
2023-10-25 13:48:30 -05:00
Joseph Huber
ca3545f0ef
[Libomptarget] Bump up PTX version from +ptx61 to +ptx63 (#70227)
Summary:
This version is required to support the 'activemask' feature which is
used for certain features, such as reductions. This ties the
implementation of the DeviceRTL roughly to the features provided by the
CUDA 9.0 release, which should be sufficienly old as to not cause
problems since this is a minor version jump that corresponds to the
release of `sm_53`.
2023-10-25 13:28:02 -04:00
Joseph Huber
8a181f43da [OpenMP][Obvious] Fix incompatbile function prototype causing failures
Summary:
This function needs `void` as the arguments to be ABI compatbile with
what is actually defined. This is enforced when doing CUDA separable
linking of the runtime.
2023-10-25 10:44:07 -05:00
Joseph Huber
84d8ace51a [OpenMP][Obvious] Fix function prototype when used in C mode
Summary:
The `llvm_omp_target_dynamic_shared_alloc` prototype in `omp.h`
accidentally left the void argument unspecified. This created unintended
code when called from the C language, causing some `nvlink` failures in
certain scenarios.
2023-10-25 09:35:23 -05:00
Ilya Leoshkevich
f7fc98a1cf
[OpenMP][Archer] Do not check for column numbers in backtraces (#70075)
TSan can show only line numbers on some platforms, e.g., SystemZ. Skip
checking the column numbers; line numbers should be enough to verify
that race detection is working.
2023-10-25 13:22:24 +02:00
Ilya Leoshkevich
77c2b623ca
[OpenMP][Tests] Sync struct DEP with the runtime (#69982)
struct DEP defined in multiple testcases must correspond to runtime's
struct kmp_depend_info. The former defines flags as int, and the latter
as kmp_uint8_t. This discrepancy goes unnoticed on little-endian
systems, but breaks big-endian ones.

Make flags in struct DEP unsigned char.
2023-10-24 19:40:08 +02:00
Ilya Leoshkevich
34459b72da
[OpenMP] Provide big-endian bitfield definitions (#69995)
structs kmp_depend_info.flags and kmp_tasking_flags contain bitfields,
which overlay integer flag values. The current bitfield definitions
target little-endian machines. On big-endian machines bitfields are laid
out in the opposite order, so the current definitions do not work there.

There are two ways to fix this: either provide big-endian bitfield
definitions, or bit-swap integer flag values. Go with the former, since
it's localized to one place and therefore is more maintainable.
2023-10-24 19:39:50 +02:00
Jon Chesterfield
840d0b7e03
[amdgpu] D2D memcpy via streams and HSA (#69977)
hsa_amd_memory_async_copy can handle device to device copies if passed
the corresponding parameters.

No functional change - currently D2D copy goes through a fallback in
libomptarget that stages through a host malloc, after this it goes
directly through HSA.

Works under exactly the situations that HSA works. Verified locally on a
performance benchmark. Hoping to attract further testing from internal
developers after it lands.
2023-10-24 00:05:04 +01:00
Johannes Doerfert
86bb713142 [OpenMP][FIX] Enlarge thread state array, improve test and add second 2023-10-22 17:47:00 -07:00
Johannes Doerfert
9f3b06d8be [OpenMP][FIX] Fix memset oversight to partially unblock test
The tests "unoptimized" version is still broken, disabled for now.
2023-10-22 14:29:11 -07:00
Johannes Doerfert
f3ff0a67be [OpenMP][FIX] Ensure test runs correct with (at least) 2 threads 2023-10-22 13:22:36 -07:00
Johannes Doerfert
87dac9f168 [OpenMP] Rewrite test to check the correct (CPU) result
The test initially showed we do no crash but compute the wrong GPU
result, now we show the CPU result is correct and disable GPU testing.
2023-10-21 14:55:15 -07:00
Johannes Doerfert
d3921e4670
[OpenMP] Basic BumpAllocator for (AMD)GPUs (#69806)
The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to
run more tests. The allocator implements `malloc`, both internally and
externally, while we continue to default to the NVIDIA `malloc` when we
target NVIDIA GPUs. Once we have smarter or customizable allocators we
should consider this choice, for now, this allocator is better than
none. It traps if it is out of memory, making it easy to debug. Heap
size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB.
It allows to track allocation statistics via
`LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with
`-fopenmp-target-debug=8`). Two tests were added, and one was enabled.

This is the next step towards fixing
 https://github.com/llvm/llvm-project/issues/66708
2023-10-21 14:49:30 -07:00
Johannes Doerfert
d571af7f62 [OpenMP][FIX] Ensure thread states do not crash on the GPU
The nested parallelism causes thread states which still do not properly
work but at least don't crash anymore.
2023-10-21 14:43:09 -07:00
Johannes Doerfert
1cea309b7e [OpenMP][NFC] Move DebugKind to make it reusable from the host 2023-10-20 19:28:09 -07:00
Joseph Huber
34a3fb9f62
[Libomptarget][NFC] Remove use of VLA in the AMDGPU plugin (#69761)
Summary:
We should not rely on a VLA in C++ for the handling of this string. The
size is a true runtime value so we cannot rely on constexpr handling. We
simply use a small vector, whose default size is most likely large
enough to handle whatever size gets output within the stack, but is safe
in cases where it is not.
2023-10-20 16:02:51 -04:00
Michael Klemm
f93a697e47
[libomptarget][OpenMP] Initial implementation of omp_target_memset() and omp_target_memset_async() (#68706)
Implement a slow-path version of omp_target_memset*() 

There is a TODO to implement a fast path that uses an on-device
kernel instead of the host-based memory fill operation.  This may
require some additional plumbing to have kernels in libomptarget.so
2023-10-19 15:29:36 +02:00
Joseph Huber
970e7456e0
[Libomptarget] Add a test for the libc implementation of assert (#69518)
Summary:
The `libcgpu.a` file provides its own implementation of `__assert_fail`.
This adds a test to make sure it's usable in OpenMP offloading as
expected. Currently this requires linking `libcgpu.a` before the OpenMP
device RTL however. We also disable the test on the CPU as the format of
the string will be different.
2023-10-19 08:55:45 -04:00
Joseph Huber
b69081e324
Attributes (#69358)
- [Libomptarget] Make the references to 'malloc' and 'free' weak.
- [Libomptarget][NFC] Use C++ style attributes instead
2023-10-18 12:52:43 -04:00
Joseph Huber
1e5fe67e70
[Libomptarget] Make the references to 'malloc' and 'free' weak. (#69356)
Summary:
We use `malloc` internally in the DeviceRTL to handle data
globalization. If this is undefined it will map to the Nvidia
implementation of `malloc` for NVPTX and return `nullptr` for AMDGPU.
This is somewhat problematic, because when using this as a shared
library it causes us to always extract the GPU libc implementation,
which uses RPC and thus requires an RPC server. Making this `weak`
allows us to implement this internally without worrying about binding to
the GPU `libc` implementation.
2023-10-18 12:50:23 -04:00
Jon Chesterfield
7ac516a119 [amdgpu] Disable openmp test that is blocking CI after changing hardware, need to diagnose memory fault 2023-10-16 13:59:49 +01:00
Kazu Hirata
18d199116f Stop including llvm/ADT/STLFunctionalExtras.h (NFC)
These source files do not use function_ref.
2023-10-13 20:50:58 -07:00
JP Lehr
b2a67255be [OpenMP] Disable flaky libomptarget AMDGPU test
We observe intermittent failures of that test and need some time to
investigate. Hence, for now, we disable it.
2023-10-10 13:09:29 -05:00
Joseph Huber
4e9054d391 [Libomptarget] Fix lookup of the libcgpu.a library
Summary:
The `libcgpu.a` library was added to support certain libc functions. A
recent patch made us pass its location directly on the commandline,
however it used `find_library`. This doesn't work because the ordering
of CMake might run `fine_library` before it builds the library we're
trying to find. This patch changes this to just use the destimation we
know it will end up in and checks it manually.
2023-10-05 10:48:56 -05:00
Joseph Huber
75e648031c [Libomptarget] Disable AMDGPU complex math test after recent patch
Summary:
The recent patch added `-nogpulib` to make these tests only pick up what
was intentionally put into them. This had the effect of removing the
dependency on the ROCm device libs which are needed for math. This test
disables the complex math test, which is the only one that needed it,
for the time being. In the future we will implement these and provide it
via the GPU `libm` and pass it in the same way as the GPU `libc`.
2023-10-04 15:24:43 -05:00
Joseph Huber
7282975057
[Libomptarget] Explicitly pass the OpenMP device libraries to tests (#68225)
Summary:
We have tests that depend on two static libraries
`libomptarget.devicertl.a` and `libcgpu.a`. These are currently
implicitly picked up and searched through the standard path. This patch
changes that to pass `-nogpulib` to disable implicit runtime path
searches. We then explicitly passed the built libraries to the
compilations so that we know exactly which libraries are being used.

Depends on: https://github.com/llvm/llvm-project/pull/68220

Fixes https://github.com/llvm/llvm-project/issues/68141
2023-10-04 14:14:30 -05:00
Joseph Huber
2d4d8c8f97
[Libomptarget] Make the DeviceRTL configuration globals weak (#68220)
This patch applies weak linkage to the config globals by the name
`__omp_rtl...`. This is because when passing `-nogpulib` we will not
link in or create these globals. This allows the OpenMP device RTL to be
self contained without requiring the additional definitions from the
`clang` compiler. In the standard case, this should not affect the
current behavior, this is because the strong definition coming from the
compiler should always override the weak definition we default to here.
In the case that these are not defined by the compiler, these will
remain weak. This will impact optimizations somewhat, but the previous
behavior was that it would not link so that is an improvement.
    
Depends on: https://github.com/llvm/llvm-project/pull/68215
2023-10-04 14:14:13 -05:00
Joseph Huber
49d8a559d3
[LinkerWrapper] Fix resolution of weak symbols during LTO (#68215)
Summary:
Weak symbols are supposed to have the semantics that they can be
overriden by a strong (i.e. global) definition. This wasn't being
respected by the LTO pass because we simply used the first definition
that was available. This patch fixes that logic by doing a first pass
over the symbols to check for strong resolutions that could override a
weak one.

A lot of fake linker logic is ending up in the linker wrapper. If there
were an option to handle this in `lld` it would be a lot cleaner, but
unfortunately supporting NVPTX is a big restriction as their binaries
require the `nvlink` tool.
2023-10-04 14:13:52 -05:00
agozillon
1482106c99
[Flang][OpenMP][MLIR] Remove deletion of unused declare target global after use replacement (#67762)
At the moment, for device a reference pointer is generated in place of
the original declare target global value, this reference pointer is the
pointer that actually receives the data. In Clang the original global
value isn't generated for device, just the reference pointer.

Unfortunately for Flang/MLIR this is currently not the case, as the
declare target attribute is processed after the creation of the global
so we end up with a dead global on device effectively after rewriting
its uses to the new device reference pointer.

It appears I was a little overzealous with the deletion of the declare
target globals for device. The current method breaks in-cases where the
same declare target global is used across two target regions (added a
runtime reproduced in the patch). As it'll effectively delete it before
the second target gets a chance to be written to LLVM IR and have it's
uses rewritten .

I'd like to remove this deletion as the dead global isn't breaking any
code and will likely be removed in later dead code elimination passes,
perhaps a little too heavy handed with the original approach.
2023-10-03 15:21:27 +02:00
Leandro Lupori
5833a9e99a
[OpenMP] Fix -Wc++98-compat-extra-semi warning (NFC) (#68022)
Compiling OpenMP with LLVM 16 triggers the following warning:
warning: extra ';' outside of a function is incompatible with C++98
2023-10-02 16:43:02 -03:00
Shilei Tian
103bb69c04
[OpenMP] Fix a potential memory buffer overflow (#67252)
#67167 reports a potential memory overflow caused by the wrong size
passed to the function `memcpy_s`. This patch fixes it.

Fix #67167.
2023-09-29 12:41:32 -04:00
Joseph Huber
183a1b1e38 [OpenMP] Enable the 'libc/malloc.c' test on NVPTX
Summary:
Previously this test hanged indefinitely on NVPTX. This was due to an
issue fixed previously where we would wait indefinitely inside the CUDA
runtime waiting for the kernel to complete if it was blocked on the RPC
server. This patch enables this test again now that it can run without
deadlocking, at least on CUDA 12.2.
2023-09-28 14:41:35 -05:00
Joseph Huber
0f88be77ea
[Libomptarget] Fix Nvidia offloading hanging on dataRetrieve using RPC (#66817)
Summary:
The RPC server is responsible for providing host services from the GPU.
Generally, the client running on the GPU will spin in place until the
host checks the server. Inside the runtime, we elected to have the user
thread do this checking while it would be otherwise waiting for the
kernel to finish. However, for Nvidia this caused problems when
offloading to a target region that requires a copy back.

This is caused by the implementation of `dataRetrieve` on Nvidia. We
initialize an asynchronous copy-back on the same stream that the kernel
is running on. This creates an implicit sync on the kernel to finish
before we issue the D2H copy, which we then wait on. This implicit sync
happens inside of the CUDA runtime. This is problematic when running the
RPC server because we need someone to check the RPC server. If no one
checks the RPC server then the kernel will never finish, meaning that
the memcpy will never be issued and the program hangs. This patch adds
an explicit check for unfinished work on the stream and waits for it to
complete.
2023-09-26 16:03:34 -05:00