Commit Graph

3077 Commits

Author SHA1 Message Date
Joseph Huber
4b7beab418 [OpenMP] Add back implicit flags manually
Summary:
We used to inherit these flags from the LLVM options in a runtimes
build. This patch adds them back in manually as they are helpful for
diagnostics and optimizing the created binary.
2023-11-27 14:51:48 -06:00
Johannes Doerfert
7bfcce3e94
[OpenMP] Tear down GenericDeviceTy's with GenericPluginTy (#73557)
There is no point in keeping GenericDeviceTy objects alive longer than
the associated GenericPluginTy. Instead of the old API we now tear them
down with the plugin, avoiding ordering issues.
2023-11-27 11:42:12 -08:00
Johannes Doerfert
f9436464a9 [OpenMP][NFC] Minor name and code simplification 2023-11-27 11:08:29 -08:00
Johannes Doerfert
2b2e711afc [OpenMP][NFC] Remove no-op __tgt_rtl_deinit_plugin
The order in which we deinit things, especially when shared libraries
are involved, is complicated. To simplify our lives the nextgen plugin
deinitializes the GenericPluginTy and subclasses automatically. The old
__tgt_rtl_deinit_plugin is not needed anymore.
2023-11-27 11:07:57 -08:00
Johannes Doerfert
9c33bf62a7 [OpenMP][NFC] Remove unused (un)register_lib plugin API
These APIs have not been hooked up for a while. No need to carry them.
2023-11-27 11:07:57 -08:00
Brad Smith
e66876f2e0
[OpenMP][Tools] Have sort(1) not use long name parameters (#73477)
I noticed a few tests were failing on NetBSD. NetBSD's sort(1) does not
support long name parameters unlike GNU and FreeBSD/OpenBSD/DragonFly's
sort(1).

executed command: sort --numeric-sort --stable

 .---command stderr------------
 | sort: unknown option -- -
 | usage: sort [-bdfHilmnrSsu] [-k kstart[,kend]] [-o output] [-R char] [-T dir]
 |              [-t char] [file ...]
 |    or: sort -C|-c [-bdfilnru] [-k kstart[,kend]] [-o output] [-R char]
 |              [-t char] [file]
 `-----------------------------
2023-11-27 13:23:25 -05:00
Brad Smith
20406af31b
[runtime] Have the runtime use the compiler builtin for alloca on NetBSD (#73480)
Most of the tests were failing with the following in their logs..

| /usr/bin/ld: /home/brad/llvm-build/runtimes/runtimes-bins/openmp/runtime/src/libomp.so:
warning: Warning: reference to the libc supplied alloca(3); this most likely will not
work. Please use the compiler provided version of alloca(3), by supplying the appropriate
compiler flags (e.g. -std=gnu99).

By making use of __builtin_alloca..

before:

Total Discovered Tests: 353
  Unsupported:  59 (16.71%)
  Passed     :  51 (14.45%)
  Failed     : 243 (68.84%)

after:

Total Discovered Tests: 353
  Unsupported:  59 (16.71%)
  Passed     : 290 (82.15%)
  Failed     :   4 (1.13%)
2023-11-27 13:22:54 -05:00
Joseph Huber
ca007181ea [OpenMP] Fix missing CMake function in runtimes build
Summary:
We borrowed this function from LLVM, my previous patch removed that. Now
we redefine it if it's not present.
2023-11-27 09:23:15 -06:00
Lixi Zhou
a3c0f705db
[NFC] fix failed ompt tests on M1 device (#65696)
Fix the 2 failed ompt tests on M1 device found on #63194.

```
libomp :: ompt/synchronization/masked.c
libomp :: ompt/synchronization/master.c
```

For the details of this fix, please check the origin discussion in
https://github.com/llvm/llvm-project/issues/63194#issuecomment-1710494689

Thanks @jprotze for the fix.
2023-11-24 23:40:14 +01:00
Akash Banerjee
f1d773863d
[Flang][OpenMP] Remove use of non reference values from MapInfoOp (#72444)
This patch removes the val field from the `MapInfoOp`.

Previously when lowering `TargetOp`, the bounds information for the
`BoxValues` were also being mapped. Instead these ops are now cloned
inside the target region to prevent mapping of non reference typed
values.
2023-11-24 11:33:19 +00:00
Joachim Jenke
f5e50b21da [OpenMP] Optimized trivial multiple edges from task dependency graph
From "3.1 Reducing the number of edges" of this [[ https://hal.science/hal-04136674v1/ | paper ]] - Optimization (b)

Task (dependency) nodes have a `successors` list built upon passed dependency.
Given the following code, B will be added to A's successors list building the graph `A` -> `B`
```
// A
 # pragma omp task depend(out: x)
{}

// B
 # pragma omp task depend(in: x)
{}
```

In the following code, B is currently added twice to A's successor list
```
// A
 # pragma omp task depend(out: x, y)
{}

// B
 # pragma omp task depend(in: x, y)
{}
```

This patch removes such dupplicates by checking lastly inserted task in `A` successor list.

Authored by: Romain Pereira (rpereira-dev)
Differential Revision: https://reviews.llvm.org/D158544
2023-11-21 18:36:12 +01:00
Johannes Doerfert
f48c4d8aa1 [OpenMP] Be more forgiving during record and replay
When we record and replay kernels we should not error out early if there
is a chance the program might still run fine. This patch will:
1) Fallback to the allocation heuristic if the VAMap doesn't work.
2) Adjust the memory start to match the required address if possible.
3) Adjust the (guessed) pointer arguments if the memory start adjustment
   is impossible. This will allow kernels without indirect accesses to
   work while indirect accesses will most likely fail.
2023-11-20 17:15:34 -08:00
Johannes Doerfert
41566fb852 [OpenMP][FIX] Ensure recording works properly w/ late allocations 2023-11-20 17:15:33 -08:00
Johannes Doerfert
6663df30c0 [OpenMP][NFC] Remove std::move to silence warnings 2023-11-20 17:15:33 -08:00
Joseph Huber
47a3ad5be1
[Libomptarget] Handle dynamic stack sizes for AMD COV5 (#72606)
Summary:
One of the changes in the AMD code-object version five was that kernels
that use an unknown amount of private stack memory now no longer default
to 16 KBs. Instead it emits a flag that indicates the runtime must
provide a value. This patch checks if we must provide such a stack, and
uses the existing handling of the stack environment variable to
configure it.
2023-11-20 12:48:42 -06:00
Brad Smith
3425e11a11
[OpenMP] Add missing pieces in __kmp_launch_worker for Solaris support (#72613) 2023-11-17 13:04:13 -05:00
Fabian Mora
be9fa9dee5
[flang][NVPTX] Add initial support to the NVPTX target (#71992)
This patch adds initial support to the NVPTX target, enabling `flang` to
produce OpenMP offload code for NVPTX targets.
2023-11-16 11:34:28 -05:00
agozillon
718793ce6a
[OpenMP][OMPIRBuilder] Handle replace uses of ConstantExpr's inside of Target regions (#71891)
Currently there's an edge cases where constant indexing in target
regions can lead to incorrect results as we do not correctly replace
uses of mapped variables in generated target functions with the target
arguments (and accessor instructions) that replace them. This patch
seeks to fix that by extending the current logic in the OMPIRBuilder.

Things like GEP's can come in the form of Constants/ConstantExprs,
Constants and ConstantExpr's do not have access to the knowledge of what
they're contained in, so we must dig a little to find an instruction so
we can tell if they're used inside of the function we're outlining so we
can be sure they are replaceable and we are not accidentally replacing a
usage somewhere else in the module that's still necessary.

This patch handles these by replacing the original constant expression
with a new instruction equivalent; an instruction as it allows easy
modification in the following loop, as we can now know the constant
(instruction) is owned by our target function (as it holds this
knowledge) and replaceUsesOfWith can now be invoked on it (cannot do
this with constants it seems), a brand new one also allows us to be
cautious as it is perhaps possible the old expression was used inside of
the function but exists and is used externally (unlikely by the nature
of a Constant, but still a positive side affect).
2023-11-15 15:45:32 +01:00
Jan Patrick Lehr
5c22b907dc
Reland [OpenMP][libomptarget] Enable parallel copies via multiple SDM… (#72307)
…A engines (#71801)

This enables the AMDGPU plugin to use a new ROCm 5.7 interface to
dispatch asynchronous data transfers across SDMA engines.

The default functionality stays unchanged, meaning that all data
transfers are enqueued into a H2D queue or an D2H queue, depending on
transfer direction, via the HSA interface used previously.

The new interface can be enabled via the environment variable
`LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true` when libomptarget
is built against a recent ROCm version (5.7 and later). As of now,
requests are distributed in a round-robin fashion across available SDMA
engines.
2023-11-14 21:30:04 +01:00
Joseph Huber
cc9e19ee59 Revert "[OpenMP][libomptarget] Enable parallel copies via multiple SDMA engines (#71801)"
This causes the tests to fail because the bots were not updated in time.
Revert until we update the bots to a valid version.

This reverts commit e876250b63.
2023-11-14 12:34:27 -06:00
Jan Patrick Lehr
e876250b63
[OpenMP][libomptarget] Enable parallel copies via multiple SDMA engines (#71801)
This enables the AMDGPU plugin to use a new ROCm 5.7 interface to
dispatch asynchronous data transfers across SDMA engines.

The default functionality stays unchanged, meaning that all data
transfers are enqueued into a H2D queue or an D2H queue, depending on
transfer direction, via the HSA interface used previously.

The new interface can be enabled via the environment variable
`LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true` when libomptarget
is built against a recent ROCm version (5.7 and later).
As of now, requests are distributed in a round-robin fashion across
available SDMA engines.
2023-11-14 19:16:39 +01:00
Brad Smith
5feebdcef2
[OpenMP] Link against libm on OpenBSD (#70614)
Needed for some math functions in libomp.
2023-11-11 20:37:50 -05:00
Johannes Doerfert
7318fe6334 [OpenMP][FIX] Ensure device reduction geps work for multi-var reductions
If we have more than one reduction variable we need to be consistent
wrt. indexing. In 3de645efe3 we broke this
as the buffer type was reduced to a singleton but the index computation
was not adjusted to account for that offset. This fixes it by
interleaving the reduction variables properly in a array-of-struct
style. We can revert it back to struct-of-array in a follow up if turns
out to be a problem. I doubt it since half the accesses should benefit
from the locallity this layout offers and only the other half were
consecutive before.
2023-11-10 14:34:46 -08:00
Joseph Huber
237adfca4e
[OpenMP] Rework handling of global ctor/dtors in OpenMP (#71739)
Summary:
This patch reworks how we handle global constructors in OpenMP.
Previously, we emitted individual kernels that were all registered and
called individually. In order to provide more generic support, this
patch moves all handling of this to the target backend and the runtime
plugin. This has the benefit of supporting the GNU extensions for
constructors an destructors, removing a class of failures related to
shared library destruction order, and allows targets other than OpenMP
to use the same support without needing to change the frontend.

This is primarily done by calling kernels that the backend emits to
iterate a list of ctor / dtor functions. For x64, this is automatic and
we get it for free with the standard `dlopen` handling. For AMDGPU, we
emit `amdgcn.device.init` and `amdgcn.device.fini` functions which
handle everything atuomatically and simply need to be called. For NVPTX,
a patch https://github.com/llvm/llvm-project/pull/71549 provides the
kernels to call, but the runtime needs to set up the array manually by
pulling out all the known constructor / destructor functions.

One concession that this patch requires is the change that for GPU
targets in OpenMP offloading we will use `llvm.global_dtors` instead of
using `atexit`. This is because `atexit` is a separate runtime function
that does not mesh well with the handling we're trying to do here. This
should be equivalent in all cases except for cases where we would need
to destruct manually such as:

```
struct S { ~S() { foo(); } };
void foo() {
  static S s;
}
```

However this is broken in many other ways on the GPU, so it is not
regressing any support, simply increasing the scope of what we can
handle.

This changes the handling of ctors / dtors. This patch now outputs a
information message regarding the deprecation if the old format is used.
This will be completely removed in a later release.

Depends on: https://github.com/llvm/llvm-project/pull/71549
2023-11-10 14:53:53 -06:00
Ilya Leoshkevich
72552fc5cb
[OpenMP][SystemZ] Compile __kmpc_omp_task_begin_if0() with backchain (#71834)
OpenMP runtime fails to build on SystemZ with clang with the following
error message:

    LLVM ERROR: Unsupported stack frame traversal count

__kmpc_omp_task_begin_if0() uses OMPT_GET_FRAME_ADDRESS(1), which
delegates to __builtin_frame_address(), which in turn works with nonzero
values on SystemZ only if backchain is in use. If backchain is not in
use, the above error is emitted.

Compile __kmpc_omp_task_begin_if0() with backchain. Note that this only
resolves the build error. If at runtime its caller is compiled without
backchain, __builtin_frame_address() will produce an incorrect value,
but will not cause a crash. Since the value is relevant only for OMPT,
this is acceptable.
2023-11-09 23:54:16 +01:00
Konstantinos Parasyris
b34d31d2e1
[OpenMP] Fix record-replay allocation order for kernel environment (#71863) 2023-11-09 12:51:22 -08:00
xingxue-ibm
90a9e9f638
[OpenMP] Fix a condition for KMP_OS_SOLARIS. (#71831)
Line 75 of `z_Linux_util.cpp` checks `#ifdef KMP_OS_SOLARIS` which is
always true regardless of the building platform because macro
`KMP_OS_SOLARIS` is always defined in line 23 of `kmp_platform.h`:
`define KMP_OS_SOLARIS 0`.
2023-11-09 13:30:36 -05:00
Saiyedul Islam
21861991e7
[OpenMP] Cleanup and fixes for ABI agnostic DeviceRTL (#71234)
Fixes the DeviceRTL compilation to ensure it is ABI agnostic. Uses
already available global variable "oclc_ABI_version" instead of
"llvm.amdgcn.abi.verion".

It also adds some minor fields in ImplicitArg structure.
2023-11-09 10:34:35 +05:30
Jonathan Peyton
5cc603cb22
[OpenMP] Add skewed iteration distribution on hybrid systems (#69946)
This commit adds skewed distribution of iterations in
nonmonotonic:dynamic schedule (static steal) for hybrid systems when
thread affinity is assigned. Currently, it distributes the iterations at
60:40 ratio. Consider this loop with dynamic schedule type,
for (int i = 0; i < 100; ++i). In a hybrid system with 20 hardware
threads (16 CORE and 4 ATOM core), 88 iterations will be assigned to
performance cores and 12 iterations will be assigned to efficient cores.
Each thread with CORE core will process 5 iterations + extras and with
ATOM core will process 3 iterations.

Differential Revision: https://reviews.llvm.org/D152955
2023-11-08 10:19:37 -06:00
Anton Rydahl
446e11acef
[OpenMP ]Adding more libomptarget reduction tests (#71616)
Based on https://github.com/llvm/llvm-project/pull/70766 I think it
would be good to have a few more offloading reduction tests, so we do
not accidentally break minimum and maximum reductions another time.
2023-11-07 20:39:08 -08:00
Shilei Tian
6d7457861b [OpenMP][FIX] Fix the compile error introduced by reverting eab828d 2023-11-07 19:46:18 -05:00
Shilei Tian
6e574f125d Revert "[OpenMP] Provide a specialized team reduction for the common case (#70766)"
This reverts commit eab828d46c.
2023-11-07 19:16:44 -05:00
Johannes Doerfert
2d739f13d4
[OpenMP][Offload] Automatically map indirect function pointers (#71462)
We already have all the information to automatically map function
pointers that have been declared as `indirect` declare target by the
user. This is just enabling and testing the functionality by looking
through the one level of indirection.
2023-11-07 08:33:39 -08:00
Johannes Doerfert
002f422410 [OpenMP] Replace CUDART_VERSION with CUDA_VERSION 2023-11-06 12:30:40 -08:00
Johannes Doerfert
726ee40f52 [OpenMP] Move the recording code to account for KernelLaunchEnvironment
We need to record late to account for the kernel launch environment as
well as the potential changes in block and thread count.
2023-11-06 12:30:40 -08:00
Johannes Doerfert
3de645efe3 [OpenMP][NFC] Split the reduction buffer size into two components
Before we tracked the size of the teams reduction buffer in order to
allocate it at runtime per kernel launch. This patch splits the number
into two parts, the size of the reduction data (=all reduction
variables) and the (maximal) length of the buffer. This will allow us to
allocate less if we need less, e.g., if we have less teams than the
maximal length. It also allows us to move code from clangs codegen into
the runtime as we now know how large the reduction data is.
2023-11-06 11:50:41 -08:00
Jan Patrick Lehr
07f5cf1992
[OpenMP][libomptarget] Fixes possible no-return warning (#70808)
The UNREACHABLE macro resolves to message + trap, which may still warn, so we add call to __builtin_unreachable.
2023-11-06 16:45:03 +01:00
Akash Banerjee
be59fe5028 [OpenMP][Flang]Fix some of the Fortan OpenMP Offloading tests
target_map_common_block2.f90
	- Fix the extra space in the print message.
	- #67164 fixes this. So moving it outside of failing and also removing XFAIL marker.

basic-target-region-3D-array.f90
	- Corrected the check to account for the new lines printed.

Depends on #67319
2023-11-06 13:24:02 +00:00
Shilei Tian
db37d25c53 Revert "[OpenMP] Simplify parallel reductions (#70983)"
This reverts commit e9a48f9e05 because it breaks
3 sollve 5.0 tests:

test_loop_reduction_and_device.c
test_loop_reduction_bitand_device.c
test_loop_reduction_multiply_device.c
2023-11-05 22:51:59 -05:00
Konstantinos Parasyris
d301a28950
[OpenMP] Guard Virtual Memory Management API and Types (#70986) 2023-11-03 16:24:18 -07:00
Johannes Doerfert
d3e7a48cbd [OpenMP][NFC] Remove a no-op function 2023-11-03 10:28:36 -07:00
Neale Ferguson
1111ef0257
Add openmp support to System z (#66081)
* openmp/README.rst
  - Add s390x to those platforms supported

* openmp/libomptarget/plugins-nextgen/CMakeLists.txt
  - Add s390x subdirectory

* openmp/libomptarget/plugins-nextgen/s390x/CMakeLists.txt
  - Add s390x definitions

* openmp/runtime/CMakeLists.txt
  - Add s390x to those platforms supported

* openmp/runtime/cmake/LibompGetArchitecture.cmake
  - Define s390x ARCHITECTURE

* openmp/runtime/cmake/LibompMicroTests.cmake
  - Add dependencies for System z (aka s390x)

* openmp/runtime/cmake/LibompUtils.cmake
  - Add S390X to the mix

* openmp/runtime/cmake/config-ix.cmake
  - Add s390x as a supported LIPOMP_ARCH

* openmp/runtime/src/kmp_affinity.h
  - Define __NR_sched_[get|set]addinity for s390x

* openmp/runtime/src/kmp_config.h.cmake
  - Define CACHE_LINE for s390x

* openmp/runtime/src/kmp_os.h
  - Add KMP_ARCH_S390X to support checks

* openmp/runtime/src/kmp_platform.h
  - Define KMP_ARCH_S390X

* openmp/runtime/src/kmp_runtime.cpp
  - Generate code when KMP_ARCH_S390X is defined

* openmp/runtime/src/kmp_tasking.cpp
  - Generate code when KMP_ARCH_S390X is defined

* openmp/runtime/src/thirdparty/ittnotify/ittnotify_config.h
  - Define ITT_ARCH_S390X

* openmp/runtime/src/z_Linux_asm.S
  - Instantiate __kmp_invoke_microtask for s390x

* openmp/runtime/src/z_Linux_util.cpp
  - Generate code when KMP_ARCH_S390X is defined

* openmp/runtime/test/ompt/callback.h
  - Define print_possible_return_addresses for s390x

* openmp/runtime/tools/lib/Platform.pm
  - Return s390x as platform and host architecture

* openmp/runtime/tools/lib/Uname.pm
  - Set hardware platform value for s390x
2023-11-03 12:42:55 +01:00
Brad Smith
b5b251aac8
[OpenMP] Add support for Solaris/x86_64 (#70593)
Tested on `amd64-pc-solaris2.11`.
2023-11-02 23:29:02 -04:00
Johannes Doerfert
e9a48f9e05
[OpenMP] Simplify parallel reductions (#70983)
A lot of the code was from a time when we had multiple parallel levels.
The new runtime is much simpler, the code can be simplified a lot which
should speed up reductions too.
2023-11-02 15:50:05 -07:00
Johannes Doerfert
eab828d46c
[OpenMP] Provide a specialized team reduction for the common case (#70766)
We default to < 1024 teams if the user did not specify otherwise. As
such we can avoid the extra logic in the teams reduction that handles
more than num_of_records (default 1024) teams. This is a stopgap but
still shaves of 33% of the runtime in some simple reduction examples.
2023-11-02 15:49:22 -07:00
Johannes Doerfert
95e11a97f6 [OpenMP][FIX] Unbreak a fencing issue
A recent update caused the fences to be team only while we always need
kernel fences. Broke OpenMC on NVIDIA A100.
2023-11-02 15:04:10 -07:00
Jon Chesterfield
f0e100a05a
[amdgpu][openmp] Treat missing TIMESTAMP_FREQUENCY as non-fatal (#70987)
If you build with dynamic_hsa, the symbol is known and compilation
succeeds. If you then run with a slightly older libhsa, this argument is
not recognised and an error returned. I'd rather the program runs with a
misleading omp wtime than refuses to run at all.
2023-11-01 22:43:34 +00:00
Johannes Doerfert
a8152086ff [Attributor][FIX] Ensure new BBs are registered 2023-11-01 12:12:14 -07:00
Johannes Doerfert
a273d17d4a [OpenMP][FIX] Do not add implicit argument to device Ctors and Dtors
Constructors and destructors on the device do not take any arguments,
also not the implicit dyn_ptr argument other kernels automatically take.
2023-11-01 11:18:11 -07:00
Johannes Doerfert
f9a89e6b9c
[OpenMP][FIX] Allocate per launch memory for GPU team reductions (#70752)
We used to perform team reduction on global memory allocated in the
runtime and by clang. This was racy as multiple instances of a kernel,
or different kernels with team reductions, would use the same locations.
Since we now have the kernel launch environment, we can allocate dynamic
memory per-launch, allowing us to move all the state into a non-racy
place.

Fixes: https://github.com/llvm/llvm-project/issues/70249
2023-11-01 11:11:48 -07:00