Commit Graph

3236 Commits

Author SHA1 Message Date
Johannes Doerfert
4667dd62ee
[OpenMP][NFC] Merge elf_common into PluginInterface (#73677)
The overhead of a library and 4 files seems high without benefit. This
simply tries to consolidate our structure.
2023-11-28 10:03:25 -08:00
Alex
d6f00654fb
[OpenMP][Runtime][test] Fix ompt task testcase fail randomly (#72337)
Fixed #72231
2023-11-28 14:22:57 +01:00
Johannes Doerfert
e1f911e40c [OpenMP][NFC] Simplify code 2023-11-27 16:02:34 -08:00
Johannes Doerfert
d2636dc390 [OpenMP][NFC] Fix diagnostic warnings 2023-11-27 16:02:34 -08:00
Johannes Doerfert
30fbe73ba9 [OpenMP][NFC] Remove else after return 2023-11-27 16:02:33 -08:00
Johannes Doerfert
755e108825 [OpenMP][NFC] Remove unused variable 2023-11-27 16:02:33 -08:00
Johannes Doerfert
77aa79d4fb [OpenMP][Docs] Cleanup libomptarget README 2023-11-27 16:02:33 -08:00
Joseph Huber
216040eef2 [OpenMP] Temporarily disable OMPT tests for x64
Summary:
These tests are currently failing, disable them so we can keep the bots
green until we find a better solution. The x64 tests are not the core
target so this is low priority.
2023-11-27 15:49:07 -06:00
Joseph Huber
71e3082d85 [OpenMP] Enable position independent code for libomptarget
Summary:
This option used to be passed manually by the `-fPIC` option that was
always enabled by the LLVM flags. Since we now do this manually we want
to specify that these are supposed for use fPIC code.
2023-11-27 14:51:48 -06:00
Joseph Huber
4b7beab418 [OpenMP] Add back implicit flags manually
Summary:
We used to inherit these flags from the LLVM options in a runtimes
build. This patch adds them back in manually as they are helpful for
diagnostics and optimizing the created binary.
2023-11-27 14:51:48 -06:00
Johannes Doerfert
7bfcce3e94
[OpenMP] Tear down GenericDeviceTy's with GenericPluginTy (#73557)
There is no point in keeping GenericDeviceTy objects alive longer than
the associated GenericPluginTy. Instead of the old API we now tear them
down with the plugin, avoiding ordering issues.
2023-11-27 11:42:12 -08:00
Johannes Doerfert
f9436464a9 [OpenMP][NFC] Minor name and code simplification 2023-11-27 11:08:29 -08:00
Johannes Doerfert
2b2e711afc [OpenMP][NFC] Remove no-op __tgt_rtl_deinit_plugin
The order in which we deinit things, especially when shared libraries
are involved, is complicated. To simplify our lives the nextgen plugin
deinitializes the GenericPluginTy and subclasses automatically. The old
__tgt_rtl_deinit_plugin is not needed anymore.
2023-11-27 11:07:57 -08:00
Johannes Doerfert
9c33bf62a7 [OpenMP][NFC] Remove unused (un)register_lib plugin API
These APIs have not been hooked up for a while. No need to carry them.
2023-11-27 11:07:57 -08:00
Brad Smith
e66876f2e0
[OpenMP][Tools] Have sort(1) not use long name parameters (#73477)
I noticed a few tests were failing on NetBSD. NetBSD's sort(1) does not
support long name parameters unlike GNU and FreeBSD/OpenBSD/DragonFly's
sort(1).

executed command: sort --numeric-sort --stable

 .---command stderr------------
 | sort: unknown option -- -
 | usage: sort [-bdfHilmnrSsu] [-k kstart[,kend]] [-o output] [-R char] [-T dir]
 |              [-t char] [file ...]
 |    or: sort -C|-c [-bdfilnru] [-k kstart[,kend]] [-o output] [-R char]
 |              [-t char] [file]
 `-----------------------------
2023-11-27 13:23:25 -05:00
Brad Smith
20406af31b
[runtime] Have the runtime use the compiler builtin for alloca on NetBSD (#73480)
Most of the tests were failing with the following in their logs..

| /usr/bin/ld: /home/brad/llvm-build/runtimes/runtimes-bins/openmp/runtime/src/libomp.so:
warning: Warning: reference to the libc supplied alloca(3); this most likely will not
work. Please use the compiler provided version of alloca(3), by supplying the appropriate
compiler flags (e.g. -std=gnu99).

By making use of __builtin_alloca..

before:

Total Discovered Tests: 353
  Unsupported:  59 (16.71%)
  Passed     :  51 (14.45%)
  Failed     : 243 (68.84%)

after:

Total Discovered Tests: 353
  Unsupported:  59 (16.71%)
  Passed     : 290 (82.15%)
  Failed     :   4 (1.13%)
2023-11-27 13:22:54 -05:00
Joseph Huber
ca007181ea [OpenMP] Fix missing CMake function in runtimes build
Summary:
We borrowed this function from LLVM, my previous patch removed that. Now
we redefine it if it's not present.
2023-11-27 09:23:15 -06:00
Lixi Zhou
a3c0f705db
[NFC] fix failed ompt tests on M1 device (#65696)
Fix the 2 failed ompt tests on M1 device found on #63194.

```
libomp :: ompt/synchronization/masked.c
libomp :: ompt/synchronization/master.c
```

For the details of this fix, please check the origin discussion in
https://github.com/llvm/llvm-project/issues/63194#issuecomment-1710494689

Thanks @jprotze for the fix.
2023-11-24 23:40:14 +01:00
Akash Banerjee
f1d773863d
[Flang][OpenMP] Remove use of non reference values from MapInfoOp (#72444)
This patch removes the val field from the `MapInfoOp`.

Previously when lowering `TargetOp`, the bounds information for the
`BoxValues` were also being mapped. Instead these ops are now cloned
inside the target region to prevent mapping of non reference typed
values.
2023-11-24 11:33:19 +00:00
Joachim Jenke
f5e50b21da [OpenMP] Optimized trivial multiple edges from task dependency graph
From "3.1 Reducing the number of edges" of this [[ https://hal.science/hal-04136674v1/ | paper ]] - Optimization (b)

Task (dependency) nodes have a `successors` list built upon passed dependency.
Given the following code, B will be added to A's successors list building the graph `A` -> `B`
```
// A
 # pragma omp task depend(out: x)
{}

// B
 # pragma omp task depend(in: x)
{}
```

In the following code, B is currently added twice to A's successor list
```
// A
 # pragma omp task depend(out: x, y)
{}

// B
 # pragma omp task depend(in: x, y)
{}
```

This patch removes such dupplicates by checking lastly inserted task in `A` successor list.

Authored by: Romain Pereira (rpereira-dev)
Differential Revision: https://reviews.llvm.org/D158544
2023-11-21 18:36:12 +01:00
Johannes Doerfert
f48c4d8aa1 [OpenMP] Be more forgiving during record and replay
When we record and replay kernels we should not error out early if there
is a chance the program might still run fine. This patch will:
1) Fallback to the allocation heuristic if the VAMap doesn't work.
2) Adjust the memory start to match the required address if possible.
3) Adjust the (guessed) pointer arguments if the memory start adjustment
   is impossible. This will allow kernels without indirect accesses to
   work while indirect accesses will most likely fail.
2023-11-20 17:15:34 -08:00
Johannes Doerfert
41566fb852 [OpenMP][FIX] Ensure recording works properly w/ late allocations 2023-11-20 17:15:33 -08:00
Johannes Doerfert
6663df30c0 [OpenMP][NFC] Remove std::move to silence warnings 2023-11-20 17:15:33 -08:00
Joseph Huber
47a3ad5be1
[Libomptarget] Handle dynamic stack sizes for AMD COV5 (#72606)
Summary:
One of the changes in the AMD code-object version five was that kernels
that use an unknown amount of private stack memory now no longer default
to 16 KBs. Instead it emits a flag that indicates the runtime must
provide a value. This patch checks if we must provide such a stack, and
uses the existing handling of the stack environment variable to
configure it.
2023-11-20 12:48:42 -06:00
Brad Smith
3425e11a11
[OpenMP] Add missing pieces in __kmp_launch_worker for Solaris support (#72613) 2023-11-17 13:04:13 -05:00
Fabian Mora
be9fa9dee5
[flang][NVPTX] Add initial support to the NVPTX target (#71992)
This patch adds initial support to the NVPTX target, enabling `flang` to
produce OpenMP offload code for NVPTX targets.
2023-11-16 11:34:28 -05:00
agozillon
718793ce6a
[OpenMP][OMPIRBuilder] Handle replace uses of ConstantExpr's inside of Target regions (#71891)
Currently there's an edge cases where constant indexing in target
regions can lead to incorrect results as we do not correctly replace
uses of mapped variables in generated target functions with the target
arguments (and accessor instructions) that replace them. This patch
seeks to fix that by extending the current logic in the OMPIRBuilder.

Things like GEP's can come in the form of Constants/ConstantExprs,
Constants and ConstantExpr's do not have access to the knowledge of what
they're contained in, so we must dig a little to find an instruction so
we can tell if they're used inside of the function we're outlining so we
can be sure they are replaceable and we are not accidentally replacing a
usage somewhere else in the module that's still necessary.

This patch handles these by replacing the original constant expression
with a new instruction equivalent; an instruction as it allows easy
modification in the following loop, as we can now know the constant
(instruction) is owned by our target function (as it holds this
knowledge) and replaceUsesOfWith can now be invoked on it (cannot do
this with constants it seems), a brand new one also allows us to be
cautious as it is perhaps possible the old expression was used inside of
the function but exists and is used externally (unlikely by the nature
of a Constant, but still a positive side affect).
2023-11-15 15:45:32 +01:00
Jan Patrick Lehr
5c22b907dc
Reland [OpenMP][libomptarget] Enable parallel copies via multiple SDM… (#72307)
…A engines (#71801)

This enables the AMDGPU plugin to use a new ROCm 5.7 interface to
dispatch asynchronous data transfers across SDMA engines.

The default functionality stays unchanged, meaning that all data
transfers are enqueued into a H2D queue or an D2H queue, depending on
transfer direction, via the HSA interface used previously.

The new interface can be enabled via the environment variable
`LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true` when libomptarget
is built against a recent ROCm version (5.7 and later). As of now,
requests are distributed in a round-robin fashion across available SDMA
engines.
2023-11-14 21:30:04 +01:00
Joseph Huber
cc9e19ee59 Revert "[OpenMP][libomptarget] Enable parallel copies via multiple SDMA engines (#71801)"
This causes the tests to fail because the bots were not updated in time.
Revert until we update the bots to a valid version.

This reverts commit e876250b63.
2023-11-14 12:34:27 -06:00
Jan Patrick Lehr
e876250b63
[OpenMP][libomptarget] Enable parallel copies via multiple SDMA engines (#71801)
This enables the AMDGPU plugin to use a new ROCm 5.7 interface to
dispatch asynchronous data transfers across SDMA engines.

The default functionality stays unchanged, meaning that all data
transfers are enqueued into a H2D queue or an D2H queue, depending on
transfer direction, via the HSA interface used previously.

The new interface can be enabled via the environment variable
`LIBOMPTARGET_AMDGPU_USE_MULTIPLE_SDMA_ENGINES=true` when libomptarget
is built against a recent ROCm version (5.7 and later).
As of now, requests are distributed in a round-robin fashion across
available SDMA engines.
2023-11-14 19:16:39 +01:00
Brad Smith
5feebdcef2
[OpenMP] Link against libm on OpenBSD (#70614)
Needed for some math functions in libomp.
2023-11-11 20:37:50 -05:00
Johannes Doerfert
7318fe6334 [OpenMP][FIX] Ensure device reduction geps work for multi-var reductions
If we have more than one reduction variable we need to be consistent
wrt. indexing. In 3de645efe3 we broke this
as the buffer type was reduced to a singleton but the index computation
was not adjusted to account for that offset. This fixes it by
interleaving the reduction variables properly in a array-of-struct
style. We can revert it back to struct-of-array in a follow up if turns
out to be a problem. I doubt it since half the accesses should benefit
from the locallity this layout offers and only the other half were
consecutive before.
2023-11-10 14:34:46 -08:00
Joseph Huber
237adfca4e
[OpenMP] Rework handling of global ctor/dtors in OpenMP (#71739)
Summary:
This patch reworks how we handle global constructors in OpenMP.
Previously, we emitted individual kernels that were all registered and
called individually. In order to provide more generic support, this
patch moves all handling of this to the target backend and the runtime
plugin. This has the benefit of supporting the GNU extensions for
constructors an destructors, removing a class of failures related to
shared library destruction order, and allows targets other than OpenMP
to use the same support without needing to change the frontend.

This is primarily done by calling kernels that the backend emits to
iterate a list of ctor / dtor functions. For x64, this is automatic and
we get it for free with the standard `dlopen` handling. For AMDGPU, we
emit `amdgcn.device.init` and `amdgcn.device.fini` functions which
handle everything atuomatically and simply need to be called. For NVPTX,
a patch https://github.com/llvm/llvm-project/pull/71549 provides the
kernels to call, but the runtime needs to set up the array manually by
pulling out all the known constructor / destructor functions.

One concession that this patch requires is the change that for GPU
targets in OpenMP offloading we will use `llvm.global_dtors` instead of
using `atexit`. This is because `atexit` is a separate runtime function
that does not mesh well with the handling we're trying to do here. This
should be equivalent in all cases except for cases where we would need
to destruct manually such as:

```
struct S { ~S() { foo(); } };
void foo() {
  static S s;
}
```

However this is broken in many other ways on the GPU, so it is not
regressing any support, simply increasing the scope of what we can
handle.

This changes the handling of ctors / dtors. This patch now outputs a
information message regarding the deprecation if the old format is used.
This will be completely removed in a later release.

Depends on: https://github.com/llvm/llvm-project/pull/71549
2023-11-10 14:53:53 -06:00
Ilya Leoshkevich
72552fc5cb
[OpenMP][SystemZ] Compile __kmpc_omp_task_begin_if0() with backchain (#71834)
OpenMP runtime fails to build on SystemZ with clang with the following
error message:

    LLVM ERROR: Unsupported stack frame traversal count

__kmpc_omp_task_begin_if0() uses OMPT_GET_FRAME_ADDRESS(1), which
delegates to __builtin_frame_address(), which in turn works with nonzero
values on SystemZ only if backchain is in use. If backchain is not in
use, the above error is emitted.

Compile __kmpc_omp_task_begin_if0() with backchain. Note that this only
resolves the build error. If at runtime its caller is compiled without
backchain, __builtin_frame_address() will produce an incorrect value,
but will not cause a crash. Since the value is relevant only for OMPT,
this is acceptable.
2023-11-09 23:54:16 +01:00
Konstantinos Parasyris
b34d31d2e1
[OpenMP] Fix record-replay allocation order for kernel environment (#71863) 2023-11-09 12:51:22 -08:00
xingxue-ibm
90a9e9f638
[OpenMP] Fix a condition for KMP_OS_SOLARIS. (#71831)
Line 75 of `z_Linux_util.cpp` checks `#ifdef KMP_OS_SOLARIS` which is
always true regardless of the building platform because macro
`KMP_OS_SOLARIS` is always defined in line 23 of `kmp_platform.h`:
`define KMP_OS_SOLARIS 0`.
2023-11-09 13:30:36 -05:00
Saiyedul Islam
21861991e7
[OpenMP] Cleanup and fixes for ABI agnostic DeviceRTL (#71234)
Fixes the DeviceRTL compilation to ensure it is ABI agnostic. Uses
already available global variable "oclc_ABI_version" instead of
"llvm.amdgcn.abi.verion".

It also adds some minor fields in ImplicitArg structure.
2023-11-09 10:34:35 +05:30
Jonathan Peyton
5cc603cb22
[OpenMP] Add skewed iteration distribution on hybrid systems (#69946)
This commit adds skewed distribution of iterations in
nonmonotonic:dynamic schedule (static steal) for hybrid systems when
thread affinity is assigned. Currently, it distributes the iterations at
60:40 ratio. Consider this loop with dynamic schedule type,
for (int i = 0; i < 100; ++i). In a hybrid system with 20 hardware
threads (16 CORE and 4 ATOM core), 88 iterations will be assigned to
performance cores and 12 iterations will be assigned to efficient cores.
Each thread with CORE core will process 5 iterations + extras and with
ATOM core will process 3 iterations.

Differential Revision: https://reviews.llvm.org/D152955
2023-11-08 10:19:37 -06:00
Anton Rydahl
446e11acef
[OpenMP ]Adding more libomptarget reduction tests (#71616)
Based on https://github.com/llvm/llvm-project/pull/70766 I think it
would be good to have a few more offloading reduction tests, so we do
not accidentally break minimum and maximum reductions another time.
2023-11-07 20:39:08 -08:00
Shilei Tian
6d7457861b [OpenMP][FIX] Fix the compile error introduced by reverting eab828d 2023-11-07 19:46:18 -05:00
Shilei Tian
6e574f125d Revert "[OpenMP] Provide a specialized team reduction for the common case (#70766)"
This reverts commit eab828d46c.
2023-11-07 19:16:44 -05:00
Johannes Doerfert
2d739f13d4
[OpenMP][Offload] Automatically map indirect function pointers (#71462)
We already have all the information to automatically map function
pointers that have been declared as `indirect` declare target by the
user. This is just enabling and testing the functionality by looking
through the one level of indirection.
2023-11-07 08:33:39 -08:00
Johannes Doerfert
002f422410 [OpenMP] Replace CUDART_VERSION with CUDA_VERSION 2023-11-06 12:30:40 -08:00
Johannes Doerfert
726ee40f52 [OpenMP] Move the recording code to account for KernelLaunchEnvironment
We need to record late to account for the kernel launch environment as
well as the potential changes in block and thread count.
2023-11-06 12:30:40 -08:00
Johannes Doerfert
3de645efe3 [OpenMP][NFC] Split the reduction buffer size into two components
Before we tracked the size of the teams reduction buffer in order to
allocate it at runtime per kernel launch. This patch splits the number
into two parts, the size of the reduction data (=all reduction
variables) and the (maximal) length of the buffer. This will allow us to
allocate less if we need less, e.g., if we have less teams than the
maximal length. It also allows us to move code from clangs codegen into
the runtime as we now know how large the reduction data is.
2023-11-06 11:50:41 -08:00
Jan Patrick Lehr
07f5cf1992
[OpenMP][libomptarget] Fixes possible no-return warning (#70808)
The UNREACHABLE macro resolves to message + trap, which may still warn, so we add call to __builtin_unreachable.
2023-11-06 16:45:03 +01:00
Akash Banerjee
be59fe5028 [OpenMP][Flang]Fix some of the Fortan OpenMP Offloading tests
target_map_common_block2.f90
	- Fix the extra space in the print message.
	- #67164 fixes this. So moving it outside of failing and also removing XFAIL marker.

basic-target-region-3D-array.f90
	- Corrected the check to account for the new lines printed.

Depends on #67319
2023-11-06 13:24:02 +00:00
Shilei Tian
db37d25c53 Revert "[OpenMP] Simplify parallel reductions (#70983)"
This reverts commit e9a48f9e05 because it breaks
3 sollve 5.0 tests:

test_loop_reduction_and_device.c
test_loop_reduction_bitand_device.c
test_loop_reduction_multiply_device.c
2023-11-05 22:51:59 -05:00
Konstantinos Parasyris
d301a28950
[OpenMP] Guard Virtual Memory Management API and Types (#70986) 2023-11-03 16:24:18 -07:00
Johannes Doerfert
d3e7a48cbd [OpenMP][NFC] Remove a no-op function 2023-11-03 10:28:36 -07:00
Neale Ferguson
1111ef0257
Add openmp support to System z (#66081)
* openmp/README.rst
  - Add s390x to those platforms supported

* openmp/libomptarget/plugins-nextgen/CMakeLists.txt
  - Add s390x subdirectory

* openmp/libomptarget/plugins-nextgen/s390x/CMakeLists.txt
  - Add s390x definitions

* openmp/runtime/CMakeLists.txt
  - Add s390x to those platforms supported

* openmp/runtime/cmake/LibompGetArchitecture.cmake
  - Define s390x ARCHITECTURE

* openmp/runtime/cmake/LibompMicroTests.cmake
  - Add dependencies for System z (aka s390x)

* openmp/runtime/cmake/LibompUtils.cmake
  - Add S390X to the mix

* openmp/runtime/cmake/config-ix.cmake
  - Add s390x as a supported LIPOMP_ARCH

* openmp/runtime/src/kmp_affinity.h
  - Define __NR_sched_[get|set]addinity for s390x

* openmp/runtime/src/kmp_config.h.cmake
  - Define CACHE_LINE for s390x

* openmp/runtime/src/kmp_os.h
  - Add KMP_ARCH_S390X to support checks

* openmp/runtime/src/kmp_platform.h
  - Define KMP_ARCH_S390X

* openmp/runtime/src/kmp_runtime.cpp
  - Generate code when KMP_ARCH_S390X is defined

* openmp/runtime/src/kmp_tasking.cpp
  - Generate code when KMP_ARCH_S390X is defined

* openmp/runtime/src/thirdparty/ittnotify/ittnotify_config.h
  - Define ITT_ARCH_S390X

* openmp/runtime/src/z_Linux_asm.S
  - Instantiate __kmp_invoke_microtask for s390x

* openmp/runtime/src/z_Linux_util.cpp
  - Generate code when KMP_ARCH_S390X is defined

* openmp/runtime/test/ompt/callback.h
  - Define print_possible_return_addresses for s390x

* openmp/runtime/tools/lib/Platform.pm
  - Return s390x as platform and host architecture

* openmp/runtime/tools/lib/Uname.pm
  - Set hardware platform value for s390x
2023-11-03 12:42:55 +01:00
Brad Smith
b5b251aac8
[OpenMP] Add support for Solaris/x86_64 (#70593)
Tested on `amd64-pc-solaris2.11`.
2023-11-02 23:29:02 -04:00
Johannes Doerfert
e9a48f9e05
[OpenMP] Simplify parallel reductions (#70983)
A lot of the code was from a time when we had multiple parallel levels.
The new runtime is much simpler, the code can be simplified a lot which
should speed up reductions too.
2023-11-02 15:50:05 -07:00
Johannes Doerfert
eab828d46c
[OpenMP] Provide a specialized team reduction for the common case (#70766)
We default to < 1024 teams if the user did not specify otherwise. As
such we can avoid the extra logic in the teams reduction that handles
more than num_of_records (default 1024) teams. This is a stopgap but
still shaves of 33% of the runtime in some simple reduction examples.
2023-11-02 15:49:22 -07:00
Johannes Doerfert
95e11a97f6 [OpenMP][FIX] Unbreak a fencing issue
A recent update caused the fences to be team only while we always need
kernel fences. Broke OpenMC on NVIDIA A100.
2023-11-02 15:04:10 -07:00
Jon Chesterfield
f0e100a05a
[amdgpu][openmp] Treat missing TIMESTAMP_FREQUENCY as non-fatal (#70987)
If you build with dynamic_hsa, the symbol is known and compilation
succeeds. If you then run with a slightly older libhsa, this argument is
not recognised and an error returned. I'd rather the program runs with a
misleading omp wtime than refuses to run at all.
2023-11-01 22:43:34 +00:00
Johannes Doerfert
a8152086ff [Attributor][FIX] Ensure new BBs are registered 2023-11-01 12:12:14 -07:00
Johannes Doerfert
a273d17d4a [OpenMP][FIX] Do not add implicit argument to device Ctors and Dtors
Constructors and destructors on the device do not take any arguments,
also not the implicit dyn_ptr argument other kernels automatically take.
2023-11-01 11:18:11 -07:00
Johannes Doerfert
f9a89e6b9c
[OpenMP][FIX] Allocate per launch memory for GPU team reductions (#70752)
We used to perform team reduction on global memory allocated in the
runtime and by clang. This was racy as multiple instances of a kernel,
or different kernels with team reductions, would use the same locations.
Since we now have the kernel launch environment, we can allocate dynamic
memory per-launch, allowing us to move all the state into a non-racy
place.

Fixes: https://github.com/llvm/llvm-project/issues/70249
2023-11-01 11:11:48 -07:00
Johannes Doerfert
b8cbc5c02c
[OpenMP] Introduce the KernelLaunchEnvironment as implicit argument (#70401)
The KernelEnvironment is for compile time information about a kernel. It
allows the compiler to feed information to the runtime. The
KernelLaunchEnvironment is for dynamic information *per* kernel launch.
It allows the rutime to feed information to the kernel that is not
shared with other invocations of the kernel. The first use case is to
replace the globals that synchronize teams reductions with per-launch
versions. This allows concurrent teams reductions. More uses cases will
follow, e.g., per launch memory pools.

Fixes: https://github.com/llvm/llvm-project/issues/70249
2023-10-31 19:38:43 -07:00
Johannes Doerfert
6258da14d6 [OpenMP] Lower synchronization threshold for reductions
This should provide an easy performance boost by only avoiding
synchronization that was unnessary anyway.
2023-10-30 22:39:46 -07:00
Johannes Doerfert
e137af60cd [OpenMP][NFC] Fix test to actually check for the result 2023-10-30 17:15:41 -07:00
Andrew Gozillon
68c384676c [Flang][MLIR][OpenMP] Temporarily re-add basic handling of uses in target regions to avoid gfortran test-suite regressions
This was a regression introduced by myself in:

 6a62707c04

where I too hastily removed the basic handling of implicit captures
we have currently. This will be superseded by all implicit captures
being added to target operations map_info entries in a soon landing
series of patches, however, that is currently not the case so we must
continue to do some basic handling of these captures for the time
being. This patch re-adds that behaviour to avoid regressions.

Unfortunately this means some test changes as well as
getUsedValuesDefinedAbove grabs constants used outside
of the target region which aren't handled particularly
well currently.
2023-10-30 15:10:12 -05:00
Shilei Tian
0d5b7dd25c
[OpenMP] Add a test for D158802 (#70678)
In D158802 we honored user's `thread_limit` value even with the
optimization
introduced in D152014. This patch adds a simple test.
2023-10-30 15:59:05 -04:00
Jon Chesterfield
896749aa0d
[amdgpu][openmp] Avoiding writing to packet header twice (#70695)
I think it follows from the HSA spec that a write to the first byte is
deemed significant to the GPU in which case writing to the second short
and reading back from it later would be safe. However, the examples for
this all involve an atomic write to the first 32 bits and it seems a
credible risk that the occasional CI errors abound invalid packets have
as their root cause that the firmware notices the early write to
packet->setup and treats that as a sign that the packet is ready to go.

That was overly-paranoid, however in passing noticed the code in libc is
genuinely invalid. The memset writes a zero to the header byte, changing
it from type_invalid (1) to type_vendor (0), at which point the GPU is
free to read the 64 byte packet and interpret it as a vendor packet,
which is probably why libc CI periodically errors about invalid packets.

Also a drive by change to do the atomic store on a uint32_t
consistently. I'm not sure offhand what __atomic_store_n on a uint16_t*
and an int resolves to, seems better to be unambiguous there.
2023-10-30 18:35:52 +00:00
agozillon
6a62707c04
[Flang][OpenMP][MLIR] Initial array section mapping MLIR -> LLVM-IR lowering utilising omp.bounds (#68689)
This patch seeks to add initial lowering of OpenMP array sections within
target region map clauses from MLIR to LLVM IR.

This patch seeks to support fixed sized contiguous (don't think OpenMP
supports anything other than contiguous sections from my reading but i
could be wrong) arrays initially, before looking toward assumed size and
shaped arrays. The patch also currently does not include stride, it's
left for future work.

Although, assumed size works in some fashion (dummy arguments) with some
minor alterations to the OMPEarlyOutliner, so it is possible changes
made in the IsolatedFromAbove series may allow this to work with no
further required patches.

It utilises the generated omp.bounds to calculate the size of the mapped
OpenMP array (both for sectioned and un-sectioned arrays) as well as the
offset to be passed to the kernel argument structure.

Alongside these changes some refactoring of how map data is handled is
attempted, using a new MapData structure to keep track of information
utilised in the lowering of mapped values.

The initial addition of a more complex createDeviceArgumentAccessor that
utilises capture kinds similarly to (and loosely based on) Clang to
generate different kernel argument accesses is also added.

A similar function for altering how the kernel argument is passed to the
kernel argument structure on the host is also utilised
(createAlteredByCaptureMap), which allows modification of the
pointer/basePointer based on their capture (and bounds information).
It's of note ByRef, is the default for explicit mappings and ByCopy will
be the default for implicit captures, so the former is currently tested
in this patch and the latter is not for the moment.
2023-10-30 16:00:23 +01:00
Brad Smith
0a29879e41
[OpenMP] Add missing bit with the Hurd support (#70609)
Looking at 855d09855d it looks like a bit was
missing. The padding variable is used further down by the KMP_ALLOCA()
function.
2023-10-29 22:35:03 -04:00
Brad Smith
0d1da7c37f
[OpenMP] Make use of getloadavg() on *BSD OS's (#70586)
OpenBSD does not have /proc filesystem, neither does FreeBSD (by default).
2023-10-29 18:30:11 -04:00
Konstantinos Parasyris
d6a3d6b96d
[openmp] Fixed Support for VA for record-replay. (#70396)
The commit was discussed in phabricator
(https://reviews.llvm.org/D157186).

Record replay currently fails on AMD as it conflicts with the heap
memory allocator introduced in #69806. The workaround is setting
`LIBOMPTARGET_HEAP_SIZE=0` during both record and replay run.
2023-10-29 12:27:19 -07:00
Johannes Doerfert
d346c82435
[OpenMP] Associate the KernelEnvironment with the GenericKernelTy (#70383)
By associating the kernel environment with the generic kernel we can
access middle-end information easily, including the launch bounds ranges
that are acceptable. By constraining the number of threads accordingly,
we now obey the user-provided bounds that were passed via attributes.
2023-10-29 11:35:34 -07:00
Brad Smith
223852aecf
[OpenMP] Fix building for 32-bit DragonFly, NetBSD, OpenBSD (#70527)
Fixing ```#error "Unknown or unsupported OS"```
2023-10-27 22:53:24 -04:00
Konstantinos Parasyris
01828c4323
[OpenMP] record-replay use static-cast (#70516)
[OpenMP] Fixes #69905
2023-10-27 16:46:34 -07:00
Shraiysh
03485a0406
[openmp][flang] Add tests for map clause (#70394)
This patch adds basic tests for map clause on target construct for
commonblocks. There will be more tests to add, which will be added in
future patches. Currently failing tests are added in a separate folder
with XFAIL. They should be moved as they are fixed.
2023-10-27 09:35:06 -07:00
Mehdi Amini
f390a76b7e Revert "Revert "[OpenMP][NFC] Add min/max threads/teams count into the KernelEnvironment (#70257)""
This reverts commit ddbaa11e9f.

Reapply the original commit, the broken test was repaired in 5e51363f38 in the meantime.
2023-10-26 17:30:01 -07:00
Mehdi Amini
ddbaa11e9f Revert "[OpenMP][NFC] Add min/max threads/teams count into the KernelEnvironment (#70257)"
This reverts commit c2a1249a82.

The MLIR bots are broken with an omp test failure.
2023-10-26 17:25:20 -07:00
Johannes Doerfert
c2a1249a82
[OpenMP][NFC] Add min/max threads/teams count into the KernelEnvironment (#70257)
The runtime needs to know about the acceptable launch bounds, especially
if the compiler (middle- or backend) assumed those bounds. While this
patch does not yet inform the runtime, it stores the bounds in a place
that can/will be accessed and is associated with the kernel.
2023-10-26 14:46:55 -07:00
Johannes Doerfert
0012b956f9 [OpenMP][FIX] Move workaround code to avoid races
The workaround code ensure we always call __kmpc_kernel_parallel, but it
did so in a racy manner as the initialization might not have been
completed yet. To avoid introducing a sync, we move the workaround into
the deinit function for now.
2023-10-26 14:38:23 -07:00
Joseph Huber
cee08ff342
[Libomptarget] Do not pass 'nogpulib' to the non-LTO Nvidia tests (#70327)
Summary:
For the other tests we pass `-nogpulib` to ensure that we set up the
needed libraries correctly. However, this caused problems for the
non-LTO build and test of Nvidia systems. In general this is because we
would do a separate compile of the libomptarget device runtime and then
link in that cubin. This exercised the runtime in a lot of ways it's not
used to, since doing things this way was hardly expected or tested. This
patch disables it only for the Nvidia non-LTO build so that we still get
the effect of `--liboimptarget-nvptx-bc-path` rather than ignoring it.
2023-10-26 10:36:34 -05:00
Yuanfang Chen
f09f58d0f2 [OpenMP] [OMPD] Fix CMake install command
https://cmake.org/cmake/help/latest/command/install.html
"If a relative path is given it is interpreted relative to the value of the CMAKE_INSTALL_PREFIX variable."
2023-10-26 03:02:53 +00:00
Joseph Huber
17b5445996
[Libomptarget] Add a wavefront sync builtin for the AMDGPU implementation (#70228)
Summary:
While this is technically a no-op for AMDGPU hardware, in cases where
the user would see fit to add an explicit wavefront sync on Nvidia
hardware, we should also inform the LLVM optimizer that this control
flow is convergent so we do not reorder blocks.
2023-10-25 14:27:14 -05:00
Joseph Huber
006cd37960 [OpenMP][Obvious] Fix incorrect variant selector in test
Summary:
This should be `kind` and not `arch`.
2023-10-25 13:48:30 -05:00
Joseph Huber
ca3545f0ef
[Libomptarget] Bump up PTX version from +ptx61 to +ptx63 (#70227)
Summary:
This version is required to support the 'activemask' feature which is
used for certain features, such as reductions. This ties the
implementation of the DeviceRTL roughly to the features provided by the
CUDA 9.0 release, which should be sufficienly old as to not cause
problems since this is a minor version jump that corresponds to the
release of `sm_53`.
2023-10-25 13:28:02 -04:00
Joseph Huber
8a181f43da [OpenMP][Obvious] Fix incompatbile function prototype causing failures
Summary:
This function needs `void` as the arguments to be ABI compatbile with
what is actually defined. This is enforced when doing CUDA separable
linking of the runtime.
2023-10-25 10:44:07 -05:00
Joseph Huber
84d8ace51a [OpenMP][Obvious] Fix function prototype when used in C mode
Summary:
The `llvm_omp_target_dynamic_shared_alloc` prototype in `omp.h`
accidentally left the void argument unspecified. This created unintended
code when called from the C language, causing some `nvlink` failures in
certain scenarios.
2023-10-25 09:35:23 -05:00
Ilya Leoshkevich
f7fc98a1cf
[OpenMP][Archer] Do not check for column numbers in backtraces (#70075)
TSan can show only line numbers on some platforms, e.g., SystemZ. Skip
checking the column numbers; line numbers should be enough to verify
that race detection is working.
2023-10-25 13:22:24 +02:00
Ilya Leoshkevich
77c2b623ca
[OpenMP][Tests] Sync struct DEP with the runtime (#69982)
struct DEP defined in multiple testcases must correspond to runtime's
struct kmp_depend_info. The former defines flags as int, and the latter
as kmp_uint8_t. This discrepancy goes unnoticed on little-endian
systems, but breaks big-endian ones.

Make flags in struct DEP unsigned char.
2023-10-24 19:40:08 +02:00
Ilya Leoshkevich
34459b72da
[OpenMP] Provide big-endian bitfield definitions (#69995)
structs kmp_depend_info.flags and kmp_tasking_flags contain bitfields,
which overlay integer flag values. The current bitfield definitions
target little-endian machines. On big-endian machines bitfields are laid
out in the opposite order, so the current definitions do not work there.

There are two ways to fix this: either provide big-endian bitfield
definitions, or bit-swap integer flag values. Go with the former, since
it's localized to one place and therefore is more maintainable.
2023-10-24 19:39:50 +02:00
Jon Chesterfield
840d0b7e03
[amdgpu] D2D memcpy via streams and HSA (#69977)
hsa_amd_memory_async_copy can handle device to device copies if passed
the corresponding parameters.

No functional change - currently D2D copy goes through a fallback in
libomptarget that stages through a host malloc, after this it goes
directly through HSA.

Works under exactly the situations that HSA works. Verified locally on a
performance benchmark. Hoping to attract further testing from internal
developers after it lands.
2023-10-24 00:05:04 +01:00
Johannes Doerfert
86bb713142 [OpenMP][FIX] Enlarge thread state array, improve test and add second 2023-10-22 17:47:00 -07:00
Johannes Doerfert
9f3b06d8be [OpenMP][FIX] Fix memset oversight to partially unblock test
The tests "unoptimized" version is still broken, disabled for now.
2023-10-22 14:29:11 -07:00
Johannes Doerfert
f3ff0a67be [OpenMP][FIX] Ensure test runs correct with (at least) 2 threads 2023-10-22 13:22:36 -07:00
Johannes Doerfert
87dac9f168 [OpenMP] Rewrite test to check the correct (CPU) result
The test initially showed we do no crash but compute the wrong GPU
result, now we show the CPU result is correct and disable GPU testing.
2023-10-21 14:55:15 -07:00
Johannes Doerfert
d3921e4670
[OpenMP] Basic BumpAllocator for (AMD)GPUs (#69806)
The patch contains a basic BumpAllocator for (AMD)GPUs to allow us to
run more tests. The allocator implements `malloc`, both internally and
externally, while we continue to default to the NVIDIA `malloc` when we
target NVIDIA GPUs. Once we have smarter or customizable allocators we
should consider this choice, for now, this allocator is better than
none. It traps if it is out of memory, making it easy to debug. Heap
size is configured via `LIBOMPTARGET_HEAP_SIZE` and defaults to 512MB.
It allows to track allocation statistics via
`LIBOMPTARGET_DEVICE_RTL_DEBUG=8` (together with
`-fopenmp-target-debug=8`). Two tests were added, and one was enabled.

This is the next step towards fixing
 https://github.com/llvm/llvm-project/issues/66708
2023-10-21 14:49:30 -07:00
Johannes Doerfert
d571af7f62 [OpenMP][FIX] Ensure thread states do not crash on the GPU
The nested parallelism causes thread states which still do not properly
work but at least don't crash anymore.
2023-10-21 14:43:09 -07:00
Johannes Doerfert
1cea309b7e [OpenMP][NFC] Move DebugKind to make it reusable from the host 2023-10-20 19:28:09 -07:00
Joseph Huber
34a3fb9f62
[Libomptarget][NFC] Remove use of VLA in the AMDGPU plugin (#69761)
Summary:
We should not rely on a VLA in C++ for the handling of this string. The
size is a true runtime value so we cannot rely on constexpr handling. We
simply use a small vector, whose default size is most likely large
enough to handle whatever size gets output within the stack, but is safe
in cases where it is not.
2023-10-20 16:02:51 -04:00
Michael Klemm
f93a697e47
[libomptarget][OpenMP] Initial implementation of omp_target_memset() and omp_target_memset_async() (#68706)
Implement a slow-path version of omp_target_memset*() 

There is a TODO to implement a fast path that uses an on-device
kernel instead of the host-based memory fill operation.  This may
require some additional plumbing to have kernels in libomptarget.so
2023-10-19 15:29:36 +02:00
Joseph Huber
970e7456e0
[Libomptarget] Add a test for the libc implementation of assert (#69518)
Summary:
The `libcgpu.a` file provides its own implementation of `__assert_fail`.
This adds a test to make sure it's usable in OpenMP offloading as
expected. Currently this requires linking `libcgpu.a` before the OpenMP
device RTL however. We also disable the test on the CPU as the format of
the string will be different.
2023-10-19 08:55:45 -04:00
Joseph Huber
b69081e324
Attributes (#69358)
- [Libomptarget] Make the references to 'malloc' and 'free' weak.
- [Libomptarget][NFC] Use C++ style attributes instead
2023-10-18 12:52:43 -04:00
Joseph Huber
1e5fe67e70
[Libomptarget] Make the references to 'malloc' and 'free' weak. (#69356)
Summary:
We use `malloc` internally in the DeviceRTL to handle data
globalization. If this is undefined it will map to the Nvidia
implementation of `malloc` for NVPTX and return `nullptr` for AMDGPU.
This is somewhat problematic, because when using this as a shared
library it causes us to always extract the GPU libc implementation,
which uses RPC and thus requires an RPC server. Making this `weak`
allows us to implement this internally without worrying about binding to
the GPU `libc` implementation.
2023-10-18 12:50:23 -04:00
Jon Chesterfield
7ac516a119 [amdgpu] Disable openmp test that is blocking CI after changing hardware, need to diagnose memory fault 2023-10-16 13:59:49 +01:00
Kazu Hirata
18d199116f Stop including llvm/ADT/STLFunctionalExtras.h (NFC)
These source files do not use function_ref.
2023-10-13 20:50:58 -07:00
JP Lehr
b2a67255be [OpenMP] Disable flaky libomptarget AMDGPU test
We observe intermittent failures of that test and need some time to
investigate. Hence, for now, we disable it.
2023-10-10 13:09:29 -05:00
Joseph Huber
4e9054d391 [Libomptarget] Fix lookup of the libcgpu.a library
Summary:
The `libcgpu.a` library was added to support certain libc functions. A
recent patch made us pass its location directly on the commandline,
however it used `find_library`. This doesn't work because the ordering
of CMake might run `fine_library` before it builds the library we're
trying to find. This patch changes this to just use the destimation we
know it will end up in and checks it manually.
2023-10-05 10:48:56 -05:00
Joseph Huber
75e648031c [Libomptarget] Disable AMDGPU complex math test after recent patch
Summary:
The recent patch added `-nogpulib` to make these tests only pick up what
was intentionally put into them. This had the effect of removing the
dependency on the ROCm device libs which are needed for math. This test
disables the complex math test, which is the only one that needed it,
for the time being. In the future we will implement these and provide it
via the GPU `libm` and pass it in the same way as the GPU `libc`.
2023-10-04 15:24:43 -05:00
Joseph Huber
7282975057
[Libomptarget] Explicitly pass the OpenMP device libraries to tests (#68225)
Summary:
We have tests that depend on two static libraries
`libomptarget.devicertl.a` and `libcgpu.a`. These are currently
implicitly picked up and searched through the standard path. This patch
changes that to pass `-nogpulib` to disable implicit runtime path
searches. We then explicitly passed the built libraries to the
compilations so that we know exactly which libraries are being used.

Depends on: https://github.com/llvm/llvm-project/pull/68220

Fixes https://github.com/llvm/llvm-project/issues/68141
2023-10-04 14:14:30 -05:00
Joseph Huber
2d4d8c8f97
[Libomptarget] Make the DeviceRTL configuration globals weak (#68220)
This patch applies weak linkage to the config globals by the name
`__omp_rtl...`. This is because when passing `-nogpulib` we will not
link in or create these globals. This allows the OpenMP device RTL to be
self contained without requiring the additional definitions from the
`clang` compiler. In the standard case, this should not affect the
current behavior, this is because the strong definition coming from the
compiler should always override the weak definition we default to here.
In the case that these are not defined by the compiler, these will
remain weak. This will impact optimizations somewhat, but the previous
behavior was that it would not link so that is an improvement.
    
Depends on: https://github.com/llvm/llvm-project/pull/68215
2023-10-04 14:14:13 -05:00
Joseph Huber
49d8a559d3
[LinkerWrapper] Fix resolution of weak symbols during LTO (#68215)
Summary:
Weak symbols are supposed to have the semantics that they can be
overriden by a strong (i.e. global) definition. This wasn't being
respected by the LTO pass because we simply used the first definition
that was available. This patch fixes that logic by doing a first pass
over the symbols to check for strong resolutions that could override a
weak one.

A lot of fake linker logic is ending up in the linker wrapper. If there
were an option to handle this in `lld` it would be a lot cleaner, but
unfortunately supporting NVPTX is a big restriction as their binaries
require the `nvlink` tool.
2023-10-04 14:13:52 -05:00
agozillon
1482106c99
[Flang][OpenMP][MLIR] Remove deletion of unused declare target global after use replacement (#67762)
At the moment, for device a reference pointer is generated in place of
the original declare target global value, this reference pointer is the
pointer that actually receives the data. In Clang the original global
value isn't generated for device, just the reference pointer.

Unfortunately for Flang/MLIR this is currently not the case, as the
declare target attribute is processed after the creation of the global
so we end up with a dead global on device effectively after rewriting
its uses to the new device reference pointer.

It appears I was a little overzealous with the deletion of the declare
target globals for device. The current method breaks in-cases where the
same declare target global is used across two target regions (added a
runtime reproduced in the patch). As it'll effectively delete it before
the second target gets a chance to be written to LLVM IR and have it's
uses rewritten .

I'd like to remove this deletion as the dead global isn't breaking any
code and will likely be removed in later dead code elimination passes,
perhaps a little too heavy handed with the original approach.
2023-10-03 15:21:27 +02:00
Leandro Lupori
5833a9e99a
[OpenMP] Fix -Wc++98-compat-extra-semi warning (NFC) (#68022)
Compiling OpenMP with LLVM 16 triggers the following warning:
warning: extra ';' outside of a function is incompatible with C++98
2023-10-02 16:43:02 -03:00
Shilei Tian
103bb69c04
[OpenMP] Fix a potential memory buffer overflow (#67252)
#67167 reports a potential memory overflow caused by the wrong size
passed to the function `memcpy_s`. This patch fixes it.

Fix #67167.
2023-09-29 12:41:32 -04:00
Joseph Huber
183a1b1e38 [OpenMP] Enable the 'libc/malloc.c' test on NVPTX
Summary:
Previously this test hanged indefinitely on NVPTX. This was due to an
issue fixed previously where we would wait indefinitely inside the CUDA
runtime waiting for the kernel to complete if it was blocked on the RPC
server. This patch enables this test again now that it can run without
deadlocking, at least on CUDA 12.2.
2023-09-28 14:41:35 -05:00
Joseph Huber
0f88be77ea
[Libomptarget] Fix Nvidia offloading hanging on dataRetrieve using RPC (#66817)
Summary:
The RPC server is responsible for providing host services from the GPU.
Generally, the client running on the GPU will spin in place until the
host checks the server. Inside the runtime, we elected to have the user
thread do this checking while it would be otherwise waiting for the
kernel to finish. However, for Nvidia this caused problems when
offloading to a target region that requires a copy back.

This is caused by the implementation of `dataRetrieve` on Nvidia. We
initialize an asynchronous copy-back on the same stream that the kernel
is running on. This creates an implicit sync on the kernel to finish
before we issue the D2H copy, which we then wait on. This implicit sync
happens inside of the CUDA runtime. This is problematic when running the
RPC server because we need someone to check the RPC server. If no one
checks the RPC server then the kernel will never finish, meaning that
the memcpy will never be issued and the program hangs. This patch adds
an explicit check for unfinished work on the stream and waits for it to
complete.
2023-09-26 16:03:34 -05:00
Joseph Huber
791b279924
[libc] Change the puts implementation on the GPU (#67189)
Summary:
Normally, the implementation of `puts` simply writes a second newline
charcter after printing the first string. However, because the GPU does
everything in batches of the SIMT group size, this will end up with very
poor output where you get the strings printed and then 1-64 newline
characters all in a row. Optimizations like to turn `printf` calls into
`puts` so it's a good idea to make this produce the expected output.

The least invasive way I could do this was to add a new opcode. It's a
little bloated, but it avoids an unneccessary and slow send operation to
configure this.
2023-09-25 11:17:22 -05:00
Andrew Gozillon
76916669b9 [MLIR][OpenMP] Initial Lowering of Declare Target for Data
This patch adds initial lowering for DeclareTargetAttr on
GlobalOp's utilising registerTargetGlobalVariable
and getAddrOfDeclareTargetVar from the
OMPIRBuilder.

It also adds initial processing of declare target map
operands, populating the combinedInfo that the
OMPIRBuilder requires to generate kernels and
it's kernel argument structure.

The combination of these additions allows simple mapping
of declare target globals to Target regions, as such a simple
runtime test showcasing this and testing it has been added.

The patch currently does not factor in filtering
based on device_type clauses (e.g. no emission of
globals for device if host specified), this will come in
a future iteration. And for the moment it's only been
tested with 1-D arrays and basic fortran data types,
more complex types (such as user defined derived
types from Fortran, allocatables or Fortran pointers)
may need further work.

reviewers: kiranchandramohan, skatrak

Differential Revision: https://reviews.llvm.org/D149368
2023-09-20 13:31:15 -05:00
Kazushi Marukawa
7b8130c2c3
[OpenMP][VE] Limit the number of threads to create (#66729)
VE supports up to 64 threads per a VE process. So, we limit the number
of threads defined by KMP_MAX_NTH. We also modify the __kmp_sys_max_nth
initialization to use KMP_MAX_NTH as a limit.
2023-09-20 17:44:24 +09:00
Arthur Eubanks
0a1aa6cda2
[NFC][CodeGen] Change CodeGenOpt::Level/CodeGenFileType into enum classes (#66295)
This will make it easy for callers to see issues with and fix up calls
to createTargetMachine after a future change to the params of
TargetMachine.

This matches other nearby enums.

For downstream users, this should be a fairly straightforward
replacement,
e.g. s/CodeGenOpt::Aggressive/CodeGenOptLevel::Aggressive
or s/CGFT_/CodeGenFileType::
2023-09-14 14:10:14 -07:00
Ye Luo
8c2da6bb7f
[libomptarget] document ActionFunctions in the amdgpu plugin. (#66397) 2023-09-14 12:18:49 -05:00
Terry Wilmarth
102d864719 Fix /tmp approach, and add environment variable method as third fallback during library registration
The /tmp fallback for /dev/shm did not write to a fixed filename, so multiple instances of the runtime would not be able to detect each other. Now, we create the /tmp file in much the same way as the /dev/shm file was created, since mkstemp approach would not work to create a file that other instances of the runtime would detect. Also, add the environment variable method as a third fallback to /dev/shm and /tmp for library registration, as some systems do not have either. Also, add ability to fallback to a subsequent method should a failure occur during any part of the registration process. When unregistering, it is assumed that the method chosen during registration should work, so errors at that point are ignored. This also avoids a problem with multiple threads trying to unregister the library.
2023-09-13 13:50:49 -05:00
Rodrigo Ceccato de Freitas
f94b6f3396
[OpenMP] Remove optimization skipping reduction struct initialization (#65697)
This commit removes an optimization that skips the initialization of the
reduction struct if the number of threads in a team is 1. This
optimization
caused a bug with Hidden Helper Threads. When the task group is
initially
initialized by the master thread but a Hidden Helper Thread executes a
target
nowait region, it requires the reduction struct initialization to
properly
accumulate the data.

This commit also adds a LIT test for issue #57522 to ensure that the
issue is
properly addressed and that the optimization removal does not introduce
any
regressions.

Fixes: #57522
2023-09-12 16:09:16 -05:00
Kazushi Marukawa
e8679b93da
[OpenMP][test][VE] Limit the number of AFFINITY_MAX_CPUS for VE (#65872)
Limit the number of AFFINITY_MAX_CPUS for VE because VE's
sched_getaffinity doesn't work correctly with large sized mask buffer.
2023-09-12 23:45:56 +09:00
Saiyedul Islam
466a8149b3
Revert "[AMDGPU] Make default AMDHSA Code Object Version to be 5 (#65410)" (#66060)
This reverts commit 0a8d17e79b.
2023-09-12 15:13:59 +05:30
Saiyedul Islam
0a8d17e79b
[AMDGPU] Make default AMDHSA Code Object Version to be 5 (#65410)
Also update LIT tests and docs.
For more details, see
https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata

Reviewed By: arsenm, jhuber6

Github PR: #65410

Differential Revision: https://reviews.llvm.org/D129818
2023-09-12 13:53:31 +05:30
Ye Luo
6c8248e38b [libomptarget] Rename AMDGPUSignalTy member Signal to HSASignal. 2023-09-11 22:42:34 -05:00
Ye Luo
08352b99a4 [libomptarget][NFC] update comments. 2023-09-11 22:30:25 -05:00
Michael Halkenhäuser
12ac0f6ede
[OpenMP][DeviceRTL][AMDGPU] Add missing libomptarget build targets (#65964)
Extend CMake variable `all_amdgpu_architectures` by `gfx941` and
`gfx942`.
2023-09-11 15:43:51 +02:00
Michael Halkenhäuser
53602e6193
[OpenMP][OMPT] Fix device identifier collision during callbacks (#65595)
Fixes: https://github.com/llvm/llvm-project/issues/65104
When a user assigns devices to target regions it may happen that
different identifiers will map onto the same id within different
plugins. This will lead to situations where callbacks will become much
harder to read, as ambiguous identifiers are reported.

We fix this by collecting the index-offset upon general RTL
initialization. Which in turn, allows to calculate the unique,
user-observable device id.
2023-09-11 12:11:44 +02:00
Ye Luo
c9733b8a9e
[libomptarget][NFC]Rename targetDataMapper to targetDat in interface.cpp (#65915)
In omptarget.cpp, there is a targetDataMapper which does mapper related
operations.
In interface.cpp, targetDataMapper function template is simply needed
for handling TargetAsyncInfoTy. Thus it is better to name it as simple
as targetData similar to targetKernel.
2023-09-10 21:08:11 -05:00
Tianlan Zhou
057564fec5
Fix some typos in comments: evalute -> evaluate (NFC) (#65906) 2023-09-11 04:11:06 +08:00
Kazushi Marukawa
f8efa65ca5
[OpenMP][test][VE] Change to use VE_LD_LIBRARY_PATH for VE (#65869)
Change to use VE_LD_LIBRARY_PATH for VE instead of LD_LIBRARY_PATH. The
VE is connected to the host, and compiled test programs for VE is
invoked on the host and transferred to the VE. If programs are compiled
for the host, we use LD_LIBRARY_PATH. Otherwise, we use
VE_LD_LIBRARY_PATH.
2023-09-10 12:07:16 +09:00
Kazushi (Jam) Marukawa
18b6724355 [OpenMP][VE] Support OpenMP runtime on VE
Support OpenMP runtime library on VE.  This patch makes OpenMP compilable
for VE architecture.  Almost all tests run correctly on VE.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D159401
2023-09-10 08:29:53 +09:00
Shilei Tian
600e0cde3e [NFC][OpenMP] clang-format openmp/libomptarget/src/interface.cpp 2023-09-08 23:01:39 -04:00
Brad Smith
7e31b45d6a
[OpenMP] Use the more appropriate function to retrieve the thread id on OpenBSD (#65553)
Use the getthrid() function instead of a syscall.
2023-09-07 21:05:25 -04:00
Shilei Tian
010a5a737b [OpenMP] Fix build issue with libomp when OMPT is disabled 2023-09-06 23:40:24 -04:00
Brad Smith
fd4c80dec9
[OpenMP] Fix gettid warnings on DragonFly (#65549)
Define __kmp_gettid() as appropriate for DragonFly.
2023-09-06 20:21:11 -04:00
Shilei Tian
99d67fb9aa
[OpenMP] Align up the size when calling aligned_alloc (#65525)
Based on https://en.cppreference.com/w/c/memory/aligned_alloc, the
`size` is supposed
to be a multiple of `alignment`, and it is implementation defined
behavior if not.
We have a non-conformant use in `kmp_barrier.h` when allocating
distribute barrier.
The size of the barrier is 576 and the alignment is `4*CACHE_LINE`,
which is 256
on most systems. Apparently it works perfectly fine for Linux and
Intel-based Mac,
but not for Apple Silicon based Mac.

Fix #63194.
2023-09-06 16:28:07 -04:00
Joseph Huber
460840c09d
[OpenMP] Support 'omp_get_num_procs' on the device (#65501)
Summary:
The `omp_get_num_procs()` function should return the amount of
parallelism availible. On the GPU, this was not defined. We have elected
to define this function as the maximum amount of wavefronts / warps that
can be simultaneously resident on the device. For AMDGPU this is the
number of CUs multiplied byth CU's per wave. For NVPTX this is the
maximum threads per SM divided by the warp size and multiplied by the
number of SMs.
2023-09-06 13:45:05 -05:00
Shilei Tian
ff5c7261ef [OpenMP] Fix a wrong assertion in __kmp_get_global_thread_id
The function assumes that `__kmp_gtid_get_specific` always returns a valid gtid.
That is not always true, because when creating the key for thread-specific data,
a destructor is assigned. The dtor will be called at thread exit. However, before
the dtor is called, the thread-specific data will be reset to NULL first
(https://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_key_create.html):

> At thread exit, if a key value has a non-NULL destructor pointer, and the thread
> has a non-NULL value associated with that key, the value of the key is set to NULL.

This will lead to that `__kmp_gtid_get_specific` returns `KMP_GTID_DNE`.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D159369
2023-09-06 12:21:43 -04:00
Shilei Tian
518b08c193
[OpenMP] Fix issue of indirect function call in __kmpc_fork_call_if (#65436)
The outlined function is typically invoked by using
`__kmp_invoke_microtask`,
which is written in asm. D138495 introduces a new interface function for
parallel
region for OpenMPIRBuilder, where the outlined function is called via
the function
pointer. For some reason, it works perfectly well on x86 and x86-64
system, but
doesn't work on Apple Silicon. The 3rd argument in the callee is always
`nullptr`, even
if it is not in caller. It appears `x2` always contains `0x0`. This
patch adopts
the typical method to invoke the function pointer. It works on my M2
Ultra Mac.

Fix #63194.
2023-09-06 12:17:45 -04:00
Fangrui Song
678e3ee123 [lldb] Fix duplicate word typos; NFC
Those fixes were taken from https://reviews.llvm.org/D137338
2023-09-01 21:32:24 -07:00
Ethan Luis McDonough
2b6ba8c735
[openmp] Tighten flang detection in offloading test
This patch ensures that the locally built version of flang when building in-tree.  `find_program` sometimes used the wrong executable if a different copy of flang was installed.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D159161
2023-09-01 13:59:18 -05:00
Martin Storsjö
c2019c416c [OpenMP] [test] Fix target_thread_limit.cpp to not assume 4 or more cores
Previously, the test ran a section with

    #pragma omp target thread_limit(4)

and expected it to execute exactly 4 times, even though it would
in practice execute min(cores, 4) times.

Increment a counter and check that it executed 1-4 times.

Differential Revision: https://reviews.llvm.org/D159311
2023-09-01 21:16:58 +03:00
Jan Leyonberg
a0e3418bc8 [flang][OpenMP] Add fortran test with basic target region
This patch adds a test that uses a target region to set a scalar value. It also
adds rules in lit.cfg to handle fortran testing.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D159216
2023-09-01 09:26:36 -04:00
Shilei Tian
35fdf8d703 [OpenMP] Fix a segment fault in __kmp_get_global_thread_id
In `__kmp_get_global_thread_id`, if the gtid mode is 1, after getting the gtid
from TLS, it will store the gtid value to the thread stack maintained in the thread
descriptor. However, `__kmp_get_global_thread_id` can be called when the library
is destructed, after the corresponding thread info has been release. This will
cause a segment fault. This can happen on an Intel-based Mac.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D159324
2023-08-31 21:15:28 -04:00
Joseph Huber
ccb1d183c3 [OpenMP][Docs] Remove old entry saying static libraries are unsupported
Summary:
Static libraries have been supported since LLVM 15.0, this entry is
misleading and should be removed.
2023-08-30 06:48:57 -05:00
Ethan Luis McDonough
9e3d59e4c2
[openmp] Fix flang detection for offloading test
This patch fixes the flang detection in the openmp fortran offloading test.

Reviewed By: jsjodin

Differential Revision: https://reviews.llvm.org/D158546
2023-08-29 16:31:03 -05:00
Martin Storsjö
81ecc887aa [OpenMP] Export __kmpc_set_thread_limit on Windows
This fixes the new test target/target_thread_limit.cpp on
Windows, which was added recently in
08bbff4aad /
https://reviews.llvm.org/D152054.

Differential Revision: https://reviews.llvm.org/D159070
2023-08-29 23:22:21 +03:00
Saiyedul Islam
f616c3eeb4
[OpenMP][DeviceRTL][AMDGPU] Support code object version 5
Update DeviceRTL and the AMDGPU plugin to support code
object version 5. Default is code object version 4.

CodeGen for __builtin_amdgpu_workgroup_size generates code
for cov4 as well as cov5 if -mcode-object-version=none
is specified. DeviceRTL compilation passes this argument
via Xclang option to generate abi-agnostic code.

Generated code for the above builtin uses a clang
control constant "llvm.amdgcn.abi.version" to branch on
the abi version, which is available during linking of
user's OpenMP code. Load of this constant gets eliminated
during linking.

AMDGPU plugin queries the ELF for code object version
and then prepares various implicitargs accordingly.

Differential Revision: https://reviews.llvm.org/D139730

Reviewed By: jhuber6, yaxunl
2023-08-29 06:35:44 -05:00
Anton Rydahl
c1b5674fbb [OpenMP] Change OpenMP default version in documentation and help text for -fopenmp-version
As discussed on the weekly OpenMP meeting on the second of August 2023, the default version
in the OpenMP documentation shoud be changed from OpenMP 5.0 to 5.1.

Differential Revision: https://reviews.llvm.org/D156901
2023-08-28 19:05:55 -07:00
Doru Bercea
2102ed0b91 Fix for openmp tests honoring thread_limit.
Diff: https://reviews.llvm.org/D159001
2023-08-28 13:04:17 -04:00
Doru Bercea
5fe6f56563 Disable intermittently failing OpenMP test.
Diff: https://reviews.llvm.org/D159003
2023-08-28 12:56:22 -04:00
Doru Bercea
41bb5ef11f Add passing test for issue 64797. 2023-08-28 09:55:56 -04:00
Joachim Jenke
1880d8f5c1 [OpenMP][Archer] Add support for taskwait depend
At the moment Archer segfaults due to a null-pointer access, if an application
uses taskwait with depend clause as used in the two new tests.
This patch cleans up the task_schedule function, moves semantic blocks into
functions and replaces the if blocks by a single switch statement. The switch
statement will warn, when new enum values are added in OMPT and makes clear
what code is executed for the different cases.

With free-agent tasks coming up in OpenMP 6.0, we should expect more
null-pointer task_data, so additional null-pointer checks were added.
We also cannot rely on having an implicit task on the stack, so the
BarrierIndex is stored during task creation.

Differential Revision: https://reviews.llvm.org/D158072
2023-08-28 09:43:24 +02:00
Joachim Jenke
cec855af3e [OpenMP][OMPT] Fix ompt_get_task_memory implementation
Since td_allow_completion_event is a member of the taskdata struct, not all
firstprivate/shared variables are stored at the end of the task memory
allocation. Simply report the whole allocation instead.

Furthermore, the function should always return 0 since in no case there is
another block to report.

Differential Review: https://reviews.llvm.org/D158080
2023-08-28 09:19:52 +02:00
Sandeep Kosuri
08bbff4aad [OpenMP] Codegen support for thread_limit on target directive for host
offloading

- This patch adds support for thread_limit clause on target directive according to OpenMP 51 [2.14.5]
- The idea is to create an outer task for target region, when there is a thread_limit clause, and manipulate the thread_limit of task instead. This way, thread_limit will be applied to all the relevant constructs enclosed by the target region.

Differential Revision: https://reviews.llvm.org/D152054
2023-08-26 22:18:49 -05:00
Shilei Tian
fbcce33706 [OpenMP] Honor thread_limit value when choosing grid size
D152014 introduced an optimization that favors more smaller blocks over
fewer larger blocks, even if user sets `thread_limit` explicitly. This patch changes
the behavior to honor user value.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D158802
2023-08-26 22:17:49 -04:00
Joseph Huber
aa78e94b0b [Libomptarget] Support mapping indirect host calls to device functions
The changes in D157738 allowed for us to emit stub globals on the device
in the offloading entry section. These globals contain the addresses of
device functions and allow us to map host functions to their
corresponding device equivalent. This patch provides the initial support
required to build a table on the device to lookup the associated value.
This is done by finding these entries and creating a global table on the
device that can be searched with a simple binary search.

This requires an allocation, which supposedly should be automatically
freed at plugin shutdown. This includes a basic test which looks up device
pointers via a host pointer using the added function. This will need to be built
upon to provide full support for these calls in the runtime.

To support reverse offloading it would also be useful to provide a reverse table
that allows us to get host functions from device stubs.

Depends on D157738

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D157918
2023-08-25 18:51:56 -05:00
Michael Halkenhaeuser
9300b6de3c [OpenMP][OMPT] Add OMPT support for generic-elf-64bit plugin
Fixes: https://github.com/llvm/llvm-project/issues/64487
Connect OMPT during plugin initialization and enable corresponding tests.
Avoid linking OMPT when corresponding support is disabled.

Depends on D158542

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D158543
2023-08-25 13:53:11 -04:00
Johannes Doerfert
a01398156a [OpenMPOpt][FIX] Ensure to propagate information about parallel regions
Before, we checked the parallel region only once, and ignored updates in
the KernelInfo for the parallel region that happened later. This caused
us to think nested parallel sections are not present even if they are,
among other things.
2023-08-25 10:46:56 -07:00
Michael Halkenhaeuser
275259eb9a [OpenMP] Add getComputeUnitKind to generic-elf-64bit plugin
Make the generic-plugin report a corresponding CU kind -- instead of 'unknown'.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D158542
2023-08-25 07:14:43 -04:00
Johannes Doerfert
d2c37fc4f7 [Attributor][FIX] Avoid dangling stack references in map
The old code did not account for new queries during an update, which
caused us to leave stack RQIs in the map. We are now explicit about
temporary vs non-temporary RQIs.

Fixes: https://github.com/llvm/llvm-project/issues/64959
2023-08-24 16:28:10 -07:00
Johannes Doerfert
3611300a32 [OpenMP][FIX] Update tests after D157725 2023-08-24 16:28:10 -07:00
Aaron Jarmusch
1ff0bdb86d [OpenMP] Fix Slice Duplicate in Profiler
Fixed the broken commit - 6579021f02
Fix for the AMDGPU buildbot reported by @jplehr.
2023-08-24 20:52:15 +00:00
Aaron Jarmusch
6579021f02 [OpenMP] Fix Slice Duplicate in Profiler
Using LIBOMPTARGET_PROFILER, duplicates are created from timing both Kernel functions and Data update functions.
I commented out the duplicate timescope and left them in the targetkernel and the targetdataupdate functions. This
way the timescope calls will be closer to the launching of the kernel and the data moving.

Reviewed By: jdoerfert, tianshilei1992

Differential Revision: https://reviews.llvm.org/D157725
2023-08-24 19:41:37 +00:00
Johannes Doerfert
908ae84351 [OpenMP] Avoid assumptions at the end of a kernel
When we used to treat the kernel end as as aligned barrier, assertions
at the end made sense. Now, they actually cause problems as the "writes"
are not ordered with regards to reads within the kernel. We can simply
get rid of them.
2023-08-23 16:11:43 -07:00
Johannes Doerfert
80906ce48d [OpenMP] Disable early vectorization of loads/stores in the runtime
We are having a hard time optimizing some vectorized loads/stores later
on which causes this optimization to degrade performance.

Differential Revision: https://reviews.llvm.org/D158656
2023-08-23 15:14:14 -07:00
Johannes Doerfert
382b97554d [OpenMP] Force the parallel abstraction to be inlined
This is good for performance and compile time and the indirection (+
switch statements) is nothing that needs to be preserved.
2023-08-23 11:48:18 -07:00
Johannes Doerfert
81a02b0767 [Attributor][NFC] Precommit test 2023-08-23 11:48:18 -07:00
Johannes Doerfert
7481b465ae [OpenMP] Use default grid value for static grid size
If the user did not provide any static clause to override the grid size,
we assume the default grid size as upper bound and use it to improve
code generation through vendor specific attributes.

Fixes: https://github.com/llvm/llvm-project/issues/64816

Differential Revision: https://reviews.llvm.org/D158382
2023-08-23 11:12:03 -07:00
Johannes Doerfert
c5488c8dcc [OpenMP] Properly set static thread limit (w/o analysis)
We used to have two separate implementations to derive the number of
threads used in a target region. This lead us to sometimes miss out on
user provided thread bounds (num_threads, or thread_limit) when we
looked for "constant default values". If we might miss out on the
presence of those bounds, we cannot set the thread_limit statically
since the runtime will try to honor user input rather than cap it at the
"preferred default". This patch replaces the secondary implementation
with the primary in a mode that will not emit code but just look for the
presence, and potentially upper bounds, of thread limiting clauses.

The runtime test would not pass without this rewrite as we missed some
clauses, set the static limit on the device to the preferred value, but
then violated that value at runtime.

Fixes: https://github.com/llvm/llvm-project/issues/64845

Differential Revision: https://reviews.llvm.org/D158381
2023-08-23 11:12:03 -07:00
Vadim Paretsky
6789dda762 [OpenMP] make small memory allocations in loop collapse code on the stack
A few places in the loop collapse support code make small dynamic allocations
that introduce a noticeable performance overhead when made on the heap.
This change moves allocations up to 32 bytes to the stack instead of the heap.

Differential Revision: https://reviews.llvm.org/D158220
2023-08-23 10:37:45 -07:00
Jonathan Peyton
99f5969565 [OpenMP] Let primary thread gather topology info for each worker thread
This change has the primary thread create each thread's initial mask
and topology information so it is available immediately after
forking. The setting of mask/topology information is decoupled from the
actual binding. Also add this setting of topology information inside the
__kmp_partition_places mechanism for OMP_PLACES+OMP_PROC_BIND.

Without this, there could be a timing window after the primary
thread signals the workers to fork where worker threads have not yet
established their affinity mask or topology information.

Each worker thread will then bind to the location the primary thread
sets.

Differential Revision: https://reviews.llvm.org/D156727
2023-08-22 15:56:51 -05:00
Michael Halkenhaeuser
57f0bdc8fb [OpenMP][OMPT] Fix target enter data callback ordering & reported device num
This patch fixes: https://github.com/llvm/llvm-project/issues/64738
We observed multiple issues, primarily that the `DeviceId` was reported as -1
in certain scenarios. The reason for this is simply that the device is not
initialized at that point. Hence, we need to move the RAII object creation just
after the `checkDeviceAndCtors`, closer to the actual call we want to observe.

This also solves an odering issue where one `target enter data` callback would
be executed before the `Init` callback.
Additionally, this change will also fix that the callbacks corresponding to
`enter / exit data` and `update` in conjunction with `nowait` would not result
in the emission of an OMPT callback.

Added a testcase to cover initialized device number and `omp target` constructs.

Reviewed By: dhruvachak

Differential Revision: https://reviews.llvm.org/D157605
2023-08-22 13:12:09 -04:00
Kazu Hirata
11e2975810 Fx typos in documentation 2023-08-18 23:36:04 -07:00
Johannes Doerfert
9c08e76f3e [Attributor] Introduce AAIndirectCallInfo
AAIndirectCallInfo will collect information and specialize indirect call
sites. It is similar to our IndirectCallPromotion but runs as part of
the Attributor (so with assumed callee information). It also expands
more calls and let's the rest of the pipeline figure out what is UB, for
now. We use existing call promotion logic to improve the result,
otherwise we rely on the (implicit) function pointer cast.

This effectively "fixes" #60327 as it will undo the type punning early
enough for the inliner to work with the (now specialized, thus direct)
call.

Fixes: https://github.com/llvm/llvm-project/issues/60327
2023-08-18 16:44:05 -07:00
Terry Wilmarth
f0221fb1d7 [OpenMP] Add option to use different units for blocktime
This change adds the option of using different units for blocktimes specified via the KMP_BLOCKTIME environment variable. The parsing of the environment now recognizes units suffixes: ms and us. If a units suffix is not specified, the default unit is ms. Thus default behavior is still the same, and any previous usage still works the same. Internally, blocktime is now converted to microseconds everywhere, so settings that exceed INT_MAX in microseconds are considered "infinite".

kmp_set/get_blocktime are updated to use the units the user specified with KMP_BLOCKTIME, and if not specified, ms are used.

Added better range checking and inform messages for the two time units. Large values of blocktime for default (ms) case (beyond INT_MAX/1000) are no longer allowed, but will autocorrect with an INFORM message.

The delay for determining ticks per usec was lowered.  It is now 1 million ticks which was calculated as ~450us based on 2.2GHz clock which is pretty typical base clock frequency on X86:
(1e6 Ticks)  /  (2.2e9 Ticks/sec)  *  (1e6 usec/sec)  =  454 usec
Really short benchmarks can be affected by longer delay.

Update KMP_BLOCKTIME docs.

Portions of this commit were authored by Johnny Peyton.

Differential Revision: https://reviews.llvm.org/D157646
2023-08-18 14:01:13 -05:00
Johannes Doerfert
5eb7a427b0 [Attributor][NFC] Precommit tests 2023-08-17 22:42:38 -07:00
Johannes Doerfert
4fcd5f93d6 [OpenMPOpt] Mark more runtime functions as SPMD compatible
Fixes: https://github.com/llvm/llvm-project/issues/64421
2023-08-17 18:33:24 -07:00
Joseph Huber
5717329f1a [Libomptarget] Disable deadlocking bug49334.cpp test on AMDGPU
This test hangs on AMDGPU sporadically, disable it for the time being.

Fixes: https://github.com/llvm/llvm-project/issues/64733

Reviewed By: ronlieb

Differential Revision: https://reviews.llvm.org/D158082
2023-08-16 10:24:00 -05:00
Michael Halkenhaeuser
41f3626f8b [OpenMP][OMPT] Fix reported target pointer for data alloc callback
This patch fixes: https://github.com/llvm/llvm-project/issues/64671
DataOp EMI callbacks would not report the correct target pointer.
This is now alleviated by passing a `void**` into the function which
emits the actual callback, then evaluating that pointer.

Note: Since this is only done after the pointer has been properly
updated, only `endpoint=2` callbacks will show a non-null value.

Reviewed By: dhruvachak, jdoerfert

Differential Revision: https://reviews.llvm.org/D157996
2023-08-16 06:39:10 -04:00
Ye Luo
1c822e1e82 [libomptarget] Avoid unintialized GenericPluginTy::NumDevices 2023-08-13 00:01:50 -05:00
Joseph Huber
06adac8c4e [Libomptarget] Configure the RPC port count from the plugin
This patch allows us to configure the port count to what the specific
card would desire for parallelism. For AMDGPU we need to use the maximum
number of hardware parallelism to avoid deadlocks. For NVPTX we don't
have this problem due to the friendlier scheduler, so we use the number
of warps active on an SM times the number of SMs as a good guess.

Note that the max ports currently is going to be smaller than these
numbers. That will be improved in the future.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D155903
2023-08-11 12:54:47 -05:00
Carlos Eduardo Seo
4dce6d3061 [OpenMP] Disable some offloading/api tests for AArch64
Like for x86_64-linux-gnu, these need to be disabled for aarch64-linux-gnu.

Differential Revision: https://reviews.llvm.org/D156815
2023-08-07 20:26:39 +00:00
Michael Halkenhaeuser
7eba3e58d5 [OpenMP][AMDGPU] Add Envar for controlling HSA busy queue tracking
If the Envar is set to true (default), busy HSA queues will be
actively avoided when assigning a queue to a Stream.

Otherwise, we will initialize a new HSA queue for each requested
Stream, then default to round robin once the set maximum has been
reached.

Reviewed By: jdoerfert, kevinsala

Differential Revision: https://reviews.llvm.org/D156996
2023-08-07 10:48:02 -04:00
Kevin Sala
b8e297d1af [OpenMP][libomptarget] Improve kernel initialization in plugins
This patch modifies the plugins so that the initialization of KernelTy objects
is done in the init method. Part of the initialization was done in the
constructKernelEntry method. Now this method is called constructKernel
and only allocates and constructs a KernelTy object.

This patch prepares the kernel class for the new implementation of device
reductions.

Differential Revision: https://reviews.llvm.org/D156917
2023-08-06 11:53:58 +02:00
Shilei Tian
14d57545b2 [NFC][OpenMP] Fix compile warnings introduced in recent patches 2023-08-05 19:38:45 -04:00
koparasy
73cb01dc8a [OpenMP] Support for OpenMP-Offload Record Replay
Enable record-replay for OpenMP offload kernels.  On recording the initialization
is performed on device initialization by reading env variables. (This is similar to
the way rr used to operate). The primary change takes place in the replay phase
with the replay tool explicitly initializing the record-replay functionality.

Differential Revision: https://reviews.llvm.org/D156174

Fix
2023-08-05 00:46:06 -07:00
Johannes Doerfert
ed16143593 [OpenMP][FIX] Ensure __assert_fail is compatible with the host
Fixes: https://github.com/llvm/llvm-project/issues/64360
2023-08-04 11:36:58 -07:00
Shilei Tian
fcf1a1022a [OMPX] Change thread_dim to block_dim and the original block_dim to grid_dim
There is no `threadDim` in CUDA. Instead, it is `blockDim`. Then the current
`blockDim` is `gridDim` in CUDA.

Reviewed By: jhuber6

Differential Revision: https://reviews.llvm.org/D157051
2023-08-03 21:51:06 -04:00
Joseph Huber
c96cba3aea [Libomptarget] Fix compilation of libomptarget with old GCC
Summary:
Older gcc can't figure out the copy elision and needs an explicit move.
2023-08-03 10:49:35 -05:00
Joseph Huber
46642cc83d [Libomptarget] Remove debug RAII from libomptarget
This feature was supposed to allow you to trace execution inside of
Libomptarget. However, this never really worked properly. The printing
was always reoganized, only worked for single  threads, and pretty much
only told you a handful of things about a runtime library that's an
implementation detail to all users. Despite this, it contributed about
40% of the total filesize of the deviceRTL. This patch simply removes
this functionalit which I think was past due.

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D157001
2023-08-03 09:37:47 -05:00
Kevin Sala
4f46a48aaf [OpenMP][libomptarget] Remove unused virtual functions in GenericKernelTy
The virtual functions getDefaultNumBlocks and getDefaultNumThreads from the kernels are
only forwarding the call to the generic device's ones. This patch removes those two
functions from the kernels (and their derived ones). Now calls are made to the device's
functions directly.

Differential Revision: https://reviews.llvm.org/D156905
2023-08-02 17:18:50 +02:00
Kevin Sala
f7e0208a09 [OpenMP][libomptarget] Add missing field in GenericELF64bit grid values 2023-08-02 17:18:50 +02:00
Shilei Tian
ab15b11c41 [NFC][OMPT] Move pop_macro to the right place to fix compile warnings 2023-08-02 10:27:57 -04:00
Michael Halkenhaeuser
5b19f42b63 [OpenMP][AMDGPU] Single eager resource init + HSA queue utilization tracking
This patch lazily initializes queues/streams/events since their initialization
might come at a cost even if we do not use them.

To further benefit from this, AMDGPU/HSA queue management is moved into the
AMDGPUStreamManager of an AMDGPUDevice. Streams may now use different HSA queues
during their lifetime and identify busy queues.

When a Stream is requested from the resource manager, it will search for and
try to assign an idle queue. During the search for an idle queue the manager
may initialize more queues, up to the set maximum (default: 4).
When no idle queue could be found: resort to round robin selection.

With contributions from Johannes Doerfert <johannes@jdoerfert.de>

Depends on D156245

Reviewed By: kevinsala

Differential Revision: https://reviews.llvm.org/D154523
2023-08-02 08:22:26 -04:00
Johannes Doerfert
e5a3d5ba88 [OpenMP][NFC] Enable more runtime tests and also run them with O3
The test run fine on my AMD GPU machine, we should verify them on others
too and put them into our regular testing. Not testing O1/2/3 is really
bad and not testing all architecturs is similarly problematic.

Differential Revision: https://reviews.llvm.org/D148576
2023-07-31 15:45:53 -07:00
Johannes Doerfert
63684550c4 [OpenMP][NFC] Add offloading tests for the new ompx APIs 2023-07-31 15:45:53 -07:00
Johannes Doerfert
deb0ea3e47 [OpenMP] Add ompx wrappers for __syncthreads
Differential Revision: https://reviews.llvm.org/D156729
2023-07-31 13:44:51 -07:00
Johannes Doerfert
daef6d327a [OpenMP] Introduce ompx.h and 3D wrappers (threadId, threadDim, ...)
The new ompx.h header will give us a place to put extensions. The first
are 3D getters for the common cuda values:
  `{threadId,threadDim,blockId,blockDim}.{x,y,z}`

Differential Revision: https://reviews.llvm.org/D156501
2023-07-31 13:44:51 -07:00
Johannes Doerfert
1f3a28d4e5 [OpenMP][NFC] Reorganize the ompx::mapping layer in the GPU runtime
This change makes the naming more consistent, I hope.
2023-07-31 13:44:51 -07:00