Commit Graph

383 Commits

Author SHA1 Message Date
lntue
bc7a3bd864
[libc][math] Implement powf function correctly rounded to all rounding modes. (#71188)
We compute `pow(x, y)` using the formula
```
  pow(x, y) = x^y = 2^(y * log2(x))
```
We follow similar steps as in `log2f(x)` and `exp2f(x)`, by breaking
down into `hi + mid + lo` parts, in which `hi` parts are computed using
the exponent field directly, `mid` parts will use look-up tables, and
`lo` parts are approximated by polynomials.

We add some speedup for common use-cases:
```
  pow(2, y) = exp2(y)
  pow(10, y) = exp10(y)
  pow(x, 2) = x * x
  pow(x, 1/2) = sqrt(x)
  pow(x, -1/2) = rsqrt(x) - to be added
```
2023-11-06 16:54:25 -05:00
Jon Chesterfield
f0e100a05a
[amdgpu][openmp] Treat missing TIMESTAMP_FREQUENCY as non-fatal (#70987)
If you build with dynamic_hsa, the symbol is known and compilation
succeeds. If you then run with a slightly older libhsa, this argument is
not recognised and an error returned. I'd rather the program runs with a
misleading omp wtime than refuses to run at all.
2023-11-01 22:43:34 +00:00
Joseph Huber
9e390a1408 [libc][Obvious] Fix missing semicolon in AMDGPU loader implementation
Summary:
Title
2023-10-30 14:58:46 -05:00
Jon Chesterfield
896749aa0d
[amdgpu][openmp] Avoiding writing to packet header twice (#70695)
I think it follows from the HSA spec that a write to the first byte is
deemed significant to the GPU in which case writing to the second short
and reading back from it later would be safe. However, the examples for
this all involve an atomic write to the first 32 bits and it seems a
credible risk that the occasional CI errors abound invalid packets have
as their root cause that the firmware notices the early write to
packet->setup and treats that as a sign that the packet is ready to go.

That was overly-paranoid, however in passing noticed the code in libc is
genuinely invalid. The memset writes a zero to the header byte, changing
it from type_invalid (1) to type_vendor (0), at which point the GPU is
free to read the 64 byte packet and interpret it as a vendor packet,
which is probably why libc CI periodically errors about invalid packets.

Also a drive by change to do the atomic store on a uint32_t
consistently. I'm not sure offhand what __atomic_store_n on a uint16_t*
and an int resolves to, seems better to be unambiguous there.
2023-10-30 18:35:52 +00:00
Hans Wennborg
e2fc68c3db Typos: 'maxium', 'minium' 2023-10-23 10:42:28 +02:00
Joseph Huber
a39215768b
[libc] Rework the 'fgets' implementation on the GPU (#69635)
Summary:
The `fgets` function as implemented is not functional currently when
called with multiple threads. This is because we rely on reapeatedly
polling the character to detect EOF. This doesn't work when there are
multiple threads that may with to poll the characters. this patch pulls
out the logic into a standalone RPC call to handle this in a single
operation such that calling it from multiple threads functions as
expected. It also makes it less slow because we no longer make N RPC
calls for N characters.
2023-10-19 17:00:01 -04:00
alfredfo
f350532099
[libc] Fix accidental LIBC_NAMESPACE_clock_freq (#69620)
See-also: https://github.com/llvm/llvm-project/pull/69548
2023-10-19 19:39:02 +02:00
Joseph Huber
ddc30ff802
[libc] Implement the 'ungetc' function on the GPU (#69248)
Summary:
This function follows closely with the pattern of all the other
functions. That is, making a new opcode and forwarding the call to the
host. However, this also required modifying the test somewhat. It seems
that not all `libc` implementations follow the same error rules as are
tested here, and it is not explicit in the standard, so we simply
disable these EOF checks when targeting the GPU.
2023-10-17 13:02:31 -05:00
Mikhail R. Gadelha
714b4c82bb
[libc][NFC] Fix -Wdangling-else when compiling libc with gcc >= 7 (#67833)
Explicit braces were added to fix the "suggest explicit braces to avoid
ambiguous ‘else’" warning since the current solution (switch (0) case 0:
default:) doesn't work since gcc 7 (see
https://github.com/google/googletest/issues/1119)

gcc 13 generates about 5000 of these warnings when building libc without
this patch.
2023-10-04 11:44:42 -04:00
Mikhail R. Gadelha
8fc87f54a8
[libc][NFC] Couple of small warning fixes (#67847)
This patch fixes a couple of warnings when compiling with gcc 13:

* CPP/type_traits_test.cpp: 'apply' overrides a member function but is
not marked 'override'
* UnitTest/LibcTest.cpp:98: control reaches end of non-void function
* MPFRWrapper/MPFRUtils.cpp:75: control reaches end of non-void function
* smoke/FrexpTest.h:92: backslash-newline at end of file
* __support/float_to_string.h:118: comparison of unsigned expression in ‘>= 0’ is always true
* test/src/__support/CPP/bitset_test.cpp:197: comparison of unsigned expression in ‘>= 0’ is always true

---------

Signed-off-by: Mikhail R. Gadelha <mikhail@igalia.com>
2023-10-02 19:29:26 -04:00
Joseph Huber
1a5d3b6cda
[libc] Scan the ports more fairly in the RPC server (#66680)
Summary:
Currently, we use the RPC server to respond to different ports which
each contain a request from some client thread wishing to do work on the
server. This scan starts at zero and continues until its checked all
ports at which point it resets. If we find an active port, we service it
and then restart the search.

This is bad for two reasons. First, it means that we will always bias
the lower ports. If a thread grabs a high port it will be stuck for a
very long time until all the other work is done. Second, it means that
the `handle_server` function can technically run indefinitely as long as
the client is always pushing new work. Because the OpenMP implementation
uses the user thread to service the kernel, this means that it could be
stalled with another asyncrhonous device's kernels.

This patch addresses this by making the server restart at the next port
over. This means we will always do a full scan of the ports before
quitting.
2023-09-26 16:09:48 -05:00
Joseph Huber
2b7227db1e [libc] Fix RPC server global after mass replace of __llvm_libc
Summary:
This variable needs a reserved name starting with `__`. It was
mistakenly changed with a mass replace. It happened to work because the
tests still picked up the associated symbol, but it just became a bad
name because it's not reserved anymore.
2023-09-26 14:28:48 -05:00
Joseph Huber
7ac8e26fc7
[libc] Implement fseek, fflush, and ftell on the GPU (#67160)
Summary:
This patch adds the necessary entrypoints to handle the `fseek`,
`fflush`, and `ftell` functions. These are all very straightfoward, we
simply make RPC calls to the associated function on the other end.
Implementing it this way allows us to more or less borrow the state of
the stream from the server as we intentionally maintain no internal
state on the GPU device. However, this does not implement the `errno`
functinality so that must be ignored.
2023-09-26 09:46:46 -05:00
Guillaume Chatelet
b6bc9d72f6
[libc] Mass replace enclosing namespace (#67032)
This is step 4 of
https://discourse.llvm.org/t/rfc-customizable-namespace-to-allow-testing-the-libc-when-the-system-libc-is-also-llvms-libc/73079
2023-09-26 11:45:04 +02:00
Joseph Huber
791b279924
[libc] Change the puts implementation on the GPU (#67189)
Summary:
Normally, the implementation of `puts` simply writes a second newline
charcter after printing the first string. However, because the GPU does
everything in batches of the SIMT group size, this will end up with very
poor output where you get the strings printed and then 1-64 newline
characters all in a row. Optimizations like to turn `printf` calls into
`puts` so it's a good idea to make this produce the expected output.

The least invasive way I could do this was to add a new opcode. It's a
little bloated, but it avoids an unneccessary and slow send operation to
configure this.
2023-09-25 11:17:22 -05:00
Joseph Huber
f548d19fc8
[libc] Fix and simplify the implementation of 'fread' on the GPU (#66948)
Summary:
Previously, the `fread` operation was wrong in cases when we read less
data than was requested. That is, if we tried to read N bytes while the
file was in EOF, it would still copy N bytes of garbage. This is fixed
by only copying over the sizes we got from locally opening it rather
than just using the provided size.

Additionally, this patch simplifies the interface. The output functions
have special variants for writing to stdout / stderr. This is primarily
an optimization for these common cases so we can avoid sending the
stream as an argument which has a high delay. Because for input, we
already need to start with a `send` to tell the server how much data to
read, it costs us nothing to send the file along with it so this is
redundant. Re-use the file encoding scheme from the other
implementations, the one that stores the stream type in the LSBs of the
FILE pointer.
2023-09-21 14:28:06 -05:00
michaelrj-google
5bd34e0a55
[libc] Fix Off By One Errors In Printf Long Double (#66957)
Two major off-by-one errors are fixed in this patch. The first is in
float_to_string.h with length_for_num, which wasn't accounting for the
implicit leading bit when calculating the length of a number, causing
a missing digit on 80 bit float max. The other off-by-one is the
ryu_long_double_constants.h (a.k.a the Mega Table) not having any
entries for the last POW10_OFFSET in POW10_SPLIT. This was also found on
80 bit float max. Finally, the integer calculation mode was using a
slightly too short integer, again on 80 bit float max, not accounting
for the mantissa width. All of these are fixed in this patch.
2023-09-21 11:43:29 -07:00
Joseph Huber
e2bc0f9266 [libc][NFC] Remove unused function from the RPC server
Summary:
I missed removing this now-unused function in the previous patch. Remove
it to clean up the interface.
2023-09-21 11:56:48 -05:00
Joseph Huber
59896c168a
[libc] Remove the 'rpc_reset' routine from the RPC implementation (#66700)
Summary:
This patch removes the `rpc_reset` function. This was previously used to
initialize the RPC client on the device by setting up the pointers to
communicate with the server. The purpose of this was to make it easier
to initialize the device for testing. However, this prevented us from
enforcing an invariant that the buffers are all read-only from the
client side.

The expected way to initialize the server is now to copy it from the
host runtime. This will allow us to maintain that the RPC client is in
the constant address space on the GPU, potentially through inference,
and improving caching behaviour.
2023-09-21 11:07:09 -05:00
Guillaume Chatelet
467077796a
[reland][libc][cmake] Tidy compiler includes (#66783) (#66878)
This is a reland of #66783 a35a3b75b2
fixing the benchmark breakage.
2023-09-20 11:21:46 +02:00
Guillaume Chatelet
9feb0c9b6e
Revert "[libc][cmake] Tidy compiler includes (#66783)" (#66822)
This reverts commit a35a3b75b2. This broke
libc benchmarks.
2023-09-19 23:18:08 +02:00
Guillaume Chatelet
a35a3b75b2
[libc][cmake] Tidy compiler includes (#66783)
We want to activate `llvm-header-guard` (#66477) but the current CMake
configuration includes paths that should be `isystem`. This PR restricts
the number of `-I` passed to the clang command line and correctly marks
the llvm libc include path as `isystem`.
2023-09-19 23:08:29 +02:00
Joseph Huber
a1be5d69df
[libc] Implement more input functions on the GPU (#66288)
Summary:
This patch implements the `fgets`, `getc`, `fgetc`, and `getchar`
functions on the GPU. Their implementations are straightforward enough.
One thing worth noting is that the implementation of `fgets` will be
extremely slow due to the high latency to read a single char. A faster
solution would be to make a new RPC call to call `fgets` (due to the
special rule that newline or null breaks the stream). But this is left
out because performance isn't the primary concern here.
2023-09-14 15:39:29 -05:00
Joseph Huber
4792ae5cd5
[libc] Fix building the RPC server with LIBC_NAMESPACE (#65642)
A recent patch required the implementation to define `LIBC_NAMESPACE`.
For GPU offloading we provide a static library whose internal
implementation relies on the `libc` headers. This is a separate library
that is constructed during the "bootstrap" phase. This patch moves the
definition of the `LIBC_NAMESPACE` CMake variable up so its available
during bootstrapping and adds it to the definition of the RPC server.
2023-09-07 12:47:36 -05:00
Joseph Huber
701e6f7630 [libc][fix] Fix buffer overrun in initialization of GPU return value
Summary:
The HSA API explicitly states that the size is a count of uint32_t's not
a byte count. This was erroneously being used as a simple memcpy,
causing some weird behaviour. Fix this by correctly passing `1` to
initialize a single integer to zero.
2023-09-02 17:59:01 -05:00
Joseph Huber
07102a1194 [libc] Implement the 'abort' function on the GPU
This function implements the `abort` function on the GPU. The
implementation here closely mirros the `exit` call where we first
synchornize with the RPC server to make sure it's listening and then we
exit on the GPU.

I was unsure if this should be a simple `__builtin_assert` on the GPU. I
elected to go with an RPC approach to make this a more "true" `abort`
call. That is, it should invoke some signal handlers and exit with the
proper code according to the implemented C library on the server.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D159210
2023-08-31 08:40:15 -05:00
Joseph Huber
7fd9f0f4e0 [libc] Remove MAX_LANE_SIZE definition from the RPC server
This `MAX_LANE_SIZE` was a hack from the days when we used a single
instance of the server and had some GPU state handle it. Now that we
have everything templated this really shouldn't be used. This patch
removes its use and replaces it with template arguments.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D158633
2023-08-23 12:09:30 -05:00
Joseph Huber
334bbc0d67 [libc] Add support for the 'fread' function on the GPU
This patch adds support for `fread` on the GPU via the RPC mechanism.
Here we simply pass the size of the read to the server and then copy it
back to the client via the RPC channel. This should allow us to do the
basic operations on files now. This will obviously be slow for large
sizes due ot the number of RPC calls involved, this could be optimized
further by having a special RPC call that can initiate a memcpy between
the two pointers.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D155121
2023-07-26 13:51:35 -05:00
Joseph Huber
a42c1f8d97 [libc][Obvious] Fix use of fwrite in the RPC server
Summary:
The RPC server used the size field which meant we didn't get the correct
return value for partial reads. We fix that here.
2023-07-26 11:13:38 -05:00
Joseph Huber
c381a94753 [libc] Remove test RPC opcodes from the exported header
This patch does the noisy work of removing the test opcodes from the
exported interface to an interface that is only visible in `libc`. The
benefit of this is that we both test the exported RPC registration more
directly, and we do not need to give this interface to users.

I have decided to export any opcode that is not a "core" libc feature as
having its MSB set in the opcode. We can think of these as non-libc
"extensions".

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D154848
2023-07-21 15:36:36 -05:00
Joseph Huber
d3aabeb7b5 [libc] Treat the locks array as a bitfield
Currently we keep an internal buffer of device memory that is used to
indicate ownership of a port. Since we only use this as a single bit we
can simply turn this into a bitfield. I did this manually rather than
having a separate type as we need very special handling of the masks
used to interact with the locks.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D155511
2023-07-21 10:49:11 -05:00
Joseph Huber
cf269417b2 [libc] Add an override option for specifying the loader implementation
There are some cases when testing we want to override the logic for not
building tests if the loader is not present. This allows users to
specify an external binary that fulfils the same duties which will force
the tests to be built even without meeting the dependencies.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D155837
2023-07-20 08:44:58 -05:00
Jon Chesterfield
095e69404a [libc][amdgpu] Accept deadstripped clock_freq global
If the clock_freq symbol isn't used, and is removed,
we don't need to abort the loader. Can instead just not set it.

Reviewed By: jhuber6

Differential Revision: https://reviews.llvm.org/D155832
2023-07-20 14:23:08 +01:00
Jon Chesterfield
d483824fc8 [libc][amdgpu] Tolerate different install directories for hsa.h
HSA headers might be under a hsa/ directory or might not.
This scheme matches the one used by the openmp amdgpu plugin.

Reviewed By: jhuber6, jplehr

Differential Revision: https://reviews.llvm.org/D155812
2023-07-20 13:43:17 +01:00
Joseph Huber
e537c83975 [libc] Add basic support for calling host functions from the GPU
This patch adds the `rpc_host_call` function as a GPU extension. This is
exported from the `libc` project to use the RPC interface to call a
function pointer via RPC any copying the arguments by-value. The
interface can only support a single void pointer argument much like
pthreads. The function call here is the bare-bones version of what's
required for OpenMP reverse offloading. Full support will require
interfacing with the mapping table, nowait support, etc.

I decided to test this interface in `libomptarget` as that will be the
primary consumer and it would be more difficult to make a test in `libc`
due to the testing infrastructure not really having a concept of the
"host" as it runs directly on the GPU as if it were a CPU target.

Reviewed By: jplehr

Differential Revision: https://reviews.llvm.org/D155003
2023-07-19 10:11:46 -05:00
Joseph Huber
979fb95021 Revert "[libc] Treat the locks array as a bitfield"
Summary:
This caused test failures on the gfx90a buildbot. This works on my
gfx1030 and the Nvidia buildbots, so we'll need to investigate what is
going wrong here. For now revert it to get the bots green.

This reverts commit 05abcc5792.
2023-07-19 09:27:08 -05:00
Joseph Huber
05abcc5792 [libc] Treat the locks array as a bitfield
Currently we keep an internal buffer of device memory that is used to
indicate ownership of a port. Since we only use this as a single bit we
can simply turn this into a bitfield. I did this manually rather than
having a separate type as we need very special handling of the masks
used to interact with the locks.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D155511
2023-07-18 11:34:21 -05:00
Joseph Huber
a608076726 [libc][Obvious] Check if the state hasn't already been destroyed on shutdown
This ensures that if someone calls the `rpc_shutdown` method multiple
times it will not segfault and gracefully continue. This was causing
problems in the OpenMP usage. This could point to other issues, but for
now this is a safe fix.

Differential Revision: https://reviews.llvm.org/D155005
2023-07-11 14:35:38 -05:00
Joseph Huber
2a65d0388c [libc] Add support for creating wrapper headers for offloading in clang
This is an alternate approach to the patches proposed in D153897 and
D153794. Rather than exporting a single header that can be included on
the GPU in all circumstances, this patch chooses to instead generate a
separate set of headers that only provides the declarations. This can
then be used by external tooling to set up what's on the GPU. This
leaves room for header hacks for offloading languages without needing to
worry about the `libc` implementation.

Currently this generates a set of headers that only contain the
declarations. These will then be installed to a new clang resource
directory called `llvm_libc_wrappers/` which will house the shim code.
We can then automaticlaly include this from `clang` when offloading to
wrap around the headers while specifying what's on the GPU.

Reviewed By: jdoerfert, JonChesterfield

Differential Revision: https://reviews.llvm.org/D154036
2023-07-07 16:02:33 -05:00
Joseph Huber
6ca6cdb23e Revert "[libc] Add support for creating wrapper headers for offloading in clang"
This reverts commit a4a26374aa.

This was causing some problems with the CPU build and CUDA buildbot.
Revert until I can figure out what those issues are and fix them. I
believe it is just some CMake.
2023-07-06 18:26:41 -05:00
Joseph Huber
a4a26374aa [libc] Add support for creating wrapper headers for offloading in clang
This is an alternate approach to the patches proposed in D153897 and
D153794. Rather than exporting a single header that can be included on
the GPU in all circumstances, this patch chooses to instead generate a
separate set of headers that only provides the declarations. This can
then be used by external tooling to set up what's on the GPU. This
leaves room for header hacks for offloading languages without needing to
worry about the `libc` implementation.

Currently this generates a set of headers that only contain the
declarations. These will then be installed to a new clang resource
directory called `llvm_libc_wrappers/` which will house the shim code.
We can then automaticlaly include this from `clang` when offloading to
wrap around the headers while specifying what's on the GPU.

Reviewed By: jdoerfert, JonChesterfield

Differential Revision: https://reviews.llvm.org/D154036
2023-07-06 18:10:49 -05:00
Joseph Huber
c850ea1498 [libc] Support fopen / fclose on the GPU
This patch adds the necessary support for the fopen and fclose functions
to work on the GPU via RPC. I added a new test that enables testing this
with the minimal features we have on the GPU. I will update it once we
have `fread` and `fwrite` to actually check the outputted strings. For
now I just relied on checking manually via the outpuot temp file.

Reviewed By: JonChesterfield, sivachandra

Differential Revision: https://reviews.llvm.org/D154519
2023-07-05 18:31:58 -05:00
Joseph Huber
5db39796bf [libc] Support timing information in libc tests
This patch adds the necessary support to provide timing information in
`libc` tests. This is useful for determining which tests look what
amount of time. We also can use this as a test basis for providing more
fine-grained timing when implementing things on the GPU.

The main difficulty with this is the fact that the AMDGPU fixed
frequency clock operates at an unknown frequency. We need to read this
on a per-card basis from the driver and then copy it in. NVPTX on the
other hand has a fixed clock at a resolution of 1ns. I have also
increased the resolution of the print-outs as the majority of these are
below a millisecond for me.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D154446
2023-07-05 14:27:08 -05:00
Michael Jones
cfbcbc8f88 [libc] fix MPFR rounding problems in fuzz test
The accuracy for the MPFR numbers in the strtofloat fuzz test was set
too high, causing rounding issues when rounding to a smaller final
result.

Reviewed By: lntue

Differential Revision: https://reviews.llvm.org/D154150
2023-07-05 10:53:40 -07:00
Joseph Huber
df52a22b1b [libc] Make the RPC server target always available
This patch makes sure that we always build the RPC server. The proposed
used for this is to begin integrating this server implementation into
`libomptarget`. That requires that we build this server ahead of time
when using a `LLVM_ENABLE_PROJECTS` build. Make a few tweaks to ensure
that the GCC compiler which may be used for this build doesn't complain.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D154105
2023-06-30 11:30:57 -05:00
Joseph Huber
62f57bc9b0 [libc] Add other RPC callback methods to the RPC server
This patch adds the other two methods to the server so the external
users can use the interface through the obfuscated interface.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D154224
2023-06-30 11:29:37 -05:00
Joseph Huber
667c10353e [libc] Fix the implementation of exit on the GPU
The RPC calls all have delays associated with them. Currently the `exit`
function does an async send and immediately exits the GPU. This can have
the effect that the RPC server never sees the exit call and we continue.
This patch changes that to first sync with the server before continuing
to perform its exit. There is still a hazard here, where the kernel can
complete before the RPC call reads back its response, but this is simply
multi-threaded hazards. This change ensures that the server *will*
always exit some time after the GPU exits.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D154112
2023-06-29 13:22:23 -05:00
Tue Ly
f320fefc4a [libc][math] Implement erff function correctly rounded to all rounding modes.
Implement correctly rounded `erff` functions.

For `x >= 4`, `erff(x) = 1` for `FE_TONEAREST` or `FE_UPWARD`, `0x1.ffffep-1` for `FE_DOWNWARD` or `FE_TOWARDZERO`.

For `0 <= x < 4`, we divide into 32 sub-intervals of length `1/8`, and use a degree-15 odd polynomial to approximate `erff(x)` in each sub-interval:
```
  erff(x) ~ x * (c0 + c1 * x^2 + c2 * x^4 + ... + c7 * x^14).
```

For `x < 0`, we can use the same formula as above, since the odd part is factored out.

Performance tested with `perf.sh` tool from the CORE-MATH project on AMD Ryzen 9 5900X:

Reciprocal throughput (clock cycles / op)
```
$ ./perf.sh erff --path2
GNU libc version: 2.35
GNU libc release: stable
-- CORE-MATH reciprocal throughput --  with -march=native      (with FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 11.790 + 0.182 clc/call; Median-Min = 0.154 clc/call; Max = 12.255 clc/call;
-- CORE-MATH reciprocal throughput --  with -march=x86-64-v2      (without FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 14.205 + 0.151 clc/call; Median-Min = 0.159 clc/call; Max = 15.893 clc/call;

-- System LIBC reciprocal throughput --
[####################] 100 %
Ntrial = 20 ; Min = 45.519 + 0.445 clc/call; Median-Min = 0.552 clc/call; Max = 46.345 clc/call;

-- LIBC reciprocal throughput --  with -mavx2 -mfma     (with FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 9.595 + 0.214 clc/call; Median-Min = 0.220 clc/call; Max = 9.887 clc/call;
-- LIBC reciprocal throughput --  with -msse4.2     (without FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 10.223 + 0.190 clc/call; Median-Min = 0.222 clc/call; Max = 10.474 clc/call;
```

and latency (clock cycles / op):
```
$ ./perf.sh erff --path2
GNU libc version: 2.35
GNU libc release: stable
-- CORE-MATH latency --  with -march=native      (with FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 38.566 + 0.391 clc/call; Median-Min = 0.503 clc/call; Max = 39.170 clc/call;
-- CORE-MATH latency --  with -march=x86-64-v2      (without FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 43.223 + 0.667 clc/call; Median-Min = 0.680 clc/call; Max = 43.913 clc/call;

-- System LIBC latency --
[####################] 100 %
Ntrial = 20 ; Min = 111.613 + 1.267 clc/call; Median-Min = 1.696 clc/call; Max = 113.444 clc/call;

-- LIBC latency --  with -mavx2 -mfma     (with FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 40.138 + 0.410 clc/call; Median-Min = 0.536 clc/call; Max = 40.729 clc/call;
-- LIBC latency --  with -msse4.2     (without FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 44.858 + 0.872 clc/call; Median-Min = 0.814 clc/call; Max = 46.019 clc/call;
```

Reviewed By: michaelrj

Differential Revision: https://reviews.llvm.org/D153683
2023-06-28 13:58:37 -04:00
Joseph Huber
31c154881c [libc] Allow the RPC client to be initialized via a H2D memcpy
The RPC client must be initialized to set a pointer to the underlying
buffer. This is currently done with the `reset` method which may not be
ideal for the use-case. We want runtimes to be able to initialize this
without needing to call a kernel. Recent changes allowed the `Client`
type to be trivially copyable. That means we can create a client on the
server side and then copy it over. To that end we take the existing
externally visible symbol and initialize it to the client's pointer.
Therefore we can look up the symbol and copy it over once loaded.

No test currently, I tested with a demo OpenMP application but couldn't think of
how to put that in-tree.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D153633
2023-06-26 10:41:32 -05:00
Joseph Huber
e0b487bfc0 [libc] Rename and install the RPC server interface
This patch prepares the RPC interface to be installed. We place this in
the existing `llvm-gpu-none` directory as it will also give us access to
the generated `libc` headers for the opcodes.

Reviewed By: JonChesterfield

Differential Revision: https://reviews.llvm.org/D153040
2023-06-21 11:26:24 -05:00