The pipeline state data captured in the PSV0 section of the DXContainer
file encodes signature elements which are read by the runtime to map
inputs and outputs from the GPU program.
This change adds support for generating and parsing signature elements
with testing driven through the ObjectYAML tooling.
Reviewed By: bogner
Differential Revision: https://reviews.llvm.org/D157671
Initially landed as 8c567e64f808f7a818965c6bc123fedf7db7336f, and
reverted in 4d800633b2683304a5431d002d8ffc40a1815520.
../llvm/include/llvm/BinaryFormat/DXContainerConstants.def
../llvm/test/ObjectYAML/DXContainer/PSVv1-amplification.yaml
../llvm/test/ObjectYAML/DXContainer/PSVv1-compute.yaml
../llvm/test/ObjectYAML/DXContainer/PSVv1-domain.yaml
../llvm/test/ObjectYAML/DXContainer/PSVv1-geometry.yaml
../llvm/test/ObjectYAML/DXContainer/PSVv1-vertex.yaml
../llvm/test/ObjectYAML/DXContainer/PSVv2-amplification.yaml
../llvm/test/ObjectYAML/DXContainer/PSVv2-compute.yaml
../llvm/test/ObjectYAML/DXContainer/PSVv2-domain.yaml
../llvm/test/ObjectYAML/DXContainer/PSVv2-geometry.yaml
../llvm/test/ObjectYAML/DXContainer/PSVv2-vertex.yaml
The TestEvents.py test I added for ShadowListeners fails on Windows.
Since there's no reason to believe the ShadowListeners feature has
different behavior from the other event-based tests here, I copied
the skips & expected_flakey's from the other tests in that file to
this one.
This commit starts enabling vector distruction over multiple
dimensions. It requires delinearize the lane ID to match the
expected rank. shape_cast and transfer_read now can properly
handle multiple dimensions.
Reviewed By: hanchung
Differential Revision: https://reviews.llvm.org/D157931
The pipeline state data captured in the PSV0 section of the DXContainer
file encodes signature elements which are read by the runtime to map
inputs and outputs from the GPU program.
This change adds support for generating and parsing signature elements
with testing driven through the ObjectYAML tooling.
Reviewed By: bogner
Differential Revision: https://reviews.llvm.org/D157671
Before refactoring this code, all arm64 were set to use the 32bit allocator. This patch reverts back that behavior for DriverKit.
Because we target DriverKit as the target OS, rather than a specific platform, reverting back to the previous behavior is preferred to fix a failure we are seeing on embedded platforms.
Though it may be more correct in the future to match the allocator to the platform being used.
rdar://113649286
Differential Revision: https://reviews.llvm.org/D158028
Machine function splitting will become available for AArch64; since MFS
is no longer X86-only, the tests for generic behavior should live
somewhere other than tests/CodeGen/X86.
MFS implementation doesn't vary much across platforms, and most tests
should be identical between X86 and AArch64 besides instruction
selection, so the tests can live together in tests/CodeGen/Generic.
Differential Revision: https://reviews.llvm.org/D157563
We are able to fuse the pack op only if inner tiles are not tiled or
they are fully used. Otherwise, it could generate a sequence of
non-trivial ops.
Differential Revision: https://reviews.llvm.org/D157932
Before the addition of the process "Shadow Listener" you could only have one
Listener observing the Process Broadcaster. That was necessary because fetching the
Process event is what switches the public process state, and for the execution
control logic to be manageable you needed to keep other listeners from causing
this to happen before the main process control engine was ready.
Ismail added the notion of a "ShadowListener" - which allowed you ONE
extra process listener. This patch inverts that setup by designating the
first listener as primary - and giving it priority in fetching events.
Differential Revision: https://reviews.llvm.org/D157556
For non-intrinsic CallInsts, computeKnownBits only handles range
metadata and checking getReturnedArgOperand(). Both of these are now
handled in isKnownNonZero, so there is no need to fall through to
a call to computeKnownBits anymore.
Differential Revision: https://reviews.llvm.org/D158095
Accept "module procedure" (as well as module function/subroutine)
in a separate module procedure definition, such as "bb1" in:
module mm
interface
module subroutine mm1
end subroutine
end interface
end module
submodule(mm) bb
interface
module subroutine bb1
end subroutine
end interface
contains
module procedure mm1
call bb1
end procedure
module procedure bb1
print*, 'bb1'
end procedure
end submodule
use mm
call mm1
end
`MaxSafeDepDistBytes` was not correct based on its name an semantics
in instances when there was a non-unit stride loop. For example,
```
for (int k = 0; k < len; k+=3) {
a[k] = a[k+4];
a[k+2] = a[k+6];
}
```
Here, the smallest dependence distance is 24 bytes, but only vectorizing 8 bytes
is safe. `MaxSafeVectorWidthInBits` reported the correct number of bits
that could be vectorized as 64 bits.
The semantics of of `MaxSafeDepDistBytes` should be:
The smallest dependence distance in bytes in the loop. This may not be
the same as the maximum number of bytes that are safe to operate on
simultaneously.
The name of this variable should reflect those semantics and
its docstring should be updated accordingly, `MinDepDistBytes`.
A debug message that used `MaxSafeDepDistBytes` to signify to the user
how many bytes could be accessed in parallel is updated to use
`MaxSafeVectorWidthInBits` instead. That way, the same message if
communicated to the user, just in different units.
This patch makes sure that when `MinDepDistBytes` is modified in a way
that should impact `MaxSafeVectorWidthInBits`, that we update the latter
accordingly. This patch also clarifies why `MaxSafeVectorWidthInBits`
does not to be updated when `MinDepDistBytes` is (i.e. in the case of a
forward dependency).
Differential Revision: https://reviews.llvm.org/D156158
Aligning functions yields small performance gains on
embedded cores, moreso with numerous small function calls.
Similar to aligning loops, if the function can fit within
a single cache line then the performance overhead of
fetching more instructions can be limited.
Differential Revision: https://reviews.llvm.org/D157514
Fix forward bug in dac19b457e2cfd139e0e5cc29872ba3c65b7510f, which uses
the vertical bar operator for type hints, which is only supported by
Python 3.10 and later, and thus breaks the builds on Python 3.8.
This allows us to select G_SHUFFLE_VECTOR with identity masks (possibly
including undef elements), but avoid the actual EXT instruction if the shift
amount is 0.
This check is redundant as it is covered by the call to
isPotentiallyReachable.
Depends on D155726.
Differential Revision: https://reviews.llvm.org/D155718
Migrate createForStaticInitFunction, createDispatchInitFunction, createDispatchNextFunction and createDispatchFiniFunction from Clang CodeGen to OMPIRBuilder.
Differential Revision: https://reviews.llvm.org/D157994
This patch prevents `mlir-linalg-ods-yaml-gen` from adding extra
whitespace around the summary and description fields. This broke the
_italics_ of the summary as _ this _ is not recognised by markdown.
It also meant the first line of the description was in a code block
as it was indented two spaces.
The separator between summary and description has also been updated to
two newlines. This was already followed and prevents line wrapping the
summary putting part of it in the description.
These issues can be currently seen at: https://mlir.llvm.org/docs/Dialects/Linalg/
Reviewed By: awarzynski
Differential Revision: https://reviews.llvm.org/D157853
Re-apply https://reviews.llvm.org/D157704.
The original patch broke the tests on Python 3.8 and got reverted by
0c4aad050c23254c3c612e860e1278961d161aef. This patch replaces the usage
of the vertical bar operator for type hints with `Union`.
Reviewed By: ftynse
Differential Revision: https://reviews.llvm.org/D158075
Generate store immediate instruction when CPUv4 is enabled.
For example:
$ cat test.c
struct foo {
unsigned char b;
unsigned short h;
unsigned int w;
unsigned long d;
};
void bar(volatile struct foo *p) {
p->b = 1;
p->h = 2;
p->w = 3;
p->d = 4;
}
$ clang -O2 --target=bpf -mcpu=v4 test.c -c -o - | llvm-objdump -d -
...
0000000000000000 <bar>:
0: 72 01 00 00 01 00 00 00 *(u8 *)(r1 + 0x0) = 0x1
1: 6a 01 02 00 02 00 00 00 *(u16 *)(r1 + 0x2) = 0x2
2: 62 01 04 00 03 00 00 00 *(u32 *)(r1 + 0x4) = 0x3
3: 7a 01 08 00 04 00 00 00 *(u64 *)(r1 + 0x8) = 0x4
4: 95 00 00 00 00 00 00 00 exit
Take special care to:
- apply `BPFMISimplifyPatchable::checkADDrr` rewrite for BPF_ST
- validate immediate value when BPF_ST write is 64-bit:
BPF interprets `(BPF_ST | BPF_MEM | BPF_DW)` writes as writes with
sign extension. Thus it is fine to generate such write when
immediate is -1, but it is incorrect to generate such write when
immediate is +0xffff_ffff.
This commit was previously reverted in e66affa17e32.
The reason for revert was an unrelated bug in BPF backend,
triggered by test case added in this commit if LLVM is built
with LLVM_ENABLE_EXPENSIVE_CHECKS.
The bug was fixed in D157806.
Differential Revision: https://reviews.llvm.org/D140804
vmv.x.s and vmv.f.s are unconditional. They read the low element of a vector
register (not vector group), and function even when VL=0 or VSTART>0. As such,
they are don't care with respect to both VL and LMUL.
We'd previously had handling in the forward pass only via the NoRegister
mechanusm. (The only instructions with SEW but without VL are these extracts.)
This patch moves that handling into getDemanded so that the backwards pass
benefits as well.
Differential Revision: https://reviews.llvm.org/D157991
We were defaulting to VL=0 when we didn't otherwise have a vsetv
nearby. Instead, let's use VL=1. VL=0 is very much a cornercase
in hardware, and let's avoid if we can.
Differential Revision: https://reviews.llvm.org/D158015
This patch implements the `fopen`, `fclose`, and `fread` functions on
the GPU. These are pretty much re-implemented from what existed but
using the new interface. Having this subset allows us to test the
interface a bit more strenuously since we can write and read to a file.
Reviewed By: sivachandra
Differential Revision: https://reviews.llvm.org/D157622
Add test coverage for sinking/hoisting loads/stores with swifterror
pointers. Currently this isn't handled correctly by SimplifyCFG and
causes a verifier error.
Apparently the spec has overloads for fmin/fmax and ldexp with one of
the operands as scalar. We need to broadcast the scalars to the vector
type.
https://reviews.llvm.org/D158077
This patch removes some includes from LinkAllPasses.h, that appears
to be unused. Those should have been removed earlier when the
corresponding legacy PM passes were removed.
InstSimplifyPass is a bit special since the legacy PM version of the
pass still exists. But since createInstSimplifyLegacyPass is defined
in Scalar.h and not in InstSimplifyPass.h that particular include
isn't needed anyway.
`getBufferType` computes the bufferized type of an SSA value without bufferizing any IR. This is useful for predicting the bufferized type of iter_args of a loop.
To avoid endless recursion (e.g., in the case of "scf.for", the type of the iter_arg depends on the type of init_arg and the type of the yielded value; the type of the yielded value depends on the type of the iter_arg again), `fixedTypes` was used to fall back to "fixed" type. A simpler way is to maintain an "invocation stack". `getBufferType` implementations can then inspect the invocation stack to detect repetitive computations (typically when computing the bufferized type of a block argument).
Also improve error messages in case of inconsistent memory spaces inside of a loop.
Differential Revision: https://reviews.llvm.org/D158060