Enable by default for optimization levels higher than 0 (same behavior
as clang).
For simplicity, only forward the flag to the frontend driver when it
contradicts what is implied by the optimization level.
Since https://github.com/llvm/llvm-project/pull/72903 there are now no
known performance regressions.
The 'routine' construct applies either to a function directly, or, when
provided a name, applies to the function named (and is visible in the
current scope). This patch implements the parsing for this. The
identifier provided (or Id Expression) is required to be a valid,
declared identifier, though the semantic analysis portion of the Routine
directive will need to enforce it being a function/overload set.
Summary:
The puts call appends a newline. With multiple threads, this can be done
out of order such that another thread puts something before we finish
appending the newline. Add a flockfile and funlockfile to ensure that
the whole string is printed before another string can appear.
When hoisting an invariant load, we should not combine it with an
existing load through common subexpression elimination (CSE). This is
because there might be memory-changing instructions between the existing
load and the end of the block entering the loop.
Fixes https://github.com/llvm/llvm-project/issues/72855
Adding test-cases to the `cases` array causes `git clang-format` to
split the strings of many of the existing test-cases, making them harder
to read/work with in most cases.
This patch disables `clang-format` for the `cases` array so it doesn't
catch anyone off-guard in the future.
Summary:
There are a few default options that LLVM adds that can be problematic
for runtimes builds. These options are generally intended to handle
building LLVM itself, but are also added when building in a runtimes
mode. One such issue I've run into is that in `libc` we deliberately use
`--target` to use a different device toolchain, which doesn't support
some linker arguments passed via `-Wl`. This is observed in
https://github.com/llvm/llvm-project/pull/73030 when attempting to use
these options.
This patch completely removes these default arguments.
The consensus is that any issues created by this patch should ultimately
be solved on a per-runtime basis.
We're performing the same lookup twice here, just to access
different fields. Only perform it once.
Also prefer using find() over operator[], as we do not want to
perform an insert if the node does not exist.
This patch allows adding any quantity to a zero-initialized TypeSize,
such
that e.g.:
TypeSize::Scalable(0) + TypeSize::Fixed(4) == TypeSize::Fixed(4)
TypeSize::Fixed(0) + TypeSize::Scalable(4) == TypeSize::Scalable(4)
This makes it easier to implement add-reductions using TypeSize where
the 'scalable' flag is not yet known before starting the reduction.
(this PR follows on from #72979)
Before we widen to top, we now check if both values can be proved either
true or
false in their respective environments; if so, widening returns a true
or false
literal. The idea is that we avoid losing information if posssible.
This patch includes a test that fails without this change to widening.
This change does mean that we call the SAT solver in more places, but
this seems
acceptable given the additional precision we gain.
In tests on an internal codebase, the number of SAT solver timeouts we
observe
with Crubit's nullability checker does increase by about 25%. They can
be
brought back to the previous level by doubling the SAT solver work
limit.
#73092 supported the encoding/decoding for PUSHP/POPP
#73233 supported the encoding/decoding for PUSH2[P]/POP2[P]
In this patch, we teach frame lowering to spill/reload registers w/
these instructions.
1. Use PPX for balanced spill/reload
2. Use PUSH2/POP2 for continuous spills/reloads
3. PUSH2/POP2 must be 16B-aligned on the stack, so pad when necessary
This caused some runtimes builds to fail with:
error: unknown target 'runtimes-test-depends'
See comments on the PR.
> The test-depends target contained all the dependencies needed to run the
> runtimes tests, but it was never added as a dependency of check-all.
> This caused some of the tsan tests to fail, since the custom libcxx
> build the tests were looking for was never built. Besides the tsan
> failures, this fixes all the other test failures I was seeing with:
> cmake -G Ninja -B release-build -S llvm \
> -DCMAKE_POSITION_INDEPENDENT_CODE=ON \
> -DCMAKE_BUILD_TYPE=Release \
> -DLLVM_ENABLE_ASSERTIONS=OFF \
> -DLLVM_ENABLE_PROJECTS="clang;lld" \
> -DLLVM_ENABLE_RUNTIMES="libcxx;libcxxabi;libunwind;compiler-rt"
>
> This is the same configuration the test-release.sh script uses, so I'm
> hoping this will also fix all the test failures we've been seeing when
> building the releases.
>
> Fixes#58680
This reverts commit 7f215b1380da49dccbf57da3040a40d25ed898f4.
This broke
tools/llvm-exegesis/X86/latency/dummy-counters.test
on Mac, see comment on the PR.
> There was an issue with certain configurations failing to build due to a
> deleted Error constructor that should be fixed with this relanding.
This reverts commit 92b821f2dcddbfa934689e10d8f11df150ab1043.
If we are loading the same ptr at different vector widths, then reuse the largest load and just extract the low subvector.
Unlike the equivalent VBROADCAST_LOAD/SUBV_BROADCAST_LOAD folds which can occur in DAG, we have to wait until DAGISel otherwise we can hit infinite loops if constant folding recreates the original constant value.
This is mainly useful for better constant sharing.
runSemiNCA() currently checks that the ReverseChildren are below
MinLevel in the DT, which is used when performing incremental updates.
However, ReverseChildren is populated during runDFS with only the
predecessors that are part of that DFS walk, which will itself be level
limited in the relevant cases. As such, I don't believe that this should
be checked during runSemiNCA().
This code probably dates back to a time when predecessors were not
cached during runDFS and as such not limited to the visited subtree
only.
After 9645267, TypeByteSize is 0 if both access do not have the same
size (i.e. HasSameSize will be false). This can cause an infinite loop
in couldPreventStoreLoadForward, if HasSameSize is not checked first.
So check HasSameSize first instead of after
couldPreventStoreLoadForward. Checking HasSameSize first is also
cheaper.
Auto-generate checks for -loop-carried.ll to make it easier to update in
follow-on patch. As this test only checks the dependence, mark pointers
as noalias to avoid also checking various runtime pointer check groups.
NVIDIA Hopper architecture introduced the Cooperative Group Array (CGA).
It is a new level of parallelism, allowing clustering of Cooperative
Thread Arrays (CTA) to synchronize and communicate through shared memory
while running concurrently.
This PR enables support for CGA within the `gpu.launch_func` in the GPU
dialect. It extends `gpu.launch_func` to accommodate this functionality.
The GPU dialect remains architecture-agnostic, so we've added CGA
functionality as optional parameters. We want to leverage mechanisms
that we have in the GPU dialects such as outlining and kernel launching,
making it a practical and convenient choice.
An example of this implementation can be seen below:
```
gpu.launch_func @kernel_module::@kernel
clusters in (%1, %0, %0) // <-- Optional
blocks in (%0, %0, %0)
threads in (%0, %0, %0)
```
The PR also introduces index and dimensions Ops specific to clusters,
binding them to NVVM Ops:
```
%cidX = gpu.cluster_id x
%cidY = gpu.cluster_id y
%cidZ = gpu.cluster_id z
%cdimX = gpu.cluster_dim x
%cdimY = gpu.cluster_dim y
%cdimZ = gpu.cluster_dim z
```
We will introduce cluster support in `gpu.launch` Op in an upcoming PR.
See [the
documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-of-cooperative-thread-arrays)
provided by NVIDIA for details.
It was:
```
error: there is no embedded script interpreter in this mode.
```
1. What does "mode" mean?
2. It implies there might be an embedded script interpreter for some
other "mode", whatever that would be.
So I'm simplifying it and noting the most common reason for this which
is that lldb wasn't built with a scripting language enabled in the first
place.
There are other tips for dealing with this, but I'm not sure this
message is the best place for them.
Move GEP offset collection to separate helper function and collect
variable and constant offsets in OffsetResult. For now, this only
supports 1 VariableOffset, but the new code structure can be more easily
extended to handle more offsets in the future.
The refactoring drops the check that the VariableOffset >= -1 * constant
offset. This is not needed to check whether the constraint is
monotonically increasing. The constant factors can be ignored, the
constraint will be monotonically increasing if all variables are
positive.
See https://alive2.llvm.org/ce/z/ah2uSQ,
https://alive2.llvm.org/ce/z/NCADNZ
Don't only test the case where all GEPs are missing inbounds, also
test inbounds only being present on some of them. The fold should
not be performed in either case.
Inside runSemiNCA(), create a direct mapping from node number of node
info, so we can save the node number -> node pointer -> node info lookup
in many cases.
To allow this in more cases, change Label to a node number instead of
node pointer.
I've limited this to runSemiNCA() for now, because we have the
convenient property there that no new node infos will be added, so we
don't have to worry about pointer invalidation.
This gives a pretty nice compile-time improvement of about 0.4%.
This changes the fadd legalization to handle fp16 types, and treats more types
as legal so that the backend can produce the correct patterns. This is
currently a missing identity fold for `fadd x -0.0 -> x`
CVP currently tries to fold load/store pointer operands to constants
using LVI. If there is a dominating condition of the form `icmp eq ptr
%p, @g`, then `%p` will be replaced with `@g`.
LVI is geared towards range-based optimizations, and is *very*
inefficient at handling simple pointer equality conditions. We have
other passes that can handle this optimization in a more efficient way,
such as IPSCCP and GVN.
Removing this optimization gives a geomean 0.4-1.2% compile-time
improvement depending on configuration. At the same time, there
is no impact on codegen.
When converting affine.for to GPU launch operator, we have to calculate
the block dimension and thread dimension for the launch operator.
The formula of the dimension size is
(upper_bound - lower_bound) / step_size
When the difference is indivisible by step_size, we use rounding-to-zero
as the division result. However, the block dimension and thread
dimension is right-open range, i.e., [0, block_dim) and [0, thread_dim).
So, we will get the wrong result if we use DivSIOp. In this patch, we
replace it with CeilDivSIOp to get the correct block and thread
dimension values.
This patch fixes:
mlir/lib/Pass/PassRegistry.cpp:376:37: error: ISO C++ requires the
name after '::~' to be found in the same scope as the name before
'::~' [-Werror,-Wdtor-name]
This covers the following ops:
`spirv.GroupNonUniform` x {`Bitwise`, `Logical`} x {`And`, `Or`, `Xor`}
We need these to efficiently lower from the `gpu.subgroup_reduce` op.
These two opcodes are used to be followed by a MVT operand, which is
always one of i8/i16/i32/i64.
We add instantiated `OPC_EmitInteger` and `OPC_EmitStringInteger` with
i8/i16/i32/i64 so that we can reduce one byte.
We reserve `OPC_EmitInteger` and `OPC_EmitStringInteger` in case that
we may need them someday, though I haven't found one usage after this
change.
Overall this reduces the llc binary size with all in-tree targets by
about 200K.
This short option taking an argument is unfortunate.
* If a cc1-only option starts with `-e`, using it for driver will not be
reported as an error (e.g. commit
6cd9886c88d16d288c74846495d89f2fe84ff827).
* If another `-e` driver option is intended but a typo is made, the
option will be recognized as a `-e`.
`gcc -export-dynamic` passes `-export-dynamic` to ld. It's not clear
whether some options behave this way.
It seems `-Wl,-eentry` and `-Wl,--entry=entry` are primarily used. There
may also be a few `gcc -e entry`, but `gcc -eentry` is extremely rare or
not used at all. Therefore, we probably should reject the Joined form of
`-e`.
After #72835, ExplicitVEXPrefix has changed and it is not a bit now, but
in scope ExplicitOpPrefix, so the bitwise op of ExplicitVEXPrefix may
need to update.