This will make it easy for callers to see issues with and fix up calls
to createTargetMachine after a future change to the params of
TargetMachine.
This matches other nearby enums.
For downstream users, this should be a fairly straightforward
replacement,
e.g. s/CodeGenOpt::Aggressive/CodeGenOptLevel::Aggressive
or s/CGFT_/CodeGenFileType::
Summary:
This patch implements the `fgets`, `getc`, `fgetc`, and `getchar`
functions on the GPU. Their implementations are straightforward enough.
One thing worth noting is that the implementation of `fgets` will be
extremely slow due to the high latency to read a single char. A faster
solution would be to make a new RPC call to call `fgets` (due to the
special rule that newline or null breaks the stream). But this is left
out because performance isn't the primary concern here.
Currently, clang emits LLVM IR that fails verifier for the following
code:
```
template<typename T>
__global__ void foo(T x);
void bar() {
foo<<<1, 1>>>(0);
}
```
This is due to clang putting the kernel handle for foo into comdat,
which is not allowed, since the kernel handle is a declaration.
The siutation is similar to calling a declaration-only template
function. The callee will be a declaration in LLVM IR and won't be put
into comdat. This is in contrast to calling a template function with
body, which will be put into comdat.
Fixes: SWDEV-419769
This change removes the requirement that the row stride be statically known when
converting `vector.transfer_read` and `vector.transfer_write` to distributed
SIMT operations in the `nvgpu` lowering path. It also adds a check to verify
that the last dimension of the source memref is statically known to have stride
1 since this is assumed in the conversion logic. No other change should be
required since the generated `vector.load` operations are never created across
dimensions other than the last. The routines for checking preconditions on
`vector.transfer_read/write` are moved to under nvgpu utilities.
The change is NFC with respect to the GPU dialect lowering path.
Reviewed By: ThomasRaoux
Differential Revision: https://reviews.llvm.org/D155753
When we implemented C++20's P0674R1, we didn't enable the part of
P0674R1 that was resolving LWG2070 as a DR. This patch fixes that and
makes sure that we consistently go through the allocator when
constructing and destroying the underlying object in
std::allocate_shared.
Fixes#54365.
In C++ it seems it is legit to use base class's operator (e.g. `using
Base::operator=`) to perform copy if the base class is the common
ancestor of the source and destination object. In such a case we
shouldn't try to access fields beyond that of the base class, however
such a case seems to be very rare (typical code would implement a copy
constructor instead), and could add complexities, so in this patch we
simply bail if the method operator's parent class is different from the
type of the destination object that this framework recognizes.
This reverts commit 070493ddbd9473499d6f00ca62bc6aa92808ed79 (and
relands the original change). This removes a test run that makes an
assumption of RTTI being on by default for a given target.
Multiplying raw block frequency with an integer carries a high risk
of overflow.
- Add `BlockFrequency::mul` return an std::optional with the product
or `nullopt` to indicate an overflow.
- Fix two instances where overflow was likely.
This pass will upgrade DXIL-style llvm constructs (which are mostly
metadata) into the representations we use in LLVM for the same concepts.
For now we just strip the valver metadata, which we don't need. Later
changes will make this pass more useful, and then we should be able to
wire it into clang and possibly the DirectX backend's AsmParser.
The remaining use of ConstString in StructuredData is the Dictionary
class. Internally it's backed by a `std::map<ConstString, ObjectSP>`.
I propose that we replace it with a `llvm::StringMap<ObjectSP>`.
Many StructuredData::Dictionary objects are ephemeral and only exist for
a short amount of time. Many of these Dictionaries are only produced
once and are never used again. That leaves us with a lot of string data
in the ConstString StringPool that is sitting there never to be used
again. Even if the same string is used many times for keys of different
Dictionary objects, that is something we can measure and adjust for
instead of assuming that every key may be reused at some point in the
future.
Quick comparisons of key data is likely not a concern with Dictionary,
but the use of `llvm::StringMap` means that lookups should be fast with
its hashing strategy.
Switching to a llvm::StringMap meant that the iteration order may be
different. To account for this when serializing/dumping the dictionary,
I added some code to sort the output by key before emitting anything.
Differential Revision: https://reviews.llvm.org/D159313
This change adds a method to modify the ConversionTarget used during
`transform.apply_conversion_patterns` to the
`ConversionPatternDescriptorOpInterface`. This is needed when the TypeConverter
is used to dictate the dynamic legality of operations, as in "structural"
conversion patterns present in, for example, the SCF and func dialects.
As a first use case/test, this change also adds a
`transform.apply_patterns.scf.structural_conversions` operation to the SCF
dialect.
Reviewed By: springerm
Differential Revision: https://reviews.llvm.org/D158672
Close#57618: currently we align the end of PT_GNU_RELRO to a
common-page-size
boundary, but do not align the end of the associated PT_LOAD. This is
benign
when runtime_page_size >= common-page-size.
However, when runtime_page_size < common-page-size, it is possible that
`alignUp(end(PT_LOAD), page_size) < alignDown(end(PT_GNU_RELRO),
page_size)`.
In this case, rtld's mprotect call for PT_GNU_RELRO will apply to
unmapped
regions and lead to an error, e.g.
```
error while loading shared libraries: cannot apply additional memory protection after relocation: Cannot allocate memory
```
To fix the issue, add a padding section .relro_padding like mold, which
is contained in the PT_GNU_RELRO segment and the associated PT_LOAD
segment. The section also prevents strip from corrupting PT_LOAD program
headers.
.relro_padding has the largest `sortRank` among RELRO sections.
Therefore, it is naturally placed at the end of `PT_GNU_RELRO` segment
in the absence of `PHDRS`/`SECTIONS` commands.
In the presence of `SECTIONS` commands, we place .relro_padding
immediately before a symbol assignment using DATA_SEGMENT_RELRO_END (see
also https://reviews.llvm.org/D124656), if present.
DATA_SEGMENT_RELRO_END is changed to align to max-page-size instead of
common-page-size.
Some edge cases worth mentioning:
* ppc64-toc-addis-nop.s: when PHDRS is present, do not append
.relro_padding
* avoid-empty-program-headers.s: when the only RELRO section is .tbss,
it is not part of PT_LOAD segment, therefore we do not append
.relro_padding.
---
Close#65002: GNU ld from 2.39 onwards aligns the end of PT_GNU_RELRO to
a
max-page-size boundary (https://sourceware.org/PR28824) so that the last
page is
protected even if runtime_page_size > common-page-size.
In my opinion, losing protection for the last page when the runtime page
size is
larger than common-page-size is not really an issue. Double mapping a
page of up
to max-common-page for the protection could cause undesired VM waste.
Internally
we had users complaining about 2MiB max-page-size applying to shared
objects.
Therefore, the end of .relro_padding is padded to a common-page-size
boundary. Users who are really anxious can set common-page-size to match
their runtime page size.
---
17 tests need updating as there are lots of change detectors.
This records facts that are not sensitive to the current flow condition,
and should apply to all environments.
The motivating case is recording information about where a Value
originated, such as nullability:
- we may see the same Value for multiple expressions (e.g. reads of the
same field) in multiple environments (multiple blocks or iterations)
- we want to record information only when we first see the Value
(e.g. Nullability annotations on fields only add information if we
don't know where the value came from)
- this information should be expressible as a SAT condition
- we must add this SAT condition to every environment where the
Value may appear
We solve this by recording the information in the global condition.
This doesn't seem particularly elegant, but solves the problem and is
a fairly small and natural extension of the Environment.
Alternatives considered:
- store the constraint directly as a property on the Value.
But it's more composable for such properties to always be variables
(AtomicBoolValue), and constrain them with SAT conditions.
- add a hook whenever values are created, giving the analysis the
chance to populate them.
However the framework relies on/provides the ability to construct
values in arbitrary places without providing the context such a hook
would need, this would be a very invasive change.
The majority of UnixSignals strings are static in the sense that they do
not change. The overwhelming majority of these strings are string
literals. Using ConstString to manage their lifetime does not make
sense. The only exception to this is one of the subclasses of
UnixSignals, for which I have created a StringSet local to that file
which will guarantee the lifetimes of these StringRefs.
As for the other benefits of ConstString, string uniqueness is not a
concern (as many of them are already string literals) and comparing
signal names and aliases should not be a hot path.
Differential Revision: https://reviews.llvm.org/D159011
Rationale:
These libraries provided COO input and output at external boundaries
which, since then, has been generalized to the much more powerful pack
and unpack operations of the sparse tensor dialect.
The issue with these test failures is that the dSYM was not being found
by lldb, which is why setting breakpoints was failing and lldb quit
without performing any steps. This change copies the dSYM to the same
temp directory that the executable is copied to.
The ampere1 scheduling model uses IsCheapLSL predicates for ADDXri and
ADDWrr instructions, which only have 3 operands. In attempting to check
that the third is a shift, the predicate can attempt to access an out of
bounds operand, hitting an assert. This splits the rr/ri instructions
(which can never have shifts) from the rs/rx instructions to ensure they
both work correctly. Ampere1Write_1cyc_1AB was chosen for the rr/ir
instructions to match the cheap case.
This also sets CompleteModel = 0 for the ampere1 scheduling model, as at
runtime under debug it will attempt to check that as well as all
instructions having scheduling info, there is information for each
output operand.
DefIdx 1 exceeds machine model writes for
renamable $w9, renamable $w8 = LDPWi renamable $x8, 0
(Try with MCSchedModel.CompleteModel set to false)incomplete machine
model
`_get_op_result_or_value` was used in mix-ins to unify the handling of
op results and values. However, that function is now called in the
generated constructors, such that doing so in the mix-ins is not
necessary anymore.
Somewhat overdue... it has been a few years since I stopped watching block frequency / branch weight patches actively, so I effectively stopped acting as code owner a while ago. Reflect the reality.
Still happy to help out; feel free to pull me in if you think I might have useful context!
On some AArch64 cores, including Ampere's ampere1 and ampere1a
architectures, load and store pair instructions are faster compared to
simple loads/stores only when the alignment of the pair is at least
twice that of the individual element being loaded.
Based on that, this patch introduces four new subtarget features, two
for controlling ldp and two for controlling stp, to cover the ampere1
and ampere1a alignment needs and to enable optional fine-grained control
over ldp and stp generation in general. The latter can be utilized by
another cpu, if there are possible benefits
with a different policy than the default provided by the compiler.
More specifically, for each of the ldp and stp respectively we have:
- disable-ldp/disable-stp: Do not emit ldp/stp.
- ldp-aligned-only/stp-aligned-only: Emit ldp/stp only if the source
pointer is aligned to at least double the alignment of the type.
Therefore, for -mcpu=ampere1 and -mcpu=ampere1a
ldp-aligned-only/stp-aligned-only become the defaults, because of the
benefit from the alignment, whereas for the rest of the cpus the default
behaviour of the compiler is maintained.
This patch syncs the logic inside `getInputFunc` that selects
the library API and the logic in `createIoRuntimeCallForItem`
that creates the input arguments for the library call.
There were cases where we selected `InputDerivedType` API
and passed only two arguments, and also we selected `InputDescriptor`
and passed three arguments.
It turns out we also were incorrectly selecting `OutputDescriptor`
in `getOutputFunc` (`test4` case in the new LIT test),
which caused runtime issues for output of a derived type
with descriptor components (due to the missing non-type-bound table).
Since we no longer support typed pointers in LLVM IR, the PtrASXTy
in isLoadInvariantInLoop was set to be equal to Addr->getType() (an
opaque ptr in the same address space). That made the loop looking
through bitcasts redundant.
Add extra error checking (in debug mode) to detect cases where an
iterator on "direct" payload IR mappings is invalidated (due to elements
being removed). Such errors are hard to debug: they are often
non-deterministic; sometimes the program crashes, sometimes it produces
wrong results. Even when it crashes, the stack trace often points to
completely unrelated code locations.
Store a timestamp with each "direct" mapping. The timestamp is increased
whenever an operation is performed that invaldiates an iterator on that
mapping. A debug iterator is added that checks the timestamp as payload
IR is enumerated.
The scf.forall.in_parallel terminator operation has a nested graph region with the NoTerminator trait. Such regions are not supported by the default implementations. Therefore, this commit adds a specialized implementation for
this operation which only covers the case where the nested region is empty.
This is because after bufferization, ops like tensor.parallel_insert_slice were already converted to memref operations residing int the scf.forall only and the nested region of scf.forall.in_parallel ends up empty.
D154280 (landed in 64d19542e78a in July, 2023) implements
`--per-test-coverage` (which can also be specified via
`lit_config.per_test_coverage`). However, it has a few issues, which
the current patch addresses:
1. D154280 implements `--per-test-coverage` only for the case that lit
is configured to use an external shell. The current patch extends
the implementation to lit's internal shell.
2. In the case that lit is configured to use an external shell,
regardless of whether `--per-test-coverage` is actually specified,
D154280 causes `%dbg(RUN: at line N)` to be expanded in RUN lines
early and in a manner that is specific to sh-like shells. As a
result, later code in lit that expands it in a shell-specific
manner is useless as there's nothing left to expand. The current
patch cleans up the implementation to avoid useless code.
3. Because of issue 2, D154280 corrupts support for windows `cmd` as
an external shell (effectively comments out all RUN lines with
`:`). The current patch happens to fix that particular corruption
by addressing issue 2. However, D122569 (landed in 1041a9642ba0 in
April, 2022) had already broken support for windows `cmd` as an
external shell (discards RUN lines when expanding `%dbg(RUN: at
line N)`). The current patch does not attempt to fix that bug.
For further details, see the PR discussion of the current patch.
The current patch addresses the above issues by implementing
`--per-test-coverage` before selecting the shell (internal or
external) and by leaving `%dbg(RUN: at line N)` unexpanded there.
Thus, it is expanded later in a shell-specific manner, as before
D154280.
This patch introduces `buildPdbgCommand` into lit's implementation to
encapsulate the process of building (or rebuilding in the case of the
`--per-test-coverage` implementation) a full `%dbg(RUN: at line N)
cmd` line and asserting that the result matches `kPdbgRegex`. It also
cleans up that and all other uses of `kPdbgRegex` to operate on the
full line with `re.fullmatch` not `re.match`. This change better
reflects the intention in every case, but it is expected to be NFC
because `kPdbgRegex` ends in `.*` and thus avoids the difference
between `re.fullmatch` and `re.match`. The only caveat is that `.*`
does not match newlines, but RUN lines cannot contain newlines
currently, so this caveat currently shouldn't matter in practice.
The original `--per-test-coverage` implementation avoided accumulating
`export LLVM_PROFILE_FILE={profile}` insertions across retries (due to
`ALLOW_RETRIES`) by skipping the insertion if `%dbg(RUN: at line N)`
was not present and thus had already been expanded. However, the
current patch makes sure the insertions also happen for commands
without `%dbg(RUN: at line N)`, such as preamble commands or some
commands from other lit test formats. Thus, the current patch
implements a different mechanism to avoid accumulating those
insertions (see code comments).