This patch removes the non compliant constructor of std::future_error
and adds the standards compliant constructor in C++17 instead.
Note that we can't support the constructor as an extension in all
standard modes because it uses delegating constructors, which require
C++11. We could in theory support the constructor as an extension in
C++11 and C++14 only, however I believe it is acceptable not to do that
since I expect the breakage from this patch will be minimal.
If it turns out that more code than we expect is broken by this, we can
reconsider that decision.
This was found during D99515.
Differential Revision: https://reviews.llvm.org/D99567
Co-authored-by: Louis Dionne <ldionne.2@gmail.com>
Metadata (via ValueAsMetadata) can reference constant expressions
that may no longer be supported. These references can both be in
function-local metadata and module metadata, if the same expression
is used in multiple functions. At least in theory, such references
could also be in metadata proper, rather than just inside
ValueAsMetadata references in calls.
Instead of trying to expand these expressions (which we can't
reliably do), pretend that the constant has been deleted, which
means that ValueAsMetadata references will get replaced with
undef metadata.
Fixes https://github.com/llvm/llvm-project/issues/68281.
Fixes MUBUF path for most vectors and pointers, which unblocks fixing
the gfx6/7 run lines in assorted tests. Also fixes inconsistent behavior
for -flat-for-global.
* `tensor.collapse_shape` may bufferize to a memory read because the op
may have to reallocate the source buffer.
* `tensor.reshape` should not use `bufferization.clone` for
reallocation. This op has requirements wrt. the order of buffer
writes/reads. Use `memref.alloc` and `memref.copy` instead. Also fix a
bug where the memory space of the source buffer was not propagated to
the reallocated buffer.
…st)`
`expand-strided-metadata` was missing a pattern to get rid of
`memref.cast`.
The pattern is straight foward:
Produce a new `extract_strided_metadata` with the source of the cast and
fold the static information (sizes, strides, offset) along the way.
This patches implements the auto keyword from the N3007 standard
specification.
This allows deducing the type of the variable like in C++:
```
auto nb = 1;
auto chr = 'A';
auto str = "String";
```
The list of statements which allows the usage of auto:
* Basic variables declarations (int, float, double, char, char*...)
* Macros declaring a variable with the auto type
The list of statements which will not work with the auto keyword:
* auto arrays
* sizeof(), alignas()
* auto parameters, auto return type
* auto as a struct/typedef member
* uninitialized auto variables
* auto in an union
* auto as a enum type specifier
* auto casts
* auto in an compound literals
Differential Revision: https://reviews.llvm.org/D133289
`BufferizableOpInterface::bufferizesToAllocation` is queried when
forming equivalence sets during bufferization. It is not really needed
for ops like `tensor.empty` which do not have tensor operands, but it
should be added for consistency.
This change should have been part of #68080. No test is added because
the return value of this function is irrelevant for ops without tensor
operands. (However, this function acts as a form documentation,
describing the bufferization semantics of the op.)
Add `dump_alias_sets` to `transform.bufferization.one_shot_bufferize`.
This option is useful for debugging. Also improve the verifier to ensure
that `test_analysis_only` is set when other debugging flags are enabled.
Add initial half/bfloat broadcast shuffles test coverage (more to follow)
Fixes#68117 - which was stuck in a loop between getting scalarized insert/extract costs for the shuffle and then trying to convert a bfloat insert into a shuffle again......
This fixes many unit tests when trying to enable IncrementalExtensions
by default for testing purposes.
Differential Revision: https://reviews.llvm.org/D158415
Add vscale range attirbute for the Scalable Vector Extension (SVE) if
provided on the command-line (options in a previous commit)
If no command-line option is provided, if the target-feature of SVE is
specified and the architecture is AArch64, it defualts to 128-2048. in
other words a vscale-min of 1, vscale-max of 16.
A pass is used to add the atribute to all functions. The vectorizer will
use this attribute to generate the SVE instruction to match the range
specified. The attribute is harmless if there is no vectorizable
operations in the function.
* Fixes#67977, a crash in `empty-tensor-elimination`.
* Also improves `linalg.copy` canonicalization.
* Also improves indentation indentation in `mlir-linalg-ods-yaml-gen.cpp`.
Currently this non-trivial calculation is repeated multiple times,
making it hard to reason about when the
`byte_offset`/`member_byte_offset` is being set or not.
This patch simply moves all those instances of the same calculation into
a helper function.
We return an optional to remain an NFC patch. Default initializing the
offset would make sense but requires further analysis and can be done in
a follow-up patch.
The bounds of a c++ array is a _constant-expression_. And in C++ it is
also a constant expression.
But we also support VLAs, ie arrays with non-constant bounds.
We need to take care to handle the case of a consteval function (which
are specified to be only immediately called in non-constant contexts)
that appear in arrays bounds.
This introduces `Sema::isAlwayConstantEvaluatedContext`, and a flag in
ExpressionEvaluationContextRecord, such that immediate functions in
array bounds are always immediately invoked.
Sema had both `isConstantEvaluatedContext` and
`isConstantEvaluated`, so I took the opportunity to cleanup that.
The change in `TimeProfilerTest.cpp` is an unfortunate manifestation of
the problem that #66203 seeks to address.
Fixes#65520
Since the getMaximisedVFForTarget function is called twice, once for fixed-width and once for scalable, it adds no value to always return a fixed-width VF. Instead, when we are tail-folding, we can use either fixed-width or scalable vectors.
When performing store to load forwarding, replacing users of the
load may turn an indirect call into one with a known callee, in
which case it might become willreturn, invalidating cached ICF
information. Avoid this by removing users.
This is a bit more aggressive than strictly necessary (e.g. this
shouldn't be necessary when doing load-load CSE), but better safe
than sorry.
Fixes https://github.com/llvm/llvm-project/issues/48805.
This new method repeatedly calls Lex() until end of file is reached
and optionally fills a std::vector of Tokens. Use it in Clang's unit
tests to avoid quite some code duplication.
Differential Revision: https://reviews.llvm.org/D158413
Long tail calls use the following instruction sequence on RISC-V:
```
1: auipc xi, %pcrel_hi(sym)
jalr zero, %pcrel_lo(1b)(xi)
```
Since the second instruction in isolation looks like an indirect branch,
this confused BOLT and most functions containing a long tail call got
marked with "unknown control flow" and didn't get optimized as a
consequence.
This patch fixes this by detecting long tail call sequence in
`analyzeIndirectBranch`. `FixRISCVCallsPass` also had to be updated to
expand long tail calls to `PseudoTAIL` instead of `PseudoCALL`.
Besides this, this patch also fixes a minor issue with compressed tail
calls (`c.jr`) not being detected.
Note that I had to change `BinaryFunction::postProcessIndirectBranches`
slightly: the documentation of `MCPlusBuilder::analyzeIndirectBranch`
mentions that the [`Begin`, `End`) range contains the instructions
immediately preceding `Instruction`. However, in
`postProcessIndirectBranches`, *all* the instructions in the BB where
passed in the range. This made it difficult to find the preceding
instruction so I made sure *only* the preceding instructions are passed.
This PR introduces a new Op called `warpgroup.mma.store` to the NVGPU
dialect of MLIR. The purpose of this operation is to facilitate storing
fragmanted result(s) `nvgpu.warpgroup.accumulator` produced by
`warpgroup.mma` to the given memref.
An example of fragmentated matrix is given here :
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d
The `warpgroup.mma.store` does followings:
1) Takes one or more `nvgpu.warpgroup.accumulator` type (fragmented
results matrix)
2) Calculates indexes per thread in warp-group and stores the data into
give memref.
Here's an example usage:
```
// A warpgroup performs GEMM, results in fragmented matrix
%result1, %result2 = nvgpu.warpgroup.mma ...
// Stores the fragmented result to memref
nvgpu.warpgroup.mma.store [%result1, %result2], %matrixD :
!nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>,
!nvgpu.warpgroup.accumulator< fragmented = vector<64x128xf32>>
to memref<128x128xf32,3>
```
Handle the following relocations related to TLS local-exec and
initial-exec:
- R_RISCV_TLS_GOT_HI20
- R_RISCV_TPREL_HI20
- R_RISCV_TPREL_ADD
- R_RISCV_TPREL_LO12_I
- R_RISCV_TPREL_LO12_S
In addition, GNU ld has a quirk where after TLS le relaxation, two
unofficial relocation types may be emitted:
- R_RISCV_TPREL_I
- R_RISCV_TPREL_S
Since they are unofficial (defined in the reserved range of relocation
types), LLVM does not define them. Hence, I've defined them locally in
BOLT in a private namespace.
If libc++ is available and should be used as the ubsan C++ ABI library,
the check for libc++ might fail if libc++ is a static library, as the
-nodefaultlibs flag inhibits a potential compiler default -lunwind.
Just like the -nodefaultlibs configuration tests for and manually adds a
bunch of compiler default libraries, look for -lunwind too.
This is a reland of #65912.
with an explicit parameter.
We tried to read a pointer to a non-existent `This` APValue when
constant-evaluating an explicit object lambda call operator (the `this`
pointer is never set in explicit object member functions)
Fixes#68070
This PR introduces substantial improvements to the readability and
maintainability of the `nvgpu.warpgroup.mma` Op transformation from
nvgpu->nvvm. This transformation plays a crucial role in GEMM and
manages complex operations such as generating multiple wgmma ops and
iterating their descriptors. The prior code lacked clarity, but this PR
addresses that issue effectively.
**PR does followings:**
**Introduces a helper class:** `WarpgroupGemm` class encapsulates the
necessary functionality, making the code cleaner and more
understandable.
**Detailed Documentation:** Each function within the helper class is
thoroughly documented to provide clear insights into its purpose and
functionality.
While we pretty much always want to pass DT, AC and CxtI, most
places don't care about TLI. Add an overload where this is not
one of the first parameters.
This PR moves some calculation out of `LowerCall` and into
`SystemZXPLINKFrameLowering::processFunctionBeforeFrameFinalized`.
We need to make this change because LowerCall isn't invoked for
functions that don't have function calls, and it is required for some
tooling to work correctly. A function that does not make any calls is
required to allocate 32 bytes for the parameter area required by the
ABI. However, we allocate 64 bytes because this additional space is
utilized by certain tools, like the debugger.
Co-authored-by: Yusra Syeda <yusra.syeda@ibm.com>