Also, amend constraints for non-sync variants that are no longer
available on sm_70+ with PTX6.4+.
Differential Revision: https://reviews.llvm.org/D68892
llvm-svn: 374790
The static analyzer is warning about a potential null dereference, but we should be able to use cast<> directly and if not assert will fire for us.
llvm-svn: 374788
Add specific scalar costs for CTLZ instructions, we can't discriminate between CTLZ and CTLZ_ZERO_UNDEF so we have to assume the worst. Given how BSR is often a microcoded nightmare on some older targets we might still be underestimating it.
For targets supporting LZCNT (Intel Haswell+ or AMD Fam10+), we provide overrides that assume 1cy costs.
llvm-svn: 374786
Add a pass to lower is.constant and objectsize intrinsics
This pass lowers is.constant and objectsize intrinsics not simplified by
earlier constant folding, i.e. if the object given is not constant or if
not using the optimized pass chain. The result is recursively simplified
and constant conditionals are pruned, so that dead blocks are removed
even for -O0. This allows inline asm blocks with operand constraints to
work all the time.
The new pass replaces the existing lowering in the codegen-prepare pass
and fallbacks in SDAG/GlobalISEL and FastISel. The latter now assert
on the intrinsics.
Differential Revision: https://reviews.llvm.org/D65280
llvm-svn: 374784
The adds both VMOVNt and VMOVNb instruction selection from the appropriate
shuffles. We detect shuffle masks of the form:
0, N, 2, N+2, 4, N+4, ...
or
0, N+1, 2, N+3, 4, N+5, ...
ISel will also try the opposite patterns, with inputs reversed. These are
selected to VMOVNt and VMOVNb respectively.
Differential Revision: https://reviews.llvm.org/D68283
llvm-svn: 374781
Add specific scalar costs for ctpop instructions, these are based on the llvm-mca's SLM throughput numbers (the oldest model we have).
For targets supporting POPCNT, we provide overrides that assume 1cy costs.
llvm-svn: 374775
Materialize accesses to SVE frame objects from SP or FP, whichever is
available and beneficial.
This patch still assumes the objects are pre-allocated. The automatic
layout of SVE objects within the stackframe will be added in a separate
patch.
Reviewers: greened, cameron.mcinally, efriedma, rengolin, thegameg, rovka
Reviewed By: cameron.mcinally
Differential Revision: https://reviews.llvm.org/D67749
llvm-svn: 374772
Summary:
This addresses a bug in collectCallSiteParameters() where call site
immediates would be truncated from int64_t to unsigned.
This fixes PR43525.
Reviewers: djtodoro, NikolaPrica, aprantl, vsk
Reviewed By: aprantl
Subscribers: hiraditya, llvm-commits
Tags: #debug-info, #llvm
Differential Revision: https://reviews.llvm.org/D68869
llvm-svn: 374770
This patch introduces the following changes to the btver2 scheduling model:
- The number of micro opcodes for YMM loads and stores is now 2 (it was
incorrectly set to 1 for both aligned and misaligned loads/stores).
- Increased the number of AGU resource cycles for YMM loads and stores
to 2cy (instead of 1cy).
- Removed JFPU01 and JFPX from the list of resources consumed by pure
float/vector loads (no MMX).
I verified with llvm-exegesis that pure XMM/YMM loads are no-pipe. Those
are dispatched to the FPU but not really issues on JFPU01.
Differential Revision: https://reviews.llvm.org/D68871
llvm-svn: 374765
Add an extra parameter so the backend can take the alignment into
consideration.
Differential Revision: https://reviews.llvm.org/D68400
llvm-svn: 374763
This prevents isel from emitting a TEST instruction that
optimizeCompareInstr will need to remove later.
In some of the modified tests, the SUB gets duplicated due to
the flags being needed in two places and being clobbered in
between. optimizeCompareInstr was able to optimize away the TEST
that was using the result of one of them, but optimizeCompareInstr
doesn't know to turn SUB into CMP after removing the TEST. It
only knows how to turn SUB into CMP if the result was already
dead.
With this change the TEST never exists, so optimizeCompareInstr
doesn't have to remove it. Then it can just turn the SUB into
CMP immediately.
Fixes PR43649.
llvm-svn: 374755
This pass lowers is.constant and objectsize intrinsics not simplified by
earlier constant folding, i.e. if the object given is not constant or if
not using the optimized pass chain. The result is recursively simplified
and constant conditionals are pruned, so that dead blocks are removed
even for -O0. This allows inline asm blocks with operand constraints to
work all the time.
The new pass replaces the existing lowering in the codegen-prepare pass
and fallbacks in SDAG/GlobalISEL and FastISel. The latter now assert
on the intrinsics.
Differential Revision: https://reviews.llvm.org/D65280
llvm-svn: 374743
No-return and will-return are exclusive, assuming the latter is more
prominent we can avoid updates of the former unless will-return is not
known for sure.
llvm-svn: 374739
Even if an argument is captured, we cannot have an effect the function
does not have. This is fine except for the special case of `inalloca` as
it does not behave by the rules.
TODO: Maybe the special rule for `inalloca` is wrong after all.
llvm-svn: 374736
Summary:
This changes "CHECK" check lines to "ATTRIBUTOR" check lines where
necessary and also fixes the now exposed, mostly minor, problems.
Reviewers: sstefan1, uenoku
Subscribers: hiraditya, bollu, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68929
llvm-svn: 374735
We were already controlling whether the KnownZero elements were being written to the target mask, this extends it to the KnownUndef elements as well so we can prevent the target shuffle mask being manipulated at all.
llvm-svn: 374732
This enables use of the saturating truncate instructions when the
result type is less than 128 bits. It also enables the use of
saturating truncate instructions on KNL when the input is less
than 512 bits. We can do this by widening the input and then
extracting the result.
llvm-svn: 374731
Follow-up to D68244 to account for a corner case discussed in:
https://bugs.llvm.org/show_bug.cgi?id=43501
Add one more restriction: if the pointer is deref-or-null and in a non-default
(non-zero) address space, we can't assume inbounds.
Differential Revision: https://reviews.llvm.org/D68706
llvm-svn: 374728
The CmpInst::getType() calls can be replaced by just using User::getType() that it was dyn_cast from, and we then need to assert that any default predicate cases came from the CmpInst.
llvm-svn: 374716
This seems to improve std::midpoint code where we have a min and
a max with the same condition. If we split the setcc we can end
up with two compares if the one of the operands is a constant.
Since we aggressively canonicalize compares with constants.
For non-constants it can interfere with our ability to share
control flow if we need to expand cmovs into control flow.
I'm also not sure I understand this min/max canonicalization code.
The motivating case talks about comparing with 0. But we don't
check for 0 explicitly.
Removes one instruction from the codegen for PR43658.
llvm-svn: 374706
Before, we eagerly split blocks even if it was not necessary, e.g., they
had a single unreachable instruction and only a single predecessor.
llvm-svn: 374703
We do not yet perform h2s because we know something is free'ed but we do
it because we know the pointer does not escape. Storing the pointer
allows it to escape so we have to prevent that.
llvm-svn: 374699
H2S did apply to mallocs of non-constant sizes if the uses were OK. This
is now forbidden through reording of the "good" and "bad" cases in the
conditional.
llvm-svn: 374698
The check for naked/optnone was insufficient for different reasons. We
now check before we initialize an abstract attribute and we do it for
all abstract attributes.
llvm-svn: 374694
Summary:
If the underlying alloca did not change, we do not necessarily need new
lifetime markers. This patch adds a check and reuses the old ones if
possible.
Reviewers: reames, ssarda, t.p.northover, hfinkel
Subscribers: hiraditya, bollu, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68900
llvm-svn: 374692
Summary:
This is a recommit, this originally landed in rL370454 but was
subsequently reverted in rL370788 due to
https://bugs.llvm.org/show_bug.cgi?id=43206
The reduced testcase was added to bcmp-negative-tests.ll
as @pr43206_different_loops - we must ensure that the SCEV's
we got are both for the same loop we are currently investigating.
Original commit message:
@mclow.lists brought up this issue up in IRC.
It is a reasonably common problem to compare some two values for equality.
Those may be just some integers, strings or arrays of integers.
In C, there is `memcmp()`, `bcmp()` functions.
In C++, there exists `std::equal()` algorithm.
One can also write that function manually.
libstdc++'s `std::equal()` is specialized to directly call `memcmp()` for
various types, but not `std::byte` from C++2a. https://godbolt.org/z/mx2ejJ
libc++ does not do anything like that, it simply relies on simple C++'s
`operator==()`. https://godbolt.org/z/er0Zwf (GOOD!)
So likely, there exists a certain performance opportunities.
Let's compare performance of naive `std::equal()` (no `memcmp()`) with one that
is using `memcmp()` (in this case, compiled with modified compiler). {F8768213}
```
#include <algorithm>
#include <cmath>
#include <cstdint>
#include <iterator>
#include <limits>
#include <random>
#include <type_traits>
#include <utility>
#include <vector>
#include "benchmark/benchmark.h"
template <class T>
bool equal(T* a, T* a_end, T* b) noexcept {
for (; a != a_end; ++a, ++b) {
if (*a != *b) return false;
}
return true;
}
template <typename T>
std::vector<T> getVectorOfRandomNumbers(size_t count) {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(),
std::numeric_limits<T>::max());
std::vector<T> v;
v.reserve(count);
std::generate_n(std::back_inserter(v), count,
[&dis, &gen]() { return dis(gen); });
assert(v.size() == count);
return v;
}
struct Identical {
template <typename T>
static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
auto Tmp = getVectorOfRandomNumbers<T>(count);
return std::make_pair(Tmp, std::move(Tmp));
}
};
struct InequalHalfway {
template <typename T>
static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
auto V0 = getVectorOfRandomNumbers<T>(count);
auto V1 = V0;
V1[V1.size() / size_t(2)]++; // just change the value.
return std::make_pair(std::move(V0), std::move(V1));
}
};
template <class T, class Gen>
void BM_bcmp(benchmark::State& state) {
const size_t Length = state.range(0);
const std::pair<std::vector<T>, std::vector<T>> Data =
Gen::template Gen<T>(Length);
const std::vector<T>& a = Data.first;
const std::vector<T>& b = Data.second;
assert(a.size() == Length && b.size() == a.size());
benchmark::ClobberMemory();
benchmark::DoNotOptimize(a);
benchmark::DoNotOptimize(a.data());
benchmark::DoNotOptimize(b);
benchmark::DoNotOptimize(b.data());
for (auto _ : state) {
const bool is_equal = equal(a.data(), a.data() + a.size(), b.data());
benchmark::DoNotOptimize(is_equal);
}
state.SetComplexityN(Length);
state.counters["eltcnt"] =
benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant);
state.counters["eltcnt/sec"] =
benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate);
const size_t BytesRead = 2 * sizeof(T) * Length;
state.counters["bytes_read/iteration"] =
benchmark::Counter(BytesRead, benchmark::Counter::kDefaults,
benchmark::Counter::OneK::kIs1024);
state.counters["bytes_read/sec"] = benchmark::Counter(
BytesRead, benchmark::Counter::kIsIterationInvariantRate,
benchmark::Counter::OneK::kIs1024);
}
template <typename T>
static void CustomArguments(benchmark::internal::Benchmark* b) {
const size_t L2SizeBytes = []() {
for (const benchmark::CPUInfo::CacheInfo& I :
benchmark::CPUInfo::Get().caches) {
if (I.level == 2) return I.size;
}
return 0;
}();
// What is the largest range we can check to always fit within given L2 cache?
const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
/*maximal elt size*/ sizeof(T) / /*safety margin*/ 2;
b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}
BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, Identical)
->Apply(CustomArguments<uint8_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, Identical)
->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, Identical)
->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, Identical)
->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, InequalHalfway)
->Apply(CustomArguments<uint8_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, InequalHalfway)
->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, InequalHalfway)
->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, InequalHalfway)
->Apply(CustomArguments<uint64_t>);
```
{F8768210}
```
$ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks build-{old,new}/test/llvm-bcmp-bench
RUNNING: build-old/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpb6PEUx
2019-04-25 21:17:11
Running build-old/test/llvm-bcmp-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
L1 Data 16K (x8)
L1 Instruction 64K (x4)
L2 Unified 2048K (x4)
L3 Unified 8192K (x1)
Load Average: 0.65, 3.90, 4.14
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000 432131 ns 432101 ns 1613 bytes_read/iteration=1000k bytes_read/sec=2.20706G/s eltcnt=825.856M eltcnt/sec=1.18491G/s
BM_bcmp<uint8_t, Identical>_BigO 0.86 N 0.86 N
BM_bcmp<uint8_t, Identical>_RMS 8 % 8 %
<...>
BM_bcmp<uint16_t, Identical>/256000 161408 ns 161409 ns 4027 bytes_read/iteration=1000k bytes_read/sec=5.90843G/s eltcnt=1030.91M eltcnt/sec=1.58603G/s
BM_bcmp<uint16_t, Identical>_BigO 0.67 N 0.67 N
BM_bcmp<uint16_t, Identical>_RMS 25 % 25 %
<...>
BM_bcmp<uint32_t, Identical>/128000 81497 ns 81488 ns 8415 bytes_read/iteration=1000k bytes_read/sec=11.7032G/s eltcnt=1077.12M eltcnt/sec=1.57078G/s
BM_bcmp<uint32_t, Identical>_BigO 0.71 N 0.71 N
BM_bcmp<uint32_t, Identical>_RMS 42 % 42 %
<...>
BM_bcmp<uint64_t, Identical>/64000 50138 ns 50138 ns 10909 bytes_read/iteration=1000k bytes_read/sec=19.0209G/s eltcnt=698.176M eltcnt/sec=1.27647G/s
BM_bcmp<uint64_t, Identical>_BigO 0.84 N 0.84 N
BM_bcmp<uint64_t, Identical>_RMS 27 % 27 %
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000 192405 ns 192392 ns 3638 bytes_read/iteration=1000k bytes_read/sec=4.95694G/s eltcnt=1.86266G eltcnt/sec=2.66124G/s
BM_bcmp<uint8_t, InequalHalfway>_BigO 0.38 N 0.38 N
BM_bcmp<uint8_t, InequalHalfway>_RMS 3 % 3 %
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000 127858 ns 127860 ns 5477 bytes_read/iteration=1000k bytes_read/sec=7.45873G/s eltcnt=1.40211G eltcnt/sec=2.00219G/s
BM_bcmp<uint16_t, InequalHalfway>_BigO 0.50 N 0.50 N
BM_bcmp<uint16_t, InequalHalfway>_RMS 0 % 0 %
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000 49140 ns 49140 ns 14281 bytes_read/iteration=1000k bytes_read/sec=19.4072G/s eltcnt=1.82797G eltcnt/sec=2.60478G/s
BM_bcmp<uint32_t, InequalHalfway>_BigO 0.40 N 0.40 N
BM_bcmp<uint32_t, InequalHalfway>_RMS 18 % 18 %
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000 32101 ns 32099 ns 21786 bytes_read/iteration=1000k bytes_read/sec=29.7101G/s eltcnt=1.3943G eltcnt/sec=1.99381G/s
BM_bcmp<uint64_t, InequalHalfway>_BigO 0.50 N 0.50 N
BM_bcmp<uint64_t, InequalHalfway>_RMS 1 % 1 %
RUNNING: build-new/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpQ46PP0
2019-04-25 21:19:29
Running build-new/test/llvm-bcmp-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
L1 Data 16K (x8)
L1 Instruction 64K (x4)
L2 Unified 2048K (x4)
L3 Unified 8192K (x1)
Load Average: 1.01, 2.85, 3.71
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000 18593 ns 18590 ns 37565 bytes_read/iteration=1000k bytes_read/sec=51.2991G/s eltcnt=19.2333G eltcnt/sec=27.541G/s
BM_bcmp<uint8_t, Identical>_BigO 0.04 N 0.04 N
BM_bcmp<uint8_t, Identical>_RMS 37 % 37 %
<...>
BM_bcmp<uint16_t, Identical>/256000 18950 ns 18948 ns 37223 bytes_read/iteration=1000k bytes_read/sec=50.3324G/s eltcnt=9.52909G eltcnt/sec=13.511G/s
BM_bcmp<uint16_t, Identical>_BigO 0.08 N 0.08 N
BM_bcmp<uint16_t, Identical>_RMS 34 % 34 %
<...>
BM_bcmp<uint32_t, Identical>/128000 18627 ns 18627 ns 37895 bytes_read/iteration=1000k bytes_read/sec=51.198G/s eltcnt=4.85056G eltcnt/sec=6.87168G/s
BM_bcmp<uint32_t, Identical>_BigO 0.16 N 0.16 N
BM_bcmp<uint32_t, Identical>_RMS 35 % 35 %
<...>
BM_bcmp<uint64_t, Identical>/64000 18855 ns 18855 ns 37458 bytes_read/iteration=1000k bytes_read/sec=50.5791G/s eltcnt=2.39731G eltcnt/sec=3.3943G/s
BM_bcmp<uint64_t, Identical>_BigO 0.32 N 0.32 N
BM_bcmp<uint64_t, Identical>_RMS 33 % 33 %
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000 9570 ns 9569 ns 73500 bytes_read/iteration=1000k bytes_read/sec=99.6601G/s eltcnt=37.632G eltcnt/sec=53.5046G/s
BM_bcmp<uint8_t, InequalHalfway>_BigO 0.02 N 0.02 N
BM_bcmp<uint8_t, InequalHalfway>_RMS 29 % 29 %
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000 9547 ns 9547 ns 74343 bytes_read/iteration=1000k bytes_read/sec=99.8971G/s eltcnt=19.0318G eltcnt/sec=26.8159G/s
BM_bcmp<uint16_t, InequalHalfway>_BigO 0.04 N 0.04 N
BM_bcmp<uint16_t, InequalHalfway>_RMS 29 % 29 %
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000 9396 ns 9394 ns 73521 bytes_read/iteration=1000k bytes_read/sec=101.518G/s eltcnt=9.41069G eltcnt/sec=13.6255G/s
BM_bcmp<uint32_t, InequalHalfway>_BigO 0.08 N 0.08 N
BM_bcmp<uint32_t, InequalHalfway>_RMS 30 % 30 %
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000 9499 ns 9498 ns 73802 bytes_read/iteration=1000k bytes_read/sec=100.405G/s eltcnt=4.72333G eltcnt/sec=6.73808G/s
BM_bcmp<uint64_t, InequalHalfway>_BigO 0.16 N 0.16 N
BM_bcmp<uint64_t, InequalHalfway>_RMS 28 % 28 %
Comparing build-old/test/llvm-bcmp-bench to build-new/test/llvm-bcmp-bench
Benchmark Time CPU Time Old Time New CPU Old CPU New
---------------------------------------------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000 -0.9570 -0.9570 432131 18593 432101 18590
<...>
BM_bcmp<uint16_t, Identical>/256000 -0.8826 -0.8826 161408 18950 161409 18948
<...>
BM_bcmp<uint32_t, Identical>/128000 -0.7714 -0.7714 81497 18627 81488 18627
<...>
BM_bcmp<uint64_t, Identical>/64000 -0.6239 -0.6239 50138 18855 50138 18855
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000 -0.9503 -0.9503 192405 9570 192392 9569
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000 -0.9253 -0.9253 127858 9547 127860 9547
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000 -0.8088 -0.8088 49140 9396 49140 9394
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000 -0.7041 -0.7041 32101 9499 32099 9498
```
What can we tell from the benchmark?
* Performance of naive equality check somewhat improves with element size,
maxing out at eltcnt/sec=1.58603G/s for uint16_t, or bytes_read/sec=19.0209G/s
for uint64_t. I think, that instability implies performance problems.
* Performance of `memcmp()`-aware benchmark always maxes out at around
bytes_read/sec=51.2991G/s for every type. That is 2.6x the throughput of the
naive variant!
* eltcnt/sec metric for the `memcmp()`-aware benchmark maxes out at
eltcnt/sec=27.541G/s for uint8_t (was: eltcnt/sec=1.18491G/s, so 24x) and
linearly decreases with element size.
For uint64_t, it's ~4x+ the elements/second.
* The call obvious is more pricey than the loop, with small element count.
As it can be seen from the full output {F8768210}, the `memcmp()` is almost
universally worse, independent of the element size (and thus buffer size) when
element count is less than 8.
So all in all, bcmp idiom does indeed pose untapped performance headroom.
This diff does implement said idiom recognition. I think a reasonable test
coverage is present, but do tell if there is anything obvious missing.
Now, quality. This does succeed to build and pass the test-suite, at least
without any non-bundled elements. {F8768216} {F8768217}
This transform fires 91 times:
```
$ /build/test-suite/utils/compare.py -m loop-idiom.NumBCmp result-new.json
Tests: 1149
Metric: loop-idiom.NumBCmp
Program result-new
MultiSourc...Benchmarks/7zip/7zip-benchmark 79.00
MultiSource/Applications/d/make_dparser 3.00
SingleSource/UnitTests/vla 2.00
MultiSource/Applications/Burg/burg 1.00
MultiSourc.../Applications/JM/lencod/lencod 1.00
MultiSource/Applications/lemon/lemon 1.00
MultiSource/Benchmarks/Bullet/bullet 1.00
MultiSourc...e/Benchmarks/MallocBench/gs/gs 1.00
MultiSourc...gs-C/TimberWolfMC/timberwolfmc 1.00
MultiSourc...Prolangs-C/simulator/simulator 1.00
```
The size changes are:
I'm not sure what's going on with SingleSource/UnitTests/vla.test yet, did not look.
```
$ /build/test-suite/utils/compare.py -m size..text result-{old,new}.json --filter-hash
Tests: 1149
Same hash: 907 (filtered out)
Remaining: 242
Metric: size..text
Program result-old result-new diff
test-suite...ingleSource/UnitTests/vla.test 753.00 833.00 10.6%
test-suite...marks/7zip/7zip-benchmark.test 1001697.00 966657.00 -3.5%
test-suite...ngs-C/simulator/simulator.test 32369.00 32321.00 -0.1%
test-suite...plications/d/make_dparser.test 89585.00 89505.00 -0.1%
test-suite...ce/Applications/Burg/burg.test 40817.00 40785.00 -0.1%
test-suite.../Applications/lemon/lemon.test 47281.00 47249.00 -0.1%
test-suite...TimberWolfMC/timberwolfmc.test 250065.00 250113.00 0.0%
test-suite...chmarks/MallocBench/gs/gs.test 149889.00 149873.00 -0.0%
test-suite...ications/JM/lencod/lencod.test 769585.00 769569.00 -0.0%
test-suite.../Benchmarks/Bullet/bullet.test 770049.00 770049.00 0.0%
test-suite...HMARK_ANISTROPIC_DIFFUSION/128 NaN NaN nan%
test-suite...HMARK_ANISTROPIC_DIFFUSION/256 NaN NaN nan%
test-suite...CHMARK_ANISTROPIC_DIFFUSION/64 NaN NaN nan%
test-suite...CHMARK_ANISTROPIC_DIFFUSION/32 NaN NaN nan%
test-suite...ENCHMARK_BILATERAL_FILTER/64/4 NaN NaN nan%
Geomean difference nan%
result-old result-new diff
count 1.000000e+01 10.00000 10.000000
mean 3.152090e+05 311695.40000 0.006749
std 3.790398e+05 372091.42232 0.036605
min 7.530000e+02 833.00000 -0.034981
25% 4.243300e+04 42401.00000 -0.000866
50% 1.197370e+05 119689.00000 -0.000392
75% 6.397050e+05 639705.00000 -0.000005
max 1.001697e+06 966657.00000 0.106242
```
I don't have timings though.
And now to the code. The basic idea is to completely replace the whole loop.
If we can't fully kill it, don't transform.
I have left one or two comments in the code, so hopefully it can be understood.
Also, there is a few TODO's that i have left for follow-ups:
* widening of `memcmp()`/`bcmp()`
* step smaller than the comparison size
* Metadata propagation
* more than two blocks as long as there is still a single backedge?
* ???
Reviewers: reames, fhahn, mkazantsev, chandlerc, craig.topper, courbet
Reviewed By: courbet
Subscribers: miyuki, hiraditya, xbolva00, nikic, jfb, gchatelet, courbet, llvm-commits, mclow.lists
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D61144
llvm-svn: 374662
I can't see any notable differences in costs between SSE2 and SSE42 arches for FADD/ADD reduction, so I've lowered the target to just SSE2.
I've also added vXi8 sum reduction costs in line with the PSADBW codegen and discussions on PR42674.
llvm-svn: 374655
Since the input type is larger than 256-bits we'll need to some
concatenating to reassemble the results. The pack instructions
ability to concatenate while packing make this a shorter/faster
sequence.
llvm-svn: 374643