186328 Commits

Author SHA1 Message Date
Roman Lebedev
5dbe65e8a0 [NFC][InstCombine] More test for "sign bit test via shifts" pattern (PR43595)
While that pattern is indirectly handled via
reassociateShiftAmtsOfTwoSameDirectionShifts(),
that incursme one-use restriction on truncation,
which is pointless since we know that we'll produce a single instruction.

Additionally, *if* we are only looking for sign bit,
we don't need shifts to be identical,
which isn't the case in general,
and is the blocker for me in bug in question:

https://bugs.llvm.org/show_bug.cgi?id=43595

llvm-svn: 374726
2019-10-13 17:11:16 +00:00
Simon Pilgrim
8773c2f418 [X86] SimplifyMultipleUseDemandedBitsForTargetNode - use getTargetShuffleInputs with KnownUndef/Zero results.
llvm-svn: 374725
2019-10-13 17:03:11 +00:00
Simon Pilgrim
96a583c339 [X86] getTargetShuffleInputs - add KnownUndef/Zero output support
Adjust SimplifyDemandedVectorEltsForTargetNode to use the known elts masks instead of recomputing it locally.

llvm-svn: 374724
2019-10-13 17:03:02 +00:00
Nico Weber
e4a04f087b gn build: (manually) merge r374720
llvm-svn: 374721
2019-10-13 15:25:13 +00:00
Simon Pilgrim
08bd8463d0 [X86][AVX] Add i686 avx splat tests
llvm-svn: 374719
2019-10-13 13:18:07 +00:00
Simon Pilgrim
b9e25e757c IRTranslator - silence static analyzer null dereference warnings. NFCI.
The CmpInst::getType() calls can be replaced by just using User::getType() that it was dyn_cast from, and we then need to assert that any default predicate cases came from the CmpInst.

llvm-svn: 374716
2019-10-13 11:29:35 +00:00
GN Sync Bot
4e58c0f5c5 gn build: Merge r374707
llvm-svn: 374708
2019-10-13 08:33:14 +00:00
Craig Topper
25aca9b88c [X86] Add a one use check on the setcc to the min/max canonicalization code in combineSelect.
This seems to improve std::midpoint code where we have a min and
a max with the same condition. If we split the setcc we can end
up with two compares if the one of the operands is a constant.
Since we aggressively canonicalize compares with constants.
For non-constants it can interfere with our ability to share
control flow if we need to expand cmovs into control flow.

I'm also not sure I understand this min/max canonicalization code.
The motivating case talks about comparing with 0. But we don't
check for 0 explicitly.

Removes one instruction from the codegen for PR43658.

llvm-svn: 374706
2019-10-13 06:48:05 +00:00
Craig Topper
890032afee [X86] Enable v4i32->v4i16 and v8i16->v8i8 saturating truncates to use pack instructions with avx512.
llvm-svn: 374705
2019-10-13 05:47:47 +00:00
Craig Topper
4c3e63288a [X86] Add v2i64->v2i32/v2i16/v2i8 test cases to the trunc packus/ssat/usat tests. NFC
llvm-svn: 374704
2019-10-13 05:47:42 +00:00
Johannes Doerfert
1c4877f759 [Attributor][FIX] Avoid splitting blocks if possible
Before, we eagerly split blocks even if it was not necessary, e.g., they
had a single unreachable instruction and only a single predecessor.

llvm-svn: 374703
2019-10-13 05:27:09 +00:00
Johannes Doerfert
6b3ee37d3e [Attributor][FIX] Remove leftover, now unused, variable
llvm-svn: 374702
2019-10-13 05:19:17 +00:00
Johannes Doerfert
408a2116f8 [Attributor] Remove unused verification flag
We use the verify max iteration now which is more reliable.

llvm-svn: 374701
2019-10-13 05:07:00 +00:00
Johannes Doerfert
2388ee280b [Attributor][NFC] Expose call site traversal without QueryingAA
llvm-svn: 374700
2019-10-13 04:16:02 +00:00
Johannes Doerfert
807d9bd946 [Attributor][FIX] Ensure h2s doesn't trigger on escaped pointers
We do not yet perform h2s because we know something is free'ed but we do
it because we know the pointer does not escape. Storing the pointer
allows it to escape so we have to prevent that.

llvm-svn: 374699
2019-10-13 04:14:15 +00:00
Johannes Doerfert
7b069a879a [Attributor][FIX] Do not apply h2s for arbitrary mallocs
H2S did apply to mallocs of non-constant sizes if the uses were OK. This
is now forbidden through reording of the "good" and "bad" cases in the
conditional.

llvm-svn: 374698
2019-10-13 03:54:08 +00:00
Johannes Doerfert
7cbf46ed9d [Attributor][FIX] Add missing function declaration in test case
llvm-svn: 374696
2019-10-13 02:42:09 +00:00
Johannes Doerfert
a3ef980c25 [Attributor][FIX] Avoid modifying naked/optnone functions
The check for naked/optnone was insufficient for different reasons. We
now check before we initialize an abstract attribute and we do it for
all abstract attributes.

llvm-svn: 374694
2019-10-13 02:24:02 +00:00
Johannes Doerfert
67305cbcc3 [SROA] Reuse existing lifetime markers if possible
Summary:
If the underlying alloca did not change, we do not necessarily need new
lifetime markers. This patch adds a check and reuses the old ones if
possible.

Reviewers: reames, ssarda, t.p.northover, hfinkel

Subscribers: hiraditya, bollu, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D68900

llvm-svn: 374692
2019-10-13 02:21:23 +00:00
Nico Weber
b9c29111a6 Revert r374663 "[clang-format] Proposal for clang-format to give compiler style warnings"
The test fails on macOS and looks a bit wrong, see comments on the review.

Also revert follow-up r374686.

llvm-svn: 374688
2019-10-12 22:58:34 +00:00
Nico Weber
7c9db209a6 gn build: (manually) merge r374663
llvm-svn: 374686
2019-10-12 22:24:56 +00:00
Joel E. Denny
c87e87c588 Revert r374648: "Reland r374388: [lit] Make internal diff work in pipelines"
This series of patches still breaks a Windows bot.

llvm-svn: 374683
2019-10-12 18:52:46 +00:00
Joel E. Denny
07d111f9d2 Revert r374649: "Reland r374389: [lit] Clean up internal diff's encoding handling"
This series of patches still breaks a Windows bot.

llvm-svn: 374682
2019-10-12 18:52:31 +00:00
Joel E. Denny
66af91a761 Revert r374650: "Reland r374390: [lit] Extend internal diff to support - argument"
This series of patches still breaks a Windows bot.

llvm-svn: 374681
2019-10-12 18:52:18 +00:00
Joel E. Denny
7f02190bce Revert 374651: "Reland r374392: [lit] Extend internal diff to support -U"
This series of patches still breaks a Windows bot.

llvm-svn: 374680
2019-10-12 18:52:05 +00:00
Joel E. Denny
b9620c3aff Revert r374652: "[lit] Fix internal diff's --strip-trailing-cr and use it"
This series of patches still breaks a Windows bot.

llvm-svn: 374679
2019-10-12 18:51:51 +00:00
Joel E. Denny
f811e9ea03 Revert r374653: "[lit] Fix a few oversights in r374651 that broke some bots"
This series of patches still breaks a Windows bot.

llvm-svn: 374678
2019-10-12 18:51:34 +00:00
Joel E. Denny
4c98f87a0a Revert r374665: "[lit] Try yet again to fix new tests that fail on Windows bots"
This series of patches still breaks a Windows bot.

llvm-svn: 374677
2019-10-12 18:51:18 +00:00
Joel E. Denny
56234906c5 Revert r374666: "[lit] Adjust error handling for decode introduced by r374665"
This series of patches still breaks a Windows bot.

llvm-svn: 374676
2019-10-12 18:51:08 +00:00
Joel E. Denny
b9ce1e9b61 Revert r374671: "[lit] Try errors="ignore" for decode introduced by r374665"
This series of patches still breaks a Windows bot.

llvm-svn: 374675
2019-10-12 18:50:57 +00:00
Simon Pilgrim
cc8f5e36da [X86] scaleShuffleMask - use size_t Scale to avoid overflow warnings
llvm-svn: 374674
2019-10-12 18:33:47 +00:00
Simon Pilgrim
e66bfcd64d SymbolRecord - consistently use explicit for single operand constructors
llvm-svn: 374673
2019-10-12 17:55:09 +00:00
Simon Pilgrim
4e1d47e24d SymbolRecord - fix uninitialized variable warnings. NFCI.
llvm-svn: 374672
2019-10-12 17:55:01 +00:00
Joel E. Denny
46a7c0b0c5 [lit] Try errors="ignore" for decode introduced by r374665
Still trying to fix the same error as in r374666.

llvm-svn: 374671
2019-10-12 17:23:25 +00:00
Roman Lebedev
ab2284f8a9 [NFC][LoopIdiom] Adjust FIXME to be self-explanatory
llvm-svn: 374670
2019-10-12 16:48:16 +00:00
Simon Pilgrim
4ce8f35c4d Replace for-loop of SmallVector::push_back with SmallVector::append. NFCI.
llvm-svn: 374669
2019-10-12 16:37:02 +00:00
Simon Pilgrim
846f09a144 Fix cppcheck shadow variable name warnings. NFCI.
llvm-svn: 374668
2019-10-12 16:36:52 +00:00
Simon Pilgrim
354303cc5c [X86] Use any_of/all_of patterns in shuffle mask pattern recognisers. NFCI.
llvm-svn: 374667
2019-10-12 16:36:44 +00:00
Joel E. Denny
ef602e4f5b [lit] Adjust error handling for decode introduced by r374665
On that decode, Windows bots fail with:

```
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-8: ordinal not in range(128)
```

That's the same error as before r374665 except it's now at the decode
before the write to stdout.

llvm-svn: 374666
2019-10-12 16:25:46 +00:00
Joel E. Denny
cb72157724 [lit] Try yet again to fix new tests that fail on Windows bots
I seem to have misread the bot logs on my last attempt.  When lit's
internal diff runs on Windows under Python 2.7, it's text diffs not
binary diffs that need decoding to avoid this error when writing the
diff to stdout:

```
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-8: ordinal not in range(128)
```

There is no `decode` attribute in this case under Python 3.6.8 under
Ubuntu, so this patch checks for the `decode` attribute before using
it here.  Hopefully nothing else is needed when `decode` isn't
available.

It might take a couple more attempts to figure out what error
handling, if any, is needed for this decoding.

llvm-svn: 374665
2019-10-12 16:00:35 +00:00
Joel E. Denny
005ff49b43 Revert r374657: "[lit] Try again to fix new tests that fail on Windows bots"
llvm-svn: 374664
2019-10-12 16:00:25 +00:00
Roman Lebedev
0dff68630e [LoopIdiomRecognize] Recommit: BCmp loop idiom recognition
Summary:
This is a recommit, this originally landed in rL370454 but was
subsequently reverted in  rL370788 due to
https://bugs.llvm.org/show_bug.cgi?id=43206
The reduced testcase was added to bcmp-negative-tests.ll
as @pr43206_different_loops - we must ensure that the SCEV's
we got are both for the same loop we are currently investigating.

Original commit message:

@mclow.lists brought up this issue up in IRC.
It is a reasonably common problem to compare some two values for equality.
Those may be just some integers, strings or arrays of integers.

In C, there is `memcmp()`, `bcmp()` functions.
In C++, there exists `std::equal()` algorithm.
One can also write that function manually.

libstdc++'s `std::equal()` is specialized to directly call `memcmp()` for
various types, but not `std::byte` from C++2a. https://godbolt.org/z/mx2ejJ

libc++ does not do anything like that, it simply relies on simple C++'s
`operator==()`. https://godbolt.org/z/er0Zwf (GOOD!)

So likely, there exists a certain performance opportunities.
Let's compare performance of naive `std::equal()` (no `memcmp()`) with one that
is using `memcmp()` (in this case, compiled with modified compiler). {F8768213}

```
#include <algorithm>
#include <cmath>
#include <cstdint>
#include <iterator>
#include <limits>
#include <random>
#include <type_traits>
#include <utility>
#include <vector>

#include "benchmark/benchmark.h"

template <class T>
bool equal(T* a, T* a_end, T* b) noexcept {
  for (; a != a_end; ++a, ++b) {
    if (*a != *b) return false;
  }
  return true;
}

template <typename T>
std::vector<T> getVectorOfRandomNumbers(size_t count) {
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(),
                                       std::numeric_limits<T>::max());
  std::vector<T> v;
  v.reserve(count);
  std::generate_n(std::back_inserter(v), count,
                  [&dis, &gen]() { return dis(gen); });
  assert(v.size() == count);
  return v;
}

struct Identical {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    auto Tmp = getVectorOfRandomNumbers<T>(count);
    return std::make_pair(Tmp, std::move(Tmp));
  }
};

struct InequalHalfway {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    auto V0 = getVectorOfRandomNumbers<T>(count);
    auto V1 = V0;
    V1[V1.size() / size_t(2)]++;  // just change the value.
    return std::make_pair(std::move(V0), std::move(V1));
  }
};

template <class T, class Gen>
void BM_bcmp(benchmark::State& state) {
  const size_t Length = state.range(0);

  const std::pair<std::vector<T>, std::vector<T>> Data =
      Gen::template Gen<T>(Length);
  const std::vector<T>& a = Data.first;
  const std::vector<T>& b = Data.second;
  assert(a.size() == Length && b.size() == a.size());

  benchmark::ClobberMemory();
  benchmark::DoNotOptimize(a);
  benchmark::DoNotOptimize(a.data());
  benchmark::DoNotOptimize(b);
  benchmark::DoNotOptimize(b.data());

  for (auto _ : state) {
    const bool is_equal = equal(a.data(), a.data() + a.size(), b.data());
    benchmark::DoNotOptimize(is_equal);
  }
  state.SetComplexityN(Length);
  state.counters["eltcnt"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant);
  state.counters["eltcnt/sec"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate);
  const size_t BytesRead = 2 * sizeof(T) * Length;
  state.counters["bytes_read/iteration"] =
      benchmark::Counter(BytesRead, benchmark::Counter::kDefaults,
                         benchmark::Counter::OneK::kIs1024);
  state.counters["bytes_read/sec"] = benchmark::Counter(
      BytesRead, benchmark::Counter::kIsIterationInvariantRate,
      benchmark::Counter::OneK::kIs1024);
}

template <typename T>
static void CustomArguments(benchmark::internal::Benchmark* b) {
  const size_t L2SizeBytes = []() {
    for (const benchmark::CPUInfo::CacheInfo& I :
         benchmark::CPUInfo::Get().caches) {
      if (I.level == 2) return I.size;
    }
    return 0;
  }();
  // What is the largest range we can check to always fit within given L2 cache?
  const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
                        /*maximal elt size*/ sizeof(T) / /*safety margin*/ 2;
  b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}

BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, Identical)
    ->Apply(CustomArguments<uint8_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, Identical)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, Identical)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, Identical)
    ->Apply(CustomArguments<uint64_t>);

BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, InequalHalfway)
    ->Apply(CustomArguments<uint8_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, InequalHalfway)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, InequalHalfway)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, InequalHalfway)
    ->Apply(CustomArguments<uint64_t>);
```
{F8768210}
```
$ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks build-{old,new}/test/llvm-bcmp-bench
RUNNING: build-old/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpb6PEUx
2019-04-25 21:17:11
Running build-old/test/llvm-bcmp-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 0.65, 3.90, 4.14
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000           432131 ns       432101 ns         1613 bytes_read/iteration=1000k bytes_read/sec=2.20706G/s eltcnt=825.856M eltcnt/sec=1.18491G/s
BM_bcmp<uint8_t, Identical>_BigO               0.86 N          0.86 N
BM_bcmp<uint8_t, Identical>_RMS                   8 %             8 %
<...>
BM_bcmp<uint16_t, Identical>/256000          161408 ns       161409 ns         4027 bytes_read/iteration=1000k bytes_read/sec=5.90843G/s eltcnt=1030.91M eltcnt/sec=1.58603G/s
BM_bcmp<uint16_t, Identical>_BigO              0.67 N          0.67 N
BM_bcmp<uint16_t, Identical>_RMS                 25 %            25 %
<...>
BM_bcmp<uint32_t, Identical>/128000           81497 ns        81488 ns         8415 bytes_read/iteration=1000k bytes_read/sec=11.7032G/s eltcnt=1077.12M eltcnt/sec=1.57078G/s
BM_bcmp<uint32_t, Identical>_BigO              0.71 N          0.71 N
BM_bcmp<uint32_t, Identical>_RMS                 42 %            42 %
<...>
BM_bcmp<uint64_t, Identical>/64000            50138 ns        50138 ns        10909 bytes_read/iteration=1000k bytes_read/sec=19.0209G/s eltcnt=698.176M eltcnt/sec=1.27647G/s
BM_bcmp<uint64_t, Identical>_BigO              0.84 N          0.84 N
BM_bcmp<uint64_t, Identical>_RMS                 27 %            27 %
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000      192405 ns       192392 ns         3638 bytes_read/iteration=1000k bytes_read/sec=4.95694G/s eltcnt=1.86266G eltcnt/sec=2.66124G/s
BM_bcmp<uint8_t, InequalHalfway>_BigO          0.38 N          0.38 N
BM_bcmp<uint8_t, InequalHalfway>_RMS              3 %             3 %
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000     127858 ns       127860 ns         5477 bytes_read/iteration=1000k bytes_read/sec=7.45873G/s eltcnt=1.40211G eltcnt/sec=2.00219G/s
BM_bcmp<uint16_t, InequalHalfway>_BigO         0.50 N          0.50 N
BM_bcmp<uint16_t, InequalHalfway>_RMS             0 %             0 %
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000      49140 ns        49140 ns        14281 bytes_read/iteration=1000k bytes_read/sec=19.4072G/s eltcnt=1.82797G eltcnt/sec=2.60478G/s
BM_bcmp<uint32_t, InequalHalfway>_BigO         0.40 N          0.40 N
BM_bcmp<uint32_t, InequalHalfway>_RMS            18 %            18 %
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000       32101 ns        32099 ns        21786 bytes_read/iteration=1000k bytes_read/sec=29.7101G/s eltcnt=1.3943G eltcnt/sec=1.99381G/s
BM_bcmp<uint64_t, InequalHalfway>_BigO         0.50 N          0.50 N
BM_bcmp<uint64_t, InequalHalfway>_RMS             1 %             1 %
RUNNING: build-new/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpQ46PP0
2019-04-25 21:19:29
Running build-new/test/llvm-bcmp-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.01, 2.85, 3.71
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000            18593 ns        18590 ns        37565 bytes_read/iteration=1000k bytes_read/sec=51.2991G/s eltcnt=19.2333G eltcnt/sec=27.541G/s
BM_bcmp<uint8_t, Identical>_BigO               0.04 N          0.04 N
BM_bcmp<uint8_t, Identical>_RMS                  37 %            37 %
<...>
BM_bcmp<uint16_t, Identical>/256000           18950 ns        18948 ns        37223 bytes_read/iteration=1000k bytes_read/sec=50.3324G/s eltcnt=9.52909G eltcnt/sec=13.511G/s
BM_bcmp<uint16_t, Identical>_BigO              0.08 N          0.08 N
BM_bcmp<uint16_t, Identical>_RMS                 34 %            34 %
<...>
BM_bcmp<uint32_t, Identical>/128000           18627 ns        18627 ns        37895 bytes_read/iteration=1000k bytes_read/sec=51.198G/s eltcnt=4.85056G eltcnt/sec=6.87168G/s
BM_bcmp<uint32_t, Identical>_BigO              0.16 N          0.16 N
BM_bcmp<uint32_t, Identical>_RMS                 35 %            35 %
<...>
BM_bcmp<uint64_t, Identical>/64000            18855 ns        18855 ns        37458 bytes_read/iteration=1000k bytes_read/sec=50.5791G/s eltcnt=2.39731G eltcnt/sec=3.3943G/s
BM_bcmp<uint64_t, Identical>_BigO              0.32 N          0.32 N
BM_bcmp<uint64_t, Identical>_RMS                 33 %            33 %
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000        9570 ns         9569 ns        73500 bytes_read/iteration=1000k bytes_read/sec=99.6601G/s eltcnt=37.632G eltcnt/sec=53.5046G/s
BM_bcmp<uint8_t, InequalHalfway>_BigO          0.02 N          0.02 N
BM_bcmp<uint8_t, InequalHalfway>_RMS             29 %            29 %
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000       9547 ns         9547 ns        74343 bytes_read/iteration=1000k bytes_read/sec=99.8971G/s eltcnt=19.0318G eltcnt/sec=26.8159G/s
BM_bcmp<uint16_t, InequalHalfway>_BigO         0.04 N          0.04 N
BM_bcmp<uint16_t, InequalHalfway>_RMS            29 %            29 %
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000       9396 ns         9394 ns        73521 bytes_read/iteration=1000k bytes_read/sec=101.518G/s eltcnt=9.41069G eltcnt/sec=13.6255G/s
BM_bcmp<uint32_t, InequalHalfway>_BigO         0.08 N          0.08 N
BM_bcmp<uint32_t, InequalHalfway>_RMS            30 %            30 %
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000        9499 ns         9498 ns        73802 bytes_read/iteration=1000k bytes_read/sec=100.405G/s eltcnt=4.72333G eltcnt/sec=6.73808G/s
BM_bcmp<uint64_t, InequalHalfway>_BigO         0.16 N          0.16 N
BM_bcmp<uint64_t, InequalHalfway>_RMS            28 %            28 %
Comparing build-old/test/llvm-bcmp-bench to build-new/test/llvm-bcmp-bench
Benchmark                                                  Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000                      -0.9570         -0.9570        432131         18593        432101         18590
<...>
BM_bcmp<uint16_t, Identical>/256000                     -0.8826         -0.8826        161408         18950        161409         18948
<...>
BM_bcmp<uint32_t, Identical>/128000                     -0.7714         -0.7714         81497         18627         81488         18627
<...>
BM_bcmp<uint64_t, Identical>/64000                      -0.6239         -0.6239         50138         18855         50138         18855
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000                 -0.9503         -0.9503        192405          9570        192392          9569
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000                -0.9253         -0.9253        127858          9547        127860          9547
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000                -0.8088         -0.8088         49140          9396         49140          9394
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000                 -0.7041         -0.7041         32101          9499         32099          9498
```

What can we tell from the benchmark?
* Performance of naive equality check somewhat improves with element size,
  maxing out at eltcnt/sec=1.58603G/s for uint16_t, or bytes_read/sec=19.0209G/s
  for uint64_t. I think, that instability implies performance problems.
* Performance of `memcmp()`-aware benchmark always maxes out at around
  bytes_read/sec=51.2991G/s for every type. That is 2.6x the throughput of the
  naive variant!
* eltcnt/sec metric for the `memcmp()`-aware benchmark maxes out at
  eltcnt/sec=27.541G/s for uint8_t (was: eltcnt/sec=1.18491G/s, so 24x) and
  linearly decreases with element size.
  For uint64_t, it's ~4x+ the elements/second.
* The call obvious is more pricey than the loop, with small element count.
  As it can be seen from the full output {F8768210}, the `memcmp()` is almost
  universally worse, independent of the element size (and thus buffer size) when
  element count is less than 8.

So all in all, bcmp idiom does indeed pose untapped performance headroom.
This diff does implement said idiom recognition. I think a reasonable test
coverage is present, but do tell if there is anything obvious missing.

Now, quality. This does succeed to build and pass the test-suite, at least
without any non-bundled elements. {F8768216} {F8768217}
This transform fires 91 times:
```
$ /build/test-suite/utils/compare.py -m loop-idiom.NumBCmp result-new.json
Tests: 1149
Metric: loop-idiom.NumBCmp

Program                                         result-new

MultiSourc...Benchmarks/7zip/7zip-benchmark    79.00
MultiSource/Applications/d/make_dparser         3.00
SingleSource/UnitTests/vla                      2.00
MultiSource/Applications/Burg/burg              1.00
MultiSourc.../Applications/JM/lencod/lencod     1.00
MultiSource/Applications/lemon/lemon            1.00
MultiSource/Benchmarks/Bullet/bullet            1.00
MultiSourc...e/Benchmarks/MallocBench/gs/gs     1.00
MultiSourc...gs-C/TimberWolfMC/timberwolfmc     1.00
MultiSourc...Prolangs-C/simulator/simulator     1.00
```
The size changes are:
I'm not sure what's going on with SingleSource/UnitTests/vla.test yet, did not look.
```
$ /build/test-suite/utils/compare.py -m size..text result-{old,new}.json --filter-hash
Tests: 1149
Same hash: 907 (filtered out)
Remaining: 242
Metric: size..text

Program                                        result-old result-new diff
test-suite...ingleSource/UnitTests/vla.test   753.00     833.00     10.6%
test-suite...marks/7zip/7zip-benchmark.test   1001697.00 966657.00  -3.5%
test-suite...ngs-C/simulator/simulator.test   32369.00   32321.00   -0.1%
test-suite...plications/d/make_dparser.test   89585.00   89505.00   -0.1%
test-suite...ce/Applications/Burg/burg.test   40817.00   40785.00   -0.1%
test-suite.../Applications/lemon/lemon.test   47281.00   47249.00   -0.1%
test-suite...TimberWolfMC/timberwolfmc.test   250065.00  250113.00   0.0%
test-suite...chmarks/MallocBench/gs/gs.test   149889.00  149873.00  -0.0%
test-suite...ications/JM/lencod/lencod.test   769585.00  769569.00  -0.0%
test-suite.../Benchmarks/Bullet/bullet.test   770049.00  770049.00   0.0%
test-suite...HMARK_ANISTROPIC_DIFFUSION/128    NaN        NaN        nan%
test-suite...HMARK_ANISTROPIC_DIFFUSION/256    NaN        NaN        nan%
test-suite...CHMARK_ANISTROPIC_DIFFUSION/64    NaN        NaN        nan%
test-suite...CHMARK_ANISTROPIC_DIFFUSION/32    NaN        NaN        nan%
test-suite...ENCHMARK_BILATERAL_FILTER/64/4    NaN        NaN        nan%
Geomean difference                                                   nan%
         result-old    result-new       diff
count  1.000000e+01  10.00000      10.000000
mean   3.152090e+05  311695.40000  0.006749
std    3.790398e+05  372091.42232  0.036605
min    7.530000e+02  833.00000    -0.034981
25%    4.243300e+04  42401.00000  -0.000866
50%    1.197370e+05  119689.00000 -0.000392
75%    6.397050e+05  639705.00000 -0.000005
max    1.001697e+06  966657.00000  0.106242
```

I don't have timings though.

And now to the code. The basic idea is to completely replace the whole loop.
If we can't fully kill it, don't transform.
I have left one or two comments in the code, so hopefully it can be understood.

Also, there is a few TODO's that i have left for follow-ups:
* widening of `memcmp()`/`bcmp()`
* step smaller than the comparison size
* Metadata propagation
* more than two blocks as long as there is still a single backedge?
* ???

Reviewers: reames, fhahn, mkazantsev, chandlerc, craig.topper, courbet

Reviewed By: courbet

Subscribers: miyuki, hiraditya, xbolva00, nikic, jfb, gchatelet, courbet, llvm-commits, mclow.lists

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61144

llvm-svn: 374662
2019-10-12 15:35:32 +00:00
Roman Lebedev
7d56790f6f [NFC][LoopIdiom] Add bcmp loop idiom miscompile test from PR43206.
The transform forgot to check SCEV loop scopes.

https://bugs.llvm.org/show_bug.cgi?id=43206

llvm-svn: 374661
2019-10-12 15:35:16 +00:00
Roman Lebedev
c2305ce0d8 [NFC][LoopIdiom] Move one bcmp test into the proper place
llvm-svn: 374660
2019-10-12 15:35:09 +00:00
Simon Pilgrim
3fd222dafb [X86][SSE] Avoid unnecessary PMOVZX in v4i8 sum reduction
This should go away once D66004 has landed and we can simplify shuffle chains using demanded elts.

llvm-svn: 374658
2019-10-12 15:19:13 +00:00
Joel E. Denny
2c2c40daf4 [lit] Try again to fix new tests that fail on Windows bots
Based on the bot logs, when lit's internal diff runs on Windows, it
looks like binary diffs must be decoded also for Python 2.7.
Otherwise, writing the diff to stdout fails with:

```
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-8: ordinal not in range(128)
```

I did not need to decode using Python 2.7.15 under Ubuntu.  When I do
it anyway in that case, `errors="backslashreplace"` fails for me:

```
TypeError: don't know how to handle UnicodeDecodeError in error callback
```

However, `errors="ignore"` works, so this patch uses that, hoping
it'll work on Windows as well.

This patch leaves `errors="backslashreplace"` for Python >= 3.5 as
there's no evidence yet that doesn't work and it produces more
informative binary diffs.  This patch also adjusts some lit tests to
succeed for either error handler.

This patch adjusts changes introduced by D68664.

llvm-svn: 374657
2019-10-12 14:58:43 +00:00
Joel E. Denny
c7a6363e4e Revert r374654: "[lit] Try to fix new tests that fail on Windows bots"
llvm-svn: 374656
2019-10-12 14:58:30 +00:00
Simon Pilgrim
a95636568c [CostModel][X86] Improve sum reduction costs.
I can't see any notable differences in costs between SSE2 and SSE42 arches for FADD/ADD reduction, so I've lowered the target to just SSE2.

I've also added vXi8 sum reduction costs in line with the PSADBW codegen and discussions on PR42674.

llvm-svn: 374655
2019-10-12 13:21:50 +00:00
Joel E. Denny
54241b4586 [lit] Try to fix new tests that fail on Windows bots
llvm-svn: 374654
2019-10-12 13:08:21 +00:00
Joel E. Denny
4e060cb8b6 [lit] Fix a few oversights in r374651 that broke some bots
llvm-svn: 374653
2019-10-12 12:32:00 +00:00