[MemoryDepAnalysis] Fix compile time slowdown

- Problem
One program takes ~3min to compile under -O2. This happens after a certain
function A is inlined ~700 times in a function B, inserting thousands of new
BBs. This leads to 80% of the compilation time spent in
GVN::processNonLocalLoad and
MemoryDependenceAnalysis::getNonLocalPointerDependency, while searching for
nonlocal information for basic blocks.

Usually, to avoid spending a long time to process nonlocal loads, GVN bails out
if it gets more than 100 deps as a result from
MD->getNonLocalPointerDependency.  However this only happens *after* all
nonlocal information for BBs have been computed, which is the bottleneck in
this scenario. For instance, there are 8280 times where
getNonLocalPointerDependency returns deps with more than 100 bbs and from
those, 600 times it returns more than 1000 blocks.

- Solution
Bail out early during the nonlocal info computation whenever we reach a
specified threshold.  This patch proposes a 100 BBs threshold, it also
reduces the compile time from 3min to 23s.

- Testing
The test-suite presented no compile nor execution time regressions.

Some numbers from my machine (x86_64 darwin):
 - 17s under -Oz (which avoids inlining).
 - 1.3s under -O1.
 - 2m51s under -O2 ToT
 *** 23s under -O2 w/ Result.size() > 100
 - 1m54s under -O2 w/ Result.size() > 500

With NumResultsLimit = 100, GVN yields the same outcome as in the
unlimited 3min version.

http://reviews.llvm.org/D5532
rdar://problem/18188041

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@218792 91177308-0d34-0410-b5e6-96231b3b80d8
This commit is contained in:
Bruno Cardoso Lopes 2014-10-01 20:07:13 +00:00
parent 72447214a6
commit 1610bcf8d9

View File

@ -51,6 +51,9 @@ STATISTIC(NumCacheCompleteNonLocalPtr,
// Limit for the number of instructions to scan in a block. // Limit for the number of instructions to scan in a block.
static const int BlockScanLimit = 100; static const int BlockScanLimit = 100;
// Limit on the number of memdep results to process.
static const int NumResultsLimit = 100;
char MemoryDependenceAnalysis::ID = 0; char MemoryDependenceAnalysis::ID = 0;
// Register this pass... // Register this pass...
@ -1133,6 +1136,25 @@ getNonLocalPointerDepFromBB(const PHITransAddr &Pointer,
while (!Worklist.empty()) { while (!Worklist.empty()) {
BasicBlock *BB = Worklist.pop_back_val(); BasicBlock *BB = Worklist.pop_back_val();
// If we do process a large number of blocks it becomes very expensive and
// likely it isn't worth worrying about
if (Result.size() > NumResultsLimit) {
Worklist.clear();
// Sort it now (if needed) so that recursive invocations of
// getNonLocalPointerDepFromBB and other routines that could reuse the
// cache value will only see properly sorted cache arrays.
if (Cache && NumSortedEntries != Cache->size()) {
SortNonLocalDepInfoCache(*Cache, NumSortedEntries);
NumSortedEntries = Cache->size();
}
// Since we bail out, the "Cache" set won't contain all of the
// results for the query. This is ok (we can still use it to accelerate
// specific block queries) but we can't do the fastpath "return all
// results from the set". Clear out the indicator for this.
CacheInfo->Pair = BBSkipFirstBlockPair();
return true;
}
// Skip the first block if we have it. // Skip the first block if we have it.
if (!SkipFirstBlock) { if (!SkipFirstBlock) {
// Analyze the dependency of *Pointer in FromBB. See if we already have // Analyze the dependency of *Pointer in FromBB. See if we already have