llvm/CodeGen at 98925ead780628072b391935ac350852fb3f10c6 - llvm

RPCSX/llvm

mirror of https://github.com/RPCSX/llvm.git synced 2025-02-25 23:45:36 +00:00

History

Keno Fischer 2a3b42cf37 [ExecutionDepsFix] Improve clearance calculation for loops

Summary:
In revision rL278321, ExecutionDepsFix learned how to pick a better
register for undef register reads, e.g. for instructions such as
`vcvtsi2sdq`. While this revision improved performance on a good number
of our benchmarks, it unfortunately also caused significant regressions
(up to 3x) on others. This regression turned out to be caused by loops
such as:

PH -> A -> B (xmm<Undef> -> xmm<Def>) -> C -> D -> EXIT
^ |
+----------------------------------+

In the previous version of the clearance calculation, we would visit
the blocks in order, remembering for each whether there were any
incoming backedges from blocks that we hadn't processed yet and if
so queuing up the block to be re-processed. However, for loop structures
such as the above, this is clearly insufficient, since the block B
does not have any unknown backedges, so we do not see the false
dependency from the previous interation's Def of xmm registers in B.

To fix this, we need to consider all blocks that are part of the loop
and reprocess them one the correct clearance values are known. As
an optimization, we also want to avoid reprocessing any later blocks
that are not part of the loop.

In summary, the iteration order is as follows:
Before: PH A B C D A'
Corrected (Naive): PH A B C D A' B' C' D'
Corrected (w/ optimization): PH A B C A' B' C' D

To facilitate this optimization we introduce two new counters for each
basic block. The first counts how many of it's predecssors have
completed primary processing. The second counts how many of its
predecessors have completed all processing (we will call such a block
*done*. Now, the criteria to reprocess a block is as follows:
- All Predecessors have completed primary processing
- For x the number of predecessors that have completed primary
processing *at the time of primary processing of this block*,
the number of predecessors that are done has reached x.

The intuition behind this criterion is as follows:
We need to perform primary processing on all predecessors in order to
find out any direct defs in those predecessors. When predecessors are
done, we also know that we have information about indirect defs (e.g.
in block B though that were inherited through B->C->A->B). However,
we can't wait for all predecessors to be done, since that would
cause cyclic dependencies. However, it is guaranteed that all those
predecessors that are prior to us in reverse postorder will be done
before us. Since we iterate of the basic blocks in reverse postorder,
the number x above, is precisely the count of the number of predecessors
prior to us in reverse postorder.

Reviewers: myatsina
Differential Revision: https://reviews.llvm.org/D28759

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@293571 91177308-0d34-0410-b5e6-96231b3b80d8

2017-01-30 23:37:03 +00:00

AsmPrinter

Use print() instead of dump() in code

2017-01-28 06:53:55 +00:00

GlobalISel

GlobalISel: correctly translate invoke when callee is a register.

2017-01-30 21:45:21 +00:00

MIRParser

[MIRParser] Allow generic register specification on operand.

2017-01-20 00:29:59 +00:00

SelectionDAG

Use SelectionDAG::getBuildVector helper function where possible. NFCI.

2017-01-30 18:53:45 +00:00

AggressiveAntiDepBreaker.cpp

…

AggressiveAntiDepBreaker.h

…

AllocationOrder.cpp

…

AllocationOrder.h

…

Analysis.cpp

[CodeGen] Further simplify returned call operand logic. NFC.

2017-01-03 21:42:43 +00:00

AntiDepBreaker.h

…

AtomicExpandPass.cpp

…

BasicTargetTransformInfo.cpp

…

BranchFolding.cpp

Add support to dump dot graph block layout after MBP

2017-01-29 01:57:02 +00:00

BranchFolding.h

Add support to dump dot graph block layout after MBP

2017-01-29 01:57:02 +00:00

BranchRelaxation.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

BuiltinGCs.cpp

…

CalcSpillWeights.cpp

…

CallingConvLower.cpp

[X86] Vectorcall Calling Convention - Adding CodeGen Complete Support

2016-12-21 08:31:45 +00:00

CMakeLists.txt

New OptimizationRemarkEmitter pass for MIR

2017-01-25 23:20:33 +00:00

CodeGen.cpp

…

CodeGenPrepare.cpp

[CodeGenPrep]No negative cost in the ExtLd promotion

2017-01-27 17:16:37 +00:00

CountingFunctionInserter.cpp

…

CriticalAntiDepBreaker.cpp

…

CriticalAntiDepBreaker.h

…

DeadMachineInstructionElim.cpp

…

DetectDeadLanes.cpp

Implement LaneBitmask::any(), use it to replace !none(), NFCI

2016-12-16 19:11:56 +00:00

DFAPacketizer.cpp

…

DwarfEHPrepare.cpp

…

EarlyIfConversion.cpp

…

EdgeBundles.cpp

…

ExecutionDepsFix.cpp

[ExecutionDepsFix] Improve clearance calculation for loops

2017-01-30 23:37:03 +00:00

ExpandISelPseudos.cpp

…

ExpandPostRAPseudos.cpp

…

FaultMaps.cpp

…

FuncletLayout.cpp

…

GCMetadata.cpp

…

GCMetadataPrinter.cpp

…

GCRootLowering.cpp

…

GCStrategy.cpp

…

GlobalMerge.cpp

…

IfConversion.cpp

[IfConversion] Use reverse_iterator to simplify. NFC

2017-01-26 20:02:47 +00:00

ImplicitNullChecks.cpp

[CodeGen] Rename MachineInstrBuilder::addOperand. NFC

2017-01-13 09:58:52 +00:00

InlineSpiller.cpp

Fix for InlineSpiller accessing not updated dom tree base information.

2017-01-04 09:41:56 +00:00

InterferenceCache.cpp

…

InterferenceCache.h

…

InterleavedAccessPass.cpp

…

IntrinsicLowering.cpp

…

LatencyPriorityQueue.cpp

…

LexicalScopes.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

LiveDebugValues.cpp

Use print() instead of dump() in code

2017-01-28 06:53:55 +00:00

LiveDebugVariables.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

LiveDebugVariables.h

…

LiveInterval.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

LiveIntervalAnalysis.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

LiveIntervalUnion.cpp

…

LivePhysRegs.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

LiveRangeCalc.cpp

Treat segment [B, E) as not overlapping block with boundaries [A, B)

2017-01-18 23:12:19 +00:00

LiveRangeCalc.h

…

LiveRangeEdit.cpp

Implement LaneBitmask::any(), use it to replace !none(), NFCI

2016-12-16 19:11:56 +00:00

LiveRangeUtils.h

…

LiveRegMatrix.cpp

Implement LaneBitmask::any(), use it to replace !none(), NFCI

2016-12-16 19:11:56 +00:00

LiveRegUnits.cpp

LiveRegUnits: Add accumulateBackward() function

2017-01-21 02:21:04 +00:00

LiveStackAnalysis.cpp

…

LiveVariables.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

LLVMBuild.txt

…

LLVMTargetMachine.cpp

GlobalISel: Abort in ResetMachineFunctionPass if fallback isn't enabled

2017-01-13 23:46:11 +00:00

LocalStackSlotAllocation.cpp

…

LowerEmuTLS.cpp

…

LowLevelType.cpp

…

MachineBasicBlock.cpp

[AMDGPU] Prevent spills before exec mask is restored

2017-01-20 00:44:31 +00:00

MachineBlockFrequencyInfo.cpp

Add support to dump dot graph block layout after MBP

2017-01-29 01:57:02 +00:00

MachineBlockPlacement.cpp

Add support to dump dot graph block layout after MBP

2017-01-29 01:57:02 +00:00

MachineBranchProbabilityInfo.cpp

…

MachineCombiner.cpp

MachineInstr: Remove parameter from dump()

2017-01-29 18:20:42 +00:00

MachineCopyPropagation.cpp

…

MachineCSE.cpp

[codegen] Add generic functions to skip debug values.

2016-12-16 11:10:26 +00:00

MachineDominanceFrontier.cpp

…

MachineDominators.cpp

Revert "Do not verify dominator tree if it has no roots"

2017-01-25 17:15:48 +00:00

MachineFunction.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

MachineFunctionPass.cpp

Reverted: Track validity of pass results

2017-01-15 10:23:18 +00:00

MachineFunctionPrinterPass.cpp

…

MachineInstr.cpp

MachineInstr: Remove parameter from dump()

2017-01-29 18:20:42 +00:00

MachineInstrBundle.cpp

…

MachineLICM.cpp

…

MachineLoopInfo.cpp

New OptimizationRemarkEmitter pass for MIR

2017-01-25 23:20:33 +00:00

MachineModuleInfo.cpp

…

MachineModuleInfoImpls.cpp

…

MachineOptimizationRemarkEmitter.cpp

New OptimizationRemarkEmitter pass for MIR

2017-01-25 23:20:33 +00:00

MachinePassRegistry.cpp

…

MachinePipeliner.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

MachinePostDominators.cpp

…

MachineRegionInfo.cpp

…

MachineRegisterInfo.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

MachineScheduler.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

MachineSink.cpp

…

MachineSSAUpdater.cpp

…

MachineTraceMetrics.cpp

…

MachineVerifier.cpp

CodeGen: Assert that liveness is up to date when reading block live-ins.

2017-01-05 20:01:19 +00:00

MIRPrinter.cpp

CodeGen: Assert that liveness is up to date when reading block live-ins.

2017-01-05 20:01:19 +00:00

MIRPrinter.h

…

MIRPrintingPass.cpp

…

OptimizePHIs.cpp

…

ParallelCG.cpp

…

PatchableFunction.cpp

[CodeGen] Rename MachineInstrBuilder::addOperand. NFC

2017-01-13 09:58:52 +00:00

PeepholeOptimizer.cpp

PeepholeOptimizer: Do not replace SubregToReg(bitcast like)

2017-01-09 21:38:17 +00:00

PHIElimination.cpp

…

PHIEliminationUtils.cpp

…

PHIEliminationUtils.h

…

PostRAHazardRecognizer.cpp

…

PostRASchedulerList.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

PreISelIntrinsicLowering.cpp

…

ProcessImplicitDefs.cpp

…

PrologEpilogInserter.cpp

[thumb,framelowering] Reset NoVRegs in Thumb1FrameLowering::emitPrologue.

2017-01-18 15:01:22 +00:00

PseudoSourceValue.cpp

…

README.txt

…

RegAllocBase.cpp

…

RegAllocBase.h

…

RegAllocBasic.cpp

…

RegAllocFast.cpp

…

RegAllocGreedy.cpp

New OptimizationRemarkEmitter pass for MIR

2017-01-25 23:20:33 +00:00

RegAllocPBQP.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

RegisterClassInfo.cpp

Add iterator_range<regclass_iterator> to {Target,MC}RegisterInfo, NFC

2017-01-25 19:29:04 +00:00

RegisterCoalescer.cpp

[RegisterCoalescing] Recommit the patch "Remove partial redundent copy".

2017-01-28 01:05:27 +00:00

RegisterCoalescer.h

…

RegisterPressure.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

RegisterScavenging.cpp

CodeGen: Add/Factor out LiveRegUnits class; NFCI

2017-01-20 00:16:14 +00:00

RegisterUsageInfo.cpp

…

RegUsageInfoCollector.cpp

…

RegUsageInfoPropagate.cpp

…

RenameIndependentSubregs.cpp

…

ResetMachineFunctionPass.cpp

GlobalISel: Abort in ResetMachineFunctionPass if fallback isn't enabled

2017-01-13 23:46:11 +00:00

SafeStack.cpp

…

SafeStackColoring.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

SafeStackColoring.h

…

SafeStackLayout.cpp

…

SafeStackLayout.h

…

ScheduleDAG.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

ScheduleDAGInstrs.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

ScheduleDAGPrinter.cpp

…

ScoreboardHazardRecognizer.cpp

…

ShadowStackGCLowering.cpp

…

ShrinkWrap.cpp

…

SjLjEHPrepare.cpp

…

SlotIndexes.cpp

…

Spiller.h

…

SpillPlacement.cpp

…

SpillPlacement.h

…

SplitKit.cpp

Implement LaneBitmask::any(), use it to replace !none(), NFCI

2016-12-16 19:11:56 +00:00

SplitKit.h

…

StackColoring.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

StackMapLivenessAnalysis.cpp

…

StackMaps.cpp

…

StackProtector.cpp

…

StackSlotColoring.cpp

In the below scenario, we must be able to skip the a DBG_VALUE instruction and

2017-01-09 17:45:02 +00:00

TailDuplication.cpp

…

TailDuplicator.cpp

…

TargetFrameLoweringImpl.cpp

…

TargetInstrInfo.cpp

[CodeGen] Rename MachineInstrBuilder::addOperand. NFC

2017-01-13 09:58:52 +00:00

TargetLoweringBase.cpp

Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled."

2017-01-26 16:46:13 +00:00

TargetLoweringObjectFileImpl.cpp

Revert "[COFF] Use 32-bit jump table entries in .rdata for Win64"

2016-12-29 17:07:10 +00:00

TargetOptionsImpl.cpp

…

TargetPassConfig.cpp

…

TargetRegisterInfo.cpp

Cleanup dump() functions.

2017-01-28 02:02:38 +00:00

TargetSchedule.cpp

…

TargetSubtargetInfo.cpp

…

TwoAddressInstructionPass.cpp

[CodeGen] Rename MachineInstrBuilder::addOperand. NFC

2017-01-13 09:58:52 +00:00

UnreachableBlockElim.cpp

…

VirtRegMap.cpp

Implement LaneBitmask::any(), use it to replace !none(), NFCI

2016-12-16 19:11:56 +00:00

WinEHPrepare.cpp

…

XRayInstrumentation.cpp

[CodeGen] Rename MachineInstrBuilder::addOperand. NFC

2017-01-13 09:58:52 +00:00

README.txt

//===---------------------------------------------------------------------===//

Common register allocation / spilling problem:

        mul lr, r4, lr
        str lr, [sp, #+52]
        ldr lr, [r1, #+32]
        sxth r3, r3
        ldr r4, [sp, #+52]
        mla r4, r3, lr, r4

can be:

        mul lr, r4, lr
        mov r4, lr
        str lr, [sp, #+52]
        ldr lr, [r1, #+32]
        sxth r3, r3
        mla r4, r3, lr, r4

and then "merge" mul and mov:

        mul r4, r4, lr
        str r4, [sp, #+52]
        ldr lr, [r1, #+32]
        sxth r3, r3
        mla r4, r3, lr, r4

It also increase the likelihood the store may become dead.

//===---------------------------------------------------------------------===//

bb27 ...
        ...
        %reg1037 = ADDri %reg1039, 1
        %reg1038 = ADDrs %reg1032, %reg1039, %NOREG, 10
    Successors according to CFG: 0x8b03bf0 (#5)

bb76 (0x8b03bf0, LLVM BB @0x8b032d0, ID#5):
    Predecessors according to CFG: 0x8b0c5f0 (#3) 0x8b0a7c0 (#4)
        %reg1039 = PHI %reg1070, mbb<bb76.outer,0x8b0c5f0>, %reg1037, mbb<bb27,0x8b0a7c0>

Note ADDri is not a two-address instruction. However, its result %reg1037 is an
operand of the PHI node in bb76 and its operand %reg1039 is the result of the
PHI node. We should treat it as a two-address code and make sure the ADDri is
scheduled after any node that reads %reg1039.

//===---------------------------------------------------------------------===//

Use local info (i.e. register scavenger) to assign it a free register to allow
reuse:
        ldr r3, [sp, #+4]
        add r3, r3, #3
        ldr r2, [sp, #+8]
        add r2, r2, #2
        ldr r1, [sp, #+4]  <==
        add r1, r1, #1
        ldr r0, [sp, #+4]
        add r0, r0, #2

//===---------------------------------------------------------------------===//

LLVM aggressively lift CSE out of loop. Sometimes this can be negative side-
effects:

R1 = X + 4
R2 = X + 7
R3 = X + 15

loop:
load [i + R1]
...
load [i + R2]
...
load [i + R3]

Suppose there is high register pressure, R1, R2, R3, can be spilled. We need
to implement proper re-materialization to handle this:

R1 = X + 4
R2 = X + 7
R3 = X + 15

loop:
R1 = X + 4  @ re-materialized
load [i + R1]
...
R2 = X + 7 @ re-materialized
load [i + R2]
...
R3 = X + 15 @ re-materialized
load [i + R3]

Furthermore, with re-association, we can enable sharing:

R1 = X + 4
R2 = X + 7
R3 = X + 15

loop:
T = i + X
load [T + 4]
...
load [T + 7]
...
load [T + 15]
//===---------------------------------------------------------------------===//

It's not always a good idea to choose rematerialization over spilling. If all
the load / store instructions would be folded then spilling is cheaper because
it won't require new live intervals / registers. See 2003-05-31-LongShifts for
an example.

//===---------------------------------------------------------------------===//

With a copying garbage collector, derived pointers must not be retained across
collector safe points; the collector could move the objects and invalidate the
derived pointer. This is bad enough in the first place, but safe points can
crop up unpredictably. Consider:

        %array = load { i32, [0 x %obj] }** %array_addr
        %nth_el = getelementptr { i32, [0 x %obj] }* %array, i32 0, i32 %n
        %old = load %obj** %nth_el
        %z = div i64 %x, %y
        store %obj* %new, %obj** %nth_el

If the i64 division is lowered to a libcall, then a safe point will (must)
appear for the call site. If a collection occurs, %array and %nth_el no longer
point into the correct object.

The fix for this is to copy address calculations so that dependent pointers
are never live across safe point boundaries. But the loads cannot be copied
like this if there was an intervening store, so may be hard to get right.

Only a concurrent mutator can trigger a collection at the libcall safe point.
So single-threaded programs do not have this requirement, even with a copying
collector. Still, LLVM optimizations would probably undo a front-end's careful
work.

//===---------------------------------------------------------------------===//

The ocaml frametable structure supports liveness information. It would be good
to support it.

//===---------------------------------------------------------------------===//

The FIXME in ComputeCommonTailLength in BranchFolding.cpp needs to be
revisited. The check is there to work around a misuse of directives in inline
assembly.

//===---------------------------------------------------------------------===//

It would be good to detect collector/target compatibility instead of silently
doing the wrong thing.

//===---------------------------------------------------------------------===//

It would be really nice to be able to write patterns in .td files for copies,
which would eliminate a bunch of explicit predicates on them (e.g. no side 
effects).  Once this is in place, it would be even better to have tblgen 
synthesize the various copy insertion/inspection methods in TargetInstrInfo.

//===---------------------------------------------------------------------===//

Stack coloring improvements:

1. Do proper LiveStackAnalysis on all stack objects including those which are
   not spill slots.
2. Reorder objects to fill in gaps between objects.
   e.g. 4, 1, <gap>, 4, 1, 1, 1, <gap>, 4 => 4, 1, 1, 1, 1, 4, 4

//===---------------------------------------------------------------------===//

The scheduler should be able to sort nearby instructions by their address. For
example, in an expanded memset sequence it's not uncommon to see code like this:

  movl $0, 4(%rdi)
  movl $0, 8(%rdi)
  movl $0, 12(%rdi)
  movl $0, 0(%rdi)

Each of the stores is independent, and the scheduler is currently making an
arbitrary decision about the order.

//===---------------------------------------------------------------------===//

Another opportunitiy in this code is that the $0 could be moved to a register:

  movl $0, 4(%rdi)
  movl $0, 8(%rdi)
  movl $0, 12(%rdi)
  movl $0, 0(%rdi)

This would save substantial code size, especially for longer sequences like
this. It would be easy to have a rule telling isel to avoid matching MOV32mi
if the immediate has more than some fixed number of uses. It's more involved
to teach the register allocator how to do late folding to recover from
excessive register pressure.