Enable folding of long seteq/setne comparisons into branches and select instructions
Implement unfolded long relational comparisons against a constants a bit more efficiently
Folding comparisons changes code that looks like this:
mov %EAX, DWORD PTR [%ESP + 4]
mov %EDX, DWORD PTR [%ESP + 8]
mov %ECX, %EAX
or %ECX, %EDX
sete %CL
test %CL, %CL
je .LBB2 # PC rel: F
into code that looks like this:
mov %EAX, DWORD PTR [%ESP + 4]
mov %EDX, DWORD PTR [%ESP + 8]
mov %ECX, %EAX
or %ECX, %EDX
jne .LBB2 # PC rel: F
This speeds up 186.crafty by 6% with llc-ls.
llvm-svn: 12702
of the words of the constant is zeros. For example:
Y = and long X, 1234
now generates:
Yl = and Xl, 1234
Yh = 0
instead of:
Yl = and Xl, 1234
Yh = and Xh, 0
llvm-svn: 12685
* In promote32, if we can just promote a constant value, do so instead of
promoting a constant dynamically.
* In visitReturn inst, actually USE the promote32 argument that takes a
Value*
The end result of this is that we now generate this:
test:
mov %EAX, 0
ret
instead of...
test:
mov %AX, 0
movzx %EAX, %AX
ret
for:
ushort %test() {
ret ushort 0
}
llvm-svn: 12679
an incoming value from a block, the selector would evaluate the constant
at the TOP of the block instead of at the end of the block. This made the
live range for the constant span the entire block, increasing register
pressure needlessly.
llvm-svn: 12542
folding load instructions into other instructions across free instruction
boundaries. Perhaps this will also fix the other strange failures?
llvm-svn: 12494
testcase like this:
int %test(int* %P, int %A) {
%Pv = load int* %P
%B = add int %A, %Pv
ret int %B
}
We now generate:
test:
mov %ECX, DWORD PTR [%ESP + 4]
mov %EAX, DWORD PTR [%ESP + 8]
add %EAX, DWORD PTR [%ECX]
ret
Instead of:
test:
mov %EAX, DWORD PTR [%ESP + 4]
mov %ECX, DWORD PTR [%ESP + 8]
mov %EAX, DWORD PTR [%EAX]
add %EAX, %ECX
ret
... saving one instruction, and often a register. Note that there are a lot
of other instructions that could use this, but they aren't handled. I'm not
really interested in adding them, but mul/div and all of the FP instructions
could be supported as well if someone wanted to add them.
llvm-svn: 12204
(16) into certain areas of the SPARC V9 back-end. I'm fairly sure the US IIIi's
dcache has 32-byte lines, so I'm not sure where the 16 came from. However, in
the interest of not breaking things any more than they already are, I'm going
to leave the constant alone.
llvm-svn: 12043
of generating this code:
mov %EAX, 4
mov DWORD PTR [%ESP], %EAX
mov %AX, 123
movsx %EAX, %AX
mov DWORD PTR [%ESP + 4], %EAX
call Y
we now generate:
mov DWORD PTR [%ESP], 4
mov DWORD PTR [%ESP + 4], 123
call Y
Which hurts the eyes less. :)
Considering that register pressure around call sites is already high (with all
of the callee clobber registers n stuff), this may help a lot.
llvm-svn: 12028
1) For 8-bit registers try to use first the ones that are parts of the
same register (AL then AH). This way we only alias 2 16/32-bit
registers after allocating 4 8-bit variables.
2) Move EBX as the last register to allocate. This will cause less
spills to happen since we will have 8-bit registers available up to
register excaustion (assuming we use the allocation order). It
would be nice if we could push all of the 8-bit aliased registers
towards the end but we much prefer to keep callee saved register to
the end to avoid saving them on entry and exit of the function.
For example this gives a slight reduction of spills with linear scan
on 164.gzip.
Before:
11221 asm-printer - Number of machine instrs printed
975 spiller - Number of loads added
675 spiller - Number of stores added
398 spiller - Number of register spills
After:
11182 asm-printer - Number of machine instrs printed
952 spiller - Number of loads added
652 spiller - Number of stores added
386 spiller - Number of register spills
llvm-svn: 11996
their names more decriptive. A name consists of the base name, a
default operand size followed by a character per operand with an
optional special size. For example:
ADD8rr -> add, 8-bit register, 8-bit register
IMUL16rmi -> imul, 16-bit register, 16-bit memory, 16-bit immediate
IMUL16rmi8 -> imul, 16-bit register, 16-bit memory, 8-bit immediate
MOVSX32rm16 -> movsx, 32-bit register, 16-bit memory
llvm-svn: 11995
parse. The name is now I (operand size)*. For example:
Im32 -> instruction with 32-bit memory operands.
Im16i8 -> instruction with 16-bit memory operands and 8 bit immediate
operands.
llvm-svn: 11970
the size of the immediate and the memory operand on instructions that
use them. This resolves problems with instructions that take both a
memory and an immediate operand but their sizes differ (i.e. ADDmi32b).
llvm-svn: 11967
an 8-bit immediate. So mark the shifts that take immediates as taking
an 8-bit argument. The rest with the implicit use of CL are marked
appropriately.
A bug still exists:
def SHLDmri32 : I2A8 <"shld", 0xA4, MRMDestMem>, TB; // [mem32] <<= [mem32],R32 imm8
The immediate in the above instruction is 8-bit but the memory
reference is 32-bit. The printer prints this as an 8-bit reference
which confuses the assembler. Same with SHRDmri32.
llvm-svn: 11931
scaled indexes. This allows us to compile GEP's like this:
int* %test([10 x { int, { int } }]* %X, int %Idx) {
%Idx = cast int %Idx to long
%X = getelementptr [10 x { int, { int } }]* %X, long 0, long %Idx, ubyte 1, ubyte 0
ret int* %X
}
Into a single address computation:
test:
mov %EAX, DWORD PTR [%ESP + 4]
mov %ECX, DWORD PTR [%ESP + 8]
lea %EAX, DWORD PTR [%EAX + 8*%ECX + 4]
ret
Before it generated:
test:
mov %EAX, DWORD PTR [%ESP + 4]
mov %ECX, DWORD PTR [%ESP + 8]
shl %ECX, 3
add %EAX, %ECX
lea %EAX, DWORD PTR [%EAX + 4]
ret
This is useful for things like int/float/double arrays, as the indexing can be folded into
the loads&stores, reducing register pressure and decreasing the pressure on the decode unit.
With these changes, I expect our performance on 256.bzip2 and gzip to improve a lot. On
bzip2 for example, we go from this:
10665 asm-printer - Number of machine instrs printed
40 ra-local - Number of loads/stores folded into instructions
1708 ra-local - Number of loads added
1532 ra-local - Number of stores added
1354 twoaddressinstruction - Number of instructions added
1354 twoaddressinstruction - Number of two-address instructions
2794 x86-peephole - Number of peephole optimization performed
to this:
9873 asm-printer - Number of machine instrs printed
41 ra-local - Number of loads/stores folded into instructions
1710 ra-local - Number of loads added
1521 ra-local - Number of stores added
789 twoaddressinstruction - Number of instructions added
789 twoaddressinstruction - Number of two-address instructions
2142 x86-peephole - Number of peephole optimization performed
... and these types of instructions are often in tight loops.
Linear scan is also helped, but not as much. It goes from:
8787 asm-printer - Number of machine instrs printed
2389 liveintervals - Number of identity moves eliminated after coalescing
2288 liveintervals - Number of interval joins performed
3522 liveintervals - Number of intervals after coalescing
5810 liveintervals - Number of original intervals
700 spiller - Number of loads added
487 spiller - Number of stores added
303 spiller - Number of register spills
1354 twoaddressinstruction - Number of instructions added
1354 twoaddressinstruction - Number of two-address instructions
363 x86-peephole - Number of peephole optimization performed
to:
7982 asm-printer - Number of machine instrs printed
1759 liveintervals - Number of identity moves eliminated after coalescing
1658 liveintervals - Number of interval joins performed
3282 liveintervals - Number of intervals after coalescing
4940 liveintervals - Number of original intervals
635 spiller - Number of loads added
452 spiller - Number of stores added
288 spiller - Number of register spills
789 twoaddressinstruction - Number of instructions added
789 twoaddressinstruction - Number of two-address instructions
258 x86-peephole - Number of peephole optimization performed
Though I'm not complaining about the drop in the number of intervals. :)
llvm-svn: 11820
to do analysis.
*** FOLD getelementptr instructions into loads and stores when possible,
making use of some of the crazy X86 addressing modes.
For example, the following C++ program fragment:
struct complex {
double re, im;
complex(double r, double i) : re(r), im(i) {}
};
inline complex operator+(const complex& a, const complex& b) {
return complex(a.re+b.re, a.im+b.im);
}
complex addone(const complex& arg) {
return arg + complex(1,0);
}
Used to be compiled to:
_Z6addoneRK7complex:
mov %EAX, DWORD PTR [%ESP + 4]
mov %ECX, DWORD PTR [%ESP + 8]
*** mov %EDX, %ECX
fld QWORD PTR [%EDX]
fld1
faddp %ST(1)
*** add %ECX, 8
fld QWORD PTR [%ECX]
fldz
faddp %ST(1)
*** mov %ECX, %EAX
fxch %ST(1)
fstp QWORD PTR [%ECX]
*** add %EAX, 8
fstp QWORD PTR [%EAX]
ret
Now it is compiled to:
_Z6addoneRK7complex:
mov %EAX, DWORD PTR [%ESP + 4]
mov %ECX, DWORD PTR [%ESP + 8]
fld QWORD PTR [%ECX]
fld1
faddp %ST(1)
fld QWORD PTR [%ECX + 8]
fldz
faddp %ST(1)
fxch %ST(1)
fstp QWORD PTR [%EAX]
fstp QWORD PTR [%EAX + 8]
ret
Other programs should see similar improvements, across the board. Note that
in addition to reducing instruction count, this also reduces register pressure
a lot, always a good thing on X86. :)
llvm-svn: 11819
into a single LEA instruction. This should improve the code generated for
things like X->A.B.C[12].D.
The bigger benefit is still coming though. Note that this uses an LEA instruction
instead of an add, giving the register allocator more freedom. We should probably
never generate ADDri32's.
llvm-svn: 11817
block into MachineBasicBlock::getFirstTerminator().
This also fixes a bug in the implementation of the above in both
RegAllocLocal and InstrSched, where instructions where added after the
terminator if the basic block's only instruction was a terminator (it
shouldn't matter for RegAllocLocal since this case never occurs in
practice).
llvm-svn: 11748