llvm/test/CodeGen/X86/trunc-ext-ld-st.ll

80 lines
1.6 KiB
LLVM
Raw Normal View History

; RUN: llc < %s -march=x86-64 -mattr=+sse4.1 | FileCheck %s
;CHECK-LABEL: load_2_i8:
; A single 16-bit load
;CHECK: pmovzxbq
;CHECK: paddq
;CHECK: pshufb
; A single 16-bit store
;CHECK: movw
;CHECK: ret
define void @load_2_i8(<2 x i8>* %A) {
%T = load <2 x i8>, <2 x i8>* %A
%G = add <2 x i8> %T, <i8 9, i8 7>
store <2 x i8> %G, <2 x i8>* %A
ret void
}
;CHECK-LABEL: load_2_i16:
; Read 32-bits
;CHECK: pmovzxwq
;CHECK: paddq
[x86] Enable the new vector shuffle lowering by default. Update the entire regression test suite for the new shuffles. Remove most of the old testing which was devoted to the old shuffle lowering path and is no longer relevant really. Also remove a few other random tests that only really exercised shuffles and only incidently or without any interesting aspects to them. Benchmarking that I have done shows a few small regressions with this on LNT, zero measurable regressions on real, large applications, and for several benchmarks where the loop vectorizer fires in the hot path it shows 5% to 40% improvements for SSE2 and SSE3 code running on Sandy Bridge machines. Running on AMD machines shows even more dramatic improvements. When using newer ISA vector extensions the gains are much more modest, but the code is still better on the whole. There are a few regressions being tracked (PR21137, PR21138, PR21139) but by and large this is expected to be a win for x86 generated code performance. It is also more correct than the code it replaces. I have fuzz tested this extensively with ISA extensions up through AVX2 and found no crashes or miscompiles (yet...). The old lowering had a few miscompiles and crashers after a somewhat smaller amount of fuzz testing. There is one significant area where the new code path lags behind and that is in AVX-512 support. However, there was *extremely little* support for that already and so this isn't a significant step backwards and the new framework will probably make it easier to implement lowering that uses the full power of AVX-512's table-based shuffle+blend (IMO). Many thanks to Quentin, Andrea, Robert, and others for benchmarking assistance. Thanks to Adam and others for help with AVX-512. Thanks to Hal, Eric, and *many* others for answering my incessant questions about how the backend actually works. =] I will leave the old code path in the tree until the 3 PRs above are at least resolved to folks' satisfaction. Then I will rip it (and 1000s of lines of code) out. =] I don't expect this flag to stay around for very long. It may not survive next week. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@219046 91177308-0d34-0410-b5e6-96231b3b80d8
2014-10-04 03:52:55 +00:00
;CHECK: pshufd
;CHECK: movd
;CHECK: ret
define void @load_2_i16(<2 x i16>* %A) {
%T = load <2 x i16>, <2 x i16>* %A
%G = add <2 x i16> %T, <i16 9, i16 7>
store <2 x i16> %G, <2 x i16>* %A
ret void
}
;CHECK-LABEL: load_2_i32:
;CHECK: pmovzxdq
[SDAG] Introduce a combined set to the DAG combiner which tracks nodes which have successfully round-tripped through the combine phase, and use this to ensure all operands to DAG nodes are visited by the combiner, even if they are only added during the combine phase. This is critical to have the combiner reach nodes that are *introduced* during combining. Previously these would sometimes be visited and sometimes not be visited based on whether they happened to end up on the worklist or not. Now we always run them through the combiner. This fixes quite a few bad codegen test cases lurking in the suite while also being more principled. Among these, the TLS codegeneration is particularly exciting for programs that have this in the critical path like TSan-instrumented binaries (although I think they engineer to use a different TLS that is faster anyways). I've tried to check for compile-time regressions here by running llc over a merged (but not LTO-ed) clang bitcode file and observed at most a 3% slowdown in llc. Given that this is essentially a worst case (none of opt or clang are running at this phase) I think this is tolerable. The actual LTO case should be even less costly, and the cost in normal compilation should be negligible. With this combining logic, it is possible to re-legalize as we combine which is necessary to implement PSHUFB formation on x86 as a post-legalize DAG combine (my ultimate goal). Differential Revision: http://reviews.llvm.org/D4638 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@213898 91177308-0d34-0410-b5e6-96231b3b80d8
2014-07-24 22:15:28 +00:00
;CHECK: paddd
;CHECK: pshufd
;CHECK: ret
define void @load_2_i32(<2 x i32>* %A) {
%T = load <2 x i32>, <2 x i32>* %A
%G = add <2 x i32> %T, <i32 9, i32 7>
store <2 x i32> %G, <2 x i32>* %A
ret void
}
;CHECK-LABEL: load_4_i8:
;CHECK: pmovzxbd
;CHECK: paddd
;CHECK: pshufb
;CHECK: ret
define void @load_4_i8(<4 x i8>* %A) {
%T = load <4 x i8>, <4 x i8>* %A
%G = add <4 x i8> %T, <i8 1, i8 4, i8 9, i8 7>
store <4 x i8> %G, <4 x i8>* %A
ret void
}
;CHECK-LABEL: load_4_i16:
;CHECK: pmovzxwd
[SDAG] Introduce a combined set to the DAG combiner which tracks nodes which have successfully round-tripped through the combine phase, and use this to ensure all operands to DAG nodes are visited by the combiner, even if they are only added during the combine phase. This is critical to have the combiner reach nodes that are *introduced* during combining. Previously these would sometimes be visited and sometimes not be visited based on whether they happened to end up on the worklist or not. Now we always run them through the combiner. This fixes quite a few bad codegen test cases lurking in the suite while also being more principled. Among these, the TLS codegeneration is particularly exciting for programs that have this in the critical path like TSan-instrumented binaries (although I think they engineer to use a different TLS that is faster anyways). I've tried to check for compile-time regressions here by running llc over a merged (but not LTO-ed) clang bitcode file and observed at most a 3% slowdown in llc. Given that this is essentially a worst case (none of opt or clang are running at this phase) I think this is tolerable. The actual LTO case should be even less costly, and the cost in normal compilation should be negligible. With this combining logic, it is possible to re-legalize as we combine which is necessary to implement PSHUFB formation on x86 as a post-legalize DAG combine (my ultimate goal). Differential Revision: http://reviews.llvm.org/D4638 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@213898 91177308-0d34-0410-b5e6-96231b3b80d8
2014-07-24 22:15:28 +00:00
;CHECK: paddw
;CHECK: pshufb
;CHECK: ret
define void @load_4_i16(<4 x i16>* %A) {
%T = load <4 x i16>, <4 x i16>* %A
%G = add <4 x i16> %T, <i16 1, i16 4, i16 9, i16 7>
store <4 x i16> %G, <4 x i16>* %A
ret void
}
;CHECK-LABEL: load_8_i8:
;CHECK: pmovzxbw
[SDAG] Introduce a combined set to the DAG combiner which tracks nodes which have successfully round-tripped through the combine phase, and use this to ensure all operands to DAG nodes are visited by the combiner, even if they are only added during the combine phase. This is critical to have the combiner reach nodes that are *introduced* during combining. Previously these would sometimes be visited and sometimes not be visited based on whether they happened to end up on the worklist or not. Now we always run them through the combiner. This fixes quite a few bad codegen test cases lurking in the suite while also being more principled. Among these, the TLS codegeneration is particularly exciting for programs that have this in the critical path like TSan-instrumented binaries (although I think they engineer to use a different TLS that is faster anyways). I've tried to check for compile-time regressions here by running llc over a merged (but not LTO-ed) clang bitcode file and observed at most a 3% slowdown in llc. Given that this is essentially a worst case (none of opt or clang are running at this phase) I think this is tolerable. The actual LTO case should be even less costly, and the cost in normal compilation should be negligible. With this combining logic, it is possible to re-legalize as we combine which is necessary to implement PSHUFB formation on x86 as a post-legalize DAG combine (my ultimate goal). Differential Revision: http://reviews.llvm.org/D4638 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@213898 91177308-0d34-0410-b5e6-96231b3b80d8
2014-07-24 22:15:28 +00:00
;CHECK: paddb
;CHECK: pshufb
;CHECK: ret
define void @load_8_i8(<8 x i8>* %A) {
%T = load <8 x i8>, <8 x i8>* %A
%G = add <8 x i8> %T, %T
store <8 x i8> %G, <8 x i8>* %A
ret void
}