This patch expands the support of lowerInterleavedload to {8|16|32}x8i stride 3.
LLVM creates suboptimal shuffle code-gen for AVX2. In overall, this patch is a specific fix for the pattern (Strid=3 VF={8|16|32}) and we plan to include the store (deinterleved side).
The patch goal is to optimize the following sequence:
a0 b0 c0 a1 b1 c1 a2 b2
c2 a3 b3 c3 a4 b4 c4 a5
b5 c5 a6 b6 c6 a7 b7 c7
into
a0 a1 a2 a3 a4 a5 a6 a7
b0 b1 b2 b3 b4 b5 b6 b7
c0 c1 c2 c3 c4 c5 c6 c7
Reviewers
1. zvi
2. igor
3. guyblank
4. dorit
5. Ayal
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@312722 91177308-0d34-0410-b5e6-96231b3b80d8
This patch expands the support of lowerInterleavedStore to 16x8i stride 4.
LLVM creates suboptimal shuffle code-gen for AVX2. In overall, this patch is a specific fix for the pattern (Strid=4 VF=16) and we plan to include more patterns in the future.
The patch goal is to optimize the following sequence:
At the end of the computation, we have ymm2, ymm0, ymm12 and ymm3 holding
each 16 chars:
c0, c1, , c16
m0, m1, , m16
y0, y1, , y16
k0, k1, ., k16
And these need to be transposed/interleaved and stored like so:
c0 m0 y0 k0 c1 m1 y1 k1 c2 m2 y2 k2 c3 m3 y3 k3 ....
Differential Revision: https://reviews.llvm.org/D35829
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@310252 91177308-0d34-0410-b5e6-96231b3b80d8
This patch expands the support of lowerInterleavedStore to 32x8i stride 4.
LLVM creates suboptimal shuffle code-gen for AVX2. In overall, this patch is a specific fix for the pattern (Strid=4 VF=32) and we plan to include more patterns in the future. To reach our goal of "more patterns". We include two mask creators. The first function creates shuffle's mask equivalent to unpacklo/unpackhi instructions. The other creator creates mask equivalent to a concat of two half vectors(high/low).
The patch goal is to optimize the following sequence:
At the end of the computation, we have ymm2, ymm0, ymm12 and ymm3 holding
each 32 chars:
c0, c1, , c31
m0, m1, , m31
y0, y1, , y31
k0, k1, ., k31
And these need to be transposed/interleaved and stored like so:
c0 m0 y0 k0 c1 m1 y1 k1 c2 m2 y2 k2 c3 m3 y3 k3 ....
Reviewers:
dorit
Farhana
RKSimon
guyblank
DavidKreitzer
Differential Revision: https://reviews.llvm.org/D34601
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@309086 91177308-0d34-0410-b5e6-96231b3b80d8
X86. The pass optimizes as a unit the entire wide load + shuffles pattern
produced by interleaved vectorization. This initial patch optimizes one pattern
(64-bit elements interleaved by a factor of 4). Future patches will generalize
to additional patterns.
Patch by Farhana Aleen
Differential revision: http://reviews.llvm.org/D24681
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@284260 91177308-0d34-0410-b5e6-96231b3b80d8