Execute the individual Polly passes manually

This example presents the individual passes that are involved when optimizing code with Polly. We show how to execute them individually and explain for each which analysis is performed or what transformation is applied. In this example the polyhedral transformation is user-provided to show how much performance improvement can be expected by an optimal automatic optimizer.

The files used and created in this example are available in the Polly checkout in the folder www/experiments/matmul. They can be created automatically by running the www/experiments/matmul/runall.sh script.

Create LLVM-IR from the C code
Polly works on LLVM-IR. Hence it is necessary to translate the source files into LLVM-IR. If more than on file should be optimized the files can be combined into a single file with llvm-link.
```
clang -S -emit-llvm matmul.c -o matmul.s
```
Load Polly automatically when calling the 'opt' tool
Polly is not built into opt or bugpoint, but it is a shared library that needs to be loaded into these tools explicitally. The Polly library is called LVMPolly.so. For a cmake build it is available in the build/lib/ directory, autoconf creates the same file in build/tools/polly/{Release+Asserts|Asserts|Debug}/lib. For convenience we create an alias that automatically loads Polly if 'opt' is called.
```
export PATH_TO_POLLY_LIB="~/polly/build/lib/"
alias opt="opt -load ${PATH_TO_POLLY_LIB}/LLVMPolly.so"
```
Prepare the LLVM-IR for Polly
Polly is only able to work with code that matches a canonical form. To translate the LLVM-IR into this form we use a set of canonicalication passes. They are scheduled by using '-polly-canonicalize'.
```
opt -S -polly-canonicalize matmul.s > matmul.preopt.ll
```

Show the SCoPs detected by Polly (optional)

To understand if Polly was able to detect SCoPs, we print the structure of the detected SCoPs. In our example two SCoPs were detected. One in 'init_array' the other in 'main'.

opt -basicaa -polly-ast -analyze -q matmul.preopt.ll

init_array():
for (c2=0;c2<=1023;c2++) {
  for (c4=0;c4<=1023;c4++) {
    Stmt_5(c2,c4);
  }
}

main():
for (c2=0;c2<=1023;c2++) {
  for (c4=0;c4<=1023;c4++) {
    Stmt_4(c2,c4);
    for (c6=0;c6<=1023;c6++) {
      Stmt_6(c2,c4,c6);
    }
  }
}

Highlight the detected SCoPs in the CFGs of the program (requires graphviz/dotty)
Polly can use graphviz to graphically show a CFG in which the detected SCoPs are highlighted. It can also create '.dot' files that can be translated by the 'dot' utility into various graphic formats.
```
opt -basicaa -view-scops -disable-output matmul.preopt.ll
opt -basicaa -view-scops-only -disable-output matmul.preopt.ll
```
The output for the different functions
view-scops: main, init_array, print_array
view-scops-only: main, init_array, print_array

View the polyhedral representation of the SCoPs

opt -basicaa -polly-scops -analyze matmul.preopt.ll

[...]
Printing analysis 'Polly - Create polyhedral description of Scops' for region:
'for.cond => for.end19' in function 'init_array':
   Context:
   { [] }
   Statements {
   	Stmt_5
           Domain :=
               { Stmt_5[i0, i1] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 };
           Schedule :=
               { Stmt_5[i0, i1] -> schedule[0, i0, 0, i1, 0] };
           WriteAccess :=
               { Stmt_5[i0, i1] -> MemRef_A[1037i0 + i1] };
           WriteAccess :=
               { Stmt_5[i0, i1] -> MemRef_B[1047i0 + i1] };
   	FinalRead
           Domain :=
               { FinalRead[0] };
           Schedule :=
               { FinalRead[i0] -> schedule[200000000, o1, o2, o3, o4] };
           ReadAccess :=
               { FinalRead[i0] -> MemRef_A[o0] };
           ReadAccess :=
               { FinalRead[i0] -> MemRef_B[o0] };
   }
[...]
Printing analysis 'Polly - Create polyhedral description of Scops' for region:
'for.cond => for.end30' in function 'main':
   Context:
   { [] }
   Statements {
   	Stmt_4
           Domain :=
               { Stmt_4[i0, i1] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 };
           Schedule :=
               { Stmt_4[i0, i1] -> schedule[0, i0, 0, i1, 0, 0, 0] };
           WriteAccess :=
               { Stmt_4[i0, i1] -> MemRef_C[1067i0 + i1] };
   	Stmt_6
           Domain :=
               { Stmt_6[i0, i1, i2] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023 };
           Schedule :=
               { Stmt_6[i0, i1, i2] -> schedule[0, i0, 0, i1, 1, i2, 0] };
           ReadAccess :=
               { Stmt_6[i0, i1, i2] -> MemRef_C[1067i0 + i1] };
           ReadAccess :=
               { Stmt_6[i0, i1, i2] -> MemRef_A[1037i0 + i2] };
           ReadAccess :=
               { Stmt_6[i0, i1, i2] -> MemRef_B[i1 + 1047i2] };
           WriteAccess :=
               { Stmt_6[i0, i1, i2] -> MemRef_C[1067i0 + i1] };
   	FinalRead
           Domain :=
               { FinalRead[0] };
           Schedule :=
               { FinalRead[i0] -> schedule[200000000, o1, o2, o3, o4, o5, o6] };
           ReadAccess :=
               { FinalRead[i0] -> MemRef_C[o0] };
           ReadAccess :=
               { FinalRead[i0] -> MemRef_A[o0] };
           ReadAccess :=
               { FinalRead[i0] -> MemRef_B[o0] };
   }
[...]

Show the dependences for the SCoPs

opt -basicaa -polly-dependences -analyze matmul.preopt.ll

Printing analysis 'Polly - Calculate dependences for SCoP' for region:
'for.cond => for.end19' in function 'init_array':
   Must dependences:
       {  }
   May dependences:
       {  }
   Must no source:
       {  }
   May no source:
       {  }
Printing analysis 'Polly - Calculate dependences for SCoP' for region:
'for.cond => for.end30' in function 'main':
   Must dependences:
       {  Stmt_4[i0, i1] -> Stmt_6[i0, i1, 0] :
              i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023;
          Stmt_6[i0, i1, i2] -> Stmt_6[i0, i1, 1 + i2] :
              i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1022;
          Stmt_6[i0, i1, 1023] -> FinalRead[0] :
              i1 <= 1091540 - 1067i0 and i1 >= -1067i0 and i1 >= 0 and i1 <= 1023;
          Stmt_6[1023, i1, 1023] -> FinalRead[0] :
              i1 >= 0 and i1 <= 1023
       }
   May dependences:
       {  }
   Must no source:
       {  Stmt_6[i0, i1, i2] -> MemRef_A[1037i0 + i2] :
              i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023;
          Stmt_6[i0, i1, i2] -> MemRef_B[i1 + 1047i2] :
              i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023;
          FinalRead[0] -> MemRef_A[o0];
          FinalRead[0] -> MemRef_B[o0]
          FinalRead[0] -> MemRef_C[o0] :
              o0 >= 1092565 or (exists (e0 = [(o0)/1067]: o0 <= 1091540 and o0 >= 0
              and 1067e0 <= -1024 + o0 and 1067e0 >= -1066 + o0)) or o0 <= -1;
       }
   May no source:
       {  }

Export jscop files

Polly can export the polyhedral representation in so called jscop files. Jscop files contain the polyhedral representation stored in a JSON file.

opt -basicaa -polly-export-jscop matmul.preopt.ll

Writing SCoP 'for.cond => for.end19' in function 'init_array' to './init_array___%for.cond---%for.end19.jscop'.
Writing SCoP 'for.cond => for.end30' in function 'main' to './main___%for.cond---%for.end30.jscop'.

Import the changed jscop files and print the updated SCoP structure (optional)

Polly can reimport jscop files, in which the schedules of the statements are changed. These changed schedules are used to descripe transformations. It is possible to import different jscop files by providing the postfix of the jscop file that is imported.

We apply three different transformations on the SCoP in the main function. The jscop files describing these transformations are hand written (and available in www/experiments/matmul).

No Polly

As a baseline we do not call any Polly code generation, but only apply the normal -O3 optimizations.

opt matmul.preopt.ll -basicaa \
    -polly-import-jscop \
    -polly-ast -analyze

[...]
main():
for (c2=0;c2<g;=1535;c2++) {
  for (c4=0;c4<g;=1535;c4++) {
    Stmt_4(c2,c4);
    for (c6=0;c6<g;=1535;c6++) {
      Stmt_6(c2,c4,c6);
    }
  }
}
[...]

Interchange (and Fission to allow the interchange)

We split the loops and can now apply an interchange of the loop dimensions that enumerate Stmt_6.

opt matmul.preopt.ll -basicaa \
    -polly-import-jscop -polly-import-jscop-postfix=interchanged \
    -polly-ast -analyze

[...]
Reading JScop 'for.cond => for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled'.
[...]
main():
for (c2=0;c2<=1535;c2++) {
  for (c4=0;c4<=1535;c4++) {
    Stmt_4(c2,c4);
  }
}
for (c2=0;c2<=1535;c2++) {
  for (c4=0;c4<=1535;c4++) {
    for (c6=0;c6<=1535;c6++) {
      Stmt_6(c2,c6,c4);
    }
  }
}
[...]

Interchange + Tiling

In addition to the interchange we tile now the second loop nest.

opt matmul.preopt.ll -basicaa \
    -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled \
    -polly-ast -analyze

[...]
Reading JScop 'for.cond => for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled'.
[...]
main():
for (c2=0;c2<=1535;c2++) {
  for (c4=0;c4<=1535;c4++) {
    Stmt_4(c2,c4);
  }
}
for (c2=0;c2<=1535;c2+=64) {
  for (c3=0;c3<=1535;c3+=64) {
    for (c4=0;c4<=1535;c4+=64) {
      for (c5=c2;c5<=c2+63;c5++) {
        for (c6=c4;c6<=c4+63;c6++) {
          for (c7=c3;c7<=c3+63;c7++) {
            Stmt_6(c5,c7,c6);
          }
        }
      }
    }
  }
}
[...]

Interchange + Tiling + Strip-mining to prepare vectorization

To later allow vectorization we create a so called trivially parallelizable loop. It is innermost, parallel and has only four iterations. It can be replaced by 4-element SIMD instructions.

opt matmul.preopt.ll -basicaa \
    -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
    -polly-ast -analyze

[...]
Reading JScop 'for.cond => for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'.
[...]
main():
for (c2=0;c2<=1535;c2++) {
  for (c4=0;c4<=1535;c4++) {
    Stmt_4(c2,c4);
  }
}
for (c2=0;c2<=1535;c2+=64) {
  for (c3=0;c3<=1535;c3+=64) {
    for (c4=0;c4<=1535;c4+=64) {
      for (c5=c2;c5<=c2+63;c5++) {
        for (c6=c4;c6<=c4+63;c6++) {
          for (c7=c3;c7<=c3+63;c7+=4) {
            for (c8=c7;c8<=c7+3;c8++) {
              Stmt_6(c5,c8,c6);
            }
          }
        }
      }
    }
  }
}
[...]

Codegenerate the SCoPs

This generates new code for the SCoPs detected by polly. If -polly-import-jscop is present, transformations specified in the imported jscop files will be applied.

opt matmul.preopt.ll | opt -O3 > matmul.normalopt.ll

opt -basicaa \
    -polly-import-jscop -polly-import-jscop-postfix=interchanged \
    -polly-codegen matmul.preopt.ll \
   | opt -O3 > matmul.polly.interchanged.ll

Reading JScop 'for.cond => for.end19' in function 'init_array' from
    './init_array___%for.cond---%for.end19.jscop.interchanged'.
File could not be read: No such file or directory
Reading JScop 'for.cond => for.end30' in function 'main' from
    './main___%for.cond---%for.end30.jscop.interchanged'.

opt -basicaa \
    -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled \
    -polly-codegen matmul.preopt.ll \
   | opt -O3 > matmul.polly.interchanged+tiled.ll

Reading JScop 'for.cond => for.end19' in function 'init_array' from
    './init_array___%for.cond---%for.end19.jscop.interchanged+tiled'.
File could not be read: No such file or directory
Reading JScop 'for.cond => for.end30' in function 'main' from
    './main___%for.cond---%for.end30.jscop.interchanged+tiled'.

opt -basicaa \
    -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
    -polly-codegen -polly-vectorizer=polly matmul.preopt.ll \
   | opt -O3 > matmul.polly.interchanged+tiled+vector.ll

Reading JScop 'for.cond => for.end19' in function 'init_array' from
    './init_array___%for.cond---%for.end19.jscop.interchanged+tiled+vector'.
File could not be read: No such file or directory
Reading JScop 'for.cond => for.end30' in function 'main' from
    './main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'.

opt -basicaa \
    -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
    -polly-codegen -polly-vectorizer=polly -polly-parallel matmul.preopt.ll \
  | opt -O3 > matmul.polly.interchanged+tiled+openmp.ll

Reading JScop 'for.cond => for.end19' in function 'init_array' from
    './init_array___%for.cond---%for.end19.jscop.interchanged+tiled+vector'.
File could not be read: No such file or directory
Reading JScop 'for.cond => for.end30' in function 'main' from
    './main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'.

Create the executables

Create one executable optimized with plain -O3 as well as a set of executables optimized in different ways with Polly. One changes only the loop structure, the other adds tiling, the next adds vectorization and finally we use OpenMP parallelism.

llc matmul.normalopt.ll -o matmul.normalopt.s && \
    gcc matmul.normalopt.s -o matmul.normalopt.exe
llc matmul.polly.interchanged.ll -o matmul.polly.interchanged.s && \
    gcc matmul.polly.interchanged.s -o matmul.polly.interchanged.exe
llc matmul.polly.interchanged+tiled.ll -o matmul.polly.interchanged+tiled.s && \
    gcc matmul.polly.interchanged+tiled.s -o matmul.polly.interchanged+tiled.exe
llc matmul.polly.interchanged+tiled+vector.ll -o matmul.polly.interchanged+tiled+vector.s && \
    gcc matmul.polly.interchanged+tiled+vector.s -o matmul.polly.interchanged+tiled+vector.exe
llc matmul.polly.interchanged+tiled+vector+openmp.ll -o matmul.polly.interchanged+tiled+vector+openmp.s && \
    gcc -lgomp matmul.polly.interchanged+tiled+vector+openmp.s -o matmul.polly.interchanged+tiled+vector+openmp.exe

Compare the runtime of the executables

By comparing the runtimes of the different code snippets we see that a simple loop interchange gives here the largest performance boost. However by adding vectorization and by using OpenMP we can further improve the performance significantly.

time ./matmul.normalopt.exe

42.68 real, 42.55 user, 0.00 sys

time ./matmul.polly.interchanged.exe

04.33 real, 4.30 user, 0.01 sys

time ./matmul.polly.interchanged+tiled.exe

04.11 real, 4.10 user, 0.00 sys

time ./matmul.polly.interchanged+tiled+vector.exe

01.39 real, 1.36 user, 0.01 sys

time ./matmul.polly.interchanged+tiled+vector+openmp.exe

00.66 real, 2.58 user, 0.02 sys