mirror of
https://github.com/capstone-engine/llvm-capstone.git
synced 2025-01-15 12:39:19 +00:00
a041239bb7
ppcg will be used to provide mapping decisions for GPU code generation. As we do not use C as input language, we do not include pet. However, we include pet.h from pet 82cacb71 plus a set of dummy functions to ensure ppcg links without problems. The version of ppcg committed is unmodified ppcg-0.04 which has been well tested in the context of LLVM. It does not provide an official library interface yet, which means that in upcoming commits we will add minor modifications to make necessary functionality accessible. We will aim to upstream these modifications after we gained enough experience with GPU generation support in Polly to propose a stable interface. Reviewers: Meinersbur Subscribers: pollydev, llvm-commits Differential Revision: http://reviews.llvm.org/D22033 llvm-svn: 275274
227 lines
8.3 KiB
Plaintext
227 lines
8.3 KiB
Plaintext
Requirements:
|
|
|
|
- automake, autoconf, libtool
|
|
(not needed when compiling a release)
|
|
- pkg-config (http://www.freedesktop.org/wiki/Software/pkg-config)
|
|
(not needed when compiling a release using the included isl and pet)
|
|
- gmp (http://gmplib.org/)
|
|
- libyaml (http://pyyaml.org/wiki/LibYAML)
|
|
(only needed if you want to compile the pet executable)
|
|
- LLVM/clang libraries, 2.9 or higher (http://clang.llvm.org/get_started.html)
|
|
Unless you have some other reasons for wanting to use the svn version,
|
|
it is best to install the latest release (3.6).
|
|
For more details, see pet/README.
|
|
|
|
If you are installing on Ubuntu, then you can install the following packages:
|
|
|
|
automake autoconf libtool pkg-config libgmp3-dev libyaml-dev libclang-dev llvm
|
|
|
|
Note that you need at least version 3.2 of libclang-dev (ubuntu raring).
|
|
Older versions of this package did not include the required libraries.
|
|
If you are using an older version of ubuntu, then you need to compile and
|
|
install LLVM/clang from source.
|
|
|
|
|
|
Preparing:
|
|
|
|
Grab the latest release and extract it or get the source from
|
|
the git repository as follows. This process requires autoconf,
|
|
automake, libtool and pkg-config.
|
|
|
|
git clone git://repo.or.cz/ppcg.git
|
|
cd ppcg
|
|
git submodule init
|
|
git submodule update
|
|
./autogen.sh
|
|
|
|
|
|
Compilation:
|
|
|
|
./configure
|
|
make
|
|
make check
|
|
|
|
If you have installed any of the required libraries in a non-standard
|
|
location, then you may need to use the --with-gmp-prefix,
|
|
--with-libyaml-prefix and/or --with-clang-prefix options
|
|
when calling "./configure".
|
|
|
|
|
|
Using PPCG to generate CUDA or OpenCL code
|
|
|
|
To convert a fragment of a C program to CUDA, insert a line containing
|
|
|
|
#pragma scop
|
|
|
|
before the fragment and add a line containing
|
|
|
|
#pragma endscop
|
|
|
|
after the fragment. To generate CUDA code run
|
|
|
|
ppcg --target=cuda file.c
|
|
|
|
where file.c is the file containing the fragment. The generated
|
|
code is stored in file_host.cu and file_kernel.cu.
|
|
|
|
To generate OpenCL code run
|
|
|
|
ppcg --target=opencl file.c
|
|
|
|
where file.c is the file containing the fragment. The generated code
|
|
is stored in file_host.c and file_kernel.cl.
|
|
|
|
|
|
Specifying tile, grid and block sizes
|
|
|
|
The iterations space tile size, grid size and block size can
|
|
be specified using the --sizes option. The argument is a union map
|
|
in isl notation mapping kernels identified by their sequence number
|
|
in a "kernel" space to singleton sets in the "tile", "grid" and "block"
|
|
spaces. The sizes are specified outermost to innermost.
|
|
|
|
The dimension of the "tile" space indicates the (maximal) number of loop
|
|
dimensions to tile. The elements of the single integer tuple
|
|
specify the tile sizes in each dimension.
|
|
|
|
The dimension of the "grid" space indicates the (maximal) number of block
|
|
dimensions in the grid. The elements of the single integer tuple
|
|
specify the number of blocks in each dimension.
|
|
|
|
The dimension of the "block" space indicates the (maximal) number of thread
|
|
dimensions in the grid. The elements of the single integer tuple
|
|
specify the number of threads in each dimension.
|
|
|
|
For example,
|
|
|
|
{ kernel[0] -> tile[64,64]; kernel[i] -> block[16] : i != 4 }
|
|
|
|
specifies that in kernel 0, two loops should be tiled with a tile
|
|
size of 64 in both dimensions and that all kernels except kernel 4
|
|
should be run using a block of 16 threads.
|
|
|
|
Since PPCG performs some scheduling, it can be difficult to predict
|
|
what exactly will end up in a kernel. If you want to specify
|
|
tile, grid or block sizes, you may want to run PPCG first with the defaults,
|
|
examine the kernels and then run PPCG again with the desired sizes.
|
|
Instead of examining the kernels, you can also specify the option
|
|
--dump-sizes on the first run to obtain the effectively used default sizes.
|
|
|
|
|
|
Compiling the generated CUDA code with nvcc
|
|
|
|
To get optimal performance from nvcc, it is important to choose --arch
|
|
according to your target GPU. Specifically, use the flag "--arch sm_20"
|
|
for fermi, "--arch sm_30" for GK10x Kepler and "--arch sm_35" for
|
|
GK110 Kepler. We discourage the use of older cards as we have seen
|
|
correctness issues with compilation for older architectures.
|
|
Note that in the absence of any --arch flag, nvcc defaults to
|
|
"--arch sm_13". This will not only be slower, but can also cause
|
|
correctness issues.
|
|
If you want to obtain results that are identical to those obtained
|
|
by the original code, then you may need to disable some optimizations
|
|
by passing the "--fmad=false" option.
|
|
|
|
|
|
Compiling the generated OpenCL code with gcc
|
|
|
|
To compile the host code you need to link against the file
|
|
ocl_utilities.c which contains utility functions used by the generated
|
|
OpenCL host code. To compile the host code with gcc, run
|
|
|
|
gcc -std=c99 file_host.c ocl_utilities.c -lOpenCL
|
|
|
|
Note that we have experienced the generated OpenCL code freezing
|
|
on some inputs (e.g., the PolyBench symm benchmark) when using
|
|
at least some version of the Nvidia OpenCL library, while the
|
|
corresponding CUDA code runs fine.
|
|
We have experienced no such freezes when using AMD, ARM or Intel
|
|
OpenCL libraries.
|
|
|
|
By default, the compiled executable will need the _kernel.cl file at
|
|
run time. Alternatively, the option --opencl-embed-kernel-code may be
|
|
given to place the kernel code in a string literal. The kernel code is
|
|
then compiled into the host binary, such that the _kernel.cl file is no
|
|
longer needed at run time. Any kernel include files, in particular
|
|
those supplied using --opencl-include-file, will still be required at
|
|
run time.
|
|
|
|
|
|
Function calls
|
|
|
|
Function calls inside the analyzed fragment are reproduced
|
|
in the CUDA or OpenCL code, but for now it is left to the user
|
|
to make sure that the functions that are being called are
|
|
available from the generated kernels.
|
|
|
|
In the case of OpenCL code, the --opencl-include-file option
|
|
may be used to specify one or more files to be #include'd
|
|
from the generated code. These files may then contain
|
|
the definitions of the functions being called from the
|
|
program fragment. If the pathnames of the included files
|
|
are relative to the current directory, then you may need
|
|
to additionally specify the --opencl-compiler-options=-I.
|
|
to make sure that the files can be found by the OpenCL compiler.
|
|
The included files may contain definitions of types used by the
|
|
generated kernels. By default, PPCG generates definitions for
|
|
types as needed, but these definitions may collide with those in
|
|
the included files, as PPCG does not consider the contents of the
|
|
included files. The --no-opencl-print-kernel-types will prevent
|
|
PPCG from generating type definitions.
|
|
|
|
|
|
Processing PolyBench
|
|
|
|
When processing a PolyBench/C 3.2 benchmark, you should always specify
|
|
-DPOLYBENCH_USE_C99_PROTO on the ppcg command line. Otherwise, the source
|
|
files are inconsistent, having fixed size arrays but parametrically
|
|
bounded loops iterating over them.
|
|
However, you should not specify this define when compiling
|
|
the PPCG generated code using nvcc since CUDA does not support VLAs.
|
|
|
|
|
|
CUDA and function overloading
|
|
|
|
While CUDA supports function overloading based on the arguments types,
|
|
no such function overloading exists in the input language C. Since PPCG
|
|
simply prints out the same function name as in the original code, this
|
|
may result in a different function being called based on the types
|
|
of the arguments. For example, if the original code contains a call
|
|
to the function sqrt() with a float argument, then the argument will
|
|
be promoted to a double and the sqrt() function will be called.
|
|
In the transformed (CUDA) code, however, overloading will cause the
|
|
function sqrtf() to be called. Until this issue has been resolved in PPCG,
|
|
we recommend that users either explicitly call the function sqrtf() or
|
|
explicitly cast the argument to double in the input code.
|
|
|
|
|
|
Contact
|
|
|
|
For bug reports, feature requests and questions,
|
|
contact http://groups.google.com/group/isl-development
|
|
|
|
|
|
Citing PPCG
|
|
|
|
If you use PPCG for your research, you are invited to cite
|
|
the following paper.
|
|
|
|
@article{Verdoolaege2013PPCG,
|
|
author = {Verdoolaege, Sven and Juega, Juan Carlos and Cohen, Albert and
|
|
G\'{o}mez, Jos{\'e} Ignacio and Tenllado, Christian and
|
|
Catthoor, Francky},
|
|
title = {Polyhedral parallel code generation for CUDA},
|
|
journal = {ACM Trans. Archit. Code Optim.},
|
|
issue_date = {January 2013},
|
|
volume = {9},
|
|
number = {4},
|
|
month = jan,
|
|
year = {2013},
|
|
issn = {1544-3566},
|
|
pages = {54:1--54:23},
|
|
doi = {10.1145/2400682.2400713},
|
|
acmid = {2400713},
|
|
publisher = {ACM},
|
|
address = {New York, NY, USA},
|
|
}
|