This document contains the user guides and the internals of compiling CUDA C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM and developers who want to improve LLVM for GPUs. This document assumes a basic familiarity with CUDA. Information about CUDA programming can be found in the CUDA programming guide.
Below is a quick summary of downloading and building LLVM. Consult the Getting Started page for more details on setting up LLVM.
Checkout LLVM
$ cd where-you-want-llvm-to-live
$ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
Checkout Clang
$ cd where-you-want-llvm-to-live
$ cd llvm/tools
$ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
Configure and build LLVM and Clang
$ cd where-you-want-llvm-to-live
$ mkdir build
$ cd build
$ cmake [options] ..
$ make
We assume you have installed the CUDA driver and runtime. Consult the NVIDIA CUDA installation Guide if you have not.
Suppose you want to compile and run the following CUDA program (axpy.cu) which multiplies a float array by a float scalar (AXPY).
#include <helper_cuda.h> // for checkCudaErrors
#include <iostream>
__global__ void axpy(float a, float* x, float* y) {
y[threadIdx.x] = a * x[threadIdx.x];
}
int main(int argc, char* argv[]) {
const int kDataLen = 4;
float a = 2.0f;
float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
float host_y[kDataLen];
// Copy input data to device.
float* device_x;
float* device_y;
checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float)));
checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float)));
checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
cudaMemcpyHostToDevice));
// Launch the kernel.
axpy<<<1, kDataLen>>>(a, device_x, device_y);
// Copy output data to host.
checkCudaErrors(cudaDeviceSynchronize());
checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),
cudaMemcpyDeviceToHost));
// Print the results.
for (int i = 0; i < kDataLen; ++i) {
std::cout << "y[" << i << "] = " << host_y[i] << "\n";
}
checkCudaErrors(cudaDeviceReset());
return 0;
}
The command line for compilation is similar to what you would use for C++.
$ clang++ -o axpy -I<CUDA install path>/samples/common/inc -L<CUDA install path>/<lib64 or lib> axpy.cu -lcudart_static -lcuda -ldl -lrt -pthread
$ ./axpy
y[0] = 2
y[1] = 4
y[2] = 6
y[3] = 8
Note that helper_cuda.h comes from the CUDA samples, so you need the samples installed for this example. <CUDA install path> is the root directory where you installed CUDA SDK, typically /usr/local/cuda.
CPU and GPU have different design philosophies and architectures. For example, a typical CPU has branch prediction, out-of-order execution, and is superscalar, whereas a typical GPU has none of these. Due to such differences, an optimization pipeline well-tuned for CPUs may be not suitable for GPUs.
LLVM performs several general and CUDA-specific optimizations for GPUs. The list below shows some of the more important optimizations for GPUs. Most of them have been upstreamed to lib/Transforms/Scalar and lib/Target/NVPTX. A few of them have not been upstreamed due to lack of a customizable target-independent optimization pipeline.