COMP4300/8300 2017 - Tutorial/Laboratory 5

GPU Programming with CUDA

IMPORTANT: This lab will **NOT** use Raijin. It will use a system run by the Computer Systems group. You will need to attend the lab in order to get access to this platform.

The aim of this lab is to introduce you to programming GPUs using CUDA. The material is taken from the book CUDA by Example: An introduction to general-purpose GPU programming by Jason Sanders and Edward Kandrot. A PDF copy of the book can be found HERE


Accessing the Jetson Systems


There are two Jetson systems, jetson1 and jetson2. They are located in an office in the CSIT building and are accessed by first logging onto a head node called sabretooth. You can use either of two systems; note however they have separate home directories, so you might want to choose one and stick to it (unless the load on it is high). Login instructions are as follows:

# Login to the head node using your ANU uni id:
1) ssh (uni_id)@sabretooth.anu.edu.au
    password: (need to obtain during lab5)
# Change your password
2) passwd

# SSH to either jetson1 or jetson2. To help balance load, log into jetson1 if
# the last digit of you uni-id is odd, jetson2 otherwise.
3) ssh jetson(1|2)
    password: (same as initially on sabretooth)
# Change your password
4) passwd
You may also want to set up your ssh keys both on sabretooth and Jetson to facilitate transfer of files.

A tar file containing all the programs for this lab is available HERE. Save this tar file on your local desktop and then transfer it via sabretooth to the Jetson system. Alternatively, on a Jetson node: wget https://cs.anu.edu.au/courses/comp4300/practicals/prac5.tar The tar file contains a subset of the example code that comes with the CUDA by Example book. The full set of examples are available here. We will draw on examples from Chapters 3-5.

To compile we use the NVIDIA nvcc compiler. First check that your search path is set correctly to find the compiler by executing the command which nvcc. If this returns without any obvious response you need to modify your path. Do this by editing your .profile to include the line

PATH=$PATH:/usr/local/cuda/bin
then either log out and in again, or type source ~/.profile. Check that nvcc is now found. You can now compile and execute the different CUDA codes using the commands
nvcc program.cu
a.out

Making Kernel Calls


Our first code is taken from page 25 of CUDA by Example and is found in ch3/simple_kernel_params.cu:

#include "../common/book.h"

__global__ void add( int a, int b, int *c ) {
    *c = a + b;
}

int main( void ) {
    int c;
    int *dev_c;
    HANDLE_ERROR( cudaMalloc( (void**)&dev_c, sizeof(int) ) );

    add<<<1,1>>>( 2, 7, dev_c );

    HANDLE_ERROR( cudaMemcpy( &c, dev_c, sizeof(int),
                              cudaMemcpyDeviceToHost ) );
    printf( "2 + 7 = %d\n", c );
    HANDLE_ERROR( cudaFree( dev_c ) );

    return 0;
The key points to note are: Compile this code using nvcc. Run the code and ensure it produces the result you would expect, i.e. it prints out 9!
  1. Modify the code such that routine add computes not just a+b but also a-b, returning the two results in adjacent memory locations, i.e. as c[0] and c[1].

The file enum_gpu.cu contains a program that will interrogate the GPU hardware present on a host and print out various information associated with that GPU.

  1. Compile and run enum_gpu.cu. What is the frequency of the GPU on the Jetson systems? How many multiprocessors are available? How many bytes of shared memory are available per multiprocessor? (search for cudaGetDeviceProperties to help you interpret the program output).

Parallel Operations on the GPU


Our second code, ch4/add_loop_gpu.cu is taken from pages 41-43 of CUDA by Example and is given below:

#define N   10

__global__ void add( int *a, int *b, int *c ) {
    int tid = blockIdx.x;    // this thread handles the data at its
    thread id
    if (tid < N)
        c[tid] = a[tid] + b[tid];
}

int main( void ) {
    int a[N], b[N], c[N];
    int *dev_a, *dev_b, *dev_c;

    // allocate the memory on the GPU
    HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * sizeof(int) ) );
    HANDLE_ERROR( cudaMalloc( (void**)&dev_b, N * sizeof(int) ) );
    HANDLE_ERROR( cudaMalloc( (void**)&dev_c, N * sizeof(int) ) );

    // fill the arrays 'a' and 'b' on the CPU
    for (int i=0; i<N; i++) {
        a[i] = -i;
        b[i] = i * i;
    }

    // copy the arrays 'a' and 'b' to the GPU
    HANDLE_ERROR( cudaMemcpy( dev_a, a, N * sizeof(int),
                              cudaMemcpyHostToDevice ) );
    HANDLE_ERROR( cudaMemcpy( dev_b, b, N * sizeof(int),
                              cudaMemcpyHostToDevice ) );

    add<<<N,1>>>( dev_a, dev_b, dev_c );

    // copy the array 'c' back from the GPU to the CPU
    HANDLE_ERROR( cudaMemcpy( c, dev_c, N * sizeof(int),
                              cudaMemcpyDeviceToHost ) );

    // display the results
    for (int i=0; i<N; i++) {
        printf( "%d + %d = %d\n", a[i], b[i], c[i] );
    }

    // free the memory allocated on the GPU
    HANDLE_ERROR( cudaFree( dev_a ) );
    HANDLE_ERROR( cudaFree( dev_b ) );
    HANDLE_ERROR( cudaFree( dev_c ) );

    return 0;
}
Key points to note are Compile and run the above code. Confirm that it produces the correct answers.
  1. Similar to before, modify the code such that it again computes both the sum and difference between the two vectors a[] and b[]. To store the difference, create a new vector d[] (you will need to add another argument to add()).
  2. Modify the code such that N, the length of the vectors, is 512, but you only create 32 thread blocks, i.e. launch the add kernel as follows add<<<32,1>>>. This will require each thread block to compute 512/32=16 elements of the final vector c[].

Blocks of Many Threads


Our third code, ch5/add_loop_long_blocks.cu is described in pages 63-69 of CUDA by Example.

#define N   (33 * 1024)

__global__ void add( int *a, int *b, int *c ) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    while (tid < N) {
        c[tid] = a[tid] + b[tid];
        tid += blockDim.x * gridDim.x;
    }
}

int main( void ) {
    int *a, *b, *c;
    int *dev_a, *dev_b, *dev_c;

    // allocate the memory on the CPU
    a = (int*)malloc( N * sizeof(int) );
    b = (int*)malloc( N * sizeof(int) );
    c = (int*)malloc( N * sizeof(int) );

    // allocate the memory on the GPU
    HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * sizeof(int) ) );
    HANDLE_ERROR( cudaMalloc( (void**)&dev_b, N * sizeof(int) ) );
    HANDLE_ERROR( cudaMalloc( (void**)&dev_c, N * sizeof(int) ) );

    // fill the arrays 'a' and 'b' on the CPU
    for (int i=0; i<N; i++) {
        a[i] = i;
        b[i] = 2 * i;
    }

    // copy the arrays 'a' and 'b' to the GPU
    HANDLE_ERROR( cudaMemcpy( dev_a, a, N * sizeof(int),
                              cudaMemcpyHostToDevice ) );
    HANDLE_ERROR( cudaMemcpy( dev_b, b, N * sizeof(int),
                              cudaMemcpyHostToDevice ) );

    add<<<128,128>>>( dev_a, dev_b, dev_c );

    // copy the array 'c' back from the GPU to the CPU
    HANDLE_ERROR( cudaMemcpy( c, dev_c, N * sizeof(int),
                              cudaMemcpyDeviceToHost ) );

    // verify that the GPU did the work we requested
    bool success = true;
    for (int i=0; i<N; i++) {
        if ((a[i] + b[i]) != c[i]) {
            printf( "Error:  %d + %d != %d\n", a[i], b[i], c[i] );
            success = false;
        }
    }
    if (success)    printf( "We did it!\n" );

    // free the memory we allocated on the GPU
    HANDLE_ERROR( cudaFree( dev_a ) );
    HANDLE_ERROR( cudaFree( dev_b ) );
    HANDLE_ERROR( cudaFree( dev_c ) );

    // free the memory we allocated on the CPU
    free( a );
    free( b );
    free( c );

    return 0;
}
Key points to note: Compile and run the above code. Verify that it produces the correct answers.
  1. Modify add() such that even elements of c contain the sum of a and b, while odd elements contain the difference.
  2. On the host after the call to add, sum the square of all elements of the vector c. Take the square root of this value (Vnorm). Now write a second kernel that will scale the values of c by dividing each element by Vnorm.

Using GPU Shared Memory


Our last example is taken from pages 75-89 of CUDA by Example. The goal here is to calculate the dot product for a vector using shared memory to store intermediate results. The code is in ch5/dot.cu

#include "../common/book.h"

#define imin(a,b) (a<b?a:b)

const int N = 33 * 1024;
const int threadsPerBlock = 256;
const int blocksPerGrid =
            imin( 32, (N+threadsPerBlock-1) / threadsPerBlock );


__global__ void dot( float *a, float *b, float *c ) {
    __shared__ float cache[threadsPerBlock];
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    int cacheIndex = threadIdx.x;

    float   temp = 0;
    while (tid < N) {
        temp += a[tid] * b[tid];
        tid += blockDim.x * gridDim.x;
    }
    
    // set the cache values
    cache[cacheIndex] = temp;
    
    // synchronize threads in this block
    __syncthreads();

    // for reductions, threadsPerBlock must be a power of 2
    // because of the following code
    int i = blockDim.x/2;
    while (i != 0) {
        if (cacheIndex < i)
            cache[cacheIndex] += cache[cacheIndex + i];
        __syncthreads();
        i /= 2;
    }

    if (cacheIndex == 0)
        c[blockIdx.x] = cache[0];
}


int main( void ) {
    float   *a, *b, c, *partial_c;
    float   *dev_a, *dev_b, *dev_partial_c;

    // allocate memory on the cpu side
    a = (float*)malloc( N*sizeof(float) );
    b = (float*)malloc( N*sizeof(float) );
    partial_c = (float*)malloc( blocksPerGrid*sizeof(float) );

    // allocate the memory on the GPU
    HANDLE_ERROR( cudaMalloc( (void**)&dev_a,
                              N*sizeof(float) ) );
    HANDLE_ERROR( cudaMalloc( (void**)&dev_b,
                              N*sizeof(float) ) );
    HANDLE_ERROR( cudaMalloc( (void**)&dev_partial_c,
                              blocksPerGrid*sizeof(float) ) );

    // fill in the host memory with data
    for (int i=0; i<N; i++) {
        a[i] = i;
        b[i] = i*2;
    }

    // copy the arrays 'a' and 'b' to the GPU
    HANDLE_ERROR( cudaMemcpy( dev_a, a, N*sizeof(float),
                              cudaMemcpyHostToDevice ) );
    HANDLE_ERROR( cudaMemcpy( dev_b, b, N*sizeof(float),
                              cudaMemcpyHostToDevice ) ); 

    dot<<<blocksPerGrid,threadsPerBlock>>>( dev_a, dev_b,
                                            dev_partial_c );

    // copy the array 'c' back from the GPU to the CPU
    HANDLE_ERROR( cudaMemcpy( partial_c, dev_partial_c,
                              blocksPerGrid*sizeof(float),
                              cudaMemcpyDeviceToHost ) );

    // finish up on the CPU side
    c = 0;
    for (int i=0; i<blocksPerGrid; i++) {
        c += partial_c[i];
    }

    #define sum_squares(x)  (x*(x+1)*(2*x+1)/6)
    printf( "Does GPU value %.6g = %.6g?\n", c,
             2 * sum_squares( (float)(N - 1) ) );

    // free memory on the gpu side
    HANDLE_ERROR( cudaFree( dev_a ) );
    HANDLE_ERROR( cudaFree( dev_b ) );
    HANDLE_ERROR( cudaFree( dev_partial_c ) );

    // free memory on the cpu side
    free( a );
    free( b );
    free( partial_c );
}
Some key points to note include: Compile and run the code making sure that the results are correct.
  1. Insert timing calls to measure the time to i) just complete the dot kernel call on the GPU, ii) performing everything from moving the data from host to GPU, performing the dot kernel call, summing the partial_c[] elements together on the host. You should add a call to cudaDeviceSynchronize() between the kernel call and the next timing (why?).
  2. Increase the size of the dot product until the total time is around 0.1second (you will need around 10 million vector length). For a fixed vector size, run experiments with a variety of different block and thread count combinations to see how performance varies as a function of these parameters.
  3. In all the examples so far, there have been explicit copies of the data from the host to the GPU. As you will have found in the previous question, this is costly, accounting for most of the time required to undertake the dot product. The requirement to explicitly copy data exists because originally GPUs were separate devices connected to the CPU via the PCIe bus. On the Jetson however this is not the case - both CPU and GPU have equal access to the same memory. More recent versions of CUDA also allow for a more unified view of memory. The following blog posting discusses this issues (note this relates to memory consistency). Using this, modify the above code to remove the data copies. Examine the effect of this on performance.
  4. Add code such that the dot product is also calculated on host as well as on the GPU. Time this code and the GPU code for a variety of different vector sizes (on the GPU keep the block and thread count fixed at what you determined to be optimal about). When is using the CPU fastest, when is using the GPU fastest?