COMP4300/8300 2017 - Tutorial/Laboratory 4

Shared Memory Parallel Programming with pthreads and OpenMP

In this session, we will consider shared memory programming using both pthreads and OpenMP.

A tar file containing all the programs for this session is available HERE. Save this tar file on your local desktop and then transfer it to the Raijin system as per previous sessions.


Pthreads


For details of the various pthreads library calls look at the appendix in the POSIX Threads Tutorial from LLNL. Also, they are in the man pages on Raijin.
  1. The program p_hello.c illustrates how to create threads using the pthread library. It takes as input the number of threads to create (N), then creates these threads numbering them from 0 to N-1, and has each thread print out Hello from thread followed its thread number. If you run the program you will find it behaves erratically. Fix this problem.
  2. Modify the hello code such that each thread prints out BOTH its number and the total number of threads that has been read as input, i.e. the value of n (in some way you will need to make this value visible to each thread. Hint: extend the parm structure).
  3. The program p_cpi.c computes the value of Pi by numerically integrating the function f(x)=4/(1+x*x) from 0 to 1. The program takes as input the number of threads to create and the number of integration points. The code currently works correctly when run using one thread but gives the wrong result when using two or more threads. Correct the code so that it gives the right answer for any number of threads.
  4. Run the cpi program with increasing numbers of integration points. Can you reduce the error to zero? If not why not?

OpenMP Basics


The OpenMP Application Program Interface (API) is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications. The OpenMP standard supports multi-platform shared-memory parallel programming in C/C++ and Fortran. It has been jointly defined by a group of major computer hardware and software vendors. For more information see the OpenMP Tutorial by Lawrence Livermore National Laboratory, or the OpenMP web pages.

OpenMP defines a set of program directives and a small number of function calls. The function calls are associated with the execution runtime environment, memory locking, and timing. The directives are primarily responsible for the parallelisation of the code. For C/C++ code, the directives take the form of pragmas:

#pragma omp

A program written using OpenMP directives begins execution as a single process, or master thread. The master thread executes sequentially until it encounters the first parallel construct. When this happens a team of threads is created and it assumes the role of master. Upon completion of the parallel construct the threads synchronize (unless specified otherwise) and only the master continues execution. Any number of parallel constructs may be specified in the program, and as a result the program may fork and join many times.

The number of threads to spawn may be:

Note that the number of threads may exceed the number of physical CPUs on the machine (known as oversubscription); in this case, the operating system must schedule the threads as best it can. On a given system whether this is permitted or not is determined by the environment variable OMP_DYNAMIC. If this is true then the OMP environment will not permit you to have more threads than physical CPUs on the machine. To override this and obtain more threads than CPUs, you need to issue the command
export OMP_DYNAMIC=false

OpenMP Directives

Parallel Regions

A parallel region is a structured block of code that is to be executed in parallel by a number of threads. Each thread executes the structured block independently.

Note: it is illegal for code to branch out of a parallel region.

The basic structure is as follows

#pragma omp parallel [clause]
{
     /* structured block */
}

The value of [clause] determines data scoping, as follows:

As an example consider the example program ompexample1.c:

#include <stdio.h>
#include <omp.h>

int main(void) {
  int i = 1, j = 2, k;

  printf(" Initial value of i %i and j %i \n", i, j);

  #pragma omp parallel for default(shared) private(i)
  for (k = 0; k < 4; k++) {
    printf(" Initial value in parallel of i %i and j %i \n", i, j);
    i = i+99;
    j = j+99;
    printf(" Final value in parallel of i %i and j %i \n", i, j);
  }

  printf(" Final value of i %i and j %i \n", i, j);
  return 0;
}

Compile it using the command make ompexample1

  1. Run the code with one thread using the command ./ompexample1
    What values are printed out for i and j and why?

  2. Now run the code with 4 threads by first setting the OMP_NUM_THREADS environment variable: OMP_NUM_THREADS=4 ./ompexample1
    What is printed out now? Explain why these values are printed.

  3. Explain the effect of changing private(i) to firstprivate(i). What happens if you change it to lastprivate(i)?

The Reduction Clause

A reduction clause can be added to the parallel directive. This specifies that the final values of certain variables are combined using the specified operation (or intrinsic function) at the end of the parallel region. For example consider the example program ompexample2.c:

#include <stdio.h>
#include <omp.h>

int main(void) {
  int t_id;
  int i = 10, j = 10, k = 10;

  printf("Before parallel region: i=%2i, j=%2i, k=%2i\n", i, j, k);

  #pragma omp parallel default(none) private(t_id) reduction(+:i) \
    reduction(*:j) reduction(^:k)
  {
    t_id = omp_get_thread_num() + 1;
    i = t_id;
    j = t_id;
    k = t_id;
    printf("Thread %i: i=%i, j=%i, k=%i\n", t_id, i, j, k);
  }

  printf("After parallel region: i=%2i, j=%2i, k=%2i\n", i, j, k);
  return 0;
}

The above program demonstrates a number of reduction operations and also shows the use of the omp_get_thread_num function to uniquely define each thread.

  1. Compile and run ompexample2. Then run the code after setting the number of threads to 14: export OMP_NUM_THREADS=14
    Why is the final value for variable j negative?

Some other useful OpenMP routines are

The above three functions are used in the example program ompexample3.c:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char* argv[]) {
  int np, t_id, num_threads, max_threads;

  if (argc != 2) {
    printf(" %s Number_of_threads \n", argv[0]);
    return -1;
  } else {
    np = atoi(argv[1]);
    if (np < 1) {
      printf("Error: Number_of_threads (%i) < 1 \n", np);
      return -1;
    }
  }

  omp_set_num_threads(np);

  num_threads = omp_get_num_threads();
  max_threads = omp_get_max_threads();
  printf("Before Parallel: num_threads=%i max_threads %i\n",
        num_threads, max_threads);
  #pragma omp parallel default(none) private(num_threads, t_id)
    {
      num_threads = omp_get_num_threads();
      t_id = omp_get_thread_num();
      printf("In Parallel: num_threads=%i t_id=%i \n", num_threads, t_id);
    }
  num_threads = omp_get_num_threads();
  printf("After Parallel: num_threads=%i \n", num_threads);

  return 0;
}
  1. Compile and run ompexample3.c. Run the program using OMP_DYNAMIC=true ./ompexample3 20. How many threads are created? Now re-run using OMP_DYNAMIC=false ./ompexample3 20. How many are created now?

The program ompexample4.c computes the sum of all integers from 1 to num_elem, where num_elem is given as command line input. This task is performed using the following loop:

int sum = 0;
int i = 0;
do {
  i++;
  sum += i;
} while (i < num_elem);
  1. Parallelise this summation by using OpenMP to manually divide up the loop operations amoungst the available OpenMP threads (this means you are not to convert this to a for loop and use #pragma omp for). Your parallel code must continue to use a do-while statement.

The For Directive

In the above you parallelized a loop by manually assigning different loop indices to different threads. With for loops, OpenMP provides the for directive to do this for you. This directive is placed immediately before a for loop and automatically partitions the loop iterations across the available threads.

#pragma omp for [clause[,clause ...]]
  for ()

An important optional clause is the schedule(type[,chunk]) clause. This can be used to define specifically how the tasks are divided amoungst the different threads. Two distribution schemes are

  1. The program ompexample5.c is analogous to the pthread program p_cpi.c in that it computes the value of Pi by numerically integrating the function 1/(1+x2) between the values of 0 and 1 using the trapezoid method (the value of this integral is Pi/4). The code divides the domain [0,1] into N trapezoids where the area of each trapezoid is given by the average length of the two parallel sides times the width of the trapezoid. The lengths of the two parallel sides are given by the values of the function 1/(1+x2) for the lower and upper value of x for that domain.
    Parallelise this code using the omp for directive.

The Barrier Directive

In any parallel program there will be certain points where you wish to synchronize all your threads. This is achieved by using the barrier directive:

#pragma omp barrier

All threads must wait at the barrier before any of them can continue.

The Single Directive

Certain pieces of code you may only want to run on one thread - even though multiple threads are executing. For example, you often only want output to be printed once from one thread. This can be achieved using the single directive:

#pragma omp single [clause]
{
  /*structured block*/
}

By default all other threads will wait at the end of the structured block until the thread that is executing that block has completed. You can avoid this by augmenting the single directive with a nowait clause.

The Critical Directive

In some instances, interactions between threads may lead to wrong (or runtime variable) results. This can arise because two threads are manipulating the same data objects at the same time and the result depends on which thread happened to modify the objects first. The critical directive ensures that a block of code is only executed by one processor at a time. Effectively this serializes portions of a parallel region.

#pragma omp critical [(name)]
{
  /*structured block*/
}

A thread will wait at the beginning of the critical section until no other thread in the team is executing that (named) section.

The Atomic Directive

In a somewhat similar vein to critical, the atomic directive ensures that two memory locations are never updated at precisely the same time. (Note - reading shared variables is not a problem - only storing to shared variables). The atomic directive sets locks to ensure unique access to a given shared variable:

#pragma omp atomic

The directive refers to the line of code immediately following it. Be aware that there is an overhead associated with the setting and unsetting of locks - so use this directive and/or critical sections only when when necessary. For example we could use the atomic directive to parallelize an inner product:

#pragma omp parallel for shared(a,b,sum) private(I,tmp)
for (i=0; i < n; i++){
  tmp = a[i]*b[i];
  #pragma omp atomic
  sum = sum + tmp;
}

but the performance would be very poor!

  1. Often it is useful to have a shared counter that can be used to implement load balancing. The program ompexample6.c is an attempt to implement such a thing. However it does not work correctly (try it for various values of maxtasks). Explain why the current version does not work correctly. You may have to run this several times, e.g. using csh you can do the following to run the code 100 times and search for errors.
    repeat 100 ./ompexample6 4 109 | grep Something
    
    Using bash you can do the following
    for n in {1..100}; do ./ompexample6 4 109; done | grep Something 
    
  2. Fix ompexample6.c so it works correctly!

Application Parallelisation


In Prac 2 you were provided with a code to evaluate the Mandelbrot set. The same code is provided in this session as mpi_mandel.c, with a sequential version given in omp_mandel.c.

  1. Parallelise omp_mandel.c using OpenMP. Recall that this code suffers from load imbalance. Include in your OpenMP parallelisation a dynamic load balancing scheme.
  2. Compare the performance of your OpenMP code to that of the MPI code for a range of processor counts (remember you can only run OpenMP within one node on the Raijin system).

In Prac 3 you were provided with a code to solve the heat distribution problem on a rectangular grid. Similar code is provided in this session as mpi_heat.c, and a sequential version is given in omp_heat.c.

  1. Parallelise omp_heat.c using OpenMP. Note that we have a maximum-value reduction operation (variable mxdiff), but unfortunately the default gcc (currently 4.4.7) on Raijin does not support this - it may be necessary to do the reduction `by hand' (ask your tutor for hints).
  2. Compare the performance of your OpenMP code to that of the MPI code for a range of processor counts.