COMP4300/8300 2017 - Practical 6

Parallel Input/Output

IMPORTANT: This session will again use Raijin. The aim of this session is to give you an introduction to parallel input/output systems. Specific objectives include:

MPI-IO


Download the file mpi-io.tar, transfer it to Raijin and untar it (or from Raijin, wget https://cs.anu.edu.au/courses/comp4300/practicals/mpi-io.tar).
  1. Make and run the MPI-IO example helloworld
  2. Fix sine.c to output sine.dat to plot using gnuplot. There is a subtle bug that needs to be fixed AND you need to add in the MPI-IO code to write out, in binary, the data from computing sin(x), from x=0,2Pi. This goes into a file sine.dat. The file dosineplot reads this data (a series of 200 or so binary floating point numbers) and uses the gnuplot utility to plot it.

    To compile and test your code, use the command make sineplot. Note that it deletes the old sine.dat - this seems to be necessary as the MPI file open mode does not seem to overwrite an existing file. Remember to use ssh –XY login@raijin.nci.org.au to get X11 forwarding working. A sample plot is given below

  3. Sine Plot


FIO and ioping


fio is a tool that will spawn a number of threads or processes doing a particular type of io action as specified by the user.

ioping is a tool to monitor I/O latency in real time. It shows disk latency in the same way as ping shows network latency. It creates temporary files in the indicated directory and then tests the latency of read operations on them.

In your /short/c37 area on Raijin create a suitably named sub-directory for this session, and in there check out the following Git repos:


Build both repos for FIO and ioping there. From your sub-directory where you installed them, you can read the man pages for these tools for more information on their usage:
man ioping/ioping.1
man fio/fio.1

Note: Raijin has three filesystems accessible to end-users: /home, /jobfs and /short.

Measuring latency for single threaded sequential workloads

  1. Using ioping, measure the IO latency for /short. Construct PBS jobs or use the interactive queue to use 1 CPU and run the ioping executable and record latency (in milliseconds), IOPS and B/W for block sizes of 4KB, 128KB and 1MB and working set sizes of 10MB, 100MB, 1024MB. Note: the `working set size' is the size of the temporary file that ioping creates.
    Hint: use shell commands in your batch file for running multiple experiments, and to extra results from the job output files. An example run, with 20 requests, 8KB block size and for working set of 1GB on a home directory, using cached I/O:
    % /short/z00/jxa900/ioping -c 20 -s 8KB -S 1024MB -C /home/900/jxa900
    8 KiB from /home/900/jxa900 (lustre
    10.9.103.3@o2ib3:10.9.103.4@o2ib3:/homsys): request=20 time=29 us
    
    --- /home/900/jxa900 (lustre --
    --10.9.103.3@o2ib3:10.9.103.4@o2ib3:/homsys) ioping statistics ---
    20 requests completed in 2.79 ms, 160 KiB read, 7.16 k iops, 55.9MiB/s
    min/avg/max/mdev = 28 us / 139 us / 289 us / 103 us
    
    From the above experiment, the corresponding data values are:
    Working Set (MB)Block Size (KB) Time Taken (ms)Data Read (KB)IOPS B/W (MB/sec)
    102482.79160716055.9

    (Later, if you have time/interest): repeat the above without the -C and/or with -D, to see more effects of caching.

  2. Using your results from the above questions, complete the following table.
  3. Working Set (MB)Block Size (KB) Time Taken (ms)Data Read (KB)IOPS B/W (MB/sec)
    104    
    10128    
    101024    
    1004    
    100128    
    1001024    
    10244    
    1024128    
    10241024    
  4. Explain the observed trends for IOPS and B/W for the three working set sizes, on changing the request block size.
  5. (Optional) Run this exercise on a Raijin compute nodes’ /jobfs, produce a similar table and compare/contrast the results. From your batch script, you can access your jobs directory in this filesystem via /jobfs/local/$PBS_JOBID, and you can ensure sufficient space by adding the PBS detective #PBS -l jobfs=5GB
  6. Measuring IOPS for single threaded sequential workloads

  7. In your /short/c37 directory for this session, create an empty sub-directory called testIO, for fio to create temporary files in. Using fio, measure read and write IOPS for two, four and eight threads of IO running on /short for a block size of 1MB and a file size of 1GB.
    These can be set when running fio using the --bs=1M and --size=1G flags. For a sequential write workload, add -–readwrite=write. The flags --ioengine=libaio --gtod_reduce=1 are recommended for good and repeatable performance, respectively. For example, the following creates 2 threads (--thread --numjobs=2):
    fio/fio --ioengine=libaio --gtod_reduce=1 --bs=1024K  --size=1G \
      --readwrite=write --thread --numjobs=2 --name=test --directory=./testIO
    
    and will produce output like:
    test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=1
    ...
    Starting 2 threads
    Jobs: 2 (f=2)
    test: (groupid=0, jobs=1): err= 0: pid=22754: Mon May 15 18:25:09 2017
      write: IOPS=627, BW=628MiB/s (658MB/s)(1024MiB/1631msec)
      ...
    test: (groupid=0, jobs=1): err= 0: pid=22755: Mon May 15 18:25:09 2017
      write: IOPS=652, BW=653MiB/s (684MB/s)(1024MiB/1569msec)
      ...
    Run status group 0 (all jobs):
      WRITE: bw=1256MiB/s (1317MB/s), 628MiB/s-653MiB/s (658MB/s-684MB/s), io=2048MiB (2147MB), run=1569-1631msec
    
    The command ls ./testIO will display the temporary files fio generated.

    Using a suitable batch script, run experiments on fio and complete the following table for sequential write IO performance (in the output, run=... is `Time Taken', and io=... is `Data Written'):

    No of ThreadsFile Size (MB)Block Size (KB) Time Taken (ms)Data Written (MB)IOPS B/W (MB/sec)
    210241024    
    410241024    
    810241024    
  8. Repeat the above but this time using 50% read/write mix of random IOs. The flags to specify this are --readwrite=randrw --rwmixread=50. In order to aggregate read and write IOPS data together, use --unified_rw_reporting=1.

Setting striping on a Lustre filesystem

Lustre allows you to modify three striping parameters for a file: The default parameters on raijin:/short are count=2, size=1MB, index=-1, but these can be changed and viewed on a per-file or per-directory basis using the commands:

% lfs setstripe [file,dir] [-c count] [-S size] [-i index]

% lfs getstripe [file,dir]
Note: for a size parameter of 1MB, use -S 1M.

A file automatically inherits the striping parameters of the directory it is created in, so changing the parameters of a directory is a convenient way to set the parameters for a collection of files you are about to create. For instance, if your application creates output files in a sub-directory called output/, you can set the striping parameters on that directory once before your application runs, and all of your output files will inherit those parameters.

  1. In your /short/c37 directory for this session, run lfs getstripe ./testIO. Explain the output of the command, use the man pages if required.
  2. Re-run the exercise with FIO using the following stripe sizes and counts for a 1GB file, for a sequential write workload. Complete the following table, and comment on the trends:
  3. Stripe Size (MB)Stripe Count No of ThreadsWorking Set (MB) Block Size (KB)Time Taken (ms) Data Written (MB)IOPS B/W (MB/sec)
    1-1410241024    
    12410241024    
    14410241024    
    4-1410241024    
    42410241024    
    44410241024    
  4. (Optional) Using your ANU UniID and password, log onto the NeCTAR cloud and bring up a single core VM using a CentOS or Ubuntu image on any NeCTAR node. Perform the same tests as in this session. (The URL for creating VMs: https://dashboard.rc.nectar.org.au/ and for setting up SSH keys: http://darlinglab.org/tutorials/instance_startup/)