EDAN26 Multicore Programming Lab 6

The course home page is here.

The purpose of this lab is to let you explore:

Rust and its ownership rules and atomic reference counters,
Software transactional memory in Clojure,
Hardware transactional memory in C on POWER,
OpenMP, and
Parallelizing compilers: IBM xlc.

Download the file lab6.zip or use:

wget fileadmin.cs.lth.se/cs/Education/edan26/labs/lab6/lab6.zip
Go to the downloaded directory. First an optional activity for users of the nvi and vim editors:
- Type: ls -a (a for all)
- You can see a file .exrc
- If you copy it to $HOME/.exrc you can invoke make by hitting the v key in command mode
- The file defines a macro for the v which does:
  1. Saves the file being edited
  2. Runs make
- This way you only need one terminal and there is no need to leave vi ;-)
- It also sets the tab width to eight, enables showing matching () and {} and lets the key + mean go to the next file in case you edit multiple files
Edit the file src/main.rs and run make (in the current directory, not in src)

You will get a compilation error (and some warnings about unused variables). What does the error mean?
Fix this! See beginning of Lecture 9 for hints (and possibly end of Lecture 8).

Make again.
Make the program complete, i.e. you get output PASS
Experiment with some larger inputs.
Now go the clojure directory and type

time clojure swish.clj

It is only the first line of output from time with real that is interesting (and relatively correct :)

Initially only four accounts and ten transactions, using one thread, is used.

At the end all accounts are printed and you can see they all have the start-balance.

Fix this! See Lecture 11 for hints.
Then note the time and increase the number of accounts in steps of e.g. 1000 and note if takes more time to create the accounts.
Then increase the number of transactions until it takes noticably longer time.
Then increase the number of threads until it no longer is faster.
Then decrease the number of accounts until it takes noticably longer time. Will it take ''forever'' if you have really few accounts?
Next go to the C directory, which contains the original C file from Lab 3. You might want to edit your Pthreads-solution from Lab 3 instead. If you use the solution from Lab 3, make sure your account structs only contains the balance and no mutex (to save memory).
Now modify the swish function to work on a transaction, using the syntax

__transaction_atomic { /* code... */ }
Compile with gcc -fgnu-tm swish.c, i.e. without optimization!
Unfortunately, you will see a compilation error. Only ''transaction safe'' functions may be called from a transaction.

Change the code to:

void __attribute__((transaction_safe)) extra_processing() { volatile int i; for (i = 0; i < PROCESSING; i += 1) ; }

This tells GCC it is safe to call the function. Recompile!
Unfortunately, you will see another compilation error. In C a volatile variable is regarded as ''dangerous to optimize'' by compilers, since instead of being a normal variable, accessing it may lead to side effects such as I/O.

It is clear this cannot be tolerated in a transaction which may be retried.

Remove the volatile flag and recompile!
When it compiles successfully, use the following command to check the POWER transactional memory instructions really are used:

objdump -d a.out | grep tbegin
Experiment with varying the number of accounts, threads, and transactions. How many transactions per second can you achieve?
Compare the performance with your solution from Lab 3.
Now compile using optimization, for example -O3 which is the highest level for GCC.
Increase the amount of PROCESSING until it becomes absurd. How is the performance affected? Why, do you think?
Next compile the program mm.c which performs matrix multiplication, and try GCC and CLANG and different optimization levels.

gcc -fexpensive-optimizations -mcpu=power8 -O3 mm.c && time ./a.out

and then

gcc -ftree-parallelize-loops=80 -fexpensive-optimizations -mcpu=power8 -O3 mm.c && time ./a.out

Which tries to use 80 threads in parallelizable loops.
Parallelize it using OpenMP! See Lecture 8 for hints.
Finally, use the IBM parallelizing compiler without OpenMP directives. HOT means high-order transformations.

xlc -qarch=pwr8 -O5 -qsmp -qhot=level=2 mm.c && time ./a.out

As warned by xlc the accuracy can be reduced with -O3 and higher. Sometimes that is dangerous and at other times waiting ten days instead of one for an ocean weather prediction may be more dangerous.
Optional: can you modify the OpenMP code to make it faster than the IBM code? Use any compiler and options.

Sat Oct 12 15:10:30 CEST 2019