EDAN26 Multicore Programming Lab 6
The course home page is here.
The purpose of this lab is to let you explore:
- Rust and its ownership rules and atomic reference counters,
- Software transactional memory in Clojure,
- Hardware transactional memory in C on POWER,
- OpenMP, and
- Parallelizing compilers: IBM xlc.
- Download the file lab6.zip or use:
wget fileadmin.cs.lth.se/cs/Education/edan26/labs/lab6/lab6.zip
-
Go to the downloaded directory.
First an optional activity for users of the nvi and vim editors:
- Type:
ls -a
(a for all)
- You can see a file .exrc
- If you copy it to $HOME/.exrc you can invoke make by hitting the v key in command mode
- The file defines a macro for the v which does:
- Saves the file being edited
- Runs make
- This way you only need one terminal and there is no need to leave vi ;-)
- It also sets the tab width to eight, enables showing matching () and {} and lets the key + mean go to the next file in case you edit multiple files
- Edit the file src/main.rs and run make (in the current directory, not in src)
You will get a compilation error (and some warnings about unused variables). What does the error mean?
- Fix this! See beginning of Lecture 9 for hints (and possibly end of Lecture 8).
Make again.
- Make the program complete, i.e. you get output PASS
- Experiment with some larger inputs.
- Now go the clojure directory and type
time clojure swish.clj
It is only the first line of output from time with real that is interesting (and relatively correct :)
Initially only four accounts and ten transactions, using one thread, is used.
At the end all accounts are printed and you can see they all have the start-balance.
Fix this! See Lecture 11 for hints.
- Then note the time and increase the number of accounts in steps of e.g. 1000 and note if takes more time to create the accounts.
- Then increase the number of transactions until it takes noticably longer time.
- Then increase the number of threads until it no longer is faster.
- Then decrease the number of accounts until it takes noticably longer time. Will it take ''forever'' if you have really few accounts?
- Next go to the C directory, which contains the original C file from Lab 3. You might want to edit your Pthreads-solution from Lab 3 instead. If you use the solution from Lab 3, make sure your account structs only contains the balance and no mutex (to save memory).
- Now modify the swish function to work on a transaction, using the syntax
__transaction_atomic {
/* code... */
}
- Compile with gcc -fgnu-tm swish.c, i.e. without optimization!
- Unfortunately, you will see a compilation error. Only ''transaction safe'' functions may be called from a transaction.
Change the code to:
void __attribute__((transaction_safe)) extra_processing()
{
volatile int i;
for (i = 0; i < PROCESSING; i += 1)
;
}
This tells GCC it is safe to call the function. Recompile!
- Unfortunately, you will see another compilation error. In C a volatile variable is regarded as ''dangerous to optimize''
by compilers, since instead of being a normal variable, accessing it may lead to side effects such as I/O.
It is clear this cannot be tolerated in a transaction which may be retried.
Remove the volatile flag and recompile!
- When it compiles successfully, use the following command to check the POWER transactional memory instructions really are used:
objdump -d a.out | grep tbegin
- Experiment with varying the number of accounts, threads, and transactions. How many transactions per second can you achieve?
- Compare the performance with your solution from Lab 3.
- Now compile using optimization, for example -O3 which is the highest level for GCC.
- Increase the amount of PROCESSING until it becomes absurd. How is the performance affected? Why, do you think?
- Next compile the program mm.c which performs matrix multiplication, and try GCC and CLANG and different optimization levels.
gcc -fexpensive-optimizations -mcpu=power8 -O3 mm.c && time ./a.out
and then
gcc -ftree-parallelize-loops=80 -fexpensive-optimizations -mcpu=power8 -O3 mm.c && time ./a.out
Which tries to use 80 threads in parallelizable loops.
- Parallelize it using OpenMP! See Lecture 8 for hints.
- Finally, use the IBM parallelizing compiler without OpenMP directives. HOT means high-order transformations.
xlc -qarch=pwr8 -O5 -qsmp -qhot=level=2 mm.c && time ./a.out
As warned by xlc the accuracy can be reduced with -O3 and higher. Sometimes that is dangerous and at other times
waiting ten days instead of one for an ocean weather prediction may be more dangerous.
- Optional: can you modify the OpenMP code to make it faster than the IBM code? Use any compiler and options.
Sat Oct 12 15:10:30 CEST 2019