EDAN25 Multicore Programming Lab 5

The course home page is here.

The purpose of this lab is to understand C++ Atomics and compare its use with locking.

Log in to power.cs.lth.se.
Download the file lab5.zip, unpack, cd to it and type ./t
The program a.cc should compile and then abort due to wrong output.
The program has one producer and four consumers. The computation is to compute factorial for small integers and accumulate the sum of these into a variable sum. Of course, we need some synchronization...
Add a mutex variable and a condition-variable to the worklist_t class to protect it. Also protect the sum variable. Read the hint in the get method (and change the #if 0 to #if 1, or remove the #if/#endif pair).
Your program should work now, and it will time itself 10 times.
The Power architecture has a register timebase which can be used for fine-grained timing measurements. This register is read in tbr.s. How often this register is incremented is different for different Power implementations, but can be read on Linux in the "file" /proc/cpuinfo (files in that directory are not real files but are created by the kernel when somebody needs them). Due to cache and scheduling effects the execution time varies between runs, which is normal with many multithreaded programs.
Although most of the time is spent in the put and get functions, a first step in improving the performance in this program is to make the sum variable atomic.

Copy the a.cc file to b.cc. In general, make optimizations in new files and call them c.cc, d.cc, e.cc, etc. This will make performance comparison easier.
- Either edit the t file to specify a new default file,
- use ./tt b to benchmark the file b.cc, or
- use ./ttt a b c d to benchmark four files.
You can declare sum as follows:static std::atomic sum;
Did you notice any difference when using an atomic variable?
Copy b.cc to c.cc, and implement a spinlock with an atomic flag. A spinlock is a "busy wait" lock which is suitable in multicores when the expected waiting time is low.
To implement a spinlock class, you probably find the following useful:
- std::atomic_flag flag;
- flag.test_and_set(std::memory_order_acquire)
- flag.clear(std::memory_order_release)
- Use your spinlock instead of the mutex and condition variable.
Copy c.cc to d.cc.
- The purpose of this step is to find out which instructions are used with the sum += f. Introduce a volatile global int variable called for example VAR. Before the += write VAR ^= 1234 and after write VAR ^= 5678. Recompile. Type objdump -d a.out > x. Look in that file for instructions with xor 1234 and xor 5678. The increment is between them.
Copy c.cc to d.cc again and use instead sum.fetch_add(f, std::memory_order_relaxed).
Copy d.cc to e.cc and try to use other atomic operations to get the execution time below one second. A hint is to use flag.compare_exchange_weak with explicit memory ordering in the spinlock. To do that, you cannot use the type std::atomic_flag. Use instead for instance std::atomic flag.

Wed Oct 7 10:35:02 CEST 2015