# Multithreaded programming in ISO C

#### Contents of Lecture 6

- ISO C18
- Atomic objects (recall, object in ISO C terminology = "data")
- Memory consistency model for non-atomic objects
- Synchronize with operation
- Dependency ordered memory accesses
- Happens before relation and data races
- Motivation from the Linux kernel for dependency order
- Implementation on POWER
- Threads API in C11 (very similar to Pthreads)

- The current ISO C Standard is C18 and is essentially the same as C11 from December 2011.
- See http://www.open-std.org/jtc1/sc22/wg14
- C11 contains various news but we will focus on three in this course:
  - Section 5.1.2.4 Multi-threaded executions and data races
  - <threads.h> similar to Pthreads
  - <stdatomic.h> types, operators, and functions for atomic objects

- Atomic objects were introduced in C11.
- Atomic objects can safely be accessed without locking:

```
_Atomic int counter = 0;
Thread 1 Thread 2
counter++; counter++;
```

- Of course, you may need locks for other reasons to protect your data.
- Atomic objects have a global modification order and all threads see the same modification order.
- There is no total modification order of all atomic objects different threads can see the stores to different atomic objects in different orders.

- A new type qualifier: \_Atomic.
- The other type qualifiers are: const, volatile, restrict.
- Examples:
  - \_Atomic int a; // atomic int int\* \_Atomic \_\_Atomic int\* \_\_Atomic \_\_Atomic int\* \_\_Atomic \_\_Atomic \_\_Atomic \_\_Atomic \_\_Atomic struct { int d; } atomic pointer to atomic int \_\_\_\_\_\_e; // atomic struct
- In C++ it's written as atomic<int> a.
- For easier porting C allows: \_Atomic (*type-name*)

- Members of an atomic struct/union may not be accessed individually.
- The whole struct must first be copied to a non-atomic variable of compatible type.
- The ++, --, and compound assignment operators (e.g. +=) are atomic read-modify-write operations.
- The memory ordering when using these operators is sequential consistency, which means costly instructions are needed (the CPU needs to wait).

• <stdatomic.h> defines basic atomic types including

| atomic_char    | atomic_schar    | atomic_uchar     |
|----------------|-----------------|------------------|
| atomic_short   | atomic_sshort   | atomic_ushort    |
| atomic_int     | atomic_sint     | atomic_uint      |
| atomic_long    | atomic_slong    | atomic_ulong     |
| atomic_llong   | atomic_sllong   | atomic_ullong    |
| atomic_wchar_t | atomic_intptr_t | atomic_uintptr_t |
| atomic_address | atomic_bool     | atomic_flag      |

- atomic\_flag is a lock-free struct.
- On POWER its size is one byte.
- The other types may be implemented with locks (and be slower).
- An atomic flag can be initialized with ATOMIC\_FLAG\_INIT.

## Lock free property

• To know whether the other basic atomic types are lock free, the following macros can be evaluated:

ATOMIC\_CHAR\_LOCK\_FREE ATOMIC\_CHAR16\_T\_LOCK\_FREE ATOMIC\_CHAR32\_T\_LOCK\_FREE ATOMIC\_WCHAR\_T\_LOCK\_FREE ATOMIC\_SHORT\_LOCK\_FREE ATOMIC\_INT\_LOCK\_FREE ATOMIC\_LONG\_LOCK\_FREE ATOMIC\_LLONG\_LOCK\_FREE ATOMIC\_ADDRESS\_LOCK\_FREE

• If for example ATOMIC\_INT\_LOCK\_FREE is true then both \_Atomic signed int and \_Atomic unsigned int are lock-free.

## Initializing an atomic object

- A scalar type is any data not an array, struct (or union), which are aggregate types.
- Initializing some scalar variables:

\_Atomic int a = 1; static \_Atomic float b; // will be zero static \_Atomic int\* c; // will be NULL

• To initialize a struct or union, use atomic\_init

node\_t d;
\_Atomic node\_t e;

```
atomic_init(&e, d);
```

• C18 made the C11 macro ATOMIC\_VAR\_INIT obsolete and it may be removed from a future ISO C standard.

## Memory order for atomic operations

• The enumerated type memory\_order contains the enumerators

memory\_order\_relaxed
memory\_order\_consume
memory\_order\_acquire
memory\_order\_release
memory\_order\_acq\_rel
memory\_order\_seq\_cst

- They are used with the functions operating on atomic objects described next.
- For example:

x = atomic\_load\_explicit(&a, memory\_order\_relaxed); y = atomic\_load(&b);

- Without \_explicit, memory\_order\_seq\_cst is used.
- As we will see in detail the code may become faster and more confusing with memory\_order\_relaxed.

#### • Consider the code where all variables initially are zero.

```
// Thread 1
x = atomic_load_explicit(&b, memory_order_relaxed);
atomic_store_explicit(&a, x, memory_order_relaxed);
```

```
// Thread 2
y = atomic_load_explicit(&a, memory_order_relaxed);
atomic_store_explicit(&b, 42, memory_order_relaxed);
```

• Can both x and y become 42?

#### • The code may execute in the following order:

atomic\_store\_explicit(&b, 42, memory\_order\_relaxed); x = atomic\_load\_explicit(&b, memory\_order\_relaxed); atomic\_store\_explicit(&a, x, memory\_order\_relaxed); y = atomic\_load\_explicit(&a, memory\_order\_relaxed);

- // Thread 2
  // Thread 1
  // Thread 1
  // Thread 2
- With relaxed memory ordering and no dependency the store can be reordered and execute first.
- There is a dependency through the variable x between the accesses by Thread 1 so they may not be reordered, luckily.
- We will see more about dependences below.

## Atomic exchange functions

- A refers to an atomic type.
- C refers to the corresponding non-atomic type.
- The atomic exchange function writes a new value and returns the old value pointed to by ptr.
- O c atomic\_exchange\_explicit(volatile A\* ptr, C value, memory\_order order);

O atomic\_exchange(volatile A\* ptr, C value);

```
typedef struct {
    int e;
    int h;
} node_t;
_Atomic node_t a;
node_t b;
node_t c;
c = atomic_exchange(&a, b, memory_order_relaxed);
```

- gcc file.c -latomic (which looks for symbols in the atomic library)
- For a small type the function may be optimized away and -latomic would not be needed (but safer to use it)

- There are two versions: strong and weak.
- The weak may fail and must therefore be used in a loop.
- The functions compare the value at the location pointed to by ptr with an expected value and if they are equal writes a new value.
- The result of the comparison is returned.
- If the values are not equal, the current value is copied to expected, or actually to where the pointer expected points.
- The strong functions behave as:

```
if (*ptr == *expected)
    *ptr = value;
else
    *expected = *ptr;
```

### Atomic compare and exchange function prototypes

```
bool atomic_compare_exchange_strong_explicit(
0
           volatile A*
                            ptr,
           C*
                            expected,
           С
                            value,
           memory_order
                            success,
           memory_order
                            failure);
   bool atomic_compare_exchange_weak_explicit(
           volatile A*
                            ptr,
           C*
                            expected,
           С
                            value,
           memory_order
                            success,
                            failure);
           memory_order
0
   bool atomic_compare_exchange_strong(
           volatile A*
                            ptr,
                            expected,
           C*
           С
                            value):
0
   bool atomic_compare_exchange_weak(
           volatile A*
                            ptr,
                            expected,
           C*
                            value);
           С
```

• For the explicit functions, memory is affected according to parameters success and failure, depending on the result of the comparison.

## Weak compare and exchange functions

- The weak compare and exchange functions may fail spuriously, which means they fail to perform the compare and exchange and "give up".
- If so, the return value is guaranteed to be false, and the current value is **not** copied to what ptr points to.
- The weak forms allow faster implementation on machines with load-locked/load-linked/load-and-reserve/load-exclusive and store conditional instructions — instead of atomic compare and exchange.
- The load-locked type of instructions are described below but the idea is to split the atomic operation into two instructions.
- A processor *P* first performs a load-locked and then a store conditional.
- If a different processor *Q* performs a store between the load-locked and store conditional made by *P*, then the store conditional made by *P* fails.

- A spin lock has no waiting queue with sleeping threads
- Assume 0 means free and 1 means locked

| _Atomic | int | a;        |
|---------|-----|-----------|
| int     |     | expected; |
| int     |     | value;    |

value = 1; // want to lock it
do

expected = 0; // hope for unlocked
while (!atomic\_compare\_exchange\_weak(&a, &expected, value));

## Atomic fetch and modify functions

• These functions atomically read-compute-modify an atomic object.

| computation | C operator | function                  |
|-------------|------------|---------------------------|
| addition    | +          | atomic_fetch_add_explicit |
| subtraction | -          | atomic_fetch_sub_explicit |
| or          | l          | atomic_fetch_or_explicit  |
| xor         | ^          | atomic_fetch_xor_explicit |
| and         | &          | atomic_fetch_and_explicit |

• Plus the usual non-explicit functions, for example:

C atomic\_fetch\_add(volatile A\* ptr, M value); adds value to what A points to.

- If A is arithmetic then M is C and if it's atomic\_address then M is ptr\_diff\_t.
- The value of \*ptr before the operation is returned.

• The type atomic\_flag is the atomic type for the classic test-and-set operation.

```
bool atomic_flag_test_and_set(
        volatile atomic_flag* ptr);
```

```
bool atomic_flag_test_and_set_explicit(
    volatile atomic_flag* ptr,
    memory_order order);
```

- These functions set the flag to one if it was zero.
- If it already was one, it may be written again but that would be a bad implementation due to cache effects if multiple processors are waiting for the flag to be cleared.
- The old value is returned.

```
void atomic_flag_clear(
    volatile atomic_flag* ptr);
```

```
void atomic_flag_clear_explicit(
    volatile atomic_flag* ptr,
    memory_order order);
```

- These functions clear the flag.
- The order must be one of
  - memory\_order\_relaxed,
  - memory\_order\_release, or
  - memory\_order\_seq\_cst.

- Initially some in the standardization committee wanted to standardize on sequential consistency.
- Fortunately, C11 has standardized on a relaxed memory model for non-atomic objects.
- This is partly based on input from Linux kernel developers at IBM.
- Thus, to make an assignment visible to another thread requires synchronization between the writing and reading threads.
- There are three main kinds of synchronization and we will start with mutex unlock/lock.

# Mutex unlock/lock

• A mutex in C11 is called mtx\_t or use Pthreads mutex.

int x;

```
Thread 1 Thread 2
mtx_lock(&m);
x = 1;
mtx_unlock(&m);
```

```
mtx_lock(&m);
printf("x = %d\n", x);
mtx_unlock(&m);
```

- The unlock by Thread 1 and lock by Thread 2 make the write of x visible to Thread 2.
- A part of the mutex unlock is to perform a **release operation** and of a mutex lock to perform an **acquire operation**.
- The write by Thread 1 is said to **happen before** the read by Thread 2.

- Recall, a release makes previous writes visible to other threads.
- A release orders memory accesses so that no preceding write may be moved to after the release by the compiler or hardware.
- Recall an acquire makes writes by other threads that have made a release visible.
- An acquire orders memory accesses so that no subsequent read or write may be moved to before the acquire.
- A consume is similar to an acquire but it lets unrelated reads be moved by the compiler to before the consume. In addition it can be implemented faster on some machines including POWER.
- Release/acquire is different from unlock/lock but unlock/lock perform release/aqcuire in addition to keeping track of the lock variable.

- Two expression evaluations **conflict** if they access the same memory location and at least one of them modifies that location.
- For example:

- An assignment is an expression so the code above conflict.
- By evaluation is meant that the expressions are computed at runtime.
- Conflicting evaluations are necessary in parallel programs unless each thread can work exclusively on its own data.
- Conflicting evaluations become a big problem if they are not ordered through the happens before relation.

- It is the unlock/lock pair which guarantees that the write happens before the read and that the data becomes visible to the other thread.
- Since there is no synchronization in the previous slide there is a data race, obviously.
- An unlock of mutex *m* synchronizes with a lock of *m*.
- Recall that mutex unlock/lock is one of three main kinds of synchronizations which orders two variable accesses.
- The other two are:
  - Release on an atomic object M followed by acquire or consume on M
  - Memory fences.
- There are additional important details for atomic objects and fences which we will see later.

- An essential property of parallel C programs is that we have ordered all memory accesses through the happens before relation.
- If you use mutexes to order the accesses, you will be safe.
- Atomic objects and fences are intended for use when:
  - Mutexes don't give sufficient performance.
  - You implement other high level synchronization primitives.

## Sequence point and sequenced before

- The most common sequence point in C/C++ is semicolon.
- Others include:
  - Function call
  - Comma operator
  - After evaluating the left operand of ?:, && and ||.
- Below at L, the assignment to and use of v are *not* sequenced.

int u = 1, v = 2, w;

L: 
$$w = (v = u + 3) + v * 4;$$

- It is legal C but the value of w can become either 12 or 20.
- It is due to it is unspecified which operand of + is evaluated first.
- In the following w becomes 16 since comma is a sequence point.

w = (v = u + 3), v \* 4;

• The assignment to v is sequenced before the use of v.

- These are also called **memory barriers** and are used frequently in the Linux kernel.
- A fence has no memory location operand.
- A memory fence is one of:
  - release fence wait until previous writes by the CPU have completed
  - acquire fence wait until incoming invalidations have removed data from the CPU's cache
  - both release and acquire fence
- Recall an invalidation request is replied to directly and queued for being performed in typically multiple cache levels by a CPU/cache (L1 is smallest and e.g. L3 is largest)
- Fences alone are insufficient and an atomic object must be used as well

### An example

```
m = 0; // both T1 and T2 know m is initially zero.
_Atomic int
                        // unknown value for T2 initially.
int
                u;
                                                                         // T1
u = 42;
                                                                         // T1 A
atomic_thread_fence(memory_order_release);
                                                                         // T1 X
atomic_store_explicit(&m, 1, memory_order_relaxed);
        while (atomic_load_explicit(&m, memory_order_relaxed) == 0)
                                                                         // T2 Y
                                                                         // T2
                                                                         // T2 B
        atomic_thread_fence(memory_order_acquire);
                                                                         // T2
        printf("u = (d n), u);
```

- Accesses to atomic objects do not produce data races.
- Simply running the two threads does not order them in any way.
- Only if Thread 2 reads m after the write then it knows that the value of u is correct.
- The modification of u happens before the read of u.

# Synchronize-with using fences

- A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y which operate on an atomic object M and
- X is sequenced after A,
- Y is sequenced before B,
- Y reads a value written by X or the hypothetical release sequence headed by X if it were a release.

```
u = 1
A: release fence
X: atomic store M relaxed
```

- Y: atomic load M relaxed
  B: acquire fence
  print u
- A and X can be replaced with a release operation on M.
- Y and B can be replaced with an acquire operation on M.

```
Thread 1

u = 1

X: atomic store M relaxed

A: release fence

Y: atomic load M relaxed

B: acquire fence

print u
```

- Assignment to u and X can be reordered
- Thread 2 can read M and perform B
- Thread 2 can then read u
- Then the invalidation of u can reach Thread 2



- In Thread 2 reading u and Y can be reordered
- B can have been executed before A

## Dependences for consume

#### • So far: A synchronizes with B using

- mutex unlock/lock, and
- atomic variable possibly with fences
- Dependences are intra-thread (within one thread only).
- There are two kinds of dependences:
  - Data dependence due to writing and reading a memory location.
  - Control dependence due to branches such as if-statements.
- **Only** data dependences are considered in C11.
- As we will see, this can lead to unexpected results.
- When we talk about "dependences" from now on, we only mean data dependences.

- Consider two expression evaluations A and B made by one thread.
- When we say that A carries a dependency to B it means A must be evaluated before B:

| V | = | u | + | 1; | // A |
|---|---|---|---|----|------|
| W | = | V | * | 2; | // B |

- The order of A and B should not be changed!
- Both compilers and hardware must preserve this order and they do.
- A is **sequenced** before B since there is a sequence point between A and B.

#### • A carries a dependency to B if

- the value of A is used as an operand of B, or
- A writes a scalar object (pointer or arithmetic variable) or a bit-field in memory location M and B reads from M the value written by A, and A is sequenced before B
- For some evaluation X, A carries a dependence to X and X carries a dependency to B for transitivity.
- Carries a dependency is intra-thread.

- A release sequence is a maximal sequence of modifications of *M* headed by a release operation on *M* and performed by that thread or by other threads performing atomic read-modify-write accesses on *M*.
- Recall, a consume operation is like an acquire except it allows more optimizations on some machines, e.g. on the POWER it becomes an ordinary load instruction.
- An evaluation A in one thread is **dependency ordered before** an evaluation B in another thread if:
  - A performs a release operation on an atomic object M and B performs a consume operation on M and reads a value written by any side effect of A in the release sequence of A, or
  - for some evaluation X, A is dependency ordered before X and X carries a dependency to B (again, for transitivity).

### Inter-thread happens before

- An evaluation A inter-thread happens before an evaluation B if:
  - A synchronizes with B (mutex unlock/lock, or release/acquire on atomic variable), or
  - A is dependency ordered before B (release/consume), or
  - for some evaluation X:
    - A synchronizes with X and X is sequenced before B, or
    - A is sequenced before X and X synchronizes with B, or
    - A inter-thread happens before X and X inter-thread happens before B.
- An evaluation A happens before an evaluation B if:
  - A is sequenced before B (intra-thread), or
  - A is inter-thread happens before B.
- C11, Section 5.1.2.4 paragraph 25:

The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.

#### • Improved performance!

- In programs (such as the Linux kernel) with important data structures which are rarely modified and very frequently read there exist faster solutions than using full release/acquire synchronization on modern architectures.
- In this sense modern architectures include POWER, MIPS and ARM.
- For other architectures including x86, optimizing compilers can make better optimizations if dependency ordering is used rather than release/acquire, as we will see below.

- RCU is an alternative to readers-writers locks with the following essential functionality:
  - Pointers to data structures are protected with RCU.
  - Readers use read-side critical sections marked by enter/exit calls.
  - When a reader is in such a section a writer may not modify the data structure but instead modifies a copy of it.
  - When the last reader has left, the original data structure is updated.
- RCU is heavily used in the Linux kernel.
- RCU was one part of the SCO vs IBM lawsuit.

# A simplified example

• Consider the following example code (originally from Linux)

```
list_t*
                 head;
int
                 b;
void insert(int a)
Ł
        list_t* p = kmalloc(sizeof *p, GFP_KERNEL);
        spin_lock(&m);
        p - a = 1;
        p->next = head;
        rcu_assign_pointer(head, p);
        spin_unlock(&m);
}
int f(void)
ſ
        list_t* q = rcu_dereference(head);
        return q \rightarrow a + b;
}
```

 The use of rcu\_dereference() must be done within an rcu\_read\_lock() / rcu\_read\_unlock() — not seen in this example.

```
int b;
int f(void)
{
    list_t* q = rcu_dereference(head);
    return q->a + b;
}
```

- Before the proposal for dependency ordering through release/consume, the then current draft would require the use of an acquire operation in the rcu\_dereference.
- Doing so prevents the compiler from loading b before the execution of the rcu\_dereference.
- A standard for a high performance language must permit extensive compiler optimization and efficient execution on modern machines.
- C11 succeeds with this also for multicores.

- Recall the intra-thread A carries a dependency to B.
- The compiler becomes responsible for not optimizing away such dependences.
- This might sound trivial but is not. See below.
- A dependency is started with a consume operation:

p = atomic\_load(q, memory\_order\_consume); a = \*p + b;

- A root of a dependency tree is created with the consume, and the memory read in \*p becomes ordered, as we expect.
- The compiler or hardware is not allowed to move the \*p to before the first line which would probably not make much sense, anyway.
- Behaviour is as we want and expect.

- Consider the following code and assume the compiler has deduced size is one:
  - i = atomic\_load(q, memory\_order\_consume); a = b[i % size];
- What happens if the compiler transforms the code to the following?

• There is now **no dependence** between the two statements!

a = \*b;

- Without a dependence the reads are not ordered.
- This is not what the programmer expected!!!
- Therefore, optimizing C compilers must analyse all dependences **before** any code transformation and preserve them.
- What does "preserve" mean, then?

# Constant propagation example 3(3)

- Preserving the data dependency means letting the hardware know about them.
- That can be done in different ways for different processors.
- One way is to insert instructions so that there will be a chain of dependences for the processor pipeline to see.
- Compilers thus must preserve such dependences in the complete program!
- Writing optimizing C compilers all of a sudden became twice as interesting with C11
- There is, however, a very simple first version implementation: treat all memory\_order\_consume as memory\_order\_acquire which automatically will order the memory accesses.

- a = atomic\_load(p, memory\_order\_consume); atomic\_store(q, a, memory\_order\_relaxed); atomic\_store(r, b, memory\_order\_relaxed);
- Since there is no data dependency between the consume and the second store, they are not ordered.
- As programmers we need to be very careful about what we expect to be ordered!

### Execution times

- Implementation on a specific architecture: POWER.
- Illustrations of what actually is ordered for several situations.
- We will begin with the POWER synchronization instructions.
- Approximate clock counts below are not fixed but depends on what is happening in the machine at the moment.
- Numbers from Christoph von Praun: "Deconstructing Redundant Memory Synchronization" (while at IBM Research).

| mnemonic       | name                                 | POWER4 cycles | POWER5 cycles |
|----------------|--------------------------------------|---------------|---------------|
| ldarw/stwxc.   | load and reserve / store conditional | 80            | 75            |
| hwsync or sync | heavy weight sync                    | 140           | 50            |
| lwsync         | light weight sync                    | 110           | 25            |
| isync          | instruction sync                     | 30            | 10            |
| eieio          | enforce inorder execution of I/O     | not measured  | not measured  |

- We want to use functions which use the less costly instructions!
- Such as the consume operation which is only an ordinary load instruction on POWER!

- Some processors use atomic test-and-set instructions while others use pairs of special load and store instructions.
- Load and reserve is also called load-locked and load-linked.
- In addition to POWER, it's used by ARM and MIPS.
- Test-and-set, or compare-and-swap, is used by e.g. x86.
- The purpose with load-and-reserve/store conditional is to simplify the design of the pipeline.

# Load and reserve/store conditional behaviour

- The load instruction fetches data from a memory location, and makes a reservation *R* in memory.
- If a different processor modifies the same memory location, the reservation *R* is lost.
- If/when the processor with a reservation makes a conditional store, the memory location is modified only if the reservation was not lost in between.
- Therefore it's an atomic read-modify-write.
- The stwcx. conditional store in POWER sets a condition code to indicate whether the store succeeded or not (the dot in the mnemonic indicates that the condition code is set by an operation on POWER).

- The POWER memory barrier instructions create a memory barrier with two sets of instructions:
  - the A-set with instructions  $a_i$  preceding the barrier, and
  - the *B*-set with instructions  $b_j$  following the barrier.
- Depending on which barrier instruction is used, some  $b_j$  instructions may be reordered with some  $a_i$  instructions.

### hwsync memory barrier for sequential consistency

- The A-set consists of all instructions preceding the hwsync.
- The *B*-set consists of memory access instructions following the hwsync.
- Except for the instruction icbi which invalidates an instruction cache block, no *B*-set instruction may be reordered with any *A*-set instruction.
- This is the most costly POWER synchronzation instruction.
- Not only for cached data but for any storage.
- It's used e.g. to implement sequentially consistent write on POWER:

```
stwx r1,r2,r3
hwsync
```

### lwsync memory barrier for release operation

- The A and B sets consist of all memory access instructions preceding and following the lwsync, respectively.
- Only the following pairs of  $a_i$ ,  $b_j$  instructions are ordered:
  - A-side load  $\rightarrow$  B-side load
  - $\bullet \ \ \mathsf{A}\text{-side load} \to \mathsf{B}\text{-side store}$
  - A-side store  $\rightarrow$  B-side store
- $\bullet\,$  Thus, A-side store  $\to$  B-side load are not ordered.
- The dcbz data cache block zero is counted as a store.
- It's used e.g. to implement a release:

| stwx<br>stwx       | r1,r2,r3<br>r4,r5,r6 | # modify shared data             |
|--------------------|----------------------|----------------------------------|
| <br>stwx<br>lwsync | r29,r30,r31          | # last write in critical section |
| stwx               | r8,r9,r10            | # set lock to free               |

### isync memory barrier for acquire operation

- The *A* and *B* sets consist of all instructions preceding and following the isync, respectively.
- An instruction which cannot raise an exception in the pipeline can be allowed to complete and instructions following the isync therefore actually execute without order.
- Consider the sequence:

| ldw   | r1,r2,r3 |
|-------|----------|
| isync |          |
| stw   | r4,r5,r6 |

- The store may execute before the load if the load had e.g. a cache miss.
- This is not what we want and to overcome that problem we can exploit that POWER does not permit speculative execution of store instructions.

### bc; isync memory barrier for acquire operation

- By inserting a conditional branch, bc, before the isync both the isync and stw become speculative until the branch outcome is known.
- We can use beq as follows:

| ldw   | r1,r2,r3 | # | r1 = MEMORY[r2+r3]                   |
|-------|----------|---|--------------------------------------|
| cmp.  | r1,r1    | # | certainly true but the beq must      |
| beq   |          | # | wait according to the specification  |
| isync |          | # | since no speculative stw is allowed. |
| stw   | r4,r5,r6 |   |                                      |

- This memory barrier is the fastest.
- Since the store may not execute speculatively it must wait for the branch outcome.
- It can guarantee that a set of load instructions (above only one load) have completed before the fence is executed.

- (unique five vowel instruction)
- Enforce in order execution of I/O.
- It only orders stores.

- If the memory accesses ordered by a memory barrier executed by one processor  $P_i$  also take into account memory accesses executed by other processors as described below the barrier is **cumulative**.
- By **applicable** storage accesses for a barrier is meant the storage access which are ordered by that barrier.
- Two rules:
  - The *A*-set also includes all applicable storage accesses made by other processors which have completed with respect to *P<sub>i</sub>* before the barrier is started.
  - The *B*-set also includes all applicable accesses made by any processor  $P_j$  after  $P_j$  has executed a load that returned a value stored by an instruction in the *B*-set.
- The next example will clarify this

# Example of cumulative ordering

- lwsync is cumulative
- bc; isync is not
- If Thread 2 reads 1 in access 3 and Thread 3 reads 2 in access 7 then Thread 3 will read 1 in access 9.
- Using instead bc; isync Thread 3 may read zero in access 9.
- We should not expect this to work with only bc;isync/acquire.

- The following is based on the ISO C/C++ standardization document ISO/EIC JTC1 SC22 WG21 N2745 written by Paul E McKenney and Raul Silvera from IBM.
- There are other implementations possible.

| Load relaxed | ld                        |
|--------------|---------------------------|
| Load consume | ld                        |
| Load acquire | ld; cmp; bc; isync        |
| Load seq cst | hwsync;ld; cmp; bc; isync |

- Using sequential consistency on machines with relaxed memory models makes it easier to write correct parallel code which will be slow.
- Even if your code is safety critical (when people can die due to bugs) and performance is not an issue, you should not use SC, use a single threaded C program instead.

| Store relaxed | st        |
|---------------|-----------|
| Store release | lwsync;st |
| Store seq cst | hwsync;st |

- It's easier to program under sequential consistency!
- Yes, but, why would you want to use synchronization instructions before every store in a critical section since nobody should look at the data before you have left the critical section anyway?
- Write buffering permitted by stores in relaxed memory models (C/Java) allows the machine to pipeline all the stores in the critical section.
- At the end of the critical section you use store release and wait for the remaining preceding stores (now possibly waiting in a write buffer) to complete (i.e. invalidate the other copies if there are any left).

| Acquire fence | lwsync |
|---------------|--------|
| Release fence | lwsync |
| Acq rel fence | lwsync |
| Seq cst fence | hwsync |

|       | # lock  |           |   |                                    |
|-------|---------|-----------|---|------------------------------------|
| loop: | lwarx   | r6,0,r3   | # | load lock and reserve              |
| -     | cmpw    | r4,r6     | # | r4 values means unlocked           |
|       | bne-    | loop      | # | restart if locked.                 |
|       | stwcx.  | r5,0,r3   | # | try to store                       |
|       | bne     | loop      | # | restart if store failed.           |
|       | isync   |           | # | store succeeded.                   |
|       | lwzx    | r7,r8,r9  | # | first load of shared data          |
|       | # unloc | Z         |   |                                    |
|       | stwx    | r7,r8,r10 | # | last store of shared data          |
|       | lwsync  |           | # | export shared data                 |
|       | stw     | r4,0,r3   | # | unlock the lock. same r4 as above. |

- C11 threads are similar to but simpler than Pthreads.
- thrd\_t thread type
- once\_flag a type for performing initializations exactly one time
- mtx\_t mutex type
- cnd\_t condition variable type
- tss\_t thread specific storage (not the same as \_Thread\_local)
  - \_Thread\_local int x; // keyword thread\_local int y; // available when including <threads.h>

- void call\_once(once\_flag\* flag, void (\*func)(void));
- The flag should be initialized with:

once\_flag flag = ONCE\_FLAG\_INIT;

• If multiple threads invoke call\_once with the same flag, the function will only be called one time, and the others will wait until the call to func returns.

- thrd\_success indicates an operation succeeded.
- thrd\_error indicates an operation failed but not why.
- thrd\_busy an operation failed due to a resource was already in use.
- thrd\_nomem an operation failed due to memory allocation failed.
- thrd\_timeout a timed wait operation timed out.

- mtx\_plain the mutex should support none of below options.
- mtx\_recursive set mutex to support recursive locking.
- mtx\_timed set mutex to support timed wait.
- mtx\_try set mutex to support test and return.

- The struct xtime contains at least the following members.
- time\_t sec;
- long nsec;
- They may be declared in any order in the struct.

```
int cnd_init(cnd_t* cond);
void cnd_destroy(cnd_t* cond);
int cnd_signal(cnd_t* cond);
int cnd_broadcast(cnd_t* cond);
int cnd_wait(cnd_t* cond, mtx_t* mtx);
int cnd_timedwait(cnd_t* cond, mtx_t* mtx, const xtime* xt);
```

 These functions are all similar to the corresponding Pthreads functions, except that Pthread condition variables can have certain attributes and be statically initialized.

```
int pthread_cond_init(
    pthread_cond_t* restrict cond,
    const pthread_condattr_t* restrict attr);
```

```
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
```

# Mutex functions

```
int mtx_init(mtx_t* mtx, int type);
void mtx_destroy(mtx_t* mtx);
int mtx_lock(mtx_t* mtx);
int mtx_timedlock(mtx_t* mtx, const xtime* xt);
int mtx_trylock(mtx_t* mtx);
int mtx_unlock(mtx_t* mtx);
```

• The type should be one of:

```
mtx_plain
mtx_timed
mtx_try
mtx_plain | mtx_recursive
mtx_timed | mtx_recursive
mtx_try | mtx_recursive
```

- Only mtx\_trylock can return thrd\_busy.
- A prior call to mtx\_unlock synchronizes with a successful call to a mtx\_lock function.

```
int thrd_create(thrd_t* thr, int (*func)(void*), void* arg);
void thrd_exit(int res);
int thrd_join(thrd_t thr, int* res);
int thrd_detach(thrd_t thr);
thrd_t thrd_current(void);
int thrd_equal(thrd_t u, thrd_t v);
void thrd_sleep(const xtime* xt);
void thrd_yield(void);
```