Contents of Lecture 12

- Instruction Scheduling Basics
- List Scheduling
- Modulo Scheduling
The purpose of **instruction scheduling** is to improve performance by reducing the number of pipeline stalls suffered during execution.

The following example illustrates the concept, where the right column is the scheduled code.

Due to instructions only are scheduled within one basic block, only a limited improvement is achieved — the `fsub` and `stf` are not helped at all.

```
ldf  t2,a,t1
ldf  t3,b,t1
fadd t4,t2,t3
ldf  t5,c,t1
ldf  t6,d,t1
fadd t4,t2,t3
fsub t8,t3,t7
stf  t8,e,t1
```

```
ldf  t2,a,t1
ldf  t3,b,t1
ldf  t5,c,t1
ldf  t6,d,t1
fadd t4,t2,t3
fsub t8,t3,t7
stf  t8,e,t1
```
The goal of instruction scheduling is to reduce pipeline stall and this is achieved by separating the producer and consumer.

This separation makes it more difficult to perform register allocation.

**Question:** Which of instruction scheduling and register allocation should be performed first?

**Answer:** Instruction scheduling because register allocation would create unnecessary constraints for the scheduler, and advanced instruction scheduling would be seriously limited with already assigned registers.

If register allocation results in spill code, the instruction scheduler is usually run a second time in order to separate the load instructions from the uses of the loaded register.
Register Pressure of Different Schedules

The left schedule needs three floating point registers and the right four.

<table>
<thead>
<tr>
<th>Code</th>
<th>Code</th>
</tr>
</thead>
<tbody>
<tr>
<td>ldf f2,ra,ri</td>
<td>ldf f2,ra,ri</td>
</tr>
<tr>
<td>ldf f3,rb,ri</td>
<td>ldf f3,rb,ri</td>
</tr>
<tr>
<td>fadd f2,f2,f3</td>
<td>ldf f4,rc,ri</td>
</tr>
<tr>
<td>ldf f3,rc,ri</td>
<td>ldf f5,rd,ri</td>
</tr>
<tr>
<td>ldf f4,rd,ri</td>
<td>fadd f2,f2,f3</td>
</tr>
<tr>
<td>fmul f3,f3,f4</td>
<td>fmul f4,f4,f5</td>
</tr>
<tr>
<td>fsub f2,f2,f3</td>
<td>fsub f2,f2,f4</td>
</tr>
<tr>
<td>stf f2,re,ri</td>
<td>stf f2,re,ri</td>
</tr>
</tbody>
</table>
Consider the expression $a + b + c + d$.

How it must be evaluated depends on the data type and source language.

In C (and other languages) addition is left-associative which means the expression should be evaluated as $((a + b) + c) + d$.

Due to the sequential execution it takes at least three clock cycles (and more for floating point).

If the compiler knows that it can ignore the effects of overflow (either due to the type is unsigned or two’s complement representation is used), it can rewrite it as $(a + b) + (c + d)$.

On a superscalar processor two additions can be performed concurrently.

Programmers are **NOT** allowed to think they can ignore overflow other than for unsigned integers (if the computation still makes sense with the overflow, that is).
Data dependencies constrain how instructions can be scheduled.

Instruction scheduling is performed after translation from SSA Form and on low level code which is close to the final machine code.

In addition to the data dependence graph, dependencies due to scalar variables and accesses to memory through unknown addresses are used.
The most fundamental instruction scheduling technique is called list scheduling and schedules one basic block at a time.

First an instruction level data dependence graph is built.

Vertices are instructions and directed arcs constrain the scheduling.

The source vertex must execute before the target vertex.

List scheduling uses topological sorting.

Once the graph has been constructed the list scheduler maintains a set of candidate instructions.

Candidate instructions have no predecessor in the graph.
List Scheduler Goal

- The goal of the list scheduler is to minimize the number of clock cycles required to execute the basic block.
- As this problem is NP-complete, an approximation is found as follows: each vertex is assigned a priority in some way, and the highest priority vertex \( i \) of the candidates is scheduled next.
- Then any successor vertex \( s \) of \( i \) with no predecessor that has not yet been scheduled is moved to the set of candidates.
- This procedure is repeated until the set of candidates is empty.
- The interesting problem is to select the priority function using clever heuristics.
- Changing heuristics can change the execution time by several percent.
- In one version of the IBM C/C++/FORTRAN compiler each block was scheduled three times with different heuristics and the best schedule was used.
The instruction scheduler builds a graph based on the **definitions** and **uses** of storage resources, e.g. variables, registers, or all of memory.

<table>
<thead>
<tr>
<th>Attribute</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>def(r)</td>
<td>The instruction which most recently modified r while scanning backwards, or null (denoted ⊥).</td>
</tr>
<tr>
<td>uses(r)</td>
<td>The set of instructions which use the current value of r.</td>
</tr>
</tbody>
</table>
Instruction s defines resource r

procedure define_resource(r, s)

    if (def(r) ≠ ⊥) {
        add edge (s, def(r), OUTPUT)
        delete def(r) from candidates
    }

    def(r) ← s

    for each u ∈ uses(r) do {
        add edge (s, u, TRUE)
        delete u from candidates
    }

    uses(r) ← ∅

end
Instruction s uses resource r

**procedure** use_resource(r, s)

  add s to uses(r)

  **if** (def(r) ≠ ⊥ and def(r) ≠ s)

    delete def(r) from candidates

    add edge (s, def(r), ANTI)

**end**
procedure collect_candidates(v)

/* v is a basic block. */

for each resource r do {
  uses(r) ← ∅
  def(r) ← ⊥
}

candidates ← ∅
for each instruction s in v in reverse order do {
  for each resource r defined by s do
    define_resource(r, s)
  for each resource r used by s do
    use_resource(r, s)
  add s to candidates
}
end
procedure list_sched
    for each vertex \( v \) in \( G \) do
        collect_candidates(\( v \))
        cycle \( \leftarrow 0 \)
    while (candidates \( \neq \emptyset \)) do
        for each \( s \) \( \in \) candidates do
            update_earliest(\( s \))
            compute_delay(\( s \))
        max_delay_cand \( \leftarrow \emptyset \)
        earliest_cand \( \leftarrow \emptyset \)
        for each \( s \) \( \in \) candidates do
            if (delay(\( s \)) = max_delay)
                add \( S \) to max_delay_cand
            if (earliest(\( s \)) \( \leq \) cycle)
                add \( S \) to earliest_cand
            if (earliest_cand \( \neq \emptyset \))
                take \( s \) from earliest_cand using heuristics
            else
                take \( s \) from max_delay_cand using heuristics
        delete \( s \) from candidates
        schedule \( s \) as next statement in \( v \)
    cycle \( \leftarrow \) cycle + 1
end

- Incrementing the cycle after each scheduled instruction assumes a single-issue pipeline.
Modulo Scheduling

Consider the following loop and assume there are true dependencies from $A$ to $B$ and from $B$ to $C$.

```c
void h()
{
    int i;

    for (i = 0; i < 100; ++i) {
        A;
        B;
        C;
    }
}
```

Due to list scheduling only works with one basic block, it cannot improve this loop.

Such loops are of course quite common.
Modulo Scheduling the Loop

- Let us take instructions from three iterations and interleave them.
- First we need to execute instructions from the first two iterations in a prologue.

<table>
<thead>
<tr>
<th>cycle</th>
<th>i</th>
<th>ii</th>
<th>iii</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>A_0</td>
<td></td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>B_0</td>
<td>A_1</td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>C_0</td>
<td>B_1</td>
<td>A_2</td>
</tr>
<tr>
<td>3</td>
<td>A_3</td>
<td>C_1</td>
<td>B_2</td>
</tr>
<tr>
<td>4</td>
<td>B_3</td>
<td>A_4</td>
<td>C_2</td>
</tr>
<tr>
<td>5</td>
<td>C_3</td>
<td>B_4</td>
<td>A_5</td>
</tr>
<tr>
<td>6</td>
<td>A_6</td>
<td>C_4</td>
<td>B_5</td>
</tr>
<tr>
<td>7</td>
<td>B_6</td>
<td>A_7</td>
<td>C_5</td>
</tr>
<tr>
<td>8</td>
<td>C_6</td>
<td>B_7</td>
<td></td>
</tr>
<tr>
<td>9</td>
<td>C_7</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

- Assume for illustration only 8 iterations are executed.
- For example $A_3$ denotes instruction $A$ in iteration 3.
- After a steady-state with $2 \times 3$ iterations there is an epilogue.
- Consider instruction $B_3$. While it waits for $A_3$, the CPU can also execute $C_1$ and $B_2$, assuming a pipelined superscalar CPU.
List Scheduled Execution

Each iteration is completed before the next starts.

The height of an iteration is the number of clock cycles it takes.
A new iteration is started before the current has completed.

We wish to start the next iteration as early as possible.

If we start the next iteration the same clock cycle, we need a multicore with one core per loop iteration.
The Initiation Interval

- The **initiation interval**, abbreviated II, is the number of clock cycles between the start of two iterations.
- The II is limited by:
  1. Data dependencies
  2. Available hardware resources
- A maximum II is determined by doing a normal list schedule of the loop body.
- A minimum II is computed from the available resources and required resources in the loop, and the data dependencies.
Performing Modulo Scheduling

- The modulo scheduler then tries to find the smallest $\Pi$ which results in a valid schedule, by trying each value of $\Pi$ starting from the minimum $\Pi$, and incrementing it by one.
- All variables defined before being used in each loop iterations are expanded to different variables for each iteration.
- Then the loop body is duplicated and adapted for the proper iteration.
- A prologue and an epilogue is also generated.
There are two data dependence analyzes done for modulo scheduling:

1. Instruction level — as we saw for list scheduling.
2. Loop level — as we saw in lecture F10.

There is one modification: in addition to the type (true, anti, or output), dependencies for modulo scheduling are of the form \((p, d)\), where \(p\) is the dependence distance (i.e. iteration difference) and \(d\) is the delay in clock cycles.

By delay is meant the time the instructions should be separated to avoid pipeline stalls.
Let $\sigma(v)$ denote the clock cycle a certain instruction $v$ is scheduled.

With a dependence $(p, d)$ from instruction $u$ to instruction $v$, and an initiation interval $II$, to avoid pipeline stalls we need to satisfy:

$$\sigma(v) - \sigma(u) + p \cdot II \geq d \quad (1)$$

Additionally there must be sufficient hardware resources available in each clock cycle.

Assume for simplicity an instruction only needs one resource each clock cycle (e.g. a certain stage in a pipelined functional unit).

Then for each clock cycle $i$ the instruction executes (counting from zero in e.g. in the instruction decode stage) there must be such a resource available in the clock cycle given by:

$$\sigma(v) + i \mod II \quad (2)$$

If no value for $\sigma(v)$ can be found which satisfies all constraints, the initiation interval must be increased, and the scheduling be repeated.
We have now seen the essential parts of the modulo scheduling.

There was in the 1980’s a debate regarding which hardware features were needed for efficient software pipelining (e.g. rotating register files).

The problem was solved completely in software by Monica Lam in her PhD thesis from Carnegie Mellon University.

Her algorithm is described in her book ”A Systolic Array Optimizing Compiler”, and is implemented in several compilers, including in SGI’s compiler (now called Open64).

An interesting study was performed that compared an optimal scheduler with the modulo scheduler in the SGI compiler, and concluded that their modulo scheduler almost always produced optimal code.