The purpose of this lab is to learn more about the optimization options of
GCC and optimizing compilers in general.
We will use a benchmark called the Stanford benchmarks, used for the
MIPSX project in the beginning of the 1980's. The input sizes have been adapted for our POWER8 machine, however.
Online documentation about the optimization options of GCC is available
Log in to power.cs.lth.se and start bash in your terminal (to make sure the script below work).
Copy lab6 from
Go to lab6
The first task in the lab is to determine which optimization levels have any additional
effect with gcc. Compile the program mipsx.c with different optimization levels and
save each relocatable file, using the command:
for x in s 0 1 2 3 4 5
gcc -c -O$x -o $x.o mipsx.c
The for loop is equivalent to writing the following commands:
Level zero is the same as no optimization.
To determine the size of the instructions in a file,
the command size is used:
size --common *.o
The size of the read-only section is printed below text and
includes instructions and constants. Global variables which should be initialized to zero are listed under bss which means block started by symbol (for historical reasons).
To see which variables bss refers to, type the command:
Try to figure out what T, U, G (or D), and C mean.
The explanation is given in the manual page for nm. Type:
What does -Os mean? All files compiled at levels 3--5 should have identical sizes.
To compare whether two files are identical, use the command diff file1 file2
If diff prints no output, the files are identical.
Now run the following shell command:
for x in s 0 1 2 3
gcc -O$x -o $x mipsx.c tbr.s timebase.c
POWER processors have a special register called the timebase register which can be used for very accurate timing (avoid making a system call to ask the operating system kernel what time it is, which normally is what happens when you use the ISO C clock function).
You can check the current CPU frequences using:
The files tbr.s and timebase.c are used for this. Note that the POWER8 processor sometimes changes its
clock frequency but that is usually not a big problem. Run your programs a few times to see if it appears to remain the same.
Then run the programs optimized at the different levels and
note whether they improve in speed.
The remaining compilations should use -O3 and some other flags.
It is sometimes useful to tell the compiler for which pipeline it should optimize that code such as with -mcpu=power8.
It has no effect on gcc on our machine but the IBM xlc compiler can sometimes produce different code.
Try now to figure out if gcc could vectorize the mipsx.c program.
To understand what happended, type the following:
objdump -d mipsx > x
This disassembles the program and writes to the file x. Edit that file and
search for stvx which is the machine instruction for storing a vector
register to memory. Can you find any? Many? Can you see many other vector instructions?
Examples of other vector instructions are
Feedback-directed optimizations uses statistics (usually about
branches) from previous executions and exploits that
information when optimizing a file.
Read about the options -fprofile-generate and -fprofile-use
online or on page 208 of the course book.
Compile and run. These options do not give feedback to GCC but rather to you.
will produce a file mips.c.gcov with statistics about how many times
each line was executed. Adding -b will in addition print statistics
about branches, as explained on page 224 in the book.
A recent addition to GCC is link-time optimization.
It means that optimization is performed during linking, i.e. when all
relocatable files of a program are available. Issue the following commands:
cat a.c b.c
gcc -O3 a.c b.c
objdump -d a.out > x (dump the file and disassemble)
Then open the file x to see what the function main looks like.
Redo the same thing but add -flto to GCC. Check what main
now was compiled to!
Now take the matrix multiplication program clang-matmul.c
and try to
get clang to vectorize the inner loop. Is there any array reference
which would make it difficult to put the elements in a vector register?
If so, what can you do about it? What else would be affected by that change?