The purpose of this lab is to learn more about the optimization options of
GCC and optimizing compilers in general.
We will use a benchmark called the Stanford benchmarks, used for the
MIPSX project in the beginning of the 1980's. The input sizes have been adapted for our POWER8 machine, however.
Online documentation about the optimization options of GCC is available
here.
Log in to power.cs.lth.se and start bash in your terminal (to make sure the script below work).
Copy lab6 from
/usr/local/cs/edaf15/labs/lab6
Go to lab6
The first task in the lab is to determine which optimization levels have any additional
effect with gcc. Compile the program mipsx.c with different optimization levels and
save each relocatable file, using the command:
for x in s 0 1 2 3 4 5
do
gcc -c -O$x -o $x.o mipsx.c
done
The for loop is equivalent to writing the following commands:
Level zero is the same as no optimization.
To determine the size of the instructions in a file,
the command size is used:
size --common *.o
The size of the read-only section is printed below text and
includes instructions and constants. Global variables which should be initialized to zero are listed under bss which means block started by symbol (for historical reasons).
To see which variables bss refers to, type the command:
nm 0.o
Try to figure out what T, U, G (or D), and C mean.
The explanation is given in the manual page for nm. Type:
man nm
What does -Os mean? All files compiled at levels 3--5 should have identical sizes.
To compare whether two files are identical, use the command diff file1 file2
If diff prints no output, the files are identical.
Now run the following shell command:
for x in s 0 1 2 3
do
gcc -O$x -o $x mipsx.c tbr.s timebase.c
./$x
done
POWER processors have a special register called the timebase register which can be used for very accurate timing (avoid making a system call to ask the operating system kernel what time it is, which normally is what happens when you use the ISO C clock function).
You can check the current CPU frequences using:
cat /proc/cpuinfo
The files tbr.s and timebase.c are used for this. Note that the POWER8 processor sometimes changes its
clock frequency but that is usually not a big problem. Run your programs a few times to see if it appears to remain the same.
Then run the programs optimized at the different levels and
note whether they improve in speed.
The remaining compilations should use -O3 and some other flags.
Let us start with enabling vectorization, by adding -mcpu=power8.
The former tells the compiler that it should produce code for a Power processor with SIMD instructions,
and the latter specifically which processor it is which in this case means it has SIMD instructions. Using
the former can produce good code for different processors while the latter will perform better instruction
scheduling for power.cs.lth.se!
What is the effect?
To understand what happended, type the following:
objdump -d mipsx > x
This disassembles the program and and writes to the file x. Edit that file and
search for stvx which is the machine instruction for storing a vector
register to memory. Can you find any? Many? Can you see many other vector instructions?
Examples of other vector instructions are
vmaddfp,
vperm, and
lvx.
Feedback-directed optimizations uses statistics (usually about
branches) from previous executions and exploits that
information when optimizing a file.
Read about the options -fprofile-generate and -fprofile-use
online or on page 202 of the course book.
Compile with:
Compile and run. These options do not give feedback to GCC but rather to you.
The command:
gcov mipsx.c
will produce a file mips.c.gcov with statistics about how many times
each line was executed. Adding -b will in addition print statistics
about branches, as explained on pages 222-223 in the book.
A recent addition to GCC is link-time optimization.
It means that optimization is performed during linking, i.e. when all
relocatable files of a program are available. Issue the following commands:
cat a.c b.c
gcc -O3 a.c b.c
objdump -d a.out > x (dump the file and disassemble)
Then open the file x to see what the function main looks like.
Search for
main>:
Redo the same thing but add -flto to GCC. Check what main
now was compiled to!
Now take the matrix multiplication program clang-matmul.c
and try to
get clang to vectorize the inner loop. Is there any array reference
which would make it difficult to put the elements in a vector register?
If so, what can you do about it? What else would be affected by that change?