In addition to finding memory errors, Valgrind can also be useful
when doing performance measurements. Valgrind is slow than the other
tools due to it simulates the computer so use the small input, at least
initially. Recompile with -O3 and give the command
valgrind --tool=cachegrind --I1=65536,1,128 --D1=32768,2,128 \
> --LL=1048576,8,128 ./a.out < i
These options specify the cache parameters to match our POWER8 machine.
How can you see how many instructions in total, load instructions, and
store instructions are executed?
What are the cache miss rates and are the cache likely to be a performance
problem for this input?