In addition to finding memory errors, Valgrind can also be useful 
when doing performance measurements. Valgrind is slow than the other 
tools due to it simulates the computer so use the small input, at least 
initially. Recompile with -O3 and give the command
valgrind --tool=cachegrind --I1=65536,1,128 --D1=32768,2,128 \
> --LL=1048576,8,128 ./a.out < i
These options specify the cache parameters to match our POWER8 machine.
How can you see how many instructions in total, load instructions, and 
store instructions are executed?
What are the cache miss rates and are the cache likely to be a performance
problem for this input?