Have you tried passing -funroll-loops -fprefetch-loop-arrays
to GCC?
I get the following results with these additional optimizations:
[1829] /tmp/so_25078285 $ cat /proc/cpuinfo |grep CPU|head -n1
model name : Intel(R) Core(TM) i3-3225 CPU @ 3.30GHz
[1829] /tmp/so_25078285 $ g++ --version|head -n1
g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3
[1829] /tmp/so_25078285 $ g++ -O3 -march=native -std=c++11 test.cpp -o test_o3
[1829] /tmp/so_25078285 $ g++ -O3 -march=native -funroll-loops -fprefetch-loop-arrays -std=c++11 test.cpp -o test_o3_unroll_loops__and__prefetch_loop_arrays
[1829] /tmp/so_25078285 $ ./test_o3 1
unsigned 41959360000 0.595 sec 17.6231 GB/s
uint64_t 41959360000 0.898626 sec 11.6687 GB/s
[1829] /tmp/so_25078285 $ ./test_o3_unroll_loops__and__prefetch_loop_arrays 1
unsigned 41959360000 0.618222 sec 16.9612 GB/s
uint64_t 41959360000 0.407304 sec 25.7443 GB/s