No doubt some of us would be interested in ways of identifying code that is problematic for the CPU's branch-predictor. The Valgrind tool cachegrind
has a branch-predictor simulator, enabled by using the --branch-sim=yes
flag. Running it over the examples in this question, with the number of outer loops reduced to 10000 and compiled with g++
, gives these results:
Sorted:
==32551== Branches: 656,645,130 ( 656,609,208 cond + 35,922 ind)
==32551== Mispredicts: 169,556 ( 169,095 cond + 461 ind)
==32551== Mispred rate: 0.0% ( 0.0% + 1.2% )
Unsorted:
==32555== Branches: 655,996,082 ( 655,960,160 cond + 35,922 ind)
==32555== Mispredicts: 164,073,152 ( 164,072,692 cond + 460 ind)
==32555== Mispred rate: 25.0% ( 25.0% + 1.2% )
Drilling down into the line-by-line output produced by cg_annotate
we see for the loop in question:
Sorted:
Bc Bcm Bi Bim
10,001 4 0 0 for (unsigned i = 0; i < 10000; ++i)
. . . . {
. . . . // primary loop
327,690,000 10,016 0 0 for (unsigned c = 0; c < arraySize; ++c)
. . . . {
327,680,000 10,006 0 0 if (data[c] >= 128)
0 0 0 0 sum += data[c];
. . . . }
. . . . }
Unsorted:
Bc Bcm Bi Bim
10,001 4 0 0 for (unsigned i = 0; i < 10000; ++i)
. . . . {
. . . . // primary loop
327,690,000 10,038 0 0 for (unsigned c = 0; c < arraySize; ++c)
. . . . {
327,680,000 164,050,007 0 0 if (data[c] >= 128)
0 0 0 0 sum += data[c];
. . . . }
. . . . }
This lets you easily identify the problematic line - in the unsorted version the if (data[c] >= 128)
line is causing 164,050,007 mispredicted conditional branches (Bcm
) under cachegrind's branch-predictor model, whereas it's only causing 10,006 in the sorted version.
Alternatively, on Linux you can use the performance counters subsystem to accomplish the same task, but with native performance using CPU counters.
perf stat ./sumtest_sorted
Sorted:
Performance counter stats for './sumtest_sorted':
11808.095776 task-clock # 0.998 CPUs utilized
1,062 context-switches # 0.090 K/sec
14 CPU-migrations # 0.001 K/sec
337 page-faults # 0.029 K/sec
26,487,882,764 cycles # 2.243 GHz
41,025,654,322 instructions # 1.55 insns per cycle
6,558,871,379 branches # 555.455 M/sec
567,204 branch-misses # 0.01% of all branches
11.827228330 seconds time elapsed
Unsorted:
Performance counter stats for './sumtest_unsorted':
28877.954344 task-clock # 0.998 CPUs utilized
2,584 context-switches # 0.089 K/sec
18 CPU-migrations # 0.001 K/sec
335 page-faults # 0.012 K/sec
65,076,127,595 cycles # 2.253 GHz
41,032,528,741 instructions # 0.63 insns per cycle
6,560,579,013 branches # 227.183 M/sec
1,646,394,749 branch-misses # 25.10% of all branches
28.935500947 seconds time elapsed
It can also do source code annotation with dissassembly.
perf record -e branch-misses ./sumtest_unsorted
perf annotate -d sumtest_unsorted
Percent | Source code & Disassembly of sumtest_unsorted
------------------------------------------------
...
: sum += data[c];
0.00 : 400a1a: mov -0x14(%rbp),%eax
39.97 : 400a1d: mov %eax,%eax
5.31 : 400a1f: mov -0x20040(%rbp,%rax,4),%eax
4.60 : 400a26: cltq
0.00 : 400a28: add %rax,-0x30(%rbp)
...
See the performance tutorial for more details.