Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with mm popcnt u64 on Intel CPUs

Question

I was looking for the fastest way to popcount large arrays of data  I encountered a very weird effect  Changing the loop variable from unsigned to uint64 t made the performance drop by 50  on my PC   The Benchmark   include  lt iostream gt   include  lt chrono gt   include  lt x86intrin h gt   int main int argc  char  argv           using namespace std      if  argc    2           cerr  lt  lt   usage  array size in MB   lt  lt  endl         return -1             uint64 t size   atol argv 1   lt  lt 20      uint64 t  buffer   new uint64 t size 8       char  charbuffer   reinterpret cast lt char  gt  buffer       for  unsigned i 0  i lt size    i          charbuffer i    rand   256       uint64 t count duration      chrono  time point lt chrono  system clock gt  startP endP                startP   chrono  system clock  now            count   0          for  unsigned k   0  k  lt  10000  k                    Tight unrolled loop with unsigned             for  unsigned i 0  i lt size 8  i  4                    count     mm popcnt u64 buffer i                    count     mm popcnt u64 buffer i 1                    count     mm popcnt u64 buffer i 2                    count     mm popcnt u64 buffer i 3                                    endP   chrono  system clock  now            duration   chrono  duration cast lt std  chrono  nanoseconds gt  endP-startP  count            cout  lt  lt   unsigned t   lt  lt  count  lt  lt    t   lt  lt   duration 1 0E9   lt  lt    sec  t                lt  lt   10000 0 size   duration   lt  lt    GB s   lt  lt  endl                      startP   chrono  system clock  now            count 0          for  unsigned k   0  k  lt  10000  k                    Tight unrolled loop with uint64 t             for  uint64 t i 0 i lt size 8 i  4                    count     mm popcnt u64 buffer i                    count     mm popcnt u64 buffer i 1                    count     mm popcnt u64 buffer i 2                    count     mm popcnt u64 buffer i 3                                    endP   chrono  system clock  now            duration   chrono  duration cast lt std  chrono  nanoseconds gt  endP-startP  count            cout  lt  lt   uint64 t t    lt  lt  count  lt  lt    t   lt  lt   duration 1 0E9   lt  lt    sec  t                lt  lt   10000 0 size   duration   lt  lt    GB s   lt  lt  endl             free charbuffer       As you see  we create a buffer of random data  with the size being x megabytes where x is read from the command line  Afterwards  we iterate over the buffer and use an unrolled version of the x86 popcount intrinsic to perform the popcount  To get a more precise result  we do the popcount 10 000 times  We measure the times for the popcount  In the upper case  the inner loop variable is unsigned  in the lower case  the inner loop variable is uint64 t  I thought that this should make no difference  but the opposite is the case   The  absolutely crazy  results  I compile it like this  g   version  Ubuntu 4 8 2-19ubuntu1    g   -O3 -march native -std c  11 test cpp -o test   Here are the results on my Haswell Core i7-4770K CPU   3 50 nbsp GHz  running test 1  so 1 nbsp MB random data     unsigned  41959360000  0 401554 sec   26 113 nbsp GB s uint64 t  41959360000  0 759822 sec   13 8003 nbsp GB s   As you see  the throughput of the uint64 t version is only half the one of the unsigned version  The problem seems to be that different assembly gets generated  but why  First  I thought of a compiler bug  so I tried clang    Ubuntu Clang version 3 4-1ubuntu3    clang   -O3 -march native -std c  11 teest cpp -o test   Result  test 1   unsigned  41959360000  0 398293 sec   26 3267 GB s uint64 t  41959360000  0 680954 sec   15 3986 GB s   So  it is almost the same result and is still strange  But now it gets super strange  I replace the buffer size that was read from input with a constant 1  so I change   uint64 t size   atol argv 1    lt  lt  20    to  uint64 t size   1  lt  lt  20    Thus  the compiler now knows the buffer size at compile time  Maybe it can add some optimizations  Here are the numbers for g      unsigned  41959360000  0 509156 sec   20 5944 nbsp GB s uint64 t  41959360000  0 508673 sec   20 6139 nbsp GB s   Now  both versions are equally fast  However  the unsigned got even slower  It dropped from 26 to 20 GB s  thus replacing a non-constant by a constant value lead to a deoptimization  Seriously  I have no clue what is going on here  But now to clang   with the new version    unsigned  41959360000  0 677009 sec   15 4884 nbsp GB s uint64 t  41959360000  0 676909 sec   15 4906 nbsp GB s   Wait  what  Now  both versions dropped to the slow number of 15 nbsp GB s  Thus  replacing a non-constant by a constant value even lead to slow code in both cases for Clang   I asked a colleague with an Ivy Bridge CPU to compile my benchmark  He got similar results  so it does not seem to be Haswell  Because two compilers produce strange results here  it also does not seem to be a compiler bug  We do not have an AMD CPU here  so we could only test with Intel   More madness  please   Take the first example  the one with atol argv 1    and put a static before the variable  i e    static uint64 t size atol argv 1   lt  lt 20    Here are my results in g      unsigned  41959360000  0 396728 sec   26 4306 GB s uint64 t  41959360000  0 509484 sec   20 5811 GB s   Yay  yet another alternative  We still have the fast 26 nbsp GB s with u32  but we managed to get u64 at least from the 13 nbsp GB s to the 20 nbsp GB s version  On my collegue s PC  the u64 version became even faster than the u32 version  yielding the fastest result of all  Sadly  this only works for g    clang   does not seem to care about static   My question  Can you explain these results  Especially    How can there be such a difference between u32 and u64  How can replacing a non-constant by a constant buffer size trigger less optimal code  How can the insertion of the static keyword make the u64 loop faster  Even faster than the original code on my collegue s computer    I know that optimization is a tricky territory  however  I never thought that such small changes can lead to a 100  difference in execution time and that small factors like a constant buffer size can again mix results totally  Of course  I always want to have the version that is able to popcount 26 nbsp GB s  The only reliable way I can think of is copy paste the assembly for this case and use inline assembly  This is the only way I can get rid of compilers that seem to go mad on small changes  What do you think  Is there another way to reliably get the code with most performance   The Disassembly  Here is the disassembly for the various results   26 nbsp GB s version from g     u32   non-const bufsize   0x400af8  lea 0x1  rdx   eax popcnt   rbx  rax 8   r9 lea 0x2  rdx   edi popcnt   rbx  rcx 8   rax lea 0x3  rdx   esi add  r9  rax popcnt   rbx  rdi 8   rcx add  0x4  edx add  rcx  rax popcnt   rbx  rsi 8   rcx add  rcx  rax mov  edx  ecx add  rax  r14 cmp  rbp  rcx jb 0x400af8   13 nbsp GB s version from g     u64   non-const bufsize   0x400c00  popcnt 0x8  rbx  rdx 8   rcx popcnt   rbx  rdx 8   rax add  rcx  rax popcnt 0x10  rbx  rdx 8   rcx add  rcx  rax popcnt 0x18  rbx  rdx 8   rcx add  0x4  rdx add  rcx  rax add  rax  r12 cmp  rbp  rdx jb 0x400c00   15 nbsp GB s version from clang     u64   non-const bufsize   0x400e50  popcnt   r15  rcx 8   rdx add  rbx  rdx popcnt 0x8  r15  rcx 8   rsi add  rdx  rsi popcnt 0x10  r15  rcx 8   rdx add  rsi  rdx popcnt 0x18  r15  rcx 8   rbx add  rdx  rbx add  0x4  rcx cmp  rbp  rcx jb 0x400e50   20 nbsp GB s version from g     u32 amp u64   const bufsize   0x400a68  popcnt   rbx  rdx 1   rax popcnt 0x8  rbx  rdx 1   rcx add  rax  rcx popcnt 0x10  rbx  rdx 1   rax add  rax  rcx popcnt 0x18  rbx  rdx 1   rsi add  0x20  rdx add  rsi  rcx add  rcx  rbp cmp  0x100000  rdx jne 0x400a68   15 nbsp GB s version from clang     u32 amp u64   const bufsize   0x400dd0  popcnt   r14  rcx 8   rdx add  rbx  rdx popcnt 0x8  r14  rcx 8   rsi add  rdx  rsi popcnt 0x10  r14  rcx 8   rdx add  rsi  rdx popcnt 0x18  r14  rcx 8   rbx add  rdx  rbx add  0x4  rcx cmp  0x20000  rcx jb 0x400dd0   Interestingly  the fastest  26 nbsp GB s  version is also the longest  It seems to be the only solution that uses lea  Some versions use jb to jump  others use jne  But apart from that  all versions seem to be comparable  I don t see where a 100  performance gap could originate from  but I am not too adept at deciphering assembly  The slowest  13 nbsp GB s  version looks even very short and good  Can anyone explain this   Lessons learned  No matter what the answer to this question will be  I have learned that in really hot loops every detail can matter  even details that do not seem to have any association to the hot code  I have never thought about what type to use for a loop variable  but as you see such a minor change can make a 100  difference  Even the storage type of a buffer can make a huge difference  as we saw with the insertion of the static keyword in front of the size variable  In the future  I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance   The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times  So even if you unroll  you can still get hit by major performance deviations  Quite interesting

User · Answer

Have you tried passing -funroll-loops -fprefetch-loop-arrays to GCC   I get the following results with these additional optimizations    1829   tmp so 25078285   cat  proc cpuinfo  grep CPU head -n1 model name        Intel R  Core TM  i3-3225 CPU   3 30GHz  1829   tmp so 25078285   g   --version head -n1 g    Ubuntu Linaro 4 7 3-1ubuntu1  4 7 3   1829   tmp so 25078285   g   -O3 -march native -std c  11 test cpp -o test o3  1829   tmp so 25078285   g   -O3 -march native -funroll-loops -fprefetch-loop-arrays -std c  11     test cpp -o test o3 unroll loops  and  prefetch loop arrays   1829   tmp so 25078285     test o3 1 unsigned        41959360000     0 595 sec       17 6231 GB s uint64 t        41959360000     0 898626 sec    11 6687 GB s   1829   tmp so 25078285     test o3 unroll loops  and  prefetch loop arrays 1 unsigned        41959360000     0 618222 sec    16 9612 GB s uint64 t        41959360000     0 407304 sec    25 7443 GB s

User · Answer

This is not an answer but a feedback with few compilers of 2021  On Intel CoffeeLake 9900k  With Microsoft compiler  VS2019   toolset v142   unsigned        209695540000    1 8322 sec      28 6152 GB s uint64 t        209695540000    3 08764 sec     16 9802 GB s   With Intel compiler 2021   unsigned        209695540000    1 70845 sec     30 688 GB s uint64 t        209695540000    1 57956 sec     33 1921 GB s   According to Mysticial s answer  Intel compiler is aware of False Data Dependency  but not Microsoft compiler  For intel compiler  I used  QxHost  optimize of CPU s architecture which is that of the host   Oi  enable intrinsic functions  and  include  lt nmmintrin h gt  instead of  include  lt immintrin h gt   Full compile command   GS  W3  QxHost  Gy  Zi  O2  D  quot NDEBUG quot   D  quot  CONSOLE quot   D  quot  UNICODE quot   D  quot UNICODE quot   Qipo  Zc forScope  Oi  MD  Fa quot x64 Release  quot   EHsc  nologo  Fo quot x64 Release  quot    fprofile-instr-use  quot x64 Release  quot   Fp quot x64 Release Benchmark pch quot    The decompiled  by IDA 7 5  assembly from ICC  int   cdecl main int argc  const char   argv  const char   envp      int v6     er13    BYTE  v8     rsi   unsigned int v9     edi   unsigned   int64 i     rbx   unsigned   int64 v11     rdi   int v12     ebp     int64 v13     r14     int64 v14     rbx   unsigned int v15     eax   unsigned   int64 v16     rcx   unsigned int v17     eax   unsigned   int64 v18     rcx     int64 v19     rdx   unsigned int v20     eax   int result     eax   std  ostream  v23     rbx   char v24     dl   std  ostream  v33     rbx   std  ostream  v41     rbx     int64 v42     rdx   unsigned int v43     eax   int v44     ebp     int64 v45     r14     int64 v46     rbx   unsigned   int64 v47     rax   unsigned   int64 v48     rax   std  ostream  v50     rdi   char v51     dl   std  ostream  v58     rdi   std  ostream  v60     rdi     int64 v61     rdx   unsigned int v62     eax      asm         vmovdqa  rsp 98h var 58   xmm8     vmovapd  rsp 98h var 68   xmm7     vmovapd  rsp 98h var 78   xmm6       if   argc    2           v6   atol argv 1    lt  lt  20       R15   v6      v8   operator new   v6       if   v6               v9   1        for   i   0i64  i  lt  v6  i   v9             v8 i    rand              v11    unsigned   int64 v6  gt  gt  3      v12   0      v13   Xtime get ticks 0        v14   0i64      do             if   v6                   v15   4          v16   0i64          do                     v14      popcnt    QWORD    amp v8 8   v16                      popcnt    QWORD    amp v8 8   v15 - 24                      popcnt    QWORD    amp v8 8   v15 - 16                      popcnt    QWORD    amp v8 8   v15 - 8              v16   v15            v15    4                    while   v11  gt  v16            v17   4          v18   0i64          do                     v14      popcnt    QWORD    amp v8 8   v18                      popcnt    QWORD    amp v8 8   v17 - 24                      popcnt    QWORD    amp v8 8   v17 - 16                      popcnt    QWORD    amp v8 8   v17 - 8              v18   v17            v17    4                    while   v11  gt  v18                  v12    2            while   v12    10000         RBP   100    Xtime get ticks 0   - v13       std  operator   std  char traits char    std  cout   quot unsigned t quot        v23    std  ostream   std  ostream  operator lt  lt  std  cout  v14       std  operator   std  char traits char    0 v23  v24         asm             vmovq   xmm0  rbp       vmovdqa xmm8  cs   xmm 00000000000000004530000043300000       vpunpckldq xmm0  xmm0  xmm8       vmovapd xmm7  cs   xmm 45300000000000004330000000000000       vsubpd  xmm0  xmm0  xmm7       vpermilpd xmm1  xmm0  1       vaddsd  xmm6  xmm1  xmm0       vdivsd  xmm1  xmm6  cs   real 41cdcd6500000000           v33    std  ostream   std  ostream  operator lt  lt  v23       std  operator   std  char traits char    v33   quot  sec  t quot          asm             vmovq   xmm0  r15       vpunpckldq xmm0  xmm0  xmm8       vsubpd  xmm0  xmm0  xmm7       vpermilpd xmm1  xmm0  1       vaddsd  xmm0  xmm1  xmm0       vmulsd  xmm7  xmm0  cs   real 40c3880000000000       vdivsd  xmm1  xmm7  xmm6           v41    std  ostream   std  ostream  operator lt  lt  v33       std  operator   std  char traits char    v41   quot  GB s quot        LOBYTE v42    10      v43   std  ios  widen  char   v41     int       QWORD   v41   4i64   v42       std  ostream  put v41  v43       std  ostream  flush v41       v44   0      v45   Xtime get ticks 0        v46   0i64      do             if   v6                   v47   0i64          do                     v46      popcnt    QWORD    amp v8 8   v47                      popcnt    QWORD    amp v8 8   v47   8                      popcnt    QWORD    amp v8 8   v47   16                      popcnt    QWORD    amp v8 8   v47   24              v47    4i64                    while   v47  lt  v11            v48   0i64          do                     v46      popcnt    QWORD    amp v8 8   v48                      popcnt    QWORD    amp v8 8   v48   8                      popcnt    QWORD    amp v8 8   v48   16                      popcnt    QWORD    amp v8 8   v48   24              v48    4i64                    while   v48  lt  v11                  v44    2            while   v44    10000         RBP   100    Xtime get ticks 0   - v45       std  operator   std  char traits char    std  cout   quot uint64 t t quot        v50    std  ostream   std  ostream  operator lt  lt  std  cout  v46       std  operator   std  char traits char    0 v50  v51         asm             vmovq   xmm0  rbp       vpunpckldq xmm0  xmm0  cs   xmm 00000000000000004530000043300000       vsubpd  xmm0  xmm0  cs   xmm 45300000000000004330000000000000       vpermilpd xmm1  xmm0  1       vaddsd  xmm6  xmm1  xmm0       vdivsd  xmm1  xmm6  cs   real 41cdcd6500000000           v58    std  ostream   std  ostream  operator lt  lt  v50       std  operator   std  char traits char    v58   quot  sec  t quot          asm   vdivsd  xmm1  xmm7  xmm6       v60    std  ostream   std  ostream  operator lt  lt  v58       std  operator   std  char traits char    v60   quot  GB s quot        LOBYTE v61    10      v62   std  ios  widen  char   v60     int       QWORD   v60   4i64   v61       std  ostream  put v60  v62       std  ostream  flush v60       free v8       result   0        else         std  operator   std  char traits char    std  cerr   quot usage  array size in MB quot        LOBYTE v19    10      v20   std  ios  widen  char    amp std  cerr      int   std  cerr   1   v19       std  ostream  put std  cerr  v20       std  ostream  flush std  cerr       result   -1          asm         vmovaps xmm6   rsp 98h var 78      vmovaps xmm7   rsp 98h var 68      vmovaps xmm8   rsp 98h var 58        return result     and disassembly of main   text 0140001000     686p  text 0140001000     mmx  text 0140001000     model flat  text 0140001000  text 0140001000                                                                                text 0140001000  text 0140001000   Segment type  Pure code  text 0140001000   Segment permissions  Read Execute  text 0140001000  text           segment para public  CODE  use64  text 0140001000    assume cs  text  text 0140001000     org 140001000h  text 0140001000    assume es nothing  ss nothing  ds  data  fs nothing  gs nothing  text 0140001000  text 0140001000                   S U B R O U T I N E                                          text 0140001000  text 0140001000  text 0140001000   int   cdecl main int argc  const char   argv  const char   envp   text 0140001000 main            proc near        CODE XREF    scrt common main seh 107 p  text 0140001000        DATA XREF   pdata ExceptionDir o  text 0140001000  text 0140001000 var 78            xmmword ptr -78h  text 0140001000 var 68            xmmword ptr -68h  text 0140001000 var 58            xmmword ptr -58h  text 0140001000  text 0140001000    push    r15  text 0140001002    push    r14  text 0140001004    push    r13  text 0140001006    push    r12  text 0140001008    push    rsi  text 0140001009    push    rdi  text 014000100A    push    rbp  text 014000100B    push    rbx  text 014000100C    sub     rsp  58h  text 0140001010    vmovdqa  rsp 98h var 58   xmm8  text 0140001016    vmovapd  rsp 98h var 68   xmm7  text 014000101C    vmovapd  rsp 98h var 78   xmm6  text 0140001022    cmp     ecx  2  text 0140001025    jnz     loc 14000113E  text 014000102B    mov     rcx   rdx 8       String  text 014000102F    call    cs   imp atol  text 0140001035    mov     r13d  eax  text 0140001038    shl     r13d  14h  text 014000103C    movsxd  r15  r13d  text 014000103F    mov     rcx  r15          size  text 0140001042    call       U YAPEAX K Z   operator new   unsigned   int64   text 0140001047    mov     rsi  rax  text 014000104A    test    r15d  r15d  text 014000104D    jz      short loc 14000106E  text 014000104F    mov     edi  1  text 0140001054    xor     ebx  ebx  text 0140001056    mov     rbp  cs   imp rand  text 014000105D    nop     dword ptr  rax   text 0140001060  text 0140001060 loc 140001060       CODE XREF  main 6C j  text 0140001060    call    rbp     imp rand  text 0140001062    mov      rsi rbx   al  text 0140001065    mov     ebx  edi  text 0140001067    inc     edi  text 0140001069    cmp     rbx  r15  text 014000106C    jb      short loc 140001060  text 014000106E  text 014000106E loc 14000106E       CODE XREF  main 4D j  text 014000106E    mov     rdi  r15  text 0140001071    shr     rdi  3  text 0140001075    xor     ebp  ebp  text 0140001077    call     Xtime get ticks 0  text 014000107C    mov     r14  rax  text 014000107F    xor     ebx  ebx  text 0140001081    jmp     short loc 14000109F  text 0140001081   ---------------------------------------------------------------------------  text 0140001083    align 10h  text 0140001090  text 0140001090 loc 140001090       CODE XREF  main A2 j  text 0140001090        main EC j      text 0140001090    add     ebp  2  text 0140001093    cmp     ebp  2710h  text 0140001099    jz      loc 140001184  text 014000109F  text 014000109F loc 14000109F       CODE XREF  main 81 j  text 014000109F    test    r13d  r13d  text 01400010A2    jz      short loc 140001090  text 01400010A4    mov     eax  4  text 01400010A9    xor     ecx  ecx  text 01400010AB    nop     dword ptr  rax rax 00h   text 01400010B0  text 01400010B0 loc 1400010B0       CODE XREF  main E7 j  text 01400010B0    popcnt  rcx  qword ptr  rsi rcx 8   text 01400010B6    add     rcx  rbx  text 01400010B9    lea     edx   rax-3   text 01400010BC    popcnt  rdx  qword ptr  rsi rdx 8   text 01400010C2    add     rdx  rcx  text 01400010C5    lea     ecx   rax-2   text 01400010C8    popcnt  rcx  qword ptr  rsi rcx 8   text 01400010CE    add     rcx  rdx  text 01400010D1    lea     edx   rax-1   text 01400010D4    xor     ebx  ebx  text 01400010D6    popcnt  rbx  qword ptr  rsi rdx 8   text 01400010DC    add     rbx  rcx  text 01400010DF    mov     ecx  eax  text 01400010E1    add     eax  4  text 01400010E4    cmp     rdi  rcx  text 01400010E7    ja      short loc 1400010B0  text 01400010E9    test    r13d  r13d  text 01400010EC    jz      short loc 140001090  text 01400010EE    mov     eax  4  text 01400010F3    xor     ecx  ecx  text 01400010F5    db      2Eh  text 01400010F5    nop     word ptr  rax rax 00000000h   text 01400010FF    nop  text 0140001100  text 0140001100 loc 140001100       CODE XREF  main 137 j  text 0140001100    popcnt  rcx  qword ptr  rsi rcx 8   text 0140001106    add     rcx  rbx  text 0140001109    lea     edx   rax-3   text 014000110C    popcnt  rdx  qword ptr  rsi rdx 8   text 0140001112    add     rdx  rcx  text 0140001115    lea     ecx   rax-2   text 0140001118    popcnt  rcx  qword ptr  rsi rcx 8   text 014000111E    add     rcx  rdx  text 0140001121    lea     edx   rax-1   text 0140001124    xor     ebx  ebx  text 0140001126    popcnt  rbx  qword ptr  rsi rdx 8   text 014000112C    add     rbx  rcx  text 014000112F    mov     ecx  eax  text 0140001131    add     eax  4  text 0140001134    cmp     rdi  rcx  text 0140001137    ja      short loc 140001100  text 0140001139    jmp     loc 140001090  text 014000113E   ---------------------------------------------------------------------------  text 014000113E  text 014000113E loc 14000113E       CODE XREF  main 25 j  text 014000113E    mov     rsi  cs   imp  cerr std  3V  basic ostream DU  char traits D std   1 A   std  ostream std  cerr  text 0140001145    lea     rdx  aUsageArraySize    quot usage  array size in MB quot   text 014000114C    mov     rcx  rsi          std  ostream    text 014000114F    call    std  operator   std  char traits char     text 0140001154    mov     rax   rsi   text 0140001157    movsxd  rcx  dword ptr  rax 4   text 014000115B    add     rcx  rsi  text 014000115E    mov     dl  0Ah  text 0140001160    call    cs   imp  widen   basic ios DU  char traits D std   std  QEBADD Z   std  ios  widen char   text 0140001166    mov     rcx  rsi  text 0140001169    mov     edx  eax  text 014000116B    call    cs   imp  put   basic ostream DU  char traits D std   std  QEAAAEAV12 D Z   std  ostream  put char   text 0140001171    mov     rcx  rsi  text 0140001174    call    cs   imp  flush   basic ostream DU  char traits D std   std  QEAAAEAV12 XZ   std  ostream  flush void   text 014000117A    mov     eax  0FFFFFFFFh  text 014000117F    jmp     loc 1400013E2  text 0140001184   ---------------------------------------------------------------------------  text 0140001184  text 0140001184 loc 140001184       CODE XREF  main 99 j  text 0140001184    call     Xtime get ticks 0  text 0140001189    sub     rax  r14  text 014000118C    imul    rbp  rax  64h    d   text 0140001190    mov     r14  cs   imp  cout std  3V  basic ostream DU  char traits D std   1 A   std  ostream std  cout  text 0140001197    lea     rdx  aUnsigned     quot unsigned t quot   text 014000119E    mov     rcx  r14          std  ostream    text 01400011A1    call    std  operator   std  char traits char     text 01400011A6    mov     rcx  r14  text 01400011A9    mov     rdx  rbx  text 01400011AC    call    cs   imp   6  basic ostream DU  char traits D std   std  QEAAAEAV01  K Z   std  ostream  operator lt  lt  unsigned   int64   text 01400011B2    mov     rbx  rax  text 01400011B5    mov     rcx  rax          std  ostream    text 01400011B8    call    std  operator   std  char traits char    0  text 01400011BD    vmovq   xmm0  rbp  text 01400011C2    vmovdqa xmm8  cs   xmm 00000000000000004530000043300000  text 01400011CA    vpunpckldq xmm0  xmm0  xmm8  text 01400011CF    vmovapd xmm7  cs   xmm 45300000000000004330000000000000  text 01400011D7    vsubpd  xmm0  xmm0  xmm7  text 01400011DB    vpermilpd xmm1  xmm0  1  text 01400011E1    vaddsd  xmm6  xmm1  xmm0  text 01400011E5    vdivsd  xmm1  xmm6  cs   real 41cdcd6500000000  text 01400011ED    mov     r12  cs   imp   6  basic ostream DU  char traits D std   std  QEAAAEAV01 N Z   std  ostream  operator lt  lt  double   text 01400011F4    mov     rcx  rbx  text 01400011F7    call    r12   std  ostream  operator lt  lt  double    std  ostream  operator lt  lt  double   text 01400011FA    mov     rbx  rax  text 01400011FD    lea     rdx  aSec          quot  sec  t quot   text 0140001204    mov     rcx  rax          std  ostream    text 0140001207    call    std  operator   std  char traits char     text 014000120C    vmovq   xmm0  r15  text 0140001211    vpunpckldq xmm0  xmm0  xmm8  text 0140001216    vsubpd  xmm0  xmm0  xmm7  text 014000121A    vpermilpd xmm1  xmm0  1  text 0140001220    vaddsd  xmm0  xmm1  xmm0  text 0140001224    vmulsd  xmm7  xmm0  cs   real 40c3880000000000  text 014000122C    vdivsd  xmm1  xmm7  xmm6  text 0140001230    mov     rcx  rbx  text 0140001233    call    r12   std  ostream  operator lt  lt  double    std  ostream  operator lt  lt  double   text 0140001236    mov     rbx  rax  text 0140001239    lea     rdx  aGbS          quot  GB s quot   text 0140001240    mov     rcx  rax          std  ostream    text 0140001243    call    std  operator   std  char traits char     text 0140001248    mov     rax   rbx   text 014000124B    movsxd  rcx  dword ptr  rax 4   text 014000124F    add     rcx  rbx  text 0140001252    mov     dl  0Ah  text 0140001254    call    cs   imp  widen   basic ios DU  char traits D std   std  QEBADD Z   std  ios  widen char   text 014000125A    mov     rcx  rbx  text 014000125D    mov     edx  eax  text 014000125F    call    cs   imp  put   basic ostream DU  char traits D std   std  QEAAAEAV12 D Z   std  ostream  put char   text 0140001265    mov     rcx  rbx  text 0140001268    call    cs   imp  flush   basic ostream DU  char traits D std   std  QEAAAEAV12 XZ   std  ostream  flush void   text 014000126E    xor     ebp  ebp  text 0140001270    call     Xtime get ticks 0  text 0140001275    mov     r14  rax  text 0140001278    xor     ebx  ebx  text 014000127A    jmp     short loc 14000128F  text 014000127A   ---------------------------------------------------------------------------  text 014000127C    align 20h  text 0140001280  text 0140001280 loc 140001280       CODE XREF  main 292 j  text 0140001280        main 2DB j      text 0140001280    add     ebp  2  text 0140001283    cmp     ebp  2710h  text 0140001289    jz      loc 14000131D  text 014000128F  text 014000128F loc 14000128F       CODE XREF  main 27A j  text 014000128F    test    r13d  r13d  text 0140001292    jz      short loc 140001280  text 0140001294    xor     eax  eax  text 0140001296    db      2Eh  text 0140001296    nop     word ptr  rax rax 00000000h   text 01400012A0  text 01400012A0 loc 1400012A0       CODE XREF  main 2D6 j  text 01400012A0    xor     ecx  ecx  text 01400012A2    popcnt  rcx  qword ptr  rsi rax 8   text 01400012A8    add     rcx  rbx  text 01400012AB    xor     edx  edx  text 01400012AD    popcnt  rdx  qword ptr  rsi rax 8 8   text 01400012B4    add     rdx  rcx  text 01400012B7    xor     ecx  ecx  text 01400012B9    popcnt  rcx  qword ptr  rsi rax 8 10h   text 01400012C0    add     rcx  rdx  text 01400012C3    xor     ebx  ebx  text 01400012C5    popcnt  rbx  qword ptr  rsi rax 8 18h   text 01400012CC    add     rbx  rcx  text 01400012CF    add     rax  4  text 01400012D3    cmp     rax  rdi  text 01400012D6    jb      short loc 1400012A0  text 01400012D8    test    r13d  r13d  text 01400012DB    jz      short loc 140001280  text 01400012DD    xor     eax  eax  text 01400012DF    nop  text 01400012E0  text 01400012E0 loc 1400012E0       CODE XREF  main 316 j  text 01400012E0    xor     ecx  ecx  text 01400012E2    popcnt  rcx  qword ptr  rsi rax 8   text 01400012E8    add     rcx  rbx  text 01400012EB    xor     edx  edx  text 01400012ED    popcnt  rdx  qword ptr  rsi rax 8 8   text 01400012F4    add     rdx  rcx  text 01400012F7    xor     ecx  ecx  text 01400012F9    popcnt  rcx  qword ptr  rsi rax 8 10h   text 0140001300    add     rcx  rdx  text 0140001303    xor     ebx  ebx  text 0140001305    popcnt  rbx  qword ptr  rsi rax 8 18h   text 014000130C    add     rbx  rcx  text 014000130F    add     rax  4  text 0140001313    cmp     rax  rdi  text 0140001316    jb      short loc 1400012E0  text 0140001318    jmp     loc 140001280  text 014000131D   ---------------------------------------------------------------------------  text 014000131D  text 014000131D loc 14000131D       CODE XREF  main 289 j  text 014000131D    call     Xtime get ticks 0  text 0140001322    sub     rax  r14  text 0140001325    imul    rbp  rax  64h    d   text 0140001329    mov     rdi  cs   imp  cout std  3V  basic ostream DU  char traits D std   1 A   std  ostream std  cout  text 0140001330    lea     rdx  aUint64T      quot uint64 t t quot   text 0140001337    mov     rcx  rdi          std  ostream    text 014000133A    call    std  operator   std  char traits char     text 014000133F    mov     rcx  rdi  text 0140001342    mov     rdx  rbx  text 0140001345    call    cs   imp   6  basic ostream DU  char traits D std   std  QEAAAEAV01  K Z   std  ostream  operator lt  lt  unsigned   int64   text 014000134B    mov     rdi  rax  text 014000134E    mov     rcx  rax          std  ostream    text 0140001351    call    std  operator   std  char traits char    0  text 0140001356    vmovq   xmm0  rbp  text 014000135B    vpunpckldq xmm0  xmm0  cs   xmm 00000000000000004530000043300000  text 0140001363    vsubpd  xmm0  xmm0  cs   xmm 45300000000000004330000000000000  text 014000136B    vpermilpd xmm1  xmm0  1  text 0140001371    vaddsd  xmm6  xmm1  xmm0  text 0140001375    vdivsd  xmm1  xmm6  cs   real 41cdcd6500000000  text 014000137D    mov     rcx  rdi  text 0140001380    call    r12   std  ostream  operator lt  lt  double    std  ostream  operator lt  lt  double   text 0140001383    mov     rdi  rax  text 0140001386    lea     rdx  aSec          quot  sec  t quot   text 014000138D    mov     rcx  rax          std  ostream    text 0140001390    call    std  operator   std  char traits char     text 0140001395    vdivsd  xmm1  xmm7  xmm6  text 0140001399    mov     rcx  rdi  text 014000139C    call    r12   std  ostream  operator lt  lt  double    std  ostream  operator lt  lt  double   text 014000139F    mov     rdi  rax  text 01400013A2    lea     rdx  aGbS          quot  GB s quot   text 01400013A9    mov     rcx  rax          std  ostream    text 01400013AC    call    std  operator   std  char traits char     text 01400013B1    mov     rax   rdi   text 01400013B4    movsxd  rcx  dword ptr  rax 4   text 01400013B8    add     rcx  rdi  text 01400013BB    mov     dl  0Ah  text 01400013BD    call    cs   imp  widen   basic ios DU  char traits D std   std  QEBADD Z   std  ios  widen char   text 01400013C3    mov     rcx  rdi  text 01400013C6    mov     edx  eax  text 01400013C8    call    cs   imp  put   basic ostream DU  char traits D std   std  QEAAAEAV12 D Z   std  ostream  put char   text 01400013CE    mov     rcx  rdi  text 01400013D1    call    cs   imp  flush   basic ostream DU  char traits D std   std  QEAAAEAV12 XZ   std  ostream  flush void   text 01400013D7    mov     rcx  rsi          Block  text 01400013DA    call    cs   imp free  text 01400013E0    xor     eax  eax  text 01400013E2  text 01400013E2 loc 1400013E2       CODE XREF  main 17F j  text 01400013E2    vmovaps xmm6   rsp 98h var 78   text 01400013E8    vmovaps xmm7   rsp 98h var 68   text 01400013EE    vmovaps xmm8   rsp 98h var 58   text 01400013F4    add     rsp  58h  text 01400013F8    pop     rbx  text 01400013F9    pop     rbp  text 01400013FA    pop     rdi  text 01400013FB    pop     rsi  text 01400013FC    pop     r12  text 01400013FE    pop     r13  text 0140001400    pop     r14  text 0140001402    pop     r15  text 0140001404    retn  text 0140001404 main            endp  Coffee lake specification update  quot POPCNT instruction may take longer to execute than expected quot

User · Answer

TL DR  Use   builtin intrinsics instead  they might happen to help   I was able to make gcc 4 8 4  and even 4 7 3 on gcc godbolt org  generate optimal code for this by using   builtin popcountll which uses the same assembly instruction  but gets lucky and happens to make code that doesn t have an unexpectedly long loop-carried dependency because of the false dependency bug   I am not 100  sure of my benchmarking code  but objdump output seems to share my views  I use some other tricks    i vs i    to make the compiler unroll loop for me without any movl instruction  strange behaviour  I must say    Results   Count  20318230000  Elapsed  0 411156 seconds   Speed  25 503118 GB s   Benchmarking code    include  lt stdint h gt   include  lt stddef h gt   include  lt time h gt   include  lt stdio h gt   include  lt stdlib h gt   uint64 t builtin popcnt const uint64 t  buf  size t len     uint64 t cnt   0    for size t i   0  i  lt  len    i       cnt      builtin popcountll buf i          return cnt     int main int argc  char   argv     if argc    2       printf  Usage   s  lt buffer size in MB gt  n   argv 0        return -1        uint64 t size   atol argv 1    lt  lt  20    uint64 t  buffer    uint64 t  malloc  size 8  sizeof  buffer          Spoil copy-on-write memory allocation on  nix   for  size t i   0  i  lt   size   8   i          buffer i    random          uint64 t count   0    clock t tic   clock      for size t i   0  i  lt  10000    i       count    builtin popcnt buffer  size 8         clock t toc   clock      printf  Count   lu tElapsed   f seconds tSpeed   f GB s n   count   double  toc - tic    CLOCKS PER SEC    10000 0 size     double  toc - tic  1e 9    CLOCKS PER SEC       return 0      Compile options   gcc --std gnu99 -mpopcnt -O3 -funroll-loops -march native bench c -o bench   GCC version   gcc  Ubuntu 4 8 4-2ubuntu1 14 04 1  4 8 4   Linux kernel version   3 19 0-58-generic   CPU information   processor     0 vendor id     GenuineIntel cpu family    6 model         70 model name    Intel R  Core TM  i7-4870HQ CPU   2 50 GHz stepping      1 microcode     0xf cpu MHz       2494 226 cache size    6144 KB physical id   0 siblings      1 core id       0 cpu cores     1 apicid        0 initial apicid    0 fpu       yes fpu exception     yes cpuid level   13 wp        yes flags         fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant tsc nopl xtopology nonstop tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4 1 sse4 2 x2apic movbe popcnt tsc deadline timer aes xsave avx f16c rdrand hypervisor lahf lm abm arat pln pts dtherm fsgsbase tsc adjust bmi1 hle avx2 smep bmi2 invpcid xsaveopt bugs          bogomips      4988 45 clflush size      64 cache alignment   64 address sizes     36 bits physical  48 bits virtual power management

User · Answer

I can t give an authoritative answer  but provide an overview of a likely cause  This reference shows pretty clearly that for the instructions in the body of your loop there is a 3 1 ratio between latency and throughput  It also shows the effects of multiple dispatch  Since there are  give-or-take  three integer units in modern x86 processors  it s generally possible to dispatch three instructions per cycle     So between peak pipeline and multiple dispatch performance and failure of these mechanisms  we have a factor of six in performance  It s pretty well known that the complexity of the x86 instruction set makes it quite easy for quirky breakage to occur  The document above has a great example      The Pentium 4 performance for 64-bit right shifts is really poor  64-bit left shift as well as all 32-bit shifts have acceptable performance  It appears that the data path from the upper 32 bits to the lower 32 bit of the ALU is not well designed    I personally ran into a strange case where a hot loop ran considerably slower on a specific core of a four-core chip  AMD if I recall   We actually got better performance on a map-reduce calculation by turning that core off   Here my guess is contention for integer units  that the popcnt  loop counter  and address calculations can all just barely run at full speed with the 32-bit wide counter  but the 64-bit counter causes contention and pipeline stalls  Since there are only about 12 cycles total  potentially 4 cycles with multiple dispatch  per loop body execution  a single stall could reasonably affect run time by a factor of 2    The change induced by using a static variable  which I m guessing just causes a minor reordering of instructions  is another clue that the 32-bit code is at some tipping point for contention   I know this is not a rigorous analysis  but it is a plausible explanation

User · Answer

Ok  I want to provide a small answer to one of the sub-questions that the OP asked that don t seem to be addressed in the existing questions   Caveat  I have not done any testing or code generation  or disassembly  just wanted to share a thought for others to possibly expound upon   Why does the static change the performance   The line in question  uint64 t size   atol argv 1   lt  lt 20   Short Answer  I would look at the assembly generated for accessing size and see if there are extra steps of pointer indirection involved for the non-static version   Long Answer  Since there is only one copy of the variable whether it was declared static or not  and the size doesn t change  I theorize that the difference is the location of the memory used to back the variable along with where it is used in the code further down   Ok  to start with the obvious  remember that all local variables  along with parameters  of a function are provided space on the stack for use as storage   Now  obviously  the stack frame for main   never cleans up and is only generated once   Ok  what about making it static   Well  in that case the compiler knows to reserve space in the global data space of the process so the location can not be cleared by the removal of a stack frame  But still  we only have one location so what is the difference  I suspect it has to do with how memory locations on the stack are referenced       When the compiler is generating the symbol table  it just makes an entry for a label along with relevant attributes  like size  etc   It knows that it must reserve the appropriate space in memory but doesn t actually pick that location until somewhat later in process after doing liveness analysis and possibly register allocation   How then does the linker know what address to provide to the machine code for the final assembly code  It either knows the final location or knows how to arrive at the location  With a stack  it is pretty simple to refer to a location based one two elements  the pointer to the stackframe and then an offset into the frame   This is basically because the linker can t know the location of the stackframe before runtime

User · Answer

Have you tried moving the reduction step outside the loop   Right now you have a data dependency that really isn t needed   Try     uint64 t subset counts 4          for  unsigned k   0  k  lt  10000  k             Tight unrolled loop with unsigned      unsigned i 0       while  i  lt  size 8            subset counts 0      mm popcnt u64 buffer i            subset counts 1      mm popcnt u64 buffer i 1            subset counts 2      mm popcnt u64 buffer i 2            subset counts 3      mm popcnt u64 buffer i 3            i    4               count   subset counts 0    subset counts 1    subset counts 2    subset counts 3     You also have some weird aliasing going on  that I m not sure is conformant to the strict aliasing rules

User · Answer

I coded up an equivalent C program to experiment  and I can confirm this strange behaviour  What s more  gcc believes the 64-bit integer  which should probably be a size t anyway     to be better  as using uint fast32 t causes gcc to use a 64-bit uint  I did a bit of mucking around with the assembly  Simply take the 32-bit version  replace all 32-bit instructions registers with the 64-bit version in the inner popcount-loop of the program  Observation  the code is just as fast as the 32-bit version  This is obviously a hack  as the size of the variable isn t really 64 bit  as other parts of the program still use the 32-bit version  but as long as the inner popcount-loop dominates performance  this is a good start  I then copied the inner loop code from the 32-bit version of the program  hacked it up to be 64 bit  fiddled with the registers to make it a replacement for the inner loop of the 64-bit version  This code also runs as fast as the 32-bit version  My conclusion is that this is bad instruction scheduling by the compiler  not actual speed latency advantage of 32-bit instructions   Caveat  I hacked up assembly  could have broken something without noticing  I don t think so

User · Answer

I tried this with Visual Studio 2013 Express  using a pointer instead of an index  which sped up the process a bit  I suspect this is because the addressing is offset   register  instead of offset   register    register lt  lt 3   C   code      uint64 t  bfrend   buffer  size 8      uint64 t  bfrptr                      startP   chrono  system clock  now          count   0        for  unsigned k   0  k  lt  10000  k                 Tight unrolled loop with uint64 t          for  bfrptr   buffer  bfrptr  lt  bfrend                count      popcnt64  bfrptr                 count      popcnt64  bfrptr                 count      popcnt64  bfrptr                 count      popcnt64  bfrptr                              endP   chrono  system clock  now          duration   chrono  duration cast lt std  chrono  nanoseconds gt  endP-startP  count          cout  lt  lt   uint64 t t    lt  lt  count  lt  lt    t   lt  lt   duration 1 0E9   lt  lt    sec  t              lt  lt   10000 0 size   duration   lt  lt    GB s   lt  lt  endl         assembly code  r10   bfrptr  r15   bfrend  rsi   count  rdi   buffer  r13   k     LL5 main          mov     r10  rdi         cmp     rdi  r15         jae     SHORT  LN4 main         npad    4  LL2 main          mov     rax  QWORD PTR  r10 24          mov     rcx  QWORD PTR  r10 16          mov     r8  QWORD PTR  r10 8          mov     r9  QWORD PTR  r10          popcnt  rdx  rax         popcnt  rax  rcx         add     rdx  rax         popcnt  rax  r8         add     r10  32         add     rdx  rax         popcnt  rax  r9         add     rsi  rax         add     rsi  rdx         cmp     r10  r15         jb      SHORT  LL2 main  LN4 main          dec     r13         jne     SHORT  LL5 main

User · Answer

This is not an answer  but it s hard to read if I put results in comment   I get these results with a Mac Pro  Westmere 6-Cores Xeon 3 33 nbsp GHz   I compiled it with clang -O3 -msse4 -lstdc   a cpp -o a  -O2 get same result    clang with uint64 t size atol argv 1   lt  lt 20   unsigned    41950110000 0 811198 sec    12 9263 GB s uint64 t    41950110000 0 622884 sec    16 8342 GB s   clang with uint64 t size 1 lt  lt 20   unsigned    41950110000 0 623406 sec    16 8201 GB s uint64 t    41950110000 0 623685 sec    16 8126 GB s   I also tried to    Reverse the test order  the result is the same so it rules out the cache factor  Have the for statement in reverse  for  uint64 t i size 8 i gt 0 i- 4   This gives the same result and proves the compile is smart enough to not divide size by 8 every iteration  as expected     Here is my wild guess   The speed factor comes in three parts    code cache  uint64 t version has larger code size  but this does not have an effect on my Xeon CPU  This makes the 64-bit version slower  Instructions used  Note not only the loop count  but the buffer is accessed with a 32-bit and 64-bit index on the two versions  Accessing a pointer with a 64-bit offset requests a dedicated 64-bit register and addressing  while you can use immediate for a 32-bit offset  This may make the 32-bit version faster  Instructions are only emitted on the 64-bit compile  that is  prefetch   This makes 64-bit faster    The three factors together match with the observed seemingly conflicting results

User · Answer

First of all  try to estimate peak performance - examine https   www intel com content dam www public us en documents manuals 64-ia-32-architectures-optimization-manual pdf  in particular  Appendix C   In your case  it s table C-10 that shows POPCNT instruction has latency   3 clocks and throughput   1 clock  Throughput shows your maximal rate in clocks  multiply by core frequency and 8 bytes in case of popcnt64 to get your best possible bandwidth number    Now examine what compiler did and sum up throughputs of all other instructions in the loop  This will give best possible estimate for generated code   At last  look at data dependencies between instructions in the loop as they will force latency-large delay instead of throughput - so split instructions of single iteration on data flow chains and calculate latency across them then naively pick up maximal from them  it will give rough estimate taking into account data flow dependencies   However  in your case  just writing code the right way would eliminate all these complexities  Instead of accumulating to the same count variable  just accumulate to different ones  like count0  count1      count8  and sum them up at the end  Or even create an array of counts 8  and accumulate to its elements - perhaps  it will be vectorized even and you will get much better throughput   P S  and never run benchmark for a second  first warm up the core then run loop for at least 10 seconds or better 100 seconds  otherwise  you will test power management firmware and DVFS implementation in hardware     P P S  I heard endless debates on how much time should benchmark really run  Most smartest folks are even asking why 10 seconds not 11 or 12  I should admit this is funny in theory  In practice  you just go and run benchmark hundred times in a row and record deviations  That IS funny  Most people do change source and run bench after that exactly ONCE to capture new performance record  Do the right things right    Not convinced still  Just use above C-version of benchmark by assp1r1n3  https   stackoverflow com a 37026212 9706746  and try 100 instead of 10000 in retry loop   My 7960X shows  with RETRY 100   Count  203182300    Elapsed  0 008385 seconds   Speed  12 505379 GB s  Count  203182300    Elapsed  0 011063 seconds   Speed  9 478225 GB s  Count  203182300    Elapsed  0 011188 seconds   Speed  9 372327 GB s  Count  203182300    Elapsed  0 010393 seconds   Speed  10 089252 GB s  Count  203182300    Elapsed  0 009076 seconds   Speed  11 553283 GB s  with RETRY 10000   Count  20318230000  Elapsed  0 661791 seconds   Speed  15 844519 GB s  Count  20318230000  Elapsed  0 665422 seconds   Speed  15 758060 GB s  Count  20318230000  Elapsed  0 660983 seconds   Speed  15 863888 GB s  Count  20318230000  Elapsed  0 665337 seconds   Speed  15 760073 GB s  Count  20318230000  Elapsed  0 662138 seconds   Speed  15 836215 GB s  P P P S  Finally  on  accepted answer  and other mistery  -   Let s use assp1r1n3 s answer - he has 2 5Ghz core  POPCNT has 1 clock throuhgput  his code is using 64-bit popcnt  So math is 2 5Ghz   1 clock   8 bytes   20 GB s for his setup  He is seeing 25Gb s  perhaps due to turbo boost to around 3Ghz   Thus go to ark intel com and look for i7-4870HQ  https   ark intel com products 83504 Intel-Core-i7-4870HQ-Processor-6M-Cache-up-to-3-70-GHz- q i7-4870HQ  That core could run up to 3 7Ghz and real maximal rate is 29 6 GB s for his hardware  So where is another 4GB s  Perhaps  it s spent on loop logic and other surrounding code within each iteration   Now where is this false dependency  hardware runs at almost peak rate  Maybe my math is bad  it happens sometimes     P P P P P S  Still people suggesting HW errata is culprit  so I follow suggestion and created inline asm example  see below   On my 7960X  first version  with single output to cnt0  runs at 11MB s  second version  with output to cnt0  cnt1  cnt2 and cnt3  runs at 33MB s  And one could say - voila  it s output dependency   OK  maybe  the point I made is that it does not make sense to write code like this and it s not output dependency problem but dumb code generation  We are not testing hardware  we are writing code to unleash maximal performance  You could expect that HW OOO should rename and hide those  output-dependencies  but  gash  just do the right things right and you will never face any mystery   uint64 t builtin popcnt1a const uint64 t  buf  size t len         uint64 t cnt0  cnt1  cnt2  cnt3      cnt0   cnt1   cnt2   cnt3   0      uint64 t val   buf 0        if 0           asm     volatile                  1  n t               popcnt  2   1 n t               popcnt  2   1 n t               popcnt  2   1 n t               popcnt  2   1 n t               subq  4   0 n t               jnz 1b n t              q   len     q   cnt0             q   val                            else           asm     volatile                  1  n t               popcnt  5   1 n t               popcnt  5   2 n t               popcnt  5   3 n t               popcnt  5   4 n t               subq  4   0 n t               jnz 1b n t              q   len     q   cnt0     q   cnt1     q   cnt2     q   cnt3             q   val                            endif     return cnt0

User · Answer

Culprit  False Data Dependency  and the compiler isn t even aware of it   On Sandy Ivy Bridge and Haswell processors  the instruction   popcnt  src  dest   appears to have a false dependency on the destination register dest  Even though the instruction only writes to it  the instruction will wait until dest is ready before executing   This false dependency is  now  documented by Intel as erratum HSD146  Haswell  and SKL029  Skylake   Skylake fixed this for lzcnt and tzcnt  Cannon Lake  and Ice Lake  fixed this for popcnt  bsf bsr have a true output dependency  output unmodified for input 0   But no way to take advantage of that with intrinsics - only AMD documents it and compilers don t expose it     Yes  these instructions all run on the same execution unit      This dependency doesn t just hold up the 4 popcnts from a single loop iteration  It can carry across loop iterations making it impossible for the processor to parallelize different loop iterations   The unsigned vs  uint64 t and other tweaks don t directly affect the problem  But they influence the register allocator which assigns the registers to the variables   In your case  the speeds are a direct result of what is stuck to the  false  dependency chain depending on what the register allocator decided to do    13 GB s has a chain  popcnt-add-popcnt-popcnt  rarr  next iteration 15 GB s has a chain  popcnt-add-popcnt-add  rarr  next iteration 20 GB s has a chain  popcnt-popcnt  rarr  next iteration 26 GB s has a chain  popcnt-popcnt  rarr  next iteration   The difference between 20 GB s and 26 GB s seems to be a minor artifact of the indirect addressing  Either way  the processor starts to hit other bottlenecks once you reach this speed     To test this  I used inline assembly to bypass the compiler and get exactly the assembly I want  I also split up the count variable to break all other dependencies that might mess with the benchmarks   Here are the results   Sandy Bridge Xeon   3 5 GHz   full test code can be found at the bottom    GCC 4 6 3  g   popcnt cpp -std c  0x -O3 -save-temps -march native Ubuntu 12   Different Registers  18 6195 GB s   L4      movq      rbx  rax 8    r8     movq    8  rbx  rax 8    r9     movq    16  rbx  rax 8    r10     movq    24  rbx  rax 8    r11     addq     4   rax      popcnt  r8   r8     add     r8   rdx     popcnt  r9   r9     add     r9   rcx     popcnt  r10   r10     add     r10   rdi     popcnt  r11   r11     add     r11   rsi      cmpq     131072   rax     jne  L4   Same Register  8 49272 GB s   L9      movq      rbx  rdx 8    r9     movq    8  rbx  rdx 8    r10     movq    16  rbx  rdx 8    r11     movq    24  rbx  rdx 8    rbp     addq     4   rdx        This time reuse  rax  for all the popcnts      popcnt  r9   rax     add     rax   rcx     popcnt  r10   rax     add     rax   rsi     popcnt  r11   rax     add     rax   r8     popcnt  rbp   rax     add     rax   rdi      cmpq     131072   rdx     jne  L9   Same Register with broken chain  17 8869 GB s   L14      movq      rbx  rdx 8    r9     movq    8  rbx  rdx 8    r10     movq    16  rbx  rdx 8    r11     movq    24  rbx  rdx 8    rbp     addq     4   rdx        Reuse  rax  for all the popcnts      xor     rax   rax      Break the cross-iteration dependency by zeroing  rax       popcnt  r9   rax     add     rax   rcx     popcnt  r10   rax     add     rax   rsi     popcnt  r11   rax     add     rax   r8     popcnt  rbp   rax     add     rax   rdi      cmpq     131072   rdx     jne  L14     So what went wrong with the compiler   It seems that neither GCC nor Visual Studio are aware that popcnt has such a false dependency  Nevertheless  these false dependencies aren t uncommon  It s just a matter of whether the compiler is aware of it   popcnt isn t exactly the most used instruction  So it s not really a surprise that a major compiler could miss something like this  There also appears to be no documentation anywhere that mentions this problem  If Intel doesn t disclose it  then nobody outside will know until someone runs into it by chance    Update  As of version 4 9 2  GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled  Major compilers from other vendors  including Clang  MSVC  and even Intel s own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it    Why does the CPU have such a false dependency   We can speculate  it runs on the same execution unit as bsf   bsr which do have an output dependency    How is POPCNT implemented in hardware     For those instructions  Intel documents the integer result for input 0 as  undefined   with ZF 1   but Intel hardware actually gives a stronger guarantee to avoid breaking old software  output unmodified   AMD documents this behaviour   Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output but others not   AMD processors do not appear to have this false dependency     The full test code is below for reference    include  lt iostream gt   include  lt chrono gt   include  lt x86intrin h gt   int main int argc  char  argv          using namespace std     uint64 t size 1 lt  lt 20      uint64 t  buffer   new uint64 t size 8      char  charbuffer reinterpret cast lt char  gt  buffer      for  unsigned i 0 i lt size   i  charbuffer i  rand   256      uint64 t count duration     chrono  time point lt chrono  system clock gt  startP endP             uint64 t c0   0        uint64 t c1   0        uint64 t c2   0        uint64 t c3   0        startP   chrono  system clock  now          for  unsigned k   0  k  lt  10000  k              for  uint64 t i 0 i lt size 8 i  4                uint64 t r0   buffer i   0               uint64 t r1   buffer i   1               uint64 t r2   buffer i   2               uint64 t r3   buffer i   3                 asm                     popcnt  4   4   n t                   add  4   0      n t                   popcnt  5   5   n t                   add  5   1      n t                   popcnt  6   6   n t                   add  6   2      n t                   popcnt  7   7   n t                   add  7   3      n t                      r   c0     r   c1     r   c2     r   c3                     r    r0    r    r1    r    r2    r    r3                                          count   c0   c1   c2   c3        endP   chrono  system clock  now          duration chrono  duration cast lt std  chrono  nanoseconds gt  endP-startP  count          cout  lt  lt   No Chain t   lt  lt  count  lt  lt    t   lt  lt   duration 1 0E9   lt  lt    sec  t               lt  lt   10000 0 size   duration   lt  lt    GB s   lt  lt  endl                  uint64 t c0   0        uint64 t c1   0        uint64 t c2   0        uint64 t c3   0        startP   chrono  system clock  now          for  unsigned k   0  k  lt  10000  k              for  uint64 t i 0 i lt size 8 i  4                uint64 t r0   buffer i   0               uint64 t r1   buffer i   1               uint64 t r2   buffer i   2               uint64 t r3   buffer i   3                 asm                     popcnt  4    rax    n t                   add   rax   0       n t                   popcnt  5    rax    n t                   add   rax   1       n t                   popcnt  6    rax    n t                   add   rax   2       n t                   popcnt  7    rax    n t                   add   rax   3       n t                      r   c0     r   c1     r   c2     r   c3                     r    r0    r    r1    r    r2    r    r3                     rax                                          count   c0   c1   c2   c3        endP   chrono  system clock  now          duration chrono  duration cast lt std  chrono  nanoseconds gt  endP-startP  count          cout  lt  lt   Chain 4    t    lt  lt  count  lt  lt    t   lt  lt   duration 1 0E9   lt  lt    sec  t               lt  lt   10000 0 size   duration   lt  lt    GB s   lt  lt  endl                  uint64 t c0   0        uint64 t c1   0        uint64 t c2   0        uint64 t c3   0        startP   chrono  system clock  now          for  unsigned k   0  k  lt  10000  k              for  uint64 t i 0 i lt size 8 i  4                uint64 t r0   buffer i   0               uint64 t r1   buffer i   1               uint64 t r2   buffer i   2               uint64 t r3   buffer i   3                 asm                     xor   rax    rax    n t        lt --- Break the chain                   popcnt  4    rax    n t                   add   rax   0       n t                   popcnt  5    rax    n t                   add   rax   1       n t                   popcnt  6    rax    n t                   add   rax   2       n t                   popcnt  7    rax    n t                   add   rax   3       n t                      r   c0     r   c1     r   c2     r   c3                     r    r0    r    r1    r    r2    r    r3                     rax                                          count   c0   c1   c2   c3        endP   chrono  system clock  now          duration chrono  duration cast lt std  chrono  nanoseconds gt  endP-startP  count          cout  lt  lt   Broken Chain t    lt  lt  count  lt  lt    t   lt  lt   duration 1 0E9   lt  lt    sec  t               lt  lt   10000 0 size   duration   lt  lt    GB s   lt  lt  endl           free charbuffer         An equally interesting benchmark can be found here  http   pastebin com kbzgL8si  This benchmark varies the number of popcnts that are in the  false  dependency chain   False Chain 0   41959360000 0 57748 sec     18 1578 GB s False Chain 1   41959360000 0 585398 sec    17 9122 GB s False Chain 2   41959360000 0 645483 sec    16 2448 GB s False Chain 3   41959360000 0 929718 sec    11 2784 GB s False Chain 4   41959360000 1 23572 sec     8 48557 GB s

[c++] Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs

Examples related to c++

Examples related to performance

Examples related to assembly

Examples related to x86

Examples related to compiler-optimization