Why does C code for testing the Collatz conjecture run faster than hand-written assembly

Question

I wrote these two solutions for Project Euler Q14  in assembly and in C    They implement identical brute force approach for testing the Collatz conjecture  The assembly solution was assembled with  nasm -felf64 p14 asm  amp  amp  gcc p14 o -o p14  The C   was compiled with  g   p14 cpp -o p14  Assembly  p14 asm  section  data     fmt db  quot  d quot   10  0  global main extern printf  section  text  main      mov rcx  1000000     xor rdi  rdi          max i     xor rsi  rsi          i  l1      dec rcx     xor r10  r10          count     mov rax  rcx  l2      test rax  1     jpe even      mov rbx  3     mul rbx     inc rax     jmp c1  even      mov rbx  2     xor rdx  rdx     div rbx  c1      inc r10     cmp rax  1     jne l2      cmp rdi  r10     cmovl rdi  r10     cmovl rsi  rcx      cmp rcx  2     jne l1      mov rdi  fmt     xor rax  rax     call printf     ret  C    p14 cpp   include  lt iostream gt   int sequence long n        int count   1      while  n    1            if  n   2    0              n    2          else             n   3 n   1            count            return count     int main         int max   0  maxi      for  int i   999999  i  gt  0  --i            int s   sequence i           if  s  gt  max                max   s              maxi   i                      std  cout  lt  lt  maxi  lt  lt  std  endl     I know about the compiler optimizations to improve speed and everything  but I don   t see many ways to further optimize my assembly solution  speaking programmatically  not mathematically   The C   code uses modulus every term and division every other term  while the assembly code only uses a single division every other term  But the assembly is taking on average 1  second longer than the C   solution  Why is this  I am asking mainly out of curiosity  Execution times My system  64-bit Linux on 1 4  GHz Intel Celeron 2955U  Haswell microarchitecture    g    unoptimized   avg 1272  ms  g   -O3  avg 578  ms  asm  div   original   avg 2650  ms  asm  shr   avg 679  ms   johnfound asm  assembled with NASM   avg 501  ms   hidefromkgb asm  avg 200  ms   hidefromkgb asm  optimized by  Peter Cordes  avg 145  ms   Veedrac C    avg 81  ms with -O3  305  ms with -O0

User · Answer

Even without looking at assembly  the most obvious reason is that    2 is probably optimized as  gt  gt  1 and many processors have a very quick shift operation   But even if a processor doesn t have a shift operation  the integer division is faster than floating point division   Edit   your milage may vary on the  integer division is faster than floating point division  statement above   The comments below reveal that the modern processors have prioritized optimizing fp division over integer division   So if someone were looking for the most likely reason for the speedup which this thread s question asks about  then compiler optimizing   2 as  gt  gt  1 would be the best 1st place to look     On an unrelated note  if n is odd  the expression n 3 1 will always be even   So there is no need to check   You can change that branch to        n    n 3 1   gt  gt  1     count    2      So the whole statement would then be  if  n  amp  1        n    n 3   1   gt  gt  1      count    2    else       n  gt  gt   1        count

User · Answer

From comments      But  this code never stops  because of integer overflow      Yves Daoust   For many numbers it will not overflow   If it will overflow - for one of those unlucky initial seeds  the overflown number will very likely converge toward 1 without another overflow   Still this poses interesting question  is there some overflow-cyclic seed number   Any simple final converging series starts with power of two value  obvious enough     2 64 will overflow to zero  which is undefined infinite loop according to algorithm  ends only with 1   but the most optimal solution in answer will finish due to shr rax producing ZF 1   Can we produce 2 64  If the starting number is 0x5555555555555555  it s odd number  next number is then 3n 1  which is 0xFFFFFFFFFFFFFFFF   1   0  Theoretically in undefined state of algorithm  but the optimized answer of johnfound will recover by exiting on ZF 1  The cmp rax 1 of Peter Cordes will end in infinite loop  QED variant 1   cheapo  through undefined 0 number    How about some more complex number  which will create cycle without 0  Frankly  I m not sure  my Math theory is too hazy to get any serious idea  how to deal with it in serious way  But intuitively I would say the series will converge to 1 for every number   0  lt  number  as the 3n 1 formula will slowly turn every non-2 prime factor of original number  or intermediate  into some power of 2  sooner or later  So we don t need to worry about infinite loop for original series  only overflow can hamper us   So I just put few numbers into sheet and took a look on 8 bit truncated numbers   There are three values overflowing to 0  227  170 and 85  85 going directly to 0  other two progressing toward 85    But there s no value creating cyclic overflow seed   Funnily enough I did a check  which is the first number to suffer from 8 bit truncation  and already 27 is affected  It does reach value 9232 in proper non-truncated series  first truncated value is 322 in 12th step   and the maximum value reached for any of the 2-255 input numbers in non-truncated way is 13120  for the 255 itself   maximum number of steps to converge to 1 is about 128   -2  not sure if  1  is to count  etc       Interestingly enough  for me  the number 9232 is maximum for many other source numbers  what s so special about it   -O 9232   0x2410     hmmm   no idea   Unfortunately I can t get any deep grasp of this series  why does it converge and what are the implications of truncating them to k bits  but with cmp number 1 terminating condition it s certainly possible to put the algorithm into infinite loop with particular input value ending as 0 after truncation   But the value 27 overflowing for 8 bit case is sort of alerting  this looks like if you count the number of steps to reach value 1  you will get wrong result for majority of numbers from the total k-bit set of integers  For the 8 bit integers the 146 numbers out of 256 have affected series by truncation  some of them may still hit the correct number of steps by accident maybe  I m too lazy to check

User · Answer

For more performance  A simple change is observing that after n   3n 1  n will be even  so you can divide by 2 immediately  And n won t be 1  so you don t need to test for it  So you could save a few if statements and write    while  n   2    0  n    2  if  n  gt  1  for            n    3 n   1    2      if  n   2    0            do n    2  while  n   2    0           if  n    1  break            Here s a big win  If you look at the lowest 8 bits of n  all the steps until you divided by 2 eight times are completely determined by those eight bits  For example  if the last eight bits are 0x01  that is in binary your number is      0000 0001 then the next steps are   3n 1 - gt       0000 0100   2  - gt        000 0010   2  - gt         00 0001 3n 1 - gt         00 0100   2  - gt          0 0010   2  - gt            0001 3n 1 - gt            0100   2  - gt             010   2  - gt              01 3n 1 - gt              00   2  - gt               0   2  - gt                   So all these steps can be predicted  and 256k   1 is replaced with 81k   1  Something similar will happen for all combinations  So you can make a loop with a big switch statement    k   n   256  m   n   256   switch  m        case 0  n   1   k   0  break      case 1  n   81   k   1  break       case 2  n   81   k   1  break               case 155  n   729   k   425  break              Run the loop until n   128  because at that point n could become 1 with fewer than eight divisions by 2  and doing eight or more steps at a time would make you miss the point where you reach 1 for the first time  Then continue the  normal  loop - or have a table prepared that tells you how many more steps are need to reach 1    PS  I strongly suspect Peter Cordes  suggestion would make it even faster  There will be no conditional branches at all except one  and that one will be predicted correctly except when the loop actually ends  So the code would be something like  static const unsigned int multipliers  256            static const unsigned int adders  256             while  n  gt  128        size t lastBits   n   256      n    n  gt  gt  8    multipliers  lastBits    adders  lastBits       In practice  you would measure whether processing the last 9  10  11  12 bits of n at a time would be faster  For each bit  the number of entries in the table would double  and I excect a slowdown when the tables don t fit into L1 cache anymore    PPS  If you need the number of operations  In each iteration we do exactly eight divisions by two  and a variable number of  3n   1  operations  so an obvious method to count the operations would be another array  But we can actually calculate the number of steps  based on number of iterations of the loop    We could redefine the problem slightly  Replace n with  3n   1    2 if odd  and replace n with n   2 if even  Then every iteration will do exactly 8 steps  but you could consider that cheating  -  So assume there were r operations n  lt - 3n 1 and s operations n  lt - n 2  The result will be quite exactly n    n   3 r   2 s  because n  lt - 3n 1 means n  lt - 3n    1   1 3n   Taking the logarithm we find r    s   log2  n    n     log2  3     If we do the loop until n   1 000 000 and have a precomputed table how many iterations are needed from any start point n   1 000 000 then calculating r as above  rounded to the nearest integer  will give the right result unless s is truly large

User · Answer

The simple answer    doing a MOV RBX  3 and MUL RBX is expensive  just ADD RBX  RBX twice ADD 1 is probably faster than INC here MOV 2 and DIV is very expensive  just shift right 64-bit code is usually noticeably slower than 32-bit code and the alignment issues are more complicated  with small programs like this you have to pack them so you are doing parallel computation to have any chance of being faster than 32-bit code   If you generate the assembly listing for your C   program  you can see how it differs from your assembly

User · Answer

Claiming that the C   compiler can produce more optimal code than a competent assembly language programmer is a very bad mistake  And especially in this case  The human always can make the code better than the compiler can  and this particular situation is a good illustration of this claim  The timing difference you re seeing is because the assembly code in the question is very far from optimal in the inner loops   The below code is 32-bit  but can be easily converted to 64-bit  For example  the sequence function can be optimized to only 5 instructions       seq          inc     esi                   counter         lea     edx   3 eax 1         edx   3 n 1         shr     eax  1                eax   n 2         cmovc   eax  edx              if CF eax   edx         jnz      seq                  jmp if n lt  gt 1  The whole code looks like  include  quot  lib  freshlib inc quot   BinaryType console  compact options DebugMode   1 include  quot  lib  freshlib asm quot   start          InitializeAll         mov ecx  999999         xor edi  edi          max         xor ebx  ebx          max i       main loop           xor     esi  esi         mov     eax  ecx       seq          inc     esi                   counter         lea     edx   3 eax 1         edx   3 n 1         shr     eax  1                eax   n 2         cmovc   eax  edx              if CF eax   edx         jnz      seq                  jmp if n lt  gt 1          cmp     edi  esi         cmovb   edi  esi         cmovb   ebx  ecx          dec     ecx         jnz      main loop          OutputValue  quot Max sequence   quot   edi  10  -1         OutputValue  quot Max index   quot   ebx  10  -1          FinalizeAll         stdcall TerminateAll  0  In order to compile this code  FreshLib is needed  In my tests   1  GHz AMD A4-1200 processor   the above code is approximately four times faster than the C   code from the question  when compiled with -O0  430  ms vs  1900  ms   and more than two times faster  430  ms vs  830  ms  when the C   code is compiled with -O3  The output of both programs is the same  max sequence   525 on i   837799

User · Answer

If you think a 64-bit DIV instruction is a good way to divide by two  then no wonder the compiler s asm output beat your hand-written code  even with -O0  compile fast  no extra optimization  and store reload to memory after before every C statement so a debugger can modify variables   See Agner Fog s Optimizing Assembly guide to learn how to write efficient asm   He also has instruction tables and a microarch guide for specific details for specific CPUs   See also the x86 tag wiki for more perf links  See also this more general question about beating the compiler with hand-written asm  Is inline assembly language slower than native C   code    TL DR  yes if you do it wrong  like this question   Usually you re fine letting the compiler do its thing  especially if you try to write C   that can compile efficiently   Also see is assembly faster than compiled languages    One of the answers links to these neat slides showing how various C compilers optimize some really simple functions with cool tricks   Matt Godbolt s CppCon2017 talk    What Has My Compiler Done for Me Lately  Unbolting the Compiler s Lid    is in a similar vein   even      mov rbx  2     xor rdx  rdx     div rbx  On Intel Haswell  div r64 is 36 uops  with a latency of 32-96 cycles  and a throughput of one per 21-74 cycles    Plus the 2 uops to set up RBX and zero RDX  but out-of-order execution can run those early    High-uop-count instructions like DIV are microcoded  which can also cause front-end bottlenecks  In this case  latency is the most relevant factor because it s part of a loop-carried dependency chain  shr rax  1 does the same unsigned division  It s 1 uop  with 1c latency  and can run 2 per clock cycle  For comparison  32-bit division is faster  but still horrible vs  shifts  idiv r32 is 9 uops  22-29c latency  and one per 8-11c throughput on Haswell   As you can see from looking at gcc s -O0 asm output  Godbolt compiler explorer   it only uses shifts instructions  clang -O0 does compile naively like you thought  even using 64-bit IDIV twice   When optimizing  compilers do use both outputs of IDIV when the source does a division and modulus with the same operands  if they use IDIV at all  GCC doesn t have a totally-naive mode  it always transforms through GIMPLE  which means some  quot optimizations quot  can t be disabled   This includes recognizing division-by-constant and using shifts  power of 2  or a fixed-point multiplicative inverse  non power of 2  to avoid IDIV  see div by 13 in the above godbolt link   gcc -Os  optimize for size  does use IDIV for non-power-of-2 division  unfortunately even in cases where the multiplicative inverse code is only slightly larger but much faster   Helping the compiler  summary for this case  use uint64 t n  First of all  it s only interesting to look at optimized compiler output    -O3    -O0 speed is basically meaningless  Look at your asm output  on Godbolt  or see How to remove  quot noise quot  from GCC clang assembly output     When the compiler doesn t make optimal code in the first place  Writing your C C   source in a way that guides the compiler into making better code is usually the best approach   You have to know asm  and know what s efficient  but you apply this knowledge indirectly   Compilers are also a good source of ideas  sometimes clang will do something cool  and you can hand-hold gcc into doing the same thing  see this answer and what I did with the non-unrolled loop in  Veedrac s code below   This approach is portable  and in 20 years some future compiler can compile it to whatever is efficient on future hardware  x86 or not   maybe using new ISA extension or auto-vectorizing   Hand-written x86-64 asm from 15 years ago would usually not be optimally tuned for Skylake   e g  compare amp branch macro-fusion didn t exist back then   What s optimal now for hand-crafted asm for one microarchitecture might not be optimal for other current and future CPUs   Comments on  johnfound s answer discuss major differences between AMD Bulldozer and Intel Haswell  which have a big effect on this code   But in theory  g   -O3 -march bdver3 and g   -O3 -march skylake will do the right thing    Or -march native     Or -mtune     to just tune  without using instructions that other CPUs might not support  My feeling is that guiding the compiler to asm that s good for a current CPU you care about shouldn t be a problem for future compilers   They re hopefully better than current compilers at finding ways to transform code  and can find a way that works for future CPUs   Regardless  future x86 probably won t be terrible at anything that s good on current x86  and the future compiler will avoid any asm-specific pitfalls while implementing something like the data movement from your C source  if it doesn t see something better  Hand-written asm is a black-box for the optimizer  so constant-propagation doesn t work when inlining makes an input a compile-time constant   Other optimizations are also affected   Read https   gcc gnu org wiki DontUseInlineAsm before using asm    And avoid MSVC-style inline asm  inputs outputs have to go through memory which adds overhead   In this case  your n has a signed type  and gcc uses the SAR SHR ADD sequence that gives the correct rounding    IDIV and arithmetic-shift  quot round quot  differently for negative inputs  see the SAR insn set ref manual entry     IDK if gcc tried and failed to prove that n can t be negative  or what   Signed-overflow is undefined behaviour  so it should have been able to   You should have used uint64 t n  so it can just SHR   And so it s portable to systems where long is only 32-bit  e g  x86-64 Windows    BTW  gcc s optimized asm output looks pretty good  using unsigned long n   the inner loop it inlines into main   does this     from gcc5 4 -O3  plus my comments     edx  count 1    rax  uint64 t n   L9                      do      lea    rcx   rax 1 rax 2      rcx   3 n   1     mov    rdi  rax     shr    rdi           rdi   n gt  gt 1      test   al  1         set flags based on n 2  aka n amp 1      mov    rax  rcx     cmove  rax  rdi      n   n 2    3 n 1   n 2      add    edx  1          count      cmp    rax  1     jne    L9            while n  1     cmp branch to update max and maxi  and then do the next n  The inner loop is branchless  and the critical path of the loop-carried dependency chain is   3-component LEA  3 cycles  cmov  2 cycles on Haswell  1c on Broadwell or later    Total  5 cycle per iteration  latency bottleneck   Out-of-order execution takes care of everything else in parallel with this  in theory  I haven t tested with perf counters to see if it really runs at 5c iter   The FLAGS input of cmov  produced by TEST  is faster to produce than the RAX input  from LEA- gt MOV   so it s not on the critical path  Similarly  the MOV- gt SHR that produces CMOV s RDI input is off the critical path  because it s also faster than the LEA   MOV on IvyBridge and later has zero latency  handled at register-rename time     It still takes a uop  and a slot in the pipeline  so it s not free  just zero latency    The extra MOV in the LEA dep chain is part of the bottleneck on other CPUs  The cmp jne is also not part of the critical path  it s not loop-carried  because control dependencies are handled with branch prediction   speculative execution  unlike data dependencies on the critical path   Beating the compiler GCC did a pretty good job here   It could save one code byte by using inc edx instead of add edx  1  because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions  It could also save all the MOV instructions  and the TEST   SHR sets CF  the bit shifted out  so we can use cmovc instead of test   cmovz       Hand-optimized version of what gcc does  L9                         do      lea     rcx   rax 1 rax 2    rcx   3 n   1     shr     rax  1           n gt  gt  1     CF   n amp 1   n 2     cmovc   rax  rcx         n   n amp 1    3 n 1   n 2      inc     edx                count      cmp     rax  1     jne      L9              while n  1   See  johnfound s answer for another clever trick  remove the CMP by branching on SHR s flag result as well as using it for CMOV   zero only if n was 1  or 0  to start with    Fun fact  SHR with count    1 on Nehalem or earlier causes a stall if you read the flag results   That s how they made it single-uop   The shift-by-1 special encoding is fine  though   Avoiding MOV doesn t help with the latency at all on Haswell  Can x86  39 s MOV really be  quot free quot   Why can  39 t I reproduce this at all     It does help significantly on CPUs like Intel pre-IvB  and AMD Bulldozer-family  where MOV is not zero-latency   The compiler s wasted MOV instructions do affect the critical path   BD s complex-LEA and CMOV are both lower latency  2c and 1c respectively   so it s a bigger fraction of the latency   Also  throughput bottlenecks become an issue  because it only has two integer ALU pipes   See  johnfound s answer  where he has timing results from an AMD CPU  Even on Haswell  this version may help a bit by avoiding some occasional delays where a non-critical uop steals an execution port from one on the critical path  delaying execution by 1 cycle    This is called a resource conflict    It also saves a register  which may help when doing multiple n values in parallel in an interleaved loop  see below   LEA s latency depends on the addressing mode  on Intel SnB-family CPUs   3c for 3 components   base idx const   which takes two separate adds   but only 1c with 2 or fewer components  one add    Some CPUs  like Core2  do even a 3-component LEA in a single cycle  but SnB-family doesn t   Worse  Intel SnB-family standardizes latencies so there are no 2c uops  otherwise 3-component LEA would be only 2c like Bulldozer    3-component LEA is slower on AMD as well  just not by as much   So lea  rcx   rax   rax 2    inc rcx is only 2c latency  faster than lea  rcx   rax   rax 2   1   on Intel SnB-family CPUs like Haswell   Break-even on BD  and worse on Core2   It does cost an extra uop  which normally isn t worth it to save 1c latency  but latency is the major bottleneck here and Haswell has a wide enough pipeline to handle the extra uop throughput  Neither gcc  icc  nor clang  on godbolt  used SHR s CF output  always using an AND or TEST   Silly compilers   P  They re great pieces of complex machinery  but a clever human can often beat them on small-scale problems    Given thousands to millions of times longer to think about it  of course   Compilers don t use exhaustive algorithms to search for every possible way to do things  because that would take too long when optimizing a lot of inlined code  which is what they do best   They also don t model the pipeline in the target microarchitecture  at least not in the same detail as IACA or other static-analysis tools  they just use some heuristics    Simple loop unrolling won t help  this loop bottlenecks on the latency of a loop-carried dependency chain  not on loop overhead   throughput   This means it would do well with hyperthreading  or any other kind of SMT   since the CPU has lots of time to interleave instructions from two threads   This would mean parallelizing the loop in main  but that s fine because each thread can just check a range of n values and produce a pair of integers as a result  Interleaving by hand within a single thread might be viable  too   Maybe compute the sequence for a pair of numbers in parallel  since each one only takes a couple registers  and they can all update the same max   maxi   This creates more instruction-level parallelism  The trick is deciding whether to wait until all the n values have reached 1 before getting another pair of starting n values  or whether to break out and get a new start point for just one that reached the end condition  without touching the registers for the other sequence   Probably it s best to keep each chain working on useful data  otherwise you d have to conditionally increment its counter   You could maybe even do this with SSE packed-compare stuff to conditionally increment the counter for vector elements where n hadn t reached 1 yet   And then to hide the even longer latency of a SIMD conditional-increment implementation  you d need to keep more vectors of n values up in the air   Maybe only worth with 256b vector  4x uint64 t   I think the best strategy to make detection of a 1  quot sticky quot  is to mask the vector of all-ones that you add to increment the counter   So after you ve seen a 1 in an element  the increment-vector will have a zero  and   0 is a no-op  Untested idea for manual vectorization   starting with YMM0     n d  n c  n b  n a     64-bit elements    ymm4    mm256 set1 epi64x 1    increment vector   ymm5   all-zeros   count vector   inner loop      vpaddq    ymm1  ymm0  xmm0     vpaddq    ymm1  ymm1  xmm0     vpaddq    ymm1  ymm1  set1 epi64 1        ymm1  3 n   1   Maybe could do this more efficiently       vprllq    ymm3  ymm0  63                  shift bit 1 to the sign bit      vpsrlq    ymm0  ymm0  1                   n    2        FP blend between integer insns may cost extra bypass latency  but integer blends don t have 1 bit controlling a whole qword      vpblendvpd ymm0  ymm0  ymm1  ymm3         variable blend controlled by the sign bit of each 64-bit element   I might have the source operands backwards  I always have to look this up         ymm0   updated n  in each element       vpcmpeqq ymm1  ymm0  set1 epi64 1      vpandn   ymm4  ymm1  ymm4           zero out elements of ymm4 where the compare was true      vpaddq   ymm5  ymm5  ymm4           count   in elements where n has never been    1      vptest   ymm4  ymm4     jnz   inner loop       Fall through when all the n values have reached 1 at some point  and our increment vector is all-zero      vextracti128 ymm0  ymm5  1     vpmaxq      crap this doesn t exist       Actually just delay doing a horizontal max until the very very end   But you need some way to record max and maxi   You can and should implement this with intrinsics instead of hand-written asm   Algorithmic   implementation improvement  Besides just implementing the same logic with more efficient asm  look for ways to simplify the logic  or avoid redundant work   e g  memoize to detect common endings to sequences  Or even better  look at 8 trailing bits at once  gnasher s answer   EOF points out that tzcnt  or bsf  could be used to do multiple n  2 iterations in one step  That s probably better than SIMD vectorizing  no SSE or AVX instruction can do that  It s still compatible with doing multiple scalar ns in parallel in different integer registers  though  So the loop might look like this  goto loop entry      C   structured like the asm  for illustration only do      n   n 3   1    loop entry     shift    tzcnt u64 n      n  gt  gt   shift     count    shift    while n    1    This may do significantly fewer iterations  but variable-count shifts are slow on Intel SnB-family CPUs without BMI2  3 uops  2c latency    They have an input dependency on the FLAGS because count 0 means the flags are unmodified  They handle this as a data dependency  and take multiple uops because a uop can only have 2 inputs  pre-HSW BDW anyway     This is the kind that people complaining about x86 s crazy-CISC design are referring to  It makes x86 CPUs slower than they would be if the ISA was designed from scratch today  even in a mostly-similar way    i e  this is part of the  quot x86 tax quot  that costs speed   power   SHRX SHLX SARX  BMI2  are a big win  1 uop   1c latency   It also puts tzcnt  3c on Haswell and later  on the critical path  so it significantly lengthens the total latency of the loop-carried dependency chain  It does remove any need for a CMOV  or for preparing a register holding n gt  gt 1  though   Veedrac s answer overcomes all this by deferring the tzcnt shift for multiple iterations  which is highly effective  see below   We can safely use BSF or TZCNT interchangeably  because n can never be zero at that point  TZCNT s machine-code decodes as BSF on CPUs that don t support BMI1   Meaningless prefixes are ignored  so REP BSF runs as BSF   TZCNT performs much better than BSF on AMD CPUs that support it   so it can be a good idea to use REP BSF  even if you don t care about setting ZF if the input is zero rather than the output   Some compilers do this when you use   builtin ctzll even with -mno-bmi  They perform the same on Intel CPUs  so just save the byte if that s all that matters  TZCNT on Intel  pre-Skylake  still has a false-dependency on the supposedly write-only output operand  just like BSF  to support the undocumented behaviour that BSF with input   0 leaves its destination unmodified  So you need to work around that unless optimizing only for Skylake  so there s nothing to gain from the extra REP byte   Intel often goes above and beyond what the x86 ISA manual requires  to avoid breaking widely-used code that depends on something it shouldn t  or that is retroactively disallowed  e g  Windows 9x s assumes no speculative prefetching of TLB entries  which was safe when the code was written  before Intel updated the TLB management rules   Anyway  LZCNT TZCNT on Haswell have the same false dep as POPCNT  see this Q amp A  This is why in gcc s asm output for  Veedrac s code  you see it breaking the dep chain with xor-zeroing on the register it s about to use as TZCNT s destination when it doesn t use dst src  Since TZCNT LZCNT POPCNT never leave their destination undefined or unmodified  this false dependency on the output on Intel CPUs is a performance bug   limitation  Presumably it s worth some transistors   power to have them behave like other uops that go to the same execution unit  The only perf upside is interaction with another uarch limitation  they can micro-fuse a memory operand with an indexed addressing mode on Haswell  but on Skylake where Intel removed the false dep for LZCNT TZCNT they  quot un-laminate quot  indexed addressing modes while POPCNT can still micro-fuse any addr mode   Improvements to ideas   code from other answers   hidefromkgb s answer has a nice observation that you re guaranteed to be able to do one right shift after a 3n 1   You can compute this more even more efficiently than just leaving out the checks between steps   The asm implementation in that answer is broken  though  it depends on OF  which is undefined after SHRD with a count  gt  1   and slow  ROR rdi 2 is faster than SHRD rdi rdi 2  and using two CMOV instructions on the critical path is slower than an extra TEST that can run in parallel  I put tidied   improved C  which guides the compiler to produce better asm   and tested working faster asm  in comments below the C  up on Godbolt  see the link in  hidefromkgb s answer    This answer hit the 30k char limit from the large Godbolt URLs  but shortlinks can rot and were too long for goo gl anyway   Also improved the output-printing to convert to a string and make one write   instead of writing one char at a time  This minimizes impact on timing the whole program with perf stat   collatz  to record performance counters   and I de-obfuscated some of the non-critical asm    Veedrac s code I got a minor speedup from right-shifting as much as we know needs doing  and checking to continue the loop  From 7 5s for limit 1e8 down to 7 275s  on Core2Duo  Merom   with an unroll factor of 16  code   comments on Godbolt  Don t use this version with clang  it does something silly with the defer-loop  Using a tmp counter k and then adding it to count later changes what clang does  but that slightly hurts gcc  See discussion in comments  Veedrac s code is excellent on CPUs with BMI1  i e  not Celeron Pentium

User · Answer

You did not post the code generated by the compiler  so there  some guesswork here  but even without having seen it  one can say that this   test rax  1 jpe even       has a 50  chance of mispredicting the branch  and that will come expensive   The compiler almost certainly does both computations  which costs neglegibly more since the div mod is quite long latency  so the multiply-add is  free   and follows up with a CMOV  Which  of course  has a zero percent chance of being mispredicted

User · Answer

On a rather unrelated note  more performance hacks     the first   conjecture   has been finally debunked by  ShreevatsaR  removed  When traversing the sequence  we can only get 3 possible cases in the 2-neighborhood of the current element N  shown first      even   odd   odd   even   even   even    To leap past these 2 elements means to compute  N  gt  gt  1    N   1    N  lt  lt  1    N   1   gt  gt  1 and N  gt  gt  2  respectively   Let s prove that for both cases  1  and  2  it is possible to use the first formula   N  gt  gt  1    N   1   Case  1  is obvious  Case  2  implies  N  amp  1     1  so if we assume  without loss of generality  that N is 2-bit long and its bits are ba from most- to least-significant  then a   1  and the following holds    N  lt  lt  1    N   1       N  gt  gt  1    N   1           b10                    b1          b1                     b           1                     1        ----                   ---        bBb0                   bBb   where B    b  Right-shifting the first result gives us exactly what we want   Q E D    N  amp  1     1          N  gt  gt  1    N   1      N  lt  lt  1    N   1   gt  gt  1   As proven  we can traverse the sequence 2 elements at a time  using a single ternary operation  Another 2   time reduction    The resulting algorithm looks like this   uint64 t sequence uint64 t size  uint64 t  path        uint64 t n  i  c  maxi   0  maxc   0       for  n   i    size - 1    1  i  gt  2  n   i -  2            c   2          while   n     n  amp  3    n  gt  gt  1    n   1    n  gt  gt  2     gt  2              c    2          if  n    2              c            if  c  gt  maxc                maxi   i              maxc   c                       path   maxc      return maxi     int main         uint64 t maxi  maxc       maxi   sequence 1000000   amp maxc       printf   llu   llu n   maxi  maxc       return 0      Here we compare n  gt  2 because the process may stop at 2 instead of 1 if the total length of the sequence is odd    EDIT    Let s translate this into assembly   MOV RCX  1000000     DEC RCX  AND RCX  -2  XOR RAX  RAX  MOV RBX  RAX    main    XOR RSI  RSI    LEA RDI   RCX   1       loop      ADD RSI  2      LEA RDX   RDI   RDI 2   2       SHR RDX  1      SHRD RDI  RDI  2     ror rdi 2   would do the same thing     CMOVL RDI  RDX       Note that SHRD leaves OF   undefined with count gt 1  and this doesn t work on all CPUs      CMOVS RDI  RDX      CMP RDI  2    JA  loop     LEA RDX   RSI   1     CMOVE RSI  RDX     CMP RAX  RSI    CMOVB RAX  RSI    CMOVB RBX  RCX     SUB RCX  2  JA  main     MOV RDI  RCX  ADD RCX  10  PUSH RDI  PUSH RCX    itoa    XOR RDX  RDX    DIV RCX    ADD RDX   0     PUSH RDX    TEST RAX  RAX  JNE  itoa     PUSH RCX    LEA RAX   RBX   1     TEST RBX  RBX    MOV RBX  RDI  JNE  itoa   POP RCX  INC RDI  MOV RDX  RDI    outp    MOV RSI  RSP    MOV RAX  RDI    SYSCALL    POP RAX    TEST RAX  RAX  JNE  outp   LEA RAX   RDI   59   DEC RDI  SYSCALL    Use these commands to compile   nasm -f elf64 file asm ld -o file file o     See the C and an improved bugfixed version of the asm by Peter Cordes on Godbolt    editor s note  Sorry for putting my stuff in your answer  but my answer hit the 30k char limit from Godbolt links   text

User · Answer

C   programs are translated to assembly programs during the generation of machine code from the source code  It would be virtually wrong to say assembly is slower than C    Moreover  the binary code generated differs from compiler to compiler  So a smart C   compiler may produce binary code more optimal and efficient than a dumb assembler s code   However I believe your profiling methodology has certain flaws  The following are general guidelines for profiling    Make sure your system is in its normal idle state  Stop all running processes  applications  that you started or that use CPU intensively  or poll over the network   Your datasize must be greater in size  Your test must run for something more than 5-10 seconds  Do not rely on just one sample  Perform your test N times  Collect results and calculate the mean or median of the result

User · Answer

As a generic answer  not specifically directed at this task  In many cases  you can significantly speed up any program by making improvements at a high level  Like calculating data once instead of multiple times  avoiding unnecessary work completely  using caches in the best way  and so on  These things are much easier to do in a high level language    Writing assembler code  it is possible to improve on what an optimising compiler does  but it is hard work  And once it s done  your code is much harder to modify  so it is much more difficult to add algorithmic improvements  Sometimes the processor has functionality that you cannot use from a high level language  inline assembly is often useful in these cases and still lets you use a high level language    In the Euler problems  most of the time you succeed by building something  finding why it is slow  building something better  finding why it is slow  and so on and so on  That is very  very hard using assembler  A better algorithm at half the possible speed will usually beat a worse algorithm at full speed  and getting the full speed in assembler isn t trivial

User · Answer

For the Collatz problem  you can get a significant boost in performance by caching the  tails   This is a time memory trade-off  See  memoization  https   en wikipedia org wiki Memoization   You could also look into dynamic programming solutions for other time memory trade-offs   Example python implementation   import sys  inner loop   0  def collatz sequence N  cache       global inner loop      l           stop   False     n   N      tails            while not stop          inner loop    1         tmp   n         l append n          if n  lt   1              stop   True           elif n in cache              stop   True         elif n   2              n   3 n   1         else              n   n    2         tails append  tmp  len l         for key  offset in tails          if not key in cache              cache key    l offset        return l  def gen sequence l  cache       for elem in l          yield elem         if elem in cache              yield from gen sequence cache elem   cache              raise StopIteration  if   name         main         le cache           for n in range 1  4711  5           l   collatz sequence n  le cache          print          format n  len list gen sequence l  le cache           print  inner loop       format inner loop

[c++] Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?

Examples related to c++

Examples related to performance

Examples related to assembly

Examples related to optimization

Examples related to x86