Is faster than

Question

Is if  a  lt  901  faster than if  a  lt   900   Not exactly as in this simple example  but there are slight performance changes on loop complex code  I suppose this has to do something with generated machine code in case it s even true

User · Answer

TL DR answer For most combinations of architecture  compiler and language   lt  will not be faster than  lt    Full answer Other answers have concentrated on x86 architecture  and I don t know the ARM architecture  which your example assembler seems to be  well enough to comment specifically on the code generated  but this is an example of a micro-optimisation which is very architecture specific  and is as likely to be an anti-optimisation as it is to be an optimisation  As such  I would suggest that this sort of micro-optimisation is an example of cargo cult programming rather than best software engineering practice  Counterexample There are probably some architectures where this is an optimisation  but I know of at least one architecture where the opposite may be true  The venerable Transputer architecture only had machine code instructions for equal to and greater than or equal to  so all comparisons had to be built from these primitives  Even then  in almost all cases  the compiler could order the evaluation instructions in such a way that in practice  no comparison had any advantage over any other  Worst case though  it might need to add a reverse instruction  REV  to swap the top two items on the operand stack  This was a single byte instruction which took a single cycle to run  so had the smallest overhead possible  Summary Whether or not a micro-optimisation like this is an optimisation or an anti-optimisation depends on the specific architecture you are using  so it is usually a bad idea to get into the habit of using architecture specific micro-optimisations  otherwise you might instinctively use one when it is inappropriate to do so  and it looks like this is exactly what the book you are reading is advocating

User · Answer

This would be highly dependent on the underlying architecture that the C is compiled to   Some processors and architectures might have explicit instructions for equal to  or less than and equal to  which execute in different numbers of cycles   That would be pretty unusual though  as the compiler could work around it  making it irrelevant

User · Answer

No  it will not be faster on most architectures  You didn t specify  but on x86  all of the integral comparisons will be typically implemented in two machine instructions    A test or cmp instruction  which sets EFLAGS And a Jcc  jump  instruction  depending on the comparison type  and code layout    jne - Jump if not equal --  ZF   0 jz - Jump if zero  equal  --  ZF   1 jg - Jump if greater --  ZF   0 and SF   OF  etc          Example  Edited for brevity  Compiled with   gcc -m32 -S -masm intel test c      if  a  lt  b               Do something 1         Compiles to       mov     eax  DWORD PTR  esp 24         a     cmp     eax  DWORD PTR  esp 28         b     jge      L2                            jump if a is  gt   b       Do something 1  L2    And      if  a  lt   b               Do something 2         Compiles to       mov     eax  DWORD PTR  esp 24         a     cmp     eax  DWORD PTR  esp 28         b     jg       L5                            jump if a is  gt  b       Do something 2  L5    So the only difference between the two is a jg versus a jge instruction  The two will take the same amount of time     I d like to address the comment that nothing indicates that the different jump instructions take the same amount of time   This one is a little tricky to answer  but here s what I can give  In the Intel Instruction Set Reference  they are all grouped together under one common instruction  Jcc  Jump if condition is met   The same grouping is made together under the Optimization Reference Manual  in Appendix C  Latency and Throughput      Latency     The number of clock cycles that are required for the   execution core to  complete the execution of all of the   ops that form   an instruction       Throughput     The number of clock cycles required to   wait before the issue  ports are free to accept the same instruction   again  For many instructions  the  throughput of an instruction can be   significantly less than its latency   The values for Jcc are         Latency   Throughput Jcc     N A        0 5   with the following footnote on Jcc      7  Selection of conditional jump instructions should be based on the recommendation of section Section 3 4 1     Branch Prediction Optimization     to improve the  predictability of branches  When branches are predicted successfully  the latency of jcc is effectively zero    So  nothing in the Intel docs ever treats one Jcc instruction any differently from the others   If one thinks about the actual circuitry used to implement the instructions  one can assume that there would be simple AND OR gates on the different bits in EFLAGS  to determine whether the conditions are met  There is then  no reason that an instruction testing two bits should take any more or less time than one testing only one  Ignoring gate propagation delay  which is much less than the clock period      Edit  Floating Point  This holds true for x87 floating point as well    Pretty much same code as above  but with double instead of int            fld     QWORD PTR  esp 32          fld     QWORD PTR  esp 40          fucomip st  st 1                 Compare ST 0  and ST 1   and set CF  PF  ZF in EFLAGS         fstp    st 0          seta    al                       Set al if above  CF 0 and ZF 0           test    al  al         je       L2           Do something 1  L2           fld     QWORD PTR  esp 32          fld     QWORD PTR  esp 40          fucomip st  st 1                  same thing as above          fstp    st 0          setae   al                       Set al if above or equal  CF 0           test    al  al         je       L5           Do something 2  L5          leave         ret

User · Answer

Maybe the author of that unnamed book has read that a  gt  0 runs faster than a  gt   1 and thinks that is true universally   But it is because a 0 is involved  because CMP can  depending on the architecture  replaced e g  with OR  and not because of the  lt

User · Answer

You should not be able to notice the difference even if there is any   Besides  in practice  you ll have to do an additional a   1 or a - 1 to make the condition stand unless you re going to use some magic constants  which is a very bad practice by all means

User · Answer

Historically  we re talking the 1980s and early 1990s   there were some architectures in which this was true  The root issue is that integer comparison is inherently implemented via integer subtractions  This gives rise to the following cases   Comparison     Subtraction ----------     ----------- A  lt  B      -- gt  A - B  lt  0 A   B      -- gt  A - B   0 A  gt  B      -- gt  A - B  gt  0   Now  when A  lt  B the subtraction has to borrow a high-bit for the subtraction to be correct  just like you carry and borrow when adding and subtracting by hand  This  borrowed  bit was usually referred to as the carry bit and would be testable by a branch instruction  A second bit called the zero bit would be set if the subtraction were identically zero which implied equality   There were usually at least two conditional branch instructions  one to branch on the carry bit and one on the zero bit   Now  to get at the heart of the matter  let s expand the previous table to include the carry and zero bit results   Comparison     Subtraction  Carry Bit  Zero Bit ----------     -----------  ---------  -------- A  lt  B      -- gt  A - B  lt  0    0          0 A   B      -- gt  A - B   0    1          1 A  gt  B      -- gt  A - B  gt  0    1          0   So  implementing a branch for A  lt  B can be done in one instruction  because the carry bit is clear only in this case    that is      Implementation of  if  A  lt  B  goto address   cmp  A  B             compare A to B bcz  address          Branch if Carry is Zero to the new address   But  if we want to do a less-than-or-equal comparison  we need to do an additional check of the zero flag to catch the case of equality      Implementation of  if  A  lt   B  goto address   cmp A  B              compare A to B bcz address           branch if A  lt  B bzs address           also  Branch if the Zero bit is Set   So  on some machines  using a  less than  comparison might save one machine instruction   This was relevant in the era of sub-megahertz processor speed and 1 1 CPU-to-memory speed ratios  but it is almost totally irrelevant today

User · Answer

Only if the people who created the computers are bad with boolean logic  Which they shouldn t be   Every comparison   gt    lt    gt   lt   can be done in the same speed   What every comparison is  is just a subtraction  the difference  and seeing if it s positive negative   If the msb is set  the number is negative   How to check a  gt   b  Sub a-b  gt   0 Check if a-b is positive  How to check a  lt   b  Sub 0  lt   b-a Check if b-a is positive  How to check a  lt  b  Sub a-b  lt  0 Check if a-b is negative  How to check a  gt  b  Sub 0  gt  b-a Check if b-a is negative   Simply put  the computer can just do this underneath the hood for the given op   a  gt   b    msb a-b   0 a  lt   b    msb b-a   0 a  gt   b    msb b-a   1 a  lt   b    msb a-b   1  and of course the computer wouldn t actually need to do the   0 or   1 either  for the   0 it could just invert the msb from the circuit   Anyway  they most certainly wouldn t have made a  gt   b be calculated as a gt b    a  b lol

User · Answer

When I wrote the first version of this answer  I was only looking at the title question about  lt  vs   lt   in general  not the specific example of a constant a  lt  901 vs  a  lt   900   Many compilers always shrink the magnitude of constants by converting between  lt  and  lt    e g  because x86 immediate operand have a shorter 1-byte encoding for -128  127  For ARM  being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word   So cmp r0   0x00f000 would be encodeable  while cmp r0   0x00efff would not be   So the make-it-smaller rule for comparison vs  a compile-time constant doesn t always apply for ARM   AArch64 is either shift-by-12 or not  instead of an arbitrary rotation  for instructions like cmp and cmn  unlike 32-bit ARM and Thumb modes    lt  vs   lt   in general  including for runtime-variable conditions In assembly language on most machines  a comparison for  lt   has the same cost as a comparison for  lt    This applies whether you re branching on it  booleanizing it to create a 0 1 integer  or using it as a predicate for a branchless select operation  like x86 CMOV    The other answers have only addressed this part of the question  But this question is about the C   operators  the input to the optimizer   Normally they re both equally efficient  the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm   But there is at least one exception where using  lt   can accidentally create something the compiler can t optimize  As a loop condition  there are cases where  lt   is qualitatively different from  lt   when it stops the compiler from proving that a loop is not infinite   This can make a big difference  disabling auto-vectorization  Unsigned overflow is well-defined as base-2 wrap around  unlike signed overflow  UB    Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening    i  lt   size will always eventually become false    What Every C Programmer Should Know About Undefined Behavior  void foo unsigned size        unsigned upper bound   size - 1      or any calculation that could produce UINT MAX     for unsigned i 0   i  lt   upper bound   i                 Compilers can only optimize in ways that preserve the  defined and legally observable  behaviour of the C   source for all possible input values  except ones that lead to undefined behaviour   A simple i  lt   size would create the problem too  but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don t care about but which the compiler must consider   In this case  size 0 leads to upper bound UINT MAX  and i  lt   UINT MAX is always true   So this loop is infinite for size 0  and the compiler has to respect that even though you as the programmer probably never intend to pass size 0   If the compiler can inline this function into a caller where it can prove that size 0 is impossible  then great  it can optimize like it could for i  lt  size  Asm like if  size  skip the loop   do     while --size   is one normally-efficient way to optimize a for  i lt size   loop  if the actual value of i isn t needed inside the loop  Why are loops always compiled into  quot do   while quot  style  tail jump     But that do  while can t be infinite  if entered with size  0  we get 2 n iterations    Iterating over all unsigned integers in a for loop  C makes it possible to express a loop over all unsigned integers including zero  but it s not easy without a carry flag the way it is in asm   With wraparound of the loop counter being a possibility  modern compilers often just  quot give up quot   and don t optimize nearly as aggressively  Example  sum of integers from 1 to n Using unsigned i  lt   n defeats clang s idiom-recognition that optimizes sum 1    n  loops with a closed form based on Gauss s n    n 1    2 formula  unsigned sum 1 to n finite unsigned n        unsigned total   0      for  unsigned i   0   i  lt  n 1     i          total    i      return total     x86-64 asm from clang7 0 and gcc8 2 on the Godbolt compiler explorer    clang7 0 -O3 closed-form     cmp     edi  -1         n passed in EDI  x86-64 System V calling convention     je       LBB1 1         if  n    UINT MAX  return 0      C   loop runs 0 times             else fall through into the closed-form calc     mov     ecx  edi           zero-extend n into RCX     lea     eax   rdi - 1      n-1     imul    rax  rcx           n    n-1                64-bit     shr     rax                n    n-1    2     add     eax  edi           n    stuff   2    n    n 1    2     truncated to 32-bit     ret            computed without possible overflow of the product before right shifting  LBB1 1      xor     eax  eax     ret  But for the naive version  we just get a dumb loop from clang  unsigned sum 1 to n naive unsigned n        unsigned total   0      for  unsigned i   0   i lt  n     i          total    i      return total       clang7 0 -O3 sum 1 to n unsigned int       xor     ecx  ecx             i   0     xor     eax  eax             retval   0  LBB0 1                          do       add     eax  ecx               retval    i     add     ecx  1                   1     cmp     ecx  edi     jbe      LBB0 1                while  i lt n        ret   GCC doesn t use a closed-form either way  so the choice of loop condition doesn t really hurt it  it auto-vectorizes with SIMD integer addition  running 4 i values in parallel in the elements of an XMM register     quot naive quot  inner loop  L3      add     eax  1         do       paddd   xmm0  xmm1      vect total 4 6  vect vec iv  5     paddd   xmm1  xmm2      vect vec iv  5  tmp114     cmp     edx  eax        bnd 1  ivtmp 14       bound and induction-variable tmp  I think      ja       L3             while  n  gt  i      quot finite quot  inner loop     before the loop      xmm0   0   totals     xmm1    0 1 2 3    i     xmm2   set1 epi32 4    L13                   do       add     eax  1         i       paddd   xmm0  xmm1      total 0  3     i 0  3      paddd   xmm1  xmm2      i 0  3     4     cmp     eax  edx     jne      L13         while  i    upper limit          then horizontal sum xmm0      and peeled cleanup for the last n 3 iterations  or something         It also has a plain scalar loop which I think it uses for very small n  and or for the infinite loop case  BTW  both of these loops waste an instruction  and a uop on Sandybridge-family CPUs  on loop overhead   sub eax 1 jnz instead of add eax 1 cmp jcc would be more efficient   1 uop instead of 2  after macro-fusion of sub jcc or cmp jcc    The code after both loops writes EAX unconditionally  so it s not using the final value of the loop counter

User · Answer

They have the same speed  Maybe in some special architecture what he she said is right  but in the x86 family at least I know they are the same  Because for doing this the CPU will do a  substraction  a - b  and then check the flags of the flag register  Two bits of that register are called ZF  zero Flag  and SF  sign flag   and it is done in one cycle  because it will do it with one mask operation

User · Answer

I see that neither is faster  The compiler generates the same machine code in each condition with a different value   if a  lt  901  cmpl   900  -4  rbp  jg  L2  if a  lt  901  cmpl   901  -4  rbp  jg  L3   My example if is from GCC on x86 64 platform on Linux   Compiler writers are pretty smart people  and they think of these things and many others most of us take for granted   I noticed that if it is not a constant  then the same machine code is generated in either case   int b  if a  lt  b  cmpl  -4  rbp    eax jge    L2  if a  lt  b  cmpl  -4  rbp    eax jg  L3

User · Answer

At the very least  if this were true a compiler could trivially optimise a  lt   b to   a   b   and so even if the comparison itself were actually slower  with all but the most naive compiler you would not notice a difference

User · Answer

You could say that line is correct in most scripting languages  since the extra character results in slightly slower code processing   However  as the top answer pointed out  it should have no effect in C    and anything being done with a scripting language probably isn t that concerned about optimization

User · Answer

For floating point code  the  lt   comparison may indeed be slower  by one instruction  even on modern architectures  Here s the first function   int compare strict double a  double b    return a  lt  b      On PowerPC  first this performs a floating point comparison  which updates cr  the condition register   then moves the condition register to a GPR  shifts the  compared less than  bit into place  and then returns  It takes four instructions   Now consider this function instead   int compare loose double a  double b    return a  lt   b      This requires the same work as compare strict above  but now there s two bits of interest   was less than  and  was equal to   This requires an extra instruction  cror - condition register bitwise OR  to combine these two bits into one  So compare loose requires five instructions  while compare strict requires four   You might think that the compiler could optimize the second function like so   int compare loose double a  double b    return    a  gt  b       However this will incorrectly handle NaNs  NaN1  lt   NaN2 and NaN1  gt  NaN2 need to both evaluate to false

User · Answer

Assuming we re talking about internal integer types  there s no possible way one could be faster than the other  They re obviously semantically identical  They both ask the compiler to do precisely the same thing  Only a horribly broken compiler would generate inferior code for one of these   If there was some platform where  lt  was faster than  lt   for simple integer types  the compiler should always convert  lt   to  lt  for constants  Any compiler that didn t would just be a bad compiler  for that platform

[c++] Is < faster than <=?

TL;DR answer

Full answer

Counterexample

Summary

Examples related to c++

Examples related to performance

Examples related to assembly

Examples related to relational-operators