Why doesn t GCC optimize a a a a a a to a a a a a a

Question

I am doing some numerical optimization on a scientific application  One thing I noticed is that GCC will optimize the call pow a 2  by compiling it into a a  but the call pow a 6  is not optimized and will actually call the library function pow  which greatly slows down the performance   In contrast  Intel C   Compiler  executable icc  will eliminate the library call for pow a 6      What I am curious about is that when I replaced pow a 6  with a a a a a a using GCC 4 5 1 and options  -O3 -lm -funroll-loops -msse4   it uses 5 mulsd instructions   movapd   xmm14   xmm13 mulsd    xmm14   xmm13 mulsd    xmm14   xmm13 mulsd    xmm14   xmm13 mulsd    xmm14   xmm13 mulsd    xmm14   xmm13   while if I write  a a a   a a a   it will produce  movapd   xmm14   xmm13 mulsd    xmm14   xmm13 mulsd    xmm14   xmm13 mulsd    xmm13   xmm13   which reduces the number of multiply instructions to 3  icc has similar behavior   Why do compilers not recognize this optimization trick

User · Answer

I would not have expected this case to be optimized at all. It can't be very often where an expression contains subexpressions that can be regrouped to remove entire operations. I would expect compiler writers to invest their time in areas which would be more likely to result in noticeable improvements, rather than covering a rarely encountered edge case.

I was surprised to learn from the other answers that this expression could indeed be optimized with the proper compiler switches. Either the optimization is trivial, or it is an edge case of a much more common optimization, or the compiler writers were extremely thorough.

There's nothing wrong with providing hints to the compiler as you've done here. It's a normal and expected part of the micro-optimization process to rearrange statements and expressions to see what differences they will bring.

While the compiler may be justified in considering the two expressions to deliver inconsistent results (without the proper switches), there's no need for you to be bound by that restriction. The difference will be incredibly tiny - so much so that if the difference matters to you, you should not be using standard floating point arithmetic in the first place.

User · Answer

There are already a few good answers to this question  but for the sake of completeness I wanted to point out that the applicable section of the C standard is 5 1 2 2 3 15  which is the same as section 1 9 9 in the C  11 standard    This section states that operators can only be regrouped if they are really associative or commutative

User · Answer

Because Floating Point Math is not Associative   The way you group the operands in floating point multiplication has an effect on the numerical accuracy of the answer   As a result  most compilers are very conservative about reordering floating point calculations unless they can be sure that the answer will stay the same  or unless you tell them you don t care about numerical accuracy   For example  the -fassociative-math option of gcc which allows gcc to reassociate floating point operations  or even the -ffast-math option which allows even more aggressive tradeoffs of accuracy against speed

User · Answer

Another similar case  most compilers won t optimize a   b   c   d to  a   b     c   d   this is an optimization since the second expression can be pipelined better  and evaluate it as given  i e  as    a   b    c    d    This too is because of corner cases   float a   1e35  b   1e-5  c   -1e35  d   1e-5  printf   e  e n   a   b   c   d   a   b     c   d      This outputs 1 000000e-05 0 000000e 00

User · Answer

GCC does actually optimize a a a a a a to  a a a   a a a  when a is an integer   I tried with this command     echo  int f int x    return x x x x x x       gcc -o - -O2 -S -masm intel -x c -   There are a lot of gcc flags but nothing fancy   They mean   Read from stdin  use O2 optimization level  output assembly language listing instead of a binary  the listing should use Intel assembly language syntax  the input is in C language  usually language is inferred from input file extension  but there is no file extension when reading from stdin   and write to stdout   Here s the important part of the output   I ve annotated it with some comments indicating what s going on in the assembly language     x is in edi to begin with   eax will be used as a temporary register  mov  eax  edi    temp   x imul eax  edi    temp   x   temp imul eax  edi    temp   x   temp imul eax  eax    temp   temp   temp   I m using system GCC on Linux Mint 16 Petra  an Ubuntu derivative   Here s the gcc version     gcc --version gcc  Ubuntu Linaro 4 8 1-10ubuntu9  4 8 1   As other posters have noted  this option is not possible in floating point  because floating point arithmetic is not associative

User · Answer

gcc actually can do this optimization  even for floating-point numbers  For example   double foo double a      return a a a a a a      becomes  foo double       mulsd    xmm0   xmm0     movapd   xmm0   xmm1     mulsd    xmm0   xmm1     mulsd    xmm1   xmm0     ret   with -O -funsafe-math-optimizations  This reordering violates IEEE-754  though  so it requires the flag   Signed integers  as Peter Cordes pointed out in a comment  can do this optimization without -funsafe-math-optimizations since it holds exactly when there is no overflow and if there is overflow you get undefined behavior  So you get  foo long       movq     rdi   rax     imulq    rdi   rax     imulq    rdi   rax     imulq    rax   rax     ret   with just -O  For unsigned integers  it s even easier since they work mod powers of 2 and so can be reordered freely even in the face of overflow

User · Answer

Library functions like  pow  are usually carefully crafted to yield the minimum possible error  in generic case   This is usually achieved approximating functions with splines  according to Pascal s comment the most common implementation seems to be using Remez algorithm   fundamentally the following operation   pow x y     has a inherent error of approximately the same magnitude as the error in any single multiplication or division   While the following operation   float a someValue  float b a a a a a a    has a inherent error that is greater more than 5 times the error of a single multiplication or division  because you are combining 5 multiplications    The compiler should be really carefull to the kind of optimization it is doing    if optimizing pow a 6  to a a a a a a it may improve performance  but drastically reduce the accuracy for floating point numbers  if optimizing a a a a a a  to pow a 6  it may actually reduce the accuracy because  a  was some special value that allows multiplication without error  a power of 2 or some small integer number  if optimizing pow a 6  to  a a a   a a a  or  a a   a a   a a  there still can be a loss of accuracy compared to pow function    In general you know that for arbitrary floating point values  pow  has better accuracy than any function you could eventually write  but in some special cases multiple multiplications may have better accuracy and performance  it is up to the developer choosing what is more appropriate  eventually commenting the code so that noone else would  optimize  that code   The only thing that make sense  personal opinion  and apparently a choice in GCC wichout any particular optimization or compiler flag  to optimize should be replacing  pow a 2   with  a a   That would be the only sane thing a compiler vendor should do

User · Answer

As Lambdageek pointed out float multiplication is not associative and you can get less accuracy  but also when get better accuracy you can argue against optimisation  because you want a deterministic application  For example in game simulation client server  where every client has to simulate the same world you want floating point calculations to be deterministic

User · Answer

No posters have mentioned the contraction of floating expressions yet  ISO C standard  6 5p8 and 7 12 2   If the FP CONTRACT pragma is set to ON  the compiler is allowed to regard an expression such as a a a a a a as a single operation  as if evaluated exactly with a single rounding  For instance  a compiler may replace it by an internal power function that is both faster and more accurate  This is particularly interesting as the behavior is partly controlled by the programmer directly in the source code  while compiler options provided by the end user may sometimes be used incorrectly   The default state of the FP CONTRACT pragma is implementation-defined  so that a compiler is allowed to do such optimizations by default  Thus portable code that needs to strictly follow the IEEE 754 rules should explicitly set it to OFF   If a compiler doesn t support this pragma  it must be conservative by avoiding any such optimization  in case the developer has chosen to set it to OFF   GCC doesn t support this pragma  but with the default options  it assumes it to be ON  thus for targets with a hardware FMA  if one wants to prevent the transformation a b c to fma a b c   one needs to provide an option such as -ffp-contract off  to explicitly set the pragma to OFF  or -std c99  to tell GCC to conform to some C standard version  here C99  thus follow the above paragraph   In the past  the latter option was not preventing the transformation  meaning that GCC was not conforming on this point  https   gcc gnu org bugzilla show bug cgi id 37845

User · Answer

Because a 32-bit floating-point number - such as 1 024 - is not 1 024  In a computer  1 024 is an interval  from  1 024-e  to  1 024 e   where  e  represents an error  Some people fail to realize this and also believe that   in a a stands for multiplication of arbitrary-precision numbers without there being any errors attached to those numbers  The reason why some people fail to realize this is perhaps the math computations they exercised in elementary schools  working only with ideal numbers without errors attached  and believing that it is OK to simply ignore  e  while performing multiplication  They do not see the  e  implicit in  float a 1 2    a a a  and similar C codes   Should majority of programmers recognize  and be able to execute on  the idea that C expression a a a a a a is not actually working with ideal numbers  the GCC compiler would then be FREE to optimize  a a a a a a  into say  t  a a   t t t  which requires a smaller number of multiplications  But unfortunately  the GCC compiler does not know whether the programmer writing the code thinks that  a  is a number with or without an error  And so GCC will only do what the source code looks like - because that is what GCC sees with its  naked eye        once you know what kind of programmer you are  you can use the  -ffast-math  switch to tell GCC that  Hey  GCC  I know what I am doing    This will allow GCC to convert a a a a a a into a different piece of text - it looks different from a a a a a a - but still computes a number within the error interval of a a a a a a  This is OK  since you already know you are working with intervals  not ideal numbers

User · Answer

Lambdageek correctly points out that because associativity does not hold for floating-point numbers  the  optimization  of a a a a a a to  a a a   a a a  may change the value   This is why it is disallowed by C99  unless specifically allowed by the user  via compiler flag or pragma    Generally  the assumption is that the programmer wrote what she did for a reason  and the compiler should respect that   If you want  a a a   a a a   write that   That can be a pain to write  though  why can t the compiler just do  what you consider to be  the right thing when you use pow a 6    Because it would be the wrong thing to do   On a platform with a good math library  pow a 6  is significantly more accurate than either a a a a a a or  a a a   a a a    Just to provide some data  I ran a small experiment on my Mac Pro  measuring the worst error in evaluating a 6 for all single-precision floating numbers between  1 2    worst relative error using    powf a  6 f   5 96e-08 worst relative error using  a a a   a a a   2 94e-07 worst relative error using     a a a a a a  2 58e-07   Using pow instead of a multiplication tree reduces the error bound by a factor of 4   Compilers should not  and generally do not  make  optimizations  that increase error unless licensed to do so by the user  e g  via -ffast-math    Note that GCC provides   builtin powi x n  as an alternative to pow     which should generate an inline multiplication tree   Use that if you want to trade off accuracy for performance  but do not want to enable fast-math

User · Answer

Fortran  designed for scientific computing  has a built-in power operator  and as far as I know Fortran compilers will commonly optimize raising to integer powers in a similar fashion to what you describe   C C   unfortunately don t have a power operator  only the library function pow     This doesn t prevent smart compilers from treating pow specially and computing it in a faster way for special cases  but it seems they do it less commonly      Some years ago I was trying to make it more convenient to calculate integer powers in an optimal way  and came up with the following   It s C    not C though  and still depends on the compiler being somewhat smart about how to optimize inline things  Anyway  hope you might find it useful in practice   template lt unsigned N gt  struct power impl   template lt unsigned N gt  struct power impl       template lt typename T gt      static T calc const T  amp x            if  N 2    0              return power impl lt N 2 gt   calc x x           else if  N 3    0              return power impl lt N 3 gt   calc x x x           return power impl lt N-1 gt   calc x  x            template lt  gt  struct power impl lt 0 gt        template lt typename T gt      static T calc const T  amp     return 1        template lt unsigned N  typename T gt  inline T power const T  amp x        return power impl lt N gt   calc x       Clarification for the curious  this does not find the optimal way to compute powers  but since finding the optimal solution is an NP-complete problem and this is only worth doing for small powers anyway  as opposed to using pow   there s no reason to fuss with the detail   Then just use it as power lt 6 gt  a    This makes it easy to type powers  no need to spell out 6 as with parens   and lets you have this kind of optimization without -ffast-math in case you have something precision dependent such as compensated summation  an example where the order of operations is essential    You can probably also forget that this is C   and just use it in the C program  if it compiles with a C   compiler    Hope this can be useful   EDIT   This is what I get from my compiler   For a a a a a a       movapd   xmm1   xmm0     mulsd    xmm1   xmm0     mulsd    xmm1   xmm0     mulsd    xmm1   xmm0     mulsd    xmm1   xmm0     mulsd    xmm1   xmm0   For  a a a   a a a        movapd   xmm1   xmm0     mulsd    xmm1   xmm0     mulsd    xmm1   xmm0     mulsd    xmm0   xmm0   For power lt 6 gt  a        mulsd    xmm0   xmm0     movapd   xmm0   xmm1     mulsd    xmm0   xmm1     mulsd    xmm0   xmm1

[gcc] Why doesn't GCC optimize aaaaaa to (aaa)(aaa)?

Examples related to gcc

Examples related to assembly

Examples related to floating-point

Examples related to compiler-optimization

Examples related to fast-math

[gcc] Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?