[gcc] Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?

I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call pow(a,2) by compiling it into a*a, but the call pow(a,6) is not optimized and will actually call the library function pow, which greatly slows down the performance. (In contrast, Intel C++ Compiler, executable icc, will eliminate the library call for pow(a,6).)

What I am curious about is that when I replaced pow(a,6) with a*a*a*a*a*a using GCC 4.5.1 and options "-O3 -lm -funroll-loops -msse4", it uses 5 mulsd instructions:

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13

while if I write (a*a*a)*(a*a*a), it will produce

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm13, %xmm13

which reduces the number of multiply instructions to 3. icc has similar behavior.

Why do compilers not recognize this optimization trick?

The answer is


Because a 32-bit floating-point number - such as 1.024 - is not 1.024. In a computer, 1.024 is an interval: from (1.024-e) to (1.024+e), where "e" represents an error. Some people fail to realize this and also believe that * in a*a stands for multiplication of arbitrary-precision numbers without there being any errors attached to those numbers. The reason why some people fail to realize this is perhaps the math computations they exercised in elementary schools: working only with ideal numbers without errors attached, and believing that it is OK to simply ignore "e" while performing multiplication. They do not see the "e" implicit in "float a=1.2", "a*a*a" and similar C codes.

Should majority of programmers recognize (and be able to execute on) the idea that C expression a*a*a*a*a*a is not actually working with ideal numbers, the GCC compiler would then be FREE to optimize "a*a*a*a*a*a" into say "t=(a*a); t*t*t" which requires a smaller number of multiplications. But unfortunately, the GCC compiler does not know whether the programmer writing the code thinks that "a" is a number with or without an error. And so GCC will only do what the source code looks like - because that is what GCC sees with its "naked eye".

... once you know what kind of programmer you are, you can use the "-ffast-math" switch to tell GCC that "Hey, GCC, I know what I am doing!". This will allow GCC to convert a*a*a*a*a*a into a different piece of text - it looks different from a*a*a*a*a*a - but still computes a number within the error interval of a*a*a*a*a*a. This is OK, since you already know you are working with intervals, not ideal numbers.


No posters have mentioned the contraction of floating expressions yet (ISO C standard, 6.5p8 and 7.12.2). If the FP_CONTRACT pragma is set to ON, the compiler is allowed to regard an expression such as a*a*a*a*a*a as a single operation, as if evaluated exactly with a single rounding. For instance, a compiler may replace it by an internal power function that is both faster and more accurate. This is particularly interesting as the behavior is partly controlled by the programmer directly in the source code, while compiler options provided by the end user may sometimes be used incorrectly.

The default state of the FP_CONTRACT pragma is implementation-defined, so that a compiler is allowed to do such optimizations by default. Thus portable code that needs to strictly follow the IEEE 754 rules should explicitly set it to OFF.

If a compiler doesn't support this pragma, it must be conservative by avoiding any such optimization, in case the developer has chosen to set it to OFF.

GCC doesn't support this pragma, but with the default options, it assumes it to be ON; thus for targets with a hardware FMA, if one wants to prevent the transformation a*b+c to fma(a,b,c), one needs to provide an option such as -ffp-contract=off (to explicitly set the pragma to OFF) or -std=c99 (to tell GCC to conform to some C standard version, here C99, thus follow the above paragraph). In the past, the latter option was not preventing the transformation, meaning that GCC was not conforming on this point: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37845


Lambdageek correctly points out that because associativity does not hold for floating-point numbers, the "optimization" of a*a*a*a*a*a to (a*a*a)*(a*a*a) may change the value. This is why it is disallowed by C99 (unless specifically allowed by the user, via compiler flag or pragma). Generally, the assumption is that the programmer wrote what she did for a reason, and the compiler should respect that. If you want (a*a*a)*(a*a*a), write that.

That can be a pain to write, though; why can't the compiler just do [what you consider to be] the right thing when you use pow(a,6)? Because it would be the wrong thing to do. On a platform with a good math library, pow(a,6) is significantly more accurate than either a*a*a*a*a*a or (a*a*a)*(a*a*a). Just to provide some data, I ran a small experiment on my Mac Pro, measuring the worst error in evaluating a^6 for all single-precision floating numbers between [1,2):

worst relative error using    powf(a, 6.f): 5.96e-08
worst relative error using (a*a*a)*(a*a*a): 2.94e-07
worst relative error using     a*a*a*a*a*a: 2.58e-07

Using pow instead of a multiplication tree reduces the error bound by a factor of 4. Compilers should not (and generally do not) make "optimizations" that increase error unless licensed to do so by the user (e.g. via -ffast-math).

Note that GCC provides __builtin_powi(x,n) as an alternative to pow( ), which should generate an inline multiplication tree. Use that if you want to trade off accuracy for performance, but do not want to enable fast-math.


There are already a few good answers to this question, but for the sake of completeness I wanted to point out that the applicable section of the C standard is 5.1.2.2.3/15 (which is the same as section 1.9/9 in the C++11 standard). This section states that operators can only be regrouped if they are really associative or commutative.


Fortran (designed for scientific computing) has a built-in power operator, and as far as I know Fortran compilers will commonly optimize raising to integer powers in a similar fashion to what you describe. C/C++ unfortunately don't have a power operator, only the library function pow(). This doesn't prevent smart compilers from treating pow specially and computing it in a faster way for special cases, but it seems they do it less commonly ...

Some years ago I was trying to make it more convenient to calculate integer powers in an optimal way, and came up with the following. It's C++, not C though, and still depends on the compiler being somewhat smart about how to optimize/inline things. Anyway, hope you might find it useful in practice:

template<unsigned N> struct power_impl;

template<unsigned N> struct power_impl {
    template<typename T>
    static T calc(const T &x) {
        if (N%2 == 0)
            return power_impl<N/2>::calc(x*x);
        else if (N%3 == 0)
            return power_impl<N/3>::calc(x*x*x);
        return power_impl<N-1>::calc(x)*x;
    }
};

template<> struct power_impl<0> {
    template<typename T>
    static T calc(const T &) { return 1; }
};

template<unsigned N, typename T>
inline T power(const T &x) {
    return power_impl<N>::calc(x);
}

Clarification for the curious: this does not find the optimal way to compute powers, but since finding the optimal solution is an NP-complete problem and this is only worth doing for small powers anyway (as opposed to using pow), there's no reason to fuss with the detail.

Then just use it as power<6>(a).

This makes it easy to type powers (no need to spell out 6 as with parens), and lets you have this kind of optimization without -ffast-math in case you have something precision dependent such as compensated summation (an example where the order of operations is essential).

You can probably also forget that this is C++ and just use it in the C program (if it compiles with a C++ compiler).

Hope this can be useful.

EDIT:

This is what I get from my compiler:

For a*a*a*a*a*a,

    movapd  %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0

For (a*a*a)*(a*a*a),

    movapd  %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm1, %xmm0
    mulsd   %xmm0, %xmm0

For power<6>(a),

    mulsd   %xmm0, %xmm0
    movapd  %xmm0, %xmm1
    mulsd   %xmm0, %xmm1
    mulsd   %xmm0, %xmm1

As Lambdageek pointed out float multiplication is not associative and you can get less accuracy, but also when get better accuracy you can argue against optimisation, because you want a deterministic application. For example in game simulation client/server, where every client has to simulate the same world you want floating point calculations to be deterministic.


GCC does actually optimize a*a*a*a*a*a to (a*a*a)*(a*a*a) when a is an integer. I tried with this command:

$ echo 'int f(int x) { return x*x*x*x*x*x; }' | gcc -o - -O2 -S -masm=intel -x c -

There are a lot of gcc flags but nothing fancy. They mean: Read from stdin; use O2 optimization level; output assembly language listing instead of a binary; the listing should use Intel assembly language syntax; the input is in C language (usually language is inferred from input file extension, but there is no file extension when reading from stdin); and write to stdout.

Here's the important part of the output. I've annotated it with some comments indicating what's going on in the assembly language:

; x is in edi to begin with.  eax will be used as a temporary register.
mov  eax, edi  ; temp = x
imul eax, edi  ; temp = x * temp
imul eax, edi  ; temp = x * temp
imul eax, eax  ; temp = temp * temp

I'm using system GCC on Linux Mint 16 Petra, an Ubuntu derivative. Here's the gcc version:

$ gcc --version
gcc (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1

As other posters have noted, this option is not possible in floating point, because floating point arithmetic is not associative.


Library functions like "pow" are usually carefully crafted to yield the minimum possible error (in generic case). This is usually achieved approximating functions with splines (according to Pascal's comment the most common implementation seems to be using Remez algorithm)

fundamentally the following operation:

pow(x,y);

has a inherent error of approximately the same magnitude as the error in any single multiplication or division.

While the following operation:

float a=someValue;
float b=a*a*a*a*a*a;

has a inherent error that is greater more than 5 times the error of a single multiplication or division (because you are combining 5 multiplications).

The compiler should be really carefull to the kind of optimization it is doing:

  1. if optimizing pow(a,6) to a*a*a*a*a*a it may improve performance, but drastically reduce the accuracy for floating point numbers.
  2. if optimizing a*a*a*a*a*a to pow(a,6) it may actually reduce the accuracy because "a" was some special value that allows multiplication without error (a power of 2 or some small integer number)
  3. if optimizing pow(a,6) to (a*a*a)*(a*a*a) or (a*a)*(a*a)*(a*a) there still can be a loss of accuracy compared to pow function.

In general you know that for arbitrary floating point values "pow" has better accuracy than any function you could eventually write, but in some special cases multiple multiplications may have better accuracy and performance, it is up to the developer choosing what is more appropriate, eventually commenting the code so that noone else would "optimize" that code.

The only thing that make sense (personal opinion, and apparently a choice in GCC wichout any particular optimization or compiler flag) to optimize should be replacing "pow(a,2)" with "a*a". That would be the only sane thing a compiler vendor should do.


I would not have expected this case to be optimized at all. It can't be very often where an expression contains subexpressions that can be regrouped to remove entire operations. I would expect compiler writers to invest their time in areas which would be more likely to result in noticeable improvements, rather than covering a rarely encountered edge case.

I was surprised to learn from the other answers that this expression could indeed be optimized with the proper compiler switches. Either the optimization is trivial, or it is an edge case of a much more common optimization, or the compiler writers were extremely thorough.

There's nothing wrong with providing hints to the compiler as you've done here. It's a normal and expected part of the micro-optimization process to rearrange statements and expressions to see what differences they will bring.

While the compiler may be justified in considering the two expressions to deliver inconsistent results (without the proper switches), there's no need for you to be bound by that restriction. The difference will be incredibly tiny - so much so that if the difference matters to you, you should not be using standard floating point arithmetic in the first place.


gcc actually can do this optimization, even for floating-point numbers. For example,

double foo(double a) {
  return a*a*a*a*a*a;
}

becomes

foo(double):
    mulsd   %xmm0, %xmm0
    movapd  %xmm0, %xmm1
    mulsd   %xmm0, %xmm1
    mulsd   %xmm1, %xmm0
    ret

with -O -funsafe-math-optimizations. This reordering violates IEEE-754, though, so it requires the flag.

Signed integers, as Peter Cordes pointed out in a comment, can do this optimization without -funsafe-math-optimizations since it holds exactly when there is no overflow and if there is overflow you get undefined behavior. So you get

foo(long):
    movq    %rdi, %rax
    imulq   %rdi, %rax
    imulq   %rdi, %rax
    imulq   %rax, %rax
    ret

with just -O. For unsigned integers, it's even easier since they work mod powers of 2 and so can be reordered freely even in the face of overflow.


Another similar case: most compilers won't optimize a + b + c + d to (a + b) + (c + d) (this is an optimization since the second expression can be pipelined better) and evaluate it as given (i.e. as (((a + b) + c) + d)). This too is because of corner cases:

float a = 1e35, b = 1e-5, c = -1e35, d = 1e-5;
printf("%e %e\n", a + b + c + d, (a + b) + (c + d));

This outputs 1.000000e-05 0.000000e+00


Examples related to gcc

Can't compile C program on a Mac after upgrade to Mojave Compiling an application for use in highly radioactive environments Make Error 127 when running trying to compile code How to Install gcc 5.3 with yum on CentOS 7.2? How does one set up the Visual Studio Code compiler/debugger to GCC? How do I set up CLion to compile and run? CMake error at CMakeLists.txt:30 (project): No CMAKE_C_COMPILER could be found How to printf a 64-bit integer as hex? Differences between arm64 and aarch64 Fatal error: iostream: No such file or directory in compiling C program using GCC

Examples related to assembly

Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? While, Do While, For loops in Assembly Language (emu8086) Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs How to run a program without an operating system? Difference between "move" and "li" in MIPS assembly language Carry Flag, Auxiliary Flag and Overflow Flag in Assembly How do AX, AH, AL map onto EAX? JNZ & CMP Assembly Instructions Difference between JE/JNE and JZ/JNZ The point of test %eax %eax

Examples related to floating-point

Convert list or numpy array of single element to float in python Convert float to string with precision & number of decimal digits specified? Float and double datatype in Java C convert floating point to int Convert String to Float in Swift How do I change data-type of pandas data frame to string with a defined format? How to check if a float value is a whole number Convert floats to ints in Pandas? Converting Float to Dollars and Cents Format / Suppress Scientific Notation from Python Pandas Aggregation Results

Examples related to compiler-optimization

How to compile Tensorflow with SSE4.2 and AVX instructions? Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs Swift Beta performance: sorting arrays Why are elementwise additions much faster in separate loops than in a combined loop? Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)? How to disable compiler optimizations in gcc? How to see which flags -march=native will activate? Why do we use volatile keyword? How to turn off gcc compiler optimization to enable buffer overflow

Examples related to fast-math

Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?