In my somewhat checkered experience, applying -O3
to an entire program almost always makes it slower (relative to -O2
), because it turns on aggressive loop unrolling and inlining that make the program no longer fit in the instruction cache. For larger programs, this can also be true for -O2
relative to -Os
!
The intended use pattern for -O3
is, after profiling your program, you manually apply it to a small handful of files containing critical inner loops that actually benefit from these aggressive space-for-speed tradeoffs. Newer versions of GCC have a profile-guided optimization mode that can (IIUC) selectively apply the -O3
optimizations to hot functions -- effectively automating this process.