When is assembly faster than C

Question

One of the stated reasons for knowing assembler is that  on occasion  it can be employed to write code that will be more performant than writing that code in a higher-level language  C in particular   However  I ve also heard it stated many times that although that s not entirely false  the cases where assembler can actually be used to generate more performant code are both extremely rare and require expert knowledge of and experience with assembly     This question doesn t even get into the fact that assembler instructions will be machine-specific and non-portable  or any of the other aspects of assembler   There are plenty of good reasons for knowing assembly besides this one  of course  but this is meant to be a specific question soliciting examples and data  not an extended discourse on assembler versus higher-level languages   Can anyone provide some specific examples of cases where assembly will be faster than well-written C code using a modern compiler  and can you support that claim with profiling evidence   I am pretty confident these cases exist  but I really want to know exactly how esoteric these cases are  since it seems to be a point of some contention

User · Answer

Tight loops  like when playing with images  since an image may cosist of millions of pixels  Sitting down and figuring out how to make best use of the limited number of processor registers can make a difference  Here s a real life sample   http   danbystrom se 2008 12 22 optimizing-away-ii   Then often processors have some esoteric instructions which are too specialized for a compiler to bother with  but on occasion an assembler programmer can make good use of them  Take the XLAT instruction for example  Really great if you need to do table look-ups in a loop and the table is limited to 256 bytes   Updated  Oh  just come to think of what s most crucial when we speak of loops in general  the compiler has often no clue on how many iterations that will be the common case  Only the programmer know that a loop will be iterated MANY times and that it therefore will be beneficial to prepare for the loop with some extra work  or if it will be iterated so few times that the set-up actually will take longer than the iterations expected

User · Answer

I m surprised no one said this  The strlen   function is much faster if written in assembly  In C  the best thing you can do is  int c  for c   0  str c       0   c         while in assembly you can speed it up considerably   mov esi  offset string mov edi  esi xor ecx  ecx  lp  mov ax  byte ptr  esi  cmp al  cl je  end 1 cmp ah  cl je end 2 mov bx  byte ptr  esi   2  cmp bl  cl je end 3 cmp bh  cl je end 4 add esi  4 jmp lp  end 4  inc esi  end 3  inc esi  end 2  inc esi  end 1  inc esi  mov ecx  esi sub ecx  edi   the length is in ecx  This compares 4 characters at time  so it s 4 times faster  And think using the high order word of eax and ebx  it will become 8 times faster that the previous C routine

User · Answer

The simple answer    One who knows assembly well  aka has the reference beside him  and is taking advantage of every little processor cache and pipeline feature etc  is guaranteed to be capable of producing much faster code than any compiler   However the difference these days just doesn t matter in the typical application

User · Answer

Without giving any specific example or profiler evidence  you can write better assembler than the compiler when you know more than the compiler   In the general case  a modern C compiler knows much more about how to optimize the code in question  it knows how the processor pipeline works  it can try to reorder instructions quicker than a human can  and so on - it s basically the same as a computer being as good as or better than the best human player for boardgames  etc  simply because it can make searches within the problem space faster than most humans  Although you theoretically can perform as well as the computer in a specific case  you certainly can t do it at the same speed  making it infeasible for more than a few cases  i e  the compiler will most certainly outperform you if you try to write more than a few routines in assembler    On the other hand  there are cases where the compiler does not have as much information - I d say primarily when working with different forms of external hardware  of which the compiler has no knowledge  The primary example probably being device drivers  where assembler combined with a human s intimate knowledge of the hardware in question can yield better results than a C compiler could do   Others have mentioned special purpose instructions  which is what I m talking in the paragraph above - instructions of which the compiler might have limited or no knowledge at all  making it possible for a human to write faster code

User · Answer

In days where processor speed was measured in MHz and screen size was below 1 megapixel  a well known trick to have faster display was to unroll loops  write operation for each scan line of the screen  It avoided overhead of maintaining a loop index  Coupled with detection of screen refresh  it was quite effective  That s something a C compiler wouldn t do     although often you can choose between optimization for speed or for size  I suppose the former uses some similar tricks    I know some people enjoy writing Windows applications in assembly language  They claim they are faster  hard to prove  and smaller  indeed    Obviously  while it is fun to do  it is probably wasted time  except for learning purpose  of course    particularly for GUI operations    Now  perhaps some operations  like searching a string in a file  can be optimized by carefully written assembly code

User · Answer

Given the right programmer  Assembler programs can always be made faster than their C counterparts  at least marginally    It would be difficult to create a C program where you couldn t take out at least one instruction of the Assembler

User · Answer

One of the posibilities to the CP M-86 version of PolyPascal  sibling to Turbo Pascal  was to replace the  use-bios-to-output-characters-to-the-screen  facility with a machine language routine which in essense was given the x  and y  and the string to put there   This allowed to update the screen much  much faster than before   There was room in the binary to embed machine code  a few hundred bytes  and there was other stuff there too  so it was essential to squeeze as much as possible   It turnes out that since the screen was 80x25 both coordinates could fit in a byte each  so both could fit in a two-byte word   This allowed to do the calculations needed in fewer bytes since a single add could manipulate both values simultaneously     To my knowledge there is no C compilers which can merge multiple values in a register  do SIMD instructions on them and split them out again later  and I don t think the machine instructions will be shorter anyway

User · Answer

Pretty much anytime the compiler sees floating point code  a hand written version will be quicker if you re using an old bad compiler    2019 update  This is not true in general for modern compilers   Especially when compiling for anything other than x87  compilers have an easier time with SSE2 or AVX for scalar math  or any non-x86 with a flat FP register set  unlike x87 s register stack    The primary reason is that the compiler can t perform any robust optimisations  See this article from MSDN for a discussion on the subject  Here s an example where the assembly version is twice the speed as the C version  compiled with VS2K5     include  stdafx h   include  lt windows h gt   float KahanSum const float  data  int n       float sum   0 0f  C   0 0f  Y  T      for  int i   0   i  lt  n     i          Y    data   - C        T   sum   Y        C   T - sum - Y        sum   T           return sum     float AsmSum const float  data  int n      float result   0 0f      asm         mov esi data     mov ecx n     fldz     fldz l1      fsubr  esi      add esi 4     fld st 0      fadd st 0  st 2      fld st 0      fsub st 0  st 3      fsub st 0  st 2      fstp st 2      fstp st 2      loop l1     fstp result     fstp result        return result     int main  int  char         int count   1000000     float  source   new float  count      for  int i   0   i  lt  count     i        source  i    static cast  lt float gt   rand       static cast  lt float gt   RAND MAX          LARGE INTEGER start  mid  end     float sum1   0 0f  sum2   0 0f     QueryPerformanceCounter   amp start      sum1   KahanSum  source  count      QueryPerformanceCounter   amp mid      sum2   AsmSum  source  count      QueryPerformanceCounter   amp end      cout  lt  lt     C code     lt  lt  sum1  lt  lt    in    lt  lt   mid QuadPart - start QuadPart   lt  lt  endl    cout  lt  lt   asm code     lt  lt  sum2  lt  lt    in    lt  lt   end QuadPart - mid QuadPart   lt  lt  endl     return 0      And some numbers from my PC running a default release build      C code  500137 in 103884668 asm code  500137 in 52129147   Out of interest  I swapped the loop with a dec jnz and it made no difference to the timings - sometimes quicker  sometimes slower  I guess the memory limited aspect dwarfs other optimisations    Editor s note  more likely the FP latency bottleneck is enough to hide the extra cost of loop   Doing two Kahan summations in parallel for the odd even elements  and adding those at the end  could maybe speed this up by a factor of 2    Whoops  I was running a slightly different version of the code and it outputted the numbers the wrong way round  i e  C was faster    Fixed and updated the results

User · Answer

I can t give the specific examples because it was too many years ago  but there were plenty of cases where hand-written assembler could out-perform any compiler  Reasons why    You could deviate from calling conventions  passing arguments in registers  You could carefully consider how to use registers  and avoid storing variables in memory  For things like jump tables  you could avoid having to bounds-check the index    Basically  compilers do a pretty good job of optimizing  and that is nearly always  good enough   but in some situations  like graphics rendering  where you re paying dearly for every single cycle  you can take shortcuts because you know the code  where a compiler could not because it has to be on the safe side   In fact  I have heard of some graphics rendering code where a routine  like a line-draw or polygon-fill routine  actually generated a small block of machine code on the stack and executed it there  so as to avoid continual decision-making about line style  width  pattern  etc   That said  what I want a compiler to do is generate good assembly code for me but not be too clever  and they mostly do that  In fact  one of the things I hate about Fortran is its scrambling the code in an attempt to  optimize  it  usually to no significant purpose   Usually  when apps have performance problems  it is due to wasteful design  These days  I would never recommend assembler for performance unless the overall app had already been tuned within an inch of its life  still was not fast enough  and was spending all its time in tight inner loops   Added  I ve seen plenty of apps written in assembly language  and the main speed advantage over a language like C  Pascal  Fortran  etc  was because the programmer was far more careful when coding in assembler  He or she is going to write roughly 100 lines of code a day  regardless of language  and in a compiler language that s going to equal 3 or 400 instructions

User · Answer

I have read all the answers  more than 30  and didn t find a simple reason  assembler is faster than C if you have read and practiced the Intel   64 and IA-32 Architectures Optimization Reference Manual  so the reason why assembly may be slower is that people who write such slower assembly didn t read the Optimization Manual  In the good old days of Intel 80286  each instruction was executed at a fixed count of CPU cycles  but since Pentium Pro  released in 1995  Intel processors became superscalar  utilizing Complex Pipelining  Out-of-Order Execution  amp  Register Renaming  Before that  on Pentium  produced 1993  there were U and V pipelines  dual pipe lines that could execute two simple instructions at one clock cycle if they didn t depend on one another  but this was nothing to compare of what is Out-of-Order Execution  amp  Register Renaming appeared in Pentium Pro  and almost left unchanged nowadays  To explain in a few words  fastest code is where instructions do not depend on previous results  e g  you should always clear whole registers  by movzx  to remove dependency from previous values of the registers you are working with  so they may renamed internally by the CPU to allow instruction execute in parallel or in different order  Or  on some processors  false dependency may exist that may also slow things down  like false dependency on Pentium 4 for inc dec  so you may with to use add eax  1 instead or inc eax to remove dependency on previous state of the flags  You can read more on Out-of-Order Execution  amp  Register Renaming if time permits  there is plenty information available in the Internet  There are also other important issues like branch prediction  number of load and store units  number of gates that execute micro-ops  memory cache coherence protocols  etc   but the most important thing to consider is namely the Out-of-Order Execution  Most people are simply not aware about the Out-of-Order Execution  so they write their assembly programs like for 80286  expecting their instruction will take a fixed time to execute regardless of context  while C compilers are aware of the Out-of-Order Execution and generate the code correctly  That s why the code of such unaware people is slower  but if you will become aware  your code will be faster

User · Answer

Actually you can build large scale programs in a large model mode segaments may be restricted to 64kb code but you can write many segaments  people give the argument against ASM as it is an old language and we don t need to preserve memory anymore  If that were the case why would we be packing our PC s with memory  the only Flaw I can find with ASM is that it is more or less Processor based so most programs written for the intel architecture Most likely would not run on An AMD Architecture  As for C being faster than ASM there is no language faster than ASM and ASM can do many thing s C and other HLL s can not do at processor level  ASM is a difficult language to learn but once you learn it no HLL can translate it better than you  If you could only see some of the things HLL s Do to you code  and understand what it is doing  you would wonder why More people don t use ASM and why assembers are no longer being updated   For general public use anyway   So no C is not faster than ASM  Even experiences C   programmers still use and write code Chunks in ASM added to there C   code for speed  Other Languages Also that some people think are obsolete or possibly no good is a myth at times for instance Photoshop is written in Pascal ASM 1st release of souce has been submitted to the technical history museum  and paintshop pro is written still written in Python TCL and ASM     a common denominator of these to  Fast and Great image processors is ASM  although photoshop may have Upgraded to delphi now it is still pascal  and any speed problems are comming from pascal but this is because we like the way programs look and not what they do now days  I would like to make a Photoshop Clone in pure ASM which I have been working on and its comming along rather well  not code interpret arange rewwrite etc     Just code and go process complete

User · Answer

Only when using some special purpose instruction sets the compiler doesn t support   To maximize the computing power of a modern CPU with multiple pipelines and predictive branching you need to structure the assembly program in a way that makes it a  almost impossible for a human to write b  even more impossible to maintain   Also  better algorithms  data structures and memory management will give you at least an order of magnitude more performance than the micro-optimizations you can do in assembly

User · Answer

Longpoke  there is just one limitation  time  When you don t have the resources to optimize every single change to code and spend your time allocating registers  optimize few spills away and what not  the compiler will win every single time  You do your modification to the code  recompile and measure  Repeat if necessary   Also  you can do a lot in the high-level side  Also  inspecting the resulting assembly may give the IMPRESSION that the code is crap  but in practice it will run faster than what you think would be quicker  Example   int y   data i      do some stuff here   call function y         The compiler will read the data  push it to stack  spill  and later read from stack and pass as argument  Sounds shite  It might actually be very effective latency compensation and result in faster runtime      optimized version call function data i            not so optimized after all    The idea with the optimized version was  that we have reduced register pressure and avoid spilling  But in truth  the  shitty  version was faster   Looking at the assembly code  just looking at the instructions and concluding  more instructions  slower  would be a misjudgment   The thing here to pay attention is  many assembly experts think they know a lot  but know very little  The rules change from architecture to next  too  There is no silver-bullet x86 code  for example  which is always the fastest  These days is better to go by rules-of-thumb    memory is slow cache is fast try to use cached better how often you going to miss  do you have latency compensation strategy  you can execute 10-100 ALU FPU SSE instructions for one single cache miss application architecture is important      but it does t help when the problem isn t in the architecture   Also  trusting too much into compiler magically transforming poorly-thought-out C C   code into  theoretically optimum  code is wishful thinking  You have to know the compiler and tool chain you use if you care about  performance  at this low-level   Compilers in C C   are generally not very good at re-ordering sub-expressions because the functions have side effects  for starters  Functional languages don t suffer from this caveat but don t fit the current ecosystem that well  There are compiler options to allow relaxed precision rules which allow order of operations to be changed by the compiler linker code generator   This topic is a bit of a dead-end  for most it s not relevant  and the rest  they know what they are doing already anyway   It all boils down to this   to understand what you are doing   it s a bit different from knowing what you are doing

User · Answer

Nowadays  considering such compilers as Intel C   which extremely optimizing C code  it is very hard to compete with compilers output

User · Answer

How about creating machine code at run-time    My brother once  around 2000  realised an extremely fast real-time ray-tracer by generating code at run-time  I can t remember the details  but there was some kind of main module which was looping through objects  then it was preparing and executing some machine code which was specific to each object    However  over time  this method was outruled by new graphics hardware  and it became useless   Today  I think that possibly some operations on big-data  millions of records  like pivot tables  drilling  calculations on-the-fly  etc  could be optimized with this method  The question is  is the effort worth it

User · Answer

Although C is  close  to the low-level manipulation of 8-bit  16-bit  32-bit  64-bit data  there are a few mathematical operations not supported by C which can often be performed elegantly in certain assembly instruction sets    Fixed-point multiplication  The product of two 16-bit numbers is a 32-bit number  But the rules in C says that the product of two 16-bit numbers is a 16-bit number  and the product of two 32-bit numbers is a 32-bit number -- the bottom half in both cases  If you want the top half of a 16x16 multiply or a 32x32 multiply  you have to play games with the compiler  The general method is to cast to a larger-than-necessary bit width  multiply  shift down  and cast back   int16 t x  y     int16 t is a typedef for  short     set x and y to something int16 t prod    int16 t    int32 t x y  gt  gt 16      In this case the compiler may be smart enough to know that you re really just trying to get the top half of a 16x16 multiply and do the right thing with the machine s native 16x16multiply  Or it may be stupid and require a library call to do the 32x32 multiply that s way overkill because you only need 16 bits of the product -- but the C standard doesn t give you any way to express yourself  Certain bitshifting operations  rotation carries        256-bit array shifted right in its entirety  uint8 t x 32   for  int i   32  --i  gt  0         x i     x i   gt  gt  1     x i-1   lt  lt  7     x 0   gt  gt   1    This is not too inelegant in C  but again  unless the compiler is smart enough to realize what you are doing  it s going to do a lot of  unnecessary  work  Many assembly instruction sets allow you to rotate or shift left right with the result in the carry register  so you could accomplish the above in 34 instructions  load a pointer to the beginning of the array  clear the carry  and perform 32 8-bit right-shifts  using auto-increment on the pointer   For another example  there are linear feedback shift registers  LFSR  that are elegantly performed in assembly  Take a chunk of N bits  8  16  32  64  128  etc   shift the whole thing right by 1  see above algorithm   then if the resulting carry is 1 then you XOR in a bit pattern that represents the polynomial    Having said that  I wouldn t resort to these techniques unless I had serious performance constraints  As others have said  assembly is much harder to document debug test maintain than C code  the performance gain comes with some serious costs   edit  3  Overflow detection is possible in assembly  can t really do it in C   this makes some algorithms much easier

User · Answer

I think the general case when assembler is faster is when a smart assembly programmer looks at the compiler s output and says  this is a critical path for performance and I can write this to be more efficient  and then that person tweaks that assembler or rewrites it from scratch

User · Answer

Here is a real world example  Fixed point multiplies on old compilers   These don t only come handy on devices without floating point  they shine when it comes to precision as they give you 32 bits of precision with a predictable error  float only has 23 bit and it s harder to predict precision loss    i e  uniform absolute precision over the entire range  instead of close-to-uniform relative precision  float      Modern compilers optimize this fixed-point example nicely  so for more modern examples that still need compiler-specific code  see   Getting the high part of 64 bit integer multiplication  A portable version using  uint64 t for 32x32    64-bit multiplies fails to optimize on a 64-bit CPU  so you need intrinsics or   int128 for efficient code on 64-bit systems   umul128 on Windows 32 bits  MSVC doesn t always do a good job when multiplying 32-bit integers cast to 64   so intrinsics helped a lot      C doesn t have a full-multiplication operator  2N-bit result from N-bit inputs    The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren t interesting      on a 32-bit machine  int can hold 32-bit fixed-point integers  int inline FixedPointMul  int a  int b      long long a long   a     cast to 64 bit     long long product   a long   b     perform multiplication    return  int   product  gt  gt  16       shift by the fixed point bias     The problem with this code is that we do something that can t be directly expressed in the C-language  We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit  However  in C this multiply does not exist  All you can do is to promote the integers to 64 bit and do a 64 64   64 multiply    x86  and ARM  MIPS and others  can however do the multiply in a single instruction  Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply  The shift by 16 is also often done by a library routine  also the x86 can do such shifts    So we re left with one or two library calls just for a multiply  This has serious consequences  Not only is the shift slower  registers must be preserved across the function calls and it does not help inlining and code-unrolling either    If you rewrite the same code in  inline  assembler you can gain a significant speed boost   In addition to this  using ASM is not the best way to solve the problem  Most compilers allow you to use some assembler instructions in intrinsic form if you can t express them in C  The VS NET2008 compiler for example exposes the 32 32 64 bit mul as   emul and the 64 bit shift as   ll rshift   Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what s going on  This allows the code to be inlined  register allocated  common subexpression elimination and constant propagation can be done as well  You ll get a huge performance improvement over the hand-written assembler code that way   For reference  The end-result for the fixed-point mul for the VS NET compiler is   int inline FixedPointMul  int a  int b        return  int    ll rshift   emul a b  16       The performance difference of fixed point divides is even bigger  I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines     Using Visual C   2013 gives the same assembly code for both ways   gcc4 1 from 2007 also optimizes the pure C version nicely    The Godbolt compiler explorer doesn t have any earlier versions of gcc installed  but presumably even older GCC versions could do this without intrinsics    See source   asm for x86  32-bit  and ARM on the Godbolt compiler explorer    Unfortunately it doesn t have any compilers old enough to produce bad code from the simple pure C version      Modern CPUs can do things C doesn t have operators for at all  like popcnt or bit-scan to find the first or last set bit    POSIX has a ffs   function  but its semantics don t match x86 bsf   bsr   See https   en wikipedia org wiki Find first set    Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction  if enabled at compile time   but it s much more reliable to use   builtin popcnt in GNU C  or on x86 if you re only targeting hardware with SSE4 2   mm popcnt u32 from  lt immintrin h gt    Or in C    assign to a std  bitset lt 32 gt  and use  count      This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library  in a way that will always compile to something correct  and can take advantage of whatever the target supports    See also https   en wikipedia org wiki Hamming weight Language support   Similarly  ntohl can compile to bswap  x86 32-bit byte swap for endian conversion  on some C implementations that have it     Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions   Compilers are not bad with simple loops like dst i     src i    10 0   but often do badly or don t auto-vectorize at all when things get more complicated   For example  you re unlikely to get anything like How to implement atoi using SIMD  generated automatically by the compiler from scalar code

User · Answer

It might be worth looking at Optimizing Immutable and Purity by Walter Bright it s not a profiled test but shows you one good example of a difference between handwritten and compiler generated ASM  Walter Bright writes optimising compilers so it might be worth looking at his other blog posts

User · Answer

The question is a bit misleading  The answer is there in your post itself  It is always possible to write assembly solution for a particular problem which executes faster than any generated by a compiler  The thing is you need to be an expert in assembly to overcome the limitations of a compiler  An experienced assembly programmer can write programs in any HLL which performs faster than one written by an inexperienced  The truth is you can always write assembly programs executing faster than one generated by a compiler

User · Answer

A use case which might not apply anymore but for your nerd pleasure  On the Amiga  the CPU and the graphics audio chips would fight for accessing a certain area of RAM  the first 2MB of RAM to be specific   So when you had only 2MB RAM  or less   displaying complex graphics plus playing sound would kill the performance of the CPU   In assembler  you could interleave your code in such a clever way that the CPU would only try to access the RAM when the graphics audio chips were busy internally  i e  when the bus was free   So by reordering your instructions  clever use of the CPU cache  the bus timing  you could achieve some effects which were simply not possible using any higher level language because you had to time every command  even insert NOPs here and there to keep the various chips out of each others radar   Which is another reason why the NOP  No Operation - do nothing  instruction of the CPU can actually make your whole application run faster    EDIT  Of course  the technique depends on a specific hardware setup  Which was the main reason why many Amiga games couldn t cope with faster CPUs  The timing of the instructions was off

User · Answer

I d say that when you are better than the compiler for a given set of instructions  So no generic answer I think

User · Answer

I have an operation of transposition of bits that needs to be done  on 192 or 256 bits every interrupt  that happens every 50 microseconds   It happens by a fixed map hardware constraints   Using C  it took around 10 microseconds to make  When I translated this to Assembler  taking into account the specific features of this map  specific register caching  and using bit oriented operations  it took less than 3 5 microsecond to perform

User · Answer

You don t actually know whether your well-written C code is really fast if you haven t looked at the disassembly of what compiler produces  Many times you look at it and see that  well-written  was subjective   So it s not necessary to write in assembler to get fastest code ever  but it s certainly worth to know assembler for the very same reason

User · Answer

Matrix operations using SIMD instructions is probably faster than compiler generated code

User · Answer

Chiming in historically   When I was a much younger man  1970s  assembler was important  in my experience  more for the size of the code than the speed of the code   If a module in a higher-level language was  say  1300 bytes of code  but an assembler version of the module was 300 bytes  that 1K bytes was very important when you were trying to fit the application into 16K or 32K of memory   Compilers were not great at the time   In old-timey Fortran  X    Y - Z  IF  X  LT  0  THEN      do something ENDIF   The compiler at the time did a SUBTRACT instruction  then a TEST instruction on X  In assembler  you would just check the condition code  LT zero  zero  GT zero  after the subtract   For modern systems and compilers none of that is a concern   I do think that understanding what the compiler is doing is still important  When you code in a higher-level language  you should understand what allows or prevents the compiler to do loop-unroll   And with pipe-lining and look-ahead computation involving conditionals  when the compiler does a  branch-likley   Assembler is still needed when doing things not allowed by a higher-level language  like reading or writing to processor-specific registers   But largely  it is no longer needed for the general programmer  except to have a basic understanding of how the code might be compiled and executed

User · Answer

This question is a bit pointless  because anyways c is compiled to assembler  But  the assembler produced by optimizing compilers is almost fully optimized  so unless you did twenty doctorates on optimizing specific assembly  you can t beat the compiler

User · Answer

One of the more famous snippets of assembly is from Michael Abrash s texture mapping loop  expained in detail here    add edx  DeltaVFrac    add in dVFrac sbb ebp ebp   store carry mov  edi  al   write pixel n mov al  esi    fetch pixel n 1 add ecx ebx   add in dUFrac adc esi  4 ebp   UVStepVCarry   add in steps   Nowadays most compilers express advanced CPU specific instructions as intrinsics  i e   functions that get compiled down to the actual instruction  MS Visual C   supports intrinsics for MMX  SSE  SSE2  SSE3  and SSE4  so you have to worry less about dropping down to assembly to take advantage of platform specific instructions  Visual C   can also take advantage of the actual architecture you are targetting with the appropriate  ARCH setting

User · Answer

gcc has become a widely used compiler   Its optimizations in general are not that good   Far better than the average programmer writing assembler  but for real performance  not that good   There are compilers that are simply incredible in the code they produce   So as a general answer there are going to be many places where you can go into the output of the compiler and tweak the assembler for performance  and or simply re-write the routine from scratch

User · Answer

http   cr yp to qhasm html has many examples

User · Answer

Short answer   Sometimes   Technically every abstraction has a cost and a programming language is an abstraction for how the CPU works   C however is very close   Years ago I remember laughing out loud when I logged onto my UNIX account and got the following fortune message  when such things were popular       The C Programming Language -- A   language which combines the   flexibility of assembly language with   the power of assembly language    It s funny because it s true  C is like portable assembly language   It s worth noting that assembly language just runs however you write it   There is however a compiler in between C and the assembly language it generates and that is extremely important because how fast your C code is has an awful lot to do with how good your compiler is   When gcc came on the scene one of the things that made it so popular was that it was often so much better than the C compilers that shipped with many commercial UNIX flavours   Not only was it ANSI C  none of this K amp R C rubbish   was more robust and typically produced better  faster  code   Not always but often            I tell you all this because there is no blanket rule about the speed of C and assembler because there is no objective standard for C   Likewise  assembler varies a lot depending on what processor you re running  your system spec  what instruction set you re using and so on   Historically there have been two CPU architecture families  CISC and RISC   The biggest player in CISC was and still is the Intel x86 architecture  and instruction set    RISC dominated the UNIX world  MIPS6000  Alpha  Sparc and so on    CISC won the battle for the hearts and minds   Anyway  the popular wisdom when I was a younger developer was that hand-written x86 could often be much faster than C because the way the architecture worked  it had a complexity that benefitted from a human doing it   RISC on the other hand seemed designed for compilers so noone  I knew  wrote say Sparc assembler   I m sure such people existed but no doubt they ve both gone insane and been institutionalized by now   Instruction sets are an important point even in the same family of processors   Certain Intel processors have extensions like SSE through SSE4   AMD had their own SIMD instructions   The benefit of a programming language like C was someone could write their library so it was optimized for whichever processor you were running on   That was hard work in assembler   There are still optimizations you can make in assembler that no compiler could make and a well written assembler algoirthm will be as fast or faster than it s C equivalent   The bigger question is  is it worth it   Ultimately though assembler was a product of its time and was more popular at a time when CPU cycles were expensive   Nowadays a CPU that costs  5-10 to manufacture  Intel Atom  can do pretty much anything anyone could want   The only real reason to write assembler these days is for low level things like some parts of an operating system  even so the vast majority of the Linux kernel is written in C   device drivers  possibly embedded devices  although C tends to dominate there too  and so on   Or just for kicks  which is somewhat masochistic

User · Answer

I used to work with somebody who said  if the compiler is to dumb to figure out what you are trying to do and can t optimize it  your compiler is broken and it is time to get a new one      I m sure there are edge cases when assembly will beat your C code  but if you are often finding yourself using assembler to  win  over your compiler  your compiler is busted     Same can be said for writing  optimized  SQL that tries to coerce the query planner into doing things   If you find yourself re-arranging queries to get the planner to do what you want  your query planner is busted--get a new one

User · Answer

Point one which is not the answer  Even if you never program in it  I find it useful to know at least one assembler instruction set  This is part of the programmers never-ending quest to know more and therefore be better  Also useful when stepping into frameworks you don t have the source code to and having at least a rough idea what is going on  It also helps you to understand JavaByteCode and  Net IL as they are both similar to assembler   To answer the question when you have a small amount of code or a large amount of time  Most useful for use in embedded chips  where low chip complexity and poor competition in compilers targeting these chips can tip the balance in favour of humans  Also for restricted devices you are often trading off code size memory size performance in a way that would be hard to instruct a compiler to do  e g  I know this user action is not called often so I will have small code size and poor performance  but this other function that look similar is used every second so I will have a larger code size and faster performance  That is the sort of trade off a skilled assembly programmer can use   I would also like to add there is a lot of middle ground where you can code in C compile and examine the Assembly produced  then either change you C code or tweak and maintain as assembly    My friend works on micro controllers  currently chips for controlling small electric motors  He works in a combination of low level c and Assembly  He once told me of a good day at work where he reduced the main loop from 48 instructions to 43  He is also faced with choices like the code has grown to fill the 256k chip and the business is wanting a new feature  do you    Remove an existing feature Reduce the size of some or all of the existing features maybe at the cost of performance  Advocate moving to a larger chip with a higher cost  higher power consumption and larger form factor   I would like to add as a commercial developer with quite a portfolio or languages  platforms  types of applications I have never once felt the need to dive into writing  assembly  I have how ever always appreciated the knowledge I gained about it  And sometimes debugged into it   I know I have far more answered the question  why should I learn assembler  but I feel it is a more important question then when is it faster   so lets try once more You should be thinking about assembly  working on low level operating system function Working on a compiler   Working on an extremely limited chip  embedded system etc    Remember to compare your assembly to compiler generated to see which is faster smaller better    David

User · Answer

Many years ago I was teaching someone to program in C  The exercise was to rotate a graphic through 90 degrees  He came back with a solution that took several minutes to complete  mainly because he was using multiplies and divides etc   I showed him how to recast the problem using bit shifts  and the time to process came down to about 30 seconds on the non-optimizing compiler he had   I had just got an optimizing compiler and the same code rotated the graphic in  lt  5 seconds  I looked at the assembly code that the compiler was generating  and from what I saw decided there and then that my days of writing assembler were over

User · Answer

More often than you think  C needs to do things that seem to be unneccessary from an Assembly coder s point of view just because the C standards say so   Integer promotion  for example  If you want to shift a char variable in C  one would usually expect that the code would do in fact just that  a single bit shift   The standards  however  enforce the compiler to do a sign extend to int before the shift and truncate the result to char afterwards which might complicate code depending on the target processor s architecture

User · Answer

LInux assembly howto  asks this question and gives the pros and cons of using assembly

User · Answer

A few examples from my experience    Access to instructions that are not accessible from C  For instance  many architectures  like x86-64  IA-64  DEC Alpha  and 64-bit MIPS or PowerPC  support a 64 bit by 64 bit multiplication producing a 128 bit result  GCC recently added an extension providing access to such instructions  but before that assembly was required  And access to this instruction can make a huge difference on 64-bit CPUs when implementing something like RSA - sometimes as much as a factor of 4 improvement in performance  Access to CPU-specific flags  The one that has bitten me a lot is the carry flag  when doing a multiple-precision addition  if you don t have access to the CPU carry bit one must instead compare the result to see if it overflowed  which takes 3-5 more instructions per limb  and worse  which are quite serial in terms of data accesses  which kills performance on modern superscalar processors  When processing thousands of such integers in a row  being able to use addc is a huge win  there are superscalar issues with contention on the carry bit as well  but modern CPUs deal pretty well with it   SIMD  Even autovectorizing compilers can only do relatively simple cases  so if you want good SIMD performance it s unfortunately often necessary to write the code directly  Of course you can use intrinsics instead of assembly but once you re at the intrinsics level you re basically writing assembly anyway  just using the compiler as a register allocator and  nominally  instruction scheduler   I tend to use intrinsics for SIMD simply because the compiler can generate the function prologues and whatnot for me so I can use the same code on Linux  OS X  and Windows without having to deal with ABI issues like function calling conventions  but other than that the SSE intrinsics really aren t very nice - the Altivec ones seem better though I don t have much experience with them   As examples of things a  current day  vectorizing compiler can t figure out  read about bitslicing AES or SIMD error correction - one could imagine a compiler that could analyze algorithms and generate such code  but it feels to me like such a smart compiler is at least 30 years away from existing  at best     On the other hand  multicore machines and distributed systems have shifted many of the biggest performance wins in the other direction - get an extra 20  speedup writing your inner loops in assembly  or 300  by running them across multiple cores  or 10000  by running them across a cluster of machines  And of course high level optimizations  things like futures  memoization  etc  are often much easier to do in a higher level language like ML or Scala than C or asm  and often can provide a much bigger performance win  So  as always  there are tradeoffs to be made

User · Answer

It all depends on your workload   For day-to-day operations  C and C   are just fine  but there are certain workloads  any transforms involving video  compression  decompression  image effects  etc   that pretty much require assembly to be performant   They also usually involve using CPU specific chipset extensions  MME MMX SSE whatever  that are tuned for those kinds of operation

User · Answer

This is very hard to answer specifically  because the question is very unspecific  what exactly is a  modern compiler     Pretty much any manual assembler optimization could in theory be done by a compiler as well - Whether it actually is done cannot be said in general  only about a specific version of a specific compiler  Many probably require so much effort to determine whether they can be applied without side effects in a particular context that compiler writers don t bother with them

User · Answer

In my job  there are three reasons for me to know and use assembly   In order of importance    Debugging - I often get library code that has bugs or incomplete documentation   I figure out what it s doing by stepping in at the assembly level   I have to do this about once a week   I also use it as a tool to debug problems in which my eyes don t spot the idiomatic error in C C   C    Looking at the assembly gets past that  Optimizing - the compiler does fairly well in optimizing  but I play in a different ballpark than most   I write image processing code that usually starts with code that looks like this   for  int y 0  y  lt  imageHeight  y          for  int x 0  x  lt  imageWidth  x                do something           the  do something part  typically happens on the order of several million times  ie  between 3 and 30    By scraping cycles in that  do something  phase  the performance gains are hugely magnified   I don t usually start there - I usually start by writing the code to work first  then do my best to refactor the C to be naturally better  better algorithm  less load in the loop etc    I usually need to read assembly to see what s going on and rarely need to write it   I do this maybe every two or three months  doing something the language won t let me   These include - getting the processor architecture and specific processor features  accessing flags not in the CPU  man  I really wish C gave you access to the carry flag   etc   I do this maybe once a year or two years

[c] When is assembly faster than C?

Examples related to c

Examples related to performance

Examples related to assembly