Try-catch speeding up my code

Question

I wrote some code for testing the impact of try-catch  but seeing some surprising results   static void Main string   args        Thread CurrentThread Priority   ThreadPriority Highest      Process GetCurrentProcess   PriorityClass   ProcessPriorityClass RealTime       long start   0  stop   0  elapsed   0      double avg   0 0       long temp   Fibo 1        for  int i   1  i  lt  100000000  i                  start   Stopwatch GetTimestamp            temp   Fibo 100           stop   Stopwatch GetTimestamp             elapsed   stop - start          avg   avg     double elapsed - avg    i             Console WriteLine  Elapsed      avg       Console ReadKey       static long Fibo int n        long n1   0  n2   1  fibo   0      n         for  int i   1  i  lt  n  i                  n1   n2          n2   fibo          fibo   n1   n2             return fibo      On my computer  this consistently prints out a value around 0 96    When I wrap the for loop inside Fibo   with a try-catch block like this   static long Fibo int n        long n1   0  n2   1  fibo   0      n         try               for  int i   1  i  lt  n  i                          n1   n2              n2   fibo              fibo   n1   n2                      catch         return fibo      Now it consistently prints out 0 69    -- it actually runs faster  But why   Note  I compiled this using the Release configuration and directly ran the EXE file  outside Visual Studio    EDIT  Jon Skeet s excellent analysis shows that try-catch is somehow causing the x86 CLR to use the CPU registers in a more favorable way in this specific case  and I think we re yet to understand why   I confirmed Jon s finding that x64 CLR doesn t have this difference  and that it was faster than the x86 CLR  I also tested using int types inside the Fibo method instead of long types  and then the x86 CLR was as equally fast as the x64 CLR     UPDATE  It looks like this issue has been fixed by Roslyn  Same machine  same CLR version -- the issue remains as above when compiled with VS 2013  but the problem goes away when compiled with VS 2015

User · Accepted Answer

One of the Roslyn engineers who specializes in understanding optimization of stack usage took a look at this and reports to me that there seems to be a problem in the interaction between the way the C# compiler generates local variable stores and the way the JIT compiler does register scheduling in the corresponding x86 code. The result is suboptimal code generation on the loads and stores of the locals.

For some reason unclear to all of us, the problematic code generation path is avoided when the JITter knows that the block is in a try-protected region.

This is pretty weird. We'll follow up with the JITter team and see whether we can get a bug entered so that they can fix this.

Also, we are working on improvements for Roslyn to the C# and VB compilers' algorithms for determining when locals can be made "ephemeral" -- that is, just pushed and popped on the stack, rather than allocated a specific location on the stack for the duration of the activation. We believe that the JITter will be able to do a better job of register allocation and whatnot if we give it better hints about when locals can be made "dead" earlier.

Thanks for bringing this to our attention, and apologies for the odd behaviour.

User · Answer

Well  the way you re timing things looks pretty nasty to me  It would be much more sensible to just time the whole loop   var stopwatch   Stopwatch StartNew    for  int i   1  i  lt  100000000  i          Fibo 100     stopwatch Stop    Console WriteLine  Elapsed time   0    stopwatch Elapsed     That way you re not at the mercy of tiny timings  floating point arithmetic and accumulated error   Having made that change  see whether the  non-catch  version is still slower than the  catch  version   EDIT  Okay  I ve tried it myself - and I m seeing the same result  Very odd  I wondered whether the try catch was disabling some bad inlining  but using  MethodImpl MethodImplOptions NoInlining   instead didn t help     Basically you ll need to look at the optimized JITted code under cordbg  I suspect     EDIT  A few more bits of information    Putting the try catch around just the n    line still improves performance  but not by as much as putting it around the whole block If you catch a specific exception  ArgumentException in my tests  it s still fast If you print the exception in the catch block it s still fast If you rethrow the exception in the catch block it s slow again If you use a finally block instead of a catch block it s slow again If you use a finally block as well as a catch block  it s fast   Weird     EDIT  Okay  we have disassembly     This is using the C  2 compiler and  NET 2  32-bit  CLR  disassembling with mdbg  as I don t have cordbg on my machine   I still see the same performance effects  even under the debugger  The fast version uses a try block around everything between the variable declarations and the return statement  with just a catch   handler  Obviously the slow version is the same except without the try catch  The calling code  i e  Main  is the same in both cases  and has the same assembly representation  so it s not an inlining issue    Disassembled code for fast version     0000  push        ebp   0001  mov         ebp esp   0003  push        edi   0004  push        esi   0005  push        ebx   0006  sub         esp 1Ch   0009  xor         eax eax   000b  mov         dword ptr  ebp-20h  eax   000e  mov         dword ptr  ebp-1Ch  eax   0011  mov         dword ptr  ebp-18h  eax   0014  mov         dword ptr  ebp-14h  eax   0017  xor         eax eax   0019  mov         dword ptr  ebp-18h  eax   001c  mov         esi 1   0021  xor         edi edi   0023  mov         dword ptr  ebp-28h  1   002a  mov         dword ptr  ebp-24h  0   0031  inc         ecx   0032  mov         ebx 2   0037  cmp         ecx 2   003a  jle         00000024   003c  mov         eax esi   003e  mov         edx edi   0040  mov         esi dword ptr  ebp-28h    0043  mov         edi dword ptr  ebp-24h    0046  add         eax dword ptr  ebp-28h    0049  adc         edx dword ptr  ebp-24h    004c  mov         dword ptr  ebp-28h  eax   004f  mov         dword ptr  ebp-24h  edx   0052  inc         ebx   0053  cmp         ebx ecx   0055  jl          FFFFFFE7   0057  jmp         00000007   0059  call        64571ACB   005e  mov         eax dword ptr  ebp-28h    0061  mov         edx dword ptr  ebp-24h    0064  lea         esp  ebp-0Ch    0067  pop         ebx   0068  pop         esi   0069  pop         edi   006a  pop         ebp   006b  ret   Disassembled code for slow version     0000  push        ebp   0001  mov         ebp esp   0003  push        esi   0004  sub         esp 18h   0007  mov         dword ptr  ebp-14h  1   000e  mov         dword ptr  ebp-10h  0   0015  mov         dword ptr  ebp-1Ch  1   001c  mov         dword ptr  ebp-18h  0   0023  inc         ecx   0024  mov         esi 2   0029  cmp         ecx 2   002c  jle         00000031   002e  mov         eax dword ptr  ebp-14h    0031  mov         edx dword ptr  ebp-10h    0034  mov         dword ptr  ebp-0Ch  eax   0037  mov         dword ptr  ebp-8  edx   003a  mov         eax dword ptr  ebp-1Ch    003d  mov         edx dword ptr  ebp-18h    0040  mov         dword ptr  ebp-14h  eax   0043  mov         dword ptr  ebp-10h  edx   0046  mov         eax dword ptr  ebp-0Ch    0049  mov         edx dword ptr  ebp-8    004c  add         eax dword ptr  ebp-1Ch    004f  adc         edx dword ptr  ebp-18h    0052  mov         dword ptr  ebp-1Ch  eax   0055  mov         dword ptr  ebp-18h  edx   0058  inc         esi   0059  cmp         esi ecx   005b  jl          FFFFFFD3   005d  mov         eax dword ptr  ebp-1Ch    0060  mov         edx dword ptr  ebp-18h    0063  lea         esp  ebp-4    0066  pop         esi   0067  pop         ebp   0068  ret   In each case the   shows where the debugger entered in a simple  step-into    EDIT  Okay  I ve now looked through the code and I think I can see how each version works    and I believe the slower version is slower because it uses fewer registers and more stack space  For small values of n that s possibly faster - but when the loop takes up the bulk of the time  it s slower   Possibly the try catch block forces more registers to be saved and restored  so the JIT uses those for the loop as well    which happens to improve the performance overall  It s not clear whether it s a reasonable decision for the JIT to not use as many registers in the  normal  code   EDIT  Just tried this on my x64 machine  The x64 CLR is much faster  about 3-4 times faster  than the x86 CLR on this code  and under x64 the try catch block doesn t make a noticeable difference

User · Answer

9 years later and the bug is still there   You can see it easily with     static void Main  string   args               int hundredMillion   1000000        DateTime start   DateTime Now        double sqrt        for  int i 0  i  lt  hundredMillion  i                    sqrt   Math Sqrt  DateTime Now ToOADate                    DateTime end   DateTime Now         double sqrtMs    end - start  TotalMilliseconds         Console WriteLine   quot Elapsed milliseconds   quot    sqrtMs           DateTime start2   DateTime Now         double sqrt2        for  int i   0  i  lt  hundredMillion  i                    try                     sqrt2   Math Sqrt  DateTime Now ToOADate                        catch  Exception e                      int br   0                          DateTime end2   DateTime Now         double sqrtMsTryCatch    end2 - start2  TotalMilliseconds         Console WriteLine   quot Elapsed milliseconds   quot    sqrtMsTryCatch           Console WriteLine   quot ratio is  quot    sqrtMsTryCatch   sqrtMs           Console ReadLine           The ratio is less than one on my machine  running the latest version of MSVS 2019   NET 4 6 1

User · Answer

Jon s disassemblies show  that the difference between the two versions is that the fast version uses a pair of registers  esi edi  to store one of the local variables where the slow version doesn t   The JIT compiler makes different assumptions regarding register use for code that contains a try-catch block vs  code which doesn t  This causes it to make different register allocation choices  In this case  this favors the code with the try-catch block  Different code may lead to the opposite effect  so I would not count this as a general-purpose speed-up technique   In the end  it s very hard to tell which code will end up running the fastest  Something like register allocation and the factors that influence it are such low-level implementation details that I don t see how any specific technique could reliably produce faster code   For example  consider the following two methods  They were adapted from a real-life example   interface IIndexed   int this int index    get  set      struct StructArray   IIndexed        public int   Array      public int this int index            get   return Array index             set   Array index    value             static int Generic lt T gt  int length  T a  T b  where T   IIndexed       int sum   0      for  int i   0  i  lt  length  i            sum    a i    b i       return sum    static int Specialized int length  StructArray a  StructArray b        int sum   0      for  int i   0  i  lt  length  i            sum    a i    b i       return sum      One is a generic version of the other  Replacing the generic type with StructArray would make the methods identical  Because StructArray is a value type  it gets its own compiled version of the generic method  Yet the actual running time is significantly longer than the specialized method s  but only for x86  For x64  the timings are pretty much identical  In other cases  I ve observed differences for x64 as well

User · Answer

I d have put this in as a comment as I m really not certain that this is likely to be the case  but as I recall it doesn t a try except statement involve a modification to the way the garbage disposal mechanism of the compiler works  in that it clears up object memory allocations in a recursive way off the stack   There may not be an object to be cleared up in this case or the for loop may constitute a closure that the garbage collection mechanism recognises sufficient to enforce a different collection method  Probably not  but I thought it worth a mention as I hadn t seen it discussed anywhere else

User · Answer

This looks like a case of inlining gone bad   On an x86 core  the jitter has the ebx  edx  esi and edi register available for general purpose storage of local variables   The ecx register becomes available in a static method  it doesn t have to store this   The eax register often is needed for calculations   But these are 32-bit registers  for variables of type long it must use a pair of registers   Which are edx eax for calculations and edi ebx for storage   Which is what stands out in the disassembly for the slow version  neither edi nor ebx are used   When the jitter can t find enough registers to store local variables then it must generate code to load and store them from the stack frame   That slows down code  it prevents a processor optimization named  register renaming   an internal processor core optimization trick that uses multiple copies of a register and allows super-scalar execution   Which permits several instructions to run concurrently  even when they use the same register   Not having enough registers is a common problem on x86 cores  addressed in x64 which has 8 extra registers  r9 through r15    The jitter will do its best to apply another code generation optimization  it will try to inline your Fibo   method   In other words  not make a call to the method but generate the code for the method inline in the Main   method   Pretty important optimization that  for one  makes properties of a C  class for free  giving them the perf of a field   It avoids the overhead of making the method call and setting up its stack frame  saves a couple of nanoseconds   There are several rules that determine exactly when a method can be inlined   They are not exactly documented but have been mentioned in blog posts   One rule is that it won t happen when the method body is too large   That defeats the gain from inlining  it generates too much code that doesn t fit as well in the L1 instruction cache   Another hard rule that applies here is that a method won t be inlined when it contains a try catch statement   The background behind that one is an implementation detail of exceptions  they piggy-back onto Windows  built-in support for SEH  Structure Exception Handling  which is stack-frame based   One behavior of the register allocation algorithm in the jitter can be inferred from playing with this code   It appears to be aware of when the jitter is trying to inline a method   One rule it appears to use that only the edx eax register pair can be used for inlined code that has local variables of type long   But not edi ebx   No doubt because that would be too detrimental to the code generation for the calling method  both edi and ebx are important storage registers   So you get the fast version because the jitter knows up front that the method body contains try catch statements   It knows it can never be inlined so readily uses edi ebx for storage for the long variable   You got the slow version because the jitter didn t know up front that inlining wouldn t work   It only found out after generating the code for the method body   The flaw then is that it didn t go back and re-generate the code for the method   Which is understandable  given the time constraints it has to operate in   This slow-down doesn t occur on x64 because for one it has 8 more registers   For another because it can store a long in just one register  like rax    And the slow-down doesn t occur when you use int instead of long because the jitter has a lot more flexibility in picking registers

[c#] Try-catch speeding up my code?

Examples related to c#

Examples related to .net

Examples related to clr

Examples related to try-catch

Examples related to performance-testing