[c#] Try-catch speeding up my code?

I wrote some code for testing the impact of try-catch, but seeing some surprising results.

static void Main(string[] args)
{
    Thread.CurrentThread.Priority = ThreadPriority.Highest;
    Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.RealTime;

    long start = 0, stop = 0, elapsed = 0;
    double avg = 0.0;

    long temp = Fibo(1);

    for (int i = 1; i < 100000000; i++)
    {
        start = Stopwatch.GetTimestamp();
        temp = Fibo(100);
        stop = Stopwatch.GetTimestamp();

        elapsed = stop - start;
        avg = avg + ((double)elapsed - avg) / i;
    }

    Console.WriteLine("Elapsed: " + avg);
    Console.ReadKey();
}

static long Fibo(int n)
{
    long n1 = 0, n2 = 1, fibo = 0;
    n++;

    for (int i = 1; i < n; i++)
    {
        n1 = n2;
        n2 = fibo;
        fibo = n1 + n2;
    }

    return fibo;
}

On my computer, this consistently prints out a value around 0.96..

When I wrap the for loop inside Fibo() with a try-catch block like this:

static long Fibo(int n)
{
    long n1 = 0, n2 = 1, fibo = 0;
    n++;

    try
    {
        for (int i = 1; i < n; i++)
        {
            n1 = n2;
            n2 = fibo;
            fibo = n1 + n2;
        }
    }
    catch {}

    return fibo;
}

Now it consistently prints out 0.69... -- it actually runs faster! But why?

Note: I compiled this using the Release configuration and directly ran the EXE file (outside Visual Studio).

EDIT: Jon Skeet's excellent analysis shows that try-catch is somehow causing the x86 CLR to use the CPU registers in a more favorable way in this specific case (and I think we're yet to understand why). I confirmed Jon's finding that x64 CLR doesn't have this difference, and that it was faster than the x86 CLR. I also tested using int types inside the Fibo method instead of long types, and then the x86 CLR was as equally fast as the x64 CLR.


UPDATE: It looks like this issue has been fixed by Roslyn. Same machine, same CLR version -- the issue remains as above when compiled with VS 2013, but the problem goes away when compiled with VS 2015.

This question is related to c# .net clr try-catch performance-testing

The answer is


Jon's disassemblies show, that the difference between the two versions is that the fast version uses a pair of registers (esi,edi) to store one of the local variables where the slow version doesn't.

The JIT compiler makes different assumptions regarding register use for code that contains a try-catch block vs. code which doesn't. This causes it to make different register allocation choices. In this case, this favors the code with the try-catch block. Different code may lead to the opposite effect, so I would not count this as a general-purpose speed-up technique.

In the end, it's very hard to tell which code will end up running the fastest. Something like register allocation and the factors that influence it are such low-level implementation details that I don't see how any specific technique could reliably produce faster code.

For example, consider the following two methods. They were adapted from a real-life example:

interface IIndexed { int this[int index] { get; set; } }
struct StructArray : IIndexed { 
    public int[] Array;
    public int this[int index] {
        get { return Array[index]; }
        set { Array[index] = value; }
    }
}

static int Generic<T>(int length, T a, T b) where T : IIndexed {
    int sum = 0;
    for (int i = 0; i < length; i++)
        sum += a[i] * b[i];
    return sum;
}
static int Specialized(int length, StructArray a, StructArray b) {
    int sum = 0;
    for (int i = 0; i < length; i++)
        sum += a[i] * b[i];
    return sum;
}

One is a generic version of the other. Replacing the generic type with StructArray would make the methods identical. Because StructArray is a value type, it gets its own compiled version of the generic method. Yet the actual running time is significantly longer than the specialized method's, but only for x86. For x64, the timings are pretty much identical. In other cases, I've observed differences for x64 as well.


I'd have put this in as a comment as I'm really not certain that this is likely to be the case, but as I recall it doesn't a try/except statement involve a modification to the way the garbage disposal mechanism of the compiler works, in that it clears up object memory allocations in a recursive way off the stack. There may not be an object to be cleared up in this case or the for loop may constitute a closure that the garbage collection mechanism recognises sufficient to enforce a different collection method. Probably not, but I thought it worth a mention as I hadn't seen it discussed anywhere else.


Well, the way you're timing things looks pretty nasty to me. It would be much more sensible to just time the whole loop:

var stopwatch = Stopwatch.StartNew();
for (int i = 1; i < 100000000; i++)
{
    Fibo(100);
}
stopwatch.Stop();
Console.WriteLine("Elapsed time: {0}", stopwatch.Elapsed);

That way you're not at the mercy of tiny timings, floating point arithmetic and accumulated error.

Having made that change, see whether the "non-catch" version is still slower than the "catch" version.

EDIT: Okay, I've tried it myself - and I'm seeing the same result. Very odd. I wondered whether the try/catch was disabling some bad inlining, but using [MethodImpl(MethodImplOptions.NoInlining)] instead didn't help...

Basically you'll need to look at the optimized JITted code under cordbg, I suspect...

EDIT: A few more bits of information:

  • Putting the try/catch around just the n++; line still improves performance, but not by as much as putting it around the whole block
  • If you catch a specific exception (ArgumentException in my tests) it's still fast
  • If you print the exception in the catch block it's still fast
  • If you rethrow the exception in the catch block it's slow again
  • If you use a finally block instead of a catch block it's slow again
  • If you use a finally block as well as a catch block, it's fast

Weird...

EDIT: Okay, we have disassembly...

This is using the C# 2 compiler and .NET 2 (32-bit) CLR, disassembling with mdbg (as I don't have cordbg on my machine). I still see the same performance effects, even under the debugger. The fast version uses a try block around everything between the variable declarations and the return statement, with just a catch{} handler. Obviously the slow version is the same except without the try/catch. The calling code (i.e. Main) is the same in both cases, and has the same assembly representation (so it's not an inlining issue).

Disassembled code for fast version:

 [0000] push        ebp
 [0001] mov         ebp,esp
 [0003] push        edi
 [0004] push        esi
 [0005] push        ebx
 [0006] sub         esp,1Ch
 [0009] xor         eax,eax
 [000b] mov         dword ptr [ebp-20h],eax
 [000e] mov         dword ptr [ebp-1Ch],eax
 [0011] mov         dword ptr [ebp-18h],eax
 [0014] mov         dword ptr [ebp-14h],eax
 [0017] xor         eax,eax
 [0019] mov         dword ptr [ebp-18h],eax
*[001c] mov         esi,1
 [0021] xor         edi,edi
 [0023] mov         dword ptr [ebp-28h],1
 [002a] mov         dword ptr [ebp-24h],0
 [0031] inc         ecx
 [0032] mov         ebx,2
 [0037] cmp         ecx,2
 [003a] jle         00000024
 [003c] mov         eax,esi
 [003e] mov         edx,edi
 [0040] mov         esi,dword ptr [ebp-28h]
 [0043] mov         edi,dword ptr [ebp-24h]
 [0046] add         eax,dword ptr [ebp-28h]
 [0049] adc         edx,dword ptr [ebp-24h]
 [004c] mov         dword ptr [ebp-28h],eax
 [004f] mov         dword ptr [ebp-24h],edx
 [0052] inc         ebx
 [0053] cmp         ebx,ecx
 [0055] jl          FFFFFFE7
 [0057] jmp         00000007
 [0059] call        64571ACB
 [005e] mov         eax,dword ptr [ebp-28h]
 [0061] mov         edx,dword ptr [ebp-24h]
 [0064] lea         esp,[ebp-0Ch]
 [0067] pop         ebx
 [0068] pop         esi
 [0069] pop         edi
 [006a] pop         ebp
 [006b] ret

Disassembled code for slow version:

 [0000] push        ebp
 [0001] mov         ebp,esp
 [0003] push        esi
 [0004] sub         esp,18h
*[0007] mov         dword ptr [ebp-14h],1
 [000e] mov         dword ptr [ebp-10h],0
 [0015] mov         dword ptr [ebp-1Ch],1
 [001c] mov         dword ptr [ebp-18h],0
 [0023] inc         ecx
 [0024] mov         esi,2
 [0029] cmp         ecx,2
 [002c] jle         00000031
 [002e] mov         eax,dword ptr [ebp-14h]
 [0031] mov         edx,dword ptr [ebp-10h]
 [0034] mov         dword ptr [ebp-0Ch],eax
 [0037] mov         dword ptr [ebp-8],edx
 [003a] mov         eax,dword ptr [ebp-1Ch]
 [003d] mov         edx,dword ptr [ebp-18h]
 [0040] mov         dword ptr [ebp-14h],eax
 [0043] mov         dword ptr [ebp-10h],edx
 [0046] mov         eax,dword ptr [ebp-0Ch]
 [0049] mov         edx,dword ptr [ebp-8]
 [004c] add         eax,dword ptr [ebp-1Ch]
 [004f] adc         edx,dword ptr [ebp-18h]
 [0052] mov         dword ptr [ebp-1Ch],eax
 [0055] mov         dword ptr [ebp-18h],edx
 [0058] inc         esi
 [0059] cmp         esi,ecx
 [005b] jl          FFFFFFD3
 [005d] mov         eax,dword ptr [ebp-1Ch]
 [0060] mov         edx,dword ptr [ebp-18h]
 [0063] lea         esp,[ebp-4]
 [0066] pop         esi
 [0067] pop         ebp
 [0068] ret

In each case the * shows where the debugger entered in a simple "step-into".

EDIT: Okay, I've now looked through the code and I think I can see how each version works... and I believe the slower version is slower because it uses fewer registers and more stack space. For small values of n that's possibly faster - but when the loop takes up the bulk of the time, it's slower.

Possibly the try/catch block forces more registers to be saved and restored, so the JIT uses those for the loop as well... which happens to improve the performance overall. It's not clear whether it's a reasonable decision for the JIT to not use as many registers in the "normal" code.

EDIT: Just tried this on my x64 machine. The x64 CLR is much faster (about 3-4 times faster) than the x86 CLR on this code, and under x64 the try/catch block doesn't make a noticeable difference.


This looks like a case of inlining gone bad. On an x86 core, the jitter has the ebx, edx, esi and edi register available for general purpose storage of local variables. The ecx register becomes available in a static method, it doesn't have to store this. The eax register often is needed for calculations. But these are 32-bit registers, for variables of type long it must use a pair of registers. Which are edx:eax for calculations and edi:ebx for storage.

Which is what stands out in the disassembly for the slow version, neither edi nor ebx are used.

When the jitter can't find enough registers to store local variables then it must generate code to load and store them from the stack frame. That slows down code, it prevents a processor optimization named "register renaming", an internal processor core optimization trick that uses multiple copies of a register and allows super-scalar execution. Which permits several instructions to run concurrently, even when they use the same register. Not having enough registers is a common problem on x86 cores, addressed in x64 which has 8 extra registers (r9 through r15).

The jitter will do its best to apply another code generation optimization, it will try to inline your Fibo() method. In other words, not make a call to the method but generate the code for the method inline in the Main() method. Pretty important optimization that, for one, makes properties of a C# class for free, giving them the perf of a field. It avoids the overhead of making the method call and setting up its stack frame, saves a couple of nanoseconds.

There are several rules that determine exactly when a method can be inlined. They are not exactly documented but have been mentioned in blog posts. One rule is that it won't happen when the method body is too large. That defeats the gain from inlining, it generates too much code that doesn't fit as well in the L1 instruction cache. Another hard rule that applies here is that a method won't be inlined when it contains a try/catch statement. The background behind that one is an implementation detail of exceptions, they piggy-back onto Windows' built-in support for SEH (Structure Exception Handling) which is stack-frame based.

One behavior of the register allocation algorithm in the jitter can be inferred from playing with this code. It appears to be aware of when the jitter is trying to inline a method. One rule it appears to use that only the edx:eax register pair can be used for inlined code that has local variables of type long. But not edi:ebx. No doubt because that would be too detrimental to the code generation for the calling method, both edi and ebx are important storage registers.

So you get the fast version because the jitter knows up front that the method body contains try/catch statements. It knows it can never be inlined so readily uses edi:ebx for storage for the long variable. You got the slow version because the jitter didn't know up front that inlining wouldn't work. It only found out after generating the code for the method body.

The flaw then is that it didn't go back and re-generate the code for the method. Which is understandable, given the time constraints it has to operate in.

This slow-down doesn't occur on x64 because for one it has 8 more registers. For another because it can store a long in just one register (like rax). And the slow-down doesn't occur when you use int instead of long because the jitter has a lot more flexibility in picking registers.


9 years later and the bug is still there! You can see it easily with:

   static void Main( string[] args )
    {
      int hundredMillion = 1000000;
      DateTime start = DateTime.Now;
      double sqrt;
      for (int i=0; i < hundredMillion; i++)
      {
        sqrt = Math.Sqrt( DateTime.Now.ToOADate() );
      }
      DateTime end = DateTime.Now;

      double sqrtMs = (end - start).TotalMilliseconds;

      Console.WriteLine( "Elapsed milliseconds: " + sqrtMs );

      DateTime start2 = DateTime.Now;

      double sqrt2;
      for (int i = 0; i < hundredMillion; i++)
      {
        try
        {
          sqrt2 = Math.Sqrt( DateTime.Now.ToOADate() );
        }
        catch (Exception e)
        {
          int br = 0;
        }
      }
      DateTime end2 = DateTime.Now;

      double sqrtMsTryCatch = (end2 - start2).TotalMilliseconds;

      Console.WriteLine( "Elapsed milliseconds: " + sqrtMsTryCatch );

      Console.WriteLine( "ratio is " + sqrtMsTryCatch / sqrtMs );

      Console.ReadLine();
    }

The ratio is less than one on my machine, running the latest version of MSVS 2019, .NET 4.6.1


Examples related to c#

How can I convert this one line of ActionScript to C#? Microsoft Advertising SDK doesn't deliverer ads How to use a global array in C#? How to correctly write async method? C# - insert values from file into two arrays Uploading into folder in FTP? Are these methods thread safe? dotnet ef not found in .NET Core 3 HTTP Error 500.30 - ANCM In-Process Start Failure Best way to "push" into C# array

Examples related to .net

You must add a reference to assembly 'netstandard, Version=2.0.0.0 How to use Bootstrap 4 in ASP.NET Core No authenticationScheme was specified, and there was no DefaultChallengeScheme found with default authentification and custom authorization .net Core 2.0 - Package was restored using .NetFramework 4.6.1 instead of target framework .netCore 2.0. The package may not be fully compatible Update .NET web service to use TLS 1.2 EF Core add-migration Build Failed What is the difference between .NET Core and .NET Standard Class Library project types? Visual Studio 2017 - Could not load file or assembly 'System.Runtime, Version=4.1.0.0' or one of its dependencies Nuget connection attempt failed "Unable to load the service index for source" Token based authentication in Web API without any user interface

Examples related to clr

Try-catch speeding up my code? SQL Server: How to check if CLR is enabled? Resolving MSB3247 - Found conflicts between different versions of the same dependent assembly In C#, why is String a reference type that behaves like a value type? What are major differences between C# and Java?

Examples related to try-catch

Try-catch block in Jenkins pipeline script Error handling with try and catch in Laravel sql try/catch rollback/commit - preventing erroneous commit after rollback IsNumeric function in c# Why is "except: pass" a bad programming practice? EOFException - how to handle? How to efficiently use try...catch blocks in PHP How to return a value from try, catch, and finally? Multiple try codes in one block Catch KeyError in Python

Examples related to performance-testing

How do I install jmeter on a Mac? Try-catch speeding up my code? Timing a command's execution in PowerShell