Speed comparison with Project Euler C vs Python vs Erlang vs Haskell

Question

I have taken Problem  12 from Project Euler as a programming exercise and to compare my  surely not optimal  implementations in C  Python  Erlang and Haskell  In order to get some higher execution times  I search for the first triangle number with more than 1000 divisors instead of 500 as stated in the original problem   The result is the following   C   lorenzo enzo   erlang  gcc -lm -o euler12 bin euler12 c lorenzo enzo   erlang  time   euler12 bin 842161320  real    0m11 074s user    0m11 070s sys 0m0 000s   Python   lorenzo enzo   erlang  time   euler12 py  842161320  real    1m16 632s user    1m16 370s sys 0m0 250s   Python with PyPy   lorenzo enzo   Downloads pypy-c-jit-43780-b590cf6de419-linux64 bin  time   pypy  home lorenzo erlang euler12 py  842161320  real    0m13 082s user    0m13 050s sys 0m0 020s   Erlang   lorenzo enzo   erlang  erlc euler12 erl  lorenzo enzo   erlang  time erl -s euler12 solve Erlang R13B03  erts-5 7 4   source   64-bit   smp 4 4   rq 4   async-threads 0   hipe   kernel-poll false   Eshell V5 7 4   abort with  G  1 gt  842161320  real    0m48 259s user    0m48 070s sys 0m0 020s   Haskell   lorenzo enzo   erlang  ghc euler12 hs -o euler12 hsx  1 of 1  Compiling Main               euler12 hs  euler12 o   Linking euler12 hsx     lorenzo enzo   erlang  time   euler12 hsx  842161320  real    2m37 326s user    2m37 240s sys 0m0 080s   Summary    C  100  Python  692   118  with PyPy  Erlang  436   135  thanks to RichardC  Haskell  1421    I suppose that C has a big advantage as it uses long for the calculations and not arbitrary length integers as the other three  Also it doesn t need to load a runtime first  Do the others     Question 1  Do Erlang  Python and Haskell lose speed due to using arbitrary length integers or don t they as long as the values are less than MAXINT   Question 2  Why is Haskell so slow  Is there a compiler flag that turns off the brakes or is it my implementation   The latter is quite probable as Haskell is a book with seven seals to me    Question 3  Can you offer me some hints how to optimize these implementations without changing the way I determine the factors  Optimization in any way  nicer  faster  more  native  to the language   EDIT   Question 4  Do my functional implementations permit LCO  last call optimization  a k a tail recursion elimination  and hence avoid adding unnecessary frames onto the call stack   I really tried to implement the same algorithm as similar as possible in the four languages  although I have to admit that my Haskell and Erlang knowledge is very limited     Source codes used    include  lt stdio h gt   include  lt math h gt   int factorCount  long n        double square   sqrt  n       int isquare    int  square      int count   isquare    square   -1   0      long candidate      for  candidate   1  candidate  lt   isquare  candidate             if  0    n   candidate  count    2      return count     int main          long triangle   1      int index   1      while  factorCount  triangle   lt  1001                index             triangle    index            printf    ld n   triangle             usr bin env python3 2  import math  def factorCount  n       square   math sqrt  n      isquare   int  square      count   -1 if isquare    square else 0     for candidate in range  1  isquare   1           if not n   candidate  count    2     return count  triangle   1 index   1 while factorCount  triangle   lt  1001      index    1     triangle    index  print  triangle      -module  euler12   -compile  export all    factorCount  Number  - gt  factorCount  Number  math sqrt  Number   1  0    factorCount     Sqrt  Candidate  Count  when Candidate  gt  Sqrt - gt  Count   factorCount     Sqrt  Candidate  Count  when Candidate    Sqrt - gt  Count   1   factorCount  Number  Sqrt  Candidate  Count  - gt      case Number rem Candidate of         0 - gt  factorCount  Number  Sqrt  Candidate   1  Count   2             - gt  factorCount  Number  Sqrt  Candidate   1  Count      end   nextTriangle  Index  Triangle  - gt      Count   factorCount  Triangle       if         Count  gt  1000 - gt  Triangle          true - gt  nextTriangle  Index   1  Triangle   Index   1        end   solve    - gt      io format    p n    nextTriangle  1  1           halt  0       factorCount number   factorCount  number isquare 1 0 -  fromEnum   square    fromIntegral isquare      where square   sqrt   fromIntegral number           isquare   floor square  factorCount  number sqrt candidate count       fromIntegral candidate  gt  sqrt   count       number  mod  candidate    0   factorCount  number sqrt  candidate   1   count   2        otherwise   factorCount  number sqrt  candidate   1  count  nextTriangle index triangle       factorCount triangle  gt  1000   triangle       otherwise   nextTriangle  index   1   triangle   index   1   main   print   nextTriangle 1 1

User · Answer

Some more numbers and explanations for the C version. Apparently noone did it during all those years. Remember to upvote this answer so it can get on top for everyone to see and learn.

Step One: Benchmark of the author's programs

Laptop Specifications:

CPU i3 M380 (931 MHz - maximum battery saving mode)
4GB memory
Win7 64 bits
Microsoft Visual Studio 2012 Ultimate
Cygwin with gcc 4.9.3
Python 2.7.10

Commands:

compiling on VS x64 command prompt > `for /f %f in ('dir /b *.c') do cl /O2 /Ot /Ox %f -o %f_x64_vs2012.exe`
compiling on cygwin with gcc x64   > `for f in ./*.c; do gcc -m64 -O3 $f -o ${f}_x64_gcc.exe ; done`
time (unix tools) using cygwin > `for f in ./*.exe; do  echo "----------"; echo $f ; time $f ; done`

.

----------
$ time python ./original.py

real    2m17.748s
user    2m15.783s
sys     0m0.093s
----------
$ time ./original_x86_vs2012.exe

real    0m8.377s
user    0m0.015s
sys     0m0.000s
----------
$ time ./original_x64_vs2012.exe

real    0m8.408s
user    0m0.000s
sys     0m0.015s
----------
$ time ./original_x64_gcc.exe

real    0m20.951s
user    0m20.732s
sys     0m0.030s

Filenames are: integertype_architecture_compiler.exe

integertype is the same as the original program for now (more on that later)
architecture is x86 or x64 depending on the compiler settings
compiler is gcc or vs2012

Step Two: Investigate, Improve and Benchmark Again

VS is 250% faster than gcc. The two compilers should give a similar speed. Obviously, something is wrong with the code or the compiler options. Let's investigate!

The first point of interest is the integer types. Conversions can be expensive and consistency is important for better code generation & optimizations. All integers should be the same type.

It's a mixed mess of int and long right now. We're going to improve that. What type to use? The fastest. Gotta benchmark them'all!

----------
$ time ./int_x86_vs2012.exe

real    0m8.440s
user    0m0.016s
sys     0m0.015s
----------
$ time ./int_x64_vs2012.exe

real    0m8.408s
user    0m0.016s
sys     0m0.015s
----------
$ time ./int32_x86_vs2012.exe

real    0m8.408s
user    0m0.000s
sys     0m0.015s
----------
$ time ./int32_x64_vs2012.exe

real    0m8.362s
user    0m0.000s
sys     0m0.015s
----------
$ time ./int64_x86_vs2012.exe

real    0m18.112s
user    0m0.000s
sys     0m0.015s
----------
$ time ./int64_x64_vs2012.exe

real    0m18.611s
user    0m0.000s
sys     0m0.015s
----------
$ time ./long_x86_vs2012.exe

real    0m8.393s
user    0m0.015s
sys     0m0.000s
----------
$ time ./long_x64_vs2012.exe

real    0m8.440s
user    0m0.000s
sys     0m0.015s
----------
$ time ./uint32_x86_vs2012.exe

real    0m8.362s
user    0m0.000s
sys     0m0.015s
----------
$ time ./uint32_x64_vs2012.exe

real    0m8.393s
user    0m0.015s
sys     0m0.015s
----------
$ time ./uint64_x86_vs2012.exe

real    0m15.428s
user    0m0.000s
sys     0m0.015s
----------
$ time ./uint64_x64_vs2012.exe

real    0m15.725s
user    0m0.015s
sys     0m0.015s
----------
$ time ./int_x64_gcc.exe

real    0m8.531s
user    0m8.329s
sys     0m0.015s
----------
$ time ./int32_x64_gcc.exe

real    0m8.471s
user    0m8.345s
sys     0m0.000s
----------
$ time ./int64_x64_gcc.exe

real    0m20.264s
user    0m20.186s
sys     0m0.015s
----------
$ time ./long_x64_gcc.exe

real    0m20.935s
user    0m20.809s
sys     0m0.015s
----------
$ time ./uint32_x64_gcc.exe

real    0m8.393s
user    0m8.346s
sys     0m0.015s
----------
$ time ./uint64_x64_gcc.exe

real    0m16.973s
user    0m16.879s
sys     0m0.030s

Integer types are int long int32_t uint32_t int64_t and uint64_t from #include <stdint.h>

There are LOTS of integer types in C, plus some signed/unsigned to play with, plus the choice to compile as x86 or x64 (not to be confused with the actual integer size). That is a lot of versions to compile and run ^^

Step Three: Understanding the Numbers

Definitive conclusions:

32 bits integers are ~200% faster than 64 bits equivalents
unsigned 64 bits integers are 25 % faster than signed 64 bits (Unfortunately, I have no explanation for that)

Trick question: "What are the sizes of int and long in C?"
The right answer is: The size of int and long in C are not well-defined!

From the C spec:

int is at least 32 bits
long is at least an int

From the gcc man page (-m32 and -m64 flags):

The 32-bit environment sets int, long and pointer to 32 bits and generates code that runs on any i386 system.
The 64-bit environment sets int to 32 bits and long and pointer to 64 bits and generates code for AMD’s x86-64 architecture.

From MSDN documentation (Data Type Ranges) https://msdn.microsoft.com/en-us/library/s3f49ktz%28v=vs.110%29.aspx :

int, 4 bytes, also knows as signed
long, 4 bytes, also knows as long int and signed long int

To Conclude This: Lessons Learned

32 bits integers are faster than 64 bits integers.
Standard integers types are not well defined in C nor C++, they vary depending on compilers and architectures. When you need consistency and predictability, use the uint32_t integer family from #include <stdint.h>.
Speed issues solved. All other languages are back hundreds percent behind, C & C++ win again ! They always do. The next improvement will be multithreading using OpenMP :D

User · Answer

C  11   lt  20ms for me - Run it here  I understand that you want tips to help improve your language specific knowledge  but since that has been well covered here  I thought I would add some context for people who may have looked at the mathematica comment on your question  etc  and wondered why this code was so much slower   This answer is mainly to provide context to hopefully help people evaluate the code in your question   other answers more easily   This code uses only a couple of  uglyish  optimisations  unrelated to the language used  based on    every traingle number is of the form n n 1  2 n and n 1 are coprime the number of divisors is a multiplicative function      include  lt iostream gt   include  lt cmath gt   include  lt tuple gt   include  lt chrono gt   using namespace std      Calculates the divisors of an integer by determining its prime factorisation   int get divisors long long n        int divisors count   1       for long long i   2          i  lt   sqrt n              empty                   int divisions   0          while n   i    0                        n    i              divisions                       divisors count     divisions   1              here  we try to iterate more efficiently by skipping           obvious non-primes like 4  6  etc         if i    2              i            else             i    2             if n    1    n is a prime         return divisors count   2      else         return divisors count     long long euler12           n and n   1     long long n  n p 1       n   1  n p 1   2          divisors x will store either the divisors of x or x 2         the later iff x is divisible by two      long long divisors n   1      long long divisors n p 1   2       for                      This loop has been unwound  so two iterations are completed at a time            n and n   1 have no prime factors in common and therefore we can            calculate their divisors separately                      long long total divisors                    the divisors of the triangle number                                                      n n 1  2            the first  unwound  iteration          divisors n p 1   get divisors n p 1   2     here n 1 is even and we          total divisors                     divisors n                   divisors n p 1           if total divisors  gt  1000              break             move n and n 1 forward         n   n p 1          n p 1   n   1             fix the divisors         divisors n   divisors n p 1          divisors n p 1   get divisors n p 1       n p 1 is now odd             now the second  unwound  iteration          total divisors                     divisors n                   divisors n p 1           if total divisors  gt  1000              break             move n and n 1 forward         n   n p 1          n p 1   n   1             fix the divisors         divisors n   divisors n p 1          divisors n p 1   get divisors n p 1   2       n p 1 is now even             return  n   n p 1    2     int main         for int i   0  i  lt  1000  i                  using namespace std  chrono          auto start   high resolution clock  now            auto result   euler12            auto end   high resolution clock  now             double time elapsed   duration cast lt milliseconds gt  end - start  count             cout  lt  lt  result  lt  lt       lt  lt  time elapsed  lt  lt    n             return 0      That takes around 19ms on average for my desktop and 80ms for my laptop  a far cry from most of the other code I ve seen here  And there are  no doubt  many optimisations still available

User · Answer

Your Haskell implementation could be greatly sped up by using some functions from Haskell packages  In this case I used primes  which is just installed with  cabal install primes      import Data Numbers Primes import Data List  triangleNumbers   scanl1      1    nDivisors n   product   map    1    length   group  primeFactors n   answer   head   filter    gt  500    nDivisors  triangleNumbers  main    IO    main   putStrLn    First triangle number to have over 500 divisors        show answer    Timings   Your original program   PS gt  measure-command   bin 012 slow exe    TotalSeconds        16 3807409 TotalMilliseconds   16380 7409   Improved implementation  PS gt  measure-command   bin 012 exe    TotalSeconds        0 0383436 TotalMilliseconds   38 3436   As you can see  this one runs in 38 milliseconds on the same machine where yours ran in 16 seconds     Compilation commands   ghc -O2 012 hs -o bin 012 exe ghc -O2 012 slow hs -o bin 012 slow exe

User · Answer

Looking at your Erlang implementation  The timing has included the start up of the entire virtual machine  running your program and halting the virtual machine  Am pretty sure that setting up and halting the erlang vm takes some time   If the timing was done within the erlang virtual machine itself  results would be different as in that case we would have the actual time for only the program in question  Otherwise  i believe that the total time taken by the process of starting and loading of the Erlang Vm plus that of halting it  as you put it in your program  are all included in the total time which the method you are using to time the program is outputting  Consider using the erlang timing itself which we use when we want to time our programs within the virtual machine itself timer tc 1 or timer tc 2 or timer tc 3  In this way  the results from erlang will exclude the time taken to start and stop kill halt the virtual machine  That is my reasoning there  think about it  and then try your bench mark again   I actually suggest that we try to time the program  for languages that have a runtime   within the runtime of those languages in order to get a precise value  C for example has no overhead of starting and shutting down a runtime system as does Erlang  Python and Haskell  98  sure of this - i stand correction   So  based on this reasoning  i conclude by saying that this benchmark wasnot precise  fair enough for languages running on top of a runtime system  Lets do it again with these changes   EDIT  besides even if all the languages had runtime systems  the overhead of starting each and halting it would differ  so i suggest we time from within the runtime systems  for the languages for which this applies   The Erlang VM is known to have considerable overhead at start up

User · Answer

Using GHC 7 0 3  gcc 4 4 6  Linux 2 6 29 on an x86 64 Core2 Duo  2 5GHz  machine  compiling using ghc -O2 -fllvm -fforce-recomp for Haskell and gcc -O3 -lm for C    Your C routine runs in 8 4 seconds  faster than your run probably because of -O3  The Haskell solution runs in 36 seconds  due to the -O2 flag  Your factorCount  code isn t explicitly typed and defaulting to Integer  thanks to Daniel for correcting my misdiagnosis here     Giving an explicit type signature  which is standard practice anyway  using Int and the time changes to 11 1 seconds in factorCount  you have needlessly called fromIntegral  A fix results in no change though  the compiler is smart  lucky for you   You used mod where rem is faster and sufficient  This changes the time to 8 5 seconds  factorCount  is constantly applying two extra arguments that never change  number  sqrt    A worker wrapper transformation gives us       time   so  842161320     real    0m7 954s    user    0m7 944s    sys     0m0 004s     That s right  7 95 seconds   Consistently half a second faster than the C solution   Without the -fllvm flag I m still getting 8 182 seconds  so the NCG backend is doing well in this case too   Conclusion  Haskell is awesome   Resulting Code  factorCount number   factorCount  number isquare 1 0 -  fromEnum   square    fromIntegral isquare      where square   sqrt   fromIntegral number           isquare   floor square  factorCount     Int - gt  Int - gt  Int - gt  Int - gt  Int factorCount  number sqrt candidate0 count0   go candidate0 count0   where   go candidate count       candidate  gt  sqrt   count       number  rem  candidate    0   go  candidate   1   count   2        otherwise   go  candidate   1  count  nextTriangle index triangle       factorCount triangle  gt  1000   triangle       otherwise   nextTriangle  index   1   triangle   index   1   main   print   nextTriangle 1 1   EDIT  So now that we ve explored that  lets address the questions     Question 1  Do erlang  python and haskell lose speed due to using   arbitrary length integers or don t they as long as the values are less   than MAXINT    In Haskell  using Integer is slower than Int but how much slower depends on the computations performed   Luckily  for 64 bit machines  Int is sufficient   For portability sake you should probably rewrite my code to use Int64 or Word64  C isn t the only language with a long       Question 2  Why is haskell so slow  Is there a compiler flag that   turns off the brakes or is it my implementation   The latter is quite   probable as haskell is a book with seven seals to me        Question 3  Can you offer me some hints how to optimize these   implementations without changing the way I determine the factors    Optimization in any way  nicer  faster  more  native  to the language    That was what I answered above    The answer was    0  Use optimization via -O2  1  Use fast   notably  unbox-able  types when possible  2  rem not mod  a frequently forgotten optimization  and  3  worker wrapper transformation  perhaps the most common optimization        Question 4  Do my functional implementations permit LCO and hence   avoid adding unnecessary frames onto the call stack    Yes  that wasn t the issue   Good work and glad you considered this

User · Answer

Take a look at this blog  Over the past year or so he s done a few of the Project Euler problems in Haskell and Python  and he s generally found Haskell to be much faster  I think that between those languages it has more to do with your fluency and coding style   When it comes to Python speed  you re using the wrong implementation  Try PyPy  and for things like this you ll find it to be much  much faster

User · Answer

Just for fun  The following is a more  native  Haskell implementation   import Control Applicative import Control Monad import Data Either import Math NumberTheory Powers Squares  isInt    RealFrac c   gt  c - gt  Bool isInt         lt   gt  id  lt   gt  fromInteger   round  intSqrt     Integral a    gt  a - gt  Int --intSqrt   fromIntegral   floor   sqrt   fromIntegral intSqrt   fromIntegral   integerSquareRoot   factorize    Int - gt   Int  factorize 1      factorize n   first   factorize  quot n first    where first       0     a   a  lt -  2  intSqrt n   rem n a    0      n   factorize2    Int - gt    Int Int   factorize2   foldl   ls   val freq  xs  y - gt  if val    y then  val freq 1  xs else  y 1  ls    0 0     factorize  numDivisors    Int - gt  Int numDivisors   foldl   acc    y  - gt  acc    y 1   1  lt   gt  factorize2  nextTriangleNumber     Int Int  - gt   Int Int  nextTriangleNumber  n acc     n 1 acc n 1   forward    Int - gt   Int  Int  - gt  Either  Int  Int   Int  Int  forward k val  n acc    if numDivisors acc  gt  k then Left val else Right  nextTriangleNumber val   problem12    Int - gt   Int  Int  problem12 n      0    lefts   scanl   gt  gt     forward n  1 1     repeat   forward   n  main   do   let  n val    problem12 1000   print val   Using ghc -O3  this consistently runs in 0 55-0 58 seconds on my machine  1 73GHz Core i7    A more efficient factorCount function for the C version   int factorCount  int n      int count   1    int candidate tmpCount    while  n   2    0        count        n    2          for  candidate   3  candidate  lt  n  amp  amp  candidate   candidate  lt  n  candidate    2      if  n   candidate    0          tmpCount   1        do           tmpCount            n    candidate          while  n   candidate    0          count  tmpCount            if  n  gt  1      count    2    return count      Changing longs to ints in main  using gcc -O3 -lm  this consistently runs in 0 31-0 35 seconds   Both can be made to run even faster if you take advantage of the fact that the nth triangle number   n  n 1  2  and n and  n 1  have completely disparate prime factorizations  so the number of factors of each half can be multiplied to find the number of factors of the whole  The following   int main        int triangle   0 count1 count2   1    do       count1   count2      count2     triangle   2    0   factorCount triangle 1    factorCount  triangle 1  2       while  count1 count2  lt  1001     printf    lld n     long long triangle   triangle 1  2       will reduce the c code run time to 0 17-0 19 seconds  and it can handle much larger searches -- greater than 10000 factors takes about 43 seconds on my machine  I leave a similar haskell speedup to the interested reader

User · Answer

Question 1  Do erlang  python and haskell loose speed due to using arbitrary length integers or don t they as long as the values are less than MAXINT   This is unlikely  I cannot say much about Erlang and Haskell  well  maybe a bit about Haskell below  but I can point a lot of other bottlenecks in Python  Every time the program tries to execute an operation with some values in Python  it should verify whether the values are from the proper type  and it costs a bit of time  Your factorCount function just allocates a list with range  1  isquare   1  various times  and runtime  malloc-styled memory allocation is way slower than iterating on a range with a counter as you do in C  Notably  the factorCount   is called multiple times and so allocates a lot of lists  Also  let us not forget that Python is interpreted and the CPython interpreter has no great focus on being optimized   EDIT  oh  well  I note that you are using Python 3 so range   does not return a list  but a generator  In this case  my point about allocating lists is half-wrong  the function just allocates range objects  which are inefficient nonetheless but not as inefficient as allocating a list with a lot of items   Question 2  Why is haskell so slow  Is there a compiler flag that turns off the brakes or is it my implementation   The latter is quite probable as haskell is a book with seven seals to me    Are you using Hugs  Hugs is a considerably slow interpreter  If you are using it  maybe you can get a better time with GHC - but I am only cogitating hypotesis  the kind of stuff a good Haskell compiler does under the hood is pretty fascinating and way beyond my comprehension     Question 3  Can you offer me some hints how to optimize these implementations without changing the way I determine the factors  Optimization in any way  nicer  faster  more  native  to the language   I d say you are playing an unfunny game  The best part of knowing various languages is to use them the most different way possible    But I digress  I just do not have any recommendation for this point  Sorry  I hope someone can help you in this case      Question 4  Do my functional implementations permit LCO and hence avoid adding unnecessary frames onto the call stack   As far as I remember  you just need to make sure that your recursive call is the last command before returning a value  In other words  a function like the one below could use such optimization   def factorial n  acc 1       if n  gt  1          acc   acc   n         n   n - 1         return factorial n  acc      else          return acc   However  you would not have such optimization if your function were such as the one below  because there is an operation  multiplication  after the recursive call   def factorial2 n       if n  gt  1          f   factorial2 n-1          return f n     else          return 1   I separated the operations in some local variables for make it clear which operations are executed  However  the most usual is to see these functions as below  but they are equivalent for the point I am making   def factorial n  acc 1       if n  gt  1          return factorial n-1  acc n      else          return acc  def factorial2 n       if n  gt  1          return n factorial n-1      else          return 1   Note that it is up to the compiler interpreter to decide if it will make tail recursion  For example  the Python interpreter does not do it if I remember well  I used Python in my example only because of its fluent syntax   Anyway  if you find strange stuff such as factorial functions with two parameters  and one of the parameters has names such as acc  accumulator etc   now you know why people do it

User · Answer

Question 3  Can you offer me some hints how to optimize these implementations   without changing the way I determine the factors  Optimization in any   way  nicer  faster  more  native  to the language    The C implementation is suboptimal  as hinted at by Thomas M  DuBuisson   the version uses 64-bit integers  i e  long datatype    I ll investigate the assembly listing later  but with an educated guess  there are some memory accesses going on in the compiled code  which make using 64-bit integers significantly slower  It s that or generated code  be it the fact that you can fit less 64-bit ints in a SSE register or round a double to a 64-bit integer is slower    Here is the modified code  simply replace long with int and I explicitly inlined factorCount  although I do not think that this is necessary with gcc -O3     include  lt stdio h gt   include  lt math h gt   static inline int factorCount int n        double square   sqrt  n       int isquare    int square      int count   isquare    square   -1   0      int candidate      for  candidate   1  candidate  lt   isquare  candidate             if  0    n   candidate  count    2      return count     int main          int triangle   1      int index   1      while  factorCount  triangle   lt  1001                index            triangle    index            printf    d n   triangle       Running   timing it gives     gcc -O3 -lm -o euler12 euler12 c  time   euler12 842161320   euler12  2 95s user 0 00s system 99  cpu 2 956 total   For reference  the haskell implementation by Thomas in the earlier answer gives     ghc -O2 -fllvm -fforce-recomp euler12 hs  time   euler12                                                                                       9 40   1 of 1  Compiling Main               euler12 hs  euler12 o   Linking euler12     842161320   euler12  9 43s user 0 13s system 99  cpu 9 602 total   Conclusion  Taking nothing away from ghc  its a great compiler  but gcc normally generates faster code

User · Answer

In regards to Python optimization  in addition to using PyPy  for pretty impressive speed-ups with zero change to your code   you could use PyPy s translation toolchain to compile an RPython-compliant version  or Cython to build an extension module  both of which are faster than the C version in my testing  with the Cython module nearly twice as fast  For reference I include C and PyPy benchmark results as well   C  compiled with gcc -O3 -lm     time   euler12-c  842161320    euler12-c  11 95s   user 0 00s   system 99    cpu 11 959 total   PyPy 1 5    time pypy euler12 py 842161320 pypy euler12 py   16 44s user  0 01s system  99  cpu 16 449 total   RPython  using latest PyPy revision  c2f583445aee     time   euler12-rpython-c 842161320   euler12-rpy-c   10 54s user 0 00s  system 99   cpu 10 540 total   Cython 0 15    time python euler12-cython py 842161320 python euler12-cython py   6 27s user 0 00s  system 99   cpu 6 274 total   The RPython version has a couple of key changes  To translate into a standalone program you need to define your target  which in this case is the main function  It s expected to accept sys argv as it s only argument  and is required to return an int  You can translate it by using translate py    translate py euler12-rpython py which translates to C and compiles it for you     euler12-rpython py  import math  sys  def factorCount n       square   math sqrt n      isquare   int square      count   -1 if isquare    square else 0     for candidate in xrange 1  isquare   1           if not n   candidate  count    2     return count  def main argv       triangle   1     index   1     while factorCount triangle   lt  1001          index    1         triangle    index     print triangle     return 0  if   name         main         main sys argv   def target  args       return main  None   The Cython version was rewritten as an extension module  euler12 pyx  which I import and call from a normal python file  The  euler12 pyx is essentially the same as your version  with some additional static type declarations  The setup py has the normal boilerplate to build the extension  using python setup py build ext --inplace      euler12 pyx from libc math cimport sqrt  cdef int factorCount int n       cdef int candidate  isquare  count     cdef double square     square   sqrt n      isquare   int square      count   -1 if isquare    square else 0     for candidate in range 1  isquare   1           if not n   candidate  count    2     return count  cpdef main        cdef int triangle   1  index   1     while factorCount triangle   lt  1001          index    1         triangle    index     print triangle    euler12-cython py import  euler12  euler12 main      setup py from distutils core import setup from distutils extension import Extension from Cython Distutils import build ext  ext modules    Extension   euler12      euler12 pyx      setup    name    Euler12-Cython     cmdclass     build ext   build ext     ext modules   ext modules     I honestly have very little experience with either RPython or Cython  and was pleasantly surprised at the results  If you are using CPython  writing your CPU-intensive bits of code in a Cython extension module seems like a really easy way to optimize your program

User · Answer

include  lt stdio h gt   include  lt math h gt   int factorCount  long n        double square   sqrt  n       int isquare    int  square 1      long candidate   2      int count   1      while candidate  lt   isquare  amp  amp  candidate lt  n           int c   1          while  n   candidate    0               c               n    candidate                    count    c          candidate              return count     int main          long triangle   1      int index   1      while  factorCount  triangle   lt  1001                index             triangle    index            printf    ld n   triangle       gcc -lm -Ofast euler c  time   a out    2 79s user 0 00s system 99  cpu 2 794 total

User · Answer

There are some problems with the Erlang implementation  As baseline for the following  my measured execution time for your unmodified Erlang program was 47 6 seconds  compared to 12 7 seconds for the C code   The first thing you should do if you want to run computationally intensive Erlang code is to use native code  Compiling with erlc  native euler12 got the time down to 41 3 seconds  This is however a much lower speedup  just 15   than expected from native compilation on this kind of code  and the problem is your use of -compile export all   This is useful for experimentation  but the fact that all functions are potentially reachable from the outside causes the native compiler to be very conservative   The normal BEAM emulator is not that much affected   Replacing this declaration with -export  solve 0    gives a much better speedup  31 5 seconds  almost 35  from the baseline    But the code itself has a problem  for each iteration in the factorCount loop  you perform this test   factorCount     Sqrt  Candidate  Count  when Candidate    Sqrt - gt  Count   1    The C code doesn t do this  In general  it can be tricky to make a fair comparison between different implementations of the same code  and in particular if the algorithm is numerical  because you need to be sure that they are actually doing the same thing  A slight rounding error in one implementation due to some typecast somewhere may cause it to do many more iterations than the other even though both eventually reach the same result   To eliminate this possible error source  and get rid of the extra test in each iteration   I rewrote the factorCount function as follows  closely modelled on the C code   factorCount  N  - gt      Sqrt   math sqrt  N       ISqrt   trunc Sqrt       if ISqrt    Sqrt - gt  factorCount  N  ISqrt  1  -1          true          - gt  factorCount  N  ISqrt  1  0      end   factorCount   N  ISqrt  Candidate  Count  when Candidate  gt  ISqrt - gt  Count  factorCount   N  ISqrt  Candidate  Count  - gt      case N rem Candidate of         0 - gt  factorCount  N  ISqrt  Candidate   1  Count   2             - gt  factorCount  N  ISqrt  Candidate   1  Count      end    This rewrite  no export all  and native compilation  gave me the following run time     erlc  native euler12 erl   time erl -noshell -s euler12 solve 842161320  real    0m19 468s user    0m19 450s sys 0m0 010s   which is not too bad compared to the C code     time   a out  842161320  real    0m12 755s user    0m12 730s sys 0m0 020s   considering that Erlang is not at all geared towards writing numerical code  being only 50  slower than C on a program like this is pretty good   Finally  regarding your questions   Question 1  Do erlang  python and haskell loose speed due to using arbitrary length integers or  don t they as long as the values are less than MAXINT   Yes  somewhat  In Erlang  there is no way of saying  use 32 64-bit arithmetic with wrap-around   so unless the compiler can prove some bounds on your integers  and it usually can t   it must check all computations to see if they can fit in a single tagged word or if it has to turn them into heap-allocated bignums  Even if no bignums are ever used in practice at runtime  these checks will have to be performed  On the other hand  that means you know that the algorithm will never fail because of an unexpected integer wraparound if you suddenly give it larger inputs than before   Question 4  Do my functional implementations permit LCO and hence avoid adding unnecessary frames onto the call stack   Yes  your Erlang code is correct with respect to last call optimization

User · Answer

Change  case  divisor T round math sqrt T     gt  500  of  To      case  divisor T round math sqrt T     gt  1000  of  This will produce the correct answer for the Erlang multi-process example

User · Answer

With Haskell  you really don t need to think in recursions explicitly   factorCount number   foldr factorCount  0  1  isquare  -                       fromEnum   square    fromIntegral isquare      where       square   sqrt   fromIntegral number       isquare   floor square       factorCount  candidate           number  rem  candidate    0    2              otherwise   id  triangles     Int  triangles   scanl1      1 2     main   print   head   dropWhile    lt  1001    factorCount  triangles   In the above code  I have replaced explicit recursions in  Thomas  answer with common list operations   The code still does exactly the same thing without us worrying about tail recursion   It runs    7 49s  about 6  slower than the version in  Thomas  answer    7 04s  on my machine with GHC 7 6 2  while the C version from  Raedwulf runs   3 15s   It seems GHC has improved over the year   PS  I know it is an old question  and I stumble upon it from google searches  I forgot what I was searching  now       Just wanted to comment on the question about LCO and express my feelings about Haskell in general   I wanted to comment on the top answer  but comments do not allow code blocks

User · Answer

I modified  Jannich Brendle  version to 1000 instead 500  And list the result of euler12 bin  euler12 erl  p12dist erl  Both erl codes use   native  to compile   zhengs-MacBook-Pro workspace zhengzhibin  time erl -noshell -s p12dist start The result is  842161320   real    0m3 879s user    0m14 553s sys     0m0 314s zhengs-MacBook-Pro workspace zhengzhibin  time erl -noshell -s euler12 solve 842161320  real    0m10 125s user    0m10 078s sys     0m0 046s zhengs-MacBook-Pro workspace zhengzhibin  time   euler12 bin  842161320  real    0m5 370s user    0m5 328s sys     0m0 004s zhengs-MacBook-Pro workspace zhengzhibin

User · Answer

Trying GO   package main  import  fmt  import  math   func main         var n  m  c int     for i    1    i             n  m  c   i    i   1    2  int math Sqrt float64 n     0         for f    1  f  lt  m  f                 if n   f    0   c               c    2     if m   m    n   c          if c  gt  1001           fmt Println n          break                     I get   original c version  9 1690  100  go  8 2520  111       But using   package main  import        math       fmt         Sieve of Eratosthenes func PrimesBelow limit int    int       switch           case limit  lt  2              return   int           case limit    2              return   int 2            sievebound     limit - 1    2     sieve    make   bool  sievebound 1      crosslimit    int math Sqrt float64 limit  -1    2     for i    1  i  lt   crosslimit  i             if  sieve i                for j    2   i    i   1   j  lt   sievebound  j    2 i   1                   sieve j    true                                   plimit    int 1 3 float64 limit     int math Log float64 limit        primes    make   int  plimit      p    1     primes 0    2     for i    1  i  lt   sievebound  i             if  sieve i                primes p    2 i   1             p               if p  gt   plimit                   break                                   last    len primes  - 1     for i    last  i  gt  0  i--           if primes i     0               break                   last   i           return primes 0 last       func main         fmt Println p12         Requires PrimesBelow from utils go func p12   int       n  dn  cnt    3  2  0     primearray    PrimesBelow 1000000      for cnt  lt   1001           n           n1    n         if n1 2    0               n1    2                   dn1    1         for i    0  i  lt  len primearray   i                 if primearray i  primearray i   gt  n1                   dn1    2                 break                           exponent    1             for n1 primearray i     0                   exponent                   n1    primearray i                            if exponent  gt  1                   dn1    exponent                           if n1    1                   break                                 cnt   dn   dn1         dn   dn1           return n    n - 1    2     I get    original c version  9 1690  100  thaumkid s c version    0 1060  8650  first  go version   8 2520  111  second go   version  0 0230 39865     I also tried Python3 6 and pypy3 3-5 5-alpha   original c version  8 629   100  thaumkid s c version    0 109   7916  Python3 6          54 795   16  pypy3 3-5 5-alpha      13 291   65    and then with following code I got   original c version  8 629   100  thaumkid s c version    0 109   8650  Python3 6           1 489   580  pypy3 3-5 5-alpha       0 582   1483    def D N       if N    1  return 1     sqrtN   int N    0 5      nf   1     for d in range 2  sqrtN   1           if N   d    0              nf   nf   1     return 2   nf -  1 if sqrtN  2    N else 0   L   1000 Dt  n   0  0  while Dt  lt   L      t   n    n   1     2     Dt   D n 2  D n 1  if n 2    0 else D n  D  n 1  2      n   n   1  print  t

User · Answer

I made the assumption that the number of factors is only large if the numbers involved have many small factors  So I used thaumkid s excellent algorithm  but first used an approximation to the factor count that is never too small  It s quite simple  Check for prime factors up to 29  then check the remaining number and calculate an upper bound for the nmber of factors  Use this to calculate an upper bound for the number of factors  and if that number is high enough  calculate the exact number of factors    The code below doesn t need this assumption for correctness  but to be fast  It seems to work  only about one in 100 000 numbers gives an estimate that is high enough to require a full check   Here s the code       Return at least the number of factors of n  static uint64 t approxfactorcount  uint64 t n        uint64 t count   1  add    define CHECK d                                   do                                                if  n   d    0                                    add   count                                   do   n    d  count    add                     while  n   d    0                                                                     while  0       CHECK   2   CHECK   3   CHECK   5   CHECK   7   CHECK  11   CHECK  13       CHECK  17   CHECK  19   CHECK  23   CHECK  29       if  n    1  return count      if  n  lt  1ull   31   31  return count   2      if  n  lt  1ull   31   31   37  return count   4      if  n  lt  1ull   31   31   37   37  return count   8      if  n  lt  1ull   31   31   37   37   41  return count   16      if  n  lt  1ull   31   31   37   37   41   43  return count   32      if  n  lt  1ull   31   31   37   37   41   43   47  return count   64      if  n  lt  1ull   31   31   37   37   41   43   47   53  return count   128      if  n  lt  1ull   31   31   37   37   41   43   47   53   59  return count   256      if  n  lt  1ull   31   31   37   37   41   43   47   53   59   61  return count   512      if  n  lt  1ull   31   31   37   37   41   43   47   53   59   61   67  return count   1024      if  n  lt  1ull   31   31   37   37   41   43   47   53   59   61   67   71  return count   2048      if  n  lt  1ull   31   31   37   37   41   43   47   53   59   61   67   71   73  return count   4096      return count   1000000        Return the number of factors of n  static uint64 t factorcount  uint64 t n        uint64 t count   1  add       CHECK  2   CHECK  3        uint64 t d   5  inc   2      for    d d  lt   n  d    inc  inc    6 - inc           CHECK  d        if  n  gt  1  count    2     n must be a prime number     return count        Prints triangular numbers with record numbers of factors  static void printrecordnumbers  uint64 t limit        uint64 t record   30000       uint64 t count1  factor1      uint64 t count2   1  factor2   1       for  uint64 t n   1  n  lt   limit    n                factor1   factor2          count1   count2           factor2   n   1  if  factor2   2    0  factor2    2          count2   approxfactorcount  factor2            if  count1   count2  gt  record                        uint64 t factors   factorcount  factor1    factorcount  factor2               if  factors  gt  record                                printf    lluth triangular number    llu has  llu factors n   n  factor1   factor2  factors                   record   factors                                    This finds the 14 753 024th triangular with 13824 factors in about 0 7 seconds  the 879 207 615th triangular number with 61 440 factors in 34 seconds  the 12 524 486 975th triangular number with 138 240 factors in 10 minutes 5 seconds  and the 26 467 792 064th triangular number with 172 032 factors in 21 minutes 25 seconds  2 4GHz Core2 Duo   so this code takes only 116 processor cycles per number on average  The last triangular number itself is larger than 2 68  so

User · Answer

Question 1  Do Erlang  Python and Haskell lose speed due to using   arbitrary length integers or don t they as long as the values are less   than MAXINT    Question one can be answered in the negative for Erlang  The last question is answered by using Erlang appropriately  as in   http   bredsaal dk learning-erlang-using-projecteuler-net  Since it s faster than your initial C example  I would guess there are numerous problems as others have already covered in detail   This Erlang module executes on a cheap netbook in about 5 seconds     It uses the network threads model in erlang and  as such demonstrates how to take advantage of the event model  It could be distributed over many nodes  And it s fast  Not my code   -module p12dist     -author  Jannich Brendle  jannich bredsaal dk  http   blog bredsaal dk      -compile export all    server   - gt      server 1    server Number  - gt      receive  getwork  Worker PID  - gt  Worker PID    work Number Number 100       server Number 101        result T  - gt  io format  The result is    w   n    T          - gt  server Number      end   worker Server PID  - gt      Server PID    getwork  self         receive  work Start End  - gt  solve Start End Server PID      end      worker Server PID    start   - gt      Server PID   spawn p12dist  server           spawn p12dist  worker   Server PID        spawn p12dist  worker   Server PID        spawn p12dist  worker   Server PID        spawn p12dist  worker   Server PID     solve N End    when N     End - gt  no solution   solve N End Server PID  - gt      T round N  N 1  2     case  divisor T round math sqrt T     gt  500  of       true - gt          Server PID    result T         false - gt          solve N 1 End Server PID      end   divisors N  - gt      divisor N round math sqrt N      divisor   0  - gt  1    divisor N I  - gt      case  N rem I      0 of     true - gt        2 divisor N I-1       false - gt        divisor N I-1      end    The test below took place on an  Intel R  Atom TM  CPU N270     1 60GHz     time erl -noshell -s p12dist start  The result is  76576500    C  BREAK   a bort  c ontinue  p roc info  i nfo  l oaded         v ersion  k ill  D b-tables  d istribution a  real    0m5 510s user    0m5 836s sys 0m0 152s

[python] Speed comparison with Project Euler: C vs Python vs Erlang vs Haskell

Step One: Benchmark of the author's programs

Step Two: Investigate, Improve and Benchmark Again

Step Three: Understanding the Numbers

To Conclude This: Lessons Learned

Examples related to python

Examples related to c

Examples related to performance

Examples related to haskell

Examples related to erlang