How do I calculate r-squared using Python and Numpy

Question

I m using Python and Numpy to calculate a best fit polynomial of arbitrary degree   I pass a list of x values  y values  and the degree of the polynomial I want to fit  linear  quadratic  etc     This much works  but I also want to calculate r  coefficient of correlation  and r-squared coefficient of determination    I am comparing my results with Excel s best-fit trendline capability  and the r-squared value it calculates   Using this  I know I am calculating r-squared correctly for linear best-fit  degree equals 1    However  my function does not work for polynomials with degree greater than 1   Excel is able to do this   How do I calculate r-squared for higher-order polynomials using Numpy   Here s my function   import numpy    Polynomial Regression def polyfit x  y  degree       results           coeffs   numpy polyfit x  y  degree         Polynomial Coefficients     results  polynomial     coeffs tolist        correlation   numpy corrcoef x  y  0 1          r     results  correlation     correlation        r-squared     results  determination     correlation  2      return results

User · Answer

I have been using this successfully  where x and y are array-like   def rsquared x  y           Return R 2 where x and y are array-like          slope  intercept  r value  p value  std err   scipy stats linregress x  y      return r value  2

User · Answer

R-squared is a statistic that only applies to linear regression  Essentially  it measures how much variation in your data can be explained by the linear regression  So  you calculate the  quot Total Sum of Squares quot   which is the total squared deviation of each of your outcome variables from their mean       where y bar is the mean of the y s  Then  you calculate the  quot regression sum of squares quot   which is how much your FITTED values differ from the mean  and find the ratio of those two  Now  all you would have to do for a polynomial fit is plug in the y hat s from that model  but it s not accurate to call that r-squared  Here is a link I found that speaks to it a little

User · Answer

I originally posted the benchmarks below with the purpose of recommending numpy corrcoef  foolishly not realizing that the original question already uses corrcoef and was in fact asking about higher order polynomial fits   I ve added an actual solution to the polynomial r-squared question using statsmodels  and I ve left the original benchmarks  which while off-topic  are potentially useful to someone   statsmodels has the capability to calculate the r 2 of a polynomial fit directly  here are 2 methods    import statsmodels api as sm import statsmodels formula api as smf    Construct the columns for the different powers of x def get r2 statsmodels x  y  k 1       xpoly   np column stack  x  i for i in range k 1            return sm OLS y  xpoly  fit   rsquared    Use the formula API and construct a formula describing the polynomial def get r2 statsmodels formula x  y  k 1       formula    y   1             join  I x       format i  for i in range 1  k 1       data     x   x   y   y      return smf ols formula  data  fit   rsquared   or rsquared adj  To further take advantage of statsmodels  one should also look at the fitted model summary  which can be printed or displayed as a rich HTML table in Jupyter IPython notebook   The results object provides access to many useful statistical metrics in addition to rsquared  model   sm OLS y  xpoly  results   model fit   results summary     Below is my original Answer where I benchmarked various linear regression r 2 methods    The corrcoef function used in the Question calculates the correlation coefficient  r  only for a single linear regression  so it doesn t address the question of r 2 for higher order polynomial fits   However  for what it s worth  I ve come to find that for linear regression  it is indeed the fastest and most direct method of calculating r  def get r2 numpy corrcoef x  y       return np corrcoef x  y  0  1   2  These were my timeit results from comparing a bunch of methods for 1000 random  x  y  points   Pure Python  direct r calculation   1000 loops  best of 3  1 59 ms per loop   Numpy polyfit  applicable to n-th degree polynomial fits   1000 loops  best of 3  326   s per loop   Numpy Manual  direct r calculation   10000 loops  best of 3  62 1   s per loop   Numpy corrcoef  direct r calculation   10000 loops  best of 3  56 6   s per loop   Scipy  linear regression with r as an output   1000 loops  best of 3  676   s per loop   Statsmodels  can do n-th degree polynomial and many other fits   1000 loops  best of 3  422   s per loop    The corrcoef method narrowly beats calculating the r 2  quot manually quot  using numpy methods  It is  gt 5X faster than the polyfit method and  12X faster than the scipy linregress   Just to reinforce what numpy is doing for you  it s 28X faster than pure python   I m not well-versed in things like numba and pypy  so someone else would have to fill those gaps  but I think this is plenty convincing to me that corrcoef is the best tool for calculating r for a simple linear regression  Here s my benchmarking code   I copy-pasted from a Jupyter Notebook  hard not to call it an IPython Notebook      so I apologize if anything broke on the way  The  timeit magic command requires IPython  import numpy as np from scipy import stats import statsmodels api as sm import math  n 1000 x   np random rand 1000  10 x sort   y   10   x    5 np random randn 1000  10-5   x list   list x  y list   list y   def get r2 numpy x  y       slope  intercept   np polyfit x  y  1      r squared   1 -  sum  y -  slope   x   intercept    2      len y  - 1    np var y  ddof 1        return r squared      def get r2 scipy x  y             r value         stats linregress x  y      return r value  2      def get r2 statsmodels x  y       return sm OLS y  sm add constant x   fit   rsquared      def get r2 python x list  y list       n   len x list      x bar   sum x list  n     y bar   sum y list  n     x std   math sqrt sum   xi-x bar   2 for xi in x list    n-1       y std   math sqrt sum   yi-y bar   2 for yi in y list    n-1       zx     xi-x bar  x std for xi in x list      zy     yi-y bar  y std for yi in y list      r   sum zxi zyi for zxi  zyi in zip zx  zy    n-1      return r  2      def get r2 numpy manual x  y       zx    x-np mean x   np std x  ddof 1      zy    y-np mean y   np std y  ddof 1      r   np sum zx zy   len x -1      return r  2      def get r2 numpy corrcoef x  y       return np corrcoef x  y  0  1   2      print  Python    timeit get r2 python x list  y list  print  Numpy polyfit    timeit get r2 numpy x  y  print  Numpy Manual    timeit get r2 numpy manual x  y  print  Numpy corrcoef    timeit get r2 numpy corrcoef x  y  print  Scipy    timeit get r2 scipy x  y  print  Statsmodels    timeit get r2 statsmodels x  y

User · Answer

Here is a function to compute  the weighted r-squared with  Python and Numpy  most of the code comes from sklearn    from   future   import division  import numpy as np  def compute r2 weighted y true  y pred  weight       sse    weight    y true - y pred     2  sum axis 0  dtype np float64      tse    weight    y true - np average          y true  axis 0  weights weight      2  sum axis 0  dtype np float64      r2 score   1 -  sse   tse      return r2 score  sse  tse   Example   from   future   import print function  division  import sklearn metrics   def compute r2 weighted y true  y pred  weight       sse    weight    y true - y pred     2  sum axis 0  dtype np float64      tse    weight    y true - np average          y true  axis 0  weights weight      2  sum axis 0  dtype np float64      r2 score   1 -  sse   tse      return r2 score  sse  tse      def compute r2 y true  y predicted       sse   sum  y true - y predicted   2      tse    len y true  - 1    np var y true  ddof 1      r2 score   1 -  sse   tse      return r2 score  sse  tse  def main                Demonstrate the use of compute r2 weighted   and checks the results against sklearn                     y true    3  -0 5  2  7      y pred    2 5  0 0  2  8      weight    1  5  1  2      r2 score   sklearn metrics r2 score y true  y pred      print  r2 score   0   format r2 score         r2 score       compute r2 np array y true   np array y pred       print  r2 score   0   format r2 score       r2 score   sklearn metrics r2 score y true  y pred weight      print  r2 score weighted   0   format r2 score       r2 score       compute r2 weighted np array y true   np array y pred   np array weight       print  r2 score weighted   0   format r2 score    if   name         main         main        cProfile run  main       if you want to do some profiling   outputs   r2 score  0 9486081370449679 r2 score  0 9486081370449679 r2 score weighted  0 9573170731707317 r2 score weighted  0 9573170731707317   This corresponds to the formula  mirror      with f i is the predicted value from the fit  y  av  is the mean of the observed data y i is the observed data value  w i is the weighting applied to each data point  usually w i 1  SSE is the sum of squares due to error and SST is the total sum of squares     If interested  the code in R  https   gist github com dhimmel 588d64a73fa4fef02c8f   mirror

User · Answer

The wikipedia article on r-squareds suggests that it may be used for general model fitting rather than just linear regression

User · Answer

From yanl  yet-another-library  sklearn metrics has an r2 score function   from sklearn metrics import r2 score  coefficient of dermination   r2 score y  p x

User · Answer

A very late reply  but just in case someone needs a ready function for this   scipy stats linregress  i e   slope  intercept  r value  p value  std err   scipy stats linregress x  y    as in  Adam Marples s answer

User · Answer

From scipy stats linregress source  They use the average sum of squares method   import numpy as np  x   np array x  y   np array y     average sum of squares  ssxm  ssxym  ssyxm  ssym   np cov x  y  bias 1  flat  r num   ssxym r den   np sqrt ssxm   ssym  r   r num   r den  if r den    0 0      r   0 0 else      r   r num   r den      if r  gt  1 0          r   1 0     elif r  lt  -1 0          r   -1 0

User · Answer

You can execute this code directly  this will find you the polynomial  and will find you the R-value you can put a comment down below if you need more explanation  from scipy stats import linregress import numpy as np  x   np array  1 2 3 4 5 6   y   np array  2 3 5 6 7 8    p3   np polyfit x y 3    3rd degree polynomial  you can change it to any degree you want xp   np linspace 1 6 6     6 means the length of the line poly arr   np polyval p3 xp   poly list    round num  3  for num in list poly arr   slope  intercept  r value  p value  std err   linregress x  poly list  print r value  2

User · Answer

Here s a very simple python function to compute R 2 from the actual and predicted values assuming y and y hat are pandas series   def r squared y  y hat       y bar   y mean       ss tot     y-y bar   2  sum       ss res     y-y hat   2  sum       return 1 -  ss res ss tot

User · Answer

From the numpy polyfit documentation  it is fitting linear regression   Specifically  numpy polyfit with degree  d  fits a linear regression with the mean function  E y x     p d   x  d   p  d-1    x    d-1          p 1   x   p 0  So you just need to calculate the R-squared for that fit   The wikipedia page on linear regression gives full details   You are interested in R 2 which you can calculate in a couple of ways  the easisest probably being  SST   Sum i 1  n   y i - y bar  2 SSReg   Sum i 1  n   y ihat - y bar  2 Rsquared   SSReg SST   Where I use  y bar  for the mean of the y s  and  y ihat  to be the fit value for each point   I m not terribly familiar with numpy  I usually work in R   so there is probably a tidier way to calculate your R-squared  but the following should be correct  import numpy    Polynomial Regression def polyfit x  y  degree       results           coeffs   numpy polyfit x  y  degree          Polynomial Coefficients     results  polynomial     coeffs tolist          r-squared     p   numpy poly1d coeffs        fit values  and mean     yhat   p x                            or  p z  for z in x      ybar   numpy sum y  len y             or sum y  len y      ssreg   numpy sum  yhat-ybar   2      or sum    yihat - ybar   2 for yihat in yhat       sstot   numpy sum  y - ybar   2       or sum    yi - ybar   2 for yi in y       results  determination     ssreg   sstot      return results

[python] How do I calculate r-squared using Python and Numpy?

Examples related to python

Examples related to math

Examples related to statistics

Examples related to numpy

Examples related to curve-fitting