Function to calculate R2 R-squared in R

Question

I have a dataframe with observed and modelled data  and I would like to calculate the R2 value   I expected there to be a function I could call for this  but can t locate one   I know I can write my own and apply it  but am I missing something obvious   I want something like  obs  lt - 1 5 mod  lt - c 0 8 2 4 2 3 4 8  df  lt - data frame obs  mod   R2  lt - rsq df    0 85

User · Answer

Not sure why this isn t implemented directly in R  but this answer is essentially the same as Andrii s and Wordsforthewise  I just turned into a function for the sake of convenience if somebody uses it a lot like me  r2 general  lt -function preds actual      return 1- sum  preds - actual    2  sum  actual - mean actual   2

User · Answer

Here is the simplest solution based on  https   en wikipedia org wiki Coefficient of determination     1   Actual  and  Predicted  data df  lt - data frame    y actual   c 1 5     y predicted    c 0 8  2 4  2  3  4 8      2  R2 Score components    2 1  Average of actual data avr y actual  lt - mean df y actual     2 2  Total sum of squares ss total  lt - sum  df y actual - avr y actual  2     2 3  Regression sum of squares ss regression  lt - sum  df y predicted - avr y actual  2     2 4  Residual sum of squares ss residuals  lt - sum  df y actual - df y predicted  2     3  R2 Score r2  lt - 1 - ss residuals   ss total

User · Answer

Why not this   rsq  lt - function x  y  summary lm y x   r squared rsq obs  mod    1  0 8560185

User · Answer

You need a little statistical knowledge to see this  R squared between two vectors is just the square of their correlation  So you can define you function as  rsq  lt - function  x  y  cor x  y    2  Sandipan s answer will return you exactly the same result  see the following proof   but as it stands it appears more readable  due to the evident  r squared    Let s do the statistics Basically we fit a linear regression of y over x  and compute the ratio of regression sum of squares to total sum of squares  lemma 1  a regression y   x is equivalent to y - mean y    x - mean x   lemma 2  beta   cov x  y    var x   lemma 3  R square   cor x  y    2   Warning R squared between two arbitrary vectors x and y  of the same length  is just a goodness measure of their linear relationship  Think twice   R squared between x   a and y   b are identical for any constant shift a and b  So it is a weak or even useless measure on  quot goodness of prediction quot   Use MSE or RMSE instead   How to obtain RMSE out of lm result  R - Calculate Test MSE given a trained model from a training set and a test set  I agree with 42- s comment   The R squared is reported by summary functions associated with regression functions  But only when such an estimate is statistically justified   R squared can be a  but not the best  measure of  quot goodness of fit quot   But there is no justification that it can measure the goodness of out-of-sample prediction  If you split your data into training and testing parts and fit a regression model on the training one  you can get a valid R squared value on training part  but you can t legitimately compute an R squared on the test part  Some people did this  but I don t agree with it  Here is very extreme example  preds  lt - 1 4 4 actual  lt - 1 4  The R squared between those two vectors is 1  Yes of course  one is just a linear rescaling of the other so they have a perfect linear relationship  But  do you really think that the preds is a good prediction on actual    In reply to wordsforthewise Thanks for your comments 1  2 and your answer of details  You probably misunderstood the procedure  Given two vectors x and y  we first fit a regression line y   x then compute regression sum of squares and total sum of squares  It looks like you skip this regression step and go straight to the sum of square computation  That is false  since the partition of sum of squares does not hold and you can t compute R squared in a consistent way  As you demonstrated  this is just one way for computing R squared  preds  lt - c 1  2  3  actual  lt - c 2  2  4  rss  lt - sum  preds - actual    2      residual sum of squares tss  lt - sum  actual - mean actual     2      total sum of squares rsq  lt - 1 - rss tss   1  0 25  But there is another  regss  lt - sum  preds - mean preds     2     regression sum of squares regss   tss   1  0 75  Also  your formula can give a negative value  the proper value should be 1 as mentioned above in the Warning section   preds  lt - 1 4   4 actual  lt - 1 4 rss  lt - sum  preds - actual    2      residual sum of squares tss  lt - sum  actual - mean actual     2      total sum of squares rsq  lt - 1 - rss tss   1  -2 375   Final remark I had never expected that this answer could eventually be so long when I posted my initial answer 2 years ago  However  given the high views of this thread  I feel obliged to add more statistical details and discussions  I don t want to mislead people that just because they can compute an R squared so easily  they can use R squared everywhere

User · Answer

You can also use the summary for linear models   summary lm obs   mod  data df   r squared

User · Answer

It is not something obvious  but the caret package has a function postResample   that will calculate  A vector of performance estimates  according to the documentation   The  performance estimates  are    RMSE Rsquared mean absolute error  MAE    and have to be accessed from the vector like this  library caret  vect1  lt - c 1  2  3  vect2  lt - c 3  2  2  res  lt - caret  postResample vect1  vect2  rsq  lt - res 2    However  this is using the correlation squared approximation for r-squared as mentioned in another answer   I m not sure why Max Kuhn didn t just use the conventional 1-SSE SST   caret also has an R2   method  although it s hard to find in the documentation   The way to implement the normal coefficient of determination equation is   preds  lt - c 1  2  3  actual  lt - c 2  2  4  rss  lt - sum  preds - actual    2  tss  lt - sum  actual - mean actual     2  rsq  lt - 1 - rss tss   Not too bad to code by hand of course  but why isn t there a function for it in a language primarily made for statistics   I m thinking I must be missing the implementation of R 2 somewhere  or no one cares enough about it to implement it   Most of the implementations  like this one  seem to be for generalized linear models

[r] Function to calculate R2 (R-squared) in R

Examples related to r

Examples related to function

Examples related to statistics