[r] Function to calculate R2 (R-squared) in R

You need a little statistical knowledge to see this. R squared between two vectors is just the square of their correlation. So you can define you function as:

rsq <- function (x, y) cor(x, y) ^ 2

Sandipan's answer will return you exactly the same result (see the following proof), but as it stands it appears more readable (due to the evident $r.squared).


Let's do the statistics

Basically we fit a linear regression of y over x, and compute the ratio of regression sum of squares to total sum of squares.

lemma 1: a regression y ~ x is equivalent to y - mean(y) ~ x - mean(x)

lemma 1

lemma 2: beta = cov(x, y) / var(x)

lemma 2

lemma 3: R.square = cor(x, y) ^ 2

lemma 3


Warning

R squared between two arbitrary vectors x and y (of the same length) is just a goodness measure of their linear relationship. Think twice!! R squared between x + a and y + b are identical for any constant shift a and b. So it is a weak or even useless measure on "goodness of prediction". Use MSE or RMSE instead:

I agree with 42-'s comment:

The R squared is reported by summary functions associated with regression functions. But only when such an estimate is statistically justified.

R squared can be a (but not the best) measure of "goodness of fit". But there is no justification that it can measure the goodness of out-of-sample prediction. If you split your data into training and testing parts and fit a regression model on the training one, you can get a valid R squared value on training part, but you can't legitimately compute an R squared on the test part. Some people did this, but I don't agree with it.

Here is very extreme example:

preds <- 1:4/4
actual <- 1:4

The R squared between those two vectors is 1. Yes of course, one is just a linear rescaling of the other so they have a perfect linear relationship. But, do you really think that the preds is a good prediction on actual??


In reply to wordsforthewise

Thanks for your comments 1, 2 and your answer of details.

You probably misunderstood the procedure. Given two vectors x and y, we first fit a regression line y ~ x then compute regression sum of squares and total sum of squares. It looks like you skip this regression step and go straight to the sum of square computation. That is false, since the partition of sum of squares does not hold and you can't compute R squared in a consistent way.

As you demonstrated, this is just one way for computing R squared:

preds <- c(1, 2, 3)
actual <- c(2, 2, 4)
rss <- sum((preds - actual) ^ 2)  ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2)  ## total sum of squares
rsq <- 1 - rss/tss
#[1] 0.25

But there is another:

regss <- sum((preds - mean(preds)) ^ 2) ## regression sum of squares
regss / tss
#[1] 0.75

Also, your formula can give a negative value (the proper value should be 1 as mentioned above in the Warning section).

preds <- 1:4 / 4
actual <- 1:4
rss <- sum((preds - actual) ^ 2)  ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2)  ## total sum of squares
rsq <- 1 - rss/tss
#[1] -2.375

Final remark

I had never expected that this answer could eventually be so long when I posted my initial answer 2 years ago. However, given the high views of this thread, I feel obliged to add more statistical details and discussions. I don't want to mislead people that just because they can compute an R squared so easily, they can use R squared everywhere.

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to function

$http.get(...).success is not a function Function to calculate R2 (R-squared) in R How to Call a Function inside a Render in React/Jsx How does Python return multiple values from a function? Default optional parameter in Swift function How to have multiple conditions for one if statement in python Uncaught TypeError: .indexOf is not a function Proper use of const for defining functions in JavaScript Run php function on button click includes() not working in all browsers

Examples related to statistics

Function to calculate R2 (R-squared) in R pandas: find percentile stats of a given column What exactly does numpy.exp() do? Find p-value (significance) in scikit-learn LinearRegression How to plot ROC curve in Python Pandas - Compute z-score for all columns Calculating percentile of dataset column How to normalize an array in NumPy to a unit vector? How to find row number of a value in R code np.mean() vs np.average() in Python NumPy?