How do you pull out the p-value (for the significance of the coefficient of the single explanatory variable being non-zero) and R-squared value from a simple linear regression model? For example...
x = cumsum(c(0, runif(100, -1, +1)))
y = cumsum(c(0, runif(100, -1, +1)))
fit = lm(y ~ x)
summary(fit)
I know that summary(fit)
displays the p-value and R-squared value, but I want to be able to stick these into other variables.
This question is related to
r
I used this lmp function quite a lot of times.
And at one point I decided to add new features to enhance data analysis. I am not in expert in R or statistics but people are usually looking at different information of a linear regression :
Let's have an example. You have here
Here a reproducible example with different variables:
Ex<-structure(list(X1 = c(-36.8598, -37.1726, -36.4343, -36.8644,
-37.0599, -34.8818, -31.9907, -37.8304, -34.3367, -31.2984, -33.5731
), X2 = c(64.26, 63.085, 66.36, 61.08, 61.57, 65.04, 72.69, 63.83,
67.555, 76.06, 68.61), Y1 = c(493.81544, 493.81544, 494.54173,
494.61364, 494.61381, 494.38717, 494.64122, 493.73265, 494.04246,
494.92989, 494.98384), Y2 = c(489.704166, 489.704166, 490.710962,
490.653212, 490.710612, 489.822928, 488.160904, 489.747776, 490.600579,
488.946738, 490.398958), Y3 = c(-19L, -19L, -19L, -23L, -30L,
-43L, -43L, -2L, -58L, -47L, -61L)), .Names = c("X1", "X2", "Y1",
"Y2", "Y3"), row.names = c(NA, 11L), class = "data.frame")
library(reshape2)
library(ggplot2)
Ex2<-melt(Ex,id=c("X1","X2"))
colnames(Ex2)[3:4]<-c("Y","Yvalue")
Ex3<-melt(Ex2,id=c("Y","Yvalue"))
colnames(Ex3)[3:4]<-c("X","Xvalue")
ggplot(Ex3,aes(Xvalue,Yvalue))+
geom_smooth(method="lm",alpha=0.2,size=1,color="grey")+
geom_point(size=2)+
facet_grid(Y~X,scales='free')
#Use the lmp function
lmp <- function (modelobject) {
if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
f <- summary(modelobject)$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
}
# create function to extract different informations from lm
lmtable<-function (var1,var2,data,signi=NULL){
#var1= y data : colnames of data as.character, so "Y1" or c("Y1","Y2") for example
#var2= x data : colnames of data as.character, so "X1" or c("X1","X2") for example
#data= data in dataframe, variables in columns
# if signi TRUE, round p-value with 2 digits and add *** if <0.001, ** if < 0.01, * if < 0.05.
if (class(data) != "data.frame") stop("Not an object of class 'data.frame' ")
Tabtemp<-data.frame(matrix(NA,ncol=6,nrow=length(var1)*length(var2)))
for (i in 1:length(var2))
{
Tabtemp[((length(var1)*i)-(length(var1)-1)):(length(var1)*i),1]<-var1
Tabtemp[((length(var1)*i)-(length(var1)-1)):(length(var1)*i),2]<-var2[i]
colnames(Tabtemp)<-c("Var.y","Var.x","p-value","a","b","r^2")
for (n in 1:length(var1))
{
Tabtemp[(((length(var1)*i)-(length(var1)-1))+n-1),3]<-lmp(lm(data[,var1[n]]~data[,var2[i]],data))
Tabtemp[(((length(var1)*i)-(length(var1)-1))+n-1),4]<-coef(lm(data[,var1[n]]~data[,var2[i]],data))[1]
Tabtemp[(((length(var1)*i)-(length(var1)-1))+n-1),5]<-coef(lm(data[,var1[n]]~data[,var2[i]],data))[2]
Tabtemp[(((length(var1)*i)-(length(var1)-1))+n-1),6]<-summary(lm(data[,var1[n]]~data[,var2[i]],data))$r.squared
}
}
signi2<-data.frame(matrix(NA,ncol=3,nrow=nrow(Tabtemp)))
signi2[,1]<-ifelse(Tabtemp[,3]<0.001,paste0("***"),ifelse(Tabtemp[,3]<0.01,paste0("**"),ifelse(Tabtemp[,3]<0.05,paste0("*"),paste0(""))))
signi2[,2]<-round(Tabtemp[,3],2)
signi2[,3]<-paste0(format(signi2[,2],digits=2),signi2[,1])
for (l in 1:nrow(Tabtemp))
{
Tabtemp$"p-value"[l]<-ifelse(is.null(signi),
Tabtemp$"p-value"[l],
ifelse(isTRUE(signi),
paste0(signi2[,3][l]),
Tabtemp$"p-value"[l]))
}
Tabtemp
}
# ------- EXAMPLES ------
lmtable("Y1","X1",Ex)
lmtable(c("Y1","Y2","Y3"),c("X1","X2"),Ex)
lmtable(c("Y1","Y2","Y3"),c("X1","X2"),Ex,signi=TRUE)
There is certainly a faster solution than this function but it works.
Extension of @Vincent 's answer:
For lm()
generated models:
summary(fit)$coefficients[,4] ##P-values
summary(fit)$r.squared ##R squared values
For gls()
generated models:
summary(fit)$tTable[,4] ##P-values
##R-squared values are not generated b/c gls uses max-likelihood not Sums of Squares
To isolate an individual p-value itself, you'd add a row number to the code:
For example to access the p-value of the intercept in both model summaries:
summary(fit)$coefficients[1,4]
summary(fit)$tTable[1,4]
Note, you can replace the column number with the column name in each of the above instances:
summary(fit)$coefficients[1,"Pr(>|t|)"] ##lm
summary(fit)$tTable[1,"p-value"] ##gls
If you're still unsure of how to access a value form the summary table use str()
to figure out the structure of the summary table:
str(summary(fit))
Another option is to use the cor.test function, instead of lm:
> x <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
> y <- c( 2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8)
> mycor = cor.test(x,y)
> mylm = lm(x~y)
# r and rsquared:
> cor.test(x,y)$estimate ** 2
cor
0.3262484
> summary(lm(x~y))$r.squared
[1] 0.3262484
# P.value
> lmp(lm(x~y)) # Using the lmp function defined in Chase's answer
[1] 0.1081731
> cor.test(x,y)$p.value
[1] 0.1081731
This is the easiest way to pull the p-values:
coef(summary(modelname))[, "Pr(>|t|)"]
I came across this question while exploring suggested solutions for a similar problem; I presume that for future reference it may be worthwhile to update the available list of answer with a solution utilising the broom
package.
x = cumsum(c(0, runif(100, -1, +1)))
y = cumsum(c(0, runif(100, -1, +1)))
fit = lm(y ~ x)
require(broom)
glance(fit)
>> glance(fit)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
1 0.5442762 0.5396729 1.502943 118.2368 1.3719e-18 2 -183.4527 372.9055 380.7508 223.6251 99
I find the glance
function is useful as it neatly summarises the key values. The results are stored as a data.frame
which makes further manipulation easy:
>> class(glance(fit))
[1] "data.frame"
You can see the structure of the object returned by summary()
by calling str(summary(fit))
. Each piece can be accessed using $
. The p-value for the F statistic is more easily had from the object returned by anova
.
Concisely, you can do this:
rSquared <- summary(fit)$r.squared
pVal <- anova(fit)$'Pr(>F)'[1]
Notice that summary(fit)
generates an object with all the information you need. The beta, se, t and p vectors are stored in it. Get the p-values by selecting the 4th column of the coefficients matrix (stored in the summary object):
summary(fit)$coefficients[,4]
summary(fit)$r.squared
Try str(summary(fit))
to see all the info that this object contains.
Edit: I had misread Chase's answer which basically tells you how to get to what I give here.
Use:
(summary(fit))$coefficients[***num***,4]
where num
is a number which denotes the row of the coefficients matrix. It will depend on how many features you have in your model and which one you want to pull out the p-value for. For example, if you have only one variable there will be one p-value for the intercept which will be [1,4] and the next one for your actual variable which will be [2,4]. So your num
will be 2.
x = cumsum(c(0, runif(100, -1, +1)))
y = cumsum(c(0, runif(100, -1, +1)))
fit = lm(y ~ x)
> names(summary(fit))
[1] "call" "terms"
[3] "residuals" "coefficients"
[5] "aliased" "sigma"
[7] "df" "r.squared"
[9] "adj.r.squared" "fstatistic"
[11] "cov.unscaled"
summary(fit)$r.squared
While both of the answers above are good, the procedure for extracting parts of objects is more general.
In many cases, functions return lists, and the individual components can be accessed using str()
which will print the components along with their names. You can then access them using the $ operator, i.e. myobject$componentname
.
In the case of lm objects, there are a number of predefined methods one can use such as coef()
, resid()
, summary()
etc, but you won't always be so lucky.
Source: Stackoverflow.com