[r] How to force R to use a specified factor level as reference in a regression?

How can I tell R to use a certain level as reference if I use binary explanatory variables in a regression?

It's just using some level by default.

lm(x ~ y + as.factor(b)) 

with b {0, 1, 2, 3, 4}. Let's say I want to use 3 instead of the zero that is used by R.

The answer is


See the relevel() function. Here is an example:

set.seed(123)
x <- rnorm(100)
DF <- data.frame(x = x,
                 y = 4 + (1.5*x) + rnorm(100, sd = 2),
                 b = gl(5, 20))
head(DF)
str(DF)

m1 <- lm(y ~ x + b, data = DF)
summary(m1)

Now alter the factor b in DF by use of the relevel() function:

DF <- within(DF, b <- relevel(b, ref = 3))
m2 <- lm(y ~ x + b, data = DF)
summary(m2)

The models have estimated different reference levels.

> coef(m1)
(Intercept)           x          b2          b3          b4          b5 
  3.2903239   1.4358520   0.6296896   0.3698343   1.0357633   0.4666219 
> coef(m2)
(Intercept)           x          b1          b2          b4          b5 
 3.66015826  1.43585196 -0.36983433  0.25985529  0.66592898  0.09678759

The relevel() command is a shorthand method to your question. What it does is reorder the factor so that whatever is the ref level is first. Therefore, reordering your factor levels will also have the same effect but gives you more control. Perhaps you wanted to have levels 3,4,0,1,2. In that case...

bFactor <- factor(b, levels = c(3,4,0,1,2))

I prefer this method because it's easier for me to see in my code not only what the reference was but the position of the other values as well (rather than having to look at the results for that).

NOTE: DO NOT make it an ordered factor. A factor with a specified order and an ordered factor are not the same thing. lm() may start to think you want polynomial contrasts if you do that.


You can also manually tag the column with a contrasts attribute, which seems to be respected by the regression functions:

contrasts(df$factorcol) <- contr.treatment(levels(df$factorcol),
   base=which(levels(df$factorcol) == 'RefLevel'))

For those looking for a dplyr/tidyverse version. Building on Gavin Simpson solution:

# Create DF
set.seed(123)
x <- rnorm(100)
DF <- data.frame(x = x,
                 y = 4 + (1.5*x) + rnorm(100, sd = 2),
                 b = gl(5, 20))

# Change reference level
DF = DF %>% mutate(b = relevel(b, 3))

m2 <- lm(y ~ x + b, data = DF)
summary(m2)

Others have mentioned the relevel command which is the best solution if you want to change the base level for all analyses on your data (or are willing to live with changing the data).

If you don't want to change the data (this is a one time change, but in the future you want the default behavior again), then you can use a combination of the C (note uppercase) function to set contrasts and the contr.treatments function with the base argument for choosing which level you want to be the baseline.

For example:

lm( Sepal.Width ~ C(Species,contr.treatment(3, base=2)), data=iris )

I know this is an old question, but I had a similar issue and found that:

lm(x ~ y + relevel(b, ref = "3")) 

does exactly what you asked.


Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to regression

how to use the Box-Cox power transformation in R Find p-value (significance) in scikit-learn LinearRegression Run an OLS regression with Pandas Data Frame fitting data with numpy Adding a regression line on a ggplot Quadratic and cubic regression in Excel Extract regression coefficient values How to force R to use a specified factor level as reference in a regression? What is the difference between Multiple R-squared and Adjusted R-squared in a single-variate least squares regression?

Examples related to linear-regression

Accuracy Score ValueError: Can't Handle mix of binary and continuous target TensorFlow: "Attempting to use uninitialized value" in variable initialization gradient descent using python and numpy Adding a regression line on a ggplot How to calculate the 95% confidence interval for the slope in a linear regression model in R What is the difference between linear regression and logistic regression? Multiple linear regression in Python Add regression line equation and R^2 on graph Linear regression with matplotlib / numpy How to force R to use a specified factor level as reference in a regression?

Examples related to categorical-data

pandas dataframe convert column type to string or categorical Plotting with ggplot2: "Error: Discrete value supplied to continuous scale" on categorical y-axis Make Frequency Histogram for Factor Variables R error "sum not meaningful for factors" How to force R to use a specified factor level as reference in a regression?

Examples related to dummy-variable

How to force R to use a specified factor level as reference in a regression?