[r] Use of ~ (tilde) in R programming Language

I saw in a tutorial about regression modeling the following command:

myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

What exactly does this command do, and what is the role of ~ (tilde) in the command?

This question is related to r r-faq r-formula

The answer is


R defines a ~ (tilde) operator for use in formulas. Formulas have all sorts of uses, but perhaps the most common is for regression:

library(datasets)
lm( myFormula, data=iris)

help("~") or help("formula") will teach you more.

@Spacedman has covered the basics. Let's discuss how it works.

First, being an operator, note that it is essentially a shortcut to a function (with two arguments):

> `~`(lhs,rhs)
lhs ~ rhs
> lhs ~ rhs
lhs ~ rhs

That can be helpful to know for use in e.g. apply family commands.

Second, you can manipulate the formula as text:

oldform <- as.character(myFormula) # Get components
myFormula <- as.formula( paste( oldform[2], "Sepal.Length", sep="~" ) )

Third, you can manipulate it as a list:

myFormula[[2]]
myFormula[[3]]

Finally, there are some helpful tricks with formulae (see help("formula") for more):

myFormula <- Species ~ . 

For example, the version above is the same as the original version, since the dot means "all variables not yet used." This looks at the data.frame you use in your eventual model call, sees which variables exist in the data.frame but aren't explicitly mentioned in your formula, and replaces the dot with those missing variables.