How to succinctly write a formula with many variables from a data frame

Question

Suppose I have a response variable and a data containing three covariates  as a toy example    y   c 1 4 6  d   data frame x1   c 4 -1 3   x2   c 3 9 8   x3   c 4 -4 -2     I want to fit a linear regression to the data   fit   lm y   d x1   d x2   d y2    Is there a way to write the formula  so that I don t have to write out each individual covariate  For example  something like  fit   lm y   d     I want each variable in the data frame to be a covariate   I m asking because I actually have 50 variables in my data frame  so I want to avoid writing out x1   x2   x3   etc

User · Answer

You can check the package leaps and in particular the function regsubsets   functions for model selection  As stated in the documentation   Model selection by exhaustive search  forward or backward stepwise  or sequential replacement

User · Answer

I build this solution  reformulate does not take care if variable names have white spaces   add backticks   function x        paste0      x          x lm formula   function x        paste add backticks x   collapse             build lm formula   function x  y       if  length y  gt 1           stop  y needs to be just one variable             as formula                  paste0     y             x lm formula x              Example df  lt - data frame      y   c 1 4 6        x1   c 4 -1 3        x2   c 3 9 8        x3   c 4 -4 -2           Model Specification columns   colnames df  y cols   columns 1  x cols   columns 2 length columns   formula   build lm formula x cols  y cols  formula   output     y     x1     x2     x3      Run Model lm formula   formula  data   df    output Call      lm formula   formula  data   df   Coefficients       Intercept            x1           x2           x3           -5 6316       0 7895       1 1579           NA

User · Answer

A slightly different approach is to create your formula from a string  In the formula help page you will find the following example       Create a formula for a model with a large number of variables  xnam  lt - paste  x   1 25  sep     fmla  lt - as formula paste  y      paste xnam  collapse           Then if you look at the generated formula  you will get    R gt  fmla y   x1   x2   x3   x4   x5   x6   x7   x8   x9   x10   x11        x12   x13   x14   x15   x16   x17   x18   x19   x20   x21        x22   x23   x24   x25

User · Answer

Yes of course  just add the response y as first column in the dataframe and call lm   on it   d2 lt -data frame y d   gt  d2   y x1 x2 x3 1 1  4  3  4 2 4 -1  9 -4 3 6  3  8 -2  gt  lm d2   Call  lm formula   d2   Coefficients   Intercept            x1           x2           x3       -5 6316       0 7895       1 1579           NA     Also  my information about R points out that assignment with  lt - is recommended over

User · Answer

There is a special identifier that one can use in a formula to mean all the variables  it is the   identifier   y  lt - c 1 4 6  d  lt - data frame y   y  x1   c 4 -1 3   x2   c 3 9 8   x3   c 4 -4 -2   mod  lt - lm y      data   d    You can also do things like this  to use all variables but one  in this case x3 is excluded    mod  lt - lm y     - x3  data   d    Technically    means all variables not already mentioned in the formula  For example  lm y   x1   x2      data   d    where   would only reference x3 as x1 and x2 are already in the formula

User · Answer

An extension of juba s method is to use reformulate  a function which is explicitly designed for such a task      Create a formula for a model with a large number of variables  xnam  lt - paste  x   1 25  sep      reformulate xnam   y   y   x1   x2   x3   x4   x5   x6   x7   x8   x9   x10   x11        x12   x13   x14   x15   x16   x17   x18   x19   x20   x21        x22   x23   x24   x25   For the example in the OP  the easiest solution here would be    add y variable to data frame d d  lt - cbind y  d  reformulate names d  -1   names d 1    y   x1   x2   x3   or   mod  lt - lm reformulate names d  -1   names d 1     data d    Note that adding the dependent variable to the data frame in d  lt - cbind y  d  is preferred not only because it allows for the use of reformulate  but also because it allows for future use of the lm object in functions like predict

[r] How to succinctly write a formula with many variables from a data frame?

Examples related to r

Examples related to dataframe

Examples related to glm

Examples related to lm