[r] How do you remove columns from a data.frame?

Not so much 'How do you...?' but more 'How do YOU...?'

If you have a file someone gives you with 200 columns, and you want to reduce it to the few ones you need for analysis, how do you go about it? Does one solution offer benefits over another?

Assuming we have a data frame with columns col1, col2 through col200. If you only wanted 1-100 and then 125-135 and 150-200, you could:

dat$col101 <- NULL
dat$col102 <- NULL # etc

or

dat <- dat[,c("col1","col2",...)]

or

dat <- dat[,c(1:100,125:135,...)] # shortest probably but I don't like this

or

dat <- dat[,!names(dat) %in% c("dat101","dat102",...)]

Anything else I'm missing? I know this is sightly subjective but it's one of those nitty gritty things where you might dive in and start doing it one way and fall into a habit when there are far more efficient ways out there. Much like this question about which.

EDIT:

Or, is there an easy way to create a workable vector of column names? name(dat) doesn't print them with commas in between, which you need in the code examples above, so if you print out the names in that way you have spaces everywhere and have to manually put in commas... Is there a command that will give you "col1","col2","col3",... as your output so you can easily grab what you want?

This question is related to r dataframe

The answer is


To delete single columns, I'll just use dat$x <- NULL.

To delete multiple columns, but less than about 3-4, I'll use dat$x <- dat$y <- dat$z <- NULL.

For more than that, I'll use subset, with negative names (!):

subset(mtcars, , -c(mpg, cyl, disp, hp))

Use read.table with colClasses instances of "NULL" to avoid creating them in the first place:

## example data and temp file
x <- data.frame(x = 1:10, y = rnorm(10), z = runif(10), a = letters[1:10], stringsAsFactors = FALSE)
tmp <- tempfile()
write.table(x, tmp, row.names = FALSE)


(y <- read.table(tmp, colClasses = c("numeric", rep("NULL", 2), "character"), header = TRUE))

x a
1   1 a
2   2 b
3   3 c
4   4 d
5   5 e
6   6 f
7   7 g
8   8 h
9   9 i
10 10 j

unlink(tmp)

rm in within can be quite useful.

within(mtcars, rm(mpg, cyl, disp, hp))
#                     drat    wt  qsec vs am gear carb
# Mazda RX4           3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag       3.90 2.875 17.02  0  1    4    4
# Datsun 710          3.85 2.320 18.61  1  1    4    1
# Hornet 4 Drive      3.08 3.215 19.44  1  0    3    1
# Hornet Sportabout   3.15 3.440 17.02  0  0    3    2
# Valiant             2.76 3.460 20.22  1  0    3    1
# ...

May be combined with other operations.

within(mtcars, {
  mpg2=mpg^2
  cyl2=cyl^2
  rm(mpg, cyl, disp, hp)
  })
#                     drat    wt  qsec vs am gear carb cyl2    mpg2
# Mazda RX4           3.90 2.620 16.46  0  1    4    4   36  441.00
# Mazda RX4 Wag       3.90 2.875 17.02  0  1    4    4   36  441.00
# Datsun 710          3.85 2.320 18.61  1  1    4    1   16  519.84
# Hornet 4 Drive      3.08 3.215 19.44  1  0    3    1   36  457.96
# Hornet Sportabout   3.15 3.440 17.02  0  0    3    2   64  349.69
# Valiant             2.76 3.460 20.22  1  0    3    1   36  327.61
# ...

From http://www.statmethods.net/management/subset.html

# exclude variables v1, v2, v3
myvars <- names(mydata) %in% c("v1", "v2", "v3") 
newdata <- mydata[!myvars]

# exclude 3rd and 5th variable 
newdata <- mydata[c(-3,-5)]

# delete variables v3 and v5
mydata$v3 <- mydata$v5 <- NULL

Thought it was really clever make a list of "not to include"


The select() function from dplyr is powerful for subsetting columns. See ?select_helpers for a list of approaches.

In this case, where you have a common prefix and sequential numbers for column names, you could use num_range:

library(dplyr)

df1 <- data.frame(first = 0, col1 = 1, col2 = 2, col3 = 3, col4 = 4)
df1 %>%
  select(num_range("col", c(1, 4)))
#>   col1 col4
#> 1    1    4

More generally you can use the minus sign in select() to drop columns, like:

mtcars %>%
   select(-mpg, -wt)

Finally, to your question "is there an easy way to create a workable vector of column names?" - yes, if you need to edit a list of names manually, use dput to get a comma-separated, quoted list you can easily manipulate:

dput(names(mtcars))
#> c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", 
#> "gear", "carb")

If you have a vector of names already,which there are several ways to create, you can easily use the subset function to keep or drop an object.

dat2 <- subset(dat, select = names(dat) %in% c(KEEP))

In this case KEEP is a vector of column names which is pre-created. For example:

#sample data via Brandon Bertelsen
df <- data.frame(a=rnorm(100),
                 b=rnorm(100),
                 c=rnorm(100),
                 d=rnorm(100),
                 e=rnorm(100),
                 f=rnorm(100),
                 g=rnorm(100))

#creating the initial vector of names
df1 <- as.matrix(as.character(names(df)))

#retaining only the name values you want to keep
KEEP <- as.vector(df1[c(1:3,5,6),])

#subsetting the intial dataset with the object KEEP
df3 <- subset(df, select = names(df) %in% c(KEEP))

Which results in:

> head(df)
            a          b           c          d
1  1.05526388  0.6316023 -0.04230455 -0.1486299
2 -0.52584236  0.5596705  2.26831758  0.3871873
3  1.88565261  0.9727644  0.99708383  1.8495017
4 -0.58942525 -0.3874654  0.48173439  1.4137227
5 -0.03898588 -1.5297600  0.85594964  0.7353428
6  1.58860643 -1.6878690  0.79997390  1.1935813
            e           f           g
1 -1.42751190  0.09842343 -0.01543444
2 -0.62431091 -0.33265572 -0.15539472
3  1.15130591  0.37556903 -1.46640276
4 -1.28886526 -0.50547059 -2.20156926
5 -0.03915009 -1.38281923  0.60811360
6 -1.68024349 -1.18317733  0.42014397

> head(df3)
        a          b           c           e
1  1.05526388  0.6316023 -0.04230455 -1.42751190
2 -0.52584236  0.5596705  2.26831758 -0.62431091
3  1.88565261  0.9727644  0.99708383  1.15130591
4 -0.58942525 -0.3874654  0.48173439 -1.28886526
5 -0.03898588 -1.5297600  0.85594964 -0.03915009
6  1.58860643 -1.6878690  0.79997390 -1.68024349
            f
1  0.09842343
2 -0.33265572
3  0.37556903
4 -0.50547059
5 -1.38281923
6 -1.18317733

Sometimes I like to do this using column ids instead.

df <- data.frame(a=rnorm(100),
b=rnorm(100),
c=rnorm(100),
d=rnorm(100),
e=rnorm(100),
f=rnorm(100),
g=rnorm(100)) 

as.data.frame(names(df))

  names(df)
1         a
2         b
3         c
4         d
5         e
6         f
7         g 

Removing columns "c" and "g"

df[,-c(3,7)]

This is especially useful if you have data.frames that are large or have long column names that you don't want to type. Or column names that follow a pattern, because then you can use seq() to remove.

RE: Your edit

You don't necessarily have to put "" around a string, nor "," to create a character vector. I find this little trick handy:

x <- unlist(strsplit(
'A
B
C
D
E',"\n"))

For clarity purposes, I often use the select argument in subset. With newer folks, I've learned that keeping the # of commands they need to pick up to a minimum helps adoption. As their skills increase, so too will their coding ability. And subset is one of the first commands I show people when needing to select data within a given criteria.

Something like:

> subset(mtcars, select = c("mpg", "cyl", "vs", "am"))
                     mpg cyl vs am
Mazda RX4           21.0   6  0  1
Mazda RX4 Wag       21.0   6  0  1
Datsun 710          22.8   4  1  1
....

I'm sure this will test slower than most other solutions, but I'm rarely at the point where microseconds make a difference.


Can use setdiff function:

If there are more columns to keep than to delete: Suppose you want to delete 2 columns say col1, col2 from a data.frame DT; you can do the following:

DT<-DT[,setdiff(names(DT),c("col1","col2"))]

If there are more columns to delete than to keep: Suppose you want to keep only col1 and col2:

DT<-DT[,c("col1","col2")]

For the kinds of large files I tend to get, I generally wouldn't even do this in R. I would use the cut command in Linux to process data before it gets to R. This isn't a critique of R, just a preference for using some very basic Linux tools like grep, tr, cut, sort, uniq, and occasionally sed & awk (or Perl) when there's something to be done about regular expressions.

Another reason to use standard GNU commands is that I can pass them back to the source of the data and ask that they prefilter the data so that I don't get extraneous data. Most of my colleagues are competent with Linux, fewer know R.

(Updated) A method that I would like to use before long is to pair mmap with a text file and examine the data in situ, rather than read it at all into RAM. I have done this with C, and it can be blisteringly fast.


Just addressing the edit.

@nzcoops, you do not need the column names in a comma delimited character vector. You are thinking about this the wrong way round. When you do

vec <- c("col1", "col2", "col3")

you are creating a character vector. The , just separates arguments taken by the c() function when you define that vector. names() and similar functions return a character vector of names.

> dat <- data.frame(col1 = 1:3, col2 = 1:3, col3 = 1:3)
> dat
  col1 col2 col3
1    1    1    1
2    2    2    2
3    3    3    3
> names(dat)
[1] "col1" "col2" "col3"

It is far easier and less error prone to select from the elements of names(dat) than to process its output to a comma separated string you can cut and paste from.

Say we want columns col1 and col2, subset names(dat), retaining only the ones we want:

> names(dat)[c(1,3)]
[1] "col1" "col3"
> dat[, names(dat)[c(1,3)]]
  col1 col3
1    1    1
2    2    2
3    3    3

You can kind of do what you want, but R will always print the vector the screen in quotes ":

> paste('"', names(dat), '"', sep = "", collapse = ", ")
[1] "\"col1\", \"col2\", \"col3\""
> paste("'", names(dat), "'", sep = "", collapse = ", ")
[1] "'col1', 'col2', 'col3'"

so the latter may be more useful. However, now you have to cut and past from that string. Far better to work with objects that return what you want and use standard subsetting routines to keep what you need.