How do you remove columns from a data frame

Question

Not so much  How do you      but more  How do YOU       If you have a file someone gives you with 200 columns  and you want to reduce it to the few ones you need for analysis  how do you go about it  Does one solution offer benefits over another   Assuming we have a data frame with columns col1  col2 through col200  If you only wanted 1-100 and then 125-135 and 150-200  you could   dat col101  lt - NULL dat col102  lt - NULL   etc   or  dat  lt - dat  c  col1   col2          or   dat  lt - dat  c 1 100 125 135         shortest probably but I don t like this   or  dat  lt - dat   names dat   in  c  dat101   dat102          Anything else I m missing  I know this is sightly subjective but it s one of those nitty gritty things where you might dive in and start doing it one way and fall into a habit when there are far more efficient ways out there  Much like this question about which    EDIT   Or  is there an easy way to create a workable vector of column names  name dat  doesn t print them with commas in between  which you need in the code examples above  so if you print out the names in that way you have spaces everywhere and have to manually put in commas    Is there a command that will give you  col1   col2   col3      as your output so you can easily grab what you want

User · Answer

Use read table with colClasses instances of  NULL  to avoid creating them in the first place       example data and temp file x  lt - data frame x   1 10  y   rnorm 10   z   runif 10   a   letters 1 10   stringsAsFactors   FALSE  tmp  lt - tempfile   write table x  tmp  row names   FALSE     y  lt - read table tmp  colClasses   c  numeric   rep  NULL   2    character    header   TRUE    x a 1   1 a 2   2 b 3   3 c 4   4 d 5   5 e 6   6 f 7   7 g 8   8 h 9   9 i 10 10 j  unlink tmp

User · Answer

Sometimes I like to do this using column ids instead    df  lt - data frame a rnorm 100   b rnorm 100   c rnorm 100   d rnorm 100   e rnorm 100   f rnorm 100   g rnorm 100      as data frame names df      names df  1         a 2         b 3         c 4         d 5         e 6         f 7         g    Removing columns  c  and  g   df  -c 3 7     This is especially useful if you have data frames that are large or have long column names that you don t want to type  Or column names that follow a pattern  because then you can use seq   to remove    RE  Your edit  You don t necessarily have to put    around a string  nor     to create a character vector  I find this little trick handy    x  lt - unlist strsplit   A B C D E    n

User · Answer

The select   function from dplyr is powerful for subsetting columns   See  select helpers for a list of approaches   In this case  where you have a common prefix and sequential numbers for column names  you could use num range   library dplyr   df1  lt - data frame first   0  col1   1  col2   2  col3   3  col4   4  df1   gt     select num range  col   c 1  4      gt    col1 col4   gt  1    1    4   More generally you can use the minus sign in select   to drop columns  like   mtcars   gt      select -mpg  -wt    Finally  to your question  is there an easy way to create a workable vector of column names   - yes  if you need to edit a list of names manually  use dput to get a comma-separated  quoted list you can easily manipulate   dput names mtcars     gt  c  mpg    cyl    disp    hp    drat    wt    qsec    vs    am      gt   gear    carb

User · Answer

Can use setdiff function   If there are more columns to keep than to delete  Suppose you want to delete 2 columns say col1  col2 from a data frame DT  you can do the following   DT lt -DT  setdiff names DT  c  col1   col2       If there are more columns to delete than to keep  Suppose you want to keep only col1 and col2   DT lt -DT  c  col1   col2

User · Answer

I use data table s    operator to delete columns instantly regardless of the size of the table   DT   coltodelete    NULL    or  DT   c  col1   col20      NULL    or  DT    125 135     NULL    or  DT    variableHoldingNamesOrNumbers     NULL    Any solution using  lt - or subset will copy the whole table   data table s    operator merely modifies the internal vector of pointers to the columns  in place   That operation is therefore  almost  instant

User · Answer

Just addressing the edit    nzcoops  you do not need the column names in a comma delimited character vector  You are thinking about this the wrong way round  When you do  vec  lt - c  col1    col2    col3     you are creating a character vector  The   just separates arguments taken by the c   function when you define that vector  names   and similar functions return a character vector of names    gt  dat  lt - data frame col1   1 3  col2   1 3  col3   1 3   gt  dat   col1 col2 col3 1    1    1    1 2    2    2    2 3    3    3    3  gt  names dat   1   col1   col2   col3    It is far easier and less error prone to select from the elements of names dat  than to process its output to a comma separated string you can cut and paste from   Say we want columns col1 and col2  subset names dat   retaining only the ones we want    gt  names dat  c 1 3    1   col1   col3   gt  dat   names dat  c 1 3      col1 col3 1    1    1 2    2    2 3    3    3   You can kind of do what you want  but R will always print the vector the screen in quotes      gt  paste      names dat        sep       collapse          1     col1      col2      col3     gt  paste      names dat        sep       collapse          1    col1    col2    col3     so the latter may be more useful  However  now you have to cut and past from that string  Far better to work with objects that return what you want and use standard subsetting routines to keep what you need

User · Answer

rm in within can be quite useful  within mtcars  rm mpg  cyl  disp  hp                         drat    wt  qsec vs am gear carb   Mazda RX4           3 90 2 620 16 46  0  1    4    4   Mazda RX4 Wag       3 90 2 875 17 02  0  1    4    4   Datsun 710          3 85 2 320 18 61  1  1    4    1   Hornet 4 Drive      3 08 3 215 19 44  1  0    3    1   Hornet Sportabout   3 15 3 440 17 02  0  0    3    2   Valiant             2 76 3 460 20 22  1  0    3    1        May be combined with other operations  within mtcars      mpg2 mpg 2   cyl2 cyl 2   rm mpg  cyl  disp  hp                             drat    wt  qsec vs am gear carb cyl2    mpg2   Mazda RX4           3 90 2 620 16 46  0  1    4    4   36  441 00   Mazda RX4 Wag       3 90 2 875 17 02  0  1    4    4   36  441 00   Datsun 710          3 85 2 320 18 61  1  1    4    1   16  519 84   Hornet 4 Drive      3 08 3 215 19 44  1  0    3    1   36  457 96   Hornet Sportabout   3 15 3 440 17 02  0  0    3    2   64  349 69   Valiant             2 76 3 460 20 22  1  0    3    1   36  327 61

User · Answer

If you have a vector of names already which there are several ways to create  you can easily use the subset function to keep or drop an object   dat2  lt - subset dat  select   names dat   in  c KEEP     In this case KEEP is a vector of column names which is pre-created  For example    sample data via Brandon Bertelsen df  lt - data frame a rnorm 100                    b rnorm 100                    c rnorm 100                    d rnorm 100                    e rnorm 100                    f rnorm 100                    g rnorm 100     creating the initial vector of names df1  lt - as matrix as character names df      retaining only the name values you want to keep KEEP  lt - as vector df1 c 1 3 5 6       subsetting the intial dataset with the object KEEP df3  lt - subset df  select   names df   in  c KEEP     Which results in    gt  head df              a          b           c          d 1  1 05526388  0 6316023 -0 04230455 -0 1486299 2 -0 52584236  0 5596705  2 26831758  0 3871873 3  1 88565261  0 9727644  0 99708383  1 8495017 4 -0 58942525 -0 3874654  0 48173439  1 4137227 5 -0 03898588 -1 5297600  0 85594964  0 7353428 6  1 58860643 -1 6878690  0 79997390  1 1935813             e           f           g 1 -1 42751190  0 09842343 -0 01543444 2 -0 62431091 -0 33265572 -0 15539472 3  1 15130591  0 37556903 -1 46640276 4 -1 28886526 -0 50547059 -2 20156926 5 -0 03915009 -1 38281923  0 60811360 6 -1 68024349 -1 18317733  0 42014397   gt  head df3          a          b           c           e 1  1 05526388  0 6316023 -0 04230455 -1 42751190 2 -0 52584236  0 5596705  2 26831758 -0 62431091 3  1 88565261  0 9727644  0 99708383  1 15130591 4 -0 58942525 -0 3874654  0 48173439 -1 28886526 5 -0 03898588 -1 5297600  0 85594964 -0 03915009 6  1 58860643 -1 6878690  0 79997390 -1 68024349             f 1  0 09842343 2 -0 33265572 3  0 37556903 4 -0 50547059 5 -1 38281923 6 -1 18317733

User · Answer

From http   www statmethods net management subset html    exclude variables v1  v2  v3 myvars  lt - names mydata   in  c  v1    v2    v3    newdata  lt - mydata  myvars     exclude 3rd and 5th variable  newdata  lt - mydata c -3 -5      delete variables v3 and v5 mydata v3  lt - mydata v5  lt - NULL   Thought it was really clever make a list of  not to include

User · Answer

For clarity purposes  I often use the select argument in subset  With newer folks  I ve learned that keeping the   of commands they need to pick up to a minimum helps adoption  As their skills increase  so too will their coding ability  And subset is one of the first commands I show people when needing to select data within a given criteria   Something like    gt  subset mtcars  select   c  mpg    cyl    vs    am                         mpg cyl vs am Mazda RX4           21 0   6  0  1 Mazda RX4 Wag       21 0   6  0  1 Datsun 710          22 8   4  1  1        I m sure this will test slower than most other solutions  but I m rarely at the point where microseconds make a difference

User · Answer

To delete single columns  I ll just use dat x  lt - NULL   To delete multiple columns  but less than about 3-4  I ll use dat x  lt - dat y  lt - dat z  lt - NULL   For more than that  I ll use subset  with negative names       subset mtcars    -c mpg  cyl  disp  hp

User · Answer

For the kinds of large files I tend to get  I generally wouldn t even do this in R   I would use the cut command in Linux to process data before it gets to R   This isn t a critique of R  just a preference for using some very basic Linux tools like grep  tr  cut  sort  uniq  and occasionally sed  amp  awk  or Perl  when there s something to be done about regular expressions   Another reason to use standard GNU commands is that I can pass them back to the source of the data and ask that they prefilter the data so that I don t get extraneous data   Most of my colleagues are competent with Linux  fewer know R    Updated  A method that I would like to use before long is to pair mmap with a text file and examine the data in situ  rather than read it at all into RAM   I have done this with C  and it can be blisteringly fast

[r] How do you remove columns from a data.frame?

Examples related to r

Examples related to dataframe