How to drop columns by name in a data frame

Question

I have a large data set and I would like to read specific columns or drop all the others   data  lt - read dta  file dta     I select the columns that I m not interested in   var out  lt - names data   names data   in  c  iden    name    x serv    m serv      and than I d like to do something like   for i in 1 length var out        paste  data    var out i   sep      lt - NULL     to drop all the unwanted columns  Is this the optimal solution

User · Answer

df   mtcars    remove vs and am because they are categorical  In the dataset  vs is in column number 8  am is in column number 9  dfnum   df  -c 8 9

User · Answer

Here is another solution that may be helpful to others   The code below selects a small number of rows and columns from a large data set   The columns are selected as in one of juba s answers except that I use a paste function to select a set of columns with names that are numbered sequentially   df   read table text      state county city  region  mmatrix  X1 X2 X3    A1     A2     A3      B1     B2     B3      C1      C2      C3    1      1     1      1     111010   1  0  0     2     20    200       4      8     12      NA      NA      NA   1      2     1      1     111010   1  0  0     4     NA    400       5      9     NA      NA      NA      NA   1      1     2      1     111010   1  0  0     6     60     NA      NA     10     14      NA      NA      NA   1      2     2      1     111010   1  0  0    NA     80    800       7     11     15      NA      NA      NA    1      1     3      2     111010   0  1  0     1      2      1       2      2      2      10      20      30   1      2     3      2     111010   0  1  0     2     NA      1       2      2     NA      40      50      NA   1      1     4      2     111010   0  1  0     1      1     NA      NA      2      2      70      80      90   1      2     4      2     111010   0  1  0    NA      2      1       2      2     10     100     110     120    1      1     1      3     010010   0  0  1    10     20     10     200    200    200       1       2       3   1      2     1      3     001000   0  0  1    20     NA     10     200    200    200       4       5       9   1      1     2      3     101000   0  0  1    10     10     NA     200    200    200       7       8      NA   1      2     2      3     011010   0  0  1    NA     20     10     200    200    200      10      11      12     sep       header   TRUE  stringsAsFactors   FALSE  df  df2  lt - df df region    2  names df   in  c paste  C   seq along 1 3   sep       df2       C1  C2  C3   5  10  20  30   6  40  50  NA   7  70  80  90   8 100 110 120

User · Answer

I can  t answer your question in the comments due to low reputation score   The next code will give you an error because the paste function return a character string  for i in 1 length var out        paste  data    var out i   sep      lt - NULL     Here is a possible solution   for i in 1 length var out        text to source  lt - paste0   data    var out i     lt - NULL     Write a line of your                                                     code like a character string   eval  parse  text text to source     Source a text that contains a code     or just do   for i in 1 length var out       data var out i    lt - NULL

User · Answer

Here s a quick solution for this  Say  you have a data frame X with three columns A  B and C    gt  X lt -data frame A c 1 2  B c 3 4  C c 5 6    gt  X   A B C 1 1 3 5 2 2 4 6   If I want to remove a column  say B  just use grep on colnames to get the column index  which you can then use to omit the column    gt  X lt -X  -grep  B  colnames X      Your new X data frame would look like the following  this time without the B column     gt  X   A C 1 1 5 2 2 6   The beauty of grep is that you can specify multiple columns that match the regular expression  If I had X with five columns  A B C D E     gt  X lt -data frame A c 1 2  B c 3 4  C c 5 6  D c 7 8  E c 9 10    gt  X   A B C D  E 1 1 3 5 7  9 2 2 4 6 8 10   Take out columns B and D    gt  X lt -X  -grep  B D  colnames X     gt  X   A C  E 1 1 5  9 2 2 6 10   EDIT  Considering the grepl suggestion of Matthew Lundberg in the comments below    gt  X lt -data frame A c 1 2  B c 3 4  C c 5 6  D c 7 8  E c 9 10    gt  X   A B C D  E 1 1 3 5 7  9 2 2 4 6 8 10  gt  X lt -X   grepl  B D  colnames X     gt  X   A C  E 1 1 5  9 2 2 6 10   If I try to drop a column that s non-existent nothing should happen    gt  X lt -X   grepl  G  colnames X     gt  X   A C  E 1 1 5  9 2 2 6 10

User · Answer

I changed the code to     read data dat lt -read dta  file dta      vars to delete var in lt -c  iden    name    x serv    m serv      what I m keeping var out lt -setdiff names dat  var in     keep only the ones I want        dat  lt - dat var out    Anyway  juba s answer is the best solution to my problem

User · Answer

You should use either indexing or the subset function  For example    R gt  df  lt - data frame x 1 5  y 2 6  z 3 7  u 4 8  R gt  df   x y z u 1 1 2 3 4 2 2 3 4 5 3 3 4 5 6 4 4 5 6 7 5 5 6 7 8   Then you can use the which function and the - operator in column indexation    R gt  df    -which names df   in  c  z   u       x y 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6   Or  much simpler  use the select argument of the subset function   you can then use the - operator directly on a vector of column names  and you can even omit the quotes around the names    R gt  subset df  select -c z u     x y 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6   Note that you can also select the columns you want instead of dropping the others    R gt  df    c  x   y      x y 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6  R gt  subset df  select c x y     x y 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6

User · Answer

First  you can use direct indexing  with booleans vectors  instead of re-accessing column names if you are working with the same data frame  it will be safer as pointed out by Ista  and quicker to write and to execute  So what you will only need is   var out bool  lt -  names data   in  c  iden    name    x serv    m serv     and then  simply reassign data   data  lt - data  var out bool    or    data  lt - data  var out bool  drop   FALSE    You will need this option to avoid the conversion to an atomic vector if there is only one column left   Second  quicker to write  you can directly assign NULL to the columns you want to remove   data c  iden    name    x serv    m serv     lt - list NULL    You need list   to respect the target structure    Finally  you can use subset    but it cannot really be used in the code  even the help file warns about it   Specifically  a problem to me is that if you want to directly use the drop feature of susbset   you need to write without quotes the expression corresponding to the column names   subset  data  select   -c  iden    name    x serv    m serv       WILL NOT WORK subset  data  select   -c iden  name  x serv  m serv      WILL   As a bonus  here is small benchmark of the different options  that clearly shows that subset is the slower  and that the first  reassigning method is the faster                                           re assign dtest  drop vec   46 719  52 5655  54 6460  59 0400  1347 331                                       null assign dtest  drop vec   74 593  83 0585  86 2025  94 0035  1476 150                subset dtest  select    names dtest   in  drop vec  106 280 115 4810 120 3435 131 4665 65133 780  subset dtest  select   names dtest   names dtest   in  drop vec   108 611 119 4830 124 0865 135 4270  1599 577                                   subset dtest  select   -c x  y   102 026 111 2680 115 7035 126 2320  1484 174     Code is below    dtest  lt - data frame x 1 5  y 2 6  z   3 7  drop vec  lt - c  x    y    null assign  lt - function df  names      df names   lt - list NULL    df    re assign  lt - function df  drop      df  lt - df      names df   in  drop  drop   FALSE    df    res  lt - microbenchmark    re assign dtest drop vec     null assign dtest drop vec     subset dtest  select     names dtest   in  drop vec     subset dtest  select   names dtest    names dtest   in  drop vec      subset dtest  select   -c x  y     times 5000   plt  lt - ggplot2  qplot y time  data res res time  lt  1000000    colour expr  plt  lt - plt   ggplot2  scale y log10        ggplot2  labs colour    expression        ggplot2  scale color discrete labels   c  re assign    null assign    subset bool    subset names    subset drop        ggplot2  theme bw base size 16  print plt

User · Answer

I tried to delete a column while using the package data table and got an unexpected result   I kind of think the following might be worth posting   Just a little cautionary note     Edited by Matthew        DF   read table text          fruit state grade y1980 y1990 y2000      apples Ohio   aa    500   100   55      apples Ohio   bb      0     0   44      apples Ohio   cc    700     0   33      apples Ohio   dd    300    50   66    sep       header   TRUE  stringsAsFactors   FALSE   DF     names DF   in  c  grade        all columns other than  grade     fruit state y1980 y1990 y2000 1 apples  Ohio   500   100    55 2 apples  Ohio     0     0    44 3 apples  Ohio   700     0    33 4 apples  Ohio   300    50    66  library  data table   DT   as data table DF   DT     names dat4   in  c  grade         not expected    not the same as DF     1   TRUE  TRUE FALSE  TRUE  TRUE  TRUE  DT     names DT   in  c  grade    with FALSE       that s better     fruit state y1980 y1990 y2000 1  apples  Ohio   500   100    55 2  apples  Ohio     0     0    44 3  apples  Ohio   700     0    33 4  apples  Ohio   300    50    66   Basically  the syntax for data table is NOT exactly the same as data frame  There are in fact lots of differences  see FAQ 1 1 and FAQ 2 17   You have been warned

User · Answer

Do not use -which   for this  it is extremely dangerous  Consider   dat  lt - data frame x 1 5  y 2 6  z 3 7  u 4 8  dat    -which names dat   in  c  z   u        works as expected dat    -which names dat   in  c  foo   bar        deletes all columns  Probably not what you wanted      Instead use subset or the   function   dat     names dat   in  c  z   u       works as expected dat     names dat   in  c  foo   bar       returns the un-altered data frame  Probably what you want   I have learned this from painful experience  Do not overuse which

User · Answer

You can also try the dplyr package   R gt  df  lt - data frame x 1 5  y 2 6  z 3 7  u 4 8  R gt  df   x y z u 1 1 2 3 4 2 2 3 4 5 3 3 4 5 6 4 4 5 6 7 5 5 6 7 8 R gt  library dplyr  R gt  dplyr  select df2  -c x  y      remove columns x and y   z u 1 3 4 2 4 5 3 5 6 4 6 7 5 7 8

User · Answer

df2  lt - df  names df   in  c  c1    c2

[r] How to drop columns by name in a data frame

Examples related to r

Examples related to dataframe

Examples related to subset