Select the first row by group

Question

From a dataframe like this  test  lt - data frame  id   rep 1 5 2    string   LETTERS 1 10   test  lt - test order test id     rownames test   lt - 1 10   gt  test     id string  1   1      A  2   1      F  3   2      B  4   2      G  5   3      C  6   3      H  7   4      D  8   4      I  9   5      E  10  5      J   I want to create a new one with the first row of each id   string pair  If sqldf accepted R code within it  the query could look like this   res  lt - sqldf  select id  min rownames test    string                from test                group by id  string     gt  res     id string  1   1      A  3   2      B  5   3      C  7   4      D  9   5      E   Is there a solution short of creating a new column like  test row  lt - rownames test    and running the same sqldf query with min row

User · Answer

I favor the dplyr approach    group by id  followed by either   filter row number    1  or slice 1  or slice head 1                                     dplyr    1 0  top n n   -1     top n   internally uses the rank function  Negative selects from the bottom of rank     In some instances arranging the ids after the group by can be necessary   library dplyr     using filter    top n   or slice    m1  lt - test   gt      group by id    gt      filter row number    1   m2  lt - test   gt      group by id    gt      slice 1   m3  lt - test   gt      group by id    gt      top n n   -1    All three methods return the same result    A tibble  5 x 2   Groups    id  5       id string    lt int gt   lt fct gt   1     1 A      2     2 B      3     3 C      4     4 D      5     5 E

User · Answer

A simple ddply option   ddply test   id  function x  head x 1     If speed is an issue  a similar approach could be taken with data table   testd  lt - data table test  setkey testd id  testd   SD 1  by   key testd     or this might be considerably faster   testd testd    I 1   by   key testd  V1

User · Answer

What about  DT  lt - data table test  setkey DT  id   DT J unique id    mult    first       Edit  There is also a unique method for data tables which will return the the first row by key  jdtu  lt - function   unique DT      I think  if you are ordering test outside the benchmark  then you can removing the setkey and data table conversion from the benchmark as well  as the setkey basically sorts by id  the same as order    set seed 21  test  lt - data frame id sample 1e3  1e5  TRUE   string sample LETTERS  1e5  TRUE   test  lt - test order test id     DT  lt - data table DT  key    id   ju  lt - function   test  duplicated test id     jdt  lt - function   DT J unique id   mult    first      library rbenchmark  benchmark ju    jdt    replications   5        test replications elapsed relative user self sys self     2 jdt              5    0 01        1      0 02        0            1  ju              5    0 05        5      0 05        0            and with more data     Edit with unique method     set seed 21  test  lt - data frame id sample 1e4  1e6  TRUE   string sample LETTERS  1e6  TRUE   test  lt - test order test id     DT  lt - data table test  key    id          test replications elapsed relative user self sys self  2  jdt              5    0 09     2 25      0 09     0 00     3 jdtu              5    0 04     1 00      0 05     0 00       1   ju              5    0 22     5 50      0 19     0 03           The unique method is fastest here

User · Answer

You can use duplicated to do this very quickly   test  duplicated test id      Benchmarks  for the speed freaks   ju  lt - function   test  duplicated test id    gs1  lt - function   do call rbind  lapply split test  test id   head  1   gs2  lt - function   do call rbind  lapply split test  test id        1     jply  lt - function   ddply test   id  function x  head x 1   jdt  lt - function       testd  lt - as data table test    setkey testd id      Initial solution  slow      testd  lapply  SD function x  head x 1   by   key testd       Faster options     testd  duplicated id                    1      testd    SD 1L   by key testd         2      testd J unique id   mult  first       3      testd  testd   I 1L  by id            4  needs v1 8 3  Allows 2nd  3rd etc    library plyr  library data table  library rbenchmark     sample data set seed 21  test  lt - data frame id sample 1e3  1e5  TRUE   string sample LETTERS  1e5  TRUE   test  lt - test order test id      benchmark ju    gs1    gs2    jply    jdt        replications 5  order  relative    1 6        test replications elapsed relative user self sys self   1   ju              5    0 03    1 000      0 03     0 00   5  jdt              5    0 03    1 000      0 03     0 00   3  gs2              5    3 49  116 333      2 87     0 58   2  gs1              5    3 58  119 333      3 00     0 58   4 jply              5    3 69  123 000      3 11     0 51   Let s try that again  but with just the contenders from the first heat and with more data and more replications   set seed 21  test  lt - data frame id sample 1e4  1e6  TRUE   string sample LETTERS  1e6  TRUE   test  lt - test order test id     benchmark ju    jdt    order  relative    1 6       test replications elapsed relative user self sys self   1  ju            100    5 48    1 000      4 44     1 00   2 jdt            100    6 92    1 263      5 70     1 15

User · Answer

A base R option is the split  -lapply  -do call   idiom    gt  do call rbind  lapply split test  test id   head  1     id string 1  1      A 2  2      B 3  3      C 4  4      D 5  5      E   A more direct option is to lapply   the   function    gt  do call rbind  lapply split test  test id        1       id string 1  1      A 2  2      B 3  3      C 4  4      D 5  5      E   The comma-space 1    at the end of the lapply   call is essential as this is equivalent of calling  1    to select first row and all columns

User · Answer

1  SQLite has a built in rowid pseudo-column so this works   sqldf  select min rowid  rowid  id  string                 from test                 group by id     giving     rowid id string 1     1  1      A 2     3  2      B 3     5  3      C 4     7  4      D 5     9  5      E    2  Also sqldf itself has a row names  argument   sqldf  select min cast row names as real   row names  id  string                from test                group by id   row names   TRUE    giving     id string 1  1      A 3  2      B 5  3      C 7  4      D 9  5      E    3  A third alternative which mixes the elements of the above two might be even better   sqldf  select min rowid  row names  id  string                 from test                 group by id   row names   TRUE    giving     id string 1  1      A 3  2      B 5  3      C 7  4      D 9  5      E   Note that all three of these rely on a SQLite extension to SQL where the use of min or max is guaranteed to result in the other columns being chosen from the same row    In other SQL-based databases that may not be guaranteed

User · Answer

now  for dplyr  adding a distinct counter   df   gt       group by aa  bb    gt       summarise first head value 1   count n distinct value     You create groups  them summarise within groups   If data is numeric  you can use  first value   there is also last value   in place of head value  1   see  http   cran rstudio com web packages dplyr vignettes introduction html  Full    gt  df Source  local data frame  16 x 3      aa bb value 1   1  1   GUT 2   1  1   PER 3   1  2   SUT 4   1  2   GUT 5   1  3   SUT 6   1  3   GUT 7   1  3   PER 8   2  1   221 9   2  1   224 10  2  1   239 11  2  2   217 12  2  2   221 13  2  2   224 14  3  1   GUT 15  3  1   HUL 16  3  1   GUT   gt  library dplyr   gt  df   gt    gt    group by aa  bb    gt    gt    summarise first head value 1   count n distinct value    Source  local data frame  6 x 4  Groups  aa    aa bb first count 1  1  1   GUT     2 2  1  2   SUT     2 3  1  3   SUT     3 4  2  1   221     3 5  2  2   217     3 6  3  1   GUT     2

[r] Select the first row by group

Examples related to r

Examples related to dataframe

Examples related to sqldf