[r] Select the first row by group

From a dataframe like this

test <- data.frame('id'= rep(1:5,2), 'string'= LETTERS[1:10])
test <- test[order(test$id), ]
rownames(test) <- 1:10

> test
    id string
 1   1      A
 2   1      F
 3   2      B
 4   2      G
 5   3      C
 6   3      H
 7   4      D
 8   4      I
 9   5      E
 10  5      J

I want to create a new one with the first row of each id / string pair. If sqldf accepted R code within it, the query could look like this:

res <- sqldf("select id, min(rownames(test)), string 
              from test 
              group by id, string")

> res
    id string
 1   1      A
 3   2      B
 5   3      C
 7   4      D
 9   5      E

Is there a solution short of creating a new column like

test$row <- rownames(test)

and running the same sqldf query with min(row)?

This question is related to r dataframe sqldf

The answer is


now, for dplyr, adding a distinct counter.

df %>%
    group_by(aa, bb) %>%
    summarise(first=head(value,1), count=n_distinct(value))

You create groups, them summarise within groups.

If data is numeric, you can use:
first(value) [there is also last(value)] in place of head(value, 1)

see: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Full:

> df
Source: local data frame [16 x 3]

   aa bb value
1   1  1   GUT
2   1  1   PER
3   1  2   SUT
4   1  2   GUT
5   1  3   SUT
6   1  3   GUT
7   1  3   PER
8   2  1   221
9   2  1   224
10  2  1   239
11  2  2   217
12  2  2   221
13  2  2   224
14  3  1   GUT
15  3  1   HUL
16  3  1   GUT

> library(dplyr)
> df %>%
>   group_by(aa, bb) %>%
>   summarise(first=head(value,1), count=n_distinct(value))

Source: local data frame [6 x 4]
Groups: aa

  aa bb first count
1  1  1   GUT     2
2  1  2   SUT     2
3  1  3   SUT     3
4  2  1   221     3
5  2  2   217     3
6  3  1   GUT     2

A base R option is the split()-lapply()-do.call() idiom:

> do.call(rbind, lapply(split(test, test$id), head, 1))
  id string
1  1      A
2  2      B
3  3      C
4  4      D
5  5      E

A more direct option is to lapply() the [ function:

> do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
  id string
1  1      A
2  2      B
3  3      C
4  4      D
5  5      E

The comma-space 1, ) at the end of the lapply() call is essential as this is equivalent of calling [1, ] to select first row and all columns.


A simple ddply option:

ddply(test,.(id),function(x) head(x,1))

If speed is an issue, a similar approach could be taken with data.table:

testd <- data.table(test)
setkey(testd,id)
testd[,.SD[1],by = key(testd)]

or this might be considerably faster:

testd[testd[, .I[1], by = key(testd]$V1]

I favor the dplyr approach.

group_by(id) followed by either

  • filter(row_number()==1) or
  • slice(1) or
  • slice_head(1) #(dplyr => 1.0)
  • top_n(n = -1)
    • top_n() internally uses the rank function. Negative selects from the bottom of rank.

In some instances arranging the ids after the group_by can be necessary.

library(dplyr)

# using filter(), top_n() or slice()

m1 <-
test %>% 
  group_by(id) %>% 
  filter(row_number()==1)

m2 <-
test %>% 
  group_by(id) %>% 
  slice(1)

m3 <-
test %>% 
  group_by(id) %>% 
  top_n(n = -1)

All three methods return the same result

# A tibble: 5 x 2
# Groups:   id [5]
     id string
  <int> <fct> 
1     1 A     
2     2 B     
3     3 C     
4     4 D     
5     5 E

What about

DT <- data.table(test)
setkey(DT, id)

DT[J(unique(id)), mult = "first"]

Edit

There is also a unique method for data.tables which will return the the first row by key

jdtu <- function() unique(DT)

I think, if you are ordering test outside the benchmark, then you can removing the setkey and data.table conversion from the benchmark as well (as the setkey basically sorts by id, the same as order).

set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]
DT <- data.table(DT, key = 'id')
ju <- function() test[!duplicated(test$id),]

jdt <- function() DT[J(unique(id)),mult = 'first']


 library(rbenchmark)
benchmark(ju(), jdt(), replications = 5)
##    test replications elapsed relative user.self sys.self 
## 2 jdt()            5    0.01        1      0.02        0        
## 1  ju()            5    0.05        5      0.05        0         

and with more data

** Edit with unique method**

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
DT <- data.table(test, key = 'id')
       test replications elapsed relative user.self sys.self 
2  jdt()            5    0.09     2.25      0.09     0.00    
3 jdtu()            5    0.04     1.00      0.05     0.00      
1   ju()            5    0.22     5.50      0.19     0.03        

The unique method is fastest here.


(1) SQLite has a built in rowid pseudo-column so this works:

sqldf("select min(rowid) rowid, id, string 
               from test 
               group by id")

giving:

  rowid id string
1     1  1      A
2     3  2      B
3     5  3      C
4     7  4      D
5     9  5      E

(2) Also sqldf itself has a row.names= argument:

sqldf("select min(cast(row_names as real)) row_names, id, string 
              from test 
              group by id", row.names = TRUE)

giving:

  id string
1  1      A
3  2      B
5  3      C
7  4      D
9  5      E

(3) A third alternative which mixes the elements of the above two might be even better:

sqldf("select min(rowid) row_names, id, string 
               from test 
               group by id", row.names = TRUE)

giving:

  id string
1  1      A
3  2      B
5  3      C
7  4      D
9  5      E

Note that all three of these rely on a SQLite extension to SQL where the use of min or max is guaranteed to result in the other columns being chosen from the same row. (In other SQL-based databases that may not be guaranteed.)