From a dataframe like this
test <- data.frame('id'= rep(1:5,2), 'string'= LETTERS[1:10])
test <- test[order(test$id), ]
rownames(test) <- 1:10
> test
id string
1 1 A
2 1 F
3 2 B
4 2 G
5 3 C
6 3 H
7 4 D
8 4 I
9 5 E
10 5 J
I want to create a new one with the first row of each id / string pair. If sqldf accepted R code within it, the query could look like this:
res <- sqldf("select id, min(rownames(test)), string
from test
group by id, string")
> res
id string
1 1 A
3 2 B
5 3 C
7 4 D
9 5 E
Is there a solution short of creating a new column like
test$row <- rownames(test)
and running the same sqldf query with min(row)?
now, for dplyr
, adding a distinct counter.
df %>%
group_by(aa, bb) %>%
summarise(first=head(value,1), count=n_distinct(value))
You create groups, them summarise within groups.
If data is numeric, you can use:
first(value)
[there is also last(value)
] in place of head(value, 1)
see: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
Full:
> df
Source: local data frame [16 x 3]
aa bb value
1 1 1 GUT
2 1 1 PER
3 1 2 SUT
4 1 2 GUT
5 1 3 SUT
6 1 3 GUT
7 1 3 PER
8 2 1 221
9 2 1 224
10 2 1 239
11 2 2 217
12 2 2 221
13 2 2 224
14 3 1 GUT
15 3 1 HUL
16 3 1 GUT
> library(dplyr)
> df %>%
> group_by(aa, bb) %>%
> summarise(first=head(value,1), count=n_distinct(value))
Source: local data frame [6 x 4]
Groups: aa
aa bb first count
1 1 1 GUT 2
2 1 2 SUT 2
3 1 3 SUT 3
4 2 1 221 3
5 2 2 217 3
6 3 1 GUT 2
A base R option is the split()
-lapply()
-do.call()
idiom:
> do.call(rbind, lapply(split(test, test$id), head, 1))
id string
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
A more direct option is to lapply()
the [
function:
> do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
id string
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
The comma-space 1, )
at the end of the lapply()
call is essential as this is equivalent of calling [1, ]
to select first row and all columns.
A simple ddply
option:
ddply(test,.(id),function(x) head(x,1))
If speed is an issue, a similar approach could be taken with data.table
:
testd <- data.table(test)
setkey(testd,id)
testd[,.SD[1],by = key(testd)]
or this might be considerably faster:
testd[testd[, .I[1], by = key(testd]$V1]
I favor the dplyr approach.
group_by(id)
followed by either
filter(row_number()==1)
orslice(1)
orslice_head(1)
#(dplyr => 1.0)top_n(n = -1)
top_n()
internally uses the rank function.
Negative selects from the bottom of rank.In some instances arranging the ids after the group_by can be necessary.
library(dplyr)
# using filter(), top_n() or slice()
m1 <-
test %>%
group_by(id) %>%
filter(row_number()==1)
m2 <-
test %>%
group_by(id) %>%
slice(1)
m3 <-
test %>%
group_by(id) %>%
top_n(n = -1)
All three methods return the same result
# A tibble: 5 x 2
# Groups: id [5]
id string
<int> <fct>
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
What about
DT <- data.table(test)
setkey(DT, id)
DT[J(unique(id)), mult = "first"]
There is also a unique method for data.tables
which will return the the first row by key
jdtu <- function() unique(DT)
I think, if you are ordering test
outside the benchmark, then you can removing the setkey
and data.table
conversion from the benchmark as well (as the setkey basically sorts by id, the same as order
).
set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]
DT <- data.table(DT, key = 'id')
ju <- function() test[!duplicated(test$id),]
jdt <- function() DT[J(unique(id)),mult = 'first']
library(rbenchmark)
benchmark(ju(), jdt(), replications = 5)
## test replications elapsed relative user.self sys.self
## 2 jdt() 5 0.01 1 0.02 0
## 1 ju() 5 0.05 5 0.05 0
and with more data
** Edit with unique method**
set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
DT <- data.table(test, key = 'id')
test replications elapsed relative user.self sys.self
2 jdt() 5 0.09 2.25 0.09 0.00
3 jdtu() 5 0.04 1.00 0.05 0.00
1 ju() 5 0.22 5.50 0.19 0.03
The unique method is fastest here.
(1) SQLite has a built in rowid
pseudo-column so this works:
sqldf("select min(rowid) rowid, id, string
from test
group by id")
giving:
rowid id string
1 1 1 A
2 3 2 B
3 5 3 C
4 7 4 D
5 9 5 E
(2) Also sqldf
itself has a row.names=
argument:
sqldf("select min(cast(row_names as real)) row_names, id, string
from test
group by id", row.names = TRUE)
giving:
id string
1 1 A
3 2 B
5 3 C
7 4 D
9 5 E
(3) A third alternative which mixes the elements of the above two might be even better:
sqldf("select min(rowid) row_names, id, string
from test
group by id", row.names = TRUE)
giving:
id string
1 1 A
3 2 B
5 3 C
7 4 D
9 5 E
Note that all three of these rely on a SQLite extension to SQL where the use of min
or max
is guaranteed to result in the other columns being chosen from the same row. (In other SQL-based databases that may not be guaranteed.)
Source: Stackoverflow.com