I have a roster of employees, and I need to know at what department they are in most often. It is trivial to tabulate employee ID against department name, but it is trickier to return the department name, rather than the number of roster counts, from the frequency table. A simple example below (column names = departments, row names = employee ids).
DF <- matrix(sample(1:9,9),ncol=3,nrow=3)
DF <- as.data.frame.matrix(DF)
> DF
V1 V2 V3
1 2 7 9
2 8 3 6
3 1 5 4
Now how do I get
> DF2
RE
1 V3
2 V1
3 V2
This question is related to
r
One option using your data (for future reference, use set.seed()
to make examples using sample
reproducible):
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
colnames(DF)[apply(DF,1,which.max)]
[1] "V3" "V1" "V2"
A faster solution than using apply
might be max.col
:
colnames(DF)[max.col(DF,ties.method="first")]
#[1] "V3" "V1" "V2"
...where ties.method
can be any of "random"
"first"
or "last"
This of course causes issues if you happen to have two columns which are equal to the maximum. I'm not sure what you want to do in that instance as you will have more than one result for some rows. E.g.:
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(7,6,4))
apply(DF,1,function(x) which(x==max(x)))
[[1]]
V2 V3
2 3
[[2]]
V1
1
[[3]]
V2
2
Here is an answer that works with data.table and is simpler. This assumes your data.table is named yourDF
:
j1 <- max.col(yourDF[, .(V1, V2, V3, V4)], "first")
yourDF$newCol <- c("V1", "V2", "V3", "V4")[j1]
Replace ("V1", "V2", "V3", "V4")
and (V1, V2, V3, V4)
with your column names
If you're interested in a data.table
solution, here's one. It's a bit tricky since you prefer to get the id for the first maximum. It's much easier if you'd rather want the last maximum. Nevertheless, it's not that complicated and it's fast!
Here I've generated data of your dimensions (26746 * 18).
set.seed(45)
DF <- data.frame(matrix(sample(10, 26746*18, TRUE), ncol=18))
data.table
answer:require(data.table)
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
# data.table solution
system.time({
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
})
# user system elapsed
# 0.174 0.029 0.227
# apply solution from @thelatemail
system.time(t2 <- colnames(DF)[apply(DF,1,which.max)])
# user system elapsed
# 2.322 0.036 2.602
identical(t1, t2)
# [1] TRUE
It's about 11 times faster on data of these dimensions, and data.table
scales pretty well too.
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid)), rowid, mult="last"]
dplyr
solution:Idea:
Code:
DF = data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
filter(rank(-value) == 1)
Result:
# A tibble: 3 x 3
# Groups: rowname [3]
rowname column value
<chr> <chr> <dbl>
1 2 V1 8
2 3 V2 5
3 1 V3 9
This approach can be easily extended to get the top n
columns.
Example for n=2
:
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
mutate(rk = rank(-value)) %>%
filter(rk <= 2) %>%
arrange(rowname, rk)
Result:
# A tibble: 6 x 4
# Groups: rowname [3]
rowname column value rk
<chr> <chr> <dbl> <dbl>
1 1 V3 9 1
2 1 V2 7 2
3 2 V1 8 1
4 2 V3 6 2
5 3 V2 5 1
6 3 V3 4 2
One solution could be to reshape the date from wide to long putting all the departments in one column and counts in another, group by the employer id (in this case, the row number), and then filter to the department(s) with the max value. There are a couple of options for handling ties with this approach too.
library(tidyverse)
# sample data frame with a tie
df <- data_frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,5))
# If you aren't worried about ties:
df %>%
rownames_to_column('id') %>% # creates an ID number
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(cnt))
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
# If you're worried about keeping ties:
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
filter(cnt == max(cnt)) %>% # top_n(cnt, n = 1) also works
arrange(id)
# A tibble: 4 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
4 3 V3 5.
# If you're worried about ties, but only want a certain department, you could use rank() and choose 'first' or 'last'
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
mutate(dept_rank = rank(-cnt, ties.method = "first")) %>% # or 'last'
filter(dept_rank == 1) %>%
select(-dept_rank)
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 2 V1 8.
2 3 V2 5.
3 1 V3 9.
# if you wanted to keep the original wide data frame
df %>%
rownames_to_column('id') %>%
left_join(
df %>%
rownames_to_column('id') %>%
gather(max_dept, max_cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(max_cnt)),
by = 'id'
)
# A tibble: 3 x 6
id V1 V2 V3 max_dept max_cnt
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 2. 7. 9. V3 9.
2 2 8. 3. 6. V1 8.
3 3 1. 5. 5. V2 5.
Based on the above suggestions, the following data.table
solution worked very fast for me:
library(data.table)
set.seed(45)
DT <- data.table(matrix(sample(10, 10^7, TRUE), ncol=10))
system.time(
DT[, col_max := colnames(.SD)[max.col(.SD, ties.method = "first")]]
)
#> user system elapsed
#> 0.15 0.06 0.21
DT[]
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 col_max
#> 1: 7 4 1 2 3 7 6 6 6 1 V1
#> 2: 4 6 9 10 6 2 7 7 1 3 V4
#> 3: 3 4 9 8 9 9 8 8 6 7 V3
#> 4: 4 8 8 9 7 5 9 2 7 1 V4
#> 5: 4 3 9 10 2 7 9 6 6 9 V4
#> ---
#> 999996: 4 6 10 5 4 7 3 8 2 8 V3
#> 999997: 8 7 6 6 3 10 2 3 10 1 V6
#> 999998: 2 3 2 7 4 7 5 2 7 3 V4
#> 999999: 8 10 3 2 3 4 5 1 1 4 V2
#> 1000000: 10 4 2 6 6 2 8 4 7 4 V1
And also comes with the advantage that can always specify what columns .SD
should consider by mentioning them in .SDcols
:
DT[, MAX2 := colnames(.SD)[max.col(.SD, ties.method="first")], .SDcols = c("V9", "V10")]
In case we need the column name of the smallest value, as suggested by @lwshang, one just needs to use -.SD
:
DT[, col_min := colnames(.SD)[max.col(-.SD, ties.method = "first")]]
A simple for
loop can also be handy:
> df<-data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
> df
V1 V2 V3
1 2 7 9
2 8 3 6
3 1 5 4
> df2<-data.frame()
> for (i in 1:nrow(df)){
+ df2[i,1]<-colnames(df[which.max(df[i,])])
+ }
> df2
V1
1 V3
2 V1
3 V2
One option from dplyr 1.0.0
could be:
DF %>%
rowwise() %>%
mutate(row_max = names(.)[which.max(c_across(everything()))])
V1 V2 V3 row_max
<dbl> <dbl> <dbl> <chr>
1 2 7 9 V3
2 8 3 6 V1
3 1 5 4 V2
Sample data:
DF <- structure(list(V1 = c(2, 8, 1), V2 = c(7, 3, 5), V3 = c(9, 6,
4)), class = "data.frame", row.names = c(NA, -3L))
Source: Stackoverflow.com