Say we have a table 'data' containing Strings in several columns. We want to find the indices of all rows that contain a certain value, or better yet, one of several values. The column, however, is unknown.
What I do, at the moment, is:
apply(df, 2, function(x) which(x == "M017"))
where df =
1 04.10.2009 01:24:51 M017 <NA> <NA> NA
2 04.10.2009 01:24:53 M018 <NA> <NA> NA
3 04.10.2009 01:24:54 M051 <NA> <NA> NA
4 04.10.2009 01:25:06 <NA> M016 <NA> NA
5 04.10.2009 01:25:07 <NA> M015 <NA> NA
6 04.10.2009 01:26:07 <NA> M017 <NA> NA
7 04.10.2009 01:26:27 <NA> M017 <NA> NA
8 04.10.2009 01:27:23 <NA> M017 <NA> NA
9 04.10.2009 01:27:30 <NA> M017 <NA> NA
10 04.10.2009 01:27:32 M017 <NA> <NA> NA
11 04.10.2009 01:27:34 M051 <NA> <NA> NA
This also works if we try to find more than one value:
apply(df, 2, function(x) which(x %in% c("M017", "M018")))
The result being:
$`1`
integer(0)
$`2`
[1] 1 2 20
$`3`
[1] 16 17 18 19
$`4`
integer(0)
$`5`
integer(0)
However, processing the resulting list of lists is rather tedious.
Is there a more efficient way to find rows that contain a value (or more) in ANY column?
This question is related to
r
If you want to find the rows
that have any of the values in a vector, one option is to loop the vector (lapply(v1,..)
), create a logical index of (TRUE/FALSE) with (==
). Use Reduce
and OR (|
) to reduce the list to a single logical matrix by checking the corresponding elements. Sum the rows (rowSums
), double negate (!!
) to get the rows with any matches.
indx1 <- !!rowSums(Reduce(`|`, lapply(v1, `==`, df)), na.rm=TRUE)
Or vectorise and get the row indices with which
with arr.ind=TRUE
indx2 <- unique(which(Vectorize(function(x) x %in% v1)(df),
arr.ind=TRUE)[,1])
I didn't use @kristang's solution as it is giving me errors. Based on a 1000x500
matrix, @konvas's solution is the most efficient (so far). But, this may vary if the number of rows are increased
val <- paste0('M0', 1:1000)
set.seed(24)
df1 <- as.data.frame(matrix(sample(c(val, NA), 1000*500,
replace=TRUE), ncol=500), stringsAsFactors=FALSE)
set.seed(356)
v1 <- sample(val, 200, replace=FALSE)
konvas <- function() {apply(df1, 1, function(r) any(r %in% v1))}
akrun1 <- function() {!!rowSums(Reduce(`|`, lapply(v1, `==`, df1)),
na.rm=TRUE)}
akrun2 <- function() {unique(which(Vectorize(function(x) x %in%
v1)(df1),arr.ind=TRUE)[,1])}
library(microbenchmark)
microbenchmark(konvas(), akrun1(), akrun2(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval
# konvas() 1.00000 1.000000 1.000000 1.000000 1.000000 1.00000 20
# akrun1() 160.08749 147.642721 125.085200 134.491722 151.454441 52.22737 20
# akrun2() 5.85611 5.641451 4.676836 5.330067 5.269937 2.22255 20
# cld
# a
# b
# a
For ncol = 10
, the results are slighjtly different:
expr min lq mean median uq max neval
konvas() 3.116722 3.081584 2.90660 2.983618 2.998343 2.394908 20
akrun1() 27.587827 26.554422 22.91664 23.628950 21.892466 18.305376 20
akrun2() 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000 20
v1 <- c('M017', 'M018')
df <- structure(list(datetime = c("04.10.2009 01:24:51",
"04.10.2009 01:24:53",
"04.10.2009 01:24:54", "04.10.2009 01:25:06", "04.10.2009 01:25:07",
"04.10.2009 01:26:07", "04.10.2009 01:26:27", "04.10.2009 01:27:23",
"04.10.2009 01:27:30", "04.10.2009 01:27:32", "04.10.2009 01:27:34"
), col1 = c("M017", "M018", "M051", "<NA>", "<NA>", "<NA>", "<NA>",
"<NA>", "<NA>", "M017", "M051"), col2 = c("<NA>", "<NA>", "<NA>",
"M016", "M015", "M017", "M017", "M017", "M017", "<NA>", "<NA>"
), col3 = c("<NA>", "<NA>", "<NA>", "<NA>", "<NA>", "<NA>", "<NA>",
"<NA>", "<NA>", "<NA>", "<NA>"), col4 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA)), .Names = c("datetime", "col1", "col2",
"col3", "col4"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10", "11"))
Here's a dplyr
option:
library(dplyr)
# across all columns:
df %>% filter_all(any_vars(. %in% c('M017', 'M018')))
# or in only select columns:
df %>% filter_at(vars(col1, col2), any_vars(. %in% c('M017', 'M018')))
Source: Stackoverflow.com