[r] Filter rows which contain a certain string

I have to filter a data frame using as criterion those row in which is contained the string RTB.

I'm using dplyr.

d.del <- df %>%
  group_by(TrackingPixel) %>%
  summarise(MonthDelivery = as.integer(sum(Revenue))) %>%
  arrange(desc(MonthDelivery))

I know I can use the function filter in dplyr but I don't exactly how to tell it to check for the content of a string.

In particular I want to check the content in the column TrackingPixel. If the string contains the label RTB I want to remove the row from the result.

This question is related to r filter dplyr

The answer is


Solution

It is possible to use str_detect of the stringr package included in the tidyverse package. str_detect returns True or False as to whether the specified vector contains some specific string. It is possible to filter using this boolean value. See Introduction to stringr for details about stringr package.

library(tidyverse)
# - Attaching packages -------------------- tidyverse 1.2.1 -
# ? ggplot2 2.2.1     ? purrr   0.2.4
# ? tibble  1.4.2     ? dplyr   0.7.4
# ? tidyr   0.7.2     ? stringr 1.2.0
# ? readr   1.1.1     ? forcats 0.3.0
# - Conflicts --------------------- tidyverse_conflicts() -
# ? dplyr::filter() masks stats::filter()
# ? dplyr::lag()    masks stats::lag()

mtcars$type <- rownames(mtcars)
mtcars %>%
  filter(str_detect(type, 'Toyota|Mazda'))
# mpg cyl  disp  hp drat    wt  qsec vs am gear carb           type
# 1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      Mazda RX4
# 2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Mazda RX4 Wag
# 3 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 Toyota Corolla
# 4 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona

The good things about Stringr

We should use rather stringr::str_detect() than base::grepl(). This is because there are the following reasons.

  • The functions provided by the stringr package start with the prefix str_, which makes the code easier to read.
  • The first argument of the functions of stringr package is always the data.frame (or value), then comes the parameters.(Thank you Paolo)
object <- "stringr"
# The functions with the same prefix `str_`.
# The first argument is an object.
stringr::str_count(object) # -> 7
stringr::str_sub(object, 1, 3) # -> "str"
stringr::str_detect(object, "str") # -> TRUE
stringr::str_replace(object, "str", "") # -> "ingr"
# The function names without common points.
# The position of the argument of the object also does not match.
base::nchar(object) # -> 7
base::substr(object, 1, 3) # -> "str"
base::grepl("str", object) # -> TRUE
base::sub("str", "", object) # -> "ingr"

Benchmark

The results of the benchmark test are as follows. For large dataframe, str_detect is faster.

library(rbenchmark)
library(tidyverse)

# The data. Data expo 09. ASA Statistics Computing and Graphics 
# http://stat-computing.org/dataexpo/2009/the-data.html
df <- read_csv("Downloads/2008.csv")
print(dim(df))
# [1] 7009728      29

benchmark(
  "str_detect" = {df %>% filter(str_detect(Dest, 'MCO|BWI'))},
  "grepl" = {df %>% filter(grepl('MCO|BWI', Dest))},
  replications = 10,
  columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))
# test replications elapsed relative user.self sys.self
# 2      grepl           10  16.480    1.513    16.195    0.248
# 1 str_detect           10  10.891    1.000     9.594    1.281

This answer similar to others, but using preferred stringr::str_detect and dplyr rownames_to_column.

library(tidyverse)

mtcars %>% 
  rownames_to_column("type") %>% 
  filter(stringr::str_detect(type, 'Toyota|Mazda') )

#>             type  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1      Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2  Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 3 Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> 4  Toyota Corona 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1

Created on 2018-06-26 by the reprex package (v0.2.0).


The answer to the question was already posted by the @latemail in the comments above. You can use regular expressions for the second and subsequent arguments of filter like this:

dplyr::filter(df, !grepl("RTB",TrackingPixel))

Since you have not provided the original data, I will add a toy example using the mtcars data set. Imagine you are only interested in cars produced by Mazda or Toyota.

mtcars$type <- rownames(mtcars)
dplyr::filter(mtcars, grepl('Toyota|Mazda', type))

   mpg cyl  disp  hp drat    wt  qsec vs am gear carb           type
1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      Mazda RX4
2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Mazda RX4 Wag
3 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 Toyota Corolla
4 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona

If you would like to do it the other way round, namely excluding Toyota and Mazda cars, the filter command looks like this:

dplyr::filter(mtcars, !grepl('Toyota|Mazda', type))

edit included the newer across() syntax

Here's another tidyverse solution, using filter(across()) or previously filter_at. The advantage is that you can easily extend to more than one column.

Below also a solution with filter_all in order to find the string in any column, using diamonds as example, looking for the string "V"

library(tidyverse)

String in only one column

# for only one column... extendable to more than one creating a column list in `across` or `vars`!
mtcars %>% 
  rownames_to_column("type") %>% 
  filter(across(type, ~ !grepl('Toyota|Mazda', .))) %>%
  head()
#>                type  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1        Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> 2    Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> 3 Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> 4           Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> 5        Duster 360 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> 6         Merc 240D 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2

The now superseded syntax for the same would be:

mtcars %>% 
  rownames_to_column("type") %>% 
  filter_at(.vars= vars(type), all_vars(!grepl('Toyota|Mazda',.))) 

String in all columns:

# remove all rows where any column contains 'V'
diamonds %>%
  filter(across(everything(), ~ !grepl('V', .))) %>%
  head
#> # A tibble: 6 x 10
#>   carat cut     color clarity depth table price     x     y     z
#>   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
#> 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
#> 3  0.31 Good    J     SI2      63.3    58   335  4.34  4.35  2.75
#> 4  0.3  Good    J     SI1      64      55   339  4.25  4.28  2.73
#> 5  0.22 Premium F     SI1      60.4    61   342  3.88  3.84  2.33
#> 6  0.31 Ideal   J     SI2      62.2    54   344  4.35  4.37  2.71

The now superseded syntax for the same would be:

diamonds %>% 
  filter_all(all_vars(!grepl('V', .))) %>%
  head

I tried to find an across alternative for the following, but I didn't immediately come up with a good solution:

    #get all rows where any column contains 'V'
    diamonds %>%
    filter_all(any_vars(grepl('V',.))) %>%
      head
    #> # A tibble: 6 x 10
    #>   carat cut       color clarity depth table price     x     y     z
    #>   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
    #> 1 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
    #> 2 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
    #> 3 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
    #> 4 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
    #> 5 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
    #> 6 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49

Update: Thanks to user Petr Kajzar in this answer, here also an approach for the above:

diamonds %>%
   filter(rowSums(across(everything(), ~grepl("V", .x))) > 0)

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to filter

Monitoring the Full Disclosure mailinglist Pyspark: Filter dataframe based on multiple conditions How Spring Security Filter Chain works Copy filtered data to another sheet using VBA Filter object properties by key in ES6 How do I filter date range in DataTables? How do I filter an array with TypeScript in Angular 2? Filtering array of objects with lodash based on property value How to filter an array from all elements of another array How to specify "does not contain" in dplyr filter

Examples related to dplyr

R dplyr: Drop multiple columns How to specify "does not contain" in dplyr filter Select first and last row from grouped data Error: could not find function "%>%" Sum across multiple columns with dplyr Removing NA observations with dplyr::filter() Changing factor levels with dplyr mutate Change value of variable with dplyr dplyr change many data types What does %>% function mean in R?