[r] Selecting data frame rows based on partial string match in a column

I want to select rows from a data frame based on partial match of a string in a column, e.g. column 'x' contains the string "hsa". Using sqldf - if it had a like syntax - I would do something like:

select * from <> where x like 'hsa'.

Unfortunately, sqldf does not support that syntax.

Or similarly:

selectedRows <- df[ , df$x %like% "hsa-"]

Which of course doesn't work.

Can somebody please help me with this?

This question is related to r regex string match subset

The answer is


I notice that you mention a function %like% in your current approach. I don't know if that's a reference to the %like% from "data.table", but if it is, you can definitely use it as follows.

Note that the object does not have to be a data.table (but also remember that subsetting approaches for data.frames and data.tables are not identical):

library(data.table)
mtcars[rownames(mtcars) %like% "Merc", ]
iris[iris$Species %like% "osa", ]

If that is what you had, then perhaps you had just mixed up row and column positions for subsetting data.


If you don't want to load a package, you can try using grep() to search for the string you're matching. Here's an example with the mtcars dataset, where we are matching all rows where the row names includes "Merc":

mtcars[grep("Merc", rownames(mtcars)), ]
             mpg cyl  disp  hp drat   wt qsec vs am gear carb
# Merc 240D   24.4   4 146.7  62 3.69 3.19 20.0  1  0    4    2
# Merc 230    22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
# Merc 280    19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
# Merc 280C   17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4
# Merc 450SE  16.4   8 275.8 180 3.07 4.07 17.4  0  0    3    3
# Merc 450SL  17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3
# Merc 450SLC 15.2   8 275.8 180 3.07 3.78 18.0  0  0    3    3

And, another example, using the iris dataset searching for the string osa:

irisSubset <- iris[grep("osa", iris$Species), ]
head(irisSubset)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

For your problem try:

selectedRows <- conservedData[grep("hsa-", conservedData$miRNA), ]

Another option would be to simply use grepl function:

df[grepl('er', df$name), ]
CO2[grepl('non', CO2$Treatment), ]

df <- data.frame(name = c('bob','robert','peter'),
                 id = c(1,2,3)
                 )

# name id
# 2 robert  2
# 3  peter  3

Try str_detect() from the stringr package, which detects the presence or absence of a pattern in a string.

Here is an approach that also incorporates the %>% pipe and filter() from the dplyr package:

library(stringr)
library(dplyr)

CO2 %>%
  filter(str_detect(Treatment, "non"))

   Plant        Type  Treatment conc uptake
1    Qn1      Quebec nonchilled   95   16.0
2    Qn1      Quebec nonchilled  175   30.4
3    Qn1      Quebec nonchilled  250   34.8
4    Qn1      Quebec nonchilled  350   37.2
5    Qn1      Quebec nonchilled  500   35.3
...

This filters the sample CO2 data set (that comes with R) for rows where the Treatment variable contains the substring "non". You can adjust whether str_detect finds fixed matches or uses a regex - see the documentation for the stringr package.


LIKE should work in sqlite:

require(sqldf)
df <- data.frame(name = c('bob','robert','peter'),id=c(1,2,3))
sqldf("select * from df where name LIKE '%er%'")
    name id
1 robert  2
2  peter  3

Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to regex

Why my regexp for hyphenated words doesn't work? grep's at sign caught as whitespace Preg_match backtrack error regex match any single character (one character only) re.sub erroring with "Expected string or bytes-like object" Only numbers. Input number in React Visual Studio Code Search and Replace with Regular Expressions Strip / trim all strings of a dataframe return string with first match Regex How to capture multiple repeated groups?

Examples related to string

How to split a string in two and store it in a field String method cannot be found in a main class method Kotlin - How to correctly concatenate a String Replacing a character from a certain index Remove quotes from String in Python Detect whether a Python string is a number or a letter How does String substring work in Swift How does String.Index work in Swift swift 3.0 Data to String? How to parse JSON string in Typescript

Examples related to match

How to test if a string contains one of the substrings in a list, in pandas? Delete rows containing specific strings in R Jquery Value match Regex How to compare two columns in Excel and if match, then copy the cell next to it Search File And Find Exact Match And Print Line? How to test multiple variables against a value? Selecting data frame rows based on partial string match in a column how to check if string contains '+' character PHP compare two arrays and get the matched values not the difference Check whether a string matches a regex in JS

Examples related to subset

how to remove multiple columns in r dataframe? Using grep to help subset a data frame in R Subset a dataframe by multiple factor levels creating a new list with subset of list using index in python subsetting a Python DataFrame Undefined columns selected when subsetting data frame How to select some rows with specific rownames from a dataframe? Subset data to contain only columns whose names match a condition find all subsets that sum to a particular value Subset and ggplot2