[r] Using grep to help subset a data frame in R

I am having trouble subsetting my data. I want the data subsetted on column x, where the first 3 characters begin G45.

My data frame:

 x <- c("G448", "G459", "G479", "G406")  
 y <- c(1:4)
 My.Data <- data.frame (x,y)

I have tried:

 subset (My.Data, x=="G45*")

But I am unsure how to use wildcards. I have also tried grep() to find the indicies:

 grep  ("G45*", My.Data$x)

but it returns all 4 rows, rather than just those beginning G45, probably also as I am unsure how to use wildcards.

This question is related to r dataframe subset

The answer is


It's pretty straightforward using [ to extract:

grep will give you the position in which it matched your search pattern (unless you use value = TRUE).

grep("^G45", My.Data$x)
# [1] 2

Since you're searching within the values of a single column, that actually corresponds to the row index. So, use that with [ (where you would use My.Data[rows, cols] to get specific rows and columns).

My.Data[grep("^G45", My.Data$x), ]
#      x y
# 2 G459 2

The help-page for subset shows how you can use grep and grepl with subset if you prefer using this function over [. Here's an example.

subset(My.Data, grepl("^G45", My.Data$x))
#      x y
# 2 G459 2

As of R 3.3, there's now also the startsWith function, which you can again use with subset (or with any of the other approaches above). According to the help page for the function, it's considerably faster than using substring or grepl.

subset(My.Data, startsWith(as.character(x), "G45"))
#      x y
# 2 G459 2

You may also use the stringr package

library(dplyr)
library(stringr)
My.Data %>% filter(str_detect(x, '^G45'))

You may not use '^' (starts with) in this case, to obtain the results you need


Examples related to r

How to get AIC from Conway–Maxwell-Poisson regression via COM-poisson package in R? R : how to simply repeat a command? session not created: This version of ChromeDriver only supports Chrome version 74 error with ChromeDriver Chrome using Selenium How to show code but hide output in RMarkdown? remove kernel on jupyter notebook Function to calculate R2 (R-squared) in R Center Plot title in ggplot2 R ggplot2: stat_count() must not be used with a y aesthetic error in Bar graph R multiple conditions in if statement What does "The following object is masked from 'package:xxx'" mean?

Examples related to dataframe

Trying to merge 2 dataframes but get ValueError How to show all of columns name on pandas dataframe? Python Pandas - Find difference between two data frames Pandas get the most frequent values of a column Display all dataframe columns in a Jupyter Python Notebook How to convert column with string type to int form in pyspark data frame? Display/Print one column from a DataFrame of Series in Pandas Binning column with python pandas Selection with .loc in python Set value to an entire column of a pandas dataframe

Examples related to subset

how to remove multiple columns in r dataframe? Using grep to help subset a data frame in R Subset a dataframe by multiple factor levels creating a new list with subset of list using index in python subsetting a Python DataFrame Undefined columns selected when subsetting data frame How to select some rows with specific rownames from a dataframe? Subset data to contain only columns whose names match a condition find all subsets that sum to a particular value Subset and ggplot2