[regex] Extracting numbers from vectors of strings

I have string like this:

years<-c("20 years old", "1 years old")

I would like to grep only the numeric number from this vector. Expected output is a vector:

c(20, 1)

How do I go about doing this?

This question is related to regex r

The answer is


Using the package unglue we can do :

# install.packages("unglue")
library(unglue)

years<-c("20 years old", "1 years old")
unglue_vec(years, "{x} years old", convert = TRUE)
#> [1] 20  1

Created on 2019-11-06 by the reprex package (v0.3.0)

More info: https://github.com/moodymudskipper/unglue/blob/master/README.md


Update Since extract_numeric is deprecated, we can use parse_number from readr package.

library(readr)
parse_number(years)

Here is another option with extract_numeric

library(tidyr)
extract_numeric(years)
#[1] 20  1

We can also use str_extract from stringr

years<-c("20 years old", "1 years old")
as.integer(stringr::str_extract(years, "\\d+"))
#[1] 20  1

If there are multiple numbers in the string and we want to extract all of them, we may use str_extract_all which unlike str_extract returns all the macthes.

years<-c("20 years old and 21", "1 years old")
stringr::str_extract(years, "\\d+")
#[1] "20"  "1"

stringr::str_extract_all(years, "\\d+")

#[[1]]
#[1] "20" "21"

#[[2]]
#[1] "1"

Here's an alternative to Arun's first solution, with a simpler Perl-like regular expression:

as.numeric(gsub("[^\\d]+", "", years, perl=TRUE))

A stringr pipelined solution:

library(stringr)
years %>% str_match_all("[0-9]+") %>% unlist %>% as.numeric

After the post from Gabor Grothendieck post at the r-help mailing list

years<-c("20 years old", "1 years old")

library(gsubfn)
pat <- "[-+.e0-9]*\\d"
sapply(years, function(x) strapply(x, pat, as.numeric)[[1]])

Or simply:

as.numeric(gsub("\\D", "", years))
# [1] 20  1

I think that substitution is an indirect way of getting to the solution. If you want to retrieve all the numbers, I recommend gregexpr:

matches <- regmatches(years, gregexpr("[[:digit:]]+", years))
as.numeric(unlist(matches))

If you have multiple matches in a string, this will get all of them. If you're only interested in the first match, use regexpr instead of gregexpr and you can skip the unlist.


Extract numbers from any string at beginning position.

x <- gregexpr("^[0-9]+", years)  # Numbers with any number of digits
x2 <- as.numeric(unlist(regmatches(years, x)))

Extract numbers from any string INDEPENDENT of position.

x <- gregexpr("[0-9]+", years)  # Numbers with any number of digits
x2 <- as.numeric(unlist(regmatches(years, x)))

You could get rid of all the letters too:

as.numeric(gsub("[[:alpha:]]", "", years))

Likely this is less generalizable though.