[python] How to extract numbers from a string in Python?

I would extract all the numbers contained in a string. Which is the better suited for the purpose, regular expressions or the isdigit() method?

Example:

line = "hello 12 hi 89"

Result:

[12, 89]

This question is related to python regex string numbers

The answer is


To catch different patterns it is helpful to query with different patterns.

Setup all the patterns that catch different number patterns of interest:

(finds commas) 12,300 or 12,300.00

'[\d]+[.,\d]+'

(finds floats) 0.123 or .123

'[\d]*[.][\d]+'

(finds integers) 123

'[\d]+'

Combine with pipe ( | ) into one pattern with multiple or conditionals.

(Note: Put complex patterns first else simple patterns will return chunks of the complex catch instead of the complex catch returning the full catch).

p = '[\d]+[.,\d]+|[\d]*[.][\d]+|[\d]+'

Below, we'll confirm a pattern is present with re.search(), then return an iterable list of catches. Finally, we'll print each catch using bracket notation to subselect the match object return value from the match object.

s = 'he33llo 42 I\'m a 32 string 30 444.4 12,001'

if re.search(p, s) is not None:
    for catch in re.finditer(p, s):
        print(catch[0]) # catch is a match object

Returns:

33
42
32
30
444.4
12,001

# extract numbers from garbage string:
s = '12//n,_@#$%3.14kjlw0xdadfackvj1.6e-19&*ghn334'
newstr = ''.join((ch if ch in '0123456789.-e' else ' ') for ch in s)
listOfNumbers = [float(i) for i in newstr.split()]
print(listOfNumbers)
[12.0, 3.14, 0.0, 1.6e-19, 334.0]

This is more than a bit late, but you can extend the regex expression to account for scientific notation too.

import re

# Format is [(<string>, <expected output>), ...]
ss = [("apple-12.34 ba33na fanc-14.23e-2yapple+45e5+67.56E+3",
       ['-12.34', '33', '-14.23e-2', '+45e5', '+67.56E+3']),
      ('hello X42 I\'m a Y-32.35 string Z30',
       ['42', '-32.35', '30']),
      ('he33llo 42 I\'m a 32 string -30', 
       ['33', '42', '32', '-30']),
      ('h3110 23 cat 444.4 rabbit 11 2 dog', 
       ['3110', '23', '444.4', '11', '2']),
      ('hello 12 hi 89', 
       ['12', '89']),
      ('4', 
       ['4']),
      ('I like 74,600 commas not,500', 
       ['74,600', '500']),
      ('I like bad math 1+2=.001', 
       ['1', '+2', '.001'])]

for s, r in ss:
    rr = re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*(?:[eE][-+]?\d+)?", s)
    if rr == r:
        print('GOOD')
    else:
        print('WRONG', rr, 'should be', r)

Gives all good!

Additionally, you can look at the AWS Glue built-in regex


Since none of these dealt with real world financial numbers in excel and word docs that I needed to find, here is my variation. It handles ints, floats, negative numbers, currency numbers (because it doesn't reply on split), and has the option to drop the decimal part and just return ints, or return everything.

It also handles Indian Laks number system where commas appear irregularly, not every 3 numbers apart.

It does not handle scientific notation or negative numbers put inside parentheses in budgets -- will appear positive.

It also does not extract dates. There are better ways for finding dates in strings.

import re
def find_numbers(string, ints=True):            
    numexp = re.compile(r'[-]?\d[\d,]*[\.]?[\d{2}]*') #optional - in front
    numbers = numexp.findall(string)    
    numbers = [x.replace(',','') for x in numbers]
    if ints is True:
        return [int(x.replace(',','').split('.')[0]) for x in numbers]            
    else:
        return numbers

Using Regex below is the way

lines = "hello 12 hi 89"
import re
output = []
#repl_str = re.compile('\d+.?\d*')
repl_str = re.compile('^\d+$')
#t = r'\d+.?\d*'
line = lines.split()
for word in line:
        match = re.search(repl_str, word)
        if match:
            output.append(float(match.group()))
print (output)

with findall re.findall(r'\d+', "hello 12 hi 89")

['12', '89']

re.findall(r'\b\d+\b', "hello 12 hi 89 33F AC 777")

['12', '89', '777']

@jmnas, I liked your answer, but it didn't find floats. I'm working on a script to parse code going to a CNC mill and needed to find both X and Y dimensions that can be integers or floats, so I adapted your code to the following. This finds int, float with positive and negative vals. Still doesn't find hex formatted values but you could add "x" and "A" through "F" to the num_char tuple and I think it would parse things like '0x23AC'.

s = 'hello X42 I\'m a Y-32.35 string Z30'
xy = ("X", "Y")
num_char = (".", "+", "-")

l = []

tokens = s.split()
for token in tokens:

    if token.startswith(xy):
        num = ""
        for char in token:
            # print(char)
            if char.isdigit() or (char in num_char):
                num = num + char

        try:
            l.append(float(num))
        except ValueError:
            pass

print(l)

The best option I found is below. It will extract a number and can eliminate any type of char.

def extract_nbr(input_str):
    if input_str is None or input_str == '':
        return 0

    out_number = ''
    for ele in input_str:
        if ele.isdigit():
            out_number += ele
    return float(out_number)    

I was looking for a solution to remove strings' masks, specifically from Brazilian phones numbers, this post not answered but inspired me. This is my solution:

>>> phone_number = '+55(11)8715-9877'
>>> ''.join([n for n in phone_number if n.isdigit()])
'551187159877'

I'd use a regexp :

>>> import re
>>> re.findall(r'\d+', 'hello 42 I\'m a 32 string 30')
['42', '32', '30']

This would also match 42 from bla42bla. If you only want numbers delimited by word boundaries (space, period, comma), you can use \b :

>>> re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string 30')
['42', '32', '30']

To end up with a list of numbers instead of a list of strings:

>>> [int(s) for s in re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string 30')]
[42, 32, 30]

I am just adding this answer because no one added one using Exception handling and because this also works for floats

a = []
line = "abcd 1234 efgh 56.78 ij"
for word in line.split():
    try:
        a.append(float(word))
    except ValueError:
        pass
print(a)

Output :

[1234.0, 56.78]

This answer also contains the case when the number is float in the string

def get_first_nbr_from_str(input_str):
    '''
    :param input_str: strings that contains digit and words
    :return: the number extracted from the input_str
    demo:
    'ab324.23.123xyz': 324.23
    '.5abc44': 0.5
    '''
    if not input_str and not isinstance(input_str, str):
        return 0
    out_number = ''
    for ele in input_str:
        if (ele == '.' and '.' not in out_number) or ele.isdigit():
            out_number += ele
        elif out_number:
            break
    return float(out_number)

I'm assuming you want floats not just integers so I'd do something like this:

l = []
for t in s.split():
    try:
        l.append(float(t))
    except ValueError:
        pass

Note that some of the other solutions posted here don't work with negative numbers:

>>> re.findall(r'\b\d+\b', 'he33llo 42 I\'m a 32 string -30')
['42', '32', '30']

>>> '-3'.isdigit()
False

If you know it will be only one number in the string, i.e 'hello 12 hi', you can try filter.

For example:

In [1]: int(''.join(filter(str.isdigit, '200 grams')))
Out[1]: 200
In [2]: int(''.join(filter(str.isdigit, 'Counters: 55')))
Out[2]: 55
In [3]: int(''.join(filter(str.isdigit, 'more than 23 times')))
Out[3]: 23

But be carefull !!! :

In [4]: int(''.join(filter(str.isdigit, '200 grams 5')))
Out[4]: 2005

For phone numbers you can simply exclude all non-digit characters with \D in regex:

import re

phone_number = '(619) 459-3635'
phone_number = re.sub(r"\D", "", phone_number)
print(phone_number)

I am amazed to see that no one has yet mentioned the usage of itertools.groupby as an alternative to achieve this.

You may use itertools.groupby() along with str.isdigit() in order to extract numbers from string as:

from itertools import groupby
my_str = "hello 12 hi 89"

l = [int(''.join(i)) for is_digit, i in groupby(my_str, str.isdigit) if is_digit]

The value hold by l will be:

[12, 89]

PS: This is just for illustration purpose to show that as an alternative we could also use groupby to achieve this. But this is not a recommended solution. If you want to achieve this, you should be using accepted answer of fmark based on using list comprehension with str.isdigit as filter.


line2 = "hello 12 hi 89"
temp1 = re.findall(r'\d+', line2) # through regular expression
res2 = list(map(int, temp1))
print(res2)

Hi ,

you can search all the integers in the string through digit by using findall expression .

In the second step create a list res2 and add the digits found in string to this list

hope this helps

Regards, Diwakar Sharma


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to regex

Why my regexp for hyphenated words doesn't work? grep's at sign caught as whitespace Preg_match backtrack error regex match any single character (one character only) re.sub erroring with "Expected string or bytes-like object" Only numbers. Input number in React Visual Studio Code Search and Replace with Regular Expressions Strip / trim all strings of a dataframe return string with first match Regex How to capture multiple repeated groups?

Examples related to string

How to split a string in two and store it in a field String method cannot be found in a main class method Kotlin - How to correctly concatenate a String Replacing a character from a certain index Remove quotes from String in Python Detect whether a Python string is a number or a letter How does String substring work in Swift How does String.Index work in Swift swift 3.0 Data to String? How to parse JSON string in Typescript

Examples related to numbers

how to display a javascript var in html body How to label scatterplot points by name? Allow 2 decimal places in <input type="number"> Why does the html input with type "number" allow the letter 'e' to be entered in the field? Explanation on Integer.MAX_VALUE and Integer.MIN_VALUE to find min and max value in an array Input type "number" won't resize C++ - how to find the length of an integer How to Generate a random number of fixed length using JavaScript? How do you check in python whether a string contains only numbers? Turn a single number into single digits Python