[python] In Python, how to check if a string only contains certain characters?

In Python, how to check if a string only contains certain characters?

I need to check a string containing only a..z, 0..9, and . (period) and no other character.

I could iterate over each character and check the character is a..z or 0..9, or . but that would be slow.

I am not clear now how to do it with a regular expression.

Is this correct? Can you suggest a simpler regular expression or a more efficient approach.

#Valid chars . a-z 0-9 
def check(test_str):
    import re
    #http://docs.python.org/library/re.html
    #re.search returns None if no position in the string matches the pattern
    #pattern to search for any character other then . a-z 0-9
    pattern = r'[^\.a-z0-9]'
    if re.search(pattern, test_str):
        #Character other then . a-z 0-9 was found
        print 'Invalid : %r' % (test_str,)
    else:
        #No character other then . a-z 0-9 was found
        print 'Valid   : %r' % (test_str,)

check(test_str='abcde.1')
check(test_str='abcde.1#')
check(test_str='ABCDE.12')
check(test_str='_-/>"!@#12345abcde<')

'''
Output:
>>> 
Valid   : "abcde.1"
Invalid : "abcde.1#"
Invalid : "ABCDE.12"
Invalid : "_-/>"!@#12345abcde<"
'''

This question is related to python regex search character

The answer is


EDIT: Changed the regular expression to exclude A-Z

Regular expression solution is the fastest pure python solution so far

reg=re.compile('^[a-z0-9\.]+$')
>>>reg.match('jsdlfjdsf12324..3432jsdflsdf')
True
>>> timeit.Timer("reg.match('jsdlfjdsf12324..3432jsdflsdf')", "import re; reg=re.compile('^[a-z0-9\.]+$')").timeit()
0.70509696006774902

Compared to other solutions:

>>> timeit.Timer("set('jsdlfjdsf12324..3432jsdflsdf') <= allowed", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
3.2119350433349609
>>> timeit.Timer("all(c in allowed for c in 'jsdlfjdsf12324..3432jsdflsdf')", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
6.7066690921783447

If you want to allow empty strings then change it to:

reg=re.compile('^[a-z0-9\.]*$')
>>>reg.match('')
False

Under request I'm going to return the other part of the answer. But please note that the following accept A-Z range.

You can use isalnum

test_str.replace('.', '').isalnum()

>>> 'test123.3'.replace('.', '').isalnum()
True
>>> 'test123-3'.replace('.', '').isalnum()
False

EDIT Using isalnum is much more efficient than the set solution

>>> timeit.Timer("'jsdlfjdsf12324..3432jsdflsdf'.replace('.', '').isalnum()").timeit()
0.63245487213134766

EDIT2 John gave an example where the above doesn't work. I changed the solution to overcome this special case by using encode

test_str.replace('.', '').encode('ascii', 'replace').isalnum()

And it is still almost 3 times faster than the set solution

timeit.Timer("u'ABC\u0131\u0661'.encode('ascii', 'replace').replace('.','').isalnum()", "import string; allowed = set(string.ascii_lowercase + string.digits + '.')").timeit()
1.5719811916351318

In my opinion using regular expressions is the best to solve this problem


Simpler approach? A little more Pythonic?

>>> ok = "0123456789abcdef"
>>> all(c in ok for c in "123456abc")
True
>>> all(c in ok for c in "hello world")
False

It certainly isn't the most efficient, but it's sure readable.


Here's a simple, pure-Python implementation. It should be used when performance is not critical (included for future Googlers).

import string
allowed = set(string.ascii_lowercase + string.digits + '.')

def check(test_str):
    set(test_str) <= allowed

Regarding performance, iteration will probably be the fastest method. Regexes have to iterate through a state machine, and the set equality solution has to build a temporary set. However, the difference is unlikely to matter much. If performance of this function is very important, write it as a C extension module with a switch statement (which will be compiled to a jump table).

Here's a C implementation, which uses if statements due to space constraints. If you absolutely need the tiny bit of extra speed, write out the switch-case. In my tests, it performs very well (2 seconds vs 9 seconds in benchmarks against the regex).

#define PY_SSIZE_T_CLEAN
#include <Python.h>

static PyObject *check(PyObject *self, PyObject *args)
{
        const char *s;
        Py_ssize_t count, ii;
        char c;
        if (0 == PyArg_ParseTuple (args, "s#", &s, &count)) {
                return NULL;
        }
        for (ii = 0; ii < count; ii++) {
                c = s[ii];
                if ((c < '0' && c != '.') || c > 'z') {
                        Py_RETURN_FALSE;
                }
                if (c > '9' && c < 'a') {
                        Py_RETURN_FALSE;
                }
        }

        Py_RETURN_TRUE;
}

PyDoc_STRVAR (DOC, "Fast stringcheck");
static PyMethodDef PROCEDURES[] = {
        {"check", (PyCFunction) (check), METH_VARARGS, NULL},
        {NULL, NULL}
};
PyMODINIT_FUNC
initstringcheck (void) {
        Py_InitModule3 ("stringcheck", PROCEDURES, DOC);
}

Include it in your setup.py:

from distutils.core import setup, Extension
ext_modules = [
    Extension ('stringcheck', ['stringcheck.c']),
],

Use as:

>>> from stringcheck import check
>>> check("abc")
True
>>> check("ABC")
False

This has already been answered satisfactorily, but for people coming across this after the fact, I have done some profiling of several different methods of accomplishing this. In my case I wanted uppercase hex digits, so modify as necessary to suit your needs.

Here are my test implementations:

import re

hex_digits = set("ABCDEF1234567890")
hex_match = re.compile(r'^[A-F0-9]+\Z')
hex_search = re.compile(r'[^A-F0-9]')

def test_set(input):
    return set(input) <= hex_digits

def test_not_any(input):
    return not any(c not in hex_digits for c in input)

def test_re_match1(input):
    return bool(re.compile(r'^[A-F0-9]+\Z').match(input))

def test_re_match2(input):
    return bool(hex_match.match(input))

def test_re_match3(input):
    return bool(re.match(r'^[A-F0-9]+\Z', input))

def test_re_search1(input):
    return not bool(re.compile(r'[^A-F0-9]').search(input))

def test_re_search2(input):
    return not bool(hex_search.search(input))

def test_re_search3(input):
    return not bool(re.match(r'[^A-F0-9]', input))

And the tests, in Python 3.4.0 on Mac OS X:

import cProfile
import pstats
import random

# generate a list of 10000 random hex strings between 10 and 10009 characters long
# this takes a little time; be patient
tests = [ ''.join(random.choice("ABCDEF1234567890") for _ in range(l)) for l in range(10, 10010) ]

# set up profiling, then start collecting stats
test_pr = cProfile.Profile(timeunit=0.000001)
test_pr.enable()

# run the test functions against each item in tests. 
# this takes a little time; be patient
for t in tests:
    for tf in [test_set, test_not_any, 
               test_re_match1, test_re_match2, test_re_match3,
               test_re_search1, test_re_search2, test_re_search3]:
        _ = tf(t)

# stop collecting stats
test_pr.disable()

# we create our own pstats.Stats object to filter 
# out some stuff we don't care about seeing
test_stats = pstats.Stats(test_pr)

# normally, stats are printed with the format %8.3f, 
# but I want more significant digits
# so this monkey patch handles that
def _f8(x):
    return "%11.6f" % x

def _print_title(self):
    print('   ncalls     tottime     percall     cumtime     percall', end=' ', file=self.stream)
    print('filename:lineno(function)', file=self.stream)

pstats.f8 = _f8
pstats.Stats.print_title = _print_title

# sort by cumulative time (then secondary sort by name), ascending
# then print only our test implementation function calls:
test_stats.sort_stats('cumtime', 'name').reverse_order().print_stats("test_*")

which gave the following results:

         50335004 function calls in 13.428 seconds

   Ordered by: cumulative time, function name
   List reduced from 20 to 8 due to restriction 

   ncalls     tottime     percall     cumtime     percall filename:lineno(function)
    10000    0.005233    0.000001    0.367360    0.000037 :1(test_re_match2)
    10000    0.006248    0.000001    0.378853    0.000038 :1(test_re_match3)
    10000    0.010710    0.000001    0.395770    0.000040 :1(test_re_match1)
    10000    0.004578    0.000000    0.467386    0.000047 :1(test_re_search2)
    10000    0.005994    0.000001    0.475329    0.000048 :1(test_re_search3)
    10000    0.008100    0.000001    0.482209    0.000048 :1(test_re_search1)
    10000    0.863139    0.000086    0.863139    0.000086 :1(test_set)
    10000    0.007414    0.000001    9.962580    0.000996 :1(test_not_any)

where:

ncalls
The number of times that function was called
tottime
the total time spent in the given function, excluding time made to sub-functions
percall
the quotient of tottime divided by ncalls
cumtime
the cumulative time spent in this and all subfunctions
percall
the quotient of cumtime divided by primitive calls

The columns we actually care about are cumtime and percall, as that shows us the actual time taken from function entry to exit. As we can see, regex match and search are not massively different.

It is faster not to bother compiling the regex if you would have compiled it every time. It is about 7.5% faster to compile once than every time, but only 2.5% faster to compile than to not compile.

test_set was twice as slow as re_search and thrice as slow as re_match

test_not_any was a full order of magnitude slower than test_set

TL;DR: Use re.match or re.search


A different approach, because in my case I needed to also check whether it contained certain words (like 'test' in this example), not characters alone:

input_string = 'abc test'
input_string_test = input_string
allowed_list = ['a', 'b', 'c', 'test', ' ']

for allowed_list_item in allowed_list:
    input_string_test = input_string_test.replace(allowed_list_item, '')

if not input_string_test:
    # test passed

So, the allowed strings (char or word) are cut from the input string. If the input string only contained strings that were allowed, it should leave an empty string and therefore should pass if not input_string.


Use python Sets when you need to compare hm... sets of data. Strings can be represented as sets of characters quite fast. Here I test if string is allowed phone number. First string is allowed, second not. Works fast and simple.

In [17]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(898) 64-901-63 ');p.issubset(allowed)").timeit()

Out[17]: 0.8106249139964348

In [18]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(950) 64-901-63 ???');p.issubset(allowed)").timeit()

Out[18]: 0.9240323599951807

Never use regexps if you can avoid them.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to regex

Why my regexp for hyphenated words doesn't work? grep's at sign caught as whitespace Preg_match backtrack error regex match any single character (one character only) re.sub erroring with "Expected string or bytes-like object" Only numbers. Input number in React Visual Studio Code Search and Replace with Regular Expressions Strip / trim all strings of a dataframe return string with first match Regex How to capture multiple repeated groups? Find a file by name in Visual Studio Code Search all the occurrences of a string in the entire project in Android Studio Java List.contains(Object with field value equal to x) Trigger an action after selection select2 How can I search for a commit message on GitHub? SQL search multiple values in same field Find a string by searching all tables in SQL Server Management Studio 2008 Search File And Find Exact Match And Print Line? Java - Search for files in a directory How to put a delay on AngularJS instant search?

Examples related to character

Set the maximum character length of a UITextField in Swift Max length UITextField Remove last character from string. Swift language Get nth character of a string in Swift programming language How many characters can you store with 1 byte? How to convert integers to characters in C? Converting characters to integers in Java How to check the first character in a string in Bash or UNIX shell? Invisible characters - ASCII How to delete Certain Characters in a excel 2010 cell