In Python how to check if a string only contains certain characters

Question

In Python  how to check if a string only contains certain characters   I need to check a string containing only a  z  0  9  and    period  and no other character   I could iterate over each character and check the character is a  z or 0  9  or   but that would be slow   I am not clear now how to do it with a regular expression   Is this correct  Can you suggest a simpler regular expression or a more efficient approach    Valid chars   a-z 0-9  def check test str       import re      http   docs python org library re html      re search returns None if no position in the string matches the pattern      pattern to search for any character other then   a-z 0-9     pattern   r     a-z0-9       if re search pattern  test str            Character other then   a-z 0-9 was found         print  Invalid    r     test str       else           No character other then   a-z 0-9 was found         print  Valid      r     test str    check test str  abcde 1   check test str  abcde 1    check test str  ABCDE 12   check test str   -  gt     12345abcde lt         Output   gt  gt  gt   Valid      abcde 1  Invalid    abcde 1   Invalid    ABCDE 12  Invalid     -  gt     12345abcde lt

User · Answer

Use python Sets when you need to compare hm... sets of data. Strings can be represented as sets of characters quite fast. Here I test if string is allowed phone number. First string is allowed, second not. Works fast and simple.

In [17]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(898) 64-901-63 ');p.issubset(allowed)").timeit()

Out[17]: 0.8106249139964348

In [18]: timeit.Timer("allowed = set('0123456789+-() ');p = set('+7(950) 64-901-63 ???');p.issubset(allowed)").timeit()

Out[18]: 0.9240323599951807

Never use regexps if you can avoid them.

User · Answer

Simpler approach  A little more Pythonic    gt  gt  gt  ok    0123456789abcdef   gt  gt  gt  all c in ok for c in  123456abc   True  gt  gt  gt  all c in ok for c in  hello world   False   It certainly isn t the most efficient  but it s sure readable

User · Answer

Final    edit  Answer  wrapped up in a function  with annotated interactive session    gt  gt  gt  import re  gt  gt  gt  def special match strg  search re compile r   a-z0-9     search           return not bool search strg        gt  gt  gt  special match     True  gt  gt  gt  special match  az09    True  gt  gt  gt  special match  az09  n   False   The above test case is to catch out any attempt to use re match     with a     instead of   Z  -- see point  6  below   gt  gt  gt  special match  az09     False  gt  gt  gt  special match  az09 X   False  gt  gt  gt    Note  There is a comparison with using re match   further down in this answer  Further timings show that match   would win with much longer strings  match   seems to have a much larger overhead than search   when the final answer is True  this is puzzling  perhaps it s the cost of returning a MatchObject instead of None  and may warrant further rummaging        Earlier text        The  previously  accepted answer could use a few improvements    1  Presentation gives the appearance of being the result of an interactive Python session   reg re compile    a-z0-9         gt  gt  gt reg match  jsdlfjdsf12324  3432jsdflsdf   True   but match   doesn t return True   2  For use with match    the   at the start of the pattern is redundant  and appears to be slightly slower than the same pattern without the     3  Should foster the use of raw string automatically unthinkingly for any re pattern   4  The backslash in front of the dot period is redundant   5  Slower than the OP s code    prompt gt rem OP s version -- NOTE  OP used raw string   prompt gt  python26 python -mtimeit -s t  jsdlfjdsf12324  3432jsdflsdf  import re reg re compile r   a-z0-9        not bool reg search t    1000000 loops  best of 3  1 43 usec per loop  prompt gt rem OP s version w o backslash  prompt gt  python26 python -mtimeit -s t  jsdlfjdsf12324  3432jsdflsdf  import re reg re compile r   a-z0-9       not bool reg search t    1000000 loops  best of 3  1 44 usec per loop  prompt gt rem cleaned-up version of accepted answer  prompt gt  python26 python -mtimeit -s t  jsdlfjdsf12324  3432jsdflsdf  import re reg re compile r  a-z0-9    Z     bool reg match t    100000 loops  best of 3  2 07 usec per loop  prompt gt rem accepted answer  prompt gt  python26 python -mtimeit -s t  jsdlfjdsf12324  3432jsdflsdf  import re reg re compile    a-z0-9          bool reg match t    100000 loops  best of 3  2 08 usec per loop    6  Can produce the wrong answer     gt  gt  gt  import re  gt  gt  gt  bool re compile    a-z0-9        match  1234 n    True   uh-oh  gt  gt  gt  bool re compile    a-z0-9     Z   match  1234 n    False

User · Answer

This has already been answered satisfactorily  but for people coming across this after the fact  I have done some profiling of several different methods of accomplishing this  In my case I wanted uppercase hex digits  so modify as necessary to suit your needs   Here are my test implementations   import re  hex digits   set  ABCDEF1234567890   hex match   re compile r   A-F0-9   Z   hex search   re compile r   A-F0-9     def test set input       return set input   lt   hex digits  def test not any input       return not any c not in hex digits for c in input   def test re match1 input       return bool re compile r   A-F0-9   Z   match input    def test re match2 input       return bool hex match match input    def test re match3 input       return bool re match r   A-F0-9   Z   input    def test re search1 input       return not bool re compile r   A-F0-9    search input    def test re search2 input       return not bool hex search search input    def test re search3 input       return not bool re match r   A-F0-9    input     And the tests  in Python 3 4 0 on Mac OS X   import cProfile import pstats import random    generate a list of 10000 random hex strings between 10 and 10009 characters long   this takes a little time  be patient tests        join random choice  ABCDEF1234567890   for   in range l   for l in range 10  10010       set up profiling  then start collecting stats test pr   cProfile Profile timeunit 0 000001  test pr enable      run the test functions against each item in tests     this takes a little time  be patient for t in tests      for tf in  test set  test not any                  test re match1  test re match2  test re match3                 test re search1  test re search2  test re search3               tf t     stop collecting stats test pr disable      we create our own pstats Stats object to filter    out some stuff we don t care about seeing test stats   pstats Stats test pr     normally  stats are printed with the format  8 3f     but I want more significant digits   so this monkey patch handles that def  f8 x       return   11 6f    x  def  print title self       print     ncalls     tottime     percall     cumtime     percall   end      file self stream      print  filename lineno function    file self stream   pstats f8    f8 pstats Stats print title    print title    sort by cumulative time  then secondary sort by name   ascending   then print only our test implementation function calls  test stats sort stats  cumtime    name   reverse order   print stats  test       which gave the following results             50335004 function calls in 13 428 seconds     Ordered by  cumulative time  function name    List reduced from 20 to 8 due to restriction      ncalls     tottime     percall     cumtime     percall filename lineno function      10000    0 005233    0 000001    0 367360    0 000037  1 test re match2      10000    0 006248    0 000001    0 378853    0 000038  1 test re match3      10000    0 010710    0 000001    0 395770    0 000040  1 test re match1      10000    0 004578    0 000000    0 467386    0 000047  1 test re search2      10000    0 005994    0 000001    0 475329    0 000048  1 test re search3      10000    0 008100    0 000001    0 482209    0 000048  1 test re search1      10000    0 863139    0 000086    0 863139    0 000086  1 test set      10000    0 007414    0 000001    9 962580    0 000996  1 test not any    where      ncallsThe number of times that function was called    tottimethe total time spent in the given function  excluding time made to sub-functions    percallthe quotient of tottime divided by ncalls    cumtimethe cumulative time spent in this and all subfunctions    percallthe quotient of cumtime divided by primitive calls   The columns we actually care about are cumtime and percall  as that shows us the actual time taken from function entry to exit  As we can see  regex match and search are not massively different    It is faster not to bother compiling the regex if you would have compiled it every time  It is about 7 5  faster to compile once than every time  but only 2 5  faster to compile than to not compile   test set was twice as slow as re search and thrice as slow as re match  test not any was a full order of magnitude slower than test set  TL DR  Use re match or re search

User · Answer

A different approach  because in my case I needed to also check whether it contained certain words  like  test  in this example   not characters alone   input string    abc test  input string test   input string allowed list     a    b    c    test         for allowed list item in allowed list      input string test   input string test replace allowed list item       if not input string test        test passed   So  the allowed strings  char or word  are cut from the input string  If the input string only contained strings that were allowed  it should leave an empty string and therefore should pass if not input string

User · Answer

EDIT  Changed the regular expression to exclude A-Z  Regular expression solution is the fastest pure python solution so far  reg re compile    a-z0-9         gt  gt  gt reg match  jsdlfjdsf12324  3432jsdflsdf   True  gt  gt  gt  timeit Timer  reg match  jsdlfjdsf12324  3432jsdflsdf      import re  reg re compile    a-z0-9          timeit   0 70509696006774902   Compared to other solutions    gt  gt  gt  timeit Timer  set  jsdlfjdsf12324  3432jsdflsdf    lt   allowed    import string  allowed   set string ascii lowercase   string digits          timeit   3 2119350433349609  gt  gt  gt  timeit Timer  all c in allowed for c in  jsdlfjdsf12324  3432jsdflsdf      import string  allowed   set string ascii lowercase   string digits          timeit   6 7066690921783447   If you want to allow empty strings then change it to   reg re compile    a-z0-9         gt  gt  gt reg match     False     Under request I m going to return the other part of the answer  But please note that the following accept A-Z range   You can use isalnum  test str replace          isalnum     gt  gt  gt   test123 3  replace          isalnum   True  gt  gt  gt   test123-3  replace          isalnum   False   EDIT Using isalnum is much more efficient than the set solution   gt  gt  gt  timeit Timer   jsdlfjdsf12324  3432jsdflsdf  replace          isalnum     timeit   0 63245487213134766   EDIT2     John gave an example where the above doesn t work  I changed the solution to overcome this special case by using encode  test str replace          encode  ascii    replace   isalnum     And it is still almost 3 times faster than the set solution  timeit Timer  u ABC u0131 u0661  encode  ascii    replace   replace         isalnum      import string  allowed   set string ascii lowercase   string digits          timeit   1 5719811916351318   In my opinion using regular expressions is the best to solve this problem

User · Answer

Here s a simple  pure-Python implementation  It should be used when performance is not critical  included for future Googlers    import string allowed   set string ascii lowercase   string digits         def check test str       set test str   lt   allowed     Regarding performance  iteration will probably be the fastest method  Regexes have to iterate through a state machine  and the set equality solution has to build a temporary set  However  the difference is unlikely to matter much  If performance of this function is very important  write it as a C extension module with a switch statement  which will be compiled to a jump table    Here s a C implementation  which uses if statements due to space constraints  If you absolutely need the tiny bit of extra speed  write out the switch-case  In my tests  it performs very well  2 seconds vs 9 seconds in benchmarks against the regex     define PY SSIZE T CLEAN  include  lt Python h gt   static PyObject  check PyObject  self  PyObject  args            const char  s          Py ssize t count  ii          char c          if  0    PyArg ParseTuple  args   s     amp s   amp count                     return NULL                    for  ii   0  ii  lt  count  ii                      c   s ii                   if   c  lt   0   amp  amp  c            c  gt   z                             Py RETURN FALSE                                    if  c  gt   9   amp  amp  c  lt   a                             Py RETURN FALSE                                       Py RETURN TRUE     PyDoc STRVAR  DOC   Fast stringcheck    static PyMethodDef PROCEDURES                 check    PyCFunction   check   METH VARARGS  NULL            NULL  NULL     PyMODINIT FUNC initstringcheck  void            Py InitModule3   stringcheck   PROCEDURES  DOC       Include it in your setup py   from distutils core import setup  Extension ext modules         Extension   stringcheck     stringcheck c          Use as    gt  gt  gt  from stringcheck import check  gt  gt  gt  check  abc   True  gt  gt  gt  check  ABC   False

[python] In Python, how to check if a string only contains certain characters?

Examples related to python

Examples related to regex

Examples related to search

Examples related to character