Stripping everything but alphanumeric chars from a string in Python

Question

What is the best way to strip all non alphanumeric characters from a string  using Python   The solutions presented in the PHP variant of this question will probably work with some minor adjustments  but don t seem very  pythonic  to me   For the record  I don t just want to strip periods and commas   and other punctuation   but also quotes  brackets  etc

User · Answer

for char in my string      if not char isalnum            my string   my string replace char

User · Answer

If i understood correctly the easiest way is to use regular expression as it provides you lots of flexibility but the other simple method is to use for loop following is the code with example I also counted the occurrence of word and stored in dictionary    s      An    essay is  generally  a piece of writing that gives the author s own  argument     but the definition is vague   overlapping with those of a paper  an article  a pamphlet  and a short story  Essays  have traditionally been  sub-classified as formal and informal  Formal essays are characterized by  serious  purpose  dignity  logical  organization  length   whereas the informal essay is characterized by  the personal  element  self-revelation   individual tastes and experiences  confidential manner   humor  graceful style   rambling structure  unconventionality  or novelty of theme   etc  1      d             creating empty dic       words   s split     spliting string and stroing in list for word in words      new word          for c in word          if c isalnum      checking if indiviual chr is alphanumeric or not             new word   new word   c     print new word  end            if new word not in d            d new word    1       else            d new word    d new word   1 print d    please rate this if this answer is useful

User · Answer

Use the str translate   method   Presuming you will be doing this often    1  Once  create a string containing all the characters you wish to delete   delchars      join c for c in map chr  range 256   if not c isalnum       2  Whenever you want to scrunch a string   scrunched   s translate None  delchars    The setup cost probably compares favourably with re compile  the marginal cost is way lower   C  junk gt  python26 python -mtimeit -s import string d    join c for c in map chr range 256   if not c isalnum    s string printable   s translate None d   100000 loops  best of 3  2 04 usec per loop  C  junk gt  python26 python -mtimeit -s import re string s string printable r re compile r   W        r sub    s   100000 loops  best of 3  7 34 usec per loop   Note  Using string printable as benchmark data gives the pattern    W     an unfair advantage  all the non-alphanumeric characters are in one bunch     in typical data there would be more than one substitution to do   C  junk gt  python26 python -c  import string  s   string printable  print len s  repr s   100  0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ      amp        -      gt                t n r x0b x0c    Here s what happens if you give re sub a bit more work to do   C  junk gt  python26 python -mtimeit -s d    join c for c in map chr range 256   if not c isalnum    s  foo-  25   s translate None d   1000000 loops  best of 3  1 97 usec per loop  C  junk gt  python26 python -mtimeit -s import re s  foo-  25 r re compile r   W        r sub    s   10000 loops  best of 3  26 4 usec per loop

User · Answer

Regular expressions to the rescue   import re re sub r  W        your string       By Python definition   W      a-zA-Z0-9    which excludes all numbers  letters and

User · Answer

I just timed some functions out of curiosity   In these tests I m removing non-alphanumeric characters from the string string printable  part of the built-in string module   The use of compiled    W     and pattern sub     str  was found to be fastest     python -m timeit -s         import string             join ch for ch in string printable if ch isalnum      10000 loops  best of 3  57 6 usec per loop    python -m timeit -s        import string         filter str isalnum  string printable                    10000 loops  best of 3  37 9 usec per loop    python -m timeit -s        import re  string         re sub    W         string printable   10000 loops  best of 3  27 5 usec per loop    python -m timeit -s        import re  string         re sub    W          string printable                   100000 loops  best of 3  15 usec per loop    python -m timeit -s        import re  string  pattern   re compile    W              pattern sub     string printable    100000 loops  best of 3  11 2 usec per loop

User · Answer

gt  gt  gt  import re  gt  gt  gt  string    Kl13              gt  gt  gt  pattern   re compile   W    gt  gt  gt  string   re sub pattern      string   gt  gt  gt  print string Kl13

User · Answer

sent      join e for e in sent if e isalpha

User · Answer

How about   def ExtractAlphanumeric InputString       from string import ascii letters  digits     return    join  ch for ch in InputString if ch in  ascii letters   digits      This works by using list comprehension to produce a list of the characters in InputString if they are present in the combined ascii letters and digits strings   It then joins the list together into a string

User · Answer

You could try   print    join ch for ch in some string if ch isalnum

User · Answer

Timing with random strings of ASCII printables   from inspect import getsource from random import sample import re from string import printable from timeit import timeit  pattern single   re compile r   W    pattern repeat   re compile r   W     translation tb   str maketrans            join c for c in map chr  range 256   if not c isalnum       def generate test string length       return    join sample printable  length     def main        for i in range 0  60  10           for test in               lambda     join c for c in generate test string i  if c isalnum                 lambda     join filter str isalnum  generate test string i                 lambda  re sub r   W        generate test string i                lambda  re sub r   W         generate test string i                lambda  pattern single sub     generate test string i                lambda  pattern repeat sub     generate test string i                lambda  generate test string i  translate translation tb                           print timeit test   i  getsource test  lstrip              lambda     rstrip    n    sep   t     if   name         main         main     Result  Python 3 7           Time       Length                           Code                            6 3716264850008880  00     join c for c in generate test string i  if c isalnum    5 7285426190064750  00     join filter str isalnum  generate test string i    8 1875841680011940  00  re sub r   W        generate test string i   8 0002205439959650  00  re sub r   W         generate test string i   5 5290945199958510  00  pattern single sub     generate test string i   5 4417179649972240  00  pattern repeat sub     generate test string i   4 6772285089973590  00  generate test string i  translate translation tb  23 574712151996210  10     join c for c in generate test string i  if c isalnum    22 829975890002970  10     join filter str isalnum  generate test string i    27 210196289997840  10  re sub r   W        generate test string i   27 203713296003116  10  re sub r   W         generate test string i   24 008979928999906  10  pattern single sub     generate test string i   23 945240008994006  10  pattern repeat sub     generate test string i   21 830899796994345  10  generate test string i  translate translation tb  38 731336012999236  20     join c for c in generate test string i  if c isalnum    37 942474347000825  20     join filter str isalnum  generate test string i    42 169366310001350  20  re sub r   W        generate test string i   41 933375883003464  20  re sub r   W         generate test string i   38 899814646996674  20  pattern single sub     generate test string i   38 636144253003295  20  pattern repeat sub     generate test string i   36 201238164998360  20  generate test string i  translate translation tb  49 377356811004574  30     join c for c in generate test string i  if c isalnum    48 408927293996385  30     join filter str isalnum  generate test string i    53 901889764994850  30  re sub r   W        generate test string i   52 130339455994545  30  re sub r   W         generate test string i   50 061149017004940  30  pattern single sub     generate test string i   49 366573111998150  30  pattern repeat sub     generate test string i   46 649754120997386  30  generate test string i  translate translation tb  63 107938601999194  40     join c for c in generate test string i  if c isalnum    65 116287978999030  40     join filter str isalnum  generate test string i    71 477421126997800  40  re sub r   W        generate test string i   66 027950693998720  40  re sub r   W         generate test string i   63 315361931003280  40  pattern single sub     generate test string i   62 342320287003530  40  pattern repeat sub     generate test string i   58 249303059004890  40  generate test string i  translate translation tb  73 810345625002810  50     join c for c in generate test string i  if c isalnum    72 593953348005020  50     join filter str isalnum  generate test string i    76 048324580995540  50  re sub r   W        generate test string i   75 106637657001560  50  re sub r   W         generate test string i   74 681338128997600  50  pattern single sub     generate test string i   72 430461594005460  50  pattern repeat sub     generate test string i   69 394243567003290  50  generate test string i  translate translation tb    str maketrans  amp  str translate is fastest  but includes all non-ASCII characters  re compile  amp  pattern sub is slower  but is somehow faster than    join  amp  filter

User · Answer

As a spin off from some other answers here  I offer a really simple and flexible way to define a set of characters that you want to limit a string s content to   In this case  I m allowing alphanumerics PLUS dash and underscore  Just add or remove characters from my PERMITTED CHARS as suits your use case     PERMITTED CHARS    0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ -   someString      join c for c in someString if c in PERMITTED CHARS

[python] Stripping everything but alphanumeric chars from a string in Python

Examples related to python

Examples related to string

Examples related to non-alphanumeric