Python remove all non-alphabet chars from string

Question

I am writing a python MapReduce word count program  Problem is that there are many non-alphabet chars strewn about in the data  I have found this post Stripping everything but alphanumeric chars from a string in Python which shows a nice solution using regex  but I am not sure how to implement it  def mapfn k  v       print v     import re  string      pattern   re compile    W          v   pattern match v      print v     for w in v split            yield w  1   I m afraid I am not sure how to use the library re or even regex for that matter  I am not sure how to apply the regex pattern to the incoming string  line of a book  v properly to retrieve the new line without any non-alphanumeric chars   Suggestions

User · Answer

If you prefer not to use regex  you might try     join  i for i in s if i isalpha

User · Answer

It is advisable to use PyPi regex module if you plan to match specific Unicode property classes  This library has also proven to be more stable  especially handling large texts  and yields consistent results across various Python versions  All you need to do is to keep it up-to-date   If you install it  using pip intall regex or pip3 install regex   you may use  import regex print   regex sub r  P L          ABCLac1-2    3  4   5def             gt  ABCLac   def   to remove all chunks of 1 or more characters other than Unicode letters from text  See an online Python demo  You may also use    join regex findall r  p L      ABCLac1-2    3  4   5def       to get the same result   In Python re  in order to match any Unicode letter  one may use the    W d   construct  Match any unicode letter     So  to remove all non-letter characters  you may either match all letters and join the results   result      join re findall r    W d     text     Or  remove all chars other than those matched with    W d     result   re sub r     W d        r  1   text  re DOTALL    See the regex demo online  However  you may get inconsistent results across various Python versions because the Unicode standard is evolving  and the set of chars matched with  w will depend on the Python version  Using PyPi regex library is highly recommended to get consistent results

User · Answer

Use re sub  import re  regex   re compile    a-zA-Z     First parameter is the replacement  second parameter is your input string regex sub      ab3d E    Out   abdE    Alternatively  if you only want to remove a certain set of characters  as an apostrophe might be okay in your input      regex   re compile             etc

User · Answer

You can use the re sub   function to remove these characters    gt  gt  gt  import re  gt  gt  gt  re sub    a-zA-Z          ABC12abc345def    ABCabcdef    re sub MATCH PATTERN  REPLACE STRING  STRING TO SEARCH       a-zA-Z    - look for any group of characters that are NOT a-zA-z     - Replace the matched characters with

User · Answer

Try   s      join filter str isalnum  s     This will take every char from the string  keep only alphanumeric ones and build a string back from them

User · Answer

The fastest method is regex   Try with regex first t0   timeit timeit     s   r2 sub     st        setup       import re r2   re compile r   a-zA-Z0-9    re MULTILINE  st    abcdefghijklmnopqrstuvwxyz123456789       amp    -          number   1000000  print t0    Try with join method on filter t0   timeit timeit     s      join filter str isalnum  st         setup       st    abcdefghijklmnopqrstuvwxyz123456789       amp    -          number   1000000  print t0    Try with only join t0   timeit timeit     s      join c for c in st if c isalnum          setup       st    abcdefghijklmnopqrstuvwxyz123456789       amp    -          number   1000000  print t0    2 6002226710006653 Method 1 Regex 5 739747313000407 Method 2 Filter   Join 6 540099570000166 Method 3 Join

[python] Python, remove all non-alphabet chars from string

Examples related to python

Examples related to regex