How to remove xa0 from string in Python

Question

I am currently using Beautiful Soup to parse an HTML file and calling get text    but it seems like I m being left with a lot of  xa0 Unicode representing spaces  Is there an efficient way to remove all of them in Python 2 7  and change them into spaces  I guess the more generalized question would be  is there a way to remove Unicode formatting   I tried using  line   line replace u  xa0        as suggested by another thread  but that changed the  xa0 s to u s  so now I have  u s everywhere instead      EDIT  The problem seems to be resolved by str replace u  xa0        encode  utf-8    but just doing  encode  utf-8   without replace   seems to cause it to spit out even weirder characters   xc2 for instance  Can anyone explain this

User · Answer

I end up here while googling for the problem with not printable character. I use MySQL UTF-8 general_ci and deal with polish language. For problematic strings I have to procced as follows:

text=text.replace('\xc2\xa0', ' ')

It is just fast workaround and you probablly should try something with right encoding setup.

User · Answer

In Beautiful Soup  you can pass get text   the strip parameter  which strips white space from the beginning and end of the text  This will remove  xa0 or any other white space if it occurs at the start or end of the string  Beautiful Soup replaced an empty string with  xa0 and this solved the problem for me   mytext   soup get text strip True

User · Answer

Try this code  import re re sub r    x00- x7F        paste your string here   decode  utf-8   ignore   strip

User · Answer

Try using  strip   at the end of your line line strip   worked well for me

User · Answer

xa0 is actually non-breaking space in Latin1  ISO 8859-1   also chr 160   You should replace it with a space   string   string replace u  xa0   u      When  encode  utf-8    it will encode the unicode to utf-8  that means every unicode could be represented by 1 to 4 bytes  For this case   xa0 is represented by 2 bytes  xc2 xa0    Read up on http   docs python org howto unicode html    Please note  this answer in from 2012  Python has moved on  you should be able to use unicodedata normalize now

User · Answer

Generic version with the regular expression  It will remove all the control characters    import re def remove control chart s       return re sub r   x         s

User · Answer

It s the equivalent of a space character  so strip it  print string strip      no more xa0

User · Answer

try this   string replace    xa0

User · Answer

Python recognize it like a space character  so you can split it without args and join by a normal whitespace   line       join line split

User · Answer

You can try string strip   It worked for me

User · Answer

0xA0  Unicode  is 0xC2A0 in UTF-8   encode  utf8   will just take your Unicode 0xA0 and replace with UTF-8 s 0xC2A0  Hence the apparition of 0xC2s    Encoding is not replacing  as you ve probably realized now

User · Answer

There s many useful things in Python s unicodedata library  One of them is the  normalize   function   Try   new str   unicodedata normalize  NFKD   unicode str    Replacing NFKD with any of the other methods listed in the link above if you don t get the results you re after

User · Answer

After trying several methods  to summarize it  this is how I did it  Following are two ways of avoiding removing  xa0 characters from parsed HTML string   Assume we have our raw html as following   raw html     lt p gt Dear Parent    lt  p gt  lt p gt  lt span style  font-size  1rem   gt This is a test message    lt  span gt  lt span style  font-size  1rem   gt kindly ignore it    lt  span gt  lt  p gt  lt p gt  lt span style  font-size  1rem   gt Thanks lt  span gt  lt  p gt     So lets try to clean this HTML string   from bs4 import BeautifulSoup raw html     lt p gt Dear Parent   lt  p gt  lt p gt  lt span style  font-size  1rem   gt This is a test message   lt  span gt  lt span style  font-size  1rem   gt kindly ignore it   lt  span gt  lt  p gt  lt p gt  lt span style  font-size  1rem   gt Thanks lt  span gt  lt  p gt   text string   BeautifulSoup raw html   lxml   text print text string  u Dear Parent  xa0This is a test message  xa0kindly ignore it  xa0Thanks    The above code produces these characters  xa0 in the string  To remove them properly  we can use two ways    Method   1  Recommended   The first one is BeautifulSoup s get text method with strip argument as True So our code becomes   clean text   BeautifulSoup raw html   lxml   get text strip True  print clean text   Dear Parent This is a test message kindly ignore it Thanks   Method   2  The other option is to use python s library unicodedata  import unicodedata text string   BeautifulSoup raw html   lxml   text clean text   unicodedata normalize  NFKD  text string  print clean text   u Dear Parent This is a test message kindly ignore it Thanks    I have also detailed these methods on this blog which you may want to refer

User · Answer

I ran into this same problem pulling some data from a sqlite3 database with python   The above answers didn t work for me  not sure why   but this did  line   line decode  ascii    ignore   However  my goal was deleting the  xa0s  rather than replacing them with spaces  I got this from this super-helpful unicode tutorial by Ned Batchelder

[python] How to remove \xa0 from string in Python?

Examples related to python

Examples related to python-2.7

Examples related to unicode

Examples related to beautifulsoup

Examples related to utf-8