u ufeff in Python string

Question

I got an error with the following exception message  UnicodeEncodeError   ascii  codec can t encode character u  ufeff  in position 155  ordinal not in range 128   Not sure what u  ufeff  is  it shows up when I m web scraping  How can I remedy the situation  The  replace   string method doesn t work on it

User · Answer

That character is the BOM or  Byte Order Mark    It is usually received as the first few bytes of a file  telling you how to interpret the encoding of the rest of the data   You can simply remove the character to continue   Although  since the error says you were trying to convert to  ascii   you should probably pick another encoding for whatever you were trying to do

User · Answer

I ran into this on Python 3 and found this question  and solution   When opening a file  Python 3 supports the encoding keyword to automatically handle the encoding   Without it  the BOM is included in the read result    gt  gt  gt  f   open  file   mode  r    gt  gt  gt  f read     ufefftest    Giving the correct encoding  the BOM is omitted in the result    gt  gt  gt  f   open  file   mode  r   encoding  utf-8-sig    gt  gt  gt  f read    test    Just my 2 cents

User · Answer

The content you re scraping is encoded in unicode rather than ascii text  and you re getting a character that doesn t convert to ascii   The right  translation  depends on what the original web page thought it was   Python s unicode page gives the background on how it works    Are you trying to print the result or stick it in a file  The error suggests it s writing the data that s causing the problem  not reading it  This question is a good place to look for the fixes

User · Answer

This problem arise basically when you save your python code in a UTF-8 or UTF-16 encoding because python add some special character at the beginning of the code automatically  which is not shown by the text editors  to identify the encoding format  But  when you try to execute the code it gives you the syntax error in line 1 i e  start of code because python compiler understands ASCII encoding  when you view the code of file using read   function you can see at the begin of the returned code   ufeff  is shown  The one simplest solution to this problem is just by changing the encoding back to ASCII encoding for this you can copy your code to a notepad and save it Remember  choose the ASCII encoding    Hope this will help

User · Answer

The Unicode character U FEFF is the byte order mark  or BOM  and is used to tell the difference between big- and little-endian UTF-16 encoding   If you decode the web page using the right codec  Python will remove it for you   Examples     python2  coding  utf8 u   u ABC  e8   u encode  utf-8            encode without BOM e8s   u encode  utf-8-sig       encode with BOM e16   u encode  utf-16          encode with BOM e16le   u encode  utf-16le      encode without BOM e16be   u encode  utf-16be      encode without BOM print  utf-8      r    e8 print  utf-8-sig  r    e8s print  utf-16     r    e16 print  utf-16le   r    e16le print  utf-16be   r    e16be print print  utf-8  w  BOM decoded with utf-8      r    e8s decode  utf-8   print  utf-8  w  BOM decoded with utf-8-sig  r    e8s decode  utf-8-sig   print  utf-16 w  BOM decoded with utf-16     r    e16 decode  utf-16   print  utf-16 w  BOM decoded with utf-16le   r    e16 decode  utf-16le     Note that EF BB BF is a UTF-8-encoded BOM   It is not required for UTF-8  but serves only as a signature  usually on Windows    Output   utf-8      ABC  utf-8-sig   xef xbb xbfABC  utf-16      xff xfeA x00B x00C x00       Adds BOM and encodes using native processor endian-ness  utf-16le   A x00B x00C x00  utf-16be    x00A x00B x00C   utf-8  w  BOM decoded with utf-8     u  ufeffABC       doesn t remove BOM if present  utf-8  w  BOM decoded with utf-8-sig u ABC             removes BOM if present  utf-16 w  BOM decoded with utf-16    u ABC              requires  BOM to be present  utf-16 w  BOM decoded with utf-16le  u  ufeffABC       doesn t remove BOM if present    Note that the utf-16 codec requires BOM to be present  or Python won t know if the data is big- or little-endian

User · Answer

Here is based on the answer from Mark Tolonen  The string included different languages of the word  test  that s separated by      so you can see the difference   u   u ABCtest     m  sb  ta test                                           ki m tra   l  ek   e8   u encode  utf-8            encode without BOM e8s   u encode  utf-8-sig       encode with BOM e16   u encode  utf-16          encode with BOM e16le   u encode  utf-16le      encode without BOM e16be   u encode  utf-16be      encode without BOM print  utf-8      r    e8  print  utf-8-sig  r    e8s  print  utf-16     r    e16  print  utf-16le   r    e16le  print  utf-16be   r    e16be  print   print  utf-8  w  BOM decoded with utf-8      r    e8s decode  utf-8    print  utf-8  w  BOM decoded with utf-8-sig  r    e8s decode  utf-8-sig    print  utf-16 w  BOM decoded with utf-16     r    e16 decode  utf-16    print  utf-16 w  BOM decoded with utf-16le   r    e16 decode  utf-16le      Here is a test run    gt  gt  gt  u   u ABCtest     m  sb  ta test                                           ki m tra   l  ek    gt  gt  gt  e8   u encode  utf-8            encode without BOM  gt  gt  gt  e8s   u encode  utf-8-sig       encode with BOM  gt  gt  gt  e16   u encode  utf-16          encode with BOM  gt  gt  gt  e16le   u encode  utf-16le      encode without BOM  gt  gt  gt  e16be   u encode  utf-16be      encode without BOM  gt  gt  gt  print  utf-8      r    e8  utf-8     b ABCtest xce xb2 xe8 xb2 x9d xe5 xa1 x94 xec x9c x84m xc3 xa1sb xc3 xaata test  xd8 xa7 xd8 xae xd8 xaa xd8 xa8 xd8 xa7 xd8 xb1  xe6 xb5 x8b xe8 xaf x95  xe6 xb8 xac xe8 xa9 xa6  xe3 x83 x86 xe3 x82 xb9 xe3 x83 x88  xe0 xa4 xaa xe0 xa4 xb0 xe0 xa5 x80 xe0 xa4 x95 xe0 xa5 x8d xe0 xa4 xb7 xe0 xa4 xbe  xe0 xb4 xaa xe0 xb4 xb0 xe0 xb4 xbf xe0 xb4 xb6 xe0 xb5 x8b xe0 xb4 xa7 xe0 xb4 xa8  xd7 xa4 xd6 xbc xd7 xa8 xd7 x95 xd7 x91 xd7 x99 xd7 xa8 xd7 x9f ki xe1 xbb x83m tra  xc3 x96l xc3 xa7ek    gt  gt  gt  print  utf-8-sig  r    e8s  utf-8-sig b  xef xbb xbfABCtest xce xb2 xe8 xb2 x9d xe5 xa1 x94 xec x9c x84m xc3 xa1sb xc3 xaata test  xd8 xa7 xd8 xae xd8 xaa xd8 xa8 xd8 xa7 xd8 xb1  xe6 xb5 x8b xe8 xaf x95  xe6 xb8 xac xe8 xa9 xa6  xe3 x83 x86 xe3 x82 xb9 xe3 x83 x88  xe0 xa4 xaa xe0 xa4 xb0 xe0 xa5 x80 xe0 xa4 x95 xe0 xa5 x8d xe0 xa4 xb7 xe0 xa4 xbe  xe0 xb4 xaa xe0 xb4 xb0 xe0 xb4 xbf xe0 xb4 xb6 xe0 xb5 x8b xe0 xb4 xa7 xe0 xb4 xa8  xd7 xa4 xd6 xbc xd7 xa8 xd7 x95 xd7 x91 xd7 x99 xd7 xa8 xd7 x9f ki xe1 xbb x83m tra  xc3 x96l xc3 xa7ek    gt  gt  gt  print  utf-16     r    e16  utf-16    b  xff xfeA x00B x00C x00t x00e x00s x00t x00 xb2 x03 x9d x8cTX x04 xc7m x00 xe1 x00s x00b x00 xea x00t x00a x00  x00t x00e x00s x00t x00  x00  x06  x06  x06  x06  x061 x06  x00Km xd5 x8b  x00 nf x8a  x00 xc60 xb90 xc80  x00  t0 t  t x15 tM t7 t gt  t  x00  r0 r  r6 rK r  r  r  x00 xe4 x05 xbc x05 xe8 x05 xd5 x05 xd1 x05 xd9 x05 xe8 x05 xdf x05  x00k x00i x00 xc3 x1em x00  x00t x00r x00a x00  x00 xd6 x00l x00 xe7 x00e x00k x00  x00   gt  gt  gt  print  utf-16le   r    e16le  utf-16le  b A x00B x00C x00t x00e x00s x00t x00 xb2 x03 x9d x8cTX x04 xc7m x00 xe1 x00s x00b x00 xea x00t x00a x00  x00t x00e x00s x00t x00  x00  x06  x06  x06  x06  x061 x06  x00Km xd5 x8b  x00 nf x8a  x00 xc60 xb90 xc80  x00  t0 t  t x15 tM t7 t gt  t  x00  r0 r  r6 rK r  r  r  x00 xe4 x05 xbc x05 xe8 x05 xd5 x05 xd1 x05 xd9 x05 xe8 x05 xdf x05  x00k x00i x00 xc3 x1em x00  x00t x00r x00a x00  x00 xd6 x00l x00 xe7 x00e x00k x00  x00   gt  gt  gt  print  utf-16be   r    e16be  utf-16be  b  x00A x00B x00C x00t x00e x00s x00t x03 xb2 x8c x9dXT xc7 x04 x00m x00 xe1 x00s x00b x00 xea x00t x00a x00  x00t x00e x00s x00t x00  x06  x06  x06  x06  x06  x061 x00 mK x8b xd5 x00 n  x8af x00 0 xc60 xb90 xc8 x00  t  t0 t  t x15 tM t7 t gt  x00  r  r0 r  r6 rK r  r  x00  x05 xe4 x05 xbc x05 xe8 x05 xd5 x05 xd1 x05 xd9 x05 xe8 x05 xdf x00  x00k x00i x1e xc3 x00m x00  x00t x00r x00a x00  x00 xd6 x00l x00 xe7 x00e x00k x00    gt  gt  gt  print     gt  gt  gt  print  utf-8  w  BOM decoded with utf-8      r    e8s decode  utf-8    utf-8  w  BOM decoded with utf-8       ufeffABCtest     m  sb  ta test                                           ki m tra   l  ek    gt  gt  gt  print  utf-8  w  BOM decoded with utf-8-sig  r    e8s decode  utf-8-sig    utf-8  w  BOM decoded with utf-8-sig  ABCtest     m  sb  ta test                                           ki m tra   l  ek    gt  gt  gt  print  utf-16 w  BOM decoded with utf-16     r    e16 decode  utf-16    utf-16 w  BOM decoded with utf-16     ABCtest     m  sb  ta test                                           ki m tra   l  ek    gt  gt  gt  print  utf-16 w  BOM decoded with utf-16le   r    e16 decode  utf-16le    utf-16 w  BOM decoded with utf-16le    ufeffABCtest     m  sb  ta test                                           ki m tra   l  ek     It s worth to know that only both utf-8-sig and utf-16 get back the original string after both encode and decode

[python] u'\ufeff' in Python string

Examples related to python

Examples related to unicode

Examples related to utf-8