[python] u'\ufeff' in Python string

I got an error with the following exception message:

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in
position 155: ordinal not in range(128)

Not sure what u'\ufeff' is, it shows up when I'm web scraping. How can I remedy the situation? The .replace() string method doesn't work on it.

This question is related to python unicode utf-8

The answer is


I ran into this on Python 3 and found this question (and solution). When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding.

Without it, the BOM is included in the read result:

>>> f = open('file', mode='r')
>>> f.read()
'\ufefftest'

Giving the correct encoding, the BOM is omitted in the result:

>>> f = open('file', mode='r', encoding='utf-8-sig')
>>> f.read()
'test'

Just my 2 cents.


The content you're scraping is encoded in unicode rather than ascii text, and you're getting a character that doesn't convert to ascii. The right 'translation' depends on what the original web page thought it was. Python's unicode page gives the background on how it works.

Are you trying to print the result or stick it in a file? The error suggests it's writing the data that's causing the problem, not reading it. This question is a good place to look for the fixes.


This problem arise basically when you save your python code in a UTF-8 or UTF-16 encoding because python add some special character at the beginning of the code automatically (which is not shown by the text editors) to identify the encoding format. But, when you try to execute the code it gives you the syntax error in line 1 i.e, start of code because python compiler understands ASCII encoding. when you view the code of file using read() function you can see at the begin of the returned code '\ufeff' is shown. The one simplest solution to this problem is just by changing the encoding back to ASCII encoding(for this you can copy your code to a notepad and save it Remember! choose the ASCII encoding... Hope this will help.


Here is based on the answer from Mark Tolonen. The string included different languages of the word 'test' that's separated by '|', so you can see the difference.

u = u'ABCtestß???másbêta|test|??????|??|??|???|???????|???????|????????|ki?m tra|Ölçek|'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print('utf-8     %r' % e8)
print('utf-8-sig %r' % e8s)
print('utf-16    %r' % e16)
print('utf-16le  %r' % e16le)
print('utf-16be  %r' % e16be)
print()
print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))

Here is a test run:

>>> u = u'ABCtestß???másbêta|test|??????|??|??|???|???????|???????|????????|ki?m tra|Ölçek|'
>>> e8 = u.encode('utf-8')        # encode without BOM
>>> e8s = u.encode('utf-8-sig')   # encode with BOM
>>> e16 = u.encode('utf-16')      # encode with BOM
>>> e16le = u.encode('utf-16le')  # encode without BOM
>>> e16be = u.encode('utf-16be')  # encode without BOM
>>> print('utf-8     %r' % e8)
utf-8     b'ABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-8-sig %r' % e8s)
utf-8-sig b'\xef\xbb\xbfABCtest\xce\xb2\xe8\xb2\x9d\xe5\xa1\x94\xec\x9c\x84m\xc3\xa1sb\xc3\xaata|test|\xd8\xa7\xd8\xae\xd8\xaa\xd8\xa8\xd8\xa7\xd8\xb1|\xe6\xb5\x8b\xe8\xaf\x95|\xe6\xb8\xac\xe8\xa9\xa6|\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88|\xe0\xa4\xaa\xe0\xa4\xb0\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb7\xe0\xa4\xbe|\xe0\xb4\xaa\xe0\xb4\xb0\xe0\xb4\xbf\xe0\xb4\xb6\xe0\xb5\x8b\xe0\xb4\xa7\xe0\xb4\xa8|\xd7\xa4\xd6\xbc\xd7\xa8\xd7\x95\xd7\x91\xd7\x99\xd7\xa8\xd7\x9f|ki\xe1\xbb\x83m tra|\xc3\x96l\xc3\xa7ek|'
>>> print('utf-16    %r' % e16)
utf-16    b"\xff\xfeA\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16le  %r' % e16le)
utf-16le  b"A\x00B\x00C\x00t\x00e\x00s\x00t\x00\xb2\x03\x9d\x8cTX\x04\xc7m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x00'\x06.\x06*\x06(\x06'\x061\x06|\x00Km\xd5\x8b|\x00,nf\x8a|\x00\xc60\xb90\xc80|\x00*\t0\t@\t\x15\tM\t7\t>\t|\x00*\r0\r?\r6\rK\r'\r(\r|\x00\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x05|\x00k\x00i\x00\xc3\x1em\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|\x00"
>>> print('utf-16be  %r' % e16be)
utf-16be  b"\x00A\x00B\x00C\x00t\x00e\x00s\x00t\x03\xb2\x8c\x9dXT\xc7\x04\x00m\x00\xe1\x00s\x00b\x00\xea\x00t\x00a\x00|\x00t\x00e\x00s\x00t\x00|\x06'\x06.\x06*\x06(\x06'\x061\x00|mK\x8b\xd5\x00|n,\x8af\x00|0\xc60\xb90\xc8\x00|\t*\t0\t@\t\x15\tM\t7\t>\x00|\r*\r0\r?\r6\rK\r'\r(\x00|\x05\xe4\x05\xbc\x05\xe8\x05\xd5\x05\xd1\x05\xd9\x05\xe8\x05\xdf\x00|\x00k\x00i\x1e\xc3\x00m\x00 \x00t\x00r\x00a\x00|\x00\xd6\x00l\x00\xe7\x00e\x00k\x00|"
>>> print()

>>> print('utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8'))
utf-8  w/ BOM decoded with utf-8     '\ufeffABCtestß???másbêta|test|??????|??|??|???|???????|???????|????????|ki?m tra|Ölçek|'
>>> print('utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig'))
utf-8  w/ BOM decoded with utf-8-sig 'ABCtestß???másbêta|test|??????|??|??|???|???????|???????|????????|ki?m tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16'))
utf-16 w/ BOM decoded with utf-16    'ABCtestß???másbêta|test|??????|??|??|???|???????|???????|????????|ki?m tra|Ölçek|'
>>> print('utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le'))
utf-16 w/ BOM decoded with utf-16le  '\ufeffABCtestß???másbêta|test|??????|??|??|???|???????|???????|????????|ki?m tra|Ölçek|'

It's worth to know that only both utf-8-sig and utf-16 get back the original string after both encode and decode.


That character is the BOM or "Byte Order Mark". It is usually received as the first few bytes of a file, telling you how to interpret the encoding of the rest of the data. You can simply remove the character to continue. Although, since the error says you were trying to convert to 'ascii', you should probably pick another encoding for whatever you were trying to do.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to unicode

How to resolve TypeError: can only concatenate str (not "int") to str (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape UnicodeEncodeError: 'ascii' codec can't encode character at special name Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) HTML for the Pause symbol in audio and video control Javascript: Unicode string to hex Concrete Javascript Regex for Accented Characters (Diacritics) Replace non-ASCII characters with a single space UTF-8 in Windows 7 CMD NameError: global name 'unicode' is not defined - in Python 3

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8