[python] UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function

I am writing a Python (Python 3.3) program to send some data to a webpage using POST method. Mostly for debugging process I am getting the page result and displaying it on the screen using print() function.

The code is like this:

conn.request("POST", resource, params, headers)
response = conn.getresponse()
print(response.status, response.reason)
data = response.read()
print(data.decode('utf-8'));

the HTTPResponse .read() method returns a bytes element encoding the page (which is a well formated UTF-8 document) It seemed okay until I stopped using IDLE GUI for Windows and used the Windows console instead. The returned page has a U+2014 character (em-dash) which the print function translates well in the Windows GUI (I presume Code Page 1252) but does not in the Windows Console (Code Page 850). Given the strict default behavior I get the following error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in position 10248: character maps to <undefined>

I could fix it using this quite ugly code:

print(data.decode('utf-8').encode('cp850','replace').decode('cp850'))

Now it replace the offending character "—" with a ?. Not the ideal case (a hyphen should be a better replacement) but good enough for my purpose.

There are several things I do not like from my solution.

  1. The code is ugly with all that decoding, encoding, and decoding.
  2. It solves the problem for just this case. If I port the program for a system using some other encoding (latin-1, cp437, back to cp1252, etc.) it should recognize the target encoding. It does not. (for instance, when using again the IDLE GUI, the emdash is also lost, which didn't happen before)
  3. It would be nicer if the emdash translated to a hyphen instead of a interrogation bang.

The problem is not the emdash (I can think of several ways to solve that particularly problem) but I need to write robust code. I am feeding the page with data from a database and that data can come back. I can anticipate many other conflicting cases: an 'Á' U+00c1 (which is possible in my database) could translate into CP-850 (DOS/Windows Console encodign for Western European Languages) but not into CP-437 (encoding for US English, which is default in many Windows instalations).

So, the question:

Is there a nicer solution that makes my code agnostic from the output interface encoding?

This question is related to python encoding decode encode

The answer is


If you are using Windows command line to print the data, you should use

chcp 65001

This worked for me!


I see three solutions to this:

  1. Change the output encoding, so it will always output UTF-8. See e.g. Setting the correct encoding when piping stdout in Python, but I could not get these example to work.

  2. Following example code makes the output aware of your target charset.

    # -*- coding: utf-8 -*-
    import sys
    
    print sys.stdout.encoding
    print u"Stöcker".encode(sys.stdout.encoding, errors='replace')
    print u"????????".encode(sys.stdout.encoding, errors='replace')
    

    This example properly replaces any non-printable character in my name with a question mark.

    If you create a custom print function, e.g. called myprint, using that mechanisms to encode output properly you can simply replace print with myprint whereever necessary without making the whole code look ugly.

  3. Reset the output encoding globally at the begin of the software:

    The page http://www.macfreek.nl/memory/Encoding_of_Python_stdout has a good summary what to do to change output encoding. Especially the section "StreamWriter Wrapper around Stdout" is interesting. Essentially it says to change the I/O encoding function like this:

    In Python 2:

    if sys.stdout.encoding != 'cp850':
      sys.stdout = codecs.getwriter('cp850')(sys.stdout, 'strict')
    if sys.stderr.encoding != 'cp850':
      sys.stderr = codecs.getwriter('cp850')(sys.stderr, 'strict')
    

    In Python 3:

    if sys.stdout.encoding != 'cp850':
      sys.stdout = codecs.getwriter('cp850')(sys.stdout.buffer, 'strict')
    if sys.stderr.encoding != 'cp850':
      sys.stderr = codecs.getwriter('cp850')(sys.stderr.buffer, 'strict')
    

    If used in CGI outputting HTML you can replace 'strict' by 'xmlcharrefreplace' to get HTML encoded tags for non-printable characters.

    Feel free to modify the approaches, setting different encodings, .... Note that it still wont work to output non-specified data. So any data, input, texts must be correctly convertable into unicode:

    # -*- coding: utf-8 -*-
    import sys
    import codecs
    sys.stdout = codecs.getwriter("iso-8859-1")(sys.stdout, 'xmlcharrefreplace')
    print u"Stöcker"                # works
    print "Stöcker".decode("utf-8") # works
    print "Stöcker"                 # fails
    

If you use Python 3.6 (possibly 3.5 or later), it doesn't give that error to me anymore. I had a similar issue, because I was using v3.4, but it went away after I uninstalled and reinstalled.


For debugging purposes, you could use print(repr(data)).

To display text, always print Unicode. Don't hardcode the character encoding of your environment such as Cp850 inside your script. To decode the HTTP response, see A good way to get the charset/encoding of an HTTP response in Python.

To print Unicode to Windows console, you could use win-unicode-console package.


Based on Dirk Stöcker's answer, here's a neat wrapper function for Python 3's print function. Use it just like you would use print.

As an added bonus, compared to the other answers, this won't print your text as a bytearray ('b"content"'), but as normal strings ('content'), because of the last decode step.

def uprint(*objects, sep=' ', end='\n', file=sys.stdout):
    enc = file.encoding
    if enc == 'UTF-8':
        print(*objects, sep=sep, end=end, file=file)
    else:
        f = lambda obj: str(obj).encode(enc, errors='backslashreplace').decode(enc)
        print(*map(f, objects), sep=sep, end=end, file=file)

uprint('foo')
uprint(u'Antonín Dvorák')
uprint('foo', 'bar', u'Antonín Dvorák')

I dug deeper into this and found the best solutions are here.

http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python

In my case I solved "UnicodeEncodeError: 'charmap' codec can't encode character "

original code:

print("Process lines, file_name command_line %s\n"% command_line))

New code:

print("Process lines, file_name command_line %s\n"% command_line.encode('utf-8'))  

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to encoding

How to check encoding of a CSV file UnicodeEncodeError: 'ascii' codec can't encode character at special name Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? The character encoding of the plain text document was not declared - mootool script UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) How to encode text to base64 in python UTF-8 output from PowerShell Set Encoding of File to UTF8 With BOM in Sublime Text 3 Replace non-ASCII characters with a single space

Examples related to decode

How to decode encrypted wordpress admin password? How to decode a QR-code image in (preferably pure) Python? Write Base64-encoded image to file Base64 Java encode and decode a string Android - How to decode and decompile any APK file? UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function PHP replacing special characters like à->a, è->e UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined> How do I decode a string with escaped unicode? UnicodeDecodeError, invalid continuation byte

Examples related to encode

Write Base64-encoded image to file Base64 Java encode and decode a string C# Base64 String to JPEG Image UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function PHP "pretty print" json_encode Convert int to ASCII and back in Python Python Unicode Encode Error How to convert a string or integer to binary in Ruby? AJAX POST and Plus Sign ( + ) -- How to Encode? How to HTML encode/escape a string? Is there a built-in?