How to fix UnicodeDecodeError ascii codec can t decode byte

Question

as3   ngokevin-site  nano content blog 20140114 test-chinese mkd as3   ngokevin-site  wok Traceback  most recent call last   File   usr local bin wok   line 4  in Engine   File   usr local lib python2 7 site-packages wok engine py   line 104  in init self load pages   File   usr local lib python2 7 site-packages wok engine py   line 238  in load pages p   Page from file os path join root  f   self options  self  renderer  File   usr local lib python2 7 site-packages wok page py   line 111  in from file page meta  content     page renderer render page original  File   usr local lib python2 7 site-packages wok renderers py   line 46  in render return markdown plain  Markdown plugins  File   usr local lib python2 7 site-packages markdown init py   line 419  in markdown return md convert text  File   usr local lib python2 7 site-packages markdown init py   line 281  in convert source   unicode source  UnicodeDecodeError   ascii  codec can t decode byte 0xe8 in position 1  ordinal not in range 128   -- Note  Markdown only accepts unicode input    How to fix it   In some other python-based static blog apps  Chinese post can be published successfully  Such as this app  http   github com vrypan bucket3  In my site http   bc3 brite biz   Chinese post can be published successfully

User · Answer

This is the classic  unicode issue     I believe that explaining this is beyond the scope of a StackOverflow answer to completely explain what is happening     It is well explained here   In very brief summary  you have passed something that is being interpreted as a string of bytes to something that needs to decode it into Unicode characters  but the default codec  ascii  is failing   The presentation I pointed you to provides advice for avoiding this    Make your code a  unicode sandwich     In Python 2  the use of from   future   import unicode literals helps   Update  how can the code be fixed   OK - in your variable  source  you have some bytes   It is not clear from your question how they got in there - maybe you read them from a web form    In any case  they are not encoded with ascii  but python is trying to convert them to unicode assuming that they are   You need to explicitly tell it what the encoding is    This means that you need to know what the encoding is    That is not always easy  and it depends entirely on where this string came from    You could experiment with some common encodings - for example UTF-8    You tell unicode   the encoding as a second parameter   source   unicode source   utf-8

User · Answer

In a Django  1 9 10  Python 2 7 5 project I have frequent UnicodeDecodeError exceptions  mainly when I try to feed unicode strings to logging  I made a helper function for arbitrary objects to basically format to 8-bit ascii strings and replacing any characters not in the table to      I think it s not the best solution but since the default encoding is ascii  and i don t want to change it  it will do    def encode for logging c  encoding  ascii        if isinstance c  basestring           return c encode encoding   replace       elif isinstance c  Iterable           c               for v in c              c  append encode for logging v  encoding           return c      else          return encode for logging unicode c

User · Answer

I was searching to solve the following error message      unicodedecodeerror   ascii  codec can t decode byte 0xe2 in position 5454  ordinal not in range 128    I finally got it fixed by specifying  encoding    f   open     glove glove 6B 100d txt   encoding  utf-8     Wish it could help you too

User · Answer

I experienced this error with Python2 7  It happened to me while trying to run many python programs  but I managed to reproduce it with this simple script     usr bin env python  import subprocess import sys  result   subprocess Popen  u svn   u info    if not callable getattr result   quot   enter   quot   None   and not callable getattr result   quot   exit   quot   None        print  quot foo quot   print  quot bar quot    On success  it should print out  foo  and  bar   and probably an error message if you re not in a svn folder  On failure  it should print  UnicodeDecodeError   ascii  codec can t decode byte 0xc4 in position 39  ordinal not in range 128    After trying to regenerate my locales and many other solutions posted in this question  I learned the error was happening because I had a special character  l  encoded in my PATH environment variable  After fixing the PATH in     bashrc   and exiting my session and entering again   apparently sourcing     bashrc  didn t work   the issue was gone

User · Answer

I had the same problem but it didn t work for Python 3  I followed this and it solved my problem   enc   sys getdefaultencoding   file   open menu   r   encoding   enc    You have to set the encoding when you are reading writing the file

User · Answer

UnicodeDecodeError   ascii  codec can t decode byte    Cause of this error  input string must be unicode but str was given   TypeError  Decoding Unicode is not supported    Cause of this error  trying to convert unicode input string into unicode    So first check that your input string is str and convert to unicode if necessary   if isinstance input string  str      input string   unicode input string   utf-8     Secondly  the above just changes the type but does not remove non ascii characters  If you want to remove non-ascii characters   if isinstance input string  str      input string   input string decode  ascii    ignore   encode  ascii    note  this removes the character and encodes back to string   elif isinstance input string  unicode      input string   input string encode  ascii    ignore

User · Answer

Finally I got it   as3  usr local lib python2 7 site-packages  cat sitecustomize py   encoding utf8   import sys    reload sys    sys setdefaultencoding  utf8     Let me check   as3   ngokevin-site  python Python 2 7 6  default  Dec  6 2013  14 49 02   GCC 4 4 5  on linux2 Type  help    copyright    credits  or  license  for more information   gt  gt  gt  import sys  gt  gt  gt  reload sys   lt module  sys   built-in  gt   gt  gt  gt  sys getdefaultencoding    utf8   gt  gt  gt    The above shows the default encoding of python is utf8  Then the error is no more

User · Answer

This worked for me       file   open  docs my messy doc pdf    rb

User · Answer

Specify    encoding  utf-8  at the top of your Python File  It should fix the issue

User · Answer

Got a same error and this solved my error  Thanks  python 2 and python 3 differing in unicode handling is making pickled files quite incompatible to load  So Use python pickle s encoding argument  Link below helped me solve the similar problem when I was trying to open pickled data from my python 3 7  while my file was saved originally in python 2 x version  https   blog modest-destiny com posts python-2-and-3-compatible-pickle-save-and-load  I copy the load pickle function in my script and called the load pickle pickle file  while loading my input data like this   input data   load pickle  my dataset pkl     The load pickle function is here   def load pickle pickle file       try          with open pickle file   rb   as f              pickle data   pickle load f      except UnicodeDecodeError as e          with open pickle file   rb   as f              pickle data   pickle load f  encoding  latin1       except Exception as e          print  Unable to load data    pickle file       e          raise     return pickle data

User · Answer

I find the best is to always convert to unicode - but this is difficult to achieve because in practice you d have to check and convert every argument to every function and method you ever write that includes some form of string processing   So I came up with the following approach to either guarantee unicodes or byte strings  from either input  In short  include and use the following lambdas     guarantee unicode string  u   lambda t  t decode  UTF-8    replace   if isinstance t  str  else t  uu   lambda  tt  tuple  u t  for t in tt     guarantee byte string in UTF8 encoding  u8   lambda t  t encode  UTF-8    replace   if isinstance t  unicode  else t  uu8   lambda  tt  tuple  u8 t  for t in tt    Examples   text  Some string with codes  gt  127  like Z  rich  utext u Some string with codes  gt  127  like Z  rich  print     gt  with  u   uu  print  u text   type  u text   print  u utext   type  u utext   print  uu text  utext   type  uu text  utext   print     gt  with u8  uu8  print  u8 text   type  u8 text   print  u8 utext   type  u8 utext   print  uu8 text  utext   type  uu8 text  utext     with   formatting  always use  u   and  uu   print  Some unknown input  s     u text  print  Multiple inputs  s   s     uu text  text    but with string format be sure to always work with unicode strings print u Also works with formats      format  u text   print u Also works with formats         format   uu text  text         or use  u8 and  uu8  because string format expects byte strings print  Also works with formats      format  u8 text   print  Also works with formats         format   uu8 text  text     Here s some more reasoning about this

User · Answer

Encode converts a unicode object in to a string object  I think you are trying to encode a string object  first convert your result into unicode object and then encode that unicode object into  utf-8   for example      result   yourFunction       result decode   encode  utf-8

User · Answer

tl dr   quick fix   Don t decode encode willy nilly Don t assume your strings are UTF-8 encoded Try to convert strings to Unicode strings as soon as possible in your code Fix your locale  How to solve UnicodeDecodeError in Python 3 6  Don t be tempted to use quick reload hacks   Unicode Zen in Python 2 x - The Long Version  Without seeing the source it s difficult to know the root cause  so I ll have to speak generally   UnicodeDecodeError   ascii  codec can t decode byte generally happens when you try to convert a Python 2 x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string   In brief  Unicode strings are an entirely separate type of Python string that does not contain any encoding  They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum  Strings contain encoded text  beit UTF-8  UTF-16  ISO-8895-1  GBK  Big5 etc  Strings are decoded to Unicode and Unicodes are encoded to strings  Files and text data are always transferred in encoded strings   The Markdown module authors probably use unicode    where the exception is thrown  as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string  The Markdown authors can t know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown   Unicode strings can be declared in your code using the u prefix to strings  E g    gt  gt  gt  my u   u my   nic  d   string   gt  gt  gt  type my u   lt type  unicode  gt    Unicode strings may also come from file  databases and network modules  When this happens  you don t need to worry about the encoding   Gotchas  Conversion from str to Unicode can happen even when you don t explicitly call unicode     The following scenarios cause UnicodeDecodeError exceptions     Explicit conversion without encoding unicode           New style format string into Unicode string   Python will try to convert value string to Unicode first u The currency is      format           Old style format string into Unicode string   Python will try to convert value string to Unicode first u The currency is   s             Append string to Unicode   Python will try to convert string to Unicode first u The currency is                       Examples  In the following diagram  you can see how the word caf   has been encoded in either  UTF-8  or  Cp1252  encoding depending on the terminal type  In both examples  caf is just regular ascii  In UTF-8     is encoded using two bytes  In  Cp1252      is 0xE9  which is also happens to be the Unicode point value  it s no coincidence    The correct decode   is invoked and conversion to a Python Unicode is successfull    In this diagram  decode   is called with ascii  which is the same as calling unicode   without an encoding given   As ASCII can t contain bytes greater than 0x7F  this will throw a UnicodeDecodeError exception     The Unicode Sandwich  It s good practice to form a Unicode sandwich in your code  where you decode all incoming data to Unicode strings  work with Unicodes  then encode to strs on the way out  This saves you from worrying about the encoding of strings in the middle of your code   Input   Decode  Source code  If you need to bake non-ASCII into your source code  just create Unicode strings by prefixing the string with a u  E g   u Z  rich    To allow Python to decode your source code  you will need to add an encoding header to match the actual encoding of your file  For example  if your file was encoded as  UTF-8   you would use     encoding  utf-8   This is only necessary when you have non-ASCII in your source code   Files  Usually non-ASCII data is received from a file  The io module provides a TextWrapper that decodes your file on the fly  using a given encoding  You must use the correct encoding for the file - it can t be easily guessed  For example  for a UTF-8 file   import io with io open  my utf8 file txt    r   encoding  utf-8   as my file       my unicode string   my file read      my unicode string would then be suitable for passing to Markdown  If a UnicodeDecodeError from the read   line  then you ve probably used the wrong encoding value   CSV Files  The Python 2 7 CSV module does not support non-ASCII characters   Help is at hand  however  with https   pypi python org pypi backports csv   Use it like above but pass the opened file to it   from backports import csv import io with io open  my utf8 file txt    r   encoding  utf-8   as my file      for row in csv reader my file           yield row   Databases  Most Python database drivers can return data in Unicode  but usually require a little configuration  Always use Unicode strings for SQL queries   MySQL  In the connection string add   charset  utf8   use unicode True   E g    gt  gt  gt  db   MySQLdb connect host  localhost   user  root   passwd  passwd   db  sandbox   use unicode True  charset  utf8     PostgreSQL  Add   psycopg2 extensions register type psycopg2 extensions UNICODE  psycopg2 extensions register type psycopg2 extensions UNICODEARRAY    HTTP  Web pages can be encoded in just about any encoding  The Content-type header should contain a charset field to hint at the encoding  The content can then be decoded manually against this value  Alternatively  Python-Requests returns Unicodes in response text   Manually  If you must decode strings manually  you can simply do my string decode encoding   where encoding is the appropriate encoding  Python 2 x supported codecs are given here  Standard Encodings  Again  if you get UnicodeDecodeError then you ve probably got the wrong encoding   The meat of the sandwich  Work with Unicodes as you would normal strs   Output  stdout   printing  print writes through the stdout stream  Python tries to configure an encoder on stdout so that Unicodes are encoded to the console s encoding  For example  if a Linux shell s locale is en GB UTF-8  the output will be encoded to UTF-8  On Windows  you will be limited to an 8bit code page   An incorrectly configured console  such as corrupt locale  can lead to unexpected print errors  PYTHONIOENCODING environment variable can force the encoding for stdout    Files  Just like input  io open can be used to transparently convert Unicodes to encoded byte strings   Database  The same configuration for reading will allow Unicodes to be written directly   Python 3  Python 3 is no more Unicode capable than Python 2 x is  however it is slightly less confused on the topic  E g the regular str is now a Unicode string and the old str is now bytes    The default encoding is UTF-8  so if you  decode   a byte string without giving an encoding  Python 3 uses UTF-8 encoding  This probably fixes 50  of people s Unicode problems   Further  open   operates in text mode by default  so returns decoded str  Unicode ones   The encoding is derived from your locale  which tends to be UTF-8 on Un x systems or an 8-bit code page  such as windows-1251  on Windows boxes   Why you shouldn t use sys setdefaultencoding  utf8    It s a nasty hack  there s a reason you have to use reload  that will only mask problems and hinder your migration to Python 3 x  Understand the problem  fix the root cause and enjoy Unicode zen  See Why should we NOT use sys setdefaultencoding  quot utf-8 quot   in a py script  for further details

User · Answer

I got the same problem with the string  Pasteler    a Mallorca  and I solved with   unicode  Pasteler    a Mallorca    latin-1

User · Answer

In some cases  when you check your default encoding  print sys getdefaultencoding     it returns that you are using ASCII  If you change to UTF-8  it doesn t work  depending on the content of your variable  I found another way       import sys reload sys    sys setdefaultencoding  Cp1252

User · Answer

In order to resolve this on an operating system level in an Ubuntu installation check the following     locale charmap   If you get  locale  Cannot set LC CTYPE to default locale  No such file or directory   instead of  UTF-8   then set LC CTYPE and LC ALL like this     export LC ALL  en US UTF-8    export LC CTYPE  en US UTF-8

User · Answer

Here is my solution  just add the encoding   with open file  encoding  utf8   as f  And because reading glove file will take a long time  I recommend to the glove file to a numpy file  When netx time you read the embedding weights  it will save your time    import numpy as np from tqdm import tqdm   def load glove file          Loads GloVe vectors in numpy array      Args          file  str   a path to a glove file      Return          dict  a dict of numpy arrays              embeddings index          with open file  encoding  utf8   as f          for i  line in tqdm enumerate f                values   line split               word      join values  -300               coefs   np asarray values -300    dtype  float32               embeddings index word    coefs      return embeddings index    EMBEDDING PATH       embedding weights glove 840B 300d txt  EMBEDDING PATH    glove 840B 300d txt  embeddings   load glove EMBEDDING PATH   np save  glove embeddings npy   embeddings     Gist link  https   gist github com BrambleXu 634a844cdd3cd04bb2e3ba3c83aef227

User · Answer

This error occurs when there are some non ASCII characters in our string and we are performing any operations on that string without proper decoding  This helped me solve my problem  I am reading a CSV file with columns ID Text and decoding characters in it as below   train df   pd read csv  Example csv   train data   train df values for i in train data      print  ID      i 0       text   i 1  decode  utf-8  errors  ignore   strip   lower       print  Text      text

User · Answer

I had the same error  with URLs containing non-ascii chars  bytes with values   128   my solution   url   url decode  utf8   encode  utf-8     Note  utf-8  utf8 are simply aliases   Using only  utf8  or  utf-8  should work in the same way  In my case  worked for me  in Python 2 7  I suppose this assignment changed  something  in the str internal representation--i e   it forces the right decoding of the backed byte sequence in url and finally puts the string into a utf-8 str with all the magic in the right place  Unicode in Python is black magic for me  Hope useful

User · Answer

In short  to ensure proper unicode handling in Python 2    use io open for reading writing files use from   future   import unicode literals configure other data inputs outputs  e g   databases  network  to use unicode if you cannot configure outputs to utf-8  convert your output for them print text encode  ascii    replace   decode      For explanations  see  Alastair McCormack s detailed answer

[python] How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"

Examples related to python

Examples related to python-2.7

Examples related to chinese-locale