UnicodeDecodeError ascii codec can t decode byte 0xe2 in position 13 ordinal not in range 128

Question

I m using NLTK to perform kmeans clustering on my text file in which each line is considered as a document  So for example  my text file is something like this   belong finger death punch  lt br gt  hasty  lt br gt  mike hasty walls jericho  lt br gt  j  germeister rules  lt br gt  rules bands follow performing j  germeister stage  lt br gt  approach    Now the demo code I m trying to run is this    import sys  import numpy from nltk cluster import KMeansClusterer  GAAClusterer  euclidean distance import nltk corpus from nltk import decorators import nltk stem  stemmer func   nltk stem EnglishStemmer   stem stopwords   set nltk corpus stopwords words  english      decorators memoize def normalize word word       return stemmer func word lower     def get words titles       words   set       for title in job titles          for word in title split                words add normalize word word       return list words    decorators memoize def vectorspaced title       title components    normalize word word  for word in title split        return numpy array           word in title components and not word in stopwords         for word in words   numpy short   if   name         main          filename    example txt      if len sys argv     2          filename   sys argv 1       with open filename  as title file           job titles    line strip   for line in title file readlines             words   get words job titles             cluster   KMeansClusterer 5  euclidean distance          cluster   GAAClusterer 5          cluster cluster  vectorspaced title  for title in job titles if title              NOTE  This is inefficient  cluster classify should really just be           called when you are classifying previously unseen examples          classified examples                     cluster classify vectorspaced title   for title in job titles                        for cluster id  title in sorted zip classified examples  job titles                print cluster id  title    which can also be found here   The error I receive is this   Traceback  most recent call last   File  cluster example py   line 40  in words   get words job titles  File  cluster example py   line 20  in get words words add normalize word word   File     line 1  in File   usr local lib python2 7 dist-packages nltk decorators py   line 183  in memoize result   func  args  File  cluster example py   line 14  in normalize word return stemmer func word lower    File   usr local lib python2 7 dist-packages nltk stem snowball py   line 694  in stem word    word replace u  u2019   u  x27   UnicodeDecodeError   ascii  codec can t decode byte 0xe2 in position 13  ordinal not in range 128    What is happening here

User · Answer

Use open fn   rb   read   decode  utf-8   instead of just open fn  read

User · Answer

This works fine for me   f   open file path   r    encoding  utf-8     You can add a third parameter encoding to ensure the encoding type is  utf-8   Note  this method works fine in Python3  I did not try it in Python2 7

User · Answer

python3x or higher  load file in byte stream        body              for lines in open  website index html   rb                decodedLine   lines decode  utf-8               body   body decodedLine strip           return body   use global setting       import io     import sys     sys stdout   io TextIOWrapper sys stdout buffer encoding  utf-8

User · Answer

For python 3  the default encoding would be  utf-8   Following steps are suggested in the base documentation https   docs python org 2 library csv html csv-examples in case of any problem    Create a function  def utf 8 encoder unicode csv data       for line in unicode csv data          yield line encode  utf-8    Then use the function inside the reader  for e g   csv reader   csv reader utf 8 encoder unicode csv data

User · Answer

You can try this before using job titles string   source   unicode job titles   utf-8

User · Answer

I got this error when trying to install a python package in a Docker container  For me  the issue was that the docker image did not have a locale configured  Adding the following code to the Dockerfile solved the problem for me    Avoid ascii errors when reading files in Python RUN apt-get install -y locales  amp  amp  locale-gen en US UTF-8 ENV LANG  en US UTF-8  LANGUAGE  en US en  LC ALL  en US UTF-8

User · Answer

To find ANY and ALL unicode error related    Using the following command   grep -r -P     x00- x7f    etc apache2  etc letsencrypt  etc nginx   Found mine in   etc letsencrypt options-ssl-nginx conf           The following CSP directives don t use default-src as    Using shed  I found the offending sequence   It turned out to be an editor mistake   00008099      C2  194 302 11000010 00008100      A0  160 240 10100000 00008101   d  64  100 144 01100100 00008102   e  65  101 145 01100101 00008103   f  66  102 146 01100110 00008104   a  61  097 141 01100001 00008105   u  75  117 165 01110101 00008106   l  6C  108 154 01101100 00008107   t  74  116 164 01110100 00008108   -  2D  045 055 00101101 00008109   s  73  115 163 01110011 00008110   r  72  114 162 01110010 00008111   c  63  099 143 01100011 00008112      C2  194 302 11000010 00008113      A0  160 240 10100000

User · Answer

When on Ubuntu 18 04 using Python3 6 I have solved the problem doing both   with open filename  encoding  utf-8   as lines    and if you are running the tool as command line   export LC ALL C UTF-8   Note that if you are in Python2 7 you have do to handle this differently  First you have to set the default encoding   import sys reload sys  sys setdefaultencoding  utf-8     and then to load the file you must use io open to set the encoding   import io with io open filename   r   encoding  utf-8   as lines    You still need to export the env  export LC ALL C UTF-8

User · Answer

The file is being read as a bunch of strs  but it should be unicodes  Python tries to implicitly convert  but fails  Change   job titles    line strip   for line in title file readlines      to explicitly decode the strs to unicode  here assuming UTF-8    job titles    line decode  utf-8   strip   for line in title file readlines      It could also be solved by importing the codecs module and using codecs open rather than the built-in open

User · Answer

You can try this also   import sys reload sys  sys setdefaultencoding  utf8

User · Answer

For me there was a problem with the terminal encoding  Adding UTF-8 to  bashrc solved the problem   export LC CTYPE en US UTF-8   Don t forget to reload  bashrc afterwards   source    bashrc

[python] UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)

Examples related to python

Examples related to python-2.7