Character reading from file in Python

Question

In a text file  there is a string  I don t like this    However  when I read it into a string  it becomes  I don xe2 x80 x98t like this   I understand that  u2018 is the unicode representation of      I use   f1   open  file1   r   text   f1 read     command to do the reading   Now  is it possible to read the string in such a way that when it is read into the string  it is  I don t like this   instead of  I don xe2 x80 x98t like this like this    Second edit  I have seen some people use mapping to solve this problem  but really  is there no built-in conversion that does this kind of ANSI to unicode   and vice versa  conversion

User · Accepted Answer

Ref  http   docs python org howto unicode  Reading Unicode from a file is therefore simple   import codecs with codecs open  unicode rst   encoding  utf-8   as f      for line in f          print repr line    It s also possible to open files in update mode  allowing both reading and writing   with codecs open  test   encoding  utf-8   mode  w    as f      f write u  u4500 blah blah blah n       f seek 0      print repr f readline    1     EDIT  I m assuming that your intended goal is just to be able to read the file properly into a string in Python  If you re trying to convert to an ASCII string from Unicode  then there s really no direct way to do so  since the Unicode characters won t necessarily exist in ASCII   If you re trying to convert to an ASCII string  try one of the following     Replace the specific unicode chars with ASCII equivalents  if you are only looking to handle a few special cases such as this particular example Use the unicodedata module s normalize   and the string encode   method to convert as best you can to the next closest ASCII equivalent  Ref https   web archive org web 20090228203858 http   techxplorer com 2006 07 18 converting-unicode-to-ascii-using-python      gt  gt  gt  teststr u I don xe2 x80 x98t like this   gt  gt  gt  unicodedata normalize  NFKD   teststr  encode  ascii    ignore    I donat like this

User · Answer

But it really is  I don u2018t like this  and not  I don t like this   The character u  u2018  is a completely different character than      and  visually  should correspond more to        If you re trying to convert encoded unicode into plain ASCII  you could perhaps keep a mapping of unicode punctuation that you would like to translate into ASCII   punctuation       u  u2018          u  u2019          for src  dest in punctuation iteritems      text   text replace src  dest    There are an awful lot of punctuation characters in unicode  however  but I suppose you can count on only a few of them actually being used by whatever application is creating the documents you re reading

User · Answer

There is a possibility that somehow you have a non-unicode string with unicode escape characters  e g     gt  gt  gt  print repr text   I don  u2018t like this    This actually happened to me once before  You can use a unicode escape codec to decode the string to unicode and then encode it to any format you want    gt  gt  gt  uni   text decode  unicode escape    gt  gt  gt  print type uni   lt type  unicode  gt   gt  gt  gt  print uni encode  utf-8   I don   t like this

User · Answer

Leaving aside the fact that your text file is broken  U 2018 is a left quotation mark  not an apostrophe   iconv can be used to transliterate unicode characters to ascii   You ll have to google for  iconvcodec   since the module seems not to be supported anymore and I can t find a canonical home page for it    gt  gt  gt  import iconvcodec  gt  gt  gt  from locale import setlocale  LC ALL  gt  gt  gt  setlocale LC ALL       gt  gt  gt  u  u2018  encode  ascii  translit         Alternatively you can use the iconv command line utility to clean up your file     xxd foo 0000000  e280 980a                                       iconv -t  ascii  translit  foo   xxd 0000000  270a

User · Answer

Not sure about the  errors  quot ignore quot   option but it seems to work for files with strange Unicode characters  with open fName   quot rb quot   as fData      lines   fData read   splitlines       lines    line decode  quot utf-8 quot   errors  quot ignore quot   for line in lines

User · Answer

It is also possible to read an encoded text file using the python 3 read method   f   open  file txt   r   encoding  utf-8   text   f read   f close     With this variation  there is no need to import any additional libraries

User · Answer

There are a few points to consider   A  u2018 character may appear only as a fragment of representation of a unicode string in Python  e g  if you write    gt  gt  gt  text   u       gt  gt  gt  print repr text  u  u2018    Now if you simply want to print the unicode string prettily  just use unicode s encode method    gt  gt  gt  text   u I don u2018t like this   gt  gt  gt  print text encode  utf-8   I don   t like this   To make sure that every line from any file would be read as unicode  you d better use the codecs open function instead of just open  which allows you to specify file s encoding    gt  gt  gt  import codecs  gt  gt  gt  f1   codecs open file1   r    utf-8    gt  gt  gt  text   f1 read    gt  gt  gt  print type text   lt type  unicode  gt   gt  gt  gt  print text encode  utf-8   I don   t like this

User · Answer

This is Pythons way do show you unicode encoded strings  But i think you should be able to print the string on the screen or write it into a new file without any problems    gt  gt  gt  test   u I don u2018t like this   gt  gt  gt  test u I don u2018t like this   gt  gt  gt  print test I don   t like this

User · Answer

Actually  U 2018 is the Unicode representation of the special character       If you want  you can convert instances of that character to U 0027 with this code   text   text replace  u  u2018          In addition  what are you using to write the file  f1 read   should return a string that looks like this    I don xe2 x80 x98t like this    If it s returning this string  the file is being written incorrectly    I don u2018t like this

[python] Character reading from file in Python

Examples related to python

Examples related to unicode

Examples related to encoding

Examples related to ascii