Unicode UTF-8 reading and writing to files in Python

Question

I m having some brain failure in understanding reading and writing text to a file  Python 2 4      The string  which has an a-acute in it  ss   u Capit xe1n  ss8   ss encode  utf8   repr ss   repr ss8         u Capit xe1n      Capit xc3 xa1n      print ss  ss8 print  gt  gt  open  f1   w    ss8   gt  gt  gt  file  f1   read    Capit xc3 xa1n n    So I type in Capit xc3 xa1n into my favorite editor  in file f2   Then    gt  gt  gt  open  f1   read    Capit xc3 xa1n n   gt  gt  gt  open  f2   read    Capit  xc3  xa1n n   gt  gt  gt  open  f1   read   decode  utf8   u Capit xe1n n   gt  gt  gt  open  f2   read   decode  utf8   u Capit  xc3  xa1n n    What am I not understanding here  Clearly there is some vital bit of magic  or good sense  that I m missing  What does one type into text files to get proper conversions   What I m truly failing to grok here  is what the point of the UTF-8 representation is  if you can t actually get Python to recognize it  when it comes from outside  Maybe I should just JSON dump the string  and use that instead  since that has an asciiable representation  More to the point  is there an ASCII representation of this Unicode object that Python will recognize and decode  when coming in from a file   If so  how do I get it    gt  gt  gt  print simplejson dumps ss    Capit u00e1n    gt  gt  gt  print  gt  gt  file  f3   w    simplejson dumps ss   gt  gt  gt  simplejson load open  f3    u Capit xe1n

User · Answer

The  x   sequence is something that s specific to Python  It s not a universal byte escape sequence    How you actually enter in UTF-8-encoded non-ASCII depends on your OS and or your editor  Here s how you do it in Windows  For OS X to enter a with an acute accent you can just hit option   E  then A  and almost all text editors in OS X support UTF-8

User · Answer

So  I ve found a solution for what I m looking for  which is   print open  f2   read   decode  string-escape   decode  utf-8     There are some unusual codecs that are useful here  This particular reading allows one to take UTF-8 representations from within Python  copy them into an ASCII file  and have them be read in to Unicode  Under the  string-escape  decode  the slashes won t be doubled   This allows for the sort of round trip that I was imagining

User · Answer

Actually  this worked for me for reading a file with UTF-8 encoding in Python 3 2   import codecs f   codecs open  file name txt    r    UTF-8   for line in f      print line

User · Answer

In the notation  u Capit xe1n n    the   xe1  represents just one byte    x  tells you that  e1  is in hexadecimal  When you write  Capit xc3 xa1n   into your file you have   xc3  in it  Those are 4 bytes and in your code you read them all  You can see this when you display them    gt  gt  gt  open  f2   read    Capit  xc3  xa1n n    You can see that the backslash is escaped by a backslash  So you have four bytes in your string        x    c  and  3    Edit   As others pointed out in their answers you should just enter the characters in the editor and your editor should then handle the conversion to UTF-8 and save it   If you actually have a string in this format you can use the string escape codec to decode it into a normal string   In  15   print  Capit  xc3  xa1n n  decode  string escape   Capit  n   The result is a string that is encoded in UTF-8 where the accented character is represented by the two bytes that were written   xc3  xa1 in the original string  If you want to have a unicode string you have to decode again with UTF-8   To your edit  you don t have UTF-8 in your file  To actually see how it would look like   s   u Capit xe1n n  sutf8   s encode  UTF-8   open  utf-8 out    w   write sutf8    Compare the content of the file utf-8 out to the content of the file you saved with your editor

User · Answer

You have stumbled over the general problem with encodings  How can I tell in which encoding a file is   Answer  You can t unless the file format provides for this  XML  for example  begins with    lt  xml encoding  utf-8   gt    This header was carefully chosen so that it can be read no matter the encoding  In your case  there is no such hint  hence neither your editor nor Python has any idea what is going on  Therefore  you must use the codecs module and use codecs open path mode encoding  which provides the missing bit in Python   As for your editor  you must check if it offers some way to set the encoding of a file   The point of UTF-8 is to be able to encode 21-bit characters  Unicode  as an 8-bit data stream  because that s the only thing all computers in the world can handle   But since most OSs predate the Unicode era  they don t have suitable tools to attach the encoding information to files on the hard disk   The next issue is the representation in Python  This is explained perfectly in the comment by heikogerlach  You must understand that your console can only display ASCII  In order to display Unicode or anything    charcode 128  it must use some means of escaping  In your editor  you must not type the escaped display string but what the string means  in this case  you must enter the umlaut and save the file    That said  you can use the Python function eval   to turn an escaped string into a string    gt  gt  gt  x   eval   Capit  xc3  xa1n  n     gt  gt  gt  x  Capit xc3 xa1n n   gt  gt  gt  x 5    xc3   gt  gt  gt  len x 5   1   As you can see  the string   xc3  has been turned into a single character  This is now an 8-bit string  UTF-8 encoded  To get Unicode    gt  gt  gt  x decode  utf-8   u Capit xe1n n    Gregg Lind asked  I think there are some pieces missing here  the file f2 contains  hex   0000000  4361 7069 745c 7863 335c 7861 316e  Capit xc3 xa1n   codecs open  f2   rb    utf-8    for example  reads them all in a separate chars  expected  Is there any way to write to a file in ASCII that would work   Answer  That depends on what you mean  ASCII can t represent characters   127  So you need some way to say  the next few characters mean something special  which is what the sequence   x  does  It says  The next two characters are the code of a single character    u  does the same using four characters to encode Unicode up to 0xFFFF  65535    So you can t directly write Unicode to ASCII  because ASCII simply doesn t contain the same characters   You can write it as string escapes  as in f2   in this case  the file can be represented as ASCII  Or you can write it as UTF-8  in which case  you need an 8-bit safe stream   Your solution using decode  string-escape   does work  but you must be aware how much memory you use  Three times the amount of using codecs open     Remember that a file is just a sequence of bytes with 8 bits  Neither the bits nor the bytes have a meaning  It s you who says  65 means  A    Since  xc3 xa1 should become      but the computer has no means to know  you must tell it by specifying the encoding which was used when writing the file

User · Answer

- - encoding  utf-8 - -    converting a unknown formatting file in utf-8  import codecs import commands  file location    jumper sub  file encoding   commands getoutput  file -b --mime-encoding  s    file location   file stream   codecs open file location   r   file encoding  file output   codecs open file location  b    w    utf-8    for l in file stream      file output write l   file stream close   file output close

User · Answer

In the notation  u Capit xe1n n    the   xe1  represents just one byte    x  tells you that  e1  is in hexadecimal  When you write  Capit xc3 xa1n   into your file you have   xc3  in it  Those are 4 bytes and in your code you read them all  You can see this when you display them    gt  gt  gt  open  f2   read    Capit  xc3  xa1n n    You can see that the backslash is escaped by a backslash  So you have four bytes in your string        x    c  and  3    Edit   As others pointed out in their answers you should just enter the characters in the editor and your editor should then handle the conversion to UTF-8 and save it   If you actually have a string in this format you can use the string escape codec to decode it into a normal string   In  15   print  Capit  xc3  xa1n n  decode  string escape   Capit  n   The result is a string that is encoded in UTF-8 where the accented character is represented by the two bytes that were written   xc3  xa1 in the original string  If you want to have a unicode string you have to decode again with UTF-8   To your edit  you don t have UTF-8 in your file  To actually see how it would look like   s   u Capit xe1n n  sutf8   s encode  UTF-8   open  utf-8 out    w   write sutf8    Compare the content of the file utf-8 out to the content of the file you saved with your editor

User · Answer

You have stumbled over the general problem with encodings  How can I tell in which encoding a file is   Answer  You can t unless the file format provides for this  XML  for example  begins with    lt  xml encoding  utf-8   gt    This header was carefully chosen so that it can be read no matter the encoding  In your case  there is no such hint  hence neither your editor nor Python has any idea what is going on  Therefore  you must use the codecs module and use codecs open path mode encoding  which provides the missing bit in Python   As for your editor  you must check if it offers some way to set the encoding of a file   The point of UTF-8 is to be able to encode 21-bit characters  Unicode  as an 8-bit data stream  because that s the only thing all computers in the world can handle   But since most OSs predate the Unicode era  they don t have suitable tools to attach the encoding information to files on the hard disk   The next issue is the representation in Python  This is explained perfectly in the comment by heikogerlach  You must understand that your console can only display ASCII  In order to display Unicode or anything    charcode 128  it must use some means of escaping  In your editor  you must not type the escaped display string but what the string means  in this case  you must enter the umlaut and save the file    That said  you can use the Python function eval   to turn an escaped string into a string    gt  gt  gt  x   eval   Capit  xc3  xa1n  n     gt  gt  gt  x  Capit xc3 xa1n n   gt  gt  gt  x 5    xc3   gt  gt  gt  len x 5   1   As you can see  the string   xc3  has been turned into a single character  This is now an 8-bit string  UTF-8 encoded  To get Unicode    gt  gt  gt  x decode  utf-8   u Capit xe1n n    Gregg Lind asked  I think there are some pieces missing here  the file f2 contains  hex   0000000  4361 7069 745c 7863 335c 7861 316e  Capit xc3 xa1n   codecs open  f2   rb    utf-8    for example  reads them all in a separate chars  expected  Is there any way to write to a file in ASCII that would work   Answer  That depends on what you mean  ASCII can t represent characters   127  So you need some way to say  the next few characters mean something special  which is what the sequence   x  does  It says  The next two characters are the code of a single character    u  does the same using four characters to encode Unicode up to 0xFFFF  65535    So you can t directly write Unicode to ASCII  because ASCII simply doesn t contain the same characters   You can write it as string escapes  as in f2   in this case  the file can be represented as ASCII  Or you can write it as UTF-8  in which case  you need an 8-bit safe stream   Your solution using decode  string-escape   does work  but you must be aware how much memory you use  Three times the amount of using codecs open     Remember that a file is just a sequence of bytes with 8 bits  Neither the bits nor the bytes have a meaning  It s you who says  65 means  A    Since  xc3 xa1 should become      but the computer has no means to know  you must tell it by specifying the encoding which was used when writing the file

User · Answer

So  I ve found a solution for what I m looking for  which is   print open  f2   read   decode  string-escape   decode  utf-8     There are some unusual codecs that are useful here  This particular reading allows one to take UTF-8 representations from within Python  copy them into an ASCII file  and have them be read in to Unicode  Under the  string-escape  decode  the slashes won t be doubled   This allows for the sort of round trip that I was imagining

User · Answer

You have stumbled over the general problem with encodings  How can I tell in which encoding a file is   Answer  You can t unless the file format provides for this  XML  for example  begins with    lt  xml encoding  utf-8   gt    This header was carefully chosen so that it can be read no matter the encoding  In your case  there is no such hint  hence neither your editor nor Python has any idea what is going on  Therefore  you must use the codecs module and use codecs open path mode encoding  which provides the missing bit in Python   As for your editor  you must check if it offers some way to set the encoding of a file   The point of UTF-8 is to be able to encode 21-bit characters  Unicode  as an 8-bit data stream  because that s the only thing all computers in the world can handle   But since most OSs predate the Unicode era  they don t have suitable tools to attach the encoding information to files on the hard disk   The next issue is the representation in Python  This is explained perfectly in the comment by heikogerlach  You must understand that your console can only display ASCII  In order to display Unicode or anything    charcode 128  it must use some means of escaping  In your editor  you must not type the escaped display string but what the string means  in this case  you must enter the umlaut and save the file    That said  you can use the Python function eval   to turn an escaped string into a string    gt  gt  gt  x   eval   Capit  xc3  xa1n  n     gt  gt  gt  x  Capit xc3 xa1n n   gt  gt  gt  x 5    xc3   gt  gt  gt  len x 5   1   As you can see  the string   xc3  has been turned into a single character  This is now an 8-bit string  UTF-8 encoded  To get Unicode    gt  gt  gt  x decode  utf-8   u Capit xe1n n    Gregg Lind asked  I think there are some pieces missing here  the file f2 contains  hex   0000000  4361 7069 745c 7863 335c 7861 316e  Capit xc3 xa1n   codecs open  f2   rb    utf-8    for example  reads them all in a separate chars  expected  Is there any way to write to a file in ASCII that would work   Answer  That depends on what you mean  ASCII can t represent characters   127  So you need some way to say  the next few characters mean something special  which is what the sequence   x  does  It says  The next two characters are the code of a single character    u  does the same using four characters to encode Unicode up to 0xFFFF  65535    So you can t directly write Unicode to ASCII  because ASCII simply doesn t contain the same characters   You can write it as string escapes  as in f2   in this case  the file can be represented as ASCII  Or you can write it as UTF-8  in which case  you need an 8-bit safe stream   Your solution using decode  string-escape   does work  but you must be aware how much memory you use  Three times the amount of using codecs open     Remember that a file is just a sequence of bytes with 8 bits  Neither the bits nor the bytes have a meaning  It s you who says  65 means  A    Since  xc3 xa1 should become      but the computer has no means to know  you must tell it by specifying the encoding which was used when writing the file

User · Answer

I found the most simple approach by changing the default encoding of the whole script to be  UTF-8    import sys reload sys  sys setdefaultencoding  utf8     any open  print or other statement will just use utf8   Works at least for Python 2 7 9   Thx goes to https   markhneedham com blog 2015 05 21 python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128   look at the end

User · Answer

I was trying to parse iCal using Python 2 7 9      from icalendar import Calendar   But I was getting    Traceback  most recent call last    File  ical py   line 92  in parse     print      format e attr   UnicodeEncodeError   ascii  codec can t encode character u  xe1  in position 7  ordinal not in range 128    and it was fixed with just   print      format e attr  encode  utf-8       Now it can print lik      b  ss

User · Answer

To read in an Unicode string and then send to HTML  I did this   fileline decode  utf-8   encode  ascii    xmlcharrefreplace     Useful for python powered http servers

User · Answer

You can also improve the original open   function to work with Unicode files by replacing it in place  using the partial function  The beauty of this solution is you don t need to change any old code  It s transparent   import codecs import functools open   functools partial codecs open  encoding  utf-8

User · Answer

Well  your favorite text editor does not realize that  xc3 xa1 are supposed to be character literals  but it interprets them as text  That s why you get the double backslashes in the last line -- it s now a real backslash   xc3  etc  in your file   If you want to read and write encoded files in Python  best use the codecs module   Pasting text between the terminal and applications is difficult  because you don t know which program will interpret your text using which encoding  You could try the following    gt  gt  gt  s   file  f1   read    gt  gt  gt  print unicode s   Latin-1   Capit    n   Then paste this string into your editor and make sure that it stores it using Latin-1  Under the assumption that the clipboard does not garble the string  the round trip should work

User · Answer

In the notation  u Capit xe1n n    the   xe1  represents just one byte    x  tells you that  e1  is in hexadecimal  When you write  Capit xc3 xa1n   into your file you have   xc3  in it  Those are 4 bytes and in your code you read them all  You can see this when you display them    gt  gt  gt  open  f2   read    Capit  xc3  xa1n n    You can see that the backslash is escaped by a backslash  So you have four bytes in your string        x    c  and  3    Edit   As others pointed out in their answers you should just enter the characters in the editor and your editor should then handle the conversion to UTF-8 and save it   If you actually have a string in this format you can use the string escape codec to decode it into a normal string   In  15   print  Capit  xc3  xa1n n  decode  string escape   Capit  n   The result is a string that is encoded in UTF-8 where the accented character is represented by the two bytes that were written   xc3  xa1 in the original string  If you want to have a unicode string you have to decode again with UTF-8   To your edit  you don t have UTF-8 in your file  To actually see how it would look like   s   u Capit xe1n n  sutf8   s encode  UTF-8   open  utf-8 out    w   write sutf8    Compare the content of the file utf-8 out to the content of the file you saved with your editor

User · Answer

except for codecs open    one can uses io open   to work with Python2 or Python3 to read   write unicode file  example  import io  text   u     encoding    utf8   with io open  data txt    w   encoding encoding  newline   n   as fout      fout write text   with io open  data txt    r   encoding encoding  newline   n   as fin      text2   fin read    assert text    text2

User · Answer

To read in an Unicode string and then send to HTML  I did this   fileline decode  utf-8   encode  ascii    xmlcharrefreplace     Useful for python powered http servers

User · Answer

The  x   sequence is something that s specific to Python  It s not a universal byte escape sequence    How you actually enter in UTF-8-encoded non-ASCII depends on your OS and or your editor  Here s how you do it in Windows  For OS X to enter a with an acute accent you can just hit option   E  then A  and almost all text editors in OS X support UTF-8

User · Answer

So  I ve found a solution for what I m looking for  which is   print open  f2   read   decode  string-escape   decode  utf-8     There are some unusual codecs that are useful here  This particular reading allows one to take UTF-8 representations from within Python  copy them into an ASCII file  and have them be read in to Unicode  Under the  string-escape  decode  the slashes won t be doubled   This allows for the sort of round trip that I was imagining

User · Answer

You have stumbled over the general problem with encodings  How can I tell in which encoding a file is   Answer  You can t unless the file format provides for this  XML  for example  begins with    lt  xml encoding  utf-8   gt    This header was carefully chosen so that it can be read no matter the encoding  In your case  there is no such hint  hence neither your editor nor Python has any idea what is going on  Therefore  you must use the codecs module and use codecs open path mode encoding  which provides the missing bit in Python   As for your editor  you must check if it offers some way to set the encoding of a file   The point of UTF-8 is to be able to encode 21-bit characters  Unicode  as an 8-bit data stream  because that s the only thing all computers in the world can handle   But since most OSs predate the Unicode era  they don t have suitable tools to attach the encoding information to files on the hard disk   The next issue is the representation in Python  This is explained perfectly in the comment by heikogerlach  You must understand that your console can only display ASCII  In order to display Unicode or anything    charcode 128  it must use some means of escaping  In your editor  you must not type the escaped display string but what the string means  in this case  you must enter the umlaut and save the file    That said  you can use the Python function eval   to turn an escaped string into a string    gt  gt  gt  x   eval   Capit  xc3  xa1n  n     gt  gt  gt  x  Capit xc3 xa1n n   gt  gt  gt  x 5    xc3   gt  gt  gt  len x 5   1   As you can see  the string   xc3  has been turned into a single character  This is now an 8-bit string  UTF-8 encoded  To get Unicode    gt  gt  gt  x decode  utf-8   u Capit xe1n n    Gregg Lind asked  I think there are some pieces missing here  the file f2 contains  hex   0000000  4361 7069 745c 7863 335c 7861 316e  Capit xc3 xa1n   codecs open  f2   rb    utf-8    for example  reads them all in a separate chars  expected  Is there any way to write to a file in ASCII that would work   Answer  That depends on what you mean  ASCII can t represent characters   127  So you need some way to say  the next few characters mean something special  which is what the sequence   x  does  It says  The next two characters are the code of a single character    u  does the same using four characters to encode Unicode up to 0xFFFF  65535    So you can t directly write Unicode to ASCII  because ASCII simply doesn t contain the same characters   You can write it as string escapes  as in f2   in this case  the file can be represented as ASCII  Or you can write it as UTF-8  in which case  you need an 8-bit safe stream   Your solution using decode  string-escape   does work  but you must be aware how much memory you use  Three times the amount of using codecs open     Remember that a file is just a sequence of bytes with 8 bits  Neither the bits nor the bytes have a meaning  It s you who says  65 means  A    Since  xc3 xa1 should become      but the computer has no means to know  you must tell it by specifying the encoding which was used when writing the file

User · Answer

The  x   sequence is something that s specific to Python  It s not a universal byte escape sequence    How you actually enter in UTF-8-encoded non-ASCII depends on your OS and or your editor  Here s how you do it in Windows  For OS X to enter a with an acute accent you can just hit option   E  then A  and almost all text editors in OS X support UTF-8

User · Answer

- - encoding  utf-8 - -    converting a unknown formatting file in utf-8  import codecs import commands  file location    jumper sub  file encoding   commands getoutput  file -b --mime-encoding  s    file location   file stream   codecs open file location   r   file encoding  file output   codecs open file location  b    w    utf-8    for l in file stream      file output write l   file stream close   file output close

User · Answer

Rather than mess with the encode and decode methods I find it easier to specify the encoding when opening the file  The io module  added in Python 2 6  provides an io open function  which has an encoding parameter   Use the open method from the io module    gt  gt  gt import io  gt  gt  gt f   io open  test   mode  r   encoding  utf-8     Then after calling f s read   function  an encoded Unicode object is returned    gt  gt  gt f read   u Capit xe1l n n    Note that in Python 3  the io open function is an alias for the built-in open function  The built-in open function only supports the encoding argument in Python 3  not Python 2   Edit  Previously this answer recommended the codecs module  The codecs module can cause problems when mixing read   and readline    so this answer now recommends the io module instead    Use the open method from the codecs module    gt  gt  gt import codecs  gt  gt  gt f   codecs open  test    r    utf-8     Then after calling f s read   function  an encoded Unicode object is returned    gt  gt  gt f read   u Capit xe1l n n    If you know the encoding of a file  using the codecs package is going to be much less confusing   See http   docs python org library codecs html codecs open

User · Answer

I found the most simple approach by changing the default encoding of the whole script to be  UTF-8    import sys reload sys  sys setdefaultencoding  utf8     any open  print or other statement will just use utf8   Works at least for Python 2 7 9   Thx goes to https   markhneedham com blog 2015 05 21 python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128   look at the end

User · Answer

Rather than mess with the encode and decode methods I find it easier to specify the encoding when opening the file  The io module  added in Python 2 6  provides an io open function  which has an encoding parameter   Use the open method from the io module    gt  gt  gt import io  gt  gt  gt f   io open  test   mode  r   encoding  utf-8     Then after calling f s read   function  an encoded Unicode object is returned    gt  gt  gt f read   u Capit xe1l n n    Note that in Python 3  the io open function is an alias for the built-in open function  The built-in open function only supports the encoding argument in Python 3  not Python 2   Edit  Previously this answer recommended the codecs module  The codecs module can cause problems when mixing read   and readline    so this answer now recommends the io module instead    Use the open method from the codecs module    gt  gt  gt import codecs  gt  gt  gt f   codecs open  test    r    utf-8     Then after calling f s read   function  an encoded Unicode object is returned    gt  gt  gt f read   u Capit xe1l n n    If you know the encoding of a file  using the codecs package is going to be much less confusing   See http   docs python org library codecs html codecs open

User · Answer

In the notation  u Capit xe1n n    the   xe1  represents just one byte    x  tells you that  e1  is in hexadecimal  When you write  Capit xc3 xa1n   into your file you have   xc3  in it  Those are 4 bytes and in your code you read them all  You can see this when you display them    gt  gt  gt  open  f2   read    Capit  xc3  xa1n n    You can see that the backslash is escaped by a backslash  So you have four bytes in your string        x    c  and  3    Edit   As others pointed out in their answers you should just enter the characters in the editor and your editor should then handle the conversion to UTF-8 and save it   If you actually have a string in this format you can use the string escape codec to decode it into a normal string   In  15   print  Capit  xc3  xa1n n  decode  string escape   Capit  n   The result is a string that is encoded in UTF-8 where the accented character is represented by the two bytes that were written   xc3  xa1 in the original string  If you want to have a unicode string you have to decode again with UTF-8   To your edit  you don t have UTF-8 in your file  To actually see how it would look like   s   u Capit xe1n n  sutf8   s encode  UTF-8   open  utf-8 out    w   write sutf8    Compare the content of the file utf-8 out to the content of the file you saved with your editor

User · Answer

Actually  this worked for me for reading a file with UTF-8 encoding in Python 3 2   import codecs f   codecs open  file name txt    r    UTF-8   for line in f      print line

User · Answer

I was trying to parse iCal using Python 2 7 9      from icalendar import Calendar   But I was getting    Traceback  most recent call last    File  ical py   line 92  in parse     print      format e attr   UnicodeEncodeError   ascii  codec can t encode character u  xe1  in position 7  ordinal not in range 128    and it was fixed with just   print      format e attr  encode  utf-8       Now it can print lik      b  ss

User · Answer

Now all you need in Python3 is open Filename   r   encoding  utf-8     Edit on 2016-02-10 for requested clarification   Python3 added the encoding parameter to its open function  The following information about the open function is gathered from here  https   docs python org 3 library functions html open  open file  mode  r   buffering -1         encoding None  errors None  newline None         closefd True  opener None       Encoding is the name of the encoding used to decode or encode the   file  This should only be used in text mode  The default encoding is   platform dependent  whatever locale getpreferredencoding     returns   but any text encoding supported by Python can be used    See the codecs module for the list of supported encodings    So by adding encoding  utf-8  as a parameter to the open function  the file reading and writing is all done as utf8  which is also now the default encoding of everything done in Python

User · Answer

So  I ve found a solution for what I m looking for  which is   print open  f2   read   decode  string-escape   decode  utf-8     There are some unusual codecs that are useful here  This particular reading allows one to take UTF-8 representations from within Python  copy them into an ASCII file  and have them be read in to Unicode  Under the  string-escape  decode  the slashes won t be doubled   This allows for the sort of round trip that I was imagining

User · Answer

Well  your favorite text editor does not realize that  xc3 xa1 are supposed to be character literals  but it interprets them as text  That s why you get the double backslashes in the last line -- it s now a real backslash   xc3  etc  in your file   If you want to read and write encoded files in Python  best use the codecs module   Pasting text between the terminal and applications is difficult  because you don t know which program will interpret your text using which encoding  You could try the following    gt  gt  gt  s   file  f1   read    gt  gt  gt  print unicode s   Latin-1   Capit    n   Then paste this string into your editor and make sure that it stores it using Latin-1  Under the assumption that the clipboard does not garble the string  the round trip should work

User · Answer

You can also improve the original open   function to work with Unicode files by replacing it in place  using the partial function  The beauty of this solution is you don t need to change any old code  It s transparent   import codecs import functools open   functools partial codecs open  encoding  utf-8

User · Answer

except for codecs open    one can uses io open   to work with Python2 or Python3 to read   write unicode file  example  import io  text   u     encoding    utf8   with io open  data txt    w   encoding encoding  newline   n   as fout      fout write text   with io open  data txt    r   encoding encoding  newline   n   as fin      text2   fin read    assert text    text2

User · Answer

Well  your favorite text editor does not realize that  xc3 xa1 are supposed to be character literals  but it interprets them as text  That s why you get the double backslashes in the last line -- it s now a real backslash   xc3  etc  in your file   If you want to read and write encoded files in Python  best use the codecs module   Pasting text between the terminal and applications is difficult  because you don t know which program will interpret your text using which encoding  You could try the following    gt  gt  gt  s   file  f1   read    gt  gt  gt  print unicode s   Latin-1   Capit    n   Then paste this string into your editor and make sure that it stores it using Latin-1  Under the assumption that the clipboard does not garble the string  the round trip should work

User · Answer

The  x   sequence is something that s specific to Python  It s not a universal byte escape sequence    How you actually enter in UTF-8-encoded non-ASCII depends on your OS and or your editor  Here s how you do it in Windows  For OS X to enter a with an acute accent you can just hit option   E  then A  and almost all text editors in OS X support UTF-8

User · Answer

Well  your favorite text editor does not realize that  xc3 xa1 are supposed to be character literals  but it interprets them as text  That s why you get the double backslashes in the last line -- it s now a real backslash   xc3  etc  in your file   If you want to read and write encoded files in Python  best use the codecs module   Pasting text between the terminal and applications is difficult  because you don t know which program will interpret your text using which encoding  You could try the following    gt  gt  gt  s   file  f1   read    gt  gt  gt  print unicode s   Latin-1   Capit    n   Then paste this string into your editor and make sure that it stores it using Latin-1  Under the assumption that the clipboard does not garble the string  the round trip should work

User · Answer

Now all you need in Python3 is open Filename   r   encoding  utf-8     Edit on 2016-02-10 for requested clarification   Python3 added the encoding parameter to its open function  The following information about the open function is gathered from here  https   docs python org 3 library functions html open  open file  mode  r   buffering -1         encoding None  errors None  newline None         closefd True  opener None       Encoding is the name of the encoding used to decode or encode the   file  This should only be used in text mode  The default encoding is   platform dependent  whatever locale getpreferredencoding     returns   but any text encoding supported by Python can be used    See the codecs module for the list of supported encodings    So by adding encoding  utf-8  as a parameter to the open function  the file reading and writing is all done as utf8  which is also now the default encoding of everything done in Python

[python] Unicode (UTF-8) reading and writing to files in Python

Examples related to python

Examples related to unicode

Examples related to utf-8

Examples related to io