Writing Unicode text to a text file

Question

I m pulling data out of a Google doc  processing it  and writing it to a file  that eventually I will paste into a Wordpress page    It has some non-ASCII symbols  How can I convert these safely to symbols that can be used in HTML source    Currently I m converting everything to Unicode on the way in  joining it all together in a Python string  then doing    import codecs f   codecs open  out txt   mode  w   encoding  iso-8859-1   f write all html encode  iso-8859-1    replace      There is an encoding error on the last line       UnicodeDecodeError   ascii  codec can t decode byte 0xa0 in position   12286  ordinal not in range 128    Partial solution   This Python runs without an error   row    unicode x strip    if x is not None else u   for x in row  all html   row 0      lt br  gt     row 1  f   open  out txt    w   f write all html encode  utf-8      But then if I open the actual text file  I see lots of symbols like   Qur       an    Maybe I need to write to something other than a text file

User · Answer

In Python 2 6   you could use io open   that is default  builtin open    on Python 3   import io  with io open filename   w   encoding character encoding  as file      file write unicode text    It might be more convenient if you need to write the text incrementally  you don t need to call unicode text encode character encoding  multiple times   Unlike codecs module  io module has a proper universal newlines support

User · Answer

Deal exclusively with unicode objects as much as possible by decoding things to unicode objects when you first get them and encoding them as necessary on the way out   If your string is actually a unicode object  you ll need to convert it to a unicode-encoded string object before writing it to a file   foo   u                           and     f   open  test    w   f write foo encode  utf8    f close     When you read that file again  you ll get a unicode-encoded string that you can decode to a unicode object   f   file  test    r   print f read   decode  utf8

User · Answer

Unicode string handling is already standardized in Python 3    char s are already stored in Unicode  32-bit  in memory You only need to open file in utf-8  32-bit Unicode to variable-byte-length utf-8 conversion is automatically performed from memory to file    out1                               fobj   open  t1 txt    w   encoding  utf-8   fobj write out1  fobj close

User · Answer

That error arises when you try to encode a non-unicode string  it tries to decode it  assuming it s in plain ASCII  There are two possibilities    You re encoding it to a bytestring  but because you ve used codecs open  the write method expects a unicode object  So you encode it  and it tries to decode it again  Try  f write all html  instead  all html is not  in fact  a unicode object  When you do  encode       it first tries to decode it

User · Answer

In case of writing in python3   gt  gt  gt  a   u bats u00E0   gt  gt  gt  print a bats    gt  gt  gt  f   open   tmp test    w    gt  gt  gt  f write a   gt  gt  gt  f close    gt  gt  gt  data   open   tmp test   read    gt  gt  gt  data  bats      In case of writing in python2    gt  gt  gt  a   u bats u00E0   gt  gt  gt  f   open   tmp test    w    gt  gt  gt  f write a   Traceback  most recent call last     File   lt stdin gt    line 1  in  lt module gt  UnicodeEncodeError   ascii  codec can t encode character u  xe0  in position 4  ordinal not in range 128    To avoid this error you would have to encode it to bytes using codecs  utf-8  like this    gt  gt  gt  f write a encode  utf-8     gt  gt  gt  f close     and decode the data while reading using the codecs  utf-8     gt  gt  gt  data   open   tmp test   read    gt  gt  gt  data decode  utf-8   u bats xe0    And also if you try to execute print on this string it will automatically decode using the  utf-8  codecs like this   gt  gt  gt  print a bats

User · Answer

How to print unicode characters into a file   Save this to file  foo py      usr bin python -tt   - - coding  utf-8 - - import codecs import sys  UTF8Writer   codecs getwriter  utf8   sys stdout   UTF8Writer sys stdout  print u e with obfuscation         Run it and pipe output to file   python foo py  gt  tmp txt   Open tmp txt and look inside  you see this   el apollo    cat tmp txt  e with obfuscation       Thus you have saved unicode e with a obfuscation mark on it to a file

User · Answer

The file opened by codecs open is a file that takes unicode data  encodes it in iso-8859-1 and writes it to the file  However  what you try to write isn t unicode  you take unicode and encode it in iso-8859-1 yourself  That s what the unicode encode method does  and the result of encoding a unicode string is a bytestring  a str type    You should either use normal open   and encode the unicode yourself  or  usually a better idea  use codecs open   and not encode the data yourself

User · Answer

Preface  will your viewer work   Make sure your viewer editor terminal  however you are interacting with your utf-8 encoded file  can read the file  This is frequently an issue on Windows  for example  Notepad      Writing Unicode text to a text file    In Python 2  use open from the io module  this is the same as the builtin open in Python 3    import io   Best practice  in general  use UTF-8 for writing to files  we don t even have to worry about byte-order with utf-8    encoding    utf-8    utf-8 is the most modern and universally usable encoding - it works in all web browsers  most text-editors  see your settings if you have issues  and most terminals shells   On Windows  you might try utf-16le if you re limited to viewing output in Notepad  or another limited viewer    encoding    utf-16le    sorry  Windows users         And just open it with the context manager and write your unicode characters out   with io open filename   w   encoding encoding  as f      f write unicode object    Example using many Unicode characters  Here s an example that attempts to map every possible character up to three bits wide  4 is the max  but that would be going a bit far  from the digital representation  in integers  to an encoded printable output  along with its name  if possible  put this into a file called uni py    from   future   import print function import io from unicodedata import name  category from curses ascii import controlnames from collections import Counter  try    use these if Python 2     unicode chr  range   unichr  xrange except NameError    Python 3     unicode chr   chr  exclude categories   set   Co    Cn    counts   Counter   control names   dict enumerate controlnames   with io open  unidata    w   encoding  utf-8   as f      for x in range  2  8   3            try              char   unicode chr x          except ValueError              continue   can t map to unicode  try next x         cat   category char          counts update  cat            if cat in exclude categories              continue   get rid of noise  amp  greatly shorten result file         try              uname   name char          except ValueError    probably control character  don t use actual             uname   control names get x                  f write u  0  gt 6x   1      2  n  format x  cat  uname           else              f write u  0  gt 6x   1    2    3  n  format x  cat  char  uname     may as well describe the types we logged  for cat  count in counts items        print   0  chars of category   1   format count  cat     This should run in the order of about a minute  and you can view the data file  and if your file viewer can display unicode  you ll see it  Information about the categories can be found here  Based on the counts  we can probably improve our results by excluding the Cn and Co categories  which have no symbols associated with them     python uni py   It will display the hexadecimal mapping  category  symbol  unless can t get the name  so probably a control character   and the name of the symbol  e g   I recommend less on Unix or Cygwin  don t print cat the entire file to your output      less unidata   e g  will display similar to the following lines which I sampled from it using Python 2  unicode 5 2         0 Cc NUL     20 Zs     SPACE     21 Po     EXCLAMATION MARK     b6 So      PILCROW SIGN     d0 Lu      LATIN CAPITAL LETTER ETH    e59 Nd     THAI DIGIT NINE   2887 So     BRAILLE PATTERN DOTS-1238   bc13 Lo     HANGUL SYLLABLE MIH   ffeb Sm     HALFWIDTH RIGHTWARDS ARROW   My Python 3 5 from Anaconda has unicode 8 0  I would presume most 3 s would

[python] Writing Unicode text to a text file?

Examples related to python

Examples related to unicode

Examples related to character-encoding

Examples related to python-2.x