Write to UTF-8 file in Python

Question

I m really confused with the codecs open function  When I do   file   codecs open  temp    w    utf-8   file write codecs BOM UTF8  file close     It gives me the error     UnicodeDecodeError   ascii  codec can t decode byte 0xef in position   0  ordinal not in range 128    If I do   file   open  temp    w   file write codecs BOM UTF8  file close     It works fine   Question is why does the first method fail  And how do I insert the bom   If the second method is the correct way of doing it  what the point of using codecs open filename   w    utf-8

User · Answer

S-Lott gives the right procedure  but expanding on the Unicode issues  the Python interpreter can provide more insights   Jon Skeet is right  unusual  about the codecs module - it contains byte strings    gt  gt  gt  import codecs  gt  gt  gt  codecs BOM   xff xfe   gt  gt  gt  codecs BOM UTF8   xef xbb xbf   gt  gt  gt     Picking another nit  the BOM has a standard Unicode name  and it can be entered as    gt  gt  gt  bom  u  N ZERO WIDTH NO-BREAK SPACE    gt  gt  gt  bom u  ufeff    It is also accessible via unicodedata    gt  gt  gt  import unicodedata  gt  gt  gt  unicodedata lookup  ZERO WIDTH NO-BREAK SPACE   u  ufeff   gt  gt  gt

User · Answer

I believe the problem is that codecs BOM UTF8 is a byte string  not a Unicode string  I suspect the file handler is trying to guess what you really mean based on  I m meant to be writing Unicode as UTF-8-encoded text  but you ve given me a byte string    Try writing the Unicode string for the byte order mark  i e  Unicode U FEFF  directly  so that the file just encodes that as UTF-8   import codecs  file   codecs open  lol    w    utf-8   file write u  ufeff   file close      That seems to give the right answer - a file with bytes EF BB BF    EDIT  S  Lott s suggestion of using  utf-8-sig  as the encoding is a better one than explicitly writing the BOM yourself  but I ll leave this answer here as it explains what was going wrong before

User · Answer

Read the following   http   docs python org library codecs html module-encodings utf 8 sig  Do this   with codecs open  test output    w    utf-8-sig   as temp      temp write  hi mom n       temp write u This has       The resulting file is UTF-8 with the expected BOM

User · Answer

I use the file  nix command to convert a unknown charset file in a utf-8 file    - - encoding  utf-8 - -    converting a unknown formatting file in utf-8  import codecs import commands  file location    jumper sub  file encoding   commands getoutput  file -b --mime-encoding  s    file location   file stream   codecs open file location   r   file encoding  file output   codecs open file location  b    w    utf-8    for l in file stream      file output write l   file stream close   file output close

[python] Write to UTF-8 file in Python

Examples related to python

Examples related to utf-8

Examples related to character-encoding

Examples related to byte-order-mark