Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Question

Two questions here  I have a set of files which are usually UTF-8 with BOM  I d like to convert them  ideally in place  to UTF-8 with no BOM  It seems like codecs StreamRecoder stream  encode  decode  Reader  Writer  errors  would handle this  But I don t really see any good examples on usage  Would this be the best way to handle this     source files  Tue Jan 17  file brh-m-157 json  brh-m-157 json  UTF-8 Unicode  with BOM  text   Also  it would be ideal if we could handle different input encoding wihtout explicitly knowing  seen ASCII and UTF-16   It seems like this should all be feasible  Is there a solution that can take any known Python encoding and output as UTF-8 without BOM   edit 1  proposed sol n from below  thanks    fp   open  brh-m-157 json   rw   s   fp read   u   s decode  utf-8-sig   s   u encode  utf-8   print fp encoding   fp write s    This gives me the following error     IOError   Errno 9  Bad file descriptor   Newsflash  I m being told in comments that the mistake is I open the file with mode  rw  instead of  r    r b   so I should eventually re-edit my question and remove the solved part

User · Accepted Answer

Simply use the  utf-8-sig  codec   fp   open  file txt   s   fp read   u   s decode  utf-8-sig     That gives you a unicode string without the BOM  You can then use  s   u encode  utf-8     to get a normal UTF-8 encoded string back in s  If your files are big  then you should avoid reading them all into memory  The BOM is simply three bytes at the beginning of the file  so you can use this code to strip them out of the file   import os  sys  codecs  BUFSIZE   4096 BOMLEN   len codecs BOM UTF8   path   sys argv 1  with open path   r b   as fp      chunk   fp read BUFSIZE      if chunk startswith codecs BOM UTF8           i   0         chunk   chunk BOMLEN           while chunk              fp seek i              fp write chunk              i    len chunk              fp seek BOMLEN  os SEEK CUR              chunk   fp read BUFSIZE          fp seek -BOMLEN  os SEEK CUR          fp truncate     It opens the file  reads a chunk  and writes it out to the file 3 bytes earlier than where it read it  The file is rewritten in-place  As easier solution is to write the shorter file to a new file like newtover s answer  That would be simpler  but use twice the disk space for a short period   As for guessing the encoding  then you can just loop through the encoding from most to least specific   def decode s       for encoding in  utf-8-sig    utf-16           try              return s decode encoding          except UnicodeDecodeError              continue     return s decode  latin-1     will always work   An UTF-16 encoded file wont decode as UTF-8  so we try with UTF-8 first  If that fails  then we try with UTF-16  Finally  we use Latin-1     this will always work since all 256 bytes are legal values in Latin-1  You may want to return None instead in this case since it s really a fallback and your code might want to handle this more carefully  if it can

User · Answer

I found this question because having trouble with configparser ConfigParser   read fp  when opening files with UTF8 BOM header    For those who are looking for a solution to remove the header so that ConfigPhaser could open the config file instead of reporting an error of  File contains no section headers  please open the file like the following   configparser ConfigParser   read config file path  encoding  utf-8-sig     This could save you tons of effort by making the remove of the BOM header of the file unnecessary    I know this sounds unrelated  but hopefully this could help people struggling like me

User · Answer

This is my implementation to convert any kind of encoding to UTF-8 without BOM and replacing windows enlines by universal format   def utf8 converter file path  universal endline True               Convert any type of file to UTF-8 without BOM     and using universal endline by default       Parameters     ----------     file path   string  file path      universal endline   boolean  True                           by default convert endlines to universal format                 Fix file path     file path   os path realpath os path expanduser file path          Read from file     file open   open file path      raw   file open read       file open close          Decode     raw   raw decode chardet detect raw   encoding          Remove windows end line     if universal endline          raw   raw replace   r n     n         Encode to UTF-8     raw   raw encode  utf8         Remove BOM     if raw startswith codecs BOM UTF8           raw   raw replace codecs BOM UTF8      1         Write to file     file open   open file path   w       file open write raw      file open close       return 0

User · Answer

In Python 3 it s quite easy  read the file and rewrite it with utf-8 encoding   s   open bom file  mode  r   encoding  utf-8-sig   read   open bom file  mode  w   encoding  utf-8   write s

User · Answer

import codecs import shutil import sys  s   sys stdin read 3  if s    codecs BOM UTF8      sys stdout write s   shutil copyfileobj sys stdin  sys stdout

User · Answer

You can use codecs   import codecs with open  test txt   r   as filehandle      content   filehandle read   if content  3     codecs BOM UTF8      content   content 3   print content decode  utf-8

[python] Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Examples related to python

Examples related to utf-8

Examples related to utf-16

Examples related to byte-order-mark