How to convert a file to utf-8 in Python

Question

I need to convert a bunch of files to utf-8 in Python  and I have trouble with the  converting the file  part   I d like to do the equivalent of   iconv -t utf-8  file  gt  converted  file   this is shell code   Thanks

User · Answer

Answer for unknown source encoding type  based on  S  bastien RoccaSerra  python3 6  import os     from chardet import detect    get file encoding type def get encoding type file       with open file   rb   as f          rawdata   f read       return detect rawdata   encoding    from codec   get encoding type srcfile     add try  except block for reliability try       with open srcfile   r   encoding from codec  as f  open trgfile   w   encoding  utf-8   as e          text   f read     for small files  for big use chunks         e write text       os remove srcfile    remove old encoding file     os rename trgfile  srcfile    rename new encoding except UnicodeDecodeError      print  Decode Error   except UnicodeEncodeError      print  Encode Error

User · Answer

You can use the codecs module  like this   import codecs BLOCKSIZE   1048576   or some other  desired size in bytes with codecs open sourceFileName   r    your-source-encoding   as sourceFile      with codecs open targetFileName   w    utf-8   as targetFile          while True              contents   sourceFile read BLOCKSIZE              if not contents                  break             targetFile write contents    EDIT  added BLOCKSIZE parameter to control file chunk size

User · Answer

You can use the codecs module  like this   import codecs BLOCKSIZE   1048576   or some other  desired size in bytes with codecs open sourceFileName   r    your-source-encoding   as sourceFile      with codecs open targetFileName   w    utf-8   as targetFile          while True              contents   sourceFile read BLOCKSIZE              if not contents                  break             targetFile write contents    EDIT  added BLOCKSIZE parameter to control file chunk size

User · Answer

You can use the codecs module  like this   import codecs BLOCKSIZE   1048576   or some other  desired size in bytes with codecs open sourceFileName   r    your-source-encoding   as sourceFile      with codecs open targetFileName   w    utf-8   as targetFile          while True              contents   sourceFile read BLOCKSIZE              if not contents                  break             targetFile write contents    EDIT  added BLOCKSIZE parameter to control file chunk size

User · Answer

This worked for me in a small test   sourceEncoding    iso-8859-1  targetEncoding    utf-8  source   open  source   target   open  target    w    target write unicode source read    sourceEncoding  encode targetEncoding

User · Answer

Thanks for the replies  it works   And since the source files are in mixed formats  I added a list of source formats to be tried in sequence  sourceFormats   and on UnicodeDecodeError I try the next format   from   future   import with statement  import os import sys import codecs from chardet universaldetector import UniversalDetector  targetFormat    utf-8  outputDir    converted  detector   UniversalDetector    def get encoding type current file       detector reset       for line in file current file           detector feed line          if detector done  break     detector close       return detector result  encoding    def convertFileBestGuess filename      sourceFormats     ascii    iso-8859-1      for format in sourceFormats       try          with codecs open fileName   rU   format  as sourceFile              writeConversion sourceFile              print  Done                return       except UnicodeDecodeError          pass  def convertFileWithDetection fileName       print  Converting      fileName               format get encoding type fileName      try          with codecs open fileName   rU   format  as sourceFile              writeConversion sourceFile              print  Done                return     except UnicodeDecodeError          pass      print  Error  failed to convert      fileName           def writeConversion file       with codecs open outputDir         fileName   w   targetFormat  as targetFile          for line in file              targetFile write line     Off topic  get the file list and call convertFile on each file          EDIT by Rudro Badhon  this incorporates the original try multiple formats until you don t get an exception as well as an alternate approach that uses chardet universaldetector

User · Answer

This worked for me in a small test   sourceEncoding    iso-8859-1  targetEncoding    utf-8  source   open  source   target   open  target    w    target write unicode source read    sourceEncoding  encode targetEncoding

User · Answer

Thanks for the replies  it works   And since the source files are in mixed formats  I added a list of source formats to be tried in sequence  sourceFormats   and on UnicodeDecodeError I try the next format   from   future   import with statement  import os import sys import codecs from chardet universaldetector import UniversalDetector  targetFormat    utf-8  outputDir    converted  detector   UniversalDetector    def get encoding type current file       detector reset       for line in file current file           detector feed line          if detector done  break     detector close       return detector result  encoding    def convertFileBestGuess filename      sourceFormats     ascii    iso-8859-1      for format in sourceFormats       try          with codecs open fileName   rU   format  as sourceFile              writeConversion sourceFile              print  Done                return       except UnicodeDecodeError          pass  def convertFileWithDetection fileName       print  Converting      fileName               format get encoding type fileName      try          with codecs open fileName   rU   format  as sourceFile              writeConversion sourceFile              print  Done                return     except UnicodeDecodeError          pass      print  Error  failed to convert      fileName           def writeConversion file       with codecs open outputDir         fileName   w   targetFormat  as targetFile          for line in file              targetFile write line     Off topic  get the file list and call convertFile on each file          EDIT by Rudro Badhon  this incorporates the original try multiple formats until you don t get an exception as well as an alternate approach that uses chardet universaldetector

User · Answer

To guess what s the source encoding you can use the file  nix command   Example     file --mime jumper xml  jumper xml  application xml  charset utf-8

User · Answer

Thanks for the replies  it works   And since the source files are in mixed formats  I added a list of source formats to be tried in sequence  sourceFormats   and on UnicodeDecodeError I try the next format   from   future   import with statement  import os import sys import codecs from chardet universaldetector import UniversalDetector  targetFormat    utf-8  outputDir    converted  detector   UniversalDetector    def get encoding type current file       detector reset       for line in file current file           detector feed line          if detector done  break     detector close       return detector result  encoding    def convertFileBestGuess filename      sourceFormats     ascii    iso-8859-1      for format in sourceFormats       try          with codecs open fileName   rU   format  as sourceFile              writeConversion sourceFile              print  Done                return       except UnicodeDecodeError          pass  def convertFileWithDetection fileName       print  Converting      fileName               format get encoding type fileName      try          with codecs open fileName   rU   format  as sourceFile              writeConversion sourceFile              print  Done                return     except UnicodeDecodeError          pass      print  Error  failed to convert      fileName           def writeConversion file       with codecs open outputDir         fileName   w   targetFormat  as targetFile          for line in file              targetFile write line     Off topic  get the file list and call convertFile on each file          EDIT by Rudro Badhon  this incorporates the original try multiple formats until you don t get an exception as well as an alternate approach that uses chardet universaldetector

User · Answer

You can use the codecs module  like this   import codecs BLOCKSIZE   1048576   or some other  desired size in bytes with codecs open sourceFileName   r    your-source-encoding   as sourceFile      with codecs open targetFileName   w    utf-8   as targetFile          while True              contents   sourceFile read BLOCKSIZE              if not contents                  break             targetFile write contents    EDIT  added BLOCKSIZE parameter to control file chunk size

User · Answer

This is my brute force method  It also takes care of mingled  n and  r n in the input         open the CSV file     inputfile   open filelocation   rb       outputfile   open outputfilelocation   w   encoding  utf-8       for line in inputfile          if line -2      b  r n  or line -2      b  n r               output   line  -2  decode  utf-8    replace       n          elif line -1      b  r  or line -1      b  n               output   line  -1  decode  utf-8    replace       n          else              output   line decode  utf-8    replace       n          outputfile write output      outputfile close   except BaseException as error      cfg log self outf   Error 18   opening CSV-file     filelocation     failed      str error       self loadedwitherrors   1     return      try        open the CSV-file of this source table     csvreader   csv reader open outputfilelocation   rU    delimiter delimitervalue  quoting quotevalue  dialect csv excel tab  except BaseException as error      cfg log self outf   Error 19   reading CSV-file     filelocation     failed      str error

User · Answer

To guess what s the source encoding you can use the file  nix command   Example     file --mime jumper xml  jumper xml  application xml  charset utf-8

User · Answer

This worked for me in a small test   sourceEncoding    iso-8859-1  targetEncoding    utf-8  source   open  source   target   open  target    w    target write unicode source read    sourceEncoding  encode targetEncoding

User · Answer

This is a Python3 function for converting any text file into the one with UTF-8 encoding   without using unnecessary packages   def correctSubtitleEncoding filename  newFilename  encoding from  encoding to  UTF-8        with open filename   r   encoding encoding from  as fr          with open newFilename   w   encoding encoding to  as fw              for line in fr                  fw write line  -1    r n     You can use it easily in a loop to convert a list of files

User · Answer

This is my brute force method  It also takes care of mingled  n and  r n in the input         open the CSV file     inputfile   open filelocation   rb       outputfile   open outputfilelocation   w   encoding  utf-8       for line in inputfile          if line -2      b  r n  or line -2      b  n r               output   line  -2  decode  utf-8    replace       n          elif line -1      b  r  or line -1      b  n               output   line  -1  decode  utf-8    replace       n          else              output   line decode  utf-8    replace       n          outputfile write output      outputfile close   except BaseException as error      cfg log self outf   Error 18   opening CSV-file     filelocation     failed      str error       self loadedwitherrors   1     return      try        open the CSV-file of this source table     csvreader   csv reader open outputfilelocation   rU    delimiter delimitervalue  quoting quotevalue  dialect csv excel tab  except BaseException as error      cfg log self outf   Error 19   reading CSV-file     filelocation     failed      str error

User · Answer

This is a Python3 function for converting any text file into the one with UTF-8 encoding   without using unnecessary packages   def correctSubtitleEncoding filename  newFilename  encoding from  encoding to  UTF-8        with open filename   r   encoding encoding from  as fr          with open newFilename   w   encoding encoding to  as fw              for line in fr                  fw write line  -1    r n     You can use it easily in a loop to convert a list of files

User · Answer

Thanks for the replies  it works   And since the source files are in mixed formats  I added a list of source formats to be tried in sequence  sourceFormats   and on UnicodeDecodeError I try the next format   from   future   import with statement  import os import sys import codecs from chardet universaldetector import UniversalDetector  targetFormat    utf-8  outputDir    converted  detector   UniversalDetector    def get encoding type current file       detector reset       for line in file current file           detector feed line          if detector done  break     detector close       return detector result  encoding    def convertFileBestGuess filename      sourceFormats     ascii    iso-8859-1      for format in sourceFormats       try          with codecs open fileName   rU   format  as sourceFile              writeConversion sourceFile              print  Done                return       except UnicodeDecodeError          pass  def convertFileWithDetection fileName       print  Converting      fileName               format get encoding type fileName      try          with codecs open fileName   rU   format  as sourceFile              writeConversion sourceFile              print  Done                return     except UnicodeDecodeError          pass      print  Error  failed to convert      fileName           def writeConversion file       with codecs open outputDir         fileName   w   targetFormat  as targetFile          for line in file              targetFile write line     Off topic  get the file list and call convertFile on each file          EDIT by Rudro Badhon  this incorporates the original try multiple formats until you don t get an exception as well as an alternate approach that uses chardet universaldetector

User · Answer

Answer for unknown source encoding type  based on  S  bastien RoccaSerra  python3 6  import os     from chardet import detect    get file encoding type def get encoding type file       with open file   rb   as f          rawdata   f read       return detect rawdata   encoding    from codec   get encoding type srcfile     add try  except block for reliability try       with open srcfile   r   encoding from codec  as f  open trgfile   w   encoding  utf-8   as e          text   f read     for small files  for big use chunks         e write text       os remove srcfile    remove old encoding file     os rename trgfile  srcfile    rename new encoding except UnicodeDecodeError      print  Decode Error   except UnicodeEncodeError      print  Encode Error

User · Answer

This worked for me in a small test   sourceEncoding    iso-8859-1  targetEncoding    utf-8  source   open  source   target   open  target    w    target write unicode source read    sourceEncoding  encode targetEncoding

[python] How to convert a file to utf-8 in Python?

Examples related to python

Examples related to encoding

Examples related to file

Examples related to utf-8