How to determine the encoding of text

Question

I received some text that is encoded  but I don t know what charset was used  Is there a way to determine the encoding of a text file using Python  How can I detect the encoding codepage of a text file deals with C

User · Answer

Here is an example of reading and taking at face value a chardet encoding prediction, reading n_lines from the file in the event it is large.

chardet also gives you a probability (i.e. confidence) of it's encoding prediction (haven't looked how they come up with that), which is returned with its prediction from chardet.predict(), so you could work that in somehow if you like.

def predict_encoding(file_path, n_lines=20):
    '''Predict a file's encoding using chardet'''
    import chardet

    # Open the file as binary data
    with open(file_path, 'rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.readline() for _ in range(n_lines)])

    return chardet.detect(rawdata)['encoding']

User · Answer

It is  in principle  impossible to determine the encoding of a text file  in the general case  So no  there is no standard Python library to do that for you   If you have more specific knowledge about the text file  e g  that it is XML   there might be library functions

User · Answer

Some encoding strategies  please uncomment to taste      bin bash   tmpfile  1 echo  -- info about file file           file -i  tmpfile enca -g  tmpfile echo  recoding            iconv -f iso-8859-2 -t utf-8 back test xml  gt   tmpfile  enca -x utf-8  tmpfile  enca -g  tmpfile recode CP1250  UTF-8  tmpfile  You might like to check the encoding by opening and reading the file in a form of a loop    but you might need to check the filesize first    PYTHON encodings     utf-8    windows-1250    windows-1252     add more             for e in encodings                  try                      fh   codecs open  file txt    r   encoding e                      fh readlines                       fh seek 0                  except UnicodeDecodeError                      print  got unicode error with  s   trying different encoding    e                  else                      print  opening the file with encoding    s     e                      break

User · Answer

If you know the some content of the file you can try to decode it with several encoding and see which is missing  In general there is no way since a text file is a text file and those are stupid

User · Answer

If you know the some content of the file you can try to decode it with several encoding and see which is missing  In general there is no way since a text file is a text file and those are stupid

User · Answer

Here is an example of reading and taking at face value a chardet encoding prediction, reading n_lines from the file in the event it is large.

chardet also gives you a probability (i.e. confidence) of it's encoding prediction (haven't looked how they come up with that), which is returned with its prediction from chardet.predict(), so you could work that in somehow if you like.

def predict_encoding(file_path, n_lines=20):
    '''Predict a file's encoding using chardet'''
    import chardet

    # Open the file as binary data
    with open(file_path, 'rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.readline() for _ in range(n_lines)])

    return chardet.detect(rawdata)['encoding']

User · Answer

This might be helpful    from bs4 import UnicodeDammit with open  automate data billboard csv    rb   as file     content   file read    suggestion   UnicodeDammit content  suggestion original encoding   iso-8859-1

User · Answer

This site has python code for recognizing ascii  encoding with boms  and utf8 no bom   https   unicodebook readthedocs io guess encoding html   Read file into byte array  data    http   www codecodex com wiki Read a file into a byte array   Here s an example   I m in osx      usr bin python                                                                                                    import sys  def isUTF8 data       try          decoded   data decode  UTF-8       except UnicodeDecodeError          return False     else          for ch in decoded              if 0xD800  lt   ord ch   lt   0xDFFF                  return False         return True  def get bytes from file filename       return open filename   rb   read    filename   sys argv 1  data   get bytes from file filename  result   isUTF8 data  print result    PS  Users js gt    isutf8 py hi txt                                                                                      True

User · Answer

It is  in principle  impossible to determine the encoding of a text file  in the general case  So no  there is no standard Python library to do that for you   If you have more specific knowledge about the text file  e g  that it is XML   there might be library functions

User · Answer

EDIT  chardet seems to be unmantained but most of the answer applies  Check https   pypi org project charset-normalizer  for an alternative Correctly detecting the encoding all times is impossible   From chardet FAQ    However  some encodings are optimized for specific languages  and languages are not random  Some character sequences pop up all the time  while other sequences make no sense  A person fluent in English who opens a newspaper and finds    txzqJv 2 dasd0a QqdKjvz    will instantly recognize that that isn t English  even though it is composed entirely of English letters   By studying lots of    typical    text  a computer algorithm can simulate this kind of fluency and make an educated guess about a text s language   There is the chardet library that uses that study to try to detect encoding  chardet is a port of the auto-detection code in Mozilla  You can also use UnicodeDammit  It will try the following methods   An encoding discovered in the document itself  for instance  in an XML declaration or  for HTML documents  an http-equiv META tag  If Beautiful Soup finds this kind of encoding within the document  it parses the document again from the beginning and gives the new encoding a try  The only exception is if you explicitly specified an encoding  and that encoding actually worked  then it will ignore any encoding it finds in the document  An encoding sniffed by looking at the first few bytes of the file  If an encoding is detected at this stage  it will be one of the UTF-  encodings  EBCDIC  or ASCII  An encoding sniffed by the chardet library  if you have it installed  UTF-8 Windows-1252

User · Answer

Depending on your platform  I just opt to use the linux shell file command  This works for me since I am using it in a script that exclusively runs on one of our linux machines   Obviously this isn t an ideal solution or answer  but it could be modified to fit your needs  In my case I just need to determine whether a file is UTF-8 or not   import subprocess file cmd     file    test txt   p   subprocess Popen file cmd  stdout subprocess PIPE  cmd output   p stdout readlines     x will begin with the file type output as is observed using  file  command x   cmd output 0  split       1  return x startswith  UTF-8

User · Answer

Function  OpenRead file     A text file can be encoded using       1  The default operating system code page  Or      2  utf8 with a BOM header      If a text file is encoded with utf8  and does not have a BOM header     the user can manually add a BOM header to the text file    using a text editor such as notepad    and rerun the python script     otherwise the file is read as a codepage file with the     invalid codepage characters removed  import sys if int sys version 0      3      print  Aborted  Python 3 x required       sys exit 1   def bomType file               returns file encoding string for open   function      EXAMPLE          bom   bomtype file          open file  encoding bom  errors  ignore                f   open file   rb       b   f read 4      f close        if  b 0 3     b  xef xbb xbf            return  utf8         Python automatically detects endianess if utf-16 bom is present       write endianess generally determined by endianess of CPU     if   b 0 2     b  xfe xff   or  b 0 2     b  xff xfe             return  utf16       if   b 0 5     b  xfe xff x00 x00                  or  b 0 5     b  x00 x00 xff xfe             return  utf32         If BOM is not provided  then assume its the codepage           used by your operating system     return  cp1252        For the United States its  cp1252   def OpenRead file       bom   bomType file      return open file   r   encoding bom  errors  ignore                               Testing it                         fout   open  myfile1 txt    w   encoding  cp1252   fout write    hi there  cp1252    fout close    fout   open  myfile2 txt    w   encoding  utf8   fout write   u2022 hi there  utf8    fout close      this case is still treated like codepage cp1252      User responsible for making sure that all utf8 files     have a BOM header  fout   open  badboy txt    wb   fout write b hi there   barf  x81 x8D x90 x9D    fout close      Read Example file with Bom Detection fin   OpenRead  myfile1 txt   L   fin readline   print L  fin close      Read Example file with Bom Detection fin   OpenRead  myfile2 txt   L  fin readline    print L   requires QtConsole to view  Cmd exe is cp1252 fin close      Read CP1252 with a few undefined chars without barfing fin   OpenRead  badboy txt   L  fin readline    print L  fin close      Check that bad characters are still in badboy codepage file fin   open  badboy txt    rb   fin read 20  fin close

User · Answer

If you know the some content of the file you can try to decode it with several encoding and see which is missing  In general there is no way since a text file is a text file and those are stupid

User · Answer

Function  OpenRead file     A text file can be encoded using       1  The default operating system code page  Or      2  utf8 with a BOM header      If a text file is encoded with utf8  and does not have a BOM header     the user can manually add a BOM header to the text file    using a text editor such as notepad    and rerun the python script     otherwise the file is read as a codepage file with the     invalid codepage characters removed  import sys if int sys version 0      3      print  Aborted  Python 3 x required       sys exit 1   def bomType file               returns file encoding string for open   function      EXAMPLE          bom   bomtype file          open file  encoding bom  errors  ignore                f   open file   rb       b   f read 4      f close        if  b 0 3     b  xef xbb xbf            return  utf8         Python automatically detects endianess if utf-16 bom is present       write endianess generally determined by endianess of CPU     if   b 0 2     b  xfe xff   or  b 0 2     b  xff xfe             return  utf16       if   b 0 5     b  xfe xff x00 x00                  or  b 0 5     b  x00 x00 xff xfe             return  utf32         If BOM is not provided  then assume its the codepage           used by your operating system     return  cp1252        For the United States its  cp1252   def OpenRead file       bom   bomType file      return open file   r   encoding bom  errors  ignore                               Testing it                         fout   open  myfile1 txt    w   encoding  cp1252   fout write    hi there  cp1252    fout close    fout   open  myfile2 txt    w   encoding  utf8   fout write   u2022 hi there  utf8    fout close      this case is still treated like codepage cp1252      User responsible for making sure that all utf8 files     have a BOM header  fout   open  badboy txt    wb   fout write b hi there   barf  x81 x8D x90 x9D    fout close      Read Example file with Bom Detection fin   OpenRead  myfile1 txt   L   fin readline   print L  fin close      Read Example file with Bom Detection fin   OpenRead  myfile2 txt   L  fin readline    print L   requires QtConsole to view  Cmd exe is cp1252 fin close      Read CP1252 with a few undefined chars without barfing fin   OpenRead  badboy txt   L  fin readline    print L  fin close      Check that bad characters are still in badboy codepage file fin   open  badboy txt    rb   fin read 20  fin close

User · Answer

EDIT  chardet seems to be unmantained but most of the answer applies  Check https   pypi org project charset-normalizer  for an alternative Correctly detecting the encoding all times is impossible   From chardet FAQ    However  some encodings are optimized for specific languages  and languages are not random  Some character sequences pop up all the time  while other sequences make no sense  A person fluent in English who opens a newspaper and finds    txzqJv 2 dasd0a QqdKjvz    will instantly recognize that that isn t English  even though it is composed entirely of English letters   By studying lots of    typical    text  a computer algorithm can simulate this kind of fluency and make an educated guess about a text s language   There is the chardet library that uses that study to try to detect encoding  chardet is a port of the auto-detection code in Mozilla  You can also use UnicodeDammit  It will try the following methods   An encoding discovered in the document itself  for instance  in an XML declaration or  for HTML documents  an http-equiv META tag  If Beautiful Soup finds this kind of encoding within the document  it parses the document again from the beginning and gives the new encoding a try  The only exception is if you explicitly specified an encoding  and that encoding actually worked  then it will ignore any encoding it finds in the document  An encoding sniffed by looking at the first few bytes of the file  If an encoding is detected at this stage  it will be one of the UTF-  encodings  EBCDIC  or ASCII  An encoding sniffed by the chardet library  if you have it installed  UTF-8 Windows-1252

User · Answer

If you know the some content of the file you can try to decode it with several encoding and see which is missing  In general there is no way since a text file is a text file and those are stupid

User · Answer

Some encoding strategies  please uncomment to taste      bin bash   tmpfile  1 echo  -- info about file file           file -i  tmpfile enca -g  tmpfile echo  recoding            iconv -f iso-8859-2 -t utf-8 back test xml  gt   tmpfile  enca -x utf-8  tmpfile  enca -g  tmpfile recode CP1250  UTF-8  tmpfile  You might like to check the encoding by opening and reading the file in a form of a loop    but you might need to check the filesize first    PYTHON encodings     utf-8    windows-1250    windows-1252     add more             for e in encodings                  try                      fh   codecs open  file txt    r   encoding e                      fh readlines                       fh seek 0                  except UnicodeDecodeError                      print  got unicode error with  s   trying different encoding    e                  else                      print  opening the file with encoding    s     e                      break

User · Answer

This might be helpful    from bs4 import UnicodeDammit with open  automate data billboard csv    rb   as file     content   file read    suggestion   UnicodeDammit content  suggestion original encoding   iso-8859-1

User · Answer

Depending on your platform  I just opt to use the linux shell file command  This works for me since I am using it in a script that exclusively runs on one of our linux machines   Obviously this isn t an ideal solution or answer  but it could be modified to fit your needs  In my case I just need to determine whether a file is UTF-8 or not   import subprocess file cmd     file    test txt   p   subprocess Popen file cmd  stdout subprocess PIPE  cmd output   p stdout readlines     x will begin with the file type output as is observed using  file  command x   cmd output 0  split       1  return x startswith  UTF-8

User · Answer

It is  in principle  impossible to determine the encoding of a text file  in the general case  So no  there is no standard Python library to do that for you   If you have more specific knowledge about the text file  e g  that it is XML   there might be library functions

User · Answer

EDIT  chardet seems to be unmantained but most of the answer applies  Check https   pypi org project charset-normalizer  for an alternative Correctly detecting the encoding all times is impossible   From chardet FAQ    However  some encodings are optimized for specific languages  and languages are not random  Some character sequences pop up all the time  while other sequences make no sense  A person fluent in English who opens a newspaper and finds    txzqJv 2 dasd0a QqdKjvz    will instantly recognize that that isn t English  even though it is composed entirely of English letters   By studying lots of    typical    text  a computer algorithm can simulate this kind of fluency and make an educated guess about a text s language   There is the chardet library that uses that study to try to detect encoding  chardet is a port of the auto-detection code in Mozilla  You can also use UnicodeDammit  It will try the following methods   An encoding discovered in the document itself  for instance  in an XML declaration or  for HTML documents  an http-equiv META tag  If Beautiful Soup finds this kind of encoding within the document  it parses the document again from the beginning and gives the new encoding a try  The only exception is if you explicitly specified an encoding  and that encoding actually worked  then it will ignore any encoding it finds in the document  An encoding sniffed by looking at the first few bytes of the file  If an encoding is detected at this stage  it will be one of the UTF-  encodings  EBCDIC  or ASCII  An encoding sniffed by the chardet library  if you have it installed  UTF-8 Windows-1252

User · Answer

It is  in principle  impossible to determine the encoding of a text file  in the general case  So no  there is no standard Python library to do that for you   If you have more specific knowledge about the text file  e g  that it is XML   there might be library functions

User · Answer

Another option for working out the encoding is to use libmagic  which is the code behind the file command   There are a profusion of python bindings available   The python bindings that live in the file source tree are available as the python-magic  or python3-magic  debian package  It can determine the encoding of a file by doing   import magic  blob   open  unknown-file    rb   read   m   magic open magic MAGIC MIME ENCODING  m load   encoding   m buffer blob      utf-8   us-ascii  etc   There is an identically named  but incompatible  python-magic pip package on pypi that also uses libmagic  It can also get the encoding  by doing   import magic  blob   open  unknown-file    rb   read   m   magic Magic mime encoding True  encoding   m from buffer blob

User · Answer

Another option for working out the encoding is to use libmagic  which is the code behind the file command   There are a profusion of python bindings available   The python bindings that live in the file source tree are available as the python-magic  or python3-magic  debian package  It can determine the encoding of a file by doing   import magic  blob   open  unknown-file    rb   read   m   magic open magic MAGIC MIME ENCODING  m load   encoding   m buffer blob      utf-8   us-ascii  etc   There is an identically named  but incompatible  python-magic pip package on pypi that also uses libmagic  It can also get the encoding  by doing   import magic  blob   open  unknown-file    rb   read   m   magic Magic mime encoding True  encoding   m from buffer blob

User · Answer

EDIT  chardet seems to be unmantained but most of the answer applies  Check https   pypi org project charset-normalizer  for an alternative Correctly detecting the encoding all times is impossible   From chardet FAQ    However  some encodings are optimized for specific languages  and languages are not random  Some character sequences pop up all the time  while other sequences make no sense  A person fluent in English who opens a newspaper and finds    txzqJv 2 dasd0a QqdKjvz    will instantly recognize that that isn t English  even though it is composed entirely of English letters   By studying lots of    typical    text  a computer algorithm can simulate this kind of fluency and make an educated guess about a text s language   There is the chardet library that uses that study to try to detect encoding  chardet is a port of the auto-detection code in Mozilla  You can also use UnicodeDammit  It will try the following methods   An encoding discovered in the document itself  for instance  in an XML declaration or  for HTML documents  an http-equiv META tag  If Beautiful Soup finds this kind of encoding within the document  it parses the document again from the beginning and gives the new encoding a try  The only exception is if you explicitly specified an encoding  and that encoding actually worked  then it will ignore any encoding it finds in the document  An encoding sniffed by looking at the first few bytes of the file  If an encoding is detected at this stage  it will be one of the UTF-  encodings  EBCDIC  or ASCII  An encoding sniffed by the chardet library  if you have it installed  UTF-8 Windows-1252

User · Answer

This site has python code for recognizing ascii  encoding with boms  and utf8 no bom   https   unicodebook readthedocs io guess encoding html   Read file into byte array  data    http   www codecodex com wiki Read a file into a byte array   Here s an example   I m in osx      usr bin python                                                                                                    import sys  def isUTF8 data       try          decoded   data decode  UTF-8       except UnicodeDecodeError          return False     else          for ch in decoded              if 0xD800  lt   ord ch   lt   0xDFFF                  return False         return True  def get bytes from file filename       return open filename   rb   read    filename   sys argv 1  data   get bytes from file filename  result   isUTF8 data  print result    PS  Users js gt    isutf8 py hi txt                                                                                      True

[python] How to determine the encoding of text?

Examples related to python

Examples related to encoding

Examples related to text-files