Java How to determine the correct charset encoding of a stream

Question

With reference to the following thread  Java App   Unable to read iso-8859-1 encoded file correctly  What is the best way to programatically determine the correct charset encoding of an inputstream file    I have tried using the following   File in    new File args 0    InputStreamReader r   new InputStreamReader new FileInputStream in    System out println r getEncoding       But on a file which I know to be encoded with ISO8859 1 the above code yields ASCII  which is not correct  and does not allow me to correctly render the content of the file back to the console

User · Answer

If you use ICU4J  http   icu-project org apiref icu4j    Here is my code   String charset    ISO-8859-1     Default chartset  put whatever you want  byte   fileContent   null  FileInputStream fin   null     create FileInputStream object fin   new FileInputStream file getPath            Create byte array large enough to hold the content of the file     Use File length to determine size of the file in bytes      fileContent   new byte  int  file length            To read content of the file in byte array  use    int read byte   byteArray  method of java FileInputStream class         fin read fileContent    byte   data    fileContent   CharsetDetector detector   new CharsetDetector    detector setText data    CharsetMatch cm   detector detect     if  cm    null        int confidence   cm getConfidence        System out println  Encoding      cm getName       - Confidence      confidence               Here you have the encode name and the confidence       In my case if the confidence is  gt  50 I return the encode  else I return the default value     if  confidence  gt  50            charset   cm getName              Remember to put all the try-catch need it   I hope this works for you

User · Answer

You cannot determine the encoding of a arbitrary byte stream  This is the nature of encodings  A encoding means a mapping between a byte value and its representation  So every encoding  could  be the right   The getEncoding   method will return the encoding which was set up  read the JavaDoc  for the stream  It will not guess the encoding for you   Some streams tell you which encoding was used to create them  XML  HTML  But not an arbitrary byte stream   Anyway  you could try to guess an encoding on your own if you have to  Every language has a common frequency for every char  In English the char e appears very often but    will appear very very seldom  In a ISO-8859-1 stream there are usually no 0x00 chars  But a UTF-16 stream has a lot of them   Or  you could ask the user  I ve already seen applications which present you a snippet of the file in different encodings and ask you to select the  correct  one

User · Answer

An alternative to TikaEncodingDetector is to use Tika AutoDetectReader   Charset charset   new AutoDetectReader new FileInputStream file   getCharset

User · Answer

You cannot determine the encoding of a arbitrary byte stream  This is the nature of encodings  A encoding means a mapping between a byte value and its representation  So every encoding  could  be the right   The getEncoding   method will return the encoding which was set up  read the JavaDoc  for the stream  It will not guess the encoding for you   Some streams tell you which encoding was used to create them  XML  HTML  But not an arbitrary byte stream   Anyway  you could try to guess an encoding on your own if you have to  Every language has a common frequency for every char  In English the char e appears very often but    will appear very very seldom  In a ISO-8859-1 stream there are usually no 0x00 chars  But a UTF-16 stream has a lot of them   Or  you could ask the user  I ve already seen applications which present you a snippet of the file in different encodings and ask you to select the  correct  one

User · Answer

The libs above are simple BOM detectors which of course only work if there is a BOM in the beginning of the file   Take a look at http   jchardet sourceforge net  which does scans the text

User · Answer

The libs above are simple BOM detectors which of course only work if there is a BOM in the beginning of the file   Take a look at http   jchardet sourceforge net  which does scans the text

User · Answer

If you don t know the encoding of your data  it is not so easy to determine  but you could try to use a library to guess it  Also  there is a similar question

User · Answer

I found a nice third party library which can detect actual encoding  http   glaforge free fr wiki index php wiki GuessEncoding  I didn t test it extensively but it seems to work

User · Answer

For ISO8859 1 files  there is not an easy way to distinguish them from ASCII   For Unicode files however one can generally detect this based on the first few bytes of the file   UTF-8 and UTF-16 files include a Byte Order Mark  BOM  at the very beginning of the file   The BOM is a zero-width non-breaking space     Unfortunately  for historical reasons  Java does not detect this automatically   Programs like Notepad will check the BOM and use the appropriate encoding   Using unix or Cygwin  you can check the BOM with the file command   For example     file sample2 sql  sample2 sql  Unicode text  UTF-16  big-endian   For Java  I suggest you check out this code  which will detect the common file formats and select the correct encoding   How to read a file and automatically specify the correct encoding

User · Answer

Can you pick the appropriate char set in the Constructor   new InputStreamReader new FileInputStream in    ISO8859 1

User · Answer

If you don t know the encoding of your data  it is not so easy to determine  but you could try to use a library to guess it  Also  there is a similar question

User · Answer

I found a nice third party library which can detect actual encoding  http   glaforge free fr wiki index php wiki GuessEncoding  I didn t test it extensively but it seems to work

User · Answer

An alternative to TikaEncodingDetector is to use Tika AutoDetectReader   Charset charset   new AutoDetectReader new FileInputStream file   getCharset

User · Answer

Which library to use  As of this writing  they are three libraries that emerge   GuessEncoding ICU4j juniversalchardet  I don t include Apache Any23 because it uses ICU4j 3 4 under the hood  How to tell which one has detected the right charset  or as close as possible   It s impossible to certify the charset detected by each above libraries  However  it s possible to ask them in turn and score the returned response  How to score the returned response  Each response can be assigned one point  The more points a response have  the more confidence the detected charset has  This is a simple scoring method  You can elaborate others  Is there any sample code  Here is a full snippet implementing the strategy described in the previous lines  public static String guessEncoding InputStream input  throws IOException          Load input data     long count   0      int n   0  EOF   -1      byte   buffer   new byte 4096       ByteArrayOutputStream output   new ByteArrayOutputStream         while   EOF     n   input read buffer     amp  amp   count  lt   Integer MAX VALUE             output write buffer  0  n           count    n                 if  count  gt  Integer MAX VALUE            throw new RuntimeException  quot Inputstream too large  quot               byte   data   output toByteArray            Detect encoding     Map lt String  int   gt  encodingsScores   new HashMap lt  gt               GuessEncoding     updateEncodingsScores encodingsScores  new CharsetToolkit data  guessEncoding   displayName               ICU4j     CharsetDetector charsetDetector   new CharsetDetector        charsetDetector setText data       charsetDetector enableInputFilter true       CharsetMatch cm   charsetDetector detect        if  cm    null            updateEncodingsScores encodingsScores  cm getName                     juniversalchardset     UniversalDetector universalDetector   new UniversalDetector null       universalDetector handleData data  0  data length       universalDetector dataEnd        String encodingName   universalDetector getDetectedCharset        if  encodingName    null            updateEncodingsScores encodingsScores  encodingName                 Find winning encoding     Map Entry lt String  int   gt  maxEntry   null      for  Map Entry lt String  int   gt  e   encodingsScores entrySet              if  maxEntry    null     e getValue   0   gt  maxEntry getValue   0                  maxEntry   e                       String winningEncoding   maxEntry getKey          dumpEncodingsScores encodingsScores       return winningEncoding     private static void updateEncodingsScores Map lt String  int   gt  encodingsScores  String encoding        String encodingName   encoding toLowerCase        int   encodingScore   encodingsScores get encodingName        if  encodingScore    null            encodingsScores put encodingName  new int     1           else           encodingScore 0                  private static void dumpEncodingsScores Map lt String  int   gt  encodingsScores        System out println toString encodingsScores       private static String toString Map lt String  int   gt  encodingsScores        String GLUE    quot    quot       StringBuilder sb   new StringBuilder         for  Map Entry lt String  int   gt  e   encodingsScores entrySet              sb append e getKey      quot   quot    e getValue   0    GLUE             int len   sb length        sb delete len - GLUE length    len        return  quot    quot    sb toString      quot    quot      Improvements  The guessEncoding method reads the inputstream entirely  For large inputstreams this can be a concern  All these libraries would read the whole inputstream  This would imply a large time consumption for detecting the charset  It s possible to limit the initial data loading to a few bytes and perform the charset detection on those few bytes only

User · Answer

check this out  http   site icu-project org   icu4j   they have libraries for detecting charset from IOStream could be simple like this   BufferedInputStream bis   new BufferedInputStream input   CharsetDetector cd   new CharsetDetector    cd setText bis   CharsetMatch cm   cd detect     if  cm    null       reader   cm getReader       charset   cm getName     else      throw new UnsupportedCharsetException

User · Answer

You can certainly validate the file for a particular charset by decoding it with a CharsetDecoder and watching out for  malformed-input  or  unmappable-character  errors  Of course  this only tells you if a charset is wrong  it doesn t tell you if it is correct  For that  you need a basis of comparison to evaluate the decoded results  e g  do you know beforehand if the characters are restricted to some subset  or whether the text adheres to some strict format  The bottom line is that charset detection is guesswork without any guarantees

User · Answer

I have used this library  similar to jchardet for detecting encoding in Java  http   code google com p juniversalchardet

User · Answer

If you don t know the encoding of your data  it is not so easy to determine  but you could try to use a library to guess it  Also  there is a similar question

User · Answer

You can certainly validate the file for a particular charset by decoding it with a CharsetDecoder and watching out for  malformed-input  or  unmappable-character  errors  Of course  this only tells you if a charset is wrong  it doesn t tell you if it is correct  For that  you need a basis of comparison to evaluate the decoded results  e g  do you know beforehand if the characters are restricted to some subset  or whether the text adheres to some strict format  The bottom line is that charset detection is guesswork without any guarantees

User · Answer

As far as I know  there is no general library in this context to be suitable for all types of problems  So  for each problem you should test the existing libraries and select the best one which satisfies your problem   s constraints  but often none of them is appropriate  In these cases you can write your own Encoding Detector  As I have wrote      I   ve wrote a meta java tool for detecting charset encoding of HTML Web pages  using IBM ICU4j and Mozilla JCharDet as the built-in components  Here you can find my tool  please read the README section before anything else  Also  you can find some basic concepts of this problem in my paper and in its references      Bellow I provided some helpful comments which I   ve experienced in my work       Charset detection is not a foolproof process  because it is essentially based on statistical data and what actually happens is guessing not detecting icu4j is the main tool in this context by IBM  imho Both TikaEncodingDetector and Lucene-ICU4j are using icu4j and their accuracy had not a meaningful difference from which the icu4j in my tests  at most  1  as I remember  icu4j is much more general than jchardet  icu4j is just a bit biased to IBM family encodings while jchardet is strongly biased to utf-8 Due to the widespread use of UTF-8 in HTML-world  jchardet is a better choice than icu4j in overall  but is not the best choice  icu4j is great for East Asian specific encodings like EUC-KR  EUC-JP  SHIFT JIS  BIG5 and the GB family encodings Both icu4j and jchardet are debacle in dealing with HTML pages with Windows-1251 and Windows-1256 encodings  Windows-1251 aka cp1251 is widely used for Cyrillic-based languages like Russian and Windows-1256 aka cp1256 is widely used for Arabic  Almost all encoding detection tools are using statistical methods  so the accuracy of output strongly depends on the size and the contents of the input  Some encodings are essentially the same just with a partial differences  so in some cases the guessed or detected encoding may be false but at the same time be true  As about Windows-1252 and ISO-8859-1   refer to the last paragraph under the 5 2 section of my paper

User · Answer

In plain Java   final String   encodings      US-ASCII    ISO-8859-1    UTF-8    UTF-16BE    UTF-16LE    UTF-16      List lt String gt  lines   for  String encoding   encodings        try           lines   Files readAllLines path  Charset forName encoding            for  String line   lines                   do something                      break        catch  IOException ioe            System out println encoding     failed  trying next               This approach will try the encodings one by one until one works or we run out of them   BTW my encodings list has only those items because they are the charsets implementations required on every Java platform  https   docs oracle com javase 9 docs api java nio charset Charset html

User · Answer

You cannot determine the encoding of a arbitrary byte stream  This is the nature of encodings  A encoding means a mapping between a byte value and its representation  So every encoding  could  be the right   The getEncoding   method will return the encoding which was set up  read the JavaDoc  for the stream  It will not guess the encoding for you   Some streams tell you which encoding was used to create them  XML  HTML  But not an arbitrary byte stream   Anyway  you could try to guess an encoding on your own if you have to  Every language has a common frequency for every char  In English the char e appears very often but    will appear very very seldom  In a ISO-8859-1 stream there are usually no 0x00 chars  But a UTF-16 stream has a lot of them   Or  you could ask the user  I ve already seen applications which present you a snippet of the file in different encodings and ask you to select the  correct  one

User · Answer

Can you pick the appropriate char set in the Constructor   new InputStreamReader new FileInputStream in    ISO8859 1

User · Answer

In plain Java   final String   encodings      US-ASCII    ISO-8859-1    UTF-8    UTF-16BE    UTF-16LE    UTF-16      List lt String gt  lines   for  String encoding   encodings        try           lines   Files readAllLines path  Charset forName encoding            for  String line   lines                   do something                      break        catch  IOException ioe            System out println encoding     failed  trying next               This approach will try the encodings one by one until one works or we run out of them   BTW my encodings list has only those items because they are the charsets implementations required on every Java platform  https   docs oracle com javase 9 docs api java nio charset Charset html

User · Answer

I have used this library  similar to jchardet for detecting encoding in Java  http   code google com p juniversalchardet

User · Answer

Here are my favorites   TikaEncodingDetector  Dependency    lt dependency gt     lt groupId gt org apache any23 lt  groupId gt     lt artifactId gt apache-any23-encoding lt  artifactId gt     lt version gt 1 1 lt  version gt   lt  dependency gt    Sample   public static Charset guessCharset InputStream is  throws IOException     return Charset forName new TikaEncodingDetector   guessEncoding is            GuessEncoding  Dependency    lt dependency gt     lt groupId gt org codehaus guessencoding lt  groupId gt     lt artifactId gt guessencoding lt  artifactId gt     lt version gt 1 4 lt  version gt     lt type gt jar lt  type gt   lt  dependency gt    Sample     public static Charset guessCharset2 File file  throws IOException       return CharsetToolkit guessEncoding file  4096  StandardCharsets UTF 8

User · Answer

Which library to use  As of this writing  they are three libraries that emerge   GuessEncoding ICU4j juniversalchardet  I don t include Apache Any23 because it uses ICU4j 3 4 under the hood  How to tell which one has detected the right charset  or as close as possible   It s impossible to certify the charset detected by each above libraries  However  it s possible to ask them in turn and score the returned response  How to score the returned response  Each response can be assigned one point  The more points a response have  the more confidence the detected charset has  This is a simple scoring method  You can elaborate others  Is there any sample code  Here is a full snippet implementing the strategy described in the previous lines  public static String guessEncoding InputStream input  throws IOException          Load input data     long count   0      int n   0  EOF   -1      byte   buffer   new byte 4096       ByteArrayOutputStream output   new ByteArrayOutputStream         while   EOF     n   input read buffer     amp  amp   count  lt   Integer MAX VALUE             output write buffer  0  n           count    n                 if  count  gt  Integer MAX VALUE            throw new RuntimeException  quot Inputstream too large  quot               byte   data   output toByteArray            Detect encoding     Map lt String  int   gt  encodingsScores   new HashMap lt  gt               GuessEncoding     updateEncodingsScores encodingsScores  new CharsetToolkit data  guessEncoding   displayName               ICU4j     CharsetDetector charsetDetector   new CharsetDetector        charsetDetector setText data       charsetDetector enableInputFilter true       CharsetMatch cm   charsetDetector detect        if  cm    null            updateEncodingsScores encodingsScores  cm getName                     juniversalchardset     UniversalDetector universalDetector   new UniversalDetector null       universalDetector handleData data  0  data length       universalDetector dataEnd        String encodingName   universalDetector getDetectedCharset        if  encodingName    null            updateEncodingsScores encodingsScores  encodingName                 Find winning encoding     Map Entry lt String  int   gt  maxEntry   null      for  Map Entry lt String  int   gt  e   encodingsScores entrySet              if  maxEntry    null     e getValue   0   gt  maxEntry getValue   0                  maxEntry   e                       String winningEncoding   maxEntry getKey          dumpEncodingsScores encodingsScores       return winningEncoding     private static void updateEncodingsScores Map lt String  int   gt  encodingsScores  String encoding        String encodingName   encoding toLowerCase        int   encodingScore   encodingsScores get encodingName        if  encodingScore    null            encodingsScores put encodingName  new int     1           else           encodingScore 0                  private static void dumpEncodingsScores Map lt String  int   gt  encodingsScores        System out println toString encodingsScores       private static String toString Map lt String  int   gt  encodingsScores        String GLUE    quot    quot       StringBuilder sb   new StringBuilder         for  Map Entry lt String  int   gt  e   encodingsScores entrySet              sb append e getKey      quot   quot    e getValue   0    GLUE             int len   sb length        sb delete len - GLUE length    len        return  quot    quot    sb toString      quot    quot      Improvements  The guessEncoding method reads the inputstream entirely  For large inputstreams this can be a concern  All these libraries would read the whole inputstream  This would imply a large time consumption for detecting the charset  It s possible to limit the initial data loading to a few bytes and perform the charset detection on those few bytes only

User · Answer

As far as I know  there is no general library in this context to be suitable for all types of problems  So  for each problem you should test the existing libraries and select the best one which satisfies your problem   s constraints  but often none of them is appropriate  In these cases you can write your own Encoding Detector  As I have wrote      I   ve wrote a meta java tool for detecting charset encoding of HTML Web pages  using IBM ICU4j and Mozilla JCharDet as the built-in components  Here you can find my tool  please read the README section before anything else  Also  you can find some basic concepts of this problem in my paper and in its references      Bellow I provided some helpful comments which I   ve experienced in my work       Charset detection is not a foolproof process  because it is essentially based on statistical data and what actually happens is guessing not detecting icu4j is the main tool in this context by IBM  imho Both TikaEncodingDetector and Lucene-ICU4j are using icu4j and their accuracy had not a meaningful difference from which the icu4j in my tests  at most  1  as I remember  icu4j is much more general than jchardet  icu4j is just a bit biased to IBM family encodings while jchardet is strongly biased to utf-8 Due to the widespread use of UTF-8 in HTML-world  jchardet is a better choice than icu4j in overall  but is not the best choice  icu4j is great for East Asian specific encodings like EUC-KR  EUC-JP  SHIFT JIS  BIG5 and the GB family encodings Both icu4j and jchardet are debacle in dealing with HTML pages with Windows-1251 and Windows-1256 encodings  Windows-1251 aka cp1251 is widely used for Cyrillic-based languages like Russian and Windows-1256 aka cp1256 is widely used for Arabic  Almost all encoding detection tools are using statistical methods  so the accuracy of output strongly depends on the size and the contents of the input  Some encodings are essentially the same just with a partial differences  so in some cases the guessed or detected encoding may be false but at the same time be true  As about Windows-1252 and ISO-8859-1   refer to the last paragraph under the 5 2 section of my paper

User · Answer

If you don t know the encoding of your data  it is not so easy to determine  but you could try to use a library to guess it  Also  there is a similar question

User · Answer

You can certainly validate the file for a particular charset by decoding it with a CharsetDecoder and watching out for  malformed-input  or  unmappable-character  errors  Of course  this only tells you if a charset is wrong  it doesn t tell you if it is correct  For that  you need a basis of comparison to evaluate the decoded results  e g  do you know beforehand if the characters are restricted to some subset  or whether the text adheres to some strict format  The bottom line is that charset detection is guesswork without any guarantees

User · Answer

For ISO8859 1 files  there is not an easy way to distinguish them from ASCII   For Unicode files however one can generally detect this based on the first few bytes of the file   UTF-8 and UTF-16 files include a Byte Order Mark  BOM  at the very beginning of the file   The BOM is a zero-width non-breaking space     Unfortunately  for historical reasons  Java does not detect this automatically   Programs like Notepad will check the BOM and use the appropriate encoding   Using unix or Cygwin  you can check the BOM with the file command   For example     file sample2 sql  sample2 sql  Unicode text  UTF-16  big-endian   For Java  I suggest you check out this code  which will detect the common file formats and select the correct encoding   How to read a file and automatically specify the correct encoding

User · Answer

You can certainly validate the file for a particular charset by decoding it with a CharsetDecoder and watching out for  malformed-input  or  unmappable-character  errors  Of course  this only tells you if a charset is wrong  it doesn t tell you if it is correct  For that  you need a basis of comparison to evaluate the decoded results  e g  do you know beforehand if the characters are restricted to some subset  or whether the text adheres to some strict format  The bottom line is that charset detection is guesswork without any guarantees

User · Answer

You cannot determine the encoding of a arbitrary byte stream  This is the nature of encodings  A encoding means a mapping between a byte value and its representation  So every encoding  could  be the right   The getEncoding   method will return the encoding which was set up  read the JavaDoc  for the stream  It will not guess the encoding for you   Some streams tell you which encoding was used to create them  XML  HTML  But not an arbitrary byte stream   Anyway  you could try to guess an encoding on your own if you have to  Every language has a common frequency for every char  In English the char e appears very often but    will appear very very seldom  In a ISO-8859-1 stream there are usually no 0x00 chars  But a UTF-16 stream has a lot of them   Or  you could ask the user  I ve already seen applications which present you a snippet of the file in different encodings and ask you to select the  correct  one

User · Answer

If you use ICU4J  http   icu-project org apiref icu4j    Here is my code   String charset    ISO-8859-1     Default chartset  put whatever you want  byte   fileContent   null  FileInputStream fin   null     create FileInputStream object fin   new FileInputStream file getPath            Create byte array large enough to hold the content of the file     Use File length to determine size of the file in bytes      fileContent   new byte  int  file length            To read content of the file in byte array  use    int read byte   byteArray  method of java FileInputStream class         fin read fileContent    byte   data    fileContent   CharsetDetector detector   new CharsetDetector    detector setText data    CharsetMatch cm   detector detect     if  cm    null        int confidence   cm getConfidence        System out println  Encoding      cm getName       - Confidence      confidence               Here you have the encode name and the confidence       In my case if the confidence is  gt  50 I return the encode  else I return the default value     if  confidence  gt  50            charset   cm getName              Remember to put all the try-catch need it   I hope this works for you

User · Answer

Can you pick the appropriate char set in the Constructor   new InputStreamReader new FileInputStream in    ISO8859 1

User · Answer

Can you pick the appropriate char set in the Constructor   new InputStreamReader new FileInputStream in    ISO8859 1

User · Answer

check this out  http   site icu-project org   icu4j   they have libraries for detecting charset from IOStream could be simple like this   BufferedInputStream bis   new BufferedInputStream input   CharsetDetector cd   new CharsetDetector    cd setText bis   CharsetMatch cm   cd detect     if  cm    null       reader   cm getReader       charset   cm getName     else      throw new UnsupportedCharsetException

User · Answer

Here are my favorites   TikaEncodingDetector  Dependency    lt dependency gt     lt groupId gt org apache any23 lt  groupId gt     lt artifactId gt apache-any23-encoding lt  artifactId gt     lt version gt 1 1 lt  version gt   lt  dependency gt    Sample   public static Charset guessCharset InputStream is  throws IOException     return Charset forName new TikaEncodingDetector   guessEncoding is            GuessEncoding  Dependency    lt dependency gt     lt groupId gt org codehaus guessencoding lt  groupId gt     lt artifactId gt guessencoding lt  artifactId gt     lt version gt 1 4 lt  version gt     lt type gt jar lt  type gt   lt  dependency gt    Sample     public static Charset guessCharset2 File file  throws IOException       return CharsetToolkit guessEncoding file  4096  StandardCharsets UTF 8

[java] Java : How to determine the correct charset encoding of a stream

Examples related to java

Examples related to file

Examples related to encoding

Examples related to stream

Examples related to character-encoding