How to Find the Default Charset Encoding in Java

Question

The obvious answer is to use Charset defaultCharset   but we recently found out that this might not be the right answer  I was told that the result is different from real default charset used by java io classes in several occasions  Looks like Java keeps 2 sets of default charset  Does anyone have any insights on this issue   We were able to reproduce one fail case  It s kind of user error but it may still expose the root cause of all other problems  Here is the code   public class CharSetTest        public static void main String   args            System out println  Default Charset     Charset defaultCharset             System setProperty  file encoding    Latin-1            System out println  file encoding     System getProperty  file encoding             System out println  Default Charset     Charset defaultCharset             System out println  Default Charset in Use     getDefaultCharSet                private static String getDefaultCharSet             OutputStreamWriter writer   new OutputStreamWriter new ByteArrayOutputStream             String enc   writer getEncoding            return enc            Our server requires default charset in Latin-1 to deal with some mixed encoding  ANSI Latin-1 UTF-8  in a legacy protocol  So all our servers run with this JVM parameter   -Dfile encoding ISO-8859-1   Here is the result on Java 5   Default Charset ISO-8859-1 file encoding Latin-1 Default Charset UTF-8 Default Charset in Use ISO8859 1   Someone tries to change the encoding runtime by setting the file encoding in the code  We all know that doesn t work  However  this apparently throws off defaultCharset   but it doesn t affect the real default charset used by OutputStreamWriter   Is this a bug or feature   EDIT  The accepted answer shows the root cause of the issue  Basically  you can t trust defaultCharset   in Java 5  which is not the default encoding used by I O classes  Looks like Java 6 corrects this issue

User · Answer

Is this a bug or feature    Looks like undefined behaviour  I know that  in practice  you can change the default encoding using a command-line property  but I don t think what happens when you do this is defined   Bug ID  4153515 on problems setting this property      This is not a bug   The  file encoding  property is not required by the J2SE   platform specification  it s an internal detail of Sun s implementations and   should not be examined or modified by user code   It s also intended to be   read-only  it s technically impossible to support the setting of this property   to arbitrary values on the command line or at any other time during program   execution       The preferred way to change the default encoding used by the VM and the runtime   system is to change the locale of the underlying platform before starting your   Java program    I cringe when I see people setting the encoding on the command line - you don t know what code that is going to affect   If you do not want to use the default encoding  set the encoding you do want explicitly via the appropriate method constructor

User · Answer

I have set the vm argument in WAS server as -Dfile encoding UTF-8 to change the servers  default character set

User · Answer

This is really strange    Once set  the default Charset is cached and it isn t changed while the class is in memory  Setting the  file encoding  property with System setProperty  file encoding    Latin-1    does nothing  Every time Charset defaultCharset   is called it returns the cached charset    Here are my results   Default Charset ISO-8859-1 file encoding Latin-1 Default Charset ISO-8859-1 Default Charset in Use ISO8859 1   I m using JVM 1 6 though    update   Ok  I did reproduce your bug with JVM 1 5   Looking at the source code of 1 5  the cached default charset isn t being set  I don t know if this is a bug or not but 1 6 changes this implementation and uses the cached charset   JVM 1 5   public static Charset defaultCharset         synchronized  Charset class            if  defaultCharset    null                java security PrivilegedAction pa                       new GetPropertyAction  file encoding                String csn    String  AccessController doPrivileged pa               Charset cs   lookup csn               if  cs    null                  return cs              return forName  UTF-8                      return defaultCharset            JVM 1 6   public static Charset defaultCharset         if  defaultCharset    null            synchronized  Charset class                java security PrivilegedAction pa                       new GetPropertyAction  file encoding                String csn    String  AccessController doPrivileged pa               Charset cs   lookup csn               if  cs    null                  defaultCharset   cs              else                 defaultCharset   forName  UTF-8                        return defaultCharset      When you set the file encoding to file encoding Latin-1 the next time you call Charset defaultCharset    what happens is  because the cached default charset isn t set  it will try to find the appropriate charset for the name Latin-1  This name isn t found  because it s incorrect  and returns the default UTF-8   As for why the IO classes such as OutputStreamWriter return an unexpected result  the implementation of sun nio cs StreamEncoder  witch is used by these IO classes  is different as well for JVM 1 5 and JVM 1 6  The JVM 1 6 implementation is based in the Charset defaultCharset   method to get the default encoding  if one is not provided to IO classes  The JVM 1 5 implementation uses a different method Converters getDefaultEncodingName    to get the default charset  This method uses its own cache of the default charset that is set upon JVM initialization   JVM 1 6   public static StreamEncoder forOutputStreamWriter OutputStream out          Object lock          String charsetName          throws UnsupportedEncodingException       String csn   charsetName      if  csn    null          csn   Charset defaultCharset   name        try           if  Charset isSupported csn               return new StreamEncoder out  lock  Charset forName csn          catch  IllegalCharsetNameException x          throw new UnsupportedEncodingException  csn       JVM 1 5   public static StreamEncoder forOutputStreamWriter OutputStream out          Object lock          String charsetName          throws UnsupportedEncodingException       String csn   charsetName      if  csn    null          csn   Converters getDefaultEncodingName        if   Converters isCached Converters CHAR TO BYTE  csn             try               if  Charset isSupported csn                   return new CharsetSE out  lock  Charset forName csn              catch  IllegalCharsetNameException x                return new ConverterSE out  lock  csn       But I agree with the comments  You shouldn t rely on this property  It s an implementation detail

User · Answer

The behaviour is not really that strange  Looking into the implementation of the classes  it is caused by    Charset defaultCharset   is not caching the determined character set in Java 5  Setting the system property  file encoding  and invoking Charset defaultCharset   again causes a second evaluation of the system property  no character set with the name  Latin-1  is found  so Charset defaultCharset   defaults to  UTF-8   The OutputStreamWriter is however caching the default character set and is probably used already during VM initialization  so that its default character set diverts from Charset defaultCharset   if the system property  file encoding  has been changed at runtime    As already pointed out  it is not documented how the VM must behave in such a situation  The Charset defaultCharset   API documentation is not very precise on how the default character set is determined  only mentioning that it is usually done on VM startup  based on factors like the OS default character set or default locale

User · Answer

check   System getProperty  sun jnu encoding     it seems to be the same encoding as the one used in your system s command line

User · Answer

First  Latin-1 is the same as ISO-8859-1  so  the default was already OK for you  Right   You successfully set the encoding to ISO-8859-1 with your command line parameter  You also set it programmatically to  Latin-1   but  that s not a recognized value of a file encoding for Java  See http   java sun com javase 6 docs technotes guides intl encoding doc html  When you do that  looks like Charset resets to UTF-8  from looking at the source  That at least explains most of the behavior   I don t know why OutputStreamWriter shows ISO8859 1  It delegates to closed-source sun misc   classes  I m guessing it isn t quite dealing with encoding via the same mechanism  which is weird   But of course you should always be specifying what encoding you mean in this code  I d never rely on the platform default

[java] How to Find the Default Charset/Encoding in Java?

Examples related to java

Examples related to encoding

Examples related to character-encoding