The obvious answer is to use Charset.defaultCharset()
but we recently found out that this might not be the right answer. I was told that the result is different from real default charset used by java.io classes in several occasions. Looks like Java keeps 2 sets of default charset. Does anyone have any insights on this issue?
We were able to reproduce one fail case. It's kind of user error but it may still expose the root cause of all other problems. Here is the code,
public class CharSetTest {
public static void main(String[] args) {
System.out.println("Default Charset=" + Charset.defaultCharset());
System.setProperty("file.encoding", "Latin-1");
System.out.println("file.encoding=" + System.getProperty("file.encoding"));
System.out.println("Default Charset=" + Charset.defaultCharset());
System.out.println("Default Charset in Use=" + getDefaultCharSet());
}
private static String getDefaultCharSet() {
OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
String enc = writer.getEncoding();
return enc;
}
}
Our server requires default charset in Latin-1 to deal with some mixed encoding (ANSI/Latin-1/UTF-8) in a legacy protocol. So all our servers run with this JVM parameter,
-Dfile.encoding=ISO-8859-1
Here is the result on Java 5,
Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=ISO8859_1
Someone tries to change the encoding runtime by setting the file.encoding in the code. We all know that doesn't work. However, this apparently throws off defaultCharset() but it doesn't affect the real default charset used by OutputStreamWriter.
Is this a bug or feature?
EDIT: The accepted answer shows the root cause of the issue. Basically, you can't trust defaultCharset() in Java 5, which is not the default encoding used by I/O classes. Looks like Java 6 corrects this issue.
This question is related to
java
encoding
character-encoding
First, Latin-1 is the same as ISO-8859-1, so, the default was already OK for you. Right?
You successfully set the encoding to ISO-8859-1 with your command line parameter. You also set it programmatically to "Latin-1", but, that's not a recognized value of a file encoding for Java. See http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc.html
When you do that, looks like Charset resets to UTF-8, from looking at the source. That at least explains most of the behavior.
I don't know why OutputStreamWriter shows ISO8859_1. It delegates to closed-source sun.misc.* classes. I'm guessing it isn't quite dealing with encoding via the same mechanism, which is weird.
But of course you should always be specifying what encoding you mean in this code. I'd never rely on the platform default.
I have set the vm argument in WAS server as -Dfile.encoding=UTF-8 to change the servers' default character set.
The behaviour is not really that strange. Looking into the implementation of the classes, it is caused by:
Charset.defaultCharset()
is not caching the determined character set in Java 5.Charset.defaultCharset()
again causes a second evaluation of the system property, no character set with the name "Latin-1" is found, so Charset.defaultCharset()
defaults to "UTF-8".OutputStreamWriter
is however caching the default character set and is probably used already during VM initialization, so that its default character set diverts from Charset.defaultCharset()
if the system property "file.encoding" has been changed at runtime.As already pointed out, it is not documented how the VM must behave in such a situation. The Charset.defaultCharset()
API documentation is not very precise on how the default character set is determined, only mentioning that it is usually done on VM startup, based on factors like the OS default character set or default locale.
Is this a bug or feature?
Looks like undefined behaviour. I know that, in practice, you can change the default encoding using a command-line property, but I don't think what happens when you do this is defined.
Bug ID: 4153515 on problems setting this property:
This is not a bug. The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.
The preferred way to change the default encoding used by the VM and the runtime system is to change the locale of the underlying platform before starting your Java program.
I cringe when I see people setting the encoding on the command line - you don't know what code that is going to affect.
If you do not want to use the default encoding, set the encoding you do want explicitly via the appropriate method/constructor.
check
System.getProperty("sun.jnu.encoding")
it seems to be the same encoding as the one used in your system's command line.
Source: Stackoverflow.com