[java] Encoding conversion in java

Is there any free java library which I can use to convert string in one encoding to other encoding, something like iconv? I'm using Java version 1.3.

This question is related to java character-encoding converters

The answer is


UTF-8 and UCS-2/UTF-16 can be distinguished reasonably easily via a byte order mark at the start of the file. If this exists then it's a pretty good bet that the file is in that encoding - but it's not a dead certainty. You may well also find that the file is in one of those encodings, but doesn't have a byte order mark.

I don't know much about ISO-8859-2, but I wouldn't be surprised if almost every file is a valid text file in that encoding. The best you'll be able to do is check it heuristically. Indeed, the Wikipedia page talking about it would suggest that only byte 0x7f is invalid.

There's no idea of reading a file "as it is" and yet getting text out - a file is a sequence of bytes, so you have to apply a character encoding in order to decode those bytes into characters.

Source by stackoverflow


I would just like to add that if the String is originally encoded using the wrong encoding it might be impossible to change it to another encoding without errors. The question does not state that the conversion here is made from wrong encoding to correct encoding but I personally stumbled to this question just because of this situation so just a heads up for others as well.

This answer in other question gives an explanation why the conversion does not always yield correct results https://stackoverflow.com/a/2623793/4702806


UTF-8 and UCS-2/UTF-16 can be distinguished reasonably easily via a byte order mark at the start of the file. If this exists then it's a pretty good bet that the file is in that encoding - but it's not a dead certainty. You may well also find that the file is in one of those encodings, but doesn't have a byte order mark.

I don't know much about ISO-8859-2, but I wouldn't be surprised if almost every file is a valid text file in that encoding. The best you'll be able to do is check it heuristically. Indeed, the Wikipedia page talking about it would suggest that only byte 0x7f is invalid.

There's no idea of reading a file "as it is" and yet getting text out - a file is a sequence of bytes, so you have to apply a character encoding in order to decode those bytes into characters.

Source by stackoverflow


I would just like to add that if the String is originally encoded using the wrong encoding it might be impossible to change it to another encoding without errors. The question does not state that the conversion here is made from wrong encoding to correct encoding but I personally stumbled to this question just because of this situation so just a heads up for others as well.

This answer in other question gives an explanation why the conversion does not always yield correct results https://stackoverflow.com/a/2623793/4702806


CharsetDecoder should be what you are looking for, no ?

Many network protocols and files store their characters with a byte-oriented character set such as ISO-8859-1 (ISO-Latin-1).
However, Java's native character encoding is Unicode UTF16BE (Sixteen-bit UCS Transformation Format, big-endian byte order).

See Charset. That doesn't mean UTF16 is the default charset (i.e.: the default "mapping between sequences of sixteen-bit Unicode code units and sequences of bytes"):

Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets.
[US-ASCII, ISO-8859-1 a.k.a. ISO-LATIN-1, UTF-8, UTF-16BE, UTF-16LE, UTF-16]
The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.

This example demonstrates how to convert ISO-8859-1 encoded bytes in a ByteBuffer to a string in a CharBuffer and visa versa.

// Create the encoder and decoder for ISO-8859-1
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();

try {
    // Convert a string to ISO-LATIN-1 bytes in a ByteBuffer
    // The new ByteBuffer is ready to be read.
    ByteBuffer bbuf = encoder.encode(CharBuffer.wrap("a string"));

    // Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string.
    // The new ByteBuffer is ready to be read.
    CharBuffer cbuf = decoder.decode(bbuf);
    String s = cbuf.toString();
} catch (CharacterCodingException e) {
}

It is a whole lot easier if you think of unicode as a character set (which it actually is - it is very basically the numbered set of all known characters). You can encode it as UTF-8 (1-3 bytes per character depending) or maybe UTF-16 (2 bytes per character or 4 bytes using surrogate pairs).

Back in the mist of time Java used to use UCS-2 to encode the unicode character set. This could only handle 2 bytes per character and is now obsolete. It was a fairly obvious hack to add surrogate pairs and move up to UTF-16.

A lot of people think they should have used UTF-8 in the first place. When Java was originally written unicode had far more than 65535 characters anyway...