Encoding conversion in java

Question

Is there any free java library which I can use to convert string in one encoding to other encoding  something like iconv  I m using Java version 1 3

User · Answer

CharsetDecoder should be what you are looking for  no    Many network protocols and files store their characters with a byte-oriented character set such as ISO-8859-1  ISO-Latin-1   However  Java s native character encoding is Unicode UTF16BE  Sixteen-bit UCS Transformation Format  big-endian byte order    See Charset  That doesn t mean UTF16 is the default charset  i e   the default  mapping between sequences of sixteen-bit Unicode code units and sequences of bytes        Every instance of the Java virtual machine has a default charset  which may or may not be one of the standard charsets     US-ASCII  ISO-8859-1 a k a  ISO-LATIN-1  UTF-8  UTF-16BE  UTF-16LE  UTF-16    The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system    This example demonstrates how to convert ISO-8859-1 encoded bytes in a ByteBuffer to a string in a CharBuffer and visa versa      Create the encoder and decoder for ISO-8859-1 Charset charset   Charset forName  ISO-8859-1    CharsetDecoder decoder   charset newDecoder    CharsetEncoder encoder   charset newEncoder     try          Convert a string to ISO-LATIN-1 bytes in a ByteBuffer        The new ByteBuffer is ready to be read      ByteBuffer bbuf   encoder encode CharBuffer wrap  a string             Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string         The new ByteBuffer is ready to be read      CharBuffer cbuf   decoder decode bbuf       String s   cbuf toString      catch  CharacterCodingException e

User · Answer

It is a whole lot easier if you think of unicode as a character set  which it actually is - it is very basically the numbered set of all known characters   You can encode it as UTF-8  1-3 bytes per character depending  or maybe UTF-16  2 bytes per character or 4 bytes using surrogate pairs    Back in the mist of time Java used to use UCS-2 to encode the unicode character set  This could only handle 2 bytes per character and is now obsolete  It was a fairly obvious hack to add surrogate pairs and move up to UTF-16   A lot of people think they should have used UTF-8 in the first place  When Java was originally written unicode had far more than 65535 characters anyway

User · Answer

You don t need a library beyond the standard one - just use Charset   You can just use the String constructors and getBytes methods  but personally I don t like just working with the names of character encodings  Too much room for typos    EDIT  As pointed out in comments  you can still use Charset instances but have the ease of use of the String methods  new String bytes  charset  and String getBytes charset    See  URL Encoding  or   What are those   20  codes in URLs

User · Answer

CharsetDecoder should be what you are looking for  no    Many network protocols and files store their characters with a byte-oriented character set such as ISO-8859-1  ISO-Latin-1   However  Java s native character encoding is Unicode UTF16BE  Sixteen-bit UCS Transformation Format  big-endian byte order    See Charset  That doesn t mean UTF16 is the default charset  i e   the default  mapping between sequences of sixteen-bit Unicode code units and sequences of bytes        Every instance of the Java virtual machine has a default charset  which may or may not be one of the standard charsets     US-ASCII  ISO-8859-1 a k a  ISO-LATIN-1  UTF-8  UTF-16BE  UTF-16LE  UTF-16    The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system    This example demonstrates how to convert ISO-8859-1 encoded bytes in a ByteBuffer to a string in a CharBuffer and visa versa      Create the encoder and decoder for ISO-8859-1 Charset charset   Charset forName  ISO-8859-1    CharsetDecoder decoder   charset newDecoder    CharsetEncoder encoder   charset newEncoder     try          Convert a string to ISO-LATIN-1 bytes in a ByteBuffer        The new ByteBuffer is ready to be read      ByteBuffer bbuf   encoder encode CharBuffer wrap  a string             Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string         The new ByteBuffer is ready to be read      CharBuffer cbuf   decoder decode bbuf       String s   cbuf toString      catch  CharacterCodingException e

User · Answer

You don t need a library beyond the standard one - just use Charset   You can just use the String constructors and getBytes methods  but personally I don t like just working with the names of character encodings  Too much room for typos    EDIT  As pointed out in comments  you can still use Charset instances but have the ease of use of the String methods  new String bytes  charset  and String getBytes charset    See  URL Encoding  or   What are those   20  codes in URLs

User · Answer

UTF-8 and UCS-2 UTF-16 can be distinguished reasonably easily via a byte order mark at the start of the file  If this exists then it s a pretty good bet that the file is in that encoding - but it s not a dead certainty  You may well also find that the file is in one of those encodings  but doesn t have a byte order mark   I don t know much about ISO-8859-2  but I wouldn t be surprised if almost every file is a valid text file in that encoding  The best you ll be able to do is check it heuristically  Indeed  the Wikipedia page talking about it would suggest that only byte 0x7f is invalid   There s no idea of reading a file  as it is  and yet getting text out - a file is a sequence of bytes  so you have to apply a character encoding in order to decode those bytes into characters   Source by stackoverflow

User · Answer

I would just like to add that if the String is originally encoded using the wrong encoding it might be impossible to change it to another encoding without errors  The question does not state that the conversion here is made from wrong encoding to correct encoding but I personally stumbled to this question just because of this situation so just a heads up for others as well   This answer in other question gives an explanation why the conversion does not always yield correct results https   stackoverflow com a 2623793 4702806

User · Answer

CharsetDecoder should be what you are looking for  no    Many network protocols and files store their characters with a byte-oriented character set such as ISO-8859-1  ISO-Latin-1   However  Java s native character encoding is Unicode UTF16BE  Sixteen-bit UCS Transformation Format  big-endian byte order    See Charset  That doesn t mean UTF16 is the default charset  i e   the default  mapping between sequences of sixteen-bit Unicode code units and sequences of bytes        Every instance of the Java virtual machine has a default charset  which may or may not be one of the standard charsets     US-ASCII  ISO-8859-1 a k a  ISO-LATIN-1  UTF-8  UTF-16BE  UTF-16LE  UTF-16    The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system    This example demonstrates how to convert ISO-8859-1 encoded bytes in a ByteBuffer to a string in a CharBuffer and visa versa      Create the encoder and decoder for ISO-8859-1 Charset charset   Charset forName  ISO-8859-1    CharsetDecoder decoder   charset newDecoder    CharsetEncoder encoder   charset newEncoder     try          Convert a string to ISO-LATIN-1 bytes in a ByteBuffer        The new ByteBuffer is ready to be read      ByteBuffer bbuf   encoder encode CharBuffer wrap  a string             Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string         The new ByteBuffer is ready to be read      CharBuffer cbuf   decoder decode bbuf       String s   cbuf toString      catch  CharacterCodingException e

User · Answer

UTF-8 and UCS-2 UTF-16 can be distinguished reasonably easily via a byte order mark at the start of the file  If this exists then it s a pretty good bet that the file is in that encoding - but it s not a dead certainty  You may well also find that the file is in one of those encodings  but doesn t have a byte order mark   I don t know much about ISO-8859-2  but I wouldn t be surprised if almost every file is a valid text file in that encoding  The best you ll be able to do is check it heuristically  Indeed  the Wikipedia page talking about it would suggest that only byte 0x7f is invalid   There s no idea of reading a file  as it is  and yet getting text out - a file is a sequence of bytes  so you have to apply a character encoding in order to decode those bytes into characters   Source by stackoverflow

User · Answer

You don t need a library beyond the standard one - just use Charset   You can just use the String constructors and getBytes methods  but personally I don t like just working with the names of character encodings  Too much room for typos    EDIT  As pointed out in comments  you can still use Charset instances but have the ease of use of the String methods  new String bytes  charset  and String getBytes charset    See  URL Encoding  or   What are those   20  codes in URLs

User · Answer

You don t need a library beyond the standard one - just use Charset   You can just use the String constructors and getBytes methods  but personally I don t like just working with the names of character encodings  Too much room for typos    EDIT  As pointed out in comments  you can still use Charset instances but have the ease of use of the String methods  new String bytes  charset  and String getBytes charset    See  URL Encoding  or   What are those   20  codes in URLs

User · Answer

CharsetDecoder should be what you are looking for  no    Many network protocols and files store their characters with a byte-oriented character set such as ISO-8859-1  ISO-Latin-1   However  Java s native character encoding is Unicode UTF16BE  Sixteen-bit UCS Transformation Format  big-endian byte order    See Charset  That doesn t mean UTF16 is the default charset  i e   the default  mapping between sequences of sixteen-bit Unicode code units and sequences of bytes        Every instance of the Java virtual machine has a default charset  which may or may not be one of the standard charsets     US-ASCII  ISO-8859-1 a k a  ISO-LATIN-1  UTF-8  UTF-16BE  UTF-16LE  UTF-16    The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system    This example demonstrates how to convert ISO-8859-1 encoded bytes in a ByteBuffer to a string in a CharBuffer and visa versa      Create the encoder and decoder for ISO-8859-1 Charset charset   Charset forName  ISO-8859-1    CharsetDecoder decoder   charset newDecoder    CharsetEncoder encoder   charset newEncoder     try          Convert a string to ISO-LATIN-1 bytes in a ByteBuffer        The new ByteBuffer is ready to be read      ByteBuffer bbuf   encoder encode CharBuffer wrap  a string             Convert ISO-LATIN-1 bytes in a ByteBuffer to a character ByteBuffer and then to a string         The new ByteBuffer is ready to be read      CharBuffer cbuf   decoder decode bbuf       String s   cbuf toString      catch  CharacterCodingException e

User · Answer

I would just like to add that if the String is originally encoded using the wrong encoding it might be impossible to change it to another encoding without errors  The question does not state that the conversion here is made from wrong encoding to correct encoding but I personally stumbled to this question just because of this situation so just a heads up for others as well   This answer in other question gives an explanation why the conversion does not always yield correct results https   stackoverflow com a 2623793 4702806

User · Answer

It is a whole lot easier if you think of unicode as a character set  which it actually is - it is very basically the numbered set of all known characters   You can encode it as UTF-8  1-3 bytes per character depending  or maybe UTF-16  2 bytes per character or 4 bytes using surrogate pairs    Back in the mist of time Java used to use UCS-2 to encode the unicode character set  This could only handle 2 bytes per character and is now obsolete  It was a fairly obvious hack to add surrogate pairs and move up to UTF-16   A lot of people think they should have used UTF-8 in the first place  When Java was originally written unicode had far more than 65535 characters anyway

[java] Encoding conversion in java

Examples related to java

Examples related to character-encoding

Examples related to converters