How to convert Strings to and from UTF8 byte arrays in Java

Question

In Java  I have a String and I want to encode it as a byte array  in UTF8  or some other encoding   Alternately  I have a byte array  in some known encoding  and I want to convert it into a Java String  How do I do these conversions

User · Answer

If you are using 7-bit ASCII or ISO-8859-1 (an amazingly common format) then you don't have to create a new java.lang.String at all. It's much much more performant to simply cast the byte into char:

Full working example:

for (byte b : new byte[] { 43, 45, (byte) 215, (byte) 247 }) {
    char c = (char) b;
    System.out.print(c);
}

If you are not using extended-characters like Ä, Æ, Å, Ç, Ï, Ê and can be sure that the only transmitted values are of the first 128 Unicode characters, then this code will also work for UTF-8 and extended ASCII (like cp-1252).

User · Answer

Charset UTF8 CHARSET   Charset forName  UTF-8    String strISO       name            System out println strISO   byte   b   strISO getBytes    for  byte c  b        System out print       c           String str   new String b  UTF8 CHARSET   System out println str

User · Answer

As an alternative  StringUtils from Apache Commons can be used    byte   bytes     byte  1    String convertedString   StringUtils newStringUtf8 bytes     or   String myString    example    byte   convertedBytes   StringUtils getBytesUtf8 myString     If you have non-standard charset  you can use getBytesUnchecked   or newString   accordingly

User · Answer

Convert from String to byte     String s    some text here   byte   b   s getBytes StandardCharsets UTF 8     Convert from byte   to String   byte   b     byte  99   byte 97   byte 116   String s   new String b  StandardCharsets US ASCII     You should  of course  use the correct encoding name  My examples used US-ASCII and UTF-8  the two most common encodings

User · Answer

I can t comment but don t want to start a new thread  But this isn t working  A simple round trip   byte   b   new byte    0  0  0  -127        0x00000081 String s   new String b StandardCharsets UTF 8      UTF8   0x0000  0x0000   0x0000  0xfffd b   s getBytes StandardCharsets UTF 8       0  0  0  -17  -65  -67  0x000000efbfbd    0x00000081   I d need b   the same array before and after encoding which it isn t  this referrers to the first answer

User · Answer

query is your json      DefaultHttpClient httpClient   new DefaultHttpClient     HttpPost postRequest   new HttpPost  http   my site test v1 product search qy       StringEntity input   new StringEntity query   UTF-8     input setContentType  application json     postRequest setEntity input       HttpResponse response response   httpClient execute postRequest

User · Answer

As an alternative  StringUtils from Apache Commons can be used    byte   bytes     byte  1    String convertedString   StringUtils newStringUtf8 bytes     or   String myString    example    byte   convertedBytes   StringUtils getBytesUtf8 myString     If you have non-standard charset  you can use getBytesUnchecked   or newString   accordingly

User · Answer

My tomcat7 implementation is accepting strings as ISO-8859-1  despite the content-type of the HTTP request  The following solution worked for me when trying to correctly interpret characters like         byte   b1   szP1 getBytes  ISO-8859-1    System out println b1 toString      String szUT8   new String b1   UTF-8    System out println szUT8     When trying to interpret the string as US-ASCII  the byte info wasn t correctly interpreted   b1   szP1 getBytes  US-ASCII    System out println b1 toString

User · Answer

terribly late but i just encountered this issue and this is my fix   private static String removeNonUtf8CompliantCharacters  final String inString         if  null    inString   return null      byte   byteArr   inString getBytes        for   int i 0  i  lt  byteArr length  i               byte ch  byteArr i               remove any characters outside the valid UTF-8 range as well as all control characters            except tabs and new lines         if       ch  gt  31  amp  amp  ch  lt  253      ch      t     ch      n     ch      r                   byteArr i                           return new String  byteArr

User · Answer

String original    hello world   byte   utf8Bytes   original getBytes  UTF-8

User · Answer

If you are using 7-bit ASCII or ISO-8859-1 (an amazingly common format) then you don't have to create a new java.lang.String at all. It's much much more performant to simply cast the byte into char:

Full working example:

for (byte b : new byte[] { 43, 45, (byte) 215, (byte) 247 }) {
    char c = (char) b;
    System.out.print(c);
}

If you are not using extended-characters like Ä, Æ, Å, Ç, Ï, Ê and can be sure that the only transmitted values are of the first 128 Unicode characters, then this code will also work for UTF-8 and extended ASCII (like cp-1252).

User · Answer

String original    hello world   byte   utf8Bytes   original getBytes  UTF-8

User · Answer

For decoding a series of bytes to a normal string message I finally got it working with UTF-8 encoding with this code      Convert a list of UTF-8 numbers to a normal String    Usefull for decoding a jms message that is delivered as a sequence of bytes instead of plain text     public String convertUtf8NumbersToString String   numbers       int length   numbers length      byte   data   new byte length        for int i   0  i lt  length  i             data i    Byte parseByte numbers i              return new String data  Charset forName  UTF-8

User · Answer

String original    hello world   byte   utf8Bytes   original getBytes  UTF-8

User · Answer

You can convert directly via the String byte    String  constructor and getBytes String  method  Java exposes available character sets via the Charset class  The JDK documentation lists supported encodings   90  of the time  such conversions are performed on streams  so you d use the Reader Writer classes  You would not incrementally decode using the String methods on arbitrary byte streams - you would leave yourself open to bugs involving multibyte characters

User · Answer

Here s a solution that avoids performing the Charset lookup for every conversion   import java nio charset Charset   private final Charset UTF8 CHARSET   Charset forName  UTF-8     String decodeUTF8 byte   bytes        return new String bytes  UTF8 CHARSET      byte   encodeUTF8 String string        return string getBytes UTF8 CHARSET

User · Answer

terribly late but i just encountered this issue and this is my fix   private static String removeNonUtf8CompliantCharacters  final String inString         if  null    inString   return null      byte   byteArr   inString getBytes        for   int i 0  i  lt  byteArr length  i               byte ch  byteArr i               remove any characters outside the valid UTF-8 range as well as all control characters            except tabs and new lines         if       ch  gt  31  amp  amp  ch  lt  253      ch      t     ch      n     ch      r                   byteArr i                           return new String  byteArr

User · Answer

Charset UTF8 CHARSET   Charset forName  UTF-8    String strISO       name            System out println strISO   byte   b   strISO getBytes    for  byte c  b        System out print       c           String str   new String b  UTF8 CHARSET   System out println str

User · Answer

You can convert directly via the String byte    String  constructor and getBytes String  method  Java exposes available character sets via the Charset class  The JDK documentation lists supported encodings   90  of the time  such conversions are performed on streams  so you d use the Reader Writer classes  You would not incrementally decode using the String methods on arbitrary byte streams - you would leave yourself open to bugs involving multibyte characters

User · Answer

Convert from String to byte     String s    some text here   byte   b   s getBytes StandardCharsets UTF 8     Convert from byte   to String   byte   b     byte  99   byte 97   byte 116   String s   new String b  StandardCharsets US ASCII     You should  of course  use the correct encoding name  My examples used US-ASCII and UTF-8  the two most common encodings

User · Answer

String original    hello world   byte   utf8Bytes   original getBytes  UTF-8

User · Answer

I can t comment but don t want to start a new thread  But this isn t working  A simple round trip   byte   b   new byte    0  0  0  -127        0x00000081 String s   new String b StandardCharsets UTF 8      UTF8   0x0000  0x0000   0x0000  0xfffd b   s getBytes StandardCharsets UTF 8       0  0  0  -17  -65  -67  0x000000efbfbd    0x00000081   I d need b   the same array before and after encoding which it isn t  this referrers to the first answer

User · Answer

You can convert directly via the String byte    String  constructor and getBytes String  method  Java exposes available character sets via the Charset class  The JDK documentation lists supported encodings   90  of the time  such conversions are performed on streams  so you d use the Reader Writer classes  You would not incrementally decode using the String methods on arbitrary byte streams - you would leave yourself open to bugs involving multibyte characters

User · Answer

Convert from String to byte     String s    some text here   byte   b   s getBytes StandardCharsets UTF 8     Convert from byte   to String   byte   b     byte  99   byte 97   byte 116   String s   new String b  StandardCharsets US ASCII     You should  of course  use the correct encoding name  My examples used US-ASCII and UTF-8  the two most common encodings

User · Answer

You can convert directly via the String byte    String  constructor and getBytes String  method  Java exposes available character sets via the Charset class  The JDK documentation lists supported encodings   90  of the time  such conversions are performed on streams  so you d use the Reader Writer classes  You would not incrementally decode using the String methods on arbitrary byte streams - you would leave yourself open to bugs involving multibyte characters

User · Answer

For decoding a series of bytes to a normal string message I finally got it working with UTF-8 encoding with this code      Convert a list of UTF-8 numbers to a normal String    Usefull for decoding a jms message that is delivered as a sequence of bytes instead of plain text     public String convertUtf8NumbersToString String   numbers       int length   numbers length      byte   data   new byte length        for int i   0  i lt  length  i             data i    Byte parseByte numbers i              return new String data  Charset forName  UTF-8

User · Answer

Reader reader   new BufferedReader      new InputStreamReader          new ByteArrayInputStream              string getBytes StandardCharsets UTF 8    StandardCharsets UTF 8

User · Answer

My tomcat7 implementation is accepting strings as ISO-8859-1  despite the content-type of the HTTP request  The following solution worked for me when trying to correctly interpret characters like         byte   b1   szP1 getBytes  ISO-8859-1    System out println b1 toString      String szUT8   new String b1   UTF-8    System out println szUT8     When trying to interpret the string as US-ASCII  the byte info wasn t correctly interpreted   b1   szP1 getBytes  US-ASCII    System out println b1 toString

User · Answer

Reader reader   new BufferedReader      new InputStreamReader          new ByteArrayInputStream              string getBytes StandardCharsets UTF 8    StandardCharsets UTF 8

User · Answer

Convert from String to byte     String s    some text here   byte   b   s getBytes StandardCharsets UTF 8     Convert from byte   to String   byte   b     byte  99   byte 97   byte 116   String s   new String b  StandardCharsets US ASCII     You should  of course  use the correct encoding name  My examples used US-ASCII and UTF-8  the two most common encodings

User · Answer

query is your json      DefaultHttpClient httpClient   new DefaultHttpClient     HttpPost postRequest   new HttpPost  http   my site test v1 product search qy       StringEntity input   new StringEntity query   UTF-8     input setContentType  application json     postRequest setEntity input       HttpResponse response response   httpClient execute postRequest

User · Answer

Here s a solution that avoids performing the Charset lookup for every conversion   import java nio charset Charset   private final Charset UTF8 CHARSET   Charset forName  UTF-8     String decodeUTF8 byte   bytes        return new String bytes  UTF8 CHARSET      byte   encodeUTF8 String string        return string getBytes UTF8 CHARSET

[java] How to convert Strings to and from UTF8 byte arrays in Java

Examples related to java

Examples related to string

Examples related to encoding

Examples related to character-encoding