[java] Java String encoding (UTF-8)

I have come across this line of legacy code, which I am trying to figure out:

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

As far as I can understand, it is encoding & decoding using the same charSet.

How is this different from the following?

String newString = oldString;

Is there any scenario in which the two lines will have different outputs?

p.s.: Just to clarify, yes I am aware of the excellent article on encoding by Joel Spolsky !

This question is related to java string encoding

The answer is


This could be complicated way of doing

String newString = new String(oldString);

This shortens the String is the underlying char[] used is much longer.

However more specifically it will be checking that every character can be UTF-8 encoded.

There are some "characters" you can have in a String which cannot be encoded and these would be turned into ?

Any character between \uD800 and \uDFFF cannot be encoded and will be turned into '?'

String oldString = "\uD800";
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8");
System.out.println(newString.equals(oldString));

prints

false

How is this different from the following?

This line of code here:

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

constructs a new String object (i.e. a copy of oldString), while this line of code:

String newString = oldString;

declares a new variable of type java.lang.String and initializes it to refer to the same String object as the variable oldString.

Is there any scenario in which the two lines will have different outputs?

Absolutely:

String newString = oldString;
boolean isSameInstance = newString == oldString; // isSameInstance == true

vs.

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));
 // isSameInstance == false (in most cases)    
boolean isSameInstance = newString == oldString;

a_horse_with_no_name (see comment) is right of course. The equivalent of

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

is

String newString = new String(oldString);

minus the subtle difference wrt the encoding that Peter Lawrey explains in his answer.


Examples related to java

Under what circumstances can I call findViewById with an Options Menu / Action Bar item? How much should a function trust another function How to implement a simple scenario the OO way Two constructors How do I get some variable from another class in Java? this in equals method How to split a string in two and store it in a field How to do perspective fixing? String index out of range: 4 My eclipse won't open, i download the bundle pack it keeps saying error log

Examples related to string

How to split a string in two and store it in a field String method cannot be found in a main class method Kotlin - How to correctly concatenate a String Replacing a character from a certain index Remove quotes from String in Python Detect whether a Python string is a number or a letter How does String substring work in Swift How does String.Index work in Swift swift 3.0 Data to String? How to parse JSON string in Typescript

Examples related to encoding

How to check encoding of a CSV file UnicodeEncodeError: 'ascii' codec can't encode character at special name Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? The character encoding of the plain text document was not declared - mootool script UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) How to encode text to base64 in python UTF-8 output from PowerShell Set Encoding of File to UTF8 With BOM in Sublime Text 3 Replace non-ASCII characters with a single space