[unicode] What's the difference between Unicode and UTF-8?

Consider:

Alt text

Is it true that unicode=utf16?

Many are saying Unicode is a standard, not an encoding, but most editors support save as Unicode encoding actually.

This question is related to unicode utf-8

The answer is


There's a lot of misunderstanding being displayed here. Unicode isn't an encoding, but the Unicode standard is devoted primarily to encoding anyway.

ISO 10646 is the international character set you (probably) care about. It defines a mapping between a set of named characters (e.g., "Latin Capital Letter A" or "Greek small letter alpha") and a set of code points (a number assigned to each -- for example, 61 hexadecimal and 3B1 hexadecimal for those two respectively; for Unicode code points, the standard notation would be U+0061 and U+03B1).

At one time, Unicode defined its own character set, more or less as a competitor to ISO 10646. That was a 16-bit character set, but it was not UTF-16; it was known as UCS-2. It included a rather controversial technique to try to keep the number of necessary characters to a minimum (Han Unification -- basically treating Chinese, Japanese and Korean characters that were quite a bit alike as being the same character).

Since then, the Unicode consortium has tacitly admitted that that wasn't going to work, and now concentrate primarily on ways to encode the ISO 10646 character set. The primary methods are UTF-8, UTF-16 and UCS-4 (aka UTF-32). Those (except for UTF-8) also have LE (little endian) and BE (big-endian) variants.

By itself, "Unicode" could refer to almost any of the above (though we can probably eliminate the others that it shows explicitly, such as UTF-8). Unqualified use of "Unicode" probably happens the most often on Windows, where it will almost certainly refer to UTF-16. Early versions of Windows NT adopted Unicode when UCS-2 was current. After UCS-2 was declared obsolete (around Win2k, if memory serves), they switched to UTF-16, which is the most similar to UCS-2 (in fact, it's identical for characters in the "basic multilingual plane", which covers a lot, including all the characters for most Western European languages).


The development of Unicode was aimed at creating a new standard for mapping the characters in a great majority of languages that are being used today, along with other characters that are not that essential but might be necessary for creating the text. UTF-8 is only one of the many ways that you can encode the files because there are many ways you can encode the characters inside a file into Unicode.

Source:

http://www.differencebetween.net/technology/difference-between-unicode-and-utf-8/


It's not that simple.

UTF-16 is a 16-bit, variable-width encoding. Simply calling something "Unicode" is ambiguous, since "Unicode" refers to an entire set of standards for character encoding. Unicode is not an encoding!

http://en.wikipedia.org/wiki/Unicode#Unicode_Transformation_Format_and_Universal_Character_Set

and of course, the obligatory Joel On Software - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) link.


It's weird. Unicode is a standard, not an encoding. As it is possible to specify the endianness I guess it's effectively UTF-16 or maybe 32.

Where does this menu provide from?


As Rasmus states in his article "The difference between UTF-8 and Unicode?":

If asked the question, "What is the difference between UTF-8 and Unicode?", would you confidently reply with a short and precise answer? In these days of internationalization all developers should be able to do that. I suspect many of us do not understand these concepts as well as we should. If you feel you belong to this group, you should read this ultra short introduction to character sets and encodings.

Actually, comparing UTF-8 and Unicode is like comparing apples and oranges:

UTF-8 is an encoding - Unicode is a character set

A character set is a list of characters with unique numbers (these numbers are sometimes referred to as "code points"). For example, in the Unicode character set, the number for A is 41.

An encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this:

00000001 00000010 00000011 00000100 

Our data is now translated into binary and can now be saved to disk.

All together now

Say an application reads the following from the disk:

1101000 1100101 1101100 1101100 1101111 

The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user. First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this:

104 101 108 108 111 

Since the app knows this is a Unicode string, it can assume each number represents a character. We use the Unicode character set to translate each number to a corresponding character. The resulting string is "hello".

Conclusion

So when somebody asks you "What is the difference between UTF-8 and Unicode?", you can now confidently answer short and precise:

UTF-8 (Unicode Transformation Format) and Unicode cannot be compared. UTF-8 is an encoding used to translate numbers into binary data. Unicode is a character set used to translate characters into numbers.


Let's start from keeping in mind that data is stored as bytes; Unicode is a character set where characters are mapped to code points (unique integers), and we need something to translate these code points data into bytes. That's where UTF-8 comes in so called encoding – simple!


UTF-16 and UTF-8 are both encodings of Unicode. They are both Unicode; one is not more Unicode than the other.

Don't let an unfortunate historical artifact from Microsoft confuse you.


In addition to Trufa's comment, Unicode explicitly isn't UTF-16. When they were first looking into Unicode, it was speculated that a 16-bit integer might be enough to store any code, but in practice that turned out not to be the case. However, UTF-16 is another valid encoding of Unicode - alongside the 8-bit and 32-bit variants - and I believe is the encoding that Microsoft use in memory at runtime on the NT-derived operating systems.