[unicode] What is the difference between UTF-8 and Unicode?

I have heard conflicting opinions from people - according to the Wikipedia UTF-8 page.

They are the same thing, aren't they? Can someone clarify?

This question is related to unicode encoding utf-8 character-encoding terminology

The answer is


Unicode is just a standard that defines a character set (UCS) and encodings (UTF) to encode this character set. But in general, Unicode is refered to the character set and not the standard.

Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and Unicode In 5 Minutes.


I have checked the links in Gumbo's answer, and I wanted to paste some part of those things here to exist on Stack Overflow as well.

"...Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad.

In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.

Until now, we've assumed that a letter maps to some bits which you can store on disk or in memory:

A -> 0100 0001

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole other story..."

"...Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041...."

"...OK, so say we have a string:

Hello

which, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F.

Just a bunch of code points. Numbers, really. We haven't yet said anything about how to store this in memory or represent it in an email message..."

"...That's where encodings come in.

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let's just store those numbers in two bytes each. So Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Right? Not so fast! Couldn't it also be:

48 00 65 00 6C 00 6C 00 6F 00 ? ..."


UTF-8 is a method for encoding Unicode characters using 8-bit sequences.

Unicode is a standard for representing a great variety of characters from many languages.


UTF-8 is one possible encoding scheme for Unicode text.

Unicode is a broad-scoped standard which defines over 140,000 characters and allocates each a numerical code (a code point). It also defines rules for how to sort this text, normalise it, change its case, and more. A character in Unicode is represented by a code point from zero up to 0x10FFFF inclusive, though some code points are reserved and cannot be used for characters.

There is more than one way that a string of Unicode code points can be encoded into a binary stream. These are called "encodings". The most straightforward encoding is UTF-32, which simply stores each code point as a 32-bit integer, with each being 4 bytes wide.

UTF-8 is another encoding, and is becoming the de-facto standard, due to a number of advantages over UTF-32 and others. UTF-8 encodes each code point as a sequence of either 1, 2, 3 or 4 byte values. Code points in the ASCII range are encoded as a single byte value, to be compatible with ASCII. Code points outside this range use either 2, 3, or 4 bytes each, depending on what range they are in.

UTF-8 has been designed with these properties in mind:

  • ASCII characters are encoded exactly as they are in ASCII, such that an ASCII string is also a valid UTF-8 string representing the same characters.

  • Binary sorting: Sorting UTF-8 strings using a binary sort will still result in all code points being sorted in numerical order.

  • When a code point uses multiple bytes, none of those bytes contain values in the ASCII range, ensuring that no part of them could be mistaken for an ASCII character. This is also a security feature.

  • UTF-8 can be easily validated, and distinguished from other character encodings by a validator. Text in other 8-bit or multi-byte encodings will very rarely also validate as UTF-8 due to the very specific structure of UTF-8.

  • Random access: At any point in a UTF-8 string it is possible to tell if the byte at that position is the first byte of a character or not, and to find the start of the next or current character, without needing to scan forwards or backwards more than 3 bytes or to know how far into the string we started reading from.


The existing answers already explain a lot of details, but here's a very short answer with the most direct explanation and example.

Unicode is the standard that maps characters to codepoints.
Each character has a unique codepoint (identification number), which is a number like 9731.

UTF-8 is an the encoding of the codepoints.
In order to store all characters on disk (in a file), UTF-8 splits characters into up to 4 octets (8-bit sequences) - bytes. UTF-8 is one of several encodings (methods of representing data). For example, in Unicode, the (decimal) codepoint 9731 represents a snowman (?), which consists of 3 bytes in UTF-8: E2 98 83

Here's a sorted list with some random examples.


"Unicode" is unfortunately used in various different ways, depending on the context. Its most correct use (IMO) is as a coded character set - i.e. a set of characters and a mapping between the characters and integer code points representing them.

UTF-8 is a character encoding - a way of converting from sequences of bytes to sequences of characters and vice versa. It covers the whole of the Unicode character set. ASCII is encoded as a single byte per character, and other characters take more bytes depending on their exact code point (up to 4 bytes for all currently defined code points, i.e. up to U-0010FFFF, and indeed 4 bytes could cope with up to U-001FFFFF).

When "Unicode" is used as the name of a character encoding (e.g. as the .NET Encoding.Unicode property) it usually means UTF-16, which encodes most common characters as two bytes. Some platforms (notably .NET and Java) use UTF-16 as their "native" character encoding. This leads to hairy problems if you need to worry about characters which can't be encoded in a single UTF-16 value (they're encoded as "surrogate pairs") - but most developers never worry about this, IME.

Some references on Unicode:


Let me use an example to illustrate this topic:

A chinese character:      ?
it's unicode value:       U+6C49
convert 6C49 to binary:   01101100 01001001

Nothing magical so far, it's very simple. Now, let's say we decide to store this character on our hard drive. To do that, we need to store the character in binary format. We can simply store it as is '01101100 01001001'. Done!

But wait a minute, is '01101100 01001001' one character or two characters? You knew this is one character because I told you, but when a computer reads it, it has no idea. So we need some sort of "encoding" to tell the computer to treat it as one.

This is where the rules of 'UTF-8' comes in: http://www.fileformat.info/info/unicode/utf8.htm

Binary format of bytes in sequence

1st Byte    2nd Byte    3rd Byte    4th Byte    Number of Free Bits   Maximum Expressible Unicode Value
0xxxxxxx                                                7             007F hex (127)
110xxxxx    10xxxxxx                                (5+6)=11          07FF hex (2047)
1110xxxx    10xxxxxx    10xxxxxx                  (4+6+6)=16          FFFF hex (65535)
11110xxx    10xxxxxx    10xxxxxx    10xxxxxx    (3+6+6+6)=21          10FFFF hex (1,114,111)

According to the table above, if we want to store this character using the 'UTF-8' format, we need to prefix our character with some 'headers'. Our chinese character is 16 bits long (count the binary value yourself), so we will use the format on row 3 as it provides enough space:

Header  Place holder    Fill in our Binary   Result         
1110    xxxx            0110                 11100110
10      xxxxxx          110001               10110001
10      xxxxxx          001001               10001001

Writing out the result in one line:

11100110 10110001 10001001

This is the UTF-8 (binary) value of the chinese character! (confirm it yourself: http://www.fileformat.info/info/unicode/char/6c49/index.htm)

Summary

A chinese character:      ?
it's unicode value:       U+6C49
convert 6C49 to binary:   01101100 01001001
embed 6C49 as UTF-8:      11100110 10110001 10001001

P.S. If you want to learn this topic in python, click here


Unicode is a standard that defines, along with ISO/IEC 10646, Universal Character Set (UCS) which is a superset of all existing characters required to represent practically all known languages.

Unicode assigns a Name and a Number (Character Code, or Code-Point) to each character in its repertoire.

UTF-8 encoding, is a way to represent these characters digitally in computer memory. UTF-8 maps each code-point into a sequence of octets (8-bit bytes)

For e.g.,

UCS Character = Unicode Han Character

UCS code-point = U+24B62

UTF-8 encoding = F0 A4 AD A2 (hex) = 11110000 10100100 10101101 10100010 (bin)


If I may summarise what I gathered from this thread:

Unicode 'translates' characters to ordinal numbers (in decimal form).

à -> 224

UTF-8 is an encoding that 'translates' these ordinal numbers (in decimal form) to binary representations.

224 -> 11000011 10100000

Note that we're talking about the binary representation of 224, not its binary form, which is 0b11100000.


They're not the same thing - UTF-8 is a particular way of encoding Unicode.

There are lots of different encodings you can choose from depending on your application and the data you intend to use. The most common are UTF-8, UTF-16 and UTF-32 s far as I know.


1. Unicode

There're lots of characters around the world,like "$,&,h,a,t,?,?,1,=,+...".

Then there comes an organization who's dedicated to these characters,

They made a standard called "Unicode".

The standard is like follows:

  • create a form in which each position is called "code point",or"code position".
  • The whole positions are from U+0000 to U+10FFFF;
  • Up until now,some positions are filled with characters,and other positions are saved or empty.
  • For example,the position "U+0024" is filled with the character "$".

PS:Of course there's another organization called ISO maintaining another standard --"ISO 10646",nearly the same.

2. UTF-8

As above,U+0024 is just a position,so we can't save "U+0024" in computer for the character "$".

There must be an encoding method.

Then there come encoding methods,such as UTF-8,UTF-16,UTF-32,UCS-2....

Under UTF-8,the code point "U+0024" is encoded into 00100100.

00100100 is the value we save in computer for "$".


This article explains all the details http://kunststube.net/encoding/

WRITING TO BUFFER

if you write to a 4 byte buffer, symbol ? with UTF8 encoding, your binary will look like this:

00000000 11100011 10000001 10000010

if you write to a 4 byte buffer, symbol ? with UTF16 encoding, your binary will look like this:

00000000 00000000 00110000 01000010

As you can see, depending on what language you would use in your content this will effect your memory accordingly.

e.g. For this particular symbol: ? UTF16 encoding is more efficient since we have 2 spare bytes to use for the next symbol. But it doesn't mean that you must use UTF16 for Japan alphabet.

READING FROM BUFFER

Now if you want to read the above bytes, you have to know in what encoding it was written to and decode it back correctly.

e.g. If you decode this : 00000000 11100011 10000001 10000010 into UTF16 encoding, you will end up with ? not ?

Note: Encoding and Unicode are two different things. Unicode is the big (table) with each symbol mapped to a unique code point. e.g. ? symbol (letter) has a (code point): 30 42 (hex). Encoding on the other hand, is an algorithm that converts symbols to more appropriate way, when storing to hardware.

30 42 (hex) - > UTF8 encoding - > E3 81 82 (hex), which is above result in binary.

30 42 (hex) - > UTF16 encoding - > 30 42 (hex), which is above result in binary.

enter image description here


Unicode only define code points, that is, a number which represents a character. How you store these code points in memory depends of the encoding that you are using. UTF-8 is one way of encoding Unicode characters, among many others.


They are the same thing, aren't they?

No, they aren't.


I think the first sentence of the Wikipedia page you referenced gives a nice, brief summary:

UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes.

To elaborate:

  • Unicode is a standard, which defines a map from characters to numbers, the so-called code points, (like in the example below). For the full mapping, you can have a look here.

    ! -> U+0021 (21),  
    " -> U+0022 (22),  
    \# -> U+0023 (23)
    
  • UTF-8 is one of the ways to encode these code points in a form a computer can understand, aka bits. In other words, it's a way/algorithm to convert each of those code points to a sequence of bits or convert a sequence of bits to the equivalent code points. Note that there are a lot of alternative encodings for Unicode.


Joel gives a really nice explanation and an overview of the history here.


Examples related to unicode

How to resolve TypeError: can only concatenate str (not "int") to str (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape UnicodeEncodeError: 'ascii' codec can't encode character at special name Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) HTML for the Pause symbol in audio and video control Javascript: Unicode string to hex Concrete Javascript Regex for Accented Characters (Diacritics) Replace non-ASCII characters with a single space UTF-8 in Windows 7 CMD NameError: global name 'unicode' is not defined - in Python 3

Examples related to encoding

How to check encoding of a CSV file UnicodeEncodeError: 'ascii' codec can't encode character at special name Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? The character encoding of the plain text document was not declared - mootool script UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) How to encode text to base64 in python UTF-8 output from PowerShell Set Encoding of File to UTF8 With BOM in Sublime Text 3 Replace non-ASCII characters with a single space

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8

Examples related to character-encoding

Changing PowerShell's default output encoding to UTF-8 JsonParseException : Illegal unquoted character ((CTRL-CHAR, code 10) Change the encoding of a file in Visual Studio Code What is the difference between utf8mb4 and utf8 charsets in MySQL? How to open html file? All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"? UTF-8 output from PowerShell ERROR 1115 (42000): Unknown character set: 'utf8mb4' "for line in..." results in UnicodeDecodeError: 'utf-8' codec can't decode byte How to make php display \t \n as tab and new line instead of characters

Examples related to terminology

The differences between initialize, define, declare a variable What is the difference between a web API and a web service? What does "opt" mean (as in the "opt" directory)? Is it an abbreviation? What's the name for hyphen-separated case? What is Bit Masking? What is ADT? (Abstract Data Type) What exactly are iterator, iterable, and iteration? What is a web service endpoint? What is the difference between Cloud, Grid and Cluster? How to explain callbacks in plain english? How are they different from calling one function from another function?