What is the difference between UTF-8 and Unicode

Question

I have heard conflicting opinions from people - according to the Wikipedia UTF-8 page   They are the same thing  aren t they  Can someone clarify

User · Answer

This article explains all the details http://kunststube.net/encoding/

WRITING TO BUFFER

if you write to a 4 byte buffer, symbol ? with UTF8 encoding, your binary will look like this:

00000000 11100011 10000001 10000010

if you write to a 4 byte buffer, symbol ? with UTF16 encoding, your binary will look like this:

00000000 00000000 00110000 01000010

As you can see, depending on what language you would use in your content this will effect your memory accordingly.

e.g. For this particular symbol: ? UTF16 encoding is more efficient since we have 2 spare bytes to use for the next symbol. But it doesn't mean that you must use UTF16 for Japan alphabet.

READING FROM BUFFER

Now if you want to read the above bytes, you have to know in what encoding it was written to and decode it back correctly.

e.g. If you decode this : 00000000 11100011 10000001 10000010 into UTF16 encoding, you will end up with ? not ?

Note: Encoding and Unicode are two different things. Unicode is the big (table) with each symbol mapped to a unique code point. e.g. ? symbol (letter) has a (code point): 30 42 (hex). Encoding on the other hand, is an algorithm that converts symbols to more appropriate way, when storing to hardware.

30 42 (hex) - > UTF8 encoding - > E3 81 82 (hex), which is above result in binary.

30 42 (hex) - > UTF16 encoding - > 30 42 (hex), which is above result in binary.

User · Answer

The existing answers already explain a lot of details  but here s a very short answer with the most direct explanation and example   Unicode is the standard that maps characters to codepoints  Each character has a unique codepoint  identification number   which is a number like 9731   UTF-8 is an the encoding of the codepoints  In order to store all characters on disk  in a file   UTF-8 splits characters into up to 4 octets  8-bit sequences  - bytes  UTF-8 is one of several encodings  methods of representing data   For example  in Unicode  the  decimal  codepoint 9731 represents a snowman      which consists of 3 bytes in UTF-8  E2 98 83  Here s a sorted list with some random examples

User · Answer

Unicode  is unfortunately used in various different ways  depending on the context  Its most correct use  IMO  is as a coded character set - i e  a set of characters and a mapping between the characters and integer code points representing them   UTF-8 is a character encoding - a way of converting from sequences of bytes to sequences of characters and vice versa  It covers the whole of the Unicode character set  ASCII is encoded as a single byte per character  and other characters take more bytes depending on their exact code point  up to 4 bytes for all currently defined code points  i e  up to U-0010FFFF  and indeed 4 bytes could cope with up to U-001FFFFF    When  Unicode  is used as the name of a character encoding  e g  as the  NET Encoding Unicode property  it usually means UTF-16  which encodes most common characters as two bytes  Some platforms  notably  NET and Java  use UTF-16 as their  native  character encoding  This leads to hairy problems if you need to worry about characters which can t be encoded in a single UTF-16 value  they re encoded as  surrogate pairs   - but most developers never worry about this  IME   Some references on Unicode    The Unicode consortium web site and in particular the tutorials section Joel s article My own article    NET-oriented

User · Answer

UTF-8 is one possible encoding scheme for Unicode text  Unicode is a broad-scoped standard which defines over 140 000 characters and allocates each a numerical code  a code point    It also defines rules for how to sort this text  normalise it  change its case  and more   A character in Unicode is represented by a code point from zero up to 0x10FFFF inclusive  though some code points are reserved and cannot be used for characters  There is more than one way that a string of Unicode code points can be encoded into a binary stream   These are called  quot encodings quot    The most straightforward encoding is UTF-32  which simply stores each code point as a 32-bit integer  with each being 4 bytes wide  UTF-8 is another encoding  and is becoming the de-facto standard  due to a number of advantages over UTF-32 and others   UTF-8 encodes each code point as a sequence of either 1  2  3 or 4 byte values   Code points in the ASCII range are encoded as a single byte value  to be compatible with ASCII   Code points outside this range use either 2  3  or 4 bytes each  depending on what range they are in  UTF-8 has been designed with these properties in mind   ASCII characters are encoded exactly as they are in ASCII  such that an ASCII string is also a valid UTF-8 string representing the same characters   Binary sorting  Sorting UTF-8 strings using a binary sort will still result in all code points being sorted in numerical order   When a code point uses multiple bytes  none of those bytes contain values in the ASCII range  ensuring that no part of them could be mistaken for an ASCII character   This is also a security feature   UTF-8 can be easily validated  and distinguished from other character encodings by a validator   Text in other 8-bit or multi-byte encodings will very rarely also validate as UTF-8 due to the very specific structure of UTF-8   Random access  At any point in a UTF-8 string it is possible to tell if the byte at that position is the first byte of a character or not  and to find the start of the next or current character  without needing to scan forwards or backwards more than 3 bytes or to know how far into the string we started reading from

User · Answer

I have checked the links in Gumbo s answer  and I wanted to paste some part of those things here to exist on Stack nbsp Overflow as well       Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65 536 possible characters  This is not  actually  correct  It is the single most common myth about Unicode  so if you thought that  don t feel bad   In fact  Unicode has a different way of thinking about characters  and you have to understand the Unicode way of thinking of things or nothing will make sense   Until now  we ve assumed that a letter maps to some bits which you can store on disk or in memory   A -  0100 0001  In Unicode  a letter maps to something called a code point which is still just a theoretical concept  How that code point is represented in memory or on disk is a whole other story          Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this  U 0639  This magic number is called a code point  The U  means  Unicode  and the numbers are hexadecimal  U 0639 is the Arabic letter Ain  The English letter A would be U 0041           OK  so say we have a string   Hello  which  in Unicode  corresponds to these five code points   U 0048 U 0065 U 006C U 006C U 006F   Just a bunch of code points  Numbers  really  We haven t yet said anything about how to store this in memory or represent it in an email message          That s where encodings come in   The earliest idea for Unicode encoding  which led to the myth about the two bytes  was  hey  let s just store those numbers in two bytes each  So Hello becomes  00 48 00 65 00 6C 00 6C 00 6F  Right  Not so fast  Couldn t it also be   48 00 65 00 6C 00 6C 00 6F 00

User · Answer

1  Unicode  There re lots of characters around the world like     amp  h a t     1           Then there comes an organization who s dedicated to these characters   They made a standard called  Unicode    The standard is like follows    create a form in which each position is called  code point  or code position   The whole positions are from U 0000 to U 10FFFF  Up until now some positions are filled with characters and other positions are saved or empty   For example the position  U 0024  is filled with the character        PS Of course there s another organization called ISO maintaining another standard -- ISO 10646  nearly the same   2  UTF-8  As above U 0024 is just a position so we can t save  U 0024  in computer for the character       There must be an encoding method   Then there come encoding methods such as UTF-8 UTF-16 UTF-32 UCS-2      Under UTF-8 the code point  U 0024  is encoded into 00100100   00100100 is the value we save in computer for

User · Answer

Let me use an example to illustrate this topic   A chinese character         it s unicode value        U 6C49 convert 6C49 to binary    01101100 01001001   Nothing magical so far  it s very simple  Now  let s say we decide to store this character on our hard drive  To do that  we need to store the character in binary format  We can simply store it as is  01101100 01001001   Done   But wait a minute  is  01101100 01001001  one character or two characters  You knew this is one character because I told you  but when a computer reads it  it has no idea  So we need some sort of  encoding  to tell the computer to treat it as one   This is where the rules of  UTF-8  comes in  http   www fileformat info info unicode utf8 htm  Binary format of bytes in sequence  1st Byte    2nd Byte    3rd Byte    4th Byte    Number of Free Bits   Maximum Expressible Unicode Value 0xxxxxxx                                                7             007F hex  127  110xxxxx    10xxxxxx                                 5 6  11          07FF hex  2047  1110xxxx    10xxxxxx    10xxxxxx                   4 6 6  16          FFFF hex  65535  11110xxx    10xxxxxx    10xxxxxx    10xxxxxx     3 6 6 6  21          10FFFF hex  1 114 111    According to the table above  if we want to store this character using the  UTF-8  format  we need to prefix our character with some  headers    Our chinese character is 16 bits long  count the binary value yourself   so we will use the format on row 3 as it provides enough space   Header  Place holder    Fill in our Binary   Result          1110    xxxx            0110                 11100110 10      xxxxxx          110001               10110001 10      xxxxxx          001001               10001001   Writing out the result in one line   11100110 10110001 10001001   This is the UTF-8  binary  value of the chinese character   confirm it yourself  http   www fileformat info info unicode char 6c49 index htm   Summary  A chinese character         it s unicode value        U 6C49 convert 6C49 to binary    01101100 01001001 embed 6C49 as UTF-8       11100110 10110001 10001001   P S  If you want to learn this topic in python  click here

User · Answer

To expand on the answers others have given   We ve got lots of languages with lots of characters that computers should ideally display  Unicode assigns each character a unique number  or code point   Computers deal with such numbers as bytes    skipping a bit of history here and ignoring memory addressing issues  8-bit computers would treat an 8-bit byte as the largest numerical unit easily represented on the hardware  16-bit computers would expand that to two bytes  and so forth   Old character encodings such as ASCII are from the  pre-  8-bit era  and try to cram the dominant language in computing at the time  i e  English  into numbers ranging from 0 to 127  7 bits   With 26 letters in the alphabet  both in capital and non-capital form  numbers and punctuation signs  that worked pretty well  ASCII got extended by an 8th bit for other  non-English languages  but the additional 128 numbers code points made available by this expansion would be mapped to different characters depending on the language being displayed  The ISO-8859 standards are the most common forms of this mapping  ISO-8859-1 and ISO-8859-15  also known as ISO-Latin-1  latin1  and yes there are two different versions of the 8859 ISO standard as well    But that s not enough when you want to represent characters from more than one language  so cramming all available characters into a single byte just won t work   There are essentially two different types of encodings  one expands the value range by adding more bits  Examples of these encodings would be UCS2  2 bytes   16 bits  and UCS4  4 bytes   32 bits   They suffer from inherently the same problem as the ASCII and ISO-8859 standards  as their value range is still limited  even if the limit is vastly higher   The other type of encoding uses a variable number of bytes per character  and the most commonly known encodings for this are the UTF encodings  All UTF encodings work in roughly the same manner  you choose a unit size  which for UTF-8 is 8 bits  for UTF-16 is 16 bits  and for UTF-32 is 32 bits  The standard then defines a few of these bits as flags  if they re set  then the next unit in a sequence of units is to be considered part of the same character  If they re not set  this unit represents one character fully  Thus the most common  English  characters only occupy one byte in UTF-8  two in UTF-16  4 in UTF-32   but other language characters can occupy six bytes or more   Multi-byte encodings  I should say multi-unit after the above explanation  have the advantage that they are relatively space-efficient  but the downside that operations such as finding substrings  comparisons  etc  all have to decode the characters to unicode code points before such operations can be performed  there are some shortcuts  though    Both the UCS standards and the UTF standards encode the code points as defined in Unicode  In theory  those encodings could be used to encode any number  within the range the encoding supports  - but of course these encodings were made to encode Unicode code points  And that s your relationship between them   Windows handles so-called  Unicode  strings as UTF-16 strings  while most UNIXes default to UTF-8 these days  Communications protocols such as HTTP tend to work best with UTF-8  as the unit size in UTF-8 is the same as in ASCII  and most such protocols were designed in the ASCII era  On the other hand  UTF-16 gives the best average space processing performance when representing all living languages   The Unicode standard defines fewer code points than can be represented in 32 bits  Thus for all practical purposes  UTF-32 and UCS4 became the same encoding  as you re unlikely to have to deal with multi-unit characters in UTF-32   Hope that fills in some details

User · Answer

Unicode only define code points  that is  a number which represents a character   How you store these code points in memory depends of the encoding that you are using   UTF-8 is one way of encoding Unicode characters  among many others

User · Answer

They are the same thing  aren t they    No  they aren t     I think the first sentence of the Wikipedia page you referenced gives a nice  brief summary      UTF-8 is a variable width character encoding capable of encoding all 1 112 064 valid code points in Unicode using one to four 8-bit bytes    To elaborate    Unicode is a standard  which defines a map from characters to numbers  the so-called code points   like in the example below   For the full mapping  you can have a look here     - gt  U 0021  21       - gt  U 0022  22        - gt  U 0023  23   UTF-8 is one of the ways to encode these code points in a form a computer can understand  aka bits  In other words  it s a way algorithm to convert each of those code points to a sequence of bits or convert a sequence of bits to the equivalent code points  Note that there are a lot of alternative encodings for Unicode      Joel gives a really nice explanation and an overview of the history here

User · Answer

They re not the same thing - UTF-8 is a particular way of encoding Unicode   There are lots of different encodings you can choose from depending on your application and the data you intend to use   The most common are UTF-8  UTF-16 and UTF-32 s far as I know

User · Answer

Unicode is a standard that defines  along with ISO IEC 10646  Universal Character Set  UCS  which is a superset of all existing characters required to represent practically all known languages   Unicode assigns a Name and a Number  Character Code  or Code-Point  to each character in its repertoire   UTF-8 encoding  is a way to represent these characters digitally in computer memory  UTF-8 maps each code-point into a sequence of octets  8-bit bytes   For e g    UCS Character   Unicode Han Character  UCS code-point   U 24B62  UTF-8 encoding   F0 A4 AD A2  hex    11110000 10100100 10101101 10100010  bin

User · Answer

If I may summarise what I gathered from this thread  Unicode  translates  characters to ordinal numbers  in decimal form      - gt  224  UTF-8 is an encoding that  translates  these ordinal numbers  in decimal form  to binary representations  224 - gt  11000011 10100000  Note that we re talking about the binary representation of 224  not its binary form  which is 0b11100000

User · Answer

Unicode is just a standard that defines a character set  UCS  and encodings  UTF  to encode this character set  But in general  Unicode is refered to the character set and not the standard   Read The Absolute Minimum Every Software Developer Absolutely  Positively Must Know About Unicode and Character Sets  No Excuses   and Unicode In 5 Minutes

User · Answer

UTF-8 is a method for encoding Unicode characters using 8-bit sequences   Unicode is a standard for representing a great variety of characters from many languages

[unicode] What is the difference between UTF-8 and Unicode?

Examples related to unicode

Examples related to encoding

Examples related to utf-8

Examples related to character-encoding

Examples related to terminology