What is Unicode UTF-8 UTF-16

Question

What s the basis for Unicode and why the need for UTF-8 or UTF-16  I have researched this on Google and searched here as well but it s not clear to me    In VSS when doing a file comparison  sometimes there is a message saying the two files have differing UTF s  Why would this be the case   Please explain in simple terms

User · Answer

Originally  Unicode was intended to have a fixed-width 16-bit encoding  UCS-2    Early adopters of Unicode  like Java and Windows NT  built their libraries around 16-bit strings   Later  the scope of Unicode was expanded to include historical characters  which would require more than the 65 536 code points a 16-bit encoding would support   To allow the additional characters to be represented on platforms that had used UCS-2  the UTF-16 encoding was introduced   It uses  surrogate pairs  to represent characters in the supplementary planes   Meanwhile  a lot of older software and network protocols were using 8-bit strings   UTF-8 was made so these systems could support Unicode without having to use wide characters   It s backwards-compatible with 7-bit ASCII

User · Answer

This article explains all the details  http   kunststube net encoding   WRITING TO BUFFER  if you write to a 4 byte buffer  symbol   with UTF8 encoding  your binary will look like this   00000000 11100011 10000001 10000010  if you write to a 4 byte buffer  symbol   with UTF16 encoding  your binary will look like this   00000000 00000000 00110000 01000010  As you can see  depending on what language you would use in your content this will effect your memory accordingly   e g   For this particular symbol    UTF16 encoding is more efficient since we have 2 spare bytes to use for the next symbol  But it doesn t mean that you must use UTF16 for Japan alphabet   READING FROM BUFFER  Now if you want to read the above bytes  you have to know in what encoding it was written to and decode it back correctly   e g   If you decode this    00000000 11100011 10000001 10000010  into UTF16 encoding  you will end up with   not    Note  Encoding and Unicode are two different things  Unicode is the big  table  with each symbol mapped to a unique code point  e g    symbol  letter  has a  code point   30 42  hex   Encoding on the other hand  is an algorithm that converts symbols to more appropriate way  when storing to hardware   30 42  hex  -  gt  UTF8 encoding -  gt  E3 81 82  hex   which is above result in binary   30 42  hex  -  gt  UTF16 encoding -  gt  30 42  hex   which is above result in binary

User · Answer

UTF stands for stands for Unicode Transformation Format Basically in today s world there are scripts written in hundreds of other languages  formats not covered by the basic ASCII used earlier  Hence  UTF came into existence   UTF-8 has character encoding capabilities and its code unit is 8 bits while that for UTF-16 it is 16 bits

User · Answer

Why do we need Unicode   In the  not too  early days  all that existed was ASCII  This was okay  as all that would ever be needed were a few control characters  punctuation  numbers and letters like the ones in this sentence  Unfortunately  today s strange world of global intercommunication and social media was not foreseen  and it is not too unusual to see English                         e         and           in the same document  I hope I didn t break any old browsers    But for argument s sake  lets say Joe Average is a software developer  He insists that he will only ever need English  and as such only wants to use ASCII  This might be fine for Joe the user  but this is not fine for Joe the software developer  Approximately half the world uses non-Latin characters and using ASCII is arguably inconsiderate to these people  and on top of that  he is closing off his software to a large and growing economy   Therefore  an encompassing character set including all languages is needed  Thus came Unicode  It assigns every character a unique number called a code point  One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1  and hence also ASCII  In addition  the vast majority of commonly used characters are representable by only two bytes  in a region called the Basic Multilingual Plane  BMP   Now a character encoding is needed to access this character set  and as the question asks  I will concentrate on UTF-8 and UTF-16   Memory considerations  So how many bytes give access to what characters in these encodings    UTF-8    1 byte  Standard ASCII 2 bytes  Arabic  Hebrew  most European scripts  most notably excluding Georgian  3 bytes  BMP 4 bytes  All Unicode characters  UTF-16    2 bytes  BMP 4 bytes  All Unicode characters    It s worth mentioning now that characters not in the BMP include ancient scripts  mathematical symbols  musical symbols  and rarer Chinese Japanese Korean  CJK  characters   If you ll be working mostly with ASCII characters  then UTF-8 is certainly more memory efficient  However  if you re working mostly with non-European scripts  using UTF-8 could be up to 1 5 times less memory efficient than UTF-16  When dealing with large amounts of text  such as large web-pages or lengthy word documents  this could impact performance   Encoding basics  Note  If you know how UTF-8 and UTF-16 are encoded  skip to the next section for practical applications    UTF-8  For the standard ASCII  0-127  characters  the UTF-8 codes are identical  This makes UTF-8 ideal if backwards compatibility is required with existing ASCII text  Other characters require anywhere from 2-4 bytes  This is done by reserving some bits in each of these bytes to indicate that it is part of a multi-byte character  In particular  the first bit of each byte is 1 to avoid clashing with the ASCII characters  UTF-16  For valid BMP characters  the UTF-16 representation is simply its code point  However  for non-BMP characters UTF-16 introduces surrogate pairs  In this case a combination of two two-byte portions map to a non-BMP character  These two-byte portions come from the BMP numeric range  but are guaranteed by the Unicode standard to be invalid as BMP characters  In addition  since UTF-16 has two bytes as its basic unit  it is affected by endianness  To compensate  a reserved byte order mark can be placed at the beginning of a data stream which indicates endianness  Thus  if you are reading UTF-16 input  and no endianness is specified  you must check for this    As can be seen  UTF-8 and UTF-16 are nowhere near compatible with each other  So if you re doing I O  make sure you know which encoding you are using  For further details on these encodings  please see the UTF FAQ   Practical programming considerations  Character and String data types  How are they encoded in the programming language  If they are raw bytes  the minute you try to output non-ASCII characters  you may run into a few problems  Also  even if the character type is based on a UTF  that doesn t mean the strings are proper UTF  They may allow byte sequences that are illegal  Generally  you ll have to use a library that supports UTF  such as ICU for C  C   and Java  In any case  if you want to input output something other than the default encoding  you will have to convert it first   Recommended default dominant encodings  When given a choice of which UTF to use  it is usually best to follow recommended standards for the environment you are working in  For example  UTF-8 is dominant on the web  and since HTML5  it has been the recommended encoding  Conversely  both  NET and Java environments are founded on a UTF-16 character type  Confusingly  and incorrectly   references are often made to the  Unicode encoding   which usually refers to the dominant UTF encoding in a given environment   Library support  The libraries you are using support some kind of encoding  Which one  Do they support the corner cases  Since necessity is the mother of invention  UTF-8 libraries will generally support 4-byte characters properly  since 1  2  and even 3 byte characters can occur frequently  However  not all purported UTF-16 libraries support surrogate pairs properly since they occur very rarely   Counting characters  There exist combining characters in Unicode  For example the code point U 006E  n   and U 0303  a combining tilde  forms n  x303   but the code point U 00F1 forms   xF1   They should look identical  but a simple counting algorithm will return 2 for the first example  1 for the latter  This isn t necessarily wrong  but may not be the desired outcome either   Comparing for equality    x41     x410   and   x391  look the same  but they re Latin  Cyrillic  and Greek respectively  You also have cases like   x43  and   x216D   one is a letter  the other a Roman numeral  In addition  we have the combining characters to consider as well  For more info see Duplicate characters in Unicode   Surrogate pairs  These come up often enough on SO  so I ll just provide some example links    Getting string length Removing surrogate pairs Palindrome checking   Others

User · Answer

Why unicode  Because ASCII has just 127 characters  Those from 128 to 255 differ in different countries  that s why there are codepages  So they said lets have up to 1114111 characters  So how do you store the highest codepoint  You ll need to store it using 21 bits  so you ll use a DWORD having 32 bits with 11 bits wasted  So if you use a DWORD to store a unicode character  it is the easiest way because the value in your DWORD matches exactly the codepoint  But DWORD arrays are of course larger than WORD arrays and of course even larger than BYTE arrays  That s why there is not only utf-32  but also utf-16  But utf-16 means a WORD stream  and a WORD has 16 bits so how can the highest codepoint 1114111 fit into a WORD  It cannot  So they put everyything higher than 65535 into a DWORD which they call a surrogate-pair  Such surrogate-pair are two WORDS and can get detected by looking at the first 6 bits  So what about utf-8  It is a byte array or byte stream  but how can the highest codepoint 1114111 fit into a byte  It cannot  Okay  so they put in also a DWORD right  Or possibly a WORD  right  Almost right  They invented utf-8 sequences which means that every codepoint higher than 127 must get encoded into a 2-byte  3-byte or 4-byte sequence  Wow  But how can we detect such sequences  Well  everything up to 127 is ASCII and is a single byte  What starts with 110 is a two-byte sequence  what starts with 1110 is a three-byte sequence and what starts with 11110 is a four-byte sequence  The remaining bits of these so called  startbytes  belong to the codepoint  Now depending on the sequence  following bytes must follow  A following byte starts with 10  the remaining bits are 6 bits of payload bits and belong to the codepoint  Concatenate the payload bits of the startbyte and the following byte s and you ll have the codepoint  That s all the magic of utf-8

User · Answer

Unicode is a standard which maps the characters in all languages to a particular numeric value called Code Points  The reason it does this is that it allows different encodings to be possible using the same set of code points   UTF-8 and UTF-16 are two such encodings  They take code points as input and encodes them using some well-defined formula to produce the encoded string   Choosing a particular encoding depends upon your requirements  Different encodings have different memory requirements and depending upon the characters that you will be dealing with  you should choose the encoding which uses the least sequences of bytes to encode those characters   For more in-depth details about Unicode  UTF-8 and UTF-16  you can check out this article   What every programmer should know about Unicode

User · Answer

Unicode is a fairly complex standard  Don   t be too afraid  but be   prepared for some work   2    Because a credible resource is always needed  but the official report is massive  I suggest reading the following    The Absolute Minimum Every Software Developer Absolutely  Positively Must Know About Unicode and Character Sets  No Excuses   An introduction by Joel Spolsky  Stack Exchange CEO  To the BMP and beyond  A tutorial by Eric Muller  Technical Director then  Vice President later  at The Unicode Consortium   first 20 slides and you are done    A brief explanation   Computers read bytes and people read characters  so we use encoding standards to map characters to bytes  ASCII was the first widely used standard  but covers only Latin  7 bits character can represent 128 different characters   Unicode is a standard with the goal to cover all possible characters in the world  can hold up to 1 114 112 characters  meaning 21 bits character max  Current Unicode 8 0 specifies 120 737 characters in total  and that s all    The main difference is that an ASCII character can fit to a byte  8 bits   but most Unicode characters cannot  So encoding forms schemes  like UTF-8 and UTF-16  are used  and the character model goes like this   Every character holds an enumerated position from 0 to 1 114 111  hex  0-10FFFF  called code point  An encoding form maps a code point to a code unit sequence  A code unit is the way you want characters to be organized in memory  8-bit units  16-bit units and so on  UTF-8 uses 1 to 4 units of 8 bits  and UTF-16 uses 1 or 2 units of 16 bits  to cover the entire Unicode of 21 bits max  Units use prefixes so that character boundaries can be spotted  and more units mean more prefixes that occupy bits  So  although UTF-8 uses 1 byte for the Latin script it needs 3 bytes for later scripts inside Basic Multilingual Plane  while UTF-16 uses 2 bytes for all these  And that s their main difference   Lastly  an encoding scheme  like UTF-16BE or UTF-16LE  maps  serializes  a code unit sequence to a byte sequence   character  p code point  U 03C0 encoding forms  code units    nbsp   nbsp   nbsp  UTF-8  CF 80  nbsp   nbsp   nbsp  UTF-16  03C0 encoding schemes  bytes     nbsp   nbsp   nbsp  UTF-8  CF 80   nbsp   nbsp   nbsp  UTF-16BE  03 C0   nbsp   nbsp   nbsp  UTF-16LE  C0 03  Tip  a hex digit represents 4 bits  so a two-digit hex number represents a byte Also take a look at Plane maps in Wikipedia to get a feeling of the character set layout

User · Answer

ASCII - Software allocates only 8 bit byte in memory for a given character  It works well for English  amp  adopted  loanwords like fa  ade  characters as their corresponding decimal values falls below 128 in the decimal value  Example C program   UTF-8 - Software allocates 1 to 4 variable 8 bit bytes for a given character  What does mean by variable here  Let us say you are sending the character  A  through your HTML pages in the browser  HTML is UTF-8   the corresponding decimal value of A is 65  when you convert it into decimal it becomes 01000010  This requires only 1 bytes  1 byte memory is allocated even for special adopted English characters like      in a word fa  ade  However  when you want to store European characters  it requires 2 bytes  so you need UTF-8  However  when you go for Asian characters  you require minimum of 2 bytes and maximum of 4 bytes  Similarly  Emoji s require 3 to 4 bytes  UTF-8 will solve all your needs   UTF-16 will allocate minimum 2 bytes and maximum of 4 bytes per character  it will not allocate 1 or 3 bytes  Each character is either represented in 16 bit or 32 bit   Then why exists UTF-16  Originally  Unicode was 16 bit not 8 bit  Java adopted the original version of UTF-16   In a nutshell  you don t need UTF-16 anywhere unless it has been already been adopted by the language or platform you are working on   Java program invoked by web browsers uses UTF-16 but the web browser sends characters using UTF-8

User · Answer

Unicode  is a set of characters used around the world  UTF-8  a character encoding capable of encoding all possible characters  called code points  in Unicode  code unit is 8-bits use one to four code units to encode Unicode 00100100 for      one 8-bits  11000010 10100010 for       two 8-bits  11100010 10000010 10101100 for        three 8-bits   UTF-16  another character encoding  code unit is 16-bits use one to two code units to encode Unicode 00000000 00100100 for      one 16-bits  11011000 01010010 11011111 01100010 for     two 16-bits

[unicode] What is Unicode, UTF-8, UTF-16?

Examples related to unicode

Examples related to encoding

Examples related to utf-8

Examples related to utf-16