What s the difference between Unicode and UTF-8

Question

Consider     Is it true that unicode utf16   Many are saying Unicode is a standard  not an encoding  but most editors support save as Unicode encoding actually

User · Answer

It s weird  Unicode is a standard  not an encoding  As it is possible to specify the endianness I guess it s effectively UTF-16 or maybe 32   Where does this menu provide from

User · Answer

UTF-16 and UTF-8 are both encodings of Unicode  They are both Unicode  one is not more Unicode than the other   Don t let an unfortunate historical artifact from Microsoft confuse you

User · Answer

most editors support save as    Unicode    encoding actually    This is an unfortunate misnaming perpetrated by Windows   Because Windows uses UTF-16LE encoding internally as the memory storage format for Unicode strings  it considers this to be the natural encoding of Unicode text  In the Windows world  there are ANSI strings  the system codepage on the current machine  subject to total unportability  and there are Unicode strings  stored internally as UTF-16LE    This was all devised in the early days of Unicode  before we realised that UCS-2 wasn t enough  and before UTF-8 was invented  This is why Windows s support for UTF-8 is all-round poor   This misguided naming scheme became part of the user interface  A text editor that uses Windows s encoding support to provide a range of encodings will automatically and inappropriately describe UTF-16LE as    Unicode     and UTF-16BE  if provided  as    Unicode big-endian       Other editors that do encodings themselves  like Notepad    don t have this problem    If it makes you feel any better about it     ANSI    strings aren t based on any ANSI standard  either

User · Answer

There s a lot of misunderstanding being displayed here  Unicode isn t an encoding  but the Unicode standard is devoted primarily to encoding anyway   ISO 10646 is the international character set you  probably  care about  It defines a mapping between a set of named characters  e g    Latin Capital Letter A  or  Greek small letter alpha   and a set of code points  a number assigned to each -- for example  61 hexadecimal and 3B1 hexadecimal for those two respectively  for Unicode code points  the standard notation would be U 0061 and U 03B1    At one time  Unicode defined its own character set  more or less as a competitor to ISO 10646  That was a 16-bit character set  but it was not UTF-16  it was known as UCS-2  It included a rather controversial technique to try to keep the number of necessary characters to a minimum  Han Unification -- basically treating Chinese  Japanese and Korean characters that were quite a bit alike as being the same character    Since then  the Unicode consortium has tacitly admitted that that wasn t going to work  and now concentrate primarily on ways to encode the ISO 10646 character set  The primary methods are UTF-8  UTF-16 and UCS-4  aka UTF-32   Those  except for UTF-8  also have LE  little endian  and BE  big-endian  variants   By itself   Unicode  could refer to almost any of the above  though we can probably eliminate the others that it shows explicitly  such as UTF-8   Unqualified use of  Unicode  probably happens the most often on Windows  where it will almost certainly refer to UTF-16  Early versions of Windows NT adopted Unicode when UCS-2 was current  After UCS-2 was declared obsolete  around Win2k  if memory serves   they switched to UTF-16  which is the most similar to UCS-2  in fact  it s identical for characters in the  basic multilingual plane   which covers a lot  including all the characters for most Western European languages

User · Answer

In addition to Trufa s comment  Unicode explicitly isn t UTF-16  When they were first looking into Unicode  it was speculated that a 16-bit integer might be enough to store any code  but in practice that turned out not to be the case  However  UTF-16 is another valid encoding of Unicode - alongside the 8-bit and 32-bit variants - and I believe is the encoding that Microsoft use in memory at runtime on the NT-derived operating systems

User · Answer

It s not that simple   UTF-16 is a 16-bit  variable-width encoding  Simply calling something  Unicode  is ambiguous  since  Unicode  refers to an entire set of standards for character encoding  Unicode is not an encoding   http   en wikipedia org wiki Unicode Unicode Transformation Format and Universal Character Set  and of course  the obligatory Joel On Software - The Absolute Minimum Every Software Developer Absolutely  Positively Must Know About Unicode and Character Sets  No Excuses   link

User · Answer

As Rasmus states in his article  quot The difference between UTF-8 and Unicode  quot    If asked the question   quot What is the difference between UTF-8 and Unicode  quot   would you confidently reply with a short and precise answer  In these days of internationalization all developers should be able to do that  I suspect many of us do not understand these concepts as well as we should  If you feel you belong to this group  you should read this ultra short introduction to character sets and encodings  Actually  comparing UTF-8 and Unicode is like comparing apples and oranges  UTF-8 is an encoding - Unicode is a character set A character set is a list of characters with unique numbers  these numbers are sometimes referred to as  code points    For example  in the Unicode character set  the number for A is 41  An encoding on the other hand  is an algorithm that translates a list of numbers to binary so it can be stored on disk  For example UTF-8 would translate the number sequence 1  2  3  4 like this  00000001 00000010 00000011 00000100  Our data is now translated into binary and can now be saved to disk  All together now Say an application reads the following from the disk  1101000 1100101 1101100 1101100 1101111  The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user  First step  is to convert the binary data to numbers  The app uses the UTF-8 algorithm to decode the data  In this case  the decoder returns this  104 101 108 108 111  Since the app knows this is a Unicode string  it can assume each number represents a character  We use the Unicode character set to translate each number to a corresponding character  The resulting string is  hello   Conclusion So when somebody asks you  What is the difference between UTF-8 and Unicode    you can now confidently answer short and precise  UTF-8  Unicode Transformation Format  and Unicode cannot be compared  UTF-8 is an encoding used to translate numbers into binary data  Unicode is a character set used to translate characters into numbers

User · Answer

The development of Unicode was aimed   at creating a new standard for mapping   the characters in a great majority of   languages that are being used today    along with other characters that are   not that essential but might be   necessary for creating the text  UTF-8   is only one of the many ways that you   can encode the files because there are   many ways you can encode the   characters inside a file into Unicode    Source   http   www differencebetween net technology difference-between-unicode-and-utf-8

User · Answer

Let s start from keeping in mind that data is stored as bytes  Unicode is a character set where characters are mapped to code points  unique integers   and we need something to translate these code points data into bytes  That s where UTF-8 comes in so called encoding     simple

[unicode] What's the difference between Unicode and UTF-8?

Examples related to unicode

Examples related to utf-8