UTF-8 UTF-16 and UTF-32

Question

What are the differences between UTF-8  UTF-16  and UTF-32   I understand that they will all store Unicode  and that each uses a different number of bytes to represent a character  Is there an advantage to choosing one over the other

User · Answer

In short    UTF-8  Variable-width encoding  backwards compatible with ASCII   ASCII characters  U 0000 to U 007F  take 1 byte  code points U 0080 to U 07FF take 2 bytes  code points U 0800 to U FFFF take 3 bytes  code points U 10000 to U 10FFFF take 4 bytes   Good for English text  not so good for Asian text  UTF-16  Variable-width encoding   Code points U 0000 to U FFFF take 2 bytes  code points U 10000 to U 10FFFF take 4 bytes   Bad for English text  good for Asian text  UTF-32  Fixed-width encoding   All code points take four bytes   An enormous memory hog  but fast to operate on   Rarely used    In long  see Wikipedia  UTF-8  UTF-16  and UTF-32

User · Answer

As mentioned  the difference is primarily the size of the underlying variables  which in each case get larger to allow more characters to be represented    However  fonts  encoding and things are wickedly complicated  unnecessarily    so a big link is needed to fill in more detail   http   www cs tut fi  jkorpela chars html ascii  Don t expect to understand it all  but if you don t want to have problems later it s worth learning as much as you can  as early as you can  or just getting someone else to sort it out for you     Paul

User · Answer

In short    UTF-8  Variable-width encoding  backwards compatible with ASCII   ASCII characters  U 0000 to U 007F  take 1 byte  code points U 0080 to U 07FF take 2 bytes  code points U 0800 to U FFFF take 3 bytes  code points U 10000 to U 10FFFF take 4 bytes   Good for English text  not so good for Asian text  UTF-16  Variable-width encoding   Code points U 0000 to U FFFF take 2 bytes  code points U 10000 to U 10FFFF take 4 bytes   Bad for English text  good for Asian text  UTF-32  Fixed-width encoding   All code points take four bytes   An enormous memory hog  but fast to operate on   Rarely used    In long  see Wikipedia  UTF-8  UTF-16  and UTF-32

User · Answer

UTF-8 has an advantage in the case where ASCII characters represent the majority of characters in a block of text  because UTF-8 encodes these into 8 bits  like ASCII   It is also advantageous in that a UTF-8 file containing only ASCII characters has the same encoding as an ASCII file   UTF-16 is better where ASCII is not predominant  since it uses 2 bytes per character  primarily  UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters   UTF-32 will cover all possible characters in 4 bytes  This makes it pretty bloated  I can t think of any advantage to using it

User · Answer

In UTF-32 all of characters are coded with 32 bits  The advantage is that you can easily calculate the length of the string  The disadvantage is that for each ASCII characters you waste an extra three bytes   In UTF-8 characters have variable length  ASCII characters are coded in one byte  eight bits   most western special characters are coded either in two bytes or three bytes  for example     is three bytes   and more exotic characters can take up to four bytes  Clear disadvantage is  that a priori you cannot calculate string s length  But it s takes lot less bytes to code Latin  English  alphabet text  compared to UTF-32   UTF-16 is also variable length  Characters are coded either in two bytes or four bytes  I really don t see the point  It has disadvantage of being variable length  but hasn t got the advantage of saving as much space as UTF-8   Of those three  clearly UTF-8 is the most widely spread

User · Answer

UTF-8 has an advantage in the case where ASCII characters represent the majority of characters in a block of text  because UTF-8 encodes these into 8 bits  like ASCII   It is also advantageous in that a UTF-8 file containing only ASCII characters has the same encoding as an ASCII file   UTF-16 is better where ASCII is not predominant  since it uses 2 bytes per character  primarily  UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters   UTF-32 will cover all possible characters in 4 bytes  This makes it pretty bloated  I can t think of any advantage to using it

User · Answer

Depending on your development environment you may not even have the choice what encoding your string data type will use internally   But for storing and exchanging data I would always use UTF-8  if you have the choice  If you have mostly ASCII data this will give you the smallest amount of data to transfer  while still being able to encode everything  Optimizing for the least I O is the way to go on modern machines

User · Answer

In UTF-32 all of characters are coded with 32 bits  The advantage is that you can easily calculate the length of the string  The disadvantage is that for each ASCII characters you waste an extra three bytes   In UTF-8 characters have variable length  ASCII characters are coded in one byte  eight bits   most western special characters are coded either in two bytes or three bytes  for example     is three bytes   and more exotic characters can take up to four bytes  Clear disadvantage is  that a priori you cannot calculate string s length  But it s takes lot less bytes to code Latin  English  alphabet text  compared to UTF-32   UTF-16 is also variable length  Characters are coded either in two bytes or four bytes  I really don t see the point  It has disadvantage of being variable length  but hasn t got the advantage of saving as much space as UTF-8   Of those three  clearly UTF-8 is the most widely spread

User · Answer

In UTF-32 all of characters are coded with 32 bits  The advantage is that you can easily calculate the length of the string  The disadvantage is that for each ASCII characters you waste an extra three bytes   In UTF-8 characters have variable length  ASCII characters are coded in one byte  eight bits   most western special characters are coded either in two bytes or three bytes  for example     is three bytes   and more exotic characters can take up to four bytes  Clear disadvantage is  that a priori you cannot calculate string s length  But it s takes lot less bytes to code Latin  English  alphabet text  compared to UTF-32   UTF-16 is also variable length  Characters are coded either in two bytes or four bytes  I really don t see the point  It has disadvantage of being variable length  but hasn t got the advantage of saving as much space as UTF-8   Of those three  clearly UTF-8 is the most widely spread

User · Answer

As mentioned  the difference is primarily the size of the underlying variables  which in each case get larger to allow more characters to be represented    However  fonts  encoding and things are wickedly complicated  unnecessarily    so a big link is needed to fill in more detail   http   www cs tut fi  jkorpela chars html ascii  Don t expect to understand it all  but if you don t want to have problems later it s worth learning as much as you can  as early as you can  or just getting someone else to sort it out for you     Paul

User · Answer

UTF-8  has no concept of byte-order uses between 1 and 4 bytes per character ASCII is a compatible subset of encoding completely self-synchronizing e g  a dropped byte from anywhere in a stream will corrupt at most a single character pretty much all European languages are encoded in two bytes or less per character  UTF-16  must be parsed with known byte-order or reading a byte-order-mark  BOM  uses either 2 or 4 bytes per character  UTF-32  every character is 4 bytes must be parsed with known byte-order or reading a byte-order-mark  BOM   UTF-8 is going to be the most space efficient unless a majority of the characters are from the CJK  Chinese  Japanese  and Korean  character space  UTF-32 is best for random access by character offset into a byte-array

User · Answer

UTF-8 is variable 1 to 4 bytes  UTF-16 is variable 2 or 4 bytes  UTF-32 is fixed 4 bytes    Note  UTF-8 can take 1 to 6 bytes with latest convention  https   lists gnu org archive html help-flex 2005-01 msg00030 html

User · Answer

UTF-8 is variable 1 to 4 bytes  UTF-16 is variable 2 or 4 bytes  UTF-32 is fixed 4 bytes    Note  UTF-8 can take 1 to 6 bytes with latest convention  https   lists gnu org archive html help-flex 2005-01 msg00030 html

User · Answer

In short    UTF-8  Variable-width encoding  backwards compatible with ASCII   ASCII characters  U 0000 to U 007F  take 1 byte  code points U 0080 to U 07FF take 2 bytes  code points U 0800 to U FFFF take 3 bytes  code points U 10000 to U 10FFFF take 4 bytes   Good for English text  not so good for Asian text  UTF-16  Variable-width encoding   Code points U 0000 to U FFFF take 2 bytes  code points U 10000 to U 10FFFF take 4 bytes   Bad for English text  good for Asian text  UTF-32  Fixed-width encoding   All code points take four bytes   An enormous memory hog  but fast to operate on   Rarely used    In long  see Wikipedia  UTF-8  UTF-16  and UTF-32

User · Answer

Unicode defines a single huge character set  assigning one unique integer value to every graphical symbol  that is a major simplification  and isn t actually true  but it s close enough for the purposes of this question   UTF-8 16 32 are simply different ways to encode this   In brief  UTF-32 uses 32-bit values for each character  That allows them to use a fixed-width code for every character   UTF-16 uses 16-bit by default  but that only gives you 65k possible characters  which is nowhere near enough for the full Unicode set  So some characters use pairs of 16-bit values   And UTF-8 uses 8-bit values by default  which means that the 127 first values are fixed-width single-byte characters  the most significant bit is used to signify that this is the start of a multi-byte sequence  leaving 7 bits for the actual character value   All other characters are encoded as sequences of up to 4 bytes  if memory serves    And that leads us to the advantages  Any ASCII-character is directly compatible with UTF-8  so for upgrading legacy apps  UTF-8 is a common and obvious choice  In almost all cases  it will also use the least memory  On the other hand  you can t make any guarantees about the width of a character  It may be 1  2  3 or 4 characters wide  which makes string manipulation difficult   UTF-32 is opposite  it uses the most memory  each character is a fixed 4 bytes wide   but on the other hand  you know that every character has this precise length  so string manipulation becomes far simpler  You can compute the number of characters in a string simply from the length in bytes of the string  You can t do that with UTF-8   UTF-16 is a compromise  It lets most characters fit into a fixed-width 16-bit value  So as long as you don t have Chinese symbols  musical notes or some others  you can assume that each character is 16 bits wide  It uses less memory than UTF-32  But it is in some ways  the worst of both worlds   It almost always uses more memory than UTF-8  and it still doesn t avoid the problem that plagues UTF-8  variable-length characters    Finally  it s often helpful to just go with what the platform supports  Windows uses UTF-16 internally  so on Windows  that is the obvious choice   Linux varies a bit  but they generally use UTF-8 for everything that is Unicode-compliant   So short answer  All three encodings can encode the same character set  but they represent each character as different byte sequences

User · Answer

Depending on your development environment you may not even have the choice what encoding your string data type will use internally   But for storing and exchanging data I would always use UTF-8  if you have the choice  If you have mostly ASCII data this will give you the smallest amount of data to transfer  while still being able to encode everything  Optimizing for the least I O is the way to go on modern machines

User · Answer

Depending on your development environment you may not even have the choice what encoding your string data type will use internally   But for storing and exchanging data I would always use UTF-8  if you have the choice  If you have mostly ASCII data this will give you the smallest amount of data to transfer  while still being able to encode everything  Optimizing for the least I O is the way to go on modern machines

User · Answer

UTF-8 has an advantage in the case where ASCII characters represent the majority of characters in a block of text  because UTF-8 encodes these into 8 bits  like ASCII   It is also advantageous in that a UTF-8 file containing only ASCII characters has the same encoding as an ASCII file   UTF-16 is better where ASCII is not predominant  since it uses 2 bytes per character  primarily  UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters   UTF-32 will cover all possible characters in 4 bytes  This makes it pretty bloated  I can t think of any advantage to using it

User · Answer

Unicode defines a single huge character set  assigning one unique integer value to every graphical symbol  that is a major simplification  and isn t actually true  but it s close enough for the purposes of this question   UTF-8 16 32 are simply different ways to encode this   In brief  UTF-32 uses 32-bit values for each character  That allows them to use a fixed-width code for every character   UTF-16 uses 16-bit by default  but that only gives you 65k possible characters  which is nowhere near enough for the full Unicode set  So some characters use pairs of 16-bit values   And UTF-8 uses 8-bit values by default  which means that the 127 first values are fixed-width single-byte characters  the most significant bit is used to signify that this is the start of a multi-byte sequence  leaving 7 bits for the actual character value   All other characters are encoded as sequences of up to 4 bytes  if memory serves    And that leads us to the advantages  Any ASCII-character is directly compatible with UTF-8  so for upgrading legacy apps  UTF-8 is a common and obvious choice  In almost all cases  it will also use the least memory  On the other hand  you can t make any guarantees about the width of a character  It may be 1  2  3 or 4 characters wide  which makes string manipulation difficult   UTF-32 is opposite  it uses the most memory  each character is a fixed 4 bytes wide   but on the other hand  you know that every character has this precise length  so string manipulation becomes far simpler  You can compute the number of characters in a string simply from the length in bytes of the string  You can t do that with UTF-8   UTF-16 is a compromise  It lets most characters fit into a fixed-width 16-bit value  So as long as you don t have Chinese symbols  musical notes or some others  you can assume that each character is 16 bits wide  It uses less memory than UTF-32  But it is in some ways  the worst of both worlds   It almost always uses more memory than UTF-8  and it still doesn t avoid the problem that plagues UTF-8  variable-length characters    Finally  it s often helpful to just go with what the platform supports  Windows uses UTF-16 internally  so on Windows  that is the obvious choice   Linux varies a bit  but they generally use UTF-8 for everything that is Unicode-compliant   So short answer  All three encodings can encode the same character set  but they represent each character as different byte sequences

User · Answer

I m surprised this question is 11yrs old and not one of the answers mentioned the  1 advantage of utf-8  utf-8 generally works even with programs that are not utf-8 aware  That s partly what it was designed for  Other answers mention that the first 128 code points are the same as ASCII  All other code points are generated by 8bit values with the high bit set  values from 128 to 255  so that from the POV of a non-unicode aware program it just sees strings as ASCII with some extra characters  As an example let s say you wrote a program to add line numbers that effectively does this  and to keep it simple let s assume end of line is just ASCII 13     pseudo code  function readLine   if end of file      return null   read bytes  8bit values  into string until you hit 13 or end or file   return string  function main   lineNo   1   do       s   readLine     if  s    null  break      print lineNo    s        Passing a utf-8 file to this program will continue to work  Similarly  splitting on tabs  commas  parsing for ASCII quotes  or other parsing for which only ASCII values are significant all just work with utf-8 because no ASCII value appear in utf-8 except when they are actually meant to be those ASCII values Some other answers or comments mentions that utf-32 has the advantage that you can treat each codepoint separately  This would suggest for example you could take a string like  quot ABCDEFGHI quot  and split it at every 3rd code point to make ABC DEF GHI  This is false  Many code points affect other code points  For example the color selector code points that lets you choose between        If you split at any arbitrary code point you ll break those  Another example is the bidirectional code points  The following paragraph was not entered backward  It is just preceded by the 0x202E codepoint    This line is not typed backward it is only displayed backward   So no  utf-32 will not let you just randomly manipulate unicode strings without a thought to their meanings  It will let you look at each codepoint with no extra code  FYI though  utf-8 was designed so that looking at any individual byte you can find out the start of the current code point or the next code point  If you take a arbitrary byte in utf-8 data  If it is  lt  128 it s the correct code point by itself  If it s  gt   128 and  lt  192  the top 2 bits are 10  then to find the start of the code point you need to look the preceding byte until you find a byte with a value  gt   192  the top 2 bits are 11   At that byte you ve found the start of a codepoint  That byte encodes how many subsequent bytes make the code point  If you want to find the next code point just scan until the byte  lt  128 or  gt   192 and that s the start of the next code point      Num Bytes 1st code point last code point Byte 1 Byte 2 Byte 3 Byte 4     1 U 0000 U 007F 0xxxxxxx      2 U 0080 U 07FF 110xxxxx 10xxxxxx     3 U 0800 U FFFF 1110xxxx 10xxxxxx 10xxxxxx    4 U 10000 U 10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx     Where xxxxxx are the bits of the code point  Concatenate the xxxx bits from the bytes to get the code point

User · Answer

UTF-8 has an advantage in the case where ASCII characters represent the majority of characters in a block of text  because UTF-8 encodes these into 8 bits  like ASCII   It is also advantageous in that a UTF-8 file containing only ASCII characters has the same encoding as an ASCII file   UTF-16 is better where ASCII is not predominant  since it uses 2 bytes per character  primarily  UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters   UTF-32 will cover all possible characters in 4 bytes  This makes it pretty bloated  I can t think of any advantage to using it

User · Answer

In short  the only reason to use UTF-16 or UTF-32 is to support non-English and ancient scripts respectively   I was wondering why anyone would chose to have non-UTF-8 encoding when it is obviously more efficient for web programming purposes   A common misconception - the suffixed number is NOT an indication of its capability  They all support the complete Unicode  just that UTF-8 can handle ASCII with a single byte  so is MORE efficient less corruptible to the CPU and over the internet   Some good reading  http   www personal psu edu ejp10 blogs gotunicode 2007 10 which utf do i use html and http   utf8everywhere org

User · Answer

Unicode defines a single huge character set  assigning one unique integer value to every graphical symbol  that is a major simplification  and isn t actually true  but it s close enough for the purposes of this question   UTF-8 16 32 are simply different ways to encode this   In brief  UTF-32 uses 32-bit values for each character  That allows them to use a fixed-width code for every character   UTF-16 uses 16-bit by default  but that only gives you 65k possible characters  which is nowhere near enough for the full Unicode set  So some characters use pairs of 16-bit values   And UTF-8 uses 8-bit values by default  which means that the 127 first values are fixed-width single-byte characters  the most significant bit is used to signify that this is the start of a multi-byte sequence  leaving 7 bits for the actual character value   All other characters are encoded as sequences of up to 4 bytes  if memory serves    And that leads us to the advantages  Any ASCII-character is directly compatible with UTF-8  so for upgrading legacy apps  UTF-8 is a common and obvious choice  In almost all cases  it will also use the least memory  On the other hand  you can t make any guarantees about the width of a character  It may be 1  2  3 or 4 characters wide  which makes string manipulation difficult   UTF-32 is opposite  it uses the most memory  each character is a fixed 4 bytes wide   but on the other hand  you know that every character has this precise length  so string manipulation becomes far simpler  You can compute the number of characters in a string simply from the length in bytes of the string  You can t do that with UTF-8   UTF-16 is a compromise  It lets most characters fit into a fixed-width 16-bit value  So as long as you don t have Chinese symbols  musical notes or some others  you can assume that each character is 16 bits wide  It uses less memory than UTF-32  But it is in some ways  the worst of both worlds   It almost always uses more memory than UTF-8  and it still doesn t avoid the problem that plagues UTF-8  variable-length characters    Finally  it s often helpful to just go with what the platform supports  Windows uses UTF-16 internally  so on Windows  that is the obvious choice   Linux varies a bit  but they generally use UTF-8 for everything that is Unicode-compliant   So short answer  All three encodings can encode the same character set  but they represent each character as different byte sequences

User · Answer

I tried to give a simple explanation in my blogpost    UTF-32  requires 32 bits  4 bytes  to encode any character  For example  in order to represent the  A  character code-point using this scheme  you ll need to write 65 in 32-bit binary number   00000000 00000000 00000000 01000001  Big Endian    If you take a closer look  you ll note that the most-right seven bits are actually the same bits when using the ASCII scheme  But since UTF-32 is fixed width scheme  we must attach three additional bytes  Meaning that if we have two files that only contain the  A  character  one is ASCII-encoded and the other is UTF-32 encoded  their size will be 1 byte and 4 bytes correspondingly   UTF-16  Many people think that as UTF-32 uses fixed width 32 bit to represent a code-point  UTF-16 is fixed width 16 bits  WRONG   In UTF-16 the code point maybe represented either in 16 bits  OR 32 bits  So this scheme is variable length encoding system  What is the advantage over the UTF-32  At least for ASCII  the size of files won t be 4 times the original  but still twice   so we re still not ASCII backward compatible   Since 7-bits are enough to represent the  A  character  we can now use 2 bytes instead of 4 like the UTF-32  It ll look like   00000000 01000001   UTF-8  You guessed right   In UTF-8 the code point maybe represented using either 32  16  24 or 8 bits  and as the UTF-16 system  this one is also variable length encoding system   Finally we can represent  A  in the same way we represent it using ASCII encoding system   01001101   A small example where UTF-16 is actually better than UTF-8   Consider the Chinese letter     - its UTF-8 encoding is   11101000 10101010 10011110   While its UTF-16 encoding is shorter   10001010 10011110   In order to understand the representation and how it s interpreted  visit the original post

User · Answer

As mentioned  the difference is primarily the size of the underlying variables  which in each case get larger to allow more characters to be represented    However  fonts  encoding and things are wickedly complicated  unnecessarily    so a big link is needed to fill in more detail   http   www cs tut fi  jkorpela chars html ascii  Don t expect to understand it all  but if you don t want to have problems later it s worth learning as much as you can  as early as you can  or just getting someone else to sort it out for you     Paul

User · Answer

I m surprised this question is 11yrs old and not one of the answers mentioned the  1 advantage of utf-8  utf-8 generally works even with programs that are not utf-8 aware  That s partly what it was designed for  Other answers mention that the first 128 code points are the same as ASCII  All other code points are generated by 8bit values with the high bit set  values from 128 to 255  so that from the POV of a non-unicode aware program it just sees strings as ASCII with some extra characters  As an example let s say you wrote a program to add line numbers that effectively does this  and to keep it simple let s assume end of line is just ASCII 13     pseudo code  function readLine   if end of file      return null   read bytes  8bit values  into string until you hit 13 or end or file   return string  function main   lineNo   1   do       s   readLine     if  s    null  break      print lineNo    s        Passing a utf-8 file to this program will continue to work  Similarly  splitting on tabs  commas  parsing for ASCII quotes  or other parsing for which only ASCII values are significant all just work with utf-8 because no ASCII value appear in utf-8 except when they are actually meant to be those ASCII values Some other answers or comments mentions that utf-32 has the advantage that you can treat each codepoint separately  This would suggest for example you could take a string like  quot ABCDEFGHI quot  and split it at every 3rd code point to make ABC DEF GHI  This is false  Many code points affect other code points  For example the color selector code points that lets you choose between        If you split at any arbitrary code point you ll break those  Another example is the bidirectional code points  The following paragraph was not entered backward  It is just preceded by the 0x202E codepoint    This line is not typed backward it is only displayed backward   So no  utf-32 will not let you just randomly manipulate unicode strings without a thought to their meanings  It will let you look at each codepoint with no extra code  FYI though  utf-8 was designed so that looking at any individual byte you can find out the start of the current code point or the next code point  If you take a arbitrary byte in utf-8 data  If it is  lt  128 it s the correct code point by itself  If it s  gt   128 and  lt  192  the top 2 bits are 10  then to find the start of the code point you need to look the preceding byte until you find a byte with a value  gt   192  the top 2 bits are 11   At that byte you ve found the start of a codepoint  That byte encodes how many subsequent bytes make the code point  If you want to find the next code point just scan until the byte  lt  128 or  gt   192 and that s the start of the next code point      Num Bytes 1st code point last code point Byte 1 Byte 2 Byte 3 Byte 4     1 U 0000 U 007F 0xxxxxxx      2 U 0080 U 07FF 110xxxxx 10xxxxxx     3 U 0800 U FFFF 1110xxxx 10xxxxxx 10xxxxxx    4 U 10000 U 10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx     Where xxxxxx are the bits of the code point  Concatenate the xxxx bits from the bytes to get the code point

User · Answer

Unicode is a standard and about UTF-x you can think as a technical implementation for some practical purposes    UTF-8 -  size optimized   best suited for Latin character based data  or ASCII   it takes only 1 byte per character but the size grows accordingly symbol variety  and in worst case could grow up to 6 bytes per character  UTF-16 -  balance   it takes minimum 2 bytes per character which is enough for existing set of the mainstream languages with having fixed size on it to ease character    handling  but size is still variable and can grow up to 4 bytes per character  UTF-32 -  performance   allows using of simple algorithms as result of fixed size characters  4 bytes  but with memory disadvantage

User · Answer

I made some tests to compare database performance between UTF-8 and UTF-16 in MySQL  Update Speeds UTF-8  UTF-16  Insert Speeds   Delete Speeds

User · Answer

Unicode is a standard and about UTF-x you can think as a technical implementation for some practical purposes    UTF-8 -  size optimized   best suited for Latin character based data  or ASCII   it takes only 1 byte per character but the size grows accordingly symbol variety  and in worst case could grow up to 6 bytes per character  UTF-16 -  balance   it takes minimum 2 bytes per character which is enough for existing set of the mainstream languages with having fixed size on it to ease character    handling  but size is still variable and can grow up to 4 bytes per character  UTF-32 -  performance   allows using of simple algorithms as result of fixed size characters  4 bytes  but with memory disadvantage

User · Answer

UTF-8 is variable 1 to 4 bytes  UTF-16 is variable 2 or 4 bytes  UTF-32 is fixed 4 bytes    Note  UTF-8 can take 1 to 6 bytes with latest convention  https   lists gnu org archive html help-flex 2005-01 msg00030 html

User · Answer

In short    UTF-8  Variable-width encoding  backwards compatible with ASCII   ASCII characters  U 0000 to U 007F  take 1 byte  code points U 0080 to U 07FF take 2 bytes  code points U 0800 to U FFFF take 3 bytes  code points U 10000 to U 10FFFF take 4 bytes   Good for English text  not so good for Asian text  UTF-16  Variable-width encoding   Code points U 0000 to U FFFF take 2 bytes  code points U 10000 to U 10FFFF take 4 bytes   Bad for English text  good for Asian text  UTF-32  Fixed-width encoding   All code points take four bytes   An enormous memory hog  but fast to operate on   Rarely used    In long  see Wikipedia  UTF-8  UTF-16  and UTF-32

User · Answer

UTF-8 is variable 1 to 4 bytes  UTF-16 is variable 2 or 4 bytes  UTF-32 is fixed 4 bytes    Note  UTF-8 can take 1 to 6 bytes with latest convention  https   lists gnu org archive html help-flex 2005-01 msg00030 html

User · Answer

In UTF-32 all of characters are coded with 32 bits  The advantage is that you can easily calculate the length of the string  The disadvantage is that for each ASCII characters you waste an extra three bytes   In UTF-8 characters have variable length  ASCII characters are coded in one byte  eight bits   most western special characters are coded either in two bytes or three bytes  for example     is three bytes   and more exotic characters can take up to four bytes  Clear disadvantage is  that a priori you cannot calculate string s length  But it s takes lot less bytes to code Latin  English  alphabet text  compared to UTF-32   UTF-16 is also variable length  Characters are coded either in two bytes or four bytes  I really don t see the point  It has disadvantage of being variable length  but hasn t got the advantage of saving as much space as UTF-8   Of those three  clearly UTF-8 is the most widely spread

User · Answer

I made some tests to compare database performance between UTF-8 and UTF-16 in MySQL  Update Speeds UTF-8  UTF-16  Insert Speeds   Delete Speeds

User · Answer

In short  the only reason to use UTF-16 or UTF-32 is to support non-English and ancient scripts respectively   I was wondering why anyone would chose to have non-UTF-8 encoding when it is obviously more efficient for web programming purposes   A common misconception - the suffixed number is NOT an indication of its capability  They all support the complete Unicode  just that UTF-8 can handle ASCII with a single byte  so is MORE efficient less corruptible to the CPU and over the internet   Some good reading  http   www personal psu edu ejp10 blogs gotunicode 2007 10 which utf do i use html and http   utf8everywhere org

User · Answer

I tried to give a simple explanation in my blogpost    UTF-32  requires 32 bits  4 bytes  to encode any character  For example  in order to represent the  A  character code-point using this scheme  you ll need to write 65 in 32-bit binary number   00000000 00000000 00000000 01000001  Big Endian    If you take a closer look  you ll note that the most-right seven bits are actually the same bits when using the ASCII scheme  But since UTF-32 is fixed width scheme  we must attach three additional bytes  Meaning that if we have two files that only contain the  A  character  one is ASCII-encoded and the other is UTF-32 encoded  their size will be 1 byte and 4 bytes correspondingly   UTF-16  Many people think that as UTF-32 uses fixed width 32 bit to represent a code-point  UTF-16 is fixed width 16 bits  WRONG   In UTF-16 the code point maybe represented either in 16 bits  OR 32 bits  So this scheme is variable length encoding system  What is the advantage over the UTF-32  At least for ASCII  the size of files won t be 4 times the original  but still twice   so we re still not ASCII backward compatible   Since 7-bits are enough to represent the  A  character  we can now use 2 bytes instead of 4 like the UTF-32  It ll look like   00000000 01000001   UTF-8  You guessed right   In UTF-8 the code point maybe represented using either 32  16  24 or 8 bits  and as the UTF-16 system  this one is also variable length encoding system   Finally we can represent  A  in the same way we represent it using ASCII encoding system   01001101   A small example where UTF-16 is actually better than UTF-8   Consider the Chinese letter     - its UTF-8 encoding is   11101000 10101010 10011110   While its UTF-16 encoding is shorter   10001010 10011110   In order to understand the representation and how it s interpreted  visit the original post

User · Answer

As mentioned  the difference is primarily the size of the underlying variables  which in each case get larger to allow more characters to be represented    However  fonts  encoding and things are wickedly complicated  unnecessarily    so a big link is needed to fill in more detail   http   www cs tut fi  jkorpela chars html ascii  Don t expect to understand it all  but if you don t want to have problems later it s worth learning as much as you can  as early as you can  or just getting someone else to sort it out for you     Paul

User · Answer

UTF-8  has no concept of byte-order uses between 1 and 4 bytes per character ASCII is a compatible subset of encoding completely self-synchronizing e g  a dropped byte from anywhere in a stream will corrupt at most a single character pretty much all European languages are encoded in two bytes or less per character  UTF-16  must be parsed with known byte-order or reading a byte-order-mark  BOM  uses either 2 or 4 bytes per character  UTF-32  every character is 4 bytes must be parsed with known byte-order or reading a byte-order-mark  BOM   UTF-8 is going to be the most space efficient unless a majority of the characters are from the CJK  Chinese  Japanese  and Korean  character space  UTF-32 is best for random access by character offset into a byte-array

User · Answer

Depending on your development environment you may not even have the choice what encoding your string data type will use internally   But for storing and exchanging data I would always use UTF-8  if you have the choice  If you have mostly ASCII data this will give you the smallest amount of data to transfer  while still being able to encode everything  Optimizing for the least I O is the way to go on modern machines

User · Answer

Unicode defines a single huge character set  assigning one unique integer value to every graphical symbol  that is a major simplification  and isn t actually true  but it s close enough for the purposes of this question   UTF-8 16 32 are simply different ways to encode this   In brief  UTF-32 uses 32-bit values for each character  That allows them to use a fixed-width code for every character   UTF-16 uses 16-bit by default  but that only gives you 65k possible characters  which is nowhere near enough for the full Unicode set  So some characters use pairs of 16-bit values   And UTF-8 uses 8-bit values by default  which means that the 127 first values are fixed-width single-byte characters  the most significant bit is used to signify that this is the start of a multi-byte sequence  leaving 7 bits for the actual character value   All other characters are encoded as sequences of up to 4 bytes  if memory serves    And that leads us to the advantages  Any ASCII-character is directly compatible with UTF-8  so for upgrading legacy apps  UTF-8 is a common and obvious choice  In almost all cases  it will also use the least memory  On the other hand  you can t make any guarantees about the width of a character  It may be 1  2  3 or 4 characters wide  which makes string manipulation difficult   UTF-32 is opposite  it uses the most memory  each character is a fixed 4 bytes wide   but on the other hand  you know that every character has this precise length  so string manipulation becomes far simpler  You can compute the number of characters in a string simply from the length in bytes of the string  You can t do that with UTF-8   UTF-16 is a compromise  It lets most characters fit into a fixed-width 16-bit value  So as long as you don t have Chinese symbols  musical notes or some others  you can assume that each character is 16 bits wide  It uses less memory than UTF-32  But it is in some ways  the worst of both worlds   It almost always uses more memory than UTF-8  and it still doesn t avoid the problem that plagues UTF-8  variable-length characters    Finally  it s often helpful to just go with what the platform supports  Windows uses UTF-16 internally  so on Windows  that is the obvious choice   Linux varies a bit  but they generally use UTF-8 for everything that is Unicode-compliant   So short answer  All three encodings can encode the same character set  but they represent each character as different byte sequences

[unicode] UTF-8, UTF-16, and UTF-32

Examples related to unicode

Examples related to utf-8

Examples related to utf-16

Examples related to utf

Examples related to utf-32