How many bytes does one Unicode character take

Question

I am a bit confused about encodings  As far as I know old ASCII characters took one byte per character  How many bytes does a Unicode character require    I assume that one Unicode character can contain every possible character from any language - am I correct  So how many bytes does it need per character    And what do UTF-7  UTF-6  UTF-16 etc  mean  Are they different versions of Unicode   I read the Wikipedia article about Unicode but it is quite difficult for me  I am looking forward to seeing a simple answer

User · Answer

In UTF-8   1 byte        0 -     7F      ASCII  2 bytes      80 -    7FF      all European plus some Middle Eastern  3 bytes     800 -   FFFF      multilingual plane incl  the top 1792 and private-use  4 bytes   10000 - 10FFFF   In UTF-16   2 bytes       0 -   D7FF      multilingual plane except the top 1792 and private-use   4 bytes    D800 - 10FFFF   In UTF-32   4 bytes       0 - 10FFFF   10FFFF is the last unicode codepoint by definition  and it s defined that way because it s UTF-16 s technical limit   It is also the largest codepoint UTF-8 can encode in 4 byte  but the idea behind UTF-8 s encoding also works for 5 and 6 byte encodings to cover codepoints until 7FFFFFFF  ie  half of what UTF-32 can

User · Answer

You won t see a simple answer because there isn t one    First  Unicode doesn t contain  every character from every language   although it sure does try   Unicode itself is a mapping  it defines codepoints and a codepoint is a number  associated with usually a character  I say usually because there are concepts like combining characters  You may be familiar with things like accents  or umlauts  Those can be used with another character  such as an a or a u to create a new logical character  A character therefore can consist of 1 or more codepoints   To be useful in computing systems we need to choose a representation for this information  Those are the various unicode encodings  such as utf-8  utf-16le  utf-32 etc  They are distinguished largely by the size of of their codeunits  UTF-32 is the simplest encoding  it has a codeunit that is 32bits  which means an individual codepoint fits comfortably into a codeunit  The other encodings will have situations where a codepoint will need multiple codeunits  or that particular codepoint can t be represented in the encoding at all  this is a problem for instance with UCS-2    Because of the flexibility of combining characters  even within a given encoding the number of bytes per character can vary depending on the character and the normalization form  This is a protocol for dealing with characters which have more than one representation  you can say  an  a  with an accent  which is 2 codepoints  one of which is a combining char or  accented  a   which is one codepoint

User · Answer

From Wiki   UTF-8  an 8-bit variable-width encoding which maximizes compatibility with ASCII  UTF-16  a 16-bit  variable-width encoding  UTF-32  a 32-bit  fixed-width encoding   These are the three most popular different encoding   In UTF-8 each character is encoded into 1 to 4 bytes   the dominant encoding   In UTF16 each character is encoded into 1 to two 16-bit words and in UTF-32 every character is encoded as a single 32-bit word

User · Answer

There is a great tool for calculating the bytes of any string in UTF-8  http   mothereff in byte-counter  Update   mathias has made the code public  https   github com mathiasbynens mothereff in blob master byte-counter eff js

User · Answer

I know this question is old and already has an accepted answer  but I want to offer a few examples  hoping it ll be useful to someone       As far as I know old ASCII characters took one byte per character    Right  Actually  since ASCII is a 7-bit encoding  it supports 128 codes  95 of which are printable   so it only uses half a byte  if that makes any sense       How many bytes does a Unicode character require    Unicode just maps characters to codepoints  It doesn t define how to encode them  A text file does not contain Unicode characters  but bytes octets that may represent Unicode characters      I assume that one Unicode character can contain every possible   character from any language - am I correct    No  But almost  So basically yes  But still no      So how many bytes does it need per character    Same as your 2nd question      And what do UTF-7  UTF-6  UTF-16 etc mean  Are they some kind Unicode   versions    No  those are encodings  They define how bytes octets should represent Unicode characters   A couple of examples  If some of those cannot be displayed in your browser  probably because the font doesn t support them   go to http   codepoints net U 1F6AA  replace 1F6AA with the codepoint in hex  to see an image     U 0061 LATIN SMALL LETTER A  a  N    97 UTF-8  61 UTF-16  00 61    U 00A9 COPYRIGHT SIGN      N    169 UTF-8  C2 A9 UTF-16  00 A9  U 00AE REGISTERED SIGN      N    174 UTF-8  C2 AE UTF-16  00 AE    U 1337 ETHIOPIC SYLLABLE PHWA     N    4919 UTF-8  E1 8C B7 UTF-16  13 37  U 2014 EM DASH       N    8212 UTF-8  E2 80 94 UTF-16  20 14  U 2030 PER MILLE SIGN       N    8240 UTF-8  E2 80 B0 UTF-16  20 30  U 20AC EURO SIGN       N    8364 UTF-8  E2 82 AC UTF-16  20 AC  U 2122 TRADE MARK SIGN       N    8482 UTF-8  E2 84 A2 UTF-16  21 22  U 2603 SNOWMAN     N    9731 UTF-8  E2 98 83 UTF-16  26 03  U 260E BLACK TELEPHONE     N    9742 UTF-8  E2 98 8E UTF-16  26 0E  U 2614 UMBRELLA WITH RAIN DROPS     N    9748 UTF-8  E2 98 94 UTF-16  26 14  U 263A WHITE SMILING FACE     N    9786 UTF-8  E2 98 BA UTF-16  26 3A  U 2691 BLACK FLAG     N    9873 UTF-8  E2 9A 91 UTF-16  26 91  U 269B ATOM SYMBOL     N    9883 UTF-8  E2 9A 9B UTF-16  26 9B  U 2708 AIRPLANE     N    9992 UTF-8  E2 9C 88 UTF-16  27 08  U 271E SHADOWED WHITE LATIN CROSS     N    10014 UTF-8  E2 9C 9E UTF-16  27 1E  U 3020 POSTAL MARK FACE     N    12320 UTF-8  E3 80 A0 UTF-16  30 20  U 8089 CJK UNIFIED IDEOGRAPH-8089     N    32905 UTF-8  E8 82 89 UTF-16  80 89    U 1F4A9 PILE OF POO    N    128169 UTF-8  F0 9F 92 A9 UTF-16  D8 3D DC A9  U 1F680 ROCKET    N    128640 UTF-8  F0 9F 9A 80 UTF-16  D8 3D DE 80     Okay I m getting carried away     Fun facts    If you re looking for a specific character  you can copy amp paste it on http   codepoints net   I wasted a lot of time on this useless list  but it s sorted    MySQL has a charset called  utf8  which actually does not support characters longer than 3 bytes  So you can t insert a pile of poo  the field will be silently truncated  Use  utf8mb4  instead  There s a snowman test page  unicodesnowmanforyou com

User · Answer

Simply speaking Unicode is a standard which assigned one number  called code point  to all characters of the world  Its still work in progress    Now you need to represent this code points using bytes  thats called character encoding  UTF-8  UTF-16  UTF-6 are ways of representing those characters    UTF-8 is multibyte character encoding  Characters can have 1 to 6 bytes  some of them may be not required right now     UTF-32 each characters have 4 bytes a characters   UTF-16 uses 16 bits for each character and it represents only part of Unicode characters called BMP  for all practical purposes its enough   Java uses this encoding in its strings

User · Answer

In Unicode the answer is not easily given  The problem  as you already pointed out  are the encodings   Given any English sentence without diacritic characters  the answer for UTF-8 would be as many bytes as characters and for UTF-16 it would be number of characters times two   The only encoding where  as of now  we can make the statement about the size is UTF-32  There it s always 32bit per character  even though I imagine that code points are prepared for a future UTF-64     What makes it so difficult are at least two things    composed characters  where instead of using the character entity that is already accented diacritic       a user decided to combine the accent and the base character   A   code points  Code points are the method by which the UTF-encodings allow to encode more than the number of bits that gives them their name would usually allow  E g  UTF-8 designates certain bytes which on their own are invalid  but when followed by a valid continuation byte will allow to describe a character beyond the 8-bit range of 0  255  See the Examples and Overlong Encodings below in the Wikipedia article on UTF-8   The excellent example given there is that the     character  code point U 20AC can be represented either as three-byte sequence E2 82 AC or four-byte sequence F0 82 82 AC  Both are valid  and this shows how complicated the answer is when talking about  Unicode  and not about a specific encoding of Unicode  such as UTF-8 or UTF-16

User · Answer

Unicode is a standard which provides a unique number for every character  These unique numbers are called code points  which is just unique code  to all characters existing in the world  some s are still to be added    For different purposes  you might need to represent this code points in bytes  most programming languages do so   and here s where Character Encoding kicks in   UTF-8  UTF-16  UTF-32 and so on are all Character Encodings  and Unicode s code points are represented in these encodings  in different ways    UTF-8 encoding has a variable-width length  and characters  encoded in it  can occupy 1 to 4 bytes inclusive   UTF-16 has a variable length and characters  encoded in it  can take either 1 or 2 bytes  which is 8 or 16 bits   This represents only part of all Unicode characters called BMP  Basic Multilingual Plane  and it s enough for almost all the cases  Java uses UTF-16 encoding for its strings and characters   UTF-32 has fixed length and each character takes exactly 4 bytes  32 bits

User · Answer

Strangely enough  nobody pointed out how to calculate how many bytes is taking one Unicode char  Here is the rule for UTF-8 encoded strings   Binary    Hex          Comments 0xxxxxxx  0x00  0x7F   Only byte of a 1-byte character encoding 10xxxxxx  0x80  0xBF   Continuation byte  one of 1-3 bytes following the first 110xxxxx  0xC0  0xDF   First byte of a 2-byte character encoding 1110xxxx  0xE0  0xEF   First byte of a 3-byte character encoding 11110xxx  0xF0  0xF7   First byte of a 4-byte character encoding   So the quick answer is  it takes 1 to 4 bytes  depending on the first one which will indicate how many bytes it ll take up

User · Answer

Check out this Unicode code converter  For example  enter 0x2009  where 2009 is the Unicode number for thin space  in the  0x    notation  field  and click Convert  The hexadecimal number E2 80 89  3 bytes  appears in the  UTF-8 code units  field

User · Answer

Well I just pulled up the Wikipedia page on it too  and in the intro portion I saw  Unicode can be implemented by different character encodings  The most commonly used encodings are UTF-8  which uses one byte for any ASCII characters  which have the same code values in both UTF-8 and ASCII encoding  and up to four bytes for other characters   the now-obsolete UCS-2  which uses two bytes for each character but cannot encode every character in the current Unicode standard    As this quote demonstrates  your problem is that you are assuming Unicode is a single way of encoding characters  There are actually multiple forms of Unicode  and  again in that quote  one of them even has 1 byte per character just like what you are used to   So your simple answer that you want is that it varies

User · Answer

For UTF-16  the character needs four bytes  two code units  if it starts with 0xD800 or greater   such a character is called a  surrogate pair    More specifically  a surrogate pair has the form    0xD800 - 0xDBFF    0xDC00 - 0xDFF    where       indicates a two-byte code unit with the given range   Anything  lt   0xD7FF is one code unit  two bytes    Anything    0xE000 is invalid  except BOM markers  arguably    See http   unicodebook readthedocs io unicode encodings html  section 7 5

[string] How many bytes does one Unicode character take?

Examples related to string

Examples related to language-agnostic

Examples related to unicode

Examples related to encoding