[unicode] What is Unicode, UTF-8, UTF-16?

ASCII - Software allocates only 8 bit byte in memory for a given character. It works well for English & adopted (loanwords like façade) characters as their corresponding decimal values falls below 128 in the decimal value. Example C program.

UTF-8 - Software allocates 1 to 4 variable 8 bit bytes for a given character. What does mean by variable here? Let us say you are sending the character 'A' through your HTML pages in the browser (HTML is UTF-8), the corresponding decimal value of A is 65, when you convert it into decimal it becomes 01000010. This requires only 1 bytes, 1 byte memory is allocated even for special adopted English characters like 'ç' in a word façade. However, when you want to store European characters, it requires 2 bytes, so you need UTF-8. However, when you go for Asian characters, you require minimum of 2 bytes and maximum of 4 bytes. Similarly, Emoji's require 3 to 4 bytes. UTF-8 will solve all your needs.

UTF-16 will allocate minimum 2 bytes and maximum of 4 bytes per character, it will not allocate 1 or 3 bytes. Each character is either represented in 16 bit or 32 bit.

Then why exists UTF-16? Originally, Unicode was 16 bit not 8 bit. Java adopted the original version of UTF-16.

In a nutshell, you don't need UTF-16 anywhere unless it has been already been adopted by the language or platform you are working on.

Java program invoked by web browsers uses UTF-16 but the web browser sends characters using UTF-8.

Examples related to unicode

How to resolve TypeError: can only concatenate str (not "int") to str (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape UnicodeEncodeError: 'ascii' codec can't encode character at special name Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) HTML for the Pause symbol in audio and video control Javascript: Unicode string to hex Concrete Javascript Regex for Accented Characters (Diacritics) Replace non-ASCII characters with a single space UTF-8 in Windows 7 CMD NameError: global name 'unicode' is not defined - in Python 3

Examples related to encoding

How to check encoding of a CSV file UnicodeEncodeError: 'ascii' codec can't encode character at special name Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? The character encoding of the plain text document was not declared - mootool script UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) How to encode text to base64 in python UTF-8 output from PowerShell Set Encoding of File to UTF8 With BOM in Sublime Text 3 Replace non-ASCII characters with a single space

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8

Examples related to utf-16

Byte and char conversion in Java Convert UTF-8 with BOM to UTF-8 with no BOM in Python Difference between UTF-8 and UTF-16? What is Unicode, UTF-8, UTF-16? UTF-8, UTF-16, and UTF-32