Difference between UTF-8 and UTF-16

Question

Difference between UTF-8 and UTF-16  Why do we need these   MessageDigest md   MessageDigest getInstance  SHA-256    String text    This is some text    md update text getBytes  UTF-8        Change this to  UTF-16  if needed byte   digest   md digest

User · Answer

Security  Use only UTF-8     Difference between UTF-8 and UTF-16  Why do we need these    There have been at least a couple of security vulnerabilities in implementations of UTF-16  See Wikipedia for details    CVE-2008-2938 CVE-2012-2135   WHATWG and W3C have now declared that only UTF-8 is to be used on the Web       The  security  problems outlined here go away when exclusively using UTF-8  which is one of the many reasons that is now the mandatory encoding for all things    Other groups are saying the same    So while UTF-16 may continue being used internally by some systems such as Java and Windows  what little use of UTF-16 you may have seen in the past for data files  data exchange  and such  will likely fade away entirely

User · Answer

They re simply different schemes for representing Unicode characters   Both are variable-length - UTF-16 uses 2 bytes for all characters in the basic multilingual plane  BMP  which contains most characters in common use   UTF-8 uses between 1 and 3 bytes for characters in the BMP  up to 4 for characters in the current Unicode range of U 0000 to U 1FFFFF  and is extensible up to U 7FFFFFFF if that ever becomes necessary    but notably all ASCII characters are represented in a single byte each   For the purposes of a message digest it won t matter which of these you pick  so long as everyone who tries to recreate the digest uses the same option   See this page for more about UTF-8 and Unicode    Note that all Java characters are UTF-16 code points within the BMP  to represent characters above U FFFF you need to use surrogate pairs in Java

User · Answer

I believe there are a lot of good articles about this around the Web  but here is a short summary   Both UTF-8 and UTF-16 are variable length encodings  However  in UTF-8 a character may occupy a minimum of 8 bits  while in UTF-16 character length starts with 16 bits   Main UTF-8 pros    Basic ASCII characters like digits  Latin characters with no accents  etc  occupy one byte which is identical to US-ASCII representation  This way all US-ASCII strings become valid UTF-8  which provides decent backwards compatibility in many cases  No null bytes  which allows to use null-terminated strings  this introduces a great deal of backwards compatibility too  UTF-8 is independent of byte order  so you don t have to worry about Big Endian   Little Endian issue    Main UTF-8 cons    Many common characters have different length  which slows indexing by codepoint and calculating a codepoint count terribly  Even though byte order doesn t matter  sometimes UTF-8 still has BOM  byte order mark  which serves to notify that the text is encoded in UTF-8  and also breaks compatibility with ASCII software even if the text only contains ASCII characters  Microsoft software  like Notepad  especially likes to add BOM to UTF-8    Main UTF-16 pros    BMP  basic multilingual plane  characters  including Latin  Cyrillic  most Chinese  the PRC made support for some codepoints outside BMP mandatory   most Japanese can be represented with 2 bytes  This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters  Even if the text has supplementary characters  they are still represented by pairs of 16-bit values  which means that the total length is still divisible by two and allows to use 16-bit char as the primitive component of the string    Main UTF-16 cons    Lots of null bytes in US-ASCII strings  which means no null-terminated strings and a lot of wasted memory  Using it as a fixed-length encoding    mostly works    in many common scenarios  especially in US   EU   countries with Cyrillic alphabets   Israel   Arab countries   Iran and many others   often leading to broken support where it doesn t  This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters  It s variable length  so counting or indexing codepoints is costly  though less than UTF-8    In general  UTF-16 is usually better for in-memory representation because BE LE is irrelevant there  just use native order  and indexing is faster  just don t forget to handle surrogate pairs properly   UTF-8  on the other hand  is extremely good for text files and network protocols because there is no BE LE issue and null-termination often comes in handy  as well as ASCII-compatibility

User · Answer

This is unrelated to UTF-8 16  in general  although it does convert to UTF16 and the BE LE part can be set w  a single line   yet below is the fastest way to convert String to byte    For instance  good exactly for the case provided  hash code   String getBytes enc  is relatively slow    static byte   toBytes String s           byte   b new byte s length   2           ByteBuffer wrap b  asCharBuffer   put s           return b

User · Answer

Simple way to differentiate UTF-8 and UTF-16 is to identify commonalities between them  Other than sharing same unicode number for given character  each one is their own format  UTF-8 try to represent  every unicode number given to character with one byte If it is ASCII   else 2 two bytes  else 4 bytes and so on    UTF-16 try to represent  every unicode number given to character with two byte to start with  If two bytes are not sufficient  then uses 4 bytes  IF that is also not sufficient  then uses 6 bytes  Theoretically  UTF-16 is more space efficient  but in practical UTF-8 is more space efficient as most of the characters 98  of data  for processing are ASCII and UTF-8 try to represent them with single byte and UTF-16 try to represent them with 2 bytes  Also  UTF-8 is superset of ASCII encoding  So every app that expects ASCII data would also accepted by UTF-8  processor  This is not true for UTF-16  UTF-16 could not understand ASCII  and this is big hurdle for UTF-16 adoption  Another point to note is  all UNICODE as of now could be fit in 4 bytes of UTF-8 maximum Considering all languages of world   This is same as UTF-16 and no real saving in space compared to UTF-8   https   stackoverflow com a 8505038 3343801   So  people use UTF-8 where ever possible

[java] Difference between UTF-8 and UTF-16?

Security: Use only UTF-8

Examples related to java

Examples related to unicode

Examples related to utf-8

Examples related to utf-16

Examples related to utf