UTF-8 General Bin Unicode

Question

I m trying to figure out what collation I should be using for various types of data  100  of the content I will be storing is user-submitted   My understanding is that I should be using UTF-8 General CI  Case-Insensitive  instead of UTF-8 Binary  However  I can t find a clear a distinction between UTF-8 General CI and UTF-8 Unicode CI    Should I be storing user-submitted content in UTF-8 General or UTF-8 Unicode CI columns  What type of data would UTF-8 Binary be applicable to

User · Answer

utf8 bin compares the bits blindly   No case folding  no accent stripping  utf8 general ci compares one byte with one byte   It does case folding and accent stripping  but no 2-character comparisions  ij is not equal   in this collation  utf8   ci is a set of language-specific rules  but otherwise like unicode ci   Some special cases      C  ch  ll utf8 unicode ci follows an old Unicode standard for comparisons  ij    but ae       utf8 unicode 520 ci follows an newer Unicode standard   ae        See collation chart for details on what is equal to what in various utf8 collations   utf8  as defined by MySQL is limited to the 1- to 3-byte utf8 codes   This leaves out Emoji and some of Chinese   So you should really switch to utf8mb4 if you want to go much beyond Europe   The above points apply to utf8mb4  after suitable spelling change   Going forward  utf8mb4 and utf8mb4 unicode 520 ci are preferred     utf16 and utf32 are variants on utf8  there is virtually no use for them  ucs2 is closer to  Unicode  than  utf8   there is virtually no use for it

User · Answer

Accepted answer is outdated    If you use MySQL 5 5 3   use utf8mb4 unicode ci instead of utf8 unicode ci to ensure the characters typed by your users won t give you errors    utf8mb4 supports emojis for example  whereas utf8 might give you hundreds of encoding-related bugs like   Incorrect string value      xF0 x9F x98 x81       for column    data    at row 1

User · Answer

In general  utf8 general ci is faster than utf8 unicode ci  but less correct   Here is the difference      For any Unicode character set  operations performed using the  general ci collation are faster than those for the  unicode ci collation  For example  comparisons for the utf8 general ci  collation are faster  but slightly less correct  than comparisons for utf8 unicode ci  The reason for this is that utf8 unicode ci supports mappings such as expansions  that is  when one character compares as equal to combinations of other characters  For example  in German and some other languages          is equal to    ss     utf8 unicode ci also supports contractions and ignorable characters  utf8 general ci  is a legacy collation that does not support expansions  contractions  or ignorable characters  It can make only one-to-one comparisons between characters     Quoted from  http   dev mysql com doc refman 5 0 en charset-unicode-sets html  For more detailed explanation  please read the following post from MySQL forums  http   forums mysql com read php 103 187048 188748  As for utf8 bin  Both utf8 general ci and utf8 unicode ci perform case-insensitive comparison  In constrast  utf8 bin is case-sensitive  among other differences   because it compares the binary values of the characters

User · Answer

Really  I tested saving values like      and  e  in column with unique index and they cause duplicate error on both  utf8 unicode ci  and  utf8 general ci   You can save them only in  utf8 bin  collated column   And mysql docs  in http   dev mysql com doc refman 5 7 en charset-applications html  suggest into its examples set  utf8 general ci  collation    mysqld  character-set-server utf8 collation-server utf8 general ci

User · Answer

You should also be aware of the fact  that with utf8 general ci when using a varchar field as unique or primary index inserting 2 values like  a  and      would give a duplicate key error

[mysql] UTF-8: General? Bin? Unicode?

Examples related to mysql

Examples related to utf-8

Examples related to collation