What s the difference between utf8 general ci and utf8 unicode ci

Question

Between utf8 general ci and utf8 unicode ci  are there any differences in terms of performance

User · Answer

This post describes it very nicely  In short  utf8 unicode ci uses the Unicode Collation Algorithm as defined in the Unicode standards  whereas utf8 general ci is a more simple sort order which results in  quot less accurate quot  sorting results

User · Answer

According to this post  there is a considerably large performance benefit on MySQL 5 7 when using utf8mb4 general ci in stead of utf8mb4 unicode ci  https   www percona com blog 2019 02 27 charset-and-collation-settings-impact-on-mysql-performance

User · Answer

See the mysql manual  Unicode Character Sets section      For any Unicode character set    operations performed using the    general ci collation are faster than those for the  unicode ci collation    For example  comparisons for the   utf8 general ci  collation are faster    but slightly less correct  than   comparisons for utf8 unicode ci  The   reason for this is that   utf8 unicode ci supports mappings such   as expansions  that is  when one   character compares as equal to   combinations of other characters  For   example  in German and some other   languages          is equal to    ss       utf8 unicode ci also supports   contractions and ignorable characters    utf8 general ci  is a legacy collation   that does not support expansions    contractions  or ignorable characters    It can make only one-to-one   comparisons between characters    So to summarize  utf general ci uses a smaller and less correct  according to the standard  set of comparisons than utf unicode ci which should implement the entire standard  The general ci set will be faster because there is less computation to do

User · Answer

There are two big difference the sorting and the character matching   Sorting     utf8mb4 general ci removes all accents and sorts one by one which may create incorrect sort results  utf8mb4 unicode ci sorts accurate    Character Matching   They match characters differently   For example  in utf8mb4 unicode ci you have i    i  but in utf8mb4 general ci it holds i i   For example  imagine you have a row with name  Yilmaz   Then  select id from users where name  Yilmaz     would return the row if collocation is utf8mb4 general ci  but if it is collocated with utf8mb4 unicode ci it would not return the row   On the other hand we have that a    and    ss in utf8mb4 unicode ci which is not the case in utf8mb4 general ci  So imagine you have a row with name      i   then   select id from users where name  assi     would return the row if collocation is utf8mb4 unicode ci  but would not return a row if collocation is set to utf8mb4 general ci   A full list of matches for each collocation may be found here

User · Answer

I wanted to know what is the performance difference between using utf8 general ci and utf8 unicode ci  but I did not find any benchmarks listed on the internet  so I decided to create benchmarks myself   I created a very simple table with 500 000 rows   CREATE TABLE test    ID INT 11  DEFAULT NULL    Description VARCHAR 20  DEFAULT NULL   ENGINE   INNODB CHARACTER SET utf8 COLLATE utf8 general ci    Then I filled it with random data by running this stored procedure   CREATE PROCEDURE randomizer   BEGIN   DECLARE i INT DEFAULT 0    DECLARE random CHAR 20      theloop  loop     SET random   CONV FLOOR RAND     99999999999999   20  36       INSERT INTO test VALUES  i 1  random       SET i i 1      IF i   500000 THEN       LEAVE theloop      END IF    END LOOP theloop  END   Then I created the following stored procedures to benchmark simple SELECT  SELECT with LIKE  and sorting  SELECT with ORDER BY    CREATE PROCEDURE benchmark simple select   BEGIN   DECLARE i INT DEFAULT 0    theloop  loop     SELECT       FROM test     WHERE Description    test  COLLATE utf8 general ci      SET i   i   1      IF i   30 THEN       LEAVE theloop      END IF    END LOOP theloop  END   CREATE PROCEDURE benchmark select like   BEGIN   DECLARE i INT DEFAULT 0    theloop  loop     SELECT       FROM test     WHERE Description LIKE   test  COLLATE utf8 general ci      SET i   i   1      IF i   30 THEN       LEAVE theloop      END IF    END LOOP theloop  END   CREATE PROCEDURE benchmark order by   BEGIN   DECLARE i INT DEFAULT 0    theloop  loop     SELECT       FROM test     WHERE ID  gt  FLOOR 1   RAND      400000 - 1       ORDER BY Description COLLATE utf8 general ci LIMIT 1000      SET i   i   1      IF i   10 THEN       LEAVE theloop      END IF    END LOOP theloop  END    In the stored procedures above utf8 general ci collation is used  but of course during the tests I used both utf8 general ci and utf8 unicode ci   I called each stored procedure 5 times for each collation  5 times for utf8 general ci and 5 times for utf8 unicode ci  and then calculated the average values   My results are   benchmark simple select     with utf8 general ci  9 957 ms   with utf8 unicode ci  10 271 ms     In this benchmark using utf8 unicode ci is slower than utf8 general ci by 3 2    benchmark select like     with utf8 general ci  11 441 ms   with utf8 unicode ci  12 811 ms     In this benchmark using utf8 unicode ci is slower than utf8 general ci by 12    benchmark order by     with utf8 general ci  11 944 ms   with utf8 unicode ci  12 887 ms     In this benchmark using utf8 unicode ci is slower than utf8 general ci by 7 9

User · Answer

For those people still arriving at this question in 2020 or later  there are newer options that may be better than both of these   For example  utf8mb4 0900 ai ci  All these collations are for the UTF-8 character encoding   The differences are in how text is sorted and compared   unicode ci and  general ci are two different sets of rules for sorting and comparing text according to the way we expect   Newer versions of MySQL introduce new sets of rules  too  such as  0900 ai ci for equivalent rules based on Unicode 9 0 - and with no equivalent  general ci variant  People reading this now should probably use one of these newer collations instead of either  unicode ci or  general ci  The description of those older collations below is provided for interest only  MySQL is currently transitioning away from an older  flawed UTF-8 implementation   For now  you need to use utf8mb4 instead of utf8 for the character encoding part  to ensure you are getting the fixed version   The flawed version remains for backward compatibility  though it is being deprecated  Key differences  utf8mb4 unicode ci is based on the official Unicode rules for universal sorting and comparison  which sorts accurately in a wide range of languages   utf8mb4 general ci is a simplified set of sorting rules which aims to do as well as it can while taking many short-cuts designed to improve speed   It does not follow the Unicode rules and will result in undesirable sorting or comparison in some situations  such as when using particular languages or characters  On modern servers  this performance boost will be all but negligible   It was devised in a time when servers had a tiny fraction of the CPU performance of today s computers    Benefits of utf8mb4 unicode ci over utf8mb4 general ci utf8mb4 unicode ci  which uses the Unicode rules for sorting and comparison  employs a fairly complex algorithm for correct sorting in a wide range of languages and when using a wide range of special characters  These rules need to take into account language-specific conventions  not everybody sorts their characters in what we would call  alphabetical order   As far as Latin  ie  quot European quot   languages go  there is not much difference between the Unicode sorting and the simplified utf8mb4 general ci sorting in MySQL  but there are still a few differences   For examples  the Unicode collation sorts  quot    quot  like  quot ss quot   and  quot    quot  like  quot OE quot  as people using those characters would normally want  whereas utf8mb4 general ci sorts them as single characters  presumably like  quot s quot  and  quot e quot  respectively    Some Unicode characters are defined as ignorable  which means they shouldn t count toward the sort order and the comparison should move on to the next character instead   utf8mb4 unicode ci handles these properly    In non-latin languages  such as Asian languages or languages with different alphabets  there may be a lot more differences between Unicode sorting and the simplified utf8mb4 general ci sorting   The suitability of utf8mb4 general ci will depend heavily on the language used   For some languages  it ll be quite inadequate  What should you use  There is almost certainly no reason to use utf8mb4 general ci anymore  as we have left behind the point where CPU speed is low enough that the performance difference would be important   Your database will almost certainly be limited by other bottlenecks than this  In the past  some people recommended to use utf8mb4 general ci except when accurate sorting was going to be important enough to justify the performance cost   Today  that performance cost has all but disappeared  and developers are treating internationalization more seriously  There s an argument to be made that if speed is more important to you than accuracy  you may as well not do any sorting at all   It s trivial to make an algorithm faster if you do not need it to be accurate   So  utf8mb4 general ci is a compromise that s probably not needed for speed reasons and probably also not suitable for accuracy reasons  One other thing I ll add is that even if you know your application only supports the English language  it may still need to deal with people s names  which can often contain characters used in other languages in which it is just as important to sort correctly   Using the Unicode rules for everything helps add peace of mind that the very smart Unicode people have worked very hard to make sorting work properly  What the parts mean Firstly  ci is for case-insensitive sorting and comparison   This means it s suitable for textual data  and case is not important   The other types of collation are cs  case-sensitive  for textual data where case is important  and bin  for where the encoding needs to match  bit for bit  which is suitable for fields which are really encoded binary data  including  for example  Base64    Case-sensitive sorting leads to some weird results and case-sensitive comparison can result in duplicate values differing only in letter case  so case-sensitive collations are falling out of favor for textual data - if case is significant to you  then otherwise ignorable punctuation and so on is probably also significant  and a binary collation might be more appropriate  Next  unicode or general refers to the specific sorting and comparison rules - in particular  the way text is normalized or compared   There are many different sets of rules for the utf8mb4 character encoding  with unicode and general being two that attempt to work well in all possible languages rather than one specific one   The differences between these two sets of rules are the subject of this answer   Note that unicode uses rules from Unicode 4 0   Recent versions of MySQL add the rulesets unicode 520 using rules from Unicode 5 2  and 0900  dropping the  quot unicode  quot  part  using rules from Unicode 9 0  And lastly  utf8mb4 is of course the character encoding used internally   In this answer I m talking only about Unicode based encodings

User · Answer

In brief words   If you need better sorting order - use utf8 unicode ci  this is the preferred method     but if you utterly interested in performance - use utf8 general ci  but know that it is a little outdated   The differences in terms of performance are very slight

User · Answer

Some details  PL   As we can read here  Peter Gulutzan   there is difference on sorting comparing polish letter  L   L with stroke - html esc   amp  321    lower case   l  - html esc   amp  322   - we have following assumption    utf8 polish ci      L greater than L and less than M utf8 unicode ci     L greater than L and less than M utf8 unicode 520 ci L equal to L utf8 general ci     L greater than Z   In polish language letter L is after letter L and before M  No one of this coding is better or worse - it depends of your needs

[mysql] What's the difference between utf8_general_ci and utf8_unicode_ci?

Examples related to mysql

Examples related to unicode

Examples related to utf-8

Examples related to collation

Examples related to character-set