[mysql] What's the difference between utf8_general_ci and utf8_unicode_ci?

Between utf8_general_ci and utf8_unicode_ci, are there any differences in terms of performance?

This question is related to mysql unicode utf-8 collation character-set

The answer is


This post describes it very nicely.

In short: utf8_unicode_ci uses the Unicode Collation Algorithm as defined in the Unicode standards, whereas utf8_general_ci is a more simple sort order which results in "less accurate" sorting results.


In brief words:

If you need better sorting order - use utf8_unicode_ci (this is the preferred method),

but if you utterly interested in performance - use utf8_general_ci, but know that it is a little outdated.

The differences in terms of performance are very slight.


There are two big difference the sorting and the character matching:

Sorting:

  • utf8mb4_general_ci removes all accents and sorts one by one which may create incorrect sort results.
  • utf8mb4_unicode_ci sorts accurate.

Character Matching

They match characters differently.

For example, in utf8mb4_unicode_ci you have i != i, but in utf8mb4_general_ci it holds i=i.

For example, imagine you have a row with name="Yilmaz". Then

select id from users where name='Yilmaz';

would return the row if collocation is utf8mb4_general_ci, but if it is collocated with utf8mb4_unicode_ci it would not return the row!

On the other hand we have that a=ª and ß=ss in utf8mb4_unicode_ci which is not the case in utf8mb4_general_ci. So imagine you have a row with name="ªßi", then

select id from users where name='assi';

would return the row if collocation is utf8mb4_unicode_ci, but would not return a row if collocation is set to utf8mb4_general_ci.

A full list of matches for each collocation may be found here.


Some details (PL)

As we can read here (Peter Gulutzan) there is difference on sorting/comparing polish letter "L" (L with stroke - html esc: Ł) (lower case: "l" - html esc: ł) - we have following assumption:

utf8_polish_ci      L greater than L and less than M
utf8_unicode_ci     L greater than L and less than M
utf8_unicode_520_ci L equal to L
utf8_general_ci     L greater than Z

In polish language letter L is after letter L and before M. No one of this coding is better or worse - it depends of your needs.


I wanted to know what is the performance difference between using utf8_general_ci and utf8_unicode_ci, but I did not find any benchmarks listed on the internet, so I decided to create benchmarks myself.

I created a very simple table with 500,000 rows:

CREATE TABLE test(
  ID INT(11) DEFAULT NULL,
  Description VARCHAR(20) DEFAULT NULL
)
ENGINE = INNODB
CHARACTER SET utf8
COLLATE utf8_general_ci;

Then I filled it with random data by running this stored procedure:

CREATE PROCEDURE randomizer()
BEGIN
  DECLARE i INT DEFAULT 0;
  DECLARE random CHAR(20) ;
  theloop: loop
    SET random = CONV(FLOOR(RAND() * 99999999999999), 20, 36);
    INSERT INTO test VALUES (i+1, random);
    SET i=i+1;
    IF i = 500000 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END

Then I created the following stored procedures to benchmark simple SELECT, SELECT with LIKE, and sorting (SELECT with ORDER BY):

CREATE PROCEDURE benchmark_simple_select()
BEGIN
  DECLARE i INT DEFAULT 0;
  theloop: loop
    SELECT *
    FROM test
    WHERE Description = 'test' COLLATE utf8_general_ci;
    SET i = i + 1;
    IF i = 30 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END;

CREATE PROCEDURE benchmark_select_like()
BEGIN
  DECLARE i INT DEFAULT 0;
  theloop: loop
    SELECT *
    FROM test
    WHERE Description LIKE '%test' COLLATE utf8_general_ci;
    SET i = i + 1;
    IF i = 30 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END;

CREATE PROCEDURE benchmark_order_by()
BEGIN
  DECLARE i INT DEFAULT 0;
  theloop: loop
    SELECT *
    FROM test
    WHERE ID > FLOOR(1 + RAND() * (400000 - 1))
    ORDER BY Description COLLATE utf8_general_ci LIMIT 1000;
    SET i = i + 1;
    IF i = 10 THEN
      LEAVE theloop;
    END IF;
  END LOOP theloop;
END;

In the stored procedures above utf8_general_ci collation is used, but of course during the tests I used both utf8_general_ci and utf8_unicode_ci.

I called each stored procedure 5 times for each collation (5 times for utf8_general_ci and 5 times for utf8_unicode_ci) and then calculated the average values.

My results are:

benchmark_simple_select()

  • with utf8_general_ci: 9,957 ms
  • with utf8_unicode_ci: 10,271 ms

In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 3.2%.

benchmark_select_like()

  • with utf8_general_ci: 11,441 ms
  • with utf8_unicode_ci: 12,811 ms

In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 12%.

benchmark_order_by()

  • with utf8_general_ci: 11,944 ms
  • with utf8_unicode_ci: 12,887 ms

In this benchmark using utf8_unicode_ci is slower than utf8_general_ci by 7.9%.


According to this post, there is a considerably large performance benefit on MySQL 5.7 when using utf8mb4_general_ci in stead of utf8mb4_unicode_ci: https://www.percona.com/blog/2019/02/27/charset-and-collation-settings-impact-on-mysql-performance/


See the mysql manual, Unicode Character Sets section:

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

So to summarize, utf_general_ci uses a smaller and less correct (according to the standard) set of comparisons than utf_unicode_ci which should implement the entire standard. The general_ci set will be faster because there is less computation to do.


Examples related to mysql

Implement specialization in ER diagram How to post query parameters with Axios? PHP with MySQL 8.0+ error: The server requested authentication method unknown to the client Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver' phpMyAdmin - Error > Incorrect format parameter? Authentication plugin 'caching_sha2_password' is not supported How to resolve Unable to load authentication plugin 'caching_sha2_password' issue Connection Java-MySql : Public Key Retrieval is not allowed How to grant all privileges to root user in MySQL 8.0 MySQL 8.0 - Client does not support authentication protocol requested by server; consider upgrading MySQL client

Examples related to unicode

How to resolve TypeError: can only concatenate str (not "int") to str (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape UnicodeEncodeError: 'ascii' codec can't encode character at special name Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) HTML for the Pause symbol in audio and video control Javascript: Unicode string to hex Concrete Javascript Regex for Accented Characters (Diacritics) Replace non-ASCII characters with a single space UTF-8 in Windows 7 CMD NameError: global name 'unicode' is not defined - in Python 3

Examples related to utf-8

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Changing PowerShell's default output encoding to UTF-8 'Malformed UTF-8 characters, possibly incorrectly encoded' in Laravel Encoding Error in Panda read_csv Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? what is <meta charset="utf-8">? Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) Android Studio : unmappable character for encoding UTF-8

Examples related to collation

#1273 – Unknown collation: ‘utf8mb4_unicode_520_ci’ phpmysql error - #1273 - #1273 - Unknown collation: 'utf8mb4_general_ci' How to fix a collation conflict in a SQL Server query? Cannot Resolve Collation Conflict SQL Server - Convert varchar to another collation (code page) to fix character encoding How to change the CHARACTER SET (and COLLATION) throughout a database? SQL Server default character encoding What does 'COLLATE SQL_Latin1_General_CP1_CI_AS' do? Changing SQL Server collation to case insensitive from case sensitive? Troubleshooting "Illegal mix of collations" error in mysql

Examples related to character-set

What's the difference between utf8_general_ci and utf8_unicode_ci? What does character set and collation mean exactly? Best way to convert text files between character sets?