Between utf8_general_ci
and utf8_unicode_ci
, are there any differences in terms of performance?
This question is related to
mysql
unicode
utf-8
collation
character-set
This post describes it very nicely.
In short: utf8_unicode_ci
uses the Unicode Collation Algorithm as defined in the Unicode standards, whereas utf8_general_ci
is a more simple sort order which results in "less accurate" sorting results.
In brief words:
If you need better sorting order - use utf8_unicode_ci
(this is the preferred method),
but if you utterly interested in performance - use utf8_general_ci
, but know that it is a little outdated.
The differences in terms of performance are very slight.
There are two big difference the sorting and the character matching:
Sorting:
utf8mb4_general_ci
removes all accents and sorts one by one which may create incorrect sort results.utf8mb4_unicode_ci
sorts accurate.Character Matching
They match characters differently.
For example, in utf8mb4_unicode_ci
you have i != i
, but in utf8mb4_general_ci
it holds i=i
.
For example, imagine you have a row with name="Yilmaz"
. Then
select id from users where name='Yilmaz';
would return the row if collocation is utf8mb4_general_ci
, but if it is collocated with utf8mb4_unicode_ci
it would not return the row!
On the other hand we have that a=ª
and ß=ss
in utf8mb4_unicode_ci
which is not the case in utf8mb4_general_ci
. So imagine you have a row with name="ªßi"
, then
select id from users where name='assi';
would return the row if collocation is utf8mb4_unicode_ci
, but would not return a row if collocation is set to utf8mb4_general_ci
.
A full list of matches for each collocation may be found here.
As we can read here (Peter Gulutzan) there is difference on sorting/comparing polish letter "L" (L with stroke - html esc: Ł
) (lower case: "l" - html esc: ł
) - we have following assumption:
utf8_polish_ci L greater than L and less than M
utf8_unicode_ci L greater than L and less than M
utf8_unicode_520_ci L equal to L
utf8_general_ci L greater than Z
In polish language letter L
is after letter L
and before M
. No one of this coding is better or worse - it depends of your needs.
I wanted to know what is the performance difference between using utf8_general_ci
and utf8_unicode_ci
, but I did not find any benchmarks listed on the internet, so I decided to create benchmarks myself.
I created a very simple table with 500,000 rows:
CREATE TABLE test(
ID INT(11) DEFAULT NULL,
Description VARCHAR(20) DEFAULT NULL
)
ENGINE = INNODB
CHARACTER SET utf8
COLLATE utf8_general_ci;
Then I filled it with random data by running this stored procedure:
CREATE PROCEDURE randomizer()
BEGIN
DECLARE i INT DEFAULT 0;
DECLARE random CHAR(20) ;
theloop: loop
SET random = CONV(FLOOR(RAND() * 99999999999999), 20, 36);
INSERT INTO test VALUES (i+1, random);
SET i=i+1;
IF i = 500000 THEN
LEAVE theloop;
END IF;
END LOOP theloop;
END
Then I created the following stored procedures to benchmark simple SELECT
, SELECT
with LIKE
, and sorting (SELECT
with ORDER BY
):
CREATE PROCEDURE benchmark_simple_select()
BEGIN
DECLARE i INT DEFAULT 0;
theloop: loop
SELECT *
FROM test
WHERE Description = 'test' COLLATE utf8_general_ci;
SET i = i + 1;
IF i = 30 THEN
LEAVE theloop;
END IF;
END LOOP theloop;
END;
CREATE PROCEDURE benchmark_select_like()
BEGIN
DECLARE i INT DEFAULT 0;
theloop: loop
SELECT *
FROM test
WHERE Description LIKE '%test' COLLATE utf8_general_ci;
SET i = i + 1;
IF i = 30 THEN
LEAVE theloop;
END IF;
END LOOP theloop;
END;
CREATE PROCEDURE benchmark_order_by()
BEGIN
DECLARE i INT DEFAULT 0;
theloop: loop
SELECT *
FROM test
WHERE ID > FLOOR(1 + RAND() * (400000 - 1))
ORDER BY Description COLLATE utf8_general_ci LIMIT 1000;
SET i = i + 1;
IF i = 10 THEN
LEAVE theloop;
END IF;
END LOOP theloop;
END;
In the stored procedures above utf8_general_ci
collation is used, but of course during the tests I used both utf8_general_ci
and utf8_unicode_ci
.
I called each stored procedure 5 times for each collation (5 times for utf8_general_ci
and 5 times for utf8_unicode_ci
) and then calculated the average values.
My results are:
benchmark_simple_select()
utf8_general_ci
: 9,957 ms utf8_unicode_ci
: 10,271 ms In this benchmark using utf8_unicode_ci
is slower than utf8_general_ci
by 3.2%.
benchmark_select_like()
utf8_general_ci
: 11,441 ms utf8_unicode_ci
: 12,811 ms In this benchmark using utf8_unicode_ci
is slower than utf8_general_ci
by 12%.
benchmark_order_by()
utf8_general_ci
: 11,944 ms utf8_unicode_ci
: 12,887 ms In this benchmark using utf8_unicode_ci
is slower than utf8_general_ci
by 7.9%.
According to this post, there is a considerably large performance benefit on MySQL 5.7 when using utf8mb4_general_ci in stead of utf8mb4_unicode_ci: https://www.percona.com/blog/2019/02/27/charset-and-collation-settings-impact-on-mysql-performance/
See the mysql manual, Unicode Character Sets section:
For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
So to summarize, utf_general_ci uses a smaller and less correct (according to the standard) set of comparisons than utf_unicode_ci which should implement the entire standard. The general_ci set will be faster because there is less computation to do.
Source: Stackoverflow.com