[php] What is the best collation to use for MySQL with PHP?

I'm wondering if there is a "best" choice for collation in MySQL for a general website where you aren't 100% sure of what will be entered? I understand that all the encodings should be the same, such as MySQL, Apache, the HTML and anything inside PHP.

In the past I have set PHP to output in "UTF-8", but which collation does this match in MySQL? I'm thinking it's one of the UTF-8 ones, but I have used utf8_unicode_ci, utf8_general_ci, and utf8_bin before.

This question is related to php mysql encoding collation

The answer is


The accepted answer fairly definitively suggests using utf8_unicode_ci, and whilst for new projects that's great, I wanted to relate my recent contrary experience just in case it saves anyone some time.

Because utf8_general_ci is the default collation for Unicode in MySQL, if you want to use utf8_unicode_ci then you end up having to specify it in a lot of places.

For example, all client connections not only have a default charset (makes sense to me) but also a default collation (i.e. the collation will always default to utf8_general_ci for unicode).

Likely, if you use utf8_unicode_ci for your fields, your scripts that connect to the database will need to be updated to mention the desired collation explicitly -- otherwise queries using text strings can fail when your connection is using the default collation.

The upshot is that when converting an existing system of any size to Unicode/utf8, you may end up being forced to use utf8_general_ci because of the way MySQL handles defaults.


I found these collation charts helpful. http://collation-charts.org/mysql60/. I'm no sure which is the used utf8_general_ci though.

For example here is the chart for utf8_swedish_ci. It shows which characters it interprets as the same. http://collation-charts.org/mysql60/mysql604.utf8_swedish_ci.html


Actually, you probably want to use utf8_unicode_ci or utf8_general_ci.

  • utf8_general_ci sorts by stripping away all accents and sorting as if it were ASCII
  • utf8_unicode_ci uses the Unicode sort order, so it sorts correctly in more languages

However, if you are only using this to store English text, these shouldn't differ.


For UTF-8 textual information, you should use utf8_general_ci because...

  • utf8_bin: compare strings by the binary value of each character in the string

  • utf8_general_ci: compare strings using general language rules and using case-insensitive comparisons

a.k.a. it will should making searching and indexing the data faster/more efficient/more useful.


In your database upload file, add the followin line before any line:

SET NAMES utf8;

And your problem should be solved.


Essentially, it depends on how you think of a string.

I always use utf8_bin because of the problem highlighted by Guus. In my opinion, as far as the database should be concerned, a string is still just a string. A string is a number of UTF-8 characters. A character has a binary representation so why does it need to know the language you're using? Usually, people will be constructing databases for systems with the scope for multilingual sites. This is the whole point of using UTF-8 as a character set. I'm a bit of a pureist but I think the bug risks heavily outweigh the slight advantage you may get on indexing. Any language related rules should be done at a much higher level than the DBMS.

In my books "value" should never in a million years be equal to "valúe".

If I want to store a text field and do a case insensitive search, I will use MYSQL string functions with PHP functions such as LOWER() and the php function strtolower().


For the case highlighted by Guus, I would strongly suggest using either utf8_unicode_cs (case sensitive, strict matching, ordering correctly for the most part) instead of utf8_bin (strict matching, incorrect ordering).

If the field is intended to be searched, as opposed to matched for a user, then use utf8_general_ci or utf8_unicode_ci. Both are case-insensitive, one will losely match (‘ß’ is equal to ‘s’, and not to ‘ss’). There are also language specific versions, like utf8_german_ci where the lose matching is more suitable for the language specified.

[Edit - nearly 6 years later]

I no longer recommend the "utf8" character set on MySQL, and instead recommend the "utf8mb4" character set. They match almost entirely, but allow for a little (lot) more unicode characters.

Realistically, MySQL should have updated the "utf8" character set and respective collations to match the "utf8" specification, but instead, a separate character set and respective collations as to not impact storage designation for those already using their incomplete "utf8" character set.


Actually, you probably want to use utf8_unicode_ci or utf8_general_ci.

  • utf8_general_ci sorts by stripping away all accents and sorting as if it were ASCII
  • utf8_unicode_ci uses the Unicode sort order, so it sorts correctly in more languages

However, if you are only using this to store English text, these shouldn't differ.


For UTF-8 textual information, you should use utf8_general_ci because...

  • utf8_bin: compare strings by the binary value of each character in the string

  • utf8_general_ci: compare strings using general language rules and using case-insensitive comparisons

a.k.a. it will should making searching and indexing the data faster/more efficient/more useful.


Collations affect how data is sorted and how strings are compared to each other. That means you should use the collation that most of your users expect.

Example from the documentation for charset unicode:

utf8_general_ci also is satisfactory for both German and French, except that ‘ß’ is equal to ‘s’, and not to ‘ss’. If this is acceptable for your application, then you should use utf8_general_ci because it is faster. Otherwise, use utf8_unicode_ci because it is more accurate.

So - it depends on your expected user base and on how much you need correct sorting. For an English user base, utf8_general_ci should suffice, for other languages, like Swedish, special collations have been created.


Be very, very aware of this problem that can occur when using utf8_general_ci.

MySQL will not distinguish between some characters in select statements, if the utf8_general_ci collation is used. This can lead to very nasty bugs - especially for example, where usernames are involved. Depending on the implementation that uses the database tables, this problem could allow malicious users to create a username matching an administrator account.

This problem exposes itself at the very least in early 5.x versions - I'm not sure if this behaviour as changed later.

I'm no DBA, but to avoid this problem, I always go with utf8-bin instead of a case-insensitive one.

The script below describes the problem by example.

-- first, create a sandbox to play in
CREATE DATABASE `sandbox`;
use `sandbox`;

-- next, make sure that your client connection is of the same 
-- character/collate type as the one we're going to test next:
charset utf8 collate utf8_general_ci

-- now, create the table and fill it with values
CREATE TABLE `test` (`key` VARCHAR(16), `value` VARCHAR(16) )
    CHARACTER SET utf8 COLLATE utf8_general_ci;

INSERT INTO `test` VALUES ('Key ONE', 'value'), ('Key TWO', 'valúe');

-- (verify)
SELECT * FROM `test`;

-- now, expose the problem/bug:
SELECT * FROM test WHERE `value` = 'value';

--
-- Note that we get BOTH keys here! MySQLs UTF8 collates that are 
-- case insensitive (ending with _ci) do not distinguish between 
-- both values!
--
-- collate 'utf8_bin' doesn't have this problem, as I'll show next:
--

-- first, reset the client connection charset/collate type
charset utf8 collate utf8_bin

-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;

-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Note that we get just one key now, as you'd expect.
--
-- This problem appears to be specific to utf8. Next, I'll try to 
-- do the same with the 'latin1' charset:
--

-- first, reset the client connection charset/collate type
charset latin1 collate latin1_general_ci

-- next, convert the values that we've previously inserted
-- in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_general_ci;

-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Again, only one key is returned (expected). This shows 
-- that the problem with utf8/utf8_generic_ci isn't present 
-- in latin1/latin1_general_ci
--
-- To complete the example, I'll check with the binary collate
-- of latin1 as well:

-- first, reset the client connection charset/collate type
charset latin1 collate latin1_bin

-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_bin;

-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Again, only one key is returned (expected).
--
-- Finally, I'll re-introduce the problem in the exact same 
-- way (for any sceptics out there):

-- first, reset the client connection charset/collate type
charset utf8 collate utf8_generic_ci

-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

-- now, re-check for the problem/bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Two keys.
--

DROP DATABASE sandbox;

Actually, you probably want to use utf8_unicode_ci or utf8_general_ci.

  • utf8_general_ci sorts by stripping away all accents and sorting as if it were ASCII
  • utf8_unicode_ci uses the Unicode sort order, so it sorts correctly in more languages

However, if you are only using this to store English text, these shouldn't differ.


It is best to use character set utf8mb4 with the collation utf8mb4_unicode_ci.

The character set, utf8, only supports a small amount of UTF-8 code points, about 6% of possible characters. utf8 only supports the Basic Multilingual Plane (BMP). There 16 other planes. Each plane contains 65,536 characters. utf8mb4 supports all 17 planes.

MySQL will truncate 4 byte UTF-8 characters resulting in corrupted data.

The utf8mb4 character set was introduced in MySQL 5.5.3 on 2010-03-24.

Some of the required changes to use the new character set are not trivial:

  • Changes may need to be made in your application database adapter.
  • Changes will need to be made to my.cnf, including setting the character set, the collation and switching innodb_file_format to Barracuda
  • SQL CREATE statements may need to include: ROW_FORMAT=DYNAMIC
    • DYNAMIC is required for indexes on VARCHAR(192) and larger.

NOTE: Switching to Barracuda from Antelope, may require restarting the MySQL service more than once. innodb_file_format_max does not change until after the MySQL service has been restarted to: innodb_file_format = barracuda.

MySQL uses the old Antelope InnoDB file format. Barracuda supports dynamic row formats, which you will need if you do not want to hit the SQL errors for creating indexes and keys after you switch to the charset: utf8mb4

  • #1709 - Index column size too large. The maximum column size is 767 bytes.
  • #1071 - Specified key was too long; max key length is 767 bytes

The following scenario has been tested on MySQL 5.6.17: By default, MySQL is configured like this:

SHOW VARIABLES;

innodb_large_prefix = OFF
innodb_file_format = Antelope

Stop your MySQL service and add the options to your existing my.cnf:

[client]
default-character-set= utf8mb4

[mysqld]
explicit_defaults_for_timestamp = true
innodb_large_prefix = true
innodb_file_format = barracuda
innodb_file_format_max = barracuda
innodb_file_per_table = true

# Character collation
character_set_server=utf8mb4
collation_server=utf8mb4_unicode_ci

Example SQL CREATE statement:

CREATE TABLE Contacts (
 id INT AUTO_INCREMENT NOT NULL,
 ownerId INT DEFAULT NULL,
 created timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
 modified timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
 contact VARCHAR(640) NOT NULL,
 prefix VARCHAR(128) NOT NULL,
 first VARCHAR(128) NOT NULL,
 middle VARCHAR(128) NOT NULL,
 last VARCHAR(128) NOT NULL,
 suffix VARCHAR(128) NOT NULL,
 notes MEDIUMTEXT NOT NULL,
 INDEX IDX_CA367725E05EFD25 (ownerId),
 INDEX created (created),
 INDEX modified_idx (modified),
 INDEX contact_idx (contact),
 PRIMARY KEY(id)
) DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci ENGINE = InnoDB ROW_FORMAT=DYNAMIC;
  • You can see error #1709 generated for INDEX contact_idx (contact) if ROW_FORMAT=DYNAMIC is removed from the CREATE statement.

NOTE: Changing the index to limit to the first 128 characters on contacteliminates the requirement for using Barracuda with ROW_FORMAT=DYNAMIC

INDEX contact_idx (contact(128)),

Also note: when it says the size of the field is VARCHAR(128), that is not 128 bytes. You can use have 128, 4 byte characters or 128, 1 byte characters.

This INSERT statement should contain the 4 byte 'poo' character in the 2 row:

INSERT INTO `Contacts` (`id`, `ownerId`, `created`, `modified`, `contact`, `prefix`, `first`, `middle`, `last`, `suffix`, `notes`) VALUES
(1, NULL, '0000-00-00 00:00:00', '2014-08-25 03:00:36', '1234567890', '12345678901234567890', '1234567890123456789012345678901234567890', '1234567890123456789012345678901234567890', '12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678', '', ''),
(2, NULL, '0000-00-00 00:00:00', '2014-08-25 03:05:57', 'poo', '12345678901234567890', '', '', '', '', ''),
(3, NULL, '0000-00-00 00:00:00', '2014-08-25 03:05:57', 'poo', '12345678901234567890', '', '', '123', '', '');

You can see the amount of space used by the last column:

mysql> SELECT BIT_LENGTH(`last`), CHAR_LENGTH(`last`) FROM `Contacts`;
+--------------------+---------------------+
| BIT_LENGTH(`last`) | CHAR_LENGTH(`last`) |
+--------------------+---------------------+
|               1024 |                 128 | -- All characters are ASCII
|               4096 |                 128 | -- All characters are 4 bytes
|               4024 |                 128 | -- 3 characters are ASCII, 125 are 4 bytes
+--------------------+---------------------+

In your database adapter, you may want to set the charset and collation for your connection:

SET NAMES 'utf8mb4' COLLATE 'utf8mb4_unicode_ci'

In PHP, this would be set for: \PDO::MYSQL_ATTR_INIT_COMMAND

References:


Collations affect how data is sorted and how strings are compared to each other. That means you should use the collation that most of your users expect.

Example from the documentation for charset unicode:

utf8_general_ci also is satisfactory for both German and French, except that ‘ß’ is equal to ‘s’, and not to ‘ss’. If this is acceptable for your application, then you should use utf8_general_ci because it is faster. Otherwise, use utf8_unicode_ci because it is more accurate.

So - it depends on your expected user base and on how much you need correct sorting. For an English user base, utf8_general_ci should suffice, for other languages, like Swedish, special collations have been created.


Essentially, it depends on how you think of a string.

I always use utf8_bin because of the problem highlighted by Guus. In my opinion, as far as the database should be concerned, a string is still just a string. A string is a number of UTF-8 characters. A character has a binary representation so why does it need to know the language you're using? Usually, people will be constructing databases for systems with the scope for multilingual sites. This is the whole point of using UTF-8 as a character set. I'm a bit of a pureist but I think the bug risks heavily outweigh the slight advantage you may get on indexing. Any language related rules should be done at a much higher level than the DBMS.

In my books "value" should never in a million years be equal to "valúe".

If I want to store a text field and do a case insensitive search, I will use MYSQL string functions with PHP functions such as LOWER() and the php function strtolower().


Actually, you probably want to use utf8_unicode_ci or utf8_general_ci.

  • utf8_general_ci sorts by stripping away all accents and sorting as if it were ASCII
  • utf8_unicode_ci uses the Unicode sort order, so it sorts correctly in more languages

However, if you are only using this to store English text, these shouldn't differ.


In your database upload file, add the followin line before any line:

SET NAMES utf8;

And your problem should be solved.


I found these collation charts helpful. http://collation-charts.org/mysql60/. I'm no sure which is the used utf8_general_ci though.

For example here is the chart for utf8_swedish_ci. It shows which characters it interprets as the same. http://collation-charts.org/mysql60/mysql604.utf8_swedish_ci.html


For UTF-8 textual information, you should use utf8_general_ci because...

  • utf8_bin: compare strings by the binary value of each character in the string

  • utf8_general_ci: compare strings using general language rules and using case-insensitive comparisons

a.k.a. it will should making searching and indexing the data faster/more efficient/more useful.


Be very, very aware of this problem that can occur when using utf8_general_ci.

MySQL will not distinguish between some characters in select statements, if the utf8_general_ci collation is used. This can lead to very nasty bugs - especially for example, where usernames are involved. Depending on the implementation that uses the database tables, this problem could allow malicious users to create a username matching an administrator account.

This problem exposes itself at the very least in early 5.x versions - I'm not sure if this behaviour as changed later.

I'm no DBA, but to avoid this problem, I always go with utf8-bin instead of a case-insensitive one.

The script below describes the problem by example.

-- first, create a sandbox to play in
CREATE DATABASE `sandbox`;
use `sandbox`;

-- next, make sure that your client connection is of the same 
-- character/collate type as the one we're going to test next:
charset utf8 collate utf8_general_ci

-- now, create the table and fill it with values
CREATE TABLE `test` (`key` VARCHAR(16), `value` VARCHAR(16) )
    CHARACTER SET utf8 COLLATE utf8_general_ci;

INSERT INTO `test` VALUES ('Key ONE', 'value'), ('Key TWO', 'valúe');

-- (verify)
SELECT * FROM `test`;

-- now, expose the problem/bug:
SELECT * FROM test WHERE `value` = 'value';

--
-- Note that we get BOTH keys here! MySQLs UTF8 collates that are 
-- case insensitive (ending with _ci) do not distinguish between 
-- both values!
--
-- collate 'utf8_bin' doesn't have this problem, as I'll show next:
--

-- first, reset the client connection charset/collate type
charset utf8 collate utf8_bin

-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;

-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Note that we get just one key now, as you'd expect.
--
-- This problem appears to be specific to utf8. Next, I'll try to 
-- do the same with the 'latin1' charset:
--

-- first, reset the client connection charset/collate type
charset latin1 collate latin1_general_ci

-- next, convert the values that we've previously inserted
-- in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_general_ci;

-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Again, only one key is returned (expected). This shows 
-- that the problem with utf8/utf8_generic_ci isn't present 
-- in latin1/latin1_general_ci
--
-- To complete the example, I'll check with the binary collate
-- of latin1 as well:

-- first, reset the client connection charset/collate type
charset latin1 collate latin1_bin

-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET latin1 COLLATE latin1_bin;

-- now, re-check for the bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Again, only one key is returned (expected).
--
-- Finally, I'll re-introduce the problem in the exact same 
-- way (for any sceptics out there):

-- first, reset the client connection charset/collate type
charset utf8 collate utf8_generic_ci

-- next, convert the values that we've previously inserted in the table
ALTER TABLE `test` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

-- now, re-check for the problem/bug
SELECT * FROM test WHERE `value` = 'value';

--
-- Two keys.
--

DROP DATABASE sandbox;

The accepted answer fairly definitively suggests using utf8_unicode_ci, and whilst for new projects that's great, I wanted to relate my recent contrary experience just in case it saves anyone some time.

Because utf8_general_ci is the default collation for Unicode in MySQL, if you want to use utf8_unicode_ci then you end up having to specify it in a lot of places.

For example, all client connections not only have a default charset (makes sense to me) but also a default collation (i.e. the collation will always default to utf8_general_ci for unicode).

Likely, if you use utf8_unicode_ci for your fields, your scripts that connect to the database will need to be updated to mention the desired collation explicitly -- otherwise queries using text strings can fail when your connection is using the default collation.

The upshot is that when converting an existing system of any size to Unicode/utf8, you may end up being forced to use utf8_general_ci because of the way MySQL handles defaults.


It is best to use character set utf8mb4 with the collation utf8mb4_unicode_ci.

The character set, utf8, only supports a small amount of UTF-8 code points, about 6% of possible characters. utf8 only supports the Basic Multilingual Plane (BMP). There 16 other planes. Each plane contains 65,536 characters. utf8mb4 supports all 17 planes.

MySQL will truncate 4 byte UTF-8 characters resulting in corrupted data.

The utf8mb4 character set was introduced in MySQL 5.5.3 on 2010-03-24.

Some of the required changes to use the new character set are not trivial:

  • Changes may need to be made in your application database adapter.
  • Changes will need to be made to my.cnf, including setting the character set, the collation and switching innodb_file_format to Barracuda
  • SQL CREATE statements may need to include: ROW_FORMAT=DYNAMIC
    • DYNAMIC is required for indexes on VARCHAR(192) and larger.

NOTE: Switching to Barracuda from Antelope, may require restarting the MySQL service more than once. innodb_file_format_max does not change until after the MySQL service has been restarted to: innodb_file_format = barracuda.

MySQL uses the old Antelope InnoDB file format. Barracuda supports dynamic row formats, which you will need if you do not want to hit the SQL errors for creating indexes and keys after you switch to the charset: utf8mb4

  • #1709 - Index column size too large. The maximum column size is 767 bytes.
  • #1071 - Specified key was too long; max key length is 767 bytes

The following scenario has been tested on MySQL 5.6.17: By default, MySQL is configured like this:

SHOW VARIABLES;

innodb_large_prefix = OFF
innodb_file_format = Antelope

Stop your MySQL service and add the options to your existing my.cnf:

[client]
default-character-set= utf8mb4

[mysqld]
explicit_defaults_for_timestamp = true
innodb_large_prefix = true
innodb_file_format = barracuda
innodb_file_format_max = barracuda
innodb_file_per_table = true

# Character collation
character_set_server=utf8mb4
collation_server=utf8mb4_unicode_ci

Example SQL CREATE statement:

CREATE TABLE Contacts (
 id INT AUTO_INCREMENT NOT NULL,
 ownerId INT DEFAULT NULL,
 created timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
 modified timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
 contact VARCHAR(640) NOT NULL,
 prefix VARCHAR(128) NOT NULL,
 first VARCHAR(128) NOT NULL,
 middle VARCHAR(128) NOT NULL,
 last VARCHAR(128) NOT NULL,
 suffix VARCHAR(128) NOT NULL,
 notes MEDIUMTEXT NOT NULL,
 INDEX IDX_CA367725E05EFD25 (ownerId),
 INDEX created (created),
 INDEX modified_idx (modified),
 INDEX contact_idx (contact),
 PRIMARY KEY(id)
) DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci ENGINE = InnoDB ROW_FORMAT=DYNAMIC;
  • You can see error #1709 generated for INDEX contact_idx (contact) if ROW_FORMAT=DYNAMIC is removed from the CREATE statement.

NOTE: Changing the index to limit to the first 128 characters on contacteliminates the requirement for using Barracuda with ROW_FORMAT=DYNAMIC

INDEX contact_idx (contact(128)),

Also note: when it says the size of the field is VARCHAR(128), that is not 128 bytes. You can use have 128, 4 byte characters or 128, 1 byte characters.

This INSERT statement should contain the 4 byte 'poo' character in the 2 row:

INSERT INTO `Contacts` (`id`, `ownerId`, `created`, `modified`, `contact`, `prefix`, `first`, `middle`, `last`, `suffix`, `notes`) VALUES
(1, NULL, '0000-00-00 00:00:00', '2014-08-25 03:00:36', '1234567890', '12345678901234567890', '1234567890123456789012345678901234567890', '1234567890123456789012345678901234567890', '12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678', '', ''),
(2, NULL, '0000-00-00 00:00:00', '2014-08-25 03:05:57', 'poo', '12345678901234567890', '', '', '', '', ''),
(3, NULL, '0000-00-00 00:00:00', '2014-08-25 03:05:57', 'poo', '12345678901234567890', '', '', '123', '', '');

You can see the amount of space used by the last column:

mysql> SELECT BIT_LENGTH(`last`), CHAR_LENGTH(`last`) FROM `Contacts`;
+--------------------+---------------------+
| BIT_LENGTH(`last`) | CHAR_LENGTH(`last`) |
+--------------------+---------------------+
|               1024 |                 128 | -- All characters are ASCII
|               4096 |                 128 | -- All characters are 4 bytes
|               4024 |                 128 | -- 3 characters are ASCII, 125 are 4 bytes
+--------------------+---------------------+

In your database adapter, you may want to set the charset and collation for your connection:

SET NAMES 'utf8mb4' COLLATE 'utf8mb4_unicode_ci'

In PHP, this would be set for: \PDO::MYSQL_ATTR_INIT_COMMAND

References:


Collations affect how data is sorted and how strings are compared to each other. That means you should use the collation that most of your users expect.

Example from the documentation for charset unicode:

utf8_general_ci also is satisfactory for both German and French, except that ‘ß’ is equal to ‘s’, and not to ‘ss’. If this is acceptable for your application, then you should use utf8_general_ci because it is faster. Otherwise, use utf8_unicode_ci because it is more accurate.

So - it depends on your expected user base and on how much you need correct sorting. For an English user base, utf8_general_ci should suffice, for other languages, like Swedish, special collations have been created.


For the case highlighted by Guus, I would strongly suggest using either utf8_unicode_cs (case sensitive, strict matching, ordering correctly for the most part) instead of utf8_bin (strict matching, incorrect ordering).

If the field is intended to be searched, as opposed to matched for a user, then use utf8_general_ci or utf8_unicode_ci. Both are case-insensitive, one will losely match (‘ß’ is equal to ‘s’, and not to ‘ss’). There are also language specific versions, like utf8_german_ci where the lose matching is more suitable for the language specified.

[Edit - nearly 6 years later]

I no longer recommend the "utf8" character set on MySQL, and instead recommend the "utf8mb4" character set. They match almost entirely, but allow for a little (lot) more unicode characters.

Realistically, MySQL should have updated the "utf8" character set and respective collations to match the "utf8" specification, but instead, a separate character set and respective collations as to not impact storage designation for those already using their incomplete "utf8" character set.


Examples related to php

I am receiving warning in Facebook Application using PHP SDK Pass PDO prepared statement to variables Parse error: syntax error, unexpected [ Preg_match backtrack error Removing "http://" from a string How do I hide the PHP explode delimiter from submitted form results? Problems with installation of Google App Engine SDK for php in OS X Laravel 4 with Sentry 2 add user to a group on Registration php & mysql query not echoing in html with tags? How do I show a message in the foreach loop?

Examples related to mysql

Implement specialization in ER diagram How to post query parameters with Axios? PHP with MySQL 8.0+ error: The server requested authentication method unknown to the client Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver' phpMyAdmin - Error > Incorrect format parameter? Authentication plugin 'caching_sha2_password' is not supported How to resolve Unable to load authentication plugin 'caching_sha2_password' issue Connection Java-MySql : Public Key Retrieval is not allowed How to grant all privileges to root user in MySQL 8.0 MySQL 8.0 - Client does not support authentication protocol requested by server; consider upgrading MySQL client

Examples related to encoding

How to check encoding of a CSV file UnicodeEncodeError: 'ascii' codec can't encode character at special name Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings What is the difference between utf8mb4 and utf8 charsets in MySQL? The character encoding of the plain text document was not declared - mootool script UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128) How to encode text to base64 in python UTF-8 output from PowerShell Set Encoding of File to UTF8 With BOM in Sublime Text 3 Replace non-ASCII characters with a single space

Examples related to collation

#1273 – Unknown collation: ‘utf8mb4_unicode_520_ci’ phpmysql error - #1273 - #1273 - Unknown collation: 'utf8mb4_general_ci' How to fix a collation conflict in a SQL Server query? Cannot Resolve Collation Conflict SQL Server - Convert varchar to another collation (code page) to fix character encoding How to change the CHARACTER SET (and COLLATION) throughout a database? SQL Server default character encoding What does 'COLLATE SQL_Latin1_General_CP1_CI_AS' do? Changing SQL Server collation to case insensitive from case sensitive? Troubleshooting "Illegal mix of collations" error in mysql