[sql] What does collation mean?

What does collation mean in SQL, and what does it do?

This question is related to sql sql-server-2005 tsql mysql

The answer is


Collation can be simply thought of as sort order.

In English (and it's strange cousin, American), collation may be a pretty simple matter consisting of ordering by the ASCII code.

Once you get into those strange European languages with all their accents and other features, collation changes. For example, though the different accented forms of a may exist at disparate code points, they may all need to be sorted as if they were the same letter.


Collation determines how your data is sorted and compared. It's very often important with regards to internazionalization, e.g. how do you sort japanese kanji?

If you google collation and sql server you'll find plenty of articles discussing it!


Collation means assigning some order to the characters in an Alphabet, say, ASCII or Unicode etc.

Suppose you have 3 characters in your alphabet - {A,B,C}. You can define some example collations for it by assigning integral values to the characters

  1. Example 1 = {A=1,B=2,C=3}
  2. Example 2 = {C=1,B=2,A=3}
  3. Example 3 = {B=1,C=2,A=3}

As a matter of fact, you can define n! collations on an Alphabet of size n. Given such an order, different sorting routines likes LSD/MSD string sorts make use of it for sorting strings.


Collation defines how you sort and compare string values

For example, it defines how to deal with

  • accents (äàa etc)
  • case (Aa)
  • the language context:
    • In a French collation, cote < côte < coté < côté.
    • In the SQL Server Latin1 default , cote < coté < côte < côté
  • ASCII sorts (a binary collation)

Rules that tell how to compare and sort strings: letters order; whether case matters, whether diacritics matter etc.

For instance, if you want all letters to be different (say, if you store filenames in UNIX), you use UTF8_BIN collation:

SELECT  'A' COLLATE UTF8_BIN = 'a' COLLATE UTF8_BIN

---
0

If you want to ignore case and diacritics differences (say, for a search engine), you use UTF8_GENERAL_CI collation:

SELECT  'A' COLLATE UTF8_GENERAL_CI = 'ä' COLLATE UTF8_GENERAL_CI

---
1

As you can see, this collation (comparison rule) considers capital A and lowecase ä the same letter, ignoring case and diacritic differences.


http://en.wikipedia.org/wiki/Collation

Collation is the assembly of written information into a standard order. (...) A collation algorithm such as the Unicode collation algorithm defines an order through the process of comparing two given character strings and deciding which should come before the other.


Besides the "accented letters are sorted differently than unaccented ones" in some Western European languages, you must take into account the groups of letters, which sometimes are sorted differently, also.

Traditionally, in Spanish, "ch" was considered a letter in its own right, same with "ll" (both of which represent a single phoneme), so a list would get sorted like this:

  • caballo
  • cinco
  • coche
  • charco
  • chocolate
  • chueco
  • dado
  • (...)
  • lámpara
  • luego
  • llanta
  • lluvia
  • madera

Notice all the words starting with single c go together, except words starting with ch which go after them, same with ll-starting words which go after all the words starting with a single l. This is the ordering you'll see in old dictionaries and encyclopedias, sometimes even today by very conservative organizations.

The Royal Academy of the Language changed this to make it easier for Spanish to be accomodated in the computing world. Nevertheless, ñ is still considered a different letter than n and goes after it, and before o. So this is a correctly ordered list:

  • Namibia
  • número
  • ñandú
  • ñú
  • obra
  • ojo

By selecting the correct collation, you get all this done for you, automatically :-)


Reference is taken from this Article: A collation is a set of rules for comparing characters in a character set. It has also ruled for sorting of characters and proper order of two characters varies from language to language. A Collation compared two strings like, if a word is greater than another one, and sort accordingly.

If you are using “latin1” Character set, you can use “latin1_swedish_ci” Collation.

You have to choose right collation because wrong collation may affect your database performance.


The collation is how SQL server decides on how to sort and compare text.

See MSDN.


Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to sql-server-2005

Add a row number to result set of a SQL query SQL Server : Transpose rows to columns Select info from table where row has max date How to query for Xml values and attributes from table in SQL Server? How to restore SQL Server 2014 backup in SQL Server 2008 SQL Server 2005 Using CHARINDEX() To split a string Is it necessary to use # for creating temp tables in SQL server? SQL Query to find the last day of the month JDBC connection to MSSQL server in windows authentication mode How to convert the system date format to dd/mm/yy in SQL Server 2008 R2?

Examples related to tsql

Passing multiple values for same variable in stored procedure Count the Number of Tables in a SQL Server Database Change Date Format(DD/MM/YYYY) in SQL SELECT Statement Stored procedure with default parameters Format number as percent in MS SQL Server EXEC sp_executesql with multiple parameters SQL Server after update trigger How to compare datetime with only date in SQL Server Text was truncated or one or more characters had no match in the target code page including the primary key in an unpivot Printing integer variable and string on same line in SQL

Examples related to mysql

Implement specialization in ER diagram How to post query parameters with Axios? PHP with MySQL 8.0+ error: The server requested authentication method unknown to the client Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver' phpMyAdmin - Error > Incorrect format parameter? Authentication plugin 'caching_sha2_password' is not supported How to resolve Unable to load authentication plugin 'caching_sha2_password' issue Connection Java-MySql : Public Key Retrieval is not allowed How to grant all privileges to root user in MySQL 8.0 MySQL 8.0 - Client does not support authentication protocol requested by server; consider upgrading MySQL client