[sql-server] What does 'COLLATE SQL_Latin1_General_CP1_CI_AS' do?

I have an SQL query to create the database in SQLServer as given below:

create database yourdb
on
( name = 'yourdb_dat',
  filename = 'c:\program files\microsoft sql server\mssql.1\mssql\data\yourdbdat.mdf',
  size = 25mb,
  maxsize = 1500mb,
  filegrowth = 10mb )
log on
( name = 'yourdb_log',
  filename = 'c:\program files\microsoft sql server\mssql.1\mssql\data\yourdblog.ldf',
  size = 7mb,
  maxsize = 375mb,
  filegrowth = 10mb )
COLLATE SQL_Latin1_General_CP1_CI_AS;
go

It runs fine.

While rest of the SQL is clear to be I am quite confused about the functionality of COLLATE SQL_Latin1_General_CP1_CI_AS.

Can anyone explain this to me? Also, I would like to know if creating the database in this way is a best practice?

This question is related to sql-server database tsql collation

The answer is


The COLLATE keyword specify what kind of character set and rules (order, confrontation rules) you are using for string values.

For example in your case you are using Latin rules with case insensitive (CI) and accent sensitive (AS)

You can refer to this Documentation


The CP1 means 'Code Page 1' - technically this translates to code page 1252


This specifies the default collation for the database. Every text field that you create in tables in the database will use that collation, unless you specify a different one.

A database always has a default collation. If you don't specify any, the default collation of the SQL Server instance is used.

The name of the collation that you use shows that it uses the Latin1 code page 1, is case insensitive (CI) and accent sensitive (AS). This collation is used in the USA, so it will contain sorting rules that are used in the USA.

The collation decides how text values are compared for equality and likeness, and how they are compared when sorting. The code page is used when storing non-unicode data, e.g. varchar fields.


Please be aware that the accepted answer is a bit incomplete. Yes, at the most basic level Collation handles sorting. BUT, the comparison rules defined by the chosen Collation are used in many places outside of user queries against user data.

If "What does COLLATE SQL_Latin1_General_CP1_CI_AS do?" means "What does the COLLATE clause of CREATE DATABASE do?", then:

The COLLATE {collation_name} clause of the CREATE DATABASE statement specifies the default Collation of the Database, and not the Server; Database-level and Server-level default Collations control different things.

Server (i.e. Instance)-level controls:

  • Database-level Collation for system Databases: master, model, msdb, and tempdb.
  • Due to controlling the DB-level Collation of tempdb, it is then the default Collation for string columns in temporary tables (global and local), but not table variables.
  • Due to controlling the DB-level Collation of master, it is then the Collation used for Server-level data, such as Database names (i.e. name column in sys.databases), Login names, etc.
  • Handling of parameter / variable names
  • Handling of cursor names
  • Handling of GOTO labels
  • Default Collation used for newly created Databases when the COLLATE clause is missing

Database-level controls:

  • Default Collation used for newly created string columns (CHAR, VARCHAR, NCHAR, NVARCHAR, TEXT, and NTEXT -- but don't use TEXT or NTEXT) when the COLLATE clause is missing from the column definition. This goes for both CREATE TABLE and ALTER TABLE ... ADD statements.
  • Default Collation used for string literals (i.e. 'some text') and string variables (i.e. @StringVariable). This Collation is only ever used when comparing strings and variables to other strings and variables. When comparing strings / variables to columns, then the Collation of the column will be used.
  • The Collation used for Database-level meta-data, such as object names (i.e. sys.objects), column names (i.e. sys.columns), index names (i.e. sys.indexes), etc.
  • The Collation used for Database-level objects: tables, columns, indexes, etc.

Also:

  • ASCII is an encoding which is 8-bit (for common usage; technically "ASCII" is 7-bit with character values 0 - 127, and "ASCII Extended" is 8-bit with character values 0 - 255). This group is the same across cultures.
  • The Code Page is the "extended" part of Extended ASCII, and controls which characters are used for values 128 - 255. This group varies between each culture.
  • Latin1 does not mean "ASCII" since standard ASCII only covers values 0 - 127, and all code pages (that can be represented in SQL Server, and even NVARCHAR) map those same 128 values to the same characters.

If "What does COLLATE SQL_Latin1_General_CP1_CI_AS do?" means "What does this particular collation do?", then:

  • Because the name start with SQL_, this is a SQL Server collation, not a Windows collation. These are definitely obsolete, even if not officially deprecated, and are mainly for pre-SQL Server 2000 compatibility. Although, quite unfortunately SQL_Latin1_General_CP1_CI_AS is very common due to it being the default when installing on an OS using US English as its language. These collations should be avoided if at all possible.

    Windows collations (those with names not starting with SQL_) are newer, more functional, have consistent sorting between VARCHAR and NVARCHAR for the same values, and are being updated with additional / corrected sort weights and uppercase/lowercase mappings. These collations also don't have the potential performance problem that the SQL Server collations have: Impact on Indexes When Mixing VARCHAR and NVARCHAR Types.

  • Latin1_General is the culture / locale.
    • For NCHAR, NVARCHAR, and NTEXT data this determines the linguistic rules used for sorting and comparison.
    • For CHAR, VARCHAR, and TEXT data (columns, literals, and variables) this determines the:
      • linguistic rules used for sorting and comparison.
      • code page used to encode the characters. For example, Latin1_General collations use code page 1252, Hebrew collations use code page 1255, and so on.
  • CP{code_page} or {version}

    • For SQL Server collations: CP{code_page}, is the 8-bit code page that determines what characters map to values 128 - 255. While there are four code pages for Double-Byte Character Sets (DBCS) that can use 2-byte combinations to create more than 256 characters, these are not available for the SQL Server collations.
    • For Windows collations: {version}, while not present in all collation names, refers to the SQL Server version in which the collation was introduced (for the most part). Windows collations with no version number in the name are version 80 (meaning SQL Server 2000 as that is version 8.0). Not all versions of SQL Server come with new collations, so there are gaps in the version numbers. There are some that are 90 (for SQL Server 2005, which is version 9.0), most are 100 (for SQL Server 2008, version 10.0), and a small set has 140 (for SQL Server 2017, version 14.0).

      I said "for the most part" because the collations ending in _SC were introduced in SQL Server 2012 (version 11.0), but the underlying data wasn't new, they merely added support for supplementary characters for the built-in functions. So, those endings exist for version 90 and 100 collations, but only starting in SQL Server 2012.

  • Next you have the sensitivities, that can be in any combination of the following, but always specified in this order:
    • CS = case-sensitive or CI = case-insensitive
    • AS = accent-sensitive or AI = accent-insensitive
    • KS = Kana type-sensitive or missing = Kana type-insensitive
    • WS = width-sensitive or missing = width insensitive
    • VSS = variation selector sensitive (only available in the version 140 collations) or missing = variation selector insensitive
  • Optional last piece:

    • _SC at the end means "Supplementary Character support". The "support" only affects how the built-in functions interpret surrogate pairs (which are how supplementary characters are encoded in UTF-16). Without _SC at the end (or _140_ in the middle), built-in functions don't see a single supplementary character, but instead see two meaningless code points that make up the surrogate pair. This ending can be added to any non-binary, version 90 or 100 collation.
    • _BIN or _BIN2 at the end means "binary" sorting and comparison. Data is still stored the same, but there are no linguistic rules. This ending is never combined with any of the 5 sensitivities or _SC. _BIN is the older style, and _BIN2 is the newer, more accurate style. If using SQL Server 2005 or newer, use _BIN2. For details on the differences between _BIN and _BIN2, please see: Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2).
    • _UTF8 is a new option as of SQL Server 2019. It's an 8-bit encoding that allows for Unicode data to be stored in VARCHAR and CHAR datatypes (but not the deprecated TEXT datatype). This option can only be used on collations that support supplementary characters (i.e. version 90 or 100 collations with _SC in their name, and version 140 collations). There is also a single binary _UTF8 collation (_BIN2, not _BIN).

      PLEASE NOTE: UTF-8 was designed / created for compatibility with environments / code that are set up for 8-bit encodings yet want to support Unicode. Even though there are a few scenarios where UTF-8 can provide up to 50% space savings as compared to NVARCHAR, that is a side-effect and has a cost of a slight hit to performance in many / most operations. If you need this for compatibility, then the cost is acceptable. If you want this for space-savings, you had better test, and TEST AGAIN. Testing includes all functionality, and more than just a few rows of data. Be warned that UTF-8 collations work best when ALL columns, and the database itself, are using VARCHAR data (columns, variables, string literals) with a _UTF8 collation. This is the natural state for anyone using this for compatibility, but not for those hoping to use it for space-savings. Be careful when mixing VARCHAR data using a _UTF8 collation with either VARCHAR data using non-_UTF8 collations or NVARCHAR data, as you might experience odd behavior / data loss. For more details on the new UTF-8 collations, please see: Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?


Examples related to sql-server

Passing multiple values for same variable in stored procedure SQL permissions for roles Count the Number of Tables in a SQL Server Database Visual Studio 2017 does not have Business Intelligence Integration Services/Projects ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database How to create temp table using Create statement in SQL Server? SQL Query Where Date = Today Minus 7 Days How do I pass a list as a parameter in a stored procedure? SQL Server date format yyyymmdd

Examples related to database

Implement specialization in ER diagram phpMyAdmin - Error > Incorrect format parameter? Authentication plugin 'caching_sha2_password' cannot be loaded Room - Schema export directory is not provided to the annotation processor so we cannot export the schema SQL Query Where Date = Today Minus 7 Days MySQL Error: : 'Access denied for user 'root'@'localhost' SQL Server date format yyyymmdd How to create a foreign key in phpmyadmin WooCommerce: Finding the products in database TypeError: tuple indices must be integers, not str

Examples related to tsql

Passing multiple values for same variable in stored procedure Count the Number of Tables in a SQL Server Database Change Date Format(DD/MM/YYYY) in SQL SELECT Statement Stored procedure with default parameters Format number as percent in MS SQL Server EXEC sp_executesql with multiple parameters SQL Server after update trigger How to compare datetime with only date in SQL Server Text was truncated or one or more characters had no match in the target code page including the primary key in an unpivot Printing integer variable and string on same line in SQL

Examples related to collation

#1273 – Unknown collation: ‘utf8mb4_unicode_520_ci’ phpmysql error - #1273 - #1273 - Unknown collation: 'utf8mb4_general_ci' How to fix a collation conflict in a SQL Server query? Cannot Resolve Collation Conflict SQL Server - Convert varchar to another collation (code page) to fix character encoding How to change the CHARACTER SET (and COLLATION) throughout a database? SQL Server default character encoding What does 'COLLATE SQL_Latin1_General_CP1_CI_AS' do? Changing SQL Server collation to case insensitive from case sensitive? Troubleshooting "Illegal mix of collations" error in mysql