When must we use NVARCHAR NCHAR instead of VARCHAR CHAR in SQL Server

Question

Is there a rule when we must use the Unicode types     I have seen that most of the European languages  German  Italian  English       are fine in the same database in VARCHAR columns   I am looking for something like     If you have Chinese --  use NVARCHAR If you have German and Arabic --  use NVARCHAR   What about the collation of the server database   I don t want to use always NVARCHAR like suggested here What are the main performance differences between varchar and nvarchar SQL Server data types

User · Answer

Josh says       Something to keep in mind when you are using Unicode although you can store different languages in a single column you can only sort using a single collation  There are some languages that use latin characters but do not sort like other latin languages  Accents is a good example of this  I can t remeber the example but there was a eastern european language whose Y didn t sort like the English Y  Then there is the spanish ch which spanish users expet to be sorted after h    I m a native Spanish Speaker and  ch  is not a letter but two  c  and  h  and the Spanish alphabet is like  abcdefghijklmn    opqrstuvwxyz We don t expect  ch  after  h  but  i  The alphabet is the same as in English except for the    or in HTML   amp ntilde     Alex

User · Answer

The real reason you want to use NVARCHAR is when you have different languages in the same column  you need to address the columns in T-SQL without decoding  you want to be able to see the data  natively  in SSMS  or you want to standardize on Unicode   If you treat the database as dumb storage  it is perfectly possible to store wide strings and different  even variable-length  encodings in VARCHAR  for instance UTF-8    The problem comes when you are attempting to encode and decode  especially if the code page is different for different rows   It also means that the SQL Server will not be able to deal with the data easily for purposes of querying within T-SQL on  potentially variably  encoded columns   Using NVARCHAR avoids all this   I would recommend NVARCHAR for any column which will have user-entered data in it which is relatively unconstrained   I would recommend VARCHAR for any column which is a natural key  like a vehicle license plate  SSN  serial number  service tag  order number  airport callsign  etc  which is typically defined and constrained by a standard or legislation or convention   Also VARCHAR for user-entered  and very constrained  like a phone number  or a code  ACTIVE CLOSED  Y N  M F  M S D W  etc    There is absolutely no reason to use NVARCHAR for those   So for a simple rule   VARCHAR when guaranteed to be constrained NVARCHAR otherwise

User · Answer

Greek would need UTF-8 on N column types  a

User · Answer

If anyone is facing this issue in Mysql there is no need to change varchar to nvarchar you can just change the collation of the column to utf8

User · Answer

TL DR  Unicode -  nchar  nvarchar  and ntext  Non-unicode -   char  varchar  and text    From MSDN     Collations in SQL Server provide sorting rules  case  and accent   sensitivity properties for your data  Collations that are used with   character data types such as char and varchar dictate the code page   and corresponding characters that can be represented for that data   type    Assuming you are using default SQL collation SQL Latin1 General CP1 CI AS then following script should print out all the symbols that you can fit in VARCHAR since it uses one byte to store one character  256 total  if you don t see it on the list printed - you need NVARCHAR    declare  i int   0  while   i  lt  256  begin print cast  i as varchar 3            char  i   collate SQL Latin1 General CP1 CI AS  print cast  i as varchar 3           char  i   collate Japanese 90 CI AS   set  i    i 1  end   If you change collation to lets say japanese you will notice that all the weird European letters turned into normal and some symbols into   marks      Unicode is a standard for mapping code points to characters  Because   it is designed to cover all the characters of all the languages of the   world  there is no need for different code pages to handle different   sets of characters  If you store character data that reflects multiple   languages  always use Unicode data types  nchar  nvarchar  and ntext    instead of the non-Unicode data types  char  varchar  and text     Otherwise your sorting will go weird

User · Answer

You should use NVARCHAR anytime you have to store multiple languages  I believe you have to use it for the Asian languages but don t quote me on it   Here s the problem if you take Russian for example and store it in a varchar  you will be fine so long as you define the correct code page  But let s say your using a default english sql install  then the russian characters will not be handled correctly  If you were using NVARCHAR   they would be handled properly   Edit  Ok let me quote MSDN and maybee I was to specific but you don t want to store more then one code page in a varcar column  while you can you shouldn t     When you deal with text data that is   stored in the char  varchar    varchar max   or text data type  the   most important limitation to consider   is that only information from a single   code page can be validated by the   system   You can store data from   multiple code pages  but this is not   recommended   The exact code page used   to validate and store the data depends   on the collation of the column  If a   column-level collation has not been   defined  the collation of the database   is used  To determine the code page   that is used for a given column  you   can use the COLLATIONPROPERTY   function  as shown in the following   code examples    Here s some more      This example illustrates the fact that   many locales  such as Georgian and   Hindi  do not have code pages  as they   are Unicode-only collations  Those   collations are not appropriate for   columns that use the char  varchar  or   text data type   So Georgian or Hindi really need to be stored as nvarchar  Arabic is also a problem      Another problem you might encounter is   the inability to store data when not   all of the characters you wish to   support are contained in the code   page  In many cases  Windows considers   a particular code page to be a  best   fit  code page  which means there is   no guarantee that you can rely on the   code page to handle all text  it is   merely the best one available  An   example of this is the Arabic script    it supports a wide array of languages    including Baluchi  Berber  Farsi    Kashmiri  Kazakh  Kirghiz  Pashto    Sindhi  Uighur  Urdu  and more  All of   these languages have additional   characters beyond those in the Arabic   language as defined in Windows code   page 1256  If you attempt to store   these extra characters in a   non-Unicode column that has the Arabic   collation  the characters are   converted into question marks    Something to keep in mind when you are using Unicode although you can store different languages in a single column you can only sort using a single collation  There are some languages that use latin characters but do not sort like other latin languages  Accents is a good example of this  I can t remeber the example but there was a eastern european language whose Y didn t sort like the English Y  Then there is the spanish ch which spanish users expet to be sorted after h   All in all with all the issues you have to deal with when dealing with internalitionalization  It is my opinion that is easier to just use Unicode characters from the start  avoid the extra conversions and take the space hit  Hence my statement earlier

User · Answer

Both the two most upvoted answers are wrong  It should have nothing to do with  quot store different multiple languages quot   You can support Spanish characters like    and English  with just common varchar field and Latin1 General CI AS COLLATION  e g  Short Version You should use NVARCHAR NCHAR whenever the ENCODING  which is determined by COLLATION of the field  doesn t support the characters needed  Also  depending on the SQL Server version  you can use specific COLLATIONs  like Latin1 General 100 CI AS SC UTF8 which is available since SQL Server 2019  Setting this collation on a VARCHAR field  or entire table database   will use UTF-8 ENCODING for storing and handling the data on that field  allowing fully support UNICODE characters  and hence any languages embraced by it   To FULLY UNDERSTAND  To fully understand what I m about to explain  it s mandatory to have the concepts of UNICODE  ENCODING and COLLATION all extremely clear in your head   If you don t  then first take a look below at my humble and simplified explanation on  quot What is UNICODE  ENCODING  COLLATION and UTF-8  and how they are related quot  section and supplied documentation links  Also  everything I say here is specific to Microsoft SQL Server  and how it stores and handles data in char nchar and varchar nvarchar fields  Let s say we wanna store a peculiar text on our MSSQL Server database  It could be an Instagram comment as  quot I love stackoverflow   quot    The plain English part would be perfectly supported even by ASCII  but since there are also an emoji  which is a character specified in the UNICODE standard  we need an ENCODING that supports this Unicode character  MSSQL Server uses the COLLATION to determine what ENCODING is used on char nchar varchar nvarchar fields  So  differently than a lot think  COLLATION is not only about sorting and comparing data  but also about ENCODING  and by consequence  how our data will be stored  So  HOW WE KNOW WHAT IS THE ENCODING USED BY OUR COLLATION  With this  SELECT COLLATIONPROPERTY   Latin1 General CI AI     CodePage    AS  CodePage  --returns 1252  This simple SQL returns the Windows Code Page for a COLLATION  A Windows Code Page is nothing more than another mapping to ENCODINGs  For the Latin1 General CI AI COLLATION it returns the Windows Code Page code 1252   that maps to Windows-1252 ENCODING  So  for a varchar column  with Latin1 General CI AI COLLATION  this field will handle its data using the Windows-1252 ENCODING  and only correctly store characters supported by this encoding  If we check the Windows-1252 ENCODING specification Character List for Windows-1252  we will find out that this encoding won t support our emoji character  And if we still try it out   OK  SO HOW CAN WE SOLVE THIS   Actually  it depends  and that is GOOD  NCHAR NVARCHAR Before SQL Server 2019 all we had was NCHAR and NVARCHAR fields  Some say they are UNICODE fields  THAT IS WRONG   Again  it depends on the field s COLLATION and also SQLServer Version  Microsoft s  quot nchar and nvarchar  Transact-SQL  quot  documentation specifies perfectly   Starting with SQL Server 2012  11 x   when a Supplementary Character  SC  enabled collation is used  these data types store the full range of Unicode character data and use the UTF-16 character encoding  If a non-SC collation is specified  then these data types store only the subset of character data supported by the UCS-2 character encoding   In other words  if we use SQL Server older that 2012  like SQL Server 2008 R2 for example  the ENCODING for those fields will use UCS-2 ENCODING which support a subset of UNICODE  But if we use SQL Server 2012 or newer  and define a COLLATION that has Supplementary Character enabled  than with our field will use the UTF-16 ENCODING  that fully supports UNICODE   BUT WHAIT  THERE IS MORE  WE CAN USE UTF-8 NOW   CHAR VARCHAR Starting with SQL Server 2019  WE CAN USE CHAR VARCHAR fields and still fully support UNICODE using UTF-8 ENCODING    From Microsoft s  quot char and varchar  Transact-SQL  quot  documentation   Starting with SQL Server 2019  15 x   when a UTF-8 enabled collation is used  these data types store the full range of Unicode character data and use the UTF-8 character encoding  If a non-UTF-8 collation is specified  then these data types store only a subset of characters supported by the corresponding code page of that collation   Again  in other words  if we use SQL Server older that 2019  like SQL Server 2008 R2 for example  we need to check the ENCODING using the method explained before  But if we use SQL Server 2019 or newer  and define a COLLATION like Latin1 General 100 CI AS SC UTF8  then our field will use UTF-8 ENCODING which is by far the most used and efficient encoding that supports all the UNICODE characters   Bonus Information  Regarding the OP s observation on  quot I have seen that most of the European languages  German  Italian  English       are fine in the same database in VARCHAR columns quot   I think it s nice to know why it is  For the most common COLLATIONs  like the default ones as Latin1 General CI AI or SQL Latin1 General CP1 CI AS the ENCODING will be Windows-1252 for varchar fields  If we take a look on it s documentation  we can see that it supports   English  Irish  Italian  Norwegian  Portuguese  Spanish  Swedish  Plus also German  Finnish and French  And Dutch except the   character  But as I said before  it s not about language  it s about what characters do you expect to support store  as shown in the emoji example  or some sentence like  quot The electric resistance of a lithium battery is 0 5O quot  where we have again plain English  and a Greek letter character  quot omega quot   which is the symbol for resistance in ohms   which won t be correctly handled by Windows-1252 ENCODING  Conclusion  So  there it is  When use char nchar and varchar nvarchar depends on the characters that you want to support  and also the version of your SQL Server that will determines which COLLATIONs and hence the ENCODINGs you have available     What is UNICODE  ENCODING  COLLATION and UTF-8  and how they are related Note  all the explanations below are simplifications  Please  refer to the supplied documentation links to know all the details about those concepts   UNICODE - Is a standard  a convention  that aims to regulate all the characters in a unified and organized table  In this table  every character has an unique number  This number is commonly called character s code point UNICODE IS NOT AN ENCODING   ENCODING - Is a mapping between a character and a byte bytes sequence  So a encoding is used to  quot transform quot  a character to bytes and also the other way around  from bytes to a character  Among the most popular ones are UTF-8  ISO-8859-1  Windows-1252 and ASCII  You can think of it as a  quot conversion table quot   i really simplified here    COLLATION - That one is important  Even Microsoft s documentation doesn t let this clear as it should be  A Collation specifies how your data would be sorted  compared  AND STORED   Yeah  I bet you was not expecting for that last one  right   The collations on SQL Server determines too what would be the ENCODING used on that particular char nchar varchar nvarchar field   ASCII ENCODING - Was one of the firsts encodings  It is both the character table  like an own tiny version of UNICODE  and its byte mappings  So it doesn t map a byte to UNICODE  but map a byte to its own character s table  Also  it always use only 7bits  and supported 128 different characters  It was enough to support all English letters upper and down cased  numbers  punctuation and some other limited number of characters  The problem with ASCII is that since it only used 7bits and almost every computer was 8bits at the time  there were another 128 possibilities of characters to be  quot explored quot   and everybody started to map this  quot available quot  bytes to its own table of characters  creating a lot of different ENCODINGs   UTF-8 ENCODING - This is another ENCODING  one of the most  if not the most  used ENCODING around  It uses variable byte width  one character can be from 1 to 6 bytes long  by specification  and fully supports all UNICODE characters   Windows-1252 ENCODING - Also one of the most used ENCODING  it s widely used on SQL Server  It s fixed-size  so every one character is always 1byte  It also supports a lot of accents  from various languages but doesn t support all existing  nor supports UNICODE  That s why your varchar field with a common collation like Latin1 General CI AS supports          characters  even that it isn t using a supportive UNICODE ENCODING    Resources  https   blog greglow com 2019 07 25 sql-think-that-varchar-characters-if-so-think-again  https   medium com  apiltamang unicode-utf-8-and-ascii-encodings-made-easy-5bfbe3a1c45a https   www johndcook com blog 2019 09 09 how-utf-8-works  https   www w3 org International questions qa-what-is-encoding  https   en wikipedia org wiki List of Unicode characters https   www fileformat info info charset windows-1252 list htm  https   docs microsoft com en-us sql t-sql data-types char-and-varchar-transact-sql view sql-server-ver15 https   docs microsoft com en-us sql t-sql data-types nchar-and-nvarchar-transact-sql view sql-server-ver15 https   docs microsoft com en-us sql t-sql statements windows-collation-name-transact-sql view sql-server-ver15 https   docs microsoft com en-us sql t-sql statements sql-server-collation-name-transact-sql view sql-server-ver15 https   docs microsoft com en-us sql relational-databases collations collation-and-unicode-support view sql-server-ver15 SQL-collations  SQL Server default character encoding https   en wikipedia org wiki Windows code page

[sql-server] When must we use NVARCHAR/NCHAR instead of VARCHAR/CHAR in SQL Server?

Examples related to sql-server

Examples related to unicode

Examples related to collation

Examples related to nvarchar