What does character set and collation mean exactly

Question

I can read the MySQL documentation and it s pretty clear  But  how does one decide which character set to use  On what data does collation have an effect   I m asking for an explanation of the two and how to choose them

User · Answer

I suggest to use utf8mb4 unicode ci  which is based on the Unicode standard for sorting and comparison  which sorts accurately in a very wide range of languages

User · Answer

A character encoding is a way to encode characters so that they fit in memory  That is  if the charset is ISO-8859-15  the euro symbol       will be encoded as 0xa4  and in UTF-8  it will be 0xe282ac   The collation is how to compare characters  in latin9  there are letters as e          f  if sorted by their binary representation  it will go e f          but if the collation is set to  for example  French  you ll have them in the order you thought they would be  which is all of e          are equal  and then f

User · Answer

A character encoding is a way to encode characters so that they fit in memory  That is  if the charset is ISO-8859-15  the euro symbol       will be encoded as 0xa4  and in UTF-8  it will be 0xe282ac   The collation is how to compare characters  in latin9  there are letters as e          f  if sorted by their binary representation  it will go e f          but if the collation is set to  for example  French  you ll have them in the order you thought they would be  which is all of e          are equal  and then f

User · Answer

From MySQL docs       A character set is a set of symbols   and encodings  A collation is a set of   rules for comparing characters in a   character set  Let s make the   distinction clear with an example of   an imaginary character set       Suppose that we have an alphabet with   four letters   A    B    a    b   We   give each letter a number   A    0     B    1   a    2   b    3  The letter    A  is a symbol  the number 0 is the   encoding for  A   and the combination   of all four letters and their   encodings is a character set       Now  suppose that we want to compare   two string values   A  and  B   The   simplest way to do this is to look at   the encodings  0 for  A  and 1 for    B   Because 0 is less than 1  we say    A  is less than  B   Now  what we ve   just done is apply a collation to our   character set  The collation is a set   of rules  only one rule in this case      compare the encodings   We call this   simplest of all possible collations a   binary collation       But what if we want to say that the   lowercase and uppercase letters are   equivalent  Then we would have at   least two rules   1  treat the   lowercase letters  a  and  b  as   equivalent to  A  and  B    2  then   compare the encodings  We call this a   case-insensitive collation  It s a   little more complex than a binary   collation       In real life  most character sets have   many characters  not just  A  and  B    but whole alphabets  sometimes   multiple alphabets or eastern writing   systems with thousands of characters    along with many special symbols and   punctuation marks  Also in real life    most collations have many rules  not   just case insensitivity but also   accent insensitivity  an  accent  is a   mark attached to a character as in   German       and multiple-character   mappings  such as the rule that           OE  in one of the two German   collations

User · Answer

I suggest to use utf8mb4 unicode ci  which is based on the Unicode standard for sorting and comparison  which sorts accurately in a very wide range of languages

User · Answer

From MySQL docs       A character set is a set of symbols   and encodings  A collation is a set of   rules for comparing characters in a   character set  Let s make the   distinction clear with an example of   an imaginary character set       Suppose that we have an alphabet with   four letters   A    B    a    b   We   give each letter a number   A    0     B    1   a    2   b    3  The letter    A  is a symbol  the number 0 is the   encoding for  A   and the combination   of all four letters and their   encodings is a character set       Now  suppose that we want to compare   two string values   A  and  B   The   simplest way to do this is to look at   the encodings  0 for  A  and 1 for    B   Because 0 is less than 1  we say    A  is less than  B   Now  what we ve   just done is apply a collation to our   character set  The collation is a set   of rules  only one rule in this case      compare the encodings   We call this   simplest of all possible collations a   binary collation       But what if we want to say that the   lowercase and uppercase letters are   equivalent  Then we would have at   least two rules   1  treat the   lowercase letters  a  and  b  as   equivalent to  A  and  B    2  then   compare the encodings  We call this a   case-insensitive collation  It s a   little more complex than a binary   collation       In real life  most character sets have   many characters  not just  A  and  B    but whole alphabets  sometimes   multiple alphabets or eastern writing   systems with thousands of characters    along with many special symbols and   punctuation marks  Also in real life    most collations have many rules  not   just case insensitivity but also   accent insensitivity  an  accent  is a   mark attached to a character as in   German       and multiple-character   mappings  such as the rule that           OE  in one of the two German   collations

User · Answer

A character set is a subset of all written glyphs  A character encoding specifies how those characters are mapped to numeric values  Some character encodings  like UTF-8 and UTF-16  can encode any character in the Universal Character Set  Others  like US-ASCII or ISO-8859-1 can only encode a small subset  since they use 7 and 8 bits per character  respectively  Because many standards specify both a character set and a character encoding  the term  character set  is often substituted freely for  character encoding    A collation comprises rules that specify how characters can be compared for sorting  Collations rules can be locale-specific  the proper order of two characters varies from language to language   Choosing a character set and collation comes down to whether your application is internationalized or not  If not  what locale are you targeting   In order to choose what character set you want to support  you have to consider your application  If you are storing user-supplied input  it might be hard to foresee all the locales in which your software will eventually be used  To support them all  it might be best to support the UCS  Unicode  from the start  However  there is a cost to this  many western European characters will now require two bytes of storage per character instead of one   Choosing the right collation can help performance if your database uses the collation to create an index  and later uses that index to provide sorted results  However  since collation rules are often locale-specific  that index will be worthless if you need to sort results according to the rules of another locale

User · Answer

A character set is a subset of all written glyphs  A character encoding specifies how those characters are mapped to numeric values  Some character encodings  like UTF-8 and UTF-16  can encode any character in the Universal Character Set  Others  like US-ASCII or ISO-8859-1 can only encode a small subset  since they use 7 and 8 bits per character  respectively  Because many standards specify both a character set and a character encoding  the term  character set  is often substituted freely for  character encoding    A collation comprises rules that specify how characters can be compared for sorting  Collations rules can be locale-specific  the proper order of two characters varies from language to language   Choosing a character set and collation comes down to whether your application is internationalized or not  If not  what locale are you targeting   In order to choose what character set you want to support  you have to consider your application  If you are storing user-supplied input  it might be hard to foresee all the locales in which your software will eventually be used  To support them all  it might be best to support the UCS  Unicode  from the start  However  there is a cost to this  many western European characters will now require two bytes of storage per character instead of one   Choosing the right collation can help performance if your database uses the collation to create an index  and later uses that index to provide sorted results  However  since collation rules are often locale-specific  that index will be worthless if you need to sort results according to the rules of another locale

User · Answer

A character encoding is a way to encode characters so that they fit in memory  That is  if the charset is ISO-8859-15  the euro symbol       will be encoded as 0xa4  and in UTF-8  it will be 0xe282ac   The collation is how to compare characters  in latin9  there are letters as e          f  if sorted by their binary representation  it will go e f          but if the collation is set to  for example  French  you ll have them in the order you thought they would be  which is all of e          are equal  and then f

User · Answer

A character set is a subset of all written glyphs  A character encoding specifies how those characters are mapped to numeric values  Some character encodings  like UTF-8 and UTF-16  can encode any character in the Universal Character Set  Others  like US-ASCII or ISO-8859-1 can only encode a small subset  since they use 7 and 8 bits per character  respectively  Because many standards specify both a character set and a character encoding  the term  character set  is often substituted freely for  character encoding    A collation comprises rules that specify how characters can be compared for sorting  Collations rules can be locale-specific  the proper order of two characters varies from language to language   Choosing a character set and collation comes down to whether your application is internationalized or not  If not  what locale are you targeting   In order to choose what character set you want to support  you have to consider your application  If you are storing user-supplied input  it might be hard to foresee all the locales in which your software will eventually be used  To support them all  it might be best to support the UCS  Unicode  from the start  However  there is a cost to this  many western European characters will now require two bytes of storage per character instead of one   Choosing the right collation can help performance if your database uses the collation to create an index  and later uses that index to provide sorted results  However  since collation rules are often locale-specific  that index will be worthless if you need to sort results according to the rules of another locale

User · Answer

A character set is a subset of all written glyphs  A character encoding specifies how those characters are mapped to numeric values  Some character encodings  like UTF-8 and UTF-16  can encode any character in the Universal Character Set  Others  like US-ASCII or ISO-8859-1 can only encode a small subset  since they use 7 and 8 bits per character  respectively  Because many standards specify both a character set and a character encoding  the term  character set  is often substituted freely for  character encoding    A collation comprises rules that specify how characters can be compared for sorting  Collations rules can be locale-specific  the proper order of two characters varies from language to language   Choosing a character set and collation comes down to whether your application is internationalized or not  If not  what locale are you targeting   In order to choose what character set you want to support  you have to consider your application  If you are storing user-supplied input  it might be hard to foresee all the locales in which your software will eventually be used  To support them all  it might be best to support the UCS  Unicode  from the start  However  there is a cost to this  many western European characters will now require two bytes of storage per character instead of one   Choosing the right collation can help performance if your database uses the collation to create an index  and later uses that index to provide sorted results  However  since collation rules are often locale-specific  that index will be worthless if you need to sort results according to the rules of another locale

User · Answer

From MySQL docs       A character set is a set of symbols   and encodings  A collation is a set of   rules for comparing characters in a   character set  Let s make the   distinction clear with an example of   an imaginary character set       Suppose that we have an alphabet with   four letters   A    B    a    b   We   give each letter a number   A    0     B    1   a    2   b    3  The letter    A  is a symbol  the number 0 is the   encoding for  A   and the combination   of all four letters and their   encodings is a character set       Now  suppose that we want to compare   two string values   A  and  B   The   simplest way to do this is to look at   the encodings  0 for  A  and 1 for    B   Because 0 is less than 1  we say    A  is less than  B   Now  what we ve   just done is apply a collation to our   character set  The collation is a set   of rules  only one rule in this case      compare the encodings   We call this   simplest of all possible collations a   binary collation       But what if we want to say that the   lowercase and uppercase letters are   equivalent  Then we would have at   least two rules   1  treat the   lowercase letters  a  and  b  as   equivalent to  A  and  B    2  then   compare the encodings  We call this a   case-insensitive collation  It s a   little more complex than a binary   collation       In real life  most character sets have   many characters  not just  A  and  B    but whole alphabets  sometimes   multiple alphabets or eastern writing   systems with thousands of characters    along with many special symbols and   punctuation marks  Also in real life    most collations have many rules  not   just case insensitivity but also   accent insensitivity  an  accent  is a   mark attached to a character as in   German       and multiple-character   mappings  such as the rule that           OE  in one of the two German   collations

User · Answer

From MySQL docs       A character set is a set of symbols   and encodings  A collation is a set of   rules for comparing characters in a   character set  Let s make the   distinction clear with an example of   an imaginary character set       Suppose that we have an alphabet with   four letters   A    B    a    b   We   give each letter a number   A    0     B    1   a    2   b    3  The letter    A  is a symbol  the number 0 is the   encoding for  A   and the combination   of all four letters and their   encodings is a character set       Now  suppose that we want to compare   two string values   A  and  B   The   simplest way to do this is to look at   the encodings  0 for  A  and 1 for    B   Because 0 is less than 1  we say    A  is less than  B   Now  what we ve   just done is apply a collation to our   character set  The collation is a set   of rules  only one rule in this case      compare the encodings   We call this   simplest of all possible collations a   binary collation       But what if we want to say that the   lowercase and uppercase letters are   equivalent  Then we would have at   least two rules   1  treat the   lowercase letters  a  and  b  as   equivalent to  A  and  B    2  then   compare the encodings  We call this a   case-insensitive collation  It s a   little more complex than a binary   collation       In real life  most character sets have   many characters  not just  A  and  B    but whole alphabets  sometimes   multiple alphabets or eastern writing   systems with thousands of characters    along with many special symbols and   punctuation marks  Also in real life    most collations have many rules  not   just case insensitivity but also   accent insensitivity  an  accent  is a   mark attached to a character as in   German       and multiple-character   mappings  such as the rule that           OE  in one of the two German   collations

User · Answer

A character encoding is a way to encode characters so that they fit in memory  That is  if the charset is ISO-8859-15  the euro symbol       will be encoded as 0xa4  and in UTF-8  it will be 0xe282ac   The collation is how to compare characters  in latin9  there are letters as e          f  if sorted by their binary representation  it will go e f          but if the collation is set to  for example  French  you ll have them in the order you thought they would be  which is all of e          are equal  and then f

[mysql] What does character set and collation mean exactly?

Examples related to mysql

Examples related to database

Examples related to database-design

Examples related to character-set