[mysql] Is there a REAL performance difference between INT and VARCHAR primary keys?

Is there a measurable performance difference between using INT vs. VARCHAR as a primary key in MySQL? I'd like to use VARCHAR as the primary key for reference lists (think US States, Country Codes) and a coworker won't budge on the INT AUTO_INCREMENT as a primary key for all tables.

My argument, as detailed here, is that the performance difference between INT and VARCHAR is negligible, since every INT foreign key reference will require a JOIN to make sense of the reference, a VARCHAR key will directly present the information.

So, does anyone have experience with this particular use-case and the performance concerns associated with it?

This question is related to mysql performance primary-key innodb myisam

The answer is


It's not about performance. It's about what makes a good primary key. Unique and unchanging over time. You may think an entity such as a country code never changes over time and would be a good candidate for a primary key. But bitter experience is that is seldom so.

INT AUTO_INCREMENT meets the "unique and unchanging over time" condition. Hence the preference.


I faced the same dilemma. I made a DW (Constellation schema) with 3 fact tables, Road Accidents, Vehicles in Accidents and Casualties in Accidents. Data includes all accidents recorded in UK from 1979 to 2012, and 60 dimension tables. All together, about 20 million records.

Fact tables relationships:

+----------+          +---------+
| Accident |>--------<| Vehicle |
+-----v----+ 1      * +----v----+
     1|                    |1
      |    +----------+    |
      +---<| Casualty |>---+
         * +----------+ *

RDMS: MySQL 5.6

Natively the Accident index is a varchar(numbers and letters), with 15 digits. I tried not to have surrogate keys, once the accident indexes would never change. In a i7(8 cores) computer, the DW became too slow to query after 12 million records of load depending of the dimensions. After a lot of re-work and adding bigint surrogate keys I got a average 20% speed performance boost. Yet to low performance gain, but valid try. Im working in MySQL tuning and clustering.


As usual, there are no blanket answers. 'It depends!' and I am not being facetious. My understanding of the original question was for keys on small tables - like Country (integer id or char/varchar code) being a foreign key to a potentially huge table like address/contact table.

There are two scenarios here when you want data back from the DB. First is a list/search kind of query where you want to list all the contacts with state and country codes or names (ids will not help and hence will need a lookup). The other is a get scenario on primary key which shows a single contact record where the name of the state, country needs to be shown.

For the latter get, it probably does not matter what the FK is based on since we are bringing together tables for a single record or a few records and on key reads. The former (search or list) scenario may be impacted by our choice. Since it is required to show country (at least a recognizable code and perhaps even the search itself includes a country code), not having to join another table through a surrogate key can potentially (I am just being cautious here because I have not actually tested this, but seems highly probable) improve performance; notwithstanding the fact that it certainly helps with the search.

As codes are small in size - not more than 3 chars usually for country and state, it may be okay to use the natural keys as foreign keys in this scenario.

The other scenario where keys are dependent on longer varchar values and perhaps on larger tables; the surrogate key probably has the advantage.


Allow me to say yes there is definitely a difference, taking into consideration the scope of performance (Out of the box definition):

1- Using surrogate int is faster in application because you do not need to use ToUpper(), ToLower(), ToUpperInvarient(), or ToLowerInvarient() in your code or in your query and these 4 functions have different performance benchmarks. See Microsoft performance rules on this. (performance of application)

2- Using surrogate int guarantees not changing the key over time. Even country codes may change, see Wikipedia how ISO codes changed over time. That would take lots of time to change the primary key for subtrees. (performance of data maintenance)

3- It seems there are issues with ORM solutions, such as NHibernate when PK/FK is not int. (developer performance)


At HauteLook, we changed many of our tables to use natural keys. We did experience a real-world increase in performance. As you mention, many of our queries now use less joins which makes the queries more performant. We will even use a composite primary key if it makes sense. That being said, some tables are just easier to work with if they have a surrogate key.

Also, if you are letting people write interfaces to your database, a surrogate key can be helpful. The 3rd party can rely on the fact that the surrogate key will change only in very rare circumstances.


Depends on the length.. If the varchar will be 20 characters, and the int is 4, then if you use an int, your index will have FIVE times as many nodes per page of index space on disk... That means that traversing the index will require one fifth as many physical and/or logical reads..

So, if performance is an issue, given the opportunity, always use an integral non-meaningful key (called a surrogate) for your tables, and for Foreign Keys that reference the rows in these tables...

At the same time, to guarantee data consistency, every table where it matters should also have a meaningful non-numeric alternate key, (or unique Index) to ensure that duplicate rows cannot be inserted (duplicate based on meaningful table attributes) .

For the specific use you are talking about (like state lookups ) it really doesn't matter because the size of the table is so small.. In general there is no impact on performance from indices on tables with less than a few thousand rows...


It's not about performance. It's about what makes a good primary key. Unique and unchanging over time. You may think an entity such as a country code never changes over time and would be a good candidate for a primary key. But bitter experience is that is seldom so.

INT AUTO_INCREMENT meets the "unique and unchanging over time" condition. Hence the preference.


Not sure about the performance implications, but it seems a possible compromise, at least during development, would be to include both the auto-incremented, integer "surrogate" key, as well as your intended, unique, "natural" key. This would give you the opportunity to evaluate performance, as well as other possible issues, including the changeability of natural keys.


As for Primary Key, whatever physically makes a row unique should be determined as the primary key.

For a reference as a foreign key, using an auto incrementing integer as a surrogate is a nice idea for two main reasons.
- First, there's less overhead incurred in the join usually.
- Second, if you need to update the table that contains the unique varchar then the update has to cascade down to all the child tables and update all of them as well as the indexes, whereas with the int surrogate, it only has to update the master table and it's indexes.

The drawaback to using the surrogate is that you could possibly allow changing of the meaning of the surrogate:

ex.
id value
1 A
2 B
3 C

Update 3 to D
id value
1 A
2 B
3 D

Update 2 to C
id value
1 A
2 C
3 D

Update 3 to B
id value
1 A
2 C
3 B

It all depends on what you really need to worry about in your structure and what means most.


It's not about performance. It's about what makes a good primary key. Unique and unchanging over time. You may think an entity such as a country code never changes over time and would be a good candidate for a primary key. But bitter experience is that is seldom so.

INT AUTO_INCREMENT meets the "unique and unchanging over time" condition. Hence the preference.


I was a bit annoyed by the lack of benchmarks for this online, so I ran a test myself.

Note though that I don't do it on a regular basic, so please check my setup and steps for any factors that could have influenced the results unintentionally, and post your concerns in comments.

The setup was as follows:

  • Intel® Core™ i7-7500U CPU @ 2.70GHz × 4
  • 15.6 GiB RAM, of which I ensured around 8 GB was free during the test.
  • 148.6 GB SSD drive, with plenty of free space.
  • Ubuntu 16.04 64-bit
  • MySQL Ver 14.14 Distrib 5.7.20, for Linux (x86_64)

The tables:

create table jan_int (data1 varchar(255), data2 int(10), myindex tinyint(4)) ENGINE=InnoDB;
create table jan_int_index (data1 varchar(255), data2 int(10), myindex tinyint(4), INDEX (myindex)) ENGINE=InnoDB;
create table jan_char (data1 varchar(255), data2 int(10), myindex char(6)) ENGINE=InnoDB;
create table jan_char_index (data1 varchar(255), data2 int(10), myindex char(6), INDEX (myindex)) ENGINE=InnoDB;
create table jan_varchar (data1 varchar(255), data2 int(10), myindex varchar(63)) ENGINE=InnoDB;
create table jan_varchar_index (data1 varchar(255), data2 int(10), myindex varchar(63), INDEX (myindex)) ENGINE=InnoDB;

Then, I filled 10 million rows in each table with a PHP script whose essence is like this:

$pdo = get_pdo();

$keys = [ 'alabam', 'massac', 'newyor', 'newham', 'delawa', 'califo', 'nevada', 'texas_', 'florid', 'ohio__' ];

for ($k = 0; $k < 10; $k++) {
    for ($j = 0; $j < 1000; $j++) {
        $val = '';
        for ($i = 0; $i < 1000; $i++) {
            $val .= '("' . generate_random_string() . '", ' . rand (0, 10000) . ', "' . ($keys[rand(0, 9)]) . '"),';
        }
        $val = rtrim($val, ',');
        $pdo->query('INSERT INTO jan_char VALUES ' . $val);
    }
    echo "\n" . ($k + 1) . ' millon(s) rows inserted.';
}

For int tables, the bit ($keys[rand(0, 9)]) was replaced with just rand(0, 9), and for varchar tables, I used full US state names, without cutting or extending them to 6 characters. generate_random_string() generates a 10-character random string.

Then I ran in MySQL:

  • SET SESSION query_cache_type=0;
  • For jan_int table:
    • SELECT count(*) FROM jan_int WHERE myindex = 5;
    • SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));
  • For other tables, same as above, with myindex = 'califo' for char tables and myindex = 'california' for varchar tables.

Times of the BENCHMARK query on each table:

  • jan_int: 21.30 sec
  • jan_int_index: 18.79 sec
  • jan_char: 21.70 sec
  • jan_char_index: 18.85 sec
  • jan_varchar: 21.76 sec
  • jan_varchar_index: 18.86 sec

Regarding table & index sizes, here's the output of show table status from janperformancetest; (w/ a few columns not shown):

|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name              | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Collation              |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| jan_int           | InnoDB |      10 | Dynamic    | 9739094 |             43 |   422510592 |               0 |            0 |   4194304 |           NULL | utf8mb4_unicode_520_ci |  
| jan_int_index     | InnoDB |      10 | Dynamic    | 9740329 |             43 |   420413440 |               0 |    132857856 |   7340032 |           NULL | utf8mb4_unicode_520_ci |   
| jan_char          | InnoDB |      10 | Dynamic    | 9726613 |             51 |   500170752 |               0 |            0 |   5242880 |           NULL | utf8mb4_unicode_520_ci |  
| jan_char_index    | InnoDB |      10 | Dynamic    | 9719059 |             52 |   513802240 |               0 |    202342400 |   5242880 |           NULL | utf8mb4_unicode_520_ci |  
| jan_varchar       | InnoDB |      10 | Dynamic    | 9722049 |             53 |   521142272 |               0 |            0 |   7340032 |           NULL | utf8mb4_unicode_520_ci |   
| jan_varchar_index | InnoDB |      10 | Dynamic    | 9738381 |             49 |   486539264 |               0 |    202375168 |   7340032 |           NULL | utf8mb4_unicode_520_ci | 
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

My conclusion is that there's no performance difference for this particular use case.


Absolutely not.

I have done several... several... performance checks between INT, VARCHAR, and CHAR.

10 million record table with a PRIMARY KEY (unique and clustered) had the exact same speed and performance (and subtree cost) no matter which of the three I used.

That being said... use whatever is best for your application. Don't worry about the performance.


For short codes, there's probably no difference. This is especially true as the table holding these codes are likely to be very small (a couple thousand rows at most) and not change often (when is the last time we added a new US State).

For larger tables with a wider variation among the key, this can be dangerous. Think about using e-mail address/user name from a User table, for example. What happens when you have a few million users and some of those users have long names or e-mail addresses. Now any time you need to join this table using that key it becomes much more expensive.


Not sure about the performance implications, but it seems a possible compromise, at least during development, would be to include both the auto-incremented, integer "surrogate" key, as well as your intended, unique, "natural" key. This would give you the opportunity to evaluate performance, as well as other possible issues, including the changeability of natural keys.


As for Primary Key, whatever physically makes a row unique should be determined as the primary key.

For a reference as a foreign key, using an auto incrementing integer as a surrogate is a nice idea for two main reasons.
- First, there's less overhead incurred in the join usually.
- Second, if you need to update the table that contains the unique varchar then the update has to cascade down to all the child tables and update all of them as well as the indexes, whereas with the int surrogate, it only has to update the master table and it's indexes.

The drawaback to using the surrogate is that you could possibly allow changing of the meaning of the surrogate:

ex.
id value
1 A
2 B
3 C

Update 3 to D
id value
1 A
2 B
3 D

Update 2 to C
id value
1 A
2 C
3 D

Update 3 to B
id value
1 A
2 C
3 B

It all depends on what you really need to worry about in your structure and what means most.


Allow me to say yes there is definitely a difference, taking into consideration the scope of performance (Out of the box definition):

1- Using surrogate int is faster in application because you do not need to use ToUpper(), ToLower(), ToUpperInvarient(), or ToLowerInvarient() in your code or in your query and these 4 functions have different performance benchmarks. See Microsoft performance rules on this. (performance of application)

2- Using surrogate int guarantees not changing the key over time. Even country codes may change, see Wikipedia how ISO codes changed over time. That would take lots of time to change the primary key for subtrees. (performance of data maintenance)

3- It seems there are issues with ORM solutions, such as NHibernate when PK/FK is not int. (developer performance)


The question is about MySQL so I say there is a significant difference. If it was about Oracle (which stores numbers as string - yes, I couldn't believe it at first) then not much difference.

Storage in the table is not the issue but updating and referring to the index is. Queries involving looking up a record based on its primary key are frequent - you want them to occur as fast as possible because they happen so often.

The thing is a CPU deals with 4 byte and 8 byte integers naturally, in silicon. It's REALLY fast for it to compare two integers - it happens in one or two clock cycles.

Now look at a string - it's made up of lots of characters (more than one byte per character these days). Comparing two strings for precedence can't be done in one or two cycles. Instead the strings' characters must be iterated until a difference is found. I'm sure there are tricks to make it faster in some databases but that's irrelevant here because an int comparison is done naturally and lightning fast in silicon by the CPU.

My general rule - every primary key should be an autoincrementing INT especially in OO apps using an ORM (Hibernate, Datanucleus, whatever) where there's lots of relationships between objects - they'll usually always be implemented as a simple FK and the ability for the DB to resolve those fast is important to your app' s responsiveness.


As for Primary Key, whatever physically makes a row unique should be determined as the primary key.

For a reference as a foreign key, using an auto incrementing integer as a surrogate is a nice idea for two main reasons.
- First, there's less overhead incurred in the join usually.
- Second, if you need to update the table that contains the unique varchar then the update has to cascade down to all the child tables and update all of them as well as the indexes, whereas with the int surrogate, it only has to update the master table and it's indexes.

The drawaback to using the surrogate is that you could possibly allow changing of the meaning of the surrogate:

ex.
id value
1 A
2 B
3 C

Update 3 to D
id value
1 A
2 B
3 D

Update 2 to C
id value
1 A
2 C
3 D

Update 3 to B
id value
1 A
2 C
3 B

It all depends on what you really need to worry about in your structure and what means most.


Absolutely not.

I have done several... several... performance checks between INT, VARCHAR, and CHAR.

10 million record table with a PRIMARY KEY (unique and clustered) had the exact same speed and performance (and subtree cost) no matter which of the three I used.

That being said... use whatever is best for your application. Don't worry about the performance.


Depends on the length.. If the varchar will be 20 characters, and the int is 4, then if you use an int, your index will have FIVE times as many nodes per page of index space on disk... That means that traversing the index will require one fifth as many physical and/or logical reads..

So, if performance is an issue, given the opportunity, always use an integral non-meaningful key (called a surrogate) for your tables, and for Foreign Keys that reference the rows in these tables...

At the same time, to guarantee data consistency, every table where it matters should also have a meaningful non-numeric alternate key, (or unique Index) to ensure that duplicate rows cannot be inserted (duplicate based on meaningful table attributes) .

For the specific use you are talking about (like state lookups ) it really doesn't matter because the size of the table is so small.. In general there is no impact on performance from indices on tables with less than a few thousand rows...


I was a bit annoyed by the lack of benchmarks for this online, so I ran a test myself.

Note though that I don't do it on a regular basic, so please check my setup and steps for any factors that could have influenced the results unintentionally, and post your concerns in comments.

The setup was as follows:

  • Intel® Core™ i7-7500U CPU @ 2.70GHz × 4
  • 15.6 GiB RAM, of which I ensured around 8 GB was free during the test.
  • 148.6 GB SSD drive, with plenty of free space.
  • Ubuntu 16.04 64-bit
  • MySQL Ver 14.14 Distrib 5.7.20, for Linux (x86_64)

The tables:

create table jan_int (data1 varchar(255), data2 int(10), myindex tinyint(4)) ENGINE=InnoDB;
create table jan_int_index (data1 varchar(255), data2 int(10), myindex tinyint(4), INDEX (myindex)) ENGINE=InnoDB;
create table jan_char (data1 varchar(255), data2 int(10), myindex char(6)) ENGINE=InnoDB;
create table jan_char_index (data1 varchar(255), data2 int(10), myindex char(6), INDEX (myindex)) ENGINE=InnoDB;
create table jan_varchar (data1 varchar(255), data2 int(10), myindex varchar(63)) ENGINE=InnoDB;
create table jan_varchar_index (data1 varchar(255), data2 int(10), myindex varchar(63), INDEX (myindex)) ENGINE=InnoDB;

Then, I filled 10 million rows in each table with a PHP script whose essence is like this:

$pdo = get_pdo();

$keys = [ 'alabam', 'massac', 'newyor', 'newham', 'delawa', 'califo', 'nevada', 'texas_', 'florid', 'ohio__' ];

for ($k = 0; $k < 10; $k++) {
    for ($j = 0; $j < 1000; $j++) {
        $val = '';
        for ($i = 0; $i < 1000; $i++) {
            $val .= '("' . generate_random_string() . '", ' . rand (0, 10000) . ', "' . ($keys[rand(0, 9)]) . '"),';
        }
        $val = rtrim($val, ',');
        $pdo->query('INSERT INTO jan_char VALUES ' . $val);
    }
    echo "\n" . ($k + 1) . ' millon(s) rows inserted.';
}

For int tables, the bit ($keys[rand(0, 9)]) was replaced with just rand(0, 9), and for varchar tables, I used full US state names, without cutting or extending them to 6 characters. generate_random_string() generates a 10-character random string.

Then I ran in MySQL:

  • SET SESSION query_cache_type=0;
  • For jan_int table:
    • SELECT count(*) FROM jan_int WHERE myindex = 5;
    • SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));
  • For other tables, same as above, with myindex = 'califo' for char tables and myindex = 'california' for varchar tables.

Times of the BENCHMARK query on each table:

  • jan_int: 21.30 sec
  • jan_int_index: 18.79 sec
  • jan_char: 21.70 sec
  • jan_char_index: 18.85 sec
  • jan_varchar: 21.76 sec
  • jan_varchar_index: 18.86 sec

Regarding table & index sizes, here's the output of show table status from janperformancetest; (w/ a few columns not shown):

|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name              | Engine | Version | Row_format | Rows    | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Collation              |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| jan_int           | InnoDB |      10 | Dynamic    | 9739094 |             43 |   422510592 |               0 |            0 |   4194304 |           NULL | utf8mb4_unicode_520_ci |  
| jan_int_index     | InnoDB |      10 | Dynamic    | 9740329 |             43 |   420413440 |               0 |    132857856 |   7340032 |           NULL | utf8mb4_unicode_520_ci |   
| jan_char          | InnoDB |      10 | Dynamic    | 9726613 |             51 |   500170752 |               0 |            0 |   5242880 |           NULL | utf8mb4_unicode_520_ci |  
| jan_char_index    | InnoDB |      10 | Dynamic    | 9719059 |             52 |   513802240 |               0 |    202342400 |   5242880 |           NULL | utf8mb4_unicode_520_ci |  
| jan_varchar       | InnoDB |      10 | Dynamic    | 9722049 |             53 |   521142272 |               0 |            0 |   7340032 |           NULL | utf8mb4_unicode_520_ci |   
| jan_varchar_index | InnoDB |      10 | Dynamic    | 9738381 |             49 |   486539264 |               0 |    202375168 |   7340032 |           NULL | utf8mb4_unicode_520_ci | 
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

My conclusion is that there's no performance difference for this particular use case.


Absolutely not.

I have done several... several... performance checks between INT, VARCHAR, and CHAR.

10 million record table with a PRIMARY KEY (unique and clustered) had the exact same speed and performance (and subtree cost) no matter which of the three I used.

That being said... use whatever is best for your application. Don't worry about the performance.


Not sure about the performance implications, but it seems a possible compromise, at least during development, would be to include both the auto-incremented, integer "surrogate" key, as well as your intended, unique, "natural" key. This would give you the opportunity to evaluate performance, as well as other possible issues, including the changeability of natural keys.


It's not about performance. It's about what makes a good primary key. Unique and unchanging over time. You may think an entity such as a country code never changes over time and would be a good candidate for a primary key. But bitter experience is that is seldom so.

INT AUTO_INCREMENT meets the "unique and unchanging over time" condition. Hence the preference.


For short codes, there's probably no difference. This is especially true as the table holding these codes are likely to be very small (a couple thousand rows at most) and not change often (when is the last time we added a new US State).

For larger tables with a wider variation among the key, this can be dangerous. Think about using e-mail address/user name from a User table, for example. What happens when you have a few million users and some of those users have long names or e-mail addresses. Now any time you need to join this table using that key it becomes much more expensive.


As usual, there are no blanket answers. 'It depends!' and I am not being facetious. My understanding of the original question was for keys on small tables - like Country (integer id or char/varchar code) being a foreign key to a potentially huge table like address/contact table.

There are two scenarios here when you want data back from the DB. First is a list/search kind of query where you want to list all the contacts with state and country codes or names (ids will not help and hence will need a lookup). The other is a get scenario on primary key which shows a single contact record where the name of the state, country needs to be shown.

For the latter get, it probably does not matter what the FK is based on since we are bringing together tables for a single record or a few records and on key reads. The former (search or list) scenario may be impacted by our choice. Since it is required to show country (at least a recognizable code and perhaps even the search itself includes a country code), not having to join another table through a surrogate key can potentially (I am just being cautious here because I have not actually tested this, but seems highly probable) improve performance; notwithstanding the fact that it certainly helps with the search.

As codes are small in size - not more than 3 chars usually for country and state, it may be okay to use the natural keys as foreign keys in this scenario.

The other scenario where keys are dependent on longer varchar values and perhaps on larger tables; the surrogate key probably has the advantage.


I faced the same dilemma. I made a DW (Constellation schema) with 3 fact tables, Road Accidents, Vehicles in Accidents and Casualties in Accidents. Data includes all accidents recorded in UK from 1979 to 2012, and 60 dimension tables. All together, about 20 million records.

Fact tables relationships:

+----------+          +---------+
| Accident |>--------<| Vehicle |
+-----v----+ 1      * +----v----+
     1|                    |1
      |    +----------+    |
      +---<| Casualty |>---+
         * +----------+ *

RDMS: MySQL 5.6

Natively the Accident index is a varchar(numbers and letters), with 15 digits. I tried not to have surrogate keys, once the accident indexes would never change. In a i7(8 cores) computer, the DW became too slow to query after 12 million records of load depending of the dimensions. After a lot of re-work and adding bigint surrogate keys I got a average 20% speed performance boost. Yet to low performance gain, but valid try. Im working in MySQL tuning and clustering.


Absolutely not.

I have done several... several... performance checks between INT, VARCHAR, and CHAR.

10 million record table with a PRIMARY KEY (unique and clustered) had the exact same speed and performance (and subtree cost) no matter which of the three I used.

That being said... use whatever is best for your application. Don't worry about the performance.


Depends on the length.. If the varchar will be 20 characters, and the int is 4, then if you use an int, your index will have FIVE times as many nodes per page of index space on disk... That means that traversing the index will require one fifth as many physical and/or logical reads..

So, if performance is an issue, given the opportunity, always use an integral non-meaningful key (called a surrogate) for your tables, and for Foreign Keys that reference the rows in these tables...

At the same time, to guarantee data consistency, every table where it matters should also have a meaningful non-numeric alternate key, (or unique Index) to ensure that duplicate rows cannot be inserted (duplicate based on meaningful table attributes) .

For the specific use you are talking about (like state lookups ) it really doesn't matter because the size of the table is so small.. In general there is no impact on performance from indices on tables with less than a few thousand rows...


At HauteLook, we changed many of our tables to use natural keys. We did experience a real-world increase in performance. As you mention, many of our queries now use less joins which makes the queries more performant. We will even use a composite primary key if it makes sense. That being said, some tables are just easier to work with if they have a surrogate key.

Also, if you are letting people write interfaces to your database, a surrogate key can be helpful. The 3rd party can rely on the fact that the surrogate key will change only in very rare circumstances.


The question is about MySQL so I say there is a significant difference. If it was about Oracle (which stores numbers as string - yes, I couldn't believe it at first) then not much difference.

Storage in the table is not the issue but updating and referring to the index is. Queries involving looking up a record based on its primary key are frequent - you want them to occur as fast as possible because they happen so often.

The thing is a CPU deals with 4 byte and 8 byte integers naturally, in silicon. It's REALLY fast for it to compare two integers - it happens in one or two clock cycles.

Now look at a string - it's made up of lots of characters (more than one byte per character these days). Comparing two strings for precedence can't be done in one or two cycles. Instead the strings' characters must be iterated until a difference is found. I'm sure there are tricks to make it faster in some databases but that's irrelevant here because an int comparison is done naturally and lightning fast in silicon by the CPU.

My general rule - every primary key should be an autoincrementing INT especially in OO apps using an ORM (Hibernate, Datanucleus, whatever) where there's lots of relationships between objects - they'll usually always be implemented as a simple FK and the ability for the DB to resolve those fast is important to your app' s responsiveness.


Common cases where a surrogate AUTO_INCREMENT hurts:

A common schema pattern is a many-to-many mapping:

CREATE TABLE map (
    id ... AUTO_INCREMENT,
    foo_id ...,
    bar_id ...,
    PRIMARY KEY(id),
    UNIQUE(foo_id, bar_id),
    INDEX(bar_id) );

Performance of this pattern is much better, especially when using InnoDB:

CREATE TABLE map (
    # No surrogate
    foo_id ...,
    bar_id ...,
    PRIMARY KEY(foo_id, bar_id),
    INDEX      (bar_id, foo_id) );

Why?

  • InnoDB secondary keys need an extra lookup; by moving the pair into the PK, that is avoided for one direction.
  • The secondary index is "covering", so it does not need the extra lookup.
  • This table is smaller because of getting rid of id and one index.

Another case (country):

country_id INT ...
-- versus
country_code CHAR(2) CHARACTER SET ascii

All too often the novice normalizes country_code into a 4-byte INT instead of using a 'natural' 2-byte, nearly-unchanging 2-byte string. Faster, smaller, fewer JOINs, more readable.


As for Primary Key, whatever physically makes a row unique should be determined as the primary key.

For a reference as a foreign key, using an auto incrementing integer as a surrogate is a nice idea for two main reasons.
- First, there's less overhead incurred in the join usually.
- Second, if you need to update the table that contains the unique varchar then the update has to cascade down to all the child tables and update all of them as well as the indexes, whereas with the int surrogate, it only has to update the master table and it's indexes.

The drawaback to using the surrogate is that you could possibly allow changing of the meaning of the surrogate:

ex.
id value
1 A
2 B
3 C

Update 3 to D
id value
1 A
2 B
3 D

Update 2 to C
id value
1 A
2 C
3 D

Update 3 to B
id value
1 A
2 C
3 B

It all depends on what you really need to worry about in your structure and what means most.


For short codes, there's probably no difference. This is especially true as the table holding these codes are likely to be very small (a couple thousand rows at most) and not change often (when is the last time we added a new US State).

For larger tables with a wider variation among the key, this can be dangerous. Think about using e-mail address/user name from a User table, for example. What happens when you have a few million users and some of those users have long names or e-mail addresses. Now any time you need to join this table using that key it becomes much more expensive.


Common cases where a surrogate AUTO_INCREMENT hurts:

A common schema pattern is a many-to-many mapping:

CREATE TABLE map (
    id ... AUTO_INCREMENT,
    foo_id ...,
    bar_id ...,
    PRIMARY KEY(id),
    UNIQUE(foo_id, bar_id),
    INDEX(bar_id) );

Performance of this pattern is much better, especially when using InnoDB:

CREATE TABLE map (
    # No surrogate
    foo_id ...,
    bar_id ...,
    PRIMARY KEY(foo_id, bar_id),
    INDEX      (bar_id, foo_id) );

Why?

  • InnoDB secondary keys need an extra lookup; by moving the pair into the PK, that is avoided for one direction.
  • The secondary index is "covering", so it does not need the extra lookup.
  • This table is smaller because of getting rid of id and one index.

Another case (country):

country_id INT ...
-- versus
country_code CHAR(2) CHARACTER SET ascii

All too often the novice normalizes country_code into a 4-byte INT instead of using a 'natural' 2-byte, nearly-unchanging 2-byte string. Faster, smaller, fewer JOINs, more readable.


Depends on the length.. If the varchar will be 20 characters, and the int is 4, then if you use an int, your index will have FIVE times as many nodes per page of index space on disk... That means that traversing the index will require one fifth as many physical and/or logical reads..

So, if performance is an issue, given the opportunity, always use an integral non-meaningful key (called a surrogate) for your tables, and for Foreign Keys that reference the rows in these tables...

At the same time, to guarantee data consistency, every table where it matters should also have a meaningful non-numeric alternate key, (or unique Index) to ensure that duplicate rows cannot be inserted (duplicate based on meaningful table attributes) .

For the specific use you are talking about (like state lookups ) it really doesn't matter because the size of the table is so small.. In general there is no impact on performance from indices on tables with less than a few thousand rows...


Examples related to mysql

Implement specialization in ER diagram How to post query parameters with Axios? PHP with MySQL 8.0+ error: The server requested authentication method unknown to the client Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver' phpMyAdmin - Error > Incorrect format parameter? Authentication plugin 'caching_sha2_password' is not supported How to resolve Unable to load authentication plugin 'caching_sha2_password' issue Connection Java-MySql : Public Key Retrieval is not allowed How to grant all privileges to root user in MySQL 8.0 MySQL 8.0 - Client does not support authentication protocol requested by server; consider upgrading MySQL client

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder

Examples related to primary-key

Violation of PRIMARY KEY constraint. Cannot insert duplicate key in object What is Hash and Range Primary Key? Can I use VARCHAR as the PRIMARY KEY? Can a foreign key refer to a primary key in the same table? UUID max character length MySQL duplicate entry error even though there is no duplicate entry Creating composite primary key in SQL Server What are the best practices for using a GUID as a primary key, specifically regarding performance? Add primary key to existing table Create view with primary key?

Examples related to innodb

How can I rebuild indexes and update stats in MySQL innoDB? Database corruption with MariaDB : Table doesn't exist in engine How to regex in a MySQL query mysqldump exports only one table TINYTEXT, TEXT, MEDIUMTEXT, and LONGTEXT maximum storage sizes What's the difference between MyISAM and InnoDB? Why is MySQL InnoDB insert so slow? How to debug Lock wait timeout exceeded on MySQL? How to change value for innodb_buffer_pool_size in MySQL on Mac OS? #1025 - Error on rename of './database/#sql-2e0f_1254ba7' to './database/table' (errno: 150)

Examples related to myisam

What's the difference between MyISAM and InnoDB? How to test an SQL Update statement before running it? How to properly create composite primary keys - MYSQL Is there a REAL performance difference between INT and VARCHAR primary keys? How can I check MySQL engine type for a specific table? MyISAM versus InnoDB