[mysql] JOIN queries vs multiple queries

Are JOIN queries faster than several queries? (You run your main query, and then you run many other SELECTs based on the results from your main query)

I'm asking because JOINing them would complicate A LOT the design of my application

If they are faster, can anyone approximate very roughly by how much? If it's 1.5x I don't care, but if it's 10x I guess I do.

This question is related to mysql database join query-optimization

The answer is


Construct both separate queries and joins, then time each of them -- nothing helps more than real-world numbers.

Then even better -- add "EXPLAIN" to the beginning of each query. This will tell you how many subqueries MySQL is using to answer your request for data, and how many rows scanned for each query.


Depending on the complexity for the database compared to developer complexity, it may be simpler to do many SELECT calls.

Try running some database statistics against both the JOIN and the multiple SELECTS. See if in your environment the JOIN is faster/slower than the SELECT.

Then again, if changing it to a JOIN would mean an extra day/week/month of dev work, I'd stick with multiple SELECTs

Cheers,

BLT


Yes, one query using JOINS would be quicker. Although without knowing the relationships of the tables you are querying, the size of your dataset, or where the primary keys are, it's almost impossible to say how much faster.

Why not test both scenarios out, then you'll know for sure...


This question is old, but is missing some benchmarks. I benchmarked JOIN against its 2 competitors:

  • N+1 queries
  • 2 queries, the second one using a WHERE IN(...) or equivalent

The result is clear: on MySQL, JOIN is much faster. N+1 queries can drop the performance of an application drastically:

JOIN vs WHERE IN vs N+1

That is, unless you select a lot of records that point to a very small number of distinct, foreign records. Here is a benchmark for the extreme case:

JOIN vs N+1 - all records pointing to the same foreign record

This is very unlikely to happen in a typical application, unless you're joining a -to-many relationship, in which case the foreign key is on the other table, and you're duplicating the main table data many times.

Takeaway:

  • For *-to-one relationships, always use JOIN
  • For *-to-many relationships, a second query might be faster

See my article on Medium for more information.


Whether you should use a join is first and foremost about whether a join makes sense. Only at that point is performance even something to be considered, as nearly all other cases will result in significantly worse performance.

Performance differences will largely be tied to how related the info you're querying for is. Joins work, and they're fast when the data is related and you index stuff correctly, but they do often result in some redundancy and sometimes more results than needed. And if your data sets are not directly related, sticking them in a single query will result in what's called a Cartesian product (basically, all possible combinations of rows), which is almost never what you want.

This is often caused by many-to-one-to-many relationships. For example, HoldOffHunger's answer mentioned a single query for posts, tags, and comments. Comments are related to a post, as are tags...but tags are unrelated to comments.

+------------+     +---------+     +---------+
|  comment   |     |   post  |     |  tag    |
|------------|*   1|---------|1   *|---------|
| post_id    |-----| post_id |-----| post_id |
| comment_id |     | ...     |     | tag_id  |
| user_id    |     |         |     | ...     |
| ...        |     |         |     | ...     |
+------------+     +---------+     +---------+

In this case, it is unambiguously better for this to be at least two separate queries. If you try to join tags and comments, because there's no direct relation between the two, you end up with every possible combination of tag and comment. many * many == manymany. Aside from that, since posts and tags are unrelated, you can do those two queries in parallel, leading to potential gain.

Let's consider a different scenario, though: You want the comments attached to a post, and the commenters' contact info.

 +----------+     +------------+     +---------+
 |   user   |     |  comment   |     |   post  |
 |----------|1   *|------------|*   1|---------|
 | user_id  |-----| post_id    |-----| post_id |
 | username |     | user_id    |     | ...     |
 | ...      |     | ...        |     +---------+
 +----------+     +------------+

This is where you should consider a join. Aside from being a much more natural query, most database systems (including MySQL) have lots of smart people put lots of hard work into optimizing queries just like it. For separate queries, since each query depends on the results of the previous one, the queries can't be done in parallel, and the total time becomes not just the actual execute time of the queries, but also the time spent fetching results, sifting through them for IDs for the next query, linking rows together, etc.


The real question is: Do these records have a one-to-one relationship or a one-to-many relationship?

TLDR Answer:

If one-to-one, use a JOIN statement.

If one-to-many, use one (or many) SELECT statements with server-side code optimization.

Why and How To Use SELECT for Optimization

SELECT'ing (with multiple queries instead of joins) on large group of records based on a one-to-many relationship produces an optimal efficiency, as JOIN'ing has an exponential memory leak issue. Grab all of the data, then use a server-side scripting language to sort it out:

SELECT * FROM Address WHERE Personid IN(1,2,3);

Results:

Address.id : 1            // First person and their address
Address.Personid : 1
Address.City : "Boston"

Address.id : 2            // First person's second address
Address.Personid : 1
Address.City : "New York"

Address.id : 3            // Second person's address
Address.Personid : 2
Address.City : "Barcelona"

Here, I am getting all of the records, in one select statement. This is better than JOIN, which would be getting a small group of these records, one at a time, as a sub-component of another query. Then I parse it with server-side code that looks something like...

<?php
    foreach($addresses as $address) {
         $persons[$address['Personid']]->Address[] = $address;
    }
?>

When Not To Use JOIN for Optimization

JOIN'ing a large group of records based on a one-to-one relationship with one single record produces an optimal efficiency compared to multiple SELECT statements, one after the other, which simply get the next record type.

But JOIN is inefficient when getting records with a one-to-many relationship.

Example: The database Blogs has 3 tables of interest, Blogpost, Tag, and Comment.

SELECT * from BlogPost
LEFT JOIN Tag ON Tag.BlogPostid = BlogPost.id
LEFT JOIN Comment ON Comment.BlogPostid = BlogPost.id;

If there is 1 blogpost, 2 tags, and 2 comments, you will get results like:

Row1: tag1, comment1,
Row2: tag1, comment2,
Row3: tag2, comment1,
Row4: tag2, comment2,

Notice how each record is duplicated. Okay, so, 2 comments and 2 tags is 4 rows. What if we have 4 comments and 4 tags? You don't get 8 rows -- you get 16 rows:

Row1: tag1, comment1,
Row2: tag1, comment2,
Row3: tag1, comment3,
Row4: tag1, comment4,
Row5: tag2, comment1,
Row6: tag2, comment2,
Row7: tag2, comment3,
Row8: tag2, comment4,
Row9: tag3, comment1,
Row10: tag3, comment2,
Row11: tag3, comment3,
Row12: tag3, comment4,
Row13: tag4, comment1,
Row14: tag4, comment2,
Row15: tag4, comment3,
Row16: tag4, comment4,

Add more tables, more records, etc., and the problem will quickly inflate to hundreds of rows that are all full of mostly redundant data.

What do these duplicates cost you? Memory (in the SQL server and the code that tries to remove the duplicates) and networking resources (between SQL server and your code server).

Source: https://dev.mysql.com/doc/refman/8.0/en/nested-join-optimization.html ; https://dev.mysql.com/doc/workbench/en/wb-relationship-tools.html


In my experience I have found it's usually faster to run several queries, especially when retrieving large data sets.

When interacting with the database from another application, such as PHP, there is the argument of one trip to the server over many.

There are other ways to limit the number of trips made to the server and still run multiple queries that are often not only faster but also make the application easier to read - for example mysqli_multi_query.

I'm no novice when it comes to SQL, I think there is a tendency for developers, especially juniors to spend a lot of time trying to write very clever joins because they look smart, whereas there are actually smart ways to extract data that look simple.

The last paragraph was a personal opinion, but I hope this helps. I do agree with the others though who say you should benchmark. Neither approach is a silver bullet.


I actually came to this question looking for an answer myself, and after reading the given answers I can only agree that the best way to compare DB queries performance is to get real-world numbers because there are just to many variables to be taken into account BUT, I also think that comparing the numbers between them leads to no good in almost all cases. What I mean is that the numbers should always be compared with an acceptable number and definitely not compared with each other.

I can understand if one way of querying takes say 0.02 seconds and the other one takes 20 seconds, that's an enormous difference. But what if one way of querying takes 0.0000000002 seconds, and the other one takes 0.0000002 seconds ? In both cases one way is a whopping 1000 times faster than the other one, but is it really still "whopping" in the second case ?

Bottom line as I personally see it: if it performs well, go for the easy solution.


Here is a link with 100 useful queries, these are tested in Oracle database but remember SQL is a standard, what differ between Oracle, MS SQL Server, MySQL and other databases are the SQL dialect:

http://javaforlearn.com/100-sql-queries-learn/


Will it be faster in terms of throughput? Probably. But it also potentially locks more database objects at a time (depending on your database and your schema) and thereby decreases concurrency. In my experience people are often mislead by the "fewer database round-trips" argument when in reality on most OLTP systems where the database is on the same LAN, the real bottleneck is rarely the network.


For inner joins, a single query makes sense, since you only get matching rows. For left joins, multiple queries is much better... look at the following benchmark I did:

  1. Single query with 5 Joins

    query: 8.074508 seconds

    result size: 2268000

  2. 5 queries in a row

    combined query time: 0.00262 seconds

    result size: 165 (6 + 50 + 7 + 12 + 90)

.

Note that we get the same results in both cases (6 x 50 x 7 x 12 x 90 = 2268000)

left joins use exponentially more memory with redundant data.

The memory limit might not be as bad if you only do a join of two tables, but generally three or more and it becomes worth different queries.

As a side note, my MySQL server is right beside my application server... so connection time is negligible. If your connection time is in the seconds, then maybe there is a benefit

Frank


Did a quick test selecting one row from a 50,000 row table and joining with one row from a 100,000 row table. Basically looked like:

$id = mt_rand(1, 50000);
$row = $db->fetchOne("SELECT * FROM table1 WHERE id = " . $id);
$row = $db->fetchOne("SELECT * FROM table2 WHERE other_id = " . $row['other_id']);

vs

$id = mt_rand(1, 50000);
$db->fetchOne("SELECT table1.*, table2.*
    FROM table1
    LEFT JOIN table1.other_id = table2.other_id
    WHERE table1.id = " . $id);

The two select method took 3.7 seconds for 50,000 reads whereas the JOIN took 2.0 seconds on my at-home slow computer. INNER JOIN and LEFT JOIN did not make a difference. Fetching multiple rows (e.g., using IN SET) yielded similar results.


There are several factors which means there is no binary answer. The question of what is best for performance depends on your environment. By the way, if your single select with an identifier is not sub-second, something may be wrong with your configuration.

The real question to ask is how do you want to access the data. Single selects support late-binding. For example if you only want employee information, you can select from the Employees table. The foreign key relationships can be used to retrieve related resources at a later time and as needed. The selects will already have a key to point to so they should be extremely fast, and you only have to retrieve what you need. Network latency must always be taken into account.

Joins will retrieve all of the data at once. If you are generating a report or populating a grid, this may be exactly what you want. Compiled and optomized joins are simply going to be faster than single selects in this scenario. Remember, Ad-hoc joins may not be as fast--you should compile them (into a stored proc). The speed answer depends on the execution plan, which details exactly what steps the DBMS takes to retrieve the data.


Examples related to mysql

Implement specialization in ER diagram How to post query parameters with Axios? PHP with MySQL 8.0+ error: The server requested authentication method unknown to the client Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver' phpMyAdmin - Error > Incorrect format parameter? Authentication plugin 'caching_sha2_password' is not supported How to resolve Unable to load authentication plugin 'caching_sha2_password' issue Connection Java-MySql : Public Key Retrieval is not allowed How to grant all privileges to root user in MySQL 8.0 MySQL 8.0 - Client does not support authentication protocol requested by server; consider upgrading MySQL client

Examples related to database

Implement specialization in ER diagram phpMyAdmin - Error > Incorrect format parameter? Authentication plugin 'caching_sha2_password' cannot be loaded Room - Schema export directory is not provided to the annotation processor so we cannot export the schema SQL Query Where Date = Today Minus 7 Days MySQL Error: : 'Access denied for user 'root'@'localhost' SQL Server date format yyyymmdd How to create a foreign key in phpmyadmin WooCommerce: Finding the products in database TypeError: tuple indices must be integers, not str

Examples related to join

Pandas Merging 101 pandas: merge (join) two data frames on multiple columns How to use the COLLATE in a JOIN in SQL Server? How to join multiple collections with $lookup in mongodb How to join on multiple columns in Pyspark? Pandas join issue: columns overlap but no suffix specified MySQL select rows where left join is null How to return rows from left table not found in right table? Why do multiple-table joins produce duplicate rows? pandas three-way joining multiple dataframes on columns

Examples related to query-optimization

How to do the Recursive SELECT query in MySQL? What is the optimal way to compare dates in Microsoft SQL server? How to find out what is locking my tables? Finding blocking/locking queries in MS SQL (mssql) SQL: How to properly check if a record exists Counting DISTINCT over multiple columns JOIN queries vs multiple queries PostgreSQL - fetch the row which has the Max value for a column mysql select from n last rows