How do I force Postgres to use an index when it would otherwise insist on doing a sequential scan?
This question is related to
sql
postgresql
indexing
There is a trick to push postgres to prefer a seqscan adding a OFFSET 0
in the subquery
This is handy for optimizing requests linking big/huge tables when all you need is only the n first/last elements.
Lets say you are looking for first/last 20 elements involving multiple tables having 100k (or more) entries, no point building/linking up all the query over all the data when what you'll be looking for is in the first 100 or 1000 entries. In this scenario for example, it turns out to be over 10x faster to do a sequential scan.
The question on itself is very much invalid. Forcing (by doing enable_seqscan=off for example) is very bad idea. It might be useful to check if it will be faster, but production code should never use such tricks.
Instead - do explain analyze of your query, read it, and find out why PostgreSQL chooses bad (in your opinion) plan.
There are tools on the web that help with reading explain analyze output - one of them is explain.depesz.com - written by me.
Another option is to join #postgresql channel on freenode irc network, and talking to guys there to help you out - as optimizing query is not a matter of "ask a question, get answer be happy". it's more like a conversation, with many things to check, many things to be learned.
Probably the only valid reason for using
set enable_seqscan=false
is when you're writing queries and want to quickly see what the query plan would actually be were there large amounts of data in the table(s). Or of course if you need to quickly confirm that your query is not using an index simply because the dataset is too small.
random_page_cost
This problem typically happens when the estimated cost of an index scan is too high and doesn't correctly reflect reality. You may need to lower the random_page_cost
configuration parameter to fix this. From the Postgres documentation:
Reducing this value [...] will cause the system to prefer index scans; raising it will make index scans look relatively more expensive.
You can do a quick test whether this will actually make Postgres use the index:
EXPLAIN <query>; # Uses sequential scan
SET random_page_cost = 1;
EXPLAIN <query>; # May use index scan now
You can restore the default value with SET random_page_cost = DEFAULT;
again.
Index scans require non-sequential disk page fetches. Postgres uses random_page_cost
to estimate the cost of such non-sequential fetches in relation to sequential fetches. The default value is 4.0
, thus assuming an average cost factor of 4 compared to sequential fetches (taking caching effects into account).
The problem however is that this default value is unsuitable in the following important real-life scenarios:
1) Solid-state drives
As per the documentation:
Storage that has a low random read cost relative to sequential, e.g. solid-state drives, might be better modeled with a lower value for
random_page_cost
, e.g.,1.1
.
This slide from a speak at PostgresConf 2018 also says that random_page_cost
should be set to something between 1.0
and 2.0
for solid-state drives.
2) Cached data
If the required index data is already cached in RAM, an index scan will always be significantly faster than a sequential scan. The documentation says:
If your data is likely to be completely in cache, [...] decreasing
random_page_cost
can be appropriate.
The problem is that you of course can't easily know whether the relevant data is already cached. However, if a specific index is frequently used, and if the system has sufficient RAM, then data is likely to be cached eventually, and random_page_cost
should be set to a lower value. You'll have to experiment with different values and see what works for you.
You might also want to use the pg_prewarm extension for explicit data caching.
Assuming you're asking about the common "index hinting" feature found in many databases, PostgreSQL doesn't provide such a feature. This was a conscious decision made by the PostgreSQL team. A good overview of why and what you can do instead can be found here. The reasons are basically that it's a performance hack that tends to cause more problems later down the line as your data changes, whereas PostgreSQL's optimizer can re-evaluate the plan based on the statistics. In other words, what might be a good query plan today probably won't be a good query plan for all time, and index hints force a particular query plan for all time.
As a very blunt hammer, useful for testing, you can use the enable_seqscan
and enable_indexscan
parameters. See:
These are not suitable for ongoing production use. If you have issues with query plan choice, you should see the documentation for tracking down query performance issues. Don't just set enable_
params and walk away.
Unless you have a very good reason for using the index, Postgres may be making the correct choice. Why?
See also this old newsgroup post.
Sometimes PostgreSQL fails to make the best choice of indexes for a particular condition. As an example, suppose there is a transactions table with several million rows, of which there are several hundred for any given day, and the table has four indexes: transaction_id, client_id, date, and description. You want to run the following query:
SELECT client_id, SUM(amount)
FROM transactions
WHERE date >= 'yesterday'::timestamp AND date < 'today'::timestamp AND
description = 'Refund'
GROUP BY client_id
PostgreSQL may choose to use the index transactions_description_idx instead of transactions_date_idx, which may lead to the query taking several minutes instead of less than one second. If this is the case, you can force using the index on date by fudging the condition like this:
SELECT client_id, SUM(amount)
FROM transactions
WHERE date >= 'yesterday'::timestamp AND date < 'today'::timestamp AND
description||'' = 'Refund'
GROUP BY client_id
Assuming you're asking about the common "index hinting" feature found in many databases, PostgreSQL doesn't provide such a feature. This was a conscious decision made by the PostgreSQL team. A good overview of why and what you can do instead can be found here. The reasons are basically that it's a performance hack that tends to cause more problems later down the line as your data changes, whereas PostgreSQL's optimizer can re-evaluate the plan based on the statistics. In other words, what might be a good query plan today probably won't be a good query plan for all time, and index hints force a particular query plan for all time.
As a very blunt hammer, useful for testing, you can use the enable_seqscan
and enable_indexscan
parameters. See:
These are not suitable for ongoing production use. If you have issues with query plan choice, you should see the documentation for tracking down query performance issues. Don't just set enable_
params and walk away.
Unless you have a very good reason for using the index, Postgres may be making the correct choice. Why?
See also this old newsgroup post.
The question on itself is very much invalid. Forcing (by doing enable_seqscan=off for example) is very bad idea. It might be useful to check if it will be faster, but production code should never use such tricks.
Instead - do explain analyze of your query, read it, and find out why PostgreSQL chooses bad (in your opinion) plan.
There are tools on the web that help with reading explain analyze output - one of them is explain.depesz.com - written by me.
Another option is to join #postgresql channel on freenode irc network, and talking to guys there to help you out - as optimizing query is not a matter of "ask a question, get answer be happy". it's more like a conversation, with many things to check, many things to be learned.
Assuming you're asking about the common "index hinting" feature found in many databases, PostgreSQL doesn't provide such a feature. This was a conscious decision made by the PostgreSQL team. A good overview of why and what you can do instead can be found here. The reasons are basically that it's a performance hack that tends to cause more problems later down the line as your data changes, whereas PostgreSQL's optimizer can re-evaluate the plan based on the statistics. In other words, what might be a good query plan today probably won't be a good query plan for all time, and index hints force a particular query plan for all time.
As a very blunt hammer, useful for testing, you can use the enable_seqscan
and enable_indexscan
parameters. See:
These are not suitable for ongoing production use. If you have issues with query plan choice, you should see the documentation for tracking down query performance issues. Don't just set enable_
params and walk away.
Unless you have a very good reason for using the index, Postgres may be making the correct choice. Why?
See also this old newsgroup post.
There is a trick to push postgres to prefer a seqscan adding a OFFSET 0
in the subquery
This is handy for optimizing requests linking big/huge tables when all you need is only the n first/last elements.
Lets say you are looking for first/last 20 elements involving multiple tables having 100k (or more) entries, no point building/linking up all the query over all the data when what you'll be looking for is in the first 100 or 1000 entries. In this scenario for example, it turns out to be over 10x faster to do a sequential scan.
Sometimes PostgreSQL fails to make the best choice of indexes for a particular condition. As an example, suppose there is a transactions table with several million rows, of which there are several hundred for any given day, and the table has four indexes: transaction_id, client_id, date, and description. You want to run the following query:
SELECT client_id, SUM(amount)
FROM transactions
WHERE date >= 'yesterday'::timestamp AND date < 'today'::timestamp AND
description = 'Refund'
GROUP BY client_id
PostgreSQL may choose to use the index transactions_description_idx instead of transactions_date_idx, which may lead to the query taking several minutes instead of less than one second. If this is the case, you can force using the index on date by fudging the condition like this:
SELECT client_id, SUM(amount)
FROM transactions
WHERE date >= 'yesterday'::timestamp AND date < 'today'::timestamp AND
description||'' = 'Refund'
GROUP BY client_id
Source: Stackoverflow.com