I want a random selection of rows in PostgreSQL, I tried this:
select * from table where random() < 0.01;
But some other recommend this:
select * from table order by random() limit 1000;
I have a very large table with 500 Million rows, I want it to be fast.
Which approach is better? What are the differences? What is the best way to select random rows?
This question is related to
sql
performance
postgresql
random
Given your specifications (plus additional info in the comments),
The query below does not need a sequential scan of the big table, only an index scan.
First, get estimates for the main query:
SELECT count(*) AS ct -- optional
, min(id) AS min_id
, max(id) AS max_id
, max(id) - min(id) AS id_span
FROM big;
The only possibly expensive part is the count(*)
(for huge tables). Given above specifications, you don't need it. An estimate will do just fine, available at almost no cost (detailed explanation here):
SELECT reltuples AS ct FROM pg_class WHERE oid = 'schema_name.big'::regclass;
As long as ct
isn't much smaller than id_span
, the query will outperform other approaches.
WITH params AS (
SELECT 1 AS min_id -- minimum id <= current min id
, 5100000 AS id_span -- rounded up. (max_id - min_id + buffer)
)
SELECT *
FROM (
SELECT p.min_id + trunc(random() * p.id_span)::integer AS id
FROM params p
,generate_series(1, 1100) g -- 1000 + buffer
GROUP BY 1 -- trim duplicates
) r
JOIN big USING (id)
LIMIT 1000; -- trim surplus
Generate random numbers in the id
space. You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.
Each id
can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use DISTINCT
).
Join the id
s to the big table. This should be very fast with the index in place.
Finally trim surplus id
s that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.
You can simplify this query. The CTE in the query above is just for educational purposes:
SELECT *
FROM (
SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id
FROM generate_series(1, 1100) g
) r
JOIN big USING (id)
LIMIT 1000;
Especially if you are not so sure about gaps and estimates.
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * 5100000)::int AS id
FROM generate_series(1, 1030) -- 1000 + few percent - adapt to your needs
LIMIT 1030 -- hint for query planner
) r
JOIN big b USING (id) -- eliminate miss
UNION -- eliminate dupe
SELECT b.*
FROM (
SELECT 1 + trunc(random() * 5100000)::int AS id
FROM random_pick r -- plus 3 percent - adapt to your needs
LIMIT 999 -- less than 1000, hint for query planner
) r
JOIN big b USING (id) -- eliminate miss
)
SELECT *
FROM random_pick
LIMIT 1000; -- actual limit
We can work with a smaller surplus in the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.
Duplicates are eliminated by the UNION
in the rCTE.
The outer LIMIT
makes the CTE stop as soon as we have enough rows.
This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.
For repeated use with varying parameters:
CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)
RETURNS SETOF big AS
$func$
DECLARE
_surplus int := _limit * _gaps;
_estimate int := ( -- get current estimate from system
SELECT c.reltuples * _gaps
FROM pg_class c
WHERE c.oid = 'big'::regclass);
BEGIN
RETURN QUERY
WITH RECURSIVE random_pick AS (
SELECT *
FROM (
SELECT 1 + trunc(random() * _estimate)::int
FROM generate_series(1, _surplus) g
LIMIT _surplus -- hint for query planner
) r (id)
JOIN big USING (id) -- eliminate misses
UNION -- eliminate dupes
SELECT *
FROM (
SELECT 1 + trunc(random() * _estimate)::int
FROM random_pick -- just to make it recursive
LIMIT _limit -- hint for query planner
) r (id)
JOIN big USING (id) -- eliminate misses
)
SELECT *
FROM random_pick
LIMIT _limit;
END
$func$ LANGUAGE plpgsql VOLATILE ROWS 1000;
Call:
SELECT * FROM f_random_sample();
SELECT * FROM f_random_sample(500, 1.05);
You could even make this generic to work for any table: Take the name of the PK column and the table as polymorphic type and use EXECUTE
... But that's beyond the scope of this question. See:
IF your requirements allow identical sets for repeated calls (and we are talking about repeated calls) I would consider a materialized view. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.
TABLESAMPLE SYSTEM (n)
Where n
is a percentage. The manual:
The
BERNOULLI
andSYSTEM
sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be anyreal
-valued expression.
Bold emphasis mine. It's very fast, but the result is not exactly random. The manual again:
The
SYSTEM
method is significantly faster than theBERNOULLI
method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.
The number of rows returned can vary wildly. For our example, to get roughly 1000 rows:
SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);
Related:
Or install the additional module tsm_system_rows to get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:
SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);
See Evan's answer for details.
But that's still not exactly random.
I know I'm a little late to the party, but I just found this awesome tool called pg_sample:
pg_sample
- extract a small, sample dataset from a larger PostgreSQL database while maintaining referential integrity.
I tried this with a 350M rows database and it was really fast, don't know about the randomness.
./pg_sample --limit="small_table = *" --limit="large_table = 100000" -U postgres source_db | psql -U postgres target_db
A variation of the materialized view "Possible alternative" outlined by Erwin Brandstetter is possible.
Say, for example, that you don't want duplicates in the randomized values that are returned. So you will need to set a boolean value on the primary table containing your (non-randomized) set of values.
Assuming this is the input table:
id_values id | used
----+--------
1 | FALSE
2 | FALSE
3 | FALSE
4 | FALSE
5 | FALSE
...
Populate the ID_VALUES
table as needed. Then, as described by Erwin, create a materialized view that randomizes the ID_VALUES
table once:
CREATE MATERIALIZED VIEW id_values_randomized AS
SELECT id
FROM id_values
ORDER BY random();
Note that the materialized view does not contain the used column, because this will quickly become out-of-date. Nor does the view need to contain other columns that may be in the id_values
table.
In order to obtain (and "consume") random values, use an UPDATE-RETURNING on id_values
, selecting id_values
from id_values_randomized
with a join, and applying the desired criteria to obtain only relevant possibilities. For example:
UPDATE id_values
SET used = TRUE
WHERE id_values.id IN
(SELECT i.id
FROM id_values_randomized r INNER JOIN id_values i ON i.id = r.id
WHERE (NOT i.used)
LIMIT 5)
RETURNING id;
Change LIMIT
as necessary -- if you only need one random value at a time, change LIMIT
to 1
.
With the proper indexes on id_values
, I believe the UPDATE-RETURNING should execute very quickly with little load. It returns randomized values with one database round-trip. The criteria for "eligible" rows can be as complex as required. New rows can be added to the id_values
table at any time, and they will become accessible to the application as soon as the materialized view is refreshed (which can likely be run at an off-peak time). Creation and refresh of the materialized view will be slow, but it only needs to be executed when new id's are added to the id_values
table.
The one with the ORDER BY is going to be the slower one.
select * from table where random() < 0.01;
goes record by record, and decides to randomly filter it or not. This is going to be O(N)
because it only needs to check each record once.
select * from table order by random() limit 1000;
is going to sort the entire table, then pick the first 1000. Aside from any voodoo magic behind the scenes, the order by is O(N * log N)
.
The downside to the random() < 0.01
one is that you'll get a variable number of output records.
Note, there is a better way to shuffling a set of data than sorting by random: The Fisher-Yates Shuffle, which runs in O(N)
. Implementing the shuffle in SQL sounds like quite the challenge, though.
If you want just one row, you can use a calculated offset
derived from count
.
select * from table_name limit 1
offset floor(random() * (select count(*) from table_name));
One lesson from my experience:
offset floor(random() * N) limit 1
is not faster than order by random() limit 1
.
I thought the offset
approach would be faster because it should save the time of sorting in Postgres. Turns out it wasn't.
Here is a decision that works for me. I guess it's very simple to understand and execute.
SELECT
field_1,
field_2,
field_2,
random() as ordering
FROM
big_table
WHERE
some_conditions
ORDER BY
ordering
LIMIT 1000;
select your_columns from your_table ORDER BY random()
select * from
(select distinct your_columns from your_table) table_alias
ORDER BY random()
select your_columns from your_table ORDER BY random() limit 1
select * from table order by random() limit 1000;
If you know how many rows you want, check out tsm_system_rows
.
module provides the table sampling method SYSTEM_ROWS, which can be used in the TABLESAMPLE clause of a SELECT command.
This table sampling method accepts a single integer argument that is the maximum number of rows to read. The resulting sample will always contain exactly that many rows, unless the table does not contain enough rows, in which case the whole table is selected. Like the built-in SYSTEM sampling method, SYSTEM_ROWS performs block-level sampling, so that the sample is not completely random but may be subject to clustering effects, especially if only a small number of rows are requested.
First install the extension
CREATE EXTENSION tsm_system_rows;
Then your query,
SELECT *
FROM table
TABLESAMPLE SYSTEM_ROWS(1000);
You can examine and compare the execution plan of both by using
EXPLAIN select * from table where random() < 0.01;
EXPLAIN select * from table order by random() limit 1000;
A quick test on a large table1 shows, that the ORDER BY
first sorts the complete table and then picks the first 1000 items. Sorting a large table not only reads that table but also involves reading and writing temporary files. The where random() < 0.1
only scans the complete table once.
For large tables this might not what you want as even one complete table scan might take to long.
A third proposal would be
select * from table where random() < 0.01 limit 1000;
This one stops the table scan as soon as 1000 rows have been found and therefore returns sooner. Of course this bogs down the randomness a bit, but perhaps this is good enough in your case.
Edit: Besides of this considerations, you might check out the already asked questions for this. Using the query [postgresql] random
returns quite a few hits.
And a linked article of depez outlining several more approaches:
1 "large" as in "the complete table will not fit into the memory".
Starting with PostgreSQL 9.5, there's a new syntax dedicated to getting random elements from a table :
SELECT * FROM mytable TABLESAMPLE SYSTEM (5);
This example will give you 5% of elements from mytable
.
See more explanation on the documentation: http://www.postgresql.org/docs/current/static/sql-select.html
Add a column called r
with type serial
. Index r
.
Assume we have 200,000 rows, we are going to generate a random number n
, where 0 < n
<= 200, 000.
Select rows with r > n
, sort them ASC
and select the smallest one.
Code:
select * from YOUR_TABLE
where r > (
select (
select reltuples::bigint AS estimate
from pg_class
where oid = 'public.YOUR_TABLE'::regclass) * random()
)
order by r asc limit(1);
The code is self-explanatory. The subquery in the middle is used to quickly estimate the table row counts from https://stackoverflow.com/a/7945274/1271094 .
In application level you need to execute the statement again if n
> the number of rows or need to select multiple rows.
Source: Stackoverflow.com