Best way to select random rows PostgreSQL

Question

I want a random selection of rows in PostgreSQL  I tried this   select   from table where random    lt  0 01    But some other recommend this   select   from table order by random   limit 1000    I have a very large table with 500 Million rows  I want it to be fast   Which approach is better   What are the differences   What is the best way to select random rows

User · Answer

Add a column called r with type serial  Index r   Assume we have 200 000 rows  we are going to generate a random number n  where 0  lt  n  lt   200  000   Select rows with r  gt  n  sort them ASC and select the smallest one   Code   select   from YOUR TABLE  where r  gt        select           select reltuples  bigint AS estimate         from   pg class         where  oid    public YOUR TABLE   regclass    random         order by r asc limit 1     The code is self-explanatory  The subquery in the middle is used to quickly estimate the table row counts from https   stackoverflow com a 7945274 1271094    In application level you need to execute the statement again if n   the number of rows or need to select multiple rows

User · Answer

If you want just one row  you can use a calculated offset derived from count   select   from table name limit 1 offset floor random      select count    from table name

User · Answer

select   from table order by random   limit 1000    If you know how many rows you want  check out tsm system rows   tsm system rows     module provides the table sampling method SYSTEM ROWS  which can be used in the TABLESAMPLE clause of a SELECT command       This table sampling method accepts a single integer argument that is the maximum number of rows to read  The resulting sample will always contain exactly that many rows  unless the table does not contain enough rows  in which case the whole table is selected  Like the built-in SYSTEM sampling method  SYSTEM ROWS performs block-level sampling  so that the sample is not completely random but may be subject to clustering effects  especially if only a small number of rows are requested    First install the extension  CREATE EXTENSION tsm system rows    Then your query   SELECT   FROM table TABLESAMPLE SYSTEM ROWS 1000

User · Answer

The one with the ORDER BY is going to be the slower one  select   from table where random    lt  0 01  goes record by record  and decides to randomly filter it or not  This is going to be O N  because it only needs to check each record once  select   from table order by random   limit 1000  is going to sort the entire table  then pick the first 1000  Aside from any voodoo magic behind the scenes  the order by is O N   log N   The downside to the random    lt  0 01 one is that you ll get a variable number of output records   Note  there is a better way to shuffling a set of data than sorting by random  The Fisher-Yates Shuffle  which runs in O N   Implementing the shuffle in SQL sounds like quite the challenge  though

User · Answer

I know I m a little late to the party  but I just found this awesome tool called pg sample      pg sample - extract a small  sample dataset from a larger PostgreSQL database while maintaining referential integrity    I tried this with a 350M rows database and it was really fast  don t know about the randomness     pg sample --limit  small table      --limit  large table   100000  -U postgres source db   psql -U postgres target db

User · Answer

You can examine and compare the execution plan of both by using   EXPLAIN select   from table where random    lt  0 01  EXPLAIN select   from table order by random   limit 1000    A quick test on a large table1 shows  that the ORDER BY first sorts the complete table and then picks the first 1000 items  Sorting a large table not only reads that table but also involves reading and writing temporary files  The where random    lt  0 1 only scans the complete table once   For large tables this might not what you want as even one complete table scan might take to long   A third proposal would be  select   from table where random    lt  0 01 limit 1000    This one stops the table scan as soon as 1000 rows have been found and therefore returns sooner  Of course this bogs down the randomness a bit  but perhaps this is good enough in your case   Edit  Besides of this considerations  you might check out the already asked questions for this  Using the query  postgresql  random returns quite a few hits    quick random row selection in Postgres How to retrieve randomized data rows from a postgreSQL table  postgres  get random entries from table - too slow   And a linked article of depez outlining several more approaches    http   www depesz com index php 2007 09 16 my-thoughts-on-getting-random-row      1  large  as in  the complete table will not fit into the memory

User · Answer

One lesson from my experience    offset floor random     N  limit 1 is not faster than order by random   limit 1    I thought the offset approach would be faster because it should save the time of sorting in Postgres  Turns out it wasn t

User · Answer

A variation of the materialized view  Possible alternative  outlined by Erwin Brandstetter is possible   Say  for example  that you don t want duplicates in the randomized values that are returned  So you will need to set a boolean value on the primary table containing your  non-randomized  set of values   Assuming this is the input table   id values  id      used            ---- --------            1       FALSE            2       FALSE            3       FALSE            4       FALSE            5       FALSE                  Populate the ID VALUES table as needed  Then  as described by Erwin  create a materialized view that randomizes the ID VALUES table once     CREATE MATERIALIZED VIEW id values randomized AS   SELECT id   FROM id values   ORDER BY random      Note that the materialized view does not contain the used column  because this will quickly become out-of-date  Nor does the view need to contain other columns that may be in the id values table   In order to obtain  and  consume   random values  use an UPDATE-RETURNING on id values  selecting id values from id values randomized with a join  and applying the desired criteria to obtain only relevant possibilities  For example   UPDATE id values SET used   TRUE WHERE id values id IN     SELECT i id     FROM id values randomized r INNER JOIN id values i ON i id   r id     WHERE  NOT i used      LIMIT 5  RETURNING id    Change LIMIT as necessary -- if you only need one random value at a time  change LIMIT to 1   With the proper indexes on id values  I believe the UPDATE-RETURNING should execute very quickly with little load  It returns randomized values with one database round-trip  The criteria for  eligible  rows can be as complex as required  New rows can be added to the id values table at any time  and they will become accessible to the application as soon as the materialized view is refreshed  which can likely be run at an off-peak time   Creation and refresh of the materialized view will be slow  but it only needs to be executed when new id s are added to the id values table

User · Answer

Given your specifications  plus additional info in the comments     You have a numeric ID column  integer numbers  with only few  or moderately few  gaps  Obviously no or few write operations  Your ID column has to be indexed  A primary key serves nicely    The query below does not need a sequential scan of the big table  only an index scan   First  get estimates for the main query   SELECT count    AS ct              -- optional        min id   AS min id        max id   AS max id        max id  - min id  AS id span FROM   big    The only possibly expensive part is the count     for huge tables   Given above specifications  you don t need it  An estimate will do just fine  available at almost no cost  detailed explanation here    SELECT reltuples AS ct FROM pg class WHERE oid    schema name big   regclass    As long as ct isn t much smaller than id span  the query will outperform other approaches       WITH params AS       SELECT 1       AS min id           -- minimum id  lt   current min id            5100000 AS id span          -- rounded up   max id - min id   buffer        SELECT   FROM        SELECT p min id   trunc random     p id span   integer AS id     FROM   params p            generate series 1  1100  g  -- 1000   buffer     GROUP  BY 1                        -- trim duplicates       r JOIN   big USING  id  LIMIT  1000                            -- trim surplus    Generate random numbers in the id space  You have  few gaps   so add 10    enough to easily cover the blanks  to the number of rows to retrieve  Each id can be picked multiple times by chance  though very unlikely with a big id space   so group the generated numbers  or use DISTINCT   Join the ids to the big table  This should be very fast with the index in place  Finally trim surplus ids that have not been eaten by dupes and gaps  Every row has a completely equal chance to be picked    Short version  You can simplify this query  The CTE in the query above is just for educational purposes   SELECT   FROM        SELECT DISTINCT 1   trunc random     5100000   integer AS id     FROM   generate series 1  1100  g       r JOIN   big USING  id  LIMIT  1000    Refine with rCTE  Especially if you are not so sure about gaps and estimates   WITH RECURSIVE random pick AS      SELECT      FROM          SELECT 1   trunc random     5100000   int AS id       FROM   generate series 1  1030   -- 1000   few percent - adapt to your needs       LIMIT  1030                      -- hint for query planner         r    JOIN   big b USING  id              -- eliminate miss     UNION                               -- eliminate dupe    SELECT b      FROM          SELECT 1   trunc random     5100000   int AS id       FROM   random pick r             -- plus 3 percent - adapt to your needs       LIMIT  999                       -- less than 1000  hint for query planner         r    JOIN   big b USING  id              -- eliminate miss      SELECT   FROM   random pick LIMIT  1000   -- actual limit   We can work with a smaller surplus in the base query  If there are too many gaps so we don t find enough rows in the first iteration  the rCTE continues to iterate with the recursive term  We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance   Duplicates are eliminated by the UNION in the rCTE   The outer LIMIT makes the CTE stop as soon as we have enough rows   This query is carefully drafted to use the available index  generate actually random rows and not stop until we fulfill the limit  unless the recursion runs dry   There are a number of pitfalls here if you are going to rewrite it   Wrap into function  For repeated use with varying parameters   CREATE OR REPLACE FUNCTION f random sample  limit int   1000   gaps real   1 03    RETURNS SETOF big AS  func  DECLARE     surplus  int     limit    gaps      estimate int                -- get current estimate from system       SELECT c reltuples    gaps       FROM   pg class c       WHERE  c oid    big   regclass   BEGIN     RETURN QUERY    WITH RECURSIVE random pick AS         SELECT         FROM             SELECT 1   trunc random      estimate   int          FROM   generate series 1   surplus  g          LIMIT   surplus           -- hint for query planner            r  id        JOIN   big USING  id         -- eliminate misses        UNION                        -- eliminate dupes       SELECT         FROM             SELECT 1   trunc random      estimate   int          FROM   random pick        -- just to make it recursive          LIMIT   limit             -- hint for query planner            r  id        JOIN   big USING  id         -- eliminate misses         SELECT      FROM   random pick    LIMIT   limit  END  func   LANGUAGE plpgsql VOLATILE ROWS 1000    Call   SELECT   FROM f random sample    SELECT   FROM f random sample 500  1 05     You could even make this generic to work for any table  Take the name of the PK column and the table as polymorphic type and use EXECUTE     But that s beyond the scope of this question  See    Refactor a PL pgSQL function to return the output of various SELECT queries   Possible alternative  IF your requirements allow identical sets for repeated calls  and we are talking about repeated calls  I would consider a materialized view  Execute above query once and write the result to a table  Users get a quasi random selection at lightening speed  Refresh your random pick at intervals or events of your choosing   Postgres 9 5 introduces TABLESAMPLE SYSTEM  n   Where n is a percentage  The manual      The BERNOULLI and SYSTEM sampling methods each accept a single   argument which is the fraction of the table to sample  expressed as a   percentage between 0 and 100  This argument can be any real-valued expression    Bold emphasis mine  It s very fast  but the result is not exactly random  The manual again      The SYSTEM method is significantly faster than the BERNOULLI method   when small sampling percentages are specified  but it may return a   less-random sample of the table as a result of clustering effects    The number of rows returned can vary wildly  For our example  to get roughly 1000 rows   SELECT   FROM big TABLESAMPLE SYSTEM   1000   100    5100000 0     Related    Fast way to discover the row count of a table in PostgreSQL   Or install the additional module tsm system rows to get the number of requested rows exactly  if there are enough  and allow for the more convenient syntax   SELECT   FROM big TABLESAMPLE SYSTEM ROWS 1000     See Evan s answer for details   But that s still not exactly random

User · Answer

Here is a decision that works for me  I guess it s very simple to understand and execute   SELECT    field 1     field 2     field 2     random   as ordering FROM    big table WHERE    some conditions ORDER BY   ordering  LIMIT 1000

User · Answer

Starting with PostgreSQL 9 5  there s a new syntax dedicated to getting random elements from a table   SELECT   FROM mytable TABLESAMPLE SYSTEM  5    This example will give you 5  of elements from mytable  See more explanation on the documentation  http   www postgresql org docs current static sql-select html

User · Answer

postgresql order by random    select rows in random order   select your columns from your table ORDER BY random     postgresql order by random   with a distinct   select   from     select distinct your columns from your table  table alias ORDER BY random     postgresql order by random limit one row   select your columns from your table ORDER BY random   limit 1

[sql] Best way to select random rows PostgreSQL

Examples related to sql

Examples related to performance

Examples related to postgresql

Examples related to random