Simple Random Samples from a Sql database

Question

How do I take an efficient simple random sample in SQL  The database in question is running MySQL  my table is at least 200 000 rows  and I want a simple random sample of about 10 000  The  quot obvious quot  answer is to  SELECT   FROM table ORDER BY RAND   LIMIT 10000  For large tables  that s too slow  it calls RAND   for every row  which already puts it at O n    and sorts them  making it O n lg n  at best  Is there a way to do this faster than O n   Note  As Andrew Mao points out in the comments  If you re using this approach on SQL Server  you should use the T-SQL function NEWID    because RAND   may return the same value for all rows  EDIT  5 YEARS LATER I ran into this problem again with a bigger table  and ended up using a version of  ignorant s solution  with two tweaks   Sample the rows to 2-5x my desired sample size  to cheaply ORDER BY RAND   Save the result of RAND   to an indexed column on every insert update   If your data set isn t very update-heavy  you may need to find another way to keep this column fresh    To take a 1000-item sample of a table  I count the rows and sample the result down to  on average  10 000 rows with the the frozen rand column  SELECT COUNT    FROM table  -- Use this to determine rand low and rand high    SELECT       FROM table    WHERE frozen rand BETWEEN   rand low s AND   rand high s ORDER BY RAND   LIMIT 1000   My actual implementation involves more work to make sure I don t undersample  and to manually wrap rand high around  but the basic idea is  quot randomly cut your N down to a few thousand  quot   While this makes some sacrifices  it allows me to sample the database down using an index scan  until it s small enough to ORDER BY RAND   again

User · Answer

Maybe you could do  SELECT   FROM table LIMIT 10000 OFFSET FLOOR RAND     190000

User · Answer

Faster Than ORDER BY RAND   I tested this method to be much faster than ORDER BY RAND    hence it runs in O n  time  and does so impressively fast  From http   technet microsoft com en-us library ms189108 28v sql 105 29 aspx  Non-MSSQL version -- I did not test this SELECT   FROM Sales SalesOrderDetail WHERE 0 01  gt   RAND    MSSQL version  SELECT   FROM Sales SalesOrderDetail WHERE 0 01  gt   CAST CHECKSUM NEWID    SalesOrderID   amp  0x7fffffff AS float    CAST  0x7fffffff AS int   This will select  1  of records  So if you need exact   of percents or records to be selected  estimate your percentage with some safety margin  then randomly pluck excess records from resulting set  using the more expensive ORDER BY RAND   method  Even Faster I was able to improve upon this method even further because I had a well-known indexed column value range  For example  if you have an indexed column with uniformly distributed integers  0  max   you can use that to randomly select N small intervals  Do this dynamically in your program to get a different set for each query run  This subset selection will be O N   which can many orders of magnitude smaller than your full data set  In my test I reduced the time needed to get 20  out 20 mil  sample records from 3 mins using ORDER BY RAND   down to 0 0 seconds

User · Answer

Apparently in some versions of SQL there s a TABLESAMPLE command  but it s not in all SQL implementations  notably  Redshift    http   technet microsoft com en-us library ms189108 v sql 105  aspx

User · Answer

There s a very interesting discussion of this type of issue here  http   www titov net 2005 09 21 do-not-use-order-by-rand-or-how-to-get-random-rows-from-table   I think with absolutely no assumptions about the table that your O n lg n  solution is the best   Though actually with a good optimizer or a slightly different technique the query you list may be a bit better  O m n  where m is the number of random rows desired  as it wouldn t necesssarily have to sort the whole large array  it could just search for the smallest m times   But for the sort of numbers you posted  m is bigger than lg n anyway   Three asumptions we might try out    there is a unique  indexed  primary key in the table the number of random rows you want to select  m  is much smaller than the number of rows in the table  n  the unique primary key is an integer that ranges from 1 to n with no gaps   With only assumptions 1 and 2 I think this can be done in O n   though you ll need to write a whole index to the table to match assumption 3  so it s not necesarily a fast O n    If we can ADDITIONALLY assume something else nice about the table  we can do the task in O m log m    Assumption 3 would be an easy nice additional property to work with   With a nice random number generator that guaranteed no duplicates when generating m numbers in a row  an O m  solution would be possible     Given the three assumptions  the basic idea is to generate m unique random numbers between 1 and n  and then select the rows with those keys from the table   I don t have mysql or anything in front of me right now  so in slightly pseudocode this would look something like    create table RandomKeys  RandomKey int  create table RandomKeysAttempt  RandomKey int   -- generate m random keys between 1 and n for i   1 to m   insert RandomKeysAttempt select rand   n   1  -- eliminate duplicates insert RandomKeys select distinct RandomKey from RandomKeysAttempt  -- as long as we don t have enough  keep generating new keys  -- with luck  and m much less than n   this won t be necessary while count RandomKeys   lt m   NextAttempt   rand   n   1   if not exists  select   from RandomKeys where RandomKey   NextAttempt      insert RandomKeys select NextAttempt  -- get our random rows select   from RandomKeys r join table t ON r RandomKey   t UniqueKey   If you were really concerned about efficiency  you might consider doing the random key generation in some sort of procedural language and inserting the results in the database  as almost anything other than SQL would probably be better at the sort of looping and random number generation required

User · Answer

Maybe you could do  SELECT   FROM table LIMIT 10000 OFFSET FLOOR RAND     190000

User · Answer

Select 3000 random records in Netezza   WITH IDS AS        SELECT ID      FROM MYTABLE     SELECT ID FROM IDS ORDER BY mt random   LIMIT 3000

User · Answer

Starting with the observation that we can retrieve the ids of a table  eg  count 5  based on a set   select   from table name where  id in  4  1  2  5  3    we can come to the result that if we could generate the string   4  1  2  5  3    then we would have a more efficient way than RAND     For example  in Java   ArrayList lt Integer gt  indices   new ArrayList lt Integer gt  rowsCount   for  int i   0  i  lt  rowsCount  i          indices add i     Collections shuffle indices   String inClause   indices toString   replace           replace              If ids have gaps  then the initial arraylist indices is the result of an sql query on ids

User · Answer

Maybe you could do  SELECT   FROM table LIMIT 10000 OFFSET FLOOR RAND     190000

User · Answer

If you need exactly m rows  realistically you ll generate your subset of IDs outside of SQL  Most methods require at some point to select the  nth  entry  and SQL tables are really not arrays at all  The assumption that the keys are consecutive in order to just join random ints between 1 and the count is also difficult to satisfy  mdash  MySQL for example doesn t support it natively  and the lock conditions are    tricky   Here s an O max n  m lg n  -time  O n -space solution assuming just plain BTREE keys    Fetch all values of the key column of the data table in any order into an array in your favorite scripting language in O n  Perform a Fisher-Yates shuffle  stopping after m swaps  and extract the subarray  0 m-1  in   m   Join  the subarray with the original dataset  e g  SELECT     WHERE id IN   lt subarray gt    in O m lg n    Any method that generates the random subset outside of SQL must have at least this complexity  The join can t be any faster than O m lg n  with BTREE  so O m  claims are fantasy for most engines  and the shuffle is bounded below n and m lg n and doesn t affect the asymptotic behavior   In Pythonic pseudocode   ids   sql query  SELECT id FROM t   for i in range m     r   int random      len ids  - i     ids i   ids i   r    ids i   r   ids i   results   sql query  SELECT   FROM t WHERE id IN   s          join ids 0 m-1

User · Answer

Just use   WHERE RAND    lt  0 1    to get 10  of the records or  WHERE RAND    lt  0 01    to get 1  of the records  etc

User · Answer

Try SELECT TOP 10000   FROM table ORDER BY NEWID    Would this give the desired results  without being too over complicated

User · Answer

Select 3000 random records in Netezza   WITH IDS AS        SELECT ID      FROM MYTABLE     SELECT ID FROM IDS ORDER BY mt random   LIMIT 3000

User · Answer

Maybe you could do  SELECT   FROM table LIMIT 10000 OFFSET FLOOR RAND     190000

User · Answer

I think the fastest solution is   select   from table where rand    lt    3   Here is why I think this should do the job     It will create a random number for each row  The number is between 0 and 1 It evaluates whether to display that row if the number generated is between 0 and  3  30      This assumes that rand   is generating numbers in a uniform distribution  It is the quickest way to do this   I saw that someone had recommended that solution and they got shot down without proof   here is what I would say to that -   This is O n  but no sorting is required so it is faster than the O n lg n  mysql is very capable of generating random numbers for each row  Try this -  select rand   from INFORMATION SCHEMA TABLES limit 10    Since the database in question is mySQL  this is the right solution

User · Answer

I think the fastest solution is   select   from table where rand    lt    3   Here is why I think this should do the job     It will create a random number for each row  The number is between 0 and 1 It evaluates whether to display that row if the number generated is between 0 and  3  30      This assumes that rand   is generating numbers in a uniform distribution  It is the quickest way to do this   I saw that someone had recommended that solution and they got shot down without proof   here is what I would say to that -   This is O n  but no sorting is required so it is faster than the O n lg n  mysql is very capable of generating random numbers for each row  Try this -  select rand   from INFORMATION SCHEMA TABLES limit 10    Since the database in question is mySQL  this is the right solution

User · Answer

Try SELECT TOP 10000   FROM table ORDER BY NEWID    Would this give the desired results  without being too over complicated

User · Answer

Faster Than ORDER BY RAND   I tested this method to be much faster than ORDER BY RAND    hence it runs in O n  time  and does so impressively fast  From http   technet microsoft com en-us library ms189108 28v sql 105 29 aspx  Non-MSSQL version -- I did not test this SELECT   FROM Sales SalesOrderDetail WHERE 0 01  gt   RAND    MSSQL version  SELECT   FROM Sales SalesOrderDetail WHERE 0 01  gt   CAST CHECKSUM NEWID    SalesOrderID   amp  0x7fffffff AS float    CAST  0x7fffffff AS int   This will select  1  of records  So if you need exact   of percents or records to be selected  estimate your percentage with some safety margin  then randomly pluck excess records from resulting set  using the more expensive ORDER BY RAND   method  Even Faster I was able to improve upon this method even further because I had a well-known indexed column value range  For example  if you have an indexed column with uniformly distributed integers  0  max   you can use that to randomly select N small intervals  Do this dynamically in your program to get a different set for each query run  This subset selection will be O N   which can many orders of magnitude smaller than your full data set  In my test I reduced the time needed to get 20  out 20 mil  sample records from 3 mins using ORDER BY RAND   down to 0 0 seconds

User · Answer

Starting with the observation that we can retrieve the ids of a table  eg  count 5  based on a set   select   from table name where  id in  4  1  2  5  3    we can come to the result that if we could generate the string   4  1  2  5  3    then we would have a more efficient way than RAND     For example  in Java   ArrayList lt Integer gt  indices   new ArrayList lt Integer gt  rowsCount   for  int i   0  i  lt  rowsCount  i          indices add i     Collections shuffle indices   String inClause   indices toString   replace           replace              If ids have gaps  then the initial arraylist indices is the result of an sql query on ids

User · Answer

There s a very interesting discussion of this type of issue here  http   www titov net 2005 09 21 do-not-use-order-by-rand-or-how-to-get-random-rows-from-table   I think with absolutely no assumptions about the table that your O n lg n  solution is the best   Though actually with a good optimizer or a slightly different technique the query you list may be a bit better  O m n  where m is the number of random rows desired  as it wouldn t necesssarily have to sort the whole large array  it could just search for the smallest m times   But for the sort of numbers you posted  m is bigger than lg n anyway   Three asumptions we might try out    there is a unique  indexed  primary key in the table the number of random rows you want to select  m  is much smaller than the number of rows in the table  n  the unique primary key is an integer that ranges from 1 to n with no gaps   With only assumptions 1 and 2 I think this can be done in O n   though you ll need to write a whole index to the table to match assumption 3  so it s not necesarily a fast O n    If we can ADDITIONALLY assume something else nice about the table  we can do the task in O m log m    Assumption 3 would be an easy nice additional property to work with   With a nice random number generator that guaranteed no duplicates when generating m numbers in a row  an O m  solution would be possible     Given the three assumptions  the basic idea is to generate m unique random numbers between 1 and n  and then select the rows with those keys from the table   I don t have mysql or anything in front of me right now  so in slightly pseudocode this would look something like    create table RandomKeys  RandomKey int  create table RandomKeysAttempt  RandomKey int   -- generate m random keys between 1 and n for i   1 to m   insert RandomKeysAttempt select rand   n   1  -- eliminate duplicates insert RandomKeys select distinct RandomKey from RandomKeysAttempt  -- as long as we don t have enough  keep generating new keys  -- with luck  and m much less than n   this won t be necessary while count RandomKeys   lt m   NextAttempt   rand   n   1   if not exists  select   from RandomKeys where RandomKey   NextAttempt      insert RandomKeys select NextAttempt  -- get our random rows select   from RandomKeys r join table t ON r RandomKey   t UniqueKey   If you were really concerned about efficiency  you might consider doing the random key generation in some sort of procedural language and inserting the results in the database  as almost anything other than SQL would probably be better at the sort of looping and random number generation required

User · Answer

In certain dialects like Microsoft SQL Server  PostgreSQL  and Oracle  but not MySQL or SQLite   you can do something like select distinct top 10000 customer id from nielsen dbo customer TABLESAMPLE  20000 rows  REPEATABLE  123    The reason for not just doing  10000 rows  without the top is that the TABLESAMPLE logic gives you an extremely inexact number of rows  like sometimes 75  that  sometimes 1 25  times that   so you want to oversample and select the exact number you want  The REPEATABLE  123  is for providing a random seed

User · Answer

Apparently in some versions of SQL there s a TABLESAMPLE command  but it s not in all SQL implementations  notably  Redshift    http   technet microsoft com en-us library ms189108 v sql 105  aspx

User · Answer

In certain dialects like Microsoft SQL Server  PostgreSQL  and Oracle  but not MySQL or SQLite   you can do something like select distinct top 10000 customer id from nielsen dbo customer TABLESAMPLE  20000 rows  REPEATABLE  123    The reason for not just doing  10000 rows  without the top is that the TABLESAMPLE logic gives you an extremely inexact number of rows  like sometimes 75  that  sometimes 1 25  times that   so you want to oversample and select the exact number you want  The REPEATABLE  123  is for providing a random seed

User · Answer

I want to point out that all of these solutions appear to sample without replacement  Selecting the top K rows from a random sort or joining to a table that contains unique keys in random order will yield a random sample generated without replacement   If you want your sample to be independent  you ll need to sample with replacement   See Question 25451034 for one example of how to do this using a JOIN in a manner similar to user12861 s solution  The solution is written for T-SQL  but the concept works in any SQL db

User · Answer

If you need exactly m rows  realistically you ll generate your subset of IDs outside of SQL  Most methods require at some point to select the  nth  entry  and SQL tables are really not arrays at all  The assumption that the keys are consecutive in order to just join random ints between 1 and the count is also difficult to satisfy  mdash  MySQL for example doesn t support it natively  and the lock conditions are    tricky   Here s an O max n  m lg n  -time  O n -space solution assuming just plain BTREE keys    Fetch all values of the key column of the data table in any order into an array in your favorite scripting language in O n  Perform a Fisher-Yates shuffle  stopping after m swaps  and extract the subarray  0 m-1  in   m   Join  the subarray with the original dataset  e g  SELECT     WHERE id IN   lt subarray gt    in O m lg n    Any method that generates the random subset outside of SQL must have at least this complexity  The join can t be any faster than O m lg n  with BTREE  so O m  claims are fantasy for most engines  and the shuffle is bounded below n and m lg n and doesn t affect the asymptotic behavior   In Pythonic pseudocode   ids   sql query  SELECT id FROM t   for i in range m     r   int random      len ids  - i     ids i   ids i   r    ids i   r   ids i   results   sql query  SELECT   FROM t WHERE id IN   s          join ids 0 m-1

User · Answer

Just use   WHERE RAND    lt  0 1    to get 10  of the records or  WHERE RAND    lt  0 01    to get 1  of the records  etc

User · Answer

There s a very interesting discussion of this type of issue here  http   www titov net 2005 09 21 do-not-use-order-by-rand-or-how-to-get-random-rows-from-table   I think with absolutely no assumptions about the table that your O n lg n  solution is the best   Though actually with a good optimizer or a slightly different technique the query you list may be a bit better  O m n  where m is the number of random rows desired  as it wouldn t necesssarily have to sort the whole large array  it could just search for the smallest m times   But for the sort of numbers you posted  m is bigger than lg n anyway   Three asumptions we might try out    there is a unique  indexed  primary key in the table the number of random rows you want to select  m  is much smaller than the number of rows in the table  n  the unique primary key is an integer that ranges from 1 to n with no gaps   With only assumptions 1 and 2 I think this can be done in O n   though you ll need to write a whole index to the table to match assumption 3  so it s not necesarily a fast O n    If we can ADDITIONALLY assume something else nice about the table  we can do the task in O m log m    Assumption 3 would be an easy nice additional property to work with   With a nice random number generator that guaranteed no duplicates when generating m numbers in a row  an O m  solution would be possible     Given the three assumptions  the basic idea is to generate m unique random numbers between 1 and n  and then select the rows with those keys from the table   I don t have mysql or anything in front of me right now  so in slightly pseudocode this would look something like    create table RandomKeys  RandomKey int  create table RandomKeysAttempt  RandomKey int   -- generate m random keys between 1 and n for i   1 to m   insert RandomKeysAttempt select rand   n   1  -- eliminate duplicates insert RandomKeys select distinct RandomKey from RandomKeysAttempt  -- as long as we don t have enough  keep generating new keys  -- with luck  and m much less than n   this won t be necessary while count RandomKeys   lt m   NextAttempt   rand   n   1   if not exists  select   from RandomKeys where RandomKey   NextAttempt      insert RandomKeys select NextAttempt  -- get our random rows select   from RandomKeys r join table t ON r RandomKey   t UniqueKey   If you were really concerned about efficiency  you might consider doing the random key generation in some sort of procedural language and inserting the results in the database  as almost anything other than SQL would probably be better at the sort of looping and random number generation required

User · Answer

There s a very interesting discussion of this type of issue here  http   www titov net 2005 09 21 do-not-use-order-by-rand-or-how-to-get-random-rows-from-table   I think with absolutely no assumptions about the table that your O n lg n  solution is the best   Though actually with a good optimizer or a slightly different technique the query you list may be a bit better  O m n  where m is the number of random rows desired  as it wouldn t necesssarily have to sort the whole large array  it could just search for the smallest m times   But for the sort of numbers you posted  m is bigger than lg n anyway   Three asumptions we might try out    there is a unique  indexed  primary key in the table the number of random rows you want to select  m  is much smaller than the number of rows in the table  n  the unique primary key is an integer that ranges from 1 to n with no gaps   With only assumptions 1 and 2 I think this can be done in O n   though you ll need to write a whole index to the table to match assumption 3  so it s not necesarily a fast O n    If we can ADDITIONALLY assume something else nice about the table  we can do the task in O m log m    Assumption 3 would be an easy nice additional property to work with   With a nice random number generator that guaranteed no duplicates when generating m numbers in a row  an O m  solution would be possible     Given the three assumptions  the basic idea is to generate m unique random numbers between 1 and n  and then select the rows with those keys from the table   I don t have mysql or anything in front of me right now  so in slightly pseudocode this would look something like    create table RandomKeys  RandomKey int  create table RandomKeysAttempt  RandomKey int   -- generate m random keys between 1 and n for i   1 to m   insert RandomKeysAttempt select rand   n   1  -- eliminate duplicates insert RandomKeys select distinct RandomKey from RandomKeysAttempt  -- as long as we don t have enough  keep generating new keys  -- with luck  and m much less than n   this won t be necessary while count RandomKeys   lt m   NextAttempt   rand   n   1   if not exists  select   from RandomKeys where RandomKey   NextAttempt      insert RandomKeys select NextAttempt  -- get our random rows select   from RandomKeys r join table t ON r RandomKey   t UniqueKey   If you were really concerned about efficiency  you might consider doing the random key generation in some sort of procedural language and inserting the results in the database  as almost anything other than SQL would probably be better at the sort of looping and random number generation required

User · Answer

I want to point out that all of these solutions appear to sample without replacement  Selecting the top K rows from a random sort or joining to a table that contains unique keys in random order will yield a random sample generated without replacement   If you want your sample to be independent  you ll need to sample with replacement   See Question 25451034 for one example of how to do this using a JOIN in a manner similar to user12861 s solution  The solution is written for T-SQL  but the concept works in any SQL db

[mysql] Simple Random Samples from a Sql database

Examples related to mysql

Examples related to sql

Examples related to postgresql

Examples related to random