[mysql] Simple Random Samples from a Sql database

How do I take an efficient simple random sample in SQL? The database in question is running MySQL; my table is at least 200,000 rows, and I want a simple random sample of about 10,000.

The "obvious" answer is to:

SELECT * FROM table ORDER BY RAND() LIMIT 10000

For large tables, that's too slow: it calls RAND() for every row (which already puts it at O(n)), and sorts them, making it O(n lg n) at best. Is there a way to do this faster than O(n)?

Note: As Andrew Mao points out in the comments, If you're using this approach on SQL Server, you should use the T-SQL function NEWID(), because RAND() may return the same value for all rows.

EDIT: 5 YEARS LATER

I ran into this problem again with a bigger table, and ended up using a version of @ignorant's solution, with two tweaks:

  • Sample the rows to 2-5x my desired sample size, to cheaply ORDER BY RAND()
  • Save the result of RAND() to an indexed column on every insert/update. (If your data set isn't very update-heavy, you may need to find another way to keep this column fresh.)

To take a 1000-item sample of a table, I count the rows and sample the result down to, on average, 10,000 rows with the the frozen_rand column:

SELECT COUNT(*) FROM table; -- Use this to determine rand_low and rand_high

  SELECT *
    FROM table
   WHERE frozen_rand BETWEEN %(rand_low)s AND %(rand_high)s
ORDER BY RAND() LIMIT 1000

(My actual implementation involves more work to make sure I don't undersample, and to manually wrap rand_high around, but the basic idea is "randomly cut your N down to a few thousand.")

While this makes some sacrifices, it allows me to sample the database down using an index scan, until it's small enough to ORDER BY RAND() again.

This question is related to mysql sql postgresql random

The answer is


I think the fastest solution is

select * from table where rand() <= .3

Here is why I think this should do the job.

  • It will create a random number for each row. The number is between 0 and 1
  • It evaluates whether to display that row if the number generated is between 0 and .3 (30%).

This assumes that rand() is generating numbers in a uniform distribution. It is the quickest way to do this.

I saw that someone had recommended that solution and they got shot down without proof.. here is what I would say to that -

  • This is O(n) but no sorting is required so it is faster than the O(n lg n)
  • mysql is very capable of generating random numbers for each row. Try this -

    select rand() from INFORMATION_SCHEMA.TABLES limit 10;

Since the database in question is mySQL, this is the right solution.


Maybe you could do

SELECT * FROM table LIMIT 10000 OFFSET FLOOR(RAND() * 190000)

Starting with the observation that we can retrieve the ids of a table (eg. count 5) based on a set:

select *
from table_name
where _id in (4, 1, 2, 5, 3)

we can come to the result that if we could generate the string "(4, 1, 2, 5, 3)", then we would have a more efficient way than RAND().

For example, in Java:

ArrayList<Integer> indices = new ArrayList<Integer>(rowsCount);
for (int i = 0; i < rowsCount; i++) {
    indices.add(i);
}
Collections.shuffle(indices);
String inClause = indices.toString().replace('[', '(').replace(']', ')');

If ids have gaps, then the initial arraylist indices is the result of an sql query on ids.


Faster Than ORDER BY RAND()

I tested this method to be much faster than ORDER BY RAND(), hence it runs in O(n) time, and does so impressively fast.

From http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx:

Non-MSSQL version -- I did not test this

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= RAND()

MSSQL version:

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

This will select ~1% of records. So if you need exact # of percents or records to be selected, estimate your percentage with some safety margin, then randomly pluck excess records from resulting set, using the more expensive ORDER BY RAND() method.

Even Faster

I was able to improve upon this method even further because I had a well-known indexed column value range.

For example, if you have an indexed column with uniformly distributed integers [0..max], you can use that to randomly select N small intervals. Do this dynamically in your program to get a different set for each query run. This subset selection will be O(N), which can many orders of magnitude smaller than your full data set.

In my test I reduced the time needed to get 20 (out 20 mil) sample records from 3 mins using ORDER BY RAND() down to 0.0 seconds!


Just use

WHERE RAND() < 0.1 

to get 10% of the records or

WHERE RAND() < 0.01 

to get 1% of the records, etc.


In certain dialects like Microsoft SQL Server, PostgreSQL, and Oracle (but not MySQL or SQLite), you can do something like

select distinct top 10000 customer_id from nielsen.dbo.customer TABLESAMPLE (20000 rows) REPEATABLE (123);

The reason for not just doing (10000 rows) without the top is that the TABLESAMPLE logic gives you an extremely inexact number of rows (like sometimes 75% that, sometimes 1.25% times that), so you want to oversample and select the exact number you want. The REPEATABLE (123) is for providing a random seed.


Apparently in some versions of SQL there's a TABLESAMPLE command, but it's not in all SQL implementations (notably, Redshift).

http://technet.microsoft.com/en-us/library/ms189108(v=sql.105).aspx


If you need exactly m rows, realistically you'll generate your subset of IDs outside of SQL. Most methods require at some point to select the "nth" entry, and SQL tables are really not arrays at all. The assumption that the keys are consecutive in order to just join random ints between 1 and the count is also difficult to satisfy — MySQL for example doesn't support it natively, and the lock conditions are... tricky.

Here's an O(max(n, m lg n))-time, O(n)-space solution assuming just plain BTREE keys:

  1. Fetch all values of the key column of the data table in any order into an array in your favorite scripting language in O(n)
  2. Perform a Fisher-Yates shuffle, stopping after m swaps, and extract the subarray [0:m-1] in ?(m)
  3. "Join" the subarray with the original dataset (e.g. SELECT ... WHERE id IN (<subarray>)) in O(m lg n)

Any method that generates the random subset outside of SQL must have at least this complexity. The join can't be any faster than O(m lg n) with BTREE (so O(m) claims are fantasy for most engines) and the shuffle is bounded below n and m lg n and doesn't affect the asymptotic behavior.

In Pythonic pseudocode:

ids = sql.query('SELECT id FROM t')
for i in range(m):
  r = int(random() * (len(ids) - i))
  ids[i], ids[i + r] = ids[i + r], ids[i]

results = sql.query('SELECT * FROM t WHERE id IN (%s)' % ', '.join(ids[0:m-1])

Just use

WHERE RAND() < 0.1 

to get 10% of the records or

WHERE RAND() < 0.01 

to get 1% of the records, etc.


I think the fastest solution is

select * from table where rand() <= .3

Here is why I think this should do the job.

  • It will create a random number for each row. The number is between 0 and 1
  • It evaluates whether to display that row if the number generated is between 0 and .3 (30%).

This assumes that rand() is generating numbers in a uniform distribution. It is the quickest way to do this.

I saw that someone had recommended that solution and they got shot down without proof.. here is what I would say to that -

  • This is O(n) but no sorting is required so it is faster than the O(n lg n)
  • mysql is very capable of generating random numbers for each row. Try this -

    select rand() from INFORMATION_SCHEMA.TABLES limit 10;

Since the database in question is mySQL, this is the right solution.


In certain dialects like Microsoft SQL Server, PostgreSQL, and Oracle (but not MySQL or SQLite), you can do something like

select distinct top 10000 customer_id from nielsen.dbo.customer TABLESAMPLE (20000 rows) REPEATABLE (123);

The reason for not just doing (10000 rows) without the top is that the TABLESAMPLE logic gives you an extremely inexact number of rows (like sometimes 75% that, sometimes 1.25% times that), so you want to oversample and select the exact number you want. The REPEATABLE (123) is for providing a random seed.


Starting with the observation that we can retrieve the ids of a table (eg. count 5) based on a set:

select *
from table_name
where _id in (4, 1, 2, 5, 3)

we can come to the result that if we could generate the string "(4, 1, 2, 5, 3)", then we would have a more efficient way than RAND().

For example, in Java:

ArrayList<Integer> indices = new ArrayList<Integer>(rowsCount);
for (int i = 0; i < rowsCount; i++) {
    indices.add(i);
}
Collections.shuffle(indices);
String inClause = indices.toString().replace('[', '(').replace(']', ')');

If ids have gaps, then the initial arraylist indices is the result of an sql query on ids.


I want to point out that all of these solutions appear to sample without replacement. Selecting the top K rows from a random sort or joining to a table that contains unique keys in random order will yield a random sample generated without replacement.

If you want your sample to be independent, you'll need to sample with replacement. See Question 25451034 for one example of how to do this using a JOIN in a manner similar to user12861's solution. The solution is written for T-SQL, but the concept works in any SQL db.


Try

SELECT TOP 10000 * FROM table ORDER BY NEWID()

Would this give the desired results, without being too over complicated?


Try

SELECT TOP 10000 * FROM table ORDER BY NEWID()

Would this give the desired results, without being too over complicated?


Select 3000 random records in Netezza:

WITH IDS AS (
     SELECT ID
     FROM MYTABLE;
)

SELECT ID FROM IDS ORDER BY mt_random() LIMIT 3000

I want to point out that all of these solutions appear to sample without replacement. Selecting the top K rows from a random sort or joining to a table that contains unique keys in random order will yield a random sample generated without replacement.

If you want your sample to be independent, you'll need to sample with replacement. See Question 25451034 for one example of how to do this using a JOIN in a manner similar to user12861's solution. The solution is written for T-SQL, but the concept works in any SQL db.


Faster Than ORDER BY RAND()

I tested this method to be much faster than ORDER BY RAND(), hence it runs in O(n) time, and does so impressively fast.

From http://technet.microsoft.com/en-us/library/ms189108%28v=sql.105%29.aspx:

Non-MSSQL version -- I did not test this

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= RAND()

MSSQL version:

SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

This will select ~1% of records. So if you need exact # of percents or records to be selected, estimate your percentage with some safety margin, then randomly pluck excess records from resulting set, using the more expensive ORDER BY RAND() method.

Even Faster

I was able to improve upon this method even further because I had a well-known indexed column value range.

For example, if you have an indexed column with uniformly distributed integers [0..max], you can use that to randomly select N small intervals. Do this dynamically in your program to get a different set for each query run. This subset selection will be O(N), which can many orders of magnitude smaller than your full data set.

In my test I reduced the time needed to get 20 (out 20 mil) sample records from 3 mins using ORDER BY RAND() down to 0.0 seconds!


Maybe you could do

SELECT * FROM table LIMIT 10000 OFFSET FLOOR(RAND() * 190000)

If you need exactly m rows, realistically you'll generate your subset of IDs outside of SQL. Most methods require at some point to select the "nth" entry, and SQL tables are really not arrays at all. The assumption that the keys are consecutive in order to just join random ints between 1 and the count is also difficult to satisfy — MySQL for example doesn't support it natively, and the lock conditions are... tricky.

Here's an O(max(n, m lg n))-time, O(n)-space solution assuming just plain BTREE keys:

  1. Fetch all values of the key column of the data table in any order into an array in your favorite scripting language in O(n)
  2. Perform a Fisher-Yates shuffle, stopping after m swaps, and extract the subarray [0:m-1] in ?(m)
  3. "Join" the subarray with the original dataset (e.g. SELECT ... WHERE id IN (<subarray>)) in O(m lg n)

Any method that generates the random subset outside of SQL must have at least this complexity. The join can't be any faster than O(m lg n) with BTREE (so O(m) claims are fantasy for most engines) and the shuffle is bounded below n and m lg n and doesn't affect the asymptotic behavior.

In Pythonic pseudocode:

ids = sql.query('SELECT id FROM t')
for i in range(m):
  r = int(random() * (len(ids) - i))
  ids[i], ids[i + r] = ids[i + r], ids[i]

results = sql.query('SELECT * FROM t WHERE id IN (%s)' % ', '.join(ids[0:m-1])

Examples related to mysql

Implement specialization in ER diagram How to post query parameters with Axios? PHP with MySQL 8.0+ error: The server requested authentication method unknown to the client Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver' phpMyAdmin - Error > Incorrect format parameter? Authentication plugin 'caching_sha2_password' is not supported How to resolve Unable to load authentication plugin 'caching_sha2_password' issue Connection Java-MySql : Public Key Retrieval is not allowed How to grant all privileges to root user in MySQL 8.0 MySQL 8.0 - Client does not support authentication protocol requested by server; consider upgrading MySQL client

Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to postgresql

Subtracting 1 day from a timestamp date pgadmin4 : postgresql application server could not be contacted. Psql could not connect to server: No such file or directory, 5432 error? How to persist data in a dockerized postgres database using volumes input file appears to be a text format dump. Please use psql Postgres: check if array field contains value? Add timestamp column with default NOW() for new rows only Can't connect to Postgresql on port 5432 How to insert current datetime in postgresql insert query Connecting to Postgresql in a docker container from outside

Examples related to random

How can I get a random number in Kotlin? scikit-learn random state in splitting dataset Random number between 0 and 1 in python In python, what is the difference between random.uniform() and random.random()? Generate random colors (RGB) Random state (Pseudo-random number) in Scikit learn How does one generate a random number in Apple's Swift language? How to generate a random string of a fixed length in Go? Generate 'n' unique random numbers within a range What does random.sample() method in python do?