Select n random rows from SQL Server table

Question

I ve got a SQL Server table with about 50 000 rows in it  I want to select about 5 000 of those rows at random  I ve thought of a complicated way  creating a temp table with a  random number  column  copying my table into that  looping through the temp table and updating each row with RAND    and then selecting from that table where the random number column  lt  0 1  I m looking for a simpler way to do it  in a single statement if possible   This article suggest using the NEWID   function  That looks promising  but I can t see how I could reliably select a certain percentage of rows   Anybody ever do this before  Any ideas

User · Answer

select    from table where id in   select id from table order by random   limit   select count    from table  55 100      to select 55 percent of rows randomly

User · Answer

Here is an updated and improved form of sampling  It is based on the same concept of some other answers that use CHECKSUM   BINARY CHECKSUM and modulus  Reasons to use an implementation similar to this one  as opposed to other answers   It is relatively fast over huge data sets and can be efficiently used in with derived queries  Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and  if aligned with the rest of the query  the overhead is often minimal  Does not suffer from CHECKSUM      BINARY CHECKSUM    issues with runs of data  When using the CHECKSUM    approach  the rows can be selected in  quot chunks quot  and not  quot random quot  at all  This is because CHECKSUM prefers speed over distribution  Results in a stable repeatable row selection and can be trivially changed to produce different rows on subsequent query executions  Approaches that use NEWID    such as CHECKSUM NEWID      100  can never be stable repeatable  Allows for increased sample precision and reduces introduced statistical errors  The sampling precision can also be tweaked  CHECKSUM only returns an int value  Does not use ORDER BY NEWID    as ordering can become a significant bottleneck with large input sets  Avoiding the sorting also reduces memory and tempdb usage  Does not use TABLESAMPLE and thus works with a WHERE pre-filter   Cons   limitations   Slightly slower execution times and using CHECKSUM     Using hashbytes  as shown below  adds about 3 4 of a second of overhead per million lines  This is with my data  on my database instance  YMMV  This overhead can be eliminated if using a persisted computed column of the resulting  well distributed  bigint value from HASHBYTES  Unlike the basic SELECT TOP n    ORDER BY NEWID    this is not guaranteed to return  quot exactly N quot  rows  Instead  it returns a percentage row rows where such a value is pre-determined  For very small sample sizes this could result in 0 rows selected  This limitation is shared with the CHECKSUM    approaches   Here is the gist  -- Allow a sampling precision  0  100 0000   declare  sample percent decimal 7  4    12 3456  select     t   from t where 1 1     and t Name    Mr  No Questionable Checksum Usages      and   -- sample          sample percent   100         or abs              -- Choose appropriate identity column s  for hashbytes input              -- For demonstration it is assumed to be a UNIQUEIDENTIFIER rowguid column              convert bigint  hashbytes  SHA1   convert varbinary 32   t rowguid                 1000   100   lt   1000    sample percent         Notes   While SHA1 is technically deprecated since SQL Server 2016  it is both sufficient for the task and is slightly faster than either MD5 or SHA2 256  Use a different hashing function as relevant  If the table already contains a hashed column  with a good distribution   that could potentially be used as well  Conversion of bigint is critical as it allows 2 63 bits of  random space  to which to apply the modulus operator  this is much more than the 2 31 range from the CHECKSUM result  This reduces the modulus error at the limit  especially as the precision is increased  The sampling precision can be changed as long as the modulus operand and sample percent are multiplied appropriately  In this case  that is 1000   to account for the 4 digits of precision allowed in  sample percent  Can multiply the bigint value by RAND   to return a different row sample each run  This effectively changes the permutation of the fixed hash values  If  sample percent is 100 the query planner can eliminate the slower calculation code entirely  Remember  parameter sniffing  rules  This allows the code to be left in the query regardless of enabling sampling    Computing  sample percent  with lower upper limits  and adding a TOP  quot hint quot  in the query as might be useful when the sample is used in a derived table context  -- Approximate max-sample and min-sample ranges  -- The minimum sample percent should be non-zero within the precision  declare  max sample size int   3333333 declare  min sample percent decimal 7 4    0 3333 declare  sample percent decimal 7 4  --  0  100 0000  declare  sample size int  -- Get initial count for determining sample percentages  -- Remember to match the filter conditions with the usage site  declare  rows int select  rows   count 1      from t     where 1 1         and t Name    Mr  No Questionable Checksum Usages   -- Calculate sample percent and back-calculate actual sample size  if  rows  lt    max sample size begin     set  sample percent   100 end else begin     set  sample percent   convert float  100     max sample size    rows     if  sample percent  lt   min sample percent         set  sample percent    min sample percent end set  sample size   ceiling  rows    sample percent   100   select   from    join       -- Not a precise value  if limiting exactly at  can introduce more bias      -- Using  option optimize for  avoids this while requiring dynamic SQL      select top   sample size   convert int   sample percent   5       from t     where 1 1         and t Name    Mr  No Questionable Checksum Usages          and   -- sample              sample percent   100             or abs                  convert bigint  hashbytes  SHA1   convert varbinary 32   t rowguid                     1000   100   lt   1000    sample percent              sampled on

User · Answer

Selecting Rows Randomly from a Large Table on MSDN has a simple  well-articulated solution that addresses the large-scale performance concerns     SELECT   FROM Table1   WHERE  ABS CAST     BINARY CHECKSUM        RAND    as int     100   lt  10

User · Answer

select top 10 percent   from  yourtable  order by newid     In response to the  pure trash  comment concerning large tables  you could do it like this to improve performance   select    from  yourtable  where  yourPk  in   select top 10 percent  yourPk  from  yourtable  order by newid      The cost of this will be the key scan of values plus the join cost  which on a large table with a small percentage selection should be reasonable

User · Answer

This link have a interesting comparison between Orderby NEWID    and other methods for tables with 1  7  and 13 millions of rows   Often  when questions about how to select random rows are asked in discussion groups  the NEWID query is proposed  it is simple and works very well for small tables    SELECT TOP 10 PERCENT     FROM Table1   ORDER BY NEWID     However  the NEWID query has a big drawback when you use it for large tables  The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database  where they are sorted  This causes two problems    The sorting operation usually has a high cost associated with it  Sorting can use a lot of disk I O and can run for a long time  In the worst-case scenario  tempdb can run out of space  In the best-case scenario  tempdb can take up a large amount of disk space that never will be reclaimed without a manual shrink command    What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger  Here is a new idea on how to do that   SELECT   FROM Table1   WHERE  ABS CAST     BINARY CHECKSUM        RAND    as int     100   lt  10   The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table  and then choose all of those rows whose random number is less than the value of the specified percent  In this example  we want approximately 10 percent of the rows selected randomly  therefore  we choose all of the rows whose random number is less than 10   Please read the full article in MSDN

User · Answer

newid   order by will work  but will be very expensive for large result sets because it has to generate an id for every row  and then sort them   TABLESAMPLE   is good from a performance standpoint  but you will get clumping of results  all rows on a page will be returned    For a better performing true random sample  the best way is to filter out rows randomly  I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE      If you really want a random sample of   individual rows  modify your query to   filter out rows randomly  instead of   using TABLESAMPLE  For example  the   following query uses the NEWID   function to return approximately one   percent of the rows of the   Sales SalesOrderDetail table   SELECT   FROM Sales SalesOrderDetail WHERE 0 01  gt   CAST CHECKSUM NEWID   SalesOrderID   amp  0x7fffffff AS float                  CAST  0x7fffffff AS int        The SalesOrderID column is included in   the CHECKSUM expression so that   NEWID   evaluates once per row to   achieve sampling on a per-row basis    The expression CAST CHECKSUM NEWID      SalesOrderID   amp  0x7fffffff AS float     CAST  0x7fffffff AS int  evaluates to   a random float value between 0 and 1    When run against a table with 1 000 000 rows  here are my results   SET STATISTICS TIME ON SET STATISTICS IO ON     newid      rows returned  10000    logical reads  3359    CPU time  3312 ms    elapsed time   3359 ms    SELECT TOP 1 PERCENT Number FROM Numbers ORDER BY newid       TABLESAMPLE    rows returned  9269  varies     logical reads  32    CPU time  0 ms    elapsed time  5 ms    SELECT Number FROM Numbers TABLESAMPLE  1 PERCENT      Filter    rows returned  9994  varies     logical reads  3359    CPU time  641 ms    elapsed time  627 ms        SELECT Number FROM Numbers WHERE 0 01  gt   CAST CHECKSUM NEWID    Number   amp  0x7fffffff AS float                   CAST  0x7fffffff AS int   SET STATISTICS IO OFF SET STATISTICS TIME OFF   If you can get away with using TABLESAMPLE  it will give you the best performance  Otherwise use the newid   filter method  newid   order by should be last resort if you have a large result set

User · Answer

The server-side processing language in use  eg PHP   net  etc  isn t specified  but if it s PHP  grab the required number  or all the records  and instead of randomising in the query use PHP s shuffle function  I don t know if  net has an equivalent function but if it does then use that if you re using  net  ORDER BY RAND   can have quite a performance penalty  depending on how many records are involved

User · Answer

Try this   SELECT TOP 10 Field1       FieldN FROM Table1 ORDER BY NEWID

User · Answer

This is a combination of the initial seed idea and a checksum  which looks to me to give properly random results without the cost of NEWID     SELECT TOP  number   FROM table name ORDER BY RAND CHECKSUM      RAND

User · Answer

I was using it in subquery and it returned me same rows in subquery   SELECT  ID                 SELECT TOP 1                         ImageURL               FROM      SubTable                ORDER BY  NEWID                 AS ImageURL              GETUTCDATE                 1     FROM    Mytable   then i solved with including parent table variable in where  SELECT  ID                 SELECT TOP 1                         ImageURL               FROM      SubTable                Where Mytable ID gt 0               ORDER BY  NEWID                 AS ImageURL              GETUTCDATE                 1     FROM    Mytable   Note the where condtition

User · Answer

Depending on your needs  TABLESAMPLE will get you nearly as random and better performance  this is available on MS SQL server 2005 and later    TABLESAMPLE will return data from random pages instead of random rows and therefore deos not even retrieve data that it will not return   On a very large table I tested   select top 1 percent   from  tablename  order by newid     took more than 20 minutes   select   from  tablename  tablesample 1 percent    took 2 minutes   Performance will also improve on smaller samples in TABLESAMPLE whereas it will not with newid     Please keep in mind that this is not as random as the newid   method but will give you a decent sampling   See the MSDN page

User · Answer

It appears newid   can t be used in where clause  so this solution requires an inner query   SELECT   FROM       SELECT    ABS CHECKSUM NEWID     AS Rnd     FROM MyTable   vw WHERE Rnd   100  lt  10        --10

User · Answer

If you  unlike the OP  need a specific number of records  which makes the CHECKSUM approach difficult  and desire a more random sample than TABLESAMPLE provides by itself  and also want better speed than CHECKSUM  you may make do with a merger of the TABLESAMPLE and NEWID   methods  like this   DECLARE  sampleCount int   50 SET STATISTICS TIME ON  SELECT TOP   sampleCount     FROM  yourtable  TABLESAMPLE 10 PERCENT  ORDER BY NEWID    SET STATISTICS TIME OFF   In my case this is the most straightforward compromise between randomness  it s not really  I know  and speed  Vary the TABLESAMPLE percentage  or rows  as appropriate - the higher the percentage  the more random the sample  but expect a linear drop off in speed   Note that TABLESAMPLE will not accept a variable

User · Answer

This works for me   SELECT   FROM table name ORDER BY RANDOM   LIMIT  number

User · Answer

Didn t quite see this variation in the answers yet  I had an additional constraint where I needed  given an initial seed  to select the same set of rows each time   For MS SQL   Minimum example   select top 10 percent   from table name order by rand checksum       Normalized execution time  1 00  NewId   example   select top 10 percent   from table name order by newid     Normalized execution time  1 02  NewId   is insignificantly slower than rand checksum      so you may not want to use it against large record sets   Selection with Initial Seed   declare  seed int set  seed   Year getdate      month getdate       any other initial seed here     select top 10 percent   from table name order by rand checksum       seed     any other math function here      If you need to select the same set given a seed  this seems to work

User · Answer

Just order the table by a random number and obtain the first 5 000 rows using TOP   SELECT TOP 5000   FROM  Table  ORDER BY newid      UPDATE  Just tried it and a newid   call is sufficent - no need for all the casts and all the math

User · Answer

In MySQL you can do this   SELECT  PRIMARY KEY   rand   FROM table ORDER BY rand   LIMIT 5000

[sql] Select n random rows from SQL Server table

Examples related to sql

Examples related to sql-server

Examples related to random