For SQL Server and needing "a single random row"..
If not needing a true sampling, generate a random value [0, max_rows)
and use the ORDER BY..OFFSET..FETCH from SQL Server 2012+.
This is very fast if the COUNT
and ORDER BY
are over appropriate indexes - such that the data is 'already sorted' along the query lines. If these operations are covered it's a quick request and does not suffer from the horrid scalability of using ORDER BY NEWID()
or similar. Obviously, this approach won't scale well on a non-indexed HEAP table.
declare @rows int
select @rows = count(1) from t
-- Other issues if row counts in the bigint range..
-- This is also not 'true random', although such is likely not required.
declare @skip int = convert(int, @rows * rand())
select t.*
from t
order by t.id -- Make sure this is clustered PK or IX/UCL axis!
offset (@skip) rows
fetch first 1 row only
Make sure that the appropriate transaction isolation levels are used and/or account for 0 results.
For SQL Server and needing a "general row sample" approach..
Note: This is an adaptation of the answer as found on a SQL Server specific question about fetching a sample of rows. It has been tailored for context.
While a general sampling approach should be used with caution here, it's still potentially useful information in context of other answers (and the repetitious suggestions of non-scaling and/or questionable implementations). Such a sampling approach is less efficient than the first code shown and is error-prone if the goal is to find a "single random row".
Here is an updated and improved form of sampling a percentage of rows. It is based on the same concept of some other answers that use CHECKSUM / BINARY_CHECKSUM and modulus.
It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.
Does not suffer from CHECKSUM(*)
/ BINARY_CHECKSUM(*)
issues with runs of data. When using the CHECKSUM(*)
approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.
Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID()
can never be stable/repeatable.
Does not use ORDER BY NEWID()
of the entire input set, as ordering can become a significant bottleneck with large input sets. Avoiding unnecessary sorting also reduces memory and tempdb usage.
Does not use TABLESAMPLE
and thus works with a WHERE
pre-filter.
Here is the gist. See this answer for additional details and notes.
Naïve try:
declare @sample_percent decimal(7, 4)
-- Looking at this value should be an indicator of why a
-- general sampling approach can be error-prone to select 1 row.
select @sample_percent = 100.0 / count(1) from t
-- BAD!
-- When choosing appropriate sample percent of "approximately 1 row"
-- it is very reasonable to expect 0 rows, which definitely fails the ask!
-- If choosing a larger sample size the distribution is heavily skewed forward,
-- and is very much NOT 'true random'.
select top 1
t.*
from t
where 1=1
and ( -- sample
@sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * @sample_percent)
)
This can be largely remedied by a hybrid query, by mixing sampling and ORDER BY
selection from the much smaller sample set. This limits the sorting operation to the sample size, not the size of the original table.
-- Sample "approximately 1000 rows" from the table,
-- dealing with some edge-cases.
declare @rows int
select @rows = count(1) from t
declare @sample_size int = 1000
declare @sample_percent decimal(7, 4) = case
when @rows <= 1000 then 100 -- not enough rows
when (100.0 * @sample_size / @rows) < 0.0001 then 0.0001 -- min sample percent
else 100.0 * @sample_size / @rows -- everything else
end
-- There is a statistical "guarantee" of having sampled a limited-yet-non-zero number of rows.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
t.*
from t
where 1=1
and ( -- sample
@sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * @sample_percent)
)
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()