[sql-server] Function to Calculate Median in SQL Server

This is the most optimal solution for finding medians that I can think of. The names in the example is based on Justin example. Make sure an index for table Sales.SalesOrderHeader exists with index columns CustomerId and TotalDue in that order.

SELECT
 sohCount.CustomerId,
 AVG(sohMid.TotalDue) as TotalDueMedian
FROM 
(SELECT 
  soh.CustomerId,
  COUNT(*) as NumberOfRows
FROM 
  Sales.SalesOrderHeader soh 
GROUP BY soh.CustomerId) As sohCount
CROSS APPLY 
    (Select 
       soh.TotalDue
    FROM 
    Sales.SalesOrderHeader soh 
    WHERE soh.CustomerId = sohCount.CustomerId 
    ORDER BY soh.TotalDue
    OFFSET sohCount.NumberOfRows / 2 - ((sohCount.NumberOfRows + 1) % 2) ROWS 
    FETCH NEXT 1 + ((sohCount.NumberOfRows + 1) % 2) ROWS ONLY
    ) As sohMid
GROUP BY sohCount.CustomerId

UPDATE

I was a bit unsure about which method has best performance, so I did a comparison between my method Justin Grants and Jeff Atwoods by running query based on all three methods in one batch and the batch cost of each query were:

Without index:

  • Mine 30%
  • Justin Grants 13%
  • Jeff Atwoods 58%

And with index

  • Mine 3%.
  • Justin Grants 10%
  • Jeff Atwoods 87%

I tried to see how well the queries scale if you have index by creating more data from around 14 000 rows by a factor of 2 up to 512 which means in the end around 7,2 millions rows. Note I made sure CustomeId field where unique for each time I did a single copy, so the proportion of rows compared to unique instance of CustomerId was kept constant. While I was doing this I ran executions where I rebuilt index afterwards, and I noticed the results stabilized at around a factor of 128 with the data I had to these values:

  • Mine 3%.
  • Justin Grants 5%
  • Jeff Atwoods 92%

I wondered how the performance could have been affected by scaling number of of rows but keeping unique CustomerId constant, so I setup a new test where I did just this. Now instead of stabilizing, the batch cost ratio kept diverging, also instead of about 20 rows per CustomerId per average I had in the end around 10000 rows per such unique Id. The numbers where:

  • Mine 4%
  • Justins 60%
  • Jeffs 35%

I made sure I implemented each method correct by comparing the results. My conclusion is the method I used is generally faster as long as index exists. Also noticed that this method is what's recommended for this particular problem in this article https://www.microsoftpressstore.com/articles/article.aspx?p=2314819&seqNum=5

A way to even further improve performance of subsequent calls to this query even further is to persist the count information in an auxiliary table. You could even maintain it by having a trigger that update and holds information regarding the count of SalesOrderHeader rows dependant on CustomerId, of course you can then simple store the median as well.

Examples related to sql-server

Passing multiple values for same variable in stored procedure SQL permissions for roles Count the Number of Tables in a SQL Server Database Visual Studio 2017 does not have Business Intelligence Integration Services/Projects ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database How to create temp table using Create statement in SQL Server? SQL Query Where Date = Today Minus 7 Days How do I pass a list as a parameter in a stored procedure? SQL Server date format yyyymmdd

Examples related to tsql

Passing multiple values for same variable in stored procedure Count the Number of Tables in a SQL Server Database Change Date Format(DD/MM/YYYY) in SQL SELECT Statement Stored procedure with default parameters Format number as percent in MS SQL Server EXEC sp_executesql with multiple parameters SQL Server after update trigger How to compare datetime with only date in SQL Server Text was truncated or one or more characters had no match in the target code page including the primary key in an unpivot Printing integer variable and string on same line in SQL

Examples related to aggregate-functions

Spark SQL: apply aggregate functions to a list of columns GROUP BY without aggregate function GROUP BY + CASE statement must appear in the GROUP BY clause or be used in an aggregate function Naming returned columns in Pandas aggregate function? Concatenate multiple result rows of one column into one, group by another column How to include "zero" / "0" results in COUNT aggregate? Apply multiple functions to multiple groupby columns Reason for Column is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause Optimal way to concatenate/aggregate strings

Examples related to median

Finding median of list in Python Finding the median of an unsorted array Find running median from a stream of integers Help needed with Median If in Excel Calculate median in c# Function to Calculate Median in SQL Server Simple way to calculate median with MySQL