[mysql] The SQL OVER() clause - when and why is it useful?

So in simple words: Over clause can be used to select non aggregated values along with Aggregated ones.

Partition BY, ORDER BY inside, and ROWS or RANGE are part of OVER() by clause.

partition by is used to partition data and then perform these window, aggregated functions, and if we don't have partition by the then entire result set is considered as a single partition.

OVER clause can be used with Ranking Functions(Rank, Row_Number, Dense_Rank..), Aggregate Functions like (AVG, Max, Min, SUM...etc) and Analytics Functions like (First_Value, Last_Value, and few others).

Let's See basic syntax of OVER clause

OVER (   
       [ <PARTITION BY clause> ]  
       [ <ORDER BY clause> ]   
       [ <ROW or RANGE clause> ]  
      )  

PARTITION BY: It is used to partition data and perform operations on groups with the same data.

ORDER BY: It is used to define the logical order of data in Partitions. When we don't specify Partition, entire resultset is considered as a single partition

: This can be used to specify what rows are supposed to be considered in a partition when performing the operation.

Let's take an example:

Here is my dataset:

Id          Name                                               Gender     Salary
----------- -------------------------------------------------- ---------- -----------
1           Mark                                               Male       5000
2           John                                               Male       4500
3           Pavan                                              Male       5000
4           Pam                                                Female     5500
5           Sara                                               Female     4000
6           Aradhya                                            Female     3500
7           Tom                                                Male       5500
8           Mary                                               Female     5000
9           Ben                                                Male       6500
10          Jodi                                               Female     7000
11          Tom                                                Male       5500
12          Ron                                                Male       5000

So let me execute different scenarios and see how data is impacted and I'll come from difficult syntax to simple one

Select *,SUM(salary) Over(order by salary RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_sal from employees

Id          Name                                               Gender     Salary      sum_sal
----------- -------------------------------------------------- ---------- ----------- -----------
6           Aradhya                                            Female     3500        3500
5           Sara                                               Female     4000        7500
2           John                                               Male       4500        12000
3           Pavan                                              Male       5000        32000
1           Mark                                               Male       5000        32000
8           Mary                                               Female     5000        32000
12          Ron                                                Male       5000        32000
11          Tom                                                Male       5500        48500
7           Tom                                                Male       5500        48500
4           Pam                                                Female     5500        48500
9           Ben                                                Male       6500        55000
10          Jodi                                               Female     7000        62000

Just observe the sum_sal part. Here I am using order by Salary and using "RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW". In this case, we are not using partition so entire data will be treated as one partition and we are ordering on salary. And the important thing here is UNBOUNDED PRECEDING AND CURRENT ROW. This means when we are calculating the sum, from starting row to the current row for each row. But if we see rows with salary 5000 and name="Pavan", ideally it should be 17000 and for salary=5000 and name=Mark, it should be 22000. But as we are using RANGE and in this case, if it finds any similar elements then it considers them as the same logical group and performs an operation on them and assigns value to each item in that group. That is the reason why we have the same value for salary=5000. The engine went up to salary=5000 and Name=Ron and calculated sum and then assigned it to all salary=5000.

Select *,SUM(salary) Over(order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_sal from employees


   Id          Name                                               Gender     Salary      sum_sal
----------- -------------------------------------------------- ---------- ----------- -----------
6           Aradhya                                            Female     3500        3500
5           Sara                                               Female     4000        7500
2           John                                               Male       4500        12000
3           Pavan                                              Male       5000        17000
1           Mark                                               Male       5000        22000
8           Mary                                               Female     5000        27000
12          Ron                                                Male       5000        32000
11          Tom                                                Male       5500        37500
7           Tom                                                Male       5500        43000
4           Pam                                                Female     5500        48500
9           Ben                                                Male       6500        55000
10          Jodi                                               Female     7000        62000

So with ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW The difference is for same value items instead of grouping them together, It calculates SUM from starting row to current row and it doesn't treat items with same value differently like RANGE

Select *,SUM(salary) Over(order by salary) as sum_sal from employees

Id          Name                                               Gender     Salary      sum_sal
----------- -------------------------------------------------- ---------- ----------- -----------
6           Aradhya                                            Female     3500        3500
5           Sara                                               Female     4000        7500
2           John                                               Male       4500        12000
3           Pavan                                              Male       5000        32000
1           Mark                                               Male       5000        32000
8           Mary                                               Female     5000        32000
12          Ron                                                Male       5000        32000
11          Tom                                                Male       5500        48500
7           Tom                                                Male       5500        48500
4           Pam                                                Female     5500        48500
9           Ben                                                Male       6500        55000
10          Jodi                                               Female     7000        62000

These results are the same as

Select *, SUM(salary) Over(order by salary RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_sal from employees

That is because Over(order by salary) is just a short cut of Over(order by salary RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) So wherever we simply specify Order by without ROWS or RANGE it is taking RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW as default.

Note: This is applicable only to Functions that actually accept RANGE/ROW. For example, ROW_NUMBER and few others don't accept RANGE/ROW and in that case, this doesn't come into the picture.

Till now we saw that Over clause with an order by is taking Range/ROWS and syntax looks something like this RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW And it is actually calculating up to the current row from the first row. But what If it wants to calculate values for the entire partition of data and have it for each column (that is from 1st row to last row). Here is the query for that

Select *,sum(salary) Over(order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum_sal from employees

Id          Name                                               Gender     Salary      sum_sal
----------- -------------------------------------------------- ---------- ----------- -----------
1           Mark                                               Male       5000        62000
2           John                                               Male       4500        62000
3           Pavan                                              Male       5000        62000
4           Pam                                                Female     5500        62000
5           Sara                                               Female     4000        62000
6           Aradhya                                            Female     3500        62000
7           Tom                                                Male       5500        62000
8           Mary                                               Female     5000        62000
9           Ben                                                Male       6500        62000
10          Jodi                                               Female     7000        62000
11          Tom                                                Male       5500        62000
12          Ron                                                Male       5000        62000

Instead of CURRENT ROW, I am specifying UNBOUNDED FOLLOWING which instructs the engine to calculate till the last record of partition for each row.

Now coming to your point on what is OVER() with empty braces?

It is just a short cut for Over(order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

Here we are indirectly specifying to treat all my resultset as a single partition and then perform calculations from the first record to the last record of each partition.

Select *,Sum(salary) Over() as sum_sal from employees

Id          Name                                               Gender     Salary      sum_sal
----------- -------------------------------------------------- ---------- ----------- -----------
1           Mark                                               Male       5000        62000
2           John                                               Male       4500        62000
3           Pavan                                              Male       5000        62000
4           Pam                                                Female     5500        62000
5           Sara                                               Female     4000        62000
6           Aradhya                                            Female     3500        62000
7           Tom                                                Male       5500        62000
8           Mary                                               Female     5000        62000
9           Ben                                                Male       6500        62000
10          Jodi                                               Female     7000        62000
11          Tom                                                Male       5500        62000
12          Ron                                                Male       5000        62000

I did create a video on this and if you are interested you can visit it. https://www.youtube.com/watch?v=CvVenuVUqto&t=1177s

Thanks, Pavan Kumar Aryasomayajulu HTTP://xyzcoder.github.io

Examples related to mysql

Implement specialization in ER diagram How to post query parameters with Axios? PHP with MySQL 8.0+ error: The server requested authentication method unknown to the client Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver' phpMyAdmin - Error > Incorrect format parameter? Authentication plugin 'caching_sha2_password' is not supported How to resolve Unable to load authentication plugin 'caching_sha2_password' issue Connection Java-MySql : Public Key Retrieval is not allowed How to grant all privileges to root user in MySQL 8.0 MySQL 8.0 - Client does not support authentication protocol requested by server; consider upgrading MySQL client

Examples related to sql

Passing multiple values for same variable in stored procedure SQL permissions for roles Generic XSLT Search and Replace template Access And/Or exclusions Pyspark: Filter dataframe based on multiple conditions Subtracting 1 day from a timestamp date PYODBC--Data source name not found and no default driver specified select rows in sql with latest date for each ID repeated multiple times ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database

Examples related to sql-server

Passing multiple values for same variable in stored procedure SQL permissions for roles Count the Number of Tables in a SQL Server Database Visual Studio 2017 does not have Business Intelligence Integration Services/Projects ALTER TABLE DROP COLUMN failed because one or more objects access this column Create Local SQL Server database How to create temp table using Create statement in SQL Server? SQL Query Where Date = Today Minus 7 Days How do I pass a list as a parameter in a stored procedure? SQL Server date format yyyymmdd

Examples related to aggregate-functions

Spark SQL: apply aggregate functions to a list of columns GROUP BY without aggregate function GROUP BY + CASE statement must appear in the GROUP BY clause or be used in an aggregate function Naming returned columns in Pandas aggregate function? Concatenate multiple result rows of one column into one, group by another column How to include "zero" / "0" results in COUNT aggregate? Apply multiple functions to multiple groupby columns Reason for Column is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause Optimal way to concatenate/aggregate strings

Examples related to clause

The SQL OVER() clause - when and why is it useful?