So in simple words: Over clause can be used to select non aggregated values along with Aggregated ones.
Partition BY, ORDER BY inside, and ROWS or RANGE are part of OVER() by clause.
partition by is used to partition data and then perform these window, aggregated functions, and if we don't have partition by the then entire result set is considered as a single partition.
OVER clause can be used with Ranking Functions(Rank, Row_Number, Dense_Rank..), Aggregate Functions like (AVG, Max, Min, SUM...etc) and Analytics Functions like (First_Value, Last_Value, and few others).
Let's See basic syntax of OVER clause
OVER (
[ <PARTITION BY clause> ]
[ <ORDER BY clause> ]
[ <ROW or RANGE clause> ]
)
PARTITION BY: It is used to partition data and perform operations on groups with the same data.
ORDER BY: It is used to define the logical order of data in Partitions. When we don't specify Partition, entire resultset is considered as a single partition
: This can be used to specify what rows are supposed to be considered in a partition when performing the operation.
Let's take an example:
Here is my dataset:
Id Name Gender Salary
----------- -------------------------------------------------- ---------- -----------
1 Mark Male 5000
2 John Male 4500
3 Pavan Male 5000
4 Pam Female 5500
5 Sara Female 4000
6 Aradhya Female 3500
7 Tom Male 5500
8 Mary Female 5000
9 Ben Male 6500
10 Jodi Female 7000
11 Tom Male 5500
12 Ron Male 5000
So let me execute different scenarios and see how data is impacted and I'll come from difficult syntax to simple one
Select *,SUM(salary) Over(order by salary RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_sal from employees
Id Name Gender Salary sum_sal
----------- -------------------------------------------------- ---------- ----------- -----------
6 Aradhya Female 3500 3500
5 Sara Female 4000 7500
2 John Male 4500 12000
3 Pavan Male 5000 32000
1 Mark Male 5000 32000
8 Mary Female 5000 32000
12 Ron Male 5000 32000
11 Tom Male 5500 48500
7 Tom Male 5500 48500
4 Pam Female 5500 48500
9 Ben Male 6500 55000
10 Jodi Female 7000 62000
Just observe the sum_sal part. Here I am using order by Salary and using "RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW". In this case, we are not using partition so entire data will be treated as one partition and we are ordering on salary. And the important thing here is UNBOUNDED PRECEDING AND CURRENT ROW. This means when we are calculating the sum, from starting row to the current row for each row. But if we see rows with salary 5000 and name="Pavan", ideally it should be 17000 and for salary=5000 and name=Mark, it should be 22000. But as we are using RANGE and in this case, if it finds any similar elements then it considers them as the same logical group and performs an operation on them and assigns value to each item in that group. That is the reason why we have the same value for salary=5000. The engine went up to salary=5000 and Name=Ron and calculated sum and then assigned it to all salary=5000.
Select *,SUM(salary) Over(order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_sal from employees
Id Name Gender Salary sum_sal
----------- -------------------------------------------------- ---------- ----------- -----------
6 Aradhya Female 3500 3500
5 Sara Female 4000 7500
2 John Male 4500 12000
3 Pavan Male 5000 17000
1 Mark Male 5000 22000
8 Mary Female 5000 27000
12 Ron Male 5000 32000
11 Tom Male 5500 37500
7 Tom Male 5500 43000
4 Pam Female 5500 48500
9 Ben Male 6500 55000
10 Jodi Female 7000 62000
So with ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW The difference is for same value items instead of grouping them together, It calculates SUM from starting row to current row and it doesn't treat items with same value differently like RANGE
Select *,SUM(salary) Over(order by salary) as sum_sal from employees
Id Name Gender Salary sum_sal
----------- -------------------------------------------------- ---------- ----------- -----------
6 Aradhya Female 3500 3500
5 Sara Female 4000 7500
2 John Male 4500 12000
3 Pavan Male 5000 32000
1 Mark Male 5000 32000
8 Mary Female 5000 32000
12 Ron Male 5000 32000
11 Tom Male 5500 48500
7 Tom Male 5500 48500
4 Pam Female 5500 48500
9 Ben Male 6500 55000
10 Jodi Female 7000 62000
These results are the same as
Select *, SUM(salary) Over(order by salary RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_sal from employees
That is because Over(order by salary) is just a short cut of Over(order by salary RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) So wherever we simply specify Order by without ROWS or RANGE it is taking RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW as default.
Note: This is applicable only to Functions that actually accept RANGE/ROW. For example, ROW_NUMBER and few others don't accept RANGE/ROW and in that case, this doesn't come into the picture.
Till now we saw that Over clause with an order by is taking Range/ROWS and syntax looks something like this RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW And it is actually calculating up to the current row from the first row. But what If it wants to calculate values for the entire partition of data and have it for each column (that is from 1st row to last row). Here is the query for that
Select *,sum(salary) Over(order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum_sal from employees
Id Name Gender Salary sum_sal
----------- -------------------------------------------------- ---------- ----------- -----------
1 Mark Male 5000 62000
2 John Male 4500 62000
3 Pavan Male 5000 62000
4 Pam Female 5500 62000
5 Sara Female 4000 62000
6 Aradhya Female 3500 62000
7 Tom Male 5500 62000
8 Mary Female 5000 62000
9 Ben Male 6500 62000
10 Jodi Female 7000 62000
11 Tom Male 5500 62000
12 Ron Male 5000 62000
Instead of CURRENT ROW, I am specifying UNBOUNDED FOLLOWING which instructs the engine to calculate till the last record of partition for each row.
Now coming to your point on what is OVER() with empty braces?
It is just a short cut for Over(order by salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
Here we are indirectly specifying to treat all my resultset as a single partition and then perform calculations from the first record to the last record of each partition.
Select *,Sum(salary) Over() as sum_sal from employees
Id Name Gender Salary sum_sal
----------- -------------------------------------------------- ---------- ----------- -----------
1 Mark Male 5000 62000
2 John Male 4500 62000
3 Pavan Male 5000 62000
4 Pam Female 5500 62000
5 Sara Female 4000 62000
6 Aradhya Female 3500 62000
7 Tom Male 5500 62000
8 Mary Female 5000 62000
9 Ben Male 6500 62000
10 Jodi Female 7000 62000
11 Tom Male 5500 62000
12 Ron Male 5000 62000
I did create a video on this and if you are interested you can visit it. https://www.youtube.com/watch?v=CvVenuVUqto&t=1177s
Thanks, Pavan Kumar Aryasomayajulu HTTP://xyzcoder.github.io