27th Sep 2016 15 minutes read

A Gentle Introduction to Common SQL Window Functions

Mastering SQL analytical functions is a necessity for an aspiring BI/DWH professional. In this article, we'll explore the history of SQL in a BI environment. We'll also introduce some of the more common SQL analytical functions, including RANK, LEAD, LAG, SUM, and others.

SQL can be used for business intelligence (BI) applications that combine data retrieval, analysis, and computation. These computations are more complex than what we see in OLTP systems. If you're interested in getting into data warehousing (DWH) or business intelligence, you need to go beyond plain SQL and start digging into the analytical functions (also known as window functions, windowing functions, or OVER clauses).

As stated, this article will get you started with SQL analytical functions. First, let's have a quick look at the history of using SQL in BI. We'll also consider the basic processes involved in using SQL to partition, order, and analyze data.

BI Meets SQL: A Brief History

BI applications are used for a variety of data-related activities. Common computations include:

Best or worst performers in various organizational units (based on some qualitative measurement)
Best or worst contributions by percentage threshold, such as the bottom 5% of stores by revenue
Trends over time and across time; this information yields interesting insights about where the organization is heading
Contributions to a whole, such as a commission percentage

While using relational database management systems (RDMS) as DWH/BI platforms, organizations found that the SQL SELECT statement wasn't adequately supporting their BI applications.

Plus, analysts were required to possess a complex skill set involving data retrieval, procedural coding, external tool usage and business domain knowledge. This mix of skills is not common, and companies found it difficult to locate individuals who could do everything needed.

During this time, the use of standard SQL in BI was causing poor development productivity and poor app performance. Productivity was poor because a BI application developer had to use SQL relational statements mixed in with procedural code (Oracle PL/SQL). Performance was slow because the SQL compiler often created poor execution plans for complex statements, and even worse ones for mixed statements.

The need was clear: overcome these deficiencies and enable RDBM systems to support BI/DWH more fully. This resulted in major SQL extensions for both processing and specification. You know these extensions as the analytical functions we find in all of the major DBMS vendors' SQL syntaxes.

SQL Analytical Functions: The Basics

Executing analytical functions organizes data into partitions, computes functions over these partitions in a specified order, and returns the result.

Processing analytical functions is done after GROUP BY and row filtering (WHERE). This is so that the calculations can be performed on the grouped results. The returned rows are ordered after the functions have been processed.

We place analytical functions in the SELECT list, just as we would with any other result columns in a SQL SELECT statement.

Let's take a look at the basic syntax of an analytical function:

<AnalyticFunction> ([<column-list>]) over (<ordering>)

We can see that the analytical function consists of a function with an over keyword and an optional list of columns for ordering. The ordering indicates criteria for function evaluation, not the final ordering of results. This will become clear later as we look at examples.

We will now consider the most common analytical functions used in DWH/BI. To do this, we will access a simple data mart and build our queries on that.

The following data mart model is a simple star schema for tracking and analyzing retail sales:

Let's begin with the ranking functions.

RANKING FUNCTIONS

RANK

RANK is a simple analytical function that does not take any column list arguments. It returns a number or a rank based on the ordering of the rows; the ordering is based on a defined condition.

Let's look at an example product list. We will rank products based on their unit price, from the cheapest to the most expensive.

SELECT id,
  product_desc,
  product_brand,
  product_unitprice,
  RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice
FROM dim_product;

The ORDER BY clause is mandatory in a RANK function, as we will be listing products in either ascending or descending order.

The result set in ascending order would be:

ID	PRODUCT_DESC	PRODUCT_BRAND	PRODUCT_UNITPRICE	RANK_UNITPRICE
1	Plain Gibson razor	Gibson	15	1
4	Schick Xtreme3 for professionals	Schick Xtreme3	20	2
2	Gillette Venus Swirl for women	Gillette Venus Swirl	25	3
3	Gillette Fusion for young men	Gillette Fusion	25,5	4

The descending version, where we go from most to least expensive, would be delivered by this code snippet:

RANK() OVER (ORDER BY product_unitprice DESC)

DENSE RANK

DENSE RANK works like RANK in that it needs no additional arguments and it ranks items in descending or ascending order. The only difference is that DENSE RANK does not allow "gaps" between groups.

These "gaps" occur when the ordering argument is the same for one or more groups. To illustrate this, think of competition rankings. Let's say some programmers join a hackathon. Two programmers finish in second place with identical scores. DENSE_RANK would return a "1" for the first-place programmer, a "2" for both programmers in second place and a "3" for the third-place finisher. RANK would return a "4" for the third place which, in this case, does not make sense.

So, if we expand the rowset from the previous example...

SELECT id,
  product_desc,
  product_brand,
  product_unitprice,
  product_category,
  RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice,
  DENSE_RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice
FROM dim_product
WHERE product_category = 'Razor'

... we get:

ID	PRODUCT _DESC	PRODUCT _BRAND	PRODUCT _UNITPRICE	PRODUCT _CATEGORY	RANK _UNITPRICE	DENSERANK _UNITPRICE
1	Plain Gibson razor	Gibson	15	Razor	1	1
5	Gillette Plain for men	Gillette Plain	20	Razor	2	2
4	Schick Xtreme3 for professionals	Schick Xtreme3	20	Razor	2	2
2	Gillette Venus Swirl for women	Gillette Venus Swirl	25	Razor	4	3
3	Gillette Fusion for young men	Gillette Fusion	25,5	Razor	5	4

DENSE_RANK computes the row after the identical product_unitprice razors as the next ordinal number; RANK leaves a gap so that the next value displayed after an identical-value group is larger for the count. In our case, RANK shows a "4" after the two identical values; DENSE_RANK shows a "3".

NTILE

NTILE divides rows into equal groups and returns the number of the group that the row belongs to. This function is not as widely used as either RANK or DENSE RANK.

For example, NTILE can be used to divide rows into three divisions:

SELECT id,
  product_brand,
  product_unitprice,
  RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice,
  DENSE_RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice,
  NTILE(3) OVER (ORDER BY product_unitprice) Ntile_UnitPrice
FROM dim_product
WHERE product_category = 'Razor';

We see five values divided into three groups:

ID	PRODUCT _BRAND	PRODUCT _UNITPRICE	RANK _UNITPRICE	DENSERANK _UNITPRICE	NTILE _UNITPRICE
1	Gibson	15	1	1	1
5	Gillette Plain	20	2	2	1
4	Schick Xtreme3	20	2	2	2
2	Gillette Venus Swirl	25	4	3	2
3	Gillette Fusion	25,5	5	4	3

ROW NUMBER

ROW NUMBER is different than DENSE RANK and RANK in that it does not treat identical values in any special way. It simply lists them as they occur in some order.

Let's look at the example:

SELECT id,
  product_brand,
  product_unitprice,
  RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice,
  DENSE_RANK() OVER (ORDER BY product_unitprice) DenseRank_UnitPrice,
  NTILE(3) OVER (ORDER BY product_unitprice) Ntile_UnitPrice,
  ROW_NUMBER() OVER (ORDER BY product_unitprice) ROWNUMBER_UnitPrice
FROM dim_product
WHERE product_category = 'Razor';

And the result is:

ID	PRODUCT _BRAND	PRODUCT _UNITPRICE	RANK _UNITPRICE	DENSERANK _UNITPRICE	NTILE _UNITPRICE	ROWNUMBER _UNITPRICE
1	Gibson	15	1	1	1	1
5	Gillette Plain	20	2	2	1	2
4	Schick Xtreme3	20	2	2	2	3
2	Gillette Venus Swirl	25	4	3	2	4
3	Gillette Fusion	25,5	5	4	3	5

You may be wondering how ROW_NUMBER ranks these identical values – which one gets a higher ranking. The answer is: This is totally undetermined. In one execution, a product could be assigned an ID of "5"; in another execution, the same product could be assigned a "4".

So beware. Many mistakes are made by the misuse of ROW_NUMBER on datasets with non-unique arguments.

We commonly use ranking functions when filtering data. So if we would like to filter the top three razors by price using DENSE_RANK, it would look like:

WITH top_product AS 
	(SELECT id,
  	product_brand,
  	product_unitprice,
  	RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice
		FROM dim_product
		WHERE product_category = 'Razor' )

SELECT * 
FROM top_product
WHERE rank_unitprice < 3

And the result is:

ID	PRODUCT_BRAND	PRODUCT_UNITPRICE	RANK_UNITPRICE
1	Gibson	15	1
4	Schick Xtreme3	20	2
5	Gillette Plain	20	2

POSITIONAL ANALYTICAL FUNCTIONS

LEAD

LEAD returns an offset (incrementally increased) value of an argument column. The offset amount can be defined in the code; its default amount is "1". The new value is returned in the same row.

Here's the syntax for LEAD:

LEAD(value, offset, default_value)

The default value is returned if the offset points to non-existent row. The offset value follows the initial value and is applied by the direction of order.

Let's return our previous example and use the LEAD function to look at price increases. We'll define product_unitprice as the value, with an offset of 1 and a default value of null:

SELECT id,
  product_desc,
  product_brand,
  product_unitprice,
  product_category,
  lead(product_unitprice,1,null) OVER (ORDER BY product_unitprice) Lead_UnitPrice
FROM dim_product
WHERE product_category = 'Razor';

Let's look at the result set:

ID	PRODUCT _DESC	PRODUCT _BRAND	PRODUCT _UNITPRICE	PRODUCT _CATEGORY	LEADPRODUCT _UNITPRICE
1	Plain Gibson razor	Gibson	15	Razor	20
5	Gillette Plain for men	Gillette Plain	20	Razor	20
4	Schick Xtreme3 for professionals	Schick Xtreme3	20	Razor	25
2	Gillette Venus Swirl for women	Gillette Venus Swirl	25	Razor	25,5
3	Gillette Fusion for young men	Gillette Fusion	25,5	Razor	null

To make this clearer, let's visualize the shift in data values :

-->

We see the LEAD function is "taking" the 1 offset value in the direction of the analytical function's order.

LAG

LAG is the opposite of LEAD. We can even implement LAG using LEAD and vice versa. The difference is in the direction we look for the offset value.

Here's the syntax for LAG:

LAG(value, offset, default_value)

As the lead function takes the values in the window that are in the lead of the current value the lag function takes the values that are lagging. Hence the name LAG. The usage of the lead or lag is interchangeable, you can use them to implement the same behavior.

If we rewrite the previous example with a changed order direction, we get the same results.

SELECT id,
  product_desc,
  product_brand,
  product_unitprice,
  product_category,
  lead(product_unitprice,1,null) OVER (ORDER BY product_unitprice) Lead_UnitPrice,
  lag(product_unitprice,1,null) OVER (ORDER BY product_unitprice DESC) Lag_UnitPrice
FROM dim_product
WHERE product_category = 'Razor';

The result:

ID	PRODUCT _DESC	PRODUCT _BRAND	PRODUCT _UNITPRICE	PRODUCT _CATEGORY	LEADPRODUCT _UNITPRICE	LAGPRODUCT _UNITPRICE
3	Gillette Fusion for young men	Gillette Fusion	25,5	Razor	null	null
2	Gillette Venus Swirl for women	Gillette Venus Swirl	25	Razor	25,5	25,5
5	Gillette Plain for men	Gillette Plain	20	Razor	20	25
4	Schick Xtreme3 for professionals	Schick Xtreme3	20	Razor	25	20
1	Plain Gibson razor	Gibson	15	Razor	20	20

LEAD and LAG are important in the development of ETL using pure SQL batch statements. We will cover the implementation of various algorithms in later posts.

SUMMARY FUNCTIONS

SUM

SUM gives us the sum of an entire set or group of values.

This function is pretty self-explanatory. If we define the sum function as an analytical function, we get the sum of the whole set or group defined in every row.

Here's the syntax for SUM:

SUM(value)

So if we look at this example...

SELECT product_category,
  SUM(sales_units) AS SUM_SALES_UNITS
FROM f_sales,
  dim_product
WHERE f_sales.dim_product_id = dim_product.ID
GROUP BY product_category;

...we get the sum of every product category in every row:

PRODUCT_CATEGORY	SUM_SALES_UNITS
Cream	24092683,94
Razor	30985698,47

The same applies to MEDIAN, AVG (a.k.a. MEAN), or any other summary function. Their syntaxes are similar enough that we won't go into them here, but this is what they look like:

SELECT product_category,
  SUM(sales_units) AS SUM_SALES_UNITS,
  AVG(sales_units) AS AVG_SALES_UNITS,
  MEDIAN(sales_units) AS MEDIAN_SALES_UNITS
FROM f_sales,
  dim_product
WHERE f_sales.dim_product_id = dim_product.ID
GROUP BY product_category;

The result is:

PRODUCT_CATEGORY	SUM_SALES_UNITS	MEAN_SALES_UNITS	MEDIAN_SALES_UNITS
Cream	24092683,94	5496,85	5502,75
Razor	30985698,47	5516,41	5534,86

RUNNING SUMMARY FUNCTIONS

Summary functions can be defined with the ORDER BY clause. When we include this clause, the function is calculated on the current row and every preceding unique row.

Look at this SUM example:

SELECT DISTINCT DIM_TIME_ID AS TIME_ID,
  SUM(sales_units) OVER (ORDER BY DIM_TIME_ID) AS SUM_SALES_UNIT
FROM f_sales
ORDER BY DIM_TIME_ID

We know that the summary of sales_units is accumulative for every unique date – the date's uniqueness is important – as defined in the ORDER BY part of the analytical function. Below, we see the results:

DIM_TIME_ID	SUM_SALES_UNIT
20160101	1025620,36
20160102	2789678,38
20160103	4481130,18
20160104	6382374,4
20160105	8327342,03
20160106	10143646,25
20160107	11952402,14
20160108	13720165,03
20160109	15586689,76
20160110	17451030,37
20160111	19617779,59
20160112	21551100,48
...	...

If we would have omitted this unique value requirement on the ORDER BY column, the return value would have been the sum of the non-unique row and every preceding row.

This type of summarization is called a running sum. It comes in handy when we're dealing with common algorithms, such as the ones we will cover in upcoming articles.

PARTITION SYNTAX

Partitioning is optional in an analytic function. A partition statement is followed by a list of columns.

<AnalyticFunction> ([<column-list>]) over ([<partition by>]<ordering>)

PARTITION BY defines the partition, -- the group of rows on which a function is performed. Do not confuse PARTITION BY with the GROUP BY part of an SQL statement.

We can redefine our previous ranking example by creating partition groups for the product_category column like this:

SELECT id,
  product_brand,
  product_unitprice,
  product_category,
  RANK() OVER (PARTITION BY product_category ORDER BY product_unitprice) Rank_UnitPrice,
  DENSE_RANK() OVER (PARTITION BY product_category ORDER BY product_unitprice) DenseRank_UnitPrice,
  NTILE(3) OVER (PARTITION BY product_category ORDER BY product_unitprice) Ntile_UnitPrice
FROM dim_product ORDER BY PRODUCT_CATEGORY,  5;

The ranking is done within product categories. The results are:

ID	PRODUCT _BRAND	PRODUCT _UNITPRICE	PRODUCT _CATEGORY	DENSERANK _UNITPRICE	NTILE _UNITPRICE	ROWNUMBER _UNITPRICE
9	Mario Badescu	29	Cream	1	1	1
6	Hope	35	Cream	2	2	1
8	Cetapin	45	Cream	3	3	2
7	Aveena	45	Cream	3	3	3
13	Axe for women	29	Deo	1	1	1
10	Nivea men	35	Deo	2	2	1
11	Nivea women	45	Deo	3	3	2
12	Axe for men	45	Deo	3	3	3
1	Gibson	15	Razor	1	1	1
5	Gillette Plain	20	Razor	2	2	1
4	Schick Xtreme3	20	Razor	2	2	2
2	Gillette Venus Swirl	25	Razor	4	3	2
3	Gillette Fusion	25,5	Razor	5	4	3

The Window Concept

The window concept is used extensively with analytical functions; some even call analytical functions "windowing functions". A window is a collection of rows in which an analytical function is calculated.

We tie window concepts to changes in numeric variables (i.e. stock prices or sales). Windows are typically defined by time intervals, like years or months. Some examples of window comparisons are:

Sales growth percentages
Moving averages of stock prices (30 day,90 day, 180 day)
The performance of an advertising campaign across time
Cumulative sales performance

The syntax of window functions is optional and is included after the ordering:

<AnalyticFunction> ([<column-list>]) over ([<partition by>]<ordering>[window-specification])

Windows can be specified as units of physical rows using the ROWS keyword, or they can be defined in logical rows using the RANGE keyword.

Physical windows are specified by absolute values of rows we offset before and after the current row.

Logical windows are specified by values, such as a number of days. Numeric variables are calculated in moving windows so that the variables can be compared across windows.

Windows can be cumulative (as with the running summability example) or sliding, as we will see in the coming examples.

Examples of physical-row windows are:

ROWS UNBOUNDED PRECEDING – The window is the current row and every preceding row.
ROWS 2 PRECEDING – The window is the current row and 2 preceding physical rows.
ROWS 2 FOLLOWING – The window is the current row and 2 following physical rows.

ROWS UNBOUNDED PRECEDING will yield the same results as the running sum example if the ORDER BY column in the analytic definition is unique. If not, then the running analytical function works on the rows of the non-unique group set first. The ROWS UNBOUNDED PRECEDING window expands one row at a time.

Let's look at the example:

SELECT DISTINCT DIM_TIME_ID AS TIME_ID,
  SUM(sales_units) over (ORDER BY DIM_TIME_ID ) AS SUM_SALES_UNIT,
  SUM(sales_units) over (ORDER BY DIM_TIME_ID ROWS UNBOUNDED PRECEDING) AS SUM_SALES_UNIT_PRECEDING
FROM f_sales
ORDER BY DIM_TIME_ID

And the result set:

TIME_ID	SUM_SALES_UNIT	SUM_SALES_UNIT_PRECIDING
20160101	1025620,36	4043,38
20160101	1025620,36	10451,14
20160101	1025620,36	13106,5
20160101	1025620,36	15749,79
20160101	1025620,36	19807,07
20160101	1025620,36	25012,65
20160101	1025620,36	31650,32
...	...	...

As we see in the result set, the ROWS_UNBOUNDED window expands to include the current row and all the following rows. The running sum, on the other hand, summarizes values by the 2016-01-01 date.

Now let's take a look at the logical window specification.

Logical windows are indicated by the RANGE keyword coupled with an ordering variable, such as a number of years or a shipment date. These values indicate the start and end of a window.

Unlike physical windows, where we specify the number of rows, in logical windows we specify the logical bounds of the window. To clarify this, let's look at some examples of logical windows and how they differ from their physical counterparts.

Examples of logical windows are:

ORDER BY TIME_ID RANGE 30 PRECEDING – The window is the current row and 30 days in the past.
ORDER BY TIME_ID RANGE BETWEEN 365 PRECEDING AND 365 FOLLOWING – The window is bound by 365 days in the past and 365 days in the future.
ORDER BY TIME_ID DATE RANGE BETWEEN INTERVAL '1' YEAR FOLLOWING AND INTERVAL '1' YEAR PRECEDING – The window is bound by 1 year in the past and 1 year in the future.

A window bound by ranges specified in days and years will be the same most of the time. Can you guess when it won't be the same?

This picture best describes the difference between logical and physical windows:

The left side displays the window for ROW BETWEEN 1 FOLLOWING AND 1 PRECEDING and the right side displays the windows for RANGE BETWEEN 1 FOLLOWING AND 1 PRECEDING. The RANGE windows take into account all the days that precede the current date because they are all one logical unit. The ROW window only takes one preceding physical row.

Let's return to our example with the sales fact table:

SELECT DISTINCT DIM_TIME_ID AS TIME_ID,sales_units,
  SUM(sales_units) over (ORDER BY DIM_TIME_ID ) AS SUM_SALES_UNIT,
  SUM(sales_units) over (ORDER BY DIM_TIME_ID ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS SUM_SALES_UNIT_ROW,
  SUM(sales_units) over (ORDER BY TO_DATE(DIM_TIME_ID,'YYYYMMDD') RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS SUM_SALES_UNIT_RANGE
FROM f_sales
ORDER BY DIM_TIME_ID,sales_units;

Since we are defining logical rows for dates, we had to cast the DIM_TIME_ID to the date value to define the date logical row. We should stress that a single DIM_TIME_ID represents a single day and that every day has a corresponding time_id. We enforce this with a 1..n relationship from the fact table to the dimensional table.

We get different results for each analytical function:

TIME_ID	SALES_UNIT	SUM_SALES_UNIT	SUM_SALES_UNIT_ROW	SUM_SALES_UNIT_RANGE
20160101	2019,35	1025620,36	4062,47	2789678,38
20160101	2043,12	1025620,36	6105,71	2789678,38
20160101	2043,24	1025620,36	6136,87	2789678,38
20160101	2050,51	1025620,36	6148,13	2789678,38
20160101	2054,38	1025620,36	6189,71	2789678,38
20160101	2084,82	1025620,36	6276,31	2789678,38
...	...	...	...	...

We can see that SUM_SALES_UNIT is summarizing the result for the date 01.01.2016, ROW is moving its physical window through the rows, and RANGE is summing up all the sales in a three-day window.

Using SQL analytical functions can be a two-edged sword. They provide better performance and productivity, but the results and calculations can often be counterintuitive. There are many other functions that we didn't cover in these article, but we have discussed the building blocks of most of them. I challenge you to explore analytical functions further. In the next article, we'll learn about some DWH algorithms that use analytical functions for common ETL patterns.

Tags: