Back to articles list Articles Cookbook
15 minutes read

A Gentle Introduction to Common SQL Window Functions

Mastering SQL analytical functions is a necessity for an aspiring BI/DWH professional. In this article, we'll explore the history of SQL in a BI environment. We'll also introduce some of the more common SQL analytical functions, including RANK, LEAD, LAG, SUM, and others.

SQL can be used for business intelligence (BI) applications that combine data retrieval, analysis, and computation. These computations are more complex than what we see in OLTP systems. If you're interested in getting into data warehousing (DWH) or business intelligence, you need to go beyond plain SQL and start digging into the analytical functions (also known as window functions, windowing functions, or OVER clauses).

As stated, this article will get you started with SQL analytical functions. First, let's have a quick look at the history of using SQL in BI. We'll also consider the basic processes involved in using SQL to partition, order, and analyze data.

BI Meets SQL: A Brief History

BI applications are used for a variety of data-related activities. Common computations include:

  • Best or worst performers in various organizational units (based on some qualitative measurement)
  • Best or worst contributions by percentage threshold, such as the bottom 5% of stores by revenue
  • Trends over time and across time; this information yields interesting insights about where the organization is heading
  • Contributions to a whole, such as a commission percentage

While using relational database management systems (RDMS) as DWH/BI platforms, organizations found that the SQL SELECT statement wasn't adequately supporting their BI applications.

Plus, analysts were required to possess a complex skill set involving data retrieval, procedural coding, external tool usage and business domain knowledge. This mix of skills is not common, and companies found it difficult to locate individuals who could do everything needed.

During this time, the use of standard SQL in BI was causing poor development productivity and poor app performance. Productivity was poor because a BI application developer had to use SQL relational statements mixed in with procedural code (Oracle PL/SQL). Performance was slow because the SQL compiler often created poor execution plans for complex statements, and even worse ones for mixed statements.

The need was clear: overcome these deficiencies and enable RDBM systems to support BI/DWH more fully. This resulted in major SQL extensions for both processing and specification. You know these extensions as the analytical functions we find in all of the major DBMS vendors' SQL syntaxes.

SQL Analytical Functions: The Basics

Executing analytical functions organizes data into partitions, computes functions over these partitions in a specified order, and returns the result.

SQL execution steps

Processing analytical functions is done after GROUP BY and row filtering (WHERE). This is so that the calculations can be performed on the grouped results. The returned rows are ordered after the functions have been processed.

We place analytical functions in the SELECT list, just as we would with any other result columns in a SQL SELECT statement.

Let's take a look at the basic syntax of an analytical function:

<AnalyticFunction> ([<column-list>]) over (<ordering>)

We can see that the analytical function consists of a function with an over keyword and an optional list of columns for ordering. The ordering indicates criteria for function evaluation, not the final ordering of results. This will become clear later as we look at examples.

We will now consider the most common analytical functions used in DWH/BI. To do this, we will access a simple data mart and build our queries on that.

The following data mart model is a simple star schema for tracking and analyzing retail sales:




Let's begin with the ranking functions.

RANKING FUNCTIONS

RANK

RANK is a simple analytical function that does not take any column list arguments. It returns a number or a rank based on the ordering of the rows; the ordering is based on a defined condition.

Let's look at an example product list. We will rank products based on their unit price, from the cheapest to the most expensive.

SELECT id,
  product_desc,
  product_brand,
  product_unitprice,
  RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice
FROM dim_product;

The ORDER BY clause is mandatory in a RANK function, as we will be listing products in either ascending or descending order.

The result set in ascending order would be:


IDPRODUCT_DESCPRODUCT_BRANDPRODUCT_UNITPRICERANK_UNITPRICE
1Plain Gibson razorGibson151
4Schick Xtreme3 for professionalsSchick Xtreme3202
2Gillette Venus Swirl for womenGillette Venus Swirl253
3Gillette Fusion for young menGillette Fusion25,54


The descending version, where we go from most to least expensive, would be delivered by this code snippet:

RANK() OVER (ORDER BY product_unitprice DESC)

DENSE RANK

DENSE RANK works like RANK in that it needs no additional arguments and it ranks items in descending or ascending order. The only difference is that DENSE RANK does not allow "gaps" between groups.

These "gaps" occur when the ordering argument is the same for one or more groups. To illustrate this, think of competition rankings. Let's say some programmers join a hackathon. Two programmers finish in second place with identical scores. DENSE_RANK would return a "1" for the first-place programmer, a "2" for both programmers in second place and a "3" for the third-place finisher. RANK would return a "4" for the third place which, in this case, does not make sense.

So, if we expand the rowset from the previous example...

SELECT id,
  product_desc,
  product_brand,
  product_unitprice,
  product_category,
  RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice,
  DENSE_RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice
FROM dim_product
WHERE product_category = 'Razor'

... we get:


IDPRODUCT
_DESC
PRODUCT
_BRAND
PRODUCT
_UNITPRICE
PRODUCT
_CATEGORY
RANK
_UNITPRICE
DENSERANK
_UNITPRICE
1Plain Gibson razorGibson15Razor11
5Gillette Plain for menGillette Plain20Razor22
4Schick Xtreme3 for professionalsSchick Xtreme320Razor22
2Gillette Venus Swirl for womenGillette Venus Swirl25Razor43
3Gillette Fusion for young menGillette Fusion25,5Razor54


DENSE_RANK computes the row after the identical product_unitprice razors as the next ordinal number; RANK leaves a gap so that the next value displayed after an identical-value group is larger for the count. In our case, RANK shows a "4" after the two identical values; DENSE_RANK shows a "3".

NTILE

NTILE divides rows into equal groups and returns the number of the group that the row belongs to. This function is not as widely used as either RANK or DENSE RANK.

For example, NTILE can be used to divide rows into three divisions:

SELECT id,
  product_brand,
  product_unitprice,
  RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice,
  DENSE_RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice,
  NTILE(3) OVER (ORDER BY product_unitprice) Ntile_UnitPrice
FROM dim_product
WHERE product_category = 'Razor';

We see five values divided into three groups:


IDPRODUCT
_BRAND
PRODUCT
_UNITPRICE
RANK
_UNITPRICE
DENSERANK
_UNITPRICE
NTILE
_UNITPRICE
1Gibson15111
5Gillette Plain20221
4Schick Xtreme320222
2Gillette Venus Swirl25432
3Gillette Fusion25,5543


ROW NUMBER

ROW NUMBER is different than DENSE RANK and RANK in that it does not treat identical values in any special way. It simply lists them as they occur in some order.

Let's look at the example:

SELECT id,
  product_brand,
  product_unitprice,
  RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice,
  DENSE_RANK() OVER (ORDER BY product_unitprice) DenseRank_UnitPrice,
  NTILE(3) OVER (ORDER BY product_unitprice) Ntile_UnitPrice,
  ROW_NUMBER() OVER (ORDER BY product_unitprice) ROWNUMBER_UnitPrice
FROM dim_product
WHERE product_category = 'Razor';

And the result is:


IDPRODUCT
_BRAND
PRODUCT
_UNITPRICE
RANK
_UNITPRICE
DENSERANK
_UNITPRICE
NTILE
_UNITPRICE
ROWNUMBER
_UNITPRICE
1Gibson151111
5Gillette Plain202212
4Schick Xtreme3202223
2Gillette Venus Swirl254324
3Gillette Fusion25,55435


You may be wondering how ROW_NUMBER ranks these identical values – which one gets a higher ranking. The answer is: This is totally undetermined. In one execution, a product could be assigned an ID of "5"; in another execution, the same product could be assigned a "4".

So beware. Many mistakes are made by the misuse of ROW_NUMBER on datasets with non-unique arguments.

We commonly use ranking functions when filtering data. So if we would like to filter the top three razors by price using DENSE_RANK, it would look like:

WITH top_product AS 
	(SELECT id,
  	product_brand,
  	product_unitprice,
  	RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice
		FROM dim_product
		WHERE product_category = 'Razor' )

SELECT * 
FROM top_product
WHERE rank_unitprice < 3

And the result is:


IDPRODUCT_BRANDPRODUCT_UNITPRICERANK_UNITPRICE
1Gibson151
4Schick Xtreme3202
5Gillette Plain202


POSITIONAL ANALYTICAL FUNCTIONS

LEAD

LEAD returns an offset (incrementally increased) value of an argument column. The offset amount can be defined in the code; its default amount is "1". The new value is returned in the same row.

Here's the syntax for LEAD:

LEAD(value, offset, default_value)

The default value is returned if the offset points to non-existent row. The offset value follows the initial value and is applied by the direction of order.

Let's return our previous example and use the LEAD function to look at price increases. We'll define product_unitprice as the value, with an offset of 1 and a default value of null:

SELECT id,
  product_desc,
  product_brand,
  product_unitprice,
  product_category,
  lead(product_unitprice,1,null) OVER (ORDER BY product_unitprice) Lead_UnitPrice
FROM dim_product
WHERE product_category = 'Razor';

Let's look at the result set:


IDPRODUCT
_DESC
PRODUCT
_BRAND
PRODUCT
_UNITPRICE
PRODUCT
_CATEGORY
LEADPRODUCT
_UNITPRICE
1Plain Gibson razorGibson15Razor20
5Gillette Plain for menGillette Plain20Razor20
4Schick Xtreme3 for professionalsSchick Xtreme320Razor25
2Gillette Venus Swirl for womenGillette Venus Swirl25Razor25,5
3Gillette Fusion for young menGillette Fusion25,5Razornull


To make this clearer, let's visualize the shift in data values :

-->

Lead direction

We see the LEAD function is "taking" the 1 offset value in the direction of the analytical function's order.

LAG

LAG is the opposite of LEAD. We can even implement LAG using LEAD and vice versa. The difference is in the direction we look for the offset value.

Here's the syntax for LAG:

LAG(value, offset, default_value)

As the lead function takes the values in the window that are in the lead of the current value the lag function takes the values that are lagging. Hence the name LAG. The usage of the lead or lag is interchangeable, you can use them to implement the same behavior.

Lead, Lag direction

If we rewrite the previous example with a changed order direction, we get the same results.

SELECT id,
  product_desc,
  product_brand,
  product_unitprice,
  product_category,
  lead(product_unitprice,1,null) OVER (ORDER BY product_unitprice) Lead_UnitPrice,
  lag(product_unitprice,1,null) OVER (ORDER BY product_unitprice DESC) Lag_UnitPrice
FROM dim_product
WHERE product_category = 'Razor';

The result:


IDPRODUCT
_DESC
PRODUCT
_BRAND
PRODUCT
_UNITPRICE
PRODUCT
_CATEGORY
LEADPRODUCT
_UNITPRICE
LAGPRODUCT
_UNITPRICE
3Gillette Fusion for young menGillette Fusion25,5Razornullnull
2Gillette Venus Swirl for womenGillette Venus Swirl25Razor25,525,5
5Gillette Plain for menGillette Plain20Razor2025
4Schick Xtreme3 for professionalsSchick Xtreme320Razor2520
1Plain Gibson razorGibson15Razor2020


LEAD and LAG are important in the development of ETL using pure SQL batch statements. We will cover the implementation of various algorithms in later posts.

SUMMARY FUNCTIONS

SUM

SUM gives us the sum of an entire set or group of values.

This function is pretty self-explanatory. If we define the sum function as an analytical function, we get the sum of the whole set or group defined in every row.

Here's the syntax for SUM:

SUM(value)

So if we look at this example...

SELECT product_category,
  SUM(sales_units) AS SUM_SALES_UNITS
FROM f_sales,
  dim_product
WHERE f_sales.dim_product_id = dim_product.ID
GROUP BY product_category;

...we get the sum of every product category in every row:


PRODUCT_CATEGORYSUM_SALES_UNITS
Cream24092683,94
Razor30985698,47


The same applies to MEDIAN, AVG (a.k.a. MEAN), or any other summary function. Their syntaxes are similar enough that we won't go into them here, but this is what they look like:

SELECT product_category,
  SUM(sales_units) AS SUM_SALES_UNITS,
  AVG(sales_units) AS AVG_SALES_UNITS,
  MEDIAN(sales_units) AS MEDIAN_SALES_UNITS
FROM f_sales,
  dim_product
WHERE f_sales.dim_product_id = dim_product.ID
GROUP BY product_category;

The result is:


PRODUCT_CATEGORYSUM_SALES_UNITSMEAN_SALES_UNITSMEDIAN_SALES_UNITS
Cream24092683,945496,855502,75
Razor30985698,475516,415534,86


RUNNING SUMMARY FUNCTIONS

Summary functions can be defined with the ORDER BY clause. When we include this clause, the function is calculated on the current row and every preceding unique row.

Look at this SUM example:

SELECT DISTINCT DIM_TIME_ID AS TIME_ID,
  SUM(sales_units) OVER (ORDER BY DIM_TIME_ID) AS SUM_SALES_UNIT
FROM f_sales
ORDER BY DIM_TIME_ID

We know that the summary of sales_units is accumulative for every unique date – the date's uniqueness is important – as defined in the ORDER BY part of the analytical function. Below, we see the results:


DIM_TIME_IDSUM_SALES_UNIT
201601011025620,36
201601022789678,38
201601034481130,18
201601046382374,4
201601058327342,03
2016010610143646,25
2016010711952402,14
2016010813720165,03
2016010915586689,76
2016011017451030,37
2016011119617779,59
2016011221551100,48
......


If we would have omitted this unique value requirement on the ORDER BY column, the return value would have been the sum of the non-unique row and every preceding row.

This type of summarization is called a running sum. It comes in handy when we're dealing with common algorithms, such as the ones we will cover in upcoming articles.

PARTITION SYNTAX

Partitioning is optional in an analytic function. A partition statement is followed by a list of columns.

<AnalyticFunction> ([<column-list>]) over ([<partition by>]<ordering>)

PARTITION BY defines the partition, -- the group of rows on which a function is performed. Do not confuse PARTITION BY with the GROUP BY part of an SQL statement.

We can redefine our previous ranking example by creating partition groups for the product_category column like this:

SELECT id,
  product_brand,
  product_unitprice,
  product_category,
  RANK() OVER (PARTITION BY product_category ORDER BY product_unitprice) Rank_UnitPrice,
  DENSE_RANK() OVER (PARTITION BY product_category ORDER BY product_unitprice) DenseRank_UnitPrice,
  NTILE(3) OVER (PARTITION BY product_category ORDER BY product_unitprice) Ntile_UnitPrice
FROM dim_product ORDER BY PRODUCT_CATEGORY,  5;

The ranking is done within product categories. The results are:


IDPRODUCT
_BRAND
PRODUCT
_UNITPRICE
PRODUCT
_CATEGORY
DENSERANK
_UNITPRICE
NTILE
_UNITPRICE
ROWNUMBER
_UNITPRICE
9Mario Badescu29Cream111
6Hope35Cream221
8Cetapin45Cream332
7Aveena45Cream333
13Axe for women29Deo111
10Nivea men35Deo221
11Nivea women45Deo332
12Axe for men45Deo333
1Gibson15Razor111
5Gillette Plain20Razor221
4Schick Xtreme320Razor222
2Gillette Venus Swirl25Razor432
3Gillette Fusion25,5Razor543


The Window Concept

The window concept is used extensively with analytical functions; some even call analytical functions "windowing functions". A window is a collection of rows in which an analytical function is calculated.

We tie window concepts to changes in numeric variables (i.e. stock prices or sales). Windows are typically defined by time intervals, like years or months. Some examples of window comparisons are:

  • Sales growth percentages
  • Moving averages of stock prices (30 day,90 day, 180 day)
  • The performance of an advertising campaign across time
  • Cumulative sales performance

The syntax of window functions is optional and is included after the ordering:

<AnalyticFunction> ([<column-list>]) over ([<partition by>]<ordering>[window-specification])

Windows can be specified as units of physical rows using the ROWS keyword, or they can be defined in logical rows using the RANGE keyword.

Physical windows are specified by absolute values of rows we offset before and after the current row.

Logical windows are specified by values, such as a number of days. Numeric variables are calculated in moving windows so that the variables can be compared across windows.

Windows can be cumulative (as with the running summability example) or sliding, as we will see in the coming examples.

Examples of physical-row windows are:

  • ROWS UNBOUNDED PRECEDING – The window is the current row and every preceding row.
  • ROWS 2 PRECEDING – The window is the current row and 2 preceding physical rows.
  • ROWS 2 FOLLOWING – The window is the current row and 2 following physical rows.

ROWS UNBOUNDED PRECEDING will yield the same results as the running sum example if the ORDER BY column in the analytic definition is unique. If not, then the running analytical function works on the rows of the non-unique group set first. The ROWS UNBOUNDED PRECEDING window expands one row at a time.

Windows of the analytical functions

Let's look at the example:

SELECT DISTINCT DIM_TIME_ID AS TIME_ID,
  SUM(sales_units) over (ORDER BY DIM_TIME_ID ) AS SUM_SALES_UNIT,
  SUM(sales_units) over (ORDER BY DIM_TIME_ID ROWS UNBOUNDED PRECEDING) AS SUM_SALES_UNIT_PRECEDING
FROM f_sales
ORDER BY DIM_TIME_ID

And the result set:


TIME_IDSUM_SALES_UNITSUM_SALES_UNIT_PRECIDING
201601011025620,364043,38
201601011025620,3610451,14
201601011025620,3613106,5
201601011025620,3615749,79
201601011025620,3619807,07
201601011025620,3625012,65
201601011025620,3631650,32
.........


As we see in the result set, the ROWS_UNBOUNDED window expands to include the current row and all the following rows. The running sum, on the other hand, summarizes values by the 2016-01-01 date.

Now let's take a look at the logical window specification.

Logical windows are indicated by the RANGE keyword coupled with an ordering variable, such as a number of years or a shipment date. These values indicate the start and end of a window.

Unlike physical windows, where we specify the number of rows, in logical windows we specify the logical bounds of the window. To clarify this, let's look at some examples of logical windows and how they differ from their physical counterparts.

Examples of logical windows are:

  • ORDER BY TIME_ID RANGE 30 PRECEDING – The window is the current row and 30 days in the past.
  • ORDER BY TIME_ID RANGE BETWEEN 365 PRECEDING AND 365 FOLLOWING – The window is bound by 365 days in the past and 365 days in the future.
  • ORDER BY TIME_ID DATE RANGE BETWEEN INTERVAL '1' YEAR FOLLOWING AND INTERVAL '1' YEAR PRECEDING – The window is bound by 1 year in the past and 1 year in the future.

A window bound by ranges specified in days and years will be the same most of the time. Can you guess when it won't be the same?

This picture best describes the difference between logical and physical windows:

Range and row windows

The left side displays the window for ROW BETWEEN 1 FOLLOWING AND 1 PRECEDING and the right side displays the windows for RANGE BETWEEN 1 FOLLOWING AND 1 PRECEDING. The RANGE windows take into account all the days that precede the current date because they are all one logical unit. The ROW window only takes one preceding physical row.

Let's return to our example with the sales fact table:

SELECT DISTINCT DIM_TIME_ID AS TIME_ID,sales_units,
  SUM(sales_units) over (ORDER BY DIM_TIME_ID ) AS SUM_SALES_UNIT,
  SUM(sales_units) over (ORDER BY DIM_TIME_ID ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS SUM_SALES_UNIT_ROW,
  SUM(sales_units) over (ORDER BY TO_DATE(DIM_TIME_ID,'YYYYMMDD') RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS SUM_SALES_UNIT_RANGE
FROM f_sales
ORDER BY DIM_TIME_ID,sales_units;

Since we are defining logical rows for dates, we had to cast the DIM_TIME_ID to the date value to define the date logical row. We should stress that a single DIM_TIME_ID represents a single day and that every day has a corresponding time_id. We enforce this with a 1..n relationship from the fact table to the dimensional table.

We get different results for each analytical function:


TIME_IDSALES_UNITSUM_SALES_UNITSUM_SALES_UNIT_ROWSUM_SALES_UNIT_RANGE
201601012019,351025620,364062,472789678,38
201601012043,121025620,366105,712789678,38
201601012043,241025620,366136,872789678,38
201601012050,511025620,366148,132789678,38
201601012054,381025620,366189,712789678,38
201601012084,821025620,366276,312789678,38
...............


We can see that SUM_SALES_UNIT is summarizing the result for the date 01.01.2016, ROW is moving its physical window through the rows, and RANGE is summing up all the sales in a three-day window.

Using SQL analytical functions can be a two-edged sword. They provide better performance and productivity, but the results and calculations can often be counterintuitive. There are many other functions that we didn't cover in these article, but we have discussed the building blocks of most of them. I challenge you to explore analytical functions further. In the next article, we'll learn about some DWH algorithms that use analytical functions for common ETL patterns.