27th Sep 2016 15 minutes read A Gentle Introduction to Common SQL Window Functions Aldo Zelen Oracle ranking SQL window functions Mastering SQL analytical functions is a necessity for an aspiring BI/DWH professional. In this article, we'll explore the history of SQL in a BI environment. We'll also introduce some of the more common SQL analytical functions, including RANK, LEAD, LAG, SUM, and others. SQL can be used for business intelligence (BI) applications that combine data retrieval, analysis, and computation. These computations are more complex than what we see in OLTP systems. If you're interested in getting into data warehousing (DWH) or business intelligence, you need to go beyond plain SQL and start digging into the analytical functions (also known as window functions, windowing functions, or OVER clauses). As stated, this article will get you started with SQL analytical functions. First, let's have a quick look at the history of using SQL in BI. We'll also consider the basic processes involved in using SQL to partition, order, and analyze data. BI Meets SQL: A Brief History BI applications are used for a variety of data-related activities. Common computations include: Best or worst performers in various organizational units (based on some qualitative measurement) Best or worst contributions by percentage threshold, such as the bottom 5% of stores by revenue Trends over time and across time; this information yields interesting insights about where the organization is heading Contributions to a whole, such as a commission percentage While using relational database management systems (RDMS) as DWH/BI platforms, organizations found that the SQL SELECT statement wasn't adequately supporting their BI applications. Plus, analysts were required to possess a complex skill set involving data retrieval, procedural coding, external tool usage and business domain knowledge. This mix of skills is not common, and companies found it difficult to locate individuals who could do everything needed. During this time, the use of standard SQL in BI was causing poor development productivity and poor app performance. Productivity was poor because a BI application developer had to use SQL relational statements mixed in with procedural code (Oracle PL/SQL). Performance was slow because the SQL compiler often created poor execution plans for complex statements, and even worse ones for mixed statements. The need was clear: overcome these deficiencies and enable RDBM systems to support BI/DWH more fully. This resulted in major SQL extensions for both processing and specification. You know these extensions as the analytical functions we find in all of the major DBMS vendors' SQL syntaxes. SQL Analytical Functions: The Basics Executing analytical functions organizes data into partitions, computes functions over these partitions in a specified order, and returns the result. Processing analytical functions is done after GROUP BY and row filtering (WHERE). This is so that the calculations can be performed on the grouped results. The returned rows are ordered after the functions have been processed. We place analytical functions in the SELECT list, just as we would with any other result columns in a SQL SELECT statement. Let's take a look at the basic syntax of an analytical function: <AnalyticFunction> ([<column-list>]) over (<ordering>) We can see that the analytical function consists of a function with an over keyword and an optional list of columns for ordering. The ordering indicates criteria for function evaluation, not the final ordering of results. This will become clear later as we look at examples. We will now consider the most common analytical functions used in DWH/BI. To do this, we will access a simple data mart and build our queries on that. The following data mart model is a simple star schema for tracking and analyzing retail sales: Let's begin with the ranking functions. RANKING FUNCTIONS RANK RANK is a simple analytical function that does not take any column list arguments. It returns a number or a rank based on the ordering of the rows; the ordering is based on a defined condition. Let's look at an example product list. We will rank products based on their unit price, from the cheapest to the most expensive. SELECT id, product_desc, product_brand, product_unitprice, RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice FROM dim_product; The ORDER BY clause is mandatory in a RANK function, as we will be listing products in either ascending or descending order. The result set in ascending order would be: IDPRODUCT_DESCPRODUCT_BRANDPRODUCT_UNITPRICERANK_UNITPRICE 1Plain Gibson razorGibson151 4Schick Xtreme3 for professionalsSchick Xtreme3202 2Gillette Venus Swirl for womenGillette Venus Swirl253 3Gillette Fusion for young menGillette Fusion25,54 The descending version, where we go from most to least expensive, would be delivered by this code snippet: RANK() OVER (ORDER BY product_unitprice DESC) DENSE RANK DENSE RANK works like RANK in that it needs no additional arguments and it ranks items in descending or ascending order. The only difference is that DENSE RANK does not allow "gaps" between groups. These "gaps" occur when the ordering argument is the same for one or more groups. To illustrate this, think of competition rankings. Let's say some programmers join a hackathon. Two programmers finish in second place with identical scores. DENSE_RANK would return a "1" for the first-place programmer, a "2" for both programmers in second place and a "3" for the third-place finisher. RANK would return a "4" for the third place which, in this case, does not make sense. So, if we expand the rowset from the previous example... SELECT id, product_desc, product_brand, product_unitprice, product_category, RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice, DENSE_RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice FROM dim_product WHERE product_category = 'Razor' ... we get: IDPRODUCT_DESCPRODUCT_BRANDPRODUCT_UNITPRICEPRODUCT_CATEGORYRANK_UNITPRICEDENSERANK_UNITPRICE 1Plain Gibson razorGibson15Razor11 5Gillette Plain for menGillette Plain20Razor22 4Schick Xtreme3 for professionalsSchick Xtreme320Razor22 2Gillette Venus Swirl for womenGillette Venus Swirl25Razor43 3Gillette Fusion for young menGillette Fusion25,5Razor54 DENSE_RANK computes the row after the identical product_unitprice razors as the next ordinal number; RANK leaves a gap so that the next value displayed after an identical-value group is larger for the count. In our case, RANK shows a "4" after the two identical values; DENSE_RANK shows a "3". NTILE NTILE divides rows into equal groups and returns the number of the group that the row belongs to. This function is not as widely used as either RANK or DENSE RANK. For example, NTILE can be used to divide rows into three divisions: SELECT id, product_brand, product_unitprice, RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice, DENSE_RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice, NTILE(3) OVER (ORDER BY product_unitprice) Ntile_UnitPrice FROM dim_product WHERE product_category = 'Razor'; We see five values divided into three groups: IDPRODUCT_BRANDPRODUCT_UNITPRICERANK_UNITPRICEDENSERANK_UNITPRICENTILE_UNITPRICE 1Gibson15111 5Gillette Plain20221 4Schick Xtreme320222 2Gillette Venus Swirl25432 3Gillette Fusion25,5543 ROW NUMBER ROW NUMBER is different than DENSE RANK and RANK in that it does not treat identical values in any special way. It simply lists them as they occur in some order. Let's look at the example: SELECT id, product_brand, product_unitprice, RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice, DENSE_RANK() OVER (ORDER BY product_unitprice) DenseRank_UnitPrice, NTILE(3) OVER (ORDER BY product_unitprice) Ntile_UnitPrice, ROW_NUMBER() OVER (ORDER BY product_unitprice) ROWNUMBER_UnitPrice FROM dim_product WHERE product_category = 'Razor'; And the result is: IDPRODUCT_BRANDPRODUCT_UNITPRICERANK_UNITPRICEDENSERANK_UNITPRICENTILE_UNITPRICEROWNUMBER_UNITPRICE 1Gibson151111 5Gillette Plain202212 4Schick Xtreme3202223 2Gillette Venus Swirl254324 3Gillette Fusion25,55435 You may be wondering how ROW_NUMBER ranks these identical values – which one gets a higher ranking. The answer is: This is totally undetermined. In one execution, a product could be assigned an ID of "5"; in another execution, the same product could be assigned a "4". So beware. Many mistakes are made by the misuse of ROW_NUMBER on datasets with non-unique arguments. We commonly use ranking functions when filtering data. So if we would like to filter the top three razors by price using DENSE_RANK, it would look like: WITH top_product AS (SELECT id, product_brand, product_unitprice, RANK() OVER (ORDER BY product_unitprice) Rank_UnitPrice FROM dim_product WHERE product_category = 'Razor' ) SELECT * FROM top_product WHERE rank_unitprice < 3 And the result is: IDPRODUCT_BRANDPRODUCT_UNITPRICERANK_UNITPRICE 1Gibson151 4Schick Xtreme3202 5Gillette Plain202 POSITIONAL ANALYTICAL FUNCTIONS LEAD LEAD returns an offset (incrementally increased) value of an argument column. The offset amount can be defined in the code; its default amount is "1". The new value is returned in the same row. Here's the syntax for LEAD: LEAD(value, offset, default_value) The default value is returned if the offset points to non-existent row. The offset value follows the initial value and is applied by the direction of order. Let's return our previous example and use the LEAD function to look at price increases. We'll define product_unitprice as the value, with an offset of 1 and a default value of null: SELECT id, product_desc, product_brand, product_unitprice, product_category, lead(product_unitprice,1,null) OVER (ORDER BY product_unitprice) Lead_UnitPrice FROM dim_product WHERE product_category = 'Razor'; Let's look at the result set: IDPRODUCT_DESCPRODUCT_BRANDPRODUCT_UNITPRICEPRODUCT_CATEGORYLEADPRODUCT_UNITPRICE 1Plain Gibson razorGibson15Razor20 5Gillette Plain for menGillette Plain20Razor20 4Schick Xtreme3 for professionalsSchick Xtreme320Razor25 2Gillette Venus Swirl for womenGillette Venus Swirl25Razor25,5 3Gillette Fusion for young menGillette Fusion25,5Razornull To make this clearer, let's visualize the shift in data values : --> We see the LEAD function is "taking" the 1 offset value in the direction of the analytical function's order. LAG LAG is the opposite of LEAD. We can even implement LAG using LEAD and vice versa. The difference is in the direction we look for the offset value. Here's the syntax for LAG: LAG(value, offset, default_value) As the lead function takes the values in the window that are in the lead of the current value the lag function takes the values that are lagging. Hence the name LAG. The usage of the lead or lag is interchangeable, you can use them to implement the same behavior. If we rewrite the previous example with a changed order direction, we get the same results. SELECT id, product_desc, product_brand, product_unitprice, product_category, lead(product_unitprice,1,null) OVER (ORDER BY product_unitprice) Lead_UnitPrice, lag(product_unitprice,1,null) OVER (ORDER BY product_unitprice DESC) Lag_UnitPrice FROM dim_product WHERE product_category = 'Razor'; The result: IDPRODUCT_DESCPRODUCT_BRANDPRODUCT_UNITPRICEPRODUCT_CATEGORYLEADPRODUCT_UNITPRICELAGPRODUCT_UNITPRICE 3Gillette Fusion for young menGillette Fusion25,5Razornullnull 2Gillette Venus Swirl for womenGillette Venus Swirl25Razor25,525,5 5Gillette Plain for menGillette Plain20Razor2025 4Schick Xtreme3 for professionalsSchick Xtreme320Razor2520 1Plain Gibson razorGibson15Razor2020 LEAD and LAG are important in the development of ETL using pure SQL batch statements. We will cover the implementation of various algorithms in later posts. SUMMARY FUNCTIONS SUM SUM gives us the sum of an entire set or group of values. This function is pretty self-explanatory. If we define the sum function as an analytical function, we get the sum of the whole set or group defined in every row. Here's the syntax for SUM: SUM(value) So if we look at this example... SELECT product_category, SUM(sales_units) AS SUM_SALES_UNITS FROM f_sales, dim_product WHERE f_sales.dim_product_id = dim_product.ID GROUP BY product_category; ...we get the sum of every product category in every row: PRODUCT_CATEGORYSUM_SALES_UNITS Cream24092683,94 Razor30985698,47 The same applies to MEDIAN, AVG (a.k.a. MEAN), or any other summary function. Their syntaxes are similar enough that we won't go into them here, but this is what they look like: SELECT product_category, SUM(sales_units) AS SUM_SALES_UNITS, AVG(sales_units) AS AVG_SALES_UNITS, MEDIAN(sales_units) AS MEDIAN_SALES_UNITS FROM f_sales, dim_product WHERE f_sales.dim_product_id = dim_product.ID GROUP BY product_category; The result is: PRODUCT_CATEGORYSUM_SALES_UNITSMEAN_SALES_UNITSMEDIAN_SALES_UNITS Cream24092683,945496,855502,75 Razor30985698,475516,415534,86 RUNNING SUMMARY FUNCTIONS Summary functions can be defined with the ORDER BY clause. When we include this clause, the function is calculated on the current row and every preceding unique row. Look at this SUM example: SELECT DISTINCT DIM_TIME_ID AS TIME_ID, SUM(sales_units) OVER (ORDER BY DIM_TIME_ID) AS SUM_SALES_UNIT FROM f_sales ORDER BY DIM_TIME_ID We know that the summary of sales_units is accumulative for every unique date – the date's uniqueness is important – as defined in the ORDER BY part of the analytical function. Below, we see the results: DIM_TIME_IDSUM_SALES_UNIT 201601011025620,36 201601022789678,38 201601034481130,18 201601046382374,4 201601058327342,03 2016010610143646,25 2016010711952402,14 2016010813720165,03 2016010915586689,76 2016011017451030,37 2016011119617779,59 2016011221551100,48 ...... If we would have omitted this unique value requirement on the ORDER BY column, the return value would have been the sum of the non-unique row and every preceding row. This type of summarization is called a running sum. It comes in handy when we're dealing with common algorithms, such as the ones we will cover in upcoming articles. PARTITION SYNTAX Partitioning is optional in an analytic function. A partition statement is followed by a list of columns. <AnalyticFunction> ([<column-list>]) over ([<partition by>]<ordering>) PARTITION BY defines the partition, -- the group of rows on which a function is performed. Do not confuse PARTITION BY with the GROUP BY part of an SQL statement. We can redefine our previous ranking example by creating partition groups for the product_category column like this: SELECT id, product_brand, product_unitprice, product_category, RANK() OVER (PARTITION BY product_category ORDER BY product_unitprice) Rank_UnitPrice, DENSE_RANK() OVER (PARTITION BY product_category ORDER BY product_unitprice) DenseRank_UnitPrice, NTILE(3) OVER (PARTITION BY product_category ORDER BY product_unitprice) Ntile_UnitPrice FROM dim_product ORDER BY PRODUCT_CATEGORY, 5; The ranking is done within product categories. The results are: IDPRODUCT_BRANDPRODUCT_UNITPRICEPRODUCT_CATEGORYDENSERANK_UNITPRICENTILE_UNITPRICEROWNUMBER_UNITPRICE 9Mario Badescu29Cream111 6Hope35Cream221 8Cetapin45Cream332 7Aveena45Cream333 13Axe for women29Deo111 10Nivea men35Deo221 11Nivea women45Deo332 12Axe for men45Deo333 1Gibson15Razor111 5Gillette Plain20Razor221 4Schick Xtreme320Razor222 2Gillette Venus Swirl25Razor432 3Gillette Fusion25,5Razor543 The Window Concept The window concept is used extensively with analytical functions; some even call analytical functions "windowing functions". A window is a collection of rows in which an analytical function is calculated. We tie window concepts to changes in numeric variables (i.e. stock prices or sales). Windows are typically defined by time intervals, like years or months. Some examples of window comparisons are: Sales growth percentages Moving averages of stock prices (30 day,90 day, 180 day) The performance of an advertising campaign across time Cumulative sales performance The syntax of window functions is optional and is included after the ordering: <AnalyticFunction> ([<column-list>]) over ([<partition by>]<ordering>[window-specification]) Windows can be specified as units of physical rows using the ROWS keyword, or they can be defined in logical rows using the RANGE keyword. Physical windows are specified by absolute values of rows we offset before and after the current row. Logical windows are specified by values, such as a number of days. Numeric variables are calculated in moving windows so that the variables can be compared across windows. Windows can be cumulative (as with the running summability example) or sliding, as we will see in the coming examples. Examples of physical-row windows are: ROWS UNBOUNDED PRECEDING – The window is the current row and every preceding row. ROWS 2 PRECEDING – The window is the current row and 2 preceding physical rows. ROWS 2 FOLLOWING – The window is the current row and 2 following physical rows. ROWS UNBOUNDED PRECEDING will yield the same results as the running sum example if the ORDER BY column in the analytic definition is unique. If not, then the running analytical function works on the rows of the non-unique group set first. The ROWS UNBOUNDED PRECEDING window expands one row at a time. Let's look at the example: SELECT DISTINCT DIM_TIME_ID AS TIME_ID, SUM(sales_units) over (ORDER BY DIM_TIME_ID ) AS SUM_SALES_UNIT, SUM(sales_units) over (ORDER BY DIM_TIME_ID ROWS UNBOUNDED PRECEDING) AS SUM_SALES_UNIT_PRECEDING FROM f_sales ORDER BY DIM_TIME_ID And the result set: TIME_IDSUM_SALES_UNITSUM_SALES_UNIT_PRECIDING 201601011025620,364043,38 201601011025620,3610451,14 201601011025620,3613106,5 201601011025620,3615749,79 201601011025620,3619807,07 201601011025620,3625012,65 201601011025620,3631650,32 ......... As we see in the result set, the ROWS_UNBOUNDED window expands to include the current row and all the following rows. The running sum, on the other hand, summarizes values by the 2016-01-01 date. Now let's take a look at the logical window specification. Logical windows are indicated by the RANGE keyword coupled with an ordering variable, such as a number of years or a shipment date. These values indicate the start and end of a window. Unlike physical windows, where we specify the number of rows, in logical windows we specify the logical bounds of the window. To clarify this, let's look at some examples of logical windows and how they differ from their physical counterparts. Examples of logical windows are: ORDER BY TIME_ID RANGE 30 PRECEDING – The window is the current row and 30 days in the past. ORDER BY TIME_ID RANGE BETWEEN 365 PRECEDING AND 365 FOLLOWING – The window is bound by 365 days in the past and 365 days in the future. ORDER BY TIME_ID DATE RANGE BETWEEN INTERVAL '1' YEAR FOLLOWING AND INTERVAL '1' YEAR PRECEDING – The window is bound by 1 year in the past and 1 year in the future. A window bound by ranges specified in days and years will be the same most of the time. Can you guess when it won't be the same? This picture best describes the difference between logical and physical windows: The left side displays the window for ROW BETWEEN 1 FOLLOWING AND 1 PRECEDING and the right side displays the windows for RANGE BETWEEN 1 FOLLOWING AND 1 PRECEDING. The RANGE windows take into account all the days that precede the current date because they are all one logical unit. The ROW window only takes one preceding physical row. Let's return to our example with the sales fact table: SELECT DISTINCT DIM_TIME_ID AS TIME_ID,sales_units, SUM(sales_units) over (ORDER BY DIM_TIME_ID ) AS SUM_SALES_UNIT, SUM(sales_units) over (ORDER BY DIM_TIME_ID ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS SUM_SALES_UNIT_ROW, SUM(sales_units) over (ORDER BY TO_DATE(DIM_TIME_ID,'YYYYMMDD') RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS SUM_SALES_UNIT_RANGE FROM f_sales ORDER BY DIM_TIME_ID,sales_units; Since we are defining logical rows for dates, we had to cast the DIM_TIME_ID to the date value to define the date logical row. We should stress that a single DIM_TIME_ID represents a single day and that every day has a corresponding time_id. We enforce this with a 1..n relationship from the fact table to the dimensional table. We get different results for each analytical function: TIME_IDSALES_UNITSUM_SALES_UNITSUM_SALES_UNIT_ROWSUM_SALES_UNIT_RANGE 201601012019,351025620,364062,472789678,38 201601012043,121025620,366105,712789678,38 201601012043,241025620,366136,872789678,38 201601012050,511025620,366148,132789678,38 201601012054,381025620,366189,712789678,38 201601012084,821025620,366276,312789678,38 ............... We can see that SUM_SALES_UNIT is summarizing the result for the date 01.01.2016, ROW is moving its physical window through the rows, and RANGE is summing up all the sales in a three-day window. Using SQL analytical functions can be a two-edged sword. They provide better performance and productivity, but the results and calculations can often be counterintuitive. There are many other functions that we didn't cover in these article, but we have discussed the building blocks of most of them. I challenge you to explore analytical functions further. In the next article, we'll learn about some DWH algorithms that use analytical functions for common ETL patterns. Tags: Oracle ranking SQL window functions