# Top 7 Advanced SQL Queries for Data Analysis

*Explore essential advanced SQL queries for data analysis.*

Structured Query Language, or SQL, is an indispensable tool for data analysts. Most people pick up the language relatively quickly and can begin in-depth data analysis after just a few lessons. Because of this, many data analysts tend to stay in the beginner/intermediate level of using SQL. This level of understanding allows you to “get the job done”, but it might not be the most efficient way to write a query.

Mastering advanced SQL techniques can greatly enhance your data analysis capabilities, allowing you to delve deeper into your datasets and derive valuable insights. In this article, we will explore seven advanced SQL queries that can be useful for data analysis. For each example, we will present the query and its results, discuss the syntax used, and provide a brief explanation of how the database computes the result.

If you’d like to improve your SQL skills, be sure to check out our SQL Reporting track. It is a great way to learn and practice more advanced SQL functions.

## 7 Advanced SQL Queries Data Analysts Should Know

### 1. Grouping Data by Time Period

In this first example, let’s examine data in the table ** sales**. This dataset includes an ID for each product, the date of the transaction, and the total amount of the transaction.

product_id | sale_date | amount |
---|---|---|

1 | 2021-01-01 | 100 |

2 | 2021-01-15 | 200 |

1 | 2021-02-01 | 300 |

2 | 2021-02-15 | 400 |

1 | 2022-01-10 | 200 |

1 | 2022-02-05 | 100 |

2 | 2022-01-27 | 200 |

2 | 2022-02-12 | 400 |

One very common data analysis scenario is to take raw data like this and aggregate it by specific date periods, such as month or year. Let’s go ahead and do just that by running the following query:

SELECT EXTRACT(YEAR FROM sale_date) AS year, EXTRACT(MONTH FROM sale_date) AS month, SUM(amount) AS total_sales FROM sales GROUP BY year, month ORDER BY year, month;

This query will return these results:

year | month | amount |
---|---|---|

2021 | 01 | 300 |

2021 | 02 | 700 |

2022 | 01 | 400 |

2022 | 02 | 500 |

In this query, we use the `EXTRACT()`

function to pull the year and month from the `sale_date`

field. We then group the data by the new year and month fields and calculate the total sales for each group using the `SUM()`

function.

Note that we need to group by both year and month to get accurate results. If we were to only pull in the month, the results would combine the values for specific months across the years (so all Januaries from all years would be combined into one row, all Februaries would be combined together, etc). When we group the results by both year and month, the months from different years are put into separate rows.

We then order the rows by year and month to have ordered results.

Different database engines often have different and sometimes better functions to accomplish the same result, such as `DATE_TRUNC()`

in PostgreSQL or `TRUNC()`

in Oracle.

### 2. Create Multiple Grouping Levels Using ROLLUP

Sometimes you may want to group data by multiple levels in a single query. Let’s say you want to add the total sales (across all years) as well as the total sales in each year to the previous example. You can add additional rows for the grand total and totals for each year by using the `ROLLUP`

keyword.

Using the same dataset, we’ll run this query:

SELECT EXTRACT(YEAR FROM sale_date) AS year, EXTRACT(MONTH FROM sale_date) AS month, SUM(amount) AS total_sales FROM sales GROUP BY ROLLUP(year,month) ORDER BY year, month;

Result:

year | month | total_sales |
---|---|---|

2021 | 01 | 300 |

2021 | 02 | 700 |

2021 | NULL | 1000 |

2022 | 01 | 400 |

2022 | 02 | 500 |

2022 | NULL | 900 |

NULL | NULL | 1900 |

In our example, `ROLLUP`

adds three extra rows: one for the grand total of all sales and two for the total sales of each year. In our result set, the third row is the yearly total for 2021: the value in the `year`

column is 2021 and the value in the `month`

column is NULL. The sixth row is the total for 2022: the value in the year column is 2022 and the value in the `month`

column is `NULL`

. The final row is the total for both years: it has NULL in both the year and `month`

columns. These rows were added by `ROLLUP`

. The `NULL`

values in the year and `month`

columns indicate the aggregated rows.

### 3. Ranking Data Using Window Functions

Ranking data is a common requirement in advanced data analysis. A few business cases would be ranking your products by the highest sales to understand which products bring in the most revenue or ranking stores by the lowest sales to understand which stores are your lowest performers.

You can use window functions like `RANK()`

or `DENSE_RANK()`

to assign ranks to rows based on a specific column's value. First, let’s take a look at the `RANK()`

function:

SELECT product_id, SUM(amount) AS total_sales, RANK() OVER(ORDER BY SUM(amount) DESC) AS rank FROM sales GROUP BY id;

Result:

product_id | total_sales | rank |
---|---|---|

2 | 1200 | 1 |

1 | 700 | 2 |

This query first calculates the total sales for each `product_id`

and then ranks them in descending order. The `RANK()`

function is a window function used to compute the ranking. The `OVER()`

clause is the syntax used with window functions. The `RANK()`

function assigns a unique rank to each row within a result set, based on specific ordering. The `ORDER BY`

clause in the `OVER()`

clause specifies the order in which the rank will be assigned, based on one or more columns. In our example we rank the rows by the total sales amount.

The same syntax can be used for the `DENSE_RANK()`

function. The difference between the two is in the scenario where two values are equal. If two values tie for first place, `RANK()`

will skip to 3 for the third row, while `DENSE_RANK()`

would assign the third row as 2.

For more information about ranking functions, read our in-depth article Overview of Ranking Functions in SQL.

You can also filter the top X rows by using a common table expression (CTE) and a `WHERE`

clause. For example, we can run the following to only pull our lowest-ranked row:

WITH sales_cte AS ( SELECT product_id, SUM(amount) AS total_sales, RANK() OVER (ORDER BY SUM(amount) DESC) AS rank FROM sales GROUP BY id ) SELECT * FROM sales_cte WHERE rank = 1;

In this query, we’re defining a CTE called

, which gives the rank of each **sales_cte**`product_id`

based on its total sales. We then query this CTE, only pulling the rows where rank = 1. This allows us to only return the top row.

### 4. Computing the Difference (Delta) Between Rows

Sometimes you need to compare the values of consecutive rows. One example would be pulling a previous month’s sales to compare to the current month. Having both the current month’s sales and the previous month’s sales allows you to compute the difference in the two values.

You can use window functions like `LEAD()`

and `LAG()`

to achieve this. We’ll use a different table this time: ** sales_agg**. This table has the aggregate sales for each year and month.

year | month | total_sales |
---|---|---|

2021 | 1 | 300 |

2021 | 2 | 700 |

2022 | 1 | 400 |

2022 | 2 | 500 |

Let's compute the month-over-month sales difference using `LAG()`

:

SELECT year, month, total_sales, LAG(total_sales) OVER(ORDER BY year, month) AS previous_month, total_sales – LAG(total_sales) OVER(ORDER BY year, month) AS sales_difference FROM sales_agg ORDER BY year, month;

This query uses the `LAG()`

function to calculate the difference between the current month's sales and the previous month's sales:

year | month | total_sales | previous_month | sales_difference |
---|---|---|---|---|

2021 | 01 | 300 | NULL | NULL |

2021 | 02 | 700 | 300 | 400 |

2022 | 01 | 400 | 700 | -300 |

2022 | 02 | 500 | 400 | 100 |

The `LAG()`

function is a window function. It pulls the `total_sales`

value from the previous row, ordered by year and month. We’re aliasing that column as `previous_month`

.

Then the `sales_difference`

column is calculated by taking the `total_sales`

for each row and then subtracting the `previous_month`

value. The first row has `NULL`

values, since there is no previous row.

If you needed to pull the next month’s sales, you would use the window function `LEAD()`

in place of `LAG()`

. `LAG()`

compares the current row to previous values; `LEAD()`

compares the current row to subsequent values.

How to Calculate the Difference Between Two Rows in SQL gives multiple real-life examples of using this syntax to compute the difference between two rows.

### 5. Computing a Running Total

A running total calculates the sum of a sequence of numbers. It’s also known as a cumulative total or cumulative sum; it adds each new value to the previous total.

Running totals are useful in calculating the cumulative effect of data points over time. For example, you might want to calculate the cumulative number of users that have visited your website up until a certain time to understand your website’s user growth.

Let’s revisit the `SUM()`

function and see how we could use it to calculate a running sum of sales. As with the previous example, we’ll need to use `SUM()`

as a window function to achieve the desired results.

SELECT year, month, total_sales, SUM(total_sales) OVER(ORDER BY year, month) AS running_total FROM sales_agg ORDER BY year, month;

This running total is calculated using the `SUM()`

function with the `OVER()`

clause. This adds the `total_sales`

from the current row to all previous rows in the specified order.

In the second row, the running total is calculated by aggregating the `total_sales`

value from both the first and second row. In our example, the second row is for February of 2021. The `running_total`

value is the sum of the first row (for January 2021) and the second row (for February 2021).

In the third row, the running total is calculated by aggregating the values from the first through third rows. This same pattern continues through each row.

year | month | total_sales | running_total |
---|---|---|---|

2021 | 01 | 300 | NULL |

2021 | 02 | 700 | 1000 |

2022 | 01 | 400 | 1400 |

2022 | 02 | 500 | 1900 |

For more information on computing running totals in SQL, check out our article What Is a SQL Running Total and How Do You Compute It?.

### 6. Computing a Moving Average

When looking at sales trends in data analysis, it is often helpful to use a moving average rather than each individual data point. A moving average (also known as a rolling average) calculates the average of the current value and a specified number of immediately preceding values.

This technique helps smooth out data and identify trends, especially when your data has high volatility. The main idea is to examine how these averages behave over time instead of examining the behavior of the original data points.

For example, you might need to analyze daily sales of a restaurant that has high sales on the weekend but low sales Monday through Wednesday. If you were to plot each individual data point, you would see very high values and very low values next to each other, making long-term trends more difficult to see. By using a 3-day moving average, you would take the average of the past 3 days, which evens out the highs and lows.

For this example, we’ll modify our **sales_agg** table.

year | month | total_sales |
---|---|---|

2021 | 01 | 300 |

2021 | 02 | 700 |

2021 | 03 | 500 |

2021 | 04 | 1000 |

2021 | 05 | 800 |

2021 | 06 | 600 |

Now let’s calculate a 3-month moving average of sales:

SELECT year, month, total_sales, AVG(total_sales) OVER(ORDER BY year, month ROWS BETWEEN 2 PRECEDING and CURRENT ROW) AS moving_average FROM aales_agg ORDER BY year, month;

This query calculates the moving average using the `AVG()`

function with the `OVER()`

clause. The `ROWS BETWEEN`

clause specifies the range of rows to include in the moving average calculation – in this case, the current row and the two preceding rows.

The result:

year | month | total_sales | moving_average |
---|---|---|---|

2021 | 01 | 300 | 300 |

2021 | 02 | 700 | 500 |

2021 | 03 | 500 | 500 |

2021 | 04 | 1000 | 733 |

2021 | 05 | 800 | 767 |

2021 | 06 | 600 | 800 |

Since there are no previous rows, the first row’s moving average is just the total sales value. The second row’s moving average is the average of 300 and 700. For the third row, we now have the 2 preceding rows, as defined in our query; the moving average is calculated by averaging 300, 700, and 500. This pattern continues for the rest of the rows.

What a Moving Average Is and How to Compute It in SQL is a great resource for more information on this topic.

### 7. Counting Elements in Custom Categories Using SUM() and CASE WHEN

You can count elements in custom categories by combining `SUM()`

with `CASE WHEN`

. You’d use this when you need to create business logic that doesn’t exist in your data. For example, you might want to group specific locations by custom regions and then compute metrics based on those regions.

`CASE WHEN`

statements allow you to perform conditional logic in queries. The syntax is structured like this:

CASE WHEN condition_1 THEN result_1 WHEN condition_2 THEN result_2 END

The conditions are evaluated from top to bottom. The first condition that evaluates as true determines which result is returned.

In our example, let's create a sales category breakdown based on sales amounts:

SELECT SUM(CASE WHEN total_sales < 400 THEN 1 ELSE 0 END) AS low_sales, SUM(CASE WHEN total_sales >= 400 AND total_sales < 700 THEN 1 ELSE 0 END) AS medium_sales, SUM(CASE WHEN total_sales >= 700 THEN 1 ELSE 0 END) AS high_sales FROM sales_agg

In the above query, we use the `CASE WHEN`

expression to categorize each month into `low_sales`

, `medium_sales`

, or `high_sales`

categories. When the condition in the `CASE WHEN`

expression is met, it returns a value of 1; otherwise, it returns 0.

The `SUM()`

function is then used to add up the 1s and 0s for each category, effectively counting the number of months that fall into each category. The result is a single row with the counts of `low_sales`

, `medium_sales`

, and `high_sales`

months.

low_sales | medium_sales | high_sales |
---|---|---|

1 | 2 | 3 |

This approach allows you to create custom categories and count the elements in each category with a single query. The combination of `SUM()`

and `CASE WHEN`

is versatile and can be adapted to different use cases.

If you want to use this construction in your query, How to Use CASE WHEN with SUM() in SQL will give you more details.

## Power Up Your Data Analysis with Advanced SQL Queries

Mastering advanced SQL queries is essential for efficient and accurate data analysis. In this article, we've covered seven powerful SQL techniques that can help you gain deeper insights into your data.

By learning how to group data by time periods, use `ROLLUP`

for multiple grouping levels, rank data with window functions, compute differences between rows, calculate running totals and moving averages, and count elements in custom categories, you'll be well-equipped to tackle complex data analysis tasks.

As you continue to hone your SQL skills, you'll find even more ways to leverage this powerful tool to unlock the full potential of your datasets and drive better decision-making in your organization.

For more practice and to gain an even deeper level of understanding on these topics, sign up for our SQL Reporting track. It is a great way to master these advanced SQL functions.