10th Jun 2017 4 minutes read Useful SQL Patterns: Conditional Summarization with CASE Aldo Zelen aggregate functions CASE conditional summarization SQL patterns As you start coding in SQL, you will use some statements and techniques over and over again. We call these "SQL patterns". This series will look at the most common SQL patterns and consider how to use them. Previously, we looked at the SQL pattern of matching NULLs. This is important when you are comparing columns containing NULL values. Today, we're going to consider another SQL practice: conditional summarization with CASE operator. What is Conditional Summarization? When you are using aggregate functions to create report queries, you'll frequently find yourself using conditional summarization with the CASE operator. Remember that CASE returns a value based on defined criteria. (For more about the CASE expression, see this post and this one.) When you perform a summarization with CASE, you're simply adding up (summarizing) values that meet the CASE expression. Obviously, you'll be using the SUM part of the query to aggregate these values. I know this sounds complicated, but it is not. Let's use a simple example to explain it. We'll start by considering a basic transactions table that contains transactional data from a small company. The transactions table has these columns: id – A unique identifier for each transaction datetime – The timestamp for the transaction customer – The customer's ID creditcard – The ID of the type of credit card used amount – The transaction amount, in dollars account – The customer's account number type – The transaction type Here's the data we'd find in a table like this: datetime customer creditcard amount account type 2017-01-01 00:00:00.000000 1 1 100 1 type_1 2017-03-01 00:00:00.000000 2 1 350 1 type_1 2017-05-01 00:00:00.000000 3 1 10 1 type_3 2017-02-01 00:00:00.000000 2 1 10 1 type_2 2017-05-01 00:00:00.000000 2 1 10 1 type_1 2017-04-01 00:00:00.000000 3 1 600 1 type_3 2017-01-01 00:00:00.000000 3 1 350 1 type_3 2017-03-01 00:00:00.000000 1 1 150 1 type_1 2017-04-01 00:00:00.000000 1 1 200 1 type_1 2017-02-01 00:00:00.000000 1 1 50 1 type_2 2017-05-01 00:00:00.000000 1 1 210 1 type_2 2017-04-01 00:00:00.000000 2 1 600 1 type_3 2017-01-01 00:00:00.000000 2 1 100 1 type_1 We want to find the sum of transaction amounts and the number of transactions completed before April 1st. Furthermore, we wanted these listed by individual customer. We could achieve this using the following query : SELECT customer, SUM( CASE WHEN datetime >= TIMESTAMP '2017-04-01' THEN amount ELSE 0 END ) AS sum_amount_after, SUM(CASE WHEN datetime >= TIMESTAMP '2017-04-01' THEN 1 ELSE 0 END) AS transaction_count_after, SUM(CASE WHEN datetime < TIMESTAMP '2017-04-01' THEN amount ELSE 0 END) AS sum_amount_prior, SUM(CASE WHEN datetime < TIMESTAMP '2017-04-01' THEN 1 ELSE 0 END) AS transaction_count_prior FROM transactions WHERE datetime BETWEEN TIMESTAMP '2017-01-01' AND '2017-05-01' GROUP BY customer The result of this query is: customer sum_amount _after transaction _count_after sum _amount_prior transaction _count_prior 2 610 2 460 3 1 410 2 300 3 3 610 2 350 1 Wait! How Did That Work? This long query can be confusing, so let's break it down a little. We will concentrate first on the part dealing with the April 1 cutoff (2017-04-01). 👾 [NEW] What happens when you combine #CASE with #SQL's #data modifying statements? Find out!! ➽ https://t.co/7IuBoDT85z ... pic.twitter.com/wNiDiv5hSa — Vertabelo (@Vertabelo) June 6, 2017 Below, we are looking at the transaction amount for customer '1'. Any transaction amounts that posted before 01.04.2017 will be set to "0". We will name this column amount_after. SELECT customer, datetime, CASE WHEN datetime >= TIMESTAMP '2017-04-01' THEN amount ELSE 0 END AS amount_after, amount FROM transactions WHERE datetime BETWEEN TIMESTAMP '2017-01-01' AND '2017-05-01' AND customer = '1' ORDER BY datetime; customer datetime amount_after amount 1 2017-01-01 00:00:00.000000 0 100 1 2017-02-01 00:00:00.000000 0 50 1 2017-03-01 00:00:00.000000 0 150 1 2017-04-01 00:00:00.000000 200 200 1 2017-05-01 00:00:00.000000 210 210 Alternately, we can replace the zeros shown in the results with a NULL in the ELSE statement: SELECT customer, datetime, CASE WHEN datetime >= TIMESTAMP '2017-04-01' THEN amount ELSE null END AS amount_after, amount FROM transactions WHERE datetime BETWEEN TIMESTAMP '2017-01-01' AND '2017-05-01' AND customer = '1' ORDER BY datetime; customer datetime amount_after amount 1 2017-01-01 00:00:00.000000 100 1 2017-02-01 00:00:00.000000 50 1 2017-03-01 00:00:00.000000 150 1 2017-04-01 00:00:00.000000 200 200 1 2017-05-01 00:00:00.000000 210 210 Now if we summarize these columns, we will get: The total of all transactions for sum(amount). The sum of all transactions that posted after 01.04. Any transactions that posted before 01.04 are set to zero (or NULL) for sum(amount_after). If we want to count how many transactions were posted after 01.04, we can modify the query and create a COUNT statement that uses the same CASE with NULL in the ELSE. SELECT customer, count(CASE WHEN datetime >= TIMESTAMP '2017-04-01' THEN amount ELSE NULL END) AS count_after FROM transactions WHERE datetime BETWEEN TIMESTAMP '2017-01-01' AND '2017-05-01' GROUP BY customer ORDER BY customer; customer count_after 1 2 2 2 3 2 Note: This query is really fast, as the RDBMS only needs to access one table. Building aggregation queries that use just one table is a good way to get results quickly. Try using CASE with a zero in the COUNT statement. What is the result and why? Tell us in the comments section below. Tags: aggregate functions CASE conditional summarization SQL patterns