10th Jun 2017 4 minutes read

Useful SQL Patterns: Conditional Summarization with CASE

As you start coding in SQL, you will use some statements and techniques over and over again. We call these "SQL patterns". This series will look at the most common SQL patterns and consider how to use them.

Previously, we looked at the SQL pattern of matching NULLs. This is important when you are comparing columns containing NULL values. Today, we're going to consider another SQL practice: conditional summarization with CASE operator.

What is Conditional Summarization?

When you are using aggregate functions to create report queries, you'll frequently find yourself using conditional summarization with the CASE operator. Remember that CASE returns a value based on defined criteria. (For more about the CASE expression, see this post and this one.) When you perform a summarization with CASE, you're simply adding up (summarizing) values that meet the CASE expression.

Obviously, you'll be using the SUM part of the query to aggregate these values. I know this sounds complicated, but it is not. Let's use a simple example to explain it.

We'll start by considering a basic transactions table that contains transactional data from a small company.

The transactions table has these columns:

id – A unique identifier for each transaction
datetime – The timestamp for the transaction
customer – The customer's ID
creditcard – The ID of the type of credit card used
amount – The transaction amount, in dollars
account – The customer's account number
type – The transaction type

Here's the data we'd find in a table like this:

datetime	customer	creditcard	amount	account	type
2017-01-01 00:00:00.000000	1	1	100	1	type_1
2017-03-01 00:00:00.000000	2	1	350	1	type_1
2017-05-01 00:00:00.000000	3	1	10	1	type_3
2017-02-01 00:00:00.000000	2	1	10	1	type_2
2017-05-01 00:00:00.000000	2	1	10	1	type_1
2017-04-01 00:00:00.000000	3	1	600	1	type_3
2017-01-01 00:00:00.000000	3	1	350	1	type_3
2017-03-01 00:00:00.000000	1	1	150	1	type_1
2017-04-01 00:00:00.000000	1	1	200	1	type_1
2017-02-01 00:00:00.000000	1	1	50	1	type_2
2017-05-01 00:00:00.000000	1	1	210	1	type_2
2017-04-01 00:00:00.000000	2	1	600	1	type_3
2017-01-01 00:00:00.000000	2	1	100	1	type_1

We want to find the sum of transaction amounts and the number of transactions completed before April 1st. Furthermore, we wanted these listed by individual customer. We could achieve this using the following query :

	SELECT
 customer,
 SUM(
 CASE WHEN datetime >= TIMESTAMP '2017-04-01'
 THEN amount
 ELSE 0
 END
 )        AS sum_amount_after,
 SUM(CASE WHEN datetime >= TIMESTAMP '2017-04-01'
 THEN 1
 ELSE 0
 END) AS transaction_count_after,
 SUM(CASE WHEN datetime < TIMESTAMP '2017-04-01'
 THEN amount
 ELSE 0
 END) AS sum_amount_prior,
 SUM(CASE WHEN datetime < TIMESTAMP '2017-04-01'
 THEN 1
 ELSE 0
 END) AS transaction_count_prior
 FROM transactions
 WHERE datetime BETWEEN TIMESTAMP '2017-01-01'  AND '2017-05-01'
 GROUP BY customer

The result of this query is:

customer	sum_amount _after	transaction _count_after	sum _amount_prior	transaction _count_prior
2	610	2	460	3
1	410	2	300	3
3	610	2	350	1

Wait! How Did That Work?

This long query can be confusing, so let's break it down a little. We will concentrate first on the part dealing with the April 1 cutoff (2017-04-01).

👾 [NEW] What happens when you combine #CASE with #SQL's #data modifying statements? Find out!! ➽ https://t.co/7IuBoDT85z ... pic.twitter.com/wNiDiv5hSa

— Vertabelo (@Vertabelo) June 6, 2017

Below, we are looking at the transaction amount for customer '1'. Any transaction amounts that posted before 01.04.2017 will be set to "0". We will name this column amount_after.

SELECT
 customer,
 datetime,
 CASE WHEN datetime >= TIMESTAMP '2017-04-01'
  THEN amount
 ELSE 0
 END
  AS amount_after,
 amount
FROM transactions
WHERE datetime BETWEEN TIMESTAMP '2017-01-01' AND '2017-05-01'
      AND customer = '1'
ORDER BY datetime;

customer	datetime	amount_after	amount
1	2017-01-01 00:00:00.000000	0	100
1	2017-02-01 00:00:00.000000	0	50
1	2017-03-01 00:00:00.000000	0	150
1	2017-04-01 00:00:00.000000	200	200
1	2017-05-01 00:00:00.000000	210	210

Alternately, we can replace the zeros shown in the results with a NULL in the ELSE statement:

SELECT
 customer,
 datetime,
 CASE WHEN datetime >= TIMESTAMP '2017-04-01'
  THEN amount
 ELSE null
 END
  AS amount_after,
 amount
FROM transactions
WHERE datetime BETWEEN TIMESTAMP '2017-01-01' AND '2017-05-01'
      AND customer = '1'
ORDER BY datetime;

customer	datetime	amount_after	amount
1	2017-01-01 00:00:00.000000		100
1	2017-02-01 00:00:00.000000		50
1	2017-03-01 00:00:00.000000		150
1	2017-04-01 00:00:00.000000	200	200
1	2017-05-01 00:00:00.000000	210	210

Now if we summarize these columns, we will get:

The total of all transactions for sum(amount).
The sum of all transactions that posted after 01.04. Any transactions that posted before 01.04 are set to zero (or NULL) for sum(amount_after).

If we want to count how many transactions were posted after 01.04, we can modify the query and create a COUNT statement that uses the same CASE with NULL in the ELSE.

SELECT
  customer,
  count(CASE WHEN datetime >= TIMESTAMP '2017-04-01'
    THEN amount
        ELSE NULL
        END)
    AS count_after
FROM transactions
WHERE datetime BETWEEN TIMESTAMP '2017-01-01' AND '2017-05-01'
GROUP BY customer
ORDER BY customer;

customer	count_after
1	2
2	2
3	2

Note: This query is really fast, as the RDBMS only needs to access one table. Building aggregation queries that use just one table is a good way to get results quickly.

Try using CASE with a zero in the COUNT statement. What is the result and why? Tell us in the comments section below.

Tags:

What is Conditional Summarization?

Wait! How Did That Work?

You may also like