26th Jun 2017 11 minutes read

A Beginner’s Guide to SQL Aggregate Functions

Aggregate functions are powerful SQL tools that compute numerical calculations on data, allowing the query to return summarized information about a given column or result set. These functions can be used in conjunction with the GROUP BY statement. Let’s see how they work using some easy examples.

SQL Aggregate Functions

Suppose we have users residing in a city, and we store their information in two tables. These tables and their relationship are shown below:

Let’s put some data into this model:

INSERT INTO `cities` VALUES (1,'Miami'),(2,'Orlando'),
(3,'Las Vegas'),(4,'Coyote Springs');
INSERT INTO `users` VALUES (1,1,'Jhon','Doe',22),
(2,1,'Albert','Thomson',15),(3,2,'Robert','Ford',65),(4,3,'Samantha','Simpson',9),(5,2,'Carlos','Bennet',42),
(6,2,'Mirtha','Lebrand',81),(7,3,'Alex','Gomez',31);

Now we have enough material to explain the basic use of SQL aggregate functions. Let’s get started with a simple one.

MIN

This returns the smallest value in a given column. For example, let’s get the minimum age from our group of users:

SELECT MIN(age) FROM users;

This returns a “9”.

You can use this function to find alphabetical information, too. Let´s try it with the “last_name” column:

SELECT MIN(last_name) FROM users;

This returns “Bennet”, which is first in alphabetical order.

Note: The collation used to sort your data will impact the results of this query. For example, in the Danish collation, “A” is treated as “A” – the last letter in the alphabet. The Latin collation, of course, treats “A” as the first letter of the alphabet.

MAX

In a similar yet opposite fashion to MIN, MAX returns the largest value in a column. Let’s get the maximum age from our user list:

SELECT MAX(age) FROM users;

This will return an “81”.

Let´s try the same with the last name column:

SELECT MAX(last_name) FROM users;

It returns “Thomson”, which is the last one in alphabetical order. Remember, this may change depending on the collation you are using.

SUM

This function calculates the sum of all the numeric values in a column. Let’s use it to retrieve the sum of all ages in the table:

SELECT SUM(age) FROM users;

It will return “265”.

AVG

This is used to calculate the average value for a column. Let’s see it in action, retrieving the average age of our users:

SELECT AVG(age) FROM users;

It returns a “27.75”.

COUNT (column)

This returns the number of non-NULL values in a given column. If we wanted to know how many users have told us their age, we would write:

SELECT COUNT(age) FROM users;

It will return a “7”. All records in the “users” table have an age value. If a record didn’t have an age value, it would have been NULL (and not included in the COUNT result). If you want to count the actual number of table rows regardless of the column value, then the COUNT(*) function is what you need. Instead of specifying a column name as the function argument, we use an asterisk:

SELECT COUNT(*) FROM users;

In our case, this still returns a “7” because the table has seven records.

NULL values can be confusing, but don’t worry. We’ll show how you can deal with NULL values in SQL aggregate functions later in this article.

Now that you know what these aggregate functions do, let’s see how to make them even more useful.

Do you want to know more about SQL JOINs? Watch an episode of our We Learn SQL series on Youtube. Check if you already know everything about the different types of JOINs.

Using GROUP BY with Aggregate Functions

The GROUP BY statement lets us perform aggregations on a group of values based on given columns. You’ll use GROUP BY with aggregate functions frequently, so these examples will be a little more complicated and realistic than the simple ones we used before.

Conceptually, “GROUP BY (column_x)” means “put all records that share the same value in “column_x” into one group”. Let’s see how this works with each of the functions we’ve already discussed.

MIN + GROUP BY

Suppose that we want to know our youngest user’s age in each city. We can see that the “users” table has a column called “city_id” that identifies the city where each user lives. We can use this column with a GROUP BY statement to see the youngest age in each city:

SELECT 
    city_id, MIN(age)
FROM
    users
GROUP BY city_id;

To better understand what is going on, look at the raw data in the “users” table:

id	city_id	first_name	last_name	age
1	1	John	Doe	22
2	1	Albert	Thomson	15
3	2	Robert	Ford	65
4	3	Samantha	Simpson	9
5	2	Carlos	Bennet	42
6	2	Mirtha	Lebrand	81
7	3	Alex	Gomez	31

Using MIN() with GROUP BY will group records by the values in the “city_id” column before computing the aggregate for each group. If you could see the GROUPed values, it would look something like this:

id	city_id	first_name	last_name	age

1	1	John	Doe	22
2	1	Albert	Thomson	15

3	2	Robert	Ford	65
5	2	Carlos	Bennet	42
6	2	Mirtha	Lebrand	81

4	3	Samantha	Simpson	9
7	3	Alex	Gomez	31

Then the query will grab the lowest ages within each group. If we could see this step in action, it would look like this:

city_id	age

1	22
1	15

2	65
2	42
2	81

3	9
3	31

The highlighted values represent the MIN() computed values for each group.

Finally, the query will show the following results:

city_id	MIN(age)
1	15
2	42
3	9

MAX + GROUP BY

As you may have already guessed, using MAX combined with GROUP BY works in the same way as MIN. It simply returns the greatest value for each group. We could compute the maximum age for each city in a similar fashion:

SELECT 
    city_id, MAX(age)
FROM
    users
GROUP BY city_id;

This query will group users based on their “city_id” field and then grab the maximum age value for each group. It gives us the following results:

city_id	MAX(age)
1	22
2	81
3	31

SUM + GROUP BY

We could also calculate the sum of user ages in each city. For that, we can run the following query…

SELECT 
    city_id, SUM(age)
FROM
    users
GROUP BY city_id;

… which will retrieve:

city_id	SUM(age)
1	37
2	188
3	40

COUNT + GROUP BY

We may also want to calculate the number of users in each city. It’s easy to do this using the COUNT function:

SELECT 
    city_id, COUNT(age)
FROM
    users
GROUP BY city_id;

city_id	COUNT(age)
1	2
2	3
3	2

Based on the last two examples, we could calculate the average age in each city by dividing the sum of all ages in a city group by the number of users for that city. Here’s how we do it:

SELECT 
    city_id,
    SUM(age),
    COUNT(age),
    SUM(age) / COUNT(age) as average
FROM
    users
GROUP BY city_id;

city_id	SUM(age)	COUNT(age)	average
1	37	2	18.5000
2	188	3	62.6667
3	40	2	20.0000

Note: We could have also used COUNT(*) here, since there are no records with NULL values in the “age” column. In this case, COUNT(age) works the same as COUNT(*). Otherwise the value would differ, as we will explain later in the “Dealing with NULLs” section.

AVG + GROUP BY

In the previous example, we “manually” calculated the average age for each city. We could use the AVG() function to perform this operation for us, as shown below:

SELECT 
    city_id,
    AVG(age)
FROM
    users
GROUP BY city_id;

city_id	AVG(age)
1	18.5000
2	62.6667
3	20.0000

The value returned by AVG(age) is identical to the result from the math operation we performed earlier.

Filtering Grouped Results

There are times when you need to further filter results based on conditions generated by aggregated results. Adding any conditions in the WHERE would fail. Don’t believe me? See what happens when we try to retrieve the number of users only in cities where the average user age is greater than 20:


SELECT 
    city_id, COUNT(age), AVG(age)
FROM
    users
WHERE AVG(age) >= 20
GROUP BY city_id;

This will make the engine (in my case, MySQL) complain. It will say something like this:

Error Code: 1111. Invalid use of group function

To filter results in this way, we need to use the HAVING clause. HAVING will filter summarized GROUP BY results; the WHERE clause only applies to individual records. If any group doesn’t meet the criteria in the HAVING clause, they will not be returned.

So if we wanted to get the COUNT for each city with an average age of at least 20, the proper way to handle the filtering would be like this:

SELECT 
    city_id, COUNT(age), AVG(age)
FROM
    users
GROUP BY city_id
HAVING AVG(age) >= 20

city_id	COUNT(age)	AVG(age)
2	3	62.6667
3	2	20.0000

Note: The group with city_id = 1 is discarded, as its AVG(age) is 18.5.

Lastly, WHERE and HAVING can be used simultaneously without any issues (if it makes sense in your query, of course). Notice how this works in the next example. We’ll perform the calculation again, but this time we’ll exclude any user whose last name is “Simpson”:

SELECT 
    city_id, COUNT(age), AVG(age)
FROM
    users
WHERE last_name <> 'Simpson'
GROUP BY city_id
HAVING AVG(age) >= 20

city_id	COUNT(age)	AVG(age)
2	3	62.6667
3	1	31.0000

Notice that the COUNT and AVG calculation differs for city_id = 3. There is a user with the last name “Simpson” for city_id = 3, and that record was discarded due to the WHERE last_name <> 'Simpson’ condition.

Dealing with NULLs

So far all our columns have been filled with data, but that may not always be the case. Let’s insert some records with NULL ages to explain some edge cases. The following two INSERTs will add two new users, with an ID of 8 and 9 respectively, that have a NULL in the age column:

INSERT INTO `users` (`id`, `city_id`, `first_name`, `last_name`, `age`)
VALUES ('8', '2', 'Frederic', 'Scott',NULL);

INSERT INTO `users` (`id`, `city_id`, `first_name`, `last_name`, `age`) 
VALUES ('9', '4', 'Stacy', 'Roberts',NULL);

Records with null values are not considered in MIN, MAX, SUM, AVG and COUNT(column). Those records are simply ignored in the calculation. A general rule of thumb is if you don’t expect to have null values or if you expect to treat NULLs as “0”, then you should set the columns as NOT NULL and set the NULL columns to “0” or whatever value makes sense for your use case.

COUNT(column) vs COUNT(*)

The COUNT(column) function will not count any records with NULL age values. Let’s see this in action:

SELECT 
    COUNT(age)
FROM
    users;

This query returns a “7” again; the two records we added have NULL values in the “age” column, so they are ignored. If we wanted to count every record regardless of the value, we use the COUNT(*) function:

SELECT 
    COUNT(*)
FROM
    users;

This returns the expected result of “9”.

Let’s see how NULLs impact the AVG() function.

AVG + NULL

Due to the fact that AVG will ignore records with NULLs in the specified column, the resulting value may not make sense. Here’s why.

The AVG(age) function will add only the users with a non-null value in the “age” column and divide that number against the COUNT of users that also have a non-null “age” value. This is 7 in our example. If you consider that the two users we just added with NULL ages should be considered for the average age calculation, then the value you will get with AVG(age) would be wrong. The query below will show the difference in the calculations:

SELECT 
    SUM(age),
    COUNT(age),
    AVG(age),
    SUM(age) / COUNT(age),
    COUNT(*),
    SUM(age) / COUNT(*)
FROM
    users;

SUM(age)	COUNT(age)	AVG(age)	SUM(age) / COUNT(age)	COUNT(*)	SUM(age) / COUNT(*)
265	7	37.8571	37.8571	9	29.4444

Notice how the AVG(age) value matches the value calculated using SUM(age)/COUNT(age); both functions don’t consider records with NULL values. But take a look at how the average value changes when using COUNT(*), which includes all records.

We have seen that NULL values will not get computed in MIN, MAX, AVG, SUM and COUNT(column_name) functions. If you are going to use these functions and you expect some NULL fields, make sure to set NULLs as some specific value.

Also, set the column types consistently to the value you are storing. For instance, numbers stored in a VARCHAR column will not be treated as numerical and may cause undesired results in your aggregated function computations.

Finally, I’d like to emphasize the importance of putting your filtering conditions in the right places inside your queries. Use HAVING if you need conditions based on aggregated values.

What do you think about SQL aggregate functions? Do you have any experiences or examples about their use? Please share your thoughts with our community. And don’t miss the second part of this article, where I will explain how to integrate SQL aggregate functions with JOINs. We’ll be posting it on the Vertabelo blog soon!

Tags: