3rd Dec 2016 10 minutes read

GROUP BY Clause: How Well Do You Know It?

group by

The database can do different kinds of computations: it can add and subtract, multiply and divide, it can do computations on dates, it can count rows and sum row values, and so on. It can also do quite sophisticated statistics. The GROUP BY clause is the most basic way to compute statistics in SQL. It can be quite tough for beginners but it is really powerful. Let's look at the details of the GROUP BY clause, starting with the basics.

The Basics: How GROUP BY Works

Here we have table medals with Ski Jumping World Cup medalists for the last four seasons.

 country |         person        | season  | place
---------+-----------------------+---------+-------
Norway   | Anders Bardal         | 2011-12 |   1
Austria  | Gregor Schlierenzauer | 2011-12 |   2
Austria  | Andreas Kofler        | 2011-12 |   3
Austria  | Gregor Schlierenzauer | 2012-13 |   1
Norway   | Anders Bardal         | 2012-13 |   2
Poland   | Kamil Stoch           | 2012-13 |   3
Poland   | Kamil Stoch           | 2013-14 |   1
Slovenia | Peter Prevc           | 2013-14 |   2
Germany  | Severin Freund        | 2013-14 |   3
Germany  | Severin Freund        | 2014-15 |   1
Slovenia | Peter Prevc           | 2014-15 |   2
Austria  | Stefan Kraft          | 2014-15 |   3

I want to find out how many medals Poland got:

SELECT count(*)
FROM medals
WHERE country = 'Poland';

If I wanted to find out the number of medals for Germany, I would have to issue this query:

SELECT count(*)
FROM medals
WHERE country = 'Germany';

If I wanted to find out the number of medals for each country, I could ask six similar queries. Or I could use a GROUP BY.

SELECT country, count(*)
FROM medals
GROUP BY country;

The GROUP BY clause comes right after the WHERE clause in SQL query. Here, the WHERE clause is missing, so it's right after FROM.

The result:

 country | count
---------+-------
Poland   |   2
Germany  |   2
Austria  |   4
Norway   |   2
Slovenia |   2

With the GROUP BY query the database divides data into groups. Rows with the same GROUP BY column (country in the example) are put into one group. So using our example, medalists from Poland are put into one group, medalists from Germany are put into another group and so on. Here is the grouping that we get for this query:

 country |        person         | season  | place
–--------+-----------------------+---------+-------
Poland   | Kamil Stoch           | 2012-13 |   3
         | Kamil Stoch           | 2013-14 |   1
–--------+-----------------------+---------+-------
Germany  | Severin Freund        | 2013-14 |   3
         | Severin Freund        | 2014-15 |   1
–--------+-----------------------+---------+-------
Austria  | Gregor Schlierenzauer | 2012-13 |   1
         | Stefan Kraft          | 2014-15 |   3
         | Gregor Schlierenzauer | 2011-12 |   2
         | Andreas Kofler        | 2011-12 |   3
–--------+-----------------------+---------+-------
Norway   | Anders Bardal         | 2012-13 |   2
         | Anders Bardal         | 2011-12 |   1
–--------+-----------------------+---------+-------
Slovenia | Peter Prevc           | 2013-14 |   2
         | Peter Prevc           | 2014-15 |   2
–--------+-----------------------+---------+-------

With GROUP BY the aggregates (count, sum, avg, min, max, and others) are computed for each separately. In the example the database counts the number of rows in each group.

Grouping With Multiple Columns

You can group rows by more than one column.

For example, if you want to find out how many medals each country got in each season, your query would look like this:

SELECT country, season, count(*)
FROM medals
GROUP BY country, season;

Rows with the same country and the same season are put into one group. The grouping looks like this:

 country | season  |        person         | place
–--------+---------+-----------------------+-------
Poland   | 2012-13 | Kamil Stoch           |   3
–--------+---------+-----------------------+-------
Poland   | 2013-14 | Kamil Stoch           |   1
–--------+---------+-----------------------+-------
...
–--------+---------+-----------------------+-------
Austria  | 2011-12 | Gregor Schlierenzauer |   2
         | 2011-12 | Andreas Kofler        |   3
–--------+---------+-----------------------+------

The final result:

 country | season  | count
---------+---------+-------
 Poland  | 2012-13 |   1
 Austria | 2011-12 |   2
 ...
 Poland  | 2013-14 |   1

NULL Values in GROUP BY

A reminder: In the WHERE condition no two NULLs are considered equal. As strange as this might seem, the query

SELECT *
FROM medals
WHERE place = place;

will select all rows except those with NULL place. For SQL the NULL value means "Unknown" and if it's unknown, SQL cannot assume that it knows for sure what their result will be. (In particular, it can't know for sure that the result is TRUE.)

With GROUP BY, it's different. Rows with NULL values all go into one group, and the aggregates are computed for this group, as for any other. It also works for multiple-column GROUP BYs.

For this table:

    country    |   city  |    person    | earnings
---------------+---------+--------------+----------
 NULL          | Warsaw  | John Doe     |   1000
 United States | NULL    | Maria Jane   |   1000
 Germany       | Berlin  | Hans Schmitt |   2430
 United States | NULL    | Bill Noir    |   1000
 United States | Chicago | Rob Smith    |   3000
 NULL          | Warsaw  | Sophie Doe   |   2000
 Germany       | Berlin  | Jane Dahl    |   1500

the query

SELECT country, city, sum(earnings)
FROM employees
GROUP BY country, city;

renders these groups:

    country    |   city  |    person    | earnings
---------------+---------+--------------+----------
 NULL          | Warsaw  | John Doe     |   1000
               |         | Sophie Doe   |   2000
–--------------------------------------------------
 United States | NULL    | Maria Jane   |   1000
               |         | Bill Noir    |   1000
–--------------------------------------------------
 United States | Chicago | Rob Smith    |   3000
–--------------------------------------------------
 Germany       | Berlin  | Hans Schmitt |   2430
               |         | Jane Dahl    |   1500

and this result

    country    |   city  |  sum
---------------+---------+-------
 NULL          | Warsaw  |  3000
 United States | NULL    |  2000
 United States | Chicago |  3000
 Germany       | Berlin  |  3930

Myth: Select Columns Must Appear in the GROUP BY Clause or Under Aggregate Function?

Common wisdom says that selected columns in a GROUP BY query must either appear in the GROUP BY clause or under an aggregate function. So this query is incorrect:

SELECT user_account.id, email, count(*)
FROM user_account JOIN address
ON user_account.id = address.user_id
GROUP BY email;

The query groups results by email, but it selects the id column, which is not in GROUP BY clause.

This wisdom was the rule in the SQL92 standard. This is how many databases implement the GROUP BY behavior today. You will get an error similar to this one:

ERROR:  column "user_account.id" must appear in the GROUP BY clause or be used 
in an aggregate function

Where does this rule come from? Let's look at example data.

|       user_account      |   |         address         |
+----+--------------------+   +----+----------+---------+
| id |        Email       |   | id |   city   | user_id |
+----+--------------------+   +----+----------+---------+
| 1  | john@example.com   |   | 1  | London   | 1       |
| 2  | mary@example.co.uk |   | 2  | Brussels | 2       |
| 3  | john@example.com   |   | 3  | Cairo    | 3       |
|    |                    |   | 4  | Dublin   | 1       |

We group the data by email

 
user_account.email |user_account.id |address.id|address.city|address.user_id|
-------------------+----------------+----------+------------+---------------+
john@example.com   | 1              | 1        | A          | 1             |
                   +----------------+----------+------------+---------------+
                   | 1              | 4        | D          | 1             |
                   +----------------+----------+------------+---------------+
                   | 3              | 3        | C          | 3             |
-------------------+----------------+----------+------------+---------------+
mary@example.com   | 2              | 2        | B          | 2             |

The database creates group for each email. But there are multiple user_account ids in each group. The database doesn't know which id it should return. The SQL standard wants the SQL result to be deterministic so it forbids you to execute a query like this.

The SQL99 standard has modified the wording of the rule. It now says that any column which appears under SELECT has to appear under the aggregate function or be functionally dependent on columns in the GROUP BY clause. The rule no longer expects to repeat all non-aggregated columns to be repeated in the GROUP BY clause.

What does functionally dependent columns in the BY by clause mean? It means: if I fix values for columns in the GROUP BY clause, there has to be only one value for the other column. For example, the email address determines the value of the name of its owner. But there is a catch: the database has to know about this dependency. In the context of databases, the dependency means either primary keys and unique keys. If I group by a primary key, then I know that other columns in this table have fixed values.

Our initial example is still not valid under the new rule. BUT: if I enforce the unique constraint on the email column, the query becomes valid under the new rule. If the email column is unique in the user_account table, then fixing the email value determines all other columns in the user_account table. Of course, if I add the unique constraint, my example data are invalid as well. I can't have two different rows with the same email.

Do Databases Support the New GROUP BY Rule?

Some do, some don't. The new rule is in the SQL99 standard. MySQL from version 5.7.4 supports the new behaviour. So is Postgres from version 9.1. Postgres treats column as functionally dependent on the grouped columns if a subset of the grouped columns is a primary key of the table from which the column comes from.

As far as I know, Oracle and SQL Server still stick with the old version.

Should you use the new version or the old version of the rule in your queries? In practice, this modification doesn't really change anything. Both rules guarantee that whenever you select an unaggregated column in a GROUP BY query, it's value is unambigous in each group. The old rule forces you to add this column in the GROUP BY clause, but this GROUP BY does not change the semantics of the query. The result is the same, you just have to type a little more with the new rule.

In general, you're better off with sticking to the old rule. Your queries will run on most databases. But it's good to know that you don't have to.

Extensions to GROUP BY

SQL-99 added ROLLUP, and CUBE and GROUPING SETS as options for SELECT statements.

GROUP BY ROLLUP

The syntax for ROLLUP is

SELECT <columns>
FROM <tables>
WHERE <condition>
GROUP BY ROLLUP (<group-by columns>);

Using ROLLUP (a,b,c) will generate GROUP BY clauses: (a, b, c), (a, b), (a) and a row for an aggregation of all selected rows. It is equivalent to four SELECT queries with various GROUP BY clauses.

For this table

department | year | sales
–----------+--------------
 IT        | 2012 |  25000
 IT        | 2013 |  26000
 Retail    | 2012 |  35000
 Retail    | 2013 |  15000
 IT        | 2014 |  18000

and this query

SELECT department, year, sum(sales)
FROM sales
GROUP BY ROLLUP (department, year);

we get the result:

 department | year |   sum   
------------+------+--------
 IT         | 2012 |   25000
 IT         | 2014 |   18000
 IT         | 2013 |   26000
 IT         | NULL |   69000 <-  group by department
 Retail     | 2013 |   15000
 Retail     | 2012 |   35000
 Retail     | NULL |   50000 <-  group by department
 NULL       | NULL |  119000 <-  group by (), i.e. all rows selected

The additional rows are sometimes called superaggregates.

ROLLUP is supported by SQL Server, Oracle, DB2.

In MySQL you can use the WITH ROLLUP syntax:

SELECT <columns>
FROM <tables>
WHERE <condition>
GROUP BY a,b,c WITH ROLLUP;

PostgreSQL does not support ROLLUP.

GROUP BY CUBE

The syntax for CUBE is

SELECT <columns>
FROM <tables>
WHERE <condition>
GROUP BY CUBE (a, b, c);

It works similarly to ROLLUP but it generates all possible combinations of columns: (a,b,c), (a,b), (a,c), (b,c), (a), (b), (c) and a row for all rows selected.

The query

SELECT department, year, sum(sales)
FROM sales
GROUP BY CUBE (department, year);

will render this result:

 department | year |   sum   
------------+------+--------
 IT         | 2012 |   25000
 IT         | 2014 |   18000
 IT         | 2013 |   26000
 IT         | NULL |   69000  <-  group by department
 Retail     | 2013 |   15000
 Retail     | 2012 |   35000
 Retail     | NULL |   50000  <-  group by department
 NULL       | NULL |  119000  <-  group by ()
 NULL       | 2014 |   18000 |
 NULL       | 2012 |   60000 | <= three new rows added by CUBE
 NULL       | 2013 |   41000 |

CUBE is supported by SQL Server and Oracle, and DB2. MySQL and Postgres do not support it.

GROUP BY GROUPING SETS

GROUPING SETS works similarly, but it allows to specify which column combinations must be used in the result. Grouping sets have to be separated with commas. If there is more than one column in a grouping set, this grouping set should be put in parentheses. Empty parentheses mean the general record with aggregates for the whole set.

Example query:

SELECT <columns>
FROM <tables>
WHERE <condition>
GROUP BY GROUPING SETS ((a, b), c, ());

GROUPING SETS is supported by SQL Server and Oracle, and DB2. MySQL and Postgres do not support it.

Tags:

group by