3rd Dec 2016 10 minutes read GROUP BY Clause: How Well Do You Know It? Agnieszka Kozubek-Krycuń group by The database can do different kinds of computations: it can add and subtract, multiply and divide, it can do computations on dates, it can count rows and sum row values, and so on. It can also do quite sophisticated statistics. The GROUP BY clause is the most basic way to compute statistics in SQL. It can be quite tough for beginners but it is really powerful. Let's look at the details of the GROUP BY clause, starting with the basics. The Basics: How GROUP BY Works Here we have table medals with Ski Jumping World Cup medalists for the last four seasons. country | person | season | place ---------+-----------------------+---------+------- Norway | Anders Bardal | 2011-12 | 1 Austria | Gregor Schlierenzauer | 2011-12 | 2 Austria | Andreas Kofler | 2011-12 | 3 Austria | Gregor Schlierenzauer | 2012-13 | 1 Norway | Anders Bardal | 2012-13 | 2 Poland | Kamil Stoch | 2012-13 | 3 Poland | Kamil Stoch | 2013-14 | 1 Slovenia | Peter Prevc | 2013-14 | 2 Germany | Severin Freund | 2013-14 | 3 Germany | Severin Freund | 2014-15 | 1 Slovenia | Peter Prevc | 2014-15 | 2 Austria | Stefan Kraft | 2014-15 | 3 I want to find out how many medals Poland got: SELECT count(*) FROM medals WHERE country = 'Poland'; If I wanted to find out the number of medals for Germany, I would have to issue this query: SELECT count(*) FROM medals WHERE country = 'Germany'; If I wanted to find out the number of medals for each country, I could ask six similar queries. Or I could use a GROUP BY. SELECT country, count(*) FROM medals GROUP BY country; The GROUP BY clause comes right after the WHERE clause in SQL query. Here, the WHERE clause is missing, so it's right after FROM. The result: country | count ---------+------- Poland | 2 Germany | 2 Austria | 4 Norway | 2 Slovenia | 2 With the GROUP BY query the database divides data into groups. Rows with the same GROUP BY column (country in the example) are put into one group. So using our example, medalists from Poland are put into one group, medalists from Germany are put into another group and so on. Here is the grouping that we get for this query: country | person | season | place –--------+-----------------------+---------+------- Poland | Kamil Stoch | 2012-13 | 3 | Kamil Stoch | 2013-14 | 1 –--------+-----------------------+---------+------- Germany | Severin Freund | 2013-14 | 3 | Severin Freund | 2014-15 | 1 –--------+-----------------------+---------+------- Austria | Gregor Schlierenzauer | 2012-13 | 1 | Stefan Kraft | 2014-15 | 3 | Gregor Schlierenzauer | 2011-12 | 2 | Andreas Kofler | 2011-12 | 3 –--------+-----------------------+---------+------- Norway | Anders Bardal | 2012-13 | 2 | Anders Bardal | 2011-12 | 1 –--------+-----------------------+---------+------- Slovenia | Peter Prevc | 2013-14 | 2 | Peter Prevc | 2014-15 | 2 –--------+-----------------------+---------+------- With GROUP BY the aggregates (count, sum, avg, min, max, and others) are computed for each separately. In the example the database counts the number of rows in each group. Grouping With Multiple Columns You can group rows by more than one column. For example, if you want to find out how many medals each country got in each season, your query would look like this: SELECT country, season, count(*) FROM medals GROUP BY country, season; Rows with the same country and the same season are put into one group. The grouping looks like this: country | season | person | place –--------+---------+-----------------------+------- Poland | 2012-13 | Kamil Stoch | 3 –--------+---------+-----------------------+------- Poland | 2013-14 | Kamil Stoch | 1 –--------+---------+-----------------------+------- ... –--------+---------+-----------------------+------- Austria | 2011-12 | Gregor Schlierenzauer | 2 | 2011-12 | Andreas Kofler | 3 –--------+---------+-----------------------+------ The final result: country | season | count ---------+---------+------- Poland | 2012-13 | 1 Austria | 2011-12 | 2 ... Poland | 2013-14 | 1 NULL Values in GROUP BY A reminder: In the WHERE condition no two NULLs are considered equal. As strange as this might seem, the query SELECT * FROM medals WHERE place = place; will select all rows except those with NULL place. For SQL the NULL value means "Unknown" and if it's unknown, SQL cannot assume that it knows for sure what their result will be. (In particular, it can't know for sure that the result is TRUE.) With GROUP BY, it's different. Rows with NULL values all go into one group, and the aggregates are computed for this group, as for any other. It also works for multiple-column GROUP BYs. For this table: country | city | person | earnings ---------------+---------+--------------+---------- NULL | Warsaw | John Doe | 1000 United States | NULL | Maria Jane | 1000 Germany | Berlin | Hans Schmitt | 2430 United States | NULL | Bill Noir | 1000 United States | Chicago | Rob Smith | 3000 NULL | Warsaw | Sophie Doe | 2000 Germany | Berlin | Jane Dahl | 1500 the query SELECT country, city, sum(earnings) FROM employees GROUP BY country, city; renders these groups: country | city | person | earnings ---------------+---------+--------------+---------- NULL | Warsaw | John Doe | 1000 | | Sophie Doe | 2000 –-------------------------------------------------- United States | NULL | Maria Jane | 1000 | | Bill Noir | 1000 –-------------------------------------------------- United States | Chicago | Rob Smith | 3000 –-------------------------------------------------- Germany | Berlin | Hans Schmitt | 2430 | | Jane Dahl | 1500 and this result country | city | sum ---------------+---------+------- NULL | Warsaw | 3000 United States | NULL | 2000 United States | Chicago | 3000 Germany | Berlin | 3930 Myth: Select Columns Must Appear in the GROUP BY Clause or Under Aggregate Function? Common wisdom says that selected columns in a GROUP BY query must either appear in the GROUP BY clause or under an aggregate function. So this query is incorrect: SELECT user_account.id, email, count(*) FROM user_account JOIN address ON user_account.id = address.user_id GROUP BY email; The query groups results by email, but it selects the id column, which is not in GROUP BY clause. This wisdom was the rule in the SQL92 standard. This is how many databases implement the GROUP BY behavior today. You will get an error similar to this one: ERROR: column "user_account.id" must appear in the GROUP BY clause or be used in an aggregate function Where does this rule come from? Let's look at example data. | user_account | | address | +----+--------------------+ +----+----------+---------+ | id | Email | | id | city | user_id | +----+--------------------+ +----+----------+---------+ | 1 | john@example.com | | 1 | London | 1 | | 2 | mary@example.co.uk | | 2 | Brussels | 2 | | 3 | john@example.com | | 3 | Cairo | 3 | | | | | 4 | Dublin | 1 | We group the data by email user_account.email |user_account.id |address.id|address.city|address.user_id| -------------------+----------------+----------+------------+---------------+ john@example.com | 1 | 1 | A | 1 | +----------------+----------+------------+---------------+ | 1 | 4 | D | 1 | +----------------+----------+------------+---------------+ | 3 | 3 | C | 3 | -------------------+----------------+----------+------------+---------------+ mary@example.com | 2 | 2 | B | 2 | The database creates group for each email. But there are multiple user_account ids in each group. The database doesn't know which id it should return. The SQL standard wants the SQL result to be deterministic so it forbids you to execute a query like this. The SQL99 standard has modified the wording of the rule. It now says that any column which appears under SELECT has to appear under the aggregate function or be functionally dependent on columns in the GROUP BY clause. The rule no longer expects to repeat all non-aggregated columns to be repeated in the GROUP BY clause. What does functionally dependent columns in the BY by clause mean? It means: if I fix values for columns in the GROUP BY clause, there has to be only one value for the other column. For example, the email address determines the value of the name of its owner. But there is a catch: the database has to know about this dependency. In the context of databases, the dependency means either primary keys and unique keys. If I group by a primary key, then I know that other columns in this table have fixed values. Our initial example is still not valid under the new rule. BUT: if I enforce the unique constraint on the email column, the query becomes valid under the new rule. If the email column is unique in the user_account table, then fixing the email value determines all other columns in the user_account table. Of course, if I add the unique constraint, my example data are invalid as well. I can't have two different rows with the same email. Do Databases Support the New GROUP BY Rule? Some do, some don't. The new rule is in the SQL99 standard. MySQL from version 5.7.4 supports the new behaviour. So is Postgres from version 9.1. Postgres treats column as functionally dependent on the grouped columns if a subset of the grouped columns is a primary key of the table from which the column comes from. As far as I know, Oracle and SQL Server still stick with the old version. Should you use the new version or the old version of the rule in your queries? In practice, this modification doesn't really change anything. Both rules guarantee that whenever you select an unaggregated column in a GROUP BY query, it's value is unambigous in each group. The old rule forces you to add this column in the GROUP BY clause, but this GROUP BY does not change the semantics of the query. The result is the same, you just have to type a little more with the new rule. In general, you're better off with sticking to the old rule. Your queries will run on most databases. But it's good to know that you don't have to. Extensions to GROUP BY SQL-99 added ROLLUP, and CUBE and GROUPING SETS as options for SELECT statements. GROUP BY ROLLUP The syntax for ROLLUP is SELECT <columns> FROM <tables> WHERE <condition> GROUP BY ROLLUP (<group-by columns>); Using ROLLUP (a,b,c) will generate GROUP BY clauses: (a, b, c), (a, b), (a) and a row for an aggregation of all selected rows. It is equivalent to four SELECT queries with various GROUP BY clauses. For this table department | year | sales –----------+-------------- IT | 2012 | 25000 IT | 2013 | 26000 Retail | 2012 | 35000 Retail | 2013 | 15000 IT | 2014 | 18000 and this query SELECT department, year, sum(sales) FROM sales GROUP BY ROLLUP (department, year); we get the result: department | year | sum ------------+------+-------- IT | 2012 | 25000 IT | 2014 | 18000 IT | 2013 | 26000 IT | NULL | 69000 <- group by department Retail | 2013 | 15000 Retail | 2012 | 35000 Retail | NULL | 50000 <- group by department NULL | NULL | 119000 <- group by (), i.e. all rows selected The additional rows are sometimes called superaggregates. ROLLUP is supported by SQL Server, Oracle, DB2. In MySQL you can use the WITH ROLLUP syntax: SELECT <columns> FROM <tables> WHERE <condition> GROUP BY a,b,c WITH ROLLUP; PostgreSQL does not support ROLLUP. GROUP BY CUBE The syntax for CUBE is SELECT <columns> FROM <tables> WHERE <condition> GROUP BY CUBE (a, b, c); It works similarly to ROLLUP but it generates all possible combinations of columns: (a,b,c), (a,b), (a,c), (b,c), (a), (b), (c) and a row for all rows selected. The query SELECT department, year, sum(sales) FROM sales GROUP BY CUBE (department, year); will render this result: department | year | sum ------------+------+-------- IT | 2012 | 25000 IT | 2014 | 18000 IT | 2013 | 26000 IT | NULL | 69000 <- group by department Retail | 2013 | 15000 Retail | 2012 | 35000 Retail | NULL | 50000 <- group by department NULL | NULL | 119000 <- group by () NULL | 2014 | 18000 | NULL | 2012 | 60000 | <= three new rows added by CUBE NULL | 2013 | 41000 | CUBE is supported by SQL Server and Oracle, and DB2. MySQL and Postgres do not support it. GROUP BY GROUPING SETS GROUPING SETS works similarly, but it allows to specify which column combinations must be used in the result. Grouping sets have to be separated with commas. If there is more than one column in a grouping set, this grouping set should be put in parentheses. Empty parentheses mean the general record with aggregates for the whole set. Example query: SELECT <columns> FROM <tables> WHERE <condition> GROUP BY GROUPING SETS ((a, b), c, ()); GROUPING SETS is supported by SQL Server and Oracle, and DB2. MySQL and Postgres do not support it. Tags: group by