17th Feb 2022 9 minutes read

How to Select the First Row of Each Group in SQL

Grouping data in SQL is not that hard. Finding the first row of each group, maybe, but not after you read this article!

You use GROUP BY when you need to group data in SQL. The GROUP BY statement groups rows with the same value into a single row. Due to its logic, it’s often used with aggregate functions. You know, functions like MIN(), MAX(), SUM(), COUNT(), AVG(), etc.

A problem may surface when you want to select several columns but only one row from each group. Specifically, you may want the first row of each group, that is, the row with a minimum value. I’ll show you two main approaches:

Correlated subquery.
Window functions + common table expressions.

If you’re not that familiar with window functions or common table expressions, I recommend our hands-on Window Functions and Common Table Expressions courses. Together, they contain over 300 interactive exercises teach you window functions and common table expressions starting with the basics and progressing to advanced topics like recursive queries.

When is it useful to look for the first row in each group? For example, imagine you have sales data by month over five years, and you’d like to see the month with the lowest sales for every year. Or imagine you want to see the employee with the lowest salary in every department. In the example I’ll show below, you may want to see the least streamed song for every artist.

You’ll understand the problem better when I show you the data.

Data for Finding the First Row in Each Group

id	artist	song_name	number_of_streams
1	Prince	Uptown	514,744
2	Paul Simon	Graceland	2,205,477
3	Donny Hathaway	You've Got a Friend	498,444
4	Prince	I Wanna Be Your Lover	1,547,956
5	Donny Hathaway	To Be Young, Gifted And Black	274,894
6	Paul Simon	Stranger to Stranger	498,715
7	Prince	Little Red Corvette	2,147,988
8	Paul Simon	Darling Lorraine	348,748
9	Donny Hathaway	A Song For You	687,415
10	Donny Hathaway	Someday We'll All Be Free	274,894

There are three different artists: Prince, Paul Simon, and Donny Hathaway. Prince and Paul Simon have three songs each. Donny Hathaway has four songs. There’s streaming data for each of these songs.

Grouping by artist gets you every artist only once, for sure. But what if you want to show song names and the number of streams, too? With a different name and a different number of streams for each song, it isn’t much of a grouping.

You can go even further and ask: how can you group by artist and also show the song with the smallest number of streams for each artist?

Using Correlated Subqueries

The first way to find the first row of each group is by using a correlated subquery. In short, a correlated subquery is a type of subquery that is executed row by row. It uses the values from the outer query, that is, the values from the query it’s nested into.

You can use a correlated subquery to find the minimum row of each group in SQL:

SELECT  artist,
	  song_name,
	  number_of_streams
FROM streaming_data st_outer
WHERE number_of_streams = (SELECT MIN(number_of_streams)
						   FROM streaming_data
						   WHERE artist = st_outer.artist)
GROUP BY artist, song_name, number_of_streams;

In this query, I select three columns: artist, song_name, and number_of_streams from the table streaming_data. I give the alias st_outer to the table since this is my outer query and I reference this table again in the correlated subquery. This is a way to know which table from which SELECT statement you’re referencing.

I use a WHERE clause to get only the results where the number_of_streams is equal to the minimum number of streams for this artist.

The subquery gets me these minimum values with the MIN() aggregate function. But I filter the data using a WHERE clause. That’s because I want the minimum number of streams where the artist from the correlated subquery is the same artist from the outer query.

In other words, these two SELECT statements return the data for rows where the number of streams equals the minimum number of streams. If you group the result by artist, song name, and the number of streams, here’s what you get:

artist	song_name	number_of_streams
Donny Hathaway	Someday We'll All Be Free	274,894
Donny Hathaway	To Be Young, Gifted And Black	274,894
Paul Simon	Darling Lorraine	348,748
Prince	Uptown	514,744

These are indeed the least streamed song for each artist. You may be asking why there are two songs for Donny Hathaway. Nothing wrong with the output. Both songs have the same smallest number of streams, so the output shows them both.

Even though I get the correct result, I generally don’t recommend using correlated subqueries for this task. The reason? It’s not very efficient. A correlated subquery checks against the outer query once for every row. If you have a lot of data, this may be very slow. In addition, the code gets messy and difficult to read if you write a little more complex query and subquery.

The second approach for selecting the first row of each group is to combine window functions and common table expressions.

Using Window Functions and Common Table Expressions

Window functions are similar to aggregate functions in that both perform calculations across rows. However, the window functions do not group the result into a single row as aggregate functions do. With window functions, all the rows remain intact, with the result shown in an additional column for every row.

You may want to have the Window Functions Cheat Sheet open for following the next code.

A common table expression (CTE) is a temporary result of a query. You can name, reference, and use it like any other table. The only difference is a CTE has to be executed every time you want to use its result in another query.

To get the same result as you do with the correlated subquery, combine window functions and a CTE this way:

WITH first_row AS (
SELECT RANK() OVER (PARTITION BY artist ORDER BY number_of_streams ASC) AS streaming_rank,
		 artist,
		 song_name,
		 number_of_streams
	FROM streaming_data
)

SELECT  streaming_rank,
	  artist,
	  song_name,
	  number_of_streams
FROM first_row
WHERE streaming_rank = 1
GROUP BY streaming_rank, artist, song_name, number_of_streams;

It works like this. As with any CTE, first I write the WITH clause. The name of my CTE is first_row. Why? Because I use it to get the first row for every artist.

In the first SELECT statement, the window function appears immediately. It’s the RANK() function, whose purpose is to rank the rows in the table streaming_data.

This function is followed by an OVER() clause with two keywords in the parentheses. The first is PARTITION BY, which I use to rank rows only within each artist. The second keyword is the ORDER BY, which tells it to rank by the column number_of_streams in ascending order. Using these keywords this way, the window functions rank the rows for one artist from the smallest to the largest number of streams, then with the second artist the ranking restarts at 1, and so on.

After the CTE comes the second SELECT statement that references the CTE. Here, I’m simply selecting all the columns from the CTE. I’m only interested in rows ranked first, hence the WHERE clause. Finally, I group the result. Here’s what I get:

streaming_rank	artist	song_name	number_of_streams
1	Donny Hathaway	Someday We'll All Be Free	274,894
1	Donny Hathaway	To Be Young, Gifted and Black	274,894
1	Paul Simon	Darling Lorraine	348,748
1	Prince	Uptown	514,744

Other than the streaming_rank column, the result is identical to the previous one. The difference between these two approaches is not in the output itself. Using the window functions and CTE to get the first row of each group is recommended for a few reasons.

First, it’s much more efficient. Unlike the correlated subquery, this approach does not check the same condition for every row of data. This makes it much quicker.

Second, the query is also more readable because it’s divided into two obvious steps:

Ranking the rows.
Selecting only the rank 1.

The code is also easier to play with and to change according to your needs. For example, you may want to use the DENSE_RANK() functions instead of RANK() to get the same result as above:

WITH first_row AS (
	SELECT  DENSE_RANK() OVER (PARTITION BY artist ORDER BY number_of_streams ASC) AS streaming_rank,
		  artist,
		  song_name,
	        number_of_streams
	FROM streaming_data
)

SELECT  streaming_rank,
	  artist,
	  song_name,
	  number_of_streams
FROM first_row
WHERE streaming_rank = 1
GROUP BY streaming_rank, artist, song_name, number_of_streams;

Both functions returns all the rows that are tied (i.e., have the same rank), so they are ideal if you want to show all the rows with the same minimum value.

Be careful, though. RANK() and DENSE_RANK() are not the same. The differences between them, however, are not the topic of this article. For that, read the ranking functions overview and the complete guide to ranking rows in SQL. All you have to remember for now is they return the same result if used for the purpose of finding the minimum row within a group.

In case you want only one row with the minimum value, and you don’t care which one, use the ROW_NUMBER() function. The main difference from the previous two window functions is that it does not return tied values. If there are several rows with the same (minimum) value, it chooses only one randomly.

Here is the same code as before, only the window function changes to ROW_NUMBER():

WITH first_row AS (
	SELECT  ROW_NUMBER() OVER (PARTITION BY artist ORDER BY number_of_streams ASC) AS row_number,
		  artist,
		  song_name,
		  number_of_streams
	FROM streaming_data
)

SELECT  streaming_rank,
	  artist,
	  song_name,
	  number_of_streams
FROM first_row
WHERE streaming_rank = 1
GROUP BY streaming_rank, artist, song_name, number_of_streams;

As I said, the result is a bit different:

streaming_rank	artist	song_name	number_of_streams
1	Donny Hathaway	To Be Young, Gifted and Black	274,894
1	Paul Simon	Darling Lorraine	348,748
1	Prince	Uptown	514,744

Instead of two Donny Hathaway songs with minimum streams, I just get “To Be Young, Gifted And Black.”

One of the benefits of using window functions and CTEs is that you can easily change the ranking order. For example, if you want to see the maximum number of streams, the row with the maximum value now gets to be the first one in each group. I’ll show this only with ROW_NUMBER(), but you can do the same with RANK() and DENSE_RANK().

WITH first_row AS (
	SELECT  ROW_NUMBER() OVER (PARTITION BY artist ORDER BY number_of_streams DESC) AS streaming_rank,
		  artist,
		  song_name,
		  number_of_streams
	FROM streaming_data
)

SELECT  streaming_rank,
	  artist,
	  song_name,
	  number_of_streams
FROM first_row
WHERE streaming_rank = 1
GROUP BY streaming_rank, artist, song_name, number_of_streams;

You see the only difference, right? It’s in the OVER() clause. This time, I ordered it in descending rather than ascending order. This returns the most streamed songs for every artist:

streaming_rank	artist	song_name	number_of_streams
1	Donny Hathaway	A Song For You	687,415
1	Paul Simon	Graceland	2,205,477
1	Prince	Little Red Corvette	2,147,988

Find Out What Else You Can Do With Window Functions!

Finding the first row of each group is only one of the examples where you can use SQL window functions. You get a thorough guide to what they can do via the hands-on window functions course. This course is great for learning and practicing examples. And what do you get? Take a look at the article explaining the benefits of the course and of window functions themselves!

If that doesn’t quench your thirst, you can always use the Window Functions Practice Set to get your fingers more agile in writing code!

Tags:

Data for Finding the First Row in Each Group

Using Correlated Subqueries

Using Window Functions and Common Table Expressions

Find Out What Else You Can Do With Window Functions!

You may also like