28th May 2020 10 minutes read Subquery vs. JOIN Dorota Wdzięczna sql learn sql subquery joins One of the challenges in writing SQL queries is choosing whether to use a subquery or a JOIN. There are many situations in which a JOIN is the better solution, and there are others where a subquery is better. Let’s consider this topic in detail. Subqueries are used in complex SQL queries. Usually, there is a main outer query and one or more subqueries nested within the outer query. Subqueries can be simple or correlated. Simple subqueries do not rely on the columns in the outer query, whereas correlated subqueries refer to data from the outer query. You can learn about subqueries in the Subqueries section in the interactive “SQL Basics” course or practice writing subqueries in the Subqueries section of the “SQL Practice Set” course. Or simply read the article “SQL Subqueries” by Maria Alcaraz. The JOIN clause does not contain additional queries. It connects two or more tables and selects data from them into a single result set. It is most frequently used to join tables with primary and foreign keys. You can practice SQL JOINs in our interactive SQL JOINs course. It contains over 90 exercises to review and practice different types of JOINs. You can also read more about JOINs in the article “How to Practice SQL JOINs” by Emil Drkušić. Subqueries and JOINs can both be used in a complex query to select data from multiple tables, but they do so in different ways. Sometimes you have a choice of either, but there are cases in which a subquery is the only real option. We will describe the various scenarios below. Consider two simple tables, product and sale, which we will use in our examples. Here is the product table. idnamecostyearcity 1chair245.002017Chicago 2armchair500.002018Chicago 3desk900.002019Los Angeles 4lamp85.002017Cleveland 5bench2000.002018Seattle 6stool2500.002020Austin 7tv table2000.002020Austin This table contains the following columns: id: the identifier of the product. name: the name of the product. cost: the cost of the product. year: the year the product was made. city: the city in which the product was made. And the other table, sale: idproduct_idpriceyearcity 122000.002020Chicago 22590.002020New York 32790.002020Cleveland 53800.002019Cleveland 64100.002020Detroit 752300.002019Seattle 872000.002020New York which has the following columns: id: the identifier of the sale. product_id: the identifier of the product sold. price: the sale price. year: the year in which the product was sold. city: the city where the product was sold. We will use these two tables to write complex queries with subqueries and JOINs. When to Rewrite Subqueries With JOINs SQL beginners often use subqueries when the same results can be achieved with JOINs. While subqueries may be easier to understand and use for many SQL users, JOINs are often more efficient. JOINs are also easier to read as the queries become more complex. So, we will focus first on when you can replace a subquery with a JOIN for better efficiency and readability. Scalar Subquery The first such case is the scalar subquery. A scalar subquery returns a single value (one column and one row) to be used by the outer query. Here is an example. Suppose we need the names and the costs of the products that were sold for $2,000. Let’s look at the code with a subquery: SELECT name, cost FROM product WHERE id=(SELECT product_id FROM sale WHERE price=2000 AND product_id=product.id ); and the result: namecost armchair500.00 tv table2000.00 The outer query selects the names (name) and the cost (cost) of the products. Since we don’t want all of the products, we use a WHERE clause to filter the rows to the product IDs returned by the subquery. Now let’s look at the subquery. The sale table contains sales records of the products. The subquery first filters the records to only those with the sale price equal to $2,000 (price=2000). It then uses the product IDs (product_id) in the selected sales to identify the records from the product table (product_id=product.id). This is a correlated subquery, since the second condition in the subquery references a column in the outer query. Only two products were sold at $2,000: the armchair and the TV table. This query is not very efficient. How should we modify it? We can build a JOIN structure and obtain the same result. Look at the query with a JOIN: SELECT p.name, p.cost FROM product p JOIN sale s ON p.id=s.product_id WHERE s.price=2000; In this query, we connect the two tables product and sale with a JOIN operator. In the JOIN condition, the records from the product table are linked to the records from the sale table through the product IDs. At the end, the rows are filtered by a WHERE clause to select the record when the sale price of the product equals $2,000. Subquery Within the IN Clause Another subquery that is easily replaced by a JOIN is the one used in an IN operator. In this case, the subquery returns to the outer query a list of values. Let’s say we want to obtain the names and the costs of the products sold in our example. SELECT name, cost FROM product WHERE id IN (SELECT product_id FROM sale); The outer query selects the names and the costs of the products; it then filters to the records whose product IDs belong on the list returned by the subquery. The subquery selects the product IDs from the sale table (SELECT product_id FROM sale), so only the sold products are returned by this query in the final result set, like this: namecost armchair500.00 lamp85.00 bench2000.00 desk900.00 There are more products in the product table, but only four of them were sold. The query below returns the same result by using a JOIN: SELECT DISTINCT p.name, p.cost FROM product p JOIN sale s ON s.product_id=p.id; It becomes a very simple query. It connects the two tables by product ID and selects the names and the costs of these products. It is an INNER JOIN, so if a product doesn’t have its ID in the sale table, it will not be returned. Notice that we also use the DISTINCT keyword to remove duplicate records. This is often necessary if you transform subqueries with an IN or a NOT IN into JOINs. Want to learn more about SQL Subqueries with the IN operator? Watch an episode of our We Learn SQL series on Youtube. Remember to subscribe to our channel. Subquery in the Clause NOT IN This is just like the previous situation, but here the subquery is used in a NOT IN operator. We want to select the names and the costs of the products that were not sold. Below is an example with a subquery inside the NOT IN operator: SELECT name, cost FROM product WHERE id NOT IN (SELECT product_id FROM sale); The results: namecost chair245.00 stool2500.00 The subquery returns the product IDs from the sale table (the sold products) and compares them with the product IDs in the outer query. If a record in the outer query does not find its product ID in the list returned by the subquery, the record is returned. How do you rewrite this subquery with a JOIN? You can do it like this: SELECT DISTINCT p.name, p.cost FROM product p LEFT JOIN sale s ON s.product_id=p.id WHERE s.product_id IS NULL; This query connects the two tables product and sale by the product IDs. You should also use the DISTINCT keyword, like we did when we transformed the previous subquery with an IN into a JOIN. Notice that in rewriting the subquery in the NOT IN, we used a LEFT JOIN and a WHERE. That way, you start with all the products including those not sold, then select only the records that are NULL in the product_id column. The NULL denotes that the product was not sold. Correlated Subqueries in EXISTS and in NOT EXISTS The subqueries in an EXISTS or in a NOT EXISTS are also easy to rewrite with JOINs. The query below uses a subquery to obtain the details about products that were not sold in 2020. SELECT name, cost, city FROM product WHERE NOT EXISTS ( SELECT id FROM sale WHERE year=2020 AND product_id=product.id ); The result: namecostcity chair245.00Chicago desk900.00Los Angeles bench2000.00Seattle stool2500.00Austin For each product in the outer query, the subquery selects the records whose sale year is 2020 (year=2020). If there are no records for a given product in the subquery, the NOT EXISTS clause returns true. The result set contains the products with the sale year other than 2020 as well as the products without any records in the sale table. You can rewrite the same query using a JOIN: SELECT p.name, p.cost, p.city FROM product p LEFT JOIN sale s ON s.product_id=p.id WHERE s.year!=2020 OR s.year IS NULL; Here, we connect the product table with the sale table through a LEFT JOIN operator. This allows us to include the products that were never sold in the result set. The WHERE clause filters the records by selecting the products with no records in the sale table (s.year IS NULL) as well as the products with the sale year other than 2020 (s.year!=2020). When You Cannot Replace a Subquery With a JOIN JOINs can be efficient, but there are situations that require a subquery and not a JOIN. Below are some of these situations. Subquery in FROM With a GROUP BY First of these is a subquery in a FROM clause using a GROUP BY to calculate aggregate values. Let’s look at the following example: SELECT city, sum_price FROM ( SELECT city, SUM(price) AS sum_price FROM sale GROUP BY city ) AS s WHERE sum_price < 2100; and the result: citysum_price Chicago2000.00 Detroit100.00 Cleveland1590.00 Here, the subquery selects the cities and calculates the sum of the sale prices by city. The sum of all sale prices in each city from the sale table is calculated by the aggregate function SUM(). Using the results of the subquery, the outer query selects only the cities whose total sale price is less than $2,100 (WHERE sum_price < 2100). You should remember from previous lessons how to use aliases for subqueries and how to select an aggregate value in an outer query. Subquery Returning an Aggregate Value in a WHERE Clause Another situation in which you cannot rewrite a subquery structure with a JOIN is an aggregate value being compared in a WHERE clause. Look at this example: SELECT name FROM product WHERE cost<(SELECT AVG(price) from sale); The result: name chair armchair desk lamp This query retrieves the names of the products whose costs are lower than the average sale price. The average sale price is calculated with the help of the aggregate function AVG() and is returned by the subquery. The cost of each product is compared to this value in the outer query. Subquery in an ALL Clause Yet another situation is a subquery with an ALL clause. SELECT name FROM product WHERE cost > ALL(SELECT price from sale); The subquery returns all the sale prices in the sale table. The outer query returns the name of the product with the higher sale price than the cost. The result: name stool When to Use a Subquery vs. a JOIN We’ve reviewed some common uses of subqueries and the situations in which some subqueries might be rewritten with JOINs instead. A JOIN is more efficient in most cases, but there are cases in which constructs other than a subquery is not possible. While subqueries may be more readable for beginners, JOINs are more readable for experienced SQL coders as the queries become more complex. It is a good practice to avoid multiple levels of nested subqueries, since they are not easily readable and do not have good performance. In general, it is better to write a query with JOINs rather than with subqueries if possible, especially if the subqueries are correlated. If you are interested in learning more or if you want to practice your skills, check out the subqueries sections in the “SQL Basics” course or the “SQL Practice Set” course. Tags: sql learn sql subquery joins