Previous | Table of Contents | Next |
When thinking about which method to use, consider that writing subqueries often takes more work on your part than writing joins to accomplish the same goal. Consider this example, which joins the authors table to the titles table in order to find authors who have written books with computer in the title:
select au_fname, au_lname from authors a, titleauthor ta, titles t where a.au_id = ta.au_id and ta.title_id = t.title_id and t.title like %[Cc]omputer%
au_lname au_fname --------------------------- -------------------- OLeary Michael MacFeather Stearns Green Marjorie Straight Dean MacFeather Stearns Karsen Livia (6 row(s) affected)
To develop that query took five lines, with two JOIN clauses and one WHERE condition. Now, consider the subquery approach:
select au_fname from authors where au_id in( select au_id from titleauthor where title_id in( select title_id from titles where title like %[Cc]omputer%))
You can make your own decision as to which is better, but, as I said before, the join syntax is more my cup of tea.
The most notorious way that the subquery approach can hurt you is with regard to the join order. Remember from yesterday that by providing multiple join orders to the optimizer, you allow the server to make the best available choice (performance-wise) for the inner table. When you write subqueries, you are hard coding the join order by forcing the innermost query to run first, followed by the second innermost query, and so on. This could result in a dramatically different query time.
Subqueries are extremely useful for dealing with aggregates. Recall that an aggregate may not be used in a WHERE clause. Without subqueries, it is not possible to ask for books that cost more than the average in a single step:
/* This code does not work! */ select title from titles where price > avg(price)
Msg 147, Level 15, State 1 An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING clause or a select list, and the column being aggregated is an outer reference.
You cant get this information by grouping, because that wouldnt allow for the proper calculation of the whole average. Grouping on title_id, for example, is meaningless because each average will be the average of a single row. However, subqueries let you solve this problem:
select price, title from titles where price > (select avg(price) from titles)
price title ------ --------------------------------------- 19.99 The Busy Executives Database Guide 19.99 Straight Talk About Computers 19.99 Silicon Valley Gastronomic Treats 22.95 But Is It User Friendly? 20.00 Secrets of Silicon Valley 21.59 Computer Phobic AND Non-Phobic Individuals: Behavior Variations 19.99 Prolonged Data Deprivation: Four Case Studies 20.95 Onions, Leeks, and Garlic: Cooking Secrets of the Mediterranean 14.99 Sushi, Anyone? (9 row(s) affected)
The subquery calculates the average price from the titles table. You dont see this result in the result set because it is used to feed the outer query. (The average is $14.77.) Once this average is obtained, the server compares this value to all the rows in titles and selects those rows whose prices are greater than the average.
In effect, after the subquery has completed, the outer query runs with these values:
select price, title from titles where price > $14.77
Remember from Day 6 that the HAVING clause allows you to evaluate rows in an intermediate result set, then makes a determination about which rows to ultimately return. The HAVING clause may include an aggregate if it is contained in the SELECT, which you saw a couple of days ago. All that the HAVING does is compare against a column in the intermediate result set.
The HAVING clause may also include a subquery, for comparison to those aggregates. Suppose you want to know which book types have at least one book priced above the average. To get this information, you could use this HAVING query:
select type, max(price) from titles group by type having max(price) > ( select avg(price) from titles) ) order by max(price)
type ------------ -------------------------- business 19.99 mod_cook 19.99 trad_cook 20.95 psychology 21.59 popular_comp 22.95 (5 row(s) affected)
In fact, every book type has at least one book priced above the average of $14.77. The UNDECIDED type, which has only null as a price, is excluded from the final list. Because it has no actual prices, the server cannot tell for sure if the UNDECIDED rows satisfy the condition in the HAVING clause.
Let me take each clause in the example and explain what is going on. First, the GROUP BY: each type of book is searched, and the maximum price is obtained, by type. The SELECT list specifies that each type should appear in the result set with its maximum price. This results in six rows: one for each type of book. The UNDECIDED type actually does appear in this intermediate result set, which you can confirm by running this query without the HAVING restriction.
Now the HAVING clause is evaluated. The HAVING contains a subquery, so that subquery runs, returning $14.77 for the average price. There is no GROUP BY in the subquery, so the average for all books is returned. Now, HAVING compares each row in the intermediate result set to see if it is greater than $14.77. All rows pass this test except for the null priced book type. These five rows are returned in the final result set.
Previous | Table of Contents | Next |