Previous | Table of Contents | Next |
When you are comparing an outer query to a list of values, here are a few points to keep in mind.
Removing Duplicate Entries from the IN List
Should you use DISTINCT on the list to remove duplicate entries? This depends on the profile of the data with which you are working. If the outer query is operating on a very large table and the inner query is working on a small table, using DISTINCT is a good idea. In this case, eliminating duplicate list items helps the server improve response time, because each row in the outer query needs to be tested against fewer values. Its also a good idea if, for example, there are two or three distinct values, but a few hundred rows. When a value is repeated many times, remove duplicates with DISTINCT.
However, in any other case (outer query is running on a small table, inner large; outer large, inner large), it will rarely pay off for you to use DISTINCT on the inner query. To resolve a DISTINCT query, the server will usually need to create an intermediate work table to hold the results. (Whether or not it needs to do this is dependent on the indexes that exist on the table(s).) This worktable is then sorted, and duplicate values are removed. The worktable is then returned. All this extra work is expensive, more so than searching through a few duplicate list items.
To summarize, there are a few cases where using DISTINCT on the returned list is prudent, but generally it does not pay to use it.
Matching Data Types
Be sure to match the data type of the list items to the data type compared to the list in the outer query. If you dont, the server will need to convert the list items over and over in order to use them for comparison. If the data types are incompatible (money and character, for instance), you will get an error.
If you write a query where the inner query returns no rows, you will not receive an error. In the case of an IN list, you are comparing values to an empty list, and (obviously) no rows in the outer query will be in the list. In the case of an equality or inequality operator, no values are returned against which to test. The return value is not null, either. A null means, There is a value here, but its value is unknown. An empty list means, There are no values here.
The syntax is the same as with the single row subquery, with the addition of the IN and NOT IN comparison operators:
OUTER QUERY [NOT] IN ( SELECT [DISTINCT] <column-expression> FROM <table-list> WHERE <restrictions> GROUP BY <grouped column-list>)
Note that in both cases, no more than one column can be in the SELECT column list. (This is why the syntax specified a column expression, and not a column list.)
Yesterday you learned how to select information from related tables by using an inner join. You can also use a subquery to resolve problems when information from another table is required. However, you must use a join to display information from more than one table.
Here is an example from the last chapter of a join that displays publishers who published business books:
select pub_name from publishers p, titles t where p.pub_id = t.pub_id and t.type = business
Now, Ill write this as a subquery:
select pub_name from publishers where pub_id in ( select pub_id from titles where t.type = business)
The subquery used here gives the same answer as the two-step query shown earlier, and used yesterday. First, the inner query returns all the pub_ids for books in the titles table that are business books. This list is checked against pub_ids in the publishers table. All publishers are displayed if they appear in that list.
This is just how a join would occur: the server culls from the titles table a list of pub_ids that match the criteria, in this case type = business. It then joins the titles table to the publishers table on pub_id and lists pub_names for joined rows.
Technical Note: Performance
There are two ways to retrieve information that must come from multiple tables: by joining those tables or by using a subquery. Which way is faster? The answer is, unfortunately, it depends. For the most part, these two methods produce identical behavior on the part of the server. In some cases, the method chosen by the server to retrieve data may be different enough to cause a join to be faster than its corresponding subquery, or a subquery may be faster than a table join.Pick whichever method you like best, and use it consistently. In the real world, join syntax is by far the most prevalent. I find the join syntax easier to understand when multiple tables are involved. If you like subqueries better, though, dont let my personal biases stop you from using them. Remember that a subquery that does the work of a join may not combine columns from more than one table in a single result set, however, and joins (inner or outer) cannot always do the work a subquery can, so there are cases where joins and subqueries are the only solution.
Previous | Table of Contents | Next |