Previous | Table of Contents | Next |
A common fallacy at this point is to look at the work performed by the server and decide that join order is irrelevant to performance. Someone might say, No matter how you write it, the server always has to construct the entire Cartesian product to find the rows that match and the ones that dont. Regardless of which table you scan first, you still have to make the same number of comparisons.
Happily, the server doesnt have to compare everything in the outer table to everything in the inner table. In fact, it does its very best to perform as few comparisons as possible. Heres how it works:
Note: Just as a reminder, the cost of a query is the amount of work required to execute it. Work is measured as a combination of physical I/Os (disk reads), logical I/Os (memory reads), and CPU ticks. Theres a little more about cost in a moment, but a formal discussion of cost is outside the scope of the book.
I once met a guy who told me he was writing his own relational database engine. He told me it was no big deal, and he couldnt understand why the DBMS companies get all that money for a system that was so simple.
Well, join optimization is a big deal. Finding an effective join orderno, the effective join orderis complex and sophisticated and completely worth the money you pay for the software, especially if your tables get large.
On Day 15, we looked at single-table optimization. The server breaks down the query, identifies candidate indexes, uses distribution statistics to estimate the cost of various index scans, and chooses the plan that delivers the data most efficiently.
Warning: The next few pages are pretty technical. If you decide that they arent for you, just skip over them to the next section on outer joins. Youve already learned a lot about join processing, and you certainly should be able to write joins that return the correct results.If you do take the time to plow through the material, I guarantee youll understand joins better when youre finished, and you should be able to look at join queries critically to decide about some performance issues.
When the server assesses the cost of a join, it goes through the same process with each table, but it also estimates the join selectivity of each join condition.
The join selectivity is a little different from the selectivity of a column regarding a search argument. To estimate join selectivity, the server considers the ratio of the rows in the tables.
For example, if tables A and B had 5,000 and 1,000 rows respectively, the server estimates it will find five rows in A for each row in B. It also estimates it will find one row in B for each row in A. This allows the server to estimate the cost of each scan, depending on which table ends up being the inner table.
The server then estimates the cost of the query based on all possible join orders. If there are two tables, A and B, it considers the cost based on A being the outer table and the cost based on B being the outer table. These approaches are referred to as A to B and B to A, and are often drawn with arrows signifying the join order:
If you want to get a better understanding of optimization, heres a crude formula for estimating the cost of a join.
COST = OUTER TABLE SCAN COST + OUTER TABLE ROWS FOUND * INNER TABLE SCAN COST
(This formula is like Newtonian physics: it fails to account for a lot of issues, but it gives you an idea of how the factors relate to each other.) You scan the outer table once; then you scan the inner table once per row from the outer table.
Lets break the factors down a little. The most important costs are logical reads (reads of data from memory, which are fast) and physical reads (reads of data from disk, which are comparatively slow). Well ignore CPU coststhey are proportional to logical reads for the most part.
To understand this better, lets look at a simple join example:
select s.stor_id, t.type, count(*) from titles t join stores s on t.title_id = s.title_id where s.ord_date >= 9/1/94 and s.ord_date < 10/1/94 and t.price between 15.00 and 20.00 and t.type in (children, drama, psychology) group by s.stor_id, t.type
This is a two-table join with search arguments on both tables. Table 21.1 gives you some salient facts about the two tables.
Item | sales | titles |
---|---|---|
Rows | 168,725 | 537 |
Pages | 4,356 | 39 |
Rows matching ord_date | 5,077 (3%) | (N/A) |
Rows matching price | (N/A) | 115 (21%) |
Rows matching type | (N/A) | 198 (37%) |
Rows matching price and type | 42 (8%) | |
Previous | Table of Contents | Next |