Previous Table of Contents Next


Join Processing: The Physical Model

A common fallacy at this point is to look at the work performed by the server and decide that join order is irrelevant to performance. Someone might say, “No matter how you write it, the server always has to construct the entire Cartesian product to find the rows that match and the ones that don’t. Regardless of which table you scan first, you still have to make the same number of comparisons.”

Happily, the server doesn’t have to compare everything in the outer table to everything in the inner table. In fact, it does its very best to perform as few comparisons as possible. Here’s how it works:

  First, the server looks in the query for a search argument, a SARG, that could usefully restrict one of the tables. (SARGs were introduced at the beginning of this week when we were looking at indexes and optimization.) If it can restrict the number of found rows in the outer table, it can reduce the number of times the inner table must be scanned. (That’s good.)
  Second, the server looks for a useful index to find rows in one of the tables based on the key from the other table. If it finds a useful index, it can reduce the cost of each scan of the inner table by rapidly finding the matching rows, without examining every row. (That’s also good.)


Note:  Just as a reminder, the cost of a query is the amount of work required to execute it. Work is measured as a combination of physical I/Os (disk reads), logical I/Os (memory reads), and CPU ticks. There’s a little more about cost in a moment, but a formal discussion of cost is outside the scope of the book.
  Third, the server tries to see if one or both of the tables will fit entirely in the data cache; if so, it might make a good inner table because repeated scans will be faster.

Join Optimization

I once met a guy who told me he was writing his own relational database engine. He told me it was no big deal, and he couldn’t understand why the DBMS companies get all that money for a system that was so simple.

Well, join optimization is a big deal. Finding an effective join order—no, the effective join order—is complex and sophisticated and completely worth the money you pay for the software, especially if your tables get large.

On Day 15, we looked at single-table optimization. The server breaks down the query, identifies candidate indexes, uses distribution statistics to estimate the cost of various index scans, and chooses the plan that delivers the data most efficiently.


Warning:  The next few pages are pretty technical. If you decide that they aren’t for you, just skip over them to the next section on outer joins. You’ve already learned a lot about join processing, and you certainly should be able to write joins that return the correct results.

If you do take the time to plow through the material, I guarantee you’ll understand joins better when you’re finished, and you should be able to look at join queries critically to decide about some performance issues.


When the server assesses the cost of a join, it goes through the same process with each table, but it also estimates the join selectivity of each join condition.

The join selectivity is a little different from the selectivity of a column regarding a search argument. To estimate join selectivity, the server considers the ratio of the rows in the tables.

For example, if tables A and B had 5,000 and 1,000 rows respectively, the server estimates it will find five rows in A for each row in B. It also estimates it will find one row in B for each row in A. This allows the server to estimate the cost of each scan, depending on which table ends up being the inner table.

The server then estimates the cost of the query based on all possible join orders. If there are two tables, A and B, it considers the cost based on A being the outer table and the cost based on B being the outer table. These approaches are referred to as “A to B” and “B to A,” and are often drawn with arrows signifying the join order:

A -> B
B -> A

If you want to get a better understanding of optimization, here’s a crude formula for estimating the cost of a join.

COST = OUTER TABLE SCAN COST + OUTER TABLE ROWS FOUND * INNER TABLE SCAN
COST

(This formula is like Newtonian physics: it fails to account for a lot of issues, but it gives you an idea of how the factors relate to each other.) You scan the outer table once; then you scan the inner table once per row from the outer table.

Let’s break the factors down a little. The most important costs are logical reads (reads of data from memory, which are fast) and physical reads (reads of data from disk, which are comparatively slow). We’ll ignore CPU costs—they are proportional to logical reads for the most part.

To understand this better, let’s look at a simple join example:

select    s.stor_id,     t.type,     count(*)
from   titles  t    join    stores  s  on  t.title_id  =  s.title_id
where   s.ord_date    >=    “9/1/94”  and  s.ord_date  <   “10/1/94”
and  t.price  between      15.00  and  20.00
and   t.type   in    (“children”,    “drama”,   “psychology”)
group    by   s.stor_id,     t.type

This is a two-table join with search arguments on both tables. Table 21.1 gives you some salient facts about the two tables.

Table 21.1. Details about the publishers and titles tables.

Item sales titles

Rows 168,725 537
Pages 4,356 39
Rows matching ord_date 5,077 (3%) (N/A)
Rows matching price (N/A) 115 (21%)
Rows matching type (N/A) 198 (37%)
Rows matching price and type 42 (8%)


Previous Table of Contents Next
Используются технологии uCoz