Previous | Table of Contents | Next |
There is one other problem commonly encountered in optimization. What if a procedure creates a fairly large temporary table and then joins that table back to another table? Heres a sample query that requires a temporary table:
select t.title_id, pub_id, sum(qty) CAqty from titles t, sales s, stores st where state = CA and t.title_id *= s.title_id and s.stor_id = st.stor_id group by t.title_id, pub_id
Msg 303, Level 16, State 1 The table sales is an inner member of an outer-join clause. This is not allowed if the table also participates in a regular join clause.
Note: Lets remember what the outer join means. It instructs the server to show results for a row in a table, even if there are no corresponding rows in a joined table. In this case, SQL Server is to display every row in the titles table, even if there are no sales for that title to stores in California.
This is a common problem. If I have three tables that join in the format A *= B = C, SQL Server cannot perform the join in a single step. Instead, you need to break up the query into two steps, first resolving the join, B = C, and placing the results into a temporary table, #T. Then you can perform the outer join against the temporary table, A *= #T. Heres the revised query:
select title_id, qty into #temp from sales s, stores st where state = CA and s.stor_id = st.stor_id select t.title_id, pub_id, sum(qty) CAqty from titles t, #temp tp where t.title_id *= tp.title_id group by t.title_id, pub_id
The revised query is compiled properly and works. Lets translate this query into a stored procedure that uses a parameter for the value of state:
create proc state_sales (@state char(2)) as select title_id, qty into #temp from sales s, stores st where state = @state and s.stor_id = st.stor_id select t.title_id, pub_id, sum(qty) CAqty from titles t, #temp tp where t.title_id *= tp.title_id group by t.title_id, pub_id return
The first SELECT in the procedure is fairly standard. As long as the number of sales per state is fairly predictable, we wont have to worry about recompiling or anything. The second query is another matter. What statistics on the size or distribution of the temporary table are available to SQL Server at optimization time? There arent any, so the server is forced to guess.
Tip: When the server has to estimate the size of a temporary table, it always guesses that there will be 100 rows on 10 data pages (20KB). (Needless to say, this guess is usually wrong.)
If I were running the query in a batch mode, this problem would be easy to solve. I would separate the two statements with GO, allowing the second query to be separately optimized after its size was known.
Lets follow the process the server does to determine the proper execution method for this query. Because its an outer join, the outer table must be processed first. SQL Server scans the outer table one time. Each time it finds a row that is not excluded by a WHERE clause, it scans the inner table one time. That inner scan is performed using an index where possible, a scan where it is not. If the server determines that there is no useful index, but its worth the trouble to create one, it reformats the inner table (builds a working index on the join column in tempdb).
So the problem arises in the second query in the procedure when the temporary table becomes large. SQL Server has no opportunity to determine the size of the temporary table at optimization time, so it simply guesses. If you expect the temporary table to be fairly large, use a subprocedure to join the outer table and the temporary table. The subprocedure optimization will occur after the temporary table is created, so it can take into account its size and consider a reformat strategy.
It turns out to be a little tricky to create the inner procedure because it expects the temporary table already to exist when it is created; however, the outer procedure does not expect this. Create the objects in this order:
Save it all as a single script so that you can create the procedures any time you need them.
-- drop both procs if exists (select * from sysobjects where name = state_sales_inner) drop proc state_sales_inner go if exists (select * from sysobjects where name = state_sales) drop proc state_sales go -- create the temporary table select title_id, qty into #temp from sales s, stores st where state = MA and s.stor_id = st.stor_id and 1 = 2 -- speeds up table creation go -- create the inner procedure create proc state_sales_inner with recompile -- always recheck the size of #temp as select t.title_id, pub_id, sum(qty) CAqty from titles t, #temp tp where t.title_id *= tp.title_id group by t.title_id, pub_id if @@error <> 0 return 99 else return 0 go -- drop the temporary table drop table #temp go -- create the outer procedure create proc state_sales (@state char(2)) as declare @ret int select title_id, qty into #temp from sales s, stores st where state = @state and s.stor_id = st.stor_id if @@error <> 0 return 98 execute @ret = state_sales_inner drop table #temp if @ret <> 0 return 99 else return 0 go
Ill grant you that this is a lot of work compared to the first solution, but its also the best way to guarantee that no problems will arise later on as you add more data to your tables. (Note that temporary tables are dropped when the server ends the connection or procedure where they were created. Its not necessary, but it is good manners, to drop temporary tables when you are finished with them in a procedure.)
Previous | Table of Contents | Next |