There have been many methods that are commonly used to join two or more tables, aggregate the data and present the data in a useful manner. More and more often, SQL programmers attempt to write complex SQL statements that attempt to accomplish this task in one single step. However, these statements are usually quite difficult for others to read and understand and more importantly they tend to suffer from performance problems as the affected data load increases. This is usually due to the heavy use of aggregate clauses in the SQL such as ‘order by’ and ‘group by’.
There are very simple methods for achieving the same result without using aggregate statements or overly complex SQL statements that may as well be written in hieroglyphics. In this article, I will walk you through a very simple method for creating a stored procedure that will perform fast and more reliably and be virtually unaffected by increased data load.
The scenario for this task is fairly straightforward. Let’s assume I have two tables, one containing customer data and the other containing customer invoices. For this exercise we will assume that I need to extract the total sum for all of the customers in the state of Texas.The first step is to create the stored procedure, as follows:
CREATE PROCEDURE GetTotalsByState
Then we create the variables we will use for processing. I will create only three, an integer @CustId to hold the id number for each customer, an integer @rowcount to contain the number of rows yet to be processed and a money variable @InvTotal to hold the summation of the invoices.
Now we will build what is called a pseudo-cursor. I provides the same functionality as a standard cursor often used in SQL but will consume only a small fraction of the resources. It involves a temporary table and an endless ‘WHILE’ loop which we will use to control processing.
Building a pseudo cursor begins with creating the temporary table. This table will hold a list of all the customers in Texas. This will serve as the control for the infinite loop. However, care should be taken as you create this table. We first check to see if the table exists and drop it if that is the case. We will then explicitly create the table using a ‘CREATE TABLE’ statements. The table should have a column for the customer name, the customer’s ID and the invoice totals. All of which should be ‘NULL’. And at last, insert the customer’s name and their ‘ID’ into the temporary table.
Now, I already know what you are thinking… “Why can’t we do this all in one statement?” The answer is simple, if we use a ‘SELECT INTO’ statement to both create the table and populate it, we have included both DDL (data definition language) and DML (data manipulation language). Doing so will force the statement to recompile each time it is executed. That can cause severe performance degradation in a busy environment.
Now that we have a temporary table that contains the needed information regarding our customers in Texas, we need to get the sum of each clients invoices. To do this, we will create an endless loop using a ‘WHILE’ statement, as follows:
WHILE (1=1)
The next step is to determine how many customers are left to be processed. This is done by selecting a count of all of the rows in the ‘temp’ table where the invoice total is null. If that number is zero, we have completed processing and end the loop with a ‘BREAK’ statement. Otherwise, we continue processing.
Now we select the first row in the temp table that requires processing using a ‘SELECT TOP (1)’ statement and set out @CustId variable to the ID for that customer in the temp table. Once we have that number, we simply select the ‘SUM’ of all rows in the customer invoice table into the variable @InvTotal. It is true that ‘SUM’ is an aggregate, but it is a requirement of this exercise.
Now that we have our total, update the customer total column in the temp table for that customer. Test the @rowcount variable once again and proceed as required. When the @rowcount variable returns zero, your temp table will contain all of the data required by the rules we were provided. You are now free to utilize the data as necessary. Once the procedure ends, the temporary table will drop.
Remember, as a rule, ‘Group by’, ‘Order by’, ‘Between’, ‘In’ and ‘Like’ should be avoided unless there is a compelling reason to use them. This method takes more time to code but there are no ‘joins’ or aggregate functions (save the ‘sum’) that could cause performance problems as the data load increases.
By Tom Inabnet
Image courtesy of Héctor García – Flickr License