SELECT, Deconstructed

Today let’s expand on the logical processing order of SELECT that I mentioned in last week’s N Things Worth Knowing About SELECT blog.

We’re looking at the SELECT statement clauses in the order that the SQL engine logically processes them…we’ll even write it that way – it’ll look weird, but we’ll be reading it like the SQL engine does.

You can find more on SELECT’s logical processing order in BOL, and in Itzik Ben-Gan’s T-SQL Fundamentals and T-SQL Programming books.

Here’s our basic T-SQL query (using AdventureWorks):

SELECT P.Name ,
  P.ProductNumber ,
  P.Color ,
  P.StandardCost ,
  SC.Name [Category]
FROM Production.Product P
LEFT OUTER JOIN Production.ProductSubCategory SC ON P.ProductSubCategoryID = SC.ProductSubCategoryID
WHERE P.ProductSubCategoryID IS NOT NULL

If we get all pseudo-mathematical on this, we can say that there’s some theoretical set of rows that we’ll get back out of this query; each logical step – in order – further refines that set of rows until we get back the actual rowset we want.

FROM

The SQL engine starts with the FROM clause first, to see where the data’s coming from. In this case we’re pulling from the Product table…easy. So, the theoretical rowset right now is everything in the Product table. (You can think about each step passing on that theoretical rowset to the next step for further refinement.)

-- Logical order:

FROM Production.Product P

ON/JOIN

Next we look at the ON and JOIN clauses: JOIN tells us what data structure (remember, it doesn’t have to be a table) we’re hooking up with, and ON give us the criteria for hooking our result sets together. We’re still building our theoretical resultset here; a JOIN can either expand the resultset, or limit it.

-- Logical order:

FROM Production.Product P
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
LEFT OUTER JOIN Production.ProductSubCategory SC

Notice that the ON clause is evaluated before the WHERE clause; so we have to understand that where we put a limiting clause (like Color=’black’) can make a huge difference in our resultset. In this case, including the search condition “color=’black'” in the ON clause makes no difference, because it’s a LEFT OUTER JOIN…it’s going to return all the rows from the left side – black, red, blue, etc. – regardless of the join condition. Go on, try it:

SELECT P.Name ,
  P.ProductNumber ,
  P.Color ,
  P.StandardCost ,
  SC.Name [Category]
FROM Production.Product P
LEFT OUTER JOIN Production.ProductSubCategory SC ON P.ProductSubCategoryID = SC.ProductSubCategoryID
AND P.color='black'
WHERE P.ProductSubCategoryID IS NOT NULL

Toldya. In the ON clause, we’re comparing resultsets to make a match in a specific way. In the WHERE clause, we’re limiting the resultset that we got. Now, if this were an inner join, we could include the color=’black’ in the ON clause; it would limit the resultset just as if we’d said it IN WHERE. But I’m still more likely to put that condition in the WHERE clause; it’s more “proper”, in the sense that if that query ever needs to change to an outer join, the condition belongs in WHERE.

WHERE

Speaking of where…this is clearly the place where we’d want to whittle our resultset down to just the rows that we really want.

-- Logical order:

FROM Production.Product P
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
LEFT OUTER JOIN Production.ProductSubCategory SC
WHERE P.ProductSubCategoryID IS NOT NULL AND Color='black'

In this statement, it’s the last refinement we make to our resultset, so next the SQL engine would evaluate the SELECT clause to see what columns to pull, and our logically-ordered query would look like this:

-- Logical order:

FROM Production.Product P
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
LEFT OUTER JOIN Production.ProductSubCategory SC
WHERE P.ProductSubCategoryID IS NOT NULL AND Color='black'
SELECT P.Name ,
  P.ProductNumber ,
  P.Color ,
  P.StandardCost ,
  SC.Name [Category]

But, there’s a lot more we can do in a SELECT statement, so let’s continue with a more complex query that includes GROUP BY, HAVING, and ORDER BY:

SELECT P.Color ,
  COUNT(*) ItemCount,
  MIN(P.StandardCost) MinCost ,
  MAX(P.StandardCost) MaxCost ,
  AVG(P.StandardCost) AvgCost ,
  SC.Name [Category]
FROM Production.Product P
LEFT OUTER JOIN Production.ProductSubCategory SC ON P.ProductSubCategoryID = SC.ProductSubCategoryID
WHERE P.ProductSubCategoryID IS NOT NULL
AND P.color='black'
AND AVG(P.StandardCost) > 100
GROUP BY SC.Name, P.color
HAVING AvgCost > 100

GROUP BY

After SQL gets the base rowset down (by evaluating FROM, JOIN/ON, and WHERE), it looks at the GROUP clause to see if we’re going to lump some rows together. Notice, by the way, that we STILL haven’t looked at the actual SELECT clause:

-- Logical order:

FROM Production.Product P
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
LEFT OUTER JOIN Production.ProductSubCategory SC
WHERE P.ProductSubCategoryID IS NOT NULL AND Color='black'
GROUP BY SC.Name, P.color

Today we want some aggregate data about each category, so we’ll GROUP BY the category name and color – when grouping, anything that’s not an aggregate (SUM, MIN, MAX, COUNT, etc) in the select list, MUST to be in the GROUP BY clause.

HAVING

A quick bit of trivia: the HAVING clause doesn’t require a GROUP BY clause; in that case, it just behaves like a WHERE. Oh, and speaking of WHERE…why couldn’t we just put the “AND AVG(P.StandardCost) > 100” in the WHERE clause? Isn’t it a limiting factor, just like Color and subcategoryID?

Well, yeah it is, but remember: SQL evaluated WHERE, then GROUP BY (the aggregation clause), and THEN the HAVING clause. The aggregate isn’t allowed in the WHERE clause, because at that point, SQL hadn’t heard anything about grouping data…all the rows were still autonomous. So, aggregates go in HAVING:

-- Logical order:

FROM Production.Product P
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
LEFT OUTER JOIN Production.ProductSubCategory SC
WHERE P.ProductSubCategoryID IS NOT NULL AND Color='black'
GROUP BY SC.Name, P.color
HAVING AvgCost > 100

SELECT

After HAVING, SQL finally takes a look at the SELECT clause itself, and our theoretical resultset becomes an actual resultset.

-- Logical order:

FROM Production.Product P
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
LEFT OUTER JOIN Production.ProductSubCategory SC
WHERE P.ProductSubCategoryID IS NOT NULL AND Color='black'
GROUP BY SC.Name, P.color
HAVING AvgCost > 100
SELECT P.Color ,
COUNT(*) ItemCount,
MIN(P.StandardCost) MinCost ,
MAX(P.StandardCost) MaxCost ,
AVG(P.StandardCost) AvgCost ,
SC.Name [Category]

DISTINCT, ORDER BY, TOP

From here, SQL will process DISTINCT (if we included it), then ORDER BY (if we included it), then TOP (ditto). Let’s go ahead and include an ORDER BY:

-- Logical order:

FROM Production.Product P
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
LEFT OUTER JOIN Production.ProductSubCategory SC
WHERE P.ProductSubCategoryID IS NOT NULL AND Color='black'
GROUP BY SC.Name, P.color
HAVING AvgCost > 100
SELECT P.Color ,
  COUNT(*) ItemCount,
  MIN(P.StandardCost) MinCost ,
  MAX(P.StandardCost) MaxCost ,
  AVG(P.StandardCost) AvgCost ,
  SC.Name [Category]
ORDER BY AvgCost

Here’s a big mystery revealed: THIS is why the ORDER BY clause is the ONLY clause that can refer to column aliases: ORDER BY is the only full clause that’s evaluated after the aliases are! Put it this way: we read the query top down, while SQL saves the top for nearly last. When you shout an alias name at it in your WHERE, SQL has NO IDEA what on earth you’re talking about.

And….there you go. Happy days,

Jen

http://www.MidnightDBA.com/Jen

4 thoughts on “SELECT, Deconstructed

  1. Pingback: Tweets that mention SQL Awesomesauce » Blog Archive » SELECT, Deconstructed -- Topsy.com

  2. Pingback: SQL Awesomesauce » Blog Archive » SELECT: If ON preceeds JOIN, then how….

  3. Pingback: SQL Awesomesauce » Blog Archive » Learn T-SQL in 5 Minutes

  4. Ben Smith

    Thanks for the post, Jen, very helpful.

    Did want to ask about one thing to make sure I’m not misunderstanding. About midway through the post you have the statement:

    *****
    SELECT P.Color ,
    COUNT(*) ItemCount,
    MIN(P.StandardCost) MinCost ,
    MAX(P.StandardCost) MaxCost ,
    AVG(P.StandardCost) AvgCost ,
    SC.Name [Category]
    FROM Production.Product P
    LEFT OUTER JOIN Production.ProductSubCategory SC ON P.ProductSubCategoryID = SC.ProductSubCategoryID
    WHERE P.ProductSubCategoryID IS NOT NULL
    AND P.color=’black’
    AND AVG(P.StandardCost) > 100
    GROUP BY SC.Name, P.color
    HAVING AvgCost > 100
    ***

    But later on you mention that aggregates do not belong in the WHERE clause, but rather the HAVING clause. In the statement above, should the AND AVG(P.StandardCost) > 100 be removed from the WHERE clause?

    Another related question… you state that ORDER BY is the only clause that can refer to column aliases. Why then does the HAVING clause in the SQL above refer to a column alias?

    Thanks for the clarification,
    Benjamin

Comments are closed.