Remix! Optimized: Query by Hour, Day, Week, or Month

I originally posted this on 11/16/09. I repost here for the joy of TSQL Tuesday, and to add a very important change to the original solution.

Here's the accompanying video – same material, different format. The video doesn't include the change, by the way.

A solution is ever so much more elegant if you know the path that led to it.  Which is another way of saying, you only know how totally smart I am if you see how much work I had to do 🙂

I had to do a lot of work today – we have a stored procedure that's being called a few times a second, and maxing out 8 processors on our beefy test server during a moderate load test.  So I dug in and had some fun. When it came to the final fine tuning, I was looking at a query full of date maninpulation.  That sort of thing is performance death, so I stepped through it.  We started with something like this, a query that allows the user to get a SUM by hour, day, week, or month.  Here we'll just show week and month for berevity:

SELECT SUM(o.Sale) AS Sales

FROM Orders o

WHERE o.BizID = @BizID

AND (

— Interval: Week

( @Interval = 'wk'
AND DATEPART(wk, o.OrderDate) = DATEPART(wk, GetDate())

AND DATEPART(yyyy, o.OrderDate) = DATEPART(yyyy, GetDate())

)

— Interval: Month

OR ( @Interval = 'mm'
AND DATEPART(mm, o.OrderDate) = DATEPART(mm, GetDate())

AND DATEPART(yyyy, o.OrderDate) = DATEPART(yyyy, GetDate()) 

)
)
GROUP BY O.BizID

Principle #1: Free the Functions 

Right now we're looking disdainfully at all those DATEPARTs and GetDates().   Principle #1: Get functions out of the WHERE clause, if possible.  That's easy enough, we can declare @wk @mm and @yyyy variables and set them before our query. Let's also change this from OR to IF (for more on that, see my previous blog IF is better than OR), :

SELECT @wk =  DATEPART(wk, GetDate()) , @mm =  DATEPART(mm, GetDate()) , @yyyy =  DATEPART(yyyy, GetDate())

— Interval: Week 
IF @Interval = 'wk'

SELECT SUM(o.Sale) AS Sales

FROM Orders o

WHERE o.BizID = @BizID

AND DATEPART(wk, o.OrderDate) = @wk

AND DATEPART(yyyy, o.OrderDate) = @yyyy
GROUP BY O.BizID
— Interval: Month
ELSE IF @Interval = 'mm'

SELECT SUM(o.Sale) AS Sales

FROM Orders o

WHERE o.BizID = @BizID

AND DATEPART(mm, o.OrderDate) = @mm

AND DATEPART(yyyy, o.OrderDate) = @yyyy
GROUP BY O.BizID

Principle #2: Free the Optimizer 

That'll help a lot, but we're not through yet.  It occurrs to me that this still isn't an elegant solution.  The real brainstorm comes: If we compare date part to date part, we'll never get away from DATEPART.  What's better than that?  How about comparing the whole date to another whole date, allowing the optimizer to use that column's index?  Booya!  Principle #2, let the optimizer do its job (and that includes making indexes availble). Use @DateStart and @DateEnd instead…see:

        SET @date = CONVERT(VARCHAR(10), GETDATE(), 101) — This gives us a clean date with time set at midnight, 00:00:00

— Interval: Week 

IF @Interval = 'wk'
BEGIN
                        SELECT  @dateStart = DATEADD(dd,-1 * ( DATEPART(dw,GETDATE()) ) + 1,@date)
                        SELECT  @dateEnd = DATEADD(dd, 7, @dateStart)
SELECT SUM(o.Sale) AS Sales

FROM Orders o

WHERE o.BizID = @BizID

AND o.OrderDate BETWEEN @dateStart AND @DateEnd

GROUP BY O.BizID
END
— Interval: Month
ELSE IF @Interval = 'mm'
BEGIN
               SELECT  @dateStart = DATEADD(dd,-1 * ( DATEPART(dd,@date) ) + 1,@date)
               SELECT  @dateEnd = DATEADD(m, 1, @dateStart)
SELECT SUM(o.Sale) AS Sales

FROM Orders o

WHERE o.BizID = @BizID

AND o.OrderDate BETWEEN @dateStart AND @DateEnd 
GROUP BY O.BizID
END

Principle #3: Happify Your Code 

We'll go through all that date math here in a minute, but first things first. Notice we've gotten all the funky functions and date manipulation out of our WHERE clause, where it'd mess up our friend the SQL optimizer, and placed it where it belongs: as a one-shot above the query itself.  Oh and hey, see how the two SELECT statements are identical?  Let's pull those out of the IF/ELSE and make the code more compact, readable, and easy to manage (oh hai, principle #3!):

SET @date = CONVERT(VARCHAR(10), GETDATE(), 101) — This gives us a clean date with time set at midnight, 00:00:00

— Interval: Week 

IF @Interval = 'wk'
BEGIN
                        SELECT  @dateStart = DATEADD(dd,-1 * ( DATEPART(dw,GETDATE()) ) + 1,@date)
                        SELECT  @dateEnd = DATEADD(dd, 7, @dateStart)
END
— Interval: Month
ELSE IF @Interval = 'mm'
BEGIN
               SELECT  @dateStart = DATEADD(dd,-1 * ( DATEPART(dd,@date) ) + 1,@date)
               SELECT  @dateEnd = DATEADD(mm, 1, @dateStart)
END
SELECT SUM(o.Sale) AS Sales

FROM Orders o

WHERE o.BizID = @BizID

AND o.OrderDate BETWEEN @dateStart AND @DateEnd 
GROUP BY O.BizID

Now, let's sit back and have a plate of #bacon. Wait! We still have date math, and the big change…

Date Math 

We were originally just thinking of matching month to month and year to year, for our first example.  But that's not set based logic.  To compare a date to a date, we needed a range to compare the column value to.  In our example we're always going for the current date range (e.g., THIS month, THIS week, etc).  So start with a clean, no-time @date:

SET @date = CONVERT(VARCHAR(10), GETDATE(), 101) — This gives us a clean date with time set at midnight, 00:00:00

Today that gives us 11/16/2009.  Pseudocode: Use that to get the first of the month: Today's date, minus number of days (16), plus one = @date – datepart(dd,@date) + 1 = @date + (-1 * datepart(dd,@date)) + 1 =

SELECT  @dateStart = DATEADD(dd,-1 * ( DATEPART(dd,@date) ) + 1,@date) 

And DateEnd is just midnight on the first day of next month, or DATEADD(m,1,@dateStart).  You can use this same kind of logic for hour:

                SELECT  @dateStart = DATEADD(hh, DATEPART(hh, GETDATE()),@date — Clean date at midnight, plus the current hour

                SELECT  @dateEnd = DATEADD(hh, 1, @dateStart— Start date plus 1 hour

And day is even simpler:

               SELECT  @dateStart = @date               
               SELECT  @dateEnd = DATEADD(dd, 1, @dateStart) 

For me, week was the tricky part, as we didn't wind up using the "week" date part at all. Pseudocode: start with the day of the week (Monday is 2, for today's example), and subtract that from today's date to get 0, add 1 to get Sunday of this week =   today – day of week(today) + 1 =

                SELECT  @dateStart = DATEADD(dd,-1 * ( DATEPART(dw,@date) ) + 1,@date)
                SELECT  @dateEnd = DATEADD(dd, 7, @dateStart)

Nice, huh?

One More Thing: The Big Change

That'd be perfect and beautiful and awesome, except for one thing. The way we have this set up, we'll get orders back for the start of the next interval.  For example, if we want orders for the 10:00 hour, the time for our end date is 11:00…if an order was placed at precisely 11:00:00.000, it'll return. We don't want that…we want rows for orders placed up to BUT NOT INCLUDING the start of the next hour.  To enforce this, we subtract 2 ms from our end date.

We don't subtract 1 ms from the time because of an interesting behavior of SQL Server: if you subtract 1 millisecond (SELECT DATEADD(ms,-1,'2009-01-01 12:00:00')), you'll notice that it returns '2009-01-01 12:00:00.000' instead of '2009-01-01 11:59:59.999'.  But if you subtract 2 milliseconds (SELECT DATEADD(ms,-2,'2009-01-01 12:00:00')), it returns '2009-01-01 11:59:59.997' instead of '2009-01-01 11:59:59.998'. 

The explanation can be found in Kimberly Tripp's article on Partitioned Tables and Indexes in SQL Server 2005:

"…datetime data does not guarantee precision to the millisecond. Instead, datetime data is precise within 3.33 milliseconds. In the case of 23:59:59.999, this exact time tick is not available and instead the value is rounded to the nearest time tick that is 12:00:00.000 A.M. of the following day."

So, our final solution just needs a minor modification to @endDate:

                SELECT  @dateStart = DATEADD(dd,-1 * ( DATEPART(dw,@date) ) + 1,@date)
                SELECT  @dateEnd = DATEADD(ms,-2,DATEADD(dd, 7, @dateStart))

The end.

Review

Principles:

  • Principle #1: Get functions out of the WHERE clause, if possible.  They're performance death.  
  • Principle #2: Let the optimizer do its job.  Make sure indexes are available, and the query can use them.
  • Principle #3: If all else remains equal, use fewer lines of code….make code more compact, readable, and easy to manage.
  • And don't forget to exclude the start of the next time interval with Dateadd (ms, -2, yourdate)

The major 2 things we did for this query were to get rid of the ORs (see IF is better than OR), and change our thinking from comparing date parts, to comparing the whole OrderDate to a range of dates. 

Eat some #bacon. The end. Really.

-Jen McCown, http://www.MidnightDBA.com

Highly Amusing Index Frag Anecdote

I can't believe I haven't told this story from SQLPASS yet… and by the way, here's Paul Randal's blog on this same event/topic.

One of the few sessions I managed to attend was a really cool Grant Fritchey (blog, twitter) session on DMVs.  He was talking about the DMV that includes index fragmentation data (sys.dm_db_index_usage_stats) and mentioned briefly that below a certain page level, he doesn't bother defragmenting.  During the QnA (and Tweeting all the while) I asked Grant, "For you, how many pages does a table need to have before you start worrying about the fragmentation level?" 

He answered that his lower cutoff is in the single digits, because defragmenting a table that small will make no difference at all.  And "my [upper] cutoff is at about 100 pages, but Microsoft says 1,000 pages. But even a 30 page table can wreak havoc on a query if highly fragmented." 

Paul Randal (blog, twitter) was very active on Twitter during the summit, and saw the tweets. He replied immediately: "#sqlpass 1000 is a number I made up when back at MS, but it only applies if the table is in memory. Otherwise, sure, defrag."  I got to read that out while we were still in Grant's session, and it got a big laugh.

-Jen McCown

http://www.MidnightDBA.com

Proving Views Can Hurt Performance

Content rating: Beginner to experienced

For some of you, this will be a "no duh" blog.  But I've had this argument with people, so here we go.

Views can hurt performance. Not by their very existence, no.  Not by rule, no.  But in certain situations, yes. The situation is: If you're querying off of a view that's pulling much more data than you need (horizontally or vertically), and/or doing more processing than you need (outer joins, date functions, case statements), your query may have worse performance than just running it off the base tables.

As an example, I refer you to my previous blog, in which we restored Brent Ozar's Twittercache database. Get it and restore if you want to follow along.

Once you've done that, create this view in the TwitterCache database:

ALTER VIEW userTweetsHistory

AS
SELECT a.[text],

b.[USER_ID] ,

c.NAME ,

c.screen_name ,

c.cached_date ,

B.cached_date AS HistoryCachedDate,

b.followers_count

FROM Statuses a

JOIN UsersHistory b ON b.[user_id] = a.[user_id]

LEFT OUTER JOIN Users c ON c.screen_name = b.screen_name

AND CONVERT(VARCHAR(10), c.cached_date, 101) = CONVERT(VARCHAR(10), b.cached_date, 101)

WHERE c.screen_name NOT LIKE 'SQL%'

AND c.cached_date > '11/1/2009'

What we have here is a view…a view that no one would ever really need, but a view nonetheless.  I've seen many similar in my day, so don't try to tell me I'm making this unrealistically complicated.  What we want is the screen_name and cache_date for 'midnightDBA' where the follower_count is > 100. Turn on "Include Actual Execution Plan", and run these two statements together:

SELECT screen_name ,

HistoryCachedDate

FROM userTweetsHistory

WHERE screen_name = 'midnightdba'

AND followers_count > 100

SELECT screen_name ,

cached_date

FROM UsersHistory

WHERE screen_name = 'midnightdba'

AND followers_count > 100

The first reads from the view, the second from the base table.  The first query is 78% of the batch (on my box)…it takes FOUR TIMES AS MUCH EFFORT for SQL to run the query from the view as it does the query from the base table. Again, this is because the view is doing more than we need it to do for this query.

The bottom line is NOT that views are bad, or that views hurt performance.  The bottom line is that inappropriate views hurt performance.

Thanks for playing,

Jen McCown

http://www.MidnightDBA.com