What does a bad query look like?

In my SQL Marklar blog today I discussed troubleshooting DB processes.  And I’m not going to re-hash all of it here but I did want to tell you about a use case that describes perfectly what I was talking about.

Not so long ago I got a call from one team and they told me that they had some server issues.  Everything moving slow they said.  Ok, so I got on and took a look and nothing was really jumping out at me.  Then I put a profiler trace on it to see if anything jumped out at me.  And of course, I knew nothing about the app or the processes so I really didn’t know what I was looking for, but you’ve gotta start somewhere huh?

So there I am in profiler and I’m just looking for long-running queries.  The problem is there were lots of queries I would consider long-running.  For some reason I focused in on a single SP that was taking like 5mins.  I pulled up the text of the SP and started looking through it.  It all seemed fairly standard.  I mean, it was long and everything wasn’t perfect, but there was nothing out of the ordinary. 

I contacted the app guy again and asked about it.  Does this SP typically take this long to run?  No, he says (and those of you who have seen My Cousin Vinny know where this is going).  So I thought eureka, I actually found something that may fix the issue.  So I got a couple valid params from him and ran the SP with them.  I also made sure to turn on execution plans and statistics io.  The query plan had some dings in it that experience has told me could easily have caused this kind of spike in resource usage.  The problem is that there was no fragmentation, and stats were up to date.  And in talking with the app guy he told me that they just archived a bunch of the data so it was down to like 200mill rows now.  So why would this thing be taking so long to return?  Moving on.

I found a copy of his QA system that had been copied over from prod the previous week and he assured me that they had changed nothing.  I could see the extra rows in the tables (copied before the archival), and the indexes were the same as in prod so that wasn’t the issue.  They had the same fill factor, everything.  In fact, everything I checked was identical except for the amount of data.  So why would having less data cause such a huge performance issue?  Moving on.

I decided that running this thing again and again on prod was probably a bad idea.  I’m just adding to the issue.  So I started doing the rest of my work on his QA box where I was the only spid.  And the hardware was similar as well (I love it when it works out that way).  So I ran the SP on this box and 5mins passed.  Then 10mins.  Then 15mins.  Then 20mins.  And sometime soon after that, the query returned.  I had collected all my stats along the way so I was golden.  It was getting the same execution plan as the prod version.  The results aren’t what I expected at all.  Why is the prod version now performing well in comparison?  So I called the app guy again and explained the situation.  Here’s more or less how the conversation went:

Me:  You know, I just don’t know what’s going on here.  I’ve been looking at this for a long time now and I’m getting further into a hole.  The prod version of this SP takes 5mins, and that’s even after the archival.  But when I run it on QA with all the data there it takes even longer.  If the prod query is acting up then I would expect the QA query to be a shorter time even with the extra data.

Guy:  Yeah that sounds about right to me.

Me:  What sounds right to you?  (I just got a bad feeling that something horrible had gone wrong)  (You know how you can instantly drop all the pieces into place and remember key words that make everything all of a sudden fit together?  Well, I just got that, but I wanted to hear him say it.)

Guy:  This SP usually takes about that much time, but since the archival it went down to 5mins.  We’ve been very pleased.

Me:  So you mean to tell me that when I came to you with this you didn’t find it necessary to tell me that the 5mins was an improvement?

Guy:  Well, I don’t know anything about DBs so I figured you could see that kinda thing already.

Me:  I see.  Well that clears up that mystery.  Now I’ve gotta go back and start over with this whole process.

Guy:  Well I can tell you the one that’s probably causing the issue.

Me:  Oh yeah?  How’s that?

Guy:  Because the slowness is only in our billing section and that’s controlled by just a few queries.  I can give you the names of the SPs and you can look at those.  There are only like 5 of them and since we’re having a hard time pulling up a list of clients it’s likely going to be the one that controls the search on that.

Me:  I see.  So you were just never going to tell me that?  I’ve been messing with this for 2hrs and you could have given me all this info before and saved me tons of time.

Guy:  Well, again, I don’t know anything about DBs and I figured you could see all that.

Me:  You thought I could see the web app from the DB?

Guy:  You can’t?

Me:  Kill me.

So ok, it turned out to be one of the 5 he gave me.  It had a bad query plan.  I wasn’t able to determine that all on my own, btw.  I had to recompile each one of them until I found the bad one.  And that’s because I didn’t have a perf baseline like I discussed on Marklar.

So there are a couple lessons to learn here but I think you can gleen them for yourself.  The whole point of this though is that making assumptions about processes is bad and no matter what looks like a long-running query to you, it may in fact be performing better than usual.

2 thoughts on “What does a bad query look like?”

  1. Speaking as a developer, I can relate to the situation, but I think it would have been remedied if there was more open dialogue between the DBAs and the Developers. Sure, there are at times silos that persists, but usually, being open to the DBA and the DBAs being open in return, has paid huge dividends to team working toward the same goal: improving the systems the company relies on.

    Just my two cents.
    @JEBacaniSQLDude

Comments are closed.