Tag Archives: tuning

Change your process

This is an excellent example of how you need to be flexible with your processes, even when you’re in the middle of a project.

We started a project to move a DB to a new set of disks. Since the files are large, we probably weren’t going to be able to fit it into a single downtime so we were going to just move one file at a time over the next few weeks until they were all done. Well, due to circumstances out of our control, now they all have to be done at the same time. The problem is that now the file copies are going to take in excess of 6hrs, which is way longer than any downtime they would give us. I know, right? Don’t worry, I’ve got big problems with them forcing us to do operations in a large chunk like this, and then saying we can’t have time to do it. So we were doing our test last week and it did indeed take about 6hrs to copy all the files. And I don’t know why it didn’t hit me before, but why not change the process? The copy process was there because we were going to piecemeal the operation over several weeks, but since that’s gone, then maybe it’s time to come up with a new strategy.

So that’s what I did. My new strategy is an even simpler backup/restore op. All I have to do when I restore is map the files to their new locations and I’m golden. So so far it’s 6hrs to copy, and I know I can backup in 20-25mins. So my restore I’m guessing will be about 30mins (give or take).

Of course, the backup/restore won’t perform that well on its own. You have to tune it so it’ll use the resources to its advantage. This is where knowing how to tune your backups can come into play. And while I often say that tuning backups is quite often frustrating because you can’t use a lot of your resources because you’ve still gotta leave room for user processing on the box, this is one of those times that you can crank it all the way up. Everyone’s going to be offline for this op, so the box is completely mine. I can use every ounce of RAM and CPU on the server if I want. And that’s what I’m going to do. If you’re interested in how to go about tuning your backups, you can look at my recent SQLSAT session on the topic. I did it for SQL #90 in OKC. You can find the vid page for it here: http://midnightdba.itbookworm.com/Events.aspx

So anyway, the point is that just because you’ve come up with a way to do something, don’t set it in stone. If the scope changes in such a way that you can now do it a better way, then don’t be afraid to say “Stop, I’ve got a better way”. You may get some pushback from your peers because the project plan is already done and we need to just move forward with what we’ve got, but that’s when you need to push back and say no, this process was developed for a different circimstance and now it’s a different scenario completely. So this is no longer a viable method.

Over-tuning Backups

This is going to be a short post, but an important one to demonstrate the importance of thinking about what you’re doing. I recently taught a backup tuning session at SQLSaturday 90 in OKC. You can see the session recording here if you like. And I’ve been teaching this session off and on recently so I’m not really sure where this user saw my session, but I had a user write me with a tuning issue and I thought I’d share it with you.

In his email he outlined his restore scenario that he was unable to tune. No matter what he did, the restore didn’t happen any faster, and he would expect to see some improvement. And in fact, his attempts to tune it have only made it take longer. Here’s the code he sent me, and I’ve changed things to protect his environment.

RESTORE DATABASE AdventureWorks
PAGE = '1:200'
FROM DISK = 'c:\AW.bak'
WITH init, FORMAT,
buffercount = 500,
maxtransfersize = 2097152,
norecovery

So there’s nothing wrong with this syntax, and the restore should work just fine. But he’s right, he won’t see any performance gain at all and here’s why. How fast do you think you can restore a single page?
There’s just no way to copy 8KB of data any faster than it already will. And I’m not surprised at all that his restore takes longer. SQL is still gonna spin up all those resources because you’re telling it to. But since you’re only copying a single page, it’s not going to use them. So spinning up the resources won’t do you any good on this small operation and will only serve to cost you extra time. The resource spin-up will only take a few seconds, but on a page restore like this, it can be in the neighborhood of double the time. So it could take the restore from say 6-10secs to like 10-20secs or something like that. So it’s really not a big deal time-wise.

If you came in late to the backup tuning debate, the flags we’re arguing over are buffercount and maxtransfersize.
I’ll let you watch the session above for a full explanation of them, but I’ll go ahead and say here that 500 buffers just doesn’t even make sense. And transferring 8KB worth of data in 2MB chunks doesn’t make sense either.

The point here is this is a definite case where there’s really nothing to even tune. You can only move 8KB so fast. And while I applaud his efforts to apply something he learned in his shop, he needs to put a little more thought into where he applies it. However, he did the right thing. He tested, he saw an issue, and then he emailed me. We’ve discussed it fully and he now understands why he saw what he did. This is how you learn. And even more importantly, he’s out there trying to apply something he learned from a session. He’s not content with leaving things the way they are in his shop, and he’s not content with his current level of skill. You go girlfriend.

And he’s given me permission to blog about this issue so I don’t want any emails about how I screwed over a viewer who came to me for help.

Tempdb Contention

I had a nice production problem today that slowed everything down drastically.  I’ll spare you the details of the user processes, but when looking in sys.sysprocesses, I noticed that the waitresource was ‘2:%’.  I also correlated this with the wait_type column in dm_os_waiting_tasks and saw a lot of PAGELATCH_UP types. So the first thing I did was pull up the page# in dbcc page, and noticed it was page type 11. 

In my case, here’s what I typed:

DBCC traceon(3604)

DBCC Page(2, 1, 186204, 3)

 And I might add that there were a lot of them backed up.  I had something like 30 blocked processes and they were all waiting on this same page in tempdb.  Page type 11 is a PFS page so this meant I was having contention in tempdb. 

And since I always like the low-hanging fruit, I chose to add more files instead of using -T1118. 

So I added 6 files to the 16 that were already there and the problem cleared up almost instantly.

You don’t have to use DBCC Page though.  As it turns out, I was just surfing around afterwards to see what was out there on this issue , and I found a great blog by MCM Robert Davis that has a lovely query that’ll tell you right away whether you have tempdb contention.  I was gonna paste the query in here, but go read it for yourself.

A Lovely Statistics Query

Most of you know that you can set SQL to auto update stats.  And I’m sure that most of you know that it invalidates stats at about 20%.  What that means is that when about 20% of the data in the table has changed, the stats are invalidated.  It’s actually 500 + 20% if you wanna be specific.  And notice I say they’re invalidated and not updated.  Stats aren’t updated when that 20% mark is reached, they’re only invalidated.  They’re not rebuilt until they’re needed.  Well, for really large tables that can result in stats being oudated longer than they should because the amount of changes needed to invalidate them is much higher.  The same issue can occur with smaller tables where the changes are varied enough to throw the distribution out of whack.  Some tables are just more sensitive to change than others and that 20% may be too much.  You may find that you need to update stats more often… say at 10% or in extreme cases even 5%.

Here’s a sexy little query I put together that will give you the percentage of change an index has had.  You can use it to not only run a job to update stats on a different percentage, but you can also just query it and see what percentage your tables are at and how fast they change.  It’s an excellent tool for troubleshooting possible stats problems.  All the same, it’s pretty handy. 

–Get stats percentage of change for tables.

SELECT
so.SCHEMA_ID,
ss.NAME AS [Schema],
OBJECT_NAME(id) AS TableName, si.NAME AS IndexName,
CAST((CAST(rowmodctr AS float)/CAST(rowcnt AS float))*100 AS int)  AS Pct,
rowcnt,
STATS_DATE([id], indid) AS [StatsDate]
FROM sys.sysindexes si
INNER JOIN sys.objects so
ON si.id = so.object_id
INNER JOIN sys.schemas ss
ON so.schema_id = ss.schema_id
WHERE rowcnt >= 1000000--500
AND rowmodctr > 0
--AND OBJECT_NAME(id) = 'TableName'
ORDER BY Pct DESC, rowcnt DESC, rowmodctr

Here are a couple notes on the query:

1.  Notice there’s a way to query for a specific table.

2.  The StatsDate col shows you the last time the stats were updated.

3.  I’m also limiting the rowcount to indexes with more than 1mill rows.  Feel free to lower that if you like.

4.  rowmodctr > 0 is something I threw in there to keep the divide by zero error out of there.  This also filters out the system-created stats.

OK, I hope you guys like this one.  I personally love it.  It’s allowed me to build a process to update stats on some of my tables more aggressively than 20% and keep my server running at its peak.

Keynote or Breakout Session?

From time to time you run across something that demonstrates perfectly exactly what the industry needs.  As I sit here at the PASS day-3 keynote I’m watching Dr. DeWitt talk about how the query optimizer finds query plans.  This is a very complicated topic under the covers and he’s covering some of the really complicated aspects.  He’s already explained the different types of histograms and how they work (I’m sure only at a high level), and he got into selectivity, etc.  And the thing is, he’s explaining this stuff in such simple terms and making it so easy to understand, it just hit me like a ton of bricks;  why isn’t this material being taught everywhere? 

This is one of the things I’m always talking about when I say there’s no really good training out there.  There are plenty of people out there teaching queries and tuning techniques, but nobody is bothering to break this stuff down so that people actually understand the terms and what they really mean.  Most of the time what they do is just define some terms briefly and then move on and expect everyone to be able to go back home and apply this stuff.  But without this background of how this stuff actually works and how the terms really fit together and how you get bad plans, etc, then people aren’t going to be nearly as successful as they could be.  Dr. DeWitt is clearly concerned with teaching people how to think instead of just what to think.

This is always the type of training I move towards because I find I never commit anything to memory unless I understand how it all works and fits together.  So what I’d like to see happen is for someone to put something like this together in a video series, or in a pre-con… something.  That’s a pre-con I’d actually pay for out of my own pocket.  A whole day of someone dedicated to making sure I understand how this stuff works… Hell Yeah!  But seriously, where does someone go to learn stuff like this?  Because this info isn’t out there in any human readable form.  And it’s apparently not just me either, cause he’s been brought back by overwhelming request, so I’d say the better part of the room is also screaming for this type of info.

What does a bad query look like?

In my SQL Marklar blog today I discussed troubleshooting DB processes.  And I’m not going to re-hash all of it here but I did want to tell you about a use case that describes perfectly what I was talking about.

Not so long ago I got a call from one team and they told me that they had some server issues.  Everything moving slow they said.  Ok, so I got on and took a look and nothing was really jumping out at me.  Then I put a profiler trace on it to see if anything jumped out at me.  And of course, I knew nothing about the app or the processes so I really didn’t know what I was looking for, but you’ve gotta start somewhere huh?

So there I am in profiler and I’m just looking for long-running queries.  The problem is there were lots of queries I would consider long-running.  For some reason I focused in on a single SP that was taking like 5mins.  I pulled up the text of the SP and started looking through it.  It all seemed fairly standard.  I mean, it was long and everything wasn’t perfect, but there was nothing out of the ordinary. 

I contacted the app guy again and asked about it.  Does this SP typically take this long to run?  No, he says (and those of you who have seen My Cousin Vinny know where this is going).  So I thought eureka, I actually found something that may fix the issue.  So I got a couple valid params from him and ran the SP with them.  I also made sure to turn on execution plans and statistics io.  The query plan had some dings in it that experience has told me could easily have caused this kind of spike in resource usage.  The problem is that there was no fragmentation, and stats were up to date.  And in talking with the app guy he told me that they just archived a bunch of the data so it was down to like 200mill rows now.  So why would this thing be taking so long to return?  Moving on.

I found a copy of his QA system that had been copied over from prod the previous week and he assured me that they had changed nothing.  I could see the extra rows in the tables (copied before the archival), and the indexes were the same as in prod so that wasn’t the issue.  They had the same fill factor, everything.  In fact, everything I checked was identical except for the amount of data.  So why would having less data cause such a huge performance issue?  Moving on.

I decided that running this thing again and again on prod was probably a bad idea.  I’m just adding to the issue.  So I started doing the rest of my work on his QA box where I was the only spid.  And the hardware was similar as well (I love it when it works out that way).  So I ran the SP on this box and 5mins passed.  Then 10mins.  Then 15mins.  Then 20mins.  And sometime soon after that, the query returned.  I had collected all my stats along the way so I was golden.  It was getting the same execution plan as the prod version.  The results aren’t what I expected at all.  Why is the prod version now performing well in comparison?  So I called the app guy again and explained the situation.  Here’s more or less how the conversation went:

Me:  You know, I just don’t know what’s going on here.  I’ve been looking at this for a long time now and I’m getting further into a hole.  The prod version of this SP takes 5mins, and that’s even after the archival.  But when I run it on QA with all the data there it takes even longer.  If the prod query is acting up then I would expect the QA query to be a shorter time even with the extra data.

Guy:  Yeah that sounds about right to me.

Me:  What sounds right to you?  (I just got a bad feeling that something horrible had gone wrong)  (You know how you can instantly drop all the pieces into place and remember key words that make everything all of a sudden fit together?  Well, I just got that, but I wanted to hear him say it.)

Guy:  This SP usually takes about that much time, but since the archival it went down to 5mins.  We’ve been very pleased.

Me:  So you mean to tell me that when I came to you with this you didn’t find it necessary to tell me that the 5mins was an improvement?

Guy:  Well, I don’t know anything about DBs so I figured you could see that kinda thing already.

Me:  I see.  Well that clears up that mystery.  Now I’ve gotta go back and start over with this whole process.

Guy:  Well I can tell you the one that’s probably causing the issue.

Me:  Oh yeah?  How’s that?

Guy:  Because the slowness is only in our billing section and that’s controlled by just a few queries.  I can give you the names of the SPs and you can look at those.  There are only like 5 of them and since we’re having a hard time pulling up a list of clients it’s likely going to be the one that controls the search on that.

Me:  I see.  So you were just never going to tell me that?  I’ve been messing with this for 2hrs and you could have given me all this info before and saved me tons of time.

Guy:  Well, again, I don’t know anything about DBs and I figured you could see all that.

Me:  You thought I could see the web app from the DB?

Guy:  You can’t?

Me:  Kill me.

So ok, it turned out to be one of the 5 he gave me.  It had a bad query plan.  I wasn’t able to determine that all on my own, btw.  I had to recompile each one of them until I found the bad one.  And that’s because I didn’t have a perf baseline like I discussed on Marklar.

So there are a couple lessons to learn here but I think you can gleen them for yourself.  The whole point of this though is that making assumptions about processes is bad and no matter what looks like a long-running query to you, it may in fact be performing better than usual.

Why you have to be on top of your game

I just heard from a DBA at my last gig who was keeping me in the loop about things I had done while I was there.
He said that they had to take away the 2 1TB LUNs I had them put on our DB server because they caused performance problems. The SAN guys warned us that having LUNs that big would hurt us and I guess they were right. Well, not really. This was the report I got back from the DBA, so not my words. When I grilled him about it, he did say that they had given us a single spindle that was 1TB for each of them instead of giving us several drives in the array.
So let me get this right… I asked for 1TB LUNs, and instead of doing it right you gave me a single drive and used it as justification for showing me you know more about disks than I do? I find that incredible.

So guys, yeah… you really have to be on top of your game and know your stuff so that when they do stuff like this you can call them on it. And I must admit that I’m not sure it would have crossed my mind to even check that it was a single spindle. Why would that even cross my mind? These are enterprise SAN guys… they know their way around and should definitely know better. What I didn’t realize was that they were into playing games. It just goes to show that you try so hard to be effective in some places and they just won’t let you.

Data Explosion

Hey everybody… it’s been a while since I’ve blogged, but that’s what starting a new gig will do to you. I’ve had so many new things to rant about here and I’m sure I’ll be getting to those in the future, but right now I want to share a funny story with you.

We had a group of devs who were trying to run a fairly simple insert and it kept filling up the disk. And when I say it was filling up the disk, i mean like 50K rows were filling up like 200GB. So they came to me to see if I could fix it. This is where it gets fun.

It really didn’t take me too long in my investigation to find that they had the data file set to autogrow by 131,000%. That’s right, I said it! 131,000 percent. So now that I found the problem I was able to set it to something more reasonable, shrink the file and let them get on with their insert.

So it started me thinking about what other DBs in the place had similar issues because I’ve been hearing about disk space problems here and there. So I wrote a powershell to go out and check the file growth rates for every SQL box and I found a lot of the same type of stuff. There were a lot of them set for several thousand percent growth, several of them set for 1MB growth, and everything in between. In general, as a good generic setting to start from, I like to grow my files 1GB at a time. It’s a good round number that works in a lot of cases and then you can snipe the ones that need something else. And then I altered the powershell to go out and change the growth rates of all the DB files out there to 1GB. Life is good again and we have the beginnings of a happy shiny environment.

Oh y, and in the same discovery I also found 110 DBs set to autoshrink and I also took care of those.

Deleting Production Data

While talking to a group of devs late last week the topic of deleting prod data came up.  Since I’m the big MVP of the group they wanted to know what the official word from on-high is.  The question is simple, and the answer is even simpler and frankly it’s one of those that was solved so long ago it just doesn’t hold any interest for me anymore.  I’ll hash it out here again though because I know there are plenty of you out there who haven’t solved this problem yet. 

The question is, what’s the best way to delete gazillions of rows out of a prod table.  The issue is with both concurrency  and rollback.  The problem is that if you do a huge delete in a single transaction you don’t leave any room for anyone else to access the table during that time.  So your delete can bring a DB to a complete halt if you don’t do it right.  As well, you put your system at risk should something go wrong and the transaction has to rollback.  So why would the transaction have to rollback?  Well, there are a couple reasons really.  The first, and most likely, is that it could fill up the log and with no more log the transaction can’t continue.  The other possibility is that the server or the service gets restarted because someone didn’t know you were doing a big operation.  And while I’m at it, there is one more logical possibility.  Someone could get impatient with the operation and stop it without realizing that it has to rollback.  This is actually a scenario that I’ve seen played out many times and the result is always the same.  The system has to wait for at least as long as it was going.  And what’s even better is when the DB doesn’t come back online right away, that same user will bounce the service again and again. 

And of course this counts for updates as well… in SQL2K this was accomplished like this:

SET RowCount 1000

While (1=1)

Begin

Begin Transaction

–Delete stmt here–

if @@RowCount = 0

Begin

   Commit Transaction

   Break

End

    Commit Transaction

End

However, this doesn’t work in SQL2K5 and above where you have to do something more like this:

set nocount on

while 1=1

begin

delete top(1000)

from table1

where conditions…

if @@rowcount = 0

end

 

So ok, that’s how you batch delete/update in prod.  Only under very controlled and specific circumstances should you ever do these types of ops in a single batch.

More Questionable Code

A couple pieces of code came across my desk yesterday and they’re definitely the kind of thing you read about… if nowhere else, then here for sure.  This was code that a dev sent me that is to be deployed next week.

The first one could actually cause a problem so I’ll talk about it first.

I’ll write the pseudo-code because I don’t have the actual code in front of me right now.  It goes like this:

1.  Check if the Whale schema exists.

2.  If it does exist then drop it.

3.  Create the Whale schema.

Ok, so that’s exactly what the script does. Here, I’ll go ahead and attempt the code.  I didn’t look it up so it may have a slight error in it, but you get the idea.

If exists (select name from sys.schemas where name = ‘Whale’)

BEGIN

drop schema Whale

END;

Create schema Whale with authorization = ‘dbo’;

So what’s wrong with this code you ask?  It’s simple, here’s the logic.

First, if it exists and you drop it then why just turn around and recreate it in the next step?  What’s the point of that?  So if your goal is to create the schema then everything above the ‘create’ line is useless.  And I know what you’re thinking… so what?  What’s the big deal if a tiny query gets run during a deployment?  It’s not like it’s going to drag the system down.  Well, you’re right about that, but it could kill the deployment.  If that schema did exist and it actually contained objects, then the drop statement would fail until you put those objects somewhere else.  So with there being no code to check that objects exist inside, you could be dooming the deployment to an unnecessary error.  You could also say that you know that the schema doesn’t exist so there’s nothing to worry about.  That’s fine, then take out the check and the drop.  If you know it’s not there, then take it out.  It’s called having concise code and it’s something that we lack in this industry these days.  Let me illustrate this with something completely ridiculous that also doesn’t really effect performance.

Create table #t1 (col1 int)

truncate table #t1

truncate table #t1

truncate table #t1

truncate table #t1

truncate table #t1

Insert #t1 Select 1

Ok, so look at that code above.  I created a temp table and then truncated it 5x.  That’s not going to effect performance at all because there’s no data in it since it was just created and there are no pages.  Then I go ahead and insert a row.  I can’t think of anyone who would let this kind of thing go on in an SP, or in deployment code, but we’re expected to let useless checks stay in our scripts. 

This code was clearly scripted in the wizard so it’s not like the dev went out of his way to do this by hand, but it’s the mark of a fairly inexperience coder to let the wizard create scripts and not inspect what it’s giving you.

The other piece of code doesn’t really do any damage as much as it’s just annoying.  In fact, it’s really 2 different scripts.  They’re create view scripts and the first one reads something like this:

create view MyView

as

Select MyFavoriteTable.col1,

MyFavoriteTable.col2,

MyFavoriteTable.col3,

MyFavoriteTable.col4,

MyFavoriteTable.col5,

MyFavoriteTable.col6,

MyFavoriteTable.col7,

MyFavoriteTable.col8,

MyFavoriteTable.col9,

MyFavoriteTable.col10

from MyFavoriteTable as MyFavoriteTable

Ok, so even typing this in the blog pisses me off.  And again, it’s just a style thing, but this just drives me crazy.  What drives me crazy the most is that this dev doesn’t understand what aliases are for.  To alias a table with the same name as the table itself defeats the purpose of having the alias in the first place.  Here a much better alias would have been mft or mt or m.  Hell, you could even go as far as MyFT or something like that.  Here, I’ll play one of those out for you and you tell me which one you’d rather read.

Select mft.col1,

mft.col2,

mft.col3,

mft.col4,

mft.col5,

mft.col6,

mft.col7,

mft.col8,

mft.col9,

mft.col10

from MyFavoriteTable as mft

Forget reading, which one of those would you rather type?  It’s just more concise and easier to read.  I happen to know the dev who wrote this and his opinion is that the full table name makes it easier to read when you’ve got larger joins.  Personally, I’ve never known anyone to complain about short aliases before, and again, I honestly think this boils down to inexperience.  When I first started coding well over a decade ago, I used to need things to be presented to me in very specific ways too.  It was the only way I could read the code.  That kind of thing is much less important to me now that I have been doing it for a long time.  Why?  Because I know how to code.

So the second thing about this script that bugs me is the fact that he saw the need to alias a query with a single table in it.  Now you’re just being obstinate for no reason.  I know the argument though.  He probably would say that he did it for consistency.  The only problem with that is the other create view script he submitted didn’t have the same stupid aliasing structure, so where’s the argument now?

Ok, so this code was clearly brought to us as part of a code review so the next logical question is why don’t we just reject the code if it bothers us so badly?  Well, the answer is simple.  Like so many places, our code reviews are merely perfunctory process placeholders.  Our lead dev has made it very clear that he’s going to do whatever he likes no matter what we say.  We’ve rejected code plenty of times for really valid process or performance reasons and he always thumbs his nose up at us and pushes it into prod anyway.  I really don’t understand how he’s gotten such a superior attitude towards the DBAs, but it’s completely unwarranted.  He acts like we’re in this just to piss on his parade and make him look bad, but we’re really only trying to help.  But we don’t even bother anymore.  What’s the point?  He knows more than the rest of us put together so we don’t even bring stuff up anymore.

So that’s my rant for tonight.  Remember, use aliases wisely and be as consistent as you can.  And for everyone’s sake listen to what your DBAs are telling you for a change.  They’re really only trying to help.