Tag Archives: disaster

An Upgrade Disaster

I got an email from a user at another company today and he told me about the SQL 2012 upgrade they just finished. And apparently it was a disaster. Here’s the general gist of what happened.

They have a huge app that runs their entire business and the vendor talked them into upgrading to 2012. Originally they were slated to do tons of testing and upgrade probably sometime in november. But they decided to not listen to their DBA and instead allowed themselves to be lead by the vendor who told him that SQL upgrade was easy and nothing to worry about. So they did some perfunctory testing and pushed the upgrade to this past week. I know, smart right?

So this vendor did their upgrade for them and it completed ok from what I know about it. The problems came after the upgrade. Now, I don’t have any real specifics, but I do know that it caused a 10hr downtime. One of the directors asked about the backout plan and he was politely told to mind his own business. Everyone is calling the upgrade a disaster. They didn’t have any way to restore in case the upgrade failed in a really bad way… and that means no final backup, no scripted objects, and no mirrored system. This was an in-place all or nothing upgrade.

Just so we’re clear on this, that’s not the way you run an upgrade. Upgrades take plenty of testing from the DB side, and the app side. You should never upgrade anything without extensive testing. And you should always have a backout plan. Now, what does a backout plan really mean? Because I find that often times the backout plan gets overlooked and I think it’s mainly because they have a hard time defining it.

To me a backout plan means a few different things depending on what you’re after. Let’s take this upgrade as an example. No matter how good you think SQL upgrade is, there’s always something that can go wrong. So at the very least, you need to take a final backup of ALL the system and user DBs right before the upgrade. Make sure you kick everyone out of the DB first because it’s not a final backup if there are still going to be modifications afterwards. That’s a good start for sure, and what I’d consider to be a minimum effort. Here’s a quick list of the steps I like to take for an important upgrade such as this:

1. Copy all system DBs to another location. This just makes restore much easier because with DBs as small as most system DBs, you can just drop them back in their original location and you’re good to go.

2. Script all logins with SIDs.

3. Script all jobs.

4. Make sure I have all SSIS pkg projects at the ready so I can redeploy all pkgs if I need to.

5. Do a test restore of the final backup before starting the upgrade.

6. Script any system-level settings like sp_configure.

7. Script any repl, log shipping, mirroring scenarios.

8. Make sure I have pwords to any linked servers. While I try to keep everyone off of linked servers I have to admit they’re a part of life sometimes. And you don’t want your app to break because you don’t know the pword to the linked server. It’s not the end of the world if this doesn’t happen, but it’ll make life easier.

So basically, the more important the DB, the more of these steps you’ll follow. You need to prepare for a total meltdown and make sure you can recover in as timely manner as possible. As I sit here and write this I feel stupid because it seems so basic, but there are clearly those out there who still need this kind of advice, so here it ia.

And if you have a good test box handy, make sure you test as many of these procedures as possible. Script out your logins, etc and restore them to the test box and see if things work as they should. Deploy your SSIS pkgs to your test box and make sure they run, etc. Don’t just rely on what you think *should* work. Actually make sure it works. This is why some upgrade projects take months to complete. It’s not the upgrade itself, it’s all the planning around it. And while this isn’t a full list of everything you could do to protect yourself, it’s a damn good start.
Happy upgrading.

Losing your job Sucks

I’ve blogged about this before, but some things are worth repeating from time to time.

Losing your job really sucks. And it doesn’t matter if you find out about it ahead of time by 2mos, 2wks, or not until they walk you out the door, you’re going to feel like a complete failure.  And I don’t know, maybe you should, maybe you shouldn’t, but if you don’t get a handle on it and soon you’re going to find yourself in the middle of a depression that’s hard to get out of.  And once you’re there you’ll be useless for finding a job until you get out of it because everyone can see you’re depressed and nobody wants to hire someone who’s a major downer.  You can take some steps to avoid it though, and here’s what I do.

The first thing I do is learn something new.  I pick a single topic of something I really want to learn and I do it.  It’s important that you only pick a single topic though.  The reason is because if you’re already feeling like a failure, choosing to bone-up on SQL in general is only going to make you more depressed because it’s going to remind you how small you really are compared to the product.  There’s just too much to do.  So you pick one small thing and do that.  You can tackle a single feature much easier.  Maybe it’s not even a SQL topic you’re interested in.  Maybe you’ve always wanted to get started with ASP.NET, or HTML, or JavaScript, or Powershell, etc.  Pick one of those instead.  Now, you certainly won’t learn any of those overnight either, but at least it’s a solid topic you can practice and get better at.  This is very important because it shows you that you’re not a loser and you are capable of doing something.  It also gives you new confidence because you’ve added something significant that you like to your skillset.  And if something in IT isn’t what you’re dying to do, then take this time to learn French cooking, or the harmonica, or whatever.

The 2nd thing I do is I start working out.  This too is essential.  There are a couple reasons for this.  First, it’s something tangible.  Unless you’re just completely paralyzed it’s impossible to not see improvement.  You jog to the end of the street and you’re completely winded.  Then the next day (or later that day) you jog to the end of the street and go and extra 10ft.  The next time you go even farther… and so on and so on.  Or you lift weights and see some improvement there.  Do something physical.  Do it every day and do it to exhaustion.  Why exhaustion?  Well, that’s the 2nd reason.

Physical activity works out mental frustration.  It’s hard to be stressed when you’re too tired to walk.  So by working out really hard every day you go a long way to relieve your stress.  And if you’re the type to hold things in, you’re more likely to open up and talk when you’re tired.  This is why parents who know this, make their kids get on a treadmill or do some good exercise when they come home really upset and refuse to talk.  After a good workout they start talking.  This is also more or less how truth serums work.  They relax you to the point where you don’t have the energy to lie.  Lying takes energy and effort and if you’re really relaxed, you tend to not be able to exert that kind of effort.

All of this should help you achieve the ultimate goal that I’ll state now.  Your ultimate goal is to shift your self-worth from your job to something else.  If you place all your worth on your job and you just lost your job, then where does that leave you?  Completely worthless, that’s where.  But if your job is just something else you do and you’re succeeding at plenty of other things, well then you’re not worthless.  You just don’t currently have a job.  The point is that your job shouldn’t define who you are.  Instead, focus on your career.  Whether or not you have a job currently, you’re still a DBA.  Individual jobs come and go, but your career stays constant.

I’ve lost jobs before.  I think almost everyone has.  It doesn’t necessarily mean you’re an idiot or you suck at what you do.  It may simply be that you weren’t right for that gig for whatever reason.  I’ve found that there are some shops that are so dysfunctional no sane person will ever be successful there.  Sometimes it’s a single person being enabled by the entire company, and sometimes it’s actually the entire company dynamic.  For whatever reason, you’re just not suited to that gig.  Ok, try to define what it is you can’t work with there and try to avoid that the next time.

So it may not be you who sucks at all.  Of course, it very well may be, and if that’s the case then improving your skills will be your 2nd priority.  Your first priority of course is to do what I said above and keep yourself out of the funk.  Because if you can’t do that then you’re not going anywhere.

Change your process

This is an excellent example of how you need to be flexible with your processes, even when you’re in the middle of a project.

We started a project to move a DB to a new set of disks. Since the files are large, we probably weren’t going to be able to fit it into a single downtime so we were going to just move one file at a time over the next few weeks until they were all done. Well, due to circumstances out of our control, now they all have to be done at the same time. The problem is that now the file copies are going to take in excess of 6hrs, which is way longer than any downtime they would give us. I know, right? Don’t worry, I’ve got big problems with them forcing us to do operations in a large chunk like this, and then saying we can’t have time to do it. So we were doing our test last week and it did indeed take about 6hrs to copy all the files. And I don’t know why it didn’t hit me before, but why not change the process? The copy process was there because we were going to piecemeal the operation over several weeks, but since that’s gone, then maybe it’s time to come up with a new strategy.

So that’s what I did. My new strategy is an even simpler backup/restore op. All I have to do when I restore is map the files to their new locations and I’m golden. So so far it’s 6hrs to copy, and I know I can backup in 20-25mins. So my restore I’m guessing will be about 30mins (give or take).

Of course, the backup/restore won’t perform that well on its own. You have to tune it so it’ll use the resources to its advantage. This is where knowing how to tune your backups can come into play. And while I often say that tuning backups is quite often frustrating because you can’t use a lot of your resources because you’ve still gotta leave room for user processing on the box, this is one of those times that you can crank it all the way up. Everyone’s going to be offline for this op, so the box is completely mine. I can use every ounce of RAM and CPU on the server if I want. And that’s what I’m going to do. If you’re interested in how to go about tuning your backups, you can look at my recent SQLSAT session on the topic. I did it for SQL #90 in OKC. You can find the vid page for it here: http://midnightdba.itbookworm.com/Events.aspx

So anyway, the point is that just because you’ve come up with a way to do something, don’t set it in stone. If the scope changes in such a way that you can now do it a better way, then don’t be afraid to say “Stop, I’ve got a better way”. You may get some pushback from your peers because the project plan is already done and we need to just move forward with what we’ve got, but that’s when you need to push back and say no, this process was developed for a different circimstance and now it’s a different scenario completely. So this is no longer a viable method.

Over-tuning Backups

This is going to be a short post, but an important one to demonstrate the importance of thinking about what you’re doing. I recently taught a backup tuning session at SQLSaturday 90 in OKC. You can see the session recording here if you like. And I’ve been teaching this session off and on recently so I’m not really sure where this user saw my session, but I had a user write me with a tuning issue and I thought I’d share it with you.

In his email he outlined his restore scenario that he was unable to tune. No matter what he did, the restore didn’t happen any faster, and he would expect to see some improvement. And in fact, his attempts to tune it have only made it take longer. Here’s the code he sent me, and I’ve changed things to protect his environment.

PAGE = '1:200'
FROM DISK = 'c:\AW.bak'
buffercount = 500,
maxtransfersize = 2097152,

So there’s nothing wrong with this syntax, and the restore should work just fine. But he’s right, he won’t see any performance gain at all and here’s why. How fast do you think you can restore a single page?
There’s just no way to copy 8KB of data any faster than it already will. And I’m not surprised at all that his restore takes longer. SQL is still gonna spin up all those resources because you’re telling it to. But since you’re only copying a single page, it’s not going to use them. So spinning up the resources won’t do you any good on this small operation and will only serve to cost you extra time. The resource spin-up will only take a few seconds, but on a page restore like this, it can be in the neighborhood of double the time. So it could take the restore from say 6-10secs to like 10-20secs or something like that. So it’s really not a big deal time-wise.

If you came in late to the backup tuning debate, the flags we’re arguing over are buffercount and maxtransfersize.
I’ll let you watch the session above for a full explanation of them, but I’ll go ahead and say here that 500 buffers just doesn’t even make sense. And transferring 8KB worth of data in 2MB chunks doesn’t make sense either.

The point here is this is a definite case where there’s really nothing to even tune. You can only move 8KB so fast. And while I applaud his efforts to apply something he learned in his shop, he needs to put a little more thought into where he applies it. However, he did the right thing. He tested, he saw an issue, and then he emailed me. We’ve discussed it fully and he now understands why he saw what he did. This is how you learn. And even more importantly, he’s out there trying to apply something he learned from a session. He’s not content with leaving things the way they are in his shop, and he’s not content with his current level of skill. You go girlfriend.

And he’s given me permission to blog about this issue so I don’t want any emails about how I screwed over a viewer who came to me for help.

Troubleshooting a replication scenario

I just had a nice replication troubleshooting experience I thought I’d share with you guys.  I got a call from our app team saying that replication was failing.  I looked and the error was a bit unexplainable.

“Column names in each table must be unique.”

OK, so why the hell is it telling me that?  This publication has been running for months now with no problems.  I checked the table in question on both sides, and as it turns out the column it was complaining about had and int data type on the publication side and a uniqueidentifier on the subscription side.  So how did these columns get different data types?  Did someone change the data type on the subscriber to an incompatible data type?  No, that’s probably not it.  And that’s not the error message you would get for something like that anyway.  But someone had to change the data type right?  Well, only kinda.  Here’s what actually happened. 

One of the devs decided he needed this column in the table, and instead of going to the DBAs, he added it to the subscriber himself.  He then added it to the publisher (with a different data type).  So when replication went to replicate the DDL change, it saw that the subscription side already had the column and spit out that excellent error message.

Let this be a lesson to those of you who have DBAs and you constantly look for ways to circumvent them.  Why don’t you stop every now and then and implement things properly.  And by properly I mean slow down and take the time to think about what you’re doing.  The days are over when you can just do whatever you want because it crosses your mind… esp when you have replication running.  You have to plan schema changes or you’ll spend tons of time troubleshooting and fixing your scenario.  Not to mention what it does to whatever processes rely on this replication process.  Anyway, that’s all I’ve got for now.  Let’s all be safe in production.

Even MVPs make mistakes

We’re in the middle of our last mock go-live before this weekend’s prod change-over.  We’re using an SRDF process to move the DB files from the current prod to the new prod.  This is our big SQL R2 cluster and what usually happens is that they present the drives to the cluster and then I run my script to bring the DBs back online.

Well what happened this time is that the drives weren’t added as SQL dependencies and therefore SQL couldn’t see them in order to bring the drives back online.  Well, I didn’t think to check that.  Instead what I did was I just deleted the DBs thinking that just starting from scratch would be the best way to go.  What ended up happening though is that SQL deleted the newly presented data files.  So the moral of this story is that even though SQL can’t see the drives to be able to attach the files, it can apparently see them well enough to delete them behind your back.  Thanks clustering team!

And so this isn’t entirely just a complaining session, here’s the query you can use to see which drives your clustered instance of SQL can see.

SELECT * FROM sys.dm_io_cluster_shared_drives

Now I’ve you’ve got drives in the cluster that SQL can’t see, all you have to do is add them as a dependency to the SQL service and you’ll be fine.  And in Windows 2008 you can do that while the service is online, but in 2003 and below you have to take SQL offline to add them.

Oh, and just for completion, you can tell which nodes are in your cluster by using this:

SELECT * FROM sys.dm_os_cluster_nodes

Here’s a quick update:  The errors I caused and fixed this morning had some residual problems.  Because once you make a mistake and try to fix it, you’ve already got mistakes on your mind and you’re not thinking clearly.  This is why I’m always advocating having restore scripts ready to go in a manner that you don’t have to think about it in a crisis.  You always want things to be as easy as possible.  And this is also why I’m always saying that you should have standardized troubleshooting scripts and everyone should know how to use them.  You don’t want individual DBAs inventing new methods on the fly.  You should all know what scripts are being run to troubleshoot if at all possible.

The mini heart attack

Tell me if this scenario isn’t all too familiar.  You get a call in the middle of the night saying that something’s wrong with a SQL box.  It won’t respond at all.  Now, you’re the first one on the scene because we all know that if there’s a SQL box anywhere within 100 mils of the app, they’ll call the DBA first because it has to be the DB.  So anyway, you have plenty of time to think while you’re on the way to the scene. 

That first step is the hardest.  You throw the covers off of you and expose your warm body to the cold night air. 

You swing your feet on the floor and try to find something that’s not gonna be too hot or too cold because you have no idea how long you’ll be up.  At this point you’re still grumbling about how wordy the helpdesk guy was when he called.  Why can’t these guys ever realize that we’re asleep and all we need is essential info?

Now you’re on your way down the hall and your thoughts turn to the problem at hand.  What could it be?  Did someone turn off the service?  Did CHECKDB take too long and is blocking everyone? 

You just hit your desk and start logging in.  While you’re waiting for desktop to come up it hits you… what are the state of the backups?  Am I sure that I can recover this box should something happen?  What if this server’s down for good?  Oh CRAP, is this the box I got the alerts about the failed log backups yesterday and just ignored them?  I’ll be lucky if the log just filled up and I can do something about it, but if the box is down and I lost all at that data because I didn’t feel like messing with it I’m in trouble. 

So you login and take a look at the server.  You’re able to TS in without any trouble and you breathe a small sigh of relief.  Well, at least the box itself isn’t down.  I can fix almost anything else that may be wrong.  You instantly look at your backup jobs to see if something has failed.  You see nothing out of the ordinary.

Next you look to see if anything’s blocking.  Nope, you see nothing of the kind.

You can query the DB and it’s online so now your heart attach is completely gone and you turn your thoughts to what could actually be wront.  So you call the person who submitted the ticket to begin with.  She reports that they couldn’t get in to the DB for a few mins but that all seems fine now.  You tie up some loose ends on the phone and end the call.

You then close the lid on your box and as you walk back to bed you make a promise to yourself that you’ll never ignore another log backup alert ever again. 

Don’t let this happen to you.  Laziness is no excuse for not having a backup and answering the alert is much easier than answering your boss’s call about why you lost half a day of data.  Don’t you have alerts setup?  Then why did this happen?  That’s a tough question to answer. 

The point of all this is that things can go wrong on a moment’s notice.  Restoring your DB is like a car wreck.  It can happen at any time and you won’t have any notice.  So if you’re not sure… and I mean SURE about the state of your backups, then get that way.  Do whatever it takes to make sure you’re covered and you know it.  And that includes doing the actual restore.  Do you know exactly what it’ll take to restore your server?  The last thing you want to do is to have to figure out what has to be done when you’re under the gun.  So unless you know exactly what it’ll take to restore something then you don’t have a solid plan.  So avoid the bread lines and these mini heart attacks and do your diligence.  The life you save may be your own.

How to save your life with powershell.

I’m always preaching the wonders of powershell, but I think quite often I forget to stress the impact that it can provide. And today is one of those times when I’m able to give you a real life impact of how PS can save your life… ok, your job anyway.

One of the things PS does really well is scripting DB objects. It’s always one of the first things I show that’s really useful in my classes and it always gets a appropriate amount of gasp. But other than creating dev systems from prod, what good is it? Well, it’s a fabulous way to script your prod environment on a regular basis just to have the scripts handy. If you’re able to use PS to easily script your tables, SPs, views, users, etc. then you’re going to have a history of your DB that you can pull from anytime. And if you’re smart enough to store those scripts in txt files with dates appended to them, then you can always get back a complete version of your DB for any given day. Cool huh?

And the implications of this are huge. Not only can you create a complete environment for any given day so that you can compare it with the current version, or any other version you like, but much more likely is the scenario where someone makes a mistake in prod and needs some small piece of code replaced. So someone drops an SP, or alters a view, or kills a job, or anything else along those lines. Of course, this won’t help you with data accidents, but imagine the possibility of being able to pull your butt out of the fire because you were able to recover from a mistake you made… or anyone else made.

This came not only in handy, but actually saved my ass at my last gig where I had these scripts running and accidentally dropped all the SPs in our prod DB. I could have sworn I was on dev, right? So I had my PS scripts to import the objects again too, so as soon as I realized my mistake I recreated all the SPs and all was well with only a minor hiccup to our reporting cycle. But that’s exactly what I’m talking about. If you have these things ready to go you’re ready for whatever kind of disaster comes up and if you not only have the scripted objects sitting there, but also the scripts to put them back, then you’re in the best shape you can possibly be in.

My favorite way to store these like I said is in txt files.
I put them in a folder named for the DB, then another subfolder after the object type, then I name the files by schema.objectname.txt. So a sample path to an object would look like this:

Now to parse the date back out is a simple matter and can be handled with a simple PS script. And the scripts you need to restore can be as easy or complicated as you like. You can have any or none of the following:

Restore all tables
Restore all indexes
Restore all SPs
Restore all views
Restore single SP
Restore single view

You could also combine them into a single PS script more or less that you just pass an object type to and it figures out the path and does the restore for you that way. Either way is good really.

So setup your PS scripts to script your DB objects and store them someplace. It’s only txt so you can keep them pretty far back if you like, and even archive them in zipped files if you like.

Good luck.

The Best DR Plan in Existence

Someone just wrote to tell me that they just had a major outage on their main DW server due to one of the teams upgrading a driver.  There was nothing wrong with what they did, so there’s no reason to believe it would have brought the entire system down.

All I can say is that they had the best DR plan in existence and I had to share it with you so you can mimic it in your own shops.  All you have to do is not take any system-level backups and don’t save any of your configs or any of your files, or document any of your processes.  Then you can just have everyone in the dept wish as hard as they can that nothing bad will ever happen.

It must be an excellent DR plan because it’s the same one that so many companies use.  And it costs next to nothing to implement, which is probably why it’s so popular.  And the real joy is that you can put it into practice almost as quickly as you can put it on paper.  It takes next to no approval from the brass either.

I really don’t see the big deal with DR though.  There’s so much time online and in magazines/books dedicated to this topic and I just don’t get it.  If so many companies already have this practically foolproof DR plan then what’s left to discuss?  I’ve been thinking about this for a while now and I can’t come up with a single situation where this wouldn’t apply. 

Watch my free SQL Server Tutorials at:

Read my book reviews at:

Blog Author of:
Database Underground – http://www.infoworld.com/blogs/sean-mccown

Follow my Twitter:


Snotting Rights

I recently ran across a perfectly delicious situation I wanted to share with you.  Someone actually wrote to tell me about this.  He just started a new gig and there was massive corruption in one of the dev DBs.  He went to ask the DBA where the backup was and he got a very snotty reply. 

Well, as it turns out this was something that the DBA had gone around and around with them about.  See, the devs didn’t want the DBA to do anything on their precious box so they refused to give him rights.  He tried to explain that they needed backups, etc but they wouldn’t hear of any DBAs pissing all over their dev box. 

And now when there’s massive corruption and they need to get their dev code back they call the DBA to ask for help.  Y, fat chance guys. 

I’m here to tell you that the DBA has full snotting rights here.  And it only goes to teach them a lesson I hope.  DBAs aren’t here to piss on your parade.  We’re here to make sure your parade lasts as long as you want and you can even start your parade over and over again if you need to.  Seriously guys all metaphors aside, we’re here to help.  So stop acting like we’re on different sides.

Watch my free SQL Server Tutorials at:

Read my book reviews at:

Blog Author of:
Database Underground – http://www.infoworld.com/blogs/sean-mccown

Follow my Twitter: