Tag Archives: disaster

An Upgrade Disaster

I got an email from a user at another company today and he told me about the SQL 2012 upgrade they just finished. And apparently it was a disaster. Here’s the general gist of what happened.

They have a huge app that runs their entire business and the vendor talked them into upgrading to 2012. Originally they were slated to do tons of testing and upgrade probably sometime in november. But they decided to not listen to their DBA and instead allowed themselves to be lead by the vendor who told him that SQL upgrade was easy and nothing to worry about. So they did some perfunctory testing and pushed the upgrade to this past week. I know, smart right?

So this vendor did their upgrade for them and it completed ok from what I know about it. The problems came after the upgrade. Now, I don’t have any real specifics, but I do know that it caused a 10hr downtime. One of the directors asked about the backout plan and he was politely told to mind his own business. Everyone is calling the upgrade a disaster. They didn’t have any way to restore in case the upgrade failed in a really bad way… and that means no final backup, no scripted objects, and no mirrored system. This was an in-place all or nothing upgrade.

Just so we’re clear on this, that’s not the way you run an upgrade. Upgrades take plenty of testing from the DB side, and the app side. You should never upgrade anything without extensive testing. And you should always have a backout plan. Now, what does a backout plan really mean? Because I find that often times the backout plan gets overlooked and I think it’s mainly because they have a hard time defining it.

To me a backout plan means a few different things depending on what you’re after. Let’s take this upgrade as an example. No matter how good you think SQL upgrade is, there’s always something that can go wrong. So at the very least, you need to take a final backup of ALL the system and user DBs right before the upgrade. Make sure you kick everyone out of the DB first because it’s not a final backup if there are still going to be modifications afterwards. That’s a good start for sure, and what I’d consider to be a minimum effort. Here’s a quick list of the steps I like to take for an important upgrade such as this:

1. Copy all system DBs to another location. This just makes restore much easier because with DBs as small as most system DBs, you can just drop them back in their original location and you’re good to go.

2. Script all logins with SIDs.

3. Script all jobs.

4. Make sure I have all SSIS pkg projects at the ready so I can redeploy all pkgs if I need to.

5. Do a test restore of the final backup before starting the upgrade.

6. Script any system-level settings like sp_configure.

7. Script any repl, log shipping, mirroring scenarios.

8. Make sure I have pwords to any linked servers. While I try to keep everyone off of linked servers I have to admit they’re a part of life sometimes. And you don’t want your app to break because you don’t know the pword to the linked server. It’s not the end of the world if this doesn’t happen, but it’ll make life easier.

So basically, the more important the DB, the more of these steps you’ll follow. You need to prepare for a total meltdown and make sure you can recover in as timely manner as possible. As I sit here and write this I feel stupid because it seems so basic, but there are clearly those out there who still need this kind of advice, so here it ia.

And if you have a good test box handy, make sure you test as many of these procedures as possible. Script out your logins, etc and restore them to the test box and see if things work as they should. Deploy your SSIS pkgs to your test box and make sure they run, etc. Don’t just rely on what you think *should* work. Actually make sure it works. This is why some upgrade projects take months to complete. It’s not the upgrade itself, it’s all the planning around it. And while this isn’t a full list of everything you could do to protect yourself, it’s a damn good start.
Happy upgrading.

Losing your job Sucks

I’ve blogged about this before, but some things are worth repeating from time to time.

Losing your job really sucks. And it doesn’t matter if you find out about it ahead of time by 2mos, 2wks, or not until they walk you out the door, you’re going to feel like a complete failure.  And I don’t know, maybe you should, maybe you shouldn’t, but if you don’t get a handle on it and soon you’re going to find yourself in the middle of a depression that’s hard to get out of.  And once you’re there you’ll be useless for finding a job until you get out of it because everyone can see you’re depressed and nobody wants to hire someone who’s a major downer.  You can take some steps to avoid it though, and here’s what I do.

The first thing I do is learn something new.  I pick a single topic of something I really want to learn and I do it.  It’s important that you only pick a single topic though.  The reason is because if you’re already feeling like a failure, choosing to bone-up on SQL in general is only going to make you more depressed because it’s going to remind you how small you really are compared to the product.  There’s just too much to do.  So you pick one small thing and do that.  You can tackle a single feature much easier.  Maybe it’s not even a SQL topic you’re interested in.  Maybe you’ve always wanted to get started with ASP.NET, or HTML, or JavaScript, or Powershell, etc.  Pick one of those instead.  Now, you certainly won’t learn any of those overnight either, but at least it’s a solid topic you can practice and get better at.  This is very important because it shows you that you’re not a loser and you are capable of doing something.  It also gives you new confidence because you’ve added something significant that you like to your skillset.  And if something in IT isn’t what you’re dying to do, then take this time to learn French cooking, or the harmonica, or whatever.

The 2nd thing I do is I start working out.  This too is essential.  There are a couple reasons for this.  First, it’s something tangible.  Unless you’re just completely paralyzed it’s impossible to not see improvement.  You jog to the end of the street and you’re completely winded.  Then the next day (or later that day) you jog to the end of the street and go and extra 10ft.  The next time you go even farther… and so on and so on.  Or you lift weights and see some improvement there.  Do something physical.  Do it every day and do it to exhaustion.  Why exhaustion?  Well, that’s the 2nd reason.

Physical activity works out mental frustration.  It’s hard to be stressed when you’re too tired to walk.  So by working out really hard every day you go a long way to relieve your stress.  And if you’re the type to hold things in, you’re more likely to open up and talk when you’re tired.  This is why parents who know this, make their kids get on a treadmill or do some good exercise when they come home really upset and refuse to talk.  After a good workout they start talking.  This is also more or less how truth serums work.  They relax you to the point where you don’t have the energy to lie.  Lying takes energy and effort and if you’re really relaxed, you tend to not be able to exert that kind of effort.

All of this should help you achieve the ultimate goal that I’ll state now.  Your ultimate goal is to shift your self-worth from your job to something else.  If you place all your worth on your job and you just lost your job, then where does that leave you?  Completely worthless, that’s where.  But if your job is just something else you do and you’re succeeding at plenty of other things, well then you’re not worthless.  You just don’t currently have a job.  The point is that your job shouldn’t define who you are.  Instead, focus on your career.  Whether or not you have a job currently, you’re still a DBA.  Individual jobs come and go, but your career stays constant.

I’ve lost jobs before.  I think almost everyone has.  It doesn’t necessarily mean you’re an idiot or you suck at what you do.  It may simply be that you weren’t right for that gig for whatever reason.  I’ve found that there are some shops that are so dysfunctional no sane person will ever be successful there.  Sometimes it’s a single person being enabled by the entire company, and sometimes it’s actually the entire company dynamic.  For whatever reason, you’re just not suited to that gig.  Ok, try to define what it is you can’t work with there and try to avoid that the next time.

So it may not be you who sucks at all.  Of course, it very well may be, and if that’s the case then improving your skills will be your 2nd priority.  Your first priority of course is to do what I said above and keep yourself out of the funk.  Because if you can’t do that then you’re not going anywhere.

Change your process

This is an excellent example of how you need to be flexible with your processes, even when you’re in the middle of a project.

We started a project to move a DB to a new set of disks. Since the files are large, we probably weren’t going to be able to fit it into a single downtime so we were going to just move one file at a time over the next few weeks until they were all done. Well, due to circumstances out of our control, now they all have to be done at the same time. The problem is that now the file copies are going to take in excess of 6hrs, which is way longer than any downtime they would give us. I know, right? Don’t worry, I’ve got big problems with them forcing us to do operations in a large chunk like this, and then saying we can’t have time to do it. So we were doing our test last week and it did indeed take about 6hrs to copy all the files. And I don’t know why it didn’t hit me before, but why not change the process? The copy process was there because we were going to piecemeal the operation over several weeks, but since that’s gone, then maybe it’s time to come up with a new strategy.

So that’s what I did. My new strategy is an even simpler backup/restore op. All I have to do when I restore is map the files to their new locations and I’m golden. So so far it’s 6hrs to copy, and I know I can backup in 20-25mins. So my restore I’m guessing will be about 30mins (give or take).

Of course, the backup/restore won’t perform that well on its own. You have to tune it so it’ll use the resources to its advantage. This is where knowing how to tune your backups can come into play. And while I often say that tuning backups is quite often frustrating because you can’t use a lot of your resources because you’ve still gotta leave room for user processing on the box, this is one of those times that you can crank it all the way up. Everyone’s going to be offline for this op, so the box is completely mine. I can use every ounce of RAM and CPU on the server if I want. And that’s what I’m going to do. If you’re interested in how to go about tuning your backups, you can look at my recent SQLSAT session on the topic. I did it for SQL #90 in OKC. You can find the vid page for it here: http://midnightdba.itbookworm.com/Events.aspx

So anyway, the point is that just because you’ve come up with a way to do something, don’t set it in stone. If the scope changes in such a way that you can now do it a better way, then don’t be afraid to say “Stop, I’ve got a better way”. You may get some pushback from your peers because the project plan is already done and we need to just move forward with what we’ve got, but that’s when you need to push back and say no, this process was developed for a different circimstance and now it’s a different scenario completely. So this is no longer a viable method.

Over-tuning Backups

Fatal error: Uncaught Error: Call to undefined function eregi() in /home5/midnigk3/public_html/DBARant/wp-content/plugins/wp-codebox/main.php:136 Stack trace: #0 /home5/midnigk3/public_html/DBARant/wp-content/plugins/wp-codebox/main.php(75): wp_codebox_is_windowsie() #1 /home5/midnigk3/public_html/DBARant/wp-content/plugins/wp-codebox/main.php(50): wp_codebox_highlight_geshi(Array) #2 [internal function]: wp_codebox_highlight(Array) #3 /home5/midnigk3/public_html/DBARant/wp-content/plugins/wp-codebox/main.php(130): preg_replace_callback('/<p>\\s*5ed5b264...', 'wp_codebox_high...', '\n\t\t\t\t<div class...') #4 /home5/midnigk3/public_html/DBARant/wp-includes/plugin.php(235): wp_codebox_after_filter('\n\t\t\t\t<div class...') #5 /home5/midnigk3/public_html/DBARant/wp-includes/post-template.php(240): apply_filters('the_content', '\n\t\t\t\t<div class...') #6 /home5/midnigk3/public_html/DBARant/wp-content/themes/twentyfourteen/content.php(57): the_content('Continue readin...') #7 /home5/midnigk3/public_html/DBARant/wp-include in /home5/midnigk3/public_html/DBARant/wp-content/plugins/wp-codebox/main.php on line 136