Category Archives: Admin

A fun conversation about backups

While I was talking to one of my Jrs today about backups, a .Net guy poked his head around the corner to offer his opinion on the matter.  The subject was basically whether everything will be copied over if you do a full backup and restore it to another system.

Here’s basically how the talk went:

.NET – Well, it really depends on whether you have different filegroups as to whether everything will be restored.

DBA:  No, if it’s a full backup and it restores, everything is there.

.NET:  Well, yeah, but I’m just saying that DBs can have a lot of filegroups sometimes and if it does, then you might not get all o fthem.

DBA:  No, if it’s a full backup and it restores, everything is there.

.NET:  But…

DBA:  There are no buts… if it’s a full backup and it restores, everything is there.

.NET:  I’m just saying that…

DBA:  No, if it’s a full backup and it restores, everything is there.

.NET:  You can’t deny that there are several filegroups, right?

DBA:  I would never try to deny that.

.NET:  And if you backup those different filegroups, then you can only restore some of them, therefore you can have a DB with some filegroups unrestored.

DBA:  This is correct.

.NET:  So back to my original…

DBA:  No, if it’s a full backup and it restores, everything is there.

.NET:  No, because you just said that you can backup different filegroups and restore only part of them.

DBA:  Yes I did.

.NET:  So aren’t we saying the same thing?

DBA:  No, not even close.

.NET:  Why not?

DBA:  Because we were talking about full backups, not filegroup backups.  Full backups backup everything… it’s in the name.  Filegroup backups only backup filegroups… that’s also in the name.  But a full backup can only be restored fully.  There is no partial restore of a full backup.

.NET:  So you’re telling me that you backup the full DB and you can backup filegroups, but it’s a different kind of backup?

DBA:  Yes.

.NET:  But still, if you have…

DBA:  No, if it’s a full backup and it restores, everything is there.

.NET:  You’re not letting me finish.

DBA:  Yeah, because I know what you’re trying to say and there’s no wiggle-room.

.NET:  But if you have multiple filegroups, then…

DBA:  No, if it’s a full backup and it restores, everything is there.

.NET:  So are you telling me that there’s absolutely no way to restore only certain filegroups from a full backup?

DBA:  That’s exaxtly what I’m telling you.  Again, it’s in the name ‘Full Backup’.

.NET:  So is there a way to change the location of a filegroup when you restore the full backup?

DBA:  Only if it’s tied to a specific file and you restore that file to another location, then you’re really moving the file itself and the filegroup is coming along for the ride.

.NET:  But I keep thinking there’s got to be a way to…

DBA:  No, if it’s a full backup and it restores, everything is there.

.NET:  But if there’s not enough disk space and it only restores part of the DB then that would leave you with only part of the DB.

DBA:  No, it wouldn’t complete the restore and you’d have nothing.

.NET:  So if you take a filegroup backup then you can restore different filegroups.

DBA:  Yes.

.NET:  Then I’m still right.

DBA:  No, not even close.

.NET:  Yeah, because you can still do what I said you could do.

DBA:  No, because you’ll recall that there are 2 very important aspects of what I’m saying and they both have to be there… which is part of the original topic:  “If it’s a full backup” and “if it restores”.  If both of those exist, then you’ve got everything in the DB.

.NET:  But I still think you should be able to…

DBA:  No, if it’s a full backup and it restores, everything is there.

And then it just kind of tapered off into other backup discussions from there, but that was just fun.

Thought you guys might like a funny story on a mon.

What makes a Sr. DBA?

I get indirectly asked this question all the time… what makes a Sr. DBA?  Well, that question is gonna be answered 20 different ways by 10 different DBAs.  But here’s my answer.

Aside from whatever specific qualities you assign to what a Sr should look like, here’s a stick you can use to measure yourself.  You should be able to meet both of these criteria.

  1. 1.       Are you right most of the time?
  2. 2.      Do you handle your own problems instead of calling daddy?

 

Ok, let’s talk about #1 first.  Nobody is right all the time, but are you right most of the time?  When you get on in a crisis, do you diagnose the problem correctly say better than 90% of the time?  And do you discover and deduce the problem, or do you just fall into it?  Do your users come to you for answers?  In other words, have you earned your place as the go-to guy?

Next, how often do you call daddy?  If you’re in a shop with an existing Sr. DBA, do you call him when you have a problem or do you research and solve your own issues before getting him involved?  It’s always really easy to call the Lead DBA, but it doesn’t teach you anything.  And as long as you’re relying on his research skills you’ll never be the go-to guy yourself.

I remember it well.  Much longer ago than I care to remember, I asked whatever male figure I had how you know when you’re a man.  He told me something that stuck with me all these years.  He said, you know you’re a man when you quit calling your parents when you have trouble.  And I remember it hit me once when I was driving late at night and got a flat tire.  I just got out and changed it and went on my way.  And a year ago I would have called my folks to come help me.  That was my first hint that I may have crossed into manhood.  Because at some point you realize that it’s up to you.

It’s the same in the IT world.  You go through these years of learning and then of getting proficient, and at some point it dawns on you that it’s all up to you and only you can solve your problems.  You have to be the one to investigate and solve the blocking, or the deadlocks, or the excessive waits, etc.

And that doesn’t mean that you never need any help with anything.  Nothing could be further from the truth, but how often do you need that external help?

Am I a Dinosaur without the Cloud?

Man, you know, I keep looking at all this cloud advertising MS is putting out there, and all the propaganda they’re spinning around Azure and there’s one thing that’s screaming loud and clear.  Those who don’t jump on the cloud wagon are going to get left behind.  I get this message mainly from those DBAs and evangelists who have bought into the whole bag because when people buy into something so wholly.

But frankly I’m tired of being made to feel like I’m going to be a dinosaur or get left completely behind if I don’t buy into Azure.  I’m an enterprise DBA, and Azure is very new and as far as I’m concerned it’s just a way for MS to make sure they get paid for all their license fees.  It’s much harder to pirate software when you host it yourself.  But here’s something… and you’re going to find this shocking… just because MS had an idea and decided to market it doesn’t mean that the rest of us are going to fall behind if we continue doing what we’re doing.  It’s not like I’m refusing to learn any of the new features or to expand my understanding of current ones.  I study SQL quite a bit throughout the week and I work really hard to stay current.  And I think I can survive for quite some time without moving my company to the cloud.  Not only will be be a long time before it can do as much as I can with my own install, but there are still tons of privacy and regulatory issues to work out. 

All of this is to try to make it sound like locally installed DBs are going the way of the mainframe and nothing can be further from the truth.  There will be DBAs at companies for decades to come and if the time ever comes when I absolutely *have* to make the jump to Azure, then I’ll jump off that bridge when I come to it.  Until then, the only word I have for you true believers is this… don’t believe everything you hear from MS marketing.  And just because they came up with something you think is cool, don’t think it’s going to be the only game in town. 

Besides, the cloud is mostly a marketing joke anyway.  The other word for the cloud is the internet.  And there have been hosted services on the internet for a long time now.  This is nothing new.  So when you see these commercials talking about taking advantage of the cloud, just remember that it’s just the internet.  I’ve been buying books in the cloud for years.  And I’ve had my website hosted in the cloud for years too.  And remember that new thing they came out with a couple months ago?  Apparently now they can host your email for you on special websites so you can get to it anywhere.  Wow, imagine that… email in the cloud.  What will they think of next?

Getting a good look at deadlocks

One of the things that’s a major concern for DBAs is monitoring deadlocks.  And that’s really easy.  The more difficult part can be to sift through all the info once you’ve got it.  The deadlock graph in Profiler is my favorite tool for gathering them, and I prefer to shred the XML results into a table so I have all the info right there in front of me.

So here I’ve got a script I’ve written that shreds your deadlock XML into a relational table.  Actually, I modified this from the one Trevor Barkhouse did for his deadlocking presentation.  I just wanted some friendlier cols, and more info in the table.  Once you get the data into this format though, you can see how easy it is to turn it into a real process. 

What I love the most about this is how easy it is to go through a high number of deadlocks and query them for anything you like.  You can easily get a picture of your entire environment with very little effort.  Now, for this, you’ll have to get the deadlock graphs into XML to begin with and while I could show it to you here, Trevor does a good job of it so I’ll just point you to his blog on the topic.  However, once you have the XML, then here’s my query to mine it.

Here’s the code:
USE [tempdb];

GO

DECLARE @DeadlockCount INT;
DECLARE @DeadlockGraph XML;
DECLARE @DeadlockGraphFileName NVARCHAR(255);
DECLARE @DynamicSQL NVARCHAR(1024);

   SET @DeadlockGraphFileName = 'G:\DeadlockFiles\DeleteCV3OrderGenerationQueue\AllEvents.xml'
  
--'G:\DeadlockFiles\20110405-20110418_1.xdl';

   -- The OpenRowset() T-SQL function requires a string literal for the file
   -- name, so use dynamic SQL to avoid hardcoding a path (other than in the
   -- above assignment to the @DeadlockGraphFileName variable... but that's
   -- a little clearer).
   SET @DynamicSQL = N'SELECT @XMLDocument = Cast([BulkColumn] AS XML)' +
      
NCHAR(10) +
      
N'FROM OpenRowset' +
      
NCHAR(10) +
      
N'(' +
      
NCHAR(10) +
      
N'    BULK ''' +
      
@DeadlockGraphFileName +
      
N''',' +
      
NCHAR(10) +
      
N'    SINGLE_BLOB' +
      
NCHAR(10) +
      
N') AS [XMLDocument];';

PRINT @DynamicSQL
    
--  Execute the dynamic SQL and pull the result into the @XMLDocument
    --  local variable.
  
EXECUTE [master].[sys].[sp_executesql]
      
@stmt = @DynamicSQL,
      
@params = N'@XMLDocument XML OUTPUT',
      
@XMLDocument = @DeadlockGraph OUTPUT;
      
      
      
    
--  Count the number of deadlock graphs in the XDL file.
  
SELECT @DeadlockCount = DeadlockList.Graphs.value('count(deadlock)', 'Int')
  
FROM @DeadlockGraph.nodes('/deadlock-list') AS DeadlockList(Graphs);

   PRINT 'File            :   ' +
      
@DeadlockGraphFileName +
      
CHAR(10) +
      
'Deadlock count  :   ' +
      
CAST(@DeadlockCount AS VARCHAR);

    --  Extract the parties to the deadlock.
    
    
DECLARE @PartiesToTheDeadlocks TABLE
(
  
[VictimProcessID] VARCHAR(25),
    
[VictimLastBatchStarted] datetime2,
    
[VictimLockMode] VARCHAR(15),
    
[VictimXActID] VARCHAR(15),
    
[VictimClientApp] VARCHAR(50),
    
    
[LiveProcessID] VARCHAR(25),
    
[LiveLastBatchStarted] NVARCHAR(23),
    
[LiveLockMode] VARCHAR(15),
    
[LiveXActID] VARCHAR(15),
    
[LiveClientApp] VARCHAR(50),
    
    
[LiveFileID] VARCHAR(15),
    
[LivePageID] VARCHAR(15),
    
[LiveObjName] VARCHAR(50),
    
[LiveLockModeHeld] VARCHAR(50),    
    
[VictimLockModeRequest] VARCHAR(50),    
    
    
[VictimFileID] VARCHAR(15),
    
[VictimPageID] VARCHAR(15),
    
[VictimObjName] VARCHAR(50),
    
[VictimLockModeHeld] VARCHAR(50),    
    
[LiveLockModeRequest] VARCHAR(50),
    
    
[VictimProcName] VARCHAR(100),
    
[VictimExecStack] VARCHAR(MAX),
    
[LiveProcName] VARCHAR(100),
    
[LiveExecStack] VARCHAR(MAX),    
    
[VictimInputBuffer] VARCHAR(2048),      
    
[LiveInputBuffer] VARCHAR(2048)

);

    
    INSERT INTO @PartiesToTheDeadlocks
  
SELECT VictimProcessID,
          
VictimLastBatchStarted,
          
VictimLockMode,
          
VictimXActID,
          
VictimClientApp,
          
          
LiveProcessID,
          
LiveLastBatchStarted,
          
LiveLockMode,
          
LiveXActID,
          
LiveClientApp,
          
          
LiveFileID,
          
LivePageID,
          
LiveObjName,
          
LiveLockModeHeld,
          
VictimLockModeRequest,
          
          
VictimFileID,
          
VictimPageID,
          
VictimObjName,
          
VictimLockModeHeld,
          
LiveLockModeRequest,
          
          
VictimProcName,
          
VictimExecStack,
          
LiveProcName,
          
LiveExecStack,
          
VictimInputBuffer,
          
LiveInputBuffer                

   FROM
  
(
  
  
SELECT
  
--Victim
  
DeadlockList.Graphs.value('(process-list/process[1]/@spid)[1]', 'NVarChar(15)') AS VictimProcessID,
  
CAST(REPLACE(DeadlockList.Graphs.value('(process-list/process[1]/@lastbatchstarted)[1]', 'NChar(23)'), N'T', N' ') AS DATETIME) AS VictimLastBatchStarted,
  
DeadlockList.Graphs.value('(process-list/process[1]/@lockMode)[1]', 'NVarChar(15)') AS VictimLockMode,
  
DeadlockList.Graphs.value('(process-list/process[1]/@xactid)[1]', 'NVarChar(15)') AS VictimXActID,
  
DeadlockList.Graphs.value('(process-list/process[1]/@clientapp)[1]', 'NVarChar(50)') AS VictimClientApp,
  
--Live
  
DeadlockList.Graphs.value('(process-list/process[2]/@spid)[1]', 'NVarChar(15)') AS LiveProcessID,
  
CAST(REPLACE(DeadlockList.Graphs.value('(pprocess-list/process[2]/@lastbatchstarted)[1]', 'NChar(23)'), N'T', N' ') AS DATETIME) AS LiveLastBatchStarted,
  
DeadlockList.Graphs.value('(process-list/process[2]/@lockMode)[1]', 'NVarChar(15)') AS LiveLockMode,
  
DeadlockList.Graphs.value('(process-list/process[2]/@xactid)[1]', 'NVarChar(15)') AS LiveXActID,
  
DeadlockList.Graphs.value('(process-list/process[2]/@clientapp)[1]', 'NVarChar(50)') AS LiveClientApp,
  
--Live resource.
  
DeadlockList.Graphs.value('(resource-list/pagelock[1]/@fileid)[1]', 'NVarChar(15)') AS LiveFileID,
  
DeadlockList.Graphs.value('(resource-list/pagelock[1]/@pageid)[1]', 'NVarChar(15)') AS LivePageID,
  
DeadlockList.Graphs.value('(resource-list/pagelock[1]/@objectname)[1]', 'NVarChar(50)') AS LiveObjName,
  
DeadlockList.Graphs.value('(resource-list/pagelock[1]/@mode)[1]', 'NVarChar(50)') AS LiveLockModeHeld,
  
DeadlockList.Graphs.value('(resource-list/pagelock[1]/waiter-list/waiter/@mode)[1]', 'NVarChar(50)') AS VictimLockModeRequest,    
  
--Victim resource.
  
DeadlockList.Graphs.value('(resource-list/pagelock[2]/@fileid)[1]', 'NVarChar(15)') AS VictimFileID,
  
DeadlockList.Graphs.value('(resource-list/pagelock[2]/@pageid)[1]', 'NVarChar(15)') AS VictimPageID,
  
DeadlockList.Graphs.value('(resource-list/pagelock[2]/@objectname)[1]', 'NVarChar(50)') AS VictimObjName,
  
DeadlockList.Graphs.value('(resource-list/pagelock[2]/@mode)[1]', 'NVarChar(50)') AS VictimLockModeHeld,  
  
DeadlockList.Graphs.value('(resource-list/pagelock[2]/waiter-list/waiter/@mode)[1]', 'NVarChar(50)') AS LiveLockModeRequest,
  
--Inputbuffers
  
DeadlockList.Graphs.value('(process-list/process[1]/executionStack/frame/@procname)[1]', 'NVarChar(100)') AS VictimProcName,
  
DeadlockList.Graphs.value('(process-list/process[1]/executionStack/frame)[1]', 'VarChar(max)') AS VictimExecStack,
  
DeadlockList.Graphs.value('(process-list/process[2]/executionStack/frame/@procname)[1]', 'NVarChar(max)') AS LiveProcName,
  
DeadlockList.Graphs.value('(process-list/process[2]/executionStack/frame)[1]', 'VarChar(max)') AS LiveExecStack,
  
RTRIM(LTRIM(REPLACE(DeadlockList.Graphs.value('(process-list/process[1]/inputbuf)[1]', 'NVarChar(2048)'), NCHAR(10), N''))) AS VictimInputBuffer,
RTrim(LTrim(Replace(DeadlockList.Graphs.value('
(process-list/process[2]/inputbuf)[1]', 'NVARCHAR(2048)'), NChar(10), N''))) AS LiveInputBuffer

FROM @DeadlockGraph.nodes('/deadlock-list/deadlock') AS DeadlockList(Graphs)
) AS [_ProcessDetails];

--  Display the parties to the deadlock.
SELECT *
FROM @PartiesToTheDeadlocks
ORDER BY VictimLastBatchStarted, VictimXActID;'

Now, the code formatter messed up the coloring, but the code works. 

Now, I’ll list the cols and explain them cause I know that’s the next thing you’re gonna want.  It’s broken up into 2 categories… the Victim process and the Live process.  I didn’t know what else to call it other than live and still keep it short enough to prefix the cols with.

VictimProcessID – spid of victim.

VictimLastBatchStarted – date/time of last batch for victim.

VictimLockMode – lock held by victim.

VictimXActID – trans ID for victim.

VictimClientApp – Victim client app.

LiveProcessID – spid of live process.

LiveLastBatchStarted – date/time of last batch for live process.

LiveLockMode – lock held by live process.

LiveXActID – trans ID within the batch for victim (I think).

LiveClientApp – client app for live process.

LiveFileID – file ID where the locked resource for live process is held.

LivePageID – page ID the processes are fighting over.  The live process currently holds the lock on this one.

LiveObjName – object the processes are fighting over.  The live process currently holds the lock on this one.

LiveLockModeHeld – lock held on the above object by the live process.

VictimLockModeRequest – lock being requested  on the above object by the victim.

VictimFileID – file ID where the object the victim process is holding is held.

VictimPageID – Page ID the processes are fighting over.  The victim currently holds the lock on this one.

VictimObjName – object the processes are fighting over.  the victim currently holds the lock on this one.

VictimLockModeHeld – lock held on the above object by the victim.

LiveLockModeRequest – lock being requested on the above object by the live process.

VictimProcName – the SP called by the victim.

VictimExecStack – the stmt in the SP that caused the deadlock.

LiveProcName – the SP called by the live process.

LiveExecStack – the stmt in the SP that caused the deadlock.

VictimInputBuffer – the exact SQL that was sent to SQL engine.

LiveInputBuffer – the exact SQL that was sent to SQL engine.

Now you can see at a glance where your problem deadlocks are, how often they’re occurring, and which resources are under the most contention.

Good stuff that.

Something’s just not right

Like every DBA, I have customers that are harder to deal with than others, and I have some that are just impossible.  Now, I’m not talking about those annoying ones that ask for weird stuff, I’m talking about the ones that you just hate to work with because they’re either condescending, or rude, etc.  I’ve got 2 such customers right now.  And whenever I see their names on a meeting request I try to pawn them off onto one of my other DBAs if I can.  If I can’t though, I have to suck it up and endure the meeting myself. 

Well, I actually had meetings with both of them today.  I was able to send one of my guys to the 1st one, but I had to attend the 2nd one myself.  Now, this is a guy who typically likes to attack whatever I say.  He doesn’t just disagree with me, he actually attacks me.  So I’m sure I don’t have to tell you guys that I wasn’t looking forward to this meeting.  When I got there though, he was polite, thoughtful, and actually listened to what I said.  And believe me, I’ve learned not to say much around him unless specifically asked something because I just don’t need the fight.  So we were about to leave, and everyone else was on their way out, and he pulled me back.  Ok, here it comes… this is what I’ve been waiting for, only now there are no witnesses.  So he says to me… I just wanted to clarification on this one thing.  Exactly what do we need to backup and where does it need to be stored?  OK, I’m starting to think something’s wrong now, because this guy never admits he doesn’t know something, and in fact, spends his time telling me I don’t know what I’m talking about.

Then I get a report from my other DBA telling me that the guy in his meeting was the same way.  He just kept his mouth shut and answered questions, and that’s it.  No accusations, no insistance on ridiculous information, etc.  He just listened and went on about his day.

Now I know something’s up.  This isn’t normal and I’m trying to figure out where the real attack is coming from.  If this were chess it would be the equivalent of Bobby Fisher handing you his queen early in the game.  DON’T EVER TAKE THAT QUEEN.  So anyway, just waiting for the other shoe to fall.

A Lovely Statistics Query

Most of you know that you can set SQL to auto update stats.  And I’m sure that most of you know that it invalidates stats at about 20%.  What that means is that when about 20% of the data in the table has changed, the stats are invalidated.  It’s actually 500 + 20% if you wanna be specific.  And notice I say they’re invalidated and not updated.  Stats aren’t updated when that 20% mark is reached, they’re only invalidated.  They’re not rebuilt until they’re needed.  Well, for really large tables that can result in stats being oudated longer than they should because the amount of changes needed to invalidate them is much higher.  The same issue can occur with smaller tables where the changes are varied enough to throw the distribution out of whack.  Some tables are just more sensitive to change than others and that 20% may be too much.  You may find that you need to update stats more often… say at 10% or in extreme cases even 5%.

Here’s a sexy little query I put together that will give you the percentage of change an index has had.  You can use it to not only run a job to update stats on a different percentage, but you can also just query it and see what percentage your tables are at and how fast they change.  It’s an excellent tool for troubleshooting possible stats problems.  All the same, it’s pretty handy. 

–Get stats percentage of change for tables.

SELECT
so.SCHEMA_ID,
ss.NAME AS [Schema],
OBJECT_NAME(id) AS TableName, si.NAME AS IndexName,
CAST((CAST(rowmodctr AS float)/CAST(rowcnt AS float))*100 AS int)  AS Pct,
rowcnt,
STATS_DATE([id], indid) AS [StatsDate]
FROM sys.sysindexes si
INNER JOIN sys.objects so
ON si.id = so.object_id
INNER JOIN sys.schemas ss
ON so.schema_id = ss.schema_id
WHERE rowcnt >= 1000000--500
AND rowmodctr > 0
--AND OBJECT_NAME(id) = 'TableName'
ORDER BY Pct DESC, rowcnt DESC, rowmodctr

Here are a couple notes on the query:

1.  Notice there’s a way to query for a specific table.

2.  The StatsDate col shows you the last time the stats were updated.

3.  I’m also limiting the rowcount to indexes with more than 1mill rows.  Feel free to lower that if you like.

4.  rowmodctr > 0 is something I threw in there to keep the divide by zero error out of there.  This also filters out the system-created stats.

OK, I hope you guys like this one.  I personally love it.  It’s allowed me to build a process to update stats on some of my tables more aggressively than 20% and keep my server running at its peak.

Troubleshooting a replication scenario

I just had a nice replication troubleshooting experience I thought I’d share with you guys.  I got a call from our app team saying that replication was failing.  I looked and the error was a bit unexplainable.

“Column names in each table must be unique.”

OK, so why the hell is it telling me that?  This publication has been running for months now with no problems.  I checked the table in question on both sides, and as it turns out the column it was complaining about had and int data type on the publication side and a uniqueidentifier on the subscription side.  So how did these columns get different data types?  Did someone change the data type on the subscriber to an incompatible data type?  No, that’s probably not it.  And that’s not the error message you would get for something like that anyway.  But someone had to change the data type right?  Well, only kinda.  Here’s what actually happened. 

One of the devs decided he needed this column in the table, and instead of going to the DBAs, he added it to the subscriber himself.  He then added it to the publisher (with a different data type).  So when replication went to replicate the DDL change, it saw that the subscription side already had the column and spit out that excellent error message.

Let this be a lesson to those of you who have DBAs and you constantly look for ways to circumvent them.  Why don’t you stop every now and then and implement things properly.  And by properly I mean slow down and take the time to think about what you’re doing.  The days are over when you can just do whatever you want because it crosses your mind… esp when you have replication running.  You have to plan schema changes or you’ll spend tons of time troubleshooting and fixing your scenario.  Not to mention what it does to whatever processes rely on this replication process.  Anyway, that’s all I’ve got for now.  Let’s all be safe in production.

A Round of Patching

I just finished an interesting email thread with a user that’s not at all an uncommon scenario.  They were just installing a CU to some of their SQL boxes, and there was evidently a new bug that was introduced into that CU that causes their SSRS reports to rendor incorrectly.  The sad thing is that this is the kind of thing that should have been caught in testing.  Now of course, I don’t expect them to find everything during testing, but the simple rendering of reports should have caught early on.  Because the second they turned their reporting structure back on, report rendering started messing up.  Now, he assures me that they tested it, but what exactly did they test?  Did they do any user testing at all or did they just test the install of the patch itself?  I see that quite often.  I see shops claim to do patch testing, but all they’re concerned with is whether the install itself will fail.

Most of the time I blame managers and companies in general for his because the IT staff knows what’s at stake more than anyone but they’re not given the proper time and resources it takes to properly test.  Managers always claim to know the business needs better, yet they never seem to fully grasp the full implications of putting untested code into production.  All they want to know is will this bring down my server.  Unfortunately that always means something different to them than it does to us.  What they mean is will this fail and put us into an emergency situation?  And the answer is probably not.  However, disabling a critical piece of your business is putting you into an emergency situation.  And in this case they’re lucky it’s just report rendering.  What if it had been something more serious that caused some kind of logical corruption in their processing?

I would say that quite often the biggest problems caused by hotfixes are performance-related.  And by not testing a CU or hotfix properly you could bring your entire business to its knees.  You can slow your website down so much that nobody can get any real work done.  And what does that do for the reliability and confidence of your business?  It would be really nice sometimes if we in IT didn’t have to rely solely on people who know nothing about it to get our jobs done properly. 

And what’s amazing to me is that companies today still, after all this time and all the mistakes, don’t think it’s necessary to hire competent people code and test.  Oh sure, they perform lip service about it all day long, but what do they really do about it?  Do they really slow down their dev cycles enough to do benchmarking and architect solutions properly?  Do they spend the money necessary to ensure that their code and servers are solid by hiring enough competent people?  Because every shop I’ve been in works their people so hard there’s no time for anything unexpected to happen.  So I think I’ve said this before, but now I’m saying it again.  It’s time to stop being children about IT.  It’s time to stop letting people who know nothing about it make the technical decisions that effect entire projects.  And it’s definitely time to stop thinking we can get by without testing things properly.  And that includes functional testing as well as performance benchmarking.

Pinging SQL Server in Powershell

It quite often happens that there will be some kind of issue keeping your app from hitting the DB.  Sometimes the issue is with SQL itself, but most of the time there’s something else involved.  The problem is that when it’s reported through the error stack at the app level, it only says that the DB was unavailable and it’s hard to get users to understand that it’s just a generic error that means the app couldn’t talk to the DB for whatever reason.  That reason could be NIC, or cable, OS, switch, router, etc.  There are many reasons why an app wouldn’t be able to get to a SQL box.  And it’s made even worse if it’s intermitent. 

So a good way to handle this is to put yourself a ping script up that will actually query SQL.  Not check that the server is up, or the service is running, but that you can actually ping SQL itself and run a query.  I’ve done this for you in powershell.  It runs a simple select and then writes a result to a txt file.  I setup an agent job to run every 5secs and run this code from a dos cmd task.

##C:\SQLPing.ps1

add-pssnapin sqlservercmdletsnapin100

$date = get-date

$a = Invoke-Sqlcmd -ServerInstance Servername -Database master -Query “select @@servername” -ErrorAction silentlyContinue -ErrorVariable err
if ($err.count -eq 0) {$a = “OK”}
else {$a = “Failed”}

# if (!$a) {$b = “Failed”}
“$date  :  $a” | Out-File c:\VI2conn.txt -append

This kind of script can help in a number of ways, and depending on the level of troubleshooting you want to do, you can place this on different boxes.  So you can place it on the local SQL box to have it test itself, or in another data center to test a specific link, or just in the same subnet or vlan, or put it in the same subnet as the app and then in a couple different ones.  How you spread it around depends on what you’re looking for.  But this can be a tiebreaker because if the apps people insist that SQL is down because of a generic error message, you can tell them that 2 boxes from 2 other subnets were able to ping it every 5secs, but the one from the app’s subnet failed 6 times or whatever.  This way you can also show something solid to the network guys and make it easier for them to find the problem on their end. 

Now, this won’t tell you if it’s a switch, firewall, NIC, OS, or what, but it will tell you that it’s not a problem with SQL itself or the SQL box.  Of course, if a couple of them fail then it could still be the SQL box cause again, it won’t tell you that either, but it could be a tiebreaker in the right circumstance.

AND, just because I’m a nice guy here’s the job script too.  I did this in SQL2K8.

USE

[msdb]GO

/****** Object: Job [SQLPing] Script Date: 02/04/2011 14:20:44 ******/

BEGIN

DECLARE

SELECT

TRANSACTION @ReturnCode INT @ReturnCode = 0/****** Object: JobCategory [[Uncategorized (Local)]]] Script Date: 02/04/2011 14:20:44 ******/

IF

NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N'[Uncategorized (Local)]’ AND category_class=1)BEGIN

EXEC

@ReturnCode = msdb.dbo.sp_add_category @class=N’JOB’, @type=N’LOCAL’, @name=N'[Uncategorized (Local)]’IF

(@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollbackEND

DECLARE

@jobId BINARY(16)EXEC

@enabled

@notify_level_eventlog

@notify_level_email

@notify_level_netsend

@notify_level_page

@delete_level

@description

@category_name

@owner_login_name

IF

@ReturnCode = msdb.dbo.sp_add_job @job_name=N’SQLPing’, =1, =0, =0, =0, =0, =0, =N’No description available.’, =N'[Uncategorized (Local)]’, =N’sa’, @job_id = @jobId OUTPUT(@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback/****** Object: Step [Ping] Script Date: 02/04/2011 14:20:45 ******/

EXEC

@step_id

@cmdexec_success_code

@on_success_action

@on_success_step_id

@on_fail_action

@on_fail_step_id

@retry_attempts

@retry_interval

@os_run_priority

@command

@flags

@ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N’Ping’, =1, =0, =1, =0, =2, =0, =0, =0, =0, @subsystem=N’CmdExec’, =N’powershell “c:\SQLPing.ps1″‘, =0IF

(@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollbackEXEC

@ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1IF

(@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollbackEXEC

@enabled

@freq_type

@freq_interval

@freq_subday_type

@freq_subday_interval

@freq_relative_interval

@freq_recurrence_factor

@active_start_date

@active_end_date

@active_start_time

@active_end_time

@schedule_uid

@ReturnCode = msdb.dbo.sp_add_jobschedule @job_id=@jobId, @name=N’Every 5secs’, =1, =4, =1, =2, =10, =0, =0, =20110204, =99991231, =0, =235959, =N’0b2e4594-92bc-438c-8311-d40076b53042′IF

(@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollbackEXEC

@ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)’IF

(@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollbackCOMMIT

GOTO

TRANSACTION EndSaveQuitWithRollback:

 

EndSave:

GO

IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION

New Video: Hack-Attack a Damaged DB

What do you do if you have a set of data and log files you can’t attach because the log is damaged? Here I show you a way to attach it, fix it, and get going again. Paul Randall also discussed this in his blog here: Hack-attack a damaged DB

 

Join in our live DBAs at Midnight webshow
Fridays at 11pmCST on our Ustream channel!

Watch all past shows at MidnightDBA.com