Category Archives: Admin

Read my mind

Man, talk about your all time screw-ups!!

OK, here’s what happened.  I got a ticket the other day to truncate a table in a prod DB.  Simple enough, right?  So I wrote the guy and said, are you sure you want me to truncate this table?  He said, yeah, it’s taking up over half of our space.  I said ok, just give me a couple mins.  So I connected and truncated the table.

Well, then I get a call like 10mins later and the guy was frantic.  What happened to my data?  Well, you told me to truncate the table.  NOT ALL OF IT!!! So this is a valuable lesson isn’t it?  Always be sure what you’re asking for and never assume.  I of course didn’t assume anything.  We get tickets to truncate entire DBs even all the time so it’s not that big of a deal for us.  Anyway, just be aware that many users have a completely different database vocabulary than you do.

Is this industry fun or what?

Powershell to recycle error log

Here’s a really good one. I noticed that very few of my boxes had a job to recycle the sql error logs. This is important because unless specifically done, the error log will only recycle when the service is burped in some way. For those of you who aren’t up on the lingo, recycling the error log is when you close down one log and create a new one. The error logs are just txt files kept on the server. So recycling creates a new physical txt file. And the reason you care about this is because some servers don’t get bounced very often at all and the files can get quite large. I have one server with a 40GB error log. And when I try to open it, it sits there for like 2hrs and then finally dies cause there aren’t enough system resources to deal with a file that big. So you can see why you want to keep the size of the files down.

And like I said, the error log only recycles when you burp the service or do it manually. And the way you do it manually is by running this code:

master.dbo.sp_cycle_errorlog

But doing it manually isn’t good enough because it has to be done on a regular basis. I prefer to recycle my logs every day. It’s a good generic schedule and honestly I don’t ever remember needing anything more granular. OK, so a new log every day is what we’re after. That means we need to put this code in a job and have it run every day at the same time.

So ok, that set the stage for what we’re wanting to do here. Now we can get into what we actually wanna talk about… and that’s how we push our new job to all of our boxes. Now, I’m going to do this a specific way and I’m doing it like this for a reason. I’ll explain once I’m done.

Here are the steps I want you to take to prep for this venture.

1. Create a job to occur every day (Midnight is preferable) and include the above code in the job step.
2. Be sure to set the job owner to sa.
3. Script the job and save it as a file somewhere.
4. You’ll have to script the job for different versions of SQL. SQL2K doesn’t have job schedules so the script blows up. So at least have one for SQL2k and below, and one for 2K5 and above.

 Now here’s the powershell script you’ll run.

$SourceSQLScript1 = “\\Server01\F$\SQLServerDBA\Scripts\Collector\SQLQueries\ServerListAllSQL2000.txt”;

$SqlCmd1 = Invoke-Sqlcmd -ServerInstance $Server -Database $StatsDB -inputfile $SourceSQLScript1

$SqlCmd1 | % {

[System.String]$ServerName = $_.ServerName;

$ServerName;

invoke-sqlcmd -ServerInstance $ServerName -Database “MSDB” -InputFile “\\Server01\F$\SQLServerDBA\Scripts\ServerConfig\RecycleSQLErrorLogJOB-SQL2K.sql” -SuppressProviderContextWarning

}

Now I’m going to explain what’s going on here and why:

1. $SourceSQLScript1 — This is the var that holds the location of the txt file where I store the query for the list of servers to run against. Inside the file looks like this: select ServerID as InstanceID, ServerName from dbo.Servers (nolock) where IsSQL = 1 and SQLVersion = 2000. The reason I do it this way is because if I ever need to change the query, all of the scripts that call the query will automatically pick up the change. So putting the query in a file is a good thing.

2. $SqlCmd1 — here’s where I’m actually running the query in the step above and setting the results to this var. Now I’ve got an array of ServerNames I can cycle through to run my script on.

3. [System.String]$ServerName = $_.ServerName; — Here I’m just setting the current cursor iteration of the ServerName equal to a var ($ServerName). It just makes it easier to read and if something comes up where I need to make a minor adjustment to how the ServerName looks, then I can mess with the var and not have to touch the rest of the script. Again, reducing effects to the script if I need to make changes to the value. So it’s like later on down the line if I discover that I need to put [] around it or something like that. This could also be done in the query file, but it’s nice to have the option here as well should I need something special for a single script.

4. $ServerName; — Here I’m just printing out the ServerName so I can watch my script cycle through the boxes. It’s my way of knowing what’s going on. I hate that blinking cursor.

5. invoke-sqlcmd — This is where the real magic happens. I’m connecting to each box and running the job script I saved above.

Here are some finer points of the logic for why I did things the way I did:

1. I put the server list query into a var. This is to allow you (or me) to get the server list any way we like w/o disturbing the rest of the query. You could have a hardcoded list of serverNames, or have them in a file… whatever you like. Here’s a really good example.

Instead of this:
$SqlCmd1 = Invoke-Sqlcmd -ServerInstance $Server -Database $StatsDB -inputfile $SourceSQLScript1
You can have this:
$SqlCmd1 = “Server1”, “Server2”, “Server3”, “Server4”, “Server5”

 And the rest of the script stays the same because the cursor is run off of the var. It doesn’t care how that var gets populated.

 2. I chose to run this separately for each edition of SQL. So I ran it once for SQL2K, once for SQL2K5, etc. That’s because like I said there are differences in the job scripts for the different versions. I could have done it all in a single script, but I really don’t like coding that logic in there and it increases the size of the script because now I have to check for the version of SQL and make a decision before I can create the job. I have to point it to the right .sql script. That’s too much work when all I have to do is change the query file and the script file between runs. And this way, I can run them concurrently for each version simply by saving the script under different names and kicking them off at the same time.

 3. This one is my favorite because it’s what makes this so flexible. I chose to put the sql logic itself into a file and just call that file. The reason I did this is because not only is the sql logic too long for this type of script, but if I’ve got anything else I need to run against all my boxes, all I have to do is change the sql script I’m calling. I don’t have to make any changes at all to the rest of the script. So if I want to create or delete any specific users off of a group of boxes, all I have to do is put that code in a script and save it, and then point this script to it. Or anything else, right? This is really the bread and butter for powershell and multi-server mgmt.

So guys, take this boiler plate script and methodology and apply it to your own stuff. I’ve taken a lot of stuff out of the script for this blog because my scripts are part of a much larger solution I’ve developed and it’s not relevant to what you’ll be doing. But the script works as-is and you can add whatever you want to it. And now you’ve got a generic way to push T-SQL out to your boxes, and a generic way to populate the server list it goes against. What else do you need really?

The hidden update

I had a cool situation today that didn’t come to me right away so I thought I’d share it with all of you.  We’re going to be talking about the lovely deadlock today.  I can’t remember if I’ve ever talked about it before, but I am now so there…

OK, here’s the situation.  I was called over by our very sexy web team lead (George), who told me that they were having deadlock issues on the web portal DB.  Fine, I’ll put a server-side trace on it and see what comes up.  For this I used the Locks\DeadlockGraph.  Once I got my info back, I noticed that we had table1 and table 2 and table3 in the mix. Table1 was a delete against itself.  Then another session ran an update against Table2 joined to table3.  The problem is that the update was deadlocking with the delete and the delete was losing every time.  And also, why was the deadlock on table1?  The update doesn’t even touch table1. 

For starters, all the tables have an update trigger that pulls the inputbuffer and session info for the spid that ran the update.  It then puts this info in a log table.  I don’t know why.  Unfortunately that wasn’t the problem.  I checked and none of the tables turned out to be views either so that avenue was dead.  The problem was just a tiny bit buried, but I eventually found it.  There was another table in the mix… table4.  Table3 had an update cascade set on its FK to table4 and table4 had an FK back to table1.  AH-HA… there’s your connection.  Now, as well, there’s wasn’t an index on the FK col in table1, so it was doing a scan.  Nice huh? 

So my recommended fix was as follows:

1.  Get rid of the auditing update triggers.  If you really want to log the action then put that code in an SP and call it after your update is done, but not as part of the main transaction.  Yes, I’m aware of the very minute risks in this, but they’re far out-weighed by completing your transaction so much faster.

2.  Put an index on the table1 FK column.  This is probably going to give you the biggest bang for your buck.  If it’s not doing a table scan, then it’ll get in and out faster so there’ll be less chance of deadlocking with the delete.  I believe the delete is also searching on the same col so it would really be worthwhile.

3.  Use updlock on the update query. 

My whole plan here is to get these transactions as short as possible.  It’s not enough to have efficient queries because if you’ve got a bunch of other stuff in the transaction then you might as well be doing a table scan every time.  And I know that reading table1/table2/table3 and all that isn’t easy to follow, but hey, we’ve gotta scrub these things for the internet, right?  Really the whole point is that you have to dig sometimes to find the source of something.  I knew the deadlock was on that other table and I could see the index scan in the execution plan, but that table wasn’t listed in the query or in the trigger.  So it had to be coming from somewhere.  So the basic point of this is to remind you of some of the things that are possible so you can remember to check these types of things too.

A Nice SQL Virus

I’m being forced off of my powershell kick here for a few mins to talk about viruses.

Why do companies refuse to allow scanning exceptions for DB files?  Do you really think that anyone’s written a SQL virus?  I really can’t remember ever having see an actual virus that attacks SQL files.  Actually, I take that back.  There is one virus that attacks DB files and drags performance so far down you can barely use the system.  In fact in my performance realm I actually classify this virus as an official DOS attack against my DB.  It’s called virus scan.  Virus scans are the only thing I’ve seen that specifically target DB servers to bring down the performance.

So if any of you non-SQL admins out there are running AV on your servers w/o exception policies for your DB files, turn yourselves in because you’re officially launching a DOS attack against your own servers.

But I promise I’m not bitter.

Cool Powershell Scenario

here’s a cool scenario where you need to set all of your DBs to simple mode and then back again.

Let’s say that you have a bunch of DBs on your server that are all in full mode and you’re setting up new backup routines on the server and you want to start everything from scratch.  So in this scenario you may want to switch all the DBs to simple mode to truncate the logs and then back to full again.  This is to prevent a huge log backup the first time, and you may not even have a full backup file to go with it anymore so what would be the point?

So here’s some simple code you could run to make this happen very quickly and easily.

> dir | %{$_.set_RecoveryModel(3)}  # Set recovery to simple to truncate the logs.

> dir | %{$_.set_RecoveryModel(1)}  # Set recovery back to full.

Everything you do with powershell doesn’t have to cure cancer.  It can just save you a couple mins or even a few mins of tedium.  Writing really cool scripts to do big things is awesome, but most of the stuff you’re going to do is this adhoc kinda stuff.  That’s the reason I love powershell so much, because you don’t have to do anything grandios with it.  You can just make your day easier.

Derived Column vs Script Component

I get asked this from time to time… when should I put a transform inside a Derived Column (DC) and when should I put it in a Script Component (SC).  Well the answer isn’t always easy, but here are a couple guidelines… because you do have to make this decision quite often in SSIS.

Scenario #1:

In short I’d use a DC when the transform is really simple and there’s not much chance of it creeping past what it is initially.  So if it’s going to be something like testing a simple condition and doing a simple replacement based off of it, then a DC is an easy choice to make.  However, if it’s likely that the scope of the transform will shift in the future to something really big then you’re better off with an SC.

Scenario #2:

If the above transform is going to rely on several other columns in the row, then you’re far better off with an SC because it’s much easier to read.  So if you have to compare 4 or 5 cols to get the answer then use an SC.

Scenario #3:

If the transform will be really complex and/or have a lot of sub conditions in it, then it may be possible to do it in a DC, but you’ll be far better off in an SC.  Everything in a DC is on a single line and that model breaks down pretty fast.  And of course there are some sub-conditions that you won’t be able to accurately represent in the very limited DC.

Scenario #4:

If you need to test a data type like IsNumeric(), you can’t do that in a DC at all.  Even though it’s a valid VBA function, SSIS doesn’t support it, so you have to rely on .Net inside of an SC.

Scenario #5:

If you’ve got a lot of simple transforms and you’ve got one that’s more complex or longer, etc, then you may want to consider using an SC simply to keep everything in one place.

That’s all I’ve got.

Disk monitoring isn’t easy by default

At my last gig one of the other managers was constantly pissed off at my refusal to not setup disk space alerts.  "We’re always filling up the logs." he says, "so we need to monitor for space and send an alert."  And of course, that certainly sounds reasonable, so why did I refuse then?  Well, put simply, something that sounds reasonable to someone who doesn’t know what he’s doing isn’t reasonable to those of us who know something about DBs… oh and mind your own business.

The issue wasn’t whether or not I thought alerting on disk space is theoretically a good idea or not.  The issue is whether it’s what you actually need.  See, in our situation, the company was really slow to buy disks.  So most of the time we would be running at over 90% capacity for all of our disks.  And a lot of them were running on vapors.  In fact, our big DW system was reporting .01% free space for 2mos before they got us new disks.  Now, tell me I don’t know how to manage space.  So the logs were always filling up the drives because there was nothing left for them to grow into.  And the source system would push these mass changes w/o telling us quite often and there would be 3x the transactions out of the blue one night and we would have to find a way to deal with it.  So in the middle of ETL the disk would go from .01% free to 0%.  What the hell am I supposed to alert on?  Since we’re always running to tight, by the time the alert got fired and the email was sent out, the processes would have stopped anyway, so what good is the alert going to do?  And no matter what I said to him he just couldn’t get it out of his head that an alert is what we needed.  When what we really really needed was to not be running on vapors.  Trust me dude, I can spell SQL so just let me do my job.  I know it seems like we’re neglecting your system, but we’re really not.  There’s just nothing we can do about it.

The real fix of course is to get some disks and not run them at capacity.  I almost had them there when I left.  I had been telling them for 4yrs that you can’t run disks at capacity and they were finally barely starting to listen.  But again, remember it took us at least a year to get the new disks approved so once we got them, we were already running low.  And when the log is filling up in the middle of a big operation you won’t get an alert in any reasonable time because they’re not run continuously, they’re run every few mins.  So it’s quite possible that all the action of the disk filling up happens between sampling internals. 

I kind of have the same problem at my current gig as well.  Disk space is at a premium and alerting has proven to be a challenge because those rogue processes are the ones that push the log over the top and they fill up an already stressed drive very fast.  And there are unseen consequences as well.  Let’s take a look at a specific example I had just yesterday.  One of these boxes had a rogue large transaction that took the size of the log through the roof and filled up the drive.  So the DB shut down and I had to fix it manually.  However, the log backups go to the NAS with a lot of the other backups in the LAN so with the log backups being so much bigger it took up a lot of unexpected space on that drive as well and failed lots of backups on other servers.  Now there’s nothing I can do about the rogue processes filling up the log, but currently I also can’t alert on them effectively either.

So if you really want to effectively monitor your disk space, then run your disks at about 50% or so to give your alerting process a chance to detect the threshold breach and do something about it.

Oh y, and I recently heard from the DBA that took my place that they’re still running at capacity and that other manager, without me there to stand in his way, has finally gotten his precious alerts.  And the alert comes about 10mins after the process stops and the DBA is already working on it. 

Poweshell wins again

It may seem a little redundant, but I love stuff like this. I was asked in front of a group of devs to script out a group of SPs on the prod box and copy them over to the new test box. These SPs stretch across a couple schemas and are named differently from the other ones in those schemas. As it turns out, there are something like 300 of them total. I don’t have a final count.

So when the guy asked me I said sure, that’ll take me like 60secs. And one of the other devs said, there’s no way. You have to check all of those boxes individually and and make sure you don’t miss anything. I said, of course I can. I’m a powershell guy (yes, i actually said that). He was like, even if you could script something like that out, there’s no way to easily get all the ones you need. You’ll be much faster in the wizard.

I told him, I accept your challenge. And for the first time, I gave a dev rights in prod and we had a face-off right there. We sat side by side and both of us started working feverishly to get our SPs scripted. Him in the wizard and me in powershell. Very quickly a crowd gathered. We prob had like 15-20 people gather. These were the PMs, other devs, report writers, etc. They all wanted to see if the bigshot DBA MVP could be taken down by a lowly dev.

Unfortunately like 90secs later, I completed my script and was building my file with my scripted SPs. He was still slugging his way through the wizard and wasn’t even close to having all his little boxes checked. When I finished, I just stood up and walked out as everyone clapped. When I left to come back upstairs he was still working at the wizard determinded to at least finish.

At least that’s how my powershell hero stories always play-out in my mind. I really did get that assignment today, but it was through email and I just did it without any pomp and circumstance. Oh well, a guy can dream can’t he?

Here’s the code I wrote today to pull SPs by schema and matching a pattern. I used the regex in powershell to make this happen. Enjoy.

PS SQLSERVER:\SQL\Server1\DEFAULT\Databases\ProdDB\StoredProcedures> dir | ?{$_.schema -eq “Bay” -or $_.schema -match “EBM”} | ?{$_.Name -match “Bay?_PR”} | %{$_.Script() | out-file C:\SPs.txt -append; “GO” | out-file C:\SPs.txt -append}

Maybe someday one of you will actually save the day with it.

How to Monitor SQL Services with Powershell

Here’s the situation…

You get a call from one of your customers saying that the log has filled up on the DB and they can’t do anything any more.  So you connect to the server and find out that the log backups haven’t been running.  So you run the backup and everything is hunkydory.  But why did it fail to run in the first place?  Well about 3secs of investigation tells you that the Agent was turned off.  Ok, you turn it back on and go on about your business.  But this isn’t the way to do things.  You don’t want your customers informing you of important conditions on your DB servers.  And you certainly don’t want Agent to be turned off. 

And while there may be some other ways to monitor whether services are running or not, I’m going to talk about how to do it in PS.  There are 2 ways to do this in PS… get-service and get-wmiobject.  Let’s take a look at each one to see how they compare.

In the old days (about 2yrs ago), when all we had was the antiquated powershell v.1, you had to use get-wmiobject for this task because get-service didn’t allow you to hit remote boxes.  All that’s changed now so you can easily run get-service against a remote box with the -computername parameter.

get-service -computername Server2

And of course it supports a comma-separated list like this:

get-service -computername Server2, Server3

And just for completeness here’s how you would sort it, because by default they’re going to be sorted by DisplayName so services from both boxes will be inter-mingled.

get-service -computername Server2, Server3 | sort -property MachineName | FT MachineName, DisplayName, Status

Ok, that was more than just sorting wasn’t it?  I added a format-table (FT) with the columns I wanted to see.  You have to put the MachineName there so you know which box you’re gong against, right?  And the status is whether it’s running or not.

Remember though that I said we were going to do SQL services, and not all the services.  So we still have to limit the query to give us only SQL services.  This too can be done in 2 ways:

get-service -computername Server2, Server3 -include “*sql*” | sort -property MachineName | FT MachineName, DisplayName, Status

get-service -computername Server2, Server3 | ?{$_.DisplayName -match “sql”} | sort -property MachineName | FT MachineName, DisplayName, Status

so here I’ve used the -include and the where-object(?).  They’ll both give you the same results, only the -include will filter the results on the remote server and the where-object will filter them on the client.  So ultimately the -include will be more efficient because you don’t have to send all that extra text across the wire only to throw it away.

And of course, you don’t have to use that inline list to go against several boxes.  In fact, I don’t even recommend it because it doesn’t scale.  For purposes of this discussion I’ll put the servers in a txt file on C:.  Here’s how you would do the same thing while reading the servers from a txt file, only this time you could very conveniently have as many servers in there as you like.  And when creating the file, just put each server on a new line like this:

Server2
Server3

So here’s the same line above with the txt file instead:

get-content C:\Servers.txt | %{get-service -computername $_ -include “*sql*” | sort -property MachineName | FT MachineName, DisplayName, Status}

This is well documented so I’m not going to explain the foreach(%) to you.

Ok, so let’s move on to the next method because I think I’ve said all I need to say about get-service.  But isn’t this just gold?

get-wmiobject

Earlier I was talking about what we did in the old days and I always used to recommend get-wmiobject because of the server limitation imposed on get-service.  However, does that mean that get-wmiobject is completely interchangable with get-service now?  Unfortunately not.  I’m going to go ahead and cut to the chase here and say that you’ll still wanna use get-wmiobject for this task most of the time… if not all of the time, because why change methods?

You’ll notice one key difference between doing a gm against these 2 methods:

get-service | gm

get-wmiobject win32_service | gm

The get-wmiobject has more methods and more properties.

And the key property we’re interested in here is the StartMode.

If you’re going to monitor for services to see which ones are stopped, it’s a good idea to know if they’re supposed to be stopped.  Or even which ones are set to Manual when they should be set to Automatic.

And for this reason I highly recommend using getwmiobject instead of get-service.

Here’s some sample code using the txt file again.

get-content C:\Servers.txt | %{get-wmiobject win32_service -computernatm $_ -filter “DisplayName like ‘%sql%’ “} | FT SystemName, DisplayName, State, StartMode -auto

Notice that the names of things change between methods too, so watch out for that.  So like MachineName changes to SystemName.  You’ll also notice that I didn’t provide you with a full working example of a complete script.  That’ll be for another time perhaps.  The script I use fits into an entire solution so it’s tough to give you just a single script w/o also giving you all the stuff that goes along with it.  And that just gets out of the scope of a single blog post.

However, I’ll leave you with these parting pieces of advice when building your service monitor.

1.  Instead of pulling the servers from a txt file, put them in a table somewhere so you can run all of your processes from that location.

2.  Use get-wmiobject win32_service instead of get-service.  It’s more flexible.

3.  When you collect your data, just save it to a table somewhere instead of alerting on it right away.  In other words, there should be a collection and a separate alerting mechanism.

   *** Why you ask?  Well I’m glad you asked, because not asking something that should be asked is like asking something that shouldn’t be asked but in reverse.  Anyway though… I prefer to get a single alert on all my boxes at once instead of an alert for each box, or for each service.  And that kind of grouping is much easier to do in T-SQL than in PS.  Also, there may happen a time when a service is down for a reason and you don’t want to get alerts on it but you still want to get alerts on the rest of the environment.  This is easier to do in T-SQL as well.  And finally, you may want to also attempt to start the services that are down and that really should be a separate process so you can control it better.  Or you may just want to be alerted and handle them manually.  Again, what if the service is supposed to be down for some reason, you certainly don’t want the collection process going out and restarting it for you.  And the collection can be a nice way to make sure you remember to turn the service back on when you’re done with whatever you were doing.  You’ll get an alert saying it’s down, and you’ll be all like, oh y, I totally forgot to turn that back on and my backups aren’t kicking off.  All the same, you really do want the collection, alerting, and action processes to be separated.  But that’s just me, you do what you want. ***

4.  Keep history of that status of the services.  You can look back over the last few months and see which ones have given you the most trouble and you can then try to discover why.  It’s good info to have and you may not realize how much time you’re spending on certain boxes until you see it written down like that.

My favorite vendor code.

I was given this code the other day to review before it gets run against prod.  It’s code that a vendor wrote to help us clean-up some of the bad and old records in this one app table.

There’s really not much to say about it except that it’s by far the best vendor code that’s crossed my desk in a long time.

I hope you enjoy it as much as I do.

 BEGIN TRANTable1
SET touchedwhen = GETDATE(),
touchedby = 'mysupport',
statuscode = 'CISPA'
WHERE createdwhen < '2009-01-01 00:00:00.000' AND statuscode = 'AWDP'
-- Where createwhen < getdate()-1 and statuscode = 'AWDP'

ROLLBACK
COMMIT

So once I got this and the email chain I started based off of it was priceless.  I don’t feel really right about printing the email chain here, but rest assured that I’ve defended the logic of this query perfectly.