Tag Archives: Architecture

A Very Heated Argument about Backup Tuning in Minion Backup

A couple weeks ago we here at MinionWare got into a very heated argument that lasted most of the morning and part of the afternoon. The argument was around the backup tuning settings in Minion Backup (MB), and how they should work vs. how they actually work.
The problem came about because Jen was doing some testing for her first MB session at a user group. She came across an issue with the tuning settings when she added the time component to the Minion.BackupTuningThresholds table. She noticed that she wasn’t getting the tuning settings she thought she should get when she was trying to tune for a specific time of day. So naturally she assumed I was stupid and filed it as a bug.

In actuality though it’s doing exactly what it’s supposed to, and it’s following the letter of the Minion Backup law. That law is “Once you’re at a level, you never go back up”. Let me show you what I mean.

Precedence in the Tuning Thresholds table

Take a look at this sample Minion.BackupTuningThresholds table.

TuningThresholds

Ok, in the above table we’ve got some tuning rows. This is a truncated version of the table, but it’s all we need to demonstrate precedence. We’ve got two rule sets here; one for MinionDefault (the row that provides all the default configuration settings), and one for MinionDev (a specific database on my server).

  • MinionDefault is a global setting that says unless the DB has an override, it’ll take its rows from here.
  • MinionDev is the only DB on this server that has an override, so it’ll take its settings from the MinionDev rows.

At the most basic level, the precedence rule states that once there is an override row for a database, that database will never leave that level…it will never default back to the default row. So in this example, MinionDev is at the database level for its settings, so it will never go back up to the more generic MinionDefault row. Once you’re at a level, you stay at that level.

A “Zero Row” for every level

I’m going to explain how these rows work, and why they are the way they are. Notice that for both levels (that is, for the MinionDefault rows, and for the MinionDev rows), there is what we call a zero row. This is where the ThresholdValue = 0. The zero row is especially important for the MinionDefault row, because this is what covers all DBs; it’s quite possible that you could get a database that’s less than your lowest threshold value.

In the above table, the lowest (nonzero) threshold value for MinionDefault is 20GB. That means that no DBs under 20GB will get any tuning values. Without any tuning values, the number of files would be NULL, and therefore you wouldn’t be able to backup anything…they wouldn’t have any files. So setting the zero row is essential.

And, since each DB stays at that level once it’s got an override, then whenever you put in a DB-level override it’s an excellent idea to give that DB a zero row as well. It may be 50GB now, but if you ever run an archive routine that drops it below your lowest threshold, then your backups will stop if you don’t have that zero row to catch it. Did I explain that well enough? Does it make sense?

That’s how the rule is applied at a high level between DBs. Let’s now look at how it’s applied within the DB itself.

“Zero Rows” within the database level

As I just stated above, you should really have a zero row for each database that has an override row (you know, where DBName = <yourDBname>).

Let’s look at MinionDev above. It has a BackupType=All set, and a BackupType=Full set. The All set takes care of all backup types that don’t have backup type overrides. So in this case, the All set takes care of Log and Diff backups, because there’s a specific override for Full. Get it? Good, let’s move on.

Notice that MinionDev has a zero row for the All set, and a zero row for the Full set. This is essential because following the rules of precedence, once it’s at the MinionDev/Full level, it doesn’t leave that level. So again, if there’s a chance that your database will fall below your lowest tuning threshold – in this case it’s 150GB – then the backup will fail, because there are no tuning parameters defined below 150GB. This again is why the zero row is so important: because it provides settings for all backups that fall below your lowest tuning setting.

And, if you were to put in a BackupType=Log override for MinionDev, it would also need to have a zero row. I could argue that it’s even more important there because it’s quite possible that your log could be below your tuning threshold.

So now, our Argument

That’s how the precedence actually works in the Minion.BackupTuningThresholds table. The argument started when Jen thought that it should move back up to the All set if a specific BackupType override falls below its tuning threshold. So in other words, in the above table, she wouldn’t require a zero row for the MinionDev-Full set. Instead, if the DB size fell below the 150GB threshold, she would move it backup to the MinionDev-All set, and take the lowest tuning threshold from there.

She said that it wasn’t in the spirit of the precedence rules to make the setting quite that pedantic. So after hours of arguing, drawing on the board, making our case, sketching out different scenarios, etc… we just kinda lost steam and moved on, because she had to get ready for her talk.

The point is though that this is the way it currently works: once it’s at its most specific level, it stays there. So, if you have tuning settings for specific backup types, you’d be really well served to have a zero row for each one just in case.

And I’ll also note that BackupType is the lowest granularity. So, Day and Time (another config option in this table) have nothing to do with this setting. You need to concentrate on the DBName and BackupType. Everything else will fall into place.

Final Caveat: We break the rule (a little)

Now, I know it sounds like a contradiction, but there is just one place where I break this rule. I call it the FailSafe. With the FailSafe, it’s possible to have specific overrides and still get your tuning thresholds from the MinionDefault zero row. Here’s why:

This is a rather nuanced config in Minion Backup, and it’s fairly easy to get something wrong and wind up without a backup. I didn’t want that to happen. So, if you do something like leave your zero row out for an override level, and your DB falls below your lowest threshold setting, you would wind up without any backup because there isn’t a number of files to pass to the statement generator.

Failsafe says, if you screw up and don’t have a tuning setting available, MB will grab settings from the MinionDefault Zero Row.

In this situation, I kick in the FailSafe mechanism, which pulls the tuning settings from the MinionDefault zero row. At least you’ll have a backup, even if it’s slow.

(That was one of Jen’s arguments: that a FailSafe is a great idea, but she wants it to come from the DB-All set instead of the MinionDefault-All set. I don’t know, maybe she’s right. Maybe that’s more intuitive. I’ll have to think about it. It wouldn’t be that big of a change really. I could walk up the chain. In the above table I could try the MinionDev-All zero row and if that doesn’t exist then I could use the MinionDefault-All zero row. What do you guys think?)

So why not just hardcode a single file into the routine so that when this happens you’re backing up to that single file? The answer is: flexibility. Your MinionDefault zero row may be set to 4 files because all your databases are kinda big and you don’t ever want to backup with fewer than that. So, set your MinionDefault zero row to something you want your smallest DB to use. If that’s a single file, then ok, but if it’s 4 or 6 files, then also ok. That’s why I didn’t hardcode a value into the FailSafe: It’s all about giving you the power to easily configure the routine to your environment.

Takeaways:

  1. The precedence rules are followed to the very letter of the law.
  2. Once a database is configured at a level, it stays there.
  3. The configuration level is specific to DBName, and then (at the next most specific level) to the DBName and BackupType.
  4. Whenever you have database-level override row, always have a zero row for it.
  5. Whenever you have a BackupType-level override, always have a zero row for it.
  6. The FailSafe defaults back to MinionDefault Zero Row, if a level-appropriate setting isn’t available.

Ok, that’s it for this time. I hope this explanation helps you understand the reasoning behind what we did.

25 things I learned writing commercial software

It’s our job to learn new things.  We’re constantly studying, practicing, refining, etc.  But I’m not sure that I’ve ever learned so much about the different ways people work than I have by writing commercial software.  Now, my free Minion modules don’t cost anything, but that doesn’t mean they’re not commercial software.  They’re released to the public and are becoming quite popular, so they’re commercial in the distribution sense.

And there are things that I’ve learned about SQL and DBAs in general while building these tools.  Here’s a list of some of the things I’ve learned while writing the Minion maintenance modules.  Not all of these were new to me.  Some of them I’ve known for years, but were shoved in my face during this process.  Others I’ve kind of known, and still others never even crossed my mind because I’ve never had to deal with them before.  So come with me on the very rewarding, yet sometimes often quite frustrating journey of commercial software design.

  1. The customer isn’t always right. That may work in the food service industry, but it definitely doesn’t fly in IT.  I’ve found that giving people what they want is a dicey thing because not only do a lot of them not really know what they want, but sometimes they don’t have enough experience to know that what they want isn’t what they need.
  2. Service is key. At MinionWare we pride ourselves on answering support tickets immediately.  We consider it poor service to even let a ticket sit for an hour and in fact most of the time we answer the ticket within 10mins.  I think this is essential because people have to know that their issues are going to be addressed in a timely manner.
  3. You’re proud of your product. You’ve written something that you think everyone will just love.  So you package it up and send it out to the masses.  And as much as you love what your code does for the world, cut that love in half for the public.  Nobody will love your baby as much as you do; at least not in the beginning they won’t.  However, there’ll be some who really get what you do and who love it right away.  Others will take some convincing.  While yet others never get excited about anything.  It’s only DB maintnance dude, how exciting can it be?
  4. People have all kinds of reasons for not adopting your solution. Sometimes it’s warranted, and sometimes it’s just laziness, or not wanting to change.  This is neither good nor bad, it just exists.  Get ready to hear ‘no thanks’ a lot more than you’re used to.
  5. There are so many different configurations and situations people run SQL in that it’s exceptionally difficult to write software to cover all the bases. Minion Backup was more difficult in this respect than Minion Reindex, but there was still some of that for MR.  But there are so many ways people want to add backups to their processes and so many things they need to be able to do that it’s really hard to get it right.  So the question is, have we gotten it right with MB?  Honestly, only time will tell, but I think we objectively did a really good job.  We’ve had some bugs but no major config flaws that I can see.  I think we’re setup well enough for the future of the product.
  6. It has to be as easy to configure as possible. Users don’t like to jump through hoops to make simple changes to software.
  7. No matter what you put in the product, you’ll have forgotten something that someone wants. I forgot to allow NUL backups in MB and a user requested it.
  8. User requests and bug reports are a good thing. It doesn’t necessarily make you a bad coder to have bugs.  You could just have a complicated app with many different complicated situations and you can’t code for everything out of the gate.  But feature requests and bug reports mean that people are using your software and like it well enough to want to see it improved.
  9. That BS you pulled at your last company where the code you wrote was “good enough” simply won’t fly here. Your name is on this code and how embarrassing would it be for someone to comment on a poor portion of your code only for you to have to say that you didn’t feel like doing it right.  Laziness is no excuse for poor coding or design.  Take the time to do it right, even if you have to recode portions several times.
  10. Don’t be afraid to try really outlandish things. IT gets mired in the mundane sometimes.  Turn your product on its ear.  If there’s something that you really want to be able to do, but it seems too hard, or even impossible, then that’s a place for you to shine.  Try sample code for even the most outlandish ideas to accomplish it.  You never know when it’s really not going to be as bad as it seemed.  It may not always work out, but at least you’re trying to tackle the issues people are faced with.  I had a few of these moments in MB.  There are problems we face every day with different backup situations and I wanted to solve.  And I didn’t want to be bound by what’s considered tradition to solve them.
  11. You can’t control who downloads your software. You may have a primarily American market in mind, but you’ll get downloads from all around the world.  Why is this important?  Well, that instantly throws you into different collations, time zone issues, etc.  I got caught in MB by foreign decimals.  I hadn’t counted on that and when I started getting downloads from other countries, backups stopped running because Powershell and SQL handle these decimals differently.  I didn’t know that before I started this.
  12. Test Test Test… then test again. Keep a list of all your edge cases and test every new version against every one of them.  The more you test the better your product.  And formalize it.  Don’t just run it a few times on your laptop and release it to the world.  If you support different versions of SQL then you have to test every feature not only on every one of those versions, but also on all versions of windows they can be installed on.  And if you can, test it against every major service pack.  Microsoft added 3 columns to RESTORE HEADERONLY in a service pack and it broke MB.  It didn’t even cross my mind to test for individual service packs before I started this.
  13. You can’t test for everything. Sometimes there are some ridiculous things that keep your software from being successful and sometimes they’re not anything you could’ve foreseen.  Again, MB has a perfect example.  As it turns out when you’re loading the Powershell SQL module, if you have SSAS installed on the server it has no effect on backups.  However, if you have SSAS installed and the service isn’t started, then it shoots up a warning when you load the provider.  So we found that the warning was taking the place of the data we were expecting and backups were halted.  If you’d have asked me a year ago if having SSAS turned off would affect your backup routine, I would’ve said ‘Hell No’.  Leave it to me to write software that finds this kind of issue.
  14. Every feature request doesn’t have the same weight. I don’t really believe in up-voting feature reqs.  I think if a feature is a good idea then it should go into the product no matter how many people requested it.  Maybe I’ll change my mind when there are 2 million people using my stuff and I’ve got feature reqs coming out my ears, but for now, I look at everything on its merits.  That doesn’t mean though that every request is equal.  I’ve had some pretty ridiculous feature reqs from people who clearly weren’t DBAs and really don’t know the proper way to manage their backups.  These are the requests you don’t give much weight to.  However, this is your opportunity to teach, so help your product shine by showing them the proper way to do things using your product to do it.
  15. Documentation is key. The more you can tell people about your product the more successful you’ll be.  There are people who just won’t read it, but there are others who will comb out every last nugget.  And if you have a particularly sensitive feature, or something that is a little more difficult to configure, then give your reasoning behind designing it the way you did.  Give the use cases for the feature.  This will help people know when to use it and when not to.  And it’ll help them know what’s going on behind the scenes.  The more they know the better everyone is.
  16. You can’t add every feature request.
  17. Use your own software. If you don’t use it, then who will?  And there’s no better way to flesh out bugs, and usability issues.  You should always put yourself in the shoes of your users coming in for the first time.  You’d be really surprised how quirky something you wrote for yourself is.  MB was my private backup utility for years and I had a set of steps I went through to set it up.  I knew them.  I was used to them.  So it didn’t bother me having to do it.  But expecting everyone to go through those steps is ridiculous.  Sometimes you can only make something so easy, but don’t go out of your way to make it hard.  Step out of your own head.
  18. Get plenty of people to test it out for you. This can be hard because you’ve not only got to find someone willing to put beta software on their box, but they’ve got to be the right person.  Building up a group of reliable beta testers can be the difference between life and death.  I’ve had beta testers find some pretty glaring bugs in my software and I’m grateful for each and every one of them.
  19. Seriously ask yourself if you’re actually adding anything to the market. Are you really solving a problem, or putting a really good spin on something? Or just throwing a slightly different version of the same thing out there?  So if you’re not an expert in the field you’re writing the software in, then do some research and find out what already exists and what the biggest issues are.
  20. The internet is a cold, dark place. Writing software is one thing, and getting the word out is another.  You quickly find that coming up with different ways to get the word out isn’t as easy as you’d think.  It takes persistence too.  You can’t just send out a couple tweets and a blog and call it a day.  It takes dedication and a lot of thought to find the avenues that’ll actually tell people about your stuff.  Keep with it though.
  21. Write software with support in mind. Chances are you’ll have to support what you write, and not leaving yourself in a good position will be the death of you.  So make sure you try to anticipate any issues someone could have and write in some debugging mechanisms.  Your customers will love you for it, and so will you.  And don’t make the debug info too hard to get at.  Remember, you’re the one who’s going to use it, so give yourself what you need.  Sometimes your customers will use it and bypass you altogether.  These are the guys we like.
  22. Writing software is one thing, but learning to support it is another. Sure, you may be the genius behind your code, but that doesn’t mean you have experience troubleshooting it.  Sure, you’ve debugged your code many times on your test box, but what about through email with a customer who won’t just let you on his system?  Do you know the right questions to ask?  Do you know the right things to have them to do repro the more complicated issues?  I’ve had to learn how to support my own products and it’s shown me that even my debug mechanisms weren’t what I thought they were.  So I’ve had to improve my debugging scenario and going forward it’ll be first on my mind with every feature.
  23. There’s a fine line between hardcoding things, and having parameters. You can’t hardcode everything, but you can’t have 500 params passed in either.   It’s just too clunky.  So good luck with finding that balance.
  24. Never rest on your laurels. Always be thinking ahead to the next release.  I’ve very happy with the current version of MB and MR, but before the code was released I was already listing enhancements for the next couple releases.
  25. Be honest about your shortcomings. People hate it when you BS them, so don’t even try.  Be honest about how the product works and why.  Not only will people respect you more for it, but you may convert them to your way of thinking.  People who don’t love you won’t love you anyway so you can’t convert everyone, but being honest about your bugs, and your features can go a very long way.  Show them you’re making an honest good-faith effort to write something good.

Security Theater

The biggest question I get about the Minion products is about security. People constantly tell me that a shop can’t implement Minion because it enables xp_cmdshell, whereas the SQL Server Maintenance Solution by Ola Hallengren doesn’t, so it’s naturally more secure. So in this post I’m going to show you, using both Minion and Ola’s routines, that what most people consider “more secure” is really just security theater. Now since the names of Ola’s routines change slightly, and the entire solution has kind of a long name, I’m going to refer to them collectively as SQLMaint. As well, I’m going to refer to the Minion suite of maintenance routines collectively as Minion.

For those of you who don’t know, Security Theater is when you have the appearance of security, but in reality you’re not secure at all. Your security measures are just for show. We see this everywhere, and it’s probably one of the main reasons that we’ve got so many data breaches happening all over the world. We’ve got admins making security decisions that they never bother testing, or never bother even questioning so while their intentions are good, they wind up with just the appearance of security. You’ll never be smarter than every hacker on the planet, but don’t get in your own way either.

So here I’m going to compare and contrast the methods that both Minion and SQLMaint use to perform their tasks and see which one is honestly more secure in your environment.

Let’s start by looking at how each product does its job.

Minion: Powershell and xp_cmdshell

First let’s look at Minion. The Minion tools use a combination of Powershell and xp_cmdshell inside SPs. Now, I’ve heard screams from some admins that Powershell isn’t secure because it could allow someone to do something inside SQL or the OS that they don’t have rights to do. This is called privilege escalation and it’s outright false. Powershell is a product like anything else. It has to follow security protocols the same as any other program. That means that there are no hooks inside Windows that notice when a Powershell script is running, and give it extra access. It simply runs under the context of the account that started the script. So Powershell is like any other scripting language from that respect. And if you’re accessing SQL Server from Powershell, then you’re still limited to your rights inside SQL itself. Powershell has no native way to access SQL so there’s no way for SQL to notice that Powershell is trying to connect and give it extra privileges. You access SQL from Powershell from one of 3 different methods: .net, sqlcmd, or invoke-sqlcmd.

Nobody has accused .net of privilege escalation, so making this call from Powershell wouldn’t do it either as you’re literally creating a .net framework object and connecting to SQL through any of those methods. And nobody thinks that sqlcmd gives you any extra rights either. You just connect to SQL and either use your AD account or your SQL account and you get the perms you’d have anyway. And of course, invoke-sqlcmd is just a Powershell wrapper for sqlcmd so there’s no extra security stuff going on in there either.

This is a good time to mention that Powershell is just a SQL client like anything else. SQL itself doesn’t even know Powershell exists. So when you connect to SQL through Powershell, as far as SQL is concerned, it might as well be SSMS, or Excel, or a website, or VBScript, or Python, etc. They’re all just clients and SQL doesn’t do anything special for any one of them. So the idea that Powershell leads to unintended privilege escalation is just that… it’s an idea. But it’s completely false. It has to follow the same security rules everything else does. And believe me, I really wish it would give me the rights to do things my personal account can’t.

Now does that mean that someone can’t fool Powershell into running under a different account so that it has more rights? Of course not. Hackers are smart. They’ve got all kinds of ways to get around things. But Powershell isn’t any less secure than VBScript, which comes on your box by default. So if they exploit a security mechanism they can use that exploit against Powershell or VBScript or even just batch file commands.

Second, the Minion tools use xp_cmdshell. By many admins this is considered to be a huge security hole. Why? Well in short the issue is definitely with privilege escalation. The issue isn’t with what they’re afraid the Minion tools are doing; they’re afraid of what someone else could do with xp_cmdshell enabled. Because with this enabled, someone could fairly easily access the OS layer and do all kinds of things under the SQL service account credentials.

SQLMaint: sqlcmd

SQLMaint works by calling sqlcmd from a job step. This is considered more secure because you don’t have to enable anything special outside of the default security configuration. However, I’m going to show you that this actually isn’t more secure, it can actually be considered less secure. So let’s get into it and I’ll show you what I mean.

xp_cmdshell is limited to sysadmins by default

Ok, we’ll start with xp_cmdshell. Let’s look at how you enable xp_cmdshell and the implications it has.

You implement xp_cmdshell by turning it on in sp_configure. This is an instance-level config so once it’s on, it’s on for all the DBs on that instance. Here’s the code you run to turn it on:

First you have to have ‘Show Advanced Options’ enabled.

Sp_configure ‘show advanced options’, 1

RECONFIGURE

Sp_configure ‘xp_cmdshell’, 1

RECONFIGURE

*Here’s something few people know. You actually only have to specify enough letters to make the option unique. So in the first command above you really only have to do this:

Sp_configure ‘show’, 1

If there were two options that started with ‘show’ you’d have to put enough letters in there to make it unique.

 

Now that it’s on, by default it’s only available to sysadmins. Nobody else can access xp_cmdshell in any way unless you give them access. And what does this mean in reality? It means that sysadmins have rights to do through xp_cmdshell what they’ve already got rights to do anyway. They’re really not getting any extra rights since most DBAs have full rights on the box anyway. And since xp_cmdshell runs under the context of the SQL service, then they’re running under whatever rights it has. This is one of the big reasons why it’s important to follow the least privilege rule for your service accounts. The other reason is because someone who knows the service account password could login under that account and do stuff and have their tracks completely covered. The threat doesn’t have to come from outside the company.

How to grant access to xp_cmdshell for non-sysadmins

You can give non-sysadmins rights to xp_cmdshell, but it takes an extra step. Since you don’t want just anyone running with full access, you have to define a proxy account. This proxy account provides the security context for all xp_cmdshell executions performed by non-sysadmins. This means that your non-sysadmins don’t automatically have unfettered access to anything on the OS or the network, because you’re going to make sure that the proxy account has only the rights it needs. You’re not going to make the proxy account an admin on any of your boxes. Here’s how you create the proxy:

EXEC sp_xp_cmdshell_proxy_account ‘Domain\ProxyLogin,’$$$$###MyStr0ngPassw0rd!@#!@#!!!’

And yes, it has to be an AD account… or at least a local Windows account (I would imagine). And the reason is simple. The only reason for running xp_cmdshell is to access OS-level things. The OS has nothing to do with SQL security so you need to pass it a Windows account. Now you can grant any non-sysadmin execute rights on xp_cmdshell.

The question is do you need to give non-sysadmins access to xp_cmdshell? I have to say that in my 20yrs in SQL, I think I can remember needing this only like once… maybe twice. The point is, that this is a lot like linked servers. The answer is almost always NO, unless there’s such a compelling reason that can’t be gotten around any other way. And that’s almost never. So in all but the strictest of circumstances, xp_cmdshell is only going to be available to your admins who have OS rights to do what they need anyway. Xp_cmdshell just makes it easier for them to code it.

The dangers of sqlcmd

Now let’s look at the method SQLMaint uses to launch its routines. Like I said, SQLMaint calls sqlcmd, which is an OS-level cmdline executable. So you have to have a way to make an OS-level call. And the way this is done is by using the command line job step type. So this job step type allows you to write any cmdline operation you need and it’ll run it for you, you guessed it, under the service account credentials. So by default this functionality is turned on and most people don’t even know it. And if you don’t know it’s there by default then how can you lock it down? The good news though is that only sysadmins have access to this type of job step by default. But anyone who has rights to alter jobs can make changes to the step.

So what does this mean for security vs security theater? Well, with xp_cmdshell you know you’re turning it on and you have to go out of your way to create the proxy and give non-sysadmins rights to it. So at every step of the way you’re aware of what’s going on. But with the job step you’re accepting the default config so anyone with the rights can come in and create a job with this job step and do something they’re not supposed to, or better yet, just alter the current job step in SQLMaint’s call.

Here’s a call I took for SQLMaint’s backup routine directly from its website:

sqlcmd -E -S $(ESCAPE_SQUOTE(SRVR)) -d master -Q “EXECUTE dbo.DatabaseBackup @Databases = ‘USER_DATABASES’, @Directory = ‘C:\Backup’, @BackupType = ‘FULL'” –b

The above command backs up the user DBs and this is the code that’s inside his command line job step. Now, what if I wanted to be malicious? I could easily open the job and alter it to do this:

sqlcmd -E -S $(ESCAPE_SQUOTE(SRVR)) -d master -Q “CREATE LOGIN [MyNewSALogin] with password = ‘StrongPassword’; EXEC master..sp_addsrvrolemember @loginame = N’ MyNewSALogin ‘, @rolename = N’sysadmin’; EXECUTE dbo.DatabaseBackup @Databases = ‘USER_DATABASES’, @Directory = ‘C:\Backup’, @BackupType = ‘FULL'” -b

Ok, so I just gave myself sysadmin on this instance. And I know what you’re thinking. You have to have admin rights to be able to make this change. Well, that’s not even close to accurate. Not only can you have job manager perms, but you don’t have to have any of those perms. In fact, a regular user can make these types of changes with very minimal rights. Let me give you a scenario that’s not only possible, but is highly likely.

You have a production box that has lots of jobs on it. You have different groups that need to create, alter or manage these jobs because as the DBA team you don’t want to have to get involved every time one of these many jobs has an issue, or every time they need an update… because it happens a lot. So you give that app team rights to manage jobs. Here’s where the fun begins. There are multiple ways this can go wrong, but I’ll go with the simplest. All your user needs is 3 permissions inside msdb, and here they are:

grant select on sysjobs to [SecurityTest]
grant select, update on sysjobsteps to [SecurityTest]

 

I created a SecurityTest user and gave him access to msdb. Then I granted the perms above. Now the only thing the user has to do to recreate what I did above is run a simple update.

update sysjobsteps
set command = ‘CREATE LOGIN [MyNewSALogin] with password = ”StrongPassword”; EXEC master..sp_addsrvrolemember @loginame = N” MyNewSALogin ”, @rolename = N”sysadmin”; ‘ + command
where job_id = ‘0C06625F-F518-4D86-9E5A-063AE8B9C4E4’
and step_name = ‘BackupDBs’

 

He can query sysjobs to get the list of jobs and find the one he’s interested in, and then update sysjobsteps to make the change. Now, the next time that job runs, he’ll have a sysadmin account. He’s then free to cover he tracks by removing the changes, and even give himself a couple backdoors in case someone discovers his new account. This can even include adding this type of code inside of other SPs or jobs so that if his rights are ever removed, they’ll be put back. And depending on how many instances run off of the same SQL service account, he could easily spread his access to every other server in your shop. And he doesn’t even have to be that smart to do it. This isn’t all that clever of a hack.

But you see what’s happened here, right? You wanted to give him rights to manage his jobs and you ended up giving him sa. And he didn’t need that many rights to do it. See, one of the misunderstandings is that the msdb tables are system tables. They’re not. They’re regular user tables, so you can easily update them directly.

Lock down sqlcmd!

But how do you protect against this? Well, the same way you did for xp_cmdshell. You create a proxy account and run those command line job steps under an account with much lesser rights. This way, even though someone might change the code the job runs, it’ll fail because the account won’t have the rights to make the change.

Security is a complicated animal with many facets, twists, turns, and pitfalls. And to say that one solution is more secure than another just because one of them uses a feature that’s turned off by default simply isn’t the case. I hope I’ve proven that turning on xp_cmdshell isn’t inherently bad, and I really hope you understand now that simply leaving it off doesn’t mean that you’re secure. In fact, I’d say you’re at greater risk because you’re not likely to have addressed this issue in your jobs. You’re merely engaging in Security Theater. You have to be a sysadmin to turn on xp_cmdshell and you have to give someone specific rights to run it after creating a proxy. But you could innocently give someone perms in msdb and give them the keys to the kingdom.

So I hope you start looking at the xp_cmdshell issue differently because it’s a wonderful feature that allows you to do some very cool things. And it lets Minion tools give you some really awesome functionality that you wouldn’t have otherwise.

To AutoGrow or not?

I just got this question in the user group and thought I’d write a blog instead of just answering a sub-set of users who could benefit from it.  The question was:

I have customized the values of the Auto growth according to the size of the database and the rate at which it grows. I have noticed that Auto growth kicks in about every 3 months – 6 months on an average. Is that OK? I have read articles where the advice on it ranges from “Auto growth is OK” to “Auto growth should kick in only during emergency”.

This is one of those topics that comes up again and again, unlike AutoShrink which I hope is settled by now.  I suspect it keeps coming up because there’s no real solid answer.

Ok, so whether or not to AutoGrow your files.  I’m going to talk about both data and log files together unless there’s a difference.  So unless I call one out over the other, I’m talking about them both.

You should definitely use AutoGrow.  And you should definitely NOT use AutoGrow.  That’s my way of getting around saying “it depends”.

It depends on a few factors really.

1.       What you’re going to do with the files.

2.       How big your environment is.

3.       How many other files are on the drive.

4.       How much activity is on the files.

5.       Monitoring method

 

Maybe there’s more, but that’s all I can think of right this second, but you get the idea.  Ok, so let’s go through them one at a time.

1.     What you’re going to do with the files.

From time to time I step into a shop where the DBAs have bought into this idea that AutoGrowth is bad so they have some job setup to monitor the size and they grow the files manually.  Now while that sounds like a good idea, it can cause more problems than it solves.  Let’s look at a scenario I’ve encountered more times than I care to try to count.  You get a request from a group to restore the DB to a dev or maybe a QA box so they can do some testing.  This is a common scenario, right?  I’ve done it plenty in almost every shop I’ve been in. 

So you go to do the restore and it fails telling you that there’s not enough space on the drive.  You look and there’s not that much data in it so it should fit on the drive right?  No, not right.  The drive has to support the size of the file, not the size of the data.  So if you’ve got a 50GB drive, and a 100GB file it will fail even if there’s only 20GB of data in that file.  So now what do you do?  Do you go beg the SAN guys for more space or do you manage your files in smaller increments?

With AutoGrow you can set your own growth rate, so you can have it grow the files at whatever interval you want.  So how is manually growing the file going to add anything to the equation here?  And with Instant File Initialization (IFI) you don’t even have to worry about incurring the cost of zeroing out the file unless it’s a log.

Now, for log files specifically, I know some of the top experts say that you can have perf problems if you grow your files too much and get too many VLFs, but honestly that problem really doesn’t come up that often.  And logs are very volatile.  Lots of things log activity that you don’t realize and I wouldn’t want the log file to rely on me.  And again, I can’t stress too much that it really matters what you’re going to be doing with the files.  If you’ve got an extra 60GB of space in your log because you’re afraid of VLFs, then you’ll need that extra 60GB on every other system you plan to restore the DB on.  And you may not be afraid of the VLFs on those lower-level servers.

2.      How big your environment is

Now let’s talk about large enterprise environments.  I tend to be in really large shops with hundreds or thousands of servers.  And I don’t know about you, but I don’t wanna spend my time managing file growths.  Consider my last environment where I had over 900 servers with over 4,000 DBs spread across all of them.  And that was just prod.  I’m not going to do that kind of analysis on all of those servers and manually grow all of those files.  And it’s honestly just ridiculous to even try.  There are 2 ways I could solve a problem like this.  

 

I could develop a process where I monitor the free space in all the files, and when it reaches a threshold it grows the file by a certain amount.  Hell, that’s just a homegrown version of autogrow isn’t it?  So that’s not a solution really. 

 

I could also use autogrow on some of my boxes and manually grow my really important or trouble boxes.  And again we’re back to “it depends” aren’t we?  What we’re saying here is it’s ok to use autogrow on some servers and not on others, which means there’s no solid answer.  You just can’t spend all your time growing files.  Use autogrow here unless you have a reason not to.

 

3.     How many other files are on the drive?

This argument may or may not have any teeth… it just depends on how you look at it.  The main reason for manually growing your files on drives where you’ve got a lot of other files is for the fragmentation.  And here I’m talking about fragmentation at the filesystem level, not inside the files themselves.  If you’ve got you files on a drive with lots of other files and they’re all growing, then they’ll be growing over each other esp if they’re growing in smaller increments.  So you could fragment your drive pretty easily and that can definitely cause perf issues.  So the solution is typically to manually grow the files to a larger size so it reduces the amount of fragmentation you create when they do grow.  And that does have merit, but why not just set the AutoGrow setting higher then?

I can see a reason why you wouldn’t.  If there are a lot of DBs sharing that drive and they all grow fairly often, then you wouldn’t want to AutoGrow it to a certain size and have it fill up too much of the drive and starve the other DBs.  The most logical way around this issue though is too fold:

 

AutoGrow at smaller increments.  Unfortunately, this may put you back in the fragmentation scenario though.  If you go this route then you need to defrag the drive on a regular basis and you should be ok.

 

Split those DBs off onto their own drives.  This is the best solution because you get stuff for free.  Things like simplified space mgmt., 0% fragmentation, and I/O isolation are all things that come along for the ride when you put DB files off onto their own drives. 

 

However, all that said, if you can’t put the files on their own drives and you’re really afraid of starving the other DB files, then your only real choice may be to monitor the size and grow manually.  But this shouldn’t be the norm if you’re in a big shop.  Keep this kind of activity to a minimum if you can help it.

 

4.        How much activity is on the files.

This one is almost like the other one, only this doesn’t necessarily rely on what else is on the drive.  This counts even if the file is on its own drive.  If the file grows a lot every day or every week, then you don’t want to take a chance on missing an email alert or whatever else you use and having the file fill up because you didn’t grow it.  So while there may be some exceptions, my skills are better spent elsewhere than growing files manually.

 

5.        Monitoring method

Many shops monitor with 3rd party tools and those tools monitor disk space.  However, none of them are smart enough to know the difference between a full drive and a full file.  You could have a 100GB file with a 99GB data file on it and the alarm will trip even if the file is only 3% full.  And depending on whether or not your monitoring team is friendly, they may or may not help you out by either turning off the alarm on that drive, or doing something so that it knows something about the space in the file.  I’ve honestly worked with both friendly and unfriendly teams.  So I could either setup an outlook rule to ignore all space alerts (bad idea) or shrink my file back again so it didn’t trip the alarm.

 

Conclusion

So you can see there are several factors involved with this decision and chances are you’ll have a mixed solution.  I’ve worked in shops where I never managed space at the file level, and shops where it was very necessary, and everything in between.  For me #1 above is one of the biggest deciding factors.  I’m constantly fighting DBAs growing files a lot to be proactive and then we can’t restore to any of the other environments.  Even DR becomes an issue because you have to have that space anywhere you restore those DBs.  And that’s a lot of extra space to keep on hand for such little return.  Don’t get me wrong, I’m not a big fan of thin provisioning either.  I think that’s going a bit far, but it’s basically the same thing at the SAN level.  This provisioning is AutoGrow for the LUN itself.  And the biggest problem I have with it is that they tend to not grow it enough or they set the threshold too high so the file fills up and brings the DB down while you’re still waiting for the LUN to expand.  If they can get it right though it’s not the evil it used to be.  So what we’re really doing with AutoGrow is we’re thin provisioning our DB files.  And that’s actually much easier with IFI because they expand in just a couple seconds.  That’s only for data files though.  Log files still have to be zeroed out so you can run into the issue now and then where the log file is still growing when the process runs up against the end of the current file and everything stops.  Hey it happens.  Those are the cases where you might consider manually growing your log files.  These would be more DSS type systems where it’s unlikely that you’ll restore it to a different box.

Having huge files can also slow down your DR plan.  If you’ve got a huge log file and a 30min SLA, you could easily spend more time than that zeroing out your log file.  So you’ve orchestrated that you’ll miss your SLA just by trying to make sure you don’t run into an almost non-existent VLF issue.  So you’ve got to consider that too.

 

So anyway, I hope this helps you at least consider the different factors involved in making this decision.  Leave me comments if I’ve messed something up really badly.  Or if I’ve gotten something really right.  Hell, just tell me I have great hair and call it a day.

 

 

SODA Architecture

I was asked by a user to write this post.  And while I’m gonna beat around the bush a little, it’s only to give you some background so you understand the context of what this is all about.  Because everything’s connected and my goal here is to show you how a bunch of these things are related, especially when you break them down into essentials.

Quite often you’ll have a DB that just doesn’t perform no matter what you do to it.  You’ve spread-out the I/O, you’ve indexed the hell out of it, and you’ve thrown hardware at it.  And maybe it’s not strictly performance you’re concerned with.  Maybe you’re concerned with backup/restore time, or security, or something else.  Well, enter SODA.  SODA stands for Service-Oriented Data Architecture.  And what it really means to do is to split your DB into different pieces based off the services they’re going to perform.  For those of you who are really DB literate, that’s all you need, but I’ll go ahead and flesh this out a little bit for those of you who would like a little more explanation.

Background Basics

Now, before I really get into SODA itself, I’m going to go into some performance basics to give you some background on how this fits into the grand scheme of things.  Let’s take a look at disk access. 

Of course, whenever you need to get resting data, you need to get it from disk, and whenever you get it from disk, that’s a physical I/O that needs to be performed.  And like anything else that performs physical work, you can overload the disk and requests start to build-up.  And this is what the disk queue counters in perfmon are all about.  The disk queue counters tell you how many I/O requests are currently waiting in line (the queue).  And of course the larger the queue, the further behind the disk is. 

That’s not entirely relevant these days because most of us are using some kind of SAN and these counters are harder to measure on SANs.  The reason is because in order to interpret disk queues you need to know how many disks are in your array.  Every disk array in your system has an acceptable disk queue limit.  And what that functionally means is that if you have a single disk, say in your laptop, that single disk has a number of requests it can fit in its queue before it can be considered “falling behind”.  So there’s a certain level of queuing that can happen and still be considered normal behavior.  And of course, with all the different drive vendors and types of drives out there, you can imagine that this acceptable queue number would be different for them all.  And technically, you’re right.  If you want to know the exact acceptable disk queue for your given drive, you need to dig through the vendor specs and then take a lot of the numbers in there and plop them into a nice long formula that spits out the number of the acceptable disk queue for that drive.  And I’ll tell you right now that’s a bit hard to manage for all but the most dedicated IT folks (and even for them too, really). 

So the way we get around this is we just use a standard measure of 2 per disk.  So for any given single disk you can have an acceptable disk queue of 2.  That means that if you have an array with 10 disks in it, then you can have an acceptable disk queue of 20.  Get it?  That’s why this measure isn’t really as useful on a SAN because quite often you don’t have any real visibility into the specs of your SAN arrays.  Without this information, you can’t really say what an acceptable disk queue would be because you don’t know how many disks are in the array.  And, SANs are all mostly shared amongst many apps anyway so you don’t even have dedicated storage to begin with.  The counter you’ll use to diagnose disk problems on a SAN is avg. disk secs/read (or write).

Ok, so that seems like a pointless digression, but it ties in.  Like I said though I just wanted to give you a little background so we’re more or less on the same page.  Now that we have that background, we can say that in a situation where we’re disk queuing, we have an underlying cause for our issue.  What would you say that underlying cause is?  Go ahead and take a second to consider it real quick before I give you the answer. 

The answer is we have disk queuing because we’re overworking the disk.  We’re pushing more work at it than it can do so it has to stack up that work.  I know that sounds simplistic, and it is, but it has to be understood before we can go any further.  With that understood, how can we fix this issue?  This is the heart of what I want to talk about in this piece.  There are 2, and only 2 ways to fix this issue.  Those 2 ways are: decrease the activity on the disk or increase the throughput on the disk.  That’s it.  There’s nothing else you can do to fix a disk queuing issue.  It’s the methods for doing each of these that can get involved and where you have the most leeway to be creative.  Hang on, I’ll show you how this ties into SODA soon enough.  And of course I’m not going to get into a full discussion here about all the different ways you can solve the issue, but to give a couple examples, a common way to decrease the activity on the disk is to separate the different files onto different LUNs, and a common way of increasing the throughput is to add disks to the LUN.  Again, this isn’t going to be a full discussion on that topic, so I’m going to stop there.  We have enough info to continue with our current discussion on SODA. 

Let’s look at another common performance issue we have in DBs… that of table performance.  Again, you’ve got a single table that could be a bottleneck and you’ve got to make it perform faster.  And again, your choices are limited.  There are really only 2 ways to solve this bottleneck.  You can decrease the volume of the requests against the table, or you can increase the throughput of the table.  Well, quite often these 2 solutions go hand in hand, and while this isn’t a discussion on table performance, a couple of the more common ways to solve this problem are:

a) improve the indexes to decrease the volume of requests (by decreasing the I/O)

b) partitioning the table and placing the different partition data on different LUNs to increase the throughput of the table itself. 

So hopefully you’re seeing a pattern form here.  For any given bottleneck you’ve really only got 2 choices:  Decrease the requests or increase the throughput.  It’s in the execution of either of those where the art comes into performance tuning.

Now let’s talk about SODA.  SODA stands for Service-Oriented Data Architecture.  And its main purpose is to help get around some kind of bottleneck, typically at the system level.  That’s not the only reason to use SODA so nobody email me and say I’m an idiot cause I said the only reason to use SODA was for system-level performance.  So anyway, SODA is really just a way to break up your DB into pieces that are more manageable.

Let’s take a quick example.  Take a DB with something like 3,000 tables and thousands of SPs.  That’s bound to be a very busy system and you could easily run into both memory and CPU bottlenecks.  What people will commonly do is split up the DB based off of the services it provides.  They may split out the Shipping components from the rest of the DB, and they may split out the Orders portion as well.  This huge DB we started with could potentially be split into many separate DBs based off of the services or the functions they provide.  Then all the devs have to do is just teach these new DBs to talk to each other (usually through SSIS, BizTalk, web services, or Service Broker, etc).  You now have a DB architecture based off of the services provided by each one.  You have a Shipping DB, and an Orders DB, and a Manufacturing DB, and an Inventory DB, etc.  This is what SODA is.  It’s splitting off the different portions of a large DB into their own DBs based off of their services.  Now these DBs can go onto different servers and have their own security, resources, etc.  So you can see how this is solving the bottleneck problem.  You’re decreasing the requests on each single DB and increasing the throughput by giving them their own resources on their own servers.  And of course they can stay on the same server if you like.

And hopefully you can see why I wanted to give you the background on solving bottlenecks so you could see that this is no different.  It’s the same thing as before, just at the DB level.

Now that I’ve covered SODA, I think it only fitting that I also cover its counterpart, DDR.  DDR is Data-Dependent Routing, and it’s yet another way to manage the influx of data you’ve got coming in.  Imagine the same DB as above with many different modules inside it.  Only this time you’re not going to split it up by function, but by data range.  So maybe you’ll keep 5 copies of the exact same DB and each copy will contain a different region of data, or maybe a different country.  This is actually pretty common practice.  I’ve seen many apps that split their data apart like this.  And there are several reasons for it.  It could be performance, data size, archival, etc.  And the reason it’s called ‘data dependent’ is because quite often there’s a layer on the front end or maybe even the middle tier that decides where the request should go.  So for example, maybe you’re looking up customers in TX.  You submit the request from the front end, and it has knowledge of where the different regions are stored and it dynamically sends you to the DB that has the TX data.  So every request has to go through this process because every request potentially goes to a different DB.  These DBs can be on different servers, but that’s not always the case.  In fact, most of the time whenever I see a server with hundreds or thousands of DBs it’s because this design was used.  A good example of this is with online training companies.  I’ve see a few of them that have chosen to keep all of their corporate customers in their own DB instead of having everyone together in the same DB.  There’s no need to put them on different servers because it’s not a performance decision.  It’s just an organizational decision.  Maybe it’s based on security, maybe it’s based on data size, or maybe it was adapted from a much smaller app and this was the best way they had to expand their business model with the least effort. Now, whether that’s a good design decision or a bad one doesn’t really matter here because putting all customers in the same DB and putting them in separate DBs both have their pros and cons so it really just depends on where you want your pain to reside.

So just to summarize:

SODA – putting each function or module in a DB into its own DB.  The schemas in the new DBs will be different as they all have different tables.

DDR – Keeping several identical copies of the same DB but with different data.  The schemas will be identical, but the data will be based off of some business rule you’ve chosen.  These rules can be date, region, country, alphabetical, etc.  It can be anything you like.

Ok I think that’s about all I have on this topic.  I know I went in kind of a roundabout way of talking about this, but I really felt you should have a little context.  Don’t think of these as completely separate things because they’re not.  Everything is related and if you can break them into basics then you’ll be much better at coming up with solutions.  And everything I talked about today is just another way to either increase throughput or decrease requests. 

The answer is No

There are times in a DBA’s life when the answer is No.  And for once I’m not talking about a DBA asking a girl out.  This time I’m talking about requests we get from our users.  I had such a request today.

A manager of another group came up and asked for a generic account that he could use to connect to the DB from an application.  The answer is No.  Sorry dude, but we don’t give out generic account unless it’s absolutely necessary.  And of course, then the insistance sets in that he has to have the generic account because it’s an application.  I said no, the app itself is running under an acct, and I’ll be happy to give that acct the perms.  He said, well it’s not that easy, this is a web app.  And I said well, that changes things significantly, and you’re right it’s not that easy.  He said see, I told you.  I said, it’s even easier.

You see, with a web app you can set the app pool to run under any acct you like, so connecting to a DB under certain credentials is wicked easy.  And it really is.  He said well, we also need to be able to connect using SSMS to be able to run a couple SPs manually.  So we were hoping to be able to use that acct for that as well.  Again, No.  Give me the accts you want to have these perms and I’ll make it happen. 

Now that’s the end of the conversation, but here’s a word for you guys reading this.  There are several ways to make this happen depending on how you layout the group (either in AD or SQL).  The point I’m trying to make here is that sometimes you have to make sure you service your customers in a way that’s best for the company.  They quite often ask for something specific that you can’t give them, but you can give them the equivalent.  Often times users ask for things a certain way because that’s all they know.  They get too caught up in solving a problem so they try to solve the problem the only way they know how… and in this case it was asking for a generic account.  But this is where we as DBAs need to step up and have the courage to guide them.  They may get upset, they may even insist that it be done their way, but they’re not DBAs, and they’re ass isn’t on the line.  It’s our job to make sure that it gets done right.

Now, just for completion here’s a video I did quite some time ago that shows you how to run a website under a specific account.  It’s in IIS 6, but it’ll show you how it’s done, and you should be able to transfer that skill to IIS 7 if you have to.  And if the app has a windows service it’s even easier… just run the service under the account you like and you’re golden.

A Round of Patching

I just finished an interesting email thread with a user that’s not at all an uncommon scenario.  They were just installing a CU to some of their SQL boxes, and there was evidently a new bug that was introduced into that CU that causes their SSRS reports to rendor incorrectly.  The sad thing is that this is the kind of thing that should have been caught in testing.  Now of course, I don’t expect them to find everything during testing, but the simple rendering of reports should have caught early on.  Because the second they turned their reporting structure back on, report rendering started messing up.  Now, he assures me that they tested it, but what exactly did they test?  Did they do any user testing at all or did they just test the install of the patch itself?  I see that quite often.  I see shops claim to do patch testing, but all they’re concerned with is whether the install itself will fail.

Most of the time I blame managers and companies in general for his because the IT staff knows what’s at stake more than anyone but they’re not given the proper time and resources it takes to properly test.  Managers always claim to know the business needs better, yet they never seem to fully grasp the full implications of putting untested code into production.  All they want to know is will this bring down my server.  Unfortunately that always means something different to them than it does to us.  What they mean is will this fail and put us into an emergency situation?  And the answer is probably not.  However, disabling a critical piece of your business is putting you into an emergency situation.  And in this case they’re lucky it’s just report rendering.  What if it had been something more serious that caused some kind of logical corruption in their processing?

I would say that quite often the biggest problems caused by hotfixes are performance-related.  And by not testing a CU or hotfix properly you could bring your entire business to its knees.  You can slow your website down so much that nobody can get any real work done.  And what does that do for the reliability and confidence of your business?  It would be really nice sometimes if we in IT didn’t have to rely solely on people who know nothing about it to get our jobs done properly. 

And what’s amazing to me is that companies today still, after all this time and all the mistakes, don’t think it’s necessary to hire competent people code and test.  Oh sure, they perform lip service about it all day long, but what do they really do about it?  Do they really slow down their dev cycles enough to do benchmarking and architect solutions properly?  Do they spend the money necessary to ensure that their code and servers are solid by hiring enough competent people?  Because every shop I’ve been in works their people so hard there’s no time for anything unexpected to happen.  So I think I’ve said this before, but now I’m saying it again.  It’s time to stop being children about IT.  It’s time to stop letting people who know nothing about it make the technical decisions that effect entire projects.  And it’s definitely time to stop thinking we can get by without testing things properly.  And that includes functional testing as well as performance benchmarking.

The hidden update

I had a cool situation today that didn’t come to me right away so I thought I’d share it with all of you.  We’re going to be talking about the lovely deadlock today.  I can’t remember if I’ve ever talked about it before, but I am now so there…

OK, here’s the situation.  I was called over by our very sexy web team lead (George), who told me that they were having deadlock issues on the web portal DB.  Fine, I’ll put a server-side trace on it and see what comes up.  For this I used the Locks\DeadlockGraph.  Once I got my info back, I noticed that we had table1 and table 2 and table3 in the mix. Table1 was a delete against itself.  Then another session ran an update against Table2 joined to table3.  The problem is that the update was deadlocking with the delete and the delete was losing every time.  And also, why was the deadlock on table1?  The update doesn’t even touch table1. 

For starters, all the tables have an update trigger that pulls the inputbuffer and session info for the spid that ran the update.  It then puts this info in a log table.  I don’t know why.  Unfortunately that wasn’t the problem.  I checked and none of the tables turned out to be views either so that avenue was dead.  The problem was just a tiny bit buried, but I eventually found it.  There was another table in the mix… table4.  Table3 had an update cascade set on its FK to table4 and table4 had an FK back to table1.  AH-HA… there’s your connection.  Now, as well, there’s wasn’t an index on the FK col in table1, so it was doing a scan.  Nice huh? 

So my recommended fix was as follows:

1.  Get rid of the auditing update triggers.  If you really want to log the action then put that code in an SP and call it after your update is done, but not as part of the main transaction.  Yes, I’m aware of the very minute risks in this, but they’re far out-weighed by completing your transaction so much faster.

2.  Put an index on the table1 FK column.  This is probably going to give you the biggest bang for your buck.  If it’s not doing a table scan, then it’ll get in and out faster so there’ll be less chance of deadlocking with the delete.  I believe the delete is also searching on the same col so it would really be worthwhile.

3.  Use updlock on the update query. 

My whole plan here is to get these transactions as short as possible.  It’s not enough to have efficient queries because if you’ve got a bunch of other stuff in the transaction then you might as well be doing a table scan every time.  And I know that reading table1/table2/table3 and all that isn’t easy to follow, but hey, we’ve gotta scrub these things for the internet, right?  Really the whole point is that you have to dig sometimes to find the source of something.  I knew the deadlock was on that other table and I could see the index scan in the execution plan, but that table wasn’t listed in the query or in the trigger.  So it had to be coming from somewhere.  So the basic point of this is to remind you of some of the things that are possible so you can remember to check these types of things too.

Derived Column vs Script Component

I get asked this from time to time… when should I put a transform inside a Derived Column (DC) and when should I put it in a Script Component (SC).  Well the answer isn’t always easy, but here are a couple guidelines… because you do have to make this decision quite often in SSIS.

Scenario #1:

In short I’d use a DC when the transform is really simple and there’s not much chance of it creeping past what it is initially.  So if it’s going to be something like testing a simple condition and doing a simple replacement based off of it, then a DC is an easy choice to make.  However, if it’s likely that the scope of the transform will shift in the future to something really big then you’re better off with an SC.

Scenario #2:

If the above transform is going to rely on several other columns in the row, then you’re far better off with an SC because it’s much easier to read.  So if you have to compare 4 or 5 cols to get the answer then use an SC.

Scenario #3:

If the transform will be really complex and/or have a lot of sub conditions in it, then it may be possible to do it in a DC, but you’ll be far better off in an SC.  Everything in a DC is on a single line and that model breaks down pretty fast.  And of course there are some sub-conditions that you won’t be able to accurately represent in the very limited DC.

Scenario #4:

If you need to test a data type like IsNumeric(), you can’t do that in a DC at all.  Even though it’s a valid VBA function, SSIS doesn’t support it, so you have to rely on .Net inside of an SC.

Scenario #5:

If you’ve got a lot of simple transforms and you’ve got one that’s more complex or longer, etc, then you may want to consider using an SC simply to keep everything in one place.

That’s all I’ve got.

SSIS Email Chain

Today’s blog is actually an email chain between me and a user. It’s only a single question and reply, but I think it’s good info.

Q:
Hey, would you agree that a monthly load process is better served as an SSIS – even if you have to push/pull from text files for now – than as a series of SPs or DLLs?

A:
if you’re staying on the same box for the load then SPs can be an attractive offer because they’re very fast and the memory stays in sql and can be managed quite well… if you move that process to ssis, and ssis is on the same box, then you have to allocate memory away from sql to run the pkg and house the data while in the buffer…

if ssis is on another box, but the data is still being moved to different dbs on the same box… so if the data is being moved from server1.db1 to server1.db2 and ssis is on server2, then you don’t have to fight sql for memory, but now you incur the network cost of moving the data from the box, and then back to it…

if you’re moving between boxes, then y, ssis is a better choice because in SPs you have to manage linked servers or openrowset to make that happen and that’s not cricket…

however, what makes ssis attractive in the single box scenario is that it handles errors easier and alerting is much richer and easier to come by… you can also more easily fix problems in the data itself and it’s easier to extend… so if your requirements change and you need to switch it to another box, or if you need to send a copy somewhere else, etc then that’s much easier in ssis… ssis also gives you parallelism that you cant do in sps… you can load several tables at once in ssis where they have to be serialized in sps…

a good compromise in the single box scenario is to keep things moving like they are if possible, and where not stupid… so if they’ve already got an sp that inserts data into the 2nd db, then just call that sp from ssis… this way you get the insert and select process in sql where it belongs, and the workflow and error handling of ssis, and everything else that goes with it… if a single sp inserts several tables serially though, i’d create several sps and call them individually in ssis tasks… even if you have to keep them seialized (like for PK/FK issues)… because now you can even add things like checkpoints and individual error flows to your pkg steps and have a richer experience…

these things are meant to allow your process to grow and change much easier than is possible with an sp… so if one of the DBs gets too big and has to move to another box, it’s a simple matter in ssis… but in an sp, that means you have to create and manage a linked server, which is just one more thing and they come with their own problems…

as for the DLLs… these processes NEVER belong in a DLL… not only do you have to steal memory from sql, but it’s not optimized for data flow tasks… it’s just a plain DB connection… you can have parallelism though by calling the windows threading model, but they have to manage those threads manually and mistakes can be made… and that takes extra memory, etc… not only that, but the code isn’t available for admins to fix problems, and making changes to the compiled code can be dangerous, right… so no, that’s not a good idea…

does this answer… i realize you prob wanted more of a yes, do this, but it’s not that simple…