Disk monitoring isn’t easy by default

At my last gig one of the other managers was constantly pissed off at my refusal to not setup disk space alerts.  "We’re always filling up the logs." he says, "so we need to monitor for space and send an alert."  And of course, that certainly sounds reasonable, so why did I refuse then?  Well, put simply, something that sounds reasonable to someone who doesn’t know what he’s doing isn’t reasonable to those of us who know something about DBs… oh and mind your own business.

The issue wasn’t whether or not I thought alerting on disk space is theoretically a good idea or not.  The issue is whether it’s what you actually need.  See, in our situation, the company was really slow to buy disks.  So most of the time we would be running at over 90% capacity for all of our disks.  And a lot of them were running on vapors.  In fact, our big DW system was reporting .01% free space for 2mos before they got us new disks.  Now, tell me I don’t know how to manage space.  So the logs were always filling up the drives because there was nothing left for them to grow into.  And the source system would push these mass changes w/o telling us quite often and there would be 3x the transactions out of the blue one night and we would have to find a way to deal with it.  So in the middle of ETL the disk would go from .01% free to 0%.  What the hell am I supposed to alert on?  Since we’re always running to tight, by the time the alert got fired and the email was sent out, the processes would have stopped anyway, so what good is the alert going to do?  And no matter what I said to him he just couldn’t get it out of his head that an alert is what we needed.  When what we really really needed was to not be running on vapors.  Trust me dude, I can spell SQL so just let me do my job.  I know it seems like we’re neglecting your system, but we’re really not.  There’s just nothing we can do about it.

The real fix of course is to get some disks and not run them at capacity.  I almost had them there when I left.  I had been telling them for 4yrs that you can’t run disks at capacity and they were finally barely starting to listen.  But again, remember it took us at least a year to get the new disks approved so once we got them, we were already running low.  And when the log is filling up in the middle of a big operation you won’t get an alert in any reasonable time because they’re not run continuously, they’re run every few mins.  So it’s quite possible that all the action of the disk filling up happens between sampling internals. 

I kind of have the same problem at my current gig as well.  Disk space is at a premium and alerting has proven to be a challenge because those rogue processes are the ones that push the log over the top and they fill up an already stressed drive very fast.  And there are unseen consequences as well.  Let’s take a look at a specific example I had just yesterday.  One of these boxes had a rogue large transaction that took the size of the log through the roof and filled up the drive.  So the DB shut down and I had to fix it manually.  However, the log backups go to the NAS with a lot of the other backups in the LAN so with the log backups being so much bigger it took up a lot of unexpected space on that drive as well and failed lots of backups on other servers.  Now there’s nothing I can do about the rogue processes filling up the log, but currently I also can’t alert on them effectively either.

So if you really want to effectively monitor your disk space, then run your disks at about 50% or so to give your alerting process a chance to detect the threshold breach and do something about it.

Oh y, and I recently heard from the DBA that took my place that they’re still running at capacity and that other manager, without me there to stand in his way, has finally gotten his precious alerts.  And the alert comes about 10mins after the process stops and the DBA is already working on it. 

3 thoughts on “Disk monitoring isn’t easy by default”

  1. Seems to be common scenario in many companies. We move around files on a weekly basis to avoid this issue and are finally moving to a SAN environment this weekend. With it I would be able to plan accordingly without having my team worry about freeing up space on a weekly basis. Got applications inserting 10ths of millions or rows on a weekly basis which unfortunately are required for transactional reporting.

  2. Unfortunately, SAN disks are even more expensive so companies tend to run the SANs at capacity as well. So I fear you may just be moving your problem to a SAN instead of a DAS, but good luck on that.

  3. Agreed. It will give us additional time to plan and chop/compress unnecessary data and move to a NAS. Right now is impossible. Moving from 2005 EE to 2008 R2 EE

Comments are closed.