Red Alert or Situation Normal?

I’m a huge sci-fi geek. I love stories about interstellar travel, artificial intelligences and alien worlds. I have a soft spot for all things Star Trek too. I’d love to be zipping through space at superluminal speeds—boldly going where no one has gone before. Well, except for those pesky inertial dampeners and warp drives. Seriously—in a world where technology has flung humanity across the cosmos, you would think they’d have more reliable systems! Ever heard of systems monitoring, Mr. Scott?


As dramatic as sudden failures of warp drives and inertial dampeners are—filled with flashing lights, blaring klaxons and (wait for it) sparking consoles—going from situation normal to red alert inside your own enterprise is rarely appreciated. “Seriously! Guys!” I want to shout at the T.V. “Shouldn’t you have noticed something was out of the ordinary before the engines exploded? Talk about an epic engineering failure, Mr. La Forge!”

As much as we might enjoy technology failures on a television show—you might be putting your own enterprise in the same situation today. Is it smooth sailing until you’re surprised by IT alarms, alerts and impending doom? Hopefully not! After all—you probably have some sort of monitoring solution in place. But many monitoring solutions fail to spot developing situations before it is red alert time. Why is that?

I have a theory—and it really revolves around the attempt to reduce the “noise” most IT systems generate. By noise—I mean the plethora of IT events any monitoring solution will generate if they were set to their true potential. Of course, such fine grained alerting would quickly overwhelm an IT staff. You need an approach to reduce all that noise, yet still take advantage of all that monitoring potential.

The most common way to resolve that dilemma is to set minimum and maximum thresholds. It’s a great idea—except static thresholds usually only spot the most egregious of IT events. How can you know if anomalous IT events are cause for concern—especially if they never trigger a static threshold alert? You may eventually get an alert once the behavior triggers your threshold—but still feel blind-sided by the actual situation. It’s suddenly red alert time when you thought it was situation normal. Static thresholds—while nice—put you at risk.

I think every IT department should adopt a more dynamic approach to IT monitoring. Instead of static thresholds—you need adaptive thresholds that learn the “normal” behavior of certain IT metrics. You should only get an alert when monitored metrics fall outside that expected range. Here’s the real challenge though. Such an approach must be context aware. It’s not enough to have a line that simply moves up and down across all situations. You need an IT monitoring solutions that knows CPU usage is going to spike near 8 a.m. as users log into systems —and expects a drop off when the work day is over-- only to spike again near 10 p.m. as your data center backs up critical information or runs a virus scan. Until your IT monitoring solution has dynamic, adaptive thresholds—you will probably miss developing situations in your IT environment in the attempt to reduce the IT noise.


Adaptive monitoring strategies are the future. And as much as I love Star Trek—it’s not a realistic picture of how systems should be run today—much less 300 years from now! If you’re not using adaptive thresholds for IT monitoring today, I hope you have a great swivel chair to rock around in as your enterprise constantly goes from situation normal to red alert.

As for me—I’ll settle for a smoother journey because I’ve spotted anomalous situations developing before they caused critical problems. Engage away with confidence, Captain—I’ve got your enterprise covered!
