Don't Just Monitor. Manage!

Systems Management. It’s a common term, historically used by software vendors to categorise their monitoring solutions – the technologies that many organisations deploy to measure the performance and availability of their IT infrastructure. And there’s an important link there. Afterall, the old adage goes “you can’t manage what you don’t measure”. And it’s true. In order for IT teams to be able to optimise the performance of the IT infrastructure, and to respond to critical system problems, they have to know what is going on, and be notified quickly when something fails or degrades. And it can work. Practically all vendors in the ‘Systems Management’ space have case studies describing the value that ‘X’ customer is getting from deploying ‘Y’ solution. The important thing to remember is that it takes effort. Human effort. The countless man-hours of keyboard-bashing and mouse-clicking in response to the alerts that are generated. But that’s not management, is it? The Oxford English Dictionary defines ‘Management’ as “the process of managing”. Not an overly descriptive definition, but it makes an important assertion – namely that management is a process. The definition of Process? “A series of actions or steps towards achieving a particular end”. Now, if we apply this to the IT environment, whilst many organisations might say “we don’t have a process”, in reality they all do – it’s just not formally documented (and often not scalable or particularly resilient…)

Let’s look at ABC Corp again...

Jane, on the ServiceDesk knows that when a particular type of event is received for the CashOrder app, she should assign the call to Bob, because if there’s a problem with CashOrder, you call Bob – he’s supported it since day one. He always knows how to fix it. And what does Bob do? He spends the next 30 or so minutes connecting to the system, checking to see who is using the app, sends them an e-mail saying the app needs to be taken offline, so they’ll need to disconnect, and if they don’t he’s going to disconnect them anyway. He goes and grabs a coffee, then he checks again, stops the app, navigates through the obscure folder hierarchy to that folder tucked away where all those old files collect (and Bob knows that when there are more than 30 files, greater than 5mb in size and over 4 hours old, the app doesn’t like it and throws a wobbly), he deletes them all, then restarts the app. Sometimes he’ll check if the app is actually working again, and sometimes, if he’s in a good mood, he’ll actually tell the users that the app is back up again so that they can continue doing the job they are being paid to do. Bob loses count of the number of times he has to do this every month…

Who, or what, did the ‘management’ here? In reality, part of the process was undertaken by Jane on the ServiceDesk assigning the call to Bob, and then Bob himself who undertook all the necessary steps to resolve the problem. Not the ‘Systems Management’ tool – in this case it’s just monitoring. But what happens if someone new on the ServiceDesk responded to the alert, and, what if, heaven forbid, Bob was on holiday (Bob normally takes his Blackberry on holiday and checks his e-mail, but this time his wife put her foot down and told him to leave it at home…). Well, the monitoring tool continues to raise the alerts, but they take longer to resolve, as people have to work out who should own the problem, and what they need to do to fix it. The problem gets worse, people can’t do their job, they phone the ServiceDesk to complain, and eventually, a very unhappy Robin Farquhar-Brown, ABC’s VP of Operations storms into the CIO’s office…

Wouldn’t it therefore be good if the ‘Systems Management’ (read ‘Monitoring’) tools actually ‘Managed’? A kind of ‘Systems Management 2.0’. A solution that provides the comprehensive, and important, monitoring of the IT infrastructure, but then actually knows what to do with the events when they are generated, regardless of whether Bob is on holiday or not. A tool that:

  1. Collects the event(s)

  • Creates a ServiceDesk Ticket

  • Connects to the app to check current users

  • E-mails the users asking them to disconnect

  • Waits for users to disconnect, any remaining are e-mailed again, telling them app is going down

  • Stops the app

  • Navigates to folder, searches for old files meeting the defined criteria

  • Deletes the files

  • Starts the app, and validates that it is functioning again

  • E-Mails the users to tell them the app is back up and working

  • Updates and closes the ServiceDesk ticket

Systems Management 2.0? IT Process Automation anyone..?
  • Both Gartner and Forrester state that IT Operations consumes 60 to 75% of IT budgets. A major part of IT operations is spent dealing with events. While there are ways to reduce events with good architecture, BSM, etc., no system or application is perfect, so events are a way of life. IT organizations can’t ignore them, so they must deal with them. NetIQ's philosophy is in agreement with yours, in a sense – document the process for dealing with events and then automate that process as much as possible to drive down operational costs and reallocate those funds to innovation. Check out the analyst-documented results of one customer at: The cost savings can be dramatic.
  • Management is a process-plain and simple. Tools don't manage, people do. Technology and tools can only provide information and assistance. They cannot actually perform exclusive of human involvement.

    The 11 step outline is nice but overwrought. This maniacal focus on "event management" crap is maddening and a grand miss-application of resources. Events are usage-based and occur as a result of user activity in conjunction with system configuration. Rather than trying to wrap one's head around millions of events is a big waste of time. Treating symptoms is not addressing the source of problems. An event is a result of some activity. With an end-to-end view of all devices and usage one can more easily address "management" issues without all the complexity.

    Management is a process and not something that can be accomplished with any one solution or tool, especially when said tools cost 1,000s of dollars and require hours and hours of involvement to extract useful information.