Enterprise Management - The Pro's and Con's
Point Monitoring: As IT and the business grow, monitoring becomes more and more important. In some cases monitoring tools expand based on devices purchased (IE: SAN, switch,etc) because the device may come with a point monitoring tool. In other cases, IT realizes that other tools are required such as one for the network, one for applications, one for end user simulation, etc. No one tools is the best of breed for monitoring everything... even though large software companies provide features or tools for each critical area, it doesn't mean it meets your needs, or it is priced out of internal approvability (yes, I made up a word).
PRO's: These point based tools typically are the best at doing what they do, of course depends on which one you buy. Regardless, they provide the most detailed information at that layer (again, depending on which one you buy).
CON's: When someone opens a helpdesk ticket that they are getting errors from an application that is used heavily in the company and it is a complex application (ie: more than a single server), A very experienced IT person(s) has to log into multiple tools to figure out what is going on. Is the SAN up, are the disks in the SAN up/online, are there performance problems in the SAN, are there any alarms in the SAN. This is just the SAN area, what about the database, network, web server, etc, etc, etc.,
Event Correlation: This is a common feature that customers desire. The basic concept is that the organization has matured over the years and has purchased specific mature products (devices, appliances, firewall, etc) that come with tools to manage (and monitor) the technology. In other cases, IT has come to realize that a single tool is not able to keep IT informed of problems within the enterprise (IE: ping is not enough, digging through log files is not enough, monitoring processes is not enough, etc). The basics around Event Correlation is that the customer has two or more (some cases 20 different) tools and they realize that having all of the error/alarms from these tools to be sent to a single tool. This single tool then automatically analyzes the events and combines related events together into a single event/alarm, it probably automatically closes "noise" alarms, de-duplicates alarms, etc.
PRO's: This methodology reduces the overall pure count of alarms significantly if done right and in turn reduces the noise and hopefully gets IT to focus on the right things. By the right things, it is still pretty general, 10 computers failed ping tests and it is the switch. This level of information is only as good as the rules you put into place in the correlation engine.
CON's: When you have several outages in the Event Correlation tool, which one (ie: 10 Critical alarms) should they work on first. Correlation is not always (and usually isn't) Root Cause (IE: This application is reporting errors about login's, is it the LDAP, is it a backend database, is it just a specific area of the network and the switch, etc). The other challenge is that not everything from one point monitoring tool can be jammed into the correlation tool, this impacts troubleshooting as well as general details about the issue (ie: the SAN monitoring tool can issue a disk problem, but bringing in other factoids such as, is there a RAID rebuild going on, is the SAN temp high, etc, etc, typically don't fit into the Event Correlation tool. Network Maps, specialized right-click actions in underling tool, performance data, historical data, etc, etc, etc don't fit either. IT area experts are still using their point tool and are not fully aware of the rest of the world, the full impact of an alarm, etc.
End User Response: Several companies want to be more proactive than reactive. IE: waiting for someone to call Help Desk that an application is slow versus having automated testing set up to test the response time from an end user perspective.
PRO's: Assuming you placed these Response time testers throughout the network internally and externally correctly, you set up the tests correctly to simulate the right features, etc, this is a good indication typically ahead of a Help Desk ticket that there is a problem that may not be noticed by the other monitoring tools.
CON's: These tools point out a slow response in doing a specific action such as logging into an application, submitting a timesheet, issuing a stock trade. The test may imply where the problem is, but it is typically not Root Cause. Log in slow, is it LDAP, the network (bandwidth over utilization) or some switch that lost it's mind.
Application Monitoring: At this point the IT department has realized that ping is not enough, end user response is not enough. They need more information and the actual running (or not running) application can typically point out more specifics of the general problem (IE: SQL Exception, File permission problems, etc).
PRO's: There are alot of vendors that provide these types of monitoring products. Depending on the depth of Application Monitoring available by the vendor of the Monitoring tool, they can provide a wealth of information about the running process, the system processes, error logs, performance data, etc, etc, etc.
CON's: For complex applications (multiple servers and processes), you typically aren't going to get at the reason and you may end up chasing symptoms, ie: the Application is timing out connecting to a database, throwing exceptions in a log file about out of memory, etc, etc, etc. Which is the real problem, the timing out going to the database and it spun out of control or the application, VM or physical server is running out of memory, or, or, or.
Change Control: Taste great, less filling! Of course everyone wants/needs this since a very high percentage of outages are caused by the actual changes IT does on a routine basis. It is not that they don't know what they are doing (most of the time), it is because they typically are unable to get enough information ahead of time such as risk of the change, impact of the change, etc. Most companies need to improve their process, but they also need tools to help with tracking, approving change, etc. But even after getting that stuff lined up, the actual act of a change about to occur or just occurred is important for the IT support people to be aware of (which is typically not the people involved with the change to begin with).
PRO's: If done right (or even close to right), the amount of outages or issues related to the change (IE: upgrade an OS, patch an application, re-subnetting, etc) will have less impact on the availability of the critical applications the business uses.
CON's: Since a high percentage of the problems found in the environment are caused by human changes, having a heads up of a (successful or failed) change last night versus the outage that just happened is rare for nearly every company I have encountered, ie: Application errors starting spamming the log files around noon when all the employees were into work and using the application. Come to find out that the application owners patched the application last night and clearly this is the cause of the problem since all the other monitoring tools are not reporting errors. None of the typical IT tools would point this out, Network is fine, database is fine, server is fine, etc.
Let's face it, there are probably ten other areas I could list such as ensuring all servers are using approved versions of windows with specific patch levels, every server/device is being monitored in some manner, understanding that you don't need to work on a help desk ticket because IT already is working on the issue, etc.
It is amazing how many companies still get 10+ people on the phone when the company EMail stops working. The person running the call does a roll call which is more of a temperature check of each person. Ok, Sally, how is the Network, Bob what's going on in database land, Charlie are all of the SANs happy today. Tim... oh wait, Tim is out sick today, anyone have an idea if the log file has any errors? The outage started slowly 30 minutes ago, Help Desk did troubleshooting for another 30 minutes. Level 2 support used some of their magic powers to look deeper for 15 minutes and after not figuring it out punted to management. Now there are 10+ people on a phone call doing an inventory of factoids. That call may go quick, or it could go 30 minutes, or a few hours, the team may need to drop from the call to dig in further and regroup every 30 minutes to provide updates. There are costs associated to this, in some cases, huge costs. Not only is it the time all the people work on it, but the delays in the other projects they are no longer working on as well as impact on the business side, people are now reading the newspaper, customers are not opening new orders, etc, etc, etc. An hour outage can be a non-issue or it could be a million+ dollar outage.
We interrupt this blog for an important message.... Sorry, Product pitch time.
NetIQ Operations Center was designed from the ground up to integrate with Point Monitoring tools. If you have an Event Correlation Engine, GREAT, if not, that is fine to. Event Correlation is a nice option to get you pointed in the right direction, but NetIQ Operations Center provides not only an Event Correlation Engine for those who don't have one, it provides a higher level of correlation. Image if you could look at a correlated event and drill down into the actual tool for even more information.
There is a corporate webserver problem, the synthetic test tool (ours or someone elses) is unable to issue a purchase for something, issue a trade, sign up for service, open an account... whatever.
The Corporate Webserver icon is red inside of NetIQ Operations Center.
Since the company Website is a very critical application both internally and externally, emails and pages are automatically sent out to notify engineers and management. End users are able to log into the NetIQ Operations Center dashboard and see that IT is already working on it so help desk is not overwhelmed with duplicate/repeated phone calls and emails. Our partners can see the Dashboard but with a different level of details and they know that the outage is being worked on. Management has a higher level Dashboard with limited abilities and they are seeing the impact from a business perspective as well as knowing the IT is already aware and working on it.
The Level 1 Operator is logged into NetIQ Operations Center and notices right away that the corporate website is critical/red. It is odd to the person because they were told that it will never go down because the corporate website is on a cluster of web servers as well as using a clustered database (uh... cluster... what's that?). Without having to look at alarms, without looking at other consoles, without joining an IT tiger team conference call, without dialing 411, the Level 1 operator (who is not an expert in the corporate applications), reads the root cause is a SAN. They immediately call Jessie the SAN owner who says... yah, yah, yah, I already know, NetIQ Operations Center sent me an email that SAN45 reported a critical error. BTW, the email also said that other less used applications are impacted so don't be surprised if you start getting calls that people can't log into the Training area of our intranet website as well as the Application Download area of the website... they are all down until I replace the power supply on the SAN. I updated a Help Desk ticket and a change request to track this so you should see those in NetIQ Operations Center under each impacted application.
Wow, you made it this far, this is a long blog, sorry. This is a common goal for many organizations. There are variations of this as well as other things our customers do with our product(s). If you want to improve the way IT works today, give us a call, tell us your challenges and if we can help, we will. Thanks for reading.