Heartbeat-based server down and agent availability KS

0 Likes
over 12 years ago
 

Overview

The Server Down Knowledge Script (KS) uses a 'data heartbeat' to identify down servers and non-functioning AppManager agents. The main benefit of this approach is that it does not ping every server to determine up/down status; only those which have not returned data to the database recently, which may make it more suitable for large AppManager environments. It also provides a mechanism to identify non-functioning agents if, for example, they are stopped, hung or no longer installed.


The monitoring process consists of two parts: a Stored Procedure and a KS.  The Stored Procedure, which must be installed into each AppManager database, determines which agents have not returned a data point from a designated KS within a specified interval in minutes. Separate KSs may be defined for Windows and UNIX agents. There are two versions of the code included, one for version 7 and another for version 8 onwards but that in many cases it may not be necessary to run this script in version 8 due to out-of-the-box server down monitoring performed by AppManager 8 as part of its new self-monitoring capabilities.


The agents identified by the stored procedure are pinged (i.e. ICMP echo) to determine up/down status.  Windows servers which ping up may be further checked to determine if the NetIQ AppManager Client Resource Monitor (netiqmc) service is running or not, based on the availability of its TCP port. This type of check is not possible for UNIX agents or the Client Communication Manager service.


The Knowledge Script may be configured to raise up to three different types of Event as follows:

  • The server is down, i.e. it did not respond to a ping check.

  • Inactive AppManager agent, i.e. the AppManager agent is not running (Windows agents only).

  • Heartbeat failure, i.e. the agent has not returned data from the specified KS within the defined interval (but pinged up and agent responded to port check).


To provide flexibility for most circumstances, the KS has numerous parameters to include (i.e. to limit monitoring to) or exclude certain agents from monitoring based on Maintenance Mode, (CC or OC) Group membership, Management Server assignment, host name or custom properties. These criteria may be used in any combination, but exclusion criteria take precedence over inclusions if an agent satisfies both. The more filters applied, the more queries the script will need to execute to process them, so they should be kept to a minimum for performance and load reasons.

Under most circumstances, only one instance of the script needs to be deployed to monitor all agents in a repository, typically on the agent of the AppManager repository server, though any Windows agent may be used. However, there may be situations where multiple Jobs are required, for example where no agent exists that can ping every server in the the repository. If multiple server down Jobs are deployed in a single repository, then they should be scheduled to avoid them executing at the same time.

The KS needs access to the AppManager database (which it accesses via ADO) in order to execute. This may be provided by running the agent services with a Windows account which has been granted access in SQL Server to the repository, or by defining a SQL login in the script's parameters. Note that if inclusion or exclusion parameters are defined for Control Center Management Groups then the KS will also need access to the Control Center database, NQCCDB.

Version History


The latest version of this KS is 3.63, released 06-Mar-2014. See code body for detailed change history.


NetIQ does not test or validate any software, code or other materials provided in, on or through NetIQ Cool Solutions (collectively, "Materials"), so please use caution when downloading or accessing any Materials from Cool Solutions and ensure that you have reasonable and current security, spyware and anti-virus measures in place on your computer and/or network prior to downloading. Additionally, do not use any Materials downloaded from Cool Solutions in any production environment without first testing the Materials to ensure they are compatible with your version of NetIQ software or any other hardware or software present in your network or environment. Cool Solutions is not a substitute for authorized NetIQ support and should not be used as such. NETIQ COOL SOLUTIONS AND ANY MATERIALS ARE PROVIDED ON AN AS-IS, AS-AVAILABLE BASIS WITHOUT ANY WARRANTY OF ANY KIND. By downloading this file, you are agreeing to these terms of use. To report a problem please contact: coolguys-netiq@netiq.com. Your use of Cool Solutions is governed by the Cool Solutions Terms and Conditions. https://www.netiq.com/communities/coolsolutions/terms-and-conditions/

Labels:

How To-Best Practice
Comment List
Anonymous
Parents
  • We are mostly interested to get an event for missing data. To test I setup the job on the repository server to monitor server x and selected event for missing data. I then stopped the agent services on server x to stop data collection but did not receive any events for missing data? The only time we received an event is when I select to event for inactive AppManager agent. Not sure why we are not getting events for missing data? Thanks.
Comment
  • We are mostly interested to get an event for missing data. To test I setup the job on the repository server to monitor server x and selected event for missing data. I then stopped the agent services on server x to stop data collection but did not receive any events for missing data? The only time we received an event is when I select to event for inactive AppManager agent. Not sure why we are not getting events for missing data? Thanks.
Children
No Data
Related Discussions
Recommended