Heartbeat-based server down and agent availability KS

0 Likes
over 12 years ago
 

Overview

The Server Down Knowledge Script (KS) uses a 'data heartbeat' to identify down servers and non-functioning AppManager agents. The main benefit of this approach is that it does not ping every server to determine up/down status; only those which have not returned data to the database recently, which may make it more suitable for large AppManager environments. It also provides a mechanism to identify non-functioning agents if, for example, they are stopped, hung or no longer installed.


The monitoring process consists of two parts: a Stored Procedure and a KS.  The Stored Procedure, which must be installed into each AppManager database, determines which agents have not returned a data point from a designated KS within a specified interval in minutes. Separate KSs may be defined for Windows and UNIX agents. There are two versions of the code included, one for version 7 and another for version 8 onwards but that in many cases it may not be necessary to run this script in version 8 due to out-of-the-box server down monitoring performed by AppManager 8 as part of its new self-monitoring capabilities.


The agents identified by the stored procedure are pinged (i.e. ICMP echo) to determine up/down status.  Windows servers which ping up may be further checked to determine if the NetIQ AppManager Client Resource Monitor (netiqmc) service is running or not, based on the availability of its TCP port. This type of check is not possible for UNIX agents or the Client Communication Manager service.


The Knowledge Script may be configured to raise up to three different types of Event as follows:

  • The server is down, i.e. it did not respond to a ping check.

  • Inactive AppManager agent, i.e. the AppManager agent is not running (Windows agents only).

  • Heartbeat failure, i.e. the agent has not returned data from the specified KS within the defined interval (but pinged up and agent responded to port check).


To provide flexibility for most circumstances, the KS has numerous parameters to include (i.e. to limit monitoring to) or exclude certain agents from monitoring based on Maintenance Mode, (CC or OC) Group membership, Management Server assignment, host name or custom properties. These criteria may be used in any combination, but exclusion criteria take precedence over inclusions if an agent satisfies both. The more filters applied, the more queries the script will need to execute to process them, so they should be kept to a minimum for performance and load reasons.

Under most circumstances, only one instance of the script needs to be deployed to monitor all agents in a repository, typically on the agent of the AppManager repository server, though any Windows agent may be used. However, there may be situations where multiple Jobs are required, for example where no agent exists that can ping every server in the the repository. If multiple server down Jobs are deployed in a single repository, then they should be scheduled to avoid them executing at the same time.

The KS needs access to the AppManager database (which it accesses via ADO) in order to execute. This may be provided by running the agent services with a Windows account which has been granted access in SQL Server to the repository, or by defining a SQL login in the script's parameters. Note that if inclusion or exclusion parameters are defined for Control Center Management Groups then the KS will also need access to the Control Center database, NQCCDB.

Version History


The latest version of this KS is 3.63, released 06-Mar-2014. See code body for detailed change history.


NetIQ does not test or validate any software, code or other materials provided in, on or through NetIQ Cool Solutions (collectively, "Materials"), so please use caution when downloading or accessing any Materials from Cool Solutions and ensure that you have reasonable and current security, spyware and anti-virus measures in place on your computer and/or network prior to downloading. Additionally, do not use any Materials downloaded from Cool Solutions in any production environment without first testing the Materials to ensure they are compatible with your version of NetIQ software or any other hardware or software present in your network or environment. Cool Solutions is not a substitute for authorized NetIQ support and should not be used as such. NETIQ COOL SOLUTIONS AND ANY MATERIALS ARE PROVIDED ON AN AS-IS, AS-AVAILABLE BASIS WITHOUT ANY WARRANTY OF ANY KIND. By downloading this file, you are agreeing to these terms of use. To report a problem please contact: coolguys-netiq@netiq.com. Your use of Cool Solutions is governed by the Cool Solutions Terms and Conditions. https://www.netiq.com/communities/coolsolutions/terms-and-conditions/

Labels:

How To-Best Practice
Comment List
Anonymous
  • Alain,

    Is there a easy way ;) to validate if the end of the maintenance scheduled are expired and then changing the Dbo.Object.Status in the QDB for the machine where not supposed to be in maintenance mode ?

    Thanks!

  • Hi Christian
    There is a parameter which determines whether servers in maintenance mode are excluded from the server down monitoring process or not, and by default the exclusion is enabled.

    To avoid this situation you would probably need to run another Server Down job where agents in maintenance mode are included, but only the Job alerts only for Server Down and not agent inactive or heartbeat failure.

    Then you would run a second Job which excludes maintenance mode but only alerts for agent inactive or heartbeat failure.

    This configuration would avoid alerts for heartbeat failures for agents in maint mode which would otherwise occur for any agent in maintenance.

    Hope this helps. Perhaps you can re-post your question in the main forum so others can more easily see my response.

    Regards
    Alain
  • Hi Alain,

    I observed that when the machine is in scheduled maintenance mode and the server is crash in this mode, the server status remains in this state of maintenance and still excluded from Heartbeat-based server down after the end on the schedule. No event is generated!
    Have you ever seen this?

    Thanks!
    Christian
  • Hi Nathan
    I emailed you a couple of weeks ago to let you know that I've updated the server down process to make use of the new heartbeat process in AM8. Just download the zip file on this page!
    Regards
    Alain
  • Alain,

    I saw something above about a version of this KS/StoredProc that is updated to work with AM8? Has this been posted somewhere?

    Thanks!
    -Nathan
  • Hi Mona
    This is by design. If no data is received, the KS will perform a port check to determine if the agent service is running or not. Depending on the outcome it will raise an "Agent Inactive" or "Missing Heartbeat Data" alert. But it doesn't convert the former into the latter alert if the agent is stopped, it simply doesn't generate the alert. You should set the KS to alert for both conditions, and if you want to test for missing data then stop the data job rather than the agent. If you need any more assistance then email me directly at Alain.Salesse@netiq.com.
    Regards
    Alain
  • We are mostly interested to get an event for missing data. To test I setup the job on the repository server to monitor server x and selected event for missing data. I then stopped the agent services on server x to stop data collection but did not receive any events for missing data? The only time we received an event is when I select to event for inactive AppManager agent. Not sure why we are not getting events for missing data? Thanks.
  • Thank you Brett for your feedback
    I will looking closer on this side
  • Christian,
    I was having similar issues that were caused by our SQL server being overloaded (too much Disk IO). In my case the _DataNotRecieved procedure was getting deadlocked by another SQL resource at random times. Moving to a bigger RAID (more drives/more IO) fixed the issue for me.
  • Hi Alain,

    Sometime I have a "Unexpected error encountered" on the repository server.
    Event Message: Failed to execute query Timeout expired
    Exec _DataNotReceived 'NTRRQ_MemUtil_Commun', 11, 2
    Failed to execute query
    Exec _DataNotReceived 'NTRRQ_MemUtil_Commun', 11, 2

    But the event is raise on the right server but I can't associate the error time with the error message because is not the in same time. So I am wondering if a real message is missed when I get an error !

    I have the KS version 7.3.57 -1.3 and the _DataNotReceived.sql Created on 06-Aug-2009

    The only change on my side is the name of the KS NetIQ_ServerDown and the name of the QDB
    And I have two KS ServerDown who's running on the same Repository Machine!

    Do you have the same behavior or any idea?

    Thank
    Christian
Related Discussions
Recommended