Heartbeat-based server down and agent availability KS

0 Likes
over 12 years ago
 

Overview

The Server Down Knowledge Script (KS) uses a 'data heartbeat' to identify down servers and non-functioning AppManager agents. The main benefit of this approach is that it does not ping every server to determine up/down status; only those which have not returned data to the database recently, which may make it more suitable for large AppManager environments. It also provides a mechanism to identify non-functioning agents if, for example, they are stopped, hung or no longer installed.


The monitoring process consists of two parts: a Stored Procedure and a KS.  The Stored Procedure, which must be installed into each AppManager database, determines which agents have not returned a data point from a designated KS within a specified interval in minutes. Separate KSs may be defined for Windows and UNIX agents. There are two versions of the code included, one for version 7 and another for version 8 onwards but that in many cases it may not be necessary to run this script in version 8 due to out-of-the-box server down monitoring performed by AppManager 8 as part of its new self-monitoring capabilities.


The agents identified by the stored procedure are pinged (i.e. ICMP echo) to determine up/down status.  Windows servers which ping up may be further checked to determine if the NetIQ AppManager Client Resource Monitor (netiqmc) service is running or not, based on the availability of its TCP port. This type of check is not possible for UNIX agents or the Client Communication Manager service.


The Knowledge Script may be configured to raise up to three different types of Event as follows:

  • The server is down, i.e. it did not respond to a ping check.

  • Inactive AppManager agent, i.e. the AppManager agent is not running (Windows agents only).

  • Heartbeat failure, i.e. the agent has not returned data from the specified KS within the defined interval (but pinged up and agent responded to port check).


To provide flexibility for most circumstances, the KS has numerous parameters to include (i.e. to limit monitoring to) or exclude certain agents from monitoring based on Maintenance Mode, (CC or OC) Group membership, Management Server assignment, host name or custom properties. These criteria may be used in any combination, but exclusion criteria take precedence over inclusions if an agent satisfies both. The more filters applied, the more queries the script will need to execute to process them, so they should be kept to a minimum for performance and load reasons.

Under most circumstances, only one instance of the script needs to be deployed to monitor all agents in a repository, typically on the agent of the AppManager repository server, though any Windows agent may be used. However, there may be situations where multiple Jobs are required, for example where no agent exists that can ping every server in the the repository. If multiple server down Jobs are deployed in a single repository, then they should be scheduled to avoid them executing at the same time.

The KS needs access to the AppManager database (which it accesses via ADO) in order to execute. This may be provided by running the agent services with a Windows account which has been granted access in SQL Server to the repository, or by defining a SQL login in the script's parameters. Note that if inclusion or exclusion parameters are defined for Control Center Management Groups then the KS will also need access to the Control Center database, NQCCDB.

Version History


The latest version of this KS is 3.63, released 06-Mar-2014. See code body for detailed change history.


NetIQ does not test or validate any software, code or other materials provided in, on or through NetIQ Cool Solutions (collectively, "Materials"), so please use caution when downloading or accessing any Materials from Cool Solutions and ensure that you have reasonable and current security, spyware and anti-virus measures in place on your computer and/or network prior to downloading. Additionally, do not use any Materials downloaded from Cool Solutions in any production environment without first testing the Materials to ensure they are compatible with your version of NetIQ software or any other hardware or software present in your network or environment. Cool Solutions is not a substitute for authorized NetIQ support and should not be used as such. NETIQ COOL SOLUTIONS AND ANY MATERIALS ARE PROVIDED ON AN AS-IS, AS-AVAILABLE BASIS WITHOUT ANY WARRANTY OF ANY KIND. By downloading this file, you are agreeing to these terms of use. To report a problem please contact: coolguys-netiq@netiq.com. Your use of Cool Solutions is governed by the Cool Solutions Terms and Conditions. https://www.netiq.com/communities/coolsolutions/terms-and-conditions/

Labels:

How To-Best Practice
Comment List
Anonymous
  • Nothing immediately springs to mind Brett. If you can send me a job execution trace for when it happened then I may be able to figure it out.
  • Alain,
    I am using the inclusions option to include agents with a Custom Property set to a specific value. There have been several occasions where I have received Hearbeat failures for devices that do not have the specified Custom Property value. This only happens occasionally. Any ideas of what is going on?
  • Hi Brett, You can try executing that query directly in SQL Server and seeing if it reports an error. It may, however, be a performance issue, i.e. that SQL is getting deadlocked and killing the query. Let me know how you get on and email me directly at alain.salesse@netiq.com.
    Regards
    Alain
  • I am getting a lot of unexpected error events. Here is the detailed message:

    Failed to execute query
    Exec _DataNotReceived 'NT_SystemUpTime', 20, 2
  • Hi Sam
    Unless you configured the script to run once, then that clearly is not expected behaviour. Have you installed the accompanying SQL Stored Procedure into the repository? If so, can you run this query in SQL "Exec _ServiceStatus". Does it error or return a list of values? Does the Job raise an error - if so, what is it? Please can you email me directly at alain.salesse@netiq.com to continue this discussion
    Regards
    Alain
  • I have attempted to deploy this to one repository site and it starts and then stops with all the objects inside configured.. Is this expected behavior?
  • Hi Alain,

    We have tested the script the functionality is very good.However in situation where some of the servers which are unreachable due to network issues huge number of alerts are getting generated.

    Can you please include the feasiblity procedure to check if any of the servers are unreachable then to ping router also if router is also unreachable then it shouldnt raise any alert for those servers.

    The script is not raising any alert if server is grayed out in the console and due to netiq appmanager services are doown.Could you please check this also

    Regards,
    Kiran Kumar
  • Another error message I am receiving frequently, Any ideas?:

    Detail Message: Failed to execute query Timeout expired
    Select Name from Object Where IsMachine = 1 and (Status & 0x00004000 > 0 or Status & 0x00000100 > 0)
    Failed to execute query Timeout expired
    Select Name from Object Where IsMachine = 1 and (Status & 0x00004000 > 0 or Status & 0x00000100 > 0)
  • I received this error from the ServerDown KS:

    KS Name: NetIQ_ServerDown
    Object Name:
    EventID: 9718397
    Detail Message: Failed to execute query Protocol error in TDS stream
    select o.Name, convert(Datetime,right(p.ValueStr,len(p.ValueStr)-charindex(':',p.ValueStr)))
    from ObjectProp p with (NOLOCK) inner join Object o with (NOLOCK)
    On o.ObjID = p.ObjID
    where p.Name = 'CreateMachine'
    order by 2 Asc
    Failed to execute DateMachineAddedtoRepos query
    Failed to execute query Timeout expired
    Exec _DataNotReceived 'NT_SystemUpTime', 10, 2
    Failed to execute query
    Exec _DataNotReceived 'NT_SystemUpTime', 10, 2
  • Hi Christian
    The new Stored Procedure will overwrite the old one, but if doesn't install then you can delete the old one manually to work around that.
    Regards
    Alain
Related Discussions
Recommended