Management Server Performance Monitor and Issues Alert

0 Likes
over 12 years ago
This is an older script (2005) but still works in current AM 7.x.  I have customers use this when we believe SQL repository is slow or just overall housekeeping.  The KS always produces an event to tell you it's ok.  I suspect someone might want to add a parameter to optionally event, but I'm just the messenger, this KS has been used in the field for several years.  If you do not want the event I usually suggest going to Operator Console File/preferences/event options, turn on 'automatically close when event threshold is greater than' = 39.  the event this KS triggers is 40 (nothing in AM by default uses 40) thus it will just close this event automatically.

 

 

 

The Knowledge Script itself is linked and the detail below outline the use of the KS including a complete list of event messages that might be produced and an explanation of what they mean. I would strongly suggest that you read the document before deploying this.

 

 

 

I would like to suggest that this is not a Knowledge Script that you run on an extremely frequent basis (I usually look 15-60min). It was written as a diagnostic tool to examine the workload of the Management Server for a defined amount of time. While it does not appear to impact the Management Server adversely it does consume some resources.

 

 



Disclaimer

 

 

 

This Knowledge Script is provided on an "as is" basis. By allowing your customers to use this script, they must agree to the terms of the Knowledge Script Usage Agreement found on our website. Technical support is not available for this Knowledge Script.

 

 

 

What is it?

 

 

 

The AppManager Management Server Performance Monitor gathers critical metrics potentially impacting the operational performance of the MS. These metrics are taken from the paper Optimization and Tuning of Windows NT by Scott B Suhy and the NetIQ AppManager Administrator’s Guide.

 

 

 

After the metrics have been gathered the Knowledge Script attempts to analyze the information and make meaningful recommendations. These recommendations are meant to improve the performance of the monitored MS.

 

 

 

What are the known limitations?

 

 

 

Development was done on Windows 2003 and AppManager 6.0.2. xxxx made the necessary modifications to allow the original script to run on AppManager 5.0.1. Usability testing was conducted by xxxx.

 

 

 

Metrics on physical disk were taken across all drives. No attempt is made to identify which physical disk the PIOC table is actually located on. The assumption made is that most installations have everything installed on one physical drive and while there may be more physical drives available they are not being utilized.

 

 

 

 

 

 

 

Metrics are gathered, analyzed and then discarded. The KS has no current capability of comparing current metrics to past iterations. The analysis made is therefore subject to spikes normally associated with server operations. It is recommended that AppManager’s Advanced options be employed to smooth events should a high degree of false positives occur.

 

 

 

This release should be treated as an “alpha” attempt. After completion of representative testing in actual customer sites there will undoubtedly be some adjustments made. Please report any suggestions back to your Technical Account Manager.

 

 

 

What options are available in the KS?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Debug Parameters allow the user to control the use of the built in debugging features of the KS. Obviously meant as an aid in development they would rarely be used in a production environment. “No debug log” is the default option. The available options are:

 

 

 

No debug log    No debug logging is done.

 

 

 

Trace log          A trace log only is produced. A trace log shows the path of execution within the KS.

 

 

 

Value log          A value log is produced. A value log shows the resultant values of metrics used within the KS.

 

 

 

Full debug        A trace and value log are produced.

 

 

 

The filename of the debug file allows the user to specify the name of that file. The default location of the file is the AppManager temporary directory which is normally C:\Program Files\NetIQ\AppManager\Temp.

 

 

 

Error Severity allows the user to specify the severity levels assigned to various types of events. Currently there are only two types of events issued.

 

 

 

First, events that has to do with metric gathering activities where execution of the KS would be pointless since critical information is missing. Second, events that has to do with information regarding the analysis of the gathered metrics.

 

 

 

Event Options control where event information is recorded for user reference. The assumption is that not all users will want to use the AppManager event mechanisms especially if they are having MS issues currently or if the generated events over time need to be copied to other media for remote analysis.

 

 

 

 

 

With this in mind the user has two options for recording events related to KS analysis of performance metrics:

 

 

 

Event to AppManager           Issue AppManager events through the normal mechanisms fully utilizing the products capabilities.

 

 

 

Event to file                          Write text events to an ASCII file to minimize the impact on the current AppManager environment.

 

 

 

The filename of the event file allows the user to specify the name of that file. The default location of the file is the AppManager temporary directory which is normally C:\Program Files\NetIQ\AppManager\Temp.

 

 

 

If the AppManager environment is experiencing operational difficulties the user might also elect not to gather data for charting and reporting. This is explained under Data Options.

 

 

 

Data Options control whether or not data will be captured for later charting and reporting utilizing AppManager standard facilities. The available options are:

 

 

 

Box not checked    No data will be gathered for charting or reporting

 

 

 

Box checked          Data will be gathered using standard AppManager mechanisms.

 

 

 

 What metrics are gathered by the KS?

 

 

 

There are two classes of metrics gathered by the KS. Both classes are in the form of Windows Performance Counters. The operating system related metrics are:

 

 

 

Object                              Counter                                      Instance

 

 

 

Memory                           Available Bytes     

 

Memory                           Pages/sec

 

Processor                         % Processor Time                      _Total

 

Processor                         Interrupts/sec                             _Total

 

System                             System Calls/sec   

 

System                             Processor Queue Length

 

PhysicalDisk                    % Disk Time                              _Total

 

PhysicalDisk                    Current Disk Queue Length       _Total

 

PhysicalDisk                    Avg. Disk Queue Length           _Total

 

PhysicalDisk                    Avg. Disk sec/Transfer              _Total

 

 

 

The MS related metrics are:

 

 

 

Object                              Counter                                      Instance

 

 

 

Process                            % Processor Time                      NetIQms

 

Process                            IO Data Bytes/sec                      NetIQms

 

Process                            Page Faults/sec                          NetIQms

 

NetIQms                          Ioc Collapsed Events Queued    0

 

NetIQms                          Ioc Data Queued                        0

 

NetIQms                          Ioc Events Queued                    0

 

NetIQms                          QDB Request Service Time       0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

What are the possible event messages?

 

 

 

There are three possible messages. There is the short and long message when creating an AppManager event. The third is the message written to the event log ASCII file should the user not choose to use the standard AppManager event mechanism.

 

 

 

This is not an MS

 

This is not a server that has NetIQms running as a process.

 

 

 

Not an MS machine. Please start NetIQms or drop on a valid server

 

 

 

This KS is intended to run only on a server where an active MS is available. This message is generated when process NetIQms is not running on the server.

 

 

 

One or more counter failures

 

There has been one or more failures at obtaining the required counter values - Please correct

 

 

 

One or more counter failures. Please correct

 

 

 

The KS will not continue if any counter listed in this document cannot be found for whatever reason. The KS does go through the complete list of counters attempting to gather each one and generating error messages describing the ones that are missing.

 

 

 

MS performance OK

 

The MS is currently running optimally.

 

 

 

The MS is currently running optimally.

 

 

 

For the current iteration the KS analysis of the performance metrics found no abnormalities. This message simply affirms that fact.

 

 

 

 

 

 

 

 

 

 

 

 

 

Low virtual memory

 

There is less than 4 megabytes of virtual memory available. Paging/Poor performance possible.

 

 

 

Less than 4 megabytes virtual memory. Paging/Poor performance possible.

 

 

 

If the available virtual memory falls below 4 megabytes on a Windows server it is likely to experience severe degradation.

 

 

 

Paging rate is elevated

 

Paging rate > 10 pages/sec. Memory is probably becoming a bottleneck in the system.

 

 

 

Paging rate is elevated. Memory becoming bottleneck in system.

 

 

 

A paging rate of greater than 10 pages per second but less than 21 pages per second indicate that paging is beginning to occur on the server. This could be an indication that the server is beginning to experience a poor performance.

 

 

 

Excessive paging rate

 

Paging rate is greater than 20 pages per sec. Performance degraded and disk thrashing probable.

 

 

 

Paging rate critical. Degradation & thrashing probable.

 

 

 

Paging rates above 20 pages per second indicate a severe condition. This server is probably experiencing degraded performance if not disk thrashing.

 

 

 

In systems that use virtual memory, the resulting condition of a hard drive being used excessively for virtual memory because the physical memory is full. Disk thrashing considerably slows down the performance of a system because data has to be transferred back and forth from the hard drive to the physical memory.

 

 

 

A sure sign that your computer is thrashing is when an application stops responding but the disk drive light keeps blinking on and off. Thrashing is generally caused by too many processes competing for scarce memory resources. To temporarily stop thrashing, you need to terminate one or more applications. To stop it permanently, you need to install more main memory.

 

NetIQms minor contributor to paging

 

NetIQms is page faulting less than 5 pages per sec. Investigate other processes.

 

 

 

MS minor contributor to paging. Investigate other processes.

 

 

 

While the end result is the same it appears that the MS is not the cause of a paging issue on this server. The user should investigate other processes running on the server that may be causing this problem and inadvertently impacting the MS and the server as a whole.

 

 

 

>10% disk time is high for paging

 

>10% of the disk time for paging is too high.

 

 

 

>10% disk time for paging. Paging too high.

 

 

 

The calculated percentage of disk time spent on paging activities is greater than 10%. This percentage was calculated using the formula:

 

 

 

DiskTimeUsedForPaging = 100*(MemPagesSec*PhyAvgDiskSecTransfer)

 

 

 

This is another strong indication that paging may be excessive on this server impacting MS operations in a negative manner.

 

 

 

Interrupts > 1000 sec

 

CPU > 80% and Interrupts > 1000 sec could be indicators of high disk I/O.

 

 

 

Interrupts > 1000 sec. High disk I/O indicated.

 

 

 

The CPU utilization is greater than 80% and the CPU interrupts per second are exceeding 1000. This is an indication that a hardware controller may be issuing an excessive amount of interrupts possibly indicative of a very busy physical disk(s).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

NetIQms minor contributor to CPU

 

NetIQms is contributing < 20% of the CPU load of > 80% overall. Investigate other processes.

 

 

 

MS minor contributor to CPU time. Investigate other CPU processes.

 

 

 

Even though the server is running in excess of 80% CPU utilization the MS process is using less than 20% of the CPU. This would indicate that the user should investigate other processes that are probably negatively impacting the operation of this server.

 

 

 

CPU > 80

 

CPU readings at or near 80% indicate this may be bottleneck.

 

 

 

CPU > 80%. Ongoing high CPU may indicate bottleneck.

 

 

 

It is recommended in the NetIQ AppManager Administrator’s Guide that the server on which the MS is running not consume more than 80% of the CPU resources. This server is exceeding that.

 

 

 

Processor queue Length > 3

 

With a processor queue length of greater than 3 the MS machine may be overloaded.

 

 

 

Processor queue length > 3. MS may be overloaded.

 

 

 

It is recommended in the NetIQ AppManager Administrator’s Guide that the server on which the MS is running have no more than a processor queue length of 3. This server is exceeding that.

 

 

 

Physical Disk Time > 65

 

The physical disk is a bottleneck. Investigate adding/moving PIOC to another disk.

 

 

 

 Disk Time > 65%. Move/add PIOC to another disk.

 

 

 

 

 

 

 

 

 

The physical disk is in operation 65% of the time (assuming there is one physical disk or that other physical disks are relatively inactive). This basically is exceeding the capability if the controller and drive. The recommendation is to add another physical disk and/or move the PIOC table to that inactive drive to spread the disk I/O activity.

 

 

 

Before moving the PIOC file it is highly recommended that NetIQ Technical Support be called for the proper procedure to follow. Basically it involves shutting down the MS, moving the PIOC folder to the new drive, changing the PIOC Map File Path key in the registry and restarting the MS.

 

 

 

Disk queue length > 2

 

The physical disk is a bottleneck. Investigate adding/moving PIOC to another disk.

 

 

 

Disk queue length > 2. Move/add PIOC to another disk.

 

 

 

A physical disk queue length greater than 2 indicates an extremely busy device. Again, the recommendation is to investigate moving the PIOC table to other available resources.

 

PIOC collapsed queue > 25

 

There are > 25 events in an IOC queue on the PIOC. Possible problems with PIOC

 

 

 

IOC collapsed events Q > 25. Possible PIOC problems.

 

 

 

The NetIQ AppManager Administrator’s Guide recommends that this counter maintain a value of nearly zero at most times. A consistent value higher than that would indicate that the MS is having difficulty keeping up with the influx of events and data. An arbitrary setting of 25 has been selected for this metric.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PIOC data queue > 25

 

There are > 25 events in an IOC queue on the PIOC. Possible problems with PIOC

 

 

 

IOC data Q > 25. Possible PIOC problems.

 

 

 

The NetIQ AppManager Administrator’s Guide recommends that this counter maintain a value of nearly zero at most times. A consistent value higher than that would indicate that the MS is having difficulty keeping up with the influx of events and data. An arbitrary setting of 25 has been selected for this metric.

 

 

 

PIOC events queue > 25

 

There are > 25 events in an IOC queue on the PIOC. Possible problems with PIOC

 

 

 

IOC events Q > 25. Possible PIOC problems.

 

 

 

The NetIQ AppManager Administrator’s Guide recommends that this counter maintain a value of nearly zero at most times. A consistent value higher than that would indicate that the MS is having difficulty keeping up with the influx of events and data. An arbitrary setting of 25 has been selected for this metric.

 

 

 

Counter Error

 

Unable to read counter ObjectName CounterName InstanceName

 

 

 

Counter Error / Unknown Unable To read counter ObjectName CounterName InstanceName

 

 

 

The KS was unable to find this counter or obtain a value for it. The Object, Counter and Instance names are displayed for investigation. Until the user corrects all invalid counter problems the KS will not continue execution.

 

 

 

 


 

 



NetIQ does not test or validate any software, code or other materials provided in, on or through NetIQ Cool Solutions (collectively, "Materials"), so please use caution when downloading or accessing any Materials from Cool Solutions and ensure that you have reasonable and current security, spyware and anti-virus measures in place on your computer and/or network prior to downloading. Additionally, do not use any Materials downloaded from Cool Solutions in any production environment without first testing the Materials to ensure they are compatible with your version of NetIQ software or any other hardware or software present in your network or environment. Cool Solutions is not a substitute for authorized NetIQ support and should not be used as such. NETIQ COOL SOLUTIONS AND ANY MATERIALS ARE PROVIDED ON AN AS-IS, AS-AVAILABLE BASIS WITHOUT ANY WARRANTY OF ANY KIND. By downloading this file, you are agreeing to these terms of use. To report a problem please contact: coolguys-netiq@netiq.com. Your use of Cool Solutions is governed by the Cool Solutions Terms and Conditions. https://www.netiq.com/communities/coolsolutions/terms-and-conditions/

Labels:

How To-Best Practice
Comment List
Anonymous
Related Discussions
Recommended