I have a customer that was used to Novell Clustering Services on NetWare, where NetWare's ability to abend a server on a application crash, would trigger clustering to failover a resource. In OES2 Linux that behaviour isn't really there. An application like GroupWise might crash but the Linux server will stay up and the resource won't migrate since the resource volume and IP address are still up. The solution to this is the Monitoring function within Novell Clustering on Linux.
First a quick note about the NCS monitoring function.
NCS Monitoring is pretty cool in that it's a great way to monitor applications on a per resource basis. It's divided into 2 parts.
1 - Monitor configuration
2 - The Monitor scripting
For Number 1 you configure how Monitor behaves per resource. You have options such as
- Enable/Disable ---> Whether or not Monitoring is active for that resource
- Polling interval ---> How often the Monitor script will run
- Failure Rate -----> This is made of 2 settings - Max Local Failures and Time Interval. It works in conjunction with the next setting
- Failure Action ----> Based on the Failure Rate you can either make the resource go Comatose or Migrate it.
How Failure Rate and Failure Action work.
- If a failure is detected AND the Failure Rate is NOT reached - the resource gets reloaded on the server
- If a failure is detected AND the Failure Rate is reached - The Failure action is performed
- Failure rate is set to 3 failures in 10 minutes
Then when a failure occurs the resource will be restarted - If this happens 4 times in 10 minutes the Failure action will occur (migrate or Comatose)
- Failure rate is set to 0 failures in 10 minutes
Then when a failure occurs the Failure action will occur (migrate or Comatose)
For Number 2 you tell Monitor what to check for via script commands. The commands can be either direct commands in the Monitor script page or they can call scripts. If the direct commands\scripts return an error code of 0, then all is well, if the commands\scripts
return an error code other than 0 then it's considered a failure and Number 1 above kicks in.
So basically it comes down to how well your scripts check on the resource. You need to take into consideration, multiple resources running on the server, the fact that on GW's clustering every agent is in the gwha.conf file, Administration actions (ie manual unload of the application but leaving the resource up), etc.
GroupWise Monitoring Script
So I have created a monitor script that will check the basics of GroupWise 7 to see if it is working correctly. Additional monitoring can be added as you desire. You will need to modify it per resource (I tested POA and MTA resources)
At the top of the script there are 2 variables to set: AGENT and LOGDirectory
AGENT = the agent that is loaded with the resource. The format is the same as it is in gwha.conf or what would be returned by a rcgrpwise status....example of a PO --> PO1.DOM1 or for a Domain --> DOM1. I tested this with GroupWise 7. I believe that in Groupwise 8 the agent should be in lower case. using the example above "po1.dom1"
LOGDirectory = the Directory where you want failures captured. The failures get written to a file called "gwstatus" in that directory. Make sure the log directory exists on all nodes
Script explanation: Assume we are monitoring PO1 POA agent
The script first does a "rcgrpwise status PO1.Dom1"
- If the status check returns a value of " Failed" we write/append the "timestamp status agent name" to the log file. We will then exit out with a return code of "1". The return code of "1" will cause Monitor to act on the failure
The above is based on rcgrpwise status, Which I would think covers most failures However since I do not, at this time know, if a rcgrpwise status will return a Failure if the agent is defunct, or not running in a normal state, I created a section to check the status of the PID.
To get the correct PID (as there could be mulitple POA's resources running) I use the PIDFILE, /var/run/novell/groupwise/AGENT.pid. The PIDFILE will disappear if the agent is unloaded manually. Which is a key feature in determining if an administrator, for whatever reason, wanted to unloaded the agent manually but not have the monitor script act. However if the PID is Killed, stopped (kill -stop pid), or crashed the pidfile will remain. I ASSUME that it will remain if the PID is Defunct as well but I couldn't test this last one. Test 2 in the script acts on this information.
- The script checks to see if the PIDFILE exits. If it does NOT exist it should mean that the Agent was unloaded manually. The script writes to the LOG file
timestamp agentname message stating the agent is down manually but the resource is still up.
Result code 0 is returned to Monitor, hence no action is taken by Monitor.
- If the PIDFILE exists we then check to see if the PID is defunct. If it is we write to the log file
timestamp agentname status message stating that the Agent is defunct
We then exit with a return code of 1 and let Monitor act on it
- If the PID is in any other status other than S or R (which is running) we write to the log file
timestamp agentname status message stating that the agent is not running
We then exit with a return code of 1 and let Monitor act on it
Testing the scripts
- You can monitor the monitor actions by doing a "tail -f /var/opt/novell/log/ncs/resourcename.monitor.out
- You can monitor the log file by doing a "tail -f /[LOGDirectory]/gwstatus
- Then do things like kill the process, bring it down clean, test the logic of the scripts by changing them a bit .....for example since you can't force a defunct state change the defunct PID test to act on a S instead, hence testing the defunct code
Just remember to make the script executable and to put it in the same directory on all Nodes, to change the AGENT variable to match the resource, and to create one for every resource ie.... gwcheckPO1, gwcheckPO2, gwcheckDOM1, etc
Finally below is an example of the monitor script and where I put in my script
From iManager/cluster/cluster manager/click on resource/scripts/monitor script/
The first lines are default for monitoring the ip address and the volume
exit_on_error status_fs /dev/evms/GWPOOL /opt/novell/nss/mnt/.pools/GWPOOL nsspool
exit_on_error status_secondary_ipaddress 18.104.22.168
exit_on_error ncpcon volume GWVOL
#monitor groupwise PO1 <------------------------------------------------ Agent that I'm monitoring with this reosurce
exit_on_error /etc/opt/novell/groupwise/scripts/gwcheckPO1.sh <----------------- My script
#script below does "netstat -planet | grep -w 1677"<------------------------|
exit_on_error /tmp/1677.sh <------------------------------------------|--- A way to monitor a port to see if it's listenting
Updates to Original Script
I've added new functionality to the script. It's attached as gwcheckv3.
I've had reports where a POA/MTA agent may be running but is unresponsive. An indication of this was that the administrator could not connect to the http monitor port. To capture this scenario, I've added logic to the script that will connect to the http port with wget (command line browser). If wget does not get a proper response it will activate the monitor script.
Please check the new variables you will need to populate at the top of the script
To Download http://ginohm.webs.com