GroupWise Clustering Basics: Part 2
Gregg Hinchman has just written a new eBook that is now available at www.taykratzer.com. He has graciously shared Chapter 1 with Cool Solutions, and we presented it in a four-part series. The full title of this outstanding work is Success with Clustering GroupWise 7: A Guide to Building a Highly Available GroupWise 7 System on an OES NetWare Cluster. Enjoy!
- Cluster Hardware Design
- Cluster-Enabled Volumes
- GroupWise Server Requirements
- Processors and NICs
- ConsoleOne and Snapins
- IP Addresses and Ports
- Protected Memory
- Additional Patches
- NSS Configuration Parameters
- STARTUP.NCF Parameters
This is a great place to stop and discuss hardware design for clusters. I am definitely not an expert in all things hardware, but I have spent much time in discussions on designing cluster hardware to be fault tolerant. The one major consideration is building redundancy into the hardware. This means a cluster node (server) should have redundant power supplies, redundant Network Interface Cards (NICs) preferably with "Teaming" supported, redundant Host Bus Adapters (HBAs), and at least RAID Level 1 hardware mirroring on the node's on-board disk subsystem, which means an array controller and two disks. All of this drives up the price of the cluster hardware for sure, but let's take a look at what you are getting.
If you have a three-node cluster, as I do, and your cluster nodes do not have redundant power supplies that each plug into a different circuits, a power supply failure will make your cluster a two-node very quickly. Now, if all your nodes are drawing their electricity from one electrical circuit and that circuit fails, you have a cluster of none. This means downtime and your organization loses money due to lost productivity, email-delivered opportunities, and rapid access to requested data. A redundant path for power is needed to increase availability. And do not forget the Heating, Ventilation, and Cooling (HVAC). Adding more servers and a SAN to a data center will drive up the heat - make sure you have cooling in place that can handle the additional hardware.
The same applies for NICs. Having two NICs in a node, each going to a different module on a switch provides network interface path redundancy. Add to the redundancy the ability to "Team" the NICs, and a failure of one NIC or one switch module will not cause the loss of a node in your cluster - again, increasing the resiliency of the cluster.
Note: What is NIC "Teaming"? Teaming is a feature of newer NICs and their drivers that make two or more physical NICs appear as one. You might think of "Teaming of NICs" as "virtualization of NICs." The bonded pair of NICs acts as one from the operating system point of view. In the case of clustering, this means a bonded pair of NICs has one IP address and balances the traffic across the bonded NICs. This also means that should one NIC - or the NICs switch module it's plugged into - fail, the other NIC of the pair picks up all traffic, effectively increasing the availability of the cluster node.
Host Bus Adapters (HBAs) are the "NIC" for SAN connectivity. HBAs connect directly to the SAN through a -'fibre switch'- and the fibre switch connects to the SAN's service processor'. A service processor is a loose term, as different vendors have different names for it. Suffice to say a service processor is the CPU of the SAN. A fibre switch is similar to an Ethernet switch but with disk channel access to the SAN. Since all data is held on the SAN, it makes sense to have two HBAs in case of an HBA failure. But what if the fibre switch fails? Best to have two fibre switches and each cluster node HBA plugs into a different fibre switch. This is called 'multi-pathing'. Most SAN vendors have software that runs on the server nodes that assists and directs the 'multi-pathing' for OES and clustering, such as PowerPath from EMC.
Now you have to consider the service processors. If you only have one service processor -can it support the demand being placed upon it and what if it fails? Therefore, having two service processors, an expensive piece of hardware, is preferred. This is especially true if you are doing any type of SAN mirroring. SAN mirroring is service processor intensive. Let's look at an example of why I suggest all the built-in hardware redundancy. If you have only one service processor, then all nodes in your cluster will be attached, and all services run through the one service processor. Should the service processor fail, or if throughput is degraded, you have a cluster failure and downtime. If you have two service processors, and two fibre switches and each cluster node only has one HBA, and half your nodes plug into one fibre switch while the other half plug into the second fibre switch, a service processor failure (or fibre switch failure) means half your cluster fails. Lastly, when you have two HBAs in a cluster node and each HBA plugs into a different fibre switch, a failure of a service processor OR a fibre switch OR an HBA does not cause a loss of a node. Again this effectively increases availability.
Note: Service processors may be the most important part of your SAN from a performance perspective. If the service processor has a slow processor in it or minimal amount of RAM the performance through the service processor to write to the SAN disk will be less. In the case of GroupWise, which is very disk-efficient but very I/O-intensive, you must consider the load you are placing upon the SAN. Here are some general guidelines and/or tips to remember when scoping a SAN for a cluster. First consider what you are going to have running on the cluster. If it's I/O-intensive, you need a faster processor and more RAM in the service processor to handle the traffic. Second, if you are planning on building a disaster recovery site (DR) or building a Business Continuity Cluster (BC), you will be implementing some form of SAN mirroring. SAN mirroring is I/O-intensive and will need a faster processor and more RAM to handle the load of the production environment AND mirroring of the SAN at the same time. Your SAN vendor can assist in proper design.
Why do I need RAID Level 1 hardware mirroring for my cluster nodes? The answer is simple. If you are installing an OES NetWare server on the cluster node's local hard drive, in other words you are not "booting SYS off the SAN," you may want to increase the availability of this node with RAID Level 1. RAID Level 1 hardware mirroring uses a RAID array controller and two hard drives. The disks mirror the vendor's system partition, the DOS partition, and the OES NetWare partition. The disk writes are faster because data is written to both disks and mirrored. Also, should one disk fail, the other disk will continue running until you get a replacement disk. Once the replacement disk arrives, just pop it in and the array controller will start mirroring it. Effectively you do not lose a node due to hard-disk failure.
What's the point of clustering if you have such high hardware fault tolerance? The protection is for the weakened areas such as: corrupt data on a node's SYS volume, ABENDs, and motherboard, RAM, or processor failures. Of the hardware redundancy items mentioned, I would recommend them all, but if you want to save some money, the RAID Level 1 hardware mirroring is the least important. You can always rebuild a node, which is especially easy with imaging tools such as Portlock's Storage Manager.
[ ] 4. Verify the SAN design and nodes for hardware fault tolerance.
[ ] 5. Check the electrical circuits and HVAC for the data center.
A cluster-enabled volume is one that has an eDirectory/NDS volume object associated to it and hence the application running on the volume requires an eDirectory volume object associated to it as well. I strongly recommend cluster-enabling all GroupWise volumes. In the long run the benefits far exceed any setup issues you may encounter. Here are two very good reasons to cluster enable GroupWise resources.
Mapped drives to the clustered resource remain available even if the resource fails over to another node. The mapping is to the virtual server. Now I know what you may be thinking. "Our GroupWise users don't map a drive to where the GroupWise post office is, so why talk about mapped drives?" When you administer GroupWise, you almost always need at least one mapped drive to a domain. Also in order to view/edit the Properties of a gateway, you need a mapped drive to that gateway's domain. Finally, if you are testing GroupWise clustering or upgrading GroupWise, the last thing you want to do is to continue remapping drives every time a volume moves to a new cluster node.
ConsoleOne is the second reason. If the volumes are cluster-enabled, then ConsoleOne will be able to connect to the GroupWise databases no matter which cluster node holds the clustered resource, after the first domain is mapped of course. I have seen oCasions where ConsoleOne will not connect to domains unless it has a mapped drive to the domain. Again, this may be a shortcoming in ConsoleOne. Save yourself the potential grief and just cluster-enable all GroupWise volumes.
The main lesson here is you no longer manage GroupWise based upon \\Servername\volume. Rather, you manage GroupWise based upon its clustered resource name. If PMDOMVL is the clustered resource name (cluster-enabled volume) and PMDOMVS is the virtual server, all your mappings are to the: \\<Virtual Server Name>\PMDOMVL volume ONLY, not to the server hosting it at that moment. This means if the volume moves to another host node, your mapping NEVER breaks.
This brief section will discuss the hardware requirements for GroupWise. Understand these are recommendations and your "mileage" may vary. Clustering GroupWise from a hardware perspective is not much different than running GroupWise on stand-alone servers. The one exception is RAM. OES NetWare takes more RAM than older versions of NetWare. In order to determine the basic specifications for RAM on a clustered server for GroupWise, you first need an idea of what GroupWise services will run on that box. You also need to take into account the fact that the node may have to support multiple incidences of the GroupWise services. This means multiple MTAs, POAs and gateways may be running on the same cluster node.
Out of the box with just the basic installed GroupWise services (one POA, one MTA, one WebAccess agent), assuming a virus-scanning software, TSAFSGW, and a back-up software agent running on the node, a server should have 2GB of RAM. This is enough RAM to support the OS and the services just mentioned. Add to that the clustering software, and if you are planning for fail over and having multiple MTAs, POAs and gateways all being hosted on the same node, I recommend adding at least 1GB more of RAM. This would bring the total up to 3GB of RAM per server. Remember these are recommendations for a general audience. Your system may require more or less based upon customized factors for your environment.
When I cluster GroupWise, I use protected memory in order to run multiple instances of GroupWise services. Protected memory consumes more RAM than running GroupWise services in the OS memory space. It is far better to have a bit too much RAM than not enough. Remember, you are building a highly available GroupWise system -- why let a little RAM stand in the way?
[ ] 6. Determine whether you have enough RAM.
- Go to the server console.
- Type in "MONITOR" to load the monitor utility.
- Select Disk Cache Utilization.
- Look at the LRU Sitting Time.
If the LRU Sitting Time is listed as a long period of time, say a couple days, then you have plenty of RAM. If, however it shows a shorter period of time, say four hours, you are in need of more RAM. Also keep in mind how long the server has been up and running. If it has been up and running for only four hours, then the LRU Sitting Time statistic is not valid yet.
The next parameter to check is Cache Buffers. This can be found in MONITOR, under System Resources. If the "Cache Buffer memory" value is high low, say 70%, and LRU Sitting Time is short, you are definitely in need of more RAM. However, if the Cache Buffers are high, say 80%, then no RAM is needed. The gray area here is when LRU Sitting Time is low and Cache Buffers are high. You may need to tune your OES NetWare OS and eDirectory.
Processors and NICs are another requirement for GroupWise to perform efficiently. As always, purchase the best processor you can afford. Multiple processors are nice to have, and GroupWise agents will take advantage of them. As for network interface cards (NICs), buy the best and, of course, the fastest available. In a severe fail-over situation, you may have three or more post offices running on the same box. That means three or more times the traffic will flow across the NIC. Gigabit NIC cards are the only way to go.
Note: Make sure that, in a 10/100MB environment, the switch ports and the servers are hard-configured for 100MB full duplex. Auto-configured NICs and ports may cause a mismatch, which will bring your network to a crawl and on occasion create corruption within the databases. In a gigabit environment, hard-configure the server NIC and the switch if possible; otherwise, let the server NIC auto-configure, then verify it is running full duplex. Also verify the switch is running full duplex.
[ ] 7. Verify your NICs and server switch ports.
ConsoleOne is used to manage GroupWise 6.x and above. Currently (at the time this guide was written), the GroupWise 7 Snapins are available and should be implemented if you have GroupWise 7. Occasionally, I find it necessary to keep a separate copy of ConsoleOne with GroupWise 6.5 Snapins available. My reason for this is simple. If I find the GroupWise 7 Snapins are not working quite right, I will use the GroupWise 6.5 ConsoleOne to verify changes or settings. Call it a "sanity check."
[ ] 8. Verify you are using the latest GroupWise Snapins.
To determine the Snapins you are using, do the following in ConsoleOne:
- Select Help|About Snapins.
- Select the "GroupWise Administration" Snapin.
- Confirm that the version is the latest version, for example "7.0".
As I stated before, GroupWise will use the clustered resource IP address. Remember, this is the secondary IP address stated in the cluster load script. When you install or move GroupWise to a cluster, you will use the volume IP address (cluster resource) where GroupWise will reside. This guarantees that when the GroupWise volume fails over to another node, the GroupWise service will still run with the same IP address. But there is a gotcha: the IP ports. By default in GroupWise 6.0x and above, GroupWise components always listen on all bound IP addresses, not just the one specified. This means if a POA fails over to another server running a POA, even though they have different IP addresses, they will both listen on the same common default ports, such as 1677. You can imagine the nightmare that would cause - GroupWise would become confused, messages would not get sent, the system would grind to a halt. So how do you overcome this issue?
Well, the easiest way is planning. Make each port different for each GroupWise component. I recommend standardizing this and, for the sake of simplicity, use numeric sequence. Let me show you. This assumes the MTA, POA and WebAccess agent exist on the same cluster resource.
|Cluster Resource IP Address:||192.168.20.11|
|Cluster Resource Name:||PMDOMVS|
|GroupWise MTA MTP Port:||7101|
|GroupWise MTA HTTP Port:||3801|
|GroupWise POA MTP IN IP Address:||192.168.20.11|
|GroupWise POA MTP IN Port:||7301|
|GroupWise POA Client/Server Port:||1681|
|GroupWise POA HTTP Port:||2801|
|GroupWise POA MTP OUT IP Address:||192.168.20.11 (the MTA IP Address)|
|GroupWise POA MTP OUT Port:||7101 (the MTA MTP)|
|GroupWise WebAccess IP Address:||192.168.20.11 (the MTA IP Address)|
|GroupWise WebAccess TCP Port:||7201|
|GroupWise WebAccess HTTP Port:||4801|
|GroupWise WebAccess Document Viewer Ports:||7440 to 74xx|
As you may have noticed, any port that is on the 192.168.20.11 clustered resource ends with a 1. Also, you will note I created a standard for all ports. This standard will make it easy for anyone to follow after you when administering or troubleshooting GroupWise. The one exception to all of this is 1677. If you would like to configure the "NGWNAMESERVER" functionality of GroupWise, then GroupWise insists on using port 1677 for the POA registered in your DNS as "NGWNAMESERVER".
Example of Port Standards:
710x is for MTA MTP
720x is for WebAccess TCP Ports
730x is for POA MTP IN Ports
280x is for POA HTTP
380x is for MTA HTTP
480x is for Gateway HTTP
168x is for POA Client/Server
74xx is for WebAccess Document Viewer Ports
Before you ever start implementing GroupWise or moving it in a cluster, create a GroupWise design matrix. The GroupWise design matrix will house all the information about the GroupWise system. The design matrix will specify:
- IP addresses and ports
- The cluster resource name
- The primary cluster node (the first node the GroupWise cluster resource is regularly assigned to)
- The primary cluster node IP address
- The "virtual" UNC path
- The volume sizes
- The GroupWise service names, such as DOM1, PO1, etc.
Figure 1.16: GroupWise Design Matrix -An Example
This notion of a design matrix is best practice whether clustering GroupWise or not. Appendix A has a copy of a GroupWise design matrix; you may want to use it, or you may want to construct a design matrix of your own in a spreadsheet program.
GroupWise 7 now has both a working "/ip" switch AND a "Bind Exclusive" check box for the MTA, POA, GWIA, and WebAccess agent. The "/ip-" switch tells the agent to only listen on the TCP/IP address specified after the "/ip-" switch. This improvement, which actually was around in GroupWise 6.5 should now listen only on the ports associated to its IP address. This means a GroupWise design matrix with a standardized port number scheme, as we previously presented, is not needed. All agents, such as the POA, can use the same ports on the same node as another POA because now they will listen only on their IP address for that port. This is good news!
Note: A /ip- switch tip: Did you know you can replace the IP address in the /IP parameter with the DNS name for the GroupWise agent? Simply place the DNS name of the cluster resource where the agent runs from in quotes, like so: /IP-"PMDOMVS.DEMO.COM". Then make sure you have a DNS entry for the cluster resource. Also, it is advisable to use the DNS configuration for the agent in ConsoleOne, in the Properties of the agent, under the GroupWise tab called Network Settings. If you think about it, IP addresses start to have less importance, making it easier to manipulate GroupWise domains and post offices to different servers. This is also very helpful when implementing a Business Continuity Cluster for GroupWise. Fair warning, however, your organization's DNS must be stable and fault-tolerant, otherwise you will have GroupWise issues.
In the past, I have not recommend relying upon the use of the "/ip-" switch in a clustering environment. The main reason was (and still is), I want to be absolutely 100% guaranteed an agent (POA, MTA) is only listening on its appropriate ports. And the only way to do this is to use the GroupWise design matrix and standardize the ports numerically. Think about this for a moment. You have GroupWise running in a cluster, you have banked your career on 99.999% uptime for GroupWise, and now you have to upgrade or apply a service pack to GroupWise. Unbeknownst to you, the upgrade or service pack could break the "/ip-" switch and now you are fighting fires trying to figure out why GroupWise is misbehaving. Please understand I am not picking on Novell's developers here; they do an excellent job, but mistakes happen. So a bit of planning and work at the beginning will save you hours of potential headaches in the future. Better to be safe than sorry, at least that is my conservative consulting approach.
[ ] 9. Create a GroupWise design matrix and standardize names, ports, and IP addresses.
Protected memory is an administrator-defined space where the NLM's will load. NLM's in this space do not cause other NLM's outside the protected memory space to stop performing, in essence to ABEND. Here's a simple analogy. Pretend you are an NLM. Now, pretend your office has four walls, a floor, and a ceiling - it's a protected memory space. Assuming your door is closed and you do not share your office, try to touch a person outside your office. You cannot. This is how protected memory works. It isolates the NLM from other NLM's. This means you can have two post offices running on one cluster node and they will not contend or corrupt each other's memory space. Protected memory is a feature within the OES NetWare OS. Unless manually forced into a protected memory space, all NLM's load in the "OS" memory space. Protected memory does use more RAM and does slow down performance. The performance hit is negligible; you can expect an increase on average of 20% more memory used than if you did not use protected memory.
Also, when you load multiple GroupWise components, say an MTA and a POA, in separate memory spaces, they will each load a copy of GWENN5.NLM in their respective memory spaces. So you have two copies loaded instead of one without protected memory. In the case of GroupWise and clustering, protected memory is a gift from Novell. By loading GroupWise NLM's into their own protected and uniquely named memory space, the NLM is easier to manage. If, for example, an NLM will not unload after an ABEND in a protected memory space, the protected memory space can just be removed, which will bring the NLM down - no questions asked!
Then there is the benefit of auto restart of the memory space. If an NLM within the memory space ABEND's, the OS can restart the entire memory space. This pays big dividends in clustering and striving for 99.999% uptime with GroupWise.
Example: You are running a POA and WebAccess agent on the same clustered resource, but each component runs in its own memory space. The GWINTER.NLM ABENDS and, rather than taking the whole cluster resource (and the cluster node) down with it, it is isolate to just its memory space. Here's the kicker: The OS notices the memory space has an ABEND'ed NLM and restarts the space, and now the WebAccess agent is back up and running again. This process usually takes place within 10 to 15 seconds. So unless users are pushing a button or link at that moment, they will never notice.
There is a downside to auto-restarting a memory space however. On rare occasions, the memory space may become stuck in an ABEND-Restart-ABEND-Restart loop until the entire server runs out of memory and locks up. In this case, a cluster resource may not fail over to another node until the server becomes completely unresponsive. The agent involved in this loop will likely not be servicing user needs effectively, making it unavailable or "down."
As you can see, protected memory is a very valuable tool for GroupWise and for clustering GroupWise. Remember when you design the fail-over path for the GroupWise cluster resources, at some point multiple GroupWise components will load on the same host node. Protected memory will allow you to better manage GroupWise and the host node.
To see protected memory address spaces set up on your OES NetWare server (if any are set up), do the following:
- From the Console screen load Monitor.
- From the "Available Options" dialog, select "Virtual Memory".
- Select "Address spaces".
In the "Known Address Space" dialog, you can now see the address spaces you have set up on your OES NetWare server. There will always be an "OS" address space. This is the base operating system. If you are using protected memory, you will see the name of the protected memory space listed here also. In OES NetWare, memory spaces are listed in the "Current Screens" function. To access this, at the server console hold down the Control key and press the Escape key. You will see the Current Screens, which list the products running on that server. Over to the right, the memory space they run in is shown.
There are occasionally patches for the OES NetWare OS outside of the standard OES NetWare service pack. I advise you confirm you have the latest OES NetWare patches. OES NetWare service pack 1 was released on August 27, 2005, and as of this writing there are no additional patches available; nor are there patches available for Novell Cluster Services. In the case of GroupWise 7 and Novell Messenger 2, as of this writing there are no service packs or additional patches available. Please take a moment and check Novell's Support site for any patches that may be available before you start implementing GroupWise 7 and OES NetWare in a clustered environment.
Note: I chose to implement OES NetWare service pack 1 (NetWare 6.5 sp4) for this writing, mainly because of a few noted issues with NetWare 6.5 service pack 3 when it was released. Additional services packs were released to compensate for the issues in service pack 3. Do your research on service packs and potential pitfalls before you implement any solution. It saves time and lots of suffering.
[ ] 10. Apply patches to OES NetWare.
NSS does not require any adjusting in order to perform efficiently for GroupWise with OES NetWare.
There are a couple SET commands that need to be set on the cluster node servers that will host GroupWise cluster resources. These settings should be included in the STARTUP.NCF file of the server. Following are the set commands:
SET MEMORY PROTECTION RESTART COUNT = 9
SET MEMORY PROTECTION NO RESTART INTERVAL = 10
SET MEMORY PROTECTION ABEND AFTER RESTART COUNT = ON
SET HUNG UNLOAD WAIT DELAY = 60
The "Set Memory Protection Restart Count" tells the server to allow the restart of the protected memory space x times within a specified period of time; this is set in the "Memory Protection No Restart Interval". The Default setting for this command is 1. I like 9 as the setting here. I will explain this in conjunction with the other SET commands in a moment.
The "Set Memory Protection No Restart Interval" tells the server an amount of time to allow a restart of a memory space in conjunction with the "Restart Count" mentioned above. The Default for this setting is 1.
The "Set Memory Protection ABEND after Restart Count" tells the server to ABEND if the "Restart Count" setting is met within the "No Restart Interval". The Default for this setting is OFF.
Let me take a moment to explain the interaction of the three Memory Protection SET commands. With the parameters I have given, let's run a scenario. Say the WebAccess agent is running it its protected memory space and suddenly goes "rogue" on us. The "Set Memory Protection Restart Count" will increase with each restart of the memory space the OES NetWare OS performs, up to 9. Once the restart count of 9 is reached within a 10-minute interval (the "Set Memory Protection No Restart Interval" setting), the "Set Memory Protection ABEND after Restart Count" will tell the OES NetWare OS to ABEND itself. This will cause all cluster resources and any GroupWise components to fail over to a different cluster node and start running again. Hopefully this helps you understand the interrelatedness of these three settings and why I recommend these modifications from the default.
The "Set Hung Unload Wait Delay" instructs the OES NetWare OS on how long it should wait for resources to unload before it brings up a prompt to unload the NLM. If you set this higher, the cluster node will wait longer to prompt to unload an NLM. If you set it lower, the cluster node will prompt sooner. This is mainly a concern for the POA and you may need to tune this for your environment if your POAs take longer than 60 seconds. I recommend a 60-second wait delay to start.
These parameters are especially useful for the WebAccess agent. As we discussed previously, these SET commands can be very beneficial in increasing uptime of a GroupWise agent. But remember they can, in rare occasions get caught in a loop. (I have seen it only one time in four years; others have told me they have experienced looping however.) Therefore, I am suggesting these settings as a way to increase uptime for GroupWise agents. Previously, I recommended "0" for infinite restarts for the "Restart Interval" (Actually, it disables this parameter). However, after discussing this with folks at various conferences, via email, and late-night dinners, a more conservative approach is to limit the restarts and their interval. This will then allow the cluster resource to fail over to another node in the cluster sooner, and decrease the potential for downtime of the affected GroupWise agent.
[ ] 11. Configure the STARTUP.NCF file.
End Part 2
About the Author
Gregg A. Hinchman is a self-employed consultant (www.HinchmanConsulting.com), and Novell Consulting Partner. He has supported GroupWise for more than 11 years. He specializes in GroupWise, Clustering, eDirectory, NetWare and GroupWise Document Management. He currently carries 18 Novell Certifications, including CDE, MCNE and CNE NetWare 6. He is also certified IT Project + and regularly provides project management for clients. Gregg is the co-author of success with GroupWise Document Management and has spoken at Novell's Premiere Technology Conference "BrainShare" four years running in the United States and abroad as well as GroupWise Advisor Summit. He lives in Indianapolis, Indiana, and spends his vacation time seeking mountain summits "just for the view." He is always for hire at very reasonable rates. Gregg's modest website can be found at: http://www.hinchmanconsulting.com/.