Application Delivery Management
Application Modernization & Connectivity
CyberRes by OpenText
IT Operations Management
Gregg Hinchman has just written a new eBook that is now available at www.taykratzer.com. He has graciously shared Chapter 1 with Cool Solutions, and we presented it in a four-part series. The full title of this outstanding work is Success with Clustering GroupWise 7: A Guide to Building a Highly Available GroupWise 7 System on an OES NetWare Cluster. Enjoy!
Part 1
Part 2
Part 3
Part 4
Part 4
In this section I will actually discuss NetWare partitions, NSS pools, volumes, and LUNs in relationship with each other. As previously discussed, LUNs are devices according to OES. A LUN is disk space. In order to make the disk space usable by OES, you have to have a partition. It is possible to have partitions span multiple LUNs, but it isn't advisable. LUN-to-partition ratio should be one-to-one.
NSS pools are the future cluster resources where the GroupWise components will live. Many times I have seen organizations place multiple pools on a single LUN. It does work, but it isn't advisable. If the LUN breaks, user lose all that is running on that LUN. LUN-to-NSS-pools ratio should be one-to-one.
Volumes are the file-system-accessible disk space for GroupWise. As you are aware, you can have multiple volumes running under one NSS pool. Which means, should the NSS pool become corrupt and need repair, you have two volumes that are unavailable. If each volume has a different GroupWise post office on it, both post offices are unavailable. Therefore, NSS-pool-to-volume ratio should be one-to-one.
To summarize, one LUN should hold one NetWare partition, which should hold one NSS pool, which should hold one volume, which should contain one GroupWise component unit. If any one of these fails, only one GroupWise component is lost. (In the case of gateways, I consider one GWIA AND its parent domain as one unit or GroupWise component unit because of their interdependence.) Now we need to create cluster resources on the devices (LUNs).
The heart of Novell Cluster Services exists in the communication of each node to each other node and their ability to know when a node is up or down. When you install Novell Cluster Services, you are asked to decide which LUN to create a SBD Partition on. It is best practice to create a LUN just for the SBD partition and preferably to create two LUNs in order to mirror the SBD partition. Should one SBD partition become corrupted or the LUN fail, the other will allow the cluster to continue without disruption.
The SBD Partition is a small space (usually 15MB or less) on a SAN where each node writes a packet to let others know it's "alive." If a node does not write its "I'm alive" packet within a configurable time, known as the heartbeat, then the node with the Master IP address will ping that node. If the node does not respond, the Master IP address node will send the "failed node" a "poison pill" packet, which tells the failed node to ABEND. For more technical details on clustering, check out Novell's documentation site under Cluster Services.
A LUN of 100 to 500MB will more than suffice for an SBD partition. SAN vendors, in my experience, prefer not to make LUNs smaller than 500MB most of the time. Make sure you have two such LUNs. Also make sure they do not have any partitions, pools, or other things on them. You do not want to share your SBD partition with anything else: for example, a GroupWise post office.
In this section I will walk you through the process of creating a cluster resource (NSS pool, or virtual server). Then I will walk you through creating a cluster-enabled volume on that cluster resource. I will be using NetWare Remote Manager to perform both operations. My intention is not to make you an expert in the cluster resource creation process, but rather to give you a good knowledge and some help that will expand your knowledge of clustering.
Once you have a device (LUN) accessible by OES the next step is to make the device shareable for clustering. If you do not, the device will only belong to the server you initially configure it on. To configure a device (LUN) as sharable, at the OES system console type:
NSSMU
[ ] 14. Set Devices to "Sharable for Clustering".
To get started, you will need a web browser and a plan. Since I will be doing the planning in the next chapter - aptly titled "Planning" - I will give you a cluster resource I have planned out in order to demonstrate how to create a cluster resource. My cluster resource (NSS pool) name is PMDOMVS. My volume name is: PMDOMVL. Because my system is a demo system for the purpose of teaching the reader, my NSS pool will only be 200MB in size. Finally, I do assume you already have a cluster built, and all LUNs are seen by all cluster nodes (servers). Let's do it!!
That is all there is to creating a cluster resource. Next up, I need to create a volume on that cluster resource. I highly recommend you create your cluster resources first. Then proceed to creating the volumes. Also it is a best practice to have a one-to-one relationship between cluster resources and volumes, as I stated previously. Now it's time to create a volume on a cluster resource.
Creating a cluster volume is as easy as creating a cluster resource was in the last section. You will need a browser and NRM again.
[ ] 15. Create cluster resources and volumes.
One last tidbit to ease your worried mind. In order for GroupWise to perform well on a SAN, the hardware must be well designed and the underlying OES must be configured well. These items we have discussed. But how do you test that the I/O throughput for your cluster to your SAN is as fast as it should be? The answer comes from Novell -TSATEST.NLM. Born from one too many calls from customers complaining that Novell's throughput when attached to a SAN was bringing network services to a crawl, developers wrote this little gem. TSATEST can be used to help diagnose back-up issues as well as diagnose SAN issues. TSATEST has many switches and a ton of information to help you determine if the problem is SAN- or network-related versus OES-related. Check it out.
[ ] 16. Read up and use TSATEST to verify SAN throughput.
This section will discuss the next level of high availability and disaster recovery for GroupWise. Business Continuity Clusters (BCC) are a "cluster of clusters" separated by distance. BCC allows organizations to secure their Novell environment against a site disaster or failure. Site failures may consist of such events as:
The design of BCC allows for up to four totally separate locations to provide redundancy for the Novell environment. Should one location experience a failure, all Novell services hosted at that location would fail over to another location within the organization's network and continue running, providing near zero downtime during that failure. Most organizations consider having two sites; one is the main datacenter and the second is the disaster recovery (DR) site. This design works well, providing the second site is located a good distance from the main site. Let's take a look at recent disasters and allow me to provide some insight to this thinking.
Personal Note: Both 911 and the recent hurricanes were and still are tragic events in the United States history. My intention in discussing them here is not to in anyway belittle the events or the people, but to learn from them in a business sense. I apologize to you, the reader, if I touch upon events that may have affected you personally.
On September 11, 2001 many organizations felt their businesses were protected from disaster. Their main data centers were in one tower while their DR site was in the second tower of the World Trade Center buildings. It seemed out of the realm of possibility that both towers would ever be affected by disaster at the same time, let alone what actually happened. Had the businesses located their DR site in Newark, New Jersey or even near Times Square, the business continuity would have flowed.
Businesses in New Orleans felt they knew how to protect their company data from disaster by locating their DR sites across town or in a nearby city such as Biloxi. They never thought an event could possibly affect such a wide area. Had the businesses located their DR sites further inland to Alexandria, Louisiana, or Jackson, Mississippi, the business continuity would have flowed.
The fact is we now know that DR sites must be logically located far enough away to protect organizations from catastrophic events. Or an organization should invest in up to four different DR sites spread across entirely different physical locations. This is where Novell's BCC comes into play. BCC is software that rides above the Novell Cluster Services software. By this I mean it controls it. BCC also controls, through a series of scripts, the SAN vendor's mirroring software. To affect a site failover, an administrator simply launches iManager, logs in, and "pushes the button." The button is actually the administrator failing cluster resources over to a second, third, or fourth site where a cluster is waiting to run the resources. This can also be done remotely using a remote console to the server(s). This is a manual process meaning an administrator must be involved. Some find it less than desirable to have a manual failover of a site, but let's consider reasons why manual is better than fully automatic failover for a moment.
Manual failover provides a level of control that automatic failover does not. Specifically, human intelligence decides if the failover should occur. Have you ever been around when a WAN link was accidentally cut between two sites? Imagine the high-speed link between two BCC sites are cut somehow. In this scenario, site 1 and site 2 would not know this happened. Both would believe the other site is completely dead and would load the cluster resources. This means there are now two sets of the same data at two different locations -a divergence of data. A real problem especially in an active/active BCC. Manual failover is preferred so such events do not happen. Now, let's take a look at what Novell's BCC has to offer:
Note: Recovery Time Objective (RTO) is the most amount of time it may take between the disaster and the time when normal data access is reinstated. It is during this time that organizations will operate in a manual mode or not at all. As you can see, the shorter this time, the quicker an organization's business can return to normal. BCC has the ability to make this time less five minutes. If and only if properly configured. You must use Novell Cluster Services preferred nodes list feature or all bets are off. This is also highly dependent on the SAN.
The Recovery Point Objective (RPO) is the highest amount of data allowed to be lost or re-entered. Data that is created between backups is RPO. BCC has the ability to provide 0 RPO, or 0 data loss.
Now that you have a basic understanding of BCC, let's look at an example. If my Demo Company has three BCC sites, and one of the sites is hit by a tornado and completely destroyed, I would be able to fail over all my Novell services, GroupWise in particular, to my second data site. Now, while I am rebuilding my first data center, I have a back-up BCC data center in case of a second "freak of nature" disaster. Let's look at what happens if my second data center goes down.
If I only had two BCC data centers and the first was wiped out by a tornado, then the second was hit by an electrical failure (too much coffee spilt on the circuits while working late one night), my Demo Company business would come to a complete and utter stop till I got a generator, back-up tapes, servers, or whatever I might need. Since my Demo Company has three BCC data centers, when the second data center fails, my business continues running at data center three. So just like in clustering, in BCC the more the merrier, at a cost of course. See Figure 1.39 below for my example of BCC failover.
Figure 1.39 Business Continuity Cluster Example
Something else for you to ponder about BCC: active versus passive. In its simplest application, BCC can have one active site, where all users are accessing Novell services, and one passive site for DR. However, this means that resources (servers and a nice cluster and SAN) are not being utilized; rather they are just sitting there waiting for a disaster. Enter active/active. Providing your network infrastructure allows it - and true BCC and SAN mirroring rely upon a fast-stable network - both sites could be providing Novell services for all users. Site one could be active, servicing half the data and GroupWise requests on its own cluster with its many different fail-over paths. All the while, site two is servicing the other half of the data and GroupWise services. Should any one site experience failure, the other site could completely take over and run all Novell services. This scenario allows the organization to capitalize on the hardware expenditures rather than have them collect dust.
Note: A quick note on SAN mirroring. Whether you are mirroring a SAN for BCC or just to provide a non-BCC oriented DR site there are two kinds of SAN mirroring to consider: Asynchronous and Synchronous. Because I am discussing GroupWise - and as we all know GroupWise databases are sensitive to inconsistencies across the different databases - synchronous is the way to go. Synchronous mirroring is done at the block level for an entire LUN at one moment in time. Unfortunately, synchronous mirroring has a distance limitation, which must be taken into consideration. You can stretch synchronous mirroring to ~200km, providing the addition of some pricey hardware. Asynchronous mirroring does not have a distance limitation, per se, but is more based upon your bandwidth and moving the previous 'snapshot' before taking the new 'snapshot'. Asynchronous is more often used with 'snap-shot' technology, but does not work well with databases because of the time inconsistencies. These inconsistencies are not a problem with just DR solutions; it's a backup issue with all types of databases as well.
Hopefully this section provided you with a little bit of information to consider the next time you attend a disaster recovery meeting. If your organization, like many I have seen, is in the process of collapsing their Novell infrastructure, data, and GroupWise into single data centers all running on Novell Cluster Services clusters, then disaster recovery will be your next step or assignment. Thankfully, Novell has BCC available to assist you and your organization.
A final note on BCC: I will discuss some planning considerations in the next chapter for BCC as well. I will not, however, be discussing how to implement BCC. The reason is simple: BCC requires a very tight working relationship with your SAN vendor to get the SAN scripting correct. One false move, and mirroring of a SAN in the wrong way could wipe out production data. Therefore, Novell highly recommends, and I am in complete agreement, you engage both your SAN vendor's consulting arm as well as Novell Consulting. I have been involved in BCC design and implementation projects; it requires detail-oriented designing and complete accuracy to test and implement. For more information on Novell's Business Continuity Cluster go here.
[ ] 17. Consider your disaster recovery plan and BCC.
This brings to a close the Clustering Basics chapter. This chapter should have provided you the knowledge required to proceed with clustering GroupWise. Please realize that there is a lot more to know to build a cluster, and Novell does a great job in their Advanced Technical Training Clustering class. I highly recommend it.
This chapter presented the basics about Novell Cluster Services. It further drilled into cluster hardware design, and SAN LUN design. This chapter also presented a review of clustering components and management utilities. Finally this chapter provided only a few specific task-based instructions for setting up cluster resources and their volumes. At this time, I will re-list the action items for this chapter to help you prepare for clustering GroupWise.
[ ] 1. Verify ConsoleOne Snapins version.
[ ] 2. Access NetWare Remote Manager.
[ ] 3. Access iManager.
[ ] 4. Verify the SAN design and nodes for hardware fault tolerance.
[ ] 5. Check the electrical circuits and HVAC for the data center.
[ ] 6. Determine whether you have enough RAM.
[ ] 7. Verify your NICs and server switch ports.
[ ] 8. Verify you are using the latest GroupWise Snapins.
[ ] 9. Create a GroupWise design matrix and standardize names, ports, and IP addresses.
[ ] 10. Apply patches to OES NetWare.
[ ] 11. Configure the STARTUP.NCF file.
[ ] 12. Decide which GroupWise components will be clustered.
[ ] 13. Verify the LUN design.
[ ] 14. Set Devices to "Sharable for Clustering".
[ ] 15. Create cluster resources and volumes.
[ ] 16. Read up and use TSATEST to verify SAN throughput.
[ ] 17. Consider your disaster recovery plan and BCC.
End Part 4
Gregg A. Hinchman is a self-employed consultant (www.HinchmanConsulting.com), and Novell Consulting Partner. He has supported GroupWise for more than 11 years. He specializes in GroupWise, Clustering, eDirectory, NetWare and GroupWise Document Management. He currently carries 18 Novell Certifications, including CDE, MCNE and CNE NetWare 6. He is also certified IT Project and regularly provides project management for clients. Gregg is the co-author of success with GroupWise Document Management and has spoken at Novell's Premiere Technology Conference "BrainShare" four years running in the United States and abroad as well as GroupWise Advisor Summit. He lives in Indianapolis, Indiana, and spends his vacation time seeking mountain summits "just for the view." He is always for hire at very reasonable rates. Gregg's modest website can be found at: http://www.hinchmanconsulting.com/.