GroupWise Clustering Basics: Part 4

0 Likes

Gregg Hinchman has just written a new eBook that is now available at www.taykratzer.com. He has graciously shared Chapter 1 with Cool Solutions, and we presented it in a four-part series. The full title of this outstanding work is Success with Clustering GroupWise 7: A Guide to Building a Highly Available GroupWise 7 System on an OES NetWare Cluster. Enjoy!



Part 1

Part 2

Part 3

Part 4



Part 4



 

 

 

 

 

 

 

 

 

 

 

 

 




Partitions, Pools, Volumes, and LUNs - OH MY!

 

In this section I will actually discuss NetWare partitions, NSS pools, volumes, and LUNs in relationship with each other. As previously discussed, LUNs are devices according to OES. A LUN is disk space. In order to make the disk space usable by OES, you have to have a partition. It is possible to have partitions span multiple LUNs, but it isn't advisable. LUN-to-partition ratio should be one-to-one.



NSS pools are the future cluster resources where the GroupWise components will live. Many times I have seen organizations place multiple pools on a single LUN. It does work, but it isn't advisable. If the LUN breaks, user lose all that is running on that LUN. LUN-to-NSS-pools ratio should be one-to-one.



Volumes are the file-system-accessible disk space for GroupWise. As you are aware, you can have multiple volumes running under one NSS pool. Which means, should the NSS pool become corrupt and need repair, you have two volumes that are unavailable. If each volume has a different GroupWise post office on it, both post offices are unavailable. Therefore, NSS-pool-to-volume ratio should be one-to-one.



To summarize, one LUN should hold one NetWare partition, which should hold one NSS pool, which should hold one volume, which should contain one GroupWise component unit. If any one of these fails, only one GroupWise component is lost. (In the case of gateways, I consider one GWIA AND its parent domain as one unit or GroupWise component unit because of their interdependence.) Now we need to create cluster resources on the devices (LUNs).



Split-Brain Detector Partition



The heart of Novell Cluster Services exists in the communication of each node to each other node and their ability to know when a node is up or down. When you install Novell Cluster Services, you are asked to decide which LUN to create a SBD Partition on. It is best practice to create a LUN just for the SBD partition and preferably to create two LUNs in order to mirror the SBD partition. Should one SBD partition become corrupted or the LUN fail, the other will allow the cluster to continue without disruption.



The SBD Partition is a small space (usually 15MB or less) on a SAN where each node writes a packet to let others know it's "alive." If a node does not write its "I'm alive" packet within a configurable time, known as the heartbeat, then the node with the Master IP address will ping that node. If the node does not respond, the Master IP address node will send the "failed node" a "poison pill" packet, which tells the failed node to ABEND. For more technical details on clustering, check out Novell's documentation site under Cluster Services.



A LUN of 100 to 500MB will more than suffice for an SBD partition. SAN vendors, in my experience, prefer not to make LUNs smaller than 500MB most of the time. Make sure you have two such LUNs. Also make sure they do not have any partitions, pools, or other things on them. You do not want to share your SBD partition with anything else: for example, a GroupWise post office.



Creating Cluster Resources and Cluster Volumes



In this section I will walk you through the process of creating a cluster resource (NSS pool, or virtual server). Then I will walk you through creating a cluster-enabled volume on that cluster resource. I will be using NetWare Remote Manager to perform both operations. My intention is not to make you an expert in the cluster resource creation process, but rather to give you a good knowledge and some help that will expand your knowledge of clustering.



Devices Sharable for Clustering

 

Once you have a device (LUN) accessible by OES the next step is to make the device shareable for clustering. If you do not, the device will only belong to the server you initially configure it on. To configure a device (LUN) as sharable, at the OES system console type:


NSSMU



    1. Select devices.

 

    1. Select and highlight the device you wish to make sharable.

 

    1. Select F6.

 

  1. In the Device Information screen the Device will display "Yes" under "Sharable for Clustering".



[ ] 14. Set Devices to "Sharable for Clustering".



Create a Cluster Resource (NSS Pool)

 

To get started, you will need a web browser and a plan. Since I will be doing the planning in the next chapter - aptly titled "Planning" - I will give you a cluster resource I have planned out in order to demonstrate how to create a cluster resource. My cluster resource (NSS pool) name is PMDOMVS. My volume name is: PMDOMVL. Because my system is a demo system for the purpose of teaching the reader, my NSS pool will only be 200MB in size. Finally, I do assume you already have a cluster built, and all LUNs are seen by all cluster nodes (servers). Let's do it!!



    1. Launch a browser and enter the IP address of a cluster node along with the port for NetWare Remote Manager (NRM) and hit the Enter key, like this:


      https://192.168.20.11:8009/



    1. Login into NRM.

 

    1. On the left side, under Manage Server, select the Partition Disks link.








      Figure 1.21 NRM Manager Server Category - Partition Disks Link





    1. Using the right-side scroll bar, scroll down till you find the disks that contain all the free space. In my case, I am using an LSI HBA. As you can see, I have several LUNs available, including one labeled as "Cluster Services Partition". In this example, I will choose the next one in line - V358-A2-D2:0 - to create my cluster resource.








      Figure 1.22 NRM Manager Server Category - Partition Disks Link





    1. Select the Create link next to the Free Disk Space. This will launch the File System Creation Operations window.







      Figure 1.23 File System Creation Operations Window





    1. Select the Create a New Pool link.



      Note: I have found that, if you want more control over naming your cluster resources and volumes, you are better off first creating the new pool. Then go back to the new pool and create the volume. Also for every clustered NSS pool you create, eDirectory will attach "_Server" to the end of the name. In the case of a volume, it tags a "_Vol" at the end of the name. Now I prefer to delete that - but in eDirectory it will keep it. In the case of the volume however, it will not keep the "_Vol" if you delete it during the creation process.



    1. Next, the NSS Pool Create window will appear. Fill in the pool size and the pool name. In my case, the name is PMDOMVS.








      Figure 1.24 NSS Pool Create Window



    1. Then check the Cluster Enable Pool box. This is very important!! If you do not check this box, you will not create a clustered resource and will not be able to keep your mapped drives in a fail-over scenario.

 

    1. Select the Create link.

 

    1. An informational prompt will appear asking if you want to create the pool; select OK.








      Figure 1.25 Are you Sure You Want to Create This Pool?





    1. Now the Cluster Pool Enabling Information window will appear. Change the Virtual Server Name to how you would like it to read. Again mine is: PMDOMVS.

 

    1. Next, if you are going to use CIFS, then set the CIFS name as you wish it to appear.

 

    1. Then choose the protocol you are going to use - NCP is the only one I will use. However, if you have both Windows and Apples in your network, you may choose the CIFS and AFP protocols as well.

 

    1. Next check the Auto Pool Activate box.

 

    1. Finally, fill in the IP address for this virtual server (cluster resource).








      Figure 1.26 Virtual Server Configuration




    1. Select the Create button. This will create the cluster resource and then return you to the Partition Disks window.

 

    1. At the Partition Disks window, use the right-side scroll bar to scroll down to view the newly created NSS pool.









      Figure 1.27 Back to Partition Disks with the Cluster Resource Created





    1. On the left side, scroll down to the Cluster Menu category, and select the Cluster Management link. This will bring up the Cluster State view from NRM's perspective. You will now notice the cluster resource you just created is running. You will also notice it still has the "_Server" tagged to it. Try as we might, eDirectory and Novell Clustering Services will just not let us forget that the cluster resource is a server.








      Figure 1.28 New Cluster Resource Running





    1. Again, on the left side, select the Cluster Config link. Note the cluster resource appears here as well. As I have stated already, the Cluster Config link is for configuring the cluster resource.



      Note: You will notice there is a link for "New Cluster Volume" and the icon appears to be the same for the cluster resource I have loaded and running. Do not let it confuse you. It's a small misrepresentation of the truth. The cluster resource is a NSS pool, a virtual server, and a cluster resource. It's NOT a volume. I have to manually create the volume. If you select this link, it will take you to the "NSS Pool and Volume Create" window.









      Figure 1.29 Cluster Config with New Cluster Resource




    1. Now let's switch to the LOGGER screen of the server where the cluster resource is running. In my case, it's the OESNW1 server. You will note the PMDOMVS pool is activated and the secondary IP address is now resident on the OESNW1 server. This means the cluster resource is running and, most importantly, working as designed.









      Figure 1.30 Cluster Node LOGGER Screen Showing the Cluster Resource Loading





  1. As a final step, you may want to test offline/online/migration of the cluster resource. This assures you the cluster resource is configured and working correctly before you create a volume or place data on it.



That is all there is to creating a cluster resource. Next up, I need to create a volume on that cluster resource. I highly recommend you create your cluster resources first. Then proceed to creating the volumes. Also it is a best practice to have a one-to-one relationship between cluster resources and volumes, as I stated previously. Now it's time to create a volume on a cluster resource.



Create a Cluster Volume

 

Creating a cluster volume is as easy as creating a cluster resource was in the last section. You will need a browser and NRM again.



    1. Launch a browser and enter the IP address of a cluster node along with the port for NetWare Remote Manager (NRM) and hit the Enter key, like this:



      https://192.168.20.11:8009/



    1. Login into NRM.

 

    1. On the left side, under Manage Server, select the Partition Disks link.






      Figure 1.31 NRM Manager Server Category - Partition Disks Link





    1. Using the right-side scroll bar, scroll down till you find the cluster resource created in the last section. The cluster resource is "PMDOMVS".








      Figure 1.32 NRM Manager Server Category -Partition Disks Link




      Select the Add a Volume link next to the cluster resource. This will bring up the NSS Volume Create window.








      Figure 1.33 NSS Volume Create Window





    1. Type in the volume name and select the volume attributes you desire.

 

    1. Select Create.

 

    1. An informational prompt will appear asking if you want to create the volume. Select OK.








      Figure 1.34 Are You Sure You Want to Create This Volume?





    1. NRM will now take you back to the Partition Disk link. At this point, you will notice the PMDOMVS cluster resource is no longer available. Select the Begin Refresh button at the top to refresh the view. You will see the new Volume created under the PMDOMVS cluster resource.

 

    1. Next, in NRM under the Manage Server category, select the Volumes link. This will now display the volumes mounted on this server, as shown in Figure 1.35.






      Figure 1.35 Volumes Mounted on the Server





    1. Then select the Partition Disks link and scroll down to the cluster resource. You will now see the cluster volume you have created.



      Note: Notice the Create eDir Object link. Do not use it. If you switch to iManager or ConsoleOne, you will see that the volume object already appears in eDirectory. I have found that if you do select the Create eDir Object link here, the volume will be tied specifically to the server it is currently mounted on. And the eDirectory object for the volume will list the "servername_volumename". This is NOT desired if you want the volume to be a clustered volume.








      Figure 1.36 New Cluster Volume Mounted




    1. Next, switch to the LOGGER screen of the cluster node running the cluster resource and volume you just created. You will notice the new volume and its parent NSS pool are now both in an "active" state. They are ready for data.






      Figure 1.37 Both the NSS Pool and Volume Are Set to Active





  1. Finally, launch Windows Explorer and browse through Network Neighborhood, through the eDirectory Tree, and into the context of the cluster node, resource, and volume. Notice the cluster resource shows up as a Windows "server" and under it are "all volumes" it knows about that are loaded on the same cluster node as itself. This includes, of course, its own volume, PMDOMVL in my example.



    Note: I am not quite sure how better to describe how Windows Explorer sees virtual servers. It's quite confusing for the end user, especially if they browse to multiple virtual servers and see the exact same volumes appear under them. In this case, if two cluster resources (virtual servers) are running on the same cluster node, then both virtual servers will show all volumes mounted on that cluster node. An example of this can be seen in Figure 1.38 where the volumes of SYS, and _ADMIN show up under the PMDOMVS cluster resource (virtual server). Confused? Don't be, it's just an illusion.








    Figure 1.38 Windows Explorer View of Cluster Resources and Volumes





[ ] 15. Create cluster resources and volumes.




TSATest: Testing SAN Throughput

 

One last tidbit to ease your worried mind. In order for GroupWise to perform well on a SAN, the hardware must be well designed and the underlying OES must be configured well. These items we have discussed. But how do you test that the I/O throughput for your cluster to your SAN is as fast as it should be? The answer comes from Novell -TSATEST.NLM. Born from one too many calls from customers complaining that Novell's throughput when attached to a SAN was bringing network services to a crawl, developers wrote this little gem. TSATEST can be used to help diagnose back-up issues as well as diagnose SAN issues. TSATEST has many switches and a ton of information to help you determine if the problem is SAN- or network-related versus OES-related. Check it out.



[ ] 16. Read up and use TSATEST to verify SAN throughput.




Business Continuity Clusters

 

This section will discuss the next level of high availability and disaster recovery for GroupWise. Business Continuity Clusters (BCC) are a "cluster of clusters" separated by distance. BCC allows organizations to secure their Novell environment against a site disaster or failure. Site failures may consist of such events as:



    • WAN link failure

 

    • Electrical or HVAC outage

 

  • Destruction of the physical site holding the Novell environment



The design of BCC allows for up to four totally separate locations to provide redundancy for the Novell environment. Should one location experience a failure, all Novell services hosted at that location would fail over to another location within the organization's network and continue running, providing near zero downtime during that failure. Most organizations consider having two sites; one is the main datacenter and the second is the disaster recovery (DR) site. This design works well, providing the second site is located a good distance from the main site. Let's take a look at recent disasters and allow me to provide some insight to this thinking.



Personal Note: Both 911 and the recent hurricanes were and still are tragic events in the United States history. My intention in discussing them here is not to in anyway belittle the events or the people, but to learn from them in a business sense. I apologize to you, the reader, if I touch upon events that may have affected you personally.



On September 11, 2001 many organizations felt their businesses were protected from disaster. Their main data centers were in one tower while their DR site was in the second tower of the World Trade Center buildings. It seemed out of the realm of possibility that both towers would ever be affected by disaster at the same time, let alone what actually happened. Had the businesses located their DR site in Newark, New Jersey or even near Times Square, the business continuity would have flowed.



Businesses in New Orleans felt they knew how to protect their company data from disaster by locating their DR sites across town or in a nearby city such as Biloxi. They never thought an event could possibly affect such a wide area. Had the businesses located their DR sites further inland to Alexandria, Louisiana, or Jackson, Mississippi, the business continuity would have flowed.



What Does BCC Do?

 

The fact is we now know that DR sites must be logically located far enough away to protect organizations from catastrophic events. Or an organization should invest in up to four different DR sites spread across entirely different physical locations. This is where Novell's BCC comes into play. BCC is software that rides above the Novell Cluster Services software. By this I mean it controls it. BCC also controls, through a series of scripts, the SAN vendor's mirroring software. To affect a site failover, an administrator simply launches iManager, logs in, and "pushes the button." The button is actually the administrator failing cluster resources over to a second, third, or fourth site where a cluster is waiting to run the resources. This can also be done remotely using a remote console to the server(s). This is a manual process meaning an administrator must be involved. Some find it less than desirable to have a manual failover of a site, but let's consider reasons why manual is better than fully automatic failover for a moment.



Manual failover provides a level of control that automatic failover does not. Specifically, human intelligence decides if the failover should occur. Have you ever been around when a WAN link was accidentally cut between two sites? Imagine the high-speed link between two BCC sites are cut somehow. In this scenario, site 1 and site 2 would not know this happened. Both would believe the other site is completely dead and would load the cluster resources. This means there are now two sets of the same data at two different locations -a divergence of data. A real problem especially in an active/active BCC. Manual failover is preferred so such events do not happen. Now, let's take a look at what Novell's BCC has to offer:



    • Provides failover of mission critical services to up to three other clusters (for a total of four clusters)

 

    • Standards-based integration with various SAN hardware devices for an automated fail-over process

 

    • Synchronizes and transforms cluster-related eDirectory objects using Novell Identity Manager technology

 

    • Provides for < five-minute Recovery Time Objective and 0 Recovery Point Objective (See Note below for definitions)

 

    • failover of one, all, or a group of cluster resources

 

  • Support for new Novell Virtual IP Address (VIPA) technology



Note: Recovery Time Objective (RTO) is the most amount of time it may take between the disaster and the time when normal data access is reinstated. It is during this time that organizations will operate in a manual mode or not at all. As you can see, the shorter this time, the quicker an organization's business can return to normal. BCC has the ability to make this time less five minutes. If and only if properly configured. You must use Novell Cluster Services preferred nodes list feature or all bets are off. This is also highly dependent on the SAN.



The Recovery Point Objective (RPO) is the highest amount of data allowed to be lost or re-entered. Data that is created between backups is RPO. BCC has the ability to provide 0 RPO, or 0 data loss.



BCC Example Scenario

 

Now that you have a basic understanding of BCC, let's look at an example. If my Demo Company has three BCC sites, and one of the sites is hit by a tornado and completely destroyed, I would be able to fail over all my Novell services, GroupWise in particular, to my second data site. Now, while I am rebuilding my first data center, I have a back-up BCC data center in case of a second "freak of nature" disaster. Let's look at what happens if my second data center goes down.



If I only had two BCC data centers and the first was wiped out by a tornado, then the second was hit by an electrical failure (too much coffee spilt on the circuits while working late one night), my Demo Company business would come to a complete and utter stop till I got a generator, back-up tapes, servers, or whatever I might need. Since my Demo Company has three BCC data centers, when the second data center fails, my business continues running at data center three. So just like in clustering, in BCC the more the merrier, at a cost of course. See Figure 1.39 below for my example of BCC failover.








Figure 1.39 Business Continuity Cluster Example




BCC: Active versus Passive

 

Something else for you to ponder about BCC: active versus passive. In its simplest application, BCC can have one active site, where all users are accessing Novell services, and one passive site for DR. However, this means that resources (servers and a nice cluster and SAN) are not being utilized; rather they are just sitting there waiting for a disaster. Enter active/active. Providing your network infrastructure allows it - and true BCC and SAN mirroring rely upon a fast-stable network - both sites could be providing Novell services for all users. Site one could be active, servicing half the data and GroupWise requests on its own cluster with its many different fail-over paths. All the while, site two is servicing the other half of the data and GroupWise services. Should any one site experience failure, the other site could completely take over and run all Novell services. This scenario allows the organization to capitalize on the hardware expenditures rather than have them collect dust.



Note: A quick note on SAN mirroring. Whether you are mirroring a SAN for BCC or just to provide a non-BCC oriented DR site there are two kinds of SAN mirroring to consider: Asynchronous and Synchronous. Because I am discussing GroupWise - and as we all know GroupWise databases are sensitive to inconsistencies across the different databases - synchronous is the way to go. Synchronous mirroring is done at the block level for an entire LUN at one moment in time. Unfortunately, synchronous mirroring has a distance limitation, which must be taken into consideration. You can stretch synchronous mirroring to ~200km, providing the addition of some pricey hardware. Asynchronous mirroring does not have a distance limitation, per se, but is more based upon your bandwidth and moving the previous 'snapshot' before taking the new 'snapshot'. Asynchronous is more often used with 'snap-shot' technology, but does not work well with databases because of the time inconsistencies. These inconsistencies are not a problem with just DR solutions; it's a backup issue with all types of databases as well.



BCC Summary

 

Hopefully this section provided you with a little bit of information to consider the next time you attend a disaster recovery meeting. If your organization, like many I have seen, is in the process of collapsing their Novell infrastructure, data, and GroupWise into single data centers all running on Novell Cluster Services clusters, then disaster recovery will be your next step or assignment. Thankfully, Novell has BCC available to assist you and your organization.



A final note on BCC: I will discuss some planning considerations in the next chapter for BCC as well. I will not, however, be discussing how to implement BCC. The reason is simple: BCC requires a very tight working relationship with your SAN vendor to get the SAN scripting correct. One false move, and mirroring of a SAN in the wrong way could wipe out production data. Therefore, Novell highly recommends, and I am in complete agreement, you engage both your SAN vendor's consulting arm as well as Novell Consulting. I have been involved in BCC design and implementation projects; it requires detail-oriented designing and complete accuracy to test and implement. For more information on Novell's Business Continuity Cluster go here.





[ ] 17. Consider your disaster recovery plan and BCC.




Summary

 

This brings to a close the Clustering Basics chapter. This chapter should have provided you the knowledge required to proceed with clustering GroupWise. Please realize that there is a lot more to know to build a cluster, and Novell does a great job in their Advanced Technical Training Clustering class. I highly recommend it.



This chapter presented the basics about Novell Cluster Services. It further drilled into cluster hardware design, and SAN LUN design. This chapter also presented a review of clustering components and management utilities. Finally this chapter provided only a few specific task-based instructions for setting up cluster resources and their volumes. At this time, I will re-list the action items for this chapter to help you prepare for clustering GroupWise.



Chapter 1 Consolidated Task List



[ ] 1. Verify ConsoleOne Snapins version.


[ ] 2. Access NetWare Remote Manager.


[ ] 3. Access iManager.


[ ] 4. Verify the SAN design and nodes for hardware fault tolerance.


[ ] 5. Check the electrical circuits and HVAC for the data center.


[ ] 6. Determine whether you have enough RAM.


[ ] 7. Verify your NICs and server switch ports.


[ ] 8. Verify you are using the latest GroupWise Snapins.


[ ] 9. Create a GroupWise design matrix and standardize names, ports, and IP addresses.


[ ] 10. Apply patches to OES NetWare.


[ ] 11. Configure the STARTUP.NCF file.


[ ] 12. Decide which GroupWise components will be clustered.


[ ] 13. Verify the LUN design.


[ ] 14. Set Devices to "Sharable for Clustering".


[ ] 15. Create cluster resources and volumes.


[ ] 16. Read up and use TSATEST to verify SAN throughput.


[ ] 17. Consider your disaster recovery plan and BCC.



End Part 4



About the Author



Gregg A. Hinchman is a self-employed consultant (www.HinchmanConsulting.com), and Novell Consulting Partner. He has supported GroupWise for more than 11 years. He specializes in GroupWise, Clustering, eDirectory, NetWare and GroupWise Document Management. He currently carries 18 Novell Certifications, including CDE, MCNE and CNE NetWare 6. He is also certified IT Project and regularly provides project management for clients. Gregg is the co-author of success with GroupWise Document Management and has spoken at Novell's Premiere Technology Conference "BrainShare" four years running in the United States and abroad as well as GroupWise Advisor Summit. He lives in Indianapolis, Indiana, and spends his vacation time seeking mountain summits "just for the view." He is always for hire at very reasonable rates. Gregg's modest website can be found at: http://www.hinchmanconsulting.com/.

Labels:

How To-Best Practice
Collateral
Comment List
Related
Recommended