Gregg Hinchman has just written a new eBook that is now available at www.taykratzer.com. He has graciously shared Chapter 1 with Cool Solutions, and we presented it in a four-part series. The full title of this outstanding work is Success with Clustering GroupWise 7: A Guide to Building a Highly Available GroupWise 7 System on an OES NetWare Cluster. Enjoy!
As I continue through this book I will be referring to an example GroupWise system I am clustering. Here is how this GroupWise system is designed. Figure 1.16 represents this design.
It should also be noted I have one WebAccess gateway per one domain. This provides a substantial amount of flexibility in GroupWise design and management. As an example, if I decide to add another WebAccess gateway, I can just create a new cluster resource, create a new domain and WebAccess gateway, and I am finished. If I needed to service one of the WebAccess agents, I can do so without users being affected because I have a second one running.
Figure 1.16 My GroupWise System Design
That is my GroupWise system in all its glory. In order to maximize the example, I will be performing both a new GroupWise system creation process and moving existing GroupWise components to the cluster, just like in a real environment. The post office secondary domain will be created along with the DMS post office. I will then create a new domain for a primary domain and install that domain onto the cluster. Next, I will move the EMAIL post office to the cluster, and create new GWIA and WebAccess gateways. This allows me to demonstrate both principles: migrating and installing new to a cluster. I will use the new Server Consolidation Migration Kit 1.0 when I move the GroupWise components to the cluster. This is the consolidated tool set of the previous Novell tools: Novell Server Consolidation Utility 4.1 and the NetWare Migration Wizard 8.1.
[ ] 12. Decide which GroupWise components will be clustered.
There are several utilities that are very helpful in clustering GroupWise. The next few paragraphs acquaint you with these utilities so you can understand the purpose they will play in configuring GroupWise to run on your cluster.
TCPCON will allow you to see what ports the OES NetWare server is listening on. To view the listening ports in TCPCON do the following:
The second utility is Novell's Server Consolidation and Migration Kit. This utility is used to move data from one location to another and to migrate from one server to another. In my case, I use it to move a GroupWise post office to its new home on a cluster. This utility can be downloaded for free from Novell. Here's how you obtain this utility:
The Server Consolidation and Migration Kit performs server-to-server copies and is non-volatile to data. In other words, it only copies; it does not delete or move the data. I have seen copy speeds averaging from 4GB to 10GB an hour. But when going from compressed to non-compressed volumes it can be slower. Also, you are limited to the speeds of the hardware. As I have already stated, I recommend having 10/100 NICs statically set to 100 and full duplex. The switch ports they are plugged into should be statically set to the same. This avoids the obvious mismatch that can corrupt data and, more importantly, bring all copy jobs to a crawl. Also, having the source and destination on the same switch will increase copy speeds.
Another little tip for the Server Consolidation Migration Kit is to have all servers involved in the copy process listed in (Service Location Protocol (SLP). If for some reason this is a problem, then edit the HOSTS files of each server and place an entry in each HOSTS file of the other server.
Example: In the source server HOSTS file, place an entry for the Destination server. The entry might read like this:
192.168.20.11 OESNW1.DEMO.COM OESNW1
This little tip will help identify the servers on the network. It also helps the Server Consolidation and Migration Kit to facilitate communication between the source and destination servers.
Also, the Server Consolidation and Migration Kit has a few system parameters:
I recommend performing "pre-copies" with the Server Consolidation and Migration Kit. A pre-copy is running the scheduled copy of data while the GroupWise system is up. This can be done in the middle of the day. This will take the longest and will report copy errors due to open files, and that is OK. On the day of the GroupWise system move, take down the GroupWise agents running on the server, then run the same scheduled copy job again and make sure you select "Copy Files if newer" option. This will then start the copy process, but will only copy those files from the source to the destination if they are newer. So, any user.db, wphost.db, or wpdomain.db, etc. will have a new modified date since the pre-copy was performed. Therefore, they will all be overwritten on the destination. At first, this copy will seem to take as long as the first, but it actually will end up taking about 25-33% of the original copy time. I have seen pre-copy jobs take 10 hours to copy a 20GB post office, but the second "copy if newer" only takes 2.5 to 3 hours.
The LOGGER screen on an OES NetWare server is a valuable information tool. The logger screen will actually list the events that take place on the server. You can use the up- or down-arrow or the page-up or page-down keys to view the log. This is very handy when testing and troubleshooting GroupWise on a server and in a cluster.
The Cluster Monitor (CMON) screen, shown in Figure 1.17, is very helpful in keeping track of the cluster environment. The CMON screen shows details on the cluster itself, including which nodes are up and running, the Epoch number, and which server is the holder of the Master IP address. Hint: The server with the yellow "UP" is the holder of the Master IP address. The Epoch is the counter for events in a cluster. A server coming online into a cluster is an event. A server ABEND'ing out of the cluster is an event. The Master IP address is "the IP address" of the cluster.
Figure 1.17 Cluster Monitor Screen
The Cluster Resource Manager (CRM) screen on an OES NetWare server, shown in Figure 1.18, is useful as well when testing and troubleshooting cluster resource loading and unloading. This screen will provide information on cluster resources, when they load and unload, and any errors directly associated with their loading/unloading.
Figure 1.18 Cluster Resource Manager Screen
NSSMU is the NSS Management utility that runs on the server. This utility allows you to view all devices, partitions, pools and volumes that are available to a cluster node. Within this utility, you can also add, delete, activate and deactivate pools, etc. To run NSSMU at the server console type:
Figure 1.19 NSSMU View of Pools
Figure 1.20 NSSMU View of Volumes
In this section I will take a few moments to discuss LUN design. LUN's, or Logical Unit Numbers, are identity numbers given to a logically created set of disks on a SAN. To understand LUN's let's look at SAN disk space for a minute. A SAN has a physical "rack" that holds physical disks (spindles, if you will). Vendors have their own specifications on how many disks (of what speed and size) will fit in a rack. The rack may hold 10 disks of 300GB each, which is 3TeraBytes if I did my math correctly. Obviously, if you have a 100GB GroupWise post office, you do not need to provide it 3TB of disk space. So you subdivide (or carve out) the rack into disk space units and the SAN assigns numbers to each unit. The number assigned to the "carved disk space" is considered a LUN. From an operating system point of view, the LUN is considered a device.
Try this, on a NetWare server console, type:
Figure 1.21 List Devices Command Displays All LUNs Seen by OES
As you can see, many devices are listed. In my example, figure the LUN's are the items labeled as "VMWare Virtual S rev:1.0 [HDD]". If you look a bit farther, you will see the numbers in front that start with a "V" are the device numbers. These correlate to the LUN's. Let's use a second tool to view the LUN's. At the OES system console, type:
Figure 1.22 NSSMU View of Devices
As you can see in NSSMU, this server has a device called 0x6. This device name is specific to just this server. However, if you look at the Device Information section, you will see a "name" with the number "V358-A2-D2:0". This correlates to the above figure as well. This device is considered Adapter 2 Device 2 and has a capacity of 204.79MB. A small LUN for sure. Now one last correlation; in a browser go to NetWare Remote Manager like this: http://IPAddressOfServer:8008 and log in. Then select "Manage Server" and "Partition Disks".
Figure 1.23 NRM's View of a Device
Again you can see that the device name "V358-A2-D2:0" appears. Notice I have "Free Disk Space" for this device. Well, this device is a LUN. That's right -- in OES terms, a disk device is a LUN. The name "V358-A2-D2:0" is the device name, and for this demonstration and your understanding, it's the LUN as well.
Let's get back to the SAN. When you carve out your rack of disks into LUN's, you have to take several things into consideration. First, what is going to be running on the LUN's? Second, what level of RAID do you need? How many "hot spare" drives do you need? GroupWise, as I have stated, is I/O intensive, which means it performs many reads/writes to a disk.
Note: OES is very "cache intensive". GroupWise POAs will write to the OES "cache" first, then OES will commit the cache to disk, which is handled by the service processors of the SAN. Service processors also have cache (RAM). The more RAM a service processor has, the more commits it can handle from OES servers, and the more efficient and fast the disk I/O becomes. Do you see the relationship between SAN and server hardware and why it's not GroupWise's fault if it performs poorly? It's the hardware designer's fault!
If you assign two GroupWise post offices to one LUN, what happens? Well, if they are busy post offices, you will have degraded performance on that LUN. You will also have the potential to lose two post offices should that one LUN have a failure. It's best practice to have one GroupWise component running on one LUN. I will discuss this in more detail later.
Figure 1.24 Example of a SAN Vendor's LUN Configuration Software
What about RAID Level? In a SAN, you can have many different types of RAID level. Each LUN, in theory can have a different RAID level. In practice, the more RAID levels, the more drain placed upon the service processors. RAID level is important and should also be considered when designing LUN's. Which RAID level is most fault tolerant - RAID 1, RAID 5, or RAID 10? RAID 10, which is RAID 1 disk sets linked together with RAID 0. This is the most fault-tolerant. However, RAID 10 takes up a large amount of disk space for all the striping and mirroring. RAID 10 is faster than RAID 5. RAID 1 with duplexing is faster than RAID 5 because of the parity-striping feature of RAID 5. However, with RAID 1, you lose 50% of your disk space. RAID 5's performance speeds up when you have more disks. So back to the original question: What about RAID level? In a SAN for GroupWise, RAID 5 will suffice for most implementations. If you have the money and can afford the additional disks, RAID 10 will provide an added boost of performance for heavily used post offices. Honestly, it's a cost justification between RAID 5 and RAID 10. I lean toward RAID 5 for GroupWise LUN's because it saves the customer money up front and has little downside in performance over the long-term. Here is a quick analysis of RAID 10 vs. RAID 5 vs. RAID 1 disk space.
|RAID Level||Disks in RAID Set||Usable Disk Space|
|RAID 10||6 146GB drives||400.94GB|
|RAID 5||6 146GB drives||668.24GB|
|RAID 1||2 146GB drives||133.65GB|
Briefly, I will explain the "hot spare" drive for a SAN. In a SAN's physical rack with the ability to hold 10 disk spindles, one of those should be a hot spare in case of a disk failure in the rack. Usually hardware vendors recommend only one or two hot spare disks per SAN. But this does vary based upon the number of disks in the SAN.
A final word on RAID levels. If you have small LUNs for the WebAccess agent and its parent domain, then RAID 1 makes sense. But for larger post offices, better fault tolerance, and cost savings, stick with RAID 5.
Figure 1.25 Example of SAN Vendor's RAID Group Configuration Software
The LUN knowledge you now have can be increased as well as the fault tolerance of your SAN LUN's with Virtual-LUNs. Seems like another long sticky topic right? Wrong. It's a simple concept. Previously, I explained a SAN disk drive exists in a physical rack and that you can carve out LUN's based upon the space of physical rack and the RAID level. Well, Virtual-LUN's remove the physical rack limitation. Virtual-LUNs are written across all disks in the SAN. Here is an example. If we have a SAN with 100 drives and we create a 100GB Virtual-LUN, it will write across all disks, substantially increasing the ability to handle the loss of one, two, or three or more drives. So you see, Virtual-LUNs are not "tied to" the physical racks. Xiotech is the premiere vendor for virtualizing LUNs. Check them out at: www.Xiotech.com .
A SAN vendor has software for configuring the LUNs and for assigning the OES nodes to the correct LUNs. In the OES clustering world, all OES nodes must always see all LUNs that they are going to service. This linking of LUNs to server nodes (called Hosts in the SAN world) is performed with the SAN vendor software. The linked LUNs and servers are called "storage groups" by some vendors. The terms differ but the result is the same: All 16 nodes of a cluster that will service all 20 LUNs on a SAN must be in the same storage group.
Figure 1.26 Example of SAN Vendor's Storage Group Configuration Software
Time to summarize. Proper LUN design is essential to GroupWise performance and fault tolerance. LUNs are just a device number for "carved up" disk space on a SAN. Only one GroupWise component should be on each LUN. Each LUN should have a RAID 5 configuration. If LUNs are limited to the physical racks of the SAN, then careful planning is required to make sure not too many I/O-intensive post offices are sitting on the same SAN rack. In conjunction with this, too many I/O-intensive post offices on multiple racks being serviced primarily by one service processor will decrease performance of GroupWise. Spread out your I/O-intensive LUNs. If virtual-LUNs are available, use them. Last and certainly most important, all OES servers in the cluster MUST see all LUNs to be clustered.
[ ] 13. Verify the LUN design.
End Part 3
Gregg A. Hinchman is a self-employed consultant (www.HinchmanConsulting.com), and Novell Consulting Partner. He has supported GroupWise for more than 11 years. He specializes in GroupWise, Clustering, eDirectory, NetWare and GroupWise Document Management. He currently carries 18 Novell Certifications, including CDE, MCNE and CNE NetWare 6. He is also certified IT Project and regularly provides project management for clients. Gregg is the co-author of success with GroupWise Document Management and has spoken at Novell's Premiere Technology Conference "BrainShare" four years running in the United States and abroad as well as GroupWise Advisor Summit. He lives in Indianapolis, Indiana, and spends his vacation time seeking mountain summits "just for the view." He is always for hire at very reasonable rates. Gregg's modest website can be found at: http://www.hinchmanconsulting.com/.