Consultant's Corner: Tsk, Tsk, Disk, Disk

0 Likes

Introduction



“Maybe I'm a man ... who's in the middle of something that he doesn't really understand.” I know I have some days where I just wonder 'what the heck is going on?' Yeah – it’s a general question that I apply to the environment, politics, my life and IT technology equally. It helps me to think about issues from a different (call it “outside of myself”) perspective. This is exactly how you should think about GroupWise issues when you have them. Problem-solving is often very systematic: try this, try that. The key is to assess the problem, note the issues and effects, and research solutions. Of the possible solutions, you have to rank them in order of difficulty, time to fix, and potential damage. Then you start with the easiest, shortest and least dangerous.

I am sure most of you have this process down, but how many times can you just sit back and laugh when the solution which you overlooked was the easiest, and had it been a snake, you would have been bitten? It happens to the best of us. What does this have to do with GroupWise? Lots. GroupWise is the network monitor in your environment. It’s an application used by nearly all in your organization. So, any issue in the OSI 7 layer model will directly impact GroupWise, as it’s the 'A' in the model (Application).

Note: The OSI 7 Layer Model, is not a new type of salad. It’s the frame for technology - period. The 7 layers are shown from the top-most down:

1. Application -Layer 7. Think: GroupWise / end-user interaction.
2. Presentation -Layer 6. Think: Presents data in an application-acceptable form.
3. Session -Layer 5. Think: Establishing sessions with applications.
4. Transport -Layer 4. Think: Transporting to other systems.
5. Network -Layer 3. Think: Network routing.
6. Data Link -Layer 2. Think: Frames and encoding.
7. Physical -Layer 1. Think: Hardware.

Yes, I did these from memory. Just remember 'APS' Transport to 'NDP' and you will have it for life. LOL! And I know many of you are telling me I forgot at least one layer - the Politics layer - and sometimes the Religion layer. Those are “soft” layers for another discussion. And I am sure there are more of you that would like to question my “quick” memory definitions, but there’s really no need. For more on the OSI 7 Layer model go to: http://www.webopedia.com/quick_ref/OSI_Layers.asp.

Back at it!

So what does all of this have to do with GroupWise and performance? Glad you asked. Disk throughput (input/output, I/O) is the topic for this article. I am hoping to provide you with a few tips and tricks to help you work on making GroupWise perform optimally on disk. Every time I do a GroupWise audit, one of the items I look at is the hardware and disk usage in particular. Usually I do not do full-blown SAN reviews or that sort of thing unless requested, but when requested, I am often amazed at what I find.

SAN Design for GroupWiseM



There are many ways to slice a SAN. I will provide you with a few things to keep in mind. And remember that your SAN vendor is the expert on their product, but you are the expert on GroupWise. Make sure to tell them that GroupWise SAN design should be treated like a database.

A SAN is broken into many parts, but most notably disks. A SAN will have an enclosure with many disks in it. There can, and often are, multiple enclosures. It’s these disks that you cut up in order to serve a disk up to the Operating System (OS). When you carve up a SAN, the pieces are called “LUNs” or Logical Unit Numbers. LUNs become devices to the OS. The devices become NSS pools and eventually volumes where GroupWise roams.

Some fiber channel SAN vendors force you to carve up their SAN disk based on the enclosure and then further based on the Raid Group. A Raid Group is a pre-identified set of disks created as a Raid Level, such as Raid 5 or Raid 10.

Note: Let’s compare RAID 5 vs. RAID 10 for GroupWise. Fact: RAID 10 is more fault tolerant, and the Input/Output (read/write) performance is great for databases. Fact: RAID 10 eats up twice as much disk space. Meaning, if I want a 100GB volume of usable disk space, I will need to use approximately 200GB of raw disk space. In the case of Raid 5, it uses less disk space and has good performance but not great when compared to Raid 10. Which do you use for GroupWise? Use Raid 10 if you can afford it, and if the I/O demands are heavy. Otherwise, Raid 5 will work fine. Heck, you might even do a hybrid approach -Raid 10 for post offices, and Raid 5 for gateways or domains.

Other vendors such as Xiotech do not care about enclosures when carving a disk. They will stripe a LUN tied to a Raid Group across all disks on the SAN. This provides the ultimate in fault tolerance, as the loss of a disk or two or an enclosure, if properly designed, will not result in an outage. And it provides the ultimate in I/O performance, as data is read or written to a large quantity of disks. EMC has something similar they call “Meta LUNs”.

Example #1

Here is an example of a SAN enclosure for an EMC CX500 SAN. (I actually used the SAN software to view and then create this spreadsheet picture, so I could clearly see how things are designed on the disk.) Notice we have one enclosure that contains 15 drives, each 146GB in size. There are 4 Raid Groups that show up: Raid Group 0, 1, 2, and 200. The Raid Group 200 is a Hot Spare drive for the SAN. As is usual practice for EMC, the SAN system partition exists on the first set of disks. This is where the SAN does most of its work and caching. It’s an important spot for EMC. A step further down shows all the LUNs, their size, and what is running on them. In Raid Group 0 we have five post offices running on just that one set of five disks, along with GWIA, WebAccess, and two domains. Is that a problem? See the figure below:



Figure 1: Example of a CX500 EMC SAN Enclosure

Yes. Why? For one, there are too many high I/O components, in this case post offices, hitting the same set of disks. Number two: fault tolerance. What happens if we lose one too many drives on that enclosure or the Raid Group? We then lose five post offices, GWIA, WebAccess, and 2 domains. That is a failure of biblical proportions. Now to make matters worse, Raid Group 1 on the same enclosure has 6 more post offices running across 7 disks. Lose the enclosure for some reason, or lose too many disks, and this GroupWise system is T-O-A-S-T ! The day-to-day issue, though, is performance. If each of these post offices represented say 1000 users, each accessing their mail at the same time ... well, you get the point: things would slow down.

Tip Number 1: Spread out! Do not place all heavy I/O LUN's on the same set of disks. Balance out file and print LUNs, which are read/cache with GroupWise post office LUNs that are read/write.


Testing Throughput



Once you have your SAN carved up properly for GroupWise you should test your I/O throughput. There are many ways to test, such as copying files to/from the SAN, or setting up Test LUNs with Test GroupWise post offices and trying to flood post offices with email. Unfortunately, there are few tools you have at your side to run numbers on performance using these tests. Enter: TSATEST.

For those interested, Novell created a utility that will help you test I/O called TSATEST. This utility is priceless!

There are several parameters that can be used for testing. Here are the ones I use at a minimum:

1. /V: Points TSATEST to the volume where it will run the test.
2. /Path: Provides the directory for the data set that TSATEST will use to simulate backup/copy jobs.
3. /U: The user TSATEST will use to authenticate.
4. /P: The user’s password.
5. /I: The iterations or number of rounds TSATEST will run a test against the data set.
6. /ERR: The filename of the error log.

Next, you have to choose the test data set. Here is what I would recommend: user MSG.DB files. Yes, large GroupWise Message database files can be a great data set for testing average throughput. You can also use OFFILES if you want to test small file throughput. I recommend about 3GB of a test data set in either case. So, I create a directory with 3GB of data and copy that to the various locations where I want to run the throughput test. I always run a baseline test first against local disk to see what my throughput is; then I run the real SAN test and compare the numbers to tell me how my SAN is performing.

Just so you know, TSATEST does not actually copy the data; it just simulates the read/write. But you must have the test data set on the disk where you want to test throughput.

Baseline Test

Here is my baseline test. First I run this against the local hardware disk within the server. This is the highest number you should be seeing, because it uses the server’s internal backplane. Also, I like to create a NCF file for a NetWare test that loads and runs the following:

BASETSATEST.NCF
load tsatest /V=VOL1: /PATH=\TSATTEST /U=.admin.demo /P=novell /I=5 /ERR=sys:\etc\tsaErr.txt





Figure 2: A baseline run of TSATEST against a data set on the local server disk

As you can see, my local test yielded 1755 MB per minute. Also notice that my total read time is 16 seconds. This gives me a great throughput result.

SAN Test

Now you need to create a second NCF file to run the same test - but this time across your Host Bus Adpaters (HBAs) to the SAN disk - and see what numbers you get. Then you can determine if the SAN is performing properly. Here is my NCF:

SANTSATEST.NCF
load tsatest /V=SANVOL1: /PATH=\TSATTEST /U=.admin.demo /P=novell /I=5 /ERR=sys:\etc\tsaErr.txt





Figure 3: A TSATEST run against a data set on the SAN

Notice the Raw and Effective MB/min. You tell me: is 42MB/min. a good rate of I/O speed, versus your baseline of 1755MB/min? Hopefully your answer is “NO!” You have now found that your SAN is not performing as it should, and you can further conclude that GroupWise will run very slow as well. What next? Well, it’s simple: go back, look at the LUN design, look at the HBAs, the fiber cables, the BIOS of the HBAs, and the fiber switches or SAN service processors. The last stop is the SAN vendor.

Tip Number 2: Use a Number 2 Pencil! That is -Test I/O. Make sure you have run a baseline and tested the LUN's before you start loading GroupWise on the SAN. This will avoid future issues.

iSCSI Twist

For those of you using iSCSI SANs, does this apply to you? Yes, absolutely! But you have one more test to run. An iSCSI implementation is best when there is a separate iSCSI network - aka different IP address networks for the data channel - so they do not compete with the end user’s channel. The way it works is simple enough. First, you have an iSCSI SAN that you have configured for 10.10.10.x network. This means your servers have two network interface cards: one for the 10.x network, which is iSCSI, and one for the end-user network, which might be 192.168.0.x. Your servers now act as a sort of a router as well, because they have to transition data between the two networks.

For testing you, have to do the baseline test and the SAN test. There is a third test you should consider as well: testing between servers on the iSCSI AND the end-user networks. For those without iSCSI, you could run a “between test” as well, to see how your switches affect backup.

'Tween Test

Here are the premises for a “between test” –

- I have FS2 and FS2 both attached to the iSCSI SAN. Both are on both networks, 10.x and 192.168.x.

- Run the test from FS1 to FS2’s sys or local disk, which would give you an idea of network throughput. Or you can run it from FS1 to FS2’s SAN volume, to test the iSCSI network. In theory it would give you a baseline between servers. Maybe you will find you have a bad switch or poor switched network design. I use the /S= for the server.

Here is my NCF for server-to-server local disk:

CROSSTSATEST.NCF
load tsatest /S=FS2 /V=FS2VOL1: /PATH=\TSATTEST /U=.admin.demo /P=novell /I=5 /ERR=sys:\etc\tsaErr.txt





Figure 4: A TSATEST run across the network to a data set on a different server

Ah, yes - a third set of numbers! So how do you figure out if this is good or bad? It depends upon your network, right? If it’s a 10/100, then this seems good. If it’s a gigabyte network - not so good. You need 600MB/min.

Tip Number 3: Get a clue! Look at the WHOLE picture of throughput. This includes HBAs and their BIOSes, NICs and their BIOSes (if iSCSI), network switch design/implementation, and all the parts/pieces of your SAN.

That wraps up my thoughts on testing disk I/O for GroupWise. For more detail on TSATEST, go to:
http://www.novell.com/support/search.do?cmd=displayKC&docType=kc&externalId=10092890&sliceId=&dialogID=53520801&stateId=0 0 53528393

Side Board



I just wanted to take a quick break from the article to cover some non-GroupWise-related items. First and foremost, I hope your 2007 was a great year and 2008 brings you more success, great health and a tremendous amount of happiness! I know I can always use a bit more of all of these.

Second, I bet you are wondering why I did not get an article written for October and November. I can only say: “Sorry.” I tried to write one article per month and nearly succeeded, with 12 articles written - some months doubled up. I ran out of time/steam/ideas. So if you have any time/steam/ideas -do send them to me so I can use them for 2008.

Third up, Brainshare - yeeeehaaaawwww! It will be my 13th year! My, how the time flies. At 2008 Brainshare I have two very good sessions that, for those attending, should make the trip worth while. Oh, and I am sure I will have at least one guest (surprise) speaker to provide additional expertise!

Fourth and final, 2008. I have a plan in place for “Consultant’s Corner” for 2008 that will provide you with some great reading and learning, I hope. I will not promise 12 articles next year, but hey - who knows, I may find lots of new things to write about with Bonsai coming out!

OK, back to your "regularly scheduled show" ....

Summary



Hopefully this article has given you a few ideas that you can run with to further investigate GroupWise performance in your environment. The fact is, GroupWise is very efficient and runs strongly even in mis-configured environments, but wouldn't you like to say: “Maybe I'm Amazed...” at how fast this puppy runs!

As always, I can be reached at: Gregg@HinchmanConsulting.com if you have any comments or article ideas, or if you just want to help a quirky consultant support his GroupWise habit.

Labels:

How To-Best Practice
Comment List
Related
Recommended