Proactively testing your GroupWise systems IO performance
We've all taken the call, you know, the one from your CxO that GroupWise is slow, the old haystack and the needle call. Looking through the logs on the POA yields nothing, testing with a test account on the CxO's post office doesn't show the symptom that's been reported - yet you're taking calls at the helpdesk that things are slow. CPU looks good, no backup running, network team says everything looks good... there's no obvious cause, now what?!?
IO performance testing
We all want to avoid "the call". Performing some simple tests and keeping the results from when "things are good", can go a long way towards identifying whether or not disk IO performance is contributing to slowness. Here are some simple tests that should identify what sustained disk IO you can expect from your system.
Note: dd can be destructive, carefully review your command before executing it.
Sync, eg., force everything to be written to disk that's currently cached, copy in blocks of 1024k, a string of zeros, do this copy 10000 times to a file named /media/nss/GW/testfile, the second command then copies the testfile to a device named /dev/null (bitbucket in the sky).
What you should see:
If you have decent hardware and it's not a busy time for disk access, you should see throughput numbers that exceed 200MBPS on the sustained write test and 300MBPS on the read test. Re-run the test a few times then average the returned values to get a better picture of what the throughput numbers are when things are good. If you don't see at least these numbers, you'll want to closely watch any POA hosted on this server for signs of degradation in service delivery, or chose a different server to host your POA.
Sample results: smoring@slowpoke:~/Desktop> sync;time -p dd if=/dev/zero of=/media/nss/GW/testfile bs=1024k count=10000 10000 0 records in 10000 0 records out 10485760000 bytes (10 GB) copied, 422.616 s, 24.8 MB/s real 422.62 user 0.02 sys 13.64
As you can see from this example, the disk throughput during the sustained write test is less than 200MBPS. A well performing GroupWise system requires enterprise class disk throughput, without it, you'll be taking evasive action when the CxO comes looking for you...
Good point on disk frag under Windows. The installation I worked with started slowing down after 6-12 months after migrating to Windows (from NetWare). We moved from 7200 rpm disk (on SAN) to 15K and that helped a bit. However, the ultimate fis was to install a third party defrag tool that would use "free" cpu/disk cycles to keep the disk space unfragmented.
Note: we also found that too many hosts sharing the same physical SAN disk can hurt performance as well. We wound up putting POAs and MTAs on their own set of disk on the SAN to avoid disk contention.
Linux file systems (like many modern file systems) utilize the system memory to cache file writes, at least until it exhausts the memory that's available, then it needs to commit the writes to the disk IO subsystem. Since this test is reading a "0" from memory and writing it to a file many times, it's going to first go to cache until it exhausts available cache memory, then to disk with no read time delay. The "sync" command at the beginning of the command string will force the writing of any previously cached data to disk, leaving our "dd" command with a clean slate for our test. Although it is true that using the system cache will skew the test results; making a machine with oodles of memory appear to be much faster than one that is starved, adjusting the number of gigabytes of data written to disk to at least double the amount of memory available on the server will show a much more consistent and accurate number. This is because it's much much faster to write the "0" from memory to memory (cache) than it is to write it to disk across even the fastest SAN or disk system.
Read operations on the other hand rely almost entirely on the connection to disk and will only occur at the maximum sustainable speed from the IO subsystem (the slowest/weakest link in the chain). In our test we're reading it from disk and writing it to /dev/null (the bitbucket).
If you are writing to a SAN or cache enabled array controller, you can bias the read/write ratio to match your traffic. It's possible that your SAN administrator has biased the system one way or the other.
The answer to your question on Windows based tools is "it depends", many tools will emulate reads and writes but some will actually let you point to a directory structure and "exercise the hard drives" by copying these files to and from the server's disk. A simple way to do this is with a stopwatch and a few large files, it should give you a pretty decent idea of what your throughput on sustained reads are, what it won't do is show you what your sustained writes are as you'd be reading a file then writing at the same time.
I have always been hesitant to host a GroupWise system on Windows due to fragmentation issues exaggerated by how GroupWise reads and writes many many small transient files then deletes them. This will rapidly cause the performance to degrade on a file system ill-equipped to handle fragmentation. Again, these are my opinions and results from years of experience, your mileage will vary.
I tried running the read and write tests on a SLES 11 SP1 x64 server running as a vSphere 4.0 VM connected to an EMC SAN and got pretty bad numbers. On average after 3 runs I was seeing 92.4 MB/s for writes and 71 MB/s for reads! Why would reads be slower than writes?
Also, do you know of an equivalent way to test a Windows GW server so I could compare performance of Windows to Linux?