Consultant's Corner: Disaster Recovery Thoughts for GroupWise
"Burn baby, burn" ... so goes my mouth when I eat Thai food. Thus far I have traveled to the root source of Thai food - Thailand. In this venture I have eaten the hottest of hot Thai food. It was during one of these feasts, as my mouth was on fire, I considered GroupWise on fire - not because it's popular (which we all know it is) but because of a disaster scenario. Since e-mail is electronic, you will not have the paper mess a fire can cause (just 1's and 0's spilled on the floor). But certainly there are other disaster types to consider.
Hot Thai Food
In a recent article I discussed Highly Available/Fault tolerance versus High Availability/Fault Resilience (http://www.novell.com/coolsolutions/feature/17110.html). In this article, I want to drill down a bit and discuss a subset of Business Continuance for GroupWise - Disaster Recovery. Many organizations consider disaster recovery as their business continuance plan, but truly it is not. A disaster recovery (DR) plan is just that - a plan to recover "something" in the event of a disaster. Business continuance is about an organization continuing its business before, during and after a disaster. So how does all of this apply to GroupWise?
GroupWise is an organization's lifeblood, and as you have heard me say on many occasions, GroupWise is the most important application in an organization. Just the loss of a single e-mail could mean the loss of millions to a company in lost revenue or lawsuits lost. Therefore, let's list a few items to consider in order to ensure GroupWise can survive a disaster or two.
I have discussed in many articles how to properly design GroupWise. When designed properly, GroupWise can survive some minimal disasters. Proper design can also reduce the number of affected users in a disaster situation. It's important to research and review your organization's current design and see how you can strengthen it.
Then there is the issue of the size of your GroupWise data stores. If you have a 300GB post office, it will take quite a while to restore it. And if there is a problem with the databases or data, maintenance procedures will also take a long time. Consider trimming your data stores either by archiving or policies that reduce the size. I recently wrote an article for the Spring GroupWise Advisor magazine on e-mail policies. You can use it to help your CxO's make effective e-mail policies.
1. Consider and evaluate your organization's GroupWise design from a disaster point of view.
Consider hardware or the number of GroupWise components sitting on one server. For decades now, hardware has decreased in price and increased in reliability, yet we still have failed hard drives or NIC cards that die. It's easy and cheap to combat and avoid most hardware failures. For example, RAID5 eliminates the effects of a failed hard drive. NIC teaming eliminates problems from the failure of a network interface card. Multiple NICs, each directed to a separate backend switch, avoids problems from switch port failure.
What about the loss of a motherboard? When was the last time that happened? It's a rare event, and there is no way to avoid it in a single-server or stand-alone-server environment. This is where placing too many GroupWise components on one server can come back to get you. Just because you can run 3,4,5 post offices on one server doesn't mean you should, unless you want to affect a large portion of your user population.
Clustering also comes into play here. Clustering is cheap when compared to a day's lost revenue or productivity if an organization cannot get access to e-mail. I have personally seen a substantial increase in the number of organizations doing clustering specifically for GroupWise. Why? Because they "get it!"
2. Consider and evaluate the hardware GroupWise runs on.
I don't know how many times I have designed a GroupWise system to sit on cluster in a data center and gone to implement it, only to find the customer has yet to strengthen their data center. If a data center is setup and running fine today, with a terabyte SAN, 10 cluster nodes, fibre switches and such, will it be running fine after we flip on the switch? Most likely, no!
If your organization's data center is humming fine today, investigate a bit further. What is the average daily temperature? Are there enough electrical ports? What about the amount of electricity being delivered to the data center by the electric company? I am not an electrician nor do I pretend to know that side of the IT business. But I do know that if I add a ton of electrical equipment, I may have brown-outs if current electrical circuits cannot handle it. Sometimes electric companies will do an analysis of the current and planned electrical load. And hardware manufactures do provide detailed information on the electrical consumption of their products. Don't forget your data center's uninterruptible power supply - make sure it's updated as well.
Then there's the issue of heating/air conditioning. Now let's face it - heat is not an issue unless you live in the Antarctic. But AC is very important - I have seen servers shut down due to heat. Imagine spending $400,000 on a SAN only to have it overheat. How do you explain that to the CxO's? An HVAC person can also run an analysis of current and planned output.
3. Consider and evaluate Electrical/HVAC for the data center.
What is the state of your data center? Does it have a flat roof? Is it in a flood plain? Is it in the middle of tornado alley? My main point here is the building itself. If the roof leaks, that is bad. Water and electrical equipment, surprisingly, do not mix. If you have a great seaside view from your data center window, I bet you have not considered flooding or salt air. Or what if you are in a area known for its dust or large particulate air-borne matter? Recently, China was zapped with a nice big sand storm of sorts. They are going to seed clouds to produce rain to clear the sky. That is an extreme solution to a natural problem. Consider your data center building. Do a walk-around; look at the floor, the ceiling, the inside and the outside. You do the same when you purchase a home, why not with your data center?
Then there is the security of the data center. As a consultant, I tend to spend much time in data centers. I find it a game these days just to see if I can get to places that I should not be able to without proper authorization. No I am not a "sneaker," I just use a bit of human factoring. Here is an example: I'm on site and know the name of an IT director. Next thing you know, I walk to a locked data center, look at the closest person and ask them if they can let me in. Then I give them the story of how "so and so, the IT director" is still getting me my access card/number/etc. but just told me to have them open the door in the meantime. Funny, but it works. And sometimes, I will just walk in like I own the place, and no one asks questions. Check out security - into the building and into the data center room.
4. Consider and evaluate the building and security where the data center is located.
Let's suppose you have a safe, secure, well-designed, clustered GroupWise system. Do you need a backup? I guess the answer is - it depends. It depends on if you want to recover data on a daily/weekly/monthly/yearly basis. It depends on if you want to have data available in case of a disaster. Clustering is not a backup. There are many products that backup GroupWise and do it well. Just make sure you are getting consistency in the databases by using an open file manager and/or TSA.
There are many kinds of backups as well. Of course we are all aware of the nightly backups. But there are near-line backup/restores as well. These can be native in GroupWise with the "restore area" or with a 3rd party product like Reload by GWAVA. Just keep in mind they are great solutions and could be used as a backup solution, but they are not traditional backups.
5. Consider and evaluate your current backup solution.
Here is one that I bet you didn't think of - retention and archiving. How does it play into disaster planning? It fits into the "keeping data store sizes in GroupWise under control" area. I mentioned 300GB post offices in the Design/Size section of this article - let's discuss this a bit more here.
Implement email retention policies! Implement email box size restrictions! These two things will keep your GroupWise data stores lean. And then implement an enterprise-level archiving solution that stores email in a standards-based format. In my article on Open Source GroupWise, I discussed the need for standards that can survive the test of time. If you currently use native GroupWise archiving, what will you do with all those archives when GroupWise is dead and gone and you need to recover one for a court case? GWArchive is one such 3rd party application that stores archived GroupWise email in XML format. And you can access the archives from WebAccess.
6. Consider and evaluate your organizations retention and archiving policies and practice.
I am now going to go off the deep end. There is so much talk about virtualization these days - how can it save an organization's money and hardware? It's easy to recover from a failure, but think about this: if you have one server with four virtual NetWare sessions and the motherboard blows on the server, you lose all the sessions. Am I saying virtualization is not the answer? NO! But tread wisely; don't just run blindly into vitalizing everything. It just adds a level of complexity and may not gain you much from a disaster recovery point of view.
7. Consider and evaluate virtualization for your organization.
How the heck do you protect GroupWise from a natural disaster? Answer: Business Continuity Clustering (BCC). I have spoken of BCC on many occasions. It's a cluster of clusters - live and active redundant data centers, SAN mirroring between the data centers, IDM to replicate cluster objects, and the possibility of zero data loss and 5 minutes of downtime in the event of a data center failure.
This is a costly solution. Depending upon how it's designed, it could run into millions. It's not a solution for all organizations, but for those industries such as government, healthcare, and financial, it makes fiscal sense. If a tornado or flood hits your data center and it's the only data center you have, I hope your previous night's backup tapes are valid and stored off site. I also hope you have the hardware, electricity, network connectivity, and a location to do the restore. If not, business will not continue until you do. Honestly, what disasters can happen where your organizations data center is located? That is the key question.
Side note: I live in Indianapolis, Indiana - not a thriving metropolis by any means, but a good-sized city with few disaster potentials. Not long ago the one major disaster that could happen, did - a tornado. Along with the tornado came the baseball-size hail. I was fortunate; my home is intact, but the siding was severely damaged by hail. I can live with that. But I do not have to live with tornados very often and I do not have to worry about floods or hurricanes. Earthquakes have happened, but the fault lines are so far away that the best we get is a small shake and only once in a decade of so. Not to be gloom-and-doom here, but one thing that I do have to worry about more than anything is the VX nerve gas disposal business a mere 20 miles to the West. A nice tornado and a good Western wind and most of Indianapolis could be exposed. My point: I know my area and its disaster potential.
8. Consider and evaluate your organizations need for a "hot and live" Business Continuity Cluster
To quickly summarize, think about and evaluate every single part of your existing GroupWise system implementation, from design to hardware to building to disaster potential. I am not a disaster recovery specialist, but like you I have been in this business long enough to know - stuff happens. Anyway, with good planning maybe the only burning you will experience will be your "Disco Inferno" on a Saturday night out on the town!