Due to instability with the web and printing code we had to segregate our cluster nodes into "File only" and "App" nodes. Three nodes were dedicated to hosting volumes, the other three were dedicated to the web and NDPS/iPrint services. The FTP and Zen services just worked and were not segregated. This suited us for several years.
Then, the sheer amount of data being used by the three file-serving nodes exceeded their NSS caches. The cache-hit percentages were typically in the 75%-85% range, and these were nodes with 4GB of RAM in them. Even though NetWare supports PAE memory access, it isn't fast. The conventional wisdom in the support forums is that you needed 12GB of RAM to see any improvements. These servers were old enough that adding that much memory wasn't really an option. So, something had to give.
Did I mention a major budget crunch was hitting about this time? Whatever I did, had to be free to us.
We have two NDPS Managers, one for Student printing (heavily used), and one for Faculty/Staff printing (lightly used). These two couldn't co-habitate, but we also needed to make sure the printing services were highly available. This meant I needed at minimum four nodes.
As it happened, a pair of HP ProLiant BL20p G2 blades had just been freed up after we moved a certain Microsoft application to the ESX cluster. These blades also had 2GB of RAM in them. This was plenty for what I needed to do, print and webservices generally aren't in the same class of memory pig as NSS.
I didn't have another two blades, so I couldn't go bare iron. What to do?
Since this had to be a free solution, purchasing a few more ESX licenses was out. As we were running NetWare 6.5 SP7, Xen paravirtualization was an option! Since we hadn't moved to SP8, this meant it had to be SLES10 SP1.
Since I didn't have either fibre cards for these servers or any iSCSI servers out there on the LAN, I had to use the local disks for the shared volume. With only mirrored 72GB drives to work with, I had to be careful with my partitioning when I installed SLES10-SP1. I reserved 12GB for the OS, and put the rest into two LVM containers. Since you need to use partitions for virtualized NetWare and not virtual-disk files, I had to create LVM Volumes for everything. One LVM Container for the SYS volumes, and the other Container to host the spool volumes for NDPS/iPrint.
With only 2GB of RAM to work with, and a tiny 12GB root partition, I had to be careful with what software packages I installed. I needed the Xen packages, of course, so those were added. But I also removed things these servers would never need, like Samba and Gnome. I also configured the boot-loader to launch in text-mode (runlevel 3) in order to save memory by not loading the X console.
Once I got the servers installed and booted, I went into YaST to change a few things. First off, load the Xen kernel by default. Second, turn off several services I don't need on a server of this type. This saved even more memory. I also turned off the ZMD service, as any patching will be done on a specific schedule and not done automatically. This step saved significant CPU cycles, but is one I only take in very specific circumstances.
First thing first, I downloaded the NW65SP7 DVD ISO to the local file-system so I could use it for installs. As these were blade servers, they did not have any attached optical drives. What's more, I don't think it's possible to do an over-the-network install the way I did the SLES install. This sped up the installs once I got to that point.
Configuring the VM's took a bit of trial and error to see how much memory I could allocate both VMs before things started getting crunchy on the host end of things. Right now, 818MB of RAM is allocated to each VM, and I can probably go a bit further if things end up not being stable. Each VM was given one virtual CPU, which brought the total virtual CPU's to 2, the same number of physical processors in the blade. For the disks, I gave the VMs their SYS partitions first, so I could get the base server installs done.
Installing went just fine. No problems at all. I used X over SSH to view the virtual console screens while working. I made my usual edits to the STARTUP and AUTOEXEC scripts to tweak things how I'd like them, as well as disabling AFP, CIFS, and NSS serving as these nodes would do none of that.
Once both servers had the products installed that I needed, NetStorage, NDPS/iPrint, and on one pair iManager/Tomcat5, I took both VMs down, changed their config files in /etc/xen/vm/ to add the Cluster LVM Volume I created when I installed the SLES server, and started both up. Both saw the storage. I initialized the storage through NSSMU, and flagged it sharable.
Installing the cluster was pretty simple. I already had the IP addresses I needed allocated. This was done from an XP workstation, as that's how NCS on NetWare needs to be installed.
Once the cluster was installed, it was time to get services configured. Each 2-node cluster had three web services and NDPS/iPrint cluster services to create. Creating the NDPS/iPrint services took a bit of work until I figured out how to fine tune the creation process to give it the object names I wanted. I went back and forth on that few times until I nailed it.
I also took this opportunity to really clean up the web service files. As they exist on the 6-node cluster, they tend to get munched during NetWare service packs and require a good amount of hand-editing to undo changes Novell does during that process. Rather than fight that, I took a page from the SLES10-style of Apache config files, and broke out my changes into the conf/conf.d and conf/vhost.d directories. The master httpd.conf file has had two lines added to it:
This greatly reduced the exposure we had to service-pack related screwups. It also, erm, made the config cleaner. Okay, a lot cleaner. Someone besides me could probably figure it out now.
Printing took a little bit of help to configure, especially on the Student side. We use A.N.D. Technologies PCounter for audting our printing, and that HAD to work. Very happily for me, the possibility of this move had been thought about back when we had money, and licenses were purchased. I was able to get the new nodes licensed and installed.
Moving the web services was very easy. Stop the service on the 6-node cluster, start it up on the 2-node cluster. Since all three clusters were on the same subnet, the same IP's could be used. This was done during the day, without users being aware of it. It was great.
Printing took a bit more sweat. As it happened, Novell had very recently published a TID on how to do exactly that. TID7001772. Following that, I was able to get things moved over to their new homes. Copying the Broker directory was a real time-saver, as there are a lot of drivers in there.
One thing that wasn't mentioned in that document was the iPrint-specific config that needed changing. The ipdocs structure under SYS:Apache2/htdocs/ippdocs/ needed copying as well in order to get the right iPrint client versions. The SYS:Apache2/htdocs/ippdocs/iprint.ini file had to be copied from the 6-node cluster.
Also as part of this was generating a bunch of SSL certificates. NetStorage and iPrint both needed them, as did iManager. Our CA Manager was able to handle this with no problems. As one of the old certificates was within 60 days of expiring, it was due anyway.
The biggest complication were printers rejecting binds from the new IP addresses. Since NDPS will use the server's IP address for communication rather than the cluster service's IP address, new IP's were communicating with printers. A few of our printers had been hard coded to only accept jobs from the two IP's for the server nodes that handled student printing on the 6-node cluster. These had to be modified to allow printing from the new nodes.
None of the iPrint users noticed any changes. If they looked closely they saw that the CA that was signing the iPrint certificate had changed, but authentication and printing just worked. The same for our NDPS users.
I forgot that NSS sets the cache balance percentage to 85% by default. Shortly after we moved printing over, the console started getting alerts about cache allocator errors. Since we do zero file serving from these nodes, I set the cache balance percentage to something really low like 20%.
Right now, both clusters-in-a-box have been running for over a month. To my surprise, iManager can run in this kind of restricted memory environment! Unfortunately, all it'll take is one memory leak and it'll bring the node down; but that hasn't happened yet.
After all the services had been moved to the clusters-in-a-box, we realigned which services run on which nodes on the 6-node cluster. We tried to level each node so it was hosting volumes with around the same amount of space allocated. Each node hosts between 1.3 to 1.7TB of allocated space, with one node hosting a single volume that is 2.3TB in size. Since we moved them off, our cache hit percentage is now back in the 90's on most nodes.