I get asked regularly about High Availability for Novell Identity Manager implementations. (Most recently at a Sentinel training session, hi Sanath!) I had been meaning to write something about this for a while, and this request finally pushed me into doing it.
Once a location get Identity Manager configured and running, almost the next question is how do we make it Highly Available (HA) and how do we get it into our Disaster Recovery planning.
Used to be there was not really a great answer there, and in fact now, there is not really a great answer, but instead a series of answers, some better than others.
As it turns out, over the years Novell Identity Manager has moved from being just about event synchronization to much more, with other components that are just as important.
Some of the newer features in Identity Manager are more user facing, like the User Application, User Application (User App for short) is the place your users actually interact on their own with the Identity Manager implementation.
The simplest install case may use the User App as a white pages tool, for password resets, and for using the Forgotten password interface. This is a great tool, saving lots of Helpdesk support peoples time, when the user can just go to a link answer some personal questions they had filled in before, and get a password reset done.
If you are licensed for the Provisioning (as it was called in Identity Manager 3.0 and 3.5) or the Role Based Provisioning use of User App, then you may have implemented a number of workflows, allowing much more self service. For one client (Hey Marc and Ray!) we implemented a hugely complex self service page, that once the user logs in to the User App, based on what entitlements they have, they get a nice dynamic web page that shows them only the services they are allowed to request, and validates everything based on information in eDirectory.
Once they fill out the form, it generated a workflow for each request, generating a work order in their helpdesk system, which once approved went back into Identity Manager to be processed.
Well that is VERY customer facing, and high availability is really important, and making sure that can survive a disaster is pretty important as well.
The good news here is that the design of the User Application takes that into account. The application has two basic components, the JBoss Application server that runs the interface and the application logic. Additionally you need a database to store Workflows while they exist and are running. The default install is with MySQL, which you should be able to cluster but it also supports Microsoft SQL Server 2005 SP1, and Oracle 10g. These products have their own internal clustering implementations and if you already have them as part of your regular High Availability or Disaster Recovery planning, this should be an easy step for you.
The JBoss application server is designed to support clustering, so when you install the User Application, you can just choose to install a Clustered approach, follow the documentation and get several nodes running it, pointing against your clustered database.
The User Application can be deployed in IBM's WebSphere application server, as well as a WebLogic application server. I had not realized that Oracle acquired BEA, as I was about to say a BEA Weblogic server, but it turns out that should now be called an Oracle WebLogic application server. Who can keep up with all these purchases, mergers, etc. Anyway, those two are higher end application servers, and I have not dealt with them enough to know how they handle clustering other than to be confident that they must do it somehow!
One issue that comes up has to do with the eDirectory tree that the User Application will be talking too. Within User App you get to an Integration Activity, which is where modifications to users or other objects in eDirectory can occur. As far as I know, that is just a standard LDAP connection. In which case, when configuring the User Application, you should be able to point at a DNS name of an LDAP server that is either a DNS round robin, pointing at several servers, or else at a group of LDAP servers behind a real load balancer. That too should already be part of your HA and DR planning, so a simple configuration change should get you there as well.
eDirectory High Availability and Disaster Recovery is pretty easy most of the time. Just add more servers with replicas. eDirectory is really nice this way as it runs on so many platforms (Netware, Windows (2000, 2003, 2008), Solaris, Red Hat, SUSE Linux, AIX, HPUX (Though it looks like HPUX may be getting dropped since it is basically a dead OS now)) you can just add a replica to any box in the data center that has some spare capacity.
The only real issue is to be sure you get sufficient replica coverage. By that I mean that all the objects in the tree that you care about are replicated elsewhere. In larger installations, the tree is usually partitioned, and few servers hold replicas of every object in the tree. But in the Disaster Recovery site, you probably will be doing that. Or else you may decide to have a couple of servers out there hosting the various partitions that are needed.
Clustering eDirectory is usually not something really done, since it is so easy to just add another server, add replicas to it, and somehow load balancing it. This is usually significantly simpler than most clustering solutions. On Netware, because of eDirectory's tight integration into the operating system it is basically not really possible to cluster eDirectory.
On other platforms, where eDirectory is basically just an application, it is pretty straightforward to cluster. The main issue to address is the NICI keys for the tree, and making sure that they too are clustered, as well as the application binaries and the database itself. There is documentation out there for clustering on Linux, and on Windows which I referenced at the beginning of the article.
Identity Manager is mostly a Java based application that runs in the eDirectory process space. Thus once you have eDirectory clustered, you at least have the engine clustered.
Here comes the part that is not so great.
Even though just clustering eDirectory technically does cluster the Identity Manager engine, since it is running in the eDirectory memory space, that may or may not be all that helpful!
Identity Manager is an event based system, and the component (dxevent (.NLM on Netware. .so on Linux and Unix variants, and .dlm on Windows platforms) that catches events as they happen (Hey any Canadians ever listen to "As it Happens" on CBC radio (I still remember it as 740AM but I know in Toronto they moved to an FM channel a few years back)? I used to quite enjoy that show. Wonder if I can get it here in the US. Maybe they stream it!) in eDirectory, and stores them in the .TAO event cache file is very important (See these two articles for more details of the TAO file, and how Identity Manager handles caching events, and how to manage those caches: Managing the eDirectory Database Cache and Driver Cache Stats in the IDM 3.6 iManager Plugins ). The thing is that dxevent is a per server process, and the events that it monitors are only those that occur in the replicas it has stored. Additionally, if the TAO file cache gets backed up, because events are coming in faster than the engine can process them, and something happens to cause a cluster fail over event, the events in the local TAO file will probably not come across to the other node. If the cluster can handle this portion of the issue, which it might be able too, since the TAO files are stored in the eDirectory DIB directory, which should be part of what is clustered, then you might be able to get past that aspect of it. But it is something important to pay attention too, confirm, and understand.
One thing in our favour is that the Remote Loaders rarely are configured to only accept clients from one host. Usually they just listen for any connection on the correct port, with the correct credentials. Though I recently saw someone post a remote loader configuration file that had the option fromaddress= some IP address. I was not aware that was a valid option for the remote loaders, but I have not been able to confirm that either way yet.
This means that if the Remote Loader device is clustered it will probably mostly continue to work. As usual however the devil is in the details. The details in this case relate to how the specific driver shims running in the remote loader work. For example, the Active Directory driver, while the basic functions of replicating events in and out (publisher and subscriber channel) would probably work ok, if it the Remote Loader was clustered, or manually failed over, but one part would not. That is the synchronizing passwords out of Active Directory on the Publisher channel. The reason for this has to do with specifics of how the Active Directory driver catches password changes. There is a Password Filter, using Microsoft's Password API running on each Domain Controller (DC) in the domain. If a password change event is recorded via one of those Domain Controllers, then the Password Sync Filter catches it, encrypts it, and forwards it on to the server running the Remote Loader and the Password Synchronization Applet. This accepts the password and passes it on to the Remote Loader to pass back to eDirectory.
In order to fail over the Remote Loader for Active Directory, you would need to get all the Domain Controllers to change their Password Synchronization Filters to point at the new server running the Remote Loader. Alas, this takes a reboot of each Domain Controller.
I have not investigated if the Password Synchronization Applet, along with the Remote Loader can be successfully clustered, but these would be the things you need to look at. For a cold fail over it is the same set of issues to keep in mind.
Other Remote Loaders will have similar issues, where there is something special about the specific server that is running the Remote Loader, and some extra piece that needs to be properly configured to fail over to a new server. It would be worth working through the options and collating the various issues to share. Any up for the task?
Another piece of good news is that the way the SSL configuration is done, this aspect should not be an issue. Usually the remote loader wants a copy of the Organizational Certificate Authority's public key (The Trusted Root key) in the Java keystore. Any certificates signed by this Org CA will be accepted, since they are all signed with the Private key, and the Org CA's public is sufficient to prove it is valid.
The User Application, the front end that most users will see, that needs to be highly available, because users see it. The rest of Identity Manager is mostly a behind the scenes sort of thing.
The Remote Loaders cache events locally if they cannot talk to the Identity Manager engine, so when the engine comes back, the events get played back in the correct order.
eDirectory events will get queued up to be synchronized to the replica that is down, and when it comes back up the events will get sent over, dxevent local to the server will process them and get them back into production.
With the exception of passwords, most end users do not care, nor actually notice most Identity Manager back end events. In reality, most of the time Human Resources is so far behind the curve that information they enter is usually not hugely relevant as it happens.
None of that is to say that the system should not be maintained and to strive for high availability, but rather to try and put some context around the issue.
This is a big and complex topic, and I would love to hear feedback! If you have strong feelings either way, write them up and post them as an article! Come on, contribute! I would love to hear what people have to say about this, and actually, anything I write!