High Performance Data Extract Driver for Identity Manager
Novell Identity Manager is great. It allows us to connect very disparate systems together, making for a more "agile" and productive environment. One of the side effects of working with Identity Manager is that you start to think it terms of event driven changes to the system. When X happens in application [this], then we need Y to happen over here in application [that]. Oh, and because Y changed in [that], we need a workflow to fire off for approval to Y's manager, which then allows Z to be provisioned for them over in [the other]. The more time you spend doing this, the more systems you work with, and the more interesting the mesh of applications and interactions gets.
Then it happens. You run in to the application that doesn't have an IDM driver available. Worse, it doesn't have any sort of Application Programming Interface (API). All it offers is the ability to import a CSV file of data to build its own internal database. Ugh. Ok, well, it won't be as cool as doing live updates via a dedicated driver, but we can still integrate it in to our system via the IDM DelimText driver. That's ok.
As you start developing this, eventually you find that not only does this application lack an API, it was written by programmers that think small. They do not see any point in having a way to update their database, even with CSV file imports. Nope. It is easier to just drop and rebuild the whole database. So what they want you to produce is a CSV file of "everybody", every day. For one change, they will just drop the whole database and rebuild it from your CSV file. Even if *nothing* changes, they'll drop and rebuild it anyway. Usually the justification for this is "well, that way we're sure that what's in the database is right", or something along those lines. It's ugly, bad programming, but it's all too common. And it doesn't scale well at all.
But, from the Identity Management point of view, it's not our place to critique the systems we have to work with, only to find ways to make them work. So, we set off down the path of making a DelimText driver that exports everybody, every day.
Usually these applications don't really need "everybody", what they need is some subset of "everybody". As with any IDM driver development project, the really interesting part is in the discussions of who to provision in to this application, when, and when and how to deprovision them from it. Getting a handle on that makes everything else possible. And it is the key to the remainder of this article. By getting a good definition for "who to include", by its very nature that tells us who *not* to include. Since this is a drop-all-and-rebuild-every-day application, they do not need a deprovision step, we just stop including the information for deprovisioned users and in the next day's database rebuild, they will not be recreated.
Defining the Problem
Just as with any other IDM driver, the first steps are to determine what it is that this driver is supposed to do, and when. Despite the fact that they want a "CSV file of everybody" to come out of it, you still have to think about which objects should be included, and which excluded. This is your "scope". For the sake of example, an application to be used by "managers" might only need include all User objects with a Title attribute containing "Manager". We'll proceed using that as our scope definition.
The simplest, easiest, and most obvious way to approach this problem is with a Job (http://www.novell.com/documentation/idm36/idm_jobs/?page=/documentation/idm36/idm_jobs/data/front.html) that triggers against the OU=Users container, all objects. Schedule the job to run daily. Then do the rest in Policy Builder. In the Subscriber Event Transform, a policy like:
<comment xml:space="preserve">Filter out people who are not Managers.</comment>
<if-class-name mode="nocase" op="equal">User</if-class-name>
<if-attr mode="nocase" name="Title" op="not-equal">Manager</if-attr>
will work. Your job runs, all User objects get triggered, your policy filters out non-Managers by checking the Title, and a bunch of CSV files pop out in to your driver's output directory.
But there is a problem. You only have a couple of dozen managers, but you have 10,000 Users. Your driver Job is triggering 10,000 events, but you are really only interested in 20 of them. This works, but the efficiency of it is bad. Maybe it is good enough. But if you try to scale this up to 100,000 Users, or 1,000,000 Users, it quickly gets out of control as your job takes longer and longer and longer to run to completion.
Another approach is to use a Job, but have it trigger only one event. Using that event, on the Subscriber Event Transform, a policy like:
<description>Search for Managers</description>
<comment xml:space="preserve">Using Title, select Managers for further processing.</comment>
<if-operation mode="case" op="equal">trigger</if-operation>
<do-set-local-variable name="Managers" scope="policy">
<token-query class-name="User" datastore="src">
can be used to search for the Users you wanted to include in the output. This is somewhat more efficient, and faster than the first draft. It scales better, too, but it is still not the best. For some reason, the IDM search function is not as fast as an LDAP search. With the right selection criteria, and indexing, LDAP search is much faster. LDAP searches also offer better control, via what can be extremely complicated search filters.
Now, having explained how not to do this, it is now time to bring this driver up to its best possible performance level.
In your Identity Vault, create a Dynamic Group object (http://www.novell.com/documentation/edir88/edir88/?page=/documentation/edir88/edir88/data/fbabihje.html). This special type of eDirectory object looks, for all intents and purposes, like a Group. But it differs from Group objects in that it doesn't have any actual Members (it can, but ignore that for now). What it has is an LDAP style search filter that defines what the Members should be. When queried for members, eDirectory generates the list on the fly, using the LDAP search filter to do it. And more importantly, it uses the same type of search that an LDAP search would use, which has been tuned to be as fast as possible.
For our example, an LDAP search filter like '(&(objectclass=user)(title=Manager))' is needed. Notice that this is essentially the same logic as we used in the Scope Filter policy in the First Draft, it is just expressed in LDAP search terms instead of Policy Builder logic. For more complicated searches, it may take a while to define the LDAP search filter necessary to select the correct objects, but it can be built in stages until it is found to be correct.
Remove the Scope Filter policy from the First Draft, and the Search For Objects policy from the Second Draft. They will no longer be needed.
Recreate the trigger Job definition. This will be similar to the one created in the First Draft, since you want it to submit a <trigger> event for every object that it finds in its Scope. But, instead of using the OU=Users container, use the Dynamic Group object you just created.
Now, when your Job runs, it queries eDirectory for the membership of this Dynamic Group. eDirectory goes off and tears through its database and returns a list of members matching the LDAP search filter. The IDM job then submits a <trigger> document for each object that was returned in scope. And out the other side of your driver pops a set of CSV files. Some quick adjustments to the XSLT stylesheet on the Subscriber Event Transform, the Schema Map, and the XSLT Stylesheet on the Subscriber Output Transform is all that is needed to create whatever CSV file your application needs.
The overall operation of all three approaches here will be the same. You will get a set of CSV files with the requested User objects in it. Your application will get what it wants to import. But, as your Identity Vault increases in size, the First Draft will slow down dramatically, the Second Draft will be better, but will still take a long time to complete. The Third Draft method will be the fastest of the three. In my environment, First took many hours, Second took a couple of hours, and Third took 20 minutes. The difference for you may not be as dramatic, depending on your tree and your data. But if you have one driver that takes eight hours to make a CSV file, you might not mind. If you have two, or three, or a dozen of these type of applications, you eventually run out of hours in the day for a driver like the First Draft to run in. Setting them up to be as fast and efficient as possible pays off in a huge way as these badly behaved applications proliferate in your environment.