Wikis - Page

How DStrace can affect Identity Manager performance

0 Likes

Identity Manager performance tips:



When you deploy Novell Identity Manager it usually starts out small, and quickly gets out of hand. Once you see the benefit of connecting authentication systems, it starts to make more and more sense to connect additional systems.

When you see the value of synchronizing passwords through those connected systems, making the life of end users easier, tehcnical and Helpdesk staff easier, it becomes more and more compelling to add more systems to the mix.

Novell eDirectory we already know from its decades of use all around the world by hundreds of millions of users is clearly scalable and very performant. The example that always comes to mind is CNN that runs eDirectory on Solaris and uses it to support their main web site, storing cookie like artifacts about visiting users, and they get millions if not billions of hits a day. (I heard a somewhat description of how it works, but do not think I can fairly reproduce it, so I will leave it at that).

Personally I used to work at a University with a 50,000 user object eDirectory tree, with thousands of machines connected to it all day, with close to half a million logins every month. Some governments use eDirectory to represent either employees or citizens, which get into very very large deployments of eDirectory, with millions and millions of objects.

Novell Identity Manager is based on eDirectory (in fact it runs as an application in eDirectory's memory space) and gains much of that scalability and performance just from being based on eDirectory. Where we run into trouble is when our Identity Manager systems start getting large enough, that so many events are happening that some events are falling behind and getting stuck in the queue.

There are all sorts of tricks you can use to boost performance. Of course you can always throw more hardware at it. eDirectory and thus Identity Manager can take advantage of multiple processors on modern hardware on many platforms. (Alas, as much as it pains me, not so much so on Netware, where it is not as multi processor aware as I would like). Memory is a limit at times, since eDirectory, in its current releases is a 32 bit application, it can only use a maximum of 2 Giga bytes of memory. With the coming release of eDirectory 8.8.3 and going forward, we will see 64 bit eDirectory support, when running on 64 bit operating systems and 64 bit hardware (alas also not on Netware. I really like Netware as an operating system and I am very saddened to see its decline). Until then, we are limited, but in fact, I do not usually find the 2 giga byte limit all that limiting. I was shocked when I discovered that a test lab environment I was using had the default JVM size set. The default is 64 Megs and I did not run into that as an issue until I tried a query that returned a nodeset that was 115 mega bytes in size, and finally ran out of memory.

Disk performance can help a lot, since eDirectory's major bottle neck has to do with writes to the directory. eDirectory can handle huge numbers of simultaneous read requests and does so very well. However, it is still limited to a single thread for write events. So if you are making large numbers of updates to the directory you can run into a bottleneck as the writes stack up.

Putting the DIB set (the Directory Information Base, a naming convention that comes from the X.500 days. No body except probably IBM ever really developed a fully X.500 complaint directory. In fact, LDAP is the Lightweight Directory Access Protocol, where the L for Lightweight is meant to distinguish it from the very heavy DAP that X.500 specifies) on the fastest disk you have access to can help performance as well. Modern SAN storage devices, with modern RAID controller chips and with many spindles spinning at fifteen thousand RPM like the fastest fibre channel or SAS disks can, is very much faster at writes then any other device you may have encountered.

One of the most powerful features of eDirectory is its ability to replicate the directory to multiple servers. Technically eDirectory is a single master directory, but the funny thing about that, is that you can run without a master replica for long periods of time and never know it. We had a problem with a server once at the University I used to work at that caused us to loose a master replica for a partition, and we ran for many months without noticing, since unless we were going to do partitioning or schema changes (which are very rare) we really did not need it. Admittedly obituaries piled up, but they cleared quickly once we fixed the issue.

You can have a single server with the entire tree on it. You can have dozens of servers with any piece you want of the directory on it, with multiple replicas on the same server. With eDirectory 8.8 and higher, you can run multiple instances of eDirectory on the same physical server. In which case, you could have one server, running three instances of eDirectory, each with multiple partitions and replicas in each. The flexibility is great and very useful. You can build the model that makes sense for your environment.

Identity Manager is powerful since it can react to events as they happen. With multiple replica on many different physical boxes, it is possible to spread the load of events between many servers.

This does not always make sense, since you should not have two Identity Manager instances eventing on the same users in ways that overlap. However there are many use cases where this can work. For example, one implementation I worked on had 25 drivers, and we ended up putting the two busiest drivers on one server, and the remaining 23 on a second server, since the two busiest, really were significantly busier than all the rest.

Look at this article, Identity Manager Performance Example for some examples of how performant Identity Manager can be.

However even with all the performance available you still need to focus on writing efficient rules. In order to do that, it is important to know what is slow and what takes time. You need to know what slows things down, and if there is a more efficient way to do it. I am working on a series on the use of XPATH in many interesting ways that can be very efficient, (see: Some thoughts on XPATH in Novell Identity Manager and Cool tricks using XPATH on nodesets) the alternative to perform the task other ways can be horribly slow.

One issue that is really problematic is the use of DSTrace. Dstrace is indispensable in troubleshooting issues in Identity Manager. In fact, you pretty much cannot troubleshoot most problems in Identity Manager without the use of Dstrace.

For some articles about Dstrace, look at:







However, Dstrace has a serious downside. It is a performance hog. I had been told this on numerous occasions, and I believed it, but in a development system you do not have the option of turning it off, and in production, if it is off, it is impossible to see what happened when something goes awry, after the fact. When you are running in production and some event did not occur as planned, if Dstrace is not running, you cannot look and see what happens, you can only try to recreate the event and watch it with Dstrace on. If Dstrace is running all the time, you can search for it, and try to figure out what happened. Which becomes a very compelling argument to leave Dstrace running all the time, assuming you have the disk space for it to store enough logs to be useful.

The desire to leave Dstrace running all the time is very hard to resist. The problem is, that the engine has to translate all the work it is doing in memory with DOM objects that can be processed very efficiently in memory, and convert that to human readable XML and trace output. Then it has to write that all out to disk as a final insult.

It turns out that just turning Dstrace on causes a large performance slow down. Even at level 1. It is the act of converting from efficient memory structures to human readable XML that really eats the CPU cycles.

While I knew this, I never knew how much of a performance difference there was. At a client I was working at, we are using the JDBC driver to synchronize a database table of Users that has about 8000 rows (aka 8000 users). We are running the driver in triggerless mode, which means a state file is maintained (see: Identity Manager JDBC Driver State Files) and if anything happens to it, the driver will force a resync event.

A resync event for 8000 rows of the database, where unfortunately due to poor choice of database design (that is outside our control to resolve) it takes a fair bit of work to synchronize one user, which can take more time than we might like. Until that resync is completed, other events do not come through the driver, effectively paralyzing the system.

Recently I watched a resync event come through and decided to try turning off trace to see how much it helped.

If you are not aware, you can change the Dstrace level of a driver live, without requiring a driver restart for it to take affect. This is a very powerful thing, since you can watch an event start in Dstrace, and realize it will take a while, look at how long events are taking, open the driver object properties in the Identity Manager snapins for iManager, then on the Misc tab just drop the trace level to 0. Once you click apply, the change takes affect immediately.

I was astounded. The resync event was taking over three seconds per user with Dstrace set to level 3. I turned Dstrace level down to 0 and I watched the success DirXML Log events stream by on the screen. When I looked at the timestamps I saw it was taking about 100 to 200 milliseconds to resync one user.

That means at 3 seconds a user, for 8000 users, it would take 24,000 seconds or 400 minutes (6:40 hours) to resync the entire database. At 100 ms a user it would take about 800 seconds, or a little over 13 minutes.

Thirteen minutes versus over six hours is a HUGE difference. I knew the performance hit from Dstrace existed, I just never knew how to quantify it.

Now I find myself very torn. Running with Dstrace off is very fast. Running with Dstrace is slower, but enables me to better troubleshoot. Which way to go? I am not sure.

The fact that trace level 1 or trace level 10 both generate the same base level of performance hit means that even a little tracing hurts a lot.

We could mitigate some of this by instrumenting the various error messages with generate-event tokens, which could send the error to NSure Audit (which has basically been subsumed by the Sentinel product) and use Sentinel to monitor the errors. Of course the issue is, to see why things went wrong you really need to read the trace carefully.

Alas, I suspect we will continue to suffer the performance hit of Dstrace running, but take care when we notice a resync event to turn trace off until it is completed.

Now my next goal is to figure out how automate that process of changing trace levels based on a resync event. If you have any ideas, please comment and let me know!

One issue is that there is no resync event showing up in trace that preceeds the resync, at least as far as I can see. That is there is no sync event in the trace as it happens. The events do not have an attribute in the event like from-merge nor an event-id like app-migrateXXX that you would get in the case of starting a migrate from eDirectory. Then of course would be the trick of figuring out that the resync has completed automatically so as to reeanable trace.

Labels:

How To-Best Practice
Comment List
Related
Recommended