More thoughts on the size of a node set in Identity Manager

More thoughts on the size of a node set in Identity Manager:

Novell Identity Manager is an Identity Management product, and much of the work that goes on, happens in XML.

XML is an interesting language, that has many benefits. But it also has many downsides. One of which is it is probably NOT the most efficient method of storing data. An excellent blog on the topic of programming, and running a small software development company is by Joel Spolsky at I personally judge a blog on content, and grammar. If you cannot write in comprehensible English, with punctuation and proper spelling, I will not read it more than once. Conversely the most grammatically correct and elegant blog will loose my interest if the content is not interesting.

Joel Spolsky is both, and entertaining to boot. (Being a Canadian I should point out that when I say "No doubt about it", it is NO way sounds like I said A-boot, no matter what my wife says!) If you are interested in these notions, programming as it applies to running a business and the travails of running a small programming firm, I highly recommend his blog.

Now why on earth am I bring it up? Well to give some context to the issue about XML as a language. Joel feels strongly that XML is not the most efficient way to store data, and that there are better ways. See his article at for more of his thoughts. While you are there browse around, I am sure you will enjoy the experience.

One of the things he brings up is that since XML is not a fixed length field (or at least not a predictable length) data structure, it is very inefficient in code to search through a record. One of the ways Identity Manager tries to combat that performance hit (and actually, pretty much any XML consuming tool or API) is by storing the document not as arbitrary text in memory, but rather as a node set (usually in DOM, but there are other models). DOM (Document Object Model) stores it in a more efficient fashion, than just parsing clear text.

In Identity Manager, we see this most clearly when we use node sets, and XPATH upon those node sets. If you have ever watched the trace in Dstrace of a node set, annoyingly, you will NOT see the values, rather you see a representation of the node set, which is somewhat useful, but not entirely useful.

An example of such a trace would be:

[09/18/08 14:54:16.097]:Active Directory PT: Token Value: {<value> @timestamp = "1221487939#6" @type = "structured",<value> @timestamp = "1221658965#11" @type = "structured",<value> @timestamp = "1221745364#11" @type = "structured",<value> @timestamp = "1221764032#1" @type = "structured"}.

In that example, I queried for a multi valued back link syntax attribute which basically is a DN followed by a 32 bit integer. It is meant to represent an object in the directory, and a value in this DIB set for the object ID. A back link is a way of storing a reference from one object to another. eDirectory handles this all quite nicely in the background. It turns out that if you want to store a time value about an object, say last login time from different systems, it is quite useful! You could for example, store the DN of a driver to represent the connected system, and then use the second field of the syntax type, the 32 bit integer, to store time in CTIME format. Neat eh? (Yes, another Canadian'ism, we do admittedly say Eh, on occasion, but come on, it works doesn't it, eh?)

Anyway, so there you see a node set of 4 values, as trace will show it. In some ways useful, in others ways, not so much. Depends what you want. If I am troubleshooting a rule, I really want to see what the node sets look like inside them, so I can see why my XPATH or DirXML Script is not working on them. If I am in production, I probably want to minimize that so I am more efficient, since tracing out to screen is very painful on the processor and performance.

Now as a hint, to see what the node set looks like, usually I just look a rule before hand, where I get the node set values. Usually from a Query, or a Source/Destination Attribute token. (See: The different attribute options in Identity Manager for some more details) You can look at the output document that gets returned to the query input document, and see what the node set looks like. As it turns out, what really matters for XPATH selecting is the current context node. You need to know where it is. It is a little different if the node set is the result of a Source/Destination Attribute, than if it the result of a Query token. I have an article in the works on this topic, so stay tuned for that in the near future.

Next we come to an issue of scaling. How many node sets can an XML node set hold if a node set could hold XML node sets? (Think wood chucks...) More seriously, how much memory does a node set take up?

In an article on reporting on the Java heap size, via policy: Reading and Displaying the Value of Java Heap in Identity Manager Rules

I noted how to report on current heap, max heap size, and free heap space, so you can know how much memory is free, and whether you need to allocate more to Identity Managers JVM (Do be aware that on 32 bit Identity Manager on 32 bit eDirectory you are limited to 2GB total RAM allocated for the eDirectory memory space. That includes the eDirectory database cache (set in iMonitor, see: Monitoring eDirectory Performance or Managing the eDirectory Database Cache or Finding the eDirectory Database Size), the memory eDirectory needs for its actual component libraries, and the amount of space the JVM is allocated. With 64 bit eDirectory (I believe 8.8.3 can run 64 bit on SLES only this release) and the one day release of 64 bit Identity Manager (dunno when, but hopefully soon enough!) we should be able to use more memory per process. Until that day we are limited to 2GB of RAM total.

Thus the size of a node set is a big deal. In the article about report Java heap size (Reading and Displaying the Value of Java Heap in Identity Manager Rules) I noted that for a single attribute, I was guessing at about 7K a node set. Read the article for details, but I will restate that I was entirely estimating the size, and there were many places for errors to creep in.

Well I did it again, I ran a test, where I think I have another attempt to estimate the size of a node set. In my 7K example, the methodology I followed was, I started the engine up (I was on SLES, a Linux variant, so I did an /etc/init.d/ndsd restart) and I ran a report rule, that first told me the heap sizes, did its work, then told me the heap sizes afterwards.

My thinking was, the amount of memory used is mostly allocated to the report rule in the middle. Well in that report rule in the middle I read a node set of thousands of users, returning at most a single attribute value. Based on the math, that came out to about 7K a node.

This time I did something reasonably foolish, I ran a report that did much the same thing, this time for lots more attributes. I reported on Java Heap size, about 58 Megs in use at startup of the engine. I then ran a report rule that queried for about 7800 objects, and the query asked for 5 of our customized entitlement attributes, plus DirXML-Associations. Users have all sorts of possible combinations of values, but they mostly have 3 DirXML-Associations, and then paired values for each entitlement attribute. (I.e. If they have the Active Directory entitlement, then they also have the Active Directory driver DirXML-Associations). As a rough guess I would imagine the average user has 3 entitlements and 6-7 DirXML-Association values.

Thus my node set was probably 7800 users * 9 value nodes, or about 70,200 nodes. The memory usage (current Java heap), after the report was run, was about 284 MB. About 226 MB increase. Now some of that is attributable to the other drivers on this server doing work, but in the short time period (actually it took about 25 minutes to run through the report rule, and this was a FAST box, did I mention it was probably reasonably foolish? I really did need to fix up some attributes in the background to get some core functionality we developed that relied on it working. I could have done it in some other language, and built an LDIF from the results to push the results back into eDirectory, but since I have DirXML Script, I have a 4 CPU physical box, with 16 GB of RAM, why the heck not?) not that many operations really flowed through the drivers.

Also, I was careful to prepopulate the cache in my initial Query. Read this article (More thoughts on Source/Destination/Operation attribute tokens in Identity Manager) about how Identity Manager caches events, and reuses values from the cache to boost performance. I made sure to request every attribute I knew I would need in the report rule, in that initial request, this way, I should not have needed to query again inside the big loop.

For about 7800 Instance nodes, with about 70,000 value nodes then, taking 226 Megs of space, that comes out to about 4K per value node. Which is about 36.4K or so per instance node on average.

I think together with the notion that an instance node with a single value node from the previous article (Reading and Displaying the Value of Java Heap in Identity Manager Rules) may be close to 7K a node, this provides us some insight.

Clearly the instance node (Parent node, whatever you want to call it) takes up more space than a single value node. This is not something I had thought of before, and is good to know. Perhaps that 7K figure makes more sense in this case, as perhaps half the space required is for the instance node, and the other half is for the value node.

This at least gives two possible values, that are at least within an order of magnitude for how much memory you should be thinking about based on how many node sets you may need to keep in memory at one time.

I would love to hear from other people who have similar, or even better, DIFFERENT results than I am seeing, so that we can try and come up with a better experiment to nail the value down even closer to the real value.

Heck if anyone from the Identity Manager design and programming team knows the answer to the question, with actual accuracy, I would love to know as well!


How To-Best Practice
Comment List