Wikis - Page

Using the Generic File Driver in Identity Manager (IDM) with XML Input files

0 Likes
I recently attended an online IDM User Group meeting about the products created by another member of the IDM community; his name is Stefaan Van Cauwenberge and one of his creations is known as the Generic File Driver. This new shim provides the same kind of functionality as the Delimited Text shim from NetIQ, but has a lot of extra features worth mentioning.

Currently at version 0.6, this shim comes without some of the quirks of the Delimited Text shim, specifically the need to hack XSLT in order to process character-separated values (CSV) files. This has been a stumbling block for years with the NetIQ option, and while we always hope it will change, so far it has not. The NetIQ option can accept XML files with XDS within, but if your XML is not XDS you'll need to add a stylesheet or some other code to transform into XDS directly, and the last time I checked this all worked best if you had one XML event per file. Finally, the NetIQ shim lacking metadata per event, basically reading lines from a file, sending them through, and letting you guess about when things are completed. Several threads in the forums have discussed this and speculated about ways to let the administrator know that the end of a file has been reached so that some finishing operations can be completed; Stefaan's shim does this by default by indicating a record number with each record coming in, and also indicating when the last record is reached. This also means that there is the option to stop and resume within a single file, which is a huge deal when processing a huge file and needing to do some kind of eDir/IDM/server maintenance. There's more about this shim to love, but I want to get into the XML file processing specifically, so try it out for yourself and see what can finally be done with plain old text files, or XML files, or even proprietary XLS files.

In my setup I decided to use a basic, made-up bit of XML that would handle a single user per file. The idea is that I have some custom application from which I can dump XML as easily as anything, and since XML is generally more capable than CSV for complex types of data, it is the better option if it's as easy to implement either. Initially I had hoped to use the Generic File Driver shim and just read in XML and then map from schema (in the XML files) to another (XDS/IDM), but its not quite that easy. The Generic File Driver wants a specific type of XML record in order for it to do its extra and magical things, such as sending metadata about which record in the XML file you're on (in case you have multiples), if you're on the last one, or even query back to the record for more information; yes, that's right, the Generic File Driver supports querying back to the source, which is another pain point for the original Delimited Text shim. Along these same lines, this type of query works out of the box, naturally, so when you modify an existing object and the engine detects it as an existing associated object, or matches with an existing object, the merge works perfectly without needing to block queries. This is really awesome, especially since it appears you can even query other records in the "application" (input file) besides the one which caused the current event.

The required format for input is as follows:
<root>
<aRecord>
<someAttribute>someValueHere</someAttribute>
<someOtherAttribute>someOtheralueHere</someOtherAttribute>
<yetAnotherAttribute>yetAnotherValueHere</yetAnotherAttribute>
</aRecord>
</root>

Seems pretty simple overall. The format nicely supports multiple records per file, and therefore allows the shim to do all of its fancy stuff like querying, supporting offsets within and among files, etc. Still, the chances that random applications will generate events in exactly this format with <root/> and <aRecord/> at the top of a list of attributes is pretty slim. Stefaan either encountered or anticipated this and thankfully implemented a simple fix. He could have just read in whatever was out there and left it up to us to implement a stylesheet in the Input Transformation Policyset (ITP) to clean it up, like the Delimited Text shim does for anything it supports, but that would mean losing the ability to handle offsets, queries, etc. which are nice features of this new shim. Instead he built in a field within the driver configuration parameters to allow pasting in XSLT to handle this initial ransformation.

The initial XML I used was pulled out of my head and looks like this:
<UserDetails>
<AppUserID>10001</AppUserID>
<FirstName>Test10001</FirstName>
<LastName>Lname10001</LastName>
<EmailID>ttest10001@me.com</EmailID>
</UserDetails>

It is made to have a single record per file (notice it lacks the depth of the example expected by the Generic File shim) and while I could have made it look like what the shim wanted I really wanted to see the XSLT apply and convert things. The attribute names can be left alone since we'll handle those normally with simple Schema Mapping, so the only thing to do is to make that block look more like this:
<root>
<aRecord>
<AppUserID>10001</AppUserID>
<FirstName>Test10001</FirstName>
<LastName>Lname10001</LastName>
<EmailID>ttest10001@me.com</EmailID>
</aRecord>
</root>

For those well-versed in XSLT there are probably several ways to pull this off. One way is shown below:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fo="http://www.w3.org/1999/XSL/Format">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/UserDetails">
<root>
<aRecord>
<xsl:for-each select="/UserDetails/*">
<xsl:copy-of select="."/>
</xsl:for-each>
</aRecord>
</root>
</xsl:template>
</xsl:stylesheet>

This basically matches the top /UserDetails element, and then adds the top tags that I want (<root/> and <aRecord>) to replace UserDetails, copying in everything within. To use this with the Generic File shim just merge these lines into one (XSLT doesn't care about line breaks or whitespace in the same way that XML doesn't care about them) and then use that in the Pre-xslt field under the Publisher channel of the driver configuration:
<?xml version="1.0" encoding="UTF-8"?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fo="http://www.w3.org/1999/XSL/Format"><xsl:output method="xml" indent="yes"/>  <xsl:template match="/UserDetails">    <root>      <aRecord>        <xsl:for-each select="/UserDetails/*">          <xsl:copy-of select="."/>        </xsl:for-each>      </aRecord>    </root>  </xsl:template></xsl:stylesheet>

Now that I have jumped ahead of myself, I realize that you may not see the Pre-xslt field under the Publisher channel settings, and the reason is that you need to tell the driver config to use XML. Under the Publisher channel settings is a drop-down labeled 'File Reader Strategy' which is used to tell the shim how to handle the input file. In our case, we want to handle it like XML and that's what we'll do.

As a note, this was my first shot at using this package-based driver config along with this shim, so I could have made some mistakes in importing the package to Designer (Designer: right-click on Package Catalog, Import Package, Browse), or created the driver config using this package as a base, or changed settings here or there, but at some point I did something that caused all kinds of pain, suffering, and error messages. I believe I correctly identified (and passed along to Stefaan) the problems, but maybe I did not and somebody can correct me.

First, that last change made to change the 'File Reader Strategy' to XML seems just fine. The various Publisher Channel fields all update and things seem to be fine, right up until you try to run the driver. At that point, you get some of this:
DirXML: [03/31/14 12:50:39.44]: TRACE:  xml-app\PT: MetaData read:recordNumber,isLastRecord
DirXML: [03/31/14 12:50:39.44]: TRACE: xml-app\PT: MetaData result:MetaDataManager [requiresFileSize=false, requiresFileName=false, requiresFilePath=false, requiresLastRecordFlag=true, requiresRecordNumber=true]
DirXML: [03/31/14 12:50:39.44]: TRACE: xml-app\PT: init - Creating fileLocator object
DirXML: [03/31/14 12:50:39.44]: TRACE: xml-app\PT: init:/var/opt/novell/dirxml/rdxml/xml-app/input - java.util.regex.Matcher[pattern=(?i).*\.xml region=0,0 lastmatch=]
DirXML: [03/31/14 12:50:39.44]: TRACE: xml-app\PT: init - Creating fileSorter object
DirXML: [03/31/14 12:50:39.45]: TRACE: xml-app\PT: init - Creating fileReader object
DirXML: [03/31/14 12:50:39.45]: TRACE: xml-app\PT: java.lang.ClassNotFoundException: info.vancauwenberge.filedriver.filereader.csv.XMLFileReader
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at info.vancauwenberge.filedriver.shim.FileDriverPublicationShim.initStrategy(FileDriverPublicationShim.java:321)
at info.vancauwenberge.filedriver.shim.FileDriverPublicationShim.init(FileDriverPublicationShim.java:475)
at com.novell.nds.dirxml.remote.loader.Driver.run(Driver.java:809)
at java.lang.Thread.run(Unknown Source)

This is taken from the Remote Loader (RL) trace during startup; as you can see there are mentions of the metadata used to handle offsets and to indicate where in the file we are, which is exciting, but then we have that ugly ClassNotFoundException which means something is amiss. If you look closely right there you'll see what took me a few minutes to find:
java.lang.ClassNotFoundException: info.vancauwenberge.filedriver.filereader.csv.XMLFileReader

Notice that the package of the XMLFileReader class is 'csv', which may not be terribly obvious, but it is wrong. I unzipped the shim to see what was present, and while the class name is correct it is actually in a sibling package to 'csv' named, unsurprisingly, 'xml'. This is part of the config shipped with the shim (at one of those first links way above) and so I think that is a bug. Since the config is easy to modify just use the 'Edit XML' button within IDM Designer to modify the driver properties, search for the old class, and replace it with the new version. Being a little bit foolish I just replaced the one value causing the problem, which then let me move on to errors that had a similar problem. Moral of the story: search-replace for all instances of the old string above and replace with the new string. Here's what some of this should look like after the fact:
<definition display-name="File Reader Strategy:" name="pub_fileReader" type="enum">
<enum-choice display-name="CSVFileReader">info.vancauwenberge.filedriver.filereader.csv.CSVFileReader</enum-choice>
<enum-choice display-name="XMLFileReader">info.vancauwenberge.filedriver.filereader.xml.XMLFileReader</enum-choice>
<enum-choice display-name="XLSFileReader">info.vancauwenberge.filedriver.filereader.xls.XLSFileReader</enum-choice>
<description>Publisher: File Reader Strategy (object class implementing IFileReadStrategy). This class will actually read the file, and thus knows about the file format (scv, tcv, xml,...).
Current implementations:
info.vancauwenberge.filedriver.filereader.csv.CSVFileReader
info.vancauwenberge.filedriver.filereader.xml.XMLFileReader
info.vancauwenberge.filedriver.filereader.xls.XLSFileReader</description>
<value>info.vancauwenberge.filedriver.filereader.xml.XMLFileReader</value>
</definition>

Notice that besides csv.CSVFileReader and xml.XMLFileReader there is also xls.XLSFileReader. I'd recommend fixing them all (they were all under the 'csv' package) in case you ever copy this driver config and modify it slightly for some other purpose and then get to hit the same issue all over again. Anytime you want to verify the correct paths in a JAR you can just unzip the JAR using the 'unzip' command:
> unzip -t /path/to/the/GenericFileDriver.jar
*snip for brevity*
testing: info/vancauwenberge/filedriver/filereader/DummyFileReader.class OK
testing: info/vancauwenberge/filedriver/filereader/DummyFileReader.java OK
testing: info/vancauwenberge/filedriver/filereader/csv/CSVFileReader$1.class OK
testing: info/vancauwenberge/filedriver/filereader/csv/CSVFileReader.class OK
testing: info/vancauwenberge/filedriver/filereader/csv/CSVFileReader.java OK
testing: info/vancauwenberge/filedriver/filereader/xls/XlsFileReader.class OK
testing: info/vancauwenberge/filedriver/filereader/xls/XlsFileReader.java OK
testing: info/vancauwenberge/filedriver/filereader/xml/XMLFileReader$1.class OK
testing: info/vancauwenberge/filedriver/filereader/xml/XMLFileReader.class OK
testing: info/vancauwenberge/filedriver/filereader/xml/XMLFileReader.java OK
*snip for brevity*

The '-t' option is just to test the integrity of the ZIP-formatted file, but as part of that test it, by default, prints out all of the directories and files.

Next we can move on to my next headache, and this one was a bit more painful. Because the world does not revolve around any single locale, and since computer are very specific machines that care about every single bit of information presented, it is important to have the computer and the data source agree on the encoding of data. You have likely heard of ASCII, Unicode, UTF-8, or any of a large number of other types of data encoding standards. Generally speaking they allow two computers to agree on what a sequence of bits means by giving a pre-determined and agreed-upon context for those bits, just as if you and I were to agree to communicate via spoken (US) English. The Generic File shim also lets you set the encoding of files fed to it from whatever application, useful in case the locale of the server running this shim is not the same as the application creating the files to be used for input.

The option looks like this in the driver configuration under the Publisher Channel properties:

<definition display-name="File encoding:" name="csvReader_forcedEncoding" type="string">
<value xml:space="preserve">UTF-8</value>
<description>CSVReader: forced encoding of the xml file. Leave blank to use system default encoding.</description>
</definition>

Sounds like I should be able to leave it blank, which I did. Unfortunately it seems like that was not a good thing, as I ended up with all kinds of problems applying my Pre-xslt code described above. Maybe my system does not have a default locale somehow passed on to the RL (running as 'root' in this case, which I never use otherwise), or maybe there is a bug in the shim. Either way I had failures that looked like this:
DirXML: [03/31/14 20:03:12.72]: TRACE:  xml-app\PT: Sleeping for 5 seconds
DirXML: [03/31/14 20:03:17.72]: TRACE: xml-app\PT: poll start
DirXML: [03/31/14 20:03:17.72]: TRACE: xml-app\PT: getFileList start
DirXML: [03/31/14 20:03:17.72]: TRACE: xml-app\PT: File matches regexp:test10001.xml
DirXML: [03/31/14 20:03:17.72]: TRACE: xml-app\PT: getFileList done
DirXML: [03/31/14 20:03:17.72]: TRACE: xml-app\PT: getNextFile: found file(s)
DirXML: [03/31/14 20:03:17.72]: TRACE: xml-app\PT: Move via java native.
DirXML: [03/31/14 20:03:17.72]: TRACE: xml-app\PT: processFile: start
DirXML: [03/31/14 20:03:17.72]: TRACE: xml-app\PT: processFile: staticMetaData added:{isLastRecord=false}
DirXML: [03/31/14 20:03:17.72]: TRACE: xml-app\PT: Exception while handeling XML document:java.io.FileNotFoundException: file:/var/opt/novell/dirxml/rdxml/xml-app/work/2014.03.31_20.03.17/test10001.xml.transformed (No such file or directory)
DirXML: [03/31/14 20:03:17.72]: TRACE: xml-app\PT: An error occured reading the input file.(/var/opt/novell/dirxml/rdxml/xml-app/work/2014.03.31_20.03.17/test10001.xml)java.io.FileNotFoundException: file:/var/opt/novell/dirxml/rdxml/xml-app/work/201
4.03.31_20.03.17/test10001.xml.transformed (No such file or directory)
DirXML: [03/31/14 20:03:17.72]: TRACE: Remote Loader: Received document from publicationShim
DirXML: [03/31/14 20:03:17.72]: TRACE: <nds dtdversion="3.0">
<source>
<product build="2014-02-28 21:19" instance="xml-app" version="0.6">Generic File Driver</product>
<contact>VanCauwenberge.info</contact>
</source>
<input>
<status level="error" type="driver-general">
<description>An error occured reading the input file.(/var/opt/novell/dirxml/rdxml/xml-app/work/2014.03.31_20.03.17/test10001.xml)</description>
<exception class-name="javax.xml.transform.TransformerException">
<message>java.io.FileNotFoundException: file:/var/opt/novell/dirxml/rdxml/xml-app/work/2014.03.31_20.03.17/test10001.xml.transformed (No such file or directory)</message>
</exception>
<exception class-name="java.io.FileNotFoundException">
<message>file:/var/opt/novell/dirxml/rdxml/xml-app/work/2014.03.31_20.03.17/test10001.xml.transformed (No such file or directory)</message>
</exception>
</status>
</input>
</nds>

Seeing that and being familiar with the original Delimited Text shim you may wonder what is going on. When processing starts on a file the Generic File shim moves the file to a 'work' directory where it then operates on the file. There is an option in the driver config to allow you to specify this 'work' directory and then while there the actual reading of the contents takes place. There is a work directory for each channel, so this is particularly nice with the Subscriber channel, which I was not using. The result on the Subscriber channel is that the final output directory does not see the file until the shim is completely done with it. Using the original Delimited Text (DT) shim if you have an application watching for a file (via a cron job, for example, or just actively checking a directory for input created by the DT shim) then that application may start reading the DT-created file before it is finished being created. Having a 'work' directory prevents that nicely, since only after writes are complete is the finished product moved to the place where something else can consume it. All very nice, well-thought out, etc.

To resolve my final bit of pain I set the encoding explicitly to 'UTF-8' and then things worked perfectly. The rest of the setup is what you'd expect; the driver config has a parameter for you to specify which class of objects are created from the input files ('User' is the default as I recall), the schema mapping and filter functions of the engine are defaults as always, etc. The base package for this config does not have any Subscriber channel policies, so add those to properly create an object to be sent out the Subscriber channel, but I was not doing anything like that for my test. The subscriber channel configuration options also include a Post-xslt option which can be applied to your generated XDS before writing the final output file, just like the Pre-xslt was used to transform events at the time of reading on the Publisher channel. Overall I hit some basic trivial quirks which are probably new bugs, and I anticipate they're probably fixed by the time anybody reads this. Look forward to the version after 0.6 for those fixes, or else just fix them as described previously to get started with this driver right away.

Labels:

How To-Best Practice
Comment List
Parents Comment Children
No Data
Related
Recommended