Thoughts on the Delimited Text Driver

0 Likes
over 11 years ago

There are a number of Identity Manager drivers available from Novell for the various supported connected systems. Some are very specific, such as the Lotus Notes/Domino driver which as the name suggests really only talks to the Notes or Domino server. Others however are quite generic like the JDBC driver which ought to be able to connect to most any database that there is a database JDBC Java class available.



When all else fails, the driver of last resort is often the Delimited Text driver. Recently I got my first attempt to work with one, and I had some thoughts about the driver that I thought would be worth sharing.



Something that I would like to see for each driver is a basic summary of the high points and the low points of it's feature set. That is, say with the SAP HR driver, there is an entire sub-issue of how relationships between Organizations, Positions, Jobs, and Persons are handled. ( SAP HR Driver and Organizational Management - Part 1 ) Or that the two channels use different approaches. The Subscriber channel communicates purely through iDOC files and the Publisher channel uses BAPI.



Sure you could read the documentation but that would be a lot of work, and I would then have to sift out the important parts and effectively summarize the issues for myself to wrap my head around them.



I thought it would be nice to offer some such comments on the Delimited Text driver.



This is one of those drivers where you have no choice but to use XSLT to transform the incoming XML from one format to another. I personally prefer to use DirXML Script wherever possible in preference to XSLT for many reasons. Not the least of which is Policy Builder is truly a great interface to generate DirXML Script, and it really does a good job. You can see more about this topic at:
Open Call: What Can You Do in XSLT that You Cannot Do in DirXML Script?



In some ways this article is a continuation of that open call for what you cannot do in DirXML Script, and can only do in XSLT but focused on the specific case of the Delimited Text driver. I think I have the SAP HR issue solved, and it is pretty straightforward to query for Relationship data in DirXML Script. But the delimited text driver, the more I look at it, the more I realize, there is no way to get around this one, and working with the driver opened my eyes more fully to why that is the case.



The basic problem is that the event comes in as an XML document that looks like:



[11/20/09 13:49:23.642]:ACME Add PT:Receiving DOM document from application.
[11/20/09 13:49:23.642]:ACME Add PT:
<delimited-text>
<record>
<field name="Empl ID">1234567</field>
<field name="Status">Active</field>
<field name="Employee Name">Knud, Dalaire</field>
<field name="Company">xx</field>
<field name="Location">BMetal A/S</field>
<field name="Department">Marketing</field>
<field name="EE Type">S</field>
<field name="Regular/Temporary">Regular</field>
<field name="Full/Part">Part-Time</field>
<field name="ABCD"></field>
<field name="Job Code"></field>
<field name="Job Title">Marketing Coordinator</field>
<field name="Job Family">Super</field>
<field name="Reports To ID">1234</field>
<field name="Report To Name">Smith, John</field>
<field name="Hire Date">12/12/2009</field>
<field name="Email">bob@acme.com</field>
</record>
</delimited-text>



There are a couple of things to note here. The field names are parsed from the driver config, where you provide a comma separated list of the names of the fields, in the order they will appear. A really cool enhancement that actually ought not to be too hard, would be to read the header line in each file instead and use that to get the field names.



Seeing that, I had an issue I wanted to work on, which was, sometimes HR was sending us the data missing columns. Those guys. I tell you, working with HR is always an experience. You can tell them time and again, we need the data in this exact format, and they still keep changing it. Anyway that can be disastrous, as if you are about to modify every user based on a data feed from HR, and a field is shifted one, you get what in biology is called a frame shift mutation, which is almost always lethal. That is the case where a single base pair is inserted into the DNA strand. Well since proteins are coded by reading the base pairs in sets of three, that shifts the meaning of everything downstream usually in a horrible way. No less so in the case of bad data. Imagine you rewrite every ones email address to a date string. Ouch, that would hurt.



I thought, ok, I will do it after the Input Transform style sheet writes out the new version of the XML, and in DirXML Script I will say if XPATH of count(//add-attr)=17 is not true then veto this operation. Thats an easy way to do it.



Tried it, and realized that the XSLT Style sheet is nice enough to NOT copy in empty nodes. Therefore if there is an empty node as above, the ABCD node has no value, then the resulting XML has no corresponding add-attr node for it. So that fails often, as empty nodes in the delimited text are allowed, they just mean there is no data for that field.



Ok, next thought was there is another easy way. Before the Input transform fires, I will do the same test, but this time the count() statement in XPATH is for the following nodes /delimited-text/record/field which would solve my problem. If there are fewer or more nodes provided, that would catch it.



It probably would not catch a misordering of the columns in the CSV file, but whatcha gonna do. I cannot perfectly validate everything. Lets catch the easy stuff, and stress how important accuracy is to the HR folk.



I wrote my simple little rule to test for the XPATH of count(delimited-text/record/field)=17 is not true. Now being me, I actually used a GCV to store the value 17, since I know I will have to this again, so I actually used a GCV called nodeCount. If I had a little more time, I would have read back the attribute of the driver, DirXML-ConfigValues, into a node set variable, and XPATH'ed out the configuration value that holds the comma separated list of field names, then used a join token to take the comma separated list and make it a node set variable. Then I would have set nodeCount to the count($VAR) to get my final answer, made it a driver scoped variable, and started this all off, by testing on each pass through the rule if the driver scoped variable has a value. This way, the first event through would do the work to calculate the number of nodes expected, and it would remain available until the driver was restarted and thus the next event through would calculate it again.



But I digress. Regardless of how elegant and simple my approach was, it did bubkes. Nothing happened. The rule did not even fire on the input document.



I realized why once I had tried it. The main reason why you have to use XSLT Style sheets to transform the XML before DirXML Script can process it, in the SOAP and Delimited Text drivers, is that DirXML Script only works on documents that start with <nds> and contain <input> or <output> nodes according to the DirXML Script DTD. If the document does not look like one DirXML Script can process, there is nothing to process.



Thus the XSLT Style sheet is required to convert the incoming XML and transform it into something that matches the DTD that DirXML Script is even willing to consider looking at.



Once I realized that, I went into the XSLT style sheet (see below) and added an <operation-data count={count(field)}/> line to the XML document which adds an operation property which I can easily catch in DirXML Script and test there.



If operation property count not equal to 17 (or in my case a GCV) then veto. Nice and easy.



We discussed this in the support forums and came up with some simple DirXML Script that will read the field names and figure out the number of nodes for us, (thanks David Gersic and Joakim Ganse for your help!)



Here is some sample code to do just that:



<rule>
<description>check input doc</description>
<conditions>
<and>
<if-local-variable name="nodeCount" op="not-available"/>
</and>
</conditions>
<actions>
<do-set-local-variable name="configvalues" scope="policy">
<arg-string>
<token-base64-decode>
<token-dest-attr name="DirXML-ShimConfigInfo">
<arg-dn>
<token-global-variable name="dirxml.auto.driverdn"/>
</arg-dn>
</token-dest-attr>
</token-base64-decode>
</arg-string>
</do-set-local-variable>
<do-set-local-variable name="pconfigvalues" scope="policy">
<arg-string>
<token-substring length="-1" start="38">
<token-local-variable name="configvalues"/>
</token-substring>
</arg-string>
</do-set-local-variable>
<do-set-local-variable name="dconfigvalues" scope="policy">
<arg-node-set>
<token-xml-parse>
<token-local-variable name="pconfigvalues"/>
</token-xml-parse>
</arg-node-set>
</do-set-local-variable>
<do-set-local-variable name="fields2" scope="policy">
<arg-string>
<token-xpath expression="$dconfigvalues/driver-config/driver-options/configuration-values/definitions/definition[@name='field-names']/value/ text()"/>
</arg-string>
</do-set-local-variable>
<do-set-local-variable name="nconfigvalues" scope="policy">
<arg-node-set>
<token-split delimiter=",">
<token-local-variable name="fields2"/>
</token-split>
</arg-node-set>
</do-set-local-variable>
<do-set-local-variable name="nodeCount" scope="driver">
<arg-string>
<token-xpath expression="count$nconfigvalues)"/>
</arg-string>
</do-set-local-variable>
</actions>
</rule>



If the local variable is not available, being a driver scoped variable this should only occur after a driver restart, then read the DirXML-ShimConfigInfo attribute from the driver's DN, using the automatic GCV that all drivers have of the drivers DN. It is stored Base64 encoded, so decode it first.



Then there is a line of <?xsl...> that needs to be removed before the Parse XML token will treat it like valid XML to be parsed. David Gersic came up with the idea of using substring the first 38 characters out to do this, good call! Actually that is substring out everything AFTER the first 38 characters.



Use XPATH to get the right node of the configuration document into a variable.



Then use the Split token, to take a line of text, comma separated and turn it into a nodeset.



Once we have a nodeset, handy dandy XPATH has a count() function to get the correct number.



Then you can use the local variable instead of the GCV and it no longer needs to be configured separately, which is quite nice, as it becomes quite generic and you can reuse it without needing to change any specific values.



Back to the document that is coming in, the XSLT transforms the <delimited-text> document into an <nds> document as below:



[11/20/09 13:49:23.648]:ACME Add PT:Applying XSLT policy: % CCpub-its-InputTransformSS%-C.
[11/20/09 13:49:23.649]:ACME Add PT:Policy returned:
[11/20/09 13:49:23.649]:ACME Add PT:
<nds dtdversion="1.1" ndsversion="8.6" xml:space="default">
<input>
<add class-name="User" src-dn=" ">
<association>bob@acme.com</association>
<add-attr attr-name="Empl ID">
<value type="string">1234567</value>
</add-attr>
<add-attr attr-name="Status">
<value type="string">Active</value>
</add-attr>
<add-attr attr-name="Employee Name">
<value type="string">Knud, Dalaire</value>
</add-attr>
<add-attr attr-name="Company">
<value type="string">xx</value>
</add-attr>
<add-attr attr-name="Location">
<value type="string">BMetal A/S</value>
</add-attr>
<add-attr attr-name="Department">
<value type="string">Marketing</value>
</add-attr>
<add-attr attr-name="EE Type">
<value type="string">S</value>
</add-attr>
<add-attr attr-name="Regular/Temporary">
<value type="string">Regular</value>
</add-attr>
<add-attr attr-name="Full/Part">
<value type="string">Part-Time</value>
</add-attr>
<add-attr attr-name="Job Title">
<value type="string">Marketing Coordinator</value>
</add-attr>
<add-attr attr-name="Job Family">
<value type="string">Super</value>
</add-attr>
<add-attr attr-name="Reports To ID">
<value type="string">1234</value>
</add-attr>
<add-attr attr-name="Report To Name">
<value type="string">Smith, John</value>
</add-attr>
<add-attr attr-name="Hire Date">
<value type="string">12/12/2009</value>
</add-attr>
<add-attr attr-name="Email">
<value type="string">bob@acme.com</value>
</add-attr>
<operation-data count="17"/>
</add>
</input>
</nds>



Lets walk through some of the XSLT to see how it does it.



Here is the input transform style sheet I am using.



<?xml version="1.0" encoding="UTF-8"?><xsl:stylesheet extension-element-prefixes="nxsl" version="1.0" xmlns:nxsl="http://www.novell.com/nxsl" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- each application must fill in the name of the field that provides the association key -->
<xsl:variable name="association-field-name" select="'Email'"/>
<!-- The following two fields will be concatinated to form the CN of the user -->
<xsl:variable name="srcdn-field-name1" select="'FirstName'"/>
<xsl:variable name="srcdn-field-name2" select="'LastName'"/>
<!-- each application must fill in the name of the class that the delimited text represents -->
<xsl:variable name="object-class" select="'User'"/>
<xsl:template match="/">
<xsl:choose>
<!-- if document element is delimited-text, then we need to do the transformation -->
<xsl:when test="delimited-text">
<nds dtdversion="1.1" ndsversion="8.6" xml:space="default">
<input>
<!-- for each record, do an add -->
<xsl:for-each select="delimited-text/record">
<!-- see NDSDTD doc on web for Add verb syntax & details -->
<!-- get the association id into a variable -->
<xsl:variable name="association" select="field[@name=$association-field-name]"/>
<!-- get the src-dn id into a variable, replacing invalid DN characters with a dash -->
<xsl:variable name="temp1" select="concat(field[@name=$srcdn-field-name1],' ')"/>
<xsl:variable name="temp2" select="concat($temp1,field[@name=$srcdn-field-name2])"/>
<xsl:variable name="srcdn" select="translate($temp2,' =,.\','-----')"/>
<!-- generate the add event -->
<add class-name="{$object-class}" src-dn="{$srcdn}">
<!-- generate the association -->
<association>
<xsl:value-of select="$association"/>
</association>
<!-- handle each field -->
<xsl:for-each select="field[string()]">
<xsl:variable name="fieldValue" select="normalize-space(.)"/>
<!-- generate the add-attr -->
<add-attr attr-name="{@name}">
<!-- generate the value element using string syntax -->
<!-- note that attributes that require a structured or octet syntax -->
<!-- may require special handling here -->
<value type="string">
<xsl:value-of select="$fieldValue"/>
</value>
</add-attr>
</xsl:for-each>
<operation-data count="{count(field)}"/>
</add>
</xsl:for-each>
</input>
</nds>
</xsl:when>
<xsl:otherwise>
<!-- if the document element is not <delimited-text> copy as is-->
<xsl:copy-of select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>




First up we have some variables being set, that are pretty straightforward, and are actually not really used in my example, they are just part of the default sample config.



	<xsl:variable name="association-field-name" select="'Email'"/>
<!-- The following two fields will be concatinated to form the CN of the user -->
<xsl:variable name="srcdn-field-name1" select="'FirstName'"/>
<xsl:variable name="srcdn-field-name2" select="'LastName'"/>
<!-- each application must fill in the name of the class that the delimited text represents -->
<xsl:variable name="object-class" select="'User'"/>




Basically you decide which field is the value used for association values. You can see what other drivers use for DirXML-Association values in the article Open Call - IDM Association Values for eDirectory Objects and in this case, we are using Email since it is the only thing we have that is somewhat unique.



The next two are used to try and generate a src-dn value, assuming there are some columns for first and last name that are useful. However, that is not how our data is laid out so it really does nothing for me.



Finally the class of the object we will be working with needs to be set to User here.



The next line is the important part:


<xsl:template match="/">


This is where XSLT applies a template to whatever matches the string, in this case, the root of the document. This is the step that DirXML Script cannot handle. It needs <nds> at the root only.



<xsl:choose>
<!-- if document element is delimited-text, then we need to do the transformation -->
<xsl:when test="delimited-text">


This only handles documents that come in as <delimited-text> which is all this driver usually provides so that should be good to go.



<nds dtdversion="1.1" ndsversion="8.6" xml:space="default">
<input>


Write out the beginning on a valid XDS document that DirXML Script can begin to handle.




<!-- for each record, do an add -->
<xsl:for-each select="delimited-text/record">
<!-- see NDSDTD doc on web for Add verb syntax & details -->
<!-- get the association id into a variable -->
<xsl:variable name="association" select="field[@name=$association-field-name]"/>
<!-- get the src-dn id into a variable, replacing invalid DN characters with a dash -->
<xsl:variable name="temp1" select="concat(field[@name=$srcdn-field-name1],' ')"/>
<xsl:variable name="temp2" select="concat($temp1,field[@name=$srcdn-field-name2])"/>
<xsl:variable name="srcdn" select="translate($temp2,' =,.\','-----')"/>
<!-- generate the add event -->
<add class-name="{$object-class}" src-dn="{$srcdn}">



Here we look at each <record> node under the <delimted-text> node since one document can have many records, and try to use those variables above to make a valid src-dn value by sticking the values that ought to be First and Last name together (the concat() function) and then removing illegal characters (the translate() function) and finally adding an <add> node with some XML attributes of class-name and src-dn.



<!-- generate the association -->
<association>
<xsl:value-of select="$association"/>
</association>



Now we add the <association> node, using the email address field value.



<!-- handle each field -->
<xsl:for-each select="field[string()]">
<xsl:variable name="fieldValue" select="normalize-space(.)"/>
<!-- generate the add-attr -->
<add-attr attr-name="{@name}">
<!-- generate the value element using string syntax -->
<!-- note that attributes that require a structured or octet syntax -->
<!-- may require special handling here -->
<value type="string">
<xsl:value-of select="$fieldValue"/>
</value>
</add-attr>
</xsl:for-each>



Now we loop through the field values that have strings in them, and use the normalize-space() function to clean up any loose spaces at the beginning or ends, or multiple spaces get truncated down to a single space character.



Finally we add an <add-attr> node with a name= XML attribute based on the name of the field.



Add a <value> node, and then the value itself. You might have to, as the comments suggest, do different things for Octet string or structured attributes.



<operation-data count="{count(field)}"/>



Here is where I add in the count of the fields to use as I described above to validate the data.



<xsl:otherwise>
<!-- if the document element is not <delimited-text> copy as is-->
<xsl:copy-of select="."/>
</xsl:otherwise>



Finally copy through any other bits that might have been in the document as well.



Last but not least close off all the open nodes so the XML is valid.



At this point, you have a more standard <add> event document that DirXML Script can manage and process.



As you can see, this XSLT once done, is much of what you need, and of course there is a corresponding style sheet on the other end, in the Output transform, to convert back to a format that can be written out to a CSV style file.



Hopefully this walk through of a segment of how the driver works will be helpful if you are trying to get a better understanding of how the driver is meant to work.



Labels:

How To-Best Practice
Comment List
Anonymous
Related Discussions
Recommended