Cybersecurity
DevOps Cloud (ADM)
IT Operations Cloud
In part 1 of this series, I started talking about the Start Workflow token in Novell Identity Manager.
It turns out I have a really nice troubleshooting issue to work through, that shows a bunch of possible places to have errors, and how I worked through tracking each of them down, one by one, until finally we identified the root cause of the problem. Hope you enjoy coming along for the ride.
This started with a Start Workflow token in our Identity Manager driver. (It happens it was a SOAP driver talking to Salesforce.com, but basically when an event came out of Salesforce.com we needed to start a workflow for someone to approve the changes that were made, since the sales critters used Salesforce.com to enter clients in, and often made errors, but the helpdesk was responsible to make sure they did not do anything foolish. By automating the connection between the sales folk and the Identity system, we needed to put a human review step into the process. Thus Workflows are the perfect way to do this!).
When we left off in part 1, the start workflow token had failed with a HTTP 403
However still see the 403 error in response to the Start-Workflow token in DStrace that looked like:
DirXML Log Event -------------------
Driver: \IDVAULT-LAB\com\ACME\Drivers\IDM\SOAP-SPML
Channel: Publisher
Object: 003T000000OL9KGIA1 (com\ACME\Contacts\arewethereyet@td.com)
Status: Error
Message: Code(-9194) Error in vnd.nds.stream://IDVAULT-LAB/com/ACME/Drivers/IDM/SOAP-SPML/Publisher/[acme] pub-ctp-Fork off events for Workflow approval#XmlData:300 : Couldn't start workflow 'CN=Client Contact,CN=RequestDefs,CN=AppConfig,CN=User Application,CN=IDM,OU=Drivers,O=ACME,dc=com' for recipient 'cn=arewethereyet@td.com,ou=Contacts,o=acme,dc=com': java.rmi.RemoteException: HTTP 403 Forbidden
Realized initially that it was because we had never installed and run the Roles and Resource driver that manages the Roles ACLs inside the User Application environment. (In fact it turns out that after you install a fresh Roles Based Provisioning System, with the Roles Services driver, it is actually the first time that you access the User Application web page, which starts up the WAR and gets the processes spinning that finally generates the events to be sent back to eDirectory for the Roles and Resources driver to be processed).
Once we did that, we still did not have any success, but this time we got a further hint. The User App log (server.xml for JBoss) shows:
2010-06-09 13:54:55,422 ERROR [com.novell.common.auth.JAASManager] (http-0.0.0.0-8080-6) Login failed for user: CN=dev-test@us\.acme\.com.OU=Users.O=ACME.dc=com
Hmmm, the user did not succeed at logging in. By the way, to ruin the suspense, the answer to the root cause of this issue is actually in that line above, and my colleague (Hey Sai!) noticed, pointed it out, and I brushed him off, so hats off to you Sai, you got it, and I totally missed it! Anyway, if you see it, don't tell, wait till the end!
Back to the problem at hand, So we had a 403 error, we definitely were missing the Roles, but we added them back successfully, What else might cause a 403 error? Well the JBoss log (server.xml) showed that our user failed to login, when starting the workflow. That makes sense, if you fail to login, you get a 403 access denied page. So what would cause a failed login? Well password seems like the next obvious issue.
What I was doing in the Start Workflow token, was querying the directory for the nspmDistributionPassword of the user specified, and using that. In fact I was trying to be cute. In order to login to the User Application in the Start Workflow token, you need to specify the user DN in LDAP format. Well this is definitely a pain point in Identity Manager, and the ParseDN token ( ParseDN Token in Identity Manager and Some of its Limitations,
Examples of using the ParseDN Token in Identity Manager ) while pretty cool and powerful is not a super hero, and when going from a backslash formatted DN, cannot generate missing information, like what is the naming attribute for each node? (I.e. the cn=, ou=, o=, or dc=). What you need to do is Query for the user into a nodeset variable, then use the XPATH of $Variable/@qualified-src-dn and the use the Parse DN token to convert from qualified slash format to LDAP format. Since I needed to get the password of said user anyway, I figured save a query, and do it all at once. Then when I need the password, the XPATH of $Variable/attr[@attr-name='nspmDistributionPassword'] will get it for me, no need for any additional querying. I thought I was being clever...
Of course, that only works if the user has a Universal Password set. So maybe we do not have UP set? Well when it comes to diagnosing these sorts of issues there is only one place to look! That is Jim Willeke's Dump Up tool at:
http://ldapwiki.willeke.com/wiki/DumpEdirectoryPasswordInformationTool
Well first run of the tool shows, no Universal Password set, since I forgot to assign a UP Policy to this users container. 1697/16049 errors are correct errors in this case.
Here is the output of the UP diagnostic tool showing the error:
dn: cn=dev-test@us.acme.com,ou=Users,o=ACME,dc=com
Password: NMAS is enabled on the server but Universal Password is not enabled for the Entry. (-1697)
Password policy assigned to user: NMAS Code: (-16049)
Does Current password meet password policy assigned to user? NMAS Return Code (-16049)
===> Password Status <===
==> Universal Password <==
Is UPwd Enabled: false
Is the UPwd history full: false
Does UPwd match NDSPwd: false
Does UPwd match SimplePwd: false
Is UPwd older than NDSPwd: false
==> Simple Password <==
Is Simple Password Set: false
Is Simple Password Clear Text: false
Does Simple Password match NDSPwd: false
==> Account Status <==
Is Entry Account Disabled: FALSE
Is Account Intruder Locked: FALSE
Login Time: 20100609181727Z
Added a Universal Password policy and logged into iManager as the user, which takes the NDS password and sets the UP to match, which is darn useful. . But I did not change the Password Policy to allow the admin user to retrieve the password, so I get a 1659 error since I have no rights to read it, then I realized I had set the password to 123456, but the password policy requires a minimum of 8 characters (alphas or numerics) so that is not yet quite right. So I get a 1659 error due to lack of rights to see the password, and 216 that the password does not match the password policy.
dn: cn=dev-test@us.acme.com,ou=Users,o=ACME,dc=com
Password: Requestor does not have sufficient rights to perform operation. (-1659)
Password policy assigned to user: cn=User,cn=Password Policies,cn=Security
Does Current password meet password policy assigned to user? NMAS Return Code (-216)
===> Password Status <===
==> Universal Password <==
Is UPwd Enabled: true
Is the UPwd history full: false
Does UPwd match NDSPwd: true
Does UPwd match SimplePwd: false
Is UPwd older than NDSPwd: false
==> Simple Password <==
Is Simple Password Set: false
Is Simple Password Clear Text: false
Does Simple Password match NDSPwd: false
==> Account Status <==
Is Entry Account Disabled: FALSE
Is Account Intruder Locked: FALSE
Grace Logins Remaining: 6
Login Time: 20100609182233Z
The 216 error means the password is too short so I set it to 12345678 to make it long enough, lets see if that does it. One more test through the Universal Password Diagnostic tool (How can you NOT love this thing?)
dn: cn=dev-test@us.acme.com,ou=Users,o=ACME,dc=com
userpassword: 12345678
Password policy assigned to user: cn=User,cn=Password Policies,cn=Security
Does Current password meet password policy assigned to user? true
===> Password Status <===
==> Universal Password <==
Is UPwd Enabled: true
Is the UPwd history full: false
Does UPwd match NDSPwd: true
Does UPwd match SimplePwd: false
Is UPwd older than NDSPwd: false
==> Simple Password <==
Is Simple Password Set: false
Is Simple Password Clear Text: false
Does Simple Password match NDSPwd: false
==> Account Status <==
Is Entry Account Disabled: FALSE
Is Account Intruder Locked: FALSE
Password Expiration Time: 20110608182600Z
Grace Logins Remaining: 6
Login Time: 20100609182233Z
Well the password looks good now. No errors, meets password policy, so the access denied (HTTP 403) error cannot be related to this issue any longer.
However, even after cleaning this up, we still see the same error in User Application JBoss (server.xml) trace, and it still has the failed login line, but now I see an extra line:
2010-06-09 14:23:12,037 INFO [com.novell.pwdmgt.util.PasswordHelper] (http-0.0.0.0-8080-7) [Login_Failure] CN=dev-test@us\.acme\.com.OU=Users.O=ACME.dc=com failed to log in.
2010-06-09 14:23:12,037 ERROR [com.novell.common.auth.JAASManager] (http-0.0.0.0-8080-7) Login failed for user: CN=dev-test@us\.acme\.com.OU=Users.O=ACME.dc=com
So that tells me that setting the Universal Password had an affect of some kind. What next to diagnose a failed login? Well logins to eDirectory can show up in a couple of places. In the case of a CIFS or AFP login NMAS (Novell Modular Authentication Module) the component that allows different password types, smart cards, tokens, etc to work is involved. I worked through an example of this a while back that you can read to see some of those trouble shooting steps:
Example of Troubleshooting AFP NMAS Issues
So NMAS trace will probably be at least somewhat interesting, lets enable that. Next question to consider is, how does the Start workflow token connect and login? Well we know it is something of a wrapper to a SOAP calling the exposed User Applications API interfaces. Actually, we kind of know that from the logging in JBoss log, since to see any events, we need to enable DEBUG logging in the class: com.novell.soa.af.impl.soap.ProvisioningImpl which is pretty clearly a hint that SOAP is somehow involved. Well how does the SOAP call login from the User Application side, once it gets processed? Well we have to pass the DN's in LDAP format for it to work, so that is pretty much as good a hint as you are going to get.
Thus our next step, is to turn on LDAP/NMAS in Dstrace on the replica the User Application is pointed at, and see whats going on there. This is a huge advantage eDirectory has over some of the other LDAP directory servers. The ability to see what is being queried for at so many levels (NMAS, LDAP, DirXML, etc) make debugging complex issues like this so much more doable!
Of course, you need to be sure that you are watching on the right replica. You will not see much of anything on if you are watching on the wrong replica.
When it comes to troubleshooting these kinds of problems, I spend lots of time in DSTrace watching to see what is happening under the covers to try and understand what the issue is. Another nice example of doing that was in this article, where I watched how the SAP GRC application does LDAP queries to try and figure out why it was not working:
Troubleshooting an External LDAP Application (SAP GRC)
Anyway, here is what I was seeing in DSTrace:
14:30:43 33BD8950 NMAS: Successful get distribution password for dev-test@us\.acme\.com.Users.ACME.com
14:30:44 31FE4950 LDAP: (10.1.1.2:48540)(0x0006:0x63) DoSearch on connection 0xfb8f380
14:30:44 31FE4950 LDAP: (10.1.1.2:48540)(0x0006:0x63) Search request:
base: ".com.OU=Users.O=ACME.dc=com"
scope:0 dereference:0 sizelimit:0 timelimit:0 attrsonly:0
filter: "(objectclass=*)"
attribute: "srvprvUUID"
attribute: "givenName"
attribute: "sn"
attribute: "mail"
attribute: "srvprvPreferredLocale"
14:30:44 31FE4950 LDAP: Illegal ndsname ".com.OU=Users.O=ACME.dc=com" in ldap2uNDSDN, err = illegal ds name (-610)
14:30:44 31FE4950 LDAP: ldap2uNDSDN ldapDN = ".com.OU=Users.O=ACME.dc=com" - error illegal ds name (-610)
14:30:44 31FE4950 LDAP: (10.1.1.2:48540)(0x0006:0x63) nds_back_search: ldap2uNDSDN failed with err illegal ds name (-610)
14:30:44 31FE4950 LDAP: (10.1.1.2:48540)(0x0006:0x63) Sending operation result 34:"":"NDS error: illegal ds name (-610)" to connection 0xfb8f380
14:30:45 33BD8950 NMAS: Successful get distribution password for dev-test@us\.acme\.com.Users.ACME.com
14:30:46 320E5950 LDAP: (10.1.1.2:48541)(0x0006:0x63) DoSearch on connection 0xfb8e9c0
14:30:46 320E5950 LDAP: (10.1.1.2:48541)(0x0006:0x63) Search request:
base: ".com.OU=Users.O=ACME.dc=com"
scope:0 dereference:0 sizelimit:0 timelimit:0 attrsonly:0
filter: "(objectclass=*)"
attribute: "srvprvUUID"
attribute: "givenName"
attribute: "sn"
attribute: "mail"
attribute: "srvprvPreferredLocale"
14:30:46 320E5950 LDAP: Illegal ndsname ".com.OU=Users.O=ACME.dc=com" in ldap2uNDSDN, err = illegal ds name (-610)
14:30:46 320E5950 LDAP: ldap2uNDSDN ldapDN = ".com.OU=Users.O=ACME.dc=com" - error illegal ds name (-610)
14:30:46 320E5950 LDAP: (10.1.1.2:48541)(0x0006:0x63) nds_back_search: ldap2uNDSDN failed with err illegal ds name (-610)
14:30:46 320E5950 LDAP: (10.1.1.2:48541)(0x0006:0x63) Sending operation result 34:"":"NDS error: illegal ds name (-610)" to connection 0xfb8e9c0
Lets pick that apart. Note the first line: As I described above, in policy I was getting the users nspmDistributionPassword via an IDM Query token, and using it in the Start Workflow token, so you see that request for to read the password (even though in a simple IDM Query token) succeed as:
14:30:43 33BD8950 NMAS: Successful get distribution password for dev-test@us\.acme\.com.Users.ACME.com
Now note the escaped periods in the user name (For this case we are using email address as the naming attribute, and thus always has a period in it).
Next note that the UA queries for the user as:
14:30:44 31FE4950 LDAP: (10.1.1.2:48540)(0x0006:0x63) Search request:
base: ".com.OU=Users.O=ACME.dc=com"
scope:0 dereference:0 sizelimit:0 timelimit:0 attrsonly:0
filter: "(objectclass=*)"
attribute: "srvprvUUID"
attribute: "givenName"
attribute: "sn"
attribute: "mail"
attribute: "srvprvPreferredLocale"
Gal dang it! It took the escaped periods (\.) and improperly parsed it.
Ok, is this User Application, NMAS, or LDAP with the issue? What can I look at next? What can I change to try and isolate the issue down to one of the many components. Well first easy way to approach it is to change the input data, and see how that affects the results.
Lets start with a simple change. Lets get rid of the extra escaped periods and see if that makes a difference. Changed my Start Workflow IDM token and added a replace all token of \\\. to be replaced with \. (\\ for the backslash \. for the period and replace it with a period (\.), gotta love escaping things. This reminds me on VAX/VMS when you wanted to send email to the "Internet" before Al Gore invented it, and you wanted to alias email addresses it was in the format IN%"""geoffc@something.com""" where you needed three double quotes to escape the double quotes. Ah the good old days! Though that is as far back as I go, no punch cards in my lifetime.).
Now we see:
14:36:10 327EC950 NMAS: Successful get distribution password for dev-test@us\.acme\.com.Users.ACME.com
14:36:11 321E6950 LDAP: (10.1.1.2:46019)(0x0007:0x63) DoSearch on connection 0xfb7a040
14:36:11 321E6950 LDAP: (10.1.1.2:46019)(0x0007:0x63) Search request:
base: "CN=dev-test@us.acme.com.OU=Users.O=ACME.dc=com"
scope:0 dereference:0 sizelimit:0 timelimit:0 attrsonly:0
filter: "(objectclass=*)"
attribute: "srvprvUUID"
attribute: "givenName"
attribute: "sn"
attribute: "mail"
attribute: "srvprvPreferredLocale"
14:36:11 321E6950 LDAP: (10.1.1.2:46019)(0x0007:0x63) Cannot resolve NDS name 'CN=dev-test@us.acme.com.OU=Users.O=ACME.dc=com' in ResolveAndAuthNDSName, err = no such entry (-601)
14:36:11 321E6950 LDAP: (10.1.1.2:46019)(0x0007:0x63) Base "CN=dev-test@us.acme.com.OU=Users.O=ACME.dc=com" not found, err = no such entry (-601)
14:36:11 321E6950 LDAP: (10.1.1.2:46019)(0x0007:0x63) Sending operation result 32:"":"NDS error: no such entry (-601)" to connection 0xfb7a040
14:36:11 327EC950 NMAS: Successful get distribution password for dev-test@us\.acme\.com.Users.ACME.com
14:36:12 320E5950 LDAP: (10.255.3.52:52738)(0x000e:0x42) DoUnbind on connection 0xef3f9c0
14:36:12 320E5950 LDAP: Connection 0xef3f9c0 closed
The 610 error should have clued me in at this point, but I was still oblivious at this point.
The escaped DN is used by the IDM token (I query for the user, get the @qualified-src-dn, Parsed DN from qualified slash format to qualified dot to get proper LDAP format, and then do my replace all:
<do-set-local-variable name="INITIATOR-LDAP" scope="policy">
<arg-string>
<token-replace-all regex="\\\." replace-with="\.">
<token-parse-dn dest-dn-format="qualified-dot" src-dn-format="qualified-slash">
<token-xpath expression="$INITIATOR/@qualified-src-dn" notrace="true"/>
</token-parse-dn>
</token-replace-all>
</arg-string>
</do-set-local-variable>
But now we see that LDAP needs the escaped DN to bind. But when the User App tries to login using the escaped version, it looks like it is truncating the DN down to just the .com at the end of the email address. thats odd.
Well lets see if this works with a simpler user name. So lets ditch the email address naming for a moment, and try just a simple user name, so rename the test user to dev-test instead of dev-test@us.acme.com
JBoss server.xml log shows:
2010-06-09 15:01:13,195 ERROR [com.novell.common.auth.JAASManager] (http-0.0.0.0-8080-7) Login failed for user: CN=dev-test.OU=Users.O=ACME.dc=com
This should have clued me in as well, something goofy is wrong with my DN I am passing if it is failing even with a simple username. In other words it is not the email address part that is making it fail, but I was too blinded by the trees to see the passing forest of Ent's walking by (Hey a LOTR reference!).
LDAP and NMAS DSTrace shows it clearly to me this time.
15:01:10 327EC950 NMAS: Successful get distribution password for dev-test.Users.ACME.com
15:01:11 FFFFFFFFADE93950 LDAP: New TLS connection 0xfb8f380 from 10.1.1.2:42609, monitor = 0x33cd9950, index = 2
15:01:11 33CD9950 LDAP: Monitor 0x33cd9950 initiating TLS handshake on connection 0xfb8f380
15:01:11 322E7950 LDAP: (10.1.1.2:42609)(0x0000:0x00) DoTLSHandshake on connection 0xfb8f380
15:01:11 31EE3950 LDAP: (10.1.1.2:42609)(0x0002:0x63) DoSearch on connection 0xfb8f380
15:01:11 31EE3950 LDAP: (10.1.1.2:42609)(0x0002:0x63) Search request:
base: "CN=dev-test.OU=Users.O=ACME.dc=com"
scope:0 dereference:0 sizelimit:0 timelimit:0 attrsonly:0
filter: "(objectclass=*)"
attribute: "srvprvUUID"
attribute: "givenName"
attribute: "sn"
attribute: "mail"
attribute: "srvprvPreferredLocale"
Am I totally dumb and using qualified dot format instead of LDAP in ParseDN? Can it really be that simple? I even showed it in the sample code above. Sai saw this first thing, and I brushed him off. Haha, should have listened to him. Would have saved me a lot of time.
Good grief! PEBKAC. PICNIC, GIGO! Oh dear, how silly do I feel now?
You can see it in my Parse DN token line here:
<token-parse-dn dest-dn-format="qualified-dot" src-dn-format="qualified-slash">
My destination format is qualified-dot not LDAP! (Or perhaps qualified-comma, but that is called LDAP format). Oh man, that was silly.
Nonetheless I think that the process of working through the process of figuring out how I screwed up is still worth it as you can see each step along the way the steps I took to try and isolate the problem.
Sometimes the silliest most obvious mistakes just kill you. This is where the second set of eyes can be so powerful. My friend saw it in his first glimpse of the problem, and I was so deep into it, I could not see it until an hour or two later.
I have run into this issue time and time again, where just the act of asking someone to come look at the code, trace, or whatever, and try to explain what is going on, causes them to look at it, and in the first minute or two ask why you did it this way, which is the core issue you could not see.
I have a collection of other Start Workflow errors that I think I will see if I have enough for another article, so there may be a part 3 to this! Stay tuned!