iManager 3.2 displaying at a slow pace after upgrade
After having enjoyed sloppily out commented AJP connector definitions and LDAP connection problems that we managed to overcome, we are still facing troublesome iManager behaviour.
We had upgraded our 4.7.2 Identity Manager system (eDir 9.1.4, iMan 3.1.4) to 4.7.4 on eDir 9.2.1 without apparent problems using the integrated IM 4.7.4 Linux installer except for an exceptionally slow performance of the new iManager 3.2.
We therefore uninstalled iManager and gave iMan_322_linux_x86_64 a shot. As before, the usual internal LDAP connections were missing. (Our iManager instance runs on a server with an identity vault replica, but without IDM engine.) netstat -na revealed 0.0.0.0 listening on port 636 — in fact, there were no problems logging in to the server and reading information with ADS —, but no other 636 connections were reported. Instead, catalina.out contained a complaint not seen before, regarding failure to create a secure SSL LDAP context.
It took us a while until we discovered, that adding the server's IP value to attribute ldapInterfaces that used to contain just "ldaps://:636" with previous iM versions, reestablished LDAP connectivity between iM and vault. Unfortunately, this still didn't prevent further iM naps.
These naps occur at several places when iM is asked to display a new frame or tab. While iM pretends to communicate with the server, nothing happens visually until finally the information requested gets displayed with normal speed. Nap duration comes remarkably close to 60s or an integer multiple. One might speculate, therefore, iM is trying something in the background without success, before giving up and resorting to a more suitable process.
Since there seem to be a couple of iM 3.2 installations around in this place, we wonder, if somebody else has seen this behaviour and knows a remedy.
Thank-you for the descriptive write-up. If you have a moment could could you elaborate a bit on "several places when iM is asked to display a new frame or tab", e.g. which link you clicked and if it might have anything to do with LDAP? iManager, for the most part, is an NCP/NDAP-using application just like eDirectory is natively, but it uses LDAP in a few places, e.g. setting Universal Passwords. Usually iManager does not open new browser tabs, but of course it has its own "tabs" across the top, and in the view of an object (e.g. a User object) and maybe those are the tabs you mean, or maybe you really do mean new browser tabs.
Considering the delays you may want to see if you can find stuck connections being attempted on the iManager server when this happens via a command like this:
> sudo /usr/sbin/ss -planeto | grep SYN
It is not impossible to see a SYN-SENT normally, but they usually only exist for a fraction of a second, so if you do see one for more than that it likely means the sought connection is not responding, and these happen to timeout after a minute or so. Perhaps your iManager instance regularly wants to talk to another LDAP service, and perhaps that service is blocked for any number of reasons; the same could happen for an NCP connection, by default on TCP 524.
I've helped a few clients of mine move to IDM 4.8 and iManager 3.2 without huge issue, and certainly without positives multiples of minutes in delays opening things, but that does not make what you are seeing impossible or even unlikely given the right conditions. We'll figure this out too.
Have a good day,
thanks for your reply.
Having done some dumping meanwhile, I doubt the issue is LDAP related. Performing the same steps (log in to iM, IDM administration -> IDM overview -> driver sets -> our driver set) using iManager 3.1.4 still in use in our production environment and our test iM 3.2, we didn't see any LDAP connection at all. Still iM 3.1.4 showed no abnormalities, while im 3.2 repeatedly went asleep.
Diving deeper with Wireshark we found the connection between client:i and iM:8443 fell silent for exactly 60s (allow for an error of a few milliseconds). In the captured case a new client:j - iM:8443 connection was established that didn't transmit any application data for another 60s before everything resumed normal operation.
During those breaks SSH traffic between iM server and client machine took place as well as a replica update. (The iM server holds an eDir replica, but doesn't run the IDM engine.) I'd therefore considered eDir replication issues, but didn't find any evidence thus far.
The "places" where iM falls asleep are not always reproducible. I suspect some caching behind the scenes. The displayment of the driver landscape after selecting the driver set in question seems to be a quite reliable spot, though.
I'll try your and Matt's suggestions later this week.
Have a fine day, too,
if you are seeing slowness specifically when you browse for a DriverSet (e.g. when you have '[root]' in the context field and you then click on something to have it find DriverSet objects in your tree) then the slowness is almost certainly because you lack a necessary index. This has been an issue for a long time, but IDM boxes auto-add a "value" index for the Object Class attribute on their engine servers, and usually iManager ends up there too, but if that is not the case for you then the lack of an index there can significantly hurt performance. Any randomness can be easily attributed to iManager choosing different servers to query; if it queries one with an index, the results are quick even with millions of objects in the tree; if one without, the results can be slow even if you only have many thousands of objects, as you would expect with any query of something from a database when a column has many values and there is no index.
Go to the 'Indexes" task within iManager and browse to your IDM engine machine; you will almost certainly see a Value index for "Object Class" on it. From there you can go to something like "Index Locations" and choose to copy the index to other servers. Put it on all servers, as it's just a good idea for many reasons like this, and I do it by default for all clients so it is very well-tested.
Have a good day,
I have meanwhile collected some more evidence. "Naps" are not 100% reproducible. That means, they occur very often before displaying a driver set, but not necessarily. Sometimes driver displayment proceeds without any noticeable delay. The same holds for other places, where "naps" have been observed. These places are not restricted to IDM related tasks, but may already occur upon login or logout, e.g.
I'm attaching five screenshots showing still images lasting for 60s each, before all window elements ("frames" as I had called them earlier without strictly referring to real HTML frames) got displayed. If you have three minutes to spare you may watch a real time mp4 starting just after login and continuing to the point, where a driver set finally appeared on screen (here including three naps). Many times, iManager displays the status "transferring data to X" while stuck, where X is the IP address of the iManager server.
Our test system had been emptied, before updating from eDir 9.1.4, IM 4.7.2, iManager 3.1.4. Before the update, these delays have never been observed. Neither in our (filled) test nor in our populated production system.
/usr/sbin/ss -planeto | grep SYN didn't produce any output when repeatedly issued before and during a nap. Providing the tree name in capital letters in Authenticate.Form.DefaultTree as suggested by Matt, didn't make any difference. The value has been written in small letters by iManager's configuration routine in our earlier installations, too, without causing any noticeable trouble.
I'm going to do simultaneous screen and tcp traffic captures to see, if some relations can be established. Moreover, I'll return to the office for once to get rid of home-office-connected VPN traffic and see, if the SSH traffic between iManager and VPN server observed during naps persists.
Did you check indexes at all?
Also a LAN/wire trace of iManager's NCP traffic at the time of a "nap" might be interesting. You may not see a lot, but if you see a query go out to eDirectory and not come back for a while that might be indicative of something, especially if "naps" are associated with connections to serverA but not serverB.
The +RECM flag in ndstrace shows things like queries and the indexes they use, so that output from whichever (including multiple) eDirectory instances are accessed by iManager during a "nap" could be useful.
Did you check indexes at all?
Not yet. I wanted to share more precise information regarding where iManager tends to get stuck first. As iM-2020-06-25-um-18.16.06.jpg depicts, even displaying the blue NetIQ iManager title ribbon may send iManager to sleep and I wonder, if there are any queries etc. involved in this task. (Indexing is on my list, anyway.)
Last week, after several weeks of absence, I've found an opportunity to return to office and capture TCP traffic without involvement of a VPN. Despite of the almost identical client (machine, OS, browser) it proved hard to record evidence at all: out of 23 sequences (login, launch the identity manager overview page, see iMan 3.2.2 SD.mp4) only three 60s naps were observed with iManager giving totally innocent performances most of the time. Back home two of the test sequences yielded eight naps lasting 480s in total.
Reverting our test system to a snapshot taken before the upgrade (iManager 3.1.4) reestablishes normal operation in the same network environment including IP addresses, eDir tree, certificates and except for iManager software, of course, client.
Traffic analysis shows a typical pattern before and after naps:
1. There are five to six connections between Tomcat:8443 and arbitrary client port (client or VPN node, resp.)
2. After a final ACK packet, all these connections fall silent
3. After 10s of silence the client sends a "keepAlive“ request that is acknowledged with normal speed (after around 40µs)
4. More „keepAlive“ requests are sent every 10s until...
5. ...60s after the transmission of "application data" has ceased, Tomcat or client issue an "encrypted alert" that get's acknowledged within µs again
6. All Tomcat - client connections are then reset, new connections are set up and communication resumes.
A detailed analysis of an example nap showed six GET requests uttered by the client that were served by Tomcat at a time. Obviously, one of the requests wasn't fully satisfied, when communication came to a stop (a JS file was only partially transmitted). The last packet sent in stage 2. was a client ACK for the last part of this JS file. 59.8s later Tomcat came up with the last part of the overdue JS file together with an "encrypted alert" of Close Notify. After the existing six connections had been substituted by three new connections the client continued with three more GET requests. Two were aiming at files not yet transmitted, but the third did request the JS file that presumably triggered the nap once again. All requested files resided on disk, neither eDirectory nor Identity Manager had to be involved in the process.
There was additional traffic on port 524 during the nap, but no timely correlation with the paused 8443 communication was obvious. NCP communication was initiated by "serverA" holding the master replica, proceeded swiftly and always ended with End Update Replica - OK packets that were finally mutually acknowledged. In one case the iManager server started to talk to itself exchanging a bunch of ACK packets, each of length 66 bytes, through lo afterwards.
Where the nap occurred at different places (when building the identity manager overview page, e.g.), additional traffic on ports 524 (internal and to the server running identity manager's engine) and 636 (internal) was observed without any indication (to my eye) that something went wrong with these connections.
Other servers apart from serverA (engine, vault, master) and serverB (iManager, vault replica) are not part of the game, nor is another eDir instance.
ndstrace with +RECM moans repeatedly
Iter #697469b0 setIndex( 261)
Iter #697469b0 query PartID==5 && Obituary$257A$
Iter #697469b0 index = Obituary$IX$261
Iter #697469b0 first( ID_INVALID)
and sometimes also
Failed to apply attribute .ACL.[Attribute Definitions].[Schema Root] to entry .Security.TREEB., failed, duplicate value (-614)
Failed to apply attribute .ACL.[Attribute Definitions].[Schema Root] to entry .TREEB CA.Security.TREEB"., failed, duplicate value (-614)
Again the timing seems not to be related to iManager naps (the former lines are also recorded when iManager is not supposed to deliver any screen updates) and equivalent messages appear in our well-behaving iManager 3.1.4 traces.
When preparing the driver set overview page both iManager versions deliver a more comprehensive query...
Iter #f15d610 query (ParentID==33112) && ((Flags&1)==1) && (((ClassID==430) || (ClassID==432) || (ClassID==433) || <++
Iter #f15d610 query ++> (ClassID==434) || (ClassID==435) || (ClassID==455) || (ClassID==462) || (ClassID==463) || ( <++
Iter #f15d610 query ++> ClassID==1009) || (ClassID==1010) || (ClassID==1011) || (ClassID==1016) || (ClassID==1020) || <++
Iter #f15d610 query ++> (ClassID==1022) || (ClassID==1023) || (ClassID==1025) || (ClassID==1027) || (ClassID==1028) <++
Iter #f15d610 query ++> || (ClassID==1031) || (ClassID==1032) || (ClassID==1034) || (ClassID==1039) || (ClassID==1041 <++
Iter #f15d610 query ++> ) || (ClassID==1042) || (ClassID==1046) || (ClassID==1048) || (ClassID==1049) || (ClassID==1055 <++
Iter #f15d610 query ++> ) || (ClassID==1057) || (ClassID==1059) || (ClassID==1062) || (ClassID==1159) || (ClassID==1160 <++
Iter #f15d610 query ++> ) || (ClassID==1161) || (ClassID==1162) || (ClassID==1164) || (ClassID==1166) || (ClassID==2362 <++
Iter #f15d610 query ++> ) || (ClassID==2368) || (ClassID==2371) || (ClassID==2376) || (ClassID==2377) || (ClassID==2378 <++
Iter #f15d610 query ++> ) || (ClassID==2379) || (ClassID==2380) || (ClassID==2409) || (ClassID==2410) || (ClassID==2411 <++
Iter #f15d610 query ++> ) || (ClassID==2412) || (ClassID==2413) || (ClassID==2414) || (ClassID==2415) || (ClassID==2417 <++
Iter #f15d610 query ++> ) || (ClassID==2423) || (ClassID==2425) || (ClassID==2427) || (ClassID==2429) || (ClassID==2430 <++
Iter #f15d610 query ++> ) || (ClassID==2432) || (ClassID==2438) || (ClassID==2448) || (ClassID==2449) || (ClassID==2450 <++
Iter #f15d610 query ++> ) || (ClassID==2451) || (ClassID==2452) || (ClassID==2453) || (ClassID==2454) || (ClassID==2455 <++
Iter #f15d610 query ++> ) || (ClassID==2456) || (ClassID==2457) || (ClassID==2458) || (ClassID==2471) || (ClassID==2472 <++
Iter #f15d610 query ++> ) || (ClassID==2473) || (ClassID==2478) || (ClassID==2479) || (ClassID==2480) || (ClassID==2482 <++
Iter #f15d610 query ++> ) || (ClassID==2483) || (ClassID==2484) || (ClassID==2485) || (ClassID==2486) || (ClassID==2495 <++
Iter #f15d610 query ++> ) || (ClassID==2497) || (ClassID==2498) || (ClassID==2501) || (ClassID==2503) || (ClassID==2504 <++
Iter #f15d610 query ++> ) || (ClassID==2506) || (ClassID==2660)))
Iter #f15d610 index = ObjectList_IX
Iter #f15d610 first( eid=65963)
Iter #111cb1f0 query ((Flags&1)==1) && (ParentID==33112) && (((((Flags&64)!=0)) || ((ClassID==447)) || ((ClassID==451 <++
Iter #111cb1f0 query ++> ))))
Iter #111cb1f0 index = ObjectList_IX
Iter #111cb1f0 first( ID_INVALID)
Iter #11618170 setOrderIndex() for attr # 31
Iter #11618170 query ((Flags&1)==1) && (ParentID==33112) && ((Object_Class$262A$.Flags&8==1166))
...that gets transmitted twice. The queries built by iManager versions 3.2.2 and 3.1.4 look almost identical, while 60s breaks before the query is logged the second time are a 3.2.2 specialty.
Unfortunately (for us), our network guy is on holiday at present, but we'll investigate further when he returns. Especially the differences between iManager versions and client - server connection (home office ./. office) might be interesting.
I've noticed problems related to LDAP SSL failures in iManager 3.1 and beyond as well and these seem to manifest themselves as very slow iManager behavior or outright plugin failure (like with the Cert Server plugins). I believe there are a few bugs/issues with iManager, but so far I've been unsuccessful in convincing support of that.
First, make sure the Cert/KMO used for each LDAP server has all the possible names that could be used to contact each server. I make sure the fully qualified DNS name of the server, the short host name, the IP address, AND the actual TREE NAME of the server are in the Subject Alternative Name (SAN) on the cert (I usually make sure the tree name resolves in DNS as well). This seemed to fix a lot of the slow issues I saw with iManger.
Second, and this has been a bug for ages, when you set the DEFAULT tree name in the iManager Server settings, no matter how you type it in, it changes it to lower case. I've complained about this for a long time, but I always thought it was cosmetic. But I discovered it was having an impact on this issue as well. If your tree name is in all CAPITAL letters you may see these SSL context errors. Try just typing in the tree name in all upper case when logging into iManager and see if that helps. To make this permanent, edit the /var/opt/novell/iManager/nps/WEB-INF/config.xml file and correct the Authenticate.Form.DefaultTree field to match your tree name case exactly. You will have to restart tomcat after that.
Those two items have solved slow iManager problems for me at NUMEROUS sites. I've opened SRs only to get back "unable to reproduce" and the SR closed. So if this is an issue for you, please be sure and open an SR!
That last bit with the RECM output is useful as it confirmed iManager was trying to query by Object Class. If that is slow both before and after the upgrade then it's because of the missing index, but I would expect it to be slow either way since, as you mentioned, the queries are almost identical and the search by object class is a big part of it in any tree. That may not be the reason for the "naps" you see in other places and at other times, but please believe me you want those indexes in place on all replica-holders that iManager might query, either for IDM or often other tasks.
Also, as a reminder, whenever you use ndstrace (including via iMonitor), especially for these types of timing issues, having +TIME and +TAGS enabled is also really useful since that would show us the time elapsed between a query going in and the results coming out, or at least the next thing traced.
Having the system stall in the middle of a file transfer seems really unusual. That you saw it less while onsite vs. not makes me wonder about an issue I'm seeing with a completely unrelated product, but it has to do with silently dropped ICMP packets (argh!) by routers/VPNs when packets get too large and for some reason the brainless network devices decide to drop rather than fragment. Anyway, VPNs do this type of thing regularly (in my experience, though my own never do) so it makes me wonder if the MTU at some point is small, the packet going across is big, something drops it, and instead of notifying everybody of the drop the ICMP packet is also dropped. That may explain some of this, maybe, and it is exactly the issue I am having here (with SCP file transfers and database connections doing large batches of INSERT operations). On the other hand, you said you can still see the "naps" while onsite, and it might be odd for you to have MTU and ICMP issues when everything is local.
Have you tried running iManager Workstation on your local machine to see if it fares any better? You can still point your browser to it even (rather than using the built in browser) by pointing to http://localhost:48080/nps and ignoring the security warnings (it's all local, so the warning is invalid).
Also, as a reminder...
Thanks for that one. As I had confessed a few problems earlier, I'm using iMonitor most of the time and had managed to temporarily forget how to include timestamps when working with ndstrace. I'm attaching the trace collected a few minutes ago. As mentioned, first ( ID_INVALID) and -614 messages are observed all the time, even before iManager is called.
Walking through the meanwhile familiar test sequence I was able to witness five naps until 18:27:33, 18:28:34, 18:31:53 (a "double" one) and 18:33:13 as reported by my home desktop until the driver overview appeared.
The JS file not being fully transmitted until the remaining 8443 connections were notified to close, was dojo.js of 264kB transmitted in packets of 16384 and 532 bytes, resp. The late packet had 16258 bytes. All sizes are for decrypted packets.
Trying iManager Workstation sounds good. I'll do that next week (I'm only working for my institution three days a week).