IDM process events in the subscriber channel 1 at a time.
For each event on the subscriber channel, the IDM run over all policies on the subscriber channel, and when it is done, it continue to the next event.
The following image describes the subscriber channel processing:
In environments where there are millions of events to the subscriber channel, the processing time is very long, and can take weeks.
I want to offer a solution that will make the subscriber channel faster.
It will be faster if it will be split into multiple identical subscriber channels, running the same logic. This feature will allow us to process multiple events in parallel - 1 event for each channel (thread\process).
The following image describes the offered solution:
I know that it doesn't fit all needs, but one will be able to activate this solution based on a configuration of the number of simultaneous processing channels (default should be 1).
thank you for good suggestions and call.,
1. we have added 8 cpus to our edir servers, and first of all, we see extreme performance improvement to what we actually initially had. very good and efficient tip!
2. we have both legacy idm drivers and new idm drivers, to convert groupMembership to entitlements we have done following:
a) on the drivers where we use groupMembership we have now defined and declared userAccount entitlement
b) we have created custom multivalued attribute called(linkedEntitlment) and added on the security groups where its is needed pointing to userAccount entitlment (reference)
c) created a null driver which has subscription filter on groupMembership change, and for each groupMembership its checks linkedEntitlment attribute
d) if linkedEntitkement attribute is found, it starts a custom workflow in userapp (start-wrokflow) to grant or revoke entitlement
d) custom workflow in userapp has very simple process, start-entitlementactivity(1|0)-finish
in this fashion we are able to support all legacy and new idm drivers with entitlement support, and only one null driver is listening to groupMembership attribute (notify) and all others idm drivers are listening (subscribe notify) on EntitementRef
Overall performance is just ammmmmmmazing lighting speed
We have high focus on automation, we dont login to userapp to create and map role|resources to entitlements, as all security groups are created in edir from external systems and are created as role or resource in userapp (using custom rest driver)
sounds good. would you mind if we can discuss details on teams call so?, it would be good to have it discussed with you to have it understand correct what you mean, if you could have time to have short call with us, i can send you invitation privately?
personally i dont think so RSSD is being bottle nacked i would rather see how driver can process events with "reference" type of attributes such as groupMembership or any attrribute of type "reference".
if scope filter these events on sub-etp is solution which we are doing already , then we are fine with that, but i would my drivers not listen to groupmembership if these group were not relevant for the driver.
I understand the limited complexity desire. By taking group membership into a null services driver to do a do-add-role or do-add-resource so you can add the comment you are essentially doing the exact same process that the RRSD would be doing. The only difference for performance is the RRSD can be multi-threaded. To replicate this increase in throughput, you could do the cloned copies of this driver (or simply have this driver run on more than one server with different GCV settings). If you run it on multiple servers, you are essentially just setting it to automatic on some number of IDVs in your replica ring, but a simple GCV difference will ensure that you process different populations separately.
The logic to use for multithreading is really environment specific. Many of my client environments are using flat placement, so where they exist in the tree is not an option. In the cases where we have a segmented tree, I have to avoid mass changes at a single location in the tree from bogging the environment down, so I still prefer to leave that alone. What I've really found to work best is to segment on entryId, which is pseudo random within the tree itself, but be careful if there are event-specific dependencies, which does not sound like the case here.
I think you could get some serious throughput boosts by just doing this. It'll be either copies of a driver (clones) or just running on multiple servers (setup new remote loader instances at worse?), both of which are pretty simple changes to make to get massive performance boosts at your biggest bottlenecks.
As NetIQ moves towards IGA 6.0 where Identity Governance and IDM are merged into a single product, it could provide you with some avenues for significant assistance. Not saying to dump what you have, but you may inherit some functionality in your IDM license that helps enhance that process.
How much work have you done to tune userapp? Have you tried some of the simple stuff in the setenv.sh yet? There are also some indexes that can really burn you that often times are missed during configuration.
I believe your roles process may gain some IG functionality in the 6.0 release, but am not sure if NetIQ is going to require that additional license for it. Over in IG, you actually have an approval and lifecycle process built into the role creation/modification process, as well as the ability to regularly review the contents of those roles. While IG is a tool specifically intended for your use case, if you have a process that satisfies your compliance requirements and is working, I'd say leave it as is and just try to get your scalability gains.
Are drivers handling those groups and assigning roles and resources your primary bottleneck? If that is the case, try splitting into two drivers on one and see if it doubles your throughput (give or take). If it doesn't, is userapp a bottleneck? I'd be interested in what performance tuning you've tried on userapp.
Thanks for your review of my comments; we simply want to have lowest foot print of IDM and reduce total cost of ownership as minimum as possible and achieve more out of the platform by tuning and tweaks, we had two trees/two vault design in past and we have now consolidated into one for simplicity, ,licenses (arguments with novell/netiq) and other operation related reasons.
we cannot add role to group from identity application, as all the role and resource assignment we document must have "reason" for review processes, we have all the access policy (automation)rules are defined in the IDV using our custom concept and combined this with RBE and we have many mgmt(null) drivers which does add-role, remove-role , add-resource and remove-resource with "reason" as text why (either based on RBE rule or some other logic is the requester" so that its assignment shows correctly in identity application otherwise it will show up as " direct assignment" in userapp which would give nothing clue to access review process for the main question "why"?
we have a custom access review solution which read that info from userapp to show it to reveiwer. out custom access review is built on the top of the internal issue management system based on "Jira" which already is used by business users in company, so we have done some custom integration which scans roles and their assignments with "reason" text from userapp and creates jira issues and assigns to role owner in jira for review process.
the userapp is really slow and unstable for us, so yes we do not know how to tune it more, all the roles|resource assign|revoke are carried by mgmt(null) drives running in idm, that uses userapp role|resource webservices, for create, update, setting owner(delegate agmin), review process etc. the calls to identity applicaition(userap) webservices are made frequently as we have high change on users from HR, as we are dealing to pure "retail" business where ppl change o move very often.
we use mainly userapp to docuemnt role|resource for acess review purposes, and for custom workflows(approvers) where its needed programmatically. our business users hardly login to userapp, only when there is manual access request.
we also have a mature role lifecycle process, our roles|resources are born into the source systems such as SAP or other , and then fed into IDM, which then publishes them into userapp with correct owner and permissions. and when roles are expires in source systems thare removed from idm and from userapp, fully automated process.
I think you have done a great job of tuning the system to maximize the throughput, however, the primary limitation you are hitting is what the original poster was discussing, which is the limitation created by the fact that your drivers are single threaded and process all events in the order they are received. In your example, if you throw 1000 modifies (group object add member), then you are essentially creating a queue of 1000 events on each driver that has that group in filter.
The first step, which you have already completed, is to put a policy right at the beginning to drop the events the driver shouldn't consider (because it doesn't care, why waste cycles?). How to scale from this point is where, as you stated, you need to get somewhat creative in your architecture.
One scalability thing I've seen done is to segment out into multiple Identity Vaults, connected by edir-to-edir drivers (or the newer driver that only requires one). This can have some scalability benefits because you can now segment your tree and potentially run multiple, parallel trees that are kept in near real-time sync. This can allow separate trees to have different subsets of schema, reference objects and even users. One sample implementation of this would be to setup a few additional vaults and synchronize a few groups to each. The user objects will only be created in the new vaults if they are a member of one of the groups that is hosted in that vault. This should help reduce the load of creating events on hundreds of drivers within the IDVault engines by offloading to (potentially) much smaller vaults with less users and less total events.
The next item to consider is eliminating the single threaded design of each of the drivers. Right now, in your example of 1000 users being added to a group, a single thread in a driver must run through each of the 1000 events and process it to the shim. If you were to split your user population in half across two drivers (or two vaults, two driversets, etc), this would then allow 2 separate threads to process that same queue of events.
If we use the two drivers sample, you would create a copy of your existing driver (I recommend having them share policies in a library for ease of maintenance). On each driver, you can simply set a GCV for which to handle. Some simple examples are to parse based on the first character of the CN (A-M on driver 1, N-Z on driver 2) or to do a mod on the entryID (odd numbers driver 1, even number driver 2). You'll probably never get it perfectly, but based on dividing it up, you theoretically would get 500 on one and 500 on two, thus processing the same batch with two threads. You can continue this paralleling exercise until you get the throughput target you are looking for.
The last thing you can do is evaluate your policies and process for handling those batches. Depending on how each shim works, you may want to handle your processing seperately. Some simple global things in IDM, use operation-attribute instead of attribute where possible to reduce queries back to the vault. Each millisecond you shave off each event can add up pretty quickly at scale. How you handle those individual events to the shims matter as well. Back to your group modification to add 1000 users. When you add them in a single operation, you may see a single event with 1000 adds to a group object. If the remote system can batch the operation, why split it out, we may simply be able to go from a batch to a batch, depending on the shim we are using.
I am a little curious as to your comments on the user application being slow to process. The roles and resources driver has had quite a few modifications in the past few years to really get moving on its operations, including multithreading support, similar to the discussion about parsing the events into separate drivers, but it handles it within a single driver. The benefit of using RRSD + UserApp is that you can enter situations where those group memberships can directly assign resources - which then translate into direct entitlement grants on specific drivers, thereby eliminating drivers that do not need to see an event from actually seeing the event.
If you combine RRSD + UserApp with the multi-IDVault or multi-driverset architecture, many of your issues outside of single threaded drivers can be mitigated much easier. Back to the examples - if you have a single RRSD with a userapp cluster that has some null services drivers in the vault. An nrfRole can be assigned to the appropriate group. The resources attached to this role can either be for entitlement on specific drivers in this vault, or can be assigned to a null services driver. That entitlement grant can then call a do-assign-resource in a separate vault. If you have no need for the request interface, you probably don't need to do cross vault calls, but it is an option if you need it. The beauty of this architecture is that the RRSD, which is already multithreaded, can split those group membership events onto multiple threads for processing. The resulting operation is to grant the resource(s) and subsequent entitlement(s) directly on the driver you want and not require processing of any logic other than transforming the entitlement value for the shim (and processing a synthetic add if necessary).
In short, depending on where your bottlenecks are and your desired throughput, you can architect a way to use parallel processing in all of the right places to achieve some pretty insane throughput. We have a client with over 1200 drivers, 700k users and well north of 5 million objects and unacceptable queues do not occur. We simply evaluate the required throughput and use some architectural prowess to ensure we can achieve that throughput.