Archiving Data in Identity Governance 3.5
There is a relatively new product in the Identity and Access Governance space from Microfocus called Identity Governance. While it is at version 3.51 as I write this, it is still early in its life cycle and developing pretty rapidly.
One of the key differences between IDG and IDM (Identity Manager) from Microfocus is in their approach to access to data. IDM as we know and love it is event driven, with the data mostly stored in eDirectory (and the connected system applications native data stores). But the meta directory, the Identity Vault, running on eDirectory is there with most of the data collected in one place.
Thus when we look at the data it is always live (or very close to it, perhaps some queued events are still processing while you look).
However, one should not really validate that a tool set is working, with the same toolset, as a particular failure in one part is likely to be missed in validating it, by the failure itself.
Thus the testing tool Microfocus supports Validator for IDM is a standalone tool, using a different approach than IDM itself.
IDG similarly is a different tool set (even if it shares a GUI and Single Sign On mechanism with IDM) that approaches the data issue differently.
In IDG, the data is collected from the various sources on some sort of schedule, stored in a data base, perhaps transformed, and finally published into a final view that can be used for various purposes.
While you might point IDG at your Identity Vault, the collection of the data is mostly static or polled and almost always out of sync. Now a caveat to that, is that the Identity collectors in IDG for eDirectory and Active Directory have mechanisms by which to get events sent to keep it more up to date. However, the Account collectors (tracking permission to access systems, and permissions within systems) does not support this more life like view of the data.
As a consequence, the main database in IDG is not eDirectory (as it was in IDM) but rather an external SQL'ish database. (PostGres, MS SQL Server, or Oracle)
I mentioned before that the data from Identity sources, Accounts, permissions and what not are collected and then published. This is purposefully a two step process. First the data is imported. Only once it is complete is it pushed to the front facing view, as it is "Published".
Along the way, the collected data is maintained, until it is cleaned up. This means your database is very likely to grow and grow and grow. Especially as you develop the solution and test multiple tweaks, collecting and publishing each iteration of a change.
In IDG 3.0x there was a script in the idgov/bin directory that would allow you to perform database maintenance. The idea was, for each of the 'thing's for which are stored in the database, eventually you would want to clean them up. Now of course, if you wish to report on some of these things, you probably do NOT want to clean them up quickly, or perhaps at all.
Some things are much smaller in size than others, and persisting them for a long period of time is not expensive in terms of database space and might be useful to track in reports later. These are tradeoffs that will differ with each site.
However some general rules are obvious. A collection of 100,000 identities, each time it runs through, will add at least 100,000 lines to the data just for the Account information. If there are many thousands of Permissions (reflected as Groups perhaps in the application) then the group, and its memberships can be potentially many tens of thousands more lines.
The thing is, you only really ever see the current published view of the data, the previous collections mostly sit there doing very little. I have not specifically looked, but I could imagine a use case, where you might want to be able to report on the history of changes in a particular object through time, which could require keeping many older collections. However I do not think this is the best tool in which to store that information.
When you ran the command line tool you get a number of options in a script, and you can schedule it to run via cron as needed, passing in the parameters you want. There is a config file in which you can define the manner in which you want it to run, to allow scheduling the script.
In Identity Governance 3.5.x this script is no longer present and is instead part of the GUI under the Data Administration menu item as maintenance.
New in 3.5.x is the idea of archiving of data. That is, before you delete it, you can archive it. In this case, it copies the data from the igops database into the igarc database. (In case you were wondering, igarc is a new schema/database in 3.5.x for this purpose).
It was unclear to me from the docs what one is expected to do with this archived data. In discussions with some folk we agreed in principle it is cool to archive your old data, and maybe move it off to somewhere else. However it was unclear how you would use the igarc database to possible restore data or recover. Thus if you archive, it appears you just shuffle the deck chairs on your disk usage.
You are able to disable archiving in IDG 3.5.x and not required to use it, but you get a somewhat ominous at first warning when you do:
Disabling archiving deactivates database triggers that capture database updates. Although this may speed up various activities that do many database updates (collection, publication, starting a big review, etc.), it also clears all of the data currently in the archive database, because it is not possible to keep it current when archiving is disabled.
Are you sure that you want to disable archiving?
This warning is interesting in that it seems to say it will do that which I want (speed things up, clear the archive DB (igarc)) but somehow paints this as a negative. I fully believe more thought and information is required on this topic before I recommend any actions. But the more you know, the better informed you will be to decide for your organization.
In the command line version it would query for the stats of what can be deleted, and in the GUI version if you wait a moment of three you get a display with refresh links of:
Tables changed since last archive: 56, Deletes: 640027, Updates: 2223675
Total entities that can be cleaned up: 10, Entity Types: 3
This informs me how much data has changed (640K deletes!!! 2.2 million updates?!!!) since I last ran an archive. I did clean up my system and scope my identities down from 150,000 users to 23,000 (Ditched the darn students!) so the numbers make some sense. (There were lots of group memberships as well).
You can click on the two lines to see the summary of the changes.
Here is the Table changes view snipped a bit.
Then in the Entities that can be cleaned up link you get this view:
You can see that the various entities are linked and you can dig down into them, so I looked at Snapshots, and you can see dates, but not much else here:
When you wish to cleanup, if Archiving is enabled, first you archive (which can take more than an hour) then you delete. This locks out access to IDG while it is running and only a Global Administrator or Database Administrator can actually log in while this is going on.
Alas, you may note that there is no way to schedule the database maintenance to run (perhaps over night on weekends?) to keep the database trim. I have an idea, that we could use the REST API that IDG exposes to call the Database maintenance, which is basically what the IDG 3.01 script did, called a Java class that made the needed REST calls. (Also it drove the menu system). But that is a topic for another day and some work to figure out the specific REST call that is needed to be made. Once done, that should be scriptable via curl on Linux, and then cron to run it on a schedule I expect.
One thing I did not notice at first and had to have it pointed out to me, is that in the maintenance page, you start on the Archive tab (There is a cleanup tab as well) that the past Archive attempts are all listed with dates and who did it. (Useful for blame casting I suppose). But you can actually select an archive instance from the list and then chose to delete that specific archive instance from the igarc database. I missed that entirely.
I had noticed that if you clicked on the Completed message you got a summary of what was archived:
This shows you the time it took, how many tables were involved, how many deletes and updates in total. My example is 770K deletes, and 3.4 million updates, and 46 minutes. So that may be larger than most cases, but gives a feel for performance.
If you click on the linked Table count value, you get to see table by table, the status, the deletes and updates that were done. as in this image:
If you select a Completed Archive from the list view there is an Action button that allows you to delete it. (Side note: there is a minor UI glitch and if your window is not large enough, you won't see the Delete button, so if you do not see it, consider resizing the window larger).
This is useful as it does not mean you must keep archives forever, they can be trimmed. I am still waiting to hear back from the DBA on how to best measure the space usage of the databases before I experiment much further with cleanup.
This example is using Oracle 12c as the target database, and in my quick research I found the SQL statement:
select sum(bytes)/1024/1024 size_in_mb from dba_data_files;
This is supposed to show you the space on disk the database is taking. I checked before and after archiving and deleting. The DB seemed to always get bigger, never smaller with this measure. I even disabled archiving, which in theory should have thrown away all my old archives and freed up much space, yet this SQL command returned ever larger numbers. (Sort of the Hotel California for table space? You can check out but you can never leave?)
It was explained to me that as a novice in SQL on Oracle, this measure is not quite the correct tool to use. What the dba_data_files table implies is how much raw disk space has the DB instance allocated. This is different than how much disk space is in use. The DB can dynamically grow to use more and more disk space, and if you cleaned up, inside the allocated disk space, you might be using less, but the allocated disk space always stays that large, until a DBA shrinks it. I was also informed that the IDG interface to SQL does not have the ability to shrink the space, so talk to your DBA about that issue.
I am still looking for the proper SQL command to see how much space a particular schema/database is using up before i try seriously cleaning up my DB. If you happen to know, please comment on the article to let me know.
As you can see, the product is maturing, we got the addition of archiving. The move of the maintenance tool from an external script to an internal GUI element. Yet it is still missing the ability to schedule the maintenance. Thus there is always room to grow and expand functionality, but this is a nice step forward, with a useful interface, better than before so all in all a win I should think.
I expect I will have more to say on this topic in the future, and look forward to input from anyone else who has figured out some neat aspect of it as well.