Excellent article and really helpful for the performance improvement.
This brief knowledge document deals with the performance tuning of the ingestion process in ControlPoint. It talks about the constraints on components and parameters that influence performance.
Before reading this article, you may need to look at ControlPoint ingestion process: components, architecture and flow post if not familiar with the components.
This article goes through a list of parameters to look at in order to maximize the performance.
Logs write on the file system. Once the process is validated and working, you can reduce logging to save some CPU cycles and disk I/O.
Each configuration file of the components has a log level parameter in the [logging] section.
It defaults to Normal. You can safely put it to “Warning”, especially for the Connector, Framework and MetaStore.
Files are processed in batches. The default value is 100. If the environment contains more than 20TB to scan, it is reasonable to increase this value up to 1000 (one thousand).
Change this parameter in the configuration file of the Connector.
You may also consider the fact that Connector will run much faster than Framework, especially if Eduction in on. Therefore, having a lower batch size value for Connector makes sense. It will avoid flooding Framework that will have a big queue. Slowing down Connector also reduces requests on the data source repository.
Optionally introduce this new section [Actions] to define MaximumThreads (default is 2 if undefined):
Optionally introduce IngestDelayMS if you wish to introduce a delay to slowdown current ingestion performance of a given connector.
Change this parameter in the configuration file of the Connector Framework...
The components are multi-threaded. It means, for example, that one CFS can run multiple instances of itself in order to process multiple task in parallel.
Set the ThreadCount parameter equal to the number of CPU cores on the server.
There is the possibility to configure the number of parallel threads for the metastore service that will send data to SQL Server. You should normally set it to the number of cores that are on the server hosting the service. However you may start hitting deadlocks in the metastore ingest.log if you set them too high and SQL Server is not fast enough.
Parameters are in the cfg file of the metastore.
In the event of deadlocks occurring you can decrease the frequency of retries via ActionRetrySleepPeriodInSeconds which has a default of 60.
Increase this setting if you wish to perform retries less frequently.
Pattern matching is done with the Eduction module. There are many options to define precisely what is extracted and how. It is important to discuss with business users exactly what is needed and what is not.
See the Eduction reference guide for detailed information on the parameters. They are set in the Framework configuration file. Here are some that you may consider first and set to these values.
MaxMatchPerDoc is critical. It determines how many entities we want to capture in each document at a maximum. We often see environments including csv files, for example, that can contain tens of thousands of PII. That could take hours to extract them. The functional requirement may be only to know that we have at least one in the document (responsiveness of the document). In that case, MaxMatchPerDoc=1 is enough, and will go much faster.
Similarly, we can imagine a use case where customers want to distinguish between documents with high quantities of PII and documents with low, due to internal risk policy. Documents would be then processed differently based on what category they fall into. Here MaxMatchPerDoc could be set to a number that makes sense for the business use case, such as 100 or 1000. You could then direct documents in the “over 100 hits” category to be processed more thoroughly, for example.
Here is a concrete example of a customer using ControlPoint and scanning nearly 2 million documents on this file share.
We see here that 885 files have more than 100 000 PII in each document. This represents about 50% of the total entities extracted in the document set. It means that 0.05% of the files take as long as all the others to be processed by Eduction. In other words, if we set MaxMatchPerDocs=100000, we devide by two the Eduction time. We still have a valuable information for the business: “885 files contain more than 100 000 PII each and are therefore super sensitive”.
It is best practice to never leave “MaxMatchPerDoc” as unlimited. I would highly recommend to not exceed 100.
I would recommend having one connector per framework. Connector is simply scanning and does its job much faster than the framework that has a lot of processes to run like KeyView, Eduction… Therefore, not necessary to overload the framework.
This could be mitigated if we consider the “after the initial” runs. These will be incremental scans. Connector will have a huge task crawling the file structure and identifying if they have changed while framework will have a small number of files to process. We may then reconsider the one to one relation and think at a two to one or three to one.
I would recommend having a MetaStore on each Connector/Framework server. This makes sense as MetaStore will also use the same file that was filled by Framework in order to create the SQL request to send that to the database.
CPU and queues will have to be monitored. Depending on how much Framework (I.e. Eduction) has extracted, MetaStore could be overloaded and generate "dead locks".
Use SSD (a.k.a. Flash) disks. Latency on SSD is 20 time lower. Latency is the main important parameter for small random input/outputs that happens on a relational database.
Separate the MetaStore database from the other four databases on its own, dedicated disks. Also separate the logs from MetaStore database on dedicated disks. MetaStore database is the most accessed database and needs this special treatment.
Use database partitioning feature of SQL Server 2016 and later releases for the MetaStore database. This will split the database file in as many cores you have on the server. Having multiple files accessed in parallel removes the bottleneck of accessing a file on a Windows file system.
Monitor average queue depth on the server. If higher than 1.5, the database is overloaded.
Monitor SQL page life expectancy on the server. You may need to add memory.
And though this may be obvious: best practice is not to share this database with any other application, or you will not be able to control this key part of ControlPoint performance.
Great article that really summarizes how to improve performance. I would add:
- Upgrade to the latest product version. Performance improvement is a large part of the roadmap, in 5.8 we saw a 20% ingestion performance improvement using lab criteria.
- Consider your use case. You can reduce how many matches of PII per document as you have explained, but you can also altogether not keep the value of the PII, only the "count" of how many times it appears in the document, if that suits the customer use case. This will improve performance as it eliminates having to write these values to the SQL.
- Disable features you don't need, such as Document Level Security, or Encryption.