Tl;dr: A properly configured HP UCMDB Data Flow Probe MySQL database increases reliability and performance to a previously unimagined new level. You need to tune the Probe properly and we share the method and queries to do so. Jobs are more reliable, results are faster, fewer errors and no corruption, clearing the probe cache is no longer a requirement of daily use. Integrations and Federation jobs are improved up to 366-fold. These changes have been made in a half dozen enterprises and have no negative impact on feature, function or licensing.
The free tuning package and the results of the tuning are available here.
I am the CEO of Effectual, an SPM-only HP Partner. I've been working with the UCMDB technologies since Mercury first acquired it from Appilog. I’ve been thinking about and using UCMDB in effective ways for something close to a decade.
I founded Effectual to systematically approach and resolve the barriers and challenges every UCMDB project must overcome. Over the past three years, my engineers have often turned to this support forum for guidance. This is my first time posting since the old Mercury support forums were retired, but I expect this is one of the most important findings I could share in my career.
The MySQL version of the Probe (every version of UCMDB prior to 10.10) has shipped with a misconfigured set of variables that has caused a huge amount of pain for every UCMDB user working with anything larger than a tiny data set. In addition to exceeding the originally defined variable limits, the Probe required additional setup to really achieve the use case and technical requirements we have asked of it.
We are sharing fully with you; the root causes and the solutions. While we’re also going to the usual marketing efforts to let people know, this thread will go much deeper into “what and why” the problem exists so that Support and other users can understand how significantly this impacts their daily lives.
Bear with me, this will not be a short post.
- Simplified Version of the Probe in Operation
We first need to understand that the Probe is not a black box. It is a series of applications, processes and functions that all have moving parts. These parts all behave differently based on what the data moving through the parts consist of and how they write to the database and are passed back to the Probe application and on to the UCMDB.
The MySQL version of the database uses MyISAM database engine, and the .MYD files and .MYI files comprise the data tables and the table indexes. Every single step of the Probe operation writes to MyISAM; either to record how far the job has progressed, or returning results from Discovery or normalizing CI results from an integration. These writes are done by the operating system and the operating system manages its own memory and threads outside of the MySQL database.
At regular intervals in the processing of any job, the MyISAM results are queried back into memory via the InnoDB features, which while they are “in memory”, the out of the box configuration also double writes the results to the IBDATA1 file and its logs. This is for features like ACID and to keep track of transactions that are mission critical in the event of a failover.
Once the DFP application has the results from the InnoDB process, it then handshakes it over to the UCMDB and processes the results; recon engine, history tables, etc. This cycle repeats over and over again in small chunks to prevent “overwhelming” the UCMDB until all the results are processed.
At each step through this process, the application updates the database to keep track where it is in the process – these results are also queried by the UCMDB UI as “Discovery Progress” and “Discovery Results”. Left alone this process runs smoothly and without attention from a human, just as designed, until the data size exceeds the key buffer cache size and the process grinds to a halt.
- Key Buffer Cache
As it arrives out of the box, it isn’t uncommon to see results like a job that completed14 days before still chunking results from Probes through the DAL log to the UCMDB. It isn’t uncommon to have a job hung perpetually at a certain percent, or jobs failing, returning errors hours or days later after finally timing out.
This isn’t because the workload is heavy or the job failed on the jobs merit. It is because of a mechanism used by MySQL MyISAM engine called the Key Buffer Cache is running out of available memory blocks and forcing a horrible swap and write to/read from disk condition. There’s a reason the symptoms are so predictable – performance of the Probe breaks down once the cumulative database index usage exceeds this one limit.
The Key Buffer Cache or Key Cache is used to keep the .MYD and .MYI index blocks in memory. It also needs additional space to handle joins across those index blocks, such as for result processing of large many-relationship type jobs. A small dataset would never experience these problems, but once the .MYD files and the index used to operate on the database table exceeds the key buffer cache setting of 384MB, all bets are off and things begin to spiral out of control.
We have established that the Probe job sequence has many repeated steps that take place between the application, the actual database, the temporary database file, local log files and back to the application, then on to UCMDB. UCMDB can also demand information from the Probe when you ask for it in the UI. Lots and lots of throughput and repeated processes over time all banging head on into the limited key buffer cache.
This happens over and over and over again making the problem worse and the response time slow down exponentially. Once the Probe database size exceeds that cache, the only relief a customer has is to restart the servers, Clear Probe Cache Data, empty out those tables, and start the process over again.
Or you can properly size the key buffer cache and give MySQL adequate Free Memory – and all of these symptoms go away.
- I/O Memory & Probe Data Table Sizes
You know the one thing not working hard at this point? The UCMDB.
By design the MySQL MyISAM engine makes use of the operating system to handle writes and reads, work buffers and blocks from memory to disk. The OS threads are used in every function, not just as a result of swapping or paging. Every healthy write and read uses OS threads and needs Free Physical Memory to work well.
The rule of thumb from MySQL DBA’s is that MyISAM engine’s key buffer cache should not be set more than 20% of the total Memory on the server, with the expectation that the rest of the memory will be consumed by the operating system and thread related.
So when tuning the Probe, we should have at least four times more memory per server than we have related to the Key Buffer Cache. This has worked out pretty well for us, as you’ll see in our last section on how to evaluate and tune the Probe.
We provide queries with our Probe Performance Tuning Guide so you can measure the size of your database tables and their corresponding indexes. You can also look at the .MYD and .MYI files sizes on the Probe and add them up, then add 15% on top.
What we’ve seen in all of our testing of various environments and use cases with Probes is that the related index to database size with MyISAM isn’t always equal. A Probe that has very simple jobs to run won’t have a big difference between data file and index size, but a Probe handling complex jobs or doing federation or integrations will have a significantly larger total index size. This is due to all the extra joins required when containment, composition, dependency or other related links are created.
In these cases the index of such a Probe is usually 1.5-2.0 times the size of the actual data in the tables. The more complicated the job, the more related data results are, the more likely you’ll have a lot of joins (I’m looking RIGHT at you Network and Application Discovery). We here at Effectual have been very careful with how our discovery and integration jobs are triggered and we use very specific TQL in every aspect of the work to keep the trigger input sizes to an ideal minimum.
Even with this careful and constant best practice, we couldn’t escape the reality of poor Probe performance and so we delicately separated out the job work load across many Probes and balanced schedules and job types. It was a pain. But it worked and many other folks do the same thing to get results.
After we had performed extensive testing, we went back to some of our Discovery farms and checked this hypothesis thoroughly. The largest Index we had across more than 30 different balanced probes was 481MB. So that particular probe was swapping, but the vast majority were under 400MB. We were able to correlate that the 481MB probe was historically much slower and less responsive to report results than the others.
- IBDATA1, Temporary Tables, ACID, Doublewrites, Useless Overhead
Without going into query cache, table cache, open table cache, join buffer, read rnd buffer and all the other things we’ve tuned, we want to briefly mention the IBDATA1 and innodb_file_per_table settings that we want you to change.
As we mentioned in Part I and with the problems in Part II, the InnoDB functions eventually start paging. When there isn’t enough memory, or when work is sitting in memory waiting for more work to be completed, MySQL shuffles the active memory off to disk as a normal part of Windows function. So the longer something takes, the more likely disk writing is happening. Egad.
It’s a perfect storm that really adds up to long response times and poor observed performance. What isn’t intended, is that the IBDATA1 file which is used for all kinds of unintentional things never shrinks or auto truncates. If a Probe is really struggling, it also does lots of “temporary” writes and reads from the IBDATA1 file. We’ve seen these files exceed 4 and 5 GB in size. That’s a lot of bottlenecked activity, going into that one file and can have real impact on an already distressed system. We don’t see any intended activity with the IBDATA1 file, it just happens that MySQL as configured operates that way.
The single IBDATA1 file also lacks any real read/write checksum control and is essentially an unregulated temporary space that sticks around permanently. So it’s possible the MySQL database is corrupting part or all of the result set it sends back to the UCMDB when a Probe is having a difficult time.
Your IBDATA1 file may not be very large (and there’s no way to disable the use of IBDATA1), but at least we can offload the actual temp file work to actual temporary files. That’s where innodb_file_per_table comes in and our guide that walks you through how to enable it. By following our guide, you’re giving MySQL the ability to create actual temporary tables and log files as they are needed. On a modern VM Probe with fast SAN, you’ll see temp files get created, fill up with data and disappear in a second. It’s awesome, compared to the way the out of the box probe worked. The results of these temp files are most often joins, large complex queries and the results of the work is kept cached in memory with our configuration.
Since IBDATA1 can’t clean up after itself, the odds are that a lot of the results that may be read back into memory could be bad. Or they could get pushed out of the file entirely and not be found. Think on the types of error messages we see most common with Discovery and Integration jobs. Think about bulk failures, think about duplicates being detected. These are all originating not from the Discovery script, but from the stress condition of the Probe breaking down over time as the database size grows.
In short, the out of the box Probe was also configured more for a “financial website application” than the way we use it to process and manage Discovery results. The ACID compliant settings, the double writing, are essentially additional overhead where you might need a warm backup of the data for a MySQL database cluster. All of that has been disabled. We’ve tuned the I/O and tested it and mechanically changed the functions of the temporary tables, MyISAM buffers and InnoDB temporary table (and file) usage. The Probe use case does not require financial accountability for every transaction. We save time and resources by disabling this and it has an impact on performance.
There still remains a series of features from the Application that should be tuned, such as the actual SQL queries executed when a job starts and stops. Select Count(*) and Insert/Delete from where statement in the Probe database are not very efficient and these now result in the only long running “Slow” entries in the MySQL Slow Log. This will be a lot easier for HP R&D to see over time and hopefully address, such as replacing a 6 minute long delete statement with a drop or alter statement.
- Tuning The Probe
We’ve misunderstood probe operation for years. So we’re also going to give provide you with our observations and recommendations for how you should be tuning and thinking about sizing your own Probes.
Adding machine resources without at least changing the key buffer cache, would not have shown any remarkable improvement to performance. So simply adding more RAM or CPU’s would just mean you page more, swap less. With the problems we’re describing, adding more resources without changing the tuning would result in faster paging and would decrease overall run time, but not alleviate any of the underlying pain and data corruption issues we’ve seen.
There is no one size fits all solution here because once you’re aware of the “ecosystem” of the Probe you’ll realize that your needs change as the maturity of the Discovery and Integration efforts grow. There’s certainly a “better” key buffer cache size for all, but even this will grow over time.
Effectual recommends emphatically that you build your Probes on virtual systems as resources can be added or removed as the Probe role or workload will change and I/O and disk are faster on modern SAN. It would be wasteful to allocate physical hardware and keep the Probe configuration the same. If you have a very large and dedicated physical environment for your Probes, you might want to consider a different architecture for the Probe databases, please contact us for design and guidance.
The Probe as an ecosystem needs to run a good number of concurrent operations; the actual discovery processes, the application processes, the MySQL processes, the operating system I/O processes. We are currently standardizing on 4 Core, 8GB RAM for all Windows Discovery Probes and 9GB and 12GB for the Integration Probes. As a result of the tuning we need far fewer Probes to accomplish the same amount of work, with none of the previous pain.
At the end of a week full of Discovery job execution and processing your Probe should have at least 1.5GB of Free Physical Memory unallocated. This prevents the OS and the above processes from swapping. Reducing swapping is good for your processor as well and you can find a good balance between the right size memory, just enough swapping and paging and good I/O. It takes patience and discipline and paying attention to the Probe MySQL databases with the queries we provide.
Every single operation the MySQL database performs requires I/O – although the Effectual tuning will remove a great deal of this excessive I/O, the fastest I/O storage should be used. Avoid traditional attached physical HDD for the database and probe operation.
In addition to these rules of thumb, you need to tune the Probe key_buffer_cache and innodb_buffer_pool_size based upon the actual and intended growth of your database tables and index size. Innodb_buffer_pool_size will need to be increased when we resolve some of the sipping that the UCMDB does from the Probe. When we tune the UCMDB to take more results, the Innodb_buffer_pool_size will begin to cause excess paging due to the rate at which data will be able to flow out of the Probe database. Not a big consideration today.
As previously mentioned MySQL DBA’s have a rule of thumb for MyISAM key_buffer_cache, that it should not exceed 20% of the available Physical Memory. Effectual’s test results show that key_buffer_cache should be at least the size of your total data files, if not 1.5x the size of your data file and then have room for growth. Once the total Index size exceeds the key buffer, you’re heading back towards poor performance. Setting the key cache higher does not mean the system will immediately use all that memory, just that there is more for the Index if needed.
So the general physical memory minimum recommendation we’re comfortable making is:
Total Data Table Sizes at Max Possible Size * 1.5 = key_buffer_cache * 1.5 for thread, buffers, cache
Example: 2 GB total data table size * 1.5 = 3 * 1.5 = 4.5 GB for the MySQL Session, assuming you’re running 64 bit OS and 64 bit versions of the Probe and Database you can set your key buffer cache to whatever size you have available. The rest of the Free Memory on the box should go to the Java application (2GB) and the operating system (whatever is left).
Effectual also makes the recommendation of having at least 1.5 GB of free Physical Memory at all times, this 2GB data table size means you’re looking at a recommendation of 9 GB of Physical RAM required for ideal throughput and growth. Our large table Probes are 9GB and 12GB of RAM and will continue to grow as our CMS projects get larger.
If you’re not running DDMI you can turn off the HP Universal Discovery XML Enricher application which consumes 500-600MB of memory on the Probe. We disable this service on our Integration only probes. If you do run DDMI, you’ll need additional cores and 2GB of additional memory just to accommodate this service based on these findings.
- Conclusion & Final Thoughts
As a former engineer and now a CEO, I’m a terrible writer and this is only a draft. I’m sure someone could do a better job than I of polishing up this language. I’m sure my writer/communications person will be doing that soon. However, I felt we couldn’t let this continue and so I’ve done my best with a dozen or so hours of work writing this out for the forum.
We will be releasing alternate configurations, including shared database tables in memory and super high performing MySQL instances alongside our results of PostgreSQL testing in the fall of 2014. We will not be keeping this as a proprietary secret, there’s just too much at stake. Our mission at Effectual is to see more HP customers succeed with UCMDB and CMSes in particular. This will help in every way.
We are currently performing tests against UCMDB and the Probe application. Delta syncs are fluid and take seconds. Larger jobs and Discovery results processing seem to have serious artificial “functionality” put into place to slow down the result processing.
To provide a sense of where we are heading next, by tuning the Probe and UCMDB task result behavior, we’ve been able to move 1,000,000 CI’s from the Probe to the UCMDB in under 20 seconds. This created 10 5MB dal log files in under a minute during test. UCMDB handled it just fine with the longest batch UCMDB update taking 636ms. We’re now looking at the history and Last Access time mechanics and the reconciliation engine chunk and merge in fuses.
Stay tuned to our Blog at EffectualSystems.com for more information as we release it. We do welcome discussion on this. We’re not going to stop until we can help demonstrate to the market how amazing a product UCMDB is. No other vendor supports ITIL centric value propositions like HP does and the new History and Integration capabilities of UCMDB are unmatched.
I look forward to discussing this and hearing your thoughts.