I'm still unsure if we really have the problem as the manager crashed because of vulnerability scans 2 times in the last month, so a bit more time will be needed to make sure...
Support had me change some GC settings which seems to help, but limits the EPS IN the manager can handle...
Anything on your side?
Thanks for the quick response. We ran the GC 'fix', we had the issue re-occur within the week it was applied. It also capped EPS at 20k, which we run fairly regularly at 22+, and plan to be near 30k within the next few weeks. So we rolled the change back.
Same here. EPS are capped at about 18k (guess it's hardware dependant), but we are running at 12k right now (we also have plans to get to about 20k, but waiting on a better resolution), but it does seem to extend the time between crashes in my case. We had the problem every 3-5 days before.
Let's hope they are gonna find a real solution soon...
The "GC fix" (switching from ParallelOld to CMS) was recommended to customers in cases where a substantial EPS drop after several weeks appeared to be JVM related and not correlated to any obvious root cause in code, content, or DB performance issues. If the performance drop is due to something else, then switching the GC will not help.
If you saw an EPS drop after applying the GC change, could you collect thread dumps in that state (at least three) and email them to me, in addition to thread dumps seen with slow EPS with the default GC. I'd like to see how they compare.
ESM Correlation Tech Lead
Do you have access to support cases Rob? You would have anything you need in mine. I just got some caching couple hours ago with the "GC fix" applied.
Just curious if you have a large amount of assets or vulnerabilities or both?
We do have a large amount of assets and vulnerabilities. We have McAfee Vulnerability Manager scanning more than 50,000 assets every months...
Just to add... We deleted all of our assets and disabled auto-creation. We have no vulnerability scanning data in ArcSight.
Reviving an old/dead thread. This was still occurring for us up to recently, no real resolution.
But we did identify a couple scenarios which could induce this behavior...
1) Massive amounts of badly time stamped events. We haven't quite figured out exactly 'how much' is too bad, but we had over 1000 EPS of data that was on average older than 1 week. Given our collection infrastructure this is unfortunately not abnormal, *but* it was not something I was aware would cause this kind of issue. Has anyone else tried to correlate how many 'bad' timestamps they have with the persistence slowdown?
2) We were able to reproduce an unrecoverable slowdown of persistence when doing backups... Our consultant who was working on the project at the time, found specific resources which were corrupted (specifically query viewers with drilldowns) that were being backed up.
I will note that these may or may not be related and could just have similar results, but I was curious if anyone else out there 'living with this' could take a look and see if there is any correlation in the other systems exhibiting this behavior
Edit: OH, and... Our insert rates improved unbelievably after we notified our soc, changed our processes regarding timestamp, and essentially set the majority of our events to 'use connector time as device time' ****Please do not use this without understanding the implications!****
Our insert rates went from 200+ms inserts to 20ms inserts on average. And this is during our normal daily activities at 20,000+ EPS with multiple users and all custom content running.
I know we also have some events "from the past", never checked how many though. I'll take a look!
By backup, do you mean system tables export?
Our insert rate is around 100ms during normal activities, with 10-12k EPS.
Regarding the slowdown, we found it while trying to read resources using the API. But I think we were able to reproduce using packages as well. I don't believe there was testing with the export system tables, specifically.