Memory Exhaustion in ESM 6
This discussion will be used to discuss memory utilization issues in ESM. I'm particular what I'd like to call the "staircase to hades" issue.
If you view the graphic below, you can figure out how I came up with the name
The system has a steady event rate, and little activity outside a trend that runs twice an hour. As the chart moves to the right, the system runs out of memory and performance suffers as swapping begins. Does anyone else see this phenomenon on their system?
Our read rates are pretty close to yours (similar setup with fusions)... They vary wildly between 30k to 250k+ on that same dashboard. I have not gotten any solid info as to what would be considered 'good' for this number, and also what exactly it represents/how its calculated...
From a 'feel' perspective, most things seem fast (reporting being the exception), considering our EPS.
I'm kind of with Ray on this one; I have de-tuned MySQL since that seems to be the cause of our "memory leak" and things have been pretty stable. I'm still flushing the cache every hour and have some other ESM stuff bumped up.
These are "microseconds" and not even milliseconds. So it's a pretty short timespan. In usual operation, we're seeing between 7-8k microsecond (so that's 7-8 millisecond) retrieval times over the past hour.
For the past 24 hours we're seeing the average at 183k microseconds, so 183 milliseconds. That's 1/20th of a second per event retrieval. I'm not sure how that rates/compares, but it doesn't seem too bad.
I'm having issues with "Loading events..." messages when opening an Active Channel in ESM (regardless of sorting field, field-set, filtering conditions, etc). Sometimes painting events in a channel takes multiple seconds. If you're experiencing an overall good 'feel' with those retrieval numbers that would suggest a network issue on our side. I'll be looking into this today. Thanks for the quick response!
Painting channels definitely doesn't seem to be slow on our end easily sub-second response times when scrolling through vanilla channels (MRT sort, simple to moderate filters). We do have a decent sized pipe between our SOC and our ESM. The only exception is where we have extremely complex filters (such as active list lookups in channels), or actual broken events (whole other story unrelated).
Network latency definitely has impacts on things such as this - you're right, it might not be a bad idea to check
ESM v3 - v5 performed poorly if there was high latency between the console and manager. I haven't checked ESM 6.x, but it's possible the issue is still there.
What are your ping times between the console system and the manager server?
Perhaps try larger packets, 50ms is somewhat slow for an internal connection... But shouldn't be too big of an issue, unless its a throughput deal:
ping -l 65500 (hostname)
Oh, and yes - it is still there in 6. We cannot use the console on internal WAN from India as their throughput is horrible (small ping response times are ok). They get a black screen when attempting to use console over WAN.
I support a number of customers remotely. For some customers with a lot of bandwidth, things are OK, but for others I insist on a VM in their infrastructure. This is so I can login to via MS Terminal Services (or similar) so I can run the console there, since the latency over the internet as a whole makes the session unusable.
Seems like packets exceeding a certain size is not permitted in this environment. Specifying a size of 65451 bytes (ping -l 65451...) I get ~67ms RTT and minimal packet loss, however, when going above this size (e.g. 65452 bytes) I get 100% packet-loss...
If you navigate to the "ASM Database Statistics" filter (/All Filters/ArcSight Administration/ESM/System Health/Storage/ASM Database Statistics), right-click and create channel, how long does it take for you to load? Are you also seeing "Loading Event ID..." messages for some time?
I do see the "Loading Event ID.." message for like 10 seconds before the information gets displayed. This is on both my production and lab box, the lab box is mostly sleeping. So does not like it's load dependant (prod box runs at 10K EPS with lots of content)
We are having issues with loading active channels & reports for AIX system with End time as the time stamp but if i put MRT as the time active channel and reports fare well.I am not able to come to terms with it logger shows all the events well i have issues only with ESM 6.0.I have checked ESM memory and processor usage they are pretty stable at the time of loading the particular active channel.Any one has come across this issue in there setup.