ESM6 ingest first indication of something else bad..?

-= 2015-01-05 Update -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Answer wrote:

You need to add a line in the server.defaults.wrapper.conf file, in the additional parameters section.

My line:

The index (12 in my case) needs to be adjusted depending on your file and changes you may have done to it.

NOTE: Do not modify this parameter without first consulting support.  This is a pretty dramatic change and there is definitely a wrong way to do it!

We finally have tested this in both our production and development environments.  We haven't had a reoccurrence in Dev in about 3 months and in prod in about 1 month.  This seems like a success.  We have had moments where the ESM will start to cache across the connectors, but it seems to recover without a restart.  Thanks for helping and all the support everyone, I am sure there may be other symptoms that are similar, but this specific issue seems to be resolved on our system.

-= ORIGINAL MESSAGE FOLLOWS -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

I know this is a fishing expedition, but im hoping we aren't the only ones going through this...  Maybe as a group we can combine our knowledge and findings..

Basically we have a system which we have been running in production between 10-25k EPS on ESM6 patch 1.

System specs are beyond recommendations (512gb ram, SSD in raid 10, 4x10 xeons).

Anyway - under normal circumstances our system does not have issues (besides ones already identified in other threads).  But periodically (random ~1-2 time per month max, we have gone 2 months with no issue) we have this slowdown, it is visible in nearly everything, and a manager service restart fixes the issue completely.

Noticed symptoms:
-Insertion rates overall go down drastically (all connectors - normally we maintain between 50-80ms ingest on monitor:102 events, after slowdown 200-500 )

-ESM Garbage collections happen extremely frequently (normally they are 1/hr, they will be sporadic ~10-30minutes apart)

-Large trends slow down drastically (usually take about twice as long)

-CPU utilization goes up dramatically (normally we maintain 10-20% CPU at peak, it will spike and maintain 60-70% )

Our investigations have come up fairly dry:

-The slowdown does not occur at specific timeframes, thus far seems random - occurs maybe 1-2 times per month max-

-Does not start necessarily during EPS peaks (we normally peak around 10am PST - this has occurred at night, sundays, and at 2pm - no definitive time/eps correlation)

-I do have monitoring set up looking for resource changes (trends/lists/etc) and nothing has come up conclusive

-Restarting the ESM/manager service *only* completely resolves the issue (not the server, MySQL, logger, etc)

      -^^^Adding to the restart fixing - ZERO other changes are made to the system (conf, content, event/filtering/aggregation, etc)

To be honest, I am a bit at a loss...  Has anyone else come across something similar - found resolution or made headway?

Just for full disclosure link below contains our system specs and configured parameters:

Edit: 9/24/2014 - I don't know why this wasn't done earlier... But wrote a quick script that will put all of your full GCs in order

grep "Full GC" -A1 /opt/arcsight/manager/logs/default/server.std.log* | awk '{print $6 " " $7 " | " $9 " " $10}' | sort

  • Which features are you using?  Active lists, session lists, rules, trends, actors, zones, query viewers, data monitors, scheduled reports, etc.

    I think this may also be important in order to figure out why the manager died it may not be as simple as EPS and the current running config.


  • This is a bland response... But we are using nearly all of it.

    No usage:


    -session lists

    -pattern discovery, etc

    Light usage:

    -data monitors (mostly last N, but we do have a moving average and a few others, id have to go back and look)

    -scheduled reports (about 5 per day, all running off of small trends)

    -Query viewers (we have a few set up off of trends that refresh every 4 hours, gathering basic stats like total event counts, etc... Most take less than 10 seconds to populate)

    Heavy usage:

    -zones/assets (total estimates at about 4000 zones, 8000 assets I believe)

    -active lists (we have a few large lists for active/new processes

    -rules (some extremely complex and a few 'expensive' rules

    -trends (as discussed in previous threads, we found ways to make trends work, but basically it involves narrowing down to 1-2 key fields, max. We run about 20 trends per hour, I think total trend run time per hour is ~30 minutes - all trends combined)

    Also, we have about 60 connectors, about 40% windows, 40% firewall/syslog, and a smattering of other things like IDS, AV, etc...

    The reason I am hesitant with content is that we do not make changes to the content, restarting the ESM service alone resolves the issue for weeks ...

  • Thanks Ray.  Posting that information may help others with the manager issue find commonalities with what you have.

    I had my ESM 6.0c box up for a month without issues, but I had low EPS and not much content.  I'm hoping to get a 6.5 environment running and stress test it for at least 30 days.  I'll post my findings.


  • We have the EXACT same problem as you Ray. Happens roughly every 1-2 weeks. Running 6.0c Patch1 (planning the install of patch 2 soon) at 10K EPS sustained. System specs are 128Gb ram, 2x8 xeons, HP SAN.

    We are using a ConnAPP - Logger - ESM architecture and we clearly see the loggers begin to cache badly when the problem occurs.

    Symptoms and investigations results are about the same. Kinda happy to see we are not alone... I had multiple cases with support and everytime, content was blamed to cause threads blockages (I have hard time believing this). I'm pretty sure it's Java getting overwhelmed.

  • Oh, to add to the investigation, tuning the MySql buffer size seemed to have helped when the problems arises. With stock MySql settings, EPS in was dropping to about 2K, now, with updated settings, EPS in are at 5-6K when the caching occurs.

    Have you done some MySql tuning?

  • Thanks for the response.  That is interesting... I think we have pretty much been MySQL tuned since the onset of ESM...  Our EPS under duress doesn't really drop below 5-6k.

    Are all your cases with support closed?  I don't think I've been able to maintain one open, because this happens so intermittently (typically once per month), and because they constantly blame content - even though disabling the content they find in thread dumps, don't resolve the issue...

    If enough are seeing this maybe we can raise it as a group concern, and maybe then it wont get ignored.

  • I was able to keep a case open for couple months. It got closed in october during my paternity leave because I was unable to reply due to our SAIDs being renewed.

    I have my HP sales engineer in touch on this. I actually just sent him a link to this thread. Let's hope he will be able to make something happen!

  • Checking in with the group.  Has anyone found a solution to these issues?

    Thank you,


  • Checking in with the group.  Has anyone found a solution to these issues?

    Thank you,


  • I haven't yet... Have a couple consultants in also looking at it, haven't found any specific reason yet   Happens every 2-4 weeks, at random.  Same symptoms and resolution as before though, only need to restart manager service to fix...