Welcome Serena Central users! CLICK HERE
The migration of the Serena Central community is currently underway. Be sure to read THIS MESSAGE to get your new login set up to access your account.
Highlighted
Established Member.. raymond.doty
Established Member..
3037 views

ESM6 ingest first indication of something else bad..?

Jump to solution

-= 2015-01-05 Update -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Answer wrote:

You need to add a line in the server.defaults.wrapper.conf file, in the additional parameters section.

My line:  wrapper.java.additional.12=-XX:ReservedCodeCacheSize=256m

The index (12 in my case) needs to be adjusted depending on your file and changes you may have done to it.

NOTE: Do not modify this parameter without first consulting support.  This is a pretty dramatic change and there is definitely a wrong way to do it!


We finally have tested this in both our production and development environments.  We haven't had a reoccurrence in Dev in about 3 months and in prod in about 1 month.  This seems like a success.  We have had moments where the ESM will start to cache across the connectors, but it seems to recover without a restart.  Thanks for helping and all the support everyone, I am sure there may be other symptoms that are similar, but this specific issue seems to be resolved on our system.

-= ORIGINAL MESSAGE FOLLOWS -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

I know this is a fishing expedition, but im hoping we aren't the only ones going through this...  Maybe as a group we can combine our knowledge and findings..

Basically we have a system which we have been running in production between 10-25k EPS on ESM6 patch 1.

System specs are beyond recommendations (512gb ram, SSD in raid 10, 4x10 xeons).


Anyway - under normal circumstances our system does not have issues (besides ones already identified in other threads).  But periodically (random ~1-2 time per month max, we have gone 2+ months with no issue) we have this slowdown, it is visible in nearly everything, and a manager service restart fixes the issue completely.

Noticed symptoms:
-Insertion rates overall go down drastically (all connectors - normally we maintain between 50-80ms ingest on monitor:102 events, after slowdown 200-500+)

-ESM Garbage collections happen extremely frequently (normally they are 1/hr, they will be sporadic ~10-30minutes apart)

-Large trends slow down drastically (usually take about twice as long)

-CPU utilization goes up dramatically (normally we maintain 10-20% CPU at peak, it will spike and maintain 60-70%+)

Our investigations have come up fairly dry:

-The slowdown does not occur at specific timeframes, thus far seems random - occurs maybe 1-2 times per month max-

-Does not start necessarily during EPS peaks (we normally peak around 10am PST - this has occurred at night, sundays, and at 2pm - no definitive time/eps correlation)

-I do have monitoring set up looking for resource changes (trends/lists/etc) and nothing has come up conclusive

-Restarting the ESM/manager service *only* completely resolves the issue (not the server, MySQL, logger, etc)

      -^^^Adding to the restart fixing - ZERO other changes are made to the system (conf, content, event/filtering/aggregation, etc)

To be honest, I am a bit at a loss...  Has anyone else come across something similar - found resolution or made headway?

Just for full disclosure link below contains our system specs and configured parameters:

https://protect724.arcsight.com/thread/8414

Edit: 9/24/2014 - I don't know why this wasn't done earlier... But wrote a quick script that will put all of your full GCs in order

grep "Full GC" -A1 /opt/arcsight/manager/logs/default/server.std.log* | awk '{print $6 " " $7 " | " $9 " " $10}' | sort

0 Likes
1 Solution

Accepted Solutions
Answer Honored Contributor.
Honored Contributor.

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

You need to add a line in the server.defaults.wrapper.conf file, in the additional parameters section.

My line:  wrapper.java.additional.12=-XX:ReservedCodeCacheSize=256m

The index (12 in my case) needs to be adjusted depending on your file and changes you may have done to it.

View solution in original post

0 Likes
111 Replies
jbur Absent Member.
Absent Member.

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

Which features are you using?  Active lists, session lists, rules, trends, actors, zones, query viewers, data monitors, scheduled reports, etc.

I think this may also be important in order to figure out why the manager died it may not be as simple as EPS and the current running config.

-Joe

0 Likes
Established Member.. raymond.doty
Established Member..

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

This is a bland response... But we are using nearly all of it.

No usage:

-Actors

-session lists

-pattern discovery, etc

Light usage:

-data monitors (mostly last N, but we do have a moving average and a few others, id have to go back and look)

-scheduled reports (about 5 per day, all running off of small trends)

-Query viewers (we have a few set up off of trends that refresh every 4 hours, gathering basic stats like total event counts, etc... Most take less than 10 seconds to populate)

Heavy usage:

-zones/assets (total estimates at about 4000 zones, 8000 assets I believe)

-active lists (we have a few large lists for active/new processes

-rules (some extremely complex and a few 'expensive' rules

-trends (as discussed in previous threads, we found ways to make trends work, but basically it involves narrowing down to 1-2 key fields, max. We run about 20 trends per hour, I think total trend run time per hour is ~30 minutes - all trends combined)

Also, we have about 60 connectors, about 40% windows, 40% firewall/syslog, and a smattering of other things like IDS, AV, etc...

The reason I am hesitant with content is that we do not make changes to the content, restarting the ESM service alone resolves the issue for weeks+...

0 Likes
jbur Absent Member.
Absent Member.

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

Thanks Ray.  Posting that information may help others with the manager issue find commonalities with what you have.

I had my ESM 6.0c box up for a month without issues, but I had low EPS and not much content.  I'm hoping to get a 6.5 environment running and stress test it for at least 30 days.  I'll post my findings.

-Joe

0 Likes
Answer Honored Contributor.
Honored Contributor.

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

We have the EXACT same problem as you Ray. Happens roughly every 1-2 weeks. Running 6.0c Patch1 (planning the install of patch 2 soon) at 10K EPS sustained. System specs are 128Gb ram, 2x8 xeons, HP SAN.

We are using a ConnAPP - Logger - ESM architecture and we clearly see the loggers begin to cache badly when the problem occurs.

Symptoms and investigations results are about the same. Kinda happy to see we are not alone... I had multiple cases with support and everytime, content was blamed to cause threads blockages (I have hard time believing this). I'm pretty sure it's Java getting overwhelmed.

0 Likes
Answer Honored Contributor.
Honored Contributor.

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

Oh, to add to the investigation, tuning the MySql buffer size seemed to have helped when the problems arises. With stock MySql settings, EPS in was dropping to about 2K, now, with updated settings, EPS in are at 5-6K when the caching occurs.

Have you done some MySql tuning?

0 Likes
Established Member.. raymond.doty
Established Member..

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

Thanks for the response.  That is interesting... I think we have pretty much been MySQL tuned since the onset of ESM...  Our EPS under duress doesn't really drop below 5-6k.

Are all your cases with support closed?  I don't think I've been able to maintain one open, because this happens so intermittently (typically once per month), and because they constantly blame content - even though disabling the content they find in thread dumps, don't resolve the issue...

If enough are seeing this maybe we can raise it as a group concern, and maybe then it wont get ignored.

0 Likes
Answer Honored Contributor.
Honored Contributor.

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

I was able to keep a case open for couple months. It got closed in october during my paternity leave because I was unable to reply due to our SAIDs being renewed.

I have my HP sales engineer in touch on this. I actually just sent him a link to this thread. Let's hope he will be able to make something happen!

0 Likes
jbur Absent Member.
Absent Member.

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

Checking in with the group.  Has anyone found a solution to these issues?

Thank you,

-Joe

0 Likes
Established Member.. raymond.doty
Established Member..

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

I haven't yet... Have a couple consultants in also looking at it, haven't found any specific reason yet   Happens every 2-4 weeks, at random.  Same symptoms and resolution as before though, only need to restart manager service to fix...

0 Likes
Answer Honored Contributor.
Honored Contributor.

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

Same here, nothing new. Are you guys running patch 2 ?

0 Likes
matslofva
New Member.

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

Seeing the exact same thing in ESM 6.5.

EPS down to 4-5K EPS from 20K EPS, connectors caching, CPU utilization at 70% ballpark.

0 Likes
Established Member.. raymond.doty
Established Member..

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

Does restarting the manager service *only* resolve the issue for you too?

GCs occurring much more frequently than before this started?

What is normal CPU utilization at 20k EPS?

No hardware, arcsight content, or OS changes I assume?

Anything indicative in the logs?  I have had 3 sets of eyes + support look and find nothing so far... But we haven't given up.

0 Likes
Answer Honored Contributor.
Honored Contributor.

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

Crap!  I was hoping that 6.5 would resolve the issue... Looks like they have no idea what the actual problem is!

0 Likes
deathbywedgie1 Frequent Contributor.
Frequent Contributor.

Re: ESM6 ingest first indication of something else bad..?

Jump to solution

This is my obligatory "Me too!" post. We just recently finished our 6.5 migration, and I noticed that we had hit this wall last night. I'm not sure how long we'd been running behind, but 3-4 hours after I bounced the manager service we were still cooking 10k+ EPS as our largest connectors were still trying to catch up on cached events, and we were still getting events that were 13 hours old.

To make matters worse, we're rolling out a tiered architecture where we have a pair of independent managers in each of 5 different regions. I'm not looking forward to having to keep an eye on this issue across 10 different ESM instances. I reeeaaally need this problem to just go away.

0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.