ALERT! The community will be read-only starting on April 19, 8am Pacific as the migration begins. Read more for important details.
ALERT! The community will be read-only starting on April 19, 8am Pacific as the migration begins.Read more for important details.
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class
4259 views

ESM6 ingest first indication of something else bad..?

Jump to solution

-= 2015-01-05 Update -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Answer wrote:

You need to add a line in the server.defaults.wrapper.conf file, in the additional parameters section.

My line:  wrapper.java.additional.12=-XX:ReservedCodeCacheSize=256m

The index (12 in my case) needs to be adjusted depending on your file and changes you may have done to it.

NOTE: Do not modify this parameter without first consulting support.  This is a pretty dramatic change and there is definitely a wrong way to do it!


We finally have tested this in both our production and development environments.  We haven't had a reoccurrence in Dev in about 3 months and in prod in about 1 month.  This seems like a success.  We have had moments where the ESM will start to cache across the connectors, but it seems to recover without a restart.  Thanks for helping and all the support everyone, I am sure there may be other symptoms that are similar, but this specific issue seems to be resolved on our system.

-= ORIGINAL MESSAGE FOLLOWS -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

I know this is a fishing expedition, but im hoping we aren't the only ones going through this...  Maybe as a group we can combine our knowledge and findings..

Basically we have a system which we have been running in production between 10-25k EPS on ESM6 patch 1.

System specs are beyond recommendations (512gb ram, SSD in raid 10, 4x10 xeons).


Anyway - under normal circumstances our system does not have issues (besides ones already identified in other threads).  But periodically (random ~1-2 time per month max, we have gone 2+ months with no issue) we have this slowdown, it is visible in nearly everything, and a manager service restart fixes the issue completely.

Noticed symptoms:
-Insertion rates overall go down drastically (all connectors - normally we maintain between 50-80ms ingest on monitor:102 events, after slowdown 200-500+)

-ESM Garbage collections happen extremely frequently (normally they are 1/hr, they will be sporadic ~10-30minutes apart)

-Large trends slow down drastically (usually take about twice as long)

-CPU utilization goes up dramatically (normally we maintain 10-20% CPU at peak, it will spike and maintain 60-70%+)

Our investigations have come up fairly dry:

-The slowdown does not occur at specific timeframes, thus far seems random - occurs maybe 1-2 times per month max-

-Does not start necessarily during EPS peaks (we normally peak around 10am PST - this has occurred at night, sundays, and at 2pm - no definitive time/eps correlation)

-I do have monitoring set up looking for resource changes (trends/lists/etc) and nothing has come up conclusive

-Restarting the ESM/manager service *only* completely resolves the issue (not the server, MySQL, logger, etc)

      -^^^Adding to the restart fixing - ZERO other changes are made to the system (conf, content, event/filtering/aggregation, etc)

To be honest, I am a bit at a loss...  Has anyone else come across something similar - found resolution or made headway?

Just for full disclosure link below contains our system specs and configured parameters:

https://protect724.arcsight.com/thread/8414

Edit: 9/24/2014 - I don't know why this wasn't done earlier... But wrote a quick script that will put all of your full GCs in order

grep "Full GC" -A1 /opt/arcsight/manager/logs/default/server.std.log* | awk '{print $6 " " $7 " | " $9 " " $10}' | sort

0 Likes
111 Replies
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class

Does restarting the manager service *only* resolve the issue for you too?

GCs occurring much more frequently than before this started?

What is normal CPU utilization at 20k EPS?

No hardware, arcsight content, or OS changes I assume?

Anything indicative in the logs?  I have had 3 sets of eyes + support look and find nothing so far... But we haven't given up.

0 Likes
Vice Admiral
Vice Admiral

Crap!  I was hoping that 6.5 would resolve the issue... Looks like they have no idea what the actual problem is!

0 Likes

This is my obligatory "Me too!" post. We just recently finished our 6.5 migration, and I noticed that we had hit this wall last night. I'm not sure how long we'd been running behind, but 3-4 hours after I bounced the manager service we were still cooking 10k+ EPS as our largest connectors were still trying to catch up on cached events, and we were still getting events that were 13 hours old.

To make matters worse, we're rolling out a tiered architecture where we have a pair of independent managers in each of 5 different regions. I'm not looking forward to having to keep an eye on this issue across 10 different ESM instances. I reeeaaally need this problem to just go away.

0 Likes
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class

1. Yes, seems like it.

2. Logs are rolled over. In normal operation we have a Full GC about every 2 minutes.. I seem to believe it was about the same.

3. Really inconsistent but I would say around 40-50% (this is unacceptable for what we're doing.. Seems like only persisting events is generating way to much cpu utilization).

4. No significant changes during that time, no.

0 Likes
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class

Every two minutes is pretty fast for Full GCs, I've heard that hourly Full GC is what youre shooting for most optimally, alongside GC times being quick (I don't know the number on that might be 200mb/sec)...  Do you have RAM which you can allocate to your ESM JVM Heap?  (server.wrapper.conf file - config parameters wrapper.java.initmemory and wrapper.java.maxmemory)

Thanks for sharing!

0 Likes
Captain
Captain

This may be an moot and an addressed by support point,    has anyone been able to run esm in debug mode ?

0 Likes
Absent Member.
Absent Member.

superman wrote:

This may be an moot and an addressed by support point,    has anyone been able to run esm in debug mode ?

I haven't tried debug mode in 6.x yet, but running debug mode in 5.x slows the system down to under 1000 EPS.  (Which I imagine would be too slow to recreate your issue)

-Joe

0 Likes
Vice Admiral
Vice Admiral

Just had the caching problem this morning.  I did take some time to check on the Full GCs and here's the result:

INFO   | jvm 1    | 2013/12/12 04:26:03 | [Full GC

INFO   | jvm 1    | 2013/12/12 04:26:07 |  5027791K->1940325K(16700032K), 3.4398970 secs]

--

INFO   | jvm 1    | 2013/12/12 05:26:07 | [Full GC

INFO   | jvm 1    | 2013/12/12 05:26:10 |  5101306K->1980106K(16696000K), 3.4956920 secs]

--

INFO   | jvm 1    | 2013/12/12 06:26:10 | [Full GC

INFO   | jvm 1    | 2013/12/12 06:26:14 |  5970906K->1982132K(16688896K), 3.4655490 secs]

--

INFO   | jvm 1    | 2013/12/12 07:26:14 | [Full GC

INFO   | jvm 1    | 2013/12/12 07:26:18 |  6086238K->1987097K(16682816K), 3.7416110 secs]

--

INFO   | jvm 1    | 2013/12/12 08:26:18 | [Full GC

INFO   | jvm 1    | 2013/12/12 08:26:22 |  7181117K->2086567K(16661056K), 4.3971330 secs]

--

INFO   | jvm 1    | 2013/12/12 09:26:22 | [Full GC

INFO   | jvm 1    | 2013/12/12 09:26:26 |  8715083K->2064446K(16682048K), 4.4700140 secs]

--

Caching started at about 09:40

--

INFO   | jvm 1    | 2013/12/12 09:52:20 | [Full GC

INFO   | jvm 1    | 2013/12/12 09:52:22 |  11269172K->2242501K(16532096K), 2.6180600 secs]

--

INFO   | jvm 1    | 2013/12/12 10:10:49 | [Full GC

INFO   | jvm 1    | 2013/12/12 10:10:51 |  11163293K->2158772K(16332608K), 2.5266630 secs]

--

Manager restarted here

--

INFO   | jvm 1    | 2013/12/12 10:18:43 | [Full GC

INFO   | jvm 1    | 2013/12/12 10:18:43 |  5744K->5238K(16078208K), 0.1411190 secs]

--

INFO   | jvm 1    | 2013/12/12 10:19:55 | [Full GC

INFO   | jvm 1    | 2013/12/12 10:19:56 |  159839K->141899K(16078208K), 0.7561660 secs]

--

INFO   | jvm 1    | 2013/12/12 10:20:26 | [Full GC

INFO   | jvm 1    | 2013/12/12 10:20:26 |  152324K->137378K(16078208K), 0.6157070 secs]

--

INFO   | jvm 1    | 2013/12/12 10:20:26 | [Full GC

INFO   | jvm 1    | 2013/12/12 10:20:27 |  137506K->136679K(16078208K), 0.5874090 secs]

So it does seem to do full GCs more often when the problem is there.

0 Likes
Captain
Captain

Would the increased frequency of GC mean there is more G to collect ?

Does memory free up after the GC ?

0 Likes
Absent Member.
Absent Member.

As we are facing the same symptoms with ESM 5.2 Patch 2, I wouldn't bet it's a CORR-E issue. Anybody else seeing that on 5.x?

0 Likes
Vice Admiral
Vice Admiral

Now that you're talking about it, we might have been facing the same issue with 5.2.... We always blamed Oracle for the problems we had, but it might not be the actual case...  We had way lower EPS that we have right now (2k vs 10k)...

0 Likes
Absent Member.
Absent Member.

I think it's a valid point given the amount of code shared between 5.x and 6.x on the manager side.  Do you think the issue first appeared in 5.2 patch 2?

-Joe

0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.