ALERT! The community will be read-only starting on April 19, 8am Pacific as the migration begins. Read more for important details.
ALERT! The community will be read-only starting on April 19, 8am Pacific as the migration begins.Read more for important details.
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class
4261 views

ESM6 ingest first indication of something else bad..?

Jump to solution

-= 2015-01-05 Update -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Answer wrote:

You need to add a line in the server.defaults.wrapper.conf file, in the additional parameters section.

My line:  wrapper.java.additional.12=-XX:ReservedCodeCacheSize=256m

The index (12 in my case) needs to be adjusted depending on your file and changes you may have done to it.

NOTE: Do not modify this parameter without first consulting support.  This is a pretty dramatic change and there is definitely a wrong way to do it!


We finally have tested this in both our production and development environments.  We haven't had a reoccurrence in Dev in about 3 months and in prod in about 1 month.  This seems like a success.  We have had moments where the ESM will start to cache across the connectors, but it seems to recover without a restart.  Thanks for helping and all the support everyone, I am sure there may be other symptoms that are similar, but this specific issue seems to be resolved on our system.

-= ORIGINAL MESSAGE FOLLOWS -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

I know this is a fishing expedition, but im hoping we aren't the only ones going through this...  Maybe as a group we can combine our knowledge and findings..

Basically we have a system which we have been running in production between 10-25k EPS on ESM6 patch 1.

System specs are beyond recommendations (512gb ram, SSD in raid 10, 4x10 xeons).


Anyway - under normal circumstances our system does not have issues (besides ones already identified in other threads).  But periodically (random ~1-2 time per month max, we have gone 2+ months with no issue) we have this slowdown, it is visible in nearly everything, and a manager service restart fixes the issue completely.

Noticed symptoms:
-Insertion rates overall go down drastically (all connectors - normally we maintain between 50-80ms ingest on monitor:102 events, after slowdown 200-500+)

-ESM Garbage collections happen extremely frequently (normally they are 1/hr, they will be sporadic ~10-30minutes apart)

-Large trends slow down drastically (usually take about twice as long)

-CPU utilization goes up dramatically (normally we maintain 10-20% CPU at peak, it will spike and maintain 60-70%+)

Our investigations have come up fairly dry:

-The slowdown does not occur at specific timeframes, thus far seems random - occurs maybe 1-2 times per month max-

-Does not start necessarily during EPS peaks (we normally peak around 10am PST - this has occurred at night, sundays, and at 2pm - no definitive time/eps correlation)

-I do have monitoring set up looking for resource changes (trends/lists/etc) and nothing has come up conclusive

-Restarting the ESM/manager service *only* completely resolves the issue (not the server, MySQL, logger, etc)

      -^^^Adding to the restart fixing - ZERO other changes are made to the system (conf, content, event/filtering/aggregation, etc)

To be honest, I am a bit at a loss...  Has anyone else come across something similar - found resolution or made headway?

Just for full disclosure link below contains our system specs and configured parameters:

https://protect724.arcsight.com/thread/8414

Edit: 9/24/2014 - I don't know why this wasn't done earlier... But wrote a quick script that will put all of your full GCs in order

grep "Full GC" -A1 /opt/arcsight/manager/logs/default/server.std.log* | awk '{print $6 " " $7 " | " $9 " " $10}' | sort

0 Likes
111 Replies
Absent Member.
Absent Member.

We started with a fresh 5.2 Patch 2 install, so there is no experience with previous 5.x versions.

0 Likes
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class

It is interesting you mention this.  We didn't have similar experience in 5.2, we had other issues, but not this specific one...  Our investigation does seem to be leaning towards manager code and memory management, but were still troubleshooting...

0 Likes
Absent Member.
Absent Member.

I'd like to try to replicate the issue in a lab with ESM 6.5.  Are these the correct test parameters to trigger the problem?

  • Multiple connectors feeding a single ESM instance (connector version >= 6.x)
  • Total event throughput > 10,000 EPS
  • Minimum test duration = 2 weeks

Are any of those affected using peer search or archives?

Thank you,

Joe

0 Likes

We average 4-5k EPS during the day on the manager that we encountered this on recently, and it only took about 10 days of uptime before we encountered it. I had not yet enabled peer search, but we do use archives. I look forward to your lab test results. There's no telling what other user or content behaviors contribute to the issue, but if you're able to reproduce it with basic (but high volume) forwarding in a lab without a lot of user activity, that will be very meaningful.

--dbw

0 Likes
Absent Member.
Absent Member.

I wish I also had a chance to do labs like that.

Over here, it's also between 5 and 7 kEPS, multiple connectors (20-ish to 40-ish) 6.x and also some 5.x. Test duration could be longer, I would reckon 2 to 4 weeks.

0 Likes
Vice Admiral
Vice Admiral

I've been bleeping my lab at 15k eps for about a week and a half now. No problems to this day. I do have far less content in the lab than in production, so my guess is that content does play a role. The problem arises in production about every 10ish days.

0 Likes
Absent Member.
Absent Member.

There's always time for labs if you don't sleep Heiko.  😉

Questions for users that see the problem in production:

1) Are you running any add-on modules like identity view?

2) Do events expire by retention or are you letting them overwrite themselves due to 0% free space?

3) Are you using trends?

4) Are you using reactivated events from offline archives?

-J

0 Likes
Vice Admiral
Vice Admiral

1) We have IT Gov, PCI and Cisco packages. But most of the rules are disabled / not yet implemented.

2) They expire by retention (33 days), we have 90% free space.

3) Yes, quite a lot of trends (more than I'd like to have). Also quite a lot of Data Monitors / scheduled reports.

4) We are not using archives.

0 Likes
Absent Member.
Absent Member.

Merci!

My ESM 6.5 instance is now running at 6500 EPS (multiple connectors and products, mostly TestAlert replays with some production logs coming off my firewall).

While that's cooking for a few weeks, I recommend we compare the configuration of our systems.

  1. Anyone having the issue that is NOT using trends?
  2. Anyone else using the IT Gov, PCI or Cisco packages?
  3. Anyone using using payload sampling and payload retrieval?
  4. Anyone using using cases or annotation/stages?

-Joe

0 Likes
Captain
Captain

ESM 6.0 Patch 1 ( Patch 2 not installed ).

22-30K eps for 30 days+  ;  no reboot or manager restart  ;  no issues

  1. Anyone having the issue that is NOT using trends?   -    One Trend size 200'000 entries with 5 fields size 12-14 characters excluding timeStamps
  2. Anyone else using the IT Gov, PCI or Cisco packages? - Only One FISMA Package
  3. Anyone using using payload sampling and payload retrieval? - Not using yet.
  4. Anyone using using cases or annotation/stages? - Yes, Annotations on less than 0.01% of events

    Free space 77%

        Disabled heaviest DataMonitors provided with stock content.


0 Likes
Absent Member.
Absent Member.

More questions to try to narrow this down. (including @ superman since his system is working)

  1. How much memory are you assigning to the manager?
  2. Which connector versions are you using?
  3. Does any of your data contain IPv6 fields?
  4. Are you using SSD or HDD for storage?

Thank you,

-J

0 Likes
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class

Add to that (in case of superman) all customized parameters in my.cnf, postgresql.conf and server.properties would be great.

Thanks,

David

0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.