ALERT! The community will be read-only starting on April 19, 8am Pacific as the migration begins. Read more for important details.
ALERT! The community will be read-only starting on April 19, 8am Pacific as the migration begins.Read more for important details.
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class
4258 views

ESM6 ingest first indication of something else bad..?

Jump to solution

-= 2015-01-05 Update -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Answer wrote:

You need to add a line in the server.defaults.wrapper.conf file, in the additional parameters section.

My line:  wrapper.java.additional.12=-XX:ReservedCodeCacheSize=256m

The index (12 in my case) needs to be adjusted depending on your file and changes you may have done to it.

NOTE: Do not modify this parameter without first consulting support.  This is a pretty dramatic change and there is definitely a wrong way to do it!


We finally have tested this in both our production and development environments.  We haven't had a reoccurrence in Dev in about 3 months and in prod in about 1 month.  This seems like a success.  We have had moments where the ESM will start to cache across the connectors, but it seems to recover without a restart.  Thanks for helping and all the support everyone, I am sure there may be other symptoms that are similar, but this specific issue seems to be resolved on our system.

-= ORIGINAL MESSAGE FOLLOWS -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

I know this is a fishing expedition, but im hoping we aren't the only ones going through this...  Maybe as a group we can combine our knowledge and findings..

Basically we have a system which we have been running in production between 10-25k EPS on ESM6 patch 1.

System specs are beyond recommendations (512gb ram, SSD in raid 10, 4x10 xeons).


Anyway - under normal circumstances our system does not have issues (besides ones already identified in other threads).  But periodically (random ~1-2 time per month max, we have gone 2+ months with no issue) we have this slowdown, it is visible in nearly everything, and a manager service restart fixes the issue completely.

Noticed symptoms:
-Insertion rates overall go down drastically (all connectors - normally we maintain between 50-80ms ingest on monitor:102 events, after slowdown 200-500+)

-ESM Garbage collections happen extremely frequently (normally they are 1/hr, they will be sporadic ~10-30minutes apart)

-Large trends slow down drastically (usually take about twice as long)

-CPU utilization goes up dramatically (normally we maintain 10-20% CPU at peak, it will spike and maintain 60-70%+)

Our investigations have come up fairly dry:

-The slowdown does not occur at specific timeframes, thus far seems random - occurs maybe 1-2 times per month max-

-Does not start necessarily during EPS peaks (we normally peak around 10am PST - this has occurred at night, sundays, and at 2pm - no definitive time/eps correlation)

-I do have monitoring set up looking for resource changes (trends/lists/etc) and nothing has come up conclusive

-Restarting the ESM/manager service *only* completely resolves the issue (not the server, MySQL, logger, etc)

      -^^^Adding to the restart fixing - ZERO other changes are made to the system (conf, content, event/filtering/aggregation, etc)

To be honest, I am a bit at a loss...  Has anyone else come across something similar - found resolution or made headway?

Just for full disclosure link below contains our system specs and configured parameters:

https://protect724.arcsight.com/thread/8414

Edit: 9/24/2014 - I don't know why this wasn't done earlier... But wrote a quick script that will put all of your full GCs in order

grep "Full GC" -A1 /opt/arcsight/manager/logs/default/server.std.log* | awk '{print $6 " " $7 " | " $9 " " $10}' | sort

0 Likes
111 Replies
Absent Member.
Absent Member.

Thanks.  There must something else in the equation other than EPS that's causing the issue.

10,000 EPS + x = manager reboot

We just need to figure out what x is.  It may be user activity like you mentioned, but what do you suspect the users are running that's crashing the manager?

Is anyone having the issue that is NOT receiving events with IPv6 fields?

-J

0 Likes

Actually, i think your "agent.threads.max reached" error is a different issue than what the bulk of this thread (no pun intended) is about. It sounds like you need to add/increase the "agents.threads.max=" parameter in server.properties.

0 Likes
Vice Admiral
Vice Admiral

I know it's not the same error, It happenned twice in december, and the max was increased (seems like it was caused by the MVM agent). I was only stating that the manager was restarted last monday due to this error, and not the caching. But what I normally get, is exactely the same thing as the others here, agent caching and EPS in dropping drastically, all this with a big spike in CPU utilization.

0 Likes
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class

jbur wrote:

There's always time for labs if you don't sleep Heiko.  ;-)

Questions for users that see the problem in production:

1) Are you running any add-on modules like identity view?

2) Do events expire by retention or are you letting them overwrite themselves due to 0% free space?

3) Are you using trends?

4) Are you using reactivated events from offline archives?

-J

Realized I didn't answer these questions:

1) ID view is installed, but unused - no others

2) Expire by 90d retention, I think we maintain >60% free space on event_data tablespace

3) Heavy trending

4) Never reactivated events

0 Likes
Absent Member.
Absent Member.

Thanks for the information!

More troubleshooting questions for all concerned:

1) Have you tried bumping the manager heap size down to 12GB to see if that's more stable?

2) Do you have enough free memory available for ioMemory VSL to run properly? (release notes indicate about 40GB of physical memory is needed if you have 12TB of Fusion IO storage - 6TB useable in RAID 10) - One of you reported that your SAN based ESM 6 install has no problems, while the FIO powered system requires manager restarts.

3) Anyone seeing "manager:200" events?  What kind of "database:" and "manager:" events are you seeing when the problem occurs?

-J

0 Likes
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class

jbur wrote:

There's always time for labs if you don't sleep Heiko.  ;-)

Questions for users that see the problem in production:

1) Are you running any add-on modules like identity view?

2) Do events expire by retention or are you letting them overwrite themselves due to 0% free space?

3) Are you using trends?

4) Are you using reactivated events from offline archives?

-J

One observation that I've made on multiple installs:

When a storage group is full, instead of aging out old data (as advertised), ESM is opening up new threads trying to persist data and it ends up DOSing the service. Once ESM reaches this state you cannot even login… Note that users that are already logged in won’t be kicked out.

That's when you'll see things like this in server.std.log:

WARNING: '24' agent requests REJECTED because the limit of '768' agent threads was exceeded.

The way to get around this is to reduce the retention period in ESM so that data is being purged based on retention period as opposed to available space.

/D

0 Likes
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class

Do any of you have the following entry/exception in server.log?

[ERROR][default.com.arcsight.server.util.ASComponentAssetsUtil]

com.arcsight.common.resource.InvalidResourceNameException

Thanks,

David

0 Likes
Vice Admiral
Vice Admiral

Hi David,


I have a shitload of them. Agent asset auto-creation failed ...... (Character '%' is not allowed).

Mostly seem to happen during the night, were the vulnerability scans are running and updating the assets.

You have a case open for that?

0 Likes
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class

In our environment, this exception was limiting EPS persistence rate to ~22K EPS. Since addressing this we've been able to persist 40k EPS consistently as connector cache was clearing out.

The exception is caused by connectors including ipv6 addresses in hostnames, which is being passed to ESM for asset creation. As the following characters is not allowed in ESM resource names: "/~`*%", when an ipv6 address is containing the % character it triggers an exception in the asset creation process.

On the connector side this issue is identified by CON-13478 and CON-13620 and is supposed to be fixed in the next connector release.

In the meantime, we've disabled ipv6 on all connector servers sending events to ESM (preventing this error to be thrown when an asset is created for the agent server).

If you have asset auto creation enabled, or maintain an asset model based on vuln scanner output, solving this issue might be more challenging (requiring parser override etc).

Thanks,

David


0 Likes
Absent Member.
Absent Member.

Those on this thread are reporting more frequent full GCs in the logs and increased CPU utilization during the problem.


Anyone seeing deviceeventclassid"manager:200" events?  What kind of "database:" and "manager:" events are you seeing when the problem occurs?

-Joe

0 Likes

My company added its name to those bugs recently too. Those asset errors are killing us.

0 Likes
Vice Admiral
Vice Admiral

On my side, I see none of these events, except when the manager is restarted, I get a manager:100

0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.