nmbabkin1 Absent Member.
Absent Member.
2069 views

ESM EPS decreasing halfly after 7-10 days of uptime

Jump to solution

Hello!

Our ESM Version is 6.5.0.1837.1 (6.5 Patch 1).

We have a critical issue regarding event processing on our ESM installation.

After 7-10 days of uptime suddenly EPS is decreasing halfly (approximately), and connectors start caching immediately.

For example, if we had 6-7K EPS before, after 7-10 days it would be 2,5-3K. After it has happened connector caching starting, growing immense, and then connectors start dropping.

It seems there's no performance problems at all, cause we have a huge hardware (256 GB of RAM with 32 processors), 80 GB Manager Heap Size with ~ one per 1h Full GC appearance, and threadcount in config was increased to 2048 long-long ago.

It happens all the time. Seven, maximum 10 days of uptime.

Some researches did not help.

1) First, i disabled all the rules which partial matches are more than 100k. Not helped.

2) Then i increased Active List Entry limit to 5million (to avoid AL overflow bug). Not helped.

3) I also tried to play with heap size, 16G to 100GB, nothing at all.

4) Current load average

top - 10:56:48 up 233 days, 10:40,  3 users,  load average: 26.37, 25.62, 25.87

which is good for 32processor system.

5) Garbage Collector stats:

INFO   | jvm 1    | 2014/10/02 08:28:30 | [Full GC

INFO   | jvm 1    | 2014/10/02 08:28:35 |  56149204K->8650221K(82874048K), 5.8604970 secs]

INFO   | jvm 1    | 2014/10/02 08:28:35 | HOSTINFO 1412224115861 59.0 6.0 1.0 35.0 21872 105715 115 0 0 0 0 3306 537 4 9676 22890 0 0.0

--

INFO   | jvm 1    | 2014/10/02 09:22:37 | [Full GC

INFO   | jvm 1    | 2014/10/02 09:22:43 |  56190733K->8372061K(82826304K), 6.8200720 secs]

INFO   | jvm 1    | 2014/10/02 09:22:43 | SnapshotsCleanUp task starting ...

--

INFO   | jvm 1    | 2014/10/02 10:18:03 | [Full GC

INFO   | jvm 1    | 2014/10/02 10:18:10 |  56094225K->8346081K(82935040K), 6.6189020 secs]

6) Event insertion performance (which is excellent also):

INFO   | jvm 1    | 2014/10/02 11:06:13 | (02-Pre-SecurityEventPersistor3988) Persisted 296 events in 129 ms.

INFO   | jvm 1    | 2014/10/02 11:06:10 | (02-Pre-SecurityEventPersistor3988) Persisted 486 events in 134 ms.

INFO   | jvm 1    | 2014/10/02 11:06:10 | (02-Pre-SecurityEventPersistor3988) Persisted 1076 events in 96 ms.

7) No errors or warns in server.std.log:

[arcsight@0001arc-mgr nmbabkin]$ less /opt/arcsight/manager/logs/default/server.std.log | grep -v INFO

[arcsight@0001arc-mgr nmbabkin]$

😎 Database free space is good also:

arc_event_data:DNS Storage Group 204800 191488 93

arc_event_data:Default Storage Group 2237440 752640 33

arc_event_data:Firewall Storage Group 614400 584704 95

arc_event_data:Internal Event Storage Group 5120 4096 80

arc_event_data:Mobile DNS File 10240 9216 90

arc_system_data 307210 300613 97

The one and only thing that could help is manager reboot.

It would be great if anything can help to prevent us from doing an annoying manual 1-per-week rebooting.

Labels (2)
0 Likes
1 Solution

Accepted Solutions
Trusted Contributor.. daniel.eddy Trusted Contributor..
Trusted Contributor..

Re: ESM EPS decreasing halfly after 7-10 days of uptime

Jump to solution

All,

Note that in the ESM 6.8c release notes, they appear to have addressed and resolved the ReservedCodeCacheSize issue as detailed below:

NGS-8578:

Event throughput would sometimes drop significantly after the Manager had been

running for a period of 1-2 weeks, for no obvious reason. This was happening due

to the CodeCache (a section of the JVM heap used by the just-in-time compiler)

becoming full.

The problem has been fixed by increasing the CodeCache capacity setting.

Danny

View solution in original post

0 Likes
15 Replies
jring1 Trusted Contributor.
Trusted Contributor.

Re: ESM EPS decreasing halfly after 7-10 days of uptime

Jump to solution

Hi,

do you experience caching on all connectors equally so we can rule out a problem with connectors or transport?

Just restarting the manager doesn't help - you have to reboot?

Also what is your system memory - is it very low due to all memory being used by buffer/cache - post the output of free...

Joachim

0 Likes
nmbabkin1 Absent Member.
Absent Member.

Re: ESM EPS decreasing halfly after 7-10 days of uptime

Jump to solution

Hello Joachim, thanks for the answer.

do you experience caching on all connectors equally so we can rule out a problem with connectors or transport?

Just restarting the manager doesn't help - you have to reboot?
Experiencing caching only on heavy-loaded ones (>500 EPS), however, rebooting helps. After rebooting manager (/etc/init.d/arcsight_services restart manager) the problem is gone. Seems to be a clearly manager bug.

Also what is your system memory - is it very low due to all memory being used by buffer/cache - post the output of free...
There you go:

[arcsight@0001arc-mgr tmp]$ free -m

                  total       used            free     shared    buffers     cached

Mem:        258286     236464      21822          0        114      99628

-/+ buffers/cache:     136721     121565

Swap:        16383          0      16383

0 Likes
jring1 Trusted Contributor.
Trusted Contributor.

Re: ESM EPS decreasing halfly after 7-10 days of uptime

Jump to solution

Ok, mem seems ok although I've seen linux boxen to start swapping despite having loads of mem used as cache... but 20GB free is far from that.

A shot in the dark - have you tried setting http.transport.threadcount to sth like 2 or 4 (default = 1)  for the loaded connectors? We always do this for all connectors doing more than a few Eps.

What does arcsight support say?

Joachim

0 Likes
Trusted Contributor.. daniel.eddy Trusted Contributor..
Trusted Contributor..

Re: ESM EPS decreasing halfly after 7-10 days of uptime

Jump to solution

Nikolay,

Thanks so much for posting this!  We seem to be having very similar issues with our ESM instance and have not been able to find a resolution after months of continued outages.  Our issues are described below:

Problem Description

Every few days (not consistent), the "Sent to Manager EPS" count drops significantly, causing all of our Loggers/connectors to start caching.  Restarting the ArcSight services appears to be the only way to resolve this issue.  Once the services are restarted, the EPS to the manager count jumps up to 13 to 15K until the cache is depleted.  After another few days (again, inconsistent), the problem returns and we must repeat this process.


System Information

  • ArcSight ESM 6.5c SP1 (6.5.1.1845.0)
  • 256 GB RAM (32 GB manager heap size)
  • 64 Processors
  • Fusion IO Storage Cards
  • Average EPS is 6 to 7k

History

We upgraded from ArcSight 5.2 back in April of 2014 and ever since then, we've been plagued with these ESM issues.  We've had ArcSight professional services involved from the get-go but they can't seem to find a root cause.  We even went as far as re-installing everything from scratch to rule out a bad install but this did not help.

Troubleshooting Steps

We started off by going to HP/ArcSight support for these issues and they identified the standard issues:

  • Rules with high partial matches
  • Full GCs occurring too often (more than 1 hour)
  • Blocked threads due to rules (likely related to partial matches)

We followed their recommendations and disabled all of the high partial matching rules and did our best to clean up the Full GCs.  However, after all of these troubleshooting steps, we were still left with the same issue.  After sending HP more thread dumps and log files, they said they were seeing "chunks" with very few events, leading them to believe we had devices with time issues.  After only being able to identify a few devices with time issues (and correcting them), the issues were not resolved.

New Issues

Starting last week, we started to experience a new symptom that also leads to an ESM outage.  We started receiving database errors (Event Receiver errors) with the following information:

== SUBSYSTEM STATUS CHANGED ====================================================

  Error - Event Receiver

== ORIGIN OF CHANGE ============================================================

  Error - Database Connection

-- DESCRIPTION -----------------------------------------------------------------

[Database Connection:

Received exception while trying to check connectivity to the database: Communications link failure

The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.

  Please see manager log files for details.]


After this event, the EPS to manager drops to almost 0 and we have to restart the manager to process events again.  I'm not sure if these 2 issues are related but I thought I'd include that info just in case.

Questions

  • How long have you been experiencing this issue?
  • Did something change in your environment (new devices, upgrades to ESM software/hardware)?
  • What is your average EPS (before and after filtering/aggregation)?
  • Have you contacted HP support about this issue?

Thanks again for your post!  I was starting to think that we were the only ones out there experiencing these types of issues!

Danny

0 Likes
rgraham4 Absent Member.
Absent Member.

Re: ESM EPS decreasing halfly after 7-10 days of uptime

Jump to solution

Have any of you checked your insertion rate? what are you using for storage? we had this issue at first with SAN and insertion rate not equal to eps in...if that slows down it will cause eps in to slow down and cache downstream. Once we migrated to the fastest EMC VMAX SAN our problems stopped. Look for the dashboard under dashboards --> arcsight express(no we are not running express)--> cross-device --> "Connector Connection and Cache Status" open that dashboard up and look at "current connector status"

Look at sent to manager eps and received by agent eps....TOTAL values. See if they are not matching that was our clue it was our SAN slowing it down. We currently havent had that issue after the SAN upgrade. Hope this helps!

0 Likes
pbrettle Acclaimed Contributor.
Acclaimed Contributor.

Re: ESM EPS decreasing halfly after 7-10 days of uptime

Jump to solution

For everyone suffering these issues - have you called support and logged a case around this? There are situations in which JVM memory management can lead to some odd behaviors. Usually it's down to high EPS rates inbound, but we are trying to understand the exact circumstances as to what causes this - it has proven to be a troublesome issue to pinpoint!


That said, get on to support - talk about the issue and get them to log and assign this. There are ways to resolve this and get around it. But this really needs careful investigation and troubleshooting - although Protect is great and a fantastic forum for this  - support really needs that direct interaction to get the right information. If we get it wrong, it can cause the system to stop functioning correctly.

Get on the phone to support ASAP.

0 Likes
pavel.novotny Absent Member.
Absent Member.

Re: ESM EPS decreasing halfly after 7-10 days of uptime

Jump to solution

Hello Nikolay,

Do you have any update on this issue?

It seems that I have the same problem. Nearly for one year our ESM 6.5c has been running ok. Few reports, few rules. 44 smart connectors, about 2000 EPS.

20.11. 2014 I added 12 new connectors with load about1200 EPS. During the night 26.11. 2014 ESM started to run in slow performance and each hour sends these mails:

-----Original Message-----

Subject: [ARCSIGHT][servername] User Session Server -- OK

== SUBSYSTEM STATUS CHANGED ====================================================

   OK - User Session Server

== ORIGIN OF CHANGE ============================================================

   OK - Database Connection

================================================================================

-----Original Message-----

Subject: [ARCSIGHT][servername] Event Receiver -- Error

== SUBSYSTEM STATUS CHANGED ====================================================

   Error - Event Receiver

== ORIGIN OF CHANGE ============================================================

   Error - Database Connection

-- DESCRIPTION -----------------------------------------------------------------

[Database Connection:

Received exception while trying to check connectivity to the database: Communications link failure

The last packet successfully received from the server was 1 milliseconds ago.  The last packet sent successfully to the server was 1 milliseconds ago.

Please see manager log files for details.]

================================================================================

It means that after 6 days with higher load and more 12 smart connectors I have the simillar issue.

I prepared LOGs  based on this topic:

https://protect724.hp.com/message/48404#48404

Then I restarted manager and so far - the same - is is working fine.

But some solution/conclusion would be nice.

0 Likes
Trusted Contributor.. daniel.eddy Trusted Contributor..
Trusted Contributor..

Re: ESM EPS decreasing halfly after 7-10 days of uptime

Jump to solution

All,

The solution for us was the ReservedCodeCacheSize setting in the server.defaults.wrapper.conf file on the ESM.  We heard from other customers that upping this setting from the default 48MB to 256MB solved their issue.  I can't explain in detail what this setting does but from what I've read, it affects the JIT compiler and how efficiently it runs.

To verify that we were going over the 48MB limit, we grep'd the server.status.log files using this command:

grep -A1 "Name=\"Code Cache\"" server.status.log*

All of the logs we captured during our EPS/database drops showed that we were at the max (48MB) which helped verify that this was an issue for us.  Due to these factors, we made the following change to our server.defaults.wrapper.conf file (ESM):

wrapper.java.additional.13=-XX:ReservedCodeCacheSize=256m

*(note that the "13" in this setting may be different depending on your specific settings)

Since making this change, we have not experienced any other EPS or database issues.  To verify this was a legitimate fix, we consulted with HP support.  According to them, this is a workable solution and should have no adverse side effects.

After almost a month of up-time, our server.status.log files show this:

server.status.log.9:[2014-12-02 13:49:06,136]      Name="Code Cache"

server.status.log.9-[2014-12-02 13:49:06,136]      PeakUsage="javax.management.openmbean.CompositeDataSupport(compositeType=javax.management.openmbean.CompositeType(name=java.lang.management.MemoryUsage,items=((itemName=committed,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=init,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=max,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=used,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)))),contents={committed=52690944, init=2555904, max=268435456, used=52238592})

Note that we are using about 52MB which is just over the 48MB limit.

Thank you to everyone who helped us out on this one.  It's greatly appreciated!

Danny

0 Likes
Answer Honored Contributor.
Honored Contributor.

Re: ESM EPS decreasing halfly after 7-10 days of uptime

Jump to solution

See this thread: https://protect724.hp.com/message/40198#40198

I'm 90% positive that you are experiencing the same problem as we did. You'll need to add the Java additional parameter to fix the problem.

0 Likes
pavel.novotny Absent Member.
Absent Member.

Re: ESM EPS decreasing halfly after 7-10 days of uptime

Jump to solution

Well, here are details about JAVA settings:

grep -A1 "Name=\"Code Cache\"" server.status.log*

[2015-02-10 08:03:55,074]      Name="Code Cache"

[2015-02-10 08:03:55,074]      PeakUsage="javax.management.openmbean.CompositeDataSupport(compositeType=javax.management.openmbean.CompositeType(name=java.lang.management.MemoryUsage,items=((itemName=committed,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=init,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=max,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=used,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)))),contents={committed=39714816, init=2555904, max=50331648, used=39362176})"

Based on HP support investigation and recommendation we switched off Hyper-threading in server BIOS.

This command shows HT all the time the same - when HT is ON and when HT is OFF.

[root@servername ~]# /usr/sbin/dmidecode -t processor |grep HT

                HTT (Multi-threading)

                HTT (Multi-threading)

The best place where to check whether HT is ON or OFF is server BIOS.

We have this note there:

Hyper-threading         [disabled]

Yesteday 9.2. 2015 we disabled HT and now we will see if ESM can run longer than 14 days withour ESM restart.

0 Likes
Highlighted
pavel.novotny Absent Member.
Absent Member.

Re: ESM EPS decreasing halfly after 7-10 days of uptime

Jump to solution

Just FYI. We did not modify:

server.defaults.wrapper.conf

file on the ESM and we do not have this line there:

wrapper.java.additional.13=-XX:ReservedCodeCacheSize=256m

Here is our config file server.defaults.wrapper.conf:

0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.