ALERT! The community will be read-only starting on April 19, 8am Pacific as the migration begins. Read more for important details.
ALERT! The community will be read-only starting on April 19, 8am Pacific as the migration begins.Read more for important details.
Absent Member.
Absent Member.
2558 views

Memory Exhaustion in ESM 6

This discussion will be used to discuss memory utilization issues in ESM.  I'm particular what I'd like to call the "staircase to hades" issue.

If you view the graphic below, you can figure out how I came up with the name

mem.png

The system has a steady event rate, and little activity outside a trend that runs twice an hour.  As the chart moves to the right, the system runs out of memory and performance suffers as swapping begins.  Does anyone else see this phenomenon on their system?

-Joe

0 Likes
37 Replies
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class

Thanks for the extremely thoughtful post, Jonathan.  I am going to take some time to digest it.

To be completely honest, I've found very little impact of nearly *all* of the my.cnf settings on my system, I am considering striking most of them out of the perf tuning post...

I have to say you make some extremely valid logical points, it seems silly for the OS to pull all of these filesystem objects out of the SSD and into RAM just to write it back later...  This would make sense if it weren't large volumes of data constantly being written/read, and the disks were slower, but with the SSD and ArcSight - it seems like an extra step which is un-necessary and a possible bottleneck...

Will definitely look into this as well:

'sync_binlog' on

0 Likes
Absent Member.
Absent Member.

Yes, I'm replying to myself.

The question begs itself, what process is really writing data to disk? Well, let's find out. Here on my test system, I ran 'lsof' against /opt/arcsight/data

[root@SERVER logger]# pwd

/opt/arcsight/logger/data/logger

[root@fry logger]# !lsof

lsof *

COMMAND   PID     USER   FD   TYPE DEVICE  SIZE/OFF    NODE NAME

java    14345 arcsight  170u   REG 202,97 228242602 3703200 Arcsight_Data_3

[root@SERVER logger]#

So it's 'java' that's doing the disk writing from what I can tell. So unless HP/ArcSight re-rolled their own JRE, I doubt Java knows about fancy stuff like "O_DIRECT" and writing files.

Then I started playing around further on my test system. 'mysql' is only used for reads of event data not writes. I think. At least that's what I can tell from mysql.log and strace. If mysql was used for writes to the database, we'd see 'COMMIT' all over the mysql.log. Instead, we see 'select' for our reports and active channels! There are no 'update' or 'insert' or 'commits' in there! MySQL isn't doing any writing! So, all the write tuning we're doing for MySQL, we're wasting our time!

So how are writes done?

PID 14345 is the Logger Server. After looking at it with 'strace -pf' I finally got somewhere...

The 'logger server' process is doing all the writing to disk.

From this strace, you can see that the Java process that owns 'logger_server' is running an open() on the stuff in /opt/data/logger...

484   open("/opt/arcsight/logger/current/arcsight/logger/user/logger/data/server_stats_serverstats.txt", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 174

484   open("/opt/arcsight/logger/current/arcsight/logger/user/logger/data/server_stats_serverstats.txt", O_WRONLY|O_CREAT|O_TRUNC, 0666 <unfinished ...>

1181  open("/opt/arcsight/logger/data/logger/Arcsight_Data_3", O_RDWR|O_CREAT, 0666 <unfinished ...>

Don't sweat the "unfinished" .. O_DIRECT would be with the O_RDWR, etc.

So we're just doing a straight open on the stuff on the XFS file system. Nothing fancy. Any 'directio' would have to be done via a mount option or similar.

0 Likes
Absent Member.
Absent Member.

I did some more digging, as I'm still trying to get my customer's instance stable.

Just because you set the amount of memory to be used by the JVM doesn't mean it stays within those confines.

linux - Java memory mystery (do I have a leak)? - Stack Overflow

Something to be mindful of.

I have my JVM set to 12G. I run 'pmap -x' against it and... (just the headers and the summary)

Address           Kbytes     RSS   Dirty Mode   Mapping

----------------  ------  ------  ------

total kB        33361288 9196812 9172528

Thats taking up 9GB of RAM, but addressing/reserving 32G (IIRC.)

0 Likes
Absent Member.
Absent Member.

Jonathan,

Many of us are using huge/large pages with memlock and disabling THP.  With that setup, it runs very stable and I haven't seen the ArcSight processes take more memory than expected.

-Joe

0 Likes
Absent Member.
Absent Member.

I'll take another gander. I also have vm.drop_caches=3 running every hour. So far so good.

0 Likes
Absent Member.
Absent Member.

Something else I learned. With my current customer we put a FusionIO tech on the phone. Each FusionIO card requires up to 1GB of memory; we see about 850G per card (we have six IODrive Duos.) So when tuning the heck out of JVMs, remember to leave some space for the cards. We were told make sure to have 10G available for our installation. This is dynamically set as IO is performed, but there is a setting to permanently reserve this RAM by the FusionIO kernel module. Also, using O_DIRECT aka directio to the FusionIO devices is encouraged.

0 Likes
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class

Jonathan Katz wrote:

Something else I learned. With my current customer we put a FusionIO tech on the phone. Each FusionIO card requires up to 1GB of memory; we see about 850G per card (we have six IODrive Duos.) So when tuning the heck out of JVMs, remember to leave some space for the cards. We were told make sure to have 10G available for our installation. This is dynamically set as IO is performed, but there is a setting to permanently reserve this RAM by the FusionIO kernel module. Also, using O_DIRECT aka directio to the FusionIO devices is encouraged.

Do you happen to have the config for the fusion IO memory RAM reservation?

Again, thanks a lot for sharing the info

0 Likes
Absent Member.
Absent Member.

In /etc/modprobe.d/iomemory-vsl.conf

options iomemory-vsl preallocate_memory=XXXYXXX-XXXX,XXXYXXX-XXXX

where XXXXYXXX-XXXX is the serial number of the cards in question.

0 Likes
Absent Member.
Absent Member.

So we know that memory is being used, but it's good to know WHAT memory is being used.

I set up the following cron job to run every 10 minutes to tell me what's going on.

1,11,21,31,41,51 * * * *  /usr/bin/pidstat -drh >> /root/pidstat.out

That tells me memory and disk usage by process.

After running through the weekend I found that mysql slowly but surly eats up more and more RAM. Not sure if this works for you, but you probably want to do this ASAP to see what your culprit is.

Note, I am dropping caches every hour, too.

0 * * * *  /sbin/sysctl vm.drop_caches=3 > /dev/null

I have a huge session list, and all that stuff is kept in MySQL, not in CORRE.

0 Likes
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class

Excellent research. I'm facing the same memory issue and have enabled huge pages combined with hourly cache drops to address this.

Another issue that I've come across in CORRE is extremely high retrieval times. This seems to be tied to the number of events being persisted and in some cases the actual queries being executed on the system, however, after replicating the same issue on different hardware configurations, OS, ESM versions (6.0patch1 and 6.5), tuned and default mysql, pgsql, server.properties configs etc... I'm getting concerned that this is a CORRE design issue.

The scenario is: two continuously evaluating Active Channels, one with the "ASM Database Statistics" filter, and one showing events with Type=Correlation. Event rate is about 10k EPS, the majority of these events are Bluecoat, various firewalls and Windows Security events.

Observations: Active Channels are not painting in a timely fashion (showing 'Loading Event ID messages...' for quite some time before returning results). Disabling all content does not impact performance noticeably, neither does changing the Sorting field or filter in the Active Channels.

Maybe this issue deserves a separate thread, but I'm interested to hear what retrieval times you're getting in CORRE. According to internal Monitoring events we have +200k microseconds per event, spiking up to 1M microseconds sometimes...

0 Likes
Absent Member.
Absent Member.

What underlying storage are you using for CORRE? Things are pretty speedy for us with FusionIO and/or SSDs (depending upon my customers.)

0 Likes
Cadet 2nd Class Cadet 2nd Class
Cadet 2nd Class

8x FusionIO 3TB Accelerator cards in RAID10. What read stats are you seeing according to the internal Monitoring events? ("ASM Database Statistics" filter)

0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.