We've been monitoring the ArcMC and noticed that most of our Connectors have some level of Queue drops shown in ArcMC. While, this was unnoticed for a long while, once taken into consideration we've been monitoring it very closely and the drops are quite significant on most connectors. I'm not certain if this is related to the volume of events, types of events, parsing related performance of the connector or in general itself. We aren't able to deduce the RCA so far and its been a month of investigation even with the support team involved but no breakthrough so far. Hoping for any known issues or suggested resolutions from fellow members here.
To give a brief of the overall issue,
We've a setup where few syslog connectors (UDP) collecting logs from 3 sources basically have huge EPS and we've seen the issues begin from here:
Connector1 - collecting Cisco ASA logs - receiving avg. EPS of around 15-18k - constant Connector queue drops
Connector2 - collecting Fortigate logs - receiving avg. EPS around 13-15k - constant Connector queue drops
Connector3 - collecting Bluecoat logs - receiving avg. 10k EPS - constant Connector queue drops
We were certain the high volumes are causing the events to drop from the queue before it reaches the processing cycle. So we deployed ArcSight Load Balancer and split the traffic between different connectors.
Cisco ASA - 6 Connectors - 2-3k avg. EPS per connector - some of the connectors are functioning properly while around 4 are dropping from the queue.
Fortigate - 5 connectors - 2-3k avg. EPS per connector - all connectors dropping from queue.
With the above scenario, all the connectors have similar configurations across them - 4GB of JVM, linux hosted, syslog UDP. Additional configurations - ASA connectors have map files added; Fortigate and ASA connectors have aggregation applied.
Even without the aggregation applied, the connectors are dropping from the queue, so this is not related to the aggregation/filtration applied. We've ruled out EPS related concern as well as the connectors are processing around 2-3k EPS which shouldn't be an issue at all. Parsing wise while there are no issues, the concern is of events being dropped from the queue.
Suggested actions from the support:
-adding few parameters to the agent.properties such as multithreading, increasing queue size, increasing batch sizes, etc. - applied all but issue exists
-disabling the queue on the connector - rejected as its not an acceptable solution.
-upgrading the connector to latest 8.1 Framework as it has significant performance enhancements - done on one of the connector to confirm, but still issue persists.
Attaching snips for the patterns observed from the queue drops.
Any helpful pointers to resolve the issue of connector drops are welcome.