Random Loss of Management Capability
Does anyone have experience of apparently healthy and manageable (from a connector appliance) SmartConnectors suddenly not being manageable?
The scenario is:
1. A host with two containers
2. Each container is running one SmartConnector only
3. One day I'm able to send commands (Get Status) from the the connector appliance to both SmartConnectors, the next day I can only send commands to one SmartConnector (error: unable to send the command to the container. Container may not be running).
4. The host is up, containers are running (though I have tried start/stop to see if this would fix the problem), listening on ports 9001 and 9002 and able to run Logfu against both containers from the connector appliance.
As far as I know there have been no network changes to block port 9002 (which is the port the problem container is on). I can't completely discount this but I don't have tcpdump on the host (and I don't have root to install it). Without a full understanding of the management/command communication mechanism between a connector appliance and a connector (anybody?), I'm assuming that as I can run Logfu then port 9002 is reachable, and therefore at some level management is good (I've asked Arcsight for clarification on this - silence).
So, has anyone else had this problem, where it wasn't attributable to any network issues?
Thanks in advance
Re: Random Loss of Management Capability
This issue is 'fixed'.
All of what I'm about to tell you is from my own observations and research. I did engage Arcsight on this but didn't receive anything meaningful to assist in resolving the issue or identifying root cause. So this is in no way 'official' analysis.
Since first posting the problem, we started having similar issues with other SmartConnectors. Curiously the issues were limited to SmartConnectors in one Zone only. On 'healthy' SmartConnectors (in other zones) I could see management connections being logged in agent.out.wrapper.log, but I wasn't seeing them in the SmartConnectors that were having issues. Perhaps there has been a series of changes in the Zone that was blocking management ports? I wasn't convinced though as we operate quite tight change control and nothing had been going on.
I finally discovered that there was a large number of connections (101 in total) to the management port. Netstat showed the following (addresses changed):
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 332 0 220.127.116.11:9001 18.104.22.168:33167 CLOSE_WAIT
Remote ports were different in each of the other hundred cases (obviously).
There were two things that caught my eye
- The remote host wasn't the connector appliance!
- The state of the connection was CLOSE_WAIT
Further analysis revealed
- The remote host was one of our internal vulnerabilty scanners, which was not currently active.
- The TCP sessions were closed but It was likley that the application (Arcsight) was not closing the socket. Probably because Recv-Q still had data to be read.
So I'm now thinking that the scanner has caused the SmartConnector some problem by scanning the management port(s), the connector appliance SmartConnector is not closing sockets and we have hit an upper limit of sockets that the application can handle. Therefore subsequent management connections are being refused. NOTE: some of the errors that the connector appliance was reporting, when trying to connect to the affetced SmartConnectors, were java socket exceptions.
I asked Arcsight to verify my analysis, but they disagreed with me on the determination that the application was keeping sockets open (their view was that the SmartConnector was waiting on the remote end to close and I suggested that if that were the case then these connections should have timed out by now as the scanner had been offline for some time). Arcsight didn't comment on any socket limit that may be configured in the application. We had to agree to disagree on the analysis.
Resolution: If I had followed the vendor's suggestions I would have by now upgraded the SmartConnector(s) and the connector appliance. This would no doubt have 'fixed' the problem but in the same way as buying a new car, every time you have a problem with your old car, would solve your transportation problems. I ended up restarting SmartConnectors (which curiously didn't work first time), this cleared the sockets and management connectivity was restored. The outstanding action for me is to understand why these SmartConnectors, in this zone, are affected this way by this particular vulnerability scanner. We have SmartConnectors in other zones, with different scanners, that aren't affected in this way. First-stop will be to compare the scanner settings.
However in one case restarting the SmartConnector didn't fix the problem (possibly the one I restarted at the outset?), I kept getting the error that there was no SmartConnector running when I tried to stop it. It was clearly running in the process list and wasn't orphaned or a zombie. If I tried to start the SmartConnector it would create a new instance briefly then it would die, with the original 'uncontrollable' process remaining. I then discovered that the PIDs for the uncontrollable running process (and its child) were missing from
- arc_windowsfg.pid (This is the PID of the connector parent process, in this case a wuc)
- arc_windowsfg.java.pid (This is the PID of the child process associated with arc_windowsfg.pid)
I recreated these manually and did a 'chmod 644' on them. This allowed the control script to find the process that was running and shutdown both it and the child, albeit it did complain as follows:
ArcSight Microsoft Windows Event Log - Unified did not die cleanly. Will destroy!
Removed stale pid file: /opt/arcsight/smartconnector/wuc/current/run///arc_windowsfg.pid
Removed stale pid file: /opt/arcsight/smartconnector/wuc/current/run///arc_windowsfg.java.pid
No matter. Controlled shutdown achieved, sockets released, fresh instance of the SmartConnector started and management connectivity restored.
I hope somebody else finds this experience useful 🙂
Message was edited by: Robert Colvin Strikethrough 'connector appliance' and add 'SmartConnector'