Anonymous_User Absent Member.
Absent Member.
634 views

NetIQ AM to detect excessive numbers of orphaned CLOSE_WAIT sessions - how?

Gurus, we're looking at a condition on a couple of our DMZ servers where
they are getting excessive numbers of orphaned CLOSE_WAIT sessions that
never close. Over time, that chokes up the TCP stack and the servers become
non-responsive. We're trying to figure out why the application sometimes
doesn't properly close a session. This all started sometime last week we
think - not sure why, no Microsoft patches or reboots of the servers had
occurred before we had to reboot them Monday. Orphaned sessions are
occurring at the rate of 60 to 80 per day. At around the 1500 to -1800
level the server/service becomes choked. Happened to two other of our
application servers in the Sensitive Zone several months ago, when there
was some odd problem with Coxnet routing or something. Those miraculously
cured themselves after a couple of weeks. In the meantime, we have
discovered that running a simple disable/enable on the NIC, from the Device
Manager GUI, will restart the NIC and the TCP stack and clears the orphaned
connections. That whole process takes about 20-25 seconds. We were
wondering if it were possible to monitor this with NetIQ AppManager? Once
monitored, we could invoke the Action_DosCommand option to clear the
connections as the job action.


0 Likes
3 Replies
Anonymous_User Absent Member.
Absent Member.

Re: NetIQ AM to detect excessive numbers of orphaned CLOSE_WAIT sessions- how?

This likely won't help as much as you'd like, but (as you probably know) a
CLOSE_WAIT lasting for more than a few seconds should really happen very
rarely, and in every case it is by definition the problem of the side
seeing the CLOSE_WAIT connections build up, and specifically the fault of
the application holding those connections open. The reason why is
available in the TCP RFCs and while other things can cause them to go away
(like a device between the client and server that tears down the
connections for some reason using RSTs) the origin of the problem is the
application holding open the connection on the side that shows the
CLOSE_WAIT because it is not sending a FIN packet to finally close the
application. In TCP land this is because the application (server or
client) does not think that it is done sending data to the other side.
Whether or not there are more data to be sent is not certain, but if the
application thought it was done it'd probably send a FIN since the other
side has already closed down its side of the TCP connection.

I share this to avoid going down paths that do not matter, like routers
(layer three, so they have no business with layer four things like TCP)
and while restarting the TCP/IP stack is a workround it's often just a way
to delay the inevitable and as connections increase (business grows) that
becomes infeasible.

Now I'll get out of the way and let somebody who knows AppManager provide
a way to implement the workaround.

Good luck.
0 Likes
Anonymous_User Absent Member.
Absent Member.

Re: NetIQ AM to detect excessive numbers of orphaned CLOSE_WAIT sessions - how?


ab;227879 Wrote:
> This likely won't help as much as you'd like, but (as you probably know)
> a
> CLOSE_WAIT lasting for more than a few seconds should really happen
> very
> rarely, and in every case it is by definition the problem of the side
> seeing the CLOSE_WAIT connections build up, and specifically the fault
> of
> the application holding those connections open. The reason why is
> available in the TCP RFCs and while other things can cause them to go
> away
> (like a device between the client and server that tears down the
> connections for some reason using RSTs) the origin of the problem is
> the
> application holding open the connection on the side that shows the
> CLOSE_WAIT because it is not sending a FIN packet to finally close the
> application. In TCP land this is because the application (server or
> client) does not think that it is done sending data to the other side.
> Whether or not there are more data to be sent is not certain, but if
> the
> application thought it was done it'd probably send a FIN since the
> other
> side has already closed down its side of the TCP connection.
>
> I share this to avoid going down paths that do not matter, like routers
> (layer three, so they have no business with layer four things like TCP)
> and while restarting the TCP/IP stack is a workround it's often just a
> way
> to delay the inevitable and as connections increase (business grows)
> that
> becomes infeasible.
>
> Now I'll get out of the way and let somebody who knows AppManager
> provide
> a way to implement the workaround.
>
> Good luck.


I have a KS which can detect this state (and others if you specify them)
using the "netstat -n" command. It *does not* account for the length of
time a session has been in this state (I'm not sure you can reliably do
that as there is no unique id for each session). But will allow you to
raise an event if the number of specified sessions exceeds a threshold,
and to collect data. It is provided "as is" and has not been tested by
anyone, just me... it runs OK on Windows 2012, 2008, 2003, 7 and 8. The
KS is attached (zipped). Let me know if it helps.


+----------------------------------------------------------------------+
|Filename: NTAdmin_CheckNetStatStatus.zip |
|Download: https://forums.netiq.com/attachment.php?attachmentid=41 |
+----------------------------------------------------------------------+

--
Andy Doran
Software Engineer Consultant (NetIQ)
------------------------------------------------------------------------
andy_doran's Profile: https://forums.netiq.com/member.php?userid=3937
View this thread: https://forums.netiq.com/showthread.php?t=47422

0 Likes
Anonymous_User Absent Member.
Absent Member.

Re: NetIQ AM to detect excessive numbers of orphaned CLOSE_WAIT sessions - how?


We are at Access Manager 4.0.1.88 + HF3 with linux under. Our system is
choked by the httpd which results a cpu load that our system can not
handle. Restarting the application was our solution, but we updated to
the latest version and the issue still here. Two AGs, IDPs and ACs are
in production but luckily only one of the AG gets overwhelmed after 10+
days of running.
Is there any solution we should apply?

Any help would be appreciated!
Tamas


--
ertom
------------------------------------------------------------------------
ertom's Profile: https://forums.netiq.com/member.php?userid=9462
View this thread: https://forums.netiq.com/showthread.php?t=47422

0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.