djbrightman1 Absent Member.
Absent Member.
2373 views

oes2sp2 cluster node freeze

Hi
We have recently patched up all of our cluster nodes (oes2sp2 x86_64).
Fully patched to 15Jun10 (i.e. including the much awaited ncs and nss fixes)

Since then we have had 3 incidents of nodes freezing.
In one instance we managed to get something out of /var/log/messages:
>>
Jun 22 21:29:18 sblx-1 kernel: cma invoked oom-killer: gfp_mask=0x201d2, order=0
, oomkilladj=0
Jun 22 21:29:18 sblx-1 kernel:
Jun 22 21:29:18 sblx-1 kernel: Call Trace: <ffffffff8016538a>{oom_kill_process+9
5}
Jun 22 21:29:18 sblx-1 kernel: <ffffffff8016596b>{out_of_memory+410} <fff
fffff801678a9>{__alloc_pages+613}
Jun 22 21:29:18 sblx-1 kernel: <ffffffff80169070>{__do_page_cache_readahe
ad+166} <ffffffff802f0cf5>{__wait_on_bit_lock+92}

<<

So, it looks like a memory leak issue, as oom_killer is stepping and killing off processes...

Unfortunately this has varying degrees of failure.... In one instance the nodes resources were inaccessible, ssh was not possible and the console was frozen... However you could ping the node and the other nodes thought it was OK (cluster view) and still successfully running it's resources (cluster status). This took a power cycle to fail properly.

We have now logged an SR and have been advised a sysctl setting which should dump the task list should oom kick in again.... We await to see what might be causing the issue....

Has anyone else experienced any issues like this?

Regards and thanks

David
Labels (1)
0 Likes
8 Replies
utman Absent Member.
Absent Member.

Re: oes2sp2 cluster node freeze

I haven't but am curious what Novell recommends. I usually just create a script to dump top and ps information to a file and have cron call it every 5-10 mins. Please post back with what you find out.
0 Likes
changju Absent Member.
Absent Member.

Re: oes2sp2 cluster node freeze

Hi David,

I believe this is a problem exposed by the latest release.

NCS stops setting a node to reboot. Administrators should set parameter “kernel.panic” in file “/etc/sysctl.conf” to non-zore if they want their servers to reboot in case of a panic.

Example:
#30 root@CG_01:~ # cat /etc/sysctl.conf
# Disable response to broadcasts.
# You don't want yourself becoming a Smurf amplifier.
net.ipv4.icmp_echo_ignore_broadcasts = 1
# enable route verification on all interfaces
net.ipv4.conf.all.rp_filter = 1
# enable ipV6 forwarding
#net.ipv6.conf.all.forwarding = 1
kernel.panic = 8


Before the patch, the server would have rebooted if memory was exhausted.

Regards,

Changju
0 Likes
djbrightman1 Absent Member.
Absent Member.

Re: oes2sp2 cluster node freeze

Hi

Changju: Thanks for that, though I think we shall keep the setting for now...

...we have been asked to set the boxes up for kernel core dump capture (to attempt manual capture on 'hung' machine) as well as the "oom_dump_tasks" .... Other than that I'm afraid they haven't been able to detect anything from the supportconfigs, logs and core provided....
So, we are waiting for a failure to gather more info....

Also, because we experienced two situations where the nodes have actually hung or ndsd core'd in such away that cluster failover didn't occur but service was lost, we have invoked Monitor scripts, with 0 local retries and an action to migrate... This because any attempt to locally reload the service on a machine with dead ndsd resulted in comatose...

Now that everything is set up to capture we have been OK for several days....!

I'll try and keep this updated as things progress...

Cheers
David
0 Likes
Anonymous_User Absent Member.
Absent Member.

Re: oes2sp2 cluster node freeze

>>>>> "djbrightman" == djbrightman <djbrightman@no-mx.forums.novell.com> writes:

djbrightman> Hi We have recently patched up all of our cluster nodes
djbrightman> (oes2sp2 x86_64). Fully patched to 15Jun10
djbrightman> (i.e. including the much awaited ncs and nss fixes)

:

djbrightman> We have now logged an SR and have been advised a sysctl
djbrightman> setting which should dump the task list should oom kick in
djbrightman> again.... We await to see what might be causing the
djbrightman> issue....

djbrightman> Has anyone else experienced any issues like this?

I suspect that I have. Two three node clusters in a development tree
that I haven't really touched for four or five months other than to
restart ZMD every time it crashes (grrrr). Logged in last week to
manually do all outstanding patches (they automatically apply
non-interactive patches at 3am) and after restarting, two of the six
nodes kept hanging on reboot.

One of the two we could ping and login via ssh (although the console was
frozen and we couldn't login there), the other one locked up hard with
no network or console.

djbrightman> Regards and thanks
djbrightman> David

--
Adrian Tritschler -- Adrian.Tritschler@monash.edu
Monash University, Victoria, Australia, 3800
0 Likes
djbrightman1 Absent Member.
Absent Member.

Re: oes2sp2 cluster node freeze

Hi

The customer experienced another freeze/hang yesterday...
Unfortunately I was not on site and in their efforts to get services restored didn't try to take a manual kernel core dump.... (though not sure if that was possible, as console frozen and apparently not accepting keyboard input...)

Again:

  • NSS cluster resource volumes inaccessible, but still reported as running on the node
  • No auto cluster failover (despite the 'new' monitor scripts - I guess the box had hung in such a way to prevent determintation)
  • ABLE to ping
  • Unable to login SSH (PuTTY)
  • Console appeared frozen
  • Powercycle required to release resources, etc.


I am attempting to escalate the SR with Novell....

One thing I have noticed are occasional errors in ndsd.log
e.g.
SSG Thread 24:332 taking too long to service a request (66:-1) - replacing with a new thread

I can find references, but no description or help with these messages.
Anyone know what, how, etc.? Is there a way of tracking the problem through the numbers (e.g. 24:332 and or (66:-1)??

Regards

David
0 Likes
schmatzp1 Absent Member.
Absent Member.

Re: oes2sp2 cluster node freeze

Hi David,
we have nearly he same issue where nodes are freezing (Only way to get out is OFF / ON )
we open an SR, too.
would you be so kind to provide me the SR number (my email is "pschmatz at novell dot lu") if you want my SR I can reply your mail

I feel less alone 😉

many thanks,
regards

djbrightman;1995400 wrote:
Hi

The customer experienced another freeze/hang yesterday...
Unfortunately I was not on site and in their efforts to get services restored didn't try to take a manual kernel core dump.... (though not sure if that was possible, as console frozen and apparently not accepting keyboard input...)

Again:

  • NSS cluster resource volumes inaccessible, but still reported as running on the node
  • No auto cluster failover (despite the 'new' monitor scripts - I guess the box had hung in such a way to prevent determintation)
  • ABLE to ping
  • Unable to login SSH (PuTTY)
  • Console appeared frozen
  • Powercycle required to release resources, etc.


I am attempting to escalate the SR with Novell....

One thing I have noticed are occasional errors in ndsd.log
e.g.
SSG Thread 24:332 taking too long to service a request (66:-1) - replacing with a new thread

I can find references, but no description or help with these messages.
Anyone know what, how, etc.? Is there a way of tracking the problem through the numbers (e.g. 24:332 and or (66:-1)??

Regards

David
0 Likes
djbrightman1 Absent Member.
Absent Member.

Re: oes2sp2 cluster node freeze

Hi Pascal

Anything new your end?

We've had a couple more ndsd dead situations and another freeze... Have set up a bunch of things to try and enhance the cores, kdump, statistics collection, etc. but no progress yet...

I have a colleague who has also experienced the freeze issue and logged another SR....

Cheers

David
0 Likes
djbrightman1 Absent Member.
Absent Member.

Re: oes2sp2 cluster node freeze

Hi

We are still experiencing issues...

No more frozen servers - I believe that behaviour was due to the change, highlighted by Chiangu, in the May2010 patch (kernel.panic, etc.)

Because we have implement kdump, this will cause the reboot...hence no more freeze!

Anyhow, it now looks like an ndsd/ncp issue... We have had a number of 'ndsd dead' situations, and have eventually managed to get a successful core dump (after stopping apparmor and tuning some dump settings...), which Novell support are currently assessing....

I am still no wiser as to the SSG Thread messages in ndsd.log...

>>
One thing I have noticed are occasional errors in ndsd.log
e.g.
SSG Thread 24:332 taking too long to service a request (66:-1) - replacing with a new thread

I can find references, but no description or help with these messages.
Anyone know what, how, etc.? Is there a way of tracking the problem through the numbers (e.g. 24:332 and or (66:-1)??
<<


These are still happening, despite tuning ncp parameters, etc.

Can anyone out there shed any further light on what these are, what they refer to, etc.?

It may be co-incidental, but I have seen larger than expected increases in the ncp connection slots around the same time.... Could they be related?

Thanks

David
0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.