UPDATE! The community will be go into read-only on April 19, 8am Pacific in preparation for migration on April 21. Read more.
UPDATE! The community will be go into read-only on April 19, 8am Pacific in preparation for migration on April 21.Read more.
Absent Member.
Absent Member.
1832 views

NCS Cluster Volumes become unavailble

System and versions:
====================
Novell Cluster with 9 nodes and 44 cluster volumes under
SLES11SP1/OES11 with the last patches applied
Hardware: HP Blades and HP Storage (EVA7000)

Problem:
========
For some weeks we have a problem with cluster volumes and nds daemons. Almost daily the cluster volumes of one cluster node become unavailable for the clients. The command

# rcndsd status returns "Unable to get server status".

In this case the server in question has to be rebooted. The cluster volumes (because of rebooting) migrate to other cluster nodes and become functionable again.
After some hours another server shows the same symptoms. It seems to be that there are one or two "favorite" cluster volumes that are always involved.

Additional information:
=======================
1.)
I applied TD 7012793 to one cluster node. The only change: When the cluster volumes become unavailable to clients, the command
# rcndsd status returns no error in this case. But when the cluster volume is migrated (by iManager) the ndsd of the server from wich is migrated crashes with "dead" as the return value of the rcndsd status command.

2.)
A piece of /var/log/messages
I migrated by iManager the cluster volumen C3-NL3K12P-SERVER that became unavailable for clients from the server nc308

Sep 25 06:13:01 nc308 /usr/sbin/cron[22602]: (root) CMD (/usr/sbin/smt-agent)
Sep 25 06:14:48 nc308 [XTCOM]: pam_sm_authenticate in pam_ncl.c (novell-client's pam)is called
Sep 25 06:15:01 nc308 /usr/sbin/cron[22639]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests)
Sep 25 06:16:15 nc308 sshd[22665]: Accepted keyboard-interactive/pam for root from 172.20.144.40 port 58548 ssh2
Sep 25 06:19:28 nc308 smdrd[16219]: Received Leave Event for C3-NL3K12P-SERVER
Sep 25 06:19:28 nc308 smdrd[16219]: Target name C3-NL3K12P-SERVER successfully de-advertised from SLP
Sep 25 06:19:28 nc308 kernel: [54445.897985] ndsd[22110]: segfault at 58 ip 00007fb6b44962b9 sp 00007fb69cec1be0 error 4 in libncpengine.so.0.0.0[7fb6b4429000+105000]
Sep 25 06:19:29 nc308 smdrd[16219]: Could not start TCP listener on 172.20.144.50
Sep 25 06:19:32 nc308 adminus daemon: umounting volume NL3K12S lazy=1
Sep 25 06:19:34 nc308 kernel: [54451.742301] NSSLOG ==> [MSAP] comnLog[201]
Sep 25 06:19:34 nc308 kernel: [54451.742303] Pool "NL3K12P" - MSAP deactivate.
Sep 25 06:20:01 nc308 /usr/sbin/cron[22848]: (root) CMD ( /opt/hp/hp-health/bin/check-for-restart-requests)
Sep 25 06:21:50 nc308 shutdown[22906]: shutting down for system reboot
Sep 25 06:21:51 nc308 init: Switching to runlevel: 6
Sep 25 06:21:53 nc308 kernel: [54591.102010] bootsplash: status on console 0 changed to on
Sep 25 06:21:57 nc308 multipathd: 36001438012599fc20000400000c40000: stop event checker thread (140680465872640)
Labels (1)
Tags (3)
0 Likes
3 Replies
Absent Member.
Absent Member.

From your description, the problem is likely caused by NDS daemon. Please post it at NDS sub-forum and report it to NTS.

Just FYI, we add some defensive mechanism in NCS to guard against suhc NDS problems in OES11 SP2. You can enable the built-in ndsd monitoring on a node, and choose to restart NDSD, migrate the resource, or reboot the node if NDS is experienceing problems. This should buy more time for NDS team to fix the root cause.

Regards,

Changju
0 Likes
Absent Member.
Absent Member.

kernel: [54445.897985] ndsd[22110]: segfault

Yes, NDSD has crashed. When you say "last patches" is that right up to the September eDir (8.8.7 patch 4 hotfix 1)?
NOVELL: Downloads - September 2013 OES11 eDirectory 8.8 SP7 Patch 4 Hot Patch 8260

Andrew C Taubman (Sorry, support is not provided via e-mail) Opinions expressed above are not necessarily those of Micro Focus.
0 Likes
Absent Member.
Absent Member.

Thank You for your hint! I have posted my thread in the NDS sub-forum.

changju;2284341 wrote:
From your description, the problem is likely caused by NDS daemon. Please post it at NDS sub-forum and report it to NTS.

Just FYI, we add some defensive mechanism in NCS to guard against suhc NDS problems in OES11 SP2. You can enable the built-in ndsd monitoring on a node, and choose to restart NDSD, migrate the resource, or reboot the node if NDS is experienceing problems. This should buy more time for NDS team to fix the root cause.

Regards,

Changju
0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.