Welcome Serena Central users! CLICK HERE
The migration of the Serena Central community is currently underway. Be sure to read THIS MESSAGE to get your new login set up to access your account.
babilon Absent Member.
Absent Member.
2622 views

Pool Corrupted after poison pill to nodes

Hi folks,

We're having a two node GroupWise 8 SP1 cluster running on OES2 SP1-SLES10 SP2 with EMC SAN Fibre Channel using EMC PowerPath.

When the resource (called POA_DIRETOR_SERVER) tries load, the other node send a poison pill and reboot the server. I found that all nodes rebooted during the night.

Every three months one of the resources (POOL) is corrupted. The last log:

NSSLOG ==> [Error] zlssMSAP.c[1899]
Oct 21 20:17:43 srv-corp-120 kernel: Oct 21, 2009 7:17:43 pm NSS<ZLSS>-4.11b-xxxx:
Oct 21 20:17:43 srv-corp-120 kernel: MSAP: Pool "POA_DIRETOR" ownership lost, pool may have been corrupted
Oct 21 20:17:43 srv-corp-120 kernel: by being activated from two servers at the same time.
...
Oct 22 09:47:21 srv-corp-120 kernel: err=20801 comnVol.c[894]
Oct 22 09:49:20 srv-corp-120 kernel: err=20801 comnVol.c[894]
Oct 22 09:49:39 srv-corp-120 sshd[16218]: Accepted keyboard-interactive/pam for root from 10.100.207.6 port 59479 ssh2
Oct 22 09:50:00 srv-corp-120 kernel: lsa_vol_statfs: zOpen = 20407
Oct 22 09:50:06 srv-corp-120 kernel: lsa_vol_statfs: zOpen = 20407
Oct 22 09:51:12 srv-corp-120 sshd[22740]: Accepted keyboard-interactive/pam for root from 172.22.0.101 port 1149 ssh2
Oct 22 09:51:15 srv-corp-120 kernel: lsa_vol_statfs: zOpen = 20407
Oct 22 09:51:15 srv-corp-120 kernel: lsa_vol_statfs: zOpen = 20407
Oct 22 09:52:24 srv-corp-120 kernel: err=20801 comnVol.c[894]
Oct 22 09:52:56 srv-corp-120 smdrd[19377]: Received Leave Event for POA_DIRETOR_SERVER
Oct 22 09:52:56 srv-corp-120 smdrd[19377]: Target name POA_DIRETOR_SERVER could not be de-advertised from SLP
Oct 22 09:53:44 srv-corp-120 kernel: CLUSTER-<WARNING>-<6077>: The cluster has lost communication with node [srv-corp-121].
Oct 22 09:53:44 srv-corp-120 kernel: Node [srv-corp-121] may have failed or experiencing other problems.
Oct 22 09:53:44 srv-corp-120 kernel: To ensure cluster stability, this node has sent a poison pill to node [srv-corp-121].
Oct 22 09:53:44 srv-corp-120 kernel: Epoch for this node is higher than for some other node.
Oct 22 09:53:44 srv-corp-120 kernel: Other node is slow to update epoch and bitmask (slow or dead).
Oct 22 09:58:53 srv-corp-120 syslog-ng[13581]: syslog-ng version 1.6.8 starting
Oct 22 09:58:53 srv-corp-120 ifup: lo
Oct 22 09:58:53 srv-corp-120 syslog-ng[13581]: Changing permissions on special file /dev/xconsole
Oct 22 09:58:53 srv-corp-120 syslog-ng[13581]: Changing permissions on special file /dev/tty10
Oct 22 09:58:53 srv-corp-120 dbus-daemon: nds_nss_GetGroupsbyMember: failed to init socket, status = -1
Oct 22 09:58:53 srv-corp-120 dbus-daemon: nds_nss_GetGroupsbyMember: failed to init socket, status

To correct the problem I did a rebuild with ravsui command.

Anybody knows how can I prevent this sort of thing from happening again?

Thanks.
Labels (1)
0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.