AubreyJ
Visitor.
175 views

Cluster node stops responding then resource reverts to old copy

Hello all!

I have a two node cluster that has two resources.  Each node is configured to host one resource unless the other node goes down.   These are two servers hosting shared volumes for my users.

Twice in the past few months the first node has stopped responding and when the resource migrates to the other node, it apparently reverts to an old copy.  For example, any modifications to a file are undone and any new files are removed.   Just like if I remove the current drive and replaced it with an older version.

This morning I had this event happen.  I was alerted when a user that modifies a certain file often, said their recent changes were not there.  Upon checking with a file that I modified yesterday, those modifications were gone.  There were also some folders that had been added that were now gone.

I could not ping node1 (which hosts the affected resource volume), could not connect with Putty, nor would VMWare bring up its console.  VMWare said its CPU usage was high.  I powered node1 off and back on again and when it came back up and its assigned resource was migrated back to it, everything went back to normal - the modifications to the files were there again.

The cluster event log in iManager shows the node leaving at 3:43 in the morning, but no other reasons.  Just that it left and the resources migrated to node2,

The /var/log/messages log stops at 3:39 then picks back up at 8:44, when I rebooted the server. 

Here are the last few lines in the messages log before I restarted the node:

 

2019-10-09T03:30:01.269251-04:00 boe1 systemd[307]: Received SIGRTMIN+24 from PID 431 (kill).
2019-10-09T03:30:01.276920-04:00 boe1 systemd: pam_unix(systemd-user:session): session closed for user root
2019-10-09T03:30:01.277145-04:00 boe1 systemd[1]: Stopped User Manager for UID 0.
2019-10-09T03:30:01.281191-04:00 boe1 systemd[1]: Removed slice User Slice of root.
2019-10-09T03:34:25.051629-04:00 boe1 sshd[553]: Connection closed by 127.0.0.1 port 39128 [preauth]
2019-10-09T03:39:25.102526-04:00 boe1 sshd[656]: Connection closed by 127.0.0.1 port 39172 [preauth]
2019-10-09T08:44:46.865576-04:00 boe1 systemd[1]: systemd 228 running in system mode. (+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN)
2019-10-09T08:44:46.867697-04:00 boe1 systemd[1]: Detected virtualization vmware.

 

  

Do you have any idea what may be happening?  Any leads on where to look?

Thank you!

Aubrey

Labels (1)
0 Likes
4 Replies
Knowledge Partner
Knowledge Partner

Re: Cluster node stops responding then resource reverts to old copy

Classic VMWare issue combined with a Clustering Bug where Microfocus calims it's not a bug. YOu have two different VMDKs for your cluster volume, and the cluster (nor you) notices this. Most likely happened because you migrated a cluster node (which copies shared disks around).

CU,
--
Massimo Rosen
Micro Focus Knowledge Partner
No emails please!
http://www.cfc-it.de
0 Likes
AubreyJ
Visitor.

Re: Cluster node stops responding then resource reverts to old copy

Thanks, Massimo!
I haven't manually migrated a node recently.
I do see In the config for each server node in VMWare that each drive points to its own vmdk and not a "shared" vmdk. Example: node1 shared volume 1 is \node1\node1_1.vmdk and the same volume on node2 is \node2\node2_1.vmdk.
Is there a way to fix this so that the nodes see the correct vmdk?

Am I correct in understanding what you are saying in that each time a resource is migrated, a new drive is created in VMWare?
Will I need to manually migrate volumes in order for an up-to-date copy to be available?
0 Likes
Knowledge Partner
Knowledge Partner

Re: Cluster node stops responding then resource reverts to old copy

I was referring to migrating (Vmotion) VMs inside your VM environment, e.g change the storage. That's when this happens (which in my mind is also a bug in VMWare it should stronfly warn you not to move shared disks, or (even better) chnage the config on all VMs.

To fix this, you need to determine which disk is the correct one, remove the "wrong" VMDK from the node(s) and add the correct one to it back.

 

 

CU,
--
Massimo Rosen
Micro Focus Knowledge Partner
No emails please!
http://www.cfc-it.de
0 Likes
AubreyJ
Visitor.

Re: Cluster node stops responding then resource reverts to old copy

I had come to a similar conclusion, but since this is a production cluster, I'm wary of doing that right now.  I do have a backup just in case and will try it as soon as I can.  Since the current vmdk's are located in the VM's folder, can I move them to another folder (like SharedDisks) before reconnecting them?  They are configured Persistent and Independent.

I found this document that details configuring a cluster in a VMWare environment, but it loses me in step 2B of preparing the shared storage.   It looks like some shared vmdk's were created that are not detailed in the instructions.

0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.