Cluster node stops responding then resource reverts to old copy
I have a two node cluster that has two resources. Each node is configured to host one resource unless the other node goes down. These are two servers hosting shared volumes for my users.
Twice in the past few months the first node has stopped responding and when the resource migrates to the other node, it apparently reverts to an old copy. For example, any modifications to a file are undone and any new files are removed. Just like if I remove the current drive and replaced it with an older version.
This morning I had this event happen. I was alerted when a user that modifies a certain file often, said their recent changes were not there. Upon checking with a file that I modified yesterday, those modifications were gone. There were also some folders that had been added that were now gone.
I could not ping node1 (which hosts the affected resource volume), could not connect with Putty, nor would VMWare bring up its console. VMWare said its CPU usage was high. I powered node1 off and back on again and when it came back up and its assigned resource was migrated back to it, everything went back to normal - the modifications to the files were there again.
The cluster event log in iManager shows the node leaving at 3:43 in the morning, but no other reasons. Just that it left and the resources migrated to node2,
The /var/log/messages log stops at 3:39 then picks back up at 8:44, when I rebooted the server.
Here are the last few lines in the messages log before I restarted the node:
2019-10-09T03:30:01.269251-04:00 boe1 systemd: Received SIGRTMIN+24 from PID 431 (kill). 2019-10-09T03:30:01.276920-04:00 boe1 systemd: pam_unix(systemd-user:session): session closed for user root 2019-10-09T03:30:01.277145-04:00 boe1 systemd: Stopped User Manager for UID 0. 2019-10-09T03:30:01.281191-04:00 boe1 systemd: Removed slice User Slice of root. 2019-10-09T03:34:25.051629-04:00 boe1 sshd: Connection closed by 127.0.0.1 port 39128 [preauth] 2019-10-09T03:39:25.102526-04:00 boe1 sshd: Connection closed by 127.0.0.1 port 39172 [preauth] 2019-10-09T08:44:46.865576-04:00 boe1 systemd: systemd 228 running in system mode. (+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN) 2019-10-09T08:44:46.867697-04:00 boe1 systemd: Detected virtualization vmware.
Do you have any idea what may be happening? Any leads on where to look?
Classic VMWare issue combined with a Clustering Bug where Microfocus calims it's not a bug. YOu have two different VMDKs for your cluster volume, and the cluster (nor you) notices this. Most likely happened because you migrated a cluster node (which copies shared disks around).
Micro Focus Knowledge Partner
No emails please!
I haven't manually migrated a node recently.
I do see In the config for each server node in VMWare that each drive points to its own vmdk and not a "shared" vmdk. Example: node1 shared volume 1 is \node1\node1_1.vmdk and the same volume on node2 is \node2\node2_1.vmdk.
Is there a way to fix this so that the nodes see the correct vmdk?
Am I correct in understanding what you are saying in that each time a resource is migrated, a new drive is created in VMWare?
Will I need to manually migrate volumes in order for an up-to-date copy to be available?
I was referring to migrating (Vmotion) VMs inside your VM environment, e.g change the storage. That's when this happens (which in my mind is also a bug in VMWare it should stronfly warn you not to move shared disks, or (even better) chnage the config on all VMs.
To fix this, you need to determine which disk is the correct one, remove the "wrong" VMDK from the node(s) and add the correct one to it back.
Micro Focus Knowledge Partner
No emails please!
I had come to a similar conclusion, but since this is a production cluster, I'm wary of doing that right now. I do have a backup just in case and will try it as soon as I can. Since the current vmdk's are located in the VM's folder, can I move them to another folder (like SharedDisks) before reconnecting them? They are configured Persistent and Independent.
I found this document that details configuring a cluster in a VMWare environment, but it loses me in step 2B of preparing the shared storage. It looks like some shared vmdk's were created that are not detailed in the instructions.
I finally had some down time and was able to resolve this.
I shut down both nodes of the cluster, located the correct copy of each shared disk and copied the vmdks to a SharedDisks folder. I then detached the old shared drives from node1 and attached the relocated disks to it using the original drive settings (scsi1:1 and 1:2). Booted up that node and both drives show up and running!
I then did the same for node 2 and they shared properly and I can migrate the shared drives between nodes.
Thanks for your help!