I have a two node cluster that has two resources. Each node is configured to host one resource unless the other node goes down. These are two servers hosting shared volumes for my users.
Twice in the past few months the first node has stopped responding and when the resource migrates to the other node, it apparently reverts to an old copy. For example, any modifications to a file are undone and any new files are removed. Just like if I remove the current drive and replaced it with an older version.
This morning I had this event happen. I was alerted when a user that modifies a certain file often, said their recent changes were not there. Upon checking with a file that I modified yesterday, those modifications were gone. There were also some folders that had been added that were now gone.
I could not ping node1 (which hosts the affected resource volume), could not connect with Putty, nor would VMWare bring up its console. VMWare said its CPU usage was high. I powered node1 off and back on again and when it came back up and its assigned resource was migrated back to it, everything went back to normal - the modifications to the files were there again.
The cluster event log in iManager shows the node leaving at 3:43 in the morning, but no other reasons. Just that it left and the resources migrated to node2,
The /var/log/messages log stops at 3:39 then picks back up at 8:44, when I rebooted the server.
Here are the last few lines in the messages log before I restarted the node:
2019-10-09T03:30:01.269251-04:00 boe1 systemd: Received SIGRTMIN 24 from PID 431 (kill). 2019-10-09T03:30:01.276920-04:00 boe1 systemd: pam_unix(systemd-user:session): session closed for user root 2019-10-09T03:30:01.277145-04:00 boe1 systemd: Stopped User Manager for UID 0. 2019-10-09T03:30:01.281191-04:00 boe1 systemd: Removed slice User Slice of root. 2019-10-09T03:34:25.051629-04:00 boe1 sshd: Connection closed by 127.0.0.1 port 39128 [preauth] 2019-10-09T03:39:25.102526-04:00 boe1 sshd: Connection closed by 127.0.0.1 port 39172 [preauth] 2019-10-09T08:44:46.865576-04:00 boe1 systemd: systemd 228 running in system mode. ( PAM -AUDIT SELINUX -IMA APPARMOR -SMACK SYSVINIT UTMP LIBCRYPTSETUP GCRYPT -GNUTLS ACL XZ -LZ4 SECCOMP BLKID -ELFUTILS KMOD -IDN) 2019-10-09T08:44:46.867697-04:00 boe1 systemd: Detected virtualization vmware.
Do you have any idea what may be happening? Any leads on where to look?