Cannot check nor repair disk errors on the SMG appliance.

This thread is a follow on from the Issue with updated version 23.3.6? thread where I posted this:

I opened a case to learn what happened to my appliance and, hopefully, prevent it from happening again.

The logs showed disk errors on /dev/sdb1 & /dev/sdc1. They also showed that /var was not empty when /dev/sdc1 was mounted.

OT Support could not determine what happened. I asked how to deal with disk errors and they reached out to SUSE for recommendations.

SUSE has two recommendations:

  1. If I wanted to investigate further, they said I could umount /var and see if there are any clues there.
  2. As for how to run e2fsck on the appliance, the proper way is to boot to the ISO that matches the version of the OS of the appliance.

I'm not surprised by the response from SUSE and I don't disagree with them. While their response does provide some guidance it resolves nothing.

The SMG appliance has had issues, some of which can be directly attributed to SMG while others not so much. This one falls into that "not so much" category but that doesn't mean that customers can't still be severely impacted. Others have posted in the forum that they too had encountered issues after installing the OS upgrade. For some, they were able to recover simply by rebooting. I wasn't so lucky: I almost lost the whole appliance.

As long as we continue to use the appliance we must find ways to mitigate these risks and to deal with them when they do happen. The obvious way to mitigate much of the risk is to take backups. That in itself is problematic with the SMG appliance and a restore will almost certainly result in data loss so let's just leave this discussion for another time...

The other way is to do regular maintenance. Currently the only way to become aware of potential issues is to scan the logs but there is no automated way to do so and to do it manually on a regular basis requires substantial effort. But the real issue is what can we do when we encounter log entries that suggest dire consequences if ignored, for example, disk errors?

SUSE's recommendations overlooks two important points:

  1. /var cannot be umounted while the appliance is running.
  2. There is no available ISO that matches the appliance OS.

The result is we are unable to run e2fsck to check for or resolve disk errors.

So, we have what I would consider some major issues that could result in the total loss of all SMG configuration and user data. It is true that this situation will improve when SMG no longer is dependent on the appliance but according to Product Manager Eric Maughan, "the appliance will still be with us for at least a year". In the meantime, something must be done!

Comments welcomed...

__________
Kevin Boyle, 
Knowledge Partner

Calgary, Alberta, Canada

  • Suggested Answer

    0  

    Kevin.  You don't need to mount var from an identical iso.  This is just a basic Linux filesystem maintenance workflow.  I am a little concerned that support tell you to do the below.

    Just turn off appliance, take a snapshot/backup

    Boot a suse/opensuse 15.5 live or recovery os iso

    Find your appliances var device /dev/whatever

    fsck /dev/whatever, or xfs_repair /dev/....

    Note fsck will autodetect ext fs and link to e2fsck

    https://www.suse.com/support/kb/doc/?id=000020980

    I do weekly disaster recovery snapshots using my hypervisor management.  It shuts down, takes a snapshot then starts back up.  VM is offline less than a minute.

    I'd recommend doing quarterly fsck checks. 

    Rodney

    If you found this post useful, give it a "Like" or click on "Verify Answer" under the "More" button.   This helps others.

  • 0   in reply to   
    You don't need to mount var from an identical iso.

    Now that is interesting.

    The feedback I received was supposed to be directly from someone at SUSE. Rolling eyes

    My Linux skills are pretty basic so for some things I have to rely on OT Tech Support. This time, it seems, they let me down or I misinterpreted their response.

    Thanks Rodney!

    __________
    Kevin Boyle, 
    Knowledge Partner

    Calgary, Alberta, Canada

  • 0  

    It seems the issue I encountered was not an isolated incident!

    Why? Because it happened again. The issue was resolved in a support session with a SUSE engineer. In the process I learned a few things I would like to share.

    I did a normal shutdown of the appliance while I did some work on GroupWise. We use an email spooling service that collects our email if the email host is down. When I started the appliance, the SMG application did not start. /var/log/messages showed disk errors on /dev/sdb1 and /vastorage was completely empty. 

    SMG uses three disks:

    1. /dev/sda - this is the main (boot) disk. it is mounted to / (root)
    2. /dev/sdb - /dev/sdb1 is mounted at /vastorage. 
    3. /dev/sdc - /dev/sdc1 is mounted at /var and cannot be unmounted while the appliance is running.
    Last time I was able to umount /dev/sdb1 and run es2fsck /dev/sdb1. This resolved the disk errors and the contents of /vastorage were restored allowing the SMG application to run. This time, however, I could not umount /dev/sdb1
     

    When the support session started, I was shown a couple of things that surprised me:

    1. /dev/sda - was not mounted!
    2. /dev/sdb - /dev/sdb1 was mounted at / (root)
    3. /dev/sdc - /dev/sdc1 was mounted at /var as it should be

    What??? How did that happen? I was reminded that this can happen when disks are mounted using deviceID. To prevent it from happening again, the disks mounts were changed to use UUID.

    Next we booted the appliance into rescue mode using an OpenSUSE LEAP 15 Sp5 DVD and checked the disks for errors. Quite a few errors were discovered and corrected on /dev/sdb1.  We were back in business!

    Disk errors: What is causing them? They were only on /dev/sdb1 (/vastorage). I repaired them two months ago. Is it just a coincidence there were additional errors on the same device? My VMware datastores are backed by SAS SSDs. I have not discovered any disk errors on any of my other VMs. Why just this device?

    Mount issues: They were confirmed on my appliance but I suspect this may have happened on other's appliances too - especially those whose /vastorage appeared empty but mysteriously appeared normal after a reboot. That is definitely a symptom of disks not being mounted as expected. If SUSE's recommendation is to mount disks by UUID, why is it the appliances are not following that recommendation?

    I'm beginning to look forward to running SMG on my own SLES server. Thinking

    __________
    Kevin Boyle, 
    Knowledge Partner

    Calgary, Alberta, Canada