Cannot check nor repair disk errors on the SMG appliance.

This thread is a follow on from the Issue with updated version 23.3.6? thread where I posted this:

I opened a case to learn what happened to my appliance and, hopefully, prevent it from happening again.

The logs showed disk errors on /dev/sdb1 & /dev/sdc1. They also showed that /var was not empty when /dev/sdc1 was mounted.

OT Support could not determine what happened. I asked how to deal with disk errors and they reached out to SUSE for recommendations.

SUSE has two recommendations:

  1. If I wanted to investigate further, they said I could umount /var and see if there are any clues there.
  2. As for how to run e2fsck on the appliance, the proper way is to boot to the ISO that matches the version of the OS of the appliance.

I'm not surprised by the response from SUSE and I don't disagree with them. While their response does provide some guidance it resolves nothing.

The SMG appliance has had issues, some of which can be directly attributed to SMG while others not so much. This one falls into that "not so much" category but that doesn't mean that customers can't still be severely impacted. Others have posted in the forum that they too had encountered issues after installing the OS upgrade. For some, they were able to recover simply by rebooting. I wasn't so lucky: I almost lost the whole appliance.

As long as we continue to use the appliance we must find ways to mitigate these risks and to deal with them when they do happen. The obvious way to mitigate much of the risk is to take backups. That in itself is problematic with the SMG appliance and a restore will almost certainly result in data loss so let's just leave this discussion for another time...

The other way is to do regular maintenance. Currently the only way to become aware of potential issues is to scan the logs but there is no automated way to do so and to do it manually on a regular basis requires substantial effort. But the real issue is what can we do when we encounter log entries that suggest dire consequences if ignored, for example, disk errors?

SUSE's recommendations overlooks two important points:

  1. /var cannot be umounted while the appliance is running.
  2. There is no available ISO that matches the appliance OS.

The result is we are unable to run e2fsck to check for or resolve disk errors.

So, we have what I would consider some major issues that could result in the total loss of all SMG configuration and user data. It is true that this situation will improve when SMG no longer is dependent on the appliance but according to Product Manager Eric Maughan, "the appliance will still be with us for at least a year". In the meantime, something must be done!

Comments welcomed...

__________
Kevin Boyle, 
Knowledge Partner

Calgary, Alberta, Canada