stelgenkamp
New Member.
538 views

NSS Pool crash on OES2015 VM holding Groupwise

Server crash yesterday, auchhh!
Have a Xen server OES2018, which holds several VM's
Have ISCSI drives on our NAS, which holds the data for our Groupwise system.
I give the total ISCI drive to the VM : so /dev/sdd and the VM has its NSS POOL and Volume on it.

Now the server itself crashed, causing all VM's to crash.
But when getting back up, somehow messed up the drive letters.
so the Groupwise system used to be /dev/sdd, but it changed to /dev/sdc !
Of course when booting up the VM, (which is on OES2015 fully patched) I could not see the NSS drive.
It took my a while to figure that out. So rebooting several times.
Now I have connected the correct drive, I still cannot see the NSS.
I cannot connect to the pool using nssmu, as it just hangs.
Cannot see anything about the pool in imanager, as this errors out.

But the VM's processors are running full speed. So i guessed it was auto-re-building of sorts, and let it sit overnight.
Now still nothing.
Our Groupwise system is small, max 10 users, only 5 Gb of Data or something.

- Questions :
Can the VM is still rebuilding NSS ? Or does it just hang ?
How long should a rebuild like this normally take ?
I have seen some info about ravsui, but it cannot seems to access the disc.

Any ideas?
Labels (2)
0 Likes
9 Replies
stelgenkamp
New Member.

Re: NSS Pool crash on OES2015 VM holding Groupwise

Update, I managed to reboot and get into nsscon :
Unable to rebuild Super Block Headers
Pool DATA_NAS system data error 20444
Pool DATA_NAS read error 20206

My next plan was to do nss /Poolmaintenance=DATA_NAS
but then the system is not responsive. How long should it take for a Pool to enter Maintenance mode ?
0 Likes
Anonymous_User Absent Member.
Absent Member.

Re: NSS Pool crash on OES2015 VM holding Groupwise

Am 14.02.19 um 08:24 schrieb stelgenkamp:
>
> Server crash yesterday, auchhh!
> Have a Xen server OES2018, which holds several VM's
> Have ISCSI drives on our NAS, which holds the data for our Groupwise
> system.
> I give the total ISCI drive to the VM : so /dev/sdd and the VM has its
> NSS POOL and Volume on it.
>
> Now the server itself crashed, causing all VM's to crash.
> But when getting back up, somehow messed up the drive letters.
> so the Groupwise system used to be /dev/sdd, but it changed to /dev/sdc
> !
> Of course when booting up the VM, (which is on OES2015 fully patched) I
> could not see the NSS drive.
> It took my a while to figure that out. So rebooting several times.
> Now I have connected the correct drive, I still cannot see the NSS.
> I cannot connect to the pool using nssmu, as it just hangs.
> Cannot see anything about the pool in imanager, as this errors out.
>
> But the VM's processors are running full speed. So i guessed it was
> auto-re-building of sorts, and let it sit overnight.
> Now still nothing.
> Our Groupwise system is small, max 10 users, only 5 Gb of Data or
> something.
>
> - Questions :
> Can the VM is still rebuilding NSS ? Or does it just hang ?
> How long should a rebuild like this normally take ?
> I have seen some info about ravsui, but it cannot seems to access the
> disc.
>
> Any ideas?
>
>

Check #lsblk
There you should see if the device and partition is visible to OES.

Bernd
0 Likes
stelgenkamp
New Member.

Re: NSS Pool crash on OES2015 VM holding Groupwise

Thank you Bernd,

lsblk shows
DATA_NAS (dm-0) /opt/novell/nss/mnt/.pools/DATA_NAS
So should be available I think ?

However I cannot put the Pool in Maintenance.
System Monitor is showing like 6 NSS tasks taking all of the processor power
0 Likes
stelgenkamp
New Member.

Re: NSS Pool crash on OES2015 VM holding Groupwise

Documentation reads :
This type of error is fixed automatically during volume mount by the NSS journaling feature.
So is the NSS pool auto rebuilding?
How long would that take? Total Pool size is 939 Gb.
0 Likes
Knowledge Partner
Knowledge Partner

Re: NSS Pool crash on OES2015 VM holding Groupwise

stelgenkamp wrote:

>
> Server crash yesterday, auchhh!
> Have a Xen server OES2018, which holds several VM's
> Have ISCSI drives on our NAS, which holds the data for our Groupwise
> system.
> I give the total ISCI drive to the VM : so /dev/sdd and the VM has its
> NSS POOL and Volume on it.


It's always a good idea to place data on a separate volume. That way
it's easier to backup and recover and it may also provide some
performance advantages.



> Now the server itself crashed, causing all VM's to crash.
> But when getting back up, somehow messed up the drive letters.
> so the Groupwise system used to be /dev/sdd, but it changed to
> /dev/sdc !


Your devices are mounted "by device name" (/dev/sda, /dev/sdb. etc).
The names are assigned based in the order in which the drives are
encountered when the system is booted. If there in an issue with a
drive, different drive names could be assigned to your drives as it
appears to have happened here. This, of course can cause all kinds of
problems. Some years back SUSE recommended that drives be mounted "by
device ID" which will not change so you could have avoided this very
issue.



> Of course when booting up the VM, (which is on OES2015 fully patched)
> I could not see the NSS drive.
> It took my a while to figure that out. So rebooting several times.


I hope you have a backup? 😉

When you have a failure and your data is at risk the last thing you
want to do is make matters worse. NSS is pretty robust but sometimes
things do go wrong. In a situation like this it might be better to open
a Service Request and let the experts advise what might be the correct
course of action!



> Now I have connected the correct drive, I still cannot see the NSS.
> I cannot connect to the pool using nssmu, as it just hangs.
> Cannot see anything about the pool in imanager, as this errors out.


I can't tell what you may have done. You may just be making matters
worse! 😞



> But the VM's processors are running full speed. So i guessed it was
> auto-re-building of sorts, and let it sit overnight.
> Now still nothing.
> Our Groupwise system is small, max 10 users, only 5 Gb of Data or
> something.
>
> - Questions :
> Can the VM is still rebuilding NSS ? Or does it just hang ?
> How long should a rebuild like this normally take ?
> I have seen some info about ravsui, but it cannot seems to access the
> disc.
>
> Any ideas?


You are trying to fix a problem but don't know what the problem is. You
said your VMs and data reside on a NAS:
- What kind of a NAS?
- What kind of redundant storage has been configured?
- Have you checked for hardware errors (drive, RAID, network, etc.)?

Your server is doing something but you have no idea what. It *could* be
rebuilding your data but if there are other underlying hardware errors
and especially with different drive names having being assigned, it
could just as easily be *destroying* it.

If your data is important to you, it's time to stop your hit and miss
approach and get some help.


In a follow-up post you say:
> Total Pool size is 939 Gb.


If you have a gigabit connection between your server and your NAS the
*maximum* throughput you can expect is about 120MBs so it *could* take
a very long time.


--
Kevin Boyle - Knowledge Partner
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below this post.
Thank you.
_____
Kevin Boyle - Knowledge Partner - Calgary, Alberta, Canada
Who are the Knowledge Partners?
If you appreciate my comments, please click the Like button.
If I have resolved your issue, please click the Accept as Solution button.
0 Likes
stelgenkamp
New Member.

Re: NSS Pool crash on OES2015 VM holding Groupwise

Dear Kevin,
Lots of questions :
NSS is pretty robust but sometimes things do go wrong
But does it auto-rebuild ? The tasks running are all NSS tasks.
I hope you have a backup?
I hope so too, but better to know if this is fixable.
data reside on a NAS:
So on the NAS (Lenovo) is a NSS Pool and Volume.
This is RAID mirrored, and the RAID is in order.
*maximum* throughput you can expect is about 120MBs so it *could* take a very long time.
So when we take 100 Mbs : 939 Gb it should be ready within 3 hours. And it has been working at it all day now .. sigh

I have opened up a SR per your advise.

I just want to understand the process.
So the server boots, starts up NSS. NSS pool is deactived due to read errors.
Then what? The server is cracking at something. Is it working to restore ?
Is it normal, to have 6 NSS tasks even after the pool has been deactivated ?
Any advice on this is highly appreciated. And I will keep you posted about the S/R
Stephan
0 Likes
Knowledge Partner
Knowledge Partner

Re: NSS Pool crash on OES2015 VM holding Groupwise

stelgenkamp wrote:

>
> Dear Kevin,
> Lots of questions :
> -NSS is pretty robust but sometimes things do go wrong-
> But does it auto-rebuild ? The tasks running are all NSS tasks.


Hi Stephan,

I know very little about the internal workings of NSS. Yes, it is
supposed to recover from errors and it uses a journal to help in that
recovery but there are limits what it can accomplish. For example, if
there are disk errors that have caused a large loss of data it may be
impossible for any program to recover the data. That is when you are
glad you have a backup!


> -I hope you have a backup?-
> I hope so too, but better to know if this is fixable.


Yes...



> -data reside on a NAS:-
> So on the NAS (Lenovo) is a NSS Pool and Volume.
> This is RAID mirrored, and the RAID is in order.


A RAID array will show failed if there is a drive failure. There could
still be errors on one of the mirrored drives. I assume you are using
RAID 1?


> -*maximum* throughput you can expect is about 120MBs so it could take
> a very long time.-
> So when we take 100 Mbs : 939 Gb it should be ready within 3 hours.
> And it has been working at it all day now .. sigh


So, your Ethernet connection can support a maximum throughput of about
120 MBs. If you are using RAID 1 in your NAS the maximum write transfer
rate of a drive is likely 30 MBs to 60 MBs.

During a repair operation, small amounts of data are read from the
drive and examined. If necessary, the data is changed and written back
to the drive so the actual amount of data transferred would be much
greater than 939 GB and, because of the overhead transferring small
chunks of data, the throughput is likely to be far less than the
maximum write transfer rate of a drive. In other words, your simple
calculation has nothing to do with how long a repair might take.

If you want a better idea of the actual transfer rates you might be
able to get that from your NAS.


> I have opened up a SR per your advise.


I'm glad you did. Please let us know how that goes.
Your GroupWise server is still not working so be sure to tell support
you have a "server down" situation which will make this a high priority.

>
> I just want to understand the process.
> So the server boots, starts up NSS. NSS pool is deactived due to read
> errors.
> Then what? The server is cracking at something. Is it working to
> restore ?
> Is it normal, to have 6 NSS tasks even after the pool has been
> deactivated ?
> Any advice on this is highly appreciated. And I will keep you posted
> about the S/R
> Stephan


Those are the questions I cannot answer. The support person working on
your SR will have to do some troubleshooting to assess the situation
before he can determine that.



--
Kevin Boyle - Knowledge Partner
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below this post.
Thank you.
_____
Kevin Boyle - Knowledge Partner - Calgary, Alberta, Canada
Who are the Knowledge Partners?
If you appreciate my comments, please click the Like button.
If I have resolved your issue, please click the Accept as Solution button.
0 Likes
stelgenkamp
New Member.

Re: NSS Pool crash on OES2015 VM holding Groupwise

So this was fast ! Here is the reply :
The pool rebuild doesn’t run automatically. It is probably the attempted activation of the corrupt pool that is causing the NSS utilization.
So the high processor did mean a thing and nothing was happening.

Given the errors 20206 and 20444 errors, you’ll want to run a rebuild with rezid.
That was the problem ; I could not get to rebuild.

In order to do this, you’ll probably need to boot the server without it attempting to activate the pool.
To do this, add /poolautodeactivate=all to the very top of /etc/opt/novell/nss/nssstart.cfg and then reboot.

This was the magic key ; now the pool did not auto activate and processors were normal.

You should then be able to put the pool in maintenance mode and run the rebuild with “ravsui -r 0xefffffff rebuild DATA_NAS”
Tadaa, within 2 minutes 48 seconds I was up and running !

Hope this helps someone in future
Stephan
0 Likes
Knowledge Partner
Knowledge Partner

Re: NSS Pool crash on OES2015 VM holding Groupwise

stelgenkamp wrote:

> To do this, add /poolautodeactivate=all to the very top of
> /etc/opt/novell/nss/nssstart.cfg and then reboot.


That's especially good to know!

I'm glad your data is safe and I appreciate your updating this thread
with your solution.

--
Kevin Boyle - Knowledge Partner
If you find this post helpful and are logged into the web interface,
please show your appreciation and click on the star below this post.
Thank you.
_____
Kevin Boyle - Knowledge Partner - Calgary, Alberta, Canada
Who are the Knowledge Partners?
If you appreciate my comments, please click the Like button.
If I have resolved your issue, please click the Accept as Solution button.
0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.