NCS pool stuck in loading state without a node assigned

After some mishap involving a NCS node failing to load a NSS pool
followed by a hard power switch off I am stuck now with one pool in
state "loading" without a node assignment. A simple "cluster offline"
will not work.
Also some of the usual tricks like migrating the master node or forcing
an update of the locally stored configuration with the one in eDirectory
by changing some options in iManager did not help me so far.
So I am looking for a way to rectify the situation without doing an
actual "cluster restart" which in most cases will help as a last resort,
but also causes downtime for all the other pools, of course.

Günther
  • In article <KbwRC.1498$RN1.317@novprvlin0914.provo.novell.com>, Günther
    Schwarz wrote:
    > After some mishap involving a NCS node failing to load a NSS pool
    > followed by a hard power switch off I am stuck now with one pool in
    > state "loading" without a node assignment. A simple "cluster offline"
    > will not work.


    Hi Günther
    So that we can better understand, which version are we running here and
    how many nodes?
    I would try to manually mount that pool to see if that makes a
    difference, and to possibly get some better error reporting.
    Check the /var/opt/novell/log/ncpserv.log and ncp2nss.log as well as
    /var/log/messages for anthing of note when you attempt to mount this
    pool.
    Make sure your SAN isn't showing any errors, if so fix them first.

    You might have to go as far as the deeper verify steps as per
    https://www.novell.com/support/kb/doc.php?id=7006457
    Note that the verify can take a while, the rebuild certainly takes a
    long time with it being like watching grass grow once you get past 99%
    done.


    Andy of
    http://KonecnyConsulting.ca in Toronto
    Knowledge Partner
    http://forums.novell.com/member.php/75037-konecnya
    If you find a post helpful and are logged in the Web interface, please
    show your appreciation by clicking on the star below. Thanks!

  • On 06/05/2018 09:28 PM, Andy Konecny wrote:
    > In article <KbwRC.1498$RN1.317@novprvlin0914.provo.novell.com>, Günther
    > Schwarz wrote:
    >> After some mishap involving a NCS node failing to load a NSS pool
    >> followed by a hard power switch off I am stuck now with one pool in
    >> state "loading" without a node assignment. A simple "cluster offline"
    >> will not work.


    > So that we can better understand, which version are we running here and
    > how many nodes?


    These are four nodes running OES2015 SP1.

    > I would try to manually mount that pool to see if that makes a
    > difference, and to possibly get some better error reporting.
    > Check the /var/opt/novell/log/ncpserv.log and ncp2nss.log as well as
    > /var/log/messages for anthing of note when you attempt to mount this
    > pool.
    > Make sure your SAN isn't showing any errors, if so fix them first.


    Actually the NSS part looks just fine. I can even run the cluster load
    script on a command line. The pool and volume will come online and the
    secondary IP address is configured.

    > You might have to go as far as the deeper verify steps as per
    > https://www.novell.com/support/kb/doc.php?id=7006457
    > Note that the verify can take a while, the rebuild certainly takes a
    > long time with it being like watching grass grow once you get past 99%
    > done.


    It is a small pool, so a verify command will not take long time. But
    then my problem does not seem to be related to NSS. For me this looks
    just like NCS being stuck. So I am looking for a way to reset this
    single ressource without doing a cluster restart. I might just delete it
    and create it again.

    Günther

  • In article <D%LRC.1522$RN1.526@novprvlin0914.provo.novell.com>, Günther
    Schwarz wrote:
    > Actually the NSS part looks just fine. I can even run the cluster load
    > script on a command line. The pool and volume will come online and the
    > secondary IP address is configured.

    ...
    > It is a small pool, so a verify command will not take long time. But
    > then my problem does not seem to be related to NSS. For me this looks
    > just like NCS being stuck. So I am looking for a way to reset this
    > single ressource without doing a cluster restart. I might just delete it
    > and create it again.


    Ah, so at least you aren't down and out with that resource.
    Perhaps there is an eDir sync issue, run though the basic check on all
    those cluster nodes and any others that hold those objects, the usual
    ndsrepair -T
    ndsrepair -E
    ndsrepair -C -Ad -A
    making sure no errors, that nothing simple has snuck by you.
    Make some other change(s) to the cluster resource object to force some
    syncing of it. Perhaps looking at the object from different iManager
    instances.


    Andy of
    http://KonecnyConsulting.ca in Toronto
    Knowledge Partner
    http://forums.novell.com/member.php/75037-konecnya
    If you find a post helpful and are logged in the Web interface, please show
    your appreciation by clicking on the star below. Thanks!

  • On 06/07/2018 05:03 AM, Andy Konecny wrote:

    > Ah, so at least you aren't down and out with that resource.
    > Perhaps there is an eDir sync issue, run though the basic check on all
    > those cluster nodes and any others that hold those objects, the usual
    > ndsrepair -T
    > ndsrepair -E
    > ndsrepair -C -Ad -A
    > making sure no errors, that nothing simple has snuck by you.
    > Make some other change(s) to the cluster resource object to force some
    > syncing of it. Perhaps looking at the object from different iManager
    > instances.


    Thank you very much for your suggestions. Finally a cluster restart
    fixed the issue. The pool is online again and I can migrate it from node
    to another without problems. I am missing a reset command for single
    cluster resources.

    Günther