Standby node in AppFailover cluster reports: STANDBY_DB_FAILED START_FAILED (NNMi 2022.05 with no patches )

Hello,

I noticed that the standby node in our AF cluster has been showing STANDBY_DB_FAILED START_FAILED state since a few weeks.

Local? NodeType State OvStatus Hostname/Address
------ -------- ----- -------- ----------------------------
* REMOTE DAEMON STANDBY_DB_FAILED START_FAILED node2/node2-17831
LOCAL DAEMON ACTIVE_NNM_RUNNING RUNNING node1/node1-29522
(SELF) ADMIN n/a n/a node1/node1-57504

nnmcluster-daemon.0.0.log shows this with one WARNING line related to nmsdbmgr:

Sep 13, 2024 11:44:55.571 PM [ThreadID:43] INFO: com.hp.ov.nms.admin.nnmcluster.NnmCluster viewAccepted: New cluster View accepted: [node2-13411|2] [node2-13411, node2-17831, node1-29522]
Sep 13, 2024 11:45:43.406 PM [ThreadID:43] INFO: com.hp.ov.nms.admin.nnmcluster.NnmCluster viewAccepted: New cluster View accepted: [node2-13411|3] [node2-13411, node2-17831, node1-29522, node1-15523]
Sep 13, 2024 11:45:50.507 PM [ThreadID:43] INFO: com.hp.ov.nms.admin.nnmcluster.NnmCluster viewAccepted: New cluster View accepted: [node2-17831|4] [node2-17831, node1-29522, node1-15523]
Sep 13, 2024 11:45:50.508 PM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.ClusterInfo : Detected controller change from node2-13411 to node2-17831. Requesting updated node info from all nodes
Sep 13, 2024 11:45:50.510 PM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.NnmCluster : Received updated node info NodeInfo(addr=node2-17831, type=DAEMON, state=QUERY_CONTROLLER, ovstatus=NOT_RUNNING, startTime=-1)
Sep 13, 2024 11:45:50.574 PM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.NnmCluster : Received updated node info NodeInfo(addr=node1-15523, type=ADMIN, state=NONDAEMON_READY, ovstatus=null, startTime=-1)
Sep 13, 2024 11:45:50.576 PM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.NnmCluster : Received updated node info NodeInfo(addr=node1-29522, type=DAEMON, state=ACTIVE_NNM_STARTING, ovstatus=STARTING, startTime=1,726,263,897,964)
Sep 13, 2024 11:56:00.655 PM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.NodeStateTransition setState: Transitioning NodeState from QUERY_CONTROLLER to STANDBY_INITIALIZING
Sep 13, 2024 11:56:00.718 PM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.NodeStateTransition setState: Transitioning NodeState from STANDBY_INITIALIZING to STANDBY_QUERY_DB
Sep 13, 2024 11:56:04.069 PM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.NodeStateTransition setState: Transitioning NodeState from STANDBY_QUERY_DB to STANDBY_PREPWORK
Sep 13, 2024 11:56:07.233 PM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.NnmCluster startDbOnStandby: Starting NNM database.
Sep 13, 2024 11:56:07.234 PM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.NodeStateTransition setState: Transitioning NodeState from STANDBY_PREPWORK to STANDBY_DB_STARTING
Sep 13, 2024 11:56:08.713 PM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.ClusterFileReceiver : Receiving file: /var/opt/OV/shared/nnm/databases/Postgres_standby/TxWALs_recv/0000002300001B9C000000E4.zip
Sep 13, 2024 11:56:08.824 PM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.ClusterFileReceiver : Receiving file: /var/opt/OV/nmsas/NNM/log/audit-2024-09-13.log
Sep 13, 2024 11:56:09.199 PM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.ClusterFileReceiver : Receiving file: /var/opt/OV/shared/nnm/databases/Postgres_standby/TxWALs_recv/0000002300001B9C000000E5.zip
Sep 13, 2024 11:58:06.301 PM [ThreadID:82] WARNING: com.hp.ov.nms.admin.nnmcluster.utils.ExecProc call: Command ("/opt/OV/bin/ovstart" "-c" "nmsdbmgr" ) returned non-zero exit status: 1
Sep 13, 2024 11:58:06.303 PM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.NodeStateTransition setState: Transitioning NodeState from STANDBY_DB_STARTING to STANDBY_DB_FAILED
Sep 13, 2024 11:59:09.088 PM [ThreadID:56] INFO: com.hp.ov.nms.admin.nnmcluster.NnmCluster viewAccepted: New cluster View accepted: [node2-17831|5] [node2-17831, node1-29522]
Sep 14, 2024 12:00:01.479 AM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.ClusterFileReceiver : Receiving file: /var/opt/OV/nmsas/NNM/log/audit-2024-09-13.log
Sep 14, 2024 12:00:28.377 AM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.ClusterFileReceiver : Receiving file: /var/opt/OV/shared/nnm/databases/Postgres_standby/TxWALs_recv/0000002300001B9C000000E6.zip
Sep 14, 2024 12:00:28.440 AM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.ClusterFileReceiver : Receiving file: /var/opt/OV/shared/nnm/databases/Postgres_standby/TxWALs_recv/0000002300001B9C000000E7.zip
Sep 14, 2024 12:00:28.689 AM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.ClusterFileReceiver : Receiving file: /var/opt/OV/shared/nnm/databases/Postgres_standby/TxWALs_recv/0000002300001B9C000000E8.zip
Sep 14, 2024 12:00:29.080 AM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.ClusterFileReceiver : Receiving file: /var/opt/OV/shared/nnm/databases/Postgres_standby/TxWALs_recv/0000002300001B9C000000E9.zip
Sep 14, 2024 12:00:29.712 AM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.ClusterFileReceiver : Receiving file: /var/opt/OV/shared/nnm/databases/Postgres_standby/TxWALs_recv/0000002300001B9C000000EA.zip
Sep 14, 2024 12:05:39.637 AM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.core.FileService : Removed /var/opt/OV/shared/nnm/databases/Postgres_standby/PostgresBackup.zip from the queue because it is no longer on the active
Sep 14, 2024 12:05:39.677 AM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.ClusterFileReceiver : Receiving file: /var/opt/OV/shared/nnm/databases/Postgres_standby/TxWALs_recv/0000002300001B9C000000EB.zip
Sep 14, 2024 12:05:40.149 AM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.ClusterFileReceiver : Receiving file: /var/opt/OV/shared/nnm/databases/Postgres_standby/TxWALs_recv/0000002300001B9C000000EC.zip
Sep 14, 2024 12:10:52.498 AM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.core.FileService : Removed /var/opt/OV/shared/nnm/databases/Postgres_standby/PostgresBackup.zip from the queue because it is no longer on the active
Sep 14, 2024 12:10:52.540 AM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.ClusterFileReceiver : Receiving file: /var/opt/OV/shared/nnm/databases/Postgres_standby/TxWALs_recv/0000002300001B9C000000ED.zip
Sep 14, 2024 12:13:04.273 AM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.NnmCluster : Rejecting db transfer while in state STANDBY_DB_FAILED
Sep 14, 2024 12:13:04.273 AM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.NnmCluster : Rejecting db transfer while in state STANDBY_DB_FAILED
Sep 14, 2024 12:00:28.440 AM [ThreadID:28] INFO: com.hp.ov.nms.admin.nnmcluster.ClusterFileReceiver : Receiving file: /var/opt/OV/shared/nnm/databases/Postgres_standby/TxWALs_recv/0000002300001B9C000000E7.zip

Is there any one who may know how to fix that?
I read this article: Application Failover Cluster got status STANDBY_DB_FAILED START_FAILED (microfocus.com) 
but it is not suitable for our case as the standby node will not start normally because the embedded DB cannot start.

Tags:

  • Verified Answer

    +1

    Hello,

    I was able to find the root cause and fix it.

    There was another entry in /var/opt/OV/log/nnm/nnmcluster-daemon.*.log files: ("/opt/OV/nonOV/perl/a/bin/perl" "/opt/OV/lbin/fixdbperms.ovpl") returned non-zero exit status: 2

    It turned out this script was not able to runn correclty to fix file permissions because of missing Perl module "NonRootFunc.pm":

    root@standby-node:~ # /opt/OV/nonOV/perl/a/bin/perl /opt/OV/lbin/fixdbperms.ovpl

    Can't locate NonRootFunc.pm in @INC (you may need to install the NonRootFunc module) (@INC contains: /opt/OV/nonOV/perl/a/lib/site_perl/5.34.0/x86_64-linux-thread-multi /opt/OV/nonOV/perl/a/lib/site_perl/5.34.0 /opt/OV/nonOV/perl/a/lib/5.34.0/x86_64-linux-thread-multi /opt/OV/nonOV/perl/a/lib/5.34.0 /opt/OV/nonOV/perl/a/lib/site_perl/5.34.0/x86_64-linux-thread-multi /opt/OV/nonOV/perl/a/lib/site_perl/5.34.0 /opt/OV/nonOV/perl/a/lib/site_perl) at /opt/OV/lbin/fixdbperms.ovpl line 10.


    This module is located here:
    /opt/OV/newconfig/HPNmsCommon/perl/a/lib/site_perl/NonRootFunc.pm

    but it is expected to be also in /opt/OV/nonOV/perl/a/lib/site_perl folder as all the content of /opt/OV/newconfig/HPNmsCommon/perl/a/ must (is expected to) be copied to /opt/OV/nonOV/perl/a/.

    This is normally done by the script /opt/OV/lbin/nnm-install-config.sh where there is line “/opt/OV/nonOV/perl/a/bin/perl /opt/OV/newconfig/HPNmsCommon/scripts/nnm-configure-perl.ovpl -source /opt/OV/newconfig/HPNmsCommon/perl/a/ -target $INSTPERLDIR”.

    I have no idea why the NonRootFunc.pm was missing in /opt/OV/nonOV/perl/a/lib/site_perl but after I copied it there from  /opt/OV/newconfig/HPNmsCommon/perl/a/ the standby node was able to start nmsdbmgr process correctly as the file permissions were as expected.

    cp /opt/OV/newconfig/HPNmsCommon/perl/a/lib/site_perl/NonRootFunc.pm /opt/OV/nonOV/perl/a/lib/site_perl

    chown bin:bin /opt/OV/nonOV/perl/a/lib/site_perl/NonRootFunc.pm



    Best regards,

    Jarosław Kęsy