Highlighted
Honored Contributor.
Honored Contributor.
325 views

NNMi cluster health - postgres issue on standby?

NNMi is running on Linux with application fail-over clustering [NNMi v10.10 + Patch 8].

Has anyone else encountered this issue? Any tips or tricks? I'm concerned that if a fail-over to the standby system was needed that the database may be in an unusable state. For the time-being, I'm doing nnmbackupembdb.ovpl daily on the active node.

Since this past weekend, I've noticed that the output from the nnmhealth.ovpl -print verbose -filter NNMClusterHealthAgent always shows that the Last DB transaction ID received on standby is 0xffffffffffffffff=-1.

Prior to this past weekend, this was not always -1. I'm used to seeing the "Last DB transaction ID received on standby is 0xffffffffffffffff=-1." when the active node does a back-up and sends it to the standby, then the standby nmsdbmgr shuts down to update/replace the existing database. No reboots of either active or standby since this started, however there was a wide-area network outage between late Friday evening, which is the last thing that occurred before this started happening. Active node reported that it lost connectivity to the cluster agent on the standby node.

FWIW, I did attempt a restore of the embedded database from the active to the standby, that didn't resolve. I did attempt to clean-up the Postgres_standby directory on the standby: remove Postgres_standby/PostgresBackup.*, remove Postgres_standby/TxWALs_recv/* after shutting down nnmcluster on the standby [KM1156362]. No actions performed on the active node, that would require down-time.  I don't see anything out of ordinary in pg_log log files on the standby, cluster log and nmsdbmgr log don't show any persistent / repeating errors, but it's possible that the interesting log entries have rotated out of view.

Current output from the nnmhealth.ovpl's NNMClusterHealthAgent:

NNMi System Health Report

  Hostname: dcwipvmnnm001.XXX.XXX.XX.com
  Date: 2019-06-12 13:15:03.148
  Overall Status: Normal

  History:

    NNM Cluster Health Agent
    No Issues Detected

  System Details

  NNM Cluster Health Agent
    Active node: dcwipvmnnm001.XXX.XXX.XX.com
       State: ACTIVE_NNM_RUNNING
       Last file system scan at: 2019-06-12 13:09:29.404
       File transfer failed duration(minutes): N/A
    Secondary node: dcmipvmnnm001.XXX.XXX.XX.com
       State: STANDBY_READY
       Last file received at: 2019-06-12 13:09:30.364
       File transfer queue length: 0
       Average transfer rate: 24556.44 KB/s
    Automatic failover: DISABLED
    Embedded database stream:
       Last DB transaction ID on active: 0xcaada4d8=3400377560
       Last DB transaction ID received on standby: 0xffffffffffffffff=-1
       Transaction ID difference: 3400377561 transactions.

Cluster health agent output early Saturday a.m.:

  NNMi System Health Report

  Hostname: dcwipvmnnm001.XXX.XXX.XX.com
  Date: 2019-06-08 04:05:04.167
  Overall Status: Minor

  History:
    [2019-06-08 02:15:21.609] 'NNM Cluster Health Agent' has changed status from 'Normal' to 'Minor'

  NNM Cluster Health Agent
    [Minor] Application Failover is enabled and the standby node is not in the cluster

  System Details

  NNM Cluster Health Agent
    Active node: dcwipvmnnm001.XXX.XXX.XX.com
       State: ACTIVE_NNM_RUNNING
       Last file system scan at: 2019-06-08 04:03:26.314
       File transfer failed duration(minutes): N/A
    Secondary node: dcmipvmnnm001.XXX.XXX.XX.com
       State: STANDBY_EXTRACT_DBZIP
       Last file received at: 2019-06-08 04:04:50.140
       File transfer queue length: 275
       Average transfer rate: 12914.37 KB/s
       Automatic failover: DISABLED
    Embedded database stream:
       Last DB transaction ID on active: 0x930ae000=2466963456
       Last DB transaction ID received on standby: 0xffffffffffffffff=-1
       Transaction ID difference: 2466963457 transactions.

Example of good output from the nnmhealth.ovpl's NNMClusterHealthAgent:

  NNMi System Health Report

    Hostname: dcwipvmnnm001.XXX.XXX.XX.com
    Date: 2019-06-08 02:05:02.467
    Overall Status: Normal

    History:

    NNM Cluster Health Agent
      No Issues Detected

    System Details

      NNM Cluster Health Agent
        Active node: dcwipvmnnm001.XXX.XXX.XX.com
           State: ACTIVE_NNM_RUNNING
           Last file system scan at: 2019-06-08 02:03:56.376
           File transfer failed duration(minutes): N/A
        Secondary node: dcmipvmnnm001.XXX.XXX.XX.com
           State: STANDBY_READY
           Last file received at: 2019-06-08 02:04:02.476
           File transfer queue length: 0
           Average transfer rate: 10193.74 KB/s
           Automatic failover: DISABLED
        Embedded database stream:
           Last DB transaction ID on active: 0x194c1a50=424417872
           Last DB transaction ID received on standby: 0x198f2398=428811160
           Transaction ID difference: -4393288 transactions.

Labels (3)
0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.