

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content
NNMi cluster health - postgres issue on standby?
NNMi is running on Linux with application fail-over clustering [NNMi v10.10 + Patch 8].
Has anyone else encountered this issue? Any tips or tricks? I'm concerned that if a fail-over to the standby system was needed that the database may be in an unusable state. For the time-being, I'm doing nnmbackupembdb.ovpl daily on the active node.
Since this past weekend, I've noticed that the output from the nnmhealth.ovpl -print verbose -filter NNMClusterHealthAgent always shows that the Last DB transaction ID received on standby is 0xffffffffffffffff=-1.
Prior to this past weekend, this was not always -1. I'm used to seeing the "Last DB transaction ID received on standby is 0xffffffffffffffff=-1." when the active node does a back-up and sends it to the standby, then the standby nmsdbmgr shuts down to update/replace the existing database. No reboots of either active or standby since this started, however there was a wide-area network outage between late Friday evening, which is the last thing that occurred before this started happening. Active node reported that it lost connectivity to the cluster agent on the standby node.
FWIW, I did attempt a restore of the embedded database from the active to the standby, that didn't resolve. I did attempt to clean-up the Postgres_standby directory on the standby: remove Postgres_standby/PostgresBackup.*, remove Postgres_standby/TxWALs_recv/* after shutting down nnmcluster on the standby [KM1156362]. No actions performed on the active node, that would require down-time. I don't see anything out of ordinary in pg_log log files on the standby, cluster log and nmsdbmgr log don't show any persistent / repeating errors, but it's possible that the interesting log entries have rotated out of view.
Current output from the nnmhealth.ovpl's NNMClusterHealthAgent:
NNMi System Health Report
Hostname: dcwipvmnnm001.XXX.XXX.XX.com
Date: 2019-06-12 13:15:03.148
Overall Status: Normal
History:
NNM Cluster Health Agent
No Issues Detected
System Details
NNM Cluster Health Agent
Active node: dcwipvmnnm001.XXX.XXX.XX.com
State: ACTIVE_NNM_RUNNING
Last file system scan at: 2019-06-12 13:09:29.404
File transfer failed duration(minutes): N/A
Secondary node: dcmipvmnnm001.XXX.XXX.XX.com
State: STANDBY_READY
Last file received at: 2019-06-12 13:09:30.364
File transfer queue length: 0
Average transfer rate: 24556.44 KB/s
Automatic failover: DISABLED
Embedded database stream:
Last DB transaction ID on active: 0xcaada4d8=3400377560
Last DB transaction ID received on standby: 0xffffffffffffffff=-1
Transaction ID difference: 3400377561 transactions.
Cluster health agent output early Saturday a.m.:
NNMi System Health Report
Hostname: dcwipvmnnm001.XXX.XXX.XX.com
Date: 2019-06-08 04:05:04.167
Overall Status: Minor
History:
[2019-06-08 02:15:21.609] 'NNM Cluster Health Agent' has changed status from 'Normal' to 'Minor'
NNM Cluster Health Agent
[Minor] Application Failover is enabled and the standby node is not in the cluster
System Details
NNM Cluster Health Agent
Active node: dcwipvmnnm001.XXX.XXX.XX.com
State: ACTIVE_NNM_RUNNING
Last file system scan at: 2019-06-08 04:03:26.314
File transfer failed duration(minutes): N/A
Secondary node: dcmipvmnnm001.XXX.XXX.XX.com
State: STANDBY_EXTRACT_DBZIP
Last file received at: 2019-06-08 04:04:50.140
File transfer queue length: 275
Average transfer rate: 12914.37 KB/s
Automatic failover: DISABLED
Embedded database stream:
Last DB transaction ID on active: 0x930ae000=2466963456
Last DB transaction ID received on standby: 0xffffffffffffffff=-1
Transaction ID difference: 2466963457 transactions.
Example of good output from the nnmhealth.ovpl's NNMClusterHealthAgent:
NNMi System Health Report
Hostname: dcwipvmnnm001.XXX.XXX.XX.com
Date: 2019-06-08 02:05:02.467
Overall Status: Normal
History:
NNM Cluster Health Agent
No Issues Detected
System Details
NNM Cluster Health Agent
Active node: dcwipvmnnm001.XXX.XXX.XX.com
State: ACTIVE_NNM_RUNNING
Last file system scan at: 2019-06-08 02:03:56.376
File transfer failed duration(minutes): N/A
Secondary node: dcmipvmnnm001.XXX.XXX.XX.com
State: STANDBY_READY
Last file received at: 2019-06-08 02:04:02.476
File transfer queue length: 0
Average transfer rate: 10193.74 KB/s
Automatic failover: DISABLED
Embedded database stream:
Last DB transaction ID on active: 0x194c1a50=424417872
Last DB transaction ID received on standby: 0x198f2398=428811160
Transaction ID difference: -4393288 transactions.