Flexible Hot standby which fits into GNM - OMi/OBM

Flexible Hot standby which fits into GNM - OMi/OBM

As of now, for redundancy, there are 2 concepts on offer in NOM, namely AF and HA.

Application Failover (AF) with prominent pros(+) and cons(-) as listed:
+ Automatically syncs.
+ UUIDs are kept in sync.
+ Reforming an AF cluster forces sync of cluster members.
- Only one NOM instance is active at a time.
- Open/reform cluster required for functional testing of config changes, hotfixes, patches, etc.
- AF is Cold Standby only, failing-over can take up to 30 minutes
= no option for more demanding setups.

High Availability (HA) with prominent pros(+) and cons(-) as listed:
+ Convenient failover by HA cluster ventor.
+ Upgrade and patching without dismantling the HA.
-  Only one database on a shared disk - if corrupted => production down.
-  HA is Warm Standby at best only Stand-by only, failing-over can take up to 30 minutes.
= no option for more demanding setups.

Larger installations ask for a Global Network Management (GNM) setup with several Regional Managers (RMs) reporting to a Global Manage (GM)
+ Regional Manager (RM) UUIDs are nicely transferred to Global Manager (GM).
+  RM-GM relation is n:1 or n:2 (since at least 10.40)
- No adequate Hot/Hot on GM level, same limitations as listed above.

Adding to this, nowadays NOM isn't stand-alone any more - but one part of much more.
Frequently seen example: 
Frequently_seen_example.png

 

             


 

 

 

 

 

 

What is actually required:
(1)  Hot/Hot mode on all levels, RM and GM.

(2)  Fully flexible fail-over of only the failed system.
(3)  NO (or near-zero) loss of Management Events/SNMP traps on northbound  OMi/OBM or other) after failover or failback.
(4)  NO (or near-zero) duplicate Management Events/SNMP traps after failover or failback.
(5)  NO out-dated (e.g. buffered) Management Events/SNMP traps after failback.
(6)  For installing patches, hotfixes, new releases, splitting and reforming a Hot/Hot pair should be comfortable.
(7)  Configurable sync delay in updating topology.
(8)  In 'Very Large' installations: RM-to-OMi/OBM trap/syslog stream needs to bypass GM due to amount.
1.Desired_redundancy.png

 

 

 

 

 

 

 

How to get to a flexible hot/hot in GNM with 'northbound'?
Synchronized UUIDs for all topology objects would be a good basis, however,

it seems that such synchronization is in conflict with a full side-by-side.
Below an idea for balancing both:

  • New: Some mechanism for sync'ing only UUIDs between RMs, let's call it 'UUIDsync',
    (similar to the 'push' in AF or the inform/pull-like mechanism in GM-RM) to force same UUIDs in RM pairs.
  • New: Some mechanism to (re)reform and split a 'UUIDsync'-pair.
  • New: A mechanism to coordinate which GM is sending management incidents northbound to e.g. OMi/OBM (as in HA)

o  Available: Each RM may forward information to a supported limit of two GMs (at least since 10.40).
o  Available: GNM already includes a highly efficient pull-like mechanism for requesting updates from RMs.
o  Available: GNM already has a robust mechanism for sync'ing UUIDs between RM and GM.
Available: Some mechanism to re-evaluate changes in a failed RM#b during the last polling cycle (NetworkOutageStateSyncTask) - Goal: Prevent missed incidents.

Activities required:

  • fail-over/fail-back between RMs for discovering topology
  • fail-over/fail-back of a RM reporting to a GM or both
  • For 'Very Large': fail-over/fail-back of RM-to-OMi/OBM trap/syslog stream
  • fail-over/fail-back a GM sending Management incidents northbound

Achievements and Limitation:
+ Very flexible
+ Very fast fail-over expected
+ Resilient on multiple failed systems
+ Minimum fail-over activity
+ Fully hot/hot with respect to monitoring, root cause analysis, incident creation, etc.
-  Limitedly hot/hot with respect to topology, because RM#b's topology is collected from RM#a's topology
   (hence requirement (7) above, as a safe-guard; any better idea for getting around this limitation is welcome!)

Normal operation
2.Normal_operation.gif

RMs runs as Hot/Hot pairs (a and b) and each does monitoring, querying network devices, root cause analysis, etc.
>  Only RM#a do (re-)Discovery for updating topology.
RM#b get topology updates by sync ('UUIDsync') from RM#a.
RM#a are reporting to GMa. RM#b are reporting to GMb.
GM runs as a Hot/Hot pair (no extra sync). Sync'ed UUIDs are coming from their respective RMs.
GMa sends Management Incidents northbound to e.g. OMi/OBM.
>  In 'Very Large': RM#a handle the RM-to-OMi/OBM trap/syslog stream.

Fail-over scenarios
3.Fail-over_scenarios.gif

(1) In scenario 'RM1a fails',
>  RM1b picks up (re-)Discovery for updating topology (per schedule and by ChangeDetection).
>  RM1b now also reports to GMa  (in addition to reporting to GMb as before).
> In 'Very Large': RM1b takes over the RM-to-OMi/OBM trap/syslog stream.
>  RM1b is re-evaluated wrt changes during the last polling cycle.
o   No further change on GM level.
o   No impact on 'northbound'.
>  Fail-back: RM1a to catch-up on changes in the network, and once done,
     roles are swapped back: RM1a reports to GMa, RM1b stops additional reporting to GMa,
     and in 'Very Large': RM1a takes back the RM-to-OMi/OBM trap/syslog stream.

(2) In scenario 'GMa fails',
>  GMb picks up sending Management incidents northbound to e.g. OMi/OBM (all RM#b already report to it).
>  GMa needs to be stopped from sending 'northbound' (e.g. if GMa is still 'half-alive').
>  GMb is re-evaluated wrt changes during the last polling cycle.
o  No change on RM level. (RM#a continue discovering topology changes which continue to by UUIDsync'ed to RM#b).
o  Unchanged in 'Very Large': RM2a still handles the RM-to-OMi/OBM trap/syslog stream.
>  Fail-back: GMa to catch-up on updates from RM#a, and once done, roles are swapped back:
     GMa starts, GMb stops talking 'northbound' (this order may help with not missing any incidents).

(3) In scenario 'GMa is failed-over, and now on top, RM2b fails',
o  RM2a continues with (re-)Discovery as before.
>  RM2a now reports to GMb (in this scenario GMa is failed). This should be do-able b/c UUIDs are identical.
o  Unchanged in 'Very Large': RM2a still handles the RM-to-OMi/OBM trap/syslog stream.
o  No (further) impact on 'northbound'.
>  Fail-back if RM2b is fixed first (GMa still failed):
     RMb2 to catch-up on changes sent to it via 'UUIDsync', and once done,

      RM2b reports to GMb, RM2a stops reporting to GMb.
>  Fail-back if GMa is fixed first (RM2b still failed):
     GMa to catch-up on updates from RM#a, and once that is done,

      RM2a starts reporting to GMa and stops reporting to GMb.

 

Idea with similar topic:
https://community.microfocus.com/t5/NOM-Idea-Exchange/NA-NOM-should-be-able-to-be-run-in-Hot-Hot-mode/idi-p/1768454

Tags (2)
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.