Idea ID: 1642238

Primary HeartBeat - Agent Ping Monitoring from OMi

Status : Under Consideration
Under Consideration
See status update history
over 3 years ago

Agent health monitoring done is OMi is not reactive enough to outages for us, and we needed to add extra layer of monitoring for Agents (called primary heartbeat in HPOM). Basically a simple ping on port 383 every 5min.

We could script a ping monitor or using Sitescope with auto template deployment, but it is a shame to do it ourself where OMi could do i OOTB.

thanks,

Anthony

Labels:

Self-health
Agents

  • Michael,
    I agree we can create many things with the toolset available, but I confirm the request I made 3 years ago is still NOT out of the box.
    OBM waiting for a missing heartbeat to react is way different than OBM checking all agents every X minutes as in HPOM environment.
    As a workaround, I built a Sitescope monitoring on port 383 for it with auto template deployment but this is a miss it is not out of the box in OBM.
    rgds,
    Anthony

  • This is out of the box behavior in the Monitored Nodes applet of OBM and has been that way for several years now, in that by default, every X minutes, the OBM Management Servers do a Health Check on the nodes and the Managed Nodes send back bi-directional "keep alive" messages to the OBM Management servers, the equivalent of bbcutil -ping <Managed Node> and bbcutil -ping <OBM Gateway Server> and/or bbcutil -ping <OBM Data Processing Server>. On top of that you can create schedule policies for problematic nodes and also OBM tools that execute the same commands on a specific time interval, based on the importance of the server, although a service-oriented approach should be used, with the appropriate KPI, HI and Business Rules weighted accordingly to show a service performance degradation of a certain degree, versus "a resource is up or down," which is a rather ancient way of thinking, frankly. All the Health Check is, is a check to see if the OBM Management Servers can hit the  bbc control sub-process of the Operations Agent (OA), i.e., "application layer ping" on TCP port 383. That's why usually create a multiplicity of checks including this heartbeat check, as part of an approach I call "the better availability." Why? I've seen occasions where blue-screened Windows servers responded to ping. The default heartbeat check is once every 30 minutes. Does one heartbeat check miss mean a server that has the OA on it is definitely down? NO WAY. Therefore, several other checks leveraging OBM policies, tools, automatic actions after enabling internal messaging (there's a nifty Content Pack (CP) on the ITOM Marketplace that does that for you. OA content pack for internal message filtering | ITOM Marketplace (microfocus.com), then SBEC/TBEC rules can be created to take all of these OBM and OA events, responses (or lack of them), events and similar tool execution from other element managers, like APM/BPM VuGen synthetic transactions, SIEM data from ArcSight/Splunk and other 3rd-party, non-framework tools like Oracle Enterprise Manager (OEM), Tenable Nessus/Tenable.io, events from Azure/AWS APIs like Cloudwatch, log analytics, etc. can be created.

    Then, we have arrived comfortably crossing the finish line of one of my most critical missions: creation of what I would term "The Better Availability" - this is a real measurement of Service Availability that could have KPIs, HIs, etc. associated with them (weighted accordingly).

  • I read in the release notes OBM 2020.05

    • opr-node-hc-config- The opr-node-hc-config CLI has introduced a new parameter that determines the state and time when the last heartbeat was received for a specified node. For more information, see the opr-node-hc-config Command-Line interface.

    Could this functionality be available in earlier versions of OBM ?

    With a little tweak, this could be a solution to the problem.

  • We need this feature.

    Any update ?