Hello,
I have some issue since we migrated our opsbridge on external k8s (version 24.1).
I'm receiving a lot of agent health issue everyday (like maybe 50-100 events / per day for around 800 nodes).
- The status of this event is "classic" agent health issue :
Agent health problem The agent did not send events during the last 35 minutes No additional information is available Connectivity Status: Disconnected |
- All agent is configured with default health check settings ( Agent & Server , Heartbeat interval 30 minutes, heartbeat grace period 5 minutes).
- I have upgraded all my agent on 12.25.006 version.
What's weird is we receive this kind of alert however we can still perform some action on theses nodes (like executing tools), on agent all ovc services is UP and RUNNING
One of my agent currently have the event up :
ovc -status agtrep OV Discovery Agent AGENT,AgtRep (36182) Running hpsensor Compute Sensor AGENT,OA (36334) Running oacore Operations Agent Core AGENT,OA (36308) Running oahmon Agent Health Monitor AGENT,EA (241346) Running ompolparm OM Parameter Handler AGENT,EA (36156) Running opcacta OVO Action Agent AGENT,EA (36264) Running opcgeni Generic Source Interceptor AGENT,EA (36109) Running opcle OVO Logfile Encapsulator AGENT,EA (36253) Running opcmona OVO Monitor Agent AGENT,EA (36205) Running opcmsga OVO Message Agent AGENT,EA (36281) Running opcmsgi OVO Message Interceptor AGENT,EA (36195) Running ovbbccb OV Communication Broker CORE (241052) Running ovcd OV Control CORE (241043) Running ovconfd OV Config and Deploy COREXT (241080) Running |
I'v find in system.txt some issue with communication agent <-> opsbridge (that I didn't have before in version 2022 with CDF) :
0: WRN: Thu Sep 19 07:08:41 2024: opcmsga (36281/139952380274496): [genmsga.c:9931]: Forwarding message/action response to OVO message receiver failed due to server failure : (bbc-422) HttpOutputRequestImpl::ReceiveResponse() caught OvXplNet::ConnectionRefusedException_t. <null> . (OpC30-36)ere is no server process active for address: https://[MYSERVEROPSBRIDGE]:383/com.hp.ov.opc.msgr/rpc/. 0: INF: Thu Sep 19 07:08:42 2024: opcmsga (36281/139952380274496): [genmsga.c:7342]: Message Agent is not buffering. (OpC30-100) |
From my agent I can bbcutil -ping MYSERVEROPSBRIDGE and spam it i will always have a answer everything seem working fine.
I have a L1 support team All day long executing tools to restart the agent, after which the alert disappears.
Anyone can help me on this ? (i have also a case opened on support but to be honest until today i don't have any workaround and this community seem more active for this kind of issue)
Thanks :) !