Agent Health Problem : The agent did not send events during the last 35 minutes. however agent still working fine.

Hello, 

I have some issue since we migrated our opsbridge on external k8s (version 24.1).

I'm receiving a lot of agent health issue everyday (like maybe 50-100 events / per day for around 800 nodes). 

- The status of this event is "classic" agent health issue :

         Agent health problem
         The agent did not send events during the last 35 minutes
         No additional information is available

         Connectivity Status: Disconnected

- All agent is configured with default health check settings ( Agent & Server , Heartbeat interval 30 minutes, heartbeat grace period 5 minutes).


- I have upgraded all my agent on 12.25.006 version. 

What's weird is we receive this kind of alert however we can still perform some action on theses nodes (like executing tools), on agent all ovc services is UP and RUNNING 

One of my agent currently have the event up :

ovc -status
agtrep OV Discovery Agent AGENT,AgtRep (36182) Running
hpsensor Compute Sensor AGENT,OA (36334) Running
oacore Operations Agent Core AGENT,OA (36308) Running
oahmon Agent Health Monitor AGENT,EA (241346) Running
ompolparm OM Parameter Handler AGENT,EA (36156) Running
opcacta OVO Action Agent AGENT,EA (36264) Running
opcgeni Generic Source Interceptor AGENT,EA (36109) Running
opcle OVO Logfile Encapsulator AGENT,EA (36253) Running
opcmona OVO Monitor Agent AGENT,EA (36205) Running
opcmsga OVO Message Agent AGENT,EA (36281) Running
opcmsgi OVO Message Interceptor AGENT,EA (36195) Running
ovbbccb OV Communication Broker CORE (241052) Running
ovcd OV Control CORE (241043) Running

ovconfd OV Config and Deploy COREXT (241080) Running

I'v find in system.txt some issue with communication agent <-> opsbridge (that I didn't have before in version 2022 with CDF) :

0: WRN: Thu Sep 19 07:08:41 2024: opcmsga (36281/139952380274496): [genmsga.c:9931]: Forwarding message/action response to OVO message
receiver failed due to server failure : (bbc-422) HttpOutputRequestImpl::ReceiveResponse() caught OvXplNet::ConnectionRefusedException_t. <null>
. (OpC30-36)ere is no server process active for address: https://[MYSERVEROPSBRIDGE]:383/com.hp.ov.opc.msgr/rpc/.

0: INF: Thu Sep 19 07:08:42 2024: opcmsga (36281/139952380274496): [genmsga.c:7342]: Message Agent is not buffering. (OpC30-100)

From my agent I can bbcutil -ping MYSERVEROPSBRIDGE and spam it i will always have a answer everything seem working fine.

I have a L1 support team All day long executing tools to restart the agent, after which the alert disappears.

Anyone can help me on this ? (i have also a case opened on support but to be honest until today i don't have any workaround and this community seem more active for this kind of issue)


Thanks :) ! 

  • 0  

    Hello Raphael,

    If you get several such events for multiple agents, then the problem is very likely on the receiver end, not on the agent(s).

    I would check if wde is low on memory, and if that's the case increase memory allocation for wde if possible.

    Check /opt/HP/BSM/log/wde/jvm_statistics.log in your omi-0 and omi-1 pod.

    If there are times where the free heap memory is 0 or close to 0, you will need to increase memory for wde.

    Best regards,

    Tobias

  • 0 in reply to   

    Hello, 

    Yes you are right I think it's more on server side ! 

    I have checked as you requested : 

    On both OMI i'm still between 280 and 500 on HEAP FREE , NON-HEAP always at 498.7

    2024-09-19 08:18:11,625 INFO  - HEAP - [USED: 408.2, COMMITTED: 866.1, MAX: 866.1, FREE: 457.8];

    Both OMI :

    omiuser@omi-0:/> grep Xmx /opt/HP/BSM/conf/OPR-SCRIPTING-HOST_vm_params.ini
    -Xms1024m -Xmx1024m -XX:MaxMetaspaceSize=256m

    omiuser@omi-0:/> grep Xmx /opt/HP/BSM/conf/OPR_vm_params.ini
    -Xms3072m -Xmx3072m -XX:MaxMetaspaceSize=256m



    Do you think is enough ? 

  • 0   in reply to 

    Hello Raphael,

    866 M in general is on the low side. But if FREE memory doesn't go below 280 M, then that's not the problem.

    As a side note, for wde, you get memory settings in this file:

    > grep Xmx /opt/HP/BSM/conf/wde_vm_params.ini
    -Xms896m -Xmx896m -XX:MaxMetaspaceSize=256m

    You could check if there are any errors in opr-gateway.log

    grep ERROR /opt/HP/BSM/log/wde/opr-gateway.log

    Best regards,

    Tobias

  • 0 in reply to   

    Oops 

    Yes all good (i will upgrade it later thanks for info) :

    grep Xmx /opt/HP/BSM/conf/wde_vm_params.ini
    -Xms896m -Xmx896m -XX:MaxMetaspaceSize=256m



    Seem like the issue is on "opr-gateway.log" i don't have any ERROR on OMI-0 however OMI-1 there is a lot : 

    2024-09-19 12:06:46,335 [EventSyncThread:itom-opsbridge-des-svc:MYOPSBRIDGE_239824] ERROR EventSyncForward.logForward(527) - Event forward request has expired for node itom-opsbridge-des-svc. Deleting request from queue for event with ID: 0a9c3a60-761b-71ef-158d-0a3603440000


    2024-09-19 12:06:46,335 [EventSyncThread:itom-opsbridge-des-svc:MYOPSBRIDGE_239824] ERROR EventSyncForward.logForward(527) - Event forward request has expired for node itom-opsbridge-des-svc. Deleting request from queue for event with ID: 0aab5590-761b-71ef-158d-0a3603440000



    2024-09-19 12:07:19,391 [EventSyncThread:itom-opsbridge-des-svc:MYOPSBRIDGE_239824] ERROR EventSyncUpdate.logUpdate(538) - Event update request has expired for node itom-opsbridge-des-svc. Deleting request from queue for event with ID: 1ca4c650-761b-71ef-158d-0a3603440000

    2024-09-19 12:07:19,391 [EventSyncThread:itom-opsbridge-des-svc:MYOPSBRIDGE_239824] ERROR EventSyncUpdate.logUpdate(538) - Event update request has expired for node itom-opsbridge-des-svc. Deleting request from queue for event with ID: 1ca4c650-761b-71ef-158d-0a3603440000

    there is 2 kind of ERROR : 

    - EventSyncUpdate.logUpdate(538) 

    - EventSyncForward.logForward(527)

  • 0   in reply to 

    Hello Raphael,

    The EventSync errors are from forwarding to another target (or in this case, another component of the OpsBridge suite).

    That is independent of receiving events from agents and wouldn't explain the issue.

    I think it would be best to open a support case so somebody can have a closer look.

    Best regards,

    Tobias

  • 0 in reply to   

    Ok, thanks for your help.

    Yes, a case is opened since 20 august for the moment no resolution ... to be continued ...

  • 0  

    If the agent is buffering, then if you check opcagt status it should ideally show that agent is buffering to management server, during that time you can run the bbcutil -ping command and see if its reachable.

    I have seen a rare scenario when bbcutil was working fine but still random agent was buffering at some certain intervals to the OBM. we did below checks

    1) Check the heartbeat log and found out that there is a pattern at which its happening. For example: Agent 1 will buffer today at 10:00AM and same agent will be buffering tomorrow around same time.

    2) Later we did some network packet analysis and found out that there are network drops happening at network level at OBM side.

    Finally, issue was fixed by bypassing the OBM IP at network firewall level.

    It took very long time to figure out the issue because it's hard to detect these drops.

  • 0

    Can you post one of the agents logs and check the pod logs of omi-1 and omi-0. 

    Check the events in cluster if any issues there. Depending on which pod agent is connecting to may be some issue in upstream connection.