What is the most ideal way to handle stuck threads in a discovery job ?

Hello everyone,


uCMDB version: 9.05.CUP12.351

DDM Content Pack 11.08.802



This is a generic question about probes and their capabilities.

While running heavy jobs like Host Networking by SNMP or VLAN related jobs, I check the probe jmx (Local_probeFQDN --> type=JobsInformation -->Description of viewJobsStatuses),  for stuck threads, run duration, etc. Often, I see that all of the 8 threads(Default) are stuck and remain stuck for hours together.


When I play with the Discoveryprobe.properties file and change a few things like the Max Threads, Max Stuck threads, Time until the probe waits, after reaching a certain number of stuck threads, for a restart and other stuff, it doesn't seem to be helping us. I would love to know how to achieve an efficient and fast discovery process.


Also, is there a way to fix these stuck threads so that the job continues without interruptions, other than logging on to the probe box and bouncing them ?


Please share your thoughts.




  • Look for pattern/trend...is it alway stuck on a certain host? Is it a particular job? maybe it stuck on a command? root cause the problem instead of the changing system setting.

    Best approach is first see which job always stuck on then which host / command it is getting stuck on. Then figure why it is stuck?
  • Hi Chuong,


    Alright. So, I followed your suggestion to find the following:

    1: The Discovery job is Host Networking by SNMP

    2: There is a bunch of Switches on which these threads run and end up as In Progress(With error) message.

    3: Now I need to figure out if some command is causing this. Where do we find it ? I checked the communication log and I see no errors but it says, Incomplete communication log...  at the end of the file.


    I will have to start from scratch and watch the process with one particular IP address and see what exactly is happening with those switches.




  • Good place to start:


    Error log (ucmdb and probe-error)

    Operation log

    reconcilation log



    Run on one of the IP that is known to have problem n watch the log.  It should give you an idea of where the problem is


Reply Children
No Data