Sitescope shows "No data" for some monitors

Hello Sitescopers,

 

I am running SItescope 11.20 on a Linux server, monitoring ~70 Solarisservers running Solaris 10 (x86).

In our setup, we use ssh to connect and run the monitors.

 

One of the problems we have with Sitescope, is that sometimes it fails to get data for basic monitors like the CPU monitor or the Memory monitor, resulting in a "no data" failure.

 

When I investigate the connection, there is never anything wrong. 

A manual ssh session works perfectly from the Linuxserver that SiteScope is running on to the monitored server.

( I have tuned the ssh timeout to 180s as was recomended in a previous post).

 

When the problem occurs, I cannot make it go away by re-running the monitor (I am not sure if it actually runs the monitor again. It seems to finish very quickly). Sometimes I have been succesful after a restart of SiS, but this seems like overkill.

 

When I jave this problem with for instance the CPU monitor, other monitors are working fine.

 

One suspicion that I have is if ssh "Message Of The Day" (motd) could have an impact on some monitors?

We use a motd for legal reasons that is echoed to all (interactive) ssh connections and I am not sure

if the motd output is messing up the data parsing for some monitors.

 

Any suggestions on what could cause this behaviour or how to troubleshoot it?

 

 

Best Regards,

 

Thomas

 

 

 

 

Parents
  • Hi Oliw

     

     

    I'd start with checking your logfiles (error.log and RunMonitor.log in $SITESCOPE_HOME/logs/). Could be that certain monitors are skipping or timing out, doing something like this as a start could potentially point it out:

    egrep -i 'timeout' errors.log*

    egrep -i 'skip' errors.log*

    egrep -i 'ssh' errors.log*

    egrep -i 'timeout' RunMonitor.log*

    egrep -i 'skip' RunMonitor.log*

    egrep -i 'ssh' RunMonitor.log*

     

     

    Maybe have a look at your Sitescope's "Server Statistics" as well when you notice this happening again. It could just be that you're running too many monitors too frequently or something.

     

    I'd recommend upgrading to 11.21 as well. It won't necessarily fix this particular problem, but from personal experience 11.21 contains quite a few useful fixes (particularly around Match Expressions and Match Value Labels).

     

    If you spot anything out of the ordinary in your logfiles, post back and we'll take it from there.

     

     

    EDIT: Also, if you can maybe just post a screenshot of your Sitescope's "Health" group of monitors when you notice this happening again. If anything's in a critical state at this point, it might narrow things down slightly.

     

    EDIT #2: Something else that you could try is enabling the "Verify Error" option on one or two monitors where you notice this happening and just seeing how it goes after that. This in essence just causes the monitor to run again in the event of the monitor entering a critical state (which is what I assume happens when you get the 'no data' situation). I think if you go for this though, just do it on one or two monitors for now as a test. Had it before where going crazy with this option semi-kills a Sitescope instance.

  • Hi dcsbeemer,

     

    Thanks for your solid troubleshooting suggestions!

     

    I did as you suggested, but in reverse order. Attached is a screenshot of my SiS dashboard showing that I get no data from 

    one of the servers, esekiux1023.

     

    So I grepped in the logfiles as you suggested but added the servername as well.

     

    [root@esekil4adm logs]# egrep -i '(timeout)' RunMonitor.log* | grep esekiux1023
    [root@esekil4adm logs]# egrep -i 'ssh' RunMonitor.log* | grep esekiux1023
    [root@esekil4adm logs]# egrep -i 'skip' RunMonitor.log* | grep esekiux1023
    [root@esekil4adm logs]# egrep -i 'skip' error.log* | grep esekiux1023
    [root@esekil4adm logs]# egrep -i 'ssh' error.log* | grep esekiux1023
    [root@esekil4adm logs]# egrep -i 'timeout' error.log* | grep esekiux1023

     

    Unfortunately none of these commands captures anything..

    (If I remove the esekiux1023 part, I DO get lots of hits, but not for this problem)

     

    To prove my point, I did a testrun from the SiS server manually doing a "mpstat 2 2" command like SItescope is doing:

     

     

    -sh-4.1$ ssh esekiux1023 'mpstat 2 2'

    CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
    0 805 0 4 2125 114 69 2 3 5 0 748 1 2 0 98
    1 535 0 15 40 24 85 2 4 4 0 1115 1 1 0 98
    2 33 0 7 85 20 100 0 1 2 0 168 0 0 0 100

    .

    .

     

    So, connection to the system is obvously OK, but these monitors has been stuck on 'no data' all day.

    I wonder if they are running at all? Is there a way to check that?

     

    Also attached a screenshot with ssh statistics. Is this what you asked for?

     

    I would love to upgrade to 11.21. We are currently evaluating SiteScope so I have a 60 day trial license.

    I looked for a trial version of 11.21, but 11.20 was the version I could find on the web.

    Is update in form of a patch? Or is it basically a new install?

     

    Thanks for your help!

     

     

    Kind Regards,

     

     

    Thomas Oliw

     

     

  • you can try to increase connection time out value for your servers..can you tell us are these errors coming on some specific monitors for specific servers..and what is the total monitors created for the server on which you are getting error.

    Hanumant
  • Hi Oliw

     

     

    What does your remote server config look like more or less?

    We usually make sure that "SSH V2 only" is selected, but this might not have anything to do with your problem to be honest.

     

    The MOTD shouldn't make a differene either I think, we have quite a few servers on this side with the same without it adversely affecting Sitescope in any way.

     

    What I'm thinking is if you notice another monitor returning 'no data', tail -f the RunMonitor.log file and manually run the monitor a couple of times, say 5 to 10 times in quick succession. Might as well tail -f the error.log file as well while we're at it. Something like this should do the trick:

    tail -f $SITESCOPE_HOME/logs/RunMonitor.log > $HOME/RunMonitor.log

     

    If you could post the resultant logfiles here afterwards, we'll have a look and see if anything stands out.

  • Hi Hanumant,

     

    I have changed the connection timeoutvalue to 60s. But I still have the problem.

    When I re-run the monitor, I get a popupwindow immediately that states "no data".

     

    My problem occurs on a few different servers. I would not call it random since I have seen this only on 4-5 servers out of ~70.

    I have seen it on the CPU-, Memory and  a Logfile moniotor.

     

    However, when the problem occurs, it seems to be sticky. I can get rid of it sometimes by restarting SiteScope.

     

    Thanks for helping!

     

     

    Kind Regards,

     

    Thomas

  • Hi Hanumant,

     

    I have changed the connection timeoutvalue to 60s. But I still have the problem.

    When I re-run the monitor, I get a popupwindow immediately that states "no data".

     

    My problem occurs on a few different servers. I would not call it random since I have seen this only on 4-5 servers out of ~70.

    I have seen it on the CPU-, Memory and  a Logfile moniotor.

     

    However, when the problem occurs, it seems to be sticky. I can get rid of it sometimes by restarting SiteScope.

     

    Thanks for helping!

     

     

    Kind Regards,

     

    Thomas

  • Hi Hanumant,

     

    I have changed the connection timeoutvalue to 60s. But I still have the problem.

    When I re-run the monitor, I get a popupwindow immediately that states "no data".

     

    My problem occurs on a few different servers. I would not call it random since I have seen this only on 4-5 servers out of ~70.

    I have seen it on the CPU-, Memory and  a Logfile moniotor.

     

    However, when the problem occurs, it seems to be sticky. I can get rid of it sometimes by restarting SiteScope.

     

    Thanks for helping!

     

     

    Kind Regards,

     

    Thomas

  • Hi Oliw

     

     

    Had a bit of a look through the forums, found one where the issue was with 'no data' being returned for certain Disk Space monitors. The solution there was to upgrade from 11.11 to 11.12.

     

    Not sure if you'll be able to access it, but I'd say give upgrading to 11.21 a go, available here.

  • Add this to master.config to have monitors use SSH V2 as first option:

    _sshV2First=true

    Also add this one for the intermittent "no data"

    _numberOfRepeatExecForSSHConnection=4

    Finally make sure you're not using KSH it is known for issues like this. Assign Bash as default shell for SiS user.
  • Hello,

     

    Thanks for your tips Kenneth. I will make sure to follow them.

     

    The situation now is a bit different. I decided to upgrade to 11.21 since there seems to be many fixes and I have some other issues I want to resolv as well.

     

    I downloaded the patch SIS_00237.zip and installed it.

    Installation went fine, but when I restart SiS the GUI will not come up.

     

    After a timeout period I get a popup that states "server did not start after restart in a reasonable time"

     

    In the error.log I see the following:

    2013-02-15 15:01:03,129 [SiteScope Main] (SiteScopeGroup.java:844) INFO - Starting HTTP server...
    2013-02-15 15:01:03,134 [SiteScope Main] (SiteScopeGroup.java:944) INFO -
    2013-02-15 15:01:03,134 [SiteScope Main] (SiteScopeGroup.java:945) INFO - Open your web browser to:
    2013-02-15 15:01:03,135 [SiteScope Main] (SiteScopeGroup.java:946) INFO - https://<ip>
    2013-02-15 15:01:03,176 [SiteScope Main] (SiteScopeGroup.java:309) INFO - Starting common scheduler...
    2013-02-15 15:01:03,198 [SiteScope Main] (SiteScopeGroup.java:329) INFO - Starting maintenance scheduler...
    2013-02-15 15:01:03,519 [SiteScope Main] (SiteScopeGroup.java:434) INFO - starting topaz manager
    2013-02-15 15:02:06,620 [SiteScope Main] (ApacheHttpUtils.java:473) ERROR - problem while sending data to URL: http://myhost.se/topaz/topaz_
    api/tmc_info_query?type=integration_support error: URL: http://myhost.se/topaz/topaz_api/tmc_info_query?type=integration_support, host: es

    .

    .

    2013-02-15 15:20:59,245 [main] (Service.java:395) INFO - stopping monitor process: 27223
    2013-02-15 15:20:59,247 [main] (Service.java:293) INFO - SiteScope monitoring process restarted, process not responding: no progress change was detected.

     

    I think that I get this error becuse we have an old connection to a BSM server that is no longer available, and after the upgrade it seems like sitescope tries for a very long time to get hold of it and eventually times out and kills itself since

    there is no progress.

     

    I have not found a way to remove the connection to BSM (esekilx1200) from Sitescope. Any clues how to do that

    without the GUI?

     

    Kind Regards,

     

    Thomas

  • Hello,

     

    Thanks for your tips Kenneth. I will make sure to follow them.

     

    The situation now is a bit different. I decided to upgrade to 11.21 since there seems to be many fixes and I have some other issues I want to resolv as well.

     

    I downloaded the patch SIS_00237.zip and installed it.

    Installation went fine, but when I restart SiS the GUI will not come up.

     

    After a timeout period I get a popup that states "server did not start after restart in a reasonable time"

     

    In the error.log I see the following:

    2013-02-15 15:01:03,129 [SiteScope Main] (SiteScopeGroup.java:844) INFO - Starting HTTP server...
    2013-02-15 15:01:03,134 [SiteScope Main] (SiteScopeGroup.java:944) INFO -
    2013-02-15 15:01:03,134 [SiteScope Main] (SiteScopeGroup.java:945) INFO - Open your web browser to:
    2013-02-15 15:01:03,135 [SiteScope Main] (SiteScopeGroup.java:946) INFO - https://<ip>
    2013-02-15 15:01:03,176 [SiteScope Main] (SiteScopeGroup.java:309) INFO - Starting common scheduler...
    2013-02-15 15:01:03,198 [SiteScope Main] (SiteScopeGroup.java:329) INFO - Starting maintenance scheduler...
    2013-02-15 15:01:03,519 [SiteScope Main] (SiteScopeGroup.java:434) INFO - starting topaz manager
    2013-02-15 15:02:06,620 [SiteScope Main] (ApacheHttpUtils.java:473) ERROR - problem while sending data to URL: http://myhost.se/topaz/topaz_
    api/tmc_info_query?type=integration_support error: URL: http://myhost.se/topaz/topaz_api/tmc_info_query?type=integration_support, host: es

    .

    .

    2013-02-15 15:20:59,245 [main] (Service.java:395) INFO - stopping monitor process: 27223
    2013-02-15 15:20:59,247 [main] (Service.java:293) INFO - SiteScope monitoring process restarted, process not responding: no progress change was detected.

     

    I think that I get this error becuse we have an old connection to a BSM server that is no longer available, and after the upgrade it seems like sitescope tries for a very long time to get hold of it and eventually times out and kills itself since

    there is no progress.

     

    I have not found a way to remove the connection to BSM (esekilx1200) from Sitescope. Any clues how to do that

    without the GUI?

     

    Kind Regards,

     

    Thomas

Reply
  • Hello,

     

    Thanks for your tips Kenneth. I will make sure to follow them.

     

    The situation now is a bit different. I decided to upgrade to 11.21 since there seems to be many fixes and I have some other issues I want to resolv as well.

     

    I downloaded the patch SIS_00237.zip and installed it.

    Installation went fine, but when I restart SiS the GUI will not come up.

     

    After a timeout period I get a popup that states "server did not start after restart in a reasonable time"

     

    In the error.log I see the following:

    2013-02-15 15:01:03,129 [SiteScope Main] (SiteScopeGroup.java:844) INFO - Starting HTTP server...
    2013-02-15 15:01:03,134 [SiteScope Main] (SiteScopeGroup.java:944) INFO -
    2013-02-15 15:01:03,134 [SiteScope Main] (SiteScopeGroup.java:945) INFO - Open your web browser to:
    2013-02-15 15:01:03,135 [SiteScope Main] (SiteScopeGroup.java:946) INFO - https://<ip>
    2013-02-15 15:01:03,176 [SiteScope Main] (SiteScopeGroup.java:309) INFO - Starting common scheduler...
    2013-02-15 15:01:03,198 [SiteScope Main] (SiteScopeGroup.java:329) INFO - Starting maintenance scheduler...
    2013-02-15 15:01:03,519 [SiteScope Main] (SiteScopeGroup.java:434) INFO - starting topaz manager
    2013-02-15 15:02:06,620 [SiteScope Main] (ApacheHttpUtils.java:473) ERROR - problem while sending data to URL: http://myhost.se/topaz/topaz_
    api/tmc_info_query?type=integration_support error: URL: http://myhost.se/topaz/topaz_api/tmc_info_query?type=integration_support, host: es

    .

    .

    2013-02-15 15:20:59,245 [main] (Service.java:395) INFO - stopping monitor process: 27223
    2013-02-15 15:20:59,247 [main] (Service.java:293) INFO - SiteScope monitoring process restarted, process not responding: no progress change was detected.

     

    I think that I get this error becuse we have an old connection to a BSM server that is no longer available, and after the upgrade it seems like sitescope tries for a very long time to get hold of it and eventually times out and kills itself since

    there is no progress.

     

    I have not found a way to remove the connection to BSM (esekilx1200) from Sitescope. Any clues how to do that

    without the GUI?

     

    Kind Regards,

     

    Thomas

Children
No Data