Sitescope shows "No data" for some monitors

Hello Sitescopers,

 

I am running SItescope 11.20 on a Linux server, monitoring ~70 Solarisservers running Solaris 10 (x86).

In our setup, we use ssh to connect and run the monitors.

 

One of the problems we have with Sitescope, is that sometimes it fails to get data for basic monitors like the CPU monitor or the Memory monitor, resulting in a "no data" failure.

 

When I investigate the connection, there is never anything wrong. 

A manual ssh session works perfectly from the Linuxserver that SiteScope is running on to the monitored server.

( I have tuned the ssh timeout to 180s as was recomended in a previous post).

 

When the problem occurs, I cannot make it go away by re-running the monitor (I am not sure if it actually runs the monitor again. It seems to finish very quickly). Sometimes I have been succesful after a restart of SiS, but this seems like overkill.

 

When I jave this problem with for instance the CPU monitor, other monitors are working fine.

 

One suspicion that I have is if ssh "Message Of The Day" (motd) could have an impact on some monitors?

We use a motd for legal reasons that is echoed to all (interactive) ssh connections and I am not sure

if the motd output is messing up the data parsing for some monitors.

 

Any suggestions on what could cause this behaviour or how to troubleshoot it?

 

 

Best Regards,

 

Thomas

 

 

 

 

Parents
  • Hi Oliw

     

     

    I'd start with checking your logfiles (error.log and RunMonitor.log in $SITESCOPE_HOME/logs/). Could be that certain monitors are skipping or timing out, doing something like this as a start could potentially point it out:

    egrep -i 'timeout' errors.log*

    egrep -i 'skip' errors.log*

    egrep -i 'ssh' errors.log*

    egrep -i 'timeout' RunMonitor.log*

    egrep -i 'skip' RunMonitor.log*

    egrep -i 'ssh' RunMonitor.log*

     

     

    Maybe have a look at your Sitescope's "Server Statistics" as well when you notice this happening again. It could just be that you're running too many monitors too frequently or something.

     

    I'd recommend upgrading to 11.21 as well. It won't necessarily fix this particular problem, but from personal experience 11.21 contains quite a few useful fixes (particularly around Match Expressions and Match Value Labels).

     

    If you spot anything out of the ordinary in your logfiles, post back and we'll take it from there.

     

     

    EDIT: Also, if you can maybe just post a screenshot of your Sitescope's "Health" group of monitors when you notice this happening again. If anything's in a critical state at this point, it might narrow things down slightly.

     

    EDIT #2: Something else that you could try is enabling the "Verify Error" option on one or two monitors where you notice this happening and just seeing how it goes after that. This in essence just causes the monitor to run again in the event of the monitor entering a critical state (which is what I assume happens when you get the 'no data' situation). I think if you go for this though, just do it on one or two monitors for now as a test. Had it before where going crazy with this option semi-kills a Sitescope instance.

  • Hi dcsbeemer,

     

    Thanks for your solid troubleshooting suggestions!

     

    I did as you suggested, but in reverse order. Attached is a screenshot of my SiS dashboard showing that I get no data from 

    one of the servers, esekiux1023.

     

    So I grepped in the logfiles as you suggested but added the servername as well.

     

    [root@esekil4adm logs]# egrep -i '(timeout)' RunMonitor.log* | grep esekiux1023
    [root@esekil4adm logs]# egrep -i 'ssh' RunMonitor.log* | grep esekiux1023
    [root@esekil4adm logs]# egrep -i 'skip' RunMonitor.log* | grep esekiux1023
    [root@esekil4adm logs]# egrep -i 'skip' error.log* | grep esekiux1023
    [root@esekil4adm logs]# egrep -i 'ssh' error.log* | grep esekiux1023
    [root@esekil4adm logs]# egrep -i 'timeout' error.log* | grep esekiux1023

     

    Unfortunately none of these commands captures anything..

    (If I remove the esekiux1023 part, I DO get lots of hits, but not for this problem)

     

    To prove my point, I did a testrun from the SiS server manually doing a "mpstat 2 2" command like SItescope is doing:

     

     

    -sh-4.1$ ssh esekiux1023 'mpstat 2 2'

    CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
    0 805 0 4 2125 114 69 2 3 5 0 748 1 2 0 98
    1 535 0 15 40 24 85 2 4 4 0 1115 1 1 0 98
    2 33 0 7 85 20 100 0 1 2 0 168 0 0 0 100

    .

    .

     

    So, connection to the system is obvously OK, but these monitors has been stuck on 'no data' all day.

    I wonder if they are running at all? Is there a way to check that?

     

    Also attached a screenshot with ssh statistics. Is this what you asked for?

     

    I would love to upgrade to 11.21. We are currently evaluating SiteScope so I have a 60 day trial license.

    I looked for a trial version of 11.21, but 11.20 was the version I could find on the web.

    Is update in form of a patch? Or is it basically a new install?

     

    Thanks for your help!

     

     

    Kind Regards,

     

     

    Thomas Oliw

     

     

  • you can try to increase connection time out value for your servers..can you tell us are these errors coming on some specific monitors for specific servers..and what is the total monitors created for the server on which you are getting error.

    Hanumant
  • Hi Oliw

     

     

    What does your remote server config look like more or less?

    We usually make sure that "SSH V2 only" is selected, but this might not have anything to do with your problem to be honest.

     

    The MOTD shouldn't make a differene either I think, we have quite a few servers on this side with the same without it adversely affecting Sitescope in any way.

     

    What I'm thinking is if you notice another monitor returning 'no data', tail -f the RunMonitor.log file and manually run the monitor a couple of times, say 5 to 10 times in quick succession. Might as well tail -f the error.log file as well while we're at it. Something like this should do the trick:

    tail -f $SITESCOPE_HOME/logs/RunMonitor.log > $HOME/RunMonitor.log

     

    If you could post the resultant logfiles here afterwards, we'll have a look and see if anything stands out.

Reply
  • Hi Oliw

     

     

    What does your remote server config look like more or less?

    We usually make sure that "SSH V2 only" is selected, but this might not have anything to do with your problem to be honest.

     

    The MOTD shouldn't make a differene either I think, we have quite a few servers on this side with the same without it adversely affecting Sitescope in any way.

     

    What I'm thinking is if you notice another monitor returning 'no data', tail -f the RunMonitor.log file and manually run the monitor a couple of times, say 5 to 10 times in quick succession. Might as well tail -f the error.log file as well while we're at it. Something like this should do the trick:

    tail -f $SITESCOPE_HOME/logs/RunMonitor.log > $HOME/RunMonitor.log

     

    If you could post the resultant logfiles here afterwards, we'll have a look and see if anything stands out.

Children
No Data