Multiple Measurement Threshold Sources in OBM

Hello,

I am having an issue when I try to create multiple events that are derived from different program sources in one threshold monitoring policy.

This is that my sources look like:

Source type: Program source
Short name: memory
Description: Checks the percentage of memory used in the system using the opcmon
Program: /home/dslesarenko/mem_check.sh

Source type: Program source
Short name: cpu
Description: CPU used on the system
Program: /home/dslesarenko/cpu_check.sh

This is what my Pearl script looks like for creating a memory threshold event:
my $Src = $Policy->Source("memory");
if ($Src->Value >= 40 ) {
$Rule->Status(TRUE);
} else {
$Rule->Status(FALSE);
}
my $SrcValue = $Src->Value;
$Session->Value('AlertString',$SrcValue);

This is what my Pearl script looks like for creating a cpu threshold event:
my $Src = $Policy->Source("cpu");
if ($Src->Value >= 40 ) {
$Rule->Status(TRUE);
} else {
$Rule->Status(FALSE);
}
my $SrcValue = $Src->Value;
$Session->Value('AlertString',$SrcValue);


Now, If there is only a memory threshold event, It runs perfectly fine and catches all the opcmon commands I send to it. The issue arises when I try to add the CPU threshold event. The CPU event can be triggered via opcmon right after the deployment of the policy, but after the memory threshold event has been triggered, CPU threshold stops working and does not return anything or show up in the event browser.

What is going on with my policy? Why doesn't the CPU threshold monitor send events after the memory monitor triggers? Is OBM supposed to have such functionality, or is this not the intended use?

Thanks in advance, 

Daniil

Tags:

  • 0  

    Hello,

    I think I understand your issue: You have a monitor policy with two sources - memory and CPU. You added CPU as a second source, the CPU can be burning 100%.  You get your first event, but you don't get other CPU events generated and you would like to know why.

    Would you be able to upload your policy (create a content pack), your /home/dslesarenko/cpu_check.sh script plus some instructions about how to reproduce the issue? 

    This is running on a OA12 managed node?

    If you don't want to upload it, feel free to email it to me if you like, but others may have a better idea than me!  

    Thanks.

  • 0 in reply to   

    Thank you for your reply Duncan,

    Certainly, here is a content pack with my policy in it. The script for mem_check is:
    check_mem=$(free | awk '/^Mem:/ {print $3/$2 * 100.0}')
    opcmon dsMemory-memory=$check_mem -object "MEM"

    The script for cpu_check:
    check_cpu=$(top -bn1 | awk 'NR > 7 { sum += $9 } END { print sum }')
    opcmon "dsMemory"-cpu=$check_cpu -object "CPU"

    Recreating the issue is very simple, it starts after the deployment. Before the first memory threshold policy event is generated, the CPU event can be called via the opcmon command a couple times. Afte the memory event is triggered, the CPU event stops responding.

    It is running on OA 12 managed node.

    OBM Content Pack - Threshold_Monitor_Memory_Policy (1.00).zip

  • 0   in reply to 

    Hello there,

    I just checked the policy, would you be able to provide /home/dslesarenko/cpu_check.sh so I have a source to test with?

    Thanks.

  • 0   in reply to   

    sorry, just seen it.

  • 0 in reply to   

    Hello Duncan, 

    This website doesn't allow me to upload .sh files for some reason. Giving an error message: "File type is not allowed: cpu_check.sh"

    The entire content of the cpu_check.sh file is:

    check_cpu=$(top -bn1 | awk 'NR > 7 { sum += $9 } END { print sum }')
    opcmon "dsMemory"-cpu=$check_cpu -object "CPU"

    And the entire content of mem_check.sh is: 

    check_mem=$(free | awk '/^Mem:/ {print $3/$2 * 100.0}')
    opcmon dsMemory-memory=$check_mem -object "MEM"

    Both of these are just simple shell scripts to gather some system data. I think that the most important part of them is the opcmon command. 

    Hope this helps.

    Thank you for your time,

    Daniil 

  • Suggested Answer

    0   in reply to 


    Hello,

    I hope it's good. I'm not sure this is going to answer your question directly, but I think it may be helpful never-the-less.

    I check to see if the values from your scripts could be obtained directly using scope or coda. This means not using a shell script. Yes, you can and there are many advantages to doing this. Please see the values from my test system below:

    # free -mt | awk '/^Mem:/ {print $3/$2 * 100.0}'
    76.9452
    # ovcodautil -ds SCOPE -o GLOBAL -m GBL_MEM_UTIL |tail -1
    05/02/24 10:15:00 78.02
    # ovcodautil -ds CODA -o GLOBAL -m GBL_MEM_UTIL |tail -1
    05/02/24 10:15:00 78.02

    Please note the same values from SCOPE or CODA.

    The script for cpu_check:
    # top -bn1 | awk 'NR > 7 { sum += $9 } END { print sum }'
    11.8
    # ovcodautil -ds CODA -o GLOBAL -m GBL_CPU_TOTAL_UTIL |tail -1
    05/02/24 10:15:00 11.97

    I think it's important to mention the difference in values come from how values are rounded (*).

    You can get these values directly - a good example of this is Sys_MemoryUsageAndPerformance. This uses SCOPE GBL_MEM_UTIL and GBL_MEM_FREE which is then processes via a perl script. Maybe you can copy this policy to meet your needs? Another memory example would be Sys_MemoryBottleneckDiagnosis. For CPU, please check Sys_CPUBottleneckDiagnosis which uses CODA GBL_CPU_TOTAL_UTIL and GBL_ACTIVE_CPU. I think these are really useful out of the box policies which you can copy/change to meet your needs.

    However, I'm not sure about the logic around having one policy for both memory and CPU in one policy. I always think that any performance issue is based around memory, CPU, I/O and networking (not forgetting the underlying hypervisor). Looking at what your scripts do, I think it would be beneficial keeping memory and CPU as two separate policies. However, if you are using this as a learning exercise, then let me know.

    (*) Rounding up errors: I'm sure somebody will laugh at me, but with something like this, it's a bit like asking how hot this sun is, then asking if that's in degrees F or degrees C or degrees Kelvin - it doesn't really matter that much. It’s just a value which either exceeds or doens't exceed a threshold. This is very different to pay rises :-)

    Thanks.

  • 0 in reply to   

    Thank you a lot for your help Duncan,

    This is the conclusion I came to as well. This policy seems to be working best when there is one source and multiple events or multiple sources and one event with same object. I appreciate your help a lot and I am extremely impressed that you got to work that with CODA.
    I was doing this as a practice exercise and exploring the flexibility and possibilities of threshold monitor policies.

    Thanks again,
    Daniil

  • 0   in reply to 

    Hello,

    Thanks for the update and feedback.  I think that monitors offer massive flexibility - you can use external scripting via instrumentation or by using multiple sources.  When developing, I always look for examples which somebody else has written (and known working :-)) which I then change.  Often, the amount of change is minimal.  In your case Sys_MemoryUsageAndPerformance, Sys_MemoryBottleneckDiagnosis or Sys_NetworkInterfaceErrorDiagnosis are all good starting points as they are they work in similar ways. 

    There are advantages and disadvantages about using embedded perl.  Advantages are that you don't need any instrumentation and that you can access policy objects, the downside is that can be time consuming to troubleshoot.  I'd also mention that while you may be tied to using perl and VBS within an embedded policy, that's not the case if you go external.  This opens up Python, .NET, PowerShell and so on.

    I think it's great that you are finding out what OA12 can do.  If you have another example then feel free to let me know - I'm always learning new stuff and happy to help.

    Thanks!