Highlighted
hellboy_1 Super Contributor.
Super Contributor.
1109 views

OMW server critical issue

Hi Team, 

We are facing a problem with our management server that message queue has reached the maximum size. Also the old alert coming which was acknowledge already is in console that why message queue showing reached the max size reached.  

Please help regarding this.

NB: This is our production server. So pls help as urgent basic. 

Regards,

Debashish.

0 Likes
21 Replies
nicnictout Outstanding Contributor.
Outstanding Contributor.

Re: OMW server critical issue

Hello,

i do not understand what is your question, in case of i give you the following procedure:

1) stop all OMW service with ovc-kill, and vpstat -3 -r stop

2) kill manually all remaining process, especially the trace service which is never stopped

3) delete all message queues, it's better to loose old message than to be blind. %ovdatadir%/tmp/OpC and %ovdatadir%/tmp/OpC/public

Note: i'm not absolutly sure of path i have not an OMW to check

 

I really try several time to read your sentence, but i finally give up 🙂

"Also the old alert coming which was acknowledge already is in console that why message queue showing reached the max size reached"

 

Could you explain what are you asking precisely

0 Likes
Gediminas Daniu Outstanding Contributor.
Outstanding Contributor.

Re: OMW server critical issue

Hi,


did you tried to look  why alarms are accumulated in the queue ? Alarm storm from agents, some died OMW process what prevents propcessing of incomming messages, slow database or issues with free db space ?
Have you registered  HP software (now Microfocus) support call?

regards,

Gedas

0 Likes
nicnictout Outstanding Contributor.
Outstanding Contributor.

Re: OMW server critical issue

Have you others tools connected to HPOM: NNM, SIM,.... (especially SCOM for example)?

0 Likes
hellboy_1 Super Contributor.
Super Contributor.

Re: OMW server critical issue

Hi,

Yes it has intigratation with hp NNMi.

Regards,

Debashish

0 Likes
hellboy_1 Super Contributor.
Super Contributor.

Re: OMW server critical issue

Hi,

 

That means the old alert which has came in the year 2012,we have ack on that time but now its showing again in the console the received time is showing today but creation time showing 2/12/2012.

Several time we have delete the queue but within 15minuits its getting full and stop coming the new alert.

Hope u have understand now.

 

Regards,

Debashish   

0 Likes
nicnictout Outstanding Contributor.
Outstanding Contributor.

Re: OMW server critical issue

Hello,

 

thx for explanation, but you face a very strange issue...

Your issue is about all alarmsor there is a specific pattern you can see to help identify the root cause?

Can you check your manager configuration please, even share it if possible (don't forget to clean all sensible data previously)

 

I noticed that key in ovrg-server config:

OPC_UNACK_ON_DUPL_EVENT:

When set to TRUE, the message manager that receives a 'duplicate
counter' event via M2M forwarding from another server for a
history message, will unacknowledge the message.

 

The description seems matching perfectly your manager behavior....

 

 

0 Likes
hellboy_1 Super Contributor.
Super Contributor.

Re: OMW server critical issue

Hi All,

Any update on this issue..

Regards,

Debashish

 

0 Likes
nicnictout Outstanding Contributor.
Outstanding Contributor.

Re: OMW server critical issue

Hello,

we would like to help you but you did not answer to our questions... so it's quite hard to help you !

1) Post the ovconfget and ovconfget -ovrg server output.

2) I mention a config variable on manager that could explain the strange behavior about alarms that are automatically unacknowledge, did you check that ? We will try to solve our issue step by step !

3) stop all OM services, delete the queue file, it will be created again at restarting. Now take a closed look to your manager when it comes back online to see if you have a storm event coming suddenly.

You can also make a query on database to sort what node generate a ton of alarms:

SELECT node_id, COUNT(opc_act_messages.message_number) FROM opc_act_messages GROUP BY node_id;

SELECT node_id, COUNT(opc_hist_messages.message_number) FROM opc_hist_messages GROUP BY node_id;

 

After that you just need to check on opc_nodes table which CI fill your queue 😉

 

0 Likes
hellboy_1 Super Contributor.
Super Contributor.

Re: OMW server critical issue

Hi,

As per your query the details are given below:

1) Please find the attached file of ovconfget and ovconfget_ovrg server out put.

2) How can I check the strange behavior about alarms that are automatically unacknowledge.Could you please tell me the steps.

3) Yes the storm event coming suddenly from one old policy which name is Shutdown_Reboot event monitor.

This policy is old OSSPI policy. We are used it for Geneysis servers. We had more then 900 genesys servers where this policy had deployed.

I have try to diseable & remove that policy from genesys servers but unable to do it. After that i have delete the genesys node group from our management server also. Although these alert coming countiniously from various genesys servers. And maxium alerts creating date is showing too old (its showing 2014/15/16).

The alerts is coming from various genesys system.

 

Thanks & Regards,

Debashish Guha Roy

 

0 Likes
nicnictout Outstanding Contributor.
Outstanding Contributor.

Re: OMW server critical issue

Ok Hellboy,

 

1) i do not find any strange things in your config files.

2) i just want to check the presence or not of such config variables in both files you upload. I do not see it.

Just one question because i have a doubt when i read your response, in alarm properties, the time creation is tagued for years 2014 for ex, isn't it? This is not an alarm raised because it reads an old entry in Windows EventLog which occurs in 2014 for example. If possible make a screenshot of your alarm properties (it may help...)

3) when you said you cannot remove such policy from servers, did you try remotely, or on server itself?

If you have some permission issue locally, try to remove it with:

ovdeploy -cmd "ovpolicy -remove -polname 'xxxxxx'" -node yyyy

4) in the meantime, you can set a new condition to your policy genesys, that discard all messages (acting like a filter), redeploy to all your servers, it should at least prevent your manager to collapse if event storm is really impressive.

 

0 Likes
hellboy_1 Super Contributor.
Super Contributor.

Re: OMW server critical issue

Hi Nicnictout,

 

Please find the attached file of alerm properties. You also see the duplicate alert in the below.

I have try to remove the policy from management server ,also from node itself.

Right now we have to stop the message storm attacked.

Could you please confirm one thing in the console the acknowledged messages count are showing 239879,how can we remove these old alert.

 

 

Regards,

Debashish

 

0 Likes
nicnictout Outstanding Contributor.
Outstanding Contributor.

Re: OMW server critical issue

Hummm ok,

as i guessed this kind of issue appears with Windows Eventlog policy, because it often read the whole logs each time you deploy your policies, and agent can may have some difficulties to set his position flag when logs becomes huge in size...

 

So:

1) edit the policy and check if option "never read from begining is selected" (i don't really remember the exact term and i've only a Linux OML at hand)

2) create a new condition in your policy:

- condition match => <*> or EventID=6008 or anything else that allow to match all your unwanted alarms

- Matching option => select "supress on matching condition"

This will prevent any new alarm to be pushed to manager, the agent will discard it. Obviously i hope you have no needs to monitore windows eventlog...

This is a temporary fixed issue, that is not normal you cannot unassigned and remove a policy to nodes. take a look into instrumentation, because you may remove policy from host, if your nodes have instrumentation assigned, policy and script will be push again and again frequently.

3) There is no way to clean it, these are entries in database, and all tables are linked, so erase it directly won't be a nice idea ^^

But don't be upset because you have 30k of duplicate, this is just a number and a message_number store in database, it suggest that you have an issue, but thoose entries won't colapse your manager. 😉

 

What you can try in desesperate solution, but it depends you were able to stop the event storm before:

- edit the config server properties from console, here you will find how many days an acknowledged alarm is keep in database. Set it to 1 day for ex

- after that you may acknowledge all active alarms, and run the binaries which allow to download all messages to simple text (opchistdown i'm not sure about syntaxe name i have not my OMW memo guide with me). When you done this, run the HPOM DB cleaner "opcdbidx -all" (once again check all command syntaxe as i gave it to you on Unix format).

 

In case of doubt, post the OMW list command page available in help, i may help you to find the matching ones 😉

 

0 Likes
hellboy_1 Super Contributor.
Super Contributor.

Re: OMW server critical issue

Hello,

One more thing that I have already removed the node group (Genesys Server) from our managent server,So how can I redepoly the policy?

Is it possible to take a webex session to check the problem from your end.

Regards,
Debashish
0 Likes
nicnictout Outstanding Contributor.
Outstanding Contributor.

Re: OMW server critical issue

Hello,


- removing the node group won't help you to fix anything, this is just a folder.

- i cannot make a webex during work hours, i'm a contractor, my customer won't be pleased if i spent my time to help others companies. But we can do this after my work (send me an mp to check our respective availability i'm on french timezone)


In the meantime you can proceed some tasks:

1) on your manager server:
ovconfchg -ovrg server -ns opc -set OPC_UNACK_ON_DUPL_EVENT FALSE
2) vpstat -3 -r stop
3) vpstat -3 -r start
4) On your problematic policy, create a new condition in first position, with matching condition set to EventID=6008, in Actions tab, select "Do nothing: stop evaluation"
5) right click on your policy, in contextual menu, there is a option to see all servers where your policy is deployed
6) try to deploy again your new policy on few nodes, no need to waste 2h to fix all servers in the same time if it did not fix your issue.

NB: could you post the output of:
ovdeploy -cmd "ovpolicy -remove -polname 'xxxxxx'" -node yyyy*

It's quite strange you cannot remove a policy from a node, and please attach the System.txt.

0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.