OMU TIP: opcbbcdist stops processing when ditributing to a large number of nodes
Upon distributing policies/instrumentation to a large number of nodes, opcbbcdist stops processing distribution requests after a while.
When performing a massive distribution of policies from the Management Server to several nodes
the distribution may get stuck and several files may show in the distrib directory.
Distribution of any kind cannot be performed at this time.
The system.txt shows the following errors:
0: WRN: Tue Nov 5 17:34:11 2013: opcbbcdist (5390/1083803968): [DBHandler.cpp:2042]: Error occurred while getting instrumentation value: Type is MAP_OS_VERSION. Assd value is 'EMPTY'.
0: ERR: Tue Nov 5 17:37:15 2013: opcbbcdist (5390/1083803968): (depl-218) Could not get inventory.
1: ERR: Tue Nov 5 17:37:15 2013: opcbbcdist (5390/1083803968): (depl-176) Message returned from host 'abc.com':
2: ERR: Tue Nov 5 17:37:15 2013: opcbbcdist (5390/1083803968): (xpl-117) Timeout occurred while waiting for data.
0: ERR: Tue Nov 5 17:38:02 2013: opcbbcdist (5390/1083803968): (depl-218) Could not get inventory.
1: ERR: Tue Nov 5 17:38:02 2013: opcbbcdist (5390/1083803968): (depl-176) Message returned from host 'abc.com':
2: WRN: Tue Nov 5 17:38:02 2013: opcbbcdist (5390/1083803968): (bbc-427) HttpOutputRequestImpl::WaitForPreResponse() caught OvXplIo::IOException_t. <null>
3: ERR: Tue Nov 5 17:38:02 2013: opcbbcdist (5390/1083803968): (xpl-333) recv() on '[10.x.x.x]:383' failed.
4: ERR: Tue Nov 5 17:38:02 2013: opcbbcdist (5390/1083803968): (RTL-110) Connection timed out
The root cause of this issue is indeed the large number of distributions triggered at the same time.
The problem is due to opcbbcdist reaching the maximum number of 1024 open files.
opcbbcdist will only run a limited number of distributions in parallel, the remaining distributions wait until
the previous have completed.
Probably the problem results from the fact that connections to the nodes remain open for a long period of time after a distribution.
So after a while the opcbbcdist process runs out of file handlers and cannot open new connections.
In order to close connections faster please use the following settings on the Management Server:
# ovconfchg -ns xpl.net SocketPoll TRUE
# ovconfchg -ns bbc.http.ext.opcragt -set AUTO_CONNECTION_CLOSE_INTERVAL 60
# ovconfchg -ns bbc.http.ext.opcbbcdist -set AUTO_CONNECTION_CLOSE_INTERVAL 300
However, due to the large number of distributions, opcbbcdist may still reaching the maximum of 1024 file descriptors.
The settings in /etc/security/limits.conf (32768 in this example) are not inherited by opcbbcdist because the OMU processes are started by ovcd,
and this process in turn is started during system boot and thus does not inherit login settings as defined in /etc/security/limits.conf,
unless you restart it from a login shell with "ovc -kill" and "ovc -start".
So in order for opcbbcdist to inherit the settings in /etc/security/limits.conf please do:
Delete the files in the distrib directory
Also, add "ulimit -n 32768" near the beginning of scripts /etc/rc.d/init.d/omu500 and /etc/rc.d/init.d/OVCtrl.
A value of 8192 would probably be sufficient, but 32768 makes it consistent with the systems current settings in /etc/security/limits.conf.
After this, the distribution should proceed as expected.
It is not convenient to launch a mass distribution because it generates problems into the network.
The best way to do this process is in little groups and increase the number of Agents little by little to know which is the best
number of policies or Agents into the distribution.
There are some tips for distribution documented into the OMU 9.10 Admin Reference guide on Chapter 3
where it says “HPOM Agent-Configuration Distribution”, page 185.
This document is available under the following link:
If you find that this or any post resolves your issue, please be sure to mark it as an accepted solution.
If you liked it I would appreciate KUDOs.