Hi every one,
I am an HPOO 10.20 consultant.
Since 2 days i have a problem with OO central worker which do not execute any flow and the list of flows " in execution" is bigger and bigger without any resolution.
From server log, i detect the following warning (in more than 20 occurences)
(AbstractEventsBuffer. Java 63) WARN - Step log execution events: buffer is full.
Could some one tell me what does it means please? And why it appears ?
Thanks a lot
If I had to guess based on what I've seen so far: Central isnt processing its internal queues as efficiently as it probably needs to. I think its probably exacerbated by the fact there is only one worker to assign jobs to. However, I would focus on the trigger firings - my prod env now has 9 clustered centrals and 30 dedicated ras processing a few thousand jobs an hour (mostly via scheduler) and anytime I've seen issues such as you've described it was mostly resolvable by trigger tuning. I've also made other changes to the engine and quartz configs but the conditions that led me there arent in your current error messages.
Although, i would work to get upgraded to the latest version sooner as well... a lot of enhancements that you're missing out on that aided in job execution speed etc...
take a look at the monitoring page (https://<central>/oo/monitoring). Check for long running SQL, long running jobs, resource issues with JVM.
I'd also suggest looking at upgrading, you're about 7 versions behind - several fixes have gone in to resolve queueing issues.
You can also clear the queue/ set all currenting executing jobs to cancelled using the following SQL on your OO DB - doesnt help you resolve the issue long term but it'll get you functional again
TRUNCATE TABLE OO_FINISHED_BRANCHES
DELETE FROM [OO_SUSPENDED_EXECUTIONS]
TRUNCATE TABLE OO_EXECUTION_INTERRUPTS
TRUNCATE TABLE OO_EXECUTION_QUEUES
truncate table OO_EXECUTED_FLOW_GRAPH
TRUNCATE TABLE OO_EXECUTION_STATE
TRUNCATE TABLE OO_EXECUTION_STATES
TRUNCATE TABLE OO_EXECUTION_STATES_1
TRUNCATE TABLE OO_PAUSE_DATA
TRUNCATE TABLE OO_SUSPENDED_EXECUTIONS
delete from [OO_RUNTIME_VALUE_STORE]
SET STATUS = 'CANCELED', END_TIME = START_TIME
WHERE STATUS IN ('RUNNING', 'PAUSED', 'PENDING_PAUSE', 'PENDING_CANCEL')
After disscussion with colleagues :
The monitoring page does not help us, because , we had applied a database restore , so this page does not contain information about the critical period (when OO stops executiong flows and we had the buffer is full message), the restore of the database took us to 3 days earlier.
Just for information, we have found other warning in the critical window:
ERROR - HHH000315: Exception executing batch [Batch update returned unexpected row count from update ; actual row count: 0; expected: 1]
WARN - HHH020003: Could not find a specific ehcache configuration for cache named [com.hp.oo.security.authn.entities.UserRef]; using defaults.
WARN - SQL Error: 2627, SQLState: 23000
ERROR - Violation of UNIQUE KEY constraint 'OO_FINISHED_BRANCHES_UC'. Cannot insert duplicate key in object 'dbo.OO_FINISHED_BRANCHES'. The duplicate key value is ......
the data in this page isnt kept in the DB but a file store on central in the tomcat temp directory.... it may still provide some insight - in the event you run into this again this is one of the first places i would suggesst looking
just to verify the environment was restored to a point prior to the error conditiion (so you're operating normally now?)... if so, i would focus on upgrading to a more recent edition to take advantage of the numerous updates since 10.20
So for the upgrade , we will go to the last version 2018.** in few months .
For our situation now, really, we are not operating normally because every day at 8 am the buffer is full message appears, the worker stops and we apply a databade restore (which is not a good solution because we lose all recent executions).
The problem is that we have not recent developped flows, all are old and they are in execution for at least 2 months without problems.
When did the issues start?
Are you performing regular DB maintenance?
Can you explain your enviornment (number of centrals/ workers) / what DB are you using/ Size of DB (actual not allocated)
So the problem starts as follows
Sunday at 8 am we found in logs that the disk partition containing OO DB ldf (DB log) is full so we resolved that by getting more space.
After that we had problems with worker each 1 hour (during sunday) so we proceed to many service restarts (but the problem stay the same) and then we executed the first DB restore.
Monday morning, worker didn't work starting from 8 am
Tuesday same thing and today same thing.
We are using SQL Server DB.
For DB maintenance, we have only the DB purge which runs at 2 am everyday to keep only last week executions data.
For the architecture we have one central (which is the only worker).
Db size ~400 GB.
Thanks a lot
You might check your index fragmentation and reindex where necessary.
Can you share your config in database.properties and central-wrapper.conf (please make sure you mask sensitive info)
Also, look at central/tomcat/temp - do you see a large number of jtds files?
So for conf files, i wilk share it tomorrow because i am out of office.
In tomcat tmp direcyory i checked it today in the mlrning there was ~15000 files.
you might stop oo and delete all files under temp (excluding the javamelody directory (this is where the /oo/monitoring data is stored))