Issue with search, IRQUEUE gets processed but nothing is added to SCIREXPERT during an IR Regen

My environment is HP SM 9.41 Hybrid horizontally scaled with asynchronous ir setup.

Our text search stopped working due to a corrupted IR index for ir.probsummary so I triggered an IR Regen on the probsummary table. This regen added about 70,000 records in the SCIREXPERT table with the filename ir.probsummary but then it got "stuck". In the sm.log file it showed that the IRQUEUE session was getting a signal 11 error and terminating. I tried to manually start the sm -que:ir process multiple times but each time it would stop with the same error message in the logs. Since then i did a full server restart and it seemed like it was processing the entire IRQUEUE table but it wasn't actually putting any entries in the SCIREXPERT table and after completely emptying the IRQUEUE table any text searches just say no incidents found. There are no error messages in the sm.log that I can see that would explain this and I'm not sure what might cause this behaviour. Has anyone else experienced this issue or similar that could help?

  • The signal 11 could be cause by all the factors mentioned in I found IR Regen doesn't do well when you start having millions of records in the db.

    One remedy is and do the regen in dev and port it back to prod. Note: not mentioned in KM868376 but is in in other KMs, you can increase the shared memory allocation specifically for the ir regen in dev to help it along since you don't have to worry about production need for share memory.  Also, to save some time, delete the scirexpert in dev yourself before you start. Not sure why but the IR regen deletion of scirexpert takes longer than doing it manually.

    Or stop using IR, disable it, free up resources and just use Knowledge Management module. IR is old tech and free but KM module is newer tech and is continuously developed but not free.

  • The signal 11 error is no longer occurring since the restart, the IRQUEUE is being processed it's just being processed incorrectly and not actually creating any entries in the SCIREXPERT table while clearing them from IRQUEUE. Also we don't have millions of records in probsummary, just around 210,000 so I wouldn't expect that to be too much for the IR Regen to handle.


    I could do the regen in dev but I'm concerned that the IRQUEUE is no longer getting processed correctly at all so any new records being created wont be searchable until i do subsequent regen's in dev each time which would be a real hassle compared to getting it working correctly in production.

    I have the shared memory set to 256000000, I increased it when I did the restart since memory isn't really an issue on the server. 

    Unfortunately disabling IR and using the KMmodule isn't really an option for us right now since we don't have the license for that module.

    Ideally I would like to troubleshoot this further and get the issue resolved, actually not that i look again i see there is an error message straight after the IRQUEUE gets filled.


    1720( 3908) 10/24/2017 11:10:46 RTE W sqllimit exceeded, user=IRQUEUE limit=5.000 actual=12.594 SQL statement follows
    1720( 3908) 10/24/2017 11:10:46 RTE D 23080554: sqmssqlSelect - EXECUTE:SELECT * FROM IRQUEUEM1 READCOMMITTED WHERE "FILENAME"=? AND "KEYINTERNAL"=? AND "COUNTER"=?

    Not sure why the sqllimit is 5 seconds, my understanding is that it should default to 30 seconds.


  • I've triggered a new IR Regen of the probsummary table while stopping the "sm -que:ir" process then once the IRQUEUE table was populated i manually started the process with the options "sm -que:ir -sqllimit:60 -ir_trace" so I'm not getting the SQLLimit message in the logs this time but it's still doing the same thing where the number of rows in the IRQUEUE table is decreasing without adding any new entries to the SCIREXPERT table.

  • I haven't seen such an error before. 

    Do you get the same issue with a different table?

    If samething, maybe get a single record of scirexpert unload and see if you can load it into scirexpert to rule out db issue.

  • OK so I've manually run sm -que:ir from the sync server instead of the main primary server in the horizontal scaling environment and it is processing the records correctly, it's been going for about 14 hours and is half way through the 200,000 IRQUEUE records now and has created 440,000 records in SCIREXPERTM1.

    It's quite strange though, I have no clue why the behaviour would be so different on one server versus the other. They have identical configuration and software versions including java.

  • Thanks for sharing the solution. This will be going into my notes.

    I didn't realised it has to be run from the sync server. I would have ran it from the main primary server too but I haven't done this for a long time.

    I did some digging. I suspect since the sync processes locks and IR regen need exclusive locks. That may be the reason.

    IR Horizontal Scaling
    Since IR Expert indexes are held in shared memory, locks to these indexes have to be communicated between the different machines in horizontal scaling to prevent IR issues. There are two different kinds of access against the IR indexes in shared memory: add/update when a record was added or updated or an IR regen was performed, and IR searches. Exclusive locks are required for any add / update action, meaning no searches can be performed on an IR file while the IR Index is being updated.

  • That's the thing, it's don't think that it is meant to be run from the sync server, at least it wasn't that way on any recommendations I looked at while we setup the environment. I just attempted it since i wasn't sure where to go next with my troubleshooting since it was a frustrating issue. It seems to be working now though so I'm happy about that as it gives me a viable workaround but I'll still need to find out why it wasn't working from the primary server like it should be. 

  • Have you experienced any other issues lately? e.g. to do with load balancing.

    I'm not sure if IR Regen has anything to do with JGROUP but the server to server comms are handle via JGROUP. You could give this a go to see if the JGROUP comms is affecting your IR regen.

  • Haven't experienced any issues. There are a number of active sessions on connected to both servers at the moment and people are logging/updating incidents/changes/problems without issue as per normal operations. What sort of issues would you expect to be noticed if there were JGROUP problems?

  • Users running out of connections as load balancer not talking to other servers and not getting real count and not spreading the load. So a servlet may end up being overloaded when other servlets are free. Come to think of it. You can just run sm -reportlbstatus to check on your lb to confirm whether you have JGROUP comms issue.