(SM) Support Tip: Generating effective IR stopwords
Service Manager IR expert uses two mechanism to reduce this diversity: stop lists and lexical analysis.
IR expert uses a list of stop words to exclude terms from IR index: Typical candidates for stop word list are terms that appear in many documents or do not carry a meaning for a typical user: The word "regards" matches both classes: It exists in most email exchanged with the support requester, and has no meaning for the actual support issue.
The other mechanism to reduce the number of terms stored in the index, while ensuring finding relevant documents even with small variation of the term is lexical analysis, which we for simplicity will refer to here as "stemming". This means a word is analyzed and reduced to the word stem - which we refer to as "term". It enables IR expert to store fewer terms while finding documents containing different words derived by this word stem.
Service Manager allows to specify a custom stemming algorithm except for English and German language, as for these it contains a hardcoded stemming algorithm. As IR expert cannot decide if a word is English, French, or other language, only one IR language - and therefore only one stemming algorithm - may be configured. In Service Manager instances containing multi-lingual data, this may have strange effects to the stop words to be added in the stop word file.
Finally, it is important to understand when we are talking about words and terms:
Data input by the user as stored in the journal of an incident
Data input by the user into a text search field
Stored in IR index
Stop word list
VRIR log to verify IR index contents
The typical method of identifying terms to be added to stop word file is to run a Verify IR (VRIR, see below for an example how to run it). It analyses the contents of the IR index including dumping all terms in the index into the log file with the number of documents (records) they appear in:
RTE I Term accept is used in 11220 Documents. Hash offset: 7. File offset: 5204485
There exist a few tools to automatically generate lists of terms appearing in more than n documents from VRIR log file. Included in the attachment you'll find the ScriptLibrary record IRReadVRIR with the function readVRIRlog() that does this job. In context of this ScriptLibrary record, you can execute this function like this to write a file containing all extracted terms
writeFile( 'c:\\stopterms.txt', 't', readVRIRlog('..\\logs.vrir.log', 10000).join('\n') );
Therefore the unload file attached contains another tool:
Given a list of terms, it adds one of a list of suffixes (i.e. ed, s, ing, ..) to the term and use a stemming algorithm to verify that the generated artificial word is stemmed to the expected term. If so, the artificial word is added to list of words ready to be appended to stop word file. If not, it will try the next suffix, respective add the term into a list of terms that will need manual handling.
The stemming algorithm implemented in the tool is an open-source implementation of the same stemming algorithm used by Service Manager binaries. The tool only ships with English stemmer.
Using the tool in context of ScriptLibrary record IRGenStopwords:
var source = readCandidateList( "C:\\stopterms.txt");
var list = 
// Option 1: If this source list is already stemmed, just use it
list = source;
// Option 2: If this source list is unstemmed, yet: stemm it!
// for(var i in source) list.push(lib.IRPorter.stem(source[i]);
result = genStopWords( list );
writeStopwordList( "C:\\stopword_candidates.txt", result["stopwords"]);
writeStopwordList( "C:\\stopterms_failed.txt", result["failed"]);
Finally, append the new list of words to your stop words file and sort it, as Service Manager requires the stop word file to be sorted.
Running VRIR from command line:
HP Service Manager Database Exerciser (Version: 9.40.0000 Build: (r173058)) [02/02/2016 12:07:42]
Enter your choice: opn
Open a file
(Version: 9.40.0000 Build: (r173058)) [02/02/2016 12:07:47]
Enter file name: probsummary
dbInitRelation() for file 'probsummary' returned 0.
Enter your choice: vrir
Validate an IR file
(Version: 9.40.0000 Build: (r173058)) [02/02/2016 12:07:53]
Enter fully qualified name for the IR file: ir.probsummary
IR validation succeeded. See log for details
Enter your choice: x
Closing currently open file 'probsummary'.
dbTermRelation() returned 0