Micro Focus Expert
Micro Focus Expert

(SM) Support Tip: Generating effective IR stopwords

Building a full-text index like IR index means fighting the diversity of words in natural language texts. Effective reduction of this diversity reduces the index size dramatically, increasing performance, and increasing accuracy.
Service Manager IR expert uses two mechanism to reduce this diversity: stop lists and lexical analysis.
IR expert uses a list of stop words to exclude terms from IR index: Typical candidates for stop word list are terms that appear in many documents or do not carry a meaning for a typical user: The word "regards" matches both classes: It exists in most email exchanged with the support requester, and has no meaning for the actual support issue.
The other mechanism to reduce the number of terms stored in the index, while ensuring finding relevant documents even with small variation of the term is lexical analysis, which we for simplicity will refer to here as "stemming". This means a word is analyzed and reduced to the word stem - which we refer to as "term". It enables IR expert to store fewer terms while finding documents containing different words derived by this word stem.
Service Manager allows to specify a custom stemming algorithm except for English and German language, as for these it contains a hardcoded stemming algorithm. As IR expert cannot decide if a word is English, French, or other language, only one IR language - and therefore only one stemming algorithm - may be configured. In Service Manager instances containing multi-lingual data, this may have strange effects to the stop words to be added in the stop word file.
Finally, it is important to understand when we are talking about words and terms:

Data input by the user as stored in the journal of an incident

Unstemmed word

Data input by the user into a text search field

Unstemmed word

Stored in IR index

Stemmed term

Stop word list

Unstemmed word

VRIR log to verify IR index contents

Stemmed term

The typical method of identifying terms to be added to stop word file is to run a Verify IR (VRIR, see below for an example how to run it). It analyses the contents of the IR index including dumping all terms in the index into the log file with the number of documents (records) they appear in:

RTE I Term accept is used in 11220 Documents. Hash offset: 7. File offset: 5204485

It is general HPE recommendation to add all terms appearing in more than 10,000 documents to the stop words file. It is also advisable to review all terms appearing in more than 5,000 documents for conversational terms with no relevance for the researcher.
There exist a few tools to automatically generate lists of terms appearing in more than n documents from VRIR log file. Included in the attachment you'll find the ScriptLibrary record IRReadVRIR with the function readVRIRlog() that does this job. In context of this ScriptLibrary record, you can execute this function like this to write a file containing all extracted terms

writeFile( 'c:\\stopterms.txt', 't', readVRIRlog('..\\logs.vrir.log', 10000).join('\n') );

When Service Manager loads the stop word file, it will automatically stem each loaded word. The terms in VRIR log however are already stemmed. For that reason simply appending the term list from VRIR log to the stop word list will result in terms being stemmed twice - so many of these may fail to stop words appearing in many documents.
Therefore the unload file attached contains another tool:
Given a list of terms, it adds one of a list of suffixes (i.e. ed, s, ing, ..) to the term and use a stemming algorithm to verify that the generated artificial word is stemmed to the expected term. If so, the artificial word is added to list of words ready to be appended to stop word file. If not, it will try the next suffix, respective add the term into a list of terms that will need manual handling.
The stemming algorithm implemented in the tool is an open-source implementation of the same stemming algorithm used by Service Manager binaries. The tool only ships with English stemmer.

Using the tool in context of ScriptLibrary record IRGenStopwords:

var source = readCandidateList( "C:\\stopterms.txt");

var list = []

// Option 1: If this source list is already stemmed, just use it

list = source;

// Option 2: If this source list is unstemmed, yet: stemm it!

// for(var i in source) list.push(lib.IRPorter.stem(source[i]);

result = genStopWords( list );

writeStopwordList( "C:\\stopword_candidates.txt", result["stopwords"]);

writeStopwordList( "C:\\stopterms_failed.txt", result["failed"]);

Finally, append the new list of words to your stop words file and sort it, as Service Manager requires the stop word file to be sorted.

Running VRIR from command line:


RUN>sm -util -log:..\logs\vrir.log
HP Service Manager Database Exerciser (Version: 9.40.0000 Build: (r173058)) [02/02/2016 12:07:42]
Enter your choice: opn
Open a file
(Version: 9.40.0000 Build: (r173058)) [02/02/2016 12:07:47]
Enter file name: probsummary
dbInitRelation() for file 'probsummary' returned 0.
Enter your choice: vrir
Validate an IR file
(Version: 9.40.0000 Build: (r173058)) [02/02/2016 12:07:53]
Enter fully qualified name for the IR file: ir.probsummary
IR validation succeeded. See log for details
Enter your choice: x
Closing currently open file 'probsummary'.
dbTermRelation() returned 0
Labels (1)
0 Replies
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.