Highlighted
Acclaimed Contributor.. Acclaimed Contributor..
Acclaimed Contributor..
340 views

file system crawling

Experts,

I am trying to set up file system crawling on HP SM v9.41 with SOLR search engine and having some challenges. Below are the steps followed so far, please let me know what is missing 

  1. Added the knowledgebase for filesystem library

crawl 1.png2.Under ‘type information’ tab, added the below values

Start Path : contains the shared drive name on the KM Master / Crawler host server

(Note: I have created a folder under the ‘D’ drive of the search server host and added shareddocuments folder and set the advanced sharing as for everyone)

crawl 2.JPG3.On the KM Search Enginer / Crawler server, I modified ‘craw;-urlfilter’ file value as below

crawl 3.JPG

4. The seed text value is showing as below,

crawl 4.JPG

5.Restarted the SM services and KM Search Engine services.

6.In KMCrawler / Logs / hadoop, I see the below warnings and errorcrawl 5.JPG

 

 

 

7.I see the below entries in the sm.log file,crawl 6.JPG

 

 

 

I am Listening..
0 Likes
5 Replies
Highlighted
Acclaimed Contributor.. Acclaimed Contributor..
Acclaimed Contributor..

Re: file system crawling

2017-09-08 01:05:29,015 ERROR job.JobUtil - Job : LocalFile not found in job file 2017-09-08 01:05:29,030 ERROR job.JobUtil - Trigger: Trigger_LocalFile not found in job file 2017-09-08 01:05:29,030 ERROR job.JobUtil - Trigger: Trigger_LocalFile_Immediate not found in job file 2017-09-08 01:05:35,874 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2017-09-08 01:05:36,390 WARN regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2017-09-08 01:05:37,124 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2017-09-08 01:05:37,327 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2017-09-08 01:05:38,249 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2017-09-08 01:05:38,374 WARN regex.RegexURLNormalizer - can't find rules for scope 'partition', using default 2017-09-08 01:05:38,390 WARN regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default 2017-09-08 01:05:40,343 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2017-09-08 01:05:40,468 WARN regex.RegexURLNormalizer - can't find rules for scope 'partition', using default 2017-09-08 01:05:41,437 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2017-09-08 01:05:41,562 WARN parse.ParsePluginsReader - No aliases defined in parse-plugins.xml! 2017-09-08 01:05:42,624 WARN regex.RegexURLNormalizer - can't find rules for scope 'outlink', using default 2017-09-08 01:05:43,515 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2017-09-08 01:05:43,609 WARN regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using default 2017-09-08 01:05:43,671 WARN regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using default 2017-09-08 01:05:43,703 WARN regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using default 2017-09-08 01:05:44,609 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2017-09-08 01:05:45,687 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2017-09-08 01:08:29,129 ERROR job.JobUtil - Trigger: Trigger_LocalFile_Immediate not found in job file
I am Listening..
0 Likes
Highlighted
Micro Focus Expert
Micro Focus Expert

Re: file system crawling

Two things when it comes to file system crawling:

1. SOLR won't crawl a UNC path (this was never fixed in the SOLR world, but does work when using Smart Analytics and IDOL)

2. You need to turn on a very verbose debug parameter in the sm.ini in order for the crawling to actually function. 

Below I have workarounds. You need to do both in order for this to work.

For Item 1:

1. Seems there is a bug within our implementation of SOLR/Nutch that breaks the ability to crawl a UNC path.  Workarounds are to crawl the local drive letter folder (which not optimal) or point a webserver to this directory and crawl the webservers' URL (preferrable)

For Item 2:

The problem is that without turning on the debug parameter the quartz scheduler doesn’t function:

1. Edit the sm.ini file
2. Add a new section titled Fsyslib Crawling Workaround
3. Add the following parameter: log4jDebug:com.hp.ov.sm.server.plugins.knowledgemanagement.solr.KMSolrSearch
4. Add the following parameter: numberoflogfiles:7
5. Add the following parameter: maxlogsize: 6. Example of how this should look is below

Fsyslib Crawling Workaround
log4jDebug:com.hp.ov.sm.server.plugins.knowledgemanagement.solr.KMSolrSearch
numberoflogfiles:7
maxlogsize:10485760

7. Save sm.ini file
8. Stop and restart Service Manager server
9. Configure a fsyslib

Highlighted
Acclaimed Contributor.. Acclaimed Contributor..
Acclaimed Contributor..

Re: file system crawling

Thank you for the reply Brett.

  1. I have mentioned the aboslute path on the 'Master Search / Crawler' server (D:/Shared Contents/), while configuring the filesystem library.
  2. I already have the below parameters set in the sm.ini file, ( numberoflogfiles:7 and maxlogsize:5MB  was already set in our ini ). I had to add only log4jDebug parameter.
    numberoflogfiles:7 
    maxlogsize:5242880 
    #KM Crawler Settings
    log4jDebug:com.hp.ov.sm.server.plugins.knowledgemanagement.solr.KMSolrSearch

     

 

 

I am Listening..
0 Likes
Highlighted
Acclaimed Contributor.. Acclaimed Contributor..
Acclaimed Contributor..

Re: file system crawling


 

 

For Item 1:

1. Seems there is a bug within our implementation of SOLR/Nutch that breaks the ability to crawl a UNC path.  Workarounds are to crawl the local drive letter folder (which not optimal) or point a webserver to this directory and crawl the webservers' URL (preferrable)

For the above step mentioned:-

Are you suggesting not to use local drive and enable file system library crawling, rather use web crawling by pointing a webserver to this directory ??.  Could you please tell me why filesystem crawling is not preferred.

 

I am Listening..
0 Likes
Highlighted
Micro Focus Expert
Micro Focus Expert

Re: file system crawling

So as I mentioned UNC paths won't work. Knowing that you'd need to use an actual drive like "D:\my_file_directory\myfiles.  So if you really think about that will everyone on the network searching SM knowlege have access to some server's D drive - or maybe an X drive - without a physical mapping? UNC paths would get around that problem, but - like I said - that won't work. 

Therefore, to control access you could setup a webserver to serve the files and have the file crawler point to that URL.

0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.