WebConnector 11.6 StayOnSite

Why does the WebConnector (v.11.6) continue to try to crawl URLs that are not on the start point when I have StayOnSite=TRUE? I have even gone to great lengths to create SpiderUrlCantHaveRegex patterns to get these URLs excluded but the WebConnector continues to attempt to contact the URLs through the proxy server.

  • This might be an issue for our support team, ideally with sharing logs and cfgs. Here is some feedback that I got though:

    The best GUESS is that the connector is behaving in that it is sticking to pages from that web site but because you are looking at proxy traffic, you see other URLS being downloaded.  We embed Chrome and that will download other bits from other sites in order to form a complete page.  It won’t be saving those bits though.

    Caroline Oest

    Micro Focus Customer Experience Marketing

    If you find this post useful, give it a ‘Like’ or use ‘Verify Answer’

  • I think it would be a good idea to add a parameter to be able to instruct the WebConnector to "absolutely stay on site" and not attempt to contact any "outside" URLs. Do you agree that parameter would be useful?

  • This is also occurring on v12 web crawlers.

    I've added "spider must/cant...", "must have" and "cant have" options and those sites still get hit.  It slows everything down.

  • If you're certain that this is the result of following links, I'd recommend opening a support ticket. While the answer regarding the proxy is certainly a possibility, it's not necessarily established functionality.

    If, however, the specific items being requested are things like images, Javascript, or other resources that may be referenced in the HTML, but not necessarily in a <a href="..."> hyperlink, then you may want to look at ResourceUrlMustHaveRegex to prevent attempts at loading these. (This was added in 12.4.)