Configuring Site Crawler


This configuration file points to an HTML site and then crawls it. The site crawler will parse the static HTML content and will index the following default values in the main index:

  • URL
  • Title
  • Abstract
  • BodyCopy
  • Headings (H1-H6)

Site Crawler follows links on a page as deep as they go and puts an entry in the indexing log as Follows Links. This log entry also shows response codes if Site Crawler attempts to follow a link and is unable to resolve content for any reason (e.g., 404 response).

Note
Site Crawler only searches for HTML content and does not crawl content within binary documents (e.g., PDF and Microsoft Word documents).

To configure Site Crawler:

  1. Navigate to [Drive]:\[path-to-DSS-root-directory], and open Search.config in a text editor.
    Important
    Search.config acts as a hub for all InSite Search configurations and must be configured before configuring SiteCrawlerSource.config.
    The example below displays generic code contained within Search.config.
    Sample Search.config
    <?xml version="1.0"?>
    <configuration>
        <configSections>
            <section name="Search"
                type="Ingeniux.Search.Configuration.IndexingConfiguration, Ingeniux.Search"/>
        </configSections>
        <Search indexLocation="App_Data\LuceneIndex"
            synonymsLocation="[Drive]:\[path to DSS root directory]\published\iss-config\Synonyms.xml"
            indexingEnabled="true" queryMaxClauses="1024">
            <Hiliter startTag="&lt;strong&gt;" endTag="&lt;/strong&gt;"/>
            <Settings>
                <add name="defaultIndexingAnalyzer"
                    value="Ingeniux.Search.Analyzers.StemmingIndexingAnalyzer, Ingeniux.Search"/>
                <add name="defaultQueryAnalyzer"
                    value="Ingeniux.Search.Analyzers.StemmingQueryAnalyzer, Ingeniux.Search"/>
                <add name="QueryFieldsFileLocation"
                    value="[Drive]:\[path to DSS root directory]\[subfolder(s)]\QueryFields.xml"/>
                <add name="DocumentBoostByFacetsFileLocation"
                    value="App_Data\DocumentBoostByFacetsFileLocation.xml"/>
                <add name="GSearchFieldMapping"
                    value="[Drive]:\[path to DSS root directory]\[subfolder(s)]\GSearchFieldMapping.xml"/>
            </Settings>
            <IndexingSources>
                <add name="CMSPublishedContent" type="Ingeniux.Runtime.Search.DssContentSearchSource"
                    settingsFile="[Drive]:\[path to DSS root directory]\settings\SearchSource.config"/>
                <add name="KeyMatch" type="Ingeniux.Search.KeyMatchSearchDocumentSource"
                    settingsFile="App_Data\KeymatchSource.config"/>
                <add name="SpellCheckDictionary" type="Ingeniux.Search.SpellCheckerSearchDocumentSource"
                    settingsFile="App_Data\spellcheckerSource.config"/>
                <add name="SiteCrawlerSource" type="Ingeniux.Search.HtmlSiteSource"
                    settingsFile="App_Data\sitecrawlerSource.config"/>
                <add name="Analytics" type="Ingeniux.Search.AnalyticsSearchDocumentSource"
                    settingsFile="App_Data\analyticSource.config"/>        
            </IndexingSources>
            <SearchProfiles>
                <add name="Independent-search-profile-1">
                    <Sources>
                        <add name="KeyMatch"/>
                        <add name="SpellCheckDictionary"/>
                        <add name="analytics"/>
                    </Sources>
                </add>
            </SearchProfiles>
        </Search>
    </configuration>
  2. If an <add> element doesn't already exist for Site Crawler within the <IndexingSources> element, create one. Complete the following steps in the <add> element.
    1. Enter an arbitrary, unique name as the value of the @name attribute for identification purposes.
    2. Enter Ingeniux.Search.HtmlSiteSource as the value of the @type attribute.
    3. Enter the path to SiteCrawlerSource.config as the value of the @settingsFile attribute.
      Note
      You can use a relative path to the default site crawler configuration file (i.e., App_Data/SiteCrawlerSource.config), which had been included during installation.
  3. Save and close Search.config.
  4. Navigate to SiteCrawlerSource.config, and open the file in a text editor.
    Note
    By default, SiteCrawlerSource.config resides in [Drive]:[path-to-cms-site]\App_Data. However, this source file can reside elsewhere. You can also download a sample SiteCrawlerSource.config and then save the file to App_Data or a custom location.
    Sample SiteCrawlerSource.config
    <?xml version="1.0" encoding="utf-8"?>
    <configuration>
        <configSections>
            <section name="Search" type="Ingeniux.Search.Configuration.IndexingSourceConfig, Ingeniux.Search" />
        </configSections>
        <Search>
            <Settings>
                <!-- URL to crawl. -->
                <add name="url" value="" />
                <!-- Value for how often to cycle (e.g., 1 time each week) -->
                <add name="cycle" value="1" />
                <!-- cycleUnit value can be week, day, hour. -->
                <add name="cycleUnit" value="week" />
                <!-- Batch size to process into index. -->
                <add name="batchSize" value="200" />
                
                <!-- Sets up start time for the next crawl. Delete timestamp file for this option to be honored. -->
                <!-- <add name="time" value="16:25:00" /> -->
                
                <!-- <add name="querystringsToIgnore" value="fullIgnore|partial1*|*partial2|*partial3*" /> -->
                
                <!-- If <link rel="conanical" href /> exists, use that HREF value as the url of the crawled page. -->
                <add name="useCanonicalUrlInHtml" value="false" />
                
                <!-- The prefixes to restrict urls to crawl. Absolute urls.
                     When specified, crawled page url must start with one of the given prefixes to be indexed. -->
                <add name="urlRestrictionPrefixes" value="" />
                
                <!-- When enabled, eliminate duplicates, when url matches, but cases are different. -->
                <add name="caseInsensitiveUrlMatch" value="false" />
                
                <!-- Pages that start with the provided prefix will be excluded from indexing during site crawl. -->
                <add name="urlExclusionPrefixes" value="http://mysite.com/this/, http://mysite.com/that/" />
            </Settings>
        </Search>
    </configuration>
  5. Modify the following attribute values on @add elements to suit your needs.
    NameDefault ValuesPossible ValuesExample ValuesDescriptionNotes
    url-empty-https://sites.google.com/view/testindex/homeURL to the root of crawled site.
    cycle-empty-1Provide an integer value to use in the defined re-crawl cycle. Represents how many times within the cycleUnit site is crawled.
    cycleUnit-empty-
    • week
    • day
    • hour
    hourThis sets up the cadence for re-crawling content.
    batchsize200200Batch size to process into index.
    time-empty-16:25:00Sets up the start time for next crawl.
    Other

    If the time value in Site Crawler is defined, InSite Search ignores the timestamp file.

    In Ingeniux CMS 10.x, for example, there is no need to delete the timestamp file each time the app is restarted in order to honor the time value.

    Version Notes: InSite Search 2.13+
    The settings below only apply to InSite Search 2.13+.
    querystringsToIgnore-empty-
    • fullIgnore
    • *partial1
    • partial2*
    • *partial3*
    partial2*The ignore querystring feature is used to differentiate URLs. The actual stored URLs for a given page in Lucene retains the full querystring included in the field.
    • fullIgnore: The full ignore matches on the querystring (i.e., ignoreTerm).
    • *partial1: The leading wildcard ignore matches the trailing part of a querystring (i.e., *ignoreTerm).
    • partial2*: The trailing wildcard ignore matches the leading part of a querystring (i.e., ignoreTerm*).
    • *partial3*: The full wildcard ignore matches the querystring contained by the wildcards (i.e., *ignoreTerm*).
    Important
    If you enter the wildcard symbol (*) by itself as a value, all query strings are ignored.

    When setting querystringsToIgnore, consider the scenario in which querystringsToIgnore is defined, for example, as myquerystring. The crawler finds a new link, http://mysite.com/thispage.html?myquerystring=test, and there is an existing page indexed as http://mysite.com/thispage.html. Without the querystring value, these URL values are considered a match, and the new crawled link will not be processed.

    useCanonicalUrlInHtmlfalse
    • true
    • false
    trueIf <link rel="canonical" href /> exists, use the @href value as the URL of the crawled page.
    urlRestrictionPrefixes-empty-http://mysite.com/this/, https://yoursite.com/that/In order to be indexed, crawled page URLs must start with one of the given prefix values. Comma delimited. Not case-sensitive. Absolute URLs required.The crawler traverses any links that match the base URL of the site but will only index pages that contain the defined value for urlRestrictionPrefix (i.e., http://mycrawlsite.com/crawl/page1.html will be crawled and indexed, but http://mycrawlsite.com/hello/ will be crawled but not be indexed.).
    caseInsensitiveUrlMatchfalse
    • true
    • false
    trueWhen enabled, this option eliminates duplicates when URLs match but when respective cases are different.
    urlExclusionPrefixes-empty-http://mysite.com/this/, https://yoursite.com/that/Comma-delimited list of URL prefixes. Pages that start with the provided prefix will be excluded from indexing during site crawl. Not case-sensitive. Absolute URLs required.
    • If the prefix ends with a forward slash (e.g., http://mysite.local/here/, then all URLs that start with this value will not be indexed.)
    • If prefix does not end with a forward slash (e.g., http://mysite.local/helloworld, it is a exact match for exclusion.)
  6. Save and close SiteCrawlerSource.config when you finish.
  7. Choose one of the following steps to generate the Site Crawler index.
    • To generate the initial indexing of the Site Crawler URL(s), recycle the search service application pool.
    • To force a re-crawl, delete the SiteCrawlerSource_LastIndexed.timestamp file from the App_Data folder then recycle the search service application pool.