Home CMS Documentation Home CMS 10 InSite Search Independent Searches Configuring Site Crawler

Configuring Site Crawler

This configuration file points to an HTML site and then crawls it. The site crawler will parse the static HTML content and will index the following default values in the main index:

URL
Title
Abstract
BodyCopy
Headings (H1-H6)

Site Crawler follows links on a page as deep as they go and puts an entry in the indexing log as Follows Links. This log entry also shows response codes if Site Crawler attempts to follow a link and is unable to resolve content for any reason (e.g., 404 response).

Note

Site Crawler only searches for HTML content and does not crawl content within binary documents (e.g., PDF and Microsoft Word documents).

To configure Site Crawler:

Navigate to [Drive]:\[path-to-DSS-root-directory], and open Search.config in a text editor.

Important

Search.config acts as a hub for all InSite Search configurations and must be configured before configuring SiteCrawlerSource.config.

The example below displays generic code contained within Search.config.

Sample Search.config

<?xml version="1.0"?>
<configuration>
    <configSections>
        <section name="Search"
            type="Ingeniux.Search.Configuration.IndexingConfiguration, Ingeniux.Search"/>
    </configSections>
    <Search indexLocation="App_Data\LuceneIndex"
        synonymsLocation="[Drive]:\[path to DSS root directory]\published\iss-config\Synonyms.xml"
        indexingEnabled="true" queryMaxClauses="1024">
        <Hiliter startTag="&lt;strong&gt;" endTag="&lt;/strong&gt;"/>
        <Settings>
            <add name="defaultIndexingAnalyzer"
                value="Ingeniux.Search.Analyzers.StemmingIndexingAnalyzer, Ingeniux.Search"/>
            <add name="defaultQueryAnalyzer"
                value="Ingeniux.Search.Analyzers.StemmingQueryAnalyzer, Ingeniux.Search"/>
            <add name="QueryFieldsFileLocation"
                value="[Drive]:\[path to DSS root directory]\[subfolder(s)]\QueryFields.xml"/>
            <add name="DocumentBoostByFacetsFileLocation"
                value="App_Data\DocumentBoostByFacetsFileLocation.xml"/>
            <add name="GSearchFieldMapping"
                value="[Drive]:\[path to DSS root directory]\[subfolder(s)]\GSearchFieldMapping.xml"/>
        </Settings>
        <IndexingSources>
            <add name="CMSPublishedContent" type="Ingeniux.Runtime.Search.DssContentSearchSource"
                settingsFile="[Drive]:\[path to DSS root directory]\settings\SearchSource.config"/>
            <add name="KeyMatch" type="Ingeniux.Search.KeyMatchSearchDocumentSource"
                settingsFile="App_Data\KeymatchSource.config"/>
            <add name="SpellCheckDictionary" type="Ingeniux.Search.SpellCheckerSearchDocumentSource"
                settingsFile="App_Data\spellcheckerSource.config"/>
            <add name="SiteCrawlerSource" type="Ingeniux.Search.HtmlSiteSource"
                settingsFile="App_Data\sitecrawlerSource.config"/>
            <add name="Analytics" type="Ingeniux.Search.AnalyticsSearchDocumentSource"
                settingsFile="App_Data\analyticSource.config"/>        
        </IndexingSources>
        <SearchProfiles>
            <add name="Independent-search-profile-1">
                <Sources>
                    <add name="KeyMatch"/>
                    <add name="SpellCheckDictionary"/>
                    <add name="analytics"/>
                </Sources>
            </add>
        </SearchProfiles>
    </Search>
</configuration>

If an <add> element doesn't already exist for Site Crawler within the <IndexingSources> element, create one. Complete the following steps in the <add> element.
1. Enter an arbitrary, unique name as the value of the @name attribute for identification purposes.
2. Enter Ingeniux.Search.HtmlSiteSource as the value of the @type attribute.
3. Enter the path to SiteCrawlerSource.config as the value of the @settingsFile attribute.
  Note
  You can use a relative path to the default site crawler configuration file (i.e., App_Data/SiteCrawlerSource.config), which had been included during installation.
Save and close Search.config.

Navigate to SiteCrawlerSource.config, and open the file in a text editor.

Note

By default, SiteCrawlerSource.config resides in [Drive]:[path-to-cms-site]\App_Data. However, this source file can reside elsewhere. You can also download a sample SiteCrawlerSource.config and then save the file to App_Data or a custom location.

Sample SiteCrawlerSource.config

<?xml version="1.0" encoding="utf-8"?>
<configuration>
    <configSections>
        <section name="Search" type="Ingeniux.Search.Configuration.IndexingSourceConfig, Ingeniux.Search" />
    </configSections>
    <Search>
        <Settings>
            <!-- URL to crawl. -->
            <add name="url" value="" />
            <!-- Value for how often to cycle (e.g., 1 time each week) -->
            <add name="cycle" value="1" />
            <!-- cycleUnit value can be week, day, hour. -->
            <add name="cycleUnit" value="week" />
            <!-- Batch size to process into index. -->
            <add name="batchSize" value="200" />
            
            <!-- Sets up start time for the next crawl. Delete timestamp file for this option to be honored. -->
            <!-- <add name="time" value="16:25:00" /> -->
            
            <!-- <add name="querystringsToIgnore" value="fullIgnore|partial1*|*partial2|*partial3*" /> -->
            
            <!-- If <link rel="conanical" href /> exists, use that HREF value as the url of the crawled page. -->
            <add name="useCanonicalUrlInHtml" value="false" />
            
            <!-- The prefixes to restrict urls to crawl. Absolute urls.
                 When specified, crawled page url must start with one of the given prefixes to be indexed. -->
            <add name="urlRestrictionPrefixes" value="" />
            
            <!-- When enabled, eliminate duplicates, when url matches, but cases are different. -->
            <add name="caseInsensitiveUrlMatch" value="false" />
            
            <!-- Pages that start with the provided prefix will be excluded from indexing during site crawl. -->
            <add name="urlExclusionPrefixes" value="http://mysite.com/this/, http://mysite.com/that/" />
        </Settings>
    </Search>
</configuration>

Modify the following attribute values on @add elements to suit your needs.

Name	Default Values	Possible Values	Example Values	Description	Notes
url	-empty-		https://sites.google.com/view/testindex/home	URL to the root of crawled site.
cycle	-empty-		1	Provide an integer value to use in the defined re-crawl cycle. Represents how many times within the cycleUnit site is crawled.
cycleUnit	-empty-	week day hour	hour	This sets up the cadence for re-crawling content.
batchsize	200		200	Batch size to process into index.
time	-empty-		16:25:00	Sets up the start time for next crawl.	Other If the time value in Site Crawler is defined, InSite Search ignores the timestamp file. In Ingeniux CMS 10.x, for example, there is no need to delete the timestamp file each time the app is restarted in order to honor the time value.
Version Notes: InSite Search 2.13+ The settings below only apply to InSite Search 2.13+.
querystringsToIgnore	-empty-	fullIgnore partial1 partial2 partial3	partial2*	The ignore querystring feature is used to differentiate URLs. The actual stored URLs for a given page in Lucene retains the full querystring included in the field. fullIgnore: The full ignore matches on the querystring (i.e., ignoreTerm). partial1: The leading wildcard ignore matches the trailing part of a querystring (i.e., ignoreTerm). partial2: The trailing wildcard ignore matches the leading part of a querystring (i.e., ignoreTerm). partial3: The full wildcard ignore matches the querystring contained by the wildcards (i.e., ignoreTerm).	Important If you enter the wildcard symbol () by itself as a value, all query strings are ignored. When setting querystringsToIgnore, consider the scenario in which querystringsToIgnore* is defined, for example, as myquerystring. The crawler finds a new link, http://mysite.com/thispage.html?myquerystring=test, and there is an existing page indexed as http://mysite.com/thispage.html. Without the querystring value, these URL values are considered a match, and the new crawled link will not be processed.
useCanonicalUrlInHtml	false	true false	true	If `<link rel="canonical" href />` exists, use the `@href` value as the URL of the crawled page.
urlRestrictionPrefixes	-empty-		http://mysite.com/this/, https://yoursite.com/that/	In order to be indexed, crawled page URLs must start with one of the given prefix values. Comma delimited. Not case-sensitive. Absolute URLs required.	The crawler traverses any links that match the base URL of the site but will only index pages that contain the defined value for urlRestrictionPrefix (i.e., http://mycrawlsite.com/crawl/page1.html will be crawled and indexed, but http://mycrawlsite.com/hello/ will be crawled but not be indexed.).
caseInsensitiveUrlMatch	false	true false	true	When enabled, this option eliminates duplicates when URLs match but when respective cases are different.
urlExclusionPrefixes	-empty-		http://mysite.com/this/, https://yoursite.com/that/	Comma-delimited list of URL prefixes. Pages that start with the provided prefix will be excluded from indexing during site crawl. Not case-sensitive. Absolute URLs required.	If the prefix ends with a forward slash (e.g., http://mysite.local/here/, then all URLs that start with this value will not be indexed.) If prefix does not end with a forward slash (e.g., http://mysite.local/helloworld, it is a exact match for exclusion.)

Save and close SiteCrawlerSource.config when you finish.
Choose one of the following steps to generate the Site Crawler index.
- To generate the initial indexing of the Site Crawler URL(s), recycle the search service application pool.
- To force a re-crawl, delete the SiteCrawlerSource_LastIndexed.timestamp file from the App_Data folder then recycle the search service application pool.

Configuring Site Crawler

Table of Contents Release Notes Search

Table of Contents