Configuring Site Crawler
This configuration file points to an HTML site and then crawls it. The site crawler will parse the static HTML content and will index the following default values in the main index:
- URL
- Title
- Abstract
- BodyCopy
- Headings (H1-H6)
Site Crawler follows links on a page as deep as they go and puts an entry in the indexing log as Follows Links. This log entry also shows response codes if Site Crawler attempts to follow a link and is unable to resolve content for any reason (e.g., 404 response).
To configure Site Crawler:
- Navigate to [Drive]:\[path-to-DSS-root-directory], and open
Search.config in a text editor.ImportantSearch.config acts as a hub for all InSite Search configurations and must be configured before configuring SiteCrawlerSource.config.
The example below displays generic code contained within Search.config. - If an
<add>
element doesn't already exist for Site Crawler within the<IndexingSources>
element, create one. Complete the following steps in the<add>
element.- Enter an arbitrary, unique name as the value of the
@name
attribute for identification purposes. - Enter Ingeniux.Search.HtmlSiteSource as the value of the
@type
attribute. - Enter the path to SiteCrawlerSource.config as the value of the
@settingsFile
attribute.NoteYou can use a relative path to the default site crawler configuration file (i.e., App_Data/SiteCrawlerSource.config), which had been included during installation.
- Enter an arbitrary, unique name as the value of the
- Save and close Search.config.
- Navigate to SiteCrawlerSource.config, and open the file in a text
editor.NoteBy default, SiteCrawlerSource.config resides in [Drive]:[path-to-cms-site]\App_Data. However, this source file can reside elsewhere. You can also download a sample SiteCrawlerSource.config and then save the file to App_Data or a custom location.
- Modify the following attribute values on
@add
elements to suit your needs.Name Default Values Possible Values Example Values Description Notes url -empty- https://sites.google.com/view/testindex/home URL to the root of crawled site. cycle -empty- 1 Provide an integer value to use in the defined re-crawl cycle. Represents how many times within the cycleUnit site is crawled. cycleUnit -empty- - week
- day
- hour
hour This sets up the cadence for re-crawling content. batchsize 200 200 Batch size to process into index. time -empty- 16:25:00 Sets up the start time for next crawl. OtherIf the time value in Site Crawler is defined, InSite Search ignores the timestamp file.
In Ingeniux CMS 10.x, for example, there is no need to delete the timestamp file each time the app is restarted in order to honor the time value.
Version Notes: InSite Search 2.13+The settings below only apply to InSite Search 2.13+.querystringsToIgnore -empty- - fullIgnore
- *partial1
- partial2*
- *partial3*
partial2* The ignore querystring feature is used to differentiate URLs. The actual stored URLs for a given page in Lucene retains the full querystring included in the field. - fullIgnore: The full ignore matches on the querystring (i.e., ignoreTerm).
- *partial1: The leading wildcard ignore matches the trailing part of a querystring (i.e., *ignoreTerm).
- partial2*: The trailing wildcard ignore matches the leading part of a querystring (i.e., ignoreTerm*).
- *partial3*: The full wildcard ignore matches the querystring contained by the wildcards (i.e., *ignoreTerm*).
ImportantIf you enter the wildcard symbol (*) by itself as a value, all query strings are ignored.When setting querystringsToIgnore, consider the scenario in which querystringsToIgnore is defined, for example, as myquerystring. The crawler finds a new link, http://mysite.com/thispage.html?myquerystring=test, and there is an existing page indexed as http://mysite.com/thispage.html. Without the querystring value, these URL values are considered a match, and the new crawled link will not be processed.
useCanonicalUrlInHtml false - true
- false
true If <link rel="canonical" href />
exists, use the@href
value as the URL of the crawled page.urlRestrictionPrefixes -empty- http://mysite.com/this/, https://yoursite.com/that/ In order to be indexed, crawled page URLs must start with one of the given prefix values. Comma delimited. Not case-sensitive. Absolute URLs required. The crawler traverses any links that match the base URL of the site but will only index pages that contain the defined value for urlRestrictionPrefix (i.e., http://mycrawlsite.com/crawl/page1.html will be crawled and indexed, but http://mycrawlsite.com/hello/ will be crawled but not be indexed.). caseInsensitiveUrlMatch false - true
- false
true When enabled, this option eliminates duplicates when URLs match but when respective cases are different. urlExclusionPrefixes -empty- http://mysite.com/this/, https://yoursite.com/that/ Comma-delimited list of URL prefixes. Pages that start with the provided prefix will be excluded from indexing during site crawl. Not case-sensitive. Absolute URLs required. - If the prefix ends with a forward slash (e.g., http://mysite.local/here/, then all URLs that start with this value will not be indexed.)
- If prefix does not end with a forward slash (e.g., http://mysite.local/helloworld, it is a exact match for exclusion.)
- Save and close SiteCrawlerSource.config when you finish.
- Choose one of the following steps to generate the Site Crawler index.
- To generate the initial indexing of the Site Crawler URL(s), recycle the search service application pool.
- To force a re-crawl, delete the SiteCrawlerSource_LastIndexed.timestamp file from the App_Data folder then recycle the search service application pool.