Appendix I - The Scout Configuration File

Configuration File Syntax

The configuration file that establishes Scout's runtime options is composed using a simple syntax of assignment statements divided into two named sections. Sections are delimited with a line of the form [SECTIONNAME] and within the sections, one parameter per line is set with a line of the form NAME=VALUE. There are two sections in the file: EXTRACTOR and SCOUT. The EXTRACTOR section contains parameters relevant to the HTML parser, and the SCOUT sections contains parameters relevant to Scout.

Configuration File Parameters

The following tables detail the parameters in the EXTRACTOR and SCOUT sections of the configuration file.

The EXTRACTOR Section

Parameter Name Description
EntityFile Path on the local file system to a file used by parser to interpolate SGML entities, for example " for " and & for &, as characters

The SCOUT Section

Parameter Name Description
StartURL The fully qualified URL of the page at which to begin a session
UseGUI If set to true, Scout will provide a simple set of button controls for pausing and resuming the session and toggling a windowed display of the 10 most recent log entries
LogFile The file to which log data is written
PersistFile File to which session state is saved on program termination
NetDelay Time is milliseconds to wait between sending successive requests to the same host
RestrictHost If not null, Scout will not request URLs from any server other than the one named
RestrictDomain If not null, Scout will not request URLs from servers in any domain other than the one named
CacheDir Directory in which documents will be cached locally
MaxCacheFiles Maximum number of files to keep in the cache at any time
MaxURLs Maximum number of URLs to request in the session, or set to 0 to make Scout search until the queue is empty
RequestRobotsFile If true, request and honor the robots.txt file from all hosts
SearchCache If true, try to load documents from the cache before requesting them from a remote server
SearchWeb If true, request documents from their Web servers

Configuration File Examples

The examples below are the actual configuration files used for the sessions detailed in the paper on Scout. As the meaning of all the parameters has already been defined, the examples are not annotated.

Configuration File: ClassDoc.ini

[SCOUT] StartURL=http://www.cs.engr.uky.edu/~borchers/javatree/docs/api/packages.html UseGUI=false LogFile=ClassDoc.log PersistFile=ClassDoc.dat NetDelay=2000 RestrictHost=www.cs.engr.uky.edu RestrictDomain=null CacheDir=JavaDocCache MaxCacheFiles=1024 MaxURLs=0 RequestRobotsFile=true SearchCache=true SearchWeb=true [EXTRACTOR] EntityFile=c:\classes\SGMLKit\entities.txt

Configuration File: NKU.ini

[EXTRACTOR] EntityFile=/a/al/u/al-d7/csgrad/borchers/classes/SGMLKit/entities.txt [SCOUT] UseGUI=true RuleBase=Rules LogFile=sessions\NKU.log PersistFile=sessions\NKU.dat NetDelay=2000 StartURL=http://www.nku.edu/ RestrictHost=null RestrictDomain=nku.edu CacheDir=NKUCache MaxCacheFiles=256 MaxURLs=1 RequestRobotsFile=true SearchCache=true SearchWeb=true