The configuration file that establishes Scout's runtime options is composed
using a simple syntax of assignment statements divided into two named sections.
Sections are delimited with a line of the form [SECTIONNAME]
and
within the sections, one parameter per line is set with a line of the form
NAME=VALUE
. There are two sections in the file:
EXTRACTOR
and SCOUT
. The EXTRACTOR
section contains parameters relevant to the HTML parser, and the
SCOUT
sections contains parameters relevant to Scout.
The following tables detail the parameters in the EXTRACTOR
and SCOUT
sections of the configuration file.
The EXTRACTOR Section |
|
Parameter Name | Description |
---|---|
EntityFile
|
Path on the local file system to a file used by parser to interpolate SGML entities, for example " for " and & for &, as characters |
The SCOUT Section |
|
Parameter Name | Description |
---|---|
StartURL
|
The fully qualified URL of the page at which to begin a session |
UseGUI
|
If set to true , Scout will provide a simple set of button
controls for pausing and resuming the session and toggling a windowed
display of the 10 most recent log entries
|
LogFile
|
The file to which log data is written |
PersistFile
|
File to which session state is saved on program termination |
NetDelay
|
Time is milliseconds to wait between sending successive requests to the same host |
RestrictHost
|
If not null, Scout will not request URLs from any server other than the one named |
RestrictDomain
|
If not null, Scout will not request URLs from servers in any domain other than the one named |
CacheDir
|
Directory in which documents will be cached locally |
MaxCacheFiles
|
Maximum number of files to keep in the cache at any time |
MaxURLs
|
Maximum number of URLs to request in the session, or set to 0 to make Scout search until the queue is empty |
RequestRobotsFile
|
If true, request and honor the robots.txt file from all hosts
|
SearchCache
|
If true, try to load documents from the cache before requesting them from a remote server |
SearchWeb
|
If true, request documents from their Web servers |
The examples below are the actual configuration files used for the sessions detailed in the paper on Scout. As the meaning of all the parameters has already been defined, the examples are not annotated.
[SCOUT] StartURL=http://www.cs.engr.uky.edu/~borchers/javatree/docs/api/packages.html UseGUI=false LogFile=ClassDoc.log PersistFile=ClassDoc.dat NetDelay=2000 RestrictHost=www.cs.engr.uky.edu RestrictDomain=null CacheDir=JavaDocCache MaxCacheFiles=1024 MaxURLs=0 RequestRobotsFile=true SearchCache=true SearchWeb=true [EXTRACTOR] EntityFile=c:\classes\SGMLKit\entities.txt
[EXTRACTOR] EntityFile=/a/al/u/al-d7/csgrad/borchers/classes/SGMLKit/entities.txt [SCOUT] UseGUI=true RuleBase=Rules LogFile=sessions\NKU.log PersistFile=sessions\NKU.dat NetDelay=2000 StartURL=http://www.nku.edu/ RestrictHost=null RestrictDomain=nku.edu CacheDir=NKUCache MaxCacheFiles=256 MaxURLs=1 RequestRobotsFile=true SearchCache=true SearchWeb=true