Scout reads runtime command directives and Rule initialization information from a section of HTML-like tags. These tags can make up a file of their own or be embedded in another HTML file, since it is standard practice for browsers to ignore tags they do not understand. In this way it is possible for an HTML document to contain a template describing how Scout might can process it's content into rule results. This appendix describes the template tagging syntax, and shows the templates used in the example applications detailed in the paper, Scout --- An Infrastructure for Web-Based Information Retrieval. The template syntax we define here is only partially supported by Scout. Some tag attributes are, at present, meaningful only as annotations to the template author and user. These are clearly differentiated in the discussion that follows.
All template tags begin with the generic identifier SCOUT
, with
all tag-specific semantics being conveyed through attributes. A template
begins with a tag <SCOUT TSTART NAME=X>
, where X is a name
for the template. The NAME attribute is not currently used by Scout, but is
included to allow future implementations to recognize and use more than one
template per session, and produce different result sets for each. The template
ends with a corresponding tag <SCOUT TEND>
. Between these
two tags, Scout will interpret any tags with the SCOUT
identifier
as meaningful to the template. Other HTML tags within the template or
SCOUT
tags outside of it are ignored.
Between the TSTART
and TEND
tags, the required
attribute TYPE
is the most important, and the one that Scout
evaluates first to determine how to process the rest of the tag. The
TYPE
attribute may take one of three values: C
for "Command", D
for "Data", or
I
for "Ignore". C
and D
type tags are discussed below. TYPE=I
tags are simply an
alternative to commenting the tag out using a standard HTML comment.
If Scout encounters a template tag with this type value, it is passed over.
A TYPE=C
tag tells Scout to execute an internal command.
The NAME
attribute supplies the name of the command to
execute, and the remaining tags are specific to the command named.
Only one command is presently implemented, SETVAR
, which
instructs Scout to set an internal value in a hash table using the
assignment expression in the VALUE
attribute. Values set
in this manner are readable by all rules in a Scout session. For
example, the following tag instructs Scout to set the key FOO to the
value BAR.
<SCOUT TYPE=C NAME=SETVAR VALUE="FOO=BAR">
A TYPE=D
tag identifies result data and the rule for extracting
it. For this type of tag, the NAME
attribute supplies a name
for the runtime rule thread. Note that multiple instances of the same
rule-derived class can execute in the same session with different names and
for different purposes. The VALUE
attribute is currently
unsupported for a TYPE=D
tag, but is intended to seed a rule's
results with initial data, or to provide named data for which no rule exists.
The RULE
attribute specifies what rule class to load. The
PARSE
attribute is an annotation to indicate what type of result
objects the rule produces. It is not currently used by Scout.
Any other attributes in TYPE=D
tags are specific to Rules, and
are ignored by Scout. One attribute, VALIDATE
, is recognized by
the Rule
base class. If this attribute is present, the rule will
require that Scout successfully parse each document before invoking the
processDoc()
method. Documents which cannot be thus parsed are
ignored by the rule.
As an example, consider the following tag. It constructs an instance of
Scout.RegExpRule, a rule that extracts regular expressions from document
texts. The rule thread will run under the name EmailAddress, and
match strings of the form indicated by the RegExpRule-specific attribute
PATTERN
. Note that the VALIDATE
attribute is not
supplied, as there is no reason to restrict a search for email addresses
to HTML documents.
<SCOUT TYPE="D" NAME="EmailAddress" RULE="Scout.RegExpRule" PATTERN="[^\s]+@[^\s]+\.(com|net|edu)"
Attribute values used as rule parameters are not restricted to literal strings.
Values can be interpolated into the parameters from Scout's global variable hash
table by the rule constructor. To perform such interpolation, the attribute
string must contain the named key to be interpolated between the //s in a string
of the form VAR/Key/
. For example, if we has three rules that used
the email address matching regular expression above, we could use the following
set of template tags to set the value and initialize the rules:
<SCOUT TYPE="C" NAME="SETVAR" VALUE="EMAIL=[^\s]+@[^\s]+\.(com|net|edu)"> <SCOUT TYPE="D" NAME="EmailAddressRule1" RULE="Rule1Class" PATTERN="VAR/EMAIL/"> <SCOUT TYPE="D" NAME="EmailAddressRule2" RULE="Rule2Class" PATTERN="VAR/EMAIL/"> <SCOUT TYPE="D" NAME="EmailAddressRule3" RULE="Rule3Class" PATTERN="VAR/EMAIL/">
The following examples show the templates used in the sample applications
described in the paper on Scout. The have
been annotated with italicized comments to illustrate various features.
Recall that the VALUE
and PARSE
attributes, when
present, are only annotations to the human template user.
The first four lines after the TSTART
tag show four
variables being set in Scout's internal hash. These correspond one
to one to the first entries displayed in the
sample log file example.
<SCOUT TSTART NAME="School"> <SCOUT TYPE=C NAME=SETVAR VALUE="USAREACODE=(\(\d{3}\))|(\d{3})"> <SCOUT TYPE=C NAME=SETVAR VALUE="USPHONENUMBER=\d{3}-\d{4}"> <SCOUT TYPE=C NAME=SETVAR VALUE="CAPWORD=[A-Z][A-Za-z]*"> <SCOUT TYPE=C NAME=SETVAR VALUE="ACRONYM=[A-Z][A-Z]+">
The first thread initialized, BFS, is for breadth-first search using the
Scout.BreadthFirstSearch rule. The VALIDATE
attribute is set,
as valid HTML tags with anchors to other documents are the target of search.
<SCOUT TYPE=D NAME=BFS VALUE=null PARSE=void RULE=Scout.BreadthFirstSearch VALIDATE=true>
The remaining rules are all instances of the Scout.RegExpRule
class, each charged with a specific data-extraction task. Each uses a
VAR
parameter to interpolate values set by the C
tags above. Each rule also requires valid HTML via the VALIDATE
attribute. Additionally, three new attributes meaningful to the
Scout.RegExpRule
class are demonstrated: SQUEEZEDOC
,
TRIM
, and SQUEEZEMATCH
. SQUEEZEDOC
tells the rule to compress all whitespace into a single space character
before applying the search pattern. TRIM
and
SQUEEZEMATCH
respectively tell the rule to remove leading and
trailing whitespace and compress interior whitespace in results.
<SCOUT TYPE=D NAME="UniversityName" VALUE=null PARSE="String" RULE=Scout.RegExpRule PATTERN="(University\s+of\s+(VAR/CAPWORD/\s+)+)|((VAR/CAPWORD/\s+)+University)" SQUEEZEDOC TRIM VALIDATE> <SCOUT TYPE=D NAME=PhoneNumber VALUE=null PARSE=string RULE=Scout.RegExpRule PATTERN=(VAR/USAREACODE/){0,1}(\s){0,1}(VAR/USPHONENUMBER/) SQUEEZEMATCH TRIM VALIDATE> <SCOUT TYPE=D NAME=Acronym VALUE=null RULE=Scout.RegExpRule PATTERN=(VAR/ACRONYM/) TRIM VALIDATE>
<SCOUT TEND>
This template is quite simple. It simply loads the three rules which
traverse the javadoc
-produced HTML. None of these rules
are parameterized. For a discussion of what the rules do, see
An Application for Processing Javadoc
in the main paper.
<SCOUT TSTART NAME="ClassDoc"> <SCOUT TYPE=D NAME=PackageListRule VALUE=null PARSE=Object RULE=JavaDoc.PackageListRule> <SCOUT TYPE=D NAME=PackageIndexRule VALUE=null PARSE=Object RULE=JavaDoc.PackageIndexRule> <SCOUT TYPE=D NAME=ClassDocRule VALUE=null PARSE=Object RULE=JavaDoc.ClassDocRule> <SCOUT TEND>