Scout
An Infrastructure for Web-Based Information Retrieval

Anthony L. Borchers - borchers@cs.engr.uky.edu
University of Kentucky Department of Computer Science
December, 1998
Advised by Dr. Raphael A. Finkel

Abstract

We describe Scout, a multithreaded robot infrastructure for Web-based information retrieval tasks. Scout implements HTTP communications, document caching, and simple HTML parsing, and can be extended with programs called rules that perform arbitrary data-processing tasks on the documents collected. We describe and demonstrate two simple proof-of-concept applications built using Scout: one that collects and extracts basic information about a university from its home page, and one that converts javadoc documentation to data structures that can be categorized and manipulated in interesting ways. We conclude with some observations about the performance of the sample applications and discuss some future applications that might be built using Scout.


Contents

Introduction
Scout Overview
Rules
Third-Party Code Used in Scout
Robots Exclusion Protocol
Scout Usage Example
Running Scout
Log Output
An Application for Processing Javadoc
The JavaDoc Application
Collection and Processing
Postprocessing
Conclusions
Future Applications
References

Appendices

The Scout Configuration File
Template Syntax
JavaDoc Phase Two Session Log

Introduction

Scout is a general-purpose Web robot that can be arbitrarily extended with procedures called rules that implement data-processing techniques specialized to the text or data formats retrieved. Initialized with a set of rules and a list of URLs, Scout collects the documents associated with the URLs and provides them to the rules for processing. Rules might parse natural language, interpret markup structure, or analyze binary-data formats.

Rules build data structures called results and store them in a globally accessible table. At any time during a Scout session, a rule may look up results previously generated by itself or any other rule. This facility gives rules a multiple-document memory and allows processing logic to be broken into easily managed functional components. Any rule may also append discovered URLs to Scout's search queue or connect to an external database to export results to structured records. When a session terminates, the results table is stored to disk for postprocessing or to seed a later session.

Scout Overview

Scout is a multithreaded Java application after the readers-writer model. The Scout thread, the writer, removes URLs from a search queue and requests the associated documents from the Web servers on which they reside using the HyperText Transfer Protocol, HTTP. Successfully collected documents and their HTTP headers are stored in a shared buffer where rules executing as concurrent threads may access them. Each rule is required to access the document one time to maintain synchronization, but a rule may at any time choose to release the document without performing any work on it. Scout and the rule threads also synchronize on the URL queue so Scout can differentiate an empty queue from one that is waiting on a rule to produce a URL.

The Scout thread avoids collecting redundant documents that could lead to cycling, and caches documents to minimize network traffic. It implements the Robots Exclusion Protocol and can be configured to stall for a specified interval between successive accesses to the same server to reduce remote server load.

Since HTML is the most common format of Web pages, Scout attempts to parse each document as HTML before buffering it, and stores the tags and text separately if the parse is successful. This preprocessing step permits rules to specialize in markup or text processing. As necessary, tags may be mapped backed into their positions in the text or tags and text recombined into a normalized HTML document.

Throughout a Scout session, both the Scout thread and the rules write detailed activity records to a log file. The last few lines of this log can optionally be monitored in a graphical window.

Rules

Rules are implemented by extending a base class Rule.class that provides standard interactions with Scout, including thread synchronization and result recording. To perform useful work, subclasses of Rule must override one method, processDoc(). This method is called once by the Rule parent class for each document buffered by the Scout thread.

When the processDoc() method is called, the following conditions exist:

  1. A previously unseen document is buffered in the object scout.doc (class Scout.Document) as HTTP headers and body, the latter separated into tags and text if the source was successfully parsed as HTML
  2. The integer variable sequenceNumber contains a value that all threads will consistently use to refer to the buffered document
  3. A local object results (class java.lang.Vector) is initialized empty and ready to receive results from the rule
  4. A shared object scout.ruleResults (class Scout.Results) contains a table of all results previously generated by all running rules
The overridden processDoc() method performs task-specific work by examining the buffered document, previous entries in the results table, or both.

The rule may generate any legal Java class objects as results. These are stored in the local results vector, which is automatically entered into the shared table when processDoc() returns. If the rule does not generate any results, the empty vector is stored to maintain the structure of the results table and differentiate results that will never be produced from those that simply have not yet been.

Third-Party Code Used in Scout

In addition to the java.* packages that are included in Sun's Java Development Kit (JDK), two packages from third-party vendors were used. Ronald Tschalär's HTTPClient [1], provided under the GNU public license, provides a more developed interface to Web connections than those given in the java.net package. Pat 1.0 [2] by Steven R. Brandt provides Perl-style regular-expression syntax and is free to educational institutions.

The Hashlookup class that Scout uses to track visited URLs is from Praveen Devulapalli's 1997 Master's Project [3]. This class uses a constant-sized bit map and fast hashing methods for probabilistically equating strings.

Robots Exclusion Protocol

Scout implements the Robots Exclusion Protocol [4] developed by Martijn Koster. This protocol is a standard by which Web servers instruct robots not to access certain paths. The standard relies on the cooperation of robots not to go where they are not wanted, and is implemented by including a simple text file named robots.txt in the server's Web root directory. The robots.txt file lists where particular agents are forbidden to navigate. A simple, commented example follows.

# sample robots.txt file # The * wildcard means "match any" in either User-agent or Disallow # lines # forbid a known rude robot to access anything User-agent: RudeBot 6.66 Disallow: * # forbid all other robots to access the cgi-bin and images directories User-agent: * Disallow: /cgi-bin/ Disallow: /images/ # end robots.txt

Scout's first request to any server will be for the robots.txt file. If the server provides such a file and it contains a User-agent field that includes Scout, any prohibitions for this field will be merged with a default set of excluded paths and file types and honored for all subsequent accesses in the current session. If no robots.txt file is returned or no User-agent field can be found that applies to Scout, then only the defaults are used.

This feature can be disabled for searching on well-known hosts but should always be used for searches on unfamiliar servers.

Scout Usage Example

Scout runs on any Java Virtual Machine compatible with Java version 1.1.4 or later. For this example, we assume a generic Unix environment and Sun's Java bytecode interpreter. We also assume that the environment has been configured so that all required class libraries are available.

In addition to the core Scout package and a collection of rule classes, two files are needed to run Scout: an initialization file to set Scout's run-time parameters and an HTML file with a section of markup called a template. The template contains the information needed to load and initialize named rules and can also set named values in a table within Scout so common data can be readily accessed by a number of rules. The format of the initialization file and the markup for templates will be detailed later. For the following example, it is sufficient to know their purposes.

Our example concerns collecting a university name, acronym, and phone number from a home page at http://www.nku.edu/. The initialization data and template are contained in the files NKU.ini and university.html respectively. The template associates three instances of a regular expression matching rule, RegExpRule.class, with the runtime names UniversityName, Acronym, and PhoneNumber. A fourth rule, BreadthFirstSearch.class, instantiated under the name BFS, queues URLs found in the home page. Scout does not search beyond the first page collected, but the queue is serialized with the rest of the program's state when it halts, and could be used to resume the session later.

Running Scout

Scout is invoked from the command prompt as shown. It dumps a summary of the configuration file parameters, detailed in Appendix I, to standard output before beginning Web crawling, during which it is silent:

% java Scout.Scout NKU.ini university.html

Configuration File: NKU.ini [EXTRACTOR] EntityFile: /a/al/u/al-d7/csgrad/borchers/classes/SGMLKit/entities.txt [SCOUT] CacheDir: NKUCache LogFile: NKU.log MaxCacheFiles: 256 MaxURLs: 1 NetDelay: 2000 PersistFile: NKU.dat RequestRobotsFile: true RestrictDomain: null RestrictHost: www.nku.edu SearchCache: true SearchWeb: true StartURL: http://www.nku.edu/ UseGUI: true

Log Output

After running for a few seconds, Scout exits, leaving the following log in the file NKU.log. Editorial comments, indicated by italic text, have been added throughout. Log messages from specific rules are preceded by the names assigned to the instances invoked by the template.

The first section of the log shows how Scout processes the template, the first four directives in which set variables in an internal hash table to regular expressions for matching area codes, phone numbers, capitalized words, and acronyms.

Scout.setvar - Set runtime variable USAREACODE=(\(\d{3}\))|(\d{3}) Scout.setvar - Set runtime variable USPHONENUMBER=\d{3}-\d{4} Scout.setvar - Set runtime variable CAPWORD=[A-Z][A-Za-z]* Scout.setvar - Set runtime variable ACRONYM=[A-Z][A-Z]+

Next, the template calls for the rules to be loaded. The three instances of RegExpRule identify themselves as their constructors execute. Each rule reports the pattern it will match. What is not evident here is that the template parameterized RegExpRule with references to the internally stored variables listed above. For example the PhoneNumber rule was parameterized with 0 or 1 occurences of USAREACODE, followed by an optional space, followed by a USPHONENUMBER. This was expanded by Scout into the complex expression shown. Such variable parameters are discussed in Appendix II, which details the template syntax.

UniversityName.RegExpRule - ready to search on pattern (University\s+of\s+([A-Z][A-Za-z]*\s+)+)|(([A-Z][A-Za-z]*\s+)+University) PhoneNumber.RegExpRule - ready to search on pattern ((\(\d{3}\))|(\d{3})){0,1}(\s){0,1}(\d{3}-\d{4}) Acronym.RegExpRule - ready to search on pattern ([A-Z][A-Z]+)

Scout reports that four rules were initialized and dumps a simple view of the state of each.

Scout.Scout - Loaded 4 rules Scout.Scout - BFS - {type=D, parse=void, value=null, validate=true, name=BFS, rule=Scout.BreadthFirstSearch} Scout.Scout - UniversityName - {type=D, parse=String, squeezedoc=true, value=null, validate=true, trim=true, pattern=(University\s+of\s+([A-Z][A-Za-z]*\s+)+)|(([A-Z][A-Za-z]*\s+)+University), name=UniversityName, rule=Scout.RegExpRule} Scout.Scout - PhoneNumber - {type=D, parse=string, value=null, validate=true, trim=true, pattern=((\(\d{3}\))|(\d{3})){0,1}(\s){0,1}(\d{3}-\d{4}), name=PhoneNumber, squeezematch=true, rule=Scout.RegExpRule} Scout.Scout - Acronym - {type=D, value=null, validate=true, trim=true, pattern=([A-Z][A-Z]+), name=Acronym, rule=Scout.RegExpRule}

Scout looks for a cache of previously collected documents and any previously serialized state data as indicated by the CacheDir and PersistFile fields in the initialization file. In our case the robot is running for the first time, so neither the cache nor the state data exists.

Scout.restoreState - Initialized cache of 0 objects Scout.restoreState - No state data found. Creating new objects

The session proper begins here as Scout removes the first URL from the queue. For a fresh session, this URL will be the one indicated by the StartURL parameter in the initialization file.

Scout.run - Scout started at Mon Nov 30 21:34:02 EST 1998

The search engine and all rules execute as separate threads that synchronize using the DocBuffer and URLQueue objects, so their log entries are in a nondeterministic order. In the next few lines, the run methods of the four rules start while Scout requests the robot-exclusion data from www.nku.edu and mixes the seven excluded paths returned from the host with its own internally excluded paths and file types. The next-to-last line shows a rule trying to retrieve the first document. Since no such document yet exists, the DocBuffer object forces the rule's thread to wait.

BFS.run - starting URLQueue.removeFront - returning http://www.nku.edu/ UniversityName.run - starting Nobots.getHostExclusions - Getting exclusions for host www.nku.edu Nobots.loadExclusions - Read 7 path exclusions for host www.nku.edu PhoneNumber.run - starting Nobots.getHostExclusions - Stored 10 excluded paths and 19 excluded types for www.nku.edu Acronym.run - starting

Next, Scout requests the first (root) URL from the www.nku.edu host while two of the remaining rules become blocked waiting to access the document. Once the Web server returns the document, Scout parses it into text and tag components. Though no log entry is written to indicate it, the document is also cached at this point.

Scout.getDocument - Requesting URL http://www.nku.edu/ from cache Scout.getDocument - URL not cached - hitting the Web now Scout.getDocument - Separated tags and text DocBuffer.fill - Buffered 2273 bytes of text and 311 tags

Now that the document is available, the rules are unblocked and allowed to access it. The URL queue poses a producer-consumer problem with multiple producers, the rules, and one consumer, Scout. Scout will be blocked from exiting on an empty queue if any rule is still running with the potential to add to the queue.

UniversityName.run - acquired document http://www.nku.edu/ [0] PhoneNumber.run - acquired document http://www.nku.edu/ [0] Acronym.run - acquired document http://www.nku.edu/ [0] BFS.run - acquired document http://www.nku.edu/ [0] PhoneNumber.processDoc - processing document http://www.nku.edu/ [0] UniversityName.processDoc - processing document http://www.nku.edu/ [0] Acronym.processDoc - processing document http://www.nku.edu/ [0] BFS - Extracting links from URL http://www.nku.edu/ PhoneNumber.processDoc - searching on pattern ((\(\d{3}\))|(\d{3})){0,1}(\s){0,1}(\d{3}-\d{4}) UniversityName.processDoc - searching on pattern (University\s+of\s+([A-Z][A-Za-z]*\s+)+)|(([A-Z][A-Za-z]*\s+)+University) Acronym.processDoc - searching on pattern ([A-Z][A-Z]+)

Next, the rules finish the document and report their results. BFS does not report any results, but announces that it has enqueued 34 discovered URLs to Scout's search queue.

Acronym.run - finished document http://www.nku.edu/ [0] in 0 minutes 0 seconds Results.put - Storing 8 results for rule Acronym, document 0 BFS - Enqueued 34 URLs BFS.run - finished document http://www.nku.edu/ [0] in 0 minutes 1 seconds Results.put - Storing 0 results for rule BFS, document 0 UniversityName.run - finished document http://www.nku.edu/ [0] in 0 minutes 2 seconds Results.put - Storing 3 results for rule UniversityName, document 0

If the configuration calls for collection of more than one URL, Scout will wait on the URL queue until either one of the rules produces a URL or all of the rules exit without producing one. Since the configuration parameter MaxURLs calls for only one URL to be collected, Scout does not wait on the queue. Instead it begins its shutdown procedure. Meanwhile, the remaining rule threads finish their tasks and exit.

Scout.run - URL queue exhausted. Shutting down buffer and exiting... DocBuffer.close() - closing Acronym.run - finished BFS.run - finished UniversityName.run - finished PhoneNumber.run - finished document http://www.nku.edu/ [0] in 0 minutes 2 seconds Results.put - Storing 2 results for rule PhoneNumber, document 0 PhoneNumber.run - finished

Scout provides a summary of the results reported by each rule listed according to their names and subscripted by document and result numbers.

Acronym[0,0]: MM Acronym[0,1]: MM Acronym[0,2]: NKU Acronym[0,3]: JD Acronym[0,4]: MBA Acronym[0,5]: NCAA Acronym[0,6]: KY Acronym[0,7]: NKU PhoneNumber[0,0]: (606) 572-5220 PhoneNumber[0,1]: 637-9948 UniversityName[0,0]: Northern Kentucky University UniversityName[0,1]: NKU Northern Kentucky University UniversityName[0,2]: Other News Northern Kentucky University

Lastly, some statistics for the session are recorded and the program exits.

Scout.logResults - URL Stats: discovered = 0 requested = 1 expanded = 1 ignored = 0 failed = 0 error = 0 CacheManager.save - Saving cache information Scout.run - Finished after 0 minutes 9 seconds

An Application for Processing Javadoc

Java provides a source-tagging syntax and a utility, javadoc, for automating HTML-formatted documentation of classes and packages. Because the documents produced by javadoc have a predictable markup structure and familiar subject matter in a limited domain, we built a test application, JavaDoc, to traverse this document space and produce results describing the Java packages and classes. We now describe JavaDoc and our early experience using it.

Give the application a name, like JDOC, and use it consistently.
I guess I had unconsciously avoided doing so since the application was called JavaDoc, different from javadoc only in case. I have documented it as such.

The JavaDoc Application

We were interested in describing the Java containment hierarchy of packages, their classes and interfaces, and their field and methods. We were also interested in making field and method inheritance explicit in our object descriptions, which it is not in the javadoc-produced HTML. JavaDoc contains a class for describing Java objects and primitives and three rules to process a portion of the set of javadoc documents that shipped with Sun's JDK version 1.1.4. We also built a stand-alone postprocessor and interactive query tool to index and look up lists of interfaces and classes by the names of fields or methods they contain.

The subset of the documentation we concentrated on consists of a file that lists the packages contained in a Java distribution, package-index files that describe the interface and class contents of each of these packages, and a large set of files describing the interfaces and classes themselves. In designing our rules, we were careful to approach these documents as an HTML corpus without considering the tags or engine that produced them.

Our first rule, PackageListRule, reads the package-list file and appends the URLs of the package-index files referenced to Scout's search queue without producing any results. The second rule, PackageIndexRule, reads the package-index documents and builds a hash table describing the packages. This table contains entries giving the package name, and lists of the interfaces, classes, exceptions and errors it contains. For each package member, class or interface, the URL of the file documenting it is added to the search queue. The third rule, ClassDocRule, processes the class and interface descriptions and builds a description of each one as a vector of JavaDocObject instances.

The JavaDocObject class represents a Java class, interface, variable, or method. It stores the URL of the javadoc-generated source file, along with the name of the object, information about its type, scope, containment and parentage. If the object is a class or interface, the JavaDocObject stores its description and a list of any interfaces it implements. If the object is a method, the JavaDocObject stores a list of its parameter types. As mentioned previously, a vector of JavaDocObjects describes a class or interface where the first element represents the class itself and subsequent elements represent its variables and methods.

Collection and Processing

We ran JavaDoc in two phases, both on a 166 MHz Pentium machine with 64 MB RAM. In the first phase, we collected the documents by initializing Scout with PackageListRule and PackageIndexRule and seeding the URL queue with the URL of the package list file on the UK Computer Science Department's Web server. The session ran on a 28.8 modem connection to a commercial ISP connection in the Lexington area, with Scout configured to stall 2 seconds between requests. Scout collected 500 documents (1 package list, 22 package indices, and 477 class or interface description files) in just under 63 minutes. Scout spent at least 55 minutes of this time in networking and other overhead unrelated to the rules, as no rule recorded a full second of processing time per document.

In the second phase, the rules from the first session were used again, with the ClassDocRule added to build the vectors describing the 477 class and interface descriptions. This time, operating on the cached copies of the documents, the session completed in 12 minutes, 2 seconds. Again, the maximum time reported by a rule in processing a document was under 1 second. The results generated by the JavaDoc application can be seen at the end of the log output from the second-phase session.

Postprocessing

Java is a fully object-oriented language, but javadoc does not explicitly document inherited members in subclasses. In order to easily answer the question "what classes or interfaces implement a method named Y?", we built a small Java program BuildMethodIndex to extract the ClassDocRule results from the table saved by Scout and compute a representation of the classes and interfaces that makes inherited members explicit by adding references to the ancestral member in the descendant object's vector. It then builds a hash table using member names as keys and vectors of the object names that implement the key member as values. We then built an interactive application QueryMethodIndex that reads member names at a prompt and displays the list of objects that contain a member of that name.

Assuming the file classdoc.dat contains the saved state of the second-phase JavaDoc session where the class description files were produced by an instance of ClassDocRule identified in the template as ClassDocRule, and we want to build the method index in the file method.index, the postprocessing proceeds as follows, following the annotation conventions previously established:

% java JavaDoc.BuildMethodIndex classdoc.dat ClassDocRule method.index

Once this process completes, we can run the interactive QueryMethodIndex program.

% java JavaDoc.QueryMethodIndex method.index

Because the methods table is a large and complex structure and Java is fairly slow reconstructing it from the version serialized to disk, the application reports that it is running.

wait... ready!

The simplest result to demonstrate is a negative one.

? MAX

Not Found MAX

Next, we query for a variable member. one class.

? MAX_VALUE

[java.lang.Long, java.lang.Character, java.lang.Float, java.lang.Double, java.lang.Integer, java.lang.Short, java.lang.Byte]

Now, we query for a method member.

? contains

Found contains [java.awt.Panel, java.awt.FileDialog, java.awt.TextField, java.awt.Choice, java.applet.Applet, java.util.Vector, java.awt.List, java.util.Hashtable, java.security.Provider, java.awt.Container, java.awt.Polygon, java.awt.Button, java.awt.TextComponent, java.awt.Dialog, java.awt.Label, java.awt.Component, java.awt.Window, java.awt.Canvas, java.awt.ScrollPane, java.util.Stack, java.awt.TextArea, java.awt.Frame, java.awt.Checkbox, java.awt.Rectangle, java.util.Properties, java.awt.Scrollbar]

A query on a member that is contained in java.lang.Object, the ultimate parent of all Java objects, would result in all 477 classes being listed.

Conclusions

We have described Scout, a Java infrastructure for building and running Web-based information retrieval applications. We have also described two proof-of-concept applications, one for extracting basic information from university home pages and another for converting javadoc-produced documentation into objects representing the classes and interfaces.

As proof-of-concept tests, both examples were encouraging, but serious questions regarding performance remain to be studied. In these applications, the vast majority of execution time was spent in networking, other input/output operations, and thread management. It will be interesting to apply performance metrics to determine exactly how this time is being spent. It will also be interesting to run rules on a larger cached collection to get a better idea of how performance scales with the cache size and to run applications that deal with multiple web servers so that Scout need not be artificially slowed to limit the individual server load.

Currently, the set of rules for a session is loaded once at program start, and all the threads that will ever run in the session run throughout. It would be useful for rules to be able to dynamically add other rules to the set and to remove themselves from the set if and when they have fulfilled their purpose. The JavaDoc example is a case in point where the sets of documents handled by each rule are disjoint, yet all rules remain active throughout the session and must touch each document. Rules run only when the session completes could do much of the work that presently requires a postprocessing phase. Such a class of rules could easily be created by adding an attribute to the template syntax and making some minor modifications to the Scout thread's code.

The simple FIFO queue of URLs has also begun to show weaknesses. Currently, Scout is stalled waiting between requests on the same host. The ability to look ahead in the queue for a URL located on a different server would allow computation to progress. Also, a priority queue that can be modified by the rules would be useful in many circumstances, although it raises interesting questions about how much authority the rules should have to tamper with central data structures.

To date, Scout has been strictly an HTTP application, but there is no reason why it should not be extended to support other protocols such as FTP, gopher, finger, or any other service that can be referenced by a URL. Support for other protocols will require substantial rethinking of much of Scout's design, though, so there are no short-term plans to begin this generalization effort.

Future applications

We have just begun to explore uses for Scout. We have considered several applications that might be implemented with Scout:

Email-Address Extractor
A simple rule could identify and extract email addresses from Web pages.
Meta-Search Engine
Seeded with URL queries to several search engines, Scout could collect their responses and apply rules to filter and rank returned links.
Look-ahead and Content-filtering Caching Proxy Servers
A rule that provides a network-server socket could use Scout to collect the documents, analyze them, then selectively relay them to the user, meanwhile expanding URLs contained in the document so they would be locally available if the user requested them.
Comparative-Shopping Agents
Perhaps beginning with a Meta-Search Engine query, a set of rules could collect and sort pages describing products of interest and identify the vendor offering the best prices.

There are certainly many other applications for Scout that we have not yet imagined. We believe that Scout may prove particularly useful for XML applications, which can use the existing tag parser without modification. We hope that other programmers will adapt Scout to their purposes, creating new classes of robots and improving how all of us, as users, use the vast maze of information available on the World Wide Web.

References

  1. Tschalär, Ronald. HTTPClient, 1998, http://www.innovation.ch/java/HTTPClient/
  2. Brandt, Steven R. Regular Expressions in Java, 1998, http://javaregex.com/
  3. Devulapalli, Praveen. A Web-crawling engine to discover email addresses, 1997, Masters Project Report, University of Kentucky Department of Computer Science
  4. Koster, Martijn. A Standard for Robot Exclusion, 1994 http://info.webcrawler.com/mak/projects/robots/norobots.html