Scout
An Infrastructure for Web-Based Information Retrieval
Department of Computer Science, University of Kentucky
Abstract
We describe Scout, a multithreaded robot infrastructure for Web-based
information retrieval tasks. Scout implements HTTP communications,
document caching, and simple HTML parsing, and can be extended with
programs called rules that perform arbitrary data-processing
tasks on the documents collected. We describe and demonstrate two
simple proof-of-concept applications built using Scout: one that
collects and extracts basic information about a university from its
home page, and one that converts javadoc
documentation
to data structures that can be categorized and manipulated in
interesting ways. We conclude with some observations about the
performance of the sample applications and discuss some future
applications that might be built using Scout.
Contents
- Introduction
- Scout Overview
- Rules
- Third-Party Code Used in Scout
- Robots Exclusion Protocol
- Scout Usage Example
- Running Scout
- Log Output
- An Application for Processing Javadoc
- The JavaDoc Application
- Collection and Processing
- Postprocessing
- Conclusions
- Future Applications
- References
Introduction
Scout
- A general-purpose Web robot
- Arbitrarily extensible with procedures called rules
- Initialized with a set of rules and a list of URLs, Scout
- Collects the documents associated with the URLs
- Provides them to the rules for processing
Rules
- Techniques specialized to text or data formats
- Parse natural language
- Interpret markup structure
- Analyze binary-data formats
- Produce and use data structures called results
- Append discovered URLs to Scout's search queue
- Connect to an external database to export results
Scout Overview
- Multithreaded Java application with readers-writer model
- Scout thread = the writer
- Removes URLs from a search queue and retrieves associated documents
- Documents and their HTTP headers are stored in a shared buffer
- Rule threads = readers
- Required to "touch" each document to maintain synchronization
- May release document without performing any work
- Threads also synchronize on the URL queue
- Scout must differentiate an empty queue from a queue waiting on a rule to
produce a URL
Scout Thread
- Avoids redundant links that could lead to cycling
- Caches documents to minimize network traffic
- Implements the Robots Exclusion Protocol
- Can be configured to stall between accesses to the same server
- Attempts to parse each document as HTML before buffering
Rule Threads
- Extensions of a base class
Rule.class
- Synchronizes threads
- Records results
- Subclasses override one method,
processDoc()
- Called once by the parent class for each document
When processDoc()
is called
- Previously unseen document buffered in
scout.doc
- Contains HTTP headers and body
- Body separated into tags and text if source is HTML
- Integer variable
sequenceNumber
indicates the document
- All threads consistently use to refer to the document
- Local vector
results
initialized empty, ready to receive results from the rule
- Object
scout.ruleResults
contains a table of previously-generated results
The overridden processDoc()
- Performs task-specific work on
- The buffered document
- Previous entries in the results table
- May generate Java objects of any class as results
Third-Party Code
HTTPClient
by Ronald Tschalär
- Provided under the GNU public license
- A more developed interface to Web connections than those given in the java.net package
Pat
1.0 by Steven R. Brandt
- Free for educational use
- Perl-style regular-expression syntax
Hashlookup.class
by Praveen Devulapalli
- Unpublished, 1997 Master's Project
- Fast, probabilistic string (URL) equation
Robots Exclusion Protocol
- Developed by Martijn Koster
- Web servers instruct robots not to access certain paths
- Relies on the cooperation of robots
robots.txt
, in the server's Web root directory
Sample robots.txt
File
# forbid a known rude robot to access anything
User-agent: RudeBot 6.66
Disallow: *
# forbid all other robots to access the cgi-bin
# and images directories
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Scout's Implementation of the Protocol
- First request to any server will be for
robots.txt
- Prohibitions, if any, merged with a default set
- Can be disabled for searches on well-known hosts
Example
Assume
- A generic Unix environment with Sun's Java bytecode interpreter
- Environment configured to include required class libraries
- Two required files
- Initialization file to set run-time parameters
- HTML file containing a template
Example
Collecting a university name, acronym, and phone number from http://www.nku.edu/
- Initialization data and template contained in the files
NKU.ini
and university.html
- Template associates three instances of a regular expression matching rule, RegExpRule.class, with
names UniversityName, Acronym, and PhoneNumber
- A fourth rule, BreadthFirstSearch.class, instantiated as BFS, queues URLs
Running Scout
% java Scout.Scout NKU.ini university.html
Configuration File: NKU.ini
[EXTRACTOR]
EntityFile: /a/al/u/al-d7/csgrad/borchers/classes/SGMLKit/entities.txt
[SCOUT]
CacheDir: NKUCache
LogFile: NKU.log
MaxCacheFiles: 256
MaxURLs: 1
NetDelay: 2000
PersistFile: NKU.dat
RequestRobotsFile: true
RestrictDomain: null
RestrictHost: www.nku.edu
SearchCache: true
SearchWeb: true
StartURL: http://www.nku.edu/
UseGUI: true
Log Output Excerpts
Scout.setvar - Set runtime variable USAREACODE=(\(\d{3}\))|(\d{3})
Scout.setvar - Set runtime variable USPHONENUMBER=\d{3}-\d{4}
Scout.setvar - Set runtime variable CAPWORD=[A-Z][A-Za-z]*
Scout.setvar - Set runtime variable ACRONYM=[A-Z][A-Z]+
UniversityName.RegExpRule - ready to search on pattern (University\s+of\s+([A-Z][A-Za-z]*\s+)+)|(([A-Z][A-Za-z]*\s+)+University)
PhoneNumber.RegExpRule - ready to search on pattern ((\(\d{3}\))|(\d{3})){0,1}(\s){0,1}(\d{3}-\d{4})
Acronym.RegExpRule - ready to search on pattern ([A-Z][A-Z]+)
Scout.Scout - Loaded 4 rules
Scout.Scout - BFS - {type=D, parse=void, value=null, validate=true, name=BFS, rule=Scout.BreadthFirstSearch}
Scout.Scout - UniversityName - {type=D, parse=String, squeezedoc=true, value=null, validate=true, trim=true, pattern=(University\s+of\s+([A-Z][A-Za-z]*\s+)+)|(([A-Z][A-Za-z]*\s+)+University), name=UniversityName, rule=Scout.RegExpRule}
Scout.Scout - PhoneNumber - {type=D, parse=string, value=null, validate=true, trim=true, pattern=((\(\d{3}\))|(\d{3})){0,1}(\s){0,1}(\d{3}-\d{4}), name=PhoneNumber, squeezematch=true, rule=Scout.RegExpRule}
Scout.Scout - Acronym - {type=D, value=null, validate=true, trim=true, pattern=([A-Z][A-Z]+), name=Acronym, rule=Scout.RegExpRule}
Scout.restoreState - Initialized cache of 0 objects
Scout.restoreState - No state data found. Creating new objects
Scout.run - Scout started at Mon Nov 30 21:34:02 EST 1998
BFS.run - starting
URLQueue.removeFront - returning http://www.nku.edu/
UniversityName.run - starting
Nobots.getHostExclusions - Getting exclusions for host www.nku.edu
Nobots.loadExclusions - Read 7 path exclusions for host www.nku.edu
PhoneNumber.run - starting
Nobots.getHostExclusions - Stored 10 excluded paths and 19 excluded types for www.nku.edu
Acronym.run - starting
Scout.getDocument - Requesting URL http://www.nku.edu/ from cache
Scout.getDocument - URL not cached - hitting the Web now
Scout.getDocument - Separated tags and text
DocBuffer.fill - Buffered 2273 bytes of text and 311 tags
Log Output Excerpts, More
UniversityName.run - acquired document http://www.nku.edu/ [0]
PhoneNumber.run - acquired document http://www.nku.edu/ [0]
Acronym.run - acquired document http://www.nku.edu/ [0]
BFS.run - acquired document http://www.nku.edu/ [0]
PhoneNumber.processDoc - processing document http://www.nku.edu/ [0]
UniversityName.processDoc - processing document http://www.nku.edu/ [0]
Acronym.processDoc - processing document http://www.nku.edu/ [0]
BFS - Extracting links from URL http://www.nku.edu/
PhoneNumber.processDoc - searching on pattern ((\(\d{3}\))|(\d{3})){0,1}(\s){0,1}(\d{3}-\d{4})
UniversityName.processDoc - searching on pattern (University\s+of\s+([A-Z][A-Za-z]*\s+)+)|(([A-Z][A-Za-z]*\s+)+University)
Acronym.processDoc - searching on pattern ([A-Z][A-Z]+)
Acronym.run - finished document http://www.nku.edu/ [0] in 0 minutes 0 seconds
Results.put - Storing 8 results for rule Acronym, document 0
BFS - Enqueued 34 URLs
BFS.run - finished document http://www.nku.edu/ [0] in 0 minutes 1 seconds
Results.put - Storing 0 results for rule BFS, document 0
UniversityName.run - finished document http://www.nku.edu/ [0] in 0 minutes 2 seconds
Results.put - Storing 3 results for rule UniversityName, document 0
Scout.run - URL queue exhausted. Shutting down buffer and exiting...
DocBuffer.close() - closing
Acronym.run - finished
BFS.run - finished
UniversityName.run - finished
PhoneNumber.run - finished document http://www.nku.edu/ [0] in 0 minutes 2 seconds
Results.put - Storing 2 results for rule PhoneNumber, document 0
PhoneNumber.run - finished
Log Output Excerpts, Concludes
Acronym[0,0]: MM
Acronym[0,1]: MM
Acronym[0,2]: NKU
Acronym[0,3]: JD
Acronym[0,4]: MBA
Acronym[0,5]: NCAA
Acronym[0,6]: KY
Acronym[0,7]: NKU
PhoneNumber[0,0]: (606) 572-5220
PhoneNumber[0,1]: 637-9948
UniversityName[0,0]: Northern Kentucky University
UniversityName[0,1]: NKU Northern Kentucky University
UniversityName[0,2]: Other News Northern Kentucky University
Scout.logResults - URL Stats: discovered = 0 requested = 1 expanded = 1 ignored = 0 failed = 0 error = 0
CacheManager.save - Saving cache information
Scout.run - Finished after 0 minutes 9 seconds
An Application for Processing Javadoc
What is Javadoc?
- Source-tagging syntax
- Utility,
javadoc
- Produces HTML-formatted documentation for Java classes and packages
Javadoc Files
- Have a predictable markup structure
- Have a familiar subject matter
- Cover a limited domain
- Are an ideal test application for Scout
The JavaDoc
Application
Files Processed
- A package-list file that lists the packages in a Java distribution
- A set of package-index files that describe the interface and class contents of each of these packages
- A large set of files describing the interfaces and classes themselves
JavaDoc
Consists of
- A class for describing Java objects and primitives,
JavaDocObject
- Three rules to process specific types
javadoc
-produced documents
- A stand-alone postprocessor to index lists of interfaces and classes by member names
- An interactive query tool to search this index
JavaDocObject
- Represents a Java class, interface, variable, or method
- Stores
- URL of the
javadoc
-generated source file
- Name of the object
- Information about type, scope, containment and parentage
- Class or interface: description and list of interfaces
- Method: parameter types
JavaDoc
Rules
PackageListRule
-
- Reads package-list file and queues the URLs of the package-index files
- Produces no results
PackageIndexRule
-
- Reads the package-index documents and builds a table describing the packages
- Queues URLs for each package member, class or interface
ClassDocRule
-
- Processes the class and interface descriptions
- Produces a description of each one as a vector of
JavaDocObject
s
Collection and Processing
- Platform
- 166 MHz Pentium machine
- 64 MB RAM
- Sun's JVM 1.1.4
- Ran in two phases
- Documents collected and cached using
PackageListRule
and
PackageIndexRule
- Documents processed using
PackageListRule
,
PackageIndexRule
and
Statistics - Phase 1
- 500 documents
- 1 package list
- 22 package indices
- 477 class or interface description files
- Just under 63 minutes
- 55 minutes of this time in networking and other overhead
Statistics - Phase 2
- Completed in 12 minutes, 2 seconds
- Maximum time reported by a rule in processing a document was under 1 second
Postprocessing
- Purpose: Answer the question "what classes or interfaces contain a member named Y?"
- Small Java program
BuildMethodIndex
- Extracts the
ClassDocRule
results from the table saved by
Scout
- Makes inherited members explicit by adding references to the ancestral member
in the descendant object's vector
- Builds a hash table of vectors of the object names that implement a member using
the member name as a key
- An interactive application
QueryMethodIndex
- Reads member names at a prompt and displays the list of objects that contain a member of that name
Example
classdoc.dat
contains Scout's saved state from Phase 2
- Phase 2
ClassDocRule
instance was named ClassDocRule
- Builds the method index in
method.index
% java JavaDoc.BuildMethodIndex classdoc.dat ClassDocRule method.index
% java JavaDoc.QueryMethodIndex method.index
wait...
ready!
? MAX
Not Found MAX
? MAX_VALUE
[java.lang.Long, java.lang.Character,
java.lang.Float, java.lang.Double,
java.lang.Integer, java.lang.Short,
java.lang.Byte]
? contains
Found contains
[java.awt.Panel, java.awt.FileDialog,
java.awt.TextField, java.awt.Choice,
java.applet.Applet, java.util.Vector,
java.awt.List, java.util.Hashtable,
java.security.Provider, java.awt.Container,
java.awt.Polygon, java.awt.Button,
java.awt.TextComponent, java.awt.Dialog,
java.awt.Label, java.awt.Component,
java.awt.Window, java.awt.Canvas,
java.awt.ScrollPane, java.util.Stack,
java.awt.TextArea, java.awt.Frame,
java.awt.Checkbox, java.awt.Rectangle,
java.util.Properties, java.awt.Scrollbar]
A query on a member contained in java.lang.Object
, the
ultimate parent of all Java objects, would result in all 477 classes
being listed!
Conclusions
- As proof-of-concept tests, both examples were encouraging
- Serious performance questions remain to be studied
- Need to apply performance metrics to determine exactly how Scout is spending its time
- It may be advantageous to rebuild the Scout-Rule interface so that rules execute serially
- Need to run rules on a larger cached collection to see how performance scales with the size of the cache
- Need to run applications that deal with multiple web servers to eliminate stalling
Future applications
-
Email-Address Extractor
-
A simple rule could identify and extract email addresses from Web pages
-
Meta-Search Engine
-
Seeded with URL queries to several search engines, Scout could
collect responses and filter and rank returned links
-
Proxy Servers
- A rule providing a server socket collects and relays documents
- Look-ahead: extracts links and caches them in advance
- Content-filtering: Analyzes document content and selectively relays it to the user
-
Comparative-Shopping Agents
-
A set of rules to collect and sort pages describing products of interest and
identify the vendor offering the best prices
Thank You!
Salutations to my ever helpful commitee!
- Raphael A. Finkel, Chair
-
- Victor W. Marek
-
- Miroslaw Truszczynski
-
and, of course,
- Carol Hannahs
- Who mentioned my
graderbot
to
- Joe Oldham
- Who introduced me to the commitee that turned a Perl script
designed to let me avoid reading student Web pages into a
Master's Project
References
- Tschalär, Ronald. HTTPClient, 1998,
http://www.innovation.ch/java/HTTPClient/
- Brandt, Steven R. Regular Expressions in Java, 1998,
http://javaregex.com/
- Devulapalli, Praveen. A Web-crawling engine to discover email addresses, 1997,
Masters Project Report, University of Kentucky Department of Computer Science
- Koster, Martijn. A Standard for Robot Exclusion, 1994
http://info.webcrawler.com/mak/projects/robots/norobots.html