Scout
An Infrastructure for Web-Based Information Retrieval

Anthony L. Borchers - borchers@cs.engr.uky.edu
University of Kentucky Department of Computer Science
December, 1998
Advised by Dr. Raphael A. Finkel

Abstract

We describe Scout, a multithreaded robot infrastructure for Web-based information retrieval tasks. Scout implements HTTP communications, document caching, and simple HTML parsing, and can be extended with programs called rules that perform arbitrary data-processing tasks on the documents collected. We describe and demonstrate two simple proof-of-concept applications built using Scout: one that collects and extracts basic information about a university from its home page, and one that converts javadoc documentation to data structures that can be categorized and manipulated in interesting ways. We conclude with some observations about the performance of the sample applications and discuss some future applications that might be built using Scout.