We describe Scout, a multithreaded robot infrastructure for Web-based
information retrieval tasks. Scout implements HTTP communications,
document caching, and simple HTML parsing, and can be extended with
programs called rules that perform arbitrary data-processing
tasks on the documents collected. We describe and demonstrate two
simple proof-of-concept applications built using Scout: one that
collects and extracts basic information about a university from its
home page, and one that converts javadoc
documentation
to data structures that can be categorized and manipulated in
interesting ways. We conclude with some observations about the
performance of the sample applications and discuss some future
applications that might be built using Scout.