WWW Index Server: Frequently Asked Questions

WWW Index Server:
Frequently Asked Questions

Questions:

How does the system build its index?
What is a Web robot?
How do I identify a particular robot?
What sites are in the robot's target list?
How many Web pages/URL's have the system visited?
How do I register my site/page?
How can I tell robots to stay away from my site?
Your robot causes problems, what should I do?
How often does the system update its index?
Why does my pages get hit all the time?
Your robot never gone past my homepage, why?
Can I register a site outside HK?
What kind of data does the robot collect?
Which words in a page are indexed?
What search engine does the system use?
Is wildcard/word-pattern search supported?
Why do I need a Boolean query construct?
How do I search for phrases?
Why do I get results not relevant to my query?
Why does my query return nothing?
How does the system search on Chinese texts?
Why do I want to save my query?
Does the system index gopher, ftp, telnet sites?
Is source code available?
What code library is used in the implementation?
Any online documentation?
Where do I send comments/enquiries?

Answers:

Q: How does the system build its index?
A: It runs a Web robot which automatically retrieve Web pages, traverses hyperlinks and collects information.
Q: What is a Web robot?
A: Essentially, it is a user-less Web browser/client-program. Its internal algorithm decides what information to retrieve and where to go. Today, there are many Web robots out there on the net. The majority of them are designed for collecting index data.
Q: How do I identify a particular robot?
A: If you have access to your site's Web server access-log and full logging mode is set, a (well behaved) robot can be identified from its "User-Agent" field. Here is our robot's identification data. Robots which honor the Proposed Standard for Robot Exclusion always make an attempt to retrieve a plain text file named robots.txt before sending further access requests (unless the robots.txt disallows them to do so.)
Q: What sites are in the robot's target list? A: Our service covers only Web servers/sites in Hong Kong territory. Here is a list of Web servers visited by our robot.
Q: How many Web pages/URL's have the system visited? A: See our weekly summary.
Q: How do I register my site/page? A: You can register your site or Web page by filling out our URL registration form. Pages linked to the registered page will be automatically traversed by the robot, so they need not be registered.
Q: How can I tell robots to stay away from my site? A: Assuming all robots honor the Proposed Standard for Robot Exclusion, you can prepare a text file named robots.txt in your site's root directory containing the following: User-agent: * Disallow: / disallows any robot to retrieve any page from your site. The "Disallow:" field can be used to prevent robots from retrieving pages under the specified directory path. For example: User-agent: * Disallow: /webstat/ disallows accesses to pages under "/webstat/" directory. Multiple "Disallow:" fields is permitted. Empty "Disallow:" means any request is allowed. The "User-agent:" field can be used to specify a particular Web robot, by naming the robot's User-Agent identification name. If no robots.txt file is found, by convention, a robot can assume that any access request is allowed.
Q: Your robot causes problems, what should I do? A: Please feel free to report any problem our robot may have caused to our robot maintainer.
Q: How often does the system update its index? A: Once in every two weeks we run an index maintenance process. For every URL/page in the index, this process checks if the URL/page is valid (exists) and if it has been modified since the last time it was visited by the robot. Such a probing mechanism is supported by the HTTP (Hypertext Transfer Protocol) by means of a special request called "HEAD request." A page is then revisited if its last modification date is later than the last time it was visited by the robot.
Q: Why does my pages get hit all the time? A: On some systems, weekly file-system backup automatically resets the last modification date of the whole file system. As a result, all Web files in the file system are thought of by HTTPD (the Web server program) as have been changed on that date. Based on this information, our system automatically runs the robot to re-index the pages.
Q: Your robot never gone past my homepage, why? A: This usually happens when pointers to your other pages are accessible only through an image-map. For an obvious reason, our robot can not construct access requests from image-maps.
Q: Can I register a site outside HK? A: To avoid any possible redundant coverage with other Web indexing systems out there, we restrict our coverage to Hong Kong based sites only. Our main purpose is to facilitate browsers outside Hong Kong (as well as those inside Hong Kong) with an easy way to locate information provided by the Hong Kong Internet community. One way to make an off Hong Kong site keyword-searchable, albeit indirect, through our Index Server is to have its URL listed somewhere in some local Web page.
Q: What kind of data does the robot collect? A: Our robot collects URL's (Web addresses), page titles, hypertext anchors, hyperlink data, and keywords.
Q: Which words in a page are indexed? A: Our robot collects words in page titles, hypertext anchors, headings, list items, and words in bold/italic/emphasized type fonts. Function words or common words are discarded.
Q: What search engine does the system use? A: Our search engine employs the so called vector space model [Salton, G., Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading MA, 1989.] to search and rank documents (URL's) based on keywords specified by the user. Basically, the algorithm works based on word distributions within a document and across documents.
Q: Is wildcard/word-pattern search supported? A: No, because our search engine uses hashing search method. However, word stemming is done to remove suffices (for example `s' suffix for plural nouns) from the query and index words. Also, the search is also case insensitive (disregards upper/lower cases.)
Q: Why do I need a Boolean query construct? A: You can use Boolean construct to specify multiple queries (search statements) in one line. For example, if you are interested in information about hotels, instead of using query: "hotel information guide", you can get a better result using query: "hotel (information | guide)." Unlike the former query, the latter query gives less scores to "information guide" with missing word "hotel."
Q: How do I search for phrases? A: You can use a - (hyphen) to connect two words of a phrase so that the words are not treated independent of each other. For example, query: "local-area-network."
Q: Why do I get results not relevant to my query? A: The search engine returns N documents (URL's) with relevance scores higher than 0 (N is user-specified with a default value of 40.) However, having a score higher than 0 does not guarantee that a page is relevant with respect to the query (let alone to your real information need.) If you find an irrelevant URL in the result list, you can be sure that other URL's further down the list (with lower scores) are also irrelevant.
Q: Why does my query return nothing? A: First, make sure that the spelling is correct. If perfect spelling does not help, consider using the word's synonyms. If that fails, the keyword(s) may not be in the index. Our system does not index Web-specific common words (among others) such as "homepage", "home", etc.
Q: How does the system search on Chinese texts? A: Our system searches on Chinese texts based on character-sequence statistics. The version of the Chinese search engine that we are running now is still very crude, but works reasonably well.
Q: Why do I want to save my query? A: You may save your query for later reference (save you time for fine tuning the query), or for others to share your discovery. Some people save failed queries with a hope that somebody somewhere will provide the unavailable information online.
Q: Does the system index gopher, ftp, telnet sites? A: Yes, as long as they are listed on some Web pages somewhere.
Q: Is source code available? A: Not at this moment. Partly because we are short in staff.
Q: What code library is used in the implementation? A: Our entire system is written from scratch in C language, except for the index database management module which is implemented using the Gnu Database Management (GDBM) library package.
Q: Any online documentation? A: We currently have two papers submitted for publication. One paper, "Search and Ranking Algorithms for Locating Resources on the World Wide Web" (postscript format, 174KB), describes and evaluates a number of document ranking/search algorithms used by the earlier version of this server. The other paper, "A WWW Resource Discovery System"", describes our prototype of an integrated system consisting of the indexer robot, the search engine and an enhanced user interface to the search engine.