A Hybrid Parallel Web Document Clustering Algorithm
and Its Performance Study

Shuting Xu, and Jun Zhang
Laboratory for High Performance Scientific Computing and Computer Simulation
Department of Computer Science
University of Kentucky
Lexington, KY 40506-0046, USA

Abstract

Clustering Web document is an important procedure in many Web document retrieval systems. As the size of the Internet grows rapidly and the amount of information requests increases exponentially, the use of parallel computing techniques in large scale Web document retrieval is unavoidable. We develop a parallel implementation of the Principal Direction Divisive Partitioning (PDDP) algorithm and the K-means algorithm based on the message passing model. We also propose a hybrid parallel Web document clustering algorithm, which combines the parallel PDDP algorithm with the parallel K-means algorithm. We conduct computational experiments to test the performance of these three parallel algorithms using three real life Web document datasets. The results show that the quality of the clusters obtained from the hybrid algorithm is better than that from the parallel PDDP or the parallel K-means. The parallel run time of the hybrid algorithm is similar to and sometimes better than that of the parallel K-means algorithm.


Key words: Information retrieval, parallel document clustering, PDDP, K-means.

Mathematics Subject Classification:


Download the compressed postscript file ppddp.ps.gz, or the PDF file ppddp.pdf.gz.
Technical Report 366-03, Department of Computer Science, University of Kentucky, Lexington, KY, 2003.

The research work of S. Xu was supported in part by the U.S. National Science Foundation under grant CCR-0092532.

The research work of J. Zhang was supported in part by the U.S. National Science Foundation under grants CCR-9988165, CCR-0092532, and ACR-0202934, by the U.S. Department of Energy Office of Science under grant DE-FG02-02ER45961, by the Kentucky Science & Engineering Foundation under grant KSEF-02-264-RED-002, by the apanese Research Organization for Information Science & Technology, and by the University of Kentucky Research Committee.