Clustering Web Document Sets with Different Closeness

Shuting Xu, and Jun Zhang
Laboratory for High Performance Scientific Computing and Computer Simulation
Department of Computer Science
University of Kentucky
Lexington, KY 40506-0046, USA


Document clustering algorithms usually show different performance on document sets with different closeness. In general, most document clustering algorithms perform better on independent and distant document sets than on similar or close document sets. We propose an efficient method, based on the Principal Direction Divisive Partitioning (PDDP) algorithm, which refines the clustering solutions according to the closeness of the document sets. The experimental results show that the quality of the clustering solutions obtained by our method is better than that from PDDP, while the time cost is about 39% less on average.

Key words: Document clustering, PDDP, closeness.

Mathematics Subject Classification:

Download the compressed postscript file, or the PDF file closeness.pdf.
Technical Report 389-04, Department of Computer Science, University of Kentucky, Lexington, KY, 2004.

The research work of S. Xu was supported in part by the U.S. National Science Foundation under grant ACR-0234270.

The research work of J. Zhang was supported in part by NSF under grants CCR-9988165, CCR-0092532, ACR-0202934, ACR-0234270, by DOE under grant DE-FG02-02ER45961, and by the University of Kentucky Research Committee.