Enhancing Clustering Blog Documents by Utilizing Author/Reader Comments

Beibei Li
Department of Computer Science
University of Kentucky
Lexington, KY 40506-0046, USA
Shuting Xu
Department of Computer Information Systems
Virginia State University
Petersburg, VA 23906, USA
and Jun Zhang
Department of Computer Science
University of Kentucky
Lexington, KY 40506-0046, USA

Abstract

Blogs are a new form of internet phenomenon and a vast ever-increasing information resource. Mining blog files for information is a very new research direction in data mining. We propose to include the title, body, and comments of the blog pages in clustering datasets from blog documents. In particular, we argue that the author/reader comments of the blog pages may have more discriminating effect in clustering blog documents. We constructed a word-page matrix by downloading blog pages from a well-known website and experimented a k-means clustering algorithm with different weights assigned to the title, body, and comment parts. Our experimental results show that assigning a larger weight value to the blog comments helps the k-means algorithm produce better clustering solutions. The experimental results confirm our hypothesis that the author/reader comments of the blog files are very useful in discriminating blog files.


Key words: Blog, blogosphere, data mining, comment, clustering.

ategories and Subject Descriptors H.3.3. [Information Search and Retrieval]: Clustering, retrieval models, search process.


Download the PDF file beibei1.pdf.
Technical Report 462-06, Department of Computer Science, University of Kentucky, Lexington, KY, 2006.

The research work of J. Zhang was supported in part by the National Science Foundation under grants CCR-0092532 and CCF-0527967, in part by the Kentucky Science and Engineering Foundation under grant KSEF-148-502-05-132, and in part by Alzheimer.s Association under grant NIRG-06-25460.