An appropriate similarity measure for k-means algorithm in clustering web documents |
Author(s): |
| S JAIGANESH , PSNA COLLEGE OF ENGINEERING AND TECHNOLOGY, DINDIGUL, TAMILNADU; Dr.P.JAGANTHAN, PSNA COLLEGE OF ENGINEERING AND TECHNOLOGY, DINDIGUL, TAMILNADU |
Keywords: |
| Partitional Clustering, Cosine Similarity, Euclidean, Jaccard, Pearson, KLD |
Abstract |
|
Organizing a large volume of documents into categories through clustering facilitates searching and finding the relevant information on the web easier and quicker. Hence we need more efficient clustering algorithms for organizing large volume of documents. Clustering on large text dataset can be effectively done using partitional clustering algorithms. The K-means algorithm is the most suitable partitional clustering approach for handling large volume of data. K-means clustering algorithm uses a similarity metric that determines the distance from a document to a point that represents a cluster head. This similarity metric plays a vital role in the process of cluster analysis. The usage of suitable similarity metric improves the clustering results. There are varieties of similarity metrics available to find the similarity between any two documents. In this paper, we analyse the performance and effectiveness of these similarity measures in particular to k-means partitional clustering for text document datasets. We use seven text document datasets and five similarity measures namely Euclidean distance, cosine similarity, Jaccard coefficient, Pearson correlation coefficient and Kullback-Leibler Divergence. Based on our experimental study, we conclude that cosine correlation measure is the best suited similarity metric for K-means clustering algorithm. |
Other Details |
|
Paper ID: IJSRDV3I2393 Published in: Volume : 3, Issue : 2 Publication Date: 01/05/2015 Page(s): 408-412 |
Article Preview |
|
|
|
|
