High Impact Factor : 4.396 icon | Submit Manuscript Online icon |

Study and Analysis of Distributed Document Clustering Based on Mapreduce in Hadoop

Author(s):

Suman Devi , Manav Rachna International university; Dr. Suresh Kumar, Manav Rachna International University

Keywords:

Hadoop; MapReduce, Document Clustering, Distributed Document Clustering, Large Data Sets

Abstract

MapReduce is a simplified programming model of distributed parallel computing. It is an important technology of Google, and is commonly used for data-intensive distributed parallel computing. Cluster analysis is the most important data mining methods. Efficient parallel algorithms and frameworks are the key to meeting the scalability and performance requirements entailed in such scientific data analysis. In this paper, we describe how document clustering for large collection can be efficiently implemented with MapReduce. Hadoop implementation provides a convenient and flexible framework for distributed computing on cluster of commodity machines. The design and implementation of direct K-Means and Distributed K-means algorithm on MapReduce is presented.

Other Details

Paper ID: IJSRDV3I60210
Published in: Volume : 3, Issue : 6
Publication Date: 01/09/2015
Page(s): 290-293

Article Preview

Download Article