An Efficient K-means Algorithm: Generating Clusters Dynamically in MapReduce Framework

Anupama Chadha

doi:10.9734/bpi/nramcs/v3/15830D

Authors

Anupama Chadha Manav Rachna International Institute of Research and Studies, Faridabad, India.

DOI:

https://doi.org/10.9734/bpi/nramcs/v3/15830D

Keywords:

Clustering, k-means, generating clusters on the run, mapreduce framework

Abstract

Background: K-Means is a widely used partition based clustering algorithm which organizes input dataset into predefined number of clusters. Simplicity and speed in clustering of massive data are two features which have made K-Means a very popular algorithm. The generation of huge amount of electronic data has resulted in modifications in data clustering algorithms to process the huge data. The performance of the K-Means can further be enhanced if we use distributed computing environment to deal with the big data. MapReduce paradigm can be used with the K-Means to give it a distributed computing environment and make it more efficient in terms of time. K-Means has a major limitation -- the number of clusters, ‘K’, need to be pre-specified as an input to the algorithm. In absence of thorough domain knowledge, or for a new and unknown dataset, this advance estimation and specification of cluster number typically leads to “forced” clustering of data, and proper clustering does not emerge.

Method: In this paper, we introduce a new algorithm based on the K-Means that takes only the numerical dataset as an input and generates appropriate number of clusters on the run using MapReduce programming style.

Findings: The new algorithm not only overcomes the limitation of providing the value of K initially but also reduces the computation time using MapReduce framework.

An Efficient K-means Algorithm: Generating Clusters Dynamically in MapReduce Framework

Authors

DOI:

Keywords:

Abstract

Author Biography

Anupama Chadha, Manav Rachna International Institute of Research and Studies, Faridabad, India.

Published

How to Cite

Issue

Section