An Efficient K-means Algorithm: Generating Clusters Dynamically in MapReduce Framework

Authors

  • Anupama Chadha Manav Rachna International Institute of Research and Studies, Faridabad, India.

DOI:

https://doi.org/10.9734/bpi/nramcs/v3/15830D

Keywords:

Clustering, k-means, generating clusters on the run, mapreduce framework

Abstract

Background: K-Means is a widely used partition based clustering algorithm which organizes input dataset into predefined number of clusters. Simplicity and speed in clustering of massive data are two features which have made K-Means a very popular algorithm. The generation of huge amount of electronic data has resulted in modifications in data clustering algorithms to process the huge data. The performance of the K-Means can further be enhanced if we use distributed computing environment to deal with the big data. MapReduce paradigm can be used with the K-Means to give it a distributed computing environment and make it more efficient in terms of time.  K-Means has a major limitation -- the number of clusters, ‘K’, need to be pre-specified as an input to the algorithm. In absence of thorough domain knowledge, or for a new and unknown dataset, this advance estimation and specification of cluster number typically leads to “forced” clustering of data, and proper clustering does not emerge.

Method: In this paper, we introduce a new algorithm based on the K-Means that takes only the numerical dataset as an input and generates appropriate number of clusters on the run using MapReduce programming style.

Findings: The new algorithm not only overcomes the limitation of providing the value of K initially but also reduces the computation time using MapReduce framework.

   

Author Biography

Anupama Chadha, Manav Rachna International Institute of Research and Studies, Faridabad, India.

 

 

Published

2022-05-14

How to Cite

Anupama Chadha. (2022). An Efficient K-means Algorithm: Generating Clusters Dynamically in MapReduce Framework. Novel Research Aspects in Mathematical and Computer Science Vol. 3, 54–66. https://doi.org/10.9734/bpi/nramcs/v3/15830D