An Efficient K-means Algorithm: Generating Clusters Dynamically in MapReduce Framework
DOI:
https://doi.org/10.9734/bpi/nramcs/v3/15830DKeywords:
Clustering, k-means, generating clusters on the run, mapreduce frameworkAbstract
Background: K-Means is a widely used partition based clustering algorithm which organizes input dataset into predefined number of clusters. Simplicity and speed in clustering of massive data are two features which have made K-Means a very popular algorithm. The generation of huge amount of electronic data has resulted in modifications in data clustering algorithms to process the huge data. The performance of the K-Means can further be enhanced if we use distributed computing environment to deal with the big data. MapReduce paradigm can be used with the K-Means to give it a distributed computing environment and make it more efficient in terms of time. K-Means has a major limitation -- the number of clusters, ‘K’, need to be pre-specified as an input to the algorithm. In absence of thorough domain knowledge, or for a new and unknown dataset, this advance estimation and specification of cluster number typically leads to “forced” clustering of data, and proper clustering does not emerge.
Method: In this paper, we introduce a new algorithm based on the K-Means that takes only the numerical dataset as an input and generates appropriate number of clusters on the run using MapReduce programming style.
Findings: The new algorithm not only overcomes the limitation of providing the value of K initially but also reduces the computation time using MapReduce framework.