Evaluation of an Optimized K-Means Algorithm Based on Real Data

Cosmin Marian Poteraş, Mihai Mocanu

DOI: http://dx.doi.org/10.15439/2016F231

Citation: Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 8, pages 831–835 (2016)

Full text

Abstract. In a previous paper [1] we introduced an optimized version of the K-Means Algorithm. Unlike the standard version of the K-Means algorithm that iteratively traverses the entire data set in order to decide to which cluster the data items belong, the proposed optimization relies on the observation that after performing only a few iterations the centroids get very close to their final position causing only a few of the data items to switch their cluster. Therefore, after a small number of iterations, most of the processing time is wasted on checking items that have reached their final cluster. At each iteration, the data items that might switch the cluster due to centroids' deviation will be re-checked. The prototype implementation has been evaluated using data generated based on an uniform distribution random numbers generator. The evaluation showed up to 70\\% reduction of the running time. This paper will evaluate the optimized K-Means against real data sets from different domains.

References

Cosmin Marian Poteraş, Marian Cristian Mihăescu, Mihai Mocanu: An optimized version of the K-Means clustering algorithm, Proceedings of the 2014 Federated Conference on Computer Science and Information Systems, ACSIS, Vol. 2, pages 695–699, 2014, http://dx.doi.org/10.15439/2014F258.
Dolnicar, S: Using cluster analysis for market segmentation—typical misconceptions, established methodological weaknesses and some recommendations for improvement, Australasian Journal of Market Research, 2003, 11(2), 5–12.
Ng, H. P.; Ong, S. H.; Foong, K. W. C.; Goh, P. S.; Nowinsky, W. L.: Medical Image Segmentation Using K-Means Clustering and Improved Watershed Algorithm, 7th IEEE Southwest Symposium on Image Analysis and Interpretation, March 26-28, 2006, Denver, Colorado, pages 61-66
Agnieszka Wosiak, Danuta Zakrzewska: On Integrating Clustering and Statistical Analysis for Supporting Cardiovascular Disease Diagnosis, Proceedings of the 2015 Federated Conference on Computer Science and Information Systems, ACSIS, Vol. 5, pages 303–310 (2015) http://dx.doi.org/10.15439/2015F151, http://dx.doi.org/10.15439/2015F151
Hongwei Xie, Li Zhang; Jingyu Sun, Xueli Yu: Application of K-means Clustering Algorithms in News Comments - The International Conference on E-Business and E-Government, May 2010, Guangzhou, China, pages 451-454
Oyelade, O. J, Oladipupo, O. O, Obagbuwa, I. C: Application of K-Means Clustering algorithm for prediction of Students’ Academic Performance, (IJCSIS) International Journal of Computer Science and Information Security, Vol. 7, No. 1, 2010, pages 292–295
Szymon Wawrzyniak, Wojciech Niemirom: Clustering Approach to the Problem of Human Activity Recognition using Motion Data, Proceedings of the 2015 Federated Conference on Computer Science and Information Systems, ACSIS, Vol. 5, pages 411–416 (2015), http://dx.doi.org/10.15439/2015F424, http://dx.doi.org/10.15439/2015F424
Souptik Datta, Chris Giannella, Hillol Kargupta: K-Means Clustering Over a Large, Dynamic Network, Proceedings of the Sixth SIAM International Conference on Data Mining, April 20-22, 2006, Bethesda, MD, USA. SIAM 2006 ISBN 978-0-89871-611-5, pages 153—164.
Yufang Zhang, Zhongyang Xiong, Jiali Mao, Ling O: The Study of Parallel K-Means Algorithm, Proceedings of the 6th World Congress on Intelligent Control and Automation, June 21–23, 2006, Dalian, China, pages 5868–5871.
Jing Zhang, Gongqing Wu, Xuegang Hu, Shiying Li, Shuilong Hao: A Parallel K-means Clustering Algorithm with MPI, 4th Internation Symposium on Parallel Architectures, Algorithms and Programming, ISBN 978-0-7695-4575-2, pages 60-64, 2011.
Jitendra Kumar, Richard T. Mills, Forrest M. Hoffman, William W. Hargrove: Parallel k-Means Clustering for Quantitative Ecoregion Delineation Using Large Data Sets, Proceedings of the International Conference on Computational Science, ICCS 2011, Procedia Computer Science 4 (2011) 1602–1611.
Reza Farivar, Daniel Rebolledo, Ellick Chan, Roy Campbell: A Parallel Implementation of K-Means Clustering on GPUs, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2008, Las Vegas, Nevada, USA, July 14-17, 2008, 2 Volumes. CSREA Press 2008 ISBN 1-60132-084-1, pages 340–345.
Mario Zechner, Michael Granitzer: Accelerating K-Means on the Graphics Processor via CUDA, The First International Conference on Intensive Applications and Services INTENSIVE 2009, 20–25 April, Valencia, Spain, pages 7–15, ISBN 978-1-4244-3683-5.
M. Lichman: UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2013, http://archive.ics.uci.edu/ml
The USCensus1990raw data set, U.S. Department of Commerce Census Bureau - http://dataferrett.census.gov/
Manohar Kaul: Building Accurate 3D Spatial Networks to Enable Next Generation Intelligent Transportation Systems, Proceedings of International Conference on Mobile Data Management (IEEE MDM), June 3-6 2013, Milan, Italy
http://archive.ics.uci.edu/ml/datasets/Bag+of+Words