Clustering Validity Indices Evaluation with Regard to Semantic Homogeneity

Tomasz Dziopa

DOI: http://dx.doi.org/10.15439/2016F371

Citation: Position Papers of the 2016 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 9, pages 3–9 (2016)

Full text

Abstract. Clustering validity indices are a methods for examining and assessing quality of data clustering results. Various studies provide thorough evaluation of their performance using both synthetic and real-world datasets. In this work, we describe various approaches to the topic of evaluation of a clustering scheme. Moreover, a new solution to a problem of selecting an appropriate clustering validity index is presented. The approach is applied to a problem of selecting an appropriate clustering validity index for a real-world task of clustering biomedical articles with usage of MeSH ontology.

References

https://www.nlm.nih.gov/mesh/introduction.html, 2016. [Online; accessed 5.05.2016].
Charu C. Aggarwal and Cheng Xiang Zhai. Mining Text Data. Springer Publishing Company, Incorporated, 2012.
T. Caliński and J. Harabasz. A dendrite method for cluster analysis. Communications in Statistics-Simulation and Computation, 3(1):1–27, 1974.
Brian Everitt. Cluster analysis. Quality and Quantity, 14(1):75–100, 1980.
Ibai Gurrutxaga, Javier Muguerza, Olatz Arbelaitz, Jesús M. Pérez, and José Ignacio Martín. Towards a standard methodology to evaluate internal cluster validity indices. Pattern Recognition Letters, 32(3):505–515, 2011.
Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. On clustering validation techniques. Journal of Intelligent Information Systems, 17(2):107–145.
Christian Hennig. Data Analysis, Machine Learning and Knowledge Discovery, chapter How Many Bee Species? A Case Study in Determining the Number of Clusters, pages 41–49. Springer International Publishing, Cham, 2014.
Christian Hennig. fpc: Flexible Procedures for Clustering, 2015. R package version 2.1-10.
Andrzej Janusz, Dominik Śl ̨ezak, and Hung Son Nguyen. Unsupervised similarity learning from textual data. Fundam. Inf., 119(3-4):319–336, August 2012.
Pablo A. Jaskowiak, Davoud Moulavi, Antonio C. S. Furtado, Ricardo J. G. B. Campello, Arthur Zimek, and Jörg Sander. On strategies for building effective ensembles of relative clustering validity criteria. Knowledge and Information Systems, pages 1–26, 2015.
Hung Son Nguyen, Sinh Hoa Nguyen, and W. Swieboda. Semantic explorative evaluation of document clustering algorithms. In Computer Science and Information Systems (FedCSIS), 2013 Federated Conference on, pages 115–122, Sept 2013.
Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53 – 65, 1987.
Sergios Theodoridis and Konstantinos Koutroumbas. Chapter 16 - cluster validity. In Sergios Theodoridis, , and Konstantinos Koutroumbas, editors, Pattern Recognition (Fourth Edition), pages 863 – 913. Academic Press, Boston, fourth edition edition, 2009.
Mohammed J. Zaki and Jr. Wagner Meira. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, May 2014.