Hashtag Discernability - Competitiveness Study of Graph Spectral and Other Clustering Methods

Bartłomiej Starosta; Mieczysław Kłopotek; Slawomir T. Wierzchon; Dariusz Czerski

Hashtag Discernability - Competitiveness Study of Graph Spectral and Other Clustering Methods

Bartłomiej Starosta, Mieczysław Kłopotek, Slawomir T. Wierzchon, Dariusz Czerski

DOI: http://dx.doi.org/10.15439/2023F2398

Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 759–767 (2023)

Full text

Abstract. Spectral clustering methods are claimed to possess ability to represent clusters of diverse shapes, densities etc. They constitute an approximation to graph cuts of various types (plain cuts, normalized cuts, ratio cuts). They are applicable to unweighted and weighted similarity graphs. We perform an evaluation of these capabilities for clustering tasks of increasing complexity.

References

S. T. Wierzchoń and M. A. Kłopotek, Modern Clustering Algorithms, ser. Studies in Big Data. Springer Verlag, 2018, vol. 34. ISBN 978-3-319-69307-1. https://doi.org/10.1007/978-3-319-69308-8
P. Łoziński, D. Czerski, and M. A. Kłopotek, “Grammatical case based IS-A relation extraction with boosting for polish,” in Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, FedCSIS 2016, Gdańsk, Poland, September 11-14, 2016, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. A. Maciaszek, and M. Paprzycki, Eds., vol. 8. IEEE, 2016. http://dx.doi.org/10.15439/2016F391 pp. 533–540. [Online]. Available: https://doi.org/10.15439/2016F391
J. Dörpinghaus, S. Schaaf, J. Fluck, and M. Jacobs, “Document clustering using a graph covering with pseudostable sets,” in Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, FedCSIS 2017, Prague, Czech Republic, September 3-6, 2017, ser. Annals of Computer Science and Information Systems, M. Ganzha, L. A. Maciaszek, and M. Paprzycki, Eds., vol. 11, 2017. http://dx.doi.org/10.15439/2017F84 pp. 329–338. [Online]. Available: https://doi.org/10.15439/2017F84
P. Borkowski, M. A. Kłopotek, B. Starosta, S. T. Wierzchoń, and M. Sydow, “Eigenvalue based spectral classification,” PLoS ONE, vol. 18, no. 4, p. e0283413, 2023. https://doi.org/10.1371/journal.pone.0283413
I. S. Dhillon and D. S. Modha, “Concept decompositions for large sparse text data using clustering,” Machine Learning, vol. 42, no. 1, pp. 143–175, Jan 2001. https://doi.org/10.1023/A:1007612920971
S. T. Wierzchoń and M. A. Kłopotek, “Spectral cluster maps versus spectral clustering,” in Computer Information Systems and Industrial Management, ser. LNCS, vol. 12133. Springer, 2020. http://dx.doi.org/10.1007/978-3-030-47679-3 40 pp. 472–484.
C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. Fernández del Rı́o, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant, “Array programming with NumPy,” Nature, vol. 585, p. 357–362, 2020. http://dx.doi.org/10.1038/s41586-020-2649-2 https://numpy.org.
P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors, “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python,” Nature Methods, vol. 17, pp. 261–272, 2020. http://dx.doi.org/10.1038/s41592-019-0686-2 https://scipy.org.
L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux, “API design for machine learning software: experiences from the scikit-learn project,” in ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 2013. http://dx.doi.org/10.48550/arXiv.1309.023 pp. 108–122, https://scikit-learn.org.
H. Kim and H. K. Kim, “clustering4docs github repository,” 2020, https://pypi.org/project/soyclustering/. [Online]. Available: https://github.com/lovit/clustering4docs
H. Kim, H. K. Kim, and S. Cho, “Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling,” Expert Systems with Applications, vol. 150, p. 113288, 2020. https://doi.org/10.1016/j.eswa.2020.113288. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0957417420301135
U. von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007. https://doi.org/10.48550/arXiv.0711.0189
P. Macgregor and H. Sun, “A tighter analysis of spectral clustering, and beyond,” in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 17–23 Jul 2022. https://doi.org/10.48550/arXiv.2208.01724 pp. 14 717–14 742. [Online]. Available: https://proceedings.mlr.press/v162/macgregor22a.html
Y. Xu, A. Srinivasan, and L. Xue, A Selective Overview of Recent Advances in Spectral Clustering and Their Applications. Cham: Springer International Publishing, 2021, pp. 247–277. ISBN 978-3-030-72437-5. http://dx.doi.org/10.1007/978-3-030-72437-5_12
C. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press, 2008. https://doi.org/10.1017/CBO9780511809071
F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborov, and P. Zhang, “Spectral redemption in clustering sparse networks,” in Proc. the National Academy of Sciences, vol. 110
, 2013. http://dx.doi.org/10.48550/arXiv.1306.5550 pp. 20 935–20 940.
H. T. Ali and R. Couillet, “Improved spectral community detection in large heterogeneous networks,” Journal of Machine Learning Research, vol. 18, no. 225, pp. 1–49, 2018. [Online]. Available: http://jmlr.org/papers/v18/17-247.html
A. Saade, F. Krzakala, and L. Zdeborová, “Spectral clustering of graphs with the bethe hessian,” 2014. [Online]. Available: https://arxiv.org/abs/1406.1880. http://dx.doi.org/10.48550/ARXIV.1406.1880
Y. Endo and S. Miyamoto, “Spherical k-means++ clustering,” in Modeling Decisions for Artificial Intelligence, V. Torra and T. Narukawa, Eds. Cham: Springer International Publishing, 2015. https://doi.org/10.1007/978-3-319-23240-9 9. ISBN 978-3-319-23240-9 pp. 103–114.
S. Ji, D. Xu, L. Guo, M. Li, and D. Zhang, “The seeding algorithm for spherical k-means clustering with penalties,” J. Comb. Optim., vol. 44, no. 3, p. 1977–1994, oct 2022. http://dx.doi.org/10.1007/s10878-020-00569-1. [Online]. Available: https://doi.org/10.1007/s10878-020-00569-1
J. Knittel, S. Koch, and T. Ertl, “Efficient sparse spherical k-means for document clustering,” in Proceedings of the 21st ACM Symposium on Document Engineering, DocEng ’21. ACM, New York, NY, United States, 2021. https://doi.org/10.1145/3469096.3474937 pp. 1–4.
R. Pratap, A. Deshmukh, P. Nair, and T. Dutt, “A faster sampling algorithm for spherical k-means,” in Proceedings of The 10th Asian Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Zhu and I. Takeuchi, Eds., vol. 95. PMLR, 14–16 Nov 2018, pp. 343–358. [Online]. Available: https://proceedings.mlr.press/v95/pratap18a.html
R. A. Kłopotek, M. A. Kłopotek, and S. T. Wierzchoń, “A feasible k-means kernel trick under non-euclidean feature space,” International Journal of Applied Mathematics and Computer Science, vol. 30, no. 4, pp. 703–715, 2020. https://doi.org/10.34768/amcs-2020-0052 Online publication date: 1-Dec-2020.
J. Gower, “Some distance properties of latent root and vector methods used in multivariate analysis,” Biometrika, vol. 53(3-4), pp. 325—-338, 1966. https://doi.org/10.1093/biomet/53.3-4.325