Document Clustering using a Graph Covering with Pseudostable Sets

Jens Dörpinghaus; Sebastian Schaaf; Juliane Fluck; Marc Jacobs

Document Clustering using a Graph Covering with Pseudostable Sets

Jens Dörpinghaus, Sebastian Schaaf, Juliane Fluck, Marc Jacobs

DOI: http://dx.doi.org/10.15439/2017F84

Citation: Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 11, pages 329–338 (2017)

Full text

Abstract. In text mining, document clustering describes the efforts to assign unstructured documents to clusters, which in turn usually refer to topics. Clustering is widely used in science for data retrieval and organisation. In this paper we present a new graph theoretical approach to document clustering and its application on a real-world data set. We will show that the well-known graph partition to stable sets or cliques can be generalized to pseudostable sets or pseudocliques. This allows to make a soft clustering as well as a hard clustering. We will present an integer linear programming and a greedy approach for this NP-complete problem and discuss some results on random instances and some real world data for different similarity measures.

References

A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, 9 1999.
C. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, 2008.
W. B. A. Karaa, A. S. Ashour, D. B. Sassi, P. Roy, N. Kausar, and N. Dey, “Medline text mining: an enhancement genetic algorithm based approach for document clustering,” in Applications of Intelligent Optimization in Biology and Medicine. Springer, 2016, pp. 267–287. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-21212-8_12
T. Mu, J. Y. Goulermas, I. Korkontzelos, and S. Ananiadou, “Descriptive document clustering via discriminant learning in a co-embedded space of multilevel similarities,” Journal of the Association for Information Science and Technology, vol. 67, no. 1, pp. 106–133, 2016.
L. Stanchev, “Semantic document clustering using a similarity graph,” in Semantic Computing (ICSC), 2016 IEEE Tenth International Conference on. IEEE, 2016, pp. 1–8. [Online]. Available: http://dx.doi.org/10.1109/ICSC.2016.8
L. Hirsch and A. Di Nuovo, “Document clustering with evolved search queries,” 2017. [Online]. Available: http://shura.shu.ac.uk/15409/
C.-J. Lee, C.-C. Hsu, and D.-R. Chen, “A hierarchical document clustering approach with frequent itemsets,” International Journal of Engineering and Technology, vol. 9, no. 2, p. 174, 2017. [Online]. Available: http://dx.doi.org/10.7763/IJET.2017.V9.965
S. E. Schaeffer, “Graph clustering,” Computer Science Review, vol. 1, no. 1, pp. 27 – 64, 2007.
E. Hartuv and R. Shamir, “A clustering algorithm based on graph connectivity,” Information Processing Letters, vol. 76, no. 4–6, pp. 175–181, 2000.
S. O. Krumke and H. Noltemeier, Graphentheoretische Konzepte und Algorithmen, 2nd ed. Wiesbaden: Vieweg + Teubner, 2009.
P. Hansen, M. Labbé, and D. Schindl, “Set covering and packing formulations of graph coloring: Algorithms and first polyhedral results,” Discrete Optimization, vol. 6, no. 2, pp. 135 – 147, 2009.
J. Dörpinghaus, “Über das Train Marshalling Problem,” 2012. [Online]. Available: https://doi.org/10.5281/zenodo.570503
A. Kosowski and K. Manuszewski, “Classical coloring of graphs,” Contemporary Mathematics, vol. 352, pp. 1–20, 2004.
D. Brélaz, “New methods to color the vertices of a graph,” Commun. ACM, vol. 22, no. 4, pp. 251–256, Apr. 1979.
J. Bhasker and T. Samad, “The clique-partitioning problem,” Computers & Mathematics with Applications, vol. 22, no. 6, pp. 1–11, 1991.
E. N. Gilbert, “Random graphs,” The Annals of Mathematical Statistics, vol. 30, no. 4, pp. 1141–1144, 1959.
P. Erdös and A. Rényi, “On random graphs, i,” Publicationes Mathematicae (Debrecen), vol. 6, pp. 290–297, 1959.
V. Batagelj and U. Brandes, “Efficient generation of large random networks,” Phys. Rev. E, vol. 71, p. 036113, Mar 2005.
D. Hanisch, K. Fundel, H.-T. Mevissen, R. Zimmer, and J. Fluck, “ProMiner: rule-based protein and gene entity recognition.” BMC bioinformatics, vol. 6 Suppl 1, p. S14, 2005.
E. Younesi, L. Toldo, B. Müller, C. M. Friedrich, N. Novac, A. Scheer, M. Hofmann-Apitius, and J. Fluck, “Mining biomarker information in biomedical literature,” BMC medical informatics and decision making, vol. 12, no. 1, p. 148, 2012.
I. Iliopoulos, A. Enright, and C. Ouzounis, “Textquest: Document clustering of medline,” Biocomputing 2001, p. 384, 2000.
T. Theodosiou, N. Darzentas, L. Angelis, and C. A. Ouzounis, “Pured-mcl: a graph-based pubmed document clustering methodology,” Bioinformatics, vol. 24, no. 17, pp. 1935–1941, 2008. [Online]. Available: http://dx.doi.org/10.1093/bioinformatics/btn318