Knowledge Detection and Discovery using Semantic Graph Embeddings on Large Knowledge Graphs generated on Text Mining Results

Jens Dörpinghaus; Marc Jacobs

Knowledge Detection and Discovery using Semantic Graph Embeddings on Large Knowledge Graphs generated on Text Mining Results

Jens Dörpinghaus, Marc Jacobs

DOI: http://dx.doi.org/10.15439/2020F36

Citation: Proceedings of the 2020 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 21, pages 169–178 (2020)

Full text

Abstract. Knowledge graphs play a central role in big data integration, especially for connecting data from different domains. Bringing unstructured texts, e.g. from scientific literature, into a structured, comparable format is one of the key assets. Here, we use knowledge graphs in the biomedical domain working together with text mining based document data for knowledge extraction and retrieval from text and natural language structures. For example cause and effect models, can potentially facilitate clinical decision making or help to drive research towards precision medicine. However, the power of knowledge graphs critically depends on context information. Here we provide a novel semantic approach towards a context enriched biomedical knowledge graph utilizing data integration with linked data applied to language technologies and text mining. This graph concept can be used for graph embedding applied in different approaches, e.g with focus on topic detection, document clustering and knowledge discovery. We discuss algorithmic approaches to tackle these challenges and show results for several applications like search query finding and knowledge discovery. The presented remarkable approaches lead to valuable results on large knowledge graphs.

References

J. Dörpinghaus and M. Jacobs, “Semantic knowledge graph embeddings for biomedical research: Data integration using linked open data,” Posters and Demo Track of the 15th International Conference on Semantic Systems. (Poster and Demo Track at SEMANTiCS 2019), no. 2451, pp. 46–50, 2019. [Online]. Available: http: //ceur-ws.org/Vol-2451/#paper-10
J. Dörpinghaus, J. Darms, and M. Jacobs, “What was the question? a systematization of information retrieval and nlp problems.” in 2018 Federated Conference on Computer Science and Information Systems (FedCSIS). IEEE, 2018.
J. Dörpinghaus, C. Düing, and V. Weil, “A minimum set-cover problem with several constraints,” in 2019 Federated Conference on Computer Science and Information Systems (FedCSIS), Sep. 2019, pp. 115–122.
J. Dörpinghaus, A. Stefan, B. Schultz, and M. Jacobs, “Towards context in large scale biomedical knowledge graphs,” arXiv preprint https://arxiv.org/abs/2001.08392, 2020.
V. Gligorijević and N. Pržulj, “Methods for biological data integration: perspectives and challenges,” Journal of the Royal Society Interface, vol. 12, no. 112, p. 20150571, 2015.
J. Dörpinghaus and A. Stefan, “Knowledge extraction and applications utilizing context data in knowledge graphs,” in 2019 Federated Conference on Computer Science and Information Systems (FedCSIS). IEEE, 2019, pp. 265–272.
J. Dörpinghaus, A. Stefan, B. Schultz, and M. Jacobs. (2020) Towards context in large scale biomedical knowledge graphs. [Online]. Available: http://arxiv.org/abs/2001.08392
C. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, 2008.
A. Clark, C. Fox, and S. Lappin, The handbook of computational linguistics and natural language processing. John Wiley & Sons, 2013.
H. Mirisaee, E. Gaussier, C. Lagnier, and A. Guerraz, “Terminology-based text embedding for computing document similarities on technical content,” arXiv preprint https://arxiv.org/abs/1906.01874, 2019.
N. Yarushkina, A. Filippov, and M. Grigoricheva, “Using of linguistic analysis of search query for improving the quality of information retrieval,” in International Conference on Information Technologies. Springer, 2019, pp. 215–226.
C. S. Burns, R. M. Shapiro, T. Nix, J. T. Huber et al., “Examining medline search query reproducibility and resulting variation in search results,” iConference 2019 Proceedings, 2019.
J. Lin and W. J. Wilbur, “Pubmed related articles: a probabilistic topic-based model for content similarity,” BMC bioinformatics, vol. 8, no. 1, p. 423, 2007.
D. Newman, S. Karimi, and L. Cavedon, “Using topic models to interpret medline’s medical subject headings,” in Australasian Joint Conference on Artificial Intelligence. Springer, 2009, pp. 270–279.
D. Trieschnigg, P. Pezik, V. Lee, F. De Jong, W. Kraaij, and D. Rebholz-Schuhmann, “Mesh up: effective mesh text classification for improved document retrieval,” Bioinformatics, vol. 25, no. 11, pp. 1412–1418, 2009.
Z. Lu, W. J. Wilbur, J. R. McEntyre, A. Iskhakov, and L. Szilagyi, “Finding query suggestions for pubmed,” in AMIA Annual Symposium Proceedings, vol. 2009. American Medical Informatics Association, 2009, p. 396.
M. Hagen, M. Michel, and B. Stein, “What was the query? generating queries for document sets with applications in cluster labeling,” in International Conference on Applications of Natural Language to Information Systems. Springer, 2015, pp. 124–133.
Y. Yan, X.-C. Yin, C. Yang, S. Li, and B.-W. Zhang, “Biomedical literature classification with a cnns-based hybrid learning network,” PloS one, vol. 13, no. 7, p. e0197933, 2018.
A. Varghese, M. Cawley, and T. Hong, “Supervised clustering for automated document classification and prioritization: a case study using toxicological abstracts,” Environment Systems and Decisions, vol. 38, no. 3, pp. 398–414, 2018.
D. Fensel, U. Şimşek, K. Angele, E. Huaman, E. Kärle, O. Panasiuk, I. Toma, J. Umbrich, and A. Wahler, Introduction: What Is a Knowledge Graph? Cham: Springer International Publishing, 2020, pp. 1–10. [Online]. Available: https://doi.org/10.1007/978-3-030-37439-6_1
L. Ehrlinger and W. Wöß, “Towards a definition of knowledge graphs.” SEMANTiCS (Posters, Demos, SuCCESS), vol. 48, 2016.
M. Ley, “Dblp: some lessons learned,” Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1493–1500, 2009.
A. A. Salatino, F. Osborne, T. Thanapalasingam, and E. Motta, “The cso classifier: Ontology-driven detection of research topics in scholarly articles,” in International Conference on Theory and Practice of Digital Libraries. Springer, 2019, pp. 296–311.
B. Yates, B. Braschi, K. A. Gray, R. L. Seal, S. Tweedie, and E. A. Bruford, “Genenames.org: the HGNC and VGNC resources in 2017,” Nucleic Acids Research, vol. 45, no. D1, pp. D619–D625, 10 2016. [Online]. Available: https://doi.org/10.1093/nar/gkw1033
M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig et al., “Gene ontology: tool for the unification of biology,” Nature genetics, vol. 25, no. 1, pp. 25–29, 2000.
G. O. Consortium, “The gene ontology resource: 20 years and still going strong,” Nucleic acids research, vol. 47, no. D1, pp. D330–D338, 2019.
L. M. Schriml, E. Mitraka, J. Munro, B. Tauber, M. Schor, L. Nickle, V. Felix, L. Jeng, C. Bearer, R. Lichenstein et al., “Human disease ontology 2018 update: classification, content and workflow expansion,” Nucleic acids research, vol. 47, no. D1, pp. D955–D962, 2019.
R. Feldman and J. Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, 2007.
F. França and A. de Souza, Intelligent Text Categorization and Clustering, ser. Studies in Computational Intelligence. Springer Berlin Heidelberg, 2008.
J. Dörpinghaus, S. Schaaf, and M. Jacobs, “Soft document clustering using a novel graph covering approach,” BioData mining, vol. 11, no. 1, p. 11, 2018.
A. T. Kodamullil, E. Younesi, M. Naz, S. Bagewadi, and M. Hofmann-Apitius, “Computable cause-and-effect models of healthy and alzheimer’s disease states and their mechanistic differential analysis,” Alzheimer’s & Dementia, vol. 11, no. 11, pp. 1329–1339, 2015.
D. S. Wishart, Y. D. Feunang, A. C. Guo, E. J. Lo, A. Marcu, J. R. Grant, T. Sajed, D. Johnson, C. Li, Z. Sayeeda et al., “Drugbank 5.0: a major update to the drugbank database for 2018,” Nucleic acids research, vol. 46, no. D1, pp. D1074–D1082, 2017.
M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne et al., “The fair guiding principles for scientific data management and stewardship,” Scientific data, vol. 3, 2016.