## Optimization of Retrieval Algorithms on Large Scale Knowledge Graphs

### Jens Dörpinghaus, Andreas Stefan

DOI: http://dx.doi.org/10.15439/2020F2

Citation: Proceedings of the 2020 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 21, pages 227–236 (2020)

Abstract. Knowledge graphs have been shown to play an important role in recent knowledge mining and discovery, for example in the field of life sciences or bioinformatics. Although a lot of research has been done on the field of query optimization, query transformation and of course in storing and retrieving large scale knowledge graphs the field of algorithmic optimization is still a major challenge and a vital factor in using graph databases. Few researchers have addressed the problem of optimizing algorithms on large scale labeled property graphs. Here, we present two optimization approaches and compare them with a naive approach of directly querying the graph database. The aim of our work is to determine limiting factors of graph databases like Neo4j and we describe a novel solution to tackle these challenges. For this, we suggest a classification schema to differ between the complexity of a problem on a graph database. We evaluate our optimization approaches on a test system containing a knowledge graph derived biomedical publication data enriched with text mining data. This dense graph has more than 71M nodes and 850M relationships. The results are very encouraging and -- depending on the problem -- we were able to show a speedup of a factor between 44 and 3839.

### References

- M. Desai, R. G Mehta, and D. P Rana, “Issues and challenges in big graph modelling for smart city: An extensive survey,” International Journal of Computational Intelligence & IoT, vol. 1, no. 1, 2018.
- J. Dörpinghaus and A. Stefan, “Knowledge extraction and applications utilizing context data in knowledge graphs,” in 2019 Federated Conference on Computer Science and Information Systems (FedCSIS). IEEE, 2019, pp. 265–272.
- D. B. Johnson, “Efficient algorithms for shortest paths in sparse networks,” Journal of the ACM (JACM), vol. 24, no. 1, pp. 1–13, 1977.
- H. Huang and Z. Dong, “Research on architecture and query performance based on distributed graph database neo4j,” in 2013 3rd International Conference on Consumer Electronics, Communications and Networks. IEEE, 2013, pp. 533–536.
- J. Hölsch and M. Grossniklaus, “An algebra and equivalences to transform graph patterns in neo4j,” in EDBT/ICDT 2016 Workshops: EDBT Workshop on Querying Graph Structured Data (GraphQ), 2016.
- H. Thakkar, D. Punjani, S. Auer, and M.-E. Vidal, “Towards an integrated graph algebra for graph pattern matching with gremlin,” in International Conference on Database and Expert Systems Applications. Springer, 2017, pp. 81–91.
- R. Angles, H. Thakkar, and D. Tomaszuk, “Rdf and property graphs interoperability: Status and issues,” in Proceedings of the 13th Alberto Mendelzon International Workshop on Foundations of Data Management, Asunción, Paraguay, 2019.
- S. Mennicke, “Modal schema graphs for graph databases,” in International Conference on Conceptual Modeling. Springer, 2019, pp. 498–512.
- P. Zhao and J. Han, “On graph query optimization in large networks,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 340–351, 2010.
- J. Eymer, P. Dexter, and Y. D. Liu, “Toward lazy evaluation in a graph database,” SPLASH 2019.
- A. Cheung, S. Madden, and A. Solar-Lezama, “Sloth: Being lazy is a virtue (when issuing database queries),” ACM Transactions on Database Systems (ToDS), vol. 41, no. 2, p. 8, 2016.
- A. B. Mathew, “Efficient query retrieval from social data in neo4j using lindex.” KSII Transactions on Internet & Information Systems, vol. 12, no. 5, 2018.
- W. Cabrera and C. Ordonez, “Scalable parallel graph algorithms with matrix–vector multiplication evaluated with queries,” Distributed and Parallel Databases, vol. 35, no. 3-4, pp. 335–362, 2017.
- X. Wu and S. Deng, “Research on optimizing strategy of database-oriented gis graph database query,” in 2018 5th IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS). IEEE, 2018, pp. 305–309.
- P. Fournier-Viger, C. Cheng, L. J. chuan Wei, U. Yun, and R. U. Kiran, “Tkg: Efficient mining of top-k frequent subgraphs,” in Big Data Analytics: 7th International Conference, BDA 2019, Ahmedabad, India, December 17–20, 2019, Proceedings, vol. 11932. Springer Nature, 2019, p. 209.
- J. Fluck, A. Klenner, S. Madan, S. Ansari, T. Bobic, J. Hoeng, M. Hofmann-Apitius, and M. Peitsch, “Bel networks derived from qualitative translations of bionlp shared task annotations,” in Proceedings of the 2013 Workshop on Biomedical Natural Language Processing, 2013, pp. 80–88.
- M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig et al., “Gene ontology: tool for the unification of biology,” Nature genetics, vol. 25, no. 1, p. 25, 2000.
- D. S. Wishart, Y. D. Feunang, A. C. Guo, E. J. Lo, A. Marcu, J. R. Grant, T. Sajed, D. Johnson, C. Li, Z. Sayeeda et al., “Drugbank 5.0: a major update to the drugbank database for 2018,” Nucleic acids research, vol. 46, no. D1, pp. D1074–D1082, 2017.
- K. Khan, E. Benfenati, and K. Roy, “Consensus qsar modeling of toxicity of pharmaceuticals to different aquatic organisms: Ranking and prioritization of the drugbank database compounds,” Ecotoxicology and environmental safety, vol. 168, pp. 287–297, 2019.
- J. Dörpinghaus, A. Stefan, B. Schultz, and M. Jacobs. (2020) Towards context in large scale biomedical knowledge graphs. [Online]. Available: http://arxiv.org/abs/2001.08392
- P. Barceló Baeza, “Querying graph databases,” Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2013.
- R. Angles, “A Comparison of Current Graph Database Models,” in 2012 IEEE 28th International Conference on Data Engineering Workshops, apr 2012, pp. 171–177.
- R. Angles, M. Arenas, P. Barceló, A. Hogan, J. Reutter, and D. Vrgoč, “Foundations of Modern Query Languages for Graph Databases,” ACM Comput. Surv., vol. 50, no. 5, pp. 68:1—-68:40, sep 2017. [Online]. Available: http://doi.acm.org/10.1145/3104031
- P. T. Wood, “Query Languages for Graph Databases,” SIGMOD Rec., vol. 41, no. 1, pp. 50–60, apr 2012. [Online]. Available: http://doi.acm.org/10.1145/2206869.2206879
- J. Pokorný, “Functional querying in graph databases,” Vietnam Journal of Computer Science, vol. 5, no. 2, pp. 95–105, 2018. [Online]. Available: https://doi.org/10.1007/s40595-017-0104-6
- J. Pokorny, “Graph Databases: Their Power and Limitations,” in Computer Information Systems and Industrial Management, K. Saeed and W. Homenda, Eds. Cham: Springer International Publishing, 2015, pp. 58–69.
- M. Needham and A. E. Hodler, Graph Algorithms. O’Reilly Media, Inc., 2019.
- A. Aziz and A. Prakash, Algorithms for Interviews: A Problem Solving Approach. algorithmsforinterviews.com, 2010.