Analyzing longitudinal Data in Knowledge Graphs utilizing shrinking pseudo-triangles

This paper aims to analyze longitudinal data, serial data related to different time points, in knowledge graphs. Knowledge graphs play a central role for linking different data. While multiple layers for data from different sources are considered, there is only very limited research on longitudinal data in knowledge graphs. However, knowledge graphs are widely used in big data integration, especially for connecting data from different domains. Few studies have investigated the questions how multiple layers and time points within graphs impact methods and algorithms developed for single-purpose networks. This manuscript investigates the impact of a modeling of longitudinal data in multiple layers on retrieval algorithms. In particular, (a) we propose a first draft of a generic model for longitudinal data in multi-layer knowledge graphs, (b) we develop an experimental environment to evaluate a generic retrieval algorithm on random graphs inspired by computational social sciences. We present a knowledge graph generated on German job advertisements comprising data from different sources, both structured and unstructured, on data between 2011 and 2021. The data is linked using text mining and natural language processing methods. We further (c) present two different shrinking techniques for structured and unstructured layers in knowledge based on graph structures like triangles and pseudo-triangles. The presented approach (d) shows that on the one hand, the initial research questions, on the other hand the graph structures and topology have a great impact on the structures and efficiency for additional data stored. Although the experimental analysis of random graphs allows us to make some basic observations we will (e) make suggestions for additional research on particular graph structures that have a great impact on the analysis of knowledge graph structures.


I. INTRODUCTION
K NOWLEDGE graphs have been shown to play an important role in recent knowledge mining and discovery, for example in the fields of computational social sciences, digital humanities, life sciences or bioinformatics. They also include single purpose networks (like social networks), but mostly they contain also additional information and data, see for example [1], [2], [3]. Thus, a knowledge graph can be seen as a multi-layer graph comprising different data layers, for example social data, spatial data, etc. In addition, scientists study network patterns and structures, for example paths, communities or other patterns within the data structure, see for example [4]. Very few studies have investigated the questions how multiple layers within graphs impact methods and algorithms 8 Gesundheit, Soziales, Lehre und Erziehung  developed for single-purpose networks, see [5]. In addition, it is possible to store and analyze longitudinal data in knowledge graphs. This is an important topic in medical informatics, for example when working with longitudinal patient records. For example, in [6], the authors use a temporal query language on clinical knowledge graphs. Other authors like [7] use longitudinal patient records within a medical knowledge graph for predictive models. Longitudinal knowledge graphs are related to the versioning of knowledge graphs and other graphbased structures like ontologies. Although research started early on versioning RDF knowledge bases, see [8], only little research has been done on this field. Some attention was paid to the field of the evolution of data structures within information management, see for example [9], [10], [11], and the decentralized collaborative work on knowledge resources, see [12], [13]. Other researchers were interested in parallel world frameworks to analyze scenarios in knowledge graphs, see [14]. Nevertheless, a generic framework for modeling longitudinal data in knowledge graphs is still missing.
In this paper, we focus on an example use case from computational social sciences. In labor market research, ex- tracting skill requirements from job advertisements (short: job ads) becomes a feasible approach to observe which skills are in demand by employers [15], [16], [17], [18]. Job ads are one way for a company to recruit new employees. Beside general information about the hiring company and the working conditions, they document the current skill needs on the labor market. For a longitudinal view on how skill demands develop, it is necessary to build a model which is capable of not only utilizing these data within a knowledge graph with contextual data, but also for efficient analysis.
We present a detailed overview of the knowledge graph representation in Figure 1. We give a detailed formal definition and overview in the next section.
The main research question of this paper is: How can we model longitudinal knowledge graphs on job ads for an efficient analysis of the development of skills while preserving all contextual data? In order to answer that question, this manuscript investigates both the impact of shrinking triangles in multiple layers and the runtime needed for this approach.
This paper is divided into five sections. The first section gives a brief overview of the research question, state of the art and related work. The second section describes the preliminaries and background. We will in particular introduce knowledge graphs and describe the knowledge graph models on job ads. In the third section, we present the experimental setting and the methods used. The fourth section is dedicated to experimental results and their evaluation. Our conclusions are drawn in the final section.

II. PRELIMINARIES
The term knowledge graph (sometimes also called a semantic network) is not clearly defined, see [19]. In [20], several definitions are compared, but the only formal definition was related to RDF graphs which does not cover labeled property graphs. However, a knowledge graph is a systematic way to connect information and data to knowledge.
Definition 1 (Knowledge Graph). We define a knowledge graph as mixed graph G = (E, R) with entities e * E = {E 1 , ..., E n } coming from formal structures E i , like ontologies.
By using formal structures within the graph, we are implicitly using the model of a labeled property graph, see [21] and [22]. Here, nodes and edges form a heterogeneous set. Nodes and edges can be identified by using a single label or multiple labels, using a mapping λ : V * E ³ Σ, where Σ denotes a set of labels. We need to mention that both concepts are equivalent, since graph databases use the concept of labeled property graphs.
Context is a widely discussed topic in text mining and knowledge extraction since it is an important factor in determining the correct semantic sense of unstructured text. In [23], Nenkova and McKeown discuss the influence of context on text summarization. Ambiguity is an issue for both common language words and those in scientific context. The challenge in this field is not only to extract such context data, but Storage Period unstructured E 8 Regional Data unstructured also to be able to store this data for further natural language processing (NLP), like querying and discovery approaches, see for example [4].
In general, for a node n * V , the neighborhood N (n) contains all relevant contextual information. But usually information is best understood using information-triangles. Thus every two nodes v, w * N (n) form an implicit triangle v, n, w and when adding an additional edge (v, w) this forms a triangle K 3 , see Figure 1 for an illustration. Here, the additional edges that form pseudo-triangles are red.
Definition II.1 (Pseudo-Triangle). Let G = (E, R) be a knowledge graph and let n, v, w * G be three nodes in G. Moreover, let n * E i for some i and let v, w * N (n). Then n, v, w form a pseudo-triangle in G.
The knowledge graph on German job ads is build upon different corpora of job ads from multiple sources. In this paper, we will focus on a corpus from the German Federal Employment Agency. The corpus that we use to extract skills and tools contains approximately 600,000 job ads per year that were advertised from 2011 to 2021. In Table I we present the different knowledge graph layers.
Thus, combing Figure 1 and Table I, it appears that we are working on a knowledge graph G = (V, E) with eight different layers, thus G = E 1 * E 2 * ... * E 8 * T 1 * I 1 with the given data subsets E 1 , ..., E 8 and the text mining results T 1 and other data integrated in I 1 . We will now discuss how this knowledge graph can be connected with data from different years to build a longitudinal knowledge graph representation.
To test the efficiency of the analysis, we will focus on a very generic question: Given a structured layer which is not constantly changing (e.g. a taxonomy), how do the results on unstructured data (in our case: the job ads) evolve with respect to another structured layer (e.g. another taxonomy, for example tools or skills)? In other words: How can we efficiently retrieve data from a structured layer E s ordered by another structured layer E i when both are connected over time by different sets of unstructured data? For the sake of simplicity, we will define E s = E 4 as skills retrieved by text mining and E i as E 2 given by the classifications of occupations. Both sets are connected by the job ads. 324 PROCEEDINGS OF THE FEDCSIS. SOFIA, BULGARIA, 2022

III. METHOD
We can extend the knowledge graph with the information for one particular time point t, in our case for a year: With this, we can build a generic graph model comprising multiple times T = {t 1 , ..., t m }: In this case, C ti,t k comprises all edge relations from G ti to G tj . This contains relations like isEqual if two entities are equivalent or isSuccessor if an entity in G tj is the successor of an deprecated element in G ti .

Classification of Occupations
Year a 1 Year a n ... Thus, a first step to shrink the volume of the knowledge graph is to merge the multiple existence of elements in multiple years. Thus, we search for maximal paths P = p 1 , ..., p m in G T where p i * E j "p i * P . The edges between p i and p i+1 are either isEqual or isSuccessor edges, see Figure 2. Thus for every edge (p i , p i+1 ) * C tj ,t k we can either merge p i and p i+1 if they are the same (isEqual) or leave the isSuccessor edges. In our case we are in particular working on E 2 , the classification of occupations.
This can be done with depth-first search, see Algorithm 1, because we explicitly only use the directed subgraph induced by R = G T C t1,t2 * C t2,t3 * ... * C tm−1,tm . The worst-case behavior is in O(E(R)+V (R)) and since every node p i has at most ∆(E 2 ) neighbors in E 2 and at most N (E 1 ) neighbors in E 1 the time complexity of merging the nodes is O(∆(E 2 ) + N (E 1 )). Thus, the runtime of this step is linear, O(n) in G T . We denote the graph after step 1 with G 1 T .

Classification of Occupations
Year a1 Year an ...

Job Ads
Other structured Data In general we can make the following observations: " Since the number of jobs in the classification schema does not increase dramatically, we can assume N (E t 2 ) j N (E t+1 2 ). " Thus, even though a number of e 1 jobs may either be deprecated or are added as new items, the size of ). In a next step, we can merge all job ads for a given year preserving the further links to other structured data. Thus for every time point t and every v t * E t 2 we merge all nodes in N = {n t | n t * N (v t ) and n t * E t 1 } to a meta node a t and add an edge (v t , a t ) with weight |N |, see Figure 3. These form pseudo-triangles in E 1 .
The runtime of this step is in O(tN (E 1 )N (E 2 ) and thus is quadratic, O(n 2 ) in G T , see Algorithm 2. We denote the changed graph after step 2 with G 2 T and the new shrinked nodes in E 1 with E 2 1 . Before continuing with a possible third step of shrinking graph structures, we should consider the theoretical results. Given the question, how the description of skills in job ads evolves in job classifications over the years, in the initial graph G T we need to consider the following steps: " Consider the evolution of any classifications v for all ). " Consider all skills for any job ad in N (v t ) for all times, ). Thus, the average runtime is in O(n 3 ).For the graph with shrinked pseudo-triangles this reduces to linear runtime: ). " Consider all skills for all times, runtime O(m).
With this third step we can reduce the data complexity, but while in step 1 we do not lose any relevant data, in step 2 we lose the information about specific job ads while preserving Algorithm 2 STEP-2 Require: Knowledge Graph G T with an unstructured layer E t x and a target layer E t y containing paths in P and mappings C ti,ti+1 for all t * T = {t 1 , ..., t m } and Ensure: Shrinked G 2 , ..., m} P = [] 4: for every p * P in E t y do for every p i * P = {p 1 , ..., p z } do 6: for every t * {1, ..., m} do end for end for 12: end for return x the information for a complete time set. Thus we can make the following observations: " Structured data like taxonomies and ontologies can be shrinked without any data loss. " Shrinking unstructured data in triangles or pseudotriangles always goes along with data loss of particular data points while accumulated information might be preserved. " Thus, it highly depends on the initial research question which layers and information can be shrinked to improve the runtime of algorithms. For the given research question, steps 1-2 are the maximum reduction of the initial graph if considering the change for years.

IV. EXPERIMENTAL RESULTS
Our testing environment contains a random graph with m time points containing several graph layers. First, we have a random tree E 1 2 with 18,700 nodes and two probabilities p p and p d denoting a rate of a changing predecessor or a deleted node. These probabilities lead to m copies of E 1 2 and their mapping from one time point to the next as described in the previous section.
Second, we generate m times 600,000 random nodes in E 1 1 , ...E m 1 with equal distributed mappings to E 1 2 , ..., E m 2 as described in the last section. In addition, these nodes receive random edges to 600 descriptive elements. Thus, our experimental setting is highly related to our real-world environment describes in the second section.
We used 50 instances to evaluate the runtime and performance of the algorithms presented in the last section. In Table  II and Figure 4 we show the runtime of the two optimization steps. In general, we can see that both steps in average take 0.6 seconds. Step 1 Step In Table III and Figure 5 we show the runtime of the retrieval described in the last section. We can see that the runtime of both optimization steps is nearly the same as one retrieval on the initial graph G T . This is not surprising, since the steps are quite similar. The retrieval on the optimized graph G 2 T is much faster and at latest with the second run of a retrieval algorithm, we see a good improvement of runtime.

V. DISCUSSION AND OUTLOOK
This paper investigates the impact of longitudinal data in knowledge graphs. Knowledge graphs play a central role for linking different data. While multiple layers for data from different sources are considered, there is only very limited research on longitudinal data in knowledge graphs. We presented an experimental environment to evaluate one generic retrieval heuristic given different -both structured and unstructureddata layers. The result clearly shows that the graph structures and topology has a great impact on the efficient retrieval of additional data stored. The initial very generic question was: Given a structured layer which is not constantly changing (e.g. a taxonomy), how do the results on unstructured data (in our case: the job ads) evolve with respect to another structured layer (e.g. another taxonomy, for example tools or skills) evolve? In other words: How can we efficiently retrieve data from a structured layer E s ordered by another structured layer E i when both are connected over time by different sets of unstructured data? We specified three example layers to illustrate our optimization approach on (pseudo-)triangles and to evaluate the efficiency.
In particular, we propose a first draft of a generic model for longitudinal data in multi-layer knowledge graphs. This approach stores copies of the knowledge graph on multiple time points and the mapping between nodes in one and a following time point. Since some optimization can be done without losing data, e.g. step 1, we propose further research on a generic longitudinal data model to use these approaches when building the knowledge graph. Second, we develop an experimental environment to evaluate a generic retrieval algorithms on random graphs inspired by computational social sciences. This example was highly influenced by the boundary conditions given by the real-world problem, a knowledge graph generated on German job advertisements comprising data from different sources, both structured and unstructured, on data between 2011 and 2021. The data is linked using text mining and natural language processing methods. In general, we present two different shrinking approaches for structured (step 1) and unstructured (step 2) layers in knowledge graphs based on graph structures like triangles and pseudo-triangles.
Here, more research needs to follow. While we have argued that these approaches are generic and can be used for any content, further attention for triangles and pseudo-triangles is needed. They form a crucial factor both for understanding the data context and for efficient retrieval of these data.
The presented approach shows that on the one hand the initial research questions (what are the layers to shrink) and on the other hand the graph structures and topology have a great impact on the structures and efficiency for additional data stored. The experimental results show promising results, but further research is necessary to build a generic, time-efficient representation of longitudinal data in knowledge graphs.