Learning edge importance in bipartite graph-based recommendations

In this work, we propose the P3 Learning to Rank (P3LTR) model, a generalization of the RP3Beta graph-based recommendation method. In our approach, we learn the importance of user-item relations based on features that are usually available in online recommendations (such as types of user-item past interactions and timestamps). We keep the simplicity and explainability of RP3Beta predictions. We report the improvements of P3LTR over RP3Beta on the OLX Jobs Interactions dataset, which we published.


I. INTRODUCTION
G RAPH-BASED RP3Beta model [1] is a very strong baseline on multiple recommender systems datasets [2], [3], [4]. This relatively simple model outperformed other approaches on our published OLX Jobs Interactions dataset and is currently a state-of-the-art collaborative filtering recommender system at OLX. In this work, we propose P3LTR (P3 Learning to Rank) model which generalizes the RP3Beta model.
In RP3Beta each user and item is represented as a node of the user-item bipartite graph. The recommendations are generated based on the paths of length 3 starting from a given user. The scores of these paths are calculated based on the scores assigned to the edges of the graph. Scores of the edges are directly calculated based on the degrees of the connected nodes. Hence, there is no learning process in this approach.
In P3LTR, to better leverage the importance of the useritem relation, we learn the score of a given edge based on the features of this edge. As features, we not only use node degrees but also utilize the sequence of interactions between two given nodes. It enables us to incorporate the information that the user visited the item several times and that the user not only clicked but applied for a given job, or how recent the click was.
In this work, we propose a training procedure and a loss function for the P3LTR model. We tune, train, and evaluate RP3Beta and P3LTR models on the OLX Jobs Interactions dataset.
The paper consists of 6 sections. The second section presents a literature review and formulates a research gap addressed by this work. Section III proposes the P3LTR model and describes its advantages and relation to the RP3Beta model. Section IV describes the considered dataset and hyperparameter tuning procedure. The results of our model are discussed in Section V. Section VI presents the conclusions and future perspectives.

A. Recommender systems
Most digital platforms provide more choices than the user can explore in a reasonable time. Even a perfect search engine can not resolve this problem, because it requires users to know what they are looking for and to spend time providing this information. For this reason, powerful recommendation systems are developed by multiple companies, such as Netflix [5] or Amazon [6].
We usually distinguish two categories of recommendation methods: content-based and collaborative filtering. In contentbased models [7], [8] we utilize user and item features to provide recommendations. The history of interactions between users and items is considered from the perspective of a single user. In contrast, collaborative filtering techniques [9], [10] do not consider additional information about users or items but utilize the rating history of all users at the same time to provide recommendations. During the last few decades several collaborative filtering recommendation techniques have been proposed: neighborhood-based (e.g., [11], [12]), matrix factorization-based (e.g., [13], [14]), graph-based (e.g., [1], [15] or Word2Vec-based (e.g., [16], [17]).
Another category of recommendation systems, contextaware recommendation systems (CARS) [18], [19], utilize contextual information of user-item interactions such as time or location. We can distinguish an important subcategory of these methods, sequence-aware recommendation systems [20], which utilizes sequentially-ordered user-item interaction logs.
In this work, we extend a graph-based collaborative-filtering approach that does not utilize additional contextual information. Our approach is a sequence-aware recommendation method that utilizes timestamps and types of interactions.

B. Graph-based recommendations
Many graph-based recommender systems are focused on producing the item and/or user embeddings. Some of them utilize the graph structure to produce random walks which are used as an input for the model which produces embeddings. For instance, Node2vec [21], or DeepWalk [22] utilize a SkipGram model [23]. In recent years, several collaborative filtering methods based on graph convolutional neural networks have been proposed [15], [24], [25], [26].
Another type of graph-based recommendation systems directly utilizes the graph structure to calculate the scores of items, usually by utilizing a user-item bipartite graph. Cooper et al. [27] proposed simple and efficient P3 and P3alpha methods which outperformed more complex and computationally demanding techniques [28], [29], [30]. Paudel et al. [1] extended this work by proposing the RP3Beta model, an extension of P3alpha which recommends popular items less often. These methods were recently used as a benchmark by Dacrema et al. [2], [3] and Anelli et al. [4] who compared them with several state-of-the-art neural recommendation methods. P3Alpha and RP3Beta demonstrated a very good performance against other baselines and neural models. The RP3Beta model provided the most accurate recommendations on some of the considered datasets (i.e., Pinterest [31], CiteULike-a [32] and MovieLens1M [33]; on MovieLens1M authors used their own random splits). In our previous work, as yet unpublished, we showed that RP3Beta outperforms other methods on the OLX Jobs Interactions dataset. It is currently the state-of-the-art collaborative-filtering recommendations technique deployed at OLX Jobs.

C. Research gap
The RP3Beta model does not utilize any information about user-item relations. Additionally, there is no learning process that could optimize model parameters. Hence it is not possible to learn the importance of edges in the user-item bipartite graph. In this work, we fill this gap by proposing a machine learning model which generalizes RP3Beta.

A. Model
Let U be a set of users and I a set of items. For each user u ∈ U and item i ∈ I let r ui be the score assigned by the model to the pair (u, i). We denote the matrix of all user-item scores by R R R.
We represent users and items as the nodes of the bipartite graph, where edges represent interactions between users and items. Let N (x) represent the set of nodes connected with the node x. For a given user u ∈ U our model recommends the items with the highest score r ui , excluding the items which the user interacted with.
The score r ui is calculated as the sum of the scores assigned to the paths of length 3 connecting u and i, i.e.: where p(u, i 2 , u 2 , i) is the score assigned to the given path. Following the idea used in the RP3Beta model, we factorize this score as: xy is the score assigned to the edge connecting nodes x and y in the k-th layer, k = 1, 2, 3. The edge scores of a given path are illustrated in Fig. 1. With this assumption, the calculation of the scores is simplified in the following way: where P P P (k) = (p (k) xy ), P P P (1) , P P P (3) are |U| × |I| matrices and P P P (2) is |I| × |U | matrix.
In this work we propose to calculate the edge scores as the function of node features f n and edge features f e , i.e.: x is feature vector of node x, f e xy is a feature vector of the edge connecting nodes x and y and φ (k) can be any real-valued function (e.g., neural network). We will call the functions φ (k) feature encoders and propose them below.
Assume that for each (user, item) pair we know the type of interactions between them with corresponding timestamps. Let E = {e 1 , e 2 , ..., e |E| } be a set of all possible types of interactions (e.g., click, reply, purchase). 228 PROCEEDINGS OF THE FEDCSIS. SOFIA, BULGARIA, 2022 Then we define the following features: " deg(x) -degree of the node x (number of distinct users/items which interacted with x), " rec(x, y) -number of days which passed between the most recent user (x or y) interaction with the item (y or x) and the most recent interaction of this user with any item, " ev(e i , x, y) -number of interactions of type e i between x and y, " ev(x, y) -number of interactions between x and y. Then the score is calculated as:
Parameters d (k) tell us how impactful (destination) nodes of a given degree should be, i.e., d (k) = 0 means that we treat all nodes equally, d (k) > 0 means that we reduce the impact of nodes with a greater degree, d (k) < 0 means that we increase the impact of nodes with the greater degree.
Parameters r (k) are used to utilize the recency of interactions, i.e., when r (k) > 0 more recent interactions have higher importance, when r (k) = 0 the recency of interactions has no impact on recommendations and when r (k) < 0 the older interactions are more impactful.
Parameters e (k) e are designed to utilize the information about type of interactions between users and items. The parameters associated with events requiring higher user engagement (e.g., replying to an offer) might have higher values than the others (e.g., the parameter associated with visiting an offer).
Parameters e (k) , b (k) were introduced to include the information about the frequency of interactions. Higher frequency is an indicator of higher user engagement and might be used to increase the importance of a particular item.

C. Model training
The goal of training the model is to learn the values of the parameters of our feature encoders φ (k) . We describe three components of this process: a forward pass for a single user, a training loop, and a loss function.

1) Forward pass for a single user:
We can present our model from the perspective of a message-passing paradigm used in graph convolutional neural networks [15], [24], [25], [26]. For the initial representation (k = 0) we set the score r (0) u = 1 for the node representing the given user and the score 0 for all other nodes. Then for all nodes x and for k = 1, 2, 3 we perform the message passing: This process can be interpreted as spreading the message across the graph. At the beginning, we send the message to the neighboring nodes depending on their relevancy calculated by φ (1) . Then each of these nodes sends the message to their neighbors with respect to the relevancy calculated by φ (2) . This process could be continued, but for efficiency reasons and based on the results of Cooper et al. [27], we limit it to 3 steps.
2) Training loop: A training process is described by Algorithm 1.

Algorithm 1 Training loop of the P3LTR model.
for iteration = 1, 2, . . . , iterations do Update edge weights of the graph based on feature encoders Set the loss to 0. for i = 1, 2, . . . , batch size do Pick a random target user (by default: random user who interacted with at least 2 items) and take his most recent interacted item as a validation node. Make a forward pass for this user and calculate the scores of top k items and the score and position of a validation node. Calculate the loss for this user and add it to the current loss. end for Backpropagate the loss and update the weights of feature encoders. end for

3) Loss function:
The idea of our loss function is to score the validation item higher than the other items. Let us define ratio = avg score of top k items validation node score .
To stabilize the training, we additionally calculated the sum of squares of the parameters and multiplied it by a constant regularization parameter. We considered three loss functions: by WARP loss [35]. The best loss function was chosen during the hyperparameter optimization.

D. Model advantages
We would like to emphasize the following advantages of the proposed approach: 1) P3LTR generalizes RP3Beta which is a strong baseline model. 2) P3LTR directly utilizes the information about the useritem relationship. For instance, our model may be used for encoding the importance of ratings in the explicit feedback dataset used for the top N recommendations task if we treat each rating as a different type of interaction. 3) P3LTR utilizes additional information regarding the users and the items. In our collaborative filtering dataset, we used only node degrees as such features, but we can easily extend the model to include additional user and item features. 4) P3LTR is an explainable model from two perspectives: we can explain because of which items a given item is recommended and explain why some items are more influential on recommendations. 5) P3LTR directly utilizes the information of the node's neighbors. Such an approach might give better results than embedding-based approaches for users with a low number of interactions. 6) The training pipeline is used only for optimizing the weights of feature encoders. Hence it can be trained sporadically (or even just once) and be utilized for providing predictions every day. 7) The model prediction is almost as efficient, as RP3Beta.
The difference is in the preprocessing stage, where in P3LTR, we need to additionally calculate features and pass them through feature encoders.

A. Dataset
We utilized the OLX Jobs Interactions dataset which is publicly available on Kaggle 1 . In our previous work, as yet unpublished, we compared several collaborative filtering nonneural approaches. The RP3Beta model outperformed other approaches in terms of accuracy and efficiency and, after online A/B tests, has been deployed at OLX. The dataset contains 65 502 201 events made on http://olx. pl/praca by 3 295 942 users who interacted with 185 395 job ads in 2 weeks of 2020. Each event contains 4 pieces of information: user id, item id, event type (e.g., click or reply) and timestamp. It is important to note that users usually do not interact with many job ads (average: 20, median: 6, first quartile: 2, third quartile: 18).

B. Train-test splitting
We split the events into train and test sets by time, i.e., 20% of the newest events (approximately 2.8 days) were included in the test set. We filtered out from the test set all user-item pairs which appeared in the train set (to avoid recommending already seen items).

C. Hyperparameter tuning
For the sake of efficiency, we extracted 20% of users and 20% of items from the original train set and, according to the train-test splitting technique described in the previous section, we divided them into train and test sets used for validation. For each model, we defined the hyperparameter space and performed 100 iterations chosen by Bayesian optimization using Gaussian processes. We were optimizing for precision@10 [36] calculated on 30 thousand users. In Fig. 2 we can observe that the hyperparameters significantly affect the performance of tuned models. Therefore, tuning was essential for providing reliable results for compared methods. We can also see that choosing suboptimal hyperparameters for the RP3Beta model can result in very poor performance, which is not the case for P3LTR. We report the optimal hyperparameters in Table I.

V. RESULTS
We used the best found hyperparameters to train our model on the full dataset and generate recommendations for all 619 389 users in the test set. We compared the following methods: " P3LTR, " RP3Beta, " P3: which is the RP3Beta model for α = 1 and β = 0, " #3-Paths: which is the RP3Beta model for α = 0 and β = 0. We initialized all the parameters of our P3LTR model to zeros, which makes the #3-Paths model equivalent to the P3LTR model before the learning process. In this section, we compare the accuracy and diversity of these models. We will also discuss the parameters of our P3LTR model.

A. Accuracy
In Table II we list common accuracy evaluation metrics calculated with respect to the top 10 recommendations. The To identify differences between the methods, we test the null hypothesis that all methods perform the same. We used the Friedman test with Iman and Davenport extension. The p-value from this test is equal to 0 which indicates that we can safely reject the null hypothesis that all the algorithms perform the same. We can therefore proceed with the posthoc tests in order to detect significant differences among all of the methods. Demšar [37] proposes the use of the Nemenyi's test and preparing a plot to visually check the differences, the critical difference plot. In the plot, those algorithms that are not joined by a line can be regarded as different. In our case, with a significance of α = 0.05 any two algorithms with a difference in the mean rank above 0.006 are regarded as nonequal (Fig. 3).
We can observe three disjoint groups of methods: 1) P3LTR, 2) RP3Beta and P3, 3) #3-Paths. From this analysis, we see that P3LTR performs significantly better than other methods on the examined dataset.

B. Diversity
Most of the job ads refer to only one job position. Hence we should avoid recommending the same item to a great number of users. For that reason, we report also the diversity metrics in Table III. Test coverage is a fraction of items from the test set which were recommended to at least one user. We also report Shannon entropy [38] and Gini index [38]. We can see that P3LTR is the most diverse method with respect to all these metrics. We can also note that the #3-Paths method seems less  diverse than other methods. We suppose that the reason is that this method recommends the items based on the number of paths of length 3 connecting a given user and item, so the most popular items are more often recommended. In order to decide whether to deploy a new recommendation system in production, we usually check how different are the recommendations produced by a new model compared to the old one. To assess it we calculated the overlap coefficient [39] with respect to user-item pairs. The results are reported in Table IV. We see that 70% of the top 10 recommendations provided by P3LTR and RP3Beta models are the same recommendations. We can also observe that RP3Beta and P3 provide pretty similar results on our dataset (overlap coefficient equals 84%).

C. Parameters of the P3LTR model
As we mentioned, the parameters of our model can be easily interpreted. In the Table V we report the values of d (k) (parameters related to node degrees) and r (k) (parameters related to recency).
In previous works regarding the RP3Beta model, usually positive values for α and β were chosen to discourage the model from recommending the most popular items [1], [3]. We can see that in our machine learning approach also positive values were learned for all d (k) parameters.
Regarding recency, the model chose positive values of r (k) parameters. It means that the more recent interactions should have higher importance.
We do not discuss parameters related to event type e.g., viewing or replying to an ad, because they did not converge within the number of iterations we have chosen. Hence the reported results might differ when we train the model multiple times. We believe the convergence could be achieved with a greater value of a batch_size hyperparameter, but it would also significantly increase the training time.

VI. SUMMARY
In the paper, we introduced a new graph vertex ranking recommendation method which we named P3LTR. It generalizes the RP3Beta model which provides very efficient and accurate recommendations on multiple datasets. We described several strengths of our approach, including explainability and prediction efficiency. We showed that our method is superior to RP3Beta on the OLX Jobs Interactions dataset in terms of accuracy and diversity of recommendations.
The proposed method may improve the quality of recommendations currently being generated using the RP3Beta model that is implemented at OLX Jobs in a production setting.
In future work, we plan to explore more advanced feature encoders which utilize user and item features. We would like to explore and compare different loss functions for the P3LTR model. Additionally, we would like to launch A/B tests on production to measure the model's effectiveness on real users.