Using graph solutions to identify "troll farms" and fake news propagation channels

—This paper addresses the issue of fake news detection, with a particular focus on solutions derived from graph theory. It covers identifying channels, which are sources of fake news, and identifying users spreading false information, considering users deliberately misleading their audience, forming clusters called ’troll farms’. It proposes a solution using graph theory, which includes classifying users based on the social context extracted in graph centrality measures built from user interactions or networks built from followers on the social network Twitter. The solution includes not only the identiﬁcation of trolls but also potential unintentional users spreading false information, users exposed to false information, or automated scripts spreading information (bots). Thorough research on the efﬁciency of different features and classiﬁers is conducted on MIB and FakeNewsNet datasets. Conducted research conﬁrms general conclusions from previous studies and offers some improvements.


I. INTRODUCTION
W ITH the rapid increase in accessibility to information caused by the development of the Internet, it has become much easier to manipulate and spread false information to any audience.Social media platforms have changed the way journalism can be conducted in the 21st century, causing anyone to be able to report on events to the masses.What is most commonly considered fake news is, in a broad sense, information that is not true, or in a more specific purpose, information that has been made available to mislead the recipient [1].
Information portals or social media enable targeting any audience, which, if used appropriately, can influence public sentiment and impact countries' internal politics.It severely threatens a nation's and its citizens' stability and internal security.Examples of such events include the 2016 US presidential election campaigns, during which 20 of the most popular manipulated posts generated more shares and comments than 19 of the most prominent news sites [2].'Trolling' can be defined as deviant, malicious, anti-social behavior aimed at destroying a conversation or creating conflict.The key features of this activity are deception, aggression, and negative disruptive actions, and the measure of success is to gain as much audience attention as possible [3].
An example of how vital the information domain is can be seen in the actions taken by Russia and Ukraine during the Russian-Ukrainian war that began on 24 February 2022.
Building public support for an invasion of a neighboring country using manipulative techniques and a wide range of information channels preceded the Russian Federation's attack on Ukraine [4].This action also targeted the rest of the world -using messengers such as Telegram to release posts or videos distorting the picture of reality to present the Russian view of the conflict and gain support for its actions.While most Western countries did not succumb to disinformation, the manipulation work carried out domestically in the Russian Federation served its purpose and convinced most Russians that the war was necessary and consolidated citizens around the authorities.

II. METHODS OF FAKE NEWS DETECTION
Methods for detecting false information are classified as content-oriented, social context-oriented, and graph-based [5].

A. Content-oriented methods
Methods that use fact-checking, i.e., comparing the thesis presented in the news with external sources, are called knowledge-oriented methods.Manual fact-checking is poorly scalable and manpower-intensive.However, it allows for creating valuable datasets for developing automated solutions such as FakeNewsNet [6].Fact-checking using 'crowd-sourcing' has a high risk of obtaining biased results, but it is better scalable than the expert method [1].
Style-oriented methods are similar to knowledge-oriented methods, but in this case, the aim is not to assess the content's veracity but to extract the author's intentions and determine whether it was to mislead the audience [1].
Content-oriented methods also include linguistic analysis of the text [7].It is based on analyzing the syntax and semantics of a sentence by extracting features that distinguish false information from accurate information, such as length of statements, word embedding, lexical context, discourse level, etc. [5].This solution works in the case of longer forms of expression, but in social media, extracting these features proves difficult, or there are too few to determine the veracity of such information.

B. Social context-oriented methods
One method used is to analyze the life of information on the web.It allows one to observe how it evolves with each sharing and how information changes to form a 'rumor.'Analyzing the life cycle of such information over a period of time allows us to understand the diffusion patterns of rumors over time.Another way is to assess the information's veracity by analyzing the source's credibility [1] [5].The third popular solution for identifying fake news is to analyze the networks they form with other information, like social networks, friends, post sharing, interactions with other profiles, and profile data [1].It allows for identifying the relationships between people spreading such information and extracting the characteristics of such interactions or profiles.An important aspect is the propagation pattern of such information, which differs between false and authentic information.

C. Graph-based solutions
Network and graph analysis is mainly based on studying features challenging to describe by standard data-averaging methods.In the case of graphs, there are often power relationships due to the uneven distribution of nodes or the high degree of links between data.Graph solutions allow the study of features such as the propagation speed of objects in the network, the relevance of individual nodes, or the way objects interact within the network and whether this can change.
Graph-based methods are used extensively in deep learning to detect internet trolls, fake news propagation channels, or fake news in general.Graph neural networks (GNNs) are characterized by the fact that they can encode the graph structure as well as the node features at the same time, which in the case of social networks or news propagation networks, dramatically increases the efficiency of classification [8] [9].
To verify fake news, automatic fact-checking methods are often used, which consist of extracting facts from the content of the news and then comparing this fact with a knowledge base, the form of which can be a knowledge graph [1].

III. DATASET PREPARATION AND PREPROCESSING
This paper decided to use graph centrality measures, which have served as features for machine learning algorithms.These measures were chosen because this area has yet to be fully explored despite some work on the subject.
Identifying 'fake' users based on a follower network is a method of detecting fake news based on social context.The source of false information can be identified in this way.When a new user arrives and 'adds' other users to his/her social network, there is a chance to identify whether he or she is an account that will spread false information.It creates a significant advantage because we can already take action, then -observe the user and start analyzing their content with other solutions to detect false information.References [10] and [11] examined follower networks and followed accounts using graph centrality measures.In addition, it was possible to classify online trolls from the 2016 US presidential campaign by creating a network of users who retweeted their posts [12].Using graph algorithms, identifying Russian troll accounts extracted from a list provided by the US House Intelligence Committee from the 2016 US election was also feasible [13].
As part of this work, it was decided to use the centrality measures used in previous works, such as: centrality of agency, centrality of proximity (unnamed type), centrality of node degree (degree, in-degree, out-degree), PageRank centrality, centrality of eigenvector.In addition to this, the measures examined were: centrality of proximity (Wasserman-Faust), the centrality of harmonic closeness (harmonic closeness), ArticleRank, HITS (Hyperlink-Induced Topic Search).

A. Tools
Of the available tools for operating on graphs, it was decided to use Neo4j in the study because of the numerous previous uses of this tool for analyzing fake news and troll accounts [2][11] [13].This database is also fully adapted to operate on graphs.The research was performed on a computer with an Intel Core i7 7700HQ processor with 16GB RAM DDR5 and Google Colab.All collections were placed in the Neo4j database version 5.1.0.Additional libraries were used: APOC version 5.1.0and Graph Data Science Library 2.2.5.

B. Datasets
The datasets used were those collected for the study of fake Twitter accounts [14].This MIB dataset consists of five subsets: two sets of accounts run by humans (TFP and E13) and three sets of accounts with fake followers (INT, FSF, TWT).The data was collected before 2015.For machine learning, the collection was filtered, removing profiles with less than two edges due to their large number -they were considered noise.However, the complete set was used for feature extraction to capture the centrality features of all nodes as accurately as possible.In addition to the MIB collection, users extracted from the FakeNewsNet [6] collection were also used.This collection was created in 2018 and consisted of tweets spreading fake and real news, their retweets, the profiles of the users who sent them, and tweets from the users' timelines.The collection is based on manual fact-checking performed by the portals Gossipcop and Politifact.
Due to the known problems with the collection download and the Twitter limits [6][12][14] [15], it was eventually possible to obtain 6 240 964 unique identifiers of users.Based on whether a profile was among the followers, or followers of an account that spread real or fake news, a label of true or false was assigned to that profile.Thus, profiles potentially at risk of seeing fake news were labeled as if they were spreading fake news.After filtering out the noise in the form of profiles that contained one or fewer relations and were irrelevant to the graph, 2 713 356 profiles were obtained.To speed up the Neo4j database feature extraction algorithms, once the collection was imported, the Random Walk with Restart algorithm was used to sample the collection at a ratio of 0.3.This algorithm preserves the structural features of the graph, which, in the case of centrality testing, is crucial for obtaining results close to the truth.Unfortunately, this procedure nevertheless introduced additional uncertainty into the study.The final result was 541 255 nodes labeled as potentially false and 272 743 as potentially genuine.As the access to information about real troll accounts via the Twitter API was prevented, and the data contained in the Neo4j Sandbox about these accounts were small, another collection1 from the GNN Fake News survey was used [8].That survey used the FakeNewsNet dataset and provided the collection as a finished graph -the relationship between individual users who retweeted another user's post.The collection in this form contains much less memory because the original tweet identifiers have been mapped to unique numerical values starting from 0, and it does not contain additional information related to the user profile -it is a kind of skeleton.This processed collection yielded user profiles, with interaction in the form of retweeting a post.The original collection contained 425 842 profiles, but due to accounts being blocked, deleted, or unavailable, the authors obtained only 355 316.Tweet collection is a separate part of the MIB collection.It was created for the paper [15] on the study of spambots.

C. Selection of characteristics of user interaction and followers sets
To select features for machine learning algorithms, the following dependency measures were used: Pearson correlation, F-test, analysis of variance (F classifier), Mutual information, chi2 (chi-square test), tree classifier [16] [17].
The feature selection analysis was started by determining the Pearson correlation matrix, identifying linearly dependent features, and then sifting them out.The selection was carried out on the complete set, with the awareness that some algorithms will show a linear relationship because they are similar in implementation -for example, closeness and harmonic closeness, or PageRank and ArticleRank.
The significance level of α = 0.1 was assumed to reject the null hypothesis.Thus, for p > 0.1, we cannot reject the null hypothesis that the variables are independent.The choice to leave one of the two features was made when the linear Pearson correlation coefficient between the features exceeded the value of 0.3.Pearson correlation matrices were prepared for different cases: separate Politifact user interaction set, separate Gossipcop user interaction set, combined Politifact and Gossipcop user interaction set, MIB set of followers, and combined Politifact and Gossipcop set of followers.Results for different cases were similar.One of Pearson correlation matrices is shown in Figure 1.Based on them, selected features for user interaction sets were: eigenvector score, closeness score, hits auth, page rank, and inDegree.For the MIB set of followers and combined Politifact and Gossipcop set of followers, the following features were selected: eigenvector score, harmonic closeness score, hits auth, page rank.

IV. CLASSIFICATION
To roughly identify the classifiers that will bring the best effect, the extracted features were trained using the following algorithms: • K-Neighbors Classifier, where k is set to n=3 by default; • classifier with decision tree algorithm (Decision Tree Classifier); • classifier with Random Forest Classifier, where the number of heuristic estimators was initially set at 300; • adaptive boost classifier (AdaBoost Classifier); • Gradient Boosting Classifier; • Gaussian classifier with naive Bayes algorithm (Gaus-sianNB); • Linear Discriminant Analysis classifier; • Quadratic Discriminant Analysis classifier; • Support Vector Machines Classifier (SVC), with regularization parameter C=0.025, radial basis function kernel, and 5-fold cross-validation; • Support vector classifier with support vector quantity control proposed by Bernhard Schölkopf (NuSVC -Nu Support Vector Machines Classifier) [18].Algorithms with unspecified configurations used the default settings of the Sci-Kit Learn library.An initial test was carried out using the Accuracy index and the Log Loss function to determine the confidence with which the algorithm made the classification [19].The sets were divided using the function PATRYK SULEJ, KRZYSZTOF HRYNIÓW: USING GRAPH SOLUTIONS TO IDENTIFY "TROLL FARMS" AND FAKE NEWS PROPAGATION CHANNELS     [17].Finally, the following classifiers were subjected to further analysis: random forest, gradient boost, k-nearest neighbors, adaptive boost.These classifiers were subjected to a stratified 10-fold cross-validation study following a review of popular methods for testing the efficiency of classifiers [20].Stratification is a good solution for unbalanced sets, and the K-fold method itself has already been used in previous works on this topic [10] [11].It is also widely used, and effective [21].The study results are presented in Table VI.
Table VI shows that all models obtained a relatively low recall value, indicating many classifications of "fake" users as "real".A better result was obtained in the case of precision, which gives us information about how many "real" accounts were rated as "fake".Fewer false profiles in the set (Table II) could have contributed to obtaining a high value of the accuracy coefficient.
When detecting fake user accounts, it is essential to consider how much it will cost to recognize a user spreading accurate information when they are a "troll".This cost can be very high, making the built algorithm useless.Sometimes, however, the "forbearance" of the algorithm can be desirable.
The final proposed solution is a classifier based on the random forest algorithm, where the number of estimators n has been heuristically set to n=300.This classifier was tested on the set of Russian troll accounts described in Table II.This set consisted of 413 fake accounts and was used only as another measure of verifying the task's success.The final version of the random forest classifier learned from the Politifact and Gossipcop collections achieved an accuracy of 84.50%.This observation is consistent with previous conclusions for the set of low validity, but the obtained result is better than the tests would indicate.
A. Choosing a solution to detect fake propagation channels and bots by analyzing the network of followed users The classifiers were studied for these sets by testing the bestperforming algorithms using selected features.Studies for the   sets were performed using 10-fold cross-validation to better compare the results with those in other studies.In addition, the results of training performed on one set and then testing the model on a second set were also examined.
Table VII shows that the classifiers obtained high confidence and accuracy on the MIB set.This may be because it consisted of accounts generated by bots, which may have resulted in more significant differences between the characteristics.An important fact is that this set is already about ten years old, so the algorithms creating the bots could have been less advanced then.Similar high accuracy was achieved for all algorithms except Gaussian naive Bayes and linear and quadratic discriminant analysis.
The extracted features for this set allowed us to obtain results similar to the work [11] where closeness centrality was introduced.At the same time, it can be seen that betweenness centrality, in this case, does not play a significant role in classifying "fake" users.It was also possible to obtain better results with the KNN classifier than previous works did with the random forest.
Worse algorithm efficiency results were obtained for the set of FakeNewsNet followers, presented in Table VIII.In this case, the random forest classifier was the best, achieving the highest accuracy and the lowest Log Loss.The decision tree algorithm, KNN, and gradient boost also achieved high scores.Worse results could be obtained because accounts were classified as fake or genuine only because they had a person  who tweeted false information to their followers or followed.However, achieving such accuracy means that we can identify people who may be potentially unwitting spreaders of fake news, and according to research, they constitute a large part of fake news propagation channels [1].However, surprising results were obtained for the model trained on the FakeNewsNet set and tested on the MIB set containing bots.The results presented in Table IX show that both the decision tree algorithm and the naive Bayes classifier performed much worse in this case than before.An interesting result was obtained in the case of the adaptive gain algorithm, which turned out to be the best in terms of precision and in terms of accuracy.The gradient boost and random forest algorithms also performed well.The KNN method obtained a relatively high value of the Loss Log coefficient.This result was probably obtained because the MIB set profiles were relatively easy to detect.The model built on FakeNewsNet seems to be quite effective in this case.In the reverse situation, when the MIB model was used on the FakeNewsNet set, worse results were obtained -it can be assumed that the model built on this set will have a lower generalization ability.
The random forest, gradient boost, KNN, and adaptive boost classifiers were tested on a combined set of FakeNewsNet and MIB followers to maximize the efficiency.Table X shows the result of testing the effectiveness of classifiers using 10-fold cross-validation with stratification.
The obtained values are slightly worse than those obtained in the research from 2021 [11] on the exclusive MIB set.However, the MIB set allowed us to build a classifier and significantly lower ability to generalize in detecting fake users, in contrast to the set of FakeNewsFollowers obtained in this work.Building a classifier based on both sets significantly increases the generalization capabilities of the classifier.
Ultimately, the best overall results were obtained for the random forest, which confirms previous studies.At the same PATRYK SULEJ, KRZYSZTOF HRYNIÓW: USING GRAPH SOLUTIONS TO IDENTIFY "TROLL FARMS" AND FAKE NEWS PROPAGATION CHANNELS time, other algorithms have also been shown to be highly effective.Very high accuracy was obtained for the gradient boost, which may be beneficial in the case when a maximum "raw" classifier is needed in detecting fake users, even at the cost of considering some genuine users as fake.

V. CONCLUSIONS
The article presents graph techniques for detecting false information.An essential aspect of detecting fake news is combining knowledge from many disciplines and data from different contexts to get better results.Combining several methods gives better results, but creating a complete system that classifies information as genuine and false using content and social context analysis is time-consuming and complicated.Studying individual techniques of a complex solution, such as the one presented in the paper, requires much time and collecting appropriate training data for machine learning algorithms.
The problem of classification presented in both cases, for the analysis of connections between users based on retweeting posts and based on followers, turned out to be a complicated issue.In the case of the user interaction network, it was impossible to build an effective classifier to solve the problem.However, we created a classifier that dealt with accounts of Russian trolls from the 2016 US elections quite effectively, proving that research in this direction should be continued.
It is more difficult to determine whether a user is part of a fake news channel based on what users they retweet.In further research, the set should be enlarged with additional samples, more work should be done to remove potential outliers, and the set should be better balanced to avoid overfitting.An important area for improvement is set normalization, model regularization, and parameter tuning.
Classifier tests in the case of the follower network essentially confirmed the conclusions regarding the effective operation of the random forest from previous studies [10] [11].It turned out, however, that the KNN classifier on the same set of MIB followers achieved better results than the random forest used in previous studies.It is an important finding, considering that learning this algorithm took less time than in the case of a random forest for n=300 estimators.Also, learning on the set of FakeNewsNet followers and validation on the MIB set was reasonably practical -although it could have been more reliable among the algorithms, obtaining a considerable value of the Log Loss coefficient.

Fig. 1 :
Fig. 1: Pearson correlation matrix of features extracted using graph algorithms for the FakeNewsNet skeleton -combined Politifact and Gossipcop sets.

TABLE I :
Size of follower datasets.

TABLE II :
Size of user interaction sets from the FakeNewsNet skeleton and US Elections Trolls.

TABLE III :
The results of individual algorithms measuring data dependence for the combined set of Politifact and Gossipcop (skeleton of the FakeNewsNet set)

TABLE IV :
The results of individual algorithms measuring data dependence for the MIB set of followers

TABLE V :
The results of individual algorithms measuring data dependence for the FakeNewsNet set of followers

TABLE VI :
Stratified 10-fold cross-validation results for selected classifiers for the combined set of Politifact and Gossipcop.

TABLE VII :
Test results of different classifiers on a set of MIB followers using 10-fold cross-validation with stratification.

TABLE IX :
Test results of various classifiers learned on a set of FakeNewsNet followers and tested on a set of MIB followers without cross-validation.(70% training data, 30% test data)

TABLE X :
Stratified 10-fold cross-validation results for selected classifiers for the combined set of FakeNewsNet and MIB followers.