Identifying Reliable Sources of Information about Companies in Multilingual Wikipedia

For over 21 years Wikipedia has been edited by volunteers from all over the world. Such editors have different education, cultural background and competences. One of the core rules of Wikipedia says, that information in its articles should be based on reliable sources and Wikipedia readers must be able to verify particular facts in text. However, reliability is a subjective concept and a reputation of the same source can be assessed differently depending on a person (or group of persons), language and topic. So each language version of Wikipedia may have own rules or criteria on how the website must be assessed before it can be used as a source in references. At the same time, nowadays there are over 1 billion websites on the Internet and only few developed Wikipedia language versions contain non-exhaustive lists of popular websites with reliability assessment. Additionally, since reputation of the source can be changed during the time, such lists must be updated regularly.This study presents the result of identification of reliable sources of information based on the analysis of over 200 million references that were extracted from over 40 million Wikipedia articles. Using DBpedia and Wikidata we identified articles related to various kinds of companies and found the most important sources of information in this area. This also allows to compare differences of the source reliability between Wikipedia languages.


I. INTRODUCTION
I NFORMATION presented in Wikipedia articles should be based on reliable sources [1]. The source can be understood as the work (book, paper etc.), author, publisher. Such sources must have a proper reputation, should present all majority and significant minority views on some piece of information. Following this rule ensures that readers of the article can be assured that each provided specific fact (piece of information or statement) comes from a published and reliable source. Hence, before adding any information (even if it is a generally accepted truth) to this online encyclopedia, Wikipedia volunteer editors (authors or users) need to ascertain whether the facts put forward in the article can be verified by other people, who read Wikipedia [2].
Few developed language versions of Wikipedia contain non-exhaustive list of sources whose reliability and use on Wikipedia are frequently discussed. Even the English Wikipedia (the largest chapter of the encyclopedia) has such general list with information on reliability for less than 400 websites [3]. Sometimes we can find such lists for specific topics (e.q. video games, films, new Wikipedia articles in English Wikipedia).
It could take a significant human effort to produce a more complete list of assessed internet sources -there are over billion websites available in the Internet [4], [5] and a lot of them can be considered as a source of information. So, it can be very challenging and time consuming task for Wikipedia volunteers to assess reliability of each source. Moreover, reputation of each website can change with time -hence, such lists must be updated regularly. Additional challenge -each source may have a different reliability score depending on topic and language version of Wikipedia.
More complete and updated list of reliable sources can be useful not only for Wikipedia editors, but also for readers of this popular encyclopedia. The aim of this study is to show some possibilities of automating this process by analyzing existing and accepted content with sources in Wikipedia articles about companies in different languages. This paper uses existing and new models for reliability and popularity assessment of websites. The results show that depending on models it is possible to find such important sources in selected Wikipedia languages. Additionally, we show how the assessment of same sources can vary depending on language of this encyclopedia.

II. RELATED WORKS
Researching the quality of Wikipedia content is a fairly developed topic in scientific works. As one of the key factors influencing the quality of Wikipedia articles is the presence of references, some studies focused on researching information sources. Some of works use the number of references to automatically assess quality of the information in Wikipedia [6], [7], [8]. Such important measures are implemented in different approaches for automatic quality assessment of Wikipedia articles (for example WikiRank [9]). References often contain external links (URL addresses) where cited information is placed. Such links in references can be assessed by indicating the degree to which these conform to their intended purpose [10]. Moreover, those links can be employed separately to assess quality of Wikipedia articles [11], [12].
Some of the studies focused on metadata analysis of the sources in Wikipedia references. One of the previous works used ISBN and DOI identifiers to unify the references and find the similarity of sources between various Wikipedia language editions [13]. It is increasingly common practice to include scientific sources in references of Wikipedia articles. [13], [14], [15], [16]. At the same time, it is worth noting that such references often link to open-access works [17] and recently published journal articles [18]. One of the studies devoted to the COVID-19-related scientific works cited in Wikipedia articles and found that information comes from about 2% of the scientific works published at that time [19].
News websites are also one of the most popular sources of the information in Wikipedia and there is a method for automatic suggestion of the news references for the selected piece of information [20]. Particularly popular are references about recent content or life events [21]. For example in case of information related to COVID-19 pandemic Wikipedia editors inclined to cite the latest scientific works and insert more recent information on to Wikipedia shortly after the publication of these works [19].
Previous relevant publication [15] to this paper proposed and implemented 10 models for sources evaluation in Wikipedia articles. Results of assessment are also implemented in online tool "BestRef" [22]. Such approaches uses features (or measures) that can be extracted from publicly available data (Wikimedia Downloads [23]), so anybody can use those models for different purposes. One of the recent studies [24] in addition to the proposed models included also a time dimension to show how importance of the given web source of information on COVID-19 pandemic can be changed over different months.

III. REFERENCES EXTRACTION
To be able to extract information about references we prepared own parser in Python and applied it on Wikimedia dumps with articles in HTML format [23]. Table I presents te  general statistics of the extraction. External links (or URL addressees) in references were used to indicate main address of the website. However, each web source can use different structure of URL addresses. For example, some of the websites can use subdomains for separate topics of information or news. Another example -some organizational units (e.q. departments) of the same company may post its information on separate subdomains of main organization. To detect which level of domain indicates the source this work uses the Public Suffix List, which is a cross-vendor initiative to provide an accurate list of domain name suffixes [25]. Figure 1 presents example of URL address at fourth level domain with indication of main website.  separate language chapter). The highest value of this measure has English Wikipedia -almost 11 references per article. High values of RpA has also French (fr), Greek (el), Japanese (ja) and Russian (ru) Wikipedia.

IV. MODELS FOR WEB SOURCES
Based on previous study [15], this work used following models for sources assessment with changes (described in this section): 1) F-model -how frequently (F ) of considered source appears in references. 2) PR-model -how popular (P ) are Wikipedia articles in which considered source appears divided by number of the references (R) in such articles. 3) AR-model -how much authors (A) edited the articles in which considered source appears divided by number of the references (R) in such articles.
One of the most basic and commonly used approaches to assess the importance of a web source is to count how frequently it was used in Wikipedia articles. This principle was used in relevant studies [26], [13], [27], [18]. So, Fmodel assesses how many times specific web domain occurs in external links of the references. For example, if the same source is cited 25 times in 13 Wikipedia articles (each contains at least one reference with such source), we count the (cumulative) frequency as 25. Equation 1 shows the calculation for F -model. where: • s is the source, n is a number of the considered Wikipedia articles, Some studies showed correlation between information quality and page views in Wikipedia articles [28], [8], [29]. The more people read a specific Wikipedia article, the more likely its content was checked by part of them (including presence of reliable sources in references). So the more readers see the particular facts in the Wikipedia, the bigger probability that one of such reader will make appropriate edit if such facts are incorrect (or if source of information is inappropriate).
In other words, page views of the particular article usually shows the demand on information from Wikipedia readers. Therefore, visibility of the reference is also important. If more references are presented in the article, then the less visible is a specific source for the particular reader (visitor). At the same time, the more visitors has an Wikipedia article with references, the more visible is particular source in it. Equation 2 shows the calculation using P R-model. where: • s is the source, n is a number of the considered Wikipedia articles, • C(i) is total number of the references in article i, • C s (i) is a number of the references using source s (e.q. domain in URL) in article i, is page views (visits) value of article for certain period of time i. In comparison with previous research [15], for purposes of this study, apart from P R-model that uses cumulative page views V from humans (non-bots views) for a recent month (March 2022), additionally PRy-model will be used, which takes into account a wider date range -April 2021 -March 2022.
Quality of Wikipedia articles depends also on quantity and experience of authors who contributed to the content. Often articles in Wikipedia with the high quality are jointly created by a large number of different editors and this measure positively correlates with information quality [30], [31], [32], [33], [29]. To assess popularity of an article from editing users there is a possibility to analyze revision history of the article to find how many authors were involved in content creation/editing. So, AR-model shows how popular article is among Wikipedia volunteer editors. Equation 3 presents this model in mathematical form.
• s is the source, n is a number of the considered Wikipedia articles, is a number of the references using source s (e.q. domain in URL) in article i, • E(i) is total number of authors of article i. In contrast to previous work [15], AR-model in this study uses number of authors E that are registered on Wikipedia as users, without bot-users. Names of bots were selected based on the separate page (for example there is a special category in English Wikipedia [34]).
Additionally this study provides ARe-model, which is modification of AR-model:instead of counting the number of authors of a Wikipedia article, the number of editions of these authors (registered and non-bots) will be taken into account.
V. USING DBPEDIA AND WIKIDATA TO IDENTIFY WIKIPEDIA ARTICLES ABOUT COMPANIES There are different possibilities to find topic of a particular Wikipedia article. For example, each article can be aligned to multiple categories, corresponding Wikidata item or DBpedia resource can highlight the topic based on properties in statements [29]. Additionally Wikipedia article can be included to different WikiProjects, that indicates interest to its information from groups of Wikipedia editors which focused on a specific topic (e.q. culture, history, military etc.).
This study used data from DBpedia and Wikidata to find Wikipedia articles related to companies. Each of those semantic databases have own advantages and disadvantages which are related to the operating principles and the technologies used.

A. DBpedia
DBpedia [35] is a semantic knowledge base that enriched automatically using structured information from Wikipedia articles in different languages [36], [37]. The resulting knowledge about some subject is available on the Web depending on title of Wikipedia article (as a source of that knowledge). For example, such semantic data about "Meta Platforms" as the DBpedia resource we can find on the page https:// dbpedia.org/resource/Meta_Platforms because such data were extracted from the relevant article in English Wikipedia -https://en.wikipedia.org/wiki/Meta_Platforms. At the same time DBpedia has separate knowledge extracted from other language versions and we can find also relevant information on other pages extracted from other Wikipedia chapters. On such DBpedia pages among different properties we can also find information about the type(s) of subject. In our example "Meta Platforms" aligned to "Company" and other classes of DBpedia ontology [38] and other structures. Such information is can generated automatically based on infoboxes (contained in Wikipedia articles) and their parameters. The figure 2 shows example of infoboxes about "Meta Platforms" company in different Wikipedia languages. DBpedia extracts information about infoboxes based on the source code (wiki code or wiki markup) of the Wikipedia articles.
DBpedia ontology has a hierarchical structure, and if some resource is aligned to other company-related classes, we can use connections between those classes to detect Wikipedia articles related to companies. For example, some of the organizations can be aligned to "Bank", "Publisher", "BusCompany" or other company-related class of DBpedia ontology, and after generalization we can find that all of them are belonging to "Company" class. Based on DBpedia dumps related to instance types [35] ("specific" part of the dumps for each available language) we found that Wikipedia articles can be aligned directly to one of 634 classses from DBpedia ontology. Figure 3 shows those classes distinguishing with larger font size the most popular ones: Person, Species, PopulatedPlace, Insect, Settlement, Place and other. "Company" class is the 20th most popular in such ranking.
It is worth mentioning that DBpedia provides two kinds of dumps that contain information on classification of resources (instances): instance-types (containing only direct types) and instance-types-transitive (containing the transitive types of a resource based on the DBpedia ontology). Such files contain triples of the form '<resource> rdf:type <class>' generated by the mappings extraction and other techniques for different language chapters of Wikipedia. Figure 4 shows the structure of a part of DBpedia ontology with "Organisation" class as a root node. It also presents information about directly alignments to separate classes of this ontology based on English Wikipedia. We can find there numbers based on of instances-types (direct alignment).
If we include also information on transitive types, we will have more resources aligned to same classes by taking into account connections between them in the DBpedia ontology. Figure 5 shows those classes distinguishing with larger font size the most popular ones: Species, Eukaryote, Animal, Person, Location, Place and other. "Company" class is the 34th most popular in such ranking.
After considering transitive DBpedia dumps we have got additionally 20,736 resources (to directly aligned 64,372 resources) in "Company" class -85,108 in total in that class based on data from English Wikipedia. Next we took similar data extracted by DBpedia from other Wikipedia languages, and finally we got 173,418 unique companies 1 . Further we 1 Unique company in this case means, that separate Wikipedia articles in various languages related to the same company counted as 1 company (instead of counting each Wikipedia article in each language version as a separate company). used "DBO-companies" for the obtained list of Wikipedia articles about companies based on DBpedia extraction.

B. Wikidata
Wikidata [40] is a semantic knowledge base that works on a similar principles that Wikipedia with one important difference -here we can insert facts about the subjects using statements with properties and values rather then sentences in natural language. Wikidata is also considered as the central data management platform for Wikipedia and most of its sister projects [41].
Each Wikidata item has a collection of different statements structured in the form: "Subject-Predicate-Object". Figure 6 shows Wikidata item Q380 ("Meta Platforms") with some statements.
Based on Wikidata statements we can find items on a specific topic. In our case, we will use the statement "Property:P31 Q783794" ("instance of" -"company"). Listing 1 presents SPARQL query to get such list from Wikidata using its query service [43]. Result of this query is available on the web page: https://w.wiki/5Bsc.

SELECT ?item WHERE {
?item wdt:P31 wd:Q783794. } Listing 1. SPARQL query to get list of Wikidata items directly connected to "company" item (Q783794) by "instance of" property (P31) So, based on simple query we have got 12,635 Wikidata items related to companies. However, there are other connections in Wikidata that indicate items related to our topic. Similarly to DBpedia, here we can have also other "sub-classes" or alternatives that can build more complete list of Wikidata items which can give list of appropriate Wikipedia articles. Let's go back to our example on "Meta Platforms" as an Wikidata item showed on the figure 6. We can see, that apart from "company", this item is also aligned to "business" (Q4830453), "enterprise" (Q6881511), "public company" (Q891723) and "technology company" (Q18388277) by "instance of" parameter. Now we will use this information to enrich our query -listing 2 presents such SPARQL query: https://w.wiki/5Bsw. This query returned much more Wikidata items (comparing previous one) -275,944 items. It is important to note, that this number doesn't show directly number of Wikipedia articles related to companies, because not all Wikidata items contains links to at least one Wikipedia article.  First, lets try to obtain general statistics on values that are inserted to "instance of" (P31) parameter among over 95 million Wikidata items. To do so, we prepared special algorithm in Python to extract such information from Wikidata dumps in JSON format [44]. It is worth noticing, that it is possible to construct SPARQL query to solve this task, however due to limitation of the Wikidata query service (such as limited time execution of the query) such statistics and other complex analysis can be done by extracting necessary data from the dump files. Figure 7 shows those items distinguishing with larger font size the most popular ones: scholarly article (Q13442814), human (Q5), Wikimedia category (Q4167836), temporal range start (Q523), Taxon (Q16521) , infrared source (Q67206691), galaxy (Q318) and other. Overall there are 87501 different alignments ("classes"). Items related to companies, such as "business" (Q4830453), "enterprise" (Q6881511) are on the 39th, 129th place respectively in such ranking.
Next we conduct such analysis only on Wikidata items, which has at least one link to Wikipedia article of one of the 42 considered languages in this study (see table I

C. Combined approach
Comparing to DBpedia ontology classes (see V-A), Wikidata has much more possible aliments to different itemsover 100 times more. To automatize process of identification company-related items in Wikidata there are various possibilities. One of them -to analyze Wikidata items related to "DBO-companies" selected using DBpedia extraction and Fig. 6. Scheme of the Wikidata item related to "Meta Platforms" company. Source: own work based on [42].
find the most popular alignments in "instance of" statements. Figure 9 presents popular aliments for this case. Overall there are 3,453 various "classes" and the most popular are: business, enterprise, public company, company, automobile manufacturer, airline, record label, publisher, bus company, video game developer, organization, commercial organization, bank and others.
Finally let's take into the account alignments that appears  at least 200 times to avoid insignificant mistakes that could be done by some users that edit Wikidata. In that case we will have 63 Wikidata items, that can appear in "instance of" (P31) statements as a values. Additionally we removed alignment to "organization" (Q43229) which is too general. As a result, we have more Wikidata items with articles on the list of companies -overall 291,768 Wikidata items with at least one related Wikipedia article in considered language versions were identified. In futher analisys we will use "WCAcompanies" for this list.

VI. ESTIMATING THE INFORMATION SOURCES IN WIKIPEDIA ABOUT COMPANIES
This section presents results of assessment of the most important sources of information companies across Wikipedia languages using different models. Due to the limitation of space, following subsections presents results for the 15 most developed language versions of Wikipedia (with at least 1 million articles, see table I) Additionally, for the charts below, only the websites that appear at least 20 times in the top 100 at each language/model intersection 2 were selected. The more extended and interactive results can be found in supplementary materials [39].
It is important to note that archive services (such as archive.org) were excluded from analysis, due to the frequent occurrence of such links alongside the original sources in the same reference. If original source is no longer available, such archive services are very important, because Wikipedia readers can verify information, but unavailable original web sources are not a scope of this research. References to Wikipedia itself and Wikidata were also excluded. Links that are automatically inserted to references based on such identifiers as DOI (often links to doi.org) or ISBN (often links to books.google.com) cannot indicate directly the source of information. So such links were not considered in website analysis.

A. DBO-companies
First, we conducted a source analysis for the list of Wikipedia articles that have been generated based on data from DBpedia (see V-A) -"DBO-companies". Figure 10 shows the most important web sources of information on companies described in Wikipedia based with positions in rankings across 15 most developed language versions using five considered models.
Top 10 web sources in DBO-companies across 15 considered languages according to different models are as follows: nytimes.com, reuters.com, youtube.com, bloomberg.com, forbes.com, techcrunch.com, bbc.co.uk, cnn.com, wsj.com, theguardian.com.  [39] B. WCA-companies Figure 11 presents the most important web sources of information on companies described in Wikipedia based on WCAcompanies list (described in V-B and V-C) with positions in rankings across 15 most developed language versions using five considered models.

C. Wikipedia languages
Based on average position in rankings calculated using different models we prepared the top 10 most important sources of information about companies in each Wikipedia languages.
Lists of such sources are presented below.

VII. CONCLUSION AND FUTURE WORK
This study focused on information sources analysis of Wikipedia about companies in different languages. After extraction over 230 million references there were a process of indication of the main websites address for each URL address. As a result -over 2 million unique websites have been identified. To find important web sources across the languages, topics of the Wikipedia articles were analyzed. Using semantic representation of those information in DBpedia and user-generated knowledge in Wikidata this study shows how to find important web sources across languages based on existing and new models.
Models presented in this work can help not only Wikipedia volunteer editors to select web sites that can provide valuable information on companies, but also can help other Internet users to better understand how to find valuable sources of information a specific topic on the Web using open data from Wikipedia.
We plan to extend this research in future by providing additional features on identification of companies in Wikipedia. Additionally, we plan to divide different organizations to specific sectors (industries) to find the differences between reliability of information sources. Future work will be focused also on extending reliability models and using different methods on topic classifications. One of the directions is to develop ways of weighting the importance of a reference based on its position within a Wikipedia article. There are also plans on including different measures related to the reputation of Wikipedia authors, protection of the articles, topic similarity and others.