Preliminary Citation and Topic Analysis of International Conference on Agile Software Development Papers (2002-2018)

This study utilizes citation analysis and automated topic analysis of papers published in International Conference on Agile Software Development (XP) from 2002 to 2018. We collected data from Scopus database, finding 789 XP papers. We performed topic and trend analysis with R/RStudio utilizing the text mining approach, and used MS Excel for the quantitative analysis of the data. The results show that the first five years of XP conference cover nearly 40% of papers published until now and almost 62% of the XP papers are cited at least once. Mining of XP conference paper titles and abstracts result in these hot research topics: “Coordination”, “Technical Debt”, “Teamwork”, “Startups” and “Agile Practices”, thus strongly focusing on practical issues. The results also highlight the most influential researchers and institutions. The approach applied in this study can be extended to other software engineering venues and applied to large-scale studies.


I. INTRODUCTION
In every field of science, evidence for the importance of identifying emerging research topics is useful for researchers, funding agencies and policy makers.This helps to promote and enhance the development of potentially promising research topics.Citation is a way to judge influential work and build new studies on existing research results [1], [2], [3].Citation analysis is a common way not only to judge but also to observe the most popular and influential work [1], [2], [4].Bibliometrics, on the other hand, is a method used for statistical analysis of publications in order to provide quantitative analysis [5].Bibliometrics based identification of active authors and institutions has many benefits, i.e. helping students and researchers to identify active and relevant institutes for their area of interest, and enabling employers to recruit the most qualified potential researchers [3].
In various fields of science, e.g., in medicine, physics and social sciences, it is common to identify the highly cited papers [6], [7], [8].Bibliometrics and citation analysis studies have also been conducted in software engineering, computer science and other disciplines, e.g., [4], [2], [3], [9], [10], [11], [12], [13], [14].The highly cited papers usually provide insights into new avenues of research, a significant summary of the state-of-the-art in a research area and a measure of scientific activity, in general [1], [2].One of the key outlets for Agile research, "Agile Software Development Conference (XP)", has not been evaluated under the lens of citation analysis alone or as a sub-field of its own (processes).XP Conference ("International Conference on Extreme Programming (XP)" -formerly "Conference on Agile Software Development (AG-ILE)") was included in a bibliometrics study of Karanatsiou et al. [14] in the general domain of software engineering (where XP conference was the only process oriented conference in that study).The study of Chuang et al. [13] assessed agile software development, in general, for 221 published primary articles on the topic.
The purpose of this study is to provide an overview of the literature published in all XP conference proceedings.This study helps readers to understand the development and evolution of the XP conference from three main aspects: (i) the citation landscape and the most cited papers, (ii) the most active authors, institutions and countries, in terms of number of publications, and (iii) the identification of emerging research topics in XP conference publications and use of indexed keywords.
This paper is organized as follows: First, we discuss the research method and the data extraction technique.Second, we present the results of the analysis including findings on active individuals and institutes, highly cited papers and authors, and trends in the covered topics.Third, we discuss the threats to validity of the study.Finally, we summarize the findings and provide recommendations for future research.

II. RESEARCH METHOD AND DATA EXTRACTION
The research data were collected from Scopus1 database on September 2 nd , 2018.Scopus is claimed to be the largest abstract and citation database of peer-reviewed literature.Scopus also provides citation data and allows to save the search results to a csv-file, for further analysis.
We started with the search string "1" (see Table I), to collect data related to all published XP conference papers.The search resulted in 758 papers.To our surprise, the search string "1" did not retrieve papers for the year 2011.We learned that the papers for the year 2011 do not include the information about the XP conference in the Scopus database.Thus, to collect those missing papers 2 , we complemented the findings with the search string "2", resulting in 31 papers.
The complete search gave us 789 papers (758+31), covering the years of 2002-2018 (published by September 2 nd , 2018).The data, including e.g., names of the authors, title, publication year, source title, number of citations, link and abstract, were stored as a csv-file.We were also able to extract data from Scopus, directly, for the analysis of the affiliations and countries related to the authors (analysis of the search results in Scopus) as well as the top 20 cited papers (overview of the citations in Scopus).We used both MS Excel and R/RStudio for analyzing statistics and trends from the data.

III. RESULTS
In 2001, the first "XP Universe" hosted tutorials, lectures, panel discussions, posters, workshops, and other less traditional discussions.A year later, the 2 nd "XP Universe" and 1 st "Agile Universe" were brought together to attract software experts, educators, and developers 3 , in general.In 2003 and 2004, the two conferences, "Extreme Programming and Agile Methods -XP/Agile Universe" and "Extreme Programming and Agile Processes in Software Engineering" were organized separately, but reported together in a Springer database.In 2005, the conferences were merged and formed a single venue: "Extreme Programming and Agile Processes in Software Engineering".Since 2007, the conference has been called as "Agile Processes in Software Engineering and Extreme Programming".
The Scopus database search yielded 789 papers in the proceedings of XP conference published between 2002 and 2018, see Fig. 1.The high number of papers for 2004 (n=96) is explained by the fact that the two aforementioned conferences are recorded together.The first five years of the XP conference

A. Authorship Trends
The results show that 1260 unique authors contributed to the 789 papers in XP conferences until 2018.The minimum number of authors for a XP paper was one whereas maximum was nine.Majority of the XP papers in 2018 (almost 35%) have four authors.In general, about 30% of all papers have two authors, 25% have one author, and 9% of the papers have five or more authors, see Table IV.The number of authors having contributed to three or more XP papers is rather small, as most authors have contributed to just one or two papers.About 75% of the authors (944) have an authorship to just one paper and about 88% of the authors (1108) have an authorship to only one or two papers, as a single or as a co-author.Chuang et al. [13] also reported a finding of a core intellectual pool contributing to the agile research realm.
During the first three years (2002)(2003)(2004) of the conference, most papers were published by a single author.For the years 2005-2009, most papers were published by two authors, and for the years 2010-2012 and 2013-2014 by three and four authors, respectively.We consider the different number of authors for the papers as an indication of increased, high (international) collaboration among the contributors.In the 1970's, the average number of authors per paper in software engineering was around 1.5, while after 2010, the number of authors has typically been three [15].The average number (i.e., arithmetic mean) of authors for the papers in XP conference is 2.6.
Asknes [16] studied a body of Norwegian articles (nearly 50000 articles having at least one Norwegian address).He concluded that at an aggregated, general level the "highly cited papers typically involve more collaborative research than what is the normal or average" [16].In our study, the correlation between the number of authors and citations for a paper, for all papers, is weak (r = 0.13, df = 787, p = 0.0002.However, for the set of top 20 cited papers (see Table VI), the correlation between the number of authors and citations for a paper is 0.59 (r = 0.59, df = 18, p = 0.0064.Thus, the correlation coefficient suggests a strong positive correlation between the number of authors and citations for those top 20 cited papers.

B. Citation Landscape & Most Cited Papers of XP Conference
A high citation count of a scientific work is an indication of the influential work and impact of a given paper [16], [17].Our analysis shows that 62% (n=488) of XP papers have been cited at least once, leaving about 38% (n=301) as uncited papers, see Fig. 2.This is an indication of higher visibility of XP conference papers.When focusing on the first ten years of XP conference, i.e., the papers prior to 2012, nearly 65% of those papers (352/542) have been cited at least once.The finding are in line with prior studies [4], [18] in which about 43% of the papers were uncited (large body of software engineering publications).Similarly, about 42% of the papers of "International Symposium on Empirical Software Engineering and Measurement" [3] were uncited.
Garfield [1] argues about the citation count being the measure of importance or impact of a scientific work.He claims that citation count is rather a measure of utility, i.e., usefulness of the work for a large number of people or experiments [1].Furthermore, a citation count can also be a measure of scientific activity and not necessarily related to the significance of the scientific work [1].As in reality, only a  Fig. 2. Distribution of Citations (0-100) for the papers rather small portion of the XP conference papers retrieved from Scopus are full research papers, the high number of uncited papers is not a surprise.Thus, it can be claimed that the samples from indexed databases may not be as representative as expected for citation analysis without rigorous filtering.However, such sample papers may well be valid for analysing author activity as well as research trends and topics.
The Table VI shows the top 20 most cited XP conference papers (each paper having minimum 23 citations).The total number of citations for the top 20 papers covers almost 25% of all citations (680/2920) which are mainly from earlier years of XP conference (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)).However, one paper is published in 2015 and five papers among the top 20 papers are published in 2002.Table VI, shows that 92% of citations (624/680) are from papers not written by the authors (of the cited paper) themselves.Typically, a paper is cited the first time during the year of its publication or during the following year.However, the two top cited papers, "Empirical findings

C. Highest Cited Papers Per Year
Many countries and evaluating bodies (for funding, promotions or appointments) are using figures like publication record or citation count in decision-making [3] VII).The paper also ranked the highest for the number of citations (100, see Table VI) and has the fourth highest normalized citation count (6.25).Garousi and Fernandes [18] claim that newer papers will first get to be known in the communnity.According to Raulamo-Jurvanen et al [3] the longer the paper has been available the better are the chances to be cited.However, according to our results, recent papers have received more attention in terms of citations.One reason can be that the software engineering community has grown over the years and recent topical papers may have a slight advantage when it comes to the number of citations per year.
We were curious to see whether the length of the title had impact on the number of citations for a paper.Letchford et al. [19] had studied the relationship between the lengths of paper titles and citations (across various journals) and concluded that a short title for a paper is an advantage for receiving citations.However, they also stated that the evidence is not as strong when adjusted for the journal where the paper is published.For the XP papers, the correlation between the length of the title, either in words or characters, and the number of citations is weak (r = 0.03, df = 787, p = 0.415 and r = 0.04, df = 787, p = 0.235, respectively).The top 5 cited papers have rather short titles (length varying from 31 to 77 in characters and from 5 to 10 in words).The median length of all titles, in characters and words is 62 and 8, respectively.

D. Topical Issues
With topic modeling, we intend to analyze the abstract topics in the documents.We removed 66 documents from the original pool of 789 documents, as not including the abstract in Scopus.Thus, the set of documents for trend analysis included 723 documents.We combined the titles and the abstracts of the documents, converted the text to lowercase and removed all (english) stopwords in R.
For the trend analysis we utilized topic modeling and Latent Dirichlet Allocation (LDA) as described by Griffiths and Steyvers [12] with R scripts based on Ponweiser [20].Our approach was identical to the process used by Raulamo-Jurvanen et al. [3] and Garousi and Mäntylä [4].We created a document term matrix from the corpus (using R "text2vec"4 package), excluding words having less than two characters or appearing in less than three documents.We generated a LDA model (using R "topicmodels"5 package) by running the topic models from 2 to 100 by one, yielding 35 as the optimal number of topics.
In the analysis of the trend slopes (by publication year) the topics gaining interest among the authors are the "hot topics" and the topics declining interest are the "cold topics".The five hottest and coldest topics, interpreted by the topic-specific words (and related titles), and 10 significant terms for each of those, as shown in Table VIII(a) and Table VIII(b), respectively.The topics gaining the most interest are "Coordination" and "Technical Debt", which include issues like largescale coordination and interteam objectives as well as metric and automation.Cold topics such as "Education", "Methods and Practices" (including pair programming) and "Testing", have been of less inspiration for the submissions during the recent years of XP conference.
In 2012, Dingsøyr et al. [21] studied agile software development and outlined key research themes at the time, namely Case Study Methodology, Traditional Software Engineering, CMM, Project Management, Software estimation, Pair Development, Distributed Cognition, Agile methods, User-centered design, Agile methodologies and Patterns.Some of those themes seem still topical, e.g., software estimation as "Technical Debt" and some not, like Pair Develoment or Agile Methods as "Methods and Practices" (see Table VIII).In fact, Dingsøyr et al. [21] report that in Agile2011 they had specifically asked people (mainly academics) what are the topics that should be researched less or further.Pair programming in educational settings and reuse of code were considered as topics not requiring further research while topics like agile across projects and across organizations and distributed agile were considered to be important."We concur that these are exciting research areas that can further our understanding of the effectiveness of agile methods and practices, particularly in different project/organizational contexts" [21].Such trend is also visible in our study, as "Education" and "Methods and Practices" (including pair programming) were found to be cold topics and topics like "Coordination" and "Teamwork" were among the hot topics.
Perhaps researchers should ask research topic related questions more frequently, not only among academics but also among the practitioners in the field, to support the needs or interests in the industry, too.

E. Indexed Keywords
To study the published topics from another perspective, we collected the indexed keywords from Scopus.It is notable that we used the indexed keywords (not the author keywords), as the indexed keywords outnumber the author keywords, providing more details.Additionally, there are papers that are not only missing abstracts (see Chapter III-D) but also keywords (see Scopus e.g., a conference paper "Agile acceptance testing" by Pettichord and Marick from 2002).There were 720 papers with indexed keywords.The minimum number of indexed keywords for a paper was 3, the maximum was as high as 25 (for one paper) and arithmetic mean 9.4.We checked the correlation between the number of indexed keywords and the number of citations for a paper, but that correlation is weak (r = 0.028, df = 718, p = 0.459. We paired the keywords for each paper (e.g., a paper having four keywords would eventually yield 6 unique keyword pairs) and converted the keywords to lower case.The pairing resulted in 32131 keyword pairs which we then stored in a CSV-file.We used the Cytoscape 6 , an open source software platform, for visualizing the network of the paired keywords (after removing duplicates), see Fig. 3.The lighter the color in the figure, the more the keyword had connections.The keyword "software engineering" was, unsurprisingly, the most used keyword, see Fig. 3.The nine other most used keywords were "software design", "agile software development", "agile methods", "computer programming", "project management" , "computer software", "agile development", "extreme programming", "agile" and "software testing".The keywords are rather generic, but still quite nicely represent the key research themes identified by Dingsøyr et al. [21].However, a more detailed analysis of the keywords, to view the overall importance and reveal the topicality of the keywords, would be required to see the trends in the area of XP.

IV. THREATS TO VALIDITY
In this section, we discuss four perspectives of validity threats [17] and the steps that we have taken to mitigate those threats.
Internal validity reflects the extent to which a causal conclusion based on a study is warranted [17].The approach used for the selection and extraction of XP conference paper from selected are discussed in Section II.In order to ensure repeatability and reproducibility of our study, the search terms have been defined carefully and reported in the research method Section II.Additionally, the raw data and the scripts used are provided to ensure transparency and replicability of our analysis.The material can be accessed via this link: https://bit.ly/2LiqQ3S.
Construct validity is concerned with issues that to what extent the object of study truly represents theory behind the study [17].As a limitation w.r.t.construct validity, we assumed that all the papers were published in Scopus database properly.Scopus claims to be "the largest abstract and citation database of peer-reviewed literature" 7 .All the XP conference proceedings are indexed in Scopus and we fetched all the data from this database.However, 2011 papers are not properly indexed, so papers for the year of 2011 were fetched with a separate query and added to the research data manually.
Conclusion validity of a study deals with whether correct conclusions are reached through rigorous and repeatable treatments [17].Throughout the paper, the discussions and conclusions are based on actual quantitative measures and statistics from the extracted data.The approach we used to identify and map the top papers assures that, the results of any replications of this study will not have major deviations from our results.
External validity is concerned with to what extent the results of this secondary study can be generalized [17].The results of this study are not meant to be generalized to the whole SE field or outside SE.However, we believe that given the rigor of our approach that we used to identify top cited papers, emerging hot topics, the results highlight the citation landscape of the top XP conference papers in SE area.

V. CONCLUSIONS AND FUTURE WORK
This is the first citation and topic analysis study on XP conference papers since 2002 until 2018.The paper identifies and classifies: the highly cited papers, topic trends, top individuals and institutes who have significantly published in XP conference.
The trend of the papers shows that XP conference has received interest from both the academic community and industry.The papers highlight that much of research is stirred by practices emerging in industry.Overall, 62% of the XP conference papers received at least one citation, which is a sign of good visibility relevance of the published papers.However, about 38% of the XP papers so far have received no citations at all.This raises concerns and questions such as: what are the reason(s) of large ratio of non-cited XP conference papers?Does this have anything to do with papers or venues quality?Or, is it about the topics of the papers, the indexed keywords, or the keywords provided by the author(s)?The data, which we make publicly available, can be used to conduct various analysis (i.e., characteristics of highly cited papers) on XP conference papers.
The analysis shows that XP community interest has been moving away from "Process Simulation", "Education" and "Coaching & Experimenting" related topics to more practice and process oriented topics.According to the trend analysis, the hottest research topics, i.e., the topics gaining the most interest are "Coordination", "Technical Debt", "Teamwork", "Startups" and "Agile Practices".The identified trends are helpful for both researchers and practitioners to see topics that are more impact and align their future research activities.
The study found an active core intellectual pool of authors along with their highly cited work.The newbie researchers can start their journey from these papers and follow listed active researchers to stay up to date about latest trends in the Agile world.Additionally, the active publishing institutes in XP conference can be helpful for doctoral students to approach experts on the specific topic for further research and doctoral studies.We hope that this paper encourages further discussions in the software engineering community towards further analysis and formal characterization of the highlycited software engineering papers in general and specifically in XP conference community.The important thing about citation count is that it is an "objective measure of the utility or impact of the scientific work" [1].
The following are among our future work directions: • To replicate this analysis for other SE publication venues in order to conduct comparison between research venues and provide more depth to our analysis.
• To mine typical features for highly cited papers and to assess the extent to which papers inner quality, external features, reputation of the authors and journals, contribute to generation of highly cited papers in the future.• To study the indexed keywords within a publication venue, in more detail, e.g., by years, to see whether we could find trends from those, too.
Table V includes the 16 most active authors in the XP conference who have minimum number of 10 papers each.Maurer F. has been the most active author compared to the other top contributors of the XP conference.There are four authors that have their most cited papers published in 2010's (the publication year for the most cited paper in parenthesis), namely Abrahamsson P. (2015), Wang X. (2015), Concas G. (2012) and Bosch J. (2012); the rest of those most cited papers have been available for ten years or more.Interestingly, in a study "Institutions, scholars and contributions on agile software development (2001-2012)" by Chuang et al.[13], the list of the 18 most active authors included four of the 20 most active authors in this study, namely Abrahamsson P., Dingsøyr T., Moe, N.B. and Sharp H.However, the list of the most active authors in that study[13] included also Boehm, B., Robinson H., Williams L., Dingsøyr T., Moe, N.B. and Sharp H. who were among the authors of the top 20 most cited papers in this study.
in agile methods" by Lindvall et al. (2002) and "Towards a framework for integrating agile development and user-centred design" by Chamberlain et al. (2006), have been published over ten years ago, and have received the most citations since 2015.Chamberlain et al. (2006) had only a few citations right after its publication.After 2010 until 2015 the paper has received attention from both industry and academics in various fields of science, e.g., Computer Science, Mathematics, Decision science, Business, Management and Accounting, Social sciences or Psychology.In 2017, Chamberlain et al. (2006) received the most citations among the top 20 cited papers, and was the second most cited in 2018 (after Lindvall et al. 2002), at the time of the study.

TABLE IV PROPORTION
OF THE NUMBER OF THE AUTHORS PER YEAR

TABLE V
Maximum number of citations for a single paper & publication year of that paper b Percentage of the total number of citations (2920 for all publications) c Number of times as first or second author in the publications # Total number of publications of specific investigator, and secondly, the appropriateness of such trends/counts can be questioned on scientific grounds.Rapid growth of citations for a paper may be a sign of a popular topic, or active author(s) building on their existing research, or both.Eight of the year-wise most cited papers are the same as reported in Table VI.Those papers have been available for the public for a long period of time, from years The average number of citations for top cited paper per year in Table VII is 26.6, which is less than the average from top 20 most cited papers, 34 in Table VI.To compare the general interest on the published papers, we normalized the number of citations for years, see column C-Norm in Table VII.The values for normalized citations varied between 0.53−7.67.The highest number of normalized citations, 7.67, are for the paper "What do practitioners vary in using scrum" by Diebold et al. (2015) which received 23 citations in three years (ranked #8 in Table VII considering a

TABLE VIII HOT
AND COLD TOPICS, TERMS & NUMBER OF PAPERS FOR EACH TOPIC