Handling of Categorical Data in Software Development Effort Estimation: A Systematic Mapping Study

Producing reliable and accurate estimates of software effort remains a difficult task in software project management, especially at the early stages of the software life cycle where the information available is more categorical than numerical. In this paper, we conducted a systematic mapping study of papers dealing with categorical data in software development effort estimation. In total, 27 papers were identified from 1997 to January 2019. The selected studies were analyzed and classified according to eight criteria: publication channels, year of publication, research approach, contribution type, SDEE technique, Technique used to handle categorical data, types of categorical data and datasets used. The results showed that most of the selected papers investigate the use of both nominal and ordinal data. Furthermore, Euclidean distance, fuzzy logic, and fuzzy clustering techniques were the most used techniques to handle categorical data using analogy. Using regression, most papers employed ANOVA and combination of categories.


I. INTRODUCTION
HE competitiveness of software companies relies on the successful management of their software projects.One of the most important and difficult tasks in software project management is how to accurately estimate the effort needed to develop a software product.This task is known as software development effort estimation (SDEE).Delivering reliable and accurate estimates remains a challenging objective for software companies due to several factors including the human factor, the variety of software projects, the inherent uncertainty of feature measurement, and the diversity of development environments [1].In attempt to get accurate predictions, various SDEE techniques have been proposed.These techniques fall into three main types [2]: parametric models [3], [4], machine learning (ML) models [5]- [10] and expert judgment [11].
T SDEE techniques build their predictions based on a set of attributes (also called features or cost drivers) that characterize software projects [12], [13].Most of these techniques derive their predictions based on numerical attributes.However, the information available at the early stages of the software life cycle is more categorical than numerical.Furthermore, the datasets used to build and validate SDEE models involve a high number of categorical data.For example, in COCOMO'81 dataset [14], 15 attributes out of 17 are measured on a scale composed of six categories: very low, low, nominal, high, very high, and extra high.Another example is the International Software Benchmarking Standards Group (ISBSG) dataset [15], in which numerous attributes such as programming language, application type and development platform are measured on a nominal scale.
Categorical attributes may be measured on a nominal or ordinal scale.The nominal scale type allows the classification of entities into different categories [16], for example, primary programming language may be classified into five categories: Visual basic, C, Cobol, Visual C++, Oracle.Unlike the nominal scale type in which there is no order between the categories of entities, the ordinal scale type enables ranking the categories in a specific order [16].An example of ordinal attributes is the application experience which may be measured as: 'low', 'nominal', 'high', and 'very high'.To deal with this kind of attributes, different approaches were used in SDEE literature [17]- [21].
In this paper, a Systematic Mapping Study (SMS) is performed to investigate the use of categorical data to estimate software development effort.As pointed out in [22], a systematic map is a method that concentrates on building a classification scheme and categorizing primary research studies in a specific domain with respect to a set of defined categories.Thus, it provides a common starting point for many researchers [23].To the best of the authors' knowledge, no systematic mapping study has been carried out with focus on how to handle categorical data in SDEE.
Section III, reports the results of the mapping study.Section IV presents the implications for research and practice.Conclusions and future work are presented in Section V.

II. RESEARCH METHODOLOGY
In this study, the systematic mapping process suggested by Kitchenham and Charters [24] is used.According to Kitchenham, a mapping study aims to identify the research trends related to a specific topic and classify research works with respect to a set of defined criteria [22], [24].The mapping process used comprises the following five steps: (1) define the mapping questions, (2) conduct an exhaustive search for candidate papers, (3) select studies, (4) extract data, and (5) summarize data.Each of these steps is described next.

A. Mapping questions
Eight mapping questions (MQs) were formulated in this mapping study.Table I shows the MQs as well as their main motivations.

B. Search Strategy
The aim of this step is to find the relevant SDEE papers that address the MQs listed in table I. To perform the search, four electronic databases were used: ACM Digital library, IEEE Xplore, Science Direct and Google Scholar.These libraries were chosen since they were used in previous systematic maps and reviews in SDEE to conduct the search for candidate papers [5], [25], [26].All searches were restricted to the studies published between 1997 and January 2019.To identify the types of categorical data that are the most investigated in SDEE.

MQ8 What are the datasets used for validation?
To explore the datasets used in the selected papers as well as the Percentage of categorical features used in the experiments.
To carry out the search using the four databases, a search string was defined.To do so, we derived the main terms based on the MQs.Then, we identified all alternative spellings and synonyms of the major terms.The Boolean operators OR and AND were used to combine the main terms [25], [26].The final search string was formulated as follows: (software OR system OR application OR product OR project OR development) AND (effort OR cost) AND (estimat* OR predict* OR assess*) AND (categorical OR nominal OR ordinal OR "non-quantitative") AND (feature OR attribute OR data OR "cost driver").
To ensure that no relevant paper was missed, we adopted a search process of two stages.In the first stage, we performed the search in the four electronic databases using the above search string to identify the set of candidate papers.In the second stage, we applied the inclusion and exclusion criteria on each of the candidate papers based on title, abstract, and keywords to decide on its relevance to our study.If necessary, the full paper was examined.The reference list of each of the relevant papers was scanned to check whether a SDEE study with focus on categorical data was leaved out in the first stage.

C. Study Selection
The purpose of this step was to select the papers that are relevant to our SMS (i.e., papers that addressed the MQs).To achieve this, a set of inclusion and exclusion criteria were applied on each of the candidate papers by each of the authors of this study to decide whether it should be retained or discarded.
Inclusion criteria  Studies with focus on how to handle categorical data to estimate software effort  Studies in which a technique is proposed or extended and which enables software effort estimation using categorical data or a mixture of numerical and categorical data  Studies comparing different techniques that handle categorical data Exclusion criteria:  SDEE studies in which categorical features are not handled or discarded  SDEE studies for which the main objective is not deal with categorical data and which use only transformation to dummy variables  SDEE studies that fuzzify numerical inputs to get linguistic values without dealing with categorical inputs  SDEE studies with focus on missing categorical data  Duplicate publications of the same paper (In this case, only the most complete study is included)  Studies estimating maintenance or testing effort Using the above criteria, the two researchers independently evaluate the candidate papers.Based on the title and abstract (if necessary full text), a researcher might categorize a candidate paper as "include", "Exclude", or "Uncertain".A paper that was categorized as "Include" ("Exclude") by both researchers was retained (discarded); otherwise, the paper was discussed until an agreement was reached.

D. Data Extraction Strategy and Synthesis Method
Each of the selected papers was examined by both authors to extract the data necessary to answer the mapping questions of table I. To this end, a data extraction form was used and completed by both authors for each selected paper.
Table II shows the data extraction form used in our mapping study.
The extracted data were, then, synthesized and summarized with respect to each MQ.To achieve this, a narrative synthesis approach was used.We also used some visualization charts such as pie charts and bubble plots to improve the presentation of the results obtained and facilitate their interpretation.This section presents and discusses the results of our systematic mapping related to the questions of table I.

A. Overview of the selected studies
The results of the selection process are shown in Fig. 1.As can be seen, 1226 candidate papers were retrieved by applying the search string described previously on the four electronic databases.Afterward, the inclusion and exclusion criteria were used to evaluate each of the candidate papers and decide whether it should be retained or discarded.The evaluation was based on the title, abstract, keywords, and full text of the candidate papers.This process resulted in 27 relevant papers.No additional relevant studies were identified by checking the reference lists of the selected studies.

B. Publications Channels (MQ1)
We identified two main publication channels in which the selected studies were published: journals and conferences.Specifically, among the 27 selected papers, 15 (55.56%) papers appeared in journals and 12 (44.44%)papers were presented at conferences.Tables III and IV shows the publication sources of the papers identified in journals and conferences respectively.The number of studies per publication source is given in the second column of each table.Three journals were identified with 2 or more papers dealing with categorical data in SDEE: Empirical Software Engineering, Information Software Technology, and IEEE Transactions on Software Engineering.Only one conference was identified with 2 papers: International Conference on Predictive Models in Software Engineering (PROMISE).The remaining sources (journals and conferences) were used once to publish SDEE studies with focus on categorical data.

C. Publications Trends (MQ2)
To get a global picture of the publication trends of SDEE papers dealing with categorical data, we analyzed the distribution of the selected studies over time.Fig. 2 shows the number of papers per year from 1997 to January 2019.As can be seen, the publication of SDEE papers with focus on categorical data is characterized by discontinuity.In fact, no was identified in some specific years (1998,2000,2003,2005,2014,2017,2018).Handling categorical data in SDEE has gained research interest in the period 2008-2013 (59% of the selected papers).Outside this period, poor number of studies was identified (not more than one paper per year except for 2001).

D. Research approaches (MQ3) and contribution types (MQ4)
As shown in Fig. 3, two main research approaches were used in the selected papers: solution proposal, and historybased evaluation.The solution proposal approach was adopted by 85% of the selected studies.Among them, 91% (21 out of 23) proposed new techniques, 4% (1 out of 23) proposed a new framework and 4% investigated the use of a new metric.Note that, all selected studies were included in the history-based evaluation approach.Among them, 15% (4 out of 27) performed a comparison of various SDEE techniques using datasets with mixed numerical and categorical data.The remaining papers used historical datasets to assess the performance of their proposed approaches.
Fig. 3 Research approaches used in the selected studies and their contribution type

E. SDEE Techniques investigating categorical data (MQ5)
Various approaches were used in the selected papers to estimate software effort using a mixture of numerical and categorical data.Table V shows the techniques used as well as the number of studies in which they were applied.Case based-reasoning (CBR), Regression (SR), Fuzzy Logic (FL), and Classification and Regression Trees (CART) were the techniques that investigate the most the use of categorical data in software effort estimation.Most of these techniques were not used alone.They were combined with each other to improve their prediction accuracy and to get accurate estimates.Specifically, 59% (16 out of 27) of the selected papers used a combination of two or more techniques to predict software effort whereas 41% employed a single technique.

F. Handling of categorical data in SDEE (MQ6)
To deal with categorical data, different techniques were applied depending on their type (nominal or ordinal) as well as the SDEE technique in which they were used.Table VI shows how both nominal and ordinal data were handled in the selected SDEE studies.Note that, some studies used the term 'Categorical' without specifying the exact data type (nominal or ordinal).As shown in table VI, using CBR, Euclidean distance is the most used metric to assess the similarity between two projects that are described by a mixture of numerical and categorical data [9], [27]- [33].Fuzzy logic, and fuzzy clustering techniques were also used in many CBR/DT works to deal with categorical data [10], [17], [20], [34], [35].Using regression, most papers employed one-way Analysis of Variance (ANOVA) and recorded categorical variables into new ones with fewer categories [2], [18], [30], [31], [36], [37].Other studies employed classification and regression trees to handle categorical data [30], [31], [35], [38]- [40].
The above-mentioned techniques were applied to handle both nominal and ordinal data.Other techniques to deal with categorical data were identified depending on whether they are measured on a nominal or ordinal scale.Table VII shows how nominal data were handled in the selected papers.Using regression, four techniques were identified: Transformation to dummy variables, dataset segmentation, interaction, and use of a hierarchical linear model [30], [31], [36], [41], [42].Using CBR, the equality distance was used to assess the similarity between projects that are described by nominal features [1], [36].Regarding ordinal data, they were handled as if they were measured using an interval scale or converted to numerical values using regression [30], [43].Using CBR, they were treated as interval scaled or handled using Grow's formula [1], [36]

(see table VIII).
It is worth noting that, when investigating the use of categorical data in the selected papers, we found that some CBR works used categorical data not only to measure the similarity between software projects using Euclidean distance but also: 1) to adjust estimation by analogy; 2) to identify whether a categorical attribute is appropriate to yield predictions or 3) for feature weighting (see table IX).G. Types of used categorical data (MQ7) Fig. 4 shows the types of categorical data used in the selected papers.As can be seen, 59% (16 out of 27) of the selected studies dealt with both nominal and ordinal data, 7% (2 out of 27) dealt with only nominal data and 4% (1 out of 27) were concerned with ordinal data.Among the selected studies, 30% (8 out of 27) did not specify the exact categorical data type that is handled in the paper.However, based on our knowledge and the datasets used in the experiments, we concluded that most of these papers dealt with both nominal and ordinal data types.

H. Datasets used (MQ8)
Several datasets were used in the selected papers to investigate the use of categorical data in software effort estimation.Table X shows the datasets used for validation as well as the number of studies in which they were used and the percentage of categorical data.The min, max, and mean columns show the minimum value, the maximum value and the mean value respectively of the percentage of categorical data used in the selected papers to conduct experiments.Note that, different studies may opt for different categorical features to conduct experiments.Therefore, the percentage of categorical data is not the same for all studies.Note also that, there were some studies for which it was not possible to extract the percentage of categorical features used in the experiments.As can be seen from table X, 21 datasets were used in the selected papers.Among them ISBSG, COCOMO, Desharnais, Kemerer, Albrecht and Maxwell are the most used datasets.In terms of categorical data percentage, COCOMO (93.52%) was the dataset with the highest mean percentage followed by Maxwell (88.83%) and Laturi (80.00%).
Even if ISBSG is the most used dataset and contains numerous categorical features, the mean percentage of the categorical data used in the selected papers was 49.13%.This is due to the fact that some studies used few categorical features to conduct experiments.Also, there was 1 study [29] that used only the numerical features of ISBSG.This study was included in our mapping study since the technique described in the paper may be applied on both numerical and categorical data.It is worth noting that, some papers employed datasets with numerical and mixed data to show the efficiency of their techniques to deal with both data types.

IV. IMPLICATION FOR RESEARCH AND PRACTICE
This study aims at presenting an overview of how categorical data are handled in SDEE.Based on the finding of our SMS, some recommendations to SDEE researchers and practitioners are provided.Dealing with categorical data is an important issue in SDEE especially at the early stages of the software life cycle where most of the existing attributes are more categorical than numerical.This study found that, the publication of SDEE papers with focus on categorical data is characterized by discontinuity.This implies that the use of categorical data in SDEE needs to be more investigated.
No case study was identified in the selected papers.Therefore, it is suggested to the researchers to cooperate with practitioners in order to explore the use of categorical data in industry to yield estimates.We also recommend for researchers to develop tools that enable software effort estimation using a mixture of numerical and categorical data to encourage the use of categorical data by practitioners and researchers.
This study found that CBR, regression and classification and regression trees are the techniques that investigate the most the use of categorical data in SDEE.It is therefore recommended to conduct further research works using other SDEE techniques.Researchers are also encouraged to new techniques to handle categorical data instead of using traditional ones.Furthermore, previous studies revealed that ensemble techniques yield better results than single techniques [26], [45]- [47].However, all selected papers used single SDEE techniques.No ensemble SDEE technique dealing with categorical data was identified.This implies that researchers should give more attention to the use of categorical data in ensemble techniques to investigate their impact on improving the estimation accuracy of their techniques.

V. CONCLUSION AND FUTURE WORK
In this paper, a systematic mapping study was carried out in order to identify and summarize the existing works on SDEE dealing with categorical data.A total of 27 relevant studies were identified and classified according to research approach, contribution type, SDEE technique, Technique used to handle categorical data, types of categorical data and datasets used.Research sources and publication trends were also identified and analyzed.Our findings are summarized as follows.
(MQ1): Dealing with categorical data has not been sufficiently investigated in SDEE.Besides, Journals were the most targeted publication channels followed by conferences.
(MQ2): The publication of SDEE papers with focus on categorical data is characterized by discontinuity.Dealing with categorical data in SDEE has gained research interest in the period 2008-2013.
(MQ3): Solution proposal and history-based evaluation were the two main research approaches used in the selected papers.
(MQ4): Most of the selected papers focus on developing new techniques especially to improve existing approaches.
(MQ5): Case based-reasoning, regression, fuzzy logic, and classification and regression trees were the techniques that investigate the most the use of categorical data in SDEE.
(MQ6): Euclidean distance, fuzzy logic, and fuzzy clustering techniques were the most used techniques to handle categorical data using CBR.Using regression, most papers employed ANOVA and combination of categories.
(MQ7): Most of the selected studies dealt with both nominal and ordinal data.
For future work, we will carry out a systematic literature review to analyze the use of categorical data in SDEE by taking into account the finding of this SMS.

Fig. 1
Fig. 1 Results of selection process

Fig. 2
Fig. 2 Publication trends of the selected studies

Fig. 4
Fig. 4 Types of used categorical data

TABLE I .
MAPPING QUESTIONS

TABLE II .
DATA EXTRACTION FORM

TABLE III .
PUBLICATION SOURCES OF JOURNAL PAPERS

TABLE IV .
PUBLICATION SOURCES OF CONFERENCE PAPERS FATIMA AZZAHRA AMAZAL, ALI IDRI: HANDLING OF CATEGORICAL DATA IN SOFTWARE DEVELOPMENT EFFORT

TABLE VI .
CATEGORICAL (NOMINAL AND ORDINAL) DATA HANDLING

Table VII .
Nominal data handling

Table VIII .
Ordinal data handling

Table IX .
Other uses of categorical data

Table X .
Datasets used in the selected papers N: Not given in the paper