An Empirical Study on Application of Word Embedding Techniques for Prediction of Software Defect Severity Level

Software defect severity level helps to indicate the impact of bugs on the execution of the software and how rapidly these bugs need to be addressed by the team. The working team is regularly analyzing the bugs report and prioritizing the defects. The manual prioritization of these defects based on the experience may be an inaccurate prediction of the severity that will delay in fixing of critical bugs. It is compulsory to automate the process of assigning an appropriate level of severity based on bug report results with an objective to fix critical bugs without any delay. This work aims to develop defect severity level prediction models that have the ability to assign severity level of defects based on bugs report. In this work, seven different word embedding techniques are applied to defect description to represent the word, not just as a number but as a vector in n-dimensional space in order to reduce the number of features. Since the predictive ability of the developed models depends on the vectors extracted from text as they are used as an input to the defect severity level prediction models. Further, three feature selection techniques have been applied to find the right set of relevant vectors. The effectiveness of these word embedding techniques and different sets of vectors are evaluated using eleven different classification techniques with Synthetic Minority Oversampling Technique (SMOTE) to overcome the class imbalance problem. The experimental results show that the word embedding, feature selection techniques and SMOTE have the ability to predict the severity level of the defect in a software.


I. INTRODUCTION
A PPLYING data mining techniques on software repositories such as software fault prediction, maintainability prediction, version control systems, source code analysis, bug archives, etc. is an emerging field that has received significant research interest in recent times. Researchers have proposed many tools and methods using machine learning techniques to assist a practitioner in decision making and automating software engineering tasks [1] [2][3] [4]. However, Forrest et al. observed that the finding and fixing defects in software is a time-consuming and expensive process. They have found that the median time to repair bugs for ArgoUML software is 190 days, and PostgreSQL is 200 days. They have also observed that more than 50% of all fixed bugs in Mozilla took more than 29 days [5] [6]. Therefore, it becomes essential to reduce the time and cost of the bug fixing process and also improve the quality of the software system. Defect severity level prediction has been emerged as a novel research field for the effective allocation of resources and plans to fix the defects based on their severity level [3]. These models help to find the severity level of defects that can be used to find the effect of defects on the software. Defect severity level prediction models are designed based on the features extracted from the defect description. Recent research has used different data mining techniques to extract numerical features from defect descriptions for the severity level of defect prediction using machine learning techniques. However, there are three main technical challenges in building defect severity level prediction models for predicting the proper severity level of the defects using defect description.
• Word Embedding: The defect severity level prediction models are often developed based on the unstructured form of the description of defects. The unstructured nature of data poses intrinsic challenges. If some sort of numerical features can be assigned using text mining techniques that can use as an input for model development, then it can be utilized for prediction of future severity level of defects. In this work, seven different word embedding techniques such as Continuous Bag of Words Model (CBOW) 1 , Skipgram (SKG)1, Global Vectors for Word Representation (GLOVE) 2 , Google news word to vector(w2v) 3 , fasttext (FST) 4 , Bidirectional Encoder Representations from Transformers (BERT) 5 , and generative pre-training model (GPT) 6 have been applied on bugs reports to represent the word not just as a number but as a vector in n-dimensional space. The above techniques provide similar representation for similar words and also provide a small number of features as compared to the size of the vocabulary. We have also removed stop-words, spaces, and bad symbols before applying these techniques. The predictive ability of these techniques is compared with frequently used term frequency, inverse document frequency (TFIDF).
• High-Dimensional Features Data: The predictive ability of defect severity level prediction models also depends on the features that are considered as the input of the models. Researchers have concluded that the data having high dimension features consisting of redundant and irrelevant features negatively affect the performance of the defect severity level prediction models [2][3] [1]. The presence of a huge number of the feature in the case of text analysis poses intrinsic challenges to develop models for predicting the proper severity level of the defects using defect description.
In this study, we have used three different feature selection techniques to remove irrelevant features and select the right sets of the relevant features.
• Imbalanced Data: The last challenge in building defect severity level prediction model is that the data used for building the models are imbalanced. A dataset is defined as a balanced dataset when the samples of the dependent variable or output variable are approximately evenly distributed across different values of dependent variable [7] [8]. In this study, the considered datasets are observed to be not possessing have an equal number of the severity level of the defects. Hence, it has been proposed to apply Synthetic Minority Oversampling Technique (SMOTE) on each dataset in order to get balance data.
Prioritization of defect based on the severity level computed using bugs reports? is a problem encountered by software practitioners and the study presented in this work is motivated by the need to develop defect severity level prediction models using extracted features with the help of word embedding techniques from bugs reports. This study aims to find the best word embedding technique by comparing the predictive ability of the models developed using seven different types of word embedding techniques. It further investigates the application of feature selection techniques, data sampling techniques, and eleven different classification techniques for prediction of severity level of defects.

II. RELATED WORK
Software researchers have used different methods in the past to extract features from bugs report and used these features as an input for developing models. Menzies and Marcus have used various text mining concepts to extract features from the bugs report [9]. They proposed an automated method called SEVERIS and validated these models using the defects report of NASA's Project and Issue Tracking System (PITS). These proposed models help to predict proper severity level of the defects using defect description. Rajni Jindal et al. also done similar work to extract features from defect descriptions using Term Frequency and Inverse Document Frequency (TFIDF) to extract features [10]. They have used the Radial Basis function network for developing defect severities prediction models. Finally, they found that the proposed methods have a high predictive ability to predict the severity levels of the defects. Sari and Siahaan have also followed a similar method for developing models to predict the severity level of the defects based on defect description [11]. They have applied InfoGain gain on extracted features from text to find relevant features for model development. Finally, they have used a support vector machine with an objective to develop defect severities prediction models.
In 2011, David Lo and the team analyzed the performance of models at three different levels of severity: low, medium, and high. It was found that an artificial neural network (ANN) was among the best methods. However, the predictions were less accurate for high severity faults. In 2012, Sharma et al. [12] proposed a priority prediction method using SVM, Naive Bayes, KNN, and Neural Network. This predicted the priority of the newly arrived bug reports, and the accuracy of almost all techniques (except NB) was less than 70% for Eclipse and Open Office projects. In 2014, Gayathri and Sudha developed an enhanced Multilayer Perceptron Neural Network [13]. Comparative analysis of modeling of defect proneness predictions using a dataset of different metrics from NASA MDP (Metrics Data Program) was performed. In 2017, Gupta and Saxena developed a model for the prediction of the existence of bugs in class [14]. The model developed was the object-oriented Software Bug Prediction System (SBPS), and it was trained using the Promise Software Engineering Repository. The Logistic Regression Classifier provided the best accuracy. The average accuracy of the model was found out to be 76.27%.
In the context of software severity level prediction, most of the researchers have used count vectorization and TFIDF to extract numerical features from bugs report. The concepts of these techniques are based on bag-of-words, therefore it has not capability to capture the position of vocabulary in sentences. These methods do not play well with many machine learning models because of high-dimensional features. While in this work, we are attempting to use seven different word embedding techniques that represent the word not just as a number but as a vector in n-dimensional space. The above techniques provide similar representation for similar words. The effectiveness of these word embedding techniques are evaluated using eleven different classification techniques with Synthetic Minority Oversampling Technique (SMOTE) to overcome the class imbalance problem.

III. STUDY DESIGN
This section presents the details regarding various design setting used for this research.

A. Experimental Dataset
In this study, six different software datasets have been used, which are referred to as CDT, JDT, PDE, Platform, Bugzilla, and Thunderbird to validate our proposed models.
These datasets have been collected from msr2013-bug_datasetmaster 7 . Mining Software Repositories (MSR) conducted Challenge every year by providing software-related data and motivate participate to apply data mining techniques for finding important patterns. The datasets are the collection of bugs reports wherein each bugs report contains the defect ID, defect description, and severity level of the defects. Table I shows the details of the dataset used for this study. As shown in Table I, the CDT software bugs report consists of 2220 normal defects, 146 minor defects, 288 major defects, 42 trivial defects, 58 blocker defects, 106 critical defects.

B. Training of Models from Imbalanced Data Set:
After analyzing experimental data as shown in Table I, it is quite evident that the considered datasets suffering from class imbalance problem, i.e., the number of samples in each class, are not same. Therefore, balancing of data is required before applying any classification techniques [15]. This approach help to improve the predictive ability of the developed software defect severity level prediction models [16] [17]. In this study, we have performed Synthetic Minority Oversampling Technique (SMOTE) one each dataset in order to get balance data. SMOTE technique is identified as a very popular technique by different researches that helps to improve the predictive ability of the models. 7 http://2013.msrconf.org/

C. Word Embedding:
The software bugs report consist of the defect ID and their corresponding defect description. In this work, seven different word embedding techniques including Continuous Bag of Words Model (CBOW), Skip-gram(SKG), Global Vectors for Word Representation (GLOVE), Google news word to vector(w2v), fasttext (FST), Bidirectional Encoder Representations from Transformers (BERT), and generative pretraining model (GPT) have been applied on defect description extracted from bugs reports. We have applied these techniques to represent the word not just as a number but as a vector in ndimensional space. These vectors are used as input to develop models for assigning appropriate severity levels to the defects present in the bugs reports. We have also removed stopwords, bad symbols, and spaces before applying word embedding. We have also compared the predictive ability of these techniques with term frequency, inverse document frequency(TFIDF).

D. Feature Selection Techniques
After successfully finding a vector of defect description, we have used these vectors as an input of the models. Since we are using these vector of n-dimension as an input of the models, so, the performance of the models also depends upon the selection of important features vectors. In this study, we have used three different features selection techniques, i.e., significant sets of features using rank-sum test, uncorrelated sets of features using cross-correlation analysis, and principal component analysis to remove irrelevant features and select right sets of the relevant feature. We have also compared the predictive ability of the models developed using selected sets of features with original features.

E. Classification Technique:
The predictive ability of different word-emending techniques, feature selection techniques and SMOTE are evaluated using eleven most frequently used classifiers such as multinomial naive bayes (MNB), bernoulli naive bayes (BNB),

IV. RESEARCH METHODOLOGY
In this work, we have applied seven different word embedding methods to extract features from bugs reports and considered these features as an input to develop models for predicting proper severity level of the defects using defect description. These models are trained using eleven different classifiers and validated using 5-fold cross-validation. In this study, we have also considered SMOTE for handling imbalanced data and three feature selection techniques for finding the best combination of relevant features. The detailed overview of our proposed work is giving in Figure 1. The information presented in Figure 1 suggested that the proposed framework is a multi-step process consisting of features extraction from text data using word embedding, handling class imbalance problem using SMOTE, removal of irrelevant features, and finally development of prediction models using eleven different classification techniques.
First, bugs report for a software project is collected from the Bugzilla bug tracking system containing the unique id of defects, description of the defect, and associated severity level of the defects. Next, we have used seven different word embedding to find the numerical representation of defect description. Next, we have used SMOTE techniques to handle the class imbalance problem because the considered dataset is not evenly distributed. The performance of models trained using balanced data is also compared with models developed using original data. After balancing the data, three different features selection techniques such as significant features using ranksum test, cross-correlation analysis, and principal component analysis are used to remove irrelevant features and select the right sets of reverent features. Finally, eleven different classifiers are used to develop models predicting proper severity level of the defects using defect description. The performance of these developed models is computed and compared using AUC, F-Measure, and accuracy performance values.

V. EMPIRICAL RESULTS AND ANALYSIS
In this work, we have applied eight different word embedding, one sampling technique, three feature selection techniques, and eleven classification techniques for developing models to predict proper severity level of the defects using defect description. Each word-embedding is applied on the considered datasets as mentioned in Table I. The effectiveness of these word-emending techniques is evaluated using 11 different most frequently used classifiers. Therefore, a total of 4224 (6 datasets * 8 word-embedding*(1 Original Data+ Smote data)*(3 Feature Selection+ 1 All Features)* 11 different classification technique) distinct prediction models are built in the study. The predictive ability of these trained models are evaluated in terms of AUC, F-Measure, and accuracy performance values. These models are validated with the help of 5-fold cross-validation methods.  Table II, we can be inferred that: • The high value of AUC confirm that the developed models have the ability to predict proper severity level of the defects using defect description.
• The models developed using a support vector machine with polynomial kernel have better predictive ability as compared to other classifiers.
• The models trained using neural network with ADAM (NNADAM) training algorithm have better predictive ability as compared LBFG, and SGD traing algorithms.
• The models trained by considering balanced data using smote as an input have better predictive ability as compared to original data.

VI. COMPARATIVE ANALYSIS
In this section, we analyze and compare the performance of models developed using different word-embedding, classifiers, sampling techniques, and sets of features. In this paper, we have considered Descriptive statistics, box-plot, and Significant tests to compare the developed models for severity level prediction.

A. Word Embedding
The predictive ability of developed defect severity level prediction models using different word embedding are computed with the help of AUC, F-Measure, and accuracy. They are compared using Descriptive statistics, box-plot, and Significant tests. In this study, seven different word embedding techniques such as Continuous Bag of Words Model (CBOW), Skip-gram (SKG), Global Vectors for Word Representation (GLOVE), Google news word to vector(w2v), fasttext (FST), BERT, and generative pre-training model (GPT) have been used to compute the numerical vector of defects reports. Comparison of Word Embedding: box-plots: Figure 2 provides the performance value, i.e., AUC, F-Measure, and accuracy of different word embedding in terms of Box-Plot diagrams and descriptive statistics. It is clear from Figure 2 that the models developed by considered word vector computed using GLOVE and w2v have better predictive ability to predict the appropriate severity level to the defects present in the bugs reports as compared to other models. The models developed using w2v achieve 0.70 average AUC value, 0.99 max auc, and 0.87 Q3 AUC i.e., 25% models developed using w2v have 0.87 AUC value. However, the models developed using SKG have low predictive ability as compared to other techniques.
Comparison of Word Embedding: Significant Test: In this study, the Wilcoxon signed-rank test is also applied on the AUC, F-Measure, and accuracy for statistically comparing the ability to predict the appropriate severity level of developed models using different word embedding. The objective of this testing is to find whether the models developed using different word embedding have a significant improvement or not. This test uses p-value to accept or reject the considered null hypothesis. The considered null hypothesis for this paper is "the defect severity level prediction models developed by    Table III. For the purpose of simplicity, we have used only two number for representing results, i.e., 0 means hypothesis accepted (models are significantly same) and 1 means hypothesis rejected (models are significantly different). According to the information present in Table III, the models developed by considering word vector using different word embedding as an input are significantly different for most of the cases.  TFIDF  CBOW  SKG  GLOVE  W2V  FST  BERT  GPT  TFIDF  0  1  1  0  0  1  1  1  CBOW  1  0  1  1  1  1  1  1  SKG  1  1  0  1 The predictive ability of developed defect severity level prediction models using original data and smote sampled data are computed using AUC, F-Measure, and accuracy performance values and compared using Descriptive statistics, boxplot, and Significant tests. Comparison of Original Data and SMOTE: box-plots: Figure 4 provides the performance value, i.e., AUC, F-Measure, and accuracy of the models developed using original data and smote sampled data in terms of Box-Plot diagrams and descriptive statistics. The information in Figure 4 demonstrate that the SMOTE data sampling technique plays an important role in improving the predictive ability of the defect severity level prediction models. The models developed using SMOTE sampled data achieve 0.75 average AUC value, 0.99 max auc, and 0.86 Q3 AUC, i.e., 25% models developed using SMOTE sampled data have 0.86 AUC value.

Comparison of Original data SMOTE: Significant Test:
In this study, the Wilcoxon signed-rank test is also applied on the AUC, F-Measure, and accuracy for statistically comparing the ability to predict the appropriate severity level of developed models using original data and SMOTE sampled data. The objective of this testing is to find whether the models developed using sampled data have a significant improvement or not. The considered null hypothesis for this paper is "the defect severity level prediction models trained using sampled have not a significant improvement." The considered null hypothesis is only accepted if the obtained p-values using the Wilcoxon signed-rank test is greater than 0.05. In this work, the p-value of the models trained using sampled data and original data is less than 0.05, i.e., our considered hypothesis is rejected. Hence, the models trained using sampled data have significant improvement in predicting defect severity levels.

C. Feature Selection
In this study, we have used three different features selection techniques, i.e., significant sets of features using rank-sum test, uncorrelated sets of features using cross-correlation analysis, and principal component analysis to remove irrelevant features and select right sets of the relevant feature. We have also validated the performance of the models developed using selected sets of features with all features using AUC, F-Measure, and accuracy performance values and compared with the help of Descriptive statistics, boxplot, and Significant tests. Comparison of Different Sets of Features: box-plots: Figure  3 provides the performance value i.e., AUC, F-Measure, and accuracy of the models trained using selected sets of features and all features. We can see that the models developed using CCRA and AF have slightly better performance as compared to other techniques. The models developed using CCRA achieve 0.65 average AUC value, 0.98 max auc, and 0.78 Q3 AUC i.e., 25% models developed using CCRA have 0.78 AUC value. We can also observed that the models developed using AF have similar performance, but the number of features is more as compared to CCRA features sets.

Comparison of Different Sets of Features: Significant
Test: In this study, the Wilcoxon signed-rank test is also applied to the AUC, F-Measure, and accuracy for statistically comparing the ability to predict the appropriate severity level of developed models by considering different sets of features an input. The objective of this testing is to find whether the performance of the models depends on input sets of features or not. The considered null hypothesis for this paper is "the defect severity level prediction models developed by considering different sets of features as an input are significantly same". The considered null hypothesis is only accepted if the obtained p-values using Wilcoxon signed-rank test is greater than 0.05. The results of Wilcoxon signed-rank test are depicted in Table  IV. We can see that the models developed using all features, significant sets of features, and uncorrelated sets of features are significantly same.
Comparison of Classification Techniques: Significant Test: In this study, the Wilcoxon signed-rank test is also applied to the AUC, F-Measure, and accuracy for statistically comparing the ability to predict the appropriate severity level of developed models using different classifiers. The objective of this testing is to find whether the models trained using different classification techniques have a significant improvement or not. The considered null hypothesis for this paper is "the defect severity level prediction models trained using different classifiers are significantly same". The considered null hypothesis is only accepted if the obtained p-values using Wilcoxon signed-rank test is greater than 0.05. The results of Wilcoxon signed-rank test on different pairs of classifiers are depicted in Table V. For the purpose of simplicity, we have used only two number for representing results, i.e., 0 means hypothesis accepted (models are significantly same) and 1 means hypothesis rejected (models are significantly different). While comparing the values present in Table V, we can observed that the models trained using different classifiers are significantly different for most of the cases.

VII. CONCLUSION
In this paper, we build a model to predict proper severity level of the defects using defect description. Different from existed researches, this work focus on seven different word embedding methods to represent the word not just as a number but as a vector in n-dimensional space. The predictive ability of these methods are evaluated using three sets of features selected using feature selection techniques, and eleven different classifiers with 5-fold cross-validation. We have also used SMOTE techniques in order to handle the class imbalance problem. Finally, the predictive ability of these models are computed and compared using AUC, F-Measure, and accuracy performance values. Our main conclusions are the following: • The high value of AUC confirms that the developed models using word embedding on balanced data have • The models developed by considered word vector computed using GLOVE and w2v have a better predictive ability as compared to other models.
• The defected severity levels prediction models developed using different word embedding methods are significantly different.
• The models trained on sampled data have significant improvement in predicting defect severity levels.
• The predictive ability of the models developed using significant uncorrelated features has a better ability to predict severity level as compared to all features.
• The models developed using SVM with polynomial kernel achieve significantly better performance as compared to other techniques.
In this study, developed are trained using most frequently used classifiers. Future work can be extended to deep-learning approach to achieve higher accuracy of software severity level prediction.

VIII. ACKNOWLEDGEMENTS
This research is funded by TestAIng Solutions Pvt. Ltd.