Software Sentiment Analysis using Deep-learning Approach with Word-Embedding Techniques

Sentiment Analysis in the Software Engineering community aims to make the development and maintenance of software a better experience by helping provide code and library suggestions, defect-related comments for source code, etc. The manual finding of sentiment-based comments may be an inaccurate prediction and a time-consuming process. Automating the sentiment analysis process by leveraging Machine Learning models can benefit software professionals by giving them insights into other developers and feelings about software products, libraries, development, and maintenance tasks at a glance. This study aims to develop software sentiment prediction models based on comments by (1) identifying the best embedding techniques to represent the word of the comments, not just as a number but as a vector in n-dimensional space (2) finding the best sets of vectors using different features selection techniques (3) finding the best methods to handle the class imbalance nature of the data, and (4) finding the best architecture of deep-learning for the training of models. The developed models are validated using 5-fold cross-validation with four different performance parameters: accuracy, AUC, recall, and precision on three different datasets. The experimental finding shows that the models developed using the word embeddings with feature selection using Deep Learning classifiers on balanced data can significantly predict the underlying sentiments of textual comments.


I. INTRODUCTION
S ENTIMENT Analysis can be used to gather the opinions and feelings of the consumers regarding social and political opinions, brand loyalty, etc. Sentiment Analysis utilizes natural language processing and machine learning algorithms to draw out textual data's mood, opinions, and feelings. The texts can be product reviews, posts on social media, messages on chat boards, answers to questions on a question-answering website, commit messages by developers, etc [1]. Software developers can utilize the benefits of sentiment analysis to assist them in their development and maintenance activities. They can be well informed about whether a particular software, technology, or tutorial is appropriate for their purposes. Sentiment Analysis can distinguish feedback as being either positive or negative, which helps in development decisions. The application of sentiment analysis is to find Software professionals' sentiments and mindsets, and it can be approached in two different ways: The first and most straightforward way is to sit down face to face with the software developers, glean insights into their mindset, and evaluate their mood and feelings. Unfortunately, this is a very time-intensive and tedious process to incorporate. So, the other approach is preferred. The other approach uses sentiment analysis to discern the mood and feelings of software developers, from the product reviews, feedback forms, commit messages, etc. An effort is made to identify the positive and negative sentiments of developers. So, we have worked to create a predictive model leveraging natural language processing techniques and machine learning algorithms, to predict and detect the exhibited sentiments, moods, and feelings effectively and efficiently. The datasets are obtained from user's App Reviews, and issues tracked and managed using JIRA, and user's comments and messages on the Stack Overflow platform [1]. For the machine to better understand and analyze the text to improve the predictive model's performance, we must represent words as vectors [2] [3].
Word embedding techniques do this by representing words as vectors in an n-dimensional space. This provides a numeric representation to the words, which allows them to be used as input to Machine Learning Models, and it also preserves its syntactic and semantic integrity so that words that are used similarly have similar vector representations. In this study, we use six Word Embedding Techniques to vectorize the textual data, which are Term Frequency and Inverse Document Frequency (TF-IDF), Skip-Gram (SKG), Continuous Bag of Words (CBOW), Global Vectors for Word Representations (GLOVE), Fast Text (FST) and Google News Word to Vector (GW2V) [3]. After applying the word embeddings, we obtain a multitude of features for the data, many of which will be ineffective in the predictive model. To obtain the subset of important features that are to be used as input to the model, we apply six Feature Selection Techniques, namely Principal Components Analysis, Gain Ratio Attribute Eval, Classifier Attribute Eval, Info Gain Attribute Eval, OneR Attribute Eval, and Analysis of Variance (ANOVA) [4]. Analyzing the data after applying the Feature Selection Techniques, it is clear that the data suffer from the class imbalance problem, which occurs when the number of samples in each class is not the same. If not corrected, this can negatively affect the predictive performance of the model. So, the Synthetic Minority Oversampling Technique (SMOTE) and the Borderline Synthetic Minority Oversampling Technique (Borderline-SMOTE) are applied to balance the data. After we balance the data, we need to compare and evaluate the performance of the different techniques we have applied. To achieve this, we use eight different deep learning classifiers, which are applied by varying the number of Hidden Layers and Dropout Layers. The application of Deep Learning Classifiers to the models developed using different Word Embedding Techniques can help determine the models that can accurately and effectively predict the underlying sentiment in textual data, which can be convenient for a broad scope of Software Development and Maintenance activities. This study also aims to find the Word Embeddings, Feature Selection, and Data Sampling Techniques that provide the most optimal results. The remainder of the paper is laid out as follows: Section 2 presents a literature review on software sentiment analysis and various word embedding approaches. Section 3 describes the experimental dataset collection as well as the various machine learning algorithms used. The research methodology is described in Section 4 using an architecture framework. In Section 5, the results of the experiments, along with their analysis, are presented. Section 6 shows a comparison of models created using various word-embedding approaches, sets of features, and machine learning models. Finally, Section 7 summarizes the information provided and offers directions for further research.

II. RELATED WORK
There are many methods to acquire features from textual data. Term Frequency and Inverse Document Frequency (TF-IDF) have been used by Rajni Jindal et al. to obtain features from defect descriptions. They've used a Radial Basis function of the Neural Network to classify the defect reports. Based on tangible evidence, they've established that the model predicted high severity defects with significant accuracy and efficiency [5]. Sari and Siahaan have also leveraged Term Frequency and Inverse Document Frequency (TF-IDF) to extract features from defect descriptions. They've applied the InfoGain Feature Selection technique to obtain the set of relevant features. They've built severities prediction models with the assistance of Support Vector Machine to predict severity levels of defects [6]. Sentiment Analysis of Software Engineering Tasks has tremendous potential, but pre-trained models don't accurately predict sentiments in Software Engineering Tasks ding learned from Stack Overflow in an attempt to improve the performance of the predictive model. The impact on the performance of Sentiment Analysis tools using Domainspecific Word Embedding and Generic Word Embedding, trained using Google news, were compared. The conclusion reached was that the Generic Word Embedding was better than the Domain-specific Word embedding. Biswas et al. also found that oversampling or a combination of oversampling and undersampling achieves a jump in performance in the handling of compact Software Engineering datasets [2]. R Malhotra et al. have attempted to develop Software Bug Classification (SBC) models that can identify "low", "moderate," and "high" impact levels on Software Bugs. The levels were indicated based on Maintenance Effort (ME), Change Impact (CI), or a product of both. The data is sourced from the changelogs in Google's GIT Repository. Data preprocessing is performed, and the SBC models are developed using six classification techniques. The study assessed three predictors, which were obtained from text mining. After evaluation, it was found that the performance of the combined SBC model showed higher accuracy than the ME or CI SBC models. They also found that the accuracy of the "high" category was superior to that of the other categories [7]. R Malhotra et al. have worked to find out if resampling methods applied to software defect data improve performance. They have used datasets sourced from the Defect Collection and Reporting tool (DCRS) and performed data preprocessing and applied three different resampling methods, and evaluated their performance. The performance of the developed models is evaluated using seven performance measures, accuracy, precision, sensitivity, specificity, G-Mean, and AUC. They have concluded that the application of resampling methods to the maintainability prediction models can accurately predict the minority class [5]. R Malhotra et al. have also attempted to find the effects on the performance of Software Defect Prediction Models after applying resampling techniques. They have applied six oversampling and four undersampling methods to rectify the class imbalance problem. Examining the evaluators, which are: Sensitivity, GMean, Balance, and AUC values, it was found that there was an evident improvement in the values of the evaluators when data resampling methods were applied to the Software Defect Prediction models [8].
Dr. Lov Kumar et al. have worked to automate the process of determining the severity level of a defect in the software. Defect descriptions in the form of text have been tokenized using seven different word embedding techniques. The obtained features are further pruned to achieve an optimal set of relevant features using three different Feature Selection techniques. These features, plagued by the Class Imbalance Problem, have been rid of it by using the Synthetic Minority Oversampling Technique (SMOTE). The performance of the Word Embedding is evaluated using eleven different classifiers. Dr. Lov Kumar et al. have successfully used Word Embeddings, Feature Selection, and Synthetic Minority Oversampling Technique (SMOTE) to assemble a predictive model capable of assigning a severity level to defect descriptions [9]. SentiStrength-SE, which was proposed by Islam et al., achieved 73.85% precision and 85% recall. They call attention to issues commonly faced by Sentiment Analysis Tools, some of which are: Domain-specific meaning of words, Contextsensitive variations in meanings of words, Difficulties in dealing with negation, Sentimental words in copy-pasted content, Difficulty in dealing with irony and sarcasm, and Wrong detection of proper nouns [3].

III. STUDY DESIGN
This section presents the details regarding various design settings used for this research.

A. Experimental Dataset
The study uses three different experiential datasets to validate our proposed framework. These datasets are used by many software researchers for sentiment analysis [1] [2]. The primary objective is to explore different types of embedding, feature selection, data sampling, and different variants of deeplearning on these datasets to predict the sentiments of software engineers. Figure 2 shows the number of positive and negative sentiments for the considered datasets. From Figure 2, we observed that the number of positive sentiments for stack overflow is much higher than negative sentiments. The unequal distribution of data leads to a class imbalance problem. After analysis of the data, it becomes quite evident that the data is suffering from a class imbalance problem, i.e., the number of samples in each class is not the same. So, the balancing of data is required to improve the predictive ability of the developed Sentiment Analysis Models. We have performed Synthetic Minority Oversampling Technique (SMOTE) and Borderline Synthetic Minority Oversampling Technique (Borderline-SMOTE) on each dataset to balance the data [10].

C. Word Embedding:
The textual data of the dataset is to be expressed as vectors in relation to each other. Six different word embedding techniques including Term Frequency and Inverse Document Frequency (TF-IDF), Continuous Bag of Words (CBOW), Skip-Gram (SKG), Global Vectors for Word Representation (GLOVE), Google news Word to Vector (GW2V), fasttext (FST) have been applied on the dataset. These techniques were used to represent the textual data as a vector in an n-dimensional space. We have also removed any and all stopwords, bad symbols, and spaces before applying the word embedding techniques. These will now be used to develop models to determine the sentiment of Software Engineering Tasks [9] [2].

D. Feature Selection Techniques
The features vectors extracted from word-embedding are used as an input, so, the performance also depends upon the optimization of important features. To extract the important features from the existing set of vectors, we have used six different Feature Selection Techniques such as: Analysis of Variance (ANOVA) is used to find feature having capability to differentiate positive and negative sentiment, correlation attribute evaluation (CORR_ATR) is used to remove highly correlated features, Principal Components Analysis (PCA) is used to find new value of uncorrelated features, Gain ratio, information gain, and OneR are used to rank the features and select best features for sentiment analysis [4] [11].

E. Classification Technique:
In this study, we have used eight deep learning models, which use K-Fold Cross-Validation with a k value of 5. We have separated the data into training and testing data subsets. An input layer with a number of neurons equal in quantity to the number of features of the input data is present in every single deep learning model. The models are all constituted of Dense and Dropout layers. The Dense layer's neurons receive inputs from all the neurons present in the previous layer. The Dropout layer's neurons are randomly selected. The dropout value used in this study is 0.2. The output layer has a single neuron that corresponds to the binary classification of either functional or non-functional requirements, and it uses a sigmoid activation function, unlike the other layers, which use the Rectified Linear Activation function (ReLU) as the activation function. Adam is the optimizer used to train the models, with the loss function being the Binary Cross entropy. The number of hidden layers is increased for the other four models. Figure  1 demonstrates the architectures of the models (DL1, DL2, DL3, and DL4). The parameters used to validate the models are: batch size = 30, Dropout = 0.2, and epochs = 100.

IV. RESEARCH METHODOLOGY
The pictorial representation of the proposed framework is provided in Figure 3. We first extract the textual documents from three different SE repositories, i.e., issue trackers, i.e., JIRA issue comments, Stack Overflow discussions contain questions and answers and user reviews on mobile apps using app stores so that their corresponding data may be analyzed. After finding these textual documents, six different types of word-embedding techniques have been used to represent text documents as numerical vectors. Each embedding uses a different way to represent words for text documents with a real-valued vector. The values of these vectors are closer in the vector space for similar words. Next, we have used two different types of sampling techniques, such as: SMOTE and BLSMOTE to handle the class imbalanced nature of datasets. In the next step, we have applied different feature selection techniques to select the best combination of relevant features. The ANOVA test is used to remove insignificant features, PCA is used to remove high correlation between features and find new values of features, gain ratio, information gain, and OneR is used to rank features and select the top best features, and finally, correlation analysis is used to remove highly correlated features. After finding the right sets of features, we have used different variants of deep-learning techniques to train software sentiment, and analysis models. The trained models are validated with a 5-fold cross-validation method, and the performance of these models is compared with the help of four different performance parameters: accuracy, precision, recall, and AUC.

V. EMPIRICAL RESULTS AND ANALYSIS
The primary objective of this work is to analyze the performance of the developed software sentiment analysis models using different variants of deep-learning, word-embedding techniques, features selection techniques, and data sampling techniques with the purpose of investigating how different contexts can impact their effectiveness. The proposed models are validated with three software-related datasets, namely mobile app reviews, Stack Overflow discussions, and JIRA issue comments. Finally, the predictive ability of these models is evaluated using different performance parameters such as accuracy, AUC, precision, and recall. AUC is considered the primary parameter for the model's performance because of its capability to provide good findings in case of imbalanced nature of data. Tables I and II show the performance of models in terms of Precision, Recall, accuracy, and AUC for the AppReview dataset using different variants of deep-learning and feature selection techniques with original data and sampled data. The results for other combinations are similar. The high value of AUC (≥ 0.7) in Tables I and II suggested that the proposed models have the capability to predict the current state of sentiment analysis for software engineering. In the majority of cases, the precision, recall, and AUC values are higher than 0.8. Also, the information present in Tables I and II suggest that the models trained on sampled data have a better ability to predict sentiment as compared to original data. Another finding from Tables I and II is that the models trained on selected sets of features have a high value of precision, recall, and AUC as compared to all features.

VI. COMPARATIVE ANALYSIS
In this section, we have compared the performance of different word embedding techniques, class balancing approaches, feature selection strategies, and deep-learning techniques, which are used for developing sentiment prediction models using Box-plot diagrams and descriptive statistics. We have also performed the Friedman test to find statistical significance differences between different techniques. The hypothesis used to achieve our objective is mentioned below:  There is no significant difference between the models trained by using features extracted by different embedding techniques.
• Feature Selection Techniques: Null Hypothesis: There is no significant difference between the models trained by using selected sets of features using different feature selection techniques and all features.
• Data Sampling Techniques: Null Hypothesis: There is no significant difference between the models trained on sampled data and original data.

A. Word-Embedding
In this work, six different types of word embedding approaches such as TFIDF, CBOW, GLOVE, GW2V, SKG, and FASTXT have been used to find the numerical vectors of software text comments. To find the best embedding approach, we exploited performance evaluators-Accuracy, Precision, AUC, and Recall, which are computed for models trained by taking the above embedding techniques as input and trained using different variants of deep-learning with 5-fold cross-validation techniques on sampled as well as original data. Figure 4 visually depicts the model's ability to predict sentiments using different word-embedding techniques, and Table III depicts Table IV shows the mean ranks using the Freidman test for the various word embedding techniques. We have evaluated the considered null hypothesis at 0.05 with five degrees of freedom on four different performance parameters such as accuracy, recall, precision, and AUC. The lower value of mean rank represents the best word-embedding techniques for sentiment analysis of software engineering comments. According to information present in Table IV, the models trained using different embedding techniques are significantly different. Similarly, according to information present in Table  IV, the models trained using GW2V have a lower mean rank, i.e., 1.91 for accuracy, 1.92 precision, 3.61 recall, and 1.37 AUC representing that the developed models have better prediction capability as compared to other embedding techniques.

B. Feature Selection
In this work, six different types of features selection techniques: significant features calculation using ANOVA test, un-correlated sets of features using PCA, best sets of features using the gain ratio(GAIN_RAT), information gain(INFO_GAIN), oneR attribute evaluation (OneR_ATR), correlation attribute selection (CORR_ATR) have been used to find the best combination of relevant features for software engineering sentiment analysis. We exploited performance evaluators-Accuracy, Precision, AUC, and Recall to find the best feature selection technique that gives us the best sets of features for models trained using different variants of deeplearning with 5-fold cross-validation techniques on sampled as well as original data. Figure 5 visually depicts the model's ability to predict sentiments using different feature selection techniques and Table V depicts descriptive statistics of different feature selection techniques in terms of accuracy, AUC, precision, and Recall. From Figure 4, it is quite evident that the models trained by taking significant sets of features using the ANOVA test have better capability to predict sentiment as compared to other feature selection techniques. The models trained using ANOVA features achieved 0.85 average AUC, 0.88 average recall, 0.84 average precision, and 83.21 average accuracy. Similarly, From Table V, we observed that the models trained using features selection from OneR_ATR have

C. Classification Techniques
The sentiment prediction models for software engineering comments are trained using different variants of deep-learning techniques with a 5-fold cross-validation approach. These trained models' capability is compared using performance parameters such as accuracy, precision, recall, and AUC. Figure 6 depicts the box-plot diagrams of different performance parameters for the models trained using different variants of deep learning. Table VI shows the descriptive statistics in terms of Mean, Median, Min, Max, Q1, and Q3 for different deeplearning techniques. It can be seen from Figure 6 and Table  VI that the models trained using DL2, DL3, DL4, and DL5 have similar average values of AUC, i.e., 0.78. Similarly, the DL8 classifier produces models with a minimum average AUC of 0.67.
In this paper, we have also compared the effectiveness of different variants of deep learning using the Friedman test with a significance level of 0.05 and six degrees of freedom on four different performance parameters such as accuracy, recall, precision, and AUC.

D. SMOTE
In this study, we have used two different variants of SMOTE techniques to handle the class imbalance nature of data. We have used box-plot and descriptive statistics of performance parameters to find the impact of data sampling techniques on sentiment analysis for software engineering comments. Figure 7 presents a visual representation of the predictive capability of models trained on a balanced dataset versus models learned on an imbalanced dataset. Table VII shows the descriptive statistics in terms of min, max, Mean, Median, Q1, and Q3 for models trained on sampled data and original data. From Figure 7, and Table VI, it can be seen that the models trained on sampled data have better performance as compared to the original data. The trained prediction models on sampled data have 0.76 average AUC, 0.88 average recall, 0.78 average precision, and 76 average accuracies. While, models trained on original data have 0.73 average AUC, 0.81 average recall, 0.82 average precision, and 81.51 average accuracy.
In this paper, the Friedman test with a significance level of 0.05 and two degrees of freedom has been considered to find the significant impact of sampling techniques on model performance. Table IV shows the mean ranks using the Freidman test for SMOTE, BLSMOTE, and original data. The smaller value of P represents that the models trained on sampled data have significant improvement in performance as compared to the original data. Similarly, the models trained on SMOTE sampled data have better capability to predict sentiments as compared to BLSMOTE and original data.

VII. CONCLUSION
Sentiment analysis prediction models for software engineers help in various engineering tasks like analyzing developers' sentiments, evaluating app reviews, users' sentiments of software products, etc. The work presented in this paper is a successful effort in the direction of development of software sentiment models by using different variants of embedding techniques, different methods to find important features, different methods to handle the imbalanced nature of the dataset, and finally, different variants of deep-learning for model development. The performance of the developed models is computed and compared using accuracy, precision, AUC, and recall. We have also applied the Friedman test to statistically examine the performance of models developed using a different combination of features. The major findings are summarized as follows: • The high value of AUC for the trained models confirms the capability of the models to predict sentiment based on text comments.
• The use of sampling techniques like SMOTE and BLSMOTE significantly helps in improving the performance of software sentiment prediction models.
• The models trained by using selected sets of features using ANOVA achieved better performance as compared to other techniques.
• The deep learning with one dropout layer and two hidden layers achieved better performance as compared to other combinations.

VIII. ACKNOWLEDGEMENTS
This research is funded by TestAIng Solutions Pvt. Ltd.