Software Requirements Classification using Deep-learning Approach with Various Hidden Layers

Software requirement classification is becoming increasingly crucial for the industry to keep up with the demand of growing project sizes. Based on client feedback or demand, software requirement classification is critical in segregating user needs into functional and quality requirements. However, because there are numerous machine learning (ML) and deep-learning (DL) models that require parameter tuning, the use of ML to facilitate decision-making across the software engineering pipeline is not well understood. Five distinct word embedding techniques were applied to the functional and quality software requirements in this study. The imbalanced classes in the dataset are balanced using Synthetic Minority Oversampling technique (SMOTE). Then, to reduce duplicate and unnecessary features, feature selection and dimensionality reduction techniques are used. Dimensionality reduction is accomplished with Principal Component Analysis (PCA), while feature selection is accomplished with the Rank-Sum Test (RST). For binary categorization into functional and non-functional needs, the generated vectors are provided as inputs to eight distinct Deep Learning classifiers. The findings of the research show that using a combination of word embedding and feature selection techniques in conjunction with various classifiers can accurately classify functional and quality software requirements.


I. INTRODUCTION
S OFTWARE requirements classification deals with segregating the clients' requirements and demands found in the Software Requirements Specification (SRS) document into functional and non-functional requirements. It is a key step in the software development pipeline which can be automated using Machine Learning techniques. This allows the industry to save on labor expenses, as a domain expert is often required, while also optimizing the process and saving crucial time [1].
A key problem that needs to be addressed during requirements classification is that of the inconsistency in terminology used by the clients and the software engineers. This may lead to misclassification of the software requirements.
Functional requirements are the demands that the end-user defines as critical characteristics that the system should supply and that can be observed immediately in the finished result. This is how the input to the system, the action to be taken, and the intended output are all defined or stated. The system's basic quality standards, often known as non-functional requirements, include factors like reliability, maintainability, security, and portability [2].
Another problem faced during software requirements classification is the imbalance between the number of instances of functional and non-functional requirements classes. Data imbalance means that the number of instances of minority class are much lower than those of the majority class. Because of the unbalanced distribution of data, classifiers are misled while learning the minority class, resulting in biased and erroneous findings [3]. A good software requirements categorization model will be one that has been trained on a similar distribution of functional and non-functional requirement classes. In this study, this problem is addressed using oversampling through Synthetic Minority Oversampling Technique(SMOTE).
In this paper, we look to solve the above problem, and create highly accurate software requirements classification models which can be reliably used in the industry. The following are the research questions (RQs) that will be used to attain the aforementioned goals.
• RQ1: Which feature extraction technique can best capture the unstructured, textual data present in the SRS document, and convert it to structured data in the form of numerical vectors?
• RQ2: Which feature selection techniques are the best at getting rid of redundant and irrelevant features which may affect the performance of the classification models?
• RQ3: For what structure of the deep learning classifiers do the software requirements classification models achieve the best results?
• RQ4: How does the application of class balancing technique through oversampling affect the performance of the models?
The criteria used to evaluate the performance of the various models are F-measure, accuracy, and Area under the ROC curve (AUC). The Friedman test was used to determine whether an ML technique resulted in a substantial difference in performance. The PROMISE dataset [4] was used in this study, which contains 625 labeled criteria from 15 different projects. The contributions of the study are as follows: • An extensive comparison of various word embedding techniques for the purpose of feature extraction is provided to analyze which technique is best suited for the SRS document.
• A thorough investigation on the effect of various feature selection techniques on the performance of classification models in software engineering is presented.
• Deep learning classifiers are employed to increase the accuracy of software requirements classification from previous studies, with variations in number of layers, and type of layer being analyzed to find the best deep learning model out of the eight distinct DL classifiers used in this study.
• The study evaluates and analyses the performance of requirements classification models using relevant performance metrics. The study includes a thorough statistical analysis to back up the findings with statistical testing, unlike previous studies.
• The effect of class-balancing techniques on software datasets to build more accurate models is examined.
The remainder of the paper is structured as detailed here: Section II presents a literature review on software requirement classification and various word embedding approaches that are used in this study. Section III describes the experimental dataset collection as well as the various machine learning algorithms used. The research methodology is described in Section IV using an architecture framework. In Section V, the results of the experiments, along with their analysis, are presented. Section VI shows a comparison of models created using various word-embedding approaches, sets of features, and machine learning models. Finally, Section VII summarizes the information provided and offers directions for further research.

A. Software Requirements Word Embeddings
Navarro-Almanza et al. [5] explore Word2Vec using Skip-Gram to get structured representation for the textual software requirements dataset. The purpose of the Skip-gram model is to anticipate context words. The projection from Skip-Gram is a continuous vector space rendition of the word with a low dimensionality. The models developed achieved a maximum of 0.8 precision, 0.785 recall, and 0.77 F-measure.
To improve how well the word embeddings capture the content of the text, Marcacini et al. [6] analyze the impact of using contextual word embeddings for software requirements. They use the RE-BERT model to obtain the structured data to feed into their hierarchical clustering classifier. BERT is built on the Transformers architecture and uses a deep neural network. To address the sequence of occurrence of the tokens, a positional embedding is used. Static word embeddings, such as word2vec and FastText, on the other hand, have the issue of having the same embedding regardless of context, which makes structured modeling of software requirements difficult.

B. Classifying Software Requirements
Ott [7] employ two classifiers, Multinomial Naive Bayes, and Support Vector machine for software requirements classification. The classification techniques are applied on two datasets, out of which one is confidential and the other is public. The maximum recall achieved by the Naive Bayes classifier is 0.94, whereas the Support Vector machine achieves a precision of 0.86 in the best model. Baker et al. [8] work on classification of non-functional requirements into their subcategories. The authors compare the performance of the CNN model with that of an ANN model and results indicate that the CNN model outperforms the ANN on most performance metrics. The ANN consists of one hidden layer of 20 neurons. The ANN model has a precision of 82% to 90%, a recall of 78% to 85%, and the greatest F-score of 84%.With the maximum F-score of 92%, the CNN model obtains precision between 82% and 94%, and recall between 76% and 97%. Rahimi et al. [9] focus on further classifying the functional requirements into six different categories: policy, action constraint, solution, definition, attribute constraint, and enablement. The authors use the ensemble approach which combined five distinct classifiers: support vector classification, support vector machine, decision tree, logistic regression, and Naïve Bayes. For each class, accuracy per class as a weight is used to find the most optimal classifier. The best results are obtained using LR, SVC, SVM as the classifiers, which perform better than using all classifiers. An accuracy of 99.45% is achieved in classifying 600 functional requirements.

III. STUDY DESIGN
This section presents the details regarding various design setting used for this research.

A. Experimental Dataset
Cleland-Huang and his team [4] used the same datasets to validate the proposed software requirement solution. Cleland-Huang and his team extracted this data with the help of MS students from DePaul University and rendered it for public research via the PROMISE repository. The functional and non-functional attributes are shown below in Figure 2. The first observation to be made about Figure 2 is that the PROMISE repository is uneven in number of functional and non-functional requirements, with quality requirements accounting for 382 of the 625 total.

B. Training of Models from Imbalanced Data Set:
Several ML algorithms have the issue of neglecting the minority class in unbalanced datasets, despite the reality that performance on those is often what matters. In order to use future ML techniques, it was essential to implement the SMOTE technique on the imbalanced classes in our dataset. SMOTE [10] (Synthetic Minority Oversampling Approach) is

C. Word Embedding:
The dataset's textual data must be expressed as vectors in respect to one another. The dataset was subjected to five different word embedding techniques: Term Frequency and Inverse Document Frequency (TF-IDF), Continuous Bag of Words (CBOW) 1 , Skip-Gram (SKG), Global Vectors for Word Representation (GLOVE) 2 , and Google news Word to Vector (GW2V). The aim of these techniques, such as GLOVE, CBOW, etc., is to capture the semantic information in the text, which is not possible with other word-embedding techniques such as TF-IDF. The textual data was represented as a vector in an n-dimensional space using these techniques. Before applying the word embedding techniques, all characters in the requirements are converted to lowercase letters. We deleted any stopwords, bad symbols, and spaces. These will now be 1 https://towardsdatascience.com/word2vec-skip-gram-model-part-1-intuition-78614e4d6e0b 2 https://nlp.stanford.edu/projects/glove/ utilized to create models that will classify the requirements into functional and non-functional [11].

D. Feature Selection Techniques
We identify critical feature vectors that impact the performance of the models after doing word embedding on the data. To eliminate redundant and unnecessary features that may have a detrimental impact on the models' performance, the Rank Sum Test (RST) and Principal Component Analysis (PCA) are used. To see the difference, the performance of models with these features is compared to the performance of models without these features. This phase aids in the reduction of overtraining and training time [12].

E. Classification Technique:
The dataset is divided into training and testing subsets and categorized using eight deep learning models using K-Fold Cross-validation with a k value of 5. The structure of the various models is presented below. All models have an input layer with number neurons equal to the number of features of input data. For each subsequent hidden layer, the number of neurons is halved. All layers involved in the Deep Learning models are either Dense layers or Dropout layers. In a Dense layer, each neuron in the layer receives input from all neurons of the previous layer. On the other hand, a Dropout layer randomly selects and omits a certain number of neurons in the layer while training the Deep Learning model. In this work, a dropout value of 0.2 is used. Dropout layers are used to solve the problem of overfitting models. Finally, the output layer consists of only one neuron which contains a binary value corresponding to the binary classification to either functional or non-functional requirements. The activation function for each layer is the rectified linear activation function or ReLU, except the output layer which uses a sigmoid activation function. Binary cross entropy is the loss function that is utilized to train the models with Adam as the optimizer. Figure  are increasing the number of hidden layers for four more deep learning models. The above considered models are validated using 5-fold cross-validation with batch size=30, epochs=100 , and Dropout=0.2.
IV. RESEARCH METHODOLOGY Figure 1 provides a full outline of our proposed effort. We started by extracting the software requirements dataset from the PROMISE repository. This data contains labels of the software requirement category it falls under, i.e. functional or non-functional requirements. The data was then subjected to pre-processing techniques like converting all characters to lowercase, removing non-alphanumeric and non-symbol characters, eliminating frequently used words like 'the', and 'a', and other words with lengths less than or equal to two as they do not make a significant impact in the classification. Then, all of the phrases were tokenized to words.
To extract features from this unstructured, pre-processed data, five different word embedding techniques were applied to best capture the information. To account for the imbalance in classes due to 382 instances of quality requirements out of 625, a data oversampling technique in the form of SMOTE was used. Models were trained on both the balanced and imbalanced datasets to compare the effectiveness of the classbalancing. The features acquired after the word embeddings and class balancing needed to be refined to remove redundant and irrelevant features. Feature selection technique in the form of Rank-Sum test, and dimensionality reduction technique in the form of Principal Component Analysis were employed. These two sets of features as well as the set of original features were fed to the Deep Learning classifiers. This was done to help understand the importance of feature selection.
The eight different DL classifiers were trained using 5fold cross validation. The classifiers have varying layer sizes, and layer types, but certain attributes like the optimizer, loss function, etc. remain constant across the different classifiers. These classifiers are named DL1, DL2, DL3, and so on till DL8. Finally, the performance of the various models developed was measured using accuracy, F-measure, and AUC. This performance was statistically analyzed using box-plots for visual representation, with statistics like mean, maximum, minimum, Q1, and Q3 for each performance metric. Further, any conclusions derived were statistically supported using the Friedman test.

V. EMPIRICAL RESULTS AND ANALYSIS
To categorize software needs into functional or nonfunctional, we used five distinct word embedding approaches, a class balance strategy, two feature selection strategies, and eight different classification techniques. As a result, a total of 480 [5 word-embedding techniques × 2 requirements datasets(1 functional requirements dataset + 1 non-functional requirements dataset) × (1 imbalanced dataset + 1 balanced dataset) × 3 sets of features × 8 DL classifiers] models were generated. As shown in Tables I and II, the predictive performance of these trained models is assessed using the Fmeasure, accuracy and Area Under Curve (AUC) performance metrics.
• The high value of AUC confirms that the developed models have the ability to accurately classify the software requirements into functional and non-functional as almost all the performance values seen on the right side of Table I are greater than 0.75 AUC.
• The models developed using the Deep Learning structure of DL3 have better performance as compared to other classifiers.
• The models trained using neural network with ADAM (NNADAM) training algorithm have better predictive ability as compared LBFG, and SGD training algorithms.
• Simply by observing the values in Table I, we can see the difference SMOTE provides in improving the performance.

VI. COMPARATIVE ANALYSIS
The various models created with using word embedding techniques, class balancing approaches, feature selection strategies, and different classifiers are compared in this section. The comparison is based on statistics such as the area under the ROC curve (AUC), F-measure, and accuracy, with box plots serving as a visualization of the comparative performance. The Friedman test was used in this research to verify the findings. The Friedman test is used to accept or reject the following hypothesis.
• Null Hypothesis-There is no substantial difference in the predictive performance of software classification models constructed using different machine learning approaches.
• Alternate Hypothesis-The prediction power of software classification models constructed using various ML approaches varies significantly.
With degrees of freedom of 4 for word embedding, 1 for class balancing, 2 for feature selection, and 7 for distinct DL classifier comparisons, the Friedman test was performed with a significance threshold of α = 0.05.
A. RQ1: Which feature extraction technique can best capture the unstructured, textual data present in the SRS document, and convert it to structured data in the form of numerical vectors?
In this work, five distinct word embedding approaches were utilized to compute the numerical vector of the functional and quality requirements: TF-IDF, Skip-Gram (SKG), Global Vectors for Word Representation (GloVe), W2V and Bag of Words (CBOW). To assess the prediction abilities of models generated using various word embedding techniques, the AUC, accuracy, and F-measure statistics were used. Figure 4 illustrates the result of several word embedding algorithms. Models developed with the word embedding generated by TF-IDF are more reliable than other models, as shown in Figure 4. CBoW models exhibit poor prediction performance when compared to other methodologies. The mean AUC value of TF-IDF models is 0.91, with a maximum AUC of 0.98 and a Q3 accuracy of 0.96, implying that 25% of TF-IDF models have an AUC value 898 PROCEEDINGS OF THE FEDCSIS. SOFIA, BULGARIA, 2022 2) Friedman Test: Word-Embedding: In this work, the Friedman Test is also utilized to examine the predictive power of the models created using different word embedding methods. The goal of the test is to see if the null hypothesis is correct. The null hypothesis asserts that "the various word embedding techniques have no discernible impact on the performance of the classification models." Table IV shows the mean ranks for the various word embedding techniques. The lower the mean rank, the better the models' performance. TF-IDF has the lowest mean rank of 1.88, while CBoW has the highest mean rank of 4.88. With a mean rank of 1.96, W2V provides comparable performance to TF-IDF. It's probable that the high performance of the TF-IDF is due to the fact that requirements papers contain numerous comparable phrases and terms. Because of its method of giving weights to each term, TF-IDF can effectively minimize frequent terms used in requirements from creating an effect on classification better than other word-embedding algorithms.  In the proposed study, we use Rank Sum test and PCA as feature selection procedures, and we use all of the original features for developing predictive models for requirements categorization in a third set of models. These feature selection procedures were applied to both the functional and quality requirements datasets.  1) Box-plot: Feature Selection: RST seems to select a better subset of features than any other technique, per Figure  5. RST has an average AUC of 0.91, with a lowest of 0.84 and a peak of 0.97. The features created using PCA performed the worst of the three sets of features, with a mean AUC of 0.87.

2) Friedman Test: Feature Selection:
We also utilized the Friedman test to evaluate the various feature selection procedures based on their ability to predict model performance metrics generated with three distinct sets of features. The null hypothesis, that must be evaluated based on the Friedman   The study used eight different Deep Learning classifiers to categorize the software requirements. These classifiers are employed in conjunction with various word-embedding techniques and feature selection techniques. The DL binary classification models include models with different layer sizes, and layer types. Hyperparameters such as optimizer, activation function for specific layers, loss function, etc. are kept consistent across different models. Figure 6 depicts the accuracy, precision, recall, and AUC statistics for the

2) Friedman Test: Classification Techniques:
The Friedman test is also performed on the performance metrics of the various classifiers in order to statistically compare the models' performance and help differentiate the models where box-plot failed. The goal of the test is to see if the null hypothesis is correct. The null hypothesis for this test is that "the requirements classification models developed utilizing the different classifiers do not have a significant variation in their prediction abilities." With 7 degrees of freedom and a significance level of α= 0.05, the Friedman test was performed. The mean rank of various classifiers after the Friedman test is shown in Table IV. Table IV shows that DL3 has the lowest mean rank of 3.16, followed by DL2 with 3.25, DL4, and DL5 with 3.65. DL1 and DL8 performed significantly worse with mean ranks of 7.28, and 5.58, respectively.

D. RQ4: How does the application of class balancing technique through oversampling affect the performance of the models?
Based on the data utilized to train the models, there are two types of models presented in this study. One dataset contains imbalanced classes, while the class-balanced dataset is used to train the other set of models. Figure 7 presents a visual representation of the predictive ability of models trained on a balanced  2) Friedman Test: SMOTE: To evaluate the performance of the two sets of models, the Friedman test is applied to their performance measures on class-balanced and imbalanced datasets. The goal of this test is to accept or reject the null hypothesis, which states that "there is no substantial difference in performance between models trained with balanced or imbalanced classes." With a significance threshold of α = 0.05

VII. CONCLUSION
The Deep Learning classifiers used in this work, in conjunction with the word embedding, feature selection, and class balancing techniques have produced a very high accuracy for software requirements classification. The best models can be deployed in industry to do away with the manual classification, which is ailed by the problem of inconsistencies between client and software engineer terminologies [13]. Industry can benefit from the lower costs, and higher efficiency. The key conclusions that we arrived at were as follows: • Models that utilized features extracted from TF-IDF word embedding performed considerably better than other models.
• Out of the eight DL classifiers used, the models trained with the DL3 classifier had the best results, with the DL2 and DL4 classifiers performing similarly.
• The Rank Sum test-selected features outperformed any other group of characteristics utilized in this work.
• SMOTE-based class balancing improved the performance of requirements categorization models.
According to the results of this study, DL classifiers are critical for more accurate software requirement classification. Future research can extend this work by classifying the requirements into the subcategories of functional and non-functional requirements. Researchers can also focus on other parts of the software development pipeline, and use the models developed in this work for the requirements classification step.

VIII. ACKNOWLEDGEMENTS
This research is funded by TestAIng Solutions Pvt. Ltd.