Experiments on software error prediction using Decision Tree and Random Forest algorithms

Ilona Bluemke; Paweł Borsukiewicz

Experiments on software error prediction using Decision Tree and Random Forest algorithms

Ilona Bluemke, Paweł Borsukiewicz

DOI: http://dx.doi.org/10.15439/2023F363

Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 865–869 (2023)

Full text

Abstract. Machine learning algorithms are widely used in the assessment of error-proneness in software. We conducted several experiments with error prediction on public PROMISE repository. We used Decision Tree and Random Forest algorithms. We also examined techniques aiming at the improvement of performance and accuracy of the model -- such as oversampling, hyperparameter optimization or threshold adjustment. The outcome of our experiments suggests that Random Forest algorithm, with 100 -- 1000 trees, can be used to obtain high values of evaluation parameters such as accuracy and balanced accuracy. However, it has to be implemented with a set of techniques countering imbalance of the datasets used to assure high values of precision and recall that correspond with correct detection of erroneous software. Additionally, it was shown that the oversampling and hyperparameter optimization could be reliably applied to the algorithm, while threshold adjustment technique was not found to be consistent.

References

F. Elberzhager, A. R. Rosbach, Eschbach, J. Münch, “Reducing Test Effort: A Systematic Mapping Study on Existing Approaches”, Information and Software Technology, vol. 54, no. 10, 1092-1106, 2012.
K. Bareja, A. Singhal, “A Review of Estimation Techniques to Reduce Testing Efforts in Software Development”, http://dx.doi.org/ 10.1109/ACCT.2015.110, 2015.
J. Hryszko, L. Madeyski, “Cost Eﬀectiveness of Software Defect Prediction in an Industrial Project”, http://dx.doi.org/ 10.1515/fcds-2018-0002, 2018.
Y.Z. Bala, P.A. Samat, K.Y. Sharif, N. Manshor, “Current Software Defect Prediction: A Systematic Review”, http://dx.doi.org/ 10.1109/AiIC54368.2022.99114586, 2022
F. Matloob et al., “Software Defect Prediction Using Ensemble Learning: A Systematic Literature Review”, http://dx.doi.org/ 0.1109/ACCESS.2021.3095559, 2021.
Y. Zhao, K. Damevski, H,Chen, “A Systematic Survey of Just-in-Time Software Defect Prediction”, http://dx.doi.org/ 10.1145/3567550, 2023.
T. Menzies , J. DiStefano, A. Orrego , R. Chapman, “ Assessing predictors of software defects”, in Proc Predictive software models workshop, pp. 1-5, 2004.
G. Boetticher, T. Menzies, T. Ostrand, PROMISE Repository of Empirical Software Engineering Data, West Virginia University, Department of Computer Science 2007.
C. Catal, B. Diri, B. Ozumut, “An artificial immune system approach for fault prediction in object oriented software”, pp. 238-245, http://dx.doi.org/ 10.1109/DEPCOS-RELCOMEX, 2007.
C. Catal, B. Diri, “Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem”, http://dx.doi.org/ 10.1016/j.ins.2008.12.001, 2009.
J. Brownlee, “Clonal selection theory & CLONALG. The clonal selection classiﬁcation algorithm”, in Technical Report 2-02, Swinburne University of Technology, 2005.
J. H. Carter, “The immune system as a model for pattern recognition and classiﬁcation”, http://dx.doi.org/10.1136/jamia.2000.0070028, 2001.
L. Breiman, “Bagging predictors.”, Mach Learn 24, pp.123–140, https://doi.org/10.1007/BF00058655Y, 1996.
D. Mundada, A. Murade, O. Vaidya, and J. N. Swathi, “Software Fault Prediction Using Artificial Neural Network And Resilient Back Propagation”, Int. J. Comput. Sci. Eng., vol. 5, no. 03, pp. 173–179, 2016.
Z. Xiang, L. Zhang, "Research on an Optimized C4.5 Algorithm Based on Rough Set Theory", http://dx.doi.org/ 10.1109/ICMeCG.2012.74, 2012.
P. Bishnu and V. Bhattacherjee, “Software Fault Prediction Using Quad Tree-Based K-Means Clustering Algorithm”, pp. 1146–1150, http://dx.doi.org/ 10.1109/TKDE.2011.163, 2012.
P. Bishnu and V. Bhattacherjee, “Outlier Detection Technique Using Quad Tree” in Proc Int’l Conf. Computer Comm. Control and Information Technology, pp. 143-148, 2009.
A. Okutan and O. Taner, “Software defect prediction using Bayesian networks”, http://dx.doi.org/ 10.1007/s10664-012-9218-8, 2014.
P. Kumudha, R. Venkatesan, “Cost-Sensitive Radial Basis Function Neural Network Classifier for Software Defect Prediction”, http://dx.doi.org/ 10.1155/2016/2401496, 2016.
S. Gupta, D. Gupta, “Fault Prediction using Metric Threshold Value of Object Oriented Systems”, International Journal of Engineering Science and Computing, vol. 7, no. 6, pp. 13629–13643, 2017
E. Erturk, E. Akcapinar, “Iterative software fault prediction with a hybrid approach”, http://dx.doi.org/ 10.1016/j.asoc.2016.08.025, 2016.
J. S. R. Jang, "ANFIS: adaptive-network-based fuzzy inference system", http://dx.doi.org/ 10.1109/21.256541, 1993.
F. Alighardashi, M. Ali, Z. Chahooki, “The Effectiveness of the Fused Weighted Filter Feature Selection Method to Improve Software Fault Prediction”, pp. 5, http://dx.doi.org/10.22385/jctecs.v8i0.96, 2016.
C. Lakshmi Prabha, Dr.N. shivakumar “Software Defect Prediction Using Machine Learning Techniques” , Proc. of the Fourth International Conference on Trends in Electronics and Informatics, IEEE Xplore Part Number: CFP20J32-ART; ISBN: 978-1-7281-5518-0, 2020.
Y. Shen, S. Hu, S, Cai, M. Chen, “Software Defect Prediction based on Bayesian Optimization Random Forest”, http://dx.doi.org/ 10.1109/DSA56465.2022.00149, 2022.
T.F. Husin, M.R. Pribadi, Yohannes, “Implementation of LSSVM in Classification of Software Defect Prediction Data with Feature Selection”, 9th Int. Conf. on Electrical Engineering, Computer Science and Informatics (EECSI2022), pp.126-131, 2022.
MD.A. Jahangir, MD. A.Tajwar, W. Marma, “Intelligent Software Bug Prediction: An Empirical Approach”, http://dx.doi.org , 101109/ICREST57604.2023.10070026, 2023.
Python Core Team, “Python: A dynamic, open source programming language”, Python Software Foundation, accessed 28.04.2022, <https://www.python.org/>
C.R. Harris, K.J. Millman, S.J. van der Walt et al. “Array programming with NumPy”, Nature 585, pp. 357–362, http://dx.doi.org/ 10.1038/s41586-020-2649-2, 2020.
W. McKinney, “Data structures for statistical computing in python”, Proc. of the 9th Python in Science Conference, vol 445, pp. 56-61, http://dx.doi.org/ 10.25080/Majora-92bf1922-00a, 2010.
Pedregosa et al., “Scikit-learn: Machine Learning in Python”, Journal of Machine Learning Research 12, pp. 2825-2830, 2011.
G. Lematre, F. Nogueira, C. K. Áridas, “Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning”, Journal of Machine Learning Research 17, pp. 1-5, http://dx.doi.org/ 10.48550/arXiv.1609.06570, 2017.
N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique”, Journal of artificial intelligence research, pp. 321-357, 2002.