## Improved Analogy-based Effort Estimation with Incomplete Mixed Data

### Ibtissam Abnane, Ali Idri

DOI: http://dx.doi.org/10.15439/2018F95

Citation: Proceedings of the 2018 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 15, pages 1015–1024 (2018)

Abstract. Estimation by analogy (EBA) is one of the most attractive software effort development estimation techniques. However, one of the critical issues when using EBA is the occurrence of missing data (MD) in the historical data sets. The absence of values of several relevant software attributes is a frequent phenomenon that may cause inaccurate EBA estimations. The MD can be numerical and/or categorical. This paper evaluates four MD techniques (toleration, deletion, k-nearest neighbors (KNN) imputation and support vector regression (SVR) imputation) over four mixed data sets. A total of 432 experiments were conducted involving four MD techniques, nine MD percentages (from 10\% to 90\%), three missingness mechanisms (MCAR: Missing Completely at Random, MAR: Missing at Random and NIM: Non-Ignorable Missing) and four data sets. The evaluation process consists of four steps and uses several accuracy measures such as standardized accuracy (SA) and prediction level (Pred). The results suggest that EBA with imputation techniques achieved significantly better SA values over EBA with toleration or deletion regardless of the mechanism of missingness. Moreover, no particular MD imputation technique outperformed the other techniques overall. However, according to Pred and other accuracy criteria, EBA with SVR was the best, followed by KNN imputation; we also found that toleration instead of deletion improves the accuracy of EBA.

### References

- S. K. Sehra, Y. S. Brar, N. Kaur, and S. S. Sehra, “Research patterns and trends in software effort estimation,” Inf. Softw. Technol., vol. 91, p. , 2017.
- J. Wen, S. Li, Z. Lin, Y. Hu, and C. Huang, “Systematic literature review of machine learning based software development effort estimation models,” Inf. Softw. Technol., vol. 54, no. 1, pp. 41–59, 2012.
- A. Idri, F. A. Amazal, and A. Abran, “Analogy-based software development effort estimation: A systematic mapping and review,” Inf. Softw. Technol., vol. 58, pp. 206–230, 2014.
- F. A. Amazal, A. Idri, and A. Abran, “Improving Fuzzy Analogy based Software Development Effort Estimation,” in 21st Asia-Pacific Software Engineering Conference (APSEC), 2014, pp. 1–4.
- F. A. Amazal, A. Idri, and A. Abran, “An analogy-based approach to estimation of software development effort using categorical data,” in Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement, 2014, pp. 252–262.
- A. Idri and A. Abran, “Evaluating software project similarity by using linguistic quantifier guided aggregations,” Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference, Volume: 1, 2001, pp. 470 - 475.
- A. Idri, F. A. Amazal, and A. Abran, “Accuracy Comparison of Analogy-Based Software Development Effort Estimation Techniques,” Int. J. Intell. Syst., vol. 31 (2), pp. 128–152, 2016.
- A. Idri, I. Abnane, and A. Abran, “Missing data techniques in analogy-based software development effort estimation,” J. Syst. Softw., vol. 117, pp. 595–611, 2016.
- J. Li, G. Ruhe, A. Al-Emran, and M. M. Richter, “A flexible method for software effort estimation by analogy,” Empir. Softw. Eng., vol. 12, no. 1, pp. 65–106, 2007.
- M. Shepperd and C. Schofield, “Estimating Software Project Effort Using Analogies,” IEEE Trans. Softw. Eng., vol. 23, no. 12, pp. 736– 743, 1997.
- A. Idri and I. Abnane, “Fuzzy Analogy Based Effort Estimation: An Empirical Comparative Study,” in IEEE International Conference on Computer and Information Technology (CIT), 2017, pp. 114–121.
- F. A. Amazal, A. Idri, and A. Abran, “Software Development Effort Estimation Using Classical and Fuzzy Analogy: a Cross-Validation Comparative Study,” Int. J. Comput. Intell. Appl., vol. 13, no. 3, p. 1450013, 2014.
- J. Li, A. Al-Emran, and G. Ruhe, “Impact Analysis of Missing Values on the Prediction Accuracy of Analogy-based Software Effort Estimation Method AQUA,” First Int. Symp. Empir. Softw. Eng. Meas. (ESEM 2007), pp. 126–135, 2007.
- I. Abnane and A. Idri, “Evaluating Fuzzy Analogy on Incomplete Software Projects data,” in IEEE Symposium Series on Computational Intelligence (SSCI), 2016.
- M. Azzeh and A. B. Nassif, “A hybrid model for estimating software project effort from Use Case Points,” Appl. Soft Comput. J., pp. 1–9, 2016.
- A. Idri, I. Abnane, and A. Abran, “Evaluating Pred(p) and standardized accuracy criteria in software development effort estimation,” J. Softw. Evol. Process, no. September, 2017.
- R. J. A. Little and D. . Rubin, “Statistical Analysis with Missing Data,” Wiley, New York., 1987.
- D. . Little, R.J.A., Rubin, “Analysis of social science data with missing values,” Sociol. Methods Res., pp. 292–326, 1989.
- G. Molenberghs and M. G. Kenward, Missing Data in Clinical Studies, vol. 61. John Wiley & Sons, 2007.
- Q. Song, M. Shepperd, X. Chen, and J. Liu, “Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation,” J. Syst. Softw., vol. 81, no. 12, pp. 2361–2370, 2008.
- A. Idri, I. Abnane, and A. Abran, “Systematic Mapping Study of Missing Values Techniques in Software Engineering Data,” in International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2015 16th IEEE/ACIS, 2015, pp. 1–8.
- J. Schafer, Analysis of Incomplete Multivariate Data. 1997.
- B. W. Boehm, “Software Engineering Economics,” IEEE Trans. Softw. Eng., vol. SE-10, no. 1, 1984.
- L. C. Briand, K. El Emam, D. Surmann, I. Wieczorek, and K. D. Maxwell, “An assessment and comparison of common software cost estimation modeling techniques,” Proc. 21st Int. Conf. Softw. Eng. - ICSE ’99, pp. 313–322, 1999.
- E. Mendes, “A Comparative Study of Cost Estimation Models for Web Hypermedia Applications,” Empir. Softw. Eng., vol. 8, no. 2, pp. 163–196, 2003.
- M. Azzeh and Y. Elsheikg, “Learning Best K analogies from Data Distribution for Case-Based Software Effort Estimation,” in The Seventh International Conference on Software Engineering Advances, 2012, no. 2, pp. 341–347.
- L. Angelis and I. Stamelos, “A Simulation Tool for Efficient Analogy Based Cost Estimation,” Empir. Softw. Eng., vol. 5, no. 1, pp. 35–68, 2000.
- G. K. Michelle, M. Cartwright, and L. Chen, “Experiences Using Case-Based Reasoning to Predict Software Project Effort,” no. Ml, pp. 1–22, 2000.
- A. Idri, A. Abran, and T. M. Khoshgoftaar, “Investigating soft computing in case-based reasoning for software cost estimation,” International Journal of Engineering Intelligent Systems for Electrical Engineering and Communications, 10 (3), 2002. p. 147- 157.
- A. Idri, A. Zahi, and E. Mendes, A. Zakrani, “Software Cost Estimation Models Using Radial Basis Function Neural Networks”, Mensura-IWSM: Software Process and Product Measurement , 2007, pp 21-31.
- S. Yenduri, “an Empirical Study of Imputation Techniques for Software Data Sets,” Louisiana State, 2005.
- V. N. Vapnik, The Nature ofStatistical Learning Theory. New York, 1995.
- S. K. Shevade, S. S. Keerthi, C. Bhattacharyya, and K. R. K. Murthy, “Improvements to the SMO algorithm for SVM regression,” IEEE Trans. Neural Networks, vol. 11, no. 5, pp. 1188–1193, 2000.
- J. C. Platt, “Sequential minimal optimization: A fast algorithm for training support vector machines,” Adv. Kernel MethodsSupport Vector Learn., vol. 208, pp. 1–21, 1998.
- A. Smola and B. Scholkopf, “A tutorial on support vector regression,” Stat. Comput., vol. 14, no. 3, pp. 199–222, 2004.
- X. Chen, Q. Zhou, and H. Xiao, “Combination of Support Vector Regression with Particle Swarm Optimization for Hot-spot temperature prediction of oil-immersed power transformer,” Prz. Elektrotechniczny, no. 8, pp. 172–176, 2012.
- A. L. I. Oliveira, “Estimation of software project effort with support vector regression,” Neurocomputing, vol. 69, no. 13–15, pp. 1749– 1753, 2006.
- S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, “Improvements to platt’s SMO algorithm for SVM classifier design,” 1999.
- V. N. Vapnik, “An overview of statistical learning theory.,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 988–99, 1999.
- E. Osuna, R. Freund, and F. Girosi, “Training support vector machines: an application to face detection,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997, pp. 130–136.
- H. Hsieh, T. Lee, and T.-S. Lee, “A Hybrid Particle Swarm Optimization and Support Vector Regression Model for Financial Time Series Forecasting,” Int. J. Bus. Adm., vol. 2, no. 2, pp. 48–56, 2011.
- C. W. Hsu, C. . Chang, and C. J. A. Lin, “A practical guide to support vector classification.,” 2003.
- Q. Zong, W. Liu, and L. Dou, “Parameters selection for SVR based on PSO,” in 6th World Congress on Intelligent Control and Automation, 2006, no. 1, pp. 2811–2814.
- E. Kocaguneli and T. Menzies, “Software effort models should be assessed via leave-one-out validation,” J. Syst. Softw., vol. 86, no. 7, pp. 1879–1890, 2013.
- M. Shepperd and S. MacDonell, “Evaluating prediction systems in software project estimation,” Inf. Softw. Technol., vol. 54, no. 8, pp. 820–827, 2012.
- J. Demšar, “Statistical Comparisons of Classifiers over Multiple Data Sets,” J. Mach. Learn. Res., vol. 7, pp. 1–30, 2006.
- G. M. Foody, “Classification accuracy comparison: Hypothesis tests and the use of confidence intervals in evaluations of difference, equivalence and non-inferiority,” Remote Sens. Environ., vol. 113, no. 8, pp. 1658–1663, 2009.
- S. Greenland et al., “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations,” Eur. J. Epidemiol., vol. 31, no. 4, pp. 337–350, 2016.
- C. J. Geyer, “Nonparametric Tests and Confidence Intervals,” In Pract., pp. 1–14, 2003.
- G. Cumming and S. Finch, “Inference by eye: Confidence intervals and how to read pictures of data.,” Am. Psychol., vol. 60, no. 2, pp. 170–180, 2005.
- M. J. Gardiner and D. G. Altman, Statistics with confidence: confidence intervals and statistical guidelines. 1989.
- D. Sheskin, Handbook of Parametric and Non-parametric Procedures. CRC Press, 1997.
- E. Lehmann, “Nonparametrics: Statistical methods based on ranks,” Prentice Hall New Jersey, 1998.
- J. L. Hodges and E. L. Lehmann, “Estimates of Location Based on Rank Tests,” Ann. Math. Stat., 1963.
- H. Abdi, “1 Overview 2 Preliminary : The different meanings of alpha,” Encycl. Res. Des., pp. 1–8, 2010.
- J. Cohen, “Quantitative Methods in Psychology,” Psychol. Bull., vol. 112, no. 1, pp. 155–159, 1992.
- M. Hosni, A. Idri, A. Abran, and A. B. Nassif, “On the value of parameter tuning in heterogeneous ensembles effort estimation,” Soft Computing, Springer Berlin Heidelberg, pp. 1–34, 2017.
- A. Idri, M. Hosni, and A. Abran, “Improved Estimation of Software Development Effort Using Classical and Fuzzy Analogy Ensembles,” Appl. Soft Comput., vol. 49, pp. 990–1019, 2016.
- M. Azzeh, A. B. Nassif, and L. L. Minku, “An empirical evaluation of ensemble adjustment methods for analogy-based effort estimation,” J. Syst. Softw., vol. 103, pp. 36–52, 2015.