Preprocessing compensation techniques for improved classification of imbalanced medical datasets
Agnieszka Wosiak, Sylwia Karbowiak
DOI: http://dx.doi.org/10.15439/2017F82
Citation: Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 11, pages 203–211 (2017)
Abstract. The paper describes the study on the problem of applying classification techniques in medical datasets with class imbalance. The aim is to identify factors that negatively affect classification results and propose actions that may be taken to improve the performance. To alleviate the impact of uneven and complex class distribution, methods of balancing the datasets are proposed and compared. The experiments were conducted on five datasets - three binary and two multiclass. They comprise several data preprocessing methods applied on data and classification with different techniques. The study shows that for some datasets there exists a combination of certain preprocessing method and classification technique which outperforms other approaches. For datasets with complex distribution or too many features the ratio of correctly predicted labels may be low regardless what resampling method and classification technique has been applied.
References
- Stefanowski J.: "Dealing with Data Difficulty Factors while Learning from Imbalanced Data", Challenges in Computational Statistics and Data Mining, 2016, pp. 333–363, http://dx.doi.org/10.1007/978-3-319-18781-5_17.
- Senthilkumar D., Paulraj S.: "Diabetes Disease Diagnosis Using Multi-variate Adaptive Regression Splines", International Journal of Engineering and Technology, vol.5(5), 2013, pp. 3922-3929.
- Arslan A.K., Colaka C.: "Different medical data mining approaches based prediction of ischemic stroke", Computer Methods and Programs in Biomedicine, 2016, vol. 130, pp. 87–92, http://dx.doi.org/10.1016/j.cmpb.2016.03.022.
- Wosiak A., Dziomdziora A.: "Feature Selection and Classification Pairwise Combinations for High-dimensional Tumour Biomedical Datasets", Schedae Informaticae, 2015, vol. 24, pp. 53-62, http://dx.doi.org/10.4467/20838476SI.15.005.3027.
- Glinka K., Wosiak A., Zakrzewska D.: "Improving Children Diagnostics by Efficient Multi-label Classification Method", Information Technologies in Medicine 2016 vol. 1, series: Advances in Intelligent Systems and Computing 471(1), eds.: Ewa Pietka, Pawel Badura, Jacek Kawa, Wojciech Wieclawek, Springer International Publishing, pp. 253-266, http://dx.doi.org/10.1007/978-3-319-39796-2.
- Levashenko V., Zaitseva E.: "Fuzzy Decision Trees in medical decision Making Support System" 2012 Federated Conference on Computer Science and Information Systems (FedCSIS), Wroclaw, 2012, pp. 213-219.
- He H., Garcia E. A.: "Learning from Imbalanced Data", IEEE Transactions on Knowledge and Data Engineering, 2009, vol. 21(8), pp. 1263–1284, http://dx.doi.org/10.1109/TKDE.2008.239.
- Yang Q., Wu X.: "Challenging problems in data mining research", International Journal of Information Technology and Decision Making, 2006, vol. 5(4), 597–604, http://dx.doi.org/10.1142/S0219622006002258.
- Sun Y., Wong A.K., Kamel M.S.: "Classification of imbalanced data: A review", International Journal of Pattern Recognition and Artificial Intelligence, 2009, vol. 23(4), pp. 687–719, http://dx.doi.org/10.1142/S0218001409007326.
- Weiss G.M., Provost F.: "Learning when training data are costly: The effect of class distribution on tree induction", Journal of Artificial Intelligence Research, 2003, vol. 19, pp. 315–354, http://dx.doi.org/10.1613/jair.1199.
- Japkowicz N.: "Learning from Imbalanced Data Sets: A Comparison of Various Strategies", In: Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets, Austin, TX, USA, 2000.
- de Morais R. F., Miranda P. B., Silva, R. M.: "A Meta-Learning Method to Select Under-Sampling Algorithms for Imbalanced Data Sets", In: Intelligent Systems (BRACIS), 2016 5th Brazilian Conference on, pp. 385–390, http://dx.doi.org/10.1109/BRACIS.2016.076.
- Morent D., Stathatos K., Lin W. C., Berthold M. R.: "Comprehensive PMML preprocessing in KNIME", In: Proceedings of the 2011 workshop on Predictive markup language modeling, 2011, pp. 28-31, http://dx.doi.org/10.1145/2023598.2023602.
- Wilk S., Stefanowski J., Wojciechowski S., Farion K. J., Michalowski W.: "Application of Preprocessing Methods to Imbalanced Clinical Data: An Experimental Study", In: Pietka E., Badura P., Kawa J., Wieclawek W. (eds.) Information Technologies in Medicine. Advances in Intelligent Systems and Computing, 2016, vol. 471, pp. 503–516, http://dx.doi.org/10.1007/978-3-319-39796-2_41.
- Wong, T.T.: "Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation", Pattern Recognition, 2015, vol. 48(9), pp. 2839-2846, http://dx.doi.org/10.1016/j.patcog.2015.03.009.
- Yadav S., Shukla S.: "Analysis of k-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification", In: Advanced Computing (IACC), 2016 IEEE 6th International Conference on, pp. 78-83, http://dx.doi.org/10.1109/IACC.2016.25.
- Zhang Y., Yang Y.: "Cross-validation for selecting a model selection procedure", Journal of Econometrics, 2015, vol. 187(1), pp. 95-112, http://dx.doi.org/10.1016/j.jeconom.2015.02.006.
- Berthold M.R., Cebron N., Dill F., Gabriel T.R., Kötter T., Meinl T., Ohl P., Sieb Ch., Thiel K., Wiswedel B.: "KNIME: The Konstanz Information Miner" In: Preisach C., Burkhardt H., Schmidt-Thieme L., Decker R. (eds) Data Analysis, Machine Learning and Applications. Studies in Classification, Data Analysis, and Knowledge Organization, Springer, Berlin, Heidelberg, 2008, http://dx.doi.org/10.1007/978-3-540-78246-9_38.
- O’Hagan S., Kell D.B.: "Software review: the KNIME workflow environment and its applications in genetic programming and machine learning", Genetic Programming and Evolvable Machines, 2015, vol. 16(3), pp. 387-391, http://dx.doi.org/10.1007/s10710-015-9247-3.
- Lopez, V., Fernandez, A., Moreno-Torres, J. G., Herrera, F.: "Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics", Expert Systems with Applications, 2012, vol. 39(7), pp. 6585–6608 http://dx.doi.org/10.1016/j.eswa.2011.12.043.
- Drummond C., Holte R.C.: "C4.5, Class Imbalance, and Cost Sensitivity: Why Under-sampling beats Over-sampling", In: Workshop on Learning from Imbalanced Data Sets II, International Conference on Machine Learning, Washington, DC, USA, 2003.
- Garcia V., Sanchez J.S., Mollineda R.A.: "On the effectiveness of preprocessing methods when dealing with different levels of class imbalance", Knowledge-Based Systems, 2012, vol. 25(1), pp. 13–21, http://dx.doi.org/10.1016/j.knosys.2011.06.013.
- Fernandez A., Lopez V., Galar M, Jose del Jesus M., Herrera F.: "Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches", Knowledge-Based Systems, 2013, vol. 42, pp. 97–110, http://dx.doi.org/10.1016/j.knosys.2013.01.018.
- Chawla N., Bowye K., Hall L., Kegelmeyer W.P.: "SMOTE: Synthetic Minority Over-sampling Technique", Journal of Artificial Intelligence Research, 2002, vol. 16, pp. 321âĂŞ357, http://dx.doi.org/10.1613/jair.953.
- Weiss G.M.: "Mining with rarity: a unifying framework", ACM SIGKDD Explorations Newsletter, 2004, vol. 6(1), pp. 7–19, http://dx.doi.org/10.1145/1007730.1007734.
- Batista G.E., Prati R.C., Monard M.C.: "Balancing strategies and class overlapping". In: Advances in Intelligent Data Analysis VI, 2005, pp. 24–35, http://dx.doi.org/10.1007/11552253_3.
- Ali A., Shamsuddin S.M., Ralescu A.L.: "Classification with class imbalance problem: A Review", International Journal of Advances in Soft Computing and its Applications, 2015, vol. 7(3), pp. 176–204.
- https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data
- Kent Ridge Biomedical Dataset Repository: http://datam.i2r.a-star.edu.sg/datasets/krbd/LungCancer/LungCancer-Michigan.html
- https://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.data
- http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/ann-train.data
- http://www.cbioportal.org/study?id=lusc_tcga
- Yang P., Xu L., Zhou B. B., Zhang Z., Zomaya A. Y.: A particle swarm based hybrid system for imbalanced medical data sampling. BMC genomics, 2009, vol. 10(3):S34, http://dx.doi.org/10.1186/1471-2164-10-S3-S34.
- Janousova E., Schwarz D., Kasparek T.: "Data reduction in classification of 3-D brain images in the schizophrenia research", Analysis of Biomedical Signals and Images, 2010, vol. 20, pp. 69–74.
- Pancerz K., Paja W., Gomula J.: "Random Forest Feature Selection for Data Coming from Evaluation Sheets of Subjects with ASDs", Proceedings of the 2016 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, 2016, Vol. 8, pages 299–302, http://dx.doi.org/10.15439/2016F274.
- Paja W.: "Medical diagnosis support and accuracy improvement by application of total scoring from feature selection approach", Proceedings of the 2015 Federated Conference on Computer Science and Information Systems (FEDCSIS 2015), Annals of Computer Science and Information Systems, eds. M. Ganzha and L. Maciaszek and M. Paprzycki, IEEE, 2015, pp. 281–286, http://dx.doi.org/10.15439/2015F361.
- El-Ghamrawy S. M.: "A Knowledge Management Framework for imbalanced data using Frequent Pattern Mining based on Bloom Filter", In: Computer Engineering & Systems (ICCES), 2016 11th International Conference on, pp. 226–231, http://dx.doi.org/10.1109/ICCES.2016.7822004.