Alternatives for greedy discrete subsampling: various approaches including cluster subsampling of COVID-19 data with no response variable

Lubomír Štěpánek; Filip Habarta; Ivana Malá; Luboš Marek

Alternatives for greedy discrete subsampling: various approaches including cluster subsampling of COVID-19 data with no response variable

Lubomír Štěpánek, Filip Habarta, Ivana Malá, Luboš Marek

DOI: http://dx.doi.org/10.15439/2021F87

Citation: Position and Communication Papers of the 16th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 26, pages 103–111 (2021)

Full text

Abstract. An exhaustive selection of all possible combinations of n = 400 from N = 698 observations of the COVID-19 dataset was used as a benchmark. Building a random set of subsamples and choosing the one that minimized an averaged sum of squares of each variable's category frequency returned similar results as a``forward'' subselection reducing the dataset one-by-one observation by the same metric's permanent lowering. That works similarly as k-means clustering (with a random clusters' number) over the original dataset's observations and choosing a subsample from each cluster proportionally to its size. However, the approaches differ significantly in computational time.

References

Peter C. Austin. “An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies”. In: Multivariate Behavioral Research 46.3 (May 2011), pp. 399–424. http://dx.doi.org/10.1080/00273171.2011.568786. URL: https://doi.org/10.1080/00273171.2011.568786.
Santhosh Pathical and Gursel Serpen. “Comparison of subsampling techniques for random subspace ensembles”. In: 2010 International Conference on Machine Learning and Cybernetics. IEEE, July 2010. http://dx.doi.org/10.1109/icmlc.2010.5581032.
Elizabeth A. Stuart. “Matching Methods for Causal Inference: A Review and a Look Forward”. In: Statistical Science 25.1 (Feb. 2010). http://dx.doi.org/10.1214/09-sts313. URL: https://doi.org/10.1214/09-sts313.
Sarda Sahney, Michael J. Benton, and Paul A. Ferry. “Links between global taxonomic diversity, ecological diversity and the expansion of vertebrates on land”. In: Biology Letters 6.4 (Jan. 2010), pp. 544–547. DOI : 10.1098/rsbl.2009.1024. URL: https://doi.org/10.1098/rsbl.2009.1024.
David MacKay. Information theory, inference, and learning algorithms. Cambridge, UK New York: Cambridge University Press, 2003. ISBN: 0-521-64298-1.
Lubomír Štěpánek, Filip Habarta, Ivana Malá, et al. “Analysis of asymptotic time complexity of an assumption-free alternative to the log-rank test”. In: Proceedings of the 2020 Federated Conference on Computer Science and Information Systems. IEEE, Sept. 2020. http://dx.doi.org/10.15439/2020f198. URL : https://doi.org/10.15439/2020f198.
Malay K. Pakhira. “A Linear Time-Complexity k-Means Algorithm Using Cluster Shifting”. In: 2014 International Conference on Computational Intelligence and Communication Networks. IEEE, Nov. 2014. http://dx.doi.org/10.1109/cicn.2014.220. URL : https://doi.org/10.1109/cicn.2014.220.
J. C. Gower. “A General Coefficient of Similarity and Some of Its Properties”. In: Biometrics 27.4 (Dec. 1971), p. 857. http://dx.doi.org/10.2307/2528823. URL: https://doi.org/10.2307/2528823.
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2017. URL: https://www.R-project.org/.
Lubomír Štěpánek, Pavel Kasal, and Jan Měšt’ák. “Evaluation of facial attractiveness for purposes of plastic surgery using machine-learning methods and image analysis”. In: 2018 IEEE 20th International Conference on e-Health Networking, Applications and Services (Healthcom). IEEE, Sept. 2018. DOI : 10.1109/healthcom.2018.8531195. URL : https://doi.org/10.1109/healthcom.2018.8531195.
Lubomír Štěpánek, Pavel Kasal, and Jan Měšt’ák. “Machine-learning at the service of plastic surgery: a case study evaluating facial attractiveness and emotions using R language”. In: Proceedings of the 2019 Federated Conference on Computer Science and Information Systems. IEEE, Sept. 2019. http://dx.doi.org/10.15439/2019f264. URL: https://doi.org/10.15439/2019f264.
Lubomír Štěpánek, Pavel Kasal, and Jan Měšt’ák. “Machine-Learning and R in Plastic Surgery – Evaluation of Facial Attractiveness and Classification of Facial Emotions”. In: Advances in Intelligent Systems and Computing. Springer International Publishing, Sept. 2019, pp. 243–252. http://dx.doi.org/10.1007/978-3-030-30604-5_22. URL : https://doi.org/10.1007/978-3-030-30604-5_22.
Lubomír Štěpánek, Pavel Kasal, and Jan Měšt’ák. “Machine-learning at the service of plastic surgery: a case study evaluating facial attractiveness and emotions using R language”. In: Proceedings of the 2019 Federated Conference on Computer Science and Information Systems. IEEE, Sept. 2019. http://dx.doi.org/10.15439/2019f264. URL: https://doi.org/10.15439/2019f264.
Lubomír Štěpánek, Pavel Kasal, and Jan Měšt’ák. “Evaluation of Facial Attractiveness after Undergoing Rhinoplasty Using Tree-based and Regression Methods”. In: 2019 E-Health and Bioengineering Conference (EHB). IEEE, Nov. 2019. http://dx.doi.org/10.1109/ehb47216.2019.8969932. URL: https://doi.org/10.1109/ehb47216.2019.8969932.
Lubomír Štěpánek, Filip Habarta, Ivana Malá, et al. “A Machine-learning Approach to Survival Time-event Predicting: Initial Analyses using Stomach Cancer Data”. In: 2020 International Conference on e-Health and Bioengineering (EHB). IEEE, Oct. 2020. http://dx.doi.org/10.1109/ehb50910.2020.9280301. URL: https://doi.org/10.1109/ehb50910.2020.9280301.