Logo PTI Logo FedCSIS

Proceedings of the 17th Conference on Computer Science and Intelligence Systems

Annals of Computer Science and Information Systems, Volume 30

A short note on post-hoc testing using random forests algorithm: Principles, asymptotic time complexity analysis, and beyond

, , ,

DOI: http://dx.doi.org/10.15439/2022F265

Citation: Proceedings of the 17th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 30, pages 489497 ()

Full text

Abstract. When testing whether a continuous variable differs between categories of a factor variable or their combinations, taking into account other continuous covariates, one may use an analysis of covariance. Several post-hoc methods, such as Tukey's honestly significant difference test, Scheff\'e's, Dunn's, or Nemenyi's test are well-established when the analysis of covariance rejects the hypothesis there is no difference between any categories. However, these methods are statistically rigid and usually require meeting statistical assumptions. In this work, we address the issue using a random forest-based algorithm, practically assumption-free, classifying individual observations into the factor's categories using the dependent continuous variable and covariates on input. The higher the proportion of trees classifying the observations into two different categories is, the more likely a statistical difference between the categories is. To adjust the method's first-type error rate, we change random forest trees' complexity by pruning to modify the proportions of highly complex trees. Besides simulations that demonstrate a relationship between the tree pruning level, tree complexity, and first-type error rate, we analyze the asymptotic time complexity of the proposed random forest-based method compared to established techniques.

References

  1. Geoffrey Keppel and Thomas D Wickens. Design and analysis. en. 4th ed. Upper Saddle River, NJ: Pearson, Jan. 2004.
  2. John W. Tukey. “Comparing Individual Means in the Analysis of Variance”. In: Biometrics 5.2 (June 1949), p. 99. http://dx.doi.org/10.2307/3001913.
  3. H Scheffe. The analysis of variance. en. Wiley Classics Library. Nashville, TN: John Wiley & Sons, Feb. 1999.
  4. Olive Jean Dunn. “Multiple Comparisons among Means”. In: Journal of the American Statistical Association 56.293 (Mar. 1961), pp. 52–64. DOI : 10.1080/01621459.1961.10482090.
  5. Myles Hollander and Douglas Alan Wolfe. Nonparametric Statistical Methods. en. 2nd ed. Wiley series in probability & statistics: applied section. Nashville, TN: John Wiley & Sons, Feb. 1999.
  6. Ellen R Girden. ANOVA: Repeated measures. 84. Sage, 1992. ISBN: 0803942575. Kenneth L. Lange, Roderick J. A. Little, and Jeremy
  7. M. G. Taylor. “Robust Statistical Modeling Using the t Distribution”. In: Journal of the American Statistical Association 84.408 (Dec. 1989), p. 881. http://dx.doi.org/10.2307/2290063
  8. A. Charnes, E. L. Frome, and P. L. Yu. “The Equivalence of Generalized Least Squares and Maximum Likelihood Estimates in the Exponential Family”. In: Journal of the American Statistical Association 71.353 (Mar. 1976), pp. 169–171. http://dx.doi.org/10.1080/01621459.1976.10481508.
  9. R A Bailey. Cambridge series in statistical and probabilistic mathematics: Design of comparative experiments series number 25. Cambridge, England: Cambridge University Press, Apr. 2008.
  10. Leo Breiman. Classification and regression trees. New York: Chapman & Hall, 1993. ISBN: 9780412048418.
  11. Leo Breiman. “Random Forests”. In: Machine Learning 45.1 (2001), pp. 5–32. http://dx.doi.org/10.1023/a:1010933404324.
  12. Lubomír Štěpánek, Filip Habarta, Ivana Malá, et al. “Analysis of asymptotic time complexity of an assumption-free alternative to the log-rank test”. In: Proceedings of the 2020 Federated Conference on Computer Science and Information Systems. IEEE, Sept. 2020. http://dx.doi.org/10.15439/2020f198.
  13. Kawther Hassine, Aiman Erbad, and Ridha Hamila. “Important Complexity Reduction of Random Forest in Multi-Classification Problem”. In: 2019 15th International Wireless Communications Mobile Computing Conference (IWCMC). 2019, pp. 226–231. http://dx.doi.org/10.1109/IWCMC.2019.8766544.
  14. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2017. URL: https://www.R-project.org/.
  15. Lubomír Štěpánek, Pavel Kasal, and Jan Měšt’ák. “Evaluation of facial attractiveness for purposes of plastic surgery using machine-learning methods and image analysis”. In: 2018 IEEE 20th International Conference on e-Health Networking, Applications and Services (Healthcom). IEEE, Sept. 2018. DOI : 10.1109/healthcom.2018.8531195.
  16. Lubomír Štěpánek, Pavel Kasal, and Jan Měšt’ák. “Machine-learning at the service of plastic surgery: a case study evaluating facial attractiveness and emotions using R language”. In: Proceedings of the 2019 Federated Conference on Computer Science and Information Systems. IEEE, Sept. 2019. http://dx.doi.org/10.15439/2019f264
  17. Lubomír Štěpánek, Pavel Kasal, and Jan Měšt’ák. “Machine-Learning and R in Plastic Surgery – Evaluation of Facial Attractiveness and Classification of Facial Emotions”. In: Advances in Intelligent Systems and Computing. Springer International Publishing, Sept. 2019, pp. 243–252. http://dx.doi.org/10.1007/978-3-030-30604-5_22.
  18. Lubomír Štěpánek, Pavel Kasal, and Jan Měšt’ák. “Machine-learning at the service of plastic surgery: a case study evaluating facial attractiveness and emotions using R language”. In: Proceedings of the 2019 Federated Conference on Computer Science and Information Systems. IEEE, Sept. 2019. http://dx.doi.org/10.15439/2019f264. Lubomír Štěpánek, Pavel Kasal, and Jan Měšt’ák.
  19. “Evaluation of Facial Attractiveness after Undergoing Rhinoplasty Using Tree-based and Regression Methods”. In: 2019 E-Health and Bioengineering Conference (EHB). IEEE, Nov. 2019. http://dx.doi.org/10.1109/ehb47216.2019.8969932.
  20. Lubomír Štěpánek, Filip Habarta, Ivana Malá, et al. “A Machine-learning Approach to Survival Time-event Predicting: Initial Analyses using Stomach Cancer Data”. In: 2020 International Conference on e-Health and Bioengineering (EHB). IEEE, Oct. 2020. http://dx.doi.org/10.1109/ehb50910.2020.9280301.
  21. Lubomír Štěpánek, Filip Habarta, Ivana Malá, et al. “A random forest-based approach for survival curves comparing: principles, computational aspects and asymptotic time complexity analysis”. In: Proceedings of the 16th Conference on Computer Science and Intelligence Systems. IEEE, Sept. 2021. DOI : 10.15439/2021F89.
  22. Lubomír Štěpánek, Filip Habarta, Ivana Malá, et al. “Data envelopment analysis models connected in time series: A case study evaluating COVID-19 pandemic management in some European countries”. In: 2021 International Conference on e-Health and Bioengineering (EHB). Iasi, Romania: IEEE, Nov. 2021. DOI : 10.1109/EHB52898.2021.9657597.
  23. Owen Jones, Robert Maillardet, and Andrew Robinson. Introduction to Scientific Programming and Simulation Using R. Chapman and Hall/CRC, Mar. 2009. http://dx.doi.org/10.1201/9781420068740.