A short note on post-hoc testing using random forests algorithm: Principles, asymptotic time complexity analysis, and beyond

When testing whether a continuous variable differs between categories of a factor variable or their combinations, taking into account other continuous covariates, one may use an analysis of covariance. Several post-hoc methods, such as Tukey’s honestly significant difference test, Scheffé’s, Dunn’s, or Nemenyi’s test are well-established when the analysis of covariance rejects the hypothesis there is no difference between any categories. However, these methods are statistically rigid and usually require meeting statistical assumptions. In this work, we address the issue using a random forest-based algorithm, practically assumption-free, classifying individual observations into the factor’s categories using the dependent continuous variable and covariates on input. The higher the proportion of trees classifying the observations into two different categories is, the more likely a statistical difference between the categories is. To adjust the method’s first-type error rate, we change random forest trees’ complexity by pruning to modify the proportions of highly complex trees. Besides simulations that demonstrate a relationship between the tree pruning level, tree complexity, and first-type error rate, we analyze the asymptotic time complexity of the proposed random forest-based method compared to established techniques.


I. INTRODUCTION
C OMPARING a continuous variable's means of two or more categories (or their combinations) of one or more factor variables and detecting significant mutual differences, if any, is very common in applied statistics. Particularly when the dependent variable needs to be adjusted by other continuous covariates, an analysis of covariance (ANCOVA) is a tool of choice [1].
Since the analysis of covariance tests whether there is, in general, a difference between at least two categories of a given factor, the big question is to determine where exactly the statistical difference is, i. e., which two (or more) exact categories of the factor are those the significant difference arises from.
For this reason, post-hoc tests are usually applied to identify the significantly different categories of their combinations. Some of them are quite established, for instance, Tukey honestly significant difference (HSD) test [2], Scheffé's test [3], Dunn's test [4], or, if needed, Nemenyi's test with a reduced amount of assumptions required to be met [5].
However, the covariance analysis and the post-hoc tests are limited by relatively tough statistical assumptions, usually in terms of normality of independence of observation subsamples. Furthermore, empirically, when there are multiple methods for one task, that usually implies each method is limited somehow and, consequently, there is no "apriori first choice" method routinely working in all situations.
This work introduces a new post-hoc method based on a random forest algorithm to overcome the mentioned. Each classification tree the random forest model consists of has got its complexity, i. e. number of leaf nodes, by which it can classify into only one or more categories of observations than only one. The continuous dependent variable and continuous covariates are the variables by which an entry sample of observations is split into subsamples, using logical formulas with the variables and searched cut-offs. Considering the categories given by the factor variable (or more factor variables) entering the analysis of covariance, these may be refined as an output for the random forest algorithm, not only as an input for the analysis of covariance. If the number of trees in the random forest model with sufficient complexity, i. e. classifying into two or more factor categories or their combinations, is high enough, then the hypothesis that there is no statistical difference between the two categories is hardly Proceedings of the of the 17 th Conference on Computer Science and Intelligence Systems pp. 489-497 DOI: 10.15439/2022F265 ISSN 2300-5963 ACSIS, Vol. 30 likely. As a tuning parameter, the pruning level may affect how complex the trees in a random forest are.
After the well-established methods revisiting, we describe principles behind the random forest-based algorithm for posthoc testing, derive the asymptotic time complexity of the proposed method, and estimate a feasible number of trees in the random forest model regarding the number of other factors and continuous covariates. Eventually, we do simulations to compare the new method to others, i. e. established ones, and, particularly, describe a relationship between the random forest tree pruning level and tree complexity and the model's firsttype error rate.

II. PRINCIPLES AND ASSUMPTIONS OF ANALYSIS OF COVARIANCE AND POST-HOC TESTS REVISITED
In this section, we recapitulate basic principles of analysis of covariance and commonly used post-hoc tests to refresh their logic and mention their assumptions and limitations.
A. Analysis of covariance (ANCOVA) -principles, assumptions, and limitations Principles of ANCOVA. Analysis of covariance is a linear model standing in between analysis of variance (ANOVA) [6] and linear regression [1], [7]. While the analysis of variance assumes there is a continuous dependent variable and independent categorical factors, linear regression allows for independent covariates as the analysis of covariance. However, compared to the linear regression, it estimates effect sizes as excesses above or below covariate variable average and enables to elegantly estimate an explained variability proportion of the continuous dependent variable by covariates. Furthermore, analysis of variance is performed particularly when continuous covariates are not of much interest compared to the factors. That being said, inference tests for coefficients of the covariates are usually skipped.
A model of the analysis of covariance, including k * N categorical factor variables and m * N continuous covariates, is for i-th observation from n * N observations in total, as follows, where y i is a value of the dependent continuous variable for i-th observation, µ is a grand total mean of the dependent variable, ¶ j is an effect of j-th factor on i-th observation, with "j * {1, 2, . . . , k},´l is a coefficient (slope) of l-th covariate, with with "l * {1, 2, . . . , m}, x i,l is a value of l-th covariate for i-th observation, and · i is a residual term of i-th observation, respectively. Firstly, coefficients´l for "l * {1, 2, . . . , m}, listed in a vector β = (´1,´2, . . . ,´m) T , are estimated as those minimizing the sum of residuals [8] from formula (1), ignoring (not yet estimated) effects ¶ j of the factor variables, thus, So far, considering formula (2) the analysis of covariance is similar to the linear regression. Secondly, once the vector β = (´1,´2, . . . ,´m) T of linear coefficients is estimated, a part close to multifactorial analysis of variance follows. A total sum of squares, SS tot = n i=1 (y i 2ȳ) 2 , describing total variability of the dependent variable [9], is corrected (reduced) by variability explained by the continuous covariates, SS´l , getting SS * tot , so considering a vector of the dependent variable y = (y 1 , y 2 , . . . , y n ) T , a vector of l-th covariate x l = (x 1,l , x 2,l , . . . , x n,l ) T , and grand meanȳ * adjusted by the correction.
Finally, k-way analysis of variance for the coefficients ¶ j estimation is applied. The null hypothesis claiming that jth factor does not affect the dependent continuous variable, i. e. H 0 : ¶ j = 0, is formulated k-times and tests using the adjusted sum of squares, SS * tot , decomposed into component for the factors and for the residuals. Using formula (3) and assuming j-th factor has got n j categories and c-th category has got n j,c observations, the decomposition is as follows [9], where SS j-th factor is sum of squares for j-th factor, SS ε is sum of squares for residuals, andȳ j,c is an average of all values that belong to c-th category of j-th factor. Consequently, using formula (4), the null hypothesis H 0 : ¶ j = 0 for j-th factor is rejected on confidence level 1 2 ³ if and only if F = SS j-th factor /(n j 2 1) where F 1−³ (df 1 , df 2 ) is (12³)-th quantile of Fisher-Snedecor distribution with df 1 and df 2 degrees of freedom, respectively. Rejecting the null hypothesis for j-th factor does not determine which categories of the factor mutually differ significantly, though.

B. Post-hoc tests -principles, assumptions, and limitations
Assuming the null hypothesis has been rejected for j-th factor, one would like to determine which exact two or more categories of the factor significantly differ. Let us mark average values of observations that belong to categories c r and c s of j-th factor, with r, s * {1, 2, . . . , n j }, as µ r and µ s . Usually, the categories c r and c s of j-th factor significantly differ when some inequality using data parameters or estimates holds, as showed below applying the mathematical notation from formulas (4) and (5). The decision process may be repeated for each pair of categories r, s of j-th factor to research all possible differences.
1) Tukey honestly significant differences (HSD) test: Based on Tukey, averages of the categories c r and c s significantly differ if |µ r 2 µ s | where q(³, k, n2k) is studentized critical value for confidence level ³ and Ã 2 is residuals' variance, n is sample size and k is the number of factors. Tukey HSD method assumes that subsamples for compared categories are independent, of the same variability (homoskedasticity), and follow normal distribution [2].
2) Scheffé's test: Following Scheffé's (unweighted) test, averages of the categories c r and c s significantly differ if where df 1 = n j 2 1 and df 2 = n + k 2 1 2 k j=1 n j . Scheffé's test is less limited than Tukey's HSD test since there is no explicit assumption of any normal distribution of observations; however, it has lower statistical power, though [3].

3) Dunn's test:
Transforming values of dependent variable y i that belong to the categories c r and c s from initial continuous ones to their ranks, we get their averagesw r andw s . Dunn's test recommends considering the categories c r and c s as different when where t is a possible tied value of the ranks w r and w s , n t is a count of tied ranks at value t, and n cr and n cs are numbers of observations in categories c r and c s , respectively.
While assumption-free, Dunn's test may fail to identify significant differences between categories due to its low statistical power [4].

4) Nemenyi's test:
Similarly to Dunn's test, assuming average ranksw r andw s of the categories c r and c s , these significantly differ if where q(³, k, n2k) is studentized critical value for confidence level ³, n is sample size, k is the number of factors, and n cr and n cs are numbers of observations in categories c r and c s , respectively.
Nemeyi's test is nonparametric and robust enough, but may suffer from low statistical power, though [5].

FORESTS
In advance of the proposed method introduction we shortly point out important pieces of knowledge about classification trees and random forests.

A. Classification trees -principles and assumptions
Classification trees from the CART family of trees (classification and regression trees) split a hyperspace of k * N explanatory variables (continuous or categorical) into disjunctive hyper-rectangles, fitting simple (constant) models there by minimizing a given criterion [10].
An observation given by a vector of values x i = (x i,1 , x i,2 , . . . , x i,k ) T is classified into one of m classes of a target variable by a set of rules that comes from node formulas, created throughout the tree is growing, as described in Fig. 1. Initially, the root covers all observations till a node rule, i. e. a found explanatory variable and a cut-off value minimizing the given criterion partitions the dataset into two parts. Each part is then again split by a new rule set for a child node. The process is recursively repeated by growing the tree, by which a set of node rules successively splits the input dataset into more parts that are mutually more and more different. The process of the tree growing, called a top-down induction of a decision tree (TDIDT), is stopped by a stopping criterion, e. g. maximum of leaf (ending) nodes, maximum tree deepness level, etc. Let Ã(") j be a proportion of all observations that belong -by rules of all nodes from root to leaf one -to a target class j. A leaf node n t classifies into the class c * f if c * f = argmax f ∈{1,2,...,nj } {Ã(") f }. Since each node is through the tree growing a leaf one (for a limited time), the criterion has to be minimized in searching for node n t rule. There are several commonly used criteria, also called impurity measure, Q nt (T ), such as misclassification error (10), Gini index (11), or deviance (cross-entropy) (12), One can easily see that the lower the impurity measure is, the higher Ã(") f , i. e. a proportion of a target class f in the node n t , has to be, as expected.
Classification trees, as depicted in Fig. 1, in order to minimize the leaf nodes impurity, tend to overfit the node rules on a given dataset, which is done by the tree's typical "overgrowing", i. e. high complexity. To avoid this, besides some other naive approaches, pruning is commonly applied. Firstly, let us use usually defined cost-complexity function, where {n t } is a set of leaf nodes of the tree and {x nt } is a set of all observations constrained by rules coming from the root till the node n t . The idea of the pruning is to find a subtree T » so that T » ¢ T for a given » that minimizes the statistics C » (T ), i. e. T » = argmin T nt∈{nt} |{x nt }| · Q nt (T ) + » · |{n t }| . The » g 0 is a tuning parameter that governs the trade-off between a high tree complexity and size (for low values of ») and tree parsinomity and reproducibility to other datasets (for large values of »).

B. Principles of the random forests
Random forests are finite sets of (distinct) classification trees, described in detail above, each classifying a kdimensional observation, What is worth to be mentioned is that only k * < k variables are considered as possible partitioning variables in node rules. The subset of k * variables from the original k explanatory variables is selected randomly using bootstrapping; that ensures the pre-selected k * variables are mutually independent enough. A flowchart of the random forest model building is in Fig. 2.
Neither classification trees nor random forests have important assumptions or limitations worth speaking off.

IV. THE PROPOSED METHOD FOR POST-HOC TESTING
In this section, we introduce a novel alternative for post-hoc testing based on a random forest algorithm. Considering the ANCOVA notation, categories of a factor that contains statistically different effects on the continuous dependent variable are leaf node classes each tree of a random forest classifies into. The dependent variable and the covariates, and other factor variables, if any, are entry variables that serve for node rules if needed. Each tree of the random forest model can either classify only into one category (as a root note tree) or into two or more categories, based on its complexity (size). For details, see Fig. 3.
The more trees of sufficient complexity can classify into the classes (categories) in the forest, the more likely we can reject the null hypothesis that there is no difference between the effects of the factor's categories on the dependent variable. This is formally done by ANCOVA, too. What is more, if a proportion of trees classifying into two given categories is large enough, considering all trees, the given two categories seem to be of statistically different effect on the dependent variable [12].
A proportion of trees classifying into two or more categories to all trees in the random forest is close to a point estimate PROCEEDINGS OF THE FEDCSIS. SOFIA, BULGARIA, 2022 initialization of an empty random forest creating a decision tree with a set of k * < k node-splitting variables randomly preselected using bootstrap involving the created tree into the random forest are there less than t trees in the random forest?
stop the random forest construction and output the random forest model no yes Fig. 2. A construction of the random forest model involving t decision trees. of the p-value. The p-value is the probability we incorrectly reject the null hypothesis of no different effects of the factor categories on the dependent variable, assuming the null hypothesis is true. Thus, the method also provides statistical inference as a post-hoc test. Since we could modify a random forest's tree complexity (size), i. e. also tendencies to classify either only into one or into two or more classes, by pruning and the tuning parameter », we may control the first-type error rate, i. e. the incorrect rejection of the null hypothesis when it is true, of the random forest model as inferential post-hoc test. The proposed method is due to the random forest algorithm behind almost assumption-free.
Besides the derivations of the inferential properties of the method, we also discuss the method's asymptotic time complexity and do a simulation study with varying » tuning parameters to describe a relationship between the parameter and the method's first-type error rate.

A. Statistical inference behind the proposed method
Using the mathematical notation from previous sections, let us assume the ANCOVA already rejected the null hypothesis that the j-th factor does not affect the dependent variable. Thus, the question is what two (or more) categories of j-th factor are significantly different so that the factor influences Fig. 3. An example of a root node tree (on the left) not able to classify into any class unambiguously, an example of a tree with sufficient complexity (in the middle) able to classify into two classes (j = 1 and j = 2), and an example of a tree with sufficient complexity (on the right) able to classify into three classes (j = 1, j = 2, and j = 3).
the dependent variable 2 .
Intuitively, when a large number of the (appropriately pruned) trees of the random forest model can classify into two given classes, i. e. categories, then one can hardly suppose the categories are statistically without a difference.
Similarly to the post-hoc tests, let the null hypothesis H 0 claim that there is no statistical difference between the given two categories c r and c s of j-th factor. The alternative hypothesis H 1 claims the contradiction, so  Whenever a post-hoc test rejects the null hypothesis H 0 in favor of the alternative hypothesis H 1 , the case is equivalent to a situation the test's p-value is lower than or equal to a prior set significance level ³, usually equal to 0.05.
By definition, the p-value is a probability of gaining data at least as extreme as the data actually observed, assuming the null hypothesis is true. Let t c be a number of trees in the random forest model that are in contradiction to the null hypothesis (under the null hypothesis assumption). Then, the value of t c is equal to the number of all trees classifying, besides other classes, into given two classes (categories) c r and c s ; showing that there is a difference between the two classes. Let the I cr,cs (Ä ) be an identifier function returning 1 if and only if the tree Ä classifies into the classes (categories) c r and c s (regardless whether it classifies into other classes, too), thus, Thus, formula (14) shows that the p-value's estimate is equal to the fraction of 1 2 tc−1 t . Intuitively, supposing the initial number t c of trees in the random forest model that are complex enough to classify, besides others, to categories c r and c s is generally low. In that case, such a model is not "much" in contradiction to the null hypothesis about no differences between the two categories. Thus, when the t c is relatively low, the fraction p-value = 12 tc−1 t is relatively high and close to 1, so, unlikely lower than ³(= 0.05). The null hypothesis probably fails to be rejected. However, for high values of t c , i. e. when there are many trees in the forest with sufficient complexity classifying into the two categories c r and c s (thus, in contradiction to the null hypothesis), then -since the high value of t c -the fraction p-value = 1 2 tc−1 t is relatively low and perhaps below the ³ level. Consequently, the null hypothesis is likely rejected.
Indeed, the » parameter determines how complex the trees in the random forest are or how radical the pruning of the trees is. Investigating formula (13), one can realize that if » = 0, then there is no penalization for large tree complexity, so trees in the random forest are generally very complex (i. e., of large size). So, whenever there are at least two observations, one from c r category and the other from c s category of j-th factor, all trees in the forest would classify those observations into their categories, i. e. that for each tree Ä is I cr,cs (Ä ) = 1, which results into t c = ∀Ä ∈ random forest I cr,cs (Ä ) = t, and, thus, p-value estimate is p-value = 1 2 tc−1 t = 1 2 t−1 t = 1 t j 0. Finally, if p-value j 0, then also p-value j 0 < ³ which, consequently, results into the null hypothesis rejection. However, when the null hypothesis rejection is often, it is also very likely a false rejection, that increases the first-type error rate. High chance of the null hypothesis rejection means also the high statistical power, though, i. e. the case when the incorrect null hypothesis is correctly rejected.
For » > 0, the penalization for tree complexity (size) is applied, so, the trees' complexity (size) decreases, and, thus, if not all, many of the trees do not classify into both c r and c s categories. This means that there are trees Ä in the random forest so that I cr,cs (Ä ) = 0, and, finally, t c = ∀Ä ∈ random forest I cr,cs (Ä ) < t. So, p-value estimate is p-value = 1 2 tc−1 t > 0, and it could be below or above the ³ level.
B. A feasible low bound of the number of trees t in the random forest Adopting the ANCOVA mathematical notation, when two categories c r and c s of j-th factor are compared using the random forest-based method, other k 21 factors, together with m covariates and the originally dependent continuous variable, play as input variables for node decision rules. Assuming that 3-th factor contains n 3 g 2 categories and cut-offs for the covariates are usually estimated as midpoints of covariates' ranges, splitting the ranges into a number of categories to be classified into, i. e. n c , we may estimate a minimum number of mutually different trees. Each mentioned feature could be or could not be included in a tree; that being said, the number of all combinations of k 2 1 factor node rules is at least 2 ! 3∈{1,2,...,k}\j n 3 g 2 2 k−1 , and the number of all combinations of m covariates and the dependent variable, split into n j intervals, is at least n m+1 j . Thus, the minimum number of mutually different trees and, thus, the feasible low bound of the random forest trees' number is Furthermore, since the number of trees t determines decimal precision of p-value estimate based on formula 14, if we ask for decimal precision of d * N digits, then the minimum number of trees in the random forest is about t > 10 d+1 , to ensure feasible precision for d-th decimal digit.

C. A brief asymptotic time complexity analysis of the proposed method
The random forest model consists of classification trees as atomic units, constructed following the flowchart 1 and algorithm 1. A decision tree is induced until the moment all its leaf nodes include only observations of one category of j-th factor, i. e. a sequence of node rules coming from root note till the leaf one successively limits the entry dataset to one-class observations [13]. When the categories are well balanced across the dataset, each binary partitioning halves them, and the tree average depth is around log 2 n levels; thus, the asymptotic time complexity is also Θ(log 2 n), assuming one split of a node takes one time atomic unit. However, if the categories are totally unbalanced, each splitting cuts the current dataset of size n * into 1 and n * 2 1 observations, which takes n steps in total. Then, tree depth is n, so the asymptotic time complexity is Θ(n). Assuming there are k 2 1 factors, m covariates and the originally dependent variable, i. e. k 2 1 + m + 1 = k + m variables, searched through averagely n 2 observations within each node splitting, the asymptotic time complexity of one tree induction Θ( ) 494 PROCEEDINGS OF THE FEDCSIS. SOFIA, BULGARIA, 2022 is therefore between Θ(log 2 n) (best-case scenario) and Θ(n) (worst-case scenario), // is a root; 5 Ã(") j // a node criterion; 6 while # a node * {n} so that data constrained by all node rules coming from root to the node belong to g 2 classes do 7 find for the node a splitting variable and splitting point minimizing the Ã(") j ; 8 add to the node two child nodes n left a n right ; Since a random forest contain t trees, each built in Θ( ) time by (15), the entire random forest asymptotic time complexity construction Θ(!) is One model of the random forest provides one (point) estimate of the p-value using the formula (14), enabling to statistically distinguish between two categories. In comparison, the estimation of the decision rules for post-hoc tests using formulas (6), (7), (8), and (9) usually take only several linear steps, assuming the SS j-th factor in (7) term is precalculated. Fortunately, the time complexity (16) is still polynomial. Furthermore, since the building of the random forest with the complexity of (16) is based on independent trees induction, it could be parallelized; then, if the random forest building would be parallelized into Ã f t independent slave processes each inducting a bunch of t Ã * N trees, the time complexity (16) would be reduced to V. SIMULATION STUDY To compare the established post-hoc tests with the proposed method, particularly its first-type error rate, we run a simulation study generating many n × (k + m + 1) datasets with n observations, k factor variables, m covariates with various relationships between the variables, and, lastly, with the continuous dependent variable. For each post-hoc test, i. e. Tukey HSD test, Scheffé's test, Dunn's test, Nemenyi's test, and random forest-based method, we compare two categories of a selected factor so that the categories have significantly non-different averages within the continuous dependent variable and check how many times the methods claim there is a significant difference. Thus, in theory, we measure the firsttype error rate. The simulation was repeated for different » parameter values to illustrate how the value of » determines the first-type error rates in the new method, i. e., what are ideal » values to control the first-type error rate on a feasible level.
The datasets were generated as follows. One of the k factors, let's say the j-th one, contained two categories, c r and c s , following a normal distribution with the same average of the continuous dependent variable, i. e. N (0, 1 2 ). Other k 2 1 factors always split the dependent continuous variable into 2 to 4 categories with random averages from N (µ, σ 2 ), where µ * ï21, 1ð and σ 2 * ï0, 2ð. Furthermore, the covariates also followed normal distribution N (µ, σ 2 ), where µ * ï21, 1ð, σ 2 * ï0, 2ð, and correlations r between the dependent continuous variable and covariates were randomly from r * ï20.5, 0.5ð. The continuous variable was dependent for all post-hoc tests with exception of the proposed method, where j-th factor is as dependent one.
There were¸= 1000 datasets, as depicted above, generated in total, and for each » * {0.1, 0.3, 0.5, 0.7, 0.9}. The number of trees in each random forest was always t = 1000. Numbers of cases where p-value was lower than or equal to ³ = 0.05 regardless of the post-hoc test were summed up, indicating the point estimates of the first-type error rates, as illustrated in table I. The simulation study was performed using R programming language and environment [14]. There are more numerical applications of R language to various fields in [15]- [23].
While the established post-hoc tests returned point estimates of the first-type error rate about 0.050 (regardless of » since their formulas (6), (7), (8), and (9) are not functions of the »), point estimates of the first-type error rates output by the introduced method progressively decreased with increasing value of », see table I.
However, since the proposed random forest-based algorithm for categories' averages comparison is data-determined and heuristic, it is nontrivial to suggest specific values of the  I  POINT ESTIMATES OF THE FIRST-TYPE ERROR RATES FOR POST-HOC  TESTS, I. E. TUKEY HSD TEST (T-HSD