A Non-Deterministic Strategy for Searching Optimal Number of Trees Hyperparameter in Random Forest

In this paper, we present a non-deterministic strategy for searching for optimal number of trees hyperparameter in Random Forest (RF). Hyperparameter tuning in Machine Learning (ML) algorithms is essential. It optimizes predictability of an ML algorithm and/or improves computer resources utilization. However, hyperparameter tuning is a complex optimization task and time consuming. We set up experiments with the goal of maximizing predictability, minimizing number of trees and minimizing time of execution. Compared to the deterministic search algorithm, the non-deterministic search algorithm recorded an average percentage accuracy of approximately 98 %, number of trees percentage average improvement of 44.64 %, average time of execution mean improvement ratio of 175.62 and an average improvement of 94 % iterations. Moreover, evaluations using Jackknife Estimation show stable and reliable results from several experiment runs of the non-deterministic strategy. The non-deterministic approach in searching hyperparameter shows a significant accuracy and better computer resources (i.e cpu and memory time) utilization. This approach can be adopted widely in hyperparameter tuning, and in conserving utilization of computer resources like green computing.


I. INTRODUCTION
M L performance tuning is aimed at improving the pre- dictability of ML algorithms.Improving performance of a ML systems can be done by configuring a set of hyperparameters.Most ML algorithms have several hyperparameters to be configured.Hyperparameters specify the interoperability of the underlying model.ML algorithms hyperparameter tuning is aimed at getting optimal values that can improve the algorithm's predictability considering minimum consumption of computer system resources [6].When adopting ML algorithm to a specific dataset, hyperparameter tuning can be cumbersome and time consuming [13].
Manual, grid search and bayesian optimization are methods of hyperparameter optimization.Grid search is deterministic.It does an exhaustive search.It uses a predefined parameter space S = {0, 1, 2, ..., n}.The goal is to search an optimal hyperparameter s in S that records an optimal accuracy.Grid search consumes substantial amount time and is computationally expensive.However, it gives accurate results [4].Manual search involves randomly selecting a value s in S. The value s is configured in the algorithm, the experiment executed and the accuracy observed.The process is repeated comparing the accuracy.The hyperparameter that records the optimal accuracy is selected.Manual search is cumbersome and difficult to reproduce results [1].Bayesian optimization stochastically and efficiently trades off exploration and exploitation of the parameter space.It also explores historical information to find the parameters that maximize functions to inform user the configurations that best optimize predictability of the ML algorithm [5].
This paper introduces a non-deterministic search algorithm.The algorithm randomly selects 10% of elements in a parameter space.It then uses heuristics and termination conditions to maximize accuracy (acc) and minimize time of execution (t).This algorithm was applied and tested in selecting optimal number of trees (θ) in random forest (RF).In this paper, Section II covers related works, Section III discusses methodology and Section IV concludes this paper.

II. RELATED WORKS
In the paper by Hazan et al. (2017), large scale machine learning systems at times involves large number of parameters that are fixed manually.This is time consuming and at times inaccurate and difficult for a human expert.A hyper-parameter optimization strategy is proposed inspired by analysis of boolean function focusing on high-dimension datasets.The algorithm is an iterative application of compressed sensing techniques for orthogonal polynomials.The algorithm is tested in deep neural networks.In terms of running time, the algorithm records at least an order of magnitude faster than Hyperband and Bayesian Optimization and outperform Random Search 8x [Hazan et al., 2017].Hazan et al. (2017) guides this work as they develops an algorithm and tests it in another algorithm; their algorithm establishes heuristics for reducing the search space.
Experiments showed that accuracy increased when number of trees in RF was doubled.However, there was a threshold beyond which there was no significance gain in accuracy.Therefore, increasing number of trees does not always mean a better performance can be attained [15].We note that, there was no significant variable that used to measure use of computing resources consumed when varying number of trees.
MapReduce was used to optimize regularization parameters for boosted trees and random forests (RF).For RF [2], two parameters were tuned: the number of trees in the model and the number of features selected to split each node.Experiments showed that performance was sensitive to the number of trees but less sensitive to the number of features in each split.Results showed that MapReduce could make parameter optimization feasible on a massive scale.However, it created possibilities for overfitting that could reduce accuracy and lead to inferior learning parameters [6].
In the technical report by [3], they discuss manually setting up, using and understanding RF.They note that RF grows trees rapidly and setting up a large number of trees (e.g.1000) is okay.They further note that, if there are many variables, they can grow more trees (of up-to 5000) Beiman, (2003).From this work we can set up experiments with variable number of trees and see their effects on computing resources.
ML algorithms often involve careful tuning of learning parameters and model hyper-parameters.Parameter tuning is often a "black art" that requires expert experience, rules of thumb or sometimes brute-force search.To solve this problem, the following techniques were used: a full Bayesian treatment expected improvement, and algorithms (e.g ANN) for dealing with variable time regimes and running experiments in parallel.Results of this experiment surpassed a human expert at selecting hyper-parameters on the competitive CIFAR-10 dataset; beating the state of the art by over 3%.SVM was used as a case study algorithm [13].
A novel idea for approximate tree learning is seen in sparsity-aware algorithm for sparse data and weighted quantile sketch.The algorithm (XGBoost) proposes candidate splitting points according to percentiles of feature distribution, then maps the continuous features into buckets split, aggregates the statistics and finds the best solution among proposals based on the aggregated statistics.The algorithm also provides an insights on cache access patterns, data compression and sharing to build a scalable tree boosting system.The algorithm has been widely used and recognized in machine learning and data mining challenges e.g.Kaggle and KDDCup 2015.The algorithm can be applied to machine learning systems and in solving real-world scale problems using a minimal amount of resources [4].
Optimizing parameters of an evolutionary algorithm values is a challenging activity.CMA-ES tuning algorithms gave better results in terms of utility, in evolution algorithms.It is noted that using algorithms for tuning parameters of evolutionary algorithms does pay off in terms of performance.However, tuning algorithms gave better tuning parameter values than relying on intuitions and the usual parameter setting conventions [14].
It is challenging to create a large dataset and improve train ability of deep neural network models (DNNs).A selection of supplemental training datasets was used in fine-tuning a high-performing neural network model.Natural Language Processing system ability is improved after being evaluated by the Item Response Theory ability scores without negatively affecting generalization due to overfitting [9].
Large scale machine learning systems at times involve large number of parameters that are fixed manually.This is time consuming and at times inaccurate and difficult for a human expert.A hyper-parameter optimization strategy is proposed inspired by analysis of boolean function focusing on highdimension datasets.The algorithm is an iterative application of compressed sensing techniques for orthogonal polynomials.The algorithm is tested in deep neural networks.In terms of running time, the algorithm records at least an order of magnitude faster than Hyperband and Bayesian Optimization and outperform Random Search 8x.The algorithm requires only uniform sampling of the hyperparameters and is easily parallelizable [7].
In the department of Soil Survey in Kenya Agriculture and Livestock Research Organization (KALRO) [10] and other soil research organizations, land evaluation is done manually, is stressful, takes a long time and is prone to human errors [11][12].Parallel RF experiment prototypes are set up in [11] and further experiments in [12].Parallel RF, Linear Regression, Linear Discriminant Analysis, KNN, Gaussian Naive Bayesian and Support Vector Machine are applied in predicting land suitability for crop (sorghum) production, given soil properties information.Parallel RF had a better accuracy of 0.96 and time of execution of 1.7 sec [12].
Besides assertions regarding performance reliability of default parameters in RF, many RF experiments fit using these values.An examination of parameter sensitivity of RF in computational genomic was studied.Experiments were evaluated using Area Under Curve (AUC), Root Mean Square Error (RMSE) and cross-fold validation.It was seen that RF performance was strongly affected by number of trees, sample size and number of random variables used at each split.It was noted that tuned RF gave better results than when default parameters/values are used.Effects of parameterization were analyzed using selection methods and showed that tuning can successfully improved prediction accuracy of non-parametric ML algorithms [8].

A. Considering 2 to 4096 Number of Trees
We considered a finite set of sorted number of trees in the parameter space.RF predictability was evaluated by acc defined in equation 1 with n samples, where ŷi is the predicted label and y i is the original label.The results of acc and t are tabulated in Tables I and II respectively.In this case, we think 256 number of trees is better because the change in accuracy rather insignificant (-0.3) while it runs faster (approximately 7x faster).Generally, we observed better results between 2 and 512, and we assume these results can be extended to other datasets.We call the region between 2 and 512, the f ertile region.
Table II shows a general trend of time of execution increasing steadily with increase in number of trees.This tells us that more number of trees demand more computing resources.We also observed a relative significant change in time of execution, the threshold values are in bold.Generally, after 64 number of trees, we see a significant change in time difference.Increase in number of trees increases time of execution.More number of trees requires more computer resources to build and average the de-correlated trees in RF.
Different datasets give different values of accuracy and time of execution with the same number of trees.The selected datasets have different complexity i.e dimensionality, number of records and classes.This leads to a variation in accuracy and time of execution.For us to have an optimal number of trees hyperparameter in RF classifier, it is important we consider maximizing accuracy and minimizing number of trees.
However, we see the 6 th dataset maximum accuracy of 88.7% and time of execution of 6.42 seconds being out of the fertile region i.e 2048 number of trees.As per our experiments, this is a probability of 0.07 i.e 1 out of 14 datasets can exhibit this.The second best accuracy of 87.9% is observed in the fertile region i.e 128 number of trees with 0.5 seconds time of execution.In such instances, we can compromise accuracy to get a better time of execution, for this case, we compromise 0.8% accuracy to gain 5.92 seconds.

B. Considering 2 to 512 Number of Trees
In the fertile region, we observed lower time of execution and maximum accuracy, therefore, we will have avoided searching out regions (> 512) that show higher time of execution and significantly same or lower accuracy.We defined a finite set of sorted number of trees from the parameter space θ.We configured, trained and tested RF with the respective θ and recorded acc and t.The results are show in Fig. 1 and 2. Fig. 1 is a box plot of accuracy for number of trees against datasets across 14 datasets in the fertile region.Most datasets had a low inter-quartile range, low difference between the low and maximum points and more outliers below the lower whiskers.Some box plots also recorded some outliers above the upper whisker.A low difference in quartile ranges means there was a low variation in accuracy from the median and 50% of the accuracy records are within this region.However, the outliers inform us that, some maximum accuracy values were very far away from the median and some lowest accuracy values were very far away from the median.The goal of any data scientist is to have the maximum accuracy when configuring RF with a specific number of trees.Nonetheless, we see variations in accuracy on different datasets, i.e. different datasets record different accuracy levels.This make the search problem more difficult because we need to have a strategy that will be dynamic to search the best accuracy in different datasets.This research was interesting in finding number of trees (i.e. the outliers in the upper whisker) that maximize accuracy.Fig. 2 is a box plot of time of execution of number of trees against datasets across 14 datasets in the fertile region.We see the lower whisker having almost the same time of execution.This means there are some number of trees that could give almost the same minimum time of execution when configured in RF.We also see the lower whiskers being shorter than the upper whiskers.A shorter lower whisker means most lower time of executions were closer to the median.This research was interesting in these number of trees that minimize time of execution.
From these analysis, we formulated deterministic, nondeterministic and automatic configuration (having 8 number of trees by default) algorithmic approaches in searching optimal number of trees hyperparameter in the fertile region.

C. Deterministic Hyperparameter Search
Deterministic search algorithm is defined in equation 2. We developed a deterministic hyperparameter search algorithm from equation 2 as outlined in Algorithm 1.We considered number of trees θ, time t and accuracy acc descriptions and results from Section III-B.The deterministic hyperparameter search algorithm's goal is to maximize acc and minimize θ.

D. The Non-Deterministic Hyperparameter Search Algorithm
In this research, we were interested in maximizing accuracy and minimizing number of trees.Tables 1 and 2 shows almost the same accuracy but with different time of execution.Table 2 shows more NoTs require more ToE (i.e.memory and cpu time).With this analogy, this research formulated a nondeterministic search approach to converge close/to maximize accuracy and minimize number of trees and save time of execution.The algorith is outlined Algorithm 2, where θ i = random(∈ T ), ψ 1 = 1 + lim 100 , and ψ 2 = 1 − lim 100 .
We considered θ, acc and t descriptions and results from Section III-B.The goal of this algorithm was to maximize acc and minimize t through randomization.In this algorithm we assumption that, ∃acc best ∈ acc that has θ best .Note that the function GENERATE( ) returns 26 elements which is approximately 10% of elements in the parameter space.We iterate through the random selected number of trees as we configure RF.We considered percentage upper bound and lower bound of the acc best .If acc rand falls in the upper boundary, then acc best ← acc rand , θ best ← θ rand and we break, with the assumption that we do not anticipate further percentage ∆ acc best .If acc rand falls in the lower boundary and θ rand is less than θ best , then acc best ← acc rand , θ best ← θ rand and we also break, with the assumption that we have an insignificant ∆ acc best and we have a better t best .Moreover, if acc rand falls above the upper boundary, then acc best ← acc rand , θ best ← θ rand , and we continue looping with the assumption that we anticipate further percentage ∆ acc best .Lastly, we break when iteration counts are 10% of the parameter space, with the assumption that we have uniformly sampled the whole parameter space.We set the percentage boundary as 1% to increase the algorithm's accuracy.Experiment results are tabulated in Tables III, IV and V.

Algorithm 2
The Non Deterministic Hyperparameter Search while LEN(vals) ≤ 26 do if val is not in vals then add val in vals return val 8: procedure NONDETERMINISTICSEARCH(train, test) acc rand , θ rand , acc best , θ best , count ← 0 11: for each θ rand in T do 13: rf ← RANDOMFOREST(θ rand , train) (acc best , θ best , count) ← (acc rand , θ rand , 0) return (acc best , θ best , time_spent) E. Determinstic and Non-Deterministic Hyperparameter Search Algorithms, and Auto-Configured RF Table III contains results and analysis of minimum number of trees selected by deterministic and non-deterministic hyperparameter search algorithms.We see a considerably good percentage improvement of number of trees in the non-deterministic search algorithm.At some instances, for example, in datasets 8 and 13, the non-deterministic search algorithm was able to perfectly converged to the minimum number of trees with 26 and 2 iterations respectively.In some datasets e.g dataset 1, the percentage number of trees improvement was poor.Moreover, as observed in Table III, 50% of the datasets used less than 50% (i.e. less than 5% of random values in the search space) of random values while iterating, to converge close/to maximum accuracy and minimum number of trees.With this observation, in some cases, we can have an assumption that sometimes increasing the search space would not have much scientific significance.Generally, the percentage number of trees improvement was 44.6% and the average number of iterations used were 14.5.
Table IV has results and analysis of accuracy recorded from running deterministic, non-deterministic and auto-configured RF algorithms.The auto-configured RF had a mean percentage difference -5.46 while the non-deterministic search algorithm had a considerably better percentage change of -2.1.In nondeterministic search algorithm, datasets 2, 8 and 13 recorded a zero percentage change in accuracy.50% of the datasets recorded a percentage change of more than 1%.
Table V has results and analysis of time of execution of deterministic and non-deterministic search algorithms, and autoconfigured RF.The ratio of deterministic:non-deterministic algorithms and deterministic: auto-configured RF are calculated.Their averages are also calculated.Both auto-configured RF and non-deterministic algorithm record a very high average ratio of 5623 and 176 respectively.
As discussed in Section III-C, the deterministic search algorithm is exhaustive and selects the minimum number of trees that has the maximum accuracy.With these results, we benchmark the non-deterministic search algorithm and autoconfigured RF.The non-deterministic search algorithm, as discussed in Section III-D, uses the principle of randomization, heuristics and terminating policies as outlined in Algorithm 2. With this strategy, the non-deterministic search algorithm recorded ≈ 98% average accuracy, and could run at an average of 175.62 faster, on an average of 14.5 iterations.Using the strategy formulated in Algorithm 2, the non-deterministic search algorithm recorded 100% accuracy at three instances and recorded zero number of trees percentage improvement on two instances.Moreover, in the non-deterministic search algorithm, we recorded number of trees that are below the number of trees threshold (64 trees), that showed a significant change in time of execution, as discussed in Section III-A.This means the formulated strategy worked quite well.Considering dataset 2, we note that 0% percentage accuracy change, was got with more number of trees (48 trees instead of 46 trees) but at 34.8 times faster.These shows 100% accuracies got, at more number trees but takes a shorter searching time.This makes the strategy formulated in this research more relevant.Despite the 1% boundary policy and breaking policies strategies, 50% of the datasets recorded less than 1% change in percentage accuracy.The other 50% scored fairly good results too.Generally, a shorter time of execution means the process will take a shorter time in memory and shorter cpu time, when tuning RF.We see the non-deterministic search algorithm run ≈ 175 faster on average, achieving an average of ≈ 98% accuracy, on an average of 5.6% iterations (i.e 14.5 of 256 iterations in the parameter space).This is an improvement in iterations by 94.4%.Therefore, the non-deterministic search algorithm can improve utilization of computing resources while maintaining a significant accuracy.
Auto-configuring (having 8 number of trees by default) RF

F. Evaluation using Jackknife Estimation
Jackknife is used to evaluate the quality of the prediction of computational models.It uses resampling to calculate standard deviation error and estimate bias of a sample statistic, as shown in equations 3 and 4 [16].We computed Jackknife across the 14 datasets and tabulated results as shown in Table VI.We recorded a zero for bias and standard errors across all datasets.
In Table VI we see different datasets record different values of Bias-Corrected Jackknifed Estimates.We also observe stable results are per the predictions in Table IV.Standard error is used for null hypothesis testing and for computing confidence intervals (upper and lower bounds).This explains why we observe confidence intervals deviating insignificantly.We also see the bias-corrected Jackknifed estimate deviating minimally because the standard error were zero across all the records.These results show that the non-deterministic search algorithm predictions are stable and reliable.

IV. CONCLUSION
In this research, we formulated a non-deterministic strategy in searching for the best hyperparameter in random forest algorithm considering number of trees, accuracy and time of searching hyper-parameter.The non-deterministic search strategy recorded significantly good results in maximizing accuracy, minimizing number of trees and minimizing searching time.Evaluations using Jackknifed Estimation show that its predictions are stable.Moreover, the non-deterministic search strategy had a significant accuracy levels and better utilization cpu processing and time in memory.This research can be widely adopted in algorithms hyperparameter search and in green computing to preserve computing resources.
KENNEDY SENAGI, NICOLAS JOUANDEAU: A NON-DETERMINISTIC STRATEGY FOR SEARCHING OPTIMAL NUMBER

Table I
shows a general trend of accuracy increasing steadily with increase in number of trees, then flattens.RF classification employs bagging principles, where a committee of trees each, cast a vote for the predicted class.However, RF classifier introduces modifications in bagging where it builds a large collection of de-correlated trees, and then averages them.When the number of trees become huge, we see RF accuracy varying insignificantly meaning the average accuracy of de-correlated trees varying insignificantly.Average accuracy varies because of the random nature of RF, for example, randomly selecting features when building trees.We further observed an interesting trend in the number of trees against accuracy; increasing the number of trees does not significantly contribute to a positive accuracy.The maximum accuracy values are in bold, in TableI.Moreover, we see 13 out of the 14 dataset's maximum accuracy values found between 2 and 512 trees.Dataset 6 with 2048 number of trees recorded an accuracy of 88.7% and 6.42 seconds.It's second best accuracy is 88.4% with 0.89 seconds observed at 256 number of trees.

Table I :
Accuracy (percentage) of RF with θ trees for 14 datasets (DS)