Application of Diversified Ensemble Learning in Real-life Business Problems: The Case of Predicting Costs of Forwarding Contracts

Finding an optimal machine learning model that can be applied to a business problem is a complex challenge that needs to provide a balance between multiple requirements, including a high predictive performance of the model, continuous learning and deployment, and explainability of the predictions. The topic of the FedCSIS 2022 Challenge: ‘Predicting the Costs of Forwarding Contracts’ is related to the challenges logistics and transportation companies are facing. To tackle these challenges, we established an entire Machine Learning framework which includes domain-specific feature engineering and enrichment, generic feature transformation and extraction, model hyper- parameter tuning, and creating ensembles of traditional and deep learning models. Our contributions additionally include an analysis of the types of models which are suitable for the case of predicting a multi-modal continuous target variable, as well as an explainable analysis of the features which have the largest impact on predicting the value of these costs. We further show that ensembles created by combining multiple different models trained with different algorithms can improve the performance on unseen data. In this particular dataset, the experiments showed that such a combination improves the score by 3% compared to the best performing individual model.

Abstract-Finding an optimal machine learning model that can be applied to a business problem is a complex challenge that needs to provide a balance between multiple requirements, including a high predictive performance of the model, continuous learning and deployment, and explainability of the predictions. The topic of the FedCSIS 2022 Challenge: 'Predicting the Costs of Forwarding Contracts' is related to the challenges logistics and transportation companies are facing. To tackle these challenges, we established an entire Machine Learning framework which includes domain-specific feature engineering and enrichment, generic feature transformation and extraction, model hyperparameter tuning, and creating ensembles of traditional and deep learning models. Our contributions additionally include an analysis of the types of models which are suitable for the case of predicting a multi-modal continuous target variable, as well as an explainable analysis of the features which have the largest impact on predicting the value of these costs. We further show that ensembles created by combining multiple different models trained with different algorithms can improve the performance on unseen data. In this particular dataset, the experiments showed that such a combination improves the score by 3% compared to the best performing individual model.
Index Terms-Costs of Forwarding Contract, explainability, prediction ensembles, Diversified Ensemble Learning

I. INTRODUCTION
T O BE competitive in the market, companies need to be able to utilize all available data and perform analytics to identify hidden patterns [1]. This can allow them to improve their processes, better understand their customers, and make predictions (e.g. churn prediction, service-outage prediction, fraud detection, etc.). To achieve such goals, companies are facing a variety of challenges, ranging from data integration from a variety of sources [1] and finding suitable machine learning models that are both performant but also practical and explainable [2], to maintaining the corresponding infrastructure. To perform such analytical data processing and machine learning on a large scale, companies require a complex computing infrastructure and methods that will minimize their * These authors contributed equally to this work. total cost of ownership [3], and yet scale the computation to multiple nodes [4].
In this paper, we focus on the problem of finding an optimal machine learning algorithm that can be easily applied in a real-life business domain, meaning it should achieve high predictive performance, continuous learning and deployment, and explainability of the models and their predictions. The topic of the FedCSIS 2022 Challenge, hosted on the KnowledgePit portal is 'Predicting the Costs of Forwarding Contracts' [5]. The competition addresses the challenges of transportation, shipping, and logistics companies related to their digital transformation. Particularly, the benefits of the research boosted by this competition for such companies can be multi-fold: • Identify reasons and circumstances that lead to increased transportation costs.
• Improve companies' planning to lower the costs, and generally, improve their investment strategy.
• Help companies in selecting contracts that maximize their profits by predicting the forwarding (i.e., delivery) contract cost.
Similar real-world challenges were addressed at previous competitions on the KnowledgePit platform, such as predicting escalations in customer support [6], network device workload prediction [7], suspicious network event recognition [8], and predicting victories in video games [9], to name a few. These papers also demonstrate how predictions from individual solutions could be integrated into diversified ensembles to create more powerful and more robust models.
For the case study on which we focus in this paper, we choose to use the XGBoost model with a grid search with 5fold cross-validation [10], due to its extensive use in retail sale predictions [11]. We also use Random Forest models with grid search, as well as deep learning models that are commonly used in demand forecasting in multi-channel retail [12]. Finally, the Linear Regression models are one of the most commonly used simple models for price prediction in industry  [13]. All hyper-parameter tuning is done via exhaustively searching a specified subset of the hyper-parameter space of the given model. The validation is performed using a 5-fold cross-validation, which balances validation speed and metric accuracy of the test data.
The rest of the paper is structured as follows. Section II reviews the most important related works. Section III describes the experimental setup and multiple validation procedures we used to evaluate each model we have developed in this study. Section IV describes the preprocessing of the data, including data transformations and aggregations. Subsection IV-A contains information on the preprocessing implemented over the main table training and testing data, and subsection IV-B includes the preprocessing information for the routes table training and testing data. The section V describes the implemented feature selection methods. The experiments, including model hyper-parameter tuning, training, and evaluation techniques are described in section VI, along with an overview of the final scores of the implemented models. The paper concludes with section VIII where we give a brief overview of the entire Machine Learning workflow, limitations to the study, and opportunities for further work and improvement of the methods.

II. LITERATURE REVIEW
Even though similar challenges have been extensively studied in other industries, this problem is fairly new in the logistics sector. Authors of [14] analyze the shipping cost differences between various carriers, and attempt to identify opportunities for reducing transportation costs.
Similarly, in [15], authors utilize neural networks to forecast shipping freight rates and compare them with traditional time series analysis models. The key objective of their work is to improve the forecasting accuracy of traditional time series analysis. In relation to the competition task, this article also highlights the importance of the information contained in forwarding freight agreements in relation to predictive accuracy.
Another interesting approach is presented in [16], where the impact of the demand and cargo capacity on the shipping price is identified. Forecasting of the long-term cost of logistics contracts is particularly important in long-term agreements with upfront-defined prices, such as in various types of tenders and auctions. On one hand, the bids should be attractive so that the contract can be won, while still being profitable for the logistic company. This challenge was researched in [17], which utilized historic data to train the models.
On a related topic, Men et al. [18] use an ensemble of mixture density neural networks for the purpose of shortterm wind speed and power forecasting. They show that this methodology works well for multi-step ahead prediction. Additionally, [19] illustrates the use-case of multi-observation and multi-dimensional data cleaning methods for applying machine learning algorithms. In this study [19], the authors use transactions from the Lending club data set for training tree-based models to predict peer-to-peer (P2P) loan default and observe that the LightGBM algorithm, using multiple observational data, has the best performance. In many cases, it has been shown that decision tree-based methods significantly outperform linear models for predicting complex response variables, such as the example of predicting accrual expenses in a balance sheet by utilizing the unused vacation time of employees [20].
The scientific community has placed a massive effort into studying individual algorithms (e.g. ensemble algorithms, various deep learning architectures, etc.). Additionally, some studies also focus on finding ways to utilize the diverse algorithms and integrate their predictions. This process is often referred to as diversified ensemble learning and aims to find the best classification algorithms (out of many heterogeneous classification algorithms) and an optimal method to combine them [2]. Note that the individual algorithms used in a diversified ensemble could be ensembles on their own (e.g., XGBoost [21] or Random Forest [22]), so the term diversified ensemble learning refers to another layer of integration. Some methods train another classifier whose inputs are the predictions of the individual classifiers [23] or use other ways of voting. In this paper, algorithms perform weighted voting based on empirically identified weights.

III. VALIDATION PROCEDURE
As in all practical machine learning problems, the experimental setup concerning the training/validation/test split should resemble the natural chronological and logical process as closely as possible, so that the models built are valid and robust over time. In that regard, we attempted to split the training dataset into two subsets, one for training and one for validation, in a way that we thought would most resemble the natural setting in which the data was collected. Considering that this is a very practical problem coming from the industry, any results of the transformation and validation methods should be applicable in a production setting.
That being said, we considered the id payer column, the client identifier, as special because it gave us the ability to use it primarily for splitting the original training set into our training and validation subsets. For this purpose, we first analyzed the frequency of rows in the main table per id payer, dubbing it number of contracts. We noticed the huge discrepancy in the frequency of contracts, ranging from just a few to upwards of thousands. Therefore, we tried several approaches in how we considered this fact: • Split by alternating frequency of records per id payer. In this approach, we ordered the id payer records by the number of contracts, and we assigned them to our training or validation split in alternating order. The idea was that roughly 50% of the records will be our training set, and the other 50% will be the validation set. One additional benefit of this approach was that it made sure that the id payer column would not have an effect on the prediction. With this approach we are very conservative to overfitting, trying to train the models on one subset of the data, and applying the models to a completely new set of data. Indeed, our first submissions showed that our own validation results were considerably worse than the leaderboard results, but were still consistent when comparing different algorithms or feature subsets (the better models per our internal evaluation were also better on the leaderboard).
• Time-sensitive split. We also tested splitting the data in such a way that the older contracts (records with an earlier start date) were in the training set, while newer records were in the validation set. This approach mitigates the previous conservativeness, by allowing the same clients to be in the training and validation set, while also allowing some new clients to appear in the validation set. After the initial testing of the previous approaches, we noticed that the hyper-parameter tuning procedures performed on our hold-out training set (a subset of the competition training set) were not fully applicable when we used the whole training dataset provided in the competition. Namely we used our hold-out training set and the remaining of the training set to learn the hyper-parameters. Then, we compared two models trained with the same hyper-parameters -one using the holdout training set, and another trained on the full training dataset. The former performed significantly better on the leaderboard result, even though it was trained on smaller data set. With this counter-intuitive finding that contradicts the common principle that more training data is better, and having a very limited time for this competition, we decided to use 5-fold crossvalidation in the remaining experiments so that we can use the full training dataset for making the final test predictions. Despite that, we strongly believe that further experiments in the validation procedure are needed to properly tackle the problem.

IV. DATA PREPROCESSING
After the initial data exploration phase, we decided to primarily focus on the main table and extract whichever knowledge we can from it, before proceeding with utilizing the detailed table of expected routes. Table   Firstly, the prim train line and prim f erry line features were not used, due to the high missing data ratio (between 80% and 90% missing from the total number of observations). Additionally, these columns had unstandardized data (e.g., temperature ranges or temperature and unit combined strings in the same column as a descriptive field, etc.) For the remaining columns which had missing data, we applied mean (for continuous columns) or median filling (for nominal data).

A. Main
The transformations done on the Main Table were split into two major types: • One-hot encoding of categorical (nominal) data. We considered utilizing the Weight of Evidence [24] approach, but considering that the categorical features had a relatively small number of different values, the one-hot encoding technique was considered sufficient.
• Combinations of two or more features to create a new meaningful feature. Such features were a result of calculations based on the columns that contain date or timestamp information. 1) Nominal to numeric features with one-hot encoding: The one-hot encoding was done to maximize data balance while minimizing the loss of information. Binary features or features with a few different values were transformed with classic onehot encoding. The features with over 15 values were split into 3 major categories: low-frequency categories (those that had appeared under 1000 times in the data set), high-frequency categories (those that had appeared over 1000 times in the data set), and the highest frequency category of the feature was separated as an individual category.
2) New domain-specific features based on other features: a) Date-time related features: A combining of two or more features was done for the route start numeric and time taken minutes features. The route start numeric is the difference in days between the minimum date found in the dataset (i.e., 1/1/2016), and the start date of the specific route. This was done by using the route start datetime feature, and finding the number of days between it and 1/1/2016. Similarly, the time taken minutes is the time the complete route is estimated to take in minutes. This is calculated by finding the difference in minutes between the route start datetime and route end datetime features.
b) Geo-spatial features: To enhance the geo-spatial information about the routes, we created a new feature, using the Euclidean distance [25] between the route starting point and route ending point. This calculation uses the latitude and longitude values of the original points. Additionally, we used the geo-spatial (Haversine) distance [26], given in the competition dataset.

B. Routes table
The initial experiments were conducted using only data from the Main Table. To further improve model performance in later experiments, we enriched the dataset with aggregate features extracted from the Routes Table. The columns which had missing data were very sparse in the general case. Moreover, the lack of entries seemed correlated in most cases. For this reason, we decided to ignore such columns and do not create features based on them, especially considering the limited time we had for experiments. Still, we believe that more sophisticated data imputation methods could be explored in the future, or at least to prepare some bins of values in cases when such data is available. The ignored columns for this reason were: f erry line, train line, and another 17 columns whose names started with vehicle or id vehicle .
We have extracted the aggregate features by grouping the dataset based on the column id contract. Before the aggregation, one-hot encoding was performed on the step type feature with the goal of extracting the number of steps of each type that were taken in one route. The following features were extracted for each route:

A. Manual Filtering of Correlated Features
This method was used for training the Linear Regression models. Since the Maximum likelihood (MLE) estimations [27] can be highly disturbed by correlated features, we decided to manually remove features with correlations greater than 0.6 in absolute value. For this purpose, we calculated the Pearson Correlation [28] between each pair of continuous variables in the main dataset. We only calculated the correlations between the continuous features from the main dataset in order to exclude the features that have a high correlation from the feature engineering step for training the linear regression models.
The linear regression models were only trained using the uncorrelated features from the main dataset, plus higher degrees of some of the most important features chosen by applying domain knowledge. These features include the total kilometers, the time taken and the maximum weight. This was done in order to use the results of the most basic linear regression models as an internal evaluation baseline for all the other trained models. The experiments for the Linear Regression models were conducted using only the Main Table data. The correlation map of the continuous features is displayed in Figure 1. As it is evident from the figure, a lot of features have high correlations (positive and negative), so removing these dependencies was one of the feature selection methods we implemented.

B. XGBoost Feature Importance
Another method for feature selection was using the built-in feature importance metric from the Extreme Gradient Boosting model (XGBoost) for regression. The process of optimizing this model is detailed in section VI. Figure 2 represents the feature importance obtained from XGBoost on the full dataset (containing the main table and  routes table data feature, namely direction d (a binary feature stating whether the direction is d or not) dominates the rest of the features in the dataset. Sometimes, these built-in feature importance estimates can be inaccurate, as we suspected in this case. The reason behind this is that XGBoost weights the features based on the frequency of splitting, gain and coverage metrics. In the case of categorical variables, the frequency of splitting can be very low, since there are few possible split points, contributing to an overall lower importance than the actual one for categorical variables. Additionally the gain for continuous features can be lower than that of the categorical ones since more split points are possible to be made which in turn can rule out fewer examples in each split compared to the categorical features.
In our case we are creating shallow individual decision trees as weak learners. This means that we have fewer levels in each tree, thus splitting by a binary (or categorical) feature, which is correlated with the target variable, can have a higher value for the gain compared to a continuous feature which is correlated with the target variable because the continuous feature might rule out fewer examples in each split. In turn, this can result in inaccurate calculations of the overall feature importance when using a mix of categorical and continuous features.
For this reason, we further try to extract the important features using wrapper methods [29] over the XGBoost algorithm, which is explained in the following subsection.

C. Boruta search with Shapley values
Boruta search [30] is a wrapper algorithm originally built over the Random Forest classification model, but further extended for all types of decision tree-based models and regression. The method implements feature selection by creating copies of the original features and shuffling them to remove any correlation with the target variable. These features are called shadow features. The algorithm then compares the shadow features' Z-scores [31] to the original features' Zscores. Each feature that fails a two-sided test for significant difference of importance with the shadow feature with maximum importance is removed from the dataset. The feature importance, in this case, was measured using Shapley values [32]. The Shapley values are often used as a method for explainable AI (XAI) because they reveal the average marginal contribution of a feature value across all possible coalitions.
The results of implementing this feature selection method over the XGBoost algorithm are shown in Figure 3. A total of 48 features were identified as important, and their importance compared to the shadow features is displayed in the figure.
From the figure, we can again see that the features euclidean distance, direction d, and km total dominate in their importance for the algorithm compared to all of the other features in the dataset, which means that their impact is most significant in determining the expenses variable's values.

A. Linear Regression
The Linear Regression models were first experimented with by using the features in the main table and standardizing them according to the needs of the algorithm. The f erry intervals, train intervals, and id service type features were scaled using the Min-Max scaling, while all other features that were continuous were scaled using the Standard scaling [33]. The root mean squared error (RMSE) of this primary model, using the train/validation split for validation, was 0.6703. The RMSE of this model on the leaderboard was 0.4598.
The same model was later modified to include only the features that are not correlated, according to the OLS [34] statistical test for feature relationships and the correlations represented in Figure 1. Using only those features, the model had a RMSE of 0.6942 on the validation data set, and a RMSE of 0.5027 on the leaderboard.
Finally, the squared values of the features km total, max weight and time taken minutes were added to the model. This improved the model's RMSE on the validation set to 0.5713, and the RMSE of the test set to 0.4309. The coefficients and their significance are shown in Figure 4.
We can see that all of the features have significant coefficients according to the reported p-value. In this case, the total number of features for training the model was 15. Since linear regression might not estimate the coefficients right in the case of a large number of features, we decided to further go with other non-linear models that better handle a large number of features.
Moreover, we examined that the target variable is multimodal, meaning that any model which expects a Gaussian distribution of the target variable will not be suited well in this scenario. For this reason, we mainly focus on treebased models and ensembles. We further try Gaussian Mixture distributions with neural networks, but due to the time limit, we did not have the resources to optimize these types of models.

B. Extreme Gradient Boost Regression
The first XGB Regressor model that was built only on the main table dataset included alpha booster, eta, lambda, and max depth as hyper-parameters in the tuning job, using Bayesian Search [35] to find the optimal values. This resulted in an average of 0.34 RMSE on the 5-fold cross-validation, and a 0.1735 RMSE on the leaderboard.
The model was later improved with the addition of the routes table, as well as the selected features from the Boruta Shap search, which resulted in improving the RMSE of the CV to approximately 0.18 and 0.15, respectively on the 5fold cross-validation and improving the test RMSE on the leaderboard to 0.1649 and 0.1622 respectively.

C. Random Forest
The first Random Forest model was built on the transformed features of the main table. Hyperparameter optimisation was done on the max depth, min samples leaf and   The model was later improved by using a deeper grid search on all features from the merged main and route tables. This time, max depth, max f eatures, min samples leaf and min samples split were optimized using 5-fold crossvalidation, which resulted in the validation RMSE being 0.0234. The test results had a RMSE of 0.1625.
Both Random Forest models were clearly over-fitted, however, we tackled that issue with the different Ensembles of models later on.

D. Deep Learning Models
A few feed-forward neural networks were implemented using different configurations. The neural networks were trained on the full dataset. The networks only included dense layers and the main activation function used in the hidden layers was ReLU [36]. We experimented with a few regularization techniques including dropout, batch normalization, and kernel regularization with L2 [37].
From the conducted experiments, we concluded that adding regularization caused the model to underfit the training data. Moreover, batch normalization caused a significant performance degradation in this case.
For the output layer, we tried two activation functions, namely softplus [38] and linear. In this case, the same network configuration had better performance using the linear activation instead of softplus.
The model which obtained the best results had the following configuration: The results of this model were RMSE of 0.1683 on the random validation split of 20% training data and RMSE of 0.1775 on the leaderboard, respectively.
Adding more layers results in better model performance, however, it also causes the model to overfit the data in the early stages of training.
Since the neural networks approximate a Gaussian-like distribution, they are not quite suitable for the multimodal target in this case. We additionally tried Mixture Distribution networks but did not have the resources to optimize these models. A basic model with the following configuration of two dense layers with 100 neurons and ReLU activation each for approximating the parameters of the distribution, resulted in a RMSE of 0.1868 on the leaderboard.

E. Diversified Ensemble Models
Ensemble methods [39] were used in order to compensate for models which might be overfitted or underfitted in the data, and further use the errors the models make in a way to further tune the final predictions.
One of the ensemble methods used was model stacking. For this type of ensemble, we used the best performing XGBoost model using the features extracted with the Boruta search method and additionally re-trained a Bagging model of 100 linear regressions with the same features.
The outputs from these models were then fed to another linear regression model, which learned the weights to assign to each individual model, thus creating a weighted ensemble. This approach resulted in a 0.1631 RMSE on the leaderboard and a RMSE of approximately 0.06 on the validation data. This means that the stack resulted in obvious overfitting.
We further experimented with the same approach using a decision tree in the last layer instead of a Linear regression, for learning the weights in the ensemble, however, this resulted in a more overfit model, with a RMSE of 0.1805 on the leaderboard.
Since this approach resulted in fast overfitting, we decided to abandon it.
The other method of ensembling the models that we attempted was a simple weighted Ensemble. Using a combination of the Linear Regression models, the XGBoost models, the Random Forest models, and the feed-forward neural network models, we attempted to manually adjust the weights that these models had on the final outcome. The best Ensemble was found to be an equal weights Ensemble between the highestscoring Random Forest model, and highest scoring XGBoost model, which had a validation RMSE of 0.1318, and a test RMSE of 0.1586.
We then tried another ensemble, which used the underfitted feed-forward neural network with a weight of 0.2, the highest-scoring XGBoost model with a weight of 0.4, and the highest-scoring Random Forest model with a weight of 0.4, all trained on the features chosen with the Boruta search method. We expected that the feed-forward neural network would generally make mistakes in the opposite direction of the Random Forest and XGBoost models, and therefore contribute to the reduction of the average mistake. The weight of the feed-forward neural network model is low, however, due to its larger average errors. This resulted in an Ensemble with a validation RMSE of 0.1856, and a test RMSE of 0.1567, our best score in this competition.

F. Model evaluation
In this subsection, we present the results of the individual models which were optimized and chosen for creating ensembles in the final stage of experimentation. Table I shows the models, their training configurations, the features they used, and their validation and leaderboard RMSE scores.
The final 3 models which were chosen for the competition include: • Ensemble using the best performing feed-forward neural network, Random Forest, and XGBoost models • Ensemble using only the best performing Random Forest and XGBoost models (excluding models which expect Gaussian distributions of the target variable) • The best performing Random Forest which uses the features chosen with the Boruta search method to avoid over-smoothing or overfitting the ensemble methods

VII. DISCUSSION AND FUTURE WORK
The top-performing models were the XGBoost and Random Forest models, strongly outperforming the Linear Regression models and slightly outperforming the feed-forward neural network models. This was expected, due to the multi-modal nature of the target variable, which is hard to estimate using models that expect a Gaussian distribution of the target.
Moreover, the ensembles of diverse models performed the best out of all the predictive options. They used weights that were calculated using the inverse of the RMSE scores of the cross-validation of the models they were composed of. The singular Linear Regression models were not used in the Ensembles due to their massive underperformance compared to the other three model types. However, they were used with the bagging regressors, but this approach also underperformed compared to the non-linear approaches.
Hyper-parameter optimization on all models was performed using the grid search algorithm, with 5-fold cross-validation, and RMSE as a metric to evaluate performance. Grid search was used because it is one of the most thorough hyperparameter tuning algorithms. Given more time, we would have expanded the search space of the grid search of all models.
According to the Boruta search for feature importance, the euclidean distance, direction id, and km total columns were considered the most important for determining the expenses value, with starting and ending locations of lowfrequency destinations being some of the least important features. This further implies that the distance of the route is the most important deciding factor in the final expenses of forwarding contracts.
The main challenge in working with this dataset was the limited information we had on the meaning of some of the given features. With better information on the features, the data engineering process, as well as the model building process, would have been more specific and exhaustive.
While experimenting with the aforementioned validation procedures in section III, we noticed that some additional features could be extracted from the id payer column, considering that it, in its original form, is not applicable as a feature. Such derived features could be: • num previous contracts -the number of previous contracts (before this contract date) for the same client (id payer) • average cost previous contracts -the average cost of previous contracts (before this contract date) for the same client (id payer) • average duration previous contracts -the average duration of previous contracts (before this contract date) for the same client (id payer) • average length previous contracts -the average length of previous contracts (before this contract date) for the same client (id payer) • ratio length previous contracts -the ratio of current length divided by the average of previous contracts (before this contract date) for the same client • cost most similar contract -the cost of the previous contract with the most similar length, adjusted by the difference in exchange ratios Considering that computation of such features should be properly handled and should be closely integrated with the training-validation split process, we did not utilize them. Despite that, we believe that there is merit in further experimenting with them.
Although it was considered, fuel prices were ultimately not used in the prediction of the target variable. This was due to the uncertainty of the availability of current fuel prices, making using them a potential data leak.
Finally, the usage of external public datasets could have vastly improved the predictions of all models. Unfortunately, due to time restrictions, we were unable to properly search for, test, and use any relevant public dataset.

VIII. CONCLUSION
The original goal of the challenge was to use preprocessing methodologies, Machine Learning algorithms and feature selection methods, in order to most accurately predict the costs related to the execution of forwarding contracts in a transporting company. Using feature engineering techniques, as well as a weighted diversified ensemble of XGBoost, Random Forest, and deep learning models (a feed-forward neural network), we were able to predict the expenses of the forwarding contracts with a RMSE of 0.1573.
In this paper, all missing data was imputed using mean filling and median filling. However, in the future, more sophisticated methods for data imputation can be utilized, such as Multiple Imputation by Chained Equations [40] or Regression Imputation [41].
In a broader context, we can conclude that in real-life business problems, domain knowledge and information are essential. With manual feature extraction that reflects the domain knowledge, valuable features could be created that improve the model performance. Likewise, without the domain knowledge, the model validation from a practicality and explainability perspective could be limited.